STAC Research Note: Latency and Throughput Optimization for STAC-ML Markets (Inference)

STAC recently completed a set of experiments that used STAC's naive Python implementation of the STAC-ML Markets Inference benchmark specification using the ONNX runtime as the inference engine. We ran the experiments on two stacks under tests (SUTs).

The SUTs ran identical hardware and software but differed in their configuration:

The experiment's goals were simple: Given a reasonable range of Numbers of Model Instances (NMI) that might be running simultaneously, find the best configuration settings for optimizing Instance Throughput or 99th percentile latency (or both).

The lessons learned in achieving these goals are in this research note.

Please log in to see file attachments. If you are not registered, you may register for no charge.

The use of machine learning (ML) to develop models is now commonplace in trading and investment. Whether the business imperative is reducing time to market for new algorithms, improving model quality, or reducing costs, financial firms have to offload major aspects of model development to machines in order to continue competing in the markets.