STAC Research Note: Latency and Throughput Optimization for STAC-ML Markets (Inference)
STAC recently completed a set of experiments that used STAC's naive Python implementation of the STAC-ML Markets Inference benchmark specification using the ONNX runtime as the inference engine. We ran the experiments on two stacks under tests (SUTs).
The SUTs ran identical hardware and software but differed in their configuration:
The experiment's goals were simple: Given a reasonable range of Numbers of Model Instances (NMI) that might be running simultaneously, find the best configuration settings for optimizing Instance Throughput or 99th percentile latency (or both).
The lessons learned in achieving these goals are in this research note.