Scaling a common machine learning workload in the cloud

Project examined scale-up and scale-out of a natural language processing problem

10 January 2019

STAC recently studied the performance, cost, and quality of a common machine-learning (ML) pipeline in natural language processing (NLP): Python-based feature preparation and training of topic models from SEC Form 10-K filings using Latent Dirichlet Allocation. We explored the question of scale-up versus scale-out in three cloud environments:

1. A single Google Cloud Platform (GCP) n1-standard-16 instance (Intel Skylake option)
2. A single GCP n1-standard-96 instance (Intel Skylake option)
3. A Google Cloud Dataproc (Spark as a service) cluster containing 13 x n1-standard-16 Skylake instances (1 master and 12 worker nodes).

Excerpts from the study are available here. At the same link, subscribers to the Analytics STAC Track™ can access the full STAC Study™, which contains extensive performance analysis, "war stories" about working with the data science tools and cloud products in the project as a regular customer, and detailed configuration information. The implementation code and test dataset are also available.*

The purpose of this first NLP project was to demonstrate the utility of business-driven benchmarks for ML techniques and technologies and to solicit feedback from the STAC AI Working Group. Machine learning and other forms of artificial intelligence are increasingly important in finance, but firms face a wide array of algorithms, frameworks, hardware, and cloud services to choose from. Benchmarks can be useful signposts in such an environment, but they must use workloads that correspond to key business bottlenecks, measure the properties that a business cares about, and fall within a process that ensures the reliability of results.

The benchmarks in this project were not official benchmarks from the STAC Benchmark Council but rather a proposed suite. STAC developed them by applying principles learned from other STAC Benchmarks—such as strategy backtesting, enterprise tick analytics, and derivatives valuation—to NLP use cases it has encountered at financial firms.

In this exercise, we studied scalability. Suboptimal scaling of model training can mean long cycle times (i.e., low data scientist productivity) and less thorough exploration of hyperparameter spaces (i.e., potentially low model quality). The simple theory behind cloud scaling is that if it takes, say, 60 minutes to run something on one core, we can shrink that time to 1 minute by running on 60 cores, all for the same total cost. Reality usually differs from theory, but the question is by how much and depending on what. One dimension to this question is scale-up versus scale-out. With cloud instances based on Intel Xeon Platinum (“Skylake”) processors now offering 96 vCPUs, and many more vCPUs per instance coming online soon when cloud providers adopt Intel’s “Cascade Lake” processors (including an announced variant with 192 virtual cores per 2-socket server), the relative merits of scale-up and scale-out are becoming salient for some workloads.

We are grateful to Intel for providing financial support for this effort and to Google for offsetting the cloud costs. Their support demonstrates each company’s commitment to the effort to develop rigorous, vendor-independent benchmark standards with long-term value to the industry.


* If your firm does not have access to these materials, please take a minute to learn about subscription options.