This week marks the anniversary of SysML, the first peer-reviewed computer science conference dedicated to research at the intersection of systems and machine learning. Backed by an impressive slate of researchers, including Jeff Dean and Mike Jordan, SysML fills a gap in coverage of topics like hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. After successfully piloting a single-day conference last year, this year’s SysML contained a full slate of 32 full-length papers and 15 demos, from a wide range of institutions including usual suspects like Stanford, Google, and FAIR.
Reflecting on my experience as both a SysML program committee member and attendee, three major trends stood out to me:
This means the gap between our aspirations and our capabilities for even the most basic ML deployments still requires major investments in systems and infrastructure.
While better models are often touted as a panacea to achieving usable ML, there’s an increasing realization that for practical uses, clean and labeled data is equally–if not more–important. It’s easy to forget that a decade ago, Google, recently one of the key proponents of automated neural architecture search for better models, argued that more data beats better algorithms. After several years of seeing wins from new innovation in architectures, it seems the trend has reverted back to data: many recent state-of-the-art results depend on using more data and all data available, as opposed to radically different architectures.
Multiple conversations and ongoing projects at SysML reflected this focus on better organizing and making use of data, not just building better models. In particular, one SysML 2019 paper I really liked on this topic was from Google’s TFX group on “Data Validation for Machine Learning.” The paper describes in detail how Google combines statistical tests (e.g., testing differences between distributions) and user-provided constraints (e.g., data types and ranges) in a versioned schema for model inputs. In turn, TFX performs automated data validation both at train time and during model deployment to combat anomalies in data at runtime; the paper reports on results from over 700 production deployments, and on real errors like new feature columns found in the data but not in the model schema.
One of the challenges of productionizing ML is the staggering lack of support for end-to-end model deployment. This includes everything from toolchain support for transitioning from training to serving models, to deploying on potentially heterogeneous clusters, and to assessing model reliability and performance. While efforts like Tensorflow Serving, ONNX, and MLflow provide helpful building blocks (and the literature is similarly full of promising proposals), the infrastructure for end-to-end development is largely underdeveloped.
To help deploy with confidence, researchers from ETH Zurich, Alibaba, and Microsoft Research presented a paper on “Continuous Integration for Machine Learning Models.” As part of the ease.ml research project, the authors developed practical tests for assessing whether a new model achieves a desired accuracy target. This problem is challenging because obtaining a good estimate of model accuracy using a straightforward implementation of confidence intervals can require prohibitively large amounts of data. Instead, the paper develops a set of tests that leverage properties of continuous integration, like the fact that software is successively deployed in releases, not written from scratch every time. This allows 1% error tolerance with 99.9% reliability, using only 2K labels per CI test.
Especially given the challenge and expense of hiring machine learning engineers and building models, there’s continued interest in new abstractions that make it easier, faster, and cheaper to express new model architectures and utilize more familiar interfaces, ideally without incurring performance penalties. Mike Jordan has written on the idea that intelligence automation and intelligent infrastructure comprise a new kind of engineering discipline, and the interfaces for allowing users to author models are core component of this discipline.
SysML featured an entire session on this topic, including two proposals for authoring imperative model code that’s easier to write but is often slower than graph-based approaches, a new engine for executing distributed reinforcement learning tasks, and TensorFlow in the browser. Collectively, these papers illustrate the challenge of programming in a future with ML everywhere – in every program, and on most devices. Much like we saw an evolution from low to high level interfaces in the Big Data ecosystem (e.g., from Hadoop MapReduce to Spark SQL), we’re starting to see the first set of proposals beyond classic TensorFlow/Pytorch programming come to fruition (e.g., AutoGraph is actually shipping in TensorFlow v2.0).
This isn’t easy. For example, Section 4 of the TensorFlow.js paper has some hairy details about integrating backpropagation with WebGL, including loss of precision due to Chrome’s handling of mobile GPU floating point, that’s reminiscent of early papers on GPGPU programming.
In all, the view from research indicates consuming ML “off the shelf” will require considerable investment in data acquisition, deployment infrastructure, and interfaces. These results echo our conviction at Sisu that leveraging all data available is necessary to enable high quality, useful ML at scale. Putting data from disparate sources and organizational units to work is hard, and requires fundamentally new algorithmic tools and systems that can easily support training, deployment, and debugging. Moreover, we’re just getting started with new interfaces to ML. While much of the work on cutting-edge interfaces targets ML engineers and data scientists, there’s far more work to be done and impact to be had up the stack, via concerted collaboration between ML, UX, and systems.