Peter Bailis · December 04, 2018

Big Data: Where’s the value?

With today’s excitement about AI and machine learning, it’s easy to forget that we were only recently enamored with Big Data and its promise of extracting value from high volume, high dimensional data. In just over a decade, Big Data changed our perception of data at scale: far cheaper data storage and processing led to a widespread shift, from treating data as an cost center requiring expensive data warehouses to treating data as an asset with huge value, waiting to be unlocked.

So, what happened to the promise of Big Data? Despite the widespread collection of data at scale, there’s little evidence that most enterprises are successful in efficiently realizing this value. Data science remains one of the most in-demand talents, and several name-brand efforts based on bespoke and ad-hoc analytics have publicly struggled to scale and deliver on promises. In many warehouses, only a small fraction of data is ever utilized. Reflecting on its promise, it’s surprising that Big Data isn’t considered more of a failure.

Deep learning to the rescue…

Around the time we might have expected a “trough of disillusionment” for Big Data, we saw massive advances in machine learning capabilities via the rise of modern deep learning. Thanks to increased amounts of annotated data and cheap compute, deep networks roared onto the scene in 2012. New deep network architectures like AlexNet obliterated past approaches to machine learning tasks that relied on hand-tuned, manually engineered features. Given continued advances in hard tasks like object detection and question answering, it seems that finally extracting the value from Big Data is in sight.

While modern deep networks are a major advance for machine learning, they excel at processing data that’s different than much of the data stored in today’s Big Data lakes. Historically, deep networks have performed especially well on data that is unstructured, like visual and textual data. However, to make predictions using structured data (i.e., with a common, repetitive, and well-defined schema) like transaction or customer records in a data warehouse, deep networks aren’t a panacea: in fact, on this structured data, much simpler models often perform nearly as well. Instead, the bottleneck is in simply putting the data to use.

One of my favorite examples of this phenomenon comes from Google’s recent paper on “Scalable and accurate deep learning with electronic health records.” Buried on page 12 of the Supplemental Materials, we see that logistic regression (appearing in lecture 3 of our intro ML class at Stanford) “essentially performs just as well as Deep Nets” for these predictive tasks, coming within 2-3% accuracy without any manual feature engineering:

A platform for putting data to work

For many use cases, putting data to work doesn’t require a new deep network, or more efficient neural architecture search. Instead, it requires new software tools. What does such a toolkit for using structured data at scale look like? At Sisu, we believe it will:

  1. Help navigate organizations’ existing data at scale. Modern organizations are sitting on massive amounts of data in warehouses like Redshift, BigQuery, and Snowflake. Displaying raw data in a table or set of dashboards is insufficient and impractical—the volume of this data is just too great. A usable ML platform will need to help users proactively identify where to look and how to respond for a given predictive task, in real time.

  2. Provide results users can trust. Deep networks are notoriously hard to understand and famously difficult to interpret—why should we trust their output? Usable ML platforms must explain their rationale for making a given prediction or recommendation so a user can understand and verify their output. As a result, we believe black-box AutoML-oriented solutions that fail to earn user trust will only see near-term uptake for the lowest-value tasks.

  3. Work alongside users. Except for the most mechanical and precisely specified tasks like datacenter scheduling, we’re years away from complete automation of even routine business workflows. As a result, usable ML platforms will work alongside users, augmenting their intuition and their existing workflows. Users are smart, and ML platforms can make them smarter.

A usable analytics platform with these capabilities would enable fundamentally new platform architectures. In contrast with spreadsheet software or modern business intelligence, which are focused on a manual, user-driven interaction model, the vast amount of data available in a modern lake allows us to obtain high-quality results using weaker specifications from users; instead of requiring users to completely specify their queries of interest, we can infer user intent. Moreover, we can utilize historical interactions to perform personalized ranking and relevance, and predict future intents using variants of reinforcement learning.

Today, these technologies are common in consumer internet applications (e.g., Google’s keyword search, Facebook’s news feed) but are completely foreign to enterprise analytics settings. Given the volume of data available in data lakes, we can finally afford to apply these techniques to private, first-party data as found in modern organizations.

The fundamental challenge lies in leveraging this structured data effectively without requiring expert intervention. At Sisu, we have a strong hypothesis about how to do so. To learn more, sign up for access, or come work with us.

Illustration by Michie Cao.
Peter Bailis · October 03, 2018

A high-stakes conundrum

During his seminal Graduate Computer Systems course at UC Berkeley, Eric Brewer offered the following anecdote:

Eric’s company Inktomi provided web search for major sites such as Yahoo! and ultimately peaked at a public valuation of $25 billion. Their massive success led to an unexpected problem: Inktomi had run out of capacity in their only datacenter. As a result, their growth was capped.

This was the mid-1990s, so Inktomi couldn’t just spin up servers like today on EC2. Instead, they had to lease a second datacenter, over 50 miles away.

Migrating their servers between these datacenters posed a serious challenge. Inktomi couldn’t simply turn their servers off and drive them to the new datacenter because they were serving live traffic. On the other hand, replicating their data would be complex, and full replication would mean doubling their server count.

Nevertheless, Brewer and colleagues devised a clever plan to guarantee 100% uptime without buying a single additional server.

This is seemingly impossible: in a regular database, if the data on any server (i.e., shard) is inaccessible, the engine can’t guarantee a correct result. In fact, some databases simply won’t execute queries if a single server is unavailable.

What is correctness, anyway?

Inktomi’s clever insight was to rethink the definition of “correct.” In web search, some results are more relevant than others. Web search is subjective: when I search for “Kafka,” I may want to read the Wikipedia page for Franz Kafka the author, while you may be looking for docs for the Kafka message queue. While results vary in quality, even modern search engines like Google don’t make claims of optimality. Instead, we use search engines because they’re generally helpful and relevant.

Over the course of a weekend, Inktomi turned off half the servers and drove them to the new datacenter, serving queries from the original half. Then, they redirected traffic to the newly installed servers. This meant that during the migration, searches only reached one half of Inktomi’s servers.

While half of the Web was effectively excluded from searches during the migration, Inktomi provided 100% uptime. These search results might not have been as useful as the day before, but they were “good enough” for a weekend’s worth of queries. And today, these techniques are still widely used in distributed computing.

Statistical Robustness: The New Normal

As machine learning and “Software 2.0” become dominant in software development, Inktomi’s lessons are increasingly relevant. Our modern compute stack – from compilers to runtimes to hardware – is built around a deterministic, precise model of computation. But statistical forms of correctness are becoming the new normal. Researchers have begun to explore the impact of this relaxed statistical correctness in domains including language design, concurrent programming, microarchitecture, sensor processing, and query optimization. But in practice, everything’s still up for grabs, and the mainstream software stack stands to be reinvented.

Google’s decades-long dominance in web search highlights the importance of scale and quality in ML-powered products: the highest quality results win. As more of the world’s workloads shift to ML training and inference, which are similar to web search in their statistical robustness, software that can efficiently exploit this robustness is becoming key to success.

If you’re interested in helping build the future of data analytics, join us at Sisu.

Illustration by Michie Cao.
Peter Bailis · July 30, 2018

More data is recorded today than ever before, offering hyper-resolution into the environments and behaviors that define us. Despite this increasing potential, our tools haven’t kept pace. Manual analysis via spreadsheets, charts, and dashboards remain our primary tools but, when applied to today’s complex data, extracting value is slow, painful, and error-prone. When these tools fail, we turn to people, in the form of machine learning and data science teams. But these teams are scarce, even within the largest organizations.

To close the gap between recording data and acting on it, my research group at Stanford builds new interfaces, algorithms, and systems for making advanced analytics and machine learning usable. Over the past several years, we’ve worked with domain experts to make data-informed advances in the sciences, and with some of the most advanced companies to improve efficiency and reliability. These experiences have proven there’s an opportunity for a new kind of analytics that’s both more usable and more efficient.

After years of sitting on the sidelines at Stanford, I’m putting skin in the game. I’m taking a leave of absence from my tenure-track position at Stanford to found Sisu, a new company headquartered in San Francisco. At Sisu, we’re developing and applying cutting-edge technology to help people use data to make better decisions. We’re building a new analytics stack.

To maximize our impact, we’ve raised a $14.2 million Series A round of financing led by Andreessen Horowitz. In addition, Ben Horowitz, co-founder and general partner at Andreessen Horowitz, has joined Sisu’s board of directors.

While we’re currently in stealth, our team at Sisu is quickly growing, with deep expertise spanning machine learning, databases, and distributed systems. If you want to help build the future of data analytics, join us.

© 2018 Sisu Data, Inc.