Peter Bailis · October 03, 2018

A high-stakes conundrum

During his seminal Graduate Computer Systems course at UC Berkeley, Eric Brewer offered the following anecdote:

Eric’s company Inktomi provided web search for major sites such as Yahoo! and ultimately peaked at a public valuation of $25 billion. Their massive success led to an unexpected problem: Inktomi had run out of capacity in their only datacenter. As a result, their growth was capped.

This was the mid-1990s, so Inktomi couldn’t just spin up servers like today on EC2. Instead, they had to lease a second datacenter, over 50 miles away.

Migrating their servers between these datacenters posed a serious challenge. Inktomi couldn’t simply turn their servers off and drive them to the new datacenter because they were serving live traffic. On the other hand, replicating their data would be complex, and full replication would mean doubling their server count.

Nevertheless, Brewer and colleagues devised a clever plan to guarantee 100% uptime without buying a single additional server.

This is seemingly impossible: in a regular database, if the data on any server (i.e., shard) is inaccessible, the engine can’t guarantee a correct result. In fact, some databases simply won’t execute queries if a single server is unavailable.

What is correctness, anyway?

Inktomi’s clever insight was to rethink the definition of “correct.” In web search, some results are more relevant than others. Web search is subjective: when I search for “Kafka,” I may want to read the Wikipedia page for Franz Kafka the author, while you may be looking for docs for the Kafka message queue. While results vary in quality, even modern search engines like Google don’t make claims of optimality. Instead, we use search engines because they’re generally helpful and relevant.

Over the course of a weekend, Inktomi turned off half the servers and drove them to the new datacenter, serving queries from the original half. Then, they redirected traffic to the newly installed servers. This meant that during the migration, searches only reached one half of Inktomi’s servers.

While half of the Web was effectively excluded from searches during the migration, Inktomi provided 100% uptime. These search results might not have been as useful as the day before, but they were “good enough” for a weekend’s worth of queries. And today, these techniques are still widely used in distributed computing.

Statistical Robustness: The New Normal

As machine learning and “Software 2.0” become dominant in software development, Inktomi’s lessons are increasingly relevant. Our modern compute stack – from compilers to runtimes to hardware – is built around a deterministic, precise model of computation. But statistical forms of correctness are becoming the new normal. Researchers have begun to explore the impact of this relaxed statistical correctness in domains including language design, concurrent programming, microarchitecture, sensor processing, and query optimization. But in practice, everything’s still up for grabs, and the mainstream software stack stands to be reinvented.

Google’s decades-long dominance in web search highlights the importance of scale and quality in ML-powered products: the highest quality results win. As more of the world’s workloads shift to ML training and inference, which are similar to web search in their statistical robustness, software that can efficiently exploit this robustness is becoming key to success.

If you’re interested in helping build the future of data analytics, join us at Sisu.

Peter Bailis · July 30, 2018

More data is recorded today than ever before, offering hyper-resolution into the environments and behaviors that define us. Despite this increasing potential, our tools haven’t kept pace. Manual analysis via spreadsheets, charts, and dashboards remain our primary tools but, when applied to today’s complex data, extracting value is slow, painful, and error-prone. When these tools fail, we turn to people, in the form of machine learning and data science teams. But these teams are scarce, even within the largest organizations.

To close the gap between recording data and acting on it, my research group at Stanford builds new interfaces, algorithms, and systems for making advanced analytics and machine learning usable. Over the past several years, we’ve worked with domain experts to make data-informed advances in the sciences, and with some of most advanced companies to improve efficiency and reliability. These experiences have proven there’s an opportunity for a new kind of analytics that’s both more usable and more efficient.

After years of sitting on the sidelines at Stanford, I’m putting skin in the game. I’m taking a leave of absence from my tenure-track position at Stanford to found Sisu, a new company headquartered in San Francisco. At Sisu, we’re developing and applying cutting-edge technology to help people use data to make better decisions. We’re building a new analytics stack.

To maximize our impact, we’ve raised a $14.2 million Series A round of financing led by Andreessen Horowitz. In addition, Ben Horowitz, co-founder and general partner at Andreessen Horowitz, has joined Sisu’s board of directors.

While we’re currently in stealth, our team at Sisu is quickly growing, with deep expertise spanning machine learning, databases, and distributed systems. If you want to help build the future of data analytics, join us.

© 2018 Sisu Data, Inc.