Lightning-fast Schema Inference in Redshift

by Dan Burkert · February 5, 2019

Lightning-fast Schema Inference in Redshift

Inferring the schema — or organizational structure — of a SQL query is key to understanding what data is contained in a result set. The query schema is how we obtain names of columns, headers in dataframes, and metadata for result export. At Sisu, fast schema inference is a critical first step in helping our users put their data to work.

While schema inference is easy for simple tables (e.g., using SQL’s DESCRIBE), schema inference can be prohibitively expensive for complex queries including joins, subqueries, and derived tables. In this post, we’ll show you a simple trick we’ve used to improve schema inference performance by over 100x in Redshift. The principles here should translate to most relational databases, too.

KISS?

The simplest way to infer the schema of a query is to execute the query and inspect the results. Every library for interacting with a database has some way to inspect the metadata of a result; for example, JDBC has a ResultSet.getMetaData method that returns the schema of the table.

inference method runtime
execute-query 11.2s

Unfortunately, this simple strategy incurs a large cost: the database must fully evaluate the query and return the result, which is subsequently thrown away. As a result, the performance of this approach to schema inference depends on both the complexity of the query and how much data the query processes. We found firsthand that this approach doesn’t scale to large datasets, so we set out to find a better solution.

Redshift Query Execution Flow

Who said the database should actually use query options?

At this point, we looked for a way to trim the data returned to the client without affecting the metadata. The JDBC API provides a convenient method that limits the rows included in the result set: Statement.setMaxRows.

inference method runtime
max-rows 11.3s

Unfortunately, setting the maximum number of rows to 0 via the JDBC API’s setMaxRows parameter has a negligible effect on performance. It turns out that the setMaxRows option is only a hint in the Redshift JDBC driver library and has no effect on the amount of work the database performs or the amount of data passed back to the client. As a result, this option is just as slow as our simple baseline.

Subqueries to the Rescue!

Stymied by slow schema inference, our team received a well-timed hint from a Sisu user with a lot of Redshift expertise. The idea is simple: wrap the query in a filter or limit clause that will force it to return 0 rows:

SELECT * FROM (<query>) WHERE true = false;
-- or
SELECT * FROM (<query>) WHERE true <> true;
-- or
SELECT * FROM (<query>) LIMIT 0;

This results in a huge speed increase:

inference method runtime
true = false 153ms
true <> true 86ms
LIMIT 0 83ms

What causes this massive performance improvement? Each additional clause guarantees an empty result. One consequence of this empty result set is that this optimization cuts down on the amount of data which needs to be transmitted. However, these clauses also enable the database to perform additional optimizations.

Specifically, Redshift is able to push the outer filter (or limit) into the subquery (i.e., the query being analyzed), which in turn enables execution to short-circuit and return an empty result set without ever paying the cost of query evaluation!

Of course, there is a caveat: the effectiveness of this optimization across databases is subject to the internals of the database query optimizer. Ensuring that the optimizer behaves correctly across all databases and versions is fragile.

We found that Redshift will optimize seemingly similar filters differently. For example, with Redshift the true = false filter appears to do more work than the other examples. Whereas true <> true and LIMIT 0 are short-circuited before even being added to the query log (the SVL_QLOG system table), the true = false query is not:

user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true = false;
...
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true <> true;
...
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) LIMIT 0;
...
user@cluster:db> SELECT query, elapsed, substring
   FROM SVL_QLOG
  WHERE userid <> 1
    AND starttime > (GETDATE() - INTERVAL '10 seconds')
  ORDER BY query DESC;
+---------+-----------+-----------------------+
| query   | elapsed   | substring             |
|---------+-----------+-----------------------|
| 1083768 | 77850     | SELECT col1 FROM tbl1 |
+---------+-----------+-----------------------+

Going even faster: The Prepared Statement

Using either the predicates from above or a limit we can avoid returning any rows from the database. At Sisu, we’re obsessed with speed, and asked the natural question: Can we go even faster?

One strategy is to remove the execution step altogether while still validating the query.

The SQL standard contains a PREPARE command that checks and compiles a query. This kind of prepared statement can be run repeatedly without incurring the overhead of the optimizer on every execution.

Redshift Prepared Statement Flow

It turns out that, in creating a prepared statement for a client, the database also returns metadata for the result set schema. As a result, no data is returned to the client and the query is guaranteed to not execute. In addition, unlike the solution which adds an always-false predicate, prepared statements have the benefit of not modifying the original query. This means that, in the event of a syntax error, the resulting error message will contain row and column numbers that match the original query string.

inference method runtime
prepared statement 122ms

So why is the prepared statement a little slower than LIMIT 0 and true <> true despite all strategies ensuring no query is executed? Our working theory is that this small difference occurs because prepared statements must be deallocated explicitly, requiring an additional round trip between the client and database. Since we ran these tests using a client outside the AWS datacenter, this additional round trip incurs a small but measurable difference. To verify, we ran the benchmark again while intentionally leaking the prepared statement and the performance difference disappeared! The additional round trip also shows up when tracing the connection between the client and database with a packet analyzer.

inference method runtime
prepared statement (leaked) 83ms

Conclusion

After experimenting with several methods for schema inference, we settled on using prepared statements. The prepared statement approach satisfies all of the following requirements:

  • Does not tie up database resources executing a query
  • Doesn’t modify user queries, preserving database errors
  • Works on any SQL database
  • Fast

If you’re curious, here are the full timings of each of the methods when inferring the schema of a simple SELECT * query against a Redshift table containing 10 columns and 150,000 rows:

Results Plot

Illustrations by Michie Cao.

To reproduce this analysis against your own database and queries, check out our schema inference benchmark: