Inferring the schema — or organizational structure — of a SQL query is key to understanding what data is contained in a result set. The query schema is how we obtain names of columns, headers in dataframes, and metadata for result export. At Sisu, fast schema inference is a critical first step in helping our users put their data to work.
While schema inference is easy for simple tables (e.g., using SQL’s
DESCRIBE), schema inference
can be prohibitively expensive for complex queries including joins, subqueries,
and derived tables. In this post, we’ll show you a simple trick we’ve used to
improve schema inference performance by over 100x in Redshift. The principles here
should translate to most relational databases, too.
The simplest way to infer the schema of a query is to execute the query and
inspect the results. Every library for interacting with a database has some way
to inspect the metadata of a result; for example, JDBC has a
method that returns the schema of the table.
Unfortunately, this simple strategy incurs a large cost: the database must fully evaluate the query and return the result, which is subsequently thrown away. As a result, the performance of this approach to schema inference depends on both the complexity of the query and how much data the query processes. We found firsthand that this approach doesn’t scale to large datasets, so we set out to find a better solution.
At this point, we looked for a way to trim the data returned to the client
without affecting the metadata. The JDBC API provides a convenient method that
limits the rows included in the result set:
Unfortunately, setting the maximum number of rows to 0 via the JDBC API’s
setMaxRows parameter has a negligible effect on performance. It turns out that
setMaxRows option is only a hint
in the Redshift JDBC driver library and has no effect on the amount of work the
database performs or the amount of data passed back to the client. As a result, this option is just as slow as our simple baseline.
Stymied by slow schema inference, our team received a well-timed hint from a Sisu user with a lot of Redshift expertise. The idea is simple: wrap the query in a filter or limit clause that will force it to return 0 rows:
SELECT * FROM (<query>) WHERE true = false; -- or SELECT * FROM (<query>) WHERE true <> true; -- or SELECT * FROM (<query>) LIMIT 0;
This results in a huge speed increase:
What causes this massive performance improvement? Each additional clause guarantees an empty result. One consequence of this empty result set is that this optimization cuts down on the amount of data which needs to be transmitted. However, these clauses also enable the database to perform additional optimizations.
Specifically, Redshift is able to push the outer filter (or limit) into the subquery (i.e., the query being analyzed), which in turn enables execution to short-circuit and return an empty result set without ever paying the cost of query evaluation!
Of course, there is a caveat: the effectiveness of this optimization across databases is subject to the internals of the database query optimizer. Ensuring that the optimizer behaves correctly across all databases and versions is fragile.
We found that Redshift will optimize seemingly similar filters
differently. For example, with Redshift the
true = false filter appears
to do more work than the other examples. Whereas
true <> true and
are short-circuited before even being added to the query log (the
system table), the
true = false query is not:
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true = false; ... user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true <> true; ... user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) LIMIT 0; ... user@cluster:db> SELECT query, elapsed, substring FROM SVL_QLOG WHERE userid <> 1 AND starttime > (GETDATE() - INTERVAL '10 seconds') ORDER BY query DESC; +---------+-----------+-----------------------+ | query | elapsed | substring | |---------+-----------+-----------------------| | 1083768 | 77850 | SELECT col1 FROM tbl1 | +---------+-----------+-----------------------+
Using either the predicates from above or a limit we can avoid returning any rows from the database. At Sisu, we’re obsessed with speed, and asked the natural question: Can we go even faster?
One strategy is to remove the execution step altogether while still validating the query.
The SQL standard contains a
PREPARE command that checks and compiles a query.
This kind of prepared statement can be run repeatedly without incurring the
overhead of the optimizer on every execution.
It turns out that, in creating a prepared statement for a client, the database also returns metadata for the result set schema. As a result, no data is returned to the client and the query is guaranteed to not execute. In addition, unlike the solution which adds an always-false predicate, prepared statements have the benefit of not modifying the original query. This means that, in the event of a syntax error, the resulting error message will contain row and column numbers that match the original query string.
So why is the prepared statement a little slower than
LIMIT 0 and
true despite all strategies ensuring no query is executed? Our working theory
is that this small difference occurs because prepared statements must be
deallocated explicitly, requiring an additional round trip between the client
and database. Since we ran these tests using a client outside the AWS
datacenter, this additional round trip incurs a small but measurable difference.
To verify, we ran the benchmark again while intentionally leaking the prepared
statement and the performance difference disappeared! The additional round trip
also shows up when tracing the connection between the client and database with a
|prepared statement (leaked)||83ms|
After experimenting with several methods for schema inference, we settled on using prepared statements. The prepared statement approach satisfies all of the following requirements:
If you’re curious, here are the full timings of each of the methods when
inferring the schema of a simple
SELECT * query against a Redshift table
containing 10 columns and 150,000 rows:
To reproduce this analysis against your own database and queries, check out our schema inference benchmark: