Lazy Record Streams and clojure.contrib.sql

I’ve been working with the very useful clojure.contrib.sql package. One wrinkle I’ve encountered, in an otherwise lazy language, is that the SQL connections typically are not lazy. By default they load the entire result set at once – even if the java.sql.ResultSet is lazily accessed. This isn’t an issue with clojure.contrib.sql inasmuch as it is with how the JDBC classes are being used.

For many use cases, fetching a handful of rows, a user profile, etc., this a perfectly reasonable default. When you’re working with result sets that are larger than will fit into memory this default behavior isn’t so desirable.

First Attempt: LIMIT using an Offset

I like the interface that the contrib library offers – a lazy sequence of the result set. Since PostgreSQL supports ranges as part of it’s SQL (via LIMIT ) it would seem that you could just write up a quick chunking function that pulls ranges of records from the table using a lazy approach, while still presenting things as a single stream of records from a single result set.

The first function, get-record-range , simply takes a table name a starting offset and a record count. It executes a select for the given range and offset. The second function uses the first to create a lazy sequence, generating the next block of records when the previous block was exhausted. It includes a check to ensure it terminates when there are no more records.

The only piece missing from the above is our db/exec-sql , which is just a convenience wrapper around clojure.contrib.sql/with-query-results that forces the lazy record sequence (this was not the source of our memory troubles). db/exec-sql is something I often use to quickly grab the entirety of a result set.

The problem with this approach though, is that using a SELECT * and a LIMIT without an ORDER BY clause doesn’t guarantee the ordering of the records. The database (for an example, see PostgreSQL) is free to return rows in whatever order it pleases. The impact of this is that the rows you get back will not be consistent – you are not guaranteed to get all of the rows through this approach, nor are you guaranteed to not get duplicate rows! For a table where we tested this approach, with approximately 600k records, we got about 400k unique rows back.

Adding an ORDER BY id (since most of our tables contain an id column) resolves these issues, only to incur a large performance penalty while the database computes an ordered list of ids so that it can return the requested block of records.

Back to the drawing board: Try Database Cursors

Databases have long had a way of supporting this use case via Cursors. JDBC, and specifically the ResultSet interface, was designed to support them.

It turns out that the default settings on the java.sql.Connection configured by clojure.contrib.sql are not set to use a Cursor. According to ‘PostgreSQL JDBC Driver: Getting results based on a cursor’ there are several settings that need to be in place for the result set to use a cursor:

Auto Commit must be turned off

The java.sql.ResultSet must be created with a ResultSet type of: ResultSet.TYPE_FORWARD_ONLY

must be created with a type of: The Query must be a single statement, meaning it must not contain semi-colons joining together multiple statements.

You must use a java.sql.Statement and set the FetchSize on the statement to a non-zero value.

It turns out that clojure.contrib.sql has a handy form that will turn off AutoCommit and ensure it will be set back to its previous value so you don’t put the Connection into an inconsistent state (and so that you join any transactions that are in progress). That form is transaction , and trying it out:

The output shows that the AutoCommit is indeed off, but we’ll still need to set a FetchSize ourselves.

The default ResultSet type is ResultSet.TYPE_FORWARD_ONLY, which is what we needed. Adding in @setFetchSize is straightforward, so, wrapping it all up into a helper function (following the conventions in the sql package) I end up with:

This brings me back to my goal: being able to wield very large result sets.

See Also



