Or: Move That Loop into the Server Already!

This article will illustrate the significance of something that I always thought to be common sense, but I keep seeing people getting this (very) wrong in their productive systems. Chances are, in fact, that most applications out there suffer from this performance problem – and the fix is really easy.

Transferring data in bulk vs. transferring it row by row

This blog post is inspired by a recent Stack Overflow question that was asking about how to call Oracle’s DBMS_OUTPUT.GET_LINES from JDBC. The answer to that specific question can also be seen in a previous blog post here.

This time, I don’t want to discuss that particular Oracle-specific technical problem, but the performance aspect of whether to call:

DBMS_OUTPUT.GET_LINES : Which allows for fetching a bulk of server output into an array

: Which allows for fetching a bulk of server output into an array DBMS_OUTPUT.GET_LINE : Which fetches a single line of server output into a string

While I will continue to discuss this Oracle-specific functionality, this blog post is in no way strictly related to Oracle. It is generally applicable (and again, it should be common sense) for any database, and in fact, any client-server architecture, or even any distributed architecture. The solution to such performance problems will almost always be that transferring a bulk of data is better than transferring it row by row.

Let’s run a benchmark!

The full benchmark logic can be seen in this gist. It includes the boring parts of actually calling the GET_LINE[S] procedures. The beef of the benchmark is this:

int max = 50; long[] getLines = new long[max]; long[] getLine = new long[max]; try (Connection c = DriverManager.getConnection(url, properties); Statement s = c.createStatement()) { for (int warmup = 0; warmup < 2; warmup++) { for (int i = 0; i < max; i++) { s.executeUpdate("begin dbms_output.enable(); end;"); String sql = "begin " + "for i in 1 .. 100 loop " + "dbms_output.put_line('Message ' || i); " + "end loop; " + "end;"; long t1 = System.nanoTime(); logGetLines(c, 100, () -> s.executeUpdate(sql)); long t2 = System.nanoTime(); logGetLine(c, 100, () -> s.executeUpdate(sql)); long t3 = System.nanoTime(); s.executeUpdate("begin dbms_output.disable(); end;"); if (warmup > 0) { getLines[i] = t2 - t1; getLine[i] = t3 - t2; } } } } System.out.println(LongStream.of(getLines).summaryStatistics()); System.out.println(LongStream.of(getLine).summaryStatistics());

What does it do in prose?

It contains a warmup loop whose first iteration doesn’t contribute to our measurements (this is always a good idea)

It runs the benchmarked logic 50 times

It generates 100 DBMS_OUTPUT.PUT_LINE messages for each run in an anonymous PL/SQL loop …

messages for each run in an anonymous PL/SQL loop … … and then fetches those 100 messages immediately with either 1 call to GET_LINES or 100 calls to GET_LINE

or 100 calls to Finally, all the stored execution times are aggregated and printed out conveniently with Java 8 Stream’s summary statistics feature

So, in both cases, we’re generating and fetching 5000 messages.

Both methods using GET_LINES and GET_LINE respectively are functionally equivalent, i.e. the only difference is performance. Again the full benchmark can be seen here (which also benchmarks the effect of JDBC fetchSize, which is none, in this case).

The results are devastating:

{count=50, sum= 69120455, min= 1067521, average= 1382409.100000, max= 2454614} {count=50, sum=2088201423, min=33737827, average=41764028.460000, max=64498375}

We’re gaining a factor of 30x in this benchmark run on my machine. The actual results may vary depending on your hardware and software (e.g. I’m running Oracle 12cR2 in Docker), but regardless of the setup, the results are always very significant.

Note that caching the prepared statement in the client yields better results (see the gist), but nowhere near as good as moving the loop to the server

Does this mean that GET_LINE is slow?

When looking at benchmark results, we must always be very very careful not to draw the wrong conclusions. There can be many reasons why benchmarks show differences in performance. One simple (albeit improbable) reason could be that the GET_LINE implementation simply sucks.

So, I’ve tried to re-implement this benchmark in pure PL/SQL. The full benchmark can be seen here. The beef of it is this:

FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP m(); v_i := v_max; dbms_output.get_lines(v_array, v_i); END LOOP; INSERT INTO results VALUES (1, (SYSTIMESTAMP - v_ts)); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP m(); FOR j IN 1 .. v_max LOOP dbms_output.get_line(v_string, v_i); END LOOP; END LOOP; INSERT INTO results VALUES (2, (SYSTIMESTAMP - v_ts)); END LOOP;

Where m() is:

PROCEDURE m IS BEGIN FOR i IN 1 .. v_max LOOP dbms_output.put_line('Message ' || i); END LOOP; END m;

The results are now rather different:

stmt sum avg min max 1 0.0609 0.01218 0.0073 0.0303 2 0.0333 0.00666 0.0063 0.007

This time, calling GET_LINE individually seems to have been 2x faster than the GET_LINES version. Again, it is important not to draw the wrong conclusions! This could be due to:

GET_LINES allocating an additional array copy of the original lines, which resides in the PGA, might be costly

allocating an additional array copy of the original lines, which resides in the PGA, might be costly GET_LINE might have profited from some additional optimisation because we’re never actually consuming the result in the benchmark

But the one thing we can conclude with certainty is: There’s no problem in GET_LINE , so calling it is not inherently worse than calling GET_LINES .

Which brings us back to the JDBC calls

While guess work and hypotheses are usually dangerous, in this case I’m certain of the reason why the JDBC based approach shows such drastic differences. Just watch this excellent talk by Toon Koppelaars from Oracle “NoPLSql and Thick Database Approaches with Toon Koppelaars”, where he explains this with some impressive flame graphs:

The obvious difference between the JDBC benchmark and the PL/SQL one is the fact that the JDBC call has to traverse a vast amount of logic, APIs, “barriers” between the JVM and the Oracle kernel before it can actually invoke the really interesting part. This includes:

JVM overhead

JDBC logic

Network overhead

Various “outer” layers inside the Oracle database

Oracle’s API layers to get into the SQL and PL/SQL execution engines

The actual code running in the PL/SQL engine

In Toon’s talk (which again, you should definitely watch), the examples are running SQL code, not PL/SQL code, but the results are the same. The actual logic is relatively cheap inside of the database (as we’ve seen in the PL/SQL only benchmark), but the overhead is significant when calling database logic from outside the database.

Thus: It is very important to minimise that overhead

There are two ways to minimise that overhead:

The super hard way: Change and / or tweak the API technology, e.g. in Oracle’s case, using the C/OCI bindings can be much faster than JDBC

The easy way: Just move some data collection logic into the database and fetch data in bulk

Let me repeat this one more time:

Fetch (or send) data in bulk

… it’s almost always faster than processing things row-by-row (or as the Oracle folks call it: “slow-by-slow”).

And it does not matter at all, if that database logic is written in SQL or in a procedural language. The point is that accessing objects over the network (any network) is expensive, so you should minimise the access calls if ever possible.

As I’ve tweeted recently:

When you have little data and very complex logic, it's often better to move the data to the logic. When you have tons of data, it's often better to move the logic (simple or complex ) to the data. The latter is done with SQL and/or stored procedures. Start using these things! — Lukas Eder (@lukaseder) December 14, 2017

When calling logic over the network (any network), we should move logic to the data, not data to the logic. When working with RDBMS, we’re doing this through SQL (preferrably) or if SQL doesn’t suffice, we resort to using stored procedures.

When working with HTTP, we’re doing this with – well, it doesn’t have a name, but we should prefer making few physical HTTP calls that aggregate several logical API calls in order to transfer a lot of data in bulk.

When working with “map reduce” or “serverless” etc technology, we’re calling this “functions” or “lambdas”, which are just fancy, more modern names for stored procedures.

Conclusion

I’m not an expert in how to design complex distributed systems. This stuff is really hard. But I’ve spent many years working with RDBMS, which are also, in a very simple way, distributed systems (data on one server, client logic on another one).

A very significant amount of performance problems with RDBMS is related to the simple fact of clients making way too many calls to the database for what could be implemented in a single SQL query. Within the database, once your logic has reached the kernel, stuff gets executed really really fast. Adding more logic to a query is going to cause far less trouble than adding more queries.

Does your application take into account these things? Especially, if you’re using an ORM that generates the SQL for you, does it generate the right amount of queries, or are you suffering from “N+1 problems”? The main reason why ORM-based systems tend to be slow is because developers are not aware of the SQL their ORMs generate, e.g. due to excessive lazy loading, when a single SQL (or JPQL) query would have been a much better choice. It’s not the ORM’s fault. Most ORMs can get this right.

It’s the developer’s responsibility to think about where the logic should be executed. The rule of thumb is:

If you’re looping in the client to fetch individual things from the same server, you’re doing it wrong. Move the loop into the server.

And by doing so, you’ve written your first (ghasp) “stored procedure”. You’ll write many more, and you’ll love it, once you realise how much speed you’re gaining.

Or, in other words:

Move the source, Luke. — Brian Fitzgerald (@ExaGridDba) December 19, 2017

Update: Some criticism from the reddit discussion of this article

/u/cogman10 made good points in his comment warning about batching “too big” workloads, which is perfectly correct when batching write heavy tasks. Large batches may increase the contention inside of your database. If you’re working in an MVCC environment (such as Oracle), having transactions that span millions of updates will put a lot of pressure on the UNDO/REDO log, and all other sessions reading from the same data will feel that pain.

Also, when working with HTTP, beware of the fact that batches are harder to cache than individual requests. This article made the assumption that HTTP requests are:

Authorised – i.e. caching doesn’t even make sense as the authorisation might be revoked or the session terminated

Operating on similar resources, e.g. fetching a bulk of IDs in one go might be more sensible than fetching each ID individuall

… of course, as always, don’t follow advice you find on the internet blindly :) This article illustrated a common mistake. The fix isn’t always as simple as illustrated here, but often it really is.