Throughout my professional career, when a database had several million records, I considered that I handled a large volume of data (without reaching Big Data). In my past startup, Otogami, we had a history of some tens of millions of records and thought there were many.

Working at Nextail I saw that those millions of records were peanuts, and that the volume of data I handled was low compared to what a modern database is capable of managing.

At Nextail any data from our customers is relevant and, for each of them, we handle tables with billions of rows taking the database to the limit.

The area in which I am most focused is the one that does all the magic of our product, where all the information is crunched and all the mathematical operations are carried out to calculate the optimum distribution of our clients’ merchandise. As it cannot be otherwise, it is built with Java.

Simplifying it a lot, any business process can be split into three complex phases:

Loading information

Processing of all information according to your business rules producing new information

Persistence of the information generated

Our processes generate millions of records that must be persisted. This last phase can consume 20% of the total time.

Searching the fastest persistence method

In this posts I will focus on the last point: how to persist the information in the database as fast as possible (without losing data of course), and in particular, to persist new information (INSERT), not to update it (UPDATE).

Being curious about how to get that information into the database as fast as possible and the different things I’ve discovered, I’ve decided to write 3 posts about how to improve the persistence and common mistakes we often make.

To illustrate the topic, I will provide sample code that will load a dataset and will persist in the database using different techniques. The final code can be found in this GitHub repository.

The goal is to persist data in a single table, with no relationships or anything beyond a single table with many rows. I am aware that JPA, which comes to simplify this type of relationships problems, will not be able to show off and will not be well represented.

Throughout the articles I will show code, from the simplest and slowest version to the fastest version, explaining what changes I am doing and why it is improving.

I also add the restriction of not being able to do the whole process in a single transaction. It is necessary to persist in multiple commits, to avoid potentially losing information and to avoid overwhelming the database with a very long transaction.

To see the effect of each change by database engine, I will take metrics on the two most popular open source databases: MySQL and Postgres. It is not the purpose of this article to make a comparison and find out which database is faster, although everybody knows which is better :D

Using JPA

Probably the most widely used persistence library by java developers, and also the most hated (probably because we turned it into our hammer of persistence). The examples and metrics will be done with the most widely used and well-known implementation: Hibernate.

1.- A register each time

It would consist of persisting each record one by one as we iterate a collection. It is the simplest version of all, and simplifying it a lot, it replaces the behavior of some @Transactional annotation of your favorite framework. We could write something like this:

Quite a disappointing performance indeed.

2.- A register each time in transactions of 1000 elements

The first improvement could be to include a large set of records in each transaction, because we are currently creating one transaction per record:

An improvement of over 20X! It wasn’t very difficult, seeing where we were coming from :)

3.- Flush

By the way Hibernate works (and the rest of JPA implementations), to do all its magic you need to track all the entities that go through it. All these entities are managed by the EntityManager , maintaining the uniqueness of instances, their status, and their life cycle.

The more records you persist, the more entities end up in the EntityManager , the more memory they consume and the more objects it has to manage when doing certain operations. One of them is the flush , and JPA executes it, for example, every time we execute a commit.

Therefore, if you are going to persist many records, you will not be interested in having them all in context, and you will be able to clean them up often, by explicitly executing the flush and clear operations from EntityManager .

We’ve improved performance by about 2X, and probably avoided some OutOfMemoryError in production.

4.- In batches of 1000 records

Although the operations against the database seem to be performed for each call to the method entityManager.persist(iterator.next()) , this is not how JPA works.

JPA implements the Unit of work design pattern, and stores new, deleted and modified objects in the EntityManager, and it is the EntityManager which decides when to perform the operations against the database.

It is in charge of maintaining a consistent status of the changes you have made, and if in a query you ask for an object that you have modified in a previous operation, it already takes care of returning it to you even if the update has not yet been done in the database. It is part of the magic and complexity of an ORM.

Therefore, instead of sending the insertion operations one by one as you call them, Hibernate accumulates them and executes all in a succession of calls to the database.

But instead of making a trip to the database per insert associated with each new object created, we can setup Hibernate to group them in batches by modifying the configuration parameter of hibernate.jdbc.batch_size . So we will only make one trip to the database with all the insertion orders.

To make our process optimal, we only have to worry about making the value of the batch_size parameter match the number of elements we have chosen to make the flush. Otherwise, it will be out of sync every time we flush with the number of elements to create the batch.

Vlad Mihalcea explains it better than me in this excellent post (don’t miss his blog and book if you are interested in database persistence topics).

If we review the code we have until now, we see that it matches what Vlad writes and all we have to do is to add the property batch_size to the Hibernate configuration (in your code probably goes in an XML or properties file):

If we compare the results with the previous version, we see that there has been no improvement in performance, and even a slight worsening. Is Vlad cheating on us?

No, he has not cheated on us. His example is simpler and he has not told us a little detail about the primary key.

5.- The primary key

JPA has a problem: it likes to have control over all your entities, and the only mechanism it has to do this and distinguish if two instances of your objects are the same is through their primary key.

In this example I am not creating and assigning the value of the primary key: I expect it to be generated by the database. A newly instantiated and non-persisted JPA object does not have a primary key, and JPA still has to conform with the object’s own reference to know if two references point to the same entity.

In short, after persisting the entities in the database, JPA needs to recover the value of the primary key that has just been generated.

If you go to the source code, you can see that the entity that I am persisting has a primary key defined with the strategy GenerationType.IDENTITY :

I’m telling that the primary key has to be generated by the database at insertion time with some sequence mechanism or autoincremental value.

Although JDBC implements the mechanism to obtain the generated primary key (using the getGeneratedKeys method), due to a Hibernate implementation detail, it cannot be used in batch operations and deactivates the optimization. I don't know if other implementations suffer from the same problem. Does anyone know why?

Possible solutions to avoid this are:

Search in our data for a Natural Key, avoiding generating values for the primary key.

Use a UUID as an identifier.

Manage manually the identifier value, playing with its current maximum value (if you don’t have concurrency problems).

Use a generator that implements the hi-lo algorithm.

To continue with the example I will choose to assign the value of the id with a counter of my own. So once the @GeneratedValue annotation is removed from the entity we have the following code:

Here is the first big difference between MySQL and Postgres: while Postgres gets a 5X improvement, for MySQL it is only 1.3X.

It is clear that Postgres manages batch operations better than MySQL. Although MySQL JDBC driver implements the JDBC specification at API level, it does not send the batched statements in a single request, and send it statement by statement.

6.- Insert multiple

If there is anything more efficient than sending all the insert statements together, it is to send all the inserts in a single statement.

Instead of two sentences:

INSERT INTO person (dni, name) VALUES ('12345Z', 'Alberto Zaplana');

INSERT INTO person (dni, name) VALUES ('98765A', 'Zoe Alarcón');

send a single sentence:

INSERT INTO person (dni, name) VALUES

('12345Z', 'Alberto Zaplana'),

('98765A', 'Zoe Alarcón');

With this, in addition to making a single trip, we save work to the database when parsing and executing the query.

Unfortunately Hibernate, in spite of knowing this, does not generate this kind of queries….. maybe because they’re lazy and they delegate to the JDBC driver :)

Both MySQL and Postgres allow you to configure the driver so that when it finds a batch of identical inserts in structure, transforms it into a single insert.

Activation is done at URL level as a connection parameter, being in MySQL the rewriteBatchedStatements parameter and in Postgres the reWriteBatchedInserts parameter. It is not necessary to modify the code, only the database connection.

This change has brought us between 1.5X and 4X performance, leaving MySQL at a more reasonable level comparing to Postgres.

Next step?

I haven’t found many more options that dramatically improve performance without leaving JPA.

Applying some other technique on the database configuration you can get some improvements, but I reserve it for another type of post.

To simplify the post I have limited the number of elements in the batch to 1000, but it would be an interesting exercise to try different sizes to find the most convenient one for each database and morphology of the data to persist. Probably the results I’ve got will change if I try another batch size.

In the next post I do the same exercise with almost the same steps with JDBC. You will see how it behaves when we remove the ORM and use the database directly:

Conclusion

We have seen how improving the performance of your system is quite simple and does not require many code changes. We need to know how the tools we use work so that we can make the most of them:

Do you know any techniques that can improve performance in JPA or Hibernate? Do you think that I am an ignorant and I have no idea about JPA? We’d all like to know about it, and if you tell us about it in the comments or in your own post we’ll all learn about it.

Benchmark setup

Any benchmark that is worthwhile and wants to be taken into account must specify its configuration. The methodology of the benchmark has not been very rigorous, but I believe that it is sufficient to draw conclusions from each test carried out.

The tests were done on a Dell XPS 13 2017 laptop, with a Core i7–7560U processor, 16GB memory, and a LiteOn CX2 SSD drive.

The software versions used are:

MySQL 5.6.35

Postgres 9.6.8

Java 8u151

In no case I have done a tuning of the database engine and it is with the configuration that it brings by default when installed.

In each test, I have done 10 executions, discarding the best and worst results, and averaged the remaining 8 values.