It's often said that abstractions slow down your program, since they add layers which makes your application slower.

While this is generally correct, it's not always true.

Performance can be improved somewhat by removing layers, but the best way to improve performance is to change algorithms.

So let's see how we can beat performance of writing SQL and doing object materialization by hand, as it is (wrongly) common knowledge that this is the fastest way to talk to the database.

First use case, simple single table access

CREATE TABLE Post ( id UUID PRIMARY KEY , title VARCHAR NOT NULL , created DATE NOT NULL ) CREATE TABLE Post ( id UUID PRIMARY KEY, title VARCHAR NOT NULL, created DATE NOT NULL )

So – the standard pattern to access such a table would be:

SELECT * FROM Post SELECT * FROM Post

ignoring (for now) that it would probably be a better style to explicitly name columns. Alternatively, in Postgres we can also do:

SELECT p FROM Post p SELECT p FROM Post p

which would return a tuple for each row in the table.

For the first query, without going too deep into the actual Postgres protocol, we would get three "columns" with length and content. Parsing such a response would look something like this:

IDataReader dr = ... return new Post { id = dr . GetGuid ( 0 ) , title = dr . GetString ( 1 ) , created = dr . GetDateTime ( 2 ) } ; IDataReader dr = ... return new Post { id = dr.GetGuid(0), title = dr.GetString(1), created = dr.GetDateTime(2) };

The second query, on the other hand, has only one "column" with length and content. Parsing such a response requires knowledge of Postgres rules for tuple assembly and is similar to parsing JSON. The code would look like this:

IDataReader dr = ... return PostgresDriverLibrary . Parse < Post > ( dr . GetValue ( 0 ) ) ; IDataReader dr = ... return PostgresDriverLibrary.Parse<Post>(dr.GetValue(0));

In the TEXT protocol, the response from Postgres would look like this:

(f4d84c89-c179-4ae4-991a-e2e6bc12d879,"some text",2015-03-12)

So, now we can raise a couple of questions:

is it faster or slower for Postgres to return the second version?

can we parse the second response faster than the first response on the client side?

To make things more interesting, let's investigate how would it compare talking to Postgres using BINARY protocol in first case and using TEXT protocol for second case. Common knowledge tells us that binary protocols are much faster then textual ones, but this also isn’t always true:





Verdict: for such a simple table, performance of both approaches is similar





Second use case, master-detail table access

Common pattern in DB access is reading two tables to reconstruct an object on the client side. While we could use several approaches, let's use the "standard one" which first reads from one table and then from a second one. This can sometimes lead to reading inconsistent data, unless we change the isolation level.

For this example, let's use an Invoice and Item tables:

CREATE TABLE Invoice ( NUMBER VARCHAR ( 20 ) PRIMARY KEY , dueDate DATE NOT NULL , total NUMERIC NOT NULL , paid TIMESTAMPTZ , canceled BOOL NOT NULL , version BIGINT NOT NULL , tax NUMERIC ( 15 , 2 ) NOT NULL , reference VARCHAR ( 15 ) , createdAt TIMESTAMPTZ NOT NULL , modifiedAt TIMESTAMPTZ NOT NULL ) ; CREATE TABLE Item ( invoiceNumber VARCHAR ( 20 ) REFERENCES Invoice , _index INT , PRIMARY KEY ( invoiceNumber , _index ) , product VARCHAR ( 100 ) NOT NULL , cost NUMERIC NOT NULL , quantity INT NOT NULL , taxGroup NUMERIC ( 4 , 1 ) NOT NULL , discount NUMERIC ( 6 , 2 ) NOT NULL ) ; CREATE TABLE Invoice ( number VARCHAR(20) PRIMARY KEY, dueDate DATE NOT NULL, total NUMERIC NOT NULL, paid TIMESTAMPTZ, canceled BOOL NOT NULL, version BIGINT NOT NULL, tax NUMERIC(15,2) NOT NULL, reference VARCHAR(15), createdAt TIMESTAMPTZ NOT NULL, modifiedAt TIMESTAMPTZ NOT NULL ); CREATE TABLE Item ( invoiceNumber VARCHAR(20) REFERENCES Invoice, _index INT, PRIMARY KEY (invoiceNumber, _index), product VARCHAR(100) NOT NULL, cost NUMERIC NOT NULL, quantity INT NOT NULL, taxGroup NUMERIC(4, 1) NOT NULL, discount NUMERIC(6, 2) NOT NULL );

To make things more interesting we'll also investigate how performance would compare if we used a type instead of table for the items property. In that case we don't need a join or two queries to reconstruct the whole object.

So let's say that we want to read several invoices and their details. We would usually write something along the lines of:

SELECT * FROM Invoice WHERE NUMBER IN ( 'invoice-1' , 'invoice-2' , ... ) SELECT * FROM Item WHERE invoiceNumber IN ( 'invoice-1' , 'invoice-2' , ... ) SELECT * FROM Invoice WHERE number IN ('invoice-1', 'invoice-2', ...) SELECT * FROM Item WHERE invoiceNumber IN ('invoice-1', 'invoice-2', ...)

and if we wanted to simplify materialization we could add ordering:

SELECT * FROM Invoice WHERE NUMBER IN ( 'invoice-1' , 'invoice-2' , ... ) ORDER BY NUMBER SELECT * FROM Item WHERE invoiceNumber IN ( 'invoice-1' , '"invoice-2' , ... ) ORDER BY invoiceNumber , _index SELECT * FROM Invoice WHERE number IN ('invoice-1', 'invoice-2', ...) ORDER BY number SELECT * FROM Item WHERE invoiceNumber IN ('invoice-1', '"invoice-2', ...) ORDER BY invoiceNumber, _index

While this is slightly more taxing on the database, if we did a more complicated search, it would be much easier to process stuff in order via the second version.

On the other hand, by combining records into one big object directly on the database, we can load it in a single query:

SELECT inv , ARRAY_AGG ( SELECT it FROM Item it WHERE it . invoiceNumber = inv . number ORDER BY it . _index ) AS items FROM Invoice inv WHERE inv . number IN ( 'invoice-1' , 'invoice-2' , ... ) SELECT inv, ARRAY_AGG( SELECT it FROM Item it WHERE it.invoiceNumber = inv.number ORDER BY it._index) as items FROM Invoice inv WHERE inv.number IN ('invoice-1', 'invoice-2', ...)

The above query actually returns two columns, but it could be changed to return only one column.

Materialization of such objects on the client for the first version would look like this:

IDataReader master = ... IDataReader detail = ... var memory = new Dictionary < string , Invoice > ( ) ; while ( master . Read ( ) ) { var head = new Invoice { number = master . GetString ( 0 ) , dueDate = master . GetDateTime ( 1 ) , ... } ... } while ( detail . Read ( ) ) { var invoice = memory [ detail . GetString ( 0 ) ] ; var detail = new Item { product = detail . GetString ( 2 ) , cost = detail . GetDecimal ( 3 ) ... } invoice . Items . Add ( detail ) ; } IDataReader master = ... IDataReader detail = ... var memory = new Dictionary<string, Invoice>(); while (master.Read()) { var head = new Invoice { number = master.GetString(0), dueDate = master.GetDateTime(1), ... } ... } while (detail.Read()) { var invoice = memory[detail.GetString(0)]; var detail = new Item { product = detail.GetString(2), cost = detail.GetDecimal(3) ... } invoice.Items.Add(detail); }

Postgres native format would be materialized as in first example along the lines of:

IDataReader dr = ... return PostgresDriverLibrary . Parse < Invoice > ( dr . GetValue ( 0 ) ) ; IDataReader dr = ... return PostgresDriverLibrary.Parse<Invoice>(dr.GetValue(0));

Postgres response in TEXT protocol would start to suffer from nesting and escaping, and would look something like:

(invoice-1,2015-03-16,"{""(invoice-1,1,""""product name"""",...)...}",...)

With each nesting layer more and more space would be spent on escaping. By developing optimized parsers for this specific Postgres TEXT response we can parse such a response very quickly.

Verdict: manual coding of SQL and materialization has become non-trivial. Joins introduce noticeable performance difference. Manual approach is losing ground.





Third use case, master-child-detail table access

Sometimes we have nesting two levels deep. Since Postgres has rich type support this is something which we can leverage. So, how would our object-oriented modeling approach look like if we had to store bank account data into a database?

CREATE TYPE Currency AS ENUM ( 'EUR' , 'USD' , 'Other' ) ; CREATE TYPE TRANSACTION AS ( DATE DATE , description VARCHAR ( 200 ) , currency Currency , amount NUMERIC ( 15 , 2 ) ) ; CREATE TYPE Account AS ( balance NUMERIC ( 15 , 2 ) , NUMBER VARCHAR ( 40 ) , name VARCHAR ( 100 ) , notes VARCHAR ( 800 ) , transactions TRANSACTION [ ] ) ; CREATE TABLE BankScrape ( id INT PRIMARY KEY , website VARCHAR ( 1024 ) NOT NULL , at TIMESTAMPTZ NOT NULL , info HSTORE NOT NULL , externalId VARCHAR ( 50 ) , ranking INT NOT NULL , tags VARCHAR ( 10 ) [ ] NOT NULL , createdAt TIMESTAMPTZ NOT NULL , accounts Account [ ] NOT NULL ) ; CREATE TYPE Currency AS ENUM ('EUR','USD','Other'); CREATE TYPE Transaction AS ( date DATE, description VARCHAR(200), currency Currency, amount NUMERIC(15,2) ); CREATE TYPE Account AS ( balance NUMERIC(15,2), number VARCHAR(40), name VARCHAR(100), notes VARCHAR(800), transactions Transaction[] ); CREATE TABLE BankScrape ( id INT PRIMARY KEY, website VARCHAR(1024) NOT NULL, at TIMESTAMPTZ NOT NULL, info HSTORE NOT NULL, externalId VARCHAR(50), ranking INT NOT NULL, tags VARCHAR(10)[] NOT NULL, createdAt TIMESTAMPTZ NOT NULL, accounts Account[] NOT NULL );

Our SQL queries and materialization code will look similar to before (although complexity will have increased drastically). Escaping issue is even worse than before and while reading transactions we are mostly skipping escaped chars. Not to mention that due to LOH issues we can’t just process a string, it must be done using TextReader.

Verdict: manual coding of SQL and materialization is really complex. Joins introduce a noticeable performance difference. Manual approach is not comparable on any test:





Conclusions