Really fast CSV imports with Rails and PostgreSQL

Let’s say that we have to import data from some social network. We will be provided with CSV file generated every 5 minutes containing list of posts that were added/liked/commented since the beginning of the day together with likes’ and comments’ counts for each post.

As you can imagine the file will grow in size during the day; also we are looking at not only inserting new posts but also updating counts for the ones we already imported.

Test environment

sample example.csv file containing 50,000 rows with the following format:

post_id ; title ; published_at ; likes_count ; comments_count

PostgreSQL table with unique index on external_post_id

CREATE TABLE public . posts ( id integer NOT NULL DEFAULT nextval ( 'posts_id_seq' :: regclass ), title character varying , published_at timestamp with time zone , likes_count integer NOT NULL DEFAULT 0 , comments_count integer NOT NULL DEFAULT 0 , external_post_id integer NOT NULL , CONSTRAINT posts_pkey PRIMARY KEY ( id ) ); CREATE UNIQUE INDEX index_posts_on_external_post_id ON public . posts USING btree ( external_post_id );

ActiveRecord model

class Post < ActiveRecord :: Base validates :title , presence: true validates :external_post_id , presence: true validates :external_post_id , uniqueness: true end

To better emulate requirements all rows from the CSV file with odd external_post_id will be already present in the database.

Simple approach

To establish baseline let’s start with very simple version that’s using first_or_create provided by ActiveRecord out of the box:

CSV . foreach ( Rails . root . join ( 'example.csv' )) do | row | data = { title: row [ 0 ], published_at: row [ 1 ], likes_count: row [ 2 ], comments_count: row [ 3 ] } Post . where ( external_post_id: row [ 4 ]). first_or_create ( data ). update ( data ) end

The result is expected but still pretty disappointing:

user system total real 112.940000 9.690000 122.630000 ( 159.846472 )

Stepping up our game

We all know there are better ways to perform bulk imports with ActiveRecord and one of them is the activerecord-import gem. It supports PostgreSQL 9.5 on conflict which makes is perfect for out use-case.

According to the documentation the fastest method is to use raw columns and arrays of values and that is exactly what we are going to do:

columns = [ :title , :published_at , :likes_count , :comments_count , :external_post_id ] values = CSV . read ( Rails . root . join ( 'example.csv' )) Post . import columns , values , validate: false , on_duplicate_key_update: { conflict_target: [ :external_post_id ] }

The result is amazing - we are almost 23x faster then before!

user system total real 6.160000 0.060000 6.220000 ( 6.974064 )

Can we be even faster?

The fastest way to put data into PostgreSQL is to use the COPY command, but it comes with some limitations. One especially important in our context is the lack of upsert support.

Until it will be available we can achieve similar functionality using TEMP TABLES.

Post . connection . execute <<- SQL CREATE TEMP TABLE post_imports ( title character varying, published_at timestamp with time zone, likes_count integer, comments_count integer, external_post_id integer ) SQL File . open ( Rails . root . join ( 'data.csv' ), 'r' ) do | file | Post . connection . raw_connection . copy_data %{copy post_imports from stdin with csv delimiter ',' quote '"'} do while line = file . gets do Post . connection . raw_connection . put_copy_data line end end end Post . connection . execute <<- SQL insert into posts(title, published_at, likes_count, comments_count, external_post_id) select title, published_at, likes_count, comments_count, external_post_id from post_imports on conflict(external_post_id) do update set title = EXCLUDED.title, published_at = EXCLUDED.published_at, likes_count = EXCLUDED.likes_count, comments_count = EXCLUDED.comments_count returning id SQL

In the above example we are creating temporary table post_imports and copying all the data in there. Next we are taking advantage of INSERT INTO(...) SELECT ... and ON CONFLICT syntax to move the data from temporary table to the posts table.

The code is longer and less pretty then the one using activerecord-import gem. It’s also 10x faster!

user system total real 0.150000 0.020000 0.170000 ( 0.643315 )

Final thoughts

It’s not surprising that the method closest to the database turns out to be the quickest.