Dealing with various sources of data in web applications requires us to create services that will extract information from CSV, Excel, and other file types. In that case, it's best to use some existing libraries, or if your backend is on Rails, use gems. There are many gems with very cool features like CSVImporter and Roo . But you can also use plain Ruby CSV .

Either way, if those are small CSV files, you will get your job done easily. But what if you need to import large CSV files (~100MB / ~1M rows)?

The Problem

For a recent project I worked on, an external system would send a CSV file containing 200k rows every 15 minutes. Aforementioned solutions were simply not good enough; they were slow and ate up a bunch of RAM.

I knew I had to find a more efficient solution.

First Try: CSVImporter gem

Because of its simplicity, I first decided to use the CSVImporter gem. After installing the gem, the CSVImporter class was defined like this:

class ForecastCSVImporter include CSVImporter model Forecast column :property column :date column :value column :type column :location column :created_at end

Executing an import for a specific file was done like this:

ForecastCSVImporter . new ( path: file_path ). run!

The implementation was simple and it worked really well on a test CSV file. Then I decided to test it in production, but it was too slow. Importing took more than 2 hours which was fairly bad. The main problem was that each CSV row had to be converted into an ActiveRecord model and had to call #create .

I have tried many different gems and wrote a custom CSV importer by using plain Ruby CSV with batch insert commands. Performance improvements were noticeable and the import time was reduced to around 45 minutes, but it was still too slow. I felt I still wasn't on the right track.

Solution - Use COPY

After all of these attempts, I finally gave up on Ruby solutions and looked for help from my database.

I found that PostgreSQL has a really powerful yet very simple command called COPY which copies data between a file and a database table. It can be used in both ways:

to import data from a CSV file to database

to export data from a database table to a CSV file.

Example of usage:

COPY forecasts FROM 'tmp/forecast.csv' CSV HEADER ;

This piece of SQL code will import the content from a CSV file to our forecasts table. Note one thing: it's assumed that the number and order of columns in the table is the same as in the CSV file.

Results

Importing a CSV file with ~1M rows now takes under 4 seconds which is blazing fast when compared to previous solutions!

Library Support

Usually, when writing application code, it's good to avoid writing raw SQL. By default, ActiveRecord doesn't support the COPY command but there is a gem which takes care of that. It's called postgres-copy . The gem provides a simple interface for copying data between a database table and a CSV file.

Let's see an example:

# Enable COPY command on Forecast model class Forecast < ActiveRecord :: Base acts_as_copy_target end # Run export of table data to a file Forecast . copy_to '/tmp/forecast.csv' # Run import from a CSV file to database Forecast . copy_from '/tmp/forecast.csv'

SQL commands of previous calls are:

-- Forecast.copy_to '/tmp/forecast.csv' COPY ( SELECT "forecasts" . * FROM "forecasts" ) TO '/tmp/forecast.csv' WITH DELIMITER ';' CSV HEADER -- Forecast.copy_from '/tmp/forecast.csv' COPY forecasts FROM '/tmp/forecast.csv'

Data manipulation

The COPY command is simple and super fast. However, it has restrictions in some advanced scenarios when importing from CSV:

you must use all of the columns from a CSV file

problems arise if you want to manipulate the data before it is inserted into the database table.

You can specify mappings between CSV columns and table columns. That means you can have different orders of attributes in a CSV file and the database table, but the table must use all of the columns from a CSV file.

These problems are common, but as always, there are workarounds.

The first one is to create a temporary table where you would import the original data from a CSV file. After the COPY command inserts all the data from the CSV file, you can perform a custom INSERT command to transfer data from the temporary table to your original table. Within the INSERT command, you can easily perform data manipulation.

Let's see an example:

COPY forecasts_import FROM 'tmp/forecast.csv' ; INSERT INTO forecasts SELECT location_id :: int , value , forecast_type , DATE ( created_at ) FROM forecasts_import ; DELETE FROM forecasts_import ;

A second approach when our attributes in the CSV and our tables don't match is to read the data from standard input and manipulate it through our application:

COPY forecasts FROM STDIN ;

In that case, you need to manually read the data, row by row, from a CSV file (e.g. by using plain Ruby CSV ) and send the data to STDIN of your database connection.

db_conn = ActiveRecord :: Base . connection . raw_connection copy_statement = 'COPY forecasts FROM STDIN' file_path = '/tmp/forecast.csv' db_conn . copy_data ( copy_statement ) do CSV . foreach ( file_path , headers: true ) do | row | db_conn . put_copy_data ( row . fields + [ Time . zone . now ]) end end

Although the import process goes through Ruby, there is no overhead of instantiating ActiveRecord objects and performing validations. This is a bit slower than directly importing with the database, but it's still very fast.

Can I use this if I'm not on PostgreSQL?

In addition to PostgreSQL, other databases also support native CSV importing.

For example, Microsoft SQL Server uses the BULK INSERT SQL command which is pretty similar to PostgreSQL's COPY command, while Oracle has a command line tool called sqlloader . MySQL also supports CSV file imports with the LOAD DATA INFILE command or by using the mysqlimport utility.

Conclusion

Native CSV parsers will give you better performance than using plain CSV parsers combined with many INSERT commands. Of course, you should take care of your database validations because you are skipping application validations.

If you're working with small CSV files, existing libraries are usually the best solution because of their simplicity, especially if you need to do data manipulation before persisting it to the database.

In all other cases, go ahead and use the database tools because they are really powerful and easy to use.