Importing data is one of the tasks I like the least. It requires to write messy code and it will inevitably be really slow, maybe too slow to be viable.



A good approach in these cases is to rely as much as possible on the database engine using a temporary table to store the CSV content and then calling the various insert into select or update from select queries needed to import data from the temp table. They can definitely save your day.

Last day I stumbled upon this wonderful feature of PostgreSQL that could avoid even the need of parsing the CSV. So, let’s say we have the following database:

And the following CSV:

Code;Name;Description SHO32_LEA01_BLA10;Black shoes;Black leather shoes SHO32_LEA01_RED10;Red shoes;Red leather shoes SHO32_PLA90_BLA10;Black shoes;Black plastic shoes SHO32_PLA90_RED10;Red shoes;Red plastic shoes HAT76_LEA01_BLA10;Black hat;Black leather hat HAT76_LEA01_RED10;Red hat;Red leather hat HAT76_PLA90_BLA10;Black hat;Black plastic hat HAT76_PLA90_RED10;Red hat;Red plastic hat

we can write the entire import algorithm in the DB. Pseudocode:

import the csv inside a temporary table;

for each row, detect the correct model code, material code and color code splitting the code we have in the first column;

create all missing colors, materials and models;

update all existing products (setting the new name and description);

insert all new products;

To achieve this we can write a PostgreSQL function. Basically you can see a function like an aggregate of SQL statements. With a function we can store in the DB all the import logic and when we have to effectively import the data, we can simply execute the function.

Okay, let’s see a function that will load our CSV and will import the data (see the comments for info):

/* a function named import_products that requires an argument an returns nothing */ CREATE OR REPLACE FUNCTION import_products(file_path text) RETURNS VOID SECURITY DEFINER AS $BODY$ BEGIN DROP TABLE IF EXISTS tmp_import_data; /* * create the temporary table * we start inserting only not null columns * and then we update the other columns to set references */ CREATE TEMP TABLE tmp_import_data( product_id integer, product_code varchar(255) NOT NULL, model_id integer, model_code varchar(5), material_id integer, material_code varchar(5), color_id integer, color_code varchar(5), name varchar(255) NOT NULL, description text NOT NULL ); /* copy the entire csv in tmp_import_data */ execute format ( $$copy tmp_import_data (product_code, name, description) from %L delimiter ';' header csv$$ , file_path); /* * set model code, material code and color code individually * splitting the product code */ UPDATE tmp_import_data SET model_code = split_part(product_code, '_', 1), material_code = split_part(product_code, '_', 2), color_code = split_part(product_code, '_', 3); /* insert new colors */ INSERT INTO models (code) SELECT DISTINCT model_code FROM tmp_import_data WHERE NOT EXISTS (SELECT id FROM models WHERE code = model_code); /* set references */ UPDATE tmp_import_data SET model_id = s.id FROM (SELECT id, code FROM models) AS s WHERE s.code = model_code; /* insert new materials */ INSERT INTO materials (code) SELECT DISTINCT material_code FROM tmp_import_data WHERE NOT EXISTS (SELECT code FROM materials WHERE code = material_code); /* set references */ UPDATE tmp_import_data SET material_id = s.id FROM (SELECT id, code FROM materials) AS s WHERE s.code = material_code; /* insert new colors */ INSERT INTO colors (code) SELECT DISTINCT color_code FROM tmp_import_data WHERE NOT EXISTS (SELECT code FROM colors WHERE code = color_code); /* set references */ UPDATE tmp_import_data SET color_id = s.id FROM (SELECT id, code FROM colors) AS s WHERE s.code = color_code; /* update name and description for existing products */ UPDATE products SET name = s.name, description = s.description FROM (SELECT model_id, material_id, color_id, name, description FROM tmp_import_data) AS s WHERE products.model_id = s.model_id AND products.material_id = s.material_id AND products.color_id = s.color_id; /* insert new products */ INSERT INTO products (model_id, material_id, color_id, name, description) SELECT DISTINCT t.model_id, t.material_id, t.color_id, t.name, t.description FROM tmp_import_data AS t WHERE NOT EXISTS ( SELECT id FROM products WHERE model_id = t.model_id AND material_id = t.material_id AND color_id = t.color_id ); END; $BODY$ LANGUAGE plpgsql;

Now we need only to call the function passing a file name, and the function will take care of loading the stuff:

SELECT import_products("/home/user/import.csv");