Back in 2009, Aaron Bycoffe inspired Joe Germuska to write a thirty-line Python script called csvcut for making it easy to select a subset of columns from a CSV file. For the next two years the two of them went back and forth forking and making small revisions to this tool. We got a lot of mileage out of it at the Chicago Tribune, where Joe and I were constantly wrestling with the annoyances of processing datasets efficiently in the command line. In 2011, in a fit of exuberance, I rewrote csvcut and began building other tools for making this sort of work easier. Thus csvkit was born and has been under continuous development ever since.

Over the last several months there have been two major releases of csvkit. These releases have brought long-awaited features such as Python 3 support, a csvformat utility and a new csvkit tutorial—not to mention a slew of bug fixes. To celebrate the latest release, here are eleven of my favorite awesome things you can do with csvkit. If you aren’t using it yet, hopefully this will convince you.

1. Ditch Excel (for real)

Who needs it?

in2csv file1.xls > file1.csv in2csv file2.xlsx > file2.csv

Reference: in2csv

2. Conquer fixed-width formats

Fixed-width files are a particularly frustrating to parse. Save yourself some frustration by using a CSV-formatted schema to convert your fixed-width file into a CSV!

[schema.csv] column,start,length name,1,5 age,6,2 cash,8,3

[data.fixed] Chris44 72 Brian26110 Ryan 18145 Joe 34 83

in2csv -f fixed -s schema.csv data.fixed > data.csv

[data.csv] name,age,cash Chris,44,72 Brian,26,110 Ryan,18,145 Joe,34,83

Be sure to check out FFS for a database of fixed-width schemas for common files, such as those produced by the Bureau of Labor Statistics.

Reference: in2csv

3. Find cells matching a regular expression

[data.csv] name,phone_number Chris,555-999-1111 Brian,555-123-4567 Ryan,555-123-8901 Joe,555-777-1111

Find phone numbers following the pattern “ddd–123-dddd”:

csvgrep -c phone_number -r "\d{3}-123-\d{4}" data.csv > matching.csv

[matching.csv] name,phone_number Brian,555-123-4567 Ryan,555-123-8901

Reference: csvgrep

4. Turn your data into a JSON lookup table

Do you have data that with a unique id column? Would you like to be able to load that data into your browser keyed by its unique id so that you can use it as a lookup table? Well then.

[geo.csv] slug,place,latitude,longitude dcl,Downtown Coffee Lounge,32.35066,-95.30181 tyler-museum,Tyler Museum of Art,32.33396,-95.28174 genecov,Genecov Sculpture,32.299076986939205,-95.31571447849274

csvjson --key slug --indent 4 geo.csv

[keyed.json] { "dcl": { "slug": "dcl", "place": "Downtown Coffee Lounge", "latitude": "32.35066", "longitude": "-95.30181" }, "tyler-museum": { "slug": "tyler-museum", "place": "Tyler Museum of Art", "latitude": "32.33396", "longitude": "-95.28174" }, "genecov": { "slug": "genecov", "place": "Genecov Sculpture", "latitude": "32.299076986939205", "longitude": "-95.31571447849274" } }%

Reference: csvjson

5. Turn a CSV with latitude and longitude columns into GeoJSON

Is your data geographic, like in our previous example? Then don’t stop with JSON—go one step further and make it GeoJSON!

csvjson --lat latitude --lon longitude --key slug --crs EPSG:4269 --indent 4 geo.csv > geo.json

[geo.json] { "type": "FeatureCollection", "bbox": [ -95.31571447849274, 32.299076986939205, -95.28174, 32.35066 ], "features": [ { "type": "Feature", "id": "dcl", "geometry": { "type": "Point", "coordinates": [ -95.30181, 32.35066 ] }, "properties": { "place": "Downtown Coffee Lounge" } }, { "type": "Feature", "id": "tyler-museum", "geometry": { "type": "Point", "coordinates": [ -95.28174, ...

Reference: csvjson

6. Generate summary statistics for any CSV file

[data.csv] name,age,cash Chris,44,72 Brian,26,110 Ryan,18,145 Joe,34,83 Ryan,27,300 Jeremy,98,1

$ csvstat data.csv 1. name <type 'unicode'> Nulls: False Values: Chris, Brian, Jeremy, Joe, Ryan 2. age <type 'int'> Nulls: False Min: 18 Max: 98 Sum: 247 Mean: 41.1666666667 Median: 30.5 Standard Deviation: 26.6359197743 Unique values: 6 3. cash <type 'int'> Nulls: False Min: 1 Max: 300 Sum: 711 Mean: 118.5 Median: 96.5 Standard Deviation: 92.1461701139 Unique values: 6 Row count: 6

csvstat will calculate different statistics based on the type of each column.

Reference: csvstat

7. Execute a SQL query directly on a CSV file

[data.csv] name,age,cash Chris,44,72 Brian,26,110 Ryan,18,145 Joe,34,83 Ryan,27,300 Jeremy,98,1

csvsql --query "select name from data where age > 30" data.csv > old.csv

[old.csv] name Chris Joe Jeremy

Care to take it up a notch? Reference two CSV files and use SQL to JOIN them (this example uses data from the csvkit repository:

csvsql --query "select m.usda_id, avg(i.sepal_length) as mean_sepal_length from iris as i join irismeta as m on (i.species = m.species) group by m.species" examples/iris.csv examples/irismeta.csv

usda_id,mean_sepal_length IRSE,5.006 IRVE2,5.936 IRVI,6.588

(Second example courtesy Jeroen Janssens.)

Reference: csvsql

8. Automatically create a SQL table and import a CSV into a database

[data.csv] name,age,cash Chris,44,72 Brian,26,110 Ryan,18,145 Joe,34,83 Ryan,27,300 Jeremy,98,1

createdb demographics csvsql --db postgresql:///demographics --insert data.csv

Too good to be true? Check it out:

psql -q demographics -c "\d data" Table "public.data" Column | Type | Modifiers --------+----------------------+----------- name | character varying(6) | not null age | integer | not null cash | integer | not null psql -q demographics -c "select * from data" name | age | cash --------+-----+------ Chris | 44 | 72 Brian | 26 | 110 Ryan | 18 | 145 Joe | 34 | 83 Ryan | 27 | 300 Jeremy | 98 | 1 (6 rows)

Reference: csvsql

9. Extract a table from a SQL database into a CSV

sql2csv --db postgresql:///demographics --query "select * from data" > extract.csv

[extract.csv] name,age,cash Chris,44,72 Brian,26,110 Ryan,18,145 Joe,34,83 Ryan,27,300 Jeremy,98,1

Reference: sql2csv

10. Turn your Github issues into a CSV

curl https://api.github.com/repos/onyxfish/csvkit/issues?state=open | in2csv -f json > issues.csv

[issues.csv] url,labels_url,comments_url,events_url,html_url,id,number,title,user,labels,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,pull_request,body https://api.github.com/repos/onyxfish/csvkit/issues/329,https://api.github.com/repos/onyxfish/csvkit/issues/329/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/329/comments,https://api.github.com/repos/onyxfish/csvkit/issues/329/events,https://github.com/onyxfish/csvkit/pull/329,42344699,329,Adding support for multiline json documents,,,open,False,,,0,2014-09-09T20:03:02Z,2014-09-09T20:03:02Z,,,Fixes #275 https://api.github.com/repos/onyxfish/csvkit/issues/328,https://api.github.com/repos/onyxfish/csvkit/issues/328/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/328/comments,https://api.github.com/repos/onyxfish/csvkit/issues/328/events,https://github.com/onyxfish/csvkit/issues/328,42259481,328,Docs should lead-off with an example of why it's awesome,,,open,False,,,0,2014-09-09T00:36:58Z,2014-09-09T00:36:58Z,,, ...

Reference: in2csv

11. Slice a 9,655 column file using index ranges

Perhaps the most audacious use of csvkit I’ve seen:

Slicing 9,655 Illinois schools columns. All praises to csvkit. pic.twitter.com/4DCD2VgYSC — Dan Hill (@DanHillReports) May 28, 2013

(Credit to Dan Hill.)

Reference: csvcut

What awesome thing are you going to do with csvkit?