Falsehoods Programmers Believe About CSVs

Much of my professional work for the last 10+ years has revolved around handling, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.

Everything on this list is a false assumption that developers make.

All CSVs are ASCII All CSVs are Win1252 All CSVs are in 8-bit encodings All CSVs are UTF-8 All CSVs are UTF-16 All CSVs contains a single consistent encoding All records contain a single consistent encoding All fields contain a single consistent encoding All CSVs contain records All records contain fields Fields never contain record separators Fields never contain delimiters Fields never contain control characters Delimiters are escaped with a \ All fields are enclosed by double quotes All records are a single line All lines contain a single record All records contain the same number of fields All records contain the same number of fields as the header All records contain the same number of fields or fewer than the header All CSVs contain a header All record separators are CRLF All record separators are LF All record separators are a single byte All record separators are a single rune All newlines are a single byte All CSVs are delimited with a comma All CSVs are delimited with a comma, tab or semicolon TSV isn't CSV All delimiters are a single byte All commas are a single byte All CSVs are readable with Excel Excel is a good tool for working with CSVs Excel is an OK tool for working with CSVs Excel can losslessly save CSVs it opens Using ="{value}" is a good way to get around Excel auto-formatting The first line will never be a poorly supported instruction header Using sep={char} is a good way to get Excel to accept your delimiter Prepending a BOM is a good way to get Excel to read your encoding You can safely name your first column "ID" All CSVs follow RFC4180 Most CSVs follow RFC4180 All CSVs follow the same defined standard All CSVs follow a defined standard All CSVs have a .csv extension All CSV is human readable

Please take these into consideration next time you find yourself working with CSV. If you can think of anything I may have missed I'd be happy to add it.

As a suggested further reading, "The Art of Unix Programming" http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2901882 section on DSV style which notably says "CSV is a textbook example of how not to design a textual file format"

Updates: