CSV: An Encoding Nightmare

UPDATE 2019-01-16: In the three years since this article was written, parts of the article, in particular talking about UTF-8 are thankfully no longer accurate.

It would appear in a recent update Microsoft has added support for safely reading and writing UTF-8 CSVs to Excel. There is a new format in the save dialog CSV UTF-8 (Comma delimited) which is distinct from Comma Separated Values which is also still in there.

In my testing it appears to safely both load and save. Best of all it appears if there is a BOM, it leaves it, if there isn't it doesn't add one. Very nice handling indeed.

The company I work for manages a self-service data import system, handling information from school districts “Student Information Systems”. There are hundreds if not thousands of SIS’s out there, but what the vast majority have in common is the ability to export CSV. It is the lingua franca, everyone can get us CSVs.

CSV to the casual observer seems a simple portable format, but its looks are deceiving. If the data is pure ASCII (bytes 0-127) you’ll be fine. However, if there is a need for any sort of non-ASCII character, there is some work ahead.

In our case, handling school data from around the world, correctly handling non-ASCII characters is of the utmost importance.

Excel: A World of Hurt

The biggest problem is not CSV itself, but that the primary tool used to interact with it is Excel. Excel handles CSV encodings badly.

Creating a new document in Excel and saving as ”Comma Separated Values (.csv)” it uses your locale’s Windows or Mac codepage. Win-1252 and MacRoman respectively in the United States. If your codepage doesn’t support a character in your document, it will be silently replaced with an underscore _ character.

Because it uses codepages and not a Unicode encoding, it makes processing a painful chore. There is no way to tell the difference between different 8-bit codepages programatically.

One can use heuristics to sort them into an order of likelihood, but there is no way to ever know for sure. We chose to present the user with previews of the most likely codepages, and let them pick the correct one.

Excel for Mac: Broken Beyond Belief

The Macintosh version of Microsoft Excel is particularly harrowing.

As I mentioned above, Excel saves your CSV in your locales codepage. One should note that the Mac codepages fell out of use with OS 9, almost 15 years ago. Microsoft did not get that memo.

While that by definition makes the CSVs the Mac version of Excel exports unusable on Windows, the problem is more unfortunate than that. The Mac version can only read the locales Windows codepage.

That means the Mac version of Excel cannot read CSVs it wrote. That's pathetic.

If you had any extended characters when you saved, they are scrambled when you reopen it. This problem has persisted in every version of Mac Excel up to the current Excel 2016.

One simply cannot safely open a CSV created with the Mac version of Excel, on any platform, anywhere.

UTF-8? What’s that?

UTF-8 is the encoding of the 21st Century. It was quick in overtaking every other encoding. Almost every modern application supports it. This will be easy and we’ll be fine? Not so fast.

If one attempts to open a CSV file encoded as UTF-8 without a Byte Order Mark (BOM) as recommended, any non-ASCII characters are again scrambled.

This is the CSV format Apple’s Numbers exports by default, UTF-8 sans BOM.

If we try it again with a UTF-8 BOM prepended to the file and Excel will read it. This is deceptive because once saved the text will remain correctly encoded UTF-8, but bizarrely the BOM will be stripped causing the file to no longer be correctly readable.

Many naïve application output UTF-8 + BOM CSVs, and they read correctly but do not write correctly. I’ve seen this cause all sorts of headaches because it appears to work but does not.

The Right Way Functional Workaround

As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.

Using Excel to output into this format to begin with, you have to use the Save As dialog and choose “UTF-16 Unicode Text (.txt)”. Frustratingly, it includes a .txt extension by default which can be confusing for non-technical users. However, if you provide clear instructions to Windows users to save their filename in quotes with a CSV extension such as "example.csv" it is workable.

If you operate an application with a CSV exporter please use this as your default export format. It will save us all many headaches in the long run.