The Story of UTF8 VS UTF8MB4

I once got a call from the support team, saying that one of our customers reported that the application fails to save data in one of our business-critical features. The customer is seeing a general error from the application. About 30 of his 500 users are experiencing this issue and can't save data in the application.

After a short 15 minutes debugging session, we saw that the data is transmitted from the client side, successful received in the server side, and the insertion query is fired to the database. But still, no data in the database. Hmm.. now it got interesting.

Looking at the logs, it turns out that for specific inputs, MySQL refused to insert the data to the database. The error MySQL logged was:

Incorrect string value: '\xF0\x9F\x98\x81...' for column 'data' at row 1

Looking at those first 4 bytes, I got to no conclusion as to what was the issue. When googling them, I found that they represent an emoticon in UTF8. So, it means that for this specific input, each character is probably encoded as 4 bytes.

I tried to reproduce this issue with a different string, which has its characters encoded with 1-3 bytes per character. It turned out that it only happens when each character in the data was combined of 4-byte.

We scratched our heads and thought - the character set we used in the database is UTF8, which should support 4 bytes (right?), so what's wrong?

Well, it turns out we were wrong. We quickly realized that MySQL decided that UTF-8 can only hold 3 bytes per character (as it's defined as an alias of utf8mb3). Why? no good reason that I can find documented anywhere. Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.

Recommendation

if you're using MySQL (or MariaDB or Percona Server), make sure you know your encodings. I would recommend anyone to set the MySQL encoding to utf8mb4. Never use utf8 in MySQL, there is no good reason to do that (unless you like tracing encoding related bugs).

How to convert utf8 to utf8mb4 in MySQL?

So now I had to fix this issue. As I recommend above, I wanted to use utf8mb4 and drop the old utf8. To do that, I used the following ALTER statements. Please DO NOT just copy paste them. You need to make sure you understand each of them and adjust them accordingly.

# Run this once on each schema you have (Replace database_name with your schema name) ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; # Run this once for each table you have (replace table_name with the table name) ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; # Run this for each column (replace table name, column_name, the column type, maximum length, etc.) ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Please note that you'll have to consider the consequences of increasing the column size from 3 bytes per character to 4. For example, MySQL indexes are limited to 768 bytes. This means that if you increase VARCHAR(255) from 3 bytes per character to 4 bytes per character, you won't meet that limit anymore.

To conclude, make sure you read about the internals of every decision you make with MySQL. Oh, and use utf8mb4 instead of utf8 without even thinking about it.