I have seen a few databases recently that could have saved a lot of space by being more efficient with how they stored data. Sometimes this isn't a big problem, when a table is not going to grow particularly quickly. But it can become a big problem and you can be leaving a lot of disk savings on the table.

Let's review some of the benefits of smaller tables:

Indexes are smaller. This means your database needs less space to index your tables, and more RAM can be used to cache results.

The cache can hold more objects, since the objects are smaller.

You'll delay the point at which your database won't fit on a single disk, and you have to shard.

Query results which might have fit in 2 TCP packets will now fit in one.

Backups complete more quickly.

Your application servers will use less RAM to hold the result.

Migrations complete more quickly.

Full table searches complete more quickly.

Let's review some common data types and strategies for holding these. If these are obvious to you - great! You can stop reading at any point. They're not obvious to a lot of people.

A brief reminder before we get started - a bit is a single 0 or 1, and a byte is a series of 8 bits. Every ASCII character can be represented in a single byte.

UUID's

It's common to store UUID's as text fields in your database. A typical UUID representation - "ad91e02b-a147-4c47-aa8c-1f3c2240c0df" - will take up 36 bytes and more if you store it with a prefix like SMS or user_ . A UUID uses only 16 different characters (the hyphens are for display only, like hyphens in a phone number). This means you only need 4 bits to store a UUID character. There are 32 characters in a UUID, so can fit a UUID in 16 bytes, a saving of 55%. If you're using Postgres, you can use the uuid data type to store UUID's, or the binary data type - in MySQL, you can use a binary(16) .

CREATE TABLE users (id uuid PRIMARY KEY);

It's often useful to store a prefix with a UUID, so you know what the ID represents from looking at it - for example, SM123 or user_456 . I wrote a short library that stores a prefix with a UUID, but strips it before writing to the database. To read UUID's out of the database with a prefix, attach them to the SELECT statement:

SELECT 'user_' || id FROM users LIMIT 5;

My old team at Shyp recently converted text ID's to UUID's and wrote about that process on their engineering blog.

UUID's in JSON

It's becoming more common to store relational data in JSON or JSONB columns. There are a lot of reasons to do this or not do this - I don't want to rehash that discussion here. JSONB does lead to inefficient data storage for UUID's, however, since you are limited to storing characters that are valid JSON. If you are storing UUID's this means you can't get down to 16 bytes, since you can't just any byte in JSON. You can base64 encode your 16 byte UUID. In Go, that encoding dance looks something like this:

import "encoding/base64" import "encoding/hex" import "strings" rawUUID := "ad91e02b-a147-4c47-aa8c-1f3c2240c0df" // Strip the hyphens uuidStr := strings.Replace(rawUUID, "-", "", 4) // Decode the hex string into a slice of 16 bytes. bits, _ := hex.DecodeString(uuidStr) // Re-encode that 16-byte slice using base64. fmt.Println(base64.RawURLEncoding.EncodeToString(bits))

That outputs rZHgK6FHTEeqjB88IkDA3w , which is only 22 bytes, a 38% improvement.

Use smaller numbers

A default Postgres integer is 4 bytes and can hold any number from -2147483648 to 2147483648. If you know that the integer you are storing is never going to exceed 32,760, you can use a smallint (2 bytes) to store it and save 2 bytes per row.

Use an enum type

Let's say you have a subscription that can have one of several states ( trial , paid , expired ). Storing the strings "trial" , "paid" , "expired" in the database can take up extra space. Instead use an enum type, which is only 4 bytes (1 byte in MySQL) and ensures you can't accidentally write a bad status like "trail" . Another alternative is to store a smallint and convert them to values that make sense in the application, but this makes it harder to determine what things are if you're querying the database directly, and doesn't prevent mistakes.

Use binary for password hashes

Most password hashing algorithms should give you back raw bytes. You should be able to store the raw bytes directly in the database using bytea .

Move fields out of JSON columns

One downside of JSON/JSONB is that the key gets stored alongside the value for each row in the application. If you are storing a boolean like {"show_blue_button": true} in JSON, you're using 18 bytes per row to store the string "show_blue_button" and only one bit to store the boolean true . If you store this field in a Postgres column, you are only using one or two bits per row. Moving this to a column pays off in terms of space even if you only need the show_blue_button boolean once every 70-140 rows. It's much easier to add indexes on columns than JSON fields as well.

Conclusion

That's it! A small amount of thought and work upfront can pay big dividends down the line. Migrating columns after they're already in place can be a pain. In general, the best approach is to do the following:

Add the new column with the correct type.

Edit your application to write/update both the new and the old column.

Backfill the new column, copying over all values from the old column for old records in the database. If the table is large, do this in batches of 1000 rows or so to avoid locking your table for too long.

Edit your application to read exclusively from the new column.

Drop the old column.

I hope this helps!

Inspired by some tweets from Andrey Petrov and a Heap Analytics post about JSONB.

Liked what you read? I am available for hire.