When working with distributed systems, sequential IDs are not always an option. GUIDs are commonly used, but they’re unnecessarily long. How long do randomly generated IDs really need to be?

Background

An ID is a unique identifier for a record in a database. It’s used to identify that record when updating it, indexing it, etc.

The traditional method of generating IDs has been to use a sequential number. The first record gets ID 1, the second 2, and so forth.

When your database is distributed onto multiple machines however, this can become complicated. In the real world where we face network failures, machine failures, and just general latency, having to coordinate what the next ID value will be has a very real cost.

GUIDs

A popular method of solving this problem is to use Globally Unique Identifiers (GUIDs). GUIDs are essentially just very long random numbers. The idea is, if you make your random number long enough, it will be very unlikely that two machines will ever generate the same one. As crazy as that might sound, it works because there are just so many possible values a long enough number can have.

The format most commonly used is UUID4. It involves 122 bits of random data with 6 bits which are fixed. UUID4s are typically represented as hex characters like this: f0c6b590-0bd6-4c66-8872-f6a0f3aa33ac, but can also be represented in some more fun ways.

Many people don’t like UUID4s however. For one, they’re very long, meaning you often store more bits of random data than you truly need to uniquely identify your record. They’re also commonly stored as strings rather than the actual 128 bits they represent. This can mean the UUID takes 8 bits per hex character to store which adds up to a whopping 288 bits to store your 122 bits of data. They also exhibit a problem with “locality”.

Locality

Understanding locality requires a bit of understanding of how computers work. They have hard disks which store lots of data slowly, and memory which stores smaller amounts of data much more quickly. Generally the operating system is smart enough to store in memory the chunk of the hard drive you are likely to access soon. This caching has a dramatic effect on database performance, as most of what a database does is read and write data.

It’s very common for databases to order records by their id, usually in a structure like a B-tree. When you use sequential IDs, this works well, as each new record ends up at the end. The operating system can keep the last block of records in memory, making it efficient for you to add new rows without having to move too much between memory and the hard disk. When your IDs are random however, each new record has to be placed at a random position. This means a lot more thrashing of data to and from disk.

Some databases make a compromise. MongoDB for example uses random IDs which start with a timestamp. That way, records tend to be grouped by when they were created, but can nevertheless be created independently on multiple machines simultaneously.

Solution

So, with that fix in mind, the next question is: how long do random IDs need to be? We know they need to be long enough to make it very, very unlikely that two machines will come up with the same one. But very long IDs both require more storage space, and make it less pleasant to, for example, include IDs in URLs. No one likes a URL with /fdslkj3r2 39fj49dfaK Jkj4234231fa sdAEDFfsda/ in it.

Luckily there is a mathematical answer to this problem found in the Birthday Problem. The Birthday Problem asks the question: “How many people need to be in a room before it’s likely that two share a birthday?” With a little creativity we can imagine this question instead: “How many IDs need to be generated before it’s likely that two records will share an ID?”

An approximate solution to that problem is:

where x is the number of values an ID can have, and n is the number of IDs we plan on generating.

For example, for UUID4s which have 122 bits of random data, we have 2122 potential IDs. If we plan on generating one million IDs, the equation becomes:

Plugging that into Wolfram Alpha gives us the astronomically small number 9.4 × 10-26.

To give a little perspective, you are 78, 000, 000, 000, 000, 000, 000 times more likely to be struck by lightning than to have a collision in those million IDs.

It’s clear that, for most projects, UUID4s are a little excessive. So how long should GUIDs be?

Let’s start by upping the number of IDs we plan on generating to 100 million to give us some breathing room. 72 bits of randomness then gives us a collision probability that is 10-6, or one in a million. Considering that it’s probably more likely that Heroku falls off the face of the earth in the middle of your big launch than that, it seems safe enough.

Encoding

The next question is, what’s the best way to represent it? In an ideal world we would store just the random bits we generated. Working with strings of bits can be harder than you’d expect though. When you need to send it to the client your options are limited. Languages which use IEEE floating point numbers (like JavaScript) generally only provide 53 bits which can be used to represent your value exactly. Anything more than that is going to introduce inaccuracy which would render your ID useless.

Strings however offer a pretty decent option. Using base64 encoding, you can pack 6 bits in each 8 bit character (encoding dependent). 72 bits maps to 12 characters, which is short enough of an ID to include in a URL pleasantly. Be sure to use the URL-safe variant of base64 if you intend on including these IDs in URLs. You could even use base62 or 53 if you want to get even cleaner.

Unfortunately, this method fails one of our tests: it doesn’t offer any locality. Each new ID is going to end up in a random part of the database, hurting performance. We can however do something similar to what MongoDB does, include a partial timestamp at the front. Although the exact number of bits to devote to your timestamp is tricky to decide.

A UNIX timestamp accurate to the second is generally given 32 bits, which should be good until 2038. It may not be necessary for you to encode every time since 1970 however. If you are only interested in keeping recent records close to each other, you only need enough values to ensure that you don’t have more values with the same prefix than your database can cache at once. This value is going to vary, but for example you could store the number of seconds you are into the current year. That would require 25 bits (log 2 of the number of seconds in a year). A bucket for every 10 seconds would be 22 bits, every minute 19.

Generating an ID would then be:

(seconds into the current year) + (47 bits of random data)

You have less randomness, but you now only have to worry about values in the same second-bucket colliding. If your data comes in regularly, that means you have 100 million / (seconds in a year) values in each bucket. Running that through the birthday paradox math gives us an even lower collision probability of 3.57 × 10-14. This does come at a cost: if your data isn’t regular, collisions become progressively more common. For example, if all hundred million records are written in just 100 seconds (an impressive feat), the collision probability becomes about three in a thousand.

This math is how we generate IDs at Eager, giving you URLs like https://eager.io/app/ZYBle8qUhKFJ. In your next project, before you type auto increment or use a UUID4, take a moment to think about your other options.

Improve your website with free tools you can install in seconds. Get Eager