The Pwned Passwords API (part of Troy Hunt’s Have I Been Pwned service) is used tens of millions of times each day, to alert users if their credentials are breached in a variety of online services, browser extensions and applications. Using Cloudflare, the API cached around 99% of requests, making it very efficient to run.

From today, we are offering a new security advancement in the Pwned Passwords API - API clients can receive responses padded with random data. This exists to effectively protect from any potential attack vectors which seek to use passive analysis of the size of API responses to identify which anonymised bucket a user is querying. I am hugely grateful to security researcher Matt Weir who I met at PasswordsCon in Stockholm and has explored proof-of-concept analysis of unpadded API responses in Pwned Passwords and has driven some of the work to consider the addition of padded responses.

Now, by passing a header of “Add-Padding” with a value of “true”, Pwned Passwords API users are able to request padded API responses (to a minimum of 800 entries with additional padding of a further 0-200 entries). The padding consists of randomly generated hash suffixes with the usage count field set to “0”.

Clients using this approach should seek to exclude 0-usage hash suffixes from breach validation. Given most implementations of PwnedPasswords simply do string matching on the suffix of a hash, there is no real performance implication of searching through the padding data. The false positive risk if a hash suffix matches a randomly generated response is very low, 619/(235*4) ≈ 4.44 x 10-40. This means you’d need to do about 1040 queries (roughly a query for every two atoms in the universe) to have a 44.4% probability of a collision.

In the future, non-padded responses will be deprecated outright (and all responses will be padded) once clients have had a chance to update.

You can see an example padded request by running the following curl request:

curl -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF

API Structure

The high level structure of the Pwned Passwords API is discussed in my original blog post “Validating Leaked Passwords with k-Anonymity”. In essence, a client queries the API for the first 5 hexadecimal characters of a SHA-1 hashed password (amounting to 20 bits), a list of responses is returned with the remaining 35 hexadecimal characters of the hash (140 bits) of every breached password in the dataset. Each hash suffix is appended with a colon (“:”) and the number of times that given hash is found in the breached data.

An example query for FFFFF can be seen below, with the structure represented:

Without padding, the message length varies given the amount of hash suffixes in the bucket that is queried. It is known that it is possible to fingerprint TLS traffic based on the encrypted message length - fortunately this padding can be inserted in the API responses themselves (in the HTTP content). We can see the difference in download size between two unpadded buckets by running:

$ curl -so /dev/null https://api.pwnedpasswords.com/range/E0812 -w '%{size_download} bytes

' 17022 bytes $ curl -so /dev/null https://api.pwnedpasswords.com/range/834EF -w '%{size_download} bytes

' 25118 bytes

The randomised padded entries can be found with with the “:0” suffix (indicating usage count); for example, below the top three entries are real entries whilst the last 3 represent padding data:

FF1A63ACC70BEA924C5DBABEE4B9B18C82D:10 FF8A0382AA9C8D9536EFBA77F261815334D:12 FFEE791CBAC0F6305CAF0CEE06BBE131160:2 2F811DCB8FF6098B838DDED4D478B0E4032:0 A1BABA501C55ACB6BDDC6D150CF585F20BE:0 9F31397459FF46B347A376F58506E420A58:0

Compression and Randomisation

Cloudflare supports both GZip and Brotli for compression. Compression benefits the PwnedPasswords API as responses are hexadecimal represented in ASCII. That said, compression is somewhat limited given the Avalanche Effect in hashing algorithms (that a small change in an input results in a completely different hash output) - each range searched has dramatically different input passwords and the remaining 35 characters of the SHA-1 hash are similarly different and have no expected similarity between them.

Accordingly, if one were to simply pad messages with null messages (say “000...”), the compression could mean that values padded to the same could be differentiated after compression. Similarly, even without compression, padding messages with the same data could still yield credible attacks.

Accordingly, padding is instead generated with randomly generated entries. In order to not break clients, such padding is generated to effectively look like legitimate hash suffixes. It is possible, however, to identify such messages as randomised padding. As the PwnedPasswords API contains a count field (distinguished by a colon after the remainder of the hex followed by a numerical count), randomised entries can be distinguished with a 0 usage.

Lava Lamps and Workers

I’ve written before about how cache optimisation of Pwned Passwords (including using Cloudflare Workers). Cloudflare Workers has an additional benefit that Workers run before elements are pulled from cache.

This allows for randomised entries to be generated dynamically on a request-to-request basis instead of being cached. This means the resulting randomised padding can differ from request-to-request (thus the amount of entries in a given response and the size of the response).

Cloudflare Workers supports the Web Crypto API, providing for exposure of a cryptographically sound random number generator. This random number generator is used to decide the variable amount of padding added to each response. Whilst a cryptographically secure random number generator is used for determining the amount of padding, as the random hexadecimal padding does not need to be indistinguishable from the real hashes, for computational performance we use the non-cryptographically secure Math.random() to generate the actual content of the padding.

Famously, one of the sources of entropy used in Cloudflare servers is sourced from Lava Lamps. By filming a wall of lava lamps in our San Francisco office (with individual photoreceptors picking up on random noise beyond the movement of the lava), we are able to generate random seed data used in servers (complimented by other sources of entropy along the way). This lava lamp entropy is used alongside the randomness sources on individual servers. This entropy is used to seed cryptographically secure pseudorandom number generators (CSPRNG) algorithms when generating random numbers. Cloudflare Workers runtime uses the v8 engine for JavaScript, with randomness sourced from /dev/urandom on the server itself.

Each response is padded to a minimum of 800 hash suffixes and a randomly generated amount of additional padding (from 200 entries).

This can be seen in two ways, firstly we can see that repeating the same responses to the same endpoint (with the underlying response being cached), yields a randomised amount of lines between 800 and 1000:

$ for run in {1..10}; do curl -s -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF | wc -l; done 831 956 870 980 932 868 856 961 912 827

Secondly, we can see a randomised download size in each response:

$ for run in {1..10}; do curl -so /dev/null -H Add-Padding:true https://api.pwnedpasswords.com/range/FFFFF -w '%{size_download} bytes

'; done 35572 bytes 37358 bytes 38194 bytes 33596 bytes 32304 bytes 37168 bytes 32532 bytes 37928 bytes 35154 bytes 33178 bytes

Future Work and Conclusion

There has been a considerable amount of research that has complemented the anonymity approach in Pwned Passwords. For example; Google and Stanford have written a paper about their approach implemented in Google Password Checkup, “Protecting accounts from credential stuffing with password breach alerting” [Usenix].

We have done a significant amount of work exploring more advanced protocols for Pwned Passwords, some of this work can be found in a paper we worked on with academics at Cornell University, “Protocols for Checking Compromised Credentials” [ACM or arXiv preprint]. This research offers two new protocols (FSB, frequency smoothing bucketization, and IDB, identifier-based bucketization) to further reduce information leakage in the APIs.

Further work is needed before these protocols gain the production worthiness that we’d like before they are shipped - but, as always, we’ll keep you updated here on our blog.





