Using Google Cloud Platform to store and query 1.4 billion usernames and passwords

Using Google Cloud Platform to store and query 1.4 billion usernames and passwords

How we used GCP to search massive data breach dump and how you can set it up too.

NOTE: If you plan to follow this blog post and set this up, you should understand that you may incur some charges for usage of your Google Cloud Platform resources.

Recently it came to our attention that there was a combined password dump which contained passwords cracked to plaintext.

The dump, said to be one of the largest, was 42 GB in size. That is a lot of usernames and passwords! Woah!

The username and password dump came conveniently sorted alphabetically and with simple scripts to query for email addresses. It also had scripts to count the total number of entries etc. On any decent laptop/virtual machine with an SSD, the query time is mere 4–5 seconds. But we wanted to do dig a bit deeper. We wanted to count things like:

How many usernames have passwords longer than 100 characters?

How many users have a gmail.com email address?

How many emails and passwords are present for a particular corporate domain?

Format of the records present in the data

username:password

Since the dump files were formatted in the format of username and password separated with a colon it seemed like a great opportunity to try and use Google BigQuery to query for these questions and more.

Google BigQuery enables super-fast SQL queries using the processing power of Google’s infrastructure.

It was a fairly compelling offer, whilst none of us at Appsecco are data scientists, we are definitely familiar with using SQL for querying databases.

We started at /r/netsec sub-reddit which gave us the magnet link for the torrent to this dump.

(Internally we now refer to this password dump as treasure trove!)

The post linked to a gist on Github with a magnet link to download the sorted list as a torrent (magnet link), the torrent didn’t start since it didn’t have any tracker information. But a redditor (Thank you CiNXNppjlK) posted the magnet link with tracker information.

Using a Google Cloud Platform (GCP) Virtual Machine

Since we planned to use Google BigQuery and wanted the data to be in Google Storage, we decided to download the torrent directly in a virtual machine on GCP. We are using a Debian Stretch available from the console.

Standard options for starting a new virtual machine in GCP Compute Engine

We used aria2 for downloading magnet links on the command line.

aria2 is a complete bittorrent downloading solution. More information about aria2 https://aria2.github.io/

Installing aria2

apt-get install aria2

Adding a new disk to store the data

Since we planned to download at least 42 GB of data, it made sense to add an additional SSD to the machine.

The standard SSD comes with 375 GB of storage

Starting the download

We created a folder called torrents on the SSD

cd /mnt/bigdisk && mkdir torrents

Starting a new bittorrent download using aria2 client with a magnet link is as simple as

aria2c "MAGNET LINK"

The double quotes in BASH ensure that anything that needs to be escaped which is part of the link is escaped.

Since this was a torrent being downloaded (and we assumed it would take some time) we went out for a cup of coffee. By the time we came back 30 minutes later, the download was done!

Copying files to the Google Storage Bucket

A nice touch to every compute engine instance is that some of the command line tools required for working with other Google cloud services are already installed.

Before we could copy files to a Google Storage Bucket we needed to run the config command

gsutil config

Apart from some random output, we got a message saying

Please navigate your browser to the following URL:

We copied the link given in message and got an authorization code

When you see the output in your screen, there won’t be any blanked out parts — I am hiding my details here

We knew it worked when we saw the following line of output

Boto config file "/home/REDACTED/.boto" created...

We created a bucket for our treasure trove

gsutil mkdir gs://treasuretrove

If you plan to do this as well you would ideally want to add some random hexadecimal string suffixed with a hyphen as well.

Optional, but looks cool kind of command

gsutil mkdir gs://treasuretrove-`openssl rand -hex 10`

This is just a hangover of how we create AWS S3 buckets. You can chose to ignore this if you prefer your bucket names to be simple words

For any reason if you are unhappy with the bucket created, here is the command to remove it

gsutil rm -r gs://treasuretrove

Now we were ready to copy the data.

The gsutil utility had an option to parallelise copying of data by using the -m flag. Since we have to copy a lot of data, we used it. The entire copying got done in about 5 minutes or so.

If you plan to use this flag, if something does go wrong while copying you will not be able to use the retry feature. Therefore it makes sense to run this command inside tmux.

Now we just had to be inside the data directory

cd /torrents/BreachCompilation/data

and execute the command

gsutil -m cp -r * gs://treasuretrove/

All this typing made us thirsty. That and the fact that we are copying about 42 GB of data, so time for another cup of coffee.

Once this done, this is what it will look like all the way till z

Now that we had the data in storage, we could if required, trash the compute engine instance so as to not incur any more charges.