Apache Spark aims to solve the problem of working with large scale distributed data and with access to over 500 million leaked passwords we have a lot of data to dig through. If you spend any time with the password data set, you’ll notice how simple most passwords are. This is why we’re always thinking about how to encourage stronger passwords and recommend turning on Two-factor authentication everywhere it’s available.

While tools like Excel and Python are great for data analysis, Spark helps solve the problem of what to do once the data you’re working with gets too large to fit into the memory of your local computer.

This tutorial will show you how to get setup for running Spark and introduce the tools and code that allow you to do data manipulation and exploration. Read on to find out how to spot the most common password lengths, suffixes, and more.

500 Million Pwned Passwords 😱

We’re using a combination of the raw pwned password data from Troy Hunt joined with some known common passwords. The pwned password data contains SHA-1 hashes of passwords and a count of the number of times that password has been pwned. Hunt explains more about why he stores the data that way: “[the] point is to ensure that any personal info in the source data is obfuscated such that it requires a concerted effort to remove the protection, but that the data is still usable for its intended purposes”. Our analysis won’t be terribly interesting only looking at hashed passwords, so we already grabbed known common passwords and joined those with our data.

Download the data from GitHub: