Some of you out there know that I have been collecting passwords for quite some time. Since 1998 to be exact. Originally I did it just to have big wordlists for password cracking, then I started gathering them for research on my Perfect Passwords book, finally it became like a big ball of string where you just do it because it makes no sense to stop now. My list currently contains about 6 million unique username/password combinations (not counting those from public lists from Gawker, RockYou, and others).

So I thought that some people might be interested in how I collect these passwords. Note that all of these passwords have already been made public and can easily be found by anyone. There are no passwords on my list that have not already been made public. Also note that so far I have never shared this list with anyone.

I use tools such as Athena, which does massive Google searches for and collects passwords in the format “http://user:password@example.com/members". This tool can easily gather 200,000 combos in a day but the majority of these are already in my database. I run this about once a month. I have a script that nightly leeches from a huge list of well-known password sharing web sites. I use a number of Google alerts that watch for common keylogger log formats. This is just one of many that I use. There are a surprisingly huge number of these logs that can be found via Google, although it is sometimes difficult to parse the passwords from the content. I use Google alerts to watch for SQL database dumps of forum and other common software databases. I also use Google alerts to look for passwords on pastebin.com and other related sites. I use a script that grabs all the Google alerts as RSS feeds and parses out URLs, then another script visits each site and leeches the passwords. I use RSS feeds from filestube.com to watch for and download password lists that might show up on a number of file sharing sites. I use RSS feeds from various torrent searches that I put into uTorrent to download automatically. I use a number of IRC bots that hang out in a large number of IRC channels where password sharing happens. These aren’t as effective as they once were but I still use them occasionally. I use a script to automatically download posts from various Usenet newsgroups, although most of those are just spam nowadays. I visit a number of public and private hacking-related forums to get wordlists and hacked passwords. I often pay for VIP memberships (usually the lifetime ones) so that I can access premium content areas. Leeching from forums has to be done manually, because you often have to comment on posts to be able to download the lists, but occasionally I will spend half a day leeching from these forums. Some forums will let you subscribe to posts and will include the entire post contents in the email. This bypasses the often-used “hide hack” and I can just use another script to save that inbox to local files. I use various FTP search engines to watch for interesting filenames that might show up on FTP sites. In the past I have used various P2P networks (such as LimeWire) to search for files but those don’t produce many results nowadays. Every once in a while someone will send me a big dump of their own lists they have collected.

As these scripts collect data, it is all dumped into a directory on my hard drive and regularly I run program I wrote that parses all the data looking for password is common formats.

Here are some examples of what the program recognizes:



http://www.example.com/members login:user1 password:password1

http://www.example.com/members user: user1 pass: password1

Login: user1 passw:password1

L:user1 P:password1

username:user1 password:password1

http://www.example.com/members L: user1 P: password1

username = user1 password= password1

u=user1 p=password1

username user1 password password1

login id: user1 password: password1 http://www.example.com/members/ L:user1 P:password1http://www.example.com/members login:user1 password:password1http://www.example.com/members user: user1 pass: password1Login: user1 passw:password1L:user1 P:password1username:user1 password:password1http://www.example.com/members L: user1 P: password1username = user1 password= password1u=user1 p=password1username user1 password password1login id: user1 password: password1

It grabs the username/password combos and saves them into text log file. After a while these files accumulate and I merge them into my master database. In the database I perform cleanup steps such as removing passwords from well-known password hackers (such as pr0test) and other junk that might appear. I also strip domain names off usernames that are email addresses.

What is interesting about all this is how difficult it is to find new username/passwords combos that aren’t already on my list. These scripts can easily collecting 100,000 unique username/password combos every day, but only a few thousand of those are not already on my list.

After 12+ years of collecting passwords, I have found a few interesting facts:

Although my list contains about 6 million username/password combos, the list only contains about 1,300,000 unique passwords.

Of those, approximately 300,000 of those passwords are used by more than one person; about 1,000,000 only appear once (and a good portion of those are obviously generated by a computer).

The list of the top 20 passwords rarely changes and 1 out of every 50 people uses one of these passwords.

There are a few flaws with my list that I should point out:

Many of these passwords have been cracked from hashes so a good percentage of them would by nature be crackable, skewing the statistics some.

These passwords are largely dominated by passwords from adult web sites, which are the ones mostly publicly shared. This results in a higher percentage of adult-related and obscene passwords.

These passwords are usually from web sites that often do not enforce strong passwords policies that a private organization might. This is bad because this data doesn’t truly reflect all passwords, but on the other hand it shows the kind of passwords users will select if a password policy is not enforced.

My scripts only grab usernames and passwords between 3 and 30 characters long, all others are thrown out.

None of the passwords contain a colon, because that is the delimiter used to separate usernames and passwords in the combo lists my scripts generate.

So that is how I collect my passwords, maybe someday I will share the list itself.

Incidentally, the one tool I really wish I had time to build is either a proxy server or a Greasemonkey script that will automatically parse and log usernames and password combos from web pages that you visit. That would be extremely helpful!

Update (4/25/12): Google has recently changed things that resulted in breaking several of the tools listed here. Now I collect many of my passwords using google alerts and custom searches turned into RSS feeds and automatically added into a private WordPress blog via AutoBlogged. Before each post is added it runs through a tool I have developed (which I will share eventually) that returns just the username/password combos. I can then use the RSS feed from that private blog as a raw combos list to merge into my master list.

Discuss this article on Disqus

Interested in passwords? Join the discussion on /r/passwords moderated by Mark Burnett