Pipal, Password Analyser

On most internal pen-tests I do I generally manage to get a password dump from the DC. To do some basic analysis on this I wrote Counter and since I originally released it I've made quite a few mods to it to generate extra stats that are useful when doing reports to management.

Recently a good friend, n00bz, asked on Twitter if anyone had a tool that he could use to analyse some passwords he had. I pointed him to Counter and said if he had any suggestions for additions to let me know. He did just that and over the last month between us we have come up with a load of new features which we both think will help anyone with a large dump of cracked passwords to analyse. We also got some input from well known password analysts Matt Weir and Martin Bos who I'd like to give a big thanks to.

I have to point out before going on, all this tool does is to give you the stats and the information to help you analyse the passwords. The real work is done by you in interpreting the results, I give you the numbers, you tell the story.

Seeing as there have been so many changes to the underlying code I also decided to change the name (see why) and do a full new release.

Modular Release

Over the past few months I've been rewriting Pipal to make it modular rather than a huge, monolithic lump. Rather than try to add the extra information here, I've written a short blog post about it.

Pipal Goes Modular

Version 2

Version 2 - Two big changes, the first a massive speed increase. This patch was submitted by Stefan Venken who said a small mention would be good enough, I want to give him a big mention. Running through the LinkedIn lists would have taken many many hours on version 1, version 2 went through 3.5 million records in about 15 minutes. Thank you.

Second change is the addition of US area and zip code lookups. This little feature gives some interesting geographical data when ran across password lists originating in the US. The best example I've seen of this is the dump from the Military Singles site where some passwords could be obviously seen to be grouped around US military bases. People in the UK don't have the same relationship with phone numbers so I know this won't work here but if anyone can suggest any other areas where this might be useful then I'll look at building in some kind of location awareness feature so you can specify the source of the list and get results customized to the correct area or just run every area and see if a pattern emerges.

A non-code-base change is for version 2 is the move from hosting the code myself to github. This is my first github hosted project so I may get things wrong, if I do, sorry. A number of people asked how they could submit patches so this seems like the best way to do it, lets hope it works out. See the Download section for more info.

Worked Example

So, what does Pipal do? The easiest way to explain this is to show the output generated by parsing a leaked password list. I've chosen the list of passwords from the phpBB leak which I grabbed from the SkullSecurity site.

The first output is the number of entries in the file parsed and the number of unique entries found. Unfortunately the list I chose has already been ran through unique so these two figures match in this example.

Total entries = 184373

Total unique entries = 184373

The top 10 passwords. In this situation the list I chose has already been passed through a filter to strip any duplicates, this is why each word only appears once. The cap of showing the top 10 is configurable by a parameter on the command line, I'd suggest playing with this limit as sometimes the next entry is the one that starts to explain things.

Top 10 passwords

123456 = 1 (0.0%)

password = 1 (0.0%)

phpbb = 1 (0.0%)

qwerty = 1 (0.0%)

12345 = 1 (0.0%)

12345678 = 1 (0.0%)

letmein = 1 (0.0%)

111111 = 1 (0.0%)

1234 = 1 (0.0%)

123456789 = 1 (0.0%)

The next list is the number of base words. I define a base word as a word with any non-alpha character stripped from the start and end. This is useful to identify common words such as company names or places which the passwords have been based on. I did consider stripping all non-alpha but in one of the lists I tested on I found the base word "un1c0rn". Leaving the non-alpha in the word makes sense, removing them you get "uncrn" which doesn't really mean anything.

Unsurprisingly as this list came from phpBB the top word that passwords are based on is "phpbb", "password" next is another obvious base word but then "dragon" is one I wouldn't have expected.

Top 10 base words

phpbb = 332 (0.18%)

password = 89 (0.05%)

dragon = 76 (0.04%)

pass = 70 (0.04%)

mike = 69 (0.04%)

blue = 67 (0.04%)

test = 66 (0.04%)

qwerty = 59 (0.03%)

alex = 58 (0.03%)

alpha = 53 (0.03%)

Lengths are next, fairly self explanatory. It is a shame that the people who put the effort in and had greater than 20 character passwords still got theirs leaked.

I hope that the 948 three and under words are a mistake made when cracking the list.

Password length (length ordered)

1 = 33 (0.02%)

2 = 138 (0.07%)

3 = 777 (0.42%)

4 = 4597 (2.49%)

5 = 8199 (4.45%)

6 = 42069 (22.82%)

7 = 32731 (17.75%)

8 = 55338 (30.01%)

9 = 19187 (10.41%)

10 = 11897 (6.45%)

11 = 4934 (2.68%)

12 = 2506 (1.36%)

13 = 1019 (0.55%)

14 = 516 (0.28%)

15 = 233 (0.13%)

16 = 126 (0.07%)

17 = 37 (0.02%)

18 = 28 (0.02%)

19 = 10 (0.01%)

20 = 9 (0.0%)

21 = 6 (0.0%)

22 = 3 (0.0%)

23 = 4 (0.0%)

25 = 2 (0.0%)

27 = 3 (0.0%)

28 = 2 (0.0%)

32 = 4 (0.0%)

Password length (count ordered)

8 = 55338 (30.01%)

6 = 42069 (22.82%)

7 = 32731 (17.75%)

9 = 19187 (10.41%)

10 = 11897 (6.45%)

5 = 8199 (4.45%)

11 = 4934 (2.68%)

4 = 4597 (2.49%)

12 = 2506 (1.36%)

13 = 1019 (0.55%)

3 = 777 (0.42%)

14 = 516 (0.28%)

15 = 233 (0.13%)

2 = 138 (0.07%)

16 = 126 (0.07%)

17 = 37 (0.02%)

1 = 33 (0.02%)

18 = 28 (0.02%)

19 = 10 (0.01%)

20 = 9 (0.0%)

21 = 6 (0.0%)

23 = 4 (0.0%)

32 = 4 (0.0%)

22 = 3 (0.0%)

27 = 3 (0.0%)

25 = 2 (0.0%)

28 = 2 (0.0%)

Next a nice graph showing the length data, I'm quite proud of getting this in.

| | | | | | | | | ||| ||| ||| ||| |||| |||| ||||| |||||| |||||||| ||||||||||||||||||||||||||||||||| 000000000011111111112222222222333 012345678901234567890123456789012

Some more self explanatory information comes next. 30% of people chose a 1-6 character password and 40% chose one that contained only lowercase alpha characters.

One to six characters = 55807 (30.27%)

One to eight characters = 143874 (78.03%)

More than eight characters = 40507 (21.97%)



Only lowercase alpha = 76041 (41.24%)

Only uppercase alpha = 1706 (0.93%)

Only alpha = 77747 (42.17%)

Only numeric = 20728 (11.24%)



First capital last symbol = 225 (0.12%)

First capital last number = 4749 (2.58%)

The external list is a list of words passed in to Pipal on the command line. I check how many times each of these words is included in each password. This is similar to base words but here you tell the app which base words to search for.

If you are wondering why "dragon" is only counted 76 times as a base word but shows 185 times here, that is because there are 109 base words which contain "dragon" but aren't just "dragon", for example "phpdragon".

The external list I'm using is the list claiming to be "The 25 Worst Passwords on the Internet". Another suggestion for a list of words to use is the domains from the Alexa top 1000 list, this could be good if you are analysing a list of passwords from an unknown origin or would like to know if a list from one domain is linked to any other domains.

External list (Top 10)

master = 229 (0.12%)

123456 = 208 (0.11%)

dragon = 185 (0.1%)

password = 164 (0.09%)

monkey = 118 (0.06%)

shadow = 105 (0.06%)

qwerty = 95 (0.05%)

1234567 = 72 (0.04%)

12345678 = 47 (0.03%)

letmein = 44 (0.02%)

We now look at months and days in both full and abreviated form. While "may" could be a persons name or normal word it looks like for some reason it is a popular word in the list. "June" and "April" are also popular but also names which could explain the higher proportion. For days of the week there is a very large preference for "monday" and "friday", guess which days people change their passwords.

Months

january = 8 (0.0%)

february = 3 (0.0%)

march = 23 (0.01%)

april = 48 (0.03%)

may = 171 (0.09%)

june = 56 (0.03%)

july = 27 (0.01%)

august = 22 (0.01%)

september = 3 (0.0%)

october = 15 (0.01%)

november = 7 (0.0%)

december = 6 (0.0%)



Days

monday = 12 (0.01%)

tuesday = 2 (0.0%)

wednesday = 1 (0.0%)

thursday = 3 (0.0%)

friday = 11 (0.01%)

saturday = 1 (0.0%)

sunday = 5 (0.0%)

Months (Abreviated)

jan = 341 (0.18%)

feb = 42 (0.02%)

mar = 1406 (0.76%)

apr = 108 (0.06%)

may = 171 (0.09%)

jun = 190 (0.1%)

jul = 158 (0.09%)

aug = 83 (0.05%)

sept = 17 (0.01%)

oct = 69 (0.04%)

nov = 161 (0.09%)

dec = 120 (0.07%)



Days (Abreviated)

mon = 953 (0.52%)

tues = 3 (0.0%)

wed = 69 (0.04%)

thurs = 6 (0.0%)

fri = 169 (0.09%)

sat = 187 (0.1%)

sun = 299 (0.16%)

Seeing as we've looked at months and days why not years. Looks like years around the turn of the milenium are popular in this list. I also ran this on the passwords from the myspace leak which showed years around 1990 were popular, maybe this says something about the age of the average user.

Includes years

1975 = 82 (0.04%)

1976 = 80 (0.04%)

1977 = 96 (0.05%)

1978 = 118 (0.06%)

1979 = 142 (0.08%)

1980 = 130 (0.07%)

1981 = 139 (0.08%)

1982 = 142 (0.08%)

1983 = 168 (0.09%)

1984 = 176 (0.1%)

1985 = 171 (0.09%)

1986 = 152 (0.08%)

1987 = 183 (0.1%)

1988 = 165 (0.09%)

1989 = 139 (0.08%)

1990 = 127 (0.07%)

1991 = 115 (0.06%)

1992 = 82 (0.04%)

1993 = 49 (0.03%)

1994 = 41 (0.02%)

1995 = 25 (0.01%)

1996 = 38 (0.02%)

1997 = 56 (0.03%)

1998 = 49 (0.03%)

1999 = 79 (0.04%)

2000 = 428 (0.23%)

2001 = 236 (0.13%)

2002 = 268 (0.15%)

2003 = 235 (0.13%)

2004 = 180 (0.1%)

2005 = 199 (0.11%)

2006 = 145 (0.08%)

2007 = 91 (0.05%)

2008 = 30 (0.02%)

2009 = 26 (0.01%)

2010 = 57 (0.03%)

2011 = 48 (0.03%)

2012 = 45 (0.02%)

2013 = 27 (0.01%)

2014 = 9 (0.0%)

2015 = 16 (0.01%)

2016 = 12 (0.01%)

2017 = 17 (0.01%)

2018 = 16 (0.01%)

2019 = 26 (0.01%)

2020 = 47 (0.03%)

Years (Top 10)

2000 = 428 (0.23%)

2002 = 268 (0.15%)

2001 = 236 (0.13%)

2003 = 235 (0.13%)

2005 = 199 (0.11%)

1987 = 183 (0.1%)

2004 = 180 (0.1%)

1984 = 176 (0.1%)

1985 = 171 (0.09%)

1983 = 168 (0.09%)

The common assumption is that when people are foced to use passwords with numbers in their general response is to add a single digit on the end. Looking at this next set of stats, in this list people actually prefered to add two digits onto the end. The assumption that the last digit will be "1" does however hold true.

Single digit on the end = 14447 (7.84%)

Two digits on the end = 18112 (9.82%)

Three digits on the end = 9637 (5.23%)

Last number

0 = 7753 (4.2%)

1 = 13572 (7.36%)

2 = 8735 (4.74%)

3 = 9313 (5.05%)

4 = 6279 (3.41%)

5 = 6408 (3.48%)

6 = 5991 (3.25%)

7 = 6472 (3.51%)

8 = 5726 (3.11%)

9 = 6728 (3.65%)

| | | | | ||| |||| |||| |||||||| | |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| 0123456789

We now look at what the last digits are. Some of the numbers are expected but others, 21984 for example, aren't. Could this be a US zip code?

Last digit

1 = 13572 (7.36%)

3 = 9313 (5.05%)

2 = 8735 (4.74%)

0 = 7753 (4.2%)

9 = 6728 (3.65%)

7 = 6472 (3.51%)

5 = 6408 (3.48%)

4 = 6279 (3.41%)

6 = 5991 (3.25%)

8 = 5726 (3.11%)



Last 2 digits (Top 10)

23 = 3027 (1.64%)

00 = 2185 (1.19%)

01 = 1992 (1.08%)

12 = 1817 (0.99%)

11 = 1620 (0.88%)

99 = 1341 (0.73%)

21 = 1150 (0.62%)

13 = 1095 (0.59%)

69 = 1052 (0.57%)

88 = 1028 (0.56%)



Last 3 digits (Top 10)

123 = 2164 (1.17%)

000 = 708 (0.38%)

234 = 477 (0.26%)

007 = 449 (0.24%)

001 = 430 (0.23%)

666 = 397 (0.22%)

321 = 286 (0.16%)

101 = 284 (0.15%)

002 = 274 (0.15%)

111 = 261 (0.14%)

Last 4 digits (Top 10)

1234 = 424 (0.23%)

2000 = 377 (0.2%)

2002 = 215 (0.12%)

2003 = 202 (0.11%)

2001 = 181 (0.1%)

2005 = 166 (0.09%)

2004 = 153 (0.08%)

1987 = 141 (0.08%)

1988 = 133 (0.07%)

1985 = 132 (0.07%)



Last 5 digits (Top 10)

12345 = 110 (0.06%)

23456 = 68 (0.04%)

54321 = 25 (0.01%)

11111 = 23 (0.01%)

21984 = 21 (0.01%)

00000 = 18 (0.01%)

11988 = 16 (0.01%)

21985 = 15 (0.01%)

23123 = 14 (0.01%)

11984 = 13 (0.01%)

These last three are recommendations from Martin. These are where we start moving from analysis to cracking, character sets and hashcat masks.

Character sets

loweralpha: 76041 (41.24%)

loweralphanum: 65827 (35.7%)

numeric: 20728 (11.24%)

mixedalphanum: 8886 (4.82%)

mixedalpha: 4948 (2.68%)

upperalphanum: 2186 (1.19%)

upperalpha: 1706 (0.93%)

loweralphaspecialnum: 1393 (0.76%)

loweralphaspecial: 1383 (0.75%)

mixedalphaspecialnum: 483 (0.26%)

mixedalphaspecial: 268 (0.15%)

specialnum: 191 (0.1%)

special: 61 (0.03%)

upperalphaspecialnum: 48 (0.03%)

upperalphaspecial: 37 (0.02%)



Character set ordering

allstring: 82695 (44.85%)

stringdigit: 47849 (25.95%)

alldigit: 20728 (11.24%)

othermask: 12040 (6.53%)

stringdigitstring: 11274 (6.11%)

digitstring: 5490 (2.98%)

digitstringdigit: 2180 (1.18%)

stringspecialstring: 837 (0.45%)

stringspecialdigit: 521 (0.28%)

stringspecial: 489 (0.27%)

specialstring: 116 (0.06%)

specialstringspecial: 101 (0.05%)

allspecial: 61 (0.03%)

Hashcat masks (Top 10)

?l?l?l?l?l?l: 18462 (0.0%)

?l?l?l?l?l?l?l?l: 17481 (0.0%)

?l?l?l?l?l?l?l: 13981 (0.0%)

?l?l?l?l?l?l?l?l?l: 8045 (0.0%)

?d?d?d?d?d?d: 7726 (0.0%)

?l?l?l?l?l?l?l?l?l?l: 5253 (0.0%)

?l?l?l?l?l: 5249 (0.0%)

?d?d?d?d?d?d?d?d: 5116 (0.0%)

?l?l?l?l?l?l?d?d: 4956 (0.0%)

?l?l?l?l?l?d?d: 3149 (0.0%)

Install / Usage

The app will only work with Ruby 1.9.x, if you try to run it in any previous versions you will get a warning and the app will close.

Pipal is completely self contained and requires no gems installing so should work on any vanilla Ruby install.

Usage is fairly simple, -? will give you full instructions:

$ ./pipal.rb -? pipal 1.0 Robin Wood (robin@digi.ninja) (http://digi.ninja) Usage: pipal [OPTION] ... FILENAME --help, -h: show help --top, -t X: show the top X results (default 10) --output, -o <filename>: output to file --external, -e <filename>: external file to compare words against FILENAME: The file to count

When you run the app you'll get a nice progress bar which gives you a rough idea of how long the app will take to run. If you want to stop it at any point hitting ctrl-c will stop the parsing and will dump out the stats generated so far.

The progress bar is based on a line count from the file which it gets this using the wc command. If it can't find wc it will make a guess at the number of lines based on the file size and an average line length of 8 bytes so the progress bar may not be fully accurate but should still give you an idea.

Download

Due to the number of people asking about submitting updates I've moved Pipal hosting to github, you can now get he the latest version from its github repository.

If you aren't sure what you are doing with github just click the ZIP button on the approximately middle left and that will give you a zip file which you can decompress and use as you would the versions below.

Download Pipal 1.1 - Bug fixed, not calculating correct percentage for Hashcat masks - Reported by Moshe Zioni

Download Pipal 1.0

Analysis

This section was supposed to just contain a few sets of sample stats but as more sites are being hacked and passwords released I've decided to run analysis on any lists I can get my hands on and post the results here. The first six in the list are the original sample sets and are based on password lists from the SkullSecurity site, for the rest, I'll give whatever information I can about where the list came from.

The following stats have been generated by other people.

Feedback/Todo

If you have a read through the source for Pipal you'll notice that it isn't very efficient at the moment. The way I built it was to try to keep each chunk of stats together as a distinct group so that if I wanted to add a new, similar, group then it was easy to just copy and paste the group. Now I've got a working app and I know roughly what I need in the different group types I've got an idea on how to rewrite the main parser to make it much more efficient and hopefully multi-threaded which should speed up the processing by a lot for large lists.

I could have made these changes before releasing version 1.0 but I figured before I do I want to get as much feedback as possible from users about the features already implemented and about any new features they would like to see so that I can bundle all these together into version 2. So, please get in touch if there is a set of stats that you'd like to see included.

One other thing I know needs fixing, Pipal doesn't handle certain character encodings very well. If anyone knows how to correctly deal with different encoding types, especially with regards to regular expressions, please let me know.

Where is the name from?

It comes from Pip Al as a way to celebrate my daughter and n00bz's son, Pippa and Alexander. It also turns out to be the name of a type of fig and a village in Nepal.

Credits

The speed increases added in version 2 were submitted by Stefan Venken who said a small mention would be good enough, I want to give him a big mention. Running through the LinkedIn lists would have taken many many hours on version 1, version 2 went through 3.5 million records in about 15 minutes. Thank you.

I didn't realise it when I included them, but the "Hashcat", "Character sets" and "Character set ordering" stats are all based on an original idea by iPhelix in his tool PACK. If you are interested in generating Hashcat masks then his work is well worth a read.