ROT13 and Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

ROT13 (short for “rotate 13 places”), is an obfuscation technique familiar to nerds, geeks, and computer programmers. It’s commonly used in online forums as a means of hiding or obscuring spoilers, punchlines, hints, and (sometimes) offensive content.

I’m hesitant to call it encryption because it’s so weak. What it is is a simple substitution cipher (in fact, it's worse than that because, as described below, it's a Caesar Cipher in which the offset for each character is the same and fixed!)

To execute ROT13 you take a letter and shift it along 13 places (and if you go over the end past ‘Z’, you wrap around again to ‘A’). It jumbles the letters up sufficiently such that, at first glance, you can’t read the message, and that, these days, is its only real purpose. If you attempt to use it to store/encrypt passwords or sensitive information you deserve to have your programming license revoked!

It’s such a popular technique that some text editors and news readers have ROT13 functionality built in!

As an example of how this works the word “HELLO” gets converted by ROT13 into “URYYB”

Traditionally, ROT13 is only applied to the letters ‘A-Z’ and ‘a-z’ so that case, numbers, and other punctuation are preserved.

When you were a kid (perhaps you still are one), you might have had fun using some kind of Spy code/de-code wheel. On these devices, the letters ‘A-Z’ are written on concentric disks which can rotated to offset the alphabet. Two ‘secret agents’ agree on the offset beforehand then, to encode a message, the desired letter is selected on the inner wheel, and the coded letter read on the outside wheel. The process is then inverted by the decoder to read the message. To use a decoding wheel for ROT13, simply rotate the wheel 13 places.

Why ROT13?

Because the English alphabet has 26 characters, ROT13 has the interesting property that it is self-inverting. Performed twice on a piece of text reverts the text back to the original. It is for this reason that ROT13 became so popular.

ROT13('HELLO') = 'URYYB'

ROT13('URYYB') = 'HELLO'

ROT13(ROT13('HELLO')) = 'HELLO'

If you are familiar with Boolean logic, this is a property similar to the XOR operator. If performed twice with the same argument, XOR returns the input to the same value.

To encode/de-code in ROT13 you only need one command, and you can't get it the wrong way round either!

ROT5

Similar to ROT13, which applies to letters, it’s possible to obfuscate numbers with a similar self-inverting rotation of five places.

43,252,003,274,489,856,000 ↔ 98,707,558,729,934,301,555

There is a hybrid system which encodes text using ROT13, numbers using ROT5, and leaves all other characters unaffected.

ROT5 is subtle, numbers just look like numbers should!

ROT47

Another (less-popular) variant is ROT47 which shifts the 94 characters from ASCII 33 (which is the “!” directly after the space) to ASCII 126 “~”. This obfuscates letters, numbers, and punctuation characters but still keeps the output in 7-bit ‘safe’ printable ASCII.

Call the number (425)-555-1212, and ask for "Princess"

r2== E96 ?F>36C WcadX\ddd\`a`a[ 2?5 2D< 7@C Q!C:?46DDQ

ROT47 is far from subtle; it's pretty clear that the message above has been encoded.

Caesar Ciphers

THE QUICK BROWN FOX SAID I LOVE LUCY

GUR DHVPX OEBJA SBK FNVQ V YBIR YHPL

Using ROT13 as anything more than obfuscation technique has more security holes than a piece of Swiss Cheese. A simple rotation cipher is given the name Caesar Cipher, after Julius Ceasar, as it is documented he used this technique to 'protect' messages to his troops (he is documented as using ROT3, whilst his nephew used ROT1).

Once you know the technique used, it's fairly trivial (even using brute force if you don't know the offset), to enumerate all possible versions to reveal the source message!

ROTn Cipher 0 THEQUICKBROWNFOXSAIDILOVELUCY +1 UIFRVJDLCSPXOGPYTBJEJMPWFMVDZ +2 VJGSWKEMDTQYPHQZUCKFKNQXGNWEA +3 WKHTXLFNEURZQIRAVDLGLORYHOXFB +4 XLIUYMGOFVSARJSBWEMHMPSZIPYGC +5 YMJVZNHPGWTBSKTCXFNINQTAJQZHD +6 ZNKWAOIQHXUCTLUDYGOJORUBKRAIE +7 AOLXBPJRIYVDUMVEZHPKPSVCLSBJF +8 BPMYCQKSJZWEVNWFAIQLQTWDMTCKG +9 CQNZDRLTKAXFWOXGBJRMRUXENUDLH +10 DROAESMULBYGXPYHCKSNSVYFOVEMI +11 ESPBFTNVMCZHYQZIDLTOTWZGPWFNJ +12 FTQCGUOWNDAIZRAJEMUPUXAHQXGOK +13 GURDHVPXOEBJASBKFNVQVYBIRYHPL … …

There are only 25 rotations to try by brute force!

Substitution Ciphers

Closely related to Caesar Ciphers are Substitution Ciphers. These still map 1:1 between each character in the source text and cipher text, but adjacent characters in the source do not have to map to adjacent ones in the destination.

If spaces are preserved in the encoding, it's easy to see where the word breaks are, and thus you can guess at what you think are the more popular words. As each character is always converted over to the same replacement character, common words (and commonly occurring groupings and patterns of letters) start to jump out of the page very quickly (especially if the message is quite long). Removing the white space between words adds a trivial level of complexity.

Substitution ciphers don't have to just use other letters. Symbols can be used. Two of the most well known examples of this are "The Dancing Men", from the famous Sherlock Holmes story, and the "PigPen Cipher" which uses fragments of grids and dots to represent the alphabet.

Solving Substitution Ciphers

Even with, or without, spaces removed, substitution ciphers are fairly trivial to guess. There's a 1:1 mapping between each character so, once you know one conversion, you know all other occurrences of that same character (and you also know that this letter can't be used again).

The solution space is so small that the solving of these is a hobby (like solving Crossword puzzles, word searches, or Sudoku). These puzzles are called Cryptograms. An example is shown on the left for the quote: "Style and structure are the essence of a book; great ideas are hogwash." - Vladimir Nabokov

The strategy for solving cryptograms is a combination of brute force, heuristics, and letter/word frequency.

Not all letters in the English language are used equally. Some, like the letters 'E', 'T' and 'A' are used very frequently. Unless the message we are trying to decode is very obscure, we'd expect the distribution of symbols in the solution to follow a similar profile. This alone could give a first pass for decoding a message; we simply apply the frequency of letters used in the secret message to the frequency we expect for each letter.

From Wikipedia, here is the ordering of letters in English language (taken from a corpus of many hundreds and thousands of documents):

ETAOINSHRDLUCMFWYPVBGKQJXZ

(You might also like an article I wrote few years ago about the game of Hangman and letter distribution).

If our secret message is a representative sample of the entire English language, we'd expect the symbol representing "E" to be the most frequently occurring in our message, followed by "T", then "A" …

This is far from perfect; the chances are our message is short, and so the letters might not follow this distribution perfectly (or even have enough granularity, or even use all the letters of the alphabet). We can narrow down solutions using brute-force and Chi-squared tests of letters frequency based on expected, but there is so much more we can do very easily.

If white spaces are present, we can apply knowledge of the words in the English language. We know that there are only a limited number of two, three and four letter words, and these words are common. Did you know that one third of all printed English materials are made up of the top 25 occurring words? (The most popular 100 words make up approximately half of all printed English!)

the, of, and, a, to, in, is, you, that, it, he, was, for, on, are, as, with, his, they, I, at, be, this, have, from

Guessing which word could be which and corroborating this with what these symbols/letters would be like in the other words could be a great help.

If there is no white space to give word breaks, we can still apply statistical techniques. Certain combinations of letters often occur together. It's very common to have "TH" next to each other and "ER" and "RE". Certain letters often occur in double form, like "OO", "EE" and "LL". Conversely, have you ever seen a word containing "JJ"*?

We're taught an early age that it's very common for "Q" to be followed by "U". Although "Q" is not a popular letter, if we do identify one, there's a very good chance the letter after it is a "U". There are similar rules with other letters.

Most (not all), words contain at least one vowel (AEIOU), and if you include "Y" as a vowel you practically include all words. The more letters you lock in, the easier it gets to solve the rest (both because you have partial words to complete, and the unused letter pool is smaller).

*I can only think of: HAJJ, HAJJES, HAJJI, HAJJIS

Let's take a look

I was curious about letter distribution, so I downloaded a dozen books in plain text from Project Gutenberg. This site has 50,000 free books available!

If you are interested, the books I selected (randomly from the fiction collection) were:

20,000 Leagues under the Sea, Jules Verne A Tale of Two Cities, Charles Dickens Around the World in 80 days, Jules Verne Little Women, Louisa May Alcott Alice's Adventure in Wonderland, Lewis Caroll Anna Karenina, Leo Tolstoy The Arabian Nights, Sir Richard Burton The Canterbuy Tales, Geoffery Chaucer The Journey to the Center of the Earth, Jules Verne The Wonderful Wizard of Oz, L. Frank Baum War and Peace, Leo Tolstoy Ben-Hur, Lew Wallace

Obviously, the more books you sample the more refined your distribution will become for generic solving. Alternatively, if you have some idea of the context of your secret message, you might elect to sample a more specific set of books to more accurately represent the sample you have.

Single Letter Frequency

Based on the books above, here is the single letter frequency distribution. The percentages show the percentage over the total of all the letters in these books (Approximately 9 million letters).

Here is the same data plotted in sorted order. The ordering is slightly different to the answer given by Wikipedia, but that's because we're using different samples.

Bigrams

Next I looked at the frequency of all bigrams (also called couplets of letters, adjacent pairs, and sometimes called digrams).

To generate this list I ignored any white space and punctuation characters. So, for example, in addition to containing all the letters that occur next to each other inside of words, this list also contains entries for the words that end with the first character that occur adjacent to words that starts with the second. This will help if your secret message does not contain white space that allows you to determine where the line breaks are.

Here are the most frequently occurring top 50 adjacent pairs of letters:

Interestingly, even though there are 26 characters, the total number of bigrams in my sample is not 26 × 26 (=676). Instead there are 643 distinct items. Not every possible pairing of characters occur (for instance the pairing "QZ" or "ZX" never occurred in the books I sampled).

As expected the frequency of "TH" and "HE" dominate. These bigrams are popular in many words as well as the most common words.

Note - This is a great data to use if you have no information about either of the characters. However, if you know, for instance, one of the characters in the pair you can use this information to find conditional probability. For instance, if you know a pair is "Q?", then the there is 99.9% chance that the missing unknown character is a "U".

Here are the top 200 bigrams in tabular order:

# 2-gram % #1 TH 3.322% #2 HE 3.108% #3 AN 1.838% #4 ER 1.820% #5 IN 1.801% #6 ND 1.434% #7 RE 1.373% #8 ED 1.244% #9 ES 1.220% #10 HA 1.163% #11 TO 1.101% #12 EN 1.096% #13 EA 1.071% #14 AT 1.062% #15 HI 1.046% #16 ON 1.042% #17 ST 1.011% #18 OU 1.000% #19 NT 0.988% #20 NG 0.949% #21 AS 0.909% #22 IT 0.899% #23 IS 0.881% #24 ET 0.844% #25 OR 0.832% #26 TE 0.797% #27 SE 0.767% #28 OF 0.746% #29 AR 0.735% #30 TI 0.719% #31 LE 0.706% #32 SA 0.690% #33 VE 0.637% #34 NE 0.636% #35 AL 0.629% #36 ME 0.625% #37 RO 0.608% #38 NO 0.598% #39 SH 0.592% #40 OT 0.589% #41 DE 0.588% #42 EL 0.578% #43 TA 0.564% #44 LL 0.561% #45 TT 0.560% #46 SO 0.546% #47 RI 0.543% #48 DT 0.538% #49 HO 0.536% #50 WA 0.531% # 2-gram % #51 SS 0.506% #52 RA 0.500% #53 EW 0.496% #54 EE 0.492% #55 WH 0.490% #56 SI 0.478% #57 OM 0.477% #58 DI 0.473% #59 BE 0.467% #60 DA 0.464% #61 AD 0.461% #62 MA 0.453% #63 EC 0.450% #64 EM 0.446% #65 WI 0.441% #66 CH 0.440% #67 CO 0.438% #68 CE 0.437% #69 UT 0.436% #70 OW 0.435% #71 RT 0.432% #72 LI 0.431% #73 NA 0.415% #74 LA 0.401% #75 FO 0.400% #76 RS 0.397% #77 EI 0.389% #78 AI 0.382% #79 UR 0.379% #80 LO 0.379% #81 WE 0.378% #82 DO 0.371% #83 LY 0.369% #84 IM 0.366% #85 IL 0.365% #86 US 0.362% #87 GH 0.357% #88 EH 0.353% #89 ID 0.350% #90 NS 0.349% #91 FT 0.346% #92 OO 0.344% #93 IC 0.342% #94 TS 0.331% #95 UN 0.330% #96 EF 0.322% #97 EO 0.321% #98 HT 0.321% #99 YO 0.320% #100 EP 0.316% # 2-gram % #101 DS 0.311% #102 PE 0.310% #103 NI 0.307% #104 NC 0.303% #105 OS 0.303% #106 AC 0.301% #107 LD 0.296% #108 CA 0.286% #109 MO 0.284% #110 UL 0.279% #111 OL 0.275% #112 DH 0.273% #113 IO 0.269% #114 KE 0.268% #115 TR 0.265% #116 IE 0.264% #117 IR 0.263% #118 EV 0.263% #119 AM 0.252% #120 TW 0.250% #121 FA 0.249% #122 GE 0.248% #123 AY 0.244% #124 GA 0.243% #125 PR 0.239% #126 EY 0.233% #127 WO 0.232% #128 SW 0.229% #129 PA 0.229% #130 MI 0.228% #131 RY 0.220% #132 GO 0.218% #133 EB 0.215% #134 FI 0.210% #135 YA 0.207% #136 AV 0.206% #137 BU 0.206% #138 RD 0.205% #139 YT 0.204% #140 PO 0.203% #141 SP 0.200% #142 IG 0.200% #143 OV 0.200% #144 FE 0.198% #145 FR 0.190% #146 DW 0.188% #147 SU 0.186% #148 EG 0.183% #149 AP 0.182% #150 NH 0.181% # 2-gram % #151 DR 0.177% #152 DB 0.177% #153 AB 0.176% #154 YS 0.176% #155 OD 0.174% #156 TU 0.173% #157 VI 0.173% #158 GT 0.171% #159 TL 0.170% #160 SC 0.165% #161 PL 0.164% #162 LT 0.164% #163 IF 0.163% #164 TY 0.159% #165 AG 0.159% #166 RR 0.159% #167 YE 0.159% #168 MY 0.156% #169 BO 0.156% #170 KI 0.149% #171 BL 0.148% #172 CT 0.147% #173 OP 0.146% #174 GI 0.146% #175 DN 0.146% #176 UG 0.145% #177 OH 0.142% #178 GR 0.142% #179 RM 0.141% #180 UP 0.140% #181 RN 0.140% #182 OK 0.139% #183 IV 0.137% #184 SM 0.136% #185 IA 0.135% #186 OA 0.132% #187 RH 0.130% #188 DD 0.126% #189 PI 0.125% #190 OI 0.124% #191 AW 0.123% #192 SL 0.123% #193 EX 0.121% #194 SB 0.119% #195 MP 0.119% #196 NW 0.119% #197 DM 0.118% #198 BA 0.118% #199 AK 0.118% #200 SF 0.116%

Trigrams

The next logical expansion is to look at trigrams (sequences of three letters).

There are 9,671 distinct trigrams in my sample (cf. 26 × 26 × 26 = 17,576 possible).

Again seeing "THE" at the top is no surprise, neither is "AND". These are both popular words in their own right, and sub-strings of other words. "ING" comes next as the suffix for many verbs, followed by many other triplets you can find inside common words.

Here are the top 200 trigrams in tabular form:

# 3-gram % #1 THE 2.049% #2 AND 1.097% #3 ING 0.758% #4 HER 0.615% #5 THA 0.449% #6 ERE 0.420% #7 HIS 0.416% #8 HAT 0.412% #9 ETH 0.348% #10 DTH 0.341% #11 ENT 0.326% #12 NTH 0.323% #13 THI 0.306% #14 FOR 0.304% #15 OTH 0.303% #16 ITH 0.302% #17 WAS 0.300% #18 HES 0.297% #19 SHE 0.285% #20 WIT 0.271% #21 TTH 0.270% #22 INT 0.249% #23 EAN 0.246% #24 FTH 0.243% #25 ALL 0.241% #26 TER 0.240% #27 OFT 0.240% #28 VER 0.237% #29 NOT 0.232% #30 EDT 0.232% #31 YOU 0.227% #32 EST 0.223% #33 ERS 0.216% #34 GHT 0.215% #35 ION 0.212% #36 STH 0.204% #37 REA 0.202% #38 HIM 0.199% #39 ESS 0.199% #40 SAN 0.197% #41 NDT 0.192% #42 HAD 0.191% #43 EAR 0.189% #44 RTH 0.184% #45 RES 0.183% #46 HEM 0.182% #47 ONE 0.180% #48 HEN 0.180% #49 EDA 0.179% #50 HEW 0.179% # 3-gram % #51 NCE 0.178% #52 HOU 0.177% #53 EVE 0.175% #54 AST 0.174% #55 ATT 0.172% #56 OME 0.172% #57 ONT 0.171% #58 OUT 0.171% #59 HIN 0.170% #60 MAN 0.170% #61 TIN 0.170% #62 NGT 0.168% #63 HEA 0.167% #64 STO 0.167% #65 HEC 0.165% #66 ATI 0.164% #67 THO 0.162% #68 BUT 0.161% #69 ESA 0.161% #70 ATH 0.160% #71 TAN 0.160% #72 HAN 0.156% #73 DIN 0.155% #74 TIO 0.154% #75 HED 0.153% #76 ERA 0.152% #77 AVE 0.152% #78 EOF 0.152% #79 NDS 0.151% #80 TOT 0.151% #81 RIN 0.151% #82 DTO 0.150% #83 OUL 0.150% #84 ERT 0.148% #85 TED 0.146% #86 RED 0.146% #87 NDE 0.145% #88 OUN 0.143% #89 IGH 0.143% #90 RAN 0.142% #91 WHI 0.142% #92 ORE 0.142% #93 OUR 0.141% #94 EWA 0.141% #95 ORT 0.141% #96 ETO 0.140% #97 ILL 0.140% #98 DAN 0.140% #99 NTO 0.138% #100 EDI 0.137% # 3-gram % #101 ANT 0.136% #102 WER 0.136% #103 ULD 0.135% #104 ATE 0.134% #105 AID 0.134% #106 YTH 0.133% #107 SOF 0.133% #108 ICH 0.132% #109 STA 0.130% #110 ECO 0.130% #111 WHE 0.128% #112 HEH 0.128% #113 ARE 0.127% #114 AIN 0.126% #115 UGH 0.125% #116 EIN 0.125% #117 EAS 0.124% #118 SAI 0.124% #119 ONS 0.123% #120 IST 0.122% #121 OVE 0.122% #122 EHA 0.120% #123 OUS 0.120% #124 NDI 0.119% #125 SIN 0.119% #126 ERI 0.117% #127 CON 0.117% #128 STE 0.116% #129 MEN 0.116% #130 UND 0.116% #131 DER 0.116% #132 NIN 0.116% #133 SHA 0.115% #134 NDA 0.115% #135 NGA 0.115% #136 EAT 0.115% #137 HEL 0.115% #138 RET 0.114% #139 ASS 0.114% #140 ISH 0.113% #141 TOF 0.113% #142 COM 0.113% #143 EEN 0.112% #144 HEP 0.112% #145 HTH 0.112% #146 HET 0.111% #147 NOW 0.108% #148 HEY 0.108% #149 EDH 0.107% #150 ROM 0.107% # 3-gram % #151 FRO 0.107% #152 EHE 0.107% #153 ESE 0.106% #154 DHE 0.106% #155 ELL 0.106% #156 EFO 0.106% #157 NED 0.105% #158 GTH 0.105% #159 LEA 0.105% #160 HAV 0.104% #161 KIN 0.104% #162 WHO 0.104% #163 COU 0.104% #164 ART 0.103% #165 NTE 0.102% #166 HEI 0.102% #167 ENE 0.101% #168 HEF 0.101% #169 ESO 0.101% #170 SEL 0.100% #171 DNO 0.100% #172 OUG 0.100% #173 IVE 0.099% #174 EDO 0.099% #175 WHA 0.099% #176 AME 0.098% #177 HEE 0.098% #178 HIC 0.098% #179 STI 0.096% #180 INE 0.096% #181 EAD 0.096% #182 EME 0.096% #183 ERO 0.096% #184 DHI 0.095% #185 EMA 0.095% #186 STR 0.094% #187 NDH 0.094% #188 SSI 0.094% #189 ERY 0.094% #190 BLE 0.093% #191 CHA 0.093% #192 OOK 0.093% #193 INA 0.092% #194 SHO 0.092% #195 TOH 0.091% #196 NAN 0.091% #197 IDE 0.091% #198 OSE 0.090% #199 DRE 0.089% #200 IND 0.089%

Quadgrams (or is is Tetragrams?)

After three, comes four. I'm not sure if it's correct to call them quadgrams or tetragrams (Latin or Greek?), so instead we'll just call them n-grams or 4-grams.

There were 87,526 4-grams in my book samples (cf. 26 × 26 × 26 × 26 = 456,976 possible; less than 20% of the theoretical possible combinations).

Here are the top 200:

# 4-gram % #1 THER 0.325% #2 THAT 0.302% #3 WITH 0.256% #4 DTHE 0.253% #5 NTHE 0.250% #6 OTHE 0.219% #7 OFTH 0.217% #8 FTHE 0.206% #9 THES 0.203% #10 TTHE 0.192% #11 HERE 0.189% #12 EAND 0.183% #13 ETHE 0.177% #14 ANDT 0.164% #15 THEM 0.162% #16 SAND 0.161% #17 TION 0.151% #18 INGT 0.144% #19 NDTH 0.143% #20 THIS 0.139% #21 OULD 0.134% #22 INTH 0.132% #23 THEC 0.132% #24 STHE 0.130% #25 TOTH 0.129% #26 ANDS 0.129% #27 EDTH 0.129% #28 IGHT 0.122% #29 THIN 0.118% #30 SAID 0.118% #31 EVER 0.114% #32 ATTH 0.111% #33 RTHE 0.110% #34 THOU 0.110% #35 WERE 0.109% #36 THEY 0.106% #37 HING 0.106% #38 DAND 0.105% #39 NGTH 0.103% #40 TAND 0.103% #41 THEP 0.101% #42 INGA 0.099% #43 OUGH 0.095% #44 EDTO 0.095% #45 THEW 0.094% #46 THEN 0.094% #47 EWAS 0.094% #48 ONTH 0.093% #49 HICH 0.092% #50 FROM 0.092% # 4-gram % #51 WHIC 0.092% #52 HAVE 0.090% #53 WHAT 0.090% #54 ANDA 0.090% #55 EFOR 0.086% #56 THEF 0.084% #57 HTHE 0.084% #58 UGHT 0.083% #59 TING 0.083% #60 KING 0.082% #61 ATHE 0.081% #62 ANDW 0.081% #63 ERTH 0.081% #64 THEI 0.080% #65 ANDH 0.080% #66 HEWA 0.078% #67 DNOT 0.078% #68 RAND 0.077% #69 VERY 0.077% #70 THEE 0.075% #71 THET 0.075% #72 FORT 0.075% #73 ANDI 0.075% #74 GTHE 0.075% #75 THED 0.075% #76 HEHA 0.074% #77 THEL 0.074% #78 YTHE 0.073% #79 HAND 0.072% #80 HESA 0.071% #81 HECO 0.071% #82 YAND 0.071% #83 EHAD 0.071% #84 ORTH 0.071% #85 INGH 0.070% #86 SELF 0.070% #87 WHEN 0.069% #88 ERED 0.069% #89 THEB 0.069% #90 THEH 0.067% #91 MENT 0.067% #92 NAND 0.067% #93 EDAN 0.066% #94 OUND 0.066% #95 SOME 0.065% #96 NDER 0.065% #97 NING 0.065% #98 HERS 0.064% #99 HATH 0.063% #100 TWAS 0.063% # 4-gram % #101 ATIO 0.063% #102 RING 0.063% #103 INGS 0.062% #104 INGO 0.061% #105 OVER 0.061% #106 HATT 0.060% #107 ETHA 0.059% #108 WOUL 0.059% #109 ENTH 0.059% #110 THAN 0.058% #111 ERAN 0.058% #112 EDHI 0.058% #113 LOOK 0.058% #114 THTH 0.056% #115 DWIT 0.056% #116 HATI 0.056% #117 HEAR 0.056% #118 ITHA 0.055% #119 EOFT 0.055% #120 THEA 0.055% #121 THEG 0.055% #122 NGTO 0.055% #123 INCE 0.054% #124 ASTH 0.054% #125 HEIR 0.054% #126 WILL 0.054% #127 BEEN 0.053% #128 FORE 0.053% #129 MTHE 0.053% #130 INGI 0.053% #131 NOTH 0.052% #132 LING 0.052% #133 MAND 0.052% #134 INTO 0.051% #135 STAN 0.051% #136 THEO 0.051% #137 LLTH 0.051% #138 RETH 0.051% #139 EDIN 0.051% #140 HESE 0.051% #141 HERA 0.051% #142 DING 0.050% #143 HOUG 0.050% #144 ETHI 0.050% #145 ANDR 0.050% #146 TOHI 0.049% #147 DTHA 0.049% #148 TTER 0.049% #149 ANCE 0.049% #150 KNOW 0.049% # 4-gram % #151 TIME 0.049% #152 REAT 0.048% #153 SWER 0.048% #154 COUL 0.048% #155 UNDE 0.048% #156 LIKE 0.048% #157 HEMA 0.047% #158 SOFT 0.047% #159 YOUR 0.047% #160 ITHT 0.047% #161 PRIN 0.047% #162 NESS 0.047% #163 EREA 0.047% #164 LTHE 0.047% #165 RINC 0.046% #166 NHIS 0.046% #167 WASA 0.046% #168 DHIS 0.046% #169 RESS 0.046% #170 IONS 0.045% #171 DHER 0.045% #172 LAND 0.045% #173 NDIN 0.045% #174 DHIM 0.044% #175 MORE 0.044% #176 ERIN 0.044% #177 ABLE 0.044% #178 ESAI 0.044% #179 ERES 0.044% #180 ENCE 0.044% #181 ESAN 0.044% #182 OUNT 0.043% #183 TTLE 0.043% #184 HATS 0.043% #185 COME 0.043% #186 HEST 0.043% #187 LONG 0.042% #188 PRES 0.042% #189 UTTH 0.042% #190 EYOU 0.042% #191 WHER 0.042% #192 TOBE 0.042% #193 ABOU 0.041% #194 METH 0.041% #195 EWIT 0.041% #196 HERO 0.041% #197 HIMS 0.041% #198 NDRE 0.041% #199 NDHE 0.041% #200 OMTH 0.041%

Things get a little more complicated as we move to four characters. Top of the list is "THER", some of which could be from the word "THE", followed by a word starting with "R", but a a most of the frequency of "THER" comes as it being part of words like "THERE" and "OTHER" (and all those other words that have this sub-string contained in them).

Looking through the list it is easy to see words that are distinct popular four character words in their own right as well the sub-strings.

5-grams

There were 434,396 5-grams (cf. 26 × 26 × 26 × 26 × 26 = 11,881,376 possible; less than 4% of the theoretical possible combinations).

Here are the top 200:

# 5-gram % #1 OFTHE 0.190% #2 ANDTH 0.122% #3 TOTHE 0.116% #4 INTHE 0.112% #5 THERE 0.108% #6 NDTHE 0.106% #7 EDTHE 0.097% #8 WHICH 0.092% #9 ATTHE 0.090% #10 OTHER 0.090% #11 INGTH 0.085% #12 THING 0.081% #13 ONTHE 0.075% #14 NGTHE 0.074% #15 OUGHT 0.064% #16 ATION 0.063% #17 WOULD 0.059% #18 EDAND 0.056% #19 THECO 0.055% #20 DWITH 0.055% #21 THEIR 0.053% #22 HEHAD 0.053% #23 INGTO 0.052% #24 EOFTH 0.052% #25 HEWAS 0.051% #26 FORTH 0.051% #27 ERTHE 0.051% #28 THOUG 0.050% #29 HOUGH 0.049% #30 HATTH 0.049% #31 COULD 0.048% #32 THATT 0.048% #33 EVERY 0.048% #34 ERAND 0.047% #35 THTHE 0.047% #36 WITHA 0.046% #37 DTHAT 0.046% #38 WITHT 0.046% #39 THESE 0.046% #40 ETHAT 0.043% #41 PRINC 0.043% #42 ITHTH 0.043% #43 THATH 0.043% #44 ORTHE 0.043% #45 ESAID 0.042% #46 THEMA 0.042% #47 THATI 0.042% #48 ENTHE 0.042% #49 RINCE 0.042% #50 EFORE 0.041% # 5-gram % #51 ABOUT 0.040% #52 ESAND 0.040% #53 ATHER 0.040% #54 SOFTH 0.039% #55 ITWAS 0.039% #56 ASTHE 0.039% #57 AFTER 0.038% #58 SWERE 0.038% #59 UNDER 0.038% #60 EWITH 0.038% #61 WHERE 0.037% #62 WITHH 0.037% #63 FROMT 0.037% #64 ALLTH 0.036% #65 ETHER 0.036% #66 LLTHE 0.036% #67 INGAN 0.036% #68 ANDHE 0.036% #69 OMTHE 0.035% #70 ROMTH 0.035% #71 THEMO 0.035% #72 HATHE 0.035% #73 EDWIT 0.035% #74 AGAIN 0.034% #75 NEVER 0.034% #76 INGHI 0.034% #77 CEAND 0.034% #78 BEFOR 0.034% #79 THEPR 0.034% #80 TTHAT 0.033% #81 NGAND 0.033% #82 THATS 0.033% #83 OULDN 0.033% #84 RETHE 0.033% #85 TOFTH 0.033% #86 SAIDT 0.032% #87 THERS 0.032% #88 COUNT 0.032% #89 TIONS 0.032% #90 STAND 0.032% #91 EDHIM 0.032% #92 UTTHE 0.032% #93 HIMSE 0.032% #94 OFHIS 0.032% #95 MSELF 0.031% #96 NTOTH 0.031% #97 THEWA 0.031% #98 STHAT 0.031% #99 ITTLE 0.031% #100 IMSEL 0.031% # 5-gram % #101 NTHES 0.031% #102 LITTL 0.031% #103 INGIN 0.031% #104 HECOU 0.030% #105 ROUGH 0.030% #106 THESA 0.030% #107 ANDIN 0.030% #108 BYTHE 0.030% #109 RIGHT 0.029% #110 ANDRE 0.029% #111 HESAI 0.029% #112 THECA 0.029% #113 THEST 0.029% #114 THERO 0.029% #115 LIGHT 0.029% #116 TOHIM 0.028% #117 CTION 0.028% #118 THERA 0.028% #119 WITHO 0.028% #120 EANDT 0.028% #121 IONOF 0.028% #122 IDNOT 0.028% #123 HADBE 0.027% #124 HEREW 0.027% #125 INGOF 0.027% #126 ITHOU 0.027% #127 SANDT 0.027% #128 ETHIN 0.027% #129 ULDNO 0.027% #130 GREAT 0.027% #131 ROUND 0.027% #132 DIDNO 0.027% #133 HOULD 0.027% #134 SHOUL 0.027% #135 EDHER 0.026% #136 LDNOT 0.026% #137 OTHIN 0.026% #138 THOUT 0.026% #139 THEWO 0.026% #140 HEREA 0.026% #141 EDHIS 0.025% #142 DTHES 0.025% #143 INHIS 0.025% #144 INGHE 0.025% #145 SWITH 0.025% #146 DTHEM 0.025% #147 NTHAT 0.025% #148 ANDWH 0.025% #149 YTHIN 0.024% #150 THELA 0.024% # 5-gram % #151 HENTH 0.024% #152 THATW 0.024% #153 DBEEN 0.024% #154 NOTHE 0.024% #155 SOMET 0.024% #156 FIRST 0.024% #157 TWITH 0.024% #158 NSWER 0.024% #159 ADBEE 0.023% #160 THEFI 0.023% #161 ITHHI 0.023% #162 PRESS 0.023% #163 FTHES 0.023% #164 WASTH 0.023% #165 HERAN 0.023% #166 STILL 0.023% #167 AKING 0.023% #168 LOOKE 0.023% #169 BUTTH 0.023% #170 ASKED 0.023% #171 TIONO 0.023% #172 OOKED 0.023% #173 SHEHA 0.023% #174 TOHER 0.022% #175 ANDSO 0.022% #176 ESTHE 0.022% #177 URNED 0.022% #178 THEHA 0.022% #179 ANDSA 0.022% #180 OMETH 0.022% #181 ERING 0.022% #182 IERRE 0.022% #183 PIERR 0.022% #184 THINK 0.022% #185 DTHER 0.022% #186 INTOT 0.022% #187 ANSWE 0.022% #188 SHEWA 0.022% #189 PLACE 0.022% #190 NOTHI 0.022% #191 HIMAN 0.022% #192 NCEAN 0.021% #193 TTING 0.021% #194 ONAND 0.021% #195 THEYW 0.021% #196 TTHES 0.021% #197 WHILE 0.021% #198 TURNE 0.021% #199 SSION 0.021% #200 WASNO 0.021%

Things get even more interesting here. "OFTHE", "ANDTH", "TOTHE" and "INTHE" at the top are all obvious concatenations of two words. "THERE" comes next.

As the n-grams become longer it's possible to start seeing more distinct (and specific) words. As I was testing the code out with smaller books, after pausing and viewing the intermediate results, it was possible to identify sub-strings of titles characters and specific nouns in the books.

This shows us that, unless we're aiming to decode a message with a defined dictionary of possible words, going too deep into n-gram analysis will start to hurt us. Up to about 4-grams, we're mapping the characteristics of the English language. Above 4-grams, it's looking like we are starting to map more to words than distributions of groupings of letters.

6-grams

The error of going too deep into n-gram is confirmed looking at this list. It doesn't take too long see specific words that obviously belong to one specific book.

There were 1,239,584 6-grams (cf. 26 × 26 × 26 × 26 × 26 × 26 = 308,915,776 possible; less than 0.4% of the theoretical possible combinations).

Here are the top 200:

# 6-gram % #1 ANDTHE 0.090% #2 INGTHE 0.063% #3 THOUGH 0.049% #4 EOFTHE 0.044% #5 WITHTH 0.042% #6 THATTH 0.042% #7 PRINCE 0.042% #8 HATTHE 0.039% #9 ITHTHE 0.037% #10 FROMTH 0.035% #11 FORTHE 0.035% #12 SOFTHE 0.035% #13 EDWITH 0.034% #14 HOUGHT 0.034% #15 BEFORE 0.033% #16 ROMTHE 0.032% #17 HIMSEL 0.031% #18 IMSELF 0.031% #19 LITTLE 0.031% #20 INGAND 0.029% #21 TOFTHE 0.029% #22 NTOTHE 0.029% #23 HESAID 0.028% #24 THATHE 0.028% #25 OULDNO 0.027% #26 SHOULD 0.027% #27 DIDNOT 0.026% #28 ULDNOT 0.026% #29 WITHOU 0.025% #30 ALLTHE 0.024% #31 ITHOUT 0.024% #32 THEREW 0.024% #33 YTHING 0.023% #34 ADBEEN 0.023% #35 HADBEE 0.023% #36 ETHING 0.023% #37 WITHHI 0.023% #38 LOOKED 0.022% #39 PIERRE 0.022% #40 OTHING 0.022% #41 HENTHE 0.022% #42 OFTHES 0.022% #43 ANSWER 0.022% #44 NCEAND 0.021% #45 COULDN 0.021% #46 EANDTH 0.021% #47 INTOTH 0.021% #48 NOTHIN 0.021% #49 TURNED 0.021% #50 HIMAND 0.020% # 6-gram % #51 SANDTH 0.020% #52 ROUGHT 0.020% #53 INGHIS 0.020% #54 TIONOF 0.020% #55 NOTHER 0.020% #56 SHEHAD 0.020% #57 SAIDTH 0.019% #58 OTHERS 0.019% #59 SHEWAS 0.019% #60 PEOPLE 0.019% #61 ECOULD 0.019% #62 NDTHAT 0.019% #63 THECOU 0.019% #64 EWOULD 0.019% #65 OFTHEM 0.019% #66 HERAND 0.019% #67 EDTHAT 0.018% #68 DTOTHE 0.018% #69 OULDBE 0.018% #70 ANOTHE 0.018% #71 THESAM 0.018% #72 OUGHTH 0.017% #73 METHIN 0.017% #74 WHICHH 0.017% #75 WASTHE 0.017% #76 RINCES 0.017% #77 HESAME 0.017% #78 OMETHI 0.017% #79 SOMETH 0.017% #80 WHENTH 0.017% #81 THROUG 0.017% #82 HROUGH 0.017% #83 FATHER 0.017% #84 AIDTHE 0.017% #85 SEEMED 0.017% #86 MOTHER 0.017% #87 DINTHE 0.017% #88 EINTHE 0.017% #89 ANDTHA 0.017% #90 UNDERS 0.017% #91 PRESEN 0.017% #92 ECAUSE 0.016% #93 THEPRI 0.016% #94 OFTHEC 0.016% #95 NDERST 0.016% #96 INCESS 0.016% #97 BUTTHE 0.016% #98 INGHER 0.016% #99 HEREWA 0.016% #100 DTHERE 0.016% # 6-gram % #101 EREWAS 0.016% #102 LOOKIN 0.016% #103 OOKING 0.016% #104 EOTHER 0.015% #105 OULDHA 0.015% #106 THEFIR 0.015% #107 THINGS 0.015% #108 THEWOR 0.015% #109 MOMENT 0.015% #110 THEYWE 0.015% #111 FRIEND 0.015% #112 NWHICH 0.015% #113 RETURN 0.015% #114 THEMAN 0.015% #115 FRENCH 0.015% #116 ITHHIS 0.015% #117 NOFTHE 0.015% #118 ATIONS 0.015% #119 NSWERE 0.015% #120 ETHOUG 0.015% #121 EANDRE 0.014% #122 INTHES 0.014% #123 EPRINC 0.014% #124 UGHTHE 0.014% #125 ALWAYS 0.014% #126 NGWITH 0.014% #127 LDHAVE 0.014% #128 ETHERE 0.014% #129 ULDHAV 0.014% #130 WASNOT 0.014% #131 EDTOTH 0.014% #132 TTHERE 0.014% #133 ERETHE 0.014% #134 NGTHAT 0.014% #135 ESSION 0.014% #136 HECOUL 0.014% #137 ECOUNT 0.014% #138 ATASHA 0.014% #139 ABOUTT 0.014% #140 SINTHE 0.014% #141 NATASH 0.014% #142 ROFTHE 0.014% #143 VERTHE 0.014% #144 INGHIM 0.014% #145 EVERYT 0.014% #146 APPEAR 0.013% #147 ETOTHE 0.013% #148 EYWERE 0.013% #149 ROTHER 0.013% #150 SWERED 0.013% # 6-gram % #151 WHICHT 0.013% #152 UPONTH 0.013% #153 RESENT 0.013% #154 HEYWER 0.013% #155 TINTHE 0.013% #156 EFIRST 0.013% #157 INGTHA 0.013% #158 HATSHE 0.013% #159 HEWOUL 0.013% #160 POSSIB 0.013% #161 BECAUS 0.013% #162 INGWIT 0.013% #163 ANDREW 0.013% #164 EDTHEM 0.013% #165 RINCEA 0.013% #166 OUTTHE 0.013% #167 IONAND 0.013% #168 ESTION 0.013% #169 NDWITH 0.013% #170 HAVING 0.013% #171 PRESSI 0.013% #172 NDTHEN 0.013% #173 TEDTHE 0.013% #174 THEREA 0.013% #175 INCEAN 0.013% #176 TOTHES 0.013% #177 ERSELF 0.013% #178 ECTION 0.013% #179 THERES 0.013% #180 ANDHIS 0.013% #181 HERSEL 0.013% #182 OFTHEP 0.013% #183 THEREI 0.013% #184 SHESAI 0.013% #185 PONTHE 0.013% #186 CEANDR 0.013% #187 RSTAND 0.013% #188 VERYTH 0.013% #189 QUESTI 0.013% #190 WHENHE 0.013% #191 THECON 0.013% #192 HEOTHE 0.012% #193 NEOFTH 0.012% #194 THEOTH 0.012% #195 UESTIO 0.012% #196 INGFOR 0.012% #197 EELING 0.012% #198 HECOUN 0.012% #199 EXPRES 0.012% #200 STHERE 0.012%

Final return to ROT13

As I was messing with ROT13, It wondered if it was possible to to apply ROT13 to a word and make an entirely different (valid) word. A few lines of SQL late revealed there are quite a few possible. The longest found in my dictionary file was NOWHERE ↔ ABJURER

NA↔AN NAAN↔ANNA NAG↔ANT NAN↔ANA NAVY↔ANIL NE↔AR NIB↔AVO NO↔AB NOB↔ABO NOON↔ABBA NOWHERE↔ABJURER NU↔AH NUN↔AHA OHO↔BUB ON↔BA ONE↔BAR ONES↔BARF ONYX↔BALK OR↔BE ORA↔BEN ORRA↔BEEN ORT↔BEG OVA↔BIN PENNY↔CRAAL PENT↔CRAG PERRY↔CREEL PRY↔CEL PUNG↔CHAT PURS↔CHEF RAIL↔ENVY RAT↔ENG RE↔ER REAR↔ERNE REE↔ERR REEF↔ERRS REF↔ERS RET↔ERG ROOF↔EBBS SEL↔FRY SENT↔FRAG SERER↔FRERE SHA↔FUN SHE↔FUR SYNC↔FLAP TANG↔GNAT TERRA↔GREEN THY↔GUL TRY↔GEL TUNG↔GHAT UN↔HA UREA↔HERN VEX↔IRK WHA↔JUN WHEN↔JURA

Encryption Humour

This web page is encrypted with ROT26.

You can find a complete list of all the articles here. Click here to receive email alerts on new articles.