Domain Name Analysis

What's in a (domain) name? That which we call a server by any other IP address would smell as sweet. Today it seems that just about every business, organization and group has a web presence. We access these web sites, not through the IP addresses that the computers use, but through the use of friendly, human-memorable, monikers. Services translate these Domain Names into the IP addresses for the computer. Today, Google.com, Microsoft.com and Amazon.com are household names. These are examples of TLD s (Top Level Domains). Each TLD is distinct, and made up of the letters A – Z (case insensitive), the digits 0 – 9 , the simple dash ' – ', and the period ' . ' (though domain names cannot start or end with the latter two).

There are only a finite number of ways the above characters can be combined to make domain names (and even less that form memorable words or phrases). To assist in this matter (and also to give some indication of the kind of service that each site performs), suffixes are added to each TLD name to categorise them. The most popular suffix, by far, is the .com domain (originally intended to signify a commerical venture). Other common domain suffixes are .net , .gov , .org , .edu as well as a whole swage of country specific suffixes such as .co.uk and .co.jp …

Whilst the balance may shift in the future with the opening up of new generic top-level domain (gTLDs), and their new suffixes, the .com domain still dominates the domain name universe.

.COM and .NET

At the time of writing this blog, according to Verisign there are 102,815,927 registered names in .com universe, and 14,967,256 in the .net name space. Just because a name is registered does not mean that there is an active website behind it, it just means that somebody has taken the option of reserving that name for current or future use.

The dominance of the .com domain can be seen from the Venn diagram on the left which compares the overlap of the two name spaces. The diagram shows that, for each distinct domain name, whether there is a registration in the .com domain, in the .net domain, or both. NOTE – Just because the name is registered in both domains does not mean that both registrations belong to the same company! Sometimes they are, and often companies aggressively purchase matching names in the other domains to prevent confusion and lock out others. Just as often, however, names with different suffixes belong to different organisations … sometimes legitimate, sometimes obtained with dubious misrepresentation intentions.

Thankfully, these misleading registrations are not as damaging as they once were because people are relying more and more on search engines to elucidate their required destination. Often, these days, users type in what they are looking for directly into a search box (even the browser address bar) and do not type in the full http: address of their desired location. Legitimate destinations have superior Page Ranks and, thus, appear in the search results higher up.

From the diagram, we can see the ascendancy of the .com name space. Of the 102.8 million names in .com space, only 12.54% of these same names are also registered in .net namespace. The vast majority of owners of .com names (plus the doppelgangers), don't think it worth the effort. Contrast this with the view from other side. Of the 15 million registered names in .net space, 86.13% of these names are also available in .com namespace.

Because '.com' is now part of our common vernacular, internet users are intimately familiar with it. When coming up with a new business name, sure, it's probably possible to find a suitable name in .net space, but these days, why bother? Unless it's unique you'd not be able to find the same name free in the .com space, which is where everyone would probably look in first. Better to simply research/brainstorm further and find a name you can acquire/repurchase in the .com arena and bypass all the confusion/customer education.

Distribution of lengths

Having access to a database of domain names, I decided to run some more analysis on the .com and .net databases.

Above is the distribution of lengths of all the registered .com domains.

The modal length (the most common) is 12 characters long, and the average length is 13.539 characters long. The median length is also 12 characters (there are as many domains of length shorter than than 12 characters as there are domains of length longer than 12 characters).

Below is a similar chart for all the registered .net domains.

The average lengths of the .net domains is slightly shorter.

.COM .NET Average Length 13.539 12.394 Median Length 12 11 Modal Length 12 10

There are some pretty long names out there. Selected at random, here are a couple of obscure names that are 63 characters long:

ALTERNATIVE-RENEWABLE-ENERGY-STOCKS-INVESTMENT-WIND-SOLAR-POWER.COM

AXIALCENTRIFUGALRADIALINDUSTRIALMULTIVANEFANSVENTILATORSBLOWERS.COM

ASOCIACIONDEGUIASINTERPRETESDELPATRONATODELAALHAMBRAYGENERALIFE.COM

CARPET-CLEANING-ORANGE-COUNTYRUGSMATTRESSSTEAMUPHOLSTERYODORPET.COM

I'm sure the owners of these domains have fun spelling them out over the phone to potential clients :)

Saturation

There's only a finite number of ways that permissible characters can be combined to make domain names. Sufficient to say that all possible combinations of two characters have been registered in both the .com and .net domain space. Moving to three characters, again we find the address space saturated, with >96% of all combinations taken. (Generally short domain names are better as they are easier to remember, much easier to type, and often represent acronyms of the holding organisation).

Moving to four character domain lengths, things are still very congested with 43% of all combinations registered in the .com domain space. Things are slightly slacker in the .net universe with 17% of all possible four character names taken.

By the time we move to five character lengths combinatronics becomes more of our friend, as each increase in character length increases the number of possible domain names by a factor of 38 from the previous length. At five characters though, still 3.4% of all possible .com domain names have been taken. (By the time we get to five characters, it's already possible to make some pretty ugly domain names, such as ones with repeated characters and dashes). Saturation has dropped to less than 1% for five character names in the .net domain. By the time we've reached six characters or more, congestion and saturation aren't really the issue. It's more an issue of finding a name that is memorable, meaningful and pretty (not a random mess of character, or awkward repeated characters).

Rats live on no evil star

Of all the domain names in .com space, there are currently 26,169 which are palindromes (read the same forwards and backwards). There are just 9,403 in .net space. Here are a random selection from the .com namespace:

AZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ.COM

A-------------------------------------------------------------A.COM

ZYXWVUTSRQPONMLKJIHGFEDCBABCDEFGHIJKLMNOPQRSTUVWXYZ.COM

SATOR-AREPO-TENET-OPERA-ROTAS.COM

LOLOLOLOLOLOLOLOLOLOLOLOLOLOL.COM

REFLECTION--NOITCELFER.COM

WONTLOVERSREVOLTNOW.COM

SLATEMETALS.COM

Numbers and concatenation

Just over 9% of .com domain names contain at least one numeric character. (In .net space it's over 10%).

Since just about every possible single dictionary word and name has already been taken, many domain names are constructed by concatenating words together. Because spaces are not valid for use in domain names, registrants have a choice of simply welding the words together, such as my domain name DataGenetics.com (here usefully using the fact that domain names are CasE InSEnSiTIve, so that contrast can be made using upper case to represent the start of new words). Alternatively, people may elect to use a dash – to break up the words, such as in Rat-Bikes.com In my personal opinion, any name using a dash is sub-optimal. Users often might not remember if you have a dash or not (could they end up at your competitor if they forget it?). Also, some users are confused by how to use a keyboard to enter a dash (is it an underscore?). It's hard to explain over the phone; do you say "Hyphen", or "Dash" or "Line" or "Minus" … when spelling your URL? Finally, with more people using mobile devices, typing in the character can sometimes require double shifts.

Having said that, over 12 million .com domains have been registered with dashes (representing 11.8% of the total). Many of these, I suspect, are defensively purchased and in researching this posting, many of the URLs I entered with dashes simply redirect the user to a more appropriate domain. In .net space the percentage is higher at 13.1%

Frequency

Not all characters are equally used in domain names. A total of 1,392,049,701 characters are used to represent the 102,815,927 .com names. Below is the relative distribution:

Letter Count Freq #1 E 141,646,533 #2 A 123,868,625 #3 I 100,401,072 #4 O 96,790,706 #5 S 96,189,067 #6 R 94,784,191 #7 N 93,320,130 #8 T 88,697,233 #9 L 69,479,211 #10 C 57,324,556 #11 D 43,903,797 #12 M 41,929,347 #13 U 41,562,591 #14 H 38,309,479 #15 P 35,773,302 #16 G 35,272,868 #17 B 28,121,111 #18 Y 25,037,505 #19 F 21,407,914 #20 K 19,964,847 #21 W 17,464,287 #22 V 16,644,742 #23 - 16,236,169 #24 X 7,320,389 #25 J 7,264,260 #26 Z 6,583,320 #27 1 3,957,814 #28 2 3,557,306 #29 Q 2,975,019 #30 0 2,913,777 #31 4 2,065,220 #32 3 1,936,546 #33 . 1,935,919 #34 8 1,794,959 #35 5 1,564,812 #36 6 1,393,887 #37 9 1,382,868 #38 7 1,274,322

It's interesting to note that the distribution differs from the the traditional pattern used in the English lanuage: E,T,A,I,O,N,S,H,R,D,L … Some of this can be explained by the fact that domain names are not just for the consumption of English speaking people. Even though other regions have their own domains, since .com has become the lingua franca, many businesses simply default to .com (For those interested, there is an interesting article on Wikipedia about the differing relative frequencies of letters in other languages).

The least popular letter is the letter Q, and both the numeric digits 1 and 2 occur with higher frequency than this letter. The least popular character in .com domain names is the number 7.

Couplets, triplets, quatrains, (and whatever they call five!)

Certain pairs of charcters appear more frequently together than others. Searching through the the .com namespace, it's possible to determine that the character combination ' IN ' occurs more frequently than any other pair of letters (with a frequency of 23,193,376 times). Here are the top 120 doublets:

?? Count #1 IN 23,193,376 #2 ER 22,045,322 #3 AN 18,633,554 #4 ES 16,983,404 #5 ON 16,271,855 #6 RE 16,070,874 #7 AR 15,082,764 #8 AL 13,773,410 #9 ST 13,516,029 #10 EN 13,434,330 #11 TE 13,232,970 #12 OR 13,137,866 #13 TI 11,630,714 #14 LE 11,455,196 #15 RA 11,447,567 #16 NE 10,731,677 #17 NG 10,589,719 #18 AT 10,325,754 #19 NT 10,201,868 #20 RI 9,864,274 ?? Count #21 LI 9,653,122 #22 CO 9,325,669 #23 LA 9,300,814 #24 MA 9,149,085 #25 TO 9,143,487 #26 EA 8,916,932 #27 EL 8,844,154 #28 DE 8,781,881 #29 RO 8,764,327 #30 NS 8,737,029 #31 IC 8,699,288 #32 TA 8,437,680 #33 CA 8,421,233 #34 ME 8,282,189 #35 CH 8,181,017 #36 AS 8,047,599 #37 HO 7,954,085 #38 ND 7,924,789 #39 HE 7,833,639 #40 IT 7,746,934 ?? Count #41 SE 7,715,012 #42 ET 7,542,073 #43 IS 7,387,741 #44 TH 7,376,080 #45 IO 6,808,156 #46 LL 6,803,100 #47 SI 6,783,264 #48 OU 6,761,302 #49 UR 6,534,693 #50 LO 6,499,349 #51 TR 6,448,385 #52 NA 6,401,442 #53 RT 6,391,591 #54 EC 6,387,507 #55 CE 6,288,226 #56 DI 6,238,405 #57 VE 6,235,411 #58 IL 6,159,701 #59 AC 6,109,501 #60 OL 5,999,110 ?? Count #61 RS 5,962,453 #62 AM 5,867,200 #63 IA 5,846,628 #64 SA 5,801,203 #65 HA 5,732,145 #66 ED 5,711,386 #67 OM 5,701,672 #68 NI 5,370,265 #69 PA 5,283,311 #70 SH 5,249,674 #71 GE 5,190,053 #72 SO 5,108,273 #73 IE 5,050,900 #74 US 5,030,948 #75 AD 4,938,913 #76 TS 4,910,822 #77 SS 4,906,823 #78 VI 4,906,818 #79 AI 4,875,583 #80 OT 4,869,671 ?? Count #81 NC 4,825,712 #82 MO 4,760,025 #83 HI 4,749,105 #84 OS 4,711,684 #85 DA 4,603,390 #86 PE 4,532,226 #87 BA 4,513,833 #88 EE 4,456,735 #89 PR 4,437,297 #90 OO 4,425,562 #91 MI 4,418,683 #92 EM 4,391,197 #93 UN 4,286,124 #94 BE 4,271,960 #95 IR 4,099,976 #96 KE 3,996,087 #97 PO 3,945,584 #98 AP 3,808,451 #99 UT 3,798,122 #100 GA 3,796,346 ?? Count #101 AG 3,780,466 #102 SC 3,645,046 #103 ID 3,630,847 #104 DO 3,592,374 #105 IG 3,582,602 #106 NO 3,550,662 #107 CT 3,529,155 #108 WE 3,505,427 #109 OP 3,489,397 #110 GR 3,436,742 #111 BO 3,432,061 #112 FI 3,393,118 #113 SU 3,377,709 #114 CK 3,325,733 #115 FO 3,316,860 #116 CI 3,289,758 #117 SP 3,262,011 #118 OD 3,147,594 #119 TU 3,075,995 #120 EB 3,002,441

Moving to triplets, we see that the highest frequency combination of letters is ' ING ' with a frequency of 7,402,227 occurences. It's interesting to note that 'THE' and 'AND', whilst very high up on the list, are out counted by 'ING' (common for English -ing form of verbs).

Since I know some of you will be curious, the substring 'SEX' occurs ordinally ranked at #830 with 339,802 occurences in the .com domain.

??? Count #1 ING 7,402,227 #2 ION 4,822,392 #3 ENT 4,451,444 #4 TER 3,967,305 #5 AND 3,942,141 #6 THE 3,550,009 #7 TIO 3,537,762 #8 ERS 3,457,584 #9 INE 3,294,442 #10 EST 3,250,273 #11 LIN 2,903,029 #12 ATI 2,759,932 #13 ONS 2,654,525 #14 ATE 2,430,697 #15 TIN 2,397,200 #16 TOR 2,393,355 #17 ART 2,353,882 #18 RES 2,343,289 #19 TRA 2,332,510 #20 STA 2,284,291 ??? Count #21 PRO 2,123,707 #22 REA 2,105,693 #23 RAN 2,092,965 #24 CON 2,082,868 #25 ALL 2,070,656 #26 ORT 2,059,681 #27 ESS 2,025,271 #28 NTE 2,012,337 #29 LAN 2,004,800 #30 FOR 1,991,104 #31 STE 1,941,777 #32 CAR 1,938,491 #33 MAR 1,935,419 #34 LES 1,925,969 #35 STO 1,923,055 #36 VER 1,886,713 #37 ANC 1,883,790 #38 ALE 1,818,891 #39 IST 1,802,642 #40 INT 1,801,195 ??? Count #41 OME 1,794,527 #42 ANT 1,775,946 #43 PER 1,757,252 #44 AGE 1,747,623 #45 ILL 1,738,483 #46 EAL 1,730,384 #47 MEN 1,686,776 #48 NCE 1,679,483 #49 ERI 1,672,661 #50 ICA 1,672,360 #51 ELL 1,650,992 #52 ARE 1,648,406 #53 REE 1,630,078 #54 LLE 1,611,347 #55 TAL 1,610,718 #56 OUR 1,589,818 #57 ONE 1,586,445 #58 ICE 1,578,147 #59 MAN 1,562,247 #60 STR 1,558,690 ??? Count #61 COM 1,557,329 #62 NES 1,556,766 #63 SIN 1,534,106 #64 ORE 1,511,463 #65 SHO 1,508,233 #66 CHE 1,506,047 #67 IVE 1,498,635 #68 SER 1,485,235 #69 AIN 1,470,391 #70 CHA 1,466,781 #71 STI 1,450,094 #72 ECT 1,439,995 #73 IDE 1,437,605 #74 RIN 1,436,529 #75 AST 1,431,439 #76 POR 1,430,462 #77 CHI 1,421,418 #78 HER 1,421,325 #79 DER 1,412,386 #80 ITE 1,395,460 ??? Count #81 ARD 1,393,607 #82 PAR 1,391,007 #83 DES 1,381,508 #84 SON 1,381,487 #85 INS 1,366,380 #86 NER 1,361,337 #87 EDI 1,360,163 #88 ERT 1,355,165 #89 INA 1,331,050 #90 NTA 1,325,359 #91 ANG 1,323,553 #92 HOT 1,322,784 #93 IAN 1,321,398 #94 RIC 1,317,343 #95 TON 1,313,181 #96 IND 1,301,584 #97 REN 1,280,235 #98 ESI 1,278,969 #99 HOM 1,278,689 #100 ANA 1,273,481 ??? Count #101 EAR 1,268,724 #102 WOR 1,261,552 #103 HEA 1,253,535 #104 ECO 1,250,956 #105 AME 1,238,717 #106 GRA 1,233,766 #107 IES 1,216,401 #108 TIC 1,211,928 #109 CTI 1,208,389 #110 ARI 1,202,371 #111 URE 1,201,484 #112 MER 1,197,687 #113 ERA 1,193,752 #114 ELE 1,189,551 #115 HIN 1,187,544 #116 ASS 1,186,078 #117 ERE 1,184,471 #118 NLI 1,184,337 #119 ALI 1,183,155 #120 TUR 1,182,331

Here is the table of the quatrians. Now it's possible to start making out substrings of what must be very common word fragments. I'm a little surprised that 'FREE', did not feature slightly higher than #69, but still it's impressive that there are over half a million occurences of this substring in the .com domain. (Advanced geek note here for fans of regular expressions - The counts in these tables represent the frequency of matches of these strings, not the number of domain names that contain these strings. For instance, the number of domains which contain the characters 'FREE' is 525,448 yet the table below shows the frequency is 527,809. Why the difference? Simple, some names contain the match more than once!)

???? Count #1 TION 3,488,333 #2 ATIO 1,757,364 #3 TING 1,574,233 #4 IONS 1,335,966 #5 LINE 1,281,210 #6 NTER 1,271,171 #7 MENT 1,248,400 #8 HOME 1,142,975 #9 PORT 1,078,931 #10 ANCE 1,039,416 #11 NLIN 1,006,019 #12 ONLI 1,001,340 #13 SERV 969,629 #14 LAND 930,140 #15 INGS 912,074 #16 SIGN 893,141 #17 XN-- 853,718 #18 INTE 825,806 #19 ERVI 822,265 #20 CTIO 807,484 ???? Count #21 IGHT 799,786 #22 DESI 798,598 #23 ESIG 774,347 #24 VICE 751,564 #25 STOR 750,355 #26 STER 725,532 #27 DING 724,758 #28 MEDI 720,883 #29 RVIC 720,437 #30 NS1. 717,066 #31 ESTA 716,667 #32 REAL 711,143 #33 EALT 710,484 #34 CONS 709,192 #35 SHOP 708,987 #36 NS2. 699,106 #37 CENT 698,522 #38 ENTE 685,673 #39 INES 677,334 #40 COMP 660,084 ???? Count #41 NING 650,625 #42 GROU 650,314 #43 MARK 647,324 #44 TURE 642,651 #45 PHOT 633,798 #46 NESS 632,291 #47 HOTO 630,338 #48 TECH 626,769 #49 THER 626,745 #50 WORK 622,839 #51 OUNT 595,546 #52 RANC 595,149 #53 LING 594,830 #54 ALES 592,989 #55 ROUP 589,783 #56 STAT 587,880 #57 ENTA 582,465 #58 SION 581,508 #59 TERS 579,542 #60 PART 573,153 ???? Count #61 RING 559,112 #62 SALE 557,777 #63 STIN 557,588 #64 ENTS 551,485 #65 HOUS 548,517 #66 KING 542,883 #67 COUN 529,230 #68 ONAL 527,809 #69 FREE 527,595 #70 ARKE 526,117 #71 REAT 525,958 #72 IONA 521,854 #73 AUTO 519,532 #74 ICES 509,800 #75 CTOR 502,528 #76 ALTH 502,227 #77 YOUR 498,557 #78 CIAL 498,392 #79 OMES 498,037 #80 TORE 491,740 ???? Count #81 HING 487,146 #82 OGRA 485,742 #83 TATE 483,941 #84 TIVE 481,271 #85 OUSE 481,241 #86 URAN 480,572 #87 OTEL 478,484 #88 CHIN 476,787 #89 UTIO 475,655 #90 SPOR 475,653 #91 ITAL 473,510 #92 BOOK 472,095 #93 CARE 472,038 #94 HEAL 471,643 #95 ATER 470,263 #96 BEST 470,175 #97 RKET 469,833 #98 GRAP 469,391 #99 SAND 467,903 #100 STUD 467,322 ???? Count #101 RAPH 465,722 #102 OLUT 465,431 #103 TERN 463,946 #104 ALLE 463,517 #105 DENT 463,392 #106 EDIA 461,549 #107 EMEN 460,022 #108 RICA 457,796 #109 RENT 456,295 #110 RESS 455,265 #111 LIFE 453,744 #112 NDER 451,107 #113 ICAL 449,954 #114 GREE 449,776 #115 LUTI 449,660 #116 ILLE 448,233 #117 REEN 447,048 #118 VERS 447,032 #119 PRES 445,529 #120 VENT 442,102

And finally, a table of the top 120 for the top five character patterns. Here word fragments are obvious. Looking through the list, it's a fun exercise to think about obvious domains that we could guess containing these strings.

????? Count #1 ATION 1,735,530 #2 TIONS 1,134,141 #3 NLINE 984,560 #4 ONLIN 971,756 #5 CTION 805,503 #6 ESIGN 767,702 #7 SERVI 759,486 #8 DESIG 757,231 #9 ERVIC 718,419 #10 RVICE 694,050 #11 PHOTO 624,232 #12 INTER 619,105 #13 GROUP 586,612 #14 ENTER 535,085 #15 EALTH 478,542 #16 MARKE 477,951 #17 UTION 474,667 #18 ARKET 465,762 #19 COUNT 464,850 #20 STATE 462,895 ????? Count #21 HOMES 461,746 #22 GRAPH 453,391 #23 SPORT 452,938 #24 LUTIO 445,856 #25 OLUTI 442,088 #26 HOUSE 439,230 #27 HOTEL 430,347 #28 SOLUT 421,275 #29 WORLD 419,720 #30 EMENT 417,623 #31 UCTIO 413,506 #32 STORE 406,636 #33 HEALT 404,053 #34 ENTAL 402,259 #35 RANCE 400,607 #36 MEDIA 396,040 #37 VICES 391,582 #38 CONSU 382,581 #39 IONAL 381,293 #40 ESTAT 375,826 ????? Count #41 STUDI 371,288 #42 PRODU 370,826 #43 MUSIC 367,149 #44 GREEN 364,664 #45 RODUC 360,858 #46 OGRAP 357,465 #47 TUDIO 357,402 #48 ONSUL 356,447 #49 TIONA 352,751 #50 NSULT 351,168 #51 CATIO 348,964 #52 TOGRA 348,431 #53 CENTE 340,501 #54 USINE 339,681 #55 INESS 339,027 #56 OTOGR 338,962 #57 ODUCT 333,345 #58 SINES 331,515 #59 MOBIL 330,525 #60 TRAVE 327,196 ????? Count #61 NATIO 326,712 #62 HOTOG 326,706 #63 RAVEL 323,398 #64 BUSIN 321,811 #65 ETING 321,577 #66 NTERN 318,557 #67 COMPA 311,167 #68 INSUR 308,468 #69 URANC 305,892 #70 PORTS 305,602 #71 SURAN 303,164 #72 STING 300,157 #73 RAPHY 288,645 #74 ALEST 287,773 #75 ELECT 286,249 #76 LESTA 283,264 #77 NSURA 283,217 #78 LIGHT 283,044 #79 AMERI 282,633 #80 MENTS 281,103 ????? Count #81 ERICA 280,697 #82 TWORK 278,375 #83 KETIN 277,880 #84 RKETI 277,557 #85 REALE 274,945 #86 MERIC 274,523 #87 ROPER 273,980 #88 PROPE 273,710 #89 PRESS 272,991 #90 EALES 270,742 #91 CREAT 270,015 #92 SYSTE 269,157 #93 SCHOO 268,952 #94 DIREC 267,861 #95 YSTEM 266,582 #96 IRECT 266,558 #97 OPERT 266,462 #98 CHOOL 265,840 #99 SOCIA 265,413 #100 VILLE 259,832 ????? Count #101 VIDEO 259,764 #102 TMENT 251,356 #103 ECTIO 248,383 #104 CHRIS 245,314 #105 FAMIL 244,913 #106 ETWOR 244,643 #107 GUIDE 244,638 #108 OMPAN 243,981 #109 TRANS 243,471 #110 NETWO 241,831 #111 SIGNS 239,523 #112 REATI 235,494 #113 CLEAN 235,330 #114 RENTA 234,122 #115 CENTR 233,177 #116 MEDIC 233,069 #117 EARCH 231,572 #118 WATER 229,956 #119 LECTR 229,101 #120 SSION 227,410

OK, enough tables, back to charts

We now know what the most popular characters in a domains are, but what is the most popular starting letter for a domain name? Let's run a quick query to find out:

The most popular numeric digit to start domains is the number '1', but domains starting with numbers are dwarfed by domains starting with letters. Despite the occurence frequency of letters following the order E,A,I the most popular starting letter for a domain is the the letter 'S' followed, interestingly, with the letter 'C', then 'M'. The least popular letter to start a domain is the letter 'Q'.

Ending character

Same exercise, but this time with the ending character. Note - This graph is on a differnt scale. Again, 'S' is the most popular ending character, but this time with a frequency over double. (I'm guessing a lot of domain names are written as plurals).

After 'S', the next most ending letter is the letter 'E', followed by 'T'.

Composite chart

Below is a composite graph showing the frequency of start and end characaters on the same scale:

'Y' is not a popular starting letter, but it is a popular finishing letter. Conversly, 'B', 'F', 'J' and 'V' occur frequently as starting letters but not highly as finishing letters. This is probably no great surprise to students of the English language.

Cross correlations

But what about the distribution of cross-correlations? What is the relative frequency of each ending letter compared to the starting letter? The heatmap below shows this data. The brighter the color, the stronger the count. The vertical axis on the grid shows the starting letter, and the horizontal shows the ending character.

The brightest square on the grid is 'S%S', which has a count of 1,869,669; the most popular combination of start and end characters.

For those not familiar with SQL server, the '%' represnts the wild-card character and will match zero or any characters. (Geek Note - I had to use a logarithmic scale to make sensible use of the color space).

Video

If a picture speaks a thousand words, then does a video speak a thousand pictures? In this case, no, it speaks 120 pictures! The link below is to an animation helping to show the relative distribution of the frequencies of start and end characters.

This animation presents the same data as shown in the grid above, but on each time slice, lowers the threshold about which items get shaded.

At the start of the animation, you can see the bright cell for 'S%S', along with bright cells for 'C%S', 'M%S' and 'A%S'. The vertical column for the '%S', and to a lesser extend the '%E' column, are visible at the start. As time passes the other vertical columns become more visible, highlighting the fact that the ending character in a domain is more quantised than the starting one. Numbers don't start to make a meaningful appearance until about half way into the movie.

'4%Q' and '6%Q' are at the bottom of the frequency table.

International Domains

There are now also internationalised domain names supporting native unicode characters for languages with non latin alphabets. These domains are outside the scope of this article. You can find more details about these here IDN

Can we help you?

Want help selecting your next domain, or want to access to further research of domain names? Our company is happy to help.

Send us an Email.

Update

Thank you to the numerous people who took to time to email me and correct me about a definition. In this article, I refer to the entire root of a domain name e.g. Amazon.com as the TLD. I made a mistake, it is just the .com component of this name that is the TLD. I hope this error didn’t mask the enjoyment of the article for you. I appreciate all the feedback I receive.

You can find a complete list of all the articles here. Click here to receive email alerts on new articles.