What if I asked you to produce a regular expression to validate an email address? You may think about it for a moment, and then simply Google-itto produce a regular expression like:

^([a-zA-Z0–9_-.]+)@([a-zA-Z0–9_-.]+).([a-zA-Z]{2,5})$

There are likely thousands of different regular expressions out there. Why is that? Surely somebody has read the RFC822 standardand produced a reliable regular expression? Well, here’s another one…

(?:(?:\r

)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t]

)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:

\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(

?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[

\t]))*"(?:(?:\r

)?[ \t])*))*@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0

31]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\

](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+

(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:

(?:\r

)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z

|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)

?[ \t])*)*\<(?:(?:\r

)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\

r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[

\t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)

?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t]

)*))*(?:,@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[

\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*

)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t]

)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*)

*:(?:(?:\r

)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+

|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r



)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:

\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t

]))*"(?:(?:\r

)?[ \t])*))*@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031

]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](

?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?

:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?

:\r

)?[ \t])*))*\>(?:(?:\r

)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?

:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?

[ \t]))*"(?:(?:\r

)?[ \t])*)*:(?:(?:\r

)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]

\000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|

\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>

@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"

(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*))*@(?:(?:\r

)?[ \t]

)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\

".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?

:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[

\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-

\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(

?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)*\<(?:(?:\r

)?[ \t])*(?:@(?:[^()<>@,;

:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([

^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\"

.\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\

]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*(?:,@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\

[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\

r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\]

\000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]

|\\.)*\](?:(?:\r

)?[ \t])*))*)*:(?:(?:\r

)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0

00-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\

.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,

;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?

:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*))*@(?:(?:\r

)?[ \t])*

(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".

\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t])*(?:[

^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]

]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*\>(?:(?:\r

)?[ \t])*)(?:,\s*(

?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\

".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)(?:\.(?:(

?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[

\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t

])*))*@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t

])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?

:\.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|

\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*|(?:

[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\

]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)*\<(?:(?:\r

)

?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["

()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)

?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>

@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*(?:,@(?:(?:\r

)?[

\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,

;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\.(?:(?:\r

)?[ \t]

)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\

".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*)*:(?:(?:\r

)?[ \t])*)?

(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".

\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])*)(?:\.(?:(?:

\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z|(?=[\[

"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r

)?[ \t]))*"(?:(?:\r

)?[ \t])

*))*@(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])

+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*)(?:\

.(?:(?:\r

)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r

)?[ \t])+|\Z

|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r

)?[ \t])*))*\>(?:(

?:\r

)?[ \t])*))*)?;\s*)

Even this monster can not truly validate an email address. How can this be? It turns out there is a lot more in the humble email address. Some parts of the RFC822 are actually quite useful, some are just insane. Either way it’s interesting, so let’s dive in…

Sub-addresses

One thing that is particularly worth noting is sub-addresses because they can be extremely useful and are supported almost everywhere. A sub-address allow you to create different email addresses to go to the same physical mailbox.

Let’s say Bob’s emails address is bob@smith.com. A sub-address uses a + to add a label like bob+spam@smith.com. If Bob were to sign up to a site with the latter he would still get the messages as normal to bob@smith.combut now you (or rather, he) can create filters or simply switch off one of the sub-addresses altogether.

One more interesting tidbit is if you use unique sub-addresses for each of the sites you sign up to you will be able to see when someone, or rather who, sells your email to someone else… Busted!

Where the Regexp Starts to Break Down

#!$%&’*+-/=?^_`{}|~@example.org

Unbeknownst to most people, this is actually a valid email address because all of the characters you see are perfectly acceptable in the local-part (that’s the bit before the @):

Furthermore the local-part can contain any characters, including an @sign, if they are enclosed within double quotes. There are also perfectly valid:

“dream.within@a.dream”@inception.movie

bob.”@”.smith@mywebsite.com

You will notice that the emails above have been partially converted into links by the Markdown parser for Silvrback because they can get so difficult to parse in text as well.

To Insanity and Beyond!

I would be surprised if your not at least a little bit impressed at how crazy you can get with an email address. However, before you feel the wash of guilt over all the inadequate regular expressions you’ve implemented or borrowed in your past software it’s about to get to get even more intense…

Up until now we are still able to put these rules into a regular expression, in fact it would look like the monster that is shown above, but we must continue. It’s time to talk about comments.

Comments are arbitrary text encapsulated in parenthesis that can appear in 4 possible places of an email address:

(here)a@b.com

a(here)@b.com

a@(here)b.com

a@b.com(here)

All of these have the same semantic meaning. They work in a similar way to sub-addressing in that they are just cosmetic and the email will actually arrive in the a@b.commailbox.

“If it’s worth doing, it’s worth overdoing.” — Ayn Rand

Once again taking it one step further, comments can be nested:

If you’ve ever had to parse recursive regular expressions you know that it can be very difficult even with the most simple scenarios. Now try mixing that with the monster regular expression above and you now can let your brain explode.

Despite the RFC822 spec, we have all agreed that using simple, memorable email addresses seems to be the way to go. Maybe we will find a better use for these features in the future, but until then you can contact me on:

elliotchance+blog(aka(batman))@gmail.com

P.S. I’m hoping the spam bots trawling for email addresses on pages like these aren’t smart enough to pick up that email in their regular expressions…

EDIT: One thing that may not have been clear is that I was talking about validating an email address through a regular expression. There are of course many other ways to properly validate an email address.

Thanks to Will Sargent for suggesting dominicsayers/isemail. If you know of any other libraries, please let me know and I’ll add them to this list for reference.