Congratulations. From this day forward, you will no longer squander your time trying to work out the perfect regex to validate email addresses. You will also never again run the risk of rejecting what is, in fact, a strange, valid email address.

The trick is to first define what we mean by ‘valid’.

We are developers, we are technical folk, so it’s no surprise that the prevailing wisdom is to check that it matches the official criteria, some examples of the diversity of the official criteria are…

But I say pish! to prevailing wisdom, so…

Everything you know is wrong

Instead of the above approach that largely ignores reality, I believe there are two questions we need to ask:

Did the user understand that they were supposed to type an email address into this field? Did the user correctly type their email address into this field?

If you have a well laid-out form with a label that says “email”, and the user enters an ‘@’ symbol somewhere, then it’s safe to say they understood that they were supposed to be entering an email address. Easy.

Next, we want to do some validation to ascertain if they correctly entered their email address.

Not possible.

It’s important that you agree with me on this point: it’s not possible.

I know what you’re thinking. “But it helps, right?” That’s like saying that opening and closing your fridge really quickly conserves energy and helps fight climate change. Sure, it helps, if we want to be slaves to the word ‘help’. But most people would agree you have a promising career in a straight jacket if you’re unnecessarily rattling your pickle jars for the benefit of the polar bears.

Let’s explore

Let’s imagine that my email address is davidgilbertson@example.com. That’s 27 stabs at the keyboard that could go awry. Any mistype will definitely result in an incorrect email address but only maybe result in an invalid email address.

[epiphany]

Even if the sun shone through my window and I was visited by a particularly savage sneeze (I suffer from Autosomal Dominant Compelling Helio-Ophthalmic Outburst Syndrome*) and I typed out #!$%&’*+-/=?^_`{}|~@example.com by mistake, I would still pass the most thorough email ‘validation’ techniques. (The flip side is I fail and be told my address isn’t valid when it is! On a whim I just emailed the person at #!$%&’*+-/=?^_`{}|~@example.com and she said she gets super pissed off when told that her email address isn’t valid. She regrets buying the example.com domain, too, but won’t give it up, just like the guy that’s got milk.com. We got chatting and it turns out she only lives a few blocks from me and also collects vintage cameras; we’re playing golf next week. I think maybe she’s the one. I should probably close these brackets and get on with the story.)

So what are the odds that any one typo would result in an invalid email address? We will build a statistical model! Let’s look at, say, the ‘g’. I am more likely to mis-type with a letter on the visible keyboard with no shift key required (I apply a weighting to non-modified keys in the model). From all of the tappable keys on a physical keyboard, there are six characters that, while not completely invalid, are only valid in certain cases: []\;, and space. 6 out of 48. A 12% chance.

But an off-by-one error is more likely. For example hitting the neighbouring ‘h’ key instead of ‘g’. So from a list of 117 million email addresses I have calculated the frequency of occurrence of each character and for each, noted which keys lie closest on the keyboard, and factored in the likelihood that a mis-stroke will create an invalid email address. (I know hacking LinkedIn just to make a point about email validation is a bit extreme, but it is important to back up one’s opinions with data).

For example, ‘e’ is considered a low risk of invalidating, because all surrounding keys would still result in a valid email address. But ‘p’ has [ and ; within striking distance! So although it’s less common than ‘e’, it carries a higher risk of resulting in an invalid email address if missed.

I also consider the relative dexterity of the fingers. We all know that the pinky is the retarded cousin of the finger family, so that is factored in as well.