Verifying email addresses by the hundreds

A little technical hack that saved hours of effort

In 2015 with the #SaveTheInternet campaign, we offered a “Mail Your MP” tool that made it easy to identify your Member of Parliament and send them an email. We had two problems building that tool:

Contact information for MPs was not easy to find. Found contact information was often outdated.

The second in particular was something we couldn’t solve before launching the campaign. We could only amend our data when participants reported that their MP’s email address bounced. To handle this loop, we had a submission form and volunteers to process it.

The problem, in a nutshell, is that there is only one way to find out if an email address is working: you have to send it an email.

If it bounces, you know for sure that it is not working. This is a technical factor. If it does not bounce, it could still be an abandoned mailbox. The recipient may not be reading email delivered there. This is a human factor. A full mailbox will start bouncing, making it identifiable as a technical factor.

This time with #SpeakForMe, we wanted to do something better. Could we at least automate the technical factor? Since 2015 I’ve had occasions to think about how the Simple Mail Transport Protocol (SMTP) underlying the worldwide email network works. Obsolete email addresses are a security risk in a project I manage at my day job.

Technical background on SMTP

SMTP is an ancient protocol, predating even the web and its HTTP protocol. SMTP was written for an era when everyone on the Internet was assumed to be a good person. When a sender’s email server delivers email to a recipient’s email server, the conversation looks like this (using Gmail; bold lines are from the sender, and yes, this happens in plain text):

$ telnet gmail-smtp-in.l.google.com 25

Trying 173.194.66.26...

Connected to gmail-smtp-in.l.google.com.

Escape character is '^]'.

————————————————————————————————————————————————————————————————————220 mx.google.com ESMTP r27si8425622qtj.52 - gsmtp

EHLO demo.speakforme.in

250-mx.google.com at your service, [xxx.xxx.xxx.xxx]

250-SIZE 157286400

250-8BITMIME

250-STARTTLS

250-ENHANCEDSTATUSCODES

250-PIPELINING

250-CHUNKING

250 SMTPUTF8

MAIL FROM:<info@speakforme.in>

250 2.1.0 OK r27si8425622qtj.52 - gsmtp

RCPT TO:<jackerhack@gmail.com>

250 2.1.5 OK r27si8425622qtj.52 - gsmtp

DATA

354 Go ahead r27si8425622qtj.52 - gsmtp

From: SpeakForMe <info@speakforme.in>

To: Kiran Jonnalagadda <jackerhack@gmail.com>

Date: Sat, 16 Dec 2017 10:41:00 +0530

Subject: Test email This is a test email to demonstrate how SMTP works. .

421-4.7.0 [35.153.240.239 15] Our system has detected that this message is

421-4.7.0 suspicious due to the very low reputation of the sending IP address.

421-4.7.0 To protect our users from spam, mail sent from your IP address has

421-4.7.0 been temporarily rate limited. Please visit

421 4.7.0 https://support.google.com/mail/answer/188131 for more information. r27si8425622qtj.52 - gsmtp

Connection closed by foreign host.

Things of note here:

The modern protocol is Extended SMTP (ESMTP), which is backward compatible. SMTP conversations began with a HELO , while ESMTP uses EHLO . SMTP had three digit status codes (220, 250, 354 and 421 above) while ESMTP provides a longer code with additional detail (2.1.0, 2.1.5 and 4.7.0 here) that the sender can read to understand how to respond. 2xx status codes imply everything is okay. 4xx codes imply temporary failures such as a full mailbox. In SMTP parlance, this is called a “soft fail” or a “soft bounce”. 5xx codes imply permanent failures, like an unknown mailbox. This is a “hard fail” or a “hard bounce”. There was no authentication of the sender. I opened a connection from my computer and claimed to be sending from info@speakforme.in . No login and password required. This is a legacy from the early internet, when goodwill was assumed, and is why spam is such a huge problem in email. The From and To addresses appear twice, first as part of the SMTP conversation, second in the DATA section. The first part is called the “envelope header” and the second is the “email header”. What you see in your email client is only the second. BCC is when you put a recipient in the envelope but skip them in the email header. Gmail refused to accept this email with SMTP code 421, ESMTP 4.7.0. From their documentation, this means: “Our system has detected an unusual rate of unsolicited mail originating from your IP address. To protect our users from spam, mail sent from your IP address has been temporarily blocked.”

Summary: in SMTP, the sender is not required to authenticate themselves, but the recipient will make a guess about whether it wants to accept email from you. This is similar to how credit cards work: anyone who has your card details can take your money, and the only way a bank can tell the difference between a legitimate transaction and fraud is by making an informed guess.

Verifying email addresses

How does this detail help with verifying an email address? Notice these two lines from the exchange:

RCPT TO:<jackerhack@gmail.com>

250 2.1.5 OK r27si8425622qtj.52 - gsmtp

We supplied an email address, and Gmail accepted it. What if we give it an obviously incorrect address?

RCPT TO:<example@gmail.com>

550-5.1.1 The email account that you tried to reach does not exist. Please try

550-5.1.1 double-checking the recipient's email address for typos or

550-5.1.1 unnecessary spaces. Learn more at

550 5.1.1 550-5.1.1 The email account that you tried to reach does not exist. Please try550-5.1.1 double-checking the recipient's email address for typos or550-5.1.1 unnecessary spaces. Learn more at550 5.1.1 https://support.google.com/mail/?p=NoSuchUser q2si2682540qki.196 - gsmtp

Now we’re on to something. Gmail is telling us whether the email address exists before it makes that guess on whether it wants to receive an email from us. We could drop the connection right there, armed with this confirmation, without sending the email. If a mailbox is full, Gmail tells you that too:



552-5.2.2 the recipient to

552 5.2.2 552-5.2.2 The email account that you tried to reach is over quota. Please direct552-5.2.2 the recipient to552 5.2.2 https://support.google.com/mail/?p=OverQuotaPerm q2si2682540qki.196 - gsmtp

Other email providers like Microsoft Outlook do a background check when you name the recipient, refusing to answer if they don’t like you. Even others do it before accepting the connection at all. We found that if we did the asking from a server that the recipient server had no reason to distrust, we got reasonably reliable answers.

For those familiar with SMTP, note “reasonably”. SMTP has complex rules, so an acceptance or rejection is probabilistic rather than deterministic. But good enough for our purposes, as we had only a few hundred and could cross-check by hand.

Armed with this lucky insight, I wrote a tool last week to automate the verification. (It’s an added “probe” feature on the mxsniff tool I wrote for something related, so the version you need is on GitHub and not yet on PyPI.)

Using mxsniff has saved us literally hours of effort correcting our data. It’s one of many such technical innovations that helped us launch a nationwide campaign in less than a week since conceiving it.

mxsniff is open source, so anyone can use it. Even UIDAI could use it to clean Aadhaar’s database of the many invalid email addresses it will have collected over the years.

As an open source project, it is also open to contributions from others. The probe feature, for instance, is the product of an evening’s efforts, and is inefficient when checking multiple addresses at the same domain since it makes a new connection for each, instead of reusing the connection. Anybody with sufficient technical competence can make this improvement and receive the benefit for themselves, and share it with others too.

We wish UIDAI had such an open culture of engagement, instead of forcing us to petition our MPs to intervene on our behalf.