by

In this post I discuss a new paper that will appear at PETS 2018, authored by myself, Jeffrey Han, and Arvind Narayanan.

What happens when you open an email and allow it to display embedded images and pixels? You may expect the sender to learn that you’ve read the email, and which device you used to read it. But in a new paper we find that privacy risks of email tracking extend far beyond senders knowing when emails are viewed. Opening an email can trigger requests to tens of third parties, and many of these requests contain your email address. This allows those third parties to track you across the web and connect your online activities to your email address, rather than just to a pseudonymous cookie.

Illustrative example. Consider an email from the deals website LivingSocial (see details of the example email). When the email is opened, client will make requests to 24 third parties across 29 third-party domains.[1] A total of 10 third parties receive an MD5 hash of the user’s email address, including major data brokers Datalogix and Acxiom. Nearly all of the third parties (22 of the 24) set or receive cookies with their requests. In a webmail client the cookies are the same browser cookies used to track users on the web, and indeed many major web trackers (including domains belonging to Google, comScore, Adobe, and AOL) are loaded when the email is opened. While this example email has a large number of trackers relative to the average email in our corpus, the majority of emails (70%) embed at least one tracker.

How it works. Email tracking is possible because modern graphical email clients allow rendering a subset of HTML. JavaScript is invariably stripped, but embedded images and stylesheets are allowed. These are downloaded and rendered by the email client when the user views the email.[2] Crucially, many email clients, and almost all web browsers, in the case of webmail, send third-party cookies with these requests. The email address is leaked by being encoded as a parameter into these third-party URLs.

Measuring email tracking at scale. To understand the privacy implications of viewing and interacting with emails we assembled a collection of messages from mailing lists on the top sites.[3] Using OpenWPM, a web measurement platform developed at Princeton, we simulated a user opening each email and clicking links from within a webmail client that loads remote content. We found that 85% of emails in our corpus contain embedded third-party content, and 70% contain resources categorized as trackers by popular tracking-protection lists. Many of these third parties, including 7 of the top 10, also have a large web presence.

When “anonymous” web tracking isn’t. About 29% of emails leak the user’s email address to at least one third party when the email is opened, and about 19% of senders sent at least one email that had such a leak. The majority of these leaks (62%) are intentional.[4] If the leaked email address is associated with a tracking cookie, as it would be in many webmail clients, the privacy risk to users is greatly amplified. Since a tracking cookie can be shared with traditional web trackers, email address can allow those trackers to link tracking profiles from before and after a user clears their cookies. If a user reads their email on multiple devices, trackers can use that address as an identifier to link tracking data cross-device.

Most of the top leak recipients, including LiveIntent, Acxiom, Conversant Media, and Neustar, are involved in “people-based” marketing. These third parties receive leaked email addresses from between 24 to 68 of the 902 email senders studied. People-based marketing is defined by Acxiom as “the ability to perform targeting and measurement at the level of real people, not just devices, by resolving identity across digital and offline channels.” In other words, it is a term used to describe a set of services which allow marketers to use tracking data collected across any of a user’s devices, as well as offline data, to target that user on any of their devices. As discussed above, this could include offline data such as purchases made using a loyalty card at a grocery store, if that data is available associated with the purchaser’s email address (or a hash of it).

While our data does not let us measure how the companies use leaked email addresses they receive when a user views an email, we can get some insight into potential uses by examining their product pages. The marketing materials and privacy policies of the four companies mentioned above detail their use of email addresses for cross-device targeting and/or data onboarding products.[5]

Are leaks of hashed email addresses less of a privacy concern? In many cases the leaked email address is hashed; in fact, 68% of all leaks which occur while viewing emails are hashed, one-third of which also include the domain portion of the email address in plaintext. Hashed email is considered by some leak recipients to not be personally identifying information.[6]

From a computer science perspective, the claim that a hashed email address is not personally identifying is patently false. When user records in a database are keyed by hashed email address, looking up the record for a given email address is trivial: simply hash it first and look it up (indeed, this is the whole point of storing hashed email addresses at all). What if you have data associated with a hash of an unknown email address and want to recover the original address? It’s surprisingly easy: you can rent a multi-GPU virtual machine for $14.40 an hour[7] , which gives you 73 billion MD5 hash computations per second based on published benchmarks. Modern methods have gotten really good at enumerating plausible sequences of characters and numbers in passwords, and we believe these methods will extend to email addresses. If they do, it would mean that email address hashes can be broken much more efficiently than through brute forcing (i.e., trying all possible combinations of characters). We posit that with a trillion guesses — a cost of 6 US cents — it should be possible to enumerate the majority of email address in use.

Additional leaks occur when users click on links in emails. When an email link is clicked the URL is typically handed over to the user’s browser, or to a new tab in the user’s browser, in the case of webmail. Email addresses and other identifiers may be embedded in these links, and may ultimately cause the user’s email address to leak to third-parties on the web. We found that about 11% of links contain requests that leak the user’s email address to a third-party and about 12% of all emails contain such a link. The largest recipients of these leaks are Google, Facebook, and Twitter, and the top recipients overall are very similar to the top third-party trackers on the web.

Leaks in link clicks can also allow email trackers to work around privacy protections in emails clients that strip cookies from remote resources (like Apple Mail) or in those that proxy remote resources (like Gmail). Since the clicked link is opened in the user’s browser, the tracker can make the explicit link between the user’s cookie and the leaked email address while the resulting page is loaded.

What can users do? All of the privacy risks discussed in our paper stem from remote resources, so users can use mail clients which support blocking images by default to completely avoid the problem. However, that can often result in emails which are unreadable; this is particularly true for marketing emails.

Blocking images by default provides complete protection from tracking when emails are viewed, but can often result in unreadable emails.

In Section 6.2 of the paper we survey 16 mail clients and find that a patchwork of privacy features are employed, but that no setup offers complete protection from the threats we identify. Mail clients that block cookies by default, like Apple Mail, offer some level of protection. In these clients it’s more difficult for a tracker to track users across mailing lists, since the mail client doesn’t provide a persistent identifier. The same is true for webmail clients which proxy images, like Gmail and Yandex. Content proxying has the added benefit of preventing a tracker from being able to link the browser’s cookies to any identifiers received when an email is opened.

Even with the defenses employed by the clients we studied, trackers which receive the user’s leaked email address will continue to be able to track and target users in these clients and on the web. As an example, LiveIntent’s marketing material reassures clients that it will continue to work in Gmail since “targeting is primarily based around the e-mail address’s [sic] MD5 hash”. Regardless of the defenses deployed by the client, control of tracking is handed off to the user’s browser when email links are clicked.

We found that the tracking protection lists EasyList and EasyPrivacy reduce the number of email leaks that occur when an email is viewed by 87%. Perhaps the best option for privacy-conscious users today is to use webmail and install tracking protection tools, such as uBlock Origin or Ghostery. Users who want to use a standalone client must find one which supports privacy extensions; of the clients we studied, the only one that supports such extensions is Thunderbird. Having tracking protection tools installed in the browser will also provide protection when email links are clicked. In Section 7 of the paper we prototyped a server-side filtering feature which uses the tracking protection lists to filter the HTML body of emails before they reach the user. We found it to be nearly as effective as a tracking blocker running in the user’s browser.

Data, code, and paper release

You can read the paper here. We are also releasing the code and data publicly, including the all of the raw and parsed email bodies and crawls of all HTML emails. We hope that this dataset will spur additional research in this area.

Interested in hearing more from me? Follow me on Twitter @s_englehardt.

Thanks to Arvind Narayanan and Gunes Acar for their helpful comments on this blog post.

[1] The full list of third parties embedded in the LivingSocial example email given above are as follows:

Parties receiving an MD5 hash of the user’s email address: American List Counsel (alcmpn.com), LiveIntent (liadm.com), Datalogix (nexac.com), Acxiom (rlcdn.com, pippio.com, acxiom-online.com), Criteo (criteo.com, emailretargeting.com), Conversant Media (dotomi.com), V12 Data (v12group.com), VideoAmp (videoamp.com), Neustar (agkn.com), and alocdn.com. With the exception of emailretargeting.com and agkn.com all of the previous domains also set or receive cookies.

Additional parties setting or receiving cookies: MediaMath (mathtag.com), TapAd (tapad.com), IPONWEB (bidswitch.net), AOL (advertising.com), Centro (sitescout.com), The Trade Desk (adsrvr.org), Adobe (demdex.net), OpenX (openx.net), comScore (scorecardresearch.com, voicefive.com), Oracle (bluekai.com), Google (doubleclick.net), Realtime Targeting Aps (mojn.com).

Third-party domains requested without cookies or email hash: LiveIntent (licasd.com), Google (2mdn.net), Akamai (akamai.net).

[2] Unless they are proxied by the user’s email server; of the providers we studied (Section 6.2 in the paper), only Gmail and Yandex do so.

[3] Our email corpus was compiled by automatically signing up for mailing lists on the top 14,700 of the Alexa top 1 million sites, in addition to the Alexa top 500 shopping and top 500 news sites. In total, we received 12,618 emails from 902 senders.

[4] We classify the intentionality of leaks using the methodology detailed in Section 4.1 of the paper.

[5] LiveIntent’s marketing material touts the benefits of email-address-based tracking over cookies. In particular they highlight that email hash allows “Communication with clients across all screens and devices: Unlike the cookie, which represents an anonymous user, the email address represents a known customer. It’s unique to that individual, and remains persistent across all devices, apps and browsers.” Similarly, LiveIntent also explains how targeting users with hashed email addresses allows them to continue to serve targeted ads in Gmail despite Gmail’s image proxy.

Neustar’s privacy policy states: “[The onboarding process] allows advertisers to use their offline information about customer preferences (CRM data) … in the online environment. … We use de-identified information such as a hashed email address provided by our advertising client, to create a link between that de-identified CRM data and a Cookie ID, Mobile Advertising ID, or other persistent identifier assigned to a unique but de-identified user. That information can then be used to deliver targeted advertising…”. and “We also create and store linkages between and among household or individual level identifiers such as Cookie IDs, Mobile Advertising IDs, hashed email addresses and/or other persistent IDs that have been assigned to a unique but de-identified user. This process is sometimes called ‘cross device linking’.”

Acxiom’s Data Service API supports data queries on an MD5 or SHA1 hash of an email address.

Conversant Media’s marketing material implies that they use email address, in addition to purchase data, to match user data across devices.

[6] For example, LiveIntent’s privacy policy states: “We may collect identifiers that are used by our advertising partners to identify a specific individual … To de-identify this information, either we or our business partners perform a mathematical process (commonly known as hashing) to convert the information into a code.”

[7] A GPU is a type of processor optimized for highly parallel tasks, and is typically used for graphics processing. GPUs can also very efficiently compute hashes. In this post, we provide price quotes for Amazon’s `p2.16xlarge` EC2 cloud instance.

Image assets from the Noun Project used in this post: “Browser” by Designify.me, “Database” by Aybige, “Image” by Alfa Design, “HTML File” by Burak Kucukparmaksiz, “Computer Tower” by Melvin.