HTML5 AppCache

Application Caches allow website authors to specify that portions of their websites should be stored on the disk and made available even if the user is offline. The mechanism is controlled by cache manifests that outline the rules for storing and retrieving cache items within the app.

Similarly to implicit browser caching, AppCaches make it possible to store unique, user-dependent data - be it inside the cache manifest itself, or inside the resources it requests. The resources are retained indefinitely and not subject to the browser’s usual cache eviction policies.

AppCache appears to occupy a netherworld between HTML5 storage mechanisms and the implicit browser cache. In some browsers, it is purged along with cookies and stored website data; in others, it is discarded only if the user opts to delete the browsing history and all cached documents.

Note: AppCache is likely to be succeeded with Service Workers; the privacy properties of both mechanisms are likely to be comparable.

Flash resource cache

Flash maintains its own internal store of resource files, which can be probed using a variety of techniques. In particular, the internal repository includes an asset cache, relied upon to store Runtime Shared Libraries signed by Adobe to improve applet load times. There is also Adobe Flash Access, a mechanism to store automatically acquired licenses for DRM-protected content.

As of this writing, these document caches do not appear to be coupled to any browser privacy settings and can only be deleted by making several independent configuration changes in the Flash Settings Manager UI on macromedia.com. We believe there is no global option to delete all cached resources or prevent them from being stored in the future.

Browsers other than Chrome appear to share Flash asset data across all installations and in private browsing modes, which may have consequences for users who rely on separate browser instances to maintain distinct online identities.

SDCH dictionaries - Removed from Chrome 59+

SDCH is a Google-developed compression algorithm that relies on the use of server-supplied, cacheable dictionaries to achieve compression rates considerably higher than what’s possible with methods such as gzip or deflate for several common classes of documents.

The site-specific dictionary caching behavior at the core of SDCH inevitably offers an opportunity for storing unique identifiers on the client: both the dictionary IDs (echoed back by the client using the Avail-Dictionary header), and the contents of the dictionaries themselves, can be used for this purpose, in a manner very similar to the regular browser cache.

In Chrome, the data does not persist across browser restarts; it was, however, shared between profiles and incognito modes and was not deleted with other site data when such an operation is requested by the user. Google addressed this in bug 327783.

Other script-accessible storage mechanisms

Several other more limited techniques make it possible for JavaScript or other active content running in the browser to maintain and query client state, sometimes in a fashion that can survive attempts to delete all browsing and site data.

For example, it is possible to use window.name or sessionStorage to store persistent identifiers for a given window: if a user deletes all client state but does not close a tab that at some point in the past displayed a site determined to track the browser, re-navigation to any participating domain will allow the window-bound token to be retrieved and the new session to be associated with the previously collected data.

More obviously, the same is true for active JavaScript: any currently open JavaScript context is allowed to retain state even if the user attempts to delete local site data; this can be done not only by the top-level sites open in the currently-viewed tabs, but also by “hidden” contexts such as HTML frames, web workers, and pop-unders. This can happen by accident: for example, a running ad loaded in an <iframe> may remain completely oblivious to the fact that the user attempted to clear all browsing history, and keep using a session ID stored in a local variable in JavaScript. (In fact, in addition to JavaScript, Internet Explorer will also retain session cookies for the currently-displayed origins.)

Another interesting and often-overlooked persistence mechanism is the caching of RFC 2617 HTTP authentication credentials: once explicitly passed in an URL, the cached values may be sent on subsequent requests even after all the site data is deleted in the browser UI.

In addition to the cross-browser approaches discussed earlier in this document, there are also several proprietary APIs that can be leveraged to store unique identifiers on the client system. An interesting example of this are the proprietary persistence behaviors in some versions of Internet Explorer, including the userData API.

Last but not least, a variety of other, less common plugins and plugin-mediated interfaces likely expose analogous methods for storing data on the client, but have not been studied in detail as a part of this write-up; an example of this may be the PersistenceService API in Java, or the DRM license management mechanisms within Silverlight.

Lower-level protocol identifiers

On top of the fingerprinting mechanisms associated with HTTP caching and with the purpose-built APIs available to JavaScript programs and plugin-executed code, modern browsers provide several network-level features that offer an opportunity to store or retrieve unique identifiers:

Origin Bound Certificates (aka ChannelID ) were persistent self-signed certificates identifying the client to an HTTPS server, envisioned as the future of session management on the web. A separate certificate is generated for every newly encountered domain and reused for all connections initiated later on.



By design, OBCs function as unique and stable client fingerprints, essentially replicating the operation of authentication cookies; they are treated as “site and plug-in data” in Chrome, and can be removed along with cookies.



Uncharacteristically, sites can leverage OBC for user tracking without performing any actions that would be visible to the client: the ID can be derived simply by taking note of the cryptographic hash of the certificate automatically supplied by the client as a part of a legitimate SSL handshake.



ChannelID is currently suppressed in Chrome in “third-party” scenarios (e.g., for different-domain frames). NOTE : this feature and its successor, TLS Token Binding, were removed years ago.

The set of supported ciphersuites can be used to fingerprint a TLS/SSL handshake. Note that clients have been actively deprecating various ciphersuites in recent years, making this attack even more powerful.

In a similar fashion, two separate mechanisms within TLS - session identifiers and session tickets - allow clients to resume previously terminated HTTPS connections without completing a full handshake; this is accomplished by reusing previously cached data. These session resumption protocols provide a way for servers to identify subsequent requests originating from the same client for a short period of time.

HTTP Strict Transport Security is a security mechanism that allows servers to demand that all future connections to a particular host name need to happen exclusively over HTTPS, even if the original URL nominally begins with “http://”.



It follows that a fingerprinting server could set long-lived HSTS headers for a distinctive set of attacker-controlled host names for each newly encountered browser; this information could be then retrieved by loading faux (but possibly legitimately-looking) subresources from all the designated host names and seeing which of the connections are automatically switched to HTTPS.



In an attempt to balance security and privacy, any HSTS pins set during normal browsing [were*] carried over to the incognito mode in Chrome; there is no propagation in the opposite direction, however. *Update: Behavior was changed in Chrome 64, such that Chrome w on't use on-disk HSTS information for incognito requests. It is worth noting that leveraging HSTS for tracking purposes requires establishing log(n) connections to uniquely identify n users, which makes it relatively unattractive, except for targeted uses; that said, creating a smaller number of buckets may be a valuable tool for refining other imprecise fingerprinting signals across a very large user base.

Last but not least, virtually all modern browsers maintain internal DNS caches to speed up name resolution (and, in some implementations, to mitigate the risk of DNS rebinding attacks).



Such caches can be easily leveraged to store small amounts of information for a configurable amount of time; for example, with 16 available IP addresses to choose from, around 8-9 cached host names would be sufficient to uniquely identify every computer on the Internet. On the flip side, the value of this approach is limited by the modest size of browser DNS caches and the potential conflicts with resolver caching on ISP level.

Machine-specific characteristics With the notable exception of Origin-Bound Certificates, the techniques described in section 1 of the document rely on a third-party website explicitly placing a new unique identifier on the client system. Another, less obvious approach to web tracking relies on querying or indirectly measuring the inherent characteristics of the client system. Individually, each such signal will reveal just several bits of information - but when combined together, it seems probable that they may uniquely identify almost any computer on the Internet. In addition to being harder to detect or stop, such techniques could be used to cross-correlate user activity across various browser profiles or private browsing sessions. Furthermore, because the techniques are conceptually very distant from HTTP cookies, the authors find it difficult to decide how, if at all, the existing cookie-centric privacy controls in the browser should be used to govern such practices. EFF Panopticlick is one of the most prominent experiments demonstrating the principle of combining low-value signals into a high-accuracy fingerprint; there is also some evidence of sophisticated passive fingerpri nts being used by commercial tracking services. Browser-level fingerprints The most straightforward approach to fingerprinting is to construct identifiers by actively and explicitly combining a range of individually non-identifying signals available within the browser environment: User-Agent string, identifying the browser version, OS version, and some of the installed browser add-ons.



(In cases where User-Agent information is not available or imprecise, browser versions can be usually inferred very accurately by examining the structure of other headers and by testing for the availability and semantics of the features introduced or modified between releases of a particular browser.)

Clock skew and drift: unless synchronized with an external time source, most systems exhibit clock drift that, over time, produces a fairly unique time offset for every machine. Such offsets can be measured with microsecond precision using JavaScript. In fact, even in the case of NTP-synchronized clocks, ppm-level skews may be possible to measure remotely.

Fairly fine-grained information about the underlying CPU and GPU, either as exposed directly (GL_RENDERER) or as measured by executing Javascript benchmarks and testing for driver- or GPU-specific differences in WebGL rendering or the application of ICC color profiles to <canvas> data.

Screen and browser window resolutions, including parameters of secondary displays for multi-monitor users.

The window-manager- and addon-specific “thickness” of the browser UI in various settings (e.g., window.outerHeight - window.innerHeight).

The list and ordering of installed system fonts - enumerated directly or inferred with the help of an API such as getComputedStyle.

The list of all installed plugins, ActiveX controls, and Browser Helper Objects, including their versions - queried or brute-forced through navigator.plugins[]. (Some add-ons also announce their existence in HTTP headers.)

Information about installed browser extensions and other software. While the set cannot be directly enumerated, many extensions include web-accessible resources that aid in fingerprinting. In addition to this, add-ons such as popular ad blockers make detectable modifications to viewed pages, revealing information about the extension or its configuration. Using browser “sync” features may result in these characteristics being identical for a given user across multiple devices. A similar but less portable approach specific to Internet Explorer allows websites to enumerate locally installed software by attempting to load DLL resources via the res:// pseudo-protocol.

Random seeds reconstructed from the output of non-cryptosafe PRNGs (e.g. Math.random(), multipart form boundaries, etc). In some browsers, the PRNG is initialized only at startup, or reinitialized using values that are system-specific (e.g., based on system time or PID). According to the EFF, their Panopticlick experiment - which combines only a relatively small subset of the actively-probed signals discussed above - is able to uniquely identify 95% of desktop users based on system-level metrics alone. Current commercial fingerprinters are reported to be considerably more sophisticated and their developers might be able to claim significantly higher success rates. Of course, the value of some of the signals discussed here will be diminished on mobile devices, where both the hardware and the software configuration tends to be more homogenous; for example, measuring window dimensions or the list of installed plugins offers very little data on most Android devices. Nevertheless, we feel that the remaining signals - such as clock skew and drift and the network-level and user-specific signals described later on - are together likely more than sufficient to uniquely identify virtually all users. When discussing potential mitigations, it is worth noting that restrictions such as disallowing the enumeration of navigator.plugins[] generally do not prevent fingerprinting; the set of all notable plugins and fonts ever created and distributed to users is relatively small and a malicious script can conceivably test for every possible value in very little time.

Network configuration fingerprints An interesting set of additional device characteristics is associated with the architecture of the local network and the configuration of lower-level network protocols; such signals are disclosed independently of the design of the web browser itself. These traits covered here are generally shared between all browsers on a given client and cannot be easily altered by common privacy-enhancing tools or practices; they include: The external client IP address. For IPv6 addresses, this vector is even more interesting: in some settings, the last octets may be derived from the device's MAC address and preserved across networks.

A broad range of TCP/IP and TLS stack fingerprints, obtained with passive tools such as p0f . The information disclosed on this level is often surprisingly specific: for example, TCP/IP traffic will often reveal high-resolution system uptime data through TCP timestamps.

Ephemeral source port numbers for outgoing TCP/IP connections, generally selected sequentially by most operating systems.

The local network IP address for users behind network address translation or HTTP proxies (via WebRTC). Combined with the external client IP, internal NAT IP uniquely identifies most users, and is generally stable for desktop browsers (due to the tendency for DHCP clients and servers to cache leases).

Information about proxies used by the client, as detected from the presence of extra HTTP headers (Via, X-Forwarded-For). This can be combined with the client’s actual IP address revealed when making proxy-bypassing connections using one of several available methods.

With active probing, the list of open ports on the local host indicating other installed software and firewall settings on the system. Unruly actors may also be tempted to probe the systems and services in the visitor’s local network; doing so directly within the browser will circumvent any firewalls that normally filter out unwanted incoming traffic.





User-dependent behaviors and preferences In addition to trying to uniquely identify the device used to browse the web, some parties may opt to examine characteristics that aren’t necessarily tied to the machine, but that are closely associated with specific users, their local preferences, and the online behaviors they exhibit. Similarly to the methods described in section 2, such patterns would persist across different browser sessions, profiles, and across the boundaries of private browsing modes. The following data is typically open to examination: Preferred language, default character encoding, and local time zone (sent in HTTP headers and visible to JavaScript).

Data in the client cache and history. It is possible to detect items in the client’s cache by performing simple timing attacks; for any long-lived cache items associated with popular destinations on the Internet, a fingerprinter could detect their presence simply by measuring how quickly they load (and by aborting the navigation if the latency is greater than expected for local cache).



(It is also possible to directly extract URLs stored in the browsing history, although such an attack requires some user interaction in modern browsers.)

Mouse gesture, keystroke timing and velocity patterns, and accelerometer readings (ondeviceorientation) that are unique to a particular user or to particular surroundings. There is a considerable body of scientific research suggesting that even relatively trivial interactions are deeply user-specific and highly identifying.

Any changes to default website fonts and font sizes, website zoom level, and the use of any accessibility features such as text color, size, or CSS overrides (all indirectly measurable with JavaScript).

The state of client features that can be customized or disabled by the user, with special emphasis on mechanisms such as DNT, third-party cookie blocking, changes to DNS prefetching, pop-up blocking, Flash security and content storage, and so on. (In fact, users who extensively tweak their settings from the defaults may be actually making their browsers considerably easier to uniquely fingerprint.) On top of this, user fingerprinting can be accomplished by interacting with third-party services through the user’s browser, using the ambient credentials (HTTP cookies) maintained by the browser: Users logged into websites that offer collaboration features can be de-anonymized by covertly instructing their browser to navigate to a set of distinctively ACLed resources and then examining which of these navigation attempts result in a new collaborator showing up in the UI.

Request timing, onerror and onload handlers, and similar measurement techniques can be used to detect which third-party resources return HTTP 403 error codes in the user’s browser, thus constructing an accurate picture of which sites the user is logged in; in some cases, finer-grained insights into user settings or preferences on the site can be obtained, too.



(A similar but possibly more versatile login-state attack can be also mounted with the help of Content Security Policy, a new security mechanism introduced in modern browsers.)

Any of the explicit web application APIs that allow identity attestation may be leveraged to confirm the identity of the current user (typically based on a starting set of probable guesses). Fingerprinting prevention and detection challenges In a world with no possibility of fingerprinting, web browsers would be indistinguishable from each other, with the exception of a small number of robustly compartmentalized and easily managed identifiers used to maintain login state and implement other essential features in response to user’s intent. In practice, the Web is very different: browser tracking and fingerprinting are attainable in a large number of ways. A number of the unintentional tracking vectors are a product of implementation mistakes or oversights that could be conceivably corrected today; many others are virtually impossible to fully rectify without completely changing the way that browsers, web applications, and computer networks are designed and operated. In fact, some of these design decisions might have played an unlikely role in the success of the Web. In lieu of eliminating the possibility of web tracking, some have raised hope of detecting use of fingerprinting in the online ecosystem and bringing it to public attention via technical means through browser- or server-side instrumentation. Nevertheless, even this simple concept runs into a number of obstacles: Some fingerprinting techniques simply leave no remotely measurable footprint, thus precluding any attempts to detect them in an automated fashion.

Most other fingerprinting and tagging vectors are used in fairly evident ways, but could be easily redesigned so that they are practically indistinguishable from unrelated types of behavior. This would frustrate any programmatic detection strategies in the long haul, particularly if they are attempted on the client (where the party seeking to avoid detection can reverse-engineer the checks and iterate until the behavior is no longer flagged as suspicious).

The distinction between behaviors that may be acceptable to the user and ones that might not is hidden from view: for example, a cookie set for abuse detection looks the same as a cookie set to track online browsing habits. Without a way to distinguish between the two and properly classify the observed behaviors, tracking detection mechanisms may provide little real value to the user.

