Uniqueness of human mobility

In 1930, Edmond Locard showed that 12 points are needed to uniquely identify a fingerprint30. Our unicity test estimates the number of points p needed to uniquely identify the mobility trace of an individual. The fewer points needed, the more unique the traces are and the easier they would be to re-identify using outside information. For re-identification purposes, outside observations could come from any publicly available information, such as an individual's home address, workplace address, or geo-localized tweets or pictures. To the best of our knowledge, this is the first quantification of the uniqueness of human mobility traces with random points in a sparse, simply anonymized mobility dataset of the scale of a small country.

Given I p , a set of spatio-temporal points and D, a simply anonymized mobility dataset, we evaluate ε, the uniqueness of traces, by extracting from D the subset of trajectories S(I p ) that match the p points composing I p [See Methods]. A trace is unique if |S(I p )| = 1, containing only one trace. For example, in Fig. 2A, we evaluate the uniqueness of traces given I p = 2 . The two spatio-temporal points contained in I p = 2 are zone I from 9am to 10am and zone II from 12pm to 1pm. The red and the green traces both satisfy I p = 2 , making them not unique. However, we can also evaluate the uniqueness of traces knowing I p = 3 , adding as a third point zone III between 3pm and 4pm. In this case |S(I p = 3 )| = 1, uniquely characterize the green trace. A lower bound on the risk of deductive disclosure of a user's identity is given by the uniqueness of his mobility trace, the likelihood of this brute force characterization to succeed.

Figure 2 (A) I p = 2 means that the information available to the attacker consist of two 7am-8am spatio-temporal points (I and II).In this case, the target was in zone I between 9am to 10am and in zone II between 12pm to 1pm. In this example, the traces of two anonymized users (red and green) are compatible with the constraints defined by I p = 2 . The subset S(I p = 2 ) contains more than one trace and is therefore not unique. However, the green trace would be uniquely characterized if a third point, zone III between 3pm and 4pm, is added (I p = 3 ). (B) The uniqueness of traces with respect to the number p of given spatio-temporal points (I p ). The green bars represent the fraction of unique traces, i.e. |S(I p )| = 1. The blue bars represent the fraction of |S(I p )| ≤ 2. Therefore knowing as few as four spatio-temporal points taken at random (I p = 4 ) is enough to uniquely characterize 95% of the traces amongst 1.5 M users. (C) Box-plot of the minimum number of spatio-temporal points needed to uniquely characterize every trace on the non-aggregated database. At most eleven points are enough to uniquely characterize all considered traces. Full size image

Our dataset contains 15 months of mobility data for 1.5 M people, a significant and representative part of the population of a small European country and roughly the same number of users as the location-based service Foursquare®31. Just as with smartphone applications or electronic payments, the mobile phone operator records the interactions of the user with his phone. This creates a comparable longitudinally sparse and discrete database [Fig. 3]. On average, 114 interactions per user per month for the nearly 6500 antennas are recorded. Antennas in our database are distributed throughout the country and serve, on average, ~ 2000 inhabitants each, covering areas ranging from 0.15 km2 in cities to 15 km2 in rural areas. The number of antennas is strongly correlated with population density (R2 = .6426) [Fig. 3C]. The same is expected from businesses, places in location-based social networks, or WiFi hotspots.

Figure 3 (A) Probability density function of the amount of recorded spatio-temporal points per user during a month.(B) Probability density function of the median inter-interaction time with the service. (C) The number of antennas per region is correlated with its population (R2 = .6426). These plots strongly emphasize the discrete character of our dataset and its similarities with datasets such as the one collected by smartphone apps. Full size image

Fig. 2B shows the fraction of unique traces (ε) as a function of the number of available points p. Four randomly chosen points are enough to uniquely characterize 95% of the users (ε > .95), whereas two randomly chosen points still uniquely characterize more than 50% of the users (ε > .5). This shows that mobility traces are highly unique and can therefore be re-identified using little outside information.

Scaling properties

Nonetheless, ε depends on the spatial and temporal resolution of the dataset. Here, we determine this dependence by lowering the resolution of our dataset through spatial and temporal aggregation [Fig. 1C]. We do this by increasing the size of a region, aggregating neighbouring cells into clusters of v cells, or by reducing the dataset's temporal resolution, increasing the length of the observation time window to h hours [see Methods]. Both of these aggregations are bound to decrease ε and therefore, make re-identification harder.

Fig. 4A shows how the uniqueness of mobility traces ε depends on the spatial and temporal resolution of the data. This reduction, however, is quite gradual. Given four points (p = 4), we find that ε > .5 when using a resolution of h = 5 hours and v = 5 antennas.

Figure 4 Uniqueness of traces [ε] when we lower the resolution of the dataset with (A) p = 4 and (D) p = 10 points.It is easier to attack a dataset that is coarse on one dimension and fine along another than a medium-grained dataset along both dimensions. Given four spatio-temporal points, more than 60% of the traces are uniquely characterized in a dataset with an h = 15-hours temporal resolution while less than 40% of the traces are uniquely characterized in a dataset with a temporal resolution of h = 7 hours and with clusters of v = 7 antennas. The region covered by an antenna ranges from 0.15 km2 in urban areas to 15 km2 in rural areas. (B–C) When lowering the temporal or the spatial resolution of the dataset, the uniqueness of traces decrease as a power function ε = α − xβ. (E) While ε decreases according to a power function, its exponent β decreases linearly with the number of points p. Accordingly, a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution. Full size image

Statistically, we find that traces are more unique when coarse on one dimension and fine along another than when they are medium-grained along both dimensions. Indeed, given four points, ε > .6 in a dataset with a temporal resolution of h = 15 hours or a spatial resolution of v = 15 antennas while ε > .4 in a dataset with a temporal resolution of h = 7 hours and a spatial resolution of v = 7 antennas [Fig. 4A].

Next, we show that it is possible to find one formula to estimate the uniqueness of traces given both, the spatial and temporal resolution of the data and the number of points available to an outside observer. Fig. 4B and 4C show that the uniqueness of a trace decreases as the power function ε = α − xβ, for decreases in both the spatial and temporal resolution (x) and for all considered p = 4, 6, 8 and 10 (see Table S1). The uniqueness of human mobility can thus be expressed using the single formula: ε = α − (νh)β. We find that this power function fits the data better than other two-parameters functions such as α − exp (λx), a stretched exponential α − exp xβ, or a standard linear function α − βx (see Table S1). Both estimators for α and β are highly significant (p < 0.001)32 and the mean pseudo-R2 is 0.98 for the I p = 4 case and the I p = 10 case. The fit is good at all levels of spatial and temporal aggregation [Fig. S3A–B].

The power-law dependency of ε means that, on average, each time the spatial or temporal resolution of the traces is divided by two, their uniqueness decreases by a constant factor ~ (2)−β. This implies that privacy is increasingly hard to gain by lowering the resolution of a dataset.

Fig. 2B shows that, as expected, ε increases with p. The mitigating effect of p on ε is mediated by the exponent β which decays linearly with p: β = 0.157 − 0.007p [Fig. 4E]. The dependence of β on p implies that a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution. In fact, given four points, a two-fold decrease in spatial or temporal resolution makes it 9.3% less likely to identify an individual, while given ten points, the same two-fold decrease results in a reduction of only 6.2% (see Table S1).

Because of the functional dependency of ε on p through the exponent β, mobility datasets are likely to be re-identifiable using information on only a few outside locations.