Next, we detail how the data is collected and aggregated (high-level sketch in Fig. 1).

Fig. 1 The Tesco Grocery 1.0 data collection process. We obtain the nutritional properties of an area’s typical product by: (i) considering the purchases made with fidelity cards at tills (raw data); (ii) mapping those purchases to their nutritional properties (as per product labels); and (iii) collating the nutritional properties at area-level (based on the areas the fidelity cards were sent to), and averaging those properties out. This results in the nutritional properties of each area’s typical product. Full size image

Purchase record collection

Given the use of Clubcards, the information about the purchase of each individual product is stored in the following anonymized form: {customer area, GTIN, timestamp}, where the area is where the customer lives, and the GTIN is the Global Trade Item Number, which is used by companies to uniquely identify their trade items globally. The purchase record is then joined with a product database that is maintained by Tesco and it is populated with the information that the producers of the food items provide. The facts that are printed on the nutrition label40 are: {total energy, net weight, fats, saturated fats, carbohydrates, free sugars, proteins, fibers}. Only for drinks, their volume and relative volume of alcohol is reported.

The total energy is expressed in kilocalories (kcal). The nutrients are expressed in grams (g) and they represent the total weight of that nutrient in the product. The weight of fat comprises that of saturated fats and the weight of carbohydrates comprises that of sugar. We compute the total weight of alcohol by multiplying the total volume by the relative volume of alcohol, and by the density of alcohol. For example, there are 75 grams of alcohol in a 750 ml bottle of wine with 12.5% alcoholic volume \(\left(750ml\cdot 0.125\cdot 0.8\frac{g}{ml}=75g\right)\). To obtain the calorie intake at the nutrient level, we use the conversion factors set by EU directive 90/496/EEC41, which is implemented by all EU countries and is widely adopted in nutrition studies42. That is, we map grams into corresponding calories by simply multiplying them by these fixed factors: 9 kcal per gram for fats, 7 kcal per gram of alcohol, 4 kcal for proteins and carbohydrates, and 2 kcal for fibers. Our value of kcal is then equal to the total calories reported on the product label (when available), rounded to the closest unit.

The GTIN is also associated to a product category. The categorization includes 17 non-overlapping classes: fruit & vegetables; grains (e.g., bread, rice, pasta); red meat (e.g., pork, beef); poultry; fish; dairy (e.g., milk, cheese); eggs; fats & oils (e.g., butter, olive oil); sweets (e.g., chocolate, candies); readymade items (e.g., pre-cooked meal); sauces (e.g., tomato sauce, soups); tea & coffee; soft drinks (e.g., carbonated sodas); bottled water; beer; wine; spirits. All items have been manually labeled and validated according to this categorization.

Overall, our dataset is created starting from all the 422,453,987 individual food items purchased from January 1st 2015 to December 31st 2015 by Clubcard owners whose rsidence area is within Greater London’s boundaries. The dataset contains 67,296 unique products (distinct GTINs) purchased at least once.

Geographic areas and census statistics

We aggregate individual purchase records at area-level using a variety of geographic resolutions: LSOA (Lower Super Output Area), MSOA (Medium Super Output Area), Ward, and LA (Local Authority or, more informally, Borough).

The choice of these specific aggregation levels is motivated by three main factors. First, they are spatial partitions defined by the Office for National Statistics (ONS–https://www.ons.gov.uk). Many publicly-available official statistics are provided at the level of these areas and, consequently, most geographic studies on UK data use them as reference geographic units. Second, the ONS provides the mapping between different levels of spatial aggregations43, which facilitates the aggregation of our raw data to coarser partitions. The mapping is always exact except at Ward level, where some higher-granularity areas might lie across the boundaries of multiple Wards. The ONS applies standard guidelines to provide a best-fitting match. Finally, aggregating the data at multiple granularities will enable a wider range of studies that might benefit from having either a high number of smaller areas containing fewer datapoints (but still big enough to preserve anonymity); or a lower number of areas characterized by more robust statistics. In the dataset, for each area, we report a number of basic census statistics collected by the ONS44 in 2015 including: population, number of males, number of females, number residents aged [0–17], number residents aged [18–64], number residents aged 65+, average age, surface area, density of residents. Table 1 summarizes the average of some of these statistics.

Table 1 Statistics for the areas corresponding to the three spatial aggregations. For each spatial aggregation, the number of areas, the average surface, and the average numbers of residents are reported. Full size table

Aggregation of food purchases to areas

Given a purchase done by a customer living in a given area a, we map the purchase to that area. The same aggregation procedure is applied for all geographical levels, so we use the generic term area to denote census areas at any level, from LSOA to Borough. For each area, we provide 3 sets of variables, described below.

Penetration

The first group of variables expresses the Tesco penetration in an area. For each area a, we report the total number of products purchased (quantity (a)). To give an indication of how the customer base is representative of the overall population in the area, we compute the ratio between the number of unique customers and the number of residents, as recorded by the census:

$$representativeness(a)=customers(a)/population(a).$$ (1)

In the dataset, we provide the min-max normalized versions of these values, which can be used to filter out areas whose user base differs the most from the census statistics (see Technical Validation Section):

$$representativenes{s}^{norm}(a)=\frac{representativeness(a)-min\{representativeness\}}{max\{representativeness\}-min\{representativeness\}}.$$ (2)

We also report the number of days with at least one purchase (purchase-dates(a) ∈ [0; 365]). To provide an estimate of how often residents of area a go shopping, we supply the collective sum of days any clubcard in area a has been used (man-day(a)).

Nutrients

The second group captures the nutritional properties of the food purchased. Because the number of people consuming the food purchased with a Clubcard is unknown (e.g., singles vs. families), we cannot produce reliable estimates of per-capita food purchases. Therefore, we describe each area in terms of its typical product, whose nutritional values are the average over all the food products bought by the area residents.

We first report the typical food product’s weight (not including drinks) and the typical drink product’s volume:

$$weight(a)=\frac{{\sum }_{p\in {P}_{a}}grams(p)}{| {P}_{a}| }$$ (3)

$$volume(a)=\frac{{\sum }_{p\in {P}_{a}}liters(p)}{| {P}_{a}| },$$ (4)

where P a is the set of all food products purchased by the residents of area a, and p is one of such products.

The energy intake of food is measured in calories. To capture the energy intake, we compute the total calories contained in the typical product (i.e., the average number of calories across all products purchased by area residents):

$$energy(a)=\frac{{\sum }_{p\in {P}_{a}}kcal(p)}{| {P}_{a}| },$$ (5)

where kcal(p) is the value of kilocalories in p.

Given two foods with the same amount of calories but different energy concentrations, the delivery of pleasure within people’s brains is quicker for the calorie-dense food. To capture the density of calories rather than simple calorie counts, we compute:

$$energy-density(a)=\frac{{\sum }_{p\in {P}_{a}}kcal(p)}{{\sum }_{p\in {P}_{a}}grams(p)},$$ (6)

which reflects the concentration of calories in the area’s typical product.

Not all calories are created equal though: the food contains different types of nutrients that the body processes in different ways to produce energy and extract structural material for its growth and maintenance. The food nutrients we consider are: fats, saturated fats, carbohydrates (carbs), sugar, proteins, fibers, and alcohol. For each area, we compute the grams of each individual nutrient contained in the typical product:

$$nutrien{t}_{i}(a)=\frac{{\sum }_{p\in {P}_{a}}grams(nutrien{t}_{i}(p))}{| {P}_{a}| },$$ (7)

where grams(nutrient i (p)) is the grams of nutrient i in product p. We calculate the typical product’s energy content given by each of these nutrients as:

$$energy-nutrien{t}_{i}(a)=\frac{{\sum }_{p\in {P}_{a}}kcal(nutrien{t}_{i}(p))}{| {P}_{a}| }.$$ (8)

where kcal(nutrient i , p) is the energy intake given by that nutrient in product p.

Going beyond individual nutrients, one could study their composite impact. So, we capture the diversity of nutrients contained in the typical product. This is computed as the Shannon entropy H of the distribution of the calories given by all the nutrients:

$${f}_{nutrien{t}_{i}}(a)=\frac{nutrien{t}_{i}(a)}{{\sum }_{j}nutrien{t}_{j}(a)},$$ (9)

$${f}_{energy-nutrien{t}_{i}}(a)=\frac{energy-nutrien{t}_{i}(a)}{{\sum }_{j}energy-nutrien{t}_{j}(a)}.$$ (10)

$${H}_{nutrients}(a)=-\sum _{j}\,{f}_{nutrien{t}_{j}}(a)\cdot lo{g}_{2}\,{f}_{nutrien{t}_{j}}(a)$$ (11)

$${H}_{energy-nutrients}(a)=-\sum _{j}\,{f}_{energy-nutrien{t}_{j}}(a)\cdot lo{g}_{2}\,{f}_{energy-nutrien{t}_{j}}(a)$$ (12)

Both values of entropy are also expressed in a normalized form with values bounded in [0,1]. These are obtained by dividing the entropy values by the maximum entropy, calculated as: log 2 (number of distinct nutrients).

Product categories

We compute the probability distribution of items belonging to the 17 different product categories being purchased in area a and the entropy of that distribution:

$${f}_{categor{y}_{i}}(a)=\frac{{\sum }_{p\in {P}_{a}}categor{y}_{i}(p)}{| {P}_{a}| }$$ (13)

$${H}_{category}(a)=-\sum _{j}\,{f}_{categor{y}_{j}}(a)\cdot lo{g}_{2}\,{f}_{categor{y}_{j}}(a)$$ (14)

where category i (p) is an indicator function set to 1, if product p belongs to category i; 0 otherwise. We then calculate the relative weight of products belonging to any category compared to the total weight (this is done only for food products, not for drinks). We also calculate the entropy (and, like for nutrients, normalized entropy) of these relative weights.

$${f}_{category-weigh{t}_{i}}(a)=\frac{{\sum }_{p\in {P}_{a}}grams(categor{y}_{i}(p))}{weight(a)}$$ (15)

$${H}_{category-weight}(a)=-\sum _{j}\,{f}_{category-weigh{t}_{j}}(a)\cdot lo{g}_{2}\,{f}_{category-weigh{t}_{j}}(a)$$ (16)

Biases and limitations

Representativeness

The sample of people whose purchases are reflected in this dataset is very large but not random, as it is a set of self-selected people who decided to shop at Tesco and to opt in for a Clubcard subscription. Therefore, the set of Clubcard owners considered is not representative of the overall population in terms of demographics, socio-economic factors, or spatial distribution. However, we provide guidelines on how to filter the data to increase robustness (see Technical Validation Section).

Coverage

As we will show in the Technical Validation section, the concentration of Tesco stores is higher in the northern part of London. As a result, some areas of the city exhibit low penetration. In the dataset, we provide the information to identify low-coverage areas and filter them out if needed.

Limited scope

The number of purchases considered is very large but by no means covers the full scope of food consumption habits of London residents. Naturally, this dataset does not reflect food consumption in restaurants. Also, it does not cover the grocery store purchases done in other chains or stores and, even among the Tesco customers, it does not include information of online purchases and of products purchased by people who do not own a Clubcard. In short, albeit this dataset reveals trends in food consumption habits at area level, it does not represent the full picture of daily food consumption.

Average product

From our data, there is no way of estimating what is the diet of individual customers. Therefore, the nutrient values are provided as averages over all the items purchased by the residents of an area. In other words, we represent the nutritional features of the hypothetical average product consumed in the area. This type of representation is appropriate for some types of analysis but introduces limitations to any study that requires an average representation at the level of individuals, rather than at the level of geographical area.