Since the datasets come from various companies which have adopted different standards, their spatial distribution irregularity is aggregated in a grid with square cells. This allows comparisons between different areas and eases the geographical management of the data. Thus, the area of Milan is composed of a grid overlay of 1,000 (squares with size of about 235×235 meters and Trentino is composed of a grid overlay of 6,575 squares (see Fig. 2). This grid is projected with the WGS84 (EPSG:4326) standard.

Figure 2 The various grid systems employed in this project. Full size image

Call detail records

The Call Detail Records (CDRs) are provided by the Semantics and Knowledge Innovation Lab (SKIL) (http://jol.telecomitalia.com/jolskil/) of Telecom Italia. Every time a user engages a telecommunication interaction, a Radio Base Station (RBS) is assigned by the operator and delivers the communication through the network. Then, a new CDR is created recording the time of the interaction and the RBS which handled it. From the RBS it is possible to obtain an indication of the user's geographical location, thanks to the coverage maps C map which associates each RBS to the portion of territory which it serves (AKA coverage area, Fig. 3).

Figure 3 An example of coverage map of Milan. Full size image

In order to spatially aggregate the CDRs inside the grid, each interaction is associated with the coverage area v of the RBS which handled it. Hence, the number of records s i (t) in a grid square i at time t is computed as follows:

S i ( t ) = ∑ v ∈ C m a p R v ( t ) A v ∩ i A v

where R v,j (t) is the number of records in the coverage area v at time t, A v is the surface of the coverage area v and A v∩i is the surface of the spatial intersection between v and the square i.

There are many types of CDRs and Telecom Italia has recorded the following activities:

Received SMS a CDR is generated each time a user receives an SMS

Sent SMS a CDR is generated each time a user sends an SMS

Incoming Call a CDR is generated each time a user receives a call

Outgoing Call a CDR is generated each time a user issues a call

Internet a CDR is generated each time a user starts an Internet connection or ends an Internet connection. During the same connection a CDR is generated if the connection lasts for more than 15 min or the user transferred more than 5 MB.

The shared datasets were created combining all this anonymous information, with a temporal aggregation of time slots of ten minutes. The number of records in the datasets S ′ i ( t ) follows the rule:

S ′ i ( t ) = S i ( t ) k

where k is a constant defined by Telecom Italia, which hides the true number of calls, SMS and connections.

Telecommunications activity

The first type of dataset represents the activity of Trentino and Milan, showing all the aforementioned telecommunication events which took place within these areas. The data provides information of Telecom Italia's customers interacting with the network and of other people using it while roaming.

Telecommunications interactions

Two types of CDR datasets were also produced to measure the interaction intensity between different locations: one from a particular area (Trentino/Milan) to any of the Italian provinces and one quantifying the interactions within the city/province (e.g., Milan to Milan). Since Telecom Italia only possesses the data of its own customers, the computed interactions are only between them. This means that (at most) 34% of population's data is collected, due to Telecom Italia's market share (http://www.agcom.it/documents/10179/1734740/Studio-Ricerca+24-07-2014/5541e017-3c7a-42ff-b82f-66b460175f68?version=1.0, date of access 06/08/2014). Moreover there is no information about missed calls.

Social pulse

The Social Pulse dataset is composed of geo-located tweets that were posted by users from Trentino and Milan between November 1, 2013 and December 31, 2013. The stream was gathered through the Twitter Streaming API (https://dev.twitter.com/docs/streaming-apis) which is a free service allowing the extraction of ~1% of the total Twitter feed through a set of filterers provided by the user. This process saves the author username, the tweet content and the time-stamp when the tweet has been written. In order to ensure the privacy of the original users, their username has been obfuscated and the text of the tweet has been replaced with a list of entities extracted by the dataTXT-NEX tool (https://dandelion.eu/products/datatxt/). The obfuscation of the username has been done using the hash function SHA-1, and two random generated strings (SALT1 and SALT2):

u s e r n a m e n e w = s h a 1 ( S A L T 1 + u s e r n a m e + S A L T 2 )

The dataTXT is a tool to identify meaningful sequences of one or more terms, and then to link them to the most appropriate Wikipedia page. More information about this tool, including performance, can be found in ref. 31.

Weather station data

The weather data describes meteorological phenomena type and intensity in Milan and Trentino. The data of Milan are collected by Agenzia Regionale per la Protezione dell'Ambiente (ARPA) (http://www2.arpalombardia.it/siti/arpalombardia/meteo/richiesta-dati-misurati/Pagine/RichiestaDatiMisurati.aspx) while Trentino's data are collected by Meteotrentino (http://www.meteotrentino.it).

Milan

In Milan, the type and the intensity of the phenomena are continuously measured by different sensors located within the city limit. Each sensor has a unique ID, a type and a location. Different sensors can share the same location.

The data are split into two datasets called Legend dataset and Weather Phenomena. Intuitively, the former provides the locations of the sensors and the unit of measurements, while the latter contains the measurement files for each sensor. The sensors can measure different meteorological phenomena: Wind Direction, Wind Speed, Temperature, Relative Humidity, Precipitation, Global Radiation, Atmospheric Pressure and Net Radiation. There is no spatial aggregation and the data is aggregated in 60 min time-slots.

Trentino

The dataset contains measurements about temperature, precipitation and wind speed/direction taken in 36 Weather Stations placed around the Province of Trentino. There is no spatial aggregation and the data are aggregated in timeslots of 15 min.

Precipitation

The precipitation datasets provide information about precipitation intensity and type over the geographical area. The data of Milan and Trentino are collected by ARPA (http://www.arpa.piemonte.it/rischinaturali) and by Meteotrentino (http://www.meteotrentino.it) respectively. Since they adopt different standards, we organized two sections to describe them.

Milan

This dataset is temporally aggregated every 10 min and spatially aggregated in four quadrants of equal size of 11.75×11.75 km, corresponding to 50 squares of the grid used for the aggregation. The quadrants are referred with IDs 1, 2, 3 and 4 and the corresponding grid squares IDs are computed by the formula y×100+x, where x and y follow the following rules:.

Quadrant 1 : x: [1,50], y: [50,99];

Quadrant 2 : x: [51,100], y: [50,99];

Quadrant 3 : x: [51,100], y: [0,49];

Quadrant 4: x: [1,50], y: [0,49].

The precipitation types are described as:

Absent : precipitation quantity equal to 0 mm/h. Defined as type 0;

Slight : precipitation quantity equal in [0,2] mm/h. Defined as type 1;

Moderate : precipitation quantity equal in [2,10] mm/h. Defined as type 2;

Heavy: precipitation quantity equal to in [10,100] mm/h. Defined as type 3.

while the precipitation intensity is characterized as Absent (type: 0), Rain (type: 1) and Snow (type: 2).

Trentino

The precipitation intensity values for Trentino are spatial aggregated over the Trentino grid and temporal aggregated every 10 min and they follow the standard described as:

very slight : precipitation intensity defined [1,3] meaning an amount of [0.20,2.0] mm/hr;

slight : precipitation intensity defined [4,6] meaning an amount of [2.0,7.0] mm/hr;

moderate : precipitation intensity defined [7,9] meaning an amount of [7.0,16.0] mm/hr;

heavy : precipitation intensity defined [10,12] meaning an amount of [16.0,30.0] mm/hr;

very heavy : precipitation intensity defined [13,15] meaning an amount of [30.0,70.0] mm/hr;

extreme: precipitation intensity defined [16,18] meaning an amount of more than 70 mm/hr;

The precipitation data collection is not continuous due to some technical issues such as the presence of snow over the sensor radar. For this reason, we issued the data availability dataset which indicates whether the data has been collected or not for a specific time interval.

SET electricity

SET manages almost the entire electrical network over the Trentino territory. It uses around 180 primary distribution lines (medium voltage lines) to bring energy from the national grid to Trentino's consumers. To ensure the privacy of SET's customers, their locations and the geometry of the 180 primary distribution lines is not explicitly exposed. Consequently, the Customer site dataset shows the number of customer sites of each power line per grid square, while the Line measurement dataset indicates the amount of flowing energy through the lines at time t. Customer sites provide energy to different types of customers (e.g., houses, condominiums, business activities, industries etc.), which require different amount of electricity. For privacy reasons this information is hidden, meaning that in the dataset the energy flowing is uniformly distributed among the various types of customers.

Figure 4 shows the process we have done to transform the original dataset to the shared one. In the first layer we have the exact position of each customer site (e.g., some of them are industries, others are small houses) and the precise geometry of each line. In the second layer we lose the exact geometries of customer sites and power lines. However, this information is summarized in the Customer site dataset where for each square grid the number of customer sites is recorded along with the information about the power line they are connected to. In the third layer we know how the customer sites of a power line are distributed over the grid and the energy flowing through each power-line (from the Line measurement dataset). It is then possible to distribute the energy flowing through a powerline p over the grid in order to build a choropleth map of the energy consumption in each grid square (last layer in Fig. 4).

Figure 4 The SET customers are spatially aggregated into the grid squares and the energy consumption is uniformly divided among the customers, hiding their different type (e.g., houses, condominiums, business activities, industries). Full size image

The Line measurement dataset is temporal aggregated in time-slots of 10 min.

News

The news datasets contain all the articles published on the websites http://www.milanotoday.it and http://www.trentotoday.it. Each news is referred to the geographical location where the event happened. All the news referring to the general area (the whole city of Milan or the whole Province of Trentino) are geo-tagged to its administrative centre.

Code availability

The datasets are released under the Open Database License (ODbL) and are publicly available in the Harvard Dataverse.

Different types of software and tools were used in the dataset generation process and it would have been too complicated to share and explain all the used source code used. For this reason, we shared a simpler version of the code, to better understand part of the process explained in the Methods section. The software is written in Python 2.7 and can be found at [Data citation 1]. Unfortunately, since it was not possible to share the input (raw) files, this code can not be executed to perfectly reproduce the datasets.

converter.py It converts the raw CDRs to the grid overlay as explained previously. The output is written in the same directory where the script resides.