Large-scale networks of human interaction, in particular country-wide telephone call networks, can be used to redraw geographical maps by applying algorithms of topological community detection. The geographic projections of the emerging areas in a few recent studies on single regions have been suggested to share two distinct properties: first, they are cohesive, and second, they tend to closely follow socio-economic boundaries and are similar to existing political regions in size and number. Here we use an extended set of countries and clustering indices to quantify overlaps, providing ample additional evidence for these observations using phone data from countries of various scales across Europe, Asia, and Africa: France, the UK, Italy, Belgium, Portugal, Saudi Arabia, and Ivory Coast. In our analysis we use the known approach of partitioning country-wide networks, and an additional iterative partitioning of each of the first level communities into sub-communities, revealing that cohesiveness and matching of official regions can also be observed on a second level if spatial resolution of the data is high enough. The method has possible policy implications on the definition of the borderlines and sizes of administrative regions.

Competing interests: We have the following interests: This study was partly funded by AudiVolkswagen, BBVA, The Coca Cola Company, Ericsson, Expo 2015 and Ferrovial. Orange, British Telecom, Telecom Italia and Saudi Telecom Company provided datasets for this research. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Funding: Financial supporters of the Senseable City Laboratory are: the National Science Foundation, the MIT SMART program, the Center for Complex Engineering Systems (CCES) at KACST and MIT, AudiVolkswagen, BBVA, The Coca Cola Company, Ericsson, Expo 2015, Ferrovial and all the members of the MIT Senseable City Lab Consortium. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Further related work has partitioned data of commutes in the US and has also found regions being both cohesive and following borders [13] . Related is a work which analyzed the movements of virtual avatars in a massive multiplayer online game [8] . In this case the detection of communities from networks of raw mobility data has yielded an almost exact match with underlying socio-economic regions of the virtual society. The study also quantified the strong influence of borders on mobility. Similar investigations on borders were performed using networks of money flows [17] , or using GPS tracks of vehicles and an Infomap approach to compare detected clusters with existing administrative borders on a more local level [18] . Mobility of mobile phone users has been explored for the country of Portugal [19] .

Community detection of phone call networks via modularity optimization, see Materials and Methods , was established in previous works [10] , [11] , leading to spatially cohesive regions generally consistent with the geopartitioning of major political regions of the considered countries. Communication networks have been shown to be a reasonable proxy for other human interaction networks [14] , [15] , making the observations generalizable to human activity beyond phone calls. A massive communication network inferred from a large telecommunications database in Great Britain based on landline calls has been studied previously [10] . The study found geographically cohesive regions that generally correspond with administrative regions, while unveiling unexpected spatial structures that had previously only been hypothesized in the literature. The cohesiveness of single regions was assessed [10] , showing how “integrated” different regions of the UK are. Further, communication networks of Belgium based both on average duration of calls and on the total call duration were analyzed [11] . The latter study has yielded a number of groups which the authors called spatially balanced: the 17 areas found resemble the urban hierarchy suggested earlier [16] . Further, these groups are always made up of adjacent municipalities although this is not a necessary outcome of the algorithm. Both results were also found for France [12] . Finally, the partitioning of an average call duration network in Belgium delineated exactly two areas which closely follow the linguistic border inside the country. Language therefore seems to constitute a strong barrier in human spatial organization from communication.

Here our approach focuses on the topology of human interactions in the form of networks. A network viewpoint emphasizes that the behavior of a complex system is shaped by the strong interactions among its constituents and offers the possibility to analyze social systems within an abstracted, mathematically well-tractable framework [9] . Our main point of interest is the partitioning of human population in space based on the raw networks of communication activities. We build on a small corpus of previous studies [10] – [13] , in which human activity networks within single countries have been studied. The partitioned networks have shown to reflect the linguistic or cultural borders of underlying geographical space, and to follow administrative boundaries, sometimes surprisingly close. We now bring together large data sets from a number of different countries, broadening the scope to a multitude of regions and cultural backgrounds, showing that the observed effects tend to hold in general, and also on a second level of partitioning if the given data is fine-grained enough to allow such a partitioning. To this end we employ community detection algorithms which optimize modularity as in previous works. Further, comparing proposed borders to the underlying regions given by the human activity data in a rejected administrative referendum in Portugal demonstrates the practical implications of our work, able to reveal the actual underlying social structure of the population and to provide “ground truth” to decision makers.

In recent years, human geography and many other areas of social science have been experiencing exciting new developments due to the availability of large-scale data from human interactions, communications, and movements. Advances in information and communication technologies, and the accumulation of massive data sets on human behavior now allow researchers to study human interaction and mobility patterns with unprecedented precision [1] . Data from human interactions such as mobile phone usage can provide insights on various questions in human geography which otherwise would be impossible to understand quantitatively. The issues that can now be tackled in unprecedented details concern fields as diverse as geomarketing [2] , urban planning [3] , having implications for epidemiology and spread of diseases [4] or generally for the spread of information [5] and the understanding of individual mobility patterns [6] and political movements [7] . Even purely virtual environments have the potential to advance our understanding of the nature of human movements [8] .

We use two classical measures of clustering similarity to quantify partition overlap, i.e. of how well two different partitions of the same set of locations match: Rand's criterion [24] and the Fowlkes and Mallows index [25] . Both of these measures are based on comparing sets of pairs of locations which have either the same community in both partitions or a different community. A perfect match between two partitions will have . For the case of two completely unrelated clusterings, both indices are in general strictly larger than zero, more so for [25] . Therefore, to have a baseline, we calculated the average indices over 1000 random reshufflings of locations in given partitionings of administrative regions, denoted by and . To have a measure grounded in another, information-theoretical approach, we also use the variation of information . The has mathematical properties that are in line with our general intuition of what “more different” and “less different” should mean for two clusterings of a set [26] . For formal definitions of all measures see SI.

Boundaries produced by the algorithm which match official boundaries might naively be interpreted as having a “natural” validation of the hypothesis of closely followed borders – two divisions of a country, partitions from networks and official borders, would not coincide just by chance but rather for a reason. However, if the algorithm's result does not match, the reasons – apart from a genuine deviation of human interaction regions from official boundaries – could also include low population density near the border making boundaries visually floating but leaving modularity scores practically unchanged, and other possible minor statistical fluctuations. Due to such influences, the boundaries produced by the algorithm cannot always be treated as exact, they may be shifted slightly. However, the cores of detected regions have found to be stable for the UK data set [10] .

The intriguing property of the modularity optimization approach is that the resulting network division has no predetermined number of partitions. Only the raw topological information of the input network determines the range of communities detected. Further, the algorithm does not fix the sizes nor the distribution of sizes of the detected groups, and it is not limited by any spatial constraints.

To the extracted communication networks we apply an algorithm for community detection following a standard modularity optimization approach [21] , [22] . The method scores all the edges of the network according to their relative strength compared to a null-model with respect to the weight of the nodes they connect and aims to maximize the cumulative score inside the communities, preferring edges with a positive score and avoiding those with a negative score. The particular optimization algorithm [23] is a variation of the technique used by [10] . The idea is an iterative improvement of the partitioning in terms of the modularity score, starting from a trivial case where all nodes are gathered into one community involving three kinds of possible improvements: 1) dividing a community into two new communities, 2) joining two communities into one, and 3) shifting a part of one community to another existing community. The outcome of partitioning spatial networks is in general not qualitatively dependent on the particular algorithm used – the reason we use this one is because of its ability to consistently provide the best results in terms of modularity score compared to other algorithms, [23] , see also SI.

We construct interaction networks between different locations of a country based on the aggregated duration of calls having origin in the first and destination in the second location. This process generates a weighted directed network in which the loop edges from locations to themselves are also considered. We construct the aggregated networks of communication flows between all given different locations of the country at the available spatial resolution level defining a link weight between each two locations as a total duration of calls initiated by the users of the first considered location to the users of the second one. The nodes in these networks are the locations, ranging from municipalities, zip codes, special geographical units such as exchange areas, or cell tower areas, as defined in the “Spatial resolution” column of Table 1 . In case of Portugal and France the users are attached to their actual locations during a call, while for Belgium to their formal residence locations. The UK and Italy networks are based on landline calls, i.e. the locations of the users are fixed.

We consider seven country-wide data sets of telephone calls, in France, UK, Italy, Belgium, Portugal, Saudi Arabia, and the Ivory Coast, details on the data is given in Table 1 . All data sets comprise mobile phone data with the exception of landline calls in UK and Italy. Data was provided by single phone providers with possible heterogeneous coverage over the respective countries – we have no information on local market shares and on resulting possible inhomogeneities in spatial coverage. The Ivory Coast data was released to researchers during the D4D mobile phone data challenge [20] and was used as is. All other data sets are proprietary and subject to stricter data privacy agreements, therefore here we do not have the possibility to provide more expressive information on metadata or on the data collection process available than provided in Table 1 . All data has been anonymized and aggregated on the operator side prior to receipt and in line with all local data protection laws. There was no special cleaning process performed which could have introduced substantial bias. All the operators who provided the data possess country-wide coverage for the corresponding countries. In those cases where the coverage ratio was substantially spatially inhomogeneous the appropriate normalization by the local market shares has been performed for the aggregated communication networks.

Results

Iterative partitioning of subregions reveals similar properties on a second level By partitioning the country-wide networks of human telephone interactions we obtain spatially cohesive regions generally consistent with the geopartitioning of greater political regions. However, it is possible to go one step further to apply the community detection method in an iterated fashion. Namely, applying the network partitioning to the subnetwork inside each of the detected first-level regions allows to produce a second-level subpartitioning of the network into smaller subregions. The panels in the right column of Figs. 1 and 2 show that second-level subpartitioning again possess the same general properties – all the subregions are geographically cohesive. Since the match of first level regions with NUTS2 is most consistent for France, this makes it possible to also compare level two regions with NUTS3 in this country without running into too many inconsistencies due to deviations on the first level. These inconsistencies result in visual artifacts, see the few “hollow” regions in Fig. 1B. Although the number of level two regions found (207) is higher than the number of existing NUTS3 regions (96 without the overseas department), many borders are again followed reasonably well. The most visible mismatches occur in the same south-eastern parts were already level 1 regions are mismatched.

Deviations The findings of our approach, especially deviations from specific borderlines, have the additional potential to serve as decision aid for administrative officials and regional planners, either for or against specific possible subdivisions of a country, as well as give an insight into the long-time geographic impacts of historic events. Using telephone call data, which is recorded and stored by telephone providers and is therefore relatively easy to access by the respective companies, has the added benefit of being several orders of magnitudes less costly than performing censuses. In the following we highlight the case of Portugal and the administrative referendum of 1998 [28]. This referendum failed, as the majority of citizens voted against the newly proposed borders, shown in Fig. 2E. The proposed regions show a poor match with the regions from human interaction networks, also reflected in the clustering indices of and lying below the indices of the historical regions of and (but still above the NUTS2 values of and ), which is one of the possible reasons why the referendum failed. Comparing our partitioning result of Portugal to the today existing official territorial division of NUTS reveals that NUTS2 is more coarse-grained (5 regions in continental Portugal) while NUTS3 is more fine-grained (28 regions in continental Portugal) than our partitioning which is in-between (7 regions) and matches historical regions to some extent better. For example, the referendum proposed to split up Beira and to shift the Beira–Ribatejo border while not placing borderlines between i) Minho and Douro Litoral and between ii) Alto Alentejo and Baixo Alentejo, although these borders appear in the partition. These observations and the evidence from the clustering indices show that historical effects of human behavior could outlast modern categorization and might have an impact on policies today. Another example of possible policy implications comes from the granularity of the results. While on the country-wide level the partitioning algorithm gives the closest match for NUTS2 regions in Belgium, Italy, and France, in the UK instead the same scale partitioning appears to match rather the NUTS1 definition. The definition of different levels of NUTS regions is known to be country-dependent. Therefore the deviation of the scale of NUTS regions in the UK from other EU countries may provide valuable input for creating a possibly more adequate definition of hierarchical regions with the aim to be homogeneous EU-wide, especially considering that different levels of NUTS regions correspond to very specific levels of structural fundings possibly impacting regional performance substantially [29]. For the case of Ivory Coast, the stronger deviations from political regions than in European countries possibly hints on one hand towards the mentioned inhomogeneous distribution of population or cell towers. On the other hand, the deviations might stem from the young age of the country's administrative structure, still going through processes of social reorganization after two recent civil wars. Here the present political borders, which are not fully consistent with earlier tribal structures, have been defined only a few decades ago as opposed to the long history behind the subdivision of France. In conclusion, the regional structures based on the actual human interactions can be determined in an automated way independent of a country's history and can provide possible alternatives to existing administrative regions for organizing societies.

Finding “breaking lines” So far we used the partitioning program without any restrictions on number of communities and left it to the algorithm to find the most “natural” number in terms of modularity. In this section, we modify the algorithm which we have used above to limit this number to the smallest nontrivial number of communities, namely two, revealing the “breaking lines” of countries, i.e. the borders which split up countries into exactly two parts in terms of total network weight by optimizing modularity. In this case the algorithm optimizes the modularity value of all possible bisections, by restricting the ability of the algorithm to create a new community once the maximal allowed number of communities (here two) is already reached. We report the results in Fig. 4. France is split by a border going from center north to center south following the eastern borders of the regions Upper Normandy, Île-de-France, Centre, Limousin, Midi-Pyrénées, Fig. 4A. The UK splits along a west-east line which also splits Wales in two, following the same split already found in level 1 regions, Fig. 4B. Mainland Italy is split along a line roughly following the northern border of Emilia-Romagna, Fig. 4C, Belgium is split along the Dutch-French language barrier with Brussels assigned to the northern Dutch part, Fig. 4D, Portugal is split slightly south of the Mondego river, Fig. 4E. A similar bi-split of Belgium was previously found [11], however it required a different measure for network links – average duration of one call – while the split we report here is obtained based on the same network of total call durations, but just with a limitation on the number of communities. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Split of countries into two parts. (A) France is split by a border going from center north to center south almost exactly following regional borders. (B) The UK splits along a west-east line which also splits Wales in two. (C) Mainland Italy is split along a line roughly following the northern border of Emilia-Romagna with the islands of Sardinia and Sicily being assigned to the northern part. (D) Belgium is split along the Dutch-French language barrier with Brussels assigned to the northern Dutch part, (E) Portugal is split roughly along the ancient border of the county of Portugal. (F) Ivory Coast and (G) Saudi Arabia are split into western and eastern parts. https://doi.org/10.1371/journal.pone.0081707.g004 It is possible to quantify the strength of the splits by looking at the weights of links within each side compared to the total weight of the network. In all cases, we observe that the countries are divided into two parts with nearly equal network weight. Therefore, if the links were to be distributed homogeneously, we would expect around 50% of the links between the two split parts. However, the actual picture is quite different. Belgium displays the strongest split among all European countries with only of all links going between the north and the south partition. The next strongest splits are France with and Italy with of links going between the split parts. The weakest splits are UK and Portugal with values of and , respectively. The modularity scores for the two-part partitions follow the same order and are: Belgium , France , Italy , UK , Portugal . Note that the “breaking lines” and their strengths do not necessarily come with any political implications. First, a consequence of the algorithm is the separation of the network into two parts with almost equal total link weight. If we assume homogeneous communication behavior, then the population is expected to be divided in half by the process. Therefore, these splits have to be discussed with care, as results can be strongly influenced by heterogeneous population densities and/or geographically distinct features such as mountains. In some cases however, additional cultural reasons may be well justifiable. The most clear division in the case of Belgium falls in line with previous results where a strong language barrier was found between the northern Flemish region and the southern Walloon region [11]. In our case the bilingual city of Brussels is assigned to the northern instead of the southern partition, but it is not clear if this is simply due to the population distribution. Nevertheless, approximately half of the total of links between the north and south areas go between Brussels and the south, making the capital a bridge between the regions and providing motivation for future studies on the human interactions within the Brussels region. Apart from clear cultural differences as in Belgium, the results in Italy might be influenced by migration patterns. Here, surprisingly we find that the western and southern islands of Sardinia and Sicily are connected to the northern partition. This could be due to the nature of the algorithm, which would assign possibly weak connections between the islands and the mainland less clearly. On the other hand, substantial migration flows from southern parts of Italy to the north since after the second world war are well known. Especially the north-western regions of Piedmont, Lombardy, Liguria and Aosta Valley were the destinations of a large proportion of the migration flows of the 1950s and 1960s, since industrial development in Italy has its origins there [30]. The connection of Sicily and Sardinia to this northern part could be due to family ties spanning between migrated and remaining family members. Migration data would be needed to come to firmer conclusions. The detected breaking line of Portugal, Fig. 4E, follows on the west side roughly the historical borders of the Condado Portucalense (county of Portugal), slightly south of the city of Coimbra and the Mondego river. This county of Portugal existed between the late ninth to the early twelfth century and was a fiercely disputed region between Moor and Christian reigns, with often shifting borders due to conquests and reconquests. This period marks the time in which the national identity of the Portuguese people was formed and the basis for the Portuguese kingdom was created. Given that the split is not very strong (a relatively large percentage of of links exists between the split areas) it is not clear if the borderline we find can be reasonably attributed to these ancient regions or just to the surrounding areas of Lisbon and Porto. However, it is at least interesting to find a split into north and south as certain rivalries between those regions have left their imprints on almost every aspect of Portuguese social life [31]. These results do not come with any explicit policy implications due to the unclear causal relations, but the method can offer either careful attempts at historical insights into the evolution of specific communities, or provide possible “ground truth” to their cohesiveness if communication strength between the inhabitants is taken as a measure.