Transcription

Chandler’s book includes population data from 2250 BC to AD 1975 in various charts and tables. The book contains 656 9×5.5 inch pages and is divided into multiple sections, including Sources and Methods, Continental Tables and Maps (highlighting locations of major cities as illustrated in Fig. 4), Data Sheets for Ancient Cities (the main tables of the book shown in Fig. 1), Tables of the World’s Largest Cities, and Whereabouts of Unfamiliar Cities. Each page in the Data Sheets for Ancient Cities section (Fig. 1) contains a range of 15-30 data points per page. These pages are divided into four columns: (1) data year, (2) the population value (underlined values are Chandler’s estimates), (3) text describing the origin of the population estimate, and (4) citation information for each entry.

Figure 4: Sample Chandler Map. Continental map illustration from Chandler’s book, located in the Continental Tables and Maps section of the text. Although useful in locating some cities, the image quality of these maps is variable even in the original text and they provide approximate locations only. Full size image

As with any digitization project, a significant component of the project is to convert the printed text—in this case a hardcover book—into digital format. There are several ways this task could be done. Because the Chandler book is 656 pages, its size warranted use of a Kirtas machine. A Kirtas machine uses optical character recognition (OCR) to convert printed text into an encoded format. An OCR system is able to convert text into a portable document format (pdf), which can be manipulated using a word processing program. This differs from a scanner, which converts print media to a picture that cannot be readily manipulated.

We had planned to use a Kirtas machine to convert the printed text to digital format. However, due to issues associated with the font of the printed book, which was not easily recognized by the Kirtas machine, and the variable quality of the printed pages, (e.g., Fig. 1), none of the OCR software we tested—Microsoft One Note, Adobe Acrobat Pro, and Free OCR—were able to accurately convert the printed text. After multiple attempts with the Kirtas machine, this approach was discarded and the text was manually transcribed into Microsoft Excel (Fig. 3). In total, 1,746 city locations were originally transcribed and checked twice by research assistants for transcription errors and accuracy. If entries did not match in all three cases, we referred back to the original documents for assessment and amendment. The final Chandler dataset contains 1,599 city locations, since some originally transcribed cities were later combined or were unable to be geocoded accurately.

We received Modelski’s dataset directly from the author in digital text format depicted in Fig. 2. The book itself, which consists of 245 pages, contains descriptive text recounting shifts in population values and their origins. We formatted these Microsoft Word tables into Excel tables using a similar format to the Chandler dataset. This format includes country names along the y-axis and time periods across the x-axis as depicted in Supplementary Fig. 1.

Geolocation

Geocoding is the process of assigning geo-referenced coordinates, or longitude and latitude values, to a record to identify its location on Earth’s surface. It is often the first step in any spatial analysis when the data are not already geolocated18. Online geocoding platforms, such as CartoDB or Google Places API (Application Program Interface) can be used to process large amounts of data when the entries can be matched in batch-mode. This process allows all locations (up to a pre-determined limit for some geocoding services, such as 10,000 queries per day for Google Places API) to be submitted in a group batch, rather than interactively or individually19,21.

Geocoding or geolocating tools have been used most frequently and discussed in relation to medical field-based studies, such as public health or epidemiology19–21. In these studies, geocoding is often done at the individual address level. Geocoding at the address level allows for application of accuracy validation techniques and procedures since address locations can be checked by multiple geocoding services. When geocoding at the address level, accuracy can be measured by comparing the distances between geocoded points of different methods22,23.

Here, we geocoded population data for cities using a single, central latitude and longitudinal point with 2 to 8 significant figures depending on the geocoding database used. Urban extent data, or polygons defining the city boundary rather than city center points, are not included in our dataset due to a lack of available data. This lack of area extent information may limit the type of analysis possible using this dataset, but the point estimates of population size is a first step towards developing a more comprehensive dataset of urban extent. For example, users of this dataset could estimate area extents based on assumptions about population densities and land use, but this adds another level of uncertainty. Ultimately, the quality of the final geocoded dataset is partially determined by the quality and limitations of the original data.

After transcription of both datasets was complete, each city was geocoded, or assigned a corresponding longitude and latitude value. Spatial coverage of the entire dataset for six different time periods, accompanied by the frequency of data points per city for each period, is pictured in panels a-f of Fig. 5. Panel g shows population-weighted global mean centers (GMCs) for each of the six time periods. GMCs show the center of a region’s population at a given time point and can be useful to track human settlement patterns and shifts on a large, global scale over time. Here, we calculated the GMC by averaging all x,y coordinates (longitude, latitude) and z-cordinates (population) for each time period to determine the center of global population for each era. The results show an initial westward trend from the Mesopotamian origin. After the year AD 1500 this trend reverses and becomes westward. For the years between 2000 BC and AD 1000 where Modelski and Chandler sometimes recorded different population values for the same city/time period, we selected Modelski’s values, as his work focused on this ancient time period. The final dataset retains both Chandler’s and Modelski’s values for users to select at their discretion. However, it should be noted that the calculation of GMCs is one of many possible uses of this dataset.

Figure 5: Spatial and Temporal Representation—Global View of Data Points. (a–f) illustrate both the spatial and temporal frequency of city-level population points for different time periods. (a–d) represent the pre-modern period, from 3700 BC–AD 1800, and use the same scale to measure frequency of data points per city. (e,f) represent the modern period and shorter time frame per period and therefore the frequency scale is shorter and separated into thirds. (g) illustrates global mean centers (GMCs) for the same time periods. Each GMC is weighted by city population for each data point and was calculated and is pictured in the Goode Homolosine projection. Full size image

Originally, we used geocoding software platforms, such as CartoDB (Fig. 3). However, due to the long timescale and global coverage of the data, there were changes in city names over time, as well as numerous similar city names across space. These similarities and name changes resulted in a many-to-many relationship between city names and city locations, and made it difficult to automatically and uniquely match the geographic location of a city.

Next, the GeoNames database was used to improve upon the CartoDB results. The database, or geographic gazetteer, derives its data from the US Board of Geographic Names, Wikipedia, the Geospatial Intelligence Agency, as well as using ‘ambassadors’ from over 70 countries representing over 250 regions, whose role is to discover potential global city/place location data sources such as military, governmental, educational, and mapping-based sources24. The continuously evolving result is a database of over 10 million geographical names which are all freely downloadable. Geonames contains a comprehensive list of cities with populations over 1,000 inhabitants, alternate city/locational names, type of location, and corresponding longitude and latitude. Data can be accessed through downloading a large text file or using their web services (API) for location matching. We downloaded a text document containing cities with over 1,000 inhabitants and then subsequently used this table in a table join, which involves merging two tables into a single table based on a common field (concatenated city/country name in this case), using ArcGIS software to match cities with their corresponding longitude and latitude values.

The GeoNames geographical gazetteer does have significant challenges and limitations. First, GeoNames does not provide coordinates for ancient locations or cities which have changed name over time. For this purpose both the Ancient Locations25 database of archaeological sites and the Getty Thesaurus of Geographic Names26 were used. GeoNames also contains only city point data, meaning it does not include urban extents. In a majority of cases, GeoNames determines these point data by calculating the centroid of a city area when available. However, some locations appear to use some other general point location within the urban extent27. This inconsistency most likely occurs as the result of user contributions to the database27. Toponyms, or multiple locations which share the same city name, are also challenging using table joining with Geonames and ArcGIS and can lead to multiple matches for the same city name.

Although more successful than the initial approach, cities remained either unmatched or matched to the incorrect geographic location. These errors arose due to varying original data quality in the Chandler and Modelski datasets. Considering the Chandler dataset, only approximately 50% of city names included a corresponding country name, although these other locations, which were included as tables highlighting the 75 largest cities of the world at specific time periods at the end of the text, could be cross-referenced with continental maps at the start of the book as shown in Fig. 4 or Whereabouts of Unfamiliar Cities section at the end of the text. However, as noted in Fig. 4, the image quality of these maps is quite variable, even in the original text. Modelski’s population tables also included only city names grouped by region/continent. Without a country name, it can be both challenging and time consuming to determine the correct geographic location due to toponyms. As a result, we added the most probable country names to all entries in the Chandler and Modelski datasets.

Alternate spellings, typographic errors, truncation of names to save characters, and city name changes over time can all serve to complicate and hinder the geocoding process. Inevitably, working with a larger, comprehensive dataset requires a significant amount of time and effort devoted to manual and programmatic data cleaning.

As a result, all entries were manually checked for accuracy. These errors were remedied by a one-by-one look-up in Google Earth and Wikipedia’s GeoHack toolservers28, which provides map sources and World Geodetic System 1984 (WGS 1984) coordinate system-based geolocations. GeoHack visually highlights specific locations through various global map services such as GeoNames, Google Earth, Google Maps, OpenStreetMaps, MapQuest, and Bing Maps. The Ancient Locations database, Getty Thesaurus of Geographic Names, and GeoHack database were also used for ancient city locations.

Less than 10% of the dataset was discarded as a result of unmatchable city locations due to changes in names and spellings. The final geocoding result had a 90% match rate. The standard minimum match rate necessary in determining a dataset to be spatially reliable is considered to be 85% for address-based data29.

Although a one-by-one look-up is a tedious process which can also lead to manual transcription errors, due to original data format challenges and omissions which prohibited accurate automated matching to large online databases, as well as the relatively small size of the dataset (1,741 city entries and 10,353 unique city/date/population values), this approach was undertaken. As a result, this dataset does not provide spatial confidence values—values which estimate the accuracy of the geocoded result. However, a geolocation reliability scale and rating was created and applied to the entire dataset based on the employed methods mentioned above. This rating scale is described and discussed in the Technical Validation section.

Code availability

These three datasets were then combined using R statistical software to create one large population dataset spanning the years 3700 BC to 2000 AD. A sample script converts the three data sets (one from Chandler and two from Modelski—one each for ancient and modern time periods) from wide format into long format, standardizes the column names, and combines them into one large dataset30. A link to this script is included in the References section. However, many different approaches could be used to complete this task.

Dataset development challenges and limitations

Chandler’s data has a number of limitations. Despite Chandler’s accomplishment in creating the dataset, the resulting population data are temporally and spatially sparse as illustrated in Figs 5 and 6. Figure 5 provides a snapshot spatial view of all data points across the entire dataset for given time periods and highlights the number of population values for each city, while Fig. 6 illuminates the dataset’s temporal sparseness. As indicated by the leftmost bar in the histogram, over 600 cities of 1,741 original cities have only one city population value. As a result, temporal gaps in measured or interpolated population values can be hundreds or thousands of years in duration, especially before AD 1100. Data points are also sparse for South Asian, South American, North American, and African cities. These data alone are not accurate global representations of all population values through time. Rather, it highlights the population values of important global cities during important time periods. This fact limits the scope of analysis possible with this dataset.

Figure 6: Temporal Representation—Frequency Histogram. Histogram reporting temporal frequency of data points for individual cities. Over 600 cities have only one population value for the entire dataset time period as indicated by the leftmost histogram bar, signifying the temporal sparseness of the dataset. Table 2 emphasizes the tail end of this distribution, highlighting cities with the highest frequency of population points through time. Full size image

Other limitations of Chandler’s work include both his definition of a city and his data interpolation methods. Chandler defines a city as ‘an urban area including suburbs lying outside of the municipal area, and omitting farmland lying within the municipality’ but follows this definition with the statement that suburb growth was not significant until 1850 (ref. 5). However, suburb, or peri-urban, growth has been documented prior to this point in history.

Cities and their boundaries and populations are constantly changing. Chandler’s early city population estimates used the spatial extent of the city to estimate the population using a common population density of the region and time period. However, these population densities shift with family and city structure5. Related archaeological and historical research helping to predict city size has also been significantly improved since Chandler published his dataset5.

It should also be noted that the definition of what constitutes an urban center during both different time periods and different regions is variable31. While Chandler and Modelski’s data provides general trends in population trajectories over time, these patterns can originate from quite different growth or abandonment narratives, as the true history of a settlement is complex8,32.

Since the publication of Chandler’s data in 1987, a number of scholars have proposed methods to improve the dataset. For example, Bairoch, Pasciuti, and Chase-Dunn have proposed and done work to progress city population coverage and accuracy of these historic city population estimates5,33. Pasciuti and Chase-Dunn proposed creating a new dataset of city populations for the Urbanization and Empire Formation Project in their report called Estimating The Population Sizes of Cities5. While Bairoch, who aimed to improve Chandler’s population estimates by also taking into consideration the land type within city walls (commercial, residential, gardens, or grazing), uninhabitable space within buildings, and the density of occupations, suggests increasing Chandler’s estimate of European city values by 15% and Latin American by up to 50% (refs 5,33).

It is also important to note that urbanization is not a linear process. There have been a number of historical events such as natural disasters (i.e., fires, earthquakes, droughts) and human conflicts (i.e., wars, invasions, colonialism, conflicts over natural resources) which have influenced human migration patterns and as a result, affected both settlement population counts and general urbanization trends. The objective of this work is not to reconcile these issues, but to spatialize the data so they are in a useable format for researchers, including historians, to critique these population estimates and analyze global urbanization trends.

Despite these challenges and shortcomings of the Chandler and Modelski data, the combined dataset presented here is a significant advance to the understanding of urbanization and cities through history. Although both the Chandler and Modelski volumes are available, they are not widely accessible nor readily usable in their tabular form. Both are available in print format only (although Modelski’s digital data was made available to the authors via personal communication), and the process of spatializing the data requires significant investment in human and technical resources, and also time. Thus, although this dataset does have limitations, it’s easy to access and useable format will allow it to be tested and examined by the wider research community, such as geographers, historians, archeologists, or ecologists. Spatializing and digitizing the dataset presents a new, improved basis for data dissection and visualization. Providing the data in this easily accessible and usable format also encourages a more rigorous critique by the scientific community of the population estimates. Left in their current forms, these valuable scholarly works remain in an impractical and less-usable format.