Datasets

We use two geo-referenced datasets on population and CO 2 emissions in the continental US defined in a fine geometrical grid. The population dataset is obtained from the Global Rural-Urban Mapping Project (GRUMPv1)22. These data are a combination of gridded census and satellite data for population of urban and rural areas in the United States in year 2000 (Fig. 1a and Sec. 3). The GRUMPv1 data provides a high resolution gridded population data at 30 arc-second, equivalent to a grid of 0.926 km × 0.926 km at the Equator line.

Figure 1 Population and emissions in US. (a) The population map of the contiguous US from the Global Rural-Urban Mapping Project (GRUMPv1)22 dataset in logarithmic scale. (b) The CO 2 emissions map of the contiguous US from the Vulcan Project (VP) dataset23 measured in log base 10 scale of metric tonnes of carbon per year. (c) Map of the mean household income per capita of 3, 092 US counties in dollars from US Census Bureau dataset24 for the year 2000. Full size image

The emissions dataset is obtained from the Vulcan Project (VP) compiled at Arizona State University23. The VP provides fossil fuel CO 2 emissions in the continental US at a spatial resolution of 10 km × 10 km (0.1 deg × 0.1 deg grid) from 1999 to 2008. The data are separated according to economic sectors and activities (see Sec. 3 for details): Commercial, Industrial and Residential sectors (obtained from country-level aggregation of non-geocoded sources and non-electricity producing sources from geocoded location), Electricity Production (geolocated sources associated with the production of electricity such as thermal power stations), Onroad Vehicles (mobile transport using designated roadways such as automobiles, buses and motorcycles), Nonroad Vehicles (mobile surface sources that do not travel on roadways such as boats, trains, snowmobiles), Aircraft (Airports, geolocated sources associated with taxi, takeoff and landing cycles associated with air travel and Aircraft, gridded sources associated with the airborne component of air travel) and Cement Industry.

We analyze the annual average of emissions in 2002 for the total of all sectors combined (see Fig. 1b) and each sector separately (Fig. 2). The choice of 2002 data (rather than 2000 as in population) reflects the constraint that it is the only year for which the quantification of CO 2 emissions has been achieved at the scale of individual factories, powerplants, roadways and neighborhoods and on an hourly basis23.

Figure 2 The CO 2 emissions maps in metric tonnes of carbon per year from Vulcan Project (VP) dataset23 for each sector: (a) Aircraft, (b) Cement, (c) Commercial, (d) Industrial, (e) On-road, (f) Non-road, (g) Residential and (h) Electricity. Full size image

To define the boundary of cities, we use the notion of spatial continuity by aggregating settlements that are close to each other into cities15,16,17,18,20,21. Such a procedure, called the City Clustering Algorithm (CCA), considers cities as constituted of contiguous commercial and residential areas for which we know also the emissions of CO 2 from the Vulcan Project dataset. By using two microscopically defined datasets, we are able to match precisely the population of each agglomeration to its rate of CO 2 emissions by constructing the urban agglomerations from the bottom up without resorting to predefined administrative boundaries.

We also use the US income dataset available in ASCII format by US Census Bureau24 for the year 2000. This dataset provides the mean household income per capita for the 3, 092 US counties. For each county, we combined the income data and the administrative boundaries25 in order to relate them with the geolocated datasets (Fig. 1c and Sec. 3).

We first apply the CCA to construct cities aggregating population sites D i at site i. The procedure depends on a population threshold D* and a distance threshold ℓ. If D i > D*, the site i is populated. The length ℓ represents a cutoff distance between the sites to consider them as spatially contiguous, i.e. we aggregate all nearest-neighbor sites which are at distances smaller than ℓ. Thus a CCA cluster or city is defined by populated sites within a distance smaller than ℓ as seen schematically in Fig. 3. Starting from an arbitrary seed, we add all populated neighbors at distances to the cluster smaller than ℓ until no more sites can be added to the cluster. The scaling laws produced by the CCA depend weakly on D* and ℓ. and we are interested in a region of the parameters where the scaling laws are independent of these parameters.

Figure 3 CCA stages: We consider that if D i > D*, then the site i is populated (light blue squares). Each site is defined by its geometric center (black circles) and the length ℓ represents a cutoff on the distance to define the nearest neighbor sites. We aggregate all nearest-neighbor sites, i.e. a CCA is defined by populated sites within a distance smaller than ℓ (red circles). Full size image

This aggregation criterion based on the geographical continuity of development was shown to provide strong evidence of Zipf's law in the US and UK15,16,17,18,20,21 in agreement with established results in urban sciences26,27,28,29. For cut-off lengths above ℓ = 5 km, it was shown that CCA clusters verify the Zipf's law and the Zipf's exponent is independent of ℓ. Next, we first present results for aggregated clusters at ℓ = 5 km and then show the robustness of the scaling laws over a larger range of parameter space.

In order to assign the total CO 2 emissions to a given CCA cluster, we superimpose the obtained cluster to the CO 2 emissions dataset. If a populated site composing a CCA cluster falls inside a CO 2 site, we assign to the populated site the corresponding CO 2 emissions proportional to its area 0.9262 km2, considering that the emissions density is constant across the CO 2 site of 102 km2. For a given CCA cluster, we then calculate the population (POP) and CO 2 emissions by adding the values of the constitutive sites of the cluster.

Scaling of emissions with city size

Figure 4 shows the correlation between the total annual CO 2 emissions and POP for each CCA cluster for ℓ = 5 km and D* = 1000 (N = 2281). We perform a non-parametric regression with bootstrapped 95% confidence bands30,31 (see Sec. 3). We find that the emissions grow with the size of the cities, on average, faster than the expected linear behavior. The result can be approximated over many orders of magnitudes by a power-law yielding the following allometric scaling law:

where A = 2.05 ± 0.12 and β = 1.38 ± 0.03 (R2 = 0.76) is the allometric scaling exponent obtained from Ordinary Least Squares (OLS) analysis32 for this particular set of parameters ℓ = 5 km and D* = 1000 (see Sec. 3 for details on OLS and on the estimation of the exponent error, all emissions are measured in log base 10 of metric tonnes of carbon per year).

Figure 4 Scaling of CO 2 emissions versus population. We found a superlinear relation between CO 2 (metric tonnes/year) and POP with the allometric scaling exponent β = 1.38 ± 0.03 (R2 = 0.76) for the case ℓ = 5 km, D* = 1000. The solid (black) line is the Nadaraya-Watson estimator, the dashed (black) lines are the lower and upper confidence interval and the solid (red) line is the linear regression. Full size image

In addition, we investigate the robustness of the allometric exponent as a function of the thresholds D* and ℓ. Figure 5a shows β as a function of the cut-off length ℓ for different values of population threshold D* (1000, 2000, 3000 and 4000). We observe that β increases with ℓ until a saturation value which is relatively independent of D*. Performing an average of the exponent in the plateau region with ℓ > 10 km over D*, we obtain . Thus, we find superlinear allometry indicating an inefficient emissions law for cities: doubling the city population results in an average increment of 146% in CO 2 emissions, rather than the expected isometric 100%. This positive non-extensivity suggests that the high productivity found in larger cities3,4 is done at the expense of a disproportionally larger amount of emissions compared to small cities.

Figure 5 Behavior of allometric exponent β. (a) We plot β for the total emissions for different D* as a function of ℓ. The exponent β increases with ℓ until a saturation value. (b) Allometric exponent versus ℓ for the different sectors of the economy as indicated. The scaling exponent ranges from sublinear behavior (β < 1, optimal) on the cement and aircraft sectors, to superlinear behavior (β > 1, suboptimal) on nonroad and onroad vehicles and residential emissions, up to the less efficient sectors in commercial, industrial and electricity production activities. Full size image

Figure 5b investigates the emissions of cities as deconstructed by different sectors and activities of the economy. We perform non-parametric regression with bootstrapped 95% confidence bands of β (see Fig. 6 for D* = 1000 and ℓ = 5 km by each sector) versus ℓ and we find that the exponents for different sectors saturate to an approximate constant value for ℓ > 10 km. We assign an average exponent, over the plateau per sector as seen in Table I. The sectors with higher exponents (less efficient) are Residential, Industrial, Commercial and Electric Production with , above the average for the total emissions. Onroad vehicles contribute with a superlinear exponent , yet, below the total average. The exponent for Nonroad vehicles is also below the average at , while Aircraft sector displays approximate isometric scaling with . Cement Production displays sublinear scaling , although the reported data is less significant than the rest with only 20 datapoints of cities available.

Table 1 Allometric exponents for CO 2 emissions according to different sectors and total emissions of all sectors Full size table

Figure 6 The plot shows the CO 2 behavior measured in metric tonnes of carbon per year versus POP of the CCA clusters for different sectors. We found a superlinear relation between CO 2 and POP for all the cases, except to Aircraft and Cement sectors. The solid (black) line is the Nadaraya-Watson estimator, the dashed (black) lines are the lower and upper confidence interval and the solid (red) line is the linear regression. Full size image

We further investigate the dependence of the allometric exponent β on the income per capita of cities by aggregating the CCA clusters by their income (INC) and plotting the obtained β(INC) in Fig. 7 (see also Fig. 8). We find an inverted U-shape relationship, which is analogous to the so-called environmental Kuznets curve (EKC)7,33,34. We observe that β initially increases for cities with low income per capita until an income turning point located at $ 37, 235 per capita (in 2000 US dollar). After the turning point, β decreases indicating an environmental improvement for large-income cities. However, the allometric exponent remains always larger than one regardless of the income level (except for the lowest income) indicating that almost all large cities are less efficient than small ones, no matter their income.

Figure 7 Dependence of allometric exponent β on the income per capita of the CCA clusters. We found an inverted-U-shaped curve similar to an environmental Kuznets curve (EKC). In other words, we find a decrease of the allometric exponent β for the lower and higher income levels, with the following regression coefficients a 0 = −247.35, a 1 = 108.88 and a 2 = −11.91. The income turning point is located at . Full size image

Figure 8 Total CO 2 emissions in metric tonnes of carbon per year versus POP of CCA clusters for different income's range as indicated. We found a superlinear relation between CO 2 and POP for all the cases except for the lowest income below $ 25, 119. The solid (black) line is the Nadaraya-Watson estimator, the dashed (black) lines are the lower and upper confidence interval and the solid (red) line is the linear regression. The resulting exponent β(INC) is plotted in Fig. 7. Full size image

Comparison with MSA

A further important issue in the scaling of cities is the dependence on the way they are defined15,16,17,18,20,21,35. Thus, it is of interest to compare our results with definitions based on administrative boundaries such as the commonly used Metropolitan Statistical Areas (MSA)36 provided by the US Census Bureau37. MSAs are constructed from administrative boundaries aggregating neighboring counties which are related socioeconomically via, for instance, large commuting patterns. A drawback is that MSAs are available only for a subset (274 cites) of the most populated cities in the US and therefore can represent only the upper tail of the distribution17,21,35 (see Sec. 3 for details).

Furthermore, we find that the MSA construction violates the expected extensivity3,17 between the land area occupied by the MSA and their population since MSA overestimates the area of the small agglomerations17. This is indicated in Fig. 9, where we find the regression:

with a MSA = 0.81 ± 0.36 and b MSA = 0.51 ± 0.06 (R2 = 0.48). This approximate square-root law implies that the density is not constant across the MSAs:

On the contrary, CCA clusters capture precisely the occupied area of the agglomeration leading to the expected extensive relation between land area and population as seen also in Fig. 9:

with a CCA = −2.86 ± 0.06 and b CCA = 0.94 ± 0.01, with small dispersion R2 = 0.99, implying that the density of population of CCA clusters is well-defined (extensive), i.e. it is constant across population sizes,

In summary, while the CCA displays almost isometric relation between population and area, the MSA shows a sublinear scaling between these two measures. As a consequence, the emission of CCA is independent of the population density, as expected. On the other hand, from Eq. 2 and Eq. 6, the MSA leads to a superlinear scaling between them, .

Figure 9 Scaling of the occupied land area versus population for MSAs and CCA clusters. Two problems are evident from this comparison. First, the range of population obtained by MSA is two decades smaller than that of CCA since CCA captures all city sizes while MSA is defined only for the top 274 cities. Second, the MSA violates the extensivity between land area and population while CCA does not. This is due to the fact that MSA agglomerates together many small cities into a single administrative boundary with a large area which can be largely unpopulated, as can be see in the examples of Fig. 10. This results in an overestimation of the size of the areas of small cities compared with large cities, resulting in the violation of extensivity shown in the figure. This endogenous bias is absent in the CCA definition. This bias in the small cities ultimately affects the allometric exponent yielding a β MSA smaller than the one obtained using the CCAs. Full size image

The non-extensive character of the MSA areas is due to the fact that many MSAs are constituted by aggregating small disconnected clusters resulting in large unpopulated areas inside the MSA. This is exemplified in some typical MSAs plotted in Fig. 10, such as Las Vegas, Albuquerque, Flagstaff and others. The plots show that a large MSA area is associated to a series of disconnected small counties, like it is seen, for instance, in the region near Las Vegas. This clustering of disconnected small cities inside a MSA results into an overestimation of the emissions associated with the Las Vegas MSA, for instance. The same pattern is verified for many small cities, specially in the mid-west of US, as seen in the other panels. For some large cities, like NY, the agglomeration captures similar shapes as in the occupied areas obtained with CCA, although it is also clearly seen that the area of the NY MSA contains many unoccupied regions. Therefore, the occupied area of a typical MSA is overestimated in comparison to the area that is actually populated as captured by the CCA, the bias is larger for small cities than larger ones. This endogeneity bias leads to an overestimation of the CO 2 emissions of the small cities as compared to large cities. Consequently, we find a smaller allometric exponent for MSA than CCA with an almost extensive relation:

with A MSA = 1.08 ± 0.38 and β MSA = 0.92 ± 0.07 (R2 = 0.71, see Fig. 11). This result is consistent with previous studies of scaling emissions of MSA by Fragkias et al.36, who used MSAs and found a linear scaling between emissions and size of the cities and also Rybski et al.38, who used administrative boundaries to define 256 cities in 33 countries. Table II and III summarize the results of CCA and MSA cities.

Table 2 Population ranking of the top 15 CCA cities for D* = 1000 inhabitants and ℓ = 5 km. The total number of cities for these parameters is N = 2281. The areas are given in km2, the incomes per capita are given in US$ and the CO 2 emissions are given in metric tonnes/year Full size table

Table 3 Population ranking of the top 15 MSA/CMSA cities and the associated CCA (†) for D* = 1000 inhabitants and ℓ = 5 km. The areas are given in km2 and the CO 2 emissions are given in metric tonnes/year Full size table

Figure 10 Examples of MSA and CMSA combining the datasets from Global Rural-Urban Mapping Project (GRUMPv1), Vulcan Project (VP) and US Census Bureau22,23,37: (a)–(c) MSA of Albuquerque (Albuquerque, NM); (d)–(f) MSA of Flagstaff (Flagstaff, AZ–UT); (g)–(i) CMSA of Los Angeles (Los Angeles–Riverside–Orange County, CA); (j)–(l) MSA of Reno (Reno, NV); and (m)–(o) MSA of Las Vegas (Las Vegas, NV–AZ). In the first column, we plot the population as given by the GRUMPv1 dataset inside the administrative boundary of the MSA as provided by the US Census Bureau. The grey regions show the large unpopulated areas considered inside the MSA. The large MSA areas thus put together different populated clusters into one large administrative boundary. In the second column we plot the CO 2 emissions dataset inside the boundary of each MSA. The population and the CO 2 emissions are plotted in logarithmic scale according to the color bar at the bottom of the plot. In the third column, we plot the CCA clusters inside the corresponding MSA. Different from the MSA, the CCA captures the contiguous occupied area of a city. Full size image

Figure 11 CO 2 emissions in metric tonnes/year versus POP using the MSA/CMSA definition of cities for the total CO 2 emissions. We found almost extensive relation between CO 2 and POP with the allometric scaling exponent β MSA = 0.92 ± 0.07 (R2 = 0.71) The solid (black) line is the Nadaraya-Watson estimator, the dashed (black) lines are the lower and upper confidence interval and the solid (red) line is the linear regression. Full size image

Thus, the measurement bias in the MSAs leads to smaller β found for MSA as compared with CCA, since low-density MSAs have relatively large areas. Hence, the CCA results, which are not subject to that endogeneity bias, should be considered the main source of information on emissions. They show a positive link between emissions and population size as well as the expected extensive behavior of the occupied land. This analysis calls the attention to use the proper definition of cities when the scaling behavior of small cities needs to be accurately represented. Indeed, this issue arises in the controversy regarding the distribution of city size for small cities since the distribution of administrative cities (such as US Places) are found broadly lognormal (that is, a power law in the tail that deviates into a log-Gaussian for small cities)21,39,40,41,42, while the distribution of geography-based agglomerations like CCA is found to be Zipf distributed along all cities (power-law for all cities)13,14,15,16,17,18,20.