Estimating internal migration flows between administrative units

Following Garcia et al.60 a gravity model-based approach was used to estimate the total number of people migrating from one administrative unit to any other administrative unit, between 2005 and 2010, within each malaria endemic country located in Africa, Asia, Latin America and the Caribbean62,63 (Supplementary Table 1).

The simplest gravity-type spatial interaction model, proposed by Zipf64, considers the total population in a location of origin i and in a location of destination j (henceforth simply indicated as i and j), and the distance between the two locations to predict the migration flow (MIG ij ) between them. Thus, migration flows between administrative units can be estimated using the following function:

(1) M I G i j = P i α P j β d i j γ

where P i α and P j β represent the populations in the location of origin i and of destination j, respectively, and d ij γ represents the distance between i and j; with α, β, and γ being parameters, used to indicate the magnitude of the effect for each covariate, that are typically estimated in the statistical modelling framework.

In this study, following the notation from Henry et al.59 and Garcia et al.60, the basic gravity-type spatial interaction in equation (1) was extended in order to include additional geographical and socioeconomic factors described in detail in the Data collection and preparation subsection below. Since the census-based migration microdata extracted from the IPUMSI database61 represent only a sample of the total census, a logistic regression was used to model the proportion of people migrating between administrative units65. In particular, the logistic regression was used to model the proportion of people residing in j in the census year who were in i ‘n’ years prior to the census. Thus, the proportion of migrants in j in the census year that were previously residing in i was estimated using the following logistic regression function:

(2) p i j = e β 0 + β 1 P i + β 2 P j − β 3 d i j 1 + e β 0 + β 1 P i + β 2 P j − β 3 d i j

where p i j = M I G i j / T O T j ; with MIG ij and TOT j representing the number of people residing in j in the census year that were in i ‘n’ years prior to the census and the total population residing in j in the census year, respectively.

Initially, a separate vector β=(β 0 , β 1 , …, β n ) of coefficients was used in the linear predictor of the gravity model for each country (including malaria non-endemic countries), in Africa, Asia, Latin America and the Caribbean, for which migration data were available in the IPUMSI database61 (hereafter referred as IPUMSI countries; Table 1).

Table 1 Summary information about the edited IPUMSI 5-year internal migration microdata and the administrative unit datasets used to estimate the 5-year (2005–2010) internal human migration flows for every malaria endemic country Full size table

However, since the main aim of this study was to estimate internal human migration flows for malaria endemic countries for which migration data are not available, ultimately, models where the linear predictors were common across all countries located in the same continent were constructed (under the assumption of homogeneity of the process along the space). To investigate possible nonlinear relationships, models where linear predictors were replaced by additive predictors, using a Generalized Additive Modelling (GAM) framework66, were also explored.

GAM is a type of regression that, while preserving the functionality of using linear terms, allows covariates to have different and possibly opposite effects on the response variable by incorporating regression coefficients with smooth nonlinear form (Fig. 2).

Figure 2: Variation of the effect of the distance between administrative units (dij) on the predicted proportion of migrants in j in the census year that were previously residing in i (solid line) and 0.95 confidence intervals (dashed lines) as estimated by using a GAM. The rug plot (i.e., the vertical lines along the x axis) represents the distribution of the observed dij values. This example shows the result obtained using data for all countries located in Latin America and the Caribbean. Full size image

Thus, all possible combinations of covariates (listed in Table 2 and Supplementary Table 2) were tested in a logistic regression model and then only the linear predictors of all continuous covariates of the best predictive logistic regression model were also modelled using a GAM.

Table 2 Summary information about the source datasets and the main covariates tested in the spatial gravity models and used to derive additional covariates (Supplementary Table 2) for improving the predictive power of the models. Full size table

For each continent, the overall combinations of covariates and model types were explored using a multi-step approach to identify the model with the greatest predictive power in countries for which migration data were not available. The best model was then selected using a leave-one-out cross-validation approach67 in which the observed proportion of migrants in j previously residing in i for all countries except one were used for fitting models, that were subsequently used to predict the proportion of migrants in j previously residing in i in the withheld country. The correlation coefficient (R2) was selected to measure the variance explained after verifying homoscedasticity and testing overdispersion using a chi-squared test. This process was then repeated through iteratively withholding one country at the time. For each model, the R2 values for all withheld countries were averaged and used to rank each models according to their predictive power averaged across all withheld countries (Fig. 3). The overall best predictive model for each continent (Supplementary Table 3) was then used to predict the proportion of migrants residing in j who were previously residing in i for every malaria endemic country located in the corresponding continent (refer to Supplementary Table 4a,b and c for summary statistics of each best predictive model for Africa, Asia, and Latin America and the Caribbean, respectively).

Figure 3: Boxplots showing the distribution of all R2-values, for each withheld country, for all logistic regression (a,c,e) and GAM (b,d,f) models explored for Africa (a,b), Asia (c,d) and Latin America and the Caribbean (e,f). The red lines represent the best averaged R2 values used to select the best predictive model for each continent (Supplementary Table 3) while the red dots represent the R2 values, for all withheld countries, calculated using the best predictive model referring to the continent in which they are located. Full size image

Finally, in order to estimate the total number of people that migrated from i to j between 2005 and 2010 (Figs 4,5,6), for each country the predicted proportion of migrants residing in j was multiplied by the 2010 total population in j; with the latter calculated using either the corresponding WorldPop68–70 or the Gridded Population of the World version 4 (GPWv4)71 dataset adjusted to match United Nations Population Division (UNPD) estimates for 2010 (ref. 72). Refer to the Data collection and preparation subsection section below for a detailed description of how the population datasets mentioned above were identified and used.

Figure 4: Estimated internal human migration flows between subnational administrative units for every malaria endemic country in Africa (Supplementary Table 1). Coordinates for all three panels refer to GCS WGS 1984. For illustrative purposes, subnational unit boundaries are shown only in the insets and the colour ranges used to represent the flows are country-specific (refer to Supplementary Fig. 1 for additional close-up views of internal migration flows in Africa). Full size image

Figure 5: Estimated internal human migration flows between subnational administrative units for every malaria endemic country in Asia (Supplementary Table 1). Coordinates for all three panels refer to GCS WGS 1984. For illustrative purposes, subnational unit boundaries are shown only in the insets and the colour ranges used to represent the flows are country-specific (refer to Supplementary Fig. 2a,b for additional close-up views of internal migration flows in Asia). Full size image

Figure 6: Estimated internal human migration flows between subnational administrative units for every malaria endemic country in Latin America and the Caribbean (Supplementary Table 1). Coordinates for all three panels refer to GCS WGS 1984. For illustrative purposes, subnational unit boundaries are shown only in the insets and the colour ranges used to represent the flows are country-specific (refer to Supplementary Fig. 3 for additional close-up views of internal migration flows in Latin America and Caribbean). Full size image

Both model selection and prediction were performed using an R73 script contained in the WorldPop-InternalMigration-v1 code74 briefly described in the Code availability subsection below.

Data collection and preparation

In most of the countries available through the online IPUMSI database, internal migration variables were recorded by asking respondents either their administrative unit of residence 15, 5, or 1 prior to the census, or their previous residence and the number of years they are residing in the current locality. Considering that 5-year was the temporal interval available for most of the countries in the IPUMSI database and the fact that it has been demonstrated that both 1- and 5-year census-based internal migration data generally align well with shorter-term population movements in terms of relative strength of connections56,58, the 5-year migration data were used in this study. This maximised the amount of data that could be used to fit the gravity models subsequently used for predicting internal migration flows for every malaria endemic country. Thus, for each country listed in Table 1, harmonized, census-based 5-year internal migration data were extracted from the most recent census microdata available through the IPUMSI database61, downloaded locally, and eventually uploaded into a PostgreSQL database using a Microsoft Visual Studio 2010 user interface. The IPUMSI data stored in the PostgreSQL database were subsequently queried, using SQL, to quantify the number of people that migrated from each subnational administrative unit i to every other subnational administrative unit j during the 5-year timespan. These numbers were then matched to the corresponding country administrative unit spatial dataset, extracted from either the Global Administrative Areas (GADM)75 or the Global Administrative Unit Layers (GAUL)76 database, in a GIS environment. This was done by manually adding a unique ‘ID’ to each spatial unit corresponding to the one in the PostgreSQL database (hereafter referred as ‘IPUMSID’). In some cases, depending on the country, either the spatial detail of the IPUMSI migration data had to be reduced to match the lower spatial detail of the corresponding administrative unit dataset or spatially contiguous units in the administrative unit dataset had to be merged together to match the lower spatial detail of the IPUMSI migration data. In some other cases, ‘IPUMSIDs’ had to be edited or spatially contiguous units in the administrative unit dataset had to be merged together to match the reorganisation of the administrative units during the 5 years prior to the census. Finally, before calculating the migration flows between administrative units, another SQL query was used to classify each person in the census sample as either an internal migrant (1) or not (0). Examples of SQL queries used to perform the tasks described above are included in the WorldPop-InternalMigration-v1 code74 briefly described in the Code availability subsection below.

Response variable and covariates

For each country, the response variable, or the proportion of migrants residing in j in the census year that were residing in i 5 years prior to the census, was obtained by dividing the number of migrants residing in j in the census year that were residing in i 5 years prior the census by the total population residing in j in the census year; with both numbers based only on the information contained in IPUMSI census samples.

The administrative units spatially matching the IPUMSI migration microdata were used to calculate the distance between each pair of administrative units, their area, total population, and proportion of urban population. These main covariates (Table 2), along with other covariates derived from them (Supplementary Table 2), represent the pull and push migration factors, known to influence internal migration59,60,77, that were used to extend the basic gravity model proposed by Zipf64.

Other factors, including environmental factors59,60, and country-specific factors, such as literacy and percentage of male population59 or infrastructure and transportation78, were not used because (i) the factors listed in the previous paragraph alone proved to be able to explain most of the variance in the gravity models of Garcia et al.59, and (ii) only globally available datasets were explored in order to consistently model internal migration across all countries.

Calculating response variable and covariates

For each country, the total population in each administrative unit was calculated using the corresponding WorldPop79 (Data Citation 1) or GPWv4 (ref. 80) population count raster dataset adjusted to match UNPD estimates for 201072. The GPWv4 datasets were resampled to the spatial resolution of the WorlpPop datasets and used only for countries for which the WorldPop datasets were not available (Supplementary Table 1).

The area of each unit was calculated using each country vector administrative unit dataset projected to the most appropriate country-specific projected coordinate system, in order to minimize areal distortion, and ultimately reprojected to GCS WGS84.

The proportion of people in urbanized areas in each unit was calculated using the MODIS 500 m Global Urban Extent raster dataset81,82. The latter was converted to vector polygons, using the ArcGIS ‘Raster to Polygon’ tool83, and intersected with the reprojected country vector administrative unit dataset using the ArcGIS ‘Intersect’ tool83. Then, both the intersect output (containing polygons representing the total urban area within each unit uniquely identified by its ‘IPUMSID’) and the country vector administrative unit dataset were rasterized, at the resolution of the corresponding raster population dataset (i.e., 3 arc seconds 3 arc equals to approximately 100 m at the equator), and co-registered with it.

The two raster outputs, along with the population count raster dataset, were then input to the ArcGIS ‘Zonal Statistics as Table’ tool83 to generate two tables containing the total population and urban population in each unit (with the rasterized administrative units and thus their ‘IPUMSIDs’ used to define the zones). Subsequently, both tables were joined to the attribute table of the vector administrative unit dataset, using the ‘IPUMSID’ field to perform the join operation, and the proportion of urban population in each unit was calculated simply dividing its urban population by its total population.

The geodesic distance between each pair of administrative units, with the latter represented by their centroids, was calculated using the ArcGIS ‘Generate Near Table (Analysis)’ tool83. The ‘IN_FID’ and ‘NEAR_FID’ fields (identifying the administrative unit of origin and destination, respectively) in the output ‘distance’ table were then used for joining twice the ‘centroid attribute’ table using the centroid ‘ID’ field to perform the join operation. Since the ‘centroid attribute’ table contains the attributes of each administrative unit represented by the corresponding centroid, the join operation allowed to generate a ‘distance’ table containing all pairs of origin and destination administrative units along with their ‘IPUMSIDs’ and attributes including the unit’s area, total population, and proportion of urban population. Origin and destination ‘IPUMSID’ fields were then renamed ‘NODEI’ and ‘NODEJ’, respectively.

A ‘contiguity’ table containing information about spatial contiguity of administrative units (defined based on polygons sharing an edge) was generated using the ArcGIS ‘Generate Spatial Weights Matrix’ tool83 and subsequently joined with the ‘distance’ table to obtain a new table containing all main covariates, listed in Table 2, calculated at the unit level. This join operation (based on both the ‘NODEI’ and ‘NODEJ’ field the in the ‘distance’ table and the corresponding ‘IPUMSID’ and ‘NID’ field in the ‘contiguity’ table) was performed through two different R scripts depending on whether the country is an IPUMSI or a non-IPUMSI countries. In particular, the R script for the IPUMSI countries added to the new table a ‘MIGIJ’ field containing the number of people that migrated from each ‘NODEI’ to each other ‘NODEJ’ according to the IPUMSI migration microdata and calculated the response variable.

Finally, on a continent basis, all IPUMSI country tables were merged together and input to an R73 script that generated the additional covariates listed in Supplementary Table 2, identified the best predictive model for each continent, as described in the previous section, and was used to estimate the 5-year (2005–2010) internal human migration flows for every malaria endemic country using the best predictive model selected for the corresponding continent.

All operations described above, excluding the reprojection of the vector administrative unit datasets and the calculation of their surface areas, for all IPUMSI and non-IPUMSI countries, were performed using the WorldPop-InternalMigration-v1 code74 briefly described in the Code availability subsection below.

Code availability

The WorldPop-InternalMigration-v1 code74, used to produce the open access archive of estimated 5-year (2005–2010) internal human migration flows described in this article, is publicly available through Figshare. It consists of 1) a Microsoft Visual Studio 2010 user interface allowing users to upload the IPUMSI census microdata to a PostgreSQL database; 2) example SQL queries that were used to match the spatial detail of the IPUMSI migration data to spatial detail of the corresponding administrative unit dataset and to identify internal migrants within the IPUMSI census samples 3) an ArcToolbox geoprocessing tool82 that assigns a unique ID to each administrative unit and calculates the corresponding total population and proportion of urban population; 4) a Python84/ArcPy83 script that creates two tables, one containing spatial contiguity information between each pair of administrative units (‘contiguity.csv’) and another one containing the ISO country code, the continent in which the country is located, the distance between each pair of administrative units, their total population, proportion of urban population, surface area, and the geographic coordinates (GCS WGS84) of their centroid (‘distance.csv’); 5) two R73 scripts, one for the IPUMSI countries used to query the IPUMSI migration microdata loaded in the PostgreSQL database, calculate the response variable, and join the query result with the two output tables of the python script, and another one for the non-IPUMSI countries used just to join together the two output tables of the python script; and 6) an R73 script that performs the model selection and estimates the 5-year (2005–2010) internal human migration flows between subnational administrative units.

All available sets of code are named progressively and must be run sequentially according to the order in which they are presented above. They are also internally documented in order to both briefly explain their purpose and, when required, guide the user through their customization.