Tracking population flows is especially important in the context of the outbreak of COVID-19 in China and the rest of the world. This outbreak emerged in Wuhan (a prefecture-level city in the province of Hubei) in the run-up to the Chinese Lunar New Year’s Eve on 24 January 2020, which is associated with the annual Chunyun mass migration (which can involve as many as three billion trips). The potential scale and range of the diffusion of the outbreak was particularly alarming given the position of Wuhan as a central hub in China’s rail and aviation networks and given the severity of COVID-19.

We used nationwide mobile phone data to track population outflow from Wuhan and linked this to COVID-19 infection counts by location—at the prefecture level. Our data include 296 prefectures in 31 provinces and regions in China (average population 4.40 million, 94.07% of China’s population). Mobile phone geolocation data—which can reliably quantify human movement—provide precise, verifiable and real-time information5,6,7,8,9,10,11. We conceptualized epidemiological morbidity as a function of the movement of the human population from a disease epicentre. We therefore normalize disease risk to the population inflow from Wuhan rather than to the size of the local population.

Our approach differs from previous studies in which individual mobility and disease spread1,2,3,4,12,13 was linked, as we used real-time data about actual movement, focussed on aggregate population flows rather than individual tracking, and implemented a new modelling approach. That is, other recent studies on COVID-19 have used historical population flow data (for example, data on Chunyun migrations from previous years) to estimate case exportation during the current outbreak14,15,16,17,18. However, the benefits of observing rather than estimating population movements are substantial as inaccurate predictions can have important consequences for policy-making: under-reaction can result in disease spread and over-reaction can lead to medically, socially and economically inefficient policies. Moreover, in contrast to previous approaches to epidemiological modelling12,13,14,15,16,17,18, we take advantage of detailed data about the population flow that emanated from the source of the outbreak to develop a population-flow-based risk source model to test the extent to which population flow data can capture the spatio-temporal dynamics of the spread of the SARS-CoV-2 virus.

To measure the total aggregate population outflow from Wuhan before the region was quarantined on 23 January 2020, we used country-wide data (provided by a major national carrier) that tracked all of the movements out of Wuhan between 1 January and 24 January 2020. The onset of symptoms of the first recorded case of COVID-19 in Wuhan was 1 December 2019; by 19 February 2020—the end of our study period—74,576 infected cases had been verified in mainland China according to data from the China Center for Disease Control and Prevention19,20,21. Our time period includes the time at which the news about the outbreak initially appeared (on 31 December 2019 and 9 January 2020) and the annual Lunar New Year migration (which culminated on 24 January 2020). The dataset included any mobile phone user who had spent at least 2 h in Wuhan during this period and it tracked the total daily flow of such individuals to all other prefectures throughout mainland China. Locations were detected when users simply had their phones on. The dataset includes two measures of population outflow: the customer count of the carrier and their extrapolated count of total population movement. We use the latter in our primary analyses and the former as a robustness check (Supplementary Information).

We defined population flow as the total aggregate count of people who entered any given prefecture from Wuhan during the whole observation period (1–24 January 2020). Because Wuhan (population of 11.08 million people in 2018) is a major transportation hub, many of these people were travellers passing through rather than residents. The definition is also weighted by the number of transits through Wuhan since some people may have entered and exited Wuhan on several occasions in January (especially if they lived in neighbouring prefectures). This can be thought of as a linear weighting of additional infection and transmission risk from repeated transits. There were 11,478,484 counts of movements from Wuhan: 8,685,007 to other prefectures within Hubei and 2,793,477 to prefectures in other provinces.

Key dates during this period were 24 January—Lunar New Year’s Eve (outbound holiday travel is typically completed before this evening)—and January 23, when Wuhan was quarantined. We analysed the efficacy of the quarantine (Fig. 1b, c), which was manifested in a reduction of 52% and 38% in inter- and intra-provincial population outflow, respectively, on 23 January 2020 compared with 22 January 2020 (when there were 546,324 and 141,208 counts of intra- and extra-provincial travel, respectively), and a further reduction of 94% and 84% on 24 January 2020 compared with 23 January 2020. With the imposition of the quarantine—first in Wuhan (and two neighbouring prefectures) at 10:00 on 23 January 2020, and then in 12 other prefectures in Hubei by the end of the day on 24 January 2020—population outflow from Wuhan almost completely stopped (the average daily outflow thereafter was just 1,087 people to all prefectures outside of Hubei, which probably comprised government workers).

Fig. 1: Geographical distribution of population outflow and confirmed COVID-19 cases as of 19 February 2020. a, There is a high overlap between the geographical distribution of aggregate population outflow from Wuhan until 24 January 2020 (in red) and the number of confirmed cases of COVID-19 in other Chinese prefectures (n = 296 prefectures). Map source: National Catalogue Service for Geographic Information. Grey areas lack population outflow data. b, c, During the time that is historically the peak period for outbound Lunar New Year holiday travel, total population outflow from Wuhan to other parts of Hubei (b) is more than three times higher than the population outflow to outside provinces (c). After the implementation of the quarantine at 10:00 on 23 January 2020, population outflow from Wuhan became minimal, except to the adjacent prefectures (b). In b, the first peak possibly corresponds to the start of the winter break of (roughly one million) college students in Wuhan and the second peak is associated with outbound Chunyun travel. Full size image

We combined the population flow dataset with the count and geographical location of confirmed cases of COVID-19 nationwide (Fig. 1a), which used consistent and stringently enforced case ascertainment during this period. As of 19 February 2020, there were 74,576 infected cases in mainland China, of which 29,549 occurred outside of Wuhan and there were 2,118 fatalities (according to data from the China Center for Disease Control and Prevention).

Population flow from Wuhan was hypothesized to export the virus to other locations, where it caused local outbreaks (that is, either by importation or community transmission (refs. 19,20,21)). Indeed, we find a strong correlation between total population flow and the number of infections in each prefecture (Fig. 2a, b). Consistent with our hypothesis, the cumulative number of infections is highly correlated with aggregate population outflow from Wuhan from 1 to 24 January 2020, and the correlation increases over time from r = 0.522 on 24 January 2020 to r = 0.919 on 5 February 2020, and increases further to r = 0.952 on 19 February 2020 (P < 0.001 for all) (Fig. 2a–c). As there is little travel throughout the country during this period, the population outflow variable is comparable to a lagged variable in a time series. The correlation exhibited the same robust pattern even when different time windows of population outflow were used (Extended Data Fig. 1). The correlation between population outflow from Hubei province (excluding Wuhan itself) and the number of infections in each prefecture (Fig. 2c) followed a similar pattern but was substantially weaker; this correlation increased from r = 0.365 on 24 January 2020 to r = 0.583 on 19 February 2020.

Fig. 2: Factors correlated with confirmed COVID-19 cases. a, b, The relationship between the log-transformed aggregate population outflow from Wuhan (up to 24 January 2020) and the log-transformed number of confirmed cases by prefecture on 26 January 2020 (a) and 19 February 2020 (b). Red circles are prefectures in Hubei; light blue circles are four quarantined prefectures in Zhejiang (including Wenzhou); and the six largest prefectures in China are indicated with unique colours. c, Relationship over time between the number of confirmed cases (cumulative until 19 February 2020) and the cumulative population inflow (up to 24 January 2020) from Wuhan, the cumulative inflow from Hubei province excluding Wuhan, the frequency of Baidu search terms related to the virus, the GDP, population and distance from Wuhan of the prefectures. Over time, the correlation between population outflow from Wuhan and the number of infected cases increases from Pearson’s r = 0.522 on 24 January 2020 to r = 0.952 on 19 February (n = 296 prefectures). The decrease in the predictive strength of online search behaviour might reflect information saturation, while the decrease in the predictive strength of GDP, population size and distance suggests that late-stage Chunyun migration from Wuhan was to a more diverse set of prefectures (and not merely to the closet, largest and most-developed prefectures) and/or that community transmissions began to predominate. d, The correlation with daily infections is consistent throughout the period with Pearson’s r ranging from 0.496 on 24 January 2020 to a peak of 0.926 on 4 February 2020 (n = 296 prefectures). Fluctuations probably indicate lags in the reporting of cases (that are smoothed in c); weaker correlations on the last few days reflect that more than 90% of prefectures outside of Hubei reported no new cases. Full size image

For completeness we compared the predictive strength of aggregate population outflow to other factors—such as the relative frequency of Baidu search engine queries for virus-related terms in each prefecture (for example, novel coronavirus, flu, SARS, atypical pneumonia and surgical mask)22,23,24, the gross domestic product (GDP) and population size of each prefecture, and other movement variables. Each of these factors became less predictive of local outbreak size over time, either for the number of cumulative cases or the number of daily reported cases (Fig. 2c, d and Extended Data Figs. 2, 3).

We also evaluated a gravity model4,13. Gravity models were originally developed to model flow volumes or other interactions between geographical areas based simply on distance between two regions and their populations. Here, we use a special case of the gravity model with only the population variable for the ‘recipient’ prefecture as Wuhan is always the ‘donor’ and thus a constant value (Supplementary Information 4.1). This model (with a significantly negative parameter for distance) predicts the high quantity of travel from Wuhan to other prefectures in Hubei and to geographically proximate provinces (Fig. 1). However, it does not explain the high traffic of population outflow to more distant coastal cities. That outflow does not strictly follow a gravity model is not surprising given the rationales for Chunyun migration patterns, which are primarily based on social connections8,25.

Furthermore, we tested a gravity model to predict the infection count. Although the population size of the recipient prefecture and distance were significant predictors (P < 0.001), a mediation analysis shows that population flow from Wuhan mediates the effect of distance. Figure 2c, d illustrates why this is the case. Aggregate population flow from Wuhan exhibits a high and progressively stronger correlation with infection prevalence in destination locations over time. By contrast, the predictive strength of the distance from Wuhan, population size and GDP (an alternative source of gravity) of each prefecture shows no increases or decreases over time. There is no advantage to using distance to estimate population flow and infection spread when the actual population flow is observable, as in our case.

Next, we used two sets of models—one cross-sectional and one dynamic model—to statistically model and benchmark the extent to which aggregate population outflow from Wuhan predicts the spread and distribution of infections with SARS-CoV-2 across mainland China. We developed what we call a risk source model that leverages observed population flow data to operationalize the risk emanating from the epidemic source.

We first modelled the effect of outflow on infection by using the following multiplicative exponential model:

$${y}_{i}=c\hspace{-3pt}\mathop{\prod }\limits_{j=1}^{m}\hspace{-3pt}{e}^{{\beta }_{j}{x}_{ji}}\,{e}^{\mathop{\sum }\limits_{k=1}^{n}{\lambda }_{k}{I}_{ik}}$$ (1)

in which y i is the number of the cumulative (or daily) confirmed cases in prefecture i (depending on the model); x 1i is the cumulative population outflow from Wuhan to prefecture i from 1 to 24 January 2020; x 2i is the GDP of prefecture i; x 3i is the population size of prefecture i; m is the number of variables included; and c and β j are parameters to estimate. λ k is the fixed effect for province k; n is the number of prefectures considered in the analysis; I ik is a dummy for prefecture i and I ik = 1, if i ∈ k (prefecture i belongs to province k), otherwise I ik = 0 (Supplementary Information).

We applied a nonlinear least-squares method (Levenberg–Marquardt algorithm) to estimate the parameters of a model with confirmed cases as the dependent variable and aggregate Wuhan population outflow from 1–24 January 2020 as the sole predictor variable (R2 = 0.772 on 24 January to R2 = 0.946 on 19 February) and a model with population size and GDP as additional co-variates (R2 = 0.809 on 24 January 24 to R2 = 0.967 on 19 February) (Supplementary Tables 1, 2). Although these additional co-variates improve the fit, the parameter for population flow from Wuhan becomes increasingly dominant, whereas the GDP and population of a prefecture become increasingly less predictive over time. Overall, the performance of the models continuously improved as more infected cases were confirmed, suggesting that the spreading pattern of the virus gradually converged to the distribution of the population outflow from Wuhan to other prefectures in China. As a robustness check, we evaluate a model using daily confirmed cases and find consistent results (Supplementary Tables 3, 4).

The logic behind this convergence over time, as well as the predictive strength of the model, is that population flow from Wuhan to other prefectures fundamentally determines the eventual distribution of total infections in China. During the earliest phase of the outbreak, before the quarantine of Wuhan, there was a relative lack of awareness of the virus and few countermeasures preventing its spread. SARS-CoV-2 should thus have spread relatively randomly across the entire prefecture of Wuhan; that is, our results imply that the number of infected people was uniformly distributed (statistically speaking) in the population outflowing from Wuhan into different prefectures across the country.

Using the daily predicted cases in model (1), we are also able to calculate a daily risk score for prefectures based on the difference between the number of predicted and confirmed cases on any given date (Supplementary Information). A higher-than-expected level of infection suggests more community transmission (that is, ‘underperforming’ compared to the benchmark derived from the outflow population from Wuhan). On the other hand, ‘over-performing’ prefectures, with fewer cases than expected are also noteworthy, as they could have implemented highly successful public health measures (or may be prone to inaccurate data reporting). For example, Extended Data Fig. 4 identifies prefectures with transmission risk index values above the upper bound of the 90% confidence interval on 29 January, and the crossing of this threshold was indeed associated with imminent quarantine. The predictive strength of aggregate population flow from Wuhan and the overall fit of model (1) over time can also act as an early warning index of an epidemiological transition; they reflect the degree to which imported infections are dominant at any point in time. If model strength decreases significantly at any location, this may indicate that community transmission may be overtaking imported cases.

We next developed a spatio-temporal model to explore changes in distribution and growth of COVID-19 across all prefectures over time (rather than on individual dates) (Supplementary Information 3.2). We use a Cox proportional hazards framework and replace the constant scaling parameter of model (1) with a time-varying hazard rate function λ 0 (t), which typically has an S-shaped property (for example, logistic, generalized logistic or Gompertz functions26,27) that epidemics typically follow:

$${\rm{\lambda }}(t|{x}_{i})={{\rm{\lambda }}}_{0}(t)\,(\mathop{\prod }\limits_{j=1}^{m}{{\rm{e}}}^{{\beta }_{j}{x}_{ji}}){{\rm{e}}}^{\mathop{\sum }\limits_{k=1}^{n}{\lambda }_{k}{I}_{ik}}$$ (2)

in which λ(t|x i ) is the hazard function describing the number of cumulative confirmed cases at time t given population outflow from Wuhan to prefecture i, and other variables x i = {x 1i , x 2i , …, x mi } are the realized values of the covariates for prefecture i; the other notation is the same as model (1).

This model extends our risk source model to a dynamic context; it incorporates all infected cases across all locales and dates to statistically derive the COVID-19 epidemic curve and growth pattern across mainland China. We used the same method as before to estimate the parameters (Supplementary Information). When using only the single variable of total population outflow from Wuhan (from 1 to 24 January 2020) to each other prefecture, we observe R2 = 0.927 for the exponential–logistic model (Fig. 3a); the inclusion of local population and GDP increases R2 to 0.957 (alternate models are in Supplementary Table 5).

Fig. 3: Predictive model based on population outflow. a, The surface indicates the fitted performance of our epidemiological model (see model (5) in the Supplementary Information) with a single variable x 1i , which indicates the outflow population from Wuhan to prefecture i (log transformed) for all prefectures, with t as the number of days after outbound Chunyun is over (that is, t = 1 is 24 January 2020). The dots represent the actual number of confirmed cases for a given x 1i and t. Red dots represent prefectures in which the reported number of confirmed cases is greater than the values predicted by the model; black dots are all other cases, R2 = 0.930 (n = 7,992 data points). See Extended Data Fig. 8 for a robustness check. b, Risk scores over time provide a dynamic picture of shifting transmission risks in different prefectures. Full size image

We use a similar logic as above to contrast the expected and observed outcomes to gauge epidemiological risk. Here, model predictions serve as reference patterns across time (Extended Data Figs. 5, 6). The differences in the growth trends between the number of predicted and confirmed cases can signal higher levels of SARS-CoV-2 community transmission. We use the integral of the differences over time to create a total transmission risk index (normalized by subtracting the mean and dividing by the standard deviation) and identify a list of prefectures above and below the 90% confidence interval (Extended Data Fig. 7 and Supplementary Table 11). Indeed, our model identifies a list of statistically significant underperforming prefectures; in most of these cases, we observed the subsequent imposition of quarantine (Extended Data Figs. 5, 6, Supplementary Information and Supplementary Table 12). On the other hand, prefectures with lower trends than expected might have had more successful public health measures. Figure 3b shows the dynamic shifts in the risk index score for selected prefectures, which enables the monitoring of prefectures to analyse which prefectures performed better in controlling the transmission risk over time.

In summary, using detailed mobile-phone geolocation data to compute aggregate population movements, we track the transit of people from Wuhan to the rest of mainland China up to 24 January 2020. The geographical flow of people anticipates the subsequent location, intensity and timing of outbreaks in the rest of mainland China up to 19 February 2020. These data outperform other measures, such as population size, wealth or distance from the risk source. We modelled the epidemic curves of COVID-19 across different locales using population flows and showed that deviations from model predictions served as tools to detect the burden of community transmission.

The logic of our population-flow-based risk source model differs from classic epidemiological models that rely on assumptions regarding population mixing, population compartment sizes and viral properties. By assuming that risk arises from human population movements, our risk source model is able to parsimoniously capture the distribution of the epidemic. The model has several advantages: it makes no assumptions regarding travel patterns or effective distance effects; allows for nonlinear estimations; generates a non-arbitrary, source-linked risk score; and is easily adapted to other empirical contexts. Notably, the multiplicative functional form can also accommodate multiple risk sources—for example, for countries in which there are multiple disease epicentres. As an example, we evaluated the distinct impact of population flow from Hubei (excluding Wuhan) as an alternative risk source in our models, and found that it had little impact on the spread and growth of COVID-19 in the country (Supplementary Tables 6, 10).

We focused on the relative strength of the outbreak in each area, rather than the absolute number of cases, although one can predict the number of cases by using reported data to calibrate the parameters of the model. A key contribution of our approach is to robustly characterize the structure or relative distribution of cases across different geographical areas and over time, which is driven fundamentally by the cumulative outflow from Wuhan. Moreover, another benefit is that non-systematic inaccuracy of COVID-19 case-finding is relatively unimportant as long as we capture the distribution of population flow accurately over time, which we do.

Our approach is generalizable to any dataset that captures population movements (for example, train-ticketing or car-tolling data). This method can also be implemented in a live fashion (if suitable data are available) to facilitate policy decisions—for example, for the allocation of resources and manpower across specific geographical locales based on the predicted strength of the epidemic. This could also yield a dynamic performance metric when contrasted against real-time reports of infections, and, as we show, identify which areas have higher virus transmission risk or more effective measures.

Other techniques to forecast the levels of an epidemic in defined populations in advance have, of course, been proposed—whether the use of online search behaviour22,23,24 or the use of network sensors (that is, the monitoring of people who are at heightened risk of falling ill given their network position)28. Our approach relies on data regarding population flow. Indeed, historical (that is, baseline) information about population flows—undisturbed by the imposition of quarantines or by publicity regarding outbreaks, both of which happened here—could also be valuable to public health experts and government officials when new outbreaks occur.

When people move, they take contagious diseases with them. Their movements are thus a harbinger of the future status of an epidemic, and this offers the prospect of using data-analytic techniques to control an epidemic before it strikes too hard.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.