We estimated the level of overdispersion in COVID-19 transmission by using a mathematical model that is characterised by R 0 and the overdispersion parameter k of a negative binomial branching process. We fit this model to worldwide data on COVID-19 cases to estimate k given the reported range of R 0 and interpret this in the context of superspreading.

This suggests that not all symptomatic cases cause a secondary transmission, which was also estimated to be the case for past coronavirus outbreaks (SARS/MERS) 8 , 9 . High individual-level variation (i.e. overdispersion) in the distribution of the number of secondary transmissions, which can lead to so-called superspreading events, is crucial information for epidemic control 9 . High variation in the distribution of secondary cases suggests that most cases do not contribute to the expansion of the epidemic, which means that containment efforts that can prevent superspreading events have a disproportionate effect on the reduction of transmission.

A novel coronavirus disease (COVID-19) outbreak, which is considered to be associated with a market in Wuhan, China, is now affecting a number of countries worldwide 1 , 2 . A substantial number of human-to-human transmission has occurred; the basic reproduction number R 0 (the average number of secondary transmissions caused by a single primary case in a fully susceptible population) has been estimated around 2–3 3 – 5 . More than 100 countries have observed confirmed cases of COVID-19. A few countries have already been shifting from the containment phase to the mitigation phase 6 , 7 , with a substantial number of locally acquired cases (including those whose epidemiological link is untraceable). On the other hand, there are countries where a number of imported cases were ascertained but fewer secondary cases have been reported than might be expected with an estimated value of R 0 of 2–3.

Due to interventions targeting travellers (e.g. screening and quarantine), the risk of transmission from imported cases may be lower than that from local cases. As part of the sensitivity analysis in Extended data , we estimated k assuming that the reproduction number of imported cases is smaller than that of local cases.

We used simulations to investigate potential bias caused by underreporting, one of the major limitations of the present study. Underreporting in some countries may be more frequent than others because of limited surveillance and/or testing capacity, causing heterogeneity in the number of cases that could have affected the estimated overdispersion. See Extended data (Supplementary materials) 16 for detailed methods.

To test if our assumption of overdispersed offspring distribution better describes the data, we compared our negative-binomial branching process model with a Poisson branching process model, which assumes that the offspring distribution follows a Poisson distribution instead of negative-binomial. Since a negative-binomial distribution converges to a Poisson distribution as k → ∞, we approximately implemented a Poisson branching process model by fixing k of the negative-binomial model at 10 10 . We compared the two models by the widely-applicable Bayesian information criterion (WBIC) 15 .

Varying the assumed R 0 between 0–5 (fixed at an evenly-spaced grid of values), we estimated the overdispersion parameter k using the likelihood function described above. We used the Markov-chain Monte Carlo (MCMC) method to provide 95% credible intervals (CrIs). The reciprocal of k was sampled where the prior distribution for the reciprocal was weakly-informed half-normal (HalfNormal( σ = 10)). We employed the adaptive hit-and-run Metropolis algorithm 13 and obtained 500 thinned samples from 10,000 MCMC steps (where the first half of the chain was discarded as burn-in). We confirmed that the final 500 samples have an effective sample size of at least 300, indicating sufficiently low auto-correlation.

with a convention ∑ m = 0 − 1 c ( m ; s ) = 0 . We assumed that the growth of a cluster in a country had ceased if 7 days have passed since the latest reported case (denoted by set A ). We applied the final size likelihood c ( x ; s ) to those countries and c o ( x ; s ) to the rest of the countries (countries with an ongoing outbreak: B ). The total likelihood is

If the observed case counts are part of an ongoing outbreak in a country, cluster sizes may grow in the future. To address this issue, we adjusted the likelihood for those countries with ongoing outbreak by only using the condition that the final cluster size of such a country has to be larger than the currently observed number of cases. The corresponding likelihood function is

Assuming that the offspring distributions (distribution of the number of secondary transmissions) for COVID-19 cases are identically- and independently-distributed negative-binomial distributions, we constructed the likelihood of observing the reported number of imported/local cases (outbreak size) of COVID-19 for each country. The probability mass function for the final cluster size resulting from s initial cases is, according to Blumberg et al . 12 , given by

We extracted the number of imported/local cases in the affected countries ( Table 1 ) from the WHO situation report 38 10 published on 27 February 2020, which was the latest report of the number of imported/local cases in each country (as of the situation report 39, WHO no longer reports the number of cases stratified by the site of infection). As in the WHO situation reports, we defined imported cases as those whose likely site of infection is outside the reporting country and local cases as those whose likely site of infection is inside the reporting country. Those whose site of infection was under investigation were excluded from the analysis (Estonia had no case with a known site of infection and was excluded). In Egypt and Iran, no imported cases have been confirmed, which cause the likelihood value to be zero; data in these two countries were excluded. To distinguish between countries with and without an ongoing outbreak, we extracted daily case counts from an online resource 11 and determined the dates of the latest case confirmation for each country (as of 27 February).

Model comparison between negative-binomial and Poisson branching process models suggested that a negative-binomial model better describes the observed data; WBIC strongly supported the negative-binomial model with a difference of 11.0 ( Table 2 ). The simulation of the effect of underreporting suggested that possible underreporting is unlikely to cause underestimation of overdispersion parameter k ( Extended data , Figure S3) 16 . A slight increase in the estimate of k was observed when the reproduction number for imported cases was assumed to be lower due to interventions ( Extended data , Table S1).

The result of the joint estimation suggested the likely bounds for R 0 and k (95% CrIs: R 0 1.4–12; k 0.04–0.2). The upper bound of R 0 did not notably differ from that of the prior distribution (=13.5), suggesting that our model and the data only informed the lower bound of R 0 . This was presumably because the contribution of R 0 to the shape of a negative-binomial distribution is marginal when k is small ( Extended data , Figure S1) 16 . A scatterplot ( Extended data , Figure S2) 16 exhibited a moderate correlation between R 0 and k (correlation coefficient -0.4).

Our estimation suggested substantial overdispersion ( k ≪ 1) in the offspring distribution of COVID-19 ( Figure 1A and Figure 2 ). Within the current consensus range of R 0 (2–3), k was estimated to be around 0.1 (median estimate 0.1; 95% CrI: 0.05–0.2 for R 0 = 2.5). For the R 0 values of 2–3, the estimates suggested that 80% of secondary transmissions may have been caused by a small fraction of infectious individuals (~10%; Figure 1B ).

Discussion

Our results suggested that the offspring distribution of COVID-19 is highly overdispersed. For the likely range of R 0 of 2–3, the overdispersion parameter k was estimated to be around 0.1, suggesting that the majority of secondary transmission may be caused by a very small fraction of individuals (80% of transmissions caused by ~10% of the total cases). These results are consistent with a number of observed superspreading events observed in the current COVID-19 outbreak17, and also in line with the estimates from the previous SARS/MERS outbreaks8.

The overdispersion parameter for the current COVID-19 outbreak has also been estimated by stochastic simulation18 and from contact tracing data in Shenzhen, China19. The former study did not yield an interpretable estimate of k due to the limited data input. In the latter study, the estimates of R e (the effective reproduction number) and k were 0.4 (95% confidence interval: 0.3–0.5) and 0.58 (0.35–1.18), respectively, which did not agree with our findings. However, these estimates were obtained from pairs of cases with a clear epidemiological link and therefore may have been biased (downward for R 0 and upward for k) if superspreading events had been more likely to be missed during the contact tracing.

Although cluster size distributions based on a branching process model are useful in inference of the offspring distribution from limited data12,20, they are not directly applicable to an ongoing outbreak because the final cluster size may not yet have been observed. In our analysis, we adopted an alternative approach which accounts for possible future growth of clusters to minimise the risk of underestimation. As of 27 February 2020, the majority of the countries in the dataset had ongoing outbreaks (36 out of 43 countries analysed, accounting for 2,788 cases of the total 2,816). Even though we used the case counts in those countries only as the lower bounds of future final cluster sizes, which might have only partially informed of the underlying branching process, our model yielded estimates with moderate uncertainty levels (at least sufficient to suggest that k may be below 1). Together with the previous finding suggesting that the overdispersion parameter is unlikely to be biased downwards21, we believe our analysis supports the possibility of highly-overdispersed transmission of COVID-19.

A number of limitations need to be noted in this study. We used the confirmed case counts reported to WHO and did not account for possible underreporting of cases. Heterogeneities between countries in surveillance and intervention capacities, which might also be contributing to the estimated overdispersion, were not considered (although we investigated such effects by simulations; see Extended data, Figure S3)16. Reported cases whose site of infection classified as unknown, which should in principle be counted as either imported or local cases, were excluded from analysis. Some cases with a known site of infection could also have been misclassified (e.g., cases with travel history may have been infected locally). The distinction between countries with and without ongoing outbreak (7 days without any new confirmation of cases) was arbitrary. However, we believe that our conclusion is robust because the distinction does not change with different thresholds (4–14 days), within which the serial interval of SARS-CoV-2 is likely to fall22,23.

Our finding of a highly-overdispersed offspring distribution suggests that there is benefit to focusing intervention efforts on superspreading. As most infected individuals do not contribute to the expansion of transmission, the effective reproduction number could be drastically reduced by preventing relatively rare superspreading events. Identifying characteristics of settings that could lead to superspreading events will play a key role in designing effective control strategies.