Significance This study provides information on sex ratio at birth (SRB) reference levels and SRB imbalance. Using a comprehensive database and a Bayesian estimation model, we estimate that SRB reference levels are significantly different from the commonly assumed historical norm of 1.05 for most regions. We identify 12 countries with strong statistical evidence of SRB imbalance: Albania, Armenia, Azerbaijan, China, Georgia, Hong Kong (SAR of China), India, Republic of Korea, Montenegro, Taiwan (Province of China), Tunisia, and Vietnam.

Abstract The sex ratio at birth (SRB; ratio of male to female live births) imbalance in parts of the world over the past few decades is a direct consequence of sex-selective abortion, driven by the coexistence of son preference, readily available technology of prenatal sex determination, and fertility decline. Estimation of the degree of SRB imbalance is complicated because of unknown SRB reference levels and because of the uncertainty associated with SRB observations. There are needs for reproducible methods to construct SRB estimates with uncertainty, and to assess SRB inflation due to sex-selective abortion. We compile an extensive database from vital registration systems, censuses and surveys with 10,835 observations, and 16,602 country-years of information from 202 countries. We develop Bayesian methods for SRB estimation for all countries from 1950 to 2017. We model the SRB regional and national reference levels, the fluctuation around national reference levels, and the inflation. The estimated regional reference levels range from 1.031 (95% uncertainty interval [1.027; 1.036]) in sub-Saharan Africa to 1.063 [1.055; 1.072] in southeastern Asia, 1.063 [1.054; 1.072] in eastern Asia, and 1.067 [1.058; 1.077] in Oceania. We identify 12 countries with strong statistical evidence of SRB imbalance during 1970–2017, resulting in 23.1 [19.0; 28.3] million missing female births globally. The majority of those missing female births are in China, with 11.9 [8.5; 15.8] million, and in India, with 10.6 [8.0; 13.6] million.

We describe a method for probabilistic and reproducible estimation of the sex ratio at birth (SRB; ratio of male to female live births) for all countries, with a focus on assessing the SRB reference levels (which we henceforth term “baseline level”) and SRB imbalance due to sex-selective abortion.

Under normal circumstances, SRB varies in a narrow range around 1.05, with only a few known variations among ethnic groups (1⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–13). For most of human history, SRB remained within that natural range. However, over recent decades, SRBs have risen in a number of Asian countries and in eastern Europe (14⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–30). The increasing imbalance in SRB is due to a combination of three main factors that lead to sex-selective abortion (22, 24). Firstly, most societies with abnormal SRB inflation have persisting strong son preference, which provides the motivation. Secondly, since the 1970s, prenatal sex diagnosis and access to sex-selective abortion have become increasingly available (31⇓⇓⇓–35), providing the method. Thirdly, fertility has fallen to low levels around the world that resulted in a “squeezing effect”: attaining both the desired small families and the ideal sex composition by resorting to sex selection (22). Consequently, sex-selective abortion provides a means to avoid large families while still having male offspring. Necessary conditions for the occurrence of sex-selective abortions include a large tolerance for induced abortion from both the population and the medical establishment, available techniques for early sex detection, and legal medical abortion for several weeks after onset of pregnancy (36).

Estimation of the degree of SRB imbalance is complicated by the amount of uncertainty associated with SRB observations due to data quality issues and sampling errors. While the United Nations (UN) Population Division publishes estimates for all countries in the World Population Prospects (WPP), its estimates are deterministic and depend on expert-based opinions which are not reproducible (37). Although modeling and simulation studies of the SRB have been carried out for selected countries (38⇓–40), these studies did not estimate the SRB and its natural fluctuations; instead, SRB estimates were taken from the UN WPP. A recent assessment by the Global Burden of Disease Study 2017 (41) produced estimates for 195 countries based on 8,936 country-years of data but does not assess baseline values or imbalances. An up-to-date systematic analysis for the SRB—one of the most fundamental demographic indicators—for all countries over time using all available data with reproducible estimation method is urgently needed.

To fill the research void, we develop model-based estimates for 212 countries (referring to populations that are considered as “countries” or “areas” in the UN classification) from 1950 to 2017. Our analyses are based on a comprehensive database on national-level SRB with data from vital registration (VR) systems, censuses, and international and national surveys. In total, we have 10,835 observations, equivalent to 16,602 country-years of information, in our database from 202 countries. We implement two Bayesian hierarchical models to estimate SRBs in two types of country-years: (i) those that are not affected by sex-selective abortion and (ii) those that may be affected by sex-selective abortion that leads to SRB imbalances.

In the model for country-years not affected by sex-selective abortion, the SRB is given by the product of a baseline value and a country-year-specific multiplier that accounts for natural fluctuation around the baseline value. We allow baseline values to differ across countries within a region, and across regions, to incorporate SRB differences due to ethnic origin (1⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–13). Hence, for this purpose, regions refer to groupings of countries based on their dominant ethnic group (SI Appendix, Table S17). For example, we group countries in Europe, North America, Australia, and New Zealand to refer to the regional grouping of countries with a majority of Caucasians. Within each country and region, we assume that the baseline value is constant over time.

The model for natural fluctuations in the SRB is fitted to the global database after excluding data from country-years that may have been affected by masculinization of the SRB. We use inclusive criteria to identify such country-years, based on a combination of qualitative and quantitative approaches. We select countries with at least one of the following manifestations of son preference: (i) a high level of desired sex ratio at birth (DSRB), (ii) a high level of sex ratio at last birth (SRLB), or (iii) strong son preference or inflated SRB suggested by a literature review. The earliest start for the sex ratio inflation is set to 1970, which is when sex-selective abortions first became available.

We parametrize SRB inflation during a sex ratio transition using a trapezoid to allow for consecutive phases of increase, stagnation, and a decrease back to zero. We incorporate the fertility squeeze effect by using the total fertility rate [TFR, obtained from the UN WPP 2017 (37)] into the model to inform the start year of SRB inflation. Parameters are estimated with a Bayesian hierarchical model (42) to share information across countries about the inflation start year, the maximum inflation, and the length of inflation period during the three phases.

To quantify the effect of SRB imbalance due to sex-selective abortion, we calculate the annual number of missing female births (AMFB) and the cumulative number of missing female births (CMFB) over time. AMFB is defined as the difference between the number of female live births based on SRB without inflation and the number of female live births based on SRB with inflation. CMFB for a certain period is the sum of AMFB over the period. We define countries with strong evidence of SRB inflation to be those countries with at least 1 y with at least 95% probability of a positive number of missing female births (AMFB > 0).

Results The compiled database, annual estimates for national, regional, and global SRB during 1950–2017, and national AMFB during 1970–2017 are available in Datasets S1–S4. The SRB estimates for selected years by country are in SI Appendix, Table S20. Global and Regional SRB Estimates. The global and regional SRB median estimates and 95% uncertainty intervals in 1990 and 2017 are presented in Fig. 1 and Table 1. Globally, the SRB in 2017 is 1.068 (95% uncertainty interval, [1.059; 1.077]). Levels and trends vary across regions. In 2017, the regional-level estimated SRBs range from 1.032 [1.026; 1.039] in sub-Saharan Africa to 1.133 [1.076; 1.187] in eastern Asia. Fig. 1. Global and regional SRB estimates in 1990 and 2017, and regional baseline values of SRB. Dots indicate median estimates, and horizontal lines refer to 95% uncertainty intervals. Regional baseline values are in dark green, where the vertical line segments refer to median estimates, and green shaded areas are 95% uncertainty intervals. Table 1. Global and regional SRB in 1990, 2000, and 2017 Between 1990 and 2017, the change in the global SRB is not statistically significant. For the same period, none of the regional estimated SRBs have significant reductions, while Caucasus and central Asia have an increase at 0.010 [0.001; 0.019]. Between 1990 and 2000, the increase in global SRB is at 0.005 [−0.001; 0.013]. During 1990–2000, the increases on regional SRB are significantly above zero in eastern Asia at 0.042 [0.009; 0.083], southern Asia at 0.014 [0.005; 0.022], and Caucasus and central Asia at 0.012 [0.005; 0.020]. Between 2000 and 2017, the changes of SRB are not significant for any regions. However, on a global level, the decrease of SRB during 2000–2017 is significantly below zero at −0.010 [−0.018; −0.002]. The regional SRB baseline values range from 1.031 [1.027; 1.036] in sub-Saharan Africa to 1.063 [1.055; 1.072] in southeastern Asia, 1.063 [1.054; 1.072] in eastern Asia, and 1.067 [1.058; 1.077] in Oceania (Table 1 and Fig. 1). When comparing to the conventional value of 1.05 for SRB baseline adopted by the UN WPP (37), the regional baseline values differ significantly from 1.05 for 6 out of 10 regions: significantly above 1.05 for “ENAN” (the combination of countries in Europe, North America, Australia, and New Zealand), southeastern Asia, eastern Asia, and Oceania and significantly below 1.05 for sub-Saharan Africa and Latin America and the Caribbean. In 2017, the aggregated SRB in three regions (southern Asia, Caucasus and central Asia, and eastern Asia) are significantly above their corresponding regional baseline median estimates. In 1990, the aggregated regional-level SRB in southern Asia and eastern Asia are significantly above their regional baseline median estimates. National SRB Estimates Case Studies. We illustrate SRB estimates for 12 countries which are identified to have strong statistical evidence of SRB inflation. The SRB median estimates and 95% uncertainty intervals for the 12 countries are shown in Table 2 and Fig. 2. TFR estimates are overlaid onto SRB estimates in Fig. 2, to illustrate the relationship between the start year of SRB inflation period and fertility decline, as incorporated into the model to estimate the start year of inflation period. Table 2. SRB results for countries with strong statistical evidence of SRB inflation Fig. 2. SRB estimates during 1950–2017 for countries with strong statistical evidence of SRB inflation. The scale on the left y axis refers to SRB, and the scale on the right y axis refers to TFR. Red lines and shaded areas are country-specific SRB median estimates and their 95% uncertainty intervals. Dark green horizontal lines are median estimates for regional SRB baselines. Light green horizontal lines are median estimates for national SRB baselines. Observations from different data series are differentiated by colors, where VR data are black solid dots. The blue square dots are the UN WPP 2017 TFR estimates. Blue vertical lines indicate median estimates for start and end years (if before 2017) of SRB inflation period. TFR values in the start years of SRB inflation periods are shown. Among the 12 countries, 9 are from Asian regions (Caucasus and central Asia, eastern Asia, southeastern Asia, and southern Asia). TFR values at the start of sex ratio transitions vary across countries. As shown in Fig. 2, India is a country with a high TFR value of 5.2 at the start of its inflation period in 1975, while SRB inflation is estimated to start in Vietnam in 2001 when its TFR declined to 2.0 and in Hong Kong, SAR of China in 2004 with a TFR at 1.0. Since the start of the inflation, SRBs reached their maximum before 2017 for all 12 countries. During the sex ratio transitions, SRB reached its maximum after 2000 in 9 countries. The earliest maximum occurred in Republic of Korea in 1990, and the latest occurred in Vietnam in 2012. The highest median estimates of in-country maximum SRB since the start of inflation are in China (1.179 [1.141; 1.221] in 2005), Armenia (1.176 [1.150; 1.203] in 2000), Azerbaijan (1.171 [1.145; 1.197] in 2003), Hong Kong, SAR of China (1.157 [1.140; 1.174] in 2011), and Republic of Korea (1.151 [1.131; 1.171] in 1990). The SRBs have converged back to the range of natural fluctuations in 2007 for Republic of Korea, in 2013 for Hong Kong (SAR of China), and in 2016 for Georgia. By 2017, the lowest SRBs among the 12 countries are 1.054 [1.028; 1.081] in Tunisia and 1.056 [1.034; 1.078] in Republic of Korea, while the highest are 1.134 [1.097; 1.168] in Azerbaijan and 1.143 [1.079; 1.205] in China. Missing Female Births Estimates. From 1970 to 2017, the total CMFB for the 12 countries with strong statistical evidence of SRB inflation is 23.1 [19.0; 28.3] million (Table 3 and Fig. 3). The majority of CMFB between 1970 and 2017 are concentrated in China, with 11.9 [8.5; 15.8] million, and in India, with 10.6 [8.0; 13.6] million. The CMFB between 1970 and 2017 in China and India made up 51.40% [41.28%; 61.28%] and 45.94% [36.09%; 55.83%], respectively, of the total CMFB. Table 3. CMFB for periods 1970–1990, 1991–2000, 2001–2017, and 1970–2017, for countries with strong statistical evidence of SRB inflation Fig. 3. SRB in 2017 and the CMFB during 1970–2017, by country. Countries are colored by the levels of their SRB median estimates. Radii of circles are proportional to CMFB for countries. For high-resolution plot of Fig. 3, see SI Appendix, section 11.

Materials and Methods Details on the database and model are provided in SI Appendix and summarized in this section. Model Inputs. We produce SRB estimates for 212 countries with total population size greater than 90,000 as of 2017. Due to data availability and inclusion criteria, we construct a database with data from 202 countries. The database includes 10,835 data points on national-level SRB, corresponding to 16,602 country-years of information. On average, 82.2 country-years of data are available for each of the 202 countries with data. In the SRB database, we compile VR data from the UN Demographic Yearbook and the Human Mortality Database, sampling registration system data for India, Pakistan, and Bangladesh from annual reports, international survey data from microdata or reports (Demographic and Health Surveys, World Fertility Surveys, Reproductive Health Survey, Multiple Indicator Cluster Surveys, Pan Arab Project for Family Health, and Pan Arab Project for Child Development), and census and national-level survey data from reports. For survey data with available microdata files, we use a jackknife method to calculate sampling errors for observations with varying reference periods (SI Appendix, section 1). We conduct data quality checks for VR data before inclusion (SI Appendix, section 1). We exclude data from country-periods where national-level conflict and natural disasters occur. The crises are identified using the UN IGME criteria (65). Additional information on data processing for China and India is given in SI Appendix, section 1. Detailed information on data sources is in SI Appendix, Table S19. Estimates of TFR and number of births for all countries are obtained from the UN WPP 2017 version (37); we use the annual estimates from 1950 to 2017. Selection of Countries at Risk for SRB Inflation. We generate DSRB in 220 Demographic and Health Surveys from 73 countries and generate the SRLB in 283 Demographic and Health Surveys from 83 countries. We follow the steps described in Bongaarts (15) to compute DSRB and SRLB. We identify 11 countries with high DSRB and 13 countries with high SRLB. We also conduct a systematic literature review to identify countries with empirical evidence of SRB inflation, as well as countries with populations that are considered to have a son preference or to be a patrimonial society. Twenty-three countries are identified by the literature review. In total, out of the 212 countries considered, there is information on DSRB/SRLB criteria and/or literature for 90 countries, and we identify 29 countries at risk for SRB inflation using the three selection criteria. We assume that the remaining 122 countries without information on DSRB/SRLB and literature are not at risk for SRB inflation. This includes 65 countries with VR coverage during 1970–2017, which we assume would have been identified in the literature search if SRB imbalance or son preference were present. The remaining 57 countries without information cover only 3.2% of all births globally in 1970–2017. SI Appendix, section 2 explains the selection in detail. Model of Country-Years Without SRB Inflation. We model SRB in country-years without inflation as a product of two components: (i) a national baseline value, which is assumed to be constant over time, and (ii) a country-year–specific multiplier that captures the natural fluctuation of the country-specific SRB around its respective baseline value over time. We allow for baseline values to differ across countries within the same region, to incorporate SRB differences due to ethnic origin on a national level in a region. The national baselines are pooled toward the same regional baseline. The regional baselines are to capture the ethnic difference across regions. We assign independent uniform priors to each of the regional baseline values. The country-year–specific multiplier is modeled with an autoregressive time series process of order 1 within a country. For countries without any data or with very limited information, the multiplier is equal to (or shrunk toward) 1, such that the estimated SRBs without inflation are given by (or close to) their corresponding national baselines. For countries where the data suggest different levels or trends, the multipliers capture these natural deviations from national baselines. To estimate SRB for country-years without inflation, and to estimate baseline values, we fit the model to a reduced database by excluding data from the 29 countries at risk for SRB inflation with reference year from 1970 onward. We keep the data with reference year before 1970 for the 29 countries, since sex-selected abortion technology was not widely available or affordable before 1970. Model of Country-Years with Potential SRB Inflation. We model SRB in the 29 countries at risk for SRB inflation as the sum of two parts: (i) the inflation-free SRB level, given by the model of country-years without SRB inflation as described above, and (ii) a nonnegative SRB inflation factor. The country-year–specific SRB inflation factor is modeled from 1970 onward for those countries. The parametrization of inflation factor is described in the introduction. The hierarchical distributions for the start year of SRB inflation follow a truncated t distribution to capture start years with outlying TFR levels. Normal distributions are used for the other parameters of the trapezoid function, with lower truncations at zero. We assign vague priors to the mean and SD of these truncated distributions, with the exception of the mean of the start year of SRB inflation. The mean for the start year is determined by an analysis of the relation between fertility levels and the start as observed in countries with high-quality VR data (SI Appendix, section 2). Out of the 29 countries at risk for SRB inflation, we identify 12 countries with strong statistical evidence of SRB inflation (as listed in Table 2). These countries are selected based on the AMFB: a country is identified as having SRB inflation if the probability of AMFB greater than zero is at least 95% for at least 1 y during 1970–2017. Data Quality Model. We construct a data quality model to account for varying data quality from VR systems, surveys, and censuses. We account for differences in error variance across observations, where error variance is given by stochastic errors for VR data and the sum of sampling and nonsampling errors for non-VR data. Sampling errors are computed to reflect the sampling design. Nonsampling errors are estimated within the model by data source type. Errors—and hence the error variance—associated with non-VR data tend to be larger than errors associated with VR data, and this is reflected in the model fitting, as the weight assigned to a data point increases as its error variance decreases. Resulting model-based estimates are more strongly weighted by observations with smaller errors, and uncertainty ranges are narrower for country-periods with more observations with smaller error variance. The details are in SI Appendix, section 2. Model Validation. We use two out-of-sample and in-sample validation exercises, and a simulation to assess model performance. In the first out-of-sample validation exercise for countries without SRB inflation, we leave out all data that are collected after 2004, corresponding to 20% of the global reduced database. We fit the model without the inflation factor to the remaining training database, and obtain median estimates, projections, and uncertainty intervals that would have been constructed in the year 2005 based on available data at that time. We also conduct an in-sample validation to test the performance of the model without the inflation factor. We randomly leave out 20% of the global reduced database and fit the model with the training database. We repeat this process 30 times. In the second out-of-sample exercise, we focus on the 29 countries at risk for SRB inflation and leave out all data in these countries collected after 2009, corresponding to approximately 20% of the data for these countries. We fit the model with the inflation factor to the training set to obtain median estimates and projections for the SRB and inflation. We also assess the performance of the inflation model by simulating the SRB for each country after 1970 based on the median estimates of the global parameters of the inflation model (and not the country-specific data). We calculate various validation measures to assess model performance, including prediction errors and coverage. The error for each left-out observation is defined as the difference between the left-out observation and the posterior median of the predictive distribution based on the training database. Coverage refers to the percentage of left-out data points falling above or below their corresponding 95% or 80% prediction intervals. For the 30 rounds of in-sample validations, we compute the averages of these measures. The model validation results suggest that the models are reasonably well calibrated (SI Appendix, section 3).

Acknowledgments We thank Christophe Z. Guilmoto for helpful comments and discussions, Vladimira Kantorova for guidance on data sources, and Danan Gu for guidance on Chinese data. We thank the reviewers and the editors for their insightful comments and suggestions. This work is supported by a research grant from the National University of Singapore (R-608-000-125-646). The study described is solely the responsibility of the authors and does not necessarily represent the official views of the UN.

Footnotes Author contributions: F.C. and L.A. designed research; F.C. and L.A. performed research; F.C., P.G., A.R.C., and L.A. analyzed data; F.C., P.G., A.R.C., and L.A. wrote the paper; and F.C. and P.G. compiled the database.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. J.B. is a guest editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1812593116/-/DCSupplemental.