Water system and demographic data sources

Detailed information about public water systems was gathered from the U.S. EPA’s Safe Drinking Water Information System (SDWIS) [37]. Our target population were CWSs in each U.S. state that were active at some point between 2010 and 2014. We restricted our analysis to CWSs because these systems serve customers in their homes year-round, whereas non-community systems can serve non-residential settings such as office buildings and campgrounds. We did not include CWSs that purchased their water from another supplier; purchasing water systems are rarely required to test for nitrate and therefore rarely collect nitrate data. In total, we retrieved data for 412,835 systems, of which 42,114 were CWSs active between 2010 and 2014 that did not purchase their water. Relevant characteristics obtained for each system included: activity status, system type (community, non-community, etc.), number of people served, source water type (groundwater or surface water), affiliated wholesaler or purchasing systems, and region served by the system (city, county). CWS system sizes were classified using categories defined by the EPA: very small (≤500 people); small (501–3300); medium (3301–10,000); large (10,001–100,000); and very large (>100,000).

We obtained race, ethnicity, poverty, and home ownership information (2010–2014 five-year estimates) and the proportion of households in urbanized areas (2010 estimates) from the U.S. Census Bureau for each county, census-designated place, and county subdivision in the 50 U.S. states [38, 39]. Agricultural data on the amount of livestock (cows, goats, horses, pigs, sheep) per 100 acres and the percent of land area used as cropland were obtained from the U.S. Department of Agriculture’s 2012 Census of Agriculture for each U.S. county [40]. Demographics and agriculture variables were assumed to remain constant throughout our study period.

Identifying populations served by CWSs

Characterizing the demographics of the communities served by each water system is challenging in part because little information is publicly available on the geographic areas served by each CWS [41]. Few states provide public access to electronic records documenting the service areas of their public water systems, so we relied on the information included in SDWIS.

SDWIS’s Water System module provides address variables conveying the location of each water system’s “legal entity,” i.e., the mailing address of the administrative personnel associated with the system. Separate variables, city served and county served, describe the areas to which a system directly provides water, and the primacy agency code specifies the agency that has regulatory oversight of the water system (typically a state agency encompassing the cities or counties served). Using SDWIS’s Geographic Area module, which some states primarily use to report the “areas served” parameters, we were able to augment our database’s cities and counties served.

We used the city served and county served fields in SDWIS to determine the areas served by each CWS. SDWIS provided information in the counties served field for > 99% of CWSs (n = 41,781), but only 48.1% of CWSs reported information in the cities served field (n = 20,267). By contacting state agencies, we were able to supplement SDWIS data for 1509 CWSs in three states; however, 13 states rarely or never record information in the cities served field (Fig. 1). Although administrative address information was available for nearly all water systems, we concluded that such data did not reliably identify the areas served by each CWS. Some system administrator addresses were located hundreds of miles away from the cities served by their affiliated water systems or were located in a different state, and for 40% of the systems with both a city name (pertaining to the administrator’s city) and a city served designated in SDWIS, the two fields shared no overlapping cities. Furthermore, the demographics of the areas associated with the administrative addresses often varied substantially from the demographics of each water system’s cities served (Additional file 1: Table S1).

Fig. 1 Coverage of the states and regions included in our statistical analyses. Some portions of the country were not included in our analysis either because they did not have a community water system on record or because we were unable to collect nitrate data for that area Full size image

Data obtained for wholesale water systems varied widely in whether they included the cities or customers served by downstream purchasing systems. Because purchasing systems were not included in our analysis, it was important that the data for each wholesaler included all of the cities that purchase its water in order to best characterize the population served by each CWS. As described in Additional file 1, we amended wholesalers’ city served and population served fields to include the area and people served by systems that purchase wholesale water. As a result of these modifications, we expanded data on the cities served for 1245 wholesale systems (174 of which lacked any city served data prior to considering purchased water systems) and adjusted the system size for 220 wholesale systems. Data that we compiled for U.S. CWSs, including service areas and purchaser-seller connections, are provided in Additional files 2 and 3.

Linking demographic and water system data

Water system characteristics and demographic data were linked by matching the names in the SDWIS city served and county served fields with the geographic names in the Census Bureau demographic data. Because the majority of city served names matched to county subdivisions in the Northeastern U.S. (Pennsylvania and all states to the north and east), county subdivision demographics were prioritized as a match for CWSs in the Northeast. Census-designated places were prioritized in all other U.S. regions, where county subdivision matching was infrequent. Notably, the city served and county served fields are not checked for consistency at a national level, so SDWIS entries are prone to local (rather than official) naming conventions and typographical errors. We used an approximate string matching algorithm based on Jaro-Winkler distance to identify the most similar name in the demographic data for each city served and county served that did not have an identical match. Typographical errors and differences in abbreviation conventions were corrected to improve string matching. All county served names provided by SDWIS were matched to Census Bureau records in this manner. For city served names that did not match or matched multiple Census Bureau records after these corrections, we gathered additional information from online encyclopedias and search engines about individual water systems and geographical locations. In some cases, this additional information allowed us to definitively link city served names from SDWIS with location names in Census Bureau data, while in other cases the names were considered non-matching and were not included in subsequent analyses. In total, we successfully linked demographic data to every CWS that provided a county served (41,781 systems) and 96.8% of CWSs for which we had city served information (21,253 systems).

For CWSs serving multiple cities or counties, we calculated a weighted average for each demographic parameter based on the relative populations of each of the cities or counties served. For instance, if a CWS served five cities, then the proportion of Hispanic residents associated with that CWS was calculated as an average of the proportion of Hispanic residents in each of the five cities, weighted by each city’s population. When demographic information was missing for one or more cities served, as was the case for 262 systems (1.2%), these weighted averages were calculated based on all available demographic data for cities served by that system.

Nitrate data

There is no national database of water contaminant concentration data. Thus, nitrate data were acquired by contacting SDWA-designated primacy agencies in each state through a combination of email and Freedom of Information Act requests and by gathering data from ten state-operated online data portals. In total, we compiled data for over 2.7 million samples analyzed for nitrate and/or nitrate-nitrite (sum of nitrate plus nitrite). Of these, 616,591 samples were collected from our CWSs between 2010 and 2014, covering 40,178 (96.2%) of the systems for which we had county-level demographic data.

Nitrate (NO 3 -N) concentrations were used when available to estimate each CWS’s annual nitrate concentration. However, CWSs in 13 states primarily reported nitrate-nitrite (NO 3 /NO 2 -N) data, and some CWSs reported both. In order to extrapolate NO 3 -N concentrations from NO 3 /NO 2 -N concentrations, we used a univariate linear regression based on all samples for which we could pair by system, sampling date, and sub-facility (n = 48,925 pairs). This regression ([NO 3 -N] = 0.987*[NO 3 /NO 2 -N], r2 ≈ 0.93) was used to estimate NO 3 -N concentrations when only NO 3 /NO 2 -N concentrations were provided.

We made additional adjustments to the dataset to address discrepancies in how nitrate data were reported by states. In some cases units (mg/L, μg/L) were either missing or appeared to be misreported (e.g., 0.2 ng/L), complicating the interpretation of sample results, particularly those that substantially diverged from other samples collected from the same system. We developed a protocol to identify outliers that were either misreported or misevaluated based on the average difference between each sample result and other samples from that system. Outliers were confirmed by visual inspection, and we removed 498 of such cases (0.1% of all samples) from the analysis. In addition, 17 states never provided detection limits (DLs) for samples without detectable nitrate. DLs are useful for quantitative analyses of nondetects to constrain the range of possible values for samples below the DL (e.g., < 0.2 mg/L indicates a more limited range of possible concentrations than < 1 mg/L). Among 151,463 nitrate and nitrate-nitrite samples for which DLs were provided, 99.4% of DLs were ≤ 1 mg/L and 17.4% of DLs were exactly 1 mg/L, so 1 mg/L was used as a reasonable upper-bound DL for all nondetects that were missing a DL (12.3% of all samples).

Statistical analyses

We used a mixed-effects linear regression model to evaluate the relationship between nitrate concentrations and demographic, system-characteristic, and land-use data. Model parameters were selected a priori to allow us to explore associations between environmental justice indicators and nitrate concentrations while controlling for possible confounders. Environmental justice indicators included those related to race and ethnicity (percent Hispanic residents, percent non-Hispanic Black residents) and those related to poverty (percent of families with income below the poverty line, percent of home ownership). System characteristics of interest included system size and type of source water (surface water or groundwater). Land use variables (number of livestock per 100 acres, percent of land area used as cropland, percentage of homes in urbanized areas) were included to account for agricultural and wastewater contributions to nitrate source water contamination. State was included as a random effect to capture the influence of different monitoring and reporting practices for drinking water and differences in state regulations of pollutant sources and enforcement. In the baseline model, we used county-level demographics because county served information was provided for nearly all CWSs, offering nearly complete geographical coverage (Fig. 1). Collinearity was evaluated by visual inspection of Spearman correlation coefficients (Additional file 1: Tables S2-S6) and by calculating variance inflation factors (VIFs) for each model. Race, ethnicity, and at least one poverty indicator were retained in all models, and other covariates were retained if they were significantly associated with the outcome (p < 0.05) or if removing them from the model changed the effect estimates for other covariates by more than 10%.

Nitrate data were right skewed and approximately lognormal. Because nitrate concentrations were below the DL in nearly 30% of water samples, we used a multiple imputation method adapted from Lubin et al. [42] to address nondetects. Multiple imputation can provide less biased variance estimates relative to a simple substitution method (typically DL/2 or DL/\( \sqrt{2} \)) in datasets with more than 10% nondetects [42]. For each year in which a CWS operated (“system-year”), we calculated a minimum average nitrate concentration, imputing zero for nondetects, and a maximum average nitrate concentration, imputing the DL for nondetects. Together, these values constitute an interval that contains every possible system-year average, and for system-years without nondetect data, these two values are identical. Because we lacked information on the relative contributions of multiple sources to overall water flow within a system, we assumed all sub-facilities and all samples contributed equally to the system-year average. For each system, we then calculated a five-year system average concentration as an interval, with the lower/upper bounds of the interval represented as the average of all of the minima/maxima of system-year intervals, respectively. The ranges of these five-year system average intervals were generally narrow: 67.3% of system averages had a range ≤ 0.2 mg/L, 99.8% had a range ≤ 1 mg/L, and 0.2% (91 systems) had a system average interval range > 1 mg/L.

We used non-parametric bootstrap resampling (1000 bootstrap iterations) and maximum likelihood estimation to estimate means (\( \overset{\sim }{x} \)) and standard deviations (\( \overset{\sim }{\sigma } \)) of a normal distribution that best fit the natural logarithms of our system average intervals. Natural log-transformed system averages were then randomly imputed in accordance with the proportions of the cumulative distribution function of a normal distribution with mean \( \overset{\sim }{x} \) and standard deviation\( \overset{\sim }{\sigma } \) truncated at the natural logarithms of the system average minimum and maximum. We did not impute values for system averages based entirely on detected concentrations (37.9% of systems) because we could calculate these averages directly. We then evaluated ln-transformed system averages as the outcome variable in a mixed-effects regression model. This procedure (bootstrap resampling, lognormal parameter estimation, imputation, and regression analysis) was repeated five times to determine the consistency of our results. Regression results from each of these five phases were pooled using techniques outlined by Rubin [43].

To evaluate disparities in drinking water exposures to nitrate levels of health concern, we conducted a separate set of analyses in which we dichotomized the outcome variable. Using the same multiple imputation approach, systems were categorized as “high nitrate” if their estimated system average was ≥5 mg/L, a level associated with adverse health outcomes in some epidemiological studies, and as “low nitrate” otherwise. We used a generalized logistic mixed-effects regression model to evaluate the effects of the same set of predictors, including the state-specific random effect, on the binary nitrate response variable. As before, five sets of regression results were pooled using the Rubin method.

We assessed whether our results depended on geographic region or on the spatial resolution of the demographic data. To evaluate variability across geographic regions, we conducted a similar set of regression analyses stratified into four regions (Midwest, Northeast, South, and West) defined by the U.S. Census Bureau. To evaluate the effect of the spatial scale of the demographic data, we also repeated our analyses using demographic data representing the city or cities served by each CWS. Thirteen states rarely or never report city served values in SDWIS (Fig. 1), so a substantial portion (49.4%) of CWSs were not included in this city-level analysis. Since agricultural data were only available at the county level, our analyses using city-level demographics included agricultural variables corresponding to the counties served by each system. Finally, to determine whether differences between the county-level and city-level results were more dependent on the scope (geographic area considered in the analysis) or scale (resolution of the city vs. county demographics), we developed an intermediate third model using county-level demographics for only those CWSs included in our city-level analysis.

All analyses were conducted using R version 3.4.1 [44].