The objectives of this paper are to (1) introduce the WQP to water quality data users, including an overview of holdings in the U.S. at the national scale, the standardized data model, and data access and services; (2) describe challenges and opportunities associated with using WQP data; and (3) demonstrate through an example the value of the WQP data by characterizing seasonal variation of lake water clarity for regions of the continental U.S. The R code used to access, download, analyze, and display WQP data as shown in the figures is included as supporting information.

In response to the need for common access to standardized water quality data for the nation, the U.S. Geological Survey (USGS), the U.S. Environmental Protection Agency (EPA), and the National Water Quality Monitoring Council developed the Water Quality Portal (WQP), the largest standardized water quality data access tool available at the time of this writing. Here water quality refers to physical, chemical, radiological, and biological variables that impact the fitness of use of water for ecological, biotic, human, or other purposes. The WQP adopted and extended a standardized water quality data format, and makes accessible numerous standardized water quality data from numerous sources through a website and web services. The goal of the WQP is to be a single point of access for national‐scale water quality data to facilitate analysis and decision making at the local, regional, and national scales; the WQP includes results from the late 1800s through present day. The WQP database is growing rapidly in both contributors (records from Federal, Tribal, State, local, academic, and nongovernmental monitoring organizations have roughly doubled since 2013) and users, who in turn are producing research [e.g., Corsi et al ., 2015 ], tools [ Hirsch and Cicco , 2015 ], and third‐party extensions (e.g., the Eagle River Watershed Water Quality Courier, http://erwc.wqcourier.com ).

Users need access to data in standard formats in order to discover, process, and analyze data for management, research, and other uses, but disparities in water data description, management, and curation represents a challenge for researchers and resource managers. While large repositories of standardized machine‐readable water data sets already exist for high‐frequency hydrologic data such as streamflow (e.g., U.S. Geological Survey National Water Information System [NWIS], Consortium of Universities for the Advancement of Hydrologic Science [CUAHSI] Water Data Center), water quality data tend to be collected and distributed by a greater number of monitoring and research organizations, many of whom have not adopted common or standard water quality data formats and description approaches. Moreover, the way in which numerous water quality monitoring organizations provide users with information on the varying methods for data collection, description, and access ranges from public facing web services to flat files available only upon request.

The recognition of the importance of aquatic systems to food [ Maynard , 2015 ], security [ World Economic Forum , 2016 ], and society [ Wilson and Carpenter , 1999 ; Millennium Ecosystem Assessment , 2005 ] has elevated freshwater‐related research and driven observational networks and funding models to expand to regional, national, and even global scales (e.g., U.S. Geological Survey National Water Quality Assessment Program [ Hirsch et al ., 1988 ], the National Ecological Observatory Network [ Schimel et al ., 2007 ], and the Global Lake Ecological Observatory Network [ Hamilton et al ., 2015 ]). Data standardization —the common adoption of rules by which data are described and recorded—and data dissemination approaches vary widely across hundreds of research groups and organizations that collect water data [e.g., Klug et al ., 2012 ; Solomon et al ., 2013 ; Sharma et al ., 2015 ]. Consequently, the gathering of disparate data elements, hereafter termed data aggregation or simply aggregation , from a variety of sources to address large‐scale research questions is labor intensive, often requiring sustained effort by diverse groups of experts, e.g., [ Soranno et al ., 2015 ]. Most data aggregation efforts, large or small, are motivated by project‐specific questions, and thus the resources for curation and re‐use of these data often are gone when the project and funding concludes.

2 The Water Quality Portal

The need for a common data standard, web interface, and web services for water quality data drove the cooperative development of the WQP (www.waterqualitydata.us) by the USGS, EPA, and National Water Quality Monitoring Council [Blodgett et al., 2016]. Released in 2012, the WQP serves hundreds of millions of water quality records from millions of sites contributed by more than 450 organizations (Table 1), and is consistent with the Federal Open Data Policy [Burwell et al., 2013]. The WQP aggregates and standardizes data from three data sources, each of which is an independently maintained system of record (i.e., funded, operated, and maintained independently of the WQP by Federal agency owners). The Federal agency data owners are ultimately responsible for long‐term stewardship of the data made available by the WQP: the USGS National Water Information System (NWIS) [USGS, 2016], U.S. Department of Agriculture (USDA) Sustaining the Earth's Watersheds—Agricultural Research Data System (STEWARDS) [Steiner et al., 2009], and the USEPA STOrage and RETrieval Water Quality eXchange (STORET‐WQX) [USEPA, 2016]. Of these, the EPA STORET‐WQX is the only database to which external data providers (i.e., non‐EPA affiliated) may submit data. A graphical overview of the WQP can be found at the EPA STORET‐WQX website (https://www.epa.gov/waterdata/storage-and-retrieval-and-water-quality-exchange). Addition, deletion, and modification of data sources are outside of the purview of the WQP and occur at the discretion of data owners.

Table 1. Water Quality Portal Data Holdings Including Groundwater, Inland, and Marine Water Observations Data Source Number of Sitesb Number of Resultsb URL USGS NWIS and USGS BioData 1,616,518 94,075,242 http://waterdata.usgs.gov/nwis; https://aquatic.biodata.usgs.gov/landing.action USDA STEWARDS 227 1,230,333 http://www.nrrig.mwa.ars.usda.gov/stewards/stewards.html EPA STORET‐WQX 740,532 202,277,135 https://www.epa.gov/waterdata/storage-and-retrieval-and-water-quality-exchange WQP totalb 2,357,268 297,582,710 www.waterqualitydata.us

2.1 Overview of Data Set More than 297 million water quality records from more than two million unique sampling sites are available through the WQP (Table 1). The WQP is not a static data set; the date of data access for all figures and tables generated by WQP queries are indicated and replication of the queries at a later date may result in additional records and sites returned. Records can be queried across space and time, by the type of site at which samples were collected, and by type of water quality variable measured. A comprehensive description of query parameters can be found in the WQP User Guide (http://waterqualitydata.us/portal_userguide/). Sites included in the WQP are found across the U.S. (see site locations for various water quality characteristic in the contiguous U.S. shown in Figure 1), Canada, Mexico, and other countries; and in aquatic, marine, and groundwater environments. About 11% of results within the WQP are from countries other than the U.S. (∼32 million records). The spatial distribution of particular constituent records and site types varies (Figure 1), and is a function of a number of factors: the percentage of land cover as surface water; State, regional, and Tribal resources allocated to water monitoring; the extent to which contributing organizations prioritize of data sharing; and unique local or regional water quality issues. Fourteen site types are found in the WQP: aggregate groundwater use, aggregate surface water use, atmosphere, estuary, facility, glacier, lake, land, ocean, spring, stream, subsurface, well, and wetland. For the purposes of representing WQP data in this paper, the 14 site types were grouped into six larger categories. Aggregate groundwater use, spring, subsurface, and well site types are grouped as groundwater; estuary and ocean are marine; lake, stream, and facility remain as standalone site types; and the remaining site types, representing <5% of sites, make up other (Figure 2). Streams are the most abundant site type (n = 515,726, 22%) and result type (n = 172,484,642, 58%) within the WQP. Figure 1 Open in figure viewer PowerPoint Spatial distribution and counts of sites for five water quality variables are shown by Hydrologic Unit Code (HUC) 8 catchments for all sites (right) and for lake sites (left) in the continental U.S. The water quality variables shown include (a and b) arsenic, (c and d) nitrogen, (e and f) phosphorus, (g and h) Secchi depth, and (i and j) temperature. Data were accessed on 4 January 2017 from http://www.waterqualitydata.us/. Figure 2 Open in figure viewer PowerPoint Distribution of results stored in WQP based on characteristic groups, site types, and time collected between 1 January 1950 and 18 October 2016. The absolute number of results is displayed below the characteristic group names. Lines to the right of the bar chart are the temporal distributions of results stored in WQP by characteristic group and are scaled relative to the maximum annual record count for a given characteristic group. This figure does not include the 10,346,741 records (<5% of total) with unspecified characteristic groups or site types, which appear as “Not Assigned” in WQP. Data were access on 18 October 2016 from https://www.waterqualitydata.us/. Water quality records dating back to 1892 are available through the WQP. Through time the distribution of total water quality records by year has generally increased (Figure 2); however, there are notable spikes in observations in the 1970s, likely due to the Clean Water Act, and during other periods (Figure 2). Water quality characteristic groups include physical conditions, chemical and bacteriological water analyses, chemical analyses of fish tissue, taxon abundance data, toxicity data, habitat assessment scores, and biological index scores, among others. Within these groups, thousands of water quality variables registered in the EPA Substance Registry Service (https://iaspub.epa.gov/sor_internet/registry/substreg/home/overview/home.do) and the Integrated Taxonomic Information System (https://www.itis.gov/) are represented. Across all site types, physical characteristics (e.g., temperature and water level) are the most common water quality result type in the system (n = 136,639,157; 46% of all results).

2.2 Water Quality Exchange (WQX) Data Model For data from disparate sources to be aggregated and standardized, a shared data model is required. The Water Quality Exchange data model (WQX; http://www.exchangenetwork.net/data-exchange/wqx/), initially developed by the Environmental Information Exchange Network, was adapted by EPA to support submission of water quality records to the EPA STORET Data Warehouse [USEPA, 2016], and has subsequently become the standard data model for the WQP. Data models are structures used to represent domains of information, and consist of entities (tables; e.g., WQX sites and results), attributes (columns or fields; e.g., WQX OrganizationIdentifier or CharacteristicName), and keys (relationships between entities; e.g., WQX MonitoringLocationIdentifier, used to link results to sites in the WQP). Sites consist of unique OrganizationIdentifier and MonitoringLocationIdentifier combinations, while results for each sampling activity at a site are linked to a unique ActivityIdentifier. Source data from the USGS NWIS, EPA STORET‐WQX and USDA STEWARDS is mapped to the WQX data model prior to being made available by web services and web interface; no additional data checks or quality assurance protocols are applied by the WQP, these tasks are at the discretion of the system of record and outside of the purview of the WQP. The WQX data model includes metadata—“data about data”—needed to fully describe water quality observations, e.g., units associated with water quality value (ResultMeasure/MeasureUnitCode) or sample collection method name (SampleCollectionMethod/MethodName). Minimum metadata requirements for submission of sites and results into WQX are described in detail in the WQX webpage for the XML schema and submission documentation: http://www.exchangenetwork.net/data-exchange/wqx/. Of the 280 WQX metadata elements, 18 are mandatory for submission of sites, 32 are required for results (which includes mandatory site attributes used to link sites to results). An additional 120 elements are conditionally required: for example, if a result value (ResultMeasureValue) is entered for dissolved oxygen concentration, an entry of units (ResultMeasure/MeasureUnitCode), such as mg/l, is also mandatory. WQX metadata elements have standardized descriptions and a discrete number of accepted reference datums. For example, geospatial coordinates, an attribute of the WQX site entity, accepts a limited number of datums (e.g., WGS 84 or NAD 83).