In the 2009 influenza A H1N1 pandemic direct and indirect flight data from the country of origin significantly improved models that predicted when the virus would be detected in other countries. To prevent the further spread of infectious disease outbreaks, software that predicts where infected travelers will travel can be useful to prevent the further spread of infectious diseases. There are publicly available (e.g., GLEAMviz) and non-publicly available (e.g., Center for Disease Control, BLueDoT) software packages and models for flight network analysis; however, validations of these capabilities have not been published in peer review literature. In this manuscript, we validate a new open access web based application against cases of Zika that have arrived in the United States.

The limited public health and biosurveillance resources available to prepare against Zika must be focused in the locations where the virus is most likely to spread. Developing accurate models to anticipate Zika spread is one of the first steps to mitigate further spread and transmission of the disease. Air traffic data, network models, and simulations have likely improved disease spread forecasting capabilities. 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 Despite the existence of these methods, many of these approaches have not been validated in the peer reviewed literature, but appear to have been useful in predicting the movement of infected travelers over flight networks.

Zika Virus was first isolated in 1947 in Uganda from a captive rhesus monkey. 1 Five years later, the first evidence of humans contracting the virus was reported in Uganda and the United Republic of Tanzania. 2 Over the next 50 years, the virus spread throughout Africa and Asia 3 , though only 14 human cases were reported during this time period. 4 The first major human outbreak of Zika occurred in 2007, in the Federated States of Micronesia on the island of Yap, where approximately three quarters of the 7,391 population contracted the virus. 4 The most likely source of this outbreak was the inadvertent import of a mosquito vector or via an infected traveler. 4 Duffy et al. (2009) warned that Zika might easily spread via air travel, citing a medical volunteer who traveled back to the United States after contracting Zika in Yap. The current and ongoing Zika outbreak officially started on May 7, 2015, when Brazil’s National Reference Laboratory confirmed the virus had been isolated in the Americas. 3 There is active transmission of the Zika Virus in 31 countries in the Americas, with 4,160 confirmed cases and 175,636 suspected cases. 5 , 6

Method

FLIRT Software

FLIRT, a biosurveillance application developed under the open source Apache 2.0 license (available at www.apps.eha.io), is designed to predict where infected travelers will likely travel. FLIRT contains a database of flight schedule information that is updated monthly (Innovata 2016) and visualizes passenger flow data over flight networks. Flirt contains two distinct modes: (1) “Scheduled Direct Flights” displays nonstop flight paths from a selected airport to all possible destinations and ranks destinations by summed seat count; and, (2) “Passenger Simulation” displays the results of a Monte Carlo simulation based on passenger layover and transfer probabilities, and was created to mirror real life air travel behavior. All FLIRT results are exportable to JSON, CSV, XML, and XLSX formats.

Data

FLIRT uses a database of flights scheduled by over 800 airlines, with each record detailing one scheduled flight route (Innovata, 2016). For each route, the following data are available: (1) operating carrier and call sign; (2) origin and arrival airports; (3) schedule (start and end times, days of week, all repeating weekly); (4) effective and discontinued dates; and, (5) number of seats available. Some scheduled routes operate multiple legs and these routes are split into their constituent flights by interstitial stops, using the relative distance of each leg to impute its departure and arrival time. The data include current, past, and planned routes, extending as far back as October 1, 2014 and continuing into the future as far as 2018. Future scheduled flight routes decreases over time (e.g., approximately half as many routes are scheduled at the start of 2017). At the time of publication, the database contained 54,870,563 flights globally, and tables containing flight counts by region, country, day of the week, and by airport are available in the supplementary materials.

Nonstop Flight Analysis (FLIRT Direct Scheduled Flights Mode)

FLIRT assumes that airlines will optimize their planned schedules to meet demand and fill available seats. Thus, FLIRT uses the number of seats available as a proxy for the number of passengers traveling on that route. One degree of edge travel, for a given time period, is calculated by summing all of the seats between two destinations (nodes) for selected time period. An example is provided in Figure 1.

Fig. 1: A screenshot of FLIRT’s interface displaying a network graph based upon scheduled nonstop flights from GRU between 01 February 2016 to 01 April 2016.

To estimate the global distribution of risk from an outbreak in a location, FLIRT simulates passenger behavior, given these assumptions: (1) infected travelers behave the same as all travelers; (2) the total number of seats scheduled between two locations (nodes) is directly proportional to the number of passengers traveling between those locations (nodes); (3) some travelers take journeys consisting of multiple flights (edges); (4) the probability distribution of the number of edges per trip for all journeys worldwide is the same as that for U.S. domestic flights; (5) travelers on multi edge trips do not double back (i.e., subsequent destinations in a trip that would leave a passenger closer to their origin than to their current location are calculated); and, (6) transfers occur in a temporal window after a passenger arrives at an airport (node) weighted according to a Poisson distribution with λ = 2 hours. If multiple airports are selected, simulated passengers are assigned to each origin airport weighted by their relative total volume of scheduled outgoing seats.

Given a time interval and an origin airport or set of airports, FLIRT simulates trips at random times within that interval, with the above described behavior. Given the previously stated assumption, and the assumption that the disease prevalence of the passengers traveling through the selected airports is equal, the aggregated number of passengers arriving at airports (nodes) should be directly proportional to the rate of arrival of imported disease cases from an outbreak. An example is provided in Figure 2.

Fig. 2: A screenshot of FLIRT’s interface displaying a network graph based upon the simulation of 20,000 passengers departing from CCS between 11 January 2016 to 11 March 2016.

Interface

FLIRT users select an airport (node) by searching for name or airport code and selecting from a list of matches. Multiple airports may be selected by: (1) selecting multiple airports from the search interface; (2) automatically including airports in a selectable radius; or, (3) drawing a rectangular selection box on the map. Users also select the start and end date of flight routes, and users select direct scheduled flights (i.e., one-degree connectivity) from the selected airport(s) to all destinations (e.g., Figure 1). Users may also select a passenger simulation after specifying the number of passengers (up to 20,000) they wish to simulate. Results of both modes are displayed with color and thickness scaled routes, as a heat map, and in tabular form. All passenger simulations are cached and may be shared via a unique URL.

Validation, Verification, Evaluation of FLIRT

FLIRT’s scheduled direct flights and passenger simulation modes were used to assess records and future schedules of flights departing from five selected origin airports traveling to the continental U.S. over three time periods (Table 1). Origin airports (nodes) were selected based on the number of suspected and confirmed cases per country. As of 02 February 2016, news reports indicated that Brazil, Colombia, El Salvador, Venezuela, and Honduras had the most suspected human cases of Zika Virus. Only international airports were evaluated since only international airports are the only nodes in the network capable of sending infected travelers to the United States from locations with sustained local Zika Virus transmission. The international airport with the most passengers in each Zika-affected country was selected as the origin, as all origin countries had one main international airport at least 2 times the amount of passengers annually as the next busiest international airport in that country (with the exception of Honduras where the busiest airport carried 1.2 times the passenger traffic of the second busiest airport). Regardless of the specific airport that Zika Infected travelers chose to fly, this analysis focused on determining which locations in lower 48 United States are at the highest risk of receiving Zika Infected travelers. The airports selected were Guarulhos International Airport (GRU) in Sao Paulo, Brazil; El Dorado International Airport (BOG) in Bogota, Colombia; Monseñor Óscar Arnulfo Romero International Airport (SAL) in San Salvador, El Salvador; Simón Bolívar International Airport (CCS) in Maiquetia, Venezuela; and Ramón Villeda Morales International Airport (SAP) in San Pedro Sula, Honduras.

Using FLIRT’s scheduled direct flights mode, individual network maps were generated for each of the five origin airports using counts of seats traveling from selected the origin airports to all possible connected global destinations in each of the three time periods. Then, only the U.S. destination results were extracted and the number of seats from each origin were aggregated to determine the total connectedness between all five origin airports and each possible U.S. destination. Using the passenger simulation, five global simulations were generated for each time range (20,000 passengers per simulation), and each simulation yielded nearly identical results. The results of these five simulations were summed to produce the final simulation results.

To validate FLIRT, FLIRT’s output was compared to the locations of actual U.S. imported Zika cases. Two time ranges were used to assess FLIRT’s ability to predict the rate of imported Zika cases to the U.S during the 2015 Zika Virus epidemic, and one future time range was considered to make future Zika distribution predictions (Table 1). For each time range, direct scheduled flights and passenger simulation results from selected origins to continental U.S. airports, were exported from FLIRT (Table 1). On 02 February 2016, FLIRT predicted a priori which U.S. locations were most at risk of receiving Zika Infected travelers from February 2016 to April 2016, and this prediction was published in The Guardian (Kelkar, 2016). This prediction was validated in this study. The analysis using the Expanded Data range used all available case data collected for this study (163 U.S. Zika cases), and compares it against FLIRT’s 11 January 2016 – 11 March 2016 prediction. The purpose of analyzing the Expanded Data range (a similar time period) was used to further validate FLIRT post hoc, and covered the previous time period merely to avoid the post hoc fallacy.

Case count data were obtained daily and manually from news reports using Google Alerts and searching the Internet from January 11, 2016 to March 11, 2016 using the search terms: (1) new U.S. Zika case; (2) U.S. Zika cases; (3) Zika Virus U.S., and, (4) searches by each state (e.g., Florida Zika Cases). Information about all confirmed and suspected Zika cases and their location was collected; however, detailed geographic information beyond the state level was not always available in the news reports and distinguishing between cases was not difficult due to the heightened media attention surrounding Zika Cases at that time. Most news articles reported heavily on the single first case arriving in a previously Zika-free state. As the Zika Virus epidemic progressed, news articles began reporting on several new cases within a single news article, especially in Zika hotspot states (e.g., Florida, Texas, & New York) where several new cases appeared each day. If specific sub-state level geographic information was available for each case, it was recorded. If not, only state level information was recorded for the cases missing this information. Additional manual internet searches were conducted on cases missing this information to collect missing data and cases were de-duplicated. Real-time reporting alerts from Google Alerts helped the authors identify trends in reporting, like spikes in news activity directly following a new U.S. case finding. Observing these reporting trends helped to identify repeated case reports.

Collected case data was compared to the CDC’s case count information (available at http://www.cdc.gov/zika/geo/united-states.html). Overall, this study’s case data collection matched the CDC’s state level information with two generalizable differences. The CDC had higher case counts for states that contain many cases of Zika (e.g., FL 49 vs. 34). This is partially explained by the longer time frame for which the CDC reported its data (January 1, 2015 – March 9, 2016). Secondly, the data set that was created in this study reported on outbreaks in 6 states the CDC had not reported (AZ, KY, ME, NE, UT, WV). This is most likely because outbreaks occurred in these states after the March 9th cutoff date of the CDC’s available data at the time this study was conducted. Additionally, this data set frequently contained more detailed information (e.g., county level spatial data) which allowed for higher accuracy in associating specific airports within the airport regional analysis.

For comparison with actual case data, and for future predictions of Zika distribution, FLIRT data was exported for the two validation time ranges and grouped by state and airport/metro region. For the state-level analysis, all nonstop flight airport seat counts and simulation results within a state were aggregated at the state level for incoming U.S. flights from Zika affected areas. In the airport/metro regional analyses, each airport code was kept unique, unless the airports were within 60 miles of each other (often representative of large metropolitan regions). Accordingly, JFK – LGA – EWR – HPN, IAD – DCA – BWI, MIA – FLL, and SJC – OAK – SFO were grouped, and all nonstop flights and simulation results for each of these airports were aggregated.

Zika cases were assigned geographic locations based on known location information from news report. Geographic information was available for all Zika cases at least at the state level. Because our simulation algorithms simulate passenger transfer behaviors, which often include transfers to regional domestic flights after international travel, FLIRT’s simulation output includes both regional and international airports as destination results. For analysis of geographic case data against simulation results, we associated each case with whichever airport (regional or international) was closest to the known case location (based on google maps road distance calculated in miles). If only county level information was known, then the largest airport (regional or international) in or most nearby the county center was selected. When geographic information beyond the state level was not known, cases were associated with the highest traffic international state airport. If the nearest airport to a case was within a metro area (airports grouped because they were within 60 miles of each other), the case was associated with the whole metropolitan group (e.g., JFK – LGA – EWR).

Because the selected origin airports are external to the United States, FLIRT’s Scheduled Direct Flights results include international airports almost exclusively. Therefore, for this Zika Virus distribution comparison, the case was associated with the nearest international airport, akin to looking for the case’s likely port of entry. The Zika case was then grouped with the airport to which it was geographically closest. If only state-level geographic information was known about the case, the case was associated with the largest of the two selected state airports. To assess whether the rank order of FLIRT’s predictions corresponded with the rank order of imported Zika caseloads, we computed Kendall’s τ for the same six permutations of data.

Generalized Linear Models

This study assumed that the rate of imported Zika cases over time in U.S. locations would be proportional to the number of flights from Zika-affected areas. We tested this with univariate Gaussian general linear models (GLMs) by regressing imported Zika cases against FLIRT’s estimates. We ran these models on all permutations for FLIRT prediction type (one-degree connectedness and multi-degree simulation), time period (restricted to early data and all data), and aggregation level (state and airport region). Before running GLMs, all input variables were standardized by dividing by twice the standard deviation.15

To obtain more concretely interpretable coefficients, the 100,000 passenger simulation using actual passenger data for the source airports of interest was rescaled (see supplemental information). These numbers indicate total passengers per year and were divided by two to obtain a rough estimate of outgoing passengers, assuming that: (1) layover passengers are a negligible portion of passengers; and, (2) overestimating the number of outgoing passengers would bias the effect estimates. Multiplying FLIRT’s simulated passenger estimates by the total outgoing passengers over total simulated passengers converts the estimates to passengers per year. The simulations were run on flight data, and matched with cases, for a period of 61 days, this was multiplied by 61 / 365 to obtain a rough estimate of the number of passengers traveling from the selected airports in the time period of observation. We divided this result by 100,000 to obtain a measure of Zika cases per 100,000 passengers from selected airports.

Future Predictions of Imported Zika Cases

FLIRT was used to calculate scheduled direct flights (i.e., one degree of edge connectivity) and multi-degree passenger simulations using projected airline schedule data from 11 March 2016 2016 to 06 March 2016. The results of both outputs were compared and states and airport regions were ranked according to their relative air traffic. Global FLIRT results were analyzed to create a rank list to assess the risk of Zika case distribution globally and where U.S. destinations are within these ranks.