Abstract Modeling plays a major role in policy making, especially for infectious disease interventions but such models can be complex and computationally intensive. A more systematic exploration is needed to gain a thorough systems understanding. We present an active learning approach based on machine learning techniques as iterative surrogate modeling and model-guided experimentation to systematically analyze both common and edge manifestations of complex model runs. Symbolic regression is used for nonlinear response surface modeling with automatic feature selection. First, we illustrate our approach using an individual-based model for influenza vaccination. After optimizing the parameter space, we observe an inverse relationship between vaccination coverage and cumulative attack rate reinforced by herd immunity. Second, we demonstrate the use of surrogate modeling techniques on input-response data from a deterministic dynamic model, which was designed to explore the cost-effectiveness of varicella-zoster virus vaccination. We use symbolic regression to handle high dimensionality and correlated inputs and to identify the most influential variables. Provided insight is used to focus research, reduce dimensionality and decrease decision uncertainty. We conclude that active learning is needed to fully understand complex systems behavior. Surrogate models can be readily explored at no computational expense, and can also be used as emulator to improve rapid policy making in various settings.

Author Summary Mathematical models are used as pragmatic tools to inform policy makers on public health interventions and many non-health problems. Considerable efforts have been made to build realistic simulation models. Understanding the systems behavior and the effect of model assumptions and parameter values on the results before drawing conclusions for policy is crucial. Common and edge manifestations of complex model runs should be analyzed and therefore we present an active learning approach, also known as a sequential design of experiments, based on surrogate modeling and a model-guided experimentation process. First, we illustrate our approach with an individual-based model for influenza and demonstrate the benefits compared to current complex modeling practices. Second, we establish the power of our approach with a high dimensional model with correlated inputs to explore cost-effectiveness of varicella-zoster vaccination programs. The most influential variables are identified with the aim to reduce dimensionality and decrease decision uncertainty. We also elaborate on the use of surrogate models as an emulator to improve rapid policy making in various settings. To this purpose we provide an interactive platform through which the reader can explore instantaneously the sensitivity of the surrogate models to parameter changes for both our applications.

Citation: Willem L, Stijven S, Vladislavleva E, Broeckhove J, Beutels P, Hens N (2014) Active Learning to Understand Infectious Disease Models and Improve Policy Making. PLoS Comput Biol 10(4): e1003563. https://doi.org/10.1371/journal.pcbi.1003563 Editor: Marcel Salathé, Pennsylvania State University, United States of America Received: September 5, 2013; Accepted: February 12, 2014; Published: April 17, 2014 Copyright: © 2014 Willem et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: LW is supported by an interdisciplinary PhD grant of the Special Research Fund (Bijzonder Onderzoeksfonds, BOF) of the University of Antwerp. SS is funded by the Agency for Innovation by Science and Technology in Flanders (IWT). NH acknowledges support from the University of Antwerp scientific chair in Evidence-Based Vaccinology, financed in 2009–2014 by a gift from Pfizer. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction For many health care interventions, pre-introduction clinical trials are unfeasible for budget or ethical reasons and mathematical models are used as pragmatic tools to inform policy [1]. This is particularly the case for large-scale infectious disease interventions. Simple static fixed risk models are commonly used for health economic evaluation, and sometimes inappropriately so for infectious diseases. Dynamic models representing transmission or evolutionary dynamic systems contributed to our understanding of biological mechanisms and the spread of infections. For instance, in view of its global public health importance, influenza has been the subject of many simulation studies [2]–[10]. The levels of computational complexity and data capacity needs vary substantially between deterministic compartmental models and stochastic individual-based models, the two most widely used types of dynamic models. Such models are developed through an iterative process of designing, coding and validating with empirical data but few have undergone sufficient testing across a range of settings and situations to be fully validated [1]. In order to improve confidence in model-based conclusions, it is necessary to gain a thorough understanding of the system and assess how model assumptions and parameters alter the results and policy decisions [9]. Individual-based models are computationally expensive and can be too complex to fully explore and understand a systems behavior [1]. Different scenarios and parameter values may be explored to account for methodological, structural and parameter uncertainty [3]–[5], [11]. Parameter values can be drawn from a distribution or changed at random over a plausible range. Parameter uncertainty using linear regression in a Latin hypercube design (LHD) is now routinely explored with static health economic models. Unfortunately, these techniques are far less used in the context of dynamic models due to the computational complexity and lack of knowledge on some of the fundamental parameter values and their distributions [12]–[14]. Nonetheless changes in a limited set of parameters or the full set should be explored. Clearly, independent of which method is chosen, it should be transparent and justified in the context of the model [1]. To explore parameter influence, symbolic regression can be used. Symbolic regression enables nonlinear response surface modeling with automatic feature selection. It aims to capture input-response behavior with algebraic expressions without a priori assumptions of model structure [15], [16]. Many variants of this method exist [17]–[19], but here we apply the Pareto-aware symbolic regression (SR) that uses multiple selection objectives [15], [16], [20]. The algebraic expressions are surrogate models for the original computationally intensive simulation model. The model responses can be instantaneously predicted for a set of inputs using the algebraic expressions. These expressions provide information on the relationships between inputs and responses. Ensemble-based SR uses a collection of surrogate models as a final solution and the accordance of the models defines a measure for the prediction uncertainty. Input conditions that are hard to predict indicate that more simulation samples are needed from the corresponding input region. The goal of this paper is to present an iterative modeling approach with a model guided experimentation process to systematically analyze both common and edge manifestations of model runs. Figure 1 presents the methodology we recommend to explore simulation models. First, inputs are sampled using a maximin LHD design where the minimum distance between all points is maximized. Second, each point of the input space is used to initialize the simulation model. Next, the input–response data from the simulations are modeled with SR to create surrogate models which can be used for response predictions, feature selection and to identify conditions with large prediction uncertainty. Finally, these insights are used to enhance the experimental design of subsequent simulations by adapting the sampling strategy or reducing dimensionality. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Iterative active learning approach with a simulation model. (1) A Latin hypercube design is used to make configuration files. (2) These configurations are used for the simulation model. (3) All input-response data are modeled with SR. (4) The surrogate models obtained with SR are used to achieve system understanding. The response prediction uncertainty can be used to adapt the experimental design (1) for the following modeling cycle. https://doi.org/10.1371/journal.pcbi.1003563.g001 While the general goal is to use a combination of previously described methods from machine learning, known as sequential experimental design [21]–[23] or active learning [24]–[26], we want to emphasize that our approach with SR is an improvement in the field of infectious disease modeling. An iterative modeling approach from [13] was based on step-wise linear regression to estimate response hypersurfaces and was limited to polynomials of the third order. Okais et al [14] presented a framework to perform a preliminary sensitivity analysis considering logic and scientific relevance before conducting multivariate sensitivity analysis. However, consistent reproducible methods for the latter were not described. Longini et al [2] calculated confidence intervals for disease burden estimates based on prior distributions instead of fixed parameter values, but they did not describe variable importance or system exploration analyses. Van Hoek et al [27] performed variable importance analyses without feedback to the experimental design. We apply our methodology to the open-source model FluTE [8], which is a stochastic individual-based model for pandemic influenza. We illustrate a stepwise system exploration and sensitivity analysis of a computationally expensive simulation model. The model simulates a population with realistic social contact networks and transmission based on the natural history of influenza. Several parameters can be varied to modify the spread of influenza, with the basic reproduction number being typically described as the most essential. is defined as the expected number of secondary infections caused by a primary infection in a fully susceptible population. For the purpose of our study, we focus on two main outcomes of this model: the clinical attack rate and the day at which the influenza epidemic peaks. High dimensionality and correlated inputs cause problems in exploring system behavior, which can be resolved through iterative modeling with SR. The approach we present is relevant for many public health problems and we illustrate this through an example of a previously published dynamic model-based economic evaluation of varicella zoster virus (VZV) vaccination [27]. Varicella (chickenpox) is a typical childhood infection caused by VZV and after recovery from chickenpox, the virus may reactivate later in life to cause herpes zoster (shingles). The probability to experience herpes zoster increases with time since the primary VZV infection, but is reduced by natural re-exposure to VZV (e.g., typically parents are re-exposed when their child has chickenpox). Although infant VZV vaccination was shown to dramatically reduce chickenpox morbidity and mortality, there are lingering concerns about its adverse impact on shingles as it reduces VZV re-exposure opportunities [28]. Many modeling and economic studies aimed to tackle this problem (reviewed in [29]–[31]) but the van Hoek et al model [32] is of special interest because it includes empirical observations on social mixing patterns and combined childhood and adult vaccination strategies. This age-structured dynamic transmission model was used to perform cost-effectiveness analysis for the United Kingdom [27] with 185 input parameters, including 100 correlated transmission rates, to calculate the incremental gain of Quality Adjusted Life Years (QALY) and costs. Parameter uncertainty was incorporated and conclusions were based on the results of 1000 runs. We analyzed this input-response data with SR to identify driving parameters and compared our findings with a linear regression analysis [33]. Reducing the dimensionality may improve uncertainty analysis since the most influential parameters can be sampled more intensively.

Methods The method section follows the approach presented in Figure 1. We provide a guided step-by-step example with basic simulation model as supplementary information (Protocol S1). Design of experiments We used space-filling Latin hypercube designs (LHD) to create parameter sets for the FluTE simulations. In the general case, a sample value from the first interval of the first input parameter is matched at random with sample values from intervals chosen for the other input parameters [13]. Then the second interval of the first input parameter is matched at random with sample values from previously unused intervals of the other features. Each interval of every input parameter will be sampled once and only once. LHD has the advantage that the number of samples is independent of the number of dimensions of the input space but can be determined based on the computational budget, the input dimensions and the complexity of the simulation. Computing a space-filling LHD can be an onerous task and therefore we used the maximin designs of spacefillingdesigns.nl [34], [35]. Because our designs have a rather limited number of sample points, we extended the designs using the Intersite-projected distance method of the Sequential Experimental Design (SED) toolbox [36]–[38]. Simulation model Influenza model. We made use of an open-source individual-based model for influenza epidemics written in C++, called FluTE [8]. All individuals in the model are members of different social mixing groups. Influenza transmission within each group is based on random mixing. The geographical distribution, employment rates and commuting behavior of the population are based on the 2000 Census data for Seattle (500 000 people) and the Los Angeles County (11 million people), distributed together with the source code of the model. The simulation runs in 12-hour time steps, representing daytime (work, school and community contacts) and nighttime (home and community contacts). Contact probabilities were tuned such that the final age-specific clinical attack rates were similar to past influenza pandemics and observed household attack rates. The model can simulate several intervention strategies based on changes in susceptibility and infectivity due to vaccination or antivirals and on changes in contact probabilities between individuals due to social distancing measures. VZV model. The economic evaluation of VZV vaccination was based on a deterministic dynamic compartmental model [27], [32] with 185 inputs, including 100 correlated transmission rates between 10 age groups. Underlying contact rates were estimated from a survey of social mixing patterns and bootstrapping the original sample specified uncertainty. The model was adapted and calibrated to data from the UK. Source code was not available but we made use of a dataset with 1000 runs, previously subjected to linear regression analysis [33]. Surrogate modeling with symbolic regression Symbolic regression (SR) captures input-response behavior by efficiently exploring hundreds of thousands of algebraic expressions of the input variables [15], [16]. Aside from choosing the modeling primitives, no assumptions or restrictions are made on the model structure and genetic programming is used to optimize the search process. SR is a biologically inspired method that imitates Darwin's evolution theory by applying genetic variation and natural selection on the modeling ensemble [39]. We used the SR implementation from the DataModeler package in Mathematica [40]. The result of a SR run is an ensemble of tree-based regression models that give a good approximation of the response variable. The algorithm consists of the following steps [41]: Model initialization. In the first step of the algorithm, a population of SR models is generated randomly and the algebraic expressions of the models are represented by parse trees. Every model is a potential solution that explains the response behavior using the a subset of the input variables. The parse tree of every model consists of terminals and primitive functions. A terminal is either an input variable or a constant. We used a set of primitive functions and summation and multiplication have an arbitrary arity. These functions can be adjusted according to the problem domain. Model evaluation. The model fitness is determined by minimizing two objective functions: the prediction error , with R being Pearson's correlation coefficient of and the model complexity. We define model complexity as the sum of the number of nodes in all possible subtrees of a given tree, which is equivalent to the visitation length, i.e. the total number of links traversed starting from the root node to each of the terminal nodes of the parse tree. The complexity objective is used to avoid excessive growth of the model expressions. Because of the complexity objective, the presence of a variable in a sufficiently evolved population indicates that the variable is relevant for describing the response [20]. Model archiving and elitism. After the evaluation of all models, a fixed-size archive of the best achieving models is maintained. This is an elitism strategy that ensures that the best achieving models are never lost after recombination. The archive is populated by a selection of the least-dominated models from both the population and the archive of the previous generation. A model dominates another model if it performs better on at least one objective and does not score worse on any of the other objectives. A model is said to be Pareto optimal if any other model does not dominate it. This way, we can define the Pareto front of a model set. Model evolution. A new population of SR models is generated at every step of the algorithm. New models are created with genetic operators like crossover and mutation. Crossover is the process of combining parent models into child models by using subtrees of both parents (Figure 2A). Mutation of a model introduces random alterations in its expression tree (Figure 2B). To select which models will generate offspring a Pareto tournament is held. Models that compete for offspring in the tournament are selected from both the current population and the archive. To generate the offspring, of the children are generated by crossing over two parents obtained from the Pareto tournament. The remaining is generated using the mutation operator. Every 10 generations the population is re-initialized with random models to ensure diversity in the population and to counteract inbreeding. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Recombination and modification of algebraic expressions. (A) One point crossover example in symbolic regression. Two individuals swap subtrees, resulting in two new expressions. (B) One point mutation example in symbolic regression. The operator minus is replaced by another operator of the same arity. https://doi.org/10.1371/journal.pcbi.1003563.g002 This evolutionary process is repeated over many generations. A maximum number of generations, a time budget or a model accuracy threshold can be used as criteria to stop the process. For this paper we used time budgets based on the size and dimensionality of the data sets. Timings are listed in Table 1 and an example with different time budgets for RUN 3 is described in Text S1. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Symbolic regression settings. https://doi.org/10.1371/journal.pcbi.1003563.t001 System understanding Conclusions are based on a model selection from the knee of the Pareto front and we perform nonlinear optimization of the constants within these models. A model ensemble of high-quality and minimal complexity obtained through an effective SR algorithm can facilitate system understanding and focus the research. Variable presence in the final ensemble (taken over several independent SR runs) provides a robust indication of the importance of input variables. Only inputs significantly related to the response can survive a harsh evolutionary pressure and get to the final ensembles [20]. Besides variable importance, final ensembles also provide dimensionality trade-offs in complexity and accuracy of models. Another example of system understanding is the automatically generated hypotheses for meta-variables, low order transformations of driver inputs, which can potentially linearize the final models and enable further application of the powerful linear and regularized linear learning. The sensitivity analysis of constructed ensembles is the highlight of facilitated system understanding. The prediction divergence of the model ensemble is a measure for the prediction uncertainty. Conditions that are hard to predict might be missing from the design.

Discussion We present an iterative process of active learning with SR for the systematic exploration of simulation models. Our initial experimental setup for the pandemic influenza model showed only a subset of the systems behavior but provided insights leading to an improved design. We explored common and edge manifestations and ended up with an ensemble of surrogate models for the complex simulation model. The surrogate models for reactive immunization strategies revealed effects like herd immunity and can be useful to instantly evaluate reactive strategies for specific values based on plausible estimates for vaccination coverage and efficacy. Although we demonstrated these methods on two vaccination programs, based on two distinctly different dynamic models (stochastic individual-based and deterministic compartmental), we are certain that these methods are relevant to address a wide range of public health problems that are informed by modeling. Surrogate modeling with SR identified the most important variables. A decrease in the uncertainty of these parameters would improve the robustness of the simulation results. Also, the feature selection can be useful during the development of the simulator. E.g., the travel parameter does not seem important in the current FluTE implementation although other studies have stressed the role of travel restrictions on epidemics [5], [6], [44], [47]. A revision of the travel implementation may be required. Considerable efforts have been made to build realistic simulation models of high quality, but most of these are not fully explored. Ideally, each model should be analyzed systematically to understand system behavior and to assess the impact of model assumptions and parameters on the results. The availability of simple surrogate models based on complex simulation models not only serves to understand the original complex model better, but also to emulate it. Policy makers can easily use an interactive interface, such as the ones we present in this paper, to mimic the context in which their decisions take place (e.g., transfer model outcomes between broadly comparable countries) and predict the effectiveness or cost-effectiveness of health interventions. In that sense, the use of surrogate models as emulators provides a great opportunity to enhance both the understanding of these models and improve the reliability and speed of policy making based on existing elaborate model structures. Specifically, for FluTE we could instantly formulate some insights from the emulator (Figure 6) in clear language for policy makers. First, we predict that without reactive measures 36% of the population will be infected. Second, only a few imported cases are enough to start the epidemic hence (complete) isolation may delay (prevent) the epidemic. Third, 30% vaccination coverage (percentage of the population vaccinated) may result in a 55% reduction in the number of cases and 60% coverage in a 95% reduction due to indirect protection because of the interruption of transmission pathways in a partial immune population. In future, we aim to automate the iterative surrogate modeling approach in order to speed up the process and make it more accessible. The number of realizations should be analyzed more into detail. We acquired already substantial insights on the transmission and vaccination dynamics implemented in the FluTE model with five iterations. However this system exploration could be further expanded, for instance by considering other interventions (e.g. social distancing) separately or in combination with vaccination. While the presented epidemiological results are acquired using the previous generation of simulators, we argue that our approach is applicable to all simulators and should be used for testing and validation when new simulators are developed, and for the emulation to aid policy making across settings after that.

Acknowledgments We would like to thank Dr Albert-Jan van Hoek for making the VZV dataset available and Dr Dennis Chao for making the Flute model available. We also thank the audience of the April 2012 SIMID workshop for constructive feedback on our presentation “Symbolic regression for modeling epidemiological systems” (www.simid.be).

Author Contributions Conceived and designed the experiments: LW SS EV. Performed the experiments: LW SS. Analyzed the data: LW SS EV. Contributed reagents/materials/analysis tools: EV JB PB NH. Wrote the paper: LW SS EV JB PB NH.