0004-637X/794/1/38

We consider the problem of inferring constraints on a high-dimensional parameter space with a computationally expensive likelihood function. We propose a machine learning algorithm that maps out the Frequentist confidence limit on parameter space by intelligently targeting likelihood evaluations so as to quickly and accurately characterize the likelihood surface in both low- and high-likelihood regions. We compare our algorithm to Bayesian credible limits derived by the well-tested Markov Chain Monte Carlo (MCMC) algorithm using both multi-modal toy likelihood functions and the seven yr Wilkinson Microwave Anisotropy Probe cosmic microwave background likelihood function. We find that our algorithm correctly identifies the location, general size, and general shape of high-likelihood regions in parameter space while being more robust against multi-modality than MCMC.

Export citation and abstract BibTeX RIS

1. INTRODUCTION

We consider the following problem. A researcher wants to constrain a theory described by some number N p of tunable parameters, which we will represent by the N p -dimensional vector . The available data measure some function f which is an implicit function of . The researcher is able to calculate a χ2 value quantifying the fit of a model to the data and wishes to find a confidence limit defined by some rule, e.g., . This is a general phrasing of a Frequentist confidence limit as discussed in the Appendix. We discuss alternatives for determining in Section 2.1 below. We will also consider the case that the researcher wishes to find Bayesian credible limit to some probability (1 − α)%. If the calculation going from is rapid, this problem is trivial. One generates a fine grid of points in the N p -dimensional parameter space, evaluates χ2 at each of these points, and defines the confidence limit as only those points which satisfy the rule. Often, however, the calculation is not rapid, or N p is large enough that such a grid-based method will take a prohibitively long time. In that case, one must make intelligent choices regarding which points in parameter space are actually interesting enough to warrant evaluation. Markov Chain Monte Carlo (MCMC) methods propose to find the desired limits by drawing random samples from the parameter space which are drawn according to a probability density (the Bayesian posterior) which can be interpreted as the probability that some value of represents the true value of realized by the physical process underlying the data. With enough of these samples, one can integrate this probability density and find regions which contain some set amount of the total probability. In this case, "credible limits" are phrased as target values for the probability integral (1 − α)%, i.e., 68% for "1σ" limits, 95% for "2σ" limits and so on (Gilks et al. 1996). We propose an alternative solution, using methods from machine learning (to wit, Gaussian processes, though the algorithm admits many possible drivers; see Section 3) to find the desired confidence limit by directly searching the parameter space of for points which lie on the boundary of the desired confidence limit . We refer to this algorithm as Active Parameter Searching (APS). We present the algorithm in detail in Sections 2 and 3. As it searches, APS uses the knowledge it has already learned about to improve its search according to a metric that rewards the algorithm both for finding points on the confidence limit and for sampling points from previously unexplored regions of parameter space. We find that this behavior makes APS more robust against multi-modal χ2 functions (see Section 4) without sacrificing the speed of convergence associated with traditional MCMC methods (see Section 5). We show that, while APS is designed specifically to yield Frequentist confidence limits (which we define in the Appendix), it is also capable of roughly approximating Bayesian credible limits on the space of (see Sections 2.4, 4 and 5). Readers should be aware that, while MCMC can be analytically shown to asymptotically sample the Bayesian posterior, APS is designed to discover Frequentist confidence limits purely through exploration, i.e., there is no a priori motivation behind the APS algorithm. Our only demonstration of APS's efficacy is empirical. It has done well on the problems we have tested it on, though we have tried to choose those problems such that they are representative of problems likely to be encountered by the community. Readers who require theoretically motivated results will likely need to resort to a sampling routine like MCMC to derive their final constraints. In the case of multimodal likelihood functions, the MultiNest algorithm (Feroz and Hobson 2008; Feroz et al. 2009) has been shown to be effective. We hope to show in Section 4 that, even in these use cases, APS can provide a valuable service by identifying the general regions of parameter space over which the Bayesian posterior needs to be integrated, i.e., by guiding users in the process of setting priors and proposal densities for a sampling algorithm, a problem which has posed significant problems for MCMC in the case of multi-modal likelihood functions.

2. THE APS ALGORITHM

The APS algorithm was originally presented by Bryan (2007) as follows. 1. (1A) Generate some initial number N s of randomly-distributed points in parameter space. Find the values of χ 2 corresponding to these points.

2. c of candidate points. For each of these points, find the N g nearest points already sampled by APS (we use a k-d tree as described in Bentley 2 values to predict the value of χ 2 at each of these points. We will signify this prediction by μ. Calculate some value σ quantifying confidence in μ as a prediction of χ 2 . Low values of σ should correspond to confident predictions. (2A) Generate some number Nof candidate points. For each of these points, find the Nnearest points already sampled by APS (we use a k-d tree as described in Bentley 1975 ). Use these nearest neighbor points and their corresponding χvalues to predict the value of χat each of these points. We will signify this prediction by μ. Calculate some value σ quantifying confidence in μ as a prediction of χ. Low values of σ should correspond to confident predictions.

3. 2 value to the list of sampled points begun in (1A). Return to (2A) and repeat until convergence. (3A) Choose the candidate point that maximizes the statisticand evaluateat that point. Add this point and its χvalue to the list of sampled points begun in (1A). Return to (2A) and repeat until convergence. The confidence limit reported by APS is the list of all points found for which . These points, when plotted in one- or two-dimensional slices of parameter space, ought to sketch out a region which contains the true value of with the desired confidence (as defined in the Appendix). This is in contrast to the credible limits of MCMC, which represent an integral over the sampled points containing some fraction of the total posterior probability. We present an alternative interpretation for yielding credible limits from APS in Section 2.4. The term in Equation (1) provides an incentive for APS to choose points that it believes will lie on the boundary of the confidence limit. The σ term provides an incentive for APS to explore unknown regions of parameter space, making the algorithm robust against multimodal χ2 functions. In the present work, we introduce modifications and expansions to this basic algorithm as described below. 2.1. Determining As discussed above, APS draws confidence limits by mapping iso-χ2 contours of in parameter space. Thus, the user is presented with the problem of determining an appropriate . APS admits two possible solutions. The user can fiat a definite value of based on the characteristics of the data at hand, or can be set adaptively, based on APS's knowledge of . The first possibility, a user-defined , is based on the theory of Frequentist confidence intervals, which we briefly discuss here and consider in much greater detail in the Appendix. If a data set is comprised of N d Gaussian-distributed data points with a known covariance structure, then will be distributed according to a χ2-distribution with N d degrees of freedom. This distribution can be integrated to determine the value of χ2 containing (1 − α)% of the total probability. This limiting value will be the user-defined . If the user does not feel comfortable setting a priori, APS can learn by defining it as . Here, is the minimum value of χ2 discovered by APS. Δχ2 is set based on the properties of the N p -dimensional parameter space being explored. To wit, Wilks (1938) proves that Δχ2 for an N p -dimensional parameter space will be distributed according to a χ2-distribution with N p degrees of freedom. One can perform the same integral discussed above and set Δχ2 to that value which encloses the desired confidence limit. This is the well-known Likelihood Ratio Test (Neyman & Pearson 1933). In this case, the user is relying on APS to eventually find the true of the likelihood surface being explored. Section 2.2.3 below outlines an additional means of search by which our expanded APS is capable of efficiently locating . 2.2. Modified Search Algorithms Bryan et al. (2007) show that the APS search algorithm as presented in steps (1A–3A) is robust against multi-modal likelihood functions, using it to identify a disjoint region of high likelihood in cosmological parameter space as constrained by the one yr Wilkinson Microwave Anisotropy Probe (WMAP) cosmic microwave background (CMB) anisotropy power spectrum. This is the principal advantage to selecting points for evaluation according to the statistic (1). Unfortunately, it is possible that the focus on exploring unknown regions of parameter space might slow down convergence in the case of uni-modal likelihood functions. Therefore, we augment steps (1A–3A) with the following modified search algorithms to ensure that APS adequately characterizes known regions of high likelihood while simultaneously searching for new regions to explore. These modified searches will require APS to keep track of information it learns about the function as it searches. To facilitate this discussion, we introduce the following notation. Symbols denoted with represent points in parameter space. Variables denoted with denote sets of points in parameter space. Variables denoted {X} are sets of numbers stored by APS. We are greatly indebted to Eric Linder of the Lawrence Berkeley National Laboratory who first suggested to us the modifications described below in Sections 2.2.1 and 2.2.3. 2.2.1. Characterization of Known Low-χ2 Regions As it searches, APS will keep track of a list of all of the discovered points which are believed to be local minima of χ2 (how it finds these will be described below in Section 2.2.3). These points will be stored in the set . After each cycle of the main algorithm (steps 1A–3A above), our version of APS attempts "focused searches" about the points in . For each point in , APS performs the following. 1. p focused searches have been proposed about 2 at that point. (1B) If fewer than Nfocused searches have been proposed about, randomly choose a point from a small sphere surroundingin parameter space and evaluate χat that point.

2. (2B) If the point does not satisfy(which is highly probable), use bisection along the line connecting the proposed point toto find a point which does satisfy, i.e., find a point on that line for which—oftenitself—and a point along that line for whichand iteratively step half the distance between them until arriving at the desired(James 1972 ). The points discovered in this way withare stored in the setof "boundary points" about(eachinwill have its own set).

3. p boundary points have been found for i . These points will be referred to as × N p points in p parameters. (3B) If Nboundary points have been found for, skip steps (1B) and (2B). Instead, examine the points inand identify the points corresponding to both the maximum and minimum values of each parameter θ. These points will be referred to as. There will be 2points in, one for the maximum and one for the minimum in each of our Nparameters.

4. v of new points (we have chosen N v = 20 for the sake of speed) × N p × N v points v such points for each of the 2 × N p extremal boundary points in (4B) For each pointin, propose some number Nof new points (we have chosen N= 20 for the sake of speed)whereis a random, small vector in parameter space constrained to be perpendicular to. There will now be 2points(Nsuch points for each of the 2extremal boundary points in).

5. g nearest neighbor points as in step (2A) above and evaluate Equation ( (5B) At each pointabove, find the Nnearest neighbor points as in step (2A) above and evaluate Equation ( 1 ). Find the pointwhich maximizes S and evaluateat that point. If, use bisection to find a new boundary point. Store the resultingpoint in The modification suggested in Section 2.2.1 is designed to explore surface by starting from its extreme corners and stepping perpendicularly to its "radii" (the vector would be a radius if described a sphere), with the hope of being able to extend the bounds of . In this way, our extended APS attempts to plumb the full Frequentist confidence limit as rapidly as possible. However, we do not want this to come at the expense of the wide-ranging exploration of steps (1A–3A). Our version of APS alternates between performing steps (1A–3A) and steps (1B–5B). If there are N m distinct known local minima in , our algorithm will perform N m iterations of steps (1A–3A), and then perform steps (1B–5B) once for each in . 2.2.2. Extension of the It is, however, also possible that the S-maximization in steps (1A–3A) could learn something about the shape of the contour, even if it does not place points directly on that contour. To explore that possibility, our extended APS performs the following modified search after each iteration of steps (1A–3A). 1. (1C) For each pointin, store, in addition to the boundary points, the projection of those boundary points onto a unit sphere in parameter space, i.e., for each pointin, draw a line connectingwithand keep track of where a unit sphere centered onintersects that line. Call this set of projections

2. 2 has been evaluated in step (3A) (2C) Call the point for which χhas been evaluated in step (3A). Find the pointinthat is nearest toin parameter space.

3. (3C) Find the projection ofonto the unit sphere surrounding. Call this projected point. Find the pointinthat is nearest to. Record the parameter space distance betweenand. Add this distance to the set {D} ({D} will be a running list of thedistance for all pointsfound by steps 1A–3A, regardless of which point inthey are nearest).

4. (4C) If thedistance is greater than the 2/3 quantile of {D}, then perform bisection betweenand the point evaluated in step (3A). Add the resultingpoint to Step (4C) ensures that, if the point is in a novel direction from , APS will find a point on the contour in that direction. Because was chosen by maximizing Equation (1), we assume that there was something "interesting" about that direction from (perhaps a sharp corner or distended extension of the contour). In this way, our extended APS will attempt to cover the surface of in the most informative way possible. Because we expect the unit spheres surrounding the points to become more densely populated as the algorithm runs, we periodically delete the list {D} so that new directions away from can still be explored. 2.2.3. Function Minimization The penultimate modification to the original APS algorithm that we propose in this work is to utilize function minimization to identify local minima in χ2. One of the principal drawbacks of the algorithm presented in steps (1A–3A) is that it makes no account of "near misses." If step (3A) finds that χ2 is lower than average but still greater than , the original APS algorithm does nothing with that information. We have tried to improve upon that with our incorporation of bisection in Sections 2.2.1 and 2.2.2 above. An additional innovation is to allow APS to use points with to seed a simplex function minimizer as developed by Nelder & Mead (1965). Here, G is a user-set parameter, is the minimum discovered χ2 value and is the 1/10 quantile of the χ2 values discovered by steps (1A–3A) (not those discovered by any of the modifications discussed in Sections 2.2.1, 2.2.2, or 2.2.3). Because the Nelder-Mead simplex minimizer can be very time intensive, we balance our algorithm by keeping track of how many times is called by the different searches and demanding that the number of χ2 evaluations devoted to all other variants of search (i.e., steps (1A–3A), (1B–5B), and (1C–4C)) equal the number of χ2 evaluations devoted to the simplex minimizer before a new simplex minimization can occur. If the simplex minimizer converges to a point such that , our extended APS determines whether or not it has found a new local minimum to be added to by comparing the end point of the minimization to each of the points already in . For each in , APS calculates the mid-point If, for all , χ2 at is greater than , then it is deemed that is a new local minimum and is added to . If not, is assumed to be a part of the one of the previously identified loci of low χ2 and it is not added to . Regardless of whether or not passes the test in Equation (5), is added to the set of all simplex end points discovered by APS. As a further guard against excessive simplex searching, all points that pass the test in Equation (3) are further required to satisfy the condition before they are allowed to seed a new simplex search. In this way, our extended APS avoids seeding simplex searches with points that are inside of already known regions of low χ2. 2.2.4. Simplex-driven APS Search Having introduced the concept of the Nelder & Mead (1965) simplex search in Section 2.2.3 above, we additionally use this tool to improve upon the original APS search algorithm, steps (1A)–(3A). Broadly speaking, the purpose of the original APS algorithm is to identify the point which maximizes the S statistic in Equation (1) and evaluate at that point. As more points are sampled, S will, presumably, come to be maximized at points that are near the contour, but far from other sampled points, so APS will efficiently explore the parameter space. Steps (1A)–(3A) approximately achieve this end by proposing a random sample of candidate points and evaluating at the point which maximizes S within that sample. We use the Nelder & Mead (1965) simplex to replace this process as follows. 1. (1Ã) Perform step (1A) as above.

2. p + 1 randomly chosen points (2Ã) Generate N+ 1 randomly chosen pointsin parameter space. For each pointin, use Equation ( 1 ) as in steps (2A–3A) above to calculate S. This set of S values will be called {S}.

3. (3Ã) Useand {S} as the seed for a Nelder & Mead ( 1965 ) simplex search. At each subsequent point sampled by this search, calculate S as above. Run this simplex until S converges to a local maximum.

4. (4Ã) Evaluateat the point which maximized S. Add this point to the list of points sampled by APS. Return to step (2Ã). Repeat until convergence. By choosing the maximum S point based on a simplex search rather than a finite random sample, we hope to find a true local maximum in S, rather than just the maximum relative to the limited number of sample points chosen. Steps (1Ã)–(4Ã) replace steps (1A)–(3A) in our expanded APS algorithm. 2.3. Code Implementing APS We make code implementing our extended APS for an arbitrary function available at https://github.com/uwssg/APS/. The code is written in C++. It is designed for users to use with any function. Users need only to write a sub-class of the chisquared class defined in chisq.h and chisq.cpp that evaluates their desired and pass an instantiation of this sub-class to an instantiation of the aps class defined in aps.h and aps.cpp . It is this aps class that does the work of performing the searches described above. Users should not need to modify this class (though they are certainly welcome to). Our aps class provides the option for several user-specified parameters. We list them below along with the C++ methods (belonging to the class aps ) used to set them. 1. aps.set_grat The parameter G from Equation ( 4 ) above can be set using the method

2. aps.set_target can be manually set using the method

3. Δχ 2 can be set in the constructor for the class aps .

4. The number N g of nearest neighbors used in steps (2Ã)–(3Ã) can be set in the constructor for the class aps

5. The minimum and maximum bounds in parameter space can be set using the method aps.set_max and aps.set_min . These bounds will constrain the range of points sampled in step (2Ã)–(3Ã). They will not constrain searches performed with bisection or the simplex minimizer.

6. If there is a characteristic length scale associated with a parameter, this can be set using the method aps.set_characteristic_length . This length scale will be used to normalize all distances in parameter space. If it is not set, it will default to the difference between the maximum and minimum values set by aps.set_max and aps.set_min . Setting a characteristic length scale will principally affect what points are considered nearest neighbors for the purposes of steps (2Ã)–(3Ã). If the characteristic length scale of a parameter is large, that parameter will be down-weighted when calculating the parameter-space distance between two points. If the characteristic length scale of a parameter is small, that parameter will be up-weighted in length calculations. More complete documentation is made available with the source code. 2.4. Interpreting APS Outputs Originally, APS was conceived as a way to draw Frequentist confidence limits . This is by far the most direct way to interpret APS outputs: a list of points is given with corresponding values of χ2. All of the points with are taken to cover the desired confidence limit. There is, however, also a way to use the outputs from APS to draw an approximate Bayesian credible limit. As discussed in the introduction, Bayesian parameter constraints assume that there is some probability distribution ("the posterior") over . (1 − α)% credible limits are drawn by integrating over this probability distribution and selecting that region which contains (1 − α)% of the total probability. This integral can be very time-consuming if the dimensionality of is large. Markov Chain Monte Carlo methods overcome this difficulty by randomly drawing samples from the posterior and assuming that, if enough samples are drawn, the distribution of those samples will be sufficient to reconstruct the attributes of the full posterior probability density. APS outputs cannot be used this way, since APS does not select its points probabilistically, but in an attempt to recreate the behavior of the function in interesting regions of parameter space. We can still, however, use these points to approximate the integral over the posterior probability density. The algorithm we propose is as follows. The total set of points sampled by APS will be referred to below as . For each in : 1. (1D) Initialize two vectorsand. Set all of the values in these vectors to some nonsense number.

2. p + 1 nearest neighbor points to (2D) Select the 3N+ 1 nearest neighbor points tofrom(this will obviously includeitself, which can be discarded). Call this set of nearest neighbors

3. i = |n i − a i | (here the vertical bars denote the absolute value). For each component δ i of (3D) For each pointin, calculate the vectorsuch that δ= |n− a| (here the vertical bars denote the absolute value). For each component δof, starting with the largest and working toward the smallest:

4. p dimensions and reset either a max, i or a min, i so that d min = a i − a min, i and d max = a max, i − a i are equal to the minimum of {d max , d min } from the asymmetric hyperbox. Return to step (1D) and repeat for the next (4D)andshould now describe an asymmetric, rectangular hyperbox about. To symmetrize it, step through all Ndimensions and reset either aor aso that d= a− aand d= a− aare equal to the minimum of {d, d} from the asymmetric hyperbox. Return to step (1D) and repeat for the next At this point, each in should have a corresponding symmetric hyperbox surrounding it described by the corresponding and . To integrate the posterior probability density, we will assume that all of the parameter space points in each hyperbox correspond to , i.e., the χ2 discovered by APS for the point at the center of the hyperbox is the χ2 value for all of the points inside the hyperbox. The posterior probability associated with each box is then where is the hypervolume of the hyperbox surrounding . To plot the (1 − α)% credible limit, one need only plot the points in hyperspace that enclose the lowest-χ2 hyperboxes containing (1 − α)% of the total probability. We test this algorithm on a function with an exactly known credible limit in Section 4 (see Figures 5–7) and on a real, physical data set in Section 5 (see Figure 14). In Section 4, we find that our results agree with the known credible limit. In Section 5 we find that APS correctly identifies the region and general shape of the Bayesian credible limits, however, with significant noise. The Bayesian credible limits found by APS are systematically larger than the Bayesian credible limits found by MCMC. This motivates our claim in the introduction that APS may be principally useful for identifying regions of high posterior density on largely unknown likelihood functions, allowing users to more shrewdly set priors for sampling algorithms like MCMC. Figure 1. One-dimensional example of prediction using Gaussian processes. The black curve is the function being considered. The crosses are the points at which it has been sampled. The green curve is the resulting prediction and the red curves represent the 1σ and 2σ uncertainty bounds. We set K according to Equation (17) with N g = 5 for the five sampled points. Download figure: Standard image High-resolution image Export PowerPoint slide The functionality described by this section is provided by the classes defined in the source code file aps_extractor.cpp in our software package.

3. GAUSSIAN PROCESSES

Steps (2Ã)–(3Ã) of APS requires that a prediction μ be made regarding the value of χ2 at points as yet unsampled by the algorithm. An uncertainty σ must also be assigned to this prediction. APS writ large is agnostic regarding the method used to find μ and σ. We choose to use a Gaussian process. Gaussian processes take noisy, sparse, measurements of an unknown function (in our case, at the points in parameter space already sampled by APS) and use them to make inferences about unmeasured values of the function by assuming that the function is a random process. Gaussian processes have already found use in physics and astronomy constructing non-parametric models of the cosmic expansion (Shafieloo et al. 2012), interpolating the point spread function across astronomical images (Bergé et al. 2012), and interpolating models of the non-linear matter power spectrum from N-body simulations (Habib et al. 2007). They are useful in the current context because not only do they give robust predictions μ for the unmeasured values of the function being modeled, but they also provide well-motivated prediction uncertainties σ, a requirement of Equation (1). We present below an introduction to the formalism of Gaussian processes drawn heavily from Chapter 2 and Appendix A of Rasmussen & Williams (2006). Gaussian processes use the sampled data , where i indexes over all of the sampled points, to predict μ = χ2, (q) at the unknown point by assuming that the function represents a sample drawn from a random process distributed across parameter space. At each point in parameter space, χ2 is assumed to be distributed according to a normal distribution with mean and variance dictated by the "covariogram" . If we have N g measurements of χ2 (recall from steps (2Ã–3Ã) above that, for each point where we must predict the value of χ2, we are only using the N g nearest neighbor sampling points as data), then we assume they are distributed according to C ij is a function assumed by the user encoding how variations in χ2 at one point in parameter space affect variations in χ2 at other points in parameter space. K is a parameter controlling the normalization of C ij . The diagonal elements where is the intrinsic variance in the value of χ2 at a single point . Rasmussen & Williams (2006) treat the special case and find (see their Equations (2.19), (2.25) and (2.26)) where the sums over i and j are sums over the sampled points in parameter space. C iq relates the sampled point i to the candidate point q. We do not wish to assume that the mean value of χ2 is zero everywhere. Therefore, we modify the equation for μ to give where is the algebraic mean of the sampled χ2, (i). Note the similarity to a multi-dimensional Taylor series expansion with the covariance matrix playing the role of the derivatives. Equation (12) differs from Equation (6) in Bryan et al. (2007) because they used the semivariance in place of the covariance C ij . In practice, the two assumptions result in equivalently valid μ and σ. The form of the covariogram must be assumed. One possibility, taken from Equation (4.9) of Rasmussen & Williams (2006) is where D ij is the normalized distance in parameter space between the points and . where max n − min n is the difference between the maximum and minimum values of the nth parameter taken from the N g points used to seed the Gaussian process and ℓ n is a "hyper-parameter" associated with the nth parameter. We discuss the normalization parameter K below in Section 3.1. We discuss the setting of hyper-parameters in Section 3.2. The exponential form of C ij quantifies the assumption that distant points should not be very correlated. Another possibility (Equation (4.29) of Rasmussen & Williams) is corresponding to the covariance function of a neural network with a single hidden layer. In this function, is the vector of parameters recentered and renormalized relative to the span (max n − min n ) in each dimension. σ 0 and σ 1 are hyperparameters. In Section 5, we test APS using both covariograms (14) and (16) and find no appreciable difference in the resulting parameter constraints. 3.1. Normalizing the Covariogram The normalization constant K in Equation (11)—known as the "Kriging parameter" for the geophysicist who pioneered the overall method—also must be assumed. Determining the value of K is somewhat problematic because, examining Equation (13), one sees that the factors of K and K−1 cancel out of the prediction μ, so that the assumed value of K has no effect on the accuracy of the prediction. If the opposite had been true, one could heuristically set K to maximize the accuracy of μ. Instead, we set K to the value that maximizes the likelihood of the input data, i.e., the N g nearest neighbors in parameter space used to perform the Gaussian process inference. Recall from Equation (11) that we are treating our input data as though they were samples drawn from a probability distribution. To maximize the probability of the observed data (i.e., maximizing the value of in Equation (11)) we must set K equal to We adopt this assumption throughout this work. Figure 1 applies the Gaussian process of Equations (12), (13), and (17) with covariogram (14) to a toy one-dimensional function. Inspection shows many desirable behaviors in μ and σ. As θ(q) approaches the sampled points θ(i), μ approaches χ2, (i) and σ approaches zero. Closer to a sampled point, the Gaussian process knows more about the true behavior of the function. Far from the θ(i), σ is larger, and the S statistic in Equation (1) will induce the APS algorithm to examine the true value of χ2. 3.2. Setting Hyper-parameters The covariograms C ij presented above in Equations (14) and (16) (indeed, any covariogram) each contain arbitrary hyper-parameters whose values can be set to fine-tune the Gaussian process model (ℓ n in Equation (14) and {σ 0 , σ 1 } in Equation (16)). In order to set the hyper-parameters used in the APS Gaussian process, we randomly select a set of up to 3000 points from those already sampled by APS and use these to construct the error function where is a vector specifying the hyper-parameters required by the chosen C ij . E is defined as where is calculated according to Equation (13) using hyper-parameters (and ignoring as its own nearest neighbor) and is already known, since has been chosen from the points previously sampled by APS. If has more than two dimensions, we minimize E on -space using the simplex minimizer of Nelder & Mead (1965). Otherwise, we do a simple grid search for the value of which minimizes E. This optimal value of is used to set the hyper-parameters for our Gaussian process model. As APS samples more points in parameter space, we expect it to learn more about the function and thus be capable of a more accurate Gaussian process, so we allow it to periodically reset the hyper-parameters . However, minimizing E can be a very time-consuming process, so we have APS keep track of the amount of time elapsed since the algorithm began. We divide this value by the number of calls made to the θ → χ2 function to give the average total time spent per call to θ → χ2 with all of the overhead imposed by APS included. We compare this average with the average amount of time spent on just θ → χ2, without any of the overhead imposed by APS. APS is only allowed to reset its hyper-parameters if the difference between these two times is less than 0.1 second per call to (of course, we do require that APS optimize its hyper parameters at least once, so that we can have confidence in our Gaussian process model).

4. A TOY EXAMPLE

We will now test the ability of APS to learn the confidence limits of an artificial χ2 function and compare it to that of MCMC driven by the traditional Metropolis-Hastings algorithm (Gilks et al. 1996; Lewis & Bridle 2002). The algorithm (1A)–(3A) in Section 2 was originally presented and tested against the one yr data release from the WMAP satellite in Bryan et al. (2007). They found that the algorithm detected a second locus of low χ2 that had gone undetected by previous, MCMC-based analyses of the data. This second fit to the data has disappeared as the signal-to-noise ratio of the data has improved (as we will see in Section 5), therefore, we will use for our test a toy χ2 function with multiple loci of low χ2. We construct our toy χ2 as a series of low-χ2 ellipses in a five-dimensional parameter space (we choose a five-dimensional parameter space because it allows us to reasonably show all of the two-dimensional sub-spaces in the plots which follow). We consider functions with two, three, and four discrete minima in χ2. The location of the low-χ2 regions are generated using a pseudo-random number generator taken from Marsaglia (2003) with parameters taken from Press et al. (2007). for this toy function is calculated as where c nearest, i is the ith coordinate of the center of the nearest low-χ2 region and w nearest, i is the (pseudo-randomly generated) width of that ellipse in the ith coordinate. We refrain from introducing any interesting parameter degeneracies so that the function remains integrable in two-dimensional sub-spaces. This gives us a hard-truth "control" against which to evaluate our test APS runs below. Readers wishing to re-create this χ2 function can find it in our software package as the class ellipses_integrable in the source code files chisq.h and chisq.cpp . To test the performance of APS, we ran APS on versions of this function with two, three, and four discrete minima in χ2. We allowed APS to sample a total of 10,000 points in the five-dimensional parameter space. We had APS adaptively set , Δχ2 = 11.0 being the 95% confidence limit bound for χ2 probability distribution with 5 degrees of freedom. We used a Matèrn covariogram with ν = 3/2 for our Gaussian process (Rasmussen & Williams Equation (4.17)). Figures 2, 3, and 4 plot the points found by APS (black points) against the known Δχ2 = 11.0 contours of the function (red contours). Obviously, APS did a very good job of exploring all of the available low-χ2 regions. Figure 2. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with two minima in χ2. The black points are found by APS after 10,000 samplings. The red contours are the known contours of the function. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 3. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with three minima in χ2. The black points are found by APS after 10,000 samplings. The red contours are the known contours of the function. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 4. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with four minima in χ2. The black points are found by APS after 10,000 samplings. The red contours are the known contours of the function. Download figure: Standard image High-resolution image Export PowerPoint slide As mentioned above, because our toy χ2 function is constructed without any parameter degeneracies, we are able to directly integrate the posterior and determine the true Bayesian credible limits in our parameter space. Figures 5, 6, and 7 plot the true 95% credible limits in all 10 two-dimensional sub-spaces of our five-dimensional parameter space (red contours). They also show the APS-determined credible limits as described in Section 2.4 (black points). The APS-determined credible limits correspond well with the true credible limits, giving us confidence in APS's ability to map out Bayesian credible limits as well as Frequentist confidence limits. Figure 5. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with two minima in χ2. The black points represent the 95% Bayesian credible limit found by APS (and determined as described in Section 2.4) after 10,000 samplings. The red contours represent the known 95% Bayesian credible limit of the toy function. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 6. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with three minima in χ2. The black points represent the 95% Bayesian credible limit found by APS (and determined as described in Section 2.4) after 10,000 samplings. The red contours represent the known 95% Bayesian credible limit of the toy function. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 7. This plot shows all 10 two-dimensional sub-spaces of our five-dimensional toy likelihood function with four minima in χ2. The black points represent the 95% Bayesian credible limit found by APS (and determined as described in Section 2.4) after 10,000 samplings. The red contours represent the known 95% Bayesian credible limit of the toy function. Download figure: Standard image High-resolution image Export PowerPoint slide The tests above demonstrate APS's ability to correctly determine credible and confidence limits. They do not, however, address the question of APS's chief advantage over Markov Chain Monte Carlo methods: its ability to successfully locate all of the low-χ2 regions in a given parameter space. To demonstrate this, we ran APS 100 times on each of the likelihood functions plotted in Figures 2–4. Each time, the pseudo-random number generator driving APS was given a different seed. In nearly all cases, APS successfully located all of the local minima in χ2, usually in fewer than 10,000 evaluations of χ2 (one case of the two-mode χ2 function required 10,352 evaluations to find both low-χ2 regions; one required 12,549 evaluations). For comparison, we also ran 100 Markov Chain Monte Carlo searches on these χ2 functions, each run consisting of five independent chains. We allowed the MCMCs to run until they had sampled 100,000 points in parameter space. These searches were not nearly as successful as APS at locating all of the low-χ2 regions. In the case of the two-mode function, 50 of the 100 MCMC searches failed to identify both modes. In the case of the three-mode function, 56 of the 100 MCMC searches failed to identify all of the modes. In the case of the four-mode function, 87 of the 100 MCM searches failed to identify all of the modes. This is not to say that APS is perfect at identifying all of the modes of a likelihood function. If the characteristic widths w j, i in Equation (19) are too small, we enter a regime in which neither APS nor MCMC is capable of reliably identifying all of the modes. However, this test does indicate that APS is more robust than MCMC against multi-modal likelihood functions, as was first demonstrated by Bryan et al. (2007). This should not be a surprising statement. In order to speed convergence, MCMC attempts to learn the size and shape of any high likelihood region it discovers. It uses what it learns to propose new random steps in parameter space so that it can efficiently converge to a state in which it is sampling the Bayesian posterior. If an MCMC chain discovers one of several high likelihood regions, it will learn a proposal density that conforms to that high likelihood region, disregarding any other that may exist. It will, in other words, become trapped in that high likelihood region. MCMC can overcome this difficulty by running several chains in parallel and hoping that, if multiple high likelihood regions exist, each chain will become trapped in different region and the whole MCMC will thus learn all of the modes in the parameter space. This, however, requires initializing a number of chains that is large relative to both the dimensionality of the parameter space and the number of discrete high likelihood regions it contains so that the odds of discovering all of the high likelihood regions are large. If the user does not know how many high likelihood regions exist in the parameter space, this could be problematic. Because steps (2Ã–4Ã) keep APS exploring widely through parameter space, even after a high likelihood region has been identified, APS can discover the presence of multiple high likelihood regions without requiring the user's prior knowledge of them.

5. A REAL EXAMPLE: WMAP

Section 4 demonstrates the robustness of APS against multi-modal χ2 functions. However, it does have some short-comings as a test of APS's abilities. As was already noted, there are no interesting parameter degeneracies in Equation (19). Furthermore, because this function was constructed artificially, there is no noise. At any point in parameter space, χ2 falls smoothly and monotonically toward the associated local minimum of χ2. To really have confidence in APS abilities, we must test it on actual physical data drawn from the universe, with all of the noisy and ill-behaved properties that entails. To that end, we will now consider for our function the likelihood function on the 7 yr data release of the WMAP CMB satellite3 (Jarosik et al. 2011). For simplicity, we consider only the anisotropy power spectrum of the temperature–temperature correlation function. The parameter space we explore is the six-dimensional space of the spatially flat concordance ΛCDM cosmology: 1. Ω b h 2 —the density of baryons in the universe;

2. Ω CDM h 2 —the density of cold dark matter in the universe;

3. τ—the optical depth to the surface of last scattering. τ cannot be constrained without polarization data, which we do not consider here. However, it functions as a nuisance parameter to be marginalized over. The fact that APS can handle it as well as MCMC is an important test of APS's feasibility.

4. h—the Hubble parameter in units of 100 km s −1 Mpc −1 ;

5. n s –the spectral index controlling the distribution of scalar density perturbations in the universe; and

6. ln [1010A s ]—the amplitude of the initial scalar density perturbations in the universe. These are the parameters taken by the publicly available MCMC code CosmoMC, which we use to run our baseline MCMC chains (Lewis & Bridle 2002). The function in this case involves converting these parameters into the power spectrum of anisotropy C ℓ s on the sky and comparing those C ℓ s to the actual measurements registered by the WMAP satellite. Using the (very fast) Boltzmann code CAMB to calculate the C ℓ s Lewis et al. (2000) take 1.3 s on a 2.5 GHz processor. Evaluating the χ2 using the likelihood code provided by the WMAP team takes an additional 2 s (this can, of course, be sped up by using a faster machine and multiple cores). Hence the need for an algorithm like APS or MCMC to accelerate the search. We will consider both Equations (14) and (16) as the covariogram for our Gaussian process, though we find that the choice has little effect on our results. We will also consider setting as well as setting by hand. To gauge the appropriateness of applying APS to this problem, we consider the two following tests. In Figure 8 we plot the minimum discovered value of χ2 yielded by APS and MCMC as a function of the number of samples drawn by each. If APS failed to find a suitable value of , the use of Δχ2 to find would be questionable. However, as we can see, the yielded by APS (1273.3) is within a few of the yielded by MCMC (1270.7). Figure 8. Minimum value of χ2 found by APS and MCMC on the WMAP 7 problem as a function of steps sampled by each. Download figure: Standard image High-resolution image Export PowerPoint slide The reader will note that MCMC discovers its much more rapidly than APS, however, we will see in Figures 10 through 12 that this rapid descent to does not translate into rapid convergence to the final credible limit. Because MCMC infers its limits based on the number of samples drawn from the Bayesian posterior probability, MCMC still requires many tens (or hundreds) of thousands of points to be sampled after has been discovered. Because it is learning the shape of the function directly, APS does not impose this requirement. The reader should also consider Figure 12, which shows the Frequentist confidence intervals discovered by APS when is set by hand using prior knowledge of the size of the WMAP seven yr data set. In this case, APS does not need to learn in order to proceed. This is the original use case for which APS was designed. Here we see that the confidence intervals discovered agree very well with the Bayesian credible intervals discovered by MCMC. The second question we consider is whether or not our assumption that can be modeled by a Gaussian process is valid. In Figure 9 we plot the average fractional error in μ as a proxy for χ2 as a function of the number of steps sampled (note: the vertical axis actually shows the average fractional error over the preceding 500 steps so that the early, imprecise models do not contaminate the later, precise models). As APS samples more points, its Gaussian process model becomes more accurate. This effect is more pronounced for covariogram (14) than for covariogram (17). However, we will see below that this difference does not affect the parameter constraints determined by APS. Figure 9. Average value of |μ − χ2|/χ2 as a function of the number of points sampled by APS in the WMAP 7 problem. This is shown as a check on our assumption that can be approximated as a Gaussian process. Though the squared exponential covariogram (14) gives a better fit to the χ2 function than the arcsine covariogram (17), we see in Figures 10 and 11 that this has little effect on the parameter constraints derived by APS. Download figure: Standard image High-resolution image Export PowerPoint slide We now consider the ultimate results, namely the 95% confidence limits on our six-dimensional cosmological parameter space yielded by APS. To provide a baseline against which to test APS, we run four independent MCMC chains using CosmoMC. Like the MCMC code used in Section 4, CosmoMC periodically adjusts its proposal density by learning the covariance matrix of the points it has already sampled. At each step, it proposes a new point in parameter space by randomly selecting an eigen vector of that learned covariance matrix and stepping along it. We integrate the posterior by taking the output chains, discarding the first 50% of steps as a burn-in period, and thinning the remaining steps so that we have a set of effectively independent samples drawn from the posterior. Quantitatively, this is determined by keeping only every Lth sample after burn-in, where L is set as the length such that the normalized correlation where n is the index over the N p parameters and (i) is the index over the samples drawn. We find the 95% credible limits in two-dimensional slices of our parameter space using a total of 460,000 MCMC steps. These are assumed to be the true credible limits on the parameter space and are plotted as the thick red contours in Figures 10 through 12 below. Against these control contours, we plot the Frequentist confidence limits (all points with ) discovered by APS after only 50,000 evaluations of χ2. We consider both an adaptive (12.6 is the Δχ2 corresponding for 95% confidence limit in the case of 6 parameters) and an absolute (this is the 95% confidence limit for a χ2 probability distribution with the 1199 degrees of freedom corresponding to the 1199 C ℓ s in the data set). These are the black points in Figures 10, 11, and 12 respectively. These confidence limits overlap very nicely with the true credible limits discovered by MCMC, comporting the idea that, in the large data limit, Bayesian credible limits and Frequentist confidence limits ought to align. Figure 10. Ten two-dimensional sub-spaces of our six-dimensional parameter space (τ is ignored because no real constraint can be gleaned from this data). The thick red contours are the 95% Bayesian credible limits determined by MCMC after 460,000 calls to χ2. The black points are the 95% Frequentist confidence limits determined by APS after 50,000 calls to χ2. . Equation (14) is used for the Gaussian process covariogram. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 11. Ten two-dimensional sub-spaces of our six-dimensional parameter space (τ is ignored because no real constraint can be gleaned from this data). The thick red contours are the 95% Bayesian credible limits determined by MCMC after 460,000 calls to χ2. The black points are the 95% Frequentist confidence limits determined by APS after 50,000 calls to χ2. . Equation (16) is used for the Gaussian process covariogram. Download figure: Standard image High-resolution image Export PowerPoint slide Figure 12. Ten two-dimensional sub-spaces of our six-dimensional parameter space (τ is ignored because no real constraint can be gleaned from this data). The thick red contours are the 95% Bayesian credible limits determined by MCMC after 460,000 calls to χ2. The black points are the 95% Frequentist confidence limits determined by APS after 50,000 calls to χ2. . Equation (14) is used for the Gaussian process covariogram. Download figure: Standard image High-resolution image Export PowerPoint slide For comparison of convergence properties, we plot the Bayesian credible limits discovered by MCMC after only 50,000 χ2 evaluations. These are the black regions in Figure 13. While MCMC has clearly converged to the final confidence limits in some of the two-dimensional sub-spaces shown, others (Figures 13(f), 13(g), and 13(j)) evince the presence of significant noise in the MCMC chains. Indeed, even the "final" credible limit contour (the red contour) in Figure 13(f) shows significantly more noise along the edges than, for instance, Figure 13(a). This is not meant to disparage the convergence properties of MCMC, but to show that any convergence advantage MCMC may have over APS is not absolute. This should be compared with the clear advantage APS demonstrated over MCMC in identifying multi-modal likelihood functions in Section 4 above. Figure 13. Ten two-dimensional sub-spaces of our six-dimensional parameter space (τ is ignored because no real constraint can be gleaned from this data). The thick red contours are the 95% Bayesian credible limits determined by MCMC after 460,000 calls to χ2. The black regions are the 95% Bayesian credible limits determined by MCMC after 50,000 calls to χ2. Download figure: Standard image High-resolution image Export PowerPoint slide In Figure 14, we plot the 95% Bayesian credible limit discovered by APS after 50,000 evaluations of χ2 and determined by the method outlined in Section 2.4. Unfortunately, we do not see here the convergence to the true credible limits that we saw in Figures 5–7. While it does appear that APS has correctly identified the locations and general shapes of the true credible limits, it has, in most cases, drawn credible limits that are larger than those found by MCMC. This should not be surprising. APS was designed with Frequentist confidence limits in mind, and Figures 10–12 seem much better converged than Figure 14. However, it does seem that APS might be useful in cases where little prior knowledge is known about and users need to be confident that they have discovered all of the low-χ2 regions in a particular parameter space. Once APS has been used to identify rough Bayesian credible limits, more detailed algorithms (like MCMC) can be run using the prior knowledge gained by APS to refine the credible limits. Figure 14. Ten two-dimensional sub-spaces of our six-dimensional parameter space (τ is ignored because no real constraint can be gleaned from this data). The thick red contours are the 95% Bayesian credible limits determined by MCMC after 460,000 calls to χ2. The black points are the 95% Bayesian credible limits determined by APS according to the method outlined in Section 2.4. While the APS credible limit covers the entire MCMC limit, it also covers regions of parameter space excluded by MCMC. Download figure: Standard image High-resolution image Export PowerPoint slide As a final note, we do not plot any confidence or credible limits for τ. This is because τ needs CMB polarization data for meaningful constraints. For the sake of simplicity, we did not consider CMB polarization data here. Thus, τ has become a "nuisance parameter": it affects the fit of our theoretical models, but cannot itself be meaningfully constrained. MCMC handles such parameters by effectively integrating over them when delivering the two-dimensional contours of Figures 10 through 12. APS handles such nuisance parameters by simply allowing them to vary and returning any combinations of useful-plus-nuisance parameters that satisfy the criterion. The fact that this nuisance parameter did not significantly impede the convergence of APS relative to MCMC speaks well of APS's ability to handle actual physical problems that often include parameters that have little physical interest but are required to, e.g., calibrate noise distributions or systematics. We reiterate that the purpose of this section was to demonstrate that the presence of real-world noise in the data and likelihood function do not significantly impede APS's ability to determine confidence and credible limits. The problem of drawing parameter constraints from the WMAP likelihood function is, by now, well-understood, and shrewdly drawn priors and proposal densities are readily available. Indeed, prior knowledge of early WMAP results allowed us to use the following initial proposal densities for the MCMC runs displayed in this section: where θ is 100 times the ratio of the sound horizon to the angular diameter distance and recombination (this is the input parameter CosmoMC uses in place of h). These distributions are fairly similar to the final, marginalized, one-dimensional distributions of those same parameters Conversely, for the purposes of our APS runs, we initialize the algorithm only with the gross assumptions that In less well-studied problems, users may not understand their parameter spaces well enough to be able to supply shrewd proposal densities in the vein of Equations (21). In such a situation, the exploratory behavior of APS demonstrated in Section 4 may prove more important than the theoretical rigor offered by MCMC's sampling algorithm.

6. CONCLUSIONS

In Section 4 we indicated that, consistent with the findings of Bryan et al. (2007), APS is more robust than MCMC against multi-modal likelihood functions (at least on the idealized cartoon functions considered). In Section 5, we showed that the presence of realistic noise does not interfere with APS's ability to derive Frequentist confidence limits. However, the Bayesian credible limits yielded by APS remain noisy and imprecise. We believe that this performance can still be of use to the community, especially when trying to evaluate proof-of-concept parameter constraints (perhaps for forecasting parameter constraints from future experiments) with very expensive functions on poorly understood parameter spaces. Certainly, for users interested in plotting Frequentist confidence limits, APS is a viable option. There remain, however, several concerns to be addressed in the use of APS. Unlike MCMC, there is no obvious criterion for convergence of the APS algorithm. One possibility is to interpret the Frequentist confidence limit as a hyperbox (i.e., keep track of the allowable maximum and minimum values of each parameter as is explored) and watch the growth of the volume of that hyperbox as a function of the number of evaluations. Figure 15 plots this convergence metric for the APS run in Figure 10. While it appears that APS may have converged after ~50,000 evaluations, the volume of the Frequentist confidence limit continues to grow, though Figure 10 already shows good coverage of the control Bayesian credible limit. It is, of course, possible that we are seeing the effect of the slight difference between the Bayesian credible limit and the Frequentist confidence limit, even in the large-data case. Figure 15. Parameter space volume of the Frequentist confidence limit in Figure 10 as a function of the number of evaluations. Download figure: Standard image High-resolution image Export PowerPoint slide Another concern is that, while APS was explicitly designed to rapidly converge to the Frequentist confidence limit (where rapidity is measured in the number of evaluations made), the searches in Section 2 are complicated enough that they will add extra clock time to each evaluation. We have attempted to balance APS so that it does not rely too heavily on expensive calculations. However, some overhead is inevitably required. Figure 16 shows the average overhead time in seconds added by APS to each evaluation as a function of the number of evaluations. As the number of evaluations grow and the number of points APS must search when constructing its Gaussian processes also grows, so does this extra time (the initial spike is due to the first expensive call to the hyper-parameter optimization in Section 3.2). Though the extra time stays within the bounds of a few 0.1 s per evaluation, this may still be too expensive in the case of functions for whom a direct takes a vanishingly small amount of time. Figure 16. Extra time added by APS to a as a function of the number of evaluations. Download figure: Standard image High-resolution image Export PowerPoint slide Finally, one advantage of MCMC that APS cannot overcome is its comparative ease to implement. To this end, we make our code available at https://github.com/uwssg/APS/. The code is presented as a series of C++ classes with directions indicating where the user can interface with the desired function. Those with questions about the code or the algorithm should not hesitate to contact the authors.

The authors thank the referee for comments and suggestions that we hope have made this paper more useful to astrophysical audiences. They also thank Christopher Genovese for useful discussions regarding statistical theory and interpretation. S.F.D. thanks Eric Linder, Arman Shafieloo, and Jacob Vanderplas for useful conversations about Gaussian processes. S.F.D. also acknowledges the hospitality of the Institute for the Early Universe at Ewha Womans University in Seoul, Korea, who hosted him while some of this work was done. We acknowledge support from DOE award grant number DESC0002607 and NSF grant IIS-0844580.

APPENDIX: FREQUENTIST CONFIDENCE INTERVALS