Abstract We give an overview of the basic principles of approximate Bayesian computation (ABC), a class of stochastic methods that enable flexible and likelihood-free model comparison and parameter estimation. Our new open-source software called ABrox is used to illustrate ABC for model comparison on two prominent statistical tests, the two-sample t-test and the Levene-Test. We further highlight the flexibility of ABC compared to classical Bayesian hypothesis testing by computing an approximate Bayes factor for two multinomial processing tree models. Last but not least, throughout the paper, we introduce ABrox using the accompanied graphical user interface.

Citation: Mertens UK, Voss A, Radev S (2018) ABrox—A user-friendly Python module for approximate Bayesian computation with a focus on model comparison. PLoS ONE 13(3): e0193981. https://doi.org/10.1371/journal.pone.0193981 Editor: Yong Deng, Southwest University, CHINA Received: May 3, 2017; Accepted: February 22, 2018; Published: March 8, 2018 Copyright: © 2018 Mertens et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All datasets containing the simulation results are available from the Zenodo data repository (https://zenodo.org/record/1010125#.WeB0VYb-ui4). Funding: We acknowledge financial support by Deutsche Forschungsgemeinschaft and Ruprecht-Karls-Universität Heidelberg within the funding programme Open Access Publishing (UKM). Competing interests: The authors have declared that no competing interests exist.

Introduction Approximate Bayesian computation (ABC) is a computational method founded in Bayesian statistics. ABC enables parameter estimation as well as model comparison when the probability of observed data under the model of consideration is unknown, or put differently, when the likelihood function is not available or computationally intractable. The idea of ABC is to bypass the computation of the likelihood by comparing model predictions with observed data, thus taking a simulation-based approach. In psychological research, the classical null hypothesis significance testing (NHST) based on the p-value falls more and more under disrepute due to its many weaknesses [1–6]. One main criticism among others, especially stressed by Wagenmakers and colleagues [7], is the fact that p-values only depend on what is expected under the null hypothesis ( ) but not what is expected under the alternative hypothesis ( ). Hence, a small p-value only states that the observed data is unlikely under but it might be even more unlikely under . Furthermore, the p-value is often misinterpreted (e.g. as the probability of the null hypthesis being true) [8]. As a solution, many authors suggest the use of Bayes factors as they allow to quantify evidence for both and and can be easily interpreted as the relative evidence of one hypothesis with respect to another hypothesis [1, 2, 4, 7, 8]. Despite the advantages of Bayes factors, a shortcoming is that their computation is often hard or even impossible which in turn narrows their applicability [9]. Wagenmakers and colleagues [5] address this aspect by stating: “In order to profit from the practical advantages that Bayesian parameter estimation and Bayes factor hypothesis tests have to offer it is vital that the procedures of interest can be executed in accessible, user-friendly software package” (p. 4). They provide such a software package, called JASP [10], in order to ease the process of Bayesian hypothesis-testing. With JASP, they continue, “[…] users are able to conduct classical analyses as well as Bayesian analyses, without having to engage in computer programming or mathematical derivation” (p. 4). Yet, users of JASP are restricted when it comes to the specification of user-defined prior distributions or the application of non-standard tests. In addition, users are limited in the application of Bayesian hypothesis-testing in the sense that JASP always assumes a point null hypothesis and thus the models being compared are always nested. Although such a scenario is undoubtedly the most common one in psychological research, it is desirable to extend the methods for Bayesian model comparison to allow for non-nested model comparison. ABrox exactly addresses these issues by providing a flexible toolbox to estimate Bayes factors in areas where an exact mathematical solution is not available. The outline of the paper is as follows. First, we will give an introduction to ABC with a focus on model comparison (technical details on ABC can be found elsewhere, [11, 12] or [13]). We will then introduce our software called ABrox and demonstrate its validity by analysing two models. For this purpose, we choose the Bayesian two-sample t-test and the Levene test. Although there exists a bayesian version of the Levene test [14], a default Bayes factor is only available for the two-sample t-test [15]. Lastly, we show the flexibility of ABrox by comparing two multinomial processing tree models applied to the weapon-misidentification task [16, 17].

Approximate Bayesian computation for model selection As previously mentioned, ABC can be used for both parameter estimation and model comparison. If one is interested in the latter, the general idea of ABC methods is to test how well simulated data from different models can mimic the observed data. If a model does not represent the true data-generating process of the observed data reasonably well, then simulated data from this model will not resemble the observed data adequately. Imagine for instance that some data follow a quadratic trend. Trying to simulate such data with a model assuming a linear trend will fail most of the time. After multiple runs of the algorithm, the models are compared by counting the number of times each one was able generate data that resemble the observed data sufficiently well. One common algorithm is the ABC rejection algorithm for model selection, in its basic form introduced by Rubin (1984) [18] (see explanation below): Draw m* from the prior P(m) Sample θ* from the prior P(θ | m*) Simulate a candidate dataset D* ∼ f(D | θ*, m*) Compute the distance. If d(D 0 , D*) ≤ ϵ, accept (m*, θ*), otherwise reject it. Return to 1 until k particles are accepted. First (step 1), one of the models under consideration (m*) is picked based on the (prior-)probabilities of the models. In step 2, given the model, parameters (θ*) from the prior distributions of the parameters from the chosen model m (P(θ | m*)) are drawn. Using model and parameters, data (D*) is simulated in step 3 and is then compared with the observed data in step 4. For this purpose, a distance measure d has to be defined that quantifies the dissimilarity of simulated and observed data. If the distance d exceeds a predefined threshold (ϵ), the so-called particle containing both model indicator and the sampled parameters (m*, θ*) is accepted and stored. The process is repeated N times until k particles have been accepted. The number of accepted particles is then divided by N to get a marginal posterior distribution of a model (m’) stating how likely the model is given the data at hand (D 0 ) [11]. (1) With this metric, we can calculate the Bayes factor (BF) as the ratio of marginal likelihoods (p(D 0 | m 1 )/p(D 0 | m 2 )). The Bayes factor (BF 12 ) describes the amount of change from prior odds to posterior odds after the data has been observed [5]. (2) In Eq 2, p(m 2 )/p(m 1 ) is referred to the model prior odds, indicating how likely one model (m 2 ) is relative to the other (m 1 ) before seeing the data. p(m 1 | D 0 )/p(m 2 | D 0 ) is the ratio of marginal posterior distributions as approximated via ABC. Summary statistics In step 4 of the aforementioned algorithm, the distance could be computed by taking every data point in the observed and simulated dataset into account. While computing the distance based on every data point is reasonable theoretically, it is infeasible in most settings. If the datasets are large, then there is always a deviation from simulated data to observed data (the distance would only be zero if every simulated data point was exactly equal to the corresponding observed one). Hence, the acceptance rates would be extremely low if the distance threshold (ϵ) was too strict. However, simply increasing the distance threshold to account for the low acceptance rates would not be a reasonable decision either because exceedingly large thresholds distort the approximation [19]. In order to account for these problems, the distance d between observed and simulated data is often based on a summary statistic s(D) (e.g., mean and (co)variance). While this procedure of computing summary statistics suffers less from the curse of dimensionality, it often comes with a loss of information. Didelot et al. [12] showed that insufficient summary statistics, not capturing the whole information of the raw data, can lead to inconsistencies. Hence, the ABC model selection may fail to recover the true model. The challenge is to balance out consistency and information loss [12]. Therefore, the approximation of the true Bayes factor depends on the summary statistic capturing the necessary information in the data. The threshold ϵ Besides picking an appropriate summary statistic, the notion of simulated data being “close enough” to observed data has to be defined. Users interested in ABC have to set a specific threshold value ϵ based on a distance metric in order to decide whether a particle (m*, θ*) should be accepted or discarded. The smaller the threshold ϵ, the better the approximation of the marginal posterior distribution to the true posterior distribution [11]. If the threshold is too liberal, too many particles fall below the threshold and the resulting posterior distribution is not a good approximation. If, on the other hand, the threshold is too strict, it might take very long to find the required number of particles that lead to distances smaller than ϵ and the procedure becomes computationally inefficient. The challenge is to find an ϵ that leads to a good approximation while retaining a tolerable running time of the algorithm. To tackle the problem of finding a good threshold, Beaumont, Cornuet, Marin, and Robert [20] proposed to select ϵ as the kth percentile of computed distances prior to the actual start of the algorithm (with k being a small value such as 5 or 1).

Introducing ABrox We developed ABrox as an open-source python module which enables approximate Bayesian model comparison and parameter estimation. With ABrox, we introduce a graphical user interface (GUI) which is designed to be used as a tool for all-purpose ABC, making the methods much more accessible to researchers interested in applying ABC. The GUI is especially useful for researchers without much prior knowledge of ABC. All necessary steps are separated into small chunks to keep the structure clear and simple. Users can also save and load their projects which allows for good maintenance and easy sharing with collaborators. Moreover, prior distributions are automatically visualized for a better user-experience. ABrox facilitates high flexibility regarding the specification of mathematical models and is especially useful if one is interested in Bayesian model comparison where none of the traditional methods are suitable, usual assumptions underlying those methods are not met, models of interest are not nested, or software does not permit to change prior settings. Detailed instructions on how to install ABrox can be found online (see https://github.com/mertensu/ABrox). Comparison to other software packages The main advantages of ABrox compared to other software packages such as the R-package abc [21] or the Python module ABC-SysBio [22] are the domain-independence, meaning that it is not designed for a specific field of research, and its ease of use due to the GUI. The software DIYABC [23] is another attempt at simplifying the use of ABC by providing a GUI for experimental biologists. DIYABC provides a convenient way to conduct ABC inferences about population history using SNPs (Single Nucleotide Polymorphism), DNA sequence and microsatellite data, the areas of research it was designed for. DIYABC does not, as ABrox, necessitate prior experience with a programming language. Due to its focus on specialized fields of research, the process of simulating data and computing summary statistics is handled by the program itself which is a helpful feature for users. However, it is limited to a few specific fields of research. The ABC reference table The reference table in ABC is the starting point for most algorithms. It is a large table containing in each row the the model index (only used for model comparison), summary statistics of a simulated dataset, the parameter(s) used for simulating the data and the distance with respect to the observed summary statistics. The reference table contains as many rows as there are simulated datasets. ABrox usually generates the reference table internally but it also possible to import an external reference table stored in a comma separated value (csv) file. The header of csv-file needs to satisfy the the following conditions. There has to be a column named idx containing the model index. Furthermore, if there are k summary statistics, each value has to be stored in separate columns named s_ followed by a number (usually 1 to k). The same logic applies to the parameters where each parameter column name should start with p_. In addition to the columns containing information about summary statistics and parameter prior distributions, there has to be a column named distance containing the distance to the observed summary statistics. Features of ABrox Implemented algorithms. For model comparison, either the rejection algorithm or a random forest approach can be chosen. Random forests [24] are a supervised learning algorithm used extensively in machine learning. Given a sample of input-output pairs, the goal of a random forest is to learn a mapping from the input to the output by training multiple decision trees (a standard non-parametric algorithm for classification and regression) and aggregating their decision function outputs. It incorporates the techniques bootstrap aggregation and random feature selection to reduce over-fitting and thus increases the predictive generalization of the model. In the context of ABC, random forests learns to map summary statistics to either model indices or parameters of a model. Readers interested in more details should consult the work by Pudlo et al. (2015) [25]. Besides model comparison, parameter estimation can be conducted either via the basic rejection algorithm or a Markov chain Monte Carlo (MCMC) based algorithm by Wegmann [26]. Cross-validation. ABrox allows for cross-validating the results from both model comparison and parameter estimation using leave-one-out cross-validation as used in [21]. In leave-one-out cross-validation, a randomly chosen simulated summary statistic is treated as pseudo-observed summary statistic and an ABC algorithm is run in order to estimate parameters or a Bayes factor (posterior model probability). This procedure is repeated N times. In the case of parameter estimation, ABrox then provides a prediction error (see Eq 3) stating how different the predicted parameters ( ) are from the true parameters (θ). (3) Because the information provided by Eq 3 is limited, an additional pdf-file is automatically saved. This file contains one plot for each parameter in which the estimated parameter is plotted against the true parameter. When performing model comparison, ABrox presents a confusion matrix stating how often the pseudo-observed summary statistics could be correctly classified to the model they were generated from. In this matrix, the diagonal elements should be large and the values in off-diagonal elements should be small. The confusion matrix is also stored in a heatmap-style as a pdf-file to ease the interpretation of the results. The building blocks of ABrox To calculate a Bayes factor, the user needs to provide several pieces of information. the data to be analyzed (optional) the prior distributions for all parameters of all models. functions to simulate data from the models. a function to compute summary statistics from data. a function to compute the distance between summary statistics of observed and simulated data (optional). The functions to compute the summary-statistic and the distance are the same for all compared models. Prior distributions are internally stored as python dictionaries whereas simulation-, summary-, and distance-function are user-defined functions. Data import in ABrox. The data tab in ABrox shown in S1 Fig is used to import an external data-file. The file should be stored as a comma-separated value file (.csv). The imported data can be inspected and modified in the data tab. Note that users are able to access the data in the embedded Python Console at the bottom by typing data. The prior distribution in ABrox. Prior distributions for the parameters of each model have to be specified. S2 Fig shows the prior settings tab of ABrox with a parameter called x and its corresponding prior distribution showing a standard normal distribution. In the current version of ABrox, a total of nine different prior distributions are available. The simulate-function in ABrox. In a next step, the user has to write a Python function for each model that simulates data. The simulated dataset has to be in the form of a NumPy array [27]. The simulate-function can be written in the simulate tab (S3 Fig). Note that the names of the functions should not be changed and default to simulate. The function has only one argument (params) which is a dictionary with keys corresponding to the names of the parameters specified and the values containing one sampled value from the respective prior distribution of the parameter. Thus, each parameter value can easily be accessed by the parameter’s name. The user has to take care that the format of the imported (observed) data is identical to the format of the simulated data. Put differently, the return type of the simulate-function has to be the same as the type of the imported data. If, for instance, the observed data is a multidimensional array with 100 rows and two columns, the simulate function has to return a multidimensional array of the same shape. If this is not the case, ABrox will throw an error message. The summary and distance-function in ABrox. After the specification of each model (priors and simulate-function), a new Python function to calculate the summary statistics from the data has to be written. The function takes as input the output of the simulate function specified in the previous step. The function should return the summary statistics. The summary statistics can be a scalar or a vector. In ABrox, the user can optionally write a user-defined distance function for computing the distance between observed summary statistics and simulated summary statistics. If the function is left empty, ABrox automatically uses the default distance metric which is the Euclidean distance scaled on the median absolute deviation (MAD). The settings tab in ABrox. In the settings tab (S4 Fig), the working directory has to be set. Next, the type of analysis, that is model comparison or parameter estimation, has to be selected. Note that ABrox allows to compare more than two models by simply adding more models to the Project Tree. Each model needs to be specified with a unique simulate-function and the corresponding prior distribution(s) for the parameter(s). If the rejection algorithm is selected, ABrox calculates a matrix containing approximate Bayes factors. The random forest approach, on the other hand, calculates posterior model probabilities. In the case of parameter estimation, a summary of the posterior distributions of the parameters is returned. In addition, a data frame with all samples from the posterior of each parameter is saved in the working directory. The Console Panel at the bottom of S4 Fig shows the progress of the computation and informs the user as soon as the computation is finished. The results are stored inside a python variable which can be accessed from the integrated Python Console. The user interface of ABrox generates a python script in the working directory with all the configurations specified in the GUI. This feature of saving a python-file with all the configurations is especially useful if users want to use ABrox within the Python framework. Besides the obligatory settings, we added the option to generate pseudo-observed data from a model instead of having to import data (Model Test Settings). By choosing this option, one can simply check if the approximate Bayes factor favors the model the data have been simulated from or if the parameters chosen for the pseudo-observed data can be estimated accurately.

Application example 2: A Bayesian Levene test In this section, we show the flexibility of ABC model comparison by implementing a Bayesian Levene test for two samples. The corresponding ABrox project file can be downloaded at https://github.com/mertensu/ABrox/tree/master/project_files. The Levene test checks whether variances among different samples are equal (homogeneous). Assuming two samples, the null hypothesis ( ) is specified as: (4) Using the ABrox framework, we construct two models. is restricted in the sense that both population variances are homogeneous. For the alternative hypothesis ( ) however, the variances between both groups are allowed to differ. Note that the has one parameter (the identical variance within groups) whereas has two parameters (one variance for each sample). Although this parameterization is valid, we reparameterized both models such that has no parameter (we fixed the variance in both groups to 1). Following the approach for the two-sample t-test, we chose for a prior specifying the ratio between the two variances. A prior for a ratio should have a mean of 1 and has to be strictly positive. In order to meet both conditions, we chose a gamma distribution with shape k = 8 and scale θ = 0.125. The distribution has the form shown in Fig 4. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Gamma distribution used as the prior for the ratio of standard deviations. Parameters are k = 8 and θ = 0.125. https://doi.org/10.1371/journal.pone.0193981.g004 Simulation results For the Levene test, we simulated datasets with a ratio of variances varying uniformly between 1 (homogeneity) and 2. For each of two sample sizes (N = 50 and N = 200), we simulated 1000 datasets. Fig 5 shows the relationship of p-values from the Levene test and approximate Bayes factors. The vertical lines in Fig 5 represent the critical value of α = 0.05. All p-values left to this line would lead to the decision of accepting (the variances are heterogeneous). The horizontal lines corresponds to a BF 10 of 1. Therefore, all values above this line would indicate support for , whereas all Bayes factors below this line would indicate support for the (homogeneous variances). Although both p-value and Bayes factor often lead to the same decision about the homogeneity of variances, according to the left plot of Fig 5, there are many datasets where decisions differ, if sample sizes are small. The area above the horizontal line and to the right of the vertical line shows datasets where there is support for heterogeneity of variances based on the Bayes factor. However, the result of the Levene test is not significant (p > .05). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 5. Relationship of p-values and approximated Bayes factors for a Levene test; left: N = 50, right: N = 200. https://doi.org/10.1371/journal.pone.0193981.g005

Concluding remarks This article introduces the reader to ABrox, a python package for approximate Bayesian computation with a user-friendly interface. We demonstrated the similarity of approximate and default Bayes Factor on two prominent statistical tests and demonstrated the flexibility of ABC on a comparison of multinomial processing tree models. ABrox defaults to the rejection algorithm for both parameter inference and model choice. However, more sophisticated algorithms are implemented. For parameter inference, a Markov chain Monte Carlo (MCMC) based algorithm by Wegmann [26] is available. Concerning model comparison, there is also the option to use random forests to calculate the posterior model probabilities [25]. Moreover, we plan to extend the set of available machine-learning algorithms for ABC by adding neural networks to the toolbox. With ABrox, we hope to reduce the burden for researchers to actually apply ABC methods in their labs.