Why Bayesian?

Bayesian Learning uses Bayes theorem to statistically update the probability of a hypothesis as more evidence is available. This article explains how Bayesian learning can be used in machine learning. Bayesian-based approaches are believed to play a significant role in data science due to the following unique capabilities:

New observations or evidence can incrementally improve the estimated posterior probability

Incorporating prior knowledge/belief with the observed data to determine the final posterior probability

Ability to express uncertainty in predictions

Moreover, the Bayesian methods can be used to produce probabilistic predictions over several hypotheses rather than assigning each instance to a single hypothesis. If we consider a simple scenario where we have to predict which team is going to win the cricket world cup this year using the past performance of each team, then classical machine learning techniques can be used to predict only the winning team, whereas Bayesian learning can be used to determine the probability of each team winning the world cup. Yet Bayesian learning has been neglected in the area of machine learning due to the following reasons:

Difficulty in acquiring the initial knowledge (belief)

Significant computation cost that is required to perform the Bayesian inferences

However, in recent years Bayesian learning has been widely adopted and even proven to be more powerful than other machine learning techniques. For example, we have seen that recent competition winners are using Bayesian learning to come up with state-of-the-art solutions to win certain machine learning challenges:

March Machine Learning Mania (2017) - 1st place (Used Bayesian logistic regression model)

Observing Dark Worlds (2012) - 1st place and 2nd place

Since Bayesian learning has shown its potential in predicting complex tasks with a high accuracy, we wanted to explore the capabilities of Bayesian learning for general regression and classification tasks. Therefore, we evaluated classification and regression models that use Bayesian inference with several publicly available classification datasets. Furthermore, we provide a comparative analysis of the performance of Bayesian models compared to other baseline machine learning techniques with respect to the accuracy and the time taken for training the models.

Experiment Setup

Bayesian Models

We used PyMC3 probabilistic programming framework for our implementations. PyMC3 is a probabilistic programming framework that is written in Python, which allows specification of various Bayesian statistical models in code. The classification model was implemented as a Multinomial Logistic Regression model, whereas the regression was carried out using a linear regression model that was implemented using the Generalized Linear Model (GLM) module of PyMC3.

Let’s start with the linear regression model since that is somewhat simpler than the multinomial logistic regression model that was used for classification.

Regression Model - Linear Regression

In linear regression we try to fit the relationship between the dependent variable $Y$ and the predictor variable $X$ into a straight line, usually by minimizing the least squared error (LSE). Consider a dataset with $N$ number of samples and $D$ number of features. We can represent the linear relationship between $i^{th}$ predictor variables and dependent variable using the following equation:

\begin{equation} y_i = \tau + \sum\limits^{D}_{j=1}w_{j}x_{ij} + \epsilon \end{equation}

where $w_j$ and $\tau$ are the coefficients to be determined. The error term is denoted by $\epsilon$.

In the Bayesian model we consider $\tau$ and $w_j$ as random variables (stochastic variables) that are to be determined. Let’s consider that the probability distribution of those random variables are normally distributed.

\begin{equation} w_{j} \sim Normal(\mu_{j}, \sigma_{j}) \text{, where } j \in\Bbb Z\cap[1,D] \end{equation} \begin{equation} \tau \sim Normal(\mu_{0} , \sigma_{0}) \end{equation}

Let’s define the deterministic variable $\mu_i$ as follows:

\begin{equation} \mu_i = \tau +\sum\limits^{D}_{j=1}w_{j}x_{ij} \end{equation}

Then we defined $y_i$ as a normally distributed random variable with mean $\mu_i$ and standard deviation $\sigma$. $\sigma$ is another random variable that should be determined using the Bayesian inference.

\begin{equation} y_i \sim Normal(\mu_i, \sigma) \end{equation} \begin{equation} \sigma = HalfCauchy(\beta) \end{equation}

Since we have finished defining Bayesian linear regression model, now we can use observations $Y$ to inference the posterior distribution of model parameters $w_j$, $\tau$ and $\sigma$. Here $\mu_j$, $\sigma_j$ and $\beta$ are the hyper-parameters to the model.

Multiclass Classification Model - Multinomial Logistic Regression (AKA Softmax Regression)

For the multiclass classification tasks we implemented a Multinomial Logistic Regression model. The multinomial regression model can be denoted as follows:

\begin{equation} \theta = \alpha + X.\beta \end{equation} \begin{equation} P(Y=y_i|X) = \frac{exp(\theta_{i})}{\sum\limits^{K}_{j=1}exp(\theta_{j})} \end{equation}

Similar to the linear regression model, $X$ and $Y$ denote the predictor variable and the dependent variable, respectively. However, we have $K$ number of dependent variables where $K$ is the number of classes. We could have also define $K-1$ dependent variables and find the probability of class $K$ by subtracting the sum of probabilities of $K-1$ classes from the total probability, which is equal to one. Each dependent variable $Y=y_i$ represents the hypothesis that an instance belongs to the $i^{th}$ class. Therefore, $P(Y=y_i|X)$ is the probability that an instance belongs to the $i^{th}$ class given the data X. $\beta$ is the coefficient matrix with dimensions $D\times K$, whereas $\alpha$ is a vector with size $K$. Therefore, here $X.\beta$ is the dot product between the $X$ and the coefficient matrix $\beta$.

Even for Softmax regression, we can consider parameters $\alpha$ and $\beta$ are the random variables for the Bayesian model. We consider the probability distribution of each parameter is normally distributed.

\begin{equation} \alpha \sim Normal(\mu_{0}, \sigma_{0}) \end{equation} \begin{equation} \beta \sim Normal(\mu_{i}, \sigma_{i}) \end{equation}

However, here $\alpha$ and $\beta$ are not just random variables, they are a random matrix and a random vector, respectively.

Let’s define the deterministic variable $p$ is defined as follows:

\begin{equation} p = Softmax(\alpha + X.\beta) \end{equation}

Now we can represent the observations $Y$ using a categorical distribution, which takes deterministic variable $p$ as its parameter.

\begin{equation} Y \sim Categorical(p) \end{equation}

Here $\mu$, $\sigma$ and $\beta$ are the hyper-parameters to the model.

Training the Models

For both regression and classification models, we used three techniques to perform the Bayesian inference.

Bayesian with MAP — Maximum a posteriori (MAP) estimation is used for Bayesian inference

Bayesian with MCMC — Markov chain Monte Carlo (MCMC) sampling methods are used to learn the model parameters

Bayesian with MCMC (MAP start) — Same as the Bayesian with MCMC. However, instead of first random step, we set the estimated MAP as the first step

For the logistic regression we used Metropolis sampler. However, the linear regression models are trained using No-U-Turn sampler (NUTS). If NUTS failed to converge due to some error, then the Metropolis sampler is used. We have extracted 5000 samples using the samplers for each dataset.

Baseline Models

Moreover, we trained several baseline models to compare the performance of the Bayesian regression models.

Classification XGBoost Random Forest Classification Logistic Regression (one-vs-rest)

Regression Lasso Regression Random Forest Regression



Datasets

We used several classification and regression datasets for our experiment. A brief summary of the datasets is shown in the following tables.

Dataset # of samples # of test # of features # of classes Adult dataset 40016 7455 14 2 Glass dataset 215 20% 9 6 Iris 151 20% 4 3 Optdigits 5620 1797 64 10 Titanic 1309 20% 4 2 Bank 41188 20% 20 2 IDA2016Challenge 76000 16000 171 2

Table 1: Summary of classification dataset

Table 2: Summary of regression dataset

We have used various datasets with different complexities (in terms of the number of samples, features, and classes) for our experiment. Therefore, by using these evaluation results we can come up with a general conclusion about the performance of Bayesian machine learning irrespective of the specific characteristics of the datasets.

We used the standard test set for those datasets which have a standard test split. For other datasets, we generated the test set by splitting the dataset randomly. We have shown the size of such random test splits as a percentage of the dataset in the tables 1 and 2.

Experiment Results

We collected various statistics such as the accuracy (mean squared error for regression models) of the models and the time required for learning each model. For simplicity let’s discuss each statistics for classification and regression separately.

Dataset XGBoost Random Forest Classifier Logistic Regression Bayesian (Random Start) Bayesian (MAP Start) Bayesian (MAP) Optdigits 95.77% 93.93% 94.71% 90.15% 94.10% 94.16% Adult 85.47% 83.42% 82.29% 76.53% 76.02% 76.02% Glass 85.19% 77.78% 57.41% 0.00% 38.89% 38.89% Iris 92.11% 92.11% 92.11% 97.37% 92.11% 92.11% Titanic 78.96% 78.35% 79.27% 80.49% 78.66% 80.49% Bank 91.87% 91.29% 91.28% 88.95% 88.95% 88.95% IDA2016Challenge 98.90% 98.58% 98.41% 97.66% 97.66% 97.66%

Table 3: Classification accuracy of models for each dataset

Classification Results

Chart 1: Classification accuracy comparison

If we look at chart 1, we can observe that except for very few scenarios Bayesian classifiers perform (in terms of accuracy) as well as the benchmark classification models. Only for the Glass dataset we can see that the Metropolis sampler and MAP have failed to converge to a more optimal solution from where it started. Moreover, for the simple classification datasets such as Iris and Titanic, Bayesian classifiers show significantly higher accuracies.

Chart 2: Classification - Training time

Another important observation is that the MAP estimation shows similar results to Bayesian models that use MCMC sampling algorithms. This raises a question of whether the only advantage of using Bayesian samplers instead of single point estimation techniques such as MAP is to determine the confidence interval of the posterior predictions. However, in chart 2 we can observe that sampling techniques take longer time than all other techniques used. For some cases, it is more than 1000 times larger. I had to remove the time taken for training the IDA2016Challenge intentionally to ensure the visibility of lower values of the graphs.

Regression Results

Dataset Lasso Regression Random Forest Classifier Bayesian (Random Start) Bayesian (MAP Start) Bayesian (MAP) KC House Data 4.28E+10 1.75E+10 6.09E+10 5.72E+10 1.84E+11 Finance Distress 2.82E+00 6.21E+00 2.34E+00 2.68E+00 1.61E+00 Boston 9.93E+01 9.92E+00 2.08E+01 2.08E+01 2.13E+01 Winequality Red 6.55E-01 3.36E-01 4.33E-01 4.37E-01 4.35E-01 Diabetes 3.09E+03 2.87E+03 2.87E+03 2.87E+03 2.83E+03 CCPP 3.02E+02 1.19E+01 3.34E+01 2.07E+01 2.55E+01

Table 4: Mean squared error of models for each dataset

Table 4 shows the Mean Squared Error (MSE) of the models that we evaluated for each dataset. However, for KC House, Financial Distress and CCPP datasets the NUTS of Bayesian (Random Start) failed during the sampling. We believed that this failure is due to some undefined calculation (such as division by zero) during the gradient computation of NUTS sampler. For such cases, Metropolis sampler is used as the Bayesian inference technique.

For regression models, there are few interesting observations. Notice that Bayesian regression models (all three models or few of the three) have outperformed the Lasso Regression except for KC House dataset. This is surprising because both the Random Forest Regression and Lasso regression are improved versions of the linear regression. Lasso regression performs feature selection regularization by penalizing the absolute values of the regression coefficients, whereas the Random Forest Regression uses an ensemble of models which is organized as a random decision tree. Therefore, both are expected to have a better accuracy than the simple linear regression models. Yet the linear regression model that uses Bayesian inference outperforms the Lasso regression model and even the Random Forest Regression is outperformed for several datasets.

Chart 3: Regression - Training time

Another interesting observation is that Bayesian with MCMC (Random Start) shows similar performance in terms of training time compared to the other Lasso and Random Forest Regression models. We expect Bayesian with MCMC (MAP Start) to take less time than Bayesian with random start because we direct Bayesian with MAP start to a more optimal starting point when sampling rather than wandering around the solution space with a random starting steps. Yet it takes much longer to draw the same number of samples from the solution space compared to Bayesian random start sampling. This is completely contrary to what we observed for classification performance, where Bayesian with random start takes much more time than the Bayesian with MAP start. This difference is due to the use of the Metropolis sampling technique for Bayesian with random start for the cases that NUTS failed. And for those where NUTS failed when used with Bayesian with random start, even though the Bayesian with MAP start does not fail to converge, it seems to take much more time to search for optimal samples with the NUTS sampler.

Discussion and Conclusion

Here I have listed down other observations including those discussed above:

Bayesian classifiers outperform the baseline classifiers for simple datasets (in term of the number of dimensions, classes, and the instances) such as Iris and Titanic.

For the Glass dataset, MCMC sampler fails to converge to a more optimal point. Therefore, the accuracy is zero for Bayesian (Random Start) model.

The Bayesian MAP estimation shows similar or better performance (in terms of accuracy and speed) compared to the MCMC sampling optimization.

Bayesian regression has a competitive accuracy compared to the baseline classifiers for most of the datasets.

For KC House, Finance Distress and CCPP datasets, the NUTS failed with an error when using a random starting step (when NUTS fails, Metropolis sampler is used).

The NUTS sampler does not fail for any regression datasets when used with the MAP start.

Yet for those regression datasets where Bayesian with Random start failed when NUTS sampler is used, even Bayesian with MAP start takes a longer time to converge.

MAP estimation performs similar to or better than the Bayesian sampling methods with the larger training datasets. This observation is expected since MAP estimation should converge to the true posterior distribution when more and more data is available.

However, Bayesian sampling methods takes longer (even 1000 times longer for some datasets) for training than the other benchmark models.

Yet, the MAP estimation can be performed in less time with similar accuracy compared to the Bayesian sampling methods.

We can derive the following conclusions from the above observations.

We can successfully use Bayesian learning for machine learning with competitive accuracies compared to baseline models for both classification and regression tasks.

Moreover, Bayesian learning can add more flexibility and customizability to models.

Even Bayesian inference with MAP estimation is sufficient for machine learning when only a prediction is required. However, we may require Bayesian with MCMC sampling or other approximation techniques, if the confidence interval of the predictions are to be determined quantitatively.

However, Bayesian with MCMC sampling can be inefficient compared to MAP estimation or other machine learning in terms of the time take to learn the parameters (train the model).

The Bayesian classifiers that use MCMC sampling failed to converge to a more optimal solution from the given starting step for Glass dataset, However, we were unable to conclude the exact reasons for that observation.

We observed that MAP estimation is capable of providing competitive performance compared to the MCMC sampling. This observation raises the question of whether the only advantage of using posterior probability distributions instead of MAP is the ability to determine the confidence interval of the posterior prediction?

The NUTS failed with an error for three datasets. Moreover, it has poor performance for the datasets that it works on because even though NUTS was expected to be faster than the Metropolis sampler, NUTS spent longer to extract the same amount of samples when compared to the Metropolis sampling. Yet it is unknown if the reasons for such observations are due to the issues with the implementation of NUTS in PyMC3 or due to the complications when using NUTS with the specific datasets.

We need to explore more to understand the effect of the factors (such as the number of features, number of samples and the complexity of the task) on the execution time of MCMC sampling. Even the specific Bayesian model structures that are used for the experiment can be the reason for the certain observations such as the long time taken to converge when sampling is used. Therefore, understanding the nature of Bayesian Learning require extensive analysis, considering a wide-range of such aspects.