You can find the code on Github at:

https://github.com/crhuffer/Insurance

Goal

Today I wanted to build initial model and code base for a predictive model on a somewhat random data set. I intend for this to be the starting point for this blog series, which will be a set of experiments with different Data Science methods and techniques on a variety of data sets as a playground to satisfy some of my curiosity about what works, how well, and when does it work.

Data

The data set that I chose as a starting point is a small insurance data set on Kaggle that I know very little about. I didn’t read the description of the data, so I am starting with a relatively blank slate. I did look through the column names and the number of rows, which made it clear that I could try to predict the charge amount based on the information about the person, which makes it a reasonable fit for this post.

Here is the link to the data set:

https://www.kaggle.com/mirichoi0218/insurance

File structure

I use Python 3 in spyder for my code development and I like to modularize me code by task, so I created three files.

1Preprocessing.py

2DescriptiveStats.py

3Modeling.py

For each file I open up a new console in spyder, which allows me to be working on a couple of different things at the same time while the other kernals are running in the background.

Preprocessing

This is what the raw data looks like in spyder’s variable explorer.

We have a couple of things that need to be dummified. Let’s do that first. I prefer to leave the raw data along, so I make a new dataframe df_InsuranceProcessed to store the cleaned or modified data.

I also wanted to do a 70/20/10 training/ validation/test split, which I did with sklearn.model_selection’s train_test_split function.

Moving to the modeling file, our train data looks like this:

We want to predict ‘charges’, which is numeric so it is reasonable to start as a regression problem. I chose XGBRegressor as my initial model. It is my favorite model at the moment for regression problems because of its speed and good performance.

To prepare the dataframe for modeling, I want to break it up into the target (y) and the input features (X’s), which I do in the following code:

Modeling

Then we can plug it into the model, calculate our loss function (RMSE) on the validation data, and setup new columns for storing the prediction, residual, and fractional error, which are things I like to look at during error analysis.

The RMSE of the model is: Val_RMSE = 16517561.866376236, which sounds pretty bad, but descriptive stats on the target variable will give a better feeling of what the scale of the data is.

Error analysis

Lets also do some error analysis, first let’s look at the actual vs predicted values.

My main takeaway from this plot is that there are outliers and more typical values and the model is doing a pretty decent job of predicting the outliers even with default parameters and very little pre-processing. This is a really good sign!

Let’s also look at the residual:

We are usually over-predicting, but there are cases where we drastically under-predict. We can look at the same thing with the fractional error to help understand the extent to which the error is proportional to the size of the target.

This seems better behaved, more symmetric about zero and it has less obvious outliers. This may suggest that our model is frequently failing on the outliers themselves (I have already done a little bit of descriptive stats, so I have other reasons to suspect this.) Over all, the goal for the residuals and fractional error plots is to give me something that I can compare future models to, so we will leave this here for the moment.

Descriptive stats

Finally, I want to start descriptive stats on this data set. As a starting point, I wanted to look at how age and gender relate in the data set.

The distributions look surprisingly similar. Lets explore this deeper.

There seem to be the same fluctuations for male (blue, sorry for the missing legend) and female. That is really strange. It is hard to tell how similar they are, lets put them on the same plot.

At least they are not identical. This still looks unreasonably similar to me. At this point, I strongly suspect that the data is synthetic. How frequently can you sample a population of humans and get such similar distributions of male and female participants and have age distributions that are so similar. Luckily this doesn’t impact my goal of having a data set that I can run some models on, but it is worth keeping in mind. If this is true, it might be missing some of the variation and error that exists in real life data sets.

Now, I want to explore how the target variable relates to these parameters. As a starting point, we can run a describe command on all of the charges, on the charges for males, and on the charges for females.

All rows:

count 936.000000

mean 39.284188

std 14.076126

min 18.000000

25% 27.000000

50% 39.000000

75% 51.000000

max 64.000000

Name: age, dtype: float64

Males:

count 483.000000

mean 38.679089

std 13.964482

min 18.000000

25% 26.000000

50% 38.000000

75% 51.000000

max 64.000000

Name: age, dtype: float64

Females:

count 453.000000

mean 39.929360

std 14.181171

min 18.000000

25% 27.000000

50% 40.000000

75% 52.000000

max 64.000000

Name: age, dtype: float64

Again, these seem to be very similar distributions, my hunch is that they are too similar to be believable.

There are too many ages to do a set of boxplots with charge as the y, and age and sex as either the x and hue parameters. There will either be too many hues or too many x’s to look good and be easily readable, but it will still help me understand the data, so let’s do it anyway.

(Note: I have cropped the lower image because the legend went well off the page) These are really interesting plots. There is an almost parabolic trend in the majority of the data showing that none of the rows have a cost of less than a given amount that is set by the age and increase as age increases. Then there are some outliers with much larger than typical charges. This is definitely a case where age will be a valuable feature. (I should have looked at the feature importances from the XGB model)

The other thing that I notice is that there seems to be much more frequent variations above the lowest cost for that age in the male rows. This suggests that sex will also be an important variable.

Clearly there is a lot more data exploration and modeling work that can be done on this data set, but this feels like an okay start.

Next steps:

Look at the variable importances Continue data exploration on some of the other parameters Implement cross fold validation Start adjusting hyperparameters of the XGB, see if we can reduce overfitting Add to the error analysis and put it into a function to make comparisons with later models more automated. Record the amount of time spend on data science and the amount of time spend on writing the blog post…

I doubt I will do all of that in the next post, but I will see what I can do.

Are there other things you think I should be trying? Do you disagree with what I said? Let me know what you think!