So you’ve dabbled in data science and have heard the term “XGBoost” thrown around a bit, but don’t know what it is. I’m a big fan of learning by doing, so let’s try using XGBoost on a real-life problem: diagnosing Parkinson’s.

XGBoost is a popular technique and a neat alternative to traditional regression/neural nets. It stands for EXtreme Gradient Boosting, and basically builds something of a decision tree to compute gradients. Here’s a popular graphic from the XGBoost website as an example:

Not so menacing now, huh?

This sounds simple in practice but can be extremely powerful. Take, for example, Parkinson’s detection: we have several metrics that we can analyze and ultimately we need to diagnose Parkinson’s (classification!). This is a perfect problem for XGBoost (especially since there is a single output so we don’t need to use the MultiOutput wrapper — more on that later).

Let’s Write 10 Lines of Code

Let’s get started by gathering some data. By direction of my good friend Shlok I found an excellently formatted dataset: hop on over to UCI’s ML database and download the dataset for Parkinson’s called parkinsons.data (here’s a link) (and in case that disappears it’s in this repo). They’re CSVs so we can parse them quick with Pandas:

df = pd.read_csv('parkinsons.data')

Next, we need to get features and labels. The columns are conveniently all numeric except for the first column (name), and the column with the labels is ‘status’ (already 0 or 1). Let’s ignore what they mean for now and blindly analyze them (don’t do this in practice). This makes it rather convenient for us to quickly get the training data:

features = df.loc[:, df.columns != 'status'].values[:, 1:]

labels = df.loc[:, 'status'].values

Next, we’ll need to scale our features so they’re between -1 and 1 so they’re normalized. We can do this using sklearn ’s brilliant MinMaxScaler :

scaler = MinMaxScaler((-1, 1))

X = scaler.fit_transform(features)

We’re at 5 lines so far. Next, let’s split it into training and testing data so we can protect against overfitting. There aren’t many datapoints so let’s split 14% off into test data, this time using sklearn ’s train_test_split convenience function:

X_r, X_s, Y_r, Y_s = train_test_split(X, labels, test_size=0.14)

Then we use xgboost ’s XGBClassifier which is already built for classification and provided through the xgboost module ( pip install xgboost ):

model = XGBClassifier()

model.fit(X_r, Y_r)

That should take a split second and then finish building the tree. Isn’t it great that we can converge without spending hours training?

We’re at 8 lines now — cutting it close! Let’s wrap this deal up by evaluating our model against the test set from earlier, with the accuracy_score function from sklearn :

Y_hat = [round(yhat) for yhat in model.predict(X_test)]

print(accuracy_score(Y_test, Y_hat))

You should see an accuracy in the high 90’s (~96.42% on the test set!). This is already amazing since the original paper from 2007 cites classification accuracy of 91.8 ± 2.0%, and other papers from 2016 cite accuracies of 96.4% (SVM) and 97% with tuning (Boosted LogReg); with some tuning our model could far exceed state-of-the-art methods!

And that’s it! 10 lines of code and you’ve trained a full XGBoosting classifier for Parkinson’s. You can find the full source code for this alongside another model for the UDPRS data in the Train.ipynb Jupyter notebook here.

Afterthoughts