Summary:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “target” field refers to the presence of heart disease in the patient. The target could be either 0 (no presence) or 1. ( this data is from a Kaggle dataset) my full code can be found on GitHub or Kaggle

About the Data

Attribute Information:

age = age in years

sex = (1 = male; 0 = female)

cp = chest pain type

trestbps = resting blood pressure (in mm Hg on admission to the hospital)

chol = serum cholestoral in mg/dl

fbs = (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

restecg = resting electrocardiographic results

thalach = maximum heart rate achieved

exang = exercise induced angina (1 = yes; 0 = no)

oldpeak = ST depression induced by exercise relative to rest

slope = the slope of the peak exercise ST segment

ca = number of major vessels (0-3) colored by flourosopy

thal = 3 = normal; 6 = fixed defect; 7 = reversable defect

target = 1 or 0 (1 being heart disease, 0 being no heart disease

Here is just a snippet of the data



Build Model

The boosted tree model is an ensemble method that uses multiple decision trees. For this problem, I made a Boosted tree model with a max depth of 1 and used 500 trees. The reason I used a boosted tree model is that it allows us to easily implement many feature importance algorithms.(bellow is the results of the model)

Results of Boosted Trees Model

Feature Importance

Directional Feature Contributions (DFCs)- for individual cases Average DFC Overall Data- for all data Gain-based feature importance- for all data Permutation feature importance- for all data

1. Directional Feature Contributions (DFCs)

DFCs detect what features were used most to make a prediction by the algorithm. It also shows if an increase in the feature would make the probability go up or down.

As you can see from the picture above the most important feature used was ca followed by thalach . Something to keep in mind though is that this is only for a single person, for someone else the algorithm may way the features differently. The nice thing about DFCs is that if you sum up all the Contribution values it would equal the probability that someone would have a heart disease



2. Average DFC Overall Data- for all data

Average DFC is a pretty simple concept, it basically takes all the DFCs for each person and averages the feature importance together.

So according to this feature importance algorithm the most important features that determine if someone has a heart decease is the number of major vessels (0-3) colored by fluoroscopy, followed by thal.

3.Gain-based feature importance- for all data

Gain-based feature importances measure the loss change when splitting on a particular feature.

Just like the mean DFCs model the Gain feature importance model also says the top two features to predict heart disease are the number of major vessels (0-3) colored by fluoroscopy, followed by thal .

4.Permutation feature importance for all Data

permutation feature importances are computed by evaluating model performance on the evaluation set by taking out one feature at a time and seeing how the performance changes. One difference between this method and the other methods is that with this method you don’t have to use a boosted tree model, so for this model I used a logistic model that got an accuracy of 0.86 and a recall of 0.93.

Based on Logistic Model



The main difference in the results is that using this model it has determined that maximum heart rate achieved is the most important feature followed by thal and ca.

Conclution

Based of the models above i would say the most important features to determine hear desies are