I have been doing Kaggle’s Quora Question Pairs competition for about a month now, and by reading the discussions on the forums, I’ve noticed a recurring topic that I’d like to address. People seem to be struggling with getting the performance of their models past a certain point. The usual approach is to use XGBoost, ensembles and stacking. While those can generally give good results, I’d like to talk about why it is still important to do feature importance analysis.

Data exploration

As an example, I will be using the Quora Question Pairs dataset. The dataset has 404,290 pairs of questions, and 37% of them are semantically the same (“duplicates”). The goal is to find out which ones.

Initial steps; loading the dataset and data exploration:

# Load the dataset train = pd.read_csv('train.csv', dtype={'question1': str, 'question2': str}) print('Training dataset row number:', len(train)) # 404290 print('Duplicate question pairs ratio: %.2f' % train.is_duplicate.mean()) # 0.37

Examples of duplicate and non-duplicate question pairs are shown below.

question1

question2

is_duplicate What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0 How can I be a good geologist? What should I do to be a great geologist? 1 How can I increase the speed of my internet connection while using a VPN? How can Internet speed be increased by hacking through DNS? 0 How do I read and find my YouTube comments? How do I read and find my YouTube comments? 1

This is the word cloud inspired by a Kaggle kernel for data exploration. The cloud shows which words are popular (most frequent). The word cloud is created from words used in both questions. As you can see, the prevalent words are ones you would expect to find in a question (e.g. “best way”, “lose weight”, “difference”, “make money”, etc.)

We now have some idea about what our dataset looks like.

Feature engineering

I created 24 features, some of which are shown below. All code is written in python using the standard machine learning libraries (pandas, sklearn, numpy). You can get the full code from my github notebook. Examples of some features:

q1_word_num – number of words in question1

q2_length – number of characters in question2

word_share – ratio of shared words between the questions

same_first_word – 1 if both questions share the same first word, else 0

def word_share(row): q1_words = set(word_tokenize(row['question1'])) q2_words = set(word_tokenize(row['question2'])) return len(q1_words.intersection(q2_words)) / (len(q1_words.union(q2_words))) def same_first_word(row): q1_words = word_tokenize(row['question1']) q2_words = word_tokenize(row['question2']) return float(q1_words[0].lower() == q2_words[0].lower()) # A sample of the features train['word_share'] = train.apply(word_share, axis=1) train['q1_word_num'] = train.question1.apply(lambda x: len(word_tokenize(x))) train['q2_word_num'] = train.question2.apply(lambda x: len(word_tokenize(x))) train['word_num_difference'] = abs(train.q1_word_num - train.q2_word_num) train['q1_length'] = train.question1.apply(lambda x: len(x)) train['q2_length'] = train.question2.apply(lambda x: len(x)) train['length_difference'] = abs(train.q1_length - train.q2_length) train['q1_has_fullstop'] = train.question1.apply(lambda x: int('.' in x)) train['q2_has_fullstop'] = train.question2.apply(lambda x: int('.' in x)) train['q1_has_math_expression'] = train.question1.apply(lambda x: int('[math]' in x)) train['q2_has_math_expression'] = train.question2.apply(lambda x: int('[math]' in x)) train['same_first_word'] = train.apply(same_first_word, axis=1)

Baseline model performance

To get the model performance, we first split the dataset into the train and test set. The test set contains 20% of the total data. To evaluate the model’s performance, we use the created test set (X_test and y_test).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The model is evaluated with the logloss function. It is the same metric which is used in the competition.

$logloss = \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{j=1}^{M} y_{i,j} * log(p_{i,j})$

To test the model with all the features, we use the Random Forest classifier. It is a powerful “out of the box” ensemble classifier. No hyperparameter tuning was done – they can remain fixed because we are testing the model’s performance against different feature sets. A simple model gives a logloss score of 0.62923, which would put us at the 1371th place of a total of 1692 teams at the time of writing this post. Now let’s see if doing feature selection could help us lower the logloss.

model = RandomForestClassifier(50, n_jobs=8) model.fit(X_train, y_train) predictions_proba = model.predict_proba(X_test) predictions = model.predict(X_test) log_loss_score = log_loss(y_test, predictions_proba) acc = accuracy_score(y_test, predictions) f1 = f1_score(y_test, predictions) print('Log loss: %.5f' % log_loss_score) # 0.62923 print('Acc: %.5f' % acc) # 0.70952 print('F1: %.5f' % f1) # 0.59173

Feature importance

To get the feature importance scores, we will use an algorithm that does feature selection by default – XGBoost. It is the king of Kaggle competitions. If you are not using a neural net, you probably have one of these somewhere in your pipeline. XGBoost uses gradient boosting to optimize creation of decision trees in the ensemble. Each tree contains nodes, and each node is a single feature. The number of instances of a feature used in XGBoost decision tree’s nodes is proportional to its effect on the overall performance of the model.

model = XGBClassifier(n_estimators=500) model.fit(X, y) feature_importance = model.feature_importances_ plt.figure(figsize=(16, 6)) plt.yscale('log', nonposy='clip') plt.bar(range(len(feature_importance)), feature_importance, align='center') plt.xticks(range(len(feature_importance)), features, rotation='vertical') plt.title('Feature importance') plt.ylabel('Importance') plt.xlabel('Features') plt.show()

Looking at the graph below, we see that some features are not used at all, while some (word_share) impact the performance greatly. We can reduce the number of features by taking a subset of the most important features.

Using the feature importance scores, we reduce the feature set. The new pruned features contain all features that have an importance score greater than a certain number. In our case, the pruned features contain a minimum importance score of 0.05.

def extract_pruned_features(feature_importances, min_score=0.05): column_slice = feature_importances[feature_importances['weights'] > min_score] return column_slice.index.values pruned_featurse = extract_pruned_features(feature_importances, min_score=0.01) X_train_reduced = X_train[pruned_featurse] X_test_reduced = X_test[pruned_featurse] def fit_and_print_metrics(X_train, y_train, X_test, y_test, model): model.fit(X_train, y_train) predictions_proba = model.predict_proba(X_test) log_loss_score = log_loss(y_test, predictions_proba) print('Log loss: %.5f' % log_loss_score)

Model performance with feature importance analysis

As a result of using the pruned features, our previous model – Random Forest – scores better. With little effort, the algorithm gets a lower loss, and it also trains more quickly and uses less memory because the feature set is reduced.

model = RandomForestClassifier(50, n_jobs=8) # LogLoss 0.59251 fit_and_print_metrics(X_train_reduced, y_train, X_test_reduced, y_test, model) # LogLoss 0.63376 fit_and_print_metrics(X_train, y_train, X_test, y_test, model)

Playing a bit more with feature importance score (plotting the logloss of our classifier for a certain subset of pruned features) we can lower the loss even more. In this particular case, Random Forest actually works best with only one feature! Using only the feature “word_share” gives a logloss of 0.5544. If you are interested to see this step in detail, the full version is in the notebook.

Conclusion

As I have shown, utilising feature importance analysis has a potential to increase the model’s performance. While some models like XGBoost do feature selection for us, it is still important to be able to know the impact of a certain feature on the model’s performance because it gives you more control over the task you are trying to accomplish. The “no free lunch” theorem (there is no solution which is best for all problems) tells us that even though XGBoost usually outperforms other models, it is up to us to discern whether it is really the best solution. Using XGBoost to get a subset of important features allows us to increase the performance of models without feature selection by giving that feature subset to them. Using feature selection based on feature importance can greatly increase the performance of your models.