Among the important aspects in Machine Learning are “Feature Selection” and “Feature Extraction”. In this blog post, we shall continue our discussion further on “Feature Selection in Machine Learning”. The topics for this post are Variable Ranking or Feature Ranking, and Feature Subset Selection Methods.

Feature Selection in Machine Learning: Variable Ranking and Feature Subset Selection Methods

In the previous blog post, I’d introduced the the basic definitions, terminologies and the motivation in Feature Selection. For your quick reference, the link to the preceding blog post of the Series, below:

Topic 1: Variable Ranking

Variable Ranking is the process of ordering the features by the value of some scoring function, which usually measures feature-relevance.

Resulting set: The score S(fi) is computed from the training data, measuring some criteria of feature fi. By convention a high score is indicative for a valuable (relevant) feature.

A simple method for feature selection using variable ranking is to select the k highest ranked features according to S. This is usually not optimal, but often preferable to other, more complicated methods. It is computationally efficient — only calculation and sorting of n scores.

Ranking Criteria: Correlation Criteria and Information Theoretic Criteria

Variable Ranking Criteria or Feature Ranking Criteria: Correlation Criteria and Information Theoretic Criteria

Ranking Criteria poses some questions:

Can variables with small score be automatically discarded? NO!

Can variables with small score be automatically discarded?

The answer is NO!

Even variables with small score can improve class seperability

Here, this depends on the correlation between x1 and x2

Here, the class conditional distributions have a high co-variance in the direction orthogonal to the line between the two class centers.

Can a useless variable (i.e. one with a small score) be useful together with others? YES!

2. Can a useless variable (i.e. one with a small score) be useful together with others?

The answer is YES!

• The correlation between variables and target are not enough to assess relevance

• The correlation / co-variance between pairs of variables has to be considered too (potentially difficult)

Also, the diversity of features needs to be considered.

Information Theoretic Criteria

3. Can two variables that are useless by themselves can be useful together?

The answer is YES!

This can be done using the Information Theoretic Criteria.

Information Theoretic Criteria

Mutual information can also detect non-linear dependencies among variables

But, it is harder to estimate than correlation

It is a measure for “how much information (in terms of entropy) two random variables share”.

Variable Ranking — Single Variable Classifiers

Idea: Select variables according to their individual predictive power

Criterion: Performance of a classifier built with 1 variable e.g. the value of the variable itself

The Predictive power is usually measured in terms of error rate (or criteria using False Positive Rate, False Negative Rate)

Also, a combination of SVC’s can be deployed using ensemble methods (boosting,…).

Topic 2: Feature Subset Selection

The Goal of Feature Subset Selection is to find the optimal feature subset. Feature Subset Selection Methods can be classified into three broad categories

Filter Methods

Wrapper Methods

Embedded Methods

For Feature Subset Selection you’d need:

A measure for assessing the goodness of a feature subset (scoring function)

A strategy to search the space of possible feature subsets

Finding a minimal optimal feature set for an arbitrary target concept is hard. It would need Good Heuristics.

Filter Methods

In this method, select subsets of variables as a pre-processing step, independently of the used classifier

It would be worthwhile to note that Variable Ranking-Feature Selection is a Filter Method.

Filter Methods: Feature Subset Selection Method

Key features of Filter Methods for Feature Subset Selection:

Filter Methods are usually fast

Filter Methods provide generic selection of features, not tuned by given learner (universal)

Filter Methods are also often criticized (feature set not optimized for used classifier)

Filter Methods are sometimes used as a pre-processing step for other methods.

Wrapper Methods

In Wrapper Methods, the Learner is considered a black-box. Interface of the black-box is used to score subsets of variables according to the predictive power of the learner when using the subsets.

Results vary for different learners

One needs to define: – how to search the space of all possible variable subsets ?– how to assess the prediction performance of a learner ?

Wrapper Methods: Feature Subset Selection Method

Embedded Methods

Embedded Methods are specific to a given learning machine

Performs variable selection (implicitly) in the process of training

E.g. WINNOW-algorithm (linear unit with multiplicative updates).

Summary

Correlation and Mutual information between single variables and the target are often used as Ranking-Criteria of variables.

One cannot automatically discard variables with small scores — they may still be useful together with other variables.

We discussed, the three methods for Feature Subset Selection i.e. Filters — Wrappers — Embedded Methods.

Thanks for reading through this blog post. Any suggestions for further improving this would be cheerfully solicited.

Other Blog Posts of your interest: