Variance Threshold for Feature Selection

3 min readJul 13, 2024

Variance Threshold is a simple baseline feature selection method. It removes all features whose variance does not meet a specified threshold. Features with zero variance are constant and provide no information that distinguishes between different samples, and thus can be removed.

Why Use Variance Threshold?

Remove Uninformative Features: Features with low variance are often not very informative. Removing them can help in reducing the dimensionality of the data without losing much information.
Reduce Overfitting: Removing features that do not vary much can help reduce the complexity of the model and thus avoid overfitting.
Improve Efficiency: Fewer features can lead to faster training and prediction times.

How It Works ?

Calculate Variance: Compute the variance for each feature in the dataset.
Threshold Comparison: Compare the variance of each feature against a predefined threshold.
Feature Selection: Retain features whose variance is above the threshold.

Example

Let’s go through an example to illustrate the use of the VarianceThreshold method from the sklearn library.

Sample Code :

import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Sample dataset
data = {
    'Feature_1': [1, 1, 1, 1, 1],
    'Feature_2': [2, 2, 2, 2, 2],
    'Feature_3': [1, 2, 3, 4, 5],
    'Feature_4': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)

# Define the VarianceThreshold object with a threshold
threshold = 0.0  # This threshold will remove only features with zero variance
vt = VarianceThreshold(threshold)

# Fit the VarianceThreshold model and transform the data
vt.fit(df)
transformed_data = vt.transform(df)

# Get the names of the features that are kept
features_kept = df.columns[vt.get_support()]
print(f'Features kept: {features_kept.tolist()}')
print(f'Transformed Data:\n{pd.DataFrame(transformed_data, columns=features_kept)}')

Explanation

Import Libraries: Import necessary libraries, including pandas and VarianceThreshold from sklearn.

2. Create Sample Data:

A sample dataset is created with 4 features.
Feature_1 and Feature_2 have zero variance, while Feature_3 and Feature_4 have non-zero variance.

3. Define VarianceThreshold Object:

The VarianceThreshold object is initialized with a threshold of 0.0, meaning only features with zero variance will be removed.

4. Fit and Transform Data:

The fit method computes the variance of each feature and the transform method removes the features that do not meet the threshold.

5. Get Features Kept:

vt.get_support() returns a boolean array indicating which features are retained.
The names of the retained features are printed along with the transformed data.

Output

Features kept: ['Feature_3', 'Feature_4']
Transformed Data:
   Feature_3  Feature_4
0          1          2
1          2          3
2          3          4
3          4          5
4          5          6

Using a Higher Threshold

You can specify a higher threshold to remove features with low variance but not necessarily zero:

higher_threshold = 0.5
vt_high = VarianceThreshold(higher_threshold)
vt_high.fit(df)
transformed_data_high = vt_high.transform(df)
features_kept_high = df.columns[vt_high.get_support()]
print(f'Features kept with higher threshold: {features_kept_high.tolist()}')
print(f'Transformed DataFrame with higher threshold:\n{pd.DataFrame(transformed_data_high, columns=features_kept_high)}')

Detailed Example

Let’s consider a detailed example with a larger dataset:

import numpy as np

import numpy as np

# Creating a larger sample dataset
np.random.seed(0)
data_large = {
    'Feature_1': np.random.randint(0, 2, size=100),  # Low variance
    'Feature_2': np.random.randint(0, 10, size=100), # Higher variance
    'Feature_3': np.random.normal(0, 1, size=100),  # Higher variance
    'Feature_4': np.random.normal(0, 0.01, size=100) # Low variance
}
df_large = pd.DataFrame(data_large)

# Applying Variance Threshold
vt_large = VarianceThreshold(threshold=0.1)
vt_large.fit(df_large)
transformed_data_large = vt_large.transform(df_large)
features_kept_large = df_large.columns[vt_large.get_support()]
print(f'Features kept in large dataset: {features_kept_large.tolist()}')
print(f'Transformed DataFrame with large dataset:\n{pd.DataFrame(transformed_data_large, columns=features_kept_large)}')

Practical Considerations

Choosing the Threshold: The choice of threshold depends on the specific problem and dataset. A higher threshold will remove more features, possibly including those that carry some information. It’s important to experiment with different thresholds.
Scaling Data: Variance is affected by the scale of the features. It may be useful to scale your data before applying Variance Threshold to ensure that features are comparable.
Type of Data: Variance Threshold is typically more useful for continuous features. For categorical data, other methods like chi-square test or mutual information might be more appropriate.

Conclusion

Variance Threshold is a straightforward and effective method for feature selection, especially when you need to remove features with little to no variance. It’s a good first step in the feature selection process, helping to simplify your dataset and improve model performance. However, it’s important to carefully choose the threshold and consider the nature of your data to make the most out of this technique.