Variance Threshold is a simple baseline feature selection method. It removes all features whose variance does not meet a specified threshold. Features with zero variance are constant and provide no information that distinguishes between different samples, and thus can be removed.
Why Use Variance Threshold?
- Remove Uninformative Features: Features with low variance are often not very informative. Removing them can help in reducing the dimensionality of the data without losing much information.
- Reduce Overfitting: Removing features that do not vary much can help reduce the complexity of the model and thus avoid overfitting.
- Improve Efficiency: Fewer features can lead to faster training and prediction times.
How It Works ?
- Calculate Variance: Compute the variance for each feature in the dataset.
- Threshold Comparison: Compare the variance of each feature against a predefined threshold.
- Feature Selection: Retain features whose variance is above the threshold.
Example
Let’s go through an example to illustrate the use of the VarianceThreshold
method from the sklearn
library.
Sample Code :
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Sample dataset
data = {
'Feature_1': [1, 1, 1, 1, 1],
'Feature_2': [2, 2, 2, 2, 2],
'Feature_3': [1, 2, 3, 4, 5],
'Feature_4': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
# Define the VarianceThreshold object with a threshold
threshold = 0.0 # This threshold will remove only features with zero variance
vt = VarianceThreshold(threshold)
# Fit the VarianceThreshold model and transform the data
vt.fit(df)
transformed_data = vt.transform(df)
# Get the names of the features that are kept
features_kept = df.columns[vt.get_support()]
print(f'Features kept: {features_kept.tolist()}')
print(f'Transformed Data:\n{pd.DataFrame(transformed_data, columns=features_kept)}')
Explanation
- Import Libraries: Import necessary libraries, including
pandas
andVarianceThreshold
fromsklearn
.
2. Create Sample Data:
- A sample dataset is created with 4 features.
Feature_1
andFeature_2
have zero variance, whileFeature_3
andFeature_4
have non-zero variance.
3. Define VarianceThreshold Object:
- The
VarianceThreshold
object is initialized with a threshold of 0.0, meaning only features with zero variance will be removed.
4. Fit and Transform Data:
- The
fit
method computes the variance of each feature and thetransform
method removes the features that do not meet the threshold.
5. Get Features Kept:
vt.get_support()
returns a boolean array indicating which features are retained.- The names of the retained features are printed along with the transformed data.
Output
Features kept: ['Feature_3', 'Feature_4']
Transformed Data:
Feature_3 Feature_4
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
Using a Higher Threshold
You can specify a higher threshold to remove features with low variance but not necessarily zero:
higher_threshold = 0.5
vt_high = VarianceThreshold(higher_threshold)
vt_high.fit(df)
transformed_data_high = vt_high.transform(df)
features_kept_high = df.columns[vt_high.get_support()]
print(f'Features kept with higher threshold: {features_kept_high.tolist()}')
print(f'Transformed DataFrame with higher threshold:\n{pd.DataFrame(transformed_data_high, columns=features_kept_high)}')
Detailed Example
Let’s consider a detailed example with a larger dataset:
import numpy as np
import numpy as np
# Creating a larger sample dataset
np.random.seed(0)
data_large = {
'Feature_1': np.random.randint(0, 2, size=100), # Low variance
'Feature_2': np.random.randint(0, 10, size=100), # Higher variance
'Feature_3': np.random.normal(0, 1, size=100), # Higher variance
'Feature_4': np.random.normal(0, 0.01, size=100) # Low variance
}
df_large = pd.DataFrame(data_large)
# Applying Variance Threshold
vt_large = VarianceThreshold(threshold=0.1)
vt_large.fit(df_large)
transformed_data_large = vt_large.transform(df_large)
features_kept_large = df_large.columns[vt_large.get_support()]
print(f'Features kept in large dataset: {features_kept_large.tolist()}')
print(f'Transformed DataFrame with large dataset:\n{pd.DataFrame(transformed_data_large, columns=features_kept_large)}')
Practical Considerations
- Choosing the Threshold: The choice of threshold depends on the specific problem and dataset. A higher threshold will remove more features, possibly including those that carry some information. It’s important to experiment with different thresholds.
- Scaling Data: Variance is affected by the scale of the features. It may be useful to scale your data before applying Variance Threshold to ensure that features are comparable.
- Type of Data: Variance Threshold is typically more useful for continuous features. For categorical data, other methods like chi-square test or mutual information might be more appropriate.
Conclusion
Variance Threshold is a straightforward and effective method for feature selection, especially when you need to remove features with little to no variance. It’s a good first step in the feature selection process, helping to simplify your dataset and improve model performance. However, it’s important to carefully choose the threshold and consider the nature of your data to make the most out of this technique.