13 Feature Selection
In Chapters 5 through 8, we described different methods for adding new predictors to the model (e.g., interactions, spline columns, etc.). We can add any predictors (or functions of predictors) based on first principles or our intuition or contextual knowledge.
We are also motivated to remove unwanted or irrelevant predictors. This can decrease model complexity, training time, computer memory, the cost of acquiring unimportant predictors, and so on. For some models, the presence of non-informative predictors can decrease performance (as we’ll see in Figure 13.1).
We may want to remove predictors solely based on their distributions; unsupervised feature selection focuses on just the predictor data.
More often, though, we are interested in supervised feature selection1: the removal of predictors that do not appear to have a relationship to the outcome. This implies that we have a statistic that can compute some measure of each predictor’s utility for predicting the outcome.
While it is easy to add predictors to a model, it can be very difficult (and computationally expensive) to properly determine which predictors to remove. The reason comes back to the misuse (or over-use) of data. We’ve described the problem of using the same data to fit and evaluate a model. By naively re-predicting the training set, we can accidentally get overly optimistic performance statistics. To compensate for this, resampling repeatedly allocates some data for model fitting and other data for model evaluation.
For feature selection, we have the same problem when the same data are used to measure importance, filter the predictors, and then train the model on the smaller feature set. It is similar to repeatedly taking the same test and then measuring our knowledge using the grade for the last test repetition. It doesn’t accurately measure the quantity that is important to us. This is the reason that this chapter is situated here instead of in Part 2; understanding overfitting is crucial to effectively remove predictors.
This chapter will summarize the most important aspects of feature selection. This is a broad topic and a more extensive discussion can be found in Kuhn and Johnson (2019). First, we’ll use an example to quantify the effect of irrelevant predictors on different models. Next, the four main approaches for feature selection are described and illustrated.
13.1 The Cost of Irrelevant Predictors
Does it matter if the model is given extra predictors that are unrelated to the outcome? It depends on the model. To quantify this, data were simulated using the equation described in Hooker (2004) and used in Sorokina et al. (2008):
\[ y_i = \pi^{x_{i1} x_{i2}} \sqrt{ 2 x_{i3} } - asin(x_{i4}) + \log(x_{i3} + x_{i5}) - (x_{i9} / x_{i10}) \sqrt{x_{i7} / x_{i8}} - x_{i2} x_{i7} + \epsilon_i \tag{13.1}\]
where predictors 1, 2, 3, 6, 7, and 9 are standard uniform while the others are uniform on [0.6, 1.0]. Note that \(x_6\) is not used in the equation. The errors \(\epsilon_i\) are Gaussian with mean zero and a standard deviation2 of 0.25. Training and testing data sizes where simulated to be 1,000 and 100,000, respectively.
To assess the effect of useless predictors, between 10 and 100 extra columns of standard normal data were generated (unrelated to the outcome). Models were fit, tuned, and the RMSE of the test set was computed for each model. Twenty simulations were run for each combination of models and extra features3.
The percent difference from baseline RMSE was computed (baseline being the model with no extra columns). Figure 13.1 shows the results for the following models:
- Bagged regression trees
- Bayesian additive regression trees (BART)
- Boosted trees (via lightGBM)
- Cubist
- Generalized additive models (GAMs)
- K-nearest neighbors (KNN)
- Multivariate adaptive regression splines (MARS)
- Penalized linear regression
- Random forest
- RuleFit
- Single-layer neural networks
- Support vector machines (radial basis function kernel)
These models are described in detail in subsequent chapters. The colors of the points/lines signifies whether the model automatically removes unused predictors from the model equation (see Section 13.6 below).
The results show that a few models, K-nearest neighbors, neural networks, and support vector machines, have severely degraded performance as the number of predictors increases. This is most likely due to the use of cross- and/or dot-products of the predictor columns in their calculations. The extra predictors add significant noise to these calculations, and that noise propagates in a way that inhibits the models from accurately determining the underlying relationship with the outcome.
However, several other models use these same calculations without being drastically affected. This is due to the models automatically performing feature selection during model training. Let’s briefly describe two sets of models that were largely unaffected by the extra columns.
The first set includes penalized linear regression, generalized additive models, and RuleFit. These add a penalty to their calculations, restricting the model parameters from becoming abnormally large unless the underlying data warrant a large coefficient. This effectively reduces the impact of extra predictors by keeping their corresponding model coefficient at or near zero.
The other set o model models include MARS and Cubist. They selectively include predictors in their regression equations, only including those specifically selected as having value.
The other unaffected models are tree-based. As seen in Figure 2.5, these types of models make different rules by selecting predictors to split the data. If a predictor was not used in any split, it is functionally independent of the outcome in the prediction equation. This provides some insensitivity to the presence of non-informative predictors, although there is slight degradation in RMSE as the number of columns increases.
In summary, irrelevant columns minimally affect ML models that automatically select features. Models that do not have this attribute can be crippled under the same circumstances.
13.2 Different Levels of Features
When we talk about selecting features, we should be clear about what level of computation we mean. There are often two levels:
- Original predictors are the unmodified version of the data.
- Derived predictors refer to the set of columns that are present after feature engineering.
For example, if our model requires all numeric inputs, a categorical predictor is often converted to binary indicators. A date-time column is the original predictor, while the binary indicators for individual days of the week are the derived predictor. Similarly, the 550 measurement predictors in the barley data are the original predictors, and embedded values, such as the PCA components, are the derived predictors.
Depending on the context, our interest in feature selection could be at either level. If the original predictors are expensive to estimate, we would be more interested in removing them at the original level; if any of their derived predictors are important, we need the original.
This idea will arise again in ?sec-importance when we discuss variable importance scores for explaining models.
13.3 Overfitting the Feature Set
We’ve previously discussed overfitting, where a model finds patterns in the training data that are not reproducible. It is also possible to overfit the predictor set by finding the training set predictors that appear to be connected to the outcome but do not show the same relationship in other data.
This is most likely to occur when the number of predictors (\(p\)) is much larger than the training set size (\(n_{tr}\)) or when no external validation is used to verify performance. Ambroise and McLachlan (2002) shows reanalyses of high-dimensional biology experiments using a backward selection algorithm for removing predictors. In some cases, they could find a model and predictor subset with perfect accuracy even when the training set outcome data were randomly shuffled.
Recall back in Section 1.5 an argument was made that preprocessing methods and the supervised model encompassed the entire model pipeline. Technically, feature selection (supervised or unsupervised) falls into the definition of preprocessing. It is important to use data that is different from the data used to train the pipeline (i.e., select the predictors) is different from the data used to evaluate how well it worked.
For example, if we use a resampling scheme, we have to repeat the feature selection process within each resample. If we used 10-fold cross-validation, each of the 10 analysis sets (90% of the training set) would have its own set of selected predictors.
Ambroise and McLachlan (2002) demonstrated that when feature selection was enacted once, and then the model with this subset was resampled, they could produce models with excessively optimistic performance metrics. However, when the performance statistics were computed with multiple realizations of feature selection, there was very little false optimism.
Another improper example of data usage that can be found in the literature, as well as in real-life analyses, occurs when the entire data set is used to select predictors then the initial split is used to create the training and test sets. This “bakes in” the false optimism from the start, the epitome of information leakage4.
However, performing feature selection in each resample can multiplicatively increase the computational cost, especially if the model has many tuning parameters. For this reason, we tend to focus on methods that automatically select features during the normal process of training the model (as discussed below).
One scheme to understand if we are overfitting to the predictors is to create one or more artificial “sentinel predictors” that are random noise. Once the analysis that ranks/selects predictors is finished, we can see how high the algorithm ranks these deliberately irrelevant predictors. It is also possible to create a rule where predictors that rank higher than the sentinels are selected to be included in the model.
13.4 Unsupervised Selection
There are occasions where the values of the predictors are unfavorable or possibly detrimental to the model. The simplest example is a zero-variance column, where all of the values of a predictor in the training set have the same value. There is no information in these predictors and, although the model might be able to tolerate such a predictor, taking the time to identify and remove such columns is advisable.
But what about situations where only one or two rows have non-zero values? It is difficult to filter on variance values since the predictors could have different units or ranges. The underlying problem occurs when very few unique values exist (i.e., “coarse data”), and one or two predictor values capture most of the rows. Perhaps we can identify this situation.
Consider a hypothetical training set of 500 data points with a count-based predictor that has mostly zero counts. However, there are three data points with a count of 1 and one instance with a count of 2. The variance of these data is low, but it would be difficult to determine a cutpoint for removal. Kuhn (2008) developed an algorithm to find near-zero variance predictors, as defined by two criteria:
- The freqeuncy ratio is the frequency of the most prevalent value over the second most frequent value. For our example, the ratio was 496/3 = 165.3.
- The proportion of unique values, which is 3 / 500 = 0.006.
We can determine thresholds for both of these criteria to remove the predictor. The default behavior is to remove the column of the frequency ratio is greater than 19 and the proportion of unique values is less than 0.1. For the example, the predictor would be eliminated.
Another example of an unsupervised filter reduces between-predictor correlations. A high degree of correlation can reduce the effectiveness of some models. We can filter out predictors by choosing a threshold for the (absolute) pairwise correlations and finding the smallest subset of predictors to meet this criterion.
A good example is Figure 13.2, which visualizes the correlation matrix for the fifteen predictors in the forestry data set (from Section 3.9). Predictors for the January minimum temperature and the annual mean temperature are highly correlated (corr = 0.93), and only one might be sufficient to produce a better model. Similar issues occur with other predictor pairs. If we apply a threshold of 0.50, then eight predictors (annual_maximum_temperature, annual_mean_temperature, annual_minimum_temperature, annual_precipitation, dew_temperature, elevation, longitude, and maximum_vapor) would be removed before model training.
Similar redundancy can occur in categorical predictors. In this case, measures of similarity can be used to filter the predictors in the same way.
Unsupervised filters are only needed in specific circumstances. Some models (such as trees) are resistant to the distributions of the predictor’s data5, while others are excessively sensitive.
13.5 Classes of Supervised Methods
There are three main strategies for supervised feature selection. The first and most effective was discussed: automatic selection occurs when a subset of predictors is determined during model training. We’ll discuss this more in the next section.
The second supervised strategy is called wrappers. In this case, a sequential algorithm proposes feature subsets, fits the model with these subsets, and then determines a better subset from the results. This sounds similar to iterative optimization because it is. Many of the same tools can be used. Wrappers are the most thorough approach to searching the space of feature subsets, but this can come with an enormous computational cost.
A third strategy is to use a filter to screen predictors before adding them to the model. For example, the analysis shown in Figure 2.3 computed how each item in a food order impacted the delivery time. We could have applied a filter where we only used food item predictors whose lower confidence interval for the increase in time was greater than one. The idea behind the filter is that it is applied once prior to the model fit.
The next few sections will describe the mechanics of these techniques. However, how we utilize these algorithms is tricky. We definitely want to avoid overfitting the predictor set to the training data.
13.6 Automatic Selection
To more closely demonstrate how automatic feature selection works, let’s return to the Cubist model fit to the delivery data in Section 2.4. That model created a series of regression trees, and converted them to rules (i.e., a path through the tree), then fit regression models for every rule. Recall that there was a sizable number of rules (2,007) each with its own linear regression. Despite the size of the rule set, only 13 of the original 30 predictors in the data set were used in rules or model fits; 17 were judged to be irrelevant for predicting delivery times.
This shows that we can often reduce the number of features for free.
Different models view the training data differently, and the list of relevant predictors will likely change from model to model. When we fit a MARS model (?sec-mars-reg) to the data, only 9 original predictors were used. If we use automatic selection methods to understand what is important, we should fit a variety of different models and create a consensus estimate of which predictors are “active” for the data set.
13.7 Wrapper Methods
Wrapper methods (Kohavi and John 1998) use an overarching search method to optimize binary indicators, one for each potential predictor, that control which predictors are given to the model.
The most popular approach is recursive feature elimination (RFE), which starts with the complete set of predictors and uses a mechanism to rank each in terms of their utility for predicting the outcome (Guyon et al. 2002). Using this ordering, predictors are successively removed while performance is monitored. The performance profile determines the optimal subset size, and the most important predictors are included in the model. RFE is greedy, though; it predetermines the order that the predictors are eliminated and will never consider a subset that is inconsistent with this order.
Two global search methods described in Chapter 12, genetic algorithms (GA) and simulated annealing, can also be used. In this instance, the search space is the predictor space represented as binary indicators. The same process applies, though. For example, a GA would contain a population of different subsets, each producing a fitted model. The performance values enable the creation of the next generation via selection, reproduction, and mutation. In the case of simulated annealing, an initial subset is created and evaluated, and perturbations are created by randomly altering a small set of indicators (to create a “nearby” subset). Suboptimal subsets can be accepted similarly to our application for tuning parameters.
However, for each of these cases, the details matter. First, we should probably use separate data to
- estimate performance for each candidate subset and
- determining how far the search should proceed.
If we have large data sets, we can partition them into separate validation sets for each purpose. Otherwise, resampling is probably the answer to solve one or both of these problems.
The need to optimize tuning parameters can greatly complicate this entire process. A set of optimal parameter values for large subset sizes may perform very poorly for small subsets. In some cases, such as \(m_{try}\) and the number of embedding dimensions, the tuning parameter depends on the subset size.
In this case, the traditional approach uses a nested resampling procedure similar to the one shown in Section 11.4. An inner resampling scheme tunes the model, and the optimized model is used to predict the assessment set from the outer resample. Even with parallel processing, this can become an onerous computational task.
However, let’s illustrate a wrapper method using an artificial data set from the simulations described in Section 13.1. We’ll use Equation 13.1 and supplement the eight truly important predictors with twenty one noise columns. For simplicity, we’ll use a neural network to predict the outcome with a static set of tuning parameters. The training set consisted of 1,000 data points.
Since feature selection is part of the broader model pipeline, 10-fold cross-validation was used to measure the effectiveness of 150 iterations of simulated annealing (Kuhn and Johnson 2019 Sect 12.2). A separate SA run is created for each resample; we use the analysis set to train the model on the current subset, and the assessment set is used to measure the RMSE.
Figure 13.3 shows the results, where each point is the average of the 10 RMSE values for the SA at each iteration. The results clearly improve as the selection process mostly begins to include important predictors and discards a few of the noise columns. After approximately 120 iterations, the solution stabilizes. This appears to be a very straightforward search process that clearly indicates when to stop the SA search.
However, looks can be deceiving. The feature selection process is incredibly variable, and adding or removing an important predictor can cause severe changes in performance. Figure 13.3 is showing the average change in performance, and the relative smoothness of this curve is somewhat deceiving.
Figure 13.4 shows what could have happened by visualizing each of the 10 searches. The results are much more dynamic and significantly vary from resample to resample. Most (but not all) searches contain one or two precipitous drops in the RMSE with many spikes of high errors (caused by a bad permutation of the solution). These cliffs and spikes do not reproducibly occur at the same iteration or in the same way.
We should keep in mind that Figure 13.3 lets us know where, on average, the search should stop, but our actual search result (that uses the entire training set) is more likely to look like what we see in Figure 13.4.
For this particular search, the numerically best result occurred at iteration 113 and the final model contained 17 predictors where eight of the twenty noise columns are still contained in the model. Our resampled estimate of the RMSE, at this iteration, was 0.302 (a perfect model would have a value of 0.25). However, the plots of the SA performance profiles indicate a range, probably above 125 iterations, where the results are equivocal.
13.8 Filter Methods
Filters can use any scoring function that quantifies how effectively the predictor predicts the outcome (Kohavi and John 1998; Duch 2006). One method is to use a univariate statistic to evaluate each predictor independently. For example, the area under the ROC curve (?sec-roc) can be computed for a binary classification outcome and a set of numeric predictors. The filter could be applied using these values by pre-defining a threshold for the statistic that defines “important enough.” Alternatively, the top \(p^*\) predictors can be retained and used in the supervised model. This works well unless predictors have ties in their statistics.
The example of using ROC curves raises an interesting question. Should we rank the predictors by their area under the curve or should we convert that analysis to a p-value to see if the area under the curve is greater than 0.5 (Hanley and McNeil 1983)? Our advice is to do the former. Hypothesis tests and p-values have their uses but the AUC estimates themselves are probably more numerically stable. To factor in uncertainty, we could filter on the lower confidence bound of the AUC. This is generally true for almost any ranking statistic that has a corresponding p-value.
Multivariate methods can also be used. These simultaneously compute measures of importance for all predictors. This is usually preferred since univariate statistics can be compromised when the predictors are highly correlated and/or have important interaction effects.
A variety of models have built-in methods for measuring importance. For example:
- MARS (Friedman 1991) and decision trees can compute the importance of each predictor by aggregating the improvement in the objective function when each predictor is used in a split or an artificial feature.
- Some neural networks can use the values of their coefficients in each layer to compute an aggregate measure of effectiveness for each feature (Garson 1991; Gevrey, Dimopoulos, and Lek 2003; Olden, Joy, and Death 2004).
- Similarly, partial least squares models can compute importance scores from their loading estimates (Wold, Johansson, and Cocchi 1993; Farrés et al. 2015).
- As mentioned in Section 8.1.2, BART (and H-statistics) can measure the importance of individual predictors using the same techniques to discover interaction effects.
There are also model-free tools for computing importance scores from an existing model based on permuting each predictor and measuring how performance metrics change. This approach was initially popularized by the random forest model (?sec-rand-forest-cls) but can be easily extended to any model. See Breiman (2001), Strobl et al. (2007), and ?sec-importance for more details. There are a few other methods worth mentioning:
- The Relief allgorithm (Kira and Rendell 1992; Kononenko 1994) and its descendant methods randomly select training set points and use the outcome values of their nearest neighbors to aggregate statistics to measure importance.
- Minimum redundancy, maximum relevance analysis (MRMR) combines two statistics, one for importance and another for between-predictor correlation, to find a subset that has the best of both quantities (Peng, Long, and Ding 2005).
There are also tools to help combine multiple filters into a compound score (Karl et al. 2023). For example, we may want to blend the number of active predictors and multiple performance characteristics into a single score used to rank different tuning parameter candidates. One way to do this is to use desirability functions. These will be described in ?sec-combined-metrics, Also, Section 12.3.2 of Kuhn and Johnson (2019) shows an example of how desirability functions can be incorporated into feature selection.
Once an appropriate scoring function is determined, a tuning parameter can be added for how many predictors are to be retained. This is like RFE, but we optimize all the tuning parameters simultaneously and evaluate subsets in a potentially less greedy way.
There are two potential downsides, though. First, there is still the potential for enormous computational costs, especially if the ranking process is expensive. For example, if we use a grid search, there is significant potential for the same subset size to occur more than once in the grid6. In this case, we are repeating the same importance calculation multiple times.
The second potential problem is that we will use the same data set to determine optimal parameters and how many features to retain. This could very easily lead to overfitting and/or optimization bias, especially if our training set is not large.
We measured the predictors using the standard random forest importance score using the same data as the previous simulated annealing search. The number of features to retain was treated as an additional tuning parameter. Cross-validation was used to select the best candidate from the space-filling design with 50 candidate points. Figure 13.5(a) shows the tuning results. The number of selected predictors appears to have the largest effect on the results; performance worsens as we remove too many features (which probably includes informative ones). It also appears that too much penalization has a negative effect on reducing the RMSE.
The model with the smallest RMSE selected the top 9 most important predictors, 13 hidden units, a penalty value of 10-8.78, a learning rate of 10-1.47, and tanh activation. The resampled RMSE associated with the numerically best results was 0.28. When that model was fit on the entire training set, the holdout RMSE estimate was 0.282. One reason that these two estimates of performance are so close is the relatively high \(n_{tr}/p\) ratio (1,000 / 29 \(\approx\) 34.5).
Recall that the importance scores are computed on different analysis sets. For the best model, Figure 13.5(b) contains the frequency at which each predictor was selected. For the most part, the importance scores are pretty accurate; almost all informative features are consistently selected across folds. When the final model was fit on the entire training set, a single irrelevant predictor was included in the model (by random change) along with true predictors \(x_{1}\), \(x_{2}\), \(x_{3}\), \(x_{4}\), \(x_{5}\), \(x_{7}\), \(x_{9}\), and \(x_{10}\). These results are unusually clean, mostly due to the large sample size and relatively small predictor set.
Similar to the discussion in Section 10.10, the different predictor sets discovered in the resamples can give us a sense of our results’ consistency. They are “realizations” of what could have occurred. Once we fit the final model, we get the actual predictor set.
Chapter References
Recall that we use the terms “predictors” and “features” interchangeably. We favor the former term, but “features” is more commonly used when discussing the filtering of columns.↩︎
For this value of the simulated standard deviation, the best possible RMSE value is also 0.25.↩︎
Details and code can be found at
https://github.com/topepo/noise_features_sim
.↩︎Recall out mantra that “Always have a separate piece of data that can contradict what you believe.”↩︎
Zero-variance predictors do not harm these models, but determining such predictors computationally will speed up their training.↩︎
The same issue can also occur with iterative search↩︎