In many applications we may have a larger number of features—dozens, hundreds or even more. In machine learning contexts, where we are interested primarily in prediction, we typically don’t want to use all the features, for a couple of reasons:
Avoiding overfitting.
As we add more and more features to a model, bias is reduced but variance of the estimated model parameters increases. At some point, variance overwhelms bias, and our predictive ability declines.
Avoiding large computation time and/or memory usage.
Depending on the solution method used, computation time for a linear model increases with the cube, or at least the square, of the number of features. And modern ML models can be even worse in terms of computational burden. In addition, large models may result in lack of convergence or instability of the output.
Many, many methods for feature selection have been developed over the years, and you will often hear someone claim that their favorite is the one to use, The Best. As you may guess, the fact that there are conflicting claims as to “best” reflects the fact that no such Best method exists. Indeed, much of the feature selection process is ad hoc, and in any case, it will often boil down to the user’s judgment.
We will cover a few feature selection methods here, to explain their rationales and how to use them in qeML. You may typically use two or more of them in any given application.
We will use standard notation here, with n and p denoting the number of points in our dataset (i.e. number of rows) and the number of features (i.e. number of columns), respectively. Important points to remember:
The larger n is, then the larger a value of p that can be used before variance increase overwhelms bias reduction.
However, that bias-variance tradeoff transition point for p depends on the application. There is no magic formula for best p as a function of n.
Nevertheless, a commonly-used rule of thumb is to limit p to at most n0.5. Some theoretical analyses even suggest the more conservative rule p < log(n). Again, these are oversimplifications, but you may find them useful as guides to intuition.
It should be kept in mind that categorical features can greatly add to the value of p. See below.
Consider the dataset nycdata included in qeML, which is derived from taxi trip data made publicly available by the New York City government. Many public analyses have been made on various versions of the NYC taxi data, frequently with the goal of predicting trip time. Let’s take a look:
> data(nyctaxi)
> dim(nyctaxi)
[1] 10000 5
> head(nyctaxi)
passenger_count trip_distance PULocationID DOLocationID PUweekday
2969561 1 1.37 236 43 1
7301968 2 0.71 238 238 4
3556729 1 2.80 100 263 3
7309631 2 2.62 161 249 4
3893911 1 1.20 236 163 5
4108506 5 2.40 161 164 5
DOweekday tripTime
2969561 1 598
7301968 4 224
3556729 3 761
7309631 4 888
3893911 5 648
4108506 5 977
(Time-of-day, month etc. data is also available but not in this particular dataset.)
So n = 10000, p = 5. At first glance, one might guess that we could use all 5 features. Let’s try a linear model:
> z <- qeLin(nyctaxi,'tripTime')
holdout set has 1000 rows
Warning messages:
1: In eval(tmp, parent.frame()) :
7 rows removed from test set, due to new factor levels
2: In predict.lm(object, newx) :
prediction from a rank-deficient fit may be misleading
3: In predict.lm(object, newx) :
prediction from a rank-deficient fit may be misleading
To understand those warning messages, let’s look at the estimated beta coefficients:
> summary(z)
Call:
lm(formula = cbind(tripTime) ~ ., data = xy)
Residuals:
Min 1Q Median 3Q Max
-3807.5 -186.0 -49.8 131.8 3179.2
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2125.474 351.347 -6.049 1.51e-09 ***
trip_distance 163.056 1.670 97.628 < 2e-16 ***
PULocationID4 1424.775 345.770 4.121 3.81e-05 ***
PULocationID7 1212.331 349.196 3.472 0.000520 ***
PULocationID10 655.251 371.870 1.762 0.078097 .
PULocationID12 1829.276 408.229 4.481 7.52e-06 ***
PULocationID13 1271.201 338.220 3.758 0.000172 ***
PULocationID17 1167.109 477.898 2.442 0.014619 *
PULocationID24 1356.014 342.706 3.957 7.66e-05 ***
PULocationID25 1381.439 354.736 3.894 9.92e-05 ***
PULocationID26 1282.229 506.842 2.530 0.011429 *
PULocationID33 1353.187 363.058 3.727 0.000195 ***
PULocationID35 726.147 474.147 1.531 0.125687
PULocationID36 972.828 391.276 2.486 0.012927 *
...
PULocationID260 972.537 375.500 2.590 0.009614 **
PULocationID261 1334.116 340.437 3.919 8.97e-05 ***
PULocationID262 1362.842 337.713 4.036 5.50e-05 ***
PULocationID263 1343.784 337.113 3.986 6.77e-05 ***
PULocationID264 1378.455 341.150 4.041 5.38e-05 ***
PULocationID265 2163.796 370.248 5.844 5.27e-09 ***
DOLocationID3 869.664 339.996 2.558 0.010549 *
DOLocationID4 913.146 108.088 8.448 < 2e-16 ***
DOLocationID7 1065.241 105.990 10.050 < 2e-16 ***
DOLocationID10 1703.133 211.081 8.069 8.06e-16 ***
...
DOLocationID119 1077.036 274.406 3.925 8.74e-05 ***
DOLocationID121 1515.904 340.850 4.447 8.80e-06 ***
DOLocationID123 NA NA NA NA
DOLocationID124 2707.981 339.513 7.976 1.70e-15 ***
DOLocationID125 1123.702 105.203 10.681 < 2e-16 ***
...
DOLocationID262 933.146 96.473 9.673 < 2e-16 ***
DOLocationID263 885.520 94.578 9.363 < 2e-16 ***
DOLocationID264 960.527 108.120 8.884 < 2e-16 ***
DOLocationID265 169.068 123.391 1.370 0.170666
DayOfWeek2 44.691 15.004 2.979 0.002903 **
DayOfWeek3 100.480 14.188 7.082 1.53e-12 ***
DayOfWeek4 105.223 13.990 7.522 5.95e-14 ***
DayOfWeek5 133.312 13.775 9.678 < 2e-16 ***
DayOfWeek6 97.030 14.461 6.710 2.07e-11 ***
DayOfWeek7 57.133 14.579 3.919 8.97e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 327.2 on 8663 degrees of freedom
Multiple R-squared: 0.7128, Adjusted R-squared: 0.7017
F-statistic: 64 on 336 and 8663 DF, p-value: < 2.2e-16
There were 266 pickup/dropoff locations. That means 265 dummy-variable features were created each for pickup and dropoff. For instance, there would be such a variable for pickup at location 168, equal to 1 if pickup was at that location, 0 otherwise. Similarly, there are 6 more dummies for day of the week. R’s lm function, which qeLin wraps, automatically converts factors to dummies. There is no dummy for the 266th location, as it is indicated by the other 265 dummies all being 0.
So, p is not 5 after all; it’s 5 + 530 + 6 = 541!
We are assuming no interactions here. But if some pickup/dropoff location combinations behave quite differently from others, we ought to consider interaction terms. These will have their own dummy variables (products of the original dummies), hugely increasing p if we include them all.
Now, in those warning messages, the term “rank-deficient” is referring to multicollinearity in the data, i.e. strong relations among the features, producing a possibly unstable result. Note that the model was so unstable that the coefficient for DOLocationID123 turned out to be NA.
But even the other coefficients have rather large standard errors. This is the “variance” aspect of the Variance-Bias Tradeoff.
A more subtle problem is in the warning, “7 rows removed from test set, due to new factor levels.” By default, all qeML predictive functions split the data into training and holdout sets. Say some pickup location had only two rows in our data, both of which happen to be in the holdout set. Then the lm function would say, “Wait, I’ve never heard of these factor levels before,” so qeLin removes them. If many rows are removed, our predictive ability suffers.
What should we look for in a “good” feature selection method? We take the following as goals:
Desired Property A: The method should have predictive ability as a central goal.
Desired Property B: The method should be reasonably easy to explain.
Desired Property C: The method should produce an ordered sequence of candidate models.
All the methods discussed here will have this property. This is so important that we will give it its own subsection:
A nethod should produce some kind of ordered list, for instance, best single predictor; best pair of predictors; best triplet etc. We can evaluate each of the p feature sets via cross-validation–i.e. noting the testAcc component in the output of qeML functions–and then choosing the one that peforms best. (A method may simply rank features according to importance, but we still would assess the performance of the first feature, the first two features together, the first three features together and so on.)
In the all possible subsets (APS) approach, by contrast, one looks at all the 2p possible feature sets, a prohibitively large number computationally.
Moreover, remember, we are working with randomness. We choose holdout sets randomly, and usually the dataset itself is considered to be a random sample from some (real or conceptual) population. Due to this randomness, it’s possible that some combination of features will appear to perform well, just by accident. The more feature sets we look at, the greater the chance of this occurring.
Note that some of the methods discussed below do their own internal cross-validation, and ultimately give us the “best” feature set according to that criterion. Nevertheless, they still provide an ordered feature set.
This last point is important because we may wish to use the selected feature set sequence as input to a different ML method. There is no theoretical reason to believe the best feature set for one ML method is also best for another, but as noted, feature selection is an ad hoc business.
A classic approach for linear and generalized linear models is to first fit a full model, then retain only those features that are “statistically significant”. In the nyctaxi taxi data shown above, we would retain, for instance PULocationID33 but not PULocationID35.
This would violate our Desired Property A above. Use of p-values in geneeral is problematic (see the “NoPVals” vignette), and this is no exception. Just because βi is nonzero does not mean that Feature i has strong predictive power; what if βi is nonzero but tiny?
One could generate an ordered sequence of feature sets in the above manner by first setting the threshold for a p-value very low, then progressively higher This is similar to stepwise regression: In the forward version, one starts with no features, but adds them one at a time. At each step, one tests for the β coefficient being 0, and then chooses the feature that is most “significant.” When no new feature is found to be “significant,” the process stops, and we use whatever features managed to make it in to the model so far. The backward version starts with a full model, and removes features one at a time. Though this one may seem to be an improvement, in that it does take into account that a feature may be useful individually but not if similar features are already in the equation, it still suffers from the fundamental problems of p-values.
Here we fit a linear model, subject to the constraint that no estimated βi is larger (in absolute value) than λ, a hyperparameter. Starting at 0, we increase λ to ever-larger values, having the effect that one estimated βi becomes nonzero at a time. This gives us an ordered sequence of candidate features, as desired.
This is similar to a forward stepwise method, but NOT based on p-values; instead, the decision of which feature to use next is based on cross-validated mean squared prediction error.
Here is the LASSO approach on the nyctaxi data:
> lassout <- qeLASSO(nyctaxi,'tripTime')
> lassout$lambda.whichmin
[1] 51
> lassout$whenEntered
trip_distance PULocationID.132 PULocationID.186 DayOfWeek.1
2 29 31 31
DOLocationID.132 DOLocationID.124 DayOfWeek.5 DOLocationID.33
33 34 34 35
DOLocationID.189 DOLocationID.225 PULocationID.213 PULocationID.76
35 35 37 38
PULocationID.263 DOLocationID.1 DOLocationID.177 DayOfWeek.2
38 38 38 38
PULocationID.124 DOLocationID.35 DOLocationID.36 DOLocationID.89
39 39 39 39
DOLocationID.231 DOLocationID.263 PULocationID.239 DOLocationID.155
39 39 40 40
PULocationID.40 PULocationID.43 PULocationID.146 PULocationID.161
41 41 41 41
PULocationID.230 DOLocationID.22 DOLocationID.37 DOLocationID.161
41 41 41 41
DOLocationID.162 DOLocationID.251 PULocationID.163 PULocationID.211
41 41 42 42
PULocationID.257 DOLocationID.49 DOLocationID.81 DOLocationID.179
42 42 42 42
...
For each value of λ run by the code, one or more features joined the model. The cross-valued mean squared prediction error reached its minimum at the 51st λ. Along the way, trip_distance was the first feature added, at the 2nd λ, followed by PULocationID132 (29th), PULocationID186 (31st), DayOfWeek.1 (31st) and so o.
Consider the Random Forests (RF) ML method. It has various hyperparameters, such as maximum number of levels in a tree, and we wish to find a “good” combination of those settings. As a byproduct, though, the fact that at each level RF is entering a new feature, many RF implemntations compute some kind of variable importance ranking–thus giving us our desired ordered sequence.
We can then run RF on each feature set in the sequence, or as mentioned, try the sequence on some other ML method. We would then use whichever feature set yielded the best cross-validated performance.
Here is an example with the nyctaxi data:
> z <- qeRFranger(nyctaxi,'tripTime')
holdout set has 1000 rows
Loading required namespace: ranger
Warning message:
In eval(tmp, parent.frame()) :
9 rows removed from test set, due to new factor levels
> z$variable.importance
trip_distance PULocationID DOLocationID DayOfWeek
2447411690 212569502 201650058 85733194
So, our feature set sequence might be:
trip_distance trip_distance and PULocationID trip_distance and PULocationID and DOLocationID trip_distance and PULocationID and DOLocationID trip_distance and PULocationID and DOLocationID and DayOfWeek
Note that the algorithm chose the categorical variables as a whole here, not breaking down into their dummy variable components.
Another example is Prnciple Components Analysis (PCA). Here our p features are replaced by p linear combinations (the “principle components,” PCs) of the original features. The PCs are uncorrelated.
The PCs form our new features, and are ordered by variance, largest to smallest. Since a variable with small variance is essentially constant and thus has little predictive value, we take only the PCs with larger variance.
For the first m PCs, say, we compare the sum of their variances (which is also the variance of their sum) to the total variance of the response variable Y (i.e. the variable we are trying to predict). By varying the proportion, we produce an ordered sequence of feature sets, as desired.
The qeML function qePCA first finds the PCs, then applies whatever ML method is specified by the user. The function qeUMAP does the same for UMAP, a kind of nonlinear analog of PCA.
One more measure of variable importance is Shapley values. This concept describes an attempt to use game theory to apportion credit for a win among several players in a team; the “players” here are the features. While it is a clever idea, it has been the subject of much criticism in terms of practical value.
This is my personal “go to” method for feature selection. Its theoretical foundations are complex, but it boils down to measuring the reduction in variance in predicting Y from X, versus predicting Y simply from its mean.
FOCI is implemented in a CRAN package of the same name, and qeML’s qeFOCI function provides a convenient interface. Example:
> w <- qeFOCI(nyctaxi,'tripTime')
> w$selectedVar
index names
1: 1 trip_distance
2: 202 DOLocationID.75
3: 348 DayOfWeek.1
4: 78 PULocationID.151
Trip distance, dropoff location 72, Mondays and pickup location 151 were chosen, in that order, after which the algorithm decided that adding any further features was counterproductive. Again, we could simply take that 4-feature set for whatever ML method we wish, or we could use the above to set up an ordered sequence of feature sets,
trip_distance trip_distance and DOLocationID.75 trip_distance and DOLocationID.75 and DayOfWeek.1 trip_distance and DOLocationID.75 and DayOfWeek.1 and PULocationID.151
running our desired ML method on each feature set, then choosing the one with best cross-validation performance.
Rather than doing feature selection per se, we might transform the data to summary variables, or embeddings. We could group the pickup/dropoff locations by ZIP Code, for instance.
Or, we could consider only the more frequently used locations. The qeML function factorToTopLevels would help here. We will use it to reduce pickup location to just a few places that appear a lot in our data; all others will be lumped together as ‘other’.
We invoke the function as follows:
Ah, yes, lots of locations have small counts, less than 50 and probably many in the single-digits range. So, PULocationID is a prime candidate for simplication. Let’s go for a cutoff of 75.
So, now we have a new, simpler version of the pickup location data. Let’s take a look at old versus new.
> head(newPUlocs,50)
[1] 236 238 100 161 236 161 238 107 170 other 170 137
[13] 239 143 164 164 151 239 161 100 142 107 238 43
[25] 237 170 238 238 236 161 236 other 230 138 other 79
[37] 186 132 other 100 234 142 151 263 236 142 142 144
[49] 236 230
43 Levels: 100 107 113 114 13 132 137 138 140 141 142 143 144 148 151 ... other
> head(nyctaxi$PULocationID,50)
[1] 236 238 100 161 236 161 238 107 170 211 170 137 239 143 164 164 151 239 161
[20] 100 142 107 238 43 237 170 238 238 236 161 236 41 230 138 261 79 186 132
[39] 152 100 234 142 151 263 236 142 142 144 236 230
224 Levels: 1 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 265
So for example pickup location in row 10 of our data was location 211, but it was one of the “rare” locations, so it was recoded to “other”.
How much dimension reduction did we accomplish?
That’s much less than the original 265. Of course, we could reduce even further by using a larger threshold.
So, let’s try out the new data:
> newDOlocs <- factorToTopLevels(nyctaxi$DOLocationID,lowCountThresh=75)
> nyctaxi$PULocationID <- newPUlocs
> nyctaxi$DOLocationID <- newDOlocs
> z <- qeLin(nyctaxi,'tripTime')
Dimension reduction at least got us a stable solution. Does it predict well?
Mean Absolute Prediction Error is much less than what one would have by simply guessing all cases to be the overall mean trip time.
By varying lowCountThresh, we could generate an ordered sequence of feature sets, as before.