Why Should We Consider Using Just a Few of Our Features? And How Can We Do This?

In many applications we may have a larger number of features—dozens, hundreds or even more. In machine learning contexts, where we are interested primarily in prediction, we typically don’t want to use all the features, for a couple of reasons:

Avoiding overfitting.

As we add more and more features to a model, bias is reduced but variance of the estimated model parameters increases. At some point, variance overwhelms bias, and our predictive ability declines.
Avoiding large computation time and/or memory usage.

Depending on the solution method used, computation time for a linear model increases with the cube, or at least the square, of the number of features. And modern ML models can be even worse in terms of computational burden. In addition, large models may result in lack of convergence or instability of the output.

Which Method to Use?

Many, many methods for feature selection have been developed over the years, and you will often hear someone claim that their favorite is the one to use, The Best. As you may guess, the fact that there are conflicting claims as to “best” reflects the fact that no such Best method exists. Indeed, much of the feature selection process is ad hoc, and in any case, it will often boil down to the user’s judgment.

We will cover a few feature selection methods here, to explain their rationales and how to use them in qeML. You may typically use two or more of them in any given application.

How Many Is Too Many?

We will use standard notation here, with n and p denoting the number of points in our dataset (i.e. number of rows) and the number of features (i.e. number of columns), respectively. Important points to remember:

General principles

The larger n is, then the larger a value of p that can be used before variance increase overwhelms bias reduction.
However, that bias-variance tradeoff transition point for p depends on the application. There is no magic formula for best p as a function of n.
Nevertheless, a commonly-used rule of thumb is to limit p to at most n^0.5. Some theoretical analyses even suggest the more conservative rule p < log(n). Again, these are oversimplifications, but you may find them useful as guides to intuition.
It should be kept in mind that categorical features can greatly add to the value of p. See below.

Example: NYC Taxi Data

Consider the dataset nycdata included in qeML, which is derived from taxi trip data made publicly available by the New York City government. Many public analyses have been made on various versions of the NYC taxi data, frequently with the goal of predicting trip time. Let’s take a look:

> data(nyctaxi)
> dim(nyctaxi)
[1] 10000     5
> head(nyctaxi)
        passenger_count trip_distance PULocationID DOLocationID PUweekday
2969561               1          1.37          236           43         1
7301968               2          0.71          238          238         4
3556729               1          2.80          100          263         3
7309631               2          2.62          161          249         4
3893911               1          1.20          236          163         5
4108506               5          2.40          161          164         5
        DOweekday tripTime
2969561         1      598
7301968         4      224
3556729         3      761
7309631         4      888
3893911         5      648
4108506         5      977

(Time-of-day, month etc. data is also available but not in this particular dataset.)

So n = 10000, p = 5. At first glance, one might guess that we could use all 5 features. Let’s try a linear model:

> z <- qeLin(nyctaxi,'tripTime')
holdout set has  1000 rows
Warning messages:
1: In eval(tmp, parent.frame()) :
  7 rows removed from test set, due to new factor levels
2: In predict.lm(object, newx) :
  prediction from a rank-deficient fit may be misleading
3: In predict.lm(object, newx) :
  prediction from a rank-deficient fit may be misleading

To understand those warning messages, let’s look at the estimated beta coefficients:

> summary(z)

Call:
lm(formula = cbind(tripTime) ~ ., data = xy)

Residuals:
    Min      1Q  Median      3Q     Max
-3807.5  -186.0   -49.8   131.8  3179.2

Coefficients: (4 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     -2125.474    351.347  -6.049 1.51e-09 ***
trip_distance     163.056      1.670  97.628  < 2e-16 ***
PULocationID4    1424.775    345.770   4.121 3.81e-05 ***
PULocationID7    1212.331    349.196   3.472 0.000520 ***
PULocationID10    655.251    371.870   1.762 0.078097 .
PULocationID12   1829.276    408.229   4.481 7.52e-06 ***
PULocationID13   1271.201    338.220   3.758 0.000172 ***
PULocationID17   1167.109    477.898   2.442 0.014619 *
PULocationID24   1356.014    342.706   3.957 7.66e-05 ***
PULocationID25   1381.439    354.736   3.894 9.92e-05 ***
PULocationID26   1282.229    506.842   2.530 0.011429 *
PULocationID33   1353.187    363.058   3.727 0.000195 ***
PULocationID35    726.147    474.147   1.531 0.125687
PULocationID36    972.828    391.276   2.486 0.012927 *
...
PULocationID260   972.537    375.500   2.590 0.009614 **
PULocationID261  1334.116    340.437   3.919 8.97e-05 ***
PULocationID262  1362.842    337.713   4.036 5.50e-05 ***
PULocationID263  1343.784    337.113   3.986 6.77e-05 ***
PULocationID264  1378.455    341.150   4.041 5.38e-05 ***
PULocationID265  2163.796    370.248   5.844 5.27e-09 ***
DOLocationID3     869.664    339.996   2.558 0.010549 *
DOLocationID4     913.146    108.088   8.448  < 2e-16 ***
DOLocationID7    1065.241    105.990  10.050  < 2e-16 ***
DOLocationID10   1703.133    211.081   8.069 8.06e-16 ***
...
DOLocationID119  1077.036    274.406   3.925 8.74e-05 ***
DOLocationID121  1515.904    340.850   4.447 8.80e-06 ***
DOLocationID123        NA         NA      NA       NA
DOLocationID124  2707.981    339.513   7.976 1.70e-15 ***
DOLocationID125  1123.702    105.203  10.681  < 2e-16 ***
...
DOLocationID262   933.146     96.473   9.673  < 2e-16 ***
DOLocationID263   885.520     94.578   9.363  < 2e-16 ***
DOLocationID264   960.527    108.120   8.884  < 2e-16 ***
DOLocationID265   169.068    123.391   1.370 0.170666
DayOfWeek2         44.691     15.004   2.979 0.002903 **
DayOfWeek3        100.480     14.188   7.082 1.53e-12 ***
DayOfWeek4        105.223     13.990   7.522 5.95e-14 ***
DayOfWeek5        133.312     13.775   9.678  < 2e-16 ***
DayOfWeek6         97.030     14.461   6.710 2.07e-11 ***
DayOfWeek7         57.133     14.579   3.919 8.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 327.2 on 8663 degrees of freedom
Multiple R-squared:  0.7128,    Adjusted R-squared:  0.7017
F-statistic:    64 on 336 and 8663 DF,  p-value: < 2.2e-16

There were 266 pickup/dropoff locations. That means 265 dummy-variable features were created each for pickup and dropoff. For instance, there would be such a variable for pickup at location 168, equal to 1 if pickup was at that location, 0 otherwise. Similarly, there are 6 more dummies for day of the week. R’s lm function, which qeLin wraps, automatically converts factors to dummies. There is no dummy for the 266th location, as it is indicated by the other 265 dummies all being 0.

So, p is not 5 after all; it’s 5 + 530 + 6 = 541!

We are assuming no interactions here. But if some pickup/dropoff location combinations behave quite differently from others, we ought to consider interaction terms. These will have their own dummy variables (products of the original dummies), hugely increasing p if we include them all.

Now, in those warning messages, the term “rank-deficient” is referring to multicollinearity in the data, i.e. strong relations among the features, producing a possibly unstable result. Note that the model was so unstable that the coefficient for DOLocationID123 turned out to be NA.

But even the other coefficients have rather large standard errors. This is the “variance” aspect of the Variance-Bias Tradeoff.

A more subtle problem is in the warning, “7 rows removed from test set, due to new factor levels.” By default, all qeML predictive functions split the data into training and holdout sets. Say some pickup location had only two rows in our data, both of which happen to be in the holdout set. Then the lm function would say, “Wait, I’ve never heard of these factor levels before,” so qeLin removes them. If many rows are removed, our predictive ability suffers.

Desiderata

What should we look for in a “good” feature selection method? We take the following as goals:

Desired Property A: The method should have predictive ability as a central goal.
Desired Property B: The method should be reasonably easy to explain.
Desired Property C: The method should produce an ordered sequence of candidate models.

All the methods discussed here will have this property. This is so important that we will give it its own subsection:

Feature selection methods should produce an ordered sequence of candidate models

A nethod should produce some kind of ordered list, for instance, best single predictor; best pair of predictors; best triplet etc. We can evaluate each of the p feature sets via cross-validation–i.e. noting the testAcc component in the output of qeML functions–and then choosing the one that peforms best. (A method may simply rank features according to importance, but we still would assess the performance of the first feature, the first two features together, the first three features together and so on.)

In the all possible subsets (APS) approach, by contrast, one looks at all the 2^p possible feature sets, a prohibitively large number computationally.

Moreover, remember, we are working with randomness. We choose holdout sets randomly, and usually the dataset itself is considered to be a random sample from some (real or conceptual) population. Due to this randomness, it’s possible that some combination of features will appear to perform well, just by accident. The more feature sets we look at, the greater the chance of this occurring.

Note that some of the methods discussed below do their own internal cross-validation, and ultimately give us the “best” feature set according to that criterion. Nevertheless, they still provide an ordered feature set.

This last point is important because we may wish to use the selected feature set sequence as input to a different ML method. There is no theoretical reason to believe the best feature set for one ML method is also best for another, but as noted, feature selection is an ad hoc business.

Feature Selection Methodology Overview

Methods based on p-values

A classic approach for linear and generalized linear models is to first fit a full model, then retain only those features that are “statistically significant”. In the nyctaxi taxi data shown above, we would retain, for instance PULocationID33 but not PULocationID35.

This would violate our Desired Property A above. Use of p-values in geneeral is problematic (see the “NoPVals” vignette), and this is no exception. Just because β_i is nonzero does not mean that Feature i has strong predictive power; what if β_i is nonzero but tiny?

One could generate an ordered sequence of feature sets in the above manner by first setting the threshold for a p-value very low, then progressively higher This is similar to stepwise regression: In the forward version, one starts with no features, but adds them one at a time. At each step, one tests for the β coefficient being 0, and then chooses the feature that is most “significant.” When no new feature is found to be “significant,” the process stops, and we use whatever features managed to make it in to the model so far. The backward version starts with a full model, and removes features one at a time. Though this one may seem to be an improvement, in that it does take into account that a feature may be useful individually but not if similar features are already in the equation, it still suffers from the fundamental problems of p-values.

The LASSO

Here we fit a linear model, subject to the constraint that no estimated β_i is larger (in absolute value) than λ, a hyperparameter. Starting at 0, we increase λ to ever-larger values, having the effect that one estimated β_i becomes nonzero at a time. This gives us an ordered sequence of candidate features, as desired.

This is similar to a forward stepwise method, but NOT based on p-values; instead, the decision of which feature to use next is based on cross-validated mean squared prediction error.

Here is the LASSO approach on the nyctaxi data:

> lassout <- qeLASSO(nyctaxi,'tripTime') 
> lassout$lambda.whichmin
[1] 51
> lassout$whenEntered
  trip_distance PULocationID.132 PULocationID.186      DayOfWeek.1
               2               29               31               31
DOLocationID.132 DOLocationID.124      DayOfWeek.5  DOLocationID.33
              33               34               34               35
DOLocationID.189 DOLocationID.225 PULocationID.213  PULocationID.76
              35               35               37               38
PULocationID.263   DOLocationID.1 DOLocationID.177      DayOfWeek.2
              38               38               38               38
PULocationID.124  DOLocationID.35  DOLocationID.36  DOLocationID.89
              39               39               39               39
DOLocationID.231 DOLocationID.263 PULocationID.239 DOLocationID.155
              39               39               40               40
 PULocationID.40  PULocationID.43 PULocationID.146 PULocationID.161
              41               41               41               41
PULocationID.230  DOLocationID.22  DOLocationID.37 DOLocationID.161
              41               41               41               41
DOLocationID.162 DOLocationID.251 PULocationID.163 PULocationID.211
              41               41               42               42
PULocationID.257  DOLocationID.49  DOLocationID.81 DOLocationID.179
              42               42               42               42
...

For each value of λ run by the code, one or more features joined the model. The cross-valued mean squared prediction error reached its minimum at the 51st λ. Along the way, trip_distance was the first feature added, at the 2nd λ, followed by PULocationID132 (29th), PULocationID186 (31st), DayOfWeek.1 (31st) and so o.

Methods based on measures of feature importance

Consider the Random Forests (RF) ML method. It has various hyperparameters, such as maximum number of levels in a tree, and we wish to find a “good” combination of those settings. As a byproduct, though, the fact that at each level RF is entering a new feature, many RF implemntations compute some kind of variable importance ranking–thus giving us our desired ordered sequence.

We can then run RF on each feature set in the sequence, or as mentioned, try the sequence on some other ML method. We would then use whichever feature set yielded the best cross-validated performance.

Here is an example with the nyctaxi data:

> z <- qeRFranger(nyctaxi,'tripTime')
holdout set has  1000 rows
Loading required namespace: ranger
Warning message:
In eval(tmp, parent.frame()) :
  9 rows removed from test set, due to new factor levels
> z$variable.importance
trip_distance  PULocationID  DOLocationID     DayOfWeek
   2447411690     212569502     201650058      85733194

So, our feature set sequence might be:

trip_distance
trip_distance and PULocationID
trip_distance and PULocationID and DOLocationID
trip_distance and PULocationID and DOLocationID
trip_distance and PULocationID and DOLocationID and DayOfWeek

Note that the algorithm chose the categorical variables as a whole here, not breaking down into their dummy variable components.

Another example is Prnciple Components Analysis (PCA). Here our p features are replaced by p linear combinations (the “principle components,” PCs) of the original features. The PCs are uncorrelated.

The PCs form our new features, and are ordered by variance, largest to smallest. Since a variable with small variance is essentially constant and thus has little predictive value, we take only the PCs with larger variance.

For the first m PCs, say, we compare the sum of their variances (which is also the variance of their sum) to the total variance of the response variable Y (i.e. the variable we are trying to predict). By varying the proportion, we produce an ordered sequence of feature sets, as desired.

The qeML function qePCA first finds the PCs, then applies whatever ML method is specified by the user. The function qeUMAP does the same for UMAP, a kind of nonlinear analog of PCA.

One more measure of variable importance is Shapley values. This concept describes an attempt to use game theory to apportion credit for a win among several players in a team; the “players” here are the features. While it is a clever idea, it has been the subject of much criticism in terms of practical value.

Feature Ordering by Conditional Independence (FOCI)

This is my personal “go to” method for feature selection. Its theoretical foundations are complex, but it boils down to measuring the reduction in variance in predicting Y from X, versus predicting Y simply from its mean.

FOCI is implemented in a CRAN package of the same name, and qeML’s qeFOCI function provides a convenient interface. Example:

> w <- qeFOCI(nyctaxi,'tripTime')
> w$selectedVar
   index            names
1:     1    trip_distance
2:   202  DOLocationID.75
3:   348      DayOfWeek.1
4:    78 PULocationID.151

Trip distance, dropoff location 72, Mondays and pickup location 151 were chosen, in that order, after which the algorithm decided that adding any further features was counterproductive. Again, we could simply take that 4-feature set for whatever ML method we wish, or we could use the above to set up an ordered sequence of feature sets,

trip_distance
trip_distance and DOLocationID.75
trip_distance and DOLocationID.75 and DayOfWeek.1
trip_distance and DOLocationID.75 and DayOfWeek.1 and PULocationID.151

running our desired ML method on each feature set, then choosing the one with best cross-validation performance.

Direct dimension reduction for categorical data

Rather than doing feature selection per se, we might transform the data to summary variables, or embeddings. We could group the pickup/dropoff locations by ZIP Code, for instance.

Or, we could consider only the more frequently used locations. The qeML function factorToTopLevels would help here. We will use it to reduce pickup location to just a few places that appear a lot in our data; all others will be lumped together as ‘other’.

We invoke the function as follows:

> factorToTopLevels(nyctaxi$PULocationID,lowCountThresh=0)

alt text

Ah, yes, lots of locations have small counts, less than 50 and probably many in the single-digits range. So, PULocationID is a prime candidate for simplication. Let’s go for a cutoff of 75.

> newPUlocs <- factorToTopLevels(nyctaxi$PULocationID,lowCountThresh=75)

So, now we have a new, simpler version of the pickup location data. Let’s take a look at old versus new.

> head(newPUlocs,50)
 [1] 236   238   100   161   236   161   238   107   170   other 170   137  
[13] 239   143   164   164   151   239   161   100   142   107   238   43   
[25] 237   170   238   238   236   161   236   other 230   138   other 79   
[37] 186   132   other 100   234   142   151   263   236   142   142   144  
[49] 236   230  
43 Levels: 100 107 113 114 13 132 137 138 140 141 142 143 144 148 151 ... other
> head(nyctaxi$PULocationID,50)
 [1] 236 238 100 161 236 161 238 107 170 211 170 137 239 143 164 164 151 239 161
[20] 100 142 107 238 43  237 170 238 238 236 161 236 41  230 138 261 79  186 132
[39] 152 100 234 142 151 263 236 142 142 144 236 230
224 Levels: 1 3 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 265

So for example pickup location in row 10 of our data was location 211, but it was one of the “rare” locations, so it was recoded to “other”.

How much dimension reduction did we accomplish?

> length(levels(newPUlocs))
[1] 43

That’s much less than the original 265. Of course, we could reduce even further by using a larger threshold.

So, let’s try out the new data:

> newDOlocs <- factorToTopLevels(nyctaxi$DOLocationID,lowCountThresh=75)
> nyctaxi$PULocationID <- newPUlocs
> nyctaxi$DOLocationID <- newDOlocs
> z <- qeLin(nyctaxi,'tripTime')

Dimension reduction at least got us a stable solution. Does it predict well?

> z$testAcc
[1] 240.5642
> z$baseAcc
[1] 430.4696

Mean Absolute Prediction Error is much less than what one would have by simply guessing all cases to be the overall mean trip time.

By varying lowCountThresh, we could generate an ordered sequence of feature sets, as before.

Feature_Selection