3.5. Discussion

Thus far, we have explored a variety of techniques for variable selection and shrinkage, including subset selection, ridge regression, lasso regression, and principal components regression (Notes on PCR). A critical question emerges: which method is most appropriate for a given situation? While there’s no one-size-fits-all answer, students will have the opportunity to explore this question in depth through an upcoming coding assignment. The assignment involves a simulation study where students will compare these techniques, using the prediction error on test sets as the key performance metric.

For this study, we will examine three different design matrices:

  1. The first design matrix, X1, consists of a small set of features carefully selected by experts for their known predictive power with respect to the response variable Y.

  2. The second design matrix, X2, extends X1 by including all quadratic terms and interaction terms for the features in X1. While these features are expected to be related to Y, they are also correlated, adding a layer of complexity to the analysis.

  3. The third design matrix, X3, further extends X2 by incorporating 500 noise features. These noise features are generated by selecting a true feature from X2, randomly shuffling its n values, and repeating this process 500 times. Although these features may appear marginally similar to the true features, they are not relevant to Y.

Key insights that students may uncover during this exercise include:

  • For X1, the feature set is small and carefully curated, often obviating the need for variable shrinkage or selection; a full model may be sufficient.

  • For X2, where features are correlated yet mostly relevant to Y, shrinkage methods like ridge regression and methods based on top PCs such as PCR can be effective.

  • For X3, ridge regression and PCR are less effective as they tend to prioritize early principal components. Given that much of the variation in X3 is noise-driven, the top principal components are not necessarily relevant to Y. Here, methods like Lasso, which can completely zero out specific features, become increasingly important.