STA 235H - Model Selection II:
Shrinkage
Fall 2022
McCombs School of Business, UT Austin
1 / 39

Last class

Started with our prediction chapter
- Bias vs. Variance
- Validation set approach and Cross-validation
- How to choose a model for a continuous outcome (RMSE)
- Stepwise selection

2 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

3 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

2) Why do we split our data into training and testing datasets?

3 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

2) Why do we split our data into training and testing datasets?

3) How do we compare models with continuous outcomes?

3 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

Your best guess will be the average of the outcome in the training dataset!

4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

E.g. $l o g i n s \sim f e m a l e$ , $l o g i n s \sim s u c c e s s i o n$ , $l o g i n s \sim a g e$ , ...
$l o g i n s \sim s u c c e s s i o n$ is the best one (according to RMSE)

4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

E.g. $l o g i n s \sim f e m a l e$ , $l o g i n s \sim s u c c e s s i o n$ , $l o g i n s \sim a g e$ , ...
$l o g i n s \sim s u c c e s s i o n$ is the best one (according to RMSE)

3) Test out all models with two covariates, but that have $s u c c e s s i o n$ !

E.g. $l o g i n s \sim s u c c e s s i o n + f e m a l e$ , $l o g i n s \sim s u c c e s s i o n + a g e$ , $l o g i n s \sim s u c c e s s i o n + c i t y$ , ...

4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

E.g. $l o g i n s \sim f e m a l e$ , $l o g i n s \sim s u c c e s s i o n$ , $l o g i n s \sim a g e$ , ...
$l o g i n s \sim s u c c e s s i o n$ is the best one (according to RMSE)

3) Test out all models with two covariates, but that have $s u c c e s s i o n$ !

E.g. $l o g i n s \sim s u c c e s s i o n + f e m a l e$ , $l o g i n s \sim s u c c e s s i o n + a g e$ , $l o g i n s \sim s u c c e s s i o n + c i t y$ , ...

4) You will end up with $k$ possible models (k: total number of predictors).

Choose the best one, depending on the RMSE.

4 / 39

Today: Continuing our journey

How to improve our linear regressions:
- Ridge regression
- Lasso regression
Look at binary outcomes

5 / 39

Honey, I shrunk the coefficients!

6 / 39

What is shrinkage?

We reviewed the stepwise procedure: Subsetting model selection approach.
- Select $k$ out of $p$ total predictors

7 / 39

What is shrinkage?

We reviewed the stepwise procedure: Subsetting model selection approach.
- Select $k$ out of $p$ total predictors
Shrinkage (a.k.a Regularization): Fitting a model with all $p$ predictors, but introducing bias (i.e. shrinking coefficients towards 0) for improvement in variance.
- Ridge regression
- Lasso regression

7 / 39

On top of a ridge.

8 / 39

Ridge Regression: An example

Predict spending based on frequency of visits to a website

9 / 39

Ordinary Least Squares

In an OLS: Minimize sum of squared-errors, i.e. $min_{β} \sum_{i = 1}^{n} ({spend}_{i} - (β_{0} + β_{1} {freq}_{i}))^{2}$

10 / 39

What about fit?

Does the OLS fit the testing data well?

11 / 39

Ridge Regression

Let's shrink the coefficients!: Ridge Regression

12 / 39

Ridge Regression: What does it do?

Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - β_{0} - x_{i} β_{1})^{2}}}$

13 / 39

Ridge Regression: What does it do?

Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - β_{0} - x_{i} β_{1})^{2}}} + \underset{R i d g e P e n a l t y}{\underset{⏟}{λ \cdot β_{1}^{2}}}$

14 / 39

Ridge Regression: What does it do?

Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - β_{0} - x_{i} β_{1})^{2}}} + \underset{R i d g e P e n a l t y}{\underset{⏟}{λ \cdot β_{1}^{2}}}$

$λ$ is the penalty factor $\to$ indicates how much we want to shrink the coefficients.

14 / 39

Q1: In general, which model will have smaller β coefficients?

a) A model with a larger λ

b) A model with a smaller λ

15 / 39

Remember... we care about accuracy in the testing dataset!

16 / 39

RMSE on the testing dataset: OLS

$R M S E = \sqrt{\frac{1}{4} \sum_{i = 1}^{4} ({spend}_{i} - (132.5 - 16.25 \cdot {freq}_{i}))^{2}} = 28.36$

17 / 39

RMSE on the testing dataset: Ridge Regression

$R M S E = \sqrt{\frac{1}{4} \sum_{i = 1}^{4} ({spend}_{i} - (119.5 - 11.25 \cdot {freq}_{i}))^{2}} = 12.13$

18 / 39

Ridge Regression in general

For regressions that include more than one regressor:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{R i d g e P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} β_{k}^{2}}}$

19 / 39

Ridge Regression in general

For regressions that include more than one regressor:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{R i d g e P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} β_{k}^{2}}}$

In our previous example, if we had two regressors, $f e m a l e$ and $f r e q$ :

$min_{β} \sum_{i = 1}^{n} ({spend}_{i} - β_{0} - β_{1} {female}_{i} - β_{2} {freq}_{i})^{2} + λ \cdot (β_{1}^{2} + β_{2}^{2})$

19 / 39

Ridge Regression in general

For regressions that include more than one regressor:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{R i d g e P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} β_{k}^{2}}}$

In our previous example, if we had two regressors, $f e m a l e$ and $f r e q$ :

$min_{β} \sum_{i = 1}^{n} ({spend}_{i} - β_{0} - β_{1} {female}_{i} - β_{2} {freq}_{i})^{2} + λ \cdot (β_{1}^{2} + β_{2}^{2})$

Because the ridge penalty includes the $β$ 's coefficients, scale matters:
- Standardize variables (you will do that as an option in your code)

19 / 39

How do we choose λ?20 / 39

How do we choose λ?

Cross-validation!

20 / 39

How do we choose λ?

Cross-validation!

1) Choose a grid of $λ$ values

The grid you choose will be context dependent (play around with it!)

2) Compute cross-validation error (e.g. RMSE) for each

3) Choose the smallest one.

20 / 39

λ vs RMSE?

21 / 39

λ vs RMSE? A zoom

22 / 39

How do we do this in R?library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id, 
            data = train.data, 
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 0, 
                         lambda = lambda_seq)
  )
plot(ridge)

We will be using the caret package

23 / 39

How do we do this in R?

library(caret) 
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id, 
            data = train.data, 
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 0, 
                         lambda = lambda_seq)
  )
plot(ridge)

We will be using the caret package
We are doing cross-validation, so remember to set a seed!

24 / 39

How do we do this in R?

library(caret) 
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id, 
            data = train.data, 
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 0, 
                         lambda = lambda_seq)
  )
plot(ridge)

We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the $λ$ 's that will be tested

25 / 39

How do we do this in R?

library(caret) 
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
            data = train.data,
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 0,
                         lambda = lambda_seq)
  )
plot(ridge)

We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the $λ$ 's that will be tested
The function we will use is train: Same as before
- method="glmnet" means that it will run an elastic net.
- alpha=0 means is a ridge regression
- lambda = lambda_seq is not necessary (you can provide your own grid)

26 / 39

How do we do this in R?

library(caret) 
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id, 
            data = train.data, 
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 0, 
                         lambda = lambda_seq)
  )
plot(ridge)

We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the $λ$ 's that will be tested
The function we will use is train: Same as before
Important objects in cv:
- results$lambda: Vector of $λ$ that was tested
- results$RMSE: RMSE for each $λ$
- bestTune$lambda: $λ$ that minimizes the error term.

27 / 39

How do we do this in R?

OLS regression:

lm1 = lm(logins ~ succession + city,
          data = train.data)
coef(lm1)

## (Intercept)  succession        city 
##    7.035888   -6.306371    2.570454

rmse(lm1, test.data)

## [1] 2.089868

Ridge regression:

coef(ridge$finalModel, ridge$bestTune$lambda)

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  6.564243424
## female       0.002726465
## city         0.824387472
## age          0.046468790
## succession  -2.639308962

rmse(ridge, test.data)

## [1] 2.097452

28 / 39

Throwing a lasso

29 / 39

Lasso regression

Very similar to ridge regression, except it changes the penalty term:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{L a s s o P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} | β_{k} |}}$

30 / 39

Lasso regression

Very similar to ridge regression, except it changes the penalty term:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{L a s s o P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} | β_{k} |}}$

In our previous example:

$min_{β} \sum_{i = 1}^{n} ({spend}_{i} - β_{0} - β_{1} {female}_{i} - β_{2} {freq}_{i})^{2} + λ \cdot (| β_{1} | + | β_{2} |)$

30 / 39

Lasso regression

Very similar to ridge regression, except it changes the penalty term:

$min_{β} \sum_{i = 1}^{n} \underset{O L S}{\underset{⏟}{(y_{i} - \sum_{k = 0}^{p} x_{i} β_{k})^{2}}} + \underset{L a s s o P e n a l t y}{\underset{⏟}{λ \cdot \sum_{k = 1}^{p} | β_{k} |}}$

In our previous example:

$min_{β} \sum_{i = 1}^{n} ({spend}_{i} - β_{0} - β_{1} {female}_{i} - β_{2} {freq}_{i})^{2} + λ \cdot (| β_{1} | + | β_{2} |)$

Lasso regression is also called $l_{1}$ regularization:

$| | β | |_{1} = \sum_{k = 1}^{p} | β |$

30 / 39

Q2: Which of the following are TRUE?

a) A ridge regression will have p coeff (if we have p predictors)

b) A lasso regression will have p coeff (if we have p predictors)

c) The larger the ƛ, the larger the L1 or L2 norm

d) The larger the ƛ, the smaller the L1 or L2 norm

31 / 39

Ridge vs Lasso

Ridge

Final model will have p coefficients

Usually better with multicollinearity

32 / 39

Ridge vs Lasso

Ridge

Final model will have p coefficients

Usually better with multicollinearity

Lasso

Can set coefficients = 0

Improves interpretability of model

Can be used for model selection

32 / 39

And how do we do Lasso in R?

library(caret) 
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
lasso = train(logins ~ . - unsubscribe - id, data = train.data, 
            method = "glmnet",
            preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 1,
                         lambda = lambda_seq)
  )
plot(lasso)

Exactly the same!

... But change alpha=1!!

33 / 39

And how do we do Lasso in R?

Ridge regression:

coef(ridge$finalModel, ridge$bestTune$lambda)

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  6.564243424
## female       0.002726465
## city         0.824387472
## age          0.046468790
## succession  -2.639308962

rmse(ridge, test.data)

## [1] 2.097452

Lasso regression:

coef(lasso$finalModel, lasso$bestTune$lambda)

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  6.84122778
## female       .         
## city         0.87982819
## age          0.03099797
## succession  -2.83492585

rmse(lasso, test.data)

## [1] 2.09171

34 / 39

A note on binary outcomes

If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
- We will use accuracy instead: The proportion (%) of correctly classified observations.

35 / 39

A note on binary outcomes

If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
- We will use accuracy instead: The proportion (%) of correctly classified observations.
For example:

set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
            method = "glmnet", preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)

## [1] 0.736

35 / 39

A note on binary outcomes

If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
- We will use accuracy instead: The proportion (%) of correctly classified observations.

For example:

set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
            method = "glmnet", preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)

## [1] 0.736

36 / 39

A note on binary outcomes

If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
- We will use accuracy instead: The proportion (%) of correctly classified observations.

For example:

set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
            method = "glmnet", preProcess = "scale",
            trControl = trainControl("cv", number = 10),
            tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)

## [1] 0.736

37 / 39

Main takeway points

You can shrink coefficients to introduce bias and decrease variance.
Ridge and Lasso regression are similar:
- Lasso can be used for model selection.
Importance of understanding how to estimate the penalty coefficient.

38 / 39

References

James, G. et al. (2021). "Introduction to Statistical Learning with Applications in R". Springer. Chapter 6.
STDHA. (2018). "Penalized Regression Essentials: Ridge, Lasso & Elastic Net"

39 / 39

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

STA 235H - Model Selection II:Shrinkage

Fall 2022

McCombs School of Business, UT Austin

Last class

Knowledge check from last week

Knowledge check from last week

Knowledge check from last week

How forward stepwise selection works: Example from last class

How forward stepwise selection works: Example from last class

How forward stepwise selection works: Example from last class

How forward stepwise selection works: Example from last class

Today: Continuing our journey

What is shrinkage?

What is shrinkage?

Ridge Regression: An example

Ordinary Least Squares

What about fit?

Ridge Regression

Ridge Regression: What does it do?

Ridge Regression: What does it do?

Ridge Regression: What does it do?

RMSE on the testing dataset: OLS

RMSE on the testing dataset: Ridge Regression

Ridge Regression in general

Ridge Regression in general

Ridge Regression in general

How do we choose λ?

How do we choose λ?

How do we choose λ?

λ vs RMSE?

λ vs RMSE? A zoom

How do we do this in R?

How do we do this in R?

How do we do this in R?

How do we do this in R?

How do we do this in R?

How do we do this in R?

Lasso regression

Lasso regression

Lasso regression

Ridge vs Lasso

Ridge vs Lasso

And how do we do Lasso in R?

And how do we do Lasso in R?

A note on binary outcomes

A note on binary outcomes

A note on binary outcomes

A note on binary outcomes

Main takeway points

References

Last class

Help

STA 235H - Model Selection II:
Shrinkage