+ - 0:00:00
Notes for current slide
Notes for next slide

STA 235H - Model Selection II:
Shrinkage

Fall 2022

McCombs School of Business, UT Austin

1 / 39

Last class

  • Started with our prediction chapter

    • Bias vs. Variance

    • Validation set approach and Cross-validation

    • How to choose a model for a continuous outcome (RMSE)

    • Stepwise selection

2 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

3 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

2) Why do we split our data into training and testing datasets?

3 / 39

Knowledge check from last week

1) Which model is higher bias: A complex model or a simpler one?

2) Why do we split our data into training and testing datasets?

3) How do we compare models with continuous outcomes?

3 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

  • Your best guess will be the average of the outcome in the training dataset!
4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

  • Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

  • E.g. loginsfemale, loginssuccession, loginsage, ...
  • loginssuccession is the best one (according to RMSE)
4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

  • Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

  • E.g. loginsfemale, loginssuccession, loginsage, ...
  • loginssuccession is the best one (according to RMSE)

3) Test out all models with two covariates, but that have succession!

  • E.g. loginssuccession+female, loginssuccession+age, loginssuccession+city, ...
4 / 39

How forward stepwise selection works: Example from last class

1) Start with a null model (no covariates)

  • Your best guess will be the average of the outcome in the training dataset!

2) Test out all models with one covariate, and select the best one:

  • E.g. loginsfemale, loginssuccession, loginsage, ...
  • loginssuccession is the best one (according to RMSE)

3) Test out all models with two covariates, but that have succession!

  • E.g. loginssuccession+female, loginssuccession+age, loginssuccession+city, ...

4) You will end up with k possible models (k: total number of predictors).

  • Choose the best one, depending on the RMSE.
4 / 39

Today: Continuing our journey

  • How to improve our linear regressions:

    • Ridge regression
    • Lasso regression
  • Look at binary outcomes

5 / 39

Honey, I shrunk the coefficients!

6 / 39

What is shrinkage?

  • We reviewed the stepwise procedure: Subsetting model selection approach.

    • Select k out of p total predictors
7 / 39

What is shrinkage?

  • We reviewed the stepwise procedure: Subsetting model selection approach.

    • Select k out of p total predictors
  • Shrinkage (a.k.a Regularization): Fitting a model with all p predictors, but introducing bias (i.e. shrinking coefficients towards 0) for improvement in variance.

    • Ridge regression

    • Lasso regression

7 / 39

On top of a ridge.

8 / 39

Ridge Regression: An example

  • Predict spending based on frequency of visits to a website

9 / 39

Ordinary Least Squares

  • In an OLS: Minimize sum of squared-errors, i.e. minβi=1n(spendi(β0+β1freqi))2

10 / 39

What about fit?

  • Does the OLS fit the testing data well?

11 / 39

Ridge Regression

  • Let's shrink the coefficients!: Ridge Regression

12 / 39

Ridge Regression: What does it do?

  • Ridge regression introduces bias to reduce variance in the testing data set.

  • In a simple regression (i.e. one regressor/covariate):

minβi=1n(yiβ0xiβ1)2OLS

13 / 39

Ridge Regression: What does it do?

  • Ridge regression introduces bias to reduce variance in the testing data set.

  • In a simple regression (i.e. one regressor/covariate):

minβi=1n(yiβ0xiβ1)2OLS+λβ12RidgePenalty

14 / 39

Ridge Regression: What does it do?

  • Ridge regression introduces bias to reduce variance in the testing data set.

  • In a simple regression (i.e. one regressor/covariate):

minβi=1n(yiβ0xiβ1)2OLS+λβ12RidgePenalty

  • λ is the penalty factor indicates how much we want to shrink the coefficients.
14 / 39

Q1: In general, which model will have smaller β coefficients?

a) A model with a larger λ

b) A model with a smaller λ

15 / 39

Remember... we care about accuracy in the testing dataset!

16 / 39

RMSE on the testing dataset: OLS

RMSE=14i=14(spendi(132.516.25freqi))2=28.36

17 / 39

RMSE on the testing dataset: Ridge Regression

RMSE=14i=14(spendi(119.511.25freqi))2=12.13

18 / 39

Ridge Regression in general

  • For regressions that include more than one regressor:

minβi=1n(yik=0pxiβk)2OLS+λk=1pβk2RidgePenalty

19 / 39

Ridge Regression in general

  • For regressions that include more than one regressor:

minβi=1n(yik=0pxiβk)2OLS+λk=1pβk2RidgePenalty

  • In our previous example, if we had two regressors, female and freq:

minβi=1n(spendiβ0β1femaleiβ2freqi)2+λ(β12+β22)

19 / 39

Ridge Regression in general

  • For regressions that include more than one regressor:

minβi=1n(yik=0pxiβk)2OLS+λk=1pβk2RidgePenalty

  • In our previous example, if we had two regressors, female and freq:

minβi=1n(spendiβ0β1femaleiβ2freqi)2+λ(β12+β22)

  • Because the ridge penalty includes the β's coefficients, scale matters:

    • Standardize variables (you will do that as an option in your code)
19 / 39

How do we choose λ?

20 / 39

How do we choose λ?

Cross-validation!

20 / 39

How do we choose λ?

Cross-validation!

1) Choose a grid of λ values

  • The grid you choose will be context dependent (play around with it!)

2) Compute cross-validation error (e.g. RMSE) for each

3) Choose the smallest one.

20 / 39

λ vs RMSE?

21 / 39

λ vs RMSE? A zoom

22 / 39

How do we do this in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda_seq)
)
plot(ridge)
  • We will be using the caret package
23 / 39

How do we do this in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda_seq)
)
plot(ridge)
  • We will be using the caret package

  • We are doing cross-validation, so remember to set a seed!

24 / 39

How do we do this in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda_seq)
)
plot(ridge)
  • We will be using the caret package

  • We are doing cross-validation, so remember to set a seed!

  • You need to create a grid for the λ's that will be tested

25 / 39

How do we do this in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda_seq)
)
plot(ridge)
  • We will be using the caret package

  • We are doing cross-validation, so remember to set a seed!

  • You need to create a grid for the λ's that will be tested

  • The function we will use is train: Same as before

    • method="glmnet" means that it will run an elastic net.
    • alpha=0 means is a ridge regression
    • lambda = lambda_seq is not necessary (you can provide your own grid)
26 / 39

How do we do this in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
ridge = train(logins ~ . - unsubscribe - id,
data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = lambda_seq)
)
plot(ridge)
  • We will be using the caret package

  • We are doing cross-validation, so remember to set a seed!

  • You need to create a grid for the λ's that will be tested

  • The function we will use is train: Same as before

  • Important objects in cv:

    • results$lambda: Vector of λ that was tested
    • results$RMSE: RMSE for each λ
    • bestTune$lambda: λ that minimizes the error term.
27 / 39

How do we do this in R?

OLS regression:

lm1 = lm(logins ~ succession + city,
data = train.data)
coef(lm1)
## (Intercept) succession city
## 7.035888 -6.306371 2.570454
rmse(lm1, test.data)
## [1] 2.089868

Ridge regression:

coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 6.564243424
## female 0.002726465
## city 0.824387472
## age 0.046468790
## succession -2.639308962
rmse(ridge, test.data)
## [1] 2.097452
28 / 39

Throwing a lasso

29 / 39

Lasso regression

  • Very similar to ridge regression, except it changes the penalty term:

minβi=1n(yik=0pxiβk)2OLS+λk=1p|βk|LassoPenalty

30 / 39

Lasso regression

  • Very similar to ridge regression, except it changes the penalty term:

minβi=1n(yik=0pxiβk)2OLS+λk=1p|βk|LassoPenalty

  • In our previous example:

minβi=1n(spendiβ0β1femaleiβ2freqi)2+λ(|β1|+|β2|)

30 / 39

Lasso regression

  • Very similar to ridge regression, except it changes the penalty term:

minβi=1n(yik=0pxiβk)2OLS+λk=1p|βk|LassoPenalty

  • In our previous example:

minβi=1n(spendiβ0β1femaleiβ2freqi)2+λ(|β1|+|β2|)

  • Lasso regression is also called l1 regularization:

||β||1=k=1p|β|

30 / 39

Q2: Which of the following are TRUE?

a) A ridge regression will have p coeff (if we have p predictors)

b) A lasso regression will have p coeff (if we have p predictors)

c) The larger the ƛ, the larger the L1 or L2 norm

d) The larger the ƛ, the smaller the L1 or L2 norm

31 / 39

Ridge vs Lasso

Ridge

Final model will have p coefficients

Usually better with multicollinearity

32 / 39

Ridge vs Lasso

Ridge

Final model will have p coefficients

Usually better with multicollinearity

Lasso

Can set coefficients = 0

Improves interpretability of model

Can be used for model selection

32 / 39

And how do we do Lasso in R?

library(caret)
set.seed(100)
hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")
lambda_seq = seq(0, 20, length = 500)
lasso = train(logins ~ . - unsubscribe - id, data = train.data,
method = "glmnet",
preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1,
lambda = lambda_seq)
)
plot(lasso)

Exactly the same!

  • ... But change alpha=1!!
33 / 39

And how do we do Lasso in R?

Ridge regression:

coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 6.564243424
## female 0.002726465
## city 0.824387472
## age 0.046468790
## succession -2.639308962
rmse(ridge, test.data)
## [1] 2.097452

Lasso regression:

coef(lasso$finalModel, lasso$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 6.84122778
## female .
## city 0.87982819
## age 0.03099797
## succession -2.83492585
rmse(lasso, test.data)
## [1] 2.09171
34 / 39

A note on binary outcomes

  • If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!

    • We will use accuracy instead: The proportion (%) of correctly classified observations.
35 / 39

A note on binary outcomes

  • If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!

    • We will use accuracy instead: The proportion (%) of correctly classified observations.
  • For example:

set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
method = "glmnet", preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)
## [1] 0.736
35 / 39

A note on binary outcomes

  • If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!

    • We will use accuracy instead: The proportion (%) of correctly classified observations.
  • For example:
set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
method = "glmnet", preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)
## [1] 0.736
36 / 39

A note on binary outcomes

  • If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!

    • We will use accuracy instead: The proportion (%) of correctly classified observations.
  • For example:
set.seed(100)
lasso = train(factor(unsubscribe) ~ . - id, data = train.data,
method = "glmnet", preProcess = "scale",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))
pred.values = lasso %>% predict(test.data)
mean(pred.values == test.data$unsubscribe)
## [1] 0.736
37 / 39

Main takeway points

  • You can shrink coefficients to introduce bias and decrease variance.

  • Ridge and Lasso regression are similar:

    • Lasso can be used for model selection.
  • Importance of understanding how to estimate the penalty coefficient.

38 / 39

References

39 / 39

Last class

  • Started with our prediction chapter

    • Bias vs. Variance

    • Validation set approach and Cross-validation

    • How to choose a model for a continuous outcome (RMSE)

    • Stepwise selection

2 / 39
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow