
Started with our prediction chapter
Bias vs. Variance
Validation set approach and Cross-validation
How to choose a model for a continuous outcome (RMSE)
Stepwise selection
1) Which model is higher bias: A complex model or a simpler one?
1) Which model is higher bias: A complex model or a simpler one?
2) Why do we split our data into training and testing datasets?
1) Which model is higher bias: A complex model or a simpler one?
2) Why do we split our data into training and testing datasets?
3) How do we compare models with continuous outcomes?
1) Start with a null model (no covariates)
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
3) Test out all models with two covariates, but that have successionsuccession!
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
3) Test out all models with two covariates, but that have successionsuccession!
4) You will end up with kk possible models (k: total number of predictors).
How to improve our linear regressions:
Look at binary outcomes

Honey, I shrunk the coefficients!
We reviewed the stepwise procedure: Subsetting model selection approach.
We reviewed the stepwise procedure: Subsetting model selection approach.
Shrinkage (a.k.a Regularization): Fitting a model with all pp predictors, but introducing bias (i.e. shrinking coefficients towards 0) for improvement in variance.
Ridge regression
Lasso regression
On top of a ridge.
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2⏟OLS
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2⏟OLS+λ⋅β21⏟RidgePenalty
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2⏟OLS+λ⋅β21⏟RidgePenalty
Q1: In general, which model will have smaller β coefficients?
a) A model with a larger λ
b) A model with a smaller λ
Remember... we care about accuracy in the testing dataset!
RMSE=√144∑i=1(spendi−(132.5−16.25⋅freqi))2=28.36
RMSE=√144∑i=1(spendi−(119.5−11.25⋅freqi))2=12.13
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1β2k⏟RidgePenalty
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1β2k⏟RidgePenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(β21+β22)
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1β2k⏟RidgePenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(β21+β22)
Because the ridge penalty includes the β's coefficients, scale matters:
Cross-validation!
Cross-validation!
1) Choose a grid of λ values
2) Compute cross-validation error (e.g. RMSE) for each
3) Choose the smallest one.
library(caret)set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
caret packagelibrary(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret package
We are doing cross-validation, so remember to set a seed!
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
The function we will use is train: Same as before
method="glmnet" means that it will run an elastic net.alpha=0 means is a ridge regressionlambda = lambda_seq is not necessary (you can provide your own grid)library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
The function we will use is train: Same as before
Important objects in cv:
results$lambda: Vector of λ that was testedresults$RMSE: RMSE for each λbestTune$lambda: λ that minimizes the error term.OLS regression:
lm1 = lm(logins ~ succession + city, data = train.data)coef(lm1)
## (Intercept) succession city ## 7.035888 -6.306371 2.570454rmse(lm1, test.data)
## [1] 2.089868Ridge regression:
coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.564243424## female 0.002726465## city 0.824387472## age 0.046468790## succession -2.639308962rmse(ridge, test.data)
## [1] 2.097452Throwing a lasso
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1|βk|⏟LassoPenalty
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1|βk|⏟LassoPenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(|β1|+|β2|)
minβn∑i=1(yi−p∑k=0xiβk)2⏟OLS+λ⋅p∑k=1|βk|⏟LassoPenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(|β1|+|β2|)
||β||1=p∑k=1|β|
Q2: Which of the following are TRUE?
a) A ridge regression will have p coeff (if we have p predictors)
b) A lasso regression will have p coeff (if we have p predictors)
c) The larger the ƛ, the larger the L1 or L2 norm
d) The larger the ƛ, the smaller the L1 or L2 norm
Ridge
Final model will have p coefficients
Usually better with multicollinearity
Ridge
Final model will have p coefficients
Usually better with multicollinearity
Lasso
Can set coefficients = 0
Improves interpretability of model
Can be used for model selection
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)lasso = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) )plot(lasso)
Exactly the same!
alpha=1!!Ridge regression:
coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.564243424## female 0.002726465## city 0.824387472## age 0.046468790## succession -2.639308962rmse(ridge, test.data)
## [1] 2.097452Lasso regression:
coef(lasso$finalModel, lasso$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.84122778## female . ## city 0.87982819## age 0.03099797## succession -2.83492585rmse(lasso, test.data)
## [1] 2.09171If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
For example:
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736You can shrink coefficients to introduce bias and decrease variance.
Ridge and Lasso regression are similar:
Importance of understanding how to estimate the penalty coefficient.

James, G. et al. (2021). "Introduction to Statistical Learning with Applications in R". Springer. Chapter 6.
STDHA. (2018). "Penalized Regression Essentials: Ridge, Lasso & Elastic Net"

Started with our prediction chapter
Bias vs. Variance
Validation set approach and Cross-validation
How to choose a model for a continuous outcome (RMSE)
Stepwise selection
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |