Started with our prediction chapter
Bias vs. Variance
Validation set approach and Cross-validation
How to choose a model for a continuous outcome (RMSE)
Stepwise selection
1) Which model is higher bias: A complex model or a simpler one?
1) Which model is higher bias: A complex model or a simpler one?
2) Why do we split our data into training and testing datasets?
1) Which model is higher bias: A complex model or a simpler one?
2) Why do we split our data into training and testing datasets?
3) How do we compare models with continuous outcomes?
1) Start with a null model (no covariates)
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
3) Test out all models with two covariates, but that have succession!
1) Start with a null model (no covariates)
2) Test out all models with one covariate, and select the best one:
3) Test out all models with two covariates, but that have succession!
4) You will end up with k possible models (k: total number of predictors).
How to improve our linear regressions:
Look at binary outcomes
Honey, I shrunk the coefficients!
We reviewed the stepwise procedure: Subsetting model selection approach.
We reviewed the stepwise procedure: Subsetting model selection approach.
Shrinkage (a.k.a Regularization): Fitting a model with all p predictors, but introducing bias (i.e. shrinking coefficients towards 0) for improvement in variance.
Ridge regression
Lasso regression
On top of a ridge.
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2OLS
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2OLS+λ⋅β21RidgePenalty
Ridge regression introduces bias to reduce variance in the testing data set.
In a simple regression (i.e. one regressor/covariate):
minβn∑i=1(yi−β0−xiβ1)2OLS+λ⋅β21RidgePenalty
Q1: In general, which model will have smaller β coefficients?
a) A model with a larger λ
b) A model with a smaller λ
Remember... we care about accuracy in the testing dataset!
RMSE= ⎷144∑i=1(spendi−(132.5−16.25⋅freqi))2=28.36
RMSE= ⎷144∑i=1(spendi−(119.5−11.25⋅freqi))2=12.13
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1β2kRidgePenalty
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1β2kRidgePenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(β21+β22)
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1β2kRidgePenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(β21+β22)
Because the ridge penalty includes the β's coefficients, scale matters:
Cross-validation!
Cross-validation!
1) Choose a grid of λ values
2) Compute cross-validation error (e.g. RMSE) for each
3) Choose the smallest one.
library(caret)set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
caret
packagelibrary(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret
package
We are doing cross-validation, so remember to set a seed!
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret
package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret
package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
The function we will use is train
: Same as before
method="glmnet"
means that it will run an elastic net.alpha=0
means is a ridge regressionlambda = lambda_seq
is not necessary (you can provide your own grid)library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)ridge = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 0, lambda = lambda_seq) )plot(ridge)
We will be using the caret
package
We are doing cross-validation, so remember to set a seed!
You need to create a grid for the λ's that will be tested
The function we will use is train
: Same as before
Important objects in cv
:
results$lambda
: Vector of λ that was testedresults$RMSE
: RMSE for each λbestTune$lambda
: λ that minimizes the error term.OLS regression:
lm1 = lm(logins ~ succession + city, data = train.data)coef(lm1)
## (Intercept) succession city ## 7.035888 -6.306371 2.570454
rmse(lm1, test.data)
## [1] 2.089868
Ridge regression:
coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.564243424## female 0.002726465## city 0.824387472## age 0.046468790## succession -2.639308962
rmse(ridge, test.data)
## [1] 2.097452
Throwing a lasso
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1|βk|LassoPenalty
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1|βk|LassoPenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(|β1|+|β2|)
minβn∑i=1(yi−p∑k=0xiβk)2OLS+λ⋅p∑k=1|βk|LassoPenalty
minβn∑i=1(spendi−β0−β1femalei−β2freqi)2+λ⋅(|β1|+|β2|)
||β||1=p∑k=1|β|
Q2: Which of the following are TRUE?
a) A ridge regression will have p coeff (if we have p predictors)
b) A lasso regression will have p coeff (if we have p predictors)
c) The larger the ƛ, the larger the L1 or L2 norm
d) The larger the ƛ, the smaller the L1 or L2 norm
Ridge
Final model will have p coefficients
Usually better with multicollinearity
Ridge
Final model will have p coefficients
Usually better with multicollinearity
Lasso
Can set coefficients = 0
Improves interpretability of model
Can be used for model selection
library(caret) set.seed(100)hbo = read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week11/1_Shrinkage/data/hbomax.csv")lambda_seq = seq(0, 20, length = 500)lasso = train(logins ~ . - unsubscribe - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) )plot(lasso)
Exactly the same!
alpha=1
!!Ridge regression:
coef(ridge$finalModel, ridge$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.564243424## female 0.002726465## city 0.824387472## age 0.046468790## succession -2.639308962
rmse(ridge, test.data)
## [1] 2.097452
Lasso regression:
coef(lasso$finalModel, lasso$bestTune$lambda)
## 5 x 1 sparse Matrix of class "dgCMatrix"## s1## (Intercept) 6.84122778## female . ## city 0.87982819## age 0.03099797## succession -2.83492585
rmse(lasso, test.data)
## [1] 2.09171
If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
For example:
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736
If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736
If we are predicting binary outcomes, RMSE would not be an appropriate measure anymore!
set.seed(100)lasso = train(factor(unsubscribe) ~ . - id, data = train.data, method = "glmnet", preProcess = "scale", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq))pred.values = lasso %>% predict(test.data)mean(pred.values == test.data$unsubscribe)
## [1] 0.736
You can shrink coefficients to introduce bias and decrease variance.
Ridge and Lasso regression are similar:
Importance of understanding how to estimate the penalty coefficient.
James, G. et al. (2021). "Introduction to Statistical Learning with Applications in R". Springer. Chapter 6.
STDHA. (2018). "Penalized Regression Essentials: Ridge, Lasso & Elastic Net"
Started with our prediction chapter
Bias vs. Variance
Validation set approach and Cross-validation
How to choose a model for a continuous outcome (RMSE)
Stepwise selection
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |