Next week will be the last class with new material.
The final week of class will be for a review session.
Next week will be the last class with new material.
The final week of class will be for a review session.
You need to choose a topic for Homework 6
Talking about bias vs variance trade-off.
Linear models, model selection and regularization:
Continue on our prediction journey:
Decision Trees: Classification and Regression Trees (CART)
Activity in R: Remember to try to complete it before the end of the class!
Ridge and lasso regression add bias to a linear model to reduce variance:
Ridge and lasso regression add bias to a linear model to reduce variance:
λ represents the ridge/lasso penalty: The larger the λ the smaller the (sum of) coefficients, e.g. ∑kβ2k or ∑k|βk|.
Ridge and lasso regression add bias to a linear model to reduce variance:
λ represents the ridge/lasso penalty: The larger the λ the smaller the (sum of) coefficients, e.g. ∑kβ2k or ∑k|βk|.
Q1: What is the main difference (in terms of the final model) between Ridge and Lasso regression?
Trees, trees everywhere!
From the videos/readings, how would you explain to someone what a decision tree is?
Create a flow chart for making decisions
Create a flow chart for making decisions
... But there are many decisions!
Create a flow chart for making decisions
... But there are many decisions!
Create a flow chart for making decisions
... But there are many decisions!
How many variables do we use?
How do we sort them? In what order do we place them?
Create a flow chart for making decisions
... But there are many decisions!
How many variables do we use?
How do we sort them? In what order do we place them?
How do we split them?
Create a flow chart for making decisions
... But there are many decisions!
How many variables do we use?
How do we sort them? In what order do we place them?
How do we split them?
How deep do we go?
Q2: What is the main disadvantage of a shallower tree (compared to a deeper tree)?
a) Higher variance
b) Higher bias
c) Lower variance
d) Lower bias
Structure:
Main advantages
Simple interpretation
Mirror human decision-making
Graphic displays!
Handle categorical variables
Main disadvantages
Overfitting
Not very accurate/not very robust
Remember our Hbo Max example?
Remember our Hbo Max example?
Predict who will cancel their subscription
Remember our Hbo Max example?
Predict who will cancel their subscription
We have some information:
city
: Whether the customer lives in a big city or notfemale
: Whether the customer is female or notage
: Customer's age (in years)logins
: Number of logins to the platform in the past week.succession
: Whether the person has watched the Succession or not.unsubscribe
: Whether they canceled their subscription or not.Our outcome is binary, so this is a classification task.
Let's start looking at two variables:
City & Succession
Our outcome is binary, so this is a classification task.
Let's start looking at two variables:
City & Succession
Recursive Binary Splitting:
Divide regions of covariates in two (recursively).
This works both for continuous and categorical/binary variables
Recursive Binary Splitting:
Divide regions of covariates in two (recursively).
This works both for continuous and categorical/binary variables
We test out every covariate and see which one reduces the error the most in our predictions
In regression tasks, we can use RMSE.
In classification tasks, we can use accuracy/classification error rate, Gini Index, or entropy
Recursive Binary Splitting:
Divide regions of covariates in two (recursively).
This works both for continuous and categorical/binary variables
We test out every covariate and see which one reduces the error the most in our predictions
In regression tasks, we can use RMSE.
In classification tasks, we can use accuracy/classification error rate, Gini Index, or entropy
G=1∑k=0^pmk(1−^pmk) where ^pmk is the proportion of obs. in the m region for class k.
In our HBO Max example:
E.g.:
Q3: According to the Gini Index, is it better or worse to have a high pmk (i.e. closer to 1)?
G=K∑k=1^pmk(1−^pmk)
succession
yields a lower Gini compared to city
(0.428 vs. 0.482)succession
yields a lower Gini compared to city
(0.428 vs. 0.482)
succession
yields a lower Gini compared to city
(0.428 vs. 0.482)
How do we choose?
1) Start at the root node
1) Start at the root node
2) Split the parent node at covariate xi to minimize the sum of child node impurities
1) Start at the root node
2) Split the parent node at covariate xi to minimize the sum of child node impurities
3) Stop if leaves are pure or early stopping criteria is satisfied, else repeat step (1) and (2) for each new child nodes
1) Start at the root node
2) Split the parent node at covariate xi to minimize the sum of child node impurities
3) Stop if leaves are pure or early stopping criteria is satisfied, else repeat step (1) and (2) for each new child nodes
4) Prune your tree according to a complexity parameter (cp)
1) Start at the root node
2) Split the parent node at covariate xi to minimize the sum of child node impurities
3) Stop if leaves are pure or early stopping criteria is satisfied, else repeat step (1) and (2) for each new child nodes
4) Prune your tree according to a complexity parameter (cp)
5) Assign the average outcome (regression) or the majority (classification) in each leaf.
Adapted from "Machine Learning FAQs" (Raschka, 2021)
|T|∑m=1∑i:i∈Rm(yi−^yi)2+α|T|
|T|∑m=1∑i:i∈Rm(yi−^yi)2+α|T|
What happens if α=0?
library(caret)set.seed(100)ct = train( factor(unsubscribe) ~ . - id, data = hbo.train, #remember your outcome needs to be a factor! method = "rpart", # The method is called rpart trControl = trainControl("cv", number = 10), tuneLength = 15)
library(caret)set.seed(100)ct = train( factor(unsubscribe) ~ . - id, data = hbo.train, #remember your outcome needs to be a factor! method = "rpart", trControl = trainControl("cv", number = 10), tuneLength = 15)
library(caret)set.seed(100)ct = train( factor(unsubscribe) ~ . - id, data = hbo.train, #remember your outcome needs to be a factor! method = "rpart", trControl = trainControl("cv", number = 10), tuneLength = 15)
tuneLength
is useful when you don't want to pass a specific grid (usually it might not be enough though!)library(rpart)set.seed(100)ct = train( factor(unsubscribe) ~ . - id, data = hbo.train, method = "rpart", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(cp = seq(0,1, by = 0.01)), control = rpart.control(minsplit = 20))
cp
: Complexity parameter
Split must decrease the overall lack of fit by a factor of cp
, or is not attempted.
Parameter for pruning the tree.
Higher cp
, smaller the tree!
minsplit
: Min. number of obs in a node to attempt a split.
plot(ct)
ct$bestTune
## cp## 4 0.03
library(rattle)fancyRpartPlot(ct$finalModel, caption = "Classification tree for Unsubscribe")
What do you think the percentages in the leaves represent?
Regression Trees
Outcome is continuous
Very similar to what we have seen with classification trees:
set.seed(100)rt = train( logins ~. - unsubscribe - id, data = hbo.train, method = "rpart", trControl = trainControl("cv", number = 10), tuneLength = 20 )plot(rt)
set.seed(100)tuneGrid = expand.grid(cp = seq(0, 0.1, by = 0.005))rt = train( logins ~. - unsubscribe - id, data = hbo.train, method = "rpart", trControl = trainControl("cv", number = 10), tuneGrid = tuneGrid )plot(rt)
fancyRpartPlot(rt$finalModel, caption="Regression Tree for Login")
rt$finalModel
## n= 5000 ## ## node), split, n, deviance, yval## * denotes terminal node## ## 1) root 5000 66387.3700 4.806800 ## 2) succession>=0.5 3535 24633.5000 2.973409 ## 4) city< 0.5 500 517.1580 0.322000 *## 5) city>=0.5 3035 20022.2800 3.410214 *## 3) succession< 0.5 1465 1200.0180 9.230717 ## 6) city< 0.5 212 132.2028 8.061321 *## 7) city>=0.5 1253 728.8571 9.428571 *
Q4: What would the predicted value be for a customer who hasn't watched Succession and lives in a city?
fancyRpartPlot(rt$finalModel, caption="Regression Tree for Login")
rt$finalModel
## n= 5000 ## ## node), split, n, deviance, yval## * denotes terminal node## ## 1) root 5000 66387.3700 4.806800 ## 2) succession>=0.5 3535 24633.5000 2.973409 ## 4) city< 0.5 500 517.1580 0.322000 *## 5) city>=0.5 3035 20022.2800 3.410214 *## 3) succession< 0.5 1465 1200.0180 9.230717 ## 6) city< 0.5 212 132.2028 8.061321 *## 7) city>=0.5 1253 728.8571 9.428571 *
Main advantages:
Easy to interpret and explain (you can plot them!)
Mirrors human decision-making.
Can handle qualitative predictors (without need for dummies).
Main disadvantages:
Accuracy not as high as other methods
Very sensitive to training data (e.g. overfitting)
Use of decision trees as building blocks for more powerful prediction methods!
Bagging
Random Forests
Boosting
James, G. et al. (2021). "Introduction to Statistical Learning with Applications in R". Springer. Chapter 8
Starmer, J.. (2018). "Decision Trees". Video materials from StatQuest (YouTube).
STDHA. (2018). "CART Model: Decision Tree Essentials"
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |