Use the knowledge check portion of the JITT to assess your own understanding:
Be sure to answer the question correctly (look at the feedback provided)
Feedback are guidelines; Try to use your own words.
Use the knowledge check portion of the JITT to assess your own understanding:
Be sure to answer the question correctly (look at the feedback provided)
Feedback are guidelines; Try to use your own words.
If you are struggling with material covered in STA 301H: Check the course website for resources and come to Office Hours.
Use the knowledge check portion of the JITT to assess your own understanding:
Be sure to answer the question correctly (look at the feedback provided)
Feedback are guidelines; Try to use your own words.
If you are struggling with material covered in STA 301H: Check the course website for resources and come to Office Hours.
Office Hours Prof. Bennett: Wed 10.30-11.30am and Thu 4.00-5.30pm
Use the knowledge check portion of the JITT to assess your own understanding:
Be sure to answer the question correctly (look at the feedback provided)
Feedback are guidelines; Try to use your own words.
If you are struggling with material covered in STA 301H: Check the course website for resources and come to Office Hours.
Office Hours Prof. Bennett: Wed 10.30-11.30am and Thu 4.00-5.30pm
No in-person class next week -- Recorded class
Quick multiple regression review:
Looking at your data:
Nonlinear models:
Three criteria:
lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -127.0710 17.0563 -7.4501 0.0000## bechdel_test 11.0009 4.3786 2.5124 0.0121## Adj_Budget 1.1192 0.0367 30.4866 0.0000## Metascore 7.0254 1.9058 3.6864 0.0002## imdbRating 15.4631 3.3914 4.5595 0.0000
What does each column represent?
lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -127.0710 17.0563 -7.4501 0.0000## bechdel_test 11.0009 4.3786 2.5124 0.0121## Adj_Budget 1.1192 0.0367 30.4866 0.0000## Metascore 7.0254 1.9058 3.6864 0.0002## imdbRating 15.4631 3.3914 4.5595 0.0000
lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -127.0710 17.0563 -7.4501 0.0000## bechdel_test 11.0009 4.3786 2.5124 0.0121## Adj_Budget 1.1192 0.0367 30.4866 0.0000## Metascore 7.0254 1.9058 3.6864 0.0002## imdbRating 15.4631 3.3914 4.5595 0.0000
"Estimate": Point estimates of our paramters β. We call them ^β.
"Standard Error" (SE): You can think about it as the variability of ^β. The smaller, the more precise ^β is!
lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -127.0710 17.0563 -7.4501 0.0000## bechdel_test 11.0009 4.3786 2.5124 0.0121## Adj_Budget 1.1192 0.0367 30.4866 0.0000## Metascore 7.0254 1.9058 3.6864 0.0002## imdbRating 15.4631 3.3914 4.5595 0.0000
"Estimate": Point estimates of our paramters β. We call them ^β.
"Standard Error" (SE): You can think about it as the variability of ^β. The smaller, the more precise ^β is!
"t-value": A value of the Student distribution that measures how many SE away ^β is from 0. You can calculate it as tval=^βSE. It relates to our null-hypothesis H0:β=0.
lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -127.0710 17.0563 -7.4501 0.0000## bechdel_test 11.0009 4.3786 2.5124 0.0121## Adj_Budget 1.1192 0.0367 30.4866 0.0000## Metascore 7.0254 1.9058 3.6864 0.0002## imdbRating 15.4631 3.3914 4.5595 0.0000
"Estimate": Point estimates of our paramters β. We call them ^β.
"Standard Error" (SE): You can think about it as the variability of ^β. The smaller, the more precise ^β is!
"t-value": A value of the Student distribution that measures how many SE away ^β is from 0. You can calculate it as tval=^βSE. It relates to our null-hypothesis H0:β=0.
"p-value": Probability of rejecting the null hypothesis and being wrong (Type I error). You want this to be a small as possible (statistically significant).
We are testing H0:β=0 vs H1:β≠0
Note: Figures adapted from @AllisonHorst's art
Reject the null if the t-value falls outside the dashed lines.
Note: Figures adapted from @AllisonHorst's art
How would you test that in an equation?
How would you test that in an equation?
Interactions!
Interaction model:
Revenue=β0+β1Bechdel+β3Budget+β6(Budget×Bechdel)+β4IMDB+β5MetaScore+ε
Interaction model:
Revenue=β0+β1Bechdel+β3Budget+β6(Budget×Bechdel)+β4IMDB+β5MetaScore+ε How should we think about this?
Interaction model:
Revenue=β0+β1Bechdel+β3Budget+β6(Budget×Bechdel)+β4IMDB+β5MetaScore+ε How should we think about this?
Write the equation for a movie that does not pass the Bechdel test. How does it look like?
Now do the same for a movie that passes the Bechdel test. How does it look like?
Now, let's interpret some coefficients:
If Bechdel=0, then:
Revenue=β0+β3Budget+β4IMDB+β5MetaScore+ε
Now, let's interpret some coefficients:
If Bechdel=0, then:
Revenue=β0+β3Budget+β4IMDB+β5MetaScore+ε
Revenue=(β0+β1)+(β3+β6)Budget+β4IMDB+β5MetaScore+ε
Now, let's interpret some coefficients:
If Bechdel=0, then:
Revenue=β0+β3Budget+β4IMDB+β5MetaScore+ε
Revenue=(β0+β1)+(β3+β6)Budget+β4IMDB+β5MetaScore+ε
lm(Adj_Revenue ~ bechdel_test*Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -124.1997 17.4932 -7.0999 0.0000## bechdel_test 7.5138 6.4257 1.1693 0.2425## Adj_Budget 1.0926 0.0513 21.2865 0.0000## Metascore 7.1424 1.9126 3.7344 0.0002## imdbRating 15.2268 3.4069 4.4694 0.0000## bechdel_test:Adj_Budget 0.0546 0.0737 0.7416 0.4585
lm(Adj_Revenue ~ bechdel_test*Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -124.1997 17.4932 -7.0999 0.0000## bechdel_test 7.5138 6.4257 1.1693 0.2425## Adj_Budget 1.0926 0.0513 21.2865 0.0000## Metascore 7.1424 1.9126 3.7344 0.0002## imdbRating 15.2268 3.4069 4.4694 0.0000## bechdel_test:Adj_Budget 0.0546 0.0737 0.7416 0.4585
What is the association between budget and revenue for movies that pass the Bechdel test?
What is the difference in the association between budget and revenue for movies that pass vs movies that don't pass the Bechdel test?
lm(Adj_Revenue ~ bechdel_test*Adj_Budget + Metascore + imdbRating, data=bechdel)
## Estimate Std. Error t value Pr(>|t|)## (Intercept) -124.1997 17.4932 -7.0999 0.0000## bechdel_test 7.5138 6.4257 1.1693 0.2425## Adj_Budget 1.0926 0.0513 21.2865 0.0000## Metascore 7.1424 1.9126 3.7344 0.0002## imdbRating 15.2268 3.4069 4.4694 0.0000## bechdel_test:Adj_Budget 0.0546 0.0737 0.7416 0.4585
What is the association between budget and revenue for movies that pass the Bechdel test?
What is the difference in the association between budget and revenue for movies that pass vs movies that don't pass the Bechdel test?
Is that difference statistically significant (at conventional levels)?
Let's look at another example
cars <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/Classes/Week2/1_OLS/data/SoCalCars.csv", stringsAsFactors = FALSE)names(cars)
## [1] "type" "certified" "body" "make" "model" "trim" ## [7] "mileage" "price" "year" "dealer" "city" "rating" ## [13] "reviews" "badge"
Data source: "Modern Business Analytics" (Taddy, Hendrix, & Harding, 2018)
Do you think there's a difference between how price changes over time for luxury vs non-luxury cars?
Do you think there's a difference between how price changes over time for luxury vs non-luxury cars?
How would you test this?
Let's go to R
ˆPrice=β0+^β1Rating+^β2Miles+^β3Luxury+^β4Year+^β5Luxury×Year
ˆPrice=β0+^β1Rating+^β2Miles+^β3Luxury+^β4Year+^β5Luxury×Year
ˆPrice=β0+^β1Rating+^β2Miles+^β3Luxury+^β4Year+^β5Luxury×Year
The coefficient you are interested in is ^β5:
ˆPrice=β0+^β1Rating+^β2Miles+^β3Luxury+^β4Year+^β5Luxury×Year 1) What is the association between price and year for non-luxury cars?
ˆPrice=β0+^β1Rating+^β2Miles+^β3Luxury+^β4Year+^β5Luxury×Year 1) What is the association between price and year for non-luxury cars?
2) What is the association between price and year for luxury cars?
What should we do before we ran any model?
Inspect your data!
vtable
:library(vtable)vtable(cars)
vtable
:library(vtable)vtable(cars)
summary
to see the min, max, mean, and quartile:cars %>% select(price, mileage, year) %>% summary(.)
## price mileage year ## Min. : 1790 Min. : 0 Min. :1966 ## 1st Qu.: 16234 1st Qu.: 5 1st Qu.:2017 ## Median : 23981 Median : 56 Median :2019 ## Mean : 32959 Mean : 21873 Mean :2018 ## 3rd Qu.: 36745 3rd Qu.: 36445 3rd Qu.:2020 ## Max. :1499000 Max. :292952 Max. :2021
vtable
:library(vtable)vtable(cars)
summary
to see the min, max, mean, and quartile:cars %>% select(price, mileage, year) %>% summary(.)
## price mileage year ## Min. : 1790 Min. : 0 Min. :1966 ## 1st Qu.: 16234 1st Qu.: 5 1st Qu.:2017 ## Median : 23981 Median : 56 Median :2019 ## Mean : 32959 Mean : 21873 Mean :2018 ## 3rd Qu.: 36745 3rd Qu.: 36445 3rd Qu.:2020 ## Max. :1499000 Max. :292952 Max. :2021
What can you say about this variable?
log(Price)=β0+β1Rating+β2Miles+β3Luxury+β4Year+ε
log(Price)=β0+β1Rating+β2Miles+β3Luxury+β4Year+ε
Remember: β2 represents the average change in the outcome variable, log(Price), for a one-unit increase in the independent variable Miles.
log(Y)=^β0+^β1X
We want to compare the outcome for a regression with X=x and X=x+1
log(Y)=^β0+^β1X
We want to compare the outcome for a regression with X=x and X=x+1
log(y0)=^β0+^β1x (1)
and
log(y1)=^β0+^β1(x+1) (2)
log(Y)=^β0+^β1X
We want to compare the outcome for a regression with X=x and X=x+1
log(y0)=^β0+^β1x (1)
and
log(y1)=^β0+^β1(x+1) (2)
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(1+y1−y0y0)=^β1
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(1+y1−y0y0)=^β1
→Δyy=exp(^β1)−1
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(1+y1−y0y0)=^β1
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(1+y1−y0y0)=^β1
≈y1−y0y0=^β1
log(y1)−log(y0)=^β0+^β1(x+1)−(^β0+^β1x)
log(y1y0)=^β1
log(1+y1−y0y0)=^β1
≈y1−y0y0=^β1
→%Δy=100×^β1
log(Price)=β0+β1Rating+β2Miles+β3Luxury+β4Year+ε
log(Price)=β0+β1Rating+β2Miles+β3Luxury+β4Year+ε
log(Price)=β0+β1Rating+β2Miles+β3Luxury+β4Year+ε
For an additional 1,000 miles (Note: Remember Miles is measured in thousands of miles), the logarithm of the price increases/decreases, on average, by ^β2, holding other variables constant.
For an additional 1,000 miles, the price increases/decreases, on average, by 100×^β2%, holding other variables constant.
summary(lm(log(price) ~ rating + mileage + luxury + year, data = cars))
## ## Call:## lm(formula = log(price) ~ rating + mileage + luxury + year, data = cars)## ## Residuals:## Min 1Q Median 3Q Max ## -1.14363 -0.29112 -0.02593 0.26412 2.28855 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.5105164 0.1518312 16.535 < 2e-16 ***## rating 0.0305782 0.0057680 5.301 1.27e-07 ***## mileage -0.0098628 0.0004327 -22.792 < 2e-16 ***## luxury 0.5517712 0.0228132 24.186 < 2e-16 ***## year 0.0118783 0.0030075 3.950 8.09e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.436 on 2083 degrees of freedom## Multiple R-squared: 0.4699, Adjusted R-squared: 0.4689 ## F-statistic: 461.6 on 4 and 2083 DF, p-value: < 2.2e-16
Another way to capture nonlinear associations between the outcome (Y) and covariates (X) is to include polynomial terms:
Another way to capture nonlinear associations between the outcome (Y) and covariates (X) is to include polynomial terms:
Let's look at an example!
log(Wage)=β0+β1Educ+β2Exp+ε
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε
One additional year of education is associated, on average, to ^β1×100% increase in hourly wages, holding experience constant
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε
One additional year of education is associated, on average, to ^β1×100% increase in hourly wages, holding experience constant
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε What is the association between experience and wages?
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε What is the association between experience and wages?
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε What is the association between experience and wages?
Increasing work experience from Exp0 to Exp0+1 years is associated, on average, to a (^β2+2^β3×Exp0)100% increase on hourly wages, holding education constant
log(Wage)=β0+β1Educ+β2Exp+β3Exp2+ε What is the association between experience and wages?
Increasing work experience from Exp0 to Exp0+1 years is associated, on average, to a (^β2+2^β3×Exp0)100% increase on hourly wages, holding education constant
Increasing work experience from 20 to 21 years is associated, on average, to a (^β2+2^β3×20)100% increase on hourly wages, holding education constant
summary(lm(log(wage) ~ education + experience + I(experience^2), data = CPS1985))
## ## Call:## lm(formula = log(wage) ~ education + experience + I(experience^2), ## data = CPS1985)## ## Residuals:## Min 1Q Median 3Q Max ## -2.12709 -0.31543 0.00671 0.31170 1.98418 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.5203218 0.1236163 4.209 3.01e-05 ***## education 0.0897561 0.0083205 10.787 < 2e-16 ***## experience 0.0349403 0.0056492 6.185 1.24e-09 ***## I(experience^2) -0.0005362 0.0001245 -4.307 1.97e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4619 on 530 degrees of freedom## Multiple R-squared: 0.2382, Adjusted R-squared: 0.2339 ## F-statistic: 55.23 on 3 and 530 DF, p-value: < 2.2e-16
summary(lm(log(wage) ~ education + experience + I(experience^2), data = CPS1985))
## ## Call:## lm(formula = log(wage) ~ education + experience + I(experience^2), ## data = CPS1985)## ## Residuals:## Min 1Q Median 3Q Max ## -2.12709 -0.31543 0.00671 0.31170 1.98418 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.5203218 0.1236163 4.209 3.01e-05 ***## education 0.0897561 0.0083205 10.787 < 2e-16 ***## experience 0.0349403 0.0056492 6.185 1.24e-09 ***## I(experience^2) -0.0005362 0.0001245 -4.307 1.97e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4619 on 530 degrees of freedom## Multiple R-squared: 0.2382, Adjusted R-squared: 0.2339 ## F-statistic: 55.23 on 3 and 530 DF, p-value: < 2.2e-16
The model you fit depends on what you want to analyze.
Plot your data!
Make sure you capture associations that make sense.
Issues with regressions and our data:
Outliers?
Heteroskedasticity
Regression models with discrete outcomes:
Ismay, C. & A. Kim. (2021). “Statistical Inference via Data Science”. Chapter 6 & 10.
Keegan, B. (2018). "The Need for Openess in Data Journalism". Github Repository
Use the knowledge check portion of the JITT to assess your own understanding:
Be sure to answer the question correctly (look at the feedback provided)
Feedback are guidelines; Try to use your own words.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |