class: center, middle, inverse, title-slide .title[ # STA 235H - Multiple Regression: Overview and Statistical Adjustment ] .subtitle[ ## Fall 2023 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 90% !important; } </style> # Today .pull-left[ - Quick **.darkorange[multiple regression]** review - How does OLS work? - **.darkorange[What can we say]** using regressions? - Interpreting coefficients ] .pull-right[ ![](https://media.giphy.com/media/3o752flP8nVxQjXhO8/source.gif) ] --- background-position: 50% 50% class: left, bottom, inverse .big[ Nothing "Ordinary" about OLS ] --- background-position: 50% 50% class: center, middle .box-3Trans[What do **you** understand about regressions?] --- # Remembering Regressions - Linear Regression is a **.darkorange[very useful tool]**. - Simple supervised learning approach. - Many fancy methods are generalizations or extensions of linear regression! - It's a way to (partially) describe a **.darkorange[data generating process (DGP)]**. .box-3Trans.medium[$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon$$] --- # Essential Parts of a Regression .pull-left[ .box-2tL[Y] .box-2[Outcome Variable] .box-2[Response Variable] .box-2[Dependent Variable] .box-2t[*Thing you want to explain or predict*] ] --- # Essential Parts of a Regression .pull-left[ .box-2tL[Y] .box-2[Outcome Variable] .box-2[Response Variable] .box-2[Dependent Variable] .box-2t[*Thing you want to explain or predict*] ] .pull-right[ .box-2tL[X] .box-2[Explanatory Variable] .box-2[Predictor Variable] .box-2[Independent Variable] .box-2t[*Thing you use to explain or predict Y*] ] --- # Identify the variables .pull-left[ .box-2transl.jost-normal[A study examines the effect of smoking on lung cancer] <br> ] --- # Identify the variables .pull-left[ .box-2transl.jost-normal[A study examines the effect of smoking on lung cancer] <br> .box-4transl.jost-normal[Fantasy football fanatics predict the performance of a player based on past performance, health status, and characteristics of the opposite team] ] --- # Identify the variables .pull-left[ .box-2transl.jost-normal[A study examines the effect of smoking on lung cancer] <br> .box-4transl.jost-normal[Fantasy football fanatics predict the performance of a player based on past performance, health status, and characteristics of the opposite team] ] .pull-right[ .box-5transl.jost-normal[You want to see if taking more AP classes in high school improves college grades] <br> ] --- # Identify the variables .pull-left[ .box-2transl.jost-normal[A study examines the effect of smoking on lung cancer] <br> .box-4transl.jost-normal[Fantasy football fanatics predict the performance of a player based on past performance, health status, and characteristics of the opposite team] ] .pull-right[ .box-5transl.jost-normal[You want to see if taking more AP classes in high school improves college grades] <br> .box-7transl.jost-normal[Netflix uses your past viewing history, the day of the week, and the time of the day to guess which show you want to watch next] ] --- # Two Purposes of Regression .pull-left[ .box-3tL[Prediction] .box-3[Forecast the future] .box-3[Focus is on <b>Y</b>] .box-3tL[Netflix trying to guess your next show] ] -- .pull-right[ .box-6tL[Explanation] .box-6[Explain the effect of <b>X</b> on <b>Y</b>] .box-6[Focus is on <b>X</b>] .box-6tL[Netflix looking at the effect of time of the day on show selection] ] --- # What do we want to estimate in a regression? - When we run a regression we have an outcome `\(Y\)` and explanatory variables or covariates `\(X\)`. - **We want to estimate the `\(\beta\)`'s** -- - One important distinction: - `\(\beta\)`'s are the **.darkorange[population parameters]** we want to estimate. - `\(\hat\beta\)` are the **.darkorange[estimates]** of those parameters. --- # How do we estimate the coefficients in a regression ? .pull-left[ - **.darkorange[Ordinary Least Squares]** is the most popular way. `$$\min_{\beta} \sum[Y_i - (\sum_{j=1}^p\beta_jX_{ij})]^2$$` ] -- .pull-right[ <img src="f2023_sta235h_2_reg_files/figure-html/cookies1-1.svg" style="display: block; margin: auto;" /> ] --- # How do we estimate the coefficients in a regression ? (cont.) .pull-left[ ![:scale 100%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/Classes/Week1/2_OLS/images/3dplot1.png) ] .pull-right[ ![:scale 100%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/Classes/Week1/2_OLS/images/3dplot2.png) ] --- background-position: 50% 50% class: left, bottom, inverse .big[ Let's get into some data ] --- # Let's introduce an example: The Bechdel Test -- .pull-left[ - **.darkorange[Three criteria:]** 1. At least two named women 2. Who talk to each other 3. About something besides a man ] .pull-right[ ![:scale 80%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/Classes/Week1/2_OLS/images/bechdel.png) ] --- # Do movies pass the test? .center[ ![:scale 50%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/Classes/Week1/2_OLS/images/hickey-bechdel-11.png)] --- # Is it convenient for my movie to pass the Bechdel test? - I'm a profit-maximizing investor and want to know whether it's in my best interest to switch a male for a female character. - What is the **.darkorange[simplest model]** you could fit? -- .box-6trans.medium[$$Revenue = \alpha + \beta Bechdel + \varepsilon$$] --- # Let's analyze some models - We have some data and code on the **<u>.darkorange[[course website](https://sta235.netlify.app/classes/week1/)]</u>** - Dataset from **<u>.darkorange[[fivethirtyeight.com](https://github.com/fivethirtyeight/data/tree/master/bechdel)]</u>**: - Focus on 1990 onward <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption>Summary Statistics</caption> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> N </th> <th style="text-align:left;"> Mean </th> <th style="text-align:left;"> Std. Dev. </th> <th style="text-align:left;"> Min </th> <th style="text-align:left;"> Pctl. 25 </th> <th style="text-align:left;"> Pctl. 75 </th> <th style="text-align:left;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Year </td> <td style="text-align:left;"> 2087 </td> <td style="text-align:left;"> 2004.963 </td> <td style="text-align:left;"> 6.755 </td> <td style="text-align:left;"> 1990 </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 2011 </td> <td style="text-align:left;"> 2014 </td> </tr> <tr> <td style="text-align:left;"> Adj_Revenue </td> <td style="text-align:left;"> 2087 </td> <td style="text-align:left;"> 66.254 </td> <td style="text-align:left;"> 92.07 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 4.36 </td> <td style="text-align:left;"> 86.936 </td> <td style="text-align:left;"> 968.41 </td> </tr> <tr> <td style="text-align:left;"> Adj_Budget </td> <td style="text-align:left;"> 1369 </td> <td style="text-align:left;"> 61.498 </td> <td style="text-align:left;"> 57.784 </td> <td style="text-align:left;"> 0.02 </td> <td style="text-align:left;"> 19.3 </td> <td style="text-align:left;"> 88.47 </td> <td style="text-align:left;"> 470.839 </td> </tr> <tr> <td style="text-align:left;"> Metascore </td> <td style="text-align:left;"> 1755 </td> <td style="text-align:left;"> 5.663 </td> <td style="text-align:left;"> 1.66 </td> <td style="text-align:left;"> 1.1 </td> <td style="text-align:left;"> 4.5 </td> <td style="text-align:left;"> 6.8 </td> <td style="text-align:left;"> 9.7 </td> </tr> <tr> <td style="text-align:left;"> imdbRating </td> <td style="text-align:left;"> 2085 </td> <td style="text-align:left;"> 6.546 </td> <td style="text-align:left;"> 0.979 </td> <td style="text-align:left;"> 1.5 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 7.2 </td> <td style="text-align:left;"> 9.3 </td> </tr> <tr> <td style="text-align:left;"> bechdel_test </td> <td style="text-align:left;"> 2087 </td> <td style="text-align:left;"> 0.571 </td> <td style="text-align:left;"> 0.495 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 1 </td> </tr> </tbody> </table> --- # Let's analyze some models ```r summary(lm(Adj_Revenue ~ bechdel_test, data = bechdel)) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 76.4553 3.0641 24.9521 0 ## bechdel_test -17.8616 4.0544 -4.4055 0 ``` - How do you interpret these results? --- # Let's analyze some models ```r summary(lm(Adj_Revenue ~ bechdel_test, data = bechdel)) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 76.4553 3.0641 24.9521 0 ## bechdel_test -17.8616 4.0544 -4.4055 0 ``` - `\(\hat{\beta}_0\)` is the average adjusted revenue (in millions of dollars) for movies that do not pass the Bechdel test. - <u>On average</u>, movies that pass the Bechdel test have an adjusted revenue that is `\(|\hat{\beta}_1|\)` million dollars less than a movie that doesn't pass the Bechdel test. <br> <br> .box-3LA[Negative effect of including more women?] --- # What gives? .center[ ![:scale 50%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/Classes/Week1/2_OLS/images/538_bechdel.png)] --- # More variables .pull-left[ .center[ ![:scale 80%](https://media.giphy.com/media/1ZlrrYTN7gAxdGE8nM/source.gif) ]] .pull-right[ - **.darkorange[Bechdel test]** could be capturing the effect of other variables: - What **.darkorange[type]** of movies are the ones that pass the test? - What is their **.darkorange[budget]**? ] --- # More variables ```r lm(Adj_Revenue ~ bechdel_test + Adj_Budget + Metascore + imdbRating, data=bechdel) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -127.0710 17.0563 -7.4501 0.0000 ## bechdel_test 11.0009 4.3786 2.5124 0.0121 ## Adj_Budget 1.1192 0.0367 30.4866 0.0000 ## Metascore 7.0254 1.9058 3.6864 0.0002 ## imdbRating 15.4631 3.3914 4.5595 0.0000 ``` <br> .box-3LA[Positive and significant!] -- - How do we interpret the relevant coefficient now? --- # Main takeaway points .pull-left[ .center[ ![](https://media.giphy.com/media/PLHdpauwfN2MvHcHxL/giphy.gif) ] ] .pull-right[ - Regressions are super useful... - But you need to know **.darkorange[how]** to interpret them. - Be sure not to overstate your claims! - Remember the magic words for interpretation ] --- # Next class .pull-left[ - Continue with **.darkorange[multiple regression models]**: - Interactions and how to interpret them - **.darkorange["Nonlinear" models]** ] .pull-right[ ![](https://media.giphy.com/media/1Q9AekyhJ2U5bdTzdr/giphy.gif?cid=ecf05e47lcna24lgv5gbpkiciso4ykvxq9urjrc5d50bfq8b&rid=giphy.gif&ct=g) ] --- # References - Heiss, A. (2020). "Course: Program Evaluation for Public Service". *Slides for Regression and Inference*. - Ismay, C. & A. Kim. (2021). “Statistical Inference via Data Science”. Chapter 10. - Keegan, B. (2018). "The Need for Openess in Data Journalism". *[Github Repository](https://github.com/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb)* <!-- pagedown::chrome_print('C:/Users/mc72574/Dropbox/Hugo/Sites/sta235/exampleSite/content/Classes/Week1/2_OLS/f2021_sta235h_2_reg.html') -->