class: center, middle, inverse, title-slide .title[ # STA 235H - Final Trivia ] .subtitle[ ## Fall 2023 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> # Announcements - **.darkorange[Homework 6]** is due on Monday 12/04 (make sure you **.darkorange[submit on time]**) - Check out the course website for all the details (e.g. pay attention to style). -- - **.darkorange[Final Exam]** is on Thursday 12/07 for all sections. - It will last 2hrs, so **.darkorange[we will start at 8.30am]** - Remember to have your laptop fully charged and to bring a copy of your note sheet. -- - Questions included in exams/homeworks are very similar to what we do in class/previous assignments. **.darkorange[Make sure you check class exercises/answer keys so you understand the answers]**. -- - I will have extended office hours for the week of 12/04: On **.darkorange[Tuesday 2.00pm - 4.00pm]**, and regular hours on **.darkorange[Wednesday 10.30am - 11.30am]**. --- # Rules of Final Trivia 1) **.darkorange[Form groups]**: 2 or 3 students (no more, no less). -- 2) **.darkorange[Choose a name for your group]**: You can be funny or classic. -- 3) **.darkorange[You need to complete all the questions]**: It doesn't matter if you don't know the answer! Make your best guess. -- 4) **.darkorange[Ask questions]**: I will give you time for your team to complete each question; after the time is up, you will submit your answers and we will check them. - If something isn't clear, **now is the time to ask**. -- 5) **.darkorange[There are prizes]**: At the end of the session, we will crown the teams that perform the best. If there is a tie in scores, the team that submits their answers the fastest moves up. -- .center[**.darkorange[Note:]** All slides and answers will be posted on Wednesday at 4pm. Make sure to take notes!] --- .center2[ ![](data:image/png;base64,#https://media.giphy.com/media/pjd6Oggp5BIzJp33CK/giphy.gif) ] --- background-position: 50% 50% class: left, bottom, inverse .big[ Regressions ] --- # Amazon prices In this question, we are looking at luggage prices in Amazon US. We have data scraped from the website: ```r amz = read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/amz_luggage.csv") amz %>% select(-title) %>% head(.) ``` ``` ## uid asin stars reviews boughtInLastMonth isBestSeller price ## 1 297768 B004DPRTSE 3.7 138 0 0 67.90 ## 2 297745 B0BSBX2XPR 3.6 379 100 0 109.11 ## 3 295361 B0BG96C62P 4.7 2835 0 0 100.42 ## 4 295069 B00NF9HISK 3.3 825 0 0 57.56 ## 5 297025 B09VD9FGDH 2.5 221 0 0 70.11 ## 6 296801 B0B3LZ9KNH 2.5 3368 200 0 102.42 ``` We want to find the association between covariates and the outcome (`price`). --- # Question 1 Looking at the following output, **what is the difference in price between a best-selling luggage and a non-best-selling luggage with 4.5 stars?**. Write down the appropriate number. *Approximate your answer to 2 decimal places.* .tiny[ ```r lm1 = lm(price ~ isBestSeller*stars, data = amz) summary(lm1) ``` ``` ## ## Call: ## lm(formula = price ~ isBestSeller * stars, data = amz) ## ## Residuals: ## Min 1Q Median 3Q Max ## -115.054 -26.893 -1.386 25.022 137.700 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 99.3724 1.8165 54.707 <2e-16 *** ## isBestSeller 29.4236 17.1668 1.714 0.0866 . ## stars 4.9546 0.5299 9.350 <2e-16 *** ## isBestSeller:stars 11.5092 4.9535 2.323 0.0202 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.4 on 4037 degrees of freedom ## Multiple R-squared: 0.0586, Adjusted R-squared: 0.0579 ## F-statistic: 83.77 on 3 and 4037 DF, p-value: < 2.2e-16 ``` ] --- # Question 1: Answer -- **.darkorange[Answer:]** 81.21 (i.e. `\(29.4236 + 11.5092\times4.5\)` ) --- # Question 2 Looking at the following output, **what is the association between prices and stars for best-sellers?**. Write down the appropriate number. *Approximate your answer to 2 decimal places.* .tiny[ ```r lm1 = lm(price ~ isBestSeller*stars, data = amz) summary(lm1) ``` ``` ## ## Call: ## lm(formula = price ~ isBestSeller * stars, data = amz) ## ## Residuals: ## Min 1Q Median 3Q Max ## -115.054 -26.893 -1.386 25.022 137.700 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 99.3724 1.8165 54.707 <2e-16 *** ## isBestSeller 29.4236 17.1668 1.714 0.0866 . ## stars 4.9546 0.5299 9.350 <2e-16 *** ## isBestSeller:stars 11.5092 4.9535 2.323 0.0202 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.4 on 4037 degrees of freedom ## Multiple R-squared: 0.0586, Adjusted R-squared: 0.0579 ## F-statistic: 83.77 on 3 and 4037 DF, p-value: < 2.2e-16 ``` ] --- # Question 2: Answer -- **.darkorange[Answer:]** 16.46 (i.e `\(4.9546+11.5092\)` ) --- # Question 3 Looking at the following output, **what is the association between prices and stars for best-sellers?**. Write down the <u>interpretation</u>. .tiny[ ```r lm1 = lm(price ~ isBestSeller*stars, data = amz) summary(lm1) ``` ``` ## ## Call: ## lm(formula = price ~ isBestSeller * stars, data = amz) ## ## Residuals: ## Min 1Q Median 3Q Max ## -115.054 -26.893 -1.386 25.022 137.700 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 99.3724 1.8165 54.707 <2e-16 *** ## isBestSeller 29.4236 17.1668 1.714 0.0866 . ## stars 4.9546 0.5299 9.350 <2e-16 *** ## isBestSeller:stars 11.5092 4.9535 2.323 0.0202 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 38.4 on 4037 degrees of freedom ## Multiple R-squared: 0.0586, Adjusted R-squared: 0.0579 ## F-statistic: 83.77 on 3 and 4037 DF, p-value: < 2.2e-16 ``` ] --- # Question 3: Answer -- **.darkorange[Answer:]** For one additional star in the product's review, the average increase in price for best-selling luggage is $16.46. --- background-position: 50% 50% class: left, bottom, inverse .big[ Causal Inference ] --- # Does Academic Probation Work? Academic probation is a widely used tool by most universities to make sure students maintain minimum academic standards. In this section, we will analyze data from a large Canadian university regarding the effects of academic probation, originally used in Lindo, Sanders, and Oreopoulos’ (2010) paper, “Ability, Gender, and Performance Standards: Evidence from Academic Probation” .tiny[ ```r probation = read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/probation.csv") ``` ] .pull-left[ .small[ - `creditsY`: Credits attempted in year Y = 1,2. - `credits_earnedY`: Credits earned in year Y = 1,2. - `GPA_yearY`: GPA at the end of year Y = 1,2. - `CGPA_yearY`: Cumulative GPA at the end of year Y = 1,2. - `sex`: Gender of the student (M: Male, F: Female). - `age_at_entry`: Age of the student when they first enrolled.]] .pull-right[ .small[ - `gradinY`: Student graduated in Y years, Y = 4, 5, or 6. - `left_school`: Whether the student left school or not after the first assessment. - `hsgrade_pct`: Percentile of graduation in their high school. - `probation_year1`: Whether the student was in academic probation by the end of year 1. - `suspended_year1`: Whether the student was suspended by the end of year 1.]] --- # Question 4 First, we would like to analyze whether students that are on academic probation in their first year are more likely to drop out. For this, we run the following regression: .small[ ```r summary(lm_robust(left_school ~ probation_year1, data = probation)) ``` ``` ## ## Call: ## lm_robust(formula = left_school ~ probation_year1, data = probation) ## ## Standard error type: HC2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF ## (Intercept) 0.03755 0.0009849 38.13 5.691e-313 0.03562 0.03948 44360 ## probation_year1 0.07165 0.0038290 18.71 7.761e-78 0.06415 0.07916 44360 ## ## Multiple R-squared: 0.01481 , Adjusted R-squared: 0.01479 ## F-statistic: 350.2 on 1 and 44360 DF, p-value: < 2.2e-16 ``` ] .small[ **a-b) What is the percentage of students that drop out of school according to their academic status?** (Use two decimal places) **c) Is the difference in drop-out rates statistically significant?** **d) Interpret the coefficient for `probation_year1`**] --- # Question 4: Answer -- **.darkorange[Answer:]** On average, 3.8% of students not on academic probation drop out of this college, while that percentage is 10.92% for students in academic probation.The difference is statistically significant. In terms of interpretation, being on probation is associated to an average increase in the probability of dropping out of 7.2 percentage points compared to not being on probation. --- # Assessing the effect of academic probation To assess the effect of probation on academic performance, we decide to run the following regression <u>on students that have not dropped out</u>. .tiny[ ```r probation = probation %>% filter(left_school==0) summary(lm(GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation)) ``` ``` ## ## Call: ## lm(formula = GPA_year2 ~ probation_year1 + credits1 + credits_earned1 + ## GPA_year1 + factor(sex) + age_at_entry + hsgrade_pct, data = probation) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3545 -0.3239 0.0646 0.3708 2.5300 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0557184 0.0803853 13.133 < 2e-16 *** ## probation_year1 0.2827426 0.0132546 21.332 < 2e-16 *** ## credits1 -0.0069394 0.0120652 -0.575 0.5652 ## credits_earned1 0.0245169 0.0116370 2.107 0.0351 * ## GPA_year1 0.6971113 0.0059157 117.842 < 2e-16 *** ## factor(sex)M -0.0957468 0.0061847 -15.481 < 2e-16 *** ## age_at_entry -0.0248064 0.0041120 -6.033 1.63e-09 *** ## hsgrade_pct 0.0032055 0.0001307 24.529 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5817 on 38322 degrees of freedom ## (3857 observations deleted due to missingness) ## Multiple R-squared: 0.5139, Adjusted R-squared: 0.5139 ## F-statistic: 5789 on 7 and 38322 DF, p-value: < 2.2e-16 ``` ] --- # Question 5 Interpret the coefficient of interest in the previous model. --- # Question 5: Answer -- **.darkorange[Answer:]** On average, if students are in academic probation the first year vs not in academic probation, their GPA increases by 0.28 points in their second year, holding other variables constant. --- # Question 6 Do you think the previous estimate is a causal effect? Why or why not? Give a specific example. --- # Question 6: Answer -- **.darkorange[Answer:]** In order for the previous coefficient to be a causal effect, we need to have controlled for all confounders. However, there are most likely unobserved confounders we are not capturing. For example, more vulnerable students might be more likely to be on academic probation (e.g. they have more responsibilities outside college) and that might also affect directly their GPA the second year. --- background-position: 50% 50% class: left, bottom, inverse .big[ Prediction ] --- # Candy, candy, candy In this section, we will be predicting win percentage for candy bars! We have the following dataset for this: .tiny[ ```r candy = read.csv("https://raw.githubusercontent.com/maibennett/website_github/master/exampleSite/content/files/data/candy_r.csv") ``` ] .pull-left[ .small[ - `competitorname`: Name of the candy - `chocolate`: Is it chocolate? - `fruity`: Is it fruit flavored? - `caramel`: Is there caramel in the candy? - `peanutalmondy`: Does it contain peanuts, peanut butter or almonds? - `nougat`: Does it contain nougat? - `crispedricewafer`: Does it contain crisped rice, wafers, or a cookie component? ]] .pull-right[ .small[ - `hard`: Is it a hard candy? - `bar`: Is it a bar? - `pluribus`: Is it one of many candies in a bag/box? - `sugarpercent`: The percentile of sugar it falls under within the data set. - `pricepercent`: The unit price percentile compared to the rest of the set. - `winpercent`: The overall win percentage according to 269,000 matchups.]] --- # Question 7 We ran a simple regression tree for this. Looking at the following tree, what is the predicted outcome for a fruity candy, that has a sugar percent of 60%, doesn't have chocolate, nougat, or peanuts, comes in a bar, and has a price percentile of 0.51? <img src="data:image/png;base64,#f2023_sta235h_15_FinalTrivia_files/figure-html/rd_linear-1.svg" style="display: block; margin: auto;" /> --- # Question 7: Answer -- **.darkorange[Answer:]** The predicted win percentage is 44. --- # Question 8 Using the code provided <u>[here](https://www.magdalenabennett.com/files/data/Trivia/f2023_sta235h_15_FinalTrivia.R)</u>, how does your previous model perform? Write down the number (use two decimal places). --- # Question 8: Answer -- **.darkorange[Answer:]** ```r rmse(dt, test.data) %>% round(., 2) ``` ``` ## [1] 14.26 ``` --- # Question 9 Your turn. Fit a random forest to predict the outcome of interest. Tune only the number of randomly selected predictors, and use the code provided as a starting point. Use 100 trees and a 10-fold cross-validation. - Provide your code, the optimal number of `mtry`, and the performance of your model. --- # Question 9: Answer -- .small[ ```r tuneGrid = expand.grid( mtry = 1:13, splitrule = "variance", min.node.size = 5 ) set.seed(100) rf = train(winpercent ~ ., data = train.data, method = "ranger", num.trees = 100, tuneGrid = tuneGrid, importance = "permutation", trControl = trainControl(method = "cv", number = 10)) rf$bestTune ``` ``` ## mtry splitrule min.node.size ## 6 6 variance 5 ``` ```r rmse(rf, test.data) %>% round(., 2) ``` ``` ## [1] 15.03 ``` ] *(For Mac users: `mtry = 5` and `RMSE = 15.16`)*