class: center, middle, inverse, title-slide .title[ # STA 235H - Bootcamp ] .subtitle[ ## Fall 2023 ] .author[ ### McCombs School of Business, UT Austin ] --- <!-- <script type="text/javascript"> --> <!-- MathJax.Hub.Config({ --> <!-- "HTML-CSS": { --> <!-- preferredFont: null, --> <!-- webFont: "Neo-Euler" --> <!-- } --> <!-- }); --> <!-- </script> --> <style type="text/css"> .small .remark-code { /*Change made here*/ font-size: 80% !important; } .tiny .remark-code { /*Change made here*/ font-size: 90% !important; } </style> # Bootcamp Agenda .pull-left[ - What do we need? - Quick look into **.darkorange[R]** and **.darkorange[RStudio]** - RScript format - Refresher from the **.darkorange[tidyverse]**: - Data wrangling - Plots and figures - Regressions ] .pull-right[ .center[ ![:scale 150%](https://media.giphy.com/media/jCL5JbYPYQz96/giphy.gif)] ] --- # How comfortable are you with R? <img src="f2023_sta235h_bootcamp_files/figure-html/hist1-1.svg" style="display: block; margin: auto;" /> --- # R for coding .pull-left[ R is the programming language we will use for **.darkorange[statistical analysis]** .center[ ![:scale 40%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/R_logo.png)] ] -- .pull-right[ RStudio is the IDE (Integrated Development Environment) we will use **.darkorange[to run R on our computers]**. .center[ ![:scale 80%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/RStudio-Logo-Flat.png)] ] --- # Let's look at RStudio .center[ ![:scale 80%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/rstudio_window.png)] --- # Let's look at RStudio - Script .center[ ![:scale 110%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/rstudio_window_script.png)] --- # Let's look at RStudio - Environment .center[ ![:scale 80%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/rstudio_window_environment.png)] --- # Let's look at RStudio - Console .center[ ![:scale 120%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/rstudio_window_console.png)] --- # Let's look at RStudio - Help and others .center[ ![:scale 60%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/rstudio_window_help.png)] --- # Useful basic commands - `install.packages("name")`: Installs the package "name" on your computer. You only need to run this once! -- - `library(name)`: Loads the package "name" on your current session. You should do this at the top of every script and only include packages you will use (to avoid confusion) -- - `?function`: Opens the help file for `function` (if there is more than one `function` -- e.g. different libraries -- you can choose which one you open). --- # Also... don't restore RData into a new session! .center[ <video width="640" height="480" controls> <source src="https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/unclick_save_rdata.mp4" type="video/mp4"> </video> ] --- background-position: 50% 50% class: center, middle .box-7Trans[Let's go to R] --- # Data Wrangling .pull-left[ - Most times we need to **.darkorange[transform]**, **.darkorange[clean]**, and **.darkorange[structure]** data for analysis. - Examples of data wrangling would be dropping missing observations, merging different datasets, identifying outliers, etc. - **.darkorange[R can help us do that!]** ] .pull-right[ .center[ ![](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/ah_wrangling.jpg)] ] --- # Into the tidyverse .pull-left[ .center[ ![](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/pipe.png)] ] .pull-right[ - For data wrangling, we will use the **.darkorange[tidyverse]**: Collection of packages that follow a similar design structure (e.g. dplyr, ggplot2) - It works through **.darkorange[pipes]**: %>% - Concatenates functions! ] --- # Useful functions for wrangling -- - `mutate(var = var1 + var2)`: Creates a new variable or replaces an existing one. It takes as an argument the name of the variable and what you want that variable to be. -- - `filter(var == 1)`: Subsets your data according to a logic statement. Remember that logic statements use "==" instead of "="! -- - `group_by(var1, var2)`: Used to group observations by values of different variables. You can use it either to create a variable with values at the group level, or to summarize your dataset by group. -- - `select(var1, var2)`: Select specific variables from the dataset (drop the others). In case you want to drop instead of keeping variables, you can use `select(-var1, -var2)` -- - `rename(var_new = var_old)`: The name says it all. Used to rename variables. --- # Other useful functions - `is.na(var)`: logic function that returns TRUE if the observation is a missing value (NA) or FALSE in another case. -- - `ifelse(logic_statement, val1, val2)`: Very useful function to create conditional values. -- - `!(logic_statement)`: The exclamation point acts as a negation. If you want to invert a logic statement, use this (e.g. `!is.na(var)` will return TRUE if the obs of `var` is NOT missing and FALSE if it's missing). -- - `table(var)`: Tabulates the different values of a variable --- background-position: 50% 50% class: center, middle .box-5Trans[Let's go to R] --- # Plotting in R .pull-left[ - Plotting your data is a **.darkorange[very intuitive way]** to see what's going on. - It's also useful to convey **.darkorange[complex analysis]**! - Make sure your plots are always **.darkorange[informative]** and they **.darkorange[tell the story]** you want to highlight. ] .pull-right[ .center[ ![](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/ggplot.png)] ] --- # General structure of ggplot - `ggplot()` works in **.darkorange["layers"]**: - You can provide different geometries and "add" them to your plot (same with themes!) -- - You always start with `ggplot(data = d, aes(x = var1, y = var2, color = var3))`, depending on what you want to do: - `aes()` stands for aesthetics, and it tells which variables you want to use and how. Sometimes you need one variable (e.g. histogram), sometimes you need two (e.g. scatter plot), or even three or more! (e.g. scatter plot for different groups) -- - You can provide `aes()` in the `ggplot()` function (as seen above), or also in each geometric layer: .center[e.g. `ggplot(data = d) + geom_point(aes(x = var1, y = var2))`] --- # General structure of ggplot - Some common geometries that are useful: - `geom_point()`: Creates a scatter plot - `geom_line()`: Creates a line plot - `geom_histogram()` or `geom_density()`: Creates a histogram or a density plot for your data! - `geom_smooth()`: Creates a smooth function that goes through your data. By default, it uses a loess or gam function, depending on the size of the data. Use `method = "lm"`as an argument if you want to fit a regression line! -- - Finally, looks are also important! - `theme()` allows you to play around with every aspect of your plot (e.g. font size, grid lines, etc.) - Using a pre-packages theme can be useful, too. I personally like `theme_minimal()` or the `theme_ipsum_rc()` from the `hrbrthemes` package. --- background-position: 50% 50% class: center, middle .box-2Trans[Let's go to R] --- # Regression Analysis .pull-left[ - Regressions help us **.darkorange[quantify the relationship]** between different variables. - In R, we can get **.darkorange[many important insights]** from regression analysis! ] .pull-right[ <img src="f2023_sta235h_bootcamp_files/figure-html/plot_study-1.svg" style="display: block; margin: auto;" /> ] --- # Regressions in R - The main command to do regressions is `lm(y ~ x1 + x2, data = d)`, where `y` is our outcome of interest and `x1` and `x2` are regressors. -- - For convenience, we can store the regression in a separate object (e.g. `lm1 = lm(y ~ x1 + x2, data = d)`), so we can later manipulate it: - `summary(lm1)`: Provides a summary table of the results (including estimates, standard errors, and p-values). - `lm1$coefficients`: Recovers the exact estimated coefficients (e.g. useful if you want to use them later). - `summary(lm1)$coefficients`: Matrix of results. Includes columns for the estimates betas, standard errors, t-stats, and p-values. --- background-position: 50% 50% class: center, middle .box-6Trans[Let's go to R] --- # R is useful and fun! .center[ ![:scale 80%](https://github.com/maibennett/sta235/raw/main/exampleSite/content/bootcamp/images/know_r.png)]