Carry out a full prediction project by applying what we have learned about prediction in this class.
You will choose a dataset out of the three (3) available ones, and compete with other students that have chosen the same one.
For each dataset, you will have to predict 2 outcomes: one continuous outcome and one binary outcome. You will use two (2) methods for each, and state which of the two is your preferred one.
Using each preferred method, the instruction team will run your code on a testing dataset with the same structure as the data provided and obtain the appropriate measures for accuracy.
For each outcome, we will rank the performance of the model for each student, and average their ranking. This will determine their score in the “performance” aspect (see evaluation below).
A 2-page report (without counting images/tables) about your project with the following sections:
The report needs to have a title (you can keep it classic or be creative) and clearly indicate the different sections/subsections. All tables and plots need to be placed at the end of the document (they don’t count towards the 2-page limit), and have appropriate descriptive captions (numbered so you can reference them in the text).
Use standard margins and a font size of at least 11pt.
You will also need to submit an R script with the code and an RData file with your model.
Homework assignments will be graded according to the following items:
Without considering performance, the maximum grade is 94 points. The last 6 points will be awarded based on performance, where the student with the best performing models (on average) will receive 6 points, and the student with the lowest performance model (or a code that doesn’t allow estimating the performance) will receive 0 points. We will use average rankings for students that achieve the same performance.
NBA performance: In this dataset, you will have information about players characteristics and performance between the 2020-2021 and 2022-2023 NBA season. You will have to predict salaries, and who is the top 25% of best-paid players. The data has approximately 1,600 observations and 74 variables.
Student dropout: This dataset contains information about several students characteristics, including some demographic and socioeconomic variables, as well as their performance in college. Using these characteristics, you will have to predict a student’s score and also their probability of dropping out. The dataset has approximately 2,500 observations and 34 variables.
Housing prices: This is a real dataset (slightly cleaned) from the Home Mortgage Disclosure Act for housing purchasing loans throughout the US in 2017. It contains information about the individual’s application, and you will have to predict whether the person gets approved or not, and the amount of the loan. The data has approximately 2,800 observations and 34 variables.
Make your selection of dataset on Canvas by Thursday 11/16 (check announcement)
For this assignment, you will have to submit three different files:
Written report: This is the 2-page PDF file (without counting images and tables) that you need to submit. See instruction above for sections, stlyling, etc. You should name your report as following “EID_LastnameFirstLetter_report.pdf” (e.g. mc72574_BennettM_report.pdf).
R Script: This is a full, clean script that should replicate your results. It should have a section for the regression task and for the classification task, and both should be clearly delimited. Same thing for the different methods you use within each task. Follow the structure of the following template to make sure you are submitting an appropriate file. You should name your script as following “EID_LastnameFirstletter_script.R” (e.g. mc72574_BennettM_script.R).
install.packages()
, etc.). These things will be evaluated as well.Final Models (RData): To facilitate replication, you will also have to submit an RData file with your two preferred models. Do not submit anything else in this file. You should name your file as following “EID_LastnameFirstLetter_models.RData” (e.g. mc72574_BennettM_models.RData). Follow the instructions in the homework template to make sure you are submitting everything correctly. Make sure that your regression task model is named “reg.model” and your classification task model is named “class.model” to make evaluation easier.
You will most likely need to do some data cleaning before fitting any models. Make sure you do this to your entire dataset before you do anything else (see the previous template to guide you in the structure of your script). Pay special attention to the different variables you might encounter (e.g. Are they categorical? Should I transform them? Are there variables that are not predictors?).
Depending on your dataset, it might make sense to aggregate some categories in categorical variables or mutate certain variables to be able to use the information they provide. Remember that you need to have the same categories in your training and testing dataset, so if you have too many categories (or one category with too few observations), you might run into issues.
You will also run into missing values. How you handle these is up to you. Some common alternatives are (1) drop observations with missing values (caveat: depending on missingness, this might be a bad idea if you are losing a lot of data), (2) not use variables with a lot of missing values (caveat: again, depending on the level of missing values, you might be losing important information), (3) impute average values to missing observations (caveat: you are introducing some noise in the variable). As a note, imputation should only be done for predictors, and not outcome variables.
You can wrangle your data in an easy way, dropping a lot of data and variables, or spend a little more time in this part to be sure to preserve more information. This will have an impact on the accuracy of your prediction (remember that, usually, more info is better than less!). The decision is yours 😀