+ - 0:00:00
Notes for current slide
Notes for next slide

STA 235H - Multiple Regression: Outliers

Fall 2023

McCombs School of Business, UT Austin

1 / 12







Why should we inspect our data before doing anything else?

2 / 12

Identifying outliers

  • How do we identify outliers?
3 / 12

Identifying outliers

  • How do we identify outliers?

    • Visual inspection (e.g. plots, tables)

    • Creating thresholds (e.g. z-scores, IQ)

3 / 12

Identifying outliers

  • How do we identify outliers?

    • Visual inspection (e.g. plots, tables)

    • Creating thresholds (e.g. z-scores, IQ)

  • There is no definite way to identify outliers

    • Like the characterization of pornography, "I know it when I see it" (P. Stewart, 1964)
3 / 12

HMDA Data for Bastrop County

  • Data from the Home Mortgage Disclosure Act (HMDA) from 2017 in Bastrop County (near Austin)
4 / 12

HMDA Data for Bastrop County

  • Data from the Home Mortgage Disclosure Act (HMDA) from 2017 in Bastrop County (near Austin)

4 / 12

Association between loan amount and income

5 / 12

Identifying outliers

6 / 12

Association with complete data

7 / 12

Association after removing outliers

8 / 12

Compare both coefficients: Complete data

summary(lm(loan_amount_000s ~ applicant_income_000s, data = hmda))
##
## Call:
## lm(formula = loan_amount_000s ~ applicant_income_000s, data = hmda)
##
## Residuals:
## Min 1Q Median 3Q Max
## -458.93 -36.97 -8.77 35.47 365.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.05028 4.15313 33.96 <2e-16 ***
## applicant_income_000s 0.84000 0.03663 22.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.04 on 875 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.3754, Adjusted R-squared: 0.3747
## F-statistic: 525.8 on 1 and 875 DF, p-value: < 2.2e-16
9 / 12

Compare both coefficients: Data without outliers

summary(lm(loan_amount_000s ~ applicant_income_000s, data = hmda_without_outliers))
##
## Call:
## lm(formula = loan_amount_000s ~ applicant_income_000s, data = hmda_without_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -272.22 -36.09 -6.82 34.12 360.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 133.52408 4.47317 29.85 <2e-16 ***
## applicant_income_000s 0.92376 0.04171 22.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64.82 on 873 degrees of freedom
## Multiple R-squared: 0.3597, Adjusted R-squared: 0.359
## F-statistic: 490.5 on 1 and 873 DF, p-value: < 2.2e-16
10 / 12

What to do with outliers?

  1. Check them!

    • Make sure there's no coding error; try to understand what's happening there.
11 / 12

What to do with outliers?

  1. Check them!

    • Make sure there's no coding error; try to understand what's happening there.

2a. If they are wrongly coded:

  • You can remove them, always adding a note of why you did so
  • Be aware of sample selection!
11 / 12

What to do with outliers?

  1. Check them!

    • Make sure there's no coding error; try to understand what's happening there.

2a. If they are wrongly coded:

  • You can remove them, always adding a note of why you did so
  • Be aware of sample selection!

2b. If they are correctly coded:

  • Run analysis both with and without outliers (don't just drop them!).
  • Robust results: Do not depend exclusively on a few observations.
11 / 12

Let's do some exercises!

12 / 12







Why should we inspect our data before doing anything else?

2 / 12
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow