You are on page 1of 32

Intro to Regresion

CoderGirl Data Analysis


Today’s Topics

1. Intro to regression analysis


2. What is regression analysis?
a. Modeling building
b. Model evaluation
3. Hands-On Regression in Excel
What is Regression Analysis?
➔ First, what’s a model?
◆ Model is a mathematical description of the relationship between 2 or more variables.
● Dependent variable: depend on values of IV
● Independent variable(s): variable whose value doesn’t depend on others
➔ Deterministic (formulaic) vs. Probabilistic (statistical)

3
What is Regression Analysis?
➔ First, what’s a model?
◆ Model is a mathematical description of the relationship between 2 or more variables.
● Dependent variable: depend on values of IV
● Independent variable(s): variable whose value doesn’t depend on others
➔ Deterministic (formulaic) vs. Probabilistic (statistical)

Convert Celsius to Fahrenheit Estimate weight from height

(0°C × 9/5) + 32 = 32°F weight = 80 + 2 * height


y = a + b*x

4
What is Regression Analysis?
➔ Regression analysis is a statistical process
that helps describe the relationship
between variables
◆ How can an outcome be explained?
◆ What will a future outcome be? Ie.
prediction and/or forecasting

5
What is Regression Analysis?
➔ Many types of regression
◆ Linear - continuous dependent variable and linear relationship
◆ Logistic - dichotomous (1/0) dependent variable
◆ Polynomial - power of independent variable(s); aka “curved” relationship
◆ Stepwise - multiple independent outcomes (machine learning)
◆ Ridge, Lasso and more!
● but we’ll focus on linear regression (LR)

6
What is Regression Analysis?

1. Simple Linear Regression (SLR) involves one dependent variable and one independent
variable.

2. Multiple Linear Regression (MLR) involves one dependent variable and 2+ independent
variables (most common).

Linear regression only works when dependent variable is continuous!

7
Linear Regression
➔ Variable Types
◆ Covariates ~ independent variables (your X’s)
◆ Outcomes ~ dependent variables (your Y’s)
● Scale - continuous
● Nominal - categorical with no order
● Ordinal - categorical with equidistant order (Likert Scale)

8
Simple Linear Regression (SLR)

yi = c + Bi*xi + ε
where y ~ estimated outcome
x ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

9
Simple Linear Regression (SLR)

yi = c + Bi*xi + ε
where y ~ estimated outcome
x ~ independent variable
We’re predicting/estimating
Bi ~ coefficient
the outcome there is an error
c ~ constant/intercept associated
ε ~ error term

10
Simple Linear Regression (SLR)

yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

11
Simple Linear Regression (SLR)

yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

12
Simple Linear Regression (SLR)

yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

13
Simple Linear Regression (SLR)

yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

14
Simple Linear Regression (SLR)

yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

15
Multiple Linear Regression (MLR)

yi = Bi*xi + Bj*xj + c + ε
where yi ~ estimated outcome
xij ~ independent variable(s)
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term

16
Linear Regression Fundamentals
Least Squares Method - Minimizes sum of squared deviations by penalizing large errors more
than small errors
∑ (Yi - Yi*)2

We minimize the equation for the sum of the squared prediction errors:
Q=∑i=(yi−(b0+b1xi))2
(that is, take the derivative with respect to b0 and b1, set to 0, and solve for b0 and b1) and get the
"least squares estimates" for b0 and b1:

b0=y¯−b1x¯ and: b1 = ∑ni=1(xi−x¯)(yi−y¯)


17
∑ni=1(xi−x¯)2
What is Regression Analysis?

18
Linear Regression Assumptions

1. The mean of the response, E(Yi), at each value of 1. X(s) and Y have a linear relationship
the predictor, xi, is a linear function of the xi.
a. Expected value of error term is zero.
2. The errors, εi, are independent.
a. Autocorrelation - disturbances are correlated 2. No hidden correlations between independent
with one another variables.

3. The errors, εi, at each value of the predictor, x i, are 3. Normal distribution (can do statistical tests)
normally distributed.
4. The errors, εi, at each value of the predictor, x i, 4. Error doesn’t change drastically across values
have equal variances (denoted σ2).
a. Heteroscedastic - disturbances are not all
equal 19
Pearson Correlation
➔ Measures the strength of relationship ➔ Scale of -1 to +1
between two variables ◆ -1 Strong negative relationship
◆ 0 No relationship
➔ Several correlation coefficients, but Pearson
◆ +1 Strong positive relationship
is most common
➔ Quantitative variables only

https://www.spss-tutorials.com/pearson-correlatio
n-coefficient/
Linear Regression Model Evaluation
➔ Mean Square Error (MSE): Minimize tradeoff of bias and inefficiency
◆ E(B*) = E(B* - B)2

➔ R2: How much of the variance is explained by predictor(s)


◆ Values between 0 and 1.
◆ Caution!
➔ Pearson Coefficient
◆ If r = -1, then there is a perfect negative linear relationship between x and y.
◆ If r = 1, then there is a perfect positive linear relationship between x and y.
◆ If r = 0, then there is no linear relationship between x and y.
21
Linear Regression Model Evaluation
➔ F-test: compares the fits of different linear models (multiple variables)
◆ Null hypothesis: The fit of the intercept-only model and your model are equal.
◆ Alternative hypothesis: The fit of the intercept-only model is significantly reduced
compared to your model.
● P value for the F-test of overall significance test is less than your significance level,
you can reject the null-hypothesis
○ Higher F stat the better the model

22
Linear Regression Model Evaluation
➔ F-test: compares the fits of different linear models (multiple variables)
◆ Null hypothesis: The fit of the intercept-only model and your model are equal.
◆ Alternative hypothesis: The fit of the intercept-only model is significantly reduced
compared to your model.
● P value for the F-test of overall significance test is less than your significance level,
you can reject the null-hypothesis
○ Higher F stat the better the model

23
Schedule
▶ Recap of Regression Basics
▶ Single Variable
▶ Multi Variable
▶ Excel Output
▶ “Best” Model
▶ t Stat & p-Value

▶ Regression in R

▶ Hands On
▶ Let’s try our hand
Single Variable Regression (Simple Regression)
▶ Measures a single independent variable, x, as it impacts a single dependent variable, y
▶ Goal: Minimize Error using Least Squares Methodology

▶ GPA ~ Parent Income


▶ Red Line = Error
▶ Sum of Residuals = 8.88E-16
Multi Variable Regression
▶ Measures multiple independent variables as they impacts a single dependent variable, y
▶ Variable Significance
▶ Numeric – P-values
▶ Categorical – F-test (anova)

▶ Typically a combination of numeric and categorical values


▶ Numeric (Continuous) – Age, Height, Weight
▶ Categorical (Discrete) – Region, Gender, Shirt Size

▶ Categorical Variables
▶ Must be split into indicator variables
▶ Region Example
Excel Output

Model Fit

Model
Significanc
e

Coefficients

Residuals
“Best” Model
▶ What is the ”best” fitting model?
▶ Goldilocks balance with the number of predictors
▶ Too few: An underspecified model tends to produce biased estimates.
▶ Too many: An overspecified model tends to have less precise estimates.
▶ Just right: A model with the correct terms has no bias and the most precise estimates

▶It depends!
“Best” Model
“Best” Model
“Best” Model
Excel Output – t-Stat & P-value
▶ t-Stat = Coefficient / Standard Error
▶ P-value = Calculated from t-State and degrees of freedom
▶ What do these value actually measure?
▶ Probability that our results are extreme

You might also like