Intro To Regresion: Codergirl Data Analysis

Intro to Regresion
CoderGirl Data Analysis

Today’s Topics
1. Intro to regression analysis

2. What is regression analysis?
a. Modeling building
b. Model evaluation
3. Hands-On Regression in Excel
What is Regression Analysis?
➔ First, what’s a model?
◆ Model is a mathematical description of the relationship between 2 or more variables.
● Dependent variable: depend on values of IV
● Independent variable(s): variable whose value doesn’t depend on others
➔ Deterministic (formulaic) vs. Probabilistic (statistical)
3
➔ First, what’s a model?
◆ Model is a mathematical description of the relationship between 2 or more variables.
● Dependent variable: depend on values of IV
● Independent variable(s): variable whose value doesn’t depend on others
➔ Deterministic (formulaic) vs. Probabilistic (statistical)
Convert Celsius to Fahrenheit Estimate weight from height
(0°C × 9/5) + 32 = 32°F weight = 80 + 2 * height

y = a + b*x
4
➔ Regression analysis is a statistical process
that helps describe the relationship
between variables
◆ How can an outcome be explained?
◆ What will a future outcome be? Ie.
prediction and/or forecasting
5
➔ Many types of regression
◆ Linear - continuous dependent variable and linear relationship
◆ Logistic - dichotomous (1/0) dependent variable
◆ Polynomial - power of independent variable(s); aka “curved” relationship
◆ Stepwise - multiple independent outcomes (machine learning)
◆ Ridge, Lasso and more!
● but we’ll focus on linear regression (LR)
6
1. Simple Linear Regression (SLR) involves one dependent variable and one independent
variable.
2. Multiple Linear Regression (MLR) involves one dependent variable and 2+ independent
variables (most common).
Linear regression only works when dependent variable is continuous!
7
Linear Regression
➔ Variable Types
◆ Covariates ~ independent variables (your X’s)
◆ Outcomes ~ dependent variables (your Y’s)
● Scale - continuous
● Nominal - categorical with no order
● Ordinal - categorical with equidistant order (Likert Scale)
8
Simple Linear Regression (SLR)
yi = c + Bi*xi + ε
where y ~ estimated outcome
x ~ independent variable
Bi ~ coefficient
c ~ constant/intercept
ε ~ error term
9
yi = c + Bi*xi + ε
where y ~ estimated outcome
x ~ independent variable
We’re predicting/estimating
Bi ~ coefficient
the outcome there is an error
c ~ constant/intercept associated
ε ~ error term
10
yi = Bi*xi + c + ε
where yi ~ estimated outcome
xi ~ independent variable
Bi ~ coefficient
ε ~ error term
11
yi = Bi*xi + c + ε
Bi ~ coefficient
ε ~ error term
12
yi = Bi*xi + c + ε
Bi ~ coefficient
ε ~ error term
13
yi = Bi*xi + c + ε
Bi ~ coefficient
ε ~ error term
14
yi = Bi*xi + c + ε
Bi ~ coefficient
ε ~ error term
15
Multiple Linear Regression (MLR)
yi = Bi*xi + Bj*xj + c + ε
xij ~ independent variable(s)
Bi ~ coefficient
ε ~ error term
16
Linear Regression Fundamentals
Least Squares Method - Minimizes sum of squared deviations by penalizing large errors more
than small errors
∑ (Yi - Yi*)2
We minimize the equation for the sum of the squared prediction errors:
Q=∑i=(yi−(b0+b1xi))2
(that is, take the derivative with respect to b0 and b1, set to 0, and solve for b0 and b1) and get the
"least squares estimates" for b0 and b1:
b0=y¯−b1x¯ and: b1 = ∑ni=1(xi−x¯)(yi−y¯)

17
∑ni=1(xi−x¯)2
18
Linear Regression Assumptions
1. The mean of the response, E(Yi), at each value of 1. X(s) and Y have a linear relationship
the predictor, xi, is a linear function of the xi.
a. Expected value of error term is zero.
2. The errors, εi, are independent.
a. Autocorrelation - disturbances are correlated 2. No hidden correlations between independent
with one another variables.
3. The errors, εi, at each value of the predictor, x i, are 3. Normal distribution (can do statistical tests)
normally distributed.
4. The errors, εi, at each value of the predictor, x i, 4. Error doesn’t change drastically across values
have equal variances (denoted σ2).
a. Heteroscedastic - disturbances are not all
equal 19
Pearson Correlation
➔ Measures the strength of relationship ➔ Scale of -1 to +1
between two variables ◆ -1 Strong negative relationship
◆ 0 No relationship
➔ Several correlation coefficients, but Pearson
◆ +1 Strong positive relationship
is most common
➔ Quantitative variables only
https://www.spss-tutorials.com/pearson-correlatio
n-coefficient/
Linear Regression Model Evaluation
➔ Mean Square Error (MSE): Minimize tradeoff of bias and inefficiency
◆ E(B*) = E(B* - B)2
➔ R2: How much of the variance is explained by predictor(s)

◆ Values between 0 and 1.
◆ Caution!
➔ Pearson Coefficient
◆ If r = -1, then there is a perfect negative linear relationship between x and y.
◆ If r = 1, then there is a perfect positive linear relationship between x and y.
◆ If r = 0, then there is no linear relationship between x and y.
21
➔ F-test: compares the fits of different linear models (multiple variables)
◆ Null hypothesis: The fit of the intercept-only model and your model are equal.
◆ Alternative hypothesis: The fit of the intercept-only model is significantly reduced
compared to your model.
● P value for the F-test of overall significance test is less than your significance level,
you can reject the null-hypothesis
○ Higher F stat the better the model
22
➔ F-test: compares the fits of different linear models (multiple variables)
◆ Null hypothesis: The fit of the intercept-only model and your model are equal.
◆ Alternative hypothesis: The fit of the intercept-only model is significantly reduced
compared to your model.
● P value for the F-test of overall significance test is less than your significance level,
you can reject the null-hypothesis
○ Higher F stat the better the model
23
Schedule
▶ Recap of Regression Basics
▶ Single Variable
▶ Multi Variable
▶ Excel Output
▶ “Best” Model
▶ t Stat & p-Value
▶ Regression in R
▶ Hands On
▶ Let’s try our hand
Single Variable Regression (Simple Regression)
▶ Measures a single independent variable, x, as it impacts a single dependent variable, y
▶ Goal: Minimize Error using Least Squares Methodology
▶ GPA ~ Parent Income

▶ Red Line = Error
▶ Sum of Residuals = 8.88E-16
Multi Variable Regression
▶ Measures multiple independent variables as they impacts a single dependent variable, y
▶ Variable Significance
▶ Numeric – P-values
▶ Categorical – F-test (anova)
▶ Typically a combination of numeric and categorical values

▶ Numeric (Continuous) – Age, Height, Weight
▶ Categorical (Discrete) – Region, Gender, Shirt Size
▶ Categorical Variables
▶ Must be split into indicator variables
▶ Region Example
Excel Output
Model Fit
Model
Significanc
e
Coefficients
Residuals
“Best” Model
▶ What is the ”best” fitting model?
▶ Goldilocks balance with the number of predictors
▶ Too few: An underspecified model tends to produce biased estimates.
▶ Too many: An overspecified model tends to have less precise estimates.
▶ Just right: A model with the correct terms has no bias and the most precise estimates
▶It depends!
“Best” Model
“Best” Model
“Best” Model
Excel Output – t-Stat & P-value
▶ t-Stat = Coefficient / Standard Error
▶ P-value = Calculated from t-State and degrees of freedom
▶ What do these value actually measure?
▶ Probability that our results are extreme

Intro To Regresion: Codergirl Data Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Regresion: Codergirl Data Analysis

Uploaded by

Copyright:

Available Formats

Intro to Regresion

CoderGirl Data Analysis

1. Intro to regression analysis

Convert Celsius to Fahrenheit Estimate weight from height

(0°C × 9/5) + 32 = 32°F weight = 80 + 2 * height

Linear regression only works when dependent variable is continuous!

b0=y¯−b1x¯ and: b1 = ∑ni=1(xi−x¯)(yi−y¯)

➔ R2: How much of the variance is explained by predictor(s)

▶ GPA ~ Parent Income

▶ Typically a combination of numeric and categorical values

You might also like