Professional Documents
Culture Documents
Adv Analytical Theory and Methods: Regression
Adv Analytical Theory and Methods: Regression
Methods: Regression
Chapter Sections
Summary
6 Regression
Regression analysis attempts to explain the
influence that input (independent) variables
have on the outcome (dependent) variable
Questions regression might answer
What is a person’s expected income?
What is probability an applicant will default on a loan?
OLS
6.1.2 Model Description
Example
6.1.2 Model Description
With Normally Distributed
Errors
Making additional assumptions on the
error term provides further
capabilities
It is common to assume the error
> library(lattice)
> splom (~ incom e_input[c(2:5)], groups= N U LL, data= incom e_input,
axis.line.tck= 0, axis.text.alpha= 0)
6.1.2 Model Description
Example in R
Scatterplot
income~age: strong +
trend
income~educ: slight +
trend
income~gender: no trend
6.1.2 Model Description
Example in R
Quantify the linear relationship trends
incr
Educ coef: ~1.76, year educ + -> $1.76k
income +
Gender coef: ~-0.93, male income decreases
$930
Residuals – assumed to be normally distributed
6.1.2 Model Description
Example in R
Examine residuals – uncertainty or sampling
error
Small p-values indicate statistically significant
results
Age and Education highly significant, p<2e-16
Gender p=0.13 large, not significant at 90% confid.
level
Therefore, drop variable gender from linear
model
> results2 < - lm (Incom e~ Age+ Education,incom e_input)
> sum m ary(results) # results about sam e as before
Residual standard error: residual standard
6.1.2 Model Description
Categorical Variables
In the example in R, Gender is a binary variable
Variables like Gender are categorical variables in
could be implemented
6.1.2 Model Description
Confidence Intervals on the
Parameters
model to apply
>
6.1.3 Diagnostics
Evaluating the Linearity
Assumption
Income as a quadratic function of Age
>
6.1.3 Diagnostics
Evaluating the Residuals
The error terms was assumed to be normally
distributed with zero mean and constant variance
> w ith(results2,{plot(fi
tted.values,residuals,ylim = c(-40,40)) })
>
6.1.3 Diagnostics
Evaluating the Residuals
Next four figs don’t fit zero mean, const variance assumption
Nonlnea
r trend
in
residuals
Residuals
not centered
on zero
>
6.1.3 Diagnostics
Evaluating the Residuals
Residuals
not
centered
on zero
Variance
not
constant
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption
The normality assumption still has to be validate
> hist(results2$residuals)
Residuals centered
on zero and appear
normally distributed
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption
Normally
distributed
residuals
Non-
normally
distributed
residuals
>
6.1.3 Diagnostics
N-Fold Cross-Validation
To prevent overfitting, a common practice splits the
dataset into training and test sets, develops the model
on the training set and evaluates it on the test set
If the quantity of the dataset is insufficient for this, an
6.1.3 Diagnostics
Other Diagnostic Considerations
The model might be improved by including
additional input variables
However, the adjusted R2 applies a penalty as the number
of parameters increases
Residual plots should be examined for outliers
Points markedly different from the majority of points
They result from bad data, data processing errors, or actual
rare occurrences
Finally, the magnitude and signs of the estimated
parameters should be examined to see if they make
sense
>
6.2.3 Diagnostics
Deviance and the Pseudo-R2
In logistic regression, deviance = -2logL
where L is the maximized value of the likelihood function
6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
>
6.2.3 Diagnostics
Histogram of the Probabilities
inputs variables
If this is not true, the models perform poorly
In linear regression, the further assumption of normally