You are on page 1of 45

Adv Analytical Theory and

Methods: Regression
Chapter Sections

 6.1 Linear Regression


 6.2 Logistic Regression

 6.3 Reasons to Choose and Cautions


 6.4 Additional Regression Models

 Summary
6 Regression
 Regression analysis attempts to explain the
influence that input (independent) variables
have on the outcome (dependent) variable
 Questions regression might answer
 What is a person’s expected income?
 What is probability an applicant will default on a loan?

 Regression can find the input variables having


the greatest statistical influence on the outcome
 Then, can try to produce better values of input
variables
 E.g. – if 10-year-old reading level predicts students’

later success, then try to improve early age reading


6.1 Linear Regression
 Models the relationship between
several input variables and a
continuous outcome variable
 Assumption is that the relationship is linear
 Various transformations can be used to

achieve a linear relationship


 Linear regression models are
probabilistic
 Involves randomness and uncertainty
 Not deterministic like Ohm’s Law (V=IR)
6.1.1 Use Cases
 Real estate example
 Predict residential home prices
 Possible inputs – living area, #bathrooms,
#bedrooms, lot size, property taxes
 Demand forecasting example
 Restaurant predicts quantity of food needed
 Possible inputs – weather, day of week, etc.
 Medical example
 Analyze effect of proposed radiation
treatment
 Possible inputs – radiation treatment duration,
freq
6.1.2 Model Description
6.1.2 Model Description
Example
 Predict person’s annual income as a
function of age and education

 Ordinary Least Squares (OLS) is a


common technique to estimate the
parameters
6.1.2 Model Description
Example

OLS
6.1.2 Model Description
Example
6.1.2 Model Description
With Normally Distributed
Errors
 Making additional assumptions on the
error term provides further
capabilities
 It is common to assume the error

term is a normally distributed random


variable
 Mean zero and constant variance
 That is
Model Description
With Normally Distributed Errors
 With this assumption, the expected
value is

 And the variance is


6.1.2 Model Description
With Normally Distributed
Errors
 Normality assumption with one input
variable

 E.g., for x=8, E(y)~20 but varies 15-25


6.1.2 Model Description
Example in R
Be sure to get publisher's R downloads:
http://www.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html

> incom e_input = as.data.fram e(read.csv(“c:/data/incom e.csv”))


> incom e_input[1:10,]
> sum m ary(incom e_input)

> library(lattice)
> splom (~ incom e_input[c(2:5)], groups= N U LL, data= incom e_input,
axis.line.tck= 0, axis.text.alpha= 0)
6.1.2 Model Description
Example in R

Scatterplot

Examine bottom line

income~age: strong +
trend
income~educ: slight +
trend
income~gender: no trend
6.1.2 Model Description
Example in R
 Quantify the linear relationship trends

> results < - lm (Incom e~ Age+ Education+ G ender,incom e_input)


> sum m ary(results)
 Intercept: income of $7263 for newborn female
 Age coef: ~1, year age increase -> $1k income

incr
 Educ coef: ~1.76, year educ + -> $1.76k

income +
 Gender coef: ~-0.93, male income decreases

$930
 Residuals – assumed to be normally distributed
6.1.2 Model Description
Example in R
 Examine residuals – uncertainty or sampling
error
 Small p-values indicate statistically significant

results
 Age and Education highly significant, p<2e-16
 Gender p=0.13 large, not significant at 90% confid.

level
 Therefore, drop variable gender from linear
model
> results2 < - lm (Incom e~ Age+ Education,incom e_input)
> sum m ary(results) # results about sam e as before
 Residual standard error: residual standard
6.1.2 Model Description
Categorical Variables
 In the example in R, Gender is a binary variable
 Variables like Gender are categorical variables in

contrast to numeric variables where numeric


differences are meaningful
 The book section discusses how income by state

could be implemented
6.1.2 Model Description
Confidence Intervals on the
Parameters

 Once an acceptable linear model is developed, it


is often useful to draw some inferences
 R provides confidence intervals using confint() function
> confint(results2, level= .95)
 For example, Education coefficient was 1.76, and now

the corresponding 95% confidence interval is (1.53.


1.99)
6.1.2 Model Description
Confidence Interval on Expected
Outcome

 In the income example, the regression line


provides the expected income for a given Age
and Education
 Using the predict() function in R, a confidence

interval on the expected outcome can be


obtained
> Age < - 41
> Education < - 12
> new _pt < - data.fram e(Age, Education)
> conf_int_pt < - predict(results2,new _pt,level= .95,
interval= “confidence”)
> conf_int_pt
 Expected income = $68699, conf interval
6.1.2 Model Description
Prediction Interval on a Particular
Outcome

 The predict() function in R also provides


upper/lower bounds on a particular outcome,
prediction intervals
> pred_int_pt < - predict(results2,new _pt,level= .95,
interval= “prediction”)
> pred_int_pt
 Expected income = $68699, pred interval
($44988,$92409)
 This is a much wider interval because the confidence

interval applies to the expected outcome that falls on


the regression line, but the prediction interval applies
to an outcome that may appear anywhere within the
>
6.1.3 Diagnostics
Evaluating the Linearity
Assumption
 A major assumption in linear regression modeling is
that the relationship between the input and output
variables is linear
 The most fundamental way to evaluate this is to plot

the outcome variable against each income variable


 In the following figure a linear model would not apply
 In such cases, a transformation might allow a linear

model to apply
>
6.1.3 Diagnostics
Evaluating the Linearity
Assumption
 Income as a quadratic function of Age
>

6.1.3 Diagnostics
Evaluating the Residuals
 The error terms was assumed to be normally
distributed with zero mean and constant variance
> w ith(results2,{plot(fi
tted.values,residuals,ylim = c(-40,40)) })
>

6.1.3 Diagnostics
Evaluating the Residuals
 Next four figs don’t fit zero mean, const variance assumption

Nonlnea
r trend
in
residuals

Residuals
not centered
on zero
>

6.1.3 Diagnostics
Evaluating the Residuals

Residuals
not
centered
on zero

Variance
not
constant
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption
 The normality assumption still has to be validate
> hist(results2$residuals)

Residuals centered
on zero and appear
normally distributed
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption

 Another option is to examine a Q-Q plot comparing


observed data against quantiles (Q) of assumed dist

> qqnorm (results2$residuals)


> qqline(results2$residuals)
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption

Normally
distributed
residuals

Non-
normally
distributed
residuals
>

6.1.3 Diagnostics
N-Fold Cross-Validation
 To prevent overfitting, a common practice splits the
dataset into training and test sets, develops the model
on the training set and evaluates it on the test set
 If the quantity of the dataset is insufficient for this, an

N-fold cross-validation technique can be used


 Dataset randomly split into N dataset of equal size
 Model trained on N-1 of the sets, tested on remaining one

 Process repeated N times

 Average the N model errors over the N folds

 Note: if N = size of dataset, this is leave-one-out procedure


>

6.1.3 Diagnostics
Other Diagnostic Considerations
 The model might be improved by including
additional input variables
 However, the adjusted R2 applies a penalty as the number
of parameters increases
 Residual plots should be examined for outliers
 Points markedly different from the majority of points
 They result from bad data, data processing errors, or actual

rare occurrences
 Finally, the magnitude and signs of the estimated
parameters should be examined to see if they make
sense
>

6.2 Logistic Regression


Introduction

 In linear regression modeling, the outcome variable


is continuous – e.g., income ~ age and education
 In logistic regression, the outcome variable is

categorical, and this chapter focuses on two-valued


outcomes like true/false, pass/fail, or yes/no
>

6.2.1 Logistic Regression


Use Cases
 Medical
 Probability of a patient’s successful response to a specific
medical treatment – input could include age, weight, etc.
 Finance
 Probability an applicant defaults on a loan
 Marketing
 Probability a wireless customer switches carriers (churns)
 Engineering
 Probability a mechanical part malfunctions or fails
>

6.2.2 Logistic Regression


Model Description
 Logical regression is based on the logistic function

 As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0


>

6.2.2 Logistic Regression


Model Description
 With the range of f(y) as (0,1), the logistic function
models the probability of an outcome occurring

In contrast to linear regression,


the values of y are not directly
observed; only the values of f(y) in
terms of success or failure are
observed.

Called log odds ratio, or logit of p.


Maximum Likelihood Estimation
(MLE) is used to estimate model
parameters. MLR is beyond the
scope of this book.
6.2.2 Logistic Regression
>

Model Description: customer churn


example

 A wireless telecom company estimates probability of


a customer churning (switching companies)
 Variables collected for each customer: age (years), married
(y/n), duration as customer (years), churned contacts
(count), churned (true/false)
 After analyzing the data and fitting a logical regression

model, age and churned contacts were selected as the best


predictor variables
6.2.2 Logistic Regression
>

Model Description: customer churn


example
6.2.3 Diagnostics
>

Model Description: customer churn


example
> head(churn_input) # Churned = 1 if cust churned
> sum (churn_input$Churned) # 1743/8000 churned
 Use the Generalized Linear Model function glm ()
> C hurn_logistic1< -
glm (Churned~ Age+ M arried+ Cust_years+ C hurned_contacts,data=
churn_input,fam ily= binom ial(link= “logit”))
> sum m ary(C hurn_logistic1) # Age + Churned_contacts best
> C hurn_logistic3< -
glm (Churned~ Age+ C hurned_contacts,data= churn_input,fam ily= bi
nom ial(link= “logit”))
> sum m ary(C hurn_logistic3) # Age + Churned_contacts
>

6.2.3 Diagnostics
Deviance and the Pseudo-R2
 In logistic regression, deviance = -2logL
where L is the maximized value of the likelihood function

used to obtain the parameter estimates


 Two deviance values are provided
 Null deviance = deviance based on only the y-intercept term
 Residual deviance = deviance based on all parameters

 Pseudo-R2 measures how well fitted model explains the data


 Value near 1 indicates a good fit over the null model
>

6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve

 Logistic regression is often used to classify


 In the Churn example, a customer can be classified as
Churn if the model predicts high probability of churning
 Although 0.5 is often used as the probability threshold, other

values can be used based on desired error tradeoff


 For two classes, C and nC, we have
 True Positive: predict C, when actually C
 True Negative: predict nC, when actually nC
 False Positive: predict C, when actually nC

False Negative: predict nC, when actually C
>

6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve

 The Receiver Operating Characteristic (ROC) curve


 Plots TPR against FPR
>

6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve

> library(RO CR)


> Pred = predict(C hurn_logistic3, type= “response”)
>

6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
>

6.2.3 Diagnostics
Histogram of the Probabilities

It is interesting to visualize the


counts of the customers who
churned and who didn’t churn
against the estimated churn
probability.
>

6.3 Reasons to Choose and


Cautions
 Linear regression – outcome variable continuous
 Logistic regression – outcome variable categorical

 Both models assume a linear additive function of the

inputs variables
 If this is not true, the models perform poorly
 In linear regression, the further assumption of normally

distributed error terms is important for many statistical


inferences
 Although a set of input variables may be a good
predictor of an output variable, “correlation does not
imply causation”
>

6.4 Additional Regression


Models
 Multicollinearity is the condition when several input
variables are highly correlated
 This can lead to inappropriately large coefficients
 To mitigate this problem
 Ridge regression applies a penalty based on the size of the coefficients
 Lasso regression applies a penalty proportional to the sum of the
absolute values of the coefficients
 Multinomial logistic regression – used for a more-
than-two-state categorical outcome variable

You might also like