Adv Analytical Theory and Methods: Regression

Adv Analytical Theory and
Methods: Regression
Chapter Sections
 6.1 Linear Regression

 6.2 Logistic Regression
 6.3 Reasons to Choose and Cautions

 6.4 Additional Regression Models
 Summary
6 Regression
 Regression analysis attempts to explain the
influence that input (independent) variables
have on the outcome (dependent) variable
 Questions regression might answer
 What is a person’s expected income?
 What is probability an applicant will default on a loan?
 Regression can find the input variables having

the greatest statistical influence on the outcome
 Then, can try to produce better values of input
variables
 E.g. – if 10-year-old reading level predicts students’
later success, then try to improve early age reading

6.1 Linear Regression
 Models the relationship between
several input variables and a
continuous outcome variable
 Assumption is that the relationship is linear
 Various transformations can be used to
achieve a linear relationship

 Linear regression models are
probabilistic
 Involves randomness and uncertainty
 Not deterministic like Ohm’s Law (V=IR)
6.1.1 Use Cases
 Real estate example
 Predict residential home prices
 Possible inputs – living area, #bathrooms,
#bedrooms, lot size, property taxes
 Demand forecasting example
 Restaurant predicts quantity of food needed
 Possible inputs – weather, day of week, etc.
 Medical example
 Analyze effect of proposed radiation
treatment
 Possible inputs – radiation treatment duration,
freq
6.1.2 Model Description
Example
 Predict person’s annual income as a
function of age and education
 Ordinary Least Squares (OLS) is a

common technique to estimate the
parameters
Example
OLS
Example
With Normally Distributed
Errors
 Making additional assumptions on the
error term provides further
capabilities
 It is common to assume the error
term is a normally distributed random

variable
 Mean zero and constant variance
 That is
Model Description
With Normally Distributed Errors
 With this assumption, the expected
value is
 And the variance is

With Normally Distributed
Errors
 Normality assumption with one input
variable
 E.g., for x=8, E(y)~20 but varies 15-25

Example in R
Be sure to get publisher's R downloads:
http://www.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html
> incom e_input = as.data.fram e(read.csv(“c:/data/incom e.csv”))

> incom e_input[1:10,]
> sum m ary(incom e_input)
> library(lattice)
> splom (~ incom e_input[c(2:5)], groups= N U LL, data= incom e_input,
axis.line.tck= 0, axis.text.alpha= 0)
Example in R
Scatterplot
Examine bottom line
income~age: strong +
trend
income~educ: slight +
trend
income~gender: no trend
Example in R
 Quantify the linear relationship trends
> results < - lm (Incom e~ Age+ Education+ G ender,incom e_input)

> sum m ary(results)
 Intercept: income of $7263 for newborn female
 Age coef: ~1, year age increase -> $1k income
incr
 Educ coef: ~1.76, year educ + -> $1.76k
income +
 Gender coef: ~-0.93, male income decreases
$930
 Residuals – assumed to be normally distributed
Example in R
 Examine residuals – uncertainty or sampling
error
 Small p-values indicate statistically significant
results
 Age and Education highly significant, p<2e-16
 Gender p=0.13 large, not significant at 90% confid.
level
 Therefore, drop variable gender from linear
model
> results2 < - lm (Incom e~ Age+ Education,incom e_input)
> sum m ary(results) # results about sam e as before
 Residual standard error: residual standard
Categorical Variables
 In the example in R, Gender is a binary variable
 Variables like Gender are categorical variables in
contrast to numeric variables where numeric

differences are meaningful
 The book section discusses how income by state
could be implemented
Confidence Intervals on the
Parameters
 Once an acceptable linear model is developed, it

is often useful to draw some inferences
 R provides confidence intervals using confint() function
> confint(results2, level= .95)
 For example, Education coefficient was 1.76, and now
the corresponding 95% confidence interval is (1.53.

1.99)
Confidence Interval on Expected
Outcome
 In the income example, the regression line

provides the expected income for a given Age
and Education
 Using the predict() function in R, a confidence
interval on the expected outcome can be

obtained
> Age < - 41
> Education < - 12
> new _pt < - data.fram e(Age, Education)
> conf_int_pt < - predict(results2,new _pt,level= .95,
interval= “confidence”)
> conf_int_pt
 Expected income = $68699, conf interval
Prediction Interval on a Particular
Outcome
 The predict() function in R also provides

upper/lower bounds on a particular outcome,
prediction intervals
> pred_int_pt < - predict(results2,new _pt,level= .95,
interval= “prediction”)
> pred_int_pt
 Expected income = $68699, pred interval
($44988,$92409)
 This is a much wider interval because the confidence
interval applies to the expected outcome that falls on

the regression line, but the prediction interval applies
to an outcome that may appear anywhere within the
>
6.1.3 Diagnostics
Evaluating the Linearity
Assumption
 A major assumption in linear regression modeling is
that the relationship between the input and output
variables is linear
 The most fundamental way to evaluate this is to plot
the outcome variable against each income variable

 In the following figure a linear model would not apply
 In such cases, a transformation might allow a linear
model to apply
>
6.1.3 Diagnostics
Evaluating the Linearity
Assumption
 Income as a quadratic function of Age
>
6.1.3 Diagnostics
Evaluating the Residuals
 The error terms was assumed to be normally
distributed with zero mean and constant variance
> w ith(results2,{plot(fi
tted.values,residuals,ylim = c(-40,40)) })
>
6.1.3 Diagnostics
 Next four figs don’t fit zero mean, const variance assumption
Nonlnea
r trend
in
residuals
Residuals
not centered
on zero
>
6.1.3 Diagnostics
Residuals
not
centered
on zero
Variance
not
constant
>
6.1.3 Diagnostics
Evaluating the Normality
Assumption
 The normality assumption still has to be validate
> hist(results2$residuals)
Residuals centered
on zero and appear
normally distributed
>
6.1.3 Diagnostics
Assumption
 Another option is to examine a Q-Q plot comparing

observed data against quantiles (Q) of assumed dist
> qqnorm (results2$residuals)

> qqline(results2$residuals)
>
6.1.3 Diagnostics
Assumption
Normally
distributed
residuals
Non-
normally
distributed
residuals
>
6.1.3 Diagnostics
N-Fold Cross-Validation
 To prevent overfitting, a common practice splits the
dataset into training and test sets, develops the model
on the training set and evaluates it on the test set
 If the quantity of the dataset is insufficient for this, an
N-fold cross-validation technique can be used

 Dataset randomly split into N dataset of equal size
 Model trained on N-1 of the sets, tested on remaining one
 Process repeated N times
 Average the N model errors over the N folds
 Note: if N = size of dataset, this is leave-one-out procedure

>
6.1.3 Diagnostics
Other Diagnostic Considerations
 The model might be improved by including
additional input variables
 However, the adjusted R2 applies a penalty as the number
of parameters increases
 Residual plots should be examined for outliers
 Points markedly different from the majority of points
 They result from bad data, data processing errors, or actual
rare occurrences
 Finally, the magnitude and signs of the estimated
parameters should be examined to see if they make
sense
>
6.2 Logistic Regression

Introduction
 In linear regression modeling, the outcome variable

is continuous – e.g., income ~ age and education
 In logistic regression, the outcome variable is
categorical, and this chapter focuses on two-valued

outcomes like true/false, pass/fail, or yes/no
>
6.2.1 Logistic Regression

Use Cases
 Medical
 Probability of a patient’s successful response to a specific
medical treatment – input could include age, weight, etc.
 Finance
 Probability an applicant defaults on a loan
 Marketing
 Probability a wireless customer switches carriers (churns)
 Engineering
 Probability a mechanical part malfunctions or fails
>

Model Description
 Logical regression is based on the logistic function
 As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0

>

Model Description
 With the range of f(y) as (0,1), the logistic function
models the probability of an outcome occurring
In contrast to linear regression,

the values of y are not directly
observed; only the values of f(y) in
terms of success or failure are
observed.
Called log odds ratio, or logit of p.

Maximum Likelihood Estimation
(MLE) is used to estimate model
parameters. MLR is beyond the
scope of this book.
>
Model Description: customer churn

example
 A wireless telecom company estimates probability of

a customer churning (switching companies)
 Variables collected for each customer: age (years), married
(y/n), duration as customer (years), churned contacts
(count), churned (true/false)
 After analyzing the data and fitting a logical regression
model, age and churned contacts were selected as the best

predictor variables
>

example
6.2.3 Diagnostics
>

example
> head(churn_input) # Churned = 1 if cust churned
> sum (churn_input$Churned) # 1743/8000 churned
 Use the Generalized Linear Model function glm ()
> C hurn_logistic1< -
glm (Churned~ Age+ M arried+ Cust_years+ C hurned_contacts,data=
churn_input,fam ily= binom ial(link= “logit”))
> sum m ary(C hurn_logistic1) # Age + Churned_contacts best
> C hurn_logistic3< -
glm (Churned~ Age+ C hurned_contacts,data= churn_input,fam ily= bi
nom ial(link= “logit”))
> sum m ary(C hurn_logistic3) # Age + Churned_contacts
>
6.2.3 Diagnostics
Deviance and the Pseudo-R2
 In logistic regression, deviance = -2logL
where L is the maximized value of the likelihood function

used to obtain the parameter estimates

 Two deviance values are provided
 Null deviance = deviance based on only the y-intercept term
 Residual deviance = deviance based on all parameters
 Pseudo-R2 measures how well fitted model explains the data

 Value near 1 indicates a good fit over the null model
>
6.2.3 Diagnostics
Receiver Operating Characteristic (ROC)
Curve
 Logistic regression is often used to classify

 In the Churn example, a customer can be classified as
Churn if the model predicts high probability of churning
 Although 0.5 is often used as the probability threshold, other
values can be used based on desired error tradeoff

 For two classes, C and nC, we have
 True Positive: predict C, when actually C
 True Negative: predict nC, when actually nC
 False Positive: predict C, when actually nC

False Negative: predict nC, when actually C
>
6.2.3 Diagnostics
Curve
 The Receiver Operating Characteristic (ROC) curve

 Plots TPR against FPR
>
6.2.3 Diagnostics
Curve
> library(RO CR)

> Pred = predict(C hurn_logistic3, type= “response”)
>
6.2.3 Diagnostics
Curve
>
6.2.3 Diagnostics
Histogram of the Probabilities
It is interesting to visualize the

counts of the customers who
churned and who didn’t churn
against the estimated churn
probability.
>
6.3 Reasons to Choose and

Cautions
 Linear regression – outcome variable continuous
 Logistic regression – outcome variable categorical
 Both models assume a linear additive function of the
inputs variables
 If this is not true, the models perform poorly
 In linear regression, the further assumption of normally
distributed error terms is important for many statistical

inferences
 Although a set of input variables may be a good
predictor of an output variable, “correlation does not
imply causation”
>
6.4 Additional Regression

Models
 Multicollinearity is the condition when several input
variables are highly correlated
 This can lead to inappropriately large coefficients
 To mitigate this problem
 Ridge regression applies a penalty based on the size of the coefficients
 Lasso regression applies a penalty proportional to the sum of the
absolute values of the coefficients
 Multinomial logistic regression – used for a more-
than-two-state categorical outcome variable

Adv Analytical Theory and Methods: Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adv Analytical Theory and Methods: Regression

Uploaded by

Copyright:

Available Formats

Adv Analytical Theory and

 6.1 Linear Regression

 6.3 Reasons to Choose and Cautions

 Regression can find the input variables having

later success, then try to improve early age reading

achieve a linear relationship

 Ordinary Least Squares (OLS) is a

term is a normally distributed random

 And the variance is

 E.g., for x=8, E(y)~20 but varies 15-25

> incom e_input = as.data.fram e(read.csv(“c:/data/incom e.csv”))

Examine bottom line

> results < - lm (Incom e~ Age+ Education+ G ender,incom e_input)

contrast to numeric variables where numeric

 Once an acceptable linear model is developed, it

the corresponding 95% confidence interval is (1.53.

 In the income example, the regression line

interval on the expected outcome can be

 The predict() function in R also provides

interval applies to the expected outcome that falls on

the outcome variable against each income variable

 Another option is to examine a Q-Q plot comparing

> qqnorm (results2$residuals)

N-fold cross-validation technique can be used

 Process repeated N times

 Average the N model errors over the N folds

 Note: if N = size of dataset, this is leave-one-out procedure

6.2 Logistic Regression

 In linear regression modeling, the outcome variable

categorical, and this chapter focuses on two-valued

6.2.1 Logistic Regression

6.2.2 Logistic Regression

 As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0

6.2.2 Logistic Regression

In contrast to linear regression,

Called log odds ratio, or logit of p.

Model Description: customer churn

 A wireless telecom company estimates probability of

model, age and churned contacts were selected as the best

Model Description: customer churn

Model Description: customer churn

used to obtain the parameter estimates

 Pseudo-R2 measures how well fitted model explains the data

 Logistic regression is often used to classify

values can be used based on desired error tradeoff

 The Receiver Operating Characteristic (ROC) curve

> library(RO CR)

It is interesting to visualize the

6.3 Reasons to Choose and

 Both models assume a linear additive function of the

distributed error terms is important for many statistical

6.4 Additional Regression

You might also like