Professional Documents
Culture Documents
LECTURE7 ExtraordinaryRegression 2012-13
LECTURE7 ExtraordinaryRegression 2012-13
Extraordinary
Regression:
Non-Normal, Non-
Parametric & Non-Straight
Relationships
Andy Marshall
Practical 7 (Introductory Lecture)
Spring 2013
2
© Andrew R Marshall
3
© Andrew R Marshall
4
© Andrew R Marshall
5
© Andrew R Marshall
6
© Andrew R Marshall
Time
Statistics Forum Summary Stats 2009-10
Date
Day
7
© Andrew R Marshall
One-to-one Help
Remaining Help Sessions
8
© Andrew R Marshall
Non-normal Methods
Identifying a non-normal response variable:
Distance from Low Tide (m) 1) Data distribution
Histogram Kernel Density
- Count data with few
0.6
0.4
0.4
values, low mean or low
0.2
0.2
Frequency
Distance (m)
0.0
0.0
sample size
1 2 3 4 5 0 1 2 3 4 5 6
5
5
2) Data exploration
4
4
3
3
- Skew
2
2
Distance (m)
1
1
Sample Quantiles
-2 -1 0 1 2
- Non-straight
Theoretical Quantiles
10
© Andrew R Marshall
Non-normal Methods
Identifying a non-normal response variable:
Distance from Low Tide (m) 1) Data distribution
Histogram Kernel Density
- Count data with few
0.6
0.4
0.4
values, low mean or low
0.2
0.2
Frequency
Distance (m)
0.0
0.0
sample size
1 2 3 4 5 0 1 2 3 4 5 6
5
5
2) Data exploration
4
4
3
3
- Skew
2
2
Distance (m)
1
1
Sample Quantiles
-2 -1 0 1 2
- Non-straight
Theoretical Quantiles
11
© Andrew R Marshall
Non-normal Methods
Identifying a non-normal response variable:
Distance from Low Tide (m) 1) Data distribution
Histogram Kernel Density
- Count data with few
0.6
0.4
0.4
values, low mean or low
0.2
0.2
Frequency
Distance (m)
0.0
0.0
sample size
1 2 3 4 5 0 1 2 3 4 5 6
5
5
2) Data exploration
4
4
3
3
- Skew
2
2
Distance (m)
1
1
Sample Quantiles
-2 -1 0 1 2
- Non-straight
Theoretical Quantiles
12
© Andrew R Marshall
Non-normal Methods
Identifying a non-normal response variable:
2
16 9 16 9
1
- Residual plots show
50
0
0
curvature or
Residuals
-1
-100
-2
1
Standardized residuals
1 heteroscedasticity
140 160 180 200 220 240 -2 -1 0 1 2
2
1
1.5
9
16 9 1
1
0.5
1.0
0
4) Tests (not essential)
-1
0.5
0.5
11 1
-2
Cook's distance 1
Standardized residuals
Standardized residuals
-F
0.0
140 160 180 200 220 240 0.0 0.1 0.2 0.3 0.4
Non-normal Methods
Identifying a non-normal response variable:
2
16 9 16 9
1
- Residual plots show
50
0
0
curvature or
Residuals
-1
-100
-2
1
Standardized residuals
1 heteroscedasticity
140 160 180 200 220 240 -2 -1 0 1 2
2
1
1.5
9
16 9 1
1
0.5
1.0
0
4) Tests (not essential)
-1
0.5
0.5
11 1
-2
Cook's distance 1
Standardized residuals
Standardized residuals
-F
0.0
140 160 180 200 220 240 0.0 0.1 0.2 0.3 0.4
Non-normal Methods
Identifying a non-normal response variable:
2
16 9 16 9
1
- Residual plots show
50
0
0
curvature or
Residuals
-1
-100
-2
1
Standardized residuals
1 heteroscedasticity
140 160 180 200 220 240 -2 -1 0 1 2
2
1
1.5
9
16 9 1
1
0.5
1.0
0
4) Tests (not essential)
-1
0.5
0.5
11 1
-2
Cook's distance 1
Standardized residuals
Standardized residuals
-F
0.0
140 160 180 200 220 240 0.0 0.1 0.2 0.3 0.4
Non-Parametric
Regression
16
© Andrew R Marshall
Non-Parametric
Regression
(i.e. no defined error distribution)
17
© Andrew R Marshall
Non-parametric Regression
Kendall’s robust line method:
• “Robust”, i.e. few assumptions
15
• Slope (z) = median of all possible
slopes (for every pair of points) 10
0 1 2 3 4
• Simple (!) x
z[i,j] = (yj-yi)/(xj-xi)
Cleveland (2006)
18
© Andrew R Marshall
Non-parametric Regression
Kendall’s robust line method: H0: Slope (mu; μ) = 0
(i.e. W– = W+)
• Less influenced by outliers
15
than OLS regression
• Wilcoxon Signed-Rank (One-
10
sample Inference):
y
5
- Critical value W 0 1 2 3 4
Non-parametric Regression
Kendall’s robust line method: H0: Slope (mu; μ) = 0
(i.e. W– = W+)
• Cleveland (2006) gives other
15
non-parametric alternatives
10
• 3 drawbacks:
y
5
0
0 1 2 3 4
Cleveland (2006)
20
© Andrew R Marshall
Non-parametric Regression
Kendall’s robust line method: H0: Slope (mu; μ) = 0
(i.e. W– = W+)
• Cleveland (2006) gives other
15
non-parametric alternatives
10
• 3 drawbacks:
y
- Proportion variance
5
one variable
Cleveland (2006)
21
© Andrew R Marshall
Non-Normal Parametric
Regression
22
© Andrew R Marshall
Non-Normal Parametric
Regression
(These models are still parametric, i.e.
response variable has defined
distribution)
23
© Andrew R Marshall
• ≥1 predictors
25
© Andrew R Marshall
• Normal errors
Wikipedia
26
© Andrew R Marshall
• “Logistic regression”
Wikipedia
27
© Andrew R Marshall
• “Logistic regression”
• Probability distribution of
two alternative outcomes
• E.g. presence/absence,
0/1, categorical data
getting result 1 28
© Andrew R Marshall
• “Poisson regression”
Wikipedia
29
© Andrew R Marshall
• “Poisson regression”
• Example in
practical and
Crawley (2007)
31
© Andrew R Marshall
E.g. Poisson
regression model log(y) = β0 + β1x1 + β2x2 + β3x3 + ...
32
© Andrew R Marshall
list(Distance=xv))
> plot(Distance,Cancers)
3
Cancers
> lines(xv,exp(yv))
2
1
0
IMPORTANT
One more step is needed if the data
continue to defy the distribution
Overdispersion…
(mean = variance → dispersion = 1)
37
© Andrew R Marshall
Quasi-likelihood in GLMs
Overdispersion
• Used where greater variability than expected
• Rule of thumb:
Overdispersion concerning if dispersion >1.5
(Residual Deviance > 1.5 × residual degrees of freedom)
(often shows a funnel shape in residual diagnostic plots)
38
© Andrew R Marshall
Quasi-likelihood in GLMs
Example of overdispersion:
• Crawley (2005) clusters.txt
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.186865 0.188728 0.990 0.3221
Distance -0.006138 0.003667 -1.674 0.0941 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Quasi-likelihood in GLMs
Overdispersion
• Deviance is scaled by overdispersion coefficient (D/df)
• Poisson → quasipoisson
• Problems:
- Generally reduces power of the test
- Can’t use automated stepwise reduction…
40
© Andrew R Marshall
Quasi-likelihood in GLMs
Alternatives to quasi-likelihood:
• Remove more intercollinearity
• Poisson (p372 Q&K):
- Adjust parameters: √(x2/ν)
- Negative binomial GLM (see prac)
3
3
94 94
91 91
2
2
1
1
Residuals
0
0
(1) Diagnostic Plots
Std. deviance resid.
-1
-1
-2
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 -2 -1 0 1 2
93
93 0.5
94
4
91
94
1.5
3
84
2
Gaussian:
1.0
1
0.5
0
Cook's distance
0.0
Poisson/binomial: -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.00 0.01 0.02 0.03 0.04
43
Marshall et al. (2010)
© Andrew R Marshall
44
Marshall et al. (2010)
© Andrew R Marshall
(2) .
(3) .
(4) .
(5) .
(6) .
(7) .
45
© Andrew R Marshall
(2) .
(3) .
(4) .
(5) .
(6) .
(7) .
46
© Andrew R Marshall
(3) .
(4) .
(5) .
(6) .
(7) .
47
© Andrew R Marshall
(4) .
(5) .
(6) .
(7) .
48
© Andrew R Marshall
(5) .
(6) .
(7) .
49
© Andrew R Marshall
(6) .
(7) .
50
© Andrew R Marshall
(7) .
51
© Andrew R Marshall
Non-straight Models
Clues for non-linearity
• Data exploration:
300
- Curve
250
- Hump
200
carbon
150
- Complex relationship 100
50
prop90_ba
53
© Andrew R Marshall
Non-straight Models
Clues for non-linearity
• Data exploration:
16
- Curve
14
12
- Hump
ht_dbh
10
- Complex relationship 8
6
elevation 54
© Andrew R Marshall
Non-straight Models
Clues for non-linearity
• Data exploration:
- Curve
- Hump
- Complex relationship
55
© Andrew R Marshall
15
y = a + bx + cx2
40
20
10
z
y
0
y = a + bx + cx2 + dx3
-20
5
-40
-60
-4 -2 0 2 4 -4 -2 0 2 56 4
x x
© Andrew R Marshall
Non-linear Models
Generalised Additive
Models (GAMs):
• Wiggly relationships
Non-linear Models
GAMs:
• Uses deviance and AIC, as for GLM, then use analysis
of deviance: anova(simple,complex,Test=”F”)
• Like GLM an error probability distribution is specified
gam(y ~ s(x1) + s(x2) + s(x3), family = xxx)
58
© Andrew R Marshall
Non-linear Models
4
Over-fitting in GAMs:
2
0
• A GAM can have perfect fit
-2
s(elevation[,8.652:18])
-4
-6
• BUT fit ≠ explanatory power (we
-8
600 800 1000 1200 1400 1600 1800 2000
want to represent the “parent
elevation[2:18]
population”)
4
2
• GAM can overfit so need to adjust
0
effective degrees of freedom
-2
s(elevation,3.84)
-4
-6
• Use parsimony to decide between
-8
500 1000 1500 2000
models (e.g. quadratic = 3df)
59
elevation
© Andrew R Marshall
60
Marshall et al. (in review)
61
© Andrew R Marshall
What Next?
Some extensions to methods covered (all
possible in R):
• Weighted GLM (under-dispersion; e.g. Ridout & Besbeas 2004)
Take Home
Messages i.There are several sound
methods for distributions
other than normal
ii. Diagnostic checks (plots and
test) are vital but not included
in some statistical software!
iii. Don’t be afraid to try non-
linear methods, but beware of
over-fitting
63
© Andrew R Marshall
Homework
i. Reading as shown on
slides
ii. Assignment data
analysis
iii. Complete all
practical exercises
iv. Add requests for the
two tutorials onto
Stats Forum
64
© Andrew R Marshall
65
© Andrew R Marshall
66
© Andrew R Marshall
2
IT3265 IT3265
1
1 2
• Diagnostics
-1 0
-1 0
Residuals
BW2832 BW2832
Std. deviance resid.
-3
-3
LU3988
LU3988
plot(modelname)
2.0 2.5 3.0 3.5 -2 -1 0 1 2
LU3988
IT3265 1
2
1.0
transformation or
-1 0
0.5
BW2832
0.5
Std. deviance resid.
Std. Pearson resid.
• Interactions:
glm(y ~ x1*x2*x3, family=xxx)
glm(y ~ x1 + x2 + x3 + x1:x2:x3, family=xxx)
Multiple GLM
Model Reduction: (same as MLR)
• Before running first model, deal with intercorrelations:
Multiple GLM
Model Reduction: (same as MLR)
• Next run full model and remove any non-significant
interactions (3-way > 2-way …)
Quasi-likelihood in GLMs
Example of overdispersion:
model <- glm(Cancers ~ Distance, family = quasipoisson)
→ anova(simple,complex,Test=“F”)
72
© Andrew R Marshall
Non-straight Models
Residuals vs Fitted Normal Q-Q
PSP9 PSP9
4
1.5
Clues for non-linearity
2
0.5
Residuals
-2
PSP1
-4
PSP12
• Diagnositics example Standardized residuals
-1.5 -0.5
PSP1
PSP12
10 11 12 13 14 -2 -1 0 1 2
(see practical) Fitted values Theoretical Quantiles
PSP12 PSP9
PSP1 PSP9 0.5
1.2
• Significance remains
1
0.8
0
0.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.352888 1.539830 9.971 2.86e-08 ***
73
elevation -0.002892 0.001227 -2.358 0.0314 *
© Andrew R Marshall
Non-linear Models
Non-linear (least squares) regression:
• Polynomial regression essentially transforms the
data, but if we can’t transform…
Non-linear Models
Degrees of Freedom in GAMs:
• Df are not integers (“effective df”)
• Try various dfs using package gam – use
s(variable,x.xx)
76
© Andrew R Marshall
Non-linear Models
GAM with binary data
• Outputs is the probability of success (i.e. 1
rather than 0)
model <- gam(sp ~ ., family=binomial)
77
© Andrew R Marshall
Non-linear Models
Multiple GAM variables
• Smoother determined from points either side so
increased error at extremes
• Interactions: s(x1,x2)
- More complicated than lm/glm
- E.g. Crawley 2005 contour plot ozone.data.txt, p617
78