502 views

Uploaded by api-19919644

- Final Cheat Sheet
- Churn Prediction Model Hilda
- 2014 Bryce Accounting Quality Pre Posts IFRS
- Econo - Outline
- Chiou and Rothenberg - Comparing Legislators and Legislatures
- o1 Octave Programming_ Linear Regression & Linear Regression With Multiple Variables – Tutorials for IoT-Makers
- Software Testing Defect Prediction Model-A Practical Approach
- 1-s2.0-S0304407607001741-main
- Gravity Model
- IV estimation.pdf
- Chapter 16 (1)
- Week4_LinearRegressionReview
- Multiple Regression Analysis Part 1
- Notes econometrics
- dp8924
- A Review of Mass Valuation
- Paper 3 June 2008
- TQM Semiconductor
- Studying the Role of Political Competition in the Evolution of Government Size Over Long Horizons
- about-estimating-the-limit-of-detection-by-the-signal-to-noise-approach-2153-2435-1000355.pdf

You are on page 1of 102

Julian J. Faraway

July 2002

Chapter 1

Introduction

11 . Look before you leap!

Statistics starts with a problem, continues with the

collection of data, proceeds with the data analysis and

finishes with conclusions.

It is a common mistake of inexperienced Statisticians to

plunge into a complex analysis without paying attention

to what the objectives are or even whether the data are

appropriate for the proposed analysis.

Formulation of a problem

-- more essential than its solution, you must

1. Understand the physical background.

2. Understand the objective.

3. Make sure you know what the client wants.

4. Put the problem into statistical terms.

the data is not enough. The results may be totally

meaningless.

Data Collection

– important to understand

how the data was collected

Initial Data Analysis

-- critical step to be performed

Numerical summaries

Graphical summaries

• – One variable - Boxplots, histograms etc.

• – Two variables - scatterplots.

• – Many variables - interactive graphics.

Are the data distributed as you expected?

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Example – pima (Page 9)

Variables

pregnant -- 怀孕次数

glucose -- 葡萄糖集中度

diastolic -- 心脏舒张血压

triceps -- 三头肌皮肤折叠厚度

insulin -- 2 小时血清胰岛素

bmi -- 身体重量指标 diabetes

age -- 年龄

test -- 糖尿病测试结果（ 0- 阴性， 1- 阳性）

Initial data analysis

Pages 10 - 13

Two things unusual :

• Max of pregnant = 17 ? Possible ?

• Diastolic (blood presure)=0 ? Are they missing values?

Data rearranged – code the data

• -- NA for missing values

• -- designate test as a factor with labels

Graphical summary

1.2 When to use Regression Analysis

– Model relationship between Y and X1,…,Xp

Y --- response/output/dependent variable (continuous)

X1,…,Xp --- predictor/input/independent

/explanatory variables

p=1 --- simple regression

p>1 --- multiple regression --- diastolic on bmi and diabetes

more than one Y --- multivariate regression

Analysis of Covariance --- diastolic on bmi and test

Analysis of Variance (ANOVA) --- diastolic on test

Logistic Regression --- test on diastolic and bmi

Chapter 2

Estimation

Linear Model

Y=β 0+β 1 X1+ +β p Xp +ε

yi=β 0+β 1 x1i + +β p xpi +ε i, i=1,2,,n

Matrix/vector representation

N dimensions = p dimensions + (n-p)dimensions

where y=(y1,,yn)T ， =(ε 1,,ε n)T, β =(β 1,,β p)T

Estimation of β such that Xβ is close to Y.

Geometrical representation

Residual

ε

Y

Xβ

Least squares estimation (LSE)

βˆ = ( X T X )−1 X T y

X βˆ = Hy

ˆˆ

E ( β ) = β , Var ( β) =( X X )

T −1 2

σ

--- orthogonal projection of y onto

M(X)

Predicted values :ˆ y=Hy=X βˆ

Residuals:εˆ =(I -H )y

ε ε = y I ( −H y )

ˆˆ

Residual sum of squares: T T

β̂is a good estimate?

• Geometrically good – orthogonal projection

• MLE if ε i iid normal

• BLUE if E(ε )=0 and Var(ε )=σ 2I

--- from Gauss-Markov Theorem

Situations for other than LSE:

• GLSE if errors are correlated or have unequal variances

• Robust estimate if errors are long tailed

• Biased estimate (e.g. ridge regression) if predictors are

highly correlated (i.e. collinear).

Goodness of fit

∑

R =1 −

2

( ˆ

y i − y 2

i ) RSS

1 = −

∑ i )

•

( y − y2

T o ta l R S S

ε T

ˆˆ ε

• σˆˆ: σ=

2

n− p

Example --- Gala (Page 23-25)

Variables

Species -- # of species of tortoise

Endemics -- # of endemic species

Elevation -- highest elevation of the island (m)

Nearest -- distance from the nearest island (km)

Scruz -- distance from Santa Cruz island (km)

Adjacent -- area of the adjacent island (km2)

gfit <- lm(Species ~ Area + Elevation + Nearest + Scruz

+ Adjacent, data = gala)

summary(gfit)

Chapter 3

Inference

General Assumptions for linear models

y» N(Xβ ,σ 2I)

Thus

βˆ = ( X T X )−1 X T y : N ( β , ( X T X ) −1σ 2 )

Hypothesis tests to compare models

Two models:

Ω – Large model, with q parameters (dimension)

ω – small model, with p parameters which consists of a

subset of predictors that are in Ω

Law of parsimony – use ω if the data support it!

See the geometric illustration on page 27.

• Geometrical illustration:

Y

Residual for ω

Residual for Ω

Difference between

two models

Small model space

Test statistic

RSS ω −RSS Ω

RSS Ω

or equivalaently the likelyhood-ra tio st atistic

max β ,σ ∈Ω L( β

, σ| y)

max β ,σ ∈ω L( β, σ| y)

( RSSω − RSS Ω ) /( q − p)

: Fq − p ,n− q

RSS Ω /(n − q )

Reject the null hypothesis (ω ) if F>Fq-p,n-q (α )

Comment:

The same test statistic applies not just when ω is a

subset of Ω but also to a subspace. This test is very

widely used in regression and analysis of variance.

When it is applied in different situations, the form of test

statistic may be re-expressed in various different ways.

The beauty of this approach is you only need to know

the general form. In any particular case, you just need

to figure out which models represents the null and

alternative hypotheses, fit them and compute the test

statistic. It is very versatile.

Some examples

1. Test of all predictors

Are any of the predicators useful ?

Ω : y=Xβ +ε , X: n by p

ω : y=µ +ε – predict y by mean

(H0: β 1==β p-1=0)

Now we find that

ˆˆ T

RSSΩ = ( y − Xβ ) ( y − X β ) =RSS, df Ω =n −p

RSSω = ( y − y)T ( y − y) = SYY, df ω=n-1

-- sum of squares corrected for the mean

Comment:

• A failure to reject the null hypothesis is not the end of

the game — you must still investigate the possibility of

non-linear transformations of the variables and of

outliers which may obscure the relationship. Even then,

you may just have insufficient data to demonstrate a

real effect which is why we must be careful to say “fail

to reject” the null rather than “accept” the null.

It would be a mistake to conclude that no real

relationship exists.

• When the null is rejected, this does not imply that the

alternative model is the best model. We don’t know

whether all the predictors are required to predict the

response or just some of them. Other predictors might

also be added —for example quadratic terms in the

existing predictors. Either way, the overall F-test is just

the beginning of an analysis and not the end.

• Example – old economic dataset on page 29.

variables:

dpi – per-capita disposable income

ddpi – percent rate of change in dpi

sr – aggregate personal saving divided by disposable

income

pop15, pop75 – percentage population under 15 and over 75

2. Testing just one predictor

Can one particular predicator be dropped?

• RSSΩ – RSS for the model with all the predicators

• RSSω – RSS for the model with all the predicators except

predicator i.

Test statistic:

• F – statistic (with dfs 1 and n-p) or

ˆˆ

• t-statistic with df n-p: ti = β /i se( )βi

Example -- Savings (Page 30)

--- Three methods:

• 1) F-statistic

• 2) t-statistic

• 3) compare two nested models using anova(g2,g)

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

3. Testing a pair of predictors

Suppose we wish to test the significance of

variables Xj and Xk. We might construct a table as

shown just above and find that both variables

have p-values greater than 0.05 thus indicating

that individually neither is significant. Does this

mean that both Xj and Xk can be eliminated from

the model? Not necessarily

Except in special circumstances, dropping one variable

from a regression model causes the estimates of the

other parameters to change so that we might find that

after dropping Xj, that a test of the significance of Xk

shows that it should now be included in the model.

If you really want to check the joint significance of Xj

and Xk, you should fit a model with and then without

them and use the general F-test discussed above.

Remember that even the result of this test may

depend on what other predictors are in the model.

3. Testing a subspace

Whether two predicators be replaced by their sum?

Null hypothesis:

H0: β j=β k (a linear subspace)

Example – Savings

• 1) H0: β pop15 =β pop75

Null model :

y=β 0+β pop15 (pop15+pop75)+β dpi dpi+β ddpi ddpi+ε

> g <- lm(sr ~ . , savings)

> gr <- lm(sr ~ I(pop15 + pop75) + dpi + ddpi, savings)

> anova(gr,g)

% The period . stands for the other variables in the data frame

% I() argument is evaluated rather than interpreted as part of the model formula

• 2) H0: β ddpi =1

Null model :

y=β 0+β pop15 pop15+β pop75 pop75+β dpi dpi+ddpi+ε

% The fixed term is called an offset

> gr <- lm(gr ~ pop15 + pop75 + dpi + offset(ddpi), savings)

> anova(gr, g)

Two other ways as usual:

t = ( βˆˆ

− c )/seβ

( ) ,

t-statistic :

where c is th e poi nt hy pothesis

Confidence Intervals for β

Confidence intervals are closely related to hypothesis tests:

For a 100(1-α )% confidence region, any point that lies

within the region represents a null hypothesis that would

not be rejected at the 100α % level while every point

outside represents a null hypothesis that would be rejected.

The confidence region provides a lot more information than

a single hypothesis test in that it tells us the outcome of a

whole range of hypotheses about the parameter values.

Simultaneous confidence region for β

Confidence interval for one β i

• General form

estimate +/- critical value x s.e. of estimate

• For β i

βˆi ± tn(α−pσ

/2)

( XT ii )1

ˆ X −

plausible, especially when the estimates of β i’s are heavily

correlated.

Example – Savings

• 95% confidence interval for pop75

• 95% confidence interval for growth

• Joint 95% confidence region for pop75 and growth

Note : “ellipse” is needed for drawing confidence ellipses

Confidence intervals for predictions

Q: Given a set of predictors x0, what is the predicted

response?

yˆ 0 = x0T βˆ

A:

Two predictions:

• of the future mean response

• of future observations

The point estimate are the same, while the confidence

intervals are respectively

ˆ0 ±

y t n(α

−

p

/2)

+σx 1

0 X X(T

x1T

0 ) −

y t n(α

ˆ0 ± −

p

/2)

σ

x0 X T

X ( T1

x0 )

−

Orthothonality (page 41)

Identifiability (page 44)

What can go wrong? (page 46)

• Source and quality of the data

• Error component

• Structural component

Confidence in the conclusions from a model declines as

we progress through these.

Most statistical theory rests on the assumption that the

model is correct. In practice, the best one can hope for

is that the model is a fair representation of reality. A

model can be no more than a good portrait.

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

All models are wrong but some are useful.

George Box

So far as theories of mathematics are about reality;

they are not certain; so far as they are certain, they are

not about reality.

Einstein

Prediction and Extrapolation (Page 52)

• Is the new X0 within the range of validity of the model?

• Is it close to the range of the original data?

• If not,

* the prediction may be unrealistic and

* confidence intervals for prediction get wider as

we move away from the data.

Example

• Model:

>g4 <- lm(sr ~ pop75, data=savings)

>summary(g4)

• Prediction/fit and Confidence interval (CI)

>grid <- seq(0, 10, 0.1)

>p <- predict(g4, data.frame(pop75=grid), se=T)

>cv <- qt(0.975, 48)

>matplot(grid, cbind(p$fit, p$fit-cv*p$se, p$fit+cv*p$se),

+ lty=c(1, 2, 2), type =“1”,xlab=“pop75”,ylab=“Savings”)

>rug(savings$pop75) %show the location of pop75 values

Conclusion (See Figures 3.6 and 3.7): the widening of the CI does not reflect the

possibility that the structure of the model itself may change as we move into new

territory.

thing worse that a prediction is no prediction at all.

(a quote from the 4th century)

Chapter 4

Errors in Predictors

Regression model Y=Xβ +ε allows for Y being measured

with error by having the ε term.

What if X is measured with error ( is not the X used to

generate Y)?

Conider (xi,yi,i=1,2,,n):

varε i=σ 2, varδ i=σ 2δ . σσ ξ2ξ=,δ∑=cov(ξ

( i ξ − ) ξn,δ

2

/ ).

η i=β 0+β 1 ξ i

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Then

βˆ1 =

∑ (x − x ) y

i i

∑ (x − x )i

2

ˆ σ ξ + σ ξδ

2

cov(ξ ,δ )= 01

E β 1 = β1 2 →1 β 2 2

σ ξ + σ2δ +2 σ ξ δ 1 + / σδ σξ

Thus the estimate of β 1 will be biased!

If the variability in the errors of observation of X (σ 2δ )

are small relative to the range of X (σ 2ξ ), we need not

be concerned.

If not, it’s a serious problem. Other methods should be

considered!

Attention on the difference:

• Errors in predictors case

generated conditional on the fixed value of X)

Example (page 56)

> x <- 10*runif(50)

> y <- x + rnorm(50)

> gx <- lm(y ~ x)

The regression coeffs are β =0 and β =1. See how are they

0 1

changed when the noise 5*rnorm(50) is added. decreased

from 1 to 0.25 theoretically, from 0.97 to 0.435 numerically.

As σ 2=1/12,σ 2 =25 and σ 2 =100/12, the expected mean is 0.25

δ ξ

(see the slide above and the simulation on P57)

Chapter 5

Generalized Least Squares

The general case

Error: varε = σ 2 ∑, where

σ 2 is unknown

∑ is known.

GLS to minimize (y-Xβ )T∑-1 (y-Xβ )

βˆ = ( X T −Σ1 X ) 1− X T 1−

Σy

GLS=OLS by regressing S-1 X on S-1 y, where S is the

triangular matrix using the Cholesky Decomposition of

∑: ∑=SST. (var S-1 ε =σ 2I)

Example: Longley’s regression data (page 59)

• Errors: ε i+1 =ρ ε i+δ i,δ i» N(0,τ 2)

• ∑=(∑ij ): ∑ij =ρ |i-j|

• Two methods: 1) Use the formula 2) Choleski Decomposition

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Revisit the example using nlme library (Page 61)

Weighted Least Squares

∑=diag(1/w1,,1/wn) – diagonal and uncorrelated

weights: wi, i=1,2,…,n.

Low/high variability high/low weight.

Regress wi xi on wi yi using OLS

Example -- strongx:

• An experiment was designed to study the interaction of

certain kinds of elementary particles on collision with proton

targets.

• The cross-section (crossx) variable is believed to be linearly

related to the inverse of the energy (energy – being

inverted!). At each level of the momentum, a very large

number of observations were taken so that it was accurately

estimate the std of the response (sd).

• (Figure 5.1) The unfitted is better? For lower values of

energy, the variance in the response is less, thus the weight

is high. So the weighted fit tries to catch these points better

than the others.

Iteratively Reweighted Least Squares(IRWLS), p64

Var(ε ) is not completely known

Model ∑ (e.g. var ε =γ 0+γ 1x1) and then estimate β

using OLS. Iterate until convergence.

Use gls() in the nlme library – model the variance and

jointly estimate the regression and weighting

parameters using likelihood based method.

Chapter 6

Testing for Lack of Fit

Estimate of σ 2

If the model is correct, then

σˆ 2 should be unbiased.

If the model is simple but not correct, then

σ̂ 2 will overestimate

σ 2. σ̂ 2

If the model is too complex and overfits the data, then will

underestimate. σˆ 2

Test statistic:

σ 2 known

Lack of fit if (n − p )σˆ 2 2 (1−α )

> χ n− p

σ 2

Example – strongx

Model 1: without a quadratic term lack of fit (p-value=0.0048)

Model 2: with a quadratic term well fit (p-value=0.85363)

σ 2

unknown

The σˆ based on regression model needs to be compared to some

2

model-free estimates.

Repeat y independently for one or more x.

To reflect both the within subject variability and between subject

variability, we use the “pure error” estimate of σ 2, given by SSpe/dfpe,

where

SS pe = ∑∑ −(yyi

distinct xgiven x

) 2

df pe = ∑

distinct x

(# −

replicates 1)

If you fit a model that assigns one parameter to each group of

observations with fixed x then the σ̂ 2 from this model will be the pure

error σˆ 2 . You use factor(x) instead of x directly! See the example

below.

Compare this model to the regression model amounts to the lack of fit

test. -- most convenient way.

Alternative (ANOVA): Partition RSS into

• that due to lack of fit

df SS MS F

Lack of Fit n-p-dfpe RSS-SSpe RSS −

SS pe

n −p−df pe

Pure Error dfpe SSpe SSpe /dfpe

Lack of fit if the F-statistic >F(1- α )n-p-df

pe ,df pe

Example – corrosion:

Thirteen specimens of 90/10 Cu-Ni alloys with varying iron

content in percent. The specimens were submerged in sea

water for 60 days and the weight loss due to corrosion was

recorded in units of milligrams per square decimeter per

day.

• Model 1:

> g <- lm(loss ~ Fe, data=corrosion)

Both R2=97% and graph (Figure 6.3) show good fit to the data.

• Model 2:

(Reserve a parameter for each group of data with the same x.

The fitted values are the means in each group, see Figure 6.3.)

> ga <- lm(loss ~ factor(Fe),data=corrosion)

Compare the two models:

> anova(g, ga)

• p-value = 0.0086 a lack of fit !

• Reason:

regression std=3.06, pure error sd=sqrt(11.8/6)=1.4.

( replicates are genuine? Low pe sd? Maybe caused by

some correlation in the measurements.

Unmeasured third variable.)

• Another model other than a straight line (though not obvious!)

> gp <- lm(loss ~ Fe+I(Fe^2)+I(Fe^3)+I(Fe^4)+I(Fe^5)+I(Fe^6),

+ corrosion)

>summary$r.squared

[1] 0.99653

• R2=99.7%

• The fit of this model is excellent (see Figure 6.4)

but it is clearly ridiculous!

• This is a consequence of OVERFITTING the data.

About R2 and goodness of fit.

• No need to become too focused on measures of fit like R2

• If one method shows that the null hypothesis is accepted,

we cannot conclude that we have the true model. After all, it

may be that we just did not have enough data to detect the

inadequacies of the model. All we can say is that the model

is not contradicted by the data.

• It is also possible to detect lack of fit by less formal,

graphical methods.

• A more general question is how good a fit do you really

want?

• By increasing the complexity of the model, it is possible to fit

the data more closely. By using as many parameters as

data points, we can fit the data exactly.

• Very little is achieved by doing this since we learn nothing

beyond the data itself and any predictions made using such

a model will tend to have very high variance.

• The question of how complex a model to fit is difficult and

fundamental.

Chapter 7

Diagonostics

Regression model building is often an iterative and

interactive process. The first model we try may

prove to be inadequate.

Regression diagnostics are used to detect problems

with the model and suggest improvements.

This is a hands-on process.

Residuals and Leverage

H – hat-matrix: H=X(XTX)-1 XT, thus

yˆ = Hy

εˆ = y − ˆy =( I −H) y

= (I − H )Xβ + (I −H )ε

= (I − H ε)

var εˆ = (I − H ) σ2 (var ε =2 σI )

hi=Hii -- leverages

εˆi = σ2 (1 h− )i : a large leverage of hi will “force” the fit

to be close to yi.

Some facts: ∑ hi=p, hi >= 1/n

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

“Rule of thumb” – leverage of more than 2p/n should be

looked at more closely. Large values of hi are due to

extreme values in X. (hi corresponds to a Mahalanobis

ˆ yˆˆ=h iσ 2

var

distance defined by X.) Also notice that .

Example – savings (page 72)

• Index plot of residuals Chile and Zambia with largest or

smallest residual.

• Index plot of leverages Libya and United States with

highest leverage.

Studentized Residuals

var εˆ = σ 2 (1 − hi ) (internally) studentized residuals:

εˆi

ri =

σˆ 1 − hi

If the model assumptions are correct, then

var ri=1 and corr(ri,rj) is small

Thus studentization can correct for the non-constant

variance in residuals when the errors have constant

variance, but cannot correct for it when there is some

heteroscedascity in the errors.

Example – savings (page 75)

An outlier Test

An outlier is a point that does not fit the current model.

Outliers may effect the fit. See Figure 7.2.

An outlier test enable us to distinguish between truly

unusual points and residuals which are large but not

exceptional.

Exclude point i and recompute the estimates to get

βˆ( i) and

σ (i)ˆ 2

yˆ ( i) =xi T β ˆ

( i)

Test statistic (jacknife or externally studentized/

crossvalidated ) residuals:

εˆi n − p 1− 1 / 2

ti = ( ri = 2 ): − 1− tn p

σˆ (i ) 1 − h i

n − p i− r

When all cases are to be tested we must adjust the level

of the test using Bonferroni correction. However it is

conservative!

Example – savings (page 76)

Some notes about outliers and their treatment with an

example: Star (page 76-78)

Influential Observations

An influential point is one whose removal from the

dataset would cause a large change in the fit.

An influential point may or may not be an outlier and

may or may not have large leverage but it will tend to

have at least one of those two properties.

Measures of influence

• Change in the coefficient βˆˆ− β

(i )

• Cook Statistics – Popular now

( βˆˆˆˆ

−β ( i)) (T XTβ)(

X−β )

( i)

Di =

pσˆ 2

1 2 hi

= ri

p 1 −h i

• The first term, ri2, is the residual effect and the second is the

leverage. The combination leads to the influence.

• An index plot of Di can be used to identify influential points.

Example – savings (page 79)

• Full data

• Exclude one point -- helpful using lm.influence()

Residual Plots

Residuals vs. fitted – to look for

• deteroscedscity (non-constant variance)

• nonlinearity

Residuals or abs(residuals) vs. one of the predictors xi –

look for if xi should be included.

Example 1 – savings (pages 81-83, Figure 7.6)

Scatter plot of residuals vs. predicator pop15 show non-

constant variance in two groups.

Example 2 – simulation example (page 83)

Non-Constant Variance

Two approaches to dealing with non-constant variance.

1) weighted least squares

2) variance stabilizing transformation y->h(y):

dy

h( y ) = ∫

var y

• var y/ (Ey)2 h(y)=log y

• var y/ (E y) h(y)=\sqrt(y) – appropriate for count response

data

Note: Use graphical techniques (e.g. residual plots)

to detect problems/structures of unsuspected

natures.

Example – Galapagos (page 84)

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Non-Linearity

How to check if the systematic part Ey=Xβ is correct?

As before, we may look at

What about the effect of other x on the y vs. xi plot?

Partial Regression plots (or Added Variable plots):

can isolate the effect of xi on y

δˆ

• Regress y on all x except x , get residuals ;

i

γ̂

Regress xi on all x except xi, get residuals ;

•

δˆ γ̂

• Plot against .

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Partial Residual plots:

Plot εˆ+ βˆi xiagainst x. i

Comparison: Partial residual plots are reckoned to be

better for non-linearity detection while added variable

plots are better for outlier/influential detection.

Example – savings (pages 85-87)

Assessing Normality

Use QQ-plot

Example 1 – savings (page 88, see Figure 7.9)

Example 2 – simulated data from different distributions

(page 88, see Figure 7.10)

Ways to treat non-normality (page 90)

Half-normal plots

Designed for assessment of positive data

Useful for leverages or Cook Statistics – look for

outliers in the model.

! No need to look for a straight line relationship

Example – savings (page 91)

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Corrected Errors

Plot residuals against time

Use formal tests like the Durbin-Watson or the run test.

(If errors are correlated, you can use GLS)

Example – airquality (pages 93-94)

• Variables: ozone

solar radiation

temperature

wind speed

• Non-transformed response

non-constant variable and nonlinearity

• Log transformed response

• Check for serial correlation

Chapter 8

Transformation

Transformations of the response and predictors can

improve the fit and correct violations of model

assumptions such as constant error variance.

We may also consider adding additional predictors

that are functions of the existing predictors like

quadratic or crossproduct terms.

Transforming the response

Example – Log response

log y=β 0+β 1 x +ε (linear regression)

In the original scale:

y=exp(β 0+β 1 x) exp(ε )

The errors enter multiplicatively in the original scale.

Compare:

y=exp(β 0+β 1 x) +ε (non-linear regression)

The errors enter additively in the original scale.

Usual approach: Try different transformations and check

the residuals.

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Box-Cox method/transformation (for y>0)

• Profile log-likelihood assuming normal errors:

• Usually L(λ ) is maximized over {-2,-1,-1/2,0,1/2,1,2}

Transforming the response can make the model harder

to interpret so we don’t want to do it unless it’s really

necessary.

One way to check this is to form a confidence interval

for λ . A 100 (1-α )% confidence interval for λ is

CI(λ )=(0.6,1.4) NO need to transform

Example 2 – gala

CI(λ )=(0.1,0.5) cube-root transformation

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Other transformations

• Logit transformation log(y/(1-y)) for proportions

• Fisher’s Z transformation 0.5log((1+y)/(1-y)) for correlations

Transforming the predictors

Segmented regression (Broken stick regression)

Example – savings (page 98-100)

• Method 1: Subset regression fit – discontinuous at knotpoint.

• Method 2: with hocky stick function.

Polynomials & orthagonal polynomial:

add powers of the predictor(s) until the p-value is

significant.

Regression Splines

Polynomials have the advantage of smoothness but the

disadvantage that each data point affects the fit globally. This

is because the power functions used for the polynomials take

non-zero values across the whole range of the predictor.

In contrast, the broken stick regression method localizes the

influence of each data point to its particular segment which is

good but we do not have the same smoothness as with the

polynomials.

There is a way we can combine the beneficial aspects of both

these methods — smoothness and local influence — by using

B-spline basis functions.

Cubic B-spline basis fuction:

• A given basis function is non-zero on interval defined by four

successive knots and zero elsewhere.

This property ensures the local influence property.

• The basis function is a cubic polynomial for each sub-interval

between successive knots

• The basis function is continuous and continuous in its first and

second derivatives at each knot point.

This property ensures the smoothness of the fit.

• The basis function integrates to one over its support

A constructed example: y=sin2(2π x3)+ε , ε » N(0,0.12)

1) Orthogonal polynomial regression

2) Regression splines

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Chapter 9

Scale Changes, Principle Components and

Collinearity

Changes of Scale (page 106)

Rescaling xi as ( xi + a ) / b leaves the t and F tests and σˆ 2

and R 2 unchanged and βˆˆ → bβ . i i

unchanged but σˆ 2 and βˆ will rescaled by b.

i

-- Transform X to orthogonality

Partial Lest Squares (PLS)(page 114)

-- Find the best orthogonal linear combination of X for

predicting Y

PCR and PLS compared

PCR:

• PCR attempts to find linear combinations of the predictors

that explain most of the variation in these predictors using

just a few components.

• The purpose is dimension reduction.

• Because the principal components can be linear

combinations of all the predictors, the number of variables

used is not always reduced.

• Because the principal components are selected using only

the X-matrix and not the response, there is no definite

guarantee that the PCR will predict the response particularly

well although this often happens.

• If it happens that we can interpret the principal components

in a meaningful way, we may achieve a much simpler

explanation of the response. Thus PCR is geared more

towards explanation than prediction.

PLS:

• In contrast, PLS finds linear combinations of the predictors

that best explain the response. It is most effective when

there are large numbers of variables to be considered.

• If successful, the variablity of prediction is substantially

reduced.

• On the other hand, PLS is virtually useless for explanation

purposes.

Collinearity (page 117)

If XTX is singular, i.e. some predictors are linear

combinations of others, we have (exact) collinearity and

there is no unique least squares estimate of β . If XTX is

close to singular, we have (approximate) collinearity or

multicollinearity (some just call it collinearity). This

causes serious problems with the estimation of β and

associated quantities as well as the interpretation.

Collinearity can be detected in several way.

• Examination of the correlation matrix of the predictors will

reveal large pairwise collinearities.

• A regression of xi on all other predictors gives R2i. Repeat for

all predictors. R2i close to one indicates a problem.

• If R2i is close to one then the variance inflation factor

(VIF) 1/(1-R2i) will be very large, thus var(βˆi large

)

• Examine the eigenvalues of XTX - small eigenvalues indicate

a problem. The condition number is defined as

sqrt(λ 1/λ i) are also worth considering because they indicate

whether more than just one independent linear combination is

to blame.

Collinearity leads to

• imprecise estimates of β —the signs of the coefficients

may be misleading.

• t-tests which fail to reveal significant factors

• missing importance of predictors

Example – Longley (pages 118-120)

Ridge Regression

Ridge regression makes the assumption that the

regression coefficients (after normalization) are not

likely to be very large.

It is appropriate for use when the design matrix is

collinear and the usual least squares estimates of β

appear to be unstable.

The ridge regression estimates of b are then given by

Example – Longley (page 121)

Ridge regression estimates of coefficients are biased.

Bias is undesirable but there are other considerations.

MSE decomposition:

E ( βˆˆˆˆ

− β ) 2 = ( E (β − β )) 2 + E (β − Eβ ) 2

=bias 2 +variance

Sometimes a large reduction in the variance may

obtained at the price of an increase in the bias. If the

MSE is reduced as a consequence then we may be

willing to accept some bias. This is the trade-off that

Ridge Regression makes - a reduction in variance at

the price of an increase in bias.

This is a common dilemma.

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Chapter 10

Variable Selection

Select the best subset of predictors

Prior to variable selection:

• Identify outliers and influential points - maybe exclude them

at least temporarily.

• Add in any transformations of the variables that seem

appropriate.

Hierarchical Models

When selecting variables, it is important to respect the

hierarchy. Lower order terms should not be removed

from the model before higher order terms in the same

variable.

There two common situations where this situation

arises:

• Polynomials models.

• Models with interactions.

Stepwise Procedures

Backward Elimination

Forward Elimination

Stepwise Regression

Example – statedata (page 126)

Criterion-based Procedures

The Akaike Information Criterion (AIC)

AIC=-2log (likelihood) +2p

and the Bayes Information Criterion (BIC)

BIC= -2log –likelihood +p log n

(-2log-lokelihood=nlog(RSS/n))

Example – statedata

Adjusted R2 = R2a

R2=1-RSS/TSS

RSS /(n −p ) n− 1

R = 1−

2

a = 1− −

(1

R 2

)

TSS /(n −1) n − p

σˆ mod

2

=1 − 2 el

σˆ null

Predicted Residual Sum of Squares (PRESS)

PRESS = ∑i εˆ(2i) ,

ε (i) are the resideuals calculated

without using case i in the fit

Mallow’s Cp Statistic

RSS p

Cp = + −

p 2n ,

σ

ˆ 2

2

Example – statedata (page 130)

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Chapter 11

Statistical Strategy and Model

Uncertainty

Strategy

Procedure:

Diagnostics Transformation

Variable Selection Diagnostics

There is a danger of doing too much analysis. The

more transformations and permutations of leaving

out influential points you do, the better fitting model

you will find. Torture the data long enough, and

sooner or later it will confess.

Fitting the data well is no guarantee of good predictive

performance or that the model is a good

representation of the underlying population.

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

You should:

1. Avoid complex models for small datasets.

2. Try to obtain new data to validate your proposed

model. Some people set aside some of their existing

data for this purpose.

3. Use past experience with similar data to guide the

choice of model.

Model multiplicity: The same data may support different

models. Conclusions drawn from the models may differ

quantitatively and qualitatively.

See further from author’s experiment and discussion

(page 135-137)

Chapter 12

A Complete Example

Chapter 13

Robust and Resistant Regression

Errors ~Normal LSE

Error ~ long-tailed

LSE with outlier removed through outlier tests

Least Trimmed Squares(LTS)

• Resistant regression

Possible choices of ρ

• ρ (x)=x2 -- least squares regression

• ρ (x)=|x| -- least absolute deviations regression (LAD)

• Huber’s method – a compromise between LS and LAD

• Example – Chicago insurance data (page 151)

Chapter 14

Missing Data

Chapter 15

Analysis of Covariance

An example – Regression problem with a mixture of

quantitative and qualitative predicators

Purpose: to see the effect of a medication on cholesterol

level

Two groups:

• 1) Treatment: receive the medication, age 50 ~ 70

• 2) Control: not receive the medication, age 30 ~ 50

Figure 15 shows that the mean reduction in cholesterol

was 0% and 10% respectively.

Better not to be treated?

Not a two sample problem as the two groups differ w.r.t.

age

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Take age into account (age 50) the difference

between the two groups is again 10%, but in favor of

the treatment!

Analysis of covariance: adjust the groups for the age (a

covariate) difference and exam the effect of the

medication

Method: code the qualitative predicator(s) (covariate(s))

and incorporate within the y=Xβ +ε framework.

• y – change in cholesterol level

• x –age

• d – 0 (or -1) for the control group and 1 for the treatment

group (See 15.2 for coding qualitative predicators, p164)

Models:

• y=β 0+β 1x+ε (in R y~x)

• y=β 0+β 1x+β 2d+ε (in R y ~ x+d)

• y=β 0+β 1x+β 2d+β 3x.d+ε (in R y ~ x*d)

A two-level example (page 161)

(English medieval cathedrals)

Variables

• X – nave height

• Y – total length

• Style – r: Romanesque, g: Gotheic

Note that some have parts in both styles

Purpose: to see how the length is related to height for the

two styles

Coding qualitative predicators

For a k-level predictor, k-1 dummy variables are needed for the

representation. One parameter is used to represent the overall

mean effect or perhaps the mean of some reference level and so

only k.

Treatment coding

A 4 level factor will be coded using 3 dummy variables:

Dummy coding

1 2 3

1 0 0 0

levels 2 1 0 0

3 0 1 0

4 0 0 1

• The first one is the reference level to which other levels are

compared.

• R assigns levels to a factor in alphabetical order by default.

• The columns are orthogonal and the corresponding

dummies will be too. The dummies won’t be orthogonal to

the intercept.

• Treatment coding is the default choice for R

Helmert Coding (page 165)

• interpretation. It is the default choice in S-PLUS.

The choice of coding does not affect the R2, σˆ ,the 2

know what the coding is before making conclusions

about β̂

A three-level example (page 165)

The data:

• IQ scores for identical twins, one raised by foster parents,

the other by the natural parents.

• social class of natural parents (high, middle or low).

Purpose: Predict the IQ of the twin with foster parents

from the IQ of the twin with the natural parents and the

social class of natural parents.

Seperate lines model

• > g <- lm(Foster ~ Biological*Social, twins)

• > summary(g)

• The reference level is high class, being first alphabetically.

• We see that the intercept for low class line would be

-1.872+9.0767 while the slope for the middle class line

would be 0.9776-0.005.

Can the model be simplified to the parallel lines

model?

• > gr <- lm(Foster ˜ Biological+Social, twins)

• > anova(gr,g)

Further reduction to a single line model

• > gr <- lm(Foster ˜ Biological, twins)

Conclusion : Single line model is accepted.

(p-value=0.59423 the null is not rejected)

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

Chapter 16

ANOVA

Introduction

Predictors are now all categorical/ qualitative.

ANOVA is used to partition the overall variance in the

response to that due to each of the factors and the error.

Predictors are now typically called factors which have

some number of levels.

The parameters are now often called effects.

Two kinds of models

• fixed-effects models-- parameters are considered fixed

• random-effects models -- parameters are random

One-Way Anova

The model

Estimation and testing

• Estimate the effects

• Test if there is a difference in the levels of the factor.

Example (page 169)

• Response: blood coagulation times

• Factor: diets

• Sample size: 24 (animals)

Diagnosis (page 171)

Multiple Comparisons( 多重比较 ) (page 172)

• Bonferroni (for few comparisons it is good but it will become

conservative if there many comparisons)

• Fisher’s LSD (after overall F-test shows a difference)

• Turkey’s Honest Significant Difference (HSD) (fall all

pairwise comparisons)

Contrasts

Scheffe’s theorem for multiple comparisons

Testing for homogeneity of variance– Levene test

Two-Way Anova

One observation per cell

More than one observation per cell

Interpreting the interaction effect

Example – rats (pages 181-184)

• 48 rats were allocated to 3 poisons (I,II,III) and 4 treatments

(A,B,C,D).

• The response was survival time in tens of hours.

• 4 replicates in each cell

• The reference level is I for poison and A for treat!

Blocking designs (page 185)

后记

计算与统计分析软件

符号运算软件

Maple v9.0

Mathematica (v5.0)

Symbolic toolbox for Matlab (v7.0)

Scientific workplace (SWP)

Maxima (from Macsyma)

MuPAD

Reduce

……

数值计算软件

Matlab

Scilab

统计分析软件

•SAS (v9.1) •R (v9)

•BMDP(v7) •Minitab (v14)

•Statistica(v6) •Systat(v10

•Stat a(v8) •JMP (v5)

•Gauss •MacAnova

Bayes 统计分析

WinBus (v1.4)( 包括 ARS)

CODA (v0.5-1 for R)

JAGS (v0.50)

计量经济学数据分析软件

Tsp (v6.5)

Eviews (v4)

Istm2000

TSA

OX/OXMatrix packages

• PcGets, PcGive, Tsp, ARFIMA, GARCH, MSVAR

http://my.mofile.com/tangyc886 R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )

数据处理软件的综合比较

Comparison of mathematical

programs for data analysis

(Edition 4.4)

by Stefan Steinhaus

(stefan@steinhaus-net.de)

http://www.scientificweb.com/ncrunch/

- Final Cheat SheetUploaded byJosh Potash
- Churn Prediction Model HildaUploaded byOzioma Ihekwoaba
- 2014 Bryce Accounting Quality Pre Posts IFRSUploaded byDeri Yanto
- Econo - OutlineUploaded bywaqas118
- Chiou and Rothenberg - Comparing Legislators and LegislaturesUploaded byBill Johnson
- o1 Octave Programming_ Linear Regression & Linear Regression With Multiple Variables – Tutorials for IoT-MakersUploaded byalanpicard2303
- Software Testing Defect Prediction Model-A Practical ApproachUploaded byInternational Journal of Research in Engineering and Technology
- 1-s2.0-S0304407607001741-mainUploaded byAziz Adam
- Gravity ModelUploaded byJob Opue
- IV estimation.pdfUploaded byduytue
- Chapter 16 (1)Uploaded bymaustro
- Week4_LinearRegressionReviewUploaded byLalji Chandra
- Multiple Regression Analysis Part 1Uploaded bykinhai_see
- Notes econometricsUploaded byDaniela Plácido
- dp8924Uploaded byZiggy X Polke
- A Review of Mass ValuationUploaded byKariyono
- Paper 3 June 2008Uploaded byecobalas7
- TQM SemiconductorUploaded byPraveen Rathod
- Studying the Role of Political Competition in the Evolution of Government Size Over Long HorizonsUploaded by18462
- about-estimating-the-limit-of-detection-by-the-signal-to-noise-approach-2153-2435-1000355.pdfUploaded bykun anta
- Eco 173 Regression Part iUploaded byNadim Islam
- Studying the Relation of Exams Anxiety Level in Normal School Children and Children of Disabled Veterans in Secondary Schools of KhorramabadUploaded byTI Journals Publishing
- Threshold Paper for Bordeaux2Uploaded byPanji Rizki Maulana
- Stats ReviewUploaded byBhowal RS
- Satiation(AER)Uploaded byJohnathanGooglyLin
- p AddisonUploaded byRizki Firdaus
- Efficient Tests of Stock Return PredictabilityUploaded byfirebirdshockwave
- PACE_JSM2012_talk.pdfUploaded byAkshay Singh
- 시뮬레이션 발표 2Uploaded bySW Lee
- Linear Regression and CorrelationUploaded bygabo2382

- 18 Jhe Zhong HuaiUploaded byHans Sierra Lopinta
- 확률통계답안지Uploaded bylikemun
- Frm Aim Statements 2014-WebUploaded byVivek Singh
- Kyle&Harris JDSDE 2006Uploaded byRaluca-Mariana Popa
- Calculo Del Valor DUploaded byGuillermo Gutierrez
- Tutorial Single Equation Regression ModelUploaded bygapusing
- Econometricscoursoutline MBAMS.docUploaded bysultanatkhan
- The Relationship between Ethical Culture and Unethical Behavior in Work Groups. Testing the Corporate Ethical Virtues Model.pdfUploaded bygccripoll151707
- Sirak ThesisUploaded bySirak Aynalem
- GBM_94.pdfUploaded byMuhammad Kasheer
- Harrell NotesUploaded byPritam Changkakoti
- AP Statistics _TI83+84+ User GuideUploaded byLarryag12196
- poisson regression models.pdfUploaded byNaj Retxed Dargup
- WeibullUploaded bykeerthanaganesan4756
- AnovaUploaded byrichelisan
- AIC Myths and MisunderstandingsUploaded byJesus N. Pinto-Ledezma
- Finger Force MSUploaded byArianna Carley
- Karlsson_2007Uploaded byganesh639
- Multinomial Logistic RegressionUploaded bySuravi Malinga
- BitlyCaseStudy BSD 4Uploaded byAbhishek Anand
- ma_chap13.pdfUploaded byGeorge Adongo
- Statistical management ch 14Uploaded byAnindito W Wicaksono
- Part 4 Modeling Profitability Instead of Default.txtUploaded byWathek Al Zuaiby
- Eviews Tutorial 6Uploaded bykattarinaS
- JCFUploaded byDavid Scott
- RegressionUploaded bybibekmishra8107
- linear regression approach to academic performanceUploaded byDessy
- Multi Col LinearityUploaded byIshanDogra
- 04fUploaded bypasaitow
- cover regUploaded bymuralidharan