Professional Documents
Culture Documents
Example:
tend to be curvilinear in that the rate of growth decreases with age and eventually
Biologistsstops
are altogether.
interestedA in the characteristics
polynomial of growth
model is sometimes used curves, that is, finding a
for this purpose.
model forThis
describing how organisms grow with time. Relationships
example concerns the growth of rabbit jawbones. Measurements were made
of this type
end to beoncurvilinear in that the
lengths of jawbones rate ofofgrowth
for rabbits decreases
various ages. The datawith age in
are given and eventually
Table 8.8,
stops altogether. A polynomial
and the plot of the data is model
given in is sometimes
Fig. used
8.3 where the linefor thisestimated
is the purpose.poly-
nomial regression line described below. Two points for much older rabbits are
This shownconcerns
example on the plotthe
butgrowth
not usedof
in rabbit
the regression.
jawbones. Measurements were made
on lengths of jawbones for rabbits of various ages. The data are given in following
table. Table 8.8 Rabbit Jawbone Length
AGE LENGTH AGE LENGTH AGE LENGTH
50
50
45
40
Length
40
35
30
length
30
20
25
20
10
15
0 1 2 3
0 1 2 3 4 5 6
Age
FIGURE 8.3 age
==========================================================
library(gadata)
data=read.xls("file destination", sheet=1)
data
with(data, plot(age, length))
model=lm(length~poly(age, degree=4),data=data)
summary(model)
==========================================================
The ANOVA results for tis regression is
Call:
lm(formula = length ~ poly(age, degree = 4), data = data)
Residuals:
Min 1Q Median 3Q Max
-3.4540 -0.8948 0.2523 1.0698 3.0396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.2607 0.3192 122.980 < 2e-16
poly(age, degree = 4)1 52.1100 1.6893 30.847 < 2e-16
poly(age, degree = 4)2 -23.5047 1.6893 -13.914 1.09e-12
poly(age, degree = 4)3 7.5661 1.6893 4.479 0.000171
poly(age, degree = 4)4 -0.7002 1.6893 -0.415 0.682339
---
Residual standard error: 1.689 on 23 degrees of freedom
Multiple R-squared: 0.9806, Adjusted R-squared: 0.9773
F-statistic: 291.3 on 4 and 23 DF, p-value: < 2.2e-16
2
The following R. code for plotting the estimated polynomial curve
==========================================================
with(data,lines(age, predict(model),col="black"))
==========================================================
50
45
40
35
length
30
25
20
15
0 1 2 3
age
It looks like that the fitted curve is good and fits the data correctly but from the
ANOVA results the forth coefficient is not significant. In this case, we have to
remove the AGE ! variable from the model and reduce the polynomial equation to
third order. Therefore, The polynomial regression model is
model=lm(length~poly(age, degree=3),data=data)
summary(model)
==========================================================
we
have
3
Call:
lm(formula
=
length
~
poly(age,
degree
=
3),
data
=
data)
Residuals:
Min
1Q
Median
3Q
Max
-‐3.8051
-‐0.8179
0.2235
1.0571
2.9557
Coefficients:
Estimate
Std.
Error
t
value
Pr(>|t|)
(Intercept)
39.2607
0.3137
125.158
<
2e-‐16
poly(age,
degree
=
3)1
52.1100
1.6599
31.394
<
2e-‐16
poly(age,
degree
=
3)2
-‐23.5047
1.6599
-‐14.160
3.78e-‐13
poly(age,
degree
=
3)3
7.5661
1.6599
4.558
0.000128
-‐-‐-‐
Residual
standard
error:
1.66
on
24
degrees
of
freedom
Multiple
R-‐squared:
0.9805,
Adjusted
R-‐squared:
0.9781
F-‐statistic:
402.3
on
3
and
24
DF,
p-‐value:
<
2.2e-‐16
and the fitted polynomial curve is
50
45
40
35
length
30
25
20
15
0 1 2 3
age
================================================================
with(data,lines(age,
predict(model),col="black"))
================================================================
As is seen, the fitted polynomial curve fits the data better and all estimated
coefficients are significant.
4
(2) The Multiplicative Regression Model
where 𝑒 refers to the Naperian constant used as the basis for natural logarithms.
Note that the error term 𝑒 ! is a multiplicative factor. That is, the value of the
deterministic portion is multiplied by the error. The expected value of this error,
when 𝜖 = 0, is one. When the random error is positive the multiplicative factor is
greater than 1; when negative it is less than 1. This type of error is quite logical in
many applications where variation is proportional to the magnitude of the values
of the variable.
The multiplicative model is a nonlinear model and it is difficult to use the least
squares method to estimate its unknown parameters. The only way to estimate the
unknown parameters of the multiplicative model is to transform its equation from
nonlinear to linear equation. By taking the logarithm for both sides of the
multiplicative model, we will have
log (𝑦) = 𝛽! + 𝛽! log (𝑥! ) + 𝛽! log (𝑥! ) + ⋯ 𝛽! log (𝑥! ) + 𝜖
This model is easily implemented.
Example:
It is desired to study the size range of squid eaten by sharks and tuna. The beak
(mouth) of squid is indigestible hence it is found in the digestive tracts of
harvested fish; therefore, it may be possible to predict the total squid weight with a
regression that uses various beak dimensions as predictors. The beak
measurements and their computer names are:
Data are obtained on a sample of 22 specimens. The data are given in the
following table
5
obs rl w wt
1 1.31 0.35 1.95
2 1.55 0.47 2.9
3 0.99 0.32 0.72
4 0.99 0.27 0.81
5 1.05 0.3 1.09
6 1.09 0.31 1.22
7 1.08 0.31 1.02
8 1.27 0.34 1.93
9 0.99 0.29 0.64
10 1.34 0.37 2.08
11 1.3 0.38 1.98
12 1.33 0.38 1.9
13 1.86 0.65 8.56
14 1.58 0.5 4.49
15 1.97 0.59 8.49
16 1.8 0.59 6.17
17 1.75 0.59 7.54
18 1.72 0.63 6.36
19 1.68 0.68 7.63
20 1.75 0.62 7.78
21 2.19 0.72 10.15
22 1.73 0.55 6.88
Solution:
First, let us fit the multiple linear regression for the data using R. the output is in
the following table
Call:
lm(formula = wt ~ rl + w, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.6391 -0.5087 0.1070 0.5484 0.9674
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.8349 0.7648 -8.937 3.11e-08 ***
rl 3.2747 1.4161 2.313 0.032117 *
w 13.4008 3.3800 3.965 0.000831 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2
1.0
15 15
0.5
1
Standardized residuals
0.0
Residuals
0
-0.5
-1
18
-1.0
18
-1.5
-2
2
0 2 4 6 8 10 -2 -1 0 1 2
The regression appears to fit well and both coefficients are significant, although
the p value for RL is only 0.032. However, the residual plot reveals some
problems:
• The residuals have a curved pattern: positive at the extremes and negative
in the center. This pattern suggests a curved response.
• The residuals are less variable with smaller values of the predicted value
and then become increasingly dispersed as values increase. This pattern
reveals a heteroscedasticity problem.
We noted that the logarithmic transformation should be used when the standard
deviation is proportional to the mean. The pattern of residuals for the linear
regression would appear to suggest that the variability is proportional to the size of
the squid. This type of variability is logical for variables related to sizes of
biological specimens, which suggests a multiplicative error. The multiplicative
model itself is appropriate for this example. The following R. code is used for
estimating it
==========================================================
model=lm(log(wt)~log(rl)+log(w),data=data)
summary(model)
plot(model)
==========================================================
7
Call:
lm(formula = log(wt) ~ log(rl) + log(w), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.27314 -0.08790 0.02622 0.12677 0.17398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1689 0.4783 2.444 0.024454 *
log(rl) 2.2785 0.4933 4.619 0.000187 ***
log(w) 1.1092 0.3736 2.969 0.007886 **
---
Residual standard error: 0.1492 on 19 degrees of freedom
Multiple R-squared: 0.9768, Adjusted R-squared: 0.9744
F-statistic: 400.5 on 2 and 19 DF, p-value: 2.929e-16
1.0
0.1
0.5
Standardized residuals
0.0
0.0
Residuals
-0.5
-0.1
-1.0
-0.2
-1.5
3
2
21 2
-0.3
-2.0
21
𝑊𝑇 = 𝑒 1.1689 𝑅𝐿 2.2785
𝑊 1.1092
.
8
(3) Logistic Regression
Logistic regression is designed for the situation where the response variable, 𝑦,
has only two possible outcomes. In this situation, 𝑦 is said a binary variable
because it takes just two values a 0 (for failure) or a 1 (for success). For example,
y might represent whether a student succeeds or does not succeed in passing
college algebra. We focus merely on the probability of a success, 𝑝, since the
probability of a failure is 1 − 𝑝. On other words, the dependent variable y follows
binomial distribution with probability of success 𝑝 with a single trial 𝑛 = 1. In
logistic regression our interest is in whether the probability 𝑝 is influenced by one
or more independent variables 𝑥! , 𝑥! , . . . , 𝑥! . We will denote the value of 𝑝 at
some specific set of values for the independent variables as 𝑝! . Using the
properties of binomial probability distribution, we have that 𝐸 𝑦! = 𝜇!|! = 𝑝!
and 𝑉𝑎𝑟 𝑦! = 𝜎!! = 𝑝! (1 − 𝑝! ).
When we use the ordinary regression with this situation, we face two problems:
1. The fitted values are not within the range of (0, 1) and there is no way to
remain them in that range.
2. If we could find such a way, the distribution of the dependent variable is
not normal.
The first problem is addressed by expressing the relationship between the 𝑝! and
the independent variables as a nonlinear function known as the logistic function.
Estimating parameters using maximum likelihood estimation method rather than
least squares estimation method solves the second problem.
𝑜𝑑𝑑𝑠 = 𝑝! /(1 − 𝑝! ).
Under the logistic regression model,
9
This is our familiar linear regression model. However the linear influence is on the
ln(odds). If 𝛽! is positive, then for each unit increase in 𝑥! we expect an increase
of 𝛽! in ln(odds) assuming of course that all other 𝑥! values can be held constant.
In turn, this means that the probability of success must be increasing as 𝑥!
increases. However, the increase is nonlinear. Once 𝑝! becomes large, further
increases in 𝑥! can only cause slight increases in 𝑝! .
We compare the odds of success for two individuals with different values of the
670 CH A P T E R 13: Special Types of Regression
independent variables using the odds ratio. If individual 1 has values
𝑥!! , 𝑥!" , … 𝑥!! and individual 2 has values 𝑥!! , 𝑥!! , … 𝑥!! then their odds ratio is
then the odds ratio is 1, meaning the0.5two individuals have the same odds and
hence the same probability of success.
0.4
0.3
Logistic regression can use the same0.2mix of dummy and interval independent
0.1
variables as ordinary regression. The ln(odds) is sometimes called the logit
0.0
function. Since
FIGURE the13.2link between the expected
0 4
value8 of 𝑦!12and the
16
linear
20
expression
in terms of the independent variable comes through the logits, we refer to the
Observed and Fitted Concentration
10
Solution
ln 𝑜𝑑𝑑𝑠 = 𝛽! + 𝛽! 𝑥! ,
where 𝑥 is a city’s median family income and ODDS is the probability a city will
have TIF divided by the probability it will not have TIF. By using the following
R.code
==========================================================
data=read.xls("File destination", sheet=1)
data
model=glm(tif ~ income, data = data, family = "binomial")
summary(model)
=========================================================
we have the following results
Call:
glm(formula = tif ~ income, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8781 -0.8021 -0.4736 0.8097 1.9461
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.3487 3.3511 -3.387 0.000708
income 1.0019 0.2954 3.392 0.000695
---
(Dispersion parameter for binomial family taken to be 1)
11
===============================================================
library(aod)
wald.test(b = coef(model), Sigma = vcov(model), Terms = 2)
==========================================================
Note that Terms argument in wald.test function can be a single number for
determining a single coefficient or a vector for determining several coefficients.
For example, if you want to test 𝐻! : 𝛽! = 𝛽! = 𝛽! = 0, Terms=c(2:4).
The following are the results of Wald test
Wald test:
----------
Chi-squared test:
X2 = 11.5, df = 1, P(> X2) = 0.00069
The Wald test shows that the p-value is too small. So we reject the null hypothesis
and conclude that the 𝛽! differs from zero.
For estimating the odds ratio and the 95% confidence interval, use the following
code
=========================================================
exp(cbind(Point_Estimate = coef(model), confint(model, level = 0.95)))
=========================================================
The results are
The results show that the estimate odds ratio is 2.7234 and we can conclude that
more wealthy cities have a higher probability of adopting TIF. For every
additional $1000 in median income, the odds of adopting TIF are multiplied by a
factor between 1.603 and 5.205.
12
(4) Poisson Regression
The Poisson distribution is widely used as a model for count data. It is frequently
appropriate when the counts are of events in specific regions of time or space.
Dependent variables that might be modeled using the Poisson regression would
include the
AtThere is no fixed
first glance, the upper
term ln(s limiti ) on theseem
may possible
likenumber of events.
just another Recalling variable
independent the in
properties of the Poisson, there is a single parameter, 𝜇, which
the Poisson regression. However, its coefficient is identically 1, so that no parameteris the expected
number
need of events. for
be estimated It isit.essential
This is that
called 𝜇 beanpositive, and the regression
offset variable, functionregres-
and all Poisson
must
sion enforcewill
software this.allow
Poisson youregression
to indicate assumes size 𝑦
such aeach ! follows
marker. a Poisson size is only
Sometimes
distribution
specified with
up to mean 𝜇! , of
a constant where
proportionality. That is, we might not know exactly
the size of units i and i , but we know that unit i is twice the size of unit i′ . This
′
𝑦! = ln 𝜇! = 𝛽! + 𝛽! 𝑥! + 𝛽! 𝑥! + ⋯ 𝛽! 𝑥! ,
suffices, as the unknown proportionality constant will become an additive constant
once logarithms are computed, and be combined with the intercept β0 .
The previous linear expression may take on either positive or negative values, but
The following Figure graphs the rates per 1000 workers (number of fatalities ×
1000/number of workers).
13
0.23
workers
0.21
0.210
0.205
0.200
0.195
rate
0.190
0.185
0.180
year
We would like to see that fatality rates are declining, but is there any evidence that
this is so?
Solution:
We will model the number of fatalities each year as a Poisson variable with mean
𝜇! = 𝜆! 𝑠! where 𝜆! is the rate of fatalities per worker in year 𝑖, and 𝑠! is the number
of workers in these industries during year 𝑖. To model a trend in time, we use
ln 𝜇! = 𝛽! + 𝛽! 𝑖
where 𝑖 = 𝑦𝑒𝑎𝑟 − 1982. The link function is the logarithmic function. By using
the following R. code
===============================================================
14
Call:
glm(formula = fatal ~ 1 + period, family = poisson(link = log),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.26087 -0.59709 -0.08476 0.26053 1.81730
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.306839 0.029875 211.107 < 2e-16
period -0.015205 0.004903 -3.101 0.00193
From the results, we can conclude that the Poisson regression fit the data properly,
since the p-value of the period coefficient is significant. To test the entire
relationship, one can perform the Wald test as follows
==========================================================
Library(oad)
wald.test(b=coef(model),Sigma=vcov(model),Terms=2)
==========================================================
and the results are
Wald test:
----------
Chi-squared test:
X2 = 9.6, df = 1, P(> X2) = 0.0019
As is seen, the p-value of Wald statistics is significant. Thus, we can conclude that
the Poisson regression fits the data well.
We can see this fit as follows
==========================================================
data$pred<- exp(model$linear.predictors)
with(data,plot(period,fatal,type="p"))
with(data,lines(period,pred,lty=1))
==========================================================
15
fatal
2
4
period
6
8
10
16