Professional Documents
Culture Documents
##
## Call:
## lm(formula = Rooms ~ Crews)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9990 -4.9901 0.8046 4.0010 17.0010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7847 2.0965 0.851 0.399
## Crews 3.7009 0.2118 17.472 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.336 on 51 degrees of freedom
## Multiple R-squared: 0.8569, Adjusted R-squared: 0.854
## F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Crews,Rooms,xlab="Number of Crews",ylab="Number of Rooms Cleaned")
abline(m1)
StanRes1 <- rstandard(m1)
plot(Crews,sqrt(abs(StanRes1)),xlab="Number of Crews", ylab="Square Root(|Standardized Residuals|)")
abline(lsfit(Crews,sqrt(abs(StanRes1))))
par(mfrow=c(2,2))
plot(m1)
it is evident that the variability in the standardized residuals tends to increase with the number of crews.
##
## Call:
## lm(formula = sqrtrooms ~ sqrtcrews)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09825 -0.43988 0.06826 0.42726 1.20275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2001 0.2757 0.726 0.471
## sqrtcrews 1.9016 0.0936 20.316 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.594 on 51 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.8879
## F-statistic: 412.7 on 1 and 51 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(sqrtcrews,sqrtrooms,xlab="Sqrt(Number of Crews)",ylab="Sqrt(Number of Rooms Cleaned)")
abline(m2)
StanRes2 <- rstandard(m2)
plot(sqrtcrews,sqrt(abs(StanRes2)),xlab="Square Root(Number of Crews)", ylab="Square Root(|Standardized Residuals
|)")
abline(lsfit(sqrtcrews,sqrt(abs(StanRes2))))
par(mfrow=c(2,2))
plot(m2)
Case 2: Nonlinearity
(1) Original Data
responsetransformation <- read.delim("~/Desktop/23W Stats101A/Data/responsetransformation.txt")
attach(responsetransformation)
plot(x,y)
m1 <- lm(y~x)
par(mfrow=c(2,2))
plot(m1)
The scatter plot and the standardized residual plot show that the two variables have nonlinear relationship. Also, the variance is not constant but
dramatically changing over x values. We should consider transformation.
summary(m2)
##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05168 on 248 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9956
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m2)
par(mfrow=c(2,3))
plot(density(y,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="y")
boxplot(y,ylab="Y")
qqnorm(y, ylab = "Y")
qqline(y, lty = 2, col=2)
sj <- bw.SJ(x,lower = 0.05, upper = 100)
plot(density(x,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="x")
boxplot(x,ylab="x")
qqnorm(x, ylab = "x")
qqline(x, lty = 2, col=2)
par(mfrow=c(1,1))
inverseResponsePlot(m1,key=TRUE)
lambda RSS
<dbl> <dbl>
0.3321126 265.8749
-1.0000000 46673.8798
0.0000000 3583.8067
1.0000000 7136.8828
The inverse response plot shows that λ = 0.33 would produce best result.
bc <- powerTransform(m1)
summary(bc)
Check it the transformed Y is now normal. (Note: The power transformation does not guarantee the normality of the transformed data. We should
check.)
ty <- y^(1/3)
par(mfrow=c(2,2))
sj <- bw.SJ(ty,lower = 0.05, upper = 100)
plot(density(ty,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(Y^(1/3)))
boxplot(ty,ylab=expression(Y^(1/3)))
qqnorm(ty, ylab = expression(Y^(1/3)))
qqline(ty, lty = 2, col=2)
m2 <- lm(ty~x)
plot(x,ty,ylab=expression(Y^(1/3)))
abline(m2)
summary(m2)
##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05168 on 248 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9956
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m1)
par(mfrow=c(2,3))
plot(density(MaxSalary,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="MaxSalary")
boxplot(MaxSalary,ylab="MaxSalary")
qqnorm(MaxSalary, ylab = "MaxSalary")
qqline(MaxSalary, lty = 2, col=2)
plot(density(Score,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="Score")
boxplot(Score,ylab="Score")
qqnorm(Score, ylab = "Score")
qqline(Score, lty = 2, col=2)
#(2) Transformed data Find the best transformation using box-cox transformation.
bc <- powerTransform(cbind(MaxSalary,Score)~1)
summary(bc)
m2 <- lm(log(MaxSalary)~sqrt(Score))
par(mfrow=c(1,1))
plot(sqrt(Score),log(MaxSalary),xlab=expression(sqrt(Score)))
abline(m2,lty=2,col=2)
par(mfrow=c(2,3))
plot(density(log(MaxSalary),bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="log(MaxSalary)")
boxplot(log(MaxSalary),ylab="log(MaxSalary)")
qqnorm(log(MaxSalary), ylab = "log(MaxSalary)")
qqline(log(MaxSalary), lty = 2, col=2)
sj <- bw.SJ(sqrt(Score),lower = 0.05, upper = 100)
plot(density(sqrt(Score),bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(sqrt(Score)))
boxplot(sqrt(Score),ylab=expression(sqrt(Score)))
qqnorm(sqrt(Score), ylab=expression(sqrt(Score)))
qqline(sqrt(Score), lty = 2, col=2)
par(mfrow=c(2,2))
plot(m2)
m3 <- lm(Sales~Price)
summary(m3)
##
## Call:
## lm(formula = Sales ~ Price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1051.6 -419.3 -10.3 303.2 4745.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7223 1094 6.601 2.53e-08 ***
## Price -8706 1496 -5.820 4.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 838.3 on 50 degrees of freedom
## Multiple R-squared: 0.4038, Adjusted R-squared: 0.3919
## F-statistic: 33.87 on 1 and 50 DF, p-value: 4.166e-07
par(mfrow=c(1,2))
hist(Sales)
hist(Price)
par(mfrow=c(1,1))
plot(Price,Sales, main="Original Data")
abline(m3)
par(mfrow=c(2,2))
plot(m3)
Notice that the distribution of each variable in the two histograms appears to be skewed with a large outlier. In addition, it is clear that a straight
line does not adequately model the relationship between Price and Sales based on the scatter plot.
When studying the relationship between price and quantity in economics, it is common practice to take the logarithms of both price and quantity
since interest lie in predicting the effect of a 1% increase in price on quantity sold.
summary(m4)
##
## Call:
## lm(formula = logS ~ logP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88973 -0.18188 0.04025 0.22087 1.31026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8029 0.1744 27.53 < 2e-16 ***
## logP -5.1477 0.5098 -10.10 1.16e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4013 on 50 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.6644
## F-statistic: 102 on 1 and 50 DF, p-value: 1.159e-13
par(mfrow=c(2,2))
plot(m4)
According to the slope estimates(-5.1), we estimate that for every 1% increase in price there will be approximately a 5.1% reduction in demand.
However, the standardized residual plot still shows some nonrandom pattern, so we should not be satisfied with the current fitted model.