Stats101A - Chapter 3 Transformation (Case Study)

1/29/23, 10:41 PM Transformation (Case Study)
Transformation (Case Study)

Dr. Maria Cha
Jan 27, 2023
Case 1 : Non constant variance

(1) Original data
cleaning <- read.delim("~/Desktop/23W Stats101A/Data/cleaning.txt")
attach(cleaning)
m1 <- lm(Rooms~Crews)
summary(m1)
##
## Call:
## lm(formula = Rooms ~ Crews)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9990 -4.9901 0.8046 4.0010 17.0010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7847 2.0965 0.851 0.399
## Crews 3.7009 0.2118 17.472 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.336 on 51 degrees of freedom
## Multiple R-squared: 0.8569, Adjusted R-squared: 0.854
## F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Crews,Rooms,xlab="Number of Crews",ylab="Number of Rooms Cleaned")
abline(m1)
StanRes1 <- rstandard(m1)
plot(Crews,sqrt(abs(StanRes1)),xlab="Number of Crews", ylab="Square Root(|Standardized Residuals|)")
abline(lsfit(Crews,sqrt(abs(StanRes1))))
par(mfrow=c(2,2))
plot(m1)
file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 1/14

From the standardized residual plot,
it is evident that the variability in the standardized residuals tends to increase with the number of crews.
(2) Transformed Data ; take the square roots for both X

and Y
In this case, since the data are in the form of counts, we shall try the square root transformation of both X and Y .
sqrtcrews <- sqrt(Crews)

sqrtrooms <- sqrt(Rooms)
m2 <- lm(sqrtrooms~sqrtcrews)
summary(m2)
##
## Call:
## lm(formula = sqrtrooms ~ sqrtcrews)
##
## Residuals:
## -1.09825 -0.43988 0.06826 0.42726 1.20275
##
## Coefficients:
## (Intercept) 0.2001 0.2757 0.726 0.471
## sqrtcrews 1.9016 0.0936 20.316 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 412.7 on 1 and 51 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(sqrtcrews,sqrtrooms,xlab="Sqrt(Number of Crews)",ylab="Sqrt(Number of Rooms Cleaned)")
abline(m2)
StanRes2 <- rstandard(m2)
plot(sqrtcrews,sqrt(abs(StanRes2)),xlab="Square Root(Number of Crews)", ylab="Square Root(|Standardized Residuals
|)")
abline(lsfit(sqrtcrews,sqrt(abs(StanRes2))))

par(mfrow=c(2,2))
plot(m2)
From the standardized residual plot,
the variability in the standardized residuals remains relatively constant.
Case 2: Nonlinearity
(1) Original Data
responsetransformation <- read.delim("~/Desktop/23W Stats101A/Data/responsetransformation.txt")
attach(responsetransformation)
plot(x,y)

m1 <- lm(y~x)
par(mfrow=c(2,2))
plot(m1)
The scatter plot and the standardized residual plot show that the two variables have nonlinear relationship. Also, the variance is not constant but
dramatically changing over x values. We should consider transformation.
(2) Transformed data; Y− > Y1/3

ty <- y^(1/3)
m2 <- lm(ty~x)
plot(x,ty,ylab=expression(Y^(1/3)))
abline(m2)

summary(m2)
##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m2)
(b) : Finding the best transformation

Explore the original data distribution for X and Y.

par(mfrow=c(2,3))
plot(density(y,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="y")
boxplot(y,ylab="Y")
qqnorm(y, ylab = "Y")
qqline(y, lty = 2, col=2)
sj <- bw.SJ(x,lower = 0.05, upper = 100)
plot(density(x,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="x")
boxplot(x,ylab="x")
qqnorm(x, ylab = "x")
qqline(x, lty = 2, col=2)
(1) Inverse Response Plot

library(car)
## Loading required package: carData
par(mfrow=c(1,1))
inverseResponsePlot(m1,key=TRUE)
lambda RSS
<dbl> <dbl>
0.3321126 265.8749
-1.0000000 46673.8798
0.0000000 3583.8067
1.0000000 7136.8828

4 rows
The inverse response plot shows that λ = 0.33 would produce best result.
(2) Box-cox transformation

library(MASS)
par(mfrow=c(1,1))
boxcox(m1,lambda=seq(0.28,0.39,length=20))
bc <- powerTransform(m1)
summary(bc)
## bcPower Transformation to Normality

## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.3326 0.33 0.3197 0.3454
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 699.1082 1 < 2.22e-16
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 924.444 1 < 2.22e-16
The R result for power transformation suggests λ = 0.332 .
Check it the transformed Y is now normal. (Note: The power transformation does not guarantee the normality of the transformed data. We should
check.)
ty <- y^(1/3)
par(mfrow=c(2,2))
sj <- bw.SJ(ty,lower = 0.05, upper = 100)
plot(density(ty,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(Y^(1/3)))
boxplot(ty,ylab=expression(Y^(1/3)))
qqnorm(ty, ylab = expression(Y^(1/3)))
qqline(ty, lty = 2, col=2)
m2 <- lm(ty~x)
plot(x,ty,ylab=expression(Y^(1/3)))
abline(m2)

summary(m2)
##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16
Case 4: Transform X and Y simultaneously

(1) Original Data
data(salarygov)
attach(salarygov)
m1 <- lm(MaxSalary~Score)
par(mfrow=c(1,1))
plot(Score,MaxSalary)
abline(m1,lty=2,col=2)

par(mfrow=c(2,2))
plot(m1)
par(mfrow=c(2,3))
plot(density(MaxSalary,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="MaxSalary")
boxplot(MaxSalary,ylab="MaxSalary")
qqnorm(MaxSalary, ylab = "MaxSalary")
qqline(MaxSalary, lty = 2, col=2)
plot(density(Score,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="Score")
boxplot(Score,ylab="Score")
qqnorm(Score, ylab = "Score")
qqline(Score, lty = 2, col=2)

#(2) Transformed data Find the best transformation using box-cox transformation.
bc <- powerTransform(cbind(MaxSalary,Score)~1)
summary(bc)
## bcPower Transformations to Multinormality

## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## MaxSalary -0.0973 0.0 -0.2483 0.0537
## Score 0.5974 0.5 0.4619 0.7329
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0) 125.0901 2 < 2.22e-16
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1) 211.0704 2 < 2.22e-16
Re-run the regression anlaysis based on the transformed data.
m2 <- lm(log(MaxSalary)~sqrt(Score))
par(mfrow=c(1,1))
plot(sqrt(Score),log(MaxSalary),xlab=expression(sqrt(Score)))
abline(m2,lty=2,col=2)

par(mfrow=c(2,3))
plot(density(log(MaxSalary),bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="log(MaxSalary)")
boxplot(log(MaxSalary),ylab="log(MaxSalary)")
qqnorm(log(MaxSalary), ylab = "log(MaxSalary)")
qqline(log(MaxSalary), lty = 2, col=2)
sj <- bw.SJ(sqrt(Score),lower = 0.05, upper = 100)
plot(density(sqrt(Score),bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(sqrt(Score)))
boxplot(sqrt(Score),ylab=expression(sqrt(Score)))
qqnorm(sqrt(Score), ylab=expression(sqrt(Score)))
qqline(sqrt(Score), lty = 2, col=2)
par(mfrow=c(2,2))
plot(m2)
Case 5: Transform X and Y using logarithm : the result

shows the ‘percentage effect.’
confood1 <- read.delim("~/Desktop/23W Stats101A/Data/confood1.txt")
attach(confood1)
m3 <- lm(Sales~Price)
summary(m3)

##
## Call:
## lm(formula = Sales ~ Price)
##
## Residuals:
## -1051.6 -419.3 -10.3 303.2 4745.0
##
## Coefficients:
## (Intercept) 7223 1094 6.601 2.53e-08 ***
## Price -8706 1496 -5.820 4.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 33.87 on 1 and 50 DF, p-value: 4.166e-07
par(mfrow=c(1,2))
hist(Sales)
hist(Price)
par(mfrow=c(1,1))
plot(Price,Sales, main="Original Data")
abline(m3)
par(mfrow=c(2,2))
plot(m3)

Notice that the distribution of each variable in the two histograms appears to be skewed with a large outlier. In addition, it is clear that a straight
line does not adequately model the relationship between Price and Sales based on the scatter plot.
When studying the relationship between price and quantity in economics, it is common practice to take the logarithms of both price and quantity
since interest lie in predicting the effect of a 1% increase in price on quantity sold.
logP <- log(Price)

logS <- log(Sales)
m4 <- lm(logS~logP)
par(mfrow=c(1,2))
plot(logP,logS,xlab="log(Price)",ylab="log(Sales)", main="Log-transformed Data")
abline(m4)
summary(m4)
##
## Call:
## lm(formula = logS ~ logP)
##
## Residuals:
## -0.88973 -0.18188 0.04025 0.22087 1.31026
##
## Coefficients:
## (Intercept) 4.8029 0.1744 27.53 < 2e-16 ***
## logP -5.1477 0.5098 -10.10 1.16e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 102 on 1 and 50 DF, p-value: 1.159e-13
par(mfrow=c(2,2))

plot(m4)
According to the slope estimates(-5.1), we estimate that for every 1% increase in price there will be approximately a 5.1% reduction in demand.
However, the standardized residual plot still shows some nonrandom pattern, so we should not be satisfied with the current fitted model.

Stats101A - Chapter 3 Transformation (Case Study)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats101A - Chapter 3 Transformation (Case Study)

Uploaded by

Copyright:

Available Formats

1/29/23, 10:41 PM Transformation (Case Study)

Transformation (Case Study)

Case 1 : Non constant variance

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 1/14

From the standardized residual plot,

(2) Transformed Data ; take the square roots for both X

sqrtcrews <- sqrt(Crews)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 2/14

From the standardized residual plot,

the variability in the standardized residuals remains relatively constant.

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 3/14

(2) Transformed data; Y− > Y1/3

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 4/14

(b) : Finding the best transformation

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 5/14

(1) Inverse Response Plot

## Loading required package: carData

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 6/14

(2) Box-cox transformation

## bcPower Transformation to Normality

The R result for power transformation suggests λ = 0.332 .

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 7/14

Case 4: Transform X and Y simultaneously

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 8/14

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 9/14

## bcPower Transformations to Multinormality

Re-run the regression anlaysis based on the transformed data.

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 10/14

Case 5: Transform X and Y using logarithm : the result

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 11/14

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 12/14

logP <- log(Price)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 13/14

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 14/14

You might also like