You are on page 1of 14

1/29/23, 10:41 PM Transformation (Case Study)

Transformation (Case Study)


Dr. Maria Cha
Jan 27, 2023

Case 1 : Non constant variance


(1) Original data
cleaning <- read.delim("~/Desktop/23W Stats101A/Data/cleaning.txt")
attach(cleaning)
m1 <- lm(Rooms~Crews)
summary(m1)

##
## Call:
## lm(formula = Rooms ~ Crews)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9990 -4.9901 0.8046 4.0010 17.0010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7847 2.0965 0.851 0.399
## Crews 3.7009 0.2118 17.472 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.336 on 51 degrees of freedom
## Multiple R-squared: 0.8569, Adjusted R-squared: 0.854
## F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16

par(mfrow=c(1,2))
plot(Crews,Rooms,xlab="Number of Crews",ylab="Number of Rooms Cleaned")
abline(m1)
StanRes1 <- rstandard(m1)
plot(Crews,sqrt(abs(StanRes1)),xlab="Number of Crews", ylab="Square Root(|Standardized Residuals|)")
abline(lsfit(Crews,sqrt(abs(StanRes1))))

par(mfrow=c(2,2))
plot(m1)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 1/14


1/29/23, 10:41 PM Transformation (Case Study)

From the standardized residual plot,

it is evident that the variability in the standardized residuals tends to increase with the number of crews.

(2) Transformed Data ; take the square roots for both X


and Y
In this case, since the data are in the form of counts, we shall try the square root transformation of both X and Y .

sqrtcrews <- sqrt(Crews)


sqrtrooms <- sqrt(Rooms)
m2 <- lm(sqrtrooms~sqrtcrews)
summary(m2)

##
## Call:
## lm(formula = sqrtrooms ~ sqrtcrews)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09825 -0.43988 0.06826 0.42726 1.20275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2001 0.2757 0.726 0.471
## sqrtcrews 1.9016 0.0936 20.316 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.594 on 51 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.8879
## F-statistic: 412.7 on 1 and 51 DF, p-value: < 2.2e-16

par(mfrow=c(1,2))
plot(sqrtcrews,sqrtrooms,xlab="Sqrt(Number of Crews)",ylab="Sqrt(Number of Rooms Cleaned)")
abline(m2)
StanRes2 <- rstandard(m2)
plot(sqrtcrews,sqrt(abs(StanRes2)),xlab="Square Root(Number of Crews)", ylab="Square Root(|Standardized Residuals
|)")
abline(lsfit(sqrtcrews,sqrt(abs(StanRes2))))

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 2/14


1/29/23, 10:41 PM Transformation (Case Study)

par(mfrow=c(2,2))
plot(m2)

From the standardized residual plot,

the variability in the standardized residuals remains relatively constant.

Case 2: Nonlinearity
(1) Original Data
responsetransformation <- read.delim("~/Desktop/23W Stats101A/Data/responsetransformation.txt")
attach(responsetransformation)

plot(x,y)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 3/14


1/29/23, 10:41 PM Transformation (Case Study)

m1 <- lm(y~x)
par(mfrow=c(2,2))
plot(m1)

The scatter plot and the standardized residual plot show that the two variables have nonlinear relationship. Also, the variance is not constant but
dramatically changing over x values. We should consider transformation.

(2) Transformed data; Y− > Y1/3


ty <- y^(1/3)
m2 <- lm(ty~x)
plot(x,ty,ylab=expression(Y^(1/3)))
abline(m2)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 4/14


1/29/23, 10:41 PM Transformation (Case Study)

summary(m2)

##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05168 on 248 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9956
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(m2)

(b) : Finding the best transformation


Explore the original data distribution for X and Y.

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 5/14


1/29/23, 10:41 PM Transformation (Case Study)

par(mfrow=c(2,3))
plot(density(y,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="y")
boxplot(y,ylab="Y")
qqnorm(y, ylab = "Y")
qqline(y, lty = 2, col=2)
sj <- bw.SJ(x,lower = 0.05, upper = 100)
plot(density(x,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="x")
boxplot(x,ylab="x")
qqnorm(x, ylab = "x")
qqline(x, lty = 2, col=2)

(1) Inverse Response Plot


library(car)

## Loading required package: carData

par(mfrow=c(1,1))
inverseResponsePlot(m1,key=TRUE)

lambda RSS
<dbl> <dbl>

0.3321126 265.8749

-1.0000000 46673.8798

0.0000000 3583.8067

1.0000000 7136.8828

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 6/14


1/29/23, 10:41 PM Transformation (Case Study)
4 rows

The inverse response plot shows that λ = 0.33 would produce best result.

(2) Box-cox transformation


library(MASS)
par(mfrow=c(1,1))
boxcox(m1,lambda=seq(0.28,0.39,length=20))

bc <- powerTransform(m1)
summary(bc)

## bcPower Transformation to Normality


## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.3326 0.33 0.3197 0.3454
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 699.1082 1 < 2.22e-16
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 924.444 1 < 2.22e-16

The R result for power transformation suggests λ = 0.332 .

Check it the transformed Y is now normal. (Note: The power transformation does not guarantee the normality of the transformed data. We should
check.)

ty <- y^(1/3)
par(mfrow=c(2,2))
sj <- bw.SJ(ty,lower = 0.05, upper = 100)
plot(density(ty,bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(Y^(1/3)))
boxplot(ty,ylab=expression(Y^(1/3)))
qqnorm(ty, ylab = expression(Y^(1/3)))
qqline(ty, lty = 2, col=2)
m2 <- lm(ty~x)
plot(x,ty,ylab=expression(Y^(1/3)))
abline(m2)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 7/14


1/29/23, 10:41 PM Transformation (Case Study)

summary(m2)

##
## Call:
## lm(formula = ty ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.144265 -0.035128 -0.002067 0.035897 0.161090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.008947 0.011152 0.802 0.423
## x 0.996451 0.004186 238.058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05168 on 248 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9956
## F-statistic: 5.667e+04 on 1 and 248 DF, p-value: < 2.2e-16

Case 4: Transform X and Y simultaneously


(1) Original Data
data(salarygov)
attach(salarygov)
m1 <- lm(MaxSalary~Score)
par(mfrow=c(1,1))
plot(Score,MaxSalary)
abline(m1,lty=2,col=2)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 8/14


1/29/23, 10:41 PM Transformation (Case Study)

par(mfrow=c(2,2))
plot(m1)

par(mfrow=c(2,3))
plot(density(MaxSalary,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="MaxSalary")
boxplot(MaxSalary,ylab="MaxSalary")
qqnorm(MaxSalary, ylab = "MaxSalary")
qqline(MaxSalary, lty = 2, col=2)
plot(density(Score,bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="Score")
boxplot(Score,ylab="Score")
qqnorm(Score, ylab = "Score")
qqline(Score, lty = 2, col=2)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 9/14


1/29/23, 10:41 PM Transformation (Case Study)

#(2) Transformed data Find the best transformation using box-cox transformation.

bc <- powerTransform(cbind(MaxSalary,Score)~1)
summary(bc)

## bcPower Transformations to Multinormality


## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## MaxSalary -0.0973 0.0 -0.2483 0.0537
## Score 0.5974 0.5 0.4619 0.7329
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0) 125.0901 2 < 2.22e-16
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1) 211.0704 2 < 2.22e-16

Re-run the regression anlaysis based on the transformed data.

m2 <- lm(log(MaxSalary)~sqrt(Score))
par(mfrow=c(1,1))
plot(sqrt(Score),log(MaxSalary),xlab=expression(sqrt(Score)))
abline(m2,lty=2,col=2)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 10/14


1/29/23, 10:41 PM Transformation (Case Study)

par(mfrow=c(2,3))
plot(density(log(MaxSalary),bw="SJ",kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab="log(MaxSalary)")
boxplot(log(MaxSalary),ylab="log(MaxSalary)")
qqnorm(log(MaxSalary), ylab = "log(MaxSalary)")
qqline(log(MaxSalary), lty = 2, col=2)
sj <- bw.SJ(sqrt(Score),lower = 0.05, upper = 100)
plot(density(sqrt(Score),bw=sj,kern="gaussian"),type="l",
main="Gaussian kernel density estimate",xlab=expression(sqrt(Score)))
boxplot(sqrt(Score),ylab=expression(sqrt(Score)))
qqnorm(sqrt(Score), ylab=expression(sqrt(Score)))
qqline(sqrt(Score), lty = 2, col=2)

par(mfrow=c(2,2))
plot(m2)

Case 5: Transform X and Y using logarithm : the result


shows the ‘percentage effect.’
confood1 <- read.delim("~/Desktop/23W Stats101A/Data/confood1.txt")
attach(confood1)

m3 <- lm(Sales~Price)
summary(m3)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 11/14


1/29/23, 10:41 PM Transformation (Case Study)

##
## Call:
## lm(formula = Sales ~ Price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1051.6 -419.3 -10.3 303.2 4745.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7223 1094 6.601 2.53e-08 ***
## Price -8706 1496 -5.820 4.17e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 838.3 on 50 degrees of freedom
## Multiple R-squared: 0.4038, Adjusted R-squared: 0.3919
## F-statistic: 33.87 on 1 and 50 DF, p-value: 4.166e-07

par(mfrow=c(1,2))
hist(Sales)
hist(Price)

par(mfrow=c(1,1))
plot(Price,Sales, main="Original Data")
abline(m3)

par(mfrow=c(2,2))
plot(m3)

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 12/14


1/29/23, 10:41 PM Transformation (Case Study)

Notice that the distribution of each variable in the two histograms appears to be skewed with a large outlier. In addition, it is clear that a straight
line does not adequately model the relationship between Price and Sales based on the scatter plot.

When studying the relationship between price and quantity in economics, it is common practice to take the logarithms of both price and quantity
since interest lie in predicting the effect of a 1% increase in price on quantity sold.

logP <- log(Price)


logS <- log(Sales)
m4 <- lm(logS~logP)
par(mfrow=c(1,2))
plot(logP,logS,xlab="log(Price)",ylab="log(Sales)", main="Log-transformed Data")
abline(m4)

summary(m4)

##
## Call:
## lm(formula = logS ~ logP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88973 -0.18188 0.04025 0.22087 1.31026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8029 0.1744 27.53 < 2e-16 ***
## logP -5.1477 0.5098 -10.10 1.16e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4013 on 50 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.6644
## F-statistic: 102 on 1 and 50 DF, p-value: 1.159e-13

par(mfrow=c(2,2))

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 13/14


1/29/23, 10:41 PM Transformation (Case Study)

plot(m4)

According to the slope estimates(-5.1), we estimate that for every 1% increase in price there will be approximately a 5.1% reduction in demand.
However, the standardized residual plot still shows some nonrandom pattern, so we should not be satisfied with the current fitted model.

file:///Users/mariacha/Desktop/23W Stats101A/Chapter-3.-Transformation-Case-Study.html 14/14

You might also like