You are on page 1of 4

FEATURE

Residual Analysis and Data Transformations:


Important Tools in Statistical Analysis
1
George C.J. Fernandez
Department of Agricultural Economics, University of Nevada-Reno, Reno, NV 89557-0107
Analysis of variance (ANOVA) is a com- ence in observed values for any two treatments Outliers. Outliers or the influential obser-
monly used statistical analysis in agricultural should be the same in every block except for vation can be detected by plotting the stan-
experiments. Additivity, variance homo- the experimental error component (Finney, dardized residuals against predicted values.
geneity, and normality are often considered 1989). Equality of variance refers to the var- If the absolute value of the standardized re-
prerequisites for ANOVA (Cochran, 1943; iance of the error component, which should sidual is >2.5, that observation can be treated
Eisenhart, 1947). The interpretation of be the same for all treatments and all blocks. as an outlier (Freund and Littell, 1986; SAS,
ANOVA is valid when the random errors are Synonym for this condition is homogeneity 1988). If the residual analysis revealed the
independently distributed according to a nor- of variance or homoscedasticity, and the presence of influential or extreme observa-
mal distribution with zero mean and an un- converse condition is called heterogeneity of tions (outliers), check first whether the out-
known but fixed variance (Kempthorne, 1952; variance or heteroscedasticity. The normal- lier is due to a recording error. Do not seek
Scheffe, 1959; Steel and Torrie, 1980). Fail- ity assumption implies that every individual an excuse for the possible rejection of the
ure to meet one or more of these assumptions error component should be derived from a outlier, but investigate the possibility that the
affects the significance levels and the sen- normal frequency distribution. The existence outlier may have unexplained implication
sitivity of the F test (Gomez and Gomez, of a relationship between the size of the re- worthy of further investigation.
1984; Kempthorne, 1952; Little and Hills, siduals and the predicted value indicates that
1978) Thus, strong deviations from one or the variance of the residuals is functionally Data transformations
more of the assumptions must be checked related to the mean. This type of variance Data are transformed to make them con-
and corrected before the statistical analysis heterogeneity is usually associated with non- form more closely to the assumptions un-
and interpretation of the results. additivity and/or nonnormally associated data derlying the ANOVA (Bartlett, 1947). It is
Discrepancies of many kinds between an (Box et al., 1978; Gomez and Gomez, 1984), undertaken with three objectives: i) to make
assumed model and the data can be detected and a wedge or fan shaped pattern is seen in the error variances more nearly homoge-
by studying the error component or residuals the residual plots (Emerson and Stoto, 1983). neous; ii) to improve additivity; iii) to pro-
(Anscombe and Tukey, 1963; Emerson and Ott (1988) proposed an alternate test, the duce a more nearly normal error distribution
Stoto, 1983). The residuals are the deviation Hartley’s test for homogeneity of variance, (Finney, 1989). The transformation of data
from the observed and the predicted values to verify the assumption on equality of var- implies the replacement of each observation
according to the assumed model. If the as- iance. The residuals also can be examined by some simple function of its magnitude,
sumptions about the validity of the model are for normality and homogeneity by drawing followed by a standard ANOVA. Thus, the
valid, a residual plot (scatter plot between normal probability and box plots by treat- original data are transformed to a new scale,
the residuals and the predicted values) will ments, respectively, using the PROC UNI- resulting in data that are expected to satisfy
have a random distribution. If the residual VARIATE in SAS (SAS Institute, Inc., the assumptions of additivity, normality, and
plot has an unexplained systematic pattern, 1988). homogeneity of variance. Because a com-
then the ANOVA model is not appropriate. Auto-correlations. One essential require- mon transformation scale is used for all ob-
Residual plots can be used to detect the vi- ment of ANOVA is that the “error” com- servations in the data, treatments ranks are
olation of assumptions in ANOVA, such as ponents of the observed responses should be not altered, and the mean comparisons re-
variance heterogeneity (unequal variance], independent of each other. The lack of auto- main valid.
auto-correlated error (nonindependence), and correlation assumption is secured by a proper A convenient rule of thumb for deciding
the presence of outliers. Thus, it is crucial randomization. If the “error” components whether transformation would be effective is
to examine the residuals before interpreting are not independent, the validity of the F test to find the ratio between the largest and the
the data. of significance can be seriously impaired smallest data values. Transformation could
(Finney, 1989; Sokal and Rohlf, 1987). There be helpful when the ratio is large, >20
Violation of assumptions in ANOVA is no simple adjustment or transformation to (Emerson and Stoto, 1983). Tests of hom-
Nonadditivity, variance heterogeneity, and overcome the lack of independence of error. ogeneity of variance for two or more samples
nonnormality. The additivity requirement The basic design of the experiment or the can be performed using Bartlett’s test of
implies that the block and treatment effects way in which the analysis is performed must homogeneity (Snedecor and Cochran, 1956).
should be additive. For example, in a ran- be changed to deal with this problem. A cyclic SAS codes for performing Bartlett’s homo-
domized complete-block design, the differ- pattern in the residual plot is an indication geneity are available in SAS/STAT sample
for auto-correlation, nonindependent error library examples (SAS Institute, Inc., 1988).
(Fernandez, 1990a; Gomez and Gomez, If variance stabilization is the primary ob-
Received for publication 13 May 1991. Accepted 1984). If auto-correlated errors are observed jective of transformation, then efforts should
for publication 2 Dec. 1991. The cost of publish- be made to find the transformation that best
ing this paper was defrayed in part by the payment
in residual plots in special experimental lay-
of page charges. Under postal regulations, this pa- outs, a repeated measure of ANOVA (Fer- achieves it. Logarithmic, square-root, and
per therefore must be hereby marked advertise- nandez, 1991) or moving mean covariance arcsin conversions are the most commonly
ment solely to indicate this fact. analysis (Fernandez, 1990a) may be appro- used transformations for ANOVA of prob-
1
Assistant Professor in Plant Breeding and Bio- priate to make adjustments for auto-corre- lem data (Gomez and Gomez, 1984; Sokal
metrics. lation. and Rohlf, 1987; Steel and Torrie, 1980).

H O R TS C I E N C E , VO L. 2 7 ( 4 ) , AP R I L 1 9 9 2 297
Log transformation: When the treatment amine for a linear relationship. A strong and Berkson (1944), respectively. If the power
standard deviation (S) is proportional to the nonlinear relationship indicates that a simple transformation failed to suggest the suitable
treatment mean and the treatment effects are power transformation is not appropriate for transformation due to extreme observations
multiplicative, a log transformation is rec- such data, and distribution-free, nonpara- in the data, ranks of the observations can be
ommended (Steel and Torrie, 1980; Gomez metric methods such as the Kruskal-Wallis used in ANOVA (Conover and Iman, 1981).
and Gomez, 1984). Data consisting of test, Wilcoxon rank-sum test, and Mann– Many nonparametric statistical methods,
“whole” numbers that cover a wide range Whitney U test (SAS, 1988) should be con- Wilcoxon rank sum test, Kruskal-Wallis K-
of values (number of diseased plants per plot, sidered as alternate methods (Kempthorne, sample test, and Friedman’s two-way analy-
number of pods per square meter) often need 1952; Sokal and Rohlf, 1987). 4) Regress sis using ranks are often better than original
a log transformation. log(S) on log( ) and test for a significant observations (Quade, 1966).
Square-root transformation: It is appro- linear relationship. If the regression is not Tests of significance and mean separation
priate for data consisting of small whole significant (P > 0.05), data transformation should be carried out on the transformed data.
numbers from rare events, e.g., number of usually is not necessary. A significant Care should be taken in interpreting means
insects captured in a trap. For such data, the regression (P < 0.05) indicates the data should of transformed data. To keep the metric
variance is proportional to the mean. If the be transformed and the regression coefficient interpretation to the original scale, the trans-
data contain zeros, 0.5 or 1 is added to the estimated. 5) Estimate the power (λ) of the formed means and associated confidence in-
original data before performing square-root transformation by subtracting the regression tervals can be back-transformed (Gomez and
or log transformations, respectively. coefficient estimated. 5) Estimate the power Gomez, 1984) and reported within parenthe-
Arcsin transformation: A typical charac- (λ) of the transformation by subtracting the sis along with the transformed means.
teristic of percentages based on counts is that regression coefficient (β) from 1. Analysis of rating scale data. Rating scales
the variances of means near 0% and 100% The value of the power (λ) indicates the can be defined as a series of numbers rep-
tend to be smaller than the variances of means appropriate transformation. For example, if resenting degree of intensity of some char-
near the middle range (30% to 70%) (Finney, β approximately equals 2, then λ = 1 – acteristic based on visual or sensory estimate.
1989). Thus, percentage data based on counts β = –1. Thus, the appropriate transfor- (Little, 1985). A comprehensive account of
are discrete and have a binomial distribution. mation would be reciprocals. Some com- how to analyze rating scale data can be found
The arcsin or the angular transformation is monly used transformations and their power in Little (1985).
appropriate for these types of data obtained (λ) values are: Checking for violations of assumptions in
from a count and is expressed as a decimal ANOVA by residual analysis is very impor-
fraction or percentage. If the percentages range tant but is practiced less commonly in agri-
from 30% to 70%, the arcsin transformation cultural research since it involves additional
is not needed. Arcsin transformations con- computations and graphical display of resid-
vert the percentages to angles whose sines uals. Further, choosing the appropriate
are square-roots of percentages expressed as transformation is not straightforward, and
decimals. If the appropriate transformation is esti- without examining the residuals, it is diffi-
mated by Box’s method using the data in cult to confirm the appropriateness of the
Arcsin (Y) = (1/sin) (Y0.5) question, one df is usually taken away from transformation.
where the Ys are the decimal fractions. If the error df in the ANOVA, since the same The use of SAS software in statistical
the data include values of 0% and 100%, data are used to determine the proper trans- analysis is rapidly increasing with the avail-
these values are replaced by (1/4n) and [100 formation (Box et al., 1978). ability of command-driven SAS for personal
- (1/4n)], respectively, where n is the total In addition to these transformations, for computers (PC-SAS). In a recent study, PC-
number of units upon which the percentage which the transformed variable has constant SAS was identified as one of the more ver-
data were based. Tables of arcsin values can variance, there are two transformations that satile and easy-to-use software programs
be obtained from statistical text books (Gomez have been used extensively in biological as- available on the market (Milliken and Rem-
and Gomez, 1984; Steel and Torrie, 1980) say that do not have this property, i.e., probit menga, 1989). In addition, PC-SAS pro-
or by using the ARSIN option in SAS (1988). and logit transformation for variables that have vides powerful data management and is
Arcsin transformation is inappropriate for values between 0 and 1 (Kampthorne, 1952). flexible in formatting output (Fernandez,
unconstrained percentages involving rate of A comprehensive account of probit and logit 1990b). With SAS available to perform the
growth increases that might have values transformation cart be found in Finney (1962) residual analysis and to estimate the appro-
> 100% or even negative values (Finney,
1989).
Power transformation. When the func-
tional relationship between the treatment
means and variances is unknown, it is pos-
sible to use the data to estimate the suitable
transformation. Box and Cox (1964) pro-
posed the power transformation where:

Y(t) is the transformed response, A vary over


the range of – 2 to 2, and residual sums of
squares SSE (l) should be minimal. Box et
al. (1978) described a relatively simple
method to determine the suitable power
transformation using the data in question. The
following steps describe this method: 1) Es-
timate the treatment means single factor
or treatment combination means (two or more
factors) and treatment standard deviations (S).
2) Calculate the logs of the Ys and the logs Fig. 1. SAS program statements for analysis of variance and residual analysis of original and square-
of the Ss. 3) Plot log(S) on log and ex- root and log-transformed data for mungbean plant heights.

298 H O R TS CIENCE , VO L. 27(4), APRIL 1992


priate power transformation, the horticultur-
ist can easily perform residual analyses.
Therefore, the purpose of this paper is to
emphasize the importance of residual analy-
sis and to present PC-SAS program state-
ments to perform residual analysis and
estimate the appropriate power transforma-
tion.
Example
Data on mungbean (Vigna radiata L.
Wilkz) plant height at 50% flowering for eight
genotypes grown in two separate experi-
ments in the summer and the fall season at
the Asian Vegetable Research and Devel-
opment Center in Taiwan were used here as
a worked example to emphasize the impor-
tance of the residual analysis. The design
was a randomized complete block with three
replications. The experiment was conducted
in two distinct growing seasons, summer and
fall, and a combined ANOVA over season
was carried out. Statistical analysis was car-
ried out using the PC-SAS (SAS, 1988), and
the SAS program statements for this and the
subsequent analysis (log, square-root, in-
verse transformed) are presented in Figs. 1
and 2.
The “wedge-shaped” residual plot (Fig.
3a) for the untransformed data clearly indi-
cated the presence of the unequal variance Fig. 2. SAS program statements for the estimation of the suitable power transformation and for the
or heterogeneity. The residual plots of the adjustment in ANOVA due to the loss in one degree of freedom.
commonly used transformations (log and
square-root) (Fig. 3 b-c) also indicated the
presence of heterogeneity even after the
transformation. Therefore, the statistical sig-
nificance levels and the sensitivity of the F Table 1. Comparisons of ANOVA statistics (P values) for the original and the transformed data for
test for the untransformed log and square root- mungbean plant heights obtained in two separate plantings.
transforrned mungbean plant height data are Source df HT SQRT(HT) Log(HT) 1/HT
biased.
The method of Box et al. (1978) for power Seasons (S) 1 0.0001 0.0001 0.0001 0.0001
Replicate[season] 4 0.924 9.917 0.906 0.866
transformation was used to estimate the ap- Genotype (G) 7 0.0014 0.0004 0.0001 0.0001
propriate power transformation for this data S × G 7 0.6007 0.5756 0.3993 0.0267
set. The SAS statements are given in Fig. 2. Error 27z
The means and the standard deviations z
The df for the inverse-transformed data is one minus the error df since the same data have been used
(S) for genotype x season combinations were to choose the appropriate power transformation according to Box et al. (1978).
estimated. The log(S) was regressed on the
log The regression coefficient (β 1 =
2.142) is significant (P > T = 0.0067). From
the regression coefficient, the power (λ) was
estimated:
The P values for the original and the trans- cance levels in ANOVA may be biased if the
formed plant height data indicated the sig- data violate the assumption of the ANOVA.
Thus, Box’s method on the plant height data nificance levels for the main effects of season Furthermore, this may lead one to draw in-
suggested that an inverse transformation and genotype were in agreement for the orig- correct conclusions from the analysis. There-
would be appropriate. inal and the transformed data. However, the fore, checking data for the violation of the
The results of the ANOVA on the untrans- interaction between season × genotype was ANOVA assumptions before discussing the
formed log and square root, and inverse significant (P = 0.0267) for the inverse results is very critical. Residual analysis is a
transformed plant height data were compared transformed data, whereas the P values on powerful tool to detect the problems associated
(Table 1). One df was taken away from the the original, log, and square root transfor- with the violation of the ANOVA assump-
error df for the inverse transformed data since mation indicated that the interaction was not tions. The commonly used transformations such
the same data had been used to choose the significant (P > 0.39). Because of the het- as the log and the square-root conversions may
appropriate power transformation (Box et al., erogeneity of variance, the differential re- not be appropriate for every data set. The
1978). The appropriate SAS statements for sponses of genotypes across the two seasons method of power transformation according to
making adjustments in the error df are given were not detected in the original, log, or the Box et al. (1978) is recommended to estimate
in Fig. 2. The random distribution between square-root-transformed data. This example the appropriate transformation. Although the
the residuals and the predicted values of the clearly indicates that if the assumption of residual analysis and the estimation of the power
inverse transformed data in the residual plot homogeneity of variance is not met in transformation needs additional calculations,
(Fig. 3d) clearly show that the inverse trans- ANOVA, both the significance levels and with the use of the PC-SAS program presented
formation removed the heterogeneity of var- the sensitivity of the F tests are biased. here, the proper NOVA can be performed
iance. This example clearly shows that signifi- without tedious calculations.

H ORT S CIENCE , VOL. 27(4), APRIL 1992 299


Literature Cited
Anscombe, F.J. and J.W. Tukey. 1963. The ex-
amination and analysis of residual Technome-
trics 5:141-160.
Bartlett, M.S. 1947. The use of transformation.
Biometrics 3:39-52.
Berkson, J. 1944. Application of the logistic func-
tion to bio-assay. J. Amer. Stat. Assn. 39:357-
365.
Box, G.E.P. and D.R. Cox. 1964. An analysis of
transformation. J. Royal Stat. Soc. Ser. B.
26:211-243.
Box, G. E. P., W.G. Hunter, and I.S. Hunter. 1978.
Statistics for experimenters: an introduction to
design, data analysis, and model building. Wiley,
New York.
Cochran, W.G. 1943. Some consequences when
the assumptions for the analysis of variance are
not satisfied. Biometrics 3:22-38.
Conover, W.J. and R.L. Iman. 1981. Rank trans-
formation as a bridge between parametric and
nonparametric statistics. Amer. Statistician
35:124-129.
Eisenhart, C. 1947. The assumptions underlying
the analysis of variance. Biometrics 3:1-21.
Emerson, J.D. and M.A. Stoto. 1983. Transforma-
ng data, p. 97-126. In: D.C. Hoaglin, F. Mos-
teller, and J.H. Tukey (eds.). Understanding
robust and exploratory data analysis. Wiley, New
York.
Fernandez, G.C.J. 1990a. Evaluation of moving
mean and border row mean covariance analysis
for error control in yield trials. J. Amer. Soc.
Hort. Sci. 115:241-244.
Femandez, G.C.J. 1990b. Analysis of lattice de-
sign using PC-SAS. HortScience 25:1450.
Fernandez, G.C.J. 1991. Repeated measure analysis
of line-source sprinkler experiments. Hort-
Science 26:339-342.
Finney, D.J. 1962. Probit analysis. 2nd ed. Cam-
bridge Univ. Press, Cambridge, U.K. p. 20-
64.
Finney, D.J. 1989. Was this in your statistical text
book? V. Transformation of data. Expt. Agr.
25:165-175.
Fig. 3. Residual plots of the original, square-root and log, and inverse-transformed data for mungbean Freund. R.J. and R.C. Littell. 1986. SAS system
plant heights. for regression. SAS Institute, Inc., Cary, N.C.
Gomez, K.A. and A.A. Gomez. 1984. Statistical
procedures for agricultural research. 2nd ed.
Wiley, New York. p. 680.
Kempthorne, O. 1952. Design and analysis of ex-
periments. Wiley, New York.
Little, T.M. 1985. Analysis of percentage and rat-
ing scale data. HortScience 20:642-644.
Little, T.M. and F.J. Hills. 1978. Agricultural ex-
perimentations– Design and analysis. 1978,
Wiley, New York. p. 350.
Milliken, G.A. and M.D. Remmenga. 1989. Sta-
tistical analysis and the personal computer.
HortScience 24:45-52.
Ott, L. 1988. An introduction to statistical meth-
ods and data analysis. 3rd ed. PWS-Kent, Bos-
ton. p. 835.
Quade, D. 1966. On analysis of variance for the
k-sample problem. Annals Mathematical Stat.
37:1747-1758.
SAS Institute, Inc. 1988. SAS/STAT user’s guide,
release 6.03. SAS Institute, Inc. Cary, N.C.
Scheffe H. 1959. The analysis of variance. Wiley,
New York. p. 555.
Sokal, R.R. and F.J. Rohlf. 1987. Introduction to
bio statistics. W.H. Freeman, New York.
Snedecor, G.W. and W.G. Cochran. 1956. Sta-
tistical methods applied to experiments in ag-
riculture and biology. The Iowa State College
Press, Ames.
Steel, R.G.D. and J.H. Torrie. 1980. Principles
and procedures of statistics. McGraw-Hill, New
York.

H ORT S CIENCE , VOL. 27(4), APRIL 1992

You might also like