Univariate Analysis and Normality Test: 1
Univariate Analysis and Normality Test Using SAS,
STATA, and SPSS
Hun Myoung Park
This document summarizes graphical and numerical methods for univariate analysis and
normality test, and illustrates how to test normality using SAS 9.1, STATA 9.2 SE, and
SPSS 14.0.
1.
2.
3.
4.
5.
6.
7.
Introduction
Graphical Methods
Numerical Methods
Testing Normality Using SAS
Testing Normality Using STATA
Testing Normality Using SPSS
Conclusion
1. Introduction
Descriptive statistics provide important information about variables. Mean, median, and
mode measure the central tendency of a variable. Measures of dispersion include variance,
standard deviation, range, and interquantile range (IQR). Researchers may draw a
histogram, a stemandleaf plot, or a box plot to see how a variable is distributed.
Statistical methods are based on various underlying assumptions. One common
assumption is that a random variable is normally distributed. In many statistical analyses,
normality is often conveniently assumed without any empirical evidence or test. But
normality is critical in many statistical methods. When this assumption is violated,
interpretation and inference may not be reliable or valid.
Figure 1. Comparing the Standard Normal and a Bimodal Probability Distributions
.3
.2
.1
0
0
.1
.2
.3
.4
Bimodal Distribution
.4
Standard Normal Distribution
5
3
1
1
3
5
5
3
1
1
3
5
Ttest and ANOVA (Analysis of Variance) compare group means, assuming variables
follow normal probability distributions. Otherwise, these methods do not make much
http://www.indiana.edu/~statmath
© 20022006 The Trustees of Indiana University
Univariate Analysis and Normality Test: 2
sense. Figure 1 illustrates the standard normal probability distribution and a bimodal
distribution. How can you compare means of these two random variables?
There are two ways of testing normality (Table 1). Graphical methods display the
distributions of random variables or differences between an empirical distribution and a
theoretical distribution (e.g., the standard normal distribution). Numerical methods
present summary statistics such as skewness and kurtosis, or conduct statistical tests of
normality. Graphical methods are intuitive and easy to interpret, while numerical
methods provide more objective ways of examining normality.
Table 1. Graphical Methods versus Numerical Methods
Graphical Methods
Numerical Methods
Stemandleaf plot, (skeletal) box plot Skewness
Descriptive
Theorydriven
Histogram
PP plot
QQ plot
Kurtosis
ShapiroWilk, Shapiro Francia test
KolmogorovSmirnov test (Lillefors test)
AndersonDarling/Cramervon Mises tests
JarqueBera test, SkewnessKurtosis test
Graphical and numerical methods are either descriptive or theorydriven. The dot plot
and histogram, for instance, are descriptive graphical methods, while skewness and
kurtosis are descriptive numerical methods. The PP and QQ plots are theorydriven
graphical methods for testing normality, whereas the ShapiroWilk W and JarqueBera
tests are theorydriven numerical methods.
Figure 2. Histograms of Normally and Nonnormally Distributed Variables
A Nonnormally Distributed Variable (N=164)
0
0
.1
.03
.2
.06
.3
.09
.4
.12
.5
.15
A Normally Distributed Variable (N=500)
3
2
1
0
1
2
Randomly Drawn from the Standard Normal Distribution (Seed=1,234,567)
3
0
10
20
30
40
Per Capita Gross National Income in 2005 ($1,000)
50
60
Three variables are employed here. The first variable is unemployment rate of Illinois,
Indiana, and Ohio in 2005. The second variable includes 500 observations that were
randomly drawn from the standard normal distribution. This variable is supposed to be
normally distributed with zero mean and a variance of 1 (left plot in Figure 2). An
example of a nonnormal distribution is per capita gross national income (GNI) in 2005
of 164 countries in the world. GNIP is severely skewed to the right and is least likely to
be normally distributed (right plot in Figure 2). See the Appendix for details about these
variables.
http://www.indiana.edu/~statmath
© 20022006 The Trustees of Indiana University
Univariate Analysis and Normality Test: 3
2. Graphical Methods
Graphical methods visualize the distribution of a random variable and compare the
distribution to a theoretical one using plots. These methods are either descriptive or
theorydriven. The former method is based on the empirical data, whereas the latter
considers both empirical and theoretical distributions.
2.1 Descriptive Plots
Among frequently used descriptive plots are the stemandleafplot, dot plot, (skeletal)
box plot, and histogram. When N is small, a stemandleaf plot or dot plot is useful to
summarize data. A stemandleaf plot and dot plot work well for continuous or event
count variables. Figure 3 presents the stemandleaf plots for unemployment rates of
three states.
Figure 3. StemandLeaf Plot of Unemployment Rate of Illinois, Indiana, Ohio
> state = IL
> state = IN
> state = OH
Stemandleaf plot for rate(Rate)
Stemandleaf plot for rate(Rate)
Stemandleaf plot for rate (Rate)
rate rounded to nearest multiple
of .1
plot in units of .1
rate rounded to nearest multiple
of .1
plot in units of .1
rate rounded to nearest multiple
of .1
plot in units of .1
3.
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.











7889
011122344
556666666677778888999
0011122222333333344444
5555667777777888999
000011222333444
555579
0033
0
8
3*
3.
4*
4.
5*
5.
6*
6.
7*
7.
8*











1
89
012234
566666778889999
00000111222222233344
555666666777889
002222233344
5666677889
1113344
67
14
3*
4*
5*
6*
7*
8*
9*
10*
11*
12*
13*











8
014577899
01223333445556667778888888999
001111122222233444446678899
01223335677
1223338
99
1
3
A box plot presents the minimum, 25th percentile (1st quartile), 50th percentile (median),
75th percentile (3rd quartile), and maximum in a box and lines.1 Outliers, if any, appear at
the outsides of (adjacent) minimum and maximum lines. As such, a box plot effectively
summarizes these major percentiles using a box and lines. If a variable is normally
distributed, its 25th and 75th percentile are symmetric, and its median and mean are
located at the same point exactly in the center of the box.2
In Figure 4, you should see outliers in Illinois and Ohio that affect the shapes of
corresponding boxes. In contrast, the Indiana unemployment rate does not have outliers,
and its symmetric box implies that the rate appears to be normally distributed.
1
The first quartile cuts off lowest 25 percent of data; the second quartile cuts data set in half; and the third
quartile cuts off lowest 75 percent or highest 25 percent of data. See http://en.wikipedia.org/wiki/Quartile.
2
SAS reports a mean as “+” between (adjacent) minimum and maximum lines.
http://www.indiana.edu/~statmath
© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 4 Figure 4.indiana. Indiana. Figure 5.3 .. and Ohio Indiana (N=92) Ohio (N=88) 2 4 6 8 10 Unemployment Rate (%) 12 14 Illinois (N=102) Indiana Business Research Center (http://www.2 Theorydriven Plots The PP and QQ plots are considered here. Indiana and Ohio Indiana (N=92) Ohio (N=88) 0 .stats.2 .indiana.1 .edu/) Source: Bureau of Labor Statistics The histogram graphically shows how each category (interval) accounts for the proportion of total observations and is more appropriate for large N samples (Figure 5).edu/~statmath .4 .indiana.g.5 Illinois (N=102) 0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15 Indiana Business Research Center (http://www.edu/) Source: Bureau of Labor Statistics 2. http://www. The probabilityprobability plot (PP plot or percent plot) compares an empirical cumulative distribution function of a variable with a specific theoretical cumulative distribution function (e. the standard normal distribution function). Ohio appears to deviate more from the fitted line than Indiana. Histograms of Unemployment Rates of Illinois. In Figure 6. Box Plots of Unemployment Rates of Illinois.stats.
25 0.. 75.00 2005 Ohio Unemployment Rate (N=88 Counties) 0. the normal distribution).00 0.50 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 0. http://www. 25.e.25 0. In Figure 7.1 0 4. Ohio appears to have a wider range of outliers in the upper extreme.25 0. Figure 7. PP Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 0.00 0. 75. 90.50 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 0.3625 Unemployment Rate in 2005 5 10 15 5. 50. 10.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 5 Figure 6.75 1.75 1.00 0.760857 8. 90. QQ Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 2005 Indiana Unemployment Rate (N=92 Counties) 2005 Ohio Unemployment Rate (N=88 Counties) 7.00 Similarly.75 1.5 7.8 4.4 0 6. the points on the plot will form a linear pattern passing through the origin with a unit slope.964143 8. 25.00 0.50 0.25 0.641304 Unemployment Rate in 2005 5 10 15 3. Indiana appears to have a smaller variation in its unemployment rate than Ohio.15. 50. and 95 percentiles Grid lines are 5.932418 3 4 5 6 Inverse Normal 7 8 2 4 6 Inverse Normal Grid lines are 5.indiana.50 0. Interpretations are thus a matter of judgments.350191 3. and 95 percentiles Source: Bureau of Labor Statistics Source: Bureau of Labor Statistics 8 10 Detrended normal PP and QQ plots depict the actual deviations of data points from the straight horizontal line at zero. In contrast. SPSS can generate detrended PP and QQ plots. 10. the quantilequantile plot (QQ plot) compares ordered values of a variable with quantiles of a specific theoretical distribution (i.00 0. PP and QQ plots are used to see how well a theoretical distribution models the empirical data.5 6.00 2005 Indiana Unemployment Rate (N=92 Counties) 0.edu/~statmath . Although visually appealing. No specific pattern in a detrended plot indicates normality of the variable. these graphical methods do not provide objective criteria to determine normality of variables.75 1. If two distributions match.
while STATA returns the kurtosis. Numerical Methods Numerical methods use descriptive statistics and statistical tests to examine normality.8 3.6653 indicating many observations on the left of the probability distribution. Indiana has the smallest skewness of .7 .edu/~statmath . the distribution is skewed to the right. If skewness is greater than zero.473955 1. the distribution has thicker tails and a lower peak compared to a normal distribution (first plot in Figure 8). SAS.35 8.44809 8.1 Descriptive Statistics Measures of dispersion such as variance reveal how observations of a random variable deviate from the mean.785585 OH  88 6.1 1. and SPSS may report different kurtosis.4 3.126049 1. If kurtosis of a random variable is less than three (or if kurtosis3 is less than zero).665322 8.5 8. 3. kurtosis also shows how the distribution of a variable deviates from a normal distribution.641304 5.946029 IN  92 5.indiana. The (sample) variance of a variable is computed from the second central moment. In contrast. So. E[( x − µ ) 4 ] σ4 (x =∑ − x ) 4 (n − 1)∑ ( xi − x ) 4 = s 4 (n − 1) [∑ ( xi − x ) 2 ]2 i Like skewness. if N is small. state  N mean median max min variance skewness kurtosis +IL  102 5.043097 +Total  282 5. http://www.3 3.65 13.1 1.3416314 2. In contrast. having more observations on the left.6570033 3.383285  Kurtosis.079374 . based on the fourth central moment.3 3.786879 5. 3 SAS and SPSS produce (kurtosis 3).3416 that is close to zero. s 2 (x =∑ − x )2 n −1 i Skewness is based pm the third standardized moment that measures the degree of symmetry of a probability distribution.8541837 . STATA. Ohio has a large skewness of 1.7856 of Indiana.3625 6.8 2.421569 5. SAS uses its weighted kurtosis formula with the degree of freedom adjusted.3 See the kutosis of 2. measures the thinness of tails or “peakedness” of a probability distribution.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 6 3.1 13. E[( x − µ )3 ] σ3 (x =∑ − x )3 n − 1∑ ( xi − x )3 = 32 s 3 (n − 1) [∑ ( xi − x ) 2 ] i The following is a list of descriptive statistics for unemployment rates of three states.
2 .6 . Widely used methods include the KolmogorovSmirnov (KS) D test (Lilliefors test).000 ShapiroFrancia W’ EDF KolmogorovSmirnov D EDF Cramervol Mises W2 EDF AndersonDarling A2 * STATA .6 .2 Theorydriven Statistics Skewness and kurtosis are based on the empirical data. and Cramervon Misers tests are based on the empirical distribution function (EDF). AndersonDarling.2 . …xn with a common distribution function F(x) (SAS 2004). The numerical methods for testing normality compare empirical data with a theoretical distribution. which is defined as a set of N independent observations x1. AndersonDarling test.4 The KS D test and ShapiroWilk W test are commonly used. respectively (second plot in Figure 8). Note that Ohio has a large kurtosis of 8.swilk . 2 JarqueBera χ χ 2 (2) 9≤N SkewnessKurtosis χ2 χ 2 (2) 7≤N≤ 2.sfrancia YES YES  *  The UNIVARIATE and CAPABILITY procedures have the NORMAL option to produce four statistics. 4 SAS STATA SPSS     .8 Kurtosis < 3 5 3 1 1 3 5 0 . Table 2. Figure 8. x2.sktest  YES YES YES YES .0431.edu/~statmath . and Cramervon Mises test (SAS Institute 1995).000 ShapiroWilk W 5≤N≤ 5.4 .indiana. N Dist. http://www. Numerical Methods for Testing Normality Test Stat. ShapiroWilk test. A normally distributed random variable should have skewness and kurtosis near zero and three.8 Kurtosis > 3 5 3 1 1 3 5 3. Probability Distributions with Different Kurtosis Kurtosis = 3 0 .© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 7 kurtosis larger than 3 indicates a higher peak and thin tails (last plot).ksmirnov command is not used for testing normality. The KS.4 .
2928 10. m’=(m1.0539 .sfrancia command range from 5 to 5. can be used with from 4 to 2000 observations (STATA 2005). and Cramervon Misers W2 tests. The recommended sample sizes for the STATA .2815 149.5495 43.6 2 ( ai x( i ) ) ∑ W= ∑ ( xi − x ) 2 where a’=(a1. Indiana. an) = m'V −1[m'V −1V −1m]−1 2 . ….9719 .9458 1.0292 .5 The statistic is positive and less than or equal to one. V is the n by n covariance matrix. SAS and SPSS do not support this statistic.0582 .indiana.4104 2.3266 .0000 .59 Pvalue . …. and x(1)< x(2)< …<x(n).0100 . x2. bn) = m' (m' m) −1 2 instead of a’.8858 .1217 .© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 8 The ShapiroWilk W is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965).edu/~statmath . m2. which are useful especially when N is larger than 2.3545 . the JarqueBera test and SkewnessKurtosis test will be the alternatives for testing normality. and Ohio.2500 .9841 . AndersonDarling A2.9855 . 6 STATA . which for a sample from a normally distributed population is linear (Royston 1982). Indiana. but Royston extended the test by developing a transformation of the null distribution of W to approximate normality throughout the range between 7 and 2000. 5 The W statistic was constructed by considering the regression of ordered sample values on corresponding expected normal order statistics.0919 . The W statistic requires that the sample size is greater than or equal to 7 and less than or equal to 2.3380 .1602 .000 (STATA 2005).0260 . Table 3. The ShapiroFrancia (SF) W’ test is an approximate test that modifies the ShaproWilk W.3705 Test .75 Pvalue . 3.0336 . and Ohio State Indiana Ohio Illinois ShapiroWilk sas ShapiroWilk stata ShapiroFrancia stata KolmogorovSmirnov sas Cramervon Misers sas AndersonDarling sas JarqueBera SkewnessKurtosis stata Test .2500 . Being close to one indicates normality. b2.9714 . ….0000 The SAS UNIVARIATE and CAPABILITY procedures perform the KolmogorovSmirnov D. Given a large number of observations.0001 . Table 3 summarizes test statistics for 2005 unemployment rates of Illinois.4534 12.8787 . Shapiro and Wilk’s (1965) original W statistic is valid for the sample sizes between 3 and 50. xn) is a random sample. based on Shapiro and Wilk (1965) and Royston (1992).000. The SF statistic uses b’=(b1.99 Pvalue .0969 .9728 .8869 . The statistic was developed by Shapiro and Francia (1972) and Royston (1983).4005 . a2. mn) is the vector of expected values of standard normal order statistics. x’=(x1.6332 1.0050 .1500 .3 JarqueBera (SkewnessKurtosis) Test The test statistics mentioned in the previous section tend to reject the null hypothesis when N becomes large. Normality Test for 2005 Unemployment Rates of Illinois. http://www. ….0606 .000 (Shapiro and Wilk 1965).0583 .0050 .0000 .0021 .0000 .9858 .0050 Test .swilk command.
2272 .7256) .2526 (.9983 (.0180 (.8374 .7027 3.0192 1. ⎡ skewness 2 (kurtosis − 3) 2 ⎤ n⎢ + ~ χ 2 (2) .0652 (.6291) .5932 4.2941) .93 (.1196 .2500) * Pvalue in parenthesis 7 Skewness and kurtosis are computed using the SAS UNIVARIATE and CAPABILITY procedures that report kurtosis minus 3.0391 2. Note that in Ohio the JarqueBera statistic of 150 is quite different from the SK statistic of 44 (see Table 3). was developed to test normality.9269 (.0388 2.0107 3.3988 3.1620) . The computation for 2005 unemployment rates is as follows.9791 2. Comparison of Methods for Testing Normality N 10 100 500 1.4020 (. The JarqueBera statistic is computed from skewness and kurtosis and asymptotically follows the chisquared distribution with two degrees of freedom.9998 (.0553068^2/24) 1.1631 .1500) .0100 2.1366 1.0388 . and serial correlation (autocorrelation) of regression residuals (Jarque and Bera 1980).5117 .4526 2. (1990) and Royston (1991) (STATA 2005).9838 .9359 (.64 (.7034 .4030) .9591 (.2466) .2500) .9956 (.5387 .9554 .1583764)^2/24) 149.4132289^2/24) The STATA SkewnessKurtosis test is based on D’Agostino.2500) .2167) .5321) .0625 .4313 (.0073 (.8052 . heteroscedasticy.0097 1.0309 .0067 1.292825 = 102*(0.6479 4.4695 (.9998 (.0203 2. and D’Agostino.7320 3.8727) .1920 (.0100 .1712) .0951 1.9458304 = 92*(0.0304 (.1680) .2372 .9999 (.1382 (.5330) . http://www.0391 .1500) .5409 (.8374 .66685022^2/6 + 1.8659 .34732004^2/6 + (0. Jr.4559 .3483 (. ⎥ 6 24 ⎣ ⎦ The above formula gives a penalty for increasing the number of observations that implies a good asymptotic property of the JarqueBera test.9051 (.52 (.52 (.2666) .0090 2.7739 .4673 1.8374 .1945) .8674 .0076 (.8049) .5133 1.0701 2.7121 .0711 1.6125 2. Belanger.1875) . Table 4.1000) .2500) .2500) .9873 (.3877) .7507 1.2570) .2500) 5.5498 .3757) .0065 3.2633 2.0269 (.0219 .1500) .9980 (.0348 (.0708 (.0203 2.1500) .6310 1.69434105^2/6 + 5.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 9 The JarqueBera test.0033 2.9998 (.2500) .9620 .54945 = 88*(1. where n is the number of observations.9840 (.edu/~statmath .0850) .6411 1.2843) .0834 (.6623 3.2340) .1000) .5240 .7099 .000 Mean Standard deviation Minimum 1stquantile Median 3rdquantile Maximum Skewness sas Kurtosis3 sas JarqueBera Skewness stata Kurtosis stata SK stata ShapiroWilk W ShapiroF W’stata KolmogorovS Dsas CramerM W2 sas AndersonD A2 sas .1500) .0224 .26 (.2589) .0204 .9580 (.indiana.2618 (.1620 1.2797) .7 12.1500) .70 (.9965 (. a type of Lagrange multiplier test.0153 1.3140 .0607 (.2238 2.000 .000 10.9921 1.7171 (.4009) .2500) .5087) .0793 (.
3. while The SkewnessKurtosis and ShapiroFrancia W’ are computed in STATA. All four statistics do not reject the null hypothesis of normality regardless of the number of observations (Table 4). Examine. dot plot Stemleafplot Box plot PP plot QQ plot Detrended QQ/PP plot JarqueBera (SK) test ShapiroWilk W ShapiroFrancia W’ KolmogorovSmirnov Cramervol Mises AndersonDarling UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY UNIVARIATE UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY .sfrancia UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY Examine In contrast. As N grows. Table 5.sfrancia to conduct SkewnessKurtosis and ShapiroFrancia W’ tests.4 Software Issues The UNIVARIATE procedure of SAS/BASE and CAPABILITY of SAS/QC compute various statistics and produce PP and QQ plots. Examine UNIVARIATE CAPABILITY . The JarqueBera and SkewnessKurtosis tests show consistent results when N is large.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 10 Table 4 presents results of normality tests for random variables with different values for N. In particular.stem .000 observations.sktest . The detrended PP and QQ plots can be generated in SPSS.graph box . STATA has many individual commands to examine normality.dotplot .234.pnorm . the mean.summarize . Table 5 summarizes SAS procedures and STATA/SPSS commands that are used to test normality of random variables. These procedures provide many numerical methods including Cramervol Mises and AndersonDarling.indiana. http://www. SPSS EXAMINE provides numerical and graphical methods for normality test. respectively.0 Descriptives.edu/~statmath .8 The PP plot is generated only in CAPABILITY.swilk Examine .2 SE Descriptive statistics (Skewness/Kurtosis) Histogram. The KolmogorovSmirnov D. Igraph Pplot Pplot. Frequencies Examine Graph.567 in SAS. STATA provides . 8 MINITAB also performs the KolmogorovSmirnov and AndersonDarling tests. Cramervon Mises W2 are computed in SAS. Note that the ShapiroWilk W is not reliable when N is larger than 2. while the standard deviation gets close to 1.sktest and . Examine Pplot. AndersonDarling A2. Comparison of Procedures and Commands Available SAS 9.histogram . and (kurtosis3) approach zero. skewness. median. Frequencies Examine Examine.000 and SF W’ is valid up to 5.1 STATA 9. Igraph.tabstat . The data were randomly generated from the standard normal distribution with a seed of 1.qnorm SPSS 14.
the PLOT option draws a stemandleaf and a box plots. 5 .3. Two procedures have similar usage and produce similar statistics in the same format. 0 . 0 . while CAPABILITY provides PP and CDP plots that UNIVARIATE does not. PROC UNIVARIATE DATA=masil. 0033) 24 32 40 48 56 64 G NI P Cur ve: Nor m al ( M u=8. and conduct statistical tests for normality. normal probability.edu/~statmath . Figure 9. finally.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 11 4. 5 1. and Cramervon Misers tests. 0 1. and normal probability plots. box. However. Like UNIVARIATE. The ShapiroWilk W will be reported only if N is not larger than 2000. Histograms of Normally and Nonnormally Distributed Variables 25 60 50 20 40 15 P e r c e n t P e r c 30 e n t 10 20 5 10 0 0 . 5 0. Figure 9 presents histograms of these variables. the QQPLOT statement draws a QQ plot. This procedure also conducts ShapiroWilk. VAR random. KolmogorovSmirnov. This section illustrates how to summarize normally and nonnormally distributed variables and conduct normality tests of these variables using the two procedures. 0 0. 0 . 0 2. CAPABILITY can draw a PP plot using the PPPLOT option but does not support stemandleaf. Testing Normality in SAS SAS has the UNIVARIATE and CAPABILITY procedures to compute descriptive statistics. and draws QQ .0.2. 567) 4. stemandleaf.normality NORMAL PLOT. AndersonDarling. 5 2.indiana.1.1 A Normally Distributed Variable The UNIVARIATE procedure provides a variety of descriptive statistics. and box plots. the CAPABILITY procedure also produces various descriptive statistics and plots. RUN. http://www.0. draw various graphs. UNIVARIATE produces stemandleaf.1. 5 0 8 16 r andom Cur ve: Nor m al ( M u=. The NORMAL option performs normality tests. 095 Si gm a=1. and normal probability plots. 5 . QQPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1). box. 9646 Si gm a=13.2. 5 .3. Let us take a look at an example of the UNIVARIATE procedure.
195 Unlike the UNIVARIATE statement.00661432 0.2 .00661 5. Note that the INSET statement adds summary statistics to graphs such as a histogram and a QQ plot. and normal probability plots.026891 0.300544 0.1. PPPLOT. QQPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).0346 0.0950725 1. HISTOGRAM /NORMAL(COLOR=MAROON W=4) CFILL = BLUE CFRAME = LIGR. PPPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).0138 0.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 12 4.536241 1.41773 Tests for Location: Mu0=0 Test Statistic p Value Student's t Sign Signed Rank t M S Pr > t Pr >= M Pr >= S 2.819932 1055.9 PROC CAPABILITY DATA=masil.34911 1.09507 0. Std Deviation Variance Range Interquartile Range 1. QQPLOT.11959 .3019 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 500 47. http://www.1 SAS Output of Descriptive Statistics In the following CAPABILITY procedure. and a histogram.00330 1.normality NORMAL. the CAPABILITY statement does not have the PLOT option that draws stemandleaf. RUN.edu/~statmath .168 >0.3988198 502.11889 28 6523 0.00330171 0.0435 Tests for Normality Test Statistic p Value ShapiroWilk KolmogorovSmirnov Cramervon Mises W D WSq Pr < W Pr > D Pr > WSq 9 0. VAR random.indiana.083351 0. INSET MEAN STD /CFILL=BLANK FORMAT=5.04486902 Basic Statistical Measures Location Mean Median Mode Variability 0. a PP plot.0203721 506.995564 0. The CAPABILITY Procedure Variable: random Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 500 0. box.150 0. and HISTOGRAM statements respectively draw a QQ plot.
805191028 1.794057126 2.055464409 1. Note that the mean .© 20022006 The Trustees of Indiana University AndersonDarling Univariate Analysis and Normality Test: 13 ASq 0.219479314 2.******************* .511694336 2. and third quintile indicate a bellshaped distribution.837417522 Extreme Observations Lowest Highest Value Obs Value Obs 2.1.0951 and median .** . median. Figure 10.******** .75+* +++++++* may represent up to 3 counts http://www.**************** .24047386 29 204 73 391 393 2.indiana. mean.51169434 119 340 325 139 332 4.540894 Pr > ASq 0.42171307 2.edu/~statmath # 1 4 23 46 68 80 116 64 56 27 13 2 Boxplot     ++   *+* ++     . produced by the UNIVARIATE procedure.413548051 1.*************************** .75+* .2 Graphical Methods The stemandleaf and box plots. StemandLeaf and Box Plots of a Normally Distributed Variable Histogram 2.612538495 0.********* .530450397 1.1196 are very close.47829639 2.215210586 0.***** 2.83741752 2.14897641 2.*********************** .********************** . The locations of first quantile.*************************************** .21109349 2.171 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 2.39126554 2.59039285 2.119592165 0.42113892 2. illustrate that the variable is normally distributed (Figure 10).
ShapiroWilk W of . Normal Probability Plot of a Normally Distributed Variable Normal Probability Plot 2. 4 n r a n d o m o f 0 1 r a n d o m 0.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 14 The normal probability plot available in UNIVARIATE shows a straight line.3 Numerical Methods The mean of . 0033) 0. 0 4 3 2 1 0 Nor m al 1 2 3 4 Q uant i l es 4. 0 3 2 C u 0.3988.edu/~statmath .0951 is very close to 0 and variance is almost 1.0.75+ *  +++**  ********  *******  *****  ******  *******  *****  ******  ****** ******* 2. 0 3 0. 095 Si gm a=1.75+*+ +++++++++++ 2 1 0 +1 +2 The PP and QQ plots show that the data points do not seriously deviates from the fitted line (Figure 12). 2 0. Figure 12.0204 and . 0 0. SAS provides four different statistics for testing normality. 6 i s t r i b u t i o 0.1. 8 m u l a t i v e 1 D 0. The skewness and kurtosis3 are respectively .9956 does not reject the null hypothesis that the variable is normally distributed at the . implying normality of the randomly drawn variable (Figure 11). 4 0.indiana. 2 2 0. indicating an almost perfect normal distribution. They consistently indicate that the variable is normally distributed. However.05 level http://www. 6 Nor m al ( M u=. Figure 11. 8 1. PP and QQ Plots of a Normally Distributed Variable 1. these descriptive statistics do not provide conclusive information about normality.
5667877 2.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 15 (p<. The UNIVARIATE Procedure Variable: GNIP Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 164 8.9645732 13.72500 . The JarqueBera test also indicates the normality of the randomly drawn variable at the .010000 Variability Std Deviation Variance Range Interquartile Range 13.2 A Nonnormally Distributed Variable Let us take the per capita gross national income as an example of nonnormally distributed variable.4096 1. RUN. Similarly. 4.3482776(2) 6 24 ⎣ ⎦ Consequently.edu/~statmath 8.964573 2. Computation is ⎡ . ShapiroWilk W test will be appropriate for this case. See the appendix for details about this variable.04947469 43181.34000 7. 4.0.39881982 ⎤ 500 ⎢ + ⎥ ~ 3.1875).2.05773 65.60816725 30001.000.168). VAR gnip.765000 1.19001 184.gnip NORMAL PLOT.02037212 .05 level (p=. QQPLOT gnip /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).337798 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 164 1470.05938813 Basic Statistical Measures Location Mean Median Mode http://www.0356 151. Cramervon Mises.1 SAS Output of Descriptive Statistics This section employs the UNIVARIATE procedure to compute descriptive statistics and perform normality tests. HISTOGRAM / NORMAL(COLOR=MAROON W=4) CFILL = BLUE CFRAME = LIGR.057728 3. PROC UNIVARIATE DATA=masil. Since the number of observations is less than 2.0. we can safely confirm that the randomly drawn variable is normally distributed. and AndersonDarling tests do not reject the null hypothesis.56679 184. KolmogorovSmirnov.indiana.
29 0.39 54.59 65.980 32.663114 0.346966 22. http://www.284426 4.0001 <.29 0. Most observations are highly concentrated on the left side of the distribution.0100 <0.34 164 163 162 161 160 46.2 Graphical Methods The stemandleaf.450 0.0050 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 65.290 0.31 0.33 0.955 0. and normal probability plots all indicate that the variable is not normally distributed (Figure 13).32 47.0001 Tests for Normality Test Statistic p Value ShapiroWilk KolmogorovSmirnov Cramervon Mises AndersonDarling W D WSq ASq Pr Pr Pr Pr 0.765 0.23115 < > > > W D WSq ASq <0.indiana.2.0001 <.630 59.590 38.63 5 4 3 2 1 4.93 59.600 8.290 Extreme Observations Lowest Highest Value Obs Value Obs 0.462029 82 6765 <. box.0001 <0.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 16 Tests for Location: Mu0=0 Test Statistic p Value Student's t Sign Signed Rank t M S Pr > t Pr >= M Pr >= S 8.370 0.edu/~statmath .680 2.0050 <0.
* . 2 10 0.5+* * *** ********************** +++++++++++ 2 1 0 +1 +2 Figure 14.* .** . 0 3 2 1 0 Nor m al Q uant i l es 1 2 3 . 4 o n 50 40 G N I P 30 o f G N I P 20 0.5+* . StemandLeaf. 0 0 0.* . PP and QQ Plots of a Nonnormally Distributed Variable 1. 9646 Si gm a=13. 6 D i s t r i b u t i 0. 8 1.** .****** 2.5+ *   *  *  **  ** +++  *** +++  ** ++++  **+++  +*+  ++++**  ++++ **  +++ **** 2. 0 70 60 C 0. 567) http://www.* . 4 0.indiana. . 8 u m u l a t i v e 0. 2 0.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 17 Figure 13.** .5+************************************ +++++++* may represent up to 3 counts # 1 Boxplot * 1 1 2 3 6 5 5 2 6 7 17 108 * * * * * 0 0 0   +++ ** Normal Probability Plot 67. and Normally Probability Plots Histogram 67. 6 Nor m al ( M u=8.* .edu/~statmath 0.*** .** . Box. 0 0.
0557.608167252 ⎤ 164 ⎢ + ⎥ ~ 203. It is not surprising that the ShapiroWilk test rejects the null hypothesis. http://www. ⎡ 2.05 level (p<.049474692 3.0001. which rejects the null hypothesis of normality at the . indicating that the variable is highly skewed to the right with a high peak and thin tails.edu/~statmath . and AndersonDarling tests also report similar results.6631 and pvalue is less than .2.3 Numerical Methods Per capita gross national income has a mean of 8. Finally.6082. respectively. 4.77176(2) 6 24 ⎣ ⎦ To sum. Cramervon Mises. KolmogorovSmirnov.0000).© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 18 The PP and QQ plots in Figure 14 show that the data points seriously deviate from the fitted line.7717. W is . we can conclude that the per capita gross national income is not normally distributed. the JarqueBera test returns 203.9646 and a large variance of 184. Its skewness and kurtosis3 are 2.indiana.0495 and 3.
histogram normal.edu/~statmath .01 plot in units of . . normal . . histogram gnip. The stemandleaf plot of the randomly drawn variable shows a bellshaped distribution (Figure 15). The normal option adds a normal density curve to the histogram.stem command.indiana. Testing Normality Using STATA In STATA. stem normal Figure 15. Histograms of normally and nonnormally distributed variables are presented in introduction (Figure 2). normal Now let us draw a stemandleaf plot using the .01 28* 27* 26* 25* 24* 23* 22* 21* 20* 19* 18* 17* 16* 15* 14* 13* 12* 11* 10* 9* 8* 7* 6* 5* 4* 3* 2* 1* 0* 0* 1* 2* 3* 4*                                   4 9 8 9 40 93221 8650 8842 875200 94 9987550 97643320 87755432110 98777655433210 8866666433210 987774332210 875322 88887665542210 99988777533110 77766544100 998332 99988877654433221110 9998766655444433321 88766654433322221100 999988766555544433322111100 8888777776655544433222221110 99887776655433333111 01233344445669 0111222333445666778 0001234444556889999 1133444556667899 014455667777 http://www. The STATA .histogram command is followed by a variable name and options. you have to use individual commands to get specific statistics or draw various plots. This section contrasts a normally distributed and a nonnormally distributed variable using graphical and numerical methods.1 Graphical Methods A histogram is the most widely used graphical method. 5.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 19 5. StemandLeaf Plot of a Normally Distributed Variable Stemandleaf plot for normal normal rounded to nearest multiple of .
03.26 46.24.09..28.83.edu/~statmath .48.78 00 22.57 66.24.70.24.28.71.18 36 44.25.03.04.54 60.04.04.53.03.77. .50.28..26.04.68.03.indiana.25.23.82.45.26.55.04.04.58 62.46.52.48.1 0** 0** 0** 0** 0** 1** 1** 1** 1** 1** 2** 2** 2** 2** 2** 3** 3** 3** 3** 3** 4** 4** 4** 4** 4** 5** 5** 5** 5** 5** 6** 6** 6**                                  03.22. (64) 21.000 (Figure 16).03..27.91.03. StemandLeaf Plot of a Nonnormally Distributed Variable Stemandleaf plot for gnip gnip rounded to nearest multiple of .23.50.1 plot in units of . .48.50. (34) 44.47.05.04.73.74 49 96 56 http://www.© 20022006 The Trustees of Indiana University 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18* 19* 20* 21* 22* 23* 24* 25*                      Univariate Analysis and Normality Test: 20 00112334556888 0001123668899 00233466799999 1122334667889 012445666778889 1133457799 1222334445689 122233489 26889 2777799 00112459 1347 02467 358 03556 5 1 22 1 In contrast.59 62. .04.75.91 00.76 90 02.65. having most observations within $10.79 81.26.71.46.04.74 86.11 37 63.07.05.04..75.23.45.76.25.24.03. stem gnip Figure 16. per capita gross national income is highly skewed to the right.28.04.97 38 40.
Box plots of Normally and Nonnormally Distributed Variables A Nonnormally Distributed Variable (N=164) 0 4 20 2 40 0 60 2 80 A Normally Distributed Variable (N=500) The .edu/~statmath . The right plot. median.indiana. very similar to the stemand leaf plot.pnorm command produces a standardized normal PP plot. while the right depicts an sshaped curve that largely deviates from the line (Figure 19).pnorm normal . .pnorm gnip 10 In SAS. and 75th percentile.10 . graph box gnip Figure 18. . In the left plot of Figure 18. Dotplots of Normally and Nonnormally Distributed Variables A Nonnormally Distributed Variable (N=164) 0 3 2 20 1 40 0 1 60 2 80 A Normally Distributed Variable (N=500) 0 10 20 Frequency 30 40 0 10 20 Frequency 30 40 The . in contrast. a PP plot has the cumulative distribution of an empirical variable on the X axis and the theoretical normal distribution on the Y axis. in a descending order (Figure 17). graph box normal .© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 21 The . In STATA.dotplot command generates a dotplot. these distributions are located reversely. the shaded box represents the 25th percentile. dotplot gnip Figure 17. The left plot shows almost no deviation from the fitted line. dotplot normal .graph box command draws a box plot. http://www. has an asymmetric box with many outliers beyond the adjacent maximum line.
indiana.590393 10% 1.00 A Normally Distributed Variable (N=500) 0.00 0. being skewed to the right with a high peak and flat tails.35081 8. Per capital gross national income has large skewness of 2. 25. 10.794057 2. QQ plots of Normally and Nonnormally Distributed Variables A Normally Distributed Variable (N=500) .25 0. 75. data points systematically deviate from the straight fitted line.745357 4 2 0 Inverse Normal Grid lines are 5.98 40 2.00 Normal F[(normalm)/s] 0.53045 60 4 1.37 20 0 20 2 0 2 1. and 95 percentiles 2 4 20 0 20 40 Inverse Normal Grid lines are 5. .413548 2. 50. 90.00 The .25 0.qnorm gnip Figure 20. 10. PP plots of Normally and Nonnormally Distributed Variables Normal F[(gnipm)/s] 0.25 0.qnorm normal . The detail option lists various statistics in addition to the mean.837418 5% 1. summarize normal.765 .75 1.75 1.555212 13.50 0.794057 .© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 22 Figure 19.0950725 A Nonnormally Distributed Variable (N=164) 1. The QQ plots in Figure 20 show a similar pattern to PP plots. Skewness and kurtoshsis of a randomly drawn variable are respectively close to 0 and 3.qnorm command produces a standardized normal QQ plot. minimum. implying normality.summarize command. In the right plot. detail normal Percentiles Smallest 1% 2. and 95 percentiles 5.75 1.00 A Nonnormally Distributed Variable (N=164) 0.964573 31.00 0.478296 Obs 500 http://www.edu/~statmath .00 0.25 0. 90. 75.75 1.2 Numerical Methods Let us first get summary statistics using the . 25.50 Empirical P[i] = i/(N+1) 0. standard deviation.46.00 0. .11959221. and maximum.219479 2. 50.27995 4 38.03 and kurtosis of 6.50 Empirical P[i] = i/(N+1) 0.50 0.
stats(n mean sum max min range sd var semean skewness kurtosis /// median p1 p5 p10 p25 p50 p75 p90 p95 p99 iqr q) column(variable) stats  normal +N  500 mean  .511694 min  2.0950725 1.0203109 kurtosis  2. 8.© 20022006 The Trustees of Indiana University 25% . The column(variable)option lists statistics vertically (in table rows).955 p50  2.59 Largest 47.0203109 2.219479 p5  1.93 59.055464 2.964573 sum  1470.45 .215211 p95  1.805191 50% .955 . 500 Mean Std.765 8.6 p95  38.593181 . ShapiroFrancia test.37 . Dev.044869 skewness  .765 p1  .421713 2.215211 1.421139 2. .34 sd  13.56679 Variance Skewness Kurtosis 184.805191 p50  .1195922 p75  .63 Mean Std.059388 skewness  2.006614 .725 p25  .462734 STATA .29 10% .19 max  65.0577 se(mean)  1.63 min  .29 .59 iqr  7.955 p50  2.1195922 p75  .837418 range  5.805191 p50  .53045 p99  2.29 range  65.1195922 75% 90% 95% 99% .53045 2.0950725 sum  47.349112 sd  1.98 59.794057 p10  1. tabstat gnip. 164 50% 75% 90% 95% 99% 2. and SkewnessKurtosis test.055464 iqr  1.sfrancia commands respectively conduct the ShapiroWilk and ShapiroFrancia http://www.6125385 p90  1.39 54.030682 6. .413548 p25  .6 38.511694 Univariate Analysis and Normality Test: 23 Sum of Wgt.59 65.006614 se(mean)  .31 Obs 164 25% .6125385  stats  gnip +N  164 mean  8.68  Now let us conduct statistical tests for normality.swilk and . STATA provide three testing methods: ShapiroWilk test. sum gnip.765 p75  8.391266 Largest 2. The .030682 kurtosis  6. Dev.edu/~statmath .98 p99  59.indiana.211093 2.003302 Variance Skewness Kurtosis 1.29 5% .0577 2.68 p90  32.41773 p25  . detail gnip Percentiles Smallest 1% .33 Sum of Wgt.45 p25  .1195922 p1  2.6125385 1.29 p5  .68 32.56679 variance  184.tabstat command is vary useful to produce descriptive statistics in a table form.593181 p50  .37 p10  .53624 max  2.765 p75  8.003302 variance  1. The command for the variable normal is skipped.462734 p50  2.964573 13.
the SK test rejects normality of the variable randomly drawn at the .027 4. .93 0. .413 0.99645 1.1753).edu/~statmath . sfrancia normal ShapiroFrancia W' test for normal data Variable  Obs W' V' z Prob>z +normal  500 0.0850 . swilk normal ShapiroWilk W test for normal data Variable  Obs W V z Prob>z +normal  500 0.492 0.sktest command performs the SkewnessKurtosis test that is conceptually similar to the JarqueBera test. .29412 .indiana.0850 Like the ShapiroWilk and ShapiroFrancia tests. The following SK tests do not reject normality of a randomly drawn variable at the . The JarqueBera statistic is 3.joint Variable  Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 +normal  0.309 8. Both tests do not reject normality of the randomly drawn variable and reject normality of per capita gross national income.93 0.66365 45. which is not large enough to reject the null hypothesis (p<.00001 STATA’s .0203109^2/6+(2. Surprisingly.790 7. The JarqueBera test appears more reliable than the STATA SK test (see Table 4).00000 .© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 24 tests.joint  http://www.530 0.05 level. sfrancia gnip ShapiroFrancia W' test for normal data Variable  Obs W' V' z Prob>z +gnip  164 0. swilk gnip ShapiroWilk W test for normal data Variable  Obs W V z Prob>z +gnip  164 0.joint Variable  Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 +normal  0.851 0.5931813)^2/24). sktest gnip Skewness/Kurtosis tests for Normality .66322 42.273 0.962 0. sktest normal. noadjust Skewness/Kurtosis tests for Normality .99556 1.027 4. the SK test rejects normality of per capita gross national income.4823 = 500*(.16804 . sktest normal Skewness/Kurtosis tests for Normality . The noadjust option suppresses the empirical adjustment made by Royston (1991).851 0.541 0.1 level.
0000 The JarqueBera statistic of the per capita gross national income is 194.0000 .6489 = 164*(2.030682^2/6+(6.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 25 Variable  Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 +gnip  0.33 0.joint Variable  Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 +gnip  0.000 0. In conclusion. This large chisquared rejects the null hypothesis (p<. http://www. sktest gnip. the graphical methods and numerical methods provide sufficient evidence that the randomly drawn variable is normally distributed and per capita gross national income is not.edu/~statmath . noadjust Skewness/Kurtosis tests for Normality .000 55.000 0.39 0.000 75.4627343)^2/24).0000).indiana.
Statistics normal N Valid Missing 500 0 Mean .54 In order to execute this command. The IGRAPH command draws histogram and box plots. The output of the following DESCRIPTIVES command is skipped here. This command is able to draw the detrended QQ plot that SAS and STATA do not support.00330 Variance 1. Or click AnalysisÆ Descriptive StatisticsÆDescriptives and provide a variable of interest. such as a stemleafplot. PP plot.399 . The PPLOT command produces (detrended) PP and QQ plots. DESCRIPTIVES is usually applied to continuous variables. 6.84(a) Std. open a syntax window and paste it into the window.109 .35 Minimum 2.218 5.11 DESCRIPTIVES VARIABLES=normal /STATISTICS=MEAN SUM STDDEV VARIANCE RANGE MIN MAX SEMEAN KURTOSIS SKEWNESS.indiana.edu/~statmath . http://www.04487 Median .84 Maximum 2. Error of Skewness Kurtosis Std. FREQUENCIES VARIABLES=normal /NTILES= 4 /STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM SEMEAN MEAN MEDIAN MODE SUM SKEWNESS SESKEW KURTOSIS SEKURT /HISTOGRAM /ORDER= ANALYSIS. Testing Normality Using SPSS SPSS has the DESCRIPTIVES and FREQUENCIES commands to produce descriptive statistics. but FREQUENCIES is also able to produce various descriptive statistics including skewness and kurtosis. Deviation 1.1196 Mode 2.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 26 6.007 Skewness . The EXAMINE command can produce both descriptive statistics and various plots. Error of Mean . Error of Kurtosis Range . histogram.51 Sum 11 47.020 Std. EXAMINE also performs the KolmogorovSmirnov and ShapiroWilk tests for normality.0951 Std. and QQ plot. box plot. The statistics are specified in the /STATISTICS subcommand.1 A Normally Distributed Variable Let us get summary statistics using DESCRIPTIVES and FREQUENCIES.
6. The IGRAPH command can produce a better histogram (right plot in Figure 21).indiana.0 /CHARTLOOK='NONE' /Histogram SHAPE = HISTOGRAM CURVE = OFF X1INTERVAL AUTO X1START = 0. The kurtosis3 and skewness approach zero.0 /YLENGTH=3.13 The stemandleft plot is very similar to the histogram in Figure 20. 12 To run this command from the menu. click AnalyzeÆDescriptive StatisticsÆExplore. Note that SPSS. click GraphsÆInteractiveÆHistogram and then specify a variable.1 Graphical Methods The /HISTOGRAM subcommand of FREQUENCIES tells SPSS to draw a histogram of the variable (left plot in Figure 20). GRAPH /HISTOGRAM=normal. From the menu.1. 13 http://www.edu/~statmath . and then include the variable you want to examine. Histograms of a Normally Distributed Variable EXAMINE can produce a stemandleaf plot and a box plot using the /PLOT subcommand with the STEMLEAF and BOXPLOT option (Figure 22). You can get the identical plot using EXAMINE or the following GRAPH command. IGRAPH /VIEWNAME='Histogram' /X1 = VAR(normal) TYPE = SCALE /Y = $count /COORDINATE = VERTICAL /X1LENGTH=3.1196 75 .0 /X2LENGTH=3.8066 50 . The smallest value is shown The variable has a mean zero and a unit variance. Figure 21. The median is very close to the mean. like SAS.6132 a Multiple modes exist.12 The histogram suggests that the variable is normally distributed.© 20022006 The Trustees of Indiana University Percentiles Univariate Analysis and Normality Test: 27 25 . reports kurtosis3.
.00 1. PP and Detrended PP Plots of a Normally Distributed Variable http://www. .00 4.00 68. /MISSING LISTWISE /NOTOTAL. . . minimum and maximum).00 116.00 27. . and 75th percentiles are symmetrically arranged in the box plot.00 13.00 23. ..00 80. The both extremes (i. . .edu/~statmath .00 56. . the 25th.e.00 64. .indiana. Leaf & 011& 555667889 001111222233333444 55555566777888889999 000000011111111122222222233333334444444 000001111112222223333334444 5555566677777888899999 000111122233444 5567789 4& & 1.00 2 2 1 1 0 0 0 0 1 1 2 2 Stem width: Each leaf: . Figure 23. StemandLeaf Plot and Box Plot of a Normally Distributed Variable normal StemandLeaf Plot Frequency Stem & 2. . 50th.00 3 case(s) & denotes fractional leaves.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 28 EXAMINE VARIABLES=normal /PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT.00 46. Figure 22.
which are available in SPSS 13.indiana. the QQ plot and detrended QQ plot has observed quantiles on the X axes and normal quantiles on the Y axes.xx.02 (EXAMINE and PPLOT) does not produce color PP and QQ plots. Probably due to bugs. 14 From the menu. PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=PP /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. the standard normal distribution). As in STATA. the latest SPSS 14.edu/~statmath . The variable does not deviate far away from the fitted line (Figure 23).© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 29 Now let us draw PP and QQ plots using the PPLOT command. QQ and Detrended QQ Plots of a Normally Distributed Variable The PP and QQ plots indicate no significant deviation from the fitted line.xx and 12. PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=QQ /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. The following PPLOT command draws a QQ and detrended QQ plots of the variable (Figure 24).g.14 This command automatically produces detrended PP and QQ plots as well. Figure 24.15 The /PLOT NPPLOT subcommand of EXAMINE can also produce these plots.. The /TYPE chooses either PP or QQ plot and /DIST specifies a probability distribution (e. 15 Click GraphsÆQQ and then specify a variable http://www. click GraphsÆPP and then specify a variable of interest.
51 2 139 2.1196 Variance 1.020 .0951 Mean 95% Confidence Interval for Mean Lower Bound .0933 Median .0% 500 N 0 Total Percent . Case Processing Summary Cases Valid N normal Missing Percent 100.edu/~statmath .1. Deviation 1.218 Extreme Values Normal Std.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 30 6.42 3 325 2.00330 Minimum 2.0% N 500 Percent 100.42 Skewness . draws a normal QQ plot (/PLOT NPPLOT). and then conducts normality tests.21 5 119 2. EXAMINE VARIABLES=normal /PLOT NPPLOT /STATISTICS DESCRIPTIVES EXTREME /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL.1832 Upper Bound .007 Std.0% Descriptives normal Statistic .0069 5% Trimmed Mean . Error . The above EXAMINE command first produces descriptive statistics (/STATISTICS DESCRIPTIVES).399 .84 Maximum 2.15 1 29 2.42 4 340 2.51 Range 5.indiana.35 Interquartile Range 1.04487 Highest Lowest 1 Case Number 332 Value 2. This command performs the KolmogorovSmirnov and ShapiroWilk tests and draws a normal (detrended) QQ plot as well.109 Kurtosis .2 Numerical Methods EXAMINE has the /PLOT NPPLOT subcommand to test normality of a variable.84 http://www.
48 4 391 2. stemandleaf plot. Figure 25.edu/~statmath . The median and the 25th percentile are close to each other. Histogram and Box Plot a Nonnormally Distributed Variable http://www.1 Graphical Methods The following IGRAPH and EXAMINE command produce the histogram. 6.200(*) ShapiroWilk Statistic . 6.027.39 5 393 2. a bit larger than the . .indiana.168).2.027 Df 500 Sig. The stemandleaf plot is skipped here. we have to read the ShapiroWilk statistic that does not reject the null hypothesis of normality (p<.200. but it provides an adjusted pvalue of .996 df 500 Sig. Figure 25 illustrate that the distribution is heavily skewed to the right. There are many outliers beyond the extreme line in the box plot (left plot of Figure 25).24 Tests of Normality KolmogorovSmirnov(a) Normal Statistic .000.59 3 73 2.2 A Nonnormally Distributed Variable Let us consider a variable of per capita national gross income that is not normally distributed. a Lilliefors Significance Correction Since N is less than 2.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 31 2 204 2. SPSS reports the same kolmogorovsmirnov statistic of . and box plot of a nonnormally distributed variable gnip.168 * This is a lower bound of the true significance.150 that SAS reports. .
The detrended QQ plot also presents a systematic pattern of deviation indicating nonnormality of the variable. The detrended PP plot shows a systematic deviation of data points. Figure 26 presents the PP and detrended PP plots where data points significantly deviate from the straight fitted line.© 20022006 The Trustees of Indiana University IGRAPH Univariate Analysis and Normality Test: 32 /VIEWNAME='Histogram' /X1 = VAR(gnip) TYPE = SCALE /Y = $count /COORDINATE = VERTICAL /X1LENGTH=3.indiana. http://www.0 /CHARTLOOK='NONE' /Histogram SHAPE = HISTOGRAM CURVE = OFF X1INTERVAL AUTO X1START = 0. Figure 26.0 /X2LENGTH=3.0 /YLENGTH=3.edu/~statmath . EXAMINE VARIABLES=gnip /PLOT BOXPLOT STEMLEAF HISTOGRAM /MISSING LISTWISE /NOTOTAL. PP and Detrended PP Plots of a Nonnormally Distributed Variable The QQ and detrended QQ plots also show a significant deviation from the fitted line (Figure 27). PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=QQ /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=PP /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL.
92 2. Deviation 184.34 Interquartile Range Skewness http://www. EXAMINE VARIABLES=gnip /PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT /MISSING LISTWISE /NOTOTAL.05939 7.608.9646 Mean 95% Confidence Interval for Mean Lower Bound Upper Bound 6.63 Range 65.2. Case Processing Summary Cases Valid N gnip 164 Missing Percent 100.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 33 Figure 27.0% Descriptives gnip Statistic 8.8727 11.2 Numerical Methods The descriptive statistics of gnip indicates that the variable is not normally distributed. Error 1.1877 Median 2.058 13.indiana.9646 and the median of 2. QQ and Detrended QQ Plots of a Nonnormally Distributed Variable 6. respectively.0565 5% Trimmed Mean 7.7650 Variance Std.0% N 164 Percent 100.29 Maximum 65.56679 Minimum .049 .049 and 3.190 . There is a large gap between the mean of 8.7650.edu/~statmath Std. The skewness and kurtosis 3 are 2.0% N 0 Total Percent . The variable appears severely skewed to the right with a higher peak and flat tails.
000 The ShapiroWilk test rejects the null hypothesis of normality at the .377 Tests of Normality KolmogorovSmirnov(a) Statistic df .000 ShapiroWilk Statistic .3).663 df 164 Sig. http://www.indiana. . The JarqueBera test also rejects the null hypothesis with a large statistic of 204.05 level.edu/~statmath . we can conclude the variable gnip is not normally distributed.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 34 Kurtosis 3. Its computation is skipped (see section 4.2.608 . Based on a consistent result from both graphical and numerical methods.284 164 a Lilliefors Significance Correction gnip Sig. .
Skewness and kurtosis provide clues to the normality of a variable. the distribution has a high peak and flat tails (third plot in Figure 8). The JarqueBera test. and a PP plot that are intuitive and easy to interpret. is a good alternative for normality testing. Cramervol Mises. Some are descriptive and others are theorydriven. In particular. Normality is commonly assumed in many statistical and economic methods without any empirical test. Keep in mind that SAS and SPSS report kurtosis3. If skewness and kurtosis3 are close to zero. Various descriptive statistics provide valuable basic information about variables that is used to determine what method of analysis should be employed. In addition to these descriptive statistics. If the skewness of a variable is larger than 0. a negative skewness indicates many observations on the right. and AndersonDarling tests are recommended when N is large. Conclusion Univariate analysis is the first step of data analysis once a data set is ready.000. SPSS can produce detrended PP and QQ plots. histogram.000 and 5. quantile. Cramervol Mises. respectively.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 35 7. while STATA returns kurtosis itself. the variable may be normally distributed. Violation of this assumption will result in unreliable inferences and misleading interpretations. median. and perform the ShapiroWilk and KolmogorovSmirnov tests with Lilliefors significance correction. a modification of the ShapiroWilk test. and AndersonDarling tests (Table 5). and standard deviation. There are graphical and numerical methods for conducting univariate analysis and normality tests (Table 1). box plot. If kurtosis is smaller than 3. The KolmogorovSmirnov. Numerical methods compute a variety of measures of central tendency and dispersion such as mean. If kurtosis3 is greater than 0 (or kurtosis is greater than 3). But there is no command for the KolmogorovSmirnov test for normality in STATA. and the skewnesskurtosis test.indiana. although not supported by most statistical software.edu/~statmath . and QQ plot as well. STATA has various commands for univariate analysis and graphics. The SAS UNIVARIATE and CONTENTS procedures provide a variety of descriptive statistics and normality testing methods including KolmogorovSmirnov. STATA supports the ShapiroFrancia test. the variable has a low peak and thick tails (first plot in Figure 8). histogram. The ShapiroWilk and ShapiroFrancia tests are proper when N is less than 2. there are formal ways to perform normality tests. the variable is skewed to the right with many observations on the left of the distribution. http://www. PP plot. variance. These procedures produce stemandleaf. The graphical methods produce various plots such as a stemandleaf plot.
%LET dataset=n500.955 2.95 1.indiana. tabstat normal.edu/. stat(mean sd p25 median p75 skewness kurtosis) variable  mean sd p25 p50 p75 skewness kurtosis +normal  .421569 . 1.567. This data set includes per capita gross national incomes of 164 countries in the world that are provided by the World Bank (http://web.6570033 3.785585 OH  6. the Indiana Business Research Center of the Kelley School of Business. . %LET n=500.964573 13.0203109 2. tabstat rate.805191 . normal=RANNOR(seed). Per Capita Gross National Income in 2005.68 2. stat(mean sd p25 median p75 skewness kurtosis) variable  mean sd p25 p50 p75 skewness kurtosis +gnip  8.234. OUTPUT.383285  2.038929 4.786879 1.indiana. seed=1234567.0950725 1.641304 1.3625 1.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 36 Appendix A: Data Sets This document uses the following three variables.3416314 2.7 5. stat(mean sd p25 median p75 skewness kurtosis) by(state) Summary for variables: rate by categories of: state state  mean sd p25 p50 p75 skewness kurtosis +IL  5.462734  http://www. Unemployment rate of Illinois. Indiana University.030682 6.9 5.593181  3.946029 IN  5. .65 6. DATA masil.5 6.56679 . .665322 8. The RANNOR() of SAS was used as a random number generator. RUN.44809 8. Indiana.35 6 .35 .003302 .edu/~statmath .458098 5.stats.1 6. DO i=1 TO &n.043097 +Total  5.1195922 . Actual data were downloaded from http://www.4 1. and Ohio in 2005 This unemployment rate is provided by the Bureau of Labor Statistics. END. A Randomly Drawn Variable This variable includes 500 observations that were randomly drawn from the standard normal distribution with a seed of 1. tabstat gnip.6125385 .765 8.5 6.org/).&dataset.9242206 4.214066 5 5.worldbank.
Bera. Royston. Jarque. M. and Ralph B. J.P. P. 1990.. 1980. K. S. D’Agostino. "Efficient Tests for Normality. Cary. and Anil K. NC: SAS Institute. TX: Stata Press..J. 7(4):313318." Statistics and Computing. and Anil K. TX: STATA Press. Mitchell. NC: SAS Institute. 31(2): 115124. P. 1995. 1965.. 6(3):255259. S. Shapiro. 55(2):163172. College Station. http://www." Biometrika. Royston. Jarque. 1972.” American Statistician. 32(3) (September): 297300. and Carlos. 1992." Applied Statistics. Ralph B. 1987. Anil. STATA Press. Michael N.indiana. 2004. 2:117119. J. 44(4): 316321. “A Suggestion for Using Powerful and Informative Tests of Normality. STATA Reference Manual Release 9. SAS/QC Software: Usage and Reference I and II. 52(3/4) (December). Bera. SAS Institute. "Efficient Tests for Normality.. 1991. 67 (337) (March): 215216.edu/~statmath . College Station. "An Extension of Shapiro and Wilk's W Test for Normality to Large Samples. Wilk. B. S. Homoscedasticity and Serial Independence of Regression Residuals. STATA Graphics Reference Manual Release 8.3 Procedures Guide Volume 4.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 37 References Bera. S. Jarque.:591611. 2004. P. 2005. D’Agostino.1." Statistician. “Comment on sg3. SAS Institute. "A Test for Normality of Observations and Regression Residuals. Royston. SAS 9. and R." International Statistical Review." Economics Letters. "An Approximate Analysis of Variance Test for Normality. Homoscedasticity and Serial Independence of Regression Residuals: Monte Carlo Evidence. TX: STATA Press.4 and an Improved D’Agostino test. 1983. 1982. Albert Belanger. Carlos M.. Royston. and M. Carlos M." Economics Letters. College Station. Francia. Cary. S. "A Simple Method for Evaluating the ShapiroFrancia W' Test of NonNormality.. Shapiro. "Approximating the ShapiroWilk WTest for Nonnormality.” Stata Technical Bulletin. "An Analysis of Variance Test for Normality (Complete Samples)." Journal of the American Statistical Association. Jr. A Visual Guide to STATA Graphics. STATA Press. 2003. J. 1981. 3: 1324.
Revision History • • 2002 First draft 2006 Revision with new data http://www.indiana.© 20022006 The Trustees of Indiana University Univariate Analysis and Normality Test: 38 Acknowledgements I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions.edu/~statmath .