You are on page 1of 38

# © 2002-2006 The Trustees of Indiana University

Univariate Analysis and Normality Test: 1

Univariate Analysis and Normality Test Using SAS,
STATA, and SPSS
Hun Myoung Park
This document summarizes graphical and numerical methods for univariate analysis and
normality test, and illustrates how to test normality using SAS 9.1, STATA 9.2 SE, and
SPSS 14.0.
1.
2.
3.
4.
5.
6.
7.

Introduction
Graphical Methods
Numerical Methods
Testing Normality Using SAS
Testing Normality Using STATA
Testing Normality Using SPSS
Conclusion

1. Introduction
Descriptive statistics provide important information about variables. Mean, median, and
mode measure the central tendency of a variable. Measures of dispersion include variance,
standard deviation, range, and interquantile range (IQR). Researchers may draw a
histogram, a stem-and-leaf plot, or a box plot to see how a variable is distributed.
Statistical methods are based on various underlying assumptions. One common
assumption is that a random variable is normally distributed. In many statistical analyses,
normality is often conveniently assumed without any empirical evidence or test. But
normality is critical in many statistical methods. When this assumption is violated,
interpretation and inference may not be reliable or valid.
Figure 1. Comparing the Standard Normal and a Bimodal Probability Distributions

.3
.2
.1
0

0

.1

.2

.3

.4

Bimodal Distribution

.4

Standard Normal Distribution

-5

-3

-1

1

3

5

-5

-3

-1

1

3

5

T-test and ANOVA (Analysis of Variance) compare group means, assuming variables
follow normal probability distributions. Otherwise, these methods do not make much
http://www.indiana.edu/~statmath

© 2002-2006 The Trustees of Indiana University

Univariate Analysis and Normality Test: 2

sense. Figure 1 illustrates the standard normal probability distribution and a bimodal
distribution. How can you compare means of these two random variables?
There are two ways of testing normality (Table 1). Graphical methods display the
distributions of random variables or differences between an empirical distribution and a
theoretical distribution (e.g., the standard normal distribution). Numerical methods
present summary statistics such as skewness and kurtosis, or conduct statistical tests of
normality. Graphical methods are intuitive and easy to interpret, while numerical
methods provide more objective ways of examining normality.
Table 1. Graphical Methods versus Numerical Methods
Graphical Methods
Numerical Methods
Stem-and-leaf plot, (skeletal) box plot Skewness
Descriptive
Theory-driven

Histogram
P-P plot
Q-Q plot

Kurtosis
Shapiro-Wilk, Shapiro- Francia test
Kolmogorov-Smirnov test (Lillefors test)
Anderson-Darling/Cramer-von Mises tests
Jarque-Bera test, Skewness-Kurtosis test

Graphical and numerical methods are either descriptive or theory-driven. The dot plot
and histogram, for instance, are descriptive graphical methods, while skewness and
kurtosis are descriptive numerical methods. The P-P and Q-Q plots are theory-driven
graphical methods for testing normality, whereas the Shapiro-Wilk W and Jarque-Bera
tests are theory-driven numerical methods.
Figure 2. Histograms of Normally and Non-normally Distributed Variables
A Non-normally Distributed Variable (N=164)

0

0

.1

.03

.2

.06

.3

.09

.4

.12

.5

.15

A Normally Distributed Variable (N=500)

-3
-2
-1
0
1
2
Randomly Drawn from the Standard Normal Distribution (Seed=1,234,567)

3

0
10
20
30
40
Per Capita Gross National Income in 2005 (\$1,000)

50

60

Three variables are employed here. The first variable is unemployment rate of Illinois,
Indiana, and Ohio in 2005. The second variable includes 500 observations that were
randomly drawn from the standard normal distribution. This variable is supposed to be
normally distributed with zero mean and a variance of 1 (left plot in Figure 2). An
example of a non-normal distribution is per capita gross national income (GNI) in 2005
of 164 countries in the world. GNIP is severely skewed to the right and is least likely to
be normally distributed (right plot in Figure 2). See the Appendix for details about these
variables.

http://www.indiana.edu/~statmath

© 2002-2006 The Trustees of Indiana University

Univariate Analysis and Normality Test: 3

2. Graphical Methods
Graphical methods visualize the distribution of a random variable and compare the
distribution to a theoretical one using plots. These methods are either descriptive or
theory-driven. The former method is based on the empirical data, whereas the latter
considers both empirical and theoretical distributions.
2.1 Descriptive Plots
Among frequently used descriptive plots are the stem-and-leaf-plot, dot plot, (skeletal)
box plot, and histogram. When N is small, a stem-and-leaf plot or dot plot is useful to
summarize data. A stem-and-leaf plot and dot plot work well for continuous or event
count variables. Figure 3 presents the stem-and-leaf plots for unemployment rates of
three states.
Figure 3. Stem-and-Leaf Plot of Unemployment Rate of Illinois, Indiana, Ohio
-> state = IL

-> state = IN

-> state = OH

Stem-and-leaf plot for rate(Rate)

Stem-and-leaf plot for rate(Rate)

Stem-and-leaf plot for rate (Rate)

rate rounded to nearest multiple
of .1
plot in units of .1

rate rounded to nearest multiple
of .1
plot in units of .1

rate rounded to nearest multiple
of .1
plot in units of .1

3.
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.

|
|
|
|
|
|
|
|
|
|
|

7889
011122344
556666666677778888999
0011122222333333344444
5555667777777888999
000011222333444
555579
0033
0
8

3*
3.
4*
4.
5*
5.
6*
6.
7*
7.
8*

|
|
|
|
|
|
|
|
|
|
|

1
89
012234
566666778889999
00000111222222233344
555666666777889
002222233344
5666677889
1113344
67
14

3*
4*
5*
6*
7*
8*
9*
10*
11*
12*
13*

|
|
|
|
|
|
|
|
|
|
|

8
014577899
01223333445556667778888888999
001111122222233444446678899
01223335677
1223338
99
1

3

A box plot presents the minimum, 25th percentile (1st quartile), 50th percentile (median),
75th percentile (3rd quartile), and maximum in a box and lines.1 Outliers, if any, appear at
the outsides of (adjacent) minimum and maximum lines. As such, a box plot effectively
summarizes these major percentiles using a box and lines. If a variable is normally
distributed, its 25th and 75th percentile are symmetric, and its median and mean are
located at the same point exactly in the center of the box.2
In Figure 4, you should see outliers in Illinois and Ohio that affect the shapes of
corresponding boxes. In contrast, the Indiana unemployment rate does not have outliers,
and its symmetric box implies that the rate appears to be normally distributed.

1

The first quartile cuts off lowest 25 percent of data; the second quartile cuts data set in half; and the third
quartile cuts off lowest 75 percent or highest 25 percent of data. See http://en.wikipedia.org/wiki/Quartile.
2
SAS reports a mean as “+” between (adjacent) minimum and maximum lines.
http://www.indiana.edu/~statmath

© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 4 Figure 4.indiana. Indiana. Figure 5.3 .. and Ohio Indiana (N=92) Ohio (N=88) 2 4 6 8 10 Unemployment Rate (%) 12 14 Illinois (N=102) Indiana Business Research Center (http://www.2 Theory-driven Plots The P-P and Q-Q plots are considered here. Indiana and Ohio Indiana (N=92) Ohio (N=88) 0 .stats.2 .indiana.1 .edu/) Source: Bureau of Labor Statistics The histogram graphically shows how each category (interval) accounts for the proportion of total observations and is more appropriate for large N samples (Figure 5).edu/~statmath .4 .indiana.g.5 Illinois (N=102) 0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15 Indiana Business Research Center (http://www.edu/) Source: Bureau of Labor Statistics 2. http://www. The probability-probability plot (P-P plot or percent plot) compares an empirical cumulative distribution function of a variable with a specific theoretical cumulative distribution function (e. the standard normal distribution function). Ohio appears to deviate more from the fitted line than Indiana. Histograms of Unemployment Rates of Illinois. In Figure 6. Box Plots of Unemployment Rates of Illinois.stats.

25 0.. 75.00 2005 Ohio Unemployment Rate (N=88 Counties) 0. the normal distribution).00 0.50 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 0. http://www. 25.e.25 0. In Figure 7.1 0 4. Ohio appears to have a wider range of outliers in the upper extreme.25 0. Figure 7. P-P Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 0.00 0. 75. 90.50 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 0.3625 Unemployment Rate in 2005 5 10 15 5. 50. 10.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 5 Figure 6.75 1.75 1.00 0.760857 8. 90. Q-Q Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 2005 Indiana Unemployment Rate (N=92 Counties) 2005 Ohio Unemployment Rate (N=88 Counties) 7.00 Similarly.75 1.5 7.8 4.4 0 6. the points on the plot will form a linear pattern passing through the origin with a unit slope.964143 8. 25.00 0.50 0.25 0.641304 Unemployment Rate in 2005 5 10 15 3. Indiana appears to have a smaller variation in its unemployment rate than Ohio.15. 50. and 95 percentiles Grid lines are 5.932418 3 4 5 6 Inverse Normal 7 8 2 4 6 Inverse Normal Grid lines are 5.indiana.50 0. Interpretations are thus a matter of judgments.350191 3. and 95 percentiles Source: Bureau of Labor Statistics Source: Bureau of Labor Statistics 8 10 Detrended normal P-P and Q-Q plots depict the actual deviations of data points from the straight horizontal line at zero. In contrast. SPSS can generate detrended P-P and Q-Q plots. 10. the quantile-quantile plot (Q-Q plot) compares ordered values of a variable with quantiles of a specific theoretical distribution (i.00 0. P-P and Q-Q plots are used to see how well a theoretical distribution models the empirical data.5 6.00 2005 Indiana Unemployment Rate (N=92 Counties) 0.edu/~statmath . Although visually appealing. No specific pattern in a detrended plot indicates normality of the variable. these graphical methods do not provide objective criteria to determine normality of variables.75 1. If two distributions match.

while STATA returns the kurtosis. Numerical Methods Numerical methods use descriptive statistics and statistical tests to examine normality.8 3.6653 indicating many observations on the left of the probability distribution. Indiana has the smallest skewness of .7 .edu/~statmath . the distribution is skewed to the right. If skewness is greater than zero.473955 1. the distribution has thicker tails and a lower peak compared to a normal distribution (first plot in Figure 8). SAS.35 8.44809 8.1 Descriptive Statistics Measures of dispersion such as variance reveal how observations of a random variable deviate from the mean.785585 OH | 88 6.1 1. and SPSS may report different kurtosis.4 3.126049 1. If kurtosis of a random variable is less than three (or if kurtosis-3 is less than zero).665322 8.5 8. 3. kurtosis also shows how the distribution of a variable deviates from a normal distribution.641304 5.946029 IN | 92 5.indiana. The (sample) variance of a variable is computed from the second central moment. In contrast. So. E[( x − µ ) 4 ] σ4 (x =∑ − x ) 4 (n − 1)∑ ( xi − x ) 4 = s 4 (n − 1) [∑ ( xi − x ) 2 ]2 i Like skewness. if N is small. state | N mean median max min variance skewness kurtosis -------+-------------------------------------------------------------------------------IL | 102 5.043097 -------+-------------------------------------------------------------------------------Total | 282 5. http://www.3 3.65 13.1 1.3416314 2. In contrast. having more observations on the left.6570033 3.383285 ---------------------------------------------------------------------------------------- Kurtosis.079374 . based on the fourth central moment.3 3.786879 5. 3 SAS and SPSS produce (kurtosis -3).3416 that is close to zero. s 2 (x =∑ − x )2 n −1 i Skewness is based pm the third standardized moment that measures the degree of symmetry of a probability distribution.8541837 . STATA. Ohio has a large skewness of 1.7856 of Indiana.3625 6.8 2.421569 5. SAS uses its weighted kurtosis formula with the degree of freedom adjusted.3 See the kutosis of 2. measures the thinness of tails or “peakedness” of a probability distribution.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 6 3.1 13. E[( x − µ )3 ] σ3 (x =∑ − x )3 n − 1∑ ( xi − x )3 = 32 s 3 (n − 1) [∑ ( xi − x ) 2 ] i The following is a list of descriptive statistics for unemployment rates of three states.

2 .6 . Widely used methods include the Kolmogorov-Smirnov (K-S) D test (Lilliefors test).000 Shapiro-Francia W’ EDF Kolmogorov-Smirnov D EDF Cramer-vol Mises W2 EDF Anderson-Darling A2 * STATA .6 .2 Theory-driven Statistics Skewness and kurtosis are based on the empirical data. and Cramervon Misers tests are based on the empirical distribution function (EDF). Anderson-Darling.2 . …xn with a common distribution function F(x) (SAS 2004). The numerical methods for testing normality compare empirical data with a theoretical distribution. which is defined as a set of N independent observations x1. Anderson-Darling test.4 The K-S D test and Shapiro-Wilk W test are commonly used. respectively (second plot in Figure 8). Note that Ohio has a large kurtosis of 8.swilk . 2 Jarque-Bera χ χ 2 (2) 9≤N Skewness-Kurtosis χ2 χ 2 (2) 7≤N≤ 2.sfrancia YES YES - * - The UNIVARIATE and CAPABILITY procedures have the NORMAL option to produce four statistics. 4 SAS STATA SPSS - - - - .8 Kurtosis < 3 -5 -3 -1 1 3 5 0 . Table 2. Figure 8. x2.sktest - YES YES YES YES .0431.edu/~statmath . and Cramer-von Mises test (SAS Institute 1995).000 Shapiro-Wilk W 5≤N≤ 5.4 .indiana. N Dist. http://www. Numerical Methods for Testing Normality Test Stat. Shapiro-Wilk test. A normally distributed random variable should have skewness and kurtosis near zero and three.8 Kurtosis > 3 -5 -3 -1 1 3 5 3. Probability Distributions with Different Kurtosis Kurtosis = 3 0 .© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 7 kurtosis larger than 3 indicates a higher peak and thin tails (last plot).ksmirnov command is not used for testing normality. The K-S.4 .

2928 10. m’=(m1.0539 .sfrancia command range from 5 to 5. can be used with from 4 to 2000 observations (STATA 2005). and Cramer-von Misers W2 tests. The recommended sample sizes for the STATA .2815 149.5495 43.6 2 ( ai x( i ) ) ∑ W= ∑ ( xi − x ) 2 where a’=(a1. Indiana. an) = m'V −1[m'V −1V −1m]−1 2 . ….9719 .9458 1.0292 .5 The statistic is positive and less than or equal to one. V is the n by n covariance matrix. SAS and SPSS do not support this statistic.0582 .indiana.4104 2.3266 .0000 .59 P-value . …. and x(1)< x(2)< …<x(n).0100 . x2. bn) = m' (m' m) −1 2 instead of a’.8858 .1217 .© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 8 The Shapiro-Wilk W is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965).edu/~statmath . m2. which are useful especially when N is larger than 2.3545 . the Jarque-Bera test and Skewness-Kurtosis test will be the alternatives for testing normality. and Ohio.2500 .9841 . Anderson-Darling A2.9855 . 6 STATA . which for a sample from a normally distributed population is linear (Royston 1982). Indiana. but Royston extended the test by developing a transformation of the null distribution of W to approximate normality throughout the range between 7 and 2000. 5 The W statistic was constructed by considering the regression of ordered sample values on corresponding expected normal order statistics.0919 . The W statistic requires that the sample size is greater than or equal to 7 and less than or equal to 2.3380 .1602 .000 (STATA 2005).0260 . Table 3. The Shapiro-Francia (S-F) W’ test is an approximate test that modifies the Shapro-Wilk W.3705 Test .75 P-value . 3.0336 . and Ohio State Indiana Ohio Illinois Shapiro-Wilk sas Shapiro-Wilk stata Shapiro-Francia stata Kolmogorov-Smirnov sas Cramer-von Misers sas Anderson-Darling sas Jarque-Bera Skewness-Kurtosis stata Test .2500 . Being close to one indicates normality. b2.9714 . ….0000 The SAS UNIVARIATE and CAPABILITY procedures perform the KolmogorovSmirnov D. Given a large number of observations.0001 . Table 3 summarizes test statistics for 2005 unemployment rates of Illinois.4534 12.8787 . Shapiro and Wilk’s (1965) original W statistic is valid for the sample sizes between 3 and 50. xn) is a random sample. based on Shapiro and Wilk (1965) and Royston (1992).000. The S-F statistic uses b’=(b1.99 P-value .0969 .9728 .8869 . The statistic was developed by Shapiro and Francia (1972) and Royston (1983).4005 . a2. mn) is the vector of expected values of standard normal order statistics. x’=(x1.6332 1.0050 .1500 .3 Jarque-Bera (Skewness-Kurtosis) Test The test statistics mentioned in the previous section tend to reject the null hypothesis when N becomes large. Normality Test for 2005 Unemployment Rates of Illinois. http://www. ….0606 .000 (Shapiro and Wilk 1965).0583 .0050 .0000 .0021 .0000 .9858 .0050 Test .swilk command.

2272 -.7256) .2526 (.9983 (.0180 (.8374 -.7027 3.0192 1. ⎡ skewness 2 (kurtosis − 3) 2 ⎤ n⎢ + ~ χ 2 (2) .0652 (.6291) -.5932 4.2941) .93 (.1196 .2500) * P-value in parenthesis 7 Skewness and kurtosis are computed using the SAS UNIVARIATE and CAPABILITY procedures that report kurtosis minus 3.0391 2. Note that in Ohio the Jarque-Bera statistic of 150 is quite different from the S-K statistic of 44 (see Table 3). was developed to test normality.9269 (.0388 2.0107 -3.3988 3.1620) . The computation for 2005 unemployment rates is as follows.9791 2. Comparison of Methods for Testing Normality N 10 100 500 1.4020 (. The Jarque-Bera statistic is computed from skewness and kurtosis and asymptotically follows the chi-squared distribution with two degrees of freedom.9998 (.0553068^2/24) 1.1631 .1500) .0100 2.1366 1.0388 -. and serial correlation (autocorrelation) of regression residuals (Jarque and Bera 1980).5117 -.4526 2. (1990) and Royston (1991) (STATA 2005).9838 -.9359 (.64 (.7034 -.4030) .9591 (.2466) -.2500) -.9956 (.5387 -.9554 -.1583764)^2/24) 149.4132289^2/24) The STATA Skewness-Kurtosis test is based on D’Agostino.2500) -.2167) .5321) .0625 .4313 (.0073 (.8052 -. heteroscedasticy.0097 1.0309 .0067 1.292825 = 102*(0.6479 4.4695 (.9998 (.0203 2. and D’Agostino.7320 3.8727) .1920 (.0100 -.1712) -.0951 1.9458304 = 92*(0.0304 (.1680) .2372 .9999 (.1382 (.5330) . http://www.0391 -.1500) .5409 (.8374 -.66685022^2/6 + 1.8659 -.34732004^2/6 + (-0. Jr.4559 .3483 (. ⎥ 6 24 ⎣ ⎦ The above formula gives a penalty for increasing the number of observations that implies a good asymptotic property of the Jarque-Bera test.9051 (.52 (.52 (.2666) .0090 -2.7739 -.4673 1.8374 -.1945) .8674 -.0076 (.8049) .5133 1.0701 -2.7121 -.0711 1.6125 2. Belanger.1875) -. Table 4.1000) .2500) .2500) .9873 (.3877) .7507 1.2570) .2500) 5.5498 .3757) -.0065 -3.2633 2.0269 (.0219 .1500) .9980 (.0348 (.0708 (.0203 2.1500) .6310 1.69434105^2/6 + 5.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 9 The Jarque-Bera test.0033 -2.9998 (.2500) .9620 -.54945 = 88*(1. where n is the number of observations.9840 (.edu/~statmath .0850) .6411 1.2843) .0834 (.6623 3.2340) .1000) .5240 .7099 -.000 Mean Standard deviation Minimum 1stquantile Median 3rdquantile Maximum Skewness sas Kurtosis-3 sas Jarque-Bera Skewness stata Kurtosis stata S-K stata Shapiro-Wilk W Shapiro-F W’stata Kolmogorov-S Dsas Cramer-M W2 sas Anderson-D A2 sas .1500) .0224 .26 (.2589) .0204 -.9580 (.indiana.2618 (.1620 -1.2797) .7 12.1500) .70 (.9965 (. a type of Lagrange multiplier test.0153 1.3140 .0607 (.2238 2.000 -.000 10.9921 1.7171 (.4009) .2500) .5087) .0793 (.

3. while The SkewnessKurtosis and Shapiro-Francia W’ are computed in STATA. All four statistics do not reject the null hypothesis of normality regardless of the number of observations (Table 4). Examine. dot plot Stem-leaf-plot Box plot P-P plot Q-Q plot Detrended Q-Q/P-P plot Jarque-Bera (S-K) test Shapiro-Wilk W Shapiro-Francia W’ Kolmogorov-Smirnov Cramer-vol Mises Anderson-Darling UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY UNIVARIATE UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY .sfrancia UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY UNIVARIATE CAPABILITY Examine In contrast. As N grows. Table 5.sfrancia to conduct Skewness-Kurtosis and ShapiroFrancia W’ tests.4 Software Issues The UNIVARIATE procedure of SAS/BASE and CAPABILITY of SAS/QC compute various statistics and produce P-P and Q-Q plots. Examine UNIVARIATE CAPABILITY . The Jarque-Bera and SkewnessKurtosis tests show consistent results when N is large.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 10 Table 4 presents results of normality tests for random variables with different values for N. In particular.stem .000 observations.sktest . The detrended P-P and Q-Q plots can be generated in SPSS.graph box . STATA has many individual commands to examine normality.dotplot .234.pnorm . the mean.summarize . Table 5 summarizes SAS procedures and STATA/SPSS commands that are used to test normality of random variables. These procedures provide many numerical methods including Cramer-vol Mises and Anderson-Darling.indiana. http://www. SPSS EXAMINE provides numerical and graphical methods for normality test. respectively.0 Descriptives.edu/~statmath .8 The P-P plot is generated only in CAPABILITY.swilk Examine .2 SE Descriptive statistics (Skewness/Kurtosis) Histogram. The Kolmogorov-Smirnov D. Igraph Pplot Pplot. Frequencies Examine Graph.567 in SAS. STATA provides . 8 MINITAB also performs the Kolmogorov-Smirnov and Anderson-Darling tests. Cramer-von Mises W2 are computed in SAS. Note that the Shapiro-Wilk W is not reliable when N is larger than 2. while the standard deviation gets close to 1.sktest and . Examine Pplot. Anderson-Darling A2. Comparison of Procedures and Commands Available SAS 9.histogram . and (kurtosis-3) approach zero. skewness. median. Frequencies Examine Examine.000 and S-F W’ is valid up to 5.1 STATA 9. Igraph.tabstat . The data were randomly generated from the standard normal distribution with a seed of 1.qnorm SPSS 14.

the PLOT option draws a stem-and-leaf and a box plots. 5 .3. Two procedures have similar usage and produce similar statistics in the same format. 0 . 0 . while CAPABILITY provides P-P and CDP plots that UNIVARIATE does not. PROC UNIVARIATE DATA=masil. 0033) 24 32 40 48 56 64 G NI P Cur ve: Nor m al ( M u=8. and conduct statistical tests for normality. normal probability.edu/~statmath . Figure 9. finally.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 11 4. 5 1. and Cramer-von Misers tests. 0 1. and normal probability plots. box. However. Like UNIVARIATE. The Shapiro-Wilk W will be reported only if N is not larger than 2000. Histograms of Normally and Non-normally Distributed Variables 25 60 50 20 40 15 P e r c e n t P e r c 30 e n t 10 20 5 10 0 0 . 5 0. Figure 9 presents histograms of these variables. the QQPLOT statement draws a Q-Q plot. This procedure also conducts ShapiroWilk. VAR random. Kolmogorov-Smirnov. This section illustrates how to summarize normally and non-normally distributed variables and conduct normality tests of these variables using the two procedures. 0 0. 0 . 0 2. CAPABILITY can draw a P-P plot using the PPPLOT option but does not support stem-and-leaf. Testing Normality in SAS SAS has the UNIVARIATE and CAPABILITY procedures to compute descriptive statistics. and draws Q-Q .0.2. 567) 4. stem-and-leaf.normality NORMAL PLOT. Anderson-Darling. 5 2.indiana.1.1 A Normally Distributed Variable The UNIVARIATE procedure provides a variety of descriptive statistics. and box plots. the CAPABILITY procedure also produces various descriptive statistics and plots. RUN. http://www.0. draw various graphs. UNIVARIATE produces stem-and-leaf.1. 5 0 8 16 r andom Cur ve: Nor m al ( M u=. The NORMAL option performs normality tests. 095 Si gm a=1. and normal probability plots. 5 . QQPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1). box. 9646 Si gm a=13.2. 5 .3. Let us take a look at an example of the UNIVARIATE procedure.

195 Unlike the UNIVARIATE statement.00661432 -0.2 .00661 5. Note that the INSET statement adds summary statistics to graphs such as a histogram and a Q-Q plot. and normal probability plots.026891 0.300544 0.1. PPPLOT. QQPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).0346 0.0950725 1. HISTOGRAM /NORMAL(COLOR=MAROON W=4) CFILL = BLUE CFRAME = LIGR. PPPLOT random /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).0138 0.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 12 4.536241 1.41773 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| -2.819932 -1055.9 PROC CAPABILITY DATA=masil.34911 1.09507 -0. Std Deviation Variance Range Interquartile Range 1. QQPLOT.11959 .3019 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 500 -47. http://www.1 SAS Output of Descriptive Statistics In the following CAPABILITY procedure. and a histogram.00330 1.normality NORMAL. the CAPABILITY statement does not have the PLOT option that draws stem-and-leaf. RUN.edu/~statmath .168 >0.3988198 502.11889 -28 -6523 0.00330171 -0.0435 Tests for Normality Test --Statistic--- -----p Value----- Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises W D W-Sq Pr < W Pr > D Pr > W-Sq 9 0. VAR random.indiana.083351 0. INSET MEAN STD /CFILL=BLANK FORMAT=5.04486902 Basic Statistical Measures Location Mean Median Mode Variability -0. a P-P plot.0203721 506.995564 0. The CAPABILITY Procedure Variable: random Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 500 -0. box.150 0. and HISTOGRAM statements respectively draw a Q-Q plot.

805191028 -1.794057126 -2.055464409 1. Note that the mean -.© 2002-2006 The Trustees of Indiana University Anderson-Darling Univariate Analysis and Normality Test: 13 A-Sq 0.219479314 -2.******************* .511694336 2. and third quintile indicate a bell-shaped distribution.837417522 Extreme Observations -------Lowest------- -------Highest------ Value Obs Value Obs -2.1.0951 and median -.** . median. Figure 10.******** .75+* ----+----+----+----+----+----+----+---* may represent up to 3 counts http://www.**************** .24047386 29 204 73 391 393 2.indiana. mean.51169434 119 340 325 139 332 4.540894 Pr > A-Sq 0.42171307 2.edu/~statmath # 1 4 23 46 68 80 116 64 56 27 13 2 Boxplot | | | | +-----+ | | *--+--* +-----+ | | | | . produced by the UNIVARIATE procedure.413548051 -1.*************************** .75+* .2 Graphical Methods The stem-and-leaf and box plots. Stem-and-Leaf and Box Plots of a Normally Distributed Variable Histogram 2.612538495 -0.********* .530450397 1.1196 are very close.47829639 -2.215210586 0.***** -2.83741752 -2.14897641 2.*********************** .********************** . The locations of first quantile.*************************************** .21109349 2.171 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 2.39126554 -2.59039285 -2.119592165 -0.42113892 2. illustrate that the variable is normally distributed (Figure 10).

Shapiro-Wilk W of . Normal Probability Plot of a Normally Distributed Variable Normal Probability Plot 2. 4 n r a n d o m o f 0 -1 r a n d o m 0.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 14 The normal probability plot available in UNIVARIATE shows a straight line.3 Numerical Methods The mean of -. 0033) 0. 0 -4 -3 -2 -1 0 Nor m al 1 2 3 4 Q uant i l es 4. 0 3 2 C u 0.3988.edu/~statmath .0951 is very close to 0 and variance is almost 1.0.75+ * | +++** | ******** | ******* | ***** | ****** | ******* | ***** | ****** | ****** |******* -2. 0 -3 0. 095 Si gm a=1.75+*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 The P-P and Q-Q plots show that the data points do not seriously deviates from the fitted line (Figure 12). 2 0. Figure 12.0204 and -. 0 0. SAS provides four different statistics for testing normality. 6 i s t r i b u t i o 0.1. 8 m u l a t i v e 1 D 0. The skewness and kurtosis-3 are respectively -.9956 does not reject the null hypothesis that the variable is normally distributed at the . implying normality of the randomly drawn variable (Figure 11). 4 0.indiana. 2 -2 0. indicating an almost perfect normal distribution. They consistently indicate that the variable is normally distributed. However.05 level http://www. 6 Nor m al ( M u=. Figure 11. 8 1. P-P and Q-Q Plots of a Normally Distributed Variable 1. these descriptive statistics do not provide conclusive information about normality.

5667877 2.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 15 (p<. The UNIVARIATE Procedure Variable: GNIP Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 164 8.9645732 13.72500 . The Jarque-Bera test also indicates the normality of the randomly drawn variable at the .010000 Variability Std Deviation Variance Range Interquartile Range 13.2 A Non-normally Distributed Variable Let us take the per capita gross national income as an example of non-normally distributed variable.4096 1. RUN. Similarly. 4.3482776(2) 6 24 ⎣ ⎦ Consequently.edu/~statmath 8.964573 2. Computation is ⎡ . Shapiro-Wilk W test will be appropriate for this case. See the appendix for details about this variable.04947469 43181.34000 7. 4.0.39881982 ⎤ 500 ⎢ + ⎥ ~ 3.1875).2.05773 65.60816725 30001.000.168). VAR gnip.765000 1.19001 184.gnip NORMAL PLOT.02037212 .05 level (p=. QQPLOT gnip /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1).337798 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 164 1470.05938813 Basic Statistical Measures Location Mean Median Mode http://www.0356 151. Cramer-von Mises.1 SAS Output of Descriptive Statistics This section employs the UNIVARIATE procedure to compute descriptive statistics and perform normality tests. HISTOGRAM / NORMAL(COLOR=MAROON W=4) CFILL = BLUE CFRAME = LIGR.057728 3. PROC UNIVARIATE DATA=masil. Since the number of observations is less than 2.0. we can safely confirm that the randomly drawn variable is normally distributed. and Anderson-Darling tests do not reject the null hypothesis.56679 184. Kolmogorov-Smirnov.indiana.

29 0.39 54.59 65.980 32.663114 0.346966 22. http://www.284426 4.0001 <.29 0. Most observations are highly concentrated on the left side of the distribution.0100 <0.34 164 163 162 161 160 46.2 Graphical Methods The stem-and-leaf.450 0.0050 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 65.290 0.31 0.33 0.955 0. and normal probability plots all indicate that the variable is not normally distributed (Figure 13).32 47.0001 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.765 0.23115 < > > > W D W-Sq A-Sq <0.indiana.2.0001 <.630 59.590 38.63 5 4 3 2 1 4.93 59.600 8.290 Extreme Observations ----Lowest---- ----Highest---- Value Obs Value Obs 0.462029 82 6765 <. box.0001 <0.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 16 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 8.370 0.edu/~statmath .680 2.0050 <0.

* . 2 10 0.5+* * *** ********************** +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Figure 14.* .** . 0 -3 -2 -1 0 Nor m al Q uant i l es 1 2 3 . 4 o n 50 40 G N I P 30 o f G N I P 20 0.5+* . Stem-and-Leaf. 0 0 0.* . P-P and Q-Q Plots of a Non-normally Distributed Variable 1. 9646 Si gm a=13. 6 D i s t r i b u t i 0. 8 1.** .****** 2.5+ * | | * | * | ** | ** +++ | *** +++ | ** ++++ | **+++ | +*+ | ++++** | ++++ ** | +++ **** 2. 0 70 60 C 0. 567) http://www.* . 4 0.indiana. . 8 u m u l a t i v e 0. 2 0.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 17 Figure 13.** .5+************************************ ----+----+----+----+----+----+----+* may represent up to 3 counts # 1 Boxplot * 1 1 2 3 6 5 5 2 6 7 17 108 * * * * * 0 0 0 | | +--+--+ *-----* Normal Probability Plot 67. and Normally Probability Plots Histogram 67. 6 Nor m al ( M u=8.* .edu/~statmath 0.*** .** . Box. 0 0.

0557.608167252 ⎤ 164 ⎢ + ⎥ ~ 203. It is not surprising that the Shapiro-Wilk test rejects the null hypothesis. http://www. ⎡ 2.05 level (p<.049474692 3.0001. which rejects the null hypothesis of normality at the . indicating that the variable is highly skewed to the right with a high peak and thin tails.edu/~statmath . and AndersonDarling tests also report similar results.6631 and p-value is less than .2.3 Numerical Methods Per capita gross national income has a mean of 8. Finally.6082. respectively. 4.77176(2) 6 24 ⎣ ⎦ To sum. Cramer-von Mises. Kolmogorov-Smirnov.0000).© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 18 The P-P and Q-Q plots in Figure 14 show that the data points seriously deviate from the fitted line.7717. W is . we can conclude that the per capita gross national income is not normally distributed. the Jarque-Bera test returns 203.9646 and a large variance of 184. Its skewness and kurtosis-3 are 2.indiana.0495 and 3.

histogram normal.edu/~statmath .01 plot in units of . . normal . . histogram gnip. The stem-and-leaf plot of the randomly drawn variable shows a bell-shaped distribution (Figure 15). The normal option adds a normal density curve to the histogram.stem command.indiana. Testing Normality Using STATA In STATA. stem normal Figure 15. Histograms of normally and nonnormally distributed variables are presented in introduction (Figure 2). normal Now let us draw a stem-and-leaf plot using the .01 -28* -27* -26* -25* -24* -23* -22* -21* -20* -19* -18* -17* -16* -15* -14* -13* -12* -11* -10* -9* -8* -7* -6* -5* -4* -3* -2* -1* -0* 0* 1* 2* 3* 4* | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4 9 8 9 40 93221 8650 8842 875200 94 9987550 97643320 87755432110 98777655433210 8866666433210 987774332210 875322 88887665542210 99988777533110 77766544100 998332 99988877654433221110 9998766655444433321 88766654433322221100 999988766555544433322111100 8888777776655544433222221110 99887776655433333111 01233344445669 0111222333445666778 0001234444556889999 1133444556667899 014455667777 http://www. The STATA .histogram command is followed by a variable name and options. you have to use individual commands to get specific statistics or draw various plots. This section contrasts a normally distributed and a non-normally distributed variable using graphical and numerical methods.1 Graphical Methods A histogram is the most widely used graphical method. 5.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 19 5. Stem-and-Leaf Plot of a Normally Distributed Variable Stem-and-leaf plot for normal normal rounded to nearest multiple of .

03.26 46.24.09..28.83.edu/~statmath .48.78 00 22.57 66.24.70.24.28.71.18 36 44.25.03.04.54 60.04.04.53.03.77. .50.28..26.04.68.03.indiana.25.23.82.45.26.55.04.04.58 62.46.52.48.1 0** 0** 0** 0** 0** 1** 1** 1** 1** 1** 2** 2** 2** 2** 2** 3** 3** 3** 3** 3** 4** 4** 4** 4** 4** 5** 5** 5** 5** 5** 6** 6** 6** | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 03.22. (64) 21.000 (Figure 16).03..27.91.03. Stem-and-Leaf Plot of a Non-normally Distributed Variable Stem-and-leaf plot for gnip gnip rounded to nearest multiple of .23.50.1 plot in units of . .48.50. (34) 44.47.05.04.73.74 49 96 56 http://www.© 2002-2006 The Trustees of Indiana University 5* 6* 7* 8* 9* 10* 11* 12* 13* 14* 15* 16* 17* 18* 19* 20* 21* 22* 23* 24* 25* | | | | | | | | | | | | | | | | | | | | | Univariate Analysis and Normality Test: 20 00112334556888 0001123668899 00233466799999 1122334667889 012445666778889 1133457799 1222334445689 122233489 26889 2777799 00112459 1347 02467 358 03556 5 1 22 1 In contrast.59 62. .04.75.91 00.76 90 02.65. having most observations within \$10.79 81.26.71.46.04.74 86.11 37 63.07.05.04..75.23.45.76.25.24.03. stem gnip Figure 16. per capita gross national income is highly skewed to the right.28.04.97 38 40.

Box plots of Normally and Non-normally Distributed Variables A Non-normally Distributed Variable (N=164) 0 -4 20 -2 40 0 60 2 80 A Normally Distributed Variable (N=500) The .edu/~statmath . The right plot. median.indiana. very similar to the stem-and leaf plot.pnorm command produces a standardized normal P-P plot. while the right depicts an s-shaped curve that largely deviates from the line (Figure 19).pnorm normal . .pnorm gnip 10 In SAS. and 75th percentile.10 . graph box gnip Figure 18. . In the left plot of Figure 18. Dotplots of Normally and Non-normally Distributed Variables A Non-normally Distributed Variable (N=164) 0 -3 -2 20 -1 40 0 1 60 2 80 A Normally Distributed Variable (N=500) 0 10 20 Frequency 30 40 0 10 20 Frequency 30 40 The . in contrast. a P-P plot has the cumulative distribution of an empirical variable on the X axis and the theoretical normal distribution on the Y axis. in a descending order (Figure 17). graph box normal .© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 21 The . In STATA.dotplot command generates a dotplot. these distributions are located reversely. the shaded box represents the 25th percentile. dotplot gnip Figure 17. The left plot shows almost no deviation from the fitted line. dotplot normal .graph box command draws a box plot. http://www. has an asymmetric box with many outliers beyond the adjacent maximum line.

indiana.590393 10% -1.00 A Normally Distributed Variable (N=500) 0.00 0. being skewed to the right with a high peak and flat tails.35081 8. Per capital gross national income has large skewness of 2. 25. 10.794057 -2. Q-Q plots of Normally and Non-normally Distributed Variables A Normally Distributed Variable (N=500) -.25 0. 75. data points systematically deviate from the straight fitted line.745357 -4 -2 0 Inverse Normal Grid lines are 5.98 40 2.00 Normal F[(normal-m)/s] 0.53045 60 4 -1.37 20 0 -20 -2 0 2 -1. and 95 percentiles 2 4 -20 0 20 40 Inverse Normal Grid lines are 5. .413548 -2. 50. 90.00 The .25 0.qnorm gnip Figure 20. 10. P-P plots of Normally and Non-normally Distributed Variables Normal F[(gnip-m)/s] 0.25 0.qnorm normal . The detail option lists various statistics in addition to the mean.837418 5% -1. summarize normal.765 .75 1.75 1.555212 -13.50 0.794057 -.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 22 Figure 19.0950725 A Non-normally Distributed Variable (N=164) 1. The Q-Q plots in Figure 20 show a similar pattern to P-P plots. Skewness and kurtoshsis of a randomly drawn variable are respectively close to 0 and 3.qnorm command produces a standardized normal Q-Q plot. minimum. implying normality.summarize command. In the right plot. detail normal ------------------------------------------------------------Percentiles Smallest 1% -2. and 95 percentiles 5.75 1.00 A Non-normally Distributed Variable (N=164) 0.964573 31.00 0.478296 Obs 500 http://www.edu/~statmath .00 0.25 0. 90. 75.75 1.2 Numerical Methods Let us first get summary statistics using the . 25.50 Empirical P[i] = i/(N+1) 0. standard deviation.46.00 0. .11959221. and maximum.219479 -2. 50.27995 -4 38.03 and kurtosis of 6.50 Empirical P[i] = i/(N+1) 0.50 0.

stats(n mean sum max min range sd var semean skewness kurtosis /// median p1 p5 p10 p25 p50 p75 p90 p95 p99 iqr q) column(variable) stats | normal ---------+---------N | 500 mean | -.511694 min | -2.0950725 1.0203109 kurtosis | 2. 8.© 2002-2006 The Trustees of Indiana University 25% -. The column(variable)option lists statistics vertically (in table rows).955 p50 | 2.59 Largest 47.0203109 2.219479 p5 | -1.93 59.055464 -2.964573 sum | 1470.45 .215211 p95 | 1.805191 50% -.955 . 500 Mean Std.765 8.6 p95 | 38.593181 . Shapiro-Francia test.37 . Dev.044869 skewness | -.765 p1 | .421713 2.215211 1.421139 2. -.34 sd | 13.56679 Variance Skewness Kurtosis 184.805191 p50 | -.1195922 p75 | .63 Mean Std.059388 skewness | 2.006614 -.725 p25 | .462734 STATA .29 10% .19 max | 65.0577 se(mean) | 1.63 min | .29 .59 iqr | 7.955 p50 | 2.1195922 p75 | .837418 range | 5.805191 p50 | -.53045 p99 | 2.29 range | 65.1195922 75% 90% 95% 99% .53045 2.0950725 sum | -47.349112 sd | 1.98 59.794057 p10 | -1. tabstat gnip. 164 50% 75% 90% 95% 99% 2. and Skewness-Kurtosis test.055464 iqr | 1.sfrancia commands respectively conduct the Shapiro-Wilk and Shapiro-Francia http://www.6125385 p90 | 1.39 54.030682 6. .413548 p25 | -.6 38.511694 Univariate Analysis and Normality Test: 23 Sum of Wgt.59 65.006614 se(mean) | .31 Obs 164 25% .6125385 -------------------- stats | gnip ---------+---------N | 164 mean | 8.68 -------------------- Now let us conduct statistical tests for normality.swilk and . STATA provide three testing methods: Shapiro-Wilk test. sum gnip.765 p75 | 8.391266 Largest 2. The .030682 kurtosis | 6. Dev.edu/~statmath .98 p99 | 59.indiana.211093 2.003302 Variance Skewness Kurtosis 1.29 5% .0577 2.68 p90 | 32.41773 p25 | -. detail gnip ------------------------------------------------------------Percentiles Smallest 1% .33 Sum of Wgt.45 p25 | .1195922 p1 | -2.6125385 1.29 p5 | .68 32.56679 variance | 184.tabstat command is vary useful to produce descriptive statistics in a table form.593181 p50 | -.37 p10 | .53624 max | 2.765 p75 | 8.003302 variance | 1. The command for the variable normal is skipped.462734 p50 | 2.964573 13.

the S-K test rejects normality of the variable randomly drawn at the .027 4. .93 0. .413 0.99645 1.1753).edu/~statmath . sfrancia normal Shapiro-Francia W' test for normal data Variable | Obs W' V' z Prob>z -------------+------------------------------------------------normal | 500 0.0850 . swilk normal Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------normal | 500 0.492 0.sktest command performs the Skewness-Kurtosis test that is conceptually similar to the Jarque-Bera test. .29412 .indiana.0850 Like the Shapiro-Wilk and Shapiro-Francia tests. The following S-K tests do not reject normality of a randomly drawn variable at the . The Jarque-Bera statistic is 3.joint -----Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+------------------------------------------------------normal | 0.309 8. Both tests do not reject normality of the randomly drawn variable and reject normality of per capita gross national income.93 0.66365 45. which is not large enough to reject the null hypothesis (p<.00001 STATA’s .0203109^2/6+(2. Surprisingly.790 7. The Jarque-Bera test appears more reliable than the STATA S-K test (see Table 4).00000 .© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 24 tests.joint ------ http://www.530 0.05 level. sfrancia gnip Shapiro-Francia W' test for normal data Variable | Obs W' V' z Prob>z -------------+------------------------------------------------gnip | 164 0. swilk gnip Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------gnip | 164 0.joint -----Variable | Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 -------------+------------------------------------------------------normal | 0.851 0.593181-3)^2/24). sktest gnip Skewness/Kurtosis tests for Normality ------.66322 42.273 0.962 0. sktest normal. noadjust Skewness/Kurtosis tests for Normality ------.99556 1.027 4. the S-K test rejects normality of per capita gross national income.4823 = 500*(.16804 . sktest normal Skewness/Kurtosis tests for Normality ------. The noadjust option suppresses the empirical adjustment made by Royston (1991).851 0.541 0.1 level.

0000 The Jarque-Bera statistic of the per capita gross national income is 194.0000 .6489 = 164*(2.030682^2/6+(6.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 25 Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+------------------------------------------------------gnip | 0.33 0.joint -----Variable | Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 -------------+------------------------------------------------------gnip | 0.000 0. In conclusion. This large chi-squared rejects the null hypothesis (p<. http://www. sktest gnip. the graphical methods and numerical methods provide sufficient evidence that the randomly drawn variable is normally distributed and per capita gross national income is not.edu/~statmath . noadjust Skewness/Kurtosis tests for Normality ------.000 55.000 0.39 0.000 75.462734-3)^2/24).0000).indiana.

Statistics normal N Valid Missing 500 0 Mean -.54 In order to execute this command. The IGRAPH command draws histogram and box plots. The output of the following DESCRIPTIVES command is skipped here. This command is able to draw the detrended Q-Q plot that SAS and STATA do not support.00330 Variance 1. Or click AnalysisÆ Descriptive StatisticsÆDescriptives and provide a variable of interest. such as a stem-leaf-plot. P-P plot.399 . The PPLOT command produces (detrended) P-P and Q-Q plots. DESCRIPTIVES is usually applied to continuous variables. 6.84(a) Std. open a syntax window and paste it into the window.109 -.35 Minimum -2.218 5.11 DESCRIPTIVES VARIABLES=normal /STATISTICS=MEAN SUM STDDEV VARIANCE RANGE MIN MAX SEMEAN KURTOSIS SKEWNESS.indiana.edu/~statmath . http://www.04487 Median -.84 Maximum 2. Error of Skewness Kurtosis Std. FREQUENCIES VARIABLES=normal /NTILES= 4 /STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM SEMEAN MEAN MEDIAN MODE SUM SKEWNESS SESKEW KURTOSIS SEKURT /HISTOGRAM /ORDER= ANALYSIS. Testing Normality Using SPSS SPSS has the DESCRIPTIVES and FREQUENCIES commands to produce descriptive statistics. but FREQUENCIES is also able to produce various descriptive statistics including skewness and kurtosis. Deviation 1.1196 Mode -2.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 26 6.007 Skewness -. The EXAMINE command can produce both descriptive statistics and various plots. Error of Mean . Error of Kurtosis Range . histogram.51 Sum 11 -47.020 Std. EXAMINE also performs the Kolmogorov-Smirnov and Shapiro-Wilk tests for normality.0951 Std. and Q-Q plot. box plot. The statistics are specified in the /STATISTICS subcommand.1 A Normally Distributed Variable Let us get summary statistics using DESCRIPTIVES and FREQUENCIES.

6. The IGRAPH command can produce a better histogram (right plot in Figure 21).indiana.0 /CHARTLOOK='NONE' /Histogram SHAPE = HISTOGRAM CURVE = OFF X1INTERVAL AUTO X1START = 0. The kurtosis-3 and skewness approach zero.0 /YLENGTH=3.13 The stem-andleft plot is very similar to the histogram in Figure 20. 12 To run this command from the menu. click AnalyzeÆDescriptive StatisticsÆExplore. Note that SPSS. click GraphsÆInteractiveÆHistogram and then specify a variable.1 Graphical Methods The /HISTOGRAM subcommand of FREQUENCIES tells SPSS to draw a histogram of the variable (left plot in Figure 20). GRAPH /HISTOGRAM=normal. From the menu.1. 13 http://www.edu/~statmath . and then include the variable you want to examine. Histograms of a Normally Distributed Variable EXAMINE can produce a stem-and-leaf plot and a box plot using the /PLOT subcommand with the STEMLEAF and BOXPLOT option (Figure 22). You can get the identical plot using EXAMINE or the following GRAPH command. IGRAPH /VIEWNAME='Histogram' /X1 = VAR(normal) TYPE = SCALE /Y = \$count /COORDINATE = VERTICAL /X1LENGTH=3.1196 75 .0 /X2LENGTH=3.8066 50 -. The smallest value is shown The variable has a mean zero and a unit variance. Figure 21. The median is very close to the mean. like SAS.6132 a Multiple modes exist.12 The histogram suggests that the variable is normally distributed.© 2002-2006 The Trustees of Indiana University Percentiles Univariate Analysis and Normality Test: 27 25 -. reports kurtosis-3.

.00 1. P-P and Detrended P-P Plots of a Normally Distributed Variable http://www. .00 4.00 68. /MISSING LISTWISE /NOTOTAL. . . minimum and maximum).00 116.00 27. . and 75th percentiles are symmetrically arranged in the box plot.00 13.00 23. ..00 80. The both extremes (i. . .edu/~statmath .00 56. . the 25th.e.00 64. .indiana. Leaf & 011& 555667889 001111222233333444 55555566777888889999 000000011111111122222222233333334444444 000001111112222223333334444 5555566677777888899999 000111122233444 5567789 4& & 1.00 -2 -2 -1 -1 -0 -0 0 0 1 1 2 2 Stem width: Each leaf: . Figure 23. Stem-and-Leaf Plot and Box Plot of a Normally Distributed Variable normal Stem-and-Leaf Plot Frequency Stem & 2. . 50th.00 3 case(s) & denotes fractional leaves.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 28 EXAMINE VARIABLES=normal /PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT.00 46. Figure 22.

which are available in SPSS 13.indiana. the Q-Q plot and detrended Q-Q plot has observed quantiles on the X axes and normal quantiles on the Y axes.xx.02 (EXAMINE and PPLOT) does not produce color P-P and Q-Q plots. Probably due to bugs. 14 From the menu. PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=P-P /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. the standard normal distribution). As in STATA. the latest SPSS 14.edu/~statmath . The variable does not deviate far away from the fitted line (Figure 23).© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 29 Now let us draw P-P and Q-Q plots using the PPLOT command. Q-Q and Detrended Q-Q Plots of a Normally Distributed Variable The P-P and Q-Q plots indicate no significant deviation from the fitted line.xx and 12. PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=Q-Q /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. The following PPLOT command draws a Q-Q and detrended Q-Q plots of the variable (Figure 24).g.14 This command automatically produces detrended P-P and Q-Q plots as well. Figure 24.15 The /PLOT NPPLOT subcommand of EXAMINE can also produce these plots.. The /TYPE chooses either P-P or Q-Q plot and /DIST specifies a probability distribution (e. 15 Click GraphsÆQ-Q and then specify a variable http://www. click GraphsÆP-P and then specify a variable of interest.

51 2 139 2.1196 Variance 1.020 .0951 Mean 95% Confidence Interval for Mean Lower Bound -.0933 Median -.0% 500 N 0 Total Percent . Case Processing Summary Cases Valid N normal Missing Percent 100.edu/~statmath .1. Deviation 1.218 Extreme Values Normal Std.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 30 6.42 3 325 2.00330 Minimum -2.0% N 500 Percent 100.42 Skewness -. draws a normal Q-Q plot (/PLOT NPPLOT). and then conducts normality tests.21 5 119 2. EXAMINE VARIABLES=normal /PLOT NPPLOT /STATISTICS DESCRIPTIVES EXTREME /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL.1832 Upper Bound -.007 Std.0% Descriptives normal Statistic -.0069 5% Trimmed Mean -. Error . The above EXAMINE command first produces descriptive statistics (/STATISTICS DESCRIPTIVES).399 .84 Maximum 2.15 1 29 -2.42 4 340 2.51 Range 5.indiana.35 Interquartile Range 1.04487 Highest Lowest 1 Case Number 332 Value 2. This command performs the Kolmogorov-Smirnov and Shapiro-Wilk tests and draws a normal (detrended) Q-Q plot as well.109 Kurtosis -.2 Numerical Methods EXAMINE has the /PLOT NPPLOT subcommand to test normality of a variable.84 http://www.

48 4 391 -2. stem-and-leaf plot. Figure 25.edu/~statmath . The median and the 25th percentile are close to each other. Histogram and Box Plot a Non-normally Distributed Variable http://www.1 Graphical Methods The following IGRAPH and EXAMINE command produce the histogram. 6.200(*) Shapiro-Wilk Statistic . 6.027.39 5 393 -2. a bit larger than the . .indiana.168).2.027 Df 500 Sig. The stem-and-leaf plot is skipped here. we have to read the Shapiro-Wilk statistic that does not reject the null hypothesis of normality (p<.200. but it provides an adjusted p-value of .996 df 500 Sig. Figure 25 illustrate that the distribution is heavily skewed to the right. There are many outliers beyond the extreme line in the box plot (left plot of Figure 25).24 Tests of Normality Kolmogorov-Smirnov(a) Normal Statistic .000.59 3 73 -2.2 A Non-normally Distributed Variable Let us consider a variable of per capita national gross income that is not normally distributed. a Lilliefors Significance Correction Since N is less than 2.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 31 2 204 -2. SPSS reports the same kolmogorov-smirnov statistic of . and box plot of a non-normally distributed variable gnip.168 * This is a lower bound of the true significance.150 that SAS reports. .

The detrended Q-Q plot also presents a systematic pattern of deviation indicating non-normality of the variable. The detrended P-P plot shows a systematic deviation of data points. Figure 26 presents the P-P and detrended P-P plots where data points significantly deviate from the straight fitted line.© 2002-2006 The Trustees of Indiana University IGRAPH Univariate Analysis and Normality Test: 32 /VIEWNAME='Histogram' /X1 = VAR(gnip) TYPE = SCALE /Y = \$count /COORDINATE = VERTICAL /X1LENGTH=3.indiana. http://www.0 /CHARTLOOK='NONE' /Histogram SHAPE = HISTOGRAM CURVE = OFF X1INTERVAL AUTO X1START = 0. Figure 26.0 /X2LENGTH=3.0 /YLENGTH=3.edu/~statmath . EXAMINE VARIABLES=gnip /PLOT BOXPLOT STEMLEAF HISTOGRAM /MISSING LISTWISE /NOTOTAL. P-P and Detrended P-P Plots of a Non-normally Distributed Variable The Q-Q and detrended Q-Q plots also show a significant deviation from the fitted line (Figure 27). PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=Q-Q /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=P-P /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL.

92 2. Deviation 184.34 Interquartile Range Skewness http://www. EXAMINE VARIABLES=gnip /PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT /MISSING LISTWISE /NOTOTAL.05939 7.608.9646 Mean 95% Confidence Interval for Mean Lower Bound Upper Bound 6.63 Range 65.2. Case Processing Summary Cases Valid N gnip 164 Missing Percent 100.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 33 Figure 27.0% Descriptives gnip Statistic 8.8727 11.2 Numerical Methods The descriptive statistics of gnip indicates that the variable is not normally distributed. Error 1.1877 Median 2.058 13.indiana.9646 and the median of 2. Q-Q and Detrended Q-Q Plots of a Non-normally Distributed Variable 6. respectively.0565 5% Trimmed Mean 7.7650 Variance Std.0% N 164 Percent 100.29 Maximum 65.56679 Minimum .049 .049 and 3.190 . There is a large gap between the mean of 8.7650.edu/~statmath Std. The skewness and kurtosis -3 are 2.0% N 0 Total Percent . The variable appears severely skewed to the right with a higher peak and flat tails.

000 The Shapiro-Wilk test rejects the null hypothesis of normality at the .377 Tests of Normality Kolmogorov-Smirnov(a) Statistic df .000 Shapiro-Wilk Statistic .3).663 df 164 Sig. http://www.indiana. . The Jarque-Bera test also rejects the null hypothesis with a large statistic of 204.05 level.edu/~statmath . we can conclude the variable gnip is not normally distributed.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 34 Kurtosis 3. Its computation is skipped (see section 4.2.608 . Based on a consistent result from both graphical and numerical methods.284 164 a Lilliefors Significance Correction gnip Sig. .

Skewness and kurtosis provide clues to the normality of a variable. the distribution has a high peak and flat tails (third plot in Figure 8). The Jarque-Bera test. and a P-P plot that are intuitive and easy to interpret. is a good alternative for normality testing. Cramer-vol Mises. Some are descriptive and others are theory-driven. In particular. Normality is commonly assumed in many statistical and economic methods without any empirical test. Keep in mind that SAS and SPSS report kurtosis-3. If skewness and kurtosis-3 are close to zero. Various descriptive statistics provide valuable basic information about variables that is used to determine what method of analysis should be employed. In addition to these descriptive statistics. If the skewness of a variable is larger than 0. a negative skewness indicates many observations on the right. and Anderson-Darling tests are recommended when N is large. Conclusion Univariate analysis is the first step of data analysis once a data set is ready.000. SPSS can produce detrended P-P and Q-Q plots. histogram.000 and 5. quantile. Cramer-vol Mises. respectively.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 35 7. while STATA returns kurtosis itself. the variable may be normally distributed. Violation of this assumption will result in unreliable inferences and misleading interpretations. median. and perform the Shapiro-Wilk and Kolmogorov-Smirnov tests with Lilliefors significance correction. a modification of the Shapiro-Wilk test. and Anderson-Darling tests (Table 5). and standard deviation. There are graphical and numerical methods for conducting univariate analysis and normality tests (Table 1). box plot. If kurtosis is smaller than 3. The Kolmogorov-Smirnov. Numerical methods compute a variety of measures of central tendency and dispersion such as mean. If kurtosis-3 is greater than 0 (or kurtosis is greater than 3). But there is no command for the Kolmogorov-Smirnov test for normality in STATA. and the skewness-kurtosis test.indiana. although not supported by most statistical software.edu/~statmath . and Q-Q plot as well. STATA has various commands for univariate analysis and graphics. The SAS UNIVARIATE and CONTENTS procedures provide a variety of descriptive statistics and normality testing methods including Kolmogorov-Smirnov. STATA supports the Shapiro-Francia test. the variable has a low peak and thick tails (first plot in Figure 8). histogram. The Shapiro-Wilk and Shapiro-Francia tests are proper when N is less than 2. there are formal ways to perform normality tests. the variable is skewed to the right with many observations on the left of the distribution. http://www. P-P plot. variance. These procedures produce stem-and-leaf. The graphical methods produce various plots such as a stemand-leaf plot.

%LET dataset=n500.955 2.95 1.indiana. tabstat normal.edu/. stat(mean sd p25 median p75 skewness kurtosis) variable | mean sd p25 p50 p75 skewness kurtosis -------------+---------------------------------------------------------------------normal | -.421569 . 1.567. This data set includes per capita gross national incomes of 164 countries in the world that are provided by the World Bank (http://web.6570033 3.785585 OH | 6. the Indiana Business Research Center of the Kelley School of Business. . %LET n=500.964573 13.0203109 2. tabstat rate.805191 -. normal=RANNOR(seed). Per Capita Gross National Income in 2005.68 2. stat(mean sd p25 median p75 skewness kurtosis) variable | mean sd p25 p50 p75 skewness kurtosis -------------+---------------------------------------------------------------------gnip | 8.234. OUTPUT.383285 ------------------------------------------------------------------------------ 2.038929 4.786879 1.indiana. seed=1234567.0950725 1.641304 1.3625 1.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 36 Appendix A: Data Sets This document uses the following three variables.3416314 2.7 5. stat(mean sd p25 median p75 skewness kurtosis) by(state) Summary for variables: rate by categories of: state state | mean sd p25 p50 p75 skewness kurtosis -------+---------------------------------------------------------------------IL | 5.462734 ------------------------------------------------------------------------------------ http://www. Unemployment rate of Illinois. Indiana University.030682 6.9 5.593181 ------------------------------------------------------------------------------------ 3.946029 IN | 5. .65 6. DATA masil.5 6.56679 . .665322 8. The RANNOR() of SAS was used as a random number generator. RUN.44809 8. Indiana.35 6 .35 .003302 -.edu/~statmath .458098 5.stats.1 6. DO i=1 TO &n.043097 -------+---------------------------------------------------------------------Total | 5.1195922 . Actual data were downloaded from http://www.4 1. and Ohio in 2005 This unemployment rate is provided by the Bureau of Labor Statistics. END. A Randomly Drawn Variable This variable includes 500 observations that were randomly drawn from the standard normal distribution with a seed of 1. tabstat gnip.6125385 -.765 8.5 6.org/).&dataset.9242206 4.214066 5 5.worldbank.

Bera. Royston. Jarque. M. and Ralph B. J.P. P. 1990.. 1980. K. S. D’Agostino. "Efficient Tests for Normality. Cary. and Anil K. NC: SAS Institute. TX: Stata Press..J. 7(4):313-318." Statistics and Computing. and Anil K. TX: STATA Press. Mitchell. NC: SAS Institute. 31(2): 115-124. P. 1995. 1965.. 6(3):255-259. S. Shapiro. 55(2):163-172. College Station. http://www." Biometrika. Royston. Jarque. 1972.” American Statistician. 32(3) (September): 297-300. and Carlos. 1992." Applied Statistics. Ralph B. 1987. Anil. STATA Press. Michael N.indiana. 2004. 2:117-119. J. 44(4): 316-321. “A Suggestion for Using Powerful and Informative Tests of Normality. STATA Reference Manual Release 9. SAS/QC Software: Usage and Reference I and II. 52(3/4) (December). Bera. SAS Institute. "Efficient Tests for Normality.. 1991. 67 (337) (March): 215-216.edu/~statmath . College Station. "An Extension of Shapiro and Wilk's W Test for Normality to Large Samples. Wilk. B. S. Homoscedasticity and Serial Independence of Regression Residuals. STATA Graphics Reference Manual Release 8.3 Procedures Guide Volume 4.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 37 References Bera. S. Jarque.:591-611. 2004. P. 2005. D’Agostino.1." Statistician. “Comment on sg3. SAS Institute. "A Test for Normality of Observations and Regression Residuals. Royston. SAS 9. and R." International Statistical Review." Economics Letters. "An Approximate Analysis of Variance Test for Normality. Homoscedasticity and Serial Independence of Regression Residuals: Monte Carlo Evidence. TX: STATA Press.4 and an Improved D’Agostino test. 1983. 1982. Albert Belanger. Carlos M.. Royston. and M. Carlos M." Economics Letters. College Station. Francia. Cary. S. "A Simple Method for Evaluating the Shapiro-Francia W' Test of Non-Normality.. Shapiro. "Approximating the Shapiro-Wilk W-Test for Non-normality.” Stata Technical Bulletin. "An Analysis of Variance Test for Normality (Complete Samples)." Journal of the American Statistical Association. Jr. A Visual Guide to STATA Graphics. STATA Press. 2003. J. 1981. 3: 13-24.

Revision History • • 2002 First draft 2006 Revision with new data http://www.indiana.© 2002-2006 The Trustees of Indiana University Univariate Analysis and Normality Test: 38 Acknowledgements I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions.edu/~statmath .