You are on page 1of 19

EXAMINE

Univariate Statistics
Notation
The following notation is used throughout this chapter unless otherwise noted:

K
Let y1 < ym be m distinct ordered observations for the sample and c1 , K, c m
be the corresponding caseweights. Then

i
cci = ∑c k = cumulative frequency up to and including yi
k =1

and

m
W = ccm = ∑c k = total sum of weights.
k =1

Descriptive Statistics

Minimum and Maximum

min = y1 , max = ym

Range

range = ym − y1

1
2 EXAMINE

Mean ybg
m

∑c y i i
i =1
y=
W

Confidence Interval for the Mean

lower bound = y − tα / 2,W −1SE


upper bound = y + tα / 2,W −1SE

where SE is the standard error.

Median

The median is the 50th percentile, which is calculated by the method requested.
The default method is HAVERAGE.

Interquartile Range (IQR)

IQR = 75th percentile − 25th percentile, where the 75th and 25th percentiles are
calculated by the method requested for percentiles.

e j
Variance s 2

∑ c b y − y g2
1
s2 = i i
W −1
i =1

Standard Deviation

s = s2
EXAMINE 3

Standard Error

s
SE =
W

b g
Skewness g1 and SE of Skewness

WM3
g1 =
a fa
W − 1 W − 2 s3 f

b g aW − 26faWWaW+ 1−fa1Wf + 3f
SE g1 =

∑ c by − yg
m
3
M3 = i i
i =1

b g
Kurtosis g 2 and SE of Kurtosis

g2 =
a f
W W + 1 M 4 − 3M22 W − 1 a f
a
W −1 W − 2 W − 3 s4 fa fa f
m

∑ c by − yg
2
M2 = i i
i =1

∑ c by − yg
4
M4 = i i
i =1

b g
SE g2 =
d i b g
4 W 2 − 1 SE 2 g1
aW − 3faW + 5f
4 EXAMINE

5% Trimmed Mean 0.9

R| U
S|ecck +1 − tcj yk +1 + eW − cck −1 − tcj yk + ∑ ci yi |V|
k −1
2
1
T0.9 =
T W
1 1 2 2
0.9W
i = k +2
1

where k1 and k 2 satisfy the following conditions

cck1 < tc ≤ cck1 +1 , W − cck2 < tc ≤ W − cck2 −1

and

tc = 0.05W

Note: If k1 + 1 = k 2 , then T0.9 = y k2

Percentiles
There are five methods for computation of percentiles. Let

tc1 = Wp, b
tc2 = W + 1 p g
where p is the requested percentile divided by 100, and k1 and k 2 satisfy

cck1 ≤ tc1 < cck1 +1


cck2 ≤ tc2 < cck2 +1

Then,

g1 =
dtc − cc i ,
1 k1
g1∗ = tc1 − cck1
ck1 +1

dtc 2 − cck2 i, g2∗ = tc2 − cck2


g2 =
ck2 +1
EXAMINE 5

Let x be the pth percentile; the five definitions are as follows:

Waverage (Weighted Average at y tc 1 )

R| y if g1∗ ≥ 1
| k1 +1
x = Se1 − g j y ∗
+ g1∗ y k1 +1 if g1∗ < 1 and ck1 +1 ≥ 1
||b1 − g g y 1 k1

T 1 k1 + g1 y k1 +1 if g1∗ < 1 and ck1 +1 < 1

Round (Observation Closest to tc 1 )

If ck1 +1 ≥ 1 , then

R| y if g1∗ <
S| y
1
k1
x= 2

T k1 +1 if g1∗ ≥ 1
2

If ck1 +1 < 1 , then

R| yk if g1 < 21
x= S| yk +11

T 1
if g1 ≥ 21

Empirical (Empirical Distribution Function)

R| y if g1∗ = 0
x= S| y k1

T k1 +1 if g1∗ > 0
6 EXAMINE

Haverage (Weighted Average at y tc 2 )

R| y if g2∗ ≥ 1
| k 2 +1
x = Se1 − g j y∗
+ g2∗ y k2 +1 if g2∗ < 1 and ck2 +1 ≥ 1
||b1 − g g y
2 k2

T 2 k2 + g2 y k2 +1 if g2∗ < 1 and ck2 +1 < 1

Aempirical (Empirical Distribution Function with Averaging)

R|e yk + y k1 +1 2j if g1∗ = 0
x= S| yk1

T 1 +1
if g1∗ > 0

Note: If either the 25th, 50th, or 75th percentiles is request, Tukey Hinges will also
be printed.

Tukey Hinges

Let Q1 , Q2 , and Q3 be the 25th, 50th, and 75th percentiles. If c∗ ≥ 1 , where


c∗ = min c1 , b K, c g , define
m

d=
greatest integer ≤ W + 3 2 cb g h
2
L1 = d

L2 = W 2 + 1 2

L3 = W + 1 − d

Otherwise

d=
greatest integer ≤ W c∗ + 3 2 e j
2
EXAMINE 7

and

L1 = dc∗
L2 = W 2 + c∗ 2
L3 = W + c∗ − dc∗

Then for every i, i = 1, 2, 3 , find hi such that

cchi ≤ Li < cchi +1

and

R|e1 − a j y ∗
+ ai∗ yhi +1 if ai∗ < 1 and chi +1 ≥ 1
|
i hi

Q = Sb1 − a g y + ai yhi +1 if ai∗ < 1 and chi +1 < 1


i
|| y i hi

if ai∗ ≥ 1
T 1 hi +

where

ai∗ = Li − cchi

ai∗
ai =
chi +1

M-Estimation (Robust Location Estimation)

The M-estimator T of location is the solution of

∑ c ΨFGH s IJK = 0
m
y −T i
i
i =1

where Ψ is an odd function and s is a measure of the spread.


8 EXAMINE

An alternative form of M-estimation is

∑ c FGH s IJK ω FGH s IJK = 0


m
y −T iy −T i
i
i =1

where

b g Ψubug
ω u =

After rearranging the above equation, we get

∑ ci yiω FGH i s IJK


m
y −T

=1
T= i m

∑ ciω FGH i s IJK


y −T

i =1

Therefore, the algorithm to find M-estimators is defined iteratively by

∑ c y ω FGH s IJK
m
y −T i k
i i
i =1
Tk +1 =

∑ c ω FGH s IJK
m
y −T i k
i
i =1

The algorithm stops when either

Tk +1 − Tk ≤ ε Tk +1 + Tk b g 2 , where ε = 0.005
or the number of iterations exceeds 30.
EXAMINE 9

M-Estimators

Four M-estimators (Huber, Hampel, Andrew, and Tukey) are available. Let

yi − T
ui =
s

where

s = median of ~
y1 , K, ~y
m with caseweights c1 , K, c
m

and

~
yi = yi − ~
y, where ~
y is the median .

Huber (k), k > 0

1R| if ui ≤ k
b g S
ω ui = k
b g
ui|T
sgn ui if ui > k

The default value of k = 1339


.

Hampel (a, b, c), 0 < a ≤ b ≤ c

R|1 if ui ≤ a
|| ua sgnbu g
i if a < ui ≤ b
|| i

ω bu g = S
i
|| ua cc−− ub sgnbu g
i
i if b < ui ≤ c
||i

|T0 if ui > c

By default, a = 1.7, b = 3.4 and c = 8.5.


10 EXAMINE

Andrew’s Wave (c), c > 0

R| c sinF π u I
|π u H c K if ui ≤ c
i

ω bu g = S i
i
||0
T if ui > c

. π
By default, c = 134

Tukey’s Biweight (c)

R|F u I 2

|G1 − J
2

ω bu g = SH c K
i
2
if ui ≤ c
i
||
T0 if ui > c

By default, c = 4.685.

Tests of Normality

Shapiro-Wilk Statistic (W)

Since the W statistic is based on the order statistics of the sample, the caseweights
have to be restricted to integers. Hence, before W is calculated, all the caseweights
are rounded to the closest integer and the series is expanded. Let ci∗ be the closest
integer to ci ; then

i m
cci∗ = ∑ ck∗ , Ws = ∗
ccm = ∑c ∗
k
k =1 k =1

The original series y = y1 , l K, y q is expanded to


m

o K, xw t
x = x1 , s

where

xcc∗
i −1 +1
= K = xcc ∗
i
= yi , i = 1, K, m
EXAMINE 11

Then the W statistic is defined as

FW I 2

GG ∑ ai xi JJ
s

W= W
Hi K =1

∑ b xi − x g
s
2

i =1

where

Ws

∑x i
i =1
x=
Ws

R| ΓbW 2g s
if 5 ≤ Ws ≤ 20
|| 2ΓdbW + 1g 2i s
a12 = aW
2
=S
s
|| ΓdbW + 1g 2i
s
if Ws > 20
|T 2ΓbW 2 + 1g s

aWs = aW
2
a1 = − a12 , s

b g K, W − 1
ai = 2 c mi , i = 2, s

m =Ψ G
F i − α I , where Ψ is the c.d.f. of a standard normal distribution
−1
i
H W − 2α + 1JK s

α = 0.314195 + 0.063336β − 0.010895β 2

β = log10 Ws

Ws −1
mi2
c =4
2
∑ d1 − 2a i
i =1
2
i
12 EXAMINE

Based on the computed W statistic, the significance is calculated by linearly


interpolating within the range of simulated critical values given in Shapiro and
Wilk (1965).
If non-integer weights are specified, the Shapiro-Wilk’s statistic is calculated
when the weighted sample size lies between 3 and 50. For no weights or integer
weights, the statistic is calculated when the weighted sample size lies between 3
and 5000.
If W > w0.99 , the critical value of 99th percentile, the significance is reported
as >0.99. Similarly, if W < w0.01 , the critical value of first percentile, the
significance is reported as <0.01.

Kolmogorov-Smirnov Statistic with Lilliefors’ Significance

Lilliefors (1967) presented a table for testing normality using the Kolmogorov-
Smirnov statistic when the mean and variance of the population are unknown. This
1
statistic is

l
Da = max D+ , D− q
where

o b g b gt
D+ = max i F$ yi − F yi

D− = max i oFb yi g − Fb yi gt
$
−1

af af
where F$ x is the sample cumulative distribution and F x is the cumulative
normal distribution whose mean and variance are estimated from the sample.
Dallal and Wilkinson (1986) corrected the critical values for testing normality
reported by Lilliefors. With the corrected table they derived an analytic
approximation to the upper tail probabilities of Da for probabilities less than 0.1.

1
This algorithm applies to SPSS 7.0 and later releases.
EXAMINE 13

The following formula is used to estimate the critical value Dc for probability
0.1.

F −b − I
Dc =
H b 2 − 4ac
K
2a

where, if W ≤ 100 ,

b
a = −7.01256 W + 2.78019 g
b = 2.99587 W + 2.78019
0.974598 .
167997
c = 2.1804661 + +
W W

2
If W > 100

a = −7.90289126054 * W 0.98
b = 3180370175721
. * W 0.49
c = 2.2947256

The Lilliefors significance p is calculated as follows:


If Da = Dc , p = 01
. .

{
If Da > Dc , p = exp aDa2 + bDa + c − 2.3025851 . }
If D0.2 ≤ Da < Dc , linear interpolation between D0.2 and Dc where D0.2 is the
critical value for probability 0.2 is done.
If Da > D0.2 , p is reported as > 0.2 .

2
This algorithm applies to SPSS 7.0 and later releases. To learn about algorithms
for previous releases, call SPSS Technical Support.
14 EXAMINE

Group Statistics
Assume that there are k k ≥ 2 b g combinations of grouping factors. For every
K, k , let oy , K, y t be the sample observations with
combination i, i = 1, 2, i1 imi

the corresponding caseweights oc , K , c t . i1 imi

Spread versus Level


If a transformation value, a, is given, the spread(s) and level(l) are defined based

K K
on the transformed data. Let x be the transformed value of y; for every
i = 1, , k , j = 1, , mi

R|ln y if a = 0
xij = S| y ij

Ta
ij otherwise

bg bg
Then the spread si and the level li are respectively defined as the Interquartile
Range and the median of ox , K , x t
i1 imi with corresponding caseweights

oci , K, cim t . However, if a is not specified, the spread and the level are natural
1 i

logarithms of the Interquartile Range and of the median of the original data.
Finally, the slope is the regression coefficient of s on l, which is defined as

∑ dli − l ibsi − s g
i =1
k

∑ dli − l i
2

i =1

In some situations, the transformations cannot be done. The spread-versus-level


plot and Levene statistic will not be produced if:
• a is a negative integer and at least one of the data is 0
EXAMINE 15

• a is a negative non-integer and at least one of the data is less than or equal to 0

• a is a positive non-integer and at least one of the data is less than 0

• a is not specified and the median or the spread is less than or equal to 0

Levene Test of Homogeneity of Variances


The Levene test statistic is based on the transformed data and is defined by

F W −kI
∑ wi bzi − z g 2

La = G
H k − 1 JK k m
i =1

∑ ∑ cil bzil − zi g
i
2

i =1 l =1

where

mi
wi = ∑c
l =1
il

mi

∑c x
l =1
il il
xi =
wi

zil = x il − x i

mi


cil zil
zi =
l =1
wi


wi z i
z=
i =1
W
16 EXAMINE

The significance of La is calculated from the F distribution with degrees of


freedom k − 1 and W − k .
Groups with zero variance are included in the test.

Robust Levene’s Test of Homogeneity of Variances


With the current version of Levene’s test La, the followings can be considered as
options in order to obtain robust Levene’s tests:
• Levene’s test L based on z ( b ) = |x - ~ x | where ~
b x is the median of x ’s for
li il i i il

group i.

Median calculation is done by the method requested. The default method is


HAVERAGE. Once the ~ x i ’s and hence z li( b ) ’s are calculated, apply the
formula for La, shown in the section above, to obtain Lb by replacing zil, z i and
z with z li( b ) , z i( b ) and z ( b ) respectively.

Two significances of Lb are given. One is calculated from a F-distribution


with degrees of freedom k - 1 and W - k. Another is calculated from a F-
distribution with degrees of freedom k - 1 and v. The value of v is given by:

k 2
 



i =1
ui 

v= k
 ui2 


∑v
i =1 i

where

mi
ui = ∑c
l =1
il ( z il( b ) − z i( b ) ) 2

in which

mi
c il z il( b )
z i( b ) = ∑ l =1
wi

and
EXAMINE 17

vi = wi - 1.

• Levene’s test Lc based on z il( c ) = |xil - Ti, 0.9 | where Ti, 0.9 is the 5% trimmed mean
of xil’s for group i.

Once the Ti, 0.9 ’s and hence z il( c ) ’s are calculated, apply the formula of La to
obtain Lc by replacing zil, z i and z with z li( c ) , z i( c ) and z ( c ) respectively.

The significance of Lc is calculated from a F-distribution with degrees of


freedom k - 1 and W - k.

Plots
Normal Probability Plot (NPPLOT)
For every distinct observation yi , Ri is the rank (the mean of ranks is assigned to
ties). The normal score NSi is calculated by

NSi = Ψ −1
FG R IJ
H W + 1K
i

where Ψ −1 is the inverse of the standard normal cumulative distribution function.


b gKb
The NPPLOT is the plot of y1 , NS1 , , ym , NSm . g
Detrended Normal Plot

The detrended normal plot is the scatterplot of y1 , D1 , b g K, b y m, g


Dm , where Di
is the difference between the Z-score and normal score, which is defined by

Di = Zi − NSi

and
18 EXAMINE

yi − y
Zi =
s

where y is the average and s is the standard deviation.

Boxplot
The boundaries of the box are Tukey’s hinges. The length of the box is the
interquartile range based on Tukey’s hinges. That is,

IQR = Q3 − Q1

Define

STEP = 1.5 IQR

A case is an outlier if

Q3 + STEP ≤ yi < Q3 + 2 STEP


or
Q1 − 2 STEP < yi ≤ Q1 − STEP

A case is an extreme if

yi ≥ Q3 + 2 STEP

or

yi ≤ Q1 − 2 STEP

References
Brown, M. B., and Forsythe, A. B. (1974). Robust Tests for the Equality of
Variances. Journal of the American Statistical Association, 69, p364-367.
EXAMINE 19

Dallal, G. E., and Wilkinson, L. 1986. An analytic approximation to the


distribution of Lilliefor’s test statistic for normality. The American Statistician,
40(4): 294–296 (Correction: 41: 248).

Frigge, M., Hoaglin, D. C., and Iglewicz, B. 1987. Some implementations for the
boxplot. In: Computer Science and Statistics Proceedings of the 19th
Symposium on the Interface, R. M. Heiberger and M. Martin, eds. Alexandria,
Va.: American Statistical Association.

Glaser, R. E. (1983). Levene’s Robust Test of Homogeneity of Variances.


Encyclopedia of Statistical Sciences 4. NY: Wiley, p608-610.

Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1983. Understanding robust and
exploratory data analysis. New York: John Wiley & Sons, Inc.

Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1985. Exploring data tables,
trends, and shapes. New York: John Wiley & Sons, Inc.

Lilliefors, H. W. 1967. On the Kolmogorov-Smirnov tests for normality with mean


and variance unknown. Journal of the American Statistical Association, 62:
399–402.

Loh, W. Y. (1987). Some Modifications of Levene’s Test of Variance


Homogeneity. Journal of Statistical Computation and Simulation, 28, p213-
226.

Tukey, J. W. 1977. Exploratory data analysis. Reading, Mass.: Addison-Wesley.

Velleman, P. F., and Hoaglin, D. C. 1981. Applications, basics, and computing of


exploratory data analysis. Boston: Duxbury Press.

You might also like