Examine

EXAMINE
Univariate Statistics
Notation
The following notation is used throughout this chapter unless otherwise noted:
K
Let y1 < ym be m distinct ordered observations for the sample and c1 , K, c m
be the corresponding caseweights. Then
i
cci = ∑c k = cumulative frequency up to and including yi
k =1
and
m
W = ccm = ∑c k = total sum of weights.
k =1
Descriptive Statistics
Minimum and Maximum
min = y1 , max = ym
Range
range = ym − y1
1
2 EXAMINE
Mean ybg
m
∑c y i i
i =1
y=
W
Confidence Interval for the Mean
lower bound = y − tα / 2,W −1SE

upper bound = y + tα / 2,W −1SE
where SE is the standard error.
Median
The median is the 50th percentile, which is calculated by the method requested.
The default method is HAVERAGE.
Interquartile Range (IQR)
IQR = 75th percentile − 25th percentile, where the 75th and 25th percentiles are
calculated by the method requested for percentiles.
e j
Variance s 2
∑ c b y − y g2
1
s2 = i i
W −1
i =1
Standard Deviation
s = s2
EXAMINE 3
Standard Error
s
SE =
W
b g
Skewness g1 and SE of Skewness
WM3
g1 =
a fa
W − 1 W − 2 s3 f
b g aW − 26faWWaW+ 1−fa1Wf + 3f
SE g1 =
∑ c by − yg
m
3
M3 = i i
i =1
b g
Kurtosis g 2 and SE of Kurtosis
g2 =
a f
W W + 1 M 4 − 3M22 W − 1 a f
a
W −1 W − 2 W − 3 s4 fa fa f
m
∑ c by − yg
2
M2 = i i
i =1
∑ c by − yg
4
M4 = i i
i =1
b g
SE g2 =
d i b g
4 W 2 − 1 SE 2 g1
aW − 3faW + 5f
4 EXAMINE
5% Trimmed Mean 0.9
R| U
S|ecck +1 − tcj yk +1 + eW − cck −1 − tcj yk + ∑ ci yi |V|
k −1
2
1
T0.9 =
T W
1 1 2 2
0.9W
i = k +2
1
where k1 and k 2 satisfy the following conditions
cck1 < tc ≤ cck1 +1 , W − cck2 < tc ≤ W − cck2 −1
and
tc = 0.05W
Note: If k1 + 1 = k 2 , then T0.9 = y k2
Percentiles
There are five methods for computation of percentiles. Let
tc1 = Wp, b
tc2 = W + 1 p g
where p is the requested percentile divided by 100, and k1 and k 2 satisfy
cck1 ≤ tc1 < cck1 +1

cck2 ≤ tc2 < cck2 +1
Then,
g1 =
dtc − cc i ,
1 k1
g1∗ = tc1 − cck1
ck1 +1
dtc 2 − cck2 i, g2∗ = tc2 − cck2

g2 =
ck2 +1
EXAMINE 5
Let x be the pth percentile; the five definitions are as follows:
Waverage (Weighted Average at y tc 1 )
R| y if g1∗ ≥ 1
| k1 +1
x = Se1 − g j y ∗
+ g1∗ y k1 +1 if g1∗ < 1 and ck1 +1 ≥ 1
||b1 − g g y 1 k1
T 1 k1 + g1 y k1 +1 if g1∗ < 1 and ck1 +1 < 1
Round (Observation Closest to tc 1 )
If ck1 +1 ≥ 1 , then
R| y if g1∗ <
S| y
1
k1
x= 2
T k1 +1 if g1∗ ≥ 1
2
If ck1 +1 < 1 , then
R| yk if g1 < 21
x= S| yk +11
T 1
if g1 ≥ 21
Empirical (Empirical Distribution Function)
R| y if g1∗ = 0
x= S| y k1
T k1 +1 if g1∗ > 0
6 EXAMINE
Haverage (Weighted Average at y tc 2 )
R| y if g2∗ ≥ 1
| k 2 +1
x = Se1 − g j y∗
+ g2∗ y k2 +1 if g2∗ < 1 and ck2 +1 ≥ 1
||b1 − g g y
2 k2
T 2 k2 + g2 y k2 +1 if g2∗ < 1 and ck2 +1 < 1
Aempirical (Empirical Distribution Function with Averaging)
R|e yk + y k1 +1 2j if g1∗ = 0
x= S| yk1
T 1 +1
if g1∗ > 0
Note: If either the 25th, 50th, or 75th percentiles is request, Tukey Hinges will also
be printed.
Tukey Hinges
Let Q1 , Q2 , and Q3 be the 25th, 50th, and 75th percentiles. If c∗ ≥ 1 , where

c∗ = min c1 , b K, c g , define
m
d=
greatest integer ≤ W + 3 2 cb g h
2
L1 = d
L2 = W 2 + 1 2
L3 = W + 1 − d
Otherwise
d=
greatest integer ≤ W c∗ + 3 2 e j
2
EXAMINE 7
and
L1 = dc∗
L2 = W 2 + c∗ 2
L3 = W + c∗ − dc∗
Then for every i, i = 1, 2, 3 , find hi such that
cchi ≤ Li < cchi +1
and
R|e1 − a j y ∗
+ ai∗ yhi +1 if ai∗ < 1 and chi +1 ≥ 1
|
i hi
Q = Sb1 − a g y + ai yhi +1 if ai∗ < 1 and chi +1 < 1

i
|| y i hi
if ai∗ ≥ 1
T 1 hi +
where
ai∗ = Li − cchi
ai∗
ai =
chi +1
M-Estimation (Robust Location Estimation)
The M-estimator T of location is the solution of
∑ c ΨFGH s IJK = 0
m
y −T i
i
i =1
where Ψ is an odd function and s is a measure of the spread.

8 EXAMINE
An alternative form of M-estimation is
∑ c FGH s IJK ω FGH s IJK = 0

m
y −T iy −T i
i
i =1
where
b g Ψubug
ω u =
After rearranging the above equation, we get
∑ ci yiω FGH i s IJK

m
y −T
=1
T= i m
∑ ciω FGH i s IJK

y −T
i =1
Therefore, the algorithm to find M-estimators is defined iteratively by
∑ c y ω FGH s IJK
m
y −T i k
i i
i =1
Tk +1 =
∑ c ω FGH s IJK
m
y −T i k
i
i =1
The algorithm stops when either
Tk +1 − Tk ≤ ε Tk +1 + Tk b g 2 , where ε = 0.005
or the number of iterations exceeds 30.
EXAMINE 9
M-Estimators
Four M-estimators (Huber, Hampel, Andrew, and Tukey) are available. Let
yi − T
ui =
s
where
s = median of ~
y1 , K, ~y
m with caseweights c1 , K, c
m
and
~
yi = yi − ~
y, where ~
y is the median .
Huber (k), k > 0
1R| if ui ≤ k
b g S
ω ui = k
b g
ui|T
sgn ui if ui > k
The default value of k = 1339

.
Hampel (a, b, c), 0 < a ≤ b ≤ c
R|1 if ui ≤ a
|| ua sgnbu g
i if a < ui ≤ b
|| i
ω bu g = S
i
|| ua cc−− ub sgnbu g
i
i if b < ui ≤ c
||i
|T0 if ui > c
By default, a = 1.7, b = 3.4 and c = 8.5.

10 EXAMINE
Andrew’s Wave (c), c > 0
R| c sinF π u I
|π u H c K if ui ≤ c
i
ω bu g = S i
i
||0
T if ui > c
. π
By default, c = 134
Tukey’s Biweight (c)
R|F u I 2
|G1 − J
2
ω bu g = SH c K
i
2
if ui ≤ c
i
||
T0 if ui > c
By default, c = 4.685.
Tests of Normality
Shapiro-Wilk Statistic (W)
Since the W statistic is based on the order statistics of the sample, the caseweights
have to be restricted to integers. Hence, before W is calculated, all the caseweights
are rounded to the closest integer and the series is expanded. Let ci∗ be the closest
integer to ci ; then
i m
cci∗ = ∑ ck∗ , Ws = ∗
ccm = ∑c ∗
k
k =1 k =1
The original series y = y1 , l K, y q is expanded to

m
o K, xw t
x = x1 , s
where
xcc∗
i −1 +1
= K = xcc ∗
i
= yi , i = 1, K, m
EXAMINE 11
Then the W statistic is defined as
FW I 2
GG ∑ ai xi JJ
s
W= W
Hi K =1
∑ b xi − x g
s
2
i =1
where
Ws
∑x i
i =1
x=
Ws
R| ΓbW 2g s
if 5 ≤ Ws ≤ 20
|| 2ΓdbW + 1g 2i s
a12 = aW
2
=S
s
|| ΓdbW + 1g 2i
s
if Ws > 20
|T 2ΓbW 2 + 1g s
aWs = aW
2
a1 = − a12 , s
b g K, W − 1
ai = 2 c mi , i = 2, s
m =Ψ G
F i − α I , where Ψ is the c.d.f. of a standard normal distribution
−1
i
H W − 2α + 1JK s
α = 0.314195 + 0.063336β − 0.010895β 2
β = log10 Ws
Ws −1
mi2
c =4
2
∑ d1 − 2a i
i =1
2
i
12 EXAMINE
Based on the computed W statistic, the significance is calculated by linearly

interpolating within the range of simulated critical values given in Shapiro and
Wilk (1965).
If non-integer weights are specified, the Shapiro-Wilk’s statistic is calculated
when the weighted sample size lies between 3 and 50. For no weights or integer
weights, the statistic is calculated when the weighted sample size lies between 3
and 5000.
If W > w0.99 , the critical value of 99th percentile, the significance is reported
as >0.99. Similarly, if W < w0.01 , the critical value of first percentile, the
significance is reported as <0.01.
Kolmogorov-Smirnov Statistic with Lilliefors’ Significance
Lilliefors (1967) presented a table for testing normality using the Kolmogorov-
Smirnov statistic when the mean and variance of the population are unknown. This
1
statistic is
l
Da = max D+ , D− q
where
o b g b gt
D+ = max i F$ yi − F yi
D− = max i oFb yi g − Fb yi gt
$
−1
af af
where F$ x is the sample cumulative distribution and F x is the cumulative
normal distribution whose mean and variance are estimated from the sample.
Dallal and Wilkinson (1986) corrected the critical values for testing normality
reported by Lilliefors. With the corrected table they derived an analytic
approximation to the upper tail probabilities of Da for probabilities less than 0.1.
1
This algorithm applies to SPSS 7.0 and later releases.
EXAMINE 13
The following formula is used to estimate the critical value Dc for probability
0.1.
F −b − I
Dc =
H b 2 − 4ac
K
2a
where, if W ≤ 100 ,
b
a = −7.01256 W + 2.78019 g
b = 2.99587 W + 2.78019
0.974598 .
167997
c = 2.1804661 + +
W W
2
If W > 100
a = −7.90289126054 * W 0.98
b = 3180370175721
. * W 0.49
c = 2.2947256
The Lilliefors significance p is calculated as follows:

If Da = Dc , p = 01
. .
{
If Da > Dc , p = exp aDa2 + bDa + c − 2.3025851 . }
If D0.2 ≤ Da < Dc , linear interpolation between D0.2 and Dc where D0.2 is the
critical value for probability 0.2 is done.
If Da > D0.2 , p is reported as > 0.2 .
2
This algorithm applies to SPSS 7.0 and later releases. To learn about algorithms
for previous releases, call SPSS Technical Support.
14 EXAMINE
Group Statistics
Assume that there are k k ≥ 2 b g combinations of grouping factors. For every
K, k , let oy , K, y t be the sample observations with
combination i, i = 1, 2, i1 imi
the corresponding caseweights oc , K , c t . i1 imi
Spread versus Level

If a transformation value, a, is given, the spread(s) and level(l) are defined based
K K
on the transformed data. Let x be the transformed value of y; for every
i = 1, , k , j = 1, , mi
R|ln y if a = 0
xij = S| y ij
Ta
ij otherwise
bg bg
Then the spread si and the level li are respectively defined as the Interquartile
Range and the median of ox , K , x t
i1 imi with corresponding caseweights
oci , K, cim t . However, if a is not specified, the spread and the level are natural
1 i
logarithms of the Interquartile Range and of the median of the original data.
Finally, the slope is the regression coefficient of s on l, which is defined as
∑ dli − l ibsi − s g
i =1
k
∑ dli − l i
2
i =1
In some situations, the transformations cannot be done. The spread-versus-level

plot and Levene statistic will not be produced if:
• a is a negative integer and at least one of the data is 0
EXAMINE 15
• a is a negative non-integer and at least one of the data is less than or equal to 0
• a is a positive non-integer and at least one of the data is less than 0
• a is not specified and the median or the spread is less than or equal to 0
Levene Test of Homogeneity of Variances

The Levene test statistic is based on the transformed data and is defined by
F W −kI
∑ wi bzi − z g 2
La = G
H k − 1 JK k m
i =1
∑ ∑ cil bzil − zi g
i
2
i =1 l =1
where
mi
wi = ∑c
l =1
il
mi
∑c x
l =1
il il
xi =
wi
zil = x il − x i
mi
∑
cil zil
zi =
l =1
wi
∑
wi z i
z=
i =1
W
16 EXAMINE
The significance of La is calculated from the F distribution with degrees of

freedom k − 1 and W − k .
Groups with zero variance are included in the test.
Robust Levene’s Test of Homogeneity of Variances

With the current version of Levene’s test La, the followings can be considered as
options in order to obtain robust Levene’s tests:
• Levene’s test L based on z ( b ) = |x - ~ x | where ~
b x is the median of x ’s for
li il i i il
group i.
Median calculation is done by the method requested. The default method is

HAVERAGE. Once the ~ x i ’s and hence z li( b ) ’s are calculated, apply the
formula for La, shown in the section above, to obtain Lb by replacing zil, z i and
z with z li( b ) , z i( b ) and z ( b ) respectively.
Two significances of Lb are given. One is calculated from a F-distribution

with degrees of freedom k - 1 and W - k. Another is calculated from a F-
distribution with degrees of freedom k - 1 and v. The value of v is given by:
k 2
 


∑
i =1
ui 

v= k
 ui2 


∑v
i =1 i


where
mi
ui = ∑c
l =1
il ( z il( b ) − z i( b ) ) 2
in which
mi
c il z il( b )
z i( b ) = ∑ l =1
wi
and
EXAMINE 17
vi = wi - 1.
• Levene’s test Lc based on z il( c ) = |xil - Ti, 0.9 | where Ti, 0.9 is the 5% trimmed mean
of xil’s for group i.
Once the Ti, 0.9 ’s and hence z il( c ) ’s are calculated, apply the formula of La to
obtain Lc by replacing zil, z i and z with z li( c ) , z i( c ) and z ( c ) respectively.
The significance of Lc is calculated from a F-distribution with degrees of

freedom k - 1 and W - k.
Plots
Normal Probability Plot (NPPLOT)
For every distinct observation yi , Ri is the rank (the mean of ranks is assigned to
ties). The normal score NSi is calculated by
NSi = Ψ −1
FG R IJ
H W + 1K
i
where Ψ −1 is the inverse of the standard normal cumulative distribution function.

b gKb
The NPPLOT is the plot of y1 , NS1 , , ym , NSm . g
Detrended Normal Plot
The detrended normal plot is the scatterplot of y1 , D1 , b g K, b y m, g

Dm , where Di
is the difference between the Z-score and normal score, which is defined by
Di = Zi − NSi
and
18 EXAMINE
yi − y
Zi =
s
where y is the average and s is the standard deviation.
Boxplot
The boundaries of the box are Tukey’s hinges. The length of the box is the
interquartile range based on Tukey’s hinges. That is,
IQR = Q3 − Q1
Define
STEP = 1.5 IQR
A case is an outlier if
Q3 + STEP ≤ yi < Q3 + 2 STEP

or
Q1 − 2 STEP < yi ≤ Q1 − STEP
A case is an extreme if
yi ≥ Q3 + 2 STEP
or
yi ≤ Q1 − 2 STEP
References
Brown, M. B., and Forsythe, A. B. (1974). Robust Tests for the Equality of
Variances. Journal of the American Statistical Association, 69, p364-367.
EXAMINE 19
Dallal, G. E., and Wilkinson, L. 1986. An analytic approximation to the

distribution of Lilliefor’s test statistic for normality. The American Statistician,
40(4): 294–296 (Correction: 41: 248).
Frigge, M., Hoaglin, D. C., and Iglewicz, B. 1987. Some implementations for the
boxplot. In: Computer Science and Statistics Proceedings of the 19th
Symposium on the Interface, R. M. Heiberger and M. Martin, eds. Alexandria,
Va.: American Statistical Association.
Glaser, R. E. (1983). Levene’s Robust Test of Homogeneity of Variances.

Encyclopedia of Statistical Sciences 4. NY: Wiley, p608-610.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1983. Understanding robust and
exploratory data analysis. New York: John Wiley & Sons, Inc.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1985. Exploring data tables,
trends, and shapes. New York: John Wiley & Sons, Inc.
Lilliefors, H. W. 1967. On the Kolmogorov-Smirnov tests for normality with mean

and variance unknown. Journal of the American Statistical Association, 62:
399–402.
Loh, W. Y. (1987). Some Modifications of Levene’s Test of Variance

Homogeneity. Journal of Statistical Computation and Simulation, 28, p213-
226.
Tukey, J. W. 1977. Exploratory data analysis. Reading, Mass.: Addison-Wesley.
Velleman, P. F., and Hoaglin, D. C. 1981. Applications, basics, and computing of

exploratory data analysis. Boston: Duxbury Press.

Examine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Examine

Uploaded by

Copyright:

Available Formats

EXAMINE

Minimum and Maximum

Confidence Interval for the Mean

lower bound = y − tα / 2,W −1SE

where SE is the standard error.

Interquartile Range (IQR)

5% Trimmed Mean 0.9

where k1 and k 2 satisfy the following conditions

cck1 < tc ≤ cck1 +1 , W − cck2 < tc ≤ W − cck2 −1

Note: If k1 + 1 = k 2 , then T0.9 = y k2

cck1 ≤ tc1 < cck1 +1

dtc 2 − cck2 i, g2∗ = tc2 − cck2

Let x be the pth percentile; the five definitions are as follows:

Waverage (Weighted Average at y tc 1 )

T 1 k1 + g1 y k1 +1 if g1∗ < 1 and ck1 +1 < 1

Round (Observation Closest to tc 1 )

If ck1 +1 < 1 , then

Empirical (Empirical Distribution Function)

Haverage (Weighted Average at y tc 2 )

T 2 k2 + g2 y k2 +1 if g2∗ < 1 and ck2 +1 < 1

Aempirical (Empirical Distribution Function with Averaging)

Let Q1 , Q2 , and Q3 be the 25th, 50th, and 75th percentiles. If c∗ ≥ 1 , where

Then for every i, i = 1, 2, 3 , find hi such that

cchi ≤ Li < cchi +1

Q = Sb1 − a g y + ai yhi +1 if ai∗ < 1 and chi +1 < 1

M-Estimation (Robust Location Estimation)

The M-estimator T of location is the solution of

where Ψ is an odd function and s is a measure of the spread.

An alternative form of M-estimation is

∑ c FGH s IJK ω FGH s IJK = 0

After rearranging the above equation, we get

∑ ci yiω FGH i s IJK

∑ ciω FGH i s IJK

Therefore, the algorithm to find M-estimators is defined iteratively by

The algorithm stops when either

Huber (k), k > 0

The default value of k = 1339

Hampel (a, b, c), 0 < a ≤ b ≤ c

By default, a = 1.7, b = 3.4 and c = 8.5.

Andrew’s Wave (c), c > 0

Tukey’s Biweight (c)

Shapiro-Wilk Statistic (W)

The original series y = y1 , l K, y q is expanded to

Then the W statistic is defined as

α = 0.314195 + 0.063336β − 0.010895β 2

Based on the computed W statistic, the significance is calculated by linearly

Kolmogorov-Smirnov Statistic with Lilliefors’ Significance

The Lilliefors significance p is calculated as follows:

the corresponding caseweights oc , K , c t . i1 imi

Spread versus Level

In some situations, the transformations cannot be done. The spread-versus-level

• a is a positive non-integer and at least one of the data is less than 0

Levene Test of Homogeneity of Variances

The significance of La is calculated from the F distribution with degrees of

Robust Levene’s Test of Homogeneity of Variances

Median calculation is done by the method requested. The default method is

Two significances of Lb are given. One is calculated from a F-distribution

The significance of Lc is calculated from a F-distribution with degrees of

where Ψ −1 is the inverse of the standard normal cumulative distribution function.

The detrended normal plot is the scatterplot of y1 , D1 , b g K, b y m, g

where y is the average and s is the standard deviation.

STEP = 1.5 IQR

Q3 + STEP ≤ yi < Q3 + 2 STEP

Dallal, G. E., and Wilkinson, L. 1986. An analytic approximation to the