Professional Documents
Culture Documents
Univariate Statistics
Notation
The following notation is used throughout this chapter unless otherwise noted:
K
Let y1 < ym be m distinct ordered observations for the sample and c1 , K, c m
be the corresponding caseweights. Then
i
cci = ∑c k = cumulative frequency up to and including yi
k =1
and
m
W = ccm = ∑c k = total sum of weights.
k =1
Descriptive Statistics
min = y1 , max = ym
Range
range = ym − y1
1
2 EXAMINE
Mean ybg
m
∑c y i i
i =1
y=
W
Median
The median is the 50th percentile, which is calculated by the method requested.
The default method is HAVERAGE.
IQR = 75th percentile − 25th percentile, where the 75th and 25th percentiles are
calculated by the method requested for percentiles.
e j
Variance s 2
∑ c b y − y g2
1
s2 = i i
W −1
i =1
Standard Deviation
s = s2
EXAMINE 3
Standard Error
s
SE =
W
b g
Skewness g1 and SE of Skewness
WM3
g1 =
a fa
W − 1 W − 2 s3 f
b g aW − 26faWWaW+ 1−fa1Wf + 3f
SE g1 =
∑ c by − yg
m
3
M3 = i i
i =1
b g
Kurtosis g 2 and SE of Kurtosis
g2 =
a f
W W + 1 M 4 − 3M22 W − 1 a f
a
W −1 W − 2 W − 3 s4 fa fa f
m
∑ c by − yg
2
M2 = i i
i =1
∑ c by − yg
4
M4 = i i
i =1
b g
SE g2 =
d i b g
4 W 2 − 1 SE 2 g1
aW − 3faW + 5f
4 EXAMINE
R| U
S|ecck +1 − tcj yk +1 + eW − cck −1 − tcj yk + ∑ ci yi |V|
k −1
2
1
T0.9 =
T W
1 1 2 2
0.9W
i = k +2
1
and
tc = 0.05W
Percentiles
There are five methods for computation of percentiles. Let
tc1 = Wp, b
tc2 = W + 1 p g
where p is the requested percentile divided by 100, and k1 and k 2 satisfy
Then,
g1 =
dtc − cc i ,
1 k1
g1∗ = tc1 − cck1
ck1 +1
R| y if g1∗ ≥ 1
| k1 +1
x = Se1 − g j y ∗
+ g1∗ y k1 +1 if g1∗ < 1 and ck1 +1 ≥ 1
||b1 − g g y 1 k1
If ck1 +1 ≥ 1 , then
R| y if g1∗ <
S| y
1
k1
x= 2
T k1 +1 if g1∗ ≥ 1
2
R| yk if g1 < 21
x= S| yk +11
T 1
if g1 ≥ 21
R| y if g1∗ = 0
x= S| y k1
T k1 +1 if g1∗ > 0
6 EXAMINE
R| y if g2∗ ≥ 1
| k 2 +1
x = Se1 − g j y∗
+ g2∗ y k2 +1 if g2∗ < 1 and ck2 +1 ≥ 1
||b1 − g g y
2 k2
R|e yk + y k1 +1 2j if g1∗ = 0
x= S| yk1
T 1 +1
if g1∗ > 0
Note: If either the 25th, 50th, or 75th percentiles is request, Tukey Hinges will also
be printed.
Tukey Hinges
d=
greatest integer ≤ W + 3 2 cb g h
2
L1 = d
L2 = W 2 + 1 2
L3 = W + 1 − d
Otherwise
d=
greatest integer ≤ W c∗ + 3 2 e j
2
EXAMINE 7
and
L1 = dc∗
L2 = W 2 + c∗ 2
L3 = W + c∗ − dc∗
and
R|e1 − a j y ∗
+ ai∗ yhi +1 if ai∗ < 1 and chi +1 ≥ 1
|
i hi
if ai∗ ≥ 1
T 1 hi +
where
ai∗ = Li − cchi
ai∗
ai =
chi +1
∑ c ΨFGH s IJK = 0
m
y −T i
i
i =1
where
b g Ψubug
ω u =
=1
T= i m
i =1
∑ c y ω FGH s IJK
m
y −T i k
i i
i =1
Tk +1 =
∑ c ω FGH s IJK
m
y −T i k
i
i =1
Tk +1 − Tk ≤ ε Tk +1 + Tk b g 2 , where ε = 0.005
or the number of iterations exceeds 30.
EXAMINE 9
M-Estimators
Four M-estimators (Huber, Hampel, Andrew, and Tukey) are available. Let
yi − T
ui =
s
where
s = median of ~
y1 , K, ~y
m with caseweights c1 , K, c
m
and
~
yi = yi − ~
y, where ~
y is the median .
1R| if ui ≤ k
b g S
ω ui = k
b g
ui|T
sgn ui if ui > k
R|1 if ui ≤ a
|| ua sgnbu g
i if a < ui ≤ b
|| i
ω bu g = S
i
|| ua cc−− ub sgnbu g
i
i if b < ui ≤ c
||i
|T0 if ui > c
R| c sinF π u I
|π u H c K if ui ≤ c
i
ω bu g = S i
i
||0
T if ui > c
. π
By default, c = 134
R|F u I 2
|G1 − J
2
ω bu g = SH c K
i
2
if ui ≤ c
i
||
T0 if ui > c
By default, c = 4.685.
Tests of Normality
Since the W statistic is based on the order statistics of the sample, the caseweights
have to be restricted to integers. Hence, before W is calculated, all the caseweights
are rounded to the closest integer and the series is expanded. Let ci∗ be the closest
integer to ci ; then
i m
cci∗ = ∑ ck∗ , Ws = ∗
ccm = ∑c ∗
k
k =1 k =1
o K, xw t
x = x1 , s
where
xcc∗
i −1 +1
= K = xcc ∗
i
= yi , i = 1, K, m
EXAMINE 11
FW I 2
GG ∑ ai xi JJ
s
W= W
Hi K =1
∑ b xi − x g
s
2
i =1
where
Ws
∑x i
i =1
x=
Ws
R| ΓbW 2g s
if 5 ≤ Ws ≤ 20
|| 2ΓdbW + 1g 2i s
a12 = aW
2
=S
s
|| ΓdbW + 1g 2i
s
if Ws > 20
|T 2ΓbW 2 + 1g s
aWs = aW
2
a1 = − a12 , s
b g K, W − 1
ai = 2 c mi , i = 2, s
m =Ψ G
F i − α I , where Ψ is the c.d.f. of a standard normal distribution
−1
i
H W − 2α + 1JK s
β = log10 Ws
Ws −1
mi2
c =4
2
∑ d1 − 2a i
i =1
2
i
12 EXAMINE
Lilliefors (1967) presented a table for testing normality using the Kolmogorov-
Smirnov statistic when the mean and variance of the population are unknown. This
1
statistic is
l
Da = max D+ , D− q
where
o b g b gt
D+ = max i F$ yi − F yi
D− = max i oFb yi g − Fb yi gt
$
−1
af af
where F$ x is the sample cumulative distribution and F x is the cumulative
normal distribution whose mean and variance are estimated from the sample.
Dallal and Wilkinson (1986) corrected the critical values for testing normality
reported by Lilliefors. With the corrected table they derived an analytic
approximation to the upper tail probabilities of Da for probabilities less than 0.1.
1
This algorithm applies to SPSS 7.0 and later releases.
EXAMINE 13
The following formula is used to estimate the critical value Dc for probability
0.1.
F −b − I
Dc =
H b 2 − 4ac
K
2a
where, if W ≤ 100 ,
b
a = −7.01256 W + 2.78019 g
b = 2.99587 W + 2.78019
0.974598 .
167997
c = 2.1804661 + +
W W
2
If W > 100
a = −7.90289126054 * W 0.98
b = 3180370175721
. * W 0.49
c = 2.2947256
{
If Da > Dc , p = exp aDa2 + bDa + c − 2.3025851 . }
If D0.2 ≤ Da < Dc , linear interpolation between D0.2 and Dc where D0.2 is the
critical value for probability 0.2 is done.
If Da > D0.2 , p is reported as > 0.2 .
2
This algorithm applies to SPSS 7.0 and later releases. To learn about algorithms
for previous releases, call SPSS Technical Support.
14 EXAMINE
Group Statistics
Assume that there are k k ≥ 2 b g combinations of grouping factors. For every
K, k , let oy , K, y t be the sample observations with
combination i, i = 1, 2, i1 imi
K K
on the transformed data. Let x be the transformed value of y; for every
i = 1, , k , j = 1, , mi
R|ln y if a = 0
xij = S| y ij
Ta
ij otherwise
bg bg
Then the spread si and the level li are respectively defined as the Interquartile
Range and the median of ox , K , x t
i1 imi with corresponding caseweights
oci , K, cim t . However, if a is not specified, the spread and the level are natural
1 i
logarithms of the Interquartile Range and of the median of the original data.
Finally, the slope is the regression coefficient of s on l, which is defined as
∑ dli − l ibsi − s g
i =1
k
∑ dli − l i
2
i =1
• a is a negative non-integer and at least one of the data is less than or equal to 0
• a is not specified and the median or the spread is less than or equal to 0
F W −kI
∑ wi bzi − z g 2
La = G
H k − 1 JK k m
i =1
∑ ∑ cil bzil − zi g
i
2
i =1 l =1
where
mi
wi = ∑c
l =1
il
mi
∑c x
l =1
il il
xi =
wi
zil = x il − x i
mi
∑
cil zil
zi =
l =1
wi
∑
wi z i
z=
i =1
W
16 EXAMINE
group i.
k 2
∑
i =1
ui
v= k
ui2
∑v
i =1 i
where
mi
ui = ∑c
l =1
il ( z il( b ) − z i( b ) ) 2
in which
mi
c il z il( b )
z i( b ) = ∑ l =1
wi
and
EXAMINE 17
vi = wi - 1.
• Levene’s test Lc based on z il( c ) = |xil - Ti, 0.9 | where Ti, 0.9 is the 5% trimmed mean
of xil’s for group i.
Once the Ti, 0.9 ’s and hence z il( c ) ’s are calculated, apply the formula of La to
obtain Lc by replacing zil, z i and z with z li( c ) , z i( c ) and z ( c ) respectively.
Plots
Normal Probability Plot (NPPLOT)
For every distinct observation yi , Ri is the rank (the mean of ranks is assigned to
ties). The normal score NSi is calculated by
NSi = Ψ −1
FG R IJ
H W + 1K
i
Di = Zi − NSi
and
18 EXAMINE
yi − y
Zi =
s
Boxplot
The boundaries of the box are Tukey’s hinges. The length of the box is the
interquartile range based on Tukey’s hinges. That is,
IQR = Q3 − Q1
Define
A case is an outlier if
A case is an extreme if
yi ≥ Q3 + 2 STEP
or
yi ≤ Q1 − 2 STEP
References
Brown, M. B., and Forsythe, A. B. (1974). Robust Tests for the Equality of
Variances. Journal of the American Statistical Association, 69, p364-367.
EXAMINE 19
Frigge, M., Hoaglin, D. C., and Iglewicz, B. 1987. Some implementations for the
boxplot. In: Computer Science and Statistics Proceedings of the 19th
Symposium on the Interface, R. M. Heiberger and M. Martin, eds. Alexandria,
Va.: American Statistical Association.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1983. Understanding robust and
exploratory data analysis. New York: John Wiley & Sons, Inc.
Hoaglin, D. C., Mosteller, F., and Tukey, J. W. 1985. Exploring data tables,
trends, and shapes. New York: John Wiley & Sons, Inc.