You are on page 1of 27

Numeric Summaries and

Descriptive Statistics
populations vs. samples
• we want to describe both samples and
populations
• the latter is a matter of inference…
“outliers”
• minority cases, so different from the majority
that they merit separate consideration
– are they errors?
– are they indicative of a different pattern?
• think about possible outliers with care, but
beware of mechanical treatments…
• significance of outliers depends on your
research interests
summaries of distributions
• graphic vs. numeric
– graphic may be better for visualization
– numeric are better for statistical/inferential
purposes
• resistance to outliers is usually an advantage
in either case
general characteristics
0.22

• kurtosis [“peakedness”]

0.4

0.8

0.00
-5 5
X
X

0.0
-5 5
D
0.0
-5 5

‘leptokurtic’
D

’platykurtic’
5

4 right
(positive) • skew (skewness)
3 skew
X

2
5
1

4 left
0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 (negative)
D 3 skew
X

0
0.0 0.2 0.4 0.6 0.8 1.0 1.2
D
central tendency
• measures of central tendency
– provide a sense of the value expressed by
multiple cases, over all…

• mean
• median
• mode
mean
• center of gravity
• evenly partitions the sum of all
measurement among all cases; average of
all measures
n

∑x i
x= i =1

n
mean – pro and con
• crucial for inferential statistics

• mean is not very resistant to outliers

• a “trimmed mean” may be better for


descriptive purposes
mean
rim diameter (cm)
unit 1 unit 2
unit 1 unit 2 9 26
12.6 16.2 25
11.6 16.4 24
16.3 13.8 23
13.1 13.2 22
12.1 11.3 21
26.9 14.0 20
9.7 9.0 19
11.5 12.5 18
14.8 15.6 17
13.5 11.2 3 16 24
12.4 12.2
15 56
13.6 15.5
14.0== 8 14 0
11.7
651 13 28 ==13.3
641 12 25
n 12 13
total 168.1 172.6 65 11 237
total/n 14.0 13.3 10
7 9 0

R: mean(x)
trimmed mean
rim diameter (cm)

unit 1 unit 2 unit 1 unit 2


9.7 9.0 9 26
11.5 11.2 25
11.6 11.3 24
12.1 11.7 23
12.4 12.2 22
12.6 12.5 21
13.1 13.2 20
13.5 13.8 19
13.6 14.0 18
14.8 15.5 17
16.3 15.6 3 16 24
26.9 16.2 15 56
16.4 8 14 0
13.2== 651 13 28 ==13.4
n 10 11 641 12 25
total 131.5 147.2 65 11 237
total/n 13.2 13.4 10
7 9 0

R: mean(x, trim=.1)
median
• 50th percentile…

• less useful for inferential purposes


• more resistant to effects of outliers…
median
unit 1 unit 2
9 26
rim diameter (cm)
25
24
unit 1 unit 2
23
9.7 9.0
22
11.5 11.2
21
11.6 11.3
20
12.1 11.7
19
12.4 12.2
18
12.6 12.5
17
12.9 <-- 13.2 13.2 3 16 24
13.1 13.8 15 56
13.5 14.0 8 14 0
13.6 15.5 651 13 28 ==13.20
14.8 15.6 12.85== 641 12 25
16.3 16.2 65 11 237
26.9 16.4 10
7 9 0
mode
• the most numerous category
• for ratio data, often implies that data have
been grouped in some way
• can be more or less created by the grouping
procedure
• for theoretical distributions—simply the
location of the peak on the frequency
distribution
isolated scatters
hamlets
villages
regional centers
regional centers
modal class = ‘hamlets’

0.00
0.22

1.5
2.0
5
2.5 -5
1.0
dispersion
• measures of dispersion
– summarize degree of clustering of cases, esp.
with respect to central tendency…

• range
• variance
• standard deviation
range
unit 1 unit 2 unit 1 unit 2
* 9 26
9.7 9.0
| 25
11.5 11.2 | 24
11.6 11.3 | 23
12.1 11.7 | 22
| 21
12.4 12.2 | 20
12.6 12.5 | 19
13.1 13.2 | 18
13.5 13.8 | 17
| 3 16 24 *
13.6 14.0 | 15 56 |
14.8 15.5 | 8 14 0 |
16.3 15.6 | 651 13 28 |
26.9 16.2 | 641 12 25 |
| 65 11 237 |
16.4 | 10 |
* 7 9 0 *

R: range(x) • would be better to use midspread…


R: var(x)
variance
• analogous to average deviation of cases from
mean
• in fact, based on sum of squared deviations from
the mean—“sum-of-squares”

∑( x − x)
2
i
s =
2 i =1

n −1
variance
• computational form:

2
n
 n


i =1
x −  ∑ xi  / n
2
i
 i =1 
s =
2

n −1
• note: units of variance are squared…

• this makes variance hard to interpret

• ex.: projectile point sample:


mean = 22.6 mm
variance = 38 mm2

• what does this mean???


standard deviation
• square root of variance:

n 2
 
∑( x − x)
n n

∑ x −  ∑ xi  / n
2 2
i i
s= i =1
s= i =1  i =1 
n −1 n −1
standard deviation
• units are in same units as base measurements

• ex.: projectile point sample:


mean = 22.6 mm
standard deviation = 6.2 mm

• mean +/- sd (16.4—28.8 mm)


– should give at least some intuitive sense of where most
of the cases lie, barring major effects of outliers
rim diameter (cm)

unit 1 unit 2 unit 1 unit 2 unit 1 unit 2


12.6 16.2 -1.4 2.9 1.98 8.54
11.6 16.4 -2.4 3.1 5.80 9.75
16.3 13.8 2.3 0.5 5.25 0.27
13.1 13.2 -0.9 -0.1 0.83 0.01
12.1 11.3 -1.9 -2.0 3.64 3.91
26.9 14.0 12.9 0.7 166.20 0.52
9.7 9.0 -4.3 -4.3 18.56 18.29
11.5 12.5 -2.5 -0.8 6.29 0.60
14.8 15.6 0.8 2.3 0.63 5.40
13.5 11.2 -0.5 -2.1 0.26 4.31
12.4 12.2 -1.6 -1.1 2.59 1.16
13.6 15.5 -0.4 2.2 0.17 4.94
11.7 -1.6 2.49

mean: 14.0 13.3 sum of sq.: 212.19 60.20


n: 12 13 variance: 19.29 5.02
stand. dev.: 4.39 2.24
trimmed dispersion measures
• variance and sd are even more sensitive to
extreme values (outliers) than the mean…
• why??

• you can calculate a trimmed version of the


variance simply by eliminating cases from the
tails, and calculating the variance in the normal
way…
trimmed standard deviation
• trimmed sd is calculated differently

(n − 1) s 2
sT = W

nT − 1
• sT = trimmed standard deviation
n = number of cases in untrimmed batch
s2w = variance of trimmed (winsorized) batch
nT = number of cases in the trimmed batch