Professional Documents
Culture Documents
Lecture 3: Descriptive Statistics: Matt Golder & Sona Golder
Lecture 3: Descriptive Statistics: Matt Golder & Sona Golder
Introduction
Notes
In a broad sense, making an inference implies partially or completely describing
a phenomenon.
Before discussing inference making, we must have a method for characterizing
or describing a set of numbers.
The characterizations must be meaningful so that knowledge of the descriptive
measures enable us to clearly visualize a set of numbers.
Generally, we can characterize a set of numbers using either graphical or
numerical methods.
Were now going to look at numerical methods.
Numerical Methods
Notes
Graphical displays are not usually adequate for the purpose of making
inferences.
We need rigorously defined quantities for summarizing the information
contained in the sample. These sample quantities typically have mathematical
properties that allow us to make probability statements regarding the goodness
of our inferences.
The quantities we define are descriptive measures of a set of data.
Descriptive Measures
Notes
Typically, we are interested in two types of descriptive numbers: (i) measures of
central tendency and (ii) measures of dispersion or variation.
Measures of central tendency, sometimes called location statistics,
summarize a distribution by its typical value.
Measures of dispersion indicate, through a single number, the extent to which
a distributions observations differ from one another.
Together with information on its shape, the measures convey the most
distinctive aspects of a distribution.
Central Tendency
Notes
There are three basic measures of central tendency:
1
Mean
Median
Mode
Central Tendency
Notes
Redskins
Jets
Texans
Lions
Rams
Bears
Cardinals
Vikings
7
20
17
21
3
29
23
19
Giants
Dolphins
Steelers
Falcons
Eagles
Colts
49ers
Packers
16
14
38
34
38
13
13
24
Bengals
Chiefs
Jaguars
Seahawks
Buccaneers
Panthers
Cowboys
Broncos
10
10
10
10
20
26
28
41
Ravens
Patriots
Titans
Bills
Saints
Chargers
Browns
Raiders
17
17
17
34
24
24
10
14
.01
.02
Density
Frequency
.03
.04
2345F0requency
Density
.04
.03
.02
.01
01Frequency
10
20
30
40
Points
Scored
10
20
Points Scored
30
40
Mean
Notes
The mean is also known as an expected value or, colloquially, as an average.
The mean of a sample of N measured responses, X1 , X2 , . . . , Xn is given by:
n
X
1
= 1
X
Xi = (X1 + X2 + . . . + Xn )
n i=1
n
X1 + X2 + . . . + Xn
n
The symbol X, read X bar, refers to the sample mean. is used to denote
the population mean.
Mean
Notes
Calculating a mean only makes sense if the data are numeric, and (in general)
if the variable in question is measured at least at the interval level.
While the mean of an ordinal variable might provide some useful information,
the fact that the data are only ordinal means that the assumption necessary for
summation that the values of the variable are equally spaced is
questionable.
It never makes sense to calculate the mean of a nominal-level variable.
Mean
Notes
Its common to think of the mean as the balance point of the data that is,
the point in X at which, if every value of X was given weight according to
its size, the data would balance.
Thus, if we add a single data point (call it XN +1 ) to an existing variable X,
the new mean (using N + 1 observations) will:
Mean
Notes
A mean can also be thought of as the value of X that minimizes the squared
deviations between itself and every value of Xi . That is, suppose we are
interested in choosing a value for X (call it ) that minimizes:
f (X)
N
X
(Xi )2
N
X
(Xi2 + 2 2Xi )
i=1
i=1
Mean
Notes
To find the minimum, we need to calculate
solve.
f (X)
,
X
X
f (X)
=
(2 2Xi )
X
i=1
N
X
(2 2Xi )
Xi
2N
i=1
2N 2
N
X
i=1
N
X
Xi
i=1
1
N
N
X
Xi X
i=1
Mean
Notes
Note that the mean is very susceptible to the effect of outliers in the data.
Exceptionally large values in X will tend to pull the mean in their direction,
and so can provide a misleading picture of the actual central location of the
variable in question.
Geometric Mean
Notes
So far, we have looked at what formally is called the arithmetic mean.
There are two other potentially useful variants on the mean: the geometric
mean and the harmonic mean.
The geometric mean is defined as:
G =
X
N
Y
! N1
Xi
i=1
X1 X2 ... XN
Geometric Mean
Notes
The geometric mean can also be written using logarithms:
N
Y
i=1
! N1
Xi
"
= exp
N
1 X
ln Xi
N i=1
That is, the geometric mean of a variable X is equal to the exponent of the
arithmetic mean of the natural logarithm of that variable.
Geometric Mean
Notes
The geometric mean has a geometric interpretation. You can think of the
geometric mean of two values X1 and X2 as the answer to the question:
What is the length of one side of a square that has an area equal to
that of a rectangle with width X1 and height X2 ?
Geometric Mean
Notes
This idea is represented graphically (for the two-value case) in Figure 2.
X2
=
X1
XG
Geometric Mean
Notes
Finally, note a few things:
The geometric mean is only appropriate for variables with positive values.
The geometric mean is always less than or equal to the arithmetic mean:
X
G
X
The two are only equal if the values of X are the same for all N
observations.
While rare, the geometric mean is actually the more appropriate measure
of central tendency to use for phenomena (such as percentages) that are
more accurately multiplied rather than summed.
Geometric Mean
Notes
Example: Suppose that the price of something doubled (went up 200 percent
of the original price) in 2008, and then decreased by 50 percent (back to its
original price) in 2009.
The actual average change in the priceacross the two
years is not
(200 + 50)/2 = 125 percent, but rather 200 50 = 10000 = 100, or zero
percent average (annualized) net change.
Harmonic Mean
Notes
The harmonic mean is defined as:
H = P N
X
N
1
i=1 Xi
i.e., it is the product of N and the reciprocal of the sum of the reciprocals of
the Xi s.
Equivalently:
H =
X
1
1
X
,
i.e., the harmonic mean is the reciprocal of the (arithmetic) mean of the
reciprocals of X.
Harmonic Mean
Notes
Because it considers reciprocals, the harmonic mean is the friendliest toward
small values in the data, and the least friendly to large values.
This means, among other things, that:
the harmonic mean will always be the smallest of the means in value (the
arithmetic mean will be the biggest, and the geometric will be in
between the other two)
the harmonic mean tends to limit the impact of large outliers, and
increase the weight given to small values of X.
To be honest, there are not a lot of instances where one is likely to use the
harmonic mean.
Median
Notes
is the middle
The median of a variable X sometimes labeled XM ed or X
observation or the 50th percentile.
Practically speaking, the median is the value of:
the (N 21) + 1 th-largest value of X when N is odd, and
the mean of the N2 th- and N2+2 th- largest values of X when N is
even.
Median
Notes
Example: For our NFL opening day data, there are 32 teams.
The median is therefore the average of the numbers of points scored by the
16th- and 17th-most point-scoring teams.
Those values are 17 and 19, making the median 18 (even though exactly 18
points were not scored by any team that week).
The median is typically only calculated for ordinal-, interval- or ratio-level data.
There is no particular problem with ordinal variables since it just reflects a
middle value (and an ordinal variable orders the observations).
The median is not calculated for nominal data.
Median
Notes
While the mean is the number that minimizes the squared distance to the data,
the median is the value of c that minimizes the absolute value of the distance
to the data:
!
N
X
XM ed = min
|Xi c| .
i=1
Mode
Notes
The mode XM ode is nothing more than the most commonly-occurring
value of X.
In our NFL data, for example, the most common number of points scored was
ten i.e. XM ode = 10.
The mode is the only measure of central tendency appropriate for use with data
measured at any level, including nominal.
That means that the mode is the only descriptive statistic appropriate for
nominal variables.
The mode is (technically) undefined for any variable that has equal maximum
frequencies for two or more variable values.
Dichotomous Variables
Notes
Think about a dichotomous (binary) variable, D. Note a few things:
= 1 PN Di [0, 1] is equal to the proportion of 1s in
The mean D
i=1
N
the data.
The median DM ed {0, 1}, depending on whether there are more 0s
or 1s in the data.
The mode DM ode = DM ed .
The mean of a binary variable tells us a lot about it: not just its mean, but, as
well see, its variance as well.
It is never the case that a binary variables mean is equal to a value actually
present in the data.
It is common to use the median/mode as a measure of central tendency for
binary variables.
Relationships
Notes
Figure: Central Tendency in a Symmetric Distribution with a Single Peak
Mean
Median
Mode
Relationships
Notes
If a continuous variable Z is right-skewed, then
ZM ode ZM ed Z
Relationships
Notes
Figure: Points Scored in Week One of the 2008 NFL Season Redux
2345678F0requency
01Frequency
10
20
30
40
Points
Mode
= 10,
Scored
Median = 18, Mean = 20.03
Frequency
4
5
10
20
Points Scored
30
40
Cautions
Notes
The mean, mode and median are all poor description of the data when you
have a bimodal distribution.
Strongly bimodal distributions are an indication that what youve really got is a
dichotomous variable, or possibly that there are multiple dimensions in the
intermediate values (Polity).
With skewed distributions, notably exponential and power-law distributions,
the mean can be much larger than the median and the mode.
Dispersion
Notes
A measure of central tendency does not provide an adequate description of
some variable X on its own because it only locates the center of the
distribution of data.
Range
Notes
The range is the most basic measure of dispersion it is the difference
between the highest and lowest values of a (interval- or ratio-level) variable:
Range(X) = Xmax Xmin
Example: The IQR for our NFL data is (24.5 13) = 11.5, which means that
the middle 50 percent of teams scored points that fell within about a
12-point range.
The IQR is analogous to the range, except that it is robust to outlying data
points.
Example: If Denver had scored 82 points instead of 41, the range would then
be (82 3) = 79, rather than 38, but the IQR would be unaffected.
Deviations
Notes
A limitation of percentiles and ranges is that they do not make use of the
information in all of the data.
Approaches that make use of all the information are typically based on the idea
of a deviation.
A deviation is just the extent to which an observations value on X differs from
some benchmark value.
The typical benchmarks are (i) the median and (ii) the mean.
In either case, the deviation is just the (signed) difference between an
observations value and that benchmark.
Mean Deviation
Notes
Individual deviations are not particularly interesting. Instead we want, say, a
typical deviation.
We might think to start with the average deviation from the mean value of
X:
Mean Deviation =
N
1 X
(Xi X)
N i=1
But note that this measure is actually useless because the mean deviation
always equals 0, no matter the spread of the distribution.
A corollary of the fact that the mean minimizes the sum of squared deviations
is that the mean is also the value of X that makes the sum of deviations from
it equal to zero.
Mean Deviation
Notes
Proof.
Let the sum of the differences be:
X
Take
d=
(X X)
P
and so
through the brackets. Note that
X is the same as N X
X
=
And we know that X
X
.
N
d=
X NX
And so
X
d=
d=
X=0
N
1 X
2
(Xi X)
N i=1
The MSD is intuitive enough, in that it is the average squared deviation from
the mean.
At this point, we can begin to learn something not just about the mean of the
data, but also about its variability.
Variance
Notes
The relevance of degrees of freedom is that, in the simple little example above,
we actually only have one effective observation that is telling us about the
variation in X, not two.
As a result, we should consider revising the denominator of our estimate of the
MSD downwards by one.
This gives us the variance:
Variance = s2 =
N
1 X
2.
(Xi X)
N 1 i=1
Variance
Notes
We can provide a shortcut for calculating
2.
(X X)
= (X X)(X
= X 2 2X X
+X
2
(X X)
X)
2
=
(X X)(X
X)
X2 2
P 2
X X
X
X +N
N
N
P
and that X
=
X = NX
X
.
N
We can
P
P
( X)2
( X)2
+N
N
N2
P
X
( X)2
=
X2
N
2=
(X X)
X2 2
In
words,
P other
P you can calculate the variance with just two pieces of information:
X 2 and ( X)2 .
Standard Deviation
Notes
The variance is expressed in terms of squared units rather than the units of
X.
To put s2 back on the same scale as X, we can take its square root and obtain
the standard deviation:
v
u
N
u 1 X
2
(Xi X)
Standard Deviation = s = t
N 1 i=1
s (or for a population) is the closest analogue to an average deviation from
the mean that we have.
s is also the standard measure of empirical variability used with interval- and
ratio-level data.
In fact, its generally good practice to report s every time you report X.
Standard Deviation
Notes
Notice that, as with the mean, s is expressed in the units of the original
variable X.
Example: In our 32-team NFL data, the variance s2 is 93.6, while the standard
deviation is about 9.67.
That means that an average (or, more accurately, typical) teams score
was about 9-10 points away from the empirical mean of (about) 20.
A Variant: Geometric s
Notes
As with the (arithmetic) mean, theres a variant of s that is better used when
the geometric mean is the more appropriate statistic.
The geometric standard deviation is defined as:
s
PN
2
i=1 (ln Xi ln XG )
sG = exp
N
Example: Suppose that Denver put up 82 points instead of 41, but the other
teams scores were identical.
The result is that the mean of points rises (from 20 to about 21.3), but the
standard deviation rises even more dramatically (from 9.6 to about 14.2).
That is, changing the points scored by a single team caused s to increase by 50
percent.
We consider deviations around the median, rather than the mean, and
When considering a typical value for the deviations, we use the median
value rather than the mean.
N
1 X
|Xi X|
N i=1
This is merely a substitution of the absolute value into the standard formula for
a mean deviation, but one that lacks the robustness to outliers that MAD does.
The median absolute deviation (MAD) is often presented where there are
highly influential outlier observations i.e., in the same situations where the
median is used as a measure of central tendency.
Example
Notes
Table: Mean Squared Deviation (MSD), Variance (s2 ), and Standard Deviation (s)
=
X
Data
X
10
20
30
50
90
Deviations
(X X)
-30
-20
-10
10
50
200
5
= 40
Squared
Deviations
2
(X X)
900
400
100
100
2500
MSD =
4000
5
Squared
Deviations
2
(X X)
900
400
100
100
2500
= 800
s2 =
4000
4
= 1000
s = 32
Example
Notes
Table: Deviations, MAD, and MAD2
=
X
Data
X
10
20
30
50
90
Absolute
Deviations
(Median)
|X XM ed |
20
10
0
20
60
200
5
MAD = 20
= 40
XM ed = 30
Absolute
Deviations
(Mean)
|X X|
30
20
10
10
50
MAD2 =
120
5
= 24
=
(X1 + X1 + . . . + X1 ) + (X2 + X2 + . . . + X2 ) + . . .
N |
{z
} |
{z
}
f1 times
f2 times
1
(X1 f1 + X2 f2 + . . . + XJ fJ )
N
J
1 X
2 fj
(Xj X)
N 1 j=1
J
1 X
2 fj
(Xj X)
N 1 j=1
Example
Notes
Table: Calculation of Mean and Standard Deviation from Relative Frequency
Distribution (n = 200)
Height
X
60
63
66
69
72
75
78
Weighting
fj
4
12
44
64
56
16
4
Deviation
|X X|
-9
-6
-3
0
3
6
9
Squared
Deviations
2
(X X)
81
36
9
0
9
36
81
Weighted
Squared
Deviations
2 fj
(X X)
324
432
396
0
504
576
324
= 69.3
X
s2 =
s=
2556
199
= 12.8
12.8 = 3.6
Moments
Notes
Means and variances are examples of moments.
Moments are typically used to characterize random variables (rather than
empirical distributions of variables), but they give rise to some useful additional
statistics.
Think of the moment as a description of a distribution . . .
We can take a moment around some number, usually the mean (or zero).
In general, the kth moment around a variables mean is Mk = E[(X )k ].
= E(X) =
The mean is the first moment: M1 = X
X
.
N
P
2
( XX)
.
N 1
Skew
Notes
Skew is the dimensionless version of the third moment:
P
3
( X X)
M3 = E[(X )3 ] =
N
s3 = s.d(X)3 = ( s2 )3
Skew = 1 =
PN
1
3
(Xi X)
M3
= h N i=1
i3/2
PN
s3
1
2
i=1 (Xi X)
N
Skew
Notes
Skew is a measure of symmetry it measures the extent to which a distribution
has long, drawn out tails on one side or the other.
Skew = 0 is symmetrical .
Skew > 0 is positive (tail to the right).
Skew < 0 is negative (tail to the left).
Skew
Notes
Figure: Symmetrical Distribution Zero Skew
.1
.2
.3
.4
.5
.1
.5
.4
.3
.2
86420.1
Skew
Notes
Figure: Right-Skewed Distribution Skew > 0
.02
.04
.06
.02
.06
.04
01.02
10
20
30
40
50
X
0
10
20
30
40
50
Skew
Notes
Figure: Left-Skewed Distribution Skew < 0
.5
1.5
0.5.5
1.5
21.5
X
.5
1
X
1.5
Kurtosis
Notes
Kurtosis is the dimensionless version of the fourth moment:
P
4
( X X)
M4 = E[(X )4 ] =
N
Kurtosis = 2 =
PN
1
4
(Xi X)
M4
3 = h N P i=1
i2 3
s4
N
1
2
(X
X)
i
i=1
N
Kurtosis
Notes
Kurtosis has to do with the peakedness of a distribution.
Thin-tailed/Peaked = leptokurtic M4 is large...
Medium-tailed = mesokurtic.
Fat-tailed/Flat = platykurtic M4 is small.
Moments
Notes
Figure: Kurtosis: Examples
.50.5
.5
1.5
-10
510-5
10
Mesokurtic
Leptokurtic
Density
Platykurtic
Graphs by type
Leptokurtic
Platykurtic
.5
Density
1.5
Mesokurtic
-10
Graphs by type
-5
10
-10
-5
10
-10
-5
10
Special Cases
Notes
If the data are symmetrical, then a few things are true:
The median is equal to the average (mean) of the first and third
quartiles. That, in turn, means that:
Half the IQR equals the MAD (i.e., the distance between the median
and (say) the 75th percentile is equal to the MAD).
The skew is zero.
Special Cases
Notes
If you have nominal-level variables, all you can do is report category
percentages.
You should not use any of the measures discussed above with
nominal-level data.
Special Cases
Notes
If you have dichotomous variables, then the descriptive statistics described
above have some special characteristics.
The range is always 1.
The percentiles are always either 0 or 1.
The IQR is always either 0 (if there are fewer than 25 or more than 75
percent 1s) or 1 (if there are between 25 and 75 percent 1s).
The variance of a dichotomous variable D is related to the mean.
(1 D)
s2D = D
is necessarily between zero and one, then s2D is as well.
Since D
It follows that sD > s2D .
= 0.5 i.e., an equal numbers of
sD and s2D will always be greatest when D
zeros and ones, and will decline as one moves away from this value.
Special Cases
Notes
Example: If if we added an indicator called NFC to our NFL Week One data, it
would look like this:
. sum NFC, detail
NFC
------------------------------------------------------------Percentiles
Smallest
1%
0
0
5%
0
0
10%
0
0
Obs
32
25%
0
0
Sum of Wgt.
32
50%
.5
75%
90%
95%
99%
1
1
1
1
Largest
1
1
1
1
Mean
Std. Dev.
.5
.5080005
Variance
Skewness
Kurtosis
.2580645
0
1
Best Practices
Notes
It is useful to present summary statistics for every variable you use in a paper.
Presenting summary measures is a good way of ensuring that your reader can
(better) understand what is going on.
Typically, one presents means, standard deviations, and minimums &
maximums, as well as an indication of the total number of observations in the
data.
For an important dependent variable, its also generally a good idea to
present some sort of graphical display of the distribution of the response
variable. This is less important if the variable is dichotomous.
Best Practices
Notes
Summary Statistics
Variable
Assassination
Mean
0.01
Standard
Deviation
0.09
Minimum
0
Maximum
1
0.45
5.83
0.01
-0.03
1.54
3.17
1.67
0.76
6.04
1.01
0.92
1.34
2.39
1.19
0
0.33
-1.67
-4.66
0
0
0
4
46.06
20.11
10.08
4
6
3
Note: N = 5614. Statistics are based on all non-missing observations in the model in Table X.
Best Practices
Notes
NFC
.05
.1
AFC
10
20
30
40
10
20
30
40
Points Scored
Graphs by NFC
Stata
Notes
18
75%
90%
95%
99%
25
34
38
41
Largest
34
38
38
41
Mean
Std. Dev.
20.03125
9.673657
Variance
Skewness
Kurtosis
93.57964
.5219589
2.518387
. ameans points
Variable |
Type
Obs
Mean
[95% Conf. Interval]
-------------+---------------------------------------------------------points | Arithmetic
32
20.03125
16.54352
23.51898
| Geometric
32
17.57103
14.36262
21.49615
|
Harmonic
32
14.65255
11.30963
20.8009
------------------------------------------------------------------------
Stata
Notes
. tabstat points, stats(mean median sum max min range sd variance skewness kurtosis q)
variable |
mean
p50
sum
max
min
range
-------------+------------------------------------------------------------points | 20.03125
18
641
41
3
38
--------------------------------------------------------------------------sd variance skewness kurtosis
p25
p50
p75
-------------------------------------------------------------------------9.673657 93.57964 .5219589 2.518387
13
18
25
--------------------------------------------------------------------------
Stata
Notes
. summarize population, detail
. return list
scalars:
r(N)
r(sum_w)
r(mean)
r(Var)
r(sd)
r(skewness)
r(kurtosis)
r(sum)
r(min)
r(max)
r(p1)
r(p5)
r(p10)
r(p25)
r(p50)
r(p75)
r(p90)
r(p95)
r(p99)
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
42
42
18531.88095238095
871461347.2293844
29520.52416928576
2.480049617027598
9.764342116660073
778339
27
146001
27
32
276
2405
5372
15892
59330
65667
146001
Stata
Notes
. sort NFC
. by NFC: summarize points
----------------------------------------------------------------------> NFC = AFC
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------points |
16
19.125
10.07224
10
41
----------------------------------------------------------------------> NFC = NFC
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------points |
16
20.9375
9.497149
3
38
R
Notes
>
>
>
>
>
>
>
install.packages("foreign")
library(foreign)
install.packages("psych")
library(psych)
NFL <-read.dta("NFLWeekOne.dta")#change path
attach(NFL)
summary(points)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
3.00
13.00
18.00
20.03
24.50
41.00
> geometric.mean(points)
[1] 17.57103
> harmonic.mean(points)
[1] 14.65255
R
Notes
> library(Hmisc)
> describe(points)
points
n missing unique
32
0
18
Mean
20.03
.05
8.65
.10
10.00
.25
13.00
.50
18.00
.75
24.50
.90
34.00
.95
38.00
3 7 10 13 14 16 17 19 20 21 23 24 26 28 29 34 38 41
Frequency 1 1 5 2 2 1 4 1 2 1 1 3 1 1 1 2 2 1
%
3 3 16 6 6 3 12 3 6 3 3 9 3 3 3 6 6 3
> library(pastecs)
> stat.desc(points)
nbr.val
nbr.null
nbr.na
32.0000
0.0000
0.0000
mean
SE.mean CI.mean.0.95
20.0312
1.7101
3.4877
min
3.0000
var
93.5796
max
41.0000
std.dev
9.6737
range
38.0000
coef.var
0.4829
sum
641.0000
median
18.0000
R
Notes
> var(points)
[1] 93.58
> sd(points)
[1] 9.674
> mad(points)
[1] 8.896
> library(moments)
> skewness(points)
[1] 0.522
> kurtosis(points)
[1] 2.518
> by(points,NFC,summary)
NFC: AFC
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
10.0
12.2
17.0
19.1
21.0
41.0
-----------------------------------------------------------------------------------NFC: NFC
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
3.0
15.2
22.0
20.9
26.5
38.0
Notes