Professional Documents
Culture Documents
Chapter 1 in IPS
1
Methods for Describing Data
IPS Sections 1.1 – 1.3
2
Variables and their distributions
• Variable: Any characteristic of an individual that takes different
values for different individuals
• Distribution: Describes the values a variable takes and how
frequently these values occur.
• The distribution of a variable can be described graphically and/or
numerically
• Main features
• proportion of individuals with each value
• “shape”, “center” and “spread”.
• Two Types of variables: categorical and quantitative
• Simplest variable is a categorical variable
• a variable that takes on a few discrete values that usually
have no natural numerical coding
3
Results of the 2012
Election Poll is a
categorical variable
Gallup Poll: 2012 Election, June 23, 2012
50
45
40
35
30
25
20
15
10
5
0
Obama Romney Undecided
4
Bar Graph
Stat 104 - Spring 2011
Fastest Speeds Ever Driven, in mph
35
30
25
20
15 Female
10
Male
5
Spring 2011
35
30
Raw counts
25
20
15
10
5
73 male
0
50
65 female
45 respondents
40
35
30
25
20
15
10
5 % within
0
7
Building a histogram
1) Start by defining your bins. look at range of
values, and choose bins to have a good
number. In this case, I choose bins of
$2,500 starting at $25,000.
2) Then go through the list (in this case starting
with Alabama with income of $31,295),
adding a block onto the respective bin to
represent that observation
3) Go through the list for all you observations
Thus, the histogram represents how many
observations fall within the defined bins
Shape of you histogram therefore is affected by
how you define your bins. 8
15
15
#1: Alabama = $31,295 #2: Alaska = $37,271
10
10
Frequency
Frequency
5
5
0
0
15
#3: Arizona = $31,458 #51: Wyoming= $40,676
10
10
Frequency
Frequency
5
5
0
25000 30000 35000 40000 45000 50000 55000 60000 25000 30000 35000 40000 45000 50000 55000 60000
income income
9
Histograms: Per Capita Income (in dollars)
of the (n = 51) states (varying bins and y-axis)
10
20
8
15
Frequency
Frequency
6
10
4
5
2
0
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
20
40
15
30
Percent
Percent
10
20
10
5
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
10
Histograms can be sensitive to
definition of the bins (IPS, Ex 1.4)
11
Measures of Center
• Mean and median are two most common
measures of center of a distribution
12
Algebraic formula for the mean
x1 x2 ... xn 1 n
x xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)
13
Per capita income by state (n = 51)
40
30
Mean =
Percent
$35,470/person
20
_____________
10
0
14
Median: another measure of center
• Mean is sensitive to presence of large observations
• Think of
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
• Sort the observations from smallest to largest
• If there is an odd number of observations, median is the middle
number
• If an even number of observations, median is the average of the
two values `straddling’ the middle
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4
15
Per capita income by state (n = 51)
40
30
Mean =
Percent
$35,470/person
20
Median =
10
$34,257/person
_____________
0
16
Idealized right-skewed distribution
Mean larger than Median
In 1998 salary survey of
Harvard’s 1973 entering class,
mean salary was $750,000
compared with a median of
$175,000… why?
17
Idealized Symmetric Distribution
Mean and median are the same
18
Effect of Shape on Mean and
Median
3.0
Skewed to the left Skewed to the right
2.5
3
2.0
1.5
2
1.0
1
0.5
0.0
0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
x
x
Symmetric
0.4
Some examples
in Stata
0.3
frequency
0.2
0.1
0.0
-2 0 2
data
20
Per capita income by state (n = 51)
40
Mean =
$35,470/person
30
Percent
20
Median =
$34,257/person
10
right
Skewed-_______
0
21
Measuring Spread (Variability) in Data
22
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi x )
• Since we want this to always be a positive number, this
distance is converted to
( xi x ) 2
23
Variance
• The variance is the “average” of squared deviations from the
mean
n
1
s
2
n 1 i 1
( xi x ) 2
24
Standard Deviation
• The standard deviation (SD) is the square root of the
variance
n
1
s s 2
n 1 i 1
( xi x ) 2
• Note:
The SD is in the original units of measurement
The variance is in the (original units)2
25
Example: variance & standard deviation
• Lebron James’s points scored in the NBA finals (5 games) were:
{30, 32, 29, 26, 26}.
http://www.basketball-reference.com/players/n/nowitdi01/gamelog/2011/
1 n 30 32 29 26 26
x xi 28.6
n i 1 5
• Calculate the variance:
1 n (30 28.6) 2 (32 28.6) 2 ... (26 28.6) 2
s
2
n 1 i 1
( xi x )
2
5 1
1.96 11.56 ... 6.76
s2 6.8 points2
4
• Calculate the standard deviation:
s s 2 6.8 points2 2.608 points
26
20
15
15
Percent of Observations
10
10
5
5
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var1 var3
20
20
15
15
Percent of Observations
10
10
5
5
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var2 var4
. summ
27
A confusing detail
• Why divide by n – 1?
• Relatively
n
easy to show that:
( x x ) 0, since
i 1
i
n n n
( x x ) ( x ) ( x ) nx nx 0
i 1
i
i 1
i
i 1
• So once you solve for the mean, all you have left are n – 1
pieces of information, in essence. That’s the amount of
information remaining to do the next calculation on.
• More simply, what is the st. dev. (SD), or average spread
around the mean, of the following sets of numbers:
a) {1, 2, 3} b) {1}
28
A more important detail: sensitivity to
extreme values
• Standard deviation and variance (like the mean) can be
sensitive to large observations
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
• Actually, even more sensitive than the mean…why?
• This issue will arise several times in the course…
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers
29
Per capita income by state (n = 51)
40
Mean =
$35,470/person
30
Percent
20
Median =
$34,257/person
10
SD =
0
30
Measuring Spread
Percentiles of a distribution (IQR)
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
31
From Moore and McCabe (IPS)
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
variability of the distribution
32
40
Five number
30 summary of a
Percent
20
distribution
10
Min = $26,535
0
8
6
4
2
0
34
Shape - Detecting Outliers
• Moore and McCabe: an observation is an outlier if it falls
more than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Tuition Data:
• Q1 = $6,108 Q3 = $11,660
• 1.5 x IQR = 1.5 x (11,660 – 6,108) = $8,328
• So the criteria is: an observation
• below 6,108 – 8,328 = − $2,220 (impossible) or
• above 11,660 + 8328 = $19,988
• There are no small outliers, but there are several large
outliers…
35
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers
36
Box plot of Per Capita Income
60,000
50,000
income
40,000
30,000
20,000
Percentiles Smallest
1% 26535 26535
5% 27935 27897
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
8
6
Histogram shows
relative frequency
4
of observations,
shape in the
2
center
0
39
Some terminology
• Moore and McCabe define a boxplot where the lines extend from
the box out to the smallest and largest observations (pg. 38)
40
Changing the units of measurement
• What happens to the mean, standard deviation, etc… of a
set of data if one changes the scale, or units, of
measurements?
• For example, moving from English to metric scales
• Fahrenheit to Celsius
• feet to meters
• Calories to joules
• stones to kilograms
• Hopefully you get the idea
41
Effects of linear transformation
yi = a + bxi
• mean (Y) = a + b × mean (X)
• median (Y) = a + b × median (X)
• variance (Y) = b2 × variance (X)
• SD (Y) = |b| × SD (X)
• IQR (Y) = |b| × IQR (X)
5
cel ( fahr 32) Example of a simple
9 linear transformation
5 160
fahr
9 9
43
Great Blue Hill Weather Observatory
Milton, MA
Oldest continuous weather
record in North America
44
High Temp High Temp
Blue Hill Average Daily June (2010) (2009)
1 81 65
High Temps, by Month, 2 81 73
1885 to Present 3 80 64
4 84 71
Month Max_Fahr Max_Cels
5 83 61
Jan 33.6 0.89 6 79 69
5 160
cel fahr
9 9
5 160
meancel mean fahr
9 9
5 160
(56.79) 13.77
9 9
We’ll be using the transformations in a bit…
46
Density Curves
47
Using a density curve to approximate a
population distribution
• Histograms show distribution of the
actual collected data, directly.
• While informative, histograms can
be awkward to manipulate to
compute proportions.
• Density curves can be used to
approximate proportions of a
population within a range of values
• In this Gary test score data, the true
proportion (relative frequency) of
actual scores at less than a 6th grade
equivalency is 0.303 (= 287/947)
48
Using a density curve to approximate a
population distribution
• Plot at right shows a normal
density curve (bell shaped curve)
used to approximate this
proportion of the histogram
• By convention, all densities are
constructed so that the area
under the density curve is 1.
• Mathematical formulas or tables
can be used to show the shaded
area is 0.293, close to the
histogram’s real proportion of
0.303
49
The plots side by side…
Area
= 0.293
Proportion
=287/947
=0.303
50
Percent vs. Density Scale (vertical axis)
10000 observations 10000 observations
.4
8
.3
6
Percent
Density
.2
4
.1
2
0
0
0 2 4 6 8 10 0 2 4 6 8 10
var1 var1
• In the density scale for the vertical axis, the total area under the
histogram is 1 (shape is unchanged).
52
Not all distributions and density
functions are symmetric
.04
40
.03
30
Density
.02
20
.01
10
0
0
.4
shaped densities with single
peak and `nice’ mathematical
.3
Density
properties
.2
.1
Formula (no need to memorize):
0
-4 -3 -2 -1 0 1 2 3 4
Number of Deaths per 1000 live births
1 1 x
2
f ( x) exp
2 2 Histogram of 400 observations
drawn randomly from a Normal
density 54
10000 observations
.4
Some properties of a
.3
Density
.2
Normal Density Curve
.1
0
0 2 4 6 8 10
var1
f ( x) exp
2 2
56
The Normal Distribution
• The normal density curves comprise a family that describe the `normal
distributions’
• The curves are symmetric, unimodal, and bell shaped
(Unimodal = one peak)
N orm al D is t ribut ions
0.4
N(0, 1)
N(0, 2)
N(1, 1)
0.3
0.2
y
0.1
0.0
-4 -2 0 2 4
x 57
Normal Distributions
• The distributions are represented symbolically as N(,) – the
normal distribution with mean and standard deviation (sd)
• All normal distributions follow the 68-95-99.7 rule:
• 68% of observations fall within one of ;
• 95% fall within 2 of ;
• 99.7 fall within 3 of
• Next graph from IPS, pp 70-71 illustrates this for hypothetical
`standard normal’, μ = 0, σ = 1.
58
59
Standardizing Normal Distributions
• All normal distributions have the same shape with possibly
different center and spread (sd) .
• If Z has a N(0,1) distribution, then
• Z is N(0,), (written Z N(0,))
• X = + Z is N(,)
• Conversely, if X is N(,) distributed, then
• Z = (X- )/ has a N(0,1) distribution
• The last relationship leads to standardization and the use of
standard normal distribution, N(0,1) to compute relative
frequencies…more on this in a bit
• Note that the simple transformation Z = (X – )/ calculates the
distance of an observation from the mean, measured in number
of standard deviations, σ. This is often referred to as a
Z-score.
60
Example: Distribution of IQ scores
IQ tests calibrated
so that the mean:
μ = 100, and the
SD: σ = 15…
and the scores
are normally
distributed
Image from
Wikipedia
61
Standardizing
63
Standardizing the normal
distribution
The first question wants to find P(Y ≤ 95) where:
65
66
2) What fraction score at least 105?
Y Y 105 100
P(Y 105) P
Y 15
Why???
67
68
3) What IQ score will place a person in the
top 10%?
• Want to find x0 such that P(X > x0) = 0.10
• Think of X as 100 + 15 × Z, with Z ~ N(0,1)
• Find z0 such that P(Z > z0) = 0.10, then compute
• x0 = 100 + 15 × z0
• Because Table A only gives area to the left, need to state
this problem as: what value z0 has area 0.9 to the left?
z0 = 1.28
69
4) Einstein was thought to have an IQ of about
160. What proportion of people have an IQ higher
than Big Al?
Y Y 160 100
P(Y 160) P
Y 15
PZ 4.0 ____________
70
Some Properties of the Normal
Distribution
• If Y ~ N(,), the area less than some value a (or b) is
denoted by P(Y < a). This anticipates later material on
probability
• For now, think of P ( ) as a proportion or relative frequency
• If Y ~ N(0,1) then P(Y < a) can be found by using the table
at the front of the book
72
Some important Z values
• What are the Z values with 0.10, 0.05 and 0.025 in the lower tail of the
N(0,1) distribution?
• From Table A
Prob (Z < -1.28) = 0.10
Prob (Z < -1.645) = 0.05
Prob (Z < -1.96) = 0.025
• By symmetry
Prob (Z > 1.28) = 0.10
Prob (Z > 1.645) = 0.05
Prob (Z > 1.96) = 0.025
73
Outliers and the Normal Distribution
• The IQR for the normal distribution is
• 75th – 25th percentile = 0.675 – (-.675) = 1.35
• 1.5 IQR = 1.5 1.35 = 2.02
• 75th percentile + (1.5 IQR) = 2.69
• Probability (Z > 2.69) = .003 = 0.3%
• Z has a standard normal distribution (mean 0, sd 1)
• Probability here equivalent to area under the normal density
to the right of 2.69
• So with a normal distribution, large outliers should occur less
than ½ of 1% of the time. Same for small outliers. But we DO
expect outliers if our sample size is large enough!
74
What’s Normal?
.15
Examples from
.1
1st Day Survey
Density
.05
(last year)
0
55 60 65 70 75 80
Height
.02
.015
.003
Density
Density
.01
.002
.005
.001
0
0
0 20 40 60 80 100
0 500 1000 1500 Random N
Text Messages
76
Detecting non-Normality in Graphs
• Histogram very good at showing lack of symmetry
(skewing)
78
Unit Recap
• Summarizing Data (cont.)
• Numerically
• Proportions
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• Changing units of measurement (linear transformations)
• Density Curves and the Normal Distribution
– The “68-95-99.7 rule”
– Standardizing the normal distribution (z-score)
• Stata is your friend!
79