Professional Documents
Culture Documents
Unit 01 - Describing Data and Its Distributions - 4 Per Page
Unit 01 - Describing Data and Its Distributions - 4 Per Page
1 2
numerically 50
• Main features
45
40
•
15
Simplest variable is a categorical variable 10
4
3
1
6/25/2012
25
Fastest Speeds Ever Driven, in mph 20
35 15
10
30
5
73 male
25 0
15 Female
50
65 female
10
Male 45 respondents
40
5 35
30
0 25
20
Never 1 to 80 81 to 100 101 or 15
Driven faster 10
5 % within
0
Building a histogram
Quantitative (Numerically-valued) variables 1) Start by defining your bins. look at range of
values, and choose bins to have a good
• Quantitative variables can assume number. In this case, I choose bins of
$2,500 starting at $25,000.
• Small possible number of values (ex: number of heads
in 3 coin tosses): Discrete 2) Then go through the list (in this case starting
• Large number of values (ex: age or income of a survey with Alabama with income of $31,295),
respondent): Continuous adding a block onto the respective bin to
• Many displays for quantitative variables represent that observation
• Histogram (most common)
3) Go through the list for all you observations
• Boxplot
• Example: per capita income (2006), 50 states plus District Thus, the histogram represents how many
of Columbia observations fall within the defined bins
Shape of you histogram therefore is affected by
7 how you define your bins. 8
2
6/25/2012
15
#1: Alabama = $31,295 #2: Alaska = $37,271 of the (n = 51) states (varying bins and y-axis)
10
20
10
10
Frequency
Frequency
8
15
Frequency
Frequency
6
5
5
10
4
5
2
0
0
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
15
15
#3: Arizona = $31,458 #51: Wyoming= $40,676
20
40
15
30
10
10
Frequency
Frequency
Percent
Percent
10
20
5
10
5
0
0
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
25000 30000 35000 40000 45000 50000 55000 60000 25000 30000 35000 40000 45000 50000 55000 60000 income income
income income
9 10
11 12
3
6/25/2012
Algebraic formula for the mean Per capita income by state (n = 51)
x1 x2 ... xn 1 n
x xi
40
n n i 1
30
Some notes about this formula… Mean =
Percent
• Assuming a sample of n individuals indexed by $35,470/person
20
_____________
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
10
the sample
0
• “∑” denotes the summation operator 20000 30000 40000
income
50000 60000
40
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
30
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
Percent Mean =
• Sort the observations from smallest to largest
$35,470/person
20
• If there is an odd number of observations, median is the middle
number
Median =
10
15 16
4
6/25/2012
17 18
3.0
Skewed to the left
Median Skewed to the right
2.5
3
2.0
1.5
2
1.0
• In a right skewed distribution, the mean is
0.5
0.0
0
greater than the median 0.2 0.4 0.6
x
0.8 1.0 0.0 0.2 0.4
x
0.6 0.8
0.4
Some examples
than the median
in Stata
0.3
• In a symmetric distribution the mean is
frequency
0.2
approximately (sometimes exactly) equal to the
0.1
median
0.0
-2 0 2
data
19 20
5
6/25/2012
40
Per capita income by state (n = 51) Measuring Spread (Variability) in Data
• Median
right • Quartiles and more general percentiles
Skewed-_______
0
21 22
23 24
6
6/25/2012
1 n 30 32 29 26 26
n x xi 28.6
1
n i 1 5
s s2 ( xi x ) 2 • Calculate the variance:
n 1 i 1 s2
1 n
( xi x ) 2
(30 28.6) 2 (32 28.6) 2 ... (26 28.6) 2
n 1 i 1 5 1
1.96 11.56 ... 6.76
• Note: s
2
6.8 points 2
4
The SD is in the original units of measurement
• Calculate the standard deviation:
The variance is in the (original units)2 s s 2 6.8 points2 2.608 points
25 26
20
15
A confusing detail
15
Percent of Observations
Percent of Observations
10
10
• Why divide by n – 1?
5
• Relatively
n
easy to show that:
( x x ) 0, since
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var1 var3
i
i 1
20
20
n n n
( x x ) ( x ) ( x ) nx nx 0
15
15
Percent of Observations
Percent of Observations
i i
i 1 i 1 i 1
• So once you solve for the mean, all you have left are n – 1
10
10
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
•
var2 var4
. summ More simply, what is the st. dev. (SD), or average spread
Variable Obs Mean Std. Dev. Min Max around the mean, of the following sets of numbers:
var1
var2
200
200
9.078236
10.9526
1.038261
1.012461
6.689424
7.813696
11.7296
13.5883
a) {1, 2, 3} b) {1}
var3 200 9.03134 2.002863 4.407923 14.48719
var4 200 11.13801 1.924267 6.302924 16.35018
27 28
7
6/25/2012
40
• Standard deviation and variance (like the mean) can be Mean =
sensitive to large observations $35,470/person
30
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
Percent
20
• Actually, even more sensitive than the mean…why? Median =
$34,257/person
• This issue will arise several times in the course…
10
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers SD =
0
20000 30000 40000 50000 60000 $5,734/person
_____________
income
29 30
Measuring Spread
Percentiles of a distribution (IQR) From Moore and McCabe (IPS)
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it. • The first quartile Q1 is the median of the observations
• The 25th percentile is the value with 25% of the whose position in the ordered list is to the left of the
observations at or below it, 75% above location of the overall median.
• It is called the first quartile Q1,
• The third quartile Q3 is the median of the observations
• the 50th percentile is the median M, and
whose position in the ordered list is to the right of the
• the 75th percentile is the third quartile Q3
location of the overall median.
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile) • e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• In a small set of numbers, it may not be possible to find exact • Interquartile range, IQR = Q3 – Q1,is another measure of
values for the percentiles variability of the distribution
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
31 32
8
6/25/2012
40
US News and World Report
1995 College and Universitiy Rating
Five number
30
8
Percent
summary of a
20
Percent of Observations
distribution
6
10
4
Min = $26,535
0
2
. summarize income, detail Q1 = $31,891
income
Median = $34,257
0
Percentiles Smallest 0 5000 10000 15000 20000 25000
1% 26535 26535 Out of State (Non-resident)Tuition
5% 27935 27897 Q3 = $38,712
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
9
6/25/2012
8
60,000
Histogram shows
Percent of Observations
6
relative frequency
50,000
4
of observations,
shape in the
income
40,000
2
center
30,000
0
0 5000 10000 15000 20000 25000
Out of State (Non-resident)Tuition
20,000
Some terminology
Outliers are often data errors
• Moore and McCabe define a boxplot where the lines extend from
the box out to the smallest and largest observations (pg. 38)
80
One value of height on extend out from the box only to the largest and smallest
Height in inches
Stat 104 poll entered as 5.2 observation that are not suspected outliers (pg. 39)
inches
40
39 40
10
6/25/2012
5 160
fahr
9 9
43 44
11
6/25/2012
47 48
12
6/25/2012
49 50
• Always positive
.3
6
Density
.2
4
0 2 4 6 8 10 0 2 4 6 8 10
var1 var1
• Whenever histograms and approximating density functions are shown “balance point”
on same curve, the histogram is drawn on the `density’ scale
• In the density scale for the vertical axis, the total area under the
histogram is 1 (shape is unchanged).
13
6/25/2012
Not all distributions and density The Normal Density Curve (in blue, below)
functions are symmetric
A family of symmetric, bell
.4
shaped densities with single
.04
40
.3
.03
30
Density
properties
Percent of Countries
.2
Density
.02
20
.1
.01
10
0
-4 -3 -2 -1 0 1 2 3 4
Number of Deaths per 1000 live births
1 1 x 2
0
0
f ( x) exp
0 50 100 150 200
0 50 100 150 200
Number of Deaths per 1000 live births Number of Deaths per 1000 live births
10000 observations
Some properties of a
.3
Normal Density
Density
.2
0 2 4 6 8 10
14
6/25/2012
N(0, 1)
N(0, 2)
• 68% of observations fall within one of ;
• 95% fall within 2 of ;
N(1, 1)
0.3
`standard normal’, μ = 0, σ = 1.
0.1
0.0
-4 -2 0 2 4
x 57 58
59 60
15
6/25/2012
IQ tests calibrated
so that the mean:
μ = 100, and the
SD: σ = 15…
and the scores
• Standardizing helps in interpreting observations
are normally • An individual with IQ = 130 is
distributed • (130 – 100)/15 = 2.0 σ’s above the mean
• Higher than 97.5% of population. Why?
Image from • An individual with IQ = 115 is 1.0 σ’s above mean
Wikipedia
• Higher than ___________________%
100 - 32/2 = 84 of population
61 62
16
6/25/2012
65 66
Why???
67 68
17
6/25/2012
3) What IQ score will place a person in the 4) Einstein was thought to have an IQ of about
top 10%? 160. What proportion of people have an IQ higher
• Want to find x0 such that P(X > x0) = 0.10 than Big Al?
• Think of X as 100 + 15 × Z, with Z ~ N(0,1)
• Find z0 such that P(Z > z0) = 0.10, then compute Y Y 160 100
P(Y 160) P
Y
• x0 = 100 + 15 × z0 15
•
PZ 4.0 ____________
Because Table A only gives area to the left, need to state
this problem as: what value z0 has area 0.9 to the left?
z0 = 1.28
Look it up in our friend ____________________
x0 = 100 + 15(1.28) = 119.2
69 70
18
6/25/2012
73 74
Height
.15
What’s Normal? Examples from
.1
1st Day Survey
Density
• How to decide if data can
.05
be assumed normally (last year)
distributed?
0
55 60 65 70 75 80
Height
.004
on the basis of asymmetry
.02
(for histograms) and
.015
.003
Density
Density
.01
.002
.005
.001
0
0
0 20 40 60 80 100
0 500 1000 1500 Random N
Text Messages
19
6/25/2012
Unit Recap
• Summarizing Data (cont.)
• Numerically
• Proportions
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• Changing units of measurement (linear transformations)
• Density Curves and the Normal Distribution
– The “68-95-99.7 rule”
– Standardizing the normal distribution (z-score)
• Stata is your friend!
79
20