Unit 01 - Describing Data and Its Distributions - 1 Per Page

Unit 1: Univariate Data
Chapter 1 in IPS
• Data and their distributions

• Changing units of measurement (linear transformations)
• Density curves and the Normal distribution
1
Methods for Describing Data
IPS Sections 1.1 – 1.3
• Data and their distributions

– Graphical displays of data
– Descriptive Summaries of data
– Center, Spread, Shape
• Density Curves and the Normal distributions
– The “68-95-99.7 rule”
– Standardizing the normal distribution
2
Variables and their distributions
• Variable: Any characteristic of an individual that takes different
values for different individuals
• Distribution: Describes the values a variable takes and how
frequently these values occur.
• The distribution of a variable can be described graphically and/or
numerically
• Main features
• proportion of individuals with each value
• “shape”, “center” and “spread”.
• Two Types of variables: categorical and quantitative
• Simplest variable is a categorical variable
• a variable that takes on a few discrete values that usually
have no natural numerical coding
3
Results of the 2012
Election Poll is a
categorical variable
Gallup Poll: 2012 Election, June 23, 2012
50
45
40
35
30
25
20
15
10
5
0
Obama Romney Undecided
4
Bar Graph
Stat 104 - Spring 2011
Fastest Speeds Ever Driven, in mph
35
30
25
20
15 Female
10
Male
Never 1 to 80 81 to 100 101 or

Driven faster
5
Spring 2011
35
30
Raw counts
25
20
15
10
5
73 male
0
Never Driven 1 to 80 81 to 100 101 or faster

respondents
50
65 female
45 respondents
40
35
30
25
20
15
10
5 % within
0
Never Driven 1 to 80 81 to 100 101 or faster Gender

6
Quantitative (Numerically-valued) variables
• Quantitative variables can assume

• Small possible number of values (ex: number of heads
in 3 coin tosses): Discrete
• Large number of values (ex: age or income of a survey
respondent): Continuous
• Many displays for quantitative variables
• Histogram (most common)
• Boxplot
• Example: per capita income (2006), 50 states plus District
of Columbia
7
Building a histogram
1) Start by defining your bins. look at range of
values, and choose bins to have a good
number. In this case, I choose bins of
$2,500 starting at $25,000.
2) Then go through the list (in this case starting
with Alabama with income of $31,295),
adding a block onto the respective bin to
represent that observation
3) Go through the list for all you observations
Thus, the histogram represents how many
observations fall within the defined bins
Shape of you histogram therefore is affected by
how you define your bins. 8
15
15
#1: Alabama = $31,295 #2: Alaska = $37,271
10
10
Frequency
Frequency
5
5
0
0
25000 30000 35000 40000 45000 50000 55000 60000

25000 30000 35000 40000 45000 50000 55000 60000 income
income
15
15
#3: Arizona = $31,458 #51: Wyoming= $40,676
10
10
Frequency
Frequency
5
5
0
25000 30000 35000 40000 45000 50000 55000 60000 25000 30000 35000 40000 45000 50000 55000 60000
income income
9
Histograms: Per Capita Income (in dollars)
of the (n = 51) states (varying bins and y-axis)
10
20
8
15
Frequency
Frequency
6
10
4
5
2
0
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
20
40
15
30
Percent
Percent
10
20
10
5
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
10
Histograms can be sensitive to
definition of the bins (IPS, Ex 1.4)
11
Measures of Center
• Mean and median are two most common
measures of center of a distribution
• Mean, denoted x , is the simple arithmetic

average (formula coming up)
• Mean of the set of numbers {1, 1, 5, -1} is

• x = (1 + 1 + 5 - 1) / 4 = 6 / 4 = 1.5
12
Algebraic formula for the mean
x1  x2  ...  xn 1 n
x   xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)
13
Per capita income by state (n = 51)
40
30
Mean =
Percent
$35,470/person
20
_____________
10
0
20000 30000 40000 50000 60000

income
14
Median: another measure of center
• Mean is sensitive to presence of large observations
• Think of
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
• Sort the observations from smallest to largest
• If there is an odd number of observations, median is the middle
number
• If an even number of observations, median is the average of the
two values `straddling’ the middle
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4
15
40
30
Mean =
Percent
$35,470/person
20
Median =
10
$34,257/person
_____________
0
20000 30000 40000 50000 60000

income
16
Idealized right-skewed distribution
Mean larger than Median
In 1998 salary survey of
Harvard’s 1973 entering class,
mean salary was $750,000
compared with a median of
$175,000… why?
17
Idealized Symmetric Distribution
Mean and median are the same
18
Effect of Shape on Mean and
Median
• In a right skewed distribution, the mean is

greater than the median
• In a left skewed distribution, the mean is less

than the median
• In a symmetric distribution the mean is

approximately (sometimes exactly) equal to the
median
19
Histograms show shape
3.0
Skewed to the left Skewed to the right
2.5
3
2.0
1.5
2
1.0
1
0.5
0.0
0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
x
x
Symmetric
0.4
Some examples
in Stata
0.3
frequency
0.2
0.1
0.0
-2 0 2
data
20
40
Mean =
$35,470/person
30
Percent
20
Median =
$34,257/person
10
right
Skewed-_______
0
20000 30000 40000 50000 60000

income
21
Measuring Spread (Variability) in Data
Two common methods

1. Variance and standard deviation
• Measure spread about the mean
• Most often used, but also sensitive to large values
in skewed distributions
2. Quantiles and percentiles
• Median
• Quartiles and more general percentiles
22
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi  x )
• Since we want this to always be a positive number, this
distance is converted to
( xi  x ) 2
• The “average” of these “squared deviations from the

mean” are used as a measure of variability
23
Variance
• The variance is the “average” of squared deviations from the
mean
• If there are n observations x1, x2,…, xn, then the variance is
n
1
s 
2

n  1 i 1
( xi  x ) 2
24
Standard Deviation
• The standard deviation (SD) is the square root of the
variance
n
1
s s  2

n  1 i 1
( xi  x ) 2
• Note:
The SD is in the original units of measurement
The variance is in the (original units)2
25
Example: variance & standard deviation
• Lebron James’s points scored in the NBA finals (5 games) were:
{30, 32, 29, 26, 26}.
http://www.basketball-reference.com/players/n/nowitdi01/gamelog/2011/
• Calculate the mean:
1 n 30  32  29  26  26
x   xi   28.6
n i 1 5
• Calculate the variance:
1 n (30  28.6) 2  (32  28.6) 2  ...  (26  28.6) 2
s 
2

n  1 i 1
( xi  x ) 
2
5 1
1.96  11.56  ...  6.76
s2   6.8 points2
4
• Calculate the standard deviation:
s  s 2  6.8 points2  2.608 points
26
20
15
15
Percent of Observations
10
10
5
5
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var1 var3
20
20
15
15
Percent of Observations
10
10
5
5
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var2 var4
. summ
Variable Obs Mean Std. Dev. Min Max
var1 200 9.078236 1.038261 6.689424 11.7296

var2 200 10.9526 1.012461 7.813696 13.5883
var3 200 9.03134 2.002863 4.407923 14.48719
var4 200 11.13801 1.924267 6.302924 16.35018
27
A confusing detail
• Why divide by n – 1?
• Relatively
n
easy to show that:
 ( x  x )  0, since
i 1
i
n n n
 ( x  x )   ( x )   ( x )  nx  nx  0
i 1
i
i 1
i
i 1
• So once you solve for the mean, all you have left are n – 1
pieces of information, in essence. That’s the amount of
information remaining to do the next calculation on.
• More simply, what is the st. dev. (SD), or average spread
around the mean, of the following sets of numbers:
a) {1, 2, 3} b) {1}
28
A more important detail: sensitivity to
extreme values
• Standard deviation and variance (like the mean) can be
sensitive to large observations
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
• Actually, even more sensitive than the mean…why?
• This issue will arise several times in the course…
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers
29
40
Mean =
$35,470/person
30
Percent
20
Median =
$34,257/person
10
SD =
0
20000 30000 40000 50000 60000 $5,734/person

_____________
income
30
Measuring Spread
Percentiles of a distribution (IQR)
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
31
From Moore and McCabe (IPS)
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
variability of the distribution
32
40
Five number
30 summary of a
Percent
20
distribution
10
Min = $26,535
0
20000 30000 40000 50000 60000

income
. summarize income, detail Q1 = $31,891

income
Median = $34,257
Percentiles Smallest
1% 26535 26535
5% 27935 27897 Q3 = $38,712
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
50% 34257 Mean 35470.67

Max = $55,755
Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07
95% 46344 49852 Skewness 1.275147
99% 55755 55755 Kurtosis 5.07413
33
US News and World Report
1995 College and Universitiy Rating
8
6
4
2
0
0 5000 10000 15000 20000 25000

Out of State (Non-resident)Tuition
1% 5% 10% 25% 50% 75% 90% 95% 99%
2250 3811 4470 6108 8670 11660 15476 17720 19700
34
Shape - Detecting Outliers
• Moore and McCabe: an observation is an outlier if it falls
more than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Tuition Data:
• Q1 = $6,108 Q3 = $11,660
• 1.5 x IQR = 1.5 x (11,660 – 6,108) = $8,328
• So the criteria is: an observation
• below 6,108 – 8,328 = − $2,220 (impossible) or
• above 11,660 + 8328 = $19,988
• There are no small outliers, but there are several large
outliers…
35
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers
• They are based on the five-number summary

• Minimum, Q1, Median, Q3, Maximum
• Easiest to explain with an example, using the tuition data.
36
Box plot of Per Capita Income
60,000
50,000
income
40,000
30,000
20,000
. summarize income, detail From Stata Documentation

income
Percentiles Smallest
1% 26535 26535
5% 27935 27897
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
50% 34257 Mean 35470.67

Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07
95% 46344 49852 Skewness 1.275147
99% 55755 55755 Kurtosis 5.07413 37
US News and World Report
1995 College and Universitiy Rating
8
6
Histogram shows
relative frequency
4
of observations,
shape in the
2
center
0
0 5000 10000 15000 20000 25000

Out of State (Non-resident)Tuition
1995 Out of State Tuition

Boxplot shows
center, spread and
outliers
0 5,000 10,000 15,000 20,000 25,000

Tuition
38
80
60 Outliers are often data errors
One value of height on

Height in inches
Stat 104 poll entered as 5.2

inches
40
20
0
39
Some terminology
• Moore and McCabe define a boxplot where the lines extend from
the box out to the smallest and largest observations (pg. 38)
• Moore and McCabe define a modified boxplot where the lines

extend out from the box only to the largest and smallest
observation that are not suspected outliers (pg. 39)
• Stata produces `modified` boxplots, with * used to mark points

more than 1.5 × IQR above 75th percentile or below the 25th
percentile. We will always use modified boxplots in this course
(we will simply call them boxplots).
40
Changing the units of measurement
• What happens to the mean, standard deviation, etc… of a
set of data if one changes the scale, or units, of
measurements?
• For example, moving from English to metric scales
• Fahrenheit to Celsius
• feet to meters
• Calories to joules
• stones to kilograms
• Hopefully you get the idea 
41
Effects of linear transformation
yi = a + bxi
• mean (Y) = a + b × mean (X)
• median (Y) = a + b × median (X)
• variance (Y) = b2 × variance (X)
• SD (Y) = |b| × SD (X)
• IQR (Y) = |b| × IQR (X)
In this notation, xi and yi are used for individual

observation, and X and Y are used for the collection of
xi’s and yi’s
42
Direct and simple transformation from
Fahrenheit to Celsius
5
cel  ( fahr  32) Example of a simple
9 linear transformation
5 160
 fahr 
9 9
43
Great Blue Hill Weather Observatory
Milton, MA
Oldest continuous weather
record in North America
44
High Temp High Temp
Blue Hill Average Daily June (2010) (2009)
1 81 65
High Temps, by Month, 2 81 73
1885 to Present 3 80 64
4 84 71
Month Max_Fahr Max_Cels
5 83 61
Jan 33.6 0.89 6 79 69
Feb 34.6 1.44 7 72 75

8 69 69
Mar 43.2 6.22
9 70 57
Apr 54.6 12.56 10 57 56
May 66.3 19.06 11 68 55
12 67 69
Jun 74.6 23.67
13 64 72
Jul 79.9 26.61 14 76 60
Aug 77.7 25.39 15 78 58
16 75 60
Sep 70.7 21.50
17 72 67
Oct 60.5 15.83
18 86 62
Nov 48.6 9.22 19 83 69
Dec 37.2 2.89 20 4586 73

Mean 75.55 65.25
`Testing’ the formulas…
. summ max_f max_c
Variable Obs Mean Std. Dev. Min Max
max_f 12 56.79167 17.20647 33.6 79.9

max_c 12 13.77333 9.560084 .89 26.61
5 160
cel  fahr 
9 9
5 160
meancel  mean fahr 
9 9
5 160
 (56.79)   13.77
9 9
We’ll be using the transformations in a bit…
46
Density Curves
• Density curves (a theoretical construct) are often used to

approximate an observed distribution of values in a data set
• The plots on the following slides come from IPS, section 1.3
(Colors come from previous edition, content is the same)
• The data are the Iowa Test vocabulary scores for
n = 947 Gary, Indiana 7th graders.
• The test yields a grade equivalent vocabulary score
47
Using a density curve to approximate a
population distribution
• Histograms show distribution of the
actual collected data, directly.
• While informative, histograms can
be awkward to manipulate to
compute proportions.
• Density curves can be used to
approximate proportions of a
population within a range of values
• In this Gary test score data, the true
proportion (relative frequency) of
actual scores at less than a 6th grade
equivalency is 0.303 (= 287/947)
48
Using a density curve to approximate a
population distribution
• Plot at right shows a normal
density curve (bell shaped curve)
used to approximate this
proportion of the histogram
• By convention, all densities are
constructed so that the area
under the density curve is 1.
• Mathematical formulas or tables
can be used to show the shaded
area is 0.293, close to the
histogram’s real proportion of
0.303
49
The plots side by side…
Area
= 0.293
Proportion
=287/947
=0.303
50
Percent vs. Density Scale (vertical axis)
10000 observations 10000 observations
.4
8
.3
6
Percent
Density
.2
4
.1
2
0
0
0 2 4 6 8 10 0 2 4 6 8 10
var1 var1
• Whenever histograms and approximating density functions are shown

on same curve, the histogram is drawn on the `density’ scale
• In the density scale for the vertical axis, the total area under the
histogram is 1 (shape is unchanged).
• This is just a technical detail; don’t get too hung-up on it

51
Density Curves
• Always positive
• Area under the curve equals 1
• Median is point where 50% of area is to the left
and 50% is to the right.
• Mean is
“balance point”
52
Not all distributions and density
functions are symmetric
.04
40
.03
30
Density
.02
20
.01
10
0
0
0 50 100 150 200 0 50 100 150 200

Number of Deaths per 1000 live births Number of Deaths per 1000 live births
But everybody’s favorite density curve is…

53
The Normal Density Curve (in blue, below)
A family of symmetric, bell
.4
shaped densities with single
peak and `nice’ mathematical
.3
Density
properties
.2
.1
Formula (no need to memorize):
0
-4 -3 -2 -1 0 1 2 3 4
Number of Deaths per 1000 live births
1  1 x  
2
f ( x)  exp    
 2  2     Histogram of 400 observations
drawn randomly from a Normal
density 54
10000 observations
.4
Some properties of a
.3
Density
.2
Normal Density Curve
.1
0
0 2 4 6 8 10
var1
• Think of a normal density as the `histogram’ for a hypothetically

infinite sized population
• Interesting duality in interpretation of normal density
• When describing a distribution of numbers observed in a
study, the normal density sometimes is used to approximate
a histogram of the data set.
• When trying to learn about a large population using a small
data set, the observed histogram is thought of as an
approximation to the normal density corresponding to the
entire population
• Normal density has a `theoretical’ mean and standard deviation
55
Changing the location and scale of a
Normal Density
Formula for a normal density with mean μ and
variance σ2 (or equivalently standard deviation σ)
is:
1  1 x  
2
f ( x)  exp    
 2  2    
IPS calls this “~ N(μ, σ)”

(distributed as normal with mean ‘μ’ and std. dev. ‘σ’)
56
The Normal Distribution
• The normal density curves comprise a family that describe the `normal
distributions’
• The curves are symmetric, unimodal, and bell shaped
(Unimodal = one peak)
N orm al D is t ribut ions
0.4
N(0, 1)
N(0, 2)
N(1, 1)
0.3
0.2
y
0.1
0.0
-4 -2 0 2 4
x 57
Normal Distributions
• The distributions are represented symbolically as N(,) – the
normal distribution with mean  and standard deviation (sd) 
• All normal distributions follow the 68-95-99.7 rule:
• 68% of observations fall within one  of ;
• 95% fall within 2 of ;
• 99.7 fall within 3 of 
• Next graph from IPS, pp 70-71 illustrates this for hypothetical
`standard normal’, μ = 0, σ = 1.
58
59
Standardizing Normal Distributions
• All normal distributions have the same shape with possibly
different center  and spread (sd) .
• If Z has a N(0,1) distribution, then
• Z is N(0,), (written Z  N(0,))
• X =  + Z is N(,)
• Conversely, if X is N(,) distributed, then
• Z = (X- )/ has a N(0,1) distribution
• The last relationship leads to standardization and the use of
standard normal distribution, N(0,1) to compute relative
frequencies…more on this in a bit
• Note that the simple transformation Z = (X – )/ calculates the
distance of an observation from the mean, measured in number
of standard deviations, σ. This is often referred to as a
Z-score.
60
Example: Distribution of IQ scores
IQ tests calibrated
so that the mean:
μ = 100, and the
SD: σ = 15…
and the scores
are normally
distributed
Image from
Wikipedia
61
Standardizing
• Standardizing helps in interpreting observations

• An individual with IQ = 130 is
• (130 – 100)/15 = 2.0 σ’s above the mean
• Higher than 97.5% of population. Why?
• An individual with IQ = 115 is 1.0 σ’s above mean
• Higher than ___________________%
100 - 32/2 = 84 of population
62
More Questions That We Can
Answer
(the possibilities are endless!)
1) What fraction of the population score no higher than 95
on an IQ test?
2) What fraction score at least 105?
3) What IQ score will place a person in the top 10%?
4) Einstein was thought to have an IQ of about 160. What

proportion of people have an IQ higher than Big Al?
63
Standardizing the normal
distribution
The first question wants to find P(Y ≤ 95) where:
• Y denotes IQ score and P denotes either proportion of population with

IQ less than or equal to 95, or probability of randomly selecting a
member of the population whose IQ would be no higher than 95.
• Convert P(Y ≤ 95) to a statement about a standard normal

distribution…so we need to ___________________:
standardize/z-score
 Y  Y 95  100 
P(Y  95)  P    PZ  0.333  ???
 Y 15 
• Next step…find the area under the curve (use a table!)

64
Note use of term `probability’ here instead of proportion
65
66
2) What fraction score at least 105?
 Y  Y 105  100 
P(Y  105) P  
 Y 15 
 PZ  0.333  0.3707
Why???
67
68
3) What IQ score will place a person in the
top 10%?
• Want to find x0 such that P(X > x0) = 0.10
• Think of X as 100 + 15 × Z, with Z ~ N(0,1)
• Find z0 such that P(Z > z0) = 0.10, then compute
• x0 = 100 + 15 × z0
• Because Table A only gives area to the left, need to state
this problem as: what value z0 has area 0.9 to the left?
z0 = 1.28
x0 = 100 + 15(1.28) = 119.2
69
4) Einstein was thought to have an IQ of about
160. What proportion of people have an IQ higher
than Big Al?
 Y  Y 160  100 
P(Y  160)  P  
  
 Y 15 
 PZ  4.0  ____________
Look it up in our friend ____________________
70
Some Properties of the Normal
Distribution
• If Y ~ N(,), the area less than some value a (or b) is
denoted by P(Y < a). This anticipates later material on
probability
• For now, think of P ( ) as a proportion or relative frequency
• If Y ~ N(0,1) then P(Y < a) can be found by using the table
at the front of the book
1) Area sums to one:

P(Y  a) + P(Y > a) = 1
2) Symmetric
P(Y < -a) = P(Y > +a)
71
Properties….
3) Areas within two bounds (upper and lower)
• P(a < Y < b) = P( Y < b) - P(Y < a)
• An IQ Y score between 95 and 105 has probability
• P(95 < Y < 105) = P( Y < 105) - P(Y < 95)
= 0.6293 – 0.3707 = 0.2586
4) No area at a single point!
• So P(Y < b) = P(Y  b)
• P(a < Y < b) = P(a  Y  b) = P(a < Y  b), etc
• This is only true for theoretical density curves
72
Some important Z values
• What are the Z values with 0.10, 0.05 and 0.025 in the lower tail of the
N(0,1) distribution?
• From Table A
Prob (Z < -1.28) = 0.10
Prob (Z < -1.645) = 0.05
Prob (Z < -1.96) = 0.025
• By symmetry
Prob (Z > 1.28) = 0.10
Prob (Z > 1.645) = 0.05
Prob (Z > 1.96) = 0.025
73
Outliers and the Normal Distribution
• The IQR for the normal distribution is
• 75th – 25th percentile = 0.675 – (-.675) = 1.35
• 1.5  IQR = 1.5  1.35 = 2.02
• 75th percentile + (1.5  IQR) = 2.69
• Probability (Z > 2.69) = .003 = 0.3%
• Z has a standard normal distribution (mean 0, sd 1)
• Probability here equivalent to area under the normal density
to the right of 2.69
• So with a normal distribution, large outliers should occur less
than ½ of 1% of the time. Same for small outliers. But we DO
expect outliers if our sample size is large enough!
74
What’s Normal?
• How to decide if data can

be assumed normally
distributed?
• Histograms and boxplots
help to rule out normality
on the basis of asymmetry
(for histograms) and
outliers (for boxplots)
A normal quantile plot is specially designed for the purpose. We
will not cover this in this course. Please read the text if you are
interested in learning about this plot.
General Concept: Plot observed data against what you’d expect if

the data were normally distributed. 75
Height
.15
Examples from
.1
1st Day Survey
Density
.05
(last year)
0
55 60 65 70 75 80
Height
# Text Messages/Month Random Numbers

.004
.02
.015
.003
Density
Density
.01
.002
.005
.001
0
0
0 20 40 60 80 100
0 500 1000 1500 Random N
Text Messages
76
Detecting non-Normality in Graphs
• Histogram very good at showing lack of symmetry
(skewing)
• Box plots very good at showing outliers

• Outliers are rare but not impossible in normally
distributed data – 0.3% outliers expected in each tail
(upper and lower)
• In a sample of 500, would expect around 3 total outliers
• Normal quantile plot (not covered in this course) is very

good at showing non-normality, but are more trouble than
they are worth
• One last example in Stata (S&P 500 Index Daily
Changes)…
77
Unit Recap
• What is statistics?
• Types of data
• Summarizing Data
• Graphically
• Bar plots (Categorical)
• Histograms (Quantitative)
• Box plots (Quantitative)
• Details of a graph are important
• Vertical axis scale
• Number of bins in histogram
78
Unit Recap
• Summarizing Data (cont.)
• Numerically
• Proportions
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• Density Curves and the Normal Distribution
– The “68-95-99.7 rule”
– Standardizing the normal distribution (z-score)
• Stata is your friend!
79

Unit 01 - Describing Data and Its Distributions - 1 Per Page

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 01 - Describing Data and Its Distributions - 1 Per Page

Uploaded by

Copyright:

Available Formats

Unit 1: Univariate Data

• Data and their distributions

• Data and their distributions

Never 1 to 80 81 to 100 101 or

Never Driven 1 to 80 81 to 100 101 or faster

Never Driven 1 to 80 81 to 100 101 or faster Gender

• Quantitative variables can assume

25000 30000 35000 40000 45000 50000 55000 60000

• Mean, denoted x , is the simple arithmetic

• Mean of the set of numbers {1, 1, 5, -1} is

20000 30000 40000 50000 60000

20000 30000 40000 50000 60000

• In a right skewed distribution, the mean is

• In a left skewed distribution, the mean is less

• In a symmetric distribution the mean is

20000 30000 40000 50000 60000

Two common methods

• The “average” of these “squared deviations from the

• If there are n observations x1, x2,…, xn, then the variance is

• Calculate the mean:

Variable Obs Mean Std. Dev. Min Max

var1 200 9.078236 1.038261 6.689424 11.7296

20000 30000 40000 50000 60000 $5,734/person

20000 30000 40000 50000 60000

. summarize income, detail Q1 = $31,891

50% 34257 Mean 35470.67

0 5000 10000 15000 20000 25000

1% 5% 10% 25% 50% 75% 90% 95% 99%

2250 3811 4470 6108 8670 11660 15476 17720 19700

• They are based on the five-number summary

• Easiest to explain with an example, using the tuition data.

. summarize income, detail From Stata Documentation

50% 34257 Mean 35470.67

0 5000 10000 15000 20000 25000

1995 Out of State Tuition

0 5,000 10,000 15,000 20,000 25,000

One value of height on

Stat 104 poll entered as 5.2

• Moore and McCabe define a modified boxplot where the lines

• Stata produces `modified` boxplots, with * used to mark points

In this notation, xi and yi are used for individual

Feb 34.6 1.44 7 72 75

Dec 37.2 2.89 20 4586 73

Variable Obs Mean Std. Dev. Min Max

max_f 12 56.79167 17.20647 33.6 79.9

• Density curves (a theoretical construct) are often used to

• Whenever histograms and approximating density functions are shown

• This is just a technical detail; don’t get too hung-up on it

0 50 100 150 200 0 50 100 150 200

But everybody’s favorite density curve is…

• Think of a normal density as the `histogram’ for a hypothetically

IPS calls this “~ N(μ, σ)”

• Standardizing helps in interpreting observations

2) What fraction score at least 105?

3) What IQ score will place a person in the top 10%?

4) Einstein was thought to have an IQ of about 160. What

• Y denotes IQ score and P denotes either proportion of population with

• Convert P(Y ≤ 95) to a statement about a standard normal

• Next step…find the area under the curve (use a table!)

 PZ  0.333  0.3707

x0 = 100 + 15(1.28) = 119.2

Look it up in our friend ____________________