Unit 01 - Describing Data and Its Distributions - 4 Per Page

6/25/2012
Methods for Describing Data

IPS Sections 1.1 – 1.3
Unit 1: Univariate Data • Data and their distributions

Chapter 1 in IPS – Graphical displays of data
– Descriptive Summaries of data
• Data and their distributions – Center, Spread, Shape
• Changing units of measurement (linear transformations) • Changing units of measurement (linear transformations)
• Density curves and the Normal distribution • Density Curves and the Normal distributions
– The “68-95-99.7 rule”
– Standardizing the normal distribution
1 2
Variables and their distributions

Results of the 2012
• Variable: Any characteristic of an individual that takes different
values for different individuals Election Poll is a
• Distribution: Describes the values a variable takes and how categorical variable
frequently these values occur.
• The distribution of a variable can be described graphically and/or Gallup Poll: 2012 Election, June 23, 2012
numerically 50
• Main features
45
40
• proportion of individuals with each value 35

30
• “shape”, “center” and “spread”. 25
• Two Types of variables: categorical and quantitative 20
•
15
Simplest variable is a categorical variable 10
• a variable that takes on a few discrete values that usually 5

0
have no natural numerical coding Obama Romney Undecided
4
3
1
6/25/2012
Bar Graph Spring 2011

35
Raw counts
Stat 104 - Spring 2011 30
25
Fastest Speeds Ever Driven, in mph 20
35 15
10
30
5
73 male
25 0
Never Driven 1 to 80 81 to 100 101 or faster

respondents
20
15 Female
50
65 female
10
Male 45 respondents
40
5 35
30
0 25
20
Never 1 to 80 81 to 100 101 or 15
Driven faster 10
5 % within
0
Never Driven 1 to 80 81 to 100 101 or faster Gender

5 6
Building a histogram
Quantitative (Numerically-valued) variables 1) Start by defining your bins. look at range of
values, and choose bins to have a good
• Quantitative variables can assume number. In this case, I choose bins of
$2,500 starting at $25,000.
• Small possible number of values (ex: number of heads
in 3 coin tosses): Discrete 2) Then go through the list (in this case starting
• Large number of values (ex: age or income of a survey with Alabama with income of $31,295),
respondent): Continuous adding a block onto the respective bin to
• Many displays for quantitative variables represent that observation
• Histogram (most common)
3) Go through the list for all you observations
• Boxplot
• Example: per capita income (2006), 50 states plus District Thus, the histogram represents how many
of Columbia observations fall within the defined bins
Shape of you histogram therefore is affected by
7 how you define your bins. 8
2
6/25/2012
Histograms: Per Capita Income (in dollars)

15
15
#1: Alabama = $31,295 #2: Alaska = $37,271 of the (n = 51) states (varying bins and y-axis)
10
20
10
10
Frequency
Frequency
8
15
Frequency
Frequency
6
5
5
10
4
5
2
0
0
25000 30000 35000 40000 45000 50000 55000 60000

25000 30000 35000 40000 45000 50000 55000 60000 income
income
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income
15
15
#3: Arizona = $31,458 #51: Wyoming= $40,676
20
40
15
30
10
10
Frequency
Frequency
Percent
Percent
10
20
5
10
5
0
0
0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
25000 30000 35000 40000 45000 50000 55000 60000 25000 30000 35000 40000 45000 50000 55000 60000 income income
income income
9 10
Histograms can be sensitive to

Measures of Center
definition of the bins (IPS, Ex 1.4)
• Mean and median are two most common
measures of center of a distribution
• Mean, denoted x , is the simple arithmetic

average (formula coming up)
• Mean of the set of numbers {1, 1, 5, -1} is

• x = (1 + 1 + 5 - 1) / 4 = 6 / 4 = 1.5
11 12
3
6/25/2012
Algebraic formula for the mean Per capita income by state (n = 51)
x1  x2  ...  xn 1 n
x   xi
40
n n i 1
30
Some notes about this formula… Mean =
Percent
• Assuming a sample of n individuals indexed by $35,470/person
20
_____________
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
10
the sample
0
• “∑” denotes the summation operator 20000 30000 40000
income
50000 60000
• “Bar” notation denotes average (we say “x-bar”)

13 14
Median: another measure of center

Per capita income by state (n = 51)
• Mean is sensitive to presence of large observations
• Think of
40
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
30
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
Percent Mean =
• Sort the observations from smallest to largest
$35,470/person
20
• If there is an odd number of observations, median is the middle
number
Median =
10
• If an even number of observations, median is the average of the

two values `straddling’ the middle
$34,257/person
_____________
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
0
20000 30000 40000 50000 60000

• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4 income
15 16
4
6/25/2012
Idealized right-skewed distribution Idealized Symmetric Distribution

Mean larger than Median
Mean and median are the same
In 1998 salary survey of
Harvard’s 1973 entering class,
mean salary was $750,000
compared with a median of
$175,000… why?
17 18
Histograms show shape

Effect of Shape on Mean and
3.0
Skewed to the left
Median Skewed to the right
2.5
3
2.0
1.5
2
1.0
• In a right skewed distribution, the mean is
0.5
0.0
0
greater than the median 0.2 0.4 0.6
x
0.8 1.0 0.0 0.2 0.4
x
0.6 0.8
• In a left skewed distribution, the mean is less Symmetric
0.4
Some examples
than the median
in Stata
0.3
• In a symmetric distribution the mean is
frequency
0.2
approximately (sometimes exactly) equal to the
0.1
median
0.0
-2 0 2
data
19 20
5
6/25/2012
40
Per capita income by state (n = 51) Measuring Spread (Variability) in Data
Two common methods

Mean =
1. Variance and standard deviation
$35,470/person
30
• Measure spread about the mean

Percent
• Most often used, but also sensitive to large values

20
Median = in skewed distributions

$34,257/person 2. Quantiles and percentiles
10
• Median
right • Quartiles and more general percentiles
Skewed-_______
0
20000 30000 40000 50000 60000

income
21 22
The variance of a set of data

• The “center” of a group of observations can be Variance
measured by the mean
• The variability of a single observation xi can be •The variance is the “average” of squared deviations from the
mean
measured by its distance from the center (e.g. mean)
( xi  x ) • If there are n observations x1, x2,…, xn, then the variance is
• Since we want this to always be a positive number, this

1 n
s   i 
distance is converted to 2 2
( xi  x ) 2 ( x x )
• The “average” of these “squared deviations from the
n  1 i 1
mean” are used as a measure of variability
23 24
6
6/25/2012
Example: variance & standard deviation

Standard Deviation • Lebron James’s points scored in the NBA finals (5 games) were:
{30, 32, 29, 26, 26}.
http://www.basketball-reference.com/players/n/nowitdi01/gamelog/2011/
• The standard deviation (SD) is the square root of the
variance • Calculate the mean:
1 n 30  32  29  26  26
n x  xi   28.6
1

n i 1 5
s  s2  ( xi  x ) 2 • Calculate the variance:
n  1 i 1 s2 
1 n
 ( xi  x ) 2 
(30  28.6) 2  (32  28.6) 2  ...  (26  28.6) 2
n  1 i 1 5 1
1.96  11.56  ...  6.76
• Note: s 
2
 6.8 points 2
4
The SD is in the original units of measurement
• Calculate the standard deviation:
The variance is in the (original units)2 s  s 2  6.8 points2  2.608 points
25 26
20
15
A confusing detail
15
Percent of Observations
10
10
• Why divide by n – 1?
5
• Relatively
n
easy to show that:
 ( x  x )  0, since
0
0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var1 var3
i
i 1
20
20
n n n
 ( x  x )   ( x )   ( x )  nx  nx  0
15
15
i i
i 1 i 1 i 1
• So once you solve for the mean, all you have left are n – 1
10
10
pieces of information, in essence. That’s the amount of

5
information remaining to do the next calculation on.

0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
•
var2 var4
. summ More simply, what is the st. dev. (SD), or average spread
Variable Obs Mean Std. Dev. Min Max around the mean, of the following sets of numbers:
var1
var2
200
200
9.078236
10.9526
1.038261
1.012461
6.689424
7.813696
11.7296
13.5883
a) {1, 2, 3} b) {1}
var3 200 9.03134 2.002863 4.407923 14.48719
var4 200 11.13801 1.924267 6.302924 16.35018
27 28
7
6/25/2012
A more important detail: sensitivity to Per capita income by state (n = 51)

extreme values
40
• Standard deviation and variance (like the mean) can be Mean =
sensitive to large observations $35,470/person
30
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
Percent
20
• Actually, even more sensitive than the mean…why? Median =
$34,257/person
• This issue will arise several times in the course…
10
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers SD =
0
20000 30000 40000 50000 60000 $5,734/person
_____________
income
29 30
Measuring Spread
Percentiles of a distribution (IQR) From Moore and McCabe (IPS)
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it. • The first quartile Q1 is the median of the observations
• The 25th percentile is the value with 25% of the whose position in the ordered list is to the left of the
observations at or below it, 75% above location of the overall median.
• It is called the first quartile Q1,
• The third quartile Q3 is the median of the observations
• the 50th percentile is the median M, and
whose position in the ordered list is to the right of the
• the 75th percentile is the third quartile Q3
location of the overall median.
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile) • e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• In a small set of numbers, it may not be possible to find exact • Interquartile range, IQR = Q3 – Q1,is another measure of
values for the percentiles variability of the distribution
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
31 32
8
6/25/2012
40
US News and World Report
1995 College and Universitiy Rating
Five number
30
8
Percent
summary of a
20
distribution
6
10
4
Min = $26,535
0
20000 30000 40000 50000 60000

income
2
. summarize income, detail Q1 = $31,891
income
Median = $34,257
0
Percentiles Smallest 0 5000 10000 15000 20000 25000
1% 26535 26535 Out of State (Non-resident)Tuition
5% 27935 27897 Q3 = $38,712
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
50% 34257 Mean 35470.67

Max = $55,755 1% 5% 10% 25% 50% 75% 90% 95% 99%
Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07 2250 3811 4470 6108 8670 11660 15476 17720 19700
95% 46344 49852 Skewness 1.275147
99% 55755 55755 Kurtosis 5.07413
33 34
Shape - Detecting Outliers

• Moore and McCabe: an observation is an outlier if it falls Another Plot Type – Box plots
more than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e., • Box plots are design to show clearly the center, spread
• outside the interval (especially IQR), and outliers
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Tuition Data: • They are based on the five-number summary
• Q1 = $6,108 Q3 = $11,660 • Minimum, Q1, Median, Q3, Maximum
• 1.5 x IQR = 1.5 x (11,660 – 6,108) = $8,328
• So the criteria is: an observation • Easiest to explain with an example, using the tuition data.
• below 6,108 – 8,328 = − $2,220 (impossible) or
• above 11,660 + 8328 = $19,988
• There are no small outliers, but there are several large
outliers…
35 36
9
6/25/2012
US News and World Report

1995 College and Universitiy Rating
Box plot of Per Capita Income
8
60,000
Histogram shows
6
relative frequency
50,000
4
of observations,
shape in the
income
40,000
2
center
30,000
0
0 5000 10000 15000 20000 25000
Out of State (Non-resident)Tuition
20,000
1995 Out of State Tuition

. summarize income, detail From Stata Documentation Boxplot shows
income
center, spread and
Percentiles Smallest
1% 26535 26535 outliers
5% 27935 27897
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51
50% 34257 Mean 35470.67

Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07
95% 46344 49852 Skewness 1.275147 0 5,000 10,000 15,000 20,000 25,000
99% 55755 55755 Kurtosis 5.07413 37 Tuition

38
Some terminology
Outliers are often data errors
• Moore and McCabe define a boxplot where the lines extend from
the box out to the smallest and largest observations (pg. 38)
80
• Moore and McCabe define a modified boxplot where the lines

60
One value of height on extend out from the box only to the largest and smallest
Height in inches
Stat 104 poll entered as 5.2 observation that are not suspected outliers (pg. 39)
inches
40
• Stata produces `modified` boxplots, with * used to mark points

more than 1.5 × IQR above 75th percentile or below the 25th
20
percentile. We will always use modified boxplots in this course

(we will simply call them boxplots).
0
39 40
10
6/25/2012
Effects of linear transformation

Changing the units of measurement
yi = a + bxi
• What happens to the mean, standard deviation, etc… of a • mean (Y) = a + b × mean (X)
set of data if one changes the scale, or units, of
measurements?
• median (Y) = a + b × median (X)
• For example, moving from English to metric scales • variance (Y) = b2 × variance (X)
• Fahrenheit to Celsius • SD (Y) = |b| × SD (X)
• feet to meters • IQR (Y) = |b| × IQR (X)
• Calories to joules
• stones to kilograms In this notation, xi and yi are used for individual
• Hopefully you get the idea  observation, and X and Y are used for the collection of
xi’s and yi’s
41 42
Great Blue Hill Weather Observatory

Direct and simple transformation from
Milton, MA
Fahrenheit to Celsius
Oldest continuous weather
record in North America
5
cel  ( fahr  32) Example of a simple
9 linear transformation
5 160
 fahr 
9 9
43 44
11
6/25/2012
High Temp High Temp

Blue Hill Average Daily June (2010) (2009)
High Temps, by Month, 1

2
81
81
65
73
`Testing’ the formulas…
. summ max_f max_c
1885 to Present 3 80 64
Variable Obs Mean Std. Dev. Min Max
4 84 71
Month Max_Fahr Max_Cels
5 83 61 max_f 12 56.79167 17.20647 33.6 79.9
Jan 33.6 0.89 6 79 69 max_c 12 13.77333 9.560084 .89 26.61
Feb 34.6 1.44 7 72 75
8 69 69
Mar 43.2 6.22 5 160
9 70 57 cel  fahr 
Apr 54.6 12.56 10 57 56 9 9
May 66.3 19.06 11 68 55
5 160
Jun 74.6 23.67
12 67 69 meancel  mean fahr 
13 64 72 9 9
Jul 79.9 26.61 14 76 60
5 160
Aug 77.7 25.39 15 78 58  (56.79)   13.77
Sep 70.7 21.50 16 75 60 9 9
17 72 67
Oct 60.5 15.83
18 86 62
We’ll be using the transformations in a bit…
Nov 48.6 9.22 19 83 69
Dec 37.2 2.89 20 4586 73 46

Mean 75.55 65.25
Using a density curve to approximate a

Density Curves population distribution
• Histograms show distribution of the
• Density curves (a theoretical construct) are often used to actual collected data, directly.
approximate an observed distribution of values in a data set • While informative, histograms can
• The plots on the following slides come from IPS, section 1.3 be awkward to manipulate to
(Colors come from previous edition, content is the same) compute proportions.
• Density curves can be used to
• The data are the Iowa Test vocabulary scores for approximate proportions of a
n = 947 Gary, Indiana 7th graders. population within a range of values
• The test yields a grade equivalent vocabulary score • In this Gary test score data, the true
proportion (relative frequency) of
actual scores at less than a 6th grade
equivalency is 0.303 (= 287/947)
47 48
12
6/25/2012
The plots side by side…

Using a density curve to approximate a
population distribution Area
= 0.293
• Plot at right shows a normal Proportion
density curve (bell shaped curve)
used to approximate this =287/947
proportion of the histogram =0.303
• By convention, all densities are
constructed so that the area
under the density curve is 1.
• Mathematical formulas or tables
can be used to show the shaded
area is 0.293, close to the
histogram’s real proportion of
0.303
49 50
Percent vs. Density Scale (vertical axis)

10000 observations 10000 observations Density Curves
.4
8
• Always positive
.3
6
• Area under the curve equals 1

Percent
Density
.2
4
• Median is point where 50% of area is to the left

.1
2
and 50% is to the right.

• Mean is
0
0
0 2 4 6 8 10 0 2 4 6 8 10
var1 var1
• Whenever histograms and approximating density functions are shown “balance point”
on same curve, the histogram is drawn on the `density’ scale
• In the density scale for the vertical axis, the total area under the
histogram is 1 (shape is unchanged).
• This is just a technical detail; don’t get too hung-up on it

51 52
13
6/25/2012
Not all distributions and density The Normal Density Curve (in blue, below)
functions are symmetric
A family of symmetric, bell
.4
shaped densities with single
.04
40
peak and `nice’ mathematical
.3
.03
30
Density
properties
Percent of Countries
.2
Density
.02
20
.1
.01
10
Formula (no need to memorize):
0
-4 -3 -2 -1 0 1 2 3 4
Number of Deaths per 1000 live births
1  1 x  2

0
0
f ( x)  exp   
0 50 100 150 200

0 50 100 150 200
Number of Deaths per 1000 live births Number of Deaths per 1000 live births
 2  2     Histogram of 400 observations

But everybody’s favorite density curve is… drawn randomly from a Normal
53 density 54
10000 observations
Changing the location and scale of a

.4
Some properties of a
.3
Normal Density
Density
.2
Normal Density Curve

.1
Formula for a normal density with mean μ and

0
0 2 4 6 8 10
variance σ2 (or equivalently standard deviation σ)

var1
• Think of a normal density as the `histogram’ for a hypothetically is:

infinite sized population
1  1  x   2 
• Interesting duality in interpretation of normal density f ( x)  exp    
• When describing a distribution of numbers observed in a  2  2    
study, the normal density sometimes is used to approximate
a histogram of the data set.
IPS calls this “~ N(μ, σ)”
• When trying to learn about a large population using a small
data set, the observed histogram is thought of as an (distributed as normal with mean ‘μ’ and std. dev. ‘σ’)
approximation to the normal density corresponding to the
entire population
• Normal density has a `theoretical’ mean and standard deviation
55 56
14
6/25/2012
The Normal Distribution Normal Distributions

• The normal density curves comprise a family that describe the `normal
distributions’
• The curves are symmetric, unimodal, and bell shaped • The distributions are represented symbolically as N(,) – the
(Unimodal = one peak) normal distribution with mean  and standard deviation (sd) 
N orm al D is t ribut ions
• All normal distributions follow the 68-95-99.7 rule:
0.4
N(0, 1)
N(0, 2)
• 68% of observations fall within one  of ;
• 95% fall within 2 of ;
N(1, 1)
0.3
• 99.7 fall within 3 of 

• Next graph from IPS, pp 70-71 illustrates this for hypothetical
0.2
y
`standard normal’, μ = 0, σ = 1.
0.1
0.0
-4 -2 0 2 4
x 57 58
Standardizing Normal Distributions

• All normal distributions have the same shape with possibly
different center  and spread (sd) .
• If Z has a N(0,1) distribution, then
• Z is N(0,), (written Z  N(0,))
• X =  + Z is N(,)
• Conversely, if X is N(,) distributed, then
• Z = (X- )/ has a N(0,1) distribution
• The last relationship leads to standardization and the use of
standard normal distribution, N(0,1) to compute relative
frequencies…more on this in a bit
• Note that the simple transformation Z = (X – )/ calculates the
distance of an observation from the mean, measured in number
of standard deviations, σ. This is often referred to as a
Z-score.
59 60
15
6/25/2012
Example: Distribution of IQ scores Standardizing
IQ tests calibrated
so that the mean:
μ = 100, and the
SD: σ = 15…
and the scores
• Standardizing helps in interpreting observations
are normally • An individual with IQ = 130 is
distributed • (130 – 100)/15 = 2.0 σ’s above the mean
• Higher than 97.5% of population. Why?
Image from • An individual with IQ = 115 is 1.0 σ’s above mean
Wikipedia
• Higher than ___________________%
100 - 32/2 = 84 of population
61 62
More Questions That We Can Standardizing the normal

Answer distribution
The first question wants to find P(Y ≤ 95) where:
(the possibilities are endless!)
1) What fraction of the population score no higher than 95
• Y denotes IQ score and P denotes either proportion of population with
on an IQ test?
IQ less than or equal to 95, or probability of randomly selecting a
member of the population whose IQ would be no higher than 95.
2) What fraction score at least 105?
• Convert P(Y ≤ 95) to a statement about a standard normal
3) What IQ score will place a person in the top 10%? standardize/z-score
distribution…so we need to ___________________:
 Y  Y 95  100 
P(Y  95)  P    PZ  0.333  ???
4) Einstein was thought to have an IQ of about 160. What
 Y 15 
proportion of people have an IQ higher than Big Al?
• Next step…find the area under the curve (use a table!)
63 64
16
6/25/2012
Note use of term `probability’ here instead of proportion
65 66
2) What fraction score at least 105?

 Y  Y 105  100 
P(Y  105) P  
 Y 15 
 PZ  0.333  0.3707
Why???
67 68
17
6/25/2012
3) What IQ score will place a person in the 4) Einstein was thought to have an IQ of about
top 10%? 160. What proportion of people have an IQ higher
• Want to find x0 such that P(X > x0) = 0.10 than Big Al?
• Think of X as 100 + 15 × Z, with Z ~ N(0,1)
• Find z0 such that P(Z > z0) = 0.10, then compute  Y  Y 160  100 
P(Y  160)  P  
 Y
• x0 = 100 + 15 × z0  15 
•
 PZ  4.0  ____________
Because Table A only gives area to the left, need to state
this problem as: what value z0 has area 0.9 to the left?
z0 = 1.28
Look it up in our friend ____________________
x0 = 100 + 15(1.28) = 119.2
69 70
Some Properties of the Normal

Properties….
Distribution
• If Y ~ N(,), the area less than some value a (or b) is 3) Areas within two bounds (upper and lower)
denoted by P(Y < a). This anticipates later material on • P(a < Y < b) = P( Y < b) - P(Y < a)
probability
• An IQ Y score between 95 and 105 has probability
• For now, think of P ( ) as a proportion or relative frequency
• P(95 < Y < 105) = P( Y < 105) - P(Y < 95)
• If Y ~ N(0,1) then P(Y < a) can be found by using the table
= 0.6293 – 0.3707 = 0.2586
at the front of the book
4) No area at a single point!
1) Area sums to one: • So P(Y < b) = P(Y  b)
P(Y  a) + P(Y > a) = 1 • P(a < Y < b) = P(a  Y  b) = P(a < Y  b), etc
2) Symmetric • This is only true for theoretical density curves
P(Y < -a) = P(Y > +a)
71 72
18
6/25/2012
Some important Z values

• What are the Z values with 0.10, 0.05 and 0.025 in the lower tail of the
Outliers and the Normal Distribution
N(0,1) distribution?
• The IQR for the normal distribution is
• From Table A • 75th – 25th percentile = 0.675 – (-.675) = 1.35
Prob (Z < -1.28) = 0.10 • 1.5  IQR = 1.5  1.35 = 2.02
Prob (Z < -1.645) = 0.05 • 75th percentile + (1.5  IQR) = 2.69
Prob (Z < -1.96) = 0.025
• Probability (Z > 2.69) = .003 = 0.3%
• Z has a standard normal distribution (mean 0, sd 1)
• By symmetry
• Probability here equivalent to area under the normal density
Prob (Z > 1.28) = 0.10 to the right of 2.69
Prob (Z > 1.645) = 0.05 • So with a normal distribution, large outliers should occur less
Prob (Z > 1.96) = 0.025 than ½ of 1% of the time. Same for small outliers. But we DO
expect outliers if our sample size is large enough!
73 74
Height
.15
What’s Normal? Examples from
.1
1st Day Survey
Density
• How to decide if data can
.05
be assumed normally (last year)
distributed?
0
55 60 65 70 75 80
Height
• Histograms and boxplots

help to rule out normality # Text Messages/Month Random Numbers
.004
on the basis of asymmetry
.02
(for histograms) and
.015
.003
outliers (for boxplots)
Density
Density
.01
.002
A normal quantile plot is specially designed for the purpose. We

will not cover this in this course. Please read the text if you are
.005
.001
interested in learning about this plot.
0
0
0 20 40 60 80 100
0 500 1000 1500 Random N
Text Messages
General Concept: Plot observed data against what you’d expect if

the data were normally distributed. 75 76
19
6/25/2012
Detecting non-Normality in Graphs Unit Recap

• Histogram very good at showing lack of symmetry
(skewing) • What is statistics?
• Types of data
• Box plots very good at showing outliers • Summarizing Data
• Outliers are rare but not impossible in normally • Graphically
distributed data – 0.3% outliers expected in each tail • Bar plots (Categorical)
(upper and lower)
• Histograms (Quantitative)
• In a sample of 500, would expect around 3 total outliers • Box plots (Quantitative)
• Normal quantile plot (not covered in this course) is very • Details of a graph are important
good at showing non-normality, but are more trouble than • Vertical axis scale
they are worth • Number of bins in histogram
• One last example in Stata (S&P 500 Index Daily
Changes)…
77 78
Unit Recap
• Summarizing Data (cont.)
• Numerically
• Proportions
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• Changing units of measurement (linear transformations)
• Density Curves and the Normal Distribution
– The “68-95-99.7 rule”
– Standardizing the normal distribution (z-score)
• Stata is your friend!
79
20

Unit 01 - Describing Data and Its Distributions - 4 Per Page

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 01 - Describing Data and Its Distributions - 4 Per Page

Uploaded by

Copyright:

Available Formats

6/25/2012

Methods for Describing Data

Unit 1: Univariate Data • Data and their distributions

Variables and their distributions

• proportion of individuals with each value 35

• Two Types of variables: categorical and quantitative 20

• a variable that takes on a few discrete values that usually 5

Bar Graph Spring 2011

Never Driven 1 to 80 81 to 100 101 or faster

Never Driven 1 to 80 81 to 100 101 or faster Gender

Histograms: Per Capita Income (in dollars)

25000 30000 35000 40000 45000 50000 55000 60000

Histograms can be sensitive to

• Mean, denoted x , is the simple arithmetic

• Mean of the set of numbers {1, 1, 5, -1} is

• “Bar” notation denotes average (we say “x-bar”)

Median: another measure of center

• If an even number of observations, median is the average of the

20000 30000 40000 50000 60000

Idealized right-skewed distribution Idealized Symmetric Distribution

Histograms show shape

• In a left skewed distribution, the mean is less Symmetric

Two common methods

• Measure spread about the mean

• Most often used, but also sensitive to large values

Median = in skewed distributions

20000 30000 40000 50000 60000

The variance of a set of data

• Since we want this to always be a positive number, this

Example: variance & standard deviation

pieces of information, in essence. That’s the amount of

information remaining to do the next calculation on.

A more important detail: sensitivity to Per capita income by state (n = 51)

20000 30000 40000 50000 60000

50% 34257 Mean 35470.67

Shape - Detecting Outliers

US News and World Report

Box plot of Per Capita Income

1995 Out of State Tuition

50% 34257 Mean 35470.67

99% 55755 55755 Kurtosis 5.07413 37 Tuition

• Moore and McCabe define a modified boxplot where the lines

• Stata produces `modified` boxplots, with * used to mark points

percentile. We will always use modified boxplots in this course

Effects of linear transformation

Great Blue Hill Weather Observatory

High Temp High Temp

High Temps, by Month, 1

Dec 37.2 2.89 20 4586 73 46

Using a density curve to approximate a

The plots side by side…

Percent vs. Density Scale (vertical axis)

• Area under the curve equals 1

• Median is point where 50% of area is to the left

and 50% is to the right.

• This is just a technical detail; don’t get too hung-up on it

peak and `nice’ mathematical

Formula (no need to memorize):

 2  2     Histogram of 400 observations

Changing the location and scale of a

Normal Density Curve

Formula for a normal density with mean μ and