You are on page 1of 79

Unit 1: Univariate Data

Chapter 1 in IPS

• Data and their distributions


• Changing units of measurement (linear transformations)
• Density curves and the Normal distribution

1
Methods for Describing Data
IPS Sections 1.1 – 1.3

• Data and their distributions


– Graphical displays of data
– Descriptive Summaries of data
– Center, Spread, Shape
• Changing units of measurement (linear transformations)
• Density Curves and the Normal distributions
– The “68-95-99.7 rule”
– Standardizing the normal distribution

2
Variables and their distributions
• Variable: Any characteristic of an individual that takes different
values for different individuals
• Distribution: Describes the values a variable takes and how
frequently these values occur.
• The distribution of a variable can be described graphically and/or
numerically
• Main features
• proportion of individuals with each value
• “shape”, “center” and “spread”.
• Two Types of variables: categorical and quantitative
• Simplest variable is a categorical variable
• a variable that takes on a few discrete values that usually
have no natural numerical coding

3
Results of the 2012
Election Poll is a
categorical variable
Gallup Poll: 2012 Election, June 23, 2012

50
45
40
35
30
25
20
15
10
5
0
Obama Romney Undecided

4
Bar Graph
Stat 104 - Spring 2011
Fastest Speeds Ever Driven, in mph
35

30

25

20

15 Female
10
Male

Never 1 to 80 81 to 100 101 or


Driven faster

5
Spring 2011
35

30
Raw counts
25

20

15

10

5
73 male
0

Never Driven 1 to 80 81 to 100 101 or faster


respondents

50
65 female
45 respondents
40
35
30
25
20
15
10
5 % within
0

Never Driven 1 to 80 81 to 100 101 or faster Gender


6
Quantitative (Numerically-valued) variables

• Quantitative variables can assume


• Small possible number of values (ex: number of heads
in 3 coin tosses): Discrete
• Large number of values (ex: age or income of a survey
respondent): Continuous
• Many displays for quantitative variables
• Histogram (most common)
• Boxplot
• Example: per capita income (2006), 50 states plus District
of Columbia

7
Building a histogram
1) Start by defining your bins. look at range of
values, and choose bins to have a good
number. In this case, I choose bins of
$2,500 starting at $25,000.
2) Then go through the list (in this case starting
with Alabama with income of $31,295),
adding a block onto the respective bin to
represent that observation
3) Go through the list for all you observations
Thus, the histogram represents how many
observations fall within the defined bins
Shape of you histogram therefore is affected by
how you define your bins. 8
15

15
#1: Alabama = $31,295 #2: Alaska = $37,271
10

10
Frequency

Frequency

5
5

0
0

25000 30000 35000 40000 45000 50000 55000 60000


25000 30000 35000 40000 45000 50000 55000 60000 income
income
15

15
#3: Arizona = $31,458 #51: Wyoming= $40,676
10

10
Frequency

Frequency
5

5
0

25000 30000 35000 40000 45000 50000 55000 60000 25000 30000 35000 40000 45000 50000 55000 60000
income income

9
Histograms: Per Capita Income (in dollars)
of the (n = 51) states (varying bins and y-axis)

10
20

8
15

Frequency
Frequency

6
10

4
5

2
0

0
20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income

20
40

15
30
Percent

Percent

10
20
10

5
0

20000 30000 40000 50000 60000 20000 30000 40000 50000 60000
income income

10
Histograms can be sensitive to
definition of the bins (IPS, Ex 1.4)

11
Measures of Center
• Mean and median are two most common
measures of center of a distribution

• Mean, denoted x , is the simple arithmetic


average (formula coming up)

• Mean of the set of numbers {1, 1, 5, -1} is


• x = (1 + 1 + 5 - 1) / 4 = 6 / 4 = 1.5

12
Algebraic formula for the mean
x1  x2  ...  xn 1 n
x   xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)
13
Per capita income by state (n = 51)
40
30

Mean =
Percent

$35,470/person
20

_____________
10
0

20000 30000 40000 50000 60000


income

14
Median: another measure of center
• Mean is sensitive to presence of large observations
• Think of
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
• Sort the observations from smallest to largest
• If there is an odd number of observations, median is the middle
number
• If an even number of observations, median is the average of the
two values `straddling’ the middle
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4

15
Per capita income by state (n = 51)
40
30

Mean =
Percent

$35,470/person
20

Median =
10

$34,257/person
_____________
0

20000 30000 40000 50000 60000


income

16
Idealized right-skewed distribution
Mean larger than Median
In 1998 salary survey of
Harvard’s 1973 entering class,
mean salary was $750,000
compared with a median of
$175,000… why?

17
Idealized Symmetric Distribution
Mean and median are the same

18
Effect of Shape on Mean and
Median

• In a right skewed distribution, the mean is


greater than the median

• In a left skewed distribution, the mean is less


than the median

• In a symmetric distribution the mean is


approximately (sometimes exactly) equal to the
median
19
Histograms show shape

3.0
Skewed to the left Skewed to the right

2.5
3

2.0
1.5
2

1.0
1

0.5
0.0
0

0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
x
x

Symmetric
0.4

Some examples
in Stata
0.3
frequency
0.2
0.1
0.0

-2 0 2

data

20
Per capita income by state (n = 51)
40

Mean =
$35,470/person
30
Percent

20

Median =
$34,257/person
10

right
Skewed-_______
0

20000 30000 40000 50000 60000


income

21
Measuring Spread (Variability) in Data

Two common methods


1. Variance and standard deviation
• Measure spread about the mean
• Most often used, but also sensitive to large values
in skewed distributions
2. Quantiles and percentiles
• Median
• Quartiles and more general percentiles

22
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi  x )
• Since we want this to always be a positive number, this
distance is converted to
( xi  x ) 2

• The “average” of these “squared deviations from the


mean” are used as a measure of variability

23
Variance
• The variance is the “average” of squared deviations from the
mean

• If there are n observations x1, x2,…, xn, then the variance is

n
1
s 
2

n  1 i 1
( xi  x ) 2

24
Standard Deviation
• The standard deviation (SD) is the square root of the
variance

n
1
s s  2

n  1 i 1
( xi  x ) 2

• Note:
The SD is in the original units of measurement
The variance is in the (original units)2

25
Example: variance & standard deviation
• Lebron James’s points scored in the NBA finals (5 games) were:
{30, 32, 29, 26, 26}.
http://www.basketball-reference.com/players/n/nowitdi01/gamelog/2011/

• Calculate the mean:

1 n 30  32  29  26  26
x   xi   28.6
n i 1 5
• Calculate the variance:
1 n (30  28.6) 2  (32  28.6) 2  ...  (26  28.6) 2
s 
2

n  1 i 1
( xi  x ) 
2

5 1
1.96  11.56  ...  6.76
s2   6.8 points2
4
• Calculate the standard deviation:
s  s 2  6.8 points2  2.608 points

26
20
15

15
Percent of Observations
10

10
5

5
0
0

4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var1 var3
20

20
15

15
Percent of Observations
10

10
5

5
0

0
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
var2 var4

. summ

Variable Obs Mean Std. Dev. Min Max

var1 200 9.078236 1.038261 6.689424 11.7296


var2 200 10.9526 1.012461 7.813696 13.5883
var3 200 9.03134 2.002863 4.407923 14.48719
var4 200 11.13801 1.924267 6.302924 16.35018

27
A confusing detail
• Why divide by n – 1?
• Relatively
n
easy to show that:
 ( x  x )  0, since
i 1
i

n n n

 ( x  x )   ( x )   ( x )  nx  nx  0
i 1
i
i 1
i
i 1
• So once you solve for the mean, all you have left are n – 1
pieces of information, in essence. That’s the amount of
information remaining to do the next calculation on.
• More simply, what is the st. dev. (SD), or average spread
around the mean, of the following sets of numbers:
a) {1, 2, 3} b) {1}

28
A more important detail: sensitivity to
extreme values
• Standard deviation and variance (like the mean) can be
sensitive to large observations
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
• Actually, even more sensitive than the mean…why?
• This issue will arise several times in the course…
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers

29
Per capita income by state (n = 51)
40

Mean =
$35,470/person
30
Percent

20

Median =
$34,257/person
10

SD =
0

20000 30000 40000 50000 60000 $5,734/person


_____________
income

30
Measuring Spread
Percentiles of a distribution (IQR)
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
31
From Moore and McCabe (IPS)
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
variability of the distribution

32
40
Five number
30 summary of a
Percent

20

distribution
10

Min = $26,535
0

20000 30000 40000 50000 60000


income

. summarize income, detail Q1 = $31,891


income
Median = $34,257
Percentiles Smallest
1% 26535 26535
5% 27935 27897 Q3 = $38,712
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51

50% 34257 Mean 35470.67


Max = $55,755
Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07
95% 46344 49852 Skewness 1.275147
99% 55755 55755 Kurtosis 5.07413
33
US News and World Report
1995 College and Universitiy Rating

8
6
4
2
0

0 5000 10000 15000 20000 25000


Out of State (Non-resident)Tuition

1% 5% 10% 25% 50% 75% 90% 95% 99%

2250 3811 4470 6108 8670 11660 15476 17720 19700

34
Shape - Detecting Outliers
• Moore and McCabe: an observation is an outlier if it falls
more than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Tuition Data:
• Q1 = $6,108 Q3 = $11,660
• 1.5 x IQR = 1.5 x (11,660 – 6,108) = $8,328
• So the criteria is: an observation
• below 6,108 – 8,328 = − $2,220 (impossible) or
• above 11,660 + 8328 = $19,988
• There are no small outliers, but there are several large
outliers…
35
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers

• They are based on the five-number summary


• Minimum, Q1, Median, Q3, Maximum

• Easiest to explain with an example, using the tuition data.

36
Box plot of Per Capita Income
60,000
50,000
income
40,000
30,000
20,000

. summarize income, detail From Stata Documentation


income

Percentiles Smallest
1% 26535 26535
5% 27935 27897
10% 29515 27935 Obs 51
25% 31891 29108 Sum of Wgt. 51

50% 34257 Mean 35470.67


Largest Std. Dev. 5734.745
75% 38712 45877
90% 42392 46344 Variance 3.29e+07
95% 46344 49852 Skewness 1.275147
99% 55755 55755 Kurtosis 5.07413 37
US News and World Report
1995 College and Universitiy Rating

8
6
Histogram shows
relative frequency
4

of observations,
shape in the
2

center
0

0 5000 10000 15000 20000 25000


Out of State (Non-resident)Tuition

1995 Out of State Tuition


Boxplot shows
center, spread and
outliers

0 5,000 10,000 15,000 20,000 25,000


Tuition
38
80
60 Outliers are often data errors

One value of height on


Height in inches

Stat 104 poll entered as 5.2


inches
40
20
0

39
Some terminology
• Moore and McCabe define a boxplot where the lines extend from
the box out to the smallest and largest observations (pg. 38)

• Moore and McCabe define a modified boxplot where the lines


extend out from the box only to the largest and smallest
observation that are not suspected outliers (pg. 39)

• Stata produces `modified` boxplots, with * used to mark points


more than 1.5 × IQR above 75th percentile or below the 25th
percentile. We will always use modified boxplots in this course
(we will simply call them boxplots).

40
Changing the units of measurement
• What happens to the mean, standard deviation, etc… of a
set of data if one changes the scale, or units, of
measurements?
• For example, moving from English to metric scales
• Fahrenheit to Celsius
• feet to meters
• Calories to joules
• stones to kilograms
• Hopefully you get the idea 

41
Effects of linear transformation
yi = a + bxi
• mean (Y) = a + b × mean (X)
• median (Y) = a + b × median (X)
• variance (Y) = b2 × variance (X)
• SD (Y) = |b| × SD (X)
• IQR (Y) = |b| × IQR (X)

In this notation, xi and yi are used for individual


observation, and X and Y are used for the collection of
xi’s and yi’s
42
Direct and simple transformation from
Fahrenheit to Celsius

5
cel  ( fahr  32) Example of a simple
9 linear transformation

5 160
 fahr 
9 9

43
Great Blue Hill Weather Observatory
Milton, MA
Oldest continuous weather
record in North America

44
High Temp High Temp
Blue Hill Average Daily June (2010) (2009)
1 81 65
High Temps, by Month, 2 81 73

1885 to Present 3 80 64
4 84 71
Month Max_Fahr Max_Cels
5 83 61
Jan 33.6 0.89 6 79 69

Feb 34.6 1.44 7 72 75


8 69 69
Mar 43.2 6.22
9 70 57
Apr 54.6 12.56 10 57 56
May 66.3 19.06 11 68 55
12 67 69
Jun 74.6 23.67
13 64 72
Jul 79.9 26.61 14 76 60
Aug 77.7 25.39 15 78 58
16 75 60
Sep 70.7 21.50
17 72 67
Oct 60.5 15.83
18 86 62
Nov 48.6 9.22 19 83 69

Dec 37.2 2.89 20 4586 73


Mean 75.55 65.25
`Testing’ the formulas…
. summ max_f max_c

Variable Obs Mean Std. Dev. Min Max

max_f 12 56.79167 17.20647 33.6 79.9


max_c 12 13.77333 9.560084 .89 26.61

5 160
cel  fahr 
9 9
5 160
meancel  mean fahr 
9 9
5 160
 (56.79)   13.77
9 9
We’ll be using the transformations in a bit…

46
Density Curves

• Density curves (a theoretical construct) are often used to


approximate an observed distribution of values in a data set
• The plots on the following slides come from IPS, section 1.3
(Colors come from previous edition, content is the same)
• The data are the Iowa Test vocabulary scores for
n = 947 Gary, Indiana 7th graders.
• The test yields a grade equivalent vocabulary score

47
Using a density curve to approximate a
population distribution
• Histograms show distribution of the
actual collected data, directly.
• While informative, histograms can
be awkward to manipulate to
compute proportions.
• Density curves can be used to
approximate proportions of a
population within a range of values
• In this Gary test score data, the true
proportion (relative frequency) of
actual scores at less than a 6th grade
equivalency is 0.303 (= 287/947)

48
Using a density curve to approximate a
population distribution
• Plot at right shows a normal
density curve (bell shaped curve)
used to approximate this
proportion of the histogram
• By convention, all densities are
constructed so that the area
under the density curve is 1.
• Mathematical formulas or tables
can be used to show the shaded
area is 0.293, close to the
histogram’s real proportion of
0.303

49
The plots side by side…

Area
= 0.293
Proportion
=287/947
=0.303

50
Percent vs. Density Scale (vertical axis)
10000 observations 10000 observations

.4
8

.3
6
Percent

Density

.2
4

.1
2

0
0

0 2 4 6 8 10 0 2 4 6 8 10
var1 var1

• Whenever histograms and approximating density functions are shown


on same curve, the histogram is drawn on the `density’ scale

• In the density scale for the vertical axis, the total area under the
histogram is 1 (shape is unchanged).

• This is just a technical detail; don’t get too hung-up on it


51
Density Curves
• Always positive
• Area under the curve equals 1
• Median is point where 50% of area is to the left
and 50% is to the right.
• Mean is
“balance point”

52
Not all distributions and density
functions are symmetric

.04
40

.03
30

Density

.02
20

.01
10

0
0

0 50 100 150 200 0 50 100 150 200


Number of Deaths per 1000 live births Number of Deaths per 1000 live births

But everybody’s favorite density curve is…


53
The Normal Density Curve (in blue, below)
A family of symmetric, bell

.4
shaped densities with single
peak and `nice’ mathematical

.3
Density
properties

.2
.1
Formula (no need to memorize):
0
-4 -3 -2 -1 0 1 2 3 4
Number of Deaths per 1000 live births

1  1 x  
2

f ( x)  exp    
 2  2     Histogram of 400 observations
drawn randomly from a Normal
density 54
10000 observations

.4
Some properties of a

.3
Density

.2
Normal Density Curve

.1
0
0 2 4 6 8 10
var1

• Think of a normal density as the `histogram’ for a hypothetically


infinite sized population
• Interesting duality in interpretation of normal density
• When describing a distribution of numbers observed in a
study, the normal density sometimes is used to approximate
a histogram of the data set.
• When trying to learn about a large population using a small
data set, the observed histogram is thought of as an
approximation to the normal density corresponding to the
entire population
• Normal density has a `theoretical’ mean and standard deviation
55
Changing the location and scale of a
Normal Density
Formula for a normal density with mean μ and
variance σ2 (or equivalently standard deviation σ)
is:
1  1 x  
2

f ( x)  exp    
 2  2    

IPS calls this “~ N(μ, σ)”


(distributed as normal with mean ‘μ’ and std. dev. ‘σ’)

56
The Normal Distribution
• The normal density curves comprise a family that describe the `normal
distributions’
• The curves are symmetric, unimodal, and bell shaped
(Unimodal = one peak)
N orm al D is t ribut ions
0.4

N(0, 1)
N(0, 2)
N(1, 1)
0.3
0.2
y
0.1
0.0

-4 -2 0 2 4
x 57
Normal Distributions
• The distributions are represented symbolically as N(,) – the
normal distribution with mean  and standard deviation (sd) 
• All normal distributions follow the 68-95-99.7 rule:
• 68% of observations fall within one  of ;
• 95% fall within 2 of ;
• 99.7 fall within 3 of 
• Next graph from IPS, pp 70-71 illustrates this for hypothetical
`standard normal’, μ = 0, σ = 1.

58
59
Standardizing Normal Distributions
• All normal distributions have the same shape with possibly
different center  and spread (sd) .
• If Z has a N(0,1) distribution, then
• Z is N(0,), (written Z  N(0,))
• X =  + Z is N(,)
• Conversely, if X is N(,) distributed, then
• Z = (X- )/ has a N(0,1) distribution
• The last relationship leads to standardization and the use of
standard normal distribution, N(0,1) to compute relative
frequencies…more on this in a bit
• Note that the simple transformation Z = (X – )/ calculates the
distance of an observation from the mean, measured in number
of standard deviations, σ. This is often referred to as a
Z-score.

60
Example: Distribution of IQ scores

IQ tests calibrated
so that the mean:
μ = 100, and the
SD: σ = 15…
and the scores
are normally
distributed

Image from
Wikipedia

61
Standardizing

• Standardizing helps in interpreting observations


• An individual with IQ = 130 is
• (130 – 100)/15 = 2.0 σ’s above the mean
• Higher than 97.5% of population. Why?
• An individual with IQ = 115 is 1.0 σ’s above mean
• Higher than ___________________%
100 - 32/2 = 84 of population
62
More Questions That We Can
Answer
(the possibilities are endless!)
1) What fraction of the population score no higher than 95
on an IQ test?

2) What fraction score at least 105?

3) What IQ score will place a person in the top 10%?

4) Einstein was thought to have an IQ of about 160. What


proportion of people have an IQ higher than Big Al?

63
Standardizing the normal
distribution
The first question wants to find P(Y ≤ 95) where:

• Y denotes IQ score and P denotes either proportion of population with


IQ less than or equal to 95, or probability of randomly selecting a
member of the population whose IQ would be no higher than 95.

• Convert P(Y ≤ 95) to a statement about a standard normal


distribution…so we need to ___________________:
standardize/z-score
 Y  Y 95  100 
P(Y  95)  P    PZ  0.333  ???
 Y 15 

• Next step…find the area under the curve (use a table!)


64
Note use of term `probability’ here instead of proportion

65
66
2) What fraction score at least 105?

 Y  Y 105  100 
P(Y  105) P  
 Y 15 

 PZ  0.333  0.3707

Why???

67
68
3) What IQ score will place a person in the
top 10%?
• Want to find x0 such that P(X > x0) = 0.10
• Think of X as 100 + 15 × Z, with Z ~ N(0,1)
• Find z0 such that P(Z > z0) = 0.10, then compute
• x0 = 100 + 15 × z0
• Because Table A only gives area to the left, need to state
this problem as: what value z0 has area 0.9 to the left?

z0 = 1.28

x0 = 100 + 15(1.28) = 119.2

69
4) Einstein was thought to have an IQ of about
160. What proportion of people have an IQ higher
than Big Al?

 Y  Y 160  100 
P(Y  160)  P  
  
 Y 15 
 PZ  4.0  ____________

Look it up in our friend ____________________

70
Some Properties of the Normal
Distribution
• If Y ~ N(,), the area less than some value a (or b) is
denoted by P(Y < a). This anticipates later material on
probability
• For now, think of P ( ) as a proportion or relative frequency
• If Y ~ N(0,1) then P(Y < a) can be found by using the table
at the front of the book

1) Area sums to one:


P(Y  a) + P(Y > a) = 1
2) Symmetric
P(Y < -a) = P(Y > +a)
71
Properties….
3) Areas within two bounds (upper and lower)
• P(a < Y < b) = P( Y < b) - P(Y < a)
• An IQ Y score between 95 and 105 has probability
• P(95 < Y < 105) = P( Y < 105) - P(Y < 95)
= 0.6293 – 0.3707 = 0.2586
4) No area at a single point!
• So P(Y < b) = P(Y  b)
• P(a < Y < b) = P(a  Y  b) = P(a < Y  b), etc
• This is only true for theoretical density curves

72
Some important Z values
• What are the Z values with 0.10, 0.05 and 0.025 in the lower tail of the
N(0,1) distribution?

• From Table A
Prob (Z < -1.28) = 0.10
Prob (Z < -1.645) = 0.05
Prob (Z < -1.96) = 0.025
• By symmetry
Prob (Z > 1.28) = 0.10
Prob (Z > 1.645) = 0.05
Prob (Z > 1.96) = 0.025

73
Outliers and the Normal Distribution
• The IQR for the normal distribution is
• 75th – 25th percentile = 0.675 – (-.675) = 1.35
• 1.5  IQR = 1.5  1.35 = 2.02
• 75th percentile + (1.5  IQR) = 2.69
• Probability (Z > 2.69) = .003 = 0.3%
• Z has a standard normal distribution (mean 0, sd 1)
• Probability here equivalent to area under the normal density
to the right of 2.69
• So with a normal distribution, large outliers should occur less
than ½ of 1% of the time. Same for small outliers. But we DO
expect outliers if our sample size is large enough!

74
What’s Normal?

• How to decide if data can


be assumed normally
distributed?
• Histograms and boxplots
help to rule out normality
on the basis of asymmetry
(for histograms) and
outliers (for boxplots)
A normal quantile plot is specially designed for the purpose. We
will not cover this in this course. Please read the text if you are
interested in learning about this plot.

General Concept: Plot observed data against what you’d expect if


the data were normally distributed. 75
Height

.15
Examples from

.1
1st Day Survey

Density

.05
(last year)

0
55 60 65 70 75 80
Height

# Text Messages/Month Random Numbers


.004

.02
.015
.003

Density
Density

.01
.002

.005
.001

0
0

0 20 40 60 80 100
0 500 1000 1500 Random N
Text Messages

76
Detecting non-Normality in Graphs
• Histogram very good at showing lack of symmetry
(skewing)

• Box plots very good at showing outliers


• Outliers are rare but not impossible in normally
distributed data – 0.3% outliers expected in each tail
(upper and lower)
• In a sample of 500, would expect around 3 total outliers

• Normal quantile plot (not covered in this course) is very


good at showing non-normality, but are more trouble than
they are worth
• One last example in Stata (S&P 500 Index Daily
Changes)…
77
Unit Recap
• What is statistics?
• Types of data
• Summarizing Data
• Graphically
• Bar plots (Categorical)
• Histograms (Quantitative)
• Box plots (Quantitative)
• Details of a graph are important
• Vertical axis scale
• Number of bins in histogram

78
Unit Recap
• Summarizing Data (cont.)
• Numerically
• Proportions
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• Changing units of measurement (linear transformations)
• Density Curves and the Normal Distribution
– The “68-95-99.7 rule”
– Standardizing the normal distribution (z-score)
• Stata is your friend!

79

You might also like