You are on page 1of 18

Histograms, Central Tendency,

and Variability for Describing a


Single Interval Variable

Lecture 2

Reading: Sections 5.1 – 5.6

Review: Data Types

The Economist, September 6, 2014


2

Lecture 2 Slides, ECO220Y1Y, 1


Histogram
• Histogram graphically n = 174 countries
describes how a single .4
.3

Fraction
variable containing
interval data is .2
distributed .1
• Range of data divided 0
0 20 40 60
into non-overlapping Inflation Rate, 2011
and equal width classes
How many bins? Width of bins?
(bins) that cover range
of values
http://data.worldbank.org/indicator/FP.CPI.TOTL.ZG 3

n = 174 countries n = 174 countries


80 .4
Frequency

60 .3
Fraction

40 .2
20 .1
0 0
0 20 40 60 0 20 40 60
Inflation Rate, 2011 Inflation Rate, 2011

n = 174 countries Frequency histogram: Bar height


.1 number of observations in bin
.08
Density

.06 Relative frequency histogram:


.04 Bar height fraction of obs. in bin
.02 Density histogram: Bar area
0
0 20 40 60 measures the fraction of
Inflation Rate, 2011 observations in bin
4

Lecture 2 Slides, ECO220Y1Y, 2


n = 34 OECD countries
.4
Can this histogram tell
us the exact number
.3 of countries with
inflation between 2
Density

and 4 percent?
.2 Is it definitely above
40%?
.1

0
0 2 4 6
Inflation Rate, 2011

n = 34 OECD countries n = 34 OECD countries


.3 .5
.4
Density

Density

.2 .3
.1 .2
.1
0 0
0 2 4 6 0 2 4 6
Inflation Rate, 2011 Inflation Rate, 2011

n = 34 OECD countries Number of bins changes the


1.5 appearance of the histogram
Density

1
One suggestion: # of bins ≈ 𝑛
.5
0 OECD inflation: 34 = 5.83 and
0 2 4 6 STATA picked 5
Inflation Rate, 2011
6

Lecture 2 Slides, ECO220Y1Y, 3


Shape of Things
• Histogram gives • Bell/Normal/Gaussian
overview of a variable • Positively skewed: long
with a single picture tail to right (aka right
– Can make informal skewed)
inferences about the
shape of population
• Negatively skewed: long
tail to left (aka left
• Symmetric: If draw an
skewed)
imaginary line at center,
have mirror image on • Modality: # major peaks
each side Most distributions are
unimodal: one major peak
7

Four Perfectly Symmetric Histograms


.3
.1
Density

Density
.1 .2
.05 0

10 20 30 40 1 2 3 4 5
0 .05 .1 .15 .2 .25
0 .1 .2 .3 .4
Density

Density

2 3 4 5 6 0 2 4 6 8

Lecture 2 Slides, ECO220Y1Y, 4


Two Perfectly Bell Shaped Histograms

.4

.4
.2 .3

.2 .3
Density

Density
.1

.1
0

0
-4 -2 0 2 4 10 15 20 25 30

But histograms of real data


will never be perfect: we
always mean approximately

For example, we’d describe


the histogram to the right as
Normal (Bell) shaped
9

Four Positively Skewed Histograms


0 .05 .1 .15 .2

.3
Density

Density
.1 .2 0

0 5 10 15 -6 -4 -2 0 2 4
0 .01 .02 .03 .04
0 .2 .4 .6 .8
Density

Density

-11 -10 -9 -8 -7 -6 0 50 100 150 200 250

Alternatively, these are right skewed 10

Lecture 2 Slides, ECO220Y1Y, 5


Four Negatively Skewed Histograms

0 .05 .1 .15 .2

.3
Density

Density
.1 .2 0
5 10 15 20 -15 -10 -5

0 .01 .02 .03 .04


0 .2 .4 .6 .8
Density

Density
-15 -14 -13 -12 -11 -10 250 300 350 400 450 500

Alternatively, these are left skewed 11

Percent of Population Living Percent of Population Living


Below International Poverty Line Above International Poverty Line
n = 157 countries n = 157 countries
in 2017 (or most recent year) in 2017 (or most recent year)
.6 .6
Fraction
Fraction

.4 .4

.2 .2

0 0
0 20 40 60 80 100 0 20 40 60 80 100
% Below Poverty Line % Above Poverty Line

Data retrieved “Proportion of population below the international poverty line of


US$1.90 per day (%)” from the World Health Organization on June 6, 2022:
https://www.who.int/data/gho/data/indicators/indicator-details/GHO/proportion-of-
population-below-the-international-poverty-line-of-us$1-90-per-day-(-)

In Canada in 2013 (the most recent year of data), 0.5% of the


population lives below the international poverty line.
In Malawi in 2016 (the most recent year of data), 70.3% of the
population lives below the international poverty line.
12

Lecture 2 Slides, ECO220Y1Y, 6


Four Bimodal Histograms

0 .05 .1 .15 .2 .25

0 .02 .04 .06 .08


Density

Density
0 2 4 6 8 0 10 20 30
0 .05 .1 .15 .2 .25

0 .05 .1 .15 .2
Density

Density
0 2 4 6 8 10 0 5 10 15

13

Figure 3: Violation Scores at Initial Inspection


Source: Farronato and Zervas (2022)
“Consumer Reviews and Regulation:
Evidence from NYC Restaurants”
https://www.nber.org/papers/w29715

Notes: This shows the distribution of violation scores that restaurants obtain during
the initial inspection. The vertical lines correspond to the score thresholds that
would assign A-B-C letter grades. Scores of 13 or less automatically give an A-grade,
while higher scores imply that a restaurant will be reinspected within a few weeks.
For the purpose of this plot, inspection scores are capped at 50. 14

Lecture 2 Slides, ECO220Y1Y, 7


Ages of first-time
mothers in the
U.S. in 1980

Ages of first-time
mothers in the
U.S. in 2016

The New York Times, August 4, 2018,


“The Age That Women Have Babies:
How a Gap Divides America” 15

Samples vs. Populations


• Sample is a random subset of population
– Sampling noise: Chance differences between
population and a random sample
• Driven by the sample size, not sample size relative to
the population size, which is assumed infinite (pp. 30 –
31, “The Sample Size is What Matters”)
– Informal inference: consider sample size (𝑛)
• Never see the perfect forms (Plato): statements about
shape always approximate
• “Nearly Normal Condition”

16

Lecture 2 Slides, ECO220Y1Y, 8


Population, N = 10,000,000 Sample 1; n = 10
.025 .05

.02 .04

Density
Density
.015 .03
.01 .02
.005 .01 Sample 1 is a LIE!!
0 0
50 100 150 80 90 100 110
IQ IQ

Sample 2; n = 10 Sample 3; n = 10
.05 .02
How many
.04 samples would .015
Density

Density
.03 you have in real Why aren’t these
life? .01 samples perfectly
.02
.005 Bell shaped?
.01
0 0
80 90 100 110 120 60 80 100 120
IQ IQ
17

Population, N = 10,000,000 Sample 1; n = 30


.025
.025
.02 .02
Density

Density

.015 .015
.01 .01
Why are there more
.005 .005 bins than last slide?
0 0
50 100 150 60 80 100 120 140
IQ IQ

Sample 2; n = 30 Sample 3; n = 30
.04 .03

.03
.02
Density

Density

.02
.01
.01

0 0
70 80 90 100 110 120 60 70 80 90 100 110
IQ IQ
18

Lecture 2 Slides, ECO220Y1Y, 9


Population, N = 10,000,000 Sample 1; n = 1000
.03
.025
.02
Density .02

Density
.015
.01
.01
.005
0 0
50 100 150 60 80 100 120 140
IQ IQ

Sample 2; n = 1000 Sample 3; n = 1000


.03 .03

.02 .02
Density

Density
.01 .01

0 0
60 80 100 120 140 50 100 150
IQ IQ
19

What to Conclude About Shape?


n: 30 n: 500
.25 .8
.2
Density
Density

.6
.15
.1 .4
.05 .2
0 0
-4 -2 0 2 4 10 11 12 13 14
X Y

Is the graph on the left symmetric? Bell shaped?

Is the graph on the right symmetric? Bell shaped? Bi-modal?

20

Lecture 2 Slides, ECO220Y1Y, 10


Hsieh and Olken (2014) JEP “The Missing ‘Missing Middle’” Summer
2014 http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.3.89
21

“There is a clear bimodality in the distribution of value-added/capital for the


large firms. However, the capital questionnaire for large firms was ambiguous
as to whether the results were to be entered in thousands or millions of
Rupiah. Our best guess is that approximately half the firms used thousands
and half used millions.” http://www.aeaweb.org/jep/app/2803/28030089_app.pdf
If the real distribution of value-added/capital for large firms is Normal, is the
bimodal shape caused by sampling error or non-sampling error?
22

Lecture 2 Slides, ECO220Y1Y, 11


Summary Statistics
• Statistics (i.e. summary statistics) give a
concise idea of what data “look like”
– For a single variable, statistics can give numeric
measures of:
• Central tendency: mean and median
• Variability: range, variance, standard deviation,
coefficient of variation, IQR
• Relative standing: percentiles
– For two variables, also measure relationship

23

Mean and Median


• Population mean, a • Median is the middle
∑ obs. after sorting
parameter: 𝜇 =
– if even # of obs., average
• Sample mean, a 2 middle ones

statistic: 𝑋 = n = 34 OECD countries
mean = 3.1, median = 3.3
• Which is subject to .5
.4
Fraction

sampling error? .3
.2
.1
0
0 2 4 6
Inflation Rate, 2011
24

Lecture 2 Slides, ECO220Y1Y, 12


Two Symmetric Distributions:
Normal and Uniform
Population Sample, n=49
mu=100.0, med=100.0 X-bar=103.9, med=103.0 Why does 𝑋,
.025 .04 which is a
.02
Density

.03 statistic, differ

Density
.015 .02
.01 from 𝜇, which
.005 .01 is a
0 0 parameter?
50 100 150 60 80 100 120 140
IQ IQ

Population Sample, n=41 Why does the


mu=50.0, med=50.0 X-bar=55.3, med=62.1 population
.01 .015
.008 median
Density

Density
.006 .01 exactly equal
.004 .005 𝜇 in both
.002
0 0 distributions?
0 20 40 60 80 100 0 20 40 60 80 100
Book Rating Book Rating
25

n = 174 countries
mean = 6.6, median = 5.0
.4

.3 Why is the mean greater


than the median?
Fraction

.2

.1

0
0 20 40 60
Inflation Rate, 2011

26

Lecture 2 Slides, ECO220Y1Y, 13


Figure 2. Distribution of Local Business Tax Changes Fuest, C., A. Peichl, and S. Siegloch.
2018. “Do Higher Corporate Taxes
Reduce Wages? Micro Evidence
from Germany.” American Economic
Review, 108 (2): 393-418.
DOI:10.1257/aer.20130570

What is the variable?


What is the unit of
observation?
Notes: The histogram shows the distribution of changes in the local
business tax rate. The sample consists of 17,999 tax rate changes in
10,001 municipalities. We omit 0.1 percent of the observations with
absolute changes larger than 5 percentage points for illustrative purposes. 27

Measures of Variability (Spread)


• Range: max – min
• Variance: n = 34 OECD countries
∑ 𝑥 −𝜇 min = -0.3, max = 6.5
𝜎 = var = 1.6, sd = 1.3
𝑁 .4
Density

∑ 𝑥 −𝑋 .3
𝑠 = .2
𝑛−1 .1
• Standard deviation: 𝑠 = 0
0 2 4 6
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Inflation Rate, 2011
• Coefficient of variation For all 174 countries, is
(textbook) the range bigger or
smaller than 6.8?
28

Lecture 2 Slides, ECO220Y1Y, 14


Breaking Down Variance
• Numerator: “total sum ∑ 𝑥 −𝑋
of squares” (TSS) 𝑠 =
𝑛−1
– If all sampled countries
have 3% inflation (xi = 3
for all i), what would TSS
& s2 be? 𝑇𝑆𝑆 = 𝑥 −𝑋
• Denominator:  (“nu”)
– Only n – 1 free obs left
after calculate mean
Degrees of freedom:
• Units of variance?
𝜈 =𝑛−1
– How about s.d.?

29

Empirical Rule (Normal/Bell)


• If a random sample is drawn from a Normal
population then about:
– 68.3% of observations will lie within 1 s.d. of the
mean (i.e. between 𝑋 − 𝑠 and 𝑋 + 𝑠)
– 95.4% of observations will lie within 2 s.d. of the
mean (i.e. between 𝑋 − 2𝑠 and 𝑋 + 2𝑠)
– 99.7% of observations will lie within 3 s.d. of the
mean (i.e. between 𝑋 − 3𝑠 and 𝑋 + 3𝑠)
• “Empirical Rule” only applies if Normal
30

Lecture 2 Slides, ECO220Y1Y, 15


SAT Scores Distributions: Normal
• SAT score mean is:
– 1230 for students with
HH income > $200,000
– 970 for students with HH
income < $20,000

For the random sample (right):


about 68.3% of students have scores between 775.4 and 1173
about 95.4% of students have scores between 576.6 and 1371.8
about 99.7% of students have scores between 377.8 and 1570.6
Douglas Belkin, May 16, 2019, “SAT to Give Students ‘Adversity Score’ to Capture Social and
Economic Background,” The Wall Street Journal https://www.wsj.com/articles/sat-to-give-
students-adversity-score-to-capture-social-and-economic-background-11557999000 31

Histogram #1, n = 94 Histogram #2, n = 154


.005 .01
.004 .008
Density

Density

.003 .006
.002 .004
.001 .002
0 0
800 900 1000 1100 1200 800 900 1000 1100 1200
X X

Histogram #3, n = 298 Histogram #4, n = 521


.002 .015
.0015
Density

Density

.01
.001
.005
5.0e-04

0 0
500 1000 1500 900 950 1000 1050 1100
X X
Noticing Normality, can we approximate the s.d. of X in each? 32

Lecture 2 Slides, ECO220Y1Y, 16


Chebysheff’s Theorem
• At least 100*(1–1/k2)% of observations lie
within k s.d.’s of the mean for k>1
– At least 75% of obs. lie within 2 s.d. of mean
• 1 – 1/22 = 3/4
– At least 89% of obs. lie within 3 s.d. of mean
• 1 - 1/32 = 8/9
– Can be applied to all samples no matter how
population is distributed
– What about within one s.d.?

33

n = 185 countries
mean = 14955.1, sd = 16243.0
.5

.4
Fraction

.3
How to describe the shape of the
.2 distribution of this variable?

.1

0
0 20000 40000 60000 80000 100000
GDP per capita (PPP), 2012 est.

34

Lecture 2 Slides, ECO220Y1Y, 17


Recap
• Started to describe a single interval variable
– The histogram is a powerful visual summary tool
• Three types – frequency, relative frequency, and
density – but all give same big picture
• Describe shape with well-known terms, if appropriate
– Sometimes terms don’t work and sentences are needed
– Summary stats: mean, median, s.d., range, etc.
• For an important measure of variability – s.d. – the
Empirical Rule (special case) and Chebysheff’s Theorem
(general) help us get a grasp on the meaning of the s.d.

35

Lecture 2 Slides, ECO220Y1Y, 18

You might also like