DESCRIPTIVE
STATISTICS
Chapter 5
The Normal Approximation for
Data
The Normal Curve
• Discovered in 1720
by A. de Moivre y
• Around 1870, A.
Quetelet had the
idea of using the
curve as an “ideal”
histogram to which
x
histograms of data
could be compared!
1 x2 / 2
y e
2
Use a table, not the formula
• The area under the normal curve
– between -1 and +1 is about 68%
– between -2 and +2 is about 95%
– between -3 and +3 is about 99.7%
– outside -3 and +3 is about 0.3 %
– outside -4 and +4 is about 0.0003 %
• Many histograms for data are similar
in shape to normal curve, but not all!
How to compare?
• The histogram must be drawn to the
same scale as the normal curve:
– Make the horizontal scale the same,
that is, convert the data units to
standard units.
– A value is converted to standard units
by seeing how many SDs it is above or
below the average.
– Ex. Convert “185cm” to standard units,
if this measurement comes from a
sample with mean 170cm and SD 10cm.
Histogram and the Normal
Curve
Converting back
• Find the height (in inches)
which is equal to -1.2 in standard units.
Solution: The histogram corresponds to a data
set with average 63.5 inches and standard
deviation 2.5 inches. So,
the height is 63.5 – 1.2x2.5 = 60.5 inches
because -1.2 is “1.2 standard
deviations” lower than the average 0.
Finding areas under the curve
• Example 1: Find the area under the
normal curve, when x is greater than
0
Answer: 50% !
Finding areas under the curve
• Example 2: Find the area between 0
and 1 under the normal curve
Finding areas under the curve
• Example 3:
Finding areas under the curve
• Example 4:
More Examples
More Examples
The Normal Approximation
for Data
• The heights of the men age 18-74 in
HANES averaged 69 inches; SD was
3 inches. Use the normal curve to
estimate the percentage of these
men with heights between 63 inches
and 72 inches.
Convert to standard units
and find the area
Graphically …
Remark!
• If the normal curve is a good fit for
data then we need only
– the mean, and
– the standard deviation
to find (approximately) the percentage
of data that falls in any given interval!
• If the histogram deviates from a normal
curve significantly, such approximations
• are far from reality in general.
Percentiles
• Mean and Standard Deviation are
less satisfactory if normal curve is
NOT a good approximation for data!
• Ex. Income survey in U.S., 1992. The
average was $44,500. However this
histogram does not look like normal.
It has a long right hand tail. The
proportion of the data below the
average is a) less b) equal c) more
• than 50% ?
Histogram of the example
• Answer? More, because median is
less than the mean (average)
What percent has negative
(!) income ?
• If the normal curve was a good
approximation, then
• However, NO income is negative!
Use of percentiles
• A percentile is a VALUE IN THE
DATA SET (it has the same data
units as the data set!)
• rth percentile is the value such that r
% of the data is below that value, and
(100-r)% is above that value.
Example of Percentile
• 143.7, 146.6, 148.0, 150.2, 151.0,
157.6, 164.0, 168.0, 171.4, 173.1,
174.0, 174.7, 175.0, 176.0, 179.0,
183.7, 185.0, 192.0, 192.0, 195.0
• Find 15th, 25th, 50th, 75th and 90th
percentiles.
• 15th percentileŞ 148.0 cm.
• 25th perc.= 151.0 cm
• 50thperc. (median) = 173.1 cm
Interquartile Range
• Another measure of spread:
75th percentile – 25th percentile
• Useful when the distribution is
skewed.
• In contrast, SD would be influenced
by a small percentage of cases in the
tail.
• Again, be careful with the
cases when normal appr. is not good!
Income Distribution: Percentiles
and Interquartile Range
r rth percentile
1 $1,300
10 $10,200 Interquartile Range:
$58,100 - $20,100
25 $20,100
= $38,000
50 $36,800
SD was $32,000.
75 $58,100
90 $85,000
99 $151,800
Percentiles and the Normal Curve
• SAT scores: Average 535, SD 100.
The scores follow a normal curve.
Estimate the 95th percentile of the
SAT score distribution.
Continued ..
Continued..
Change of Scale
• Find the average and SD of the list
1,3,4,5,7
Average = 4 SD = 2
• Take the above list, multiply by 3 and
then add 7. What are the new avg.
and SD?
Average= 19 (=3x4+7) SD=6 (=3x2)