You are on page 1of 34

CONTINUOUS DISTRIBUTIONS &

CHECKING FOR NORMAL DISTRIBUTION


2

Exploring Quantitative Data


1. Always plot your data: make a graph.
2. Look for the overall pattern (shape, center, and spread) and
for striking departures such as outliers.
3. Calculate a numerical summary to briefly describe center
and spread.
4. Sometimes, the overall pattern of a large number of
observations is so regular that we can describe it by a
smooth curve.
3

Histograms
For large datasets and/or quantitative variables that take many values:
 Divide the possible values into classes or intervals of equal widths.
 Count how many observations fall into each interval. Instead of
counts, one may also use percents.
 Draw a picture representing the distribution―each bar height is
equal to the number (or percent) of observations in its interval.
Distribution of IQ Scores

18
16
14 Class Count
75 ≤ IQ Score < 85 2
Number of Students

12
85 ≤ IQ Score < 95 3
10
95 ≤ IQ Score < 105 10
8
105 ≤ IQ Score < 115 16
6 115 ≤ IQ Score < 125 13
4 125 ≤ IQ Score < 135 10
2 135 ≤ IQ Score < 145 5
0 145 ≤ IQ Score < 155 1
75 85 95 10
5
11
5
12
5
13
5
14
5

IQ Score
Outliers
An important kind of deviation is an outlier. Outliers are observations that lie outside
the overall pattern of a distribution. Always look for outliers and try to explain them.
The overall pattern here is
fairly symmetrical except
for two states that clearly
do not belong to the main
pattern. Alaska and Florida
have unusually small and
large percents,
respectively, of elderly
Alaska Florida
residents in their
populations.

A large gap in the


distribution is typically a
sign of an outlier.
5

Density Curves
Here is a histogram of vocabulary scores of 947 seventh
graders.

The smooth curve drawn over the


histogram is a mathematical
model for the distribution.

A density curve describes the


overall pattern of a distribution. The
area under the curve and above any
range of values on the horizontal
axis is the proportion of all
observations that fall in that range.
A density curve is a curve that has
an area of exactly 1 underneath it.
6

Density Curves
  The mean and standard deviation computed from
actual observations (data) are denoted by and s,
respectively.
 The mean and standard deviation of the actual
distribution represented by the density curve are
denoted by µ (“mu”) and  (“sigma”), respectively.
7

Normal Distributions
One particularly important class of density curves is the class of Normal
curves, which describe Normal distributions.
 All Normal curves are symmetric, single-peaked, and bell-shaped.
 A specific Normal curve is described by giving its mean µ and
standard deviation σ.
The Normal Distribution
• The Normal distribution is found to be a suitable model for
many naturally occurring variables which tend to be
symmetrically distributed about a central modal value - the
mean.

• e.g. human heights, weights, IQs etc. and also the output
from many production processes,
• Normal distribution is characterized by the mean and the
Standard deviation
• Ability to put a particular score into perspective
• How many standard deviations above/below the mean?
There is no single normal curve, but a family of curves, each
one defined by its mean, µ, and standard deviation, ; µ and 
are called the parameters of the distribution.

As we can see the curves may have different centres and/or different
spreads but they all certain characteristics in common:
The curve is bell-shaped,
it is symmetrical about the mean ( µ ),
the mean, mode and median coincide.
• Area beneath the Normal Distribution Curve
No matter what the values of µ and  are for a normal probability
distribution, the total area under the curve is equal to one.
• We can therefore consider partial areas under the curve as
representing probabilities. The partial area between a stated number
of standard deviation below and above the mean is always the same,
as illustrated below
• N.B. The curve neither finishes nor meets the horizontal axis at  
3, it only approaches it and actually goes on indefinitely.
11

The Empirical Rule


In the Normal distribution with mean µ and standard deviation σ:
 approximately 68% of the observations fall within σ of µ.
 approximately 95% of the observations fall within 2σ of µ.
 approximately 99.7% of the observations fall within 3σ of µ.
Shape of probability distributions
Symmetrical – the right half is a mirror image of the left half
Skewness – shows that the distribution lacks symmetry;
used to denote the data is sparse at one end, and piled at
the other end
• Absence of symmetry
• Extreme values in one side of a distribution
• The shape of probability distributions can be very
important in statistical analysis
• Scores often only approximate a normal distribution
• Skewness and Kurtosis are two measures of how scores
may deviate from the perfectly normal distribution
Skewness
• Skew implies that the shape of a unimodal distribution is
asymmetric about its mean
-the mean lies towards the direction of the skew (the
longer tail) relative to the median
• Positive skew – scores are shifted towards the left
– in a positively skewed distribution there tend to be some
positive outliers
• Negative skew – scores are shifted towards the right
Skewness

Mean Mode Mean Mode Mean


Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Coefficient of Skewness
• Coefficient of Skewness (S k) - compares the mean
and median in light of the magnitude to the standard deviation; M d
is the median; σ is the Std Dev
3   Md 
Sk 

• If Sk < 0, the distribution is negatively skewed (skewed to the left).


• If Sk = 0, the distribution is symmetric (not skewed).
• If Sk > 0, the distribution is positively skewed (skewed to the right).
Kurtosis
Leptokurtic: high and thin
Mesokurtic: normal in shape
Platykurtic: flat and spread out
Positive values of kurtosis indicate a pointy and heavy-tailed
distribution, whereas negative values indicate a flat and light-tailed
distribution. The further the value is from zero, the more likely it is that
the data are not normally distributed.

Leptokurtic

Mesokurtic
Platykurtic
Testing for Normality
(or approximate Normality)
• Visual inspection of Histograms, Box Plots, Q-Q Plots etc
• Values for Skewness & Kurtosis
• Sometimes transformations can correct for skew and kurtosis
• Square root of raw scores
• Inverse (1/x) of raw scores
• Base 10 Log of raw scores

• Kolmogorov-Smirnov Test / Shapiro-Wilk Test / Anderson


Darling Test and others
• Large samples (200+) can offset concerns
• Multivariate outliers - Check Mahalanobis distances
• Winsorizing (trimming) extreme outliers (say z score ≥
3.0) may be appropriate
Check Histograms & Box Plots
19

Normal Quantile Plots


Q-Q (quantile–quantile) plot
One way to assess if a distribution is indeed
approximately Normal is to plot the data on a Normal
quantile plot.

The data points are ranked and the percentile ranks are
converted to z-scores. The z-scores are then used for
the x axis against which the data are plotted on the y
axis of the Normal quantile plot.

 If the distribution is indeed Normal, the plot will


show a straight line, indicating a good match
between the data and a Normal distribution.

 Systematic deviations from a straight line indicate


a non-Normal distribution. Outliers appear as
points that are far away from the overall pattern
of the plot.
P-P plot

• The P-P plot (probability–probability plot) is another useful


graph for checking normality; it plots the cumulative
probability of a variable against the cumulative probability
of a particular distribution (in this case we would specify a
normal distribution).
• P-P plots can be interpreted in the same way as Q-Q plots
• If you have a lot of scores, Q-Q plots can be easier to
interpret than P-P plots because they will display fewer
values
Kolmogorov-Smirnov Test
• The Kolmogorov-Smirnov test can be used to see if a
distribution of scores significantly differs from a normal
distribution.
• If the K-S test is significant (Sig. is less than .05) then the
scores are significantly different from a normal
distribution. Otherwise, scores are approximately normally
distributed.
Shapiro-Wilk Test
• The Shapiro–Wilk test does much the same thing, but it
has more power to detect differences from normality (so
this test might be significant when the K-S test is not).
• In large samples these tests can be significant even when
the scores are only slightly different from a normal
distribution. -they should always be interpreted in
conjunction with histograms, P-P or Q-Q plots, and the
values of skew and kurtosis.
23

Standardizing Observations
 If a variable x has a distribution with mean µ and standard
deviation σ, then the standardized value of x, or its z-score, is

Z score – represents the number of standard deviations a value (x) is


above or below the mean of a set of numbers when the data are normally
distributed

All Normal distributions are the same if we measure in units of size σ from
the mean µ as center.
The
The standard
standard Normal
Normal distribution
distribution
is
is the
the Normal
Normal distribution
distribution with
with mean
mean
00 and
and standard
standard deviation
deviation 1.
1. That
That is,
is,
the
the standard
standard Normal
Normal distribution
distribution is
is
N(0,1).
N(0,1).
Z Scores: An example
• A dataset is normally
distributed with a mean 70- 50
z= = 2.0
of 50 and standard 10
deviation of 10.
Determine the z score
for a value of 70
25

The Standard Normal Table


Because all Normal
distributions are the same
when we standardize, we can
find areas under any Normal
curve from a single table.

The Standard Normal Table


is a table of areas under the
standard Normal curve. The
table entry for each value z is
the area under the curve to
the left of z.
26

The Standard Normal Table


Suppose we want to find the proportion of observations from the
standard Normal distribution that are less than 0.81.
We can use the standard normal table.

P(z < 0.81) = .7910

Z .00 .01 .02


0.7 .7580 .7611 .7642
0.8 .7881 .7910 .7939
0.9 .8159 .8186 .8212
27

Normal Calculations
Find the proportion of observations from the standard Normal distribution
that are between –1.25 and 0.81.
28

Normal Calculations
How to Solve Problems Involving Normal Distributions

Express the problem in terms of the observed variable x.

Draw a picture of the distribution and shade the area of interest


under the curve.

Perform calculations.
 Standardize x to restate the problem in terms of a standard
Normal variable z.
 Use the table and the fact that the total area under the
curve is 1 to find the required area under the standard
Normal curve.

Write your conclusion in the context of the problem.


29

Normal Calculations – Further example


According to the Health and Nutrition Study, the heights (in
inches) of adult men aged 18–24 are N(70, 2.8).

If exactly 10% of men aged 18–24 are shorter than a particular


man, how tall is he?

N(70, 2.8)

.10

? 70
30

Normal Calculations
How tall is a man who is taller N(70, 2.8)
than exactly 10% of men aged
18–24?
.10
? 70
Look up the probability
closest to 0.10 in the table.
z .07 .08 .09
Find the corresponding
standardized score. –1.3 .0853 .0838 .0823
The value you seek is that
–1.2 .1020 .1003 .0985
many standard deviations
from the mean. –1.1 .1210 .1190 .1170

Z = –1.28
31

Normal Calculations
How tall is a man who is taller than
exactly 10% of men aged 18–24? N(70, 2.8)

Z = –1.28 .10
? 70
We need to “unstandardize” the z-score to find the observed value (x):
  𝑥−𝜇
𝑧= ⟹ 𝑥 = 𝜇+ 𝑧 𝜎
𝜎
x = 70 + z(2.8)
= 70 + [(–1.28 )  (2.8)]
= 70 + (–3.58) = 66.42
A man would have to be approximately 66.42 inches tall or less to be in
the lower 10% of all men in the population.
Example:
A variety of questions about the spending habits of Supermarket shoppers can
be answered given the information that: μ = £50.00  = £15.00 n = 500

Probability of a shopper spending over £80.


• We need P(x > £80), i.e.the probability that a customer spends over £80
•  = £50.00
•  = £15.00
• x = £80
• Standardise x:

x  μ 80  50 30
z    2.00
σ 15 15
• From tables:
P(x > £80) = 1.0 - 0.9772 = 0.0228
Probability that a shopper spends between £30 and £80,
i.e. P(£30 < x < £80)

 = £50.00  = £15.00

P(x < 80) Reading from tables is 0.9772 (see previous slide)

P(x < 50)


x  μ 30  50  20
z     1.333
σ 15 15

The table values are all positive, so when z is negative we invoke the symmetry of the
situation and use its absolute value.

Reading from tables is 1 - 0.9082 = 0.0918

Therefore: P(£30 < x < £80) = 0.9772 - 0.0918 = 0.8854


The percentage of shoppers who spend between £30 and £80 is: 88.5%
CONTINUOUS DISTRIBUTIONS &
CHECKING FOR NORMAL DISTRIBUTION

You might also like