Doane Chapter 04B

Descriptive Statistics (Part 2)
Standardized Data
Percentiles and Quartiles
Box Plots
Grouped Data
Skewness and Kurtosis (optional)
C
h
a
p
t
e
r

4
For any population with mean and standard
deviation o, the percentage of observations that lie
within k standard deviations of the mean must be at
least 100[1 1/k
2
].
Developed by mathematicians Jules Bienaym
(1796-1878) and Pafnuty Chebyshev (1821-1894).
Standardized Data
Chebyshevs Theorem
For k = 2 standard deviations,
100[1 1/2
2
] = 75%
So, at least 75.0% will lie within + 2o
For k = 3 standard deviations,
100[1 1/3
2
] = 88.9%
So, at least 88.9% will lie within + 3o
Although applicable to any data set, these limits
tend to be too wide to be useful.
Standardized Data
Chebyshevs Theorem
The Empirical Rule states that for data from a
normal distribution, we expect that for
The normal or Gaussian distribution was named for
Karl Gauss (1771-1855).
The normal distribution is symmetric and is also
known as the bell-shaped curve.
k = 1 about 68.26% will lie within + 1o
Standardized Data
The Empirical Rule
Note: no
upper bound
is given.
Data values
outside
+ 3o
are rare.
Distance from the mean is measured in terms of
the number of standard deviations.
Standardized Data
The Empirical Rule
If 80 students take an exam, how many will score
within 2 standard deviations of the mean?
Assuming exam scores follow a normal distribution,
the empirical rule states
about 95.44% will lie within + 2o
so 95.44% x 80 ~ 76 students will score
+ 2o from .
How many students will score more than 2
standard deviations from the mean?
Standardized Data
Example: Exam Scores
Unusual observations are those that lie beyond
+ 2o.
Outliers are observations that lie beyond
+ 3o.
Standardized Data
Unusual Observations
For example, the P/E ratio data contains several
large data values. Are they unusual or outliers?
7 8 8 10 10 10 10 12 13 13 13 13
13 13 13 14 14 14 15 15 15 15 15 16
16 16 17 18 18 18 18 19 19 19 19 19
20 20 20 21 21 21 22 22 23 23 23 24
25 26 26 26 26 27 29 29 30 31 34 36
37 40 41 45 48 55 68 91
Standardized Data
Unusual Observations
If the sample came from a normal distribution, then
the Empirical rule states
1 x s
= 22.72 1(14.08)
2 x s
= 22.72 2(14.08)
3 x s
= 22.72 3(14.08)
Standardized Data
The Empirical Rule
= (8.9, 38.8)
= (-5.4, 50.9)
= (-19.5, 65.0)
22.72
38.8
8.9
50.9 -5.4
65.0
-19.5
Standardized Data
The Empirical Rule
Outliers
Outliers
Unusual
Unusual
Are there any unusual values or outliers?
7 8 . . . 48 55 68 91
A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.
i
i
x
z
=
o
Standardization
formula for a
population:
Standardization
formula for a
sample:
i
i
x x
z
s
=
Standardized Data
Defining a Standardized Variable
z
i
tells how far away the observation is from the
mean.
i
i
x x
z
s
= =
7 22.72
14.08
=
-1.12
Standardized Data
For example, for the P/E data, the first value x
1
= 7.
The associated z value is
i
i
x x
z
s
= =
91 22.72
14.08
=
4.85
A negative z value means the observation is below
the mean.
Standardized Data
Positive z means the observation is above the
mean. For x
68
= 91,
Here are the standardized z values for the P/E
data:
Standardized Data
What do you conclude for these four values?
In Excel, use =STANDARDIZE(Array, Mean,
STDev) to calculate a
standardized z value.
MegaStat calculates standardized values as well
as checks for outliers.
Standardized Data
What do we do with outliers in a data set?
If due to erroneous data, then discard.
An outrageous observation (one completely outside
of an expected range) is certainly invalid.
Recognize unusual data points and outliers and
their potential impact on your study.
Research books and articles on how to handle
outliers.
Standardized Data
Outliers
For a normal distribution, the range of values is 6o
(from 3o to + 3o).
If you know the range R (high low), you can
estimate the standard deviation as o = R/6.
Useful for approximating the standard deviation
when only R is known.
This estimate depends on the assumption of
normality.
Standardized Data
Estimating Sigma
Percentiles are data that have been divided into
100 groups.
For example, you score in the 83
rd
percentile on a
standardized test. That means that 83% of the
test-takers scored below you.
Deciles are data that have been divided into
10 groups.
Quintiles are data that have been divided into
5 groups.
Quartiles are data that have been divided into
4 groups.
Percentiles
Percentiles are used to establish benchmarks for
comparison purposes (e.g., health care,
manufacturing and banking industries use 5, 25,
50, 75 and 90 percentiles).
Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
Percentiles are used in employee merit evaluation
and salary benchmarking.
Percentiles
Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.
The three values that separate the four groups are
called Q
1
, Q
2
, and Q
3
, respectively.
Q
1
Q
2
Q
3

Lower 25% | Second 25% | Third 25% | Upper 25%
Quartiles
The second quartile Q
2
is the median, an important
indicator of central tendency.
Q
1
and Q
3
measure dispersion since the
interquartile range Q
3
Q
1
measures the degree of
spread in the middle 50 percent of data values.
Q
2

Lower 50% | Upper 50%
Q
1
Q
3

Lower 25% | Middle 50% | Upper 25%
Quartiles
The first quartile Q
1
is the median of the data
values below Q
2
, and the third quartile Q
3
is the
median of the data values above Q
2
.
Q
1
Q
2
Q
3

Lower 25% | Second 25% | Third 25% | Upper 25%
For first half of data,
50% above,
50% below Q
1
.
For second half of data,
50% above,
50% below Q
3
.
Quartiles
Depending on n, the quartiles Q
1
,Q
2
, and Q
3
may
be members of the data set or may lie between
two of the sorted data values.
Quartiles
For small data sets, find quartiles using method of
medians:
Step 1. Sort the observations.
Step 2. Find the median Q
2
.
Step 3. Find the median of the data values that lie
below Q
2
.
Step 4. Find the median of the data values that lie
above Q
2
.
Method of Medians
Use Excel function =QUARTILE(Array, k) to return
the kth quartile.
=QUARTILE(Array, 3)
=PERCENTILE(Array, 75)
Excel treats quartiles as a special case of
percentiles. For example, to calculate Q
3
Excel calculates the quartile positions as:
Position of Q
1
0.25n + 0.75
Position of Q
2
0.50n + 0.50
Position of Q
3
0.75n + 0.25
Excel Quartiles
Consider the following P/E ratios for 68 stocks in a
portfolio.
Use quartiles to define benchmarks for stocks that
are low-priced (bottom quartile) or high-priced (top
quartile).
7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
Example: P/E Ratios and Quartiles
Using Excels method of interpolation, the quartile
positions are:
Quartile
Position
Formula Interpolate
Between
Q
1
= 0.25(68) + 0.75 = 17.75 X
17
+ X
18

Q
2
= 0.50(68) + 0.50 = 34.50 X
34
+ X
35

Q
3
= 0.75(68) + 0.25 = 51.25 X
51
+ X
52

The quartiles are:
Quartile Formula
First (Q
1
) Q
1
= X
17
+ 0.75 (X
18
-X
17
)
= 14 + 0.75 (14-14) = 14
Second (Q
2
) Q
2
= X
34
+ 0.50 (X
35
-X
34
)
= 19 + 0.50 (19-19) = 19
Third (Q
3
) Q
3
= X
51
+ 0.25 (X
52
-X
51
)
= 26 + 0.25 (26-26) = 26
So, to summarize:
These quartiles express central tendency and
dispersion. What is the interquartile range?
Q
1
Q
2
Q
3

Lower 25%
of P/E Ratios
14 Second 25%
of P/E Ratios
19 Third 25%
of P/E Ratios
26 Upper 25%
of P/E Ratios
Because of clustering of identical data values,
these quartiles do not provide clean cut points
between groups of observations.
Whether you use the method of
medians or Excel, your quartiles will be
about the same. Small differences in
calculation techniques typically do not
lead to different conclusions in
business applications.
Tip
Quartiles generally resist outliers.
However, quartiles do not provide clean cut points
in the sorted data, especially in small samples with
repeating data values.
Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q
1
= 3, Q
2
= 6, Q
3
= 8
Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q
1
= 3, Q
2
= 6, Q
3
= 8
Although they have identical quartiles, these two
data sets are not similar. The quartiles do not
represent either data set well.
Caution
Some robust measures of central tendency and
dispersion using quartiles are:
Statistic Formula Excel Pro Con
Midhinge
=0.5*(QUARTILE
(Data,1)+QUARTILE
(Data,3))
Robust to
presence
of extreme
data
values.
Less
familiar
to most
people.
1 3
2
Q Q +
Dispersion Using Quartiles
Statistic Formula Excel Pro Con
Midspread Q
3
Q
1

=QUARTILE(Data,3)-
QUARTILE(Data,1)
Stable
when
extreme
data values
exist.
Ignores
magnitude
of extreme
data
values.
Dispersion Using Quartiles
Coefficient
of quartile
variation
(CQV)
None
Relative
variation in
percent so
we can
compare
data sets.
Less
familiar to
non-
statisticians
3 1
3 1
100
Q Q
Q Q
+
The mean of the first and third quartiles.
For the 68 P/E ratios,
Midhinge =
1 3
2
Q Q +
Midhinge =
1 3
14 26
20
2 2
Q Q + +
= =
A robust measure of central tendency since
quartiles ignore extreme values.
Midhinge
A robust measure of dispersion
Midspread = Q
3
Q
1

Midspread = Q
3
Q
1
= 26 14 = 12
Midspread (Interquartile Range)
Measures relative dispersion, expresses the
midspread as a percent of the midhinge.
3 1
3 1
100
Q Q
CQV
Q Q
=
+
3 1
3 1
26 14
100 100 30.0%
26 14
Q Q
CQV
Q Q

= = =
+ +
Similar to the CV, CQV can be used to compare
data sets measured in different units or with
different means.
Coefficient of Quartile Variation (CQV)
A useful tool of exploratory data analysis (EDA).
Also called a box-and-whisker plot.
Based on a five-number summary:
X
min
, Q
1
, Q
2
, Q
3
, X
max

Consider the five-number summary for the
68 P/E ratios:
7 14 19 26 91
X
min
, Q
1
, Q
2
, Q
3
, X
max

Box Plots
Minimum
Median (Q
2
)
Maximum
Q
1
Q
3

Box
Whiskers
Right-skewed
Center of Box is Midhinge
Box Plots
Use quartiles to detect unusual data points.
These points are called fences and can be found
using the following formulas:
Inner fences Outer fences:
Lower fence Q
1
1.5 (Q
3
Q
1
) Q
1
3.0 (Q
3
Q
1
)
Upper fence Q
3
+ 1.5 (Q
3
Q
1
) Q
3
+ 3.0 (Q
3
Q
1
)
Values outside the inner fences are unusual while
those outside the outer fences are outliers.
Box Plots
Fences and Unusual Data Values
For example, consider the P/E ratio data:
Ignore the lower fence since it is negative and P/E
ratios are only positive.
Inner fences Outer fences:
Lower fence:
14 1.5 (2614) = 4 14 3.0 (2614) = 22
Upper fence: 26 + 1.5 (2614) = +44 26 + 3.0 (2614) = +62
Box Plots
Truncate the whisker at the fences and display
unusual values
and outliers
as dots.
Inner
Fence
Outer
Fence
Unusual Outliers
Box Plots
Based on these fences, there are three unusual
P/E values and two outliers.
Although some information is lost, grouped data
are easier to display than raw data.
When bin limits are given, the mean and standard
deviation can be estimated.
Accuracy of grouped estimates depend on
- the number of bins
- distribution of data within bins
- bin frequencies
Grouped Data
Nature of Grouped Data
Consider the frequency distribution for prices of
Lipitor for three cities:
Grouped Data
Mean and Standard Deviation
Where
m
j
= class midpoint f
j
= class frequency
k = number of classes n = sample size
Estimate the mean and standard deviation by
1
3427.5
72.92552
47
k
j j
j
f m
x
n
=
= = =
2
1
( )
2091.48936
6.74293
1 47 1
k
j j
j
f m x
s
n
=
= = =

Note: dont round off too soon.

Grouped Data
How accurate are grouped estimates compared to
ungrouped estimates?
Now estimate the coefficient of variation
CV = 100 (s / ) = 100 (6.74293 / 72.92552) = 9.2% x
For the previous example, we can compare the
grouped data statistics to the ungrouped data
statistics.
Grouped Data
Accuracy Issues
For this example, very little information was lost
due to grouping.
However, accuracy could be lost due to the nature
of the grouping (i.e., if the groups were not evenly
spaced within bins).
Grouped Data
Accuracy Issues
The dot plot shows a relatively even distribution
within the bins.
Effects of uneven distributions within bins tend to
average out unless there is systematic skewness.
Grouped Data
Accuracy Issues
Accuracy tends to improve as the number of bins
increases.
If the first or last class is open-ended, there will be
no class midpoint (no mean can be estimated).
Assume a lower limit of zero for the first class
when the data are nonnegative.
You may be able to assume an upper limit for
some variables (e.g., age).
Median and quartiles may be estimated even with
open-ended classes.
Grouped Data
Accuracy Issues
Generally, skewness may be indicated by looking
at the sample histogram or by comparing the mean
and median.
This visual indicator is imprecise and does not take
into consideration sample size n.
Skewness and Kurtosis
Skewness
Skewness
Skewness is a unit-free statistic.
The coefficient compares two samples measured
in different units or one sample with a known
reference distribution (e.g., symmetric normal
distribution).
Calculate the samples skewness coefficient as:
Skewness =
3
1
( 1)( 2)
n
i
i
x x n
n n s
=
| |
|

\ .
In Excel, go to
Tools | Data Analysis |
Descriptive Statistics or
use the function
=SKEW(array)
Skewness
Consider the following table showing the 90%
range for the sample skewness coefficient.
Skewness
Coefficients within the 90% range may be
attributed to random variation.
Skewness
Coefficients outside the range suggest the sample
came from a nonnormal population.
Skewness
As n increases, the range of chance variation
narrows.
Skewness
Kurtosis is the relative length of the tails and the
degree of concentration in the center.
Consider three kurtosis prototype shapes.
Kurtosis
A histogram is an unreliable guide to kurtosis since
scale and axis proportions may differ.
Excel and MINITAB calculate kurtosis as:
Kurtosis =
4
2
1
( 1) 3( 1)
( 1)( 2)( 3) ( 2)( 3)
n
i
i
x x n n n
n n n s n n
=
+
| |
|

\ .

Kurtosis
Consider the following table of expected 90%
range for sample kurtosis coefficient.
Kurtosis
A sample coefficient within the ranges may be
attributed to chance variation.
Kurtosis
Coefficients outside the range would suggest the
sample differs from a normal population.
Kurtosis
As sample size increases, the chance range
narrows.
Inferences about kurtosis are risky for n < 50.
Kurtosis
Applied Statistics in
Business and Economics
End of Chapter 4

Doane Chapter 04B

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Doane Chapter 04B

Uploaded by

Copyright:

Available Formats

Descriptive Statistics (Part 2)

Note: dont round off too soon.

Skewness and Kurtosis

You might also like