Descriptive Statistics - Note1

Numerical
Descriptive
Measures
Definitions
The central tendency

locates the central value in a data set.
The variation measures how close to the centre

or how dispersed (scattered) the observations are
from the centre.
The shape is the pattern of the distribution of

values from the lowest value to the highest value.
Describing Data Numerically
Describing Data Numerically
Central Tendency Dispersion
Arithmetic Mean Range
Median Interquartile Range
Mode Variance
Standard Deviation
Coefficient of Variation
Measures of Central Tendency
Calculating the Mean, Median and
Mode
Measures of Central Tendency
Purpose:
To determine the
“centre” of the
data values.
The Mean
The mean is also known as the average.

Calculating the Sample Mean
from raw data
Pronounced x-bar The ith observation

(values taken by x)
n
x i
x i 1
n
Sample size = number of observations
Example 1
The number of work days lost due to illness in a
business per week is given below
(for a 10 week period)
36, 28, 33, 29, 28, 32, 33, 33, 34, 32
Calculate mean number of days lost per week

during the above period.
n
x
i 1
i
Sample mean,
n
x1  x2  x3  ...  xn

n
36  28  33  ...  32

10
318

10
 31.8
Exercise 1
The following are the ages (in years) of all eight

employees of a small company
53, 32, 61, 27, 39, 44, 49, 57
Find the mean age of these employees.
45.25 years
Properties of the Sample Mean
 Uniqueness ‐‐ For a given set of data there is one
and only one mean.
 Affected (distorted) by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
Properties of the Sample Mean
 May better be replaced by the median when
the distribution of the data is ‘skewed’).
 An important property of the mean is that it
includes every value in your data set as part of
the calculation.
The Median
The median is the value of the middle observation

in a dataset.
Calculating the Median
from raw data
Step 1: First, arrange the observations in ascending

order
Step 2: Then, find the middle position, using the
following formula if n is an odd number.
n 1
Median position 
2
Step 3: The median value is in the median position

Example 1
Find the median for the following data set.
27 38 12 34 42 40 24 40 23
 The ordered set becomes
Observation 12 23 24 27 34 38 40 40 42
Rank 1 2 3 4 5 6 7 8 9
9  1 th
 The median position is  5 rank (observation)
2
 Therefore the median = 34
Exercise 1
Sambiri Silicon manufactures computer monitors.
The following data are numbers of computer
monitors produced at the company for a sample of
10 days. Find the median.
24 31 27 25 35 33 26 40 25 28
Properties of the Median
 In an ordered array, the median is the “middle”
number (50% above, 50% below)
 Uniqueness -- There is only one median for each
set of data.
 Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
The Mode
The mode is the most frequently occurring value

in a dataset.
Calculating the Mode
from raw data
Step 1: First, arrange the observations in ascending

order
Step 2: The mode is the most frequently occurring
value in the dataset.
Example 1
 Find the mode for the data below

7.00 11.00 14.25 15.00 15.00 15.50
19.00 19.00 19.00 19.00 21.00 22.00
23.00 24.00 25.00 27.00 27.00 28.00
34.22 43.25
The mode is 19.00 because it recurs the most

times, i.e. four (4) times
Properties of the Mode
 Normally, the mode is used for categorical
data where we wish to know which is the
most common category
 Not affected by extreme values
 The mode is not unique
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
 There can be one mode
 There can be several modes
 We are now stuck as to which mode best
describes the central tendency of the data.
 This is particularly problematic when we have
continuous data because we are more likely not to
have any one value that is more frequent than the
other.
 For example, consider measuring 30 peoples'
weight (to the nearest 0.1 kg). How likely is it
that we will find two or more people with
exactly the same weight (e.g., 67.4 kg)? The
answer, is probably very unlikely ‐ many
people might be close, but with such a small
sample (30 people) and a large range of
possible weights, you are unlikely to find two
people with exactly the same weight; that is,
to the nearest 0.1 kg. This is why the mode is
very rarely used with continuous data.
Question
When re‐ordering, the most common hat or
jeans size is what you would like to know, not
the average hat or jeans size.
The Shape: Skewness
The shape is the pattern of the distribution of

values from the lowest value to the highest value.
Symmetric Histogram
Skewed Histogram
Skewed Histogram
Measures of skewness
 Pearson’s coefficient
 Bowley’s coefficient (Galton’s coefficient)
Ch
ap
Basic Business Statistics, 11e © 2009 3-
Prentice-Hall, Inc.. 31
Measures of Central Tendency:
Summary
Central Tendency
Sample Mean Median Mode Geometric

Mean
n
X i
XG  ( X1  X2    Xn )1/ n
X i1
n Middle value Most Rate of
in the ordered frequently change of
array observed a variable
value over time
Measures of Dispersion
Which dataset has the larger variation?
Dataset 1
Dataset 2
Population 1 Population 2
Narrow range Wide range
Smaller Larger
variation variation
Smaller Larger
deviation deviation Population 1
Observations Observations
clustered spread out Population 2
Same centre,
different variation
The measures of central tendency, the mean, median
and mode, do not reveal the whole picture of the
distribution of the dataset.
Two datasets with the same mean may have

completely different spreads.
The amount or degree of spread is known as variation.

Variation
Range Variance Standard Coefficient

Deviation of Variation
Measures of variation give
information on the spread or
variability or dispersion of
the data values.
Same centre,
different variation
Measures of Dispersion:
The Range
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 – 1 = 12
Why The Range Can Be Misleading
Range 12 - 7 5 Range 12 - 7 5
 Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range 12 - 7 5 Range 12 - 7 5
Range 5-1 4
Range 120 - 1 119

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range 5-1 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range 120 - 1 119

The Sample Variance
Variance is used to measure the dispersion of
values relative to the mean.
n n
 (x i  x) 2
 xi 2
 nx 2
s 
2 i1
 i 1
n1 n1
Where
X = arithmetic mean
n = sample size
Xi = ith observation of the
variable X
The Sample Standard Deviation
 Most commonly used measure of variation
 Tells us how much observations in our sample
differ from the mean value within our sample.
 Has the same units as the original data making
it easier to interpret.
s s 2
Example
For this sample data Xi:
2, 3, 5, 1, 4, 3, 2, 4 find.
1. Sample variance
2. Sample standard deviation
The variation or dispersion in a set of values refers to
how spread out the values are from each other.
• The variation is small when the values are close together.
• There is no variation if the values are the same.
Smaller variation
Larger variation
The Coefficient of Variation
The variance and the standard deviation are useful
as measures of variation of the values of a single
variable for a single population (or sample).
If we want to compare the variation of two

variables we cannot use the variance or the
standard deviation because:
1. The variables might have different means.

2. The variables might have different units.
 Measures relative variation to the mean
 Expressed as a percentage (%)
 s 
CV =   ×100%
x 
The coefficient of variation compares the
variability of two different datasets even if they
have different units of measurement.
Example 1
Spot, the dog, weighs 65 pounds. Spot’s weight
fluctuates 5 pounds depending on Spot’s
exercise level.
Sea Biscuit, the horse, weighs 1200 pounds.
Sea Biscuit’s weight fluctuates 125 pounds
depending on the number of rides Sea
Biscuit goes on.
Ch
ap
Coefficient of Variation
Some financial investors use the
coefficient of variation as a measure of
risk.
What does the Coefficient of
Variation tell us about the risk of a
stock that the standard deviation
does not?
Relative to the amount invested in a
stock, the coefficient of variation reveals
the risk of a stock in terms of the size of
the standard deviation relative to the
size of the mean (in percentage).
Example 2
Relative to the amount of money invested in the
stock, which stock, A or B, is riskier?
Stock A Stock B
Average
$50 $100
price
Standard
$5 $5
deviation
Comparing Coefficients of Variation
 s 5
CVA    100%  100%  10%
x  50
 s 5
CVB    100%  100%  5%
x  100
Comparing the C.V. it is clear that variation is much
higher stock A than in stock B.
Example 3
The yearly salaries of all employees who work
for a company have a mean of $62,350 and a
standard deviation of $6820.
The years of experience for the same
employees have a mean of 15 years and a
standard deviation of 2 years.
Is the relative variation in the salaries larger or
smaller than that in the years of experience for
these employees?
Interpretation
 A low (%) value shows low variability
implying tight clustering of observations
about the mean.
 A middle to high (%) value shows high
variability implying that observations are
widely spread.
Measures of Position for
ungrouped data
(Quartile Measures)
Quartile Measures
 Quartiles split the ranked data into 4 equal

segments.
25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile(lower quartile), Q1, below the first
are 25% of the observations.
 Q2 is the same as the median (middle quartile)and
hence below the second quartile are 50% of the
observations.
 The third quartile(upper quartile), Q3, below the
third quartile are 75% of the observations.
Quartile Measures
 Q1 = 25th percentile = P25

Locating Quartiles Positions
Step 1: First, arrange the observations in
ascending order
Step 2: Find the quartile positions using the
following formulas.
Q1 position  0.25 n  1
Q 2 position  0.5  n  1
Q3 position  0.75 n  1
Step 3: Determine the quartile values.
The Interquartile Range (IQR)
Remember that the range can be distorted by
outliers.
The IQR excludes these outliers and focuses on the
spread of the middle 50% of the data values.
The IQR is also called the 50% mid‐spread range.
IQR  Q3  Q1
Weakness
The IQR, like the range, also provides no
information on the clustering of observations
within the dataset as it uses only two
observations in its computation.
Example 1
Given Sample Data in Ordered Array:

11 12 13 16 16 17 18 21 22
Find
1. Q1 and Q3
2. IQR
Locating First quartile, Q1
11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the 0.25(9+1)=2.5 th position of the ranked
data
so use the value half way between the 2nd and 3rd values
12  13  13  12 
Q1   12.5 or Q 1  12     12.5
2  2 
Locating Third Quartile, Q3
11 12 13 16 16 17 18 21 22
(n = 9)
Q3 is in the 0.75(9+1)=7.5 th position of the ranked
data
so use the value half way between the 7th and 8th values.
18  21  21  18 
Q3   19.5 or Q 3  18     19.5
2  2 
IQR  Q3  Q1
 19.5  12.5
 7.0
Example 2
Given Sample Data in Ordered Array:

7 8 9 10 11 12 13 13 14 17 17 45
Find
1. Q1 and Q3
2. IQR
Locating First quartile, Q1
7 8 9 10 11 12 13 13 14 17 17 45
(n  12) Q1 is in the 0.2512  1  3.25 pos of the ranked data.

So find the value half way between the 3rd and 4th values,
9  10
which is  9.5
2
9  9.5  10  9 
Q1   9.25 or Q 1  9     9.25
2  4 
Locating Third Quartile, Q3
7 8 9 10 11 12 13 13 14 17 17 45
(n  12) Q3 is in the 0.7512  1  9.75 pos of the ranked data.

So find the value half way between the 9th and 10th values,
14  17
which is  15.5
2
15.5  17  17  14 
Q3   16.25 or Q 3  17     16.25
2  4 
IQR  Q3  Q1
 16.25  9.25
 7.0
End of Chapter
Grouped data
 Mean
 Variance
 CV
Ch
ap

Descriptive Statistics - Note1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics - Note1

Uploaded by

Copyright:

Available Formats

Numerical

The central tendency

The variation measures how close to the centre

The shape is the pattern of the distribution of

Describing Data Numerically

Central Tendency Dispersion

Arithmetic Mean Range

Median Interquartile Range

The mean is also known as the average.

Pronounced x-bar The ith observation

36, 28, 33, 29, 28, 32, 33, 33, 34, 32

Calculate mean number of days lost per week

The following are the ages (in years) of all eight

53, 32, 61, 27, 39, 44, 49, 57

Find the mean age of these employees.

The median is the value of the middle observation

Step 1: First, arrange the observations in ascending

Step 3: The median value is in the median position

The mode is the most frequently occurring value

Step 1: First, arrange the observations in ascending

 Find the mode for the data below

The mode is 19.00 because it recurs the most

The shape is the pattern of the distribution of

 Bowley’s coefficient (Galton’s coefficient)

Sample Mean Median Mode Geometric

Which dataset has the larger variation?

Two datasets with the same mean may have

The amount or degree of spread is known as variation.

Range Variance Standard Coefficient

Range = Xlargest – Xsmallest

Range 120 - 1 119

Range 120 - 1 119

• The variation is small when the values are close together.

• There is no variation if the values are the same.

If we want to compare the variation of two

1. The variables might have different means.

 Quartiles split the ranked data into 4 equal

25% 25% 25% 25%

 Q1 = 25th percentile = P25

 Q2 = 50th percentile = P50

 Q3 = 75th percentile = P75

Given Sample Data in Ordered Array:

Given Sample Data in Ordered Array:

(n  12) Q1 is in the 0.2512  1  3.25 pos of the ranked data.

(n  12) Q3 is in the 0.7512  1  9.75 pos of the ranked data.

You might also like