You are on page 1of 49

Chapter 3

NUMERICALLY
SUMMARIZING DATA

1
OUTLINE
3.1 Measure of Central Tendency
3.2 Measure of Dispersion
3.3 Measure of Position
3.4 The five-Number Summary and Boxplots

2
MEASURE OF CENTRAL
TENDENCY
 A measure of central tendency numerically describe the
center or typical data value.
Three most widely-used measure of central tendency:
Mean
Median
Mode

3
MEASURES OF CENTER

 Indicate where the center or most typical


value of a data set lies.

4
MEAN

This is the most popular and useful measure of central tendency.


Is the sum of the observations/measurement divided by the
number of observation/measurement.

Sum of the measurements


Mean =
Number of measurements

5
MEAN

Sample mean Population mean


n N
ii11xxi i
n
 i1 x i
x 
nn N

Sample size Population size

6
MEAN
Example 1

Find the mean rate of return for a portfolio equally


invested in five stocks having the following annual rate
of returns: 11.2%, 8.07%, 5.55%, 13.7%, 21%.

Solution
11 .2  8.07  5.55  13.7  21
x  9.764%
5

7
MEDIAN
The median of a set of measurements is the value that
falls in the middle when the measurements are
arranged in order of magnitude.
When determining the median pay attention to the
number of observations (k).
 ‘k’ is odd
Median = the number at the (k+1)/2th location of the ordered
array.
 ‘k’ is Even
Median = the average of the two numbers in the middle (The number at the (k/2) th and
the (k/2)+1)]th locations of the ordered array.)

8
MEDIAN
Example 2

The salaries of seven employees Suppose an additional salary of


were recorded (in 1000s): 28, 60, 26, $31,000 is added to the group of
32, 30, 26, 29. salaries recorded before. Find the
Find the median salary. median salary.

Odd number of observations Even number of observations


26,26,28,29,30,32,60 26,26,28,29, 30,32,60,31
There are seven salaries (K = 7). There are eight salaries (K = 8).
The (k+1)/2th salary of the ordered The two salaries in the middle are 29
array is the number at the (in the (k/2)th =4th location), and 30 (in
(7+1)/2th = 4th location. the [(k/2)+1]th=5th location.
The median is 29. The median is the average number –
29.5.

9
MODE
The Mode of a set of measurements is the value that
occurs most frequently.
A Set of data may have one mode (or modal class),
or two or more modes.

For large data sets


The modal class the modal class is
much more relevant
than a single-value
mode.

10
MODE
Example 3
 The manager of a men’s clothing store observes the waist size (in inches) of
trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.
 The mode of this data set is 34 in.

This information seems to be valuable


(for example, for the design of a new
display in the store), much more than
“ the median is 33.5 in.”

11
RELATIONSHIP AMONG
MEAN, MEDIAN, AND MODE
If a distribution is symmetrical, the mean, median and
mode coincide

 If a distribution is non symmetrical, and skewed to


the left or to the right, the three measures differ.

A negatively skewed distribution


A positively skewed distribution (“skewed to the left”)
(“skewed to the right”)

Mode Mean Mean Mode


Median Median 12
USING THE
MEAN, MEDIAN, AND
MODE
When to use (not use) each measure of central
location):
• The mean - is very sensitive to extreme values, thus, should not be used when a few
extreme values residing away from most of the observations, are present. The mean is
used in most statistical analyses.
• The median – is not effected by extreme values therefore, can be used in their presence.
Yet, the medians does not reflect all the values included in the data set, but rather the
location of the observation in the middle.
• The mode – should be used mainly for categorical data.

13
SUMMARY EXAMPLES
Example 4
A professor of statistics wants to report the results of a
midterm exam, taken by 100 students.
• The mean of the test marks is 73.90
• The median of the test marks is 81
• The mode of the test marks is 84
Describe the information each one provides.

The mean provides information The Median indicates that half of the class
about the over-all performance level received a grade below 81%, and half of the class
of the class. It can serve as a tool for received a grade above 81%. A student can use
making comparisons with other this statistic to place his/her mark relative to other
classes and/or other exams. students in the class.

The mode must be used when data is


qualitative. If marks are classified by
letter grade, the frequency of each
grade can be calculated. Then, the mode
becomes a logical measure to compute.
14
SUMMARY EXAMPLES
Example 5
 The following sample represents the lateness of arriving flights in a
certain domestic flight airport (in minutes): 22, 12, 4, -3……
(a) Find the mean, median, and mode of this sample. Are
these data form a skewed distribution? negative, positive?
(b) Which measure should not be used? Change the largest
lateness to 34 minutes (rather than 67). Which central location
measures are effected?
(c) A person is waiting for the arrival of a certain flight. He is told
the flight will probably be late not more than10 minutes. Should
he believe this is a reliable estimate? Use the distribution of data
requested in part (b).

15
SUMMARY EXAMPLES
Example 5 - solution
 We run the data on Excel using the ‘Descriptive Statistics’ tool.

Lateness  The distribution of these data shows a positive


skewness:
Mean 10.8709677
Standard Error 2.6436135
 Do not use the mean, because an ‘outlier’ of
Median 6 67 minutes lateness effects (increases) the
Mode 4 mean value to be almost 11 minutes.
Standard Deviation 14.719017
Sample Variance 216.649462
Lateness
Kurtosis 6.39059859
Skewness 2.17922953
Range 75
Minimum -8
Maximum 67
Sum 337
Count 31

16
SUMMARY EXAMPLES
Example 5 - solution
 When changing the largest observation from 67 to 34, the mean reduces to 9.80
minutes, but the median and mode do not change.

Lateness • It is reasonable to believe that the lateness will not


exceed 10 minutes. From the Ogive we see that about
Mean 9.806451613
60 % of the flights arrive within 10 minutes of the
Standard Error 2.034339265
Median 6
scheduled arrival time.
Mode 4
Standard Deviation 11.32672166
Sample Variance 128.2946237 Lateness
Kurtosis 0.919374432
Skewness 1.051857781 20 100.00%
Frequency

Range 48
10 50.00%
Minimum -8
Maximum 40 0 .00%
Sum 304 -1 8 17 26 35 More
Count 31
17
3.2 MEASURES OF
DISPERSION
Measures of central tendency fail to tell the whole story
about the distribution.
A question of interest still remains unanswered:

How much are the values of a given set spread


out around the mean value?

18
WHY DO WE NEED MEASURES OF
DISPERSION?
Observe two hypothetical
data sets:
Set 1: Small variability

The mean provides


a good representation of the
values in the data set.

Set 2: Larger variability


The mean is the same as before
but no longer represents the set
values as good as before.

19
RANGE
 The range of a set of measurements is the difference
between the largest and smallest measurements.

 Its major advantage is the ease with which it can be


computed.

 Its major shortcoming is its failure to provide information on


the dispersion of the values between the two end points.

20
VARIANCE

This measure reflects the dispersion of all the
measurement values.

The variance of a population of N measurements
x1, x2,…,xN having a mean  is defined as

N (
i 1 ix   ) 2
2 
N
 The variance of a sample of n measurements
x1, x2, …,xn having a mean isx defined as
 ni 1( x i  x )2
s2 
n 1
21
VARIANCE

Consider two small populations:


9-10= -1
A measure of dispersion 11-10= +1
Can the sum of deviations from the mean
beshould agree with
a good measure this
of dispersion? 8-10= -2
observation.
A 12-10= +2

8 9 10 11 12 Sum = 0
Themeasurements
…but mean of both in B
4-10 = - 6
arepopulations is 10...
more dispersed
then those in A. 16-10 = +6
B 7-10 = -3

4 7 10 13 16 13-10 = +3

Sum =220
VARIANCE

The sum of deviations is zero for both populations,


therefore, is not a good measure of dispersion, since
clearly their dispersion is not equal.

23
VARIANCE

Let us calculate the variance of the two populations


2 2 2 2 2
2 (8  10)  (9  10)  (10  10)  (11  10)  (12  10)
A  2
5

2 2 2 2 2
2 ( 4  10)  (7  10)  (10  10)  (13  10)  (16  10)
B   18
5
Why is the variance defined as
After all, the sum of squared
the average squared deviation?
deviations increases in
Why not use the sum of squared
magnitude when the dispersion
deviations as a measure of
of a data set increases!!
dispersion instead? 24
VARIANCE

Let us calculate the sum of squared


Which deviations
data set has for both data sets
a larger dispersion?

Data set B
is more dispersed
around the mean
A B
1 2 3 1 3 5

25
VARIANCE

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10


SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the


observation that set B is more dispersed.

A B
1 2 3 1 3 5

26
VARIANCE

However, when calculated on “per observation” basis


(variance), the dispersions are properly ordered.

A2 = SumA/N = 10/10 = 1


B2 = SumB/N = 8/2 = 4
A B
1 2 3 1 3 5

27
VARIANCE
Example 6
 Find the variance of the following set of numbers, representing annual rates of
returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a
population: -2, 4, 5, 6.9, 10

Solution

 i61 x i  2  4  5  6.9  10 23.9


x    4.78
5 5 5

 
n 2
 ( x  x ) 1
s 2  i1 i  ( 2  4.78) 2  ( 4  4.78) 2  ...  (10  4.78) 2
n 1 5 1
 19.59 percent2
Assuming a sample
28
VARIANCE

Example 6 - solution continued


 
n 2
 ( x  x ) 1
 2  i1 i  ( 2  4.78) 2  ( 4  4.78) 2  ...  (10  4.78) 2
n 5
 15.6736 percent 2 Assuming a population

29
STANDARD DEVIATION

The standard deviation of a set of measurements is the


square root of the set variance.

2
Sample standard deviation : s  s
2
Population standard deviation :   

30
STANDARD DEVIATION
 Example 7
The daily percentage of defective items in two weeks of
production (10 working days) were calculated for two
production lines?
Which line provides good items more consistently?

Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1,
30.05

Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3,
11.4

31
EXAMPLE 7 : SOLUTION
SPSS PRINTOUT OBTAINED FROM THE “DESCRIPTIVE STATISTICS” SUB-
MENU .
INTERPRETING THE
STANDARD DEVIATION
The standard deviation can be used to
 compare the variability of several distributions
 make a statement about the general shape of a
distribution.
When describing the shape of a distribution we refer to
 A distribution with any shape
 A mound shaped distribution

33
EMPIRICAL RULE –
DESCRIBING A MOUND SHAPED
DATA SET

If a sample of measurements has a mound-shaped


distribution, the interval…

( x  s, x  s) contains approximately 68% of the measurements


( x  2s, x  2s) contains approximately 95% of the measurements
( x  3s, x  3s) contains approximately 99.7% of the measurements

34
EMPIRICAL RULE
 Example 8
Running the Descriptive statistics tool in Excel we have
Mean = 17.959
Standard deviation (sample) = 0.556
 Based on the following histogram, describe the set of data.

 Solution

From the histogram it 15
appears that the distribution 10

Frequency
is approximately mound 5
shaped. We ’ll use the 0
empirical rule to describe 17 17.4 17.8 18.2 18.6 More
the data. Measurements

35
THE EMPIRICAL RULE –
INTERPRETING THE STANDARD
DEVIATION
Example 8 – solution continued
Running the Descriptive statistics tool in Excel we have
Mean = 17.959
Standard deviation (sample) = 0.556

 From the empirical rule we get:



Approximately 68% of the data lie between 17.403 and 18.515
[17. 959-1(.556), 17.959 + 1(.556)]

Approximately 95% of the data lie between 16.847 and 19.071
[17. 959-2(.556), 17. 959+2(.556)]

Approximately 99.7% of the data lie between 16.291 and 19.627
[17. 959-3(.556), 17. 959+3(.556)]

36
CHEBYSHEV THEOREM -
DESCRIBING ANY DATA SET
The proportion of observations in any sample that lie within k
standard deviations of the mean is at least 1-1/z2
for any z > 1.
This theorem is valid for any set of measurements (sample,
population) of any shape!!
K Interval Minimum %
1 x  2s, x  2s at least 75% (1-1/22)

2 x  3s, x  3s at least 89% (1-1/32)

3 x  4 s, x  4 s at least 94% (1-1/42)

37
THE CHEBYSHEV
THEOREM
Example 9
 Employee salaries were recorded and a histogram was
created. Describe this data using the correct numerical
measures.

 Solution

Creating the histogram we realize
Histogram
that the distribution is positively
skewed. Chebychev Theorem 20

needs to be used to describe the 15

Frequency
data. 10
5
0
155 200 245 290 335 380 425
Salary

38
THE CHEBYSHEV
THEOREM
Example 9 – solution continued
 From Excel we have:
Mean = 243.2
Standard deviation = 58.354 Actual count
 Applying Chebychev Theorem

 At least 75% of the salaries lie within


39 (97.5%)
[243.2-2(58.354), 243.2+2(58.354)] = [126.492, 359.908]

 At least 88.9% of the salaries lie within


All (100%)
[243.2-3(58.354), 243.2+3(58.354)] = [68.138, 418.262]

39
THE COEFFICIENT OF
VARIATION
The coefficient of variation represents the ratio of the
standard deviation to the mean, and it is a useful statistic for
comparing the degree of variation from one data series to
another, even if the means are drastically different from one
another.
The higher the coefficient of variation, the greater the level
of dispersion around the mean. It is generally expressed as a
percentage

s
Sample coefficient of variation : cv 
x

Population coefficient of variation : CV 

40
MEASURES OF RELATIVE
LOCATION AND BOX
PLOTS
Additional information on the general shape of a data set can be
obtained by describing the relative location of 5 values within the
data set.
We use percentiles to describe these 5 relative locations. What is a
percentile?

41
MEASURES OF RELATIVE
LOCATION AND BOX PLOTS
Percentile
 The pth percentile of a set of measurements is the value for which
 At most p% of the measurements are less than that value
 At most (100-p)% of all the measurements are greater than that value.

Example
 Suppose your score is the 60th percentile of a SAT test. Then

60% of all the scores lie here 40%

Your score 42
MEASURES OF RELATIVE
LOCATION AND BOX PLOTS
Here are two possible approaches commonly used to describe a set of
values.

The five number summary:


 Smallest value
 First quartile (Q1) - OR -
 Median (Q2) •The first decile (the 10th percentile)
 Third quartile (Q3)
 Largest value
•First quartile (Q1)
•Median (Q2)
•Third quartile (Q3)
•The ninth decile (90th percentile)

43
QUARTILES AND
VARIABILITY
Quartiles can provide an idea about the shape of a
histogram

Q1 Q2 Q3 Q1 Q2 Q3
Positively skewed Negatively skewed
histogram histogram

44
INTER-QUARTILE RANGE

This is a measure of the spread of the middle 50%


of the observations
Large value indicates a large spread of the
observations

Interquartile range = Q3 – Q1

45
BOX PLOT
 A box plot is a pictorial display that provides the main descriptive measures of the
measurement set:
 L - the largest measurement
 Q3 - The upper quartile
 Q2 - The median
 Q1 - The lower quartile
 S - The smallest measurement
An outlier is defined as any value
that is more than 1.5(Q3 – Q1)
away from the box.

1.5(Q3 – Q1) 1.5(Q3 – Q1)


Whisker Whisker
S Q1 Q2 Q3 L

46
BOX PLOT
 Example 11 Create a box plot for the data regarding the GMAT scores of 200
applicants .

GMAT Smallest = 449


512 Q1 = 512
531 Median = 537
461 Q3 = 575
515 Largest = 788
. IQR = 63
. Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
.

417.5 449 512 537 575 669.5 788


512-1.5(IQR) 575+1.5(IQR)
47
BOX PLOT
Example 11 - continued

Q1 Q2 Q3
449 512 537 575 669.5

25% 50% 25%


 Interpreting the box plot results
 The scores range from 449 to 788.
 About half the scores are smaller than 537, and about half are larger than 537.
 About half the scores lie between 512 and 575.
 About a quarter lies below 512 and a quarter above 575.

48
BOX PLOT

Example 11 - continued
The data set is positively skewed

Q1 Q2 Q3
449 512 537 575 669.5

25% 50% 25%

50%

25% 25%
49

You might also like