Professional Documents
Culture Documents
NUMERICALLY
SUMMARIZING DATA
1
OUTLINE
3.1 Measure of Central Tendency
3.2 Measure of Dispersion
3.3 Measure of Position
3.4 The five-Number Summary and Boxplots
2
MEASURE OF CENTRAL
TENDENCY
A measure of central tendency numerically describe the
center or typical data value.
Three most widely-used measure of central tendency:
Mean
Median
Mode
3
MEASURES OF CENTER
4
MEAN
5
MEAN
6
MEAN
Example 1
Solution
11 .2 8.07 5.55 13.7 21
x 9.764%
5
7
MEDIAN
The median of a set of measurements is the value that
falls in the middle when the measurements are
arranged in order of magnitude.
When determining the median pay attention to the
number of observations (k).
‘k’ is odd
Median = the number at the (k+1)/2th location of the ordered
array.
‘k’ is Even
Median = the average of the two numbers in the middle (The number at the (k/2) th and
the (k/2)+1)]th locations of the ordered array.)
8
MEDIAN
Example 2
9
MODE
The Mode of a set of measurements is the value that
occurs most frequently.
A Set of data may have one mode (or modal class),
or two or more modes.
10
MODE
Example 3
The manager of a men’s clothing store observes the waist size (in inches) of
trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.
The mode of this data set is 34 in.
11
RELATIONSHIP AMONG
MEAN, MEDIAN, AND MODE
If a distribution is symmetrical, the mean, median and
mode coincide
13
SUMMARY EXAMPLES
Example 4
A professor of statistics wants to report the results of a
midterm exam, taken by 100 students.
• The mean of the test marks is 73.90
• The median of the test marks is 81
• The mode of the test marks is 84
Describe the information each one provides.
The mean provides information The Median indicates that half of the class
about the over-all performance level received a grade below 81%, and half of the class
of the class. It can serve as a tool for received a grade above 81%. A student can use
making comparisons with other this statistic to place his/her mark relative to other
classes and/or other exams. students in the class.
15
SUMMARY EXAMPLES
Example 5 - solution
We run the data on Excel using the ‘Descriptive Statistics’ tool.
16
SUMMARY EXAMPLES
Example 5 - solution
When changing the largest observation from 67 to 34, the mean reduces to 9.80
minutes, but the median and mode do not change.
Range 48
10 50.00%
Minimum -8
Maximum 40 0 .00%
Sum 304 -1 8 17 26 35 More
Count 31
17
3.2 MEASURES OF
DISPERSION
Measures of central tendency fail to tell the whole story
about the distribution.
A question of interest still remains unanswered:
18
WHY DO WE NEED MEASURES OF
DISPERSION?
Observe two hypothetical
data sets:
Set 1: Small variability
19
RANGE
The range of a set of measurements is the difference
between the largest and smallest measurements.
20
VARIANCE
This measure reflects the dispersion of all the
measurement values.
The variance of a population of N measurements
x1, x2,…,xN having a mean is defined as
N (
i 1 ix ) 2
2
N
The variance of a sample of n measurements
x1, x2, …,xn having a mean isx defined as
ni 1( x i x )2
s2
n 1
21
VARIANCE
8 9 10 11 12 Sum = 0
Themeasurements
…but mean of both in B
4-10 = - 6
arepopulations is 10...
more dispersed
then those in A. 16-10 = +6
B 7-10 = -3
4 7 10 13 16 13-10 = +3
Sum =220
VARIANCE
23
VARIANCE
2 2 2 2 2
2 ( 4 10) (7 10) (10 10) (13 10) (16 10)
B 18
5
Why is the variance defined as
After all, the sum of squared
the average squared deviation?
deviations increases in
Why not use the sum of squared
magnitude when the dispersion
deviations as a measure of
of a data set increases!!
dispersion instead? 24
VARIANCE
Data set B
is more dispersed
around the mean
A B
1 2 3 1 3 5
25
VARIANCE
A B
1 2 3 1 3 5
26
VARIANCE
27
VARIANCE
Example 6
Find the variance of the following set of numbers, representing annual rates of
returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a
population: -2, 4, 5, 6.9, 10
Solution
n 2
( x x ) 1
s 2 i1 i ( 2 4.78) 2 ( 4 4.78) 2 ... (10 4.78) 2
n 1 5 1
19.59 percent2
Assuming a sample
28
VARIANCE
29
STANDARD DEVIATION
2
Sample standard deviation : s s
2
Population standard deviation :
30
STANDARD DEVIATION
Example 7
The daily percentage of defective items in two weeks of
production (10 working days) were calculated for two
production lines?
Which line provides good items more consistently?
Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1,
30.05
Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3,
11.4
31
EXAMPLE 7 : SOLUTION
SPSS PRINTOUT OBTAINED FROM THE “DESCRIPTIVE STATISTICS” SUB-
MENU .
INTERPRETING THE
STANDARD DEVIATION
The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a
distribution.
When describing the shape of a distribution we refer to
A distribution with any shape
A mound shaped distribution
33
EMPIRICAL RULE –
DESCRIBING A MOUND SHAPED
DATA SET
34
EMPIRICAL RULE
Example 8
Running the Descriptive statistics tool in Excel we have
Mean = 17.959
Standard deviation (sample) = 0.556
Based on the following histogram, describe the set of data.
Solution
From the histogram it 15
appears that the distribution 10
Frequency
is approximately mound 5
shaped. We ’ll use the 0
empirical rule to describe 17 17.4 17.8 18.2 18.6 More
the data. Measurements
35
THE EMPIRICAL RULE –
INTERPRETING THE STANDARD
DEVIATION
Example 8 – solution continued
Running the Descriptive statistics tool in Excel we have
Mean = 17.959
Standard deviation (sample) = 0.556
36
CHEBYSHEV THEOREM -
DESCRIBING ANY DATA SET
The proportion of observations in any sample that lie within k
standard deviations of the mean is at least 1-1/z2
for any z > 1.
This theorem is valid for any set of measurements (sample,
population) of any shape!!
K Interval Minimum %
1 x 2s, x 2s at least 75% (1-1/22)
37
THE CHEBYSHEV
THEOREM
Example 9
Employee salaries were recorded and a histogram was
created. Describe this data using the correct numerical
measures.
Solution
Creating the histogram we realize
Histogram
that the distribution is positively
skewed. Chebychev Theorem 20
Frequency
data. 10
5
0
155 200 245 290 335 380 425
Salary
38
THE CHEBYSHEV
THEOREM
Example 9 – solution continued
From Excel we have:
Mean = 243.2
Standard deviation = 58.354 Actual count
Applying Chebychev Theorem
39
THE COEFFICIENT OF
VARIATION
The coefficient of variation represents the ratio of the
standard deviation to the mean, and it is a useful statistic for
comparing the degree of variation from one data series to
another, even if the means are drastically different from one
another.
The higher the coefficient of variation, the greater the level
of dispersion around the mean. It is generally expressed as a
percentage
s
Sample coefficient of variation : cv
x
Population coefficient of variation : CV
40
MEASURES OF RELATIVE
LOCATION AND BOX
PLOTS
Additional information on the general shape of a data set can be
obtained by describing the relative location of 5 values within the
data set.
We use percentiles to describe these 5 relative locations. What is a
percentile?
41
MEASURES OF RELATIVE
LOCATION AND BOX PLOTS
Percentile
The pth percentile of a set of measurements is the value for which
At most p% of the measurements are less than that value
At most (100-p)% of all the measurements are greater than that value.
Example
Suppose your score is the 60th percentile of a SAT test. Then
Your score 42
MEASURES OF RELATIVE
LOCATION AND BOX PLOTS
Here are two possible approaches commonly used to describe a set of
values.
43
QUARTILES AND
VARIABILITY
Quartiles can provide an idea about the shape of a
histogram
Q1 Q2 Q3 Q1 Q2 Q3
Positively skewed Negatively skewed
histogram histogram
44
INTER-QUARTILE RANGE
Interquartile range = Q3 – Q1
45
BOX PLOT
A box plot is a pictorial display that provides the main descriptive measures of the
measurement set:
L - the largest measurement
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest measurement
An outlier is defined as any value
that is more than 1.5(Q3 – Q1)
away from the box.
46
BOX PLOT
Example 11 Create a box plot for the data regarding the GMAT scores of 200
applicants .
Q1 Q2 Q3
449 512 537 575 669.5
48
BOX PLOT
Example 11 - continued
The data set is positively skewed
Q1 Q2 Q3
449 512 537 575 669.5
50%
25% 25%
49