You are on page 1of 142

Decision Science

Describing Data: Numerically


Chapter Topics

 Measures of central tendency, variation, and


shape
 Mean, median, mode, geometric mean, Harmonic
Mean
 Quartiles
 Range, interquartile range, variance and standard
deviation, coefficient of variation
 Symmetric and skewed distributions
 Kurtosis
 Five number summary and box-and-whisker plots
A Thought

You are missing


the excitement
of life!
STATISTICS
STATISTICS is a journey
is a journey from data
from data to WISDOM!
to WISDOM!
If you really wonder and think
about such issues….

…then, you must enter         


8
84 42 66 67 35 88 72 84 59 45 94 94 54 88 25 78 99 57 90 88 64 57 57 51 90 85 79 76 56 30 29 30 72 47 35 53 55 77 33 44 88 59 78 32 33 68 49 24 88 84
73 34 42 87 59 30 44 74 90 73 70 66 43 89 31 38 30 90 26 52 34 51 39 52 96 38 23 21 55 78 77 21 43 72 60 91 29 41 88 26 70 24 86 31 63 34 67 37 87 81
22 53 54 28 76 29 70 42 84 24 73 86 69 38 66 27 26 40 72 31 83 60 88 62 41 52 46 92 78 46 60 90 29 63 84 24 76 59 56 63 31 96 68 95 99 49 100 43 83 88
64 70 82 34 24 84 27 40 42 51 78 50 93 26 24 89 46 49 30 33 81 36 70 38 96 38 74 32 75 30 44 93 55 28 91 31 52 69 44 86 34 65 29 97 62 37 59 77 37 74
39 64 64 63 80 73 66 46 95 26 55 59 37 93 28 60 68 35 79 78 89 67 97 87 89 57 56 36 95 79 85 36 78 88 59 86 92 48 59 56 56 51 74 43 22 70 56 41 86 94
62 97 69 39 79 34 46 23 58 98 90 87 70 72 75 23 89 51 50 72 90 46 40 32 62 57 29 29 67 52 86 76 94 30 44 82 80 27 42 70 21 60 71 38 47 42 68 91 73 66
50 35 25 37 86 35 77 22 84 76 81 52 46 75 77 29 66 58 53 66 91 96 32 82 59 49 80 88 51 95 59 81 20 91 62 91 25 29 74 97 43 21 49 46 53 85 47 84 20 94
43 37 80 62 71 86 67 21 89 76 47 60 71 49 43 97 94 75 39 47 58 46 76 96 96 80 68 82 95 29 25 95 27 27 87 69 89 77 21 37 83 73 35 75 24 86 32 91 23 57
23 21 86 36 25 74 38 83 86 27 39 22 29 72 76 84 65 44 90 62 78 42 87 30 24 50 44 69 52 55 87 67 66 28 30 74 44 73 27 26 51 22 75 61 56 58 33 61 78 54
52 52 70 85 56 99 55 40 72 39 84 71 55 99 47 48 51 92 39 72 63 83 22 71 48 68 40 27 71 92 61 69 83 57 59 76 38 74 44 80 99 39 98 87 27 29 40 41 47 24
71 73 81 72 91 82 96 47 69 57 61 34 64 42 25 94 22 28 94 34 99 29 92 34 57 71 38 94 69 32 44 45 31 20 29 89 61 24 92 85 64 87 40 27 90 68 92 99 86 57
58 99 37 47 66 72 79 94 54 96 33 28 44 93 44 41 25 37 87 88 38 30 95 81 63 93 90 65 72 92 51 61 94 32 33 37 93 43 32 90 55 37 23 97 98 31 66 43 42 20
38 51 71 56 70 23 34 57 54 94 43 95 32 40 63 37 50 47 54 81 97 95 31 44 80 51 74 42 47 39 64 24 56 69 59 95 44 80 31 24 78 36 57 54 54 97 73 42 86 23
44 81 48 87 66 44 40 83 48 71 89 64 89 77 73 41 83 84 68 43 64 65 56 78 59 75 55 39 90 97 91 57 51 20 52 69 98 89 91 51 71 67 46 94 46 92 55 90 37 44
26 44 76 51 71 67 44 29 45 92 87 54 46 62 34 40 26 63 74 31 53 33 86 61 86 49 36 22 82 81 28 74 30 78 22 58 79 73 63 27 28 28 55 50 47 94 87 42 43 36
36 86 20 92 89 86 46 30 96 23 23 99 53 40 30 82 29 42 70 69 57 24 79 40 73 21 60 32 33 93 86 56 39 33 36 30 48 100 80 82 31 54 23 56 84 48 45 97 79 82
66 91 20 58 27 24 75 28 85 56 69 20 51 55 68 92 100 34 62 59 92 31 61 36 97 84 64 47 88 24 83 61 40 75 80 89 92 56 86 92 75 90 47 54 68 67 43 47 81 34
46 86 58 47 98 95 54 56 70 69 83 29 46 62 79 23 94 78 87 79 65 40 95 40 28 28 97 67 55 42 42 20 44 64 29 82 67 74 55 70 70 68 29 58 67 54 62 77 56 70
58 54 64 59 80 33 48 97 92 54 48 27 80 47 75 38 45 29 65 25 53 45 86 81 75 83 96 45 60 42 89 34 44 48 64 59 82 21 33 99 79 50 58 72 81 72 49 58 62 47
41 37 34 45 30 67 44 96 95 87 99 26 49 98 74 68 62 72 22 98 43 34 78 97 54 51 45 33 89 94 26 73 38 85 97 33 29 67 70 31 39 36 73 39 87 75 91 66 66 94
95 99 31 65 99 24 98 38 87 92 51 40 37 70 30 25 52 36 22 46 42 85 76 27 74 75 85 99 33 94 97 82 49 82 81 38 66 77 23 73 31 73 82 33 26 23 84 76 64 65
80 51 79 80 46 80 61 28 55 68 66 68 28 36 62 89 92 41 56 48 77 75 55 22 63 49 78 97 39 99 87 30 85 49 54 97 43 98 29 96 65 42 31 86 66 79 28 62 92 84
57 71 37 89 49 49 87 71 30 83 68 79 31 68 56 95 72 80 90 24 85 59 74 83 82 28 27 57 30 89 84 100 51 63 96 40 96 85 60 82 31 32 35 41 76 55 59 80 70 79
66
44
86
38
86
61
91
54
47
55
100
38
80
99
79
45
95
21
81
32
96
60
95
98
44
25
40
89
37
70
65
63
64
77
21
95
96
96
78
87
27
62
35
32
76
21
21
72
44
40
61
84
24
22
94
89
91
98
91
46
41
58
93
99
Is this Statistics?
57
91
22
52
56
60
96
80
72
77
53
88
43
69
84
95
99
86
54
58
69
64
48
35
63
69
28
63
99
41
22
25
57
73
93
24
54
39
36
43
36
26
34
89
90
72
36
96
57
88
50
69
62
65
39
44
59
84
84
56
82
49
29
33
21
33
37
92
74
97
49
50
88
71
53
49
65
83
80
35
30
31
96
66
68
34
27
59
55
78
75
31
40
83
70
87
99
78
88
83
46
58
80
29
87
61
88
32
75
97
23
95
78
90
80
68
35
94
23
97
64
52
64
38
47
75
53
83
57
54
77
39
68
98
53
39
43 43 60 85 77 23 91 74 47 55 98 33 71 37 70 42 41 23 58 58 37 22 45 69 95 73 82 92 73 83 27 97 38 56 27 84 72 27 51 53 84 87 92 54 49 67 63 22 58 53
49 74 35 100 97 32 37 92 45 97 23 22 20 97 85 41 95 41 43 72 58 99 25 20 28 40 48 52 73 83 76 56 33 65 86 78 29 90 90 42 69 37 53 69 99 65 46 29 36 27
38 38 97 32 26 45 72 67 97 78 92 24 85 82 35 54 64 35 25 34 96 94 59 61 58 72 92 48 39 92 49 95 75 43 68 96 95 75 26 35 40 77 30 63 87 65 34 46 55 54
99 79 71 65 30 58 68 45 63 38 30 25 51 42 69 51 21 77 96 65 41 64 93 63 24 41 97 62 41 27 37 58 98 64 41 49 32 61 85 83 69 47 81 66 69 45 62 28 81 25
61 94 65 100 42 51 47 64 62 92 99 47 41 54 23 87 47 28 66 85 83 41 76 30 25 94 82 26 74 20 44 90 36 43 73 67 50 61 64 98 22 55 45 59 72 51 26 32 62 23
86 49 88 22 46 47 28 34 22 47 93 80 99 46 82 74 77 37 89 32 30 53 28 56 66 77 25 61 92 43 71 97 80 92 21 45 91 97 28 88 95 75 82 98 26 88 28 74 40 59
23 69 56 99 65 82 64 97 89 97 39 48 71 39 65 97 92 57 78 44 56 26 57 58 56 52 80 37 40 51 36 88 92 38 84 32 50 80 53 21 25 97 77 76 73 78 94 89 87 85
25 93 75 57 98 88 82 78 27 53 53 55 94 61 26 47 47 69 72 42 77 49 46 80 88 20 100 86 91 77 81 78 21 72 61 34 40 76 40 78 26 80 97 99 34 56 88 47 33 48
74 83 56 47 99 68 27 36 26 56 92 50 44 68 72 72 83 92 59 80 23 34 67 83 53 80 28 55 33 72 41 72 91 37 21 21 23 83 82 59 60 27 87 61 78 91 91 45 61 89
40 57 95 92 56 52 54 96 43 22 69 77 63 24 88 26 34 42 28 75 33 52 20 78 96 66 96 24 67 48 88 84 65 29 53 78 47 78 88 47 76 64 57 22 84 52 47 48 33 81
58 70 52 75 62 90 95 82 66 67 61 79 31 21 85 79 69 65 90 35 73 36 77 29 69 28 79 86 38 70 65 93 29 21 64 54 28 68 89 29 88 21 46 77 87 43 84 32 43 45
66 58 20 50 29 69 72 57 100 75 66 96 51 95 70 43 68 96 97 60 37 50 71 92 67 87 67 20 81 98 91 93 89 66 34 36 27 78 83 76 27 76 70 96 26 87 41 91 52 91
60 74 92 31 62 87 83 93 35 21 44 67 60 47 25 81 71 66 40 93 93 54 55 85 77 81 54 51 98 46 79 32 38 82 38 79 77 55 45 24 67 78 57 32 90 67 53 67 99 92
67 52 64 41 51 41 50 25 74 31 45 28 45 93 88 99 90 45 24 50 55 71 90 28 73 69 53 77 24 89 21 30 62 37 92 92 34 66 43 24 57 55 61 82 77 73 36 23 99 80
25 45 49 45 46 66 80 77 35 82 22 76 71 78 89 32 86 21 96 64 45 57 88 52 57 94 55 30 43 31 25 81 96 98 35 24 48 34 97 96 92 89 43 59 94 78 88 83 85 48
82 43 53 65 40 95 78 34 91 47 41 28 23 85 62 29 95 25 61 54 97 71 95 30 76 92 29 72 27 60 23 74 68 31 81 33 64 77 58 34 51 70 56 36 39 33 43 60 60 88
65 51 85 86 46 76 22 92 68 66 69 54 45 82 44 80 73 67 20 37 68 67 32 66 46 43 52 87 51 96 77 37 49 73 97 50 79 85 95 76 96 28 91 59 29 58 72 25 65 74
42 74 67 30 44 25 60 78 74 72 69 45 44 37 97 50 44 95 62 72 21 87 87 49 80 52 66 71 46 42 28 88 55 29 69 62 89 38 82 53 70 37 55 55 92 36 65 34 28 92
75 60 79 27 32 23 56 71 58 31 92 91 24 22 97 48 61 65 83 62 25 84 35 91 97 39 23 85 61 41 23 59 91 70 61 63 92 44 72 58 42 93 92 73 46 71 81 98 81 24
93 73 59 86 78 87 99 73 33 97 47 48 36 74 53 48 69 64 40 92 54 97 24 93 75 74 22 75 29 48 61 84 99 72 44 99 22 21 92 73 71 88 74 75 28 37 41 22 41 81
64 26 38 95 66 28 68 39 52 76 77 70 63 75 74 80 91 45 40 30 86 39 77 95 50 25 70 73 22 23 65 28 41 71 97 20 40 60 86 98 88 76 74 80 44 68 62 64 86 49
99 74 58 53 64 99 52 82 27 46 20 83 62 80 24 91 92 64 33 33 58 83 94 43 61 53 93 59 22 89 87 38 39 97 40 41 38 23 91 76 37 45 30 78 27 36 76 75 78 82
21 32 68 88 47 65 31 31 31 42 63 87 84 61 49 23 37 88 77 53 30 35 28 61 99 45 25 39 33 55 64 69 36 66 29 70 97 34 90 34 51 99 47 90 80 50 36 81 26 47
Statistics …?

 … is the science of counting.


 … is the science of averages.
 … is a way to get information from data.
 … is the art and science of collecting,
analyzing, presenting and interpreting data.
 … is a tool for creating new understanding from
a set of numbers.
Statistics …?

 … is a discipline which is concerned with:


 designing experiments and other data collection,

 summarizing information to aid understanding,

 drawing conclusions from data, and

 estimating the present or predicting the future.


Statistics …?
 … Statistics is the scientific application of mathematical
principles to the collection, analysis, and presentation of
numerical data. Statisticians contribute to scientific enquiry
by applying their mathematical and statistical knowledge to
the design of surveys and experiments; the collection,
processing, and analysis of data; and the interpretation of
the results.
American Statistical Association
Data and Information …

Statistics

Data Information

Data: Facts, Information:


especially Knowledge
numerical facts, communicated
collected together concerning some
for reference or particular fact.
information.
Vocabulary of Statistics…

 Population
— a population is the group of all items of
interest to a statistics practitioner.

 Sample
— A sample is a set of data drawn from the
population. [It is a part of a population]
Vocabulary of Statistics…(cont)
 Parameter
— A descriptive measure of a population.

 Statistic
— A descriptive measure of a sample.
Population Sample

Subset

Statistic
Parameter

 Populations have Parameters,


 Samples have Statistics.
From where we get data for
analysis? …Sources of Data

 Primary Sources: The data collector is the one using


the data for analysis
 Data from a survey
 Data collected from an experiment
 Observed data

 Secondary Sources: The person performing data


analysis is not the data collector
 Analyzing census data
 Examining data from print journals or data published on
the internet or publicly available information.
Vocabulary of Statistics…(cont)
 Variable
…are things that we measure, control, or manipulate in
research. They differ in many respects, most notably in
the role they are given in our research and in the type of
measures that can be applied to them.
… A variable is a characteristic of an item or individual.

 Variables may be …
— Discrete and Continuous.
Vocabulary of Statistics…(cont)
 Data
Data are the different numerical values associated
with a variable.

 Since variables can be discrete and continuous, so


also the DATA
– discrete data and continuous data.

 Data may be qualitative or quantitative.


Assuming that you have collected necessary
data for a problem to be addressed…

 From where we have to


begin our journey for
analyzing data?

We have to learn how to


process data to derive
meaningful conclusions.
The First Step towards data analysis is …
First Step

To Understand the Data

 DESCRIBE THE DATA


Descriptive Statistics…

 …are methods of organizing, summarizing,


and presenting data in a convenient and
informative way.

 These methods include:

 Graphical Techniques, and

 Numerical Techniques.
There are two ways to
DESCRIBE the data...
 One, Visualization

 GRAPHICAL or TABULAR and

 Second, Numerical measures


Basic Statistics
is used to understand
the data

Understanding data by

summarizing it through
various statistical
measures.
And, we can describe
our data in terms of
various characteristics
of the same!!!!
DATA CAN BE CHARACTERIZED BY
A NUMBER OF PARAMETERS…!!!

 DATA MAY BE CHARACTERIZED IN TERMS OF THE


FOLLOWING CHARACTERISTICS

● CENTRAL TENDENCY

● DISPERSION

● SYMMETRY

● PEAKEDNESS
Describing Data Numerically
Describing Data Numerically

Central Tendency Variation

Arithmetic Mean Range

Median Interquartile Range

Mode Variance

Standard Deviation

Coefficient of Variation
 Types of series -----
 Individual series or series without frequencies
 Frequency Series
 Discrete series (or Frequency Array)
 Continuous Series (Frequency Distribution or Series with
Class intervals)

 Grouped Data

 Ungrouped Data
FIRST…
MEASURES OF
CENTRAL
TENDENCY
Which measure should be
used?

 A manufacturer produces shirts in various sizes,


what should he be more interested in?
 Mean or Median or both?

 Series of Income of households

 Claims processed by insurance company


 How much money needs to be budgeted to cover
claims, then which measure should be used
Which measure should be
used?

A committee consists of two males (coded as 1)


and three females (coded as 2)

Which measure should you use?

Categorical data Vs Numerical data


 Whether to use Mean or median or mode is
context-specific

 Type of data (categorical or numerical)


 Categorical – Median or Mode
 Numerical – Mean
 Skewness in the data
 Skewed - Median or Mode
 Open end C.I.
 Mean should be avoided
What should be considered in
choosing the measures

 Purpose
 Common value
 Mid value
 Average
 Algebraic treatment
 Nature of Data
 Open end C.I.
 Varying C.I./unequal C.I.
 Extreme Values
 Size of the data
Measures of Central Tendency

 Objective:
 To describe data
 One value that describes the entire data
 bird‘s eye view
 To facilitate comparison
 At on point of time, or
 Over a period of time
 Also called central value/expected
value/average
Requisites of a Good Average

 Easy to understand
 Easy to compute
 Based on all items in data
 Should not be unduly affected by extreme
values
 Subject to further algebraic treatment
 Rigidly defined
 Sampling stability
Calculate Mean and Median

 A real estate broker wants to determine the


median selling price of 10 houses

House Prices: What will be the What will be


$67000 Median? Mean?
$91000
$ 95000 It will be average of
105000 116000 & 122000 = 635000 ---
116000 119000. higher than any
122000
Note – 5250000 of the nine
148000
167000 (extreme value) does houses!!!
189000 not enter into the
5250000 calculation
Measures of Central Tendency
Overview
Central Tendency

Mean Median Mode

x i
x i1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
(if one exists)
Arithmetic Mean
 The arithmetic mean (mean) is the most
common measure of central tendency
 For a population of N values:
N

xx1  x 2    x N
i Population
μ 
i1
values
N N
Population size

 For a sample of size n:


n

x i
x1  x 2    x n Observed
x i1
 values
n n
Sample size
Arithmetic Mean
(continued)

 The most common measure of central tendency


 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1  2  3  4  5 15 1  2  3  4  10 20
 3  4
5 5 5 5
 Based on all items
 Subject to further algebraic treatment
 Rigidly defined
 It is a calculated value rather than positional
value … Like ……….Median

 Unduly affected by extreme values


 In case of open ended C.I., varying C.I –not
preferred
 Gives equal weightage to all values
And, now we see an interesting
quote about Averages -

Three statisticians went target shooting. The first one


took aim, shot and missed by a foot to the left. The
second one took aim, shot and missed by a foot to the
right. Whereupon the third one exclaimed, “We got it!”
and walked away.
Median
 In an ordered list, the median is the ―middle‖
number (50% above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

 Not affected by extreme values


Finding the Median
 Arrange the observations in an ordered data array
 The location of the median:
th
 n  1
Median position    position in the ordered data
 2 

 If the number of values is odd, the median is the middle number


 If the number of values is even, the median is the average of
the two middle numbers
n 1
 Note that 2 is not the value of the median, only the
position of the median in the ranked data
Median

 Positional average – because it refers to the


place if value in series, and the place of value is
such that equal number of values lie on either
side of it

 Refers to the middle value of the distribution

 Splits observations into two halves

 i.e. 50th percentile


Median

 If C.I. are unequal, will it affect calculation of Median ?


 No (because in formula Median class is used

 Conversion of Inclusive to exclusive C.I. has to be done for


calculation of Median?
 Yes
 Extreme values affect its calculation?
 No
 Most appropriate for qualitative or quantitative data?
 Qualitative data (e.g. where ranks are given i.e. where items are

not counted or measured but are scored)


 In skewed distributions, like income or price
distribution?
 Yes

 Can Median be determined graphically?


 Yes , from ogives
 Does not consider all values in data
 Data needs to be arranged
 Not capable of further algebraic treatment
 Need to convert inclusive into exclusive
 Not very exact as Mean (even items)
Mode
 Value that occurs most often
 The value that has highest frequency in a series of
observations
 It is the value about which items are more closely
concentrated
 Not affected by extreme values
 Used for either numerical or categorical data
 There may may be no mode, several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Mode

 Graphically, it is the value of the variable at


which the curve reaches its maximum
 Bi-modal / multimodal  Mode is ill-defined

 How can you find Mode graphically?


 Histogram
 Frequency Polygon
Mode
 Not affected by extreme values
 Can be calculated for open end distribution
 E.g. consumer preferences (qualitative
phenomenon)

 Not based on all values


 May not be the central value
 Not further algebraic treatment
 Not rigidly defined (several formulae), So it is
the most unstable average
Relationship between Mean,
Median and Mode

 Median – divides distribution into two halves


 Mean – centre of gravity
 Mode – under the peak of curve

 Mode = 3 Median – 2Mean

 In symmetrical distribution,
 Mean=Median=Mode
Which measure of location
is the ―best‖?

 Mean is generally used, unless extreme


values (outliers) exist . . .
 Then median is often used, since the median
is not sensitive to extreme values.
 Example: Median home prices may be reported for
a region – less sensitive to outliers
Quartiles

 Time to get ready in morning for 10 days, in


minutes is shown in following table

 Q1 = 35 would mean ?

 Q3 = 44 would mean?

Ranked 29 31 35 39 39 40 43 44 44 52
Values
Days 1 2 3 4 5 6 7 8 9 10
Quartiles
 Quartiles split the ranked data into 4 segments with
an equal number of values per segment (note that
the widths of the segments may be different)

25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
observations are smaller or equal to value of Q1 and 75%
of the observations are larger or equal to the value of Q1
 Q2 is the same as the median (50% are smaller, 50% are
larger)
 Q3 Only 25% of the observations are greater than the third
quartile, and 75% of the observations are less than value
of Q
Quartile Formulas

Find a quartile by determining the value in the


appropriate position in the ranked data, where

First quartile position: Q1 = 0.25(n+1)

Second quartile position: Q2 = 0.50(n+1)


(the median position)

Third quartile position: Q3 = 0.75(n+1)

where n is the number of observed values


Quartiles

 Example: Find the first quartile


Sample Ranked Data: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5
Percentiles and Quartiles

 Percentiles and Quartiles indicate the position of a


value relative to the entire set of data…. Just like
Median

 Generally used to describe large data sets

 Example: An IQ score at the 90th percentile means that 10%


of the population has a higher IQ score and 90% have a lower
IQ score.

Pth percentile = value located in the (P/100)(n + 1)th


ordered position
Data do not give up their secrets easily. They
must be tortured to confess
-- Jeff Hopper
Second Measure,

Shape of the distribution

Skewness
SKEWNESS……

Skewness refers to the lack of symmetry or


asymmetry in the data
When a distribution is not symmetrical, it is
called skewed or asymmetrical.

It studies about cluster of deviations Above


& Below the measure of central tendency

 It describes the SHAPE of the given


distribution of some observations.
SKEWNESS……
 Symmetrical distribution
 Where frequencies are distributed equally on both
sides of the central value
 No skewness
 What is the relationship between Mean, Median &
Mode
 Mean = Median = Mode

 Assymetrical Distribution
 When frequencies are not equally distributed on both
sides of central value
 The curve gets inclined to one side
 Mean ≠ Median ≠ Mode
SKEWNESS……
A distribution can have
 Positively skewed curve/Positive skewness:
 The curve is inclined to right
 Mean > Median > Mode
 (Q3 – Median) > (Median –Q1)
 In this case, majority of the observations are having values less
than mean value and it has a long tail on the right hand side i.e.
very few frequencies are spread on the high-value end of the
curve. (Long tail is caused by some extremely large values)
 Negatively skewed curve/Negative skewness:
 The curve is inclined to left
 Mean < Median < Mode
 (Q3 – Median) < (Median –Q1)
 In this case, majority of the observations are having values greater
than mean value and it has a long tail on the left hand side i.e.
frequencies are spread on the low-value end of the curve. (Long tail
is caused by extremely low values)

 Zero skewness. In that case, the distribution is


symmetrical.
Shape of a Distribution

 Describes how data are distributed


 Measures of shape
 Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean
A CASE OF POSITIVE SKEWNESS

A CASE OF NEGATIVE SKEWNESS


Test of Skewness

 Curve is not bell shaped


 Mean ≠ Median ≠ Mode
 Mean – Mode indicates that there is skewness
in the data
 Mean>Mode : Positive skewness
 Mean < Median : Negative skewness
 Quartiles
 Q3 – Median ≠ Median – Q1
 Q3 – Median > Median – Q1: Positive skewness
 Q3 – Median < Median – Q1 : Negative Skewness
 Sum of positive deviations from central values
is not equal to Sum of negative deviations from
central values
 Percentiles
 P90 – Median ≠ Median – P10
 Deciles
 D9 – Median ≠ Median – D9
 When Smallest Value – Q1 ≠ Largest Value –
Q3
 S – Q1< L – Q3 : Positive skewness
 S – Q1> L – Q3 : Negative skewness
How to measure
Skewness
Coefficient of Skewness

Skewness in distribution is measured by a numerical


measure, coefficient of skewness
Measures of skewness Based on
Karl Pearson‘s measure Averages
Bowley‘s measure Quartiles
Kelly‘s measure Percentiles & Deciles
Moments measure Moments
Measures of SKEWNESS:
 Absolute Measure of Skewness: (Mean – Mode)

 Karl Pearson‘s Coefficient of Skewness : (Mean –


Mode)/(Standard Deviation).

 If mode is ill-defined, then use the relationship:

 Mean – Mode = 3(Mean-Median)

 Bowely‘s Coefficient of Skewness : (Q3 + Q1 –


2Median)/(Q3 – Q1)

 Kelly‘s coefficient of Skewness: (P90 + P10 –


2Median)/(P90 – P10) OR D9 + D1 – 2Median)/(P90 – P10)
Points to be noted:
 P50 = Median
 D5 = Median

 Karl Pearson‘s measure is based on all values


of distribution
 Bowley‘s ignores to and bottom 25%
 Skp lies between + - 3;Bowley‘s lie between +-1
 Quartile Deviation – Measure of dispersion
 Median + Q.D = Median – Q.D, then distribution is
symmetrical
 Bowley‘s coefficent of skewness is also known
as Quartile coefficent of skewness
In which share should you invest?
SHARE - A SHARE - B
Return (%) Probability Return (%) Probability
2 0.05 1 0.02
9 0.29 4 0.08
12 0.24 7 0.10
16 0.17 9 0.13
19 0.12 12 0.16
23 0.07 16 0.18
28 0.03 21 0.31
30 0.04 30 0.02
1.00 1.00
Expected Expected
14.10% 14.10%
Return Return
Standard Standard
6.40 6.40
Deviation Deviation
Skewness 0.80 Skewness -1.07
Now, look at their distribution …
PROBABILITY DISTRIBUTION OF RETURNS
0.35

SHARE A SHARE B
0.30

0.25
PROBABILITY

0.20

0.15

0.10
What do you
think in which
0.05 shares should
you invest?
0.00
0 5 10 15 20 25 30 35
RETURN
Now, look at again the following two
shares…
SHARE - A SHARE - B
Return (%) Probability Return (%) Probability
2 0.05 1 0.02
9 0.29 4 0.08
12 0.24 7 0.10
16 0.17 9 0.13
19 0.12 12 0.16
23 0.07 16 0.18
28 0.03 21 0.31
30 0.04 30 0.02
1.00 1.00
Expected Expected
14.10% 14.10%
Return Return
Standard Standard
6.40 6.40
Deviation Deviation
Skewness 0.80 Skewness -1.07
THIRD……
MEASURE OF
PEAKEDNESS
KURTOSIS……

 KURTOSIS is a measure of PEAKEDNESS.


 It tells about the CONCENTRATION of items at the
central value
 degree to which there are either too many or too few
observations in the middle of a distribution as
compared to a normal distribution
 Kurtosis may be Mesokurtic, Leptokurtic or Platykurtic.
 Kurtosis is measured as thus:
 Kurtosis = 4/22 (based on moments)
 (If it is equal to three, it is MESOKURTIC.
 If it is greater than three, it is LEPTOKURTIC.
 If it is less than three, it is PLATYKURTIC.
KURTOSIS……
 Kurtosis may be Mesokurtic, Leptokurtic or Platykurtic.
 Mesokurtic:
 Distribution that has ‗normal‘ peak or medium height i.e. neither
too peaked nor too falt-topped
 There is normal concentration of frequencies at the centre
 It is considered as benchmark to define other distributions –
leptokurtic and platykurtic

 Leptokurtic curve/distribution
 Distribution that has relatively high peak than normal
distribution
 They have more concentration of frequencies at the
centre than normal distribution
 Platykurtic curve/distribution:
 Distribution that is relatively flat than normal
distribution
 They have lesser concentration of frequencies at the
centre than normal distribution

 Kurtosis is measured as thus:


 Kurtosis = 4/22 (based on moments)
 (If it is equal to three, it is MESOKURTIC.
 If it is greater than three, it is LEPTOKURTIC.
 If it is less than three, it is PLATYKURTIC.
Leptokurtic Example

14

12

10

2 Std. Dev = 1.03


Mean = 3.0

0 N = 20.00
1.0 2.0 3.0 4.0 5.0
Platykurtic Example

1
Std. Dev = 1.
Mean = 3.0

0 N = 18.00
1.0 2.0 3.0 4.0 5.0
 All mesokurtic curves are Not normal curves
 But all Normal curves are mesokurtic curves

 All symmetrical curves are Not normal curves


 But all Normal curves are symmetrical curves
Five-Number Summary

The five-number summary refers to five descriptive


measures:
minimum
first quartile
median
third quartile
maximum

minimum < Q1 < median < Q3 < maximum


BOX-PLOT
Whiskers

Age of respondent

Upper
Quartile
Lower Median
Quartile

34 36 38 40 42 44 46
Box-and-Whisker Plot
The plot can be oriented horizontally or vertically

Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%

12 30 45 57 70
Question!

An investment of Rs1000. Rate of return on investment


increases by 10% and in 2010 it decreases by 10%.
1)Find the average rate of return for the two period investment 2)
What is the amount you receive at the end of the two period

Year Percentage Change


2009 +10
2010 - 10

What is the mean percentage return over time?


Geometric Mean

 Used when we want to average the percentage


rate of change in some variable
 E.g. when we want to average the growth rate,
average growth of population
 Used in questions of depreciation calculated by
WDV
 Average interest rate on deposits made in bank,
wherein the rate is the compound interest
Geometric Mean

 Geometric mean
 Used to measure the rate of change of a variable
over time

x g  (x 1  x 2    x n )  (x 1  x 2    x n )
n 1/n

 Geometric mean rate of return


 Measures the status of an investment over time

rg  (x1  x 2  ...  xn )1/n  1


 Where xi is the rate of return in time period i
Question!

 A car travels first 200 km at a speed of 40


km/hour, the next distance of 560 km at an
average speed of 80 km/h and remaining
distance of 400 km at average speed of 50
km/h. Determine the average speed?

 A car travels at an average speed of 50 km/h


for first 4 hours, at 80 km/h for next 3 hours and
60 km/h for last 3 hours. Determine the average
speed of car?
Harmonic Mean

 Like G.M., it is a specialized average


 It is used to average rates, expressed as ratio
of two factors
 E.g. Speed = D /T (km/hour)
 Production Rate = Rs /Kg
 Rs /share
 No of units per hour
 No. of hours per unit
 Reciprocal of [A.M. of Reciprocals]
𝑅𝑒𝑐 𝑋
 HM = 𝑅𝑒𝑐 ∑
𝑁
Harmonic Mean

 When to apply HM:

 1. Rates
 2. Following to be considered:
 If rates are to be averaged over Numerator units,
then HM should be used
 If rates are to be averaged over Denominator units,
then AM should be used
 Fourth Measure,

Measures of Dispersion
MEASURES OF VARIATION &
SPREAD
 The degree to which numerical data tend to
spread around an average value is called the
measure of variation or dispersion.
 Dispersion is the degree of spread in a variable
- it describes the distance from the average.
Mean alone does not provide a complete or
sufficient description of data

 These measures also look at how uniform


observations are distributed across all the
values?
 Absolute measures and relative measures of
variation
Quick question!

 Zero spread means

all values are equal to the central value


Measures of Variability

Variation

Range Interquartile Variance Standard Coefficient of


Range Deviation Variation

 Measures of variation give


information on the spread
or variability of the data
values.

Same center,
different variation
Range

 Simplest measure of variation


 Difference between the largest and the smallest
observations:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
Disadvantages of the Range
 Ignores the way in which data are distributed
i.e. it ignores all the values of distribution

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Disadvantages of the Range

 Cannot tell us about the character of distribution


between 2 extreme items

 Cannot be computed in open-ended distribution

 No sampling stability
 Then why to use Range?

 To get quick rather than very accurate picture of


variability (to get rough idea)

 Applications?
 Weather forecast
 Quality control
 Fluctuations in share prices
Range

 Range is the difference between two extreme


items and does not take into account the scatter
within extreme items

 To overcome this problem ======= >


Interquartile Range (I.Q.R)

 I.Q.R considers middle 50% of the distribution


i.e. 1/4th of observations at lower end of
distribution & 1/4th of observations at upper end
of distribution are ignored
Interquartile Range

 Can eliminate some outlier problems by using


the interquartile range

 Eliminate high- and low-valued observations


and calculate the range of the middle 50% of
the data

 Interquartile range = 3rd quartile – 1st quartile


IQR = Q3 – Q1
Interquartile Range

 The interquartile range (IQR) measures the


spread in the middle 50% of the data

 Defined as the difference between the


observation at the third quartile and the
observation at the first quartile

IQR = Q3 - Q1

Semi-interquartile Range or Q.D.= (Q3 - Q1 )/ 2


Quartile Deviation (Q.D.)

 Q.D. gives the average amount by which the


two quartiles differ from Median
 In symmetrical curve, Q3 - Q1 are equidistant from
Median

 Median +- Q.D. covers approximately 50% of


observations

 Small Q.D.  high uniformity or small variation


of central 50% of observations
Quartile Deviation (Q.D.)

 The problem with Q.D. is –


 It is an absolute measure

Then what to do---------- -> ->

Coefficient of Q.D. = (Q3 – Q1) / (Q3 + Q1)


Quartile Deviation

 Not based on every item


 Sampling fluctuations
 Algebraic treatment
 Not a true measure of dispersion because it does not
tell us about the scatter around average
 It shows distance on a scale (positional average)

 Percentile Range = P90 – P10


 Range and I.Q.R measures the spread of data
& both the measures take into account only
two data values

 We need a measure that considers every


data value
 STANDARD DEVIATION/ VARIANCE
/COEFFICIENT OF VARIATION - shows how
observations on the average differ from mean value

 VARIANCE =  (Xi - )2/N


STANDARD DEVIATION = variance
COEFFICIENT OF VARIATION = S.D./
Population Variance

 Average of squared deviations of values from


the mean
N
 Population variance:
 (x  μ)
i
2

σ 2 i1
N
Where μ = population mean
N = population size
xi = ith value of the variable x
Sample Variance

 Average (approximately) of squared deviations


of values from the mean
n
 Sample variance:
 (x  x)i
2

s 
2 i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Population Standard Deviation

 Most commonly used measure of variation


 Shows variation about the mean
 Has the same units as the original data

 Population standard deviation:

 i
(x  μ) 2

σ i1
N
Sample Standard Deviation

 Most commonly used measure of variation


 Shows variation about the mean
 Has the same units as the original data

 i
 Sample standard deviation:
(x  x) 2

S i1
n -1
Standard Deviation

 Standard Deviation s useful to know whether


mean is reliable or more representative of the
data

 Note: Standard deviation is always calculated


by using Arithmetic Mean
 Because Sum of square of deviations from Mean is
least (property of Mean)
 Subject to further algebraic treatment
 Combined S.D can be calculated
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16

(10  X)2  (12  x)2  (14  x)2    (24  x)2


s
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

130 A measure of the ―average‖


  4.3095
7 scatter around the mean
Measuring variation

Small standard deviation

Large standard deviation


Comparing Standard Deviations
Mean = 15.5 for each data set

11 12 13 14 15 16 17 18 19 20 21
s = 3.338
(compare to the two
Data A cases below)

11 12 13 14 15 16 17 18 19 20 21
s = 0.926
(values are concentrated
Data B near the mean)

s = 4.570
11 12 13 14 15 16 17 18 19 20 21 (values are dispersed far
Data C from the mean)
Advantages of Variance and
Standard Deviation
 Commonly used as a measure of dispersion because of its
mathematical properties
 Based on every item
 Subject to further Algebraic treatment
 Correlation, skewness, sampling etc.

 Less affected by sampling fluctuations


 Rigidly defined

 Affected by extreme values


 Gives more weight to extreme values (due to squaring)
Standard Deviation

 Why use Standard deviation when we have


variance?
Application in Finance
 In financial analysis, the standard deviation is
often used as a measure of Volatility and Risk
associated with financial variables
 E.g. Re/USD exchange rates for two years for
100 days each. During which period of the two
years (selected days) was the value of Re more
volatile if the S.D. for both the years is as
follows:
 S.D. 1 = 0.007033
 S.D. 2 = 0.003938
 During first year, Re is more volatile
Application: Risk of a single
asset
 Raveena and Karishma are considering two investment
alternatives, Asset A and Asset B. They are not sure which of
the two single assets is better, so they approach a financial
planner for assistance:
Year Rate of Return Rate of Return
Asset A Asset B
5 years ago 11.3% 9.4%
4 years ago 12.5 17.1
3 years ago 13.0 13.3
2 years ago 12.0 10.0
1 year ago 12.2 11.2
Total 61.0 61.0
Average Rate of 12.2% 12.2%
Return
Standard Deviation 0.63 3.12
Question!

 There are two grocery stores – one large, and


the other small.
 If we want to find which stores have fluctuating
sales, which measure should be used?

 Comparing only S.D. would be misleading.


 In this case coefficient of variation should be
used because it adjusts for the scale of units in
the population
 Standard Deviation is an Absolute measure,
expressed in its original units.

 Hence, is not suitable for comparison of two


series expressed in different units

 The solution is ………………


Coefficient of Variation
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 Can be used to compare two or more sets of
data measured in different units
Population coefficient of Sample coefficient of
variation: variation:
σ   s
CV     100% CV     100%
μ  x 
Application: Risk of a single
asset.. Revisited
 Raveena and Karishma are considering two investment
alternatives, Asset A and Asset B. They are not sure which of
the two single assets is better, so they approach a financial
planner for assistance:
Year Rate of Return Rate of Return
Asset A Asset B
5 years ago 11.3% 9.4%
4 years ago 12.5 17.1
3 years ago 13.0 13.3
2 years ago 12.0 10.0
1 year ago 12.2 11.2
Total 61.0 61.0
Average Rate of 12.2% 12.2%
Return
Standard Deviation 0.63 3.12
Question!

 Which stock should bought, A or B since B is a risky


asset?
 From the closing prices of both stocks over last several

months , standard deviation was found to be different


 sA = Rs. 14; sB = Rs 56

 Mean closing price: stock A = Rs. 28; B, Rs. 56

Should stock A be purchased, since S.D. of stock B is


larger? This shows that
market value of
 Ans.
stock A
 CVA = 14/28 *100 = 50% fluctuates more
from period to
 CVB = 56 / 560 *100 = 10% period than
stock B
Comparing Coefficient
of Variation
 Lets say, Stock A:
 Average price last year = 50

 Standard deviation = 5

 s  5
CVA    100%  100%  10%
x  50 Both stocks
 Stock B: have the same
standard
 Average price last year = $100 deviation, but
stock B is less
 Standard deviation = $5 variable relative

s 
to its price
5
CVB    100%  100%  5%
x  100
Which measure of dispersion to
use

 Depends on
 Type of data available
 If extreme values & few items - > avoid S.D.
 If skewed data - > avoid M.D (when from mean)
 If there are gaps around Quartiles -> Avoid Q.D
 If there are open-end classes -> Q.D.
 Purpose of investigation
 If elementary treatment of statistical measure ->
Range, Q.D., M.D
 But for further statistical analysis -> S.D.
Question!

 If population size and per capita personal


income for a random sample of six states of
India are given for 2011. What shall be the
mean per capita personal income for year
2011?

(Per capita personal income is total personal


income divided by total mid year population)
Question!

Per capita income at


States Population
constant prices (Rs)

Andhra Pr. 84665533 37061


Assam 31169272 20193
Bihar 103804637 12012
Goa 1457723 96885
Gujarat 60383628 48511
Delhi 16753235 100050
Total 298234028 314712
Source:
http://pbplanning.gov.in/pdf/Statewise%20GSDP%20PCI%20and%20G.R.pdf
Ans.
Per capita income
States Population at constant prices
(Rs)
W X W.X
Andhra Pr. 84665533 37061 3.13779E+12
Assam 31169272 20193 6.29401E+11
Bihar 103804637 12012 1.2469E+12
Goa 1457723 96885 1.41231E+11
Gujarat 60383628 48511 2.92927E+12
Delhi 16753235 100050 1.67616E+12
Total 298234028 314712 9.76075E+12
Arithmetic Mean per capita
52452.00 32728.51
mean personal income
Weighted Mean
and Measures of Grouped Data

 The weighted mean of a set of data is


n

w x i i
w 1x1  w 2 x 2    w n x n
x i1

n n
 Where wi is the weight of the ith observation
and n  w i

 Use when data is already grouped into n classes, with


wi values in the ith class
Question!

 If a survey asks respondents to select an age


category such as ―18 to 25‖ , rather than giving
the specific age, how shall the average age be
calculated?

 It is Grouped Data and we can compute the


approximate mean and variance
Approximations for Grouped Data
Suppose data are grouped into K classes, with
frequencies f1, f2, . . ., fK, and the midpoints of the
classes are m1, m2, . . ., mK

 For a sample of n observations, the mean is


K

fm i i
K
where n   fi
x i1
i1
n
Approximations for Grouped Data
Suppose data are grouped into K classes, with
frequencies f1, f2, . . ., fK, and the midpoints of the
classes are m1, m2, . . ., mK

 For a sample of n observations, the variance is


K

i i
f (m  x) 2

s2  i1
n 1
The Empirical Rule

 If the data distribution is bell-shaped, then


the interval:
 μ  1σ contains about 68% of the values in
the population or the sample

68%

μ
μ  1σ
The Empirical Rule
(continued)
 μ  2σ contains about 95% of the values in
the population or the sample
 μ  3σ contains almost all (about 99.7%) of
the values in the population or the sample

95% 99.7%

μ  2σ μ  3σ
A more general interpretation of the standard
deviation is derived from Chebychev‘s Theorem
which applies to all shapes of distributions
---------- ------------ --------------------
What did we Learn?
 Described measures of central tendency
 Mean, median, mode
 Illustrated the shape of the distribution
 Symmetric, skewed
 Described measures of variation
 Range, interquartile range, variance and standard deviation,
coefficient of variation
References

 Gerald Keller, Managerial Statistics


 Levin, Rubin, Siddiqui & Rastogi, Statistics for
Management
 Levine et al., Statistics for Managers Using
Microsoft Excel
 Newbold et. Al, Statistics for Business and
Economics
 Aczel et al., Complete Business Statistics

You might also like