You are on page 1of 72

Describing Data through Statistics

Prof(Dr.) Harsh Vardhan


27-28/07/2023
“A cloud does not know why it moves in just such a direction
and at such a speed.
It feels an impulsion... this is the place to go now... But the
sky knows the reasons and the patterns behind all clouds,
and you will know too when you lift yourself high enough to
see beyond horizons”

BS/IIFT/Harsh/3 2
Session Objective
 To distinguish between measures of central tendency,
measures of variation & measures of shape
 To calculate descriptive summary measures for a
population
 To differentiate between sample & population variance &
standard deviation
 To understand the meaning of standard deviation as it is
applied by using the empirical rule & Chebyshev’s theorem

BS/IIFT/Harsh/3 3
Content
To understand the following concepts:
 Measures of Central Tendency
 Mean, Median, Mode, Midrange, Midhinge, Quartile, Percentiles
 Measures of Variability
 Range, Interquartile Range, Variance and Standard Deviation, Coefficient of Variation
 Measures of Shape
 Symmetric, Skewed, using Box-and-Whisker Plots, Kurtosis
 Differentiate between sample and population variance and standard deviation
 Locating Extreme Outliers: Z-Score
 Understand the meaning of standard deviation as it is applied using the
empirical rule and Chebyshev’s theorem

BS/IIFT/Harsh/3 4
Summary Measures

Summary Measures

Central Tendency Quartile Variation

Mean Mode
Median Range Coefficient of
Variation
Midrange Variance

Standard Deviation
Midhinge
BS/IIFT/Harsh/3 5
Central Tendency & Dispersion
 Central Tendency is midpoint of a distribution. Measure of Central
Tendency is also known as Measure of Location.
 Dispersion is the spread of the data in a distribution i.e the extent to which
the observation is spread.
 Curves representing the data points in the data set may be either
Symmetrical or Skewed.
 Symmetrical A characteristic of a distribution in which each half is the
mirror image of the other half.
 Skewness The extent to which a distribution of data points are
concentrated at one end or other or it is lack of Symmetry.

BS/IIFT/Harsh/3 6
Contd.

 Positively Skewed is skewed towards right as it tails off toward


the high end of the scale.
 Negatively Skewed is skewed towards the left because it tails
off the low end of the scale.
 Kurtosis is the degree of peakedness of distribution points.
 Curves may have same central of location dispersion and are
symmetrical but may have different degree of kurtosis.

BS/IIFT/Harsh/3 7
Measures of Central Tendency

Central Tendency

Mean Median Mode


n
xi
i 1 Midrange
n

Midhinge
BS/IIFT/Harsh/3 8
The Mean (Arithmetic Average)

 It is the Arithmetic Average of data values:


1 𝑛 x1  x2      xn
 Sample Mean = 𝑥 = 𝑖=1 𝑥𝑖 
𝑛
 The Most Common Measure of Central Tendency n
 Affected by Extreme Values (Outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6
BS/IIFT/Harsh/3 9
Contd.
 Population Mean
x
= , x is sum of all observations,
N
N is number of elements in the population.
 Sample Mean
x
𝑥= , x is sum of all observations,
n
n is number of elements in the sample.
 Weighted Average for grouped data
fx
𝑥= , where f is the respective frequency of the mid point x
f
fd
𝑥 = A+ h* f , A is an arbitrary mid point,
(x−A)
h is class interval, d=
h
BS/IIFT/Harsh/3 10
Advantages and Disadvantages of
Arithmetic Mean

 The advantages are: Arithmetic Mean represents whole data set.


The concept is familiar. Every data set has a mean. It is unique for a
data set. It is useful in comparing means of several data.
 The disadvantage are: though it is reliable reflective of all values in
the data set but it is affected by the extreme values that are not
representative of data. It is cumbersome to take all observation in
calculation. Unable to calculate mean for open ended class.

BS/IIFT/Harsh/3 11
Mean - Mean -
Ungrouped Data Grouped Data

For a population: For a population:

X 1  X 2  X 3  ...  X N X  fX
  
N N N

For a sample: For a sample:

X 1  X 2  X 3  ...  X n X  fX
X   X 
n n n

BS/IIFT/Harsh/3 12
Table
Table | Approximation of the Arithmetic Mean from a
Frequency Distribution

Absolute Class
Class Frequency (number Class
(net profit in of companies in Midpoint
millions of dollars) class) f X fX
-1,250 to under 0 6 -625 -3,750
0 to under 1,250 49 625 30,625
1,250 to under 2,500 18 1,875 33,750
2,500 to under 3,750 15 3,125 46,875
3,750 to under 5,000 3 4,375 13,125
5,000 to under 6,250 2 5,625 11,250
6,250 to under 7,500 4 6,875 27,500
7,500 to under 8,750 2 8,125 16,250
8,750 to under 1 9,375 9,375
10,000 f = N = 100 fX = 185,000

Estimated arithmetic mean = $1,850 (based on the ratio 185,000/100)

BS/IIFT/Harsh/3 Arithmetic Mean 13


Numerical Problem-Grouped Mean-
27/07/23
fd
𝑥 = A+ h* ,
f
A is an arbitrary mid point,
h is class interval,
x is midpoint of the class interval
(x−A)
d=
h
Calculate the mean of following table:

Class 0-8 8-16 16-24 24-32 32-40 40-48


interval
Frequency 8 7 16 24 15 7

BS/IIFT/Harsh/3 14
Ans.
Class Mid- Frequency d= (x-A)/h fd
Interval Value A=28,h=8

0-8 4 8 -3 -24
8-16 12 7 -2 -14
16-24 20 16 -1 -16
24-32 28 24 0 0
32-40 36 15 1 15
40-48 44 7 2 14
Total 77 -25
fd (−25) 200
𝑥 = A+ h* =28+8* = 28 - = 25.404
f 77
BS/IIFT/Harsh/3
77 15
Geometric Mean
 Geometric Mean is another measure of central tendency.
It measures the average rate of change or growth for a
quantity computed by taking the nth root of product of n
values representing change.
 A good working hint to use GM is while calculating the
average percentage change in some variable over time,
for example average inflation rate, average growth rate of
saving bank interest.
GM = 𝑛 ∏𝑥

BS/IIFT/Harsh/3 16
The Median
 An important Measure of Central Tendency
 In an ordered array, the median is the middle
number.
 If n is odd, the median is the middle number.
 If n is even, the median is the average of the 2
 middle numbers.
 Not Affected by Extreme Values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5
BS/IIFT/Harsh/3 17
Median-Contd.
 Median is the middle point of the data set,a measure of location
which divides the data set into halves.
(n+1)
 Median= th item in the data array.
2
 The extreme values does not affect the median as strongly as
mean. It is easy to understand and can be calculated for any kind
of data even for grouped data with open ended class.
 We can find median for qualitative description such as sharpness
of image.
 However data need to be arrayed (time consuming) and it is
seen that Mean is easier to compute than Median.

BS/IIFT/Harsh/3 18
Median - Ungrouped Data Median - Grouped Data
For a population: For a population:
M  X N 1 ( N / 2) -F
M L  w
2 f
For a sample: For a sample:
m  X n 1 ( n / 2) -F
m L  w
2 f
X = population (or sample) value L = the median class’s lower limit
N = number of observations in population f = its absolute frequency
n = number of observations in sample w = its width
subscript = position of X in ordered array F = the sum of frequencies up to
(but not including) those of
the median class

BS/IIFT/Harsh/3 19
The Median divides the area in the graph in half

BS/IIFT/Harsh/3 20
To locate the Median

Arrange the responses in order from lowest to highest


(or highest to lowest):
Response
very dissatisfied
very dissatisfied
somewhat dissatisfied
somewhat satisfied ( The middle case =Median)
somewhat satisfied
very satisfied
very satisfied

BS/IIFT/Harsh/3 21
Numerical Problem-Median-Discrete distribution
Obtain the Median for following frequency distribution

x f
1 8 In case of discrete distribution .
1.Find cumulative distribution.
2 10 N
2.Determine ,where N=Σf,
3 11 2
3.See cumulative frequency greater than N/2,
4 16 4.The corresponding value of x is the Median
5 20
6 25
7 15
8 9
9 6 BS/IIFT/Harsh/3 22
Ans.
x f cf N/2=60,
1 8 8 Cumulative frequency(cf)
greater than N/2=60, is 65,
2 10 18 value of x corresponding to
3 11 29 65 is 5,
4 16 45 therefore 5 is the Median

5 20 65
6 25 90
7 15 105
8 9 114
9 6 120
Total N=120
BS/IIFT/Harsh/3 23
Median of continuous frequency distribution
Find the median wage of the
following distribution:
Median - Grouped Data
Wages in No.of For a population:
Rs. labors ( N / 2) -F
M L  w
20-30 3 f

30-40 5
40-50 20
50-60 10 L = the median class’s lower limit
f = frequency of median class
60-70 5 w = its width
F = the sum of frequencies up to
(but not including) those of
the median class
BS/IIFT/Harsh/3 24
Ans.

Wages in Rs. No.of labors c.f. N/2=43/2=21.5


c.f. greater than 21.5 is
20-30 3 3 28.Therefore median class
30-40 5 8 is 40 - 50
L=40,w=10,f=20, N= 43,
40-50 20 28 F=8
50-60 10 M=L+(N/2-F)*(w/f)
38 M=40+(21.5-8) *10/20
60-70 5 43 M=40+13.5*0.5
=40+6.75
Total N=43 Median=46.75
BS/IIFT/Harsh/3 25
The Mode
A Measure of Central Tendency
Value that Occurs Most Often
Not Affected by Extreme Values
There May Not be a Mode
There May be Several Modes
Used for Either Numerical or Categorical Data

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9 No Mode
BS/IIFT/Harsh/3 26
A bimodal distribution- number of fish caught

BS/IIFT/Harsh/3 27
Mode-advantages & disadvantages.
 Mode like mean can be used for both qualitative as well as
quantitative data.
 It is not affected by extreme values.
 It can be used for open ended class.
 However it is not used as often as Median or Mean.
 It is also possible that no value occurs more than once and it
is also possible that every value occurs same number of
times so there is no mode.

BS/IIFT/Harsh/3 28
Mean ,Mode, Median Comparison
 For symmetrical distribution that has one mode, all three mean, mode
and median are equal.
 For positively skewed distribution mode is the highest point ,median
is to its right and mean on the right of both.
 For negatively skewed ,mode is highest point of distribution median to
its left and mean to left of both.
 When population is skewed median is best measure for location.
Median is not influenced by frequency of occurrence of a single value
as mode is ,nor pulled by extreme value as is the mean.

BS/IIFT/Harsh/3 29
Mean Mode Median Comparison

Mean Median Mode Mean = Median = Mode Mode Median Mean

Negatively Positively
Symmetrical
skewed skewed

Mode could be estimated by Mode=3Median-2Mean

BS/IIFT/Harsh/3 30
The Shape of a Frequency Curve The Shape of a Frequency Curve
Skewness Kurtosis

Skewness BS/IIFT/Harsh/3 Kurtosis 31


Midrange

 A Measure of Central Tendency


 Average of Smallest and Largest
xl arg est  xsmallest
 Observation: 
2
 Affected by Extreme Value

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Midrange = 5 BS/IIFT/Harsh/3
Midrange = 5 32
Quartiles

 Not a Measure of Central Tendency


 Split Ordered Data into 4 Quarters
Q1 Q2(Median) Q3
25% 25% 25% 25%

 Position of i-th Quartile: position of point


Qi  i(n+1)
4

Data in Ordered Array: 11 12 13 16 16 17 18 21 22

Position of Q1 = 1•(9 + 1) = 2.50 Q1 =12.5


4
BS/IIFT/Harsh/3 33
Problem- Quartiles
Eight coins were tossed together and number of Head are noted.
The operation was repeated 256 times and the frequencies (f) that
were obtained for different values of x ,the number of heads are
shown in the following table. Calculate median ,and quartiles.

X 0 1 2 3 4 5 6 7 8

f 1 9 26 59 72 52 29 7 1

BS/IIFT/Harsh/3 34
Solution Q1 Median Q3

X 0 1 2 3 4 5 6 7 8

f 1 9 26 59 72 52 29 7 1

c.f 1 10 36 95 167 219 248 255 256

Median,N/2=256/2 =128,cf greater than128 is


167.Therefore Median is 4.
Q1 =N/4=256/4=64.cf just greater than 64 is 95.Therefore
Q1 is 3.
Q3 is 3 N/4=192 and cf just greater than 192 is 219
therefore Q3 is 5
BS/IIFT/Harsh/3 35
Midhinge
 A Measure of Central Tendency
 The Middle point of 1st and 3rdQuarters
 Not Affected by Extreme Values
Q1  Q3
Midhinge =
2

Data in Ordered Array: 11 12 13 16 16 17 18 21 22


Q1  Q3 12.5  19.5
Midhinge =   16
2 2
BS/IIFT/Harsh/3 36
Summary Measures

 ix - x 2

Summary Measures s2 
n -1

Central Tendency Quartile Variation

Mean Mode
n Median Range Coefficient of
xi Variation
i 1 Variance
n Midrange
Midhinge Standard Deviation

BS/IIFT/Harsh/3 37
The Concept of Dispersion
 Dispersion = variety, diversity, amount of variation between
scores.
 The greater the dispersion of a variable, the greater the
range of scores, and the greater the differences between
scores.

BS/IIFT/Harsh/3 38
Central Tendency & Dispersion
 Measures of Central Tendency can be informative, but may not give a
complete picture.
Example: assume there are 14 students in each of 3 classes, with the
following scores (out of a possible 10) on a test:

Class # Scores Median Mean

1 66666666666666 6 6
2 1 2 2 3 4 4 4 8 8 9 9 10 10 10 6 6
3 445556666 77788 6 6

BS/IIFT/Harsh/3 39
Mean=Median ... but distributions are very different

 Class 1 is completely homogenous


 Class 2 is more disperse
 Class 3 is not homogenous, but shows clustering
Thus, we need measures of dispersion to better characterize these classes.

BS/IIFT/Harsh/3 40
The Concept of Dispersion: Examples
 Typically, a large city will have more diversity than a small town.
 Some states (New Delhi, Mumbai, Kolkata) are more culturally diverse
than others (Shri Nagar, Pauri, Darjeeling ).

BS/IIFT/Harsh/3 41
The Concept of Dispersion

 The taller curve has less dispersion.


 The flatter curve has more dispersion.

BS/IIFT/Harsh/3 42
Measures of Variation

Variation

Variance Standard Deviation Coefficient of


Variation
Range Population
Variance Population
Standard
Deviation S 
CV     100%
Sample
Variance Sample
X 
Standard
Deviation
Interquartile Range
BS/IIFT/Harsh/3 43
Measure of Dispersion

 Measure of Dispersion describes how the observations in data set


are scattered or spread out.
 Dispersion gives additional information that enables us to judge
the reliability of measure of central tendency. If the data is more
dispersed than measure of central tendency will not less
representative.
 Financial analyst are concerned about dispersed earnings which is
reflection of higher risk for stake holders. Similarly drugs of
dispersed quality may endanger life.

BS/IIFT/Harsh/3 44
The Range

Measure of Variation
 Difference Between Largest & Smallest Observations:
Range = Xlargest - Xsmallest
 Quick and easy indication of variability
 Ignores How Data is distributed, heavily influenced by the extreme values:

Range = 12 - 7 = 5 Range = 12 - 7 = 5

7 8 9 10 11 12 7 8 9 10 11 12
BS/IIFT/Harsh/3 45
Interquartile Range(IQR)
The Interquartile Range measures approximately how far from the
median we must go on either side before we can include one half value of
the data set.
Also Known as Mid Spread:
Spread in the Middle 50%
 Difference Between Third & First Quartiles:
 Interquartile Range = Q3 - Q1
 Not Affected by Extreme Values
Example n=9,
Data in Ordered Array: 11 12 13 16 16 17 17 18 21
Q3 - Q1 = 17.5 - 12.5 = 5

BS/IIFT/Harsh/3 46
Interquartile Range(IQR)
Disadvantages to using range and IQR:
(1) you ignore a lot of information – you only use two scores in each
measure;
(2) don’t get an idea about how much scores are different from the
center of the distribution.

BS/IIFT/Harsh/3 47
Mean Absolute Deviation -
Ungrouped Data
For a population:
 X -
MAD 
N
For a sample:
 X -X
MAD 
n
numerators = the sums of absolute
differences between each observed
population or sample value (X) & the
population ( ) or sample ( X ) means

Mean Absolute Deviation


BS/IIFT/Harsh/3 48
Variance

Important Measure of Variation


Shows Variation About the Mean:

For the Population:  2   iX -  2

For the Sample: s 2



  X i -X
2

n -1

For the Population: use N in the For the Sample : use n - 1


denominator. in the denominator.

BS/IIFT/Harsh/3 49
Variance

 Computational form:

2
n
  n


i 1
x -   xi  / n
2
i
 i 1 
s 
2

n -1

BS/IIFT/Harsh/3 50
Standard Deviation

 Square root of Variance:


2
 
n

 x - x
n n

 x -   xi  / n
2 2
i i
s i 1
s i 1  i 1 
n -1 n -1

BS/IIFT/Harsh/3 51
Sample Standard Deviation

 X - X 
2 For the Sample : use n - 1 in
s i the denominator.
n -1
Data: Xi : 10 12 14 15 17 18 18 24

n=8 Mean =16

s= (10 - 16)2  (12 - 16)2  (14 - 16)2  (15 - 16)2  (17 - 16)2  (18 - 16)2  (24 - 16)2
8-1

= 4.2426
BS/IIFT/Harsh/3 52
Measures of Variability
Example: Partners in Accounting Firms
An analyst takes a sample of the number of partners in six of the largest accounting firms
in the U.S. What are the sample variance and sample standard deviation?

Firm Number of Partners


PricewaterhouseCoopers 3327
Ernst & Young 3200
Deloitte 3135
KPMG 2178
RSM US 799
CliftonLarsonAllen 735

BS/IIFT/Harsh/3 53 53
Measures of Variability
Example: Partners in Accounting Firms

xi  xi - x 2 x
13, 374
6
 2229.00

3327 1,205,604  x - x
2

s 2
 i

n -1
3200 942,841
7, 248,818

3135 820,836 5

2178 2,601  1, 449, 763.6

799 2,044,900   xi - x
2

s
n -1
735 2,232,036
 1, 449, 763.6
TOTAL 13,374 7,248,818
 1, 204.06

BS/IIFT/Harsh/3 54 54
Comparing Standard Deviations

Data A

11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
BS/IIFT/Harsh/3 55
Standard Deviation

We only squared the deviations to keep the sum from being zero
(a property of the mean). Now that we have a non-zero number,
take the square root to get a statistic s the standard deviation:

s2 
 i
( X - X ) 2

, so s 
 i
( X - X ) 2

n -1 n -1

Class # Scores Median Mean Range IQR Variance Std


Deviation
1 66666666666666 6 6 0 0.0 0.0 0.0

2 1 2 2 3 4 4 4 8 9 10 10 10 9 8 6 6 9 5.8 11.7 3.4

3 44555666677788 6 6 4 2.0 1.7 1.3


BS/IIFT/Harsh/3 56
Why the Standard Deviation?

 The most important and widely used measure of dispersion.


 If you have occasion to look at reports or journal articles where
data analyses are conducted to support conclusions, SD will be
referred to.
 SD provides an approximate picture of the average amount each
number in a set (e.g. age, income, height, weight) varies from the
center point (i.e. the mean)

BS/IIFT/Harsh/3 57
Variance -
Ungrouped Data
For a population:

( X -  )2  X 2 - 2
N
2  
N N

For a sample:

2 ( X - X ) 2
 X 2 -
n X 2
s  
n -1 n -1
numerators = the sums of squared
deviations between each population
or sample value (X) and the
population ( ) or sample ( X ) mean

BS/IIFT/Harsh/3 58
Standard Deviation
 To find the SD
 Subtract mean from each score.
 Square the deviations.
 Sum the squared deviations.
 Divide the sum of the squared deviations by N.
 Find the square root of the result.

BS/IIFT/Harsh/3 59
Use of Standard Deviation

 The standard deviation enables to determine where the values of


frequency distribution are located w.r.t mean.
 About 68% of the value of population falls with in + standard
deviation ( ) from mean  .
 About 95% of the value of population falls with in + 2 standard
deviation (2) from mean  .
 About 99% of the value of population falls with in + 3 standard
deviation (3) from mean  .

BS/IIFT/Harsh/3 60
Normal Frequency Distribution

BS/IIFT/Harsh/3
Normal Frequency Distribution 61
Z Scores
 Z score – represents the number of Std Dev a value (x) is above or
below the mean of a set of numbers
 Z score allows translation of a value’s raw distance from the mean
into units of std dev
(x−µ)
Z=
σ
 Negative z scores indicate that the raw value (x) is below the mean;
positive z scores indicate x values above the mean

BS/IIFT/Harsh/3 62
Measures of Variability
For a normally distributed population with mean of 50 and a standard deviation of 10, an x value
of 70 would have a z score of 2

70 - 50
z 2
10

This z score signifies that 70 is 2 standard


deviations above the mean

BS/IIFT/Harsh/3 63 63
Coefficient of Variation
Measure of Relative Variation
Always a %
Shows Variation Relative to Mean
Used to Compare 2 or More Groups
S
Formula ( for Sample): CV = *100%
𝑋

BS/IIFT/Harsh/3 64
Comparing Coefficient of Variation

 Stock A: Average Price last year = $50


Standard Deviation = $5
 Stock B: Average Price last year = $100
Standard Deviation = $5

Coefficient of Variation:
S  Stock A: CV = 10%
CV     100% Stock B: CV = 5%
X 
BS/IIFT/Harsh/3 65
Shape

 Describes How Data Are Distributed


 Measures of Shape:
Symmetric or Skewed

Left-Skewed Symmetric Right-Skewed


Mean Median Mod Mean = Median = Mode Mode Median Mean
e

BS/IIFT/Harsh/3 66
Coefficient of Skewness
 Coefficient of Skewness (Sk) - compares the mean
and median in light of the magnitude to the standard deviation;
 Md is the median;
 Sk is coefficient of skewness;
 σ is the Std Dev
3 - Md 
Sk 

BS/IIFT/Harsh/3 67
Coefficient of Skewness

 Summary measure for skewness

3 - Md 
Sk 

 If Sk < 0, the distribution is negatively skewed (skewed to
the left).
 If Sk = 0, the distribution is symmetric (not skewed) . If Sk is
close to 0, it’s almost symmetric
 If Sk > 0, the distribution is positively skewed (skewed to
the right).

BS/IIFT/Harsh/3 68
Box-and-Whisker Plot
 Graphical Display of Data Using
5-Number Summary

X smallest Q1 Median Q3 Xlargest

4 6 8 10 12

BS/IIFT/Harsh/3 69
Distribution Shape &
Box-and-Whisker Plots

Left-Skewed Symmetric Right-Skewed


Q1 Median Q3 Q1 Median Q3 Q1 Median Q3

BS/IIFT/Harsh/3 70
Learning
 Discussed Measures of Central Tendency
Mean, Median, Mode, Midrange, Midhinge
 Quartiles
 Addressed Measures of Variation
The Range, Interquartile Range, Variance,
Standard Deviation, Coefficient of Variation
 Determined Shape of Distributions
Symmetric, Skewed, Box-and-Whisker Plot

Mean Median Mode Mean = Median = Mode Mode Median Mean

BS/IIFT/Harsh/3 71
Thanks

BS/IIFT/Harsh/3 72

You might also like