You are on page 1of 30

MEASURES OF DISPERSION

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Learning goals
To understand the need for studying dispersion To understand the idea behind measures of dispersion To study different measures of dispersion Additional topics
Standardization of a variable Skewness and Kurtosis Five-point summary
Applied Statistics and Computing Lab
2

Need to study dispersion


Two patients are admitted into the Intensive Care Unit of a hospital. The night before their operation, the doctor makes the last visit at 9pm and blood pressure for Patient 1 is 110/80 and for Patient 2 it is 120/70. Although they are normal, for precautionary reasons, the Doctor asks the nurse to check their blood pressure every 2 hours. At 7.30 the next morning, the nurse reports that the average blood pressure for both the patients was normal, 120/80. The chart of their actual blood pressures was:
Time Patient 1 Patient 2 11pm 120/80 110/60 1am 100/80 100/60 3am 100/60 100/70 5am 130/80 130/90 7am 150/100 160/120
3

Applied Statistics and Computing Lab

Need to study dispersion (contd.)


What if the doctor decides to operate the patients without looking at the blood pressure chart? What if someone decides to visit the tourist destination next week, based on the average temperature of last week, given in our data? What if I am interested in working with company X (that is visiting our campus) and I am given information about only the mean salary of the employees? In an extreme case, a central tendency can also indicate a dataset consisting of same constant value

Applied Statistics and Computing Lab

Mode

10

11

12

13

Median

Applied Statistics and Computing Lab

Examples
Variability in temperature through the week Scatter of the horsepower capacities, within the cars available Spread of the prices at which varieties of a single product (say rice varieties) are available Variability in returns on investments

Applied Statistics and Computing Lab

Need for measures of dispersion (contd.)


Helps determine the reliability of the measure of central tendency Facilitates comparison of two sets of data Useful for building further statistical measures

Applied Statistics and Computing Lab

Desired properties
A good measure should not get highly affected if the data changes slightly A good measure should be representative of the majority of the data A good measure should allow us to declare an interval within which most of the values lie, with a certain degree of confidence

Applied Statistics and Computing Lab

Dataset
Body measurements on 507 individuals 247 men and 260 women Primarily in 20s and 30s, with some exceptions All individuals exercise several hours a week From the 28 total variables present in this dataset, we consider the variables Gender (1=Male, 0=Female) and Weight (in Kg.)
Applied Statistics and Computing Lab
9
Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study

Dataset (contd.)
Female Min. weight (in Kgs.) Max. weight (in Kgs.) Mean weight (in Kgs.) Median weight (in Kgs.) 42 105.2 60.6 59 Male 53.9 116.4 78.14 77.3 Overall 42 116.4 69.15 68.2

Applied Statistics and Computing Lab

10

Consider the boundaries (Measure based on selected values)

Report the extreme values Calculate a coefficient Build an absolute measure

High coefficient: Large spread, high variability

Evaluating dispersion

Consider distance from a central tendency (Measures based on all the values)

Small coefficient: Small spread, less variability

Applied Statistics and Computing Lab

11

1. Considering the boundaries


These measures consider and report only the boundaries of the data Try to understand how far the values of the variable reach The spread of the data is not considered relative to any central tendency These measures overlook the patterns of values within the boundaries
Applied Statistics and Computing Lab
12

Minimum and maximum values

Range = (Maximum value) (Minimum value)

Inter-quartile range = (3rd quartile) (1st quartile)

ADVANTAGES:
Useful when range of tolerance exists i.e. if values beyond a certain limit are harmful or unacceptable Easy to compute and understand

ADVANTAGES:
Easy comparison of variability across datasets Easy to compute and understand

ADVANTAGES:
Highlights the middle portion of the distribution of values Easy to understand

DISADVANTAGES: DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data

DISADVANTAGES:
Ignores any pattern in the data Ignores most of the data

More difficult to compute than Min-max and range Ignores irregularities on the extremes Ignores 25% data on each side

Female (Min. weight, Max. weight) Weight range Weight inter-quartile range (42, 105.2) 63.2 11.1

Male (53.9, 116.4) 62.5 14.55

Overall (42, 116.4) 74.4 20.45 13

Applied Statistics and Computing Lab

Consider the boundaries (Measure based on selected values)

Report the extreme values Calculate a coefficient Build an absolute measure

High coefficient: Large spread, high variability

Evaluating dispersion

Consider distance from a central tendency (Measures based on all the values)

Small coefficient: Small spread, less variability

Applied Statistics and Computing Lab

14

2. Considering distance from central tendency


Consider the deviations of values from the central tendency measure What if we simply sum all these deviations? Consider a hypothetical dataset (1,1,2,2,3,3,4,5,5,6,6,7,7) Mean = Median = 4 Consider
= = 0

Taking absolute values or taking squares so that we are considering only the magnitudes
Applied Statistics and Computing Lab
15

Absolute deviations
For a dataset consisting of n observations: Absolute deviations: Mean absolute deviation from Mean absolute deviation from
( ) mean = ( ) median =

Median absolute deviation from median =

( )
Female weights Male weights 8.58 8.57 7.2
16

Mean absolute deviation from mean Mean absolute deviation from median Median absolute deviation from median
Applied Statistics and Computing Lab

7.33 7.19 5.1

Measures based on squared deviation


For a dataset consisting of n observations, Variance = =
( )

In order to look at a measure that has unit of measurements equivalent to the original data, we can take square root: Standard deviation = =
Variance Weight, females = 92.46 Variance Weight, males = 110.52
Applied Statistics and Computing Lab

Standard deviationWeight, females = 9.62 Standard deviationWeight, males = 10.51


17

Relative measures of dispersion



Coefficient of range: ( ) Always lies between [0,1] Higher the coefficient, broader the range!
, = 0.43 , = 0.37
100

( )

Coefficient of variation:

Computes the variability per unit mean Indicates how consistent the data is, with respect to its mean Higher the coefficient, more spread-over are the observations
, = 13.45 , = 15.87

The values of weights among females are more spread-over than those among males
18

Applied Statistics and Computing Lab

Comparing measures of dispersion


All the measures that consider distance from central tendency, are based on all the values! -Absolute deviations are less affected by extreme values, as compared to squared deviations -Absolute deviations are easy to understand and interpret -Median absolute deviation is least affected by slight changes in the data, across all measures of dispersion -Variance and Standard deviation are most popular measures of dispersion due to their usefulness in building further statistical measures and because they algebraically amenable -Both play an important part in building and evaluating further statistical measures -Standard deviation is easier to understand than variance, as it is in the same units as the original data -Algebraic manipulation of measures based on measures of absolute deviations is difficult -Variance is most affected by extreme values as it is based on squared deviations -Standard deviation is not very easy to compute -Standard deviation cannot be calculated for data with open ended classes

-Coefficients are free of units therefore facilitate comparison -Useful even when two variables are measured in two different units
Applied Statistics and Computing Lab
19

Standardization
Standardized variable of = Mean of standardized variable = 0 Variance of standardized variable = 1

Standardized variables are free of units Therefore measures of variation of standardized variables are comparable
Applied Statistics and Computing Lab
20

Example
How is the weight of a new-born affected by whether a mother smokes or not? Further, does it affect the perinatal mortality rate that varies for different birth weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated with an increase in the number of babies who die shortly after birth, the babies of smokers tended to have much lower death rates than the babies of nonsmokers.* In this study, he compared perinatal death rates by grouping birth rates In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended that the babies should be grouped based on their relative (or standardized) birth weight, rather than looking at the absolute weights (in Kgs.) What happened then? Table in Yerushalmy J. (1971)**
(Weights measured in grams)

Applied Statistics and Computing Lab

21
* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

Example (contd.)

Applied Statistics and Computing Lab

22
Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

Further to deviations
Variance = is the sum of squares of deviations from the mean divided by n or the expected value of squared deviation of X from its mean Expected values of higher powers of deviations from mean, give additional information about the distribution of data Expected value of any power of the deviations from mean of a variable X (say power) is called the central moment of that variable ( ) = ( ) = = Central moments depict the spread and shape of data Variance is 2nd central moment Measures using the 3rd and 4th central moments are useful to understand the shape of the distribution
Applied Statistics and Computing Lab
23

( )

Skewness
Skewness is a measure of symmetry (or the lack of it) in a dataset A distribution is right-skewed or positively skewed if it stretches asymmetrically to the right It is left or negatively skewed if the asymmetric stretch is on the left Measuring skewness using moments:
= =

Important to note that if a distribution is perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero, indicates a highly symmetric distribution
Applied Statistics and Computing Lab
24 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Kurtosis
Kurtosis is a measure of peakedness of a dataset The ideal value for kurtosis is 3 and such a curve is called the Mesokurtic curve Value larges than 3 indicates that the distribution would be peaked with shorter tails. This graph is also termed the Leptokurtic curve Value smaller than 3 would fetch a flatter graph with longer tails and is called the Platykurtic curve Measuring kurtosis using moments:
= =
Applied Statistics and Computing Lab

The red line represents a frequency curve of a long tailed distribution The blue line represents a frequency curve of a short tailed distribution The black line is the standard bell curve

25 Visual from http://whatilearned.wikia.com/wiki/File:Kurtosis.jpg

Example
Table of the gender-wise skewness and kurtosis of weights:
Skewness Female Male Entire dataset 1.14 0.29 0.40 Kurtosis 5.59 3.15 2.65

Applied Statistics and Computing Lab

26

Example (contd.)
We see that skewness and kurtosis captures the numeric measure of the information presented in a histogram We see that the histogram of weights of females is highly stretched on the right, leading to a positive and high skewness measure of 1.14 The stretch of histogram for weights of the entire dataset is moderate and much lesser than that for weights of females. This is reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of the centrality giving a skewness measure as close to zero as 0.29 Skewness and Kurtosis shed light on important characteristics such as symmetry and peakedness Give additional information about distribution of data, than the measures of central tendency and measures of dispersion
Applied Statistics and Computing Lab
27

Point summary
Very useful and practical use of measures of central tendency and dispersion 5-point summary
Minimum 1st quartile Median 3rd quartile Maximum

6-point summary
Minimum 1st quartile Median Mean 3rd quartile Maximum

Gives an idea about the extreme values, the values within which the middle 50% of the values lie and also the centrality of the data 6-point summary of Weights in the body measurement data:
Min. 42 1st Qu. 58.4 Median 68.2 Mean 69.15 3rd Qu. 78.85 Max. 116.4
28

Applied Statistics and Computing Lab

Measure Minimum Maximum Range Inter-quartile range Mean absolute deviation about mean Mean absolute deviation about median Median absolute deviation about median Variance Standard deviation Coefficient of range Coefficient of variation Standardization of a variable Skewness and Kurtosis

R-code min(variable name) max(variable name) range(variable name) IQR(variable name) mean(abs(variable name-mean(variable name))) mean(abs(variable name-median(variable name))) median(abs(variable name-median(variable name))) var(variable name) sd(variable name) (max(variable name) - min(variable name)) / (max(variable name) + min(variable name)) library(raster) cv(variable name) function(x) {(x-mean(x))/sqrt(var(x))} library(moments) skewness(variable name) kurtosis(variable name) summary(variable name)
29

6-point summary

Applied Statistics and Computing Lab

Thank you

Applied Statistics and Computing Lab