222 views

Uploaded by ASClabISB

This tutorial on Measures of Dispersion is prepared by the Applied Statistics and Computing lab at the Indian School of Business, Hyderabad. It is a part of the module on Descriptive Statistics, prepared by us.

- (7) Measures of Central Tendency
- (9) Basic Box-Plot
- (10) Box-Plot With Fences
- (6) Graphical Presentation 2
- (1) Set Theory
- (6) Random Variables and PMF
- (8b) Grouped Data_central Tendency and Dispersion
- (12) Bivariate Data
- (4) Conditional Probability
- (15) Chi-square, Student’s t and Snedecor’s F distributions
- (4) Condensation of Data
- (11) Notched and Variable Width Box-Plots
- (2) Permutations and Combinations
- (9) Geometric and Negative Binomial Distribution
- (5) Graphical Presentation 1
- (2) Types of Data
- (13) Normal Distribution
- (7) Discrete Uniform Distribution
- (10) Hypergeometric Distribution
- (3) Probability

You are on page 1of 30

Learning goals

To understand the need for studying dispersion To understand the idea behind measures of dispersion To study different measures of dispersion Additional topics

Standardization of a variable Skewness and Kurtosis Five-point summary

Applied Statistics and Computing Lab

2

Two patients are admitted into the Intensive Care Unit of a hospital. The night before their operation, the doctor makes the last visit at 9pm and blood pressure for Patient 1 is 110/80 and for Patient 2 it is 120/70. Although they are normal, for precautionary reasons, the Doctor asks the nurse to check their blood pressure every 2 hours. At 7.30 the next morning, the nurse reports that the average blood pressure for both the patients was normal, 120/80. The chart of their actual blood pressures was:

Time Patient 1 Patient 2 11pm 120/80 110/60 1am 100/80 100/60 3am 100/60 100/70 5am 130/80 130/90 7am 150/100 160/120

3

What if the doctor decides to operate the patients without looking at the blood pressure chart? What if someone decides to visit the tourist destination next week, based on the average temperature of last week, given in our data? What if I am interested in working with company X (that is visiting our campus) and I am given information about only the mean salary of the employees? In an extreme case, a central tendency can also indicate a dataset consisting of same constant value

Mode

10

11

12

13

Median

Examples

Variability in temperature through the week Scatter of the horsepower capacities, within the cars available Spread of the prices at which varieties of a single product (say rice varieties) are available Variability in returns on investments

Helps determine the reliability of the measure of central tendency Facilitates comparison of two sets of data Useful for building further statistical measures

Desired properties

A good measure should not get highly affected if the data changes slightly A good measure should be representative of the majority of the data A good measure should allow us to declare an interval within which most of the values lie, with a certain degree of confidence

Dataset

Body measurements on 507 individuals 247 men and 260 women Primarily in 20s and 30s, with some exceptions All individuals exercise several hours a week From the 28 total variables present in this dataset, we consider the variables Gender (1=Male, 0=Female) and Weight (in Kg.)

Applied Statistics and Computing Lab

9

Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study

Dataset (contd.)

Female Min. weight (in Kgs.) Max. weight (in Kgs.) Mean weight (in Kgs.) Median weight (in Kgs.) 42 105.2 60.6 59 Male 53.9 116.4 78.14 77.3 Overall 42 116.4 69.15 68.2

10

Evaluating dispersion

Consider distance from a central tendency (Measures based on all the values)

11

These measures consider and report only the boundaries of the data Try to understand how far the values of the variable reach The spread of the data is not considered relative to any central tendency These measures overlook the patterns of values within the boundaries

Applied Statistics and Computing Lab

12

ADVANTAGES:

Useful when range of tolerance exists i.e. if values beyond a certain limit are harmful or unacceptable Easy to compute and understand

ADVANTAGES:

Easy comparison of variability across datasets Easy to compute and understand

ADVANTAGES:

Highlights the middle portion of the distribution of values Easy to understand

DISADVANTAGES: DISADVANTAGES:

Ignores any pattern in the data Ignores most of the data

DISADVANTAGES:

Ignores any pattern in the data Ignores most of the data

More difficult to compute than Min-max and range Ignores irregularities on the extremes Ignores 25% data on each side

Female (Min. weight, Max. weight) Weight range Weight inter-quartile range (42, 105.2) 63.2 11.1

Evaluating dispersion

Consider distance from a central tendency (Measures based on all the values)

14

Consider the deviations of values from the central tendency measure What if we simply sum all these deviations? Consider a hypothetical dataset (1,1,2,2,3,3,4,5,5,6,6,7,7) Mean = Median = 4 Consider

= = 0

Taking absolute values or taking squares so that we are considering only the magnitudes

Applied Statistics and Computing Lab

15

Absolute deviations

For a dataset consisting of n observations: Absolute deviations: Mean absolute deviation from Mean absolute deviation from

( ) mean = ( ) median =

( )

Female weights Male weights 8.58 8.57 7.2

16

Mean absolute deviation from mean Mean absolute deviation from median Median absolute deviation from median

Applied Statistics and Computing Lab

For a dataset consisting of n observations, Variance = =

( )

In order to look at a measure that has unit of measurements equivalent to the original data, we can take square root: Standard deviation = =

Variance Weight, females = 92.46 Variance Weight, males = 110.52

Applied Statistics and Computing Lab

17

Coefficient of range: ( ) Always lies between [0,1] Higher the coefficient, broader the range!

, = 0.43 , = 0.37

100

( )

Coefficient of variation:

Computes the variability per unit mean Indicates how consistent the data is, with respect to its mean Higher the coefficient, more spread-over are the observations

, = 13.45 , = 15.87

The values of weights among females are more spread-over than those among males

18

All the measures that consider distance from central tendency, are based on all the values! -Absolute deviations are less affected by extreme values, as compared to squared deviations -Absolute deviations are easy to understand and interpret -Median absolute deviation is least affected by slight changes in the data, across all measures of dispersion -Variance and Standard deviation are most popular measures of dispersion due to their usefulness in building further statistical measures and because they algebraically amenable -Both play an important part in building and evaluating further statistical measures -Standard deviation is easier to understand than variance, as it is in the same units as the original data -Algebraic manipulation of measures based on measures of absolute deviations is difficult -Variance is most affected by extreme values as it is based on squared deviations -Standard deviation is not very easy to compute -Standard deviation cannot be calculated for data with open ended classes

-Coefficients are free of units therefore facilitate comparison -Useful even when two variables are measured in two different units

Applied Statistics and Computing Lab

19

Standardization

Standardized variable of = Mean of standardized variable = 0 Variance of standardized variable = 1

Standardized variables are free of units Therefore measures of variation of standardized variables are comparable

Applied Statistics and Computing Lab

20

Example

How is the weight of a new-born affected by whether a mother smokes or not? Further, does it affect the perinatal mortality rate that varies for different birth weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated with an increase in the number of babies who die shortly after birth, the babies of smokers tended to have much lower death rates than the babies of nonsmokers.* In this study, he compared perinatal death rates by grouping birth rates In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended that the babies should be grouped based on their relative (or standardized) birth weight, rather than looking at the absolute weights (in Kgs.) What happened then? Table in Yerushalmy J. (1971)**

(Weights measured in grams)

21

* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

Example (contd.)

22

Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications

Further to deviations

Variance = is the sum of squares of deviations from the mean divided by n or the expected value of squared deviation of X from its mean Expected values of higher powers of deviations from mean, give additional information about the distribution of data Expected value of any power of the deviations from mean of a variable X (say power) is called the central moment of that variable ( ) = ( ) = = Central moments depict the spread and shape of data Variance is 2nd central moment Measures using the 3rd and 4th central moments are useful to understand the shape of the distribution

Applied Statistics and Computing Lab

23

( )

Skewness

Skewness is a measure of symmetry (or the lack of it) in a dataset A distribution is right-skewed or positively skewed if it stretches asymmetrically to the right It is left or negatively skewed if the asymmetric stretch is on the left Measuring skewness using moments:

= =

Important to note that if a distribution is perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero, indicates a highly symmetric distribution

Applied Statistics and Computing Lab

24 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Kurtosis

Kurtosis is a measure of peakedness of a dataset The ideal value for kurtosis is 3 and such a curve is called the Mesokurtic curve Value larges than 3 indicates that the distribution would be peaked with shorter tails. This graph is also termed the Leptokurtic curve Value smaller than 3 would fetch a flatter graph with longer tails and is called the Platykurtic curve Measuring kurtosis using moments:

= =

Applied Statistics and Computing Lab

The red line represents a frequency curve of a long tailed distribution The blue line represents a frequency curve of a short tailed distribution The black line is the standard bell curve

Example

Table of the gender-wise skewness and kurtosis of weights:

Skewness Female Male Entire dataset 1.14 0.29 0.40 Kurtosis 5.59 3.15 2.65

26

Example (contd.)

We see that skewness and kurtosis captures the numeric measure of the information presented in a histogram We see that the histogram of weights of females is highly stretched on the right, leading to a positive and high skewness measure of 1.14 The stretch of histogram for weights of the entire dataset is moderate and much lesser than that for weights of females. This is reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of the centrality giving a skewness measure as close to zero as 0.29 Skewness and Kurtosis shed light on important characteristics such as symmetry and peakedness Give additional information about distribution of data, than the measures of central tendency and measures of dispersion

Applied Statistics and Computing Lab

27

Point summary

Very useful and practical use of measures of central tendency and dispersion 5-point summary

Minimum 1st quartile Median 3rd quartile Maximum

6-point summary

Minimum 1st quartile Median Mean 3rd quartile Maximum

Gives an idea about the extreme values, the values within which the middle 50% of the values lie and also the centrality of the data 6-point summary of Weights in the body measurement data:

Min. 42 1st Qu. 58.4 Median 68.2 Mean 69.15 3rd Qu. 78.85 Max. 116.4

28

Measure Minimum Maximum Range Inter-quartile range Mean absolute deviation about mean Mean absolute deviation about median Median absolute deviation about median Variance Standard deviation Coefficient of range Coefficient of variation Standardization of a variable Skewness and Kurtosis

R-code min(variable name) max(variable name) range(variable name) IQR(variable name) mean(abs(variable name-mean(variable name))) mean(abs(variable name-median(variable name))) median(abs(variable name-median(variable name))) var(variable name) sd(variable name) (max(variable name) - min(variable name)) / (max(variable name) + min(variable name)) library(raster) cv(variable name) function(x) {(x-mean(x))/sqrt(var(x))} library(moments) skewness(variable name) kurtosis(variable name) summary(variable name)

29

6-point summary

Thank you

- (7) Measures of Central TendencyUploaded byASClabISB
- (9) Basic Box-PlotUploaded byASClabISB
- (10) Box-Plot With FencesUploaded byASClabISB
- (6) Graphical Presentation 2Uploaded byASClabISB
- (1) Set TheoryUploaded byASClabISB
- (6) Random Variables and PMFUploaded byASClabISB
- (8b) Grouped Data_central Tendency and DispersionUploaded byASClabISB
- (12) Bivariate DataUploaded byASClabISB
- (4) Conditional ProbabilityUploaded byASClabISB
- (15) Chi-square, Student’s t and Snedecor’s F distributionsUploaded byASClabISB
- (4) Condensation of DataUploaded byASClabISB
- (11) Notched and Variable Width Box-PlotsUploaded byASClabISB
- (2) Permutations and CombinationsUploaded byASClabISB
- (9) Geometric and Negative Binomial DistributionUploaded byASClabISB
- (5) Graphical Presentation 1Uploaded byASClabISB
- (2) Types of DataUploaded byASClabISB
- (13) Normal DistributionUploaded byASClabISB
- (7) Discrete Uniform DistributionUploaded byASClabISB
- (10) Hypergeometric DistributionUploaded byASClabISB
- (3) ProbabilityUploaded byASClabISB
- (1) IntroductionUploaded byASClabISB
- R TutorialUploaded byASClabISB
- (3) Methods of Data CollectionUploaded byASClabISB
- (8) Binomial DistributionUploaded byASClabISB
- Introduction to Data Modeling with MySQL WorkbenchUploaded byBest Tech Videos
- (14) Joint DistributionUploaded byASClabISB
- (5) Bayes' RuleUploaded byASClabISB
- (11) Poisson DistributionUploaded byASClabISB
- (12)Continuous DistributionsUploaded byASClabISB
- Teradata CaseUploaded byMila Gorodetsky

- (7) Discrete Uniform DistributionUploaded byASClabISB
- (15) Chi-square, Student’s t and Snedecor’s F distributionsUploaded byASClabISB
- (14) Joint DistributionUploaded byASClabISB
- (6) Random Variables and PMFUploaded byASClabISB
- (13) Normal DistributionUploaded byASClabISB
- (12)Continuous DistributionsUploaded byASClabISB
- (4) Conditional ProbabilityUploaded byASClabISB
- (11) Poisson DistributionUploaded byASClabISB
- (10) Hypergeometric DistributionUploaded byASClabISB
- (4) Condensation of DataUploaded byASClabISB
- (12) Bivariate DataUploaded byASClabISB
- (9) Geometric and Negative Binomial DistributionUploaded byASClabISB
- (8) Binomial DistributionUploaded byASClabISB
- (2) Permutations and CombinationsUploaded byASClabISB
- (1) IntroductionUploaded byASClabISB
- (5) Bayes' RuleUploaded byASClabISB
- R TutorialUploaded byASClabISB
- (3) Methods of Data CollectionUploaded byASClabISB
- (3) ProbabilityUploaded byASClabISB
- (11) Notched and Variable Width Box-PlotsUploaded byASClabISB
- (2) Types of DataUploaded byASClabISB
- (8b) Grouped Data_central Tendency and DispersionUploaded byASClabISB
- (5) Graphical Presentation 1Uploaded byASClabISB

- CD-II_MAX_51120Uploaded byMelissa Macias
- AVIRISUploaded byasadmehmud5934
- Anemia is a Sign of Having Kidney DiseaseUploaded bymoniquesmith
- Guarantor, Final Letter Trying to CollectUploaded bythriveinsurance
- CCOF Organic Farming Activity and Coloring BookUploaded byChildren Of Vietnam Veterans Health Alliance
- Acc Sample PcUploaded byShivaani Tekale
- IP-302-B User Manual _3Uploaded byposo1
- Letter of Invitation, Call for Papers and AssociationUploaded byBhavesh Jha
- Orfano Submerged Italian TunisUploaded byFausto Giudice
- bekcUploaded bypaancute8982
- Fun RetrospectivesUploaded bynetfico
- Resume Online UpdatedUploaded byjunlee012
- Zubir and Basiron, Malacca, America, And China-MIMA OnlineUploaded byMyo Myint
- Topic7 Edu3108 PpgUploaded byGeena78
- Orbital mechanicsUploaded byangeles19531322
- Energy Harvesting Systems - Innowattech Ltd.Uploaded byIsrael Exporter
- EoKS mock 4-7 testUploaded bymargit
- inquiry lessonUploaded byapi-316338270
- Genetic Linkage Maps of Rose ConstructedUploaded byvimalagriguy20035588
- Annual Report 10 11Uploaded byAshish Pateria
- Seminar on purloined letterUploaded bydememorir
- The Effect of Corrugated Skins on Aerodynamic PerformanceUploaded bySazal Rahman
- No Pasaran.pdfUploaded byΒΖόγκας
- Kaise Mujhe Tum Mil Gayi Lyrics, Translation.pdfUploaded byThaix Thian
- Exinda_User_Manual.pdfUploaded byFredy Gualdron Vargas
- ICTEV 2012-Xuan Thu Dang-using Internet Resources to Teach Listening & SpeakingUploaded byThu Dang
- Music Theory For Bass - Scalar ExercisesUploaded byEdward Thompson-Matthews
- Part 16 Horizontal Well TestingUploaded byChai Cws
- Downbeat 1.19Uploaded byMatthew Rybicki
- NotesUploaded bySapna Jain