You are on page 1of 88

Yekatit 12 Hospital Medical College

Department of Public Health


Lecture 2: Summarizing Data
For Weekend MPH Student
BY
Dube Jara (BSc in PH, MPHE, PhD Candidate)
Assistant Professor of Epidemiology
Email: jaradube@gmail.com

February, 2024
Addis Ababa, Ethiopia
1 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Content
 Introduction
 Numerical Summary Measures
 Measures of Central Tendency
 Measures of Dispersion

2 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Introduction
 Data collected by different methods, presented by tables and
figures need to be described in some concise way
 Number of sample may be large
 It leads to wrong track of the overall picture if we want
to look at all the data at once.
 Therefore, these can be overcome by summarizing the
data numerically

3 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Introduction…
 Measures of Central Tendency
 Mean
 Median
 Mode
 Measures of Variability
 Range
 Mean deviation
 Variance
 Standard Deviation
 Skewness
 Positive skew
 Normal distribution
 Negative skew

4 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Measures of Central Tendency

 The tendency of statistical data to get concentrated at certain values is


called the “Central Tendency” and
 The various methods of determining the actual value at which the data tend
to concentrate are called measures of central Tendency or averages.
 Hence, an average is a value which tends to sum up or describe the mass of the data
 The objective of calculating MCT is to determine a single figure which
may be used to represent the whole dataset

 In that sense it is an even more compact description of the statistical data


than the frequency distribution

 Since a MCT represents the entire data, it facilitates comparison within one
group or between groups of data

5 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Characteristics of a good MCT
 A MCT is good or satisfactory if it possesses the following
characteristics.
 It should be based on all the observations

 It should not be affected by the extreme values

 It should be as close to the maximum number of values as


possible
 It should have a definite value

 It should not be subjected to complicated and tedious


calculations
 It should be capable of further algebraic treatment

 It should be stable with regard to sampling

6 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Most common measures…

 The most common measures of central tendency include:


 Arithmetic Mean
 Median
 Mode
 Others

7 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Arithmetic Mean

A. Ungrouped Data (simple mean)


 The arithmetic mean is the "average" of the data set and by far
the most widely used measure of central location
 The arithmetic mean is, in general, a very natural measure of
central location
 Is the sum of all the observations divided by the total number
of observations.

8 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Arithmetic Mean…

Example : Age of ten students


18,22,19,20,21,,25 ,18,24,23,19
Mean = 18 +22+19+20+21+25+18+24+23+19 = 20.9

10
9 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Example

The heart rates for n=10 patients were as follows (beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these patients?

10 Dube Jara (Assistant Professor &PhD Candidate), For MPH


B. Mean for Grouped data

In calculating mean from grouped data , we assume that all values falling into a
particular class interval are located at the mid points of the class interval. it is
calculated as follow:

where: K- is the number of class intervals.


mi- is the mid-point of the ith class interval.
fi- is the frequency of ith class interval.
11 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Example. Compute the mean age of 169 subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi

10-19 14.5 4 58.0


20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Total __ 169 5810.5

12 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Skewness

 Skewness measures the degree of asymmetry displayed by the


data
 If extremely low or extremely high observations are present in
a distribution, then the mean tends to shift towards those scores.
 Skewness is computed by first adding together the cubed
deviations from the mean and then dividing by the product of
the cubed standard deviation and the number of observations:

13 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Skewness…
 If skewness equals zero, the histogram is symmetric
about the mean.
 Based on the type of skewness, distributions can be:
a. Negatively skewed distribution: occurs when majority of scores are at the
right end of the curve and a few small scores are scattered at the left end.
b. Positively skewed distribution: Occurs when the majority of scores are at the
left end of the curve and a few extreme large scores are scattered at the right end.
c. Symmetrical distribution: It is neither positively nor negatively skewed. A
curve is symmetrical if one half of the curve is the mirror image of the other half.
 In unimodal (one-peak) symmetrical distributions, the mean, median and
mode are identical.

14 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Example
 Example:
 Data: 14, 89, 93, 95, 96
Mean= 387 = 77.4
5
 Skewness is reflected in the outlying low value of 14
 The median is 93
 2, 3, 5, 8, 11, 16, 143
Mean= 172 = 28.7
6

15 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Kurtosis
 Kurtosis measures how peaked the histogram is.
 characterises the relive peakedness or flatness of a distribution
compared to the normal distribution
 Its definition is similar to that for skewness, with the exception that the
fourth power is used instead of the third:

 Data with high degree of peakedness are said to be leptokurtic, and


have values of kurtosis over 3.0.
 Flat histograms are platykurtic, and have kurtosis values less than 3.0.
The kurtosis of the commuting times is equal to 6.43, and hence is
relatively peaked.
16 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Kurtosis
k>3 leptokurtic
Frequency

k=3 mesokurtic

k<3 platykurtic

Value
17 Dube Jara (Assistant Professor &PhD Candidate), For MPH
A few words about the normal curve

 Skewness = 0
 Kurtosis = 3

1  ( x   ) / 2 2
f ( x)  e
18
 2
Dube Jara (Assistant Professor &PhD Candidate), For MPH
Characteristics of mean

 The value of the arithmetic mean is determined by every item in


the series
 It is greatly affected by extreme values
 The sum of the deviations about it is zero
 The sum of the squares of deviations from the arithmetic mean
is less than of those computed from any other point
 For a given set of data there is one and only one arithmetic mean
(uniqueness).

19 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Advantages and disadvantages
Advantages
1. It is based on all values given in the distribution.
2. It is most early understood.
3. It is most amenable to algebraic treatment.
Disadvantages
4. It may be greatly affected by extreme items and its usefulness
as a “summary of the whole” may be considerably reduced.

5. When the distribution has open-end classes, its computation


would be based assumption, and therefore may not be valid
20 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Geometric mean
 It is obtained by taking the nth root of the product of “n”
values, i.e, if the values of the observation are demoted by x1,x2 ,
…,x n then, GM = n√(x1)(x2)….(xn) .

 The geometric mean is preferable to the arithmetic mean if the


series of observations contains one or more unusually large
values.

 The above method of calculating geometric mean is satisfactory


only if there are a small number of items.

21 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Geometric mean…
 But if n is a large number, the problem of computing the nth root of the
product of these values by simple arithmetic is a tedious work.
 To facilitate the computation of geometric mean we make use of
logarithms.
 The above formula when reduced to its logarithmic form will be:
GM = n√(x1)(x2)….(xn) = { (x1)(x2)… (xn ) }1/n

Log GM = log {(x1 )(x2 )…(xn)}1/n

= 1/n log {(x1 )(x2 )…(xn)}

=1/n {log(x1 ) + log(x2 )+…log(xn)}

= Σ(log xi)/n

22 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Geometric mean…
 The logarithm of the geometric mean is equal to the arithmetic
mean of the logarithms of individual values.
 The actual process involves obtaining logarithm of each value,
adding them and dividing the sum by the number of
observations.
 The quotient so obtained is then looked up in the tables of anti-
logarithms which will give us the geometric mean

23 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Geometric mean…

Example: The geometric mean may be calculated for the following parasite counts per
100 fields of thick films.

7 8 3 14 2 1 440 15 52 6 2 1 1 25

12 6 9 2 1 6 7 3 4 70 20 200 2 50

21 15 10 120 8 4 70 3 1 103 20 90 1 237

GM = 42√7x8x3x…x1x237

log Gm = 1/42 (log 7+log8+log3+..+log 237)

= 1/42 (.8451+.9031+.4771 +…2.3747)

= 1/42 (41.9985)

= 0.9999 ≈ 1.0000

24 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Geometric mean…
 The anti-log of 0.9999 is 9.9992 ≈10 and this is the required
geometric mean.
 By contrast, the arithmetic mean, which is inflated by the high
values of 440, 237 and 200 is 39.8 ≈ 40.

25 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Characteristics of geometric Mean
1. It is a calculated value and depends upon the size of all the
items.

2. It gives less importance to extreme items than does the


arithmetic Mean.

3. For any series of items it is always smaller than the arithmetic


mean.

4. It exists ordinarily only for positive values.

26 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Advantages and disadvantages
Advantages:-

1. Since it is less affected by extremes it is a more preferable average than


the arithmetic mean

2. It is capable of algebraic treatment

3. It based on all values given in the distribution.

Disadvantages:-

4. Its computation is relatively difficult.

5. It cannot be determined if there is any negative value in the distribution,


or where one of the items has a zero value.

27 Dube Jara (Assistant Professor &PhD Candidate), For MPH


other mean
 Harmonic mean
 Weighted mean (WM)

 Trimmed mean

28 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Median
Ungrouped data
 The median is the value which divides the data set into two
equal parts.

 If the number of values is odd, the median will be the middle


value when all values are arranged in order of magnitude

 When the number of observations is even, there is no single


middle value but two middle observations

29 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Median…
 In this case the median is the mean of these two middle
observations, when all observations have been arranged in the
order of their magnitude
 The principal strength of the sample median is that it is insensitive
to very large or very small values
 The principal weakness of the sample median is that it is
determined mainly by the middle points in a sample and is less
sensitive to the actual numerical values of the remaining data
points
30 Dube Jara (Assistant Professor &PhD Candidate), For MPH
31 Dube Jara (Assistant Professor &PhD Candidate), For MPH
32 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Grouped data
 In calculating the median from grouped data, we assume that the
values within a class-interval are evenly distributed through the
interval.
 The first step is to locate the class interval in which the
median is located, using the following procedure.
 Find n/2 and see a class interval with a minimum cumulative
frequency which contains n/2
 Then, use the following formal

33 Dube Jara (Assistant Professor &PhD Candidate), For MPH


 n 
  Fc 
~
x = Lm  2 W
 fm 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations

34 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Example. Compute the median age of 169 subjects from the grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169

Total 169

35 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Example…
 n/2 = 84.5 = in the 3rd class interval
 Lower limit = 29.5, Upper limit = 39.5
 Frequency of the class = 47
 cumulative frequency of the interval just above the median class
interval, fc=70
 (n/2 – fc) = 84.5-70 = 14.5
 Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

36 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Characteristics of Median

 It is an average of position
 It is affected by the number of items than by extreme
values
 There is only one median for a given set of data
(uniqueness)

37 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Advantages &Disadvantages
Advantages
 It is easily calculated and is not much disturbed by extreme
values
 It is more typical of the series
 The median may be located even when the data are
incomplete, e.g, when the class intervals are irregular and
the final classes have open ends
Disadvantages
 It is determined mainly by the middle points and less
sensitive to the remaining data points (weakness)
 It is not so generally familiar as the arithmetic mean
38 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Mode

 The most frequently occurring value among all observations in a


set of data
 The modal class is the class interval with the highest frequency in
grouped data.
 It is not influenced by extreme values
 It is possible to have more than one mode or no mode
 It is not a good summary of the majority of the data
 The mode is not often used in biological or medical data
 Find the modal values for the following data
 22, 66, 69, 70, 73. (no modal value)
 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0
kg)
39 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Ungrouped data
 It is a value which occurs most frequently in a set of values
 If all the values are different there is no mode, on the other
hand, a set of values may have more than one mode
 Some distributions have more than one mode:
 Unimodal: A distribution with one mode

 Bimodal: A distribution with two modes

 Trimodal: A distribution with three modes

40 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Examples
Example
 Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
 Mode is 4 “Unimodal”
Example
 Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
 There are two modes = 2 & 5
 This distribution is said to be “bi-modal”
Example
 Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
 No mode, since all the values are different
41 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Grouped data

 To find the mode of grouped data, we usually refer to the modal


class, where the modal class is the class interval with the highest
frequency
 If a single value for the mode of grouped data must be specified,
it is taken as the mid-point of the modal class interval

42 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Characteristics of Mode
 It is an average of position
 It is not affected by extreme values
 It is the most typical value of the distribution
Advantages

 Since it is the most typical value it is the most descriptive average


 Since the mode is usually an “actual value”, it indicates the precise
value of an important part of the series

43 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Disadvantages
 Unless the number of items is fairly large and the distribution
reveals a distinct central tendency, the mode has no significance
 It is not capable of mathematical treatment
 In a small number of items the mode may not exist

44 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Choose of MCT
 Which measure of central tendency is best with a given set of
data?
 Two factors are important in making this decisions:
 The scale of measurement (type of data)
 The shape of the distribution of the observations

45 Dube Jara (Assistant Professor &PhD Candidate), For MPH


The scale of measurement (type of data)

 The mean can be used for discrete and continuous data


 The median is appropriate for discrete and continuous data as
well, but can also be used for ordinal data
 The mode can be used for all types of data, but may be especially
useful for nominal and ordinal measurements
 For discrete or continuous data, then “modal class” can be used
 The geometric mean is used primarily for observations measured
on a logarithmic scale
 Harmonic mean is a suitable MCT when the data pertains to rates
and time
 Weighted mean is commonly used in the calculation of mean for
different outcomes
46 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Shape of a Distribution
(a) Symmetric and unimodal distribution — Mean, median,
and mode should all be approximately the same

Mean, Median & Mode

47 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Shape…
(b) Bimodal — Mean and median should be about
the same, but may take a value that is unlikely to
occur; two modes might be best

Mode Mode

48 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Shape…
(c) Skewed to the right (positively skewed) —Mean is
sensitive to extreme values, so median might be more
appropriate
Mode
Median

Mean

49 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Shape…
(d) Skewed to the left (negatively skewed) — Same as (c)

Mode
Median
Mean

50 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Measures of Dispersion

Other synonymous term:


– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”

51 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Measures of Dispersion…
 While the mean, median, etc. give useful information about the
center of the data, we also need to know how “spread out” the
numbers are about the center

 Measures that quantify the variation or dispersion of a set of


data from its central location

 Dispersion refers to the variety exhibited by the values of the


data
 The amount may be small when the values are close together
 If all the values are the same, no dispersion

52 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Measures of Dispersion…

Consider the following data sets:


Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50

 The two data sets given above have a mean of 50, but obviously set 1 is
more “spread out” than set 2. How do we express this numerically?
 The object of measuring this scatter or dispersion is to obtain a single
summary figure which adequately exhibits whether the distribution is
compact or spread out

53 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Measures of Dispersion…

 Measures of dispersion include:


 Range
 Inter-quartile range
 Variance
 Standard deviation
 Coefficient of variation
 Standard error

54 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Range (R)

 The difference between the largest and smallest


observations in a sample.
 Range = Maximum value – Minimum value
 Example –
 Data values: 5, 9, 12, 16, 23, 34, 37, 42
 Range = 42-5 = 37

55 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Properties of range
 It is the simplest crude measure and can be easily
understood
 It takes into account only two values which causes it to
be a poor measure of dispersion
 Very sensitive to extreme observations
 The larger the sample size, the larger the range

56 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Quintiles and percentiles

Percentiles
 Simply divide the data into 100 pieces.
 Percentiles are less sensitive to outliers and not greatly affected by the
sample size (n)
 Percentiles can be expressed:
 P0: The minimum
 P25: (25th percentile) ,25% of the sample values are less than or equal to this
value. 1st Quartile
 P50: 50% of the sample are less than or equal to this value. 2 nd Quartile
 P75: 75% of the sample values are less than or equal to this value. 3 rd Quartile
 P100: The maximum
57 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Quintiles and percentiles…

 Quintiles are particularly useful- are the quartiles of the


distribution
 The quartiles divide the distribution into four equal parts
 The second quartile is the median.

58 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Interquartile Range (IQR)
 It is the difference between the first and the third quartiles.
 To compute it, we first sort the data, in ascending order, then find
the data values corresponding to the first quarter of the numbers
(first quartile), and then the third quartile.
 IQR is the distance (difference) between these quartiles

59 Dube Jara (Assistant Professor &PhD Candidate), For MPH


60 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Interquartile Range (IQR)…

 The IQR is a preferable measure to the range


 Because it is less prone to distortion by a single large or small value.

 That is, outliers in the data do not affect the Interquartile range.
 Also, it can be computed when the distribution has open-end class

61 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Interquartile Range (IQR)…

 Indicates the spread of the middle 50% of the observations,


and used with median
IQR = Q3 - Q1
 Example: Suppose the first and third quartile for weights of
girls 12 months of age are 8.8 Kg and 10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e. 50% of the infant girls weigh between 8.8 and 10.2 Kg

62 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Interquartile Range (IQR)…

Eg. Given the following data set (age of patients):-


18,59,24,42,21,23,24,32 find the Interquartile range!
 Sort the data from lowest to highest
 Find the bottom and the top quarters of the data
 Find the difference (Interquartile range) between the two quartiles
18 21 23 24 24 32 42 59
1st quartile = The {(n+1)/4}th observation = (2.25)th observation = 21 +
(23-21)x .25 = 21.5
3rd quartile = {3/4 (n+1)}th observation = (6.75)th observation = 32 +
(42-32)x .75 = 39.5
Hence, IQR = 39.5 - 21.5 = 18
63 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Properties of IQR
 It is a simple and versatile measure
 It encloses the central 50% of the observations
 It is not based on all observations but only on two specific
values
 It is important in selecting cut-off points in the formulation of
clinical standards
 Since it excludes the lowest and highest 25% values, it is not
affected by extreme values
 Less sensitive to the size of the sample

64 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Variance (2, s2)
 The main objection of mean deviation, that the negative signs are
ignored, is removed by taking the square of the deviations from the
mean
 The variance is the average of the squares of the deviations taken
from the mean
 It is squared because the sum of the deviations of the individual
observations of a sample about the sample mean is always 0

65 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Variance...

 The variance can be thought of as an average of squared deviations


 Variance is used to measure the dispersion of values relative to
the mean
 When values are close to their mean (narrow range) the
dispersion is less than when there is scattering over a wide range.
 Population variance = σ2
 Sample variance = S2
66 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Ungrouped data

 Let X1, X2, ..., XN be the measurement on N population units,


then:
N

 (X i  ) 2

2  i 1
where
N
N

X i
= i=1
is the population mean.
N
67 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Example
 Following are the survival times of n=11 patients after heart
transplant surgery.
 The survival time for the “ith” patient is represented as Xi for i= 1,
…, 11.
 Calculate the sample variance and SD.

68 Dube Jara (Assistant Professor &PhD Candidate), For MPH


69 Dube Jara (Assistant Professor &PhD Candidate), For MPH
70 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Grouped data

 (m i  x) 2
fi
S2  i =1
k

f
i =1
i -1

where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x
= the sample mean
k = the number of class intervals
71 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Example. Compute the variance and SD of the age of 169 subjects from the grouped data.

Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80

Total 169 1901.20 20199.22

Mean = 5810.5/169 = 34.48 years


S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
72 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Grouped data
 A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the
sample
mean (e.g. ) rather than the population mean µ.

73 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Example:
 Areas of sprayable surfaces with DDT from a sample of 15 houses are as follows (m2):
101,105,110,114,115,124,125,125,130,133,135,136,137,140,145
 Find the variance and standard deviation of the above distribution., n=15
 The mean of the sample is 125 m2. and Variance (sample) = s2 = Σ(xi –x)2/n-1

= {(101-125)2 +(105-125)2 + ….(145-125)2 } / (15-1)

= 2502/14

= 178.71 (square metres)2


 Hence, the standard deviation

= √Variance

= √178.71

= 13.37 m2.
74 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Properties of Variance
 The main disadvantage of variance is that its unit is the square
of the units of the original measurement values
 The variance gives more weight to the extreme values as
compared to those which are near to mean value, because the
difference is squared in variance
 The drawbacks of variance are overcome by the standard
deviation

75 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Standard deviation (, s)

 The sample and population standard deviations denoted by S


and σ (by convention) respectively are defined as follows:
 It is the square root of the variance
 This produces a measure having the same scale as that of the
individual values.

   and S = S 2 2

76 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Ungrouped data
 Let X1, X2, ..., XN be the measurement on N population
units, then:

 i
(X   ) 2

2  i 1
where
N
N

X i
 = i=1
is the population mean.
77 N
Dube Jara (Assistant Professor &PhD Candidate), For MPH
Ungrouped....

( x  x) 2
S = (n - 1)
sample standard
deviation

=square root
=sum (sigma)
X=score for each point in data
_
X=mean of scores for the variable
n=sample size (number of observations or cases
78 Dube Jara (Assistant Professor &PhD Candidate), For MPH
SD...
 This measure of variation is universally used to show the scatter
of the individual measurements around the mean of all the
measurements in a given distribution.
 Note that the sum of the deviations of the individual observations
of a sample about the sample mean is always 0.

79 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Properties of SD
 The SD has the advantage of being expressed in the same units
of measurement as the mean
 SD is considered to be the best measure of dispersion and is used
widely because of the properties of the theoretical normal curve
 However, if the units of measurements of variables of two data
sets is not the same, then there variability can’t be compared by
comparing the values of SD

80 Dube Jara (Assistant Professor &PhD Candidate), For MPH


SD Vs Standard Error (SE)
 SD describes the variability among individual values in a
given dataset
 SE is used to describe the variability among separate sample
means obtained from one sample to another

 We interpret SE of the mean to mean that another


similarly conducted study may give a mean that may
lie between  SE.

81 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Standard Error

 SD is about the variability of individuals


 SE is used to describe the variability in the means of repeated
samples taken from the same population

 For example, imagine 5,000 samples, each of the same size n=11
 This would produce 5,000 sample means. This new collection has its

own pattern of variability

 We describe this new pattern of variability using the SE, not the SD

82 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days
 What happens if we repeat the study? What will our next mean
be? Will it be close? How different will it be? Focus here is on
the Generalizability of the study findings
 The behavior of mean from one replication of the study to the
next replication is referred to as the sampling distribution of
mean
 We can also have sampling distribution of the median or the SD

83 Dube Jara (Assistant Professor &PhD Candidate), For MPH


SE…

We interpret this to mean that a similarly conducted


study might produce an average survival time that is
near 161 days, ±50.9 days.

84 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Coefficient of variation (CV)
 The standard deviation is an absolute measure of deviation
of observations around their mean and is expressed with
the same unit of the data
 Due to this nature of the standard deviation it is not directly
used for comparison purposes with respect to variability
 A special measure called the coefficient of variation, is
often used for this purpose

85 Dube Jara (Assistant Professor &PhD Candidate), For MPH


Coefficient of variation (CV)...
 When two data sets have different units of measurements, or their
means differ sufficiently in size, the CV should be used as a
measure of dispersion
 It is the best measure to compare the variability of two series of
sets of observations
 Data with less coefficient of variation is considered more
consistent

86 Dube Jara (Assistant Professor &PhD Candidate), For MPH


CV ...

CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x
SD Mean CV (%)
SBP 20mm 140mm 14.3
Cholesterol 80mg/dl 400md/dl 20.0

 “Cholesterol is more variable than systolic blood pressure”


87 Dube Jara (Assistant Professor &PhD Candidate), For MPH
Thank you !!

88 Dube Jara (Assistant Professor &PhD Candidate), For MPH

You might also like