You are on page 1of 69

BIOSTATISTICS

LECTURE & CBL 2


STATISTICAL AVERAGES
(MEASURES OF CENTRAL TENDENCY)
AND
MEASURES OF DISPERSION
BY
DR. SAMIRA FAIZ
STATISTICAL AVERAGES
MEASURES OF
CENTRAL TENDENCY
• The word "average" implies a value in
the distribution, around which the
other values are distributed.
• It gives a mental picture of the central
STATISTICAL value.
• There are several kinds of averages, of
AVERAGES which the commonly used are
(1) The Arithmetic Mean
(2) Median
(3) The Mode.
• The arithmetic mean is widely used in statistical calculation.
• It is sometimes simply called Mean.
• To obtain the mean, the individual observations are first
added together, and then divided by the number of
observations.
The Mean • The operation of adding together is called 'summation' and
is denoted by the sigma  or S.
• The individual observation is denoted by the sign x and the
mean is denoted by the sign x̄ (called "X bar").
Properties of mean
Merits
• It is rigidly defined.
• It is easily comprehensible and easy to calculate.
• Mean is based upon all observations in the data set.
• Mean is amenable to further mathematical treatment.
Demerits
• Can not be used while dealing with qualitative data or skewed
quantitative data.
• Cannot be obtained if single observation is missing or lost
• Mean is highly affected by the presence of extreme values.
• It is an average of a different kind,
which does not depend upon the total
and number of items.
The median • To obtain the median, the data is first
arranged in an ascending or
descending order of magnitude, and
then the value of the middle
observation is located, which is called
the median.
Properties of median
The median
• EXAMPLE:
• If data set is odd:

• The diastolic blood pressure of


9 individuals was as follows: (Fig.
11).

• The median is 79, which is the


value of the middle observation
(Fig. 12)
The median
• EXAMPLE:
• If data set is even:

• The diastolic blood pressure of


10 individuals was as follows : Fig.
13.

• In the example given, the


median will be 79+81 divided by
2 which is 80 (Fig. 14).
Merits
• Median is easy to comprehend and easy to calculate.
• Median is rigidly defined
• It is not affected by the presence of extreme values.
• Median can be used as a measure of central tendency even in the
case of qualitative data where the individual can be ranked.
Demerits:
• It is not directly dependent upon all given values.
• It is not amenable to further mathematical or algebraic treatment.
The Mode
• The mode is the commonly occurring value in a distribution of data.
• It is the most frequent item or the most "fashionable" value in a series of
observations.
• A data set can have no mode, one, or many:
• For example:
None: 1, 2, 3, 4, 6, 8, 9. Say: amodal data or no mode (don’t say zero
mode)
One mode: unimodal: 1, 2, 3, 3, 4, 5. or 1, 2, 2, 3, 3, 4, 4, 4, 5.
Two: bimodal: 1, 1, 2, 3, 4, 4, 5.
Three: trimodal: 1, 1, 2, 3, 3, 4, 5, 5.
More than one (two, three or more) = multimodal.
Properties of mode
Merits
• Mode is easy to understand and easy to calculate.
• In some cases, it can be determined merely by inspection.
• Mode is not affected by the presence of extreme values.
Demerits
• Mode is not directly based upon all observations.
• It cannot be algebraically treated.
• Mode is may not be unique. A frequency distribution may be bimodal
or even multimodal, i.e. there may be more than one value which has
maximum frequency.
DATA DISTRIBUTION
• A data distribution is a function or a listing which shows all the
possible values of the data.
• Most importantly, it tells you how often each value occurs.
• Data can be distributed or spread out in different ways
• It can be more spread out more on the left or more on the right on
the base line.
• But……….
NORMAL DISTRIBUTION
………..,
• There are many cases where the
data tends to be around the central
value with no distortion towards
left or right
• Such a data distribution is called
‘Normal Distribution’
NORMAL DISTRIBUTION
• The mean, median, and mode
are all equal in the normal
distribution.
• In a normal distribution the
graph appears as a classical,
symmetrical ‘bell-shaped’ curve

Mean = median = mode


SKEWED OR ASYMMETRIC DATA
• Skewness is asymmetry in a statistical distribution, in
which the curve appears distorted or skewed, either to
the left or to the right
POSITIVE SKEWNESS
• When the distribution is skewed to
the right, the tail on the curve’s right-
hand is longer than the tail on the
left-hand side because of the outliers.
In this situations the mean is greater
than the mode as the outliers tends positive skewness
to shift the mean towards them
• This situation is also called positive
skewness
• Mean > median > mode
• Mean is biggest n mode is smallest
NEGATIVE SKEWNESS
• When the distribution is skewed to the
left, tail on the curve’s left side is
longer than the tail on the right side
because of the outliers. In this
situation the mean is less than the 5
mode as the outliers tend to shift
mean towards them
• The situation is also called negative
skewness
• Mode > median > mean
• Mode is biggest and mean is smallest
KURTOSIS
• Kurtosis is a measure of degree to
which a distribution is peaked or
flat in comparison to a normal
distribution
• A normal or bell-shaped MESOKURTIC

distribution is said to be
‘Mesokurtic’
Platykurtic

• The graph may exhibit a flattened


appearance when excessive
proportion of observed values fall in
both tails of the curve
• Such a distribution is said to be
‘Platykurtic’
• Shorter peak and thicker tails than the
normal distribution
Leptokurtic
• Conversely, a distribution may
possess a smaller proportion of
observations in its tails, so that its
graph exhibits a peaked appearance
• i.e., most observations gather
around the center
• Such a distribution is said to be
‘leptokurtic’
• Higher peak and thinner tails than
the normal distribution
Summary
MEASURE OF CENTRAL TENDENCY

Mean • Only Quantitative (symmetric)

Median • Ordinal
• Quantitative (skewed)

• Nominal
Mode • Ordinal
• Quantitative
CENTRAL TENDENCY - CBL
• Scenario
• The table below gives the number of accidents each year at a
particular road junction:
Work out the mean, median and mode for the values.

1991 1992 1993 1994 1995 1996 1997 1998


4 5 4 2 10 5 3 5

• Mean = 4.75 accidents, Median = 4.5 accidents, Mode = 5 accidents


MEASURES OF DISPERSION
MEASURE OF DISPERSION
• There are several measures of variation or "dispersion" of which the
following are widely known:
1. Range
2. Mean Deviation
3. Variance
4. The Standard Deviation
5. Coefficient of variance
1. RANGE
• The simplest measure of dispersion
• It is the difference between the largest and smallest values
• The range is expressed as 5 to 15 or by the actual difference, 10.
• The range is not of much practical importance, because it
indicates only the extreme values and nothing about the
dispersion (spread) of values between the two extreme values.
• It is greatly affected by extreme values
R = maximum value – minimum value
2. MEAN DEVIATION
• It is the average of the deviations from the arithmetic mean.
• It is given by the formula :

Example:
• The diastolic blood pressure of 10 individuals was as follows:
83, 75, 81 , 79, 71, 95, 75, 77, 84 and 90.
• Find the mean deviation.
Mean deviation is the average of the deviations from the arithmetic mean.
3. VARIANCE
• Variance in statistics is a measurement of the spread between
numbers in a data set or how far a set of numbers is spread out.
• Variance describes how much a random variable differs from its
expected value (e.g., mean) That is, it measures how far each number
is from the mean and therefore from every other number in the set.
A large variance indicates that numbers in the
set are far from the mean and from each
other, while a small variance indicates the
opposite.
 Large variance
 Small variance
DEFINITION OF VARIANCE
• The variance is defined as the average of the squares of the differences
between the individual value and the mean of the data set
.

Sample variance =s² = sum(X - mean)²


n–1
• Expected value = mean= X bar
• Observed value = individual value = X
• Difference between observed value and expected value = X – Mean
• Square of the difference = (X - mean)²
• Sum of square of the difference = sum(X - mean)² {sum is needed to calculate the average}
POSITIVE/NEGATIVE VARIANCE
• Variance can never be negative, because it is the average squared
deviation from the mean, and anything squared is never negative.
• also average of non-negative numbers can’t be negative either.
Therefore, variance can never posses a negative sign
• However, Variance can be negative practically for example in finance
showing
Income variance: $150
Expense variance: $175
Net income variance: $25 (unfavorable or favorable?)
• It is unfavorable i.e.; net income variance is practically negative.
Must remember,
• A variance value of zero indicates that all values within a set of
numbers are identical.
e.g., the age of all the study participants is 27 years so, variance will be
zero

• All variances that are not zero will be positive numbers always.
Meaning, variance can be zero but can never be negative
Formulae of variance
Population Variance vs. Sample Variance
• For a large population, it’s impossible to get all data. So, we
want to take out a sample and calculate its variance.
• The formula for Sample Variance is a bit twist to the
population variance
• In sample variance; let the dividing number subtract by 1,
so that the variance will be slightly bigger. (e.g., 10/5=2
while 10/4=2.5)
• It is not to get a larger variance but, the idea is to be
realistic.
• It’s reasonable. If we use the population variance formula
for sample data, it's always going to be underestimated.
• That's why for sample variance we reduce the sample
population by 1.
Advantages of Variance

• Statisticians use variance to see how individual numbers relate to


each other within a data set, rather than using broader mathematical
techniques such as arranging numbers into quartiles.
• The advantage of variance is that it treats all deviations from the
mean as the same regardless of their direction.
• The squared deviations cannot sum to zero and thus prevents the
appearance of “no variability at all” in the data.
• It is often used primarily to take the square root of its value to
calculate standard deviation of the data set.
Disadvantages of Variance
• One drawback is that it gives added weight to outliers. Outliers are
the numbers that are far from the mean. Squaring these numbers
can skew the data.
• Another pitfall of using variance is that it is not easily interpreted.
Because it uses squared units rather than the natural data units, the
interpretation is less intuitive where the unit is also squared.
Therefore, it is often used primarily to take the square root of its
value, which indicates the standard deviation of the data set.
• Higher values of variance indicate greater variability, but there is no
intuitive interpretation for specific values.
• Despite these drawback, some statistical hypothesis tests use it in
their calculations. For example, ANOVA.
Calculating Variance
• Following are the sizes of 5 wooden sticks in inches to understand all
the measures of dispersion
1,4,4,9,10
Task:
• Calculate variance
Calculating Variance
• Arranging the data in ascending order
X X – mean or (X– X) (X – mean)2 or (X –X)2
1 1 - 5.6= - 4.6 (- 4.6) 2 = 21.16
4 4 - 5.6= - 1.6 (- 1.6) 2 = 2.56
4 4 - 5.6= - 1.6 (- 1.6) 2 = 2.56
9 9 - 5.6= 3.5 (3.5) 2 = 12.25
10 10 - 5.6= 4.4 (4.4) 2 = 19.36
X̅ = Mean = 28/5=5.6 Sum(X – mean)2 =57.2

Variance = Sum(X – mean)2 / n - 1


•s² = 57.2 / 5 – 1
•s² = 14.3 sq. inches
4. Standard Deviation
• Measures how spread out the
numbers are from the average
• The larger the standard
deviation, the greater the
dispersion of values around the
mean
• It is the “Root mean square
deviation” or
• The square root of variance
SD=√S² Population
Standard
deviation
Calculate SD

• Variance = 14.3 sq. inches


• SD = √Variance
• Standard deviation = √14.3 = ±3.78 inches
Example showing mean and standard deviation of a normally distributed data
The Empirical Rule
• The empirical rule tells you what percentage
of your normally distributed data falls within
a certain number of standard deviations from
the mean:
• 68% of the data falls within
one standard deviation of the mean
• 95% of the data falls within
two standard deviations of
the mean.
• 99.7% of the data falls within
three standard deviations of
the mean.
A standard deviation is a statistic that measures the spread of a dataset values relative to their mean
• The distances of 1,2 and 3 standard deviations from the mean in both
direction enclose approximately 68% ,95% and 99.7% of the total area
under the curve respectively. (Mean = x̅ )

• x̅  1SD = 68%
∵ (x̅ + 1 SD = 34% and x̅ - 1 SD = 34%)
• x̅  2SD = 95%
• x̅  3SD = 99.7%

• The total area under the


curve is 100%
Let’s look at a pizza delivery example. Assume that a pizza restaurant has a mean delivery time of 30
minutes and a standard deviation of 5 minutes (30±5). Using the Empirical Rule, we can determine that
68% of the delivery times are between 25-35 minutes (30 - /+5), 95% are between 20-40 minutes (30 -/+
2x5), and 99.7% are between 15-45 minutes (30 -/+3x5). The chart below illustrates this property
graphically.
MCQ (say whether true or false)
• The BMI of a group of 28-year-old males follow a
normal distribution with mean 25Kg/m² and
standard deviation of 5 Kg/m²
a. About 99.7% of the males have BMI b/w 15
and 35 Kg/m²
b. About 2.28% of males have BMI above
35Kg/m²
c. 68% of the data has the BMI between 20 and
30 Kg/m²
d. 16% of the males have BMI less than 20Kg/m²
(Answer: F,T,T,T)
Advantages of SD
• Units same as units of raw data
• Arithmetically easy to handle
• Measures variation in data
• Used in further calculation
• Investors can use standard deviation to assess how consistent returns
are over time.
5. Coefficient of Variance
Definition:
• Coefficient of Variation (C.V) is a measure,
which is independent of the unit of the
measurement.
Reason for Computing:
• When two data sets of the same Variable with different units and we
are interested to find the variation in data sets.(e.g., Weight in Pounds
and Kg)
• When two data sets have the same standard deviation.
Coefficient of Variance
• Suppose two samples of human males yield the following results:

• A comparison of the standard deviations will lead us to conclude that


the two sample possesses equal variability of 10 pounds. If we
compute the C.V .
• We get as:
Coefficient of Variance
Sample1: 25 years old

• C.V=(10/145)x100 C.V= 6.9%

Sample2: 11 years old

• C.V=(10/80)x100 C.V=12.5%
If we compare the results; we can conclude that
• Age of 25-years old data (sample1) has less variability. OR
• Age of 11 years old data (sample2) has more variability. OR
• Sample 1 is more precise(consistent) as compared to sample 2.
MORE ABOUT STATISTICAL
DISTRIBUTION
STANDARD ERROR OF MEAN
• The standard error is considered part of inferential statistics.
• In statistics, a sample mean deviates from the actual mean of a population;
this deviation is the standard error of the mean.
• In cases where multiple samples are collected, the mean of each sample
may vary slightly from the others, creating a spread. This spread is most
often measured as the standard error, accounting for the differences
between the means across the datasets.
• When the standard error is small, the data is said to be more
representative of the true population mean.
• Therefore, the standard error (SE) of a statistic is the approximate standard
deviation of a statistical sample population.
Standard Error of Mean (SEM)
• The standard error of the mean (SEM) measures how far the sample
mean of the data is likely to be from the true population mean.
• SEM is the SD of the theoretical distribution of the sample means (the
sampling distribution).
• SEM is calculated by taking the standard deviation and dividing it by
the square root of the sample size.
SEM=SD/√n
• The SEM is always smaller than the SD.
why to divide by √ of n
• Dividing by the square root of “n”, you are paying a “penalty” for
using a sample instead of the entire population
• Sampling allows us to make guesses or inferences about a population.
• The smaller the sample, the less confidence you might have in those
inferences; that’s the origin of the “penalty”.
Relationship between sample size and SEM
• As your sample size increases toward the size of the entire
population, the difference between the population mean and sample
mean becomes smaller and smaller.
• Therefore, larger the sample size---smaller will be the standard error
of mean
Calculate SEM
• To determine the prevalence of anemia in pregnancy, hemoglobin level of
100 females was recorded. Mean hemoglobin level was found to be 12 ±2
g/dl. Calculate standard error of mean. (Annual 2013)

Solution
• SEM = SD /√n
• SEM = 2/ √100
• SEM = 2/ 10
• SEM = 0.2
CONFIDENCE INTERVALS
• A confidence interval in statistics refers to the probability that the
population parameter (e.g., mean) will fall between two set values.
• The two set values are generally defined by the lower and upper
bounds or limits.
• The confidence interval is expressed as a percentage (the most
frequently quoted percentages are 90%, 95%, and 99%). The
percentage reflects the confidence level.
CONFIDENCE LIMITS
• Confidence limits are the numbers at the upper and lower end of a
confidence interval;
• For example,
if your mean is 102.86
with confidence limits of 99.29 and 106.43,
your confidence interval is 99.29 to 106.43.

Formula to find 95% confidence limit


“95%CI = mean ± 2 SEM”
CONFIDENCE LEVEL
• Most people use 95% confidence level,
although you could use other values.
• Setting 95% confidence level means that if
you take repeated random samples from a
population and calculate the mean and
confidence limits , you will be 95%
confident that your population mean will
lie within these confidence limits.
Calculate confidence limits
• To determine the prevalence of anemia in pregnancy, hemoglobin level of
100 females was recorded. Mean hemoglobin level was found to be 12 ±2
g/dl. Find the 95% confidence limits of the BP within which the population
mean would lie.

Solution:
N = 100, Mean = 12 g/dl,
SD = 2 g/dl, 95%CI = ?

95%CI = 12 ± 2(0.2)
95%CI =12 ± 0.4 (Lower limit -> 12 – 0.4, upper limit -> 12 + 0.4)
95%CI = 11.6 --- 12.4
Various confidence levels
and their critical values

68% CI => mean  1.0 SEM


Thankyou

You might also like