Measures of central tendency and dispersion

Measures of location
A single value which can represent the whole set of data is called an average . If the average tends to lie or
indicating the center of the distribution is called measure of central tendency or sometimes they locate the
general position of the data, so they are also called measure of location.
Important measures of central tendency are Mean , Median and Mode
Mean :
It is the most familiar measure of central tendency. It is simply termed as average ,more accurately ,called
the arithmetic mean. It is defined as the sum of observations divided by the number of observations i.e the
mean of n observations X1 ,X2 ,X3 ……, Xn is given by
Where
xi stands for ith observed value
n stands for the number of observations
stands for the sum of all observed x values
stands for the mean value of x
For example ,mean of 10,20,30 and 40 is (100/4) = 25
Merits:
It is easy to calculate and simple to follow.
It is based on all the observations.
It is determined for almost every kind of data.
Demerits:
Mean is highly affected by extreme values.
It is not an appropriate average for highly skewed distributions.
Median :
Median is the most middle value in the arrayed data. It means that when the data are arranged, median is
the middle value if the number of values is odd and the mean of the two middle values, if the numbers of
values is even. In other words , A value which divides the arrayed set of data in two equal parts is called
median.
For an odd number of values
Calculate the sample median for the following set of observations: 1, 5, 2, 8, 7.
Start by sorting the values: 1, 2, 5, 7, 8.
The median is 5 since it is the middle observation in the ordered list.
For an even number of values
Calculate the sample median for the following set of observations: 1, 5, 2, 8, 7, 2.
Start by sorting the values: 1, 2, 2, 5, 7, 8.
The average of the two middlemost terms is (2 + 5)/2 = 3.5. Therefore, the median is 3.5 since it is the
average of the middle observations in the ordered list.
For grouped frequency distribution the median is given by Median = L + C
Where
L = the lower limit of the median class (median class is the class which contains n/2 observations of the
series )
n = total number of observations
F = cumulative frequency of the class just preceding the median class
f = frequency of the median class
c = the class interval of the median class
Advantages :
It is not affected by extremely high or low values.
It can be calculated from frequency distribution
Disadvantages :
It is not based on all the observations
Compared to mean ,it is less reliable for drawing inferences and harder to use with advances statistics.
Mode :
Mode is the value of a data set that occurs most frequently.
For example ,the mode of the observations 3,6,7,9,6,8 and 6 is 6.
When each value occur the same numbers of times in the data, there is no mode. If two or more values
occur the same numbers of time, then there are two or more modes and distribution is said to be multi-
mode. If the data having only one mode the distribution is said to be unimodal and data having two modes,
the distribution is said to be bimodal.
For grouped frequency distribution the mode is given by Mode = L + c
Where
L = lower limit of the modal class (modal class is the class where the frequency is maximum)
c = the class interval of the modal class
= Difference between frequency of the modal class and premodal class
= Difference between frequency of the modal class and postmodal class
Advantages:
It is easy to understand and simple to calculate.
It is not affected by extremely high or low values.
It can be calculated from frequency distribution
Disadvantages:
It is not based on all the values.
It is less reliable average than mean when number of observations is small.

Can be misleading ; the mode tells which score is most frequent ,but tells nothing about the other scores in
the distribution.
Example:
Calculate median and mode from the following data.
Marks in test Frequency Cumulative frequency Range of cumulative

frequency
5-9 12 12 12
9-13 8 20 12-20
13-17 15 35 20-35
17-21 19 54 35-54
21-25 14 68 54-68
25-29 7 75 68-75
Here n = 75 ,therefore n/2 = 75/2 = 37.5. n/2 th value (37.5) falls in the range 17-21. So the median class ,17-21
That I s, L= 17 ,F = 35 f = 19 and c = 4
Median = L + C = 17.66
Calculation of Mode :
Here ,the frequency is maximum in 17-21. So ,17-21 is the modal class.
L = 17 c=4
Mode = L + c = 18.78
Measures of dispersion
Why study dispersion ???
An average , such as the mean or the median only locates the centre of the data. An average does not tell
us anything about the spread of data.
Dispersion
The degree to which numerical data tend to spread about an average value is called the dispersion or
variation of the data.
Measures of dispersion are the measures of extent of deviation of individual value from the central value.
Important measures of dispersion include the following :
 Range
 Standard deviation and variance
 Standard error
 Co-efficient of variation
Range :
Range is defined as the difference between the maximum and the minimum value of a given data.
It is the simplest measure of dispersion. It gives a general idea about the total spread of the observations.
If Xm denotes the maximum observation X0 denotes the minimum observation then the range is defined as
Range = X m – X 0
Uses :
Used to define the normal limits of biological characteristics e.g. glucose , blood cholesterol
,haemoglobin ,biliribin etc.
Merits :
It is easy to understand and calculate
It does not require any special knowledge
It takes minimum time to calculate the value of range
Limitations :
It does not take into account of all items of distribution. Only two extreme values are taken into
consideration.
It is affected by extreme values.
It is a poor measure of dispersion and does not give a good picture of the variability,
Plays no role in advanced statistics
Standard deviation
The standard deviation is defined as the positive square root of the mean of the square deviations taken
from arithmetic mean of the data. It is the most commonly used measure of dispersion.
For the sample data the standard deviation is denoted by s or σ and is defined as:
S=
For example , for the numbers 1 ,2 and 3 ,the mean is 2 and the standard deviation is
S=
The greater the SD ,the greater is the variation between observations
Merits :
It is based on all the observations
Provides a good description of variability
Plays an important role in advanced statistics
Uses :
Provides the original unit of data .
Best measure of dispersion
Helps in finding the standard error and co-efficient of variation
It is of great importance for the analysis of data and for the various statistical inferences.
Demerits :
It is difficult to compute and compared
It is not affected by extreme values.
It can not be used for the purpose of comparison.
The Variance
Variance is another absolute measure of dispersion. It is defined as the average of the squared difference
between each of the observations in a set of data and the mean.
For a sample data the variance is denoted is denoted by s2 and s2 =
Where is sample mean and n is the number of observations in the sample.
Standard error :
Standard error is defined as the standard deviation of sample observations divided by the square root of
the sample size ,given as
SE (mean ) =
Where SD is the standard deviation of the sample observations an d n is the sample size.
The greater the SE ,the greater is the variation between observations
Uses :
To determine if the difference of two groups means is significant
Z= or
To calculate the size of sample if the population standard deviation is known

SE = or n =
Coefficient of Variation:
The most important of all the relative measure of dispersion is the coefficient of variation. The coefficient
of variation (CV) is defined as the SD divided by the mean given as
CV =
It is expressed as a percentage
Uses of Coefficient of Variation
Coefficient of variation is used to know the consistency of the data. By consistency we mean the uniformity
in the values of the data/distribution from arithmetic mean of the data/distribution. A distribution with
smaller CV than the other is taken as more consistent than the other.
CV is also very useful to compare the variability of two populations that are expressed in different units of
mneasurement.
Example :
In a survey it was observed that the mean height of 100 adults was 170 cm & SD was 12 cm and the
mean height of 100 children was 50 cm and SD was 7 cm. Find which character shows greater
variation ?
Mean SD CV
Adult 170 cm 12 cm 7.1 %
Children 50 cm 7 cm 14. %
Though height in adult shows greater variation in SD ,but real thing is that children has greater variation as
CV of children’s height is greater than the CV of adult’s height.
Probability
Experiment (E ):
An experiment is any well-defined action that may result in a number of outcomes. For example, the rolling of
dice can be considered an experiment
Outcome (O) :
An outcome is defined as any possible result of an experiment.
Sample space (S) :
The sample space is defined as the set of all possible outcomes of an experiment. Usually denoted by S .
For example , when a coin is tossed, it has two possible outcomes. One is called head and the other is called tail.
Head is denoted by H and tail is denoted by T. Thus the sample space consists of head and tail. In set theory
notation, we can write S as:
S = { Head ,Tail } or H = {H ,T }
Event :
Event is a collection of one or more of the outcomes of an experiment
Equally likely events :
Two or more events are said to be equally likely if each one of them has an equal chance of occurring.
Mutually exclusive events:
If two or more events are said to be mutually exclusive when anyone of event that occur excludes the
occurrence of the other even
Probability
Probability is a numerical measure of the likelihood of an event relative to a set of alternative events. For
example, there is a 50% probability of observing heads relative to observing tails when flipping a coin
If an experiment can produced N mutually exclusive and equally likely outcomes out of which n outcomes are
favorable to the occurrence of event A , then the probability of A is denoted by P(A) and is defined as the ratio
n/N. Thus the probability of A is given by
P(A) = =
Properties of probability :
For any event ,the probability of the event should be greater than zero.
Probability of an event lies between 0 and 1.
If the probability of an event A is expressed as P(A) then , 0 P(A) 1.
When an event is certain to occur, it has a probability equal to 1; when it is impossible for the event to occur, it
has a probability equal to 0.
Probability of an event = sum of probabilities of its elements
Hypothesis testing/Statistical inference
Statistical hypothesis
A statistical hypothesis is an assertion or statement about a population or the probability distribution characterizing a
population which we want to verify on the basis of information contained in a sample.
Some examples of statistical hypothesis
 A researcher may hypothesize that a new drug is effective in cancer

 The court may assume that the indicted person is innocent
 A clinician may hypothesize that vitamins A and D play a significant role in promoting growth etc
Hypothesis testing
The procedure of decision- making through proper statistical tests between two contending hypotheses is referred to
as hypothesis testing or test of hypothesis
Types of hypothesis
Statistical tests are concerned with two types of hypothesis. They are Null hypothesis and Alternative Hypothesis
Null hypothesis
The hypothesis which is to be tested is called null hypothesis. It is denoted by . It is a starting point in the
investigations.
Use of coffee increases chances of heart attack. So start with we shall assume that heart attack has no link with the use
of coffee. This will be taken as and we hope it will be rejected by the sample data.
Alternative Hypothesis
The hypothesis which is accepted when the null hypothesis has been rejected is called the alternative hypothesis. It is
denoted by H1 or Ha .Whatever we are expecting from the sample data is taken as alternative hypothesis.
Oral contraceptives cause breast cancer. We are hoping to get this result from the sample. It will be taken as an
alternate hypothesis and null hypothesis will be that oral contraceptives do not cause cancer.
Test Statistic
A statistic on which the decision can be based whether to accept or reject a hypothesis is called test statistic
Steps in performing a test of Hypothesis :
1. State the null hypothesis
2. Select an appropriate test statistic
3. Choose a significance level α of the test ,usually α = 5%
4. Calculate the test static (z, t, F ,χ2 etc) and determine p value (probability of occurrence of the estimate by chance )
5. Draw conclusion on the basis of p value ,ie ,decide whether the observed difference is statistically significant or not .
If the p value is less than or equal to α ,reject null hypothesis and conclude that the observed difference is statistically
significant .If the p value is greater than α ,do not reject the null hypothesis and conclude that the observed difference
is statistically insignificant
State hypothesis
Select test statistic
Choose a significance level
Calculate test statistic
Do not Reject H0
reject H0 Make statistical decision
Conclude H0 is rejected
Conclude H0 may be accepted
z-test
A statistical test to determine whether the difference between two means is significant or not is referred to as z-test
The test is applied for large samples ,n > 30
Steps in performing z-test :
1. State the null hypothesis
H0 : There is no difference between two population means
2. Calculate z value
Where
and are means of first and second sample respectively
and are standard deviations of first and second sample respectively
and are sizes of first and second sample respectively
3. Choose a significant level α ,usually α = 5 %
4. Refer the z value to find p value from statistical table. If p 0.05 ,reject HO , otherwise accept it.
Example :
The haemoglobin (Hb) level of children was measured in 143 girls and 127 boys. The results are summarized in table.
Number (n) Mean SD

Girls 143 11.2 1.4
Boys 127 11.0 1.3
Therefore ,in the observed sample ,the girls had higher Hb levels on average than the boys did. Now question is
whether the observed difference (0.2g/dl) occurred due to by chance (sampling error) or they were truly different .
Null hypothesis :
There is no significant difference between the haemoglobin (Hb) level of girls and boys
Calculation :
Here , n1 = 143 X1 = 11.2 s1 = 1.4
n2 = 127 X2 = 11.0 s2 = 1.3
Therefore z = = 1.22
Interpretation :
Since z < 1.96 , p > 0.05 ,the difference is not statistically significant. Hence we conclude that there is no significant
difference between the haemoglobin (Hb) level of girls and boys
Confidence Interval
Lower limit
= (11.2-11.0) – 1.96
= 0.2 - 1.96 0.164
= -0.12
Upper limit
= 0.2 + 1.96 0.164
= 0.52
Therefore ,95% confidence interval for the difference between two population means will be (-0.12 ,0.52 )
Interpretation :
Since this confidence interval includes zero, we can not reject null hypothesis. Hence we conclude that there is no
significant difference between the haemoglobin (Hb) level of girls and boys at 5% level of significance.
Correlation
Correlation refers to the relationship between two continuous variables ,say X and Y ,in the case where each particular
value of X is paired with one particular value of Y. For example ,the measures of height for individual human subjects
,paired with their corresponding measures of weight ; the number of hours that individual students in a statistics
course spend studying prior to an exam ,paired with their corresponding measures of performance on the exam and so
on
When the two variables are meaningfully related and both increase or both decrease simultaneously, then the
correlation is termed as positive. For example, the length of an iron bar will increase as the temperature increases.
If increase in any one variable is associated with decrease in the other variable, the correlation is termed as negative.
For example, the volume of gas will decrease as the pressure increase or the demand of a particular commodity is
increase as price of such commodity is decrease.
If there is no relationship between the two variables such that the value of one variable change and the other variable
remain constant is called no or zero correlation.
Situations for correlation :
1. Two continuous characters are measured in the same person
- Weight and height
- Weight and cholesterol
- The number of hours spend studying prior to an exam and performance on the exam
2. The same character is measured in two related groups
- Tallness in parents and tallness in children

- Study of intelligent quotient (IQ) in brothers and in corresponding sisters
Types of correlation :
1. Perfect correlation
 Perfect positive correlation

 Perfect negative correlation
2. Partial correlation
 Partial or moderately positive correlation

 Partial or moderately negative correlation
3. No correlation or zero correlation
Perfect positive correlation :
In this case , two variables (say X And Y ) are directly proportional to each other ie, both variable rise or fall in the same
proportion. r = +1 indicates X and Y are perfectly related in a positive linear sense. All points in a scatter diagram lie on
the straight line that has a positive slope. For example ,the relationship between and voltage and current in an
electrical circuit
Perfect negative correlation :
In this case ,X and Y are inversely proportional to each other ,ie ,one rises ,the other one falls in the same proportion . r
= -1 indicates X and Y are perfectly related in a negative linear sense .All points in a scatter diagram lie on the straight
line that has a negative slope. For example the relationship between pressure and volume of gas at a particular
temperature.
Partial correlation
Partial or moderately positive correlation
Value of r close to 1 indicates a significant linear relationship with positive slope. In this case ,the non-zero values of r
lies between 0 and +1 ,ie ,0 < r < 1 .Examples are : age of husband and age of wife ,glucose and HbA 1c etc
Partial or moderately negative correlation
Value of r close to -1 indicates a significant linear relationship with negative slope. In this case ,the non-zero values of r
lie between 0 and -1 ,ie ,-1 < r < 0 .Examples include ,income and malnutrition ,income and infant mortality etc.
Coefficient of Correlation
Coefficient of Correlation is a quantitative measure of the direction and strength of linear relationship between two
numerically measured variables. It is denoted by r.
If for two variables X and Y ,SS(X) and SS(Y) stand for their sum of squares respectively and SP(X,Y) for their sum of
product ,then r is defined as
r=
Properties of r :
1. r lies between +1 and -1 ,ie, -1 < r < 1

2. The sign of r reflects the trend of the points
3. If r = -1 ,two variables are inversely proportional to each other ,ie, when one rises ,the other one falls in the
same proportion
4. If r = 1 ,two variables are perfectly related in a positive linear sense ,ie, both variable rise or fall in the same
proportion
5. If 0 < r < 1 , -1 < r < 0 there exists moderately positive and negative correlation between two variables
respectively
6. If r = 0 ,there exists no linear relationship between the two variables
7. r is unaffected by change of scale or origin
Scatter diagram
A scatter diagram is a tool for analyzing relationships between two variables , i.e., how one variable changes with the
other variable. This diagram simply plots pairs of corresponding data from two variables, which are usually two
variables in a process being studied. The scatter diagram does not determine the exact relationship between the two
variables, but it does indicate whether they are correlated or not.
The scatter diagram is used to 1) quickly confirm a hypothesis that two variables are correlated 2) provide a graphical
representation of the strength of the relationship between two variables
An example of a scatter diagram that shows no correlation is shown in Figure .
Correlation measures the strength of association between two variables.

Regression enables to predict the values of dependent variable (eg ,weight ) based on the independent variable (eg
,height)
Example :
Mouse Length (x) Weight (y)

1 2 1
2 5 4
3 8 3
4 12 3
5 14 8
6 19 9
7 22 8
Null hypothesis :
H0 : There is no relationship between weight and length of mice ,ie ,
H0 : r = 0
Calculation :
Calculate , , , ,
Length (x) Weight (y) xy

2 1 4 1 2
5 4 25 16 20
8 3 64 9 24
12 3 144 16 48
14 8 196 64 112
19 9 361 81 152
22 8 484 64 176
= 82 = 37 = 1278 = 251 = 553
r=
= 0.90
Interpretation :
The calculated value 0.90 exceeds the tabulated value 0.878 at α 5% level of significance with degrees of freedom 7-2 =
5 .So p < 0.05 and we reject H0 and conclude that significant relationship exists between weight and length of mice.
Regression:
The word regression was used by Frances Galton in 1985. It is defined as “The dependence of one variable upon other
variable”. For example, a weight depends upon the heights.
In regression we can estimate the unknown values of one (dependent) variable from known values of the other
(independent) variable.
Regression procedures are very widely used in research involved in the social and natural sciences ,especially in the
medical and health sciences. For example ,for the assessment of nutritional status the concept of regression is applied
to develop standard chart of height and weight for normal healthy population. From this chart we can find standard
weight of an individual if we know his/her height.
Often in biochemical tests ,we apply regression to find concentration of blood glucose ,cholesterol ,TG ,insulin
,creatinine etc from absorbance (optical density ,OD).We first develop a standard regression curve from data on
concentration of these parameters and their OD. Then we find unknown concentrations from the fitted regression line.
Regression that involves only two variables ,one of which is dependent variable and the other is independent variable
is referred to as Simple regression.
The model associated with simple regression is called simple regression model which is given by
Y = α + βX + ε
Where
X is the independent variable
Y is the dependent variable
α is the intercept
β is the regression co-efficient (slope of the line) and
ε is the error term
The estimate of β ,denoted as is calculated as
The estimate of α ,denoted as is calculated as
=y- x
Therefore the estimated regression equation becomes

y = + x
Example :
Concentration (mmol/l) and corresponding ultraviolet absorption of blood glucose concentration are given in the
following table :
Concentration 1 2 3 4 5
Absorbance 0.1 0.36 0.57 1.09 2.05
a. Calculate the slope and intercept
b. Fin d the regression equation
c. An unknown blood sample has an absorbance of 1.65 .What is the concentration of glucose in the
sample.
X Y X2 Y2 XY
0.1 1 0.01 1 0.1
0.36 2 0.13 4 0.72
0.57 3 0.32 9 1.71
1.09 4 1.19 16 4.36
2.05 5 4.20 25 10.25
= 4.17 =15 = 5.85 = 55 = 17.14
and
= 4.63/2.37
= 1.95
=y- x
= 3- 1.95 0.83
= 1.38
y = + x
= 1.38 + 1.95x
Now an estimated concentration ,for example ,for the absorbance with 1.65
y = 1.38 + 1.95 1.65 = 4.60
Analysis of variance (ANOVA)
Experiment
An experiment is any process or study which results in the collection of data, the outcome of which is unknown.
For example ,before introducing a new drug treatment to reduce high blood pressure, the researcher carries out an
experiment to compare the effectiveness of the new drug with that of one currently prescribed.
Experimental Unit
A unit is a person, animal or thing which is actually studied by a researcher; the basic objects upon which the study or
experiment is carried out. For example, a person , a rat etc
Treatment
Treatment is something that researchers administer to experimantal units. For example, a doctor treats a patient with
a skin condition with different creams to see which is most effective.
Replication
Replication is the repetition of an experimental condition so that the variability associated with the phenomenon can
be estimated.
ANOVA :
ANOVA is a general method of analyzing data from designed experiments ,whose objective is to compare three or
more than three groups.It replaces the multiple t test with a single F test.
The analysis of variance is also referred to as the F test developed by RA Fisher ,the British Statistician.
One way ANOVA :
One-way ANOVA is used to test for differences among two or more independent groups. Typically, however, the one-
way ANOVA is used to test for differences among at least three groups, since the two-group case can be covered by a t-
test .When there are only two means to compare, the t-test and the F-test are equivalent; the relation between ANOVA
and t is given by F = t2.
Suppose we are exploring the relationship between training hours per week (the dependent variable) and sport (the
independent variable).Suppose sport has three levels : runners ,cyclists and swimmers. We can ask question – are there
differences overall between the sports ? The answer is given by the p value for sport in analysis of variance (ANOVA)
Criteria :
1. Independence of groups and observations within each group

2. The distributions in each of the groups should be normal
3. Variances of all groups should be the same
4. Dependent variable should be continuous
MS BG
SS BG
F=
SS T
SS MS WG
WG
Procedure for One-Way ANOVA :
1. State the null hypothesis of equality between two group means
H0 : All the group means are equal
2. Define the α level (usually α = 0.05)
3. Calculate the F statistic
I. Calculate total sum of squares (SST) and between sum of squares (SSBG)
SST = – and SSBG = -
II. Calculate within sum of squares (SSWG) as the difference between total SS and between SS given as SSWG =
SST - SSBG
III. Calculate degrees of freedom
Df for between groups = k-1 and Df for within groups = n-k
IV. Calculate mean sum of squares ,ie ,variance of between group (MS BG) and within group (MSWG)
MSBG = SSBG/k-1 and MSWG = SSWG/n-k
V. Calculate F value as F = MSBG / MSWG

4. Refer the F value to table value with df n1 and n2 where n1 is the df for MSBG and n2 is the df for MSWG
5. Interpretation
Find p value using table. Compare this p value to the level of significance α ,say α = 0.05 .if p 0.05 ,reject null
hypothesis and conclude treatment differences are significant ,otherwise insignificant
Example :
Three different treatments are given to 3 groups of patients with anemia .Increase in Hb % level was
noted after one month and is given below. Find whether the difference in improvement in 3 groups
is significant or not
Group A 3 1 2 0 1 2 2
Group B 3 2 2 3 1 3 2
Group C 3 4 5 4 2 2 4
Solution :
Table : Calculation of Hb % or three groups of patients

Group A Group B Group C Group A Group B Group C
x1 x2 x3 x12 x 22 x32
3 3 3 9 9 9
1 2 4 1 4 16
2 2 5 4 4 25
0 3 4 0 9 16
1 1 2 1 1 4
2 3 2 4 9 4
2 2 4 4 4 16
= 11 = 16 = 24 =23 = 40 = 90
N = 7 + 7 + 7 = 21 = + + = 23 + 40 + 90 = 153
= + + = 11 + 16 +24 = 51
Null hypothesis :
H0 : There is no difference in increase in Hb% between 3 groups
Calculation :
1. Calculation of total sum of squares (SST) (for all values )
SST = –
= 153 - = 29.14
2. Calculation of between sum of squares (SSBG) (ie, between group)
SSBG = -
= - = 12. 28
3 .Calculation of within sum of squares (SSWG ) (ie ,within group)
SSWG = SST - SSBG = 29.14 – 12.28 = 16.86
Table : Calculation for ANOVA

Source of variation Degree of freedom (df) Sum of squares Mean sum of squares F
Between groups 2 (3-1) 12.28 6.14
Within groups 18 (21-3) 16.86 0.94
6.53
Interpretation :
Calculated value(6.53) is greater than the tabulated value (3.55) at 2 and 18 df at 5% level of significance level. So , p <
0.05 Hence ,we reject null hypothesis and conclude that there is significant differences in increase in Hb% between
three groups
Definition of some basic terms in epidemiology
Ratio :
A measure of comparing two different values obtained by dividing one quantity by another. Suppose a and b are
two different quantities. Then ,a/b or a : b is called the ratio of a to b .
Proportion :
A measure of comparing two values in which the numerator is included in the denominator. If a and b are of two
different quantities then a/a+b is the proportion of a to a+b.
Example : Proportion of female in a class of males and females is Female/Males + Females
N : B : Numerator ,the upper part of fraction and denominator ,the lower part of the fraction
Percentage :
Percentage is proportion expressed per 100 .Percentage of female in a class would be : 100
Rate :
A rate is rather like a proportion in which the numerator is related with the denominator and the value of
denominator is specific to time and usually expressed per 10000.
Crude death rate =
Risk :
Risk is defined as the probability that an event will occur, for example, that an individual will become ill or die,
within a period of time.
If a population has N people and A people out of the N develop the disease during a period of time ,the
proportion A/N represents the risk of disease in the population that period.
Risk =
Incidence rate :
Incidence rate is defined as the ratio of number of subjects developing disease and total time experienced for
the subjects follow.
Incidence rate = ,where A represents number of subjects developing disease.
Suppose that we measure an incidence rate in a population as 47 cases occurring in 158 months .
Incidence rate =
Using person-years instead of using person-months the equation would be
Relation between Risk and Incidence Rate
Because the interpretation of risk is so much more straight forward than that of incidence rate ,it is often
convenient to convert incidence rate measures into risk measures.The simplest formula to convert an incidence
rate to a risk is as follows
Risk = Incidence Rate Time
Let’s see how this equation works .Suppose that we have a population of 10,000 people who experience an
incidence rate of lung cancer of 8 cases per 10,000 person-years. If we followed the population for 1 year
,equation tells us that the risk of lung cancer would be 8 in 10,000 for the 1-year period or 0.0008.
Risk = = 0.0008
If the same rate were experienced for only half a year ,then the risk would be half of 0.0008 or 0.0004.Equation
calculates risk as directly proportional to both the incidence rate and the time period ,so as the time period is
extended ,the risk becomes proportionally greater.
Incidence
Incidence is a measure of disease that allows us to determine a person's probability of being diagnosed
with a disease during a given period of time. Therefore, incidence is the number of newly diagnosed cases
of a disease.
Incidence rate
An incidence rate is the number of new cases of a disease divided by the number of persons at risk for the
disease. If, over the course of one year, five women are diagnosed with breast cancer, out of a total female
study population of 200 (who do not have breast cancer at the beginning of the study period), the
incidence of breast cancer in this population was 0.025.
Prevalence
Prevalence is a measure of disease that allows us to determine a person's likelihood of having a disease.
Therefore, the number of prevalent cases is the total number of cases of disease existing in a population.
Prevalence rate
A prevalence rate is the total number of cases of a disease existing in a population divided by the total
population. So, if a measurement of cancer is taken in a population of 40,000 people and 1,200 were
recently diagnosed with cancer and 3,500 are living with cancer, then the prevalence of cancer is 0.118.
Incidence proportion
Incidence proportion is the number of new cases within a specified time period divided by the size of the
population initially at risk. For example, if a population initially contains 1,000 non-diseased persons and 28
develop a condition over two years of observation, the incidence proportion is 28 cases per 1,000 persons,
i.e. 2.8%.
Incidence vs. prevalence
Incidence Prevalence
Incidence is a measurement of the number of new Prevalence is a measurement of all individuals
individuals who contract a disease during a particular affected by the disease within a particular period of
period of time. time
Incidence conveys information about the risk of Prevalence indicates how widespread the disease is.
contracting the disease
Incidence is more useful when talking about diseases Prevalence is a useful parameter when talking about
of short duration, such as chickenpox. long lasting diseases, such as HIV
If, over the course of one year, five women are If a measurement of cancer is taken in a population
diagnosed with breast cancer, out of a total female of 40,000 people and 1,200 were recently diagnosed
study population of 200 (who do not have breast with cancer and 3,500 are living with cancer, then
cancer at the beginning of the study period), the the prevalence of cancer is 0.118.
incidence of breast cancer in this population was
0.025.
Relative risk
Relative risk (RR) is the risk of an event (or of developing a disease) relative to exposure. Relative risk is a ratio of
the probability of the event occurring in the exposed group versus a non-exposed group.
RR =
Example-1
In a study ,the probability of developing lung cancer among smokers was 20% and among non-smokers 1%..This
situation is expressed in the 2 × 2 table. Calculate relative risk
Here, a = 20 (%), b = 80, c = 1, and d = 99. Then the relative risk of cancer associated with smoking would be
RR =
Smokers would be twenty times as likely as non-smokers to develop lung cancer.
Example-2
Consider a study that examines the risk factors for breast cancer among women participating in the Survey. In a
sample of 4540 women who gave birth to their first child before the age 25 ,65 developed breast cancer. Of the
1628 women who first gave at age 25 or older ,31 were diagnosed with breast cancer. If we consider exposure to
be the condition of having first given birth at age 25 or older ,then calculate relative risk .
RR = = 31/1628/65/4540 = 1.33
Women who first gave birth at 25 years of age or older are more likely to develop breast cancer than women
who gave birth at young age.
Relative risk can be called risk ratio because it is the ratio of the risk in the exposed divided by the risk in the
unexposed.
It is suited to clinical trial data, where it is used to compare the risk of developing a disease, in people not
receiving the new medical treatment (or receiving a placebo) versus people who are receiving an established
(standard of care) treatment.
R = 1 indicates there is no association between disease and exposure
RR > 1 indicates occurrence of disease is higher in the exposed group than in unexposed group
RR < 1 indicates occurrence of disease is lower in the exposed group than in unexposed group
If RR = 2 indicates occurrence of disease is two times in the exposed group than in unexposed group
Odds ratio
The odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another
group. These groups might be men and women or an experimental group and a control group
If the probabilities of the event in each of the groups are p1 (first group) and p2 (second group), then the odds
ratio is:
where qx = 1 − px.
OR = 1 indicates there is no association between disease and exposure
OR > 1 indicates occurrence of disease is higher in the exposed group than in unexposed group
OR < 1 indicates occurrence of disease is lower in the exposed group than in unexposed group
OR = 2 indicates occurrence of disease is two times in the exposed group than in unexposed group
Example-1
Suppose that in a sample of 100 men, 90 have drunk wine in the previous week, while in a sample of 100 women
only 20 have drunk wine in the same period. The odds of a man drinking wine are 90 to 10, or 9:1, while the odds
of a woman drinking wine are only 20 to 80, or 1:4 = 0.25:1. The odds ratio is thus 9/0.25, or 36, showing that men
are much more likely to drink wine than women.
Using the above formula for the calculation yields the same result :
Example – 2
Among the 2914 women ,who had previously used oral contraceptives ,273 developed breast cancer and 2641 did
not.Of the 7976 women who had never used oral contraceptives ,716 developed breast cancer and 7260 did
not.Calculate odds ratio.
OR = 273/2914/(1-273/2914)/(716/7976)/(1-716/7976)
= 273/2641/716/7260
= 1.05
Users would be 1.05 times as likely as non-users to develop breast cancer
Attributable risk
Attributable risk is the portion of the incidence of a disease in the exposed that is due to the exposure. It is the
incidence of a disease in the exposed that would be eliminated if exposure were eliminated.
Attributable risk implies that not all of disease incidence is due the exposure since even some non-smoked
individuals develop disease.
Thus incidence in exposed group = incidence not due to the exposure + incidence due to the exposure
Incidence in unexposed group = Incidence not due to the exposure
Therefore ,the incidence in the exposed group ,which attributable to the exposure can be calculated by
substracting :
Example :
Table : Hypothetical data giving 1-year disease risks for exposed and unexposed people. Calculate attributable risk.
Unexposed Exposed Total

Disease 900 500 1400
No disease 89,100 9500 98,600
Total 90,000 10,000 100,000
Incidence in exposed group = 500/10,000 = 0.05
Incidence in unexposed group = 900/90,000 = 0.01
Attributable risk = = 0.8
Mathmatics :
Example-1
Table : Breast cancer cases and person-years of observation for women with tuberculosis repeatedly exposed to
multiple x-ray fluorocopies and unexposed women with tuberculosis.
Radiation exposure
Yes No Total
Breast cancer cases 41 15 56
Person-years 28,010 19,017 47,027
Calculate rates (cases /10,000 person-yr)
Solution :
Rate for exposed group = = 14.6
Rate for unexposed group = = 7.9
Rate for total population = = 11.9
Example -2
Table : Hypothetical data giving 1-year disease risks for people at three levels of exposure
Exposure
None Low High Total
Disease 100 1200 1200 2500
No disease 9900 58,800 28,800 97,500
Total 10,000 60,000 30,000 100,000
From the above table ,calculate risk ,risk ratio and proportion of all cases.
Solution :
Proportion for 1st case = 100/2500 = 0.04
Proportion for 2nd case = 1200/2500 = 0.48
Proportion for 3rd case = 1200/2500 = 0.48
Risk for 1st case = 100/10000 = 0.01
Risk for 2nd case = 1200/60000 = 0.02
Risk for 3rd case = 1200/30000 = 0.04
Risk ratio for 1st case = 1

Risk ratio for 2nd case = 0.02/0.01 = 2
Risk ratio for 3rd case = 0.04/0.01 = 4

Measures of central tendency and dispersion

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measures of central tendency and dispersion

Uploaded by

Copyright:

Available Formats

Measures of location

Important measures of central tendency are Mean , Median and Mode

mean of n observations X1 ,X2 ,X3 ……, Xn is given by

xi stands for ith observed value

n stands for the number of observations

stands for the sum of all observed x values

stands for the mean value of x

For example ,mean of 10,20,30 and 40 is (100/4) = 25

It is easy to calculate and simple to follow.

It is based on all the observations.

It is determined for almost every kind of data.

Mean is highly affected by extreme values.

It is not an appropriate average for highly skewed distributions.

For an odd number of values

Calculate the sample median for the following set of observations: 1, 5, 2, 8, 7.

Start by sorting the values: 1, 2, 5, 7, 8.

The median is 5 since it is the middle observation in the ordered list.

For an even number of values

Calculate the sample median for the following set of observations: 1, 5, 2, 8, 7, 2.

Start by sorting the values: 1, 2, 2, 5, 7, 8.

For grouped frequency distribution the median is given by Median = L + C

n = total number of observations

F = cumulative frequency of the class just preceding the median class

f = frequency of the median class

c = the class interval of the median class

It is not affected by extremely high or low values.

It can be calculated from frequency distribution

Mode is the value of a data set that occurs most frequently.

For example ,the mode of the observations 3,6,7,9,6,8 and 6 is 6.

For grouped frequency distribution the mode is given by Mode = L + c

c = the class interval of the modal class

= Difference between frequency of the modal class and premodal class

= Difference between frequency of the modal class and postmodal class

It is easy to understand and simple to calculate.

It is not affected by extremely high or low values.

It can be calculated from frequency distribution

It is not based on all the values.

It is less reliable average than mean when number of observations is small.

Calculate median and mode from the following data.

Marks in test Frequency Cumulative frequency Range of cumulative

Here ,the frequency is maximum in 17-21. So ,17-21 is the modal class.

Why study dispersion ???

Important measures of dispersion include the following :

It is easy to understand and calculate

It does not require any special knowledge

It takes minimum time to calculate the value of range

It is affected by extreme values.

Plays no role in advanced statistics

The greater the SD ,the greater is the variation between observations

It is based on all the observations

Provides a good description of variability

Plays an important role in advanced statistics

Provides the original unit of data .

Best measure of dispersion

Helps in finding the standard error and co-efficient of variation

It is difficult to compute and compared

It is not affected by extreme values.

It can not be used for the purpose of comparison.

For a sample data the variance is denoted is denoted by s2 and s2 =

Where is sample mean and n is the number of observations in the sample.

The greater the SE ,the greater is the variation between observations

To determine if the difference of two groups means is significant