Professional Documents
Culture Documents
A single value which can represent the whole set of data is called an average . If the average tends to lie or
indicating the center of the distribution is called measure of central tendency or sometimes they locate the
general position of the data, so they are also called measure of location.
Mean :
It is the most familiar measure of central tendency. It is simply termed as average ,more accurately ,called
the arithmetic mean. It is defined as the sum of observations divided by the number of observations i.e the
Where
Merits:
Demerits:
Median :
Median is the most middle value in the arrayed data. It means that when the data are arranged, median is
the middle value if the number of values is odd and the mean of the two middle values, if the numbers of
values is even. In other words , A value which divides the arrayed set of data in two equal parts is called
median.
The average of the two middlemost terms is (2 + 5)/2 = 3.5. Therefore, the median is 3.5 since it is the
average of the middle observations in the ordered list.
Where
L = the lower limit of the median class (median class is the class which contains n/2 observations of the
series )
Advantages :
Disadvantages :
It is not based on all the observations
Compared to mean ,it is less reliable for drawing inferences and harder to use with advances statistics.
Mode :
When each value occur the same numbers of times in the data, there is no mode. If two or more values
occur the same numbers of time, then there are two or more modes and distribution is said to be multi-
mode. If the data having only one mode the distribution is said to be unimodal and data having two modes,
the distribution is said to be bimodal.
Where
L = lower limit of the modal class (modal class is the class where the frequency is maximum)
Advantages:
Disadvantages:
Example:
Here n = 75 ,therefore n/2 = 75/2 = 37.5. n/2 th value (37.5) falls in the range 17-21. So the median class ,17-21
That I s, L= 17 ,F = 35 f = 19 and c = 4
Median = L + C = 17.66
Calculation of Mode :
L = 17 c=4
Mode = L + c = 18.78
Measures of dispersion
An average , such as the mean or the median only locates the centre of the data. An average does not tell
us anything about the spread of data.
Dispersion
The degree to which numerical data tend to spread about an average value is called the dispersion or
variation of the data.
Measures of dispersion are the measures of extent of deviation of individual value from the central value.
Range
Standard deviation and variance
Standard error
Co-efficient of variation
Range :
Range is defined as the difference between the maximum and the minimum value of a given data.
It is the simplest measure of dispersion. It gives a general idea about the total spread of the observations.
If Xm denotes the maximum observation X0 denotes the minimum observation then the range is defined as
Range = X m – X 0
Uses :
Used to define the normal limits of biological characteristics e.g. glucose , blood cholesterol
,haemoglobin ,biliribin etc.
Merits :
Limitations :
It does not take into account of all items of distribution. Only two extreme values are taken into
consideration.
It is a poor measure of dispersion and does not give a good picture of the variability,
Standard deviation
The standard deviation is defined as the positive square root of the mean of the square deviations taken
from arithmetic mean of the data. It is the most commonly used measure of dispersion.
For the sample data the standard deviation is denoted by s or σ and is defined as:
S=
For example , for the numbers 1 ,2 and 3 ,the mean is 2 and the standard deviation is
S=
Merits :
Uses :
It is of great importance for the analysis of data and for the various statistical inferences.
Demerits :
The Variance
Variance is another absolute measure of dispersion. It is defined as the average of the squared difference
between each of the observations in a set of data and the mean.
Standard error :
Standard error is defined as the standard deviation of sample observations divided by the square root of
the sample size ,given as
SE (mean ) =
Where SD is the standard deviation of the sample observations an d n is the sample size.
Uses :
Z= or
Coefficient of Variation:
The most important of all the relative measure of dispersion is the coefficient of variation. The coefficient
of variation (CV) is defined as the SD divided by the mean given as
CV =
It is expressed as a percentage
Coefficient of variation is used to know the consistency of the data. By consistency we mean the uniformity
in the values of the data/distribution from arithmetic mean of the data/distribution. A distribution with
smaller CV than the other is taken as more consistent than the other.
CV is also very useful to compare the variability of two populations that are expressed in different units of
mneasurement.
Example :
In a survey it was observed that the mean height of 100 adults was 170 cm & SD was 12 cm and the
mean height of 100 children was 50 cm and SD was 7 cm. Find which character shows greater
variation ?
Mean SD CV
Adult 170 cm 12 cm 7.1 %
Children 50 cm 7 cm 14. %
Though height in adult shows greater variation in SD ,but real thing is that children has greater variation as
CV of children’s height is greater than the CV of adult’s height.
Probability
Experiment (E ):
An experiment is any well-defined action that may result in a number of outcomes. For example, the rolling of
dice can be considered an experiment
Outcome (O) :
The sample space is defined as the set of all possible outcomes of an experiment. Usually denoted by S .
For example , when a coin is tossed, it has two possible outcomes. One is called head and the other is called tail.
Head is denoted by H and tail is denoted by T. Thus the sample space consists of head and tail. In set theory
notation, we can write S as:
S = { Head ,Tail } or H = {H ,T }
Event :
Two or more events are said to be equally likely if each one of them has an equal chance of occurring.
If two or more events are said to be mutually exclusive when anyone of event that occur excludes the
occurrence of the other even
Probability
Probability is a numerical measure of the likelihood of an event relative to a set of alternative events. For
example, there is a 50% probability of observing heads relative to observing tails when flipping a coin
If an experiment can produced N mutually exclusive and equally likely outcomes out of which n outcomes are
favorable to the occurrence of event A , then the probability of A is denoted by P(A) and is defined as the ratio
n/N. Thus the probability of A is given by
P(A) = =
Properties of probability :
For any event ,the probability of the event should be greater than zero.
When an event is certain to occur, it has a probability equal to 1; when it is impossible for the event to occur, it
has a probability equal to 0.
Statistical hypothesis
A statistical hypothesis is an assertion or statement about a population or the probability distribution characterizing a
population which we want to verify on the basis of information contained in a sample.
Hypothesis testing
The procedure of decision- making through proper statistical tests between two contending hypotheses is referred to
as hypothesis testing or test of hypothesis
Types of hypothesis
Statistical tests are concerned with two types of hypothesis. They are Null hypothesis and Alternative Hypothesis
Null hypothesis
The hypothesis which is to be tested is called null hypothesis. It is denoted by . It is a starting point in the
investigations.
Use of coffee increases chances of heart attack. So start with we shall assume that heart attack has no link with the use
of coffee. This will be taken as and we hope it will be rejected by the sample data.
Alternative Hypothesis
The hypothesis which is accepted when the null hypothesis has been rejected is called the alternative hypothesis. It is
denoted by H1 or Ha .Whatever we are expecting from the sample data is taken as alternative hypothesis.
Oral contraceptives cause breast cancer. We are hoping to get this result from the sample. It will be taken as an
alternate hypothesis and null hypothesis will be that oral contraceptives do not cause cancer.
Test Statistic
A statistic on which the decision can be based whether to accept or reject a hypothesis is called test statistic
4. Calculate the test static (z, t, F ,χ2 etc) and determine p value (probability of occurrence of the estimate by chance )
5. Draw conclusion on the basis of p value ,ie ,decide whether the observed difference is statistically significant or not .
If the p value is less than or equal to α ,reject null hypothesis and conclude that the observed difference is statistically
significant .If the p value is greater than α ,do not reject the null hypothesis and conclude that the observed difference
is statistically insignificant
State hypothesis
Do not Reject H0
reject H0 Make statistical decision
Conclude H0 is rejected
Conclude H0 may be accepted
z-test
A statistical test to determine whether the difference between two means is significant or not is referred to as z-test
2. Calculate z value
Where
4. Refer the z value to find p value from statistical table. If p 0.05 ,reject HO , otherwise accept it.
Example :
The haemoglobin (Hb) level of children was measured in 143 girls and 127 boys. The results are summarized in table.
Null hypothesis :
There is no significant difference between the haemoglobin (Hb) level of girls and boys
Calculation :
Therefore z = = 1.22
Interpretation :
Since z < 1.96 , p > 0.05 ,the difference is not statistically significant. Hence we conclude that there is no significant
difference between the haemoglobin (Hb) level of girls and boys
Confidence Interval
Lower limit
= (11.2-11.0) – 1.96
= -0.12
Upper limit
= 0.52
Therefore ,95% confidence interval for the difference between two population means will be (-0.12 ,0.52 )
Interpretation :
Since this confidence interval includes zero, we can not reject null hypothesis. Hence we conclude that there is no
significant difference between the haemoglobin (Hb) level of girls and boys at 5% level of significance.
Correlation
Correlation refers to the relationship between two continuous variables ,say X and Y ,in the case where each particular
value of X is paired with one particular value of Y. For example ,the measures of height for individual human subjects
,paired with their corresponding measures of weight ; the number of hours that individual students in a statistics
course spend studying prior to an exam ,paired with their corresponding measures of performance on the exam and so
on
When the two variables are meaningfully related and both increase or both decrease simultaneously, then the
correlation is termed as positive. For example, the length of an iron bar will increase as the temperature increases.
If increase in any one variable is associated with decrease in the other variable, the correlation is termed as negative.
For example, the volume of gas will decrease as the pressure increase or the demand of a particular commodity is
increase as price of such commodity is decrease.
If there is no relationship between the two variables such that the value of one variable change and the other variable
- The number of hours spend studying prior to an exam and performance on the exam
Types of correlation :
1. Perfect correlation
2. Partial correlation
In this case , two variables (say X And Y ) are directly proportional to each other ie, both variable rise or fall in the same
proportion. r = +1 indicates X and Y are perfectly related in a positive linear sense. All points in a scatter diagram lie on
the straight line that has a positive slope. For example ,the relationship between and voltage and current in an
electrical circuit
In this case ,X and Y are inversely proportional to each other ,ie ,one rises ,the other one falls in the same proportion . r
= -1 indicates X and Y are perfectly related in a negative linear sense .All points in a scatter diagram lie on the straight
line that has a negative slope. For example the relationship between pressure and volume of gas at a particular
temperature.
Partial correlation
Value of r close to 1 indicates a significant linear relationship with positive slope. In this case ,the non-zero values of r
lies between 0 and +1 ,ie ,0 < r < 1 .Examples are : age of husband and age of wife ,glucose and HbA 1c etc
Value of r close to -1 indicates a significant linear relationship with negative slope. In this case ,the non-zero values of r
lie between 0 and -1 ,ie ,-1 < r < 0 .Examples include ,income and malnutrition ,income and infant mortality etc.
Coefficient of Correlation
Coefficient of Correlation is a quantitative measure of the direction and strength of linear relationship between two
numerically measured variables. It is denoted by r.
If for two variables X and Y ,SS(X) and SS(Y) stand for their sum of squares respectively and SP(X,Y) for their sum of
product ,then r is defined as
r=
Properties of r :
Scatter diagram
A scatter diagram is a tool for analyzing relationships between two variables , i.e., how one variable changes with the
other variable. This diagram simply plots pairs of corresponding data from two variables, which are usually two
variables in a process being studied. The scatter diagram does not determine the exact relationship between the two
variables, but it does indicate whether they are correlated or not.
The scatter diagram is used to 1) quickly confirm a hypothesis that two variables are correlated 2) provide a graphical
representation of the strength of the relationship between two variables
Example :
H0 : r = 0
Calculation :
Calculate , , , ,
r=
= 0.90
Interpretation :
The calculated value 0.90 exceeds the tabulated value 0.878 at α 5% level of significance with degrees of freedom 7-2 =
5 .So p < 0.05 and we reject H0 and conclude that significant relationship exists between weight and length of mice.
Regression:
The word regression was used by Frances Galton in 1985. It is defined as “The dependence of one variable upon other
variable”. For example, a weight depends upon the heights.
In regression we can estimate the unknown values of one (dependent) variable from known values of the other
(independent) variable.
Regression procedures are very widely used in research involved in the social and natural sciences ,especially in the
medical and health sciences. For example ,for the assessment of nutritional status the concept of regression is applied
to develop standard chart of height and weight for normal healthy population. From this chart we can find standard
weight of an individual if we know his/her height.
Often in biochemical tests ,we apply regression to find concentration of blood glucose ,cholesterol ,TG ,insulin
,creatinine etc from absorbance (optical density ,OD).We first develop a standard regression curve from data on
concentration of these parameters and their OD. Then we find unknown concentrations from the fitted regression line.
Regression that involves only two variables ,one of which is dependent variable and the other is independent variable
is referred to as Simple regression.
The model associated with simple regression is called simple regression model which is given by
Y = α + βX + ε
Where
α is the intercept
=y- x
Example :
Concentration (mmol/l) and corresponding ultraviolet absorption of blood glucose concentration are given in the
following table :
Concentration 1 2 3 4 5
Absorbance 0.1 0.36 0.57 1.09 2.05
a. Calculate the slope and intercept
c. An unknown blood sample has an absorbance of 1.65 .What is the concentration of glucose in the
sample.
X Y X2 Y2 XY
0.1 1 0.01 1 0.1
0.36 2 0.13 4 0.72
0.57 3 0.32 9 1.71
1.09 4 1.19 16 4.36
2.05 5 4.20 25 10.25
= 4.17 =15 = 5.85 = 55 = 17.14
and
= 4.63/2.37
= 1.95
=y- x
= 3- 1.95 0.83
= 1.38
y = + x
= 1.38 + 1.95x
Now an estimated concentration ,for example ,for the absorbance with 1.65
Experiment
An experiment is any process or study which results in the collection of data, the outcome of which is unknown.
For example ,before introducing a new drug treatment to reduce high blood pressure, the researcher carries out an
experiment to compare the effectiveness of the new drug with that of one currently prescribed.
Experimental Unit
A unit is a person, animal or thing which is actually studied by a researcher; the basic objects upon which the study or
experiment is carried out. For example, a person , a rat etc
Treatment
Treatment is something that researchers administer to experimantal units. For example, a doctor treats a patient with
a skin condition with different creams to see which is most effective.
Replication
Replication is the repetition of an experimental condition so that the variability associated with the phenomenon can
be estimated.
ANOVA :
ANOVA is a general method of analyzing data from designed experiments ,whose objective is to compare three or
more than three groups.It replaces the multiple t test with a single F test.
The analysis of variance is also referred to as the F test developed by RA Fisher ,the British Statistician.
One-way ANOVA is used to test for differences among two or more independent groups. Typically, however, the one-
way ANOVA is used to test for differences among at least three groups, since the two-group case can be covered by a t-
test .When there are only two means to compare, the t-test and the F-test are equivalent; the relation between ANOVA
and t is given by F = t2.
Suppose we are exploring the relationship between training hours per week (the dependent variable) and sport (the
independent variable).Suppose sport has three levels : runners ,cyclists and swimmers. We can ask question – are there
differences overall between the sports ? The answer is given by the p value for sport in analysis of variance (ANOVA)
Criteria :
MS BG
SS BG
F=
SS T
SS MS WG
WG
I. Calculate total sum of squares (SST) and between sum of squares (SSBG)
II. Calculate within sum of squares (SSWG) as the difference between total SS and between SS given as SSWG =
SST - SSBG
III. Calculate degrees of freedom
IV. Calculate mean sum of squares ,ie ,variance of between group (MS BG) and within group (MSWG)
5. Interpretation
Find p value using table. Compare this p value to the level of significance α ,say α = 0.05 .if p 0.05 ,reject null
Example :
Three different treatments are given to 3 groups of patients with anemia .Increase in Hb % level was
noted after one month and is given below. Find whether the difference in improvement in 3 groups
is significant or not
Group A 3 1 2 0 1 2 2
Group B 3 2 2 3 1 3 2
Group C 3 4 5 4 2 2 4
Solution :
N = 7 + 7 + 7 = 21 = + + = 23 + 40 + 90 = 153
= + + = 11 + 16 +24 = 51
Null hypothesis :
Calculation :
SST = –
= 153 - = 29.14
2. Calculation of between sum of squares (SSBG) (ie, between group)
SSBG = -
= - = 12. 28
Interpretation :
Calculated value(6.53) is greater than the tabulated value (3.55) at 2 and 18 df at 5% level of significance level. So , p <
0.05 Hence ,we reject null hypothesis and conclude that there is significant differences in increase in Hb% between
three groups
Definition of some basic terms in epidemiology
Ratio :
A measure of comparing two different values obtained by dividing one quantity by another. Suppose a and b are
two different quantities. Then ,a/b or a : b is called the ratio of a to b .
Proportion :
A measure of comparing two values in which the numerator is included in the denominator. If a and b are of two
different quantities then a/a+b is the proportion of a to a+b.
N : B : Numerator ,the upper part of fraction and denominator ,the lower part of the fraction
Percentage :
Percentage is proportion expressed per 100 .Percentage of female in a class would be : 100
Rate :
A rate is rather like a proportion in which the numerator is related with the denominator and the value of
denominator is specific to time and usually expressed per 10000.
Risk :
Risk is defined as the probability that an event will occur, for example, that an individual will become ill or die,
within a period of time.
If a population has N people and A people out of the N develop the disease during a period of time ,the
proportion A/N represents the risk of disease in the population that period.
Risk =
Incidence rate :
Incidence rate is defined as the ratio of number of subjects developing disease and total time experienced for
the subjects follow.
Incidence rate = ,where A represents number of subjects developing disease.
Suppose that we measure an incidence rate in a population as 47 cases occurring in 158 months .
Incidence rate =
Because the interpretation of risk is so much more straight forward than that of incidence rate ,it is often
convenient to convert incidence rate measures into risk measures.The simplest formula to convert an incidence
rate to a risk is as follows
Let’s see how this equation works .Suppose that we have a population of 10,000 people who experience an
incidence rate of lung cancer of 8 cases per 10,000 person-years. If we followed the population for 1 year
,equation tells us that the risk of lung cancer would be 8 in 10,000 for the 1-year period or 0.0008.
Risk = = 0.0008
If the same rate were experienced for only half a year ,then the risk would be half of 0.0008 or 0.0004.Equation
calculates risk as directly proportional to both the incidence rate and the time period ,so as the time period is
extended ,the risk becomes proportionally greater.
Incidence
Incidence is a measure of disease that allows us to determine a person's probability of being diagnosed
with a disease during a given period of time. Therefore, incidence is the number of newly diagnosed cases
of a disease.
Incidence rate
An incidence rate is the number of new cases of a disease divided by the number of persons at risk for the
disease. If, over the course of one year, five women are diagnosed with breast cancer, out of a total female
study population of 200 (who do not have breast cancer at the beginning of the study period), the
incidence of breast cancer in this population was 0.025.
Prevalence
Prevalence is a measure of disease that allows us to determine a person's likelihood of having a disease.
Therefore, the number of prevalent cases is the total number of cases of disease existing in a population.
Prevalence rate
A prevalence rate is the total number of cases of a disease existing in a population divided by the total
population. So, if a measurement of cancer is taken in a population of 40,000 people and 1,200 were
recently diagnosed with cancer and 3,500 are living with cancer, then the prevalence of cancer is 0.118.
Incidence proportion
Incidence proportion is the number of new cases within a specified time period divided by the size of the
population initially at risk. For example, if a population initially contains 1,000 non-diseased persons and 28
develop a condition over two years of observation, the incidence proportion is 28 cases per 1,000 persons,
i.e. 2.8%.
Incidence Prevalence
Incidence is a measurement of the number of new Prevalence is a measurement of all individuals
individuals who contract a disease during a particular affected by the disease within a particular period of
period of time. time
Incidence conveys information about the risk of Prevalence indicates how widespread the disease is.
contracting the disease
Incidence is more useful when talking about diseases Prevalence is a useful parameter when talking about
of short duration, such as chickenpox. long lasting diseases, such as HIV
If, over the course of one year, five women are If a measurement of cancer is taken in a population
diagnosed with breast cancer, out of a total female of 40,000 people and 1,200 were recently diagnosed
study population of 200 (who do not have breast with cancer and 3,500 are living with cancer, then
cancer at the beginning of the study period), the the prevalence of cancer is 0.118.
incidence of breast cancer in this population was
0.025.
Relative risk
Relative risk (RR) is the risk of an event (or of developing a disease) relative to exposure. Relative risk is a ratio of
the probability of the event occurring in the exposed group versus a non-exposed group.
RR =
Example-1
In a study ,the probability of developing lung cancer among smokers was 20% and among non-smokers 1%..This
situation is expressed in the 2 × 2 table. Calculate relative risk
Here, a = 20 (%), b = 80, c = 1, and d = 99. Then the relative risk of cancer associated with smoking would be
RR =
Example-2
Consider a study that examines the risk factors for breast cancer among women participating in the Survey. In a
sample of 4540 women who gave birth to their first child before the age 25 ,65 developed breast cancer. Of the
1628 women who first gave at age 25 or older ,31 were diagnosed with breast cancer. If we consider exposure to
be the condition of having first given birth at age 25 or older ,then calculate relative risk .
RR = = 31/1628/65/4540 = 1.33
Women who first gave birth at 25 years of age or older are more likely to develop breast cancer than women
who gave birth at young age.
Relative risk can be called risk ratio because it is the ratio of the risk in the exposed divided by the risk in the
unexposed.
It is suited to clinical trial data, where it is used to compare the risk of developing a disease, in people not
receiving the new medical treatment (or receiving a placebo) versus people who are receiving an established
(standard of care) treatment.
RR > 1 indicates occurrence of disease is higher in the exposed group than in unexposed group
RR < 1 indicates occurrence of disease is lower in the exposed group than in unexposed group
If RR = 2 indicates occurrence of disease is two times in the exposed group than in unexposed group
Odds ratio
The odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another
group. These groups might be men and women or an experimental group and a control group
If the probabilities of the event in each of the groups are p1 (first group) and p2 (second group), then the odds
ratio is:
where qx = 1 − px.
OR > 1 indicates occurrence of disease is higher in the exposed group than in unexposed group
OR < 1 indicates occurrence of disease is lower in the exposed group than in unexposed group
OR = 2 indicates occurrence of disease is two times in the exposed group than in unexposed group
Example-1
Suppose that in a sample of 100 men, 90 have drunk wine in the previous week, while in a sample of 100 women
only 20 have drunk wine in the same period. The odds of a man drinking wine are 90 to 10, or 9:1, while the odds
of a woman drinking wine are only 20 to 80, or 1:4 = 0.25:1. The odds ratio is thus 9/0.25, or 36, showing that men
are much more likely to drink wine than women.
Using the above formula for the calculation yields the same result :
Example – 2
Among the 2914 women ,who had previously used oral contraceptives ,273 developed breast cancer and 2641 did
not.Of the 7976 women who had never used oral contraceptives ,716 developed breast cancer and 7260 did
not.Calculate odds ratio.
OR = 273/2914/(1-273/2914)/(716/7976)/(1-716/7976)
= 273/2641/716/7260
= 1.05
Attributable risk
Attributable risk is the portion of the incidence of a disease in the exposed that is due to the exposure. It is the
incidence of a disease in the exposed that would be eliminated if exposure were eliminated.
Attributable risk implies that not all of disease incidence is due the exposure since even some non-smoked
individuals develop disease.
Thus incidence in exposed group = incidence not due to the exposure + incidence due to the exposure
Therefore ,the incidence in the exposed group ,which attributable to the exposure can be calculated by
substracting :
Example :
Table : Hypothetical data giving 1-year disease risks for exposed and unexposed people. Calculate attributable risk.
Mathmatics :
Example-1
Table : Breast cancer cases and person-years of observation for women with tuberculosis repeatedly exposed to
multiple x-ray fluorocopies and unexposed women with tuberculosis.
Radiation exposure
Yes No Total
Breast cancer cases 41 15 56
Person-years 28,010 19,017 47,027
Calculate rates (cases /10,000 person-yr)
Solution :
Example -2
Table : Hypothetical data giving 1-year disease risks for people at three levels of exposure
Exposure
None Low High Total
Disease 100 1200 1200 2500
No disease 9900 58,800 28,800 97,500
Total 10,000 60,000 30,000 100,000
From the above table ,calculate risk ,risk ratio and proportion of all cases.
Solution :