Styudy Notes 2015 2

Introduction
Definition
The word ‘statistics’ has originated from the latin word ‘status’, which means ‘state’
or ‘position’. It is called so because the first statistics was the data collected on the
total people residing in Egypt for estimating the number of persons capable of
payment of tax to the king.
In olden days ‘statistics was mainly used for various administrative requirements. For
example : estimating budget requirement, amount of tax to be collected, labour
military force required, amount of clothing, food, school, hospital required etc.
The word statistics is used in two different ways – one as a plural sense and other as
a singular sense.
In plural sense it is defined as ‘numerically stated facts’ or ‘facts expressed in

figures’. Example : population of India according to 2011 census, birth of babies in a
hospital, percentage of people who are literate etc.
In the singular sense, it is the ‘science’ dealing with the method of data collection,
compilation of data, tabulation and analysis to provide meaningful and valid
interpretations.
So statistics can be defined as a scientific discipline concerned with design of

research, collection and organization of data, summarization and presentation of
results and drawing inferences etc
Bio statistics
Statistical methods applied in the fields of medicine, biology and public health are
termed as biostatistics. It may be stated that the application of statistical methods to
the solution of biological problems. Biostatistics is known by many names such as
medical statistic, health statistics and vital statistics
Medical statistics : Statistics related to clinical and laboratory parameters, their

relationship, efficacy of drug, diagnostic analysis etc.
Health statistics : Statistics related to health of people in a community, epidemiology

of diseases, association of occurrence of various diseases with socioeconomic and
demographic variables, control and prevention of diseases etc.
Vital statistics : Statistics related to vital events in life such as of birth, death,
marriages, morbidity etc. These terms are overlapping and not exclusive of each
other.
Uses of Biostatistics
Statistical methods are widely used in almost all fields. Most of the basic as well as
advanced statistical methods are applied in fields such as medicine, biology, public
health etc.
Uses of statistical methods in General :

Statistical methods are useful in planning and conducting meaningful and valid
research studies on medical, health and biological problems in the population for the
prevention of diseases, for finding effective appropriate treatment modalities etc.
Statistical methods needed in general are

 Collection of medical and health data scientifically
 Summarizing the collected data to make it comprehensible
 Generalizing the result from the sample to the entire population with scientific
validity
 Drawing conclusions from the summarized data and generalized results
Specific use of statistical methods
 To determine the normal limits of various laboratory and clinical parameters

such as BP, pulse rate, Cholesterol level, Blood sugar level etc.
 To determine whether various clinical and laboratory parameters correlated, if
so, the degree of correlation and statistical significance. Eg. Correlation of BP
with Cholesterol, correlation of blood sugar level with weight etc.
 To determine association of various disease and health problems with
possible causative factors such as socio demographic status, life style,
behavioral factors, environmental factors etc. Example : to assess chance of
developing lung cancer in heavy smokers than in non-smokers, to assess the
chance of develop diabetes among sedentary workers as compared to
workers engaged in labour etc.
 To find the effectiveness of a new drug, compare the action of two different
drugs or two successive dosage of the same drug etc. This branch of study is
called ‘clinical trials’
 To test the efficacy of new vaccine for the prevention of disease. This branch
of study is called ‘Prophylactics’.
 To collect data scientifically on vital events of life such as birth, death,
migration, morbidity etc. to estimate vital statistics rate and to evaluate life
expectancy at birth by constructing Life Table.
 To estimate the probability of survival after treatment for a specific period in
chronic diseases (Cancer, Heart diseases, organ transplantation etc). This
branch of study is called ‘Survival analysis’.
 To maintain the quality of drugs, laboratory equipments, surgical and medical
instruments. This branch of study is called ‘Quality control analysis’.
 To find the validation of new screening and diagnostic tests in comparison to
those already existing.
 To study the genetic composition of a population and changes in the
composition with respect other factors (migration etc). This branch of study is
called ‘Statistical genetics’.
Some basic statistical concepts
Constant: A constant is a value that does not change with any situation. Example :
The value of p is 3.141, which does not change with time, place or person. The value
of ‘e’ is 2.718.
Variable : A variable is a characteristic that that can take on different values with
respect to a person, time or place, or any other factor. Example : Blood pressure,
height, weight, age, blood group, etc.
Observation and Data : Each value of a variable recorded is called an observation.

Eg. Height of an individual recorded as 158 cm. A set of observations of one or more
variables is called data.
Population : Population is defined as the totality of individuals or measures of

interest. The basic aim of any study is to generally know about the population.
Hence, the meaning of population may vary from study to study.
Example : 1. If the study is to find the prevalence of hypertension in Kerala :

population is aggregate of each and every individuals of Kerala.
2. If the study is to find the prevalence of hypertension among adolescents in
Kerala : population is aggregate of each and every adolescents of Kerala.
3. If the study is to find the prevalence of hypertension among patients coming to
Medical College, Trivandrum: population is aggregate of each and every patients
coming to Medical College, Trivandrum.
4. If the study is to find the effect of a particular drug on cholesterol in middle aged
adults having high blood cholesterol level: population is change in cholesterol
(before-after) in all middle aged adults having high blood cholesterol level.
Populations under study can be finite or infinite in numbers. A population may refer
to variables of a concrete collection of objects, birth weight of infants, anthropometric
measurements of adults, nutrient contents of varieties of foods etc.
,
Sample : A small portion of the population which truly represents the population with
respect to the study characteristics of the population is known as sample. A
representative sample represents the population with respect to the characteristics
under study. Sample can be representative, if it is drawn carefully with an
appropriate size and good procedure.
Example : 100 person, selected randomly from a study population of 1000 person for
estimating the mean HB.
Sampling unit : An element or a group of elements of the population used for

drawing a sample is called a sampling unit.
Example : To select persons, individual person will be sampling unit
To select families, an family individual will be sampling unit
To select villages, a village will be sampling unit
Parameter and Statistic : ‘Parameter’ is a statistical characteristics such as mean,
variance, percentage, correlation coefficient related to the population. ‘Statistic’ is a
statistical characteristics related to the sample.
Example : If the percentage of population affected by diabetes is 10%, it is a

parametric value. If the corresponding value in the sample selected randomly from
the population is 12 %, it is statistic value. Mean SBP of male population is 120
mmHg, it is parametric value (µ) and if the mean SBP of males randomly selected
from the population is 123 mmHg, it is the statistic value.
When the sample size increases, the statistic values estimated will be close as
possible to the population parameter values.
Type and levels of data and their measurement
Levels / scales of measurement
Variables are classified under four levels of measurement such as nominal, ordinal,
interval and ratio.
1. Nominal : Nominal variables are those variables measured with attributes that
are exhaustive and mutually exclusive.
Exhaustive means every participant in the study should be classified to any one of
the attribute. Example: The attributes of the variable religion be Hindu, Muslim,
Christian, Others. Then every participant of the study can be included any one of the
category. If the category others is not included as an attribute, a participant who is a
Buddhist is not able to mark any attribute.
mutually exclusive means every participant in the study should be in one and only
one attribute. Example : Gender (attributes : Male, Female), place of residence
(attribute : Rural, Urban), socio economic status (Attributes : Low, Average, High)
etc.
2. Ordinal : Ordinal variables are those variables measured with attributes with
the characteristics of nominal variable(exhaustive and mutually exclusive) and
can be rank-ordered. Rank-ordered means that one attribute is grater than or
less than another attribute, but not precisely state how much grater or how
much less. In this scale the relative order is meaningful, but the difference
between categories are not known.
Example : Pain (Attributes : No pain, Mild pain, Moderate pain, Severe pain),
Satisfaction – 5 point Likert scale ( Attributes : Very dissatisfied, dissatisfied,
neither dissatisfied nor satisfied, satisfied, very satisfied) etc.
3. Interval : If the characteristic is measurable in numerical value, rank ordered

and have an equal distance is called interval scale. In this scale the ‘zero’
point and unit of measurement are arbitrary.
Example : Temperature is an interval type measurement. It is measured in two
types of scale – Fahrenheit (F) and centigrade (C ). Though these are two
different type of measurement, they contain the same information. The difference
between a temperature of 100 degrees and 90 degrees is the same difference as
between 90 degrees and 80 degrees. The freezing point occurs at 0 0 on C scale,
on the F scale the corresponding value is 32 0, hence in temperature the ‘zero’
and unit of measurement are arbitrary.
Another example time measures in Indian standard time (IST) and Greenwich
mean time (GMT).
4. Ratio : If the characteristic is measurable in numerical value having a ‘true’

zero point independent of the unit of measurement is called ratio scale.
Ratio scale incorporates all the characteristics of nominal, ordinal, interval

scales. As a result a large number of descriptive calculations are applicable
and hence ratio scale is the most sophisticated scale
Example : Weight is a ratio type measurement. It can be measure in Kilogram

and pounds. The starting value is always ‘zero’, irrespective of scale. Age,
height, BP, FBS, HB etc are ratio scale measurements.
Mathematical and logical operations possible with different scale of measurements

Logical/ Nominal Ordinal Interval Ratio
math scale scale scale scale
operations
= Yes Yes Yes Yes
≠
< No Yes Yes Yes
>
+ No No Yes Yes
-
× No No No Yes
÷
Importance of determine the level of measurement of a variable

Level of measurement play a fundamental role in choosing appropriate
statistical test. One of the assumptions of parametric tests (Pearson
correlation, independent t test, ANOVA test etc.) is that the dependent
variable is measured at interval or ratio level.
If feasible, collect data using interval or ratio level measurement, because one
can recode interval or ratio level variable into nominal or ordinal variable. So it
is possible to convert high level measurement into lower level, but not the
opposite.
Example : Age of the participant of a study can record actual age based on
last birth day. In this case age is defined as ratio level measurement, which is
appropriate for parametric tests. Later the age can be recorded into ordinal
level (0-10, 10-20, 20-30 etc.) of measurement. If the age collected in
nominal or ordinal level, it is not possible to convert into ratio level.
Data, type of data

Each value of a variable recorded is an ‘observation’ and the set of
observation of one or more variables is called data.
In the view of statistical analysis and according to measurability characteristic,
data can be divided into ‘Qualitative data’ and “Quantitative data.
Qualitative data: Qualitative data is defined as the data that is not precisely
measurable.
Example : Sex, employment, religion, nationality, colour, opinion, presence or

absence of disease etc. Since the qualitative data is attributable to some quality,
it is also known as ‘Attribute’.
Nominal Data : If the qualitative data is measure, through naming the

characteristic, it is called nominal data (Example: Gender – male/ female;
language – Malayalam/English, Hindi, place of residence – Urban/Rural, etc…).
Ordinal data : If the qualitative data is expressed in some intrinsic order, such
data is called ordinal data.
(Example: pain – nil/mild/moderate/severe; income – poor / middle / high, Attitude
– strongly agree, agree, disagree, strongly disagree, etc. )..
 Quantitative data : Data that is accurately measurable by some instrument,

device or some technique is called quantitative data. In this case, there may
not be any difference in the value between observers. As it is expressed in
numbers it is also known as numerical data (Example: weight, height,
haemoglobin, pulse rate, B.P. etc.)
Quantitative data are divided into two types namely discrete and continuous
Discrete data: The quantitative data that can be expressed only in whole
numbers is called discrete data. (Example: Pulse rate, Number of RBC, number
of injections given to patients etc.).
Continuous Data: The Quantitative data that can capable of assuming any
fractional value within a range of numbers is called continuous data. (Example:
weight, height, temperature).
.
DATA
Qualitative Quantitative
Nominal Ordinal Discrete Continuous

Tabulation of Data
The form in which the data were originally collected is called raw data, which look
like a mass of numbers. Useful information is usually not immediately evident from
raw data. So rearrangement of data is essential to get information from the data.
The process of arranging data in groups or classes according to resemblance and

similarities is called tabulation or classification. In classification of data units having a
common characteristic are placed in one class / group and in this fashion the whole
data are divided into a number of classes / groups. This is the first step in the
description and analysis of any data.
Frequency distribution : Display summary of the number of observations which

take on specific values or range of values is called frequency distribution. The count
of the number of observations in a category is called frequency
Tabular presentation of Qualitative data
The tabulation of qualitative data is carried out by counting the occurrence of each of
the attributes of the of the variable and present in frequency distribution.
Example : The data set of smoking status of 20 lung cancer patients collected as
yes (y) and No (n). The raw data appears y,n,y,y,y,y,n,y,n,y,n,y,n,y,n,y,n,y,y,n.
Simply count the occurrence of ‘y’ and ‘n’ of the raw data, which are 12 and 8
respectively. The resulting tabulation of the data is presented in the following Table
Smoking status of Lung Cancer patients

Smoking Status Number of patients
Yes 12
No 8
Total 20
Tabular presentation of Quantitative data
A common device for organizing and representing fairly large collection of

quantitative data is the formation of frequency distribution with appropriate choice of
class intervals.
Some of the terms used in frequency distribution.

(a) Frequency: Repetition of observation is called frequency. How many times a
particular observation (figure) is repeated is frequency for that value or group
or class.
(b) Class Limits: Consider a class 20 – 30. The minimum value which can be
included in the class is lower class limit. The maximum value that can be
included in the class is upper class limit. So when the classes are 10 – 20
then 10 is lower class limit and 20 is the upper class limit. Now consider the
classes as 0 - 9, 10 – 19, 20 – 29, 30 – 39, 40 – 49, 50 – 59 etc. Then for the
class 20 – 29; 20 is lower limit and 29 is upper limit.
(c) Class Boundaries: Consider the classes 10 – 20, 20 – 30, 30 – 40,

…………., etc. upper limit of previous class is equal to lower limit of next
class. 20 is upper limit for 10 – 20 and lower limit of 20 – 30. In such cases we
say class limits are continuous and class limits are called class boundaries.
In this type of classes, values ranging from 10 to 19.99 but less than 20 will be
placed in the class 10-20 and 20 will be placed in the class 20-30
Now consider the classes as 10 – 19, 20 – 29, 30 – 39, 40 – 49, ………… In

this case 19 is upper limit for 10 – 19, 20 is lower limit for 20 – 29 and they are
not same. Or class limits are not continuous. In such cases we subtract 0.5
from lower limit and ass 0.5 in upper limit. So that they will be continuous and
then these are called class boundaries. e.g.
Class Limits Class Boundaries

10 – 19 9.5 – 19.5
20 – 29 19.5 – 29.5
30 - 39 29.5 – 39.5
(d) Class Interval: The difference between class boundaries is called class
interval or class width or span of the class. e.g. for classes 10 – 20, 20 – 30,
30 – 40, ………… class interval is difference between 20 and 10 or 30 and 20
etc. When classes are 10 – 19, 20 – 29, 30 – 39, etc. The class interval is
difference between 19.5 and 9.5, 29.5 and 19.5 etc.
(e) Class Frequency: Number of observations lying in a particular class is class

frequency.
(f) Cumulative Frequency: Cumulative total of frequencies is cumulative

frequency.
The steps for the formation of frequency distribution for quantitative data are
 Find out the minimum and maximum values of the given data and divide the
total range of observation into some arbitrary intervals called class interval
 Draw a table with the first column indicating the class interval. This column
should have an appropriate label along with unit of measurement.
 Label ‘tally marks’ for the second column. Looking at the data, cross (‘/’) each
observation and put a tally mark ‘l’ against the interval in which that
observation falls.
 Continue with the other observations, indicating every fifth tally mark in an
interval by crossing the previous four tally marks as shown IIII , so that it will
be easy to count multiples of five.
 After placing the tally marks for all the observations in the appropriate groups,
count the tally marks and indicate the number as the frequency of that class in
the next column
 The total of frequencies of all the class intervals will add up to the total
number of observations in the data set.
 Finally give a suitable heading or titles to the table.
The following points may be kept in mind for classification:
(i) The classes should be clearly defined and should not lead to any ambiguity.
(ii) The classed should be exhaustive, i.e., each of the given values should be
included in one of the classes.
(iii) The classes should be mutually exclusive and non-overlapping.
(iv) The classes should be of equal width. The principle, however, cannot be
rigidly followed. Eg. : The children below 15 years can be classified with
unequal class intervals such as <7 days, 7 – 28 days, 28 days – 1 year, 1 – 4
years, 5 – 9 years etc. to see the mortality pattern.
(v) Indeterminate classes, e.g., the open-end classes, less then ‘a’ or greater
than ‘b’, should be avoided as far as possible since they create difficulty in
analysis and interpretation.
(vi) The number of classes should neither be too large nor too small. It should
preferably lie between 5 and 15.
Example 1: Following data gives number of children in some families prepare an

ungrouped frequency distribution. Also find cumulative frequency.
Number of children:
0,1,2,1,8,8,8,2,8,3,4,5,2,0,0,1,5,4, 1,3,5,3,6,6,5,2,1,3,7,1,3,2,1,0,5,6
Number of children is the variable. We prepare following table.
Preliminary Table for the construction of a frequency table

x f
Tally marks
Number of children frequency
0 llll 4
1 llll ll 7
2 llll 5
3 llll 5
4 II 2
5 llll 5
6 III 3
7 I 1
8 IIII 4
Total 36
Explanation: Consider first observation ‘0’ put a tally mark against ‘0’. Now
consider second observation ‘1’ put a tally mark against ‘1’. Proceed in same way.
Every fifth tally mark is to be crossed as shown in the table. Then count the number
of tally marks which is frequency.
Now frequency distribution is:
Distribution of number of children

Number of
Frequency c.f.
children
0 4 4
1 7 11
2 5 16
3 5 21
4 2 23
5 5 28
6 3 31
7 1 32
8 4 36
c.f. is cumulative frequency means cumulative total of frequency column.
Example 2: The following data gives ages of patients in years. Prepare a

grouped frequency distribution.
Age in years: 20,63,11,65,89,67,89,11,54,57,21,86,25,67,35,59,55,25,10,19,

32,28,39,67,22,17,67,12,43,28,25,40,67,16,32,65,23,47,38
Sol: We require to form classes of ages. Minimum age is 10 and maximum age is
89. Let us take the classes as 10 – 20, 20 – 30, ………..etc. and prepare the
following table.
Preliminary Table for the construction of a frequency table

f
Age in years Tally marks
frequency
10 – 20 llll II 7
20 – 30 llll llII 9
30 – 40 llll 5
40 – 50 lll 3
50 – 60 IIII 4
60 – 70 llll III 8
70 -80 - 0
80 - 90 III 3
Total 39
From the preliminary table, a final table is constructed. In the final table the
‘tally mark column’ will be ignored. The column heading ‘frequency’ will be changed
into ‘the number of the specific items analyzed’. An ideal frequency table can have 5
to 20 classes, so that it can be accommodated in one page. The tables should be
serially numbered and there must be a proper heading, which is self explanatory. A
vertical arrangement is preferred to horizontal arrangement. Often, a percentage
column will also be appended to give a comparative picture. Thus the final table for
age of patients can be presented as follows.
Distribution of patients according to age

Age in years Number of patients Percent
10 – 20 7 17.9
20 – 30 9 23.1
30 – 40 5 12.8
40 – 50 3 7.7
50 – 60 4 10.3
60 – 70 8 20.5
70 -80 0 0.0
80 - 90 3 7.7
Total 39 100.0
Simple and many fold Classification

If the data are classified on the basis of one variable only, the process is termed as
one-way table or simple classification.
The tabular form arrangement by counting the different combination that can be
formed with the two variables is called two-way table or two way classification. When
three variable consider simultaneously, the tabulation is called three-way table.
In general, more than one variable is studied, resulting in a subdivision of classes,
the classification is known as multi-way table or manifold classification.
Consider the data regarding the variable blood group of nursing students. The
simple classification will be
Distribution of blood group of nursing students

Blood groups No. of nursing students
A 12
B 3
AB 7
O 18
Total 40
When consider two variables - blood group and Rh factor (+ve and –ve). The
information can be arranged in a tabular form, counting the different combination that
can be formed with the two variables, namely A+ve, A-ve, B+ve, B-ve etc. Such a
presentation is called two-way table. The following table shows the two-way and
three-way tables.
Distribution of nursing students according to

Blood Group and Rh factor
Blood Rh Factor
Total
Group Positive Negative
A 10 2 12
B 2 1 3
AB 6 1 7
O 16 2 18
Total 34 6 40
Distribution of nursing students according to Blood Group, Rh factor

and gender
Gender Blood Rh Factor Total
Group Positive Negative
A 5 1 6
B 1 0 1
Male AB 2 0 2
O 8 2 10
Total 16 3 19
A 5 1 6
B 2 0 2
Female AB 5 0 5
O 6 2 8
Total 18 3 21
A 10 2 12
B 2 1 3
Total AB 6 1 7
O 16 2 18
Total 34 6 40
Presentation of Data by Graphs and Diagrams
Frequency tables provide a view of the data and its principal characteristics, where
as diagrams provide visual method of examining data. Diagrams help to get a real
grasp of the overall picture rather than specific details. Many types of diagrams and
graphs are used to represent different type of data.
A. Diagrams for Qualitative Data
The Qualitative data can be tabulated and presented in different ways.

Similarly it can be supplemented with diagrams and charts in various forms,
depending upon the need and significance. For all scientific presentations ‘Bar
Diagrams’ are preferred. However the ‘Pie Diagram’ may be a better choice for
comparing the components parts. Also, in case the data spread over a long
period and the trend of the events are to be observed, the Line Diagram’ may be
a choice of preference.
1. Bar Diagrams
One of the commonly seen diagrams is Bar Diagram. The different types of
commonly used bar diagram are simple, component and multiple or percentage.
(i) Simple Bar Diagram
In bar diagram, mark the category of the variable on X axis and frequency on
Y axis. A rectangular bar of each category of the variable is erected and the
height of the bar is proportional to frequency of that category. Each bar should
have an equal width and equal width between two successive bars.
The bar diagram representing information shown in the table regarding
distribution of blood group is presented below.
Distribution of nursing students according to blood group
Blood groups No. of nursing students
A 12
B 3
AB 7
O 18
Simple Bar Diagram show ing the Blood Group of Students
20
18
18
16
No. of nursing students
14
12
12
10
8 7
4 3
0
A B AB O
BLOOD GROUP
The bar diagram made with the percentage value for the above table is also
made.
No. of
Blood groups nursing Percent
students
A 12 30.0
B 3 7.5
AB 7 17.5
O 18 45.0
Total 40 100.0
Simple Bar Diagram show ing the Blood Group of Students
50.0
45.0
45.0
40.0
Percentage of nursing students
35.0
30.0
30.0
25.0
20.0 17.5
15.0
10.0 7.5
5.0
0.0
A B AB O
BLOOD GROUP
(ii) Component Bar diagram
Sometimes, the different categories of data may be further subdivided into

various components. For example the data on blood groups of nursing students
can also be looked into the Rh types as well. Then it can be presented as in table
below.
Distribution of nursing students according to

ABO Blood Group and Rh type
Blood Rh positive Rh negative

Total
Group No. % No. %
A 10 83.3 2 16.7 12
B 2 66.7 1 33.3 3
AB 6 85.7 1 14.3 7
O 16 88.9 2 11.1 18
In this case, each of the blood groups consists of two components – Rh

negative and Rh positive. If these ‘component parts (number)’ together with ‘total’
in each ‘category’ are to be presented graphically, a ‘Component Bar Diagram’
is preferred.
Component Bar Diagram showing the Blood Group and Rh
Type
20
18
2
No. of nursing Students

16
14
12
2
10
8 16
1
6 10
4
1 6
2
2
0
A B AB O
BLOOD GROUP
Rh positive Rh negative
A ‘Component Bar Diagram’ is drawn similar to a simple bar diagram

wherein the total length of the bars will be proportional to the total number of
observations. Then it will be subdivide for indicating various component parts and
differentiated between each other with distinct color or shade.
(iii) Percentage Bar diagram
In a diagrammatic presentation, if due weight is given to percentages of each

component rather than the absolute numbers, it can be presented
diagrammatically in a ‘Percentage Bar diagram’.
Percentage Bar Diagram showing the Blood Group and Rh
Type
120.0
Percentage of nursing Students

100.0
16.7 14.3 11.1
33.3
80.0
60.0
83.3 88.9
40.0
66.7 85.7
20.0
0.0
A B AB O
BLOOD GROUP
In a ‘Percentage Bar Diagram’, the components are marked in

percentages. In this case the total length of all the bars will be the same (100%).
As in the case of components bar diagram, the total length of 100 percent
(instead of the total number) is subdivided proportionately into component parts.
This diagram convey the similarity of relative proportion of each Rh factor in

different blood group being compared.
(iv) Multiple Bar diagram
There may be situations, where the total number as well as the percentage
is not given any importance but the component parts alone are taken into account
for graphical presentation. Then it can be presented as a ‘Multiple Bar
Diagram’.
Multiple Bar Diagram showing the Blood Group and
Rh Type
18
16
16
14
No. of nursing Students
12
10
10
8
6
6
4
2 2 2
2 1 1
0
A B AB O
BLOOD GROUP
In a ‘Multiple Bar diagram’ rectangular bars are drawn side by side, on a

common base line for each set of component part, with equal space between
categories.
2. Pie diagram or Pie chart
In pie diagram, a circle (360 0) is divided into sectors with areas

proportional to the frequencies or percentage of frequencies of the categories of
a variable. For each category find the proportionate degrees of the total 360
degrees. For example, if a category has a frequency of 12 out 40, it should be
allotted to 1080 (12 X 360 / 40). So, a sector with 1080 to represents this category
of the variable. Similarly, allot sectors for all the categories of the variables under
study.
Consider the example of frequencies of different blood groups of nursing
students, which are shown in the table below. For the category of blood group A
allot 1080 (12 X 360 / 40), similarly for group B allot 27 0 and so on. Draw a circle
and beginning from any point make a sector with 108 0 representing blood group
A, another of 270 representing blood group B and so on. This diagram is
generally used to express the comparison between component parts of a variable
under study.
Nursing degree
Calculation of
Blood groups students of
degree of sector
No. Percent sector
A 12 30.0 12 X 360 / 40 108
B 3 7.5 3 X 360 / 40 27
AB 7 17.5 7 X 360 / 40 63
O 18 45.0 18 X 360 / 40 162
Total 40 100.0 40 X 360 / 40 360
B. Diagrams for Quantitative Data
Quantitative data can be tabulated and presented in different ways for

making the graphical presentation. The most commonly used diagrams are
Histogram, Frequency Polygon / Curve, Cumulative Frequency Polygon’ or
‘Ogives’, line diagram, Stem and Leaf plot.
1. Histogram
Histogram is used to depict the frequency distribution of quantitative variable. The

class intervals are marked on X axis and frequency on Y axis. Rectangles on each
class erected with height proportional to their respective frequencies so that the area
of each rectangle is proportional to the frequency of that class interval. Since the
class intervals form a continuum, the bars are kept adjacent to each other. The
diagrams of the continuous rectangles so obtained are called histogram. As the
Histogram signifies the area of the rectangular blocks, it is also known as ‘Block
Diagram’.
Example :
The table below give the distribution patients according to hemoglobin level.
Distribution of cases according to hemoglobin level of patients

Hb gm % No. of patients
6–8 16
8 – 10 30
10 – 12 24
12 - 14 10
Total 80
The histogram representing the data is given below
35
30
25
No. of patients
20
15
10
0
6 8 101 12 14
HB level
How a histogram Differ from a simple bar diagram: A histogram is different from a
bar diagram in three ways (i) In simple bar diagram, all the bars are of equal width
but in a histogram width varies with class interval (ii) in simple bar diagram there is
some gap in between bars but there is no gap between bars in histogram (iii) in
simple bar diagram, the length of the bar is proportional to the frequency of the
category, but in histogram, the area of the bar is proportional to the frequency of the
class.
2. Frequency Polygon/Curve
Frequency polygon is obtained by joining the midpoints of all erected rectangles of a

histogram by means of a line while replacing the rectangles. When the data is very
large with narrow class interval, the frequency polygon will approach a smooth curve.
Then it is called a frequency curve. Frequency polygon is used to compare two or
more frequency distribution of a quantitative variable.
To draw a Frequency Polygon, the mid-values are marked along the X axis on a
suitable scale and the frequencies along the Y axis. The frequencies corresponding
to each mid-point will be shown in the chart area with plotted points.
Example : Consider the incidence of female breast cancer in Mumbai and Delhi as
given in the following Table.
Distribution of female breast cancer in Mumbai and Delhi
No. of Breast cancer

Age group
Mumbai Delhi
20-25 35 48
25-30 146 168
30-35 334 358
35-40 631 635
40-45 844 803
45-50 909 773
50-55 939 845
55-60 772 617
60-65 766 587
65-70 510 302
70-75 392 162
Frequency polygon showing Distribution of female breast cancer in Mumbai and
Delhi
From the frequency polygon, it can be seen that age pattern in the number of breast
cancer cases is almost the same up to the age of 45 years and thereafter more
cases seen in Mumbai as compared to Delhi. These differences are the absolute
differences in number of cases and not reflect the incident rates. In order to
compare, incidence rate has to calculate (number of cases per women at risk at
different Age groups)
The drawing of the Frequency curve is exactly similar to that of drawing a frequency
polygon except that the plotted points are joined by a smooth free hand curve.
3. Cumulative Frequency Polygon or Ogive
If the frequency polygon is drawn for the cumulative frequency rather than absolute
frequencies of class intervals, is called cumulative frequency polygon. If the points of
cumulative frequency polygon are joined with a smooth curve instead of straight line
is called an ogive. With the help of this curve one may find out the number of
observations falling below or above a specific value, This is useful for calculation of
quartiles, deciles, percentiles, median, mode and for comparison between two or
more groups.
The summing up of the frequencies successively is called ‘Less than cumulative

frequency’. In less than cumulative frequency polygon or less than Ogive, the upper
class boundaries are marked on X axis and less than cumulative frequencies on Y
axis.
Example :
Distribution of Haepatitis cases reported during fist

15 days since the outbreak
No. of days since the No. of haepatitis Cases

reporting of primary case diagnosed
1 – 3 days 38
3 – 5 days 31
5 – 7 days 25
7 – 9 days 11
9 – 11 days 3
Total 108
Less than cumulative frequency of Haepatitis cases reported against upper

class boundary
Upper Class Boundary Less than cumulative

(days) frequency (<c.f)
3 38
5 69
7 94
9 105
11 108
Less than Cumulative Frequency Polygon (<Ogive) regarding
the Haepatitis cases reported
120
108
100 105
94
80
No. of patients
69
60
40
38
20
0
1 0 3 5 7 9 11
Days of reporting Haepatitis cases
From the less than ogive figure, for any given value on X axis, it can determine the
frequency of observations that are less than or equal to that value by extending a
line vertically from that given value where the vertical line intersects the ogive curve
Greater than cumulative frequency of Haepatitis cases reported against lower

class boundary
Lower Class Boundary Greater than cumulative

(days) frequency
1 108
3 70
5 39
7 14
9 3
Greater than Cumulative Frequency Polygon (>Ogive)
regarding the Haepatitis cases reported
120
100 108
80
No. of patients
60 70
40
39
20
14
0 3 0
1 3 5 7 9 11
Days of reporting Haepatitis cases
The greater than ogive is helpful in determining the number of observations above a
specific value, within the range of the observed data.
Less than and grreater than cumulative frequency curve of Haepatitis cases
reported
When draw both the ogives on the same graph, they intersect at some point and the
corresponding value of the variable as indicated on the x axis will be median of the
entire data.
4. Line Diagram
A diagrammatic presentation of data indicating the past trend of events

during a long duration of time will be of immense help to see the future trend. It
can be visualized through ‘Line Diagrams’.
Let us consider the number of deaths reported from the Medical College
Health Centre area for the period from 2001 to 2004.
No. of deaths reported from the

Medical College Health Centre area for 2001 – 2004
Year No. of deaths reported

2001 25
2002 30
2003 32
2004 40
In this case, the number of deaths spread over a period of four years. The
objective is to see the mortality trend from a suitable diagram or a chart.
Certainly, a ‘Line diagram’ may be a better choice.
No. of deaths reported from teh Medical College Health

Centre area for 2001-2004
45
40 40
35
30 32
No. of deaths
30
25
25
20
15
10
5
0
2001 2002 2003 2004
Year
In line graphs, time is taken as independent variable (years, months, hours etc)
on the x axis and one or more variables on the Y axis. Corresponding to each of
such point on the X axis, the number of observations of each year is indicated by
dots or plotted points. A line obtained by joining all the plotted points is called
Line graph.
Two or more variables depends on the same time variables presented on the
same graph. This diagram will be very useful to compare the trend of two ore
more variable over a range of period. The following figure will give the
information on the variation in SBP at different time interval after surgery in both
group, a comparison in SBP at different time interval between group is also
possible.
Table Comparison of SBP of patient

under a surgery at different time
interval between two group
SBP
Time
Experimental Control
Baseline 116.8 131.1
1 hour 123.6 135.0
2 hours 117.9 129.8
3 hours 117.6 126.7
6 hours 118.3 129.9
12 hours 118.0 128.1
18 hours 118.5 125.5
24 hours 118.6 123.3
Fig. Comparison of SBP at different time interval based on group

140.0
135.0
130.0
Mean Value
125.0
120.0
115.0
110.0
105.0
Experimental Control
Another advantage of these diagram is to interpolate variable under study for any
given time period within the observed range.
Measures of central tendency

Classification and tabulation of data are helpful in reducing and understanding
the bulk of large mass of data into few meaningful categories. But they do not give a
single value which is typical for the entire data set. Such measures are necessary for
summarizing or for comparison with similar data set. So the need arises, to find a
constant which will be the representative of a group of data. Such constants are (i)
measures of central tendency or averages; (ii) measures of dispersion or variation
and (iii) measures of skewness and kurtosis.
The measurements that are describing the tendency of the data to be in the centre
(middle) are called ‘Measures of Central Tendency’ or ‘Statistical Averages’. The
measures of central tendency make reduction or condensation of a group of
observations into a single number and make possible the comparison of different set
of data.
The important measures of central tendency are – (i) The Arithmetic Mean (ii) The
Median and (iii) The Mode.
The desirable properties of a good Measure of Central Tendency’ are –

(i) It must be easy to calculate and easy to understand
(ii) It must be representative of the whole group (all items must be taken into account
for computation).
(iii) It must be rigidly
defined (have a well defined formula).
(iv) It must be capable
of further mathematical and statistical operations and
(v) It must be less affected by sampling fluctuation.
Important measures of Central Tendency
A. Arithmetic Mean (A.M)
It is defined as the sum of the observations divided by the number of

observations. While considering the desirable properties of a good measure of
central tendency, the mean is having all the criteria and hence it is considered
as the best measure.
Ungrouped data
For ungrouped data the arithmetic mean ( x) can be calculated by the formula
Where  (sigma) denotes the ‘sum’, x individual

x = x
observations and n the number of observation
n
Example 5.1: The gain in weights of 5 albino rats over a period of 5 days are 5, 6 , 4,
8, 7. The arithmetic mean or mean is
x =
5+6+4+8+7 = 30 = 6.0
5 5
Grouped data
In the case of grouped data or classified data the Arithmetic Mean (

) can bexcalculated with the formula :
Arithmetic Mean ( x ) = x1f1 + x2f2 + ….. + xnfn = x.f

f1 + f2 + ….. + fn N
Where x is the value of the individual observations [or the

mid value of the class if the class intervals are given], f is
the corresponding frequency and N is the ‘Total frequency)
Example:
Following data gives age in years in case of child deaths. Find the average age.
Age in years No. of deaths

0 42
1 55
2 32
3 22
4 15
5 6
Sol: This is a discrete data (discrete series). Hence Age in years is called ‘X’ and
No. of deaths is frequency ‘f’.
Prepare the following table:
Age (x) No. of deaths (f) Fxx

0 42 0
1 55 55
2 32 64
3 12 66
4 15 60
5 6 30
Total 172 fx = 275
A.M. =x x.f
= =275 = 1.6 year
N 172
Calculation of arithmetic mean when the data presented in class

intervals.
Marks awarded to each of 100 students in a class
Marks awarded (x) No. of students (f)

5 – 15 20
15 – 25 40
25 – 35 20
35 - 45 20
Total number of students 100
In this case the exact mark (x value of each student) is not known. So
we assume that all the students in ‘5 – 15 class’ might have scored 10 marks
each (the mid value of the class, which is calculated as the lower limit + upper
limit divided by 2), all the students in 15 – 25 class assumed to be scored 20
marks each and so on. The table is to be reconstructed with two more
columns viz. mid-value (x) and x.f. as follows:-
Marks awarded to each of 100 students in a class
Marks awarded Mid – value (x) No. of students (f) x.f.

5 – 15 10 20 200
15 – 25 20 40 800
25 – 35 30 20 600
35 - 45 40 20 800
Total number of
100  x.f = 2400
students
Again, the Arithmetic Mean will be calculated with the same formula:
A.M. =x x.f
= =2400 = 24 marks
N 100
 The advantage of A.M. is that it has got all the desirable properties of a good
measure of central tendency. Computation of AM uses all the observations, it
can be used for further mathematical and statistical operations, easily guess
te behavior of the sample mean over repeated samples.
 Disadvantageous : (i) Sample mean can not be computed for nominal and
non-numerical ordinal data. (ii) Sample mean is sensitive to extreme values in
the data set or it will be influenced considerably by abnormal values. (iii) Mean
can not compute from grouped distribution with open ended class intervals.
B. Median
Median is defined as the middle-most observation, when the observations are

arranged either in ascending or descending order of magnitude.
The median is the middle value, which divides the observed values into two equal
half. Since it is the middle most value by position, median is also called positional
average.
For calculation of the median, the values should be arranged in order of magnitude
and the central value represent the median.
Median for ungrouped observations :
When the total number of observations (n) is odd
Median = The [(n + 1)/2]th term, if the observations are arranged either in
ascending or descending order
When the total number of observations (n) is even
Median = The average of (n/2)th and [(n/2) +1)]th term, if the observations are
arranged either in ascending or descending order
Grouped data
In this case, the data is grouped (classified), Median can be done with the
help of a mathematical formula:
Median (grouped data) = l + [(N/2) – m]c

f
Where l = lower boundary of the median class, N = Total frequency, m =
Cumulative frequency (less than) up to the median class, c = class interval, and
f = frequency in the median class.
In this case, the ‘median-class’ is defined as the class in which N/2 lies.
Example : Calculation of median dosage of drug for the data given in table.
The distribution of dosage of drug administered for 100 patients
Dosage (mg.) No. of patients

200 – 300 10
300 – 400 20
400 – 500 50
500 – 600 12
600 – 700 8
Total 100
In order to do the calculation, table 24 has to be reconstructed by

incorporating two more columns – viz. (i) upper class boundary (the end value of the
class interval) and (ii) the less than cumulative frequency (<c.f
The distribution of dosage of drug administered for 100 patients
Lower class boundary Less than

Dosage No. of
(end value of the Cumulative
(mg.) patients
classes) frequency (<c.f.)
200 – 300 10 200 10
300 – 400 20
300 30 m
median
400 – 50 f
500 Class 400 l N/2=50  80
500 – 600 12 500 92

600 – 700 8 600 100
Total 100 N …… …...
Steps involved: 1. N= 100 (total frequency)


2. Therefore N/2 = 100/2 = 50 Helps to locate Median class.
3. Since N/2 = 50 lies between c.f. 30 & 80, their corresponding
upper boundaries are 400 & 500, which is the median class.
i.e. the median class = 400 – 500.
4. l = 400 (lower boundary of median class)
5. m = 30 (cumulative frequency up to the median class)
6. c = 100 (class interval) i.e. 700 – 600 = 600 – 500 …… =
100 (equal class intervals) and
7. f = 50 (freq. in the median class) i.e. frequency in class
400 – 500
Median (grouped data) =
l + [(N/2) – m]c = 400 + (50 – 30) 100 = 400 + (20 x 100) / 50

f 50
= 400 + (2000 / 50) = 400 + 40 = 440 mg
 Median is not a good measure of central tendency because it has none of the
desirable properties except that it is easy to understand and easy to calculate if
the data is ungrouped and it has got a well defined formula. Still, it is the best
measurement, if there are few abnormal observations or some of the
observations are missed but their positions in the distribution are known.
C. Mode
The mode is that value of the variable which occurs most frequently.
Example 5.3: The gains in weights of a ten rates are 4, 8, 6, 7, 3, 9, 2, 6, 5, 0. Since

6 occurs most frequency, value of mode is 6.
Grouped data
In this case the data is grouped. So Mode can be calculated with the formula.
Mode (grouped data) = Mo = Z = l + c.f2
f1 + f2
Where l = lower boundary of the mode class,
C = Class interval
f1 = frequency in the class just preceding the mode class
f2 = frequency in the class just succeeding the mode class
The mode class is defined as the class in which the maximum frequency
lies.
Example :
Distribution of days taken for the development of a disease
No. of days taken for

No. of patients
development of disease
4–6 2
6–8 3
8 – 10 15
10 - 12 20
12 – 14 10
Total 50
Distribution of days taken for the development of a disease
No. of days taken for

No. of patients
development of disease
4–6 2
6–8 3
8 – 10 Preceding class 15  f1
l  10 – 12 Mode class  Max. frequency 20
12 – 14 Succeeding class 10  f2
Total 50
In the above example:
(i) Maximum frequency = 20  Helps to locate Mode class

i.e. the mode class = 10 – 12 (class of maximum frequency)
(ii) So, l = 10 (lower boundary of mode class)
(iii) f1 = 15 [frequency in the just preceding class (i.e. 8 -10)]
(iv) f2 = 10 [frequency in the just succeeding class (i.e. 12 – 14)] and
(v) c = 2 (class interval)
Mode (grouped data)
= l+ c.f2 =10 + 2 x 10 = 10 + 20 = 10.8 days

15 + 10 25
f1 + f2
The Mode is not a good measure of central tendency, as it has got none of
the desirable properties, except that it is easy to calculate and easy to understand.
There are different formulae to calculate the mode for grouped data but every such
formula gives only an approximation and hence it is not well defined. However, when
the researcher is interested in the most popular value in a given set of data, the
mode is considered as the best measure of central tendency, especially in the case
of fixing the average incubation period of diseases.
Merits and demerits of Mean, Median and Mode
Mean is easy to calculate, and understand, and is based on all the observations
and least affected by sampling fluctuations but it is affected very much by extreme
values.
Median is easy to calculated and is not affected by extreme values. With even
number of observations, it is not an exact value. It is not based on the observations
like mean.
Mode is used when the values are widely varying. For skewed distributions or
samples where there is wide variation, mode and median are useful.
MEASURES OF DISPERSION
The measures of central tendency – Mean, Median & Mode – will not be enough to
provide comparison and valid inference. Two distributions may centre around the
same point, i.e. arithmetic mean, but differ in variation from arithmetic man. Such
variation is called dispersion, spread, scatter or variability.
The important measures of variation or dispersion are:

i. The range
ii. The mean deviation or absolute deviation
iii. The standard deviation or root mean square deviation
iv. The quartile deviation
A ‘Good Measure of Dispersion’ must have the properties like – (i) Easy to calculate
and Easy to understand (ii) Rigidly defined (iii) Representative of the whole group (iv)
Less amenable to sampling fluctuation and (v) Amenable for further mathematical
and statistical operations.
1. Range
The Range [R] is defined as the difference between the Highest (H) and Lowest
(L) observation in a given set of data. It is calculated as: R = H – L
It is the simplest measure of dispersion, which will provide quick and easy
inference of the data.
It is not considered as a good measure of dispersion, since it is lacking most of

the desirable properties except that it is easy to calculate, easy to understand
and it has got a ‘well defined formula’.
2. Mean Deviation
The Mean Deviation (Average Deviation) is defined as the mean of the absolute
Deviation of observations from any one of the measures of central tendency.
If the deviation is taken from the A.M., it is called ‘Mean Deviation about Mean’. For
deviation from the Median, it is known as ‘M.D. about Median’ and if the deviation is
computed from the Mode, it is called the ‘M.D. about Mode’. If it is not specifically
stated, M.D. signifies the ‘Mean Deviation about Mean’.
Ungrouped data : For ungrouped data Mean deviation is calculated with the
formula:  x–x
f n
Example : Calculate the Mean Deviation of the Systolic B.P. of 10 patients :

(mmHg) 140, 120, 260, 120, 150, 140, 200, 170, 120 and 180
Steps involved in the calculation:
1. Calculate the A.M. = x = 140 + 120 …. + 180 = 1600 = 160

n 10 10
2. Construct a new column viz. (x -x ) i.e. subtract mean from each of the
observations.
3. Delete the minus sign and construct x – x (pronounced as modulus
4. Find the sum of x – x and divide by n = 10. Then the computation can
be done as follows
The systolic B.P. level of 10 patients recorded in the hospital
Systolic BP (x) 1600

x = x = (x – x) x–x
n 10
140 Mean = 160 -20 20
120 -40 40
260 +100 100
120 -40 40
150 -10 10
140 -20 20
200 +40 40
170 +10 10
120 -40 40
180 +20 20
x = 1600  = 340
x–x
 x–x
340
M.D. = = = 34 mm Hg
n 10
Grouped data:
Mean Deviation for grouped data can be calculated with the formula:
 x – x .f
M.D. =
N
Where x the mid-value of classes, x = x.f

N
f the corresponding frequency and N the total frequency
Example : Calculate the mean deviation for the distribution given below
Systolic B.P. of 100 patients admitted in the hospital
Systolic B.P. (mm Hg.) No. of patients

110 – 130 10
130 – 150 20
150 – 170 40
170 – 190 20
190 - 210 10
Total 100
In order to apply the formula for grouped data, table has to be reconstructed
with additional columns: x (mid-value), (x.f), , , , x (x – and
x) x – x.f
x–x
No. of x.f
Systolic Midvalue x = x.f (x – x) x–x x–x
Patients N .f
B.P. (x)
(f)
110 – 130 120 10 1200 -40 40 400
130 – 150 140 20 2800 -20 20 400
16000/
150 – 170 160 40 6400 0 0 0
100 =160
170 – 190 180 20 3600 +20 20 400
190 - 210 200 10 2000 +40 40 400
x.f =
Total ….. N = 100  x – x .f
16000
= 1600
 x–x f 1600
M.D. about x (Grouped) = f = = 16 mm Hg
100
N
It is not considered as a ‘good measure of dispersion’ since it is lacking most of the

desirable properties except that it is easy to calculate, easy to understand and
representative of the whole group. Also, simply ignoring the minus sign for the
computation of M.D. does not appear sound enough, mathematically.
3. Standard Deviation
The Standard Deviation (S.D.) is defined as the ‘Root Mean Square Deviation’ or it is
the ‘square root of the average of the sum of the squares of deviation taken from the
A.M. It is denoted with the symbol ‘’ (sigma) if all the items of the population are
taken into account and denoted with ‘s’ for sample estimate.
If the square root is not computed, it is called variance. Or, the square of standard
deviation is called variance. The standard deviation is considered as the best
measure of dispersion.
Ungrouped data : In the case of ungrouped data, it is calculated by:
S.D. =  (x – x)2
 n
Example : Calculate the Standard Deviation of the Systolic B.P. of 10 patients :

(mmHg) 140, 120, 260, 120, 150, 140, 200, 170, 120 and 180
Steps involved in the calculation of S.D. (Ungrouped):

1. Calculate the A.M. = x = 140 + 120 …. + 180 = 1600 = 160
n 10 10
2. Construct a new column viz. (x -x ) i.e. subtract mean from each of the
observations.
3. Square the deviations and construct the column (x – x)2
4. Find the sum of (x – x)2and divide by n. Then the computation can be done
as follows
The systolic B.P. level of 10 patients recorded in the hospital
Systolic BP 1600
x = x = (x – x) (x – x)2
(x) n 10
140 Mean = 160 -20 400
120 -40 1600
260 +100 10000
120 -40 1600
150 -10 100
140 -20 400
200 +40 1600
170 +10 100
120 -40 1600
180 +20 400
x = 1600  = 17800
(x – x)2
 (x – x)2 17800
S.D. = =

10
 n
= 1780 = 42.2 mm Hg.
Grouped data : For grouped data standard deviation is calculated by the formula
 (x – x)2 f
S.D. =
 N
Where x the mid-value of classes, x = x.f

N
f - the corresponding frequency and N the total frequency
Example : Calculate the SD for the following distribution
Systolic B.P. (mm Hg.) No. of patients

110 – 130 10
130 – 150 20
150 – 170 40
170 – 190 20
190 - 210 10
Total 100
As the data given in Table 37 is grouped and contains frequency (no.

of patients), the Standard Deviation can be calculated with the formula:
in order to apply the formula for grouped data, – it has to be reconstructed

with additional columns: x (mid-value), (x.f), , , , x (xand .f:2
– x) (x – x)
2
(x – x)
No. of x.f
Systolic Mid x = x.f
Patients N (x – x) (x – x)2 (x – x)2.f
B.P. value (x)
(f)
110 – 130 120 10 1200 16000/ -40 1600 16000
130 – 150 140 20 2800 -20 400 8000
150 – 170 160 40 6400 0 0 0
100 =160
170 – 190 180 20 3600 +20 400 8000
190 - 210 200 10 2000 +40 1600 16000
x.f = 2
Total ….. N = 100  (x – x) .f
16000
= 48000
 (x – x)2 f 48000
S.D. (grouped data) = = 100 = 480 = 22 mm Hg
 N

16000
Where x = x.f = 100 = 160 mm Hg
N
 The Standard Deviation (S.D.) is defined as the ‘Root Mean Square Deviation’
or it is the ‘square root of the average of the sum of the squares of deviation
taken from the A.M. It is denoted with the symbol ‘’ (sigma) if all the items of the
population are taken into account and denoted with ‘s’ for sample estimate.
 In the case of ungrouped data, it is calculated by:
S.D. =  (x – x)2 OR  (x – x)2 if the sample size is less than 30


(small sample)
n
 n-1
 In the case of grouped data, it is calculated by:

 (x – x)2 f  (x – x)2 fif the sample size is less than 30 (small
S.D. = OR
sample) N
 N-1
 Standard Deviation is considered as the best measure of dispersion, since it is

having all the desirable properties of a ‘good measure of variability’.
 The square of the S.D. is called ‘Variance’, which can be calculated by ignoring
the square root sign ( ) in the formula for S.S. and denoted by ‘s2’ for sample
variance and ‘2’ for population variance.
4. Percentiles
Centiles or percentiles are values in a series of observations arranged in

ascending order of magnitude which divide the distribution into 100 equal parts.
Thus, the median is 50 th centile. The 50th percentile will have 50% observations on
either side. Accordingly, 10th percentile should have 10% observations to the left and
90% to the right. Percentiles are used to divide a distribution into convenient groups.
Those in common use are described below.
Quartiles They are three different points located on the entire range of a
variable such as height – Q1, Q2 and Q3. Q1 or lower quartile will have 25%
observations of heights falling on its left and 75% on its right; Q 2 or median will have
50% observations on either side and Q 3 or upper will have 75% observations on its
left and 25% on its right.
Quintiles: Quintiles, four in number divide the distribution into 5 equal parts. So
th
20 percentile or first quintile will have 20% observations falling to its left and 80% to
its right.
Deciles Nine in number divide the distribution into 10 equal parts, first decile or
10 percentile will divide the distribution into 10% and 90% while 9 th decile will
th
divided into and 10% and 5 th decile will be same as median. So median of a variable
can also be called as second quartile Q2, 5th decile P5 or 50th percentile P50.
5. Quartile Deviation
 The Quartile Deviation (Q.D) is defines as the average deviation within the first
Quartile and the third Quartile or it is the ‘Semi-Inter-Quartile Range’.
Q 3 – Q1
 It is calculated with the Formula: Q.D. = where Q 1 = 1st Quartile
2
value and Q3 is the 3rd Quartile values.
 In the case of ungrouped data, the Q.D. is not usually calculated due to limited
number of observations.
[(N / 4) – m1] c
 In the case of grouped data: Q = f1 where l = Lower boundary of
1 1
the 1st Quartile class (where N/4 lies), N = Total frequency m1 = Cumulative
frequency (less than) up to the 1 st Quartile class C = Class interval & f1 = the
frequency in the 1st quartile class and
Q3 = l3 + [(3N / 4) – m3] cwhere

F3
l3 = Lower boundary of the 3rd Quartile class
(where 3N / 4 lies), N = Total frequency
m3 = Cumulative frequency (less than) up to the 3 rd Quartile class
c = Class interval & f3 = frequency in the 3rd Quartile class
 The Q.D. is usually computed if there are few abnormal observations within the
1st Quartile or after the 3 rd Quartile, so as to nullify its effect and it is not
considered as a good measure as it signifies the average deviation within the first
and third quartile only and most of the desirable properties are not satisfied.
RELATIVE MEASURES OF DISPERSION
 The Measures of Dispersions – Range, M.D., S.D. and Q.D. – are always
expressed as positive and hence it is also often called as ‘Absolute measures
of dispersion’.
 The Absolute measures of Dispersion may not be powerful enough to measure

the variability, if the observations are in different units. Then a measurement of
variability that is independent of unit is found necessary. Such measures are
called ‘Relative Measures of Dispersion’. It is used to compare the consistency
of variables in different units.
 The Co-efficient of Variation (C.V) is considered as the best Relative Measure of

Dispersion, which is defined as the Ratio of the S.D. to the corresponding mean
and always expressed as percentage.
S.D. x 100
i.e. C.V. = A.M
Measures of relationship
Correlation
Sometimes we have to work on more than one variable at a time to understand their
relationship or the dependence of one variable on another variable. For example : to
understand the relationship between birth weight of new born and gestational age,
characteristics measured in the same person such as weight and cholesterol, weight
and height. At the other times, the same character is measured in two related groups
such as tallness in parents and tallness in children; study of intelligent quotient (IQ)
in sisters and brothers, degree of heaviness or obesity in parents and their children
and so on
The relationship between two quantitatively measured or continuous variables is

called correlation. The degree or the magnitude of relationship between two sets of
figures on continuous variables is called correlation coefficient. If is denoted by “r”.
If only two variables are involved, it is called ‘Bivariate Correlation’ and if more than
two variables, it is known as ‘multivariate correlation’. If the plotted points of a
correlation graph falls on or near a line it is called linear correlation and if on or near
a curve, it is curvilinear correlation and so on’
Positive correlation: If one variable increasing with an increase in the second

variable , the relation is called positive correlation. That is, the change in both the
variables is in the same direction (either both increases or both decrease together).
Example : If cholesterol level increases, the BP may also increase i.e. if there is an
increase in one variable, there will be a corresponding increase in the other. Similarly
if height increases, there may be a proportional increase in weight, birth weight of
new born is positively correlated with gestational age and so on
Negative correlation: If one variable decreasing with an increase in the second

variable, the relation is called positive correlation. That is, the change in one variable
is is opposite to that in the second variable.
Example : If the amount of exposure to radiation increases, the foetal weight may
decrease. if age advances, the level of memory may decrease and so on.
Other examples are:
i. Degree of over crowding and prevalence of pulmonary tuberculosis are

well correlated.
ii. Severity of malnutrition is well correlated with the severity of pulmonary
tuberculosis.
iii. Economic status and health status are well correlated.
iv. Ageing and blood pressure are positively correlated.
v. Degree of adiposity and cholesterol levels are also positively correlated.
vi. Intakes of calories and proteins are positively correlated.
vii. Economic status and the quantities of cereal consumed are negatively
associated.
viii. BMI and IQ are not correlated
Correlation Coefficient
The coefficient of correlation is defined as the strength and direction of linear
relationship between two quantitative variables.
It is introduced by Karl Person and hence known as Pearson Product moment

correlation coefficient or simply Pearson ’s Correlation co-efficient. The Pearson
correlation coefficient is symbolized by ‘r’.
The purpose of correlation coefficient is to examine whether an increase in one

variable leads to an increase or decrease in the other variable.
The Pearson Correlation coefficient ranges between -1 and +1 (-1≤ r ≤ +1). A

correlation coefficient of +1 indicates a perfect positive correlation between the
variables, a correlation coefficient of -1 indicates a perfect negative correlation
between the variables and a correlation coefficient close to 0 indicates little or no
linear correlation between the variables.
Direction: The sign (+ or - ) of the correlation coefficient indicates the direction of the
relationship. A correlation coefficient between 0 and 1 indicate a positive or direct
relationship, a correlation coefficient between 0 and -1 indicate a negative correlation
or inverse relationship.
Strength : The absolute value of the correlation coefficient | r | indicate the strength
of the relationship between the variables. For example, a correlation coefficient of
-0.65 is stronger than a correlation coefficient of 0.60, because the absolute value of
-0.65 is greater than the absolute value of 0.60.The correlation coefficient closer to
-1 or +1, the stronger the correlation. There are no hard and fast rules for constitute
‘strong’, ‘moderate’ and ‘weak’ correlation, some rough guideline is as given
| r | ≥ 0.8 indicate strong correlation

0.5 ≤ | r | < 0.8 indicate moderate correlation
| r | < 0.5 indicate waek correlation
There are statistical tables available to see whether the correlation coefficient
obtained is significant statistically. If the obtained correlation coefficient is greater
than the statistical table value for correlation co-efficient r, using the degrees of
freedom (d.f), there is significant relationship between the variable. If the obtained
correlation value is less than the statistical table value, the relationship is not
statistically significant.
d.f. = n – 2, ehere n is the number of paired measurements of the two

variables
Coefficient of determination
The square of the correlation coefficient (r 2) is called coefficient of determination. It is
the proportion of variance in one variable that can be explained by the other.
Example : The correlation coefficient r between birth weight of new born and
gestational age is 0.745. Coefficient of determination, r 2 = 0.555 or 55.5 %, which
implies that 55.5 % of variation in birth weight (dependent variable) can be explained
by the gestational age (independent variable) of new born.
Calculation of Pearson correlation coefficient
There are various formulae for computing the correlation coefficient ‘r’, the simplest
one is:
Correlation Coefficient ‘r’ = P

(S.D. of x) x (S.D. of y)
Where P = Co-Variance of (x, y)
Co-variance (x, y) =  (x – x) (y – y)
n
As SD of x, SD of y and co-variance of (x and y) involve lot of calculations,
another formula is suggested for the computation of correlation co-efficient.
negative correlation and if it is zero, there is no correlation
It can also calculated by the formula:
r=
where x is independent variable, y is the dependent variable and n = number
of pairs.
Assumptions for the calculation of Karl Pearson correlation
1. The dependant and independent variables should be measured at the interval

or ratio level of measurement.
2. The dependent and independent variables must be paired observations.
3. The distribution of both the variables must be approximately normal.
Example:
In a study to assess the hours of study per day, for second year B.Sc
Nursing exam and the marks obtained in the university Exam, the following data
were obtained. Calculate the correlation coefficient.
Hours of study 2,8,10,6,3.
Marks obtained 20,40,42,24,36

r=
Where,x – is the first variable

y – is the second variable
n – Number of pairs.
The above problem is worked out using the second formulae and is given below;
Marks obtained
in the university
Hours of study xy x2 y2
exam
y
2 20 40 4 400
8 40 320 64 1600
10 42 420 100 1764
6 24 144 36 576
3 36 108 9 1296
∑x=29 ∑y= 162 ∑xy= 1032 ∑ x2=213 ∑ y2 =5636
Correlation coefficient, r =
= + 0.70
The value of correlation coefficient is positive indicating a positive relationship

between two variables. The strength of correlation is moderate.
Rank Correlation
When the precise measurements of the variables are not available, the
Pearson correlation coefficient can not be calculated. Even if the precise
measurements are available, it is not justifiable to calculate ‘r’ unless, the variables
follow normal.
The Spearman rank correlation is nonparametric version of Karl Pearson correlation

coefficient. The Spearman rank correlation examines the relationship between the
ranks of the variable instead of the actual score of the variables.
Spearman rank correlation is appropriate to use when either or both the variables
are measured at ordinal level of measurement, or at interval / ratio level that do not
meet the assumptions of normality. The dependent and independent variables must
be paired observations
It is introduced by Spearman and hence known as ‘Spearman’s Rank Correlation’
denoted by ‘R’. Like the Pearson correlation coefficient, Spearman rank correlation
coefficient ranges between -1 and +1, where the sign indicates the direction of the
relationship and the absolute value indicates the strength of the relationship.
The spearman’s Rank Correlation is computed by the Formula:
Spearman’s Rank Correlation ‘R’ = 1 - 6d2 = 1 - 6d2

(n3 – n) n(n2 – 1)
Where d is the difference between the ranks assigned independently and n is
the number of pairs.
Example:
Qn. 1 In an interview conducted by 2 Examiners, ranks were awarded to candidates

as follows. Calculated the Rank correlation.
Candidate Examiner – I Examiner - II

A 2 3
B 1 4
C 4 2
D 5 5
E 3 1
F 7 6
G 6 7
Steps involved:
1. Reconstruct the table with additional columns d = [Rank
given by 1st examiner (x) – 2nd Examiner (y)] and d2
2. Find the sum of column d2
3. Substitute the values in the formula
Examiner – I Examiner – II
Candidate d=x–y d2
(x) (y)
A 2 3 -1 1
B 1 4 -3 9
C 4 2 2 4
D 5 5 0 0
E 3 1 2 4
F 7 6 1 1
G 6 7 -1 1
n = no. of
Total d2 = 20
candidate = 7
Rank corr. R = 1 - 62 d = 1 -
2
6 x 20
= 1 – 0.36 = +0.64
n(n – 1) 7(72 – 1)
Interpretation : Since the Rank Correlation R = +0.64, which is more than +0.5, it is
suggestive of moderate positive correlation. i.e. if the marks of one
examiner increase in the marks given by examiner 2 also.
Qn. 2 Information on IQ and personality scores of 6 students are given below. Study
the relationship between these two variable using Spearman correlation.
IQ Rank Personality
Rank of differenc
Sl no index of score d2
y e in ranks
x x y
1 10 6 9 5.5 0.5 0.25
2 9 4.5 9 5.5 -1 1
3 6 2 7 3 -1 1
4 9 4.5 5 1.5 3 9
5 8 3 8 4 -1 1
6 4 1 5 1.5 -0.5 0.25
Spearman’s Rank Correlation ‘R’
= 1 – 6 x 12.5 = 1 - 75
6 x 35 210
= 1 - 0.36
= 0.64
The two variables are moderate positively correlated.
Scatter diagram
This is useful to assess the relationship between two variables. In plotting data of
this type, one variable is placed on the x –axis and the second variable on the y –
axis. The (x,y) points are indicated by means of dots. The pattern made by these
dots is indicative of a possible relationship between two variables. The scatter
may be linear, curve linear, or exponential.
Example : The data of 10 albino rats on intake of proteins and gain on weight.
Table Gain in weight and protein

intake
Protein Gain in
intake weight
Sl. No (gm) /day (gm)
1 10 12
2 11 15
3 12 16
4 14 22
5 16 24
6 13 18
7 15 23
8 11.5 14
9 13.5 20
10 16.5 25
26
24
22
20
18
Gain in weight (gm)
16
14
12
10
9 10 11 12 13 14 15 16 17
Protein intake (gm) /day
This is useful to assess the relationship between two variables. In plotting data of
this type, one variable is placed on the x –axis and the second variable on the y –
axis. The (x,y) points are indicated by means of dots. The pattern made by these
dots is indicative of a possible relationship between two variables. The scatter
may be linear, curve linear, or exponential.
A scatter plot is a graph with points plotted to show a possible relationship between
two sets of data. In plotting data of this type, one variable is placed on the x –axis
and the second variable on the y – axis. The (x,y) points are indicated by means of
dots. After plotting all the observations, it can see how the values on the Y axis are
scattered for a given value of x axis. Similarly, how the values on X-axis are
scattered for a given value on the Y-axis. This diagram shows the scatter of two
variables with respect to each other.
Example : The data of 10 albino rats on intake of proteins and gain on weight.
Table Gain in weight and protein

intake
Protein Gain in
intake weight
Sl. No (gm) /day (gm)
1 10 12
2 11 15
3 12 16
4 14 22
5 16 24
6 13 18
7 15 23
8 11.5 14
9 13.5 20
10 16.5 25
26
24
22
20
18
Gain in weight (gm)
16
14
12
10
9 10 11 12 13 14 15 16 17
Protein intake (gm) /day

After creating the scatter diagram, it can understand
1. The direction of correlation : -
a. Upward trend - shows a positive correlation, means as one quantity
increases so does the other
b. Downward trend - shows a positive correlation, means as one quantity
increases the other decreases
c. Flat trend - shows no correlation, means both quantities vary with no
clear relationship
2. The strength of correlation: - How closely data point adhere to an imaginary
trend line. The closer the dots to the line, the stronger the correlation
3. The form : - The form of the correlation may be linear, curve linear,
exponential etc.
4. Identify the presence of outliers : Outliers are points with striking deviations
from overall points.
No correlation moderate positive correlation
strong positive correlation

weak negative correlation
moderate negative correlation strong negative correlation
Simple linear regression analysis and prediction
Dependent and independent variable
The dependent variables represent the output or outcome whose variation is being studied. The
independent variables represent inputs or causes, i.e. potential reasons for variation
In practical applications in medical science, we often know that one variable is

dependent or caused by another variable.
Example : 1. Data regarding gestation and birth weight . The birth weight of the new
born is dependent on its gestation rather than gestation on birth weight. The variable
birth weight of new born is ‘dependent variable’ and gestation is the ‘independent
variable”. The dependent variable is also called ‘predicted variable’ or “response
variable”. The independent variable is also called ‘predictor variable”
2. Data regarding high level of low-density lipoprotein and occurrence of coronary
heart disease. The variable occurrence of coronary heart disease is ‘dependent
variable’ and level of low-density lipoprotein is ‘independent variable’
3. In a study of how different doses of a drug affect the severity of symptoms. Here,
independent variable is the dose and the dependent variable is the frequency/
intensity of symptoms.
Regression
Regression analysis is a method of develop a mathematical equation that predict the
dependent variable for a given value of independent variable based on a sample of
measurements on both the dependent and independent variables
The word ‘Regression analysis” is because of calculate back or ‘regress’ dependent

variable based on the values of independent variable.
Linear Regression
The first step in regression is to draw a scatter diagram for independent variable X
and dependent variable Y. If there is any trend happens to be like a line in the scatter
diagram of the values of the dependent variable Y for given values of X, the
regression technique applied for the estimation of the dependent variable Y is called
linear regression.
If one independent variable is used to develop a linear equation that describes the
relationship between dependent and independent variable, such linear regression is
called simple linear regression.
The concept of regression lies in identifying a line called regression line, that is the
nearest to the data points marked on the scatter diagram, so that for a given value of
X, make a close prediction of value of Y. For that, using the observed data, find the
mathematical quantities of the equation of a straight line. Te identified line passes
through the point whose coordinates are the means of X and Y. Mathematically the
equation of straight line is
yˆi  ˆ0  ˆ1 xi

Where yi is the value of the dependent variable Y shown on the vertical axis, x i is the
value of the independent variable X shown on the horizontal axis, β 1 is the slope of
the line and β0 is the point where the line intercept the vertical axis. , ie. The value of
Y when X=0. The slope is also called the regression coefficient.
Fitting a regression line to the data is to obtain the values of ̂ 1 and ̂ 0 in the
equation. The method of estimating the slope and the intercept of liner regression is
called least square method.
The method of least square ensures the sum of squares of all vertical distances
between the fitted regression line and the data points will be the least of all possible
regression lines on this data.
The expression for the least square estimates of the slope and the intercept are
obtained as
 x  x  yi  y 
ˆ1  xy   i
SS
SS xx   xi  x  2
ˆ
Regression coefficient  1 = xy - x y
N
2
x2 - (x)
n
Intercept ˆ0  y  ˆ1 x
THEORY OF PROBABILITY AND DISTRIBUTIONS
Many statements are made with certain elements of uncertainty - may probable to
rain tomorrow, probable to recover after surgery, new drug tested may be effective
etc. No conclusion can be drawn with 100 percent certainty. Probability is the
measure of chance/ uncertainty associated with a conclusion.
Probability may be defined as the probable chance of occurrence of an event to the

total possible occurrence. It is the ratio of number of occurrence of a favorable event
to the total number of occurrence of all possible events.
If an event can occur in N mutually exclusive and equally likely ways and if m of
them possess a specific characteristic E, then
P(E) = m/N
Probability is usually expressed by the symbol ‘p’. It ranges from zero to one. When
p=0, it means that no chance of an event happening or its occurrence is impossible
(example, an animal giving birth to a human child). If p=1, it means that the chances
of an event happening are 100% (example, death for any living being).
Types of Probability ; There are two types of probability (a) mathematical and (b)
statistical
a) Mathematical / theoretical/ classical / priory probability : Theoretical probability

of an event is the number of ways that an event can occur, divided by the total
number of outcomes.
Example : 1. Probability of getting a head while tossing a coin, p=1/2

2. Probability of getting five when a dice is thrown, p=1/6
3. Probability of getting an ace from a deck of cards, p=4/52
b) Statistical / empirical probability: Empirical probability of an event is based on
how often the event occurs after collecting data or running an experiment,
divided by total number of experiment. It is based on direct observations or
experience.
Example : 1. Probability of getting a girl in the first pregnancy

2. Probability survival after kidney transplantation
3. Probability of woman develop PPD after first delivery
To find out the probabilities in all the above problems, evidence based on
empirical data is required.
1. If the past experience indicates that out of 1000 first pregnancies resulted in
the delivery of 530 girls, probability of getting girl in the first pregnancy is
530/1000 = 0.53.
2. If the past data shows out of 200 cases of kidney transplantations, succeed in
80 cases. Then probability of survival after transplantation is 80/200 = 0.4
2. If the past data shows out of 1000 first deliveries, 180 developed postpartum
depression, probability of woman develop postpartum depression after first
delivery is 180/1000 = 0.18
c) Comparing Theoretical and empirical probability : The experiment of throwing

two dice 50 times and result is presented in chart. Find out the empirical and
theoretical probability of getting a sum of 7 of the two dice.
Sum of the rolls of two dice

3,5,5,4,6, 7,7,5,9,10,12,9,6,5, 7,8,7,4,11,6,8,8,10,6, 7,4,4,5, 7,9,
9, 7,8,11,6,5,4, 7,7,4,3, 6, 7, 7, 7, 8, 6, 7, 8, 9
Empirical probability – total number of observed sum 7 divided by total

experiments is 13/50 = 0.26
6 7 8 9 10 11 12
5 6 7 8 9 10 11
4 5 6 7 8 9 10
Die 2
3 4 5 6 7 8 9
2 3 4 5 6 7 8
1 2 3 4 5 6 7
1 2 3 4 5 6
Die 1
Theoretical probability – total number of possible sum 7 when working with two
dice divided by total number of outcomes is 6/36 = 0.167
If probability of an event happening is p and that of not happening is defined by q,
then q=1-p or p+q=1
Example : example: Prob (male) + Prob (female) = p + q = 0.5 + 0.5 = 1

P (head) + p(tale) = p+q = 0.5 + 0.5 =1
PROBABILITY LAWS
Law of Additivity
The probability of occurrence of any one of mutually exclusive outcomes is

the sum of the probabilities of each outcome.
If A and B are mutually exclusive outcomes, then the probability that either A
or B will occur, P (A or B) is
P (A or B) = P (A) + P (B) = Probability of occurrence of A + Probability of

occurrence of B.
mutually exclusive event : Two events are mutually excluded if the occurrence of

an event avoid the occurrence of the other. Example : If a baby is male, cannot be
female
Examples of law of additivity
Example 1: A single 6-sided die is rolled. What is the probability of

rolling a 2 or a 5?
Probabilities: 1
P(2) =
6
1
P(5) =
6
P(2 or 5) = P(2) + P(5)
1 1
= +
6 6
2
=
6
1
=
3
Example 3: A glass jar contains 1 red, 3 green, 2 blue, and 4 yellow
marbles. If a single marble is chosen at random from the jar,
what is the probability that it is yellow or green?
Probabilities: 4
P(yellow) =
10
3
P(green) =
10
P(yellow or green) = P(yellow) + P(green)
4 3
= +
10 10

7
=
10
Biological example : Nutritional status groups categorized as normal or malnutrition

of grade I, grade II and grade III. These categories are mutually exclusive categories.
A child at any particular point of time cannot belong simultaneously to two of these
nutrition categories. In such cases, the probability of one or the other is obtained by
the law of addition. That is,
P (grade I or grade II) = P (grade I) + P (grade II).
Multiplication Law of Probability
The probability that an independent event will occur jointly is the product of
the probabilities of each event.
If A and B are independent events, then the probability that A and B will occur
- P (AB), is
P (A and B) = P (A) . P (B)

= (Probability of A) . (Probability of B)
Independent events : Two events are said to be independent, if the occurrence of

one event does not affect the occurrence of the other. Example : If the first new born
is male, does not affect that the next be female
Examples of multiplication law
Example 1: what is the probability the probability of tossing a coin twice
and getting a head on each toss?
Probabilities: Probability of getting head on the first toss P(H1)
P(H1) = 1
2
Probability of getting head on the second toss P(H2)
1
P(H2) =
2

P(H1H2) = P(H1) X P(H2)
1 1
= X
2 2

1
=
4
Example 2: A coin is tossed and a single 6-sided die is rolled. Find the probability of
landing on the head side of the coin and rolling a 3 on the die.
Probabilities:
1
P(head) =
2
1
P(3) =
6
P(head and 3) = P(head) · P(3)
1 1
= ·
2 6

1
=
12
Example 3: New borns by blood ‘Rh’ factor and sex are cited below:
Newborns and second Rh factor by sex of newborn

Rh factor Male Female Total
+ 45 45 90
- 5 5 10
Total 50 50 100
Find out the probability of newborn being female Rh +ve?
Probability of child being female = P1 = 50 / 100 = 1 / 2 = 0.5

Probability of newborn being Rh +ve is P2 = 90 / 100 = 9 / 10 = 0.9.
Since sex and Rh are independent of each other, probability of new born being
female with Rh +ve is
= P (Female, Rh + ve) = P (Female). P (Rh +ve)
= P1 . P2 = 1 / 2 x 9 / 10 = 9 / 20 = 0.45
From table also probability of child being female with Rh +ve = 45 / 100 = 0.45.
Biological example: Occurrence of arthritis and common cold in a person are

independent events. This is so because occurrence of one does not affect the
occurrence of the other. For such independent events, the probability of the two
occurring together in a person is the product of the respective probabilities, i.e.
P (arthritis and cold) = P (arthritis) P (cold).
Conditional Probability : Probability of an event occurring given that another event

has already occurred is called a conditional probability.
The probability that event B occurs, given that event A has already occurred is
P(B|A) = P(A and B) / P(A) [ p(B/A) – probability of B given A]
The conditional probability can be explained with the following examples
Example 1 : If the analysis of records show that 90 per cent of the large number of
patients of abdominal tuberculosis came with complaints of pain in abdomen,
vomiting and constipation of long duration, then the conditional probability is
P (pain, vomiting, constipation / abdominal TB) = 0.90
The conditional probability is restricted to a specific group, in the above example the
restricted group is the patients with abdominal TB.
Uses of probability
1 Probability is used to take decision in case of uncertainty.

Quantification of uncertainties has proved immensely useful in effective
management of health conditions – both at individual level as well as at
community level.
Knowing that the probability of developing coronary artery disease in senior
executives is, say, 3 times higher than in clerks, provides us a scientific basis to
give appropriate advice or to institute an intervention at individual level and to
plan and execute preventive measures to combat the problem in the target group.
2 Theory of probability provides the foundation for statistical inference regarding a

parameter or hypothesis in the population.
i) To estimate percentage of people affected by diabetes in a community
based on results obtained from a sample of randomly selected persons
from the community with certain level of confidence.
ii) In clinical trial, statistical test of significance can be used to test the
efficacy of a new drug with that of standard drug.
The Theory of probability is the basis of various mathematical equations of

Probability distributions like Binomial Distribution, Poisson Distribution, Normal
Distribution etc
Normal distribution
The normal distribution is a continuous probability distribution that closely

describes a wide variety of continuous measurement variables such as height,
weight, Hb etc. This distribution was developed by Karl F Gauss, hence it is also
called ‘Gaussian’ distribution. It was found that many distributions of many
continuous variables in biology and other science followed this distribution hence
called ‘Normal’ distribution.
The normal probability distribution has certain important characteristics. The

frequency curve that is drawn for continuous variable, which follows normal
distribution is called ‘Normal Distribution Curve’ or simply ‘Normal Curve’.
Properties of a Normal Distribution Curve:
1. It is a bell-shaped, with only one ‘hump’ (unimodal-only one mode) in the

middle, which tapers on either side in the same fashion and touches only
in the infinity (). The curve is perfectly symmetrical about the mean.
2. All the 3 measures of Central tendency – mean, median and mode are
equal (i.e. mean = median = mode).
3. Approximately 68% (exactly 68.27%) of the observations may lie in the

interval [(mean – one S.D. and mean + one S.D.) i.e. Mean  1 SD]

interval [(mean – 3 S.D. and mean + 3 S.D.) i.e. Mean  2 SD]

interval [(mean – 3 S.D. and mean + 3 S.D.) i.e. Mean  3 SD]
6. If the total area under the curve is considered as 100%, (Mean  0.67 SD)
covers 50% area; (Mean  1.96 SD) covers 95% area and (Mean  2.58
SD) covers 99% area.
7. The 1st Quartile value (Q1) and 3rd Quartile value (Q3) are equidistant from
the Mean (Q2).
8. Any normal distribution can be converted into Standard Normal

Distribution, by the conversion of Z= (x-µ)/ σ. This Z score follows Normal
distribution with mean ‘0’ and standard deviation ‘1’. This is an important
property of normal distribution, Inference theory is based on this property.
These properties are often termed as the ‘Principles of Normality’.

Perfect normality, is very rate unless the number of items studied is very large. In
Biological Sciences including medicine, most of the data have a tendency to follow
normal, in large samples.
68% of values are within
1 standard deviation of the
mean
95% of values are within

2 standard deviations of the
mean
99.7% of values are within

3 standard deviations of the
mean
Uses / applications of Normal Distribution
1. Many populations encountered in the course of research in many fields

have a normal distribution to a good degree of approximation and thus
make use of features in the statistical theory.
2. When the distribution is normal, the mean and standard deviation

describe the distribution completely.
3. If the distribution is approximately normal, no essential details are lost by

considering it to be normal.
4. When the sample is large, many other theoretical distributions (Binominal

and Poisson etc. ) approximate to a normal distribution.
5. Various other statistical tests like t-test, chi-square test, F test etc. also
developed on the basis of Normality principles. In short, the entire Tests
of significance is founded upon the Principles of Normality.
6. During generalization of results, often the researcher has to predict a
possible interval in which 95 percent of the sample estimates may lie
(95% Confidence Interval). Its calculation is made possible through the
theory of normal distribution
Deviation from Normality
The deviation from normality can be of two types : skewness and Kurtosis
Skewness : Skewness is a measure of symmetry. The tendency of the distribution

to deviate from symmetry is called skewness. Two types of skewness can be defined
depending upon the concentration of values with respect to the median.
Negative Skewness: If more values are concentrated above the median, it is called
negative skewness. Negatively skewed distribution have a long tail to the left, mean
and median are both less than mode and mean will be less than median. For
negatively skewed, Mean < Median < Mode
Positive Skewness: If more values are concentrated below the median, it is called
positive skewness. Positively skewed distribution have a long tail to the right, mean
and median are both less grater than mode and mean will be grater than median.
For positively skewed, Mean > Median > Mode
If the values are symmetrically distributed on either side of the median, there is no
skewness and it is normal curve. A normal distribution curve has skewness zero.
There are many measures of skewness. A simple measure of skewness is given
below
Pearson’s Coefficient of Skewness:
or
Where = the mean, Mo = the mode, Md = the median and s = the standard

deviation for the sample.
Kurtosis : Kurtosis is the measure of relative peakeness or flatness of a distribution
compared with normal distribution. It gives the extent of peakedness of the curve.
The types of kurtosis based on peakedness or flatness are;
Mesokurtic : A distribution that is peaked in the same way as any normal distribution
is said to be mesokurtic or normokurtic. The peak of a mesokurtic distribution is
neither high nor low rather it is considered to be a baseline for the two other
classifications
Leptokurtic : Leptokurtic distribution are those that have peak greater than a
mesokurtic distribution. Leptokurtic distributions are identified by peaks that are thin
and tall.
Platykurtic : Platykurtic distributions are those that have a peak lower than a
mesokurtic distribution. Platykurtic distributions are characterized by a certain
flatness to the peak, and have slender tails
There are many measures of Kurtosis, a simple formula for Kurtosis is given by
K= (Q3-Q1) / 2(P90 - P10), where Q1- First quartile, Q3- Third quartile, P 10 – 10th
percentile and P90 -90th percentile
If K=0.263, the cuve is normokurtic. If it is grater than 0.263, it is platykurtic. If it is
less than 0.263, it is leptokurtic
Sampling error
Sampling is done to determine the characteristics of a whole population, the

difference between the sample and population values is considered a sampling error.
sampling error is incurred when the statistical characteristics of a population are

estimated from a sample of that population. The sample does not include all
members of the population, statistics on the sample, such as means, proportion etc,
generally differ from statistics on the entire population. The difference between the
sample and population values is considered as sampling error.
For example, if one measures the height of a thousand individuals from a place of
one lakh, the average height of the thousand is typically not the same as the average
height of all one lakh people of that place. Since sampling is typically done to
determine the characteristics of a whole population, the difference between the
sample and population values is considered as sampling error
Exact measurement of sampling error is generally not feasible since the true
population values are unknown
Characteristics of sampling error

 Sampling error generally decreases as the sample size increases
 Sampling error depends on the size of the population under study
 Sampling error depends on the variability of the characteristic of interest in the
population
 Sampling error can be accounted for and reduced by an appropriate sampling
plan. The process of randomization and probability sampling minimize
sampling error.
Experimental Design
DESIGN OF EXPERIMENTS
The design of experiment includes
(1) Planning of the Experiment

(2) Obtaining relevant information from it regarding the statistical hypothesis
under study, and
(3) Making a statistical analysis of the data
It can be defined as “the logical construction of the experiment in which the degree of
uncertainty with which the inference is drawn may be well defined.
The term experimental design refers to a plan for assigning experimental units

to treatment conditions.
The specific questions that the experiment is intended to answer must be clearly
identified before carrying out the experiment.
Terminology in Experimental Designs
Experiment
An experiment is a device or a means of getting an answer to problem under
consideration. Experiment can be classified into two categories as Absolute and
Comparative.
Absolute experiments consists in determining the absolute value of some
characteristics like (i) obtaining the average intelligence quotient of a group of
people, (ii) fining the correlation coefficient between tow variables in a bivariate
distribution.
Comparative experiments are designed to compare the effect of two or more
objects on some population characteristics.
Treatments
Various objects of comparison in a comparative experiment are termed as
treatments.
Example 1. A corn field is divided into four, each part is 'treated' with a different
fertiliser to see which produces the most corn
Example 2. A teacher practices different teaching methods on different groups
in her class to see which yields the best results
Example 3. a doctor treats a patient with a skin condition with different
creams to see which is most effective
Experimental units
The smallest division of the experimental material to which we apply the
treatments and on which we make observations on the variable under study is
termed as experimental unit. e.g., in field experiments, the plot of “land” is the
experimental unit.
Blocks
In agricultural experiments, most of the times we divide the whole
experimental units into relatively homogeneous sub-groups or strata. These strata,
which are more uniform amongst themselves than the field as a whole are known as
blocks.
Yield
The measurement of the variable under study on different experimental units
are termed as yields.
Experimental bias : Experimental bias means the favoring of certain outcomes

over others
Example : A farmer wishes to evaluate a new fertilizer. He uses the new fertilizer on
one field of crops (A), while using his current fertilizer on another field of crops (B).
The irrigation system on field A has recently been repaired and provides adequate
water to all of the crops, while the system on field B will not be repaired until next
season. He concludes that the new fertilizer is far superior.
The problem with this experiment is that the farmer has neglected to control for the
effect of the differences in irrigation. This leads to experimental bias, the favoring of
certain outcomes over others. To avoid this bias, the farmer should have tested the
new fertilizer in identical conditions to the control group, which did not receive the
treatment. Without controlling for outside variables, the farmer cannot conclude that
it was the effect of the fertilizer, and not the irrigation system, that produced a better
yield of crops.
Another type of bias that is most apparent in medical experiments is the placebo
effect. Since many patients are confident that a treatment will positively affect them,
they react to a control treatment which actually has no physical affect at all, such as
a sugar pill. For this reason, it is important to include control, or placebo, groups in
medical experiments to evaluate the difference between the placebo effect and the
actual effect of the treatment.
The simple existence of placebo groups is sometimes not sufficient for avoiding bias
in experiments. If members of the placebo group have any knowledge (or suspicion)
that they are not being given an actual treatment, then the effect of the treatment
cannot be accurately assessed. For this reason, double-blind experiments are
generally preferable. In this case, neither the experimenters nor the subjects are
aware of the subjects' group status. This eliminates the possibility that the
experimenters will treat the placebo group differently from the treatment group,
further reducing experimental bias.
Experimental Error
A large homogeneous field is divided into different plots and different
treatments are applied to these plots. Experience tells us that even if the same
treatment is used on all plots, the yields would still vary due to the difference in soil
fertility. Such variation from plot to plot, which is due to random factors beyond
human control, is called as experimental error.
Types of extraneous variation due to:

1. the inherent variability in the experiment materials to which treatments are
applied.
2. the lack of uniformity in the methodology of conducting the experiment or in
other words failure to standardize the experimental technique and
3. lack of representativeness of the sample to the population.
Principles of an Experimental Design

The basic principles of the design experiments are
1) Replication 2) Randomization 3) Local control
Replication
Replication means ‘the repetition of treatments under investigation.
Randomization
‘Randomization’, a process of assigning the treatments to various
experimental units in a purely chance matter.
Local Control
The process of reducing the experimental error by dividing the relatively
heterogeneous experimental area (field) into homogeneous block is known as Local
control.
Randomization
It is the method of creating homogeneous treatment groups to eliminate any potential biases.
In a randomized experimental design, objects or individuals are randomly assigned (by

chance) to an experimental group.
There are several variations of randomized experimental designs
Completely Randomized Design
In a completely randomized design, objects or subjects are assigned to groups

completely at random.
One standard method for assigning subjects to treatment groups is to label each
subject, then use a table of random numbers to select from the labeled subjects.
Randomized Block Design
In a randomized block design, experimental subjects are first divided into

homogeneous blocks before they are randomly assigned to a treatment group
If, for instance, an experimenter had reason to believe that age might be a significant
factor in the effect of a given medication, he might choose to first divide the
experimental subjects into age groups, such as under 30 years old, 30-60 years old,
and over 60 years old. Then, within each age level, individuals would be assigned to
treatment groups using a completely randomized design.
Example
A researcher is carrying out a study of the effectiveness of four different skin creams
for the treatment of a certain skin disease. He has eighty subjects and plans to divide
them into 4 treatment groups of twenty subjects each. Using a randomized block
design, the subjects are assessed and put in blocks of four according to how severe
their skin condition is; the four most severe cases are the first block, the next four
most severe cases are the second block, and so on to the twentieth block. The four
members of each block are then randomly assigned, one to each of the four
treatment groups
Matched pairs design
A matched pairs design is a special case of the randomized block design. It is used
when the experiment has only two treatment conditions; and participants can be
grouped into pairs, based on some blocking variable. Then, within each pair,
participants are randomly assigned to different treatments.
A matched pairs design for the above mentioned study of skin disease experiment.
The 80 participants are grouped into 40 matched pairs. Each pair is matched on
gender and age. For example, Pair 1 might be two women, both age 21. Pair 2 might
be two women, both age 22, and so on.
For the above example, the matched pairs design is an improvement over the
completely randomized design and the randomized block design. Like the other
designs, the matched pairs design uses randomization to control for confounding.
Latin square Design
In field experimentation, it may happened that experimental area exhibits fertility in

strips. A useful method of eliminating fertility variations is Latin Square Design (LSD).
In LSD the number of treatment is equal to number of replications. Thus in case of m

treatments, there have to be m X m = m2 experimental units. The whole experimental
area is divided into m2 experimental units (plots) arranged in a square so that each
row as well as each column contains m units (plots). The m treatment are then
allotted to random to these rows and columns in such a way that every treatment
occurs once and only once in each row and in each column. Such a layout is known
as m X m Latin square Design and extensively used in agricultural experiments.
Example : with 4 treatment A,B,C,D, one typical arrangement of 4X4 LSD is given below
A B C D
B A D C
C D A B
D C B A
TESTING OF HYPOTHESIS OR TESTS OF SIGNIFICANCE
In scientific studies the sort of prediction or the anticipated happening is called

Hypothesis.
Really, hypothesis is a wild guess and the significance is a search for its proof. It is a
search for truth, if two or more arguments appear to be ‘apparently correct’.
Hypotheses are the anticipated results. It is the starting point for any research. It is
usually formulated on the basis of literature search, material evidence, real life
experience or even as an intuition and the arguments put forward to accept or reject
the claim, often termed as tests of significance.
If such hypotheses are formulated in some measurable terms, it is called statistical

hypothesis. A statistical hypothesis may be true or false. For the acceptance or
rejection of such hypotheses need some scientific approach. It is termed as testing
of hypothesis and the statistical tools that are used to do the tests are called tests
of significance. The word ‘ testing of hypothesis’ signifies the testing of one claim
over the other, whereas the ‘testing of significance’ go little further to see the level at
which it is significant – to estimate, the chance of taking wrong decisions, whether it
is less than 5 in 100 (<5 percent) or less than 1 in 100 (<1 percent) or less than 1 in
1000 (<0.1 percent ) and so on.
The methodology that is used to see whether the difference between the sample
estimate (statistic) and the true value of the population (parameter) or between two
or more independent sample estimates is due to sampling variation (peculiar nature
of the sample)or otherwise is called testing of hypothesis or test of significance. It
measures the strength of evidence to believe that the claims put forward by the
investigator are true or false.
Steps involved in testing of hypothesis
1.To set up a Null Hypothesis: H0
The testing of hypothesis is often equated to a criminal trial, because every citizen
in India, is considered innocent before the count of law, till his guilt is proved.
Similarly, till the statically significance is tested with suitable testes and proved, it is
believed that there is no difference between the groups or between the sample
estimate and the true value of the population.
For example, if a researcher is intended to test the association between smoking and
cancer, starts with the hypothesis that there is no association between smoking and
cancer. Such a hypothesis, indicating ‘no difference’ or ‘equal’ is called ‘Null
Hypothesis’, which is denoted by ‘H0’
Example: 1. There is no association between “prostitution and HIV/AIDS

2. The mean Hb gm% in group I = Mean Hb.Gm% in group II
3. The effectiveness of drug A = effectiveness of drug B
4. Drug A is inferior or equal to drug B
2. Set up an alternate Hypothesis: H1 or HA

After the statistical test, if the Null Hypothesis is rejected ,the researcher will be
compelled to accept the hypothesis that ‘there is association’ or ‘the two group are
unequal’ or ‘the sample estimate is not coming from the same population’ and so on.
Such Hypothesis that is being accepted by the researcher, in the event of rejection of
Null Hypothesis is called ‘Alternate Hypothesis’ usually denoted by ‘H 1’ or ‘HA’. Often
the alternate hypothesis will be the real hypothesis of interest.
Example: 1.There is association between prostitution and HIV infection

2. The mean Hb gm% in group I # Mean Hb.Gm% in group II
3. The effectiveness of drug A # the effectiveness of drug B
4. Drug A is superior to drug B
3. Calculate the appropriate test statistic with the relevant formula

In testing of hypothesis, it is being done by considering the relevant test statistic –
test for normality (Z Test), t test, chi-square test, F test etc. The type of statistical test
depends upon the original probability distribution of the population, from which the
sample is drawn.
The suitability of a particular statistical test can be decided by considering-
(i) whether the data is qualitative (generally can be expressed as
percentage)or quantitative (the data for which mean,S.D.etc.can be
computed)
(ii) whether the sample size is large or small(conventionally, this sample size
less than 30 is considered as a small sample).
(iii) whether it is an one sample test (one sample estimate like mean,
percentage etc.tested against a standard value or known value), or two
sample test (the estimates of two independent samples are taken) or more
than two sample test (k sample statistic). After considering all these factors
decide about the suitable test statistics and calculate the value with
appropriate formula.
4. Fix up the probability level :( P<0.05, P<0.01, P<0.001 etc.)
Level of significance is defined as probability of making type I error. In medical

research usually this is used in statistical analysis while interpreting the result of the
study. We consider the level of significance as 1% (0.01) or 5% (0.05). If the
significance level is 0.05 that means that the investigator is ready to take risk of
being wrong in 5 % of the times or 5 out of hundred when rejecting the null
hypothesis.
The significance may be set at 0.01 or even 0.001.with a 0.01 level of significance
the researcher stands the risk of being wrong one time out of hundred when rejecting
the null hypothesis but 0.001 level of significance the risk is one time out of
thousand. Practically in our clinical studies we consider 0.05 level of significance to
take a decision regarding whether to accept or reject null hypothesis
5. Calculate the degrees of freedom (d.f)
The degree of freedom denotes, the number of sample items that can be chosen
freely from all the sample items in each group as it can influence the precision of the
estimate considerably.
Example : 10 patients waiting in the out patient section. The doctor can call any one
of them as the first patient, 2nd patient, 3rd patient and so on with some amount of
freedom but the 10th patient must go and for her there is no freedom to call or to
enter, since she being the last one in the queue. So in a sample of 10 patients, all
the 9 patients except, the last one, has got some amount of freedom to enter the
study as the 1st ,2nd ,3rd, and so on. So if n is the sample size, the degree of freedom
(d.f.)is always, one less than the total number i.e.(n-1). Thus, in an one sample test
with n items, the degrees of freedom will be (n-1)and in a 2 sample test with sample
size n1=10(first sample)and n2=20(second sample), the total d.f.=(n 1-1)+(n2-
1)=(n1+n2-2)=10+20-2=28.
6. Compare the calculated value of the test statistic with the table value and
interpret
Refer the concerned statistical table (if normality test refer normal table, if t-test refer
t table, if chi-square test refer chi-square table and so on). As a general rule, refer
the table at 5 per cent probability level (p<0.005) corresponding to the appropriate
degrees of freedom to get the minimum level of significance. If the calculated value
of the statistics or t or chi-square or F is more than the table value, the test is
statistically significant at 5 per cent level. i.e.the probability of differing the argument
of the researcher (alternate hypothesis) is less than 5 if 100 sample studies have
taken up. In other words, in 95 percent or more of such sample studies the alternate
hypothesis many be true. If the calculated value is less than or equal to the table
value, the inference is that the test is not statistically significant (p>0.05), which
means the Null hypothesis is accepted, that is, there is no difference between the
two groups and if any present numerically, it may be due to sampling variation.
Generally, the calculated values of the test statistic are compared with the table
value (2 sided or 2 tail).
Other Concepts related to inferential statistics
a) Type 1 and Type II Error

Hypothesis testing is based on probability theory. Because we cannot make
any statement about a sample with complete certainty, there is always the chance
that an error can be made. The researcher runs the risk of committing two types of
errors.
Table x. summarizes the state of affairs in the population and the nature of Type I &
Type II error.
State of Null Hypothesis in Decision
the population Accept Ho Reject Ho
Ho is true Correct – No error Type I error(α error)
Ho is false Type II error(β error) Correct- No error
Table x1 Type I &Type II errors in hypothesis
Type I error is an error caused by rejecting the null hypothesis

when it is true
Type II error is an error caused by failing to reject the null

hypothesis when the alternative hypothesis is true
If the decision has been made to reject the null hypothesis and in fact, the null
hypothesis is true, we have made a Type I error. Type I error occurs when the
researcher concludes that there is a statistically significant difference when in reality
it does not exist. Type I and type II errors are inversely related. When the probability
of type II errors reduces, the probability of type I error increases.
A Type I error has the probability of alpha (α), the level of statistical significance that
the researcher has set up.
If the alternative hypothesis is, in fact, true and the null hypothesis is actually false
but the decision maker concludes that we should not reject a null hypothesis, then
we have made what is called a Type II error. The probability of making this incorrect
decision is called beta (β).The quantity (1- β) is called the power of a test
No error is made if the null hypothesis is true and the decision is made to
accept it. A correct decision is also made if the null hypothesis is false and the
decision is made to reject the null hypothesis.
b) Confidence Interval
Confidence interval estimation is one way to make inference about the parameter.
After drawing a random sample of adequate size a sample statistic either sample
mean or sample proportion is calculated. This value is called point estimate. Then
the researcher defines an interval around this value within which population value is
likely to lie. Since the researcher uses a sample, the first step in constructing an
interval estimate is to decide on the risk the researcher is willing take of being wrong.
An interval estimate is wrong if it does not contain the population parameter. This
probability of error is called α (alpha).
The exact value of α will depend on the nature of research question, but a 5 %
(.05) probability is commonly used. Setting α = .05,also called 95% confidence level,
means that over the long run the researcher is willing to be wrong only 5% of the
time. In other words, if the researcher draws 100 random samples of same size and
if the limits are calculated every time then it may not contain the parameter in 5
times.
From the table of Normal distribution, it is known that mean± 1.96SD will contain 95
percent of the cases and only 5 percent of the cases will lie outside it. Similarly
mean±2.58 SD will contain 99 percent of the cases and only 1 percent of the cases
will lie outside. These properties can be used in the estimation procedure.
Test of Significance and Estimation
Introduction
There are various types of problems for which the test of significance
are used for drawing conclusions. Different types of problems need different tests but
the basis of all tests and the steps involved in the procedure are the same. The
common types of problems are
1. Comparison of sample mean with population mean

2. Comparison of two sample mean
3. Comparison of sample proportion with the population proportion
4. Comparison of two sample proportion.
PROCEDURE
The steps involved in the procedure of tests of significance are as

follows:
1. Finding out the type of problem and the question to be

answered
2. Stating the Null Hypothesis
3. Determining the correct sampling distribution and calculating
the standard error of the statistic used
4. Calculating the critical ratio. Generally, this is given by
Difference between the statistics

Standard error
5. Taking into account the distribution of the critical ratio and

comparing the value observed in the experiment with that
the predetermined significant level given by the table; and
6. Making inferences.
Comparison of means of two samples ( Large Sample) – Z test
Critical ratio = Difference in means

Standard error of difference in means
x – x2 
1
= S
 n1 1
1
+ n
2
Where
S=  n1S12 + n2S22
n1 + n2
n1 - Number of samples in first group
n2 - Number of samples in second group
Example:
In a survey on hearing levels of school children with normal hearing it

was found that in the frequency 500 cycles per second, 62 children tested in the
sound-proof room had a mean hearing threshold of 15.5 decibels with a standard
deviation of 6.5. 76 comparable children who were tested in the fields had mean
threshold of 20.0 decibels with a standard deviation 7.1. Test, if there is any
difference between the hearing levels recorded in the sound-proof room and in the
field.
Solution:
(a) Question to be answered. Is there any difference between the hearing

levels recorded in the sound-proof room and in the field?
Given
Sample 1 Sample2
Sample size 62 76
Mean 15.5 20.0
SD 6.5 7.1
(b) Null hypothesis. There is no difference between the means of the hearing
thresholds taken in the sound proof room and in the field, that is, the two samples
have come from the same population.
(c) Standard error of the difference in means. The standard deviation of

population, S, is to be calculated. This is given by the formula
S=  (n1 – 1) s12 + (n2 – 1) s22

n1 + n2 - 2
= 6.837 decibels
(d) Critical ratio
Critical ratio = Difference in means

Standard error of difference in means
x –x 
1 2
= S
 n1 1
1
+ n
2
= 3.846
(e) Comparison with the theoretical value. The probability of observing this
value (3.846) or greater value by chance is less than 1%. Hence, the rejection of the
null hypothesis (p<0.01).
(f) Inference. There is evidence to believe that the hearing level tested in the
sound-proof room is different from the hearing level tested in the fields.
Small Sample Test
Comparison of means of two independent samples – Independent sample t

test
Critical ratio. The critical ratio is given by
x – x2 
1
= S
 n1 1
1
+ n
2
SE = S  n1 1
1
+ n
2
where
S=  (n1 – 1) s12 + (n2 – 1) s22

n1 + n2 - 2
S1 = Standard deviation of the first sample
S2 = Standard deviation of the second sample
Example:
In the feeding trial, 17 children were given high protein food supplement to their
normal diet and 15 comparable children were kept under normal diet. The total
calories of intake per child pe day to the high protein group is 1296 and of the control
group is 1293. They were kept on this feeding trial for a period of seven months. At
the end of the study, the changes (initial – final) in the haemoglobin (g%) level of the
two groups are assessed and given in Table 12.2. Does it provide any evidence to
say that the change in the haemoglobin level of the children who received high
protein food is different from the control group?
Table 12.2 Changes in Haemoglobin Levels o Children in High Protein Diet and
Control group
High protein diet group Control Group

Serial (Child) Serial (Child)
Change in Hb Change in Hb.
No. No.
1 -0.50
1 0.77
2 -2.20
2 1.60
3 -1.90
3 0.81
4 -0.67
4 2.00
5 -0.73
5 -0.94
6 0.23
6 1.77
7 1.08
7 0.45
8 1.22
8 0.54
9 -0.41
9 0.45
10 -0.16
10 1.22
11 -4.59
11 -2.30
12 -0.35
12 0.20
13 -0.98
13 4.90
14 -1.00
14 1.70
15 -1.01
15 1.06
16 -2.27
17 -2.73
Solution
(a) Question to be answered. Is there any difference between the means of

the change in the haemoglobin of the children fed on high protein food and those
who were fed on normal food?
(b) Null hypothesis. The two samples have come from the population with
same mean. In other words, there is no difference in means of the change in
haemoglobin values between the children fed on the high protein diet and normal
diet.
(c) Standard error of the difference in means. This estimate of the standard
error of difference in means is given by the formula.
SE = S  n1 1
1
+ n
2
S=  (n1 – 1) s12 + (n2 – 1) s22

n1 + n2 - 2
S1 = Standard deviation of the first sample

S2 = Standard deviation of the second sample
In this problem,
S=  41.2704 + 33.8646
17 + 15 - 2 = 1.5826 g%
(d) Critical ratio. The critical ratio is given by
x – x2 
1
= S
 n1 1
1
+ n
2
= 2.923
(e) Comparison with the theoretical value. This critical ratio denoted as
follows a t-distribution with n 1 + n2 – 2 = (17 + 15 – 2 = 30) degrees of freedom. The
t-distribution for 30 degrees of freedom gives the 5% level as 2.042 and 1% level as
2.750. Our observed value is 2.923 which is greater than the 1% level. This means
the probability of getting by chance a value as much as 2.923.
(f) Inference. This experiment provides evidence to show that the mean
change in the haemoglobin level of the children fed on high protein diet is different
from the mean change in the haemoglobin level of the children fed on normal diet.
Comparison of Means of Two Correlated Samples (i.e., with the Same Subjects
in Both Groups): The Paired t-test
Critical Ratio.
d–0
t=
s / n
where
  (dn -–1d)
2
S=
Example
Twelve pre-school children were given a supplement of multi-purpose food for a

period of four months. There skin-fold thickness (in mm) was measured before the
commencement of the programme and also the end. The values obtained are given
in Table 12.3. Test if there is any change in their skin-fold thickness.
Table 12.3. Skin-Fold Thickness of the Children Who Received Multi-Purpose

Food.
Serial Skin-fold thickness (mm)

Difference d–d* (d – d)2
(Child)
At the beginning At the end ‘d’
No.
1 6 8 2 1.25 1.5625
2 8 8 0 -0.75 0.5625
3 8 10 2 1.25 1.5625
4 6 7 1 0.25 0.0625
5 5 6 1 0.25 0.0625
6 9 10 1 0.25 0.0625
7 6 9 3 2.25 5.0625
8 7 8 1 0.25 0.0625
9 6 5 -1 -1.75 3.0625
10 6 7 1 0.25 0.0625
11 4 4 0 -0.75 0.5625
12 8 6 -2 -2.75 7.5625
Total 79 88 9 0 20.2500
*d = 0.75
Solution
(a) Question to be answered. Is there is a difference between the

measurements of skin fold thickness taken at the beginning and at the end of the
experiment?
(b) Null hypothesis. The sample is taken from the population in which there is
no difference in the skin-fold thickness.
(c) Standard error of the mean of difference. The mean of the difference is
0.75 mm. The estimate of the population standard deviation is given by
  (dn -–1d) 2
S=
Where n stands for the number of subjects included in the study. In the problems,
S=  20.25
11 = 1.357 mm
The estimate for the standard error of mean is

S = 13.57 = 0.3917 mm
n 12
(d) Critical ratio
d–0
t= s / n = 1.91
(e) Comparison with the theoretical value. This critical ratio, t, follows a t-
distribution with n – 1 (12 – 1 = 11) degrees of freedom. The 5% of level is 2.201 and
the 1% level is 3.106 for 11 degrees of freedom. The value 1.91 is less than the 5%
level. This means the probability of getting by chance a value as much as 1.91 or
greater is more than 5%. Therefore, we do not reject the null hypothesis.
(f) Inference. This experiment does not provide any evidence to say that there
is a difference between the initial and final values of skin-fold thickness.
Elements of Analysis of Variance (F Test)
An ANOVA (Analysis of variance), some time called F test, is closely related to the t
test. The major difference is that, where the t test measures the difference between
the means of two groups, an ANOVA tests the difference between the means of two
or more groups.
If more than 2 means are to be compared, it is rather difficult to take a

decision on superiority of one mean over the other. Then either we have to go for Z
test, if it is large sample or to apply student t test, by taking 2 groups, each at a time.
If there are 3 groups say, A, B and C, we have to compare – the means of A and B,
then A and C and again B and C – three sets of comparisons. If the number of
groups increases, the decision – making becomes complicated further. Then,
occasionally, there may be possibility of getting controversial findings – in one aspect
A may be superior to B but in other count C may be superior. To avoid such
confusing results in Testing of Hypothesis, the homogeneity of means of more than
two sample means (k sample test for means) can be tested by analyzing the
variability of groups together, which is called Analysis of Variance or in short form
‘ANOVA’.
The ‘underlying concepts of Analysis of Variance’ were originally put

forward by R.A. Fisher and hence the statistical test is named after the first letter of
his name as ‘F test’.
Thus, while comparing the means of more than two groups, the total
variability between groups has been divided into two – (i) the share attributable to
assignable cause, often termed as ‘between group variation’ and (ii) the share
attributable to chance cause, which is called ‘within group variation’ and compared.
Theoretically, it is difficult to estimate, the portion of the difference in the estimate
due to chance because, it is beyond the control of the researcher. Still we may be
able to estimate the total difference due to all factors, from which the portion due to
assignable cause if subtracted for getting the portion due to chance cause. Then the
ratio of the variability due to assignable cause to the variability due to chance cause
or error is called F value.
ie. F = Variability due to assignable cause per d.f

Variability due to chance cause per d.f.
=
Mean variability between groups
Mean variability within group (error)
As F is computed as a ratio, it is also known as F Ratio,
In order to apply F – Test (ANOVA), the following assumptions must be

fulfilled:
1. The samples must be drawn independently at random.

2. The distribution of the population from which the samples are drawn
must follow ‘Normal’.
3. The treatment effect and chance effect (Error) must be additive in
nature.
4. The variability (variance) of the population from which the samples are
drawn must be equal.
A one way ANOVA , or single factor ANOVA, tests differences between groups that
are only classified on one independent variable. One potential draw back to an
ANOVA is that the F value tells that there is a significant difference between the
group, not which group are significantly different from each other. To find out where
the difference exist post-hoc comparison is used. Some commonly used post-hoc
tests are Sheffe’s and Tukey’s
Non parametric tests
In situations where the hypothesis is based on qualitative data and parametric

assumptions are not satisfied the researcher can employ non parametric tests of
significance.
CHI SQUARE TEST:
Chi – Square test is a ‘non-parametric test’ or ‘distribution free test’. It is a ‘test for
association’ or ‘test for independence’. This test is most commonly used when the
data are in frequencies such as in the number of responses in two or more
categories. It can be used with any data which can be reduced proportions or
percentages. Chi – Square test is denoted by Greek letter χ 2 and is pronounced by
“kye Square”
It is calculated by the formula:
(O – E)2
Chi-Square (2) = 
E
Where: O = Observed frequency and

E = Expected frequency
Two – way classification or h x k table in which observed frequencies occupy

h rows & k columns is called contingency table.
Expected frequency in a contingency table is calculated by the formula.
Expected frequency =
If no relationship exists between the row &column variable then the value of χ 2 will
be small. If there is relationship between the variables then the value of χ 2 statistic
will be large. If χ 2 = 0 there is perfect agreement between theory and observation.
The d.f is calculated by (C-1) ((R-1), where C= No. of columns and R= number
of rows.
The chi square test can be used irrespective of the size of the sample. Still, it is
advisable to only, if the expected frequency is more than 5 in each cell and not at all
advisable, if it is zero.
However, in case the expected frequency is <=5 in any cell, a correction formula
suggested by ‘Yates’ is recommended. If the “Yates’ correction is applied the general
formula for the computation of chi square is given below
1 2
Chi-Square (2) =  [O – E- /2]
E
Where: O = Observed frequency and

E = Expected frequency
Illustration
Hypothetical data for chi-square showing the effectiveness of a new type of surgery
Treatment Control Total

Positive 40(25)E1 10(25)E2 50
Negative 60(75)E3 90(75)E4 150
Total 100 100 200
Table16.9 Figure in brackets represent expected frequencies.
= 9 + 9 + 3 + 3 = 24
d.f = (2-1) (2-1) = 1
Table value of χ 2 at 5% level and for 1 df is, 3.84
Since the calculated value is greater than table value the observed
difference is statistically significant. Percentage having improvement in condition is
more in the treatment group (40%).
Mann – Whitney U test (The Rank sum test)

This test is an alternative to the t test for independent samples. This test is
ideal when the level of measurement is ordinal. Also can be used with internal ratio
level data, when the assumption of normality is not true. The computational procedure
starts with combining the scores in the two groups and assigning ranks to them.
Assign rank 1 to the lowest, rank 2 to the next lowest and so on when two or more
scores are the same, use the procedure for tied ranks. Now find the sum of ranks of
each group separately. Now work and two U values as shown below.
n1 (n1  1)
U1 = n1 n2    R1
2
n2 (n2  1)
U2 = n1 n2    R2
2
Where n1 = First sample size
n2 = Second sample size
R1 = Sum of ranks of the first group
R2 = Sum of ranks of the second group
Now U is the minimum of (u1, u2). Table of U values are available for different levels
of significance and for different values of n 1 and n2. For statistical significance, the
calculated value must be equal or lower than the table value.
Alternatively z static can be computed as follows
n1 n2
U
2
Z =
n1 n2 (n1  n2  1)
12
Now, the critical value of z can be used to assess the statistical significance.
Example
A survey was conducted among mothers to ascertain their views on PNDT act. One
group consist of primi mothers and other multi gravid, the scores are given below. Can
the researcher conclude that the opinions are same for the two groups,
Opinion on PNDT Act; table
Subjects Score of Rank Score of multi Rank
primigravida gravida
12 3.5 14 7
14 7 25 18
12 3.5 17 11
13 5 18 12
20 14 19 13
16 10 21 15
15 9 24 17
14 7 22 16
9 1
11 2
n1=10  R =62 n2=8
1  R =109
2
n1 (n1  2)
U1 = n1n2+ - R 1
2
10(10  1)
= 10x8+  62
2
= 73
n2 (n 2  1)
U2 = n1n2+ - R 2
2
8 x9
= 10x8+  179
2
= 7
n1 n2
U
2
z =
n1 n2 (n1  n2  1)
12
10 x8
7
2
=
10x8(10  8  1)
12
7  40 33
=   2.93
126.5 11 .25
The calculated value of Z is greater than the critical value at 5% level (1.96), the
observed difference is significant. Multi gravid women favour the act than
primigravida (find the mean and median for the two sets for clarity).
Wilcoxon signed – ranks test.
This test is an alternative to the paired‘t’ test. This test is ideal when the level
of measurement is ordinal. Also can be used with interval ratio level data when an
assumption of normality is failed. The computational procedure starts with finding the
difference for each pair of observations. In the next step the difference are ranked
ignoring the positive or negative signs zero differences are ignored and will; not be
ranked. Assign 1 to the lowest rank 2 to the next lowest and so on. When two or more
differences are the same, use the procedure for tied ranks. Now restore the signs to the
ranks and find the sum of positive ranks and negative ranks separately. Let the letter T
denotes the smaller of the totals. For significance, calculated value of T must be
equal or lower than the table value.
Alternatively Z statistics can be computed as
n(n  1)
T
4
Z =
n(n  1)(2n  1)
24
Now the critical values of z can be used to assess the statistical significance.
Wilcoxon signed rank test
Knowledge scores of 13 subjects before and after of an intervention programme. Test
whether the education programme was effective.
Sub Before After d R

A 7 8 1 3
B 5 7 2 6.5
C 4 6 2 6.5
D 4 4 0
E 3 7 4 9.5
F 4 3 -1 -3
G 5 8 3 8
H 9 8 -1 -3
I 3 7 4 9.5
J 2 7 5 11
K 3 4 1 3
L 3 3 0
M 6 7 1 3
Sum of positive ranks = 60
Sum of negative ranks = 6
T= smaller of totals = 6
n=13
n(n  1)
T
4
z =
n(n  1)(2n  1)
24
13x14
6
4
= 13x14 x 27
24
= 2.7
Since the calculated value is greater than table value (1.96 the observed difference is
statistically significant).
RELIABILITY AND VALIDITY OF TEST SCORE
A test score is called reliable when we have reasons for believing the score to be
stable and trustworthy. Stability and trustworthiness depend upon the degree to
which the score is an index of “true ability” – is free of chance error. Scores achieved
on unreliable tests are neither stable nor trustworthy. In fact, a comparison of scores
made upon repetition of an unreliable test, or upon two parallel forms of the same
test, will reveal many discrepancies- some large and some small- in the two scores
made by each individual in the group. The correlation of the test with itself-computed
in several ways is called the reliability coefficient of the test.
1. Methods of determining reliability
There are mainly three procedures in common use for computing the reliability
coefficient (sometimes called the self-correlation) of a test. These are
(1) Test – retest (repetition)

(2) Alternate or parallel forms
(3) Split-half technique
(1) TEST – RETEST (REPETITION) METHOD
Repetition of a test is the simplest method of determining

agreement between two sets of scores: the test is given and repeated on the
same group, and the correlation computed between the first and second set of
scores.
The method is open to several serious objections. If the test is
repeated immediately, many subjects will recall their first answers and spend
their time on new material, thus tending to increase their scores – sometimes
by a good deal. Besides immediate memory effects, practice and the
confidence induced by familiarity with the material will almost certainly affect
scores when the test is taken for a second time, hence the reliability
coefficient will be too high.
On the other hand, if the interval between tests is rather long

(e.g., six months or more) and the subjects are young children, growth
changes will affect the retest score. In general growth increases initial score
by various amounts and tends to lower the reliability coefficient.
Given sufficient time interval between the first and second

administration of test to offset – in part at least – memory, practice and other
carry over effects, the retest coefficient becomes a close estimate of the
stability on the test scores. In fact, when the test is given and repeated, the
reliability coefficient is primarily a stability coefficient.
The test – retest method will estimate less accurately the

reliability of a test. Owing to difficulties on controlling conditions which
influence scores on retest, the test - retest method is generally less useful
than are the other method.
(2) ALTERNATE OR PARALLEL FORMS METHOD
When alternate or parallel forms of a test can be constructed, the

correlation between Form A, for example, and Form B may be taken as a
measure of the self-correlation of the test. Under these conditions, the
reliability coefficient becomes an index of the equivalence of the two forms of
the test. Parallel forms are usually available for standard psychological and
educational achievement tests.
The alternate forms method is satisfactory when sufficient time

has intervened between the administration of the two forms to weaken or
eliminate memory and practice effects. When Form B of a test follows Form A
closely scores on the second form of the test will often be increased because
of familiarity. If such increases are approximately constant (e.g., 3 to 5 points),
the reliability coefficient of the test will not affected, since the paired A and B
scores maintain the same relative positions in the two distributions.
In drawing up alternate test forms, care must be exercised to

match test materials for content, difficulty and form; and precautions must be
taken not to have the items in the two forms too similar. When alternate forms
are virtually identical, reliability if too high; whereas when parallel forms are
not sufficiently alike, reliability will be too low. For well-made standard tests,
the parallel forms method is usually the most satisfactory way of determining
reliability. If possible, an interval of at least two to four weeks should be
allowed between administrations of the test.
(3) SPLIT-HALF TECHNIQUE METHOD
In the split-half method, the test is first divided into two equivalent
“halves” and the correlation found for these half-tests. From the reliability of
the half-test, the self correlation of the whole test is then estimated by the
Spearman-Brown prophecy formula. The first set of scores, for example,
represents performance on the odd-numbered items, 1,3,5,7, etc.; and the
second set of scores, performance on even-numbered items, 2,4,6,8, etc.
Other ways of making up two half-tests which will be comparable in content,
difficulty and susceptibility to practice are employed, but the odds-evens split
is the one most commonly used. Form the self-correlation of the half-tests, the
reliability coefficient of the whole test may be estimated from the formula
r= 2r1
1 + r1
(Spearman-Brown prophecy formula for estimating reliability from
two comparable halves of a test)
Where r = reliability coefficient of the whole test)
The split-half method is employed when it is not feasible to

construct parallel forms of the test not advisable to repeat the test itself. This
situation occurs with many performance test, as well as with questionnaires
and inventories dealing with personality variables, attitudes and interests.
Performance tests (e.g. picture completion, puzzle solving, form boards) are
often very different tasks when repeated, as the child is familiar with content
and procedure. Likewise, many personality “tests” (as for example, the
Rorschach) cannot readily be given in alternate forms, nor repeated, owing to
changes in the subject’s attitude upon taking the test for the second time.
The split-half method is regarded by many as the best of the methods for measuring
test reliability. One of this main advantages is the fact of the all data for computing
reliability are obtained upon one occasion, so that variations brought about by
difference between the two testing situations are eliminated. A marked
disadvantages of the split-half technique lies in the fact the chance errors may affect
scores on the two halves of the test in the same way, thus tending to make the
reliability coefficient too high. This follows because the test is administered only one.
The longer the test the less the probability that effects of temporary and variable
disturbances will be cumulative in one direction, and the more accurate the estimate
of score reliability
VITAL STATISTICS
Vital statistics have great importance in health. It helps to identify the health problem
of the community and makes solution for that. These are systematically collected
information regarding the events which occur in human life such as birth, marriages,
deaths etc. in a given population.
1. Definition of vital statistics
Vital statistics is defined as the systematic collection, presentation and interpretation

of numerical information relating to vital events such as birth, death, marriage,
divorce etc.
2. Sources of vital statistics
The important sources are
i. Census
ii. Records of vital Registration
iii. Hospital statistics
iv. Sample Registration scheme
v. Research publications of Government and other voluntary

agencies and
vi. Notification of disease statistics
3. Uses of vital statistics
i. To plan, implement and evaluate health programmes

ii. To measure the health of the community
iii. To see the met and unmet health needs of the community
iv. To fix up priority for National Health Programmes
v. To compare health status of one country with another, one

place with another, one time with another
vi. To make health predictions and so on
4. Indicators of health
There are 4 important indicators of health:
i. Mortality Indicators
ii. Morbidity indicators and
iii. Fertility indicators.
i. Mortality Indicators
Mortality means death. Death is very much related to health status of a country.
The mortality rates generally computed without considering any of the influencing
factors like age, sex, occupation etc. Such rates that are computed in a crude form is
called Crude Death Rate (C.D.R.).If the ‘factors influencing deaths’ are taken into
account at the time of computation, it is called Specific Death Rate.
a. Crude Death Rate (C.D.R.)
The word Crude means not refined. Crude Death Rate means, the Death
Rate that is computed without looking into the specific factors responsible for death -
age, sex, occupation etc. So the Crude Death Rate is defined as the ratio of all
deaths from all causes during 1 year in a specified geographical area to the total
mid-year population per1000.
It is computed by the formula:
CDR = Total deaths X 1000

Midyear estimated population
In most of the developing countries, Death Rate usually signifies the

Crude Death Rate (CDR) and it is the only source to assess the health status of the
country. If the Crude Death Rate is computed for every year it is called Annual Crude
Death Rate. Once in 5 years it is called Quinquinnial Death rate and if it is done once
in 10 years it called Decennial death rate.
b. Specific Death Rates
If various factors responsible for deaths are considered, such rates are called
specific death rate. The important specific death rates are:
i. Infant Mortality rate

ii. Peri-natal Mortality Rate
iii. Neonatal Mortality rate
iv. Post-neonatal Mortality Rate
v. Maternal Mortality Rate
vi. Case fatality rate
vii. Proportional Mortality rate etc.
1. Infant Mortality Rate
The word infants denote children below 1 year. In this age group, deaths are
more due to various reasons and hence, it is considered as a sensitive index. The
Infant Mortality Rate (IMR) is defined as the ratio of Infant deaths (deaths below 1
year) to the total live births in one year and always expressed per 1000.
IMR = No. of Infant deaths x 1000

Total live births
2. Perinatal Mortality Rate
Mortality (death) around natality is called Preinstall Mortality. It is defined as the

sum of late foetal deaths (foetal loss after 28 weeks of gestation) and death of
infants below 7 days to the to the (total live births + still births) and expressed per
1000.
[foe.loss > 28 wks of gest + death <7days] x 1000

Per.mort.rate =
Total live birth + still birth
3. Neonatal Mortality Rate

The death of newborns below 28 days after delivery is called Neonatal
Mortality and Neonatal mortality rate is computed as the ratio of No. of neonatal
deaths to total live births in 1 year and expressed per 1000.
It is calculated by the formula:
(Infant death < 28 days) x 1000

Neonatal M.R. = Total live births
4. Post neonatal Mortality Rate
It is defined as the ratio of death of newborns from 28 days of age to <1 year to
the total live births and expressed per 1000. It is computed with the formula:
(Infant deaths  28days to <1yr) x 1000

Post neonatal M.R. = Total live births
5. Maternal Mortality rate
The Maternal Mortality Rate is defined as the ratio of maternal deaths (deaths
during pregnancy or after delivery but within 42 days) to total live births and
expressed per 1000.
No. of Maternal deaths x 1000
MMR = Total live birth in year
6. Case Fatality Rate
Fatality means the chances of non-survival. So the fatality rate indicates the
severity of illness. It is computed as the ratio of number of deaths due to a particular
disease to the total cases registered in the Hospital with the same disease and
usually it is computed as percentage. It is computed with the formula:
No. of deaths due to a specific disease x 100

Case fatality rate =
Total number of cases treated with the same disease
7. Proportional Mortality rate

Proportional Mortality Rate (PMR) provides the information on the proportion of
death due a particular cause or the persons die after 60 years of age. It signifies the
survival rate after a particular age or the deaths due to a specific cause.
PMR (age) =
No. of deaths after 60 yr of age x 100
Total deaths in all ages
No. of deaths from a specific cause x 100

PMR (Cause) =
Total deaths from all causes
ii. Morbidity Indicators
(a) Morbidity rates
Morbidity is being measured by computing
(a) incidence rate and
a) Prevalence rate.
1. Population at risk
In mortality statistics, the base for the computation of various indices is the mid-
year population, generally. But in the case of morbidity, the rates are computed by
considering the population at risk. It is defined as the aggregate of all the people,
who are supposed to have the risk of getting the disease-it, may be total persons
residing in a geographic area or the total persons working in a factory or the
students and teachers assembled in a class or all family members residing together
and so on.
2. Incidence Rate
Incidence indicates new sickness. So incidence rates are used to measure the
number of newly affected person or the new spells of sickness reported. If the
number of persons developing the disease during a specified time is considered, it is
called incidence Rate (persons), which is defined as the ratio of persons newly
affected to the population at risk and expressed per 1000 or even per 100. Likewise,
the Incidence Rate (spells or episodes) is the ratio of new spells of sickness to the
total population at risk per 1000. These rates can be calculated by the formula:
No. of new cases detected x 1000

Incidence rate (persons) =
Population at risk
No. of new cases detected x 1000

Population at risk
Incidence rate (spells) =
3. Prevalence Rate
Usually, Prevalence Rates are computed for chronic illnesses like, Tuberculosis,
Leprosy etc. In this case, the word prevalence signifies cases-both old and new.
The period prevalence rate (persons) is defined as the Ratio of current cases of
illness (both old & new) for a specified period to the total population at risk and
expressed per 1000.
Also the period prevalence rate(spells) can be defined as the ratio of current spells of
sickness (both old spells & new spells) to the total population at risk and multiplied
by 1000
At the same time the point prevalence Rate is defined as the ratio of the current
illness at a particular point of time and expressed per 1000.
These rates can be expressed in formulae as:
Period prevalence rate (persons) = No. of persons currently sick during a specified period of times x 1000
Total Population at risk
Period Prev (Spell) = No. of cur. Spell of sickness during a period of time x 1000
Total Population at risk
No. of sickness (both old & new) at a particular point of time x 1000
Point prevalence rate = Population at risk
iii. Fertility Rates
(a) Crude Birth Rate (C.B.R.)
This rate is called Crude because none of the factors like age, sex, religion,
occupation etc. responsible for high fertility has been considered at the time of its
computation. The Crude Birth Rate is defined as the ratio of Total Live Births during
1 year to the Midyear estimated population per 1000.
It is computed with the formula:
C.B.R = Total Live Births x 1000

Midyear estimated population
(b) Specific Fertility Rates
The important specific Fertility rates are:
1. General Fertility Rate (G.F.R)

2. Age-Specific Fertility Rate (A.S.F.R.)
3. Gross Reproduction Rate (G.R.R.)
4. Total Fertility Rate (T.F.R.)
5. Net Reproduction Rate (N.R.R.)
(c) General Fertility Rate (G.F.R.)
In the computation of General Fertility Rate one limitation has given to the
denominator compared to C.B.R. i.e. in the denominator, female population in the
reproductive age group alone is considered. It is defined as the ratio of total live
births to the female in the reproductive age group (15-44 or 15-49 years) and
expressed per 1000.
G.F.R = Total Live Births x 1000

Total female in the reproductive age group
(d) Age-Specific Fertility Rate (A.S.F.R.)
It is defined as the ratio of total live births in a specified age group (say age
group 20-24 years) to the total women in the same age group (20-24 years)
multiplied by 1000.
It can be calculated by the formula:

Total live births in a specified age group x 1000
Age specific fertility rate =
Total women in the same reproductive age group
(e) Total Fertility Rate (T.F.R.)
The sum of all single-year age-specific fertility rates is called Total Fertility rate. If
age specific fertility rates are computed as intervals say women in 15-19 years, 20-
24 years etc., the sum must be multiplied by the class interval. If the single year age-
specific birth rates are computed for 1000.
If is computed by the formula:

T.F.R =  Single year age specified fertility
(f) Gross Reproduction Rate (G.R.R.)
Gross reproduction rate is a sensitive index to measure the Fertility Status. If

given the estimate of average number of female children that would be born in the
estimate reproductive age, if the current fertility status is continued.
G.F.R =  Sum of all single year age specific female births
(g) Net Reproduction Rate (NRR)
Net Reproduction rate is the ideal rate to assess fertility status of a country. Gross
Reproduction rate provides the average number of female children that would be
born to a female in the entire reproductive life span. But there is no guarantee that all
the women in the reproductive span or their female off springs are going to survive
though out the reproduction life. Therefore, from GRR a subtraction is made to that
effect by considering the current mortality conditions of females. The calculations are
being done by Life Table Techniques. Almost all the developing countries trying to
attain a N.R.R. =1, so that the population may be able to maintain by itself. Then
every woman in the reproductive age group will be replaced by one, which makes no
change in the population. If N.R.R is more than 1 it indicates an increase in the
population and if it is less than 1, it is suggestive of a declining trend.

Styudy Notes 2015 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Styudy Notes 2015 2

Uploaded by

Copyright:

Available Formats

Introduction

In plural sense it is defined as ‘numerically stated facts’ or ‘facts expressed in

So statistics can be defined as a scientific discipline concerned with design of

Medical statistics : Statistics related to clinical and laboratory parameters, their

Health statistics : Statistics related to health of people in a community, epidemiology

Uses of statistical methods in General :

Statistical methods needed in general are

Specific use of statistical methods

 To determine the normal limits of various laboratory and clinical parameters

Observation and Data : Each value of a variable recorded is called an observation.

Population : Population is defined as the totality of individuals or measures of

Example : 1. If the study is to find the prevalence of hypertension in Kerala :

Sampling unit : An element or a group of elements of the population used for

Example : If the percentage of population affected by diabetes is 10%, it is a

Type and levels of data and their measurement

Levels / scales of measurement

3. Interval : If the characteristic is measurable in numerical value, rank ordered

4. Ratio : If the characteristic is measurable in numerical value having a ‘true’

Ratio scale incorporates all the characteristics of nominal, ordinal, interval

Example : Weight is a ratio type measurement. It can be measure in Kilogram

Mathematical and logical operations possible with different scale of measurements

Importance of determine the level of measurement of a variable

Data, type of data

Example : Sex, employment, religion, nationality, colour, opinion, presence or

Nominal Data : If the qualitative data is measure, through naming the

 Quantitative data : Data that is accurately measurable by some instrument,

Nominal Ordinal Discrete Continuous

The process of arranging data in groups or classes according to resemblance and

Frequency distribution : Display summary of the number of observations which

Tabular presentation of Qualitative data

Smoking status of Lung Cancer patients

Tabular presentation of Quantitative data

A common device for organizing and representing fairly large collection of

Some of the terms used in frequency distribution.

(c) Class Boundaries: Consider the classes 10 – 20, 20 – 30, 30 – 40,

Now consider the classes as 10 – 19, 20 – 29, 30 – 39, 40 – 49, ………… In

Class Limits Class Boundaries

(e) Class Frequency: Number of observations lying in a particular class is class

(f) Cumulative Frequency: Cumulative total of frequencies is cumulative

The following points may be kept in mind for classification:

Example 1: Following data gives number of children in some families prepare an

Number of children is the variable. We prepare following table.

Preliminary Table for the construction of a frequency table

Now frequency distribution is:

Distribution of number of children

c.f. is cumulative frequency means cumulative total of frequency column.

Example 2: The following data gives ages of patients in years. Prepare a

Age in years: 20,63,11,65,89,67,89,11,54,57,21,86,25,67,35,59,55,25,10,19,

Preliminary Table for the construction of a frequency table

Distribution of patients according to age

Simple and many fold Classification

Distribution of blood group of nursing students

Distribution of nursing students according to

Distribution of nursing students according to Blood Group, Rh factor

Presentation of Data by Graphs and Diagrams

A. Diagrams for Qualitative Data

The Qualitative data can be tabulated and presented in different ways.

(i) Simple Bar Diagram

Simple Bar Diagram show ing the Blood Group of Students

Distribution of nursing students according to blood group

(ii) Component Bar diagram

Sometimes, the different categories of data may be further subdivided into

Distribution of nursing students according to

Blood Rh positive Rh negative