Professional Documents
Culture Documents
BSRM New Notes
BSRM New Notes
Principal,
2021
First Edition: 2021
ISBN: 978-81-953600-1-7
Publication, Distribution and Promotion Rights reserved by Bhumi Publishing, Nigave Khalasa, Kolhapur
Despite every effort, there may still be chances for some errors and omissions to have crept in
inadvertently.
No part of this publication may be reproduced in any form or by any means, electronically, mechanically,
by photocopying, recording or otherwise, without the prior permission of the publishers.
The views and results expressed in various articles are those of the authors and not of editors or
publisher of the book.
Published by:
Bhumi Publishing,
Nigave Khalasa, Kolhapur 416207, Maharashtra, India
Website: www.bhumipublishing.com
E-mail: bhumipublishing@gmail.com
Book Available online at:
https://www.bhumipublishing.com/books/
Preface
We have great pleasure and privilege in presenting the book “Applied Biostatistics:
An Essential tool in Helathcare Profession” which is based on new PCI syllabus and
will be helpful in course of B. Pharm, M. Pharm, B. Sc., M. Sc. and B. Ed.
In this book every topic has been explained in details and supported by sufficient
solved examples. The questions are categorised according to the types of methods
applied. We tried to maintain language as simple as possible which will help students
to understand the statistical concepts more easily. We have also tried to cover
information in much more depth in order to ensure that reader will be benefited for
competitive exams preparation.
We hope that this book will be appreciated and accepted by all Institutes, teachers
and students. There may be few mistakes and deficiencies, we will be grateful if
readers point out them and revert to us. Also we will welcome any suggestions from
your side.
Acknowledgment
First and foremost, praises and thanks to the God, the Almighty, for His showers of
blessings throughout my work to complete this book successfully.
I would like to express my deep and sincere thanks to Dr. Sanjay J. Surana, Principal,
R. C. Patel Institute of Pharmaceutical Education and Research, Shirpur and Shirpur
Education Society (SES), Shirpur for giving me opportunity and to make me able to
write this book.
I wish to express my special gratitude to my soul mate Mr. Umesh D. Laddha whose
dynamism, vision, sincerity and motivation have deeply inspired me for the
completion of this book.
I am also thankful to Central Hindu Military Education Society, Nashik for providing
facilities while writing this book.
I am extremely grateful to my parents for their love, prayers, caring and sacrifices for
educating and preparing me for my future. I am very much thankful to my Son
Devesh for his love, understanding and continuing support to complete my work. Also
I express my thanks to my sisters, brother and in laws for their support and valuable
time.
- Archana V. Nerpagar
This book is dedicated to
Hard work, Patience and Efforts….
And
To my lovely Son Devesh
Index
1. Introduction to biostatistics 1 – 59
2. Probability 60 – 88
1. INTRODUCTION TO BIOSTATISTICS
Introduction:
Statistics is a very broad subject, with the applications in a vast number of different fields.
Statistics or statistical analysis is the branch of mathematics which deals with the study of
collection, analysis, interpretation, presentation and organization of data. In other words, statistics
is the methodology which scientists and mathematicians have developed for interpreting and
drawing conclusions from collected data. It is the science of gaining information from numerical
and categorical data. Statistics in practice is applied successfully to study the effectiveness of
medical treatments, the reaction of consumers to television advertising, the attitude of young
people toward marriage, and much more. It’s safe to say that nowadays statistics is used in every
field of science.
Biostatistics is defined as the application of statistical tools and methods to the data
derived from biological sciences. It is the application of statistics in the development and use of
therapeutic drugs and devices in humans and animals. The science of biostatistics consists of
biological experiments (specifically in medicine, pharmacy, agriculture and fishery), the collection,
analysis and interpretation of data, the inferences and results. The goal of Biostatistics is to promote
statistical science and its application in the study of medicine, human health and disease.
Let us first define some basic terms of statistics that are necessary for understanding
biological and agricultural analysis.
1. Statistical Data:
The collection of numerical statements of facts is called data. This numerical data in
statistical analysis is obtained from scientific enquiry. The numerical facts in the collected in
scientific data are known as observations.
In statistics, data is all about its characteristics. Characteristic means the quality possessed
by an individual or observation.
Characteristics are of two types:
(i) Non measurable characteristics (Attributes): The characteristics related to the qualities of
the observations are called attributes. E.g. sex, literacy, blood group, pass, fail etc.
(ii) Measurable characteristics (variables): The characteristics related to the quantity of the
observations are called variables. Their values are always varying E.g. height, weights and
ages of persons, temperature, water salinity, etc.
For example, weights of children in a class are 35kg, 37kg, 32kg, 38kg, 34kg, 39kg, 36kg and
40kg. This statistical statement contains numerical values which is data for analysis with 7
observations.
1
Bhumi Publishing, India
Type of Data:
Data can be classified into two types:
a) Qualitative data:
The data which deals with the descriptions or qualities of individuals is qualitative data.
This data can be observed but cannot be measured using any unit. The qualitative characteristics
are known as attributes.
e.g. blood group, colours, smells, tastes, appearance, emotions etc.
b) Quantitative data:
The data which deals with the numerical values of the individuals is quantitative data. This
data can be measured in some units. The quantifiable characteristics are known as variables.
e.g. height, weight, length, temperature etc.
The quantitative variables are further divided as follows:
i) Discrete variables: The variables that can take only specific and finite number of values in the
given range are known as discrete variables. Discrete variables are countable in finite amount of
time.
For example, we can count the change in our pocket, money in bank account, number of
tablets in a pack, number of students in a class, parity, myocardial infarction.
ii) Continuous variables: The variables that can take on infinite number of values in the given
range are known as continuous variables. Continuous variables would take forever to count i.e. we
would get to forever but never finish counting them. Many of the variables studied in biology are
continuous variables.
For example, age. We cannot count exact age because it would take infinite value forever.
Age of person could be counted as; 25 years, 5 months, 10 days, 6 hours, 40 minutes, 4 seconds, 4
milliseconds, 9 nanoseconds, 10 picoseconds,...and so on. Also weight, diastolic blood pressure,
volume, time required to recovery.
Collection of Data:
Data Collection is an important aspect of any type of research study. Inaccurate data
collection can impact the results of a study and ultimately lead to invalid results. The data can be
collected on the basis of qualitative or quantitative characteristics. To check the effect of drug in
curing a disease, we have to collect the quantitative information about patients before and after
application of drug.
There are many methods of data collection depending on our research designs and
methodologies. Generally, data is collected from two sources, primary sources and secondary
sources.
Primary sources: The original source or first hand from which information is collected is called
primary source and the data collected from primary source is primary data. i.e. When an
2
Applied Biostatistics: An Essential tool in Helathcare Profession
investigator collects data himself with a definite plan or purpose in his mind then it is called
primary data. E.g. data obtained by census commissioner for population census.
To collect primary data following methods are used:
a) Observation method: Observation is the main source of information in the field of research. In
this method observations are recorded from experiments or a specific situation.
b) Questionnaire method: This method plays an important role in data collection process.
Questionnaire, usually, consists of number of objective questions that the respondent has to
answer.
The questionnaire should be designed properly. All questions to be asked are relevant to
subject of research. Questions should be short, simple and clear and easy to understand. They
should be arranged in order from easy to difficult. The information through questionnaire can be
collected by mail or post which is called postal inquiry.
c) Interview method: Interview is the verbal conversation between with two people in order to
collect required information for research. It is an interactional communication in which questions
are asked by interviewer for specific purpose to obtain research related information and answers
are given by interviewee. There are different types of interview like Personal interview, Telephone
interview, Focus Group interview, Depth interview, Projective techniques.
Secondary Sources: The sources of information such as published literature or published reports
are known as secondary sources and data collected from secondary sources is secondary data. i.e.
data which is not originally collected but obtained from published or unpublished sources is called
secondary data. Some of the secondary sources are as follows:
1) Government publications, 2) Census report, 3) Periodicals and books, 4) Research review
journals, 5) Research articles, 6) Research papers, 7) Magazines, 8) Academic publications, 9)
Research literature, 10) Ph.D. Thesis.
Classification of Data:
The process of arranging collected data into homogenous groups or classes according to the
common characteristics is called classification. After collecting the qualitative or quantitative data it
is require to sort out data from questionnaire related to the common characteristics. Because of
proper classification the unnecessary information is dropped out.
e.g. During population census, people in the country are classified according to sex
(males/females), marital status (married/unmarried), residential place (rural/urban), age groups,
profession, etc.
Raw Data: When some information is collected randomly and presented, it is called a raw data.
For Example: Given below are the marks (out of 25) obtained by 20 students of class VII A in
mathematics in a test.
18, 16, 12, 10, 5, 5, 4, 19, 20, 10, 12, 12, 15, 15, 15, 8, 8, 8, 8, 16
3
Bhumi Publishing, India
Observation:
Each entry collected as a numerical fact in the given data is called an observation.
Array:
The raw data when put in ascending or descending order of magnitude is called an array or
arrayed data.
For Example: The above data is arranged in ascending order and represented as:
4, 5, 5, 8, 8, 8, 8, 10, 10, 12, 12, 12, 15, 15, 15, 16, 16, 18, 19, 20
Range:
The difference between the highest and the lowest value of the observation is called the
range of the data.
In the above data,
Highest marks obtained = 20
Lowest marks obtained = 4
Therefore, 𝑟𝑎𝑛𝑔𝑒 = 20 − 4 = 16
Frequency Distribution:
a) Frequency: If an observation or variable is repeating twice or more in a given series of
observations then the number of repetition is called frequency of that observation.
e.g. consider the marks of 15 students in a class as follows: 22, 24, 20, 22, 23, 20, 25, 22, 22,
25, 20, 25, 20, 22, 24.
Here, number 20 is repeating 4 times. So frequency of 20 is 4.
22 is repeating 5 times. So its frequency is 5.
Similarly, frequency of 23 is 1, frequency of 24 is 2 and frequency of 25 is 3.
b) Frequency distribution: The tabular arrangement of observations in the collected scientific
data individually or in groups or classes along with their frequencies is called frequency
distribution.
c) Class: The group of observations in the data under our consideration is called class.
e.g. the marks out of 100 can be divided into the classes as 0-10, 10-20, 20-30, …, 90-100.
Classes are also known as class intervals. Each class interval is assigned two values. The
smallest value is called lower limit and the highest value is called upper limit of certain class
interval.
e.g. For a class 40-50,
Lower limit = 40 and upper limit = 50.
Classes can be of two types:
i) Continuous classes: The classes of the form 10-20, 20-30, 30-40,… in which lower limit of any
class is equal to upper limit of its previous class are called continuous classes.
4
Applied Biostatistics: An Essential tool in Helathcare Profession
ii) Non continuous classes: The classes of the form 10-19, 20-29, 30-39 … are called non
continuous classes.
d
But these classes can be made continuous by subtracting a term 2
from lower limits of all
d
classes and adding a term 2 into all upper limits of classes. The newly formed class intervals are the
d) Class width (h): The difference between upper limit and lower limit of the class interval is called
class width. It is denoted by h .
Class width = upper limit − lower limit
e.g. for a class interval 45-55,
Class width = = 55 – 45 = 10
e) Class mark or mid value (X): The class mark or mid value is the value which lies exactly in the
middle of the class interval. It is denoted by X and given by,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =
2
e.g. for the class interval 30-40,
30 + 40
Class mark = 𝑋 = 2
= 35.
f) Relative frequency:
It is given by the formula,
5
Bhumi Publishing, India
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 =
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
g) Percentage frequency:
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
h) Frequency density: If the class intervals of a frequency distribution are of unequal width, the
frequency densities can be used to compare the concentration of frequencies in class interval and to
construct histogram.
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 =
𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡
6
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.1. The following data gives marks obtained to 50 students in Mathematics. Prepare grouped
frequency distribution table taking the class intervals 20-24, 25-29, 30-34, etc.
21 20 55 39 48 46 36 54 42 30
29 42 32 40 34 31 35 37 52 44
39 45 37 33 51 53 52 46 43 47
41 26 52 48 25 34 37 33 36 27
54 36 41 33 23 39 28 44 45 38
7
Bhumi Publishing, India
Ex.2. Prepare a grouped frequency distribution table from following data. Take the classes 30-55,
55-80, etc.
110 175 161 157 155 108 164 128 114 178
165 133 195 151 71 94 97 42 30 62
138 156 167 124 164 146 116 149 104 141
103 150 162 149 79 113 69 121 93 143
140 144 187 184 197 87 40 122 103 148
Here, observation 155 is included in class 155-150 and not in 130-155, where it is an upper
limit.
Frequencies can also be distributed using cumulative frequency.
Cumulative frequency (c.f.): The successive addition of frequencies in the table is known as
cumulative frequency. It is calculated by adding each frequency from a frequency distribution table
to the sum of its predecessors. Cumulative frequency is used to determine the number of
observations that lie above (or below) particular value in a data set.
There are two types of frequency distribution:
a) Cumulative Frequency Less Than Type (c.f.l.t.t.):
The successive addition of frequencies of all classes previous to the current class is
cumulative frequency less than type. The addition is carried out from top to bottom i.e. from lowest
class to the highest class.
b) Cumulative Frequency More Than Type (c.f.m.t.t.):
It is obtained by adding the frequencies of highest class to the lowest class i.e. addition of
frequencies is from bottom to top.
Ex.1. Find cumulative frequency distribution for following data.
Class Limits 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequencies 2 4 7 10 16 8 3
8
Applied Biostatistics: An Essential tool in Helathcare Profession
Ans:
Less than C.F More than C.F
Class Limit Freq C.B Marks C.F Marks C.F
Total 50
Constructing relative frequency and percentage frequency tables:
Thirty AA batteries were tested to determine how long they would last. The results, to the
nearest minute, were recorded as follows:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400, 381,
399, 415, 428, 422, 396, 372, 410, 419, 386, 390
An analyst studying these data might want to know not only how long batteries last, but also
what proportion of the batteries falls into each class interval of battery life.
This relative frequency of a particular observation or class interval is found by dividing the
frequency (f) by the number of observations (n): that is, (f ÷ n). Thus:
Relative frequency = frequency ÷ number of observations
The percentage frequency is found by multiplying each relative frequency value by 100.
Thus:
Percentage frequency = relative frequency X 100 = f ÷ n X 100
Battery life, Frequency (f) Relative Percent
minutes (x) frequency frequency
360-369 2 0.07 7
370-379 3 0.1 10
380-389 5 0.17 17
390-399 7 0.23 23
400-409 5 0.17 17
410-419 4 0.13 13
420-429 3 0.1 10
430-439 1 0.03 3
Total 30 1 100
9
Bhumi Publishing, India
10
Applied Biostatistics: An Essential tool in Helathcare Profession
The data from the table above has been summarized in the line graph below.
Weight in kg
80
70
Weight in kg
60
50
40
30
20 Weight in kg
10
0
b) Bar Diagram:
It is commonly used to represent the statistical data. Bar is a thick line. In bar diagram only
the length or height of bars is taken into consideration. The data is represented by thick bars of
uniform width keeping the uniform gaps in between two bars. The lengths or heights of bars are
taken proportional to the values they represent. Bars can be drawn vertically or horizontally.
The bar diagram is classified into four main types:
(i) Simple Bar Diagram:
It is used to represent only one observation i.e. one bar represents one observation. So
there are as many bars as the number of observations. We can use different colours or shades to
identify data and to make the diagram attractive.
Ex.1. Draw simple bar diagram to represent the profits of a bank for 55 years.
Ans:
45
40
35
30
25
20 Profits (million $$)
15
10
5
0
1989 1990 1991 1992 1993
11
Bhumi Publishing, India
Population
60
50
40
30
20 Population
10
0
1971 1981 1991 2001 2011
14000
12000
10000
8000
Imports
6000
4000 Exports
2000
0
1991 1992 1993 1994 1995
12
Applied Biostatistics: An Essential tool in Helathcare Profession
1000
800
600
400
A
200
0 B
Arts Science Commerce Agriculture
No. Of students
100
80
60 Oats
40 Barley
Wheat
20
0
1991 1992 1993 1994 1995
13
Bhumi Publishing, India
200
150 Marks in
Practical
100 Marks in Stat
50
Marks in Maths
0
2005 2006 2007
14
Applied Biostatistics: An Essential tool in Helathcare Profession
100%
80%
60% Oats
40% Barley
20% Wheat
0%
1991 1992 1993 1994
c) Pie Diagram:
A pie diagram or pie chart is a circular graph in which a circle is divided into sectors. The
angle of sector is proportional to the frequency or percentage of observation. Different shades or
colours can be used to differentiate the variables.
Steps to construct pie diagram/chart:
1. Express the given values of the variables in terms of angles/degrees of the total value. i.e.
If set of actual values of frequencies is given then angle is given by
𝑎𝑐𝑡𝑢𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
a. 𝜃 = × 360
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Ans:
Item Expenditure Angle (𝜽)
Agriculture 4200 210
Irrigation 1500 75
Health 1000 50
Education 500 25
Total 7200 360
15
Bhumi Publishing, India
Expenditure
Agriculture
Irrigation
Health
Education
16
Applied Biostatistics: An Essential tool in Helathcare Profession
Graphic method of representation of data is becoming more effective and powerful than the
diagrammatic representation. It plays an important role of comparison is all fields of study.
According to A. L. Boddington, “The wandering of a line is more powerful in its effect on the mind
than a tabulated statement; it shows what is happening and what is likely to take place, just as
quickly as the eye is capable of working.” The presentation of statistics in the form of graphs
facilitates many processes. Frequency distribution can be represented graphically in following
ways:
a) Histogram
b) Frequency polygon
c) Frequency curve
d) Ogive curve
a) Histogram:
It is one of the most important and useful methods of presenting continuous frequency
distribution. A histogram is similar to a bar diagram which shows continuous frequency
distribution of quantitative data. In this, the continuous class intervals are taken along X-axis and
the frequencies on Y-axis.
Steps to construct histogram:
1. Draw the vertical and horizontal axes using scale.
2. Take the continuous classes on X-axis and if the classes are not continuous then make them
continuous and write on X- axis.
3. Take the frequencies on Y-axis with certain multiples.
4. Draw the bar up to the required frequency for each class interval.
5. Different shades or colours can be used to decorate histogram.
17
Bhumi Publishing, India
b) Frequency Polygon:
Frequency polygon is a line graph derived from histogram by joining the mid points of all
bars in histogram. It begins and ends at the base line i.e. X-axis.
Steps to construct frequency polygon:
1. Draw the histogram for the given data with continuous class intervals.
2. Take the class interval before the given first class and class interval after the last class with
frequency as zero and the constant width.
3. Mark the mid points of all classes at the top of the bars. Also mark the mid points of two
extra classes in step 2.
4. Join all mid points successively with straight lines.
5. The complete bounded figure is frequency polygon.
18
Applied Biostatistics: An Essential tool in Helathcare Profession
c) Frequency Curve:
With the help of frequency polygon and histogram, we can draw a smooth curve. It is
obtained by joining the points in frequency polygon with free hand in order to get smooth curve. It
removes the ruggedness of polygon. A smoothed frequency curve represents a generalised
characterization of the data collected from the population or mass. Like frequency polygon,
frequency curve also begins and ends at the base line.
d) Ogive curve and cumulative frequency polygon:
It is also known as Cumulative frequency curve, as it is used to represent cumulative
frequency distribution of continuous classes. As there are two types of cumulative frequencies i.e.
less than type and more than type, accordingly there are two types of ogives for any grouped
frequency distribution.
(i) Less than frequency curve (Ogive)
(ii) More than frequency curve (Ogive)
(i) Less Than Frequency Curve:
In this, cumulative frequency less than type is calculated and plotted against the upper limit
of the classes. The points so obtained are joined by a smooth curve. It is an increasing curve sloping
upward from left to right of the graph. It is in the shape of an elongated ‘S’.
(ii) More Than Frequency Curve:
In this, cumulative frequencies more than type are calculated and plotted against the lower
limit of the classes. The points so obtained are joined by a smooth curve. It is a decreasing curve
sloping downward from left to right of the graph. It is in the shape of elongated upside down ‘S’.
An interesting feature of the two ogive curves together is that their point of intersection
gives the median.
Steps for constructing an ogive:
1. Prepare the required cumulative frequency distribution table either less than or more than
or both.
2. Draw and label the X (horizontal) and the Y (vertical) axes.
3. Represent the cumulative frequencies on the Y-axis and the class limits on the X-axis.
4. Plot the cumulative frequency at each class limit with the height being the corresponding
cumulative frequency.
5. Connect the points with segments. Less than ogive curve always starts from coordinate
point zero.
Significance of graphic representation:
Graphic representation is a visual form of presentation of data. It is more effective and
result oriented than diagrammatic representation. The presentation of statistics in the form of
graphs facilitates many processes in biostatistics.
19
Bhumi Publishing, India
20
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑥 1 +𝑥 2 + …+ 𝑥 𝑛
= 𝑛
𝑥𝑖
𝑥 = 𝑛
Ex.1. The weights of 5 students (in kg) are 20, 21, 25, 14 and 30. Find the average of their weights.
Ans: Here, n = 2
𝑥𝑖
𝑥 = 𝑛
20 + 21 + 25 + 14 + 30
= 5
110
= 5
= 22 kg
30 = 𝑎 + 20
𝑎 = 30 − 20 = 10
21
Bhumi Publishing, India
𝑥 = 10
𝑥𝑖
So, 𝑥=
𝑛
𝑝 + 4𝑝
10 = 2
20 = 5𝑝
20
𝑝=
5
𝑝=4
100 = 3𝑘 + 16
84 = 3𝑘
𝑘 = 28
22
Applied Biostatistics: An Essential tool in Helathcare Profession
Obs ( 𝒙𝒊 ) Freq ( 𝒇𝒊 ) 𝒇𝒊 𝒙𝒊
10 7 70
20 2 40
30 5 150
40 3 120
50 9 450
Total 𝑓𝑖 = 26 𝑓𝑖 𝑥𝑖 = 830
𝑓𝑖 𝑥 𝑖 830
𝑥 = 𝑓𝑖
= 26
= 31.9230
𝑓𝑖 𝑥 𝑖 229
𝑥 = 𝑓𝑖
= 72
= 3.18056
23
Bhumi Publishing, India
2. Find the mid values (X) of all classes in third column using formula,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =
2
3. Take any mid value as an assumed mean A; for easy calculations consider the middle value as
assumed mean.
𝑋 𝑖 −𝐴
4. Calculate step deviation in fourth column using 𝑢𝑖 =
.
Ex.1. Calculate average marks by step deviation method from the following data.
Marks 0 – 10 10 – 20 20 - 30 30 - 40 40 - 50 50 - 60
No. Of students 42 44 58 35 26 15
0 – 10 42 5 -2 -84
10 – 20 44 15 -1 -44
20 – 30 58 𝟐𝟓 = 𝑨 0 0
30 – 40 35 35 1 35
40 – 50 26 45 2 52
50 - 60 15 55 3 45
Total 220 4
𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 +
𝑓𝑖
4
= 25 + × 10
220
= 25 + 0.1818
= 25.1818
24
Applied Biostatistics: An Essential tool in Helathcare Profession
0 – 10 5 5 -3 -15
10 – 20 10 15 -2 -20
20 – 30 40 25 -1 -40
30 – 40 30 𝟑𝟓 = 𝑨 0 0
40 – 50 20 45 1 20
50 – 60 10 55 2 20
60 – 70 4 65 3 12
Total 119 -23
𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 +
𝑓𝑖
−23
= 35 + × 10
119
= 35 − 1.93
= 34.07
Ex.3. Calculate mean by step deviation method.
Classes 0-30 30-60 60-90 90-120 120-150 150-180
Freq. 8 13 22 27 18 7
0-30 8 15 -2 -16
30-60 13 45 -1 -13
60-90 22 𝟕𝟓 = 𝑨 0 0
90-120 27 105 1 27
120-150 18 135 2 36
150-180 7 165 3 21
Total 95 55
25
Bhumi Publishing, India
𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑓𝑖
55
= 75 + 95
× 30
= 75 + 17.37
= 92.37
Ex.4. Calculate average marks by step deviation method.
Marks 0-10 10-20 20-30 30-40 40-50 50-60
No. Of stud 42 44 58 35 26 15
26
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑓𝑖
61
= 35 + 139
× 10
= 35 + 4.39
= 39.39
Merits of Mean:
(i) It is rigidly defined.
(ii) It is easy to understand and easy to calculate.
(iii) It is the unique value.
(iv) It is based upon all values of the given data.
(v) It is capable of further mathematical treatment.
(vi) It is not much affected by sampling fluctuations.
Demerits of Mean:
(i) It cannot be calculated if any observations are missing.
(ii) It cannot be calculated for the data with open end classes.
(iii) It is affected by extreme values.
(iv) It cannot be located graphically.
(v) It may be a number which is not present in the data.
(vi) It can be calculated for the data representing qualitative characteristic.
2. Median:
The median is the value which divides the data into two equal parts. Half of the
observations are above the median and half are below it. It is determined by ranking the data and
finding the number of observations. It is another frequently used measure of central tendency.
Calculation of Median:
Depending on types of data, there are following methods for the calculation of median:
a) For raw data:
Steps:
1. Arrange the given data in ascending order.
2. If number of observations (n) is odd then median is the exact central value
𝒏+𝟏
i.e. 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝒐𝒃𝒔 𝒂𝒕 𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏( 𝟐
)
3. If number of observations is even then there are two central values say 𝑥1 and 𝑥2 such that
𝑛 𝑛
𝑥1 = ( 2 )th observation and 𝑥2 = ( 2 + 1)th observation. Hence median is the average of
27
Bhumi Publishing, India
Ex.1. Find median: 61, 63, 60, 64, 65, 62, 63, 69, 68.
Ans: Arrange data in ascending order: 60, 61, 62, 63, 63, 64, 65, 68, 69
Here, 𝑛 = 9 (odd no.)
𝑛+1 9+1
Hence, 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 2
= 2
=5
𝑀𝑒𝑑𝑖𝑎𝑛 = 5𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 63
Ex.2. Find median: 30, 60, 28, 35, 46, 47, 63, 64, 62, 32
Ans: Ascending order: 28, 30, 32, 35, 46, 47, 60, 62, 63, 64
Here, 𝑛 = 10 (even no.)
𝑛 𝑛
Hence, 𝐶𝑒𝑛𝑡𝑟𝑎𝑙 𝑜𝑏𝑠 = 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 2
& 2
+1
= 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 5 𝑎𝑛𝑑 6
5𝑡 𝑜𝑏𝑠 + 6𝑡 𝑜𝑏𝑠 46 + 47
𝑀𝑒𝑑𝑖𝑎𝑛 = = = 46.5
2 2
28
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑁 120
= = 60
2 2
CF just greater than 60 = 65
Hence, median is obs corresponding to 65
𝑀𝑒𝑑𝑖𝑎𝑛 = 5
Ex.2. Find median for following.
Obs 5 10 15 20 25 30 35
Freq 1 3 13 17 27 36 38
29
Bhumi Publishing, India
Ans:
classes 𝒇𝒊 CF
20-30 14 14
30-40 23 37
40-50 27 64
50-60 21 85
60-70 15 100
Total 100
𝑁 100
= = 50
2 2
CF just greater than 50 = 64
Hence, Median class= 40-50
Here, 𝐿 = 40 = 10 𝑓 = 27 𝑐 = 37
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + ( − 𝒄)
𝒇 𝟐
10
= 40 + 27 (50 − 37)
= 40 + 4.814
= 44.814
30
Applied Biostatistics: An Essential tool in Helathcare Profession
classes 𝒇𝒊 CF
20-25 100 100
25-30 140 240
30-35 200 440
35-40 320 760
40-45 300 1060
45-50 240 1300
Total 1300
𝑁 1300
= = 650
2 2
CF just greater than 650 = 760
Hence, Median class= 35-40
Here, 𝐿 = 35 = 5 𝑓 = 320 𝑐 = 440
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + 𝒇
( − 𝒄)
𝟐
5
= 35 + 320 (650 − 440)
= 35 + 3.28
= 38.28
Ex.3. Find median from following data.
Class 0-10 10-20 20-30 30-40 40-50
Freq 8 15 22 15 8
Ans:
classes 𝒇𝒊 CF
0-10 8 8
10-20 15 23
20-30 22 45
30-40 15 60
40-50 8 68
Total 68
31
Bhumi Publishing, India
𝑁 68
= = 34
2 2
CF just greater than 34 = 45
Hence, Median class= 20-30
Here, 𝐿 = 20 = 10 𝑓 = 22 𝑐 = 23
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + ( − 𝒄)
𝒇 𝟐
10
= 20 + (34 − 23)
22
= 20 + 5
= 25
Merits of Median:
(i) It is rigidly defined.
(ii) It is easy to understand and easy to calculate.
(iii) It is not affected by extreme values.
(iv) Even if extreme values are not known median can be calculated.
(v) It can be located just by inspection in many cases.
(vi) It is the unique value.
(vii) It is not much affected by sampling fluctuations.
(viii) It can be calculated for data based on ordinal scale i.e. on ordering.
Demerits of Median:
(i) It is not based upon all values of the given data.
(ii) For larger data size the arrangement of data in the increasing order is difficult process.
(iii) It is not capable of further mathematical treatment.
(iv) It is insensitive to some changes in the data values.
3. Mode:
The mode is the value that occurs most frequently in a set of observations. It is an
observation which repeats maximum number of times. The mean and median require a calculation
but the mode is found simply by counting the number of times each value occurs in a data set.
Sometimes we may come across a distribution having more than one mode. If there are two modes
then it is bimodal distribution. Likewise if there are more modes then it is multi-modal distribution.
Calculation of Mode:
a) For raw data:
An observation repeating maximum number of times is mode.
32
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.1. Find mode for following data: 61, 62, 63, 61, 63, 64, 64, 64, 60, 65.
Ans: 𝑀𝑜𝑑𝑒 = 𝑜𝑏𝑠 𝑟𝑒𝑝𝑒𝑎𝑡𝑖𝑛𝑔 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑡𝑖𝑚𝑒𝑠 = 64
b) For discrete frequency distribution:
An observation corresponding to the highest frequency in the table is mode.
Ex.1. Find mode.
Size 5 10 15 20 25 30 35
Freq 1 3 13 36 27 17 5
Ans:
Classes 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Freq 28 32 45 60 56 40 20
(𝒇𝟏 ) (𝒇𝒎 ) (𝒇𝟐 )
33
Bhumi Publishing, India
60− 45
= 50 + 120− 45− 56
10
= 50 + 7.89
= 57.89
Ex.2. determine mode.
Classes 0-100 100-200 200-300 300-400 400-500
Freq 28 32 45 60 56
Ans:
Classes 0-100 100-200 200-300 300-400 400-500
Freq 12 18 27 20 17
(𝒇𝟏 ) (𝒇𝒎 ) (𝒇𝟐 )
27− 18
= 100 + 54− 18− 20
100
= 100 + 56.25
= 156.25
Measures of dispersion:
We have learnt about the various measures of central tendency. Measures of central
tendency give us an idea of concentration of the observations about the central part of the data, but
it cannot describe the distribution completely. If we know the average or mean alone of certain
distribution, we cannot form a complete idea about the observations of that distribution; because
there may be different sets of observations having the same arithmetic mean. But these sets of
observations may differ or vary in their values about the measures of central tendency. A measure
of central tendency is a single value that represents a characteristic such as age or height of a group
of persons while a measure of dispersion quantifies how much persons in the group vary from each
other and from the measure of central tendency. But can the central tendency describe the data
fully or adequately?
To understand it, consider the following example.
34
Applied Biostatistics: An Essential tool in Helathcare Profession
35
Bhumi Publishing, India
i. Coefficient of Range
ii. Coefficient of Mean Deviation
iii. Coefficient of Variation
iv. Coefficient of Quartile Deviation
1. Range:
Range is the quickest and simplest measure of dispersion. It accounts only the difference
between the highest and the lowest observation in any data. For a given set of data, range is defined
as the difference between the highest (maximum) and lowest (minimum) observation. The range is
often reported as “from (the minimum) to (the maximum),” i.e., two numbers.
Coefficient of Range: is given by
𝐿−𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿+𝑆
× 100
= 50%
36
Applied Biostatistics: An Essential tool in Helathcare Profession
𝐿−𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿+𝑆
× 100
75 − 50
= × 100
75 + 50
25
= 125 × 100
= 20%
Merits of Range:
(i) It is rigidly defined.
(ii) It gives rough but quick answer.
(iii) It is simple to understand and easy to calculate. It can be found by mere inspection.
(iv) It can be calculated from extreme values only. So we need not know the details of the series
to calculate the range.
Demerits of range:
(i) It is not representative since it is not based on all the observations of the series.
(ii) It is not capable of further algebraic treatment.
(iii) In case of open-end classes range cannot be determined exactly.
(iv) It is not a stable measure of dispersion and is very much affected by the fluctuations of
sampling.
37
Bhumi Publishing, India
Coefficient of Mean Deviation: It is common for all the following three types of data and is given
by,
𝑀𝐷
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝐴
× 100 Where, A= mean/ median / mode
𝑥𝑖 𝑥𝑖 − 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
Total 20
𝑥𝑖 − 5
𝑀𝐷(5) =
9
20
=
9
= 2.223
38
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑥𝑖 𝑥𝑖 − 6 𝑥𝑖 − 7
2 4 5
5 1 2
7 1 0
8 2 1
7 2 0
6 0 1
12 6 5
3 3 4
Total 18 18
𝑥 𝑖 −6 18
𝑀𝐷 6 = 8
= 8
= 2.25
𝑥𝑖 − 7 18
𝑀𝐷 7 = 8
= 8
= 2.25
39
Bhumi Publishing, India
Ans:
𝑥𝑖 𝑓𝑖 𝑥𝑖 ∙ 𝑓𝑖 𝑥𝑖 − 12 𝑓𝑖 𝑥𝑖 − 12
10 2 20 2 4
11 5 55 1 5
12 7 84 0 0
13 3 39 3 9
14 1 14 4 4
Total 18 212 22
𝑥𝑖 ∙ 𝑓𝑖 212
𝐴=𝑥= = = 11.78 ≅ 12
𝑓𝑖 18
𝑓𝑖 𝑥𝑖 − 12 22
𝑀𝐷 12 = = = 1.22
𝑓𝑖 18
𝑓𝑖 𝑥𝑖 − 3 35
𝑀𝐷 3 = = = 0.875
𝑓𝑖 40
Ex.3. Calculate M.D. from mode.
𝑥𝑖 10 12 15 17 19
𝑓𝑖 2 4 10 5 1
40
Applied Biostatistics: An Essential tool in Helathcare Profession
Ans:
𝑥𝑖 𝑓𝑖 𝑥𝑖 − 15 𝑓𝑖 𝑥𝑖 − 15
10 2 5 10
12 4 3 12
15 10 0 0
17 5 2 10
19 1 4 4
Total 22 36
𝐴 = 𝑚𝑜𝑑𝑒 = 15
𝑓𝑖 𝑥𝑖 − 15 36
𝑀𝐷 15 = = = 1.63
𝑓𝑖 22
41
Bhumi Publishing, India
𝑓 𝑖 ∙𝑋 𝑖 1350
𝒙= 𝒇𝒊
= 50
= 27
Hence,
𝑓 𝑖 𝑋 𝑖 − 27 472
𝑀𝐷 27 = 𝑓𝑖
= 50
= 9.44
Ex.2. Calculate M.D. and coefficient of M.D about median for following data.
Classes 0-10 10-20 20-30 30-40 40-50
Freq 8 15 22 15 8
Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25
0-10 8 8 5 20 160
10-20 15 23 15 10 150
20-30 22 45 25 0 0
30-40 15 60 35 10 150
40-50 8 68 45 20 160
Total 68 620
𝑁 68
2
= 2
= 34
10
= 20 + (34 − 23)
22
= 20 + 5
= 25
Hence,
𝑓 𝑖 𝑋 𝑖 − 25 620
𝑀𝐷 25 = 𝑓𝑖
= 68
= 9.12
𝑀𝐷 9.12
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝐴
× 100 = 25
× 100 = 36.48%
42
Applied Biostatistics: An Essential tool in Helathcare Profession
Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25
0-10 8 8 5 20 160
10-20 15 23 15 10 150
20-30 22 45 25 0 0
30-40 15 60 35 10 150
40-50 8 68 45 20 160
Total 68 620
𝑁 68
= = 34
2 2
10
= 20 + (34 − 23)
22
= 20 + 5
= 25
Hence,
𝑓 𝑖 𝑋 𝑖 − 25 620
𝑀𝐷 25 = = = 9.12
𝑓𝑖 68
43
Bhumi Publishing, India
Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 42.5 𝑓𝑖 𝑋𝑖 − 42.5
25-30 6 6 27.5 15 90
30-35 12 18 32.5 10 120
35-40 17 35 37.5 5 85
40-45 30 65 42.5 0 0
45-50 10 75 47.5 5 50
50-55 10 85 52.5 10 100
55-60 8 93 57.5 15 120
60-65 5 98 62.5 20 100
65-70 2 100 67.5 25 50
Total 100 715
𝑁 100
2
= 2
= 50
5
= 40 + (50 − 35)
30
= 40 + 2.5
= 42.5
Hence,
𝑓 𝑖 𝑋 𝑖 − 42.5 715
𝑀𝐷 42.5 = 𝑓𝑖
= 100 = 7.15
44
Applied Biostatistics: An Essential tool in Helathcare Profession
Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 38.5 𝑓𝑖 𝑋𝑖 − 38.5
20-25 10 10 22.5 16 160
25-30 14 24 27.5 11 154
30-35 20 44 32.5 6 120
35-40 36 80 37.5 1 36
40-45 30 110 42.5 4 120
45-50 24 134 47.5 9 216
Total 134 806
𝑁 134
= = 67
2 2
5
= 35 + 36 (67 − 44)
= 35 + 3.19
= 38.19 ≅ 38.5
Hence,
𝑓 𝑖 𝑋 𝑖 − 38.5 806
𝑀𝐷 38.5 = 𝑓𝑖
= 134 = 6.015
45
Bhumi Publishing, India
Coefficient of variation: It is same for all three types of data and given by.
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 = × 100
𝑀𝑒𝑎𝑛
46
Applied Biostatistics: An Essential tool in Helathcare Profession
OR
𝑥𝑖 2
𝜎= 𝑛
− (𝑥 )2
𝑥𝑖 − 𝑥 2 34
𝜎= = = 5.67 = 2.38
𝑛 6
𝑆𝐷 2.38
𝐶𝑉 = × 100 = × 100 = 47.6%
𝑥 5
𝑥𝑖 40
Ans: Here, 𝑥 = 𝑛
= 10 = 4
Use formula,
𝑥𝑖 2 520
𝜎= − (𝑥 )2 = − 42 = 36 = 6
𝑛 10
Ex.3. for a certain distribution of 25 observations, mean is 50 and S.D. is 4. Find coefficient of
variation (C.V.).
47
Bhumi Publishing, India
OR
𝒇𝒊 𝒙𝒊 𝟐
𝝈= 𝒇𝒊
− (𝒙)𝟐
𝑓𝑖 𝑥 𝑖 9770
𝑥= = = 38.47 ≅ 39
𝑓𝑖 254
𝑓𝑖 𝑥 𝑖 − 𝑥 2 48574
𝜎= 𝑓𝑖
= 254
= 191.24 = 13.83
𝒇𝒊 𝑿𝒊 − 𝒙 𝟐 𝑓𝑖 𝑋𝑖
𝝈= 𝒇𝒊
Where 𝑥 = 𝑓𝑖
𝑋𝑖 = Mid values
OR
48
Applied Biostatistics: An Essential tool in Helathcare Profession
𝒇𝒊 𝑿𝒊 𝟐
𝝈= − (𝒙)𝟐
𝒇𝒊
CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 58 𝑋𝑖 − 58 2 𝑓𝑖 𝑋𝑖 − 58 2
𝑓 𝑖 ∙𝑋 𝑖 5800
𝑥= = = 58
𝑓𝑖 100
𝑓𝑖 𝑋𝑖 − 𝑥 2 40000
𝜎= = = 400 = 20
𝑓𝑖 100
CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 40 𝑋𝑖 − 40 2 𝑓𝑖 𝑋𝑖 − 40 2
49
Bhumi Publishing, India
𝑓 𝑖 ∙𝑋 𝑖 5475
𝑥= 𝑓𝑖
= 139
= 39.40 ≅ 40
𝑓𝑖 𝑋𝑖 − 𝑥 2 34275
𝜎= 𝑓𝑖
= 139
= 246.58 = 15.70
𝜎 15.70
𝐶𝑉 = × 100 = × 100 = 39.25%
𝑥 40
𝑓𝑖 𝑋𝑖 − 𝑥 2 268750
𝜎= = = 3839.29 = 61.96
𝑓𝑖 70
𝜎 61.96
𝐶𝑉 = × 100 = × 100 = 47.66%
𝑥 130
50
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑄3 − 𝑄1
𝑄𝐷 = 2
Coefficient of Quartile Deviation: it is same for all the following three types of data and is given
by,
𝑄3 − 𝑄1
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 + 𝑄1
× 100
51
Bhumi Publishing, India
𝑄3 − 𝑄1 11 − 5 6
Hence, 𝑄𝐷 = 2
= 2
= 2
=3
Ex.2. Following are the marks obtained 10 students: 56, 48, 65, 35, 42, 75, 82, 60, 55, 50. Find
quartile deviation and its coefficient.
Ans: Ascending order: 35, 42, 48, 50, 55, 56, 60, 65, 75, 80
Here, 𝑛 = 10 (not divisible by 4)
𝑛 + 1 th 10 + 1
1st quartile = 𝑄1 = 4
obs = 4
52
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑁 25
= = 6.25
4 4
C.f. just greater than 6.25=10
So, 𝑄1 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 6.25 = 1
3𝑁 75
= = 18.75
4 4
C.f. just greater than 18.75=22
So, 𝑄3 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 18.75 = 3
53
Bhumi Publishing, India
Where:
𝐿 = lower limit of Q1 class
= width of Q1 class
𝑓 = frequency of Q1 class
𝑐 = cumulative frequency above the Q1 class
3𝑁
5. Now calculate 4
.
3𝑁
6. See the CF just greater than 4
and its corresponding class is Q3 class.
𝑁
7. For the 3rd quartile, use the formula 𝑄1 = 𝐿 + 𝑓 4
− 𝑐
Where:
𝐿 = lower limit of Q3 class
= width of Q3 class
𝑓 = frequency of Q3 class
𝑐 = cumulative frequency above the Q3 class
8. Calculate quartile deviation using formula.
Ex.1. Calculate quartile deviation and its coefficient for following data.
Classes 10–15 15–20 20-25 25-30 30-35 35-40 40-45 45-50
Freq 4 4 6 8 10 9 7 5
54
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑁
𝑄1 = 𝐿 + 𝑓 4
− 𝑐
5
= 20 + 6 13.25 − 8
26.25
= 20 + 6
= 20 + 4.375
= 24.375
55
Bhumi Publishing, India
Exercise
1. What do you mean by statistics? Add note on statistical data.
2. Explain following terms in frequency distribution and data;
i. Individual data
ii. Discrete data
iii. Grouped data
iv. Classes
v. Frequency
vi. Class limits
vii. Class boundaries
viii. Class frequency
ix. Class interval
3. Given data contain weight in kg of group of 60 students. Prepare a frequency table taking
magnitude of class interval as 10 kg and the first class interval equal to 40 and less than 50.
50 52 86 94 49 90 76 96 64 70
69 80 79 73 81 110 84 67 77 65
74 60 115 61 83 72 79 103 51 78
71 66 77 84 42 69 80 68 104 79
54 59 100 53 76 50 78 63 95 42
40 82 41 75 63 113 98 43 55 76
4. Prepare a discrete frequency table for following data containing number of defectives in a lot.
2, 3, 1, 0, 1, 2, 1, 0, 1, 4, 5, 3, 2, 1, 0, 1, 3, 4, 1 , 5, 4, 3, 1, 0, 0, 1, 0, 2, 3, 1, 2, 4, 5, 0, 1, 0, 1,
0, 2, 4, 3, 5, 0, 1, 3, 2, 1, 0, 2, 2, 3, 0, 1, 3, 4, 0, 1, 3, 2, 5, 0, 1, 2.
5. For each of the given frequency distribution draw Histogram, Frequency polygon and
Cumulative frequency polygon
i.
Weight in kg 80-90 90-100 100-110 110-120 120-130
No. of workers 07 11 15 08 04
6. Following table gives the birth rate per thousand of different countries over certain period.
Represent the given data by a suitable diagram plotting the countries against their birth rate.
56
Applied Biostatistics: An Essential tool in Helathcare Profession
7. Draw a pie diagram for the following data of seventh five year plan of Government.
Agriculture 14%
Irrigation 13%
Health 27%
Education 15%
Social Development 16%
Employment 16%
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑖𝑡𝑒𝑚
Note: Angle of centre is given by 100
× 360
8. Draw a pie diagram to represent the following data of population in a town;
Males 2000
Females 1800
Boys 4200
Girls 2000
Total 10000
11. Given data contain marks obtained by a batch at 10 students in certain class test. Calculate
Median marks.
28, 35, 46, 47, 60, 30, 32, 62, 64, 63.
12. Calculate median weight from given weight in grams
68, 66, 35, 42, 26, 85, 44, 80, 33, 72.
57
Bhumi Publishing, India
58
Applied Biostatistics: An Essential tool in Helathcare Profession
31. The following are the goals scored by a team. Calculate Q.D.
No. of goals scored 0 1 2 3 4
No. of matches 1 9 7 5 3
59
Bhumi Publishing, India
2. Probability
Introduction:
‘What is probability?’ Nobody has a really good answer to this question. It is the language
which we use to explain uncertainty. The theory of probability has been originated from the game
of gambling. The correspondence between two French mathematicians Blaise Pascal and Pierre
Fermat gave rise to the study of probability. Throughout the 18th century, the application of
probability moved from games of chance to scientific problems. In the study of statistics, we are
concerned basically with the presentation and interpretation of chance outcomes that occur in a
planned study or scientific investigation. Statisticians use the word experiment to describe any
process that generates a set of data.
Probability is a part of our everyday lives. Modern research in probability theory is closely
related to the field of measure theory. The development of probability theory has been stimulated
by the variety of its applications. Statistics is one important branch of applied probability. One of
the difficulties in developing theory of probability is the definition of probability. The search for a
widely acceptable definition took centuries and was marked by controversy. The matter was finally
resolved in the 20th century by treating probability theory on an axiomatic basis.
Probability:
Probability is the branch of statistics that studies the possible outcomes of given events
together with their likelihoods and distributions. In common use, word probability is used to mean
the chance that a particular event will occur.
e.g. It is likely to rain, there is 60% chance that India will win the match, etc.
Probability is the measure of the likelihood that an event will occur. Probability is
quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates
certainty). The higher the probability of an event, more certain we are that the event will occur. The
topic of probability is seen in many facets of the modern world. The theory of probability is not just
taught in mathematical courses, but can be seen in practical fields, such as insurance, industrial
quality control, study of genetics, quantum mechanics, and the kinetic theory of gases.
In order to clear the concept of probability, we have some basic concepts:
1. Random Experiment:
It is a repeating action in which all the possible results are known but the exact result is not
known in advance.
e.g. “Tossing of a coin” is a random experiment; because we know the result in advance i.e. either
Head or Tail but the exact one is not known.
2. Outcomes:
The results of the random experiments are called outcomes.
60
Applied Biostatistics: An Essential tool in Helathcare Profession
61
Bhumi Publishing, India
Let B be the event that card drawn is Red or Black. This event has outcomes of all 52 cards; so it is a
sure event.
iii) Impossible event: An event which does not contain any sample point of the sample space is an
impossible event.
e.g. A die is thrown. 𝑆 = {1, 2, 3, 4, 5, 6}
Let C be the event that number on upper face is greater than 6.
So 𝐶 = { } = ∅
iv) Complementary event: Let A be the event of the sample space S. Then the complement of event
A is the set containing the points in S but not on A. It is denoted by 𝐴′ or 𝐴𝑐 .
e.g. A die is thrown. 𝑆 = {1, 2, 3, 4, 5, 6}
Let D be the event of getting odd no. 𝐷 = {1, 3, 5}
Then complement of D is 𝐷 𝑐 = {2, 4, 6}.
v) Mutually exclusive events: Two events say A and B are said to be mutually exclusive or disjoint
events if they have no common point i.e. 𝐴 𝐵= ∅
e.g. Throwing of a die. 𝑆 = {1, 2, 3, 4, 5, 6}
A be the event of occurring even no. on upper face. 𝐴 = {2, 4, 6}
B be the event of occurring odd no. on upper face. 𝐵 = {1, 3, 5}
Here 𝐴 𝐵 = ∅ i.e. they have no same elements.
So, A and B are mutually exclusive events.
vi) Exhaustive events: Two or more events are said to be exhaustive events if their union is a
sample space i.e. suppose A and B are events of S. A and B are exhaustive if 𝐴 𝐵 =S
62
Applied Biostatistics: An Essential tool in Helathcare Profession
Hence, permutations of n different objects taken r at a time is the total number of ways in
which n objects can be arranged at r places in a line and it is given by
nPr
𝑛!
= (𝑛 − 𝑟)!
In particular, if 𝑟 = 𝑛 then
nP = 𝑛!
r
Ex.1. In how many ways 5 different objects can be arranged by taking 2 at a time?
Ans: Here, n = 5 and r = 2
5P
5! 5×4×3×2×1
2 = = = 20 𝑤𝑎𝑦𝑠
(5 − 2)! 3×2×1
Ex.2. Calculate the number of ways in which three people from a group of seven people can be
seated in a row.
Ans: This is a case of permutation since the order is important.
Here, 𝑛 = 7 𝑟 = 3
The number of possible ways is:
7P3=
7! 7×6×5×4×3×2×1
(7 − 3)!
= 4×3×2×1
= 210 𝑤𝑎𝑦𝑠
Combinations:
A combination is a group of objects, irrespective of order, taken some or all at a time.
E.g. suppose there is a group of three students say X, Y and Z. We have to make different groups
containing two students in each group.
In this case, the groups formed are,
XY, XZ, YZ. (Here, XY is similar to YX.)
Hence, total number of combinations of n different objects taken r at a time is given by
nCr =
𝑛!
𝑟! (𝑛 − 𝑟)!
In particular, if 𝑟 = 𝑛 then
nCr = 1
63
Bhumi Publishing, India
Ex.2. Calculate the number of combinations in which three people can be selected from a group of
seven.
Ans: Here the order is not important so it is case of combination.
Here, 𝑛 = 7 𝑟 = 3
The number of possible combinations is:
7C
7! 7×6×5×4! 7×6×5
3 = 3! (7 − 3)!
= 3! × 4!
= 3×2×1
= 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠
Thus, the number of permutations is always greater than the number of combinations.
Classical Definition of Probability:
Statistically, the term probability can be defined in following way:
If S is the sample space with n outcomes of a random experiment and A is an even with m
outcomes then probability of event A is denoted by P(A) and is defined as,
𝑁𝑜.𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝐴 𝑚
P (A) = =
𝑁𝑜.𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑆 𝑛
In short, if
𝑛(𝑆) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 𝑆
𝑛(𝐴) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑒𝑣𝑒𝑛𝑡 𝐴
then
𝑛 (𝐴)
𝑃 (𝐴) = 𝑛 (𝑆)
𝑃 𝐴𝑖 = 𝐴𝑖
𝑖=1 𝑖=1
64
Applied Biostatistics: An Essential tool in Helathcare Profession
These axioms are all we need to develop a theory of probability, but there is a collection of
commonly used properties which follow directly from these axioms, and which we make extensive
use of when carrying out probability calculations.
Property A: Probability of Complementary event
𝑃 (𝐴𝑐 ) = 1 – 𝑃 (𝐴).
Property B: 𝑃(∅) = 0
Property C: If 𝐴 ⊆ 𝐵, then 𝑃(𝐴) ≤ 𝑃(𝐵).
Property D: Addition Property
𝑃(𝐴 𝐵) = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 𝐵)
Ex.1. In a box, there are 5 Aspirin, 6 Analgin and 10 Paracetamol. If one tablet is chosen at random
find the probability that:
i) it is Analgin.
ii) it is Aspirin or Paracetamol.
Ans: There are total 22 tablets in a box.
𝑛(𝑆) = 22
Ex.2. A card is selected at random from well shuffled pack of 52 cards. Find the probability of
getting
i) a face card ii) a red card iii) not a club card.
Ans: There are total 52 cards in a pack.
𝑛(𝑆) = 52
65
Bhumi Publishing, India
ii) Let event B: getting odd number on 1st die and 5 on 2nd die.
𝐵 = 1, 5 , 3, 5 , 5, 5 So 𝑛(𝐵) = 3
3 1
𝑃(𝐵) = 36
= 12
66
Applied Biostatistics: An Essential tool in Helathcare Profession
Conditional probability:
Conditional probability is the likelihood of an event or outcome occurring based on the
occurrence of a previous event or outcome. i.e. the probability of any event 'A' changes after
knowing that some other event B has occurred; It is known as the conditional probability of the
event A given that the event B has occurred. We write this as 𝑃(𝐴 | 𝐵).
If A and B are any 2 events with 𝑃(𝐵) > 0, then
𝑃( 𝐴 𝐵)
𝑃(𝐴| 𝐵) = 𝑃(𝐵)
𝑃(𝐴 𝐵)
Similarly, 𝑃(𝐵|𝐴) = 𝑃(𝐴)
; 𝑃(𝐴) > 0
Ex.1. You toss a fair coin three times. Given that you have observed at least one heads, what is the
probability that you observe at least two heads?
Ans: A coin is tossed three times.
𝑆 = 𝑇𝑇𝑇, 𝑇𝑇𝐻, 𝑇𝐻𝑇, 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇 𝑛 𝑆 =8
Let A be the event that at least one heads is observed.
𝐴 = { 𝑇𝑇𝐻, 𝑇𝐻𝑇, 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇}
7
𝑃 𝐴 =8
4 8
=8∙7
4
=7
Ex.2. Out of 50 people surveyed in a study, 35 people smoke in which 20 are males. What is the
probability that if a person surveyed is smoke then he is male?
Ans: Here, 𝑛 𝑆 = 50
Let A be the event that person is a smoker.
𝑛(𝐴) 35
𝑃(𝐴) = 𝑛(𝑆)
= 50
67
Bhumi Publishing, India
𝑛(𝐵) 20
𝑃(𝐵) = 𝑛(𝑆)
= 35
𝑃(𝐴 𝐵) 20
𝑃(𝐵|𝐴) = 𝑃(𝐴)
𝐴 𝐵 = 50
i.e. person being male and smoker
20 50
= ∙
50 35
4
= 7
Multiplication Property:
If A and B are the independent events of a given random experiment then
𝑃 𝐴 ∩ 𝐵 = 𝑃(𝐴) ∙ 𝑃(𝐵|𝐴)
Ex.4. Find the probability that a single toss of a die will result in a number less than 3 if it is given
that the toss resulted in an odd number.
Ans: For tossing of a die, 𝑆 = {1, 2, 3, 4, 5, 6} 𝑛(𝑆) = 6
Given that toss is already resulted in odd number.
Let event A: toss resulted in an odd number.
𝐴 = {1, 3, 5} 𝑛(𝐴) = 3
3 1
𝑃(𝐴) = 6
= 2
𝐴 𝐵 = {1, 3} 𝑛(𝐴 𝐵) = 2
2 1
𝑃(𝐴 𝐵) = 6
= 3
Ex.5. A bag contains 3 pink candies and 7 green candies. Two candies are taken out from the bag
with replacement. Find the probability that both candies are pink.
Ans: here, 𝑛(𝑆) = 3 + 7 = 10
Let A be the event that first candy is pink and
B be the event that second candy is pink
68
Applied Biostatistics: An Essential tool in Helathcare Profession
3
𝑃 𝐴 = 𝑃(𝐵) = 10
Since candies are taken out with replacement, both events A and B are independent.
3
𝑃 𝐵 𝐴 = 𝑃 𝐵 = 10
Random Variables:
Specifying a model for a random experiment via a complete description of sample space S
and probability P may not always be convenient or necessary. In practice we are only interested in
various observations (i.e., numerical measurements) of the experiment. We include these into our
modelling process via the introduction of random variables.
A random variable is a function that associates a real number with each element in the
sample space. A random variable is neither random nor a variable. A random variable is a function
defined on a sample space. The values of the function can be anything at all, but for us they will
always be numbers.
E.g. consider the sample space for tossing a fair coin twice:
𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}
These outcomes are equally likely. There are several random quantities we could associate
with this experiment. For example, we could count the number of heads, or the number of tails.
Formally, a random variable is a real valued function which acts on elements of the sample
space (outcomes) i.e. to each outcome. The random variable assigns a real number. Random
variables are always denoted by upper case letters.
In our example, if we let X be the number of heads, we have
𝑋 (𝐻𝐻) = 2;
𝑋 (𝐻𝑇) = 1;
𝑋 (𝑇𝐻) = 1;
𝑋 (𝑇𝑇) = 0:
Hence, 2, 1, 1, 0 are the random variables for the outcomes in sample space S.
In short,
Outcomes HH HT TH TT
Random variable (X) 2 1 1 0
69
Bhumi Publishing, India
70
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.1. Suppose a coin is tossed twice. Find the probability distribution for the head at the top.
Ans: A coin is tossed twice. Hence the distribution is as follows,
Outcomes (𝑆) 𝐻𝐻 𝐻𝑇 𝑇𝐻 𝑇𝑇 4
Random variables (𝑋) 2 1 1 0 Total
Probability 𝑃(𝑋 = 𝑥𝑖 ) 1 1 1 0 1
2 4 4
Ex.2. Find the probability function corresponding to the random variable X for Head up assuming
that the fair coin is tossed thrice.
Ans: Outcomes 𝑆 = {𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇, 𝑇𝐻𝐻, 𝑇𝐻𝑇, 𝑇𝑇𝐻, 𝑇𝑇𝑇}
Random variables (𝑋) 3 2 2 1 2 1 1
Probability 𝑃(𝑋 = 𝑥𝑖 ) 3 1 1 1 1 1 1
8 4 4 8 4 8 8
71
Bhumi Publishing, India
The Expected value (or Mean) of random variables is a number E(X) given by,
𝑛
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 (For Discrete variables)
−∞
𝐸 𝑋 = ∞
𝑥 𝑓(𝑥) (For Continuous variables)
Ex.1. A die is thrown. The random variable X is “the number of dots that appear”. Find the expected
value of this random variable.
Ans: For throwing of a die, the outcomes of dots are 𝑆 = {1, 2, 3, 4, 5, 6}
Hence, the probability distribution table is given by
No. of dots (xi) 1 2 3 4 5 6 Total
P(X=xi) = pi 1 1 1 1 1 1 1
6 6 6 6 6 6
𝒙𝒊 𝒑𝒊 1 1 1 2 5 1 3.5
6 3 2 3 6
6 21 7
𝐸 𝑋 = 𝑖=1 𝑥𝑖 𝑝𝑖 = 6
= 2
= 3.5
Ex.2. A lot containing 7 components is sampled by a quality inspector; the lot contains 4 good
components and 3 defective components. A sample of 3 is taken by the inspector. Find the expected
value of the number of good components in this sample.
Ans: This is a case of combination where 𝑛 = 7 and 𝑟 = 3
To find n(S),
7C3 =
7! 7×6×5×4! 7×6×5
= = = 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠
3! (7 − 3)! 3! × 4! 3×2×1
So, 𝑛(𝑆) = 35
Using the formula, 4Cr 3C3-r find the number of samples containing 0, 1, 2 or 3 good
components replacing r = 0, 1, 2, 3 respectively.
The Probability distribution table for number of good components in a sample is,
No. of good comp.(xi) 0 1 2 3 Total
P(X=xi) = pi 1 12 18 4 1
35 35 35 35
𝒙𝒊 𝒑𝒊 0 12 36 12 60
35 35 35 35
6 60 12
𝐸 𝑋 = 𝑖=1 𝑥𝑖 𝑝𝑖 = 35
= 7
= 1.7
72
Applied Biostatistics: An Essential tool in Helathcare Profession
Thus, if a sample of size 3 is selected at random again and again from a lot of 4 good
components and 3 defective components, it will contain, on average, 1.7 good components.
The Variance of a random variable X with the probability distribution P(X=xi) is a number
Var(X) or 𝜎 2 given by,
𝜎 2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = 𝑥 [𝑥𝑖 − 𝐸 𝑋 ]2 ∙ 𝑝𝑖
∞
𝜎 2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = [𝑥
−∞ 𝑖
− 𝐸 𝑋 ]2 ∙ 𝑝𝑖 (For Conti. variables)
Ex.3. Let the random variable X represents the number of automobiles that are used for official
business on any given workday. The probability distribution for company is
X = xi 0 1 2 3
P(X = xi) = pi 0.2 0.1 0.3 0.3
Calculate variance for random variable X.
Ans: To calculate expected value,
X = xi 0 1 2 3 Total
P(X = xi) = pi 0.2 0.1 0.4 0.3 1
𝒙𝒊 𝒑𝒊 0 0.1 0.8 0.9 1.8
𝑿𝟐 0 1 4 9
3
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 1.8
3 2
𝐸 𝑋2 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 4.4
Now, 𝑉𝑎𝑟(𝑋) = 𝜎 2 = 𝐸 𝑋 2 − [𝐸 𝑋 ]2
= 4.4 – (1.8)2
= 4.4 – 3.24
= 1.16
Ex.4. Let the random variable X represents the number of defective parts for a machine when 3
parts are sampled from a production line and tested. The following is the probability distribution of
X. Calculate 𝜎 2 .
xi 0 1 2 3
pi 0.51 0.38 0.10 0.01
73
Bhumi Publishing, India
xi 0 1 2 3 Total
pi 0.51 0.38 0.10 0.01 1
𝒙𝒊 𝒑𝒊 0 0.38 0.20 0.03 0.61
𝑿𝟐 0 1 4 9
𝒙𝟐𝒊 ∙ 𝒑𝒊 0 0.38 0.40 0.09 0.87
3
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 0.61
3 2
𝐸 𝑋2 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 0.87
Now, 𝑉𝑎𝑟(𝑋) = 𝜎 2 = 𝐸 𝑋 2 − [𝐸 𝑋 ]2
= 0.87 – (0.61)2
= 0.87 – 0.3721
= 0.4979
Ex.5. Find the expected value for the density function of a random variable X given by
1
𝑓 𝑥 = 𝑥 0<𝑥<2
2
=0 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
−∞
Ans: 𝐸 𝑋 = ∞
𝑥 𝑓(𝑥)
2 1
= 0
𝑥 ( 𝑥)
2
2 1 2 𝑥3 2 4
= 0 2
𝑥 = ⃒ =3
6 0
74
Applied Biostatistics: An Essential tool in Helathcare Profession
75
Bhumi Publishing, India
1
Ex.1. If X is binomially distributed with 6 trials and a probability of success equals to at each
4
6! 1 4 3 6−4
𝑃 𝑋=4 = 4! (6 − 4)! 4 4
1 9
= 15 × 256
× 16
135
= 4096
= 0.033
Ex.2. When an unbiased coin is tossed 8 times what is the probability of getting:
a) less than 4 heads b) more than 5 heads?
Ans: Here, 𝑛 = 8
Let p be the probability of getting head
1 1 1
𝑝 = 2
𝑞 = 1− 2
= 2
1 8 1 8 1 8 1 8
= + 8 + 28 + 56
2 2 2 2
1 8
= 93 2
93
= = 0.3633
256
76
Applied Biostatistics: An Essential tool in Helathcare Profession
1 8 1 8 1 8
= 28 +8 +
2 2 2
1 8
= 37
2
37
= 256
= 0.1445
Ex.3. A biased die is thrown thirty times and the number of sixes seen is eight. If the die is thrown a
further twelve times, find:
a) the probability that a six will occur exactly twice;
b) the expected number of sixes;
c) the variance of number of sixes.
Ans: A biased die is thrown thirty times and the number of sixes seen is eight
8 4 11
So, 𝑝 = = 𝑞 =
30 15 15
12! 4 2 11 12−2
𝑃 2 = 2! (12 − 2)! 15 15
66 ×42 × 1110
= 15 12
= 0.211
b) Expected number of sixes = mean
4
𝐸 𝑋 = 𝑟 = 𝑥 = 𝑛 𝑝 = 12 × 15 = 3.2
c) Variance of sixes,
4 11
𝑉 𝑋 = 𝑟 = 𝜎 2 = 𝑛 𝑝 𝑞 = 12 × 15 × 15 = 2.347
Ex.4. A random variable is binomially distributed with mean 6 and variance 4.2. Find 𝑃(𝑋 ≤ 6)
Ans: Since X is a binomial distribution,
Mean = 𝑛 𝑝 = 6
Variance = 𝑛 𝑝 𝑞 = 4.2
6 × 𝑞 = 4.2
77
Bhumi Publishing, India
4.2
𝑞 = 6
= 0.7
Also,
𝑝 = 1 – 𝑞 = 1 – 0.7 = 0.3
This gives,
𝑛 × 0.3 = 6
𝑛 = 20
Now,
6 20
𝑃 𝑟≤6 = 𝑟=0 0.3𝑟 0.7𝑛−𝑟
𝑟
𝑃 𝑋 = 𝑟 ≤ 6 = 𝑃 𝑟 = 0 + 𝑃 𝑟 = 1 + 𝑃 𝑟 = 2 + 𝑟 = 3 + 𝑃 𝑟 = 4 + 𝑃 𝑟 = 5 + 𝑃(𝑟 = 6)
=20C0 0.3 20
+20C1 0.3 1
0.7 19
+20C2 0.3 2
0.7 18
+ 0C3 0.3 3
0.7 17
+ 20C4
= 0.6080
Ex.5. Inland Revenue audits 5% of all companies every year. The companies selected for auditing in
any one year are independent of the previous year’s selection.
a) What is the probability that the company ‘Ross Waste Disposal’ will be selected for auditing
exactly twice in the next 5 years?
b) What is the probability that the company will be audited exactly twice in the next 2 years?
c) What is the exact probability that this company will be audited at least once in the next 4
years?
Ans: Here, 𝑝 = 0.05 𝑞 = 1 − 𝑝 = 0.95
a) For 𝑛 = 5 𝑟=2
𝑛! 5!
𝑃 𝑋=𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟 = 2! 3!
(0.05)2 (0.95)3 = 0.0214
b) For 𝑛 = 2 𝑟=2
𝑛! 2!
𝑃 𝑋=𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟 = 2! 1
(0.05)2 (0.95)2 = 0.0025
c) For 𝑛 = 4 𝑟≥1
𝑃 𝑋 = 𝑟 ≥ 1 = 1 − 𝑃(𝑋 = 0)
4!
=1−1 4!
(0.05)0 (0.95)4 = 0.1854
78
Applied Biostatistics: An Essential tool in Helathcare Profession
𝜎2 = 𝑛 𝑝 𝑞 ⇒ 2 = 6 × 𝑞
1
𝑞=
3
Again,
1
𝑝 =1−𝑞 ⇒𝑝 =1−3
2
𝑝=
3
2
Hence, 𝑥 = 𝑛𝑝 ⇒ 6 = 𝑛
3
𝑛=9
Ex.7. Eight coins are tossed at a time 256 times. Number of heads at each throw is recorded and
results are given below. Find the expected frequencies and fit the Binomial distribution.
No. of Heads at a throw 0 1 2 3 4 5 6 7 8
Frequency 2 6 30 52 67 56 32 10 1
Ans: The probability of getting a head in a single throw is,
1
𝑃 𝐻 =𝑝=2
1
Hence, 𝑃 𝑇 = 𝑞 = 1 − 𝑝 = 2
0 8 1 8
1 0 1
256 ×
0 2 2
1 8 1 7
1 1 8
256 ×
1 2 2
2 8 1 6
1 2 28
256 ×
2 2 2
3 8 1 5
1 3 56
256 ×
3 2 2
4 8 1 4
1 4 70
256 ×
4 2 2
5 8 1 3
1 5 56
256 ×
5 2 2
6 8 1 2
1 6 28
256 ×
6 2 2
7 8 1 1
1 7 8
256 ×
7 2 2
8 8 1 0
1 8 1
256 ×
8 2 2
Total 256
The Binomial distribution is properly fit.
79
Bhumi Publishing, India
𝜆𝑟
𝑃 𝑋 = 𝑟, 𝜆 = 𝑟!
𝑒 −𝜆 𝑟 = 0, 1, 2, . . .
= 0 Otherwise
where, λ is the parameter of the Poisson’s distribution.
The probability distribution of the Poisson random variable X, representing the number of
outcomes occurring in a given time interval or specified region denoted by t, is
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑒 −𝜆𝑡 𝑟 = 0, 1, 2, . . .
𝑟!
= 0 otherwise
where, λ is the average number of outcomes per unit time, distance, area or volume
The Poisson distribution occurs in different situations, for example:
1. It gives the probabilities of a given number of phone calls in a certain time interval;
2. It gives the probabilities of a given number of flaws on a length unit of a wire;
3. It gives the probabilities of a specific number of faults on an area unit of a fabric;
4. It gives the probabilities of a specific number of bacteria in a volume unit of a solution;
5. It gives the probabilities of a specific number of accidents on time unit.
Note:
1. For 𝑋 ∼ 𝑃 𝜆 ,
𝐸𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑚𝑒𝑎𝑛 = 𝐸(𝑋) = 𝜆
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉(𝑋) = 𝜆
2. 𝑃 𝑋 = 𝑟; 𝜆 = 𝑛
𝑟=0 𝑃(𝑟; 𝜆)
80
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.1. During a laboratory experiment, the average number of radioactive particles passing through
a counter in 1 millisecond is 4. What is the probability that 6 particles enter the counter in a given
millisecond?
Ans: Here the outcomes 𝑟 = 6 𝜆𝑡 = 4
Using Poisson’s distribution,
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑟!
𝑒 −𝜆𝑡
(4)6
𝑃 6; 4 = 6!
𝑒 −4 = 0.1042
Ex.2. The average number of planes landing at an airport each hour is 10 while the maximum
number it can handle is 15. What is the probability that on a given hour some planes will have to be
put on a holding pattern?
Ans: Here, the outcome 𝑋 = 𝑟 > 15 𝜆𝑡 = 10
Using Poisson’s distribution of sum,
𝑛
𝑃 𝑋 = 𝑟; 𝜆 = 𝑟=0 𝑃(𝑟; 𝜆)
15
𝑃 𝑋 => 15; 𝜆 = 1 − 𝑟=0 𝑃(𝑟 ≤; 𝜆)
= 1 – [𝑃(𝑟 = 0) + 𝑃(𝑟 = 1) + 𝑃(𝑟 = 2) +. . . + 𝑃(𝑟 = 15)]
= 1 – 0.9513
= 0.0487
Ex.3. The average number of accidents at a level-crossing every year is 5. Calculate the probability
that there are exactly 3 accidents this year.
Ans: Here, 𝑟 = 3 𝜆𝑡 = 5
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑟!
𝑒 −𝜆𝑡
(5)3
𝑃 𝑋 = 3,5 = 3!
𝑒 −5 = 0.1404
81
Bhumi Publishing, India
𝜆𝑟
Therefore; the Poissons distribution is 𝑃 𝑋 = 𝑟, 𝜆 = 𝑁 × 𝑟!
𝑒 −𝜆
X 0 1 2 3 4
f e -0.61 e -0.61 (0.61) e -0.61
(0.61)2
e -0.61
(0.61)3
e -0.61
(0.61)4
2 2 2
82
Applied Biostatistics: An Essential tool in Helathcare Profession
Where, 𝑥 and σ are the mean and standard deviations of the distribution respectively and
𝑟−𝑥
𝑧= is called standard normal variate.
𝜎
Ex.1. Suppose a particular population has 𝑥 = 4 and 𝜎 = 2. Find the probability of a randomly
selected value being greater than 6.
Ans: the 𝑍 value corresponding to 𝑃(𝑋 = 𝑟 = 6) is,
𝑟−𝑥 6−4
𝑧= 𝜎
= 2
= 1
83
Bhumi Publishing, India
Note:
1. The normal distribution curve (z curve) is “bell-shaped” having two tails at the end which never
meet X-axis theoretically and symmetric about 𝑋 = 𝑥
2. In a standard normal distribution, 𝑥 = 0 and σ2 = 1 denoted by N (0, 1)
Hence, 𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒 = 𝑥
3. The area under the Z curve gives the probability.
i.e. 𝑃 (−∞ < 𝑟 < ∞) = 1
Since the curve is symmetric about 𝑥 , we have
𝑃 (𝑋 = 𝑟 < 𝑥 ) = 𝑃 ( 𝑋 = 𝑟 > 𝑥 ) = 0.5
4. Any normal distribution can be converted into standard normal variate (SVN) Z using the
formula
𝑥−𝑥
For 𝑋 ~ 𝑁 (𝑥 , 𝜎 2 ), 𝑧= [It uses Z table for finding probability]
𝜎
5. The probability of the variate having a value within a certain interval [a, b] is calculated using
1 𝑥−𝑥 2
𝑏 1 −
𝑃 𝑋=𝑟 = 𝑎 𝜎 2𝜋
∙ 𝑒 2 𝜎 ∙ 𝑑𝑥 for 𝑎 < 𝑥 < 𝑏
Ex.2. Wool fibre breaking strengths are normally distributed with mean 𝑥 = 23.56 Newton and
standard deviation 𝜎 = 4.55. What proportion of fibres would have a breaking strength of 14.45
or less?
Ans: Here, 𝑥 = 23.56 𝜎 = 4.55 𝑟 = 14.45
Draw a diagram and label with given values
84
Applied Biostatistics: An Essential tool in Helathcare Profession
That is, the raw score of 14.45 is equivalent to a standard score of -2.0. It is negative
because it is on the left hand side of the curve.
Use tables to find probability and adjust this result to required probability:
𝑃 𝑟 < 14.45 = 𝑃 𝑧 < −2.0
= 0.5 − 𝑃(0 < 𝑧 < 2)
= 0.5 − 0.4772
= 0.0228
Inverse process: (to find a value for 𝑟, corresponding to a given probability)
Draw a diagram and label.
Shade area given as per question.
Use probability tables to find Z –score.
Convert standard score Z to raw score 𝑟 using inverse formula.
𝑟 = 𝑧×𝜎 +𝑥
Ex.3. Carrots entering a processing factory have an average length of 15.3 cm and standard
deviation of 5.4cm. If the lengths are approximately normally distributed, what is the maximum
length of the lowest 5% of the load (Given 𝑇𝑎𝑏 𝑧 = 1.645 at 5 %)?
Ans: Here, 𝑥 = 15.3 𝜎 = 5.4 𝑟 =?
Draw a diagram and label it.
Use standard Normal tables to find the Z -score corresponding to this area of probability.
Convert the standard score Z to a raw score 𝑟 us i ng t he i n ve rs e f orm u la
𝑟 = 𝑧×𝜎 +𝑥
Here, 𝑃(𝑍) for 5% is -1.645 from normal Z table (negative because it is below mean)
Hence, 𝑟 = 𝑧×𝜎 +𝑥
= −1.645 × 5.4 + 15.3
= 6.4
Lowest maximum length is 6.4cm.
85
Bhumi Publishing, India
Ex.4. The finish times for marathon runners during a race are normally distributed with a mean of
195 minutes and a standard deviation of 25 minutes.
a) What is the probability that a runner will complete the marathon within 3 hours?
b) Calculate to the nearest minute, the time by which the first 8% runners have completed the
marathon.
c) What proportion of the runners will complete the marathon between 3 hours and 4 hours?
180−195
a) 𝑟 = 180 ⇒ 𝑧 = = −0.6
25
Hence, proportion of runners taking between 3hrs and 4hrs is approx 70%
Ex.5. For the following standard normal variates z find the proportion (area) occupied by them as
measured from zero.
i) z = 1.98
ii) z = -0.5
iii) z = 1.35 to 2.18
iv) z = 1.98 to 0.5
86
Applied Biostatistics: An Essential tool in Helathcare Profession
87
Bhumi Publishing, India
Exercise
Q.1. Suppose that a pair of fair dice are to be tossed, and let the random variable X denote the sum
of the points. Obtain the probability distribution for X
Q.2. Find the expected value for the density function of a random variable X given by
1
𝑓 𝑥 = 𝑥 0<𝑥<2
2
=0 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
Q.3. Find the variance and standard deviation of the random variable of above question 2.
Q.4. The probability that a driver must stop at any one traffic light coming to Lincoln University is
0.2. There are 15 sets of traffic lights on the journey.
a) What is the probability that a student must stop at exactly 2 of the 15 sets of traffic lights?
b) What is the probability that a student will be stopped at 1 or more of the 15 sets of traffic
lights?
Q.5. The number of typing mistakes made by a secretary has a Poisson distribution. The mistakes
are made independently at an average rate of 1.65 per page. Find the probability that a three-page
letter contains no mistakes.
Q.6. The download time of a resource web page is normally distributed with a mean of 6.5 seconds
and a standard deviation of 2.3 seconds.
a) What proportion of page downloads take less than 5 seconds?
b) What is the probability that the download time will be between 4 and 10 seconds?
c) How many seconds will it take to complete 35% of the download?
Q.7. For a binomial distribution, mean is 5 and S.D. is16. Find n, p, q.
Q.8. Mean and S.D. of a binomial distribution are 3 and 2. Find n, p, q.
Q.9. For a binomial distribution, mean is 206 and S.D. is 4. Find n, p, q.
Q.10. Fit Poisson’s distribution to following.
Death 0 1 2 3 4
Freq 122 60 15 2 1
88
Applied Biostatistics: An Essential tool in Helathcare Profession
Introduction:
In the first chapter, we have discussed the collection, distribution and analysis of collected
scientific data. The data in biostatistics are generally based on individual observations. Sampling is
very often used in our daily life. For example, while purchasing food grains from a shop we usually
examine a handful from the bag to assess the quality of the commodity. A doctor examines a few
drops of blood as sample and draws conclusion about the blood constitution of the whole body.
Thus, most of our investigations are based on samples. In this chapter, let us see the importance of
sampling and the various methods of sample selections from the population.
Population:
In a statistical enquiry, all the items, which fall within the range of enquiry, are known as
Population or Universe. In other words, the population is a set of all possible observations, which
are to be investigated, having at least one property in common. For example, the population of any
country has a common language, literature, geographic origin and genetic heritage which
distinguish them from people of different nationalities. Total number of students studying in a
school or college, total number of books in a library, total number of houses in a village or town is
some examples of population.
The objects or individuals in the population are called members or elements and the
number of members in the population constitutes population size. Depending on population size,
population can be finite or infinite. If the number of members in the population is finite/ countable
then it is finite population. E.g. number of students in a college, number of workers in a factory,
production of articles in a particular day for a company. If number of members in the population is
infinite then it is infinite population. E.g. number of stars in a galaxy, number of people seeing the
Television programmes etc. Statisticians use the word population to refer not only to people but to
all items that have been chosen for study.
Census:
Sometimes it is possible and practical to examine and study every person or item in the
population which is a complete enumeration called census. A census is the procedure of systematic
collection and recording of information about the every member of a given population. It provides
true measures of population. For example, if we study the average annual income of the families of
a particular area having 1000 families then we must have to study income of all the 1000 families
and in such a case, no family should left out.
The population census of India is taken at every 10 years interval. The first census was
taken in 1871 – 72. The latest census was taken in 2011.
89
Bhumi Publishing, India
Merits of census:
1. The data is collected from each and every item of the population.
2. The results are more accurate and reliable.
3. Intensive study is possible.
4. The data collected may be used for various surveys, analyses etc.
Demerits of census:
1. It requires a large number of enumerators and it is a costly method.
2. It requires more money, labour, time energy etc.
3. It is not possible in some circumstances where the universe is infinite.
Sample:
If population is infinite or very large then it is impossible to study each and every member.
In this case, population is divided into small groups of members so all the important properties or
characteristics are covered in members of those groups. Such groups are known as samples. Thus,
Sample is a small group of finite members selected from statistical population so that all the
important characteristics of entire population are covered in members of the group. It is a subset
and representative of people, events or items from a larger population. To represent a population
well, a sample should be randomly collected and adequately large. The members of sample selected
from population which cannot be further subdivided for sampling are known as sample points and
the number of members in a sample is called the sample size. Often, it is necessary to use samples
for research, because it is impractical to study the whole population. For example, to study the
average height of 12-year-old boys in a country, we could not measure all of the 12-year-old boys in
that country, but we could measure a sample of 12-year-old boys.
Reasons for selecting a sample:
Sampling is inevitable in the following situations:
1. Complete enumerations are practically impossible when the population is infinite.
2. When the results are required in a short time.
3. When the area of survey is wide.
4. When resources for survey are limited particularly in respect of money and trained persons.
5. When the item or unit is destroyed under investigation.
Sampling frame:
For adopting any sampling procedure it is essential to have a list identifying each sampling
unit by a number. Such a list or map is called sampling frame. A list of voters, a list of house holders,
a list of villages in a district, a list of farmers etc. are a few examples of sampling frame.
Sampling Methods:
The method of selecting small groups i.e. samples from the population which represent the
characteristics (like height, weight, colour) of the population is called sampling method. If we want
90
Applied Biostatistics: An Essential tool in Helathcare Profession
to get really good conclusions from our samples, we need to assure that we make a right choice of
our samples.
The sampling process involves following stages:
(i) Define the population of statistical analysis.
(ii) Specify a sampling frame, a set of items or events possible to measure.
(iii) Specifying a sampling method for selecting items or events from the frame.
(iv) Determining the sample size.
(v) Implementing the sampling plan.
(vi) Sampling and data collecting.
Merits of Sampling:
There are many advantages of sampling methods over census method. They are as follows:
1. Under sampling a statistical investigation is carried out speedily.
2. It results in reduction of cost, time, energy and labour.
3. Sampling ends up with greater accuracy of results.
4. The size of the sample can be increased or decreased according to the size of the universe,
availability of resources and degree of accuracy desired.
5. It has greater scope.
Types of Sampling Techniques (Methods):
Following are the different types of sampling which are commonly used:
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5. Quota Sampling
1. Simple Random Sampling:
It is the most popular method for choosing a sample among population for a wide range of
purposes. Simple random sample is a group of individuals selected from a larger population, using
either a random number table or random number generator. Every individual of this sample is
selected randomly and has equal chance (probability) of being selected. The process or technique of
selection of individuals with same probability of being selected is known as simple random
sampling.
Suppose we have population size (N) of 10,000 students in a university. Each of them is
known as unit or member. To select a sample of required size (n), let it be 200, we could use simple
random sampling. Students would be selected at random and sent to questionnaire for analysis.
Steps to create Simple Random Sample:
a) Define the population
91
Bhumi Publishing, India
92
Applied Biostatistics: An Essential tool in Helathcare Profession
Suppose a researcher wants to study about the career goals of students in the Institute
which has near about 8,000 students. Thus, the population size (N) is 8,000. He wants to select a
93
Bhumi Publishing, India
sample of size (n) 100 students using systematic sampling. With systematic random sampling, there
would be an equal chance of being selected for the required sample.
Steps to create Systematic Random Sample:
a) Define the population
b) Select the sample size (n)
c) List the population and arrange in specific or definite order
d) Calculate value of k
e) Select the first unit
f) Select sample of required size.
a) Define the population:
In the condition mentioned above, the population size (N) is 8,000 students in the Institute
and we are interested in all the students of the Institute. Institute may consist of males and females.
If we select females then the male students from the institute would be rejected.
b) Select the Sample Size (n):
Decide the number of members for the sample for the further study. Suppose we want to
choose the sample size (n) of 100 female students. Sample size shows the limit of a quantity and
time require to distribute questionnaire to students.
c) List the population and arrange in specific or definite order:
For the sample of 100 female students we have to identify all 8,000 students of the institute.
Collect the entire information about all the females studying in the institute. Then arrange all the
females in specific order i.e. either assign numbers from 1 to N or arrange in alphabetical manner.
d) Calculate value of ‘k’:
Assuming that we have chosen a sample of size 100 students, we need to find the value of k
which is the ratio of population size and sample size.
It tells us that we have to choose 1 student in every 80 students from the population of
8,000 students of the Institute.
e) Select the first unit:
After finding k, we need to select the first student at random. As we have assigned numbers
to the members of population, choose any student at random from 1 to 80 (k) and suppose it is 25th
student.
f) Select sample of required size:
We have the first member i.e. 25th student of our sample. So we can select the remaining 99
members easily using value k.
Now add 𝑘 = 80 to first member 25 which will give next member.
94
Applied Biostatistics: An Essential tool in Helathcare Profession
95
Bhumi Publishing, India
This means that we need to select 60 male students and 40 female students for our sample of
100 students.
f) Select sample of required size:
Finally we have to select 120 male students from 600 and 80 female students from 400
using either simple random sampling or systematic random sampling to fulfil sample size.
The principal reasons for using stratified random sampling rather than simple random
sampling include:
1. Stratification may produce a smaller error of estimation than would be produced by a simple
random sample of the same size. This result is particularly true if measurements within strata
are very homogeneous.
2. The cost per observation in the survey may be reduced by stratification of the population
elements into convenient groupings.
3. Estimates of population parameters may be desired for subgroups of the population. These
subgroups should then be identified.
96
Applied Biostatistics: An Essential tool in Helathcare Profession
97
Bhumi Publishing, India
98
Applied Biostatistics: An Essential tool in Helathcare Profession
99
Bhumi Publishing, India
Exercise
1. Define population. Explain need of sample in detail.
2. Differentiate census and sample.
3. Explain simple random sampling with replacement and without replacement.
4. Distinguish between: simple random sampling with replacement and without replacement.
5. Write short note on systematic sampling.
6. Distinguish between: stratified sampling and cluster sampling.
7. Explain quota sampling.
8. Differentiate between systematic sampling and stratified sampling.
9. Give advantages and disadvantages of simple random sampling.
10. Explain the advantages of sampling.
100
Applied Biostatistics: An Essential tool in Helathcare Profession
4. Correlation
Introduction:
In the previous lesson, we learned about the joint probability distribution of two random
variables X and Y. In this lesson, we'll extend our investigation of the relationship between two
random variables by learning how to quantify the extent or degree to which two random
variables X and Y are associated or correlated.
The term correlation is used by a common person in day to day life without knowingly or
unknowingly. For example, when parents advice their children to work hard so that they may get
good marks, they are correlating good marks with hard work.
In the previous lesson we have studied about characteristics, measures of central tendency
and measure of dispersion of one variable i.e. univariate data. But there are variables which are
related to each other. E.g. height and weight of persons are related to each other. Such a data
containing two variables which are related to each other is called bivariate data in statistical
analysis. Sometimes the variables may be interrelated like blood pressure and age. The nature and
strength of relationship may be studied by correlation and regression.
Correlation:
In statistical analysis, two sets of data or two random variables may depend on each other
in such way that the increase or decrease in values of one variable results in either increase or
decrease in values of anther variable. The extent of linear relationship between two variables or
more variables is called correlation.
E.g. correlation in demand for a product and its price
Correlation is a single number that describes the degree of linear relationship between two
variables. It is a statistical technique which shows how strongly pairs of variables are related. Two
variables are said to be correlated, if change in one of the variables results in a change in the other
variable.
Uses of correlation:
1. It is used in physical and social sciences.
2. Businessmen estimates costs, sales, price etc. using correlation.
3. It is useful for economists to study the relationship between variables like price, quantity
etc.
4. Businessmen estimates costs, sales, price etc. using correlation.
5. It is helpful in measuring the degree of relationship between the variables like income and
expenditure, price and supply, supply and demand etc.
6. Sampling error can be calculated.
7. It is the basis for the concept of regression.
101
Bhumi Publishing, India
Scatter Diagram:
Scatter diagram is the diagrammatic representation of relationship between two variables.
It is the simplest method of studying correlation. In scatter diagram, one variable is taken along
horizontal axis and second variable is taken along vertical axis. Each pair of observations of two
variables is represented by dot in the plane of axes. There are as many dots in the plane as the
number of paired observations of two variables. The direction of dots shows the scattering or
concentration of given points which further helps to decide the type of correlation.
The following are the types of correlation:
1) Positive Correlation:
If the change in values of one variable leads to the same change in values of another variable
then it is positive correlation. It is a relationship between two variables which moves in same
direction. In positive correlation if values of one variable decrease then values of other variables
also decrease and vice versa.
E.g. Price and supply are two variables, which are positively correlated. When Price increases,
supply also increases; when price decreases, supply decreases.
The scatter diagram for positive correlation is shown below. The line corresponding to the
scatter plot is an increasing line.
Positive Correlation
2) Negative Correlation:
If the change in values of one variable leads to the opposite change in values of another
variable then it is negative correlation. It is a relationship between two variables which moves in
opposite direction. In negative correlation if values of one variable decrease then values of other
variables also increase or if values of one variable increase then values of second variable decrease.
E.g. Price and demand are two variables which are negatively correlated. When price increases,
demand decreases; when price decreases, demand increases.
The scatter diagram for positive correlation is shown below. The line corresponding to the
scatter plot is a decreasing line.
102
Applied Biostatistics: An Essential tool in Helathcare Profession
Negative Correlation
3) Zero Correlation:
When there does not exist any relationship between two variables then it is zero
correlation. The increase or decrease in values of one variable does not affect other variable.
E.g. The more weight I gain, the smarter I will be. Intelligence is not affected by weight i.e. there is
no relation between these two variables.
The scatter diagram for zero correlation is shown below. No correlation occurs when there
is no linear dependency between two variables.
Zero Correlation
103
Bhumi Publishing, India
the degree of that relationship. This measure of correlation is called correlation coefficient i.e. the
numerical value that determines the degree to which two variables are related to each other in unit-
free terms is known as correlation coefficient. It gives the strength and direction of a linear
relationship.
Covariance:
Before studying correlation coefficient we will start with covariance which computes the
dependence between two random variables say X and Y.
i.e. if X and Y are two random variables (discrete or continuous) with respective means 𝑥 and 𝑦 then
covariance of X and Y, denoted by Cov(X, Y), is defined as:
𝑥 𝑖 − 𝑥 (𝑦 𝑖 − 𝑦 ) 𝑥𝑖 𝑦𝑖
Cov(X, Y) = 𝑛
= 𝑛
− 𝑥𝑦
where 𝑥𝑖 are observations in X and 𝑦𝑖 are observations in Y.
Note:
1. The value of correlation coefficient lies in between -1 and +1.
2. If correlation coefficient = 1 then it is perfectly positive correlation.
3. If correlation coefficient = -1 then it is perfectly negative correlation.
4. If correlation coefficient = 0 then it is zero correlation i.e. there is no correlation.
5. If correlation coefficient >0 then variables are positively correlated.
6. If correlation coefficient <0 then variables are negatively correlated.
Coefficient of correlation can be measured using two methods:
1) Karl Pearson’s Correlation Coefficient (r)
2) Spearman’s Rank Correlation Coefficient (R)
1) Karl Pearson’s Correlation Coefficient (r):
This is a simple and the most common way to measure degree of correlation between two
variables. It is also known as product-moment correlation coefficient. It is measure of the strength
as well as direction of a linear relationship between two variables. It tries to draw a line of best fit
through the data of two variables and indicates how far the points are away from the line of fit.
If 𝑥1 , 𝑥2 , 𝑥3, … 𝑥𝑛 are n observations of variable X and 𝑦1 , 𝑦2 , 𝑦3, … 𝑦𝑛 are n observations of
variable Y then Karl Pearson’s Correlation Coefficient, denoted by r, is defined as
𝐶𝑜𝑣(𝑋,𝑌) 𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2
𝑟= 𝜎𝑥 𝜎𝑥
where 𝜎𝑥 = 𝑛
and 𝜎𝑦 = 𝑛
Thus,
𝒏 𝒙𝒊 𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊
𝒓=
𝟐 𝟐
[𝒏 ( 𝒙𝟐𝒊 ) – ( 𝒙𝒊 ) ] ∙ [𝒏 ( 𝒚𝟐𝒊 ) – ( 𝒚𝒊 ) ]
OR
𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)
104
Applied Biostatistics: An Essential tool in Helathcare Profession
Steps:
1. Find the means 𝑥 , 𝑦 of two variables X and Y.
2. Take the deviations dx, dy of two series X and Y using the formula. Then take their squares
as 𝑑𝑥 2 and 𝑑𝑦 2 and prepare the table as shown below.
X Y dx dy 𝑑𝑥 2 𝑑𝑦 2 dxˑdy
Ex.1. Calculate the coefficient of correlation from the 7 pairs of observations, given that,
𝑥 = 212, 𝑦 = 152, 𝑥 2 = 6514, 𝑦 2 = 3390, 𝑥𝑦 = 4681.
𝒏 𝒙𝒊 𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊
Ans: 𝒓=
𝟐 𝟐
[𝒏 ( 𝒙𝟐𝒊 ) – ( 𝒙𝒊 ) ] ∙ [𝒏 ( 𝒚𝟐𝒊 ) – ( 𝒚𝒊 ) ]
7×4681−212×152
=
[7×6514 –(212)2 ] ∙ [7×3390−(152)2 ]
32767 −32224
=
45598−44944 ∙[23730 −23104 ]
543
= 654 ×626
543
= 639.8468
r = 0.8486
Ex.2. Find Karl Pearson’s coefficient of correlation from the following data between height of father
(x) and son (y).
X 64 65 66 67 68 69 70
Y 66 67 65 68 70 68 72
Ans:
105
Bhumi Publishing, India
𝑥 469 𝑦 476
𝑥= 𝑛
= 7
= 67 & 𝑦= 𝑛
= 7
= 68
𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)
7×25−0∙0
=
7×28 – (0)2 ∙ 7×34 – (0)2
175
= 196 ×238
175
= 215.9814
r = 0.810
Ex.3. Calculate the correlation coefficient for the following heights of fathers (x) and their sons (y).
x 65 66 67 67 68 69 70 72
y 67 68 65 68 72 72 69 71
Ans:
𝑥 544 𝑦 552
𝑥= 𝑛
= 8
= 68 & 𝑦= 𝑛
= 8
= 69
𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)
8×24−0∙0
=
8×36 – (0)2 ∙ 8×44 – (0)2
192
= 288 ×352
192
= 318.3959
𝑟 = 0.6030
106
Applied Biostatistics: An Essential tool in Helathcare Profession
Rx = Ranks of data X
Ry = Ranks of data Y
Ex.1. Calculate Rank correlation coefficient from following data.
Marks by Judge A 81 72 60 33 29 11 56 42
Marks by Judge B 75 56 42 15 30 20 60 80
Ans:
X Y Rx Ry D= Rx - Ry D2
81 75 1 2 -1 1
72 56 2 4 -2 4
60 42 3 5 -2 4
33 15 6 8 -2 4
29 30 7 6 1 1
11 20 8 7 1 1
56 60 4 3 1 1
42 80 5 1 4 16
Total 32
107
Bhumi Publishing, India
6 ∙ 𝐷2 6 × 32
𝑅 =1− 2
=1− = 1 − 0.3809 = 0.6191
𝑛(𝑛 − 1) 8 × 63
Ex.2. Psychological tests of intelligence and arithmetical ability were applied to 10 candidates.
Results are given in table. Compute rank correlation coefficient between X and Y.
Intelligence ration (X) 90 95 115 96 85 110 89 98 97 93
Ans.:
X Y 𝑅𝑋 𝑅𝑦 𝐷 2 = (𝑅𝑋 − 𝑅𝑦 )2
90 95 8 6 4
95 90 6 9 9
115 110 1 2 1
96 100 5 5 0
85 85 10 10 0
110 105 2 4 4
89 94 9 7 4
98 106 3 3 0
97 111 4 1 9
93 93 7 8 1
Total 32
6 𝐷2
𝑅 =1−
𝑁 (𝑁 2 − 1)
=
6 𝑋 32
1−
10 (99)
= 1 – 0.194 = 0.806
Ex.3. In dance competition, two judges rank 10 participants in following order. From given data
calculate coefficient of rank correlation?
Ranking by judge M 6 4 3 1 7 8 9 10 5 2
Ranking by judge N 4 1 6 7 8 7 10 3 2 5
108
Applied Biostatistics: An Essential tool in Helathcare Profession
Ans.:
Rank by M Rank by N
𝐷 2 = (𝑅𝑋 − 𝑅𝑦 )2
𝑅𝑥 𝑅𝑥
6 4 4
4 1 9
3 6 9
1 7 36
7 8 1
8 7 1
9 10 1
10 3 49
5 2 9
2 5 9
+TOTAL 128
6 𝐷2
𝑅 =1−
𝑁 (𝑁 2 − 1)
6 𝑋 128
=1−
10 (99)
= 1 – 0.775
= 0.225
Merits of Rank Correlation Coefficient:
Spearman's Rank method is the only way of studying correlation between qualitative data
which cannot be measured in figures but can be arranged in serial order.
Demerits of Rank Correlation
1) The method cannot" be used in two-way frequency tables or bi-variate frequency
distribution.
2) It can be conveniently used only when n is small say 30, otherwise calculation become
tedious.
109
Bhumi Publishing, India
Exercise
1. Define correlation. Explain different methods of studying correlation.
2. What is Spearman’s rank correlation? When it can be used?
3. Om electrical obtained 120 tube lights from two companies and tested their life in hours.
The following results were obtained. Calculate the coefficient of variation and find which
company’s tubes are more durable?
Life of tubes (hrs) Company A Company B
800-1000 12 15
1000-1200 20 22
1200-1400 38 40
1400-1600 12 13
1600-1800 15 28
1800-2000 03 02
4. From given data of height of father and daughter in centimeters, calculate the correlation
coefficient.
Father 165 168 160 163 170 175 173
Daughter 160 175 166 159 173 180 177
5. Following table shows ages (X) in years and blood pressure (Y). From given data calculate
correlation coefficient.
X 25 50 60 43 51 74 46 33 49 58
Y 120 135 140 115 130 133 126 139 125 136
6. In epidemiological study of glaucoma in urban and rural population following data was
made available by WHO. Find if there is any correlation between urban and rural area.
No. of cases per 1000
Urban 23 35 28 36 45 39 19
Rural 20 30 22 40 35 45 22
7. Examine correlation for given data containing erythrocytes sedimentation rate in mm/hr of
10 male and female.
Male 112 65 70 82 105 75 60
Female 85 100 90 63 78 105 90
110
Applied Biostatistics: An Essential tool in Helathcare Profession
8. From data given in table find out the value of Karl Pearson’s Coefficient of correlation.
Fertilizer used 15 19 22 27 35 40 50
Productivity 80 95 102 118 135 144 150
9. Calculate the coefficient of correlation for given data of marks obtained by students in
Pharmaceutics and Pharmacology.
Pharmaceutics 75 51 42 77 62 81 60 58 66 49
Pharmacology 69 48 64 45 71 42 64 70 40 65
10. Compute coefficient of correlation from following data of supply and price of goods.
Supply 182 160 152 169 158 166 179
Price 167 198 152 170 162 152 180
11. Table contains values of import of raw material and export of finished formulation in
suitable unit. Calculate coefficient of correlation.
Import 15 21 15 16 25 19 12 25 21 10
Export 12 16 14 14 22 17 10 23 19 09
12. In a vocal music contest, two judges rank 09 competitors in following order.
Judge A 5 10 8 9 7 5 4 6 7 3
Judge B 3 9 9 6 10 8 7 8 3 6
13. Calculate Karl Pearson’s coefficient of correlation between x and y. State its kind.
X 39 65 62 90 82 75 25 98 36 78
Y 47 53 58 86 62 68 60 91 51 84
15. Given the following values of x and y. Find the correlation coefficient.
X 3 5 6 8 9 11
Y 2 3 4 6 5 8
111
Bhumi Publishing, India
5. Regression
Introduction:
After studying the relationship between two variables, now in this chapter we are going to
estimate the values of one variable when values of another variable are given. The variable which is
to be estimated is called “dependent” variable and the other is “independent” variable. Thus, the
term regression is used when we want to predict value of a variable based on the value of another
variable. Correlation gives us the extent of linear relationship between two variables while
regression analysis gives the measure of average relationship between two or more variables in
terms of original units of data.
The statistical technique of determining unknown values of one variable from the known
values of another variable is called Regression analysis. The relationship between two variables like
rainfall and agricultural production, consumer expenditure and disposable income etc are examples
of regression. Regression analysis is also used to define and characterize dose-response
relationships, for fitting linear portions of pharmacokinetic data and in obtaining the best fit to
linear physical-chemical relationships.
Difference between correlation and regression:
Correlation is related to regression but its application and interpretation are different than
regression. Let us see the exact difference between correlation and regression as both describe the
strength of the linear relationship between two or more variables.
Regression Correlation
1. It predicts the values of dependent variable 1. It gives the association or intensity of
based on the known values of independent relationship between two variables
variable, assuming the average relationship (x and y).
between two or more variable.
112
Applied Biostatistics: An Essential tool in Helathcare Profession
Types of Regression:
Regression analysis can be classified into following types:
1. Simple Regression: In regression analysis, if only two variables are studied at a time then it
is called simple regression i.e. there is only one independent variable.
2. Multiple Regressions: In regression analysis, if more than two variables are studied at a
time then it is called multiple regression i.e. there are two or more independent variables.
3. Linear Regression: If the graphical representation of a given data gives a straight-lined
pattern then it is linear regression.
4. Non-linear Regression: if the graphical representation of given data gives curved pattern
line then it is non-linear or curvilinear regression.
In this chapter, we are going to study Simple linear regression and its equations.
Simple Linear Regression:
Simple linear regression uses only one independent variable and examines the linear
relationship between two continuous variables: dependent (y) and independent (x) using straight
line. When the two variables are related, it is possible to predict a response value from a predictor
value with better than chance accuracy.
Regression provides the line that "best" fits the data. This line can then be used to:
Examine how the response variable changes as the predictor variable changes.
Predict the value of a dependent variable (y) for independent variable (x).
Regression Lines:
In regression analysis of two variables, regression line is a smooth curve fitted to the set of
paired data of x and y; and if the curve is straight line then it is line of linear regression. There are as
many number of regression lines as variables. But in simple linear regression we take two variables
X and Y, so there are only two regression lines:
Regression line of Y on X: This gives the most probable values of Y from the given values of X.
Regression line of X on Y: This gives the most probable values of X from the given values of Y.
Properties of Regression Lines:
(i) For perfect correlation i.e. 𝑟 = ±1, the two lines coincide each other. So there will be only
one straight line.
(ii) If 𝑟 = 0 then both variables are independent and both lines will cut each other at right
angle.
(iii) If regression lines are close to each other then there is high degree of correlation.
(iv) If regression lines are far away from each other then there is less degree of correlation.
(v) The two regression lines intersect each other at point (𝑥 , 𝑦) i.e. means of X and Y.
113
Bhumi Publishing, India
Graphic Algebraic
114
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑦𝑥 = 𝑛 𝑑𝑥 2 − ( 𝑑𝑥 )2
𝑑𝑥 = 𝑥𝑖 − 𝑥
𝑑𝑦 = 𝑦𝑖 − 𝑦
(ii) Equation of regression line X on Y is given by
𝒙 − 𝒙 = 𝒃𝒙𝒚 ( 𝒚 − 𝒚 )
𝑦𝑖 𝑥𝑖
Where 𝑦 = and 𝑥=
𝑛 𝑛
𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑥𝑦 = 𝑑𝑥 = 𝑥𝑖 − 𝑥
𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2
𝑑𝑦 = 𝑦𝑖 − 𝑦
And
(ii) Regression coefficient of X on Y is given by
𝜎
𝑏𝑥𝑦 = 𝑟 ∙ 𝜎 𝑥
𝑦
115
Bhumi Publishing, India
(1−𝑟 2 ) 𝜎𝑥 ∙𝜎𝑦
Thus, 𝜃 = tan−1 𝑟 𝜎𝑥 2 + 𝜎𝑦 2
(10) The angle between regression lines indicates the degree of dependence between the variables.
116
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑥 89
𝑥= = = 17.8 ≅ 18
𝑛 5
𝑦 283
𝑦= = = 56.6 ≅ 57
𝑛 5
(i) Part A: Equation of regression line Y on X is given by
𝒚 − 𝒚 = 𝒃𝒚𝒙 ( 𝒙 − 𝒙 )
𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑥𝑦 =
𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2
5 × 34 − (−1 × −2)
=
5 × 72 − (−2)2
117
Bhumi Publishing, India
170 − 2
=
360 − 4
= 0.4719
Now, regression line X on Y:
𝑥 − 𝑥 = 𝑏𝑥𝑦 ( 𝑦 − 𝑦 )
𝑥 − 17.8 = 0.4713( 𝑦 − 56.6 )
𝑥 − 17.8 = 0.4713𝑦 − 26.71
𝑥 − 0.4713𝑦 = −8.91
i) To estimate value of y when 𝑥 = 25, use equation of regression line Y on X
Therefore, 𝑦 − 1.25(25) = 34.35
𝑦 − 31.25 = 34.35
𝑦 = 65.6
ii) To estimate value of x when 𝑦 = 50, use regression line X on Y
Therefore; 𝑥 − 0.4713(50) = −8.91
𝑥 − 23.56 = −8.91
𝑥 = 14.65
Ex.3. Find the line of regression Y on X and line X on Y if
X Y
A.M. 36 85
S.D. 11 8
r 0.66
Ans.: Given 𝑥 = 36 𝑦 = 85
𝜎𝑥 = 11 𝜎𝑦 = 8 𝑟 = 0.66
Now,
Regression coefficient of Y on X is given by
𝜎 8
𝑏𝑦𝑥 = 𝑟 ∙ 𝜎𝑦 = 0.66 × 11 = 0.4818
𝑥
And
Hence,
Equation of regression line Y on X is
𝒚 − 𝒚 = 𝒃𝒚𝒙 𝒙 − 𝒙
𝑦 − 85 = 0.48 𝑥 − 36
118
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑦 − 85 = 0.48𝑥 − 30.24
𝑦 − 0.48𝑥 = 54.76
Equation of regression line Y on X is
𝒙 − 𝒙 = 𝒃𝒙𝒚 ( 𝒚 − 𝒚 )
𝑥 − 36 = 0.91 𝑦 − 85
𝑥 − 36 = 0.91𝑦 − 77.35
𝑥 − 0.91𝑦 = −41.35
Note:
Suppose we are given equations of two regression lines and it is not mentioned that which
one is regression equation of Y on X and X on Y.
In such case, always assume that the first equation is Y on X and then calculate regression
coefficients 𝑏𝑦𝑥 and 𝑏𝑥𝑦 .
If these two values satisfy the property of regression coefficients,
𝑟 2 = 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 < 1
Then our assumption is correct.
Otherwise, interchange these two equations.
Ex.4. Given the two line regression as;
8𝑥 – 10𝑦 + 66 = 0
40𝑥 – 18𝑦 – 214 = 0
Find average of x and y as well as correlation coefficient between x and y.
Ans.:
Part A:
Solve two equations simultaneously to find average of X and Y.
8𝑥 – 10𝑦 = − 66 ……….. (1)
40𝑥 – 18𝑦 = 214 …..….. (2)
Multiply equation 1 with 5 and subtract from equation 2.
40𝑥 – 50𝑦 = − 330
- (40𝑥 – 18𝑦 = 214)
− 32𝑦 = −544
𝑦 = 17
Now, substitute 𝑦 = 17 in equation 1.
8𝑥 – 10 (17) = −66
8𝑥 – 170 = −66
8𝑥 = 104
𝑥 = 13
Therefore; 𝑥 = 13 𝑎𝑛𝑑 𝑦 = 17
119
Bhumi Publishing, India
Part B: to find correlation coefficient, we have two lines of regression but which line is
regression line Y on X and vice-versa is not known.
Let’s assume that
8𝑥 – 10𝑦 = − 66 is regression line Y on X
40𝑥 – 18𝑦 = 214 is regression line X on Y
Now,
Check whether stated assumption is correct or not.
Check 1: Signs: both regression coefficients are positive.
Check 2 Product of two regression coefficient
8 09
𝑏𝑦𝑥 . 𝑏𝑥𝑦 = 𝑋
10 20
72
=
200
= 0.36 < 1
Hence, our assumption is correct
Now; 𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 = 𝑟 = ± 0.36 = ±0.6
But the sign of correlation coefficient is same as sign of both regression coefficients. So, 𝑟 = +0.6
120
Applied Biostatistics: An Essential tool in Helathcare Profession
Exercise
1. What is regression analysis? Explain the concepts of regression.
2. Comment on the properties of regression coefficient and lines.
3. Write a note on different methods to find regression coefficient.
4. For a certain data of two variables, the regression equations are 6𝑥 + 𝑦 – 31 = 0 𝑎𝑛𝑑 3𝑥 +
2𝑦 – 26 = 0. Find the means of x and y as well as coefficient of correlation r.
5. From given regression equations calculate coefficient correlation and 𝜎𝑦 2 ; where 𝜎𝑥 2 = 0.9
Productivity index 68 60 62 80 65 40 52 62 60 81
8. From following data find the two regression lines.
X 7 6 10 14 13
Y 22 18 20 26 24
9. Calculate two lines of regression.
X 7 6 10 14 13
Y 22 18 20 26 24
10. Following data gives values of X and Y.
X 16 12 18 14 12 10 15 12
Y 87 88 89 86 87 80 85 83
121
Bhumi Publishing, India
Introduction:
Recall that we typically cannot census the entire population of interest; so we take a sample
from that population in order to make estimates and draw conclusions about the population.
A quantity computed from the values in a sample is called statistic. The values such as
mean 𝑥 , standard deviation or the proportions of individuals in a sample are the statistics which
vary from sample to sample of a population. This variability is called sampling variability.
For example, the average age at which a child learned to walk for one sample of 10 children would
be different from the average age to walk for a different sample of 10 children.
Sampling Distribution:
The common statistics used for sample of any population are sample proportion (𝑝) and
sample mean (𝑥 ) which are random variables as they vary from sample to sample. This result into
the distribution of sample called sampling distribution. A sampling distribution is a probability
distribution of a statistic obtained from al large number of samples drawn from a specific
population. It is a distribution of statistics of all possible values of samples of fixed size.
For example, suppose to find out the sampling distribution of GPAT scores for all Graduate
students in a given year, take repeated random samples of graduate students from the general
population of students and then compute the average test score for each sample. The distribution of
those sample means would provide the sampling distribution for the average GPAT score.
The variability of sampling distribution is measured by its variance or standard deviation.
Sampling Error:
Most of the times, the value of statistic calculated from sample is assigned to the population
of that sample. But, in general, there is some difference between the value calculated from sample
and the corresponding value of population. This difference is called sampling error.
Sampling is an analysis performed by selecting specific number of observations from a
larger population. This analysis can produce some errors in selection of samples. In statistics,
sampling error is the error that occurs when the sample representing the entire population is not
selected properly. As a result, the values obtained from sample would not be obtained from entire
population. This sampling error can be eliminated by selecting sufficiently large size sample by
ensuring that it represents the entire population.
Standard Error of the Mean:
The variability of sampling distribution is measured by its variance or standard deviation.
In such case, the standard deviation of the means of samples is a measure of sample error which is
known as standard error or standard error of mean (SEM). It is a measure of uncertainty.
122
Applied Biostatistics: An Essential tool in Helathcare Profession
A standard error of the mean is the standard deviation of the sampling distribution of a
statistic. It is a statistical measure which measures the accuracy for sample representing a
population. In statistics, the samples mean deviates from the actual mean of population; this
deviation is known as standard error of the mean. The standard error is inversely proportional to
the sample size; so the larger the sample size, the smaller the standard error.
For example, for an upcoming national election, 2000 voters are chosen at random and
asked if they will vote for candidate A or candidate B. Out of the 2000 voters, 1040 (52%) state that
they will vote for candidate A. The researchers report that candidate A is expected to receive 52%
of the final vote, with a margin of error of 2%.
In this situation, the 2000 voters are a sample from all the actual voters. The sample
proportion of 52% is an estimate of the true proportion who will vote for candidate A in the actual
election. The margin of error of 2% is a quantitative measure of the uncertainty – the possible
difference between the true proportion who will vote for candidate A and the estimate of 52%.
Significance of SEM:
The standard error of mean (SEM) estimates the variability sample means where the
samples are selected from same population while the standard deviation measures the variability
within a single sample. It is used to determine how precisely the mean of the sample estimates the
population mean. The Least value of SEM indicates more precise estimate of population mean. Thus,
a larger sample size will result in a smaller standard error of mean.
The standard error of mean is also used to calculate the confidence interval, which is a
range of values likely to include the population mean.
For example, a medical research team tests a new drug to lower cholesterol. They report
that, in a sample of 400 patients, the new drug lowers cholesterol by an average of 20 units
(mg/dL). The 95% confidence interval for the average effect of the drug is that it lowers cholesterol
by 18 to 22 units.
In this situation, the 400 patients are a sample of all patients who may be treated with the
drug. The confidence interval of 18 to 22 is a quantitative measure of the uncertainty – the possible
difference between the true average effect of the drug and the estimate of 20 mg/dL.
a) Standard error of the mean for one sample:
For a sample of size n and standard deviation σ, the SEM is given by,
𝜎
𝑆𝐸𝑀 = 𝑛
where, σ = S.D. of sample
123
Bhumi Publishing, India
𝜎1 2 𝜎2 2
𝑆𝐸𝑀(𝑥 1 −𝑥 2 ) = +
𝑛1 𝑛2
124
Applied Biostatistics: An Essential tool in Helathcare Profession
Statistical Inference:
Statistical inference is the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population. It is
important in order to analyze data properly. Indeed, proper data analysis is necessary to interpret
research results and to draw appropriate conclusions. In this chapter, three basic statistical
concepts are presented: estimation, confidence interval, and P-value, and these concepts are
applied to the comparisons of proportions, means.
Statistical inference is the act of generalisation or estimation about the larger sized
population from the sample information.
There are two common types of statistical inference:
Estimation
Testing of Hypothesis
Estimation:
Statistical estimation is concerned with the method by which population characteristics are
estimated from sample information. The objective of estimation is to approximate the value of a
population parameter on the basis of a sample statistic i.e. when the value of parameter is unknown
then it can be estimated on the basis of a random sample.
For example, the sample mean 𝑥 is used to estimate the population mean 𝜇.
There are two types of estimation:
(1) Point Estimation (likely value for parameter)
(2) Interval Estimation (also called confidence interval for parameter)
(1) Point Estimates:
A point estimator of an unknown parameter of a population is a single value or point of
sample statistic. It is always provided with its standard error which is a measure of uncertainty
associated with estimation process.
For example, the sample mean 𝑥 is point estimate of the population mean 𝜇.
125
Bhumi Publishing, India
Characteristics of Estimators:
The desirability of an estimator is judged by its characteristics. There are three important
criteria:
(i) Unbiasedness:
An unbiased estimator of a population parameter is an estimator whose expected value is
equal to the parameter value.
i.e. an estimator say 𝜃 for population parameter 𝜃 is said to be unbiased if
𝐸 𝜃 = 𝜃
For example, the sample mean 𝑥 is unbiased estimator of the population mean 𝜇, since
𝐸 𝑥 =𝜇
(ii) Consistency:
An unbiased estimator is said to be consistent if the difference between the estimator and
the target population parameter becomes smaller as we increase the sample size.
i.e. an unbiased estimator say 𝜃 for parameter 𝜃 is said to be consistent if
𝑉 𝜃 → 0 as n → ∞.
Note that being unbiased is a precondition for an estimator to be consistent.
2
For example, variance of the sample mean 𝑥 is 𝜎 𝑛 , which decreases to zero as we
increase sample size n.
(iii) Efficiency:
If we are given two unbiased estimators for a population parameter then the estimator with
a smaller variance is more efficient.
For example, for a normally distributed population, it can be shown that the sample median
is an unbiased estimator for µ. It can also be shown, however, that the sample median has a greater
variance than that of the sample mean, for the same sample size. Hence 𝑥 is a more efficient
estimator than sample median.
Confidence Interval:
In statistics, confidence interval is used to describe the amount of uncertainty associated
with a sample estimate of population parameter. It is an interval estimate combined with a
probability statement. A confidence interval is an interval within which the true value of
population parameter lies. The width of this confidence interval depends on the properties of
population and the degree of probability considered.
126
Applied Biostatistics: An Essential tool in Helathcare Profession
Here we are going to construct confidence intervals for population proportion (p) and
population mean (𝜇) using sample proportion (𝑝) and sample mean (𝑥 ) as point estimate which will
be the centre of the confidence interval. The width of confidence interval will depend on two things:
i. Level of confidence
ii. Standard error
Confidence level refers to the percentage of all possible samples that can be expected to
include the true population parameter. For example 95% confidence level implies that 95% of
confidence level would include the true population parameter. It is considered as the multiplier in
confidence interval.
The general form of confidence interval is
𝑃𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟)
a) Confidence Intervals for Population Proportions (p):
It is constructed by taking sample proportion (𝑝) as point estimate with sample size n and
standard error of sample proportion. Here 𝒛 is the level of confidence find from z table and taken as
multiplier.
𝑝(1 − 𝑝)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑜𝑓 𝑝 = 𝑝 ± 𝑧
𝑛
For example, in a survey, people are asked how many of them wear seatbelt while driving.
In a sample of 1356 males, 677 said that they wear seatbelt while driving.
677
𝑝= = 0.499
1356
Let’s construct 95% confidence level for the population proportion from which the sample
proportion was drawn.
To compute confidence interval we need z multiplier and standard error.
From z table, for 95% confidence level, multiplier z = 1.96
𝑝 (1−𝑝 ) 0.499(1−0.499)
Hence, 𝑆𝐸 = 𝑛
= 1356
= 0.136
Thus, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 0.499 ± 1.96 0.136 = 0.499 ± 0.027 = [0.472, 0.526]
i.e. We are 95% confident that the population proportion of males who wear seat belt while driving
is between 0.472 and .
b) Confidence Intervals for Population Mean (𝝁):
It is constructed by taking sample mean (𝑥 ) as point estimate with sample size n and
standard error of sample mean. Here t is level of confidence find from t table and taken as
multiplier.
𝜎
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝜇 = 𝑥 ± 𝑡
𝑛
127
Bhumi Publishing, India
For example, in a class survey, students are asked how many hours they sleep per night. In a
sample of 22 students, the mean was 5.77 hours with a standard deviation of 1.572 hours.
Let’s construct a 95% confidence level for the mean number of hours slept per night in the
population from which sample was drawn.
To compute confidence interval we need t multiplier and standard error.
From t table, with degree of freedom 22 – 1 = 21 and for 95% confidence level,
multiplier t = 2.08
𝜎 1.572
Hence, 𝑆𝐸 = 𝑛
= 22
= 0.335
Thus,
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 5.77 ± 2.08 0.335 = 5.77 ± 0.697 = [5.073, 6.467]
i.e. We are 95% confident that the population mean hours of students sleeping per night is between
5.073hours and 6.467 hours.
P - value:
P-value is the probability for the given statistical model in hypothesis testing to support or
reject the null hypothesis. It is the evidence against null hypothesis. At the time of hypothesis
testing, a p-value helps to determine the significance of the result. The p-value is a number between
0 and 1 and interpreted in the following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so as
to reject the null hypothesis.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to
reject the null hypothesis.
A p-value very close to the cut off (0.05) is considered to be negligible.
128
Applied Biostatistics: An Essential tool in Helathcare Profession
7. Testing of Hypothesis
Introduction:
In this chapter, we are going to study second type of statistical inference in the form of
hypothesis testing by using various statistical methods and probability distributions; the first one
was confidence intervals. The main purpose of this study is to make decisions and draw inferences
about the available data of population using samples of that population. In pharmaceutical studies,
the purpose is often to demonstrate that “a new drug is effective, or possibly to show that it is more
effective than the existing drug”. While in clinical trials, the purpose is to demonstrate that “the new
drug is better than a placebo control”. In this chapter, we will focus only on numeric outcomes. This
chapter also introduces the critical (or rejection) region approach to hypothesis testing and
compares it to critical value (p-value) approach.
A statistical measure such as mean, standard deviation or variance which describes
population is known as parameter.
Estimation:
The statistical estimation is one of the main objectives or methods of statistics in which
conclusions about a population are drawn and/or decisions are taken from the analysis of the
sample drawn from that population. Statistical inference includes:
1. Estimation theory
2. Tests of hypothesis
3. Non Parametric tests
4. Sequential analysis
In estimation theory, we estimate the unknown value of the population parameter based on
sample observations i.e. the statistical measure calculated from sample is assigned to the
population of that sample.
E.g. suppose we are given a sample of weights of 100 students in a school. So it is possible to
estimate the average weight of all students in that school using the weights of these 100 students.
But, in general, there is difference between the value calculated from sample and the
corresponding value of population. This difference is called Sampling error.
Tests of Hypothesis:
The sample is assumed to be a small representative of the total population. But in many
experiments, it happens that the sample is not a whole representative of population from which it is
selected. So the statistical conclusions and estimations about population go wrong. In such case,
certain statements are tested about population parameter to come to the conclusions like whether
or not the difference between sample value and parametric value is due to chance or otherwise.
129
Bhumi Publishing, India
This whole procedure is called testing of hypothesis. In this chapter, we will find the difference
between sample mean and population mean i.e. both are equal or not.
A hypothesis, in statistics, is a statement about a population which is supposed to be true
till it is proved to be false. It should be stated before conducting the statistical tests of hypothesis. It
is set up in two ways:
a) Null Hypothesis
b) Alternative Hypothesis
a) Null Hypothesis:
Null hypothesis is a statement which is actually tested for acceptance or rejection. It is
stated under the assumption that “there is no significance difference” between sample result and
population result. We assume that null hypothesis is true but in pharmaceutical research we wish
to prove it false.
Null hypothesis is generally denoted by H0.
b) Alternative Hypothesis:
When the null hypothesis is rejected then it is required to accept another statement called
as alternative hypothesis. It is research hypothesis which is generally believed to be true by
researcher.
Alternative is generally denoted by Ha or H1.
Note: The hypothesis we want to test is “likely” true. So there are two possible outcomes:
Reject H0 and accept Ha because of sufficient proofs in favour of Ha.
Accept H0 because of insufficient proofs to support Ha.
Failure to reject H0 does not mean that null hypothesis is true. It only means that we do not have
sufficient evidence to support H1.
Elements necessary for Hypothesis testing:
1. Level of significance:
The level of significance, denoted by 𝛼, is the probability of rejecting the null hypothesis
when it is true. For example, a significance level 0.05 indicates a 5% risk of concluding that a
difference exists when there is no actual difference.
Similarly, the p-value is the strength or probability of accepting null hypothesis. It is used to
compare with the test statistic value. If the calculated statistic value is less than the given p-value
at 𝛼% level of significance accept the null hypothesis; otherwise reject null hypothesis.
2. Region of Acceptance and Rejection:
It is a range of values which leads to accept the null hypothesis while the set of values where
null hypothesis is rejected is called area of rejection. Area of rejection is also known as Critical
Region. The values which separate the critical region from the region of acceptance are called
critical values.
130
Applied Biostatistics: An Essential tool in Helathcare Profession
3. Power of test:
Power of test of any statistical significance is the probability of rejecting null hypothesis
when it is false. It ranges from 0 to 1. The power quantifies the chance that the null hypothesis
will be rejected when it is actually false. Thus, power is the ability of a test to correctly reject the
null hypothesis. Although a hypothesis test without it is conducted, calculating the power of a test
beforehand will help to ensure that the sample size is large enough for the purpose of the test.
Types of error:
No hypothesis test is 100% correct, as it is based on probabilities. At the time of testing of
hypothesis, null hypothesis may be accepted or may be rejected. Depending on this acceptance or
rejection there are two types of errors:
a) Type one (I) error
b) Type two (II) error
a) Type I error:
Rejection of null hypothesis when it is true, is called type I error. The probability of
committing type I error is 𝛼 i.e. level of significance which is set up before testing hypothesis. Given
𝛼 of 0.05 means there is 5% chance that we are wrong to reject null hypothesis. To decrease the
chance of error, use a lower value of 𝛼. Using lower value of 𝛼 means the less likely to detect a true
difference if exists.
b) Type II error:
Accepting null hypothesis when it is false is called type II error. The probability of making
type II error is 𝛽 i.e. power of test. To decrease the chance of error ensure that the test must have
enough power i.e. sample size should be large enough to detect practical difference if exist.
One Tailed test and Two Tailed test:
In statistics hypothesis testing, we need to judge whether it is a one-tailed or a two-tailed
test so that we can find the critical values in tables such as Standard Normal z Distribution Table
and t Distribution Table which are the standard normal values. z curve and t curve are generally in
bell shape and symmetric about the vertical axis. Each side of the curve represents a tail which is
rejection area for hypothesis.
One –tailed test: A test of a statistical hypothesis, where the region of rejection is on only one side
of the sampling distribution, is called a one-tailed test.
Two-tailed test: A test of a statistical hypothesis, where the region of rejection is on both sides of
the sampling distribution, is called a two-tailed test.
Steps to perform Hypothesis testing:
All hypothesis tests are conducted by the same way.
1) State the hypotheses.
2) Formulate an analysis plan: Choose significance level and test statistic
131
Bhumi Publishing, India
z-test:
A z-test is a statistical test used to determine whether two population means are different
when the variances are known and the sample size is large. The test statistic is assumed to have
normal distribution and standard deviation should be known for an accurate z-test to be
performed. It is basically used for dealing with problems relating to large samples when 𝑛 ≥ 30.
There are different types of z-test each for different purpose. Some of the popular types are
outlined below:
a) For comparing two proportions:
Let 𝑝1 and 𝑝2 be two proportions with the respective sample sizes 𝑛1 and 𝑛2
To test:
𝐻0 : 𝑝1 = 𝑝2
Against:
𝐻1 : 𝑝1 ≠ 𝑝2
Test statistic:
𝑝1 − 𝑝2
𝑧=
1 1
𝑝 1 − 𝑝 (𝑛 − 𝑛 )
1 2
400 600
Here, 𝑝1 = 600 𝑛1 = 600 𝑎𝑛𝑑 𝑝2 = 1000 𝑛2 = 1000
132
Applied Biostatistics: An Essential tool in Helathcare Profession
To test:
𝐻0 : 𝑝1 = 𝑝2
Against:
𝐻1 : 𝑝1 ≠ 𝑝2
Test statistic:
𝑝1 − 𝑝2
𝑧=
1 1
𝑝 1 − 𝑝 (𝑛 − 𝑛 )
1 2
Where
400 600
𝑛1 𝑝1 + 𝑛2 𝑝2 600 × 600 + 1000 × 1000 400 + 600
𝑝= = = = 0.624
𝑛1 + 𝑛2 600 + 1000 1600
Hence,
0.67 − 0.6
𝑧= = 4.76
1 1
0.624 0.376 (600 − 1000)
𝐶𝑎𝑙 𝑧 = 4.76
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,
So reject 𝐻0 .
Therefore, there is significant difference in the percentage of literacy of two cities.
b) For one sample mean:
Let 𝑥 be the sample mean and 𝜇 be population mean with 𝜎 as standard deviation and 𝑛 as
sample size.
To test:
𝐻0 : 𝑥 = 𝜇
Against:
𝐻1 : 𝑥 ≠ 𝜇
Test statistic:
𝑥−𝜇
𝑧= 𝜎
𝑛
Where,
(𝑥𝑖 − 𝑥 )2
𝜎=
𝑛
At 𝛼% level,
If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
133
Bhumi Publishing, India
Ex.2. The mean plasma potassium level of 50 adult males with a certain disease was found to be
3.356mEq/litre and the S.D. was 0.5mEq/litre. The normal adult value of plasma potassium is
4.6mEq/litre. Based on above data, can it be concluded that the males with diseases have lower
plasma potassium level than normal level? (Given at 5%, 𝑧 = 1.96)
Ans: Here, 𝑥 = 3.356 𝜎 = 0.5 𝑛 = 50
To test:
𝐻0 : 𝜇 = 4.6
Against:
𝐻1 : 𝜇 ≠ 4.6
Test statistic:
𝑥−𝜇
𝑧= 𝜎
𝑛
3.35 − 4.6
=
0.5
50
1.25
=
0.07
= 17.675
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,
Hence, reject 𝐻0 .
Thus, the males with diseases have lower plasma potassium level than normal males.
Ex.2. A machine produces metal plates of thickness 1.5cm with S.D. 0.2cm. A sample of 100 plates
produced by machine has an average thickness of 1.52cm. Is the machine fulfilling the purpose for
which it is designed?
To test:
𝐻0 : 𝜇 = 1.5
Against:
𝐻1 : 𝜇 ≠ 1.5
Test statistic:
134
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑥−𝜇
𝑧= 𝜎
𝑛
1.52 − 1.5
=
0.2
100
0.02
=
0.02
=1
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,
Hence, accept 𝐻0 .
Thus, the machine is fulfilling its purpose.
Ex.3. Six bottles from batch of suspension were assayed for paracrtamol content by
spectrophotometric method. Each 5ml suspension contains 500, 503, 509, 515, 502, 507 mg of
paracetamol. Test the hypothesis that the average content of paracetamol is 505mg.
𝑥𝑖 500+503+509+515+502+507
Ans: 𝑥 = 𝑛
= 6
= 506
To test:
𝐻0 : 𝜇 = 505
Against:
𝐻1 𝜇 ≠ 505
Test statistic:
𝑥 −𝜇 (𝑥 𝑖 −𝑥 )2
𝑧= 𝜎 Where, 𝜎 =
𝑛
𝑛
To find S.D.,
𝑥𝑖 (𝑥𝑖 − 𝑥 ) (𝑥𝑖 − 𝑥 )2
500 -6 36
503 -3 9
509 3 9
515 9 81
502 -4 16
507 1 1
Total 152
135
Bhumi Publishing, India
(𝑥 𝑖 −𝑥 )2 152
𝜎= = = 5.033
𝑛 6
Hence,
𝑥−𝜇 506 − 505
𝑧= 𝜎 = = 0.4866
5.033
𝑛 6
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,
So accept 𝐻0 .
c) For two sample means:
Let 𝑥1 and 𝑥2 be two sample means with standard deviations 𝜎1 and 𝜎2 , sample sizes 𝑛1 and
𝑛2 respectively.
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2
Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸.
Where, Standard Error is
𝜎12 𝜎22
𝑆. 𝐸. = +
𝑛1 𝑛2
At 𝛼% level,
If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.1. Random samples drawn from two places gave the following data relating to the wing length of
anopheles mosquitoes. Test at 5% level that the mean wing length is the same for mosquitoes at
two places.
(Given 𝑧 = 1.96)
Place ‘A’ Place ‘B’
Mean 3.60 3.58
S.D. 1.8 1.6
size 50 50
136
Applied Biostatistics: An Essential tool in Helathcare Profession
𝜎1 = 1.8 𝜎2 = 1.6
𝑛1 = 50 𝑛2 = 50
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2
Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸.
Where,
Hence,
3.60 − 3.58
𝑧= = 0.058
0.34
Ex.2. In two groups of infants in 6 months of age the following values were observed:
Group No. Of infants Mean weight S.D.
1 100 6.9kg 1.10kg
2 169 7.3kg 0.91kg
Test whether mean birth weights are significantly different at 5% level.
137
Bhumi Publishing, India
Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸
Where,
Hence,
6.9 − 7.3
𝑧= = 3.077
0.13
138
Applied Biostatistics: An Essential tool in Helathcare Profession
𝐻0 : 𝑥 = 𝜇
Against:
𝐻1 : 𝑥 ≠ 𝜇
Test statistic:
𝑥−𝜇
𝑡=𝜎
𝑛−1
Where,
(𝑥 𝑖 −𝑥 )2
𝜎= 𝑛−1
𝑛 − 1 is degree of freedom
At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.1. A random sample of 20 sachets of powder containing certain drug gives mean API content of
42 mg and S.D. of 6mg. Test the hypothesis that the population mean is 44 mg.
(Given at 5% level and 19 d.f., 𝑇𝑎𝑏 𝑡 = 2.093)
Ans.: Given 𝑛 = 20 𝑥 = 42𝑚𝑔 𝜎 = 6𝑚𝑔
To test:
𝐻0 : 𝜇 = 44𝑚𝑔
Against:
𝐻1 : 𝜇 ≠ 44𝑚𝑔
42 − 44
=
6
19
= 1.4534
At 5% level, 𝑇𝑎𝑏 𝑡 = 2.093
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
Hence accept 𝐻0 .
Thus, the population mean is 44mg.
139
Bhumi Publishing, India
b) T-test for large sample size (𝒏 > 30)to compare equality of two sample means:
Let 𝑥1 and 𝑥2 be two sample means of sizes 𝑛1 and 𝑛2 along with standard deviations 𝜎1 and
𝜎2 res/pectively..
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2
At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.2. One lab check shelf life of formulation obtained from two different manufacturers (generic
product). Data is given in table. Check whether is their existence of any significant difference in
same kind of product but of different manufacturer.
(Given at 5% level and for 25 df, 𝑇𝑎𝑏 𝑡 = 2.06)
Product Mean S. D. Sample Size
Brand X 2000 days 250 12
Brand Y 2230 days 300 15
140
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑥1 − 𝑥2
𝑡=
1 1
𝜎 𝑛 +𝑛
1 2
Where,
𝑛1 𝜎1 2 + 𝑛2 𝜎2 2
𝜎=
𝑛1 + 𝑛2 − 2
12(250)2 + 15(300)2
=
12 + 15 − 2
= 84000
= 289.827
Hence,
2000 − 2230
𝑡=
1 1
289.827 12 +
15
230
=
112.25
= 2.0489
141
Bhumi Publishing, India
𝑛 𝑑 2 + ( 𝑑)2
𝜎=
𝑛(𝑛 − 1)
𝑛 − 1 is degree of freedom
At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.3. A test of 150 marks was taken before and after training for newly joins candidates in
production department. Table contains marks obtained by candidates. Test whether there is any
change in candidates after training.
(Given at 5% level and 4 df, 𝑇𝑎𝑏 𝑡 = 4.6)
Candidate A B C D E
Marks obtained before training 110 120 123 132 125
Marks obtained after training 120 118 125 136 121
𝑛 𝑑2 + 𝑑 2
𝜎=
𝑛 𝑛−1
142
Applied Biostatistics: An Essential tool in Helathcare Profession
5 140 + 10 2
=
5 5−1
= 30
= 5.4
Hence,
2
𝑡=
5.4
5
2
=
2.41
= 0.829
At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
So, accept 𝐻0
Thus, training has not shown any significant effect on scores.
Ex.4. Applications of fertilizers were tested for the yield of rice grown in 10 plots. Another seed of
10 plots of similar size & condition were taken as control. Test the effect of fertilizer.
(Given for 9 df, 𝑡0.05 = 2.10)
Fertilizer applied 16 14 18 15 13 17 16 15 14 13
Fertilizer Not applied 10 12 11 9 13 13 12 14 13 11
Ans.: Prepare following table for following condition.
x Y 𝒅 = (𝒙 − 𝒚) 𝒅𝟐
16 10 6 36
14 12 2 4
18 11 7 49
15 9 6 36
13 13 0 0
17 13 4 16
16 12 4 16
15 14 1 1
14 13 1 1
13 11 2 4
- 𝑑 = 33 𝑑2 = 163
To test:
𝐻0 : 𝑥 = 𝑦
Against:
𝐻1 : 𝑥 ≠ 𝑦
143
Bhumi Publishing, India
𝑛 𝑑2 + 𝑑 2
𝜎=
𝑛 𝑛−1
10 163 + 33 2
=
10 10 − 1
= 30.21
= 5.5
Hence,
3.3
𝑡=
5.5
10
3.3
=
1.74
= 1.896
At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
So, accept 𝐻0
Thus, there is no effect of fertilizers on the yield of rice.
F-test:
A statistical F-test is derived from Student’s t-test. It is used to compare equality of two
variances by dividing them with each other. The larger variance is taken at numerator to result the
test into right tailed test as it is easier to calculate.
To calculate F value:
Let 𝜎1 2 and 𝜎2 2 be the two sample variances with sample sizes 𝑛1 and 𝑛2 respectively.
To test:
𝐻0 : 𝜎1 2 = 𝜎2 2
Against:
𝐻1 : 𝜎1 2 ≠ 𝜎2 2 𝑜𝑟 𝜎1 2 > 𝜎2 2 𝑜𝑟 𝜎1 2 < 𝜎2 2
Test statistic: F-test
𝜎1 2
𝐹= 𝑓𝑜𝑟𝜎1 2 > 𝜎2 2
𝜎2 2
144
Applied Biostatistics: An Essential tool in Helathcare Profession
𝜎2 2
= 𝑓𝑜𝑟𝜎1 2 < 𝜎2 2
𝜎1 2
Where,
(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦)2
𝜎1 2 = 𝑎𝑛𝑑 𝜎2 2 =
𝑛1 − 1 𝑛2 − 1
At 𝛼% level,
If 𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.1. Two samples are drawn from two populations. From the data given below test whether the
two samples have same variance at % level of significance. (Given at 5% F8,7 = 3.76)
total
Sample I (x) 60 65 71 74 76 82 85 87 600
Sample II (y) 61 66 67 85 78 63 85 88 91 684
Ans: To test:
𝐻0 : 𝜎1 2 = 𝜎2 2
Against:
𝐻1 : 𝜎1 2 ≠ 𝜎2 2
Here,
𝑥𝑖 600
𝑥= 𝑛1
= 8
= 75
𝑦𝑖 684
𝑦= 𝑛2
= 8
= 76
145
Bhumi Publishing, India
𝜎 2 138.75
𝐹 = 𝜎2 2 = 90.85
= 1.507
1
𝜎 2 0.49
𝐹 = 𝜎2 2 = 0.36 = 1.3611
1
146
Applied Biostatistics: An Essential tool in Helathcare Profession
Exercise
1. In a sample of 400 parts manufactured by a factory, the number of defective parts was
found to be 30. The company, however, claimed that only 5% of their product is defective. Is
the claim tenable? (Given at 5%, 𝑧 = 1.96)
2. The mean life of a sample of optical lenses produced by a company is computed to be
1570hrs with S.D. of 120hrs after which it is presumed that the lenses do not maintain
accuracy. The company claims that the average life of the lenses produced by it is 1600hrs.
Using the level of significance of 0.05 can you say the claim ia acceptable?
3. In an investigation on Neonatal Blood Pressure in relation to maturity, the following results
were obtained:
Babies 9 days old Number Mean S.B.P S.D.
1. Normal 54 75 6
2. Neonatal asphyxia 14 69 5
Is the difference in mean S.B.P. between the two groups significant at 5 level?(Given at 5%
level, 𝑧 = 1.96)
4. Intelligence test on two groups of boys abd girls gave the following results.
Group Mean S.D. n
1. Boys 75 15 150
2. Girls 70 20 150
Is there any significant difference in the mean scores obtained by boys and girls at 5% level?
(Use z-test)
5. For a random sample of 10 persons fed on diet A, the increase in weight in pounds in a
certain period were : 10, 6, 16, 17, 13, 12, 8, 14, 15, 9
For another random sample of 12 persons fed on diet B the same is given as: 7, 13, 22, 15,
12, 14, 18, 8, 21, 23, 10, 17
Test whether the diets A and B are different as regards their effect on increase in weight.
6. Life time of batteries for a random sample of 10 from a large consignment gave the
following data. Can we accept the hypothesis that average life time of battery is 400 hrs?
(Note: use t-test at 5% level for df = 9).
Battery 1 2 3 4 5 6 7 8 9 10
Life (in hrs X 100) 5.6 4.2 4.6 4.1 5.2 3.8 3.9 4.3 4.4 3.9
7. Two laboratories M and N carry out independent estimation of fat in ice-cream made by
same industry. A sample is taken from each batch and their observations are given in table
below. Is there any significant difference between the mean fat content obtained by two labs
M and N?
147
Bhumi Publishing, India
Batch No.: 1 2 3 4 5 6 7 8 9 10
Lab M 6 6 9 6 7 6 7 8 8 4
Lab N 7 6 7 8 6 5 7 7 9 4
8. Ten incubation periods of polio cases are given below. Discuss the mean by using t-test.
Days: 66 69 70 65 69 71 70 68 63 62.
9. A drug given for treatment to 10 patients suffering from diabetes showed change in blood
pressure as given in table. Is it reasonable to believe that drug has no side effect as change
in B.P. at 5% level.
125 130 120 140 135 125 120 140 135 125
10. The weights at birth of female children born in hospital are found to be in ‘kg’. Is there
anything that can suggest the mean of the weight of children any significant from the
population mean 3 kg?
2.5 3.0 2.5 3.0 3.2 3.5 2.5 3.1 2.9 3.5
148
Applied Biostatistics: An Essential tool in Helathcare Profession
8. ANOVA
Introduction:
In previous chapter, we learned to use t-test for the testing of equality of means of two
population based on data from two independent samples. The t-test and z-test developed in the
20th century were used until 1918, when Ronald Fisher created the analysis of variance (ANOVA)
which is the extension of the t-test and the z-test. In this chapter, we are going to test the equality of
means of three or more population. This comparison of two or more means is based on the
distribution of variation into its dependent components- hence the method is called analysis of
variance. This method was introduced by Sir Ronald A. Fisher and has been used in many research
fields.
ANOVA is a statistical tool which is used to test if the means of three or more population are
significantly different from each other when variances are unknown. It checks the impact of one or
more factors by comparing the means of samples.
Assumptions to use the ANOVA:
To use the ANOVA test we made the following assumptions:
Each group sample is drawn from a normally distributed population.
All populations have a common/same variance.
Within each sample, the observations are sampled randomly and independently of each
other.
The sample sizes for the groups are equal and greater than 10
Factor effects are additive
Types of ANOVA:
There are two types of ANOVA:
1. One-way ANOVA(unidirectional)
2. Two-way ANOVA
1. One-way ANOVA:
The one-way analysis of variance (ANOVA) is generally used to determine whether there
are any statistically significant differences between the means of three or more independent
(unrelated) groups using F-distribution. It has only one independent variable affecting dependant
variable so it is also known as One factor analysis of variance. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means are unequal.
Limitations of the One Way ANOVA
A one way ANOVA will tell that at least two groups were different from each other. But it is
unable to tell that which groups were different.
149
Bhumi Publishing, India
2. Two-way ANOVA:
A two-way ANOVA is an extension of one-way ANOVA in which there are two independent
factors affecting dependant variable. It is mostly used when there is quantitative as well as
qualitative data.
In two-way ANOVA, two null hypotheses are tested if one observation is placed in each cell.
i.e. the hypotheses would be:
H01: For column factor.
H02: For row factor.
ANOVA Table:
This is the table that shows the output of the ANOVA analysis and whether there is a
statistically significant difference between our group means. The tabular arrangement of source,
sum of squares, degree of freedom, mean sum of square (MSS) and F-ratio is called ANOVA table.
The word "source" stands for source of variation. Some authors prefer to use "between" and
"within" instead of "treatments" and "error", respectively.
Steps to perform ANOVA:
Following are the steps to perform ANOVA:
Step 1: setup null hypothesis (𝑯𝟎 ) for given data.
a) For one way ANOVA there should be only one null hypothesis.
b) For two way ANOVA there should be two null hypotheses.
Step 2: Find column sums i.e. 𝑐1 , 𝑐2 , 𝑐3 …
In case of two way ANOVA along with column sums, find row sums i.e. 𝑟1 , 𝑟2 , 𝑟3 , …
Note: for large values, minimize data by subtracting smallest element of data from all observation
Step 3: Find grand total (GT):
𝐺𝑇 = 𝑐1 + 𝑐2 + 𝑐3 + … . = 𝑟1 + 𝑟2 + 𝑟3 + … .
Step 4: Find correction factor (C. F.)
(𝐺𝑇)2
𝐶. 𝐹. =
𝑁
Where; N = Total number of observations
Step 5: Find column sum of squares (CSS)
𝒄𝟐𝟏 𝒄𝟐𝟐 𝒄𝟐𝟑
𝑪𝑺𝑺 = + + + ⋯ − 𝑪. 𝑭.
𝒏𝟏 𝒏𝟐 𝒏 𝟑
Where n1, n2, n3… are the number of observations in respective columns.
Similarly, for two way ANOVA along with CSS, find out row sum of squares (RSS).
𝒓𝟐𝟏 𝒓𝟐𝟐 𝒓𝟐𝟑
𝑪𝑺𝑺 = + + + ⋯ − 𝑪. 𝑭.
𝒏𝟏 𝒏𝟐 𝒏 𝟑
150
Applied Biostatistics: An Essential tool in Helathcare Profession
Where n1, n2, n3… are the number of observations in respective rows
Step 6: Calculate total sum of squares (TSS):
𝑻𝑺𝑺 = 𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒂𝒍𝒍 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔 − 𝑪. 𝑭.
Step 7: Calculate error sum of squares (ESS)
a) For one way ANOVA:
𝑬𝑺𝑺 = 𝑻𝑺𝑺 − 𝑪𝑺𝑺
b) For two way ANOVA
𝑬𝑺𝑺 = 𝑻𝑺𝑺 − (𝑪𝑺𝑺 + 𝑹𝑺𝑺)
Step 8: Find degree of freedom (df)
a) For one way ANOVA:
df for 𝐶𝑆𝑆 = 𝑐 − 1 (where c = No. of columns )
df for 𝐸𝑆𝑆 = 𝑐 (𝑟 − 1) (Where c and r = No. of columns and rows )
b) For two way ANOVA:
df for 𝐶𝑆𝑆 = 𝑐 − 1
df for 𝑅𝑆𝑆 = 𝑟 – 1
df for 𝐸𝑆𝑆 = (𝑐 − 1) (𝑟 − 1)
Step 9: ANOVA table
a) One way ANOVA table
Sum of Mean sum of
Source df F ration
squares squares (MSS)
Between 𝐶𝑆𝑆
CSS c–1
columns (CSS) (𝑐 − 1) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑀𝑆𝑆
𝐹=
𝐸𝑆𝑆 𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
ESS ESS c (r-1) 𝑐(𝑟 − 1)
ESS/ c(r-1)
a) Two way ANOVA table
Sum of Mean sum of
Source df F ration
squares squares (MSS)
Between
𝐶𝑆𝑆
columns CSS value 𝑐 − 1
(𝑐 − 1)
(CSS) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝐶𝑆𝑆 𝐸𝑆𝑆
𝐹1 =
Between 𝑅𝑆𝑆 𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
RSS value 𝑟 − 1
rows (RSS) (𝑐 − 1) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑅𝑆𝑆 𝐸𝑆𝑆
𝐹2 =
𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
𝑐 − 1 (𝑟
ESS ESS value 𝐸𝑆𝑆
− 1)
𝑐 − 1 (𝑟 − 1)
Step 10) Conclusion
a) One way ANOVA:
151
Bhumi Publishing, India
Ex.1. Use ANOVA and determine whether the machines are significantly different in their mean
speed (Given; at 5% F2,12 = 3.89).
Machines
A1 A2 A3
25 31 24
30 39 30
36 38 28
38 42 25
31 35 38
Ans.:
1) Set up null Hypothesis;
𝐻0 = There is no significant difference in the average speed of three machines.
2) Minimize data by subtracting the smallest observation (i.e. 24) from all and calculate
column sum.
Data become;
A1 A2 A3
1 7 0
6 15 6
12 14 4
14 18 1
7 11 14
𝑐1 = 40 𝑐2 = 65 𝑐3 = 25
5) Calculate CSS
𝑐12 𝑐22 𝑐32
𝐶𝑆𝑆 = + + − 𝐶. 𝐹.
𝑛1 𝑛2 𝑛3
402 652 252
= + + – 1126.67
5 5 5
= [320 + 845 + 125] – 1126.67
= 1290 – 1126.67
= 163.33
152
Applied Biostatistics: An Essential tool in Helathcare Profession
6) Calculate TSS
𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶. 𝐹.
1 49 0
36 225 36 −1126.67
144 196 16
196 324 1
49 121 196
= 426 = 915 = 249
153
Bhumi Publishing, India
Varieties
A B C D
20 25 24 23
29 23 20 20
21 21 22 20
Ans.:
1) Set up null Hypothesis.
𝐻0 = The four varieties do not differ significantly among themselves
2) Minimize data by subtracting the smallest observation (i.e. 20) from all and calculate
column sum.
Data become;
A1 A2 A3 A4
0 5 4 3
9 3 0 0
1 1 2 0
𝑐1 = 10 𝑐2 = 9 𝑐3 = 6 𝑐4 = 3
3) Calculate Grand Total
𝐺. 𝑇. = 𝑐1 +𝑐2 + 𝑐3 = 10 + 9 + 6 + 3 = 28
4) Find Correction Factor:
(𝐺.𝑇.)2 (28)2
𝐶. 𝐹. = 𝑁
= 12
= 65.33
5) Calculate CSS
𝑐12 𝑐22 𝑐32 𝑐42
𝐶𝑆𝑆 = + + + − 𝐶. 𝐹.
𝑛1 𝑛2 𝑛 3 𝑛4
102 92 62 32
= + + + – 65.33
3 3 3 3
= [33.33 + 27 + 12 + 3] – 65.33
= 75.33 – 65.33
= 10
6) Calculate TSS
𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶. 𝐹.
0 25 16 9
81 9 0 0 − 65.33
1 1 4 0
= 82 = 35 = 20 = 9
154
Applied Biostatistics: An Essential tool in Helathcare Profession
= [82 + 35 + 20 + 9] – 65.33
= 146 – 65.33
= 80.67
7) Find ESS
𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 80.67 – 10 = 70.67
8) Degree of freedom
For CSS = 𝑐 − 1 = 4 − 1 = 3
For ESS = 𝑐(𝑟 − 1) = 4(3 − 1) = 8
155
Bhumi Publishing, India
X Row
X1 X2 X3 X4 totals
Y
Y1 10 13 16 14 𝑟1 = 53
Y2 6 15 13 6 𝑟2 = 40
Y3 11 10 18 0 𝑟3 = 39
Total 𝑐1 = 27 𝑐2 = 38 𝑐3 = 47 𝑐4 = 20
156
Applied Biostatistics: An Essential tool in Helathcare Profession
157
Bhumi Publishing, India
Exercise
1. What do you mean by ANOVA? Explain in short.
2. Describe various steps for ANOVA.
3. The following table gives the yield of 15 samples plots under three variations of seed. You
are required to find if the average yields of land under different varieties of seed showing
significant difference.
A 20 21 23 16 20
B 18 20 17 15 25
C 25 28 22 28 32
4. Blood group A, B, and AB of four different persons were studied for a particular
characteristic. Set a table of analysis of variance and find out whether there is existence of
any significant difference between the mean of persons blood with their blood group
varieties.
Blood group
Persons
A B AB
1 7 9 10
2 4 7 6
3 7 5 7
4 6 6 9
5. The following data gives sales made by three MR. Perform analysis of variance to test
whether there is any difference between sales made by three MR.
A B C
300 600 700
400 300 300
300 300 400
500 400 600
0 - 500
6. The life time in hours for cells taken randomly from four individuals were observed as given
in table. On the basis of observation, analyze the values for their variance and study its
significance.
158
Applied Biostatistics: An Essential tool in Helathcare Profession
7. Samples of peanut butter produced by a company in three different batches are tested for
autotoxin content (p.p.h) and obtain following results. Use the 0.05 level of significance to
test whether the difference among the three samples means are significant.
Batch A 1.0 2.2 4.8 0.4 1.5 3.3
Batch B 0.7 1.2 5.2 3.6 1.8 2.5
Batch C 4.3 5.5 2.7 1.1 0.3 0.5
8. Two random samples were drawn from normal populations. Set up ANOVA table for given
data.
Sample 1 20 15 14 16 18 10 12 17
Sample 2 16 15 8 28 10 14 10 8 12 16
10. A company appoints four salesmen J, K, L, and M and observes their sales in three different
zones viz. A, B, C. Carry out ANOVA. (Note: figures are in lakhs)
Zones Salesmen
J K L M
A 55 41 48 56
B 52 51 56 49
C 49 49 46 48
159
Bhumi Publishing, India
9. Chi-square test
Introduction:
In previous few chapters, the statistical inference has concentrated on the statistics such as
mean and proportion which have been used to obtain interval estimates and test hypotheses
considering population parameters. This chapter changes the approach to inferential statistics by
studying the whole distributions and relationship between two distributions. These inferences are
drawn using chi-square test.
The chi-square test is the most useful and widely used test in statistics for the assumptions
are minimal to perform the test. Thus, this test can be used in most circumstances.
Chi-square test:
The chi-square test is a procedure for testing if two categorical variables are related in some
population. The null hypothesis of the Chi-Square test is that no relationship exists on the
categorical variables in the population; they are independent. This test is based on the difference
between what is actually observed in the data and what would be expected if there was truly no
relationship between the variables.
The Chi-square test is denoted by 𝜒 2 and given by,
(𝑂𝑖 − 𝐸𝑖 )2
𝜒2 =
𝐸𝑖
Where, 𝑂𝑖 = Observed values
𝐸𝑖 = Expected values
Steps to perform chi-square test:
1) Set up the hypothesis as,
To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
3) Formula:
(𝑂𝑖 − 𝐸𝑖 )2
𝜒2 =
𝐸𝑖
4) Inference:
At 𝛼% level,
If 𝐶𝑎𝑙 𝜒 2 ≤ 𝑇𝑎𝑏 𝜒 2 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
160
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.1. During clinical trials of newly developed drug for treatment of diabetic retinopathy out of 120
volunteers, 76 persons were administered a new drug. Out of 76 persons, 24 persons showed
symptoms of DR. Amongst those not administered the new drug 12 persons were not affected by
DR. Find out whether the new drug is effective or not by using Chi-Square test.
(Given, at 5% level, 𝜒 2 = 3.84)
Ans.: From given data, prepare table as given below;
DR status Drug Not Total
administered administered
Developed 24 32 56
Not developed 52 12 64
Total 76 44 120
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. Drug is not effective
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
161
Bhumi Publishing, India
(O) (E)
24 35.47 - 11.47 131.5609 3.709
32 20.53 11.47 131.5609 6.408
52 40.53 11.47 131.5609 3.246
12 23.47 - 11.47 131.5609 5.605
TOTAL 18.968
(𝑂 − 𝐸)2
𝜒2 = = 18.968
𝐸
4) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, drug is effective.
Ex.2. Following table shows the result of an experiment of study of effectiveness of vaccines on
resistance to a particular disease.
162
Applied Biostatistics: An Essential tool in Helathcare Profession
Attacked No attacked
Vaccinated 11 31
Non-Vaccinated 30 8
Using Chi-square test, analyze the results of experiments for independence between vaccination
and attack.
(Given, at 5% level, 𝜒 2 = 3.84)
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. Vaccination and attack of disease are independent.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Where,
𝑐𝑖×𝑟 𝑗
𝐸𝑖𝑗 =
𝑡𝑜𝑡𝑎𝑙
163
Bhumi Publishing, India
(𝑂 − 𝐸)2
𝜒2 = = 22.1153
𝐸
3) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, effect of vaccination on attack of disease is not independent.
Ex.3. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with
colds, half of them were given sugar pills and half were given drug. The patients’ reactions to the
treatment were recorded. Test the hypothesis that drug is no better to the treatment than the sugar
pills.
(Given, at 5% level, 𝜒 2 = 3.84)
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. drug is no better to the treatment than the sugar pills.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
164
Applied Biostatistics: An Essential tool in Helathcare Profession
(𝑂 − 𝐸)2
𝜒2 = = 22.1153
𝐸
3) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, drug is better to the treatment than the sugar pills.
Ex.4. In manufacturing company of mobile in each shift different numbers of faulty mobiles were
prepared which is shown in table. Test the hypothesis that the number of faulty samples prepared
is independent of the shift if the number of shift worked in the ration of 4:5:3.
(Given, at 5% level and 2 d.f. 𝜒 2 = 5.99)
165
Bhumi Publishing, India
(𝑂 − 𝐸)2
𝜒2 = = 4.1523
𝐸
3) Inference:
At 5% level and d.f., 𝑇𝑎𝑏 𝜒 2 = 5.99
𝐶𝑎𝑙 𝜒 2 < 𝑇𝑎𝑏 𝜒 2 ,
So, accept 𝐻0 .
Thus, stoppage is independent of shift.
166
Applied Biostatistics: An Essential tool in Helathcare Profession
Ex.5. In an industry, the number of accidents in three shifts was 12, 14, 19. Can we conclude that all
shifts are equally dangerous?
(Given at 5% level, 𝜒2 2 = 5.99 𝜒3 2 = 7.81)
Ans.:
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. All shifts are equally dangerous.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Under null hypothesis, number of accidents will be equal to average.
12 + 14 + 19
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = = 15
3
So,
Now, prepare a table as follow;
Shift (O) (E) 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐
𝑬
Morning 12 15 -3 9 0.6
Afternoon 14 15 -1 1 0.067
Night 19 15 4 16 1.007
TOTAL 1.6737
(𝑂 − 𝐸)2
𝜒2 = = 1.6737
𝐸
3) Inference:
At 5% level and d.f., 𝑇𝑎𝑏 𝜒2 2 = 5.99
𝐶𝑎𝑙 𝜒 2 < 𝑇𝑎𝑏 𝜒 2 ,
So, accept 𝐻0 .
Thus, all shifts are equally dangerous.
167
Bhumi Publishing, India
Exercise
1. A newly discovered drug was administered to 800 volunteers during clinical trials out of total
3000 volunteers. The number fever cases are shown below. Discuss the usefulness of drug (Use
X2 test at 1% level of significance).
2. Certain drug is claimed to be effective in treatment of migraine. During clinical trials, out of 400
people suffering from migraine, 200 were administered tablet containing drug and 200 were
administered placebo tablet. The patient responses were recorded as shown in table. Test the
hypothesis that drug is no better to treatment.
Treatment Cured Harmed No effect
Drug 135 20 45
Placebo 65 25 110
3. Following table shows the number of people with or without high blood pressure. Do data
reveal and association between age groups and acidity
4. Genetic theory states that children having one parent of blood group A and other of B will
always possesses one of the blood group out of A, AB, B and that the proportion of three types
will be on an average be as 1:2:1. A report states that out of 250 children having one of A
parent and one B parent 30% were found to be A, 45% type AB and remaining of type B. Test
the hypothesis by X2 test.
5. Theory predicts that the portion of mangos in the four groups J, K, L and M should be 9:3:3:1. In
an experiment among 1600 mangos the numbers in four groups were 880, 315, 280, 125. Thus;
experimental results support the theory.
168
Applied Biostatistics: An Essential tool in Helathcare Profession
6. From the data given below about the treatment of 500 patients suffering from a disease, state
whether the new treatment is superior to the conventional treatment.
Treatment Favorable Non-favorable Total
New 280 60 340
Conventional 120 40 160
Total 400 100 500
(Given for 1d.f., 𝜒0.05 2 = 3.84)
7. In experiment on pea breading mendal, following frequencies on seed are obtained: 315 round
and yellow, 101 wrinkled and yellow, 108 round and green, 32 wrinkled and green. Theory
predicts that the frequencies should be in proportion 9: 3: 3: 1. Examine the correspondace
between theory and experiment. (Given at 5% level, 𝜒3 2 = 7.815)
8. From the following data use Chi-square test and conclude whether inoculation is effective in
preventing a disease.
Attacked Not attacked Total
Inoculated 31 469 500
Non-inoculated 185 1315 1500
Total 216 1784 2000
(Given for 1d.f., 𝜒0.05 2 = 3.84)
9. In manufacturing company of mobile in each shift different numbers of faulty mobiles were
prepared which is shown in table. Test the hypothesis that the number of faulty samples
prepared is independent of the shift if the number of shift worked in the ration of 4:5:3.
(Given, at 5% level and 2 d.f. 𝜒 2 = 5.99)
Shift No. of faulty mobiles
Morning 10
Afternoon 24
Night 08
10. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with
colds, half of them were given sugar pills and half were given drug. The patients’ reactions to
the treatment were recorded. Test the hypothesis that drug is no better to the treatment than
the sugar pills. (Given, at 5% level, 𝜒 2 = 3.84)
169
Bhumi Publishing, India
Introduction:
In the previous chapters, all the statistical parametric methods of hypothesis testing which
are z-test, student’s t-test, Carl Pearson’s correlation coefficient, ANOVA were based on some
underlying assumptions about the data like “normally distributed data”, “population variances are
equal”. These methods are referred as parametric tests. But what if the assumptions on which the
methods are based do not hold? In such case we need to use distribution-free or assumption-less
methods for hypothesis testing. Such distribution-free methods are called nonparametric tests.
Nonparametric tests are often used when the population data has an unknown distribution
or when the sample size is small. Nonparametric statistics are typically used on qualitative data.
This method is useful when the data has no clear numerical interpretation, and is best to use with
data that has a ranking of sorts. This type of statistics can be used without the mean, sample size,
standard deviation, or the estimation of any other related parameters when none of that
information is available.
Common nonparametric tests include Chi square, Mann Whitney U-test (Wilcoxon rank-
sum test), Kruskal-Wallis test, Friedman test and Spearman's rank-order correlation.
1. Mann Whitney U-test (Wilcoxon rank-sum test):
A popular nonparametric test to compare outcomes between two independent groups is the Mann
Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the
Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same
population. This test also compares the medians between the two populations. This test is often
performed as a two-sided test.
Steps:
1) Set up null hypothesis as follows:
To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.
2) Test statistic:
If 𝑛1 and 𝑛2 are two sample or population sizes then Mann Whitney U test denoted by U is
the smaller from 𝑈1 𝑎𝑛𝑑 𝑈2 defined below.
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
170
Applied Biostatistics: An Essential tool in Helathcare Profession
New Drug 3 6 4 2 1
Is there a difference in the number of episodes of shortness of breath over a 1 week period
in participants receiving the new drug as compared to those receiving the placebo? By inspection, it
appears that participants receiving the placebo have more episodes of shortness of breath, but is
this statistically significant? (Given at the 5% level of significance and 5 d.f., = 2 ).
Ans:
1) To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.
2) Test statistic:
𝑛1 𝑛1 + 1
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
3) Prepare the table as follows.
171
Bhumi Publishing, India
Is there statistical evidence of a difference in APGAR scores in women receiving the new and
enhanced versus usual prenatal care? (given at 5% level, for 𝑛1 = 8, 𝑛2 = 7 𝑇𝑎𝑏𝑈 = 10)
Ans:
1) To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.
2) Test statistic:
𝑛1 𝑛1 + 1
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
172
Applied Biostatistics: An Essential tool in Helathcare Profession
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
3) Prepare the table as follows.
Ascending order Ranks
Usual New Usual Care New Program Usual Care New Program
Care Program (x) (y) (𝑹𝒙 ) (𝑹𝒚 )
8 9 2 1
7 8 3 2
6 7 5 3
2 8 6 6 4.5 4.5
5 10 7 7 7 7
8 9 7 7
7 6 8 8 10.5 10.5
3 8 8 10.5 10.5
9 13.5
9 13.5
10 15
Total 45.5 74.5
8(8 + 1)
𝑈1 = 8 × 7 + − 45.5 = 46.5
2
7(7 + 1)
𝑈2 = 8 × 7 + − 74.5 = 9.5
2
Here, 𝑈2 < 𝑈1 . So 𝑈 = 𝑈2 = 9.5
4) Given at 5% level of significance and 5 d.f. 𝑇𝑎𝑏 𝑇 = 10
𝐶𝑎𝑙 𝑈 < 𝑇𝑎𝑏 𝑈
So reject 𝐻0 .
Thus, the two populations of APGAR scores are not equal in women receiving usual prenatal
care as compared to the new program of prenatal care.
Ex.3. A clinical trial is run to assess the effectiveness of a new anti-retroviral therapy for patients
with HIV. Patients are randomized to receive a standard anti-retroviral therapy (usual care) or the
new anti-retroviral therapy and are monitored for 3 months. The primary outcome is viral load
which represents the number of HIV copies per milliliter of blood. A total of 30 participants are
randomized and the data are shown below.
Std 7500 8000 2000 550 1250 1000 2250 6800 3400 6300 9100 970 1040 670 400
Therapy
New 400 250 800 1400 8000 7400 1020 6000 920 1420 2700 4200 5200 4100 undetec
Therapy table
Is there statistical evidence of a difference in viral load in patients receiving the standard versus the
new anti-retroviral therapy? (Given at 5% level and for for 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64)
Ans:
1) To test:
173
Bhumi Publishing, India
174
Applied Biostatistics: An Essential tool in Helathcare Profession
15(15 + 1)
𝑈1 = 15 × 15 + − 245 = 100
2
15(15 + 1)
𝑈2 = 15 × 15 + − 220 = 125
2
Here, 𝑈1 < 𝑈2 . So 𝑈 = 𝑈1 = 100
4) Given at 5% level of significance and 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64
𝐶𝑎𝑙 𝑈 > 𝑇𝑎𝑏 𝑈
So accept 𝐻0 .
Thus, there is no statistical evidence that the treatment groups differ in viral load.
Test for Paired data: following are the nonparametric test used in case of paired data like “before-
after” case.
Signed Test: The Sign Test is the simplest nonparametric test for matched or paired data. The
approach is to analyze only the signs of the difference scores. The test statistic for the Sign Test
is the number of positive signs or number of negative signs, whichever is smaller.
Steps:
1) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
2) Test statistic:
The test statistic for the Sign Test is the smaller of the number of positive or negative signs.
3) Compute the test statistic:
Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒. Count the number of positive signed
answer and negative signed answer for the test statistic.
4) Set up decision rule and make conclusion:
For given 𝛼 level of significance,
If 𝑡𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑖𝑔𝑛𝑠 ≤ 𝑇𝑎𝑏 𝑣𝑎𝑙𝑢𝑒 then Reject 𝐻0
Otherwise Accept 𝐻0 .
Ex.1. A new chemotherapy treatment is proposed for patients with breast cancer. Investigators
are concerned with patient's ability to tolerate the treatment and assess their quality of life both
before and after receiving the new chemotherapy treatment. Quality of life (QOL) is measured on
an ordinal scale and for analysis purposes, numbers are assigned to each response category as
follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent.
The data are shown below.
175
Bhumi Publishing, India
176
Applied Biostatistics: An Essential tool in Helathcare Profession
177
Bhumi Publishing, India
Ex.1. A study is run to evaluate the effectiveness of an exercise program in reducing systolic blood
pressure in patients with pre-hypertension (defined as a systolic blood pressure between 120-139
mmHg or a diastolic blood pressure between 80-89 mmHg). A total of 15 patients with pre-
hypertension enroll in the study, and their systolic blood pressures are measured. Each patient then
participates in an exercise training program where they learn proper techniques and execution of a
series of exercises. Patients are instructed to do the exercise program 3 times per week for 6 weeks.
After 6 weeks, systolic blood pressures are again measured. The data are shown below.
Is there is a difference in systolic blood pressures after participating in the exercise program as
compared to before? (Given at 5% level and n= 15, 𝑊 = 25)
Ans:
1) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
2) Test statistic:
The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ and
W- which are the sums of the positive and negative ranks, respectively.
178
Applied Biostatistics: An Essential tool in Helathcare Profession
Rank table:
Observed Ascending order of Signed Ranks
Differences Differences
7 1 1
-2 -2 -2.5
8 -2 -2.5
-4 -3 -4
20 -4 -6
-3 -4 -6
6 4 6
7 6 8
8 -7 -10
4 7 10
9 7 10
-4 8 12.5
-7 8 12.5
1 9 14
-2 20 15
179
Bhumi Publishing, India
12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3
180
Applied Biostatistics: An Essential tool in Helathcare Profession
12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3
181
Bhumi Publishing, India
Here, 𝑁 = 20 𝑛1 = 𝑛2 = 𝑛3 = 𝑛4 = 5
𝑅1 = 46 𝑅2 = 62 𝑅3 = 24 𝑅4 = 78
12 462 622 242 782
𝐻= + + + − 3(20 − 1)
20(20 + 1) 5 5 5 5
= 0.0285 × 2524 − 3 21
= 71.934 − 63
= 8.934
Summary:
Mann Whitney U Test:
Use: To compare a continuous outcome in two independent samples.
Null Hypothesis: H0: Two populations are equal
Test Statistic: The test statistic is U, the smaller of
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
Where, 𝑅1 = 𝑅𝑥 i.e. sum of ranks of data x.
𝑅2 = 𝑅𝑦 i.e. sum of ranks of data y.
Decision Rule: Reject H0 if U < critical value from table
Sign Test
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is the smaller of the number of positive or negative signs.
Decision Rule: Reject H0 if the smaller of the number of positive or negative signs < critical
value from table.
Wilcoxon Signed Rank Test
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums
of the positive and negative ranks of the difference scores, respectively.
Decision Rule: Reject H0 if W < critical value from table.
182
Applied Biostatistics: An Essential tool in Helathcare Profession
12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3
183
Bhumi Publishing, India
184
Applied Biostatistics: An Essential tool in Helathcare Profession
Minimize or eliminate confounding variables, which can offer alternative explanations for the
experimental results. (A confounding variable is an “extra” variable that didn’t consider. They
can ruin an experiment and give false results.)
Allow to correlate different variables involved in experiment.
Describe how samples are allocated in groups or selected for experiment.
185
Bhumi Publishing, India
Method: Label all subjects with same number of digits. For example if there are 20 subjects,
number them 1-20. Randomly select non-repeating numbers from these labels for first treatment
and then repeat for other treatment.
Advantages
1. Simplest design
2. It can accommodate any number of subjects and treatment.
3. Although sample size might not me same for each treatment, this design is simple to analyze.
Limitations:
1. It is best suited for situation involves relatively few treatments.
2. All subjects must be as homogeneous as possible. Any extreme deviation results in error.
2. Randomized block design:
In this design, initially subjects are sorted into homogeneous groups called as block and
then within block treatment is assign randomly just same as discussed above.
Method: Subjects are classified into blocks depending on characteristics and within block
treatment is randomized. Each block is independent from another.
Advantages:
1. It can produce more precise results than completely randomized design.
2. It can accommodate any number of treatments or replications.
3. Blocking produces more comparable (homogeneous) group.
Limitations:
1. Statistical data analysis is relatively difficult.
2. Missing observation within block increases complexity of study.
186
Applied Biostatistics: An Essential tool in Helathcare Profession
Limitation:
It results in carry-over effect which may results in false results. Carry-over effect is nothing
but presence residual of previous treatment in subject at the time of current treatment.
This problem can be easily overcome by providing time enough time for wash-off drug from body.
If the first row and first column contain the n letters alphabetically then Latin square design
called as a standard square.
This design is used mainly in pharmaceutical field for bioequivalence study as well as in
agriculture field.
Advantages:
1. It minimizes carry-over effects.
2. It minimizes the inter-subject variability.
3. Make it possible to study formulation variables which are most important part of
bioequivalence study.
Limitations:
1. The study takes more time due to involvement of washing-period.
2. Patient dropout rate is high.
3. Randomization is somehow difficult.
187
Bhumi Publishing, India
2p 1-p
2 [Zα +Zβ Pc 1-Pc + Pd 1-Pd ]2
2𝑁 =
(𝑃𝑐 − 𝑃𝑑) 2
Where;
2N: Total number of subject = Nd+Nc
Nd: Number of subjects in drug group
Nc: Number of subjects in control
Pc: Success rate in control
Pd: Success rate in drug group.
Z: 1.645
Z: 1.282
𝑅𝑑 − 𝑅𝑐
𝑝=
𝑁𝑑 + 𝑁𝑐
Where Rd and Rc are number of success in drug and control group respectively.
2. For Continuous response:
4 𝜎 2 (𝑍𝛼 + 𝑍𝛽) 2
2𝑁 =
𝛿2
δ: the true difference which is to be detected between the control and drug group.
Zα and Zβ: 1.96 and 1.282 respectively
σ : variance
Exercise
1. Define experimental design. Give its Significance.
2. Enlist steps involved in experimental design.
3. Explain in details various types of experimental designs use in clinical trial study.
4. Add note on randomization, replication and local control.
5. Give formula for calculation of sample size for clinical trials.
188
Applied Biostatistics: An Essential tool in Helathcare Profession
What is Statistics?
Statistics is a branch of mathematics which deals with the collection, organization, analysis,
interpretation and presentation of data obtained from experiments or survey.
Initially; the use of statistic was restricted to collect the information related to health,
population property etc by the respected governing body. Later on development in statistics field
made it as an essential part of different fields. Now a day, almost all fields involve applications of
statistics which includes Pharmacy, Medical, Political Science, Commerce, Business, Media and
many more. Moreover; every person uses statistics in daily life although he/she may not be aware
that the practice is called as statistics.
Progress in computer field further made utilization of statistic in simple way. With
application of computer and newly developed software one can handle huge numerical data and
obtain result in fraction of minute. Even by using advanced android phones people can apply
statistics at any where for data analysis and presentation of it.
What is biostatistics?
Biostatistics is the term used when tool of statistics are applied to data obtained from
biological areas. In another word biostatistics is the application of statistics to a wide range of
topics in biology.
Biostatistics involves collection, summarization, analysis, interpretation and presentation of
data from various biological experiments especially medicine, pharmacy, agriculture and fishery. A
major branch of this is medical biostatistics, which is exclusively concerned with medicine and
health.
In medicine field, whether its research, diagnosis or treatment all depends on measurement
or counting. For example; disintegration of tablet either rapidly or slowly has no meaning unless it
is expressed in figures. Therefore; biostatistics also called as quantitative medicine.
Applications of Biostatistics:
1. In research and development of pharmaceutical industries:
Biostatistics plays crucial role in research and development of pharmaceutical industries.
Drug discovery and development, development of new formulation of existing drug, development of
generic products all these major areas involve applications of biostatistics.
It takes about 10-15 years to develop one new medicine from the time it is discovered to
when it is available for treating patients. The average cost to research and develop each successful
drug is estimated to be $800 million to $1 billion. Overall process involve discovery of drug
molecules, screening of drug molecules, preclinical study, submission of IND i.e. Investigational
New Drug Application, clinical trials submission of NDA i.e. New Drug Application for approval of
drug for marketing purpose. Starting from discovery phase to submission of NDA it involves
application of biostatics. It involves experimental design, stating hypothesis, checking probability,
selection of sampling techniques, collection of data through different experiments, arrangement of
189
Bhumi Publishing, India
collected data, analysis and interpretation of result. Investigator need to take prior permission from
FDA for both preclinical and clinical trials where he has to submit data which proves safety and
efficacy of drug for further study. FDA monitors all generated data very carefully and if satisfied
then only permits to investigator for next stage study. All this data is numerical and involve
different calculations and representation of generated data in specific manner so that appropriate
conclusion can be drawn from it. As already stated, overall this process involves huge investment of
money as well as time, any error in result will definitely lead investigator in tremendous loss. All
these errors can be avoided or minimized by application of biostatistics.
Similarly; in development of generic products, applicant has to prove bioequivalence of
developed product with innovators product and submit data to FDA in ANDA i.e. Abbreviated New
Drug Application which again involve application of biostatics.
2. In anatomy and physiology study:
i) Biostatistics is used to study various physiological and anatomical parameters and its correlation
with health e.g. mean pulse rate, average glucose level, mean and variance of weight and height.
For example; the mean height of boys in Maharashtra is less than that in Punjab. i.e. this difference
is due to natural variation or because of difference in nutrition that can be studied with
application of statistics.
ii) Biostatistics is also used for the study of normal and healthy population and to set limits for
abnormality.
3. In pharmacology:
i) In pharmacology, biostatistics plays important role to find action of drug on animal or humans.
ii) It is also used to compare two different drugs or two different formulations of same drug or two
identical dosage form from different manufacturer.
4. In medicine:
i) Biostatistics is used to find relation between two factors like T.B and smoking.
ii) It can be used to compare the efficacy of drug.
iii) Signs and symptoms of disease or syndrome are identified by using statistic study. E.g. In
typhoid, fever is observed almost in all cases and cough is rare.
5. In community medicine and public health:
i) Biostatistics is used in community medicine to find the usefulness of sera and vaccines,
comparison between vaccinated and unvaccinated groups etc.
ii) It is used for epidemiological studies to find the role of causative factors e.g. deficiency of calcium
in general orthoporosis.
iii) In public health, it can be used to check effectiveness of preventive measures. For example; fall
in death rate may be the result of availability of modern facilities in hospitals, advancement in
medicines or due to increase in awareness of public.
Exercise
1. Define statistics and biostatistics. Why biostatistics is called as quantitative medicine?
2. Explain applications of biostatistics in pharmacy in detail.
190
Applied Biostatistics: An Essential tool in Helathcare Profession
References:
1. Dr. A. R. Paradkar , M. G. Dhayagude , Y. I. Shah. Introduction To Biostatistics And Computer
Science – For Medical and Pharmacy Students. Nirali Prakashan, 16th edition, 2019.
2. B. K. Mahajan. Methods in Biostatistics: For Medical Students and Research Workers. Jaypee
Publication, 7th edition, 2010.
3. Dr. Satguru Prasad. Elements of Biostatistics. Rastogi Publication, 3rd edition, 2019.
4. Khan and Khanum Shiba Khan. Fundamentals of Biostatistics. Ukazz/BSP Publication, 6th
edition, 2018.
5. Khanal Arun Bhadra. Biostatistics for Medical Students and Research Workers. Jaypee
Publication, 8th edition, 2016.
191
Bhumi Publishing, India
192
Applied Biostatistics: An Essential tool in Helathcare Profession
193
Bhumi Publishing, India
194
Applied Biostatistics: An Essential tool in Helathcare Profession
195
Bhumi Publishing, India
196
Applied Biostatistics: An Essential tool in Helathcare Profession
197
Bhumi Publishing, India
198
Applied Biostatistics: An Essential tool in Helathcare Profession
199
Bhumi Publishing, India
200
Applied Biostatistics: An Essential tool in Helathcare Profession
201
Bhumi Publishing, India
202
Applied Biostatistics: An Essential tool in Helathcare Profession
203
Bhumi Publishing, India
204
Applied Biostatistics: An Essential tool in Helathcare Profession
205