You are on page 1of 215

Applied Biostatistics:

An Essential Tool in Helathcare Profession


(ISBN: 978-81-953600-1-7)

Ms. Archana V. Nerpagar Mr. Umesh D. Laddha

Assistant Lecturer Assistant Professor,

Bhonsala Military School, Nashik MET’s Institute of Pharmacy, Nasik

Mrs. Savita Mandan Dr. Sanjay J. Surana

Assistant Professor, Principal,

R. C. Patel Institute of Pharmaceutical R. C. Patel Institute of Pharmaceutical

Education and Research, Shirpur Education and Research, Shirpur

Dr. Sanjay J. Kshirsagar

Principal,

MET’s Institute of Pharmacy, Nasik

2021
First Edition: 2021

ISBN: 978-81-953600-1-7

 Copyright reserved by the publishers

Publication, Distribution and Promotion Rights reserved by Bhumi Publishing, Nigave Khalasa, Kolhapur
Despite every effort, there may still be chances for some errors and omissions to have crept in
inadvertently.
No part of this publication may be reproduced in any form or by any means, electronically, mechanically,
by photocopying, recording or otherwise, without the prior permission of the publishers.
The views and results expressed in various articles are those of the authors and not of editors or
publisher of the book.

Published by:
Bhumi Publishing,
Nigave Khalasa, Kolhapur 416207, Maharashtra, India
Website: www.bhumipublishing.com
E-mail: bhumipublishing@gmail.com
Book Available online at:

https://www.bhumipublishing.com/books/
Preface
We have great pleasure and privilege in presenting the book “Applied Biostatistics:
An Essential tool in Helathcare Profession” which is based on new PCI syllabus and
will be helpful in course of B. Pharm, M. Pharm, B. Sc., M. Sc. and B. Ed.

This book accentuates the relationships among probability, probability distributions


and hypothesis testing. To understand the methodology of hypothesis testing, concept
of null and research hypothesis has been highlighted, Book also contains depth study
of the standard parametric analysis along with nonparametric alternatives.
Nonparametric techniques are further useful in research activities with small sample
size.

In this book every topic has been explained in details and supported by sufficient
solved examples. The questions are categorised according to the types of methods
applied. We tried to maintain language as simple as possible which will help students
to understand the statistical concepts more easily. We have also tried to cover
information in much more depth in order to ensure that reader will be benefited for
competitive exams preparation.

We hope that this book will be appreciated and accepted by all Institutes, teachers
and students. There may be few mistakes and deficiencies, we will be grateful if
readers point out them and revert to us. Also we will welcome any suggestions from
your side.
Acknowledgment
First and foremost, praises and thanks to the God, the Almighty, for His showers of
blessings throughout my work to complete this book successfully.

I would like to express my deep and sincere thanks to Dr. Sanjay J. Surana, Principal,
R. C. Patel Institute of Pharmaceutical Education and Research, Shirpur and Shirpur
Education Society (SES), Shirpur for giving me opportunity and to make me able to
write this book.

I wish to express my special gratitude to my soul mate Mr. Umesh D. Laddha whose
dynamism, vision, sincerity and motivation have deeply inspired me for the
completion of this book.

I am also thankful to Mrs. Savita S. Mandan from R. C. Patel Institute of


Pharmaceutical Education and Research, Shirpur for her valuable addition.

I am also thankful to Central Hindu Military Education Society, Nashik for providing
facilities while writing this book.

I am extremely grateful to my parents for their love, prayers, caring and sacrifices for
educating and preparing me for my future. I am very much thankful to my Son
Devesh for his love, understanding and continuing support to complete my work. Also
I express my thanks to my sisters, brother and in laws for their support and valuable
time.

- Archana V. Nerpagar
This book is dedicated to
Hard work, Patience and Efforts….
And
To my lovely Son Devesh
Index

Sr. No. Topic Page No.

1. Introduction to biostatistics 1 – 59

2. Probability 60 – 88

3. Sample and sampling techniques 89 – 100

4. Correlation 101 – 111

5. Regression 112 – 121

6. Sampling Variability, Significance & Statistical inference 122 – 128

7. Testing of hypothesis 129 – 148

8. ANOVA 149 – 159

9. Chi-square test 160 – 169

10. Non-parametric test 170 – 183

11. Experimental design 184 – 188

12. Applications of Biostatistics in Pharmacy 189 – 190

13. References 191

14 Standard Value Tables 192 – 205


Applied Biostatistics: An Essential tool in Helathcare Profession

1. INTRODUCTION TO BIOSTATISTICS

Introduction:
Statistics is a very broad subject, with the applications in a vast number of different fields.
Statistics or statistical analysis is the branch of mathematics which deals with the study of
collection, analysis, interpretation, presentation and organization of data. In other words, statistics
is the methodology which scientists and mathematicians have developed for interpreting and
drawing conclusions from collected data. It is the science of gaining information from numerical
and categorical data. Statistics in practice is applied successfully to study the effectiveness of
medical treatments, the reaction of consumers to television advertising, the attitude of young
people toward marriage, and much more. It’s safe to say that nowadays statistics is used in every
field of science.
Biostatistics is defined as the application of statistical tools and methods to the data
derived from biological sciences. It is the application of statistics in the development and use of
therapeutic drugs and devices in humans and animals. The science of biostatistics consists of
biological experiments (specifically in medicine, pharmacy, agriculture and fishery), the collection,
analysis and interpretation of data, the inferences and results. The goal of Biostatistics is to promote
statistical science and its application in the study of medicine, human health and disease.
Let us first define some basic terms of statistics that are necessary for understanding
biological and agricultural analysis.
1. Statistical Data:
The collection of numerical statements of facts is called data. This numerical data in
statistical analysis is obtained from scientific enquiry. The numerical facts in the collected in
scientific data are known as observations.
In statistics, data is all about its characteristics. Characteristic means the quality possessed
by an individual or observation.
Characteristics are of two types:
(i) Non measurable characteristics (Attributes): The characteristics related to the qualities of
the observations are called attributes. E.g. sex, literacy, blood group, pass, fail etc.
(ii) Measurable characteristics (variables): The characteristics related to the quantity of the
observations are called variables. Their values are always varying E.g. height, weights and
ages of persons, temperature, water salinity, etc.
For example, weights of children in a class are 35kg, 37kg, 32kg, 38kg, 34kg, 39kg, 36kg and
40kg. This statistical statement contains numerical values which is data for analysis with 7
observations.

1
Bhumi Publishing, India

Type of Data:
Data can be classified into two types:
a) Qualitative data:
The data which deals with the descriptions or qualities of individuals is qualitative data.
This data can be observed but cannot be measured using any unit. The qualitative characteristics
are known as attributes.
e.g. blood group, colours, smells, tastes, appearance, emotions etc.
b) Quantitative data:
The data which deals with the numerical values of the individuals is quantitative data. This
data can be measured in some units. The quantifiable characteristics are known as variables.
e.g. height, weight, length, temperature etc.
The quantitative variables are further divided as follows:
i) Discrete variables: The variables that can take only specific and finite number of values in the
given range are known as discrete variables. Discrete variables are countable in finite amount of
time.
For example, we can count the change in our pocket, money in bank account, number of
tablets in a pack, number of students in a class, parity, myocardial infarction.
ii) Continuous variables: The variables that can take on infinite number of values in the given
range are known as continuous variables. Continuous variables would take forever to count i.e. we
would get to forever but never finish counting them. Many of the variables studied in biology are
continuous variables.
For example, age. We cannot count exact age because it would take infinite value forever.
Age of person could be counted as; 25 years, 5 months, 10 days, 6 hours, 40 minutes, 4 seconds, 4
milliseconds, 9 nanoseconds, 10 picoseconds,...and so on. Also weight, diastolic blood pressure,
volume, time required to recovery.
Collection of Data:
Data Collection is an important aspect of any type of research study. Inaccurate data
collection can impact the results of a study and ultimately lead to invalid results. The data can be
collected on the basis of qualitative or quantitative characteristics. To check the effect of drug in
curing a disease, we have to collect the quantitative information about patients before and after
application of drug.
There are many methods of data collection depending on our research designs and
methodologies. Generally, data is collected from two sources, primary sources and secondary
sources.
Primary sources: The original source or first hand from which information is collected is called
primary source and the data collected from primary source is primary data. i.e. When an

2
Applied Biostatistics: An Essential tool in Helathcare Profession

investigator collects data himself with a definite plan or purpose in his mind then it is called
primary data. E.g. data obtained by census commissioner for population census.
To collect primary data following methods are used:
a) Observation method: Observation is the main source of information in the field of research. In
this method observations are recorded from experiments or a specific situation.
b) Questionnaire method: This method plays an important role in data collection process.
Questionnaire, usually, consists of number of objective questions that the respondent has to
answer.
The questionnaire should be designed properly. All questions to be asked are relevant to
subject of research. Questions should be short, simple and clear and easy to understand. They
should be arranged in order from easy to difficult. The information through questionnaire can be
collected by mail or post which is called postal inquiry.
c) Interview method: Interview is the verbal conversation between with two people in order to
collect required information for research. It is an interactional communication in which questions
are asked by interviewer for specific purpose to obtain research related information and answers
are given by interviewee. There are different types of interview like Personal interview, Telephone
interview, Focus Group interview, Depth interview, Projective techniques.
Secondary Sources: The sources of information such as published literature or published reports
are known as secondary sources and data collected from secondary sources is secondary data. i.e.
data which is not originally collected but obtained from published or unpublished sources is called
secondary data. Some of the secondary sources are as follows:
1) Government publications, 2) Census report, 3) Periodicals and books, 4) Research review
journals, 5) Research articles, 6) Research papers, 7) Magazines, 8) Academic publications, 9)
Research literature, 10) Ph.D. Thesis.
Classification of Data:
The process of arranging collected data into homogenous groups or classes according to the
common characteristics is called classification. After collecting the qualitative or quantitative data it
is require to sort out data from questionnaire related to the common characteristics. Because of
proper classification the unnecessary information is dropped out.
e.g. During population census, people in the country are classified according to sex
(males/females), marital status (married/unmarried), residential place (rural/urban), age groups,
profession, etc.
Raw Data: When some information is collected randomly and presented, it is called a raw data.
For Example: Given below are the marks (out of 25) obtained by 20 students of class VII A in
mathematics in a test.
18, 16, 12, 10, 5, 5, 4, 19, 20, 10, 12, 12, 15, 15, 15, 8, 8, 8, 8, 16

3
Bhumi Publishing, India

Observation:
Each entry collected as a numerical fact in the given data is called an observation.
Array:
The raw data when put in ascending or descending order of magnitude is called an array or
arrayed data.
For Example: The above data is arranged in ascending order and represented as:
4, 5, 5, 8, 8, 8, 8, 10, 10, 12, 12, 12, 15, 15, 15, 16, 16, 18, 19, 20
Range:
The difference between the highest and the lowest value of the observation is called the
range of the data.
In the above data,
Highest marks obtained = 20
Lowest marks obtained = 4
Therefore, 𝑟𝑎𝑛𝑔𝑒 = 20 − 4 = 16
Frequency Distribution:
a) Frequency: If an observation or variable is repeating twice or more in a given series of
observations then the number of repetition is called frequency of that observation.
e.g. consider the marks of 15 students in a class as follows: 22, 24, 20, 22, 23, 20, 25, 22, 22,
25, 20, 25, 20, 22, 24.
Here, number 20 is repeating 4 times. So frequency of 20 is 4.
22 is repeating 5 times. So its frequency is 5.
Similarly, frequency of 23 is 1, frequency of 24 is 2 and frequency of 25 is 3.
b) Frequency distribution: The tabular arrangement of observations in the collected scientific
data individually or in groups or classes along with their frequencies is called frequency
distribution.
c) Class: The group of observations in the data under our consideration is called class.
e.g. the marks out of 100 can be divided into the classes as 0-10, 10-20, 20-30, …, 90-100.
Classes are also known as class intervals. Each class interval is assigned two values. The
smallest value is called lower limit and the highest value is called upper limit of certain class
interval.
e.g. For a class 40-50,
Lower limit = 40 and upper limit = 50.
Classes can be of two types:
i) Continuous classes: The classes of the form 10-20, 20-30, 30-40,… in which lower limit of any
class is equal to upper limit of its previous class are called continuous classes.

4
Applied Biostatistics: An Essential tool in Helathcare Profession

ii) Non continuous classes: The classes of the form 10-19, 20-29, 30-39 … are called non
continuous classes.
d
But these classes can be made continuous by subtracting a term 2
from lower limits of all
d
classes and adding a term 2 into all upper limits of classes. The newly formed class intervals are the

continuous and are known as class boundaries,


Where, 𝑑 = difference between lower limit of any class and upper limit of its previous class.
Ex. Make the following non continuous classes continuous.
Class 20-29 30-39 40-49 50-59 60-69
Frequency 5 8 12 7 6

Ans: Here, the common difference


𝑑 = 30 – 29 = 40 – 39 = 50 – 49 = 60 – 59 = 1
d 1
= = 0.5
2 2
So subtract 0.5 from all lower limits and add 0.5 into all upper limits of classes.

Class Frequency Class Boundaries


20-29 5 19.5-29.5
30-39 8 29.5-30.5
40-49 12 30.5-40.5
50-59 7 40.5-50.5
60-69 6 50.5-60.5

d) Class width (h): The difference between upper limit and lower limit of the class interval is called
class width. It is denoted by h .
Class width = upper limit − lower limit
e.g. for a class interval 45-55,
Class width = 𝑕 = 55 – 45 = 10
e) Class mark or mid value (X): The class mark or mid value is the value which lies exactly in the
middle of the class interval. It is denoted by X and given by,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =
2
e.g. for the class interval 30-40,
30 + 40
Class mark = 𝑋 = 2
= 35.

f) Relative frequency:
It is given by the formula,

5
Bhumi Publishing, India

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 =
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

g) Percentage frequency:

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

h) Frequency density: If the class intervals of a frequency distribution are of unequal width, the
frequency densities can be used to compare the concentration of frequencies in class interval and to
construct histogram.

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 =
𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡𝑕

There are two types of frequency distribution.


1) Discrete (ungrouped) frequency distribution:
In discrete frequency distribution, the observations are arranged in ascending order
without considering the repeated ones in the table. Second column of the table contains frequencies
of corresponding observations.
Ex. Following data gives number of members in 30 families. Classify the data and prepare frequency
distribution table.
3 3 4 2 4 3 5 6 2 4
3 4 1 6 3 2 7 6 1 1
5 5 3 2 1 3 1 5 4 3

Observations Tally Marks Frequency


1 5
2 4
3 8
4 5
5 4
6 3
7 1

6
Applied Biostatistics: An Essential tool in Helathcare Profession

2) Continuous (grouped) frequency distribution:


Steps:
1) Find the maximum and minimum values from the given data.
2) Decide the number of class intervals to be formed.
3) Classes should be formed in such way that least value should be included in first class interval
and the maximum value should be included in the last class interval.
4) If the classes are continuous then the upper limit is included in next class i.e. if the class interval
is 15 – 20 then 20 will occur in next class interval.
Example:
Consider the following marks (out of 50) obtained in Mathematics by 60 students of Class VIII:
21, 10, 30, 22, 33, 5 , 37, 12, 25, 42, 15, 39, 26, 32, 18, 27, 28, 19, 29, 35, 31, 24,36, 18, 20, 38, 22, 44,
16, 24, 10, 27, 39, 28, 49, 29, 32, 23, 31, 21, 34, 22, 23, 36, 24, 36, 33, 47, 48, 50 , 39, 20, 7, 16, 36, 45,
47, 30, 22, 17.
If we make a frequency distribution table for each observation, then the table would be too
long, so, for convenience, we make groups of observations.
From the above data 5 is the minimum value and 50 is the maximum value. So we have to
make such classes that 1st class includes minimum value and last class includes maximum value. So
the class interval will be 0 – 10, 10 – 20 and so on. According to these classes find the frequency.
So the Frequency distribution table is as follows:

Groups Tally Marks Frequency


0-10 || 2
10-20 10
20-30 21
30-40 19
40 - 50 7
50 - 60 | 1
Total 60

Ex.1. The following data gives marks obtained to 50 students in Mathematics. Prepare grouped
frequency distribution table taking the class intervals 20-24, 25-29, 30-34, etc.
21 20 55 39 48 46 36 54 42 30
29 42 32 40 34 31 35 37 52 44
39 45 37 33 51 53 52 46 43 47
41 26 52 48 25 34 37 33 36 27
54 36 41 33 23 39 28 44 45 38

7
Bhumi Publishing, India

Class 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59


Frequency 3 5 8 11 8 7 7 1
Note: Above class intervals are not continuous. But if the class intervals are continuous then the
upper limit value is not included in that corresponding class. See the example given below.

Ex.2. Prepare a grouped frequency distribution table from following data. Take the classes 30-55,
55-80, etc.
110 175 161 157 155 108 164 128 114 178
165 133 195 151 71 94 97 42 30 62
138 156 167 124 164 146 116 149 104 141
103 150 162 149 79 113 69 121 93 143
140 144 187 184 197 87 40 122 103 148

Classes 30-55 55-80 80-105 105-130 130-155 155-180 180-205


Frequency 3 4 6 9 12 11 5

Here, observation 155 is included in class 155-150 and not in 130-155, where it is an upper
limit.
Frequencies can also be distributed using cumulative frequency.
Cumulative frequency (c.f.): The successive addition of frequencies in the table is known as
cumulative frequency. It is calculated by adding each frequency from a frequency distribution table
to the sum of its predecessors. Cumulative frequency is used to determine the number of
observations that lie above (or below) particular value in a data set.
There are two types of frequency distribution:
a) Cumulative Frequency Less Than Type (c.f.l.t.t.):
The successive addition of frequencies of all classes previous to the current class is
cumulative frequency less than type. The addition is carried out from top to bottom i.e. from lowest
class to the highest class.
b) Cumulative Frequency More Than Type (c.f.m.t.t.):
It is obtained by adding the frequencies of highest class to the lowest class i.e. addition of
frequencies is from bottom to top.
Ex.1. Find cumulative frequency distribution for following data.
Class Limits 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequencies 2 4 7 10 16 8 3

8
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans:
Less than C.F More than C.F
Class Limit Freq C.B Marks C.F Marks C.F

10 – 19 2 9.5 – 19.5 Less than 19.5 2 9.5 or more 48 + 2 = 50

20 – 29 4 19.5– 29.5 Less than 29.5 2+4=6 19.5 or more 44 + 4 = 48

30 – 39 7 29.5–39.5 Less than 39.5 6 + 7 = 13 29.5 or more 37 + 7 = 44

40 – 49 10 39.5–49.5 Less than 49.5 13+10= 23 39.5 or more 27 +10 = 37

50 – 59 16 49.5– 59.5 Less than 59.5 23+16= 39 49.5 or more 11 +16 = 27

60 – 69 8 59.5– 69.5 Less than 69.5 39 +8 = 47 59.5 or more 3 + 8 = 11

70 – 79 3 69.5– 79.5 Less than 79.5 47+ 3 = 50 69.5 or more 3

Total 50
Constructing relative frequency and percentage frequency tables:
Thirty AA batteries were tested to determine how long they would last. The results, to the
nearest minute, were recorded as follows:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400, 381,
399, 415, 428, 422, 396, 372, 410, 419, 386, 390
An analyst studying these data might want to know not only how long batteries last, but also
what proportion of the batteries falls into each class interval of battery life.
This relative frequency of a particular observation or class interval is found by dividing the
frequency (f) by the number of observations (n): that is, (f ÷ n). Thus:
Relative frequency = frequency ÷ number of observations
The percentage frequency is found by multiplying each relative frequency value by 100.
Thus:
Percentage frequency = relative frequency X 100 = f ÷ n X 100
Battery life, Frequency (f) Relative Percent
minutes (x) frequency frequency
360-369 2 0.07 7
370-379 3 0.1 10
380-389 5 0.17 17
390-399 7 0.23 23
400-409 5 0.17 17
410-419 4 0.13 13
420-429 3 0.1 10
430-439 1 0.03 3
Total 30 1 100

9
Bhumi Publishing, India

From the above table the analyst can conclude that:


7% of AA batteries have a life of from 360 minutes up to but less than 370 minutes, and the
probability of any randomly selected AA battery having a life in this range is approximately 0.07.
Representation of data (Data Graphics):
Whenever we collect statistical data, it is difficult for common person to understand it. We
expect that common person should pay attention to the figures and compare two or more sets of
observations which are mostly presented in reports or newspapers.
The representation of collected scientific or statistical data in simple manner and attractive
form using different diagrams and graphs to understand common person is called data graphics.
Data can be represented in two ways:
1) Diagrams
2) Graphs
1) Diagrammatic representation of data:
It is visual form for presentation of statistical data, highlighting the basic facts and
relationship. The diagrams drawn on the basis of collected data are easily understood and
appreciated by all. A large number of diagrams are used in bio-statistical analyses. Here are the
important types of diagrams which are commonly used for presentation of qualitative data:
a) Line Diagram
b) Bar Diagram
(i) Simple bar diagram
(ii) Multiple bar diagram
(iii) Divided/Sub-divided bar diagram
(iv) Percentage bar diagram
c) Pie Diagram
a) Line Diagram:
It is the simplest type of diagram. In line diagram only lines are drawn to represent given
variables. The variable is taken along X-axis and the frequencies of the observations are taken along
Y-axis. The lines may be vertical or horizontal. The distance between lines is kept uniform. The lines
are drawn such that their length is proportional to the frequencies.
Ex.1. The table below shows Sam's weight in kilograms for 5 months.
Month Weight in kg
January 49
February 54
March 61
April 69
May 73

10
Applied Biostatistics: An Essential tool in Helathcare Profession

The data from the table above has been summarized in the line graph below.

Weight in kg
80
70

Weight in kg
60
50
40
30
20 Weight in kg
10
0

b) Bar Diagram:
It is commonly used to represent the statistical data. Bar is a thick line. In bar diagram only
the length or height of bars is taken into consideration. The data is represented by thick bars of
uniform width keeping the uniform gaps in between two bars. The lengths or heights of bars are
taken proportional to the values they represent. Bars can be drawn vertically or horizontally.
The bar diagram is classified into four main types:
(i) Simple Bar Diagram:
It is used to represent only one observation i.e. one bar represents one observation. So
there are as many bars as the number of observations. We can use different colours or shades to
identify data and to make the diagram attractive.
Ex.1. Draw simple bar diagram to represent the profits of a bank for 55 years.

Years 1989 1990 1991 1992 1993


Profits
(million $) 10 12 18 25 42

Ans:

45
40
35
30
25
20 Profits (million $$)
15
10
5
0
1989 1990 1991 1992 1993

11
Bhumi Publishing, India

Ex.2. Represent following data by simple bar diagram.


Years 1971 1981 1991 2001 2011
Population 45 40 50 52 47
Ans:

Population
60
50
40
30
20 Population
10
0
1971 1981 1991 2001 2011

(ii) Multiple Bar Diagram:


It is also known as compound bar diagram. Multiple bar diagram is used for comparing two
or more variables. The number of variables may be 2, 3 or 4 or more. The bars are drawn adjacent
to each other as per the number of variables. In case of 2 variables, pair of bars is drawn. In case of
3 variables, we draw triple bars. In order to distinguish bars, they may be either differently
coloured or different type of crossing or dotting is used. An index is also prepared to identify the
meaning of different colours or dotting.
Ex.1. Draw multiple bar diagram to represent the imports and exports of Canada (values in $) for
the years 1991 to 1995.
Years Imports Exports
1991 7930 4260
1992 8850 5225
1993 9780 6150
1994 11720 7340
1995 12150 8145
Ans:

14000
12000
10000
8000
Imports
6000
4000 Exports
2000
0
1991 1992 1993 1994 1995

12
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.2. Represent following data by a sub divided bar diagram.


College No. of students
Arts Science Commerce Agriculture
A 120 800 600 400
B 750 500 300 450
Ans:

1000
800
600
400
A
200
0 B
Arts Science Commerce Agriculture

No. Of students

(iii) Divided/Sub-divided Bar Diagram:


It is also known as Component bar diagram because each bar is sub-divided according to
components consisting in it. The complete bar represents the total values of observation along with
various values of components. Each component can be distinguished from the other by different
colour.
Ex.1. The table below shows the quantity in hundred kgs of wheat, barley and oats produced in a
certain form during the years 1991 to 1994.
Years Wheat Barley Oats
1991 34 18 27
1992 43 14 24
1993 43 16 27
1994 45 13 34
1995 35 15 27
Draw sub-divided bar diagram to illustrate data.

100

80

60 Oats
40 Barley
Wheat
20

0
1991 1992 1993 1994 1995

13
Bhumi Publishing, India

Ex.2. Represent the data by suitable bar diagram.


Year Marks in
Maths Stat Practical
2005 40 60 90
2006 35 55 85
2007 45 40 80

200

150 Marks in
Practical
100 Marks in Stat

50
Marks in Maths
0
2005 2006 2007

(iv) Percentage Bar Diagram:


Like sub-divided bar diagram, in this case also data of one variable (observation) is put on
single bar, but in terms of percentage. All the bars in this diagram are equal in heights representing
the value 100 as a percentage. The values of all variables are converted into percentages. The
component part of each division is depicted in percentages in each bar.
Ex.1. The table below shows the quantity in hundred kg of wheat, barley and oats produced in a
certain form during the years 1991 to 1994.
Years Wheat Barley Oats
1991 34 18 27
1992 43 14 24
1993 43 16 27
1994 45 13 34
Ans:
1991 1992 1993 1994
% Cum % Cum % Cum % Cum
value Freq value Freq value Freq value Freq
Wheat 43 43 53 53 50 50 49 49
Barley 23 66 17 70 19 69 14 63
Oats 34 100 30 100 31 100 37 100
Total 100 100 100 100

14
Applied Biostatistics: An Essential tool in Helathcare Profession

100%
80%
60% Oats
40% Barley
20% Wheat
0%
1991 1992 1993 1994

c) Pie Diagram:
A pie diagram or pie chart is a circular graph in which a circle is divided into sectors. The
angle of sector is proportional to the frequency or percentage of observation. Different shades or
colours can be used to differentiate the variables.
Steps to construct pie diagram/chart:
1. Express the given values of the variables in terms of angles/degrees of the total value. i.e.
If set of actual values of frequencies is given then angle is given by
𝑎𝑐𝑡𝑢𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
a. 𝜃 = × 360
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

If frequencies are given in terms of percentages then


% 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
b. 𝜃 = × 360
100
2. Draw a circle of appropriate radius with compass.
3. With the help of radius as a base line draw first angle of first component at the centre of
circle using protector.
4. Draw all sectors representing components of given data.
5. Label the sectors and circle graph.
6. If different shades or colours are used for the components then prepare an index.

Ex.1. Draw the pie chart for following data.

Item Agriculture Irrigation Health Education

Expenditure 4200 1500 1000 500

Ans:
Item Expenditure Angle (𝜽)
Agriculture 4200 210
Irrigation 1500 75
Health 1000 50
Education 500 25
Total 7200 360

15
Bhumi Publishing, India

Expenditure

Agriculture
Irrigation
Health
Education

Significance of diagrammatic representation:


Diagrams are an advanced technique to represent data. As a layman, one cannot understand
the tabulated data easily but only a single glance at the diagram, one gets complete picture of data
presented. According to M.J. Moroney, “diagrams register a meaningful impression almost before
we think.”
Diagrams are useful because of the following reasons:
(i) They give very clear picture of data.
(ii) They are easy to understand in short time.
(iii) They facilitate comparison between different samples.
(iv) They help to remember data easily.
(v) They make complex data simple.
(vi) They have universal utility.
(vii) No mathematics knowledge is required to draw and understand diagrams.
Limitations of diagrammatic representation:
Diagrammatic representation has the following limitations:
(i) Diagrams do not show the small differences properly.
(ii) In statistical analysis, diagrams are of no use.
(iii) Diagrams are just supplement to tabulation.
(iv) Diagrammatic presentation of data shows only on estimate of the actual behaviour of the
variables.
(v) They can be used only for comparative studies.

2) Graphic Representation of Data:


A graph is an intense or bright form of presentation of data. Graphic method helps to
present quantitative data in a simple, clear and attractive manner. It is the simplest and commonest
support to the numerical reading which gives a picture of numbers in such a way that the variables
can be easily compared.

16
Applied Biostatistics: An Essential tool in Helathcare Profession

Graphic method of representation of data is becoming more effective and powerful than the
diagrammatic representation. It plays an important role of comparison is all fields of study.
According to A. L. Boddington, “The wandering of a line is more powerful in its effect on the mind
than a tabulated statement; it shows what is happening and what is likely to take place, just as
quickly as the eye is capable of working.” The presentation of statistics in the form of graphs
facilitates many processes. Frequency distribution can be represented graphically in following
ways:
a) Histogram
b) Frequency polygon
c) Frequency curve
d) Ogive curve
a) Histogram:
It is one of the most important and useful methods of presenting continuous frequency
distribution. A histogram is similar to a bar diagram which shows continuous frequency
distribution of quantitative data. In this, the continuous class intervals are taken along X-axis and
the frequencies on Y-axis.
Steps to construct histogram:
1. Draw the vertical and horizontal axes using scale.
2. Take the continuous classes on X-axis and if the classes are not continuous then make them
continuous and write on X- axis.
3. Take the frequencies on Y-axis with certain multiples.
4. Draw the bar up to the required frequency for each class interval.
5. Different shades or colours can be used to decorate histogram.

Ex.1. Draw a Histogram for the following data.


Classes 8-10 10-12 12-14 14-16 16-18 18-20 20-22 22-24
Frequency 24 52 42 48 12 8 14 6

17
Bhumi Publishing, India

Ex.2. Draw a Histogram for the following data.


Classes 10-15 15-20 20-25 25-30 30-35
Frequency 2 6 7 5 3

b) Frequency Polygon:
Frequency polygon is a line graph derived from histogram by joining the mid points of all
bars in histogram. It begins and ends at the base line i.e. X-axis.
Steps to construct frequency polygon:
1. Draw the histogram for the given data with continuous class intervals.
2. Take the class interval before the given first class and class interval after the last class with
frequency as zero and the constant width.
3. Mark the mid points of all classes at the top of the bars. Also mark the mid points of two
extra classes in step 2.
4. Join all mid points successively with straight lines.
5. The complete bounded figure is frequency polygon.

Ex.1. Draw a frequency polygon for the following data.


Monthly wages 9-11 11-13 13-15 15-17 17-19 19-21 21-23 23-25 25-27
(New classes)
No. of workers 0 6 53 85 56 21 16 8 0

18
Applied Biostatistics: An Essential tool in Helathcare Profession

c) Frequency Curve:
With the help of frequency polygon and histogram, we can draw a smooth curve. It is
obtained by joining the points in frequency polygon with free hand in order to get smooth curve. It
removes the ruggedness of polygon. A smoothed frequency curve represents a generalised
characterization of the data collected from the population or mass. Like frequency polygon,
frequency curve also begins and ends at the base line.
d) Ogive curve and cumulative frequency polygon:
It is also known as Cumulative frequency curve, as it is used to represent cumulative
frequency distribution of continuous classes. As there are two types of cumulative frequencies i.e.
less than type and more than type, accordingly there are two types of ogives for any grouped
frequency distribution.
(i) Less than frequency curve (Ogive)
(ii) More than frequency curve (Ogive)
(i) Less Than Frequency Curve:
In this, cumulative frequency less than type is calculated and plotted against the upper limit
of the classes. The points so obtained are joined by a smooth curve. It is an increasing curve sloping
upward from left to right of the graph. It is in the shape of an elongated ‘S’.
(ii) More Than Frequency Curve:
In this, cumulative frequencies more than type are calculated and plotted against the lower
limit of the classes. The points so obtained are joined by a smooth curve. It is a decreasing curve
sloping downward from left to right of the graph. It is in the shape of elongated upside down ‘S’.
An interesting feature of the two ogive curves together is that their point of intersection
gives the median.
Steps for constructing an ogive:
1. Prepare the required cumulative frequency distribution table either less than or more than
or both.
2. Draw and label the X (horizontal) and the Y (vertical) axes.
3. Represent the cumulative frequencies on the Y-axis and the class limits on the X-axis.
4. Plot the cumulative frequency at each class limit with the height being the corresponding
cumulative frequency.
5. Connect the points with segments. Less than ogive curve always starts from coordinate
point zero.
Significance of graphic representation:
Graphic representation is a visual form of presentation of data. It is more effective and
result oriented than diagrammatic representation. The presentation of statistics in the form of
graphs facilitates many processes in biostatistics.

19
Bhumi Publishing, India

The main significances of graphs are as follows:


(i) They are more attractive and impressive than a table of figures.
(ii) They make comparison easy.
(iii) They help to present data in simple and understandable way.
(iv) Correlation between two series can be studied easily.
(v) They save time and energy of statistician as well as observer.
(vi) It needs no special knowledge of mathematics to understand graphs.
Limitations of Graphic Representation of Data:
Graphic representation of data suffers from following limitations:
(i) Graphs may be misused by taking false scales.
(ii) Accuracy is not possible in graph.
(iii) Graphs do not measure magnitude of data; they only show the fluctuations in them.
(iv) The interpretation of graphs varies from person to person.
Measures of Central Tendency
After collecting a set of statistical data, we are usually interested in making some statistical
summary statements about this large and complex set of individual values of variables. According
to Prof. Bowley, “Measures of central tendency are statistical constants which enable us to
comprehend in a single effort the significance of the whole.”
The measures of central tendency describe a distribution in terms of its most ‘frequent’,
‘typical’ or average’ data value. It is a summery measure that attempts to describe a complete set of
data with a single value which represents the centre of distribution. So, measures of central
tendency are sometimes also called as measures of central location.
The Measures of Central Tendency are used:
1) To concentrate data at a single value.
2) To facilitate comparison between data.
Criteria for an Ideal Measure of Central Tendency:
(i) It should be properly and rigidly defined.
(ii) It should be simple to understand & easy to calculate.
(iii) It should be based upon all values of given data.
(iv) It should be capable of further mathematical treatment.
(v) It should have sampling stability.
(vi) It should be not be unduly affected by extreme values.
The following are the five measures of central tendency which are common in use:
1. Arithmetic Mean (AM)
2. Median
3. Mode

20
Applied Biostatistics: An Essential tool in Helathcare Profession

4. Geometric Mean (GM)


5. Harmonic Mean (HM)
1. Arithmetic Mean (AM):
The A.M. or simply mean is the most popular and well known measure of central tendency.
It is also known as ‘average’. Many statistical analyses use the mean as a standard reference point.
A.M. is defined as the sum of all observations divided by the number of observations in the data.
Calculation of Arithmetic mean:
Depending on type data arithmetic mean is calculated as follows:
a) For raw data:
If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 are n observations then AM is given by

𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠


AM = 𝑥 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡 𝑖𝑜𝑛𝑠

𝑥 1 +𝑥 2 + …+ 𝑥 𝑛
= 𝑛
𝑥𝑖
𝑥 = 𝑛

Ex.1. The weights of 5 students (in kg) are 20, 21, 25, 14 and 30. Find the average of their weights.
Ans: Here, n = 2
𝑥𝑖
𝑥 = 𝑛
20 + 21 + 25 + 14 + 30
= 5
110
= 5

= 22 kg

Ex.2. Arithmetic mean of 5 values 2, 4, a, 9, 5 is 6. What will be the value of ‘a’?


Ans: Here, 𝑛 = 5
𝑥 =6
𝑥𝑖
So, 𝑥=
𝑛
2 +4+𝑎 +9+5
6= 5

30 = 𝑎 + 20
𝑎 = 30 − 20 = 10

Ex.3. The average of p and 4p is 10. Find the value of p.


Ans: Here, 𝑛 = 2 i.e. p and 4p

21
Bhumi Publishing, India

𝑥 = 10
𝑥𝑖
So, 𝑥=
𝑛
𝑝 + 4𝑝
10 = 2

20 = 5𝑝
20
𝑝=
5

𝑝=4

Ex.4. Arithmetic mean of 2k, 4, 7, 5, k is 20. Find value of k.


Ans: Here, 𝑛 = 5
𝑥 = 20
𝑥𝑖
So, 𝑥= 𝑛
2𝑘 + 4 + 7 + 5 + 𝑘
20 = 5

100 = 3𝑘 + 16
84 = 3𝑘
𝑘 = 28

b) For discrete frequency distribution:


If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1 , 𝑓2 , . . . , 𝑓𝑛 then AM is
given by
𝑓1 𝑥 1 + 𝑓2 𝑥 2 + …+ 𝑓𝑛 𝑥 𝑛
AM = 𝑥 =
𝑓1 + 𝑓2 + …+ 𝑓𝑛
𝑓𝑖 𝑥 𝑖
𝑥=
𝑓𝑖

Frequency distribution table in this looks like,


Observations (𝑥𝑖 ) Frequencies (𝑓𝑖 ) 𝑓𝑖 𝑥𝑖

Ex.1. Calculate AM for following data.


Obs 10 20 30 40 50
freq 7 2 5 3 9

Ans: Prepare the following table.

22
Applied Biostatistics: An Essential tool in Helathcare Profession

Obs ( 𝒙𝒊 ) Freq ( 𝒇𝒊 ) 𝒇𝒊 𝒙𝒊
10 7 70
20 2 40
30 5 150
40 3 120
50 9 450
Total 𝑓𝑖 = 26 𝑓𝑖 𝑥𝑖 = 830

𝑓𝑖 𝑥 𝑖 830
𝑥 = 𝑓𝑖
= 26
= 31.9230

Ex.2. Find average from the information given below.


Age in Yrs 1 2 3 4 5 6
No. Of deaths 12 15 18 10 9 8

Ans: Prepare the following table.

Age in Yrs (𝒙𝒊 ) No. Of deaths (𝒇𝒊 ) 𝒇𝒊 𝒙𝒊


1 12 12
2 15 30
3 18 54
4 10 40
5 9 45
6 8 48
Total 72 229

𝑓𝑖 𝑥 𝑖 229
𝑥 = 𝑓𝑖
= 72
= 3.18056

c) For continuous frequency distribution:


This is also known as the step deviation method. For the continuous class intervals, the
AM is given by
1. Prepare the distribution table containing the columns as follows,
Class Frequencies Mid values 𝑢𝑖 =
𝑋 𝑖 −𝐴 𝑓𝑖 𝑢𝑖
𝑕
Intervals (𝑓𝑖 ) (X)

23
Bhumi Publishing, India

2. Find the mid values (X) of all classes in third column using formula,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 = 𝑋 =
2
3. Take any mid value as an assumed mean A; for easy calculations consider the middle value as
assumed mean.
𝑋 𝑖 −𝐴
4. Calculate step deviation in fourth column using 𝑢𝑖 = 𝑕
.

5. Find the multiplication of frequency and step deviation columns 𝑓𝑖 𝑢𝑖 .


6. AM is calculated by,
𝑓𝑖 𝑢 𝑖
AM = 𝑥 = 𝐴 + 𝑓𝑖
𝑕

Where, 𝐴 = assumed mean taken from class marks(X)


𝑓𝑖 = Frequencies
𝑕 = Class width
𝑋 𝑖 −𝐴
𝑢𝑖 = , Step deviation
𝑕

Ex.1. Calculate average marks by step deviation method from the following data.
Marks 0 – 10 10 – 20 20 - 30 30 - 40 40 - 50 50 - 60
No. Of students 42 44 58 35 26 15

Ans: This is continuous frequency distribution with continuous classes.


Prepare following table.
CI Freq. (𝒇𝒊 ) Mid values (X) 𝒖𝒊 =
𝑿𝒊 −𝑨 𝒇𝒊 𝒖𝒊
𝒉

0 – 10 42 5 -2 -84
10 – 20 44 15 -1 -44
20 – 30 58 𝟐𝟓 = 𝑨 0 0
30 – 40 35 35 1 35
40 – 50 26 45 2 52
50 - 60 15 55 3 45
Total 220 4

𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑕
𝑓𝑖
4
= 25 + × 10
220

= 25 + 0.1818
= 25.1818

24
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.2. Calculate mean by step deviation method.


Classes 0 – 10 10 – 20 20 – 30 30 - 40 40 - 50 50 - 60 60 - 70
Freq. 5 10 40 30 20 10 4

Ans: This is continuous frequency distribution with continuous classes.


Prepare following table.
CI Freq. (𝒇𝒊 ) Mid values (X) 𝒖𝒊 =
𝑿𝒊 −𝑨 𝒇𝒊 𝒖𝒊
𝒉

0 – 10 5 5 -3 -15
10 – 20 10 15 -2 -20
20 – 30 40 25 -1 -40
30 – 40 30 𝟑𝟓 = 𝑨 0 0
40 – 50 20 45 1 20
50 – 60 10 55 2 20
60 – 70 4 65 3 12
Total 119 -23

𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑕
𝑓𝑖
−23
= 35 + × 10
119

= 35 − 1.93
= 34.07
Ex.3. Calculate mean by step deviation method.
Classes 0-30 30-60 60-90 90-120 120-150 150-180
Freq. 8 13 22 27 18 7

Ans: This is continuous frequency distribution with continuous classes.


Prepare following table.
CI Freq. (𝒇𝒊 ) Mid values (X) 𝒖𝒊 =
𝑿𝒊 −𝑨 𝒇𝒊 𝒖𝒊
𝒉

0-30 8 15 -2 -16
30-60 13 45 -1 -13
60-90 22 𝟕𝟓 = 𝑨 0 0
90-120 27 105 1 27
120-150 18 135 2 36
150-180 7 165 3 21
Total 95 55

25
Bhumi Publishing, India

𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑓𝑖
𝑕
55
= 75 + 95
× 30
= 75 + 17.37
= 92.37
Ex.4. Calculate average marks by step deviation method.
Marks 0-10 10-20 20-30 30-40 40-50 50-60
No. Of stud 42 44 58 35 26 15

Ans: This is continuous frequency distribution with continuous classes.


Prepare following table.
CI Freq. (𝒇𝒊 ) Mid values (X) 𝒖𝒊 =
𝑿𝒊 −𝑨 𝒇𝒊 𝒖𝒊
𝒉
0-10 42 5 -2 -84
10-20 44 15 -1 -44
20-30 58 𝟐𝟓 = 𝑨 0 0
30-40 35 35 1 35
40-50 26 45 2 52
50-60 15 55 3 45
Total 220 4
𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑓𝑖
𝑕
4
= 25 + 220
× 10
= 25 + 0.18
= 25.18
Ex.5. Calculate mean by step deviation method.
Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Freq. 5 10 20 40 30 20 10 4

Ans: This is continuous frequency distribution with continuous classes.


Prepare following table.
CI Freq. (𝒇𝒊 ) Mid values (X) 𝒖𝒊 =
𝑿𝒊 −𝑨 𝒇𝒊 𝒖𝒊
𝒉
0-10 5 5 -3 -15
10-20 10 15 -2 -20
20-30 20 25 -1 -20
30-40 40 𝑨 = 𝟑𝟓 0 0
40-50 30 45 1 30
50-60 20 55 2 40
60-70 10 65 3 30
70-80 4 75 4 16
Total 139 61

26
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑓𝑖 𝑢 𝑖
A.M. = 𝑥 = 𝐴 + 𝑓𝑖
𝑕
61
= 35 + 139
× 10

= 35 + 4.39
= 39.39
Merits of Mean:
(i) It is rigidly defined.
(ii) It is easy to understand and easy to calculate.
(iii) It is the unique value.
(iv) It is based upon all values of the given data.
(v) It is capable of further mathematical treatment.
(vi) It is not much affected by sampling fluctuations.
Demerits of Mean:
(i) It cannot be calculated if any observations are missing.
(ii) It cannot be calculated for the data with open end classes.
(iii) It is affected by extreme values.
(iv) It cannot be located graphically.
(v) It may be a number which is not present in the data.
(vi) It can be calculated for the data representing qualitative characteristic.

2. Median:
The median is the value which divides the data into two equal parts. Half of the
observations are above the median and half are below it. It is determined by ranking the data and
finding the number of observations. It is another frequently used measure of central tendency.
Calculation of Median:
Depending on types of data, there are following methods for the calculation of median:
a) For raw data:
Steps:
1. Arrange the given data in ascending order.
2. If number of observations (n) is odd then median is the exact central value
𝒏+𝟏
i.e. 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝒐𝒃𝒔 𝒂𝒕 𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏( 𝟐
)

3. If number of observations is even then there are two central values say 𝑥1 and 𝑥2 such that
𝑛 𝑛
𝑥1 = ( 2 )th observation and 𝑥2 = ( 2 + 1)th observation. Hence median is the average of

these two central values.


𝒙𝟏 + 𝒙𝟐
i.e. 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝟐

27
Bhumi Publishing, India

Ex.1. Find median: 61, 63, 60, 64, 65, 62, 63, 69, 68.
Ans: Arrange data in ascending order: 60, 61, 62, 63, 63, 64, 65, 68, 69
Here, 𝑛 = 9 (odd no.)
𝑛+1 9+1
Hence, 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 2
= 2
=5
𝑀𝑒𝑑𝑖𝑎𝑛 = 5𝑡𝑕 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 63

Ex.2. Find median: 30, 60, 28, 35, 46, 47, 63, 64, 62, 32
Ans: Ascending order: 28, 30, 32, 35, 46, 47, 60, 62, 63, 64
Here, 𝑛 = 10 (even no.)
𝑛 𝑛
Hence, 𝐶𝑒𝑛𝑡𝑟𝑎𝑙 𝑜𝑏𝑠 = 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 2
& 2
+1
= 𝑂𝑏𝑠 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠 5 𝑎𝑛𝑑 6
5𝑡𝑕 𝑜𝑏𝑠 + 6𝑡𝑕 𝑜𝑏𝑠 46 + 47
𝑀𝑒𝑑𝑖𝑎𝑛 = = = 46.5
2 2

b) For discrete frequency distribution:


Steps:
1. Prepare Cumulative frequency (C.F.) less than type table.
𝑁
2. Find value of where 𝑁 = 𝑓𝑖
2
𝑁
3. See C.F. just greater than .
2
𝑁
4. Observation corresponding to C.F. just greater than is Median.
2
The frequency distribution table in this looks like,
Observations (𝑥𝑖 ) Frequencies (𝑓𝑖 ) C.F.

Ex.1. Find median for following.


Obs 1 2 3 4 5 6 7 8 9
Freq 8 10 11 16 20 25 15 9 6
Ans: Prepare table to find cumulative frequency less then type.
𝒙𝒊 𝒇𝒊 CF
1 8 8
2 10 18
3 11 29
4 16 45
5 20 65
6 25 90
7 15 105
8 9 114
9 6 120
Total 120

28
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑁 120
= = 60
2 2
CF just greater than 60 = 65
Hence, median is obs corresponding to 65
𝑀𝑒𝑑𝑖𝑎𝑛 = 5
Ex.2. Find median for following.
Obs 5 10 15 20 25 30 35
Freq 1 3 13 17 27 36 38

Ans: Prepare table to find cumulative frequency less then type.


𝒙𝒊 𝒇𝒊 CF
5 1 1
10 3 4
15 13 17
20 17 34
25 27 61
30 36 97
35 38 135
Total 135
𝑁 135
= = 67.5
2 2
CF just greater than 67.5 = 97
Hence, median is obs corresponding to 97
𝑀𝑒𝑑𝑖𝑎𝑛 = 30
c) For continuous frequency distribution:
Steps:
1. Prepare cumulative frequency (C.F.) less than type table.
𝑁
2. Find value of 2
where 𝑁 = 𝑓𝑖
𝑁
3. See C.F. just greater than 2 . The class interval corresponding to C.F. is Median class.

4. Find C.F. of pre-median class and denote it by ‘c’.


5. Hence Median is given by,
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + 𝒇
(𝟐 − 𝒄)

Where, 𝐿 = lower limit of Median class


𝑕 = Width of Median class
𝑓 = Frequency of Median class
𝑐 = C.F. of Pre-median class

29
Bhumi Publishing, India

The frequency distribution table in this looks like,


Class intervals Frequencies (𝑓𝑖 ) C.F.

Ex.1. Find median.


Class 20-30 30-40 40-50 50-60 60-70
Freq 14 23 27 21 15

Ans:
classes 𝒇𝒊 CF
20-30 14 14
30-40 23 37
40-50 27 64
50-60 21 85
60-70 15 100
Total 100

𝑁 100
= = 50
2 2
CF just greater than 50 = 64
Hence, Median class= 40-50

Here, 𝐿 = 40 𝑕 = 10 𝑓 = 27 𝑐 = 37

𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + ( − 𝒄)
𝒇 𝟐

10
= 40 + 27 (50 − 37)

= 40 + 4.814

= 44.814

Ex.2. Calculate median.


Class 20-25 25-30 30-35 35-40 40-45 45-50
Freq 100 140 200 320 300 240

30
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans: Prepare following frequency distribution table:

classes 𝒇𝒊 CF
20-25 100 100
25-30 140 240
30-35 200 440
35-40 320 760
40-45 300 1060
45-50 240 1300
Total 1300

𝑁 1300
= = 650
2 2
CF just greater than 650 = 760
Hence, Median class= 35-40
Here, 𝐿 = 35 𝑕 = 5 𝑓 = 320 𝑐 = 440

𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + 𝒇
( − 𝒄)
𝟐

5
= 35 + 320 (650 − 440)

= 35 + 3.28

= 38.28
Ex.3. Find median from following data.
Class 0-10 10-20 20-30 30-40 40-50
Freq 8 15 22 15 8

Ans:

classes 𝒇𝒊 CF
0-10 8 8
10-20 15 23
20-30 22 45
30-40 15 60
40-50 8 68
Total 68

31
Bhumi Publishing, India

𝑁 68
= = 34
2 2
CF just greater than 34 = 45
Hence, Median class= 20-30
Here, 𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑳 + ( − 𝒄)
𝒇 𝟐

10
= 20 + (34 − 23)
22

= 20 + 5

= 25
Merits of Median:
(i) It is rigidly defined.
(ii) It is easy to understand and easy to calculate.
(iii) It is not affected by extreme values.
(iv) Even if extreme values are not known median can be calculated.
(v) It can be located just by inspection in many cases.
(vi) It is the unique value.
(vii) It is not much affected by sampling fluctuations.
(viii) It can be calculated for data based on ordinal scale i.e. on ordering.
Demerits of Median:
(i) It is not based upon all values of the given data.
(ii) For larger data size the arrangement of data in the increasing order is difficult process.
(iii) It is not capable of further mathematical treatment.
(iv) It is insensitive to some changes in the data values.

3. Mode:
The mode is the value that occurs most frequently in a set of observations. It is an
observation which repeats maximum number of times. The mean and median require a calculation
but the mode is found simply by counting the number of times each value occurs in a data set.
Sometimes we may come across a distribution having more than one mode. If there are two modes
then it is bimodal distribution. Likewise if there are more modes then it is multi-modal distribution.
Calculation of Mode:
a) For raw data:
An observation repeating maximum number of times is mode.

32
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.1. Find mode for following data: 61, 62, 63, 61, 63, 64, 64, 64, 60, 65.
Ans: 𝑀𝑜𝑑𝑒 = 𝑜𝑏𝑠 𝑟𝑒𝑝𝑒𝑎𝑡𝑖𝑛𝑔 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑡𝑖𝑚𝑒𝑠 = 64
b) For discrete frequency distribution:
An observation corresponding to the highest frequency in the table is mode.
Ex.1. Find mode.
Size 5 10 15 20 25 30 35
Freq 1 3 13 36 27 17 5

Ans: 𝑀𝑜𝑑𝑒 = 𝑜𝑏𝑠 𝑤𝑖𝑡𝑕 𝑕𝑖𝑔𝑕𝑒𝑠𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(36) = 20

c) For continuous frequency distribution:


Steps:
1. Find the maximum frequency in the table denoted by 𝑓𝑚 .
2. The class interval corresponding to 𝑓𝑚 is Modal class.
3. Find frequency of Pre-modal class ( 𝑓1 ) and frequency of Post-modal class ( 𝑓2 ).
4. Hence Mode is given by,
𝒇𝒎 − 𝒇𝟏
𝑴𝒐𝒅𝒆 = 𝑳 + 𝒉
𝟐𝒇𝒎 − 𝒇𝟏 − 𝒇𝟐

Where, 𝐿 = lower limit of Modal class


𝑓𝑚 = maximum frequency
𝑓1 = frequency of pre-modal class
𝑓2 = frequency of post-modal class
h = width of modal class.
Ex.1. Calculate mode.
Classes 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Freq 28 32 45 60 56 40 20

Ans:
Classes 20-30 30-40 40-50 50-60 60-70 70-80 80-90
Freq 28 32 45 60 56 40 20
(𝒇𝟏 ) (𝒇𝒎 ) (𝒇𝟐 )

Here, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑓𝑚 = 60


Modal class = 50 – 60
Hence, 𝐿 = 50 𝑕 = 10 𝑓1 = 45 𝑓2 = 56 𝑓𝑚 = 60
𝒇𝒎 − 𝒇𝟏
𝑴𝒐𝒅𝒆 = 𝑳 + 𝒉
𝟐𝒇𝒎 − 𝒇𝟏 − 𝒇𝟐

33
Bhumi Publishing, India

60− 45
= 50 + 120− 45− 56
10

= 50 + 7.89
= 57.89
Ex.2. determine mode.
Classes 0-100 100-200 200-300 300-400 400-500
Freq 28 32 45 60 56

Ans:
Classes 0-100 100-200 200-300 300-400 400-500
Freq 12 18 27 20 17
(𝒇𝟏 ) (𝒇𝒎 ) (𝒇𝟐 )

Here, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑓𝑚 = 27


Modal class = 200 – 300
Hence, 𝐿 = 200 𝑕 = 100 𝑓1 = 18 𝑓2 = 20 𝑓𝑚 = 27
𝒇𝒎 − 𝒇𝟏
𝑴𝒐𝒅𝒆 = 𝑳 + 𝟐𝒇𝒎 − 𝒇𝟏 − 𝒇𝟐
𝒉

27− 18
= 100 + 54− 18− 20
100

= 100 + 56.25

= 156.25
Measures of dispersion:
We have learnt about the various measures of central tendency. Measures of central
tendency give us an idea of concentration of the observations about the central part of the data, but
it cannot describe the distribution completely. If we know the average or mean alone of certain
distribution, we cannot form a complete idea about the observations of that distribution; because
there may be different sets of observations having the same arithmetic mean. But these sets of
observations may differ or vary in their values about the measures of central tendency. A measure
of central tendency is a single value that represents a characteristic such as age or height of a group
of persons while a measure of dispersion quantifies how much persons in the group vary from each
other and from the measure of central tendency. But can the central tendency describe the data
fully or adequately?
To understand it, consider the following example.

34
Applied Biostatistics: An Essential tool in Helathcare Profession

The daily income of the workers in two factories is:


Factory A 35 45 50 65 70 90 100
Factory B 60 65 65 65 65 65 70
Here, in both the groups the mean of the data is the same i.e. 65; but
(i) In group A, the observations are much more scattered from the mean.
(ii) In group B, almost all the observations are concentrated around the mean.
Thus, the two groups differ even though they have same mean. And hence we need to
differentiate between two groups. We need some measures which can measure the degree of
scatteredness.
Dispersion:
Scattering of data is also known as dispersion. According to W.I. King, “the term dispersion is
used to indicate the facts within a given group, the items differ from one another in size or in other
words, there is lack of uniformity in their sizes.” Spiegel defines it as, “the degree to which
numerical data tends to spread about an average value is called variation or dispersion of the data.”
Similarly, in the words of A.L. Bowley, “Dispersion is a measure of variation of the items.”
It is clear from these definitions that the deviation or variation of each observation from the
central value (i.e. mean, median or mode) is called dispersion or scattering of data. Dispersion is
defined as the degree of variation or deviation of each observation from the central value of the
distribution. The single value which describes the variability or scatterings of observations from
central value is called Measure of Dispersion. The measures of dispersion give the extent to which
the observation varies from the average of data. They help in studying the important characteristics
if the data.
If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 are the observations in the given data and A is any measure of central
tendency i.e. mean/median/mode then the deviation or dispersion is given by
Deviation = 𝑥𝑖 − 𝐴
Criteria for an Ideal Measure of Dispersion:
(i) It should be properly defined.
(ii) It should be easy to understand and easy to calculate.
(iii) It should be based on all the observations.
(iv) It should not be affected by the sampling fluctuations.
The following are the important measures of dispersion:
1. Range
2. Mean Deviation (MD)
3. Variance and Standard Deviation (SD)
4. Quartile Deviation (QD)
At the same time, we will calculate relative measures of dispersions as:

35
Bhumi Publishing, India

i. Coefficient of Range
ii. Coefficient of Mean Deviation
iii. Coefficient of Variation
iv. Coefficient of Quartile Deviation

1. Range:
Range is the quickest and simplest measure of dispersion. It accounts only the difference
between the highest and the lowest observation in any data. For a given set of data, range is defined
as the difference between the highest (maximum) and lowest (minimum) observation. The range is
often reported as “from (the minimum) to (the maximum),” i.e., two numbers.
Coefficient of Range: is given by

𝐿−𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿+𝑆
× 100

Where, L = largest value of data


S = smallest value of data
a) For raw data and discrete frequency distribution:
Range is the difference between the highest value and the lowest value of the data. If L is the
largest (highest) value and S is the smallest (lowest) value of the observations in the data then
𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆
Ex.1. The marks obtained by 10 students in Mathematics are given below. Find the range.
15, 16, 16, 29, 11, 23, 35, 25, 19, 20
Ans: Here, Largest value = 𝐿 = 35
Smallest value = 𝑆 = 11
𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 35 – 11 = 24.
Ex.2. Calculate range and coefficient of range for following data:
Marks 10 15 20 25 30
No. Of students 7 8 13 12 10
Ans: Here, Largest value = 𝐿 = 30
Smallest value = 𝑆 = 10
𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 30 – 10 = 20.
𝐿−𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿+𝑆
× 100
30 − 10
= 30 + 10 × 100
20
= 40 × 100

= 50%

36
Applied Biostatistics: An Essential tool in Helathcare Profession

b) For continuous frequency distribution table:


In case of continuous frequency distribution, range is the difference between upper limit of
highest class interval and lower limit of lowest class interval. It is also calculated as the difference
between the mid values of the highest and lowest class intervals.
Ex.1. Calculate range and coefficient of range:
Weights in kg. 50 – 55 55 - 60 60 - 65 65 - 70 70 - 75
No. Of students 12 18 23 10 3

Ans: Here, Highest class interval: 70 – 75 so, Largest value = 𝐿 = 75


Lowest class interval: 50 – 55 so, smallest value = 𝑆 = 50
𝑅𝑎𝑛𝑔𝑒 = 𝐿 − 𝑆 = 75 – 50 = 25

𝐿−𝑆
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = 𝐿+𝑆
× 100
75 − 50
= × 100
75 + 50
25
= 125 × 100

= 20%
Merits of Range:
(i) It is rigidly defined.
(ii) It gives rough but quick answer.
(iii) It is simple to understand and easy to calculate. It can be found by mere inspection.
(iv) It can be calculated from extreme values only. So we need not know the details of the series
to calculate the range.
Demerits of range:
(i) It is not representative since it is not based on all the observations of the series.
(ii) It is not capable of further algebraic treatment.
(iii) In case of open-end classes range cannot be determined exactly.
(iv) It is not a stable measure of dispersion and is very much affected by the fluctuations of
sampling.

2. Mean Deviation (MD):


Mean deviation is also known as average deviation. Mean deviation about any central value
A i.e. MD(A) is defined as the arithmetic mean of deviations of all observations taken from measure
of central tendency. While calculating mean deviation, the algebraic signs (+ or -) of the deviations
are ignored and deviations are taken positive using modulus ( ).

37
Bhumi Publishing, India

Coefficient of Mean Deviation: It is common for all the following three types of data and is given
by,
𝑀𝐷
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝐴
× 100 Where, A= mean/ median / mode

a) For raw data:


Steps: If 𝑥1 , 𝑥2 , … 𝑥𝑛 are n observations then
𝑥𝑖 𝑥𝑖 − 𝐴

1. Find the required measure of central tendency asked in example.


2. Prepare the frequency distribution table of two columns and calculate positive
deviations in second column as shown below.
3. Find the total of deviations.
4. Then MD is calculated by,
𝑥𝑖 − 𝐴
𝑀𝐷(𝐴) = Where, 𝐴 = 𝑚𝑒𝑎𝑛/𝑚𝑒𝑑𝑖𝑎𝑛/𝑚𝑜𝑑𝑒.
𝑛

Ex.1. calculate M.D. about median for: 1, 2, 3, 4, 5, 6, 7, 8, 9


Ans: Ascending order: 1, 2, 3, 4, 5, 6, 7, 8, 9
𝑛=9
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐴 = 5

𝑥𝑖 𝑥𝑖 − 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
Total 20

𝑥𝑖 − 5
𝑀𝐷(5) =
9
20
=
9
= 2.223

38
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.2. Find mean deviation about mean and mode: 2, 5, 7, 8, 7, 6, 12, 3


Ans: Mean fore given data,
𝑥𝑖 50
𝑥= = = 6.25 ≅ 6
𝑛 8
Mode for given data is, 𝑚𝑜𝑑𝑒 = 7

𝑥𝑖 𝑥𝑖 − 6 𝑥𝑖 − 7
2 4 5
5 1 2
7 1 0
8 2 1
7 2 0
6 0 1
12 6 5
3 3 4
Total 18 18

𝑥 𝑖 −6 18
𝑀𝐷 6 = 8
= 8
= 2.25

𝑥𝑖 − 7 18
𝑀𝐷 7 = 8
= 8
= 2.25

b) For discrete frequency distribution:


Steps: If 𝑥1 , 𝑥2 , … 𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1 , 𝑓2 , … 𝑓𝑛 then
1. Find the required measure of central tendency asked in example.
2. Prepare the frequency distribution table of four columns and calculate positive
deviations and their products with corresponding frequencies as shown below.
𝑥𝑖 𝑓𝑖 𝑥𝑖 − 𝐴 𝑓𝑖 𝑥𝑖 − 𝐴

3. Find the total of frequency column and last column (𝑓𝑖 𝑥𝑖 − 𝐴 )


4. Then MD is calculated by
𝑓𝑖 𝑥 𝑖 − 𝐴
𝑀𝐷(𝐴) = 𝑓𝑖
Where, 𝐴 = 𝑚𝑒𝑎𝑛 / 𝑚𝑒𝑑𝑖𝑎𝑛 / 𝑚𝑜𝑑𝑒

Ex.1. Calculate M.D. from mean.


𝑥𝑖 10 11 12 13 14
𝑓𝑖 2 5 7 3 1

39
Bhumi Publishing, India

Ans:
𝑥𝑖 𝑓𝑖 𝑥𝑖 ∙ 𝑓𝑖 𝑥𝑖 − 12 𝑓𝑖 𝑥𝑖 − 12
10 2 20 2 4
11 5 55 1 5
12 7 84 0 0
13 3 39 3 9
14 1 14 4 4
Total 18 212 22

𝑥𝑖 ∙ 𝑓𝑖 212
𝐴=𝑥= = = 11.78 ≅ 12
𝑓𝑖 18

𝑓𝑖 𝑥𝑖 − 12 22
𝑀𝐷 12 = = = 1.22
𝑓𝑖 18

Ex.2. Calculate M.D. from median.


𝑥𝑖 1 2 3 4 5
𝑓𝑖 8 10 15 5 2
Ans:
𝑥𝑖 𝑓𝑖 c.f. 𝑥𝑖 − 3 𝑓𝑖 𝑥𝑖 − 3
1 8 8 2 16
2 10 18 1 10
3 15 33 0 0
4 5 38 1 5
5 2 40 2 4
Total 40 35
𝑁 𝑓𝑖
= = 20
2 2
c.f. just greater than 20 = 33
𝐴 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 3

𝑓𝑖 𝑥𝑖 − 3 35
𝑀𝐷 3 = = = 0.875
𝑓𝑖 40
Ex.3. Calculate M.D. from mode.
𝑥𝑖 10 12 15 17 19
𝑓𝑖 2 4 10 5 1

40
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans:
𝑥𝑖 𝑓𝑖 𝑥𝑖 − 15 𝑓𝑖 𝑥𝑖 − 15
10 2 5 10
12 4 3 12
15 10 0 0
17 5 2 10
19 1 4 4
Total 22 36

𝐴 = 𝑚𝑜𝑑𝑒 = 15
𝑓𝑖 𝑥𝑖 − 15 36
𝑀𝐷 15 = = = 1.63
𝑓𝑖 22

c) For continuous frequency distribution:


Steps: For mid-values 𝑋1 , 𝑋2 … 𝑋𝑛 of the given class intervals with the corresponding
frequencies 𝑓1 , 𝑓2 ... 𝑓𝑛
1. Find the required measure of central tendency asked in example.
2. Prepare the frequency distribution table of five columns. Write mid values (𝑋𝑖 ) of classes in
third column then calculate positive deviations with mid values and their products with
corresponding frequencies as shown below.
Classes 𝑓𝑖 Mid values (Xi) 𝑋𝑖 − 𝐴 𝑓𝑖 𝑋𝑖 − 𝐴

3. Find the total of frequency column and last column (𝑓𝑖 𝑋𝑖 − 𝐴 )


4. Then MD is calculated by
𝑓𝑖 𝑋𝑖 − 𝐴
𝑀𝐷(𝐴) = 𝑓𝑖
Where, 𝐴 = 𝑚𝑒𝑎𝑛 / 𝑚𝑒𝑑𝑖𝑎𝑛 / 𝑚𝑜𝑑𝑒

Ex.1. Calculate M.D. from mean for following data.


Classes 0-10 10-20 20-30 30-40 40-50
Freq 5 8 15 16 6
Ans:
Classes 𝑓𝑖 Mid values (Xi) 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 27 𝑓𝑖 𝑋𝑖 − 27
0-10 5 5 25 22 110
10-20 8 15 120 12 96
20-30 15 25 375 2 30
30-40 16 35 560 8 128
40-50 6 45 270 18 108
Total 50 1350 472

41
Bhumi Publishing, India

𝑓 𝑖 ∙𝑋 𝑖 1350
𝒙= 𝒇𝒊
= 50
= 27

Hence,
𝑓 𝑖 𝑋 𝑖 − 27 472
𝑀𝐷 27 = 𝑓𝑖
= 50
= 9.44

Ex.2. Calculate M.D. and coefficient of M.D about median for following data.
Classes 0-10 10-20 20-30 30-40 40-50
Freq 8 15 22 15 8

Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25
0-10 8 8 5 20 160
10-20 15 23 15 10 150
20-30 22 45 25 0 0
30-40 15 60 35 10 150
40-50 8 68 45 20 160
Total 68 620

𝑁 68
2
= 2
= 34

C.F. just greater than 34 = 45


𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 20 − 30
𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒇
(𝟐− 𝒄)

10
= 20 + (34 − 23)
22

= 20 + 5

= 25
Hence,
𝑓 𝑖 𝑋 𝑖 − 25 620
𝑀𝐷 25 = 𝑓𝑖
= 68
= 9.12
𝑀𝐷 9.12
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀𝐷 𝐴 = 𝐴
× 100 = 25
× 100 = 36.48%

42
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.3. Calculate mean deviation (M.D.) about mean.


Classes 11-13 13-15 15-17 17-19 19-21 21-23 23-25
Freq 6 53 85 56 21 16 8

Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 25 𝑓𝑖 𝑋𝑖 − 25
0-10 8 8 5 20 160
10-20 15 23 15 10 150
20-30 22 45 25 0 0
30-40 15 60 35 10 150
40-50 8 68 45 20 160
Total 68 620

𝑁 68
= = 34
2 2

C.F. just greater than 34 = 45


𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 20 − 30
𝐿 = 20 𝑕 = 10 𝑓 = 22 𝑐 = 23
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒇
(𝟐− 𝒄)

10
= 20 + (34 − 23)
22

= 20 + 5

= 25
Hence,
𝑓 𝑖 𝑋 𝑖 − 25 620
𝑀𝐷 25 = = = 9.12
𝑓𝑖 68

Ex.4. Calculate M.D. from median.


C.I. 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70
Freq 6 12 17 30 10 10 8 5 2

43
Bhumi Publishing, India

Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 42.5 𝑓𝑖 𝑋𝑖 − 42.5
25-30 6 6 27.5 15 90
30-35 12 18 32.5 10 120
35-40 17 35 37.5 5 85
40-45 30 65 42.5 0 0
45-50 10 75 47.5 5 50
50-55 10 85 52.5 10 100
55-60 8 93 57.5 15 120
60-65 5 98 62.5 20 100
65-70 2 100 67.5 25 50
Total 100 715

𝑁 100
2
= 2
= 50

C.F. just greater than 50 = 65


𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 40 − 45
𝐿 = 40 𝑕=5 𝑓 = 30 𝑐 = 35
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + ( − 𝒄)
𝒇 𝟐

5
= 40 + (50 − 35)
30

= 40 + 2.5

= 42.5
Hence,
𝑓 𝑖 𝑋 𝑖 − 42.5 715
𝑀𝐷 42.5 = 𝑓𝑖
= 100 = 7.15

Ex.5. Calculate M.D. from median.


C.I. 20-25 25-30 30-35 35-40 40-45 45-50
Freq 10 14 20 36 30 24

44
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans:
Classes 𝑓𝑖 C.F. Mid values (Xi) 𝑋𝑖 − 38.5 𝑓𝑖 𝑋𝑖 − 38.5
20-25 10 10 22.5 16 160
25-30 14 24 27.5 11 154
30-35 20 44 32.5 6 120
35-40 36 80 37.5 1 36
40-45 30 110 42.5 4 120
45-50 24 134 47.5 9 216
Total 134 806

𝑁 134
= = 67
2 2

C.F. just greater than 67 = 80


𝑀𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 = 35 − 40
𝐿 = 35 𝑕=5 𝑓 = 36 𝑐 = 44
𝒉 𝑵
𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑨 = 𝑳 + 𝒇
(𝟐− 𝒄)

5
= 35 + 36 (67 − 44)

= 35 + 3.19

= 38.19 ≅ 38.5
Hence,
𝑓 𝑖 𝑋 𝑖 − 38.5 806
𝑀𝐷 38.5 = 𝑓𝑖
= 134 = 6.015

Merits of Mean Deviation:


(i) It is simple to understand and easy to calculate.
(ii) It is based on all the observations.
(iii) It shows the dispersion or scatter of the various items of a series from any measure of
central tendency.
(iv) It is not very much affected by the values of extreme items of a series.
(v) It facilitates comparison between different items of a series.
(vi) It truly represents the average of deviations of the items.

45
Bhumi Publishing, India

Demerits of Mean Deviation:


(i) It is not rigidly defined as it is calculated from any central value viz. Mean, Median, Mode etc.
and hence it can produce different results.
(ii) It violates the algebraic principle by ignoring the + and – signs while calculating the
deviations of the different items from the central value.
(iii) It is not capable of further algebraic treatment.
(iv) It is affected much by the fluctuations in sampling.
(v) It is difficult to calculate when the actual value of an average comes out in fraction or
recurring figure for that in such a case it requires to use approximate value.

3. Variance or Standard Deviation (SD):


Among all measures of dispersion Standard Deviation (or variance) is considered superior
because it possesses almost all the requisite characteristics of a good measure of dispersion.
Variance helps us in isolating the effects of various factors. To calculate variance, the deviations of
observations are squared and added. This addition is then divided by the total number of
observations. Thus, variance is defined as the arithmetic mean of squares of deviations taken from
the mean of given observations.
The positive square root of the variance is called the standard deviation i.e. the positive
square root of arithmetic mean of squares of deviations about mean is known as standard deviation.
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑆𝐷 2 i.e. 𝑆𝐷 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Coefficient of variation: It is same for all three types of data and given by.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 = × 100
𝑀𝑒𝑎𝑛

a) For raw data:


Steps: If 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 are n observations then
1. Calculate mean of given observations.
2. Prepare the frequency distribution table of three columns, find deviations about mean in
second column and their squares in third column as shown below.
𝑥𝑖 𝑥𝑖 − 𝑥 2
𝑥𝑖 − 𝑥

3. Find the total of last column 𝑥𝑖 − 𝑥 2 .


4. Then SD is calculated as,
𝑥𝑖 − 𝑥 2 𝑥𝑖
𝜎= 𝑛
Where 𝑥= 𝑛

46
Applied Biostatistics: An Essential tool in Helathcare Profession

OR
𝑥𝑖 2
𝜎= 𝑛
− (𝑥 )2

Ex.1. Find S.D. and C.V.: 2, 5, 7, 4, 3, 9


𝑥𝑖 30
Ans: 𝑚𝑒𝑎𝑛 = 𝑥 = = =5
𝑛 6

Prepare the table:


𝑥𝑖 𝑥𝑖 − 5 2
𝑥𝑖 − 5
2 3 9
5 0 0
7 2 4
4 1 1
3 2 4
9 4 16
Total 34

𝑥𝑖 − 𝑥 2 34
𝜎= = = 5.67 = 2.38
𝑛 6
𝑆𝐷 2.38
𝐶𝑉 = × 100 = × 100 = 47.6%
𝑥 5

Ex.2. If 𝑛 = 10 𝑥 = 40 𝑥 2 = 520 then find S.D.

𝑥𝑖 40
Ans: Here, 𝑥 = 𝑛
= 10 = 4

Use formula,

𝑥𝑖 2 520
𝜎= − (𝑥 )2 = − 42 = 36 = 6
𝑛 10

Ex.3. for a certain distribution of 25 observations, mean is 50 and S.D. is 4. Find coefficient of
variation (C.V.).

Ans: Use formula,


𝑆𝐷 4
𝐶𝑉 = × 100 = × 100 = 8
𝑥 50

47
Bhumi Publishing, India

b) For discrete frequency distribution:


If 𝑥1 , 𝑥2 , … 𝑥𝑛 are n observations along with the corresponding frequencies 𝑓1 , 𝑓2 , … 𝑓𝑛 then S.D. is
given by
𝒇𝒊 𝒙𝒊 − 𝒙 𝟐 𝑓𝑖 𝑥 𝑖
𝝈= 𝒇𝒊
Where 𝑥 = 𝑓𝑖

OR
𝒇𝒊 𝒙𝒊 𝟐
𝝈= 𝒇𝒊
− (𝒙)𝟐

The frequency distribution table is given by,


𝑥𝑖 𝑓𝑖 𝑥𝑖 𝑓𝑖 𝑥𝑖 − 𝑥 2 2
𝑥𝑖 − 𝑥 𝑓𝑖 𝑥𝑖 − 𝑥

Ex.1. Find S.D. for following.


Age(Yr) 10 20 30 40 50
Freq 15 30 34 75 100

Ans: Prepare table as,


𝑥𝑖 𝑓𝑖 𝑥𝑖 𝑓𝑖 𝑥𝑖 − 39 2 2
𝑥𝑖 − 39 𝑓𝑖 𝑥𝑖 − 39

10 15 150 39 1521 22815


20 30 600 19 361 10830
30 34 1020 9 81 2754
40 75 3000 1 1 75
50 100 5000 11 121 12100
Total 254 9770 48574

𝑓𝑖 𝑥 𝑖 9770
𝑥= = = 38.47 ≅ 39
𝑓𝑖 254

𝑓𝑖 𝑥 𝑖 − 𝑥 2 48574
𝜎= 𝑓𝑖
= 254
= 191.24 = 13.83

c) For continuous frequency distribution:


For mid-values 𝑋1 , 𝑋2 … 𝑋𝑛 of the given class intervals with the corresponding
frequencies 𝑓1 , 𝑓2 ... 𝑓𝑛 the S.D. is given by

𝒇𝒊 𝑿𝒊 − 𝒙 𝟐 𝑓𝑖 𝑋𝑖
𝝈= 𝒇𝒊
Where 𝑥 = 𝑓𝑖

𝑋𝑖 = Mid values
OR

48
Applied Biostatistics: An Essential tool in Helathcare Profession

𝒇𝒊 𝑿𝒊 𝟐
𝝈= − (𝒙)𝟐
𝒇𝒊

The frequency distribution table is given by,


𝑥𝑖 𝑓𝑖 𝑋𝑖 𝑓𝑖 𝑋𝑖 𝑋𝑖 − 𝑥 2 2
𝑋𝑖 − 𝑥 𝑓𝑖 𝑋𝑖 − 𝑥

Ex.1. Find S.D.


Marks 0-20 20-40 40-60 60-80 80-100
Students 5 12 32 40 11
Ans: Prepare table as,

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 58 𝑋𝑖 − 58 2 𝑓𝑖 𝑋𝑖 − 58 2

0-20 5 10 50 48 2304 11520


20-40 12 30 360 28 784 9408
40-60 32 50 1600 8 64 2048
60-80 40 70 2800 12 144 5760
80-100 11 90 990 32 1024 11264
Total 100 5800 40000

𝑓 𝑖 ∙𝑋 𝑖 5800
𝑥= = = 58
𝑓𝑖 100

𝑓𝑖 𝑋𝑖 − 𝑥 2 40000
𝜎= = = 400 = 20
𝑓𝑖 100

Ex.2. Calculate S.D. and C.V.


Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Freq 5 10 20 40 30 20 10 4
Ans: Prepare table as,

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 40 𝑋𝑖 − 40 2 𝑓𝑖 𝑋𝑖 − 40 2

0-10 5 5 25 35 1225 6125


10-20 10 15 150 25 625 6250
20-30 20 25 500 15 225 4500
30-40 40 35 1400 5 25 1000
40-50 30 45 1350 5 25 750
50-60 20 55 1100 15 225 4500
60-70 10 65 650 25 625 6250
70-80 4 75 300 35 1225 4900
Total 139 5475 34275

49
Bhumi Publishing, India

𝑓 𝑖 ∙𝑋 𝑖 5475
𝑥= 𝑓𝑖
= 139
= 39.40 ≅ 40

𝑓𝑖 𝑋𝑖 − 𝑥 2 34275
𝜎= 𝑓𝑖
= 139
= 246.58 = 15.70

𝜎 15.70
𝐶𝑉 = × 100 = × 100 = 39.25%
𝑥 40

Ex.3. Calculate C.V.


Class 0-50 50-100 100-150 150-200 200-250 250-300
Freq 7 16 23 14 8 2
Ans: Prepare table as,

CI 𝑓𝑖 𝑋𝑖 𝑓𝑖 ∙ 𝑋𝑖 𝑋𝑖 − 130 𝑋𝑖 − 130 2 𝑓𝑖 𝑋𝑖 − 130 2

0-50 7 25 175 105 11025 77175


50-100 16 75 1200 55 3025 48400
100-150 23 125 2875 5 25 575
150-200 14 175 2450 45 2025 28350
200-250 8 225 1800 95 9025 72200
250-300 2 275 550 145 21025 42050
Total 70 9050 268750
𝑓 𝑖 ∙𝑋 𝑖 9050
𝑥= 𝑓𝑖
= 709
= 129.29 ≅ 130

𝑓𝑖 𝑋𝑖 − 𝑥 2 268750
𝜎= = = 3839.29 = 61.96
𝑓𝑖 70

𝜎 61.96
𝐶𝑉 = × 100 = × 100 = 47.66%
𝑥 130

Merits of Standard Deviation:


(i) It is rigidly defined.
(ii) It is based on all the observations of the series and hence it is representative.
(iii) It strictly follows the algebraic principles, and it never ignores the + and – signs like the
mean deviation
(iv) It is capable of further algebraic treatment.
(v) It is least affected by fluctuations of sampling.
Demerits of Standard Deviation:
(i) It is relatively difficult to calculate and understand
(ii) It is more affected by extreme items.
(iii) It cannot be exactly calculated for a distribution with open-ended classes.
(iv) It cannot be used for comparing the dispersion of two, or more series given in different
units.

50
Applied Biostatistics: An Essential tool in Helathcare Profession

4. Quartile Deviation (QD):


Quartiles:
If we divide the given data into four equal parts then there are three values say Q1, Q2 and Q3
at four divisions which are known as quartiles i.e. quartiles are the three values which divide the
data into four equal parts. Each group is a quarter of given data. Q1 is called first quartile or lower
quartile, Q2 is second quartile or median and Q3 is the third quartile or upper quartile.
Quartile deviation is also known as semi inquartile range or semi quartile range 0r
inquartile range. It gives the average amount by which two quartiles differ from median. Its
calculation is a bit similar as median. The quartile deviation is half the difference between the third
quartile and the first quartile, and for this reason it is often called the semi-interquartile range. It is
given by

𝑄3 − 𝑄1
𝑄𝐷 = 2
Coefficient of Quartile Deviation: it is same for all the following three types of data and is given
by,

𝑄3 − 𝑄1
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 + 𝑄1
× 100

Calculation of Quartile Deviation:

a) For raw data:


Steps:
1. Arrange the observations in ascending order.
2. Find the positions of the quartiles.
(i) If number of observations is divisible by 4 then
𝑛 th
1st quartile = 𝑄1 = 4
observation
𝑛 th
3rd quartile = 𝑄3 = 3 4
observation

(ii) If number of observations is not divisible 4 then


𝑛 + 1 th
1st quartile = 𝑄1 = 4
observation
𝑛 + 1 th
3rd quartile = 𝑄3 = 3 4
observation
Note: If a quartile lies between observations, the value of the quartile is the value of the lower
observation plus the specified fraction of the difference between the observations. For example, if
the position of a quartile is 20¼, it lies between the 20th and 21st observations, and its value is the
value of the 20th observation, plus ¼ the difference between the value of the 20th and 21st
observations.

51
Bhumi Publishing, India

3. Calculate quartile deviation using formula.


Ex.1. Calculate QD for the following data.
13, 7, 9, 15, 11, 5, 8, 4
Ans: Ascending order: 4, 5, 7, 8, 9, 11, 13, 15
Find the position of the 1st and 3rd quartiles.
Since there are 8 observations, 𝑛 = 8 (divisible by 4),
𝑛 th 8
1st quartile = 𝑄1 = 4
obs = 4 = 2nd obs = 5
𝑛 th 8
3rd quartile = 𝑄3 = 3 4
obs = 3 4
= 6th obs = 11

𝑄3 − 𝑄1 11 − 5 6
Hence, 𝑄𝐷 = 2
= 2
= 2
=3
Ex.2. Following are the marks obtained 10 students: 56, 48, 65, 35, 42, 75, 82, 60, 55, 50. Find
quartile deviation and its coefficient.
Ans: Ascending order: 35, 42, 48, 50, 55, 56, 60, 65, 75, 80
Here, 𝑛 = 10 (not divisible by 4)
𝑛 + 1 th 10 + 1
1st quartile = 𝑄1 = 4
obs = 4

= 2.75th obs ...it lies in between 2nd & 3rd obs.

So, Q1 = 2nd obs +0.75 (3rd obs – 2nd obs)


= 42 + 0.75 (48 – 42)
= 42 + 4.5
Q1 = 46.5
𝑛 + 1 th 10 + 1
3rd quartile = 𝑄3 = 3 4
obs =3× 4

= 8.25th obs ... it lies in between 8th & 9th obs.

So, Q3 = 8th obs +0.25 (9th obs – 8th obs)


= 65 + 0.25 (75 – 65)
= 65 + 2.5
Q3 = 67.5
Hence,
𝑄3 − 𝑄1 67.5 − 46.5
𝑄𝐷 = 2
= 2
= 10.5
&
𝑄3 − 𝑄1
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 + 𝑄1
× 100
67.5 − 46.5
= 67.5 + 46.5
× 100
21
= 114 × 100
=18.42 %

52
Applied Biostatistics: An Essential tool in Helathcare Profession

b) For discrete frequency distribution:


Steps: It is obtained using cumulative frequency as in median.
2. Find the cumulative frequency less than type and prepare frequency table.
𝑁
3. Calculate 4 , where 𝑁 = 𝑓𝑖 .
𝑁
4. See the CF just greater than 4 and its corresponding observation is 1st quartile Q1.
3𝑁
5. Now calculate .
4
3𝑁
6. See the CF just greater than 4
and its corresponding observation is 3rd quartile Q3.
7. Calculate quartile deviation using formula.

Ex.1. Calculate quartile deviation.


No. of goals scored 0 1 2 3 4
No. of matches 1 9 7 5 3
Ans: Prepare table as,
𝑥𝑖 𝑓𝑖 c.f.
0 1 1
1 9 10
2 7 17
3 5 22
4 3 25
Total 25

𝑁 25
= = 6.25
4 4
C.f. just greater than 6.25=10
So, 𝑄1 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 6.25 = 1

3𝑁 75
= = 18.75
4 4
C.f. just greater than 18.75=22
So, 𝑄3 = 𝑜𝑏𝑠 𝑐𝑜𝑟𝑟. 𝑡𝑜 18.75 = 3

c) For continuous frequency distribution:


Steps: It is obtained using cumulative frequency as in median.
1. Find the cumulative frequency less than type and prepare frequency table.
𝑁
2. Calculate 4 , where 𝑁 = 𝑓𝑖 .
𝑁
3. See the CF just greater than 4 and its corresponding class is Q1 class.
𝑕 𝑁
4. For the 1st quartile, use the formula 𝑄1 = 𝐿 + 𝑓 4
− 𝑐

53
Bhumi Publishing, India

Where:
𝐿 = lower limit of Q1 class
𝑕 = width of Q1 class
𝑓 = frequency of Q1 class
𝑐 = cumulative frequency above the Q1 class
3𝑁
5. Now calculate 4
.
3𝑁
6. See the CF just greater than 4
and its corresponding class is Q3 class.
𝑕 𝑁
7. For the 3rd quartile, use the formula 𝑄1 = 𝐿 + 𝑓 4
− 𝑐

Where:
𝐿 = lower limit of Q3 class
𝑕 = width of Q3 class
𝑓 = frequency of Q3 class
𝑐 = cumulative frequency above the Q3 class
8. Calculate quartile deviation using formula.

Ex.1. Calculate quartile deviation and its coefficient for following data.
Classes 10–15 15–20 20-25 25-30 30-35 35-40 40-45 45-50
Freq 4 4 6 8 10 9 7 5

Ans: Prepare the following table


CI 𝑓𝑖 CF
10 – 15 4 4
15 – 20 4 8
20 – 25 6 14
25 – 30 8 22
30 – 35 10 32
35 – 40 9 41
40 – 45 7 48
45 - 50 5 53
Total
𝑓𝑖 = 53

For the 1st quartile,


𝑁 𝑓𝑖 53
4
= 4
= 4
= 13.25
Cf just greater than 13.25 = 14
Q1 class = 20 – 25
L =20 f=6 c=8 h=5

54
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑕 𝑁
𝑄1 = 𝐿 + 𝑓 4
− 𝑐
5
= 20 + 6 13.25 − 8
26.25
= 20 + 6
= 20 + 4.375
= 24.375

For the 3rd quartile,


3𝑁 𝑓𝑖 53
4
=3 4
=3 4
= 3 × 13.25 = 39.75
Cf just greater than 39.75 = 41
Q1 class = 35 − 40
𝐿 = 35 𝑓 = 9 𝑐 = 32 𝑕 = 5
𝑕 𝑁
𝑄1 = 𝐿 + 𝑓 4
− 𝑐
5
= 35 + 9 39.75 − 32
38.75
= 35 + 9
= 35 + 4.305
= 39.305
Hence,
𝑄3 − 𝑄1 39.305 − 24.375
𝑄𝐷 = = = 7.465
2 2
&
𝑄3 − 𝑄1
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 = 𝑄3 + 𝑄1
× 100
39.305 − 24.375
= 39.305 + 24.375
× 100
7.465
= 63.68 × 100
= 11.7226
Merits of Quartile Deviation:
(i) It is easy to calculate and simple to follow.
(ii) It is not affected by the extreme values and is, therefore, useful in skewed distributions.
(iii) It is the only method of dispersion applicable in case of ‘open-end classes’.
Demerits of Quartile Deviation:
(i) Since Quartile Deviation is based on Quartiles, sometimes it is not rigidly defined.
(ii) It is not based on all the observations in the series. Hence it is not representative.
(iii) It is not capable of further algebraic treatment.
(iv) It is not a stable measure of dispersion as it is affected very much by fluctuations of
sampling.

55
Bhumi Publishing, India

Exercise
1. What do you mean by statistics? Add note on statistical data.
2. Explain following terms in frequency distribution and data;
i. Individual data
ii. Discrete data
iii. Grouped data
iv. Classes
v. Frequency
vi. Class limits
vii. Class boundaries
viii. Class frequency
ix. Class interval

3. Given data contain weight in kg of group of 60 students. Prepare a frequency table taking
magnitude of class interval as 10 kg and the first class interval equal to 40 and less than 50.
50 52 86 94 49 90 76 96 64 70
69 80 79 73 81 110 84 67 77 65
74 60 115 61 83 72 79 103 51 78
71 66 77 84 42 69 80 68 104 79
54 59 100 53 76 50 78 63 95 42
40 82 41 75 63 113 98 43 55 76

4. Prepare a discrete frequency table for following data containing number of defectives in a lot.
2, 3, 1, 0, 1, 2, 1, 0, 1, 4, 5, 3, 2, 1, 0, 1, 3, 4, 1 , 5, 4, 3, 1, 0, 0, 1, 0, 2, 3, 1, 2, 4, 5, 0, 1, 0, 1,
0, 2, 4, 3, 5, 0, 1, 3, 2, 1, 0, 2, 2, 3, 0, 1, 3, 4, 0, 1, 3, 2, 5, 0, 1, 2.
5. For each of the given frequency distribution draw Histogram, Frequency polygon and
Cumulative frequency polygon
i.
Weight in kg 80-90 90-100 100-110 110-120 120-130
No. of workers 07 11 15 08 04

ii. Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Frequency 03 04 06 10 11 09 05

6. Following table gives the birth rate per thousand of different countries over certain period.
Represent the given data by a suitable diagram plotting the countries against their birth rate.

56
Applied Biostatistics: An Essential tool in Helathcare Profession

Country Birth rate


India 41
Pakistan 35
Bangladesh 30
Srilanka 25
USA 20
UK 15

7. Draw a pie diagram for the following data of seventh five year plan of Government.
Agriculture 14%
Irrigation 13%
Health 27%
Education 15%
Social Development 16%
Employment 16%

𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑖𝑡𝑒𝑚
Note: Angle of centre is given by 100
× 360
8. Draw a pie diagram to represent the following data of population in a town;
Males 2000
Females 1800
Boys 4200
Girls 2000
Total 10000

9. Find the average from following data


52,69,93,72,56,85,73,66,94,85.
10. Calculate the average for given data
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
No. of 00 02 03 07 13 13 09 02 01
Stude
nts

11. Given data contain marks obtained by a batch at 10 students in certain class test. Calculate
Median marks.
28, 35, 46, 47, 60, 30, 32, 62, 64, 63.
12. Calculate median weight from given weight in grams
68, 66, 35, 42, 26, 85, 44, 80, 33, 72.

57
Bhumi Publishing, India

13. Calculate the median


Value >100 100-200 200-300 300-400 400 and above
Frequency 50 90 158 68 134

14. Calculate mean, mode and median for given data


X: 12, 13, 17, 18, 19, 19, 21, 22, 21, 27, 24, 30, 31, 31
15. Find the mean and mode from table
Classes 10-25 25-40 40-55 55-70 70-85 85-100
No. of students 06 50 44 26 03 01

16. Calculate mode and median


Monthly wage 20-30 30-40 40-50 50-60 60-70 70-80 80-90
No. of employees 28 32 45 60 56 40 20
17. From the following cumulative frequency table fine the mean, median and mode.
Size below 5 10 15 20 25 30 35
Frequency 1 3 13 17 27 36 38
18. From the index numbers given below calculate the range and its coefficient.
188, 178, 173, 164, 172, 183, 184, 185, 211, 217, 232, 240.
19. Calculate mean deviation about mean
X 10 11 12 13
f 04 11 20 15
20. Calculate mean deviation from median
X: 3484, 4572, 4124, 3682, 5624, 4388, 3680, 4308.
21. Calculate mean derivation from median and its coefficient.
Classes 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70
Freq 06 12 17 30 10 10 08 05 02
22. Calculate standard deviation from the following data;
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No. of Students 05 07 14 12 09 06 02
23. Find S.D. of the data: 1, 2, 3, 4, 5, 6, 7, 8, 9
24. Calculate S.D. for: 15, 12, 9, 18, 21, 15
25. The coefficient of variation of certain distribution is 4 and mean is 60. Find S.D.
26. Calculate mean and S. D. from given data
Monthly pension in Rs. 40 50 60 70 80 100
No. of persons 03 06 04 09 03 05

58
Applied Biostatistics: An Essential tool in Helathcare Profession

27. Calculate S.D. and C.V.


Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70
Freq 4 6 10 20 10 6 4

28. Calculate S.D. and C.V.


Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Freq 1 19 30 80 70 26 10 4

29. Calculate S.D. by step deviation method


Class 140-160 160-180 180-200 200-220 220-240 240-260
Freq 12 13 55 40 35 28

30. Calculate quartile deviation from given data


Age 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
No. of persons 15 15 25 22 25 10 05 00

31. The following are the goals scored by a team. Calculate Q.D.
No. of goals scored 0 1 2 3 4
No. of matches 1 9 7 5 3

32. Lives of two models of refrigerators in a recent survey are;


Life (in years) Model A Model B
0-2 05 02
2-4 16 07
4-6 13 12
6-8 07 19
8-10 05 09
10-12 04 01
An analysis of the monthly wages paid to workers in two firms M and N of the same industry gives
the following results,
Parameters Firm M Firm N
No. of wage earners 58.6 648
Avg. monthly wages 52.5 47.5
Variance 100 121

59
Bhumi Publishing, India

2. Probability

Introduction:
‘What is probability?’ Nobody has a really good answer to this question. It is the language
which we use to explain uncertainty. The theory of probability has been originated from the game
of gambling. The correspondence between two French mathematicians Blaise Pascal and Pierre
Fermat gave rise to the study of probability. Throughout the 18th century, the application of
probability moved from games of chance to scientific problems. In the study of statistics, we are
concerned basically with the presentation and interpretation of chance outcomes that occur in a
planned study or scientific investigation. Statisticians use the word experiment to describe any
process that generates a set of data.
Probability is a part of our everyday lives. Modern research in probability theory is closely
related to the field of measure theory. The development of probability theory has been stimulated
by the variety of its applications. Statistics is one important branch of applied probability. One of
the difficulties in developing theory of probability is the definition of probability. The search for a
widely acceptable definition took centuries and was marked by controversy. The matter was finally
resolved in the 20th century by treating probability theory on an axiomatic basis.
Probability:
Probability is the branch of statistics that studies the possible outcomes of given events
together with their likelihoods and distributions. In common use, word probability is used to mean
the chance that a particular event will occur.
e.g. It is likely to rain, there is 60% chance that India will win the match, etc.
Probability is the measure of the likelihood that an event will occur. Probability is
quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates
certainty). The higher the probability of an event, more certain we are that the event will occur. The
topic of probability is seen in many facets of the modern world. The theory of probability is not just
taught in mathematical courses, but can be seen in practical fields, such as insurance, industrial
quality control, study of genetics, quantum mechanics, and the kinetic theory of gases.
In order to clear the concept of probability, we have some basic concepts:
1. Random Experiment:
It is a repeating action in which all the possible results are known but the exact result is not
known in advance.
e.g. “Tossing of a coin” is a random experiment; because we know the result in advance i.e. either
Head or Tail but the exact one is not known.
2. Outcomes:
The results of the random experiments are called outcomes.

60
Applied Biostatistics: An Essential tool in Helathcare Profession

E.g. consider random experiment: A die is thrown.


We may get the number from 1, 2, 3, 4, 5, or 6 on the uppermost face of the die.
So this experiment has six outcomes.
3. Sample space:
The set of all possible outcomes of a random experiment is called sample space. It is
denoted by ‘S’ or ‘Ω’. Each outcome in sample space is an element or member of that sample space
E.g. "Tossing of two coins at a time" has sample space as,
𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇)
4. Equiprobable (Equally likely) sample space.
When each outcome of the random experiment is having equal chance to happen, we say
that the sample space is Equiprobable. It is recognised by the following underlined words.
(i) An unbiased coin is tossed.
(ii) A fair dice is thrown.
(iii) A card is drawn from the well shuffled pack of 52 cards.
(iv) A book is selected at once. Etc.
5. Event:
Any subset of a sample space is called an event. More than one event can occur in a random
experiment. Events are generally denoted by the capital letters like A, B, C ... etc.
E.g. Consider a random experiment of “throwing a fair die” has following events:
i) The number on the uppermost face of a die is odd.
ii) The number on the uppermost face of a die is divisible by 2.
Sample space, 𝑆 = {1, 2, 3, 4, 5, 6}
Now,
Let A be the event that the no. On the upper face of the die is odd.
So, 𝐴 = {1, 3, 5}
Let B be the event that the no. On the upper face is divisible by 2.
So, 𝐵 = {2, 4, 6}
Here, both the sets A and B are subsets of S.
Types of Events:
i) Simple event: Event containing single point is called simple event.
E.g. If two coins are tossed then 𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}
Let A be the event that both coins shoe tail;
so 𝐴 = {𝑇𝑇} a single point
ii) Sure/certain event: An event which contains all the sample points of the sample space is called
as sure/certain event.
e.g. A card is drawn at random from the well shuffled pack of 52 cards.

61
Bhumi Publishing, India

Let B be the event that card drawn is Red or Black. This event has outcomes of all 52 cards; so it is a
sure event.
iii) Impossible event: An event which does not contain any sample point of the sample space is an
impossible event.
e.g. A die is thrown. 𝑆 = {1, 2, 3, 4, 5, 6}
Let C be the event that number on upper face is greater than 6.
So 𝐶 = { } = ∅
iv) Complementary event: Let A be the event of the sample space S. Then the complement of event
A is the set containing the points in S but not on A. It is denoted by 𝐴′ or 𝐴𝑐 .
e.g. A die is thrown. 𝑆 = {1, 2, 3, 4, 5, 6}
Let D be the event of getting odd no. 𝐷 = {1, 3, 5}
Then complement of D is 𝐷 𝑐 = {2, 4, 6}.
v) Mutually exclusive events: Two events say A and B are said to be mutually exclusive or disjoint
events if they have no common point i.e. 𝐴 𝐵= ∅
e.g. Throwing of a die. 𝑆 = {1, 2, 3, 4, 5, 6}
A be the event of occurring even no. on upper face. 𝐴 = {2, 4, 6}
B be the event of occurring odd no. on upper face. 𝐵 = {1, 3, 5}
Here 𝐴 𝐵 = ∅ i.e. they have no same elements.
So, A and B are mutually exclusive events.
vi) Exhaustive events: Two or more events are said to be exhaustive events if their union is a
sample space i.e. suppose A and B are events of S. A and B are exhaustive if 𝐴 𝐵 =S

Permutations and combinations:


The central theme of theory of permutations and combinations is to solve the counting
problems without doing any actual counting, which is inconvenient and at times difficult within
human limitations when the number of logical possibilities of an event is large.
Factorial notation: For any natural number n, the product (multiplication) of first n natural
numbers is denoted by n! And read as n factorial.
e.g. 5! = 5 × 4 × 3 × 2 × 1 = 120
Similarly, 100! = 1 × 2 × 3 × 4 × … × 100
Permutation:
A permutation is an arrangement in a definite order of number of objects taken some or all
at a time.
e.g. Consider a three digit number 456. We want to make different numbers from these three digits
by taking two numbers at a time under the assumption that no number is repeated.
In this case, the two digits numbers formed are, 45, 46, 54, 56, 64, and 65.

62
Applied Biostatistics: An Essential tool in Helathcare Profession

Hence, permutations of n different objects taken r at a time is the total number of ways in
which n objects can be arranged at r places in a line and it is given by
nPr
𝑛!
= (𝑛 − 𝑟)!

In particular, if 𝑟 = 𝑛 then
nP = 𝑛!
r

Ex.1. In how many ways 5 different objects can be arranged by taking 2 at a time?
Ans: Here, n = 5 and r = 2
5P
5! 5×4×3×2×1
2 = = = 20 𝑤𝑎𝑦𝑠
(5 − 2)! 3×2×1

Hence, there are 20 ways to arrange 5 different objects, taken 2 at a time.

Ex.2. Calculate the number of ways in which three people from a group of seven people can be
seated in a row.
Ans: This is a case of permutation since the order is important.
Here, 𝑛 = 7 𝑟 = 3
The number of possible ways is:
7P3=
7! 7×6×5×4×3×2×1
(7 − 3)!
= 4×3×2×1
= 210 𝑤𝑎𝑦𝑠

Combinations:
A combination is a group of objects, irrespective of order, taken some or all at a time.
E.g. suppose there is a group of three students say X, Y and Z. We have to make different groups
containing two students in each group.
In this case, the groups formed are,
XY, XZ, YZ. (Here, XY is similar to YX.)
Hence, total number of combinations of n different objects taken r at a time is given by
nCr =
𝑛!
𝑟! (𝑛 − 𝑟)!

In particular, if 𝑟 = 𝑛 then
nCr = 1

Ex.1. Find the total number of combinations of 10 objects taken 5 at a time.


Ans: Here, n = 10 and r = 5
10C5 =
10! 10×9×8×7×6×5! 6×7×8×9×10
5! (10 − 5)!
= 5! × 5!
= 5×4×3×2×1
= 252 combinations

Hence there are 252 combinations of 10 objects taken 5 at a time.

63
Bhumi Publishing, India

Ex.2. Calculate the number of combinations in which three people can be selected from a group of
seven.
Ans: Here the order is not important so it is case of combination.
Here, 𝑛 = 7 𝑟 = 3
The number of possible combinations is:
7C
7! 7×6×5×4! 7×6×5
3 = 3! (7 − 3)!
= 3! × 4!
= 3×2×1
= 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠

Example to show the relationship between permutation and combination:

Number of Taken at a Combination Permutation


objects (n) time (r) (nCr) (nPr)
P, Q 2 PQ PQ, QP

P, Q, R 2 PQ, PR, QR PQ, PR, QP, QR, RP, RQ


P, Q, R 3 PQR PQR, PRQ, QPR, QRP, RPQ, RQP

Thus, the number of permutations is always greater than the number of combinations.
Classical Definition of Probability:
Statistically, the term probability can be defined in following way:
If S is the sample space with n outcomes of a random experiment and A is an even with m
outcomes then probability of event A is denoted by P(A) and is defined as,
𝑁𝑜.𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝐴 𝑚
P (A) = =
𝑁𝑜.𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑆 𝑛

In short, if
𝑛(𝑆) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 𝑆
𝑛(𝐴) = 𝑛𝑜. 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑒𝑣𝑒𝑛𝑡 𝐴
then
𝑛 (𝐴)
𝑃 (𝐴) = 𝑛 (𝑆)

Probability Axioms and simple Properties:


Probabilities, however assigned, must satisfy three specific axioms:
Axiom 1: for any event A, 𝑃(𝐴) ≥ 0.
Axiom 2: 𝑃(𝑆) = 1.
Axiom 3: For any sequence of disjoint events A1, A2, A3, ...
𝑛 𝑛

𝑃 𝐴𝑖 = 𝐴𝑖
𝑖=1 𝑖=1

64
Applied Biostatistics: An Essential tool in Helathcare Profession

These axioms are all we need to develop a theory of probability, but there is a collection of
commonly used properties which follow directly from these axioms, and which we make extensive
use of when carrying out probability calculations.
Property A: Probability of Complementary event
𝑃 (𝐴𝑐 ) = 1 – 𝑃 (𝐴).
Property B: 𝑃(∅) = 0
Property C: If 𝐴 ⊆ 𝐵, then 𝑃(𝐴) ≤ 𝑃(𝐵).
Property D: Addition Property
𝑃(𝐴 𝐵) = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 𝐵)

Ex.1. In a box, there are 5 Aspirin, 6 Analgin and 10 Paracetamol. If one tablet is chosen at random
find the probability that:
i) it is Analgin.
ii) it is Aspirin or Paracetamol.
Ans: There are total 22 tablets in a box.
𝑛(𝑆) = 22

i) Let A be the event that tablet chosen is Analgin.


𝑛(𝐴) = 6
𝑛(𝐴) 6
𝑃(𝐴) = 𝑛 (𝑆)
= 22
= 0.2727

ii) Let B be the event that tablet chosen is Aspirin or Paracetamol.


There are 5 Aspirin + 10 Paracetamol tablets
𝑛(𝐵) = 15
𝑛(𝐵) 15
𝑃(𝐵) = 𝑛(𝑆)
= 22
= 0.6818

Ex.2. A card is selected at random from well shuffled pack of 52 cards. Find the probability of
getting
i) a face card ii) a red card iii) not a club card.
Ans: There are total 52 cards in a pack.
𝑛(𝑆) = 52

i) Let A be the event of getting a face card


number of Face cards = 13
𝑛(𝐴) = 13
𝑛(𝐴) 13
𝑃(𝐴) = 𝑛 (𝑆)
= 52
= 0.25

65
Bhumi Publishing, India

ii) Let B be the event of getting a red card


Number of Red cards = 13 Heart + 13 Diamond
𝑛(𝐵) = 26
𝑛(𝐵) 26
𝑃(𝐵) = 𝑛(𝑆)
= 52
= 0.50

iii) Let C be the event of getting a Club card


𝐶 𝑐 = complement of C i.e. not getting a club card
Number of Club cards = 13
𝑛(𝐶) = 13
𝑛(𝐶) 13
𝑃(𝐶) = 𝑛 (𝑆)
= 52
= 0.25

Now, 𝑃(𝐶 𝑐 ) = 1 – 𝑃(𝐶) = 1 – 0.25 = 0.75

Ex.3. A pair of fair dice is thrown. Find the probability of getting,


(i) A number greater than 4 on each die.
(ii) Odd number on first die and 5 on second die.
(iii) Sum of points is 10
(iv) Same points on both dice.
Ans: A pair of fair dice is thrown.
𝑆 = {(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6),
(3,1), (3,2), (3,3), (3,4), (3,5), (3,6), (4,1), (4,2), (4,3), (4,4), (4,5), (4,6),
(5,1), (5,2), (5,3), (5,4), (5,5), (5,6), (6,1), (6,2), (6,3), (6,4), (6,5), (6,6)}
𝑛(𝑆) = 36.
i) Let event A: getting a number greater than 4 on each die.
𝐴 = {(5, 5), (5, 6), (6, 5), (6, 6)} So 𝑛(𝐴) = 4
4 1
𝑃(𝐴) = 36
= 9

ii) Let event B: getting odd number on 1st die and 5 on 2nd die.
𝐵 = 1, 5 , 3, 5 , 5, 5 So 𝑛(𝐵) = 3
3 1
𝑃(𝐵) = 36
= 12

iii) ) Let event C: getting sum of points as 10.


𝐶 = {(4, 6), (5, 5), (6, 4)} So 𝑛(𝐶) = 3
3 1
𝑃(𝐶) = 36
= 12

iv) ) Let event D: getting same points on both dice.


𝐷 = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)} So 𝑛(𝐷) = 6
6 1
𝑃(𝐷) = 36
= 6

66
Applied Biostatistics: An Essential tool in Helathcare Profession

Conditional probability:
Conditional probability is the likelihood of an event or outcome occurring based on the
occurrence of a previous event or outcome. i.e. the probability of any event 'A' changes after
knowing that some other event B has occurred; It is known as the conditional probability of the
event A given that the event B has occurred. We write this as 𝑃(𝐴 | 𝐵).
If A and B are any 2 events with 𝑃(𝐵) > 0, then
𝑃( 𝐴 𝐵)
𝑃(𝐴| 𝐵) = 𝑃(𝐵)
𝑃(𝐴 𝐵)
Similarly, 𝑃(𝐵|𝐴) = 𝑃(𝐴)
; 𝑃(𝐴) > 0

Ex.1. You toss a fair coin three times. Given that you have observed at least one heads, what is the
probability that you observe at least two heads?
Ans: A coin is tossed three times.
𝑆 = 𝑇𝑇𝑇, 𝑇𝑇𝐻, 𝑇𝐻𝑇, 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇 𝑛 𝑆 =8
Let A be the event that at least one heads is observed.
𝐴 = { 𝑇𝑇𝐻, 𝑇𝐻𝑇, 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇}
7
𝑃 𝐴 =8

Let B be the event that at least two heads are observed.


𝐵 = { 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻}
4
𝑃 𝐵 =8

Probability of the event B given that the event A has occurred is


𝑃(𝐴 𝐵)
𝑃(𝐵|𝐴) =
𝑃(𝐴)
𝑃(𝐵)
= 𝑃(𝐴) 𝐴 𝐵 = { 𝑇𝐻𝐻, 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻}

4 8
=8∙7

4
=7

Ex.2. Out of 50 people surveyed in a study, 35 people smoke in which 20 are males. What is the
probability that if a person surveyed is smoke then he is male?
Ans: Here, 𝑛 𝑆 = 50
Let A be the event that person is a smoker.
𝑛(𝐴) 35
𝑃(𝐴) = 𝑛(𝑆)
= 50

Let B be the event that person is a male smoker.

67
Bhumi Publishing, India

𝑛(𝐵) 20
𝑃(𝐵) = 𝑛(𝑆)
= 35

Then the probability that a person being male is smoker is

𝑃(𝐴 𝐵) 20
𝑃(𝐵|𝐴) = 𝑃(𝐴)
𝐴 𝐵 = 50
i.e. person being male and smoker

20 50
= ∙
50 35
4
= 7

Multiplication Property:
If A and B are the independent events of a given random experiment then
𝑃 𝐴 ∩ 𝐵 = 𝑃(𝐴) ∙ 𝑃(𝐵|𝐴)

Ex.4. Find the probability that a single toss of a die will result in a number less than 3 if it is given
that the toss resulted in an odd number.
Ans: For tossing of a die, 𝑆 = {1, 2, 3, 4, 5, 6} 𝑛(𝑆) = 6
Given that toss is already resulted in odd number.
Let event A: toss resulted in an odd number.
𝐴 = {1, 3, 5} 𝑛(𝐴) = 3
3 1
𝑃(𝐴) = 6
= 2

Let event B: single toss will result in number less than 4


𝐵 = {1, 2, 3} 𝑛(𝐵) = 3
3 1
𝑃(𝐵) = 6
= 2

𝐴 𝐵 = {1, 3} 𝑛(𝐴 𝐵) = 2
2 1
𝑃(𝐴 𝐵) = 6
= 3

Hence, the required probability is,


= 𝑃(𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑠 𝑙𝑒𝑠𝑠 𝑡𝑕𝑎𝑛 4 𝑔𝑖𝑣𝑒𝑛 𝑡𝑕𝑎𝑡 𝑖𝑡 𝑖𝑠 𝑜𝑑𝑑)
𝑃( 𝐴 𝐵) 1 2
= 𝑃(𝐵|𝐴) = 𝑃(𝐴)
=13= 3
2

Ex.5. A bag contains 3 pink candies and 7 green candies. Two candies are taken out from the bag
with replacement. Find the probability that both candies are pink.
Ans: here, 𝑛(𝑆) = 3 + 7 = 10
Let A be the event that first candy is pink and
B be the event that second candy is pink

68
Applied Biostatistics: An Essential tool in Helathcare Profession

3
𝑃 𝐴 = 𝑃(𝐵) = 10

Since candies are taken out with replacement, both events A and B are independent.
3
𝑃 𝐵 𝐴 = 𝑃 𝐵 = 10

Hence, probability that both candies are pink is,


𝑃 𝐴 ∩ 𝐵 = 𝑃(𝐴) ∙ 𝑃(𝐵|𝐴)
3 3
= ∙
10 10
9
= = 0.09
100

Random Variables:
Specifying a model for a random experiment via a complete description of sample space S
and probability P may not always be convenient or necessary. In practice we are only interested in
various observations (i.e., numerical measurements) of the experiment. We include these into our
modelling process via the introduction of random variables.
A random variable is a function that associates a real number with each element in the
sample space. A random variable is neither random nor a variable. A random variable is a function
defined on a sample space. The values of the function can be anything at all, but for us they will
always be numbers.
E.g. consider the sample space for tossing a fair coin twice:
𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}
These outcomes are equally likely. There are several random quantities we could associate
with this experiment. For example, we could count the number of heads, or the number of tails.
Formally, a random variable is a real valued function which acts on elements of the sample
space (outcomes) i.e. to each outcome. The random variable assigns a real number. Random
variables are always denoted by upper case letters.
In our example, if we let X be the number of heads, we have
𝑋 (𝐻𝐻) = 2;
𝑋 (𝐻𝑇) = 1;
𝑋 (𝑇𝐻) = 1;
𝑋 (𝑇𝑇) = 0:
Hence, 2, 1, 1, 0 are the random variables for the outcomes in sample space S.
In short,
Outcomes HH HT TH TT
Random variable (X) 2 1 1 0

There are two types of random variable:

69
Bhumi Publishing, India

a) Discrete random variable


b) Continuous random variable
a) Discrete random variable takes only isolated or integral values. It is a countable number of real
values.
E.g. marks obtained by the students, number of accidents caused in a year, etc.
b) Continuous random variable can take all possible values between certain limits or in an interval.
e.g. measurements of rainfall, lifetime of a component, height and weight of etc.
Probability Distribution:
In probability and statistics, a probability distribution assigns a probability to
each measurable subset of the possible outcomes of a random experiment. Probability distributions
are used on both theoretical as well as a practical level. A listing of all the values, the random
variable can assume with their corresponding probabilities make a probability distribution.
A discrete probability distribution is a table (or a formula) listing all possible values that
a discrete variable can take on, together with the associated probabilities. It is a function that
satisfies the following properties:
1. The probability of discrete variable x is given by
𝑃 𝑋 = 𝑥 = 𝑃 𝑥 = 𝑝𝑥
2. It is non-negative for all real x.
3. The sum of P(x) over all possible values of x is 1,
𝑛
𝑖=1 𝑝𝑖 =1
4. Discrete probability functions are referred as probability mass functions.
A continuous probability distribution is a function that satisfies following properties:
1. The probability of continuous variable x between two points a and b is
𝑏
𝑃 𝑎≤𝑥≤𝑏 = 𝑎
𝑓 𝑥 𝑑𝑥
2. It is non-negative for all real x.
3. The integral probability function is one,
−∞

𝑓 𝑥 𝑑𝑥 = 1
4. Continuous probability functions are referred as probability density functions.
Some practical uses of probability distribution are:
(i) To calculate confidence interval for parameters and to calculate critical region for
hypothesis tests.
(ii) For univariate data, it is often useful to determine a reasonable distribution model for the
data.
(iii) Simulation studies with random numbers generated from using a specific probability
distribution are often needed.

70
Applied Biostatistics: An Essential tool in Helathcare Profession

In general, if 𝑒1 , 𝑒2 , … , 𝑒𝑛 are the n outcomes of a sample space and 𝑥1 , 𝑥2, … , 𝑥𝑛 are


corresponding random variables with the probabilities 𝑝1 , 𝑝2, … , 𝑝𝑛 then the probability
distribution table is given by,
Outcomes (𝑆) 𝑒1 𝑒2 ........ 𝑒𝑛
Random variables (𝑋) 𝑥1 𝑥2 ........ 𝑥𝑛 Total
Probability 𝑃(𝑋 = 𝑥𝑖 ) 𝑝1 𝑝2 ........ 𝑝𝑛 1

Ex.1. Suppose a coin is tossed twice. Find the probability distribution for the head at the top.
Ans: A coin is tossed twice. Hence the distribution is as follows,
Outcomes (𝑆) 𝐻𝐻 𝐻𝑇 𝑇𝐻 𝑇𝑇 4
Random variables (𝑋) 2 1 1 0 Total
Probability 𝑃(𝑋 = 𝑥𝑖 ) 1 1 1 0 1
2 4 4

Ex.2. Find the probability function corresponding to the random variable X for Head up assuming
that the fair coin is tossed thrice.
Ans: Outcomes 𝑆 = {𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇, 𝑇𝐻𝐻, 𝑇𝐻𝑇, 𝑇𝑇𝐻, 𝑇𝑇𝑇}
Random variables (𝑋) 3 2 2 1 2 1 1
Probability 𝑃(𝑋 = 𝑥𝑖 ) 3 1 1 1 1 1 1
8 4 4 8 4 8 8

Ex.3. Find the constant 𝑐 such that,


𝑓 𝑥 = 𝑐𝑥 2 0 < 𝑥 < 3
=0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 is a density function.
Ans: since 𝑓 𝑥 satisfies the 2nd property,

−∞
𝑓 𝑥 =1
Now,
∞ 3 𝑐𝑥 3 3
−∞
𝑓 𝑥 = 0
𝑐𝑥 2 = ⃒ = 9𝑐
3 0
Hence,
1
9𝑐 = 1 𝑐=9

Expected value and Variance:


For the given random variables X with the probability distribution,
Random variables (X) 𝑥1 𝑥2 ........ 𝑥𝑛
Probability P(x) 𝑝1 𝑝2 ........ 𝑝𝑛

71
Bhumi Publishing, India

The Expected value (or Mean) of random variables is a number E(X) given by,
𝑛
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 (For Discrete variables)
−∞
𝐸 𝑋 = ∞
𝑥 𝑓(𝑥) (For Continuous variables)

Ex.1. A die is thrown. The random variable X is “the number of dots that appear”. Find the expected
value of this random variable.
Ans: For throwing of a die, the outcomes of dots are 𝑆 = {1, 2, 3, 4, 5, 6}
Hence, the probability distribution table is given by
No. of dots (xi) 1 2 3 4 5 6 Total
P(X=xi) = pi 1 1 1 1 1 1 1
6 6 6 6 6 6
𝒙𝒊 𝒑𝒊 1 1 1 2 5 1 3.5
6 3 2 3 6
6 21 7
𝐸 𝑋 = 𝑖=1 𝑥𝑖 𝑝𝑖 = 6
= 2
= 3.5

Ex.2. A lot containing 7 components is sampled by a quality inspector; the lot contains 4 good
components and 3 defective components. A sample of 3 is taken by the inspector. Find the expected
value of the number of good components in this sample.
Ans: This is a case of combination where 𝑛 = 7 and 𝑟 = 3
To find n(S),
7C3 =
7! 7×6×5×4! 7×6×5
= = = 35 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠
3! (7 − 3)! 3! × 4! 3×2×1

So, 𝑛(𝑆) = 35
Using the formula, 4Cr 3C3-r find the number of samples containing 0, 1, 2 or 3 good
components replacing r = 0, 1, 2, 3 respectively.
The Probability distribution table for number of good components in a sample is,
No. of good comp.(xi) 0 1 2 3 Total

P(X=xi) = pi 1 12 18 4 1
35 35 35 35
𝒙𝒊 𝒑𝒊 0 12 36 12 60
35 35 35 35

6 60 12
𝐸 𝑋 = 𝑖=1 𝑥𝑖 𝑝𝑖 = 35
= 7
= 1.7

72
Applied Biostatistics: An Essential tool in Helathcare Profession

Thus, if a sample of size 3 is selected at random again and again from a lot of 4 good
components and 3 defective components, it will contain, on average, 1.7 good components.
The Variance of a random variable X with the probability distribution P(X=xi) is a number
Var(X) or 𝜎 2 given by,
𝜎 2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = 𝑥 [𝑥𝑖 − 𝐸 𝑋 ]2 ∙ 𝑝𝑖

So, 𝜎 2 = 𝐸 𝑋 2 − [𝐸 𝑋 ]2 (For Discrete variables)


𝜎 2 = 𝑉𝑎𝑟 𝑋 = 𝐸[𝑋 − 𝐸 𝑋 ]2 = [𝑥
−∞ 𝑖
− 𝐸 𝑋 ]2 ∙ 𝑝𝑖 (For Conti. variables)

Ex.3. Let the random variable X represents the number of automobiles that are used for official
business on any given workday. The probability distribution for company is
X = xi 0 1 2 3
P(X = xi) = pi 0.2 0.1 0.3 0.3
Calculate variance for random variable X.
Ans: To calculate expected value,

X = xi 0 1 2 3 Total
P(X = xi) = pi 0.2 0.1 0.4 0.3 1
𝒙𝒊 𝒑𝒊 0 0.1 0.8 0.9 1.8

𝑿𝟐 0 1 4 9

𝒙𝟐𝒊 ∙ 𝒑𝒊 0 0.1 1.6 2.7 4.4

3
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 1.8
3 2
𝐸 𝑋2 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 4.4
Now, 𝑉𝑎𝑟(𝑋) = 𝜎 2 = 𝐸 𝑋 2 − [𝐸 𝑋 ]2
= 4.4 – (1.8)2
= 4.4 – 3.24
= 1.16

Ex.4. Let the random variable X represents the number of defective parts for a machine when 3
parts are sampled from a production line and tested. The following is the probability distribution of
X. Calculate 𝜎 2 .
xi 0 1 2 3
pi 0.51 0.38 0.10 0.01

73
Bhumi Publishing, India

Ans: Prepare the following probability distribution table.

xi 0 1 2 3 Total
pi 0.51 0.38 0.10 0.01 1
𝒙𝒊 𝒑𝒊 0 0.38 0.20 0.03 0.61
𝑿𝟐 0 1 4 9
𝒙𝟐𝒊 ∙ 𝒑𝒊 0 0.38 0.40 0.09 0.87

3
𝐸 𝑋 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 0.61
3 2
𝐸 𝑋2 = 𝑖=0 𝑥𝑖 𝑝𝑖 = 0.87
Now, 𝑉𝑎𝑟(𝑋) = 𝜎 2 = 𝐸 𝑋 2 − [𝐸 𝑋 ]2
= 0.87 – (0.61)2
= 0.87 – 0.3721
= 0.4979
Ex.5. Find the expected value for the density function of a random variable X given by
1
𝑓 𝑥 = 𝑥 0<𝑥<2
2
=0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒
−∞
Ans: 𝐸 𝑋 = ∞
𝑥 𝑓(𝑥)
2 1
= 0
𝑥 ( 𝑥)
2
2 1 2 𝑥3 2 4
= 0 2
𝑥 = ⃒ =3
6 0

Following are some special Probability Distributions:


1. Binomial Distribution:
One of the most commonly encountered discrete distributions is the binomial distribution.
It is also known as the Bernoulli distribution as it is based on the Bernoulli trial or Bernoulli
process. An experiment, involving repeated trials where only two complementary outcomes are
possible which can be labelled either as a “success” or “failure”, is called a Bernoulli process. The
most obvious application deals with the testing of items as they come off an assembly line, where
each trial may indicate a defective or a non-defective item. We may choose to define either outcome
as a success.
If p is the probability of success then 𝑞 = (1 – 𝑝) is the probability of failure
The Bernoulli process must possess the following properties:
1. The experiment consists of repeated trials.
2. Each trial results in an outcome that may be classified as a success or a failure.

74
Applied Biostatistics: An Essential tool in Helathcare Profession

3. The probability of success, denoted by p, remains constant from trial to trial.


4. The repeated trials are independent.
If 𝑛 = number of independent Bernoulli trials of an experiment and
𝑟 = number of success observed in Bernoulli experiment
then the number of combinations of n trials with r successes is given by:
nC = 𝐶 𝑛, 𝑟 =
𝑛 𝑛!
r = 𝑟! (𝑛 − 𝑟)!
𝑟
The number X of successes in n Bernoulli trials is called a binomial random variable. The
probability distribution of this discrete random variable is called the binomial distribution, and
its values will be denoted by B(x; n, p) or X ~ B(n, p) since they depend on the number of trials and
the probability of a success on a given trial.
Thus, in n trials, the total number of possible ways of obtaining r successes and (n–r)
failures is:
Probability(r successes out of n trials) = 𝑃(𝑋 = 𝑟)
𝑛 𝑟 𝑛−𝑟
𝑃 𝑟 = 𝑝 𝑞
𝑟
𝑛!
i.e. 𝑃 𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟

Where, 𝑛 = no. of independent trials


𝑟 = no. of success in n trials
𝑝 = Probability of success in one trial
𝑞 = 1 – 𝑝 = probability of failure
Note:
𝑛 𝑟 𝑛−𝑟
1. 𝑃 𝑟 ≤ 𝑛 = 𝑛
𝑟=0 𝑝 𝑞 for r = 0, 1, 2, ...
𝑟
Hence, 𝑃 𝑟 ≥ 1 = 1 − 𝑞𝑛
2. If n independent trials constitute an experiment and if this experiment is repeated N times, the
probability distribution or the expected frequencies are given by:
𝑛 𝑟 𝑛−𝑟
𝑓 𝑟 =𝑁 𝑝 𝑞 For 𝑟 = 0, 1, 2, . . .
𝑟
3. The mean of binomial distribution is 𝑥 = 𝑛 𝑝
And the variance is 𝜎 2 = 𝑛 𝑝 𝑞 i.e 𝑆𝐷 = 𝜎 = 𝑛𝑝𝑞
4. Binomial distribution: expresses the probability for r successes in an experiment with n trials
(0 ≤ 𝑟 ≤ 𝑛).
5. Geometric distribution: expresses the probability of having to wait exactly r trials before the
first successful event (𝑟 ≥ 1).
6. Negative Binomial distribution: expresses the probability of having to wait exactly r trials
until k successes have occurred (r ≥ k). This form is sometimes referred to as the Pascal

75
Bhumi Publishing, India

distribution. Sometimes this distribution is expressed as the number of failures n occurring


while waiting for k successes (𝑛 ≥ 0).

1
Ex.1. If X is binomially distributed with 6 trials and a probability of success equals to at each
4

attempt, what is the probability of:


a) Exactly 4 successes, b) At least one success?
1 1 3
Ans: Here, 𝑛 = 6 𝑝 = 4
𝑞 = 1 − 4
= 4

a) For exactly 4 successes, 𝑟 = 4


𝑛!
𝑃 𝑟 = 𝑝𝑟 𝑞 𝑛−𝑟
𝑟! (𝑛 − 𝑟)!

6! 1 4 3 6−4
𝑃 𝑋=4 = 4! (6 − 4)! 4 4
1 9
= 15 × 256
× 16
135
= 4096
= 0.033

b) For at least one success, 𝑋 = 𝑟 =≥ 1 and not Zero


𝑃 𝑟 ≥ 1 =1−𝑃 𝑟 =0 ...𝑃(𝑟 = 0) is the failure.
3 6
= 1 − 4
729
= 1 − 4096
3367
= = 0.822
4096

Ex.2. When an unbiased coin is tossed 8 times what is the probability of getting:
a) less than 4 heads b) more than 5 heads?
Ans: Here, 𝑛 = 8
Let p be the probability of getting head
1 1 1
𝑝 = 2
𝑞 = 1− 2
= 2

a) For less than 4 heads, 𝑋 = 𝑟 < 4 i.e. r ≤ 3


𝑛 𝑟 𝑛−𝑟
𝑃 𝑟 ≤ 3 = 3𝑟=0 𝑝 𝑞
𝑟
𝑃 𝑟 ≤ 3 = 𝑃 𝑟 = 0 + 𝑃 𝑟 = 1 + 𝑃 𝑟 = 2 + 𝑃(𝑟 = 3)
1 8 1 1 1 7 1 2 1 6 1 3 1 5
= 2
+ 8𝐶1 2 2
+ 8𝐶2 2 2
+ 8𝐶3 2 2

1 8 1 8 1 8 1 8
= + 8 + 28 + 56
2 2 2 2

1 8
= 93 2
93
= = 0.3633
256

76
Applied Biostatistics: An Essential tool in Helathcare Profession

b) For more than 5 heads, 𝑋 = 𝑟 > 5


𝑛 𝑟 𝑛−𝑟
𝑃 𝑟 > 5 = 8𝑟=6 𝑝 𝑞
𝑟
𝑃 𝑟>5 =𝑃 𝑟=6 + 𝑃 𝑟=7 + 𝑃 𝑟=8
1 6 1 2 1 7 1 1 1 8 1 0
= 8C6 + 8C7 + 8C8
2 2 2 2 2 2

1 8 1 8 1 8
= 28 +8 +
2 2 2

1 8
= 37
2
37
= 256
= 0.1445

Ex.3. A biased die is thrown thirty times and the number of sixes seen is eight. If the die is thrown a
further twelve times, find:
a) the probability that a six will occur exactly twice;
b) the expected number of sixes;
c) the variance of number of sixes.
Ans: A biased die is thrown thirty times and the number of sixes seen is eight
8 4 11
So, 𝑝 = = 𝑞 =
30 15 15

Now, let X is defined as “the number of sixes seen in 12 throws”


Here, n = 12
a) For the probability that a six will occur exactly once, 𝑋 = 𝑟 = 2
𝑛!
𝑃 𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟

12! 4 2 11 12−2
𝑃 2 = 2! (12 − 2)! 15 15

66 ×42 × 1110
= 15 12

= 0.211
b) Expected number of sixes = mean
4
𝐸 𝑋 = 𝑟 = 𝑥 = 𝑛 𝑝 = 12 × 15 = 3.2

c) Variance of sixes,
4 11
𝑉 𝑋 = 𝑟 = 𝜎 2 = 𝑛 𝑝 𝑞 = 12 × 15 × 15 = 2.347

Ex.4. A random variable is binomially distributed with mean 6 and variance 4.2. Find 𝑃(𝑋 ≤ 6)
Ans: Since X is a binomial distribution,
Mean = 𝑛 𝑝 = 6
Variance = 𝑛 𝑝 𝑞 = 4.2
6 × 𝑞 = 4.2

77
Bhumi Publishing, India

4.2
𝑞 = 6
= 0.7

Also,
𝑝 = 1 – 𝑞 = 1 – 0.7 = 0.3
This gives,
𝑛 × 0.3 = 6
𝑛 = 20
Now,
6 20
𝑃 𝑟≤6 = 𝑟=0 0.3𝑟 0.7𝑛−𝑟
𝑟
𝑃 𝑋 = 𝑟 ≤ 6 = 𝑃 𝑟 = 0 + 𝑃 𝑟 = 1 + 𝑃 𝑟 = 2 + 𝑟 = 3 + 𝑃 𝑟 = 4 + 𝑃 𝑟 = 5 + 𝑃(𝑟 = 6)
=20C0 0.3 20
+20C1 0.3 1
0.7 19
+20C2 0.3 2
0.7 18
+ 0C3 0.3 3
0.7 17
+ 20C4

0.3 4 (0.7)16 +20C5 0.3 5


0.7 15
+ 20C6 0.3 6
0.7 14

= 0.6080

Ex.5. Inland Revenue audits 5% of all companies every year. The companies selected for auditing in
any one year are independent of the previous year’s selection.
a) What is the probability that the company ‘Ross Waste Disposal’ will be selected for auditing
exactly twice in the next 5 years?
b) What is the probability that the company will be audited exactly twice in the next 2 years?
c) What is the exact probability that this company will be audited at least once in the next 4
years?
Ans: Here, 𝑝 = 0.05 𝑞 = 1 − 𝑝 = 0.95
a) For 𝑛 = 5 𝑟=2
𝑛! 5!
𝑃 𝑋=𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟 = 2! 3!
(0.05)2 (0.95)3 = 0.0214

b) For 𝑛 = 2 𝑟=2
𝑛! 2!
𝑃 𝑋=𝑟 = 𝑟! (𝑛 − 𝑟)!
𝑝𝑟 𝑞 𝑛−𝑟 = 2! 1
(0.05)2 (0.95)2 = 0.0025

c) For 𝑛 = 4 𝑟≥1
𝑃 𝑋 = 𝑟 ≥ 1 = 1 − 𝑃(𝑋 = 0)
4!
=1−1 4!
(0.05)0 (0.95)4 = 0.1854

Ex.6. For a binomial distribution, mean is 6 and S.D. is 2. Find n, p, q.


Ans: Given 𝑥 = 6 𝜎= 2
We know,
𝑥 = 𝑛𝑝 ⇒ 6 = 𝑛𝑝
Now,

78
Applied Biostatistics: An Essential tool in Helathcare Profession

𝜎2 = 𝑛 𝑝 𝑞 ⇒ 2 = 6 × 𝑞
1
𝑞=
3

Again,
1
𝑝 =1−𝑞 ⇒𝑝 =1−3
2
𝑝=
3
2
Hence, 𝑥 = 𝑛𝑝 ⇒ 6 = 𝑛
3

𝑛=9
Ex.7. Eight coins are tossed at a time 256 times. Number of heads at each throw is recorded and
results are given below. Find the expected frequencies and fit the Binomial distribution.
No. of Heads at a throw 0 1 2 3 4 5 6 7 8
Frequency 2 6 30 52 67 56 32 10 1
Ans: The probability of getting a head in a single throw is,
1
𝑃 𝐻 =𝑝=2
1
Hence, 𝑃 𝑇 = 𝑞 = 1 − 𝑝 = 2

Given that, 𝑛=8 𝑁 = 256


The expected frequencies are given by successive terms of B.D. as. 𝐵. 𝐷. = 𝑁 𝑝 + 𝑞 𝑛

Hence B.D. table is,


No. of Heads(𝑋 = 𝑟) 𝐹𝑟𝑒𝑞 = 𝑁[ 𝑛
𝑟
𝑝𝑛−𝑟 𝑞 𝑟 ] 𝑓𝑖

0 8 1 8
1 0 1
256 ×
0 2 2
1 8 1 7
1 1 8
256 ×
1 2 2
2 8 1 6
1 2 28
256 ×
2 2 2
3 8 1 5
1 3 56
256 ×
3 2 2
4 8 1 4
1 4 70
256 ×
4 2 2
5 8 1 3
1 5 56
256 ×
5 2 2
6 8 1 2
1 6 28
256 ×
6 2 2
7 8 1 1
1 7 8
256 ×
7 2 2
8 8 1 0
1 8 1
256 ×
8 2 2
Total 256
The Binomial distribution is properly fit.

79
Bhumi Publishing, India

2. Poisson’s Probability Distribution:


Poisson’s experiments are those that involve the number of outcomes of a random variable
X which occur per unit time. The Poisson distribution is a very important discrete probability
distribution, which arises in many different contexts in probability and statistics. Poisson’s
distribution is used in place of binomial distribution in following situations:
i) The number of trials n is very large (i.e. n→ ∞)
ii) The probability of success p in one trial is indefinitely small (p→0)
iii) The expectation or mean np = λ is constant
A discrete random variable X is said to follow Poisson’s distribution for r outcomes or trials
if it assumes non-negative value w
ith the probability distribution is denoted by X ~ P(λ) and given by:

𝜆𝑟
𝑃 𝑋 = 𝑟, 𝜆 = 𝑟!
𝑒 −𝜆 𝑟 = 0, 1, 2, . . .

= 0 Otherwise
where, λ is the parameter of the Poisson’s distribution.

The probability distribution of the Poisson random variable X, representing the number of
outcomes occurring in a given time interval or specified region denoted by t, is
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑒 −𝜆𝑡 𝑟 = 0, 1, 2, . . .
𝑟!

= 0 otherwise
where, λ is the average number of outcomes per unit time, distance, area or volume
The Poisson distribution occurs in different situations, for example:
1. It gives the probabilities of a given number of phone calls in a certain time interval;
2. It gives the probabilities of a given number of flaws on a length unit of a wire;
3. It gives the probabilities of a specific number of faults on an area unit of a fabric;
4. It gives the probabilities of a specific number of bacteria in a volume unit of a solution;
5. It gives the probabilities of a specific number of accidents on time unit.

Note:
1. For 𝑋 ∼ 𝑃 𝜆 ,
𝐸𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑚𝑒𝑎𝑛 = 𝐸(𝑋) = 𝜆
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉(𝑋) = 𝜆

2. 𝑃 𝑋 = 𝑟; 𝜆 = 𝑛
𝑟=0 𝑃(𝑟; 𝜆)

80
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.1. During a laboratory experiment, the average number of radioactive particles passing through
a counter in 1 millisecond is 4. What is the probability that 6 particles enter the counter in a given
millisecond?
Ans: Here the outcomes 𝑟 = 6 𝜆𝑡 = 4
Using Poisson’s distribution,
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑟!
𝑒 −𝜆𝑡
(4)6
𝑃 6; 4 = 6!
𝑒 −4 = 0.1042

Ex.2. The average number of planes landing at an airport each hour is 10 while the maximum
number it can handle is 15. What is the probability that on a given hour some planes will have to be
put on a holding pattern?
Ans: Here, the outcome 𝑋 = 𝑟 > 15 𝜆𝑡 = 10
Using Poisson’s distribution of sum,
𝑛
𝑃 𝑋 = 𝑟; 𝜆 = 𝑟=0 𝑃(𝑟; 𝜆)
15
𝑃 𝑋 => 15; 𝜆 = 1 − 𝑟=0 𝑃(𝑟 ≤; 𝜆)
= 1 – [𝑃(𝑟 = 0) + 𝑃(𝑟 = 1) + 𝑃(𝑟 = 2) +. . . + 𝑃(𝑟 = 15)]
= 1 – 0.9513
= 0.0487
Ex.3. The average number of accidents at a level-crossing every year is 5. Calculate the probability
that there are exactly 3 accidents this year.
Ans: Here, 𝑟 = 3 𝜆𝑡 = 5
(𝜆𝑡 )𝑟
𝑃 𝑋 = 𝑟, 𝜆𝑡 = 𝑟!
𝑒 −𝜆𝑡
(5)3
𝑃 𝑋 = 3,5 = 3!
𝑒 −5 = 0.1404

i.e. there is 14% probability of exactly 3 accidents this year.


Ex.4. Fit a Poisson’s Distribution to the following data which gives the frequency of number of
death due to cancer to a person of 10 corps per army per annum over twenty years.
Death 0 1 2 3 4 Total
Frequency 109 65 22 3 1 200

Ans.: Here, 𝑁 = 200 𝑛 = 10


𝑓𝑖 𝑥𝑖
𝜆=𝑥=
𝑓𝑖
0 × 109 + 1 × 65 + 2 × 22 + 3 × 3 + 4 × 1
=
200
𝜆 = 0.61

81
Bhumi Publishing, India

𝜆𝑟
Therefore; the Poissons distribution is 𝑃 𝑋 = 𝑟, 𝜆 = 𝑁 × 𝑟!
𝑒 −𝜆

X 0 1 2 3 4
f e -0.61 e -0.61 (0.61) e -0.61
(0.61)2
e -0.61
(0.61)3
e -0.61
(0.61)4
2 2 2

𝐹 = 200. 𝑓 108.6 66.2 20.2 4.1 0.62


Frequency 109 66 20 4 1
(approx)

Total calculated frequency = 200


Comparing the observed and theoretical frequencies, conclusion is remarkably good which fits
Poisson’s distribution.
Ex.5. Fit a Poisson Distribution to following data which give the number of doddens in sample of
clover seeds.
No. of doddens (x) 0 1 2 3 4 5 6 7 8
Observed frequency (f) 56 156 132 92 37 22 4 0 1

Ans.: here, 𝑁 = 500


𝑓𝑖 𝑥𝑖 986
𝜆=𝑥= = = 1.972
𝑓𝑖 500
Therefore; the Poissons distribution is;
𝜆𝑟 −𝜆 𝑒 − 1.972 (1.972)𝑥
𝑃 𝑋 = 𝑟, 𝜆 = 𝑁 × 𝑒 = 500 ×
𝑟! 𝑥!
Calculation of theoretical frequencies is shown below;
X F
0 69.6  70
1 137.25  137
2 135.32 135
3 88.95  89
4 43.85  44
5 17.29  17
6 5.68  06
7 1.60  02
8 0.39  00
Total 500
Hence, observed value is same as the theoretical value. This shows that P.D. is fit properly.

82
Applied Biostatistics: An Essential tool in Helathcare Profession

3. Normal Probability Distribution:


Normal distribution, also known as the Gaussian distribution, is the most important
continuous probability distribution is the study of statistics. Further, it is the parent distribution for
several important continuous distributions. It is used to model events which occur by chance such
as variation of dimensions of mass-produced items during manufacturing, experimental errors,
variability in measurable biological characteristics such as people’s height or weight,…
It is a special case of the Binomial distribution with the same values of mean and variance
but applicable when n is sufficiently large (𝑛 > 30). It is a two-parameter distribution denoted by
𝑵(𝒙, 𝝈𝟐 ) and given by:
1 𝑟−𝑥 2
1 −
𝑃 𝑋=𝑟 =𝜎 2𝜋
∙ 𝑒 2 𝜎 −∞ < 𝑟 < ∞;

Where, 𝑥 and σ are the mean and standard deviations of the distribution respectively and
𝑟−𝑥
𝑧= is called standard normal variate.
𝜎

Ex.1. Suppose a particular population has 𝑥 = 4 and 𝜎 = 2. Find the probability of a randomly
selected value being greater than 6.
Ans: the 𝑍 value corresponding to 𝑃(𝑋 = 𝑟 = 6) is,
𝑟−𝑥 6−4
𝑧= 𝜎
= 2
= 1

(𝑍 = 1 Means that the value 𝑟 = 6 is 1 standard deviation above the mean)

The normal distribution is of great importance for the following reasons:


(i) It is often suitable as a probability model for measurements of weight, length, strength, etc.
(ii) Non-normal data can often be transformed to normality.
(iii) The central limit theorem states that when we take a sample of size n from any distribution
with a mean 𝑥 and a variance σ2, the sample mean will have a distribution which gets closer
and closer to normality as n increases.
(iv) It can be used as an approximation to the binomial or the Poisson distributions when we
have large n or λ respectively (though this is less useful now that computers can be used to
evaluate binomial/Poisson probabilities).
(v) Many standard statistical techniques are based on the normal distribution.
(vi) We write 𝑋 ~ 𝑁(0, 1).
(vii) The standard normal distribution is symmetric about 0.

83
Bhumi Publishing, India

Note:
1. The normal distribution curve (z curve) is “bell-shaped” having two tails at the end which never
meet X-axis theoretically and symmetric about 𝑋 = 𝑥
2. In a standard normal distribution, 𝑥 = 0 and σ2 = 1 denoted by N (0, 1)
Hence, 𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒 = 𝑥
3. The area under the Z curve gives the probability.
i.e. 𝑃 (−∞ < 𝑟 < ∞) = 1
Since the curve is symmetric about 𝑥 , we have
𝑃 (𝑋 = 𝑟 < 𝑥 ) = 𝑃 ( 𝑋 = 𝑟 > 𝑥 ) = 0.5
4. Any normal distribution can be converted into standard normal variate (SVN) Z using the
formula
𝑥−𝑥
For 𝑋 ~ 𝑁 (𝑥 , 𝜎 2 ), 𝑧= [It uses Z table for finding probability]
𝜎

5. The probability of the variate having a value within a certain interval [a, b] is calculated using
1 𝑥−𝑥 2
𝑏 1 −
𝑃 𝑋=𝑟 = 𝑎 𝜎 2𝜋
∙ 𝑒 2 𝜎 ∙ 𝑑𝑥 for 𝑎 < 𝑥 < 𝑏

Process of finding z value:


 Draw a diagram and label with given values i.e. 𝑥 populationmean,  pop S.D. and 𝑟
(rawscore).
 Shade area required as per question.
 Convert raw score 𝑟 to standard score Z using formula.
 Use tables to find probability: eg p0  Z  z.
 Adjust this result to required probability.

Ex.2. Wool fibre breaking strengths are normally distributed with mean 𝑥 = 23.56 Newton and
standard deviation 𝜎 = 4.55. What proportion of fibres would have a breaking strength of 14.45
or less?
Ans: Here, 𝑥 = 23.56 𝜎 = 4.55 𝑟 = 14.45
Draw a diagram and label with given values

84
Applied Biostatistics: An Essential tool in Helathcare Profession

Convert 𝑟 = 14.45 to Z value


14.45−23.56
𝑧= = −2.0
4.55

That is, the raw score of 14.45 is equivalent to a standard score of -2.0. It is negative
because it is on the left hand side of the curve.
Use tables to find probability and adjust this result to required probability:
𝑃 𝑟 < 14.45 = 𝑃 𝑧 < −2.0
= 0.5 − 𝑃(0 < 𝑧 < 2)
= 0.5 − 0.4772
= 0.0228
Inverse process: (to find a value for 𝑟, corresponding to a given probability)
 Draw a diagram and label.
 Shade area given as per question.
 Use probability tables to find Z –score.
 Convert standard score Z to raw score  𝑟  using inverse formula.
𝑟 = 𝑧×𝜎 +𝑥
Ex.3. Carrots entering a processing factory have an average length of 15.3 cm and standard
deviation of 5.4cm. If the lengths are approximately normally distributed, what is the maximum
length of the lowest 5% of the load (Given 𝑇𝑎𝑏 𝑧 = 1.645 at 5 %)?
Ans: Here, 𝑥 = 15.3 𝜎 = 5.4 𝑟 =?
Draw a diagram and label it.

Use standard Normal tables to find the Z -score corresponding to this area of probability.
Convert the standard score Z to a raw score 𝑟 us i ng t he i n ve rs e f orm u la
𝑟 = 𝑧×𝜎 +𝑥
Here, 𝑃(𝑍) for 5% is -1.645 from normal Z table (negative because it is below mean)
Hence, 𝑟 = 𝑧×𝜎 +𝑥
= −1.645 × 5.4 + 15.3
= 6.4
Lowest maximum length is 6.4cm.

85
Bhumi Publishing, India

Ex.4. The finish times for marathon runners during a race are normally distributed with a mean of
195 minutes and a standard deviation of 25 minutes.
a) What is the probability that a runner will complete the marathon within 3 hours?
b) Calculate to the nearest minute, the time by which the first 8% runners have completed the
marathon.
c) What proportion of the runners will complete the marathon between 3 hours and 4 hours?

Ans: Here, 𝑥 = 195 𝜎 = 25


a)
180−195
𝑟 = 180 ⇒ 𝑧 = = −0.6
25

𝑃 𝑍 < −0.6 = 0.5 − 𝑃(0 < 𝑧 < 0.6)


= 0.5 − 0.2257
X=195 = 0.2743
σ=25
r=180

b) For 𝑝 = 0.08, 𝑍 = −1.41


𝑟−195
−1.41 = 25
⇒ 𝑟 = −1.41 × 25 + 195 = 159.75 ≅ 160 𝑚𝑖𝑛

Hence, the first 8% runners have completed marathon in 160min.

180−195
a) 𝑟 = 180 ⇒ 𝑧 = = −0.6
25

𝑃 𝑍 < −0.6 = 0.2743


240−195
𝑟 = 240 ⇒ 𝑧 = 25
= 1.8

𝑃 𝑍 < 1.8 = 0.9641


Hence,
r=180 x=195 r=240
σ=25 𝑃 −0.6 < 𝑧 < 1.8 = 0.9641 − 0.2743
= 0.6898

Hence, proportion of runners taking between 3hrs and 4hrs is approx 70%

Ex.5. For the following standard normal variates z find the proportion (area) occupied by them as
measured from zero.
i) z = 1.98
ii) z = -0.5
iii) z = 1.35 to 2.18
iv) z = 1.98 to 0.5

86
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans.: From the z table it is seen that (See table in appendix-….)


i) 𝐴 = 0.4762 𝑓𝑜𝑟 𝑧 = 1.98

ii) A = 0.1915 for z = -0.5

iii) 𝐴 = 0.4854 𝑓𝑜𝑟 𝑧 = 1.35


𝐴 = 0.4115 𝑓𝑜𝑟 𝑧 = 2.18
𝑃 (1.35  𝑧  2.18) = 0.4854 – 0.4115
𝑧 = 2.18 = 0.0739

iv) 𝐴 = 0.4762 𝑓𝑜𝑟 𝑧 = −1.98


𝐴 = 0.1915 𝑓𝑜𝑟 𝑧 = 0.5
𝑃 (−1.98  𝑧  0.5) = 0.4762 – 0.1915 = 0.6677

Note: Shaded portion indicates required area.

87
Bhumi Publishing, India

Exercise
Q.1. Suppose that a pair of fair dice are to be tossed, and let the random variable X denote the sum
of the points. Obtain the probability distribution for X
Q.2. Find the expected value for the density function of a random variable X given by
1
𝑓 𝑥 = 𝑥 0<𝑥<2
2
=0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒
Q.3. Find the variance and standard deviation of the random variable of above question 2.
Q.4. The probability that a driver must stop at any one traffic light coming to Lincoln University is
0.2. There are 15 sets of traffic lights on the journey.
a) What is the probability that a student must stop at exactly 2 of the 15 sets of traffic lights?
b) What is the probability that a student will be stopped at 1 or more of the 15 sets of traffic
lights?
Q.5. The number of typing mistakes made by a secretary has a Poisson distribution. The mistakes
are made independently at an average rate of 1.65 per page. Find the probability that a three-page
letter contains no mistakes.
Q.6. The download time of a resource web page is normally distributed with a mean of 6.5 seconds
and a standard deviation of 2.3 seconds.
a) What proportion of page downloads take less than 5 seconds?
b) What is the probability that the download time will be between 4 and 10 seconds?
c) How many seconds will it take to complete 35% of the download?
Q.7. For a binomial distribution, mean is 5 and S.D. is16. Find n, p, q.
Q.8. Mean and S.D. of a binomial distribution are 3 and 2. Find n, p, q.
Q.9. For a binomial distribution, mean is 206 and S.D. is 4. Find n, p, q.
Q.10. Fit Poisson’s distribution to following.
Death 0 1 2 3 4
Freq 122 60 15 2 1

88
Applied Biostatistics: An Essential tool in Helathcare Profession

3. Sample and sampling techniques

Introduction:
In the first chapter, we have discussed the collection, distribution and analysis of collected
scientific data. The data in biostatistics are generally based on individual observations. Sampling is
very often used in our daily life. For example, while purchasing food grains from a shop we usually
examine a handful from the bag to assess the quality of the commodity. A doctor examines a few
drops of blood as sample and draws conclusion about the blood constitution of the whole body.
Thus, most of our investigations are based on samples. In this chapter, let us see the importance of
sampling and the various methods of sample selections from the population.
Population:
In a statistical enquiry, all the items, which fall within the range of enquiry, are known as
Population or Universe. In other words, the population is a set of all possible observations, which
are to be investigated, having at least one property in common. For example, the population of any
country has a common language, literature, geographic origin and genetic heritage which
distinguish them from people of different nationalities. Total number of students studying in a
school or college, total number of books in a library, total number of houses in a village or town is
some examples of population.
The objects or individuals in the population are called members or elements and the
number of members in the population constitutes population size. Depending on population size,
population can be finite or infinite. If the number of members in the population is finite/ countable
then it is finite population. E.g. number of students in a college, number of workers in a factory,
production of articles in a particular day for a company. If number of members in the population is
infinite then it is infinite population. E.g. number of stars in a galaxy, number of people seeing the
Television programmes etc. Statisticians use the word population to refer not only to people but to
all items that have been chosen for study.
Census:
Sometimes it is possible and practical to examine and study every person or item in the
population which is a complete enumeration called census. A census is the procedure of systematic
collection and recording of information about the every member of a given population. It provides
true measures of population. For example, if we study the average annual income of the families of
a particular area having 1000 families then we must have to study income of all the 1000 families
and in such a case, no family should left out.
The population census of India is taken at every 10 years interval. The first census was
taken in 1871 – 72. The latest census was taken in 2011.

89
Bhumi Publishing, India

Merits of census:
1. The data is collected from each and every item of the population.
2. The results are more accurate and reliable.
3. Intensive study is possible.
4. The data collected may be used for various surveys, analyses etc.
Demerits of census:
1. It requires a large number of enumerators and it is a costly method.
2. It requires more money, labour, time energy etc.
3. It is not possible in some circumstances where the universe is infinite.
Sample:
If population is infinite or very large then it is impossible to study each and every member.
In this case, population is divided into small groups of members so all the important properties or
characteristics are covered in members of those groups. Such groups are known as samples. Thus,
Sample is a small group of finite members selected from statistical population so that all the
important characteristics of entire population are covered in members of the group. It is a subset
and representative of people, events or items from a larger population. To represent a population
well, a sample should be randomly collected and adequately large. The members of sample selected
from population which cannot be further subdivided for sampling are known as sample points and
the number of members in a sample is called the sample size. Often, it is necessary to use samples
for research, because it is impractical to study the whole population. For example, to study the
average height of 12-year-old boys in a country, we could not measure all of the 12-year-old boys in
that country, but we could measure a sample of 12-year-old boys.
Reasons for selecting a sample:
Sampling is inevitable in the following situations:
1. Complete enumerations are practically impossible when the population is infinite.
2. When the results are required in a short time.
3. When the area of survey is wide.
4. When resources for survey are limited particularly in respect of money and trained persons.
5. When the item or unit is destroyed under investigation.
Sampling frame:
For adopting any sampling procedure it is essential to have a list identifying each sampling
unit by a number. Such a list or map is called sampling frame. A list of voters, a list of house holders,
a list of villages in a district, a list of farmers etc. are a few examples of sampling frame.
Sampling Methods:
The method of selecting small groups i.e. samples from the population which represent the
characteristics (like height, weight, colour) of the population is called sampling method. If we want

90
Applied Biostatistics: An Essential tool in Helathcare Profession

to get really good conclusions from our samples, we need to assure that we make a right choice of
our samples.
The sampling process involves following stages:
(i) Define the population of statistical analysis.
(ii) Specify a sampling frame, a set of items or events possible to measure.
(iii) Specifying a sampling method for selecting items or events from the frame.
(iv) Determining the sample size.
(v) Implementing the sampling plan.
(vi) Sampling and data collecting.
Merits of Sampling:
There are many advantages of sampling methods over census method. They are as follows:
1. Under sampling a statistical investigation is carried out speedily.
2. It results in reduction of cost, time, energy and labour.
3. Sampling ends up with greater accuracy of results.
4. The size of the sample can be increased or decreased according to the size of the universe,
availability of resources and degree of accuracy desired.
5. It has greater scope.
Types of Sampling Techniques (Methods):
Following are the different types of sampling which are commonly used:
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5. Quota Sampling
1. Simple Random Sampling:
It is the most popular method for choosing a sample among population for a wide range of
purposes. Simple random sample is a group of individuals selected from a larger population, using
either a random number table or random number generator. Every individual of this sample is
selected randomly and has equal chance (probability) of being selected. The process or technique of
selection of individuals with same probability of being selected is known as simple random
sampling.
Suppose we have population size (N) of 10,000 students in a university. Each of them is
known as unit or member. To select a sample of required size (n), let it be 200, we could use simple
random sampling. Students would be selected at random and sent to questionnaire for analysis.
Steps to create Simple Random Sample:
a) Define the population

91
Bhumi Publishing, India

b) Select the sample size (n)


c) List the population
d) Assign numbers to the units
e) Choose random numbers
f) Select sample of required size
a) Define the population
In above example, the population size (N) is of 10,000 students of a university. As we are
interested in all these university students, our sample is all about those 10,000 students. If we were
interested in male students, then females would be rejected and population would be defined for
males and N would be less than 10,000.
b) Select the sample size
Suppose we want to choose the sample size (n) of 200 students. Sample size shows the limit
of a quantity and time require to distribute questionnaire to students.
c) List the population
For the sample of 200 students we have to identify all 10,000 students of the university. To
carry out research we have to take permission from Students record to view a list of all students
studying at university.
d) Assign numbers to the units
Now assign a consecutive numbers from 1 to N (population size) to each unit of the
population. In our case, we have to assign number from 1 to 10,000.
e) Find random numbers
Next make a list of 200 random numbers to select members of sample from the total list of
10,000 students. These random numbers can either be found using random number tables or
computer program that generates these numbers for you.
f) Select your sample
Finally, we select the 200 students corresponding to the selected 200 random numbers.
Suppose the first three random numbers are 0007, 8182, 0576. It means we have selected 11 th,
8182nd and 576th students from the list of 10,000 students. Continue the process till we have sample
of all 200 students.
a) Simple Random Sampling with Replacement:
In this method, the first element is selected at random from population. Its characteristics
are studied and recorded then it is again replaced back into the population and second element is
selected at random. This process is continued till the sample of required size is selected. A unit may
be selected more than once. If any unit is repeating then reject it and select the other one. The
population size in this case remains same in every selection.

92
Applied Biostatistics: An Essential tool in Helathcare Profession

b) Simple Random Sampling without Replacement:


In simple random sampling without replacement, first element is selected at random from
population of size N. After studying its characteristics it is not replaced back into the population.
Then second element is selected at random from the remaining population of size N-1 and so on.
Thus the size of population goes on decreasing in each selection. In this method, element once
selected for sample cannot be repeated.
Merits of simple random sampling:
1) It provides a sample which is highly representative of the population being studied,
assuming that there is limited missing data.
2) It is a fair way of selecting sample from a population as every member is given equal
chance of being selected.
3) It is easy to make inferences about whole population from the results of the sample,
because of representativeness of a sample obtained from population.
4) It shows more accuracy when the size of both population and sample is very large.
Demerits of simple random sampling:
1) It is highly expensive and time taking.
2) One of the most obvious limitations of simple random sampling is the need of a complete
list of all members of the population.
3) It is not suitable for the sample of very small size. In this case sample is not a true
representative of the population.
4) When there is large difference between the units of population, the simple random
sampling may not be a representative sample.
2. Systematic Random Sampling:
It is also known as Quasi-random sampling. Systematic random sampling is a little bit
different from simple random sampling. This method is frequently used when the population is
homogenous or of the same subgroup and a complete list of the population is available. In this
method all the members of the population are arranged in systematic and definite order. The
complete list of population may be arranged in alphabetical, geographical, or numerical order. The
first unit of the sample is selected at random and then the remaining units are selected in specific
manner. After selecting the first unit, the next subsequent elements are selected by taking every kth
member from the list of population till the sample of required size is completed; where k is the ratio
of population size (N) and required sample size (n) i.e.
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒 𝑁
k= 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
= 𝑛

Suppose a researcher wants to study about the career goals of students in the Institute
which has near about 8,000 students. Thus, the population size (N) is 8,000. He wants to select a

93
Bhumi Publishing, India

sample of size (n) 100 students using systematic sampling. With systematic random sampling, there
would be an equal chance of being selected for the required sample.
Steps to create Systematic Random Sample:
a) Define the population
b) Select the sample size (n)
c) List the population and arrange in specific or definite order
d) Calculate value of k
e) Select the first unit
f) Select sample of required size.
a) Define the population:
In the condition mentioned above, the population size (N) is 8,000 students in the Institute
and we are interested in all the students of the Institute. Institute may consist of males and females.
If we select females then the male students from the institute would be rejected.
b) Select the Sample Size (n):
Decide the number of members for the sample for the further study. Suppose we want to
choose the sample size (n) of 100 female students. Sample size shows the limit of a quantity and
time require to distribute questionnaire to students.
c) List the population and arrange in specific or definite order:
For the sample of 100 female students we have to identify all 8,000 students of the institute.
Collect the entire information about all the females studying in the institute. Then arrange all the
females in specific order i.e. either assign numbers from 1 to N or arrange in alphabetical manner.
d) Calculate value of ‘k’:
Assuming that we have chosen a sample of size 100 students, we need to find the value of k
which is the ratio of population size and sample size.

𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒 𝑁 8,000


k= = = = 80
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑛 100

It tells us that we have to choose 1 student in every 80 students from the population of
8,000 students of the Institute.
e) Select the first unit:
After finding k, we need to select the first student at random. As we have assigned numbers
to the members of population, choose any student at random from 1 to 80 (k) and suppose it is 25th
student.
f) Select sample of required size:
We have the first member i.e. 25th student of our sample. So we can select the remaining 99
members easily using value k.
Now add 𝑘 = 80 to first member 25 which will give next member.

94
Applied Biostatistics: An Essential tool in Helathcare Profession

25 + 80 = 105𝑡𝑕 is the second member.


Then 105 + 80 = 185𝑡𝑕 member is the third member.
Continue the process till the sample of required size is completed.
Merits of Systematic Random Sampling:
1) It is easy to construct, execute, compare and understand.
2) It reduces time and work.
3) It gives accurate results if properly performed.
4) It distributes the sample evenly over the population.
Demerits of Systematic Random Sampling:
1) It may not be possible to select the required sample size if the population is too small or
infinite.
2) Bad arrangement of the units may produce inefficient sample.
3) It may not be the representative of the whole population.
3. Stratified Random Sampling:
This technique is widely used and very useful when the population is heterogeneous with
respect to variables or characteristics. This heterogeneous population is then divided into several
smaller homogeneous groups. These groups are known as strata (singular Stratum). A simple
random sample of suitable size from each stratum is selected to constitute a required sample
known as stratified random sample. Since each stratum is more homogeneous than the original
population, we are able to get more precise estimates of the whole. Stratified random sampling is
also called proportional random sampling or quota random sampling. Generally, it is used in cases
like males vs. females; houses vs. apartments, etc where we are interested in particular strata
(groups) in a population.
For example, geographical regions can be stratified into similar regions by means of some
known variable such as habitat type, elevation or soil type. Another example might be to determine
the proportions of defective products being assembled in a factory. In this case sampling may be
stratified by production lines, factory, etc.
Suppose a researcher wants to study more about the career goals of students at University having
roughly 10,000 students (N) and he is interested in comparing the differences in career goals
between male and female students. Using following steps we will create stratified random sample.
Steps to create Stratified Random Sample:
a) Define the population
b) Select relevant stratification
c) List the population according to selected stratification
d) Select sample size (n)
e) Calculate proportionate stratification
f) Select sample of required size using simple random sampling or systematic random
sampling.

95
Bhumi Publishing, India

a) Define the population:


Here the population is of 10,000 students at the University which is population size (N).
Our sampling frame is all 10,000 students as we are interested in all these students.
b) Select relevant stratification:
We want to study the differences in male and female students, so gender is the
required stratification. And hence we will use gender male and female as our strata.
c) List the population according to selected stratification:
Using either simple random sampling or systematic random sampling, assign a consecutive
numbers from 1 to nk to each of the students in each stratum. This will result in two lists, one
detailing all male students and one detailing all female students.
d) Select sample size (n):
Decide the number of members for the sample for the further study. Suppose we want to
choose the sample size (n) of 200 students.
e) Calculate proportionate stratification:
Consider out of 10,000 students, 60 %( =600) are male and 40 %( =400) are female. While
selecting the members for our sample, we have to ensure that the number of units from each
stratum is proportionate to the number of males and females in the population. To achieve this,
we first multiply the desired sample size (n=200) by the proportion of units (60% and 40%) in
each stratum.
60
Hence, number of males for required sample = 200 × 60% = 200 × = 120
100
40
And number of females for required sample = 200 × 40% = 200 × 100 = 80

This means that we need to select 60 male students and 40 female students for our sample of
100 students.
f) Select sample of required size:
Finally we have to select 120 male students from 600 and 80 female students from 400
using either simple random sampling or systematic random sampling to fulfil sample size.
The principal reasons for using stratified random sampling rather than simple random
sampling include:
1. Stratification may produce a smaller error of estimation than would be produced by a simple
random sample of the same size. This result is particularly true if measurements within strata
are very homogeneous.
2. The cost per observation in the survey may be reduced by stratification of the population
elements into convenient groupings.
3. Estimates of population parameters may be desired for subgroups of the population. These
subgroups should then be identified.

96
Applied Biostatistics: An Essential tool in Helathcare Profession

Merits of simple random sampling:


1) It provides us with a sample that is highly representative of the population being studied,
assuming that there is limited missing data.
2) It allows us to make statistical conclusions from the data collected that will be considered to
be valid.
3) It improves the potential for the units to be more evenly spread over the population.
4) It improves the representation of particular strata (groups) within the population, as well as
ensuring that these strata are not over-represented.
5) Stratification gives a smaller error in estimation and greater precision than the simple
random sampling method.
Demerits of simple random sampling:
1) It is possible for the list of the population to be clearly described into each stratum; that is,
each unit from the population must only belong to one stratum.
2) Even if a list is readily available, it may be challenging to gain access to that list. The list may
be protected by privacy policies or require a length process to attain permissions.
3) It may be difficult and time consuming to bring together numerous sub-lists to create a final
list from which you want to select your sample.
4) It can increase costs to carry out the research.
4. Cluster Sampling:
This technique is generally used in case of homogeneous population. It is the sampling
technique in which the population is divided into separate groups known as clusters. A complete
list of clusters represents the sampling frame. Each element of the population can be assigned to
one, and only one, cluster. Then, a few clusters are chosen randomly as the source of primary data.
Elements in the clusters are then sampled together for the required sample. Cluster sampling can
be one-stage or two-stage sampling. This is a popular method in conducting marketing researches.
Merits of cluster sampling:
1) This technique is cheap, quick and easy. Instead of sampling an entire country, the researcher
can allocate his limited resources to the few randomly selected clusters or areas when using
cluster sample.
2) It reduces variability and increases the levels of efficiency of sampling.
3) This method is easy to be used from practicality viewpoint.
Demerits of cluster sampling:
1) This technique is the least representative of the population as compared to other sampling
techniques.
2) It is a sampling technique with the possibility of high sampling error.

97
Bhumi Publishing, India

The Difference between Stratified and Cluster sampling:


Strata and clusters are both non-overlapping subsets of the population, they differ in several ways.
 In stratified sampling only specific elements of strata are accepted as sampling unit; while in
cluster sampling a cluster is perceived as a sampling unit.
 With stratified sampling, the best survey results occur when elements within strata are
internally homogeneous. However, with cluster sampling, the best results occur when
elements within clusters are internally heterogeneous.
 The main difference between cluster sampling and stratified sampling lies with the inclusion
of the cluster or strata.
 In stratified random sampling, all the strata of the population are sampled while in cluster
sampling, the researcher only randomly selects a number of clusters from the collection of
clusters of the entire population. Therefore, only a number of clusters are sampled, all the
other clusters are left unrepresented.
 Multi-stage sampling
Multi-stage sampling (also known as multi-stage cluster sampling) is a more complex form
of cluster sampling which contains two or more stages in sample selection. In multi-stage sampling
large clusters of population are divided into smaller clusters in several stages in order to make
primary data collection more manageable. It has to be noted that multi-stage sampling is not as
effective as random sampling; however, it addresses certain disadvantages associated with random
sampling such as being overly expensive and time-consuming.
Merit of Multi-stage sampling:
1) It is effective in primary data collection from geographically dispersed population.
2) It is cost-effective and time-effective.
3) This method has high level of flexibility.
Demerits of Multi-stage sampling:
1) This method is not highly representative of whole population.
2) It has high level of subjectivity.
3) Group-level information is required at each stage.
5. Quota sampling:
Quota sampling is a type of non-probability sampling technique and it is defined as the
sampling method of collecting representative data from a groups. These sampling groups represent
certain characteristics of the population chosen by researcher.
For example, suppose researcher wants to evaluate the impact of cross-cultural differences
on 10000 students in a University. So he needs to assess the effectiveness of students’ motivational
tools taking into account gender differences among the University.

98
Applied Biostatistics: An Essential tool in Helathcare Profession

Steps to create Quota sample:


a) Define population
b) Choose the relevant stratification and divide the population accordingly
c) Calculate quota from each stratum
d) Continue to invite cases until the quota for each stratum is fulfilled
a) Define population:
Here the population is of 10,000 students at the University which is population size (𝑁) and
we require selecting 100 students which is the sample size (𝑛).
b) Choose the relevant stratification and divide the population accordingly:
Students in the university as the sampling frame need to be divided into following groups
(strata) according to their cultural background:
i) North Indian
ii) South Indian
iii) East Indian
iv) West Indian
c) Calculate quota from each stratum:
The number of cases that should be included in each stratum will vary depending on the
make-up of each stratum within the population. If we have to examine the differences in male and
female students then number of students from each group that we would include in the sample
would be based on the proportion of male and female students amongst the 10,000 university
students.
For example, if there were 6,000 male students (60% of the total) and 4,000 female
students (40% of the total), our sample would need to be made up of 60% males and 40% females.
If our desired sample size was 100 students, this would mean our sample should include 60 male
students and 40 female students.
d) Continue to invite cases until the quota for each stratum is fulfilled:
Once we have selected the number of cases you need in each stratum, you simply need to
keep inviting participants to take part in your research until each of these quotas are filled.
Merits of Quota sampling:
1) It is particularly used when we failed to obtain probability sample.
2) It is easier and quicker to carry out as it doesn’t require sampling frame.
3) It improves representation of particular strata within the population.
Demerit of Quota sampling:
1) It doesn’t allow to use random sample and hence sampling error cannot be determined.
2) It is not possible to make statistical inferences from sample to the population.

99
Bhumi Publishing, India

Exercise
1. Define population. Explain need of sample in detail.
2. Differentiate census and sample.
3. Explain simple random sampling with replacement and without replacement.
4. Distinguish between: simple random sampling with replacement and without replacement.
5. Write short note on systematic sampling.
6. Distinguish between: stratified sampling and cluster sampling.
7. Explain quota sampling.
8. Differentiate between systematic sampling and stratified sampling.
9. Give advantages and disadvantages of simple random sampling.
10. Explain the advantages of sampling.

100
Applied Biostatistics: An Essential tool in Helathcare Profession

4. Correlation

Introduction:
In the previous lesson, we learned about the joint probability distribution of two random
variables X and Y. In this lesson, we'll extend our investigation of the relationship between two
random variables by learning how to quantify the extent or degree to which two random
variables X and Y are associated or correlated.
The term correlation is used by a common person in day to day life without knowingly or
unknowingly. For example, when parents advice their children to work hard so that they may get
good marks, they are correlating good marks with hard work.
In the previous lesson we have studied about characteristics, measures of central tendency
and measure of dispersion of one variable i.e. univariate data. But there are variables which are
related to each other. E.g. height and weight of persons are related to each other. Such a data
containing two variables which are related to each other is called bivariate data in statistical
analysis. Sometimes the variables may be interrelated like blood pressure and age. The nature and
strength of relationship may be studied by correlation and regression.
Correlation:
In statistical analysis, two sets of data or two random variables may depend on each other
in such way that the increase or decrease in values of one variable results in either increase or
decrease in values of anther variable. The extent of linear relationship between two variables or
more variables is called correlation.
E.g. correlation in demand for a product and its price
Correlation is a single number that describes the degree of linear relationship between two
variables. It is a statistical technique which shows how strongly pairs of variables are related. Two
variables are said to be correlated, if change in one of the variables results in a change in the other
variable.
Uses of correlation:
1. It is used in physical and social sciences.
2. Businessmen estimates costs, sales, price etc. using correlation.
3. It is useful for economists to study the relationship between variables like price, quantity
etc.
4. Businessmen estimates costs, sales, price etc. using correlation.
5. It is helpful in measuring the degree of relationship between the variables like income and
expenditure, price and supply, supply and demand etc.
6. Sampling error can be calculated.
7. It is the basis for the concept of regression.

101
Bhumi Publishing, India

Scatter Diagram:
Scatter diagram is the diagrammatic representation of relationship between two variables.
It is the simplest method of studying correlation. In scatter diagram, one variable is taken along
horizontal axis and second variable is taken along vertical axis. Each pair of observations of two
variables is represented by dot in the plane of axes. There are as many dots in the plane as the
number of paired observations of two variables. The direction of dots shows the scattering or
concentration of given points which further helps to decide the type of correlation.
The following are the types of correlation:
1) Positive Correlation:
If the change in values of one variable leads to the same change in values of another variable
then it is positive correlation. It is a relationship between two variables which moves in same
direction. In positive correlation if values of one variable decrease then values of other variables
also decrease and vice versa.
E.g. Price and supply are two variables, which are positively correlated. When Price increases,
supply also increases; when price decreases, supply decreases.
The scatter diagram for positive correlation is shown below. The line corresponding to the
scatter plot is an increasing line.

Positive Correlation

2) Negative Correlation:
If the change in values of one variable leads to the opposite change in values of another
variable then it is negative correlation. It is a relationship between two variables which moves in
opposite direction. In negative correlation if values of one variable decrease then values of other
variables also increase or if values of one variable increase then values of second variable decrease.
E.g. Price and demand are two variables which are negatively correlated. When price increases,
demand decreases; when price decreases, demand increases.
The scatter diagram for positive correlation is shown below. The line corresponding to the
scatter plot is a decreasing line.

102
Applied Biostatistics: An Essential tool in Helathcare Profession

Negative Correlation

3) Zero Correlation:
When there does not exist any relationship between two variables then it is zero
correlation. The increase or decrease in values of one variable does not affect other variable.
E.g. The more weight I gain, the smarter I will be. Intelligence is not affected by weight i.e. there is
no relation between these two variables.
The scatter diagram for zero correlation is shown below. No correlation occurs when there
is no linear dependency between two variables.

Zero Correlation

Merits of Scatter diagram:


1. It is a simplest and attractive method of finding the nature of correlation between the two
variables.
2. It is a non-mathematical method and easy to understand.
3. It is not affected by extreme items.
4. It is the first step in finding out the relation between the two variables.
5. We can have a rough idea at a glance whether it is a positive correlation or negative
correlation.
Demerits of Scatter diagram:
By this method we cannot get the exact degree or correlation between the two variables.
Correlation Coefficient:
The scatter diagram does not give the exact idea about the existence of relationship
between two variables. Instead, a number can give a good idea about how closely one variable is
related to another variable. If there is any relationship between two variables, we need to measure

103
Bhumi Publishing, India

the degree of that relationship. This measure of correlation is called correlation coefficient i.e. the
numerical value that determines the degree to which two variables are related to each other in unit-
free terms is known as correlation coefficient. It gives the strength and direction of a linear
relationship.
Covariance:
Before studying correlation coefficient we will start with covariance which computes the
dependence between two random variables say X and Y.
i.e. if X and Y are two random variables (discrete or continuous) with respective means 𝑥 and 𝑦 then
covariance of X and Y, denoted by Cov(X, Y), is defined as:
𝑥 𝑖 − 𝑥 (𝑦 𝑖 − 𝑦 ) 𝑥𝑖 𝑦𝑖
Cov(X, Y) = 𝑛
= 𝑛
− 𝑥𝑦
where 𝑥𝑖 are observations in X and 𝑦𝑖 are observations in Y.
Note:
1. The value of correlation coefficient lies in between -1 and +1.
2. If correlation coefficient = 1 then it is perfectly positive correlation.
3. If correlation coefficient = -1 then it is perfectly negative correlation.
4. If correlation coefficient = 0 then it is zero correlation i.e. there is no correlation.
5. If correlation coefficient >0 then variables are positively correlated.
6. If correlation coefficient <0 then variables are negatively correlated.
Coefficient of correlation can be measured using two methods:
1) Karl Pearson’s Correlation Coefficient (r)
2) Spearman’s Rank Correlation Coefficient (R)
1) Karl Pearson’s Correlation Coefficient (r):
This is a simple and the most common way to measure degree of correlation between two
variables. It is also known as product-moment correlation coefficient. It is measure of the strength
as well as direction of a linear relationship between two variables. It tries to draw a line of best fit
through the data of two variables and indicates how far the points are away from the line of fit.
If 𝑥1 , 𝑥2 , 𝑥3, … 𝑥𝑛 are n observations of variable X and 𝑦1 , 𝑦2 , 𝑦3, … 𝑦𝑛 are n observations of
variable Y then Karl Pearson’s Correlation Coefficient, denoted by r, is defined as
𝐶𝑜𝑣(𝑋,𝑌) 𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2
𝑟= 𝜎𝑥 𝜎𝑥
where 𝜎𝑥 = 𝑛
and 𝜎𝑦 = 𝑛

Thus,
𝒏 𝒙𝒊 𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊
𝒓=
𝟐 𝟐
[𝒏 ( 𝒙𝟐𝒊 ) – ( 𝒙𝒊 ) ] ∙ [𝒏 ( 𝒚𝟐𝒊 ) – ( 𝒚𝒊 ) ]

OR
𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

Where 𝑑𝑥 = (𝑥𝑖 − 𝑥 ) and 𝑑𝑦 = (𝑦𝑖 − 𝑦)

104
Applied Biostatistics: An Essential tool in Helathcare Profession

Steps:
1. Find the means 𝑥 , 𝑦 of two variables X and Y.
2. Take the deviations dx, dy of two series X and Y using the formula. Then take their squares
as 𝑑𝑥 2 and 𝑑𝑦 2 and prepare the table as shown below.
X Y dx dy 𝑑𝑥 2 𝑑𝑦 2 dxˑdy

3. Calculate total of each column.


4. Substitute the values in the formula of r and find r.

Ex.1. Calculate the coefficient of correlation from the 7 pairs of observations, given that,
𝑥 = 212, 𝑦 = 152, 𝑥 2 = 6514, 𝑦 2 = 3390, 𝑥𝑦 = 4681.
𝒏 𝒙𝒊 𝒚𝒊 − 𝒙𝒊 ∙ 𝒚𝒊
Ans: 𝒓=
𝟐 𝟐
[𝒏 ( 𝒙𝟐𝒊 ) – ( 𝒙𝒊 ) ] ∙ [𝒏 ( 𝒚𝟐𝒊 ) – ( 𝒚𝒊 ) ]

7×4681−212×152
=
[7×6514 –(212)2 ] ∙ [7×3390−(152)2 ]

32767 −32224
=
45598−44944 ∙[23730 −23104 ]
543
= 654 ×626
543
= 639.8468

r = 0.8486

Ex.2. Find Karl Pearson’s coefficient of correlation from the following data between height of father
(x) and son (y).
X 64 65 66 67 68 69 70
Y 66 67 65 68 70 68 72
Ans:

X Y dx= 𝒙𝒊 − 𝟔𝟕 dy= 𝒚𝒊 − 𝟔𝟖 𝒅𝒙𝟐 𝒅𝒚𝟐 dxˑdy


64 66 -3 -2 9 4 6
65 67 -2 -1 4 1 2
66 65 -1 -3 1 9 3
67 68 0 0 0 0 0
68 70 1 2 1 4 2
69 68 2 0 4 0 0
70 72 3 4 9 16 12
469 476 0 0 28 34 25

105
Bhumi Publishing, India

𝑥 469 𝑦 476
𝑥= 𝑛
= 7
= 67 & 𝑦= 𝑛
= 7
= 68

𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

7×25−0∙0
=
7×28 – (0)2 ∙ 7×34 – (0)2

175
= 196 ×238
175
= 215.9814

r = 0.810
Ex.3. Calculate the correlation coefficient for the following heights of fathers (x) and their sons (y).
x 65 66 67 67 68 69 70 72
y 67 68 65 68 72 72 69 71
Ans:

X Y dx= 𝒙𝒊 − 𝟔𝟖 dy= 𝒚𝒊 − 𝟔𝟗 𝒅𝒙𝟐 𝒅𝒚𝟐 dxˑdy


65 67 -3 -2 9 4 6
66 68 -2 -1 4 1 2
67 65 -1 -4 1 16 4
67 68 -1 -1 1 1 1
68 72 0 3 0 9 0
69 72 1 3 1 9 3
70 69 2 0 4 0 0
72 71 4 2 16 4 8
544 552 0 0 36 44 24

𝑥 544 𝑦 552
𝑥= 𝑛
= 8
= 68 & 𝑦= 𝑛
= 8
= 69

𝒏 𝒅𝒙 ∙𝒅𝒚− 𝒅𝒙∙ 𝒅𝒚
𝒓=
𝟐 𝟐
𝒏 𝒅𝒙𝟐 – ( 𝒅𝒙) ∙ 𝒏 𝒅𝒚𝟐 – ( 𝒅𝒚)

8×24−0∙0
=
8×36 – (0)2 ∙ 8×44 – (0)2

192
= 288 ×352
192
= 318.3959

𝑟 = 0.6030

106
Applied Biostatistics: An Essential tool in Helathcare Profession

Merits of correlation coefficient:


 This method not only indicates the presence or absence of correlation between any two
variables but also determines the exact extent or degree to which they are correlated.
 It is easy to identify type of correlation between the two variables i.e. positive or negative.
 It helps to estimate the value of a dependent variable with reference to a particular value of
an independent variable through regression equations.
Demerits of correlation coefficient:
 It is very much affected by the values of the extreme items.
 In comparison to the other methods, it takes much time to arrive at the results.
 It assumes a linear relationship between the variables even though it may not be there.
 It is liable to be misinterpreted, as a high degree of correlation Hp does not necessarily mean
very close relationship between the variables.
 It is tedious to calculate
2) Spearman’s Rank Correlation Coefficient (R):
This non-parametric method is used to determine the degree of correlation if one of the two
variables or both variables are qualitative in nature. In some cases ranks of variables are already
given, but if ranks are not given then it is required to assign ranks by the observer.
If 𝑥1 , 𝑥2 , 𝑥3, … 𝑥𝑛 are n observations of variable X and 𝑦1 , 𝑦2 , 𝑦3, … 𝑦𝑛 are n observations of
variable Y then Spearman’s Rank Correlation Coefficient, denoted by R, is defined as
6∙ 𝐷2
𝑅 = 1 − 𝑛(𝑛 2 −1) where D = Rx - Ry

Rx = Ranks of data X
Ry = Ranks of data Y
Ex.1. Calculate Rank correlation coefficient from following data.
Marks by Judge A 81 72 60 33 29 11 56 42
Marks by Judge B 75 56 42 15 30 20 60 80
Ans:

X Y Rx Ry D= Rx - Ry D2
81 75 1 2 -1 1
72 56 2 4 -2 4
60 42 3 5 -2 4
33 15 6 8 -2 4
29 30 7 6 1 1
11 20 8 7 1 1
56 60 4 3 1 1
42 80 5 1 4 16
Total 32

107
Bhumi Publishing, India

6 ∙ 𝐷2 6 × 32
𝑅 =1− 2
=1− = 1 − 0.3809 = 0.6191
𝑛(𝑛 − 1) 8 × 63

Ex.2. Psychological tests of intelligence and arithmetical ability were applied to 10 candidates.
Results are given in table. Compute rank correlation coefficient between X and Y.
Intelligence ration (X) 90 95 115 96 85 110 89 98 97 93

Arithmetical ration (Y) 95 90 110 100 85 105 94 106 111 93

Ans.:
X Y 𝑅𝑋 𝑅𝑦 𝐷 2 = (𝑅𝑋 − 𝑅𝑦 )2
90 95 8 6 4
95 90 6 9 9
115 110 1 2 1
96 100 5 5 0
85 85 10 10 0
110 105 2 4 4
89 94 9 7 4
98 106 3 3 0
97 111 4 1 9
93 93 7 8 1
Total 32
6 𝐷2
𝑅 =1−
𝑁 (𝑁 2 − 1)
=
6 𝑋 32
1−
10 (99)
= 1 – 0.194 = 0.806

Ex.3. In dance competition, two judges rank 10 participants in following order. From given data
calculate coefficient of rank correlation?
Ranking by judge M 6 4 3 1 7 8 9 10 5 2
Ranking by judge N 4 1 6 7 8 7 10 3 2 5

108
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans.:
Rank by M Rank by N
𝐷 2 = (𝑅𝑋 − 𝑅𝑦 )2
𝑅𝑥 𝑅𝑥
6 4 4
4 1 9
3 6 9
1 7 36
7 8 1
8 7 1
9 10 1
10 3 49
5 2 9
2 5 9
+TOTAL 128

6 𝐷2
𝑅 =1−
𝑁 (𝑁 2 − 1)
6 𝑋 128
=1−
10 (99)
= 1 – 0.775
= 0.225
Merits of Rank Correlation Coefficient:
Spearman's Rank method is the only way of studying correlation between qualitative data
which cannot be measured in figures but can be arranged in serial order.
Demerits of Rank Correlation
1) The method cannot" be used in two-way frequency tables or bi-variate frequency
distribution.
2) It can be conveniently used only when n is small say 30, otherwise calculation become
tedious.

109
Bhumi Publishing, India

Exercise
1. Define correlation. Explain different methods of studying correlation.
2. What is Spearman’s rank correlation? When it can be used?
3. Om electrical obtained 120 tube lights from two companies and tested their life in hours.
The following results were obtained. Calculate the coefficient of variation and find which
company’s tubes are more durable?
Life of tubes (hrs) Company A Company B
800-1000 12 15
1000-1200 20 22
1200-1400 38 40
1400-1600 12 13
1600-1800 15 28
1800-2000 03 02

4. From given data of height of father and daughter in centimeters, calculate the correlation
coefficient.
Father 165 168 160 163 170 175 173
Daughter 160 175 166 159 173 180 177

5. Following table shows ages (X) in years and blood pressure (Y). From given data calculate
correlation coefficient.
X 25 50 60 43 51 74 46 33 49 58
Y 120 135 140 115 130 133 126 139 125 136

6. In epidemiological study of glaucoma in urban and rural population following data was
made available by WHO. Find if there is any correlation between urban and rural area.
No. of cases per 1000
Urban 23 35 28 36 45 39 19
Rural 20 30 22 40 35 45 22

7. Examine correlation for given data containing erythrocytes sedimentation rate in mm/hr of
10 male and female.
Male 112 65 70 82 105 75 60
Female 85 100 90 63 78 105 90

110
Applied Biostatistics: An Essential tool in Helathcare Profession

8. From data given in table find out the value of Karl Pearson’s Coefficient of correlation.
Fertilizer used 15 19 22 27 35 40 50
Productivity 80 95 102 118 135 144 150

9. Calculate the coefficient of correlation for given data of marks obtained by students in
Pharmaceutics and Pharmacology.
Pharmaceutics 75 51 42 77 62 81 60 58 66 49

Pharmacology 69 48 64 45 71 42 64 70 40 65

10. Compute coefficient of correlation from following data of supply and price of goods.
Supply 182 160 152 169 158 166 179
Price 167 198 152 170 162 152 180

11. Table contains values of import of raw material and export of finished formulation in
suitable unit. Calculate coefficient of correlation.
Import 15 21 15 16 25 19 12 25 21 10
Export 12 16 14 14 22 17 10 23 19 09

12. In a vocal music contest, two judges rank 09 competitors in following order.
Judge A 5 10 8 9 7 5 4 6 7 3

Judge B 3 9 9 6 10 8 7 8 3 6

13. Calculate Karl Pearson’s coefficient of correlation between x and y. State its kind.
X 39 65 62 90 82 75 25 98 36 78
Y 47 53 58 86 62 68 60 91 51 84

14. Calculate correlation coefficient for the following data.


X 1 3 4 8 9 11 14
Y 1 2 4 5 7 8 9

15. Given the following values of x and y. Find the correlation coefficient.
X 3 5 6 8 9 11
Y 2 3 4 6 5 8

16. Calculate correlation coefficient for the following data.


X 12 9 8 10 11 13 7
Y 14 8 6 9 11 12 3

111
Bhumi Publishing, India

5. Regression

Introduction:
After studying the relationship between two variables, now in this chapter we are going to
estimate the values of one variable when values of another variable are given. The variable which is
to be estimated is called “dependent” variable and the other is “independent” variable. Thus, the
term regression is used when we want to predict value of a variable based on the value of another
variable. Correlation gives us the extent of linear relationship between two variables while
regression analysis gives the measure of average relationship between two or more variables in
terms of original units of data.
The statistical technique of determining unknown values of one variable from the known
values of another variable is called Regression analysis. The relationship between two variables like
rainfall and agricultural production, consumer expenditure and disposable income etc are examples
of regression. Regression analysis is also used to define and characterize dose-response
relationships, for fitting linear portions of pharmacokinetic data and in obtaining the best fit to
linear physical-chemical relationships.
Difference between correlation and regression:
Correlation is related to regression but its application and interpretation are different than
regression. Let us see the exact difference between correlation and regression as both describe the
strength of the linear relationship between two or more variables.
Regression Correlation
1. It predicts the values of dependent variable 1. It gives the association or intensity of
based on the known values of independent relationship between two variables
variable, assuming the average relationship (x and y).
between two or more variable.

2. It involves at least one independent 2. There is no concept of dependent or


variable which is under researchers’ independent variable.
control.

3. It gives method to describe nature of 3. It simply describes the strength and


relationship. direction of relationship.

4. Regression coefficient predicts value y from 4. Correlation coefficient gives idea of


the value of x, or vice versa. relationship between two or more variables.

112
Applied Biostatistics: An Essential tool in Helathcare Profession

Types of Regression:
Regression analysis can be classified into following types:
1. Simple Regression: In regression analysis, if only two variables are studied at a time then it
is called simple regression i.e. there is only one independent variable.
2. Multiple Regressions: In regression analysis, if more than two variables are studied at a
time then it is called multiple regression i.e. there are two or more independent variables.
3. Linear Regression: If the graphical representation of a given data gives a straight-lined
pattern then it is linear regression.
4. Non-linear Regression: if the graphical representation of given data gives curved pattern
line then it is non-linear or curvilinear regression.
In this chapter, we are going to study Simple linear regression and its equations.
Simple Linear Regression:
Simple linear regression uses only one independent variable and examines the linear
relationship between two continuous variables: dependent (y) and independent (x) using straight
line. When the two variables are related, it is possible to predict a response value from a predictor
value with better than chance accuracy.
Regression provides the line that "best" fits the data. This line can then be used to:
 Examine how the response variable changes as the predictor variable changes.
 Predict the value of a dependent variable (y) for independent variable (x).
Regression Lines:
In regression analysis of two variables, regression line is a smooth curve fitted to the set of
paired data of x and y; and if the curve is straight line then it is line of linear regression. There are as
many number of regression lines as variables. But in simple linear regression we take two variables
X and Y, so there are only two regression lines:
 Regression line of Y on X: This gives the most probable values of Y from the given values of X.
 Regression line of X on Y: This gives the most probable values of X from the given values of Y.
Properties of Regression Lines:
(i) For perfect correlation i.e. 𝑟 = ±1, the two lines coincide each other. So there will be only
one straight line.
(ii) If 𝑟 = 0 then both variables are independent and both lines will cut each other at right
angle.
(iii) If regression lines are close to each other then there is high degree of correlation.
(iv) If regression lines are far away from each other then there is less degree of correlation.
(v) The two regression lines intersect each other at point (𝑥 , 𝑦) i.e. means of X and Y.

113
Bhumi Publishing, India

Linear Regression Equation:


A regression analysis generates two equations which describe the statistical linear
relationship between two given variables x and y such as
𝑦 = 𝑎 + 𝑏𝑥 ...... (i)
Or
𝑥 = 𝑐 + 𝑑𝑦 ...... (ii)
Such algebraic expression of regression lines which show linear relationship between two
variables in form of straight lines is called as Regression Equations.
From equation
(i) We can estimate unknown Y from known values of X and known as regression equation of
Y on X.
(ii) We can estimate unknown X from known values of Y and known as regression equation of
X on Y.
Methods of Linear Regression Analysis:
Following chart represents various methods of regression analysis.
Regression Methods

Graphic Algebraic

Scatter Diagram Least Square Method

(i) Scatter Diagram:


Using this method the points of two variables X and Y are plotted on graph paper. If
𝑥1 , 𝑥2 , … 𝑥𝑛 are n observations of variable X and 𝑦1 , 𝑦2 , … 𝑦𝑛 are n observations of variable Y then
we plot the pairs 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … (𝑥𝑛 , 𝑦𝑛 ) in a diagram. A regression line is then drawn with scale
or free hand so that maximum numbers of points are covered under that straight line. If errors in
estimation of variable Y are minimised then we get regression line of Y on X and vice versa.
(ii) Least Square Method:
The most common method for fitting a regression line is the method of least-squares as
scatter diagram gives several lines which can be drawn through the given points. This method
calculates the best-fitting line for the observed data by minimizing the sum of the squares of the
vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its
vertical deviation is 0). Because the deviations are first squared, then summed, there are no
cancellations between positive and negative values. A line fitted by the method of least square is
known as the line of best fit.

114
Applied Biostatistics: An Essential tool in Helathcare Profession

Suppose 𝑥1 , 𝑥2 , … 𝑥𝑛 are n observations of variable X and 𝑦1 , 𝑦2 , … 𝑦𝑛 are n observations of


variable Y then

(i) Equation of regression line Y on X is given by


𝒚 − 𝒚 = 𝒃𝒚𝒙 ( 𝒙 − 𝒙 )
𝑦𝑖 𝑥𝑖
Where 𝑦 = 𝑛
and 𝑥= 𝑛

𝑏𝑦𝑥 = Regression coefficient of Y on X

𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑦𝑥 = 𝑛 𝑑𝑥 2 − ( 𝑑𝑥 )2
𝑑𝑥 = 𝑥𝑖 − 𝑥

𝑑𝑦 = 𝑦𝑖 − 𝑦
(ii) Equation of regression line X on Y is given by
𝒙 − 𝒙 = 𝒃𝒙𝒚 ( 𝒚 − 𝒚 )
𝑦𝑖 𝑥𝑖
Where 𝑦 = and 𝑥=
𝑛 𝑛

𝑏𝑥𝑦 = Regression coefficient of X on Y

𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑥𝑦 = 𝑑𝑥 = 𝑥𝑖 − 𝑥
𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2

𝑑𝑦 = 𝑦𝑖 − 𝑦

Another Form of Regression coefficient:


If 𝜎𝑥 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋
𝜎𝑦 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑌
𝑟 = 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌
Then
(i) Regression coefficient of Y on X is given by
𝜎
𝑏𝑦𝑥 = 𝑟 ∙ 𝜎𝑦
𝑥

And
(ii) Regression coefficient of X on Y is given by
𝜎
𝑏𝑥𝑦 = 𝑟 ∙ 𝜎 𝑥
𝑦

Properties of Regression Coefficient:


(1) The algebraic signs of both regression coefficients must be same i.e. either positive (+) or
negative (-).
(2) The geometric mean of both regression coefficients is equal to correlation coefficient i.e.

115
Bhumi Publishing, India

𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 or 𝑟 2 = 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥


(3) The correlation coefficient will have same sign as that of the regression coefficients.
(4) If value of one regression coefficient is greater than one then value of other regression
coefficient must be less than one.
(5) The regression coefficients are independent of origin but not of scale.
(6) If regression line Y on X is of the form
𝑦 = 𝑎 + 𝑏𝑥
Then 𝑏 = 𝑏𝑦𝑥 i.e. regression coefficient of Y on X
(7) The two regression lines intersect each other at point (𝑥 , 𝑦) i.e. means of X and Y.
(8) If regression line X on Y is of the form
𝑥 = 𝑐 + 𝑑𝑦
Then 𝑑 = 𝑏𝑥𝑦 i.e. regression coefficient of X on Y
(9) Angle between the two regression lines is given by
𝑚 1 −𝑚 2
tan 𝜃 = 1+𝑚 1 𝑚 2
where 𝑚1 and 𝑚2 are gradients of regression lines and
𝜎𝑦 𝑟∙𝜎𝑦
𝑚1 = 𝑟∙𝜎 , 𝑚2 = 𝜎𝑥
𝑥

(1−𝑟 2 ) 𝜎𝑥 ∙𝜎𝑦
Thus, 𝜃 = tan−1 𝑟 𝜎𝑥 2 + 𝜎𝑦 2

(10) The angle between regression lines indicates the degree of dependence between the variables.

Ex.1. If values of two regression coefficients are 0.75 and 0.2.


Ans: let 𝑏𝑥𝑦 = 0.75 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.2
𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥

𝑟 = ± 0.75 × 0.2 = ± 0.15 = ±0.3873

Ex.2. Find r, if 𝑏𝑥𝑦 = 0.8 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.46.


Ans: let 𝑏𝑥𝑦 = 0.8 𝑎𝑛𝑑 𝑏𝑦𝑥 = 0.46
𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥

𝑟 = ± 0.8 × 0.46 = ± 0.368 = ±0.6066

Ex.3. From data given in table, calculate two lines of regression.


X 16 20 17 21 15
Y 50 60 58 60 55
i) Estimate value of Y when 𝑋 = 25
ii) Estimate value of X when 𝑌 = 50

116
Applied Biostatistics: An Essential tool in Helathcare Profession

Ans.: Prepare a table as from given data;


𝑥 y 𝑑𝑥 = 𝑥 − 𝑥 𝑑𝑦 = 𝑦 − 𝑦 𝑑𝑥 𝑑𝑦 𝑑𝑥 2 𝑑𝑦 2
16 50 -2 -7 14 4 49
20 60 2 3 6 4 9
17 58 -1 1 -1 1 1
21 60 3 3 9 9 9
15 55 -3 -2 6 9 4
𝑥 𝑦 𝑑𝑥 = −1 𝑑𝑦 = −2 𝑑𝑥 𝑑𝑦 𝑑𝑥 2 = 27 𝑑𝑦 2 = 72
= 89 = 283 = 34

𝑥 89
𝑥= = = 17.8 ≅ 18
𝑛 5
𝑦 283
𝑦= = = 56.6 ≅ 57
𝑛 5
(i) Part A: Equation of regression line Y on X is given by
𝒚 − 𝒚 = 𝒃𝒚𝒙 ( 𝒙 − 𝒙 )

Where, regression coefficient y on x is given by


𝑛 𝑑𝑥 ∙ 𝑑𝑦 – 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑦𝑥 =
𝑛 𝑑𝑥 2 − 𝑑𝑥 2
5 × 34 − (−1 × −2)
=
5 × 27 − −1 2
170 − 2
=
135 − 1
= 1.25
Hence equation becomes,
𝑦 − 56.6 = 1.25 𝑥 − 17.8
𝑦 − 56.6 = 1.25𝑥 − 22.25
𝑦 − 1.25𝑥 = 34.35
(i) Now, Equation of regression line X on Y is given by
𝒙 − 𝒙 = 𝒃𝒙𝒚 ( 𝒚 − 𝒚 )

Where, Regression coefficient of X on Y is given by

𝑛 𝑑𝑥 ∙𝑑𝑦 − 𝑑𝑥 ∙ 𝑑𝑦
𝑏𝑥𝑦 =
𝑛 𝑑𝑦 2 − ( 𝑑𝑦 )2

5 × 34 − (−1 × −2)
=
5 × 72 − (−2)2

117
Bhumi Publishing, India

170 − 2
=
360 − 4
= 0.4719
Now, regression line X on Y:
𝑥 − 𝑥 = 𝑏𝑥𝑦 ( 𝑦 − 𝑦 )
𝑥 − 17.8 = 0.4713( 𝑦 − 56.6 )
𝑥 − 17.8 = 0.4713𝑦 − 26.71
𝑥 − 0.4713𝑦 = −8.91
i) To estimate value of y when 𝑥 = 25, use equation of regression line Y on X
Therefore, 𝑦 − 1.25(25) = 34.35
𝑦 − 31.25 = 34.35
𝑦 = 65.6
ii) To estimate value of x when 𝑦 = 50, use regression line X on Y
Therefore; 𝑥 − 0.4713(50) = −8.91
𝑥 − 23.56 = −8.91
𝑥 = 14.65
Ex.3. Find the line of regression Y on X and line X on Y if
X Y
A.M. 36 85
S.D. 11 8
r 0.66

Ans.: Given 𝑥 = 36 𝑦 = 85
𝜎𝑥 = 11 𝜎𝑦 = 8 𝑟 = 0.66
Now,
Regression coefficient of Y on X is given by
𝜎 8
𝑏𝑦𝑥 = 𝑟 ∙ 𝜎𝑦 = 0.66 × 11 = 0.4818
𝑥

And

Regression coefficient of X on Y is given by


𝜎 11
𝑏𝑥𝑦 = 𝑟 ∙ 𝜎 𝑥 = 0.66 × 8
= 0.9075
𝑦

Hence,
Equation of regression line Y on X is
𝒚 − 𝒚 = 𝒃𝒚𝒙 𝒙 − 𝒙
𝑦 − 85 = 0.48 𝑥 − 36

118
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑦 − 85 = 0.48𝑥 − 30.24
𝑦 − 0.48𝑥 = 54.76
Equation of regression line Y on X is
𝒙 − 𝒙 = 𝒃𝒙𝒚 ( 𝒚 − 𝒚 )
𝑥 − 36 = 0.91 𝑦 − 85
𝑥 − 36 = 0.91𝑦 − 77.35
𝑥 − 0.91𝑦 = −41.35
Note:
Suppose we are given equations of two regression lines and it is not mentioned that which
one is regression equation of Y on X and X on Y.
In such case, always assume that the first equation is Y on X and then calculate regression
coefficients 𝑏𝑦𝑥 and 𝑏𝑥𝑦 .
If these two values satisfy the property of regression coefficients,
𝑟 2 = 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 < 1
Then our assumption is correct.
Otherwise, interchange these two equations.
Ex.4. Given the two line regression as;
8𝑥 – 10𝑦 + 66 = 0
40𝑥 – 18𝑦 – 214 = 0
Find average of x and y as well as correlation coefficient between x and y.
Ans.:
Part A:
Solve two equations simultaneously to find average of X and Y.
8𝑥 – 10𝑦 = − 66 ……….. (1)
40𝑥 – 18𝑦 = 214 …..….. (2)
Multiply equation 1 with 5 and subtract from equation 2.
40𝑥 – 50𝑦 = − 330
- (40𝑥 – 18𝑦 = 214)
− 32𝑦 = −544
𝑦 = 17
Now, substitute 𝑦 = 17 in equation 1.
8𝑥 – 10 (17) = −66
8𝑥 – 170 = −66
8𝑥 = 104
𝑥 = 13
Therefore; 𝑥 = 13 𝑎𝑛𝑑 𝑦 = 17

119
Bhumi Publishing, India

Part B: to find correlation coefficient, we have two lines of regression but which line is
regression line Y on X and vice-versa is not known.
Let’s assume that
8𝑥 – 10𝑦 = − 66 is regression line Y on X
40𝑥 – 18𝑦 = 214 is regression line X on Y

Express regression line Y on X in the form of 𝑦 = 𝑎 + 𝑏𝑥


8 66
𝑦= 𝑥−
10 10
8
Therefore, 𝑏𝑦𝑥 = 10 = 0.8

Express regression line X on Y in the form of 𝑥 = 𝑐 + 𝑑𝑦.


9 107
𝑥= 𝑦−
20 20
9
Therefore, 𝑏𝑥𝑦 = = 0.45
20

Now,
Check whether stated assumption is correct or not.
Check 1: Signs: both regression coefficients are positive.
Check 2 Product of two regression coefficient
8 09
𝑏𝑦𝑥 . 𝑏𝑥𝑦 = 𝑋
10 20
72
=
200
= 0.36 < 1
Hence, our assumption is correct
Now; 𝑟 = ± 𝑏𝑥𝑦 ∙ 𝑏𝑦𝑥 = 𝑟 = ± 0.36 = ±0.6
But the sign of correlation coefficient is same as sign of both regression coefficients. So, 𝑟 = +0.6

120
Applied Biostatistics: An Essential tool in Helathcare Profession

Exercise
1. What is regression analysis? Explain the concepts of regression.
2. Comment on the properties of regression coefficient and lines.
3. Write a note on different methods to find regression coefficient.
4. For a certain data of two variables, the regression equations are 6𝑥 + 𝑦 – 31 = 0 𝑎𝑛𝑑 3𝑥 +
2𝑦 – 26 = 0. Find the means of x and y as well as coefficient of correlation r.
5. From given regression equations calculate coefficient correlation and 𝜎𝑦 2 ; where 𝜎𝑥 2 = 0.9

Regression equations: 8𝑥 – 10𝑦 + 66 = 0 and 40𝑥 – 18𝑦 – 214 = 0.


6. From given data find what will be the probable yield when the rainfall is 30’’. Find the regression
equations when r between rainfall and production = 0.8.
Parameters Rainfall Production (units /acre)
𝑥 25’’ 40
𝜎 3’’ 06
7. Following table contain aptitude test index and productivity indices of 10 workers selected
randomly. Calculate the two regression equations and estimate the productivity index of a
worker whose aptitude score is 92.
Aptitude index 60 62 65 70 72 48 53 73 65 82

Productivity index 68 60 62 80 65 40 52 62 60 81
8. From following data find the two regression lines.
X 7 6 10 14 13
Y 22 18 20 26 24
9. Calculate two lines of regression.
X 7 6 10 14 13
Y 22 18 20 26 24
10. Following data gives values of X and Y.

X 16 12 18 14 12 10 15 12
Y 87 88 89 86 87 80 85 83

(i) Calculate two lines of regression.


(ii) Find correlation coefficient.
(iii) Estimate y when x = 10.
11. Two lines of regression are given by 𝑥 + 2𝑦 = 5 and 2𝑥 + 3𝑦 = 8. Calculate the value of
𝑥 , 𝑦 , 𝑏𝑥𝑦 , 𝑏𝑦𝑥 and r.
12. Given are two linear regression equations 𝑥 − 4𝑦 = 5 and 𝑥 − 16𝑦 = −64. Find the value of
𝑥 , 𝑦 , 𝑏𝑥𝑦 , 𝑏𝑦𝑥 and correlation coefficient between X and Y.

121
Bhumi Publishing, India

6. Sampling Variability, Significance & Statistical inference

Introduction:
Recall that we typically cannot census the entire population of interest; so we take a sample
from that population in order to make estimates and draw conclusions about the population.
A quantity computed from the values in a sample is called statistic. The values such as
mean 𝑥 , standard deviation or the proportions of individuals in a sample are the statistics which
vary from sample to sample of a population. This variability is called sampling variability.
For example, the average age at which a child learned to walk for one sample of 10 children would
be different from the average age to walk for a different sample of 10 children.
Sampling Distribution:
The common statistics used for sample of any population are sample proportion (𝑝) and
sample mean (𝑥 ) which are random variables as they vary from sample to sample. This result into
the distribution of sample called sampling distribution. A sampling distribution is a probability
distribution of a statistic obtained from al large number of samples drawn from a specific
population. It is a distribution of statistics of all possible values of samples of fixed size.
For example, suppose to find out the sampling distribution of GPAT scores for all Graduate
students in a given year, take repeated random samples of graduate students from the general
population of students and then compute the average test score for each sample. The distribution of
those sample means would provide the sampling distribution for the average GPAT score.
The variability of sampling distribution is measured by its variance or standard deviation.
Sampling Error:
Most of the times, the value of statistic calculated from sample is assigned to the population
of that sample. But, in general, there is some difference between the value calculated from sample
and the corresponding value of population. This difference is called sampling error.
Sampling is an analysis performed by selecting specific number of observations from a
larger population. This analysis can produce some errors in selection of samples. In statistics,
sampling error is the error that occurs when the sample representing the entire population is not
selected properly. As a result, the values obtained from sample would not be obtained from entire
population. This sampling error can be eliminated by selecting sufficiently large size sample by
ensuring that it represents the entire population.
Standard Error of the Mean:
The variability of sampling distribution is measured by its variance or standard deviation.
In such case, the standard deviation of the means of samples is a measure of sample error which is
known as standard error or standard error of mean (SEM). It is a measure of uncertainty.

122
Applied Biostatistics: An Essential tool in Helathcare Profession

A standard error of the mean is the standard deviation of the sampling distribution of a
statistic. It is a statistical measure which measures the accuracy for sample representing a
population. In statistics, the samples mean deviates from the actual mean of population; this
deviation is known as standard error of the mean. The standard error is inversely proportional to
the sample size; so the larger the sample size, the smaller the standard error.
For example, for an upcoming national election, 2000 voters are chosen at random and
asked if they will vote for candidate A or candidate B. Out of the 2000 voters, 1040 (52%) state that
they will vote for candidate A. The researchers report that candidate A is expected to receive 52%
of the final vote, with a margin of error of 2%.
In this situation, the 2000 voters are a sample from all the actual voters. The sample
proportion of 52% is an estimate of the true proportion who will vote for candidate A in the actual
election. The margin of error of 2% is a quantitative measure of the uncertainty – the possible
difference between the true proportion who will vote for candidate A and the estimate of 52%.
Significance of SEM:
The standard error of mean (SEM) estimates the variability sample means where the
samples are selected from same population while the standard deviation measures the variability
within a single sample. It is used to determine how precisely the mean of the sample estimates the
population mean. The Least value of SEM indicates more precise estimate of population mean. Thus,
a larger sample size will result in a smaller standard error of mean.
The standard error of mean is also used to calculate the confidence interval, which is a
range of values likely to include the population mean.
For example, a medical research team tests a new drug to lower cholesterol. They report
that, in a sample of 400 patients, the new drug lowers cholesterol by an average of 20 units
(mg/dL). The 95% confidence interval for the average effect of the drug is that it lowers cholesterol
by 18 to 22 units.
In this situation, the 400 patients are a sample of all patients who may be treated with the
drug. The confidence interval of 18 to 22 is a quantitative measure of the uncertainty – the possible
difference between the true average effect of the drug and the estimate of 20 mg/dL.
a) Standard error of the mean for one sample:
For a sample of size n and standard deviation σ, the SEM is given by,
𝜎
𝑆𝐸𝑀 = 𝑛
where, σ = S.D. of sample

b) Standard error of difference between two sample means:


If 𝑥1 and 𝑥2 are means of two samples with sizes 𝑛1 and 𝑛2 along with respective standard
deviations 𝜎1 and 𝜎2 , then SEM is given by

123
Bhumi Publishing, India

𝜎1 2 𝜎2 2
𝑆𝐸𝑀(𝑥 1 −𝑥 2 ) = +
𝑛1 𝑛2

Standard error of the proportion (SEP):


It is similar to the standard deviation used in describing the dispersion of data about mean.
a) Standard error of proportion for one sample:
If samples of same size n are repeatedly randomly drawn from a population and the
proportion of sample is recorded as 𝑝 then the standard error of proportion is given by,
𝑝 (1−𝑝 )
𝑆𝐸𝑃 = 𝑛

b) Standard error for difference between two sample proportions:


If 𝑝1 and 𝑝2 are two sample proportions of different or same populations with the
respective sizes 𝑛1 and 𝑛2 , then SEP is given by
𝑝 1 (1−𝑝 1 ) 𝑝 (1−𝑝 )
𝑆𝐸𝑃(𝑝1 −𝑝 2 ) = 𝑛1
+ 2 𝑛 2
2

Degree of freedom (df):


The concept of degree of freedom is central to the principal of estimating statistics of
population from the samples drawn. It represents how many values in a calculation have freedom
to vary. It can be calculated to ensure the statistical validity of chi-square test, t-test and the more
advanced f-test. These tests are commonly used to compare observed data with the data that would
be expected to obtain according to specific hypothesis.
Degrees of freedom (df) are broadly defined as the number of observations in the data
that are free to vary when estimating statistical parameters.
The statistical formula to determine degrees of freedom is simple and given by,
𝑑𝑓 = 𝑛 − 1 where, n = sample size.
We also need to calculate the degrees of freedom for the difference between sample means.
When we assume that the population variances are equal or when both sample sizes 𝑛1 and 𝑛2 are
larger than 50 we use the following formula
𝑑𝑓 = 𝑛1 + 𝑛2 − 2
For example, imagine you are a fun loving person who loves to wear hats and you don’t care
about the degree of freedom for wearing different hats. Unfortunately, you have only 7 hats. Yet you
want to wear a different hat everyday of the week.
On the 1st day, you can wear any of 7 hats. On 2nd day, you can choose from remaining 6
hats, and so on. On 6th day, you still have 2 choices. But after 6th day, you have no choice for the hat
that you wear on Day 7. You must wear the last remaining hat.
Thus, you had 7 – 1 = 6 days of “hat” freedom- in which the hat you wore could vary.

124
Applied Biostatistics: An Essential tool in Helathcare Profession

Statistical Inference:
Statistical inference is the process through which inferences about a population are made
based on certain statistics calculated from a sample of data drawn from that population. It is
important in order to analyze data properly. Indeed, proper data analysis is necessary to interpret
research results and to draw appropriate conclusions. In this chapter, three basic statistical
concepts are presented: estimation, confidence interval, and P-value, and these concepts are
applied to the comparisons of proportions, means.
Statistical inference is the act of generalisation or estimation about the larger sized
population from the sample information.
There are two common types of statistical inference:
 Estimation
 Testing of Hypothesis
Estimation:
Statistical estimation is concerned with the method by which population characteristics are
estimated from sample information. The objective of estimation is to approximate the value of a
population parameter on the basis of a sample statistic i.e. when the value of parameter is unknown
then it can be estimated on the basis of a random sample.
For example, the sample mean 𝑥 is used to estimate the population mean 𝜇.
There are two types of estimation:
(1) Point Estimation (likely value for parameter)
(2) Interval Estimation (also called confidence interval for parameter)
(1) Point Estimates:
A point estimator of an unknown parameter of a population is a single value or point of
sample statistic. It is always provided with its standard error which is a measure of uncertainty
associated with estimation process.
For example, the sample mean 𝑥 is point estimate of the population mean 𝜇.

(2) Interval Estimates:


An interval estimate is defined by two numbers, between which value of population
parameter is said to lie. Here, we try to construct an interval that covers the true population
parameter with a specified probability.
For example, 𝑎 < 𝑥 < 𝑏 is an interval estimate of population mean 𝜇. It indicates that
population mean is greater than 𝑎 but less than b.

125
Bhumi Publishing, India

Characteristics of Estimators:
The desirability of an estimator is judged by its characteristics. There are three important
criteria:
(i) Unbiasedness:
An unbiased estimator of a population parameter is an estimator whose expected value is
equal to the parameter value.
i.e. an estimator say 𝜃 for population parameter 𝜃 is said to be unbiased if
𝐸 𝜃 = 𝜃
For example, the sample mean 𝑥 is unbiased estimator of the population mean 𝜇, since
𝐸 𝑥 =𝜇
(ii) Consistency:
An unbiased estimator is said to be consistent if the difference between the estimator and
the target population parameter becomes smaller as we increase the sample size.
i.e. an unbiased estimator say 𝜃 for parameter 𝜃 is said to be consistent if
𝑉 𝜃 → 0 as n → ∞.
Note that being unbiased is a precondition for an estimator to be consistent.
2
For example, variance of the sample mean 𝑥 is 𝜎 𝑛 , which decreases to zero as we
increase sample size n.
(iii) Efficiency:
If we are given two unbiased estimators for a population parameter then the estimator with
a smaller variance is more efficient.
For example, for a normally distributed population, it can be shown that the sample median
is an unbiased estimator for µ. It can also be shown, however, that the sample median has a greater
variance than that of the sample mean, for the same sample size. Hence 𝑥 is a more efficient
estimator than sample median.
Confidence Interval:
In statistics, confidence interval is used to describe the amount of uncertainty associated
with a sample estimate of population parameter. It is an interval estimate combined with a
probability statement. A confidence interval is an interval within which the true value of
population parameter lies. The width of this confidence interval depends on the properties of
population and the degree of probability considered.

126
Applied Biostatistics: An Essential tool in Helathcare Profession

Here we are going to construct confidence intervals for population proportion (p) and
population mean (𝜇) using sample proportion (𝑝) and sample mean (𝑥 ) as point estimate which will
be the centre of the confidence interval. The width of confidence interval will depend on two things:
i. Level of confidence
ii. Standard error
Confidence level refers to the percentage of all possible samples that can be expected to
include the true population parameter. For example 95% confidence level implies that 95% of
confidence level would include the true population parameter. It is considered as the multiplier in
confidence interval.
The general form of confidence interval is
𝑃𝑜𝑖𝑛𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟)
a) Confidence Intervals for Population Proportions (p):
It is constructed by taking sample proportion (𝑝) as point estimate with sample size n and
standard error of sample proportion. Here 𝒛 is the level of confidence find from z table and taken as
multiplier.

𝑝(1 − 𝑝)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑜𝑓 𝑝 = 𝑝 ± 𝑧
𝑛

For example, in a survey, people are asked how many of them wear seatbelt while driving.
In a sample of 1356 males, 677 said that they wear seatbelt while driving.
677
𝑝= = 0.499
1356
Let’s construct 95% confidence level for the population proportion from which the sample
proportion was drawn.
To compute confidence interval we need z multiplier and standard error.
From z table, for 95% confidence level, multiplier z = 1.96
𝑝 (1−𝑝 ) 0.499(1−0.499)
Hence, 𝑆𝐸 = 𝑛
= 1356
= 0.136

Thus, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 0.499 ± 1.96 0.136 = 0.499 ± 0.027 = [0.472, 0.526]
i.e. We are 95% confident that the population proportion of males who wear seat belt while driving
is between 0.472 and .
b) Confidence Intervals for Population Mean (𝝁):
It is constructed by taking sample mean (𝑥 ) as point estimate with sample size n and
standard error of sample mean. Here t is level of confidence find from t table and taken as
multiplier.
𝜎
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝜇 = 𝑥 ± 𝑡
𝑛

127
Bhumi Publishing, India

For example, in a class survey, students are asked how many hours they sleep per night. In a
sample of 22 students, the mean was 5.77 hours with a standard deviation of 1.572 hours.
Let’s construct a 95% confidence level for the mean number of hours slept per night in the
population from which sample was drawn.
To compute confidence interval we need t multiplier and standard error.
From t table, with degree of freedom 22 – 1 = 21 and for 95% confidence level,
multiplier t = 2.08
𝜎 1.572
Hence, 𝑆𝐸 = 𝑛
= 22
= 0.335

Thus,
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 5.77 ± 2.08 0.335 = 5.77 ± 0.697 = [5.073, 6.467]
i.e. We are 95% confident that the population mean hours of students sleeping per night is between
5.073hours and 6.467 hours.
P - value:
P-value is the probability for the given statistical model in hypothesis testing to support or
reject the null hypothesis. It is the evidence against null hypothesis. At the time of hypothesis
testing, a p-value helps to determine the significance of the result. The p-value is a number between
0 and 1 and interpreted in the following way:
 A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so as
to reject the null hypothesis.
 A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to
reject the null hypothesis.
 A p-value very close to the cut off (0.05) is considered to be negligible.

128
Applied Biostatistics: An Essential tool in Helathcare Profession

7. Testing of Hypothesis

Introduction:
In this chapter, we are going to study second type of statistical inference in the form of
hypothesis testing by using various statistical methods and probability distributions; the first one
was confidence intervals. The main purpose of this study is to make decisions and draw inferences
about the available data of population using samples of that population. In pharmaceutical studies,
the purpose is often to demonstrate that “a new drug is effective, or possibly to show that it is more
effective than the existing drug”. While in clinical trials, the purpose is to demonstrate that “the new
drug is better than a placebo control”. In this chapter, we will focus only on numeric outcomes. This
chapter also introduces the critical (or rejection) region approach to hypothesis testing and
compares it to critical value (p-value) approach.
A statistical measure such as mean, standard deviation or variance which describes
population is known as parameter.
Estimation:
The statistical estimation is one of the main objectives or methods of statistics in which
conclusions about a population are drawn and/or decisions are taken from the analysis of the
sample drawn from that population. Statistical inference includes:
1. Estimation theory
2. Tests of hypothesis
3. Non Parametric tests
4. Sequential analysis
In estimation theory, we estimate the unknown value of the population parameter based on
sample observations i.e. the statistical measure calculated from sample is assigned to the
population of that sample.
E.g. suppose we are given a sample of weights of 100 students in a school. So it is possible to
estimate the average weight of all students in that school using the weights of these 100 students.
But, in general, there is difference between the value calculated from sample and the
corresponding value of population. This difference is called Sampling error.
Tests of Hypothesis:
The sample is assumed to be a small representative of the total population. But in many
experiments, it happens that the sample is not a whole representative of population from which it is
selected. So the statistical conclusions and estimations about population go wrong. In such case,
certain statements are tested about population parameter to come to the conclusions like whether
or not the difference between sample value and parametric value is due to chance or otherwise.

129
Bhumi Publishing, India

This whole procedure is called testing of hypothesis. In this chapter, we will find the difference
between sample mean and population mean i.e. both are equal or not.
A hypothesis, in statistics, is a statement about a population which is supposed to be true
till it is proved to be false. It should be stated before conducting the statistical tests of hypothesis. It
is set up in two ways:
a) Null Hypothesis
b) Alternative Hypothesis
a) Null Hypothesis:
Null hypothesis is a statement which is actually tested for acceptance or rejection. It is
stated under the assumption that “there is no significance difference” between sample result and
population result. We assume that null hypothesis is true but in pharmaceutical research we wish
to prove it false.
Null hypothesis is generally denoted by H0.
b) Alternative Hypothesis:
When the null hypothesis is rejected then it is required to accept another statement called
as alternative hypothesis. It is research hypothesis which is generally believed to be true by
researcher.
Alternative is generally denoted by Ha or H1.
Note: The hypothesis we want to test is “likely” true. So there are two possible outcomes:
 Reject H0 and accept Ha because of sufficient proofs in favour of Ha.
 Accept H0 because of insufficient proofs to support Ha.
Failure to reject H0 does not mean that null hypothesis is true. It only means that we do not have
sufficient evidence to support H1.
Elements necessary for Hypothesis testing:
1. Level of significance:
The level of significance, denoted by 𝛼, is the probability of rejecting the null hypothesis
when it is true. For example, a significance level 0.05 indicates a 5% risk of concluding that a
difference exists when there is no actual difference.
Similarly, the p-value is the strength or probability of accepting null hypothesis. It is used to
compare with the test statistic value. If the calculated statistic value is less than the given p-value
at 𝛼% level of significance accept the null hypothesis; otherwise reject null hypothesis.
2. Region of Acceptance and Rejection:
It is a range of values which leads to accept the null hypothesis while the set of values where
null hypothesis is rejected is called area of rejection. Area of rejection is also known as Critical
Region. The values which separate the critical region from the region of acceptance are called
critical values.

130
Applied Biostatistics: An Essential tool in Helathcare Profession

3. Power of test:
Power of test of any statistical significance is the probability of rejecting null hypothesis
when it is false. It ranges from 0 to 1. The power quantifies the chance that the null hypothesis
will be rejected when it is actually false. Thus, power is the ability of a test to correctly reject the
null hypothesis. Although a hypothesis test without it is conducted, calculating the power of a test
beforehand will help to ensure that the sample size is large enough for the purpose of the test.
Types of error:
No hypothesis test is 100% correct, as it is based on probabilities. At the time of testing of
hypothesis, null hypothesis may be accepted or may be rejected. Depending on this acceptance or
rejection there are two types of errors:
a) Type one (I) error
b) Type two (II) error
a) Type I error:
Rejection of null hypothesis when it is true, is called type I error. The probability of
committing type I error is 𝛼 i.e. level of significance which is set up before testing hypothesis. Given
𝛼 of 0.05 means there is 5% chance that we are wrong to reject null hypothesis. To decrease the
chance of error, use a lower value of 𝛼. Using lower value of 𝛼 means the less likely to detect a true
difference if exists.
b) Type II error:
Accepting null hypothesis when it is false is called type II error. The probability of making
type II error is 𝛽 i.e. power of test. To decrease the chance of error ensure that the test must have
enough power i.e. sample size should be large enough to detect practical difference if exist.
One Tailed test and Two Tailed test:
In statistics hypothesis testing, we need to judge whether it is a one-tailed or a two-tailed
test so that we can find the critical values in tables such as Standard Normal z Distribution Table
and t Distribution Table which are the standard normal values. z curve and t curve are generally in
bell shape and symmetric about the vertical axis. Each side of the curve represents a tail which is
rejection area for hypothesis.
One –tailed test: A test of a statistical hypothesis, where the region of rejection is on only one side
of the sampling distribution, is called a one-tailed test.
Two-tailed test: A test of a statistical hypothesis, where the region of rejection is on both sides of
the sampling distribution, is called a two-tailed test.
Steps to perform Hypothesis testing:
All hypothesis tests are conducted by the same way.
1) State the hypotheses.
2) Formulate an analysis plan: Choose significance level and test statistic

131
Bhumi Publishing, India

3) Analyse sample data: Perform calculations.


4) Interpret result: Compare table value with calculated value.

z-test:
A z-test is a statistical test used to determine whether two population means are different
when the variances are known and the sample size is large. The test statistic is assumed to have
normal distribution and standard deviation should be known for an accurate z-test to be
performed. It is basically used for dealing with problems relating to large samples when 𝑛 ≥ 30.
There are different types of z-test each for different purpose. Some of the popular types are
outlined below:
a) For comparing two proportions:
Let 𝑝1 and 𝑝2 be two proportions with the respective sample sizes 𝑛1 and 𝑛2
To test:
𝐻0 : 𝑝1 = 𝑝2
Against:
𝐻1 : 𝑝1 ≠ 𝑝2
Test statistic:
𝑝1 − 𝑝2
𝑧=
1 1
𝑝 1 − 𝑝 (𝑛 − 𝑛 )
1 2

Where p is the probability of successes in two samples combines and given by


𝑛1 𝑝1 + 𝑛2 𝑝2
𝑝=
𝑛1 + 𝑛2
At 𝛼% level,
If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .
Ex.1. In random samples of 600 and 1000 men from two cities, 400 and 600 men are found to
be literate. Do the data indicate that the populations are significantly different in the percentage
of literacy?
[Given at 5% level of significance, 𝑧 = 1.96 ]
Ans: Let 𝑝 1 = Proportion of literate men from one city
𝑝2 = Proportion of literate men from another city

400 600
Here, 𝑝1 = 600 𝑛1 = 600 𝑎𝑛𝑑 𝑝2 = 1000 𝑛2 = 1000

132
Applied Biostatistics: An Essential tool in Helathcare Profession

To test:
𝐻0 : 𝑝1 = 𝑝2
Against:
𝐻1 : 𝑝1 ≠ 𝑝2
Test statistic:
𝑝1 − 𝑝2
𝑧=
1 1
𝑝 1 − 𝑝 (𝑛 − 𝑛 )
1 2

Where
400 600
𝑛1 𝑝1 + 𝑛2 𝑝2 600 × 600 + 1000 × 1000 400 + 600
𝑝= = = = 0.624
𝑛1 + 𝑛2 600 + 1000 1600
Hence,
0.67 − 0.6
𝑧= = 4.76
1 1
0.624 0.376 (600 − 1000)

𝐶𝑎𝑙 𝑧 = 4.76
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,
So reject 𝐻0 .
Therefore, there is significant difference in the percentage of literacy of two cities.
b) For one sample mean:
Let 𝑥 be the sample mean and 𝜇 be population mean with 𝜎 as standard deviation and 𝑛 as
sample size.
To test:
𝐻0 : 𝑥 = 𝜇
Against:
𝐻1 : 𝑥 ≠ 𝜇
Test statistic:
𝑥−𝜇
𝑧= 𝜎
𝑛
Where,

(𝑥𝑖 − 𝑥 )2
𝜎=
𝑛

At 𝛼% level,
If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

133
Bhumi Publishing, India

Ex.2. The mean plasma potassium level of 50 adult males with a certain disease was found to be
3.356mEq/litre and the S.D. was 0.5mEq/litre. The normal adult value of plasma potassium is
4.6mEq/litre. Based on above data, can it be concluded that the males with diseases have lower
plasma potassium level than normal level? (Given at 5%, 𝑧 = 1.96)
Ans: Here, 𝑥 = 3.356 𝜎 = 0.5 𝑛 = 50
To test:
𝐻0 : 𝜇 = 4.6
Against:
𝐻1 : 𝜇 ≠ 4.6

Test statistic:
𝑥−𝜇
𝑧= 𝜎
𝑛
3.35 − 4.6
=
0.5
50
1.25
=
0.07
= 17.675
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,
Hence, reject 𝐻0 .
Thus, the males with diseases have lower plasma potassium level than normal males.

Ex.2. A machine produces metal plates of thickness 1.5cm with S.D. 0.2cm. A sample of 100 plates
produced by machine has an average thickness of 1.52cm. Is the machine fulfilling the purpose for
which it is designed?

Ans: Here, 𝑥 = 1.52 𝜎 = 0.2 𝑛 = 100

To test:
𝐻0 : 𝜇 = 1.5
Against:
𝐻1 : 𝜇 ≠ 1.5

Test statistic:

134
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑥−𝜇
𝑧= 𝜎
𝑛
1.52 − 1.5
=
0.2
100
0.02
=
0.02
=1
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,
Hence, accept 𝐻0 .
Thus, the machine is fulfilling its purpose.

Ex.3. Six bottles from batch of suspension were assayed for paracrtamol content by
spectrophotometric method. Each 5ml suspension contains 500, 503, 509, 515, 502, 507 mg of
paracetamol. Test the hypothesis that the average content of paracetamol is 505mg.

𝑥𝑖 500+503+509+515+502+507
Ans: 𝑥 = 𝑛
= 6
= 506

To test:
𝐻0 : 𝜇 = 505
Against:
𝐻1 𝜇 ≠ 505

Test statistic:
𝑥 −𝜇 (𝑥 𝑖 −𝑥 )2
𝑧= 𝜎 Where, 𝜎 =
𝑛
𝑛

To find S.D.,
𝑥𝑖 (𝑥𝑖 − 𝑥 ) (𝑥𝑖 − 𝑥 )2

500 -6 36
503 -3 9
509 3 9
515 9 81
502 -4 16
507 1 1
Total 152

135
Bhumi Publishing, India

(𝑥 𝑖 −𝑥 )2 152
𝜎= = = 5.033
𝑛 6

Hence,
𝑥−𝜇 506 − 505
𝑧= 𝜎 = = 0.4866
5.033
𝑛 6
At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96
𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,
So accept 𝐻0 .
c) For two sample means:
Let 𝑥1 and 𝑥2 be two sample means with standard deviations 𝜎1 and 𝜎2 , sample sizes 𝑛1 and
𝑛2 respectively.
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2
Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸.
Where, Standard Error is

𝜎12 𝜎22
𝑆. 𝐸. = +
𝑛1 𝑛2

At 𝛼% level,
If 𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

Ex.1. Random samples drawn from two places gave the following data relating to the wing length of
anopheles mosquitoes. Test at 5% level that the mean wing length is the same for mosquitoes at
two places.
(Given 𝑧 = 1.96)
Place ‘A’ Place ‘B’
Mean 3.60 3.58
S.D. 1.8 1.6
size 50 50

Ans: Here, 𝑥1 = 3.60 𝑥2 = 3.58

136
Applied Biostatistics: An Essential tool in Helathcare Profession

𝜎1 = 1.8 𝜎2 = 1.6
𝑛1 = 50 𝑛2 = 50
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2

Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸.
Where,

𝜎12 𝜎22 1.82 1.62


𝑆. 𝐸. = + = + = 0.116 = 0.34
𝑛1 𝑛2 50 50

Hence,
3.60 − 3.58
𝑧= = 0.058
0.34

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96


𝐶𝑎𝑙 𝑧 ≤ 𝑇𝑎𝑏 𝑧 ,
So accept 𝐻0 .
Thus, the mean wing length is the same for mosquitoes at two places.

Ex.2. In two groups of infants in 6 months of age the following values were observed:
Group No. Of infants Mean weight S.D.
1 100 6.9kg 1.10kg
2 169 7.3kg 0.91kg
Test whether mean birth weights are significantly different at 5% level.

Ans: Here, 𝑥1 = 6.9 𝑥2 = 7.3


𝜎1 = 1.10 𝜎2 = 0.91
𝑛1 = 100 𝑛2 = 169
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2

137
Bhumi Publishing, India

Test statistic:
𝑥1 − 𝑥2
𝑧=
𝑆. 𝐸
Where,

𝜎12 𝜎22 1.102 0.912


𝑆. 𝐸. = + = + = 0.017 = 0.13
𝑛1 𝑛2 100 169

Hence,
6.9 − 7.3
𝑧= = 3.077
0.13

At 5% level, 𝑇𝑎𝑏 𝑧 = 1.96


𝐶𝑎𝑙 𝑧 > 𝑇𝑎𝑏 𝑧 ,
So reject 𝐻0 .
Thus, the mean birth weights are significantly different at 5% level.
Student’s t-Test:
Unfortunately, z-tests require one of two conditions: either the population is normally
distributed with a known variance, or the sample size is large.
In general hypothesis testing becomes very difficult if for normal population, sample size is
small with unknown population variance. In such case, student’s t-test is applied. A t-test is an
analysis of two population means through the use of statistical examination. A t-test looks at the t-
statistic, the t-distribution and degrees of freedom to determine the probability of difference
between populations.
Types of Student’s t-test:
There are two types of student’s t-test:
1. Paired t-test
2. Unpaired t-test
1. Paired t-test: It is used to compare the means of two populations when the data is paired. It is
also used in case “Before-After” of same individual.
2. Unpaired t-test: It is used to compare the means of the two independent groups of data and
determines whether the data has come from the same population or not
a) T-test for small sample size (𝒏 ≤ 𝟑𝟎)to compare equality of sample mean and population
mean:
Let 𝑥 be sample mean of size 𝑛 and 𝜇 be population mean along with the standard
deviation 𝜎 .
To test:

138
Applied Biostatistics: An Essential tool in Helathcare Profession

𝐻0 : 𝑥 = 𝜇
Against:
𝐻1 : 𝑥 ≠ 𝜇

Test statistic:
𝑥−𝜇
𝑡=𝜎
𝑛−1
Where,
(𝑥 𝑖 −𝑥 )2
𝜎= 𝑛−1
𝑛 − 1 is degree of freedom

At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

Ex.1. A random sample of 20 sachets of powder containing certain drug gives mean API content of
42 mg and S.D. of 6mg. Test the hypothesis that the population mean is 44 mg.
(Given at 5% level and 19 d.f., 𝑇𝑎𝑏 𝑡 = 2.093)
Ans.: Given 𝑛 = 20 𝑥 = 42𝑚𝑔 𝜎 = 6𝑚𝑔
To test:
𝐻0 : 𝜇 = 44𝑚𝑔
Against:
𝐻1 : 𝜇 ≠ 44𝑚𝑔

Test statistic: Two tailed test


𝑥−𝜇
𝑡=𝜎
𝑛−1

42 − 44
=
6
19

= 1.4534
At 5% level, 𝑇𝑎𝑏 𝑡 = 2.093
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
Hence accept 𝐻0 .
Thus, the population mean is 44mg.

139
Bhumi Publishing, India

b) T-test for large sample size (𝒏 > 30)to compare equality of two sample means:
Let 𝑥1 and 𝑥2 be two sample means of sizes 𝑛1 and 𝑛2 along with standard deviations 𝜎1 and
𝜎2 res/pectively..

To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2

Test statistic: Unpaired t-test


𝑥1 − 𝑥2
𝑡=
1 1
𝜎 +
𝑛1 𝑛2
Where,
𝑛 1 𝜎1 2 +𝑛 2 𝜎2 2
𝜎= 𝑛1 + 𝑛2 − 2 is degree of freedom
𝑛 1 +𝑛 2 −2

At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

Ex.2. One lab check shelf life of formulation obtained from two different manufacturers (generic
product). Data is given in table. Check whether is their existence of any significant difference in
same kind of product but of different manufacturer.
(Given at 5% level and for 25 df, 𝑇𝑎𝑏 𝑡 = 2.06)
Product Mean S. D. Sample Size
Brand X 2000 days 250 12
Brand Y 2230 days 300 15

Ans.: Given 𝑥1 = 2000 𝜎1 = 250 𝑛1 = 12


𝑥2 = 2230 𝜎1 = 300 𝑛1 = 15
To test:
𝐻0 : 𝑥1 = 𝑥2
Against:
𝐻1 : 𝑥1 ≠ 𝑥2

Test statistic: Unpaired t-test

140
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑥1 − 𝑥2
𝑡=
1 1
𝜎 𝑛 +𝑛
1 2

Where,

𝑛1 𝜎1 2 + 𝑛2 𝜎2 2
𝜎=
𝑛1 + 𝑛2 − 2

12(250)2 + 15(300)2
=
12 + 15 − 2

= 84000
= 289.827
Hence,
2000 − 2230
𝑡=
1 1
289.827 12 +
15
230
=
112.25
= 2.0489

At 5% level, 𝑇𝑎𝑏 𝑡 = 2.06


𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , a
So accept 𝐻0 .
Thus, two formulations are equal in average shelf life.
c) T-test for before-after condition of certain treatment:
Let 𝑥 be the ‘before’ and 𝑦 be the ‘after’ condition of the certain treatment of sample size 𝑛 and
standard deviation 𝜎 .
To test:
𝐻0 : 𝑥 = 𝑦
Against:
𝐻1 : 𝑥 ≠ 𝑦
Test statistic: Paired t-test
𝑑
𝑡=𝜎
𝑛
Where,
𝑥−𝑦
𝑑= , 𝑥−𝑦 ≥0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 𝑡𝑎𝑘𝑒 (𝑦 − 𝑥)
𝑛

141
Bhumi Publishing, India

𝑛 𝑑 2 + ( 𝑑)2
𝜎=
𝑛(𝑛 − 1)

𝑛 − 1 is degree of freedom
At 𝛼% level,
If 𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

Ex.3. A test of 150 marks was taken before and after training for newly joins candidates in
production department. Table contains marks obtained by candidates. Test whether there is any
change in candidates after training.
(Given at 5% level and 4 df, 𝑇𝑎𝑏 𝑡 = 4.6)
Candidate A B C D E
Marks obtained before training 110 120 123 132 125
Marks obtained after training 120 118 125 136 121

Ans.: Prepare following table for before-after condition.


Candidate x Y 𝒅 = (𝒚 − 𝒙) 𝒅𝟐
A 110 120 10 100
B 120 118 -2 4
C 123 125 2 4
D 132 136 4 16
E 125 121 -4 16
Total - - 𝑑 = 10 2
𝑑 = 140
To test:
𝐻0 : 𝑥 = 𝑦
Against:
𝐻1 : 𝑥 ≠ 𝑦
Test statistic: Paired t-test
𝑑
𝑡=𝜎
𝑛
Where,
𝑑 10
𝑑 = = =2
𝑛 5

𝑛 𝑑2 + 𝑑 2
𝜎=
𝑛 𝑛−1

142
Applied Biostatistics: An Essential tool in Helathcare Profession

5 140 + 10 2
=
5 5−1

= 30
= 5.4
Hence,
2
𝑡=
5.4
5
2
=
2.41
= 0.829
At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
So, accept 𝐻0
Thus, training has not shown any significant effect on scores.
Ex.4. Applications of fertilizers were tested for the yield of rice grown in 10 plots. Another seed of
10 plots of similar size & condition were taken as control. Test the effect of fertilizer.
(Given for 9 df, 𝑡0.05 = 2.10)

Fertilizer applied 16 14 18 15 13 17 16 15 14 13
Fertilizer Not applied 10 12 11 9 13 13 12 14 13 11
Ans.: Prepare following table for following condition.
x Y 𝒅 = (𝒙 − 𝒚) 𝒅𝟐
16 10 6 36
14 12 2 4
18 11 7 49
15 9 6 36
13 13 0 0
17 13 4 16
16 12 4 16
15 14 1 1
14 13 1 1
13 11 2 4
- 𝑑 = 33 𝑑2 = 163
To test:
𝐻0 : 𝑥 = 𝑦
Against:
𝐻1 : 𝑥 ≠ 𝑦

143
Bhumi Publishing, India

Test statistic: Paired t-test


𝑑
𝑡=𝜎
𝑛
Where,
𝑑 33
𝑑 = = = 3.3
𝑛 10

𝑛 𝑑2 + 𝑑 2
𝜎=
𝑛 𝑛−1

10 163 + 33 2
=
10 10 − 1

= 30.21
= 5.5
Hence,
3.3
𝑡=
5.5
10
3.3
=
1.74
= 1.896
At 5𝛼% level, 4 df, 𝑇𝑎𝑏 𝑡 = 4.6
𝐶𝑎𝑙 𝑡 ≤ 𝑇𝑎𝑏 𝑡 ,
So, accept 𝐻0
Thus, there is no effect of fertilizers on the yield of rice.
F-test:
A statistical F-test is derived from Student’s t-test. It is used to compare equality of two
variances by dividing them with each other. The larger variance is taken at numerator to result the
test into right tailed test as it is easier to calculate.
To calculate F value:
Let 𝜎1 2 and 𝜎2 2 be the two sample variances with sample sizes 𝑛1 and 𝑛2 respectively.
To test:
𝐻0 : 𝜎1 2 = 𝜎2 2
Against:
𝐻1 : 𝜎1 2 ≠ 𝜎2 2 𝑜𝑟 𝜎1 2 > 𝜎2 2 𝑜𝑟 𝜎1 2 < 𝜎2 2
Test statistic: F-test
𝜎1 2
𝐹= 𝑓𝑜𝑟𝜎1 2 > 𝜎2 2
𝜎2 2

144
Applied Biostatistics: An Essential tool in Helathcare Profession

𝜎2 2
= 𝑓𝑜𝑟𝜎1 2 < 𝜎2 2
𝜎1 2
Where,
(𝑥𝑖 − 𝑥 )2 (𝑦𝑖 − 𝑦)2
𝜎1 2 = 𝑎𝑛𝑑 𝜎2 2 =
𝑛1 − 1 𝑛2 − 1

At 𝛼% level,
If 𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

Ex.1. Two samples are drawn from two populations. From the data given below test whether the
two samples have same variance at % level of significance. (Given at 5% F8,7 = 3.76)
total
Sample I (x) 60 65 71 74 76 82 85 87 600
Sample II (y) 61 66 67 85 78 63 85 88 91 684

Ans: To test:
𝐻0 : 𝜎1 2 = 𝜎2 2
Against:
𝐻1 : 𝜎1 2 ≠ 𝜎2 2
Here,
𝑥𝑖 600
𝑥= 𝑛1
= 8
= 75
𝑦𝑖 684
𝑦= 𝑛2
= 8
= 76

𝒙𝒊 (𝒙𝒊 − 𝒙)𝟐 𝒚𝒊 (𝒚𝒊 − 𝒚)𝟐


60 225 61 225
65 100 66 100
71 16 67 81
74 1 85 81
76 1 78 4
82 49 63 169
85 100 85 81
87 144 88 144
91 225
Total 636 Total 1110

145
Bhumi Publishing, India

(𝒙𝒊 − 𝒙)𝟐 636 636


𝜎12 = = = = 90.85
𝑛1 − 1 8−1 7
(𝒚𝒊 − 𝒚)𝟐 1110 1110
𝜎22 = = = = 138.75
𝑛2 − 1 9−1 8
Here, 𝜎12 < 𝜎22

𝜎 2 138.75
𝐹 = 𝜎2 2 = 90.85
= 1.507
1

At 5% level, 𝑇𝑎𝑏 𝐹 = 3.76


𝐶𝑎𝑙 𝐹 = 1.507
𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹 ,
Accept 𝐻0 .
Therefore there is no significant difference between the variances of two samples.
Ex.2. Consider the two methods of measuring particulate matter in water and assume that we want
to find out if one is more precise than the other. Precision is measured by the fact that a more
precise method shall have a lower standard deviation and hence a lower variance. The data
obtained for 10 tests each for the two methods are given below:
Method Mean S.D. Variance
A 10.3 0.6 0.36
B 11 0.7 0.49
(Given for 9 degree of freedom and 5% level, 𝑡𝑎𝑏 𝐹 = 3.179)
Ans: To test:
𝐻0 : 𝜎1 2 = 𝜎2 2
Against:
𝐻1 : 𝜎1 2 ≠ 𝜎2 2
Here,
2
𝜎1 = 0.36 and 𝜎2 2 = 0.49
𝜎12 < 𝜎22

𝜎 2 0.49
𝐹 = 𝜎2 2 = 0.36 = 1.3611
1

At 5% level, 𝑡𝑎𝑏 𝐹 = 3.179


𝑐𝑎𝑙 𝐹 < 𝑡𝑎𝑏 𝐹
Accept Ho.
Therefore the standard deviations are not different and hence neither of the methods are
precise than the other.

146
Applied Biostatistics: An Essential tool in Helathcare Profession

Exercise
1. In a sample of 400 parts manufactured by a factory, the number of defective parts was
found to be 30. The company, however, claimed that only 5% of their product is defective. Is
the claim tenable? (Given at 5%, 𝑧 = 1.96)
2. The mean life of a sample of optical lenses produced by a company is computed to be
1570hrs with S.D. of 120hrs after which it is presumed that the lenses do not maintain
accuracy. The company claims that the average life of the lenses produced by it is 1600hrs.
Using the level of significance of 0.05 can you say the claim ia acceptable?
3. In an investigation on Neonatal Blood Pressure in relation to maturity, the following results
were obtained:
Babies 9 days old Number Mean S.B.P S.D.
1. Normal 54 75 6
2. Neonatal asphyxia 14 69 5
Is the difference in mean S.B.P. between the two groups significant at 5 level?(Given at 5%
level, 𝑧 = 1.96)
4. Intelligence test on two groups of boys abd girls gave the following results.
Group Mean S.D. n
1. Boys 75 15 150
2. Girls 70 20 150
Is there any significant difference in the mean scores obtained by boys and girls at 5% level?
(Use z-test)
5. For a random sample of 10 persons fed on diet A, the increase in weight in pounds in a
certain period were : 10, 6, 16, 17, 13, 12, 8, 14, 15, 9
For another random sample of 12 persons fed on diet B the same is given as: 7, 13, 22, 15,
12, 14, 18, 8, 21, 23, 10, 17
Test whether the diets A and B are different as regards their effect on increase in weight.
6. Life time of batteries for a random sample of 10 from a large consignment gave the
following data. Can we accept the hypothesis that average life time of battery is 400 hrs?
(Note: use t-test at 5% level for df = 9).
Battery 1 2 3 4 5 6 7 8 9 10
Life (in hrs X 100) 5.6 4.2 4.6 4.1 5.2 3.8 3.9 4.3 4.4 3.9

7. Two laboratories M and N carry out independent estimation of fat in ice-cream made by
same industry. A sample is taken from each batch and their observations are given in table
below. Is there any significant difference between the mean fat content obtained by two labs
M and N?

147
Bhumi Publishing, India

Batch No.: 1 2 3 4 5 6 7 8 9 10
Lab M 6 6 9 6 7 6 7 8 8 4
Lab N 7 6 7 8 6 5 7 7 9 4

8. Ten incubation periods of polio cases are given below. Discuss the mean by using t-test.
Days: 66 69 70 65 69 71 70 68 63 62.

9. A drug given for treatment to 10 patients suffering from diabetes showed change in blood
pressure as given in table. Is it reasonable to believe that drug has no side effect as change
in B.P. at 5% level.
125 130 120 140 135 125 120 140 135 125

10. The weights at birth of female children born in hospital are found to be in ‘kg’. Is there
anything that can suggest the mean of the weight of children any significant from the
population mean 3 kg?
2.5 3.0 2.5 3.0 3.2 3.5 2.5 3.1 2.9 3.5

148
Applied Biostatistics: An Essential tool in Helathcare Profession

8. ANOVA

Introduction:
In previous chapter, we learned to use t-test for the testing of equality of means of two
population based on data from two independent samples. The t-test and z-test developed in the
20th century were used until 1918, when Ronald Fisher created the analysis of variance (ANOVA)
which is the extension of the t-test and the z-test. In this chapter, we are going to test the equality of
means of three or more population. This comparison of two or more means is based on the
distribution of variation into its dependent components- hence the method is called analysis of
variance. This method was introduced by Sir Ronald A. Fisher and has been used in many research
fields.
ANOVA is a statistical tool which is used to test if the means of three or more population are
significantly different from each other when variances are unknown. It checks the impact of one or
more factors by comparing the means of samples.
Assumptions to use the ANOVA:
To use the ANOVA test we made the following assumptions:
 Each group sample is drawn from a normally distributed population.
 All populations have a common/same variance.
 Within each sample, the observations are sampled randomly and independently of each
other.
 The sample sizes for the groups are equal and greater than 10
 Factor effects are additive
Types of ANOVA:
There are two types of ANOVA:
1. One-way ANOVA(unidirectional)
2. Two-way ANOVA
1. One-way ANOVA:
The one-way analysis of variance (ANOVA) is generally used to determine whether there
are any statistically significant differences between the means of three or more independent
(unrelated) groups using F-distribution. It has only one independent variable affecting dependant
variable so it is also known as One factor analysis of variance. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means are unequal.
Limitations of the One Way ANOVA
A one way ANOVA will tell that at least two groups were different from each other. But it is
unable to tell that which groups were different.

149
Bhumi Publishing, India

2. Two-way ANOVA:
A two-way ANOVA is an extension of one-way ANOVA in which there are two independent
factors affecting dependant variable. It is mostly used when there is quantitative as well as
qualitative data.
In two-way ANOVA, two null hypotheses are tested if one observation is placed in each cell.
i.e. the hypotheses would be:
H01: For column factor.
H02: For row factor.
ANOVA Table:
This is the table that shows the output of the ANOVA analysis and whether there is a
statistically significant difference between our group means. The tabular arrangement of source,
sum of squares, degree of freedom, mean sum of square (MSS) and F-ratio is called ANOVA table.
The word "source" stands for source of variation. Some authors prefer to use "between" and
"within" instead of "treatments" and "error", respectively.
Steps to perform ANOVA:
Following are the steps to perform ANOVA:
Step 1: setup null hypothesis (𝑯𝟎 ) for given data.
a) For one way ANOVA there should be only one null hypothesis.
b) For two way ANOVA there should be two null hypotheses.
Step 2: Find column sums i.e. 𝑐1 , 𝑐2 , 𝑐3 …
In case of two way ANOVA along with column sums, find row sums i.e. 𝑟1 , 𝑟2 , 𝑟3 , …
Note: for large values, minimize data by subtracting smallest element of data from all observation
Step 3: Find grand total (GT):
𝐺𝑇 = 𝑐1 + 𝑐2 + 𝑐3 + … . = 𝑟1 + 𝑟2 + 𝑟3 + … .
Step 4: Find correction factor (C. F.)
(𝐺𝑇)2
𝐶. 𝐹. =
𝑁
Where; N = Total number of observations
Step 5: Find column sum of squares (CSS)
𝒄𝟐𝟏 𝒄𝟐𝟐 𝒄𝟐𝟑
𝑪𝑺𝑺 = + + + ⋯ − 𝑪. 𝑭.
𝒏𝟏 𝒏𝟐 𝒏 𝟑

Where n1, n2, n3… are the number of observations in respective columns.
Similarly, for two way ANOVA along with CSS, find out row sum of squares (RSS).
𝒓𝟐𝟏 𝒓𝟐𝟐 𝒓𝟐𝟑
𝑪𝑺𝑺 = + + + ⋯ − 𝑪. 𝑭.
𝒏𝟏 𝒏𝟐 𝒏 𝟑

150
Applied Biostatistics: An Essential tool in Helathcare Profession

Where n1, n2, n3… are the number of observations in respective rows
Step 6: Calculate total sum of squares (TSS):
𝑻𝑺𝑺 = 𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒂𝒍𝒍 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔 − 𝑪. 𝑭.
Step 7: Calculate error sum of squares (ESS)
a) For one way ANOVA:
𝑬𝑺𝑺 = 𝑻𝑺𝑺 − 𝑪𝑺𝑺
b) For two way ANOVA
𝑬𝑺𝑺 = 𝑻𝑺𝑺 − (𝑪𝑺𝑺 + 𝑹𝑺𝑺)
Step 8: Find degree of freedom (df)
a) For one way ANOVA:
df for 𝐶𝑆𝑆 = 𝑐 − 1 (where c = No. of columns )
df for 𝐸𝑆𝑆 = 𝑐 (𝑟 − 1) (Where c and r = No. of columns and rows )
b) For two way ANOVA:
df for 𝐶𝑆𝑆 = 𝑐 − 1
df for 𝑅𝑆𝑆 = 𝑟 – 1
df for 𝐸𝑆𝑆 = (𝑐 − 1) (𝑟 − 1)
Step 9: ANOVA table
a) One way ANOVA table
Sum of Mean sum of
Source df F ration
squares squares (MSS)
Between 𝐶𝑆𝑆
CSS c–1
columns (CSS) (𝑐 − 1) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑀𝑆𝑆
𝐹=
𝐸𝑆𝑆 𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
ESS ESS c (r-1) 𝑐(𝑟 − 1)
ESS/ c(r-1)
a) Two way ANOVA table
Sum of Mean sum of
Source df F ration
squares squares (MSS)
Between
𝐶𝑆𝑆
columns CSS value 𝑐 − 1
(𝑐 − 1)
(CSS) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝐶𝑆𝑆 𝐸𝑆𝑆
𝐹1 =
Between 𝑅𝑆𝑆 𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
RSS value 𝑟 − 1
rows (RSS) (𝑐 − 1) 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑙𝑢𝑒 𝑓𝑟𝑜𝑚 𝑅𝑆𝑆 𝐸𝑆𝑆
𝐹2 =
𝑆𝑚𝑎𝑙𝑙𝑒𝑟 𝑣𝑎𝑙𝑢𝑒
𝑐 − 1 (𝑟
ESS ESS value 𝐸𝑆𝑆
− 1)
𝑐 − 1 (𝑟 − 1)
Step 10) Conclusion
a) One way ANOVA:

151
Bhumi Publishing, India

Ex.1. Use ANOVA and determine whether the machines are significantly different in their mean
speed (Given; at 5% F2,12 = 3.89).
Machines
A1 A2 A3
25 31 24
30 39 30
36 38 28
38 42 25
31 35 38
Ans.:
1) Set up null Hypothesis;
𝐻0 = There is no significant difference in the average speed of three machines.
2) Minimize data by subtracting the smallest observation (i.e. 24) from all and calculate
column sum.
Data become;
A1 A2 A3
1 7 0
6 15 6
12 14 4
14 18 1
7 11 14
𝑐1 = 40 𝑐2 = 65 𝑐3 = 25

3) Calculate Grand Total


𝐺. 𝑇. = 𝑐1 +𝑐2 + 𝑐3 = 40 + 65 + 25 = 130
4) Find Correction Factor:
(𝐺.𝑇.)2 (130)2
𝐶. 𝐹. = = = 1126.67
𝑁 15

5) Calculate CSS
𝑐12 𝑐22 𝑐32
𝐶𝑆𝑆 = + + − 𝐶. 𝐹.
𝑛1 𝑛2 𝑛3
402 652 252
= + + – 1126.67
5 5 5
= [320 + 845 + 125] – 1126.67
= 1290 – 1126.67
= 163.33

152
Applied Biostatistics: An Essential tool in Helathcare Profession

6) Calculate TSS
𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶. 𝐹.

1 49 0
36 225 36 −1126.67
144 196 16
196 324 1
49 121 196
= 426 = 915 = 249

= 426 + 915 + 249 – 1126.67


= 1590 – 1126.67
= 463.33
7) Find ESS
𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 469.33 – 163.33 = 300
8) Degree of freedom
For CSS = 𝑐 − 1 = 3 − 1 = 2
For ESS = 𝑐(𝑟 − 1) = 3 (5 − 1) = 12

9) One Way ANOVA table


Source Sum of squares d.f. MSS F ration
163.33
CSS 163.33 2 2
81.665
= 81.665 𝐹=
25
300
= 3.266
ESS 300 12 12
= 25

10) Given; at 5% level of significance; 𝑇𝑎𝑏 𝐹2,12 = 3.89


𝐶𝑎𝑙 𝐹2,12 = 3.266
𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹
So, accept 𝐻0
Thus, there is no significant difference in the average speed of three machines.
Ex.2. Prepare ANOVA table for following data and test if the varieties differ significantly among
themselves.
(Given; at 5% level of significance F8,3 = 4.07 and F3,8 = 8.83).

153
Bhumi Publishing, India

Varieties
A B C D
20 25 24 23
29 23 20 20
21 21 22 20

Ans.:
1) Set up null Hypothesis.
𝐻0 = The four varieties do not differ significantly among themselves
2) Minimize data by subtracting the smallest observation (i.e. 20) from all and calculate
column sum.
Data become;
A1 A2 A3 A4
0 5 4 3
9 3 0 0
1 1 2 0
𝑐1 = 10 𝑐2 = 9 𝑐3 = 6 𝑐4 = 3
3) Calculate Grand Total
𝐺. 𝑇. = 𝑐1 +𝑐2 + 𝑐3 = 10 + 9 + 6 + 3 = 28
4) Find Correction Factor:
(𝐺.𝑇.)2 (28)2
𝐶. 𝐹. = 𝑁
= 12
= 65.33

5) Calculate CSS
𝑐12 𝑐22 𝑐32 𝑐42
𝐶𝑆𝑆 = + + + − 𝐶. 𝐹.
𝑛1 𝑛2 𝑛 3 𝑛4
102 92 62 32
= + + + – 65.33
3 3 3 3
= [33.33 + 27 + 12 + 3] – 65.33
= 75.33 – 65.33
= 10
6) Calculate TSS
𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶. 𝐹.

0 25 16 9
81 9 0 0 − 65.33

1 1 4 0
= 82 = 35 = 20 = 9

154
Applied Biostatistics: An Essential tool in Helathcare Profession

= [82 + 35 + 20 + 9] – 65.33
= 146 – 65.33
= 80.67
7) Find ESS
𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 80.67 – 10 = 70.67
8) Degree of freedom
For CSS = 𝑐 − 1 = 4 − 1 = 3
For ESS = 𝑐(𝑟 − 1) = 4(3 − 1) = 8

9) One Way ANOVA table


Source Sum of squares d.f. MSS F ration
10
CSS 10 3 3
8.83
= 3.33 𝐹=
3.33
70.67
= 2.501
ESS 70.67 8 8
= 8.83

10) Given; at 5% level of significance; 𝑇𝑎𝑏 𝐹8,3 = 4.07


𝐶𝑎𝑙 𝐹2,12 = 2.501
𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹
So, accept 𝐻0
Thus, the varieties do not differ significantly among themselves.
b) Two-way ANOVA:
Ex.3. The following are the number of defectives produced by four workmen operating in turn
three different machines. Perform two-way ANOVA and check the difference between workmen and
machines.
(Given; At 5% F6,2 = 5.14 and F3,6 = 1.94)
Workmen
Machines
X1 X2 X3 X4
Y1 36 39 42 40
Y2 32 41 39 32
Y3 37 36 44 26
Ans.:
1) Set up null Hypothesis.
a) for columns (workmen)

155
Bhumi Publishing, India

𝐻0 = There is no significant difference between workmen.


b) for rows (machines)
𝐻0 = There is no significant difference between machines.
2) Minimize data by subtracting the smallest observation (i.e. 26) from all and calculate
column sums & row sums
Data become;

X Row
X1 X2 X3 X4 totals
Y
Y1 10 13 16 14 𝑟1 = 53

Y2 6 15 13 6 𝑟2 = 40

Y3 11 10 18 0 𝑟3 = 39

Total 𝑐1 = 27 𝑐2 = 38 𝑐3 = 47 𝑐4 = 20

3) Calculate Grand Total


𝐺. 𝑇. = 𝑐1 +𝑐2 + 𝑐3 +𝑐4 = 𝑟1 + 𝑟2 + 𝑟3
= 27 + 38 + 47 + 20 = 53 + 40 + 39
= 132
4) Find Correction Factor:
(𝐺.𝑇.)2 (132)2
𝐶. 𝐹. = 𝑁
= 12
= 1452
5) a) Calculate CSS
𝑐12 𝑐22 𝑐32 𝑐42
𝐶𝑆𝑆 = + + + − 𝐶. 𝐹.
𝑛1 𝑛2 𝑛 3 𝑛4
272 382 472 202
= + + + – 1452
3 3 3 3
= [243 + 481.33 + 736.33 + 133.33] – 1452
= 1593.99 – 1452
= 141.99
b) Calculate RSS
𝑟12 𝑟22 𝑟32
𝑅𝑆𝑆 = + + – 𝐶. 𝐹.
𝑛1 𝑛2 𝑛3
532 402 392
= + + − 1452
4 4 4
= [702.25 + 400 + 380.25] – 1452
= 1482.5 – 1452
= 30.5
6) Calculate TSS
𝑇𝑆𝑆 = [ 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠] – 𝐶. 𝐹.

156
Applied Biostatistics: An Essential tool in Helathcare Profession

100 169 256 196


36 225 169 36
121 100 324 0 − 1452
= 257 = 494 = 749 = 232

= [257 + 494 + 749 + 232] – 1452


= 1732 – 1452
= 280
7) Find ESS
𝐸𝑆𝑆 = 𝑇𝑆𝑆 – 𝐶𝑆𝑆 = 280 – (142 + 30.5) = 107.5
8) Degree of freedom
For CSS = 𝑐 − 1 = 4 − 1 = 3
For RSS = 𝑟 − 1 = 3 − 1 = 2
For ESS = 𝑐 − 1 𝑟 − 1 = (4 − 1)(3 − 1) = 6
9) Two Way ANOVA table
Source Sum of squares d.f. MSS F ration
CSS 142 3 142
3
= 47.33 47.33
𝐹3,6 = = 2.64
RSS 30.5 2 30.5 17.91
2
= 15.25 17.91
𝐹6,2 = = 1.174
ESS 107.5 6 107.5 15.25
6
= 17.91
10) a) For columns (workmen):
Given at 5% level of significance; 𝑇𝑎𝑏 𝐹3,6 = 1.94
𝐶𝑎𝑙 𝐹3,6 = 2.64
𝐶𝑎𝑙 𝐹 > 𝑇𝑎𝑏 𝐹
So, reject 𝐻0
Thus, there is significant between workmen.
b) For rows (machines):
Given at 5% level of significance; 𝑇𝑎𝑏 𝐹6,2 = 5.14
𝐶𝑎𝑙 𝐹6,2 = 1.174
𝐶𝑎𝑙 𝐹 ≤ 𝑇𝑎𝑏 𝐹
So, accept 𝐻0
Thus, the there is no significant between machines.
ANOVA vs. T Test
A Student’s t-test tells that if there is a significant variation between groups. It compares means,
while the ANOVA compares variances between populations. ANOVA gives a single number (the F-
statistic) and one p-value to help to accept or reject the null hypothesis.

157
Bhumi Publishing, India

Exercise
1. What do you mean by ANOVA? Explain in short.
2. Describe various steps for ANOVA.
3. The following table gives the yield of 15 samples plots under three variations of seed. You
are required to find if the average yields of land under different varieties of seed showing
significant difference.
A 20 21 23 16 20
B 18 20 17 15 25
C 25 28 22 28 32

4. Blood group A, B, and AB of four different persons were studied for a particular
characteristic. Set a table of analysis of variance and find out whether there is existence of
any significant difference between the mean of persons blood with their blood group
varieties.
Blood group
Persons
A B AB
1 7 9 10
2 4 7 6
3 7 5 7
4 6 6 9

5. The following data gives sales made by three MR. Perform analysis of variance to test
whether there is any difference between sales made by three MR.
A B C
300 600 700
400 300 300
300 300 400
500 400 600
0 - 500

6. The life time in hours for cells taken randomly from four individuals were observed as given
in table. On the basis of observation, analyze the values for their variance and study its
significance.

158
Applied Biostatistics: An Essential tool in Helathcare Profession

Cell life time (hrs)


Persons
1 2 3 4
P 61 68 60 58
Q 68 61 57 50
R 70 55 64 59
S 60 59 62 55

7. Samples of peanut butter produced by a company in three different batches are tested for
autotoxin content (p.p.h) and obtain following results. Use the 0.05 level of significance to
test whether the difference among the three samples means are significant.
Batch A 1.0 2.2 4.8 0.4 1.5 3.3
Batch B 0.7 1.2 5.2 3.6 1.8 2.5
Batch C 4.3 5.5 2.7 1.1 0.3 0.5

8. Two random samples were drawn from normal populations. Set up ANOVA table for given
data.
Sample 1 20 15 14 16 18 10 12 17
Sample 2 16 15 8 28 10 14 10 8 12 16

9. Following table contain % drug release of 15 samples containing different concentrations of


Excipients. Check whether the three varieties have any impact on different samples.
A 85 73 66 75 67
B 70 83 77 40 65
C 90 70 64 30 55

10. A company appoints four salesmen J, K, L, and M and observes their sales in three different
zones viz. A, B, C. Carry out ANOVA. (Note: figures are in lakhs)
Zones Salesmen
J K L M
A 55 41 48 56
B 52 51 56 49
C 49 49 46 48

159
Bhumi Publishing, India

9. Chi-square test

Introduction:
In previous few chapters, the statistical inference has concentrated on the statistics such as
mean and proportion which have been used to obtain interval estimates and test hypotheses
considering population parameters. This chapter changes the approach to inferential statistics by
studying the whole distributions and relationship between two distributions. These inferences are
drawn using chi-square test.
The chi-square test is the most useful and widely used test in statistics for the assumptions
are minimal to perform the test. Thus, this test can be used in most circumstances.
Chi-square test:
The chi-square test is a procedure for testing if two categorical variables are related in some
population. The null hypothesis of the Chi-Square test is that no relationship exists on the
categorical variables in the population; they are independent. This test is based on the difference
between what is actually observed in the data and what would be expected if there was truly no
relationship between the variables.
The Chi-square test is denoted by 𝜒 2 and given by,
(𝑂𝑖 − 𝐸𝑖 )2
𝜒2 =
𝐸𝑖
Where, 𝑂𝑖 = Observed values
𝐸𝑖 = Expected values
Steps to perform chi-square test:
1) Set up the hypothesis as,
To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
3) Formula:
(𝑂𝑖 − 𝐸𝑖 )2
𝜒2 =
𝐸𝑖
4) Inference:
At 𝛼% level,
If 𝐶𝑎𝑙 𝜒 2 ≤ 𝑇𝑎𝑏 𝜒 2 , then Accept 𝐻0 .
Otherwise, reject 𝐻0 .

160
Applied Biostatistics: An Essential tool in Helathcare Profession

Types of chi-square test:


There are two types of chi-square test.
1. Chi-square test of goodness of fit
2. Chi-square test for Independence
1. Chi-square test of goodness of fit:
It determines if the sample data match with certain population i.e. it tells if the sample data
represents the data we would expect to find in actual population. It can be used for the discrete
distributions like the Binomial distribution and Poisson distribution. The chi-square Goodness of fit
is to fit one categorical variable to a distribution.
2. Chi-square test for Independence:
It determines whether the two events are independent i.e. it allows researcher to determine
whether the variables are independent of each other or whether there is a pattern of dependence
between them. The chi-square test for independence compares two sets of data to see if there is a
relationship.
The hypothesis for the chi-square test of independence is stated as:
To test:
H0: The two categorical variables in the population are independent.
Against:
H1: The two categorical variables in the population are dependent.

Ex.1. During clinical trials of newly developed drug for treatment of diabetic retinopathy out of 120
volunteers, 76 persons were administered a new drug. Out of 76 persons, 24 persons showed
symptoms of DR. Amongst those not administered the new drug 12 persons were not affected by
DR. Find out whether the new drug is effective or not by using Chi-Square test.
(Given, at 5% level, 𝜒 2 = 3.84)
Ans.: From given data, prepare table as given below;
DR status Drug Not Total
administered administered
Developed 24 32 56
Not developed 52 12 64
Total 76 44 120

1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. Drug is not effective
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖

161
Bhumi Publishing, India

2) Test statistics: Chi square


3) Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Where,
𝑐𝑖×𝑟 𝑗
𝐸𝑖𝑗 =
𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;


76 𝑋 56
𝑓𝑜𝑟 𝑂1 = 24, 𝐸1 = = 35.47
120
44 𝑋 56
𝑓𝑜𝑟 𝑂2 = 32, 𝐸2 = = 20.53
120
76 𝑋 64
𝑓𝑜𝑟 𝑂3 = 52, 𝐸3 = = 40.53
120
44 𝑋 64
𝑓𝑜𝑟 𝑂4 = 12, 𝐸4 = = 23.64
120
Now, prepare following table;
Observed Expected 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐
frequency frequency 𝑬

(O) (E)
24 35.47 - 11.47 131.5609 3.709
32 20.53 11.47 131.5609 6.408
52 40.53 11.47 131.5609 3.246
12 23.47 - 11.47 131.5609 5.605
TOTAL 18.968

(𝑂 − 𝐸)2
𝜒2 = = 18.968
𝐸
4) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, drug is effective.

Ex.2. Following table shows the result of an experiment of study of effectiveness of vaccines on
resistance to a particular disease.

162
Applied Biostatistics: An Essential tool in Helathcare Profession

Attacked No attacked
Vaccinated 11 31
Non-Vaccinated 30 8

Using Chi-square test, analyze the results of experiments for independence between vaccination
and attack.
(Given, at 5% level, 𝜒 2 = 3.84)

Ans.: From given data, prepare table as given below;

Attacked No attacked Total


Vaccinated 11 31 42
Non-Vaccinated 30 8 38
Total 41 39 80

1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. Vaccination and attack of disease are independent.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Where,
𝑐𝑖×𝑟 𝑗
𝐸𝑖𝑗 =
𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;


41 × 42
𝑓𝑜𝑟 𝑂1 = 11, 𝐸1 = = 21.525
80
39 × 42
𝑓𝑜𝑟 𝑂2 = 31, 𝐸2 = = 20.475
80
41 × 38
𝑓𝑜𝑟 𝑂3 = 30, 𝐸3 = = 19.475
80
39 × 38
𝑓𝑜𝑟 𝑂4 = 8, 𝐸4 = = 18.525
80
Now, prepare following table;

163
Bhumi Publishing, India

(O) (E) 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐


𝑬
11 21.525 - 10.525 110.7756 5.1463
31 20.475 10.525 110.7756 5.4102
30 19.525 10.475 109.7256 5.6197
8 18.475 - 10.475 109.7256 5.9391
TOTAL 22.1153

(𝑂 − 𝐸)2
𝜒2 = = 22.1153
𝐸
3) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, effect of vaccination on attack of disease is not independent.

Ex.3. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with
colds, half of them were given sugar pills and half were given drug. The patients’ reactions to the
treatment were recorded. Test the hypothesis that drug is no better to the treatment than the sugar
pills.
(Given, at 5% level, 𝜒 2 = 3.84)

Helped Harmed No effect


Drug 104 20 40
Sugar pills 88 24 52

Ans.: From given data, prepare table as given below;

Helped Harmed No effect Total


Drug 104 20 40 164
Sugar pills 88 24 52 164
Total 192 44 92

1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. drug is no better to the treatment than the sugar pills.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖

164
Applied Biostatistics: An Essential tool in Helathcare Profession

2) Test statistics: Chi square


Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Where,
𝑐𝑖×𝑟 𝑗
𝐸𝑖𝑗 =
𝑡𝑜𝑡𝑎𝑙

Now, calculate expected frequency as below;


41 × 42
𝑓𝑜𝑟 𝑂1 = 11, 𝐸1 = = 21.525
80
39 × 42
𝑓𝑜𝑟 𝑂2 = 31, 𝐸2 = = 20.475
80
41 × 38
𝑓𝑜𝑟 𝑂3 = 30, 𝐸3 = = 19.475
80
39 × 38
𝑓𝑜𝑟 𝑂4 = 8, 𝐸4 = = 18.525
80
Now, prepare following table;
(O) (E) 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐
𝑬
11 21.525 - 10.525 110.7756 5.1463
31 20.475 10.525 110.7756 5.4102
30 19.525 10.475 109.7256 5.6197
8 18.475 - 10.475 109.7256 5.9391
TOTAL 22.1153

(𝑂 − 𝐸)2
𝜒2 = = 22.1153
𝐸
3) Inference:
At 5% level and 1 d.f., 𝑇𝑎𝑏 𝜒 2 = 3.84
𝐶𝑎𝑙 𝜒 2 > 𝑇𝑎𝑏 𝜒 2 ,
So, reject 𝐻0 .
Thus, drug is better to the treatment than the sugar pills.

Ex.4. In manufacturing company of mobile in each shift different numbers of faulty mobiles were
prepared which is shown in table. Test the hypothesis that the number of faulty samples prepared
is independent of the shift if the number of shift worked in the ration of 4:5:3.
(Given, at 5% level and 2 d.f. 𝜒 2 = 5.99)

165
Bhumi Publishing, India

Shift No. of faulty mobiles


Morning 10
Afternoon 24
Night 08
Ans.:
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. stoppage is independent of shift
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Total faulty pieces are 42 which expected in ration of 4:5:3 (Total for the ratio is 4 + 5 + 3 =
12).
So,
4
𝑓𝑜𝑟 𝑂1 = 10, 𝐸1 = × 42 = 14
12
5
𝑓𝑜𝑟 𝑂2 = 24, 𝐸2 = × 42 = 17.5
12
3
𝑓𝑜𝑟 𝑂3 = 8, 𝐸3 = × 42 = 10.5
12
Now; prepare following table,
Shift (O) (E) 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐
𝑬
Morning 10 14 -4 16 1.1428
Afternoon 24 17.5 6.5 42.25 2.4143
Night 08 10.5 -2.5 6.25 0.5952
TOTAL 4.1523

(𝑂 − 𝐸)2
𝜒2 = = 4.1523
𝐸
3) Inference:
At 5% level and d.f., 𝑇𝑎𝑏 𝜒 2 = 5.99
𝐶𝑎𝑙 𝜒 2 < 𝑇𝑎𝑏 𝜒 2 ,
So, accept 𝐻0 .
Thus, stoppage is independent of shift.

166
Applied Biostatistics: An Essential tool in Helathcare Profession

Ex.5. In an industry, the number of accidents in three shifts was 12, 14, 19. Can we conclude that all
shifts are equally dangerous?
(Given at 5% level, 𝜒2 2 = 5.99 𝜒3 2 = 7.81)
Ans.:
1) To test:
𝐻0 : 𝑂𝑖 = 𝐸𝑖 i.e. All shifts are equally dangerous.
Against:
𝐻1 : 𝑂𝑖 ≠ 𝐸𝑖
2) Test statistics: Chi square
Formula:
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )2
𝜒2 =
𝐸𝑖𝑗
Under null hypothesis, number of accidents will be equal to average.
12 + 14 + 19
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = = 15
3

So,
Now, prepare a table as follow;
Shift (O) (E) 𝑶−𝑬 (𝑶 − 𝑬)𝟐 (𝑶 − 𝑬)𝟐
𝑬
Morning 12 15 -3 9 0.6
Afternoon 14 15 -1 1 0.067
Night 19 15 4 16 1.007
TOTAL 1.6737

(𝑂 − 𝐸)2
𝜒2 = = 1.6737
𝐸
3) Inference:
At 5% level and d.f., 𝑇𝑎𝑏 𝜒2 2 = 5.99
𝐶𝑎𝑙 𝜒 2 < 𝑇𝑎𝑏 𝜒 2 ,
So, accept 𝐻0 .
Thus, all shifts are equally dangerous.

167
Bhumi Publishing, India

Exercise
1. A newly discovered drug was administered to 800 volunteers during clinical trials out of total
3000 volunteers. The number fever cases are shown below. Discuss the usefulness of drug (Use
X2 test at 1% level of significance).

Treatment Fever No fever Total


Drug 20 780 800
Placebo 200 2000 2200
Total 220 2780 3000

2. Certain drug is claimed to be effective in treatment of migraine. During clinical trials, out of 400
people suffering from migraine, 200 were administered tablet containing drug and 200 were
administered placebo tablet. The patient responses were recorded as shown in table. Test the
hypothesis that drug is no better to treatment.
Treatment Cured Harmed No effect
Drug 135 20 45
Placebo 65 25 110

3. Following table shows the number of people with or without high blood pressure. Do data
reveal and association between age groups and acidity

Age groups No. Peoples High blood pressure


cases
20-30 100 10
30-40 350 200
40-50 350 265
50-60 200 170

4. Genetic theory states that children having one parent of blood group A and other of B will
always possesses one of the blood group out of A, AB, B and that the proportion of three types
will be on an average be as 1:2:1. A report states that out of 250 children having one of A
parent and one B parent 30% were found to be A, 45% type AB and remaining of type B. Test
the hypothesis by X2 test.

5. Theory predicts that the portion of mangos in the four groups J, K, L and M should be 9:3:3:1. In
an experiment among 1600 mangos the numbers in four groups were 880, 315, 280, 125. Thus;
experimental results support the theory.

168
Applied Biostatistics: An Essential tool in Helathcare Profession

6. From the data given below about the treatment of 500 patients suffering from a disease, state
whether the new treatment is superior to the conventional treatment.
Treatment Favorable Non-favorable Total
New 280 60 340
Conventional 120 40 160
Total 400 100 500
(Given for 1d.f., 𝜒0.05 2 = 3.84)

7. In experiment on pea breading mendal, following frequencies on seed are obtained: 315 round
and yellow, 101 wrinkled and yellow, 108 round and green, 32 wrinkled and green. Theory
predicts that the frequencies should be in proportion 9: 3: 3: 1. Examine the correspondace
between theory and experiment. (Given at 5% level, 𝜒3 2 = 7.815)

8. From the following data use Chi-square test and conclude whether inoculation is effective in
preventing a disease.
Attacked Not attacked Total
Inoculated 31 469 500
Non-inoculated 185 1315 1500
Total 216 1784 2000
(Given for 1d.f., 𝜒0.05 2 = 3.84)

9. In manufacturing company of mobile in each shift different numbers of faulty mobiles were
prepared which is shown in table. Test the hypothesis that the number of faulty samples
prepared is independent of the shift if the number of shift worked in the ration of 4:5:3.
(Given, at 5% level and 2 d.f. 𝜒 2 = 5.99)
Shift No. of faulty mobiles
Morning 10
Afternoon 24
Night 08

10. A certain drug claimed to be effective in curing colds. In an experiment on 328 people with
colds, half of them were given sugar pills and half were given drug. The patients’ reactions to
the treatment were recorded. Test the hypothesis that drug is no better to the treatment than
the sugar pills. (Given, at 5% level, 𝜒 2 = 3.84)

Helped Harmed No effect


Drug 104 20 40
Sugar pills 88 24 52

169
Bhumi Publishing, India

10. Non-parametric test

Introduction:
In the previous chapters, all the statistical parametric methods of hypothesis testing which
are z-test, student’s t-test, Carl Pearson’s correlation coefficient, ANOVA were based on some
underlying assumptions about the data like “normally distributed data”, “population variances are
equal”. These methods are referred as parametric tests. But what if the assumptions on which the
methods are based do not hold? In such case we need to use distribution-free or assumption-less
methods for hypothesis testing. Such distribution-free methods are called nonparametric tests.
Nonparametric tests are often used when the population data has an unknown distribution
or when the sample size is small. Nonparametric statistics are typically used on qualitative data.
This method is useful when the data has no clear numerical interpretation, and is best to use with
data that has a ranking of sorts. This type of statistics can be used without the mean, sample size,
standard deviation, or the estimation of any other related parameters when none of that
information is available.
Common nonparametric tests include Chi square, Mann Whitney U-test (Wilcoxon rank-
sum test), Kruskal-Wallis test, Friedman test and Spearman's rank-order correlation.
1. Mann Whitney U-test (Wilcoxon rank-sum test):
A popular nonparametric test to compare outcomes between two independent groups is the Mann
Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the
Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same
population. This test also compares the medians between the two populations. This test is often
performed as a two-sided test.
Steps:
1) Set up null hypothesis as follows:
To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.
2) Test statistic:
If 𝑛1 and 𝑛2 are two sample or population sizes then Mann Whitney U test denoted by U is
the smaller from 𝑈1 𝑎𝑛𝑑 𝑈2 defined below.
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2

170
Applied Biostatistics: An Essential tool in Helathcare Profession

Where, 𝑅1 = 𝑅𝑥 i.e. sum of ranks of data x.


𝑅2 = 𝑅𝑦 i.e. sum of ranks of data y.
3) Compute the test statistic:
Find ranks of given data and prepare the table as follows.
Note: the ranking in this case is given from lowest observation to the highest observation as
1 to 𝑛1 + 𝑛2 . For repeated observations, average of ranks are assigned as ranks.
𝑥 𝑦 𝑅𝑥 𝑅𝑦

4) Set up decision rule and make conclusion:


For given 𝛼 level of significance,
If 𝐶𝑎𝑙 𝑈 ≤ 𝑇𝑎𝑏 𝑈 then Reject 𝐻0
Otherwise Accept 𝐻0 .
Ex.1. Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug to
reduce symptoms of asthma in children. A total of n=10 participants are randomized to receive
either the new drug or a placebo. Participants are asked to record the number of episodes of
shortness of breath over a 1 week period following receipt of the assigned treatment. The data are
shown below.
Placebo 7 5 6 4 12

New Drug 3 6 4 2 1
Is there a difference in the number of episodes of shortness of breath over a 1 week period
in participants receiving the new drug as compared to those receiving the placebo? By inspection, it
appears that participants receiving the placebo have more episodes of shortness of breath, but is
this statistically significant? (Given at the 5% level of significance and 5 d.f., = 2 ).
Ans:
1) To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.

2) Test statistic:
𝑛1 𝑛1 + 1
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
3) Prepare the table as follows.

171
Bhumi Publishing, India

Ascending order Ranks


Placebo New Drug Placebo New Drug Placebo New Drug
(𝒙) (𝒚) (𝑹𝒙) (𝑹𝒚)
7 3 1 1
5 6 2 2
6 4 3 3
4 2 4 4 4.5 4.5
12 1 5 6
6 6 7.5 7.5
7 9
12 10
Total 37 18
5(5 + 1)
𝑈1 = 5 × 5 + − 37 = 3
2
5(5 + 1)
𝑈2 = 5 × 5 + − 18 = 22
2
Here, 𝑈1 < 𝑈2 . So 𝑈 = 𝑈1 = 3
4) Given 5% level of significance and 5 d.f. 𝑇𝑎𝑏 𝑇 = 2
𝐶𝑎𝑙 𝑈 > 𝑇𝑎𝑏 𝑈
So accept 𝐻0 .
Thus, there a difference in the number of episodes of shortness of breath over a 1 week
period in participants receiving the new drug as compared to those receiving the placebo.
Ex.2. A new approach to prenatal care is proposed for pregnant women living in a rural
community. The new program involves in-home visits during the course of pregnancy in addition to
the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant women is
designed to evaluate whether women who participate in the program deliver healthier babies than
women receiving usual care. The outcome is the APGAR score measured 5 minutes after birth.
Recall that APGAR scores range from 0 to 10 with scores of 7 or higher considered normal
(healthy), 4-6 low and 0-3 critically low. The data are shown below.
Usual Care 8 7 6 2 5 8 7 3
New Program 9 9 7 8 10 9 6

Is there statistical evidence of a difference in APGAR scores in women receiving the new and
enhanced versus usual prenatal care? (given at 5% level, for 𝑛1 = 8, 𝑛2 = 7 𝑇𝑎𝑏𝑈 = 10)
Ans:
1) To test:
H0: The two populations are equal
Against:
H1: The two populations are not equal.
2) Test statistic:
𝑛1 𝑛1 + 1
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2

172
Applied Biostatistics: An Essential tool in Helathcare Profession

𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
3) Prepare the table as follows.
Ascending order Ranks
Usual New Usual Care New Program Usual Care New Program
Care Program (x) (y) (𝑹𝒙 ) (𝑹𝒚 )
8 9 2 1
7 8 3 2
6 7 5 3
2 8 6 6 4.5 4.5
5 10 7 7 7 7
8 9 7 7
7 6 8 8 10.5 10.5
3 8 8 10.5 10.5
9 13.5
9 13.5
10 15
Total 45.5 74.5

8(8 + 1)
𝑈1 = 8 × 7 + − 45.5 = 46.5
2
7(7 + 1)
𝑈2 = 8 × 7 + − 74.5 = 9.5
2
Here, 𝑈2 < 𝑈1 . So 𝑈 = 𝑈2 = 9.5
4) Given at 5% level of significance and 5 d.f. 𝑇𝑎𝑏 𝑇 = 10
𝐶𝑎𝑙 𝑈 < 𝑇𝑎𝑏 𝑈
So reject 𝐻0 .
Thus, the two populations of APGAR scores are not equal in women receiving usual prenatal
care as compared to the new program of prenatal care.
Ex.3. A clinical trial is run to assess the effectiveness of a new anti-retroviral therapy for patients
with HIV. Patients are randomized to receive a standard anti-retroviral therapy (usual care) or the
new anti-retroviral therapy and are monitored for 3 months. The primary outcome is viral load
which represents the number of HIV copies per milliliter of blood. A total of 30 participants are
randomized and the data are shown below.

Std 7500 8000 2000 550 1250 1000 2250 6800 3400 6300 9100 970 1040 670 400
Therapy
New 400 250 800 1400 8000 7400 1020 6000 920 1420 2700 4200 5200 4100 undetec
Therapy table
Is there statistical evidence of a difference in viral load in patients receiving the standard versus the
new anti-retroviral therapy? (Given at 5% level and for for 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64)
Ans:
1) To test:

173
Bhumi Publishing, India

H0: The two populations are equal


Against:
H1: The two populations are not equal.
2) Test statistic:
Because viral load measures are not normally distributed (with outliers as well as limits of
detection (e.g., "undetectable")), we use the Mann-Whitney U test.
𝑛1 𝑛1 + 1
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
3) Prepare the table as follows.
Ascending order Ranks
x y 𝑅𝑥 𝑅𝑦
7500 400 undetectable 1
8000 250 250 2
2000 800 400 400 3.5 3.5
550 1400 550 5
1250 8000 670 6
1000 7400 800 7
2250 1020 920 8
6800 6000 970 9
3400 920 1000 10
6300 1420 1020 11
9100 2700 1040 12
970 4200 1250 13
1040 5200 1400 14
670 4100 1420 15
400 Undetectable 2000 16
2250 17
2700 18
3400 19
4100 20
4200 21
5200 22
6000 23
6300 24
6800 25
7400 26
7500 27
8000 8000 28.5 28.5
9100 30
245 220

174
Applied Biostatistics: An Essential tool in Helathcare Profession

15(15 + 1)
𝑈1 = 15 × 15 + − 245 = 100
2
15(15 + 1)
𝑈2 = 15 × 15 + − 220 = 125
2
Here, 𝑈1 < 𝑈2 . So 𝑈 = 𝑈1 = 100
4) Given at 5% level of significance and 𝑛1 = 𝑛2 = 15 𝑇𝑎𝑏 𝑈 = 64
𝐶𝑎𝑙 𝑈 > 𝑇𝑎𝑏 𝑈
So accept 𝐻0 .
Thus, there is no statistical evidence that the treatment groups differ in viral load.

Test for Paired data: following are the nonparametric test used in case of paired data like “before-
after” case.
 Signed Test: The Sign Test is the simplest nonparametric test for matched or paired data. The
approach is to analyze only the signs of the difference scores. The test statistic for the Sign Test
is the number of positive signs or number of negative signs, whichever is smaller.
Steps:
1) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
2) Test statistic:
The test statistic for the Sign Test is the smaller of the number of positive or negative signs.
3) Compute the test statistic:
Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒. Count the number of positive signed
answer and negative signed answer for the test statistic.
4) Set up decision rule and make conclusion:
For given 𝛼 level of significance,
If 𝑡𝑕𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑖𝑔𝑛𝑠 ≤ 𝑇𝑎𝑏 𝑣𝑎𝑙𝑢𝑒 then Reject 𝐻0
Otherwise Accept 𝐻0 .
Ex.1. A new chemotherapy treatment is proposed for patients with breast cancer. Investigators
are concerned with patient's ability to tolerate the treatment and assess their quality of life both
before and after receiving the new chemotherapy treatment. Quality of life (QOL) is measured on
an ordinal scale and for analysis purposes, numbers are assigned to each response category as
follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent.
The data are shown below.

175
Bhumi Publishing, India

Patient QOL Before Chemotherapy QOL After Chemotherapy


Treatment Treatment
1 3 2
2 2 3
3 3 4
4 2 4
5 1 1
6 3 4
7 2 4
8 3 3
9 2 1
10 1 3
11 3 4
12 2 3
1) Test whether there is a difference in QOL after chemotherapy treatment as compared to
before. (Given at 5% level and 𝑛 = 12, 𝑝 = 2)
Ans:
2) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
3) Test statistic:
The test statistic for the Sign Test is the smaller of the number of positive or negative signs.
4) Compute the test statistic:
Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒. Count the number of positive signed
answer and negative signed answer for the test statistic.
Patient QOL Before QOL After Difference
Treatment Treatment (After-Before)
1 3 2 -1
2 2 3 1
3 3 4 1
4 2 4 2
5 1 1 0
6 3 4 1
7 2 4 2
8 3 3 0
9 2 1 -1
10 1 3 2
11 3 4 1
12 2 3 1
Now, here there are two zeros, so randomly assign one negative sign (i.e., "-" to patient 5) and one
positive sign (i.e., "+" to patient 8), as follows:

176
Applied Biostatistics: An Essential tool in Helathcare Profession

Patient QOL Before QOL After Difference Sign


Treatment Treatment (After-Before)
1 3 2 -1 -
2 2 3 1 +
3 3 4 1 +
4 2 4 2 +
5 1 1 0 -
6 3 4 1 +
7 2 4 2 +
8 3 3 0 +
9 2 1 -1 -
10 1 3 2 +
11 3 4 1 +
12 2 3 1 +

the number of negative signs which is equal to 3.

5) Given at 5% level and 𝑛 = 12, 𝑝 = 2


𝑡𝑕𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑟 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑖𝑔𝑛𝑠 > 𝑇𝑎𝑏 𝑣𝑎𝑙𝑢𝑒
So accept 𝐻0 .
Thus, there is no difference in QOL after chemotherapy treatment as compared to before.

 Wilcoxon Signed Rank


Steps:
1) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
2) Test statistic:
The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ and
W- which are the sums of the positive and negative ranks, respectively.
3) Compute the test statistic:
Calculate the difference of 𝐵𝑒𝑓𝑜𝑟𝑒 𝑐𝑎𝑠𝑒 – 𝐴𝑓𝑡𝑒𝑟 𝑐𝑎𝑠𝑒.
Then assign ranks to the differences from 1 to n without considering negative signs and the
give negative signs to rank of negative difference. Assign mean of ranks for repeated
differences.

177
Bhumi Publishing, India

4) Set up decision rule and make conclusion:


For given 𝛼 level of significance,
If 𝐶𝑎𝑙 𝑊 ≤ 𝑇𝑎𝑏 𝑊 then Reject 𝐻0
Otherwise Accept 𝐻0 .

Ex.1. A study is run to evaluate the effectiveness of an exercise program in reducing systolic blood
pressure in patients with pre-hypertension (defined as a systolic blood pressure between 120-139
mmHg or a diastolic blood pressure between 80-89 mmHg). A total of 15 patients with pre-
hypertension enroll in the study, and their systolic blood pressures are measured. Each patient then
participates in an exercise training program where they learn proper techniques and execution of a
series of exercises. Patients are instructed to do the exercise program 3 times per week for 6 weeks.
After 6 weeks, systolic blood pressures are again measured. The data are shown below.

Patient Systolic Blood Pressure Systolic Blood Pressure


Before Exercise Program After Exercise Program
1 125 118
2 132 134
3 138 130
4 120 124
5 125 105
6 127 130
7 136 130
8 139 132
9 131 123
10 132 128
11 135 126
12 136 140
13 128 135
14 127 126
15 130 132

Is there is a difference in systolic blood pressures after participating in the exercise program as
compared to before? (Given at 5% level and n= 15, 𝑊 = 25)
Ans:
1) Set up null hypothesis.
To test:
H0: The median difference is zero.
Against:
H1: The median difference is not zero
2) Test statistic:
The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ and
W- which are the sums of the positive and negative ranks, respectively.

178
Applied Biostatistics: An Essential tool in Helathcare Profession

3) Compute the test statistic:


Patient BP Before BP After Difference
Exercise Program Exercise (Before-After)
Program
1 125 118 7
2 132 134 -2
3 138 130 8
4 120 124 -4
5 125 105 20
6 127 130 -3
7 136 130 6
8 139 132 7
9 131 123 8
10 132 128 4
11 135 126 9
12 136 140 -4
13 128 135 -7
14 127 126 1
15 130 132 -2

Rank table:
Observed Ascending order of Signed Ranks
Differences Differences
7 1 1
-2 -2 -2.5
8 -2 -2.5
-4 -3 -4
20 -4 -6
-3 -4 -6
6 4 6
7 6 8
8 -7 -10
4 7 10
9 7 10
-4 8 12.5
-7 8 12.5
1 9 14
-2 20 15

179
Bhumi Publishing, India

Here, 𝑊+= 89 𝑎𝑛𝑑 𝑊−= 31


𝑊− < 𝑊 +
Hence, 𝑊 = 𝑊−= 31

4) Given at 5% level and n= 15, 𝑊 = 25,


If 𝐶𝑎𝑙 𝑊 > 𝑇𝑎𝑏 𝑊
So accept 𝐻0 .
Thus, there is no significant difference in systolic blood pressures after the exercise program as
compared to before.
2. The Kruskal-Wallis Test:
Like ANOVA in testing of hypothesis, the nonparametric Kruskal-Wallis test is used to
compare more than two parameters or medians. It is sometimes described as an ANOVA with the
data replaced by their ranks.
Steps:
1) Set up null hypothesis:
To test:
H0: The four population medians are equal versus
Against:
H1: The four population medians are not all equal
2) Test statistic:
The Kruskal-Wallis test is denoted by H and given by,

12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3

Where, 𝑘 = number of columns


𝑁 = total numner of observations
𝑛1 , 𝑛2 , 𝑛3 , … are number of observations in respective columns
𝑅1 , 𝑅2 , 𝑅3 , … are sum of ranks in each respective column.
3) Compute the test statistic:
Arrange the given data in ascending order for each column and assign them ranks. Take
average for repeated ranks. Calculate test statistic.
4) Set up decision rule and make conclusion:
For given 𝛼 level of significance,
If 𝐶𝑎𝑙 𝐻 ≤ 𝑇𝑎𝑏 𝐻 then Accept 𝐻0
Otherwise Reject 𝐻0
Ex.1. A personal trainer is interested in comparing the anaerobic thresholds of elite athletes.
Anaerobic threshold is defined as the point at which the muscles cannot get more oxygen to sustain
activity or the upper limit of aerobic exercise. It is a measure also related to maximum heart rate.
The following data are anaerobic thresholds for distance runners, distance cyclists, distance
swimmers and cross-country skiers.

180
Applied Biostatistics: An Essential tool in Helathcare Profession

Distance Distance Distance Cross-Country


Runners Cyclists Swimmers Skiers
185 190 166 201
179 209 159 195
192 182 170 180
165 178 183 187
174 181 160 215
1) Is a difference in anaerobic thresholds among the different groups of elite athletes?(Given
at 5% level and 3 d.f. 𝐻 = 7.81)
Ans: Set up null hypothesis:
To test:
H0: The four population medians are equal
Against:
H1: The four population medians are not all equal
2) Test statistic:
The Kruskal-Wallis test is denoted by H and given by,

12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3

3) Compute the test statistic:


Total Sample (Ascending order) Ranks
Distance Distance Distance Distance 𝑅1 𝑅2 𝑅3 𝑅4
Runners Runners Runners Runners
159 1
160 2
165 3
166 4
170 5
174 6
178 7
179 8
180 9
181 10
182 11
183 12
185 13
187 14
190 15
192 16
195 17
201 18
209 19
215 20
Total 46 62 24 78

181
Bhumi Publishing, India

Here, 𝑁 = 20 𝑛1 = 𝑛2 = 𝑛3 = 𝑛4 = 5
𝑅1 = 46 𝑅2 = 62 𝑅3 = 24 𝑅4 = 78
12 462 622 242 782
𝐻= + + + − 3(20 − 1)
20(20 + 1) 5 5 5 5
= 0.0285 × 2524 − 3 21
= 71.934 − 63
= 8.934

4) Given at 5% level and 3 d.f. 𝐻 = 7.81


If 𝐶𝑎𝑙 𝐻 > 𝑇𝑎𝑏 𝐻
So reject 𝐻0
Thus, there is a difference in median anaerobic thresholds among the four different groups of elite
athletes.

Summary:
Mann Whitney U Test:
Use: To compare a continuous outcome in two independent samples.
Null Hypothesis: H0: Two populations are equal
Test Statistic: The test statistic is U, the smaller of
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2
𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2
Where, 𝑅1 = 𝑅𝑥 i.e. sum of ranks of data x.
𝑅2 = 𝑅𝑦 i.e. sum of ranks of data y.
Decision Rule: Reject H0 if U < critical value from table
Sign Test
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is the smaller of the number of positive or negative signs.
Decision Rule: Reject H0 if the smaller of the number of positive or negative signs < critical
value from table.
Wilcoxon Signed Rank Test
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums
of the positive and negative ranks of the difference scores, respectively.
Decision Rule: Reject H0 if W < critical value from table.

182
Applied Biostatistics: An Essential tool in Helathcare Profession

Kruskal Wallis Test


Use: To compare a continuous outcome in more than two independent samples.
Null Hypothesis: H0: k population medians are equal
Test Statistic: The test statistic is H,

12 𝑅1 2 𝑅2 2 𝑅3 2
𝐻= + + +⋯ − 3(𝑁 − 1)
𝑁(𝑁 + 1) 𝑛1 𝑛2 𝑛3

Where,for 𝑘 = number of columns


𝑁 = total numner of observations
𝑛1 , 𝑛2 , 𝑛3 , … are number of observations in respective columns
𝑅1 , 𝑅2 , 𝑅3 , … are sum of ranks in each respective column.
Decision Rule: Reject H0 if H > critical value

183
Bhumi Publishing, India

11. Experimental Design

 Research and development department of Pharmacy is continuously working on various


experiments for betterment in health of human beings. It involves huge investment of time as
well as money. Therefore; it is very important that all experiments should plan properly so
that final results will of our interest.
 In general, researchers formulate hypothesis first and then verify stated hypothesis by
collecting data through different kinds of experiments. It is very important to collect data
which is relevant to study and that come under heading of design of experiment.
 Experiment: an experiment is a process or study that results in the collection of data.
 Experimental Design: it is a way of carefully planning of experiments in advance so that
results will be both objective specific and valid.
 It means planning of experiment, so that information will be collected which is relevant to the
stated problem.
 The terms ‘Experimental Design’ and ‘Design of Experiments’ are used interchangeably.
 Design of experiment involves complete planning of each and every steps that will result
generation of appropriate data in possible economical way.

Steps of experimental design:


Step 1: Define statement of problem
Step 2: Formulation of Hypothesis
Step 3: Design experimental method
Step 4: Examination of possible outcomes
Step 5: To observe necessary conditions
Step 6: Performance of experiment
Step 7: Collection of data
Step 8: Application of statistical techniques
Step 9: Drawing conclusion from result
Step 10: Evaluation of investigation

Purpose of an experimental design:


 To provide maximum amount of information related to problem under investigation.
 To save time, money, personnel and experimental material.
 To obtain relevant information

Experimental design should satisfy following points;


 It should be simple.

184
Applied Biostatistics: An Essential tool in Helathcare Profession

 Minimize or eliminate confounding variables, which can offer alternative explanations for the
experimental results. (A confounding variable is an “extra” variable that didn’t consider. They
can ruin an experiment and give false results.)
 Allow to correlate different variables involved in experiment.
 Describe how samples are allocated in groups or selected for experiment.

Design of experiments involves:


 The systematic collection of data.
 A focus on the design itself, rather than the results.
 Planning changes to independent (input) variables and the effect on dependent variables or
response variables
 Ensuring results are valid, easily interpreted, and definitive.

Basic principles of experimental designs:


Replication, Randomization and Local control are the three basic principles of experimental
designs.
1. Replication: it is essential feature of an experimental design. It involves repetition of experiment
number of times in order to obtain more reliable result. Replication is required to minimize
experimental error.
2. Randomization: it is critical principle of experimental design which guarantees that statistical
tests will have valid significance level and also ensure that the bias in the result will be
minimum. It helps to reduce errors in results.
Randomization tends to produce the study groups comparable with known as well as unknown
factors those affecting the results.
It makes the test valid by making it appropriate to analyse the data.
3. Local control: it is also known as error control. Local control refers to the amount of balancing,
blocking and grouping of experimental units. Replication together with local control helps to
reduce the experimental errors. Local control makes experimental design more efficient. The
main purpose of error control is to reduce magnitude of the estimate of experimental error.

Experimental Study Design and sample selection:


Various types of test designs that are used in Pharmacy for clinical trials, bioavailability and
bioequivalence studies are discussed below:

1. Completely randomized design:


In completely randomized design, all treatments are randomly allocated in between the all
experimental subjects.

185
Bhumi Publishing, India

Method: Label all subjects with same number of digits. For example if there are 20 subjects,
number them 1-20. Randomly select non-repeating numbers from these labels for first treatment
and then repeat for other treatment.
Advantages
1. Simplest design
2. It can accommodate any number of subjects and treatment.
3. Although sample size might not me same for each treatment, this design is simple to analyze.
Limitations:
1. It is best suited for situation involves relatively few treatments.
2. All subjects must be as homogeneous as possible. Any extreme deviation results in error.
2. Randomized block design:
In this design, initially subjects are sorted into homogeneous groups called as block and
then within block treatment is assign randomly just same as discussed above.
Method: Subjects are classified into blocks depending on characteristics and within block
treatment is randomized. Each block is independent from another.
Advantages:
1. It can produce more precise results than completely randomized design.
2. It can accommodate any number of treatments or replications.
3. Blocking produces more comparable (homogeneous) group.
Limitations:
1. Statistical data analysis is relatively difficult.
2. Missing observation within block increases complexity of study.

3. Repeated measures, cross-over and carry-over design:


It is essential a randomized block design in which same subject is serves as block and each
block is use to study all planed treatments. Since; this study involves use of same subject several
times it is called as repeated measure design.
This method may involve single or several treatments at different time point.
Administration of two or more treatment one after other to the same group of patients is called as
cross-over design.
Method: Complete randomization is used to randomize the order of treatment for each subject
which is independent of other subject.
Advantages:
1. To check effect of treatment over time, this method is more perfect because it involves use of
same subject for all treatment.
2. Minimizes experimental error since; it involves use of same subject for complete study.

186
Applied Biostatistics: An Essential tool in Helathcare Profession

Limitation:
It results in carry-over effect which may results in false results. Carry-over effect is nothing
but presence residual of previous treatment in subject at the time of current treatment.
This problem can be easily overcome by providing time enough time for wash-off drug from body.

4. Latin square designs:


 In this method each subject receives each treatment during experiment.
 It is two factor design i.e. subject and treatment with one observation in each cell.
 This design minimizes the problem of carry-over effect by providing enough time for wash-
off of previous sample.
 In this method rows represents subjects and columns represents treatments.
 Lets consider n X n Latin square design is a square with n rows and n columns such that each
of n2 cell contain one n letter representing the treatment and each letter appears only once in
every row and column.
 As an example consider 3X3 Latin square design for 3 subjects to compare 3 different tablets,
each containing different grades of an excipient affecting disintegration time of tablet. Design
will be consist of 3X3 = 09 experiments in 3 rows and 3 columns. Variables are represented as
X, Y, Z. Then following is a particular 3X3 Latin square.
Study Washing Study Washing Study Washing
Subject No.
Period I period Period II period Period III period
1 X Y Z
2 Y Z X
3 Z X Y

 If the first row and first column contain the n letters alphabetically then Latin square design
called as a standard square.
 This design is used mainly in pharmaceutical field for bioequivalence study as well as in
agriculture field.
Advantages:
1. It minimizes carry-over effects.
2. It minimizes the inter-subject variability.
3. Make it possible to study formulation variables which are most important part of
bioequivalence study.
Limitations:
1. The study takes more time due to involvement of washing-period.
2. Patient dropout rate is high.
3. Randomization is somehow difficult.

187
Bhumi Publishing, India

Sample size for study:


 For an effective and economical clinical trial study, it is very essential that sample size should
be optimum.
 For calculation of sample size, responses can be classified into two classes;
1. Binomial response: e.g. success or failure
2. Continuous response: e.g. measurement of blood pressure
1. Binomial response

2p 1-p
2 [Zα +Zβ Pc 1-Pc + Pd 1-Pd ]2

2𝑁 =
(𝑃𝑐 − 𝑃𝑑) 2
Where;
2N: Total number of subject = Nd+Nc
Nd: Number of subjects in drug group
Nc: Number of subjects in control
Pc: Success rate in control
Pd: Success rate in drug group.
Z: 1.645
Z: 1.282
𝑅𝑑 − 𝑅𝑐
𝑝=
𝑁𝑑 + 𝑁𝑐

Where Rd and Rc are number of success in drug and control group respectively.
2. For Continuous response:
4 𝜎 2 (𝑍𝛼 + 𝑍𝛽) 2
2𝑁 =
𝛿2
δ: the true difference which is to be detected between the control and drug group.
Zα and Zβ: 1.96 and 1.282 respectively
σ : variance

Exercise
1. Define experimental design. Give its Significance.
2. Enlist steps involved in experimental design.
3. Explain in details various types of experimental designs use in clinical trial study.
4. Add note on randomization, replication and local control.
5. Give formula for calculation of sample size for clinical trials.

188
Applied Biostatistics: An Essential tool in Helathcare Profession

12. Application of biostatics in Pharmacy

What is Statistics?
Statistics is a branch of mathematics which deals with the collection, organization, analysis,
interpretation and presentation of data obtained from experiments or survey.
Initially; the use of statistic was restricted to collect the information related to health,
population property etc by the respected governing body. Later on development in statistics field
made it as an essential part of different fields. Now a day, almost all fields involve applications of
statistics which includes Pharmacy, Medical, Political Science, Commerce, Business, Media and
many more. Moreover; every person uses statistics in daily life although he/she may not be aware
that the practice is called as statistics.
Progress in computer field further made utilization of statistic in simple way. With
application of computer and newly developed software one can handle huge numerical data and
obtain result in fraction of minute. Even by using advanced android phones people can apply
statistics at any where for data analysis and presentation of it.
What is biostatistics?
Biostatistics is the term used when tool of statistics are applied to data obtained from
biological areas. In another word biostatistics is the application of statistics to a wide range of
topics in biology.
Biostatistics involves collection, summarization, analysis, interpretation and presentation of
data from various biological experiments especially medicine, pharmacy, agriculture and fishery. A
major branch of this is medical biostatistics, which is exclusively concerned with medicine and
health.
In medicine field, whether its research, diagnosis or treatment all depends on measurement
or counting. For example; disintegration of tablet either rapidly or slowly has no meaning unless it
is expressed in figures. Therefore; biostatistics also called as quantitative medicine.
Applications of Biostatistics:
1. In research and development of pharmaceutical industries:
Biostatistics plays crucial role in research and development of pharmaceutical industries.
Drug discovery and development, development of new formulation of existing drug, development of
generic products all these major areas involve applications of biostatistics.
It takes about 10-15 years to develop one new medicine from the time it is discovered to
when it is available for treating patients. The average cost to research and develop each successful
drug is estimated to be $800 million to $1 billion. Overall process involve discovery of drug
molecules, screening of drug molecules, preclinical study, submission of IND i.e. Investigational
New Drug Application, clinical trials submission of NDA i.e. New Drug Application for approval of
drug for marketing purpose. Starting from discovery phase to submission of NDA it involves
application of biostatics. It involves experimental design, stating hypothesis, checking probability,
selection of sampling techniques, collection of data through different experiments, arrangement of

189
Bhumi Publishing, India

collected data, analysis and interpretation of result. Investigator need to take prior permission from
FDA for both preclinical and clinical trials where he has to submit data which proves safety and
efficacy of drug for further study. FDA monitors all generated data very carefully and if satisfied
then only permits to investigator for next stage study. All this data is numerical and involve
different calculations and representation of generated data in specific manner so that appropriate
conclusion can be drawn from it. As already stated, overall this process involves huge investment of
money as well as time, any error in result will definitely lead investigator in tremendous loss. All
these errors can be avoided or minimized by application of biostatistics.
Similarly; in development of generic products, applicant has to prove bioequivalence of
developed product with innovators product and submit data to FDA in ANDA i.e. Abbreviated New
Drug Application which again involve application of biostatics.
2. In anatomy and physiology study:
i) Biostatistics is used to study various physiological and anatomical parameters and its correlation
with health e.g. mean pulse rate, average glucose level, mean and variance of weight and height.
For example; the mean height of boys in Maharashtra is less than that in Punjab. i.e. this difference
is due to natural variation or because of difference in nutrition that can be studied with
application of statistics.
ii) Biostatistics is also used for the study of normal and healthy population and to set limits for
abnormality.
3. In pharmacology:
i) In pharmacology, biostatistics plays important role to find action of drug on animal or humans.
ii) It is also used to compare two different drugs or two different formulations of same drug or two
identical dosage form from different manufacturer.
4. In medicine:
i) Biostatistics is used to find relation between two factors like T.B and smoking.
ii) It can be used to compare the efficacy of drug.
iii) Signs and symptoms of disease or syndrome are identified by using statistic study. E.g. In
typhoid, fever is observed almost in all cases and cough is rare.
5. In community medicine and public health:
i) Biostatistics is used in community medicine to find the usefulness of sera and vaccines,
comparison between vaccinated and unvaccinated groups etc.
ii) It is used for epidemiological studies to find the role of causative factors e.g. deficiency of calcium
in general orthoporosis.
iii) In public health, it can be used to check effectiveness of preventive measures. For example; fall
in death rate may be the result of availability of modern facilities in hospitals, advancement in
medicines or due to increase in awareness of public.
Exercise
1. Define statistics and biostatistics. Why biostatistics is called as quantitative medicine?
2. Explain applications of biostatistics in pharmacy in detail.

190
Applied Biostatistics: An Essential tool in Helathcare Profession

References:
1. Dr. A. R. Paradkar , M. G. Dhayagude , Y. I. Shah. Introduction To Biostatistics And Computer
Science – For Medical and Pharmacy Students. Nirali Prakashan, 16th edition, 2019.
2. B. K. Mahajan. Methods in Biostatistics: For Medical Students and Research Workers. Jaypee
Publication, 7th edition, 2010.
3. Dr. Satguru Prasad. Elements of Biostatistics. Rastogi Publication, 3rd edition, 2019.
4. Khan and Khanum Shiba Khan. Fundamentals of Biostatistics. Ukazz/BSP Publication, 6th
edition, 2018.
5. Khanal Arun Bhadra. Biostatistics for Medical Students and Research Workers. Jaypee
Publication, 8th edition, 2016.

191
Bhumi Publishing, India

192
Applied Biostatistics: An Essential tool in Helathcare Profession

193
Bhumi Publishing, India

194
Applied Biostatistics: An Essential tool in Helathcare Profession

195
Bhumi Publishing, India

196
Applied Biostatistics: An Essential tool in Helathcare Profession

197
Bhumi Publishing, India

198
Applied Biostatistics: An Essential tool in Helathcare Profession

199
Bhumi Publishing, India

200
Applied Biostatistics: An Essential tool in Helathcare Profession

201
Bhumi Publishing, India

202
Applied Biostatistics: An Essential tool in Helathcare Profession

203
Bhumi Publishing, India

204
Applied Biostatistics: An Essential tool in Helathcare Profession

205

You might also like