You are on page 1of 87

BIOSTATISTICS

Dr. Anjil K Srivastava


Department of Biotechnolog
NIT, Durgapur - 9
Biostatistics

Biostatistics has been defined as “the


application of statistical methods to
biological sciences”.
Development of biostatistics was made
during the period of Sir Francis Galton
(1822 – 1911).
• He applied the statistical methods to the analysis of
biological variation, correlation and regression.

• Karl Pearson (1857-1936) regarded as the father of


Modern statistics was motivated by the researches of
the Sir Francis Galton. For measuring correlation the
Karl Pearson’s method, popularly known as
Pearson’s coefficient of correlation is the most
widely used in practice.
Central tendency

Generally in any distribution, values of the


variable tend to congregate around a central value
of the distribution. This tendency of the
distribution is known as measures of central
tendency.
There are usually five basic measures of the
central tendency.
Arithematic mean, median, mode, geometric
mean and hormonic mean
Arithmetic Mean

 The most familiar and widely used measure


of central tendency is the arithmetic mean.

 It represents the entire data by one value


which is obtained by adding together all the
values and dividing this total by the
number of observation.
The sample mean is the average of set
of data and is computed as the sum of
all the observed outcomes from the
sample divided by the total number of
events.  We use  x as the symbol for the
sample mean
Arithmetic Mean for series of
individual observation

x1 , x2 , x3 , ......., xn ,
x1  x2  x3  .............  xn
x
n

X = ∑x/n
Example

12, 15, 11, 11, 7, 13


First, find the sum of the data.
12 + 15 +11 + 11 + 7 + 13 = 69
Then divide by the number of data.
69 / 6 = 11.5
The mean is 11.5
Example

An electronics store sells


CD players at the
following prices: Rs.
3500, Rs. 2750, Rs.
5000, Rs. 3250, Rs.
1000, Rs. 3750, and Rs.
Find your answer ….. 3000. What is the mean
price?
3500 + 2750 + 5000 + 3250 + 1000 +3750 + 3000 =

22250

22250 / 7 = 3178.60
The mean or average price of a CD
player is Rs. 3178.60.
Arithmetic Mean for discrete series

X = ∑fx/n
where X = arithmetic mean
∑f = Sum of frequency
∑fx = sum of values of the variables and
their corresponding frequencies
Example

Number of Number of the


The data recorded on the chlorophyll deficient plants
plants
number of chlorophyll 0 34
deficient plants in a lentil 1 14
population is given below. 2 20
Calculate the mean. 3 24
4 25
5 33
Number of chlorophyll Number of the plants (f)
deficient plants (x)
fx
0 34 0
1 14 14
2 20 40
3 24 72
4 25 100
5 33 165
∑f = 150 ∑fx = 391
X = ∑fx/n
∑fx = 391 ; ∑f = 150
x = 391/150 = 2.61
x = 2.61
Merits of Mean

 It is easy to understand and easy to calculate.


 It is rigidly defined.
 It is based on all the observations.
 It provides good basis for comparison.
 It is amenable to further mathematical treatment.
 It is not affected by the fluctuation of the sampling.
Demerits of Mean

 The mean is unduly affected by the


extreme items.
 It can not be accurately determined even if
one of the values is not known.
Median

A median is the middle value of the


observations or the value which divides a
distribution so that an equal number of
items occur on either side of it.
12, 15, 11, 11, 7, 13
First, arrange the data in numerical
order.
7, 11, 11, 12, 13, 15
Then find the number in the
middle or the average of the two
numbers in the middle.
11 + 12 = 23 23 / 2 = 11.5
The median is 11.5
Median in a series of individual
observation

 Arrange the data in ascending or descending order


 Median is located by finding the size of n+1/2th item.
 M = size of the n+1th
2

 Where M = median
n = number of observations
Examples
Sl. No. Data Arranged in
Find out the median from Ascending order

the data recorded on the 1 10


number of clusters per 2 10
plant in a pulse crop. 3 11
Number of clusters = 4 12
10,18,17,19,10,15,11,17,12 5 15
6 17
7 17
8 18
M = size of the n+1th
2

Median = 9+1/2
Median = size of 5th Item = 15
Merits of Median

 It is easy to define and easy to understand.


 It is also recommended in unequal class distributions.
 The median will not be affected by the size of values
of extreme items.
 The value of median can be determined graphically.
However the value of mean can not be graphically
ascertained.
Demerits of Median

 It is not based on all observations since it is positional


average.
 Median is affected more by sampling fluctuation then
by the value of mean.
 If the number of observation even, we can not
calculate the median. In this case the mean of two
median values will be the estimate of the median.
 It may be unsuitable in case of large and small items.
Mode

The mode is another measure of central


tendency which is conceptually very
useful.
Mode is the most typical value of a
distribution because it is repeated the
highest number of times in the series.
Definition

“The most commonly occurring value”


According Croxton and Cowden “the mode of a
distribution is the value at the point around which the
items tend to be most heavily concentrated.
A set of data may have a single mode, in which case it
is said to be “Unimodal”. When concentration of data
occurs at two or more points such a series called
bimodal or multimodal.
12, 15, 11, 11, 7, 13

The mode is 11.


Sometimes a set of data will have more
than one mode.

For example, in the following set the


numbers both the numbers 5 and 7
appear twice.

2, 9, 5, 7, 8, 6, 4, 7, 5

5 and 7 are both the mode and this set is said


to be bimodal.
Sometimes there is no mode in a set of data.

3, 8, 7, 6, 12, 11, 2, 1

All the numbers in this set occur only


once therefore there is no mode in this
Example-: Find Mean, Median and Mode of
Ungroup Data

The weekly pocket money for 9 first year pupils was


found to be:

3 , 12 , 4 , 6 , 1 , 4 , 2 , 5 , 8

Mean Median Mode


5 4 4
Mode of Group Data

1
M 0  L1  h
1   2
 L1 = Lower boundary of modal class
 Δ1 = difference of frequency between
modal class and class before it
 Δ2 = difference of frequency between
modal class and class after
 H = class interval
Calculate the mode

Number 100-110 110 –130 130-140 140-160 160-170 170-180


of grains/
panicle

Number 11 40 27 34 12 6
of Plants
Number of grains/ Number of plants
panicle
100-110 11
110-120 20
120-130 20
130-140 27
140-150 17
150-160 17
160-170 12
170-180 6
Mode is lies in the 130-140

1
M 0  L1  h
1   2

L1 = 130; Δ1 = (27-20) = 7; Δ2 = 27-17 = 10; i = 10

Mode = 130 + 7 x 10 = 130 + 70/17 = 130 +4.12 = 134.12


7+10

Mode = 134.12
Merits of mode

 The mode is easy to calculate and can be determined


by mere observation.
 It is not unduly affected by extreme items.
 It is simple and precise.
 It is the point where there is more concentrations of
frequencies.
Demerits of mode

 The mode is not based on all the observations.


 The value of the mode can not be determined in
bimodal distribution.
 It is not a rigidly defined measure. Sometimes the
exact value of the modal class can’t be known by
inspection of the data.
 Therefore it is necessary to prepare the grouping table
and analysis table to find out the modal class.
Standard Deviation

 The standard deviation formula is very simple it is the


square root of the variance.
 It is the most commonly used measure of spread. 
 Firstly introduced by Karl Pearson in 1893.
 The algebraic sign as in mean deviation is overcome
by taking the square of deviation thereby making all
positive.
Standard Deviation (s) =

X = arithmetic mean
n = number of observations
Variance

 Variance is also called mean square deviation.


 Term was first coined by R. A. Fisher in 1913.
 The term “Variance” is used to describe the
square of the standard deviation.
 It helps us in isolating the effects of various
factors.
The variance is defined as the mean of squares of deviations.

S2 = ∑(x-x)2
n-1

x = arithmetic mean
n = number of observations
Probability

 In the ninteenth century, Pierre Simon De Laplace


compiled the first general theory of probability.
 R.A. Fisher and Von Mises introduced the
empirical approach to probability.
 The modern theory of probability was developed
by Chebychev, A. Markov and A.N. Kolmogorov
Definition

Probability is the likelihood


of occurrence of an event.
Example
For Animal other than poultry Poultry

Male (XY) Female (XX) Parents Male (XX) Female (XY)

X or Y X Gamete X X or Y

XX XY Progeny XX XY

Female Male) Female Male)


Statistical Explanation

 If an event can happen in “a” ways


 and same event fail to happen “b” ways
 Then the probability of its happening “p”

p= Number of events occurring


Total number of trials

a
p=
a+b
Example

 If a surgeon transplants a kidney in 400 cases and


succeeds in 160 cases, calculate the probability of
survival after operation.

Number of survival after the operation


p= Total Number of patient operated

P = 160 P= 2
400 5
Event

Any possible outcome of a random


experiment is called an event.
Performing an experiment called trial and the
outcome is termed as event.
In simple terms “An event is the occurrence
of something”
Ex. The occurrence of head and tail is an
event.
 The events due to chance are grouped
in two categories:--
 Mutually exclusive events
 Independent events
Mutually Exclusive events

Events that are so related among themselves


are said to be mutually exclusive, if the
occurrence of an event excludes the
possibility of the other or in other words
Two events are mutually exclusive if both
can not occur simultaneously.
Examples – Coin Toss, Baby born
Independent Events

A set of event said to be independent if


the occurrence of any event does not
affect the chance of the occurrence of
any other event of the set.

Example:- Toss of two different coins


Theorems of Probability

There are two basic rules of chances:--

– Addition Rule

– Multiplication Rule
Addition Rule (for mutually exclusive events)

Suppose, Two events A & B are said to be mutually


exclusive.
the probability of the occurrence of either A or B is
the sum of their individual probabilities.
p (A/B) = p (A) + p (B)
The same rule can be extended for three or more
events…..
p (A/B/C) = p (A) + p (B) + p (C)
Example

 From a pack of 52 cards, one card is drawn at


random. What is the probability that it is
either king or queen?
 Events are mutually exclusive. There are 4
kings and 4 queens in a pack of 52 cards.
 So the probability of king is 4/52 and for the
queen same 4/52.
The probability the card is either a king or
queen ---
p (A/B) = p (A) + p (B)

4/52 + 4/52 = 8/52

2/13
Addition Rule (for Independent events)

When events A and B are not mutually


exclusive it is possible to both events
occur so the rule must be modified….

p (A/B) = p (A) + p (B) – p (AB)


Multiplication Rule (For independent events)

 In this Rules if the two events, “A” and “B” are


independent, the probability of joint occurrence is
given by the product of their separate
probabilities.

p (A/B) = p(A) x p(B)


Example

 What is the probability of the heads on two or three


successive tosses?
– p(A) = probability of the head in first toss- ½ =0.5
– p(B) = probability of the head in second toss- ½
=0.5
 Combined probability
p (A/B) = p(A) x p(B)
½ x ½ = ¼ =0.5
Multiplication Rule (For Dependent events)

 If two events “A” and “B” are dependant, the


probability of occurrence of one event is
dependant on the occurrence of the other event.

p (A&B) = p(A) x p(A/B)

p (A, B & C) = p(A) x p(A/B) x p(C/AB)


Example

 A bag contains 7 red and 3 black balls. Two balls


drawn at random one after the other without
replacement. What will be the probability that both
the balls drawn are black?
 Probability of drawing black ball ---

p (A&B) = p(A) x p(A/B)


 Probability of drawing black ball—
p(A/B) = 3/ 7+3 = 3/10
 Probability of drawing second black ball—
p(A/B) = 2/ 7+2 = 2/9
 The Probability that both balls drawn are black—
p (A&B) = p(A) x p(A/B)
p (AB) = 3/10 x 2/9
= 1/5 x 1/3
= 1/15
Probability Application

 It is useful to find out the results of next generation.


 It help us to find out the probability of genetic
diseases like Albinism.
 We can also use the probability in predicting the ratio
of boys and girls.
 It can also be applied in solving the Mendel’s
problems of heredity
 It also helps in analyzing the pedigrees by breeders.
Probability Distribution

When the frequency distribution (Observation; like


centtral tendency measures) of certain population needs
to device mathematically, Such distribution are called
“Probability Distributions” or “Theoretical
Distributions”.
They are not obtained by actual Observation but are
mathematically deduced on certain assumption which
are based on probability.
 There are three main types of Probability distribution
which are widely used in different studies. These
distribution may be discrete or continuous.
– Discrete Probability Distribution
 Binomial Distribution
 Poisson Distribution
– Continuous Probability Distribution
 Normal Distribution
Binomial Distribution
 It is one of the most widely used probability
distribution of random discrete variable.
 This distribution is also known as “Bernoulli
Distribution”. Since it introduced by Swiss
mathematician J. Bernoulli.
 It applied where only one or two mutually exclusive
outcome such as success or failure, dead or alive and
male and female is possible.
 It means binomial distribution describes the distribution
of probabilities where there are only two possible
outcome for each trial or experiment.
 If a coin is tossed once there are two possible ways of
outcome the head or the tail.
 The probability of obtaining head (p) is ½ and the same
½ for tail (q).
 Thus (p+q) = 1 and binomial is (p+q)n
Example

First Coin Second Coin Probability


If two coins are tossed
simultaneously, there will H H pp = p2

be four possible H T pq
outcome:-- = 2pq
T H qp
Binomial Expansion is
T T qq = q2
(p+q)2= p2+q2+2pq
Assumption of Binomial Distribution

 Each trial has only two possible outcome “success”


or “failure”.
 The success (p) and failure (q) remains constant for
each experiment or trial.
 All trial must be independent of each other. There
should not be any relation between two experiment
or trial.
Formulation

 In “n” trials, the total number of possible ways of


obtaining “r” success and failure (n-r) is:
Probability (r success of n trials)
p(r) = n!
x prqn-r
r!(n-r)!
where p = probability of success
! = factorial
Like 5! = 5x4x3x2x1

Factorial for 0 is always 1


Poisson Distribution

 It is also a discrete probability distribution and is used


very widely.
 It was derived by Frenchman S.D. Poisson in 1837.
 It applied where the event is very rare like when dying
due to rare disease, number of defective articles
produced by a high quality machine, are rare events,
in the sense the probability of their happening is very
rare.
In these cases “p” is very small and “n” is the
number of trial so,
“np” is the fixed number known as Poisson
distribution.
It has a single parameter which is the mean of
distribution and is denoted by “m” = np which
remains constant
Formulation

 Probability of “r” success =

e-mmr
!

p(r) = e-mmr Where P= probability

r! r = 0,1,2,3…n success

e = 2.7183 (constant)
Normal Distribution

 The most important distribution dealing with


continuous variables is the Normal Distribution.
 It is also called Normal Probability Distribution.
 It is extremely useful in the analysis of
agricultural and the biological data.

 This is first discovered by De Moivre in 1733


This technique help us in drawing the interference
about the population from the sample.
By this method we will get a “curve” with peak with
evenly distributed items on either side of the peak.
Such a “curve” with important statistical properties is
called the “Normal Distribution Curve” which denotes
the normally distributed population.
Importance of the Normal Distribution

 In the most of biological analyses, values are often


distributed in accordance with the normal distribution.
As the sample size increases the distribution of mean
of a random sample approaches to normal distribution.

 In large sample, it serves as a good approximation of


discrete distribution such as Binomial and Poisson.
Properties of Normal Distribution

 The normal curve is “bell shaped” and is symmetrical


in appearance having single peak.
 The mean of a normally distributed population lies at
the centre of its normal curve.
 The mean, median and mode all are equal in normal
distribution.
 The height of the curve declines on either side of the
peak which occurs at the mean.
 The two tails never touch the base.
Formulation

Normal Distribution (For sample) = z = x - x


s
Where z = number of standard deviation
x = value of random variable
x = mean of this distribution
s = standard deviation of this ditribution
Correlation

The correlation was first investigated by Sir


Francis Galton
Karl Pearson introduced a method of assessing
correlation by means of the coefficient of
correlation.
By this coefficient, we can measure the extent of
relationship between two sets of data.
Correlation measures the closeness of the
relationship between the two variables.
Example: Height of husbands and wives, 100 seed
weight.
These sets of variables may show a certain
relationship or may not show any. But when both
variables move together we say they are related.
If a relationship persist it has to be quantitatively
expressed showing a degree of association between
the sets of variables.

The statistical tool with the help of which this


relationship between two variables is studied is
called “Correlation”.

Means, the term correlation refer to the study of


relationship between two variables.
Reason behind correlation

 The correlation may be due to pure chance.

 Influence of some external factors on two variables.

 Influence of two variables on each other or mutual

influence

 Influence of one variable upon the other.


Types of Correlation

 Positive / Negative correlation

 Simple/ Partial / multiple corelation

 Linear/ Non-linear correlation


Methods of studying Correlation

 Scatter Diagram method

 Graphical method

 Correlation coeficient
Correlation Coefficient

First two methods do not provide any numerical


measures of correlation.
The degree of relationship can be established by
calculating coefficient called Correlation
Coefficient. Which always gives a quantitative
measure of the degree of closeness between the two
attributes. Karl Pearson developed this theory so it is
also called Pearsonian Coefficient of the
Correlation” denoted by “r”.
Regression

Regression analysis is concerned in measuring the


probable form of the relationship between the two
variables.
The term first used by the Sir Francis Galton
while studying the relationship between height of
Father and son
The method which help us to estimate the
unknown value of one variable from known value
of the related variable, is called Regression.
 Galton studied the average relationship between
two variables graphically and called the line
describing the relationship, the line of regression.
 Regression technique only applicable where two
or more relative variables have the tendency to go
back to the mean.
Test of Significance

 The two samples drawn from the same population


will show the differences in the mean values. This
difference between the sample can be reduced but
can’ be eliminated. A procedure to assess the
significance of this difference is known as the “Test
of Significance”.
 It help us to determine weather observed differences
between two samples are actually due to chance or
they are really significant.
Procedure for significance test

 Laying down of hypothesis


– Null Hypothesis
– Alternative hypothesis

 Level of Significance
 One or two tailed hypothesis
Good Luck !

You might also like