You are on page 1of 58

Basic Statistics for

Life Sciences

Surya Prasad Paudel


Basic Statistics for Life
Sciences

TABLE OF CONTENT

INTRODUCTION AND IMPORTANCE, OBSERVATION AND DATA RECORDING 1

FREQUENCY DISTRIBUTION 4

MEASURES OF CENTRAL TENDENCY 8

MEASURES OF DISPERSION 15

ELEMENTARY PROBABILITY 18

NORMAL DISTRIBUTION 23

INTRODUCTION TO STATISTICAL HYPOTHESIS TESTING 31

STUDENT'S T - TEST 34

ANALYSIS OF VARIANCE 42

SIMPLE LINEAR REGRESSION 49

SIMPLE LINEAR CORRELATION 54

SP Paudel Page
Introduction and importance, Observation and data
recording

Definition:

Statistics refers to the analysis and interpretation of data with a view toward
objective evaluation of the reliability of the conclusion based on the data.
Statistics applied to biological problems simply called biostatics or biometry.

The primary objective of the statistics is to infer characteristics of a group of data


by analyzing the characteristics of a small group sampling of the group.

Descriptive statistics: Once data are obtained, they are organized and
summarized in such a way as to arrive at their orderly and informative
presentation. Such procedures are termed as descriptive statistics.

e.g. Mean, median, variance and standard deviation

Variable:
A characteristic that may differ from one biological entity to another is termed as
variable.

Types of biological data


There are four types of the data. They are
Ratio:
Measurement scale having constant interval size and a true zero point are said to
be the ratio scale eg weight (mg, lb), volume ( cc, cubic ft) etc.

Interval:
Some measurement scale possess a constant interval size but not a true zero,
they are called interval scale e.g. two common temperature scales. The zero
point is arbitrary

Ordinal:
Data consists of an ordering or ranking of measurement are said to ne ordinal
scale. e.g. biolocal entity being shorter, darker, faster or more active than other
etc.

Nominal scale:

SP Paudel Page
Variables those are classified by some quality rather than by numerical
measurement are called nominal scale.

Continuous and discrete data


Variable that could be any conceivable value within any observed range are called
continuous data. eg height
Variable that can take on only certain values are called discrete variable. eg
number of leaves, litter size. It is also called meristic variable.

Importance and scope of statistics

Statistics is a science which helps us to collect the data systematically and


scientifically. It is the science to provide tools and techniques to handle and
analyze the data and draw inference and conclusion about the population.
Statistics is applicable in every field of the study from economics to biology and
social science to astronomy. Statistics is useful in planning because it helps to
analyze the data. Various economic problems can be solved by using statistics.
Moreover it facilitates to develop economic models. Statistics is useful to any
field of science.

Observation and data recording

The first step in the learning from the data is to set the objective for which the
data are collected. The design of data collection is crucial which include the
following steps.
 Specify the objective of the study
 Identifying the variable of interest
 Choosing the appropriate design for the survey or scientific study
 Collection data.
The objective of the study is derived from the problems of statement and
variables of interest are identified based on the objective. When objective and
variables are identified, the proper method for the collection of data is required.
Data collection process includes surveys, experiments, and the examination of
existing data from records, census and previous studies. The theory of sample
survey and theory of experimental design provided excellent methods for data
collection. Surveys are usually passive, the goal of survey is to gather data on
existing conditions, behaviors and attitudes whereas experiment/ scientific
studies are more active where experimental conditions varies to study the effect
of the condition on the outcome of the experiments. In most scientific
experiments, as many as possible factors that affect the measurements are under
the control of the experimenters.

Survey:

SP Paudel Page
Information from surveys affects almost every aspect of the daily lives. A crucial
element in any survey is the process how the sample is selected from the
population. If a surveyor selects any sample for his/ her convenient, there may be
bias in the sample survey. Because, in this case, the samples may not represent
the population properly, the statistics cannot precisely reflect the population
from which samples are drawn. The following are the methods of drawing
samples from a population.

1. Simple random sampling


2. Stratified random sampling
3. Cluster sampling
4. Systematic sampling

Data Collection procedure

In survey, once the samples are finalized, next step is the collection of data from
those samples. The most common methods of data collection are

1. Personal interview
2. Telephone interview
3. Self administered questionnaires
4. Direct observation

Scientific studies

For scientific studies, there are various methods of the experimental design on
which the experiment are carried out. The major objective of these designs is to
minimize the experimental error without any biasness.
The following are the example of experimental designs
I. Completely randomized design
II. Randomized block design
III. Factorial design
IV. Latin square design
V. Split plot design

SP Paudel Page
Frequency distribution

When measurements on the variable have been collected, they are organized,
displayed and examined various ways. One of the methods to examine the data
at first is the frequency distribution. It gives pattern of the variables of interest
how frequently it occurs in the observations. When observation whether discrete
or continuous are available on any characteristics from a large number of
individuals, it becomes necessary to condense data into small groups without
losing information. If we collect weight of all students from HICAST, we will have
more than 1000 observations. Unless we condensed the data on weight, it is very
difficult to give any information about the weight of students. Simply if we
arrange these in ascending order, we come to the point that we can say minimum
and maximum weight of the HICAST students. Arranging data on the frequency
table is the first step to explore the data. In this case, the weight of student is the
variable, and how often a variable is observed in the experiment, it is the
frequency.
After collecting and summarizing large amount of data, if is useful to arrange
them in the form of frequency table. Frequency table consists of list of all the
observed values of the variable under study in one column and how many times
each observed values are observed. Total observations are first categorized into
different categories. The distribution of the total number of observations among
the various categories is termed as a frequency distribution.

Frequency table can be made for nominal as well as scale data.

Example: Frequency table of nominal data

The location of sparrow nests

Nest site Number of nest observed

A. Vine (V) 56
B. Building cave (B) 60
C. Low tree branches (LTB) 46
D. Tree and building cavities (TBV) 49

60 Bars show counts

40
Count

20

0
B LT B T BV V

types

SP Paudel Page
Preparing frequency table of discrete data and continuous data has different
procedure. For discrete data, each observed value can be a category and numbers
of times they are observed are taken as frequency. Number of observation can be
grouped into a class and numbers of times observed of each of observation under
a class are taken as frequency.

Example: a frequency table of discrete data.

Litter size frequency


3 10
4 27
5 22
6 4
7 1

Bars show counts


25

20
Count

15

10

3 4 5 6 7

Litter size

If the data create a long frequency table, it is often practice to group data and
cast them in a frequency table. Such grouping results loss of information but are
easy to read. There are several rules to decide how many groups should be made
in a particular data set. But it is established that there should be equal size of
interval of the variable being measured.
In case of continuous data, it is always dealing with a frequency distribution
tabulated by groups. For continuous data set, the frequency distribution can be
displayed by histogram.

SP Paudel Page
Number of aphids observed per clover plant
Number of number of Number of
aphids on a plants aphids on a number of
plants observed plants plants observed
0 3 20 17
1 1 21 18
2 1 22 23
3 1 23 17
4 2 24 19
5 3 25 18
6 5 26 19
7 7 27 21
8 8 28 18
9 11 29 13
10 10 30 10
11 11 31 14
12 13 32 9
13 12 33 10
14 16 34 8
15 13 35 5
16 14 36 4
17 16 37 1
18 15 38 2
19 14 39 1
    40 0
    41 1

A frequency table grouping the data


Number of aphids on a number of plants
plants observed
0-3 6
4-7 17
8-11 40
12-15 54
16-19 59
20-23 75
24-27 77
28-31 55
32-35 32
36-39 8
40-43 1

Pie chart:
SP Paudel Page
It is used to display the percentage of the total number of the measurements
falling into each categories of the variable. In the example give above

types
B
LTB
TBV
V

Pies show counts

Histogram:

This graphical representation of data is only applicable to scale variable. The


horizontal axes of is labeled with the class interval of the variable and vertical
axes is labeled with the frequency.

60 Bars show counts

40
Count

20

0
B LT B T BV V

types

Measures of Central Tendency

SP Paudel Page
Before we enter to the measures of central tendency, let's define few terms used
in the statistics

Population:

The basic interest of statistical analysis is to draw a conclusion about a group of


the measurements of a variable studied. The entire collection of measurements
about which one wishes to draw a conclusion is the population. For example,
somatic cell count in the milk of Murrah buffalo from chitwan valley, then the
number of somatic cell count per ml if milk from the entire Murrah buffalo from
Chitawan valley is the population.

Sample from the population:

If the population is very large, it is not practical to take measurement of all


individuals. Samples are taken from the population so that the conclusions from
the sample study can be used to infer about the population. Samples are the
subset of the measurements of the population. Conclusions about the population
can be drawn from the characteristics of the sample which was drawn from the
population.

Parameter:

A quantity is called parameters when it describes the population. An estimate of


the population parameter is by estimating statistics: μ is a population's mean
which is estimated by the statistics sample mean ( x́ )

Measures of central tendency:

In samples or in population one finds predominance of values somewhere around


the middle of the range of observed values. The description of the concentration
of values near the middle is an average or measure of central tendency. It is also
called measure of location. There are five measures of central tendency that are
common in use.
i. Arithmetic mean or mean
ii. Geometric mean
iii. Harmonic mean
iv. Median
v. Mode

These measures of central tendency describe the properties of population.

A. Mean
a. Arithmetic mean:
SP Paudel Page
Arithmetic mean of a set of observations is the value obtained from their sum
divided by the number of observations. If there are n numbers of observations
from x1, x2, x3 ………..xn , arithmetic mean ( x́ ) of these observation is

Mean ¿

n
1
¿ ∑ xi
n 1

n
1
In case of individual series, mean ( x́ )= ∑ x i
n 1
In case of the discrete series when variable x 1, x2 ………… xn with frequency of f1, f2
……….. fn, then mean is calculated as

f 1 x 1+ f 2 x 2 +… …+ f n x n
x́=
f 1 + f 2+ … . f n

n
1
¿
N
∑ f i xi
1

For continuous series, the value of x is the mid value of the corresponding class.
In case if the value of f or /and x are large, the calculation by using the formula
given above is more tedious and time consuming.
i. Short cut method and
ii. Deviation method.

Characteristics of mean

1. It is arithmetic average of the measurements in a data set.


2. There is only one mean for a data set
3. Its value is influenced by extreme measurements (outlier): trimming can help
to reduce the degree of influence
4. Means of subsets can be combined to determine the mean of the complete
data set.
5. It is applicable to quantitative data only.

Merit and demerit of arithmetic mean:


Merits
i. It is rigidly defined.
ii. It is easy to understand and easy calculate.
SP Paudel Page
iii. It is based on all the observations.
iv. Of all the averages, arithmetic mean is affected least by fluctuation of
sampling.

Demerits
i. It cannot be determined by inspection not can be located graphically
ii. Arithmetic mean cannot be used if we deal with qualitative characteristics
iii. Arithmetic mean cannot be calculated if single observation is missing or
lost unless the arithmetic mean of remaining observation is computed.
iv. Arithmetic means is affected very much by extreme values.
v. In extremely asymmetrical distribution, arithmetic mean is not a suitable
measure of location.

b. Geometric mean:

The geometric mean of a set on n observations is the nth root of their products
Thus the geometric mean G of N observations x i where i= 1,2,3,…….n is given by
the following equation

G = ( x1.x2.x3. …….xn)1/n

1
log G= ¿ ¿
n

n
1
log G= ∑ xi
n 1

n
G=antilog [ 1
∑x
n i=1 i ]
In case of frequency distribution, the geometric mean is given by

n
G=antilog [ 1
∑ f logxi
n i=1 i ]
Merit and demerit of geometric mean

Merits
1. It is rigidly defined
2. It is based on all observations
3. It is not affected by fluctuations of sampling

SP Paudel Page
Demerits
1. Geometric mean is not easy to understand
2. If any one of the observations is zero, geometric mean becomes zero.

B. Median

Median of a distribution is the value of variable which divides the distribution into
two equal parts when the observations are arranged in an order. It is the value
such that the number of observations above it is equal to the number of
observations below it. Thus median of a set of observation is defined as the
middle value when the observations are arranged from lowest to highest or from
highest to lowest.

The sample median is the best estimate of the population mean. In a symmetrical
distribution, the sample median is unbiased estimate of the population mean
however sample mean is more efficient than the median.

The median of a data can be found by first arranging measurements in order of


magnitude either in ascending or descending order. In case of individual series,
the median is

Median= X(n+1 )/ 2
Where n is the number of the measurements.

If there are 11 measurements from X 1 , X2 ………… X11 after arranging them in


ascending or descending order, the median will X 6

Median = X (11+1)/2 = X6

If the sample size is odd, then the subscripts will be an integer and indicate the
datum which is the middle measurement in the ordered sample.

If the sample size is even number (n= even), the subscripts will be the half-
integer. There is not the middle value in the ordered data set. There are two
middle values. The median is the midpoint between the two middle values

If there are 12 observations in a data set from X1, X2, ………………. X12.

The median = X (12+1)/2


X 6.5
Thus the median is the midpoint of X6 and X7.

SP Paudel Page
Median has the same unit as the each observation. If data are plotted in
frequency histogram, the median is the value of X that divides the area of
histogram into two equal parts.

The estimation of median for grouped data is different. We cannot use the
formula given above because for group data we know the class interval of the
median but cannot locate the median within this particular interval.
The following formula is used for grouped data.

w
Median=L+ (0.5 n−c f b )
fm
Where,
L = Lower class limit of the interval that contains the median
n = Total frequency
cfb = Cumulative frequency for all class before the median class
fm = Frequency of the class interval containing the median
w = Interval width
Characteristics of median
1. It is central value;
2. Only one median value can be found for a data set
3. It is not influenced by extreme values (outlier)
4. Median of sub set cannot be combined to determine the median of entire
data set.
5. For group data, its value is rather stable even when the data are organized
into different categories
6. It is also applicable to quantitative data.

Merit of median
1. It is rigidly defined
2. It is easily understand and easy to calculate
3. It is not at all affected by extreme measurements
4. It can be calculated for distribution with open-end classes.

Demerits of median
1. In case of even number of observations median cannot be determined
exactly.

C. Mode:

Mode is the value which occurs most frequently in a set of observations. It is


the measurement of relatively great concentration in the distribution. Some
distributions may have more than one such points of concentration. A
distribution in which each different measurement occurs with equal frequency
is said to be no mode. A distribution with two modes is called bimodal. Mode
SP Paudel Page
is affected less by skewness than the mean and median. In case of discrete
frequency distribution, mode is the value of x corresponding to maximum
frequency. But in any of the following cases the value of the mode is
determined by the methods of grouping.
I. If the maximum frequency is repeated
II. If the maximum frequency occurs in the very beginning or at the end
of the distribution
III. If there are irregularities in the distribution
In the method of grouping, six different columns are used to calculate frequency
by grouping the observations. The original frequencies of the observations are
listed in first column. Column second is obtained by combining the frequencies
two by two. Now leave the first frequency and combine remaining frequency two
by two and get column third. Fourth column is obtained by combining
frequencies three by three. The combination of frequencies three by three after
leaving first frequency gives the fifth column and after leaving first two
frequencies gives sixth column. After making six columns by using the methods
given above, the maximum frequency in each column is taken and the
contributing measurements for getting the maximum frequency in each column
are counted. The measurement which comes more frequent is the mode of the
distribution.

In case of continuous frequency distribution, mode is determined by the formula

h(f 1−f 0 )
Mode=l+
2 f 1−f 0−f 2

Where,
l = lower limit,
h= magnitude
f1= frequency of modal classs
f0 = frequency of the class preceding the modal class
f2 = frequency of the class succeeding the modal class

Characteristics of mode
1. It is most frequent measurement in the data set.
2. There can be more than one mode for a data set.
3. It is not influenced by extreme measurements.
4. Modes of subset cannot be combined to determine the mode of the
complete data set.
5. For grouped data, its value can change depending on the categories used.
6. It is applicable for both qualitative and quantitative data

Merit of mode
SP Paudel Page
1. Mode is readily comprehensive and easy to calculate
2. Mode is not affected by extreme values

Demerits of mode
1. Mode is ill defined. It is not always possible to find a clearly defined mode.
In some distributions, mode could be bimodal or multimodal.
2. It is not based on all the observations
3. Mode is relatively affected to greater extent by fluctuation of sampling.

SP Paudel Page
Measures of dispersion

Previously we discussed about the measures of central tendency which gives us


an idea of the concentration of the observations about the central part of the
distribution. Only the information on measures of the central tendency cannot
give complete idea on distribution. A measure of dispersion is another way to
understand the nature of the distribution and it supports to understand the
distribution along with measures of central tendency. A measure of dispersion is
an indication of the spread of measurements around the centers of the
distribution. Following are the measures of dispersion.
I. Range
II. Quartile
III. Mean deviation
IV. Standard deviation and variance
V. Coefficient of variation

Range
The range is the difference between the highest and lowest measurements in a
group of data. If the sample measurements are arranged in ascending order of
the magnitude, then range is given as

Range = Xn-X1

Range is simple and crude form of dispersion. It does not take into account any
observations except highest and lowest value. It is based on two extreme values
which are subject to change fluctuations in sample distribution. It is unlike that
the highest and lowest measurements in the population occur in the sample.

Quartile

Range is biased and inefficient estimate and very sensitive two only two
measurements i.e. large and small observations. Quartile is another measure of
dispersion that gives the more reliable information than range. Quartile deviation
or semi-interquartile range is given by

Q 3 −Q1
Semi−interquartile range=
2

SP Paudel Page
Mean Deviation

Range is not a good estimate to understand about the distribution. In measures


of central tendency, mean was found to be a useful. The dispersion will be also
useful if it is expressed in terms of dispersion from the mean. The sum of all
deviation from the mean i.e. Ʃ (x- x ¿ will be always zero. If the summation would
be zero, it is not useful to measure the dispersion. Summation of absolute value
of the deviation from the mean gives some value that can be used as the
dispersion from the mean. Thus, mean deviation is the average of the absolute
value of deviation from the mean.

sample mean deviation=


∑|xi −x|
n

For frequency distribution

1
sample mean deviation= f |x −x́|
n∑ i i

Standard Deviation and the variance

Standard deviation is the positive square root of the arithmetic mean of squares
of deviation of a given values from their mean. Standard deviation for a
population is denoted by Greek letter σ and is given by following formula.

1
σ=
√ N
∑ ( x i−μ)2

The best estimate of the population standard deviation is the sample standard
deviation which is given by following formula

1
s=
√ n−1
∑ (x i− x́ )2

For frequency distribution, the population standard deviation is estimated by


using following formula.

n
σ=
1

∑ f ( x − x́ )
N i=1 i i

and sample standard deviation is estimated by

SP Paudel Page
n
s=
√ 1
∑ f (x − x́)
n−1 i=1 i i

The square of the standard deviation is the variance

The sample variance is estimated by


n
1
s2= ∑ f ¿¿
n−1 i=1 i

Coefficient of variation

The coefficient of variation is given by

s
C . V = ×100

C.V. is the percentage variation in the mean. We calculate C.V. to compare


variability in two series. The series having greater C.V. is said to be more variable
and the series with lesser C.V. is said to be more consistent.

SP Paudel Page
Elementary Probability

In statistical procedure, some statements are given about the population based
on the analysis of sample. First, the sample data are described graphically and
with other descriptive technique. This is just the methods of summarizing and
describing of sample. We need to assess the degree of accuracy to which the
sample statistics represent the population. Probability plays a role to make an
inference from sample results to the conclusion about the population. Thus
probability is a tool that enables us in making inferences.

Basic terms used in probability theory

Random experiment

Random experiment is the experiment, the results of which cannot be predicted


with certainty. The result may be any one of the various possible outcomes. For
example, tossing a coin, tossing two coins, rolling (throwing) an unbiased die,
throwing two unbiased dice, drawing a card from a well-shuffled pack of 52 cards
are the examples of random experiment.

Trial and event

Performing of a random experiment is called a trial and a result or outcome of a


trial is called an event.
Examples
a) Tossing a coin is a trial and getting of ‘a head’ or ‘a tail’ is an event
b) Throwing a die is a trial and getting face 1 (or 2 or 3 or 4 or 5 or 6) is an
event.
c) Drawing a card from a pack of well-shuffled cards is a trial and getting
spade or club is an event.
d) Tossing two coins is a trial and getting both heads is an event.

The event is said to be simple if it corresponds to a single possible outcome of an


experiment. The joint or composite outcomes of two or more events at a time are
called a compound or composite events.

Sample Point and Sample space

SP Paudel Page
A possible result of a random experiment is known as a sample point or an
elementary event or an event or an outcome. The set of all possible outcomes of
a random experiment is known as a sample space and is denoted by S.
For example, if e1, e2, e3, . . ., en are n mutually exclusive outcomes or
sample points of a random experiment, then set
S = {e1, e2, . . ., en}
is the sample space of the random experiment.
The elements of sample space S have the following properties:
a) Each element ei of the sample space is an outcome or a sample point of
the random experiment.
b) Any result (outcome) of a trial in a random experiment corresponds to one
and only one element of sample space S.
Examples
i) In throwing a die, the sample space is
S = {1, 2, 3, 4, 5, 6}
ii) In tossing two fair coins, the sample space is
S = {HH, HT, TH, TT}
iii) In throwing two dice, the sample space is
S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1) . . . (6, 1)
(6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
2nd die
1 2 3 4 5 6
1 (1, (1, (1, 3) (1, (1, (1, 6)
1) 2) 4) 5)
2 (2, (2, (2, 3) (2, (2, (2, 6)
1) 2) 4) 5)
3 (3, (3, (3, 3) (3, (3, (3, 6)
Firs
1) 2) 4) 5)
t
4 (4, (4, (4, 3) (4, (4, (4, 6)
die
1) 2) 4) 5)
5 (5, (5, (5, 3) (5, (5, (5, 6)
1) 2) 4) 5)
6 (6, (6, (6, 3) (6, (6, (6, 6)
1) 2) 4) 5)

Equally likely Events

The events are said to be equally likely events if they have equal chance of
occurrence in a random experiment.

For example,
a) In tossing of an unbiased or uniform coin, head (H) and tail (T) are equally
likely events.

SP Paudel Page
b) In throwing of an unbiased or regular die, all the six faces are equally likely
events.

Mutually Exclusive Events

The events are said to be mutually exclusive if the occurrence of one event
excludes the occurrence of other event(s) at a time.
Thus, for mutually exclusive events, the success of an event necessitates the
failure of others in the same experiment. For example;
a) In throwing of a die, the six faces numbered 1, 2, . . . 6 are mutually
exclusive.
b) In tossing of a coin, head (H) and tail (T) are mutually exclusive.
c) Selecting a jack and selecting a spade are not mutually exclusive.

Independent and Dependent events

The events are said to be independent of each other if the occurrence of one
event does not affect the occurrence the other.
For example, in tossing two unbiased coins, the event of getting head or tail in
one coin does not affect the event of getting head or tail in the second coin.
Two events are said to be dependent if the occurrence of one event affects the
occurrence of the other.
For example, suppose that from a deck of 52 cards, two cards are drawn in
succession without replacement. If we would like to draw a heart on both draws,
the number of hearts available in the deck for the second draw depends on
whether the first draw is or is not a heart. Hence the second drawing is
dependent on the first.

Exhaustive Cases

All possible outcomes of a random experiment are called exhaustive events or


cases. The number of cases to be exhaustive is denoted by n.
For example,
a) In tossing of a fair coin, exhaustive cases are: Head (H) and Tail (T). So, n =
2.
b) In tossing of two fair coins at the same time, exhaustive cases are: HH, HT,
TH and TT. So, n = 22 = 4.
c) In tossing of three fair coins at the same time, exhaustive cases are: HHH,
HHT, HTH, THH, HTT, THT, TTH and TTT. So, n = 23 = 8.
d) In tossing of n coins, number of exhaustive cases is 2n.
e) In throwing a die, exhaustive cases are: face 1, 2, 3, 4, 5, 6. So, n = 6.
SP Paudel Page
f) In throwing two dice at the same time, there are 62 = 36 exhaustive cases.

Favorable Cases

Successes of an event or outcomes in favor of any event are called favourable


cases of the event. The number of favourable cases is denoted by m.
For example,
a) In tossing two coins, the favourable cases to the event “getting exactly one
head” are: HT, TH. So, m = 2.
b) In throwing two dice, favourable cases to the event “getting the sum 6”
are: (1, 5), (2, 4), (3, 3) (4, 2) and (5, 1). So, m = 5.

Probability

If an event can happen in m ways out of n equally likely, mutually exclusive and
exhaustive possibilities i.e. for an event A, number of favorable cases are m and n
is the number of all possible outcomes, then the probability of the occurrence of
the event A is defined as the ratio of the number of favorable cases to the
number of all possible outcomes.
Thus,
Probability of event A =
=

In other words, if P (A) is the probability of the occurrence of the event A, then
P(A) = .
Theorems of Probability
There are two theorems (or laws) of probability:
1. Additive law of probability (or total probability law)
2. Multiplicative law of probability (or Compound probability law).

Additive law of probability

If A and B are any two events of a sample space which are disjoint or mutually
exclusive, then
P (A or B) = P (A  B)
= P (A) + P (B)
For three mutually or disjoint events A, B and C of a sample space
P (A or B or C) = P (A  B  C)
= P (A) + P (B) + P (C)

SP Paudel Page
Multiplicative Law of Probability

If A and B are two independent events, then the probability of their occurrence
(happening) together is equal to the product of their respective probabilities.
P (A and B) = P (A  B)
= P (A) . P (B)
For three independent events A, B and C,
P (A and B and C) = P (A  B  C)
= P (A) . P (B) . P (C)
Example
A box contains 2 red, 3 white and 4 black balls. A person draws one ball from the
box randomly. Find the probability that drawn ball is red or white.

Solution
Total number of balls = 2 + 3 + 4 = 9
Total number of possible outcomes (n) = 9C1 = 9
One ball is drawn at random from the box.
Let ‘A’ denotes the event of drawing a red ball. Since there are 6 red balls in the
box.
Number of favorable cases (m) = 3C1 = 3
P (A) =
Again, let ‘B’ denotes the event of drawing a white balls. Since there are 3 white
balls, no. of favorable cases (m) = 3C1 = 3.
P (B) = =
Since A and B are two mutually exhaustive events.
 The required probability of getting a red or white ball can be computed by
using addition theorem of probability.
P (A or B) = P (A) + P (B)
=

SP Paudel Page
Normal distribution

Continuous variable are those variable that flows without any break from one
individual to the next. The continuous variable ie interval and scale data follows
many distributions however one of the distribution is characterized generally that
many observations falls around the means and fewer observations are towards
the extremes . If the number of observations is large, the frequency of such
variable give a bell shaped curve which is called normal curve. Many variables of
the interest have frequency distribution which can be approximated by using the
normal curve. Such a distribution is called normal distribution. It means that
normal distribution of random variable gives the normal curve. For example, milk
yield for cattle of a particular breeds.
Since normal distribution has been tabulated, area under the normal curve can be
used to approximate the probabilities associated with the variables of the interest
in the experiment. Normal random variable and its associated distribution play an
important role in statistical inference. The relative frequency histogram for the
normal random variable called normal curve of normal probability distribution.
This distribution is one in which height of the curve at X i is as expressed by the
relation

1 −( xi −μ )2 /2σ 2
y i= σ 2 π e

Where,
The height of curve, yi = normal density, and there are two parameters, µ and σ.
For a given µ, there are infinite numbers of curve and similarly, for a given σ,
there are infinite numbers of curve.
A normal curve with µ = 0 and σ= 1 is called standardized normal curve and
distribution of called standard normal distribution.

Its characteristic states that all the measurements are within 3 standard deviation
of mean. If we select a measurement at random from the population of
measurement that possess a mount shaped distribution, the probability is
approximately 0.68 that the measurement lie within the 1 standard deviation.

SP Paudel Page
Similarly probability is 0.954 that a value will lie in the interval of μ±2σ and 0.997
in the interval of μ±3σ.

Characteristics of the normal distribution and normal curve

1. The curve is bell shaped and symmetrical about the line x=µ
2. Mean, median and mode of the distribution coincide
3. As x increases numerically, f(x) decrease rapidly, the maximum probability
occurs at the point, x= µ.

4. No portion of curves lies below the x-axis.


5. Linear combination of independent normal variate is also normal variate.

6. Areas property

P (µ-σ < X < µ+σ) = 0.6826


P (µ-2σ < X < µ+2σ) = 0.9544
P (µ-3σ < X < µ+3σ) = 0.9973.

It gives the rule of 68, 95 and 99.7.

Importance of normal distribution


SP Paudel Page
1. Most of the distribution in practice can be approximated by normal
distribution
2. Most of the sampling distribution tends to be normality in large sample
3. Even if a variable is not normally distributed, it can be brought, in most of
time, into normal distribution by simple transformation of variable.
4. Many of the distribution of sample statistic tend to normality for large
samples and as such case they can be studied with the help of normal
curve.
5. The entire theory of small sample test ( t, F and χ2) is based on the
assumption that parent population from which sample have been drawn
follow normal distribution

Symmetry and kurtosis:

Moment about the mean is often used in statistics. Ʃ (x-µ) p / N is termed as pth
moment about the mean. The first moment about the mean, Ʃ(x-µ) / N, is zero.
The second moment about the mean, Ʃ (x-µ)2 / N is the population variance. The
third moment about the mean, Ʃ (x-µ)3/ N, gives us about the symmetry of the
distribution. The sample statistics based on the third moment is given as

k 3=n ∑ ¿ ¿ ¿

This statistics carry cube unit, to make it without unit

k3 k3
g1= 3
= 3
s √ ( s2 )

This statistic g1 measures the symmetry of the distribution. If g 1 is not significantly


different from 0, then the distribution is symmetric around the mean. In this case,
the mean and median are identical. If g 1 is significantly less than zero, it indicates
that the sample is taken from a population that is skewed to the left. In a left
skewed distribution the mean is less than the median. If the g 1 is significantly
higher than zero, the sample is taken from a population which is skewed to the
right. In this case, mean is higher than the median.

SP Paudel Page
Left skewed distribution Right skewed distribution

Distribution of continuous data is symmetry, left skewed or right skewed. If the


left part and right part are equal or the mean halves the area under curve into
two equal parts, then the normal distribution is called symmetry. If it is not so,
then the distribution is skewed

Kurtosis is another statistics used to determine whether the distribution is normal


or not. Kurtosis of estimated from the fourth moment about the mean. Before
estimating the measures of the kurtosis, the following will be estimated

k 4=∑ ( xi −x́ ) 4 n(n+1)/ ( n−1 )−3 ¿ ¿ ¿ ¿

Here also k4 has unit to the power four. The following statistic has no unit and
measure the kurtosis.

k4
g2=
s4

If the g2 is less than zero, then the distribution is called platy kurtosis, if g 2 is
higher than zero then the distribution is leptokurtosis.

SP Paudel Page
Proportion of a normal distribution

We knew that the normal curve depends on the two parameters. These
parameters are mean and standard deviation. It indicates that there are varieties
of the normal curves depending on mean and standard deviation. Since the
standard table for the normal distribution is designed for the distribution with
mean = 0 and standard deviation =1. In real practice, we have normal distribution
with mean µ and standard deviation σ. whenever we need to use the standard
table with µ = 0 and σ = 1. , we need to rescale the the distribution of interest so
that the mean becomes 0 and standard deviation becomes 1. This rescaled
measurement is given by

x i−μ
Z=
σ

The Z is termed as standard normal deviate, normal deviate, and standard score.
It is clearly understood that the mean of the standard score is 0 and standard
deviation is 1.

Example

1. The mean daily milk production of a herd of Jersey cow has normal
distribution with μ = 70 pounds and σ = 13 pounds
a. what is the probability that the milk production for a cow chosen at random
will be less than 60 pounds ( 0.2206)
b. what is the probability the milk production for a cow chosen at random will be
greater than 90 pounds ? (0.0618)
c. what is probability that the milk production for a cow chosen at random will be
between 60 pound and 90 pounds ( 0.7176)

Solution:

Let's understand the distribution from the normal curve.

SP Paudel Page
60

To answer the part a we must compute z value that corresponds to the 60.

x −μ
Z=
σ

60−70
z= =−0.77
13

= 1- 0.7764 = 0.2236

The probability of choosing a cow giving 60 lb of milk from that population is


0.2236.

SP Paudel Page
SP Paudel Page
The distribution of means (Sampling distributions)

If random samples are taken from a normal population, the means of these
samples will follow normal distribution. The distribution of means from a non-
random distribution will not be normal but will be approximately normal if the
samples are large. This is called Central Limit Theorem.

Moreover, the variance of the distribution of means will decrease as n increases.


The variance of population of all possible means of sample of size n from
population with variance σ2 is

2 σ2
σ x́ =
n

This quantity is called variance of mean. The distribution of sample statistics is


called sampling distribution.
The square root of the variance of mean is the standard deviation of mean which
is also called the standard error of mean.

σ2
σ x́ =
√ n

σ
¿ σ x́ =
√n

Here, we must know σ which is population parameter. Since we seldom calculate


population parameters, but we estimate it from the random sample taken from
the population. The best estimate of σ from sample standard deviation.
Then

2 s2
s x́ =
n

s
s x́ =
√n

This is the sample standard error of mean. The importance of SE lies on the
hypothesis testing. Its magnitude is helpful in determining the precision to which
the statistics and variability are given.

SP Paudel Page
Introduction to Statistical hypothesis testing

The purpose of statistical analysis is to make inference about the population by


examining samples from that population. Basically there are two area of
statistical analysis.
a) Estimation
b) Test of hypothesis

Test of hypothesis involves series of analysis by which we arrive in a decision


about the population of interest after examining the samples from that particular
population. It includes five major steps. They are
1. Set null hypothesis

In hypothesis testing we should be neutral towards the outcome of the test.


Statistician adopts principle of "no difference" to make hypothesis, he or she
should be null attitude toward the test. Such a hypothesis is called null
hypothesis. It is abbreviated with H0. Let's derive a null hypothesis from an
example.
Example:
Seed of berseem is one of the critical inputs for the development of livestock
development. Currently, the seed production of berseem per hectare is 20 quintal
on an average. Nepal Agriculture Research Council has developed a new variety
of berseem. It is claimed that this variety gives higher seed production than the
current.
In this example, the null hypothesis is stated that the average seed production of
new variety is no difference from 20 quintal per hectare. It can be explained
mathematically as Ho: µ = 20.

2. Set alternate hypothesis

The hypothesis proposed by researcher to conduct a research is called research


hypothesis or alternative hypothesis. It is denoted by H a. In the given example,
the alternative hypothesis is the new variety gives more than 20 quintal seed per
hectare. Mathematically, it can be written as
Ha: µ > 20.
Be careful, in case of alternative hypothesis Ha: µ > 20, the null hypothesis would
be Ho: µ ≤ 20. Based on the researcher wish, the test could be one tailed or two
tailed.
One tailed testing: As we know that alternative hypothesis is based on the
researcher's interest. Alternative hypothesis tells about the difference in the
population in a specified direction between the population parameters and

SP Paudel Page
hypothesized value. Statistical testing that examines the differences in only one of
the possible directions is called one tailed testing.
In the example given above, the null and alternative hypotheses for one tailed
testing are as follows
Ho: µ ≤ 20 and Ha: µ > 20
Ho: µ ≥ 20 and Ha: µ < 20
Two tailed testing: The statistical testing that tells the differences in two
directions is called two tailed testing. In the given example, the null and
alternative hypotheses for two tailed testing are as follows
Ho: µ = 20 and Ha: µ ≠ 20

3. Estimating test statistic

Once the null and alternative hypotheses are ready, we need to take sample
from the population of interest. The decision to state whether data support
research hypothesis is based on quantity obtained from the computation of
statistics, these statistic are called test statistic. Value of mean, Z and t are the
example of test statistics.

4. Defining Critical/ rejection regions

We need to have criterion for rejecting or not rejecting null hypothesis for this
statistical test. In case of mean as the test statistics, a very large or a very small
mean might be obtained. Larger value of Z might be obtained even when null
hypothesis is true. Larger the value of absolute Z value, smaller will be the
probability of null hypothesis is true. The probability used as a criterion for the
rejection of null hypothesis is called level of significance. A probability of 5% or
less is commonly used as criterion for rejection of null hypothesis. This is
denoted by α. The values of test statistics corresponding to α is termed as critical
value of that test statistic. In Z distribution, the critical value for testing a
hypothesis at 5% level of significance is 1.96. From this statement, it can be
understood that true null hypothesis will be rejected at the frequency of α. This
is an error committed drawing a conclusion. Thus the rejection of null hypothesis
when it is true is called type I error. It is also called α error. On the other hand,
when null hypothesis is false in fact, the statistical test is unable to detect the
fact. In this case we commit an error of not rejection null hypothesis when it is
false. The probability of committing an error of not rejecting null hypothesis
when it is false is given by β. This error is called type II error (β error).

The two types of error can be summarized as below

If Ho is true If Ho is false
If Ho is rejected Type I error No error

SP Paudel Page
If Ho is not rejected No error Type II error

5. Conclusion and checking assumptions:

The final step of the hypothesis testing is giving conclusion based the the test
and checking the assumptions. Most of the statistical test needs to follow the
assumptions. In case of the violation of the assumptions, inferences of the
statistical test are not valid.

SP Paudel Page
Student's t - test

In previous part, we discuss about Z test where we need to infer about the
populations from the sample with the following assumptions.
The population variance is known, if the population variance is not known, the
sample variance is the unbiased estimator of the population mean. Another
assumption was the sample sizes are large.
There are certain situations, the sample size is not large enough and population
variance is unknown, we use t test in place of z test. This test is also called
student test. This test was invented by W.S. Gosset and published an article in the
pen name student.

ý −μ
When he used Z= for a small sample with replacing σ by s. He was falsely
σ /√ n
rejecting null hypothesis at the slightly higher rate than that specified by α. Then
he set out to derive the distribution with new statistic given as

ý −μ
for n < 30
s /√ n

ý −μ
The quantity is called t statistics and its distribution is called student's t
s /√ n
distribution.

So
ý −μ ý −μ
t= t=
s /√ n SE of ý

Properties of t distribution

1. There are many different t distributions. We specify particular one a


parameter known as degree of freedom. The degree of freedom is the piece
of information for estimation parameters.
2. The t distribution is symmetrical with mean equal to zero.
3. The t distribution has variance df/df2
4. As degree of freedom increases, the t distribution approaches z distribution.

Use of t test
1. One sample test of whether the mean of a normally distributed population
has a value that has specified in the null hypothesis.
2. Two sample test in which the null hypothesis state that the means of two
normally distributed populations is equal.
3. Paired sample t test in which the data were taken from the paired sample

SP Paudel Page
4. A test whether the regression coefficient is significantly differ from zero
One sample t test

In one sample hypothesis, t test is used to make an inference about a population


mean. The difference between Z test and t test with a sample is that s replaces σ.
t test is used any time when σ is unknown and distribution of the random variable
is mound shaped.

Example

An outbreak of food borne disease has attributed to salmonella enteritidis.


Epidemiologist determined that source of illness was ice cream. They sampled
nine productions run from a company that has produced ice cream to determine
the S. enteritidis in the ice cream. These levels are 0.593, 0.142, 0.329, 0.691,
0.231, 0.793, 0.519, 0.392, 0.418
Use these data to determine whether the average level of S. enteritidis in the ice
cream is greater than 0.3 MPN, a level that is considered to be dangerous, set α =
0.01.

Solution:

Set null hypothesis:

The null hypothesis is that the samples of the ice cream do not have more than
0.3 MPN S. enteritidis.

Ho: µ ≤ 0.3

Alternative hypothesis:

The ice cream produced by this factory is contaminated with S. enteritidis and
contain more than 0.3 MPN S. enteritidis.

Ha: µ > 0.3

Estimation of test statistics:

The one sample t test is the appropriate statistical procedure in this condition and
estimation of t value will be the best way to estimate the test statistics.

Where

x́−μ
t=
s /√ n

SP Paudel Page
Below the procedure to estimate the test statistics
Variable
(x) x- x ̅ (x- x)̅ 2
0.593 0.137 0.018769
0.142 -0.314 0.098596
0.329 -0.127 0.016129
0.691 0.235 0.055225
0.231 -0.225 0.050625
0.793 0.337 0.113569
0.519 0.063 0.003969
0.392 -0.064 0.004096
0.418 -0.038 0.001444
    0.362422

Ʃ ( x− x́)2
variance=
n−1

0.362422/ 8 = 0.0453

SD = 0.2128

0.456−0.3
t=
0.2128/√ 9

= 2.21

Rejection region:

The rejection region is obtained from the t distribution table using the level of
significance and degree of freedom. The level of significance in this case is 0.01
and degree of the freedom is 9-1= 8. From the t table, we get the critical value of
t is 2.896. As we know this is one tailed test, the rejection region lies on one
direction. The rejection region is beyond the t- value of 2.896.

In other word rejection region is given as

t ≥ 2.896

Conclusion and checking assumptions:

SP Paudel Page
Since the calculated t value do not lie in the rejection region. Thus we do not have
sufficient evidence to reject null hypothesis. The conclusion is the average level of
S. enteriditis in the ice cream sampled is not higher than 0.3 MPN/g.

Two sample hypothesis:

One of the common procedure in biostatistics is to compare two samples


whether they is a difference in the population they are drawn. T test is used to
compare the means of two populations. In this case we draw independent
samples from two populations to compare the population parameters. We select
independent random sample of n1 observation of one population and n2
observations from a second population. We use the difference in sample means (
x́ 1−x́ 2) to make an inference about the difference between the population means
(µ1-µ2).

Then the t statistics is given as

x́ 1−x́ 2
t=
SE x́ −x́
1 2

Standard error of mean difference ( SEx́ − x́ ) is estimated as follows


1 2

1 1
SEx́ − x́ =s p
1 2
√ +
n1 n2

Where Sp is the pool standard deviation which is calculated as

( n1 −1 ) s12 +(n2−1)s 22
s p=
√ n1 +n2−2

Example:
An experiment was conducted to evaluate the effectiveness of a treatment of
tapeworm in the stomach of sheep. A random sample of 24 worm infected lambs
of approximately the same age and heath was randomly divided into 2 groups. 12
of them were injected with drugs and remaining 12 were left untreated. After 6
months, lambs were slaughtered and following worm counts were recorded.

Drug treated 18 43 28 50 16 32 13 35 38 33 6 7
sheep
Untreated 40 54 26 63 21 37 39 23 48 58 28 39
sheep

SP Paudel Page
Test whether the data provide sufficient evidence that the drug is effective
against the tapeworm. (Use α = 0.05)

Solution:

The null hypothesis for the test is as follows

Null hypothesis:

The new drug is no effective against the tapeworm and will not reduce the
number of tapeworm in the stomach of sheep after 6 months.

Ho: µ1 - µ2 ≤ 0

Alternative hypothesis:

The new drug is effective against the tape worm and will reduce the number of
tapeworm in the stomach of the sheep.

Ha : µ 1 - µ 2 > 0

Test statistics:

The t value is calculated as follows

x́ 1−x́ 2
t=
SE x́ −x́
1 2

variable variable
  x1 x2 x1-x x2-x (x1-x)2 (x2-x)2
  40 18 0.33 -8.58 0.1089 73.6164
  54 43 14.33 16.42 205.3489 269.6164
  26 28 -13.67 1.42 186.8689 2.0164
  63 50 23.33 23.42 544.2889 548.4964
  21 16 -18.67 -10.58 348.5689 111.9364
  37 32 -2.67 5.42 7.1289 29.3764
  39 13 -0.67 -13.58 0.4489 184.4164
  23 35 -16.67 8.42 277.8889 70.8964
  48 38 8.33 11.42 69.3889 130.4164

SP Paudel Page
  58 33 18.33 6.42 335.9889 41.2164
  28 6 -11.67 -20.58 136.1889 423.5364
  39 7 -0.67 -19.58 0.4489 383.3764
2112.666 2268.916
Total 476 319     8 8
Mea
n 39.67 26.58        

Variance of sample1 ( s12) = 2112.68/ 11 = 192.062, SD1 = 13.86

Variance of sample2 (s22)= 2268.9168 = 206.21, SD2 = 14.36

Pooled standard deviation (Sp)= 14.11

1 1
SEx́ − x́ =14.11
1 2
√ + =5.76
12 12

x́ 1−x́ 2
t=
SE x́ −x́
1 2

39.67−26.58 .
¿
5.76

= 2.27

Rejection region:

Rejection region can be specified from t distribution table taking consideration of


the level of significance and degree of freedom. Here the level of significance is
0.05 % and degree of freedom is 12+12-2 = 22. The critical value of t at 5% level of
significance with 22 DF is 1.717. The rejection region lies above the 1.717.

Conclusion:

The calculated value of t 2.27 lies in the rejection region. Thus we have sufficient
evidence to reject null hypothesis. We can conclude the the drug reduce the
number of tapeworm in the stomach of sheep.

Paired sample hypothesis

In two sample testing, it can be applied in two conditions. First the two samples
are independent which simply means that each datum in a sample is no way
associated with any specific datum in other sample. Other condition prevails
SP Paudel Page
when each observation in one sample is same way correlated with an observation
in another sample. T test also used to compare the means of paired sample.
Paired sample t test requires each datum in one sample is correlated with one
but only one datum in another sample. The paired sample t test doesn't have
normality and equality of variance assumption of the 2 sample t test but assume
the difference come from normally distributed population of difference. In paired
sample t statistics is given as


t=
SE d́

Example

Consider the pair data given below

Forele
Hind leg g
Deer length length
1 142 138
2 140 136
3 144 147
4 144 139
5 142 143
6 146 141
7 149 143
8 150 145
9 142 136
10 148 146

Solution

Null hypothesis:
The length of hind leg and fore leg is same

Alternative hypothesis:
The length of hind leg and foreleg is not same

Test statistics

SP Paudel Page
t value is calculated as follows
Hind leg Foreleg Difference
Deer length length (d) (d− d́ ¿ (d− d́ ¿ 2
1 142 138 4 0.7 0.49
2 140 136 4 0.7 0.49
3 144 147 -3 -6.3 39.69
4 144 139 5 1.7 2.89
5 142 143 -1 -4.3 18.49
6 146 141 5 1.7 2.89
7 149 143 6 2.7 7.29
8 150 145 5 1.7 2.89
9 142 136 6 2.7 7.29
10 148 146 2 -1.3 1.69
          84.1

Standard deviation of difference = √84.1/9 = 3.05

3.05
SEd́ = = 0.96
√10
3.3
t= =3.36
0.96

Rejection region:

This is two tailed test. The critical value of t at 5% level of significance with 9
degree of freedom is 2.262. The rejection region lies above the 2.262 and below
the - 2.262.

Conclusion

The calculated t value lies in the rejection region. So we reject null hypothesis.

SP Paudel Page
Analysis of Variance

T-test is used to compare one sample or two sample mean. When measurements
of a variable are obtained from more than three samples to compare their mean,
there would be three pairs to be compared. In this case, t test is not valid to
compare three or more samples. The appropriate procedure to compare means
of three or more samples is analysis of variance (ANOVA). Analysis of variance is a
statistical process of testing more than two population means by taking sample
from the corresponding population.
If we have to compare the means of three samples, our null hypothesis would be
H0: μ1 = μ2= μ3. In ANOVA, the F statistics need to be estimated. F statistics was
formally known as "variance ratio" which the ratio between the treatments and
error variance. Error variance is the common variance to all the groups of the
samples.
Considering factors taken at a time, the analysis of variance can be divided into
single factor or one way analysis of variance for a factor at a time and factorial or
two-way analysis of variance for two factors or more taken at time. Control
variables are called factors and are selected by researcher for comparison. On the
other hands response variable are the measurements or observations that are
recorded but not controlled by researcher.
Treatment and control: Treatments are the conditions made from the factors and
control treatment is a special treatment to which the effectiveness of the other
treatment are compared.
Experimental unit is the physical entity to which treatments are randomly
assigned. When one treatment is assigned to more than one experimental unit
the single treatment is called replication. Experimental error is the variation in the
response among the experimental units which are assigned the same treatment
and are observed under the same experimental conditions.
Example: Nineteen pigs are assigned at random among four groups. Each group is
fed a different diet. The data are pig weight in kg after being raised on these
diets. We need to know whether pigs are same for all four diets

Feed 1 Feed 2 Feed 3 Feed 4


60.8 68.7 102.6 87.9
57 67.7 102.1 84.2
65 74 100.2 83.1
58.6 66.3 96.5 85.1
61.7 69.8 90.3
Mean 60.62 69.3 100.35 86.24

Solution:
The null hypothesis:
SP Paudel Page
The mean weights of pig on four diets are the same.
H0: µ1= µ2= µ3= µ4
Alternative hypothesis:
The means weights of the pigs on four diets are not same at all.
Any one of the µ is different from other.
In order to compare the means of the more than three population, F statistics
need to be calculated. For this purpose, we do analysis of variance.
First, we need to calculate the correction factor (CF) which is estimated as follows

∑∑ Xi
i j j
CF=
N
1482. 2
CF= 19
= 115627.202

Now, we need to estimate the total sum of square (TSS)


TSS=∑ ∑ X 2−CF
i j ij
= 60.82 +572+652+ -----------------------------+ 85.72 + 90.32 - 115627.202
= 119981.900 - 115627.202
= 4354.698

Now, the sum of square between treatment or between groups is calculated as


follows

∑ ( ∑ X ij )2
SSB = i j
−CF
ni

= (303.1)2/5 + (346.5)2/ 5 + (401.4)2/4 + (431.2)2/5 - 115627.202


= 119853.55 - 115627.202
= 4226. 348

Error sum of square is calculated by subtracting the treatement sum of square


from total sum of square

SSE = TSS - SSB


= 4354.698 - 4226. 348
= 128.350
After calculating these sum of square, lets build the ANOVA TABLE

ANOVA TABLE
Source of Degree of Sum of Mean sum of square F value
variance freedom Square

SP Paudel Page
Total 19-1=18 4354.698
F = MSgroups/
Treatment 4-1 = 3 4226.348 4226.348/3= MSError
1408.783
1408.783/8.577
Error 18-3= 15 128.35
128.35/15 = 8.577 F = 165.

Since the calculated F value is compared with the F tabulated value or estimate
probability of getting this F values in the given degree of freedom. In this case,
the tabulated F value with the degree of freedom of numerator 3 and
denominator 15 is 3.29. Here the calculated F value is higher than the tabulated F
value, we reject null hypothesis. Thus the conclusion is we do not have sufficient
evidence that all four diets are same for the growth of pig.

In the given example, the completely randomized design was used to perform the
analysis of variance. A statistical model can be used to express experimental
condition. In this case the statistical model is as follows

y ij=μ+α i + eij

Where
Yij = Dependable variable
µ = overall mean
α = effect of the ith level of α factor
eij = random error

Assumptions of the model are

1. Observations are independent


2. Random part of the model is normally distributed with mean 0 and
variance σ2e.

Once the it is proved that treatments are different significantly, then the next
procedure is to find out which treatment is different from others or which
treatments are same. For this purpose, multiple comparison tests need to be
performed.

Multiple comparison tests

SP Paudel Page
Multiple comparisons tests are those test used to examine the difference
between all possible pairs or means. In ANOVA, we had H0: µ 1= µ2= µ3= µ4 . In case
of rejecting the null hypothesis, any one of the mean is different from others.
There are few alternative hypothesis but we donot know which one is true or
valid. To know the best alternative hypothesis, we need to do multiple
comparisons tests. Following are the multiple comparison tests
1. Tukey Test
2. Newman-Keuls Test
3. Duncan Multiple Range Test
4. Least Significant Difference (LSD) test.
Tukey test:

Tukey test is the used to tes the null hypothesis H0 µA = µB against the
alternative hypothesis HA: µA ≠ µB where µA and µB are the mean of any possible
pairs of the treatments. For k treatment, k(k-1)/2 are the different pairs that need
to be tested.
Taking the example of the ANOVA, effects of the four differen feeds on pig
weights, the following will be the steps to be followed to complete th Tukey test.

Step 1: Arrange the sample means in order of increasing magnitude. In this


example

Feed Feed 1 Feed 2 Feed 4 Feed 3


Mean 60.62 69.30 86.24 100.35
Sample size 5 5 5 4
Step 2: Tabulate the pair wise difference

Compariso Difference SE q q 0.05(15,4) Conclusion


n
3 vs 1 100.35- 1.39 28.58 4.076 Reject H0 Hypthesis
60.62=39.73
3 vs 2 100.35- 69.3 1.39 22.34 4.076 Reject H0 Hypthesis
=31.05
3 vs 4 100.35- 1.39 10.15 4.076 Reject H0 Hypthesis
86.24=14.11
4 vs 1 86.24- 1.31 19.56 4.076 Reject H0 Hypthesis
60.62=25.62
4 vs 2 86.24 -69.30 = 1.31 12.69 4.076 Reject H0 Hypthesis
16.92
2 vs 1 69.30-60.62=8.68 1.31 6.63 4.076 Reject H0 Hypthesis

Step 3: Determine the SE and divide the difference by SE

SP Paudel Page
Standard Error (SE) = √s2/n if nA = nB

Standard Error (SE) = √ s2/2(1/nA+1/nB)

In this case when nA=nB, the standard error is

√8.557/5 = 1.31

In case when nA ≠nB

√8.557/2(1/5+1/4) = 1.39

The values of q is calculated by using the following equation


x B− x A
q=
SE
In case of first pairs

q = 39.73/ 1.39 = 28.58, similarly, the value of q for all the pairs
need to be calculated and fill in the above table.

The value of q at 0.05 levels of significance and 15 degree of freedom can be


obtained from the table which is given in the annex of most of the statistics book.
The calculate value of he q is compared with the tabulated value of q and make
the decision whether the null hypothesis for comparing pairs are true or not.

In the given example, all feed samples are different each other.

Least Significance Difference (LSD) Test:

LSD is one of the multiple comparison tests which also compare the pairs of
treatments. Following will be the steps in the LSD.

Step 1: Estimate the Least Significant Difference

1
LSD=t α / 2
√ msw (
1
n1
+
n2
)
for n1 ≠ n2

msw
LSD=t α / 2 √ n for n1 = n2
SP Paudel Page
In the given example of the effect of four different level of feed on pig weight,
LSD will be

2.131√8.557(1/4+1/5) when n1 ≠ n2

= 4.18

LSD will be
2.131√8.557/5 when n1 = n2
2.79

Step 2: Rank the mean from lowest to highest

Sample 1 2 4 3

Mean 60.62 69.30 86.24 100.35

Step 3 Compute the sample mean difference

Largest - smallest = 100.35- 60.62 = 39.73 is larger than LSD value


Second largest - smallest = 86.24-60.62 =25.62 is also larger than LSD value
Third largest - smallest = 69.30 - 60.62 = 8.62 is also larger than LSD value
This gives us the first sample is different from rest.

Similarly second sample is compared with others

Largest - second smallest = 100.35 - 69.30 = 31.05 is larger than LSD value
Second larger - second smallest = 86.24 - 69.30 = 16.94 is also larger than
LSD value
So the second sample is also different from rest.

Fourth sample is now compared with third sample


Largest - third smallest = 100.35 -86. 24 = 14.11 is large than LSD value
So third sample and fourth samples are also different

In this way we come to conclusion that each level of feed has different effect on
the pig weight.

SP Paudel Page
Simple Linear Regression

In scientific world, the role of prediction on the basis of the some variables has
importance and usefulness. For example, the increase in the temperature of
medium will increase the growth of bacteria. The growth of the bacteria can be
predicted based on the temperature they are stored. The pH of the soil can be
predicted based on the amount of the lime used in the agriculture field. It gives us
a situation that where a variable can be predicted based on the magnitude of the
other variable. The former is called dependent variable and the latter is called
independent variable. The relationship between two variables may be the
functional dependence of one on another. The magnitude of the dependent
variable is the function of the magnitude of the independent variable.
Independent variable are also called regressor, predictor and dependent
variable is called response. The simplest form of functional relationship of one
variable to another in a population is the simple linear regression which can be
stated as

Yi= β0+β1Xi

In this equation, β0 and β1 are the population parameters. But, in the population,
the data are unlikely to be the exactly on the straight line given by the above
equation. Thus the former equation become

Yi= β0+β1Xi +εi


In this equation, β0 is the intercept which gives the value of the dependent
variable when the magnitude of the independent variable is zero and β 1 is the
slope of the regression equation line which gives increase value of dependent
variable in increase in independent variable with a unit. These population
parameters are estimated from the sample. The value of b0 and b1 are the
unbiased estimator of β0 and β1. The value of b1 is estimated as

b 1=
∑ ( x− x́ ) ( y − ý)
∑ ( x− x́ )2

This b1 can be written as byx.

Example

The age and wing length of 13 birds were measured and presented as follows.
3 4 5 6 8 9 10 11 12 14 15 16 17

SP Paudel Page
Age of birds(x)
Wing length of 1. 3. 3. 3. 4.
birds(y) 1.4 5 2.2 2.4 1 3.2 2 9 4.1 4.7 5 5.2 5

Here, the variable age of the birds is the independent variable and predicts the
length of the wing. Wing length is the dependent variable.
When we plot these variables, we find the following scatter plot diagram.

0
3 4 5 6 8 9 10 11 12 14 15 16 17

By looking this plot, we can say that the dependent variable is not following exact
the straight line. So there is random error part in the equation.

The estimation of the regression coefficient is obtained as follows

Wing
Age of length of
  birds birds x-x (x-x)2 y-y (x-x)(y-y)
1 x Y        
2 3 1.4 -7 49 -2.015 14.105
3 4 1.5 -6 36 -1.915 11.49
4 5 2.2 -5 25 -1.215 6.075
5 6 2.4 -4 16 -1.015 4.06
6 8 3.1 -2 4 -0.315 0.63
7 9 3.2 -1 1 -0.215 0.215
8 10 3.2 0 0 -0.215 0
9 11 3.9 1 1 0.485 0.485
10 12 4.1 2 4 0.685 1.37

SP Paudel Page
11 14 4.7 4 16 1.285 5.14
12 15 4.5 5 25 1.085 5.425
13 16 5.2 6 36 1.785 10.71
14 17 5 7 49 1.585 11.095
Ʃx=130 Ʃy=44.4 262 70.8
x́ =10 ý =3.415
The value of regression coefficient ( b) calculated with the following equation

b 1=
∑ ( x− x́ ) ( y − ý)
∑ ( x− x́ )2

byx = 70.8/262 = 0.270

Now, the value of b0 (intercept) is calculated as follows

ý=b0 +b1 x́

3.415=b0 +0.27 X 10

b0 = 0.715

So the regression equation is

Y = 0.715 + 0.270 X

This is the equation we can use to predict the wing length of birds from their age.
Let's predict the wing length of bird with an age of 8 days.

Y = 0.715 + 0.27 x 8
Y = 0.715 + 2.16
Y = 2.875 cm

Here, we can see the difference between the observed and predicted value. The
difference (between 3.1 and 2.875) is 0.225 which is called random error. The
sum of the random error is zero. The regression equation is fitted in such a way
that the sum of the squares of the difference between observed and predicted
value is minimum.

Interpretation of b0 and b1 values

In this given example, the value of b 0 is 0.715 which indicates that the wing length
of the bird of that particular population at the zero day' age is 0.715 cm. The

SP Paudel Page
value of b1 in this example is 0.270 which means the the length of the wing of the
birds is increased in average by 0.27 cm in a day in the population from which the
sample are taken.

The value of b could be positive or negative

Testing of regression coefficient whether it is significantly different from zero

In this case the hypotheses are as follows


Null hypothesis: H0 : β=0 and Ha: β≠0
Analysis of variance is the way to test the above mentioned hypothesis and to
find out the appropriate model. The analysis of variance will be as follows

ANOVA TABLE
Source of Degree Sum of Square Mean sum of
variation of square
freedom
Total SS n-1 ∑ ( y− ý )2
¿¿
Regression 1
Residual sum of square is obtained by subtracting the regression sum of square
from the total sum of square which has n-2 degree of freedom. In the given
example the ANOVA table will be as follows

Source of Degree of Sum of Mean sum of


variation freedom square square
Total 12 19.66

Regression 1 19.13 19.13 F=


19.13/0.047
Error 11 0.52 0.047
407

From the analysis of variance, we conclude that we reject null hypothesis.

Coefficient of determination is another statistics need to estimate in regression


analysis. It explains the proportion of total variance explained the fitted
regression which is denoted by r2 and calculated by the following formula.

reg ression SS
r 2=
Total SS
SP Paudel Page
residual SS
r 2=1−
Total SS

Residual SS and Error SS are used synonymously.


In this example, the value of r2 can be obtained as 19.13/ 19.66 = 0.973
The value of 0.973 for r 2 indicates that the regression equation fitted is explaining
97.3 percent of total variation of the dependent variable.

SP Paudel Page
Simple Linear Correlation

In regression analysis, it is assumed that there is a dependence relationship


between those variable. In correlation, there is a relationship between two
variables but the relationship is not necessarily functionally dependent upon
another. Correlation coefficient measures the strength of linear relationship
between two variables. If x and y are two variable, the correlation coefficient is
given as

r=
∑ ( x−x́ )( y− ý )
√ ∑ ( x−x́)2 ( y− ý)2
In other words, the correlation coefficient can be stated as

Cov( xy)
r=
√ Var x Var y
From this equation, the correlation coefficient is defined as the covariance of two
variables divided by the product of their standard deviation. Based on above
equation, it may give the numerator positive, negative or zero but the
denominator is always positive. In case of the positive numerator, the correlation
coefficient is positive which indicates the variables of interest are positively
correlated. If the numerator is negative that gives the negative correlation
coefficient and indicates the variables of interest are negatively correlated.

Properties of correlation coefficient


1. The value of correlation coefficient lies between -1 to +1.
2. Correlation coefficient has no unit.

SP Paudel Page
Positive correlation

Similar scatter plot can be made for negative correlation, and no correlation case.

Relationship between regression and correlation coefficient

Cov( xy)
r=
√ Var x Var y

Cov( xy) √ Var x


r= ×
√ Var x Var y √ Var x

Cov ( xy) √ Var x


r= ×
Varx √Var y

r =b yx ×
√ Var x
√ Var y
sd x
r =b yx ×
sd y

Example:

SP Paudel Page
Tail and wing length of the birds are considered as correlated traits. Following is
the data set measured wing length and tail length in 12 birds. Estimate the
correlation coefficient.

Wing
length(x) 10.4 10.8 11.1 10.2 10.3 10.2 10.7 10.5 10.8 11.2 10.6 11.4
Tail
length
(y) 7.4 7.6 7.9 7.2 7.4 7.1 7.4 7.2 7.8 7.7 7.8 8.3

Solutions:

The correlation coefficient is estimated using the equation given below.

r=
∑ ( x−x́ )( y− ý )
√ ∑ ( x−x́)2 ( y− ý)2

Wing
length Tail
  (x) length (y) x-x (x-x)2 y-y (y-y)2 (x-x)(y-y)
1 10.4 7.4 -0.2 0.04 -0.1 0.01 0.02
2 10.8 7.6 0.2 0.04 0.1 0.01 0.02
3 11.1 7.9 0.5 0.25 0.4 0.16 0.2
4 10.2 7.2 -0.4 0.16 -0.3 0.09 0.12
5 10.3 7.4 -0.3 0.09 -0.1 0.01 0.03
6 10.2 7.1 -0.4 0.16 -0.4 0.16 0.16
7 10.7 7.4 0.1 0.01 -0.1 0.01 -0.01
8 10.5 7.2 -0.1 0.01 -0.3 0.09 0.03
9 10.8 7.8 0.2 0.04 0.3 0.09 0.06
10 11.2 7.7 0.6 0.36 0.2 0.04 0.12
11 10.6 7.8 0 0 0.3 0.09 0
12 11.4 8.3 0.8 0.64 0.8 0.64 0.64
Mean x = 10.683 Sum= 1.8 Sum=1.4 Sum =
1.39
Mena y = 7.567

1.39
r=
√ 1.8 ×1.4

SP Paudel Page
r = 0.878

The correlation coefficient is positive 0.878 which gives us an indication that


these two variables are positively and highly correlated.

SP Paudel Page

You might also like