Professional Documents
Culture Documents
Life Sciences
TABLE OF CONTENT
FREQUENCY DISTRIBUTION 4
MEASURES OF DISPERSION 15
ELEMENTARY PROBABILITY 18
NORMAL DISTRIBUTION 23
STUDENT'S T - TEST 34
ANALYSIS OF VARIANCE 42
SP Paudel Page
Introduction and importance, Observation and data
recording
Definition:
Statistics refers to the analysis and interpretation of data with a view toward
objective evaluation of the reliability of the conclusion based on the data.
Statistics applied to biological problems simply called biostatics or biometry.
Descriptive statistics: Once data are obtained, they are organized and
summarized in such a way as to arrive at their orderly and informative
presentation. Such procedures are termed as descriptive statistics.
Variable:
A characteristic that may differ from one biological entity to another is termed as
variable.
Interval:
Some measurement scale possess a constant interval size but not a true zero,
they are called interval scale e.g. two common temperature scales. The zero
point is arbitrary
Ordinal:
Data consists of an ordering or ranking of measurement are said to ne ordinal
scale. e.g. biolocal entity being shorter, darker, faster or more active than other
etc.
Nominal scale:
SP Paudel Page
Variables those are classified by some quality rather than by numerical
measurement are called nominal scale.
The first step in the learning from the data is to set the objective for which the
data are collected. The design of data collection is crucial which include the
following steps.
Specify the objective of the study
Identifying the variable of interest
Choosing the appropriate design for the survey or scientific study
Collection data.
The objective of the study is derived from the problems of statement and
variables of interest are identified based on the objective. When objective and
variables are identified, the proper method for the collection of data is required.
Data collection process includes surveys, experiments, and the examination of
existing data from records, census and previous studies. The theory of sample
survey and theory of experimental design provided excellent methods for data
collection. Surveys are usually passive, the goal of survey is to gather data on
existing conditions, behaviors and attitudes whereas experiment/ scientific
studies are more active where experimental conditions varies to study the effect
of the condition on the outcome of the experiments. In most scientific
experiments, as many as possible factors that affect the measurements are under
the control of the experimenters.
Survey:
SP Paudel Page
Information from surveys affects almost every aspect of the daily lives. A crucial
element in any survey is the process how the sample is selected from the
population. If a surveyor selects any sample for his/ her convenient, there may be
bias in the sample survey. Because, in this case, the samples may not represent
the population properly, the statistics cannot precisely reflect the population
from which samples are drawn. The following are the methods of drawing
samples from a population.
In survey, once the samples are finalized, next step is the collection of data from
those samples. The most common methods of data collection are
1. Personal interview
2. Telephone interview
3. Self administered questionnaires
4. Direct observation
Scientific studies
For scientific studies, there are various methods of the experimental design on
which the experiment are carried out. The major objective of these designs is to
minimize the experimental error without any biasness.
The following are the example of experimental designs
I. Completely randomized design
II. Randomized block design
III. Factorial design
IV. Latin square design
V. Split plot design
SP Paudel Page
Frequency distribution
When measurements on the variable have been collected, they are organized,
displayed and examined various ways. One of the methods to examine the data
at first is the frequency distribution. It gives pattern of the variables of interest
how frequently it occurs in the observations. When observation whether discrete
or continuous are available on any characteristics from a large number of
individuals, it becomes necessary to condense data into small groups without
losing information. If we collect weight of all students from HICAST, we will have
more than 1000 observations. Unless we condensed the data on weight, it is very
difficult to give any information about the weight of students. Simply if we
arrange these in ascending order, we come to the point that we can say minimum
and maximum weight of the HICAST students. Arranging data on the frequency
table is the first step to explore the data. In this case, the weight of student is the
variable, and how often a variable is observed in the experiment, it is the
frequency.
After collecting and summarizing large amount of data, if is useful to arrange
them in the form of frequency table. Frequency table consists of list of all the
observed values of the variable under study in one column and how many times
each observed values are observed. Total observations are first categorized into
different categories. The distribution of the total number of observations among
the various categories is termed as a frequency distribution.
A. Vine (V) 56
B. Building cave (B) 60
C. Low tree branches (LTB) 46
D. Tree and building cavities (TBV) 49
40
Count
20
0
B LT B T BV V
types
SP Paudel Page
Preparing frequency table of discrete data and continuous data has different
procedure. For discrete data, each observed value can be a category and numbers
of times they are observed are taken as frequency. Number of observation can be
grouped into a class and numbers of times observed of each of observation under
a class are taken as frequency.
20
Count
15
10
3 4 5 6 7
Litter size
If the data create a long frequency table, it is often practice to group data and
cast them in a frequency table. Such grouping results loss of information but are
easy to read. There are several rules to decide how many groups should be made
in a particular data set. But it is established that there should be equal size of
interval of the variable being measured.
In case of continuous data, it is always dealing with a frequency distribution
tabulated by groups. For continuous data set, the frequency distribution can be
displayed by histogram.
SP Paudel Page
Number of aphids observed per clover plant
Number of number of Number of
aphids on a plants aphids on a number of
plants observed plants plants observed
0 3 20 17
1 1 21 18
2 1 22 23
3 1 23 17
4 2 24 19
5 3 25 18
6 5 26 19
7 7 27 21
8 8 28 18
9 11 29 13
10 10 30 10
11 11 31 14
12 13 32 9
13 12 33 10
14 16 34 8
15 13 35 5
16 14 36 4
17 16 37 1
18 15 38 2
19 14 39 1
40 0
41 1
Pie chart:
SP Paudel Page
It is used to display the percentage of the total number of the measurements
falling into each categories of the variable. In the example give above
types
B
LTB
TBV
V
Histogram:
40
Count
20
0
B LT B T BV V
types
SP Paudel Page
Before we enter to the measures of central tendency, let's define few terms used
in the statistics
Population:
Parameter:
A. Mean
a. Arithmetic mean:
SP Paudel Page
Arithmetic mean of a set of observations is the value obtained from their sum
divided by the number of observations. If there are n numbers of observations
from x1, x2, x3 ………..xn , arithmetic mean ( x́ ) of these observation is
Mean ¿
n
1
¿ ∑ xi
n 1
n
1
In case of individual series, mean ( x́ )= ∑ x i
n 1
In case of the discrete series when variable x 1, x2 ………… xn with frequency of f1, f2
……….. fn, then mean is calculated as
f 1 x 1+ f 2 x 2 +… …+ f n x n
x́=
f 1 + f 2+ … . f n
n
1
¿
N
∑ f i xi
1
For continuous series, the value of x is the mid value of the corresponding class.
In case if the value of f or /and x are large, the calculation by using the formula
given above is more tedious and time consuming.
i. Short cut method and
ii. Deviation method.
Characteristics of mean
Demerits
i. It cannot be determined by inspection not can be located graphically
ii. Arithmetic mean cannot be used if we deal with qualitative characteristics
iii. Arithmetic mean cannot be calculated if single observation is missing or
lost unless the arithmetic mean of remaining observation is computed.
iv. Arithmetic means is affected very much by extreme values.
v. In extremely asymmetrical distribution, arithmetic mean is not a suitable
measure of location.
b. Geometric mean:
The geometric mean of a set on n observations is the nth root of their products
Thus the geometric mean G of N observations x i where i= 1,2,3,…….n is given by
the following equation
G = ( x1.x2.x3. …….xn)1/n
1
log G= ¿ ¿
n
n
1
log G= ∑ xi
n 1
n
G=antilog [ 1
∑x
n i=1 i ]
In case of frequency distribution, the geometric mean is given by
n
G=antilog [ 1
∑ f logxi
n i=1 i ]
Merit and demerit of geometric mean
Merits
1. It is rigidly defined
2. It is based on all observations
3. It is not affected by fluctuations of sampling
SP Paudel Page
Demerits
1. Geometric mean is not easy to understand
2. If any one of the observations is zero, geometric mean becomes zero.
B. Median
Median of a distribution is the value of variable which divides the distribution into
two equal parts when the observations are arranged in an order. It is the value
such that the number of observations above it is equal to the number of
observations below it. Thus median of a set of observation is defined as the
middle value when the observations are arranged from lowest to highest or from
highest to lowest.
The sample median is the best estimate of the population mean. In a symmetrical
distribution, the sample median is unbiased estimate of the population mean
however sample mean is more efficient than the median.
Median= X(n+1 )/ 2
Where n is the number of the measurements.
Median = X (11+1)/2 = X6
If the sample size is odd, then the subscripts will be an integer and indicate the
datum which is the middle measurement in the ordered sample.
If the sample size is even number (n= even), the subscripts will be the half-
integer. There is not the middle value in the ordered data set. There are two
middle values. The median is the midpoint between the two middle values
If there are 12 observations in a data set from X1, X2, ………………. X12.
SP Paudel Page
Median has the same unit as the each observation. If data are plotted in
frequency histogram, the median is the value of X that divides the area of
histogram into two equal parts.
The estimation of median for grouped data is different. We cannot use the
formula given above because for group data we know the class interval of the
median but cannot locate the median within this particular interval.
The following formula is used for grouped data.
w
Median=L+ (0.5 n−c f b )
fm
Where,
L = Lower class limit of the interval that contains the median
n = Total frequency
cfb = Cumulative frequency for all class before the median class
fm = Frequency of the class interval containing the median
w = Interval width
Characteristics of median
1. It is central value;
2. Only one median value can be found for a data set
3. It is not influenced by extreme values (outlier)
4. Median of sub set cannot be combined to determine the median of entire
data set.
5. For group data, its value is rather stable even when the data are organized
into different categories
6. It is also applicable to quantitative data.
Merit of median
1. It is rigidly defined
2. It is easily understand and easy to calculate
3. It is not at all affected by extreme measurements
4. It can be calculated for distribution with open-end classes.
Demerits of median
1. In case of even number of observations median cannot be determined
exactly.
C. Mode:
h(f 1−f 0 )
Mode=l+
2 f 1−f 0−f 2
Where,
l = lower limit,
h= magnitude
f1= frequency of modal classs
f0 = frequency of the class preceding the modal class
f2 = frequency of the class succeeding the modal class
Characteristics of mode
1. It is most frequent measurement in the data set.
2. There can be more than one mode for a data set.
3. It is not influenced by extreme measurements.
4. Modes of subset cannot be combined to determine the mode of the
complete data set.
5. For grouped data, its value can change depending on the categories used.
6. It is applicable for both qualitative and quantitative data
Merit of mode
SP Paudel Page
1. Mode is readily comprehensive and easy to calculate
2. Mode is not affected by extreme values
Demerits of mode
1. Mode is ill defined. It is not always possible to find a clearly defined mode.
In some distributions, mode could be bimodal or multimodal.
2. It is not based on all the observations
3. Mode is relatively affected to greater extent by fluctuation of sampling.
SP Paudel Page
Measures of dispersion
Range
The range is the difference between the highest and lowest measurements in a
group of data. If the sample measurements are arranged in ascending order of
the magnitude, then range is given as
Range = Xn-X1
Range is simple and crude form of dispersion. It does not take into account any
observations except highest and lowest value. It is based on two extreme values
which are subject to change fluctuations in sample distribution. It is unlike that
the highest and lowest measurements in the population occur in the sample.
Quartile
Range is biased and inefficient estimate and very sensitive two only two
measurements i.e. large and small observations. Quartile is another measure of
dispersion that gives the more reliable information than range. Quartile deviation
or semi-interquartile range is given by
Q 3 −Q1
Semi−interquartile range=
2
SP Paudel Page
Mean Deviation
1
sample mean deviation= f |x −x́|
n∑ i i
Standard deviation is the positive square root of the arithmetic mean of squares
of deviation of a given values from their mean. Standard deviation for a
population is denoted by Greek letter σ and is given by following formula.
1
σ=
√ N
∑ ( x i−μ)2
The best estimate of the population standard deviation is the sample standard
deviation which is given by following formula
1
s=
√ n−1
∑ (x i− x́ )2
n
σ=
1
√
∑ f ( x − x́ )
N i=1 i i
SP Paudel Page
n
s=
√ 1
∑ f (x − x́)
n−1 i=1 i i
Coefficient of variation
s
C . V = ×100
x́
SP Paudel Page
Elementary Probability
In statistical procedure, some statements are given about the population based
on the analysis of sample. First, the sample data are described graphically and
with other descriptive technique. This is just the methods of summarizing and
describing of sample. We need to assess the degree of accuracy to which the
sample statistics represent the population. Probability plays a role to make an
inference from sample results to the conclusion about the population. Thus
probability is a tool that enables us in making inferences.
Random experiment
SP Paudel Page
A possible result of a random experiment is known as a sample point or an
elementary event or an event or an outcome. The set of all possible outcomes of
a random experiment is known as a sample space and is denoted by S.
For example, if e1, e2, e3, . . ., en are n mutually exclusive outcomes or
sample points of a random experiment, then set
S = {e1, e2, . . ., en}
is the sample space of the random experiment.
The elements of sample space S have the following properties:
a) Each element ei of the sample space is an outcome or a sample point of
the random experiment.
b) Any result (outcome) of a trial in a random experiment corresponds to one
and only one element of sample space S.
Examples
i) In throwing a die, the sample space is
S = {1, 2, 3, 4, 5, 6}
ii) In tossing two fair coins, the sample space is
S = {HH, HT, TH, TT}
iii) In throwing two dice, the sample space is
S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1) . . . (6, 1)
(6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
2nd die
1 2 3 4 5 6
1 (1, (1, (1, 3) (1, (1, (1, 6)
1) 2) 4) 5)
2 (2, (2, (2, 3) (2, (2, (2, 6)
1) 2) 4) 5)
3 (3, (3, (3, 3) (3, (3, (3, 6)
Firs
1) 2) 4) 5)
t
4 (4, (4, (4, 3) (4, (4, (4, 6)
die
1) 2) 4) 5)
5 (5, (5, (5, 3) (5, (5, (5, 6)
1) 2) 4) 5)
6 (6, (6, (6, 3) (6, (6, (6, 6)
1) 2) 4) 5)
The events are said to be equally likely events if they have equal chance of
occurrence in a random experiment.
For example,
a) In tossing of an unbiased or uniform coin, head (H) and tail (T) are equally
likely events.
SP Paudel Page
b) In throwing of an unbiased or regular die, all the six faces are equally likely
events.
The events are said to be mutually exclusive if the occurrence of one event
excludes the occurrence of other event(s) at a time.
Thus, for mutually exclusive events, the success of an event necessitates the
failure of others in the same experiment. For example;
a) In throwing of a die, the six faces numbered 1, 2, . . . 6 are mutually
exclusive.
b) In tossing of a coin, head (H) and tail (T) are mutually exclusive.
c) Selecting a jack and selecting a spade are not mutually exclusive.
The events are said to be independent of each other if the occurrence of one
event does not affect the occurrence the other.
For example, in tossing two unbiased coins, the event of getting head or tail in
one coin does not affect the event of getting head or tail in the second coin.
Two events are said to be dependent if the occurrence of one event affects the
occurrence of the other.
For example, suppose that from a deck of 52 cards, two cards are drawn in
succession without replacement. If we would like to draw a heart on both draws,
the number of hearts available in the deck for the second draw depends on
whether the first draw is or is not a heart. Hence the second drawing is
dependent on the first.
Exhaustive Cases
Favorable Cases
Probability
If an event can happen in m ways out of n equally likely, mutually exclusive and
exhaustive possibilities i.e. for an event A, number of favorable cases are m and n
is the number of all possible outcomes, then the probability of the occurrence of
the event A is defined as the ratio of the number of favorable cases to the
number of all possible outcomes.
Thus,
Probability of event A =
=
In other words, if P (A) is the probability of the occurrence of the event A, then
P(A) = .
Theorems of Probability
There are two theorems (or laws) of probability:
1. Additive law of probability (or total probability law)
2. Multiplicative law of probability (or Compound probability law).
If A and B are any two events of a sample space which are disjoint or mutually
exclusive, then
P (A or B) = P (A B)
= P (A) + P (B)
For three mutually or disjoint events A, B and C of a sample space
P (A or B or C) = P (A B C)
= P (A) + P (B) + P (C)
SP Paudel Page
Multiplicative Law of Probability
If A and B are two independent events, then the probability of their occurrence
(happening) together is equal to the product of their respective probabilities.
P (A and B) = P (A B)
= P (A) . P (B)
For three independent events A, B and C,
P (A and B and C) = P (A B C)
= P (A) . P (B) . P (C)
Example
A box contains 2 red, 3 white and 4 black balls. A person draws one ball from the
box randomly. Find the probability that drawn ball is red or white.
Solution
Total number of balls = 2 + 3 + 4 = 9
Total number of possible outcomes (n) = 9C1 = 9
One ball is drawn at random from the box.
Let ‘A’ denotes the event of drawing a red ball. Since there are 6 red balls in the
box.
Number of favorable cases (m) = 3C1 = 3
P (A) =
Again, let ‘B’ denotes the event of drawing a white balls. Since there are 3 white
balls, no. of favorable cases (m) = 3C1 = 3.
P (B) = =
Since A and B are two mutually exhaustive events.
The required probability of getting a red or white ball can be computed by
using addition theorem of probability.
P (A or B) = P (A) + P (B)
=
SP Paudel Page
Normal distribution
Continuous variable are those variable that flows without any break from one
individual to the next. The continuous variable ie interval and scale data follows
many distributions however one of the distribution is characterized generally that
many observations falls around the means and fewer observations are towards
the extremes . If the number of observations is large, the frequency of such
variable give a bell shaped curve which is called normal curve. Many variables of
the interest have frequency distribution which can be approximated by using the
normal curve. Such a distribution is called normal distribution. It means that
normal distribution of random variable gives the normal curve. For example, milk
yield for cattle of a particular breeds.
Since normal distribution has been tabulated, area under the normal curve can be
used to approximate the probabilities associated with the variables of the interest
in the experiment. Normal random variable and its associated distribution play an
important role in statistical inference. The relative frequency histogram for the
normal random variable called normal curve of normal probability distribution.
This distribution is one in which height of the curve at X i is as expressed by the
relation
1 −( xi −μ )2 /2σ 2
y i= σ 2 π e
√
Where,
The height of curve, yi = normal density, and there are two parameters, µ and σ.
For a given µ, there are infinite numbers of curve and similarly, for a given σ,
there are infinite numbers of curve.
A normal curve with µ = 0 and σ= 1 is called standardized normal curve and
distribution of called standard normal distribution.
Its characteristic states that all the measurements are within 3 standard deviation
of mean. If we select a measurement at random from the population of
measurement that possess a mount shaped distribution, the probability is
approximately 0.68 that the measurement lie within the 1 standard deviation.
SP Paudel Page
Similarly probability is 0.954 that a value will lie in the interval of μ±2σ and 0.997
in the interval of μ±3σ.
1. The curve is bell shaped and symmetrical about the line x=µ
2. Mean, median and mode of the distribution coincide
3. As x increases numerically, f(x) decrease rapidly, the maximum probability
occurs at the point, x= µ.
6. Areas property
Moment about the mean is often used in statistics. Ʃ (x-µ) p / N is termed as pth
moment about the mean. The first moment about the mean, Ʃ(x-µ) / N, is zero.
The second moment about the mean, Ʃ (x-µ)2 / N is the population variance. The
third moment about the mean, Ʃ (x-µ)3/ N, gives us about the symmetry of the
distribution. The sample statistics based on the third moment is given as
k 3=n ∑ ¿ ¿ ¿
k3 k3
g1= 3
= 3
s √ ( s2 )
SP Paudel Page
Left skewed distribution Right skewed distribution
Here also k4 has unit to the power four. The following statistic has no unit and
measure the kurtosis.
k4
g2=
s4
If the g2 is less than zero, then the distribution is called platy kurtosis, if g 2 is
higher than zero then the distribution is leptokurtosis.
SP Paudel Page
Proportion of a normal distribution
We knew that the normal curve depends on the two parameters. These
parameters are mean and standard deviation. It indicates that there are varieties
of the normal curves depending on mean and standard deviation. Since the
standard table for the normal distribution is designed for the distribution with
mean = 0 and standard deviation =1. In real practice, we have normal distribution
with mean µ and standard deviation σ. whenever we need to use the standard
table with µ = 0 and σ = 1. , we need to rescale the the distribution of interest so
that the mean becomes 0 and standard deviation becomes 1. This rescaled
measurement is given by
x i−μ
Z=
σ
The Z is termed as standard normal deviate, normal deviate, and standard score.
It is clearly understood that the mean of the standard score is 0 and standard
deviation is 1.
Example
1. The mean daily milk production of a herd of Jersey cow has normal
distribution with μ = 70 pounds and σ = 13 pounds
a. what is the probability that the milk production for a cow chosen at random
will be less than 60 pounds ( 0.2206)
b. what is the probability the milk production for a cow chosen at random will be
greater than 90 pounds ? (0.0618)
c. what is probability that the milk production for a cow chosen at random will be
between 60 pound and 90 pounds ( 0.7176)
Solution:
SP Paudel Page
60
To answer the part a we must compute z value that corresponds to the 60.
x −μ
Z=
σ
60−70
z= =−0.77
13
= 1- 0.7764 = 0.2236
SP Paudel Page
SP Paudel Page
The distribution of means (Sampling distributions)
If random samples are taken from a normal population, the means of these
samples will follow normal distribution. The distribution of means from a non-
random distribution will not be normal but will be approximately normal if the
samples are large. This is called Central Limit Theorem.
2 σ2
σ x́ =
n
σ2
σ x́ =
√ n
σ
¿ σ x́ =
√n
2 s2
s x́ =
n
s
s x́ =
√n
This is the sample standard error of mean. The importance of SE lies on the
hypothesis testing. Its magnitude is helpful in determining the precision to which
the statistics and variability are given.
SP Paudel Page
Introduction to Statistical hypothesis testing
SP Paudel Page
hypothesized value. Statistical testing that examines the differences in only one of
the possible directions is called one tailed testing.
In the example given above, the null and alternative hypotheses for one tailed
testing are as follows
Ho: µ ≤ 20 and Ha: µ > 20
Ho: µ ≥ 20 and Ha: µ < 20
Two tailed testing: The statistical testing that tells the differences in two
directions is called two tailed testing. In the given example, the null and
alternative hypotheses for two tailed testing are as follows
Ho: µ = 20 and Ha: µ ≠ 20
Once the null and alternative hypotheses are ready, we need to take sample
from the population of interest. The decision to state whether data support
research hypothesis is based on quantity obtained from the computation of
statistics, these statistic are called test statistic. Value of mean, Z and t are the
example of test statistics.
We need to have criterion for rejecting or not rejecting null hypothesis for this
statistical test. In case of mean as the test statistics, a very large or a very small
mean might be obtained. Larger value of Z might be obtained even when null
hypothesis is true. Larger the value of absolute Z value, smaller will be the
probability of null hypothesis is true. The probability used as a criterion for the
rejection of null hypothesis is called level of significance. A probability of 5% or
less is commonly used as criterion for rejection of null hypothesis. This is
denoted by α. The values of test statistics corresponding to α is termed as critical
value of that test statistic. In Z distribution, the critical value for testing a
hypothesis at 5% level of significance is 1.96. From this statement, it can be
understood that true null hypothesis will be rejected at the frequency of α. This
is an error committed drawing a conclusion. Thus the rejection of null hypothesis
when it is true is called type I error. It is also called α error. On the other hand,
when null hypothesis is false in fact, the statistical test is unable to detect the
fact. In this case we commit an error of not rejection null hypothesis when it is
false. The probability of committing an error of not rejecting null hypothesis
when it is false is given by β. This error is called type II error (β error).
If Ho is true If Ho is false
If Ho is rejected Type I error No error
SP Paudel Page
If Ho is not rejected No error Type II error
The final step of the hypothesis testing is giving conclusion based the the test
and checking the assumptions. Most of the statistical test needs to follow the
assumptions. In case of the violation of the assumptions, inferences of the
statistical test are not valid.
SP Paudel Page
Student's t - test
In previous part, we discuss about Z test where we need to infer about the
populations from the sample with the following assumptions.
The population variance is known, if the population variance is not known, the
sample variance is the unbiased estimator of the population mean. Another
assumption was the sample sizes are large.
There are certain situations, the sample size is not large enough and population
variance is unknown, we use t test in place of z test. This test is also called
student test. This test was invented by W.S. Gosset and published an article in the
pen name student.
ý −μ
When he used Z= for a small sample with replacing σ by s. He was falsely
σ /√ n
rejecting null hypothesis at the slightly higher rate than that specified by α. Then
he set out to derive the distribution with new statistic given as
ý −μ
for n < 30
s /√ n
ý −μ
The quantity is called t statistics and its distribution is called student's t
s /√ n
distribution.
So
ý −μ ý −μ
t= t=
s /√ n SE of ý
Properties of t distribution
Use of t test
1. One sample test of whether the mean of a normally distributed population
has a value that has specified in the null hypothesis.
2. Two sample test in which the null hypothesis state that the means of two
normally distributed populations is equal.
3. Paired sample t test in which the data were taken from the paired sample
SP Paudel Page
4. A test whether the regression coefficient is significantly differ from zero
One sample t test
Example
Solution:
The null hypothesis is that the samples of the ice cream do not have more than
0.3 MPN S. enteritidis.
Ho: µ ≤ 0.3
Alternative hypothesis:
The ice cream produced by this factory is contaminated with S. enteritidis and
contain more than 0.3 MPN S. enteritidis.
The one sample t test is the appropriate statistical procedure in this condition and
estimation of t value will be the best way to estimate the test statistics.
Where
x́−μ
t=
s /√ n
SP Paudel Page
Below the procedure to estimate the test statistics
Variable
(x) x- x ̅ (x- x)̅ 2
0.593 0.137 0.018769
0.142 -0.314 0.098596
0.329 -0.127 0.016129
0.691 0.235 0.055225
0.231 -0.225 0.050625
0.793 0.337 0.113569
0.519 0.063 0.003969
0.392 -0.064 0.004096
0.418 -0.038 0.001444
0.362422
Ʃ ( x− x́)2
variance=
n−1
0.362422/ 8 = 0.0453
SD = 0.2128
0.456−0.3
t=
0.2128/√ 9
= 2.21
Rejection region:
The rejection region is obtained from the t distribution table using the level of
significance and degree of freedom. The level of significance in this case is 0.01
and degree of the freedom is 9-1= 8. From the t table, we get the critical value of
t is 2.896. As we know this is one tailed test, the rejection region lies on one
direction. The rejection region is beyond the t- value of 2.896.
t ≥ 2.896
SP Paudel Page
Since the calculated t value do not lie in the rejection region. Thus we do not have
sufficient evidence to reject null hypothesis. The conclusion is the average level of
S. enteriditis in the ice cream sampled is not higher than 0.3 MPN/g.
x́ 1−x́ 2
t=
SE x́ −x́
1 2
1 1
SEx́ − x́ =s p
1 2
√ +
n1 n2
( n1 −1 ) s12 +(n2−1)s 22
s p=
√ n1 +n2−2
Example:
An experiment was conducted to evaluate the effectiveness of a treatment of
tapeworm in the stomach of sheep. A random sample of 24 worm infected lambs
of approximately the same age and heath was randomly divided into 2 groups. 12
of them were injected with drugs and remaining 12 were left untreated. After 6
months, lambs were slaughtered and following worm counts were recorded.
Drug treated 18 43 28 50 16 32 13 35 38 33 6 7
sheep
Untreated 40 54 26 63 21 37 39 23 48 58 28 39
sheep
SP Paudel Page
Test whether the data provide sufficient evidence that the drug is effective
against the tapeworm. (Use α = 0.05)
Solution:
Null hypothesis:
The new drug is no effective against the tapeworm and will not reduce the
number of tapeworm in the stomach of sheep after 6 months.
Ho: µ1 - µ2 ≤ 0
Alternative hypothesis:
The new drug is effective against the tape worm and will reduce the number of
tapeworm in the stomach of the sheep.
Ha : µ 1 - µ 2 > 0
Test statistics:
x́ 1−x́ 2
t=
SE x́ −x́
1 2
variable variable
x1 x2 x1-x x2-x (x1-x)2 (x2-x)2
40 18 0.33 -8.58 0.1089 73.6164
54 43 14.33 16.42 205.3489 269.6164
26 28 -13.67 1.42 186.8689 2.0164
63 50 23.33 23.42 544.2889 548.4964
21 16 -18.67 -10.58 348.5689 111.9364
37 32 -2.67 5.42 7.1289 29.3764
39 13 -0.67 -13.58 0.4489 184.4164
23 35 -16.67 8.42 277.8889 70.8964
48 38 8.33 11.42 69.3889 130.4164
SP Paudel Page
58 33 18.33 6.42 335.9889 41.2164
28 6 -11.67 -20.58 136.1889 423.5364
39 7 -0.67 -19.58 0.4489 383.3764
2112.666 2268.916
Total 476 319 8 8
Mea
n 39.67 26.58
1 1
SEx́ − x́ =14.11
1 2
√ + =5.76
12 12
x́ 1−x́ 2
t=
SE x́ −x́
1 2
39.67−26.58 .
¿
5.76
= 2.27
Rejection region:
Conclusion:
The calculated value of t 2.27 lies in the rejection region. Thus we have sufficient
evidence to reject null hypothesis. We can conclude the the drug reduce the
number of tapeworm in the stomach of sheep.
In two sample testing, it can be applied in two conditions. First the two samples
are independent which simply means that each datum in a sample is no way
associated with any specific datum in other sample. Other condition prevails
SP Paudel Page
when each observation in one sample is same way correlated with an observation
in another sample. T test also used to compare the means of paired sample.
Paired sample t test requires each datum in one sample is correlated with one
but only one datum in another sample. The paired sample t test doesn't have
normality and equality of variance assumption of the 2 sample t test but assume
the difference come from normally distributed population of difference. In paired
sample t statistics is given as
d́
t=
SE d́
Example
Forele
Hind leg g
Deer length length
1 142 138
2 140 136
3 144 147
4 144 139
5 142 143
6 146 141
7 149 143
8 150 145
9 142 136
10 148 146
Solution
Null hypothesis:
The length of hind leg and fore leg is same
Alternative hypothesis:
The length of hind leg and foreleg is not same
Test statistics
SP Paudel Page
t value is calculated as follows
Hind leg Foreleg Difference
Deer length length (d) (d− d́ ¿ (d− d́ ¿ 2
1 142 138 4 0.7 0.49
2 140 136 4 0.7 0.49
3 144 147 -3 -6.3 39.69
4 144 139 5 1.7 2.89
5 142 143 -1 -4.3 18.49
6 146 141 5 1.7 2.89
7 149 143 6 2.7 7.29
8 150 145 5 1.7 2.89
9 142 136 6 2.7 7.29
10 148 146 2 -1.3 1.69
84.1
3.05
SEd́ = = 0.96
√10
3.3
t= =3.36
0.96
Rejection region:
This is two tailed test. The critical value of t at 5% level of significance with 9
degree of freedom is 2.262. The rejection region lies above the 2.262 and below
the - 2.262.
Conclusion
The calculated t value lies in the rejection region. So we reject null hypothesis.
SP Paudel Page
Analysis of Variance
T-test is used to compare one sample or two sample mean. When measurements
of a variable are obtained from more than three samples to compare their mean,
there would be three pairs to be compared. In this case, t test is not valid to
compare three or more samples. The appropriate procedure to compare means
of three or more samples is analysis of variance (ANOVA). Analysis of variance is a
statistical process of testing more than two population means by taking sample
from the corresponding population.
If we have to compare the means of three samples, our null hypothesis would be
H0: μ1 = μ2= μ3. In ANOVA, the F statistics need to be estimated. F statistics was
formally known as "variance ratio" which the ratio between the treatments and
error variance. Error variance is the common variance to all the groups of the
samples.
Considering factors taken at a time, the analysis of variance can be divided into
single factor or one way analysis of variance for a factor at a time and factorial or
two-way analysis of variance for two factors or more taken at time. Control
variables are called factors and are selected by researcher for comparison. On the
other hands response variable are the measurements or observations that are
recorded but not controlled by researcher.
Treatment and control: Treatments are the conditions made from the factors and
control treatment is a special treatment to which the effectiveness of the other
treatment are compared.
Experimental unit is the physical entity to which treatments are randomly
assigned. When one treatment is assigned to more than one experimental unit
the single treatment is called replication. Experimental error is the variation in the
response among the experimental units which are assigned the same treatment
and are observed under the same experimental conditions.
Example: Nineteen pigs are assigned at random among four groups. Each group is
fed a different diet. The data are pig weight in kg after being raised on these
diets. We need to know whether pigs are same for all four diets
Solution:
The null hypothesis:
SP Paudel Page
The mean weights of pig on four diets are the same.
H0: µ1= µ2= µ3= µ4
Alternative hypothesis:
The means weights of the pigs on four diets are not same at all.
Any one of the µ is different from other.
In order to compare the means of the more than three population, F statistics
need to be calculated. For this purpose, we do analysis of variance.
First, we need to calculate the correction factor (CF) which is estimated as follows
∑∑ Xi
i j j
CF=
N
1482. 2
CF= 19
= 115627.202
∑ ( ∑ X ij )2
SSB = i j
−CF
ni
ANOVA TABLE
Source of Degree of Sum of Mean sum of square F value
variance freedom Square
SP Paudel Page
Total 19-1=18 4354.698
F = MSgroups/
Treatment 4-1 = 3 4226.348 4226.348/3= MSError
1408.783
1408.783/8.577
Error 18-3= 15 128.35
128.35/15 = 8.577 F = 165.
Since the calculated F value is compared with the F tabulated value or estimate
probability of getting this F values in the given degree of freedom. In this case,
the tabulated F value with the degree of freedom of numerator 3 and
denominator 15 is 3.29. Here the calculated F value is higher than the tabulated F
value, we reject null hypothesis. Thus the conclusion is we do not have sufficient
evidence that all four diets are same for the growth of pig.
In the given example, the completely randomized design was used to perform the
analysis of variance. A statistical model can be used to express experimental
condition. In this case the statistical model is as follows
y ij=μ+α i + eij
Where
Yij = Dependable variable
µ = overall mean
α = effect of the ith level of α factor
eij = random error
Once the it is proved that treatments are different significantly, then the next
procedure is to find out which treatment is different from others or which
treatments are same. For this purpose, multiple comparison tests need to be
performed.
SP Paudel Page
Multiple comparisons tests are those test used to examine the difference
between all possible pairs or means. In ANOVA, we had H0: µ 1= µ2= µ3= µ4 . In case
of rejecting the null hypothesis, any one of the mean is different from others.
There are few alternative hypothesis but we donot know which one is true or
valid. To know the best alternative hypothesis, we need to do multiple
comparisons tests. Following are the multiple comparison tests
1. Tukey Test
2. Newman-Keuls Test
3. Duncan Multiple Range Test
4. Least Significant Difference (LSD) test.
Tukey test:
Tukey test is the used to tes the null hypothesis H0 µA = µB against the
alternative hypothesis HA: µA ≠ µB where µA and µB are the mean of any possible
pairs of the treatments. For k treatment, k(k-1)/2 are the different pairs that need
to be tested.
Taking the example of the ANOVA, effects of the four differen feeds on pig
weights, the following will be the steps to be followed to complete th Tukey test.
SP Paudel Page
Standard Error (SE) = √s2/n if nA = nB
√8.557/5 = 1.31
√8.557/2(1/5+1/4) = 1.39
q = 39.73/ 1.39 = 28.58, similarly, the value of q for all the pairs
need to be calculated and fill in the above table.
In the given example, all feed samples are different each other.
LSD is one of the multiple comparison tests which also compare the pairs of
treatments. Following will be the steps in the LSD.
1
LSD=t α / 2
√ msw (
1
n1
+
n2
)
for n1 ≠ n2
msw
LSD=t α / 2 √ n for n1 = n2
SP Paudel Page
In the given example of the effect of four different level of feed on pig weight,
LSD will be
2.131√8.557(1/4+1/5) when n1 ≠ n2
= 4.18
LSD will be
2.131√8.557/5 when n1 = n2
2.79
Sample 1 2 4 3
Largest - second smallest = 100.35 - 69.30 = 31.05 is larger than LSD value
Second larger - second smallest = 86.24 - 69.30 = 16.94 is also larger than
LSD value
So the second sample is also different from rest.
In this way we come to conclusion that each level of feed has different effect on
the pig weight.
SP Paudel Page
Simple Linear Regression
In scientific world, the role of prediction on the basis of the some variables has
importance and usefulness. For example, the increase in the temperature of
medium will increase the growth of bacteria. The growth of the bacteria can be
predicted based on the temperature they are stored. The pH of the soil can be
predicted based on the amount of the lime used in the agriculture field. It gives us
a situation that where a variable can be predicted based on the magnitude of the
other variable. The former is called dependent variable and the latter is called
independent variable. The relationship between two variables may be the
functional dependence of one on another. The magnitude of the dependent
variable is the function of the magnitude of the independent variable.
Independent variable are also called regressor, predictor and dependent
variable is called response. The simplest form of functional relationship of one
variable to another in a population is the simple linear regression which can be
stated as
Yi= β0+β1Xi
In this equation, β0 and β1 are the population parameters. But, in the population,
the data are unlikely to be the exactly on the straight line given by the above
equation. Thus the former equation become
b 1=
∑ ( x− x́ ) ( y − ý)
∑ ( x− x́ )2
Example
The age and wing length of 13 birds were measured and presented as follows.
3 4 5 6 8 9 10 11 12 14 15 16 17
SP Paudel Page
Age of birds(x)
Wing length of 1. 3. 3. 3. 4.
birds(y) 1.4 5 2.2 2.4 1 3.2 2 9 4.1 4.7 5 5.2 5
Here, the variable age of the birds is the independent variable and predicts the
length of the wing. Wing length is the dependent variable.
When we plot these variables, we find the following scatter plot diagram.
0
3 4 5 6 8 9 10 11 12 14 15 16 17
By looking this plot, we can say that the dependent variable is not following exact
the straight line. So there is random error part in the equation.
Wing
Age of length of
birds birds x-x (x-x)2 y-y (x-x)(y-y)
1 x Y
2 3 1.4 -7 49 -2.015 14.105
3 4 1.5 -6 36 -1.915 11.49
4 5 2.2 -5 25 -1.215 6.075
5 6 2.4 -4 16 -1.015 4.06
6 8 3.1 -2 4 -0.315 0.63
7 9 3.2 -1 1 -0.215 0.215
8 10 3.2 0 0 -0.215 0
9 11 3.9 1 1 0.485 0.485
10 12 4.1 2 4 0.685 1.37
SP Paudel Page
11 14 4.7 4 16 1.285 5.14
12 15 4.5 5 25 1.085 5.425
13 16 5.2 6 36 1.785 10.71
14 17 5 7 49 1.585 11.095
Ʃx=130 Ʃy=44.4 262 70.8
x́ =10 ý =3.415
The value of regression coefficient ( b) calculated with the following equation
b 1=
∑ ( x− x́ ) ( y − ý)
∑ ( x− x́ )2
ý=b0 +b1 x́
3.415=b0 +0.27 X 10
b0 = 0.715
Y = 0.715 + 0.270 X
This is the equation we can use to predict the wing length of birds from their age.
Let's predict the wing length of bird with an age of 8 days.
Y = 0.715 + 0.27 x 8
Y = 0.715 + 2.16
Y = 2.875 cm
Here, we can see the difference between the observed and predicted value. The
difference (between 3.1 and 2.875) is 0.225 which is called random error. The
sum of the random error is zero. The regression equation is fitted in such a way
that the sum of the squares of the difference between observed and predicted
value is minimum.
In this given example, the value of b 0 is 0.715 which indicates that the wing length
of the bird of that particular population at the zero day' age is 0.715 cm. The
SP Paudel Page
value of b1 in this example is 0.270 which means the the length of the wing of the
birds is increased in average by 0.27 cm in a day in the population from which the
sample are taken.
ANOVA TABLE
Source of Degree Sum of Square Mean sum of
variation of square
freedom
Total SS n-1 ∑ ( y− ý )2
¿¿
Regression 1
Residual sum of square is obtained by subtracting the regression sum of square
from the total sum of square which has n-2 degree of freedom. In the given
example the ANOVA table will be as follows
reg ression SS
r 2=
Total SS
SP Paudel Page
residual SS
r 2=1−
Total SS
SP Paudel Page
Simple Linear Correlation
r=
∑ ( x−x́ )( y− ý )
√ ∑ ( x−x́)2 ( y− ý)2
In other words, the correlation coefficient can be stated as
Cov( xy)
r=
√ Var x Var y
From this equation, the correlation coefficient is defined as the covariance of two
variables divided by the product of their standard deviation. Based on above
equation, it may give the numerator positive, negative or zero but the
denominator is always positive. In case of the positive numerator, the correlation
coefficient is positive which indicates the variables of interest are positively
correlated. If the numerator is negative that gives the negative correlation
coefficient and indicates the variables of interest are negatively correlated.
SP Paudel Page
Positive correlation
Similar scatter plot can be made for negative correlation, and no correlation case.
Cov( xy)
r=
√ Var x Var y
r =b yx ×
√ Var x
√ Var y
sd x
r =b yx ×
sd y
Example:
SP Paudel Page
Tail and wing length of the birds are considered as correlated traits. Following is
the data set measured wing length and tail length in 12 birds. Estimate the
correlation coefficient.
Wing
length(x) 10.4 10.8 11.1 10.2 10.3 10.2 10.7 10.5 10.8 11.2 10.6 11.4
Tail
length
(y) 7.4 7.6 7.9 7.2 7.4 7.1 7.4 7.2 7.8 7.7 7.8 8.3
Solutions:
r=
∑ ( x−x́ )( y− ý )
√ ∑ ( x−x́)2 ( y− ý)2
Wing
length Tail
(x) length (y) x-x (x-x)2 y-y (y-y)2 (x-x)(y-y)
1 10.4 7.4 -0.2 0.04 -0.1 0.01 0.02
2 10.8 7.6 0.2 0.04 0.1 0.01 0.02
3 11.1 7.9 0.5 0.25 0.4 0.16 0.2
4 10.2 7.2 -0.4 0.16 -0.3 0.09 0.12
5 10.3 7.4 -0.3 0.09 -0.1 0.01 0.03
6 10.2 7.1 -0.4 0.16 -0.4 0.16 0.16
7 10.7 7.4 0.1 0.01 -0.1 0.01 -0.01
8 10.5 7.2 -0.1 0.01 -0.3 0.09 0.03
9 10.8 7.8 0.2 0.04 0.3 0.09 0.06
10 11.2 7.7 0.6 0.36 0.2 0.04 0.12
11 10.6 7.8 0 0 0.3 0.09 0
12 11.4 8.3 0.8 0.64 0.8 0.64 0.64
Mean x = 10.683 Sum= 1.8 Sum=1.4 Sum =
1.39
Mena y = 7.567
1.39
r=
√ 1.8 ×1.4
SP Paudel Page
r = 0.878
SP Paudel Page