You are on page 1of 91

PRACTICAL MANUAL

STATISTICAL METHODS
( UG COURSE)

Compiled by

DEPARTMENT OF MATHEMATICS AND STATISTICS


Jawaharlal Nehru Krishi Vishwa Vidyalaya,
JABALPUR 482 004

1
2
Contents

S. Chapter Name Description Page No.


No.
1 Graphical Representation of 1. Construction of Discrete and continuous 1-8
data frequency distribution
2. Construction of Bar Diagram, Histogram, Pie
Diagram, Frequency curve and Frequency
polygon
2 Measures of Central tendency 1. Definition, Formula and Calculation of Mean, 9-21
Median , Mode, Geometric Mean and
Harmonic Mean for grouped and ungrouped
data
2. Definition, Formula and Calculation of
Quartiles, Deciles and Percentiles for grouped
and ungrouped data
3 Measures of Dispersion 1. Definition, Formula and Calculation of 22-29
absolute measures of Dispersion, Range,
Quartile Deviation, Mean Deviation, Standard
Deviation
2. Definition, Formula and Calculation of relative
measures of Dispersion, CD and CV for
grouped and ungrouped data
4 Moments, Skewness and 1. Definition and types of moments, skewness 30 -40
Kurtosis and Kurtosis
2. Formula and calculation of raw moments,
moments about origin, central moments and
different types of coefficient of skewness and
kurtosis
5 Correlation and Regression 1. Definition and types of Correlation and 41-49
Regression.
2. Calculation of Correlation and regression
coefficient along with their test of
significance.
6 Test of Significance 1. Definition of Null and Alternative Hypothesis 50-59
and different tests of significance
2. Application of t test for single mean, t-test for
independent samples, paired t test, F-test,
Chi-square test
7 Analysis of Variance( One way 1. Definition and steps of analysis of one way 60-79
and Two way classification) and two way classification.
2. Analysis of CRD and RBD as an example of
one way and two way ANOVA
8. Sampling Methods 1. Definition of SRS, SRSWR and SRSWOR and 80-86
difference between census and sampling
2. Procedures of selecting a simple random
sample

3
4
1. Graphical Representation of data
Mujahida Sayyed
Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India
Email id : mujahida.sayyed@gmail.com

Frequency Distribution: A tabular presentation of the data in which the frequencies of values of a
variable are given along with class is called a frequency distribution. Two types of frequency
distribution are available
1. Discrete Frequency Distribution: A frequency distribution which is formed by distinct
values of a discrete variable eg. 1,2,5 etc.
2. Continuous Frequency Distribution: A frequency distribution which is formed by distinct
values of a continuous variable eg. 0-10, 10-20, 20-30 etc.
Process: For construction of Discrete Frequency Distribution
Step I. Set the data in ascending order.
Step II. Make a blank table consisting of three columns with the title: Variable, Tally Marks and
Frequency.
Step III. Read off the observations one by one from the data given and for each one record a tally
mark against each observation. In tally marks for each variable fifth frequency is denoted by cutting
the first four frequency from top left to bottom right and then sixth frequency is again by a straight
tally marks and so on.
Step IV. In the end, count all the tally marks in a row and write their number in the frequency
column.
Step V. Write down the total frequency in the last row at the bottom.

Objective : Prepare a discrete frequency distribution from the following data


Kinds of data:
5 5 2 6 1 5 2 9 5 4
3 4 11 7 2 5 12 6
Solution : First arrange the data in ascending order
1 2 2 2 3 4 4 5 5 5
5 5 6 6 7 9 11 12
Prepare a table in the format described above in the process.
Count the numbers by tally method we get the required discrete frequency distribution:
No. of Letters, Variable (X) Tally Marks No. of Words, Frequency(f)
1 │ 1
2 │││ 3
3 │ 1
4 ││ 2
5 ││││ 5
6 ││ 2
7 │ 1
9 │ 1
11 │ 1
12 │ 1
Total 18

1
Continuous Frequency Distribution:
A continuous frequency distribution i.e. a frequency distribution which obtained by dividing the
entire range of the given observations on a continuous variable into groups and distributing the
frequencies over these groups . It can be done by two methods
1. Inclusive method of class intervals : When lower and upper limit of a class interval are
included in the class intervals.
2. Exclusive method of class intervals: When the upper limit of a class interval is equal to the
lower limit of the next higher class intervals.
Process: For construction of Continuous Frequency Distribution
Step I. Set the data in ascending order.
Step II. Find the range= max value –min value.
Step III. Decide the approximate number k of classes by the formula K= 1+3.322 log10N, where N
is the total frequency. Round up the answer to the next integer. After dividing the range by number
of classes class interval is obtained.
Step IV. Classify the data by exclusive and/or inclusive method for the desired width of the class
intervals.
Step V. Make a blank table consisting of three columns with the title: Variable, Tally Marks and
Frequency.
Step VI. Read off the observations one by one from the data given and for each one record a tally
mark against each observation.
Step VII. In the end, count all the tally marks in a row and write their number in the frequency
column.
Step VIII. Write down the total frequency in the last row at the bottom.
********************************************************************************

Objective : Prepare a continuous grouped frequency distribution from the following data.
Kinds of data: 20 students appear in an examination. The marks obtained out of 50 maximum
marks are as follows:
5, 16, 17, 17, 20, 21, 22, 22, 22, 25, 25, 26, 26, 30, 31, 31, 34, 35, 42 and 48.
Prepare a frequency distribution taking 10 as the width of the class-intervals .

Solution: Arrange the data in the ascending order


5 16 17 17 20 21 22 22 22 25
25 26 26 30 31 31 34 35 42 48
Here lower limit is 5 and upper limit is 48.

Since it is given that the desired class interval is 10, so frequency distribution for Inclusive Method
of Class intervals:

Marks Tally Marks No. of students


1-10 │ 1
11-20 ││││ 4
21-30 │││││││ │ 9
31-40 ││││ 4
41-50 ││ 2
Total 20

2
Exclusive Method of Class intervals:

Marks Tally Marks No. of students


0-10 │ 1
10-20 │││ 3
20-30 │││││││ │ 9
30-40 │││ │ 5
40-50 ││ 2
Total 20

********************************************************************************
Conversion of Inclusive series to Exclusive series: To apply any statistical technique (Mean,
Median etc.) , first the inclusive classes should be converted to exclusive classes.
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠
For this purpose we find the difference of and add
2
this amount to upper limit of first class and subtract it from the lower limit of next higher class.
𝟏𝟏−𝟏𝟎
In the present example the conversion factor = = 0.5. So we add 0.5 to 10 and subtract 0.5
𝟐
from 11 and finally get the exclusive classes 1-10.5, 10.5-20.5, etc.
********************************************************************************

Graphical Representation of data:- Graphical Representation is a way of analysing


numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram.
It is easy to understand and it is one of the most important learning strategies. It always depends on
the type of information in a particular domain. There are different types of graphical representation.
Some of them are as follows
• Bar Diagram – Bar Diagram is used to display the category of data and it compares the
data using solid bars to represent the quantities.
• Histogram – The graph that uses bars to represent the frequency of numerical data that are
organised into intervals. Since all the intervals are equal and continuous, all the bars have
the same width.
• Pie diagram–Shows the relationships of the parts of the whole. The circle is considered
with 100% and the categories occupied is represented with that specific percentage like
15%, 56% , etc.
• Frequency Polygon – It shows the frequency of data on a given number to curve.
• Frequency curve - Frequency curve is a graph of frequency distribution where the line is
smooth.

Merits of Using Graphs


Some of the merits of using graphs are as follows:
• The graph is easily understood by everyone without any prior knowledge.
• It saves time.
• It allows to relate and compare the data for different time periods
• It is used in statistics to determine the mean, median and mode for different data, as well as
in interpolation and extrapolation of data.

3
1. Simple Bar Diagram:
Bar graph is a diagram that uses bars to show comparisons between categories of
data. The bars can be either horizontal or vertical. Bar graphs with vertical bars are sometimes
called vertical bar graphs. A bar graph will have two axes. One axis will describe the types of
categories being compared, and the other will have numerical values that represent the values of the
data. It does not matter which axis is which, but it will determine what bar graph is shown. If the
descriptions are on the horizontal axis, the bars will be oriented vertically, and if the values are
along the horizontal axis, the bars will be oriented horizontally.

Objective : Prepare a simple Bar diagram for the given data:


Kinds of data: Aggregated figures for merchandise export in India for eight years are as
Follows.
Years 1971 1972 1973 1974 1975 1976 1977 1978
Exports (million Rs.) 1962 2174 2419 3024 3852 4688 5555 5112

Solution: For Simple Bar Diagram


Step I: Draw X and Y axis.
Step II: Take year on X axis .
Step III: Take scale of 1000 on Y axis which represent Exports.
Step IV: Draw the equal width bars on X axis.
6000

5000
Bar Diagram
Export (million Rs.)

4000

3000

2000 Export

1000

0
1971 1972 1973 1974 1975 1976 1977 1978
Year

Results: The above figure shows the Bar diagram.


********************************************************************************
2. Histogram :
Histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a
vertical axis. The horizontal axis is more or less a number line, labelled with what the data
represents. The vertical axis is labelled either frequency or relative frequency (or percent frequency
or probability). The histogram (like the stemplot) can give the shape of the data, the center, and the
spread of the data. The shape of the data refers to the shape of the distribution, whether normal,

4
approximately normal, or skewed in some direction, whereas the center is thought of as the middle
of a data set, and the spread indicates how far the values are dispersed about the center. In a skewed
distribution, the mean is pulled toward the tail of the distribution. In histogram the area of rectangle
is proportional to the frequency of the corresponding range of the variable.

Objective: Construction of the bar diagram for the given data:


Kinds of data: The following data are the number of books bought by 50 part-time college
students at College;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,2, 2, 2, 2, 2, 2, 2, 2, 2, 2,, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,4, 4, 4, 4,
4, 4,5, 5, 5, 5, 5,6, 6
Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six
students buy four books. Five students buy five books. Two students buy six books. Calculate the
width of each bar/bin size/interval size.
Solution:
Process:
Step I: The smallest data value is 1, and the largest data value is 6. To make sure each is included
in an interval, we can use 0.5 as the smallest value and 6.5 as the largest value by subtracting and
adding 0.5 to these values. A small range here of 6 (6.5 – 0.5), so a fewer number of bins; let’s say
six this time. So, six divided by six bins gives a bin size (or interval size) of one.
Step II: Notice that may choose different rational numbers to add to, or subtract from, maximum
and minimum values when calculating bin size.
Step III :

Result:
The above histogram displays the number of books on the x-axis and the frequency on the y-axis:
********************************************************************************

3. PIE Diagram:
Pie charts are simple diagrams for displaying categorical or grouped data. These
charts are commonly used within industry to communicate simple ideas, for example market share.
They are used to show the proportions of a whole. They are best used when there are only a handful
of categories to display.
A pie chart consists of a circle divided into segments, one segment for each category. The size of
each segment is determined by the frequency of the category and measured by the angle of the
segment. As the total number of degrees in a circle is 360, the angle given to a segment is 360◦

5
times the fraction of the data in the category, that is
Angle = (Number in category /Total number in sample (n)) × 360, or we can also express pie
diagram in %.

Objective: Draw a pie chart to display the information.


Kinds of data: A family's weekly expenditure on its house mortgage, food and fuel is as follows:
Expense Rupees
Mortage 300
Food 225
Fuel 75

Solution: Process: Step I: The total weekly expenditure = 300+225+75 = 600Rs.


Step II:

Percentage of weekly expenditure on:


300
Mortage = 600 ∗ 100% = 50%

225
Food = ∗ 100% = 37.5%
600

75
Fuel = 600 ∗ 100% = 12.5%
Step III: To draw a pie chart, divide the circle into 100 percentage parts. Then
allocate the number of percentage parts required for each item.

Result: Above figure shows the pie diagram of the given data.
*******************************************************************************

4. Frequency Polygon:
Frequency polygon are analogous to line graph, and just as line graph make continuous data
visually easy to interpret, so too do frequency polygons. It can also be obtained by joining the mid
points of the class interval on x-axis and their corresponding frequency on y-axis by a straight line.
Step I: Examine the data and decide on the number of intervals and resulting interval size, for both
the x-axis and y-axis.
Step II: The x-axis will show the lower and upper bound for each interval, containing the data
values, whereas the y-axis will represent the frequencies of the values.
Step III: Each data point represents the frequency for each interval.
Step IV: If an interval has three data values in it, the frequency polygon will show a 3 at the upper
endpoint of that interval.
Step V: After choosing the appropriate intervals, begin plotting the data points. After all the points
are plotted, draw line segments to connect them.

6
Objective: construction of frequency polygon from the frequency table.
Kinds of data:
Frequency Distribution for Calculus Final Test Scores

Lower Bound Upper Bound Mid Value Frequency

49.5 59.5 54.5 5

59.5 69.5 64.5 10

69.5 79.5 74.5 30

79.5 89.5 84.5 40

89.5 99.5 94.5 15


Solution:

Result: Above figure shows the frequency polygon diagram of the given data.
*******************************************************************************

5. Frequency curve: The frequency-curve for a distribution can be obtained by drawing a smooth
and free hand curve through the mid-points of the upper sides of the rectangles forming the
histogram.
50

40

30

20

10

0
44.5 54.5 64.5 74.5 84.5 94.5 104.5

Result: Above figure shows the frequency curve diagram of the given data.
****************************************************************************

7
Exercise:
Q1. Define Graphical Representation. Also write the advantage of Graphical representation of data?

Q2. The following data gives the information of the number of children involved in different
activities.
Activities Dance Music Art Cricket Football
No. of Children 30 40 25 20 53
Draw Simple bar Diagram.
Q3. The percentage of total income spent under various heads by a family is given below.
Clothin
Different Heads Food Health Education House Rent Miscellaneous
g
% age of Total
40% 10% 10% 15% 20% 5%
Number
Represent the above data in the form of bar graph.
Q4. The following table shows the numbers of hours spent by a child on different events on a
working day. Represent the adjoining information on a pie chart.
Activity School Sleep Playing Study TV Others
No. of Hours 6 8 2 4 1 3

Q5. Make a frequency table and histogram of the following data:


3, 5, 8, 11, 13, 2, 19, 23, 22, 25, 3,10, 21,14, 9,12,17 ,22, 23, 14
*******************************************************************************

8
Measures of Central Tendency
Umesh Singh
Assistant Professor (Statistics), College of Agriculture , Tikamgarh, 472001,India
Email id : umeshsingh0786@gmail.com

According to Professor Bowley, Averages are "statistical constants which enable us to comprehend
in a single effort the significance of the whole." They give us an idea about the concentration of the
values in the central part of the distribution. Plainly speaking, an average of a statistical series is the
value of the variable which is representative of the entire distribution.
The following are the five measures of central tendency that are in common use:
(i) Arithmetic Mean
(ii) Median
(iii) Mode
(iv) Geometric Mean
(v) Harmonic Mean
Requisites for an ideal Measure of Central Tendency
The following are the characteristics to be satisfied by an ideal measure of central tendency
(i) It should be rigidly defined.
(ii) It should be readily comprehensible and easy to calculate.
(iii) It should be based on all the observations.
(iv) It should be suitable for further mathematical treatment.
(v) It should be affected as little as possible by fluctuations of sampling.
(vi) It should not be affected much by extreme values.

1. Arithmetic Mean:

Arithmetic mean of a set of observations is their sum divided by the number of observations.
Arithmetic mean for ungrouped data: The arithmetic mean X of n observations X1,X2, X3. . .,X
n is given by
n

X1 + X 2 + X 3 ........... + X n X i
X= = i =1

n n
Arithmetic mean for grouped data :
In case of frequency distribution, Xi/fi, i = 1, 2, 3, 4,……n, where fi is the frequency of the variable
Xi;
n n

f1X1 + f 2 X 2 + f 3X 3 ........... + f n X n f X f Xi i i i
n 
X= = i =1
= i =1
,  f i = N 
f1 + f 2 + f 3 ......f n n
 i=1 
f
N
i
i =1

In case of grouped or continuous frequency distribution X is taken as the mid-value of the


corresponding class.

Remark. The Greek Capital letter, Σ Sigma, is used to indicate summation of elements in a set or a
sample or a population. It is usually indexed by an index to show how many elements are to be
summed.
Properties or Arithmetic Mean
Property 1. Algebraic sum of the deviations of a set of values from their arithmetic mean is zero. If
n
Xi / fi, i= 1, 2, ... , n is the frequency distribution, then  f (X
i =1
i i − X ) = 0, X being the mean of

distribution.
9
Property 2. The sum of the squares of the deviations of a set of values from their arithmetic mean
is always minimum.
Property 3. Mean of the composite series- If Xi, (i = 1, 2, ... , k) are the means of k-component
series of sizes ni ( i = 1, 2, ... , k) respectively, then the mean X of the composite series obtained on
combining the component series is given by the formula:
n X + n 2 X2 + n 3 X3 + .......n k Xk
X= 1 1
n1 + n 2 + n 3 ..... + n k
********************************************************************************
Objective: Find the arithmetic mean of the following ungrouped data:
Kinds of data : Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12.
Solution: We know that
n

X i
10+7+11+9+9+10+7+9+12 84
Arithmetic mean X = i =1
= = =9.33
n 10 9

*******************************************************************************
Objective: Find the arithmetic mean of the following discrete frequency distribution:
Kinds of data:
Xi 2 9 16 35 32 89 95 65 55
fi 8 2 5 7 6 8 9 6 2
Solution –
Xi 2 9 16 35 32 89 95 65 55 Total
9
fi 8 2 5 7 6 8 9 6 2
f
i =1
i = 53

fiXi 16 18 80 245 192 712 855 390 110 2618


n

f X i i
2618
X= i =1
n
= = 49.39
f
53
i
i =1

********************************************************************************
Objective : Find the arithmetic mean of the following continuous grouped frequency distribution:
Kinds of data:
Xi 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90
fi 8 2 5 7 6 8 9 6 2
Solution –
Class interval Xi (Mid-point) fi fiXi
0-10 5 8 40
10-20 15 2 30
20-30 25 5 125
30-40 35 7 245
40-50 45 6 270
50-60 55 8 440
60-70 65 9 585
70-80 75 6 450
80-90 85 2 170
Total 9

 fi = 53 2355
i =1

10
n

f X i i
2355
X= i =1
n
= = 44.433
f
53
i
i =1

Objective: Find the arithmetic mean of the pooled data


Kinds of data: The average of 5 numbers (first series) is 40 and the average of another 4 numbers
(second series) is 50.
̅̅̅̅
𝑛1𝑋 ̅̅̅̅
1 +𝑛2𝑋 2 5∗40+4∗50 400
Solution: we know that pooled mean formula = = = = 44.44
𝑛1 +𝑛2 5+4 9
********************************************************************************

2. Median:
Median of a distribution is the value of the variable which divides it into two equal parts. It
is the va1ue which exceeds and is exceeded by the same number of observations, i.e., it is the value
such that the number of observations above it is equal to the number of observations below it. The
median is thus a positional average.

Median for ungrouped data:


In case of ungrouped data, if the number of observations is odd then median is the middle value
after the values have been arranged in ascending or descending order of magnitude.
In case of odd number of observation
 n +1
th

Median =   term
 2 

In case of even number of observations, in fact any value lying between the two middle values can
be taken as median but conventionally we take it to be the mean of the middle terms. So
 n th n + 2 th 
 term + term 
Median =  
2 2
2
In case of discrete frequency distribution median is obtained by considering the cumulative
frequencies. The steps for calculating median are given below:
(i) Find N/2, where N = ∑fi.
(ii) See the (less than) cumulative frequency (cf.) just greater than N/2.
(iii) The corresponding value of X is median.

********************************************************************************
Objective: Find the median of the ungrouped data when the number of observations is odd.
Kinds of data: The values are 5, 20,15,35,18, 25, 40.
Solution – Step 1 Arrange values in ascending order of their magnitude
5, 15, 18, 20, 25, 35, 40
Step 2 the number of observation is odd i.e. 7
n +1
th
So, Median = term
2
 7 +1
th

Median =   term = 4 term , which is 20


th

 2 
********************************************************************************

11
Objective: Find the median of the ungrouped data when the number of observations is even.
Kinds of data: The values are 8, 20, 50, 25, 15, 30
Solution – Step 1 Arrange values in ascending order of their magnitude
8, 15, 20, 25, 30, 50
Step 2 the number of observation is even i.e. 6
In case of even number of observation
 n th n+2
th

 term + term 
2 
Median =  
2
2
6 th
6+2
th

 
 2 term + 2 term 
Median =  
2

Median =
(3rd
)
term + 4 th term (20 + 25)
= = 22.5
2 2
********************************************************************************
Median for grouped data
In the case of continuous frequency distribution, the class corresponding to the c.f. just greater than
N/2 is called the median class and the value of median is obtained by the following formula:

N 
 −C 
Median = l +  2 *h
 f 
 
 
Where l is the lower limit of the median class,
f is the frequency of the median class,
h is the magnitude of the median class,
'C' is cumulative frequency preceding to the median class, and N = ∑fi
********************************************************************************

Objective: Find the Median of the following discrete grouped frequency distribution:
Kinds of data:
Xi 1 2 3 4 5 6 7 8 9 Total
fi 8 10 11 16 20 25 15 9 6 120

Solution : Here N = ∑fi = 120


→ So, N/2 = 120/2 = 60
Xi 1 2 3 4 5 6 7 8 9 Total
fi 8 10 11 16 20 25 15 9 6 120
C.f. 8 18 29 45 65 90 105 114 120

Cumulative frequency (c.f.) just greater than N/2 value i.e. 65 and the value of X Corresponding to
65 is 5. Therefore, median is 5.
********************************************************************************

Objective: Find the Median wage of the following continuous grouped frequency distribution
Kinds of data:
Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90
No. of labours 3 5 20 10 5 7 2

12
Solution:

Wages (in Rs.) 20-30 30-40 40-50 50-60 60-70 70-80 80-90

No. of labours 3 5 20 10 5 7 2
c.f. 3 8 28 38 43 50 52

Here N/2 = 52/2=26. Cumulative frequency just greater than 26 is 28 and corresponding class is 40-50. Thus
median class is 40-50.
N 
 −C
Now Median = l +  2 *h
 f 
 
 

 52 
 −8
Median = 40 +  2  *10 = 49
 20 
 
 
So, Median wage is Rs 49.
********************************************************************************

3. Mode
Mode is the value which occurs most frequently in a set of observations and around which the other
items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the
series. Mode is the value which occurs most frequently in a set of observations and around which the other
items of the set cluster densely. In other words, mode is the value of the variable which is predominant in the
series. Thus in the case of discrete frequency distribution mode is the value of X corresponding to maximum
frequency. For example, the mode of {4, 2, 4, 3, 2, 2, 1, and 2} is 2 because it occurs four times, which is
more than any other number. Now look at the following discrete series:

Variable 10 20 30 40 50 55 60 89 94
Frequency 2 3 12 30 25 11 9 7 3
Here, as you can see the maximum frequency is 30, the value of mode is 40. In this case, as there is
a unique value of mode, the data is unimodal. But, the mode is not necessarily unique, unlike arithmetic
mean and median. You can have data with two modes (bi-modal) or more than two modes (multi-modal). It
may be possible that there may be no mode if no value appears more frequent than any other value in the
distribution. For example, in a series 1, 1, 2, 2, 3, 3, 4, 4, there is no mode.

But in anyone (or-more) of the following cases:


(i) If the maximum frequency is repeated,
(ii) If the maximum frequency occurs in the very beginning or at the end of the distribution, and '
(iii) If there are irregularities in the distribution, the value of mode is determined by the method of
grouping. This is illustrated below by an example.

Objective: Find the mode of the following frequency distribution:


Kinds of data:

Size ( X) 1 2 3 4 5 6 7 8 9 10 11 12
Frequency (f) 3 8 15 23 35 40 32 28 20 45 14 6

Solution: Here we see that the distribution is not regular since the frequencies are increasing steadily up to
40 and then decrease but the frequency 45 after 20 does not seem to be consistent with the
distribution. Here we cannot say that since maximum frequency is 45, mode is 10. Here we shall
locate mode by the method of grouping as explained below:

13
The frequencies in column (i) are the original frequencies. Column (ii) is obtained by combining
the frequencies two by two. If we leave the first frequency and combine the remaining frequencies
two by two we get column (iii) We proceed to combine the frequencies three by three to obtain
column (iv). The combination of frequencies three by three after leaving the first frequency results
in column (v) and after leaving the first two frequencies results in column (vi). '
To find mode we form the following table:

Column Number Maximum Frequency Value or combination of values of X


giving max. frequency
(i) 45 10
(ii) 75 5, 6
(iii) 72 6, 7
(iv) 98 4, 5, 6
(v) 107 5, 6, 7
(vi) 100 6, 7, 8

We find that the value 6 is repeated maximum number of times and hence the value of mode is 6
and not 10 which is an irregular item

Mode for ungrouped data: In the case of discrete frequency distribution mode is the value of X
corresponding to maximum frequency.
***************************************************************************************

Objective: Find the Mode of the following ungrouped data:


Kinds of data: The values are 4, 2, 4, 3, 2, 2, 1, and 2.
Solution: Here the mode is 2 because it occurs four times, which is more than any other number.
**************************************************************************************

Mode for grouped data


In case of continuous frequency distribution mode is given by the formula:

 f1 − f0 
Mode = l +   * h
 2f1 − f0 − f 2 

Where l is the lower limit of model class, h the magnitude of the model class, f 1 is frequency of the
modal class, f0 and f2 are the frequencies of the classes preceding and succeeding to the modal class
respectively.
***************************************************************************************

14
Objective: Find the mode of the following continuous grouped frequency distribution:
Kinds of data:

Class interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequency (f) 5 8 7 12 28 20 10 10

Solution: Here maximum frequency is 28. Thus the class 40-50 is the modal class.

So, l=40, f1=28, f0=12, f2=20, h=10,

 f1 − f0 
Mode = l +   * h
 2f1 − f0 − f 2 

 28 − 12 
Mode = 40 +   *10 =40+ 6.667= 46.667
 56 − 12 − 20 
**************************************************************************************

4. Geometric Mean (GM)

The Geometric mean of a set of n observation is the nth root of their product.
Geometric mean for grouped data-
The geometric mean G of n observations xi, i=1,2,3….n is

G = n ( x1.x2 x3 .......xn
This can be written as
G = ( x1.x2 x3 .......xn )1/ n

log G = log x1. log x2 . log x3 .......... log xn n


1

Taking log in both sides


log G =
1
log x1 + log x2 + log x3 ................ log xn 
n
1 n
logG =  logx i
n i =1

1 n 
G = Antilog   logx i 
 n i =1 
Geometric mean for grouped data-
In the case of grouped or continuous frequency distribution, X is taken to be the value corresponding
to the mid-point of the class-intervals.
In case of frequency distribution Xi / fi., (i = 1. 2 ...., n) geometric mean, G is given by

 
1
G = x1f1 .x2f 2 .x3f3 ..........xnf n N , where N = ∑fi.
Taking logarithm of both sides, we get
log G =
1
 f1 log x1 + f 2 log x2 + f 3 log x3 ................ + f n log xn 
N
1 n
log G =  f i log xi
N i =1
Thus we see that logarithm of G is the arithmetic mean of the logarithms of the given values.
1 n 
So G = Anti log  
 N i =1
f i log xi 

***************************************************************************************
15
Objective: Calculate Geometric mean from the following data: 3,13,11,15,5,4,2
Solution: In this example number of observation n=7, by definition of geometric mean
G= (3.13.11.15.5.4.2)1/7
log G =
1
log x1 + log x2 + log x3 ................ log xn 
n
log G = log 3 + log 13 + log 11 + ................ + log 2
1
7
log G =
1
0.4771 + 1.1139 + 1.0413 + 1.1760 + 0.6989 + 0.6020 + 0.3010
7
1
log G = [5.4106]
7
log G = 0.772944
So, G = Anti log( 0.772944) = 5.928
***************************************************************************************

Objective: Calculate Geometric mean from the following continuous grouped frequency data:
Kinds of data:
Class Interval 0-10 10-20 20-30 30-40 Total
Frequency 1 3 4 2 10
Solution: We know that in case of Grouped data
n
1
log G =
N
f
i =1
i log xi

Calculations are given below in the table

Class Interval Frequency Mid Values Xi log Xi fi log Xi


0-10 1 5 0.69 0.698
10-20 3 15 1.17 3.528
20-30 4 25 1.39 5.591
30-40 2 35 1.54 3.088
Total 10 12.907

12.91
After substituting the values in the formula, we get logG= = 1.29
10

Hence GM=Antilog(1.29)=19.53
**************************************************************************************

5. Harmonic Mean. Harmonic mean of a number of observations is the reciprocal of the arithmetic mean of
the reciprocals of the given values.

Mean for ungrouped data:


The, harmonic mean H, of n observations Xi, i = 1, 2,….., n is given by

1
H=
1 n 1
 
n i =1  xi 
Harmonic Mean for grouped data:
In case of frequency distribution Xi / fi, i = 1, 2,….., n is

16
1
H= , where N = ∑fi.
1  f 
n

 
N i =1  xi 
***************************************************************************************

Objective: Find the harmonic mean for the following ungrouped data
Kinds of data: Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12
1 1
Solution: Harmonic Mean H = = 1 1 1 1 1 1 1 1 1 1 = 9.06
1  1  9 10 7 11 9 9+10+7+9+12)
n ( + + + +
 
n i =1  xi 

********************************************************************************
Objective: Find the Harmonic Mean of the given class.
The table given below represent the frequency-distribution of ages for Standard college students.

Ages (years) 19 20 21 22 23 24 25 26
Number of students 5 8 7 12 28 20 10 10
Solution:
Ages (X) 19 20 21 22 23 24 25 26
Number of students(fi) 5 8 7 12 28 20 10 10
1/xi 0.053 0.050 0.048 0.045 0.043 0.042 0.040 0.038
fi *(1/xi) 0.263 0.400 0.333 0.545 1.217 0.833 0.400 0.385

1
H=
1  fi 
n

 
N i =1  xi 
1
H=
1
(0.263 + 0.400 + 0.333 + 0.545 + 1.217 + 0.833 +0.400 + 0.385)
100

1 1
H= = = 22.84
1
(4.377 ) (0.0437)
100

***************************************************************************************
Objective: Computation of average speed using harmonic mean.
Kinds of data: A cyclist pedals from his house to his college at a speed of 10 m.p.h. and back from the
college to his house at 15 m.p.h. Find the average speed.
Solution. Let the distance from the house to the college be x miles. In going from house to college, the
distance (x miles) is covered in x/10 hours, while in coming from college to house, the distance is
covered in x/15 hours. Thus a total distance of 2x miles is covered in (x/10 + x/15) hours.
Total distance travelled 2x
Hence average speed = =
Total time taken  x x
 + 
 10 15 
2
= = 12m. p.h.
1 1
 + 
 10 15 
***************************************************************************************
17
Partition Values - These are the values which divide the series into a number of equal parts.
1. Quartiles: The three points which divide series into four equal parts are called quartiles. The first, second
and third points are known as the first, second and third quartiles respectively. The first quartile, Q 1, is
the value which exceed 25% of the observations and is exceeded by 75% of the observations. The second
quartile, Q2, coincides with median. The third quartile, Q3, is the point which has 75% observations
before it and 25% observations after it.
2. Deciles: The nine points which divide the series into ten equal parts are called deciles.
3. Percentiles: The ninety-nine points which divide the series into hundred equal parts are called
percentiles.
For example, D5, the fifth decile, has 50% observations before it and P35, the thirty fifth percentile, is the
point which exceed 35% of the observations. The methods of computing the partition values are the same
as those of locating the median in the case of both grouped and ungrouped data.

Formula & Examples for ungrouped data set


Arrange the data in ascending order, then

 i.(n + 1) 
th

1. Quartiles : Qi =   value of the observatio n, where i = 1,2,3


 4 
 i.(n + 1) 
th

2. Deciles: Di =   value of the observatio n, where i = 1,2,3....9


 10 
 i.(n + 1) 
th

3. Percentiles : Pi =   value of the observatio n, where i = 1,2,3....99


 100 

*********************************************************************************************************************

Objective: Calculation of first Quartile, 3rd Deciles, 20th Percentile from the given data.
Kinds of data: 3,13,11,11,5,4,2
Solution:
Arranging Observations in the ascending order, We get :
2,3,4,5,11,11,13

Here, n=7
 i.(n + 1) 
th

Qi =   value of the observatio n, where i = 1,2,3


 4 
For first quartile, put i=1
 1.(7 + 1) 
th

Q1 =   value of the observatio n


 4 
th
 1.(8) 
Q1 =   value of the observatio n
 4 
Q1 = 2nd value of the observation, which is 3.

 i.(n + 1) 
th

Di =   value of the observatio n, where i = 1,2,3....9


 10 
For 3rd decile, put i=3
 3.(7 + 1) 
th

D3 =   value of the observatio n


 10 
th
 3.(8) 
D3 =   value of the observatio n
 10 
D3 = (2.4) value of the observatio n
th

18
D3 = 2nd observatio n + 0.4(3rd − 2nd )
D3 = 3 + 0.4(4 - 3)
D3 = 3 + 0.4(1)
D3 = 3.4
 i.(n + 1) 
th

Pi =   value of the observatio n, where i = 1,2,3....99


 100 
For 20th percentile, put i=20
 20.(7 + 1) 
th

P20 =   value of the observatio n


 100 
th
 20.(8) 
P20 =   value of the observatio n
 100 
th
 160 
P20 =   value of the observatio n
 100 
P20 = (1.6) value of the observatio n
th

P20 = 1st observatio n + 0.6 [2 nd − 1st ]


P20 = 2 + 0.6 [3 − 2]
P20 = 2.6
********************************************************************************

Objective: Calculation of median, quartiles, 4th decile and 27th percentile.


Kinds of data: Eight coins were tossed together and the number of heads resulting was noted. The
operation was repeated 256 times and the frequencies (f) that were obtained for different values of
x, the number of heads, are shown in the following table.
x 0 1 2 3 4 5 6 7 8
f 1 9 26 59 72 52 29 7 1
Solution:
x 0 1 2 3 4 5 6 7 8
f 1 9 26 59 72 52 29 7 1
cf 1 10 36 95 167 219 248 255 256

Median: Here N/2 = 256/2 = 128. Cumulative frequency (c.f.) just greater than 128 is 167. Thus,
median = 4.
N
Ql : Here = 64. c.f. just greater than 64 is 95. Hence Ql is 3.
4
3 N
Q3 : Here = 192 and c.f. just greater than 1921s 21,9. Thus Q3 = 5.
4
4  N 4  256
D4.: Here = = 102·4 and c.f. just greater than 102·4 is 167. Hence D4=4.
10 10
27  N 27  256
P27: Here = = 69·12 and c.f. just greater than 69·12 is 95. Hence P27=3
100 100
*******************************************************************************

Formula & Examples for grouped data set


The partition values may be determined from grouped data in the same way as the median. For calculating
partition values from grouped data we will form cumulative frequency column. Quartiles for grouped data
will be calculated from the following formulae-
1. Quartile
19
 i N 
 −C
Qi = l +  4  * h , where i=1,2,3
 f 
 
 
2. Deciles
 i N 
 −C
Di = l +  10  * h , where i=1,2,3….9
 f 
 
 
3. Percentiles
 i N 
 −C
Pi = l +  100  * h , where i=1,2,3….99
 f 
 
 
Where l is the lower limit of the class containing quartile, decile and percentile, f is the frequency
of the class containing quartile, decile and percentile, N = ∑fi, h is the magnitude of the class
containing quartile, decile and percentile, 'C' is cumulative frequency proceeding to the class
containing quartile, decile and percentile.
********************************************************************************

Objective: Calculation of 3rd quartiles, 4th decile and 37th percentile from the grouped data.

class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150
frequency 1 4 17 28 25 18 13 6 5 3

Solution:
Class 0-15 15-30 30-45 45-60 60-75 75-90 90-105 105-120 120-135 135-150
frequency 1 4 17 28 25 18 13 6 5 3
Cf 1 5 22 50 75 93 106 112 117 120

3  N 3 120
For third quartile = = 90. Cumulative frequency just greater than 90 is 93 and corresponding
4 4
class is 75-90. Thus Q3 class is 75-90. From table we see that l=75, h=15, c=75, f=18

 3 N   3 120 
 −C  − 75 
Q3 = l +  4  * h , so Q3 = 75 +  4  *15 = 87.5
 f   18 
   
   
4  N 4  120
For 4th decile = = 48. Cumulative frequency just greater than 48 is 50 and corresponding
10 10
class is 45-60. Thus Q3 class is 45-60. From table we see that l=45, h=15, c=22, f=28
 4 N   4 120 
 −C  − 22 
D 4 = l +  10  * h , so D 4 = 45 +  10  *15 = 58.85
 f   28 
   
   
37  N 37  120
For 37th percentile = = 44.4. Cumulative frequency just greater than 44.4 is 50 and
100 100
corresponding class is 45-60. Thus P37 class is 45-60. From table we see that l=45, h=15, c=22, f=28

20
 37  N   37 120 
 −C  − 22 
P37 = l +  100  * h , so D37 = 45 +  100  *15 = 57
 f   28 
   
   
***************************************************************************************
Exercise:

Q1. Find the Arithmetic Mean, Median and Mode from the following distribution.
classes 10-14 15-19 20-24 25-29 30-34 35-39
frequency 22 35 52 40 32 19
(Ans: A.M.=24.05, Median=23.63, Mode=22.43)

Q2. Find the Arithmetic, Geometric and Harmonic mean of the following frequency distribution.
Marks 0-10 10-20 20-30 30-40
No. of students 5 8 3 4
(Ans: A.M.=18.00, GM=14.58, HM=11.31)

Q3. The average salary of male employees in a firm was Rs.5200 and that of females was Rs.4200. The
mean salary of all the employees was Rs.5000. Find the percentage of male and female employees.
(Ans: Male 80%, Female 20%)

Q4. The Median and Mode of the following wage distribution are known to be Rs. 33.50 and Rs. 34.00
respectively. Find the value of f3, f4 and f5.
Wages 0-10 10-20 20-30 30-40 40-50 50-60 60-70
Frequency 4 16 f3 f4 f5 6 4
(Ans: f3= 60, f4=100, f5=40)

Q5. Find the arithmetic mean of the following frequency distribution: (Ans:21.66)

Xi 1 4 7 13 19 25 28 22 81 16
fi 7 46 19 51 89 89 28 19 33 93

Q6. The strength of 7 colleges in a city are 385; 1748; 1343; 1935; 786; 2874 and 2108. Find its median.
(Ans:1748)

Q7. The mean mark of 100 students was given to be 40. It was found later that a mark 53 was read as 83.
What is the corrected mean mark? (Ans: 39.70)
Q8. Calculate 3rd Quartile, 6th Deciles and 45th Percentiles from the following data:-
81,96,76,108,85,80,100,83,70,95,32,33 (Ans: Q3= 102.5, D6= 84.6, P45=83.4)

Q9. Calculate D7 and P85 for the following data: 79, 82, 36, 38, 51, 72, 68, 70, 64, 63
(Ans: D7= 71.4, P85=81.45)
Q10. The following is frequency distribution of over time (per week) performed by various officers from a
certain software company. Determine the value of D5, Q1 and P45.

Overtime (in hours) 4-8 8-12 12-16 16-20 20-24 24-28


No. of officers 4 8 16 18 20 18
(Ans- D5= 19.11, Q1=15.3, P45=18.17)

21
3. Measures of Dispersion
Surabhi Jain
Assistant professor (Statistics), College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India
Email id : sur_812004@yahoo.com

Dispersion: The measures of central tendency give us a single value that represents the central part
of the whole distribution whereas dispersion gives us an idea about the Scatteredness of the data. In
statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution
is stretched or squeezed. Dispersion helps us to study the variability of the items. It indicates the
extent to which all other values are dispersed about the central value in a particular distribution.
Measures of Dispersion: In dispersion, there are two types of measure. The first one is the
absolute measure, which measures the dispersion in the same statistical unit. The second type is the
relative measure of dispersion, which measures the ratio or percentage. Dispersion also helps a
researcher in comparing two or more series.

Characteristic of an ideal measure of Dispersion: To be an ideal measure, the measure of


dispersion should satisfy the following characteristics.
(1) It should be easy to calculate and easy to understand.
(2) It should be rigidly defined.
(3) It should be based upon all the observations.
(4) It should be suitable for further mathematical treatment.
(5) It should be affected as little as possible by fluctuations of sampling.
In statistics, there are many techniques that are applied to measure dispersion.

The absolute measures of dispersion are


(1) Range (2) Quartile Deviation (3) Mean Deviation (4)Standard Deviation

(1) Range: It is defined as the difference between the maximum and minimum value of any
dataset.
For ungrouped data Range = Maximum value – Minimum value
For grouped data Range = upper value of last class interval – lowest value of first class
interval
Characteristics: (1) It is the simplest but the crude measure of dispersion. (2) It takes lesser time.
(3) It is based only on two extreme observations so subject to chance fluctuations and cannot tell us
anything about the character of the distribution. (4) Range cannot be computed in the case of “open
ends’ distribution i.e., a distribution where the lower limit of the first group and upper limit of the
higher group is not given.(5) It is not suitable for further mathematical treatment.

(2) Quartile Deviation or Semi- interquartile range: It is the difference between first and third
quartile divided by 2. It is a better method when we are interested in knowing the range within
which certain proportion of the items fall.
𝑄3 − 𝑄1
Formula Quartile Deviation= 2
Characteristics:
(1)It is easy to calculate. (2) Since the Quartile deviation only makes the use of 50 % of data so it is
also not a reliable measure of dispersion but it is better than range. (3) The quartile deviation is not

22
affected by the extreme items. It is completely dependent on the central items. If these values are
irregular and abnormal the result is bound to be affected. (4)This method of calculating dispersion
can be applied generally in case of open end series where the importance of extreme values is not
considered.

(3) Mean Deviation: It is defined as the average of the sum of absolute deviation of all the
observation from their Average A (A=Mean, Median or Mode).
∑|𝑋𝑖 −𝐴|
For ungrouped data MD = , where A= mean, median or mode
𝑛
∑ 𝑓𝑖 |𝑋𝑖 −𝐴| where A= mean, median or mode
For grouped data MD = ∑ 𝑓𝑖
,

Characteristics: (1) It is based on all the observations but the step of ignoring the signs of
deviations creates artificiality and makes it useless for further mathematical treatment. (2) Average
Deviation may be calculated either by taking deviations from Mean or Median or Mode. (3)
Average Deviation is not affected by extreme items. (4) It is easy to calculate and understand. (5) It
is illogical and mathematically unsound to assume all negative signs as positive signs. Because the
method is not mathematically sound, the results obtained by this method are not reliable. (6) This
method is unsuitable for making comparisons either of the series or structure of the series.

(4) Standard Deviation (Best Measure): It is defined as the square root of the average of the
sum of squares of deviation of all the observation from their mean. The concept of standard
deviation, which was introduced by Karl Pearson has a practical significance because it is free from
all defects, which exists in a range, quartile deviation or average deviation.
̅̅̅̅2
∑(𝑋𝑖 −𝑋) ∑ 𝑥𝑖 2 ∑ 𝑥𝑖 2
For ungrouped data SD = √ =√ −( )
𝑛 𝑛 𝑛
̅̅̅̅2
∑ 𝑓𝑖 (𝑋𝑖 −𝑋) ∑ 𝑓 𝑖 𝑥𝑖 2 ∑ 𝑓 𝑖 𝑥𝑖 2
For grouped data SD =√ ∑ 𝑓𝑖
=√ ∑ 𝑓𝑖
−( ∑ 𝑓𝑖
)

Characteristics: (1) It is the best measure of dispersion among all. (2) It is difficult to compute. (3)
The step of squaring the deviations overcomes the drawback of Mean Deviation. (4) Standard
deviation is the best measure of dispersion because it takes into account all the items and is capable
of future algebraic treatment and statistical analysis. It is possible to calculate standard deviation
for two or more series.(5) This measure is most suitable for making comparisons among two or
more series about variability.(6) It assigns more weights to extreme items and less weight to items
that are nearer to mean. It is because of this fact that the squares of the deviations which are large in
size would be proportionately greater than the squares of those deviations which are comparatively
small.

Mathematical properties of standard deviation (σ)


(i) If different values are increased or decreased by a constant, the standard deviation will remain
the same. If different values are multiplied or divided by a constant than the standard deviation will
be multiplied or divided by that constant.
(ii) Combined standard deviation can be obtained for two or more series with below given formula:
If n1 and n2 are the sizes, ̅̅̅
𝑥1 and 𝑥
̅̅̅2 are the means and ̅̅̅
𝜎1 and ̅̅̅,
𝜎2 the standard deviations of the two
series, then the standard deviation 𝜎 of the combined series of size n1 + n2 is given by

23
1 𝑛1 ̅𝑥̅̅1̅+𝑛2̅𝑥̅̅2̅
𝜎 2 =𝑛 [ 𝑛1 (𝜎1 2 + 𝑑1 2 ) + 𝑛2 (𝜎2 2 + 𝑑2 2 )], where d1 = ̅̅̅
𝑥1 - 𝑥̅ and d2 = ̅̅̅
𝑥2 - 𝑥̅ and 𝑥̅ = ,
1+ 𝑛2 𝒏𝟏+ 𝒏𝟐
is the mean of combined series.
(iii) Variance is independent of change of origin means if we use di = xi – A then 𝜎 2 = 𝜎𝑑 2 but not
𝑥𝑖−𝐴 2
of scale means if we use di = , then 𝜎 2 = ℎ2 𝜎𝑑

********************************************************************************
Relative Measures for comparison of two series:
(1) Coefficient of Dispersion (CD): To compare the variability of two series Coefficient of
dispersion is used. They are pure numbers independent of the unit of the measurement. The
coefficients of dispersion based upon different measures of dispersion are as follows :
based upon
Maximum value –minimum value
(1) Range, Coefficient of Dispersion = Maximum value+minimum value
𝑄3 − 𝑄1
(2) Quartile Deviation, Coefficient of Dispersion =
𝑄3 + 𝑄1
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
(3) Mean Deviation, Coefficient of Dispersion = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
(4) Standard Deviation, CD = 𝑀𝑒𝑎𝑛
Characteristics: Used to compare the dispersion of two or more distributions. Selection of
appropriate measure depends upon the measures of central tendency and dispersion.

(2) Coefficient of Variation (CV): 100 times the coefficient of dispersion based upon standard
deviation is called coefficient of variation. (Unit less Measure).

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
CV = *100
𝑀𝑒𝑎𝑛

Characteristics: It is expressed in percentage. Lesser value of coefficient of variation indicates


more consistency.
********************************************************************************
Objective: Computation of Measures of Dispersion by all methods for Ungrouped data.
Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12
Solution:
(1) Range=max. value - min. value = 12 – 3 = 9
𝑄3 − 𝑄1
(2) Quartile Deviation: the formula for Quartile Deviation QD= 2
First arrange the observation in ascending order
3, 5, 7, 7, 9, 9, 10, 10, 12
𝑖∗(𝑛+1)
Now the formula for quartile Qi = , where i= the number of quartile i.e. i=1,2,3,or 4 and
4
n= the number of observation
1∗(9+1) 10
Q1= 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =2.5th observation
4 4
So Q1 = 2nd term + 0.5*(3rd term – 2nd term)
So Q1= 5+0.5*(7-5) =6
3∗(9+1) 30
Similarly Q3 = 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation
4 4
So Q3 = 10+ 0.5*(10-10)=10
(10−6)
Now QD = =2
2
24
∑|𝑋𝑖 −𝐴|
(3) Mean Deviation: The formula for Mean deviation is MD = , where A= mean, median or
𝑛
mode
Here we calculate first the mean deviation about mean.
(10+7+5+9+9+10+7+3+12)
Now Mean= =8
9
hence
1
MD=9 (|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| +
|12 − 8|)
1 20
=9 (2+1+3+1+1+2+1+5+4) = = 2.22
9
(4) Standard Deviation:
(10+7+5+9+9+10+7+3+12)
Mean= =8
9
(10−8)2 +(7−8)2 +(5−8)2 +(9−8)2 +(9−8)2 +(10−8)2 +(7−8)2 +(3−8)2 +(12−8)2
SD=√ 9

(4+1+9+1+1+4+1+25+16) 62
=√ =√ 9 = 2.62
9
********************************************************************************
Objective: Computation of Measures of Dispersion by all methods for Grouped data.
Kinds of data: The age distribution of 542 members are given below

Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total

No. of members 3 61 132 153 140 51 2 542

Solution:
(1)Range = 90-20=70
(2) Quartile Deviation : first we will find the first and third quartile
Age(in No. of Cumulative Fi|𝐗𝐢 −
Xi FiXi (Xi -𝑿 ̅) ̅ )2
(Xi -𝑿 ̅ )2
Fi(Xi -𝑿
years) members Frequency ̅ )|
𝑿
20-30 3 3 25 75 -29.7 89.2 883.3 2649.8
30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6
40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1
50-60 153 349 55 8415 0.3 42.8 0.1 12.0
60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0
70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2
80-90 2 542 85 170 30.3 60.6 916.9 1833.8
Total 542 2183 385 29660 1.96 5152 2800.54 76458.49

1∗542
First we will determine the first Quartile class = = 135.5
4
135.5 come in 40-50 cumulative frequency class. So the first Quartile
1∗542
( −64) 715
4
Q1 = 40 + * 10 = 40+ 132 = 40+5.42=45.42 years
132
3∗542
Similarly for Q3 = = 406.5,
4
406.5 come in 60-70 cumulative frequency class. So the third Quartile is
25
3∗542
( −349) 575
4
Q3 = 60 + * 10 = 60+ 140 = 60+4.11=64.11 years
140
(64.11−45.42) 18.69
So the quartile deviation is = = =9.345 years
2 2
(3) Mean Deviation : first calculate the mean
29660
Mean= = 54.72 years
542
5152
From the above table Mean Deviation = =9.51 years
542
76548.9
(4) Standard Deviation=√ =√141.07 = 11.88 years
542

********************************************************************************

Objective: Computation of variability of two series by coefficient of variation.


Kinds of data : Goals scored by two teams A and B in a football season were as follows
No. of goals scored in a
0 1 2 3 4
match
No. of A 27 9 8 5 4
matches B 17 9 6 5 3

Solution: Here we have to calculate the CV of both the team separately


No. of
A ̅̅̅𝟐 B ̅̅̅𝟐
goals 𝒇 𝑨 𝒙𝒊 𝒇𝒊 (𝒙𝒊 − 𝒙) 𝒇𝑩 𝒚 𝒊 𝒇𝒊 (𝒚𝒊 − 𝒚)
(𝒇𝑨 ) ̅
(𝒙𝒊 − 𝒙 (𝒙𝒊 − ̅̅̅
𝒙)2 (𝒇𝑩 ) (𝒚𝒊 − 𝒚̅) (𝒚𝒊 − ̅̅̅
𝒚)2
(𝒙𝒊 )
0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48
1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36
2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84
3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2
4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52
Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4
First we will calculate the mean and standard deviation of first (A) series
56 90.83 𝜎𝐴 1.31
̅̅̅
𝑋 𝐴 = 53 = 1.05, 𝜎𝐴 = √ 53 = √1.714 = 1.31 then CV= ̅̅̅̅ *100 = 1.05 *100 = 124.76
𝑋 𝐴
Now we calculate the mean and standard deviation of Second (B) series
48 68.4 𝜎𝐵 1.30
̅̅̅̅
𝑋𝐵 = 40 = 1.2, 𝜎𝐵 = √ 40 = √1.71 = 1.30 then CV= ̅̅̅̅ *100 = 1.2 *100 = 108.33
𝑋 𝐵
After comparing the coefficient of variation of series A and B it was found that the series B because
of lower CV value is more consistent.
********************************************************************************
Objective : Comparison of wage earners of two firms
Kinds of data : An analysis of monthly wages paid to workers in two firms A and B, belonging to
the same industry, gives the following results:
Firm A Firm B
Number of wage earners 586 (𝑛𝐴 ) 648 (𝑛𝐵 )
Average monthly wage ̅̅̅
Rs. 52.50 (𝑋𝐴 ) Rs. 47.50 (𝑋 ̅̅̅̅
𝐵)
Variance of the distribution of 2 2
100 (𝜎𝐴 ) 121 (𝜎𝐵 )
wages

26
(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B)
(b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B)
(c) What are the measures of (i) average monthly wage and (ii) the variance of the distribution of
wages of all the workers in the firms A and B taken together? (49.9, 10.8)

Solution: (a) Here we have to find the total amount of monthly wages paid by firm A and Firm B.
Since the number of workers (nA) and average monthly wage (𝑋 ̅̅̅
𝐴 ) is given. With the help of this
∑ 𝑋𝐴
̅̅̅
we calculate ∑ 𝑋𝐴 . By using the formula 𝑋𝐴 = 𝑛 , we get ∑ 𝑋𝐴 =𝑛𝐴 * 𝑋 ̅̅̅
𝐴 = 586* 52.50=30765
𝐴
Similarly for Firm B we get ∑ 𝑋𝐵 =𝑛𝐵 * ̅̅
𝑋̅̅
𝐵 = 648* 47.50=30780
Hence we find that the firm B pays out the larger amount as monthly wages.
(b) We know that the variability is determined by coefficient of variation. Here we calculate the CV
𝜎𝐴 𝜎𝐵
for both the firm. The formula for 𝐶𝑉𝐴 = ̅̅̅̅ *100 and 𝐶𝑉𝐵 = ̅̅̅̅
*100
𝑋 𝐴 𝑋 𝐵
10
By putting the values we get 𝐶𝑉𝐴 = *100=19.04
52.50

11
And 𝐶𝑉𝐵 = 47.50 *100 =23.15
Since 𝐶𝑉𝐵 > 𝐶𝑉𝐴 , hence in the firm B there is greater variability in individual wages.

𝑛𝐴 ̅̅̅̅+𝑛
𝑥𝐴 𝐵 ̅̅̅̅
𝑥𝐵 586∗52.50+648∗47.50 30765+30780
(c) (i) 𝑥̅ = = = = 49.87
𝒏𝑨+ 𝒏𝑩 586+648 1234

(ii) We know that the formula of combined variance is


1
𝜎2 = 𝑛 [ 𝑛𝐴 (𝜎𝐴 2 + 𝑑𝐴 2 ) + 𝑛𝐴 (𝜎𝐴 2 + 𝑑𝐴 2 )], where dA = ̅̅̅
𝑥𝐴 - 𝑥̅ and dB = ̅̅̅
𝑥𝐵 - 𝑥̅ and 𝑥̅ =
𝐴+ 𝑛𝐵
𝑛𝐴 ̅̅̅̅+𝑛
𝑥𝐴 𝐵 ̅̅̅̅
𝑥𝐵
, is the mean of combined series.
𝒏𝑨+ 𝒏𝑩
Here dA = 52.50 – 49.87 = 2.63 and dB = 47.50 – 49.87= -2.37
By putting the values we get
1
𝜎 2 =586+648 [586(100 + (2.63)2 ) + 648(121 + (−2.37)2 )],
62653.30+82047.75
By solving we get, 𝜎 2 = =117.26
1234
The variance of the distribution of wages of all the workers in the firms A and B taken together is
117.26
********************************************************************************

Objective : Standard deviation of combined sample


Kinds of data : The first of two samples has 100 items with mean 15 and S.D. 3. If the whole
group has 250 items with mean 15.6 and SD √13.44. Find the SD of the second sample.
𝑥1 = 15, and 𝜎1 = 3 and n=250, 𝑥̅ = 15.6, σ=√13.44
Solution: Here it is given that n1 = 100, ̅̅̅
We know that the formula of combined standard deviation
1 𝑛1 ̅𝑥̅̅1̅+𝑛2̅𝑥̅̅2̅
𝜎 2 =𝑛 [ 𝑛1 (𝜎1 2 + 𝑑1 2 ) + 𝑛2 (𝜎2 2 + 𝑑2 2 )], where d1 = ̅̅̅
𝑥1 - 𝑥̅ and d2 = ̅̅̅
𝑥2 - 𝑥̅ and 𝑥̅ = ,
1+ 𝑛2 𝒏𝟏+ 𝒏𝟐
is the mean of combined series.
So, first we will find the size of second sample n2 = n-n1, so n2=250-100=150.
Here since the mean of first sample and combined mean is given. With the help of these we find the
𝑛1 ̅𝑥̅̅1̅+𝑛2 ̅𝑥̅̅2̅
mean of second sample. By putting the values in 𝑥̅ = 𝒏𝟏+ 𝒏𝟐
̅̅̅2̅
100∗15+150∗𝑥
We get 15.6 = , by solving we get ̅̅̅
𝑥2 = 16
100+150

27
Now d1 = 15-15.6 = -0.6 and d2 = 16-15.6=0.4
By putting all these values in the formula of combined variance, we get
1
13.44 =100+150 [100(32 + (−0.6)2 ) + 150(𝜎2 2 + (0.4)2 )],
By solving the value of 𝜎2 = 4
********************************************************************************

Objective: Corrected mean and corrected standard deviation corresponding to the corrected
figures:
Kinds of data: for a group of 200 candidates, the mean and standard deviation of scores were
found to be 40 and 15 respectively. Later on it was discovered that the scores 43 and 35 were
misread as 34 and 53 respectively. Find the corrected mean and corrected standard deviation
corresponding to the corrected figures.

Solution: Here it is given that n=200, mean=40 and SD= 15.


Wrong scores are 34, 53 and corrected scores are 43 and 35.
(i) Corrected mean: to calculate the corrected mean first we find the total score by using the
∑𝑋
formula 𝑋̅ = 𝑛 ,
By putting the values we get ∑ 𝑋 = 200* 40 = 8000
Next we find the corrected total score= total score- wrong scores + correct scores= 8000-
(34+53)+(43+35)=7991
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 7991
Hence the corrected mean= 𝑛𝑜.𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 = 200 = 39.95
∑ 𝑥𝑖 2 ∑ 𝑥𝑖 2
(ii) Corrected SD: we know that the formula of SD=√ −( ) ,
𝑛 𝑛
Here the SD is 15 and mean is 40 then first we calculate the sum of square ∑ 𝑥𝑖 2 by using the
formula of SD.
∑ 𝑥𝑖 2 = n*( 𝜎 2 + ̅𝑥 2 ) =200*(225+1600) = 365000
Now we calculate corrected ∑ 𝑥𝑖 2 = 365000-(sum of square of wrong figure)+(sum of square of
corrected figure)
corrected ∑ 𝑥𝑖 2 = 365000-(342 + 532 )+(432 + 352 )= 365000-
3965+3074=364109

𝑐𝑜𝑟𝑟𝑒𝑡𝑒𝑑 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒 364109


Now corrected SD= √ − 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 𝑀𝑒𝑎𝑛2 = √ − 39.952 =
𝑛𝑜.𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 200

√224.54 = 14.98

Hence the corrected mean = 39.95 and corrected standard deviation=14.98


********************************************************************************

Important Points on Dispersion:


1. Range, QD, MD and SD are the absolute measures of dispersion.
2. CD and CV are the relative measures of dispersion.
3. Range is the crude measure of dispersion.
4. Standard deviation is the best measure of dispersion.
5. Coefficient of variation is unitless measure of dispersion and suggested by karl pearson.
6. A low standard deviation indicates that the data points tend to be close to the mean.

28
Exercise:
Q1. Calculate the variance of the following series. (i) 5,5,5,5,5 (ii) 4,5,6. (Ans. (i) 0, (ii)0.67)
Q2. Mean and Standard deviation of 10 figures are 50 and 10 respectively. What will be the mean
and SD if (i) every figure is increased by 4 (ii) every figure is multiplied by 2 (iii) if the figures
are multiplied by 2 and then diminished by 4? (Ans. (i)54,10 (ii) 100,20 (iii) 96,20).

Q3. Calculate mean deviation and standard deviation from following table:
Classes 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40
Frequency 2 5 7 13 21 16 8 3
(Ans: Mean Deviation = 6.23, Standard Deviation=8.05)

Q4. If the mean of 100 observations is 50 and CV is 40 %. Calculate the Standard Deviation.
(Ans. SD=20)
Q5. The arithmetic mean and variance of a set of 10 figures are known to be 17 and 33
respectively. Out of 10 figures one figure (i.e. 26) was found inaccurate and was weeded out.
What is the resulting? (a) Arithmetic Mean (b) variance of the 9 figures.
(Ans: AM=16, variance = 26.67)

Q6. The means of two samples of size 50 and 100 respectively are 54.1 and 50.3 and the standard
deviations are 8 and 7. Obtain the mean and standard deviation of the sample of size 150
obtained by combining the two samples. (Ans: Combined mean=51.57, Combined S.D.= 7.5)

Q7. An analysis of monthly wages paid to workers in two firms A and B, belonging to the same
industry, gives the following results:
Firm A Firm B
Number of wage earners 500 600
Average monthly wage Rs. 186.00 Rs. 175.00
Variance of the distribution of wages 81 100

(a) Which firm A or B pays out the larger amount as monthly wages? (Ans: Firm B)
(b) In which firm A or B, there is greater variability in individual wages? (Ans: Firm B)
(c) What are the measures of (i) average monthly wage and (ii) the variance of the
distribution of wages of all the workers in the firms A and B taken together?
(Ans: Combined monthly wage: Rs. 180, Combined variance = 121.36)

29
4. Moments, Skewness and Kurtosis
R. S. Solanki

Assistant professor (Maths & Stat.) , College of Agriculture , Waraseoni, Balaghat (M.P.),India
Email id : ramkssolanki@gmail.com

1. Moments:
Moment word is very popular in mechanical sciences. In science moment is a measure of
energy which generates the frequency. In Statistics, moments are the arithmetic means of first,
second, third and so on, i.e. rth power of the deviation taken from either mean or an arbitrary point
of a distribution. In other words, moments are statistical measures that give certain characteristics
of the distribution. In statistics, some moments are very important. Generally, in any frequency
distribution, four moments are obtained which are known as first, second, third and fourth
moments. These four moments describe the information about mean, variance, skewness and
kurtosis of a frequency distribution. Calculation of moments gives some features of a distribution
which are of statistical importance.

Moments can be classified in raw and central moment. Raw moments are measured about
any arbitrary point A (say). If A is taken to be zero then raw moments are called moments about
origin. When A is taken to be Arithmetic mean we get central moments. The first raw moment
about origin is mean whereas the first central moment is zero. The second raw and central moments
are mean square deviation and variance, respectively. The third and fourth moments are useful in
measuring skewness and kurtosis.

Methods of Calculation
1. Moments about Arbitrary Point i.e. raw moments
For Ungrouped Data
If x1 , x2 , ..., x N are N observations of a variable x, then their moments about an arbitrary point A are

Zero order moment  0' =


1
 (xi − A)0 = 1
N i

 (xi − A)
1
First order moment 1' =
N i

Second order moment  2' =


1
 (xi − A)2
N i

 (x − A)
1
Third order moment 3' = i
3

N i

 (x − A)
1
Fourth order moment  4' = i
4

N i

In general the r order moment about arbitrary point A is given by


th

 r' =  (xi − A)r ; r = 0, 1, 2, ...


1
N i
For Grouped Data
If x1 , x2 , ..., xk are k values (or mid values in case of class intervals) of a variable x with their
corresponding frequencies f1 , f 2 , ..., f k then moments about an arbitrary point A are

30
f i (xi − A) = 1 ; N =  f i
1
Zero order moment  0' = 
N i
0

 f (x − A)
1
1' =
1
First order moment i i
N i

f i (xi − A)
1
 2' = 
2
Second order moment
N i

f i (xi − A)
1
 3' = 
3
Third order moment
N i

f i (xi − A)
1
 4' = 
4
Fourth order moment
N i
In general the r th order moment about arbitrary point A is given by

 r' =  f i (xi − A)r ; N =  f i , r = 0, 1, 2, ...


1
N i i

2. Moments about origin: In raw moments if A is taken to be zero then raw moments are called
moments about origin and denoted by mr.
∑(𝑋𝑖 )𝑟
In general, For Ungrouped Data 𝑚𝑟= , here N is number of observation and r=0,1,2,….
𝑁
∑ 𝑓𝑖 (𝑋𝑖 )𝑟
for Grouped data 𝑚𝑟= ∑ 𝑓
𝑖

3. Moments about arithmetic mean i.e. central moments


When we take the deviation from the arithmetic mean and calculate the moments, these are known as
moments about arithmetic mean or central moments.
For Ungrouped Data
If x1 , x2 , ..., x N are N observations of variable x, then their moments about arithmetic mean
1
x=  xi are
N i

Zero order moment 0 =


1
 (xi − x )0 = 1
N i

 (x − x) = 0
1
1 =
1
First order moment i
N i

Second order moment  2 =


1
 (xi − x )2 =  2 (Variance)
N i

 (x − x)
1
3 =
3
Third order moment i
N i

Fourth order moment 4 =


1
 (xi − x )4
N i
In general the r th order moment about arithmetic mean x is given by
1
( )
 r =  xi − x ; r = 0, 1, 2, ...
r

N i

31
For Grouped Data
If x1 , x2 , ..., xk are k values (or mid values in case of class intervals) of a variable x with their
corresponding frequencies f1 , f 2 , ..., f k then moments about arithmetic mean
1
x=
N
f x ; N =f
i
i i
i
are

f i (xi − x ) = 1
1
0 = 
0
Zero order moment
N i

f i (xi − x ) = 0
1
1 = 
1
First order moment
N i

 f (x − x ) =  2 (Variance)
1
2 =
2
Second order moment i i
N i

 f (x − x)
1
3 =
3
Third order moment i i
N i

f i (xi − x )
1
4 = 
4
Fourth order moment
N i
In general the r th order moment about arithmetic mean x is given by

 r =  f i (xi − x ) ; N =  f i , r = 0, 1, 2, ...
1 r

N i i

Relationship between central moments and raw moments: is given by


𝜇𝑟= 𝜇𝑟 ′ - 𝑟𝑐1 (𝜇1 ′)( 𝜇𝑟−1 ′)+ 𝑟𝑐2 (𝜇1 ′)2 ( 𝜇𝑟−2 ′)+……+(−1)𝑟 (𝜇1 ′)𝑟

In particular, 𝜇2= 𝜇2′ - (𝜇1 ′)2


𝜇3= 𝜇3 ′- 3 𝜇1 ′𝜇2 ′ + 2(𝜇1 ′)3
𝜇4= 𝜇4 ′ - 4 𝜇1 ′𝜇3 ′+6 (𝜇1 ′)2 ( 𝜇2′ )- 3(𝜇1 ′)4

Important: (i) 𝜇0 = 𝜇0 ′=1 (ii) First central moment is always zero. (iii) 𝜇2 = 𝑆𝐷 2 =variance
********************************************************************************
2. Skewness:
The skewness of a distribution is defined as the lack of symmetry. In a symmetrical
distribution, the Mean, Median and Mode are equal to each other and the ordinate at mean divides
the distribution into two equal parts such that one part is mirror image of the other. If some
observations, of very high (low) magnitude, are added to such a distribution, its right (left) tail gets
elongated. These observations are also known as extreme observations. The presence of extreme
observations on the right hand side of a distribution makes it positively skewed and the three
averages, viz., mean, median and mode, will no longer be equal. We shall in fact have Mean >
Median > Mode when a distribution is positively skewed. On the other hand, the presence of
extreme observations to the left hand side of a distribution make it negatively skewed and the
relationship between mean, median and mode is: Mean < Median < Mode (see following figure).

32
Measures of Skewness

1. The Karl Pearson’s coefficient of skewness S k , based on mode is given by

Mean − Mode
Sk =
S . D.

The sign of S k gives the direction and its magnitude give the extent of skewness. If S k > 0, the
distribution is positively skewed, and if S k < 0 it is negatively skewed.

Karl Pearson's coefficient of skewness S k , is defined in terms of median as

3( Mean − Median )
Sk =
S . D.

The range of Karl Pearson’s coefficient of skewness is − 3  S k  +3 .

2. The Bowley’s coefficient of skewness (quartile coefficient of skewness)

(Q3 − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2


Sb = = , where Q1 ,Q2 and Q3 are first, second and third
(Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1
quartiles respectively. The range of Bowley’s coefficient of skewness is − 1  Sb  +1.

3. Coefficient of skewness based on moments: The Coefficient of skewness based on moments is


√𝜷 (𝜷 +𝟑) 𝜇3 2 𝜇4
given by Sk =𝟐(𝟓𝜷 𝟏− 𝟔𝜷
𝟐
where 𝛽1 = , 𝛽2 = .
𝟐 𝟏 −𝟗) 𝜇2 3 𝜇2 2

********************************************************************************

3. Kurtosis:

Kurtosis is another measure of the shape of a distribution. Whereas skewness measures


the lack of symmetry of the frequency curve of a distribution, kurtosis is a measure of the
relative peakedness of its frequency curve. Various frequency curves can be divided into three
categories depending upon the shape of their peak. The three shapes are termed as Leptokurtic,
Mesokurtic and Platykurtic as shown in following figure.

33
Measures of Kurtosis

Karl Pearson’s has developed Beta and Gama coefficients (or Beta and Gama measures) of kurtosis
based on the central moments, which are given below respectively

and  2 = ( 2 − 3)
4
2 =
2
2

The value of  2 = 3 ( 2 = 0) for a mesokurtic (normal) curve. When  2  3 ( 2  0) , the curve is


more peaked than the mesokurtic curve and is termed as leptokurtic. Similarly, when  2  3
( 2  0) , the curve is less peaked than the mesokurtic curve and is called as platykurtic curve.

Objective: Moments, Measures of Skewness and Kurtosis (Ungrouped data).


Kinds of data: The daily earnings (in rupees) of sample of 7 agriculture workers are : 126, 121,
124, 122, 125, 124, 123. Compute first four raw (at point 123) and central moments, coefficients of
skewness and coefficients of kurtosis.
Solution: Moments about any arbitrary value (A=123) i.e. raw moments

Table: Calculation for raw moments.

Sr. No. x (x − 123) (x − 123)2 (x − 123)3 (x − 123)4


1 126 3 9 27 81
2 121 -2 4 -8 16
3 124 1 1 1 1
4 122 -1 1 -1 1
5 125 2 4 8 16
6 124 1 1 1 1
7 123 0 0 0 0
Total 865 4 20 28 116

The first raw moment


1' =  (xi − A) =  4 = 0.57
1 1
N i 7
The second raw moment
 2' =  (xi − A) =  20 = 2.86
1 2 1
N i 7

34
The third raw moment
3' =  (xi − A) =  28 = 4
1 3 1
N i 7
The fourth raw moment
 4' =  (xi − A) = 116 = 16.57 .
1 4 1
N i 7
Moments about the Arithmetic Mean i.e. central moments
The arithmetic mean of daily earnings of agriculture workers is
1 1
x =  xi =  865 = 123.57
N i 7
Table: Calculation for central moments.
Sr. x (x − 123.57 ) (x − 123.57 )2 (x − 123.57 )3 (x − 123.57 )4
1 126 2.43 5.90 14.35 34.87
2 121 -2.57 6.60 -16.97 43.62
3 124 0.43 0.18 0.08 0.03
4 122 -1.57 2.46 -3.87 6.08
5 125 1.43 2.04 2.92 4.18
6 124 0.43 0.18 0.08 0.03
7 123 -0.57 0.32 -0.19 0.11
Total 865  0.00 17.71 -3.60 88.92
The first central moment
1 =  (xi − x ) =  0.00 = 0.00
1 1
N i 7
The second central moment
 2 =  (xi − x ) = 17.71 = 2.53
1 2 1
N i 7
The third central moment
3 =  (xi − x ) =  −3.60 = −0.51
1 3 1
N i 7
The fourth central moment
 4 =  (xi − x ) =  88.92 = 12.70 .
1 4 1
N i 7
Karl Pearson’s coefficient of skewness

The median ( M d ) of daily earnings of agriculture workers:

Arrange the data in ascending order

121,122,123,124,124,125,126

Total number of observations N = 7 (odd)

Hence the median

35
 N + 1  7 + 1
th th

Md =   term =   = 4th term= 124.


 2   2 
The mode ( M o ) of daily earnings of agriculture workers:
Since the frequency of 124 is maximum (i.e. 2), hence
M o =124.
Standard deviation (σ):

 (x − x)
2
i
17.71
= i
= = 1.59 .
N 7
Karl Pearson’s coefficient of skewness based on median
3(x − M d ) 3(123.57 − 124)
Sk = = = −0.81
 1.59
Karl Pearson’s coefficient of skewness based on mode
x − M o 123.57 − 124
Sk = = = −0.27 .
 1.59
Bowley’s coefficient of skewness
Arrange the data in ascending order
121,122,123,124,124,125,126
Total number of observations N = 7
Hence the first quartile Q1
 N + 1  7 +1
th th

Q1 =   term =   term = 2nd term= 122


 4   4 
Second quartile Q2 = M d = 124
Third quartile Q3

 N +1  7 + 1
th th

Q3 = 3  term = 3  term = 6th term= 125


 4   4 
Hence Bowley’s coefficient of skewness
(Q − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2 125 + 122 − 2 124
Sb = 3 = = = − 0.33 .
(Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1 125 − 122
Coefficients of kurtosis:
 12.70
 2 = 42 = = 1.98
 2 (2.53) 2
and
 2 = ( 2 − 3) = (1.98 − 3) = −1.02 .

Hence the curve is negatively skewed and platykurtic.

36
Objective: Moments, Measures of Skewness and Kurtosis (Grouped data).
Kinds of data: Compute first four raw (at A=11) and central moments and coefficients of
skewness and kurtosis for the following data on milk yield:

Milk yield (kg) 4-6 6-8 8-10 10-12 12-14 14-16 16-18
No. of Cows 8 10 27 38 25 20 7

Solution: Moments about any arbitrary value (A=11) i.e. raw moments

Table: Calculation for raw moments.

Sr. Milk No. of Mid f (x − A) f ( x − A)


2
f ( x − A)
3
f ( x − A)
4

yield (Kg ) Cows ( f ) Value (x )


1 4-6 8 5 -48 288 -1728 10368
2 6-8 10 7 -40 160 -640 2560
3 8-10 27 9 -54 108 -216 432
4 10-12 38 11 0 0 0 0
5 12-14 25 13 50 100 200 400
6 14-16 20 15 80 320 1280 5120
7 16-18 7 17 42 252 1512 9072
Total N=135 30 1228 408 27952

The first raw moment


1' =  f i (xi − A) =
1 1
 30 = 0.22
N i 135
The second raw moment
 2' =  f i (xi − A) =
1 2 1
1228 = 9.10
N i 135
The third raw moment
3' =  f i (xi − A) =
1 3 1
 408 = 3.02
N i 135
The fourth raw moment
 4' =  f i (xi − A) =
1 4 1
 27952 = 207.05 .
N i 135

Moments about the Arithmetic Mean i.e. central moments:


The arithmetic mean of milk yield
1 1
x =  f i xi = 1515 = 11.22
N i 135

Table: Calculation for central moments.


Sr. No. Milk No. of Mid fx f (x − x ) f (x − x)
2
f (x − x )
3
f (x − x)
4

yield Cows Value


(kg) (f) (x )
1 4-6 8 5 40 -49.78 309.73 -1927.20 11991.46
2 6-8 10 7 70 -42.22 178.27 -752.70 3178.08

37
3 8-10 27 9 243 -60.00 133.33 -296.30 658.44
4 10-12 38 11 418 -8.44 1.88 -0.42 0.09
5 12-14 25 13 325 44.44 79.01 140.47 249.72
6 14-16 20 15 300 75.56 285.43 1078.30 4073.57
7 16-18 7 17 119 40.44 233.68 1350.15 7800.84
Total N=135 1515 0.00 1221.33 -407.70 27952.20

The first central moment


1 =  f i (xi − x ) =
1 1
 0.00 = 0.00
N i 135
The second central moment
 2 =  f i (xi − x ) =
1 2 1
 1221.33 = 9.05
N i 135
The third central moment
3 =  f i (xi − x ) =
1 3 1
 −407.70 = −3.02
N i 135
The fourth central moment
 4 =  f i (xi − x ) =
1 4 1
 27952.20 = 207.05 .
N i 135
Karl Pearson’s coefficient of skewness

Table: Calculation for median and mode.


Sr. Milk No. of Mid cf
yield Cows Value  N + 1  135 + 1
Median number=  = = 68
(kg) (f) (x )  2  2
1 4-6 8 5 8 ⸫ Median class = (10-12).
2 6-8 10 7 18
3 8-10 27 9 45
4 10-12 38 11 83 Maximum frequency = 38
5 12-14 25 13 108 ⸫ Model class = (10-12).
6 14-16 20 15 128
7 16-18 7 17 135
Total N=135

The median ( M d ) of milk yield:


L1 = 10, i = 2, f = 38, N = 135, C = 45
i N  2  135 
Median = M d = L1 +  − C  = 10 +  − 45  = 11.18 .
f2  38  2 
The mode ( M o ) of milk yield:
L1 = 10, f1 = 38, f 0 = 27, f 2 = 25, i = 2
f1 − f 0 38 − 27
Mode = M o = L1 +  i = 10 +  2 = 10.92 .
2 f1 − f 0 − f 2 2  38 − 27 − 25
Standard deviation (σ):
38
 f (x − x)
2
i i
121.33
= i
=
= 3.01 .
N 135
Karl Pearson’s coefficient of skewness based on median
3(x − M d ) 3(11.22 − 11.18)
Sk = = = 0.04 .
 3.01
Karl Pearson’s coefficient of skewness based on mode
x − M o 11.22 − 10.92
Sk = = = 0.10 .
 3.01
Bowley’s coefficient of skewness:
The first quartile Q1
th th
N  135 
Q1 =   term =   term = 33.75  34th term
4  4 
34 term is in the class interval “8-10” . Hence
th

L1 = 8, i = 2, f = 27, N = 135, C = 18
i N  2  135 
Q1 = L1 +  −C =8+  − 18  = 9.17
f4  27  4 
Second quartile Q2 = M d = 11.18
Third quartile Q3
th th
 3N   405 
Q3 =   term =   term = 101.25  101th term
 4   4 
101th term is in the class interval “12-14” . Hence
L1 = 12, i = 2, f = 25, N = 135, C = 83
i  3N  2  3 135 
Q3 = L1 +  − C  = 12 +  − 83  = 13.46 Hence Bowley’s coefficient of skewness
f 4  25  4 
(Q − Q2 ) − (Q2 − Q1 ) Q3 + Q1 − 2Q2 13.46 + 9.17 − 2  11.18
Sb = 3 = = = 0.06
(Q3 − Q2 ) + (Q2 − Q1 ) Q3 − Q1 13.46 − 9.17

Coefficients of kurtosis:
 207.05
 2 = 42 = = 2.53 .
 2 (9.05) 2
and
 2 = ( 2 − 3) = (2.53 − 3) = −0.47 .
Hence the curve is positively skewed and platykurtic.
********************************************************************************
Objective: Computation of Mean and variance when moments about arbitrary value is given .
Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16
and -40.
Solution: Here arbitrary value A=2 and the moments are 𝜇1 ′ =1, 𝜇2 ′=16 and 𝜇3 ′= -40
∑ 𝑓𝑖 (𝑋𝑖 −2)1 ∑ 𝒇𝒊 𝒙𝒊 ∑𝒇 ∑ 𝒇𝒊 𝒙 𝒊
We know that 𝜇1 ′= ∑ 𝑓𝑖
=1 , hence ∑ 𝒇𝒊
- 2 ∑ 𝒇𝒊 =1 which gives 𝑥̅ = ∑ 𝒇𝒊
= 1+2 = 3
𝒊
Hence the mean is 3.
39
We know that 𝜇2= 𝜇2 ′ - (𝜇1 ′)2, by putting the values we get
𝜇2= 𝜇2 ′ - (𝜇1 ′)2 = 16- 1*1 = 15
Hence the variance is 15.
********************************************************************************
Exercise:

Q1. The marks obtained by 46 students in an examination are as follows:

Marks 0-5 5-10 10-15 15-20 20-25 25-30


Students 5 7 10 16 4 4
Calculate Karl Pearson’s and Bowley’s coefficients of skewness.
(Ans.: Karl Pearson’s coefficient of skewness = -0.31 and
Bowley’s coefficient of skewness = -0.22)

Q2. Calculate Karl Pearson’s and Bowley’s coefficients of skewness for the following distribution:
Measurement 3.5 4.5 5.5 6.5 7.5 8.5 9.5
Frequency 3 7 22 60 85 32 8
(Ans.: Karl Pearson’s coefficient of skewness = -0.36 and
Bowley’s coefficient of skewness = -1)

Q3. Compute the first four raw and central moments with coefficient of kurtosis for the following
data:

Plant Height (cm) 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70
No. of plants 5 14 16 25 14 12 8 6

(Ans. 1' = −3.65,  2' = −98.25, 3' = −766.25,  4' = 20756.25


1 = 0,  2 = 84.93,  3 = 212.34,  = 16890.14;  2 = −0.66. )

********************************************************************************

40
5. Correlation and Regression
Surabhi Jain
Assistant Professor (Statistics) , College of Agriculture , JNKVV, Jabalpur (M.P.) 482004,India
Email id : sur_812004@yahoo.com

Correlation: Correlation is a measure of linear relationship between two variables. It is a statistical


technique that can show whether and how strongly pairs of variables are related. For example,
height and weight are related; taller people tend to be heavier than shorter people. Correlation
works for quantifiable data. It cannot be used for purely categorical data, such as gender, brands
purchased, or favorite color.
It can be defined as a bi-variate analysis that measures the strength of association between
two variables and the direction of the relationship.
Karl Pearson correlation coefficient (or Product moment correlation
coefficient): Pearson r correlation is the most widely used correlation statistic to measure the
degree of the relationship between linearly related variables. The correlation coefficient between X
on Y and Y on X is same and calculated by Karl Pearson correlation formula
𝑐𝑜𝑣(𝑥,𝑦) ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥) 𝑛 ∑ 𝑥𝑖 𝑦𝑖 −∑ 𝑥𝑖 𝑦𝑖
𝑟𝑥𝑦 = = ̅̅̅
=
̅̅̅2 √𝑛 ∑ 𝑥𝑖 2 −(∑ 𝑥𝑖 )2 √𝑛 ∑ 𝑦𝑖 2 −(∑ 𝑦𝑖 )2
,
𝜎𝑥 𝜎𝑦 √∑(𝑥𝑖 −𝑥)2 ∑(𝑦𝑖 −𝑦)
here n=number of observation, xi = value of ith observation of x variable, yi = value of ith
observation of y variable.

Assumptions:
Normality: Both variables should be normally distributed (normally distributed variables have a
bell-shaped curve).
Linearity: It assumes a straight line relationship between each of the two variables.
Homoscedasticity: It assumes that data is equally distributed about the regression line. It basically
means that the variances along the line of best fit remain similar as you move along the line.
Type of correlation:

Positive Correlation: If two variables deviate in the same direction then the correlation is said to be
positive correlation. The line corresponding to the scatter plot is an increasing line sloping up from left to
right. Example height and weight of a group of persons, the income and expenditure etc.

Negative Correlation: If two variables deviate in the opposite direction, increase in one results in decrease
in the other, then the correlation is said to be negative correlation. The line corresponding to the scatter
plot is an decreasing line sloping down from left to right. Example price and demand of a commodity.

No correlation: occurs when there is no linear dependency between the variables.

41
Range of Correlation coefficient (r): Correlation coefficient lies between -1 to +1. It is a pure
number and independent of unit of measurement.
Effect of change of origin and scale: Correlation coefficient is independent of change of
origin(Xi=Xi-A) and scale(Xi=Xi/h).
Correlation between independent variables: Two independent variables are uncorrelated, but
two uncorrelated variables (if r=0 found) need not necessarily be independent.
Test of significance of correlation coefficient (Null Hypo. r=0): To test the significance of
correlation coefficient the t test statistic is used as follows:

𝑟𝑐𝑎𝑙 ∗√𝑛−2
𝑡𝑐𝑎𝑙 = at (n-2) d.f., here rcal is the calculated value of correlation coefficient.
√1−𝑟𝑐𝑎𝑙 2

To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2)
degree of freedom.
If tcal>ttab, the null hypothesis is rejected and we conclude that the correlation is significant.
If tcal<ttab, the null hypothesis is accepted and we conclude that the correlation is non-significant.

**************************************************************************

Objective: Computation of correlation coefficient and test of significance of correlation coefficient


of the given data.
Kinds of data: The marks obtained by 8 students in Mathematics and Statistics are given below:
Student A B C D E F G H
Mathematics 25 30 32 35 37 40 42 45
Statistics 8 10 15 17 20 22 24 25

Solution: Let us assume that the marks in mathematics are X and marks in Statistics are Y.
We know that the formula for correlation coefficient is
̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
r= ̅̅̅2 ∑(𝑦𝑖 −𝑦)
̅̅̅2
,
√∑(𝑥𝑖 −𝑥)
First we calculate the mean of X and Y.
∑𝑋 286 ∑𝑌 141
𝑋̅ = 𝑛 𝑖 = 8 = 35.75 ≈ 36 and 𝑌̅ = 𝑛 𝑖 = 8 = 17.63 ≈ 18
Other calculations are presented below in the table.
Student Mathematics (X) Statistics(Y) ̅)
(𝑿𝒊 − 𝑿 ̅)
(𝒀𝒊 − 𝒀 ̅ )(𝒀𝒊 − 𝒀
(𝑿𝒊 − 𝑿 ̅) ̅ )𝟐
(𝑿𝒊 − 𝑿 ̅ )𝟐
(𝒀𝒊 − 𝒀
A 25 8 -11 -10 110 121 100
B 30 10 -6 -8 48 36 64
C 32 15 -4 -3 12 16 9
D 35 17 -1 -1 1 1 1
E 37 20 1 2 2 1 4
F 40 22 4 4 16 16 16
G 42 24 6 6 36 36 36
H 45 25 9 7 63 81 49
total 286 141 -2 -3 288 308 279

By putting these values in the formula we get


288
r= = 0.983
√308∗279

42
Test of significance of r=0.983
To test the significance of correlation coefficient the t test statistic is used as follows:

𝑟𝑐𝑎𝑙 ∗√𝑛−2
𝑡𝑐𝑎𝑙 = at (n-2) d.f,
√1−𝑟𝑐𝑎𝑙 2
By putting the values in the formula we get
0.983∗√8−2
𝑡𝑐𝑎𝑙 = √1−0.9832 = 13.11
The table value of t at 6 degree of freedom at 5 % level of significance is 2.447.
Conclusion: Here since the calculated value of t (13.11) is greater than tabulated value of t (2.447)
at 6 degree of freedom. The null hypothesis is rejected and we found that the correlation coefficient
is highly significant. This indicates that marks in mathematics are associated with marks in
statistics.
********************************************************************************

Objective : Corrected correlation coefficient corresponding to the corrected figures:


Kinds of data: In two set of variables X and Y with 50 observations each, the following data were
observed: 𝑋̅ =10, 𝜎𝑥 = 3, 𝑌̅ =6, 𝜎𝑦 = 2 and r(x,y)=0.3
But on subsequent verification it was found that one value of X (=10) and one value of (Y=6) were
inaccurate and hence weeded out. With the remaining 49 pair of values, how is original value of r
affected?
Solution: First we will find the corrected mean
∑𝑋
We know that 𝑋̅ = 𝑖 , here 𝑋̅ =10 and n=50 then by solving we get ∑ 𝑋𝑖 = n*𝑋̅ = 50*10=500
𝑛
Since it was found that one value X=10 was inaccurate so we remove it from X and get
∑ 𝑋𝑖 = 500-10=490, now the number of observation is 49.
∑𝑋 490
Hence the corrected mean 𝑋̅ = 𝑖 = =10 𝑛 49
∑ 𝑋𝑖 2 ∑ 𝑋𝑖 2 ∑ 𝑋𝑖 2
We know that 𝜎𝑥 = 2
-( 𝑛 ) or 𝜎𝑥 2 = - 𝑋̅ 2 or ∑ 𝑋𝑖 2 = n*(𝜎𝑥 2 + 𝑋̅ 2 )
𝑛 𝑛
Here n=50, 𝜎𝑥 = 3 and 𝑋̅ =10 then ∑ 𝑋𝑖 2 = 50* (9 + 100)= 5450
Now we find the corrected ∑ 𝑋𝑖 2 by removing 102.
Corrected ∑ 𝑋𝑖 2 = 5450-100=5350
Now we will find the corrected 𝜎𝑥 2 by putting the corrected ∑ 𝑋𝑖 2 and corrected mean 𝑋̅ and n=49
∑ 𝑋𝑖 2 5350
corrected 𝜎𝑥 2 = - 𝑋̅ 2 = 49 - 102 = 109.18-100=9.18
𝑛
Similarly we repeat the same procedure for variable Y and find the corrected mean 𝑌̅ and ∑ 𝑌𝑖 2
∑ 𝑌𝑖 = n*𝑌̅ = 50*6=300,
Since it was found that one value Y=6 was inaccurate so we remove it from Y and get
∑ 𝑌𝑖 =300-6=294, now the number of observation is 49.
∑𝑌 294
Hence the corrected mean 𝑌̅ = 𝑖 = =6 𝑛 49
∑ 𝑌𝑖 2 2
2
𝜎𝑦 = - 𝑌 or ∑ 𝑌𝑖 = n*(𝜎𝑦 + 𝑌̅ 2 ) = 50*(22 + 62)=2000
̅2 2
𝑛
Now we find the corrected ∑ 𝑌𝑖 2 by removing 62.
Corrected ∑ 𝑌𝑖 2 = 2000-36=1964
Now we will find the corrected 𝜎𝑦 2 by putting the corrected ∑ 𝑌𝑖 2 and corrected mean 𝑌̅ and n=49
∑ 𝑌𝑖 2 1964
corrected 𝜎𝑦 2 = - 𝑌̅ 2 = 49 - 62 = 40.08-36=4.08
𝑛

43
∑ 𝑥𝑦
𝑐𝑜𝑣(𝑥,𝑦) −𝑥̅ 𝑦̅ ∑ 𝑥𝑦
𝑛
Here since r = = = 0.3, so Cov(x,y)= r*𝜎𝑥 ∗ 𝜎𝑦 = − 𝑥̅ 𝑦̅
𝜎𝑥 ∗𝜎𝑦 𝜎𝑥 ∗𝜎𝑦 𝑛
∑ 𝑥𝑦
By putting the values 0.3*3*2= - 10*6 , we get ∑ 𝑥𝑦 = 50*(1.8+60)=3090
50
Next we get the corrected value of ∑ 𝑥𝑦 = 3090-wrong values=3090-10*6=3030
∑ 𝑥𝑦 3030
−𝑥̅ 𝑦̅ − 10∗6 1.84
𝑛 49
Hence corrected r= = =6.12 = 0.3
𝜎𝑥 ∗𝜎𝑦 √9.18∗4.08
Hence we found that there is no change in correlation coefficient.
*******************************************************************************

Exercise:
Q1. Define correlation coefficient. Also write the properties of correlation coefficient.
Q2. Calculate the correlation coefficient between the variable X and Y from the following bi-
variate data.
X 71 68 70 67 70 71 70 73
Y 69 67 65 63 65 62 65 64
(Ans: r=0)
Q3. Calculate the correlation coefficient between the variable X and Y from the following bi-
variate data.
X 1 3 4 5 7 8 10
Y 2 6 8 10 14 16 20
(Ans: r=1)

Q4. Calculate the correlation coefficient between the heights of father and son from the following
data:
Height of father (inches) 65 66 67 68 69 70 71
Height of Son (inches) 67 68 66 69 72 72 69
Apply t test to test the significance and interpret the result. (Ans: r=+0.67, tcal=2.02)

Q5. A computer while calculating correlation coefficient between two variables X and Y from 25
pairs of observation obtained the following results: n=25, ∑ 𝑋=125, ∑ 𝑋 2 = 650, ∑ 𝑌=100,
∑ 𝑌 2 = 460, ∑ 𝑋𝑌=508. But on subsequent verification it was found that he had copied down
two pairs as (6,14) and (8,6) while the correct values were (8,12) and (6,8). Obtain the correct
value of correlation coefficient?
(Ans: 0.67)
*************************************************************************

44
Regression: The term regression was given by a British biometrician Sir Francis Galton. It is a
mathematical measure of average relationship between two or more variables. Regression is a
technique used to model and analyze the relationships between variables and often times how they
contribute and are related to producing a particular outcome together. A linear regression refers to a
regression model that is completely made up of linear variables.
Lines of Regression: The line of regression is the line which gives the best estimate to the value of
one variable for any specific value of the other variable. The line of regression is the line of best fit
and is obtained by the principle of least squares. Both the lines of regression passes through or
intersect at the point (𝑋̅, 𝑌̅).
In linear regression there are two lines of regression.
One is Y on X (Y=a+b*X), where X is independent variable and Y is dependent variable. By
applying the principle of least square the regression line for Y on X is given by (y-𝒚 ̅) =𝒃𝒚𝒙 (x-𝒙̅)
𝑐𝑜𝑣(𝑥,𝑦) 𝑟𝜎𝑥 𝜎𝑦 𝜎𝑦 ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
Where 𝑏𝑦𝑥 = = = r𝜎 = ̅̅̅2
𝜎𝑥 2 𝜎𝑥 2 𝑥 ∑(𝑥𝑖 −𝑥)
Other One is X on Y (X=a+b*Y) where Y is independent variable and X is dependent variable. By
̅) =𝒃𝒙𝒚 (y-𝒚
applying the principle of least square the regression line for X on Y is given by (x-𝒙 ̅)
for x on y
𝑐𝑜𝑣(𝑥,𝑦) 𝑟𝜎𝑥 𝜎𝑦 𝜎 ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
Where 𝑏𝑥𝑦 = = = r 𝜎𝑥 = ̅̅̅2
𝜎𝑦 2 𝜎𝑦 2 𝑦 ∑(𝑦𝑖 −𝑦)

Here 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are the regression coefficient and shows the change in dependent variable
with a unit change in independent variable.

Angle between two lines of Regression:


𝜎𝑦 𝜎
We know that the slope of two lines of regression are r 𝜎 and 𝑟𝜎𝑥 . If θ is the angle between
𝑥 𝑦
1−𝑟 2 𝜎𝑥 𝜎𝑦
two lines of regression then tan θ = (𝜎 2 +𝜎 2
)
𝑟 𝑥 𝑦
Properties of Regression coefficient and relationship between correlation and regression
coefficient
• Regression coefficient lies between -∞ to +∞.
• Regression coefficient is independent of change of origin and but not of scale.
• Correlation coefficient is the Geometric mean between the regression coefficients.
(r = ±√𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 )
• If one of the regression coefficients is greater than unity the other must be less than unity.
• Arithmetic mean of the regression coefficient is greater than the correlation coefficient r if r>0.
1
(𝑏𝑦𝑥 + 𝑏𝑥𝑦 ) ≥ 𝑟
2
• If the two variables are uncorrelated the lines of regression become perpendicular to each other.
𝜋
(If r=0, 𝜃 = 2 )
• The two lines of regression are coincide with each other if r= ±1, 𝑡ℎ𝑒𝑛 𝜃 = 0 𝑜𝑟 𝜋.
• The sign of correlation coefficient and regression coefficients are same because each of them
depends on sign of cov(x,y).

45
Test of significance of regression coefficient (Null Hypo. 𝑏𝑦𝑥 =0, 𝑏𝑥𝑦 = 0): To test the
significance of regression coefficient the t test statistic is used as follows:
𝑏𝑦𝑥 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for y on x )
𝑦𝑥 2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑦−𝑦)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑥−𝑥)
̅̅̅2
∑(𝑥−𝑥)

𝑏𝑥𝑦 𝑏𝑥𝑦
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for x on y )
𝑥𝑦 2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑥−𝑥)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑦−𝑦)
∑(𝑦−𝑦)̅̅̅̅2

To test the null hypothesis we compare the calculated value of t with tabulated value of t at (n-2)
degree of freedom.
If tcal>ttab, the null hypothesis is rejected and we conclude that the regression coefficient is
significant.
If tcal<ttab, the null hypothesis is accepted and we conclude that the coefficient is non-significant.
*************************************************************************
Objective: Determination of line of regression of Y on X and X on Y and their explanation?
Kinds of data: The lines of regression are Y= 5+2.8X and X=3-0.5Y .
Solution: Here it is given that line of regression of
Y on X is Y= 5+2.8X, so the regression coefficient are 𝑏𝑦𝑥 = 2.8
Similarly, X on Y is X= 3-0.5Y, so the regression coefficient are 𝑏𝑥𝑦 = -1.5
Since we know that the sign of both the of regression coefficients are same. Here the sign of both
coefficients are different from each other which is not possible. Hence the equations are not the
estimated regression equations of Y on X and X on Y respectively.
*******************************************************************************

Objective: Determination of (i) line of regression of Y on X and X on Y (ii) mean of X and mean
of Y (iii) variance of Y when the variance of X is given?.
Kinds of data: The two lines of regression X+2Y-5=0 and 2X+3Y-8=0 and variance of X is 12 is
given.
Solution: (i) If we assume the line X+2Y-5=0 as the regression line of Y on X , then the equation
can be written as 2Y= -X+5 or Y= -0.5X+2.5 and 𝑏𝑦𝑥 = -0.5
Similarly if we assume the line 2X+3Y-8=0 as the regression line of X on Y , then the equation can
be written as 2X= -3Y+8 or X= -1.5X+4 and 𝑏𝑥𝑦 = -1.5
Here since the sign of both the regression coefficient are same and also the one regression
coefficient is greater than unity and other one is smaller than unity. We can also verify
r = √𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 =√−0.5 ∗ −1.5= √0.75= -0.87, which lies between -1 to +1.
So our estimation of line of regression of Y on X and X on Y is correct.
(ii) Since both the line of regressions passes through the point (𝑋̅, 𝑌̅), the equations can be written
as
𝑋̅+2𝑌̅-5=0 ………(1)
2𝑋̅+3𝑌̅-8=0 ………….(2) by solving these equations we get 𝑋̅ and ̅𝑌.
By multiplying 2 in the equation (1) we get 2𝑋̅+4𝑌̅-10=0….(3)
Subtract equation (2) from equation (3) we get 2𝑋̅+4𝑌̅-10- 2𝑋̅-3𝑌̅ +8=0
By solving we get ̅𝑌 = 2, by putting the value of ̅𝑌 in equation (1) we get ̅𝑋 = 1.
Hence the mean of X and Y are ̅𝑋 = 1 and ̅𝑌 = 2.
(iii) Here 𝜎𝑥 2 =12 is given. We have to find 𝜎𝑦 2 .
46
𝜎𝑦
Since we know that 𝑏𝑦𝑥 = r*𝜎 , the value of r, 𝑏𝑦𝑥 and 𝜎𝑥 is known.
𝑥
𝜎𝑦
By putting these values −0.5 = -0.87*3.46, we get 𝜎𝑦 = 1.99≈ 2
Hence 𝜎𝑦 2 = 4.
*******************************************************************************

Objective: Construction of line of regression and estimation of dependent variable when mean,
standard deviation and correlation coefficient is given.
Kinds of data: The following results were obtained in the analysis of data on yield of dry bark in
ounces (Y) and age in years (X) of 200 cinchona plants:
X (age in Years) Y (Yield)
Average 9.2 16.5
Standard Deviation 2.1 4.2
Correlation coefficient +0.84
Estimate the yield of dry bark of a plant of age 8 years.
Solution: Here 𝑋̅ = 9.2, 𝜎𝑋 =2.1, 𝑌̅ = 16.5, 𝜎𝑌 =4.2 and r=0.84

(i) Construction of line of regression: we know that the line of regression of Y on X is given by
𝜎𝑦 4.2
(Y-𝑌̅) =𝑏𝑌𝑋 (X-𝑋̅), where 𝑏𝑦𝑥 = r = 0.84* = 1.68
𝜎𝑥 2.1
By putting the values we get (Y-16.5) =1.68 ∗(X-9.2)
Y=1.68X+1.04
Similarly the line of regression of X on Y is given by
𝜎 2.1
(X-𝑋̅)=𝑏𝑌𝑋 (Y − 𝑌̅), where 𝑏𝑋𝑌 = r 𝑋 = 0.84* = 0.42
𝜎𝑌 4.2
By putting the values we get (X-9.2) =0.42 ∗(Y-16.5)
X=0.42Y+2.27
(ii) Estimation of the yield (Y) of dry bark of a plant of age 8 years(X):
The line of regression of Y on X is Y=1.68X+1.04,
Put X=8 we get Y=1.68*8+1.04 = 14.48
Hence the yield (Y) of dry bark of a plant of age 8 years(X) is 14.48 ounce.

Objective: Computation of correlation coefficient and the equations of the line of regression of Y
on X and X on Y and the estimation of the value of Y when the value of X is known and the value
of X when the value of Y is known.

Kinds of data: The following table relate to the data of stature (inches) of brother and sister from
Pearson and Lee’s sample of 1,401 families.

Family
1 2 3 4 5 6 7 8 9 10 11
number
Brother,X 71 68 66 67 70 71 70 73 72 65 66
Sister,Y 69 64 65 63 65 62 65 64 66 59 62

47
759 7o4
Solution: First we calculate the mean 𝑋̅ = 11 = 69 , 𝑌̅ = 11 = 64
Family Brother Sister ̅) ̅) ̅ )𝟐 ̅ )𝟐 ̅ )(𝒀𝒊 − 𝒀
̅)
(𝑿𝒊 − 𝑿 (𝒀𝒊 − 𝒀 (𝑿𝒊 − 𝑿 (𝒀𝒊 − 𝒀 (𝑿𝒊 − 𝑿
Number X Y
1 71 69 2 5 4 25 10
2 68 64 -1 0 1 0 0
3 66 65 -3 1 9 1 -3
4 67 63 -2 -1 4 1 2
5 70 65 1 1 1 1 1
6 71 62 2 -2 4 4 -4
7 70 65 1 1 1 1 1
8 73 64 4 0 16 0 0
9 72 66 3 2 9 4 6
10 65 59 -4 -5 16 25 20
11 66 62 -3 -2 9 4 6
Total 759 704 74 66 39
Then by using the formula of correlation coefficient, we have
̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥) 39
𝑟𝑥𝑦 = ̅̅̅2 ∑(𝑦𝑖 −𝑦)
̅̅̅2
= = 0.558
√∑(𝑥𝑖 −𝑥) √74∗66
Test of significance of correlation coefficient
𝑟𝑐𝑎𝑙 ∗√𝑛−2 0.558∗ √11−2
t= = √1−0.5582
= 2.018
√1−𝑟𝑐𝑎𝑙 2
The table value of t at 9 df. At 5 % level of significance is 2.26.
Since t calculated is less than t tabulated the null hypothesis is accepted. The correlation coefficient
is not significant.
Calculation of Regression Coefficient
Using the formula of regression coefficient of Y on X and X on Y, we have
̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥) 39 ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥) 39
𝑏𝑦𝑥 = ∑(𝑥𝑖 −𝑥)̅̅̅2 = = 0.527, 𝑏𝑥𝑦 = ∑(𝑦𝑖 −𝑦)̅̅̅2 = = 0.591
74 66

Hence, the equation of regression line of Y on X is


Y- 64 = 0.527 (X-69)
Hence, the equation of regression line of X on Y is
X- 69 = 0.591 (Y-64)
Estimation of Y when X is given :
If we want to calculate the value of Y for X=70 then by putting X=70 in the line of regression of Y
on X we get Y - 64 =0.527*(70 -69)
Hence Y= 64 + 0.527 * 1 =64.527
Estimation of X when Y is given :
If we want to calculate the value of X for Y=62 then by putting Y=62 in the line of regression of X
on Y we get X - 69 =0.591(62 -64)
Hence X= 69 + 0.591 * (-2) =67.82
Test of significance of regression coefficient of y on x
𝑏𝑦𝑥 0.527 0.527
𝑡𝑦𝑥 = = =0.261 = 2.017
̅ ))2 (39)2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑦−𝑦) ̅̅̅2
)/(𝑛−2) ∑(𝑥−𝑥) √ 66− 74
̅̅̅2
∑(𝑥−𝑥) (11−2)∗74

48
Test of significance of regression coefficient of x on y
𝑏𝑥𝑦 0.591 0.591
𝑡𝑥𝑦 = = =0.292 = 1.799
̅ ))2 (39)2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑥−𝑥) ̅̅̅2
)/(𝑛−2) ∑(𝑦−𝑦) √ 74− 66
∑(𝑦−𝑦)̅̅̅̅ 2
(11−2)∗66

Since the value of t calculated is less than t tabulated. Regression coefficients are not significant.
*******************************************************************************

Exercise:

Q1. Define Regression Coefficient. Also write the properties of Regression coefficient.

Q2.The observations on X(Marks in Economics) and Y (Marks in Maths) for 10 students are given
below:
X 59 65 45 52 60 62 70 55 45 49
Y 75 70 55 65 60 69 80 65 59 61
Compute the least square regression equations of Y on X and X on Y. Also estimate the value
of Y for X=61. (Ans: Y-65.9=0.76*(X-56.2), X-56.2=0.92(Y-65.9), Y=69.54for X=61)

Q3. The following data pertain to the marks in subjects A and B in a certain examination
Subject A Subject B
Mean marks 39.5 47.5
Standard Deviation of marks 10.8 16.8
Correlation coefficient +0.42
Find the two lines of regression and estimate the marks in B for candidates who secured 50
marks in A. (Ans. Y=0.65X+21.82, X=0.27Y+26.67, Y=54.34 for X=50)

Q4. From the observations of the age (X) and the mean blood pressure (Y), following quantities
were calculated: - 𝑋̅ = 60, 𝑌̅ = 141, ∑ 𝑥 2 = 1000, ∑𝑦 2 = 1936, ∑ 𝑥𝑦=1380, where x=X-𝑋̅ and
y=Y-𝑌̅. Find the regression equation of Y on X and estimate the mean blood pressure for
women of age 35 years. (Ans: Y=1.38X+65.1, Y=113.4 For X=35)

49
6. Test of Significance
Mujahida Sayyed
Asst. professor (Maths & Stat.), College of Agriculture, JNKVV, Ganjbasoda, 464221(M.P.), India
Email id : mujahida.sayyed@gmail.com
Once sample data has been gathered through an experiment, statistical inference allows analysts to
assess some claim about the population from which the sample has been drawn. The methods of
inference used to support or reject claims based on sample data are known as tests of significance.

Null Hypothesis: Every test of significance begins with a null hypothesis H0. H0 represents a
theory that has been put forward, either because it is believed to be true or because it is to be used
as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the
null hypothesis might be that the new drug is no better, on average, than the current drug.

Null Hypothesis H0: there is no difference between the two drugs on average.

Alternative Hypothesis: The alternative hypothesis, Ha, is a statement of what a statistical


hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative
hypothesis might be that the new drug has a different effect, on average, compared to that of the
current drug.

Alternative Hypothesis Ha: the two drugs have different effects, on average.

The alternative hypothesis might also be that the new drug is better, on average, than the current
drug. In this case Ha: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the null
hypothesis. "reject H0 in favor of Ha" or "do not reject H0"; we never conclude "reject Ha", or even
"accept Ha".

If we conclude "do not reject H0", this does not necessarily mean that the null hypothesis is true, it
only suggests that there is not sufficient evidence against H0 in favor of Ha; rejecting the null
hypothesis then, suggests that the alternative hypothesis may be true.

Hypotheses are always stated in terms of population parameter, such as the mean µ. An alternative
hypothesis may be one-sided or two-sided. A one-sided hypothesis claims that a parameter is either
larger or smaller than the value given by the null hypothesis. A two-sided hypothesis claims that a
parameter is simply not equal to the value given by the null hypothesis the direction does not
matter.

Hypotheses for a one-sided test for a population mean take the following form:
H0: =k
Ha: >k
or
H0: =k
Ha: < k.

Hypotheses for a two-sided test for a population mean take the following form:
H0: =k
Ha: k.

50
1. t TEST FOR SINGLE MEAN:
A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the
null hypothesis is supported. It can be used to determine if two sets of data are significantly different from
each other, and is most commonly applied when the test statistic would follow a normal distribution if the
value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced
by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t
distribution.
𝑥̅ − 𝜇
𝑡=
𝑠/√𝑛
Where
μ = Population Mean
𝑥̅ =Sample Mean
∑(𝑥𝑖 −𝑥̅ )2
s = Sample standard deviation= √ 𝑛−1
n = No. of sample observation
if tcal > ttab then the difference is significant and null hypothesis is rejected at 5% or 1% level of
significance.
if ttab< tcal then the difference is non- significant and null hypothesis is accepted at 5% or 1% level
of significance.
*******************************************************************************
Objective: Test the significance of difference between sample mean and population mean.
Kinds of data: Based on field experiments, a new variety of greengram is expected to give an
yield of 12 quintals per hectare. The variety was tested on 10 randomly selected farmer's fields. The
yields (quintal/hectare) were recorded 14.3, 12.6, 13.7, 10.9, 13.7, 12.0, 11.4, 12.0, 12.6 and 13.1.
Do the result conform the expectation?
Solution: Here the null and alternative hypothesis is
H0 =The average yield of the new variety of greengram is 12q/hac.
Vs H1= The average yield of the new variety of greengram is not 12q/hac.

𝑥̅ −𝜇
we know that, t-test for single mean is given by 𝑡=
𝑠/√𝑛

It is given that population mean μ= 12 and n=10,


126.3
then we calculate the Sample mean x̅ = =12.63.
10
∑(𝑥𝑖 −𝑥̅ )2
Next we have to calculate S=√ 𝑛−1
Total
Yields(xi) 14.3 12.6 13.7 10.9 13.7 12 11.4 12 12.6 13.1 126.3
(xi-𝐱̅) 1.67 -0.03 1.07 -1.73 1.07 -0.63 -1.23 -0.63 -0.03 0.47
(𝐱𝐢 − 𝐱̅)𝟐 2.79 0.00 1.14 2.99 1.14 0.40 1.51 0.40 0.00 0.22 10.60
10.60
Standard deviation of sample s= √ 9

51
= 1.085
By putting the values in t statistics we get
x̅−μ
t= s/√n
12.63 − 12
=
1.0853/√10
t = 1.836
and d.f. = 10-1= 9
The table value of t at 9 d.f. and 5% level of significance is ttab = 2.262.
Since tcal < ttab. Difference is not significant and we accept the null hypothesis.
Result: Here we accept null hypothesis this means that the new variety of greengram will give an
average yield of 12 quintals per hectare.
*******************************************************************************

2. t TEST FOR TWO SAMPLE MEAN:


Comparison of two sample means 𝑥̅ and 𝑦̅ assumed to have been obtained on the basis of random
samples of sizes n1and n2 from the same population which is assumed to be normal.
The approximate test is given by (under H0: 𝑥̅ = 𝑦̅ against H1: 𝑥̅ ≠ 𝑦̅)
𝑥̅ − 𝑦̅ ∑ 𝑋𝑖 ∑ 𝑌𝑖
t= , where 𝑋̅ = and 𝑌̅ = ,
1 1 𝑛 𝑛
𝑠√ +
𝑛1 𝑛2

𝑛1 𝑛2
2
1
𝑠 = ⌊∑(𝑥𝑖 − 𝑥̅ )2 + ∑(𝑦𝑖 − 𝑦̅)2 ⌋
𝑛1 + 𝑛2 − 2
𝑖=1 𝑖=1

𝑛1 𝑠12 +𝑛2 𝑠22


=
𝑛1 +𝑛2 −2

follows Student’s t statistics with 𝑛1 + 𝑛2 − 2 d. f.

*******************************************************************************

Objective : To test the significance of difference between two treatment means.


Kinds of data: Two kinds of manure applied to 15 plots of one acres; other condition
remaining the same. The yields (in quintals) are given below
Manure I: 14 20 34 48 32 42 30 44
Manure II: 31 18 22 28 40 26 45
Examine the significance of the difference between the mean yields due to the application of
different kinds of manure.

Solution: Here the null and alternative hypothesis is


H0 : There is no significance difference between two the mean yields due to the application of
different kinds of manure. Vs
H1 : There is significance difference between two the mean yields due to the application of
different kinds of manure.
52
we use t test for difference of mean
̅− 𝒚
𝒙 ̅
t= 𝟏 𝟏
𝒔√ +
𝒏𝟏 𝒏𝟐

x̅ and y̅ are the sample mean of I and II sample.


∑ 𝑋𝑖 264 ∑ 𝑌𝑖 210
𝑋̅ = = = 33 and 𝑌̅ = = =30
𝑛 8 𝑛 7
1 n1 n2
Next we will calculate s 2 = ⌊∑i=1(xi − x̅)2 + ∑i=1(yi − y̅)2 ⌋
n1 +n2 −2

Manure I ̅)
(x-𝒙 ̅)2
(x-𝒙 Manure II ̅)
(y-𝒚 ̅)2
(y-𝒚
14 -19 361 31 +1 1
20 -13 169 18 -12 144
34 +1 1 22 -8 64
48 +15 225 28 -2 4
32 -1 1 40 +10 100
42 +9 81 26 -4 16
30 -3 9 45 +15 225
44 +11 121
264 968 210 554

1
By putting the values we get 𝑠 2 =8+7−2 ⌊968 + 554⌋ = 117.07
Then s = 10.82
By putting all the values in
𝑥̅ − 𝑦̅
t= 1 1
,
𝑠√ +
𝑛1 𝑛2
we get
33− 30
t=
1 1
10.82√ +
8 7
= 0.54
d.f. = 𝑛1 + 𝑛2 − 2 = 13

The tabulated value of t for 13 d. f at 5% level of significance is 2.16.


Since tcal < ttab then it is not significant and we accept null hypothesis.
Result : Since tcal < ttab then we conclude that there is no significance difference between the two
mean yields due to the application of different kinds of manure.
*******************************************************************************

3. t TEST FOR PAIRED OBSERVATION:


This test is used for testing whether two series of paired observations are generated from the
same population on the basis of the difference in their sample means. The approximate test is given
by
𝑑̅
t = 𝑠/ , follows student’s t-distribution with n-1 d.f.
√𝑛
∑𝑛
𝑖=1 𝑑𝑖 ∑𝑛 ̅ 2
𝑖=1(𝑑𝑖 −𝑑)
Here 𝑑̅ = and 𝑠2 =
𝑛 𝑛−1

53
di = xi -yi being the difference of the ith observation in the two sample.
*******************************************************************************
Objective: To test the significance of difference between two treatment means, when observations
are paired.
Kinds of data: Two treatments A and B are assigned randomly to two animals from each of six
litters. The following increase in body weights(oz.) of the animals were observed at the end of the
experiment
Treatment Litter Number
1 2 3 4 5 6
A 28 32 29 36 29 34
B 25 24 27 30 30 29
Test the significance of the difference between treatments A and B.
Solution: Hypothesis
H0 : There is no significance difference between treatments A and B.
Vs H1: There is significance difference between treatments A and B.
𝑑̅
Here since the observation are paired Student’s t-distribution with n-1 d.f. is t = 𝑠/
√𝑛
∑𝑛
𝑖=1 𝑑𝑖 ∑𝑛 ̅ 2
𝑖=1(𝑑𝑖 −𝑑)
Where 𝑑̅ = , and di = xi -yi
𝑛 𝑛−1

Litter Treatment di = xi -yi ̅


𝒅𝒊 − 𝒅 ̅ )𝟐
(𝒅𝒊 − 𝒅
number A(xi) B(yi)
1 28 25 3 -0.83 0.69
2 32 24 8 4.17 17.36
3 29 27 2 -1.83 3.36
4 36 30 6 2.17 4.70
5 29 30 -1 -4.83 23.36
6 34 29 5 1.17 1.36
Total ∑𝑛𝑖=1 𝑑𝑖 =23 50.83

∑𝑛 ∑𝑛 ̅ 2
𝑑𝑖 23 𝑖=1(𝑑𝑖 −𝑑) 50.83
̅̅̅
𝑑 = 𝑖=1 = 6 =3.83 and 𝑠2 = = = 10.17
𝑛 𝑛−1 6−1
then s = 3.19
3.833
by putting the values we get t= = 2.94
3.1885/√6
degree of freedom = 6-1 = 5
The table value of t at 5% level of significance and 5 degree of freedom is 2.571.
Since tcal > ttab , Difference is significant hence null hypothesis is rejected.
Result: Since null hypothesis is rejected, therefore there is significance difference between
treatments A and B.
*******************************************************************************
4. F TEST (VARIANCE RATIO TEST):
F distribution is applied in several tests of significance relating to the equality of two
sampling variances drawn on the basis of independent samples from a normal population. The
approximate test is
Larger estimate of variance
Variance Ratio (F) =
Smaller estimate of variance

54
𝑠12
=
𝑆22
𝑛1 𝑛2 (𝑥 −𝑥̅ )2
∑𝑖=1(𝑥𝑖 −𝑥̅ )2 ∑𝑖=1 2
where 𝑠12 = and 𝑠22 =
𝑛1 −1 𝑛2 −1
Follows F distribution with n1-1 and n2-1 d.f..
******************************************************************************
Objective: To test the significance of equality of two sample variances.
Kinds of data: Two random samples are chosen from two normal populations
Sample I: 20 16 26 27 23 22 18 24 25 19
Sample II: 17 23 32 25 22 24 28 18 31 33 20 27
Obtain estimates of the variance of the population and test whether the two populations have the
same variance.
Solution: Here the null and alternative hypothesis is
H0: The two populations have the same variance.
Vs H1: The two populations have not the same variance.
Larger estimate of variance
We know that Variance Ratio (F) =
Smaller estimate of variance
𝑠12
= , follows F distribution with n1-1 and n2-1 d.f.
𝑆22
𝑛1
∑𝑖=1 (𝑥𝑖 −𝑥̅ )2 𝑛2 (𝑥 −𝑥̅ )2
∑𝑖=1
Here 𝑠12 =
2
and 𝑆22 =
𝑛1 −1 𝑛2 −1
∑ 𝑋𝑖 220 ∑ 𝑌𝑖 300
𝑋̅ = = = 22 𝑎𝑛𝑑 𝑌̅ = = =25
𝑛 10 𝑛 12

Sample I Sample II
xi ̅)
(xi-𝒙 ̅)2
(xi-𝒙 yi ̅)
(yi-𝒚 ̅)2
(yi-𝒚
20 -2 4 17 -8 64
16 -6 36 23 -2 4
26 4 16 32 7 49
27 5 25 25 0 0
23 1 1 22 -3 9
22 0 0 24 -1 1
18 -4 16 28 3 9
24 2 4 18 -7 49
25 3 9 31 6 36
19 -3 9 33 8 64
20 -5 25
27 2 4
220 120 300 314
𝑛
1 (𝑥 −𝑥̅ )2
∑𝑖=1 𝑖 120
By putting the values we get, 𝑠12 = = = 13.33 and
𝑛1 −1 10−1
𝑛2
∑𝑖=1 (𝑥2 −𝑥̅ )2 314
𝑠22 = = = 28.55
𝑛2 −1 12−1

𝑠22 28.55
Hence we get F = = = 2.14
𝑆12 13.33
The tabulated value of F at 5 % level of significance and 9 and all d.f. is 2.89.
Since Fcal < Ftab , it is not significant and null hypothesis is accepted.
Result: Since Fcal < Ftab the null hypothesis is accepted and we conclude that the two population
have the same variances.
55
******************************************************************************
5. CHI- SQUARE TEST (χ2 TEST):
𝝌𝟐 test for goodness of fit:
Chi square is a measure to evaluate the difference between observed frequencies and
expected frequencies and to examine whether the difference so obtained is due to a chance factor or
due to sampling error.
To test the goodness of fit the chi-square test statistic is given by
(𝑂𝑖 −𝐸𝑖 )2
χ2 = ∑ , at (n-1) d.f. , Where, 𝑂𝑖 = Observed Frequency
𝐸𝑖
𝐸𝑖 = Expected Frequency

𝝌𝟐 test for 2X2 contingency table: In a contingency table if each attribute is divided into two
classes it is known as 2×2 contingency table.
a b (a+b)
c d (c+d)
(a+c) (b+d) N=a+b+c+d
For such data, the statistical hypothesis under test is that the two attribute are independent of one
𝑁(𝑎𝑑−𝑏𝑐)2
another. For the 2X2 contingency table, the 𝜒 2 test is given by 𝜒 2 = (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
, where
N=a+b+c+d, for 1 d.f.
Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square
(𝑎+𝑏)(𝑎+𝑐) (𝑎+𝑏)(𝑏+𝑑)
test of goodness of fit. eg. E(a) = , E(b) = or accordingly.
𝑎+𝑏+𝑐+𝑑 𝑎+𝑏+𝑐+𝑑
To test the goodness of fit the chi-square test statistic is given by
(𝑂𝑖 −𝐸𝑖 )2
χ2 = ∑ , at 1 d.f.
𝐸𝑖

Where, 𝑂𝑖 = Observed Frequency, 𝐸𝑖 = Expected Frequency


If 𝛘2𝑐𝑎𝑙 > 𝛘2𝑡𝑎𝑏 , at 1 d.f., we reject the null hypothesis.
*******************************************************************************
Yates’ correction for continuity : F. Yates has suggested a correction for continuity in χ2 value
calculated in connection with a (2 × 2) table, particularly when cell frequencies are small (since no
cell frequency should be less than 5 in any case, through 10 is better as stated earlier) and x2 is just
on the significance level. The correction suggested by Yates is popularly known as Yates’
correction. It involves the reduction of the deviation of observed from expected frequencies which
of course reduces the value of x2 . The rule for correction is to adjust the observed frequency in
each cell of a (2 × 2) table in such a way as to reduce the deviation of the observed from the
expected frequency for that cell by 0.5, but this adjustment is made in all the cells without
disturbing the marginal totals. The formula for finding the value of c2 after applying Yates’
correction can be stated thus:

Yates correction in chi square test

56
It may again be emphasised that Yates’ correction is made only in case of (2 × 2) table and that too when
cell frequencies are small.
*******************************************************************************
Objective: Testing whether the frequencies are equally distributed in a given dataset.
Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits
were as follows.
Digits 0 1 2 3 4 5 6 7 8 9
Frequency 22 21 16 20 23 15 18 21 19 25

Solution: We set up the null hypothesis H0: The digits were equally distributed in the given
dataset.
𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 200
Under the null hypothesis the expected frequencies of the digits would be = = =20
𝑛𝑜.𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 10
(22−20)2 (21−20)2 (16−20)2 (20−20)2 (23−20)2 (15−20)2 (18−20)2
Then the value of 𝜘 2 = + + + + + +
20 20 20 20 20 20 20
(21−20)2 (19−20)2 (25−20)2 1 86
+ + + = 20 (4+1+16+0+9+25+4+1+1+25) =20 =4.3
20 20 20
2
The tabulated value of 𝜘 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value
of 𝜘 2 is less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the
digits are equally distributed in a given dataset.
*****************************************************************************
Objective: Chi-square test for 2X2 contingency table
Kinds of data: The table given below show the data obtained during an epidemic of cholera
Germinated Not Germinated
Inoculated 31 469
Not Inoculated 185 1315
Test the effectiveness of inoculation in preventing the attack of cholera.
Solution: Here the null and alternative hypothesis is
H0: Inoculation is not effective in preventing the attack of cholera i. e. 𝑂𝑖 = 𝐸𝑖 , Vs
H1: Inoculation is effective in preventing the attack of cholera i.e. 𝑂𝑖 ≠ 𝐸𝑖 .
Here we use χ2 test
(𝑶𝒊 −𝑬𝒊 )𝟐
χ2 = ∑ , Where, 𝑂𝑖 = Observed Frequency, 𝐸𝑖 = Expected Frequency
𝑬𝒊
Observed Frequencies are:
Germinated Not Germinated Total
Chemically Treated 31 469 500
Untreated 185 1315 1500
57
Total 216 1784 2000
Calculation of Expected Frequencies:
500×216 1500×216
For germinated E(31)= 2000 = 54, E(185) = = 162
2000
500×1784 1500×1784
For not germinated E(469)= 2000 = 446, E(1315) = = 1338
2000
Expected Frequencies are:
Germinated Not Germinated Total
Chemically Treated 54 446 500
Untreated 162 1338 1500
Total 216 1784 2000
(𝑶𝒊 −𝑬𝒊 )𝟐
Next we calculate χ2 = ∑ ,
𝑬𝒊
Observed Expected Difference Square of
Frequencies(Oi) frequencies(Ei) (Oi-Ei) differences (Oi-Ei)2 (Oi-Ei)2/Ei
31 54 -23 529 9.796
469 446 23 529 1.186
185 162 23 529 3.265
1315 1338 -23 529 0.395
Total 14.642
2
Here we get χ𝑐𝑎𝑙 = 14.64
Degree of Freedom= (2-1)(2-1) = 1
Table values for 1 degree of freedom at 5% level of significance = 3.841
Since 𝛘2𝑐𝑎𝑙 = 14.642 and 𝛘2𝑡𝑎𝑏 = 3.841
𝛘2𝑐𝑎𝑙 > 𝛘2𝑡𝑎𝑏 , we reject the null hypothesis.
Result: 𝛘2𝑐𝑎𝑙 > 𝛘2𝑡𝑎𝑏 , we reject the null hypothesis that is Inoculation is effective in preventing the
attack of cholera.
*******************************************************************************

Objective: Chi-square test for 2X2 contingency table when cell frequency is less than 5
Kinds of data: The following information was obtained in a sample of 50 small general shops. Can
it be said that there are relatively more women owners in villages than in town?

Shops
In Towns In Villages Total
Run by men 17 18 35
Run by women 3 12 15
Total 20 30 50
2
Test your result at 5% level of significance .( 𝜒 for 1 d. f. is 3.841)
Solution: The null and alternative hypothesis are
H0: there are not relatively more women owners in villages than in town. Vs.
H1: there are relatively more women owners in villages than in town.
Here since the one cell frequency is less than 5 we apply the chi-square formula along with Yate’s
correction as given below
𝑁2
[|𝑎𝑑 − 𝑏𝑐| − 2 ] 𝑁
𝜒2 =
𝐶1 𝐶2 𝑅1 𝑅2

𝑎 𝑏
Where 2*2 contingency table is | |
𝑐 𝑑
𝐶1 = sum of first Column 𝑅1 = sum of first Row
58
𝐶2 = sum of second Column R2 = sum of second Row and N = Grand total
By putting the values in the formula we gwt
50 2
[|17×12−18×3|− ] 50
2 2
𝜒 = = 2.48
20×30×35×15
2
The critical value of 𝜒 for 1 d.f. and α = 0.05 is 3.841 i.e.
𝛘2𝑐𝑎𝑙 = 2.48 and 𝛘2𝑡𝑎𝑏 = 3.841
𝛘2𝑐𝑎𝑙 < 𝛘2𝑡𝑎𝑏 , we accept the null hypothesis.
Result: 𝛘2𝑐𝑎𝑙 < 𝛘2𝑡𝑎𝑏 , we accept the null hypothesis. It may be conclude that there are not relatively
more women owners in villages than in town.
*******************************************************************************
Exercise:
Q1. Six boys are selected at random from a school and their marks in Mathematics are found to be
63, 63, 64, 66, 60, 68 out of 100. In the light of these marks, discuss the general observation
that the mean marks in Mathematics in the school were 66. (Ans. tcal =-1.78)
Q2. The summary of the result of an yield trial on onion with two methods of propagation is given
below. Determine whether the methods differ with regard to onion yield. The onion yield is
given in kg/plot
Method I n1= 12 𝑥̅1 = 25.25 Sum of square = 186.25
Method II n2= 12 𝑥̅2 = 28.83 Sum of square = 737.67
Ans. tcal =-1.35)
Q3. A certain stimulus administrated to each 12 patients resulted in the following change in blood
pressure
di 5 2 8 -1 3 0 -2 1 5 0 4
Can it be concluded that the stimulus will in general be accompanied by an increase in blood
pressure? (Ans. Paired t test, tcal = 2.89)
Q4. The following table gives the number of units produced per day by two workers A and B for a
number of days:
A: 40 30 38 41 38 35
B: 39 38 41 23 32 39 40 34
should these results be accepted as evidence that B is the more stable worker?
(Ans. 𝑆1 2 =16, 𝑆2 2 =31.44)
Q5. A certain type of surgical operation can be performed either with a local anesthetic or with a
general anesthetic. Results are given below
Alive Dead
Local 511 24
General 147 18
2
Use 𝜒 test for testing the difference in the mortality rates associated with the different types
of anesthetic. (Ans. 𝛘2𝑐𝑎𝑙 = 9.22)
Q6. Twenty two animals suffered from the same disease with the same severity. A serum was
administered to 10 of the animals and the remaining were uninoculated to serve as control. The
results were as follows:
Recovered Died Total
Inoculated 7 3 10
Uninoculated 3 9 12
Total 10 12 22
2
Apply the𝜒 test to test the association between inoculation and control of the disease.
Interpret the result. (Ans. 𝛘2𝑐𝑎𝑙 = 2.82)

59
7. Analysis of Variance (One way and Two way classification)
P.Mishra
Assistant professor (Statistics) , college of agriculture , JNKVV, Powarkheda, (M.P.) 461110,India
Email id : pradeepjnkvv@gmail.com

Analysis of Variance (ANOVA) :The ANOVA is a powerful statistical tool for tests of
significance. The test of significance based on t-distribution is an adequate procedure only for
testing the significance of the difference between two sample means. In a situation when we have
two or more samples to consider at a time, an alternative procedure is needed for testing the
hypothesis that all the samples have been drawn from the same population. For example, if three
fertilizers are to be compared to find their efficacy, this could be done by a field experiment, in
which each fertilizer is applied to 10 plots and then the 30 plots are later harvested with the crop
yield being calculated for each plot. Now we have 3 groups of ten figures and we wish to know if
there are any differences between these groups. The answer to this problem is provided by the
technique of ANOVA.
Assumptions of ANOVA
The ANOVA test is carried out based on these below assumptions,
• The observations are normally distributed
• The observations are independent from each other
• The variance of populations are equal

Treatments: The objects of comparison in an experiment are defined as treatments


i) Suppose an Agronomist wishes to know the effect of different spacing on the yield of a crop,
different spacing will be treatments. Each spacing will be called a treatment.
(2) A teacher practices different teaching methods on different groups in his class to see which
yields the best results.
(3) A doctor treats a patient with a skin condition with different creams to see which is most
effective.
Experimental unit: Experimental unit is the object to which treatment is applied to record the
observations. ) If treatments are different varieties, then the objects to which treatments are applied
to make observations will be different plot of land. The plots will be called experimental units.
Blocks : In agricultural experiments, most of the times we divide the whole experimental unit
(field) into relatively homogeneous sub-groups or strata. These strata, which are more uniform
amongst themselves than the field as a whole are known as blocks.
Degrees of freedom: It is defined as the difference between the total number of items and the total
number of constraints. If “n” is the total number of items and “k” the total number of constraints
then the degrees of freedom (d.f.) is given by d.f. = n-k. In other words the number of degrees of
freedom generally refers to the number of independent observations in a sample minus the number
of population parameters that must be estimated from sample data.
Level of significance(LOS): The maximum probability at which we would be willing to risk a
type-I error is known as level of significance or the size of Type-I error is level of significance. The
level of significance usually employed in testing of hypothesis are 5% and 1%. The Level of
significance is always fixed in advance before collecting the sample information. LOS 5% means
the results obtained will be true is 95% out of 100 cases and the results may be wrong is 5 out of
100 cases.
Experimental error:
60
The variations in response among the different experimental units may be partitioned in to two
components:

i) the systematic part / the assignable part and


ii) the non-systematic / non assignable part.

Variations in experimental units due to different treatments, blocking etc. which are known to the
experimenter, constitute the assignable part. On the other hand, the part of the variation which can
not be assigned to specific reasons or causes are termed as the experimental error. Thus it is often
found that the experimental units receiving the same treatments and experimental conditions but
providing differential responses. This type of variations in response may be due to inherent
differences among the experimental units, error associated during measurement etc. these factor are
known as extraneous factor. So the variation in responses due to these extraneous factors is turned
as experimental error.
The purpose of designing an experiment is to increase the precision of the experiment. For
reducing the experimental error, we adopt some techniques. These techniques form the 3 basic
Principles of experimental designs.

1. Replication: The repetition of treatments under investigation is known as replication.


A replication is used (i) to secure more accurate estimate of the experimental error, a term which
represents the differences that would be observed if the same treatments were applied several times
to the same experimental units;
(ii) To reduce the experimental error and thereby to increase precision, which is a measure of the
variability of the experimental error.
2. Randomization: Random allocation of treatments to different experimental units known as
randomization.
3. Local control: It has been observed that all extraneous sources of variation are not removed by
randomization and replication. This necessitates a refinement in the experimental technique. For
this purpose, we make use of local control, a term referring to the grouping of homogeneous
experimental units. The main purpose of the principle of local control is to increase the efficiency
of an experimental design by decreasing the experimental error.
One-Way ANOVA
One-way ANOVA is an inferential statistical model to analyze three or more than three
variances at a time to test the equality between them. It's a test of hypothesis for several sample
means investigating only one factor at k levels corresponding to k populations is called One Way
ANOVA. Users may use this 1-way ANOVA test calculator to generate the ANOVA classification
table for the test of hypothesis by comparing estimated F-statistic (F0) from the samples of
populations & critical value of F (Fe) at a stated level of significance (such as 1%, 2%, 3%, 4%, 5%
etc) from the F-distribution table. Only one factor can be analyzed at multiple levels by using
this method. This technique allows each group of samples to have different number of observations.
It should satisfy replication & randomization to design the statistical experiments.

ANOVA Table for One-Way Classification


ANOVA table for one-way classification shows what are all the formulas & input parameters used
in the analysis of variance for one factor which involves two or more than two treatment means together
to check if the null hypothesis is accepted or rejected at a stated level of significance in statistical
experiments.
61
Sources of Variation df SS MSS F-ratio

Between Treatment k-1 SST SST/ k-1 = MST MST/ MSE = F T

Error N-k SSE SSE/ N-k = MSE

Total N-1

Notable Points for One-Way ANOVA Test

The below are the important notes of one-way ANOVA for test of hypothesis for a single factor
involves three or more treatment means together.
• The null hypothesis H0 : μ1 = μ2 = . . . = μk
Alternative hypothesis H1 : μ1 ≠ μ2 ≠ . . . ≠ μk
• State the level of significance α (1%, 2%, 5%, 10%, 50% etc)
• The sum of all N elements in all the sample data set is known as the Grand Total and is
represented by an English alphabet "G".
• The correction factor CF = G2/N
• The Total Sum of Squares all individual elements often abbreviated as TSS is obtained by
TSS = ∑∑xij2 - CF
• The Sum of Squares of all the class Totals often abbreviated as SST is obtained by
SST = ∑Ti2/ni - CF
• The Sum of Squares due to Error often abbreviated as SSE is obtained by
SSE = TSS - SST
• The degrees of freedom for Total Sum of Squares
TSS = N - 1
• The degrees of freedom for Sum of Squares of all the class Totals
SST = k - 1
• The degrees of freedom for Sum of Squares due to Error
SSE = N - k
• The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by
MST = SST/(k - 1)
• The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by
MSE = SSE/(N - k)
• The variance ratio of F between the treatment is the higher variance to lower variance
F = MST/MSE or MSE/MST (The numerator should be always high)
• The Critical value of F can be obtained by referring the F distribution table for (k-1, N-k) at
stated level of significance such as 1%, 5%, 9%, 10% or 50% etc.
• The difference between the treatments is not significant, if the calculated F value is lesser
than the value from the F table. Therefore, the null hypothesis H0 is accepted.
• The difference between the treatments is significant, if the calculated F value is greater
than the value from the F table. Therefore, the null hypothesis H0 is rejected.
*******************************************************************************

62
COMPLETELY RANDOMIZED DESIGN (CRD)
Completely randomized design (CRD) is the simplest of all designs where only two principles of
design of experiments i.e. replication and randomization have been used. The principle of local
control is not used in this design. The basic characteristic of this design is that the whole
experimental area (i) should be of homogeneous in nature and (ii) should be divided into as many
number of experimental unit as the sum of the number of replications of all the treatments. Let us,
suppose there are five treatments A, B, C, D, E replicated 5, 4, 3, 3, and 5 times respectively then
according to this design we require the whole experimental area to be divided in to 20 experimental
units of equal size. Thus, completely randomized design is applicable only when the experimental
area is homogeneous in nature. Under laboratory condition, where other conditions including the
environmental condition are controlled, completely randomized design is the most accepted and
widely used design. Let there be t treatments replicated r1, r2, ……rt times respectively. So in total
t
we require an experimental area of r
i =1
i
number of homogeneous experimental units of equal size.

Randomization and Layout :

To facilitate easy understanding we shall demonstrate the layout and randomization procedure in a
field experiment conducted in CRD with 5 treatments A, B, C, D, E being replicated 5, 4, 3, 2, 6
times respectively. The steps are given as follows :

(i) Total number of experimental unit required is 5+4+3+2+6 = 20 . Divide the whole
experimental area into 20 experimental units of equal size. For laboratory experiments
the experimental units may be test tubes, petri dishes, beakers, pots etc. depending upon
the nature of the experiment.
(ii) Number the experimental units 1 to 20.

Experimental area

Figure – 1

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

Figure – 2: Experimental area divided and numbered in to 20 experimental units

(iii) Assign the five treatments in to 20 experimental units randomly in such a way that the
treatments A, B, C, D, E are allotted 5, 4, 3, 2, 6 times respectively. For this we require
a random number table and follow the steps given below:
A) Method 1:
Start at any page, any point of row-column intersection of random number table. Let the
starting point be the intersection of 5th row – 4th column and read vertically downward to get
20 distinct random number of two digits. Since 80 is the highest two digit number which is
multiple of 20, we reject the number 81 to 99 and 00.If the random number be more than 20
63
then it should be divided by 20 and the remainder will be taken. The process will continue till
we have 20 distinct random numbers; if remainder is zero then we shall take it as the last
number i.e. 20.
a) In the process the random numbers selected are 08, 12, 01, 18, 14, 18, 02, 12, 12, 20, 12, 10,
14, 00, 15, 07, 05, 16, 7, 18, 19, 03, 10, 08, 16, 09, 13, 14, 17, 18, 06, 17, 19, 08, 15 and 11.
b) Repeated random numbers appeared in the above list, so we shall discard the random
numbers which have appeared previously. Thus the selected random numbers will be 08,
12, 01, 18, 14, 02, 20, 10, 15, 07, 05, 16, 19, 03, 09, 13, 04, 17, 06, 11. These random
numbers correspond to the 20 experimental units.
c) To first 5 experimental units corresponding to first 5 random numbers allotted with the first
treatment A, next 4 experimental units corresponding to next four random numbers are
allotted with treatment B and so on.
d) We demonstrate the whole process (a) to (d) in the following table :
Random numbers Treatment
Remainder Selected random numbers
taken from the table allotted
08 08 08 A
32 12 12 A
01 01 01 A
58 18 18 A
14 14 14 A
18 18 Not selected -
02 02 02 B
12 12 Not selected -
52 12 Not selected ―
20 20 20 B
12 12 Not selected ―
10 10 10 B
14 14 Not selected ―
00 00 Not selected -
55 15 15 B
07 07 07 C
05 05 05 C
16 16 16 C
27 7 Not selected -
18 18 Not selected ―
79 19 19 D
03 03 03 D
10 10 Not selected
08 08 Not selected
56 16 Not selected
29 9 9 E
13 13 13 E
14 14 14 E
17 17 17 E
18 18 Not selected
46 6 6 E
37 17 Not selected
59 19 Not selected
08 08 Not selected
15 15 Not selected
11 11 11 E

64
1 A 2 B 3 D 4 E 5 C
6 E 7 C 8 A 9 E 10 B
11 E 12 A 13 E 14 A 15 B
16 C 17 E 18 A 19 D 20 B
Figure – 3: Layout along with allocation of treatments

B) Method 2:

Step 1: In the first method we take 2 digit random numbers and in the process we are to reject a lot
of random numbers because of repetition. To avoid, instead of taking 2 digit random numbers one
may take 3 digit random numbers starting from any page any point intersection of row-column of
random number table. Let us use the same random number table and start at the intersection of 5th
row 2nd column i.e. 208. We take 20 distinct random numbers of 3 digits and the numbers are 208,
412, 480, 318, 094, 158, 082, 232, 252, 020, 392, 950, 394, 800, 435, 187, 851, 164, 273, 384.
Interestingly, we do not discard any number because of repetition in the process i.e. chances of ties
is less here.

Step 2: Rank the random numbers with smallest number getting the lowest rank 1. Thus the
random number along with their respective ranks are :

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384
Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12

These ranks correspond to the 20 numbered experimental units

Step 3: Allot first treatment A to first five plots appearing in order i.e. allot treatment A to
7,15,17,11 and 3rd experimental units. Allot treatment B to next four experimental units
i.e.4,2,8,and 9th experimental units and so on.

R No 208 412 480 318 94 158 82 232 252 20 392 950 394 800 435 187 851 164 273 384
Rank 7 15 17 11 3 4 2 8 9 1 13 20 14 18 16 6 19 5 10 12
Treat. A A A A A B B B B C C C D D E E E E E E

Layout :

1 C 2 B 3 A 4 B 5 E
6 E 7 A 8 B 9 B 10 E
11 A 12 E 13 C 14 D 15 A
16 E 17 A 18 D 19 E 20 C
Figure – 4: Layout along with allocation of treatments.

C) Method 3 :
The above two methods are applicable only when random number table is available. But while
conducting experiments at farmers field random number table may not be available. To overcome
this difficulty, we may opt for ‘drawing lots’ technique for randomization. The procedure is as
follows :

65
a) According to this problem we are to allocate five treatments in to twenty experimental units.
Initially we take a piece of paper and make 20 small pieces of equal size and shape.
b) Twenty pieces paper, thus made, are then labeled and numbered according to treatments and
corresponding number of replications such that five papers are marked with ‘A’, four with ‘B’,
three with ‘C’, two with ‘D’ and six with ‘E’.
c) Fold the papers uniformly and place them in bucket/busket/jar etc.
d) Draw one piece of paper at a time and repeat the drawing without replacing it and with
continuous stirring of the container after every draw.
e) Note the sequence of the appearance of the treatments.
f) Allot the treatments to the experimental units based on the treatment letter label and the
sequence. Thus here the sequence correspond to the experimental units from one to twenty. Let
the appearance of the treatment for this case be as follows :
Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Treatment D B A A C D B E B C A E E C B E A E E A

Thus the treatment A is allotted to the experimental units 3, 4, 11, 17 and 20, treatment B to 2, 7,
9, 15 and so on. Ultimately the final layout will be as follows :
1 D 2 B 3 A 4 A 5 C
6 D 7 B 8 E 9 B 10 C
11 A 12 E 13 E 14 C 15 B
16 E 17 A 18 E 19 E 20 A
Figure – 5: Layout along with allocation of treatments.

Analysis :

Statistical Model : Let there be t number of treatments with r1, r2, r3….rt number of replications
respectively in a completely randomized design. So the model for the experiment will be :
yij =  +  i + eij ; i =1,2,3….t; j = 1,2,….ri

where yij = response corresponding to jth observation of the ith treatment


 = general effect
 i = additional effect due to i-th treatment and  r
i i =0
eij = error associated with j-th observation of i-th treatment and are i.i.d N(0,  2 ).
Assumption of the model:
The above model is based on the assumptions that the affects are additive in nature and the error
components are identically, independently distributed as normal variate with mean zero and
constant variance.

Hypothesis to be tested :
H0 : 1 =  2 =  3 = ......... =  i = ....... =  t = 0 against the alternative hypothesis
H1 : All ' s are not equal
t
Let the level of significance be  . Let the observations of the total n =  ri experimental units be
i =1
as follows:

66
Replication Treatment
1 2 ………. i ………. t
1 y11 y21 ……….. yi1 ………. yt1
2 y12 y22 ………. yi2 ………. yt2
: : : ……… : ………. :
: : : : :
: : y2r2 : :
………. : ………. :
ri ………. yiri ………. :
: y1r1 :
: y trt
Total y1o y2o ………. yio ……….. yto
Mean y1o y 2o y io y to

The analysis for this type of data is the same as that of one way classified data discussed in chapter
1 section(1.2). From the above table we calculate the following quantities :

t ri
Grand total =   ( observation )
i =1 j
t rt
= y11 + y21 + y31 + ........ + ytrt =  yij = G
i =1 j =1
2
G
Correction factor = = CF
n
t ri t rt

  ( observation ) − CF =  yij2 − CF
2
Total Sum of Squares (TSS) =
i =1 j i =1 j =1

2 2 2 2
= y11 + y21 + y31 + ........ + y − CF
trt
Treatment Sum of Squares (TrSS)
2 r
t
yio
= − CF , where yi 0 =  yij = sum of the observations for the i th treatment
i

i =1 ri j =1

y2 y2 y2 y2 y2
= 1o
+ 2o
+ 3o
+ ....... io
+ ....... to
− CF
r1 r2 r3 ri rt

Error Sum of Squares ( By Subtraction ) = T SS – Tr SS = Er SS.


ANOVA table for Completely Randomized Design:
SOV d.f. SS MS F- Tabulated Tabulated
ratio F (0.05) F (0.01)
Treatment t-1 TrSS TrMS = TrSS TrMS
t −1 ErMS
Error n-t ErSS ErMS = ErSS
n−t
Total n-1 TSS

67
The null hypothesis is rejected at  level of significance if the calculated value of F ratio
corresponding to treatment be greater than the table value at the same level of significance with (t-
1,n-t) degrees of freedom that means we reject Ho if Fcal > Ftab otherwise one can not
 ;( t −1),( n −t )
reject the null hypothesis. When the test is non-significant we conclude that there exists no
significant differences among the treatments with respect to the particular characters under
consideration; all treatments are statistically at par.
When the test is significant i.e. when the null hypothesis is rejected then one should find out which
pair of treatments are significantly different and which treatment is either the best or the worst with
respect to the particular characters under consideration.

One of the ways to answer these queries is to use t – test to compare all possible pairs of treatment
means. This procedure is simplified with the help of least significant difference (critical difference)
value as per the given formula below :

1 1
LSD = ErMS ( + )  t ,( n −t ) where, i and i' refer to the treatments involved in
ri ri ' 2

comparison and t is the table value of t distribution at  level of significance with (n-t) d.f. and

1 1
ErMS ( + ) is the standard error of difference (SEd) between the means for treatments i and
ri ri '
i. Thus if the absolute value of the difference between the treatment means exceeds the
corresponding CD value then the two treatments are significantly different and the better treatment
is adjudged based on the mean values commensurating with the nature of the character under study.

Advantages and disadvantages of CRD:

A) Advantages:
i) Simplest of all experimental design.
ii) Flexibility in adopting different number of replications for different treatments. This is the only
design in which different number of replications can be used for different treatments. In
practical situation it is very useful because sometimes the experimenter come across with the
problem of varied availability of experimental materials. Sometimes response from particular
experimental unit(s) may not be available, even then the data can be analyzed if CRD design
was adopted.

B) Disadvantage:
i) The basic assumption of homogeneity of experimental units, particularly under field condition
is rare. That is why this design is suitable mostly in laboratory condition or green house
condition.
ii) The principle of “local control” is not used in this design which is very efficient in reducing the
experimental error.
With the increase in number of treatment especially under field condition it becomes very difficult
to use this design, because of difficulty in getting more number of homogeneous experimental
units.

******************************************************************************

68
Objective : C.R.D analysis with unequal replication
Kinds of data: Mycelial growth in terms of diameter of the colony (mm) of R. solani isolates on
PDA medium after 14 hours of incubation
R. solani isolates Mycelial growth Treatment Treatment mean
total

Repl. 1 Repl. 2 Repl. 3 (Ti)

RS 1 29.0 28.0 29.0 86.0 28.67

RS 2 33.5 31.5 29.0 94.0 31.33

RS 3 26.5 30.0 56.5 28.25

RS 4 48.5 46.5 49.0 144.0 48.00

RS 5 34.5 31.0 65.5 32.72

Grand total 172.0 167.0 107.0 446.0

Grand mean 34.31

Solution: Here we test whether the treatments differ significantly or not.


Grand total = 446
4462
Correction factor = 13 = 15301.23
Total sum of squares = (292 + 282 + ⋯ + 34.52 + 312 ) - CF = 789.27
Treatment sum of squares =( 86)2/3 +(94)2/3 +( 56.5)2/2 +(144)2/3+(65.52/2 -CF
= 16063.9 - CF = 762.69
Error sum of squares = Total sum of squares–variety sum of squares
= 789.27-762.69 = 26.58

Source of variation Degree of Sum of Mean square Computed F Tabular F 5%


freedom squares
Treatment 4 762.69 190.67 57.38* 3.84

Error 8 26.58 3.32


Total 12 789.27

Here Fcal is greater than Ftab, it was found that the treatment differ significantly. Next we calculate
the LSD and CD as per the formula described above.
For example to compare treatment 1 and treatment 2 we calculate

1 1
standard error =√3.32 ∗ (3 + 3)=1.49 and t value at 5% and 8 degree of freedom =2.30

Now CD or LSD=1.49*2.30=3.44

69
And the difference between treatment means of 1 and 2 =2.66. Hence we find that the treatment 1
and 2 doesnot differ significantly as given in table. The comparison between all the treatments are
given below in table along with their significance.

Treatment RS 1 RS 2 RS 3 RS 4 RS 5

RS 1 0.00 2.66 0.42 19.33* 4.05*


(3.44) (3.84) (3.44) (3.84)

RS 2 0.00 3.08 16.67* 1.39


(3.84) (3.44) (3.84)

RS 3 0.00 19.75* 4.47*


(3.84) (4.21)
RS 4 0.00 15.28*
(3.84)

RS 5 0.00

*******************************************************************************
Objective: Analysis of CRD with equal replication
Kinds of data: Grain yield of rice resulting from use of different foliar and granular insecticides
for the control of brown plant hoppers and stem borers, from a CRD experiment with 4 replication
® and 7 treatment (t).
Grain yield (kg/ha) Treatment Treatment
Treatment R1 R2 R3 R4 total (T) means
Dol- mix (1 kg) 2537 2069 2104 1797 8507 2127
Ferterra 3366 2591 2211 2544 10712 2678
DDT + Y-BHC 2536 2459 2827 2385 10207 2552
Standard 2387 2453 1556 2116 8512 2128
Dimecron-Boom 1997 1679 1649 1859 7184 1796
Dimecron-Knap 1796 1704 1904 1320 6724 1681
Control 1401 1516 1270 1077 5264 1316
Grand total (G) 57110
Grand mean 2040

Solution: Here we test whether the treatment differ significantly or not.


The Grand total = 57110.
Correction factor = (57110)2/28= 116484004
Total sum of squares = (2537 2 + 20692 +….10772) - CF
= 124061416 - CF = 7577412.4
Treatment sum of square = (85072 +107122 +------52642)/4 -CF
=122071179 – CF = 5587174.9
Error sum of square = 7577412.4-5587174.9 = 1990237.50
ANOVA (CRD with equal replication) of rice yield data
Tabular F
Source of Variation DF SS Mean Square Fcal 5% 1%
70
Treatment 6 5587174 931196 9.83 2.57 3.81
Error 21 1990238 94773
Total 27 7577412

Hence we find that the treatment differ significantly. After that we calculate Critical
difference.
2∗94773
The standard error of difference between treatment means=√ = 217.68 and
4
The tvalue at 5% level of significance and 21 error df =2.08
Now the CD or LSD at 5 % level of significance= 452.70.kg/ha
The tvalue at 1% level of significance and 21 error df =2.831
Now the CD or LSD at 1 % level of significance=217.68*2.831=616.33 kg/ha.
Comparison between mean yields of a control and each of the six insecticide treatments
using the LSD test are given in table below.
Treatment Mean yield (kg/ha) Difference From control

T7 2127 811**
T6 2678 1362**
T5 2552 1236**
T4 2128 812**
T3 1796 480*
T2 1681 365ns
T1 1316
* indicates significant difference at 5 %, ** indicates Significant difference at 1 % and ns
indicates non-significant difference

Two-Way ANOVA
Two Way ANOVA is an inferential statistical model to analyze three or more than three variances
at a time to test the equality & inter-relationship between them. It's a test of hypothesis for several
sample means to analyze the inter-relationship between the factors and influencing variables at k
levels corresponding to k populations is called as Two way ANOVA. Users may use this 2-way
ANOVA test calculator to generate the ANOVA table for the test of hypothesis (H0) for treatment
means & subject or class means at a stated level of significance with the help of F-distribution. In
this analysis of variance, the observations drawn from the populations should be in same length.
This model should satisfy replication, randomization & local control to design statistical
experiments. Users may use this 2-way ANOVA test calculator to generate the ANOVA
classification table for the test of hypothesis (H0) for treatment means & varieties (class) means at a
stated level of significance with the help of F-test.

71
ANOVA Table for Two-Way Classification
ANOVA table for two-way classification shows what are all the formulas & input parameters used in the
analysis of variance for more than one factor which involves two or more than two treatment means
together with null hypothesis at a stated level of significance.

Sources of Df SS MSS F-ratio


variation

Between treatment k–1 SSR SSR/ k - 1 = MST MST/ MSE = FR

Between block h–1 SSC SSC/ h - 1 = MSV MSV/ MSE = FC

Error (h - 1)(k - 1) SSE SSE/ (k - 1)(h - 1) = MSE

Total N–1

Notable Points for Two-Way ANOVA Test

The below are the important notes of two-way ANOVA for test of hypothesis for a two or more factors
involves three or more treatment or subject means together.
• The null hypothesis H0 : μ1 = μ2 = . . . = μk
H0 : μ.1 = μ.2 = . . . = μ.h
Shows no significant difference between the variances.
Alternative Hypothesis H1 :
H1 : μ1 ≠ μ2 ≠ . . . ≠ μk
H1 : μ.1 ≠ μ.2 ≠ . . . ≠ μ.h
Shows the significant difference among the variances.
• State the level of significance α (1%, 2%, 5%, 10%, 50% etc)
• The sum of all N elements in all the sample data set is known as the Grand Total and is
represented by an English alphabet "G".
• The correction factor CF = G2/N = G2/kh
• The Total Sum of Squares of all individual elements often abbreviated as TSS is obtained by
TSS = ∑∑xij2 - CF
• The sum of squares of all the treatment (row) totals in the two-way table (h x k) often
abbreviated as SST is obtained by
SST = SSR = ∑ {Ti.2/h} - CF
• The sum of squares between classes or sum of squares between columns is
SSV = SSC = {T.j2/k} - CF
k is the number of observations in each columns
• The sum of squares due to error often abbreviated as SSE is obtained by
SSE = TSS - SSR - SSC
• The degrees of freedom for Total Sum of Squares
TSS = N - 1 = hk - 1
• The degrees of freedom for Sum of Squares between treatments
SST = k - 1
• The degrees of freedom for Sum of Squares between varieties
SSV = h - 1
• The degrees of freedom for error sum of squares
72
SSE = (k - 1)(h - 1)
• The Mean Sum of Squares of Treatment often abbreviated as MST is obtained by
MST = SST/(k - 1)
• The Mean Sum of Squares for varieties often abbreviated as MSE is obtained by
MSV = SSV/(h - 1)
• The Mean Sum of Squares due to Error often abbreviated as MSE is obtained by
MSE = SSE/(h - 1)(k - 1)
• The variance ratio for treatments FR is the higher variance to lower variance
FR = MST/MSE or MSE/MST (The numerator should be always high)
• The variance ratio for subjects or classes Fc is the higher variance to lower variance
Fc = MSV/MSE or MSE/MSV (The numerator should be always high)
• The Critical value of F for between treatments (rows) can be obtained by referring the F
distribution table for (k-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%, 10% or
50% etc.
• The Critical value of F for between varieties (columns) or subjects can be obtained by referring
the F distribution table for (h-1, (k-1)(h-1)) at stated level of significance such as 1%, 5%, 9%,
10% or 50% etc.
• The difference between the treatments (rows) is not significant, if the calculated F e value is
lesser than the value from the F table. Therefore, the null hypothesis H0 is accepted.
• The difference between the treatments (rows) is significant, if the calculated F value is greater
than the value from the F table. Therefore, the null hypothesis H0 is rejected.
• The difference between the subjects or varieties (columns) is not significant, if the calculated
Fe value is lesser than the value from the F table. Therefore, the null hypothesis H 0 is accepted.
• The difference between the subjects or varieties (columns) is significant, if the calculated F
value is greater than the value from the F table. Therefore, the null hypothesis H 0 is rejected.
*******************************************************************************
RANDOMIZED BLOCK DESIGN (RBD) In such situations the principle of local control is
adopted and the experimental material is grouped into homogeneous sub groups. The subgroup is
commonly termed as block. The blocks are formed with units having common characteristics which
may influence the response under study.
Advantages and disadvantages of RBD:
A) Advantage:
1. The principle advantage of RBD is that it increases the precision of the experiment.
This is due to the reduction of experimental error by adoption of local control.
2. The amount of information obtained in RBD is more as compared to CRD. Hence, RBD is more
efficient than CRD. Since the layout of RBD involves equal replication of treatments, statistical
analysis is simple.

B) Disadvantage:
1. When the number of treatments is increased, the block size will increase.
2. If the block size is large maintaining homogeneity is difficult and hence when more number of
treatments is present this design may not be suitable.

73
Analysis:

Let us suppose that we have t number of treatments, each being replicated r number of times. The
appropriate statistical model for RBD will be
yij =  +  i +  j + eij , I =1, 2, 3,…….,t; j = 1,2,….r
where, yij = response corresponding to jth replication/block of the ith treatment
 = general effect
 i = additional effect due to i-th treatment and  i =0
 j = additional effect due to j-th replication/block and  j =0
eij = error associated with j-th replication/block of i-th treatment and are i.i.d N(0,  2 ).

The above model is based on the assumptions that the affects are additive in nature and the error
components are identically, independently distributed as normal variate with mean zero and
constant variance.
Let the level of significance be  .
Hypothesis to be tested:
The null hypotheses to be tested are
H 0 : (1) 1 =  2 = ...... =  i = ...... =  t = 0
(2) 1 =  2 = ...... =  j = ...... =  r = 0
Against the alternative hypotheses
H1 : (1)  ' s are not equal
(2)  ' s are not equal
Let the observations of these n = rt units be as follows:
Replications/Blocks
Treatments 1 2 …. J …. r Total Mean
1 y11 y12 …. y1j …. y1r y1o y10
2 y21 y22 …. y2j …. y2r y2o y20
: : : : : : : : :
I yi1 yi2 …. yij …. yir yio y i0
: : : : : : : : :
T yt1 yt2 …. ytj …. ytr yto y t0
Total yo1 yo2 …. yoj …. yor yoo
Mean yo1 yo2 …. yoj …. yor

The analysis of this design is the same as that of two-way classified data with one observation per
cell discussed in chapter 1 section (1.3).
From the above table we calculate the following quantities :
Grand total = y
i, j
ij
= y11 + y21 + y31 + ........ + ytr = G
2
Correction factor = G = CF
rt

74
Total Sum of Squares (TSS) =  y 2 − CF
ij
i, j

= y11 + y21 + y31 + ........ + ytr − CF


2 2 2 2

Treatment Sum of Squares (TrSS)


t
 yio
2

= i =1 − CF
r
y2 y2 y2 y2 y2
= + 1o
+ + ....... + ....... − CF
2o 3o io to

r r r r r
Replication Sum of Squares (RSS)
r
 yoj
2

j =1
= − CF
t
y2 y2 y2 y2 y2
= + o1
+ + ....... + ....... − CF
o2 o3 oj or

t t t t t
Error Sum of Squares (by subtraction ) = T SS – TrSS - RSS

ANOVA table for RBD


Source Of d.f. SS MS F-ratio Tabulated Tabulated
Variation F (0.05) F (0.01)
Treatment t-1 TrSS TrMS = TrSS TrMS
t −1 ErMS
Replication r-1 RSS RMS = RSS RMS
(Block) r −1 ErMS
Error (t-1)(r-1) ErSS ErMS = ErSS
(t-1)(r-1)
Total rt-1 TSS

The null hypotheses are rejected at  level of significance if the calculated values of F ratio
corresponding to treatment and replication be greater than the corresponding table value at the same
level of significance with (t-1), (t-1)(r-1) and (r-1), (t-1)(r-1) degrees of freedom respectively. That
means we reject Ho if Fcal > Ftab, otherwise one can not reject the null hypothesis. When the test
is non-significant we conclude that there exists no significant differences among the
treatments/replications with respect to the particular character under consideration; all
treatments/replications are statistically at par.

When the test(s) is (are) significant(s) we reject the null hypothesis and try to find out the
replication or the treatments which are significantly different from each other. Like in case of CRD,
here also in RBD we use the least significant difference (critical difference value) for comparing
difference between the pair of means. The CD’s are calculated as follows:

2 ErMS
LSD (CD ) =  t ;(t −1)( r −1)
t 2

75
where t is the number of treatments and t is the table value of t at α level of
;(t −1)( r −1)
2
significance and (t-1)(r-1) degrees of freedom.
*******************************************************************************

Objective: Analysis of Randomized Block Design


Kinds of data: An experiment was conducted in RBD to study the comparative performance of
fodder sorghum under rainfed condition. The rearranged data given in below table. Green matter
yield of Sorghum (Kg/plot)

Variety I II III IV Total Mean


African Tall 22.9 25.9 39.1 33.9 121.8 30.45
Co-11 29.5 30.4 35.3 29.6 124.8 31.2
FS -1 28.8 24.4 32.1 28.6 113.9 28.475
K -7 47 40.9 42.8 32.1 162.8 40.7
Co-24 28.9 20.4 21.1 31.8 102.2 25.55
157.1 142.0 170.4 156.0 625.5
Total

Solution: Here we test whether the varieties differ significantly or not.


625.52
Correction factor = = 19562.51
20
Total sum of squares = (22.92 + 25.92 + ⋯ . +31.82 ) − 𝐶𝐹 = 20514.95 – CF = 952.44
157.12 +1422 +170.42 +1562
Bock sum of squares = - CF= 19643.31 – CF =80.80
5
121.82 +124.82 +113.92 +162.82 +102.22
Variety sum of squares = − 𝐶𝐹= 15525.50 - CF = 520.53
4
Error sum of squares = 952.44 – 80.80 – 520.53 = 351.11
By putting the values in ANOVA we get
Source of
DF SS MSS F cal F tab
variation
Replication 3 80.80 26.9 <1 3.490
Variety 4 521 130 4.448* 3.259
Error 12 351 29.3
Total 19

Here we found that the varieties differ significantly.


Variety Mean
2 EMS 2(29.2588) K -7 40.7
SE ( D) = =
r 4
Co -11 31.2
= 14.6294
African Tall 30.45
= 3.8348
CD = t.SE (d ) FS -1 28.48
= (2.179)(3.8248) = 8.33 Co -24 25.55
76
Sorgham variety
50
40.7
40 28.475
31.2 30.45
30 25.55
20
10
0
K -7 Co-11 African Tall FS -1 Co-24
Treatment

From the bar chart it can be concluded that sorghum variety K-7 produces significantly higher than
green matter than all other varieties. The remaining varieties are all on par.

*******************************************************************************

Objective : Analysis of Randomized block design.


Kinds of data: The yields of 6 varieties of a crop in lbs., along with the plan of the
experiment, are given below. The number of blocks is 5, plot of size is 1/20 acre and the
varieties have been represented by A, B, C, D and E and analyze the data and state your
conclusions
B-I B E D C A F
12 26 10 15 26 62
B-II E C F A D B
23 16 56 30 20 10
B-III A B E F D C
28 9 35 64 23 14
B-IV F D E C B A
75 20 30 14 7 23
B-V D F A C B E
17 70 20 12 9 28
Solution:
Null hypothesis H01: There is no significant difference between variety means
1 = 2 = 3 = 4 =  5 = 6

H02: There is no significant difference between block means


1 = 2 =  3 =  4 =  5

(𝐺𝑇)2
Correction factor = 𝑟𝑘
∑ 𝑣𝑖 2
Sum of square due to varieties = 𝑟
– CF

∑ 𝑏𝑗 2
Block Sum of square(BSS)= 𝑘
- CF

Total sum of squares (TSS)=∑ ∑ 𝑦 2 – CF


Error Sum of Square (ESS)= TSS- VSS- BSS
77
First rearrange the given data
Blocks Varieties Block totals Means
A B C D E F
B1 26 12 15 10 26 62 ΣB1 = 151 25.17
B2 30 10 16 20 23 56 ΣB2 =155 25.83
B3 28 9 14 23 35 64 ΣB3 = 173 28.83
B4 23 7 14 20 30 75 ΣB4 = 169 28.17
B5 20 9 12 17 28 70 ΣB5 = 156 26.00
Variety ΣA = ΣB = ΣC = ΣD = ΣE = ΣF = GT = 804 -
totals 127 47 71 90 142 327
Means 25.4 9.4 14.2 18 28.4 65.4 - -

8042
CF= = 21547.2
30
1272 +472 +712 +902 +1422 +3272
VSS = − 21547.2 = 31714.4 – 21547.2= 10167.2
5

1512 +1552 +1732 +1692 +1562


BSS = − 21547.2
6

= 21608.67 – 21547.2 = 61.47


TSS= (262 + 122 + 152 + ⋯ … . , +282 + 702 ) -21547.2

= 32194 – 21547.2

= 10646.8

ESS= TSS – BSS – Tr.S.S.

= 10646.8 – 61.47 – 10167.2


= 418.13
ANOVA TABLE
Sources of d.f S.S. M.S. F-cal. F- table Value
Variation Value
Blocks 5-1=4 61.47 15.37 0.74 F0.05 (4, 20) =2.87
Varieties 6-1=5 10167.2 2033.44 97.25 F0.05 (5, 20) = 2.71
Error 29-4-5= 20 418.13 20.91
Total 30-1-29 10646.8

Calculated value of F (Treatments) > Table value of F, H0 is rejected and hence


we conclude that there is highly significant difference between variety means.

𝐸𝑀𝑆 20.91
Where SEm =√ =√ = 2.04
𝑟 4

SED = √2 * SEm = 1.414 * 2.04 = 2.88


78
Critical difference = SED x t-table value for error d.f. at 5% LOS
CD = 2.88 * 2.09
= 6.04

√𝐸𝑀𝑆 √20.91
Coefficient of variation = * 100 = *100 = 17 %
𝑋̅ 26.8

F E A D C B
65.4 28.4 25.4 18.0 14.2 9.40

(i) Those pairs not scored are significant


(ii) Those pairs underscored are non-significant
Variety F gives significantly higher yield than all the other varieties; varieties D,C and B
are on par and gives significantly higher yield than variety A.

Exercise:
Q1. Explain analysis of one way classification?
Q2. What do you understand by analysis of variance?
Q3.What are assumptions of analysis of variance?
Q4.The yields of four varieties of wheat per plot (in lbs.) obtained from an experiment in randomized
block design are given below:
Variety Replication
1 II III IV V
V1 7 9 8 10 10
V2 12 13 15 11 13
V3 15 20 15 18 16
V4 8 10 12 10 8

Analyze the data and state your conclusion.(Ans. Variety Variance=66.13, Error variance=2.59)

Q5.The following table gives the yields in pounds per plot, of five varieties of wheat after being applied
to 4,3,2,4 and 3 plots respectively
Varieties Yield in lbs.
A 8 8 6 10
B 10 9 8
C 8 10
D 7 10 9 8
E 12 8 10
Analyze the data and state your conclusion.(Ans. Variety Variance=1.86, Error variance=2.28)
Q6. Write the short notes :
(a)Local control
(b)Replication
©Advantage of C.R.D.

79
8. Sampling Methods
R. S. Solanki
Assistant professor (Maths & Stat.), College of Agriculture , Waraseoni, Balaghat (M.P.),India
Email id : ramkssolanki@gmail.com

1. Introduction

The terminology "sampling" indicates the selection of a part of a group or an aggregate with
a view to obtaining information about the whole. This aggregate or the totality of all members is
known as Population although they need not be human beings. The selected part, which is used to
ascertain the characteristics of the population, is called Sample. While choosing a sample, the
population is assumed to be composed of individual units or members, some of which are included
in the sample. The total number of members of the population and the number included in the
sample are called Population Size and Sample Size respectively. The process of generalising on the
basis of information collected on a part is really a traditional practice. The annual production of a
certain crop in a region is computed on the basis of a sample. The quality of a product coming out
of a production process is ascertained on the basis of a sample. The government and its various
agencies conduct surveys from time to time to examine various economic and related issues
through samples. Sampling methodology can be used by an auditor or an accountant to estimate the
value of total inventory in the stores without actually inspecting all the items physically. Opinion
polls based on samples is used to forecast the result of a forthcoming election

2. Advantage of sampling over census

The census or complete enumeration consists in collecting data from each and every unit
from the population. The sampling only chooses a part of the units from the population for the same
study. The sampling has a number of advantages as compared to complete enumeration due to a
variety of reasons.

Less Expensive: The first obvious advantage of sampling is that it is less expensive. If we want to
study the consumer reaction before launching a new product it will be much less expensive to carry
out a consumer survey based on a sample rather than studying the entire population which is the
potential group of customers.

Less Time Consuming: The smaller size of the sample enables us to collect the data more quickly
than to survey all the units of the population even if we are willing to spend money. This is
particularly the case if the decision is time bound. An accountant may be interested to know the
total inventory value quickly to prepare a periodical report like a monthly balance sheet and a profit
and loss account. A detailed study on the inventory is likely to take too long to enable him to
prepare the report in time.

Greater Accuracy: It is possible to achieve greater accuracy by using appropriate sampling


techniques than by a complete enumeration of all the units of the population. Complete
enumeration may result in accuracies of the data. Consider an inspector who is visually inspecting
the quality of finishing of a certain machinery. After observing a large number of such items he
cannot just distinguish items with defective finish from good one's. Once such inspection fatigue
develops the accuracy of examining the population completely is considerably decreased. On the
other hand, if a small number of items is observed the basic data will be much more accurate.
80
Destructive Enumeration: Sampling is indispensable if the enumeration is destructive. If you are
interested in computing the average life of fluorescent lamps supplied in a batch the life of the
entire batch cannot be examined to compute the average since this means that the entire supply will
be wasted. Thus, in this case there is no other alternative than to examine the life of a sample of
lamps and draw an inference about the entire batch.

3. Simple Random Sampling

The representative character of a sample is ensured by allocating some probability to each


unit of the population for being included in the sample. The simple random sample assigns equal
probability to each unit of the population. The simple random sample can be chosen both with and
without replacement.

Simple Random Sampling with Replacement (SRSWR): Suppose the population consists of N
units and we want to select a sample of size n. In simple random sampling with replacement we
choose an observation from the population in such a manner that every unit of the population has
an equal chance of 1/N to be included in the sample. After the first unit is selected its value is
recorded and it is again placed back in the population. The second unit is drawn exactly in the
swipe manner as the first unit. This procedure is continued until nth unit of the sample is selected.
Clearly, in this case each unit of the population has an equal chance of 1/N to be included in each of
the n units of the sample.

In this case the number of possible samples of size n selected from the population of size N is 𝑁 𝑛 .
The samples selected through this method are not distinct.

Simple Random Sampling without Replacement (SRSWOR): In this case when the first unit is
chosen every unit of the population has a chance of 1/N to be included in the sample. After the first
unit is chosen it is no longer replaced in the population. The second unit is selected from the
remaining (N-1) members of the population so that each unit has a chance of (1/N-1) to be included
in the sample. The procedure is continued till nth unit of the sample is chosen with probability [1/
(N-n+1)].

In this case the number of possible samples of size n selected from the population of size N is 𝑁𝑐𝑛
. The samples selected through this method are distinct.

Advantages and Disadvantages of Simple Random Sampling:

Advantages: It is a fair method of sampling and if applied appropriately it helps to reduce any bias
involved as compared to any other sampling method involved. This sampling method is a very
basic method of collecting the data. There is no technical knowledge required and need basic
listening and recording skills. Simple random sampling offers researchers an opportunity to
perform data analysis and a way that creates a lower margin of error within the information
collected. It offers an equal chance of selection for everyone within the population group.

Disadvantages: It is a costlier method of sampling as it requires a complete list of all potential


respondents to be available beforehand. It relies on the quality of the researchers performing the
work. It can require a sample size that is too large. It does not work well with widely diverse or
dispersed population groups.

81
4. Selection of Simple Random Sample

The concept of "randomness" implies that every item being considered has an equal chance
of being selected as part of the sample. To ensure randomness of selection one may adopt either the
Lottery Method or use table of random numbers.

Lottery Method: This is a very popular method of taking a random sample. Under this method, all
items of the universe are numbered or named on separate slips of paper of identical size and shape.
These slips are then folded and mixed up in a container or drum. A blindfold selection is then made
of the number of slips required to constitute the desired sample size. The selection of items thus
depends entirely on chance. The method would be quite clear with the help of an example. If we
want to take a sample of 10 persons out of a population of 100, the procedure is to write the names
of all the 100 persons on separate slips of paper, fold these slips, mix them thoroughly and then
make a blindfold selection of 10 slips. The lottery method is very popular in lottery draws where a
decision about prizes is to be made. However, while adopting lottery method it is absolutely
essential to see that the slips are of identical size, shape and colour, otherwise there is a lot of
possibility of personal prejudice and bias affecting the results. The process of writing N number of
slips is cumbersome and shuffling a large number of slips, where population size is very large, is
difficult. Also human bias may enter while choosing the slips. Hence the other alternative i.e.
random numbers can be used.

Random Number Table Method: A random number table is a table of digits. The digit given in
each position in the table was originally chosen randomly from the digits 1, 2, 3, 4, 5, 6, 7, 8, 9, 0
by a random process in which each digit is equally likely to be chosen, as demonstrated in the small
sample shown below.

Table of Random Numbers

36518 36777 89116 05542 29705


46132 81380 75635 19428 88048
31841 77367 40791 97402 27569
84180 93793 64953 51472 65358
78435 37586 07015 98729 76703
83775 21564 81639 27973 62413
08747 20092 12615 35046 67753
90184 02338 39318 54936 34641
23701 75230 47200 78176 85248
16224 97661 79907 06611 26501
85652 62817 57881 90589 74567
69630 10883 13683 93389 92725
95525 86316 87384 22633 68158

The table usually contains 5-digit numbers, arranged in rows and columns, for ease of reading.
Typically, a full table may extend over as many as four or more pages. The occurrence of any two
digits in any two places is independent of each other. Several standard tables of random numbers
are available, among which the following may be specially mentioned, as they have been tested
extensively for randomness:
82
• Tippett’s (1927) random number tables consisting of 41,600 random digits grouped into
10,400 sets of four-digit random numbers.
• Fisher and Yates (1938) table of random numbers with 15,000 random digits arranged into
1,500 sets of ten-digit random numbers.
• Kendall and Babington Smith (1939) table of random numbers consisting of 1,00,000
random digits grouped into 25,000 sets of four-digit random numbers.
• Rand Corporation (1955) table of random numbers consisting of 1,00,000 random digits
grouped into 20,000 sets of five-digit random numbers.
• C.R. Rao, Mitra and Mathai (1966) table of random numbers.

How to use a random number table: This method is one from a variety of methods of reading
numbers from random number tables.
i. Assume you have the test scores for a population of 200 students. Each student has been
assigned a number from 1 to 200. We want to randomly sample only 5 of the students.
ii. Since the population size is a three-digit number, we will use the first three digits of the
numbers listed in the table.
iii. Without looking, point to a starting spot in the above random number table. Assume we
land on 93793 (2nd column, 4th entry).
iv. This location gives the first three digits to be 937. This choice is too large (> 200), so we
choose the next number in that column. Keep in mind that we are looking for numbers
whose first three digits are from 001 to 200 (representing students).
v. The second choice gives the first three digits to be 375, also too large. Continue down the
column until you find 5 of the numbers whose first three digits are less than or equal to 200.
vi. From this table, we arrive at 200 (20092), 023 (02338), 108 (10883), 070 (07015), and 126
(12615).

Students 23, 70, 108, 126, and 200 will be used for our random sample. Our sample set of students
has been randomly selected where each student had an equal chance of being selected and the
selection of one student did not influence the selection of other students.
******************************************************************************

Objective: Selection of simple random sample using random number table.


Kinds of data: The number of diseased plants (out of 9) in 24 areas are in the following table:

S.No. of areas 1 2 3 4 5 6 7 8 9 10 11 12
Diseased Plants 1 4 1 2 5 1 1 1 7 2 3 3
S.No. of areas 13 14 15 16 17 18 19 20 21 22 23 24
Diseased Plants 2 2 3 1 2 7 2 6 3 5 3 4

Select a simple random sample with and without replacement of size 6. Compute the average
diseased plants based on the sample. Compare this with the average diseased plants of the
population.
Solution:
Simple random sample with replacement:
We have the diseased plants of population of 24 areas. Each area has been assigned a number from
1 to 24. We want to randomly sample with replacement of only 6 of the 24 areas.
Step 1: Since the population size is a two digit number, we will use the first two digits of the
numbers listed in the random number table (see appendix).
Step 2: Without looking, point to a starting spot in the random number table. Assume we land on
72918 (4th column, 21th entry). This location gives the first two digits to be 72. This choice is too

83
large (> 24), so we choose the next number in that column. Keep in mind that we are looking for
numbers whose first two digits are from 01 to 24 (representing areas).
Step 3: The second choice (12468) gives the first two digits to be 12 (≤ 24), so we accept it.
Step 4: Continue down the column until we find 6 of the numbers whose first two digits are less
than or equal to 24. From this table, we arrive at 12 (12468), 17 (17262), 02 (02401), 11 (11333),
10 (10631) and 17 (17220).
Areas 02, 10, 11, 12, 17 and 17 will be used for our random sample (area no 17 repeat twice
because our random sample is with replacement).
Average diseased plants based on simple random sample with replacement:

S.No. of areas 02 10 11 12 17 17
Diseased Plants 4 2 3 3 2 2

4+ 2+3+3+ 2+ 2
Average diseased plants = = 2.6  3
6
Simple random sample without replacement:
We have the diseased plants of population of 24 areas. Each area has been assigned a number from
1 to 24. We want to randomly sample without replacement of only 6 of the 24 areas.
Step 1: Since the population size is a two digit number, we will use the first two digits of the
numbers listed in the random number table (see appendix).
Step 2: Without looking, point to a starting spot in the random number table. Assume we land on
13211 (7th column, 17th entry). This location gives the first two digits to be 13. This choice is (≤
24), so we choose this number. Keep in mind that we are looking for numbers whose first two
digits are from 01 to 24 (representing areas).
Step 3: Continue down the column until we find 6 of the numbers (repeated numbers not allowed in
SRSWOR) whose first two digits are less than or equal to 24. From this table, we arrive at 22
(22250), 12 (12944), 04 (04014), 19 (19386), 01 (01573) and 20 (20963). Areas 01, 04, 12, 19, 20
and 22 will be used for our random sample.
Average diseased plants based on simple random sample without replacement:

S.No. of areas 01 04 12 19 20 22
Diseased Plants 1 2 3 2 6 5
1+ 2 + 3 + 2 + 6 + 5
Average diseased plants = = 3 .1  3
6
Average diseased plants based on population:
Average diseased plants =

1+ 4 +1+ 2 + 5 +1+1+1+ 7 + 2 + 3 + 3 + 2 + 2 + 3 +1+ 2 + 7 + 2 + 6 + 3 + 5 + 3 + 4


= 2.9  3
24
Conclusion: From the above calculation it has been concluded that the average number of diseased
plants based on simple random samples with and without replacement and population are almost
equal to 3.
******************************************************************************

Objective: Selection of simple random sample under SRSWOR.


Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5.
Draw a sample of size n=3 using SRSWOR and show sample mean is an estimate of population
mean.
84
Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑐𝑛 = 5𝑐3 =10.
∑ 𝑦𝑖 15 ∑ 𝑦𝑖
Population mean ̅𝑦̅̅𝑁̅ = = =3 and Compute the mean of each sample ̅̅̅
𝑦𝑛 =
𝑁 5 𝑛
The 10 possible samples are given below in the table.
S.No. Possible Sample mean
samples ̅𝒚̅̅𝒏̅
1. 1,2,3 2.0
2. 2,3,4 3.0
3. 3,4,5 4.0
4. 4,5,1 3.33
5. 5,1,2 2.67
6. 1,3,4 2.67
7. 2,4,5 3.67
8. 3,5,1 3.0
9. 4,1,2 2.33
10. 5,2,3 3.33
Total 30.0

̅̅̅)
Now we have to check whether E (𝑦 𝑛 =̅
𝑦̅̅̅
𝑁

∑ ̅̅̅̅
𝑦𝑛 30
̅̅̅)=
E (𝑦 𝑛 ̅̅̅𝑁̅
= 10 =3 =𝑦
𝑁𝑐𝑛

Hence we can say, that sample mean ̅̅̅


𝑦𝑛 is an estimate of population mean ̅𝑦̅̅𝑁̅ .

*******************************************************************************

Objective: Selection of simple random sample under SRSWR.


Kind of data: Consider a finite population of size N=5 including the values of sampling units as
(1,2,3,4,5). Enumerate all possible samples of size n=2 using SRSWR and check whether the
sample mean is an estimate of population mean.
Solution: Number of all possible samples of size n=2 under SRSWOR is given by 𝑁 𝑛 = 52 =25.
∑ 𝑦𝑖 15 ∑ 𝑦𝑖
The Population ̅𝑦̅̅𝑁̅ = = 5 =3 and Compute the mean of each sample ̅̅̅
𝑦𝑛 =
𝑁 𝑛

S.No. Possible Sample mean S.No. Possible Sample mean


Samples ̅̅̅
𝒚𝒏 Samples ̅̅̅
𝒚𝒏
1 1,2 1.5 13 4,1 2.5
2 1,3 2.0 14 5,1 3.0
3 1,4 2.5 15 3,2 2.5
4 1,5 3.0 16 4,2 3.0
5 2,3 2.5 17 5,2 3.5
6 2,4 3.0 18 4,3 3.5
7 2,5 3.5 19 5,3 4.0
8 3,4 3.5 20 5,4 4.5
9 3,5 4.0 21 1,1 1.0
10 4,5 4.5 22 2,2 2,0
11 2,1 1.5 23 3,3 3.0

85
12 3,1 2.0 24 4,4 4.0
25 5,5 5.0
Total 75.0

̅̅̅)=
Now we have to check whether E (𝑦 𝑛 ̅𝑦̅̅̅
𝑁
∑ ̅̅̅̅
𝑦𝑛 75
̅̅̅)=
E (𝑦 𝑛 ̅̅̅̅
= 25 =3 =𝑦 𝑁
𝑁𝑛

Hence we can say that sample mean ̅̅̅


𝑦𝑛 is an estimate of population mean.
*******************************************************************************
Exercise:
Q1. The data below indicate the number of workers in the factory for twelve factories

Factory 1 2 3 4 5 6 7 8 9 10 11 12
No. of 2145 1547 745 215 784 3125 126 471 841 3215 2496 589
Workers

Select a simple random sample without replacement of size four with the help of random
number table (see Appendix). Compute the average number of workers per factory based on
the sample. Compare this number with the average number of workers per factory in the
population.

Q2. A class has 115 students. Select a simple random sample with replacement of size 15.

Q3. The following data are the yields (q/ha) of 30 varieties of paddy maintained in a research
station for breeding trials:
49 78 57 55 45 26 70 21 75 94 56 62 64 79 85
47 67 43 31 38 33 50 37 75 32 42 52 22 63 40

Select a simple random sample without replacement of size 8. Compute the average yield of
paddy based on the sample. Compare this yield with the average yield of paddy in the
population.

Q4. A population have 7 units 1,2,3,4,5,6,7. Write down all possible samples of size 2 (without
replacement), which can be drawn from the given population and verify that the sample mean
is an estimate of the population mean.

Q5. How many random samples of size 5 can be drawn from a population of size 10 if sample is
done with replacement.
********************************************************************************

86
REFERENCES:

1. Practicals in Statistics , by Dr.H.L.Sharma


2. Statistical Methods, by G.W.Snedecor.
3. Experimental Designs and Survey Sampling: Methods and Applications, by H.L.Sharma
4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel
5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar
6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor
7. A Text book of Agricultural Statistics, R. Rangaswamy, New Age International (P)
Limited, publishers
8. Mishra, P. (Ed.), Homa, F. (Ed.).. Essentials of Statistics in Agriculture Sciences.
New York: Apple Academic Press., Inc CRC ( Taylor and Francis Group)
9. P.K.Sahu (2004) . “Agriculture and Applied Statistics-I” ,Kalyani publisher.

87

You might also like