You are on page 1of 79

PRACTICAL MANUAL

AGRICULTURAL STATISTICS

STATISTICAL METHODS, SAMPLING TECHNIQUES AND EXPERIMENTAL DESIGNS

( UG/PG COURSES)

Compiled by
SURABHI JAIN
(Asst. Prof./Scientist)
&
H. L. Sharma
Professor and Head,
DEPARTMENT OF MATHEMATICS AND STATISTICS
Jawaharlal Nehru Krishi Vishwa Vidyalaya,
JABALPUR 482 004 ( M.P.)
Contents
S. No. Chapter Name Page No.
1. Frequency Distribution 1
2. Graphical Representation of data 2
3. Curve fitting 3
4. Measures of Central Tendency 4-8
5. Measures of Dispersion 9-11
6. Skewness and Kurtosis 12-14
7. Probability 15-17
8. Discrete and Continuous Distribution 18-22
9. Correlation and Regression 23-28
10. Multiple and Partial Correlation 29-32
11. Multiple Regression Equation and Analysis Technique 33-39
12. Simple and Stratified Random sampling 40-44
13. Ratio and Regression Estimator 45-47
14. Large sample test 48-49
15. Small sample test 50-53
16. Chi-Square test 54-56
17. Experimental Design 57-67
18. Factorial Design 68-71
19. Confounding 72-76
References 77
1. Frequency distribution
Frequency distribution: is used to condense the large amount of data and to provide fruitful information
of our interest.
To construct the frequency distribution first we will find the range= max value –min value.
The following formula can be used to determine an approximate number k of classes.
K= 1+3.322 log10N or log(no. of observations)/log(2), where N is the total frequency. Round up the
answer to the next integer.
After dividing the range by number of classes class interval is obtained.
Kinds of data: The list of IQ scores is: 118, 123, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141,
142, 149, 150, 154.
Solution: Here Range = 154-118= 36
So the number of classes k=1+3.322log10 (17) = 5 or log(17)/log(2) = 4.08 so the number of classes=5 can
be considered.
𝑟𝑎𝑛𝑔𝑒 36
Class interval = 𝑛𝑜.𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 = = 7.2≈ 8
5
First class= min value + class interval = 118+8=126
Since the data is discrete we can subtract 1 one from the upper limit of the class. The next class will start
from next integer.
So our frequency distribution table would be
I.Q. (class interval) Number or frequency
118-125 4
126-133 6
134-141 3
142-149 2
150-157 2

The classes in which both the upper and lower limits are included in same class are called inclusive
classes whereas the classes in which the upper limit of first class is same as the lower limit of the second
class eg. 10-15,15-20 etc. are known as exclusive classes.

Remark: To apply any statistical technique, first the inclusive classes should be converted to exclusive
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠
classes. For this purpose we find the difference of and add
2
this amount to upper limit of first class and subtract it from the lower limit of next higher class.
𝟏𝟐𝟔−𝟏𝟐𝟓
In the present example the conversion factor = = 0.5
𝟐
So we add 0.5 to 125 and subtract 0.5 from 126 and finally get the exclusive classes.

I.Q. (class interval) Number or frequency


118-125.5 4
125.5-133.5 6
133.5-141.5 3
141.5-149.5 2
149.5-157 2

Generally the number of classes should lie between 5 to 15.

1
2. Graphical Representation of data
Graphical Representation of data: used to make data attractive, effective, ready for comparison and also
save the time and energy.
Type of Characteristics Diagram
Graph
Simple Thick lines used to represent the 84 BAR DIAGRAM

POPULATION (IN Millions)


Bar corresponding figure at equal distances 82
Diagram
80
78
76
74
UP MP AP
Series1 78 80 83

Histogram In histogram the area of rectangle is


proportion to the frequency of the Histogram
corresponding class limits
8

Number of students
Marks 6
maths 30- 40-
0-10 10-20 20-30 4
40 50
2
No. of 0
students
2 5 7 5 3
0-10 10-20 20-30 30-40 40-50
Marks in Maths

Pie In pie diagram the sector of the circle Monthly expenditure


Monthly
Diagram represents the components of total Quantity expenditure in (radians)
or it can also expressed in %. in %
items food Cloth Edu. HRA Oth othe
othe r, 45
er r food
foo HRA, , 144
13%
HRA d 72
Exp. 20% 40% educ
800 200 350 400 250 edu
(in Rs.) cati ation cloth
on clot , 63 , 36
800 17% h
For % cal. Food = 2000 ∗ 100 = 40% 10%
800
For Rad. Food = 2000 ∗ 360 = 144

Frequency It is obtained by joining the mid points of the 8


Number of students

Polygon class interval on x-axis and their


corresponding frequency on y-axis.(same 6
data as histogram)
4

0
0-10 10-20 20-30 30-40 40-50
Marks in maths

2
3. Curve fitting
Curve fitting: is used to find an analytic expression of the form y=f(x), the functional relationship suggested
by the given data.
Fitting of a straight line: Y=a + bX
In curve fitting by principal of least square we have to determine a and b so that
E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 is minimum.
So two line of equations by differentiating w.r.t a and b are
∑ 𝑦𝑖 = na+b∑ 𝑥𝑖
∑ 𝑥𝑖 𝑦𝑖 = a∑ 𝑥𝑖 +b∑ 𝑥𝑖 2 by solving these two equations value of a and b can be obtained.
Fitting of a second degree parabola: Y=a + bX +C𝑿𝟐
In curve fitting by principal of least square we have to determine a and b so that
E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 − 𝑐𝑥𝑖 2 )2 is minimum.
So two line of equations by differentiating w.r.t a and b are
∑ 𝑦𝑖 = na+b∑ 𝑥𝑖 +c∑ 𝑥𝑖 2
∑ 𝑥𝑖 𝑦𝑖 = a ∑ 𝑥𝑖 +b∑ 𝑥𝑖 2 + c∑ 𝑥𝑖 3
∑ 𝑥𝑖 2 𝑦𝑖 = a ∑ 𝑥𝑖 2+b∑ 𝑥𝑖 3 + c∑ 𝑥𝑖 4
by solving these three equations value of a, b and c can be obtained.

Objective: Fitting of a straight line.


Kinds of data: Treating X as the independent variable.
X: 1 2 3 4 6 8
Y: 2.4 3.0 3.6 4 5 6
Solution: Let the line be Y= a+ b X
X Y X2 XY Using the two normal equations of the straight line, we have
1 2.4 1 2.0 24= 6a+ 24 b and 113.2=24a +130b
2 3.0 4 6.0 On solving the two equations, we get a=1.976 and b=0.506
3 3.6 9 10.8 Thus, we have Ŷ = 1.976 + 0.506X is the fitted straight line.
4 4.0 16 16.0
6 5.0 36 30.0
8 6.0 64 48.0
24 24 130 113.2

Objective: Fitting of a quadratic curve to the following data treating X as the independent variable.
Kinds of data:
X: 0 1 2 3 4
Y: 1 1.8 1.3 2.5 6.3
2
Solution: Let Y = a+bX +cX be the equation of the quadratic curve.
X Y X2 X3 X4 XY X2Y Using three normal equations for the quadratic curve,
0 1 0 0 0 0 0 we have 12.9 =5a +10b +30c
1 1.8 1 1 1 1.8 1.8 37.1 =10a+30b+100c
2 1.3 4 8 16 2.6 5.2 130.3=30a+100b+354c
3 2.5 9 27 81 7.5 22.5 Solving the above three equations, we have
4 6.3 16 64 256 25.2 100.8 a=1.42, b=-1.07 and c=0.55
Thus the quadratic curve is fitted as
10 12.9 30 100 354 37.1 130.3
Ŷ = 1.42 -1.07X + 0.55 X2
Remark: If the values of X and Y are so large, the computation of ∑X,∑X2,∑X3, ∑XY…, becomes difficult
and takes more time and energy. Therefore, the calculations may be reduced by using change of origin and
scale of data.
3
4. Measures of Central Tendency

Measures of Central Tendency: gives us an idea about the average/central value of the distribution.

Measures Definition Ungrouped data Grouped Data


Mean: Average of the given data, 𝑆𝑢𝑚 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 ∑𝑓 𝑥
A.M. = 𝑁𝑜. 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 1.Ordinary Method 𝑋̅= ∑ 𝑓𝑖 𝑖 ,
Most Stable measure of central 𝑖
tendency 2. Change of Origin Method
(𝑥1 + 𝑥2 +⋯𝑥𝑛 ) ∑ 𝑓𝑖 𝑂𝑖
A.M. = 𝑋̅=A + , oi = Xi – A
𝑛
∑ 𝑓𝑖
3. Change of Scale Method
∑𝑓 𝑆 𝑥
𝑋̅= ∑ 𝑓𝑖 𝑖 ∗ ℎ, si = ℎ𝑖
𝑖
4. Change of Origin and Scale
Method
∑𝑓 𝑑 𝑥 −𝐴
𝑋̅=A + ∑ 𝑖 𝑖 ∗ ℎ, di = 𝑖
𝑓𝑖 ℎ

Median: Divides the whole data set into Arrange the observation in
2 equal parts, also used for ascending order, If the number
Qualitative data of observations is 𝑁
(𝑛+1) −𝐹
(1) odd : ( 2 ) 𝑡ℎ 𝑡𝑒𝑟𝑚 Md = L+ 2 *h
𝑓
𝑛 𝑛
+ ( +1)
2 2
(2) Even: 2
th term
is median
Mode: Most frequently occurred Prepare the frequency table
observation, ill-defined and find the mode (𝑓1 −𝑓0 )
Mo = L+ ∗h
(2𝑓1 − 𝑓0 − 𝑓2)

Geometric Nth root of the product of 1 1


GM=Antilog( (∑ 𝑙𝑜𝑔𝑥𝑖 )) GM=Antilog(∑ (∑ 𝑓𝑖 𝑙𝑜𝑔𝑥𝑖 ))
Mean: observation, Gives more 𝑛 𝑓𝑖
weightage to small items
Harmonic Arithmetic mean of the
𝑛 ∑ 𝑓𝑖
Mean: reciprocal of the given values, HM= 1 HM=
∑ 𝑓
Gives more weightage to small 𝑥𝑖 ∑ 𝑖
𝑥𝑖
items
Quartiles , Deciles and Percentiles
𝑖(𝑛+1) 𝑖𝑁
Quartiles 3 in number and divide the Qi= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 4
−𝐹
series into 4 equal parts 4 Qi = L + *h
𝑓
where i=1,2,3
𝑖(𝑛+1) 𝑖𝑁
Deciles 9 in number and divide the Di= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 10
−𝐹
series into 10 equal parts 10 Di = L + 𝑓
*h
where i=1,2,3,..,9
𝑖(𝑛+1) 𝑖𝑁
Percentiles 99 in number and divide the Pi= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 −𝐹
series into 100 equal parts 100 Pi = L + 100 *h
𝑓
where i=1,2,3,..,99
Notations: A= any arbitrary value of variable x generally mid value, h=class interval, L=lower limit of class interval,
F=cumulative frequency of Median/Quartile/Deciles/Percentile class, f1=frequency of mode class, f0 and f2 are the
preceding and succeeding frequency of mode class
Important note:
• In measures of central Tendency the unit of measurement is same as whatever the unit of given
dataset.
• Q2=Median=D5=P50
4
Objective: Computation of Measures of Central Tendency by all methods for Ungrouped data.
Data: Suppose the data are 10, 7, 11, 9, 9, 10, 7, 9, 12
Solution:
10+7+11+9+9+10+7+9+12 84
(1) Arithmetic mean= = =9.33
10 9

(2) Median: Arrange the observation in ascending order


9+1
7, 7, 9, 9, 9, 10, 10, 11, 12 here the number of observations is odd then ( ) 𝑡ℎ 𝑡𝑒𝑟𝑚 𝑖𝑠 𝑚𝑒𝑑𝑖𝑎𝑛
2

The 5th term is 9. So the median is 9.


(3) Mode: Prepare the frequency table
observation frequency
The maximum frequency is 3. So the
7 2
Mode value is 9.
9 3
10 2
11 1
12 1

(4) Geometric Mean=(10, 7, 11, 9, 9, 10,7,9,12)1/9


1
Log10GM= 9 (log1010+ log107+ log1011+ log109+ log109+ log1010+ log107+ log109+ log1012)
1 8.67
= 9 (1.00+0.85+1.04+0.95+0.95+1.00+0.85+0.95+1.08) = = 0.963
9

GM=Antilog(0.963) = 9.19
9 9
(5) Harmonic Mean= 1 1 1 1 1 1 1 1 1 = 0.99 = 9.06
( + + + + + + + + )
10 7 11 9 9 10 7 9 12

(6) Calculation of 3rd Quartile, 5th Decile and 60th percentile (first arrange the data in ascending order)
𝑖(𝑛+1) 3(9+1) 30
(a) Quartile Qi= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means Q3= 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation
4 4 4

So Q3 = 7th observation + 0.5 * (8th observation – 7th observation) = 10+0.5*(11-10)= 10.5


So 10.5 is the 3rd Quartile value.
𝑖(𝑛+1) 5(9+1)
(b) Decile Di= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means D5= 𝑡ℎ = 5𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 9
10 10

So 9 is the 5th Decile value.


𝑖(𝑛+1) 60(9+1)
(c) Percentile Pi= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means P60= 𝑡ℎ = 6 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =10
100 100

So 10 is the 60th Percentile.

5
Objective: Computation of Measures of Central Tendency by all methods for Grouped data.

Kinds of data: The following data relate to the percentage of marks obtained by 556 students in a certain
examination.

Class
intervals 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
(Marks)
No.of
2 20 50 90 107 115 91 53 20 8
Students

Solution:
Class No. of Cumu- Ordinary Change of origin Change of scale Change of origin
intervals Students lative method method (A=55) method (h=10) and scale method
(%Marks) Frequency frequency 𝑿𝒊 𝑭𝒊 𝑿𝒊 Oi = Xi – 𝑭𝒊 𝑶𝒊 𝒙𝒊
𝑭𝒊 𝑺𝒊 di = 𝑭𝒊 𝒅𝒊
Si =
(Fi) A
𝒉 𝒙𝒊 −𝑨
𝒉
0-10 2 2 5 10 -50 -100 0.5 1.0 -5 -10
10-20 20 22 15 300 -40 -800 1.5 30.0 -4 -80
20-30 50 72 25 1250 -30 -1500 2.5 125.0 -3 -150
30-40 90 162 35 3150 -20 -1800 3.5 315.0 -2 -180
40-50 107(f0) 269 45 4815 -10 -1070 4.5 481.5 -1 -107
50-60 115 (f1) 384 55 6325 0 0 5.5 632.5 0 0
60-70 91(f2) 475 65 5915 10 910 6.5 591.5 1 91
70-80 53 528 75 3975 20 1060 7.5 397.5 2 106
80-90 20 548 85 1700 30 600 8.5 170.0 3 60
90-100 8 556 95 760 40 320 9.5 76.0 4 32
Total 556 28200 -50 -2380 50 2820 -5 -238

Arithmetic Mean
28200
(a) Ordinary method = = 50.72 %
556
(−2380)
(b) Change of origin method = 55+ = 55-4.28= 50.72 %
556
2820
(c) Change of Scale method = 556 *10 = 50.72 %
−238
(d) Change of origin and scale method =55+( 556 ) * 10 = 55-4.28=50.72 %
Median
∑ 𝒇𝒊 𝟓𝟓𝟔
First we will find the median class= = = 278
𝟐 𝟐
278 comes under 384 in cumulative frequency class so 50-60 is median class.
556
−269 90
2
So, Md = 50+ ( ) * 10 = 50+115 = 50.78 %
115

Mode
Highest frequency is 115 so mode class is 50-60.
(115−107) 80
Mo = 50+ (2∗115−107−91) ∗ 10 = 50+32 = 50+2.5 =52.5 %

6
1
Geometric Mean: GM=Antilog(∑ 𝑓 (∑ 𝑓𝑖 𝑙𝑜𝑔𝑥𝑖 ))
𝑖 xi logxi fi filogxi fi/xi
929.58 5 0.70 2 1.40 0.40
GM = Antilog( ) = Antilog(1.67) =46.98 %
556 15 1.18 20 23.52 1.33
∑ 𝑓𝑖 25 1.40 50 69.90 2.00
Harmonic Mean: HM= 35
𝑓
∑ 𝑖
𝑥𝑖
1.54 90 138.97 2.57
45 1.65 107 176.89 2.38
556 55
HM= 13.20 = 42.12 % 1.74 115 200.14 2.09
65 1.81 91 164.98 1.40
Quartiles, Deciles and Percentiles 75 1.88 53 99.38 0.71
85 1.93 20 38.59 0.24
Third quartile, sixth decile and 20th percentile.
3∗556
95 1.98 8 15.82 0.08
First we will determine the Quartile class = = 417. Total 556 929.58 13.20
4
417 come in 60-70 cumulative frequency class. So the
Third Quartile
3∗556
( −384) 330
4
Q3 = 60 + * 10 = 60+ = 60+3.63 =63.63 %
91 91
6∗556
Sixth Decile: Here decile class = = 333.6 so the decile class is 50-60.
10
6∗556
( −269) 646
10
D6 = 50 + * 10 = 50+ 115 = 50+5.62 =55.62 %
115
20∗556
20th percentile : Here Percentile class = = 111.2 so the percentile class is 30-40.
100
20∗556
( −72) 39.2
100
P20 = 30 + * 10 = 30+ ∗ 10 = 30+4.35 =34.35 %
90 90

Objective: Computation of algebraic sum of the deviations of a set of values from their arithmetic mean
Kinds of data: The following data relate to the frequency distribution of the number of workers according
to their wages in a certain factory..

Wages(in
Below10 below20 below30 below40 Below50 below60 Below70 Below80
Rs)
No.of
15 35 60 84 96 127 198 250
workers
Solution: First, we will compute the arithmetic mean by changing the data in the following form using the
rule of cumulative frequency of less than type. Then the data become:

Wages (in
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Rs)
No.of
15 20 25 24 82 31 71 52
workers
Then the A.M.= ∑fixi/N=15750/320= Rs.50.40.

Then the algebraic sum of the deviations of a set of values from their arithmetic mean is equal to ∑fi(xi-𝑋̅)
which is equal to approximately zero.

7
Objective: Computation of pooled mean of the given data.
Kinds of data: the average of 5 numbers (first series) is 40 and the average of another 4 numbers (second
series) is 50.
̅̅̅̅
𝑛1𝑋 ̅̅̅̅
1 +𝑛2𝑋 2 5∗40+4∗50 400
Solution: we know that pooled mean formula = = = = 44.44
𝑛1 +𝑛2 5+4 9

Objective: Computation of average speed using harmonic mean.


Kinds of data: A train covered the first 5 kms of its journey at a speed of 30 km/h and next 15 km at a speed
of 45 km/h. the average speed of the train was 40 km/h
Solution: Here since different distance is covered by different speed, we calculate here weighted average
speed where in harmonic mean formula weights were used instead of frequency
X (km/h) 30 45
w 5 15

∑ 𝑤𝑖 20 20
Now the formula becomes average speed= ∑
𝑤 = 5 15 = 0.5 = 40 km/h
𝑥𝑖
( + )
30 45

Kinds of data: A man goes from place A to place B at a speed of 10 km/h and comes back from B to A at a
speed of 15 km if the distance travelled is x.
𝑻𝒐𝒕𝒂𝒍 𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝒙+𝒙 𝟑𝟎𝟎𝒙
Solution: We know that Total speed =𝑻𝒐𝒕𝒂𝒍 𝒕𝒊𝒎𝒆 𝒕𝒂𝒌𝒆𝒏 = 𝒙 𝒙 = 𝟐𝟓𝒙 = 12 km/h , which is the harmonic mean
+
𝟏𝟎 𝟏𝟓
of 10 and 15.

Objective: Relationship between AM, GM and HM.


Kinds of data: if for value of X, Arithmetic mean = 25, Harmonic mean=9 then geometric mean=
Solution:We know that A.M. X H.M.=(𝐺. 𝑀. )2 then by putting values we get
25* 9 = (𝐺. 𝑀. )2, hence G.M. =√225 = 15
Here also we can see that AM ≥ 𝐺𝑀 ≥ 𝐻𝑀

Effect of change of origin and scale on Arithmetic Mean:


Change of Origin: If the origin is shifted to another value A, then the new measurement will be X-A and the
new mean will be 𝑋̅ – A.
𝑋
Change of Scale: if each observation X is divided by a value h then the new measurement will be ℎ and the
𝑋̅
new mean will be ℎ .
𝑋−𝐴
Change of Origin and scale both: if both are changed the new observation will be and the new mean

𝑋̅ −𝐴
will be .

Kinds of data: The mean of 100 observations is 50. What will be the new mean if
(i) 6 is added to each observation (ii) each observation is multiplied by 3.
(iii)If 5 is subtracted from each observation and then it is divided by 4.

Solution: (i) Here since 6 is added to each observation then the new mean will be = 50+6 = 56
(ii) if each observation is multiplied by 3 the new mean will be =3 *50 =150
𝑋−5 ̅
(iii)the new variable will be U= ̅ = 𝑋−5 = 50−5 = 11.25
, hence the mean 𝑈
4 4 4

8
5. Measures of Dispersion

Measures of Dispersion: gives us an idea about the Scatteredness of the data.

Measures Definition Ungrouped data Grouped Data


Range It is defined as the Range=Maximum value – Range=upper value of last
Minimum value class interval – lowest value
difference between the
of first class interval
maximum and minimum
value of any dataset
Quartile It is the difference
Deviation 𝑄3 − 𝑄1 𝑄3 − 𝑄1
between first and third QD= QD=
2 2
quartile divided by 2
Mean It is defined as the
Deviation ̅̅̅̅|
∑|𝑋𝑖 −𝑋) ̅̅̅̅|
∑ 𝑓𝑖 |𝑋𝑖 −𝑋)
average of the sum of MD = MD = ∑ 𝑓𝑖
𝑛
absolute deviation of all
the observation from
their mean
Standard It is defined as the square ̅̅̅̅2
∑(𝑋𝑖 −𝑋) ̅̅̅̅2
∑ 𝑓𝑖 (𝑋𝑖 −𝑋)
Deviation SD = √ SD =√ ∑ 𝑓𝑖
𝑛
root of the average of the
(Best
Measure) sum of squares of
deviation of all the
observation from their
mean
Measures for comparison of two series
Coefficient To compare the based upon
of Max −min
variability of two series (1) Range CD = Max+min
Dispersion 𝑄 −𝑄
(CD) (2) Quartile Deviation CD = 𝑄3+ 𝑄1
3 1
(3) Mean Deviation CD=
𝑀𝐷
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
𝑆𝐷
(4) Standard Deviation CD = 𝑀𝑒𝑎𝑛
Coefficient 100 times the coefficient
of 𝑆𝐷
of dispersion based upon CV = 𝑀𝑒𝑎𝑛 *100
Variation
(CV) standard deviation
(Unitless Measure)

9
Objective: Computation of Measures of Dispersion by all methods for Ungrouped data.
Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12
Solution:
(1) Range= 12-3=9
(2) Quartile Deviation: Arrange the observation in ascending order
3, 5, 7, 7, 9, 9, 10, 10, 12,
1∗(9+1) 10
Q1= 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =2.5th observation
4 4
So Q1= 5+0.5*(7-5) =6
3∗(9+1) 30
Similarly Q3 = 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation
4 4
So Q3 = 10+ 0.5*(10-10)=10
(10−6)
Now QD = =2
2
(3) Mean Deviation:
(10+7+5+9+9+10+7+3+12)
Mean= =8
9
1
MD=9 (|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| + |12 − 8|)
1 20
=9 (2+1+3+1+1+2+1+5+4) = = 2.22
9
(4) Standard Deviation:
(10+7+5+9+9+10+7+3+12)
Mean= =8
9
(10−8)2 +(7−8)2 +(5−8)2 +(9−8)2 +(9−8)2 +(10−8)2 +(7−8)2 +(3−8)2 +(12−8)2
SD=√ 9

(4+1+9+1+1+4+1+25+16) 62
=√ =√ 9 = 2.62
9

Objective: Computation of Measures of Dispersion by all methods for Grouped data.


Kinds of data: The age distribution of 542 members are given below

Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total
No. of
3 61 132 153 140 51 2 542
members
Solution:
(1)Range = 90-20=70
(2) Quartile Deviation : first we will find the first and third quartile
Age(in No. of Cumulative ̅ ) Fi|𝐗𝐢 − 𝑿 ̅ )2
̅ )| (Xi -𝑿 ̅ )2
Xi FiXi (Xi -𝑿 Fi(Xi -𝑿
years) members Frequency
20-30 3 3 25 75 -29.7 89.2 883.3 2649.8
30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6
40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1
50-60 153 349 55 8415 0.3 42.8 0.1 12.0
60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0
70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2
80-90 2 542 85 170 30.3 60.6 916.9 1833.8
Total 542 2183 385 29660 1.96 5152 2800.54 76458.49
10
1∗542
First we will determine the first Quartile class = = 135.5
4
135.5 come in 40-50 cumulative frequency class. So the first Quartile
1∗542
( −64) 715
4
Q1 = 40 + * 10 = 40+ 132 = 40+5.42=45.42 years
132
3∗542
Similarly for Q3 = = 406.5,
4
406.5 come in 60-70 cumulative frequency class. So the third Quartile is
3∗542
( −349) 575
4
Q3 = 60 + * 10 = 60+ 140 = 60+4.11=64.11 years
140
(64.11−45.42) 18.69
So the quartile deviation is = = =9.345
2 2
(3) Mean Deviation : first calculate the mean
29660
Mean= = 54.72
542
5152
From the above table Mean Deviation = =9.51
542
76548.9
(4) Standard Deviation=√ =√141.07 = 11.88
542

Objective: Computation of variability of two series by coefficient of variation.


Kinds of data : Goals scored by two teams A and B in a football season were as follows
No. of goals scored in a
0 1 2 3 4
match
No. of A 27 9 8 5 4
matches B 17 9 6 5 3

Solution: Here we have to calculate the CV of both the team separately

No. of
A ̅̅̅𝟐 B ̅̅̅𝟐
goals 𝒇 𝑨 𝒙𝒊 𝒇𝒊 (𝒙𝒊 − 𝒙) 𝒇𝑩 𝒚 𝒊 𝒇𝒊 (𝒚𝒊 − 𝒚)
(𝒇𝑨 ) ̅
(𝒙𝒊 − 𝒙 (𝒙𝒊 − ̅̅̅
𝒙)2 (𝒇𝑩 ) ̅)
(𝒚𝒊 − 𝒚 (𝒚𝒊 − ̅̅̅
𝒚)2
(𝒙𝒊 )
0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48
1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36
2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84
3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2
4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52
Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4
First we will calculate the mean and standard deviation of first (A) series
56 90.83 𝜎𝐴 1.31
̅̅̅
𝑋 𝐴 = 53 = 1.05, 𝜎𝐴 = √ 53 = √1.714 = 1.31 then CV= ̅̅̅̅ *100 = 1.05 *100 = 124.76
𝑋 𝐴
Now we calculate the mean and standard deviation of Second (B) series
48 68.4 𝜎𝐵 1.30
̅̅̅̅
𝑋𝐵 = 40 = 1.2, 𝜎𝐵 = √ 40 = √1.71 = 1.30 then CV= ̅̅̅̅ *100 = 1.2 *100 = 108.33
𝑋 𝐵
After comparing the coefficient of variation of series A and B it was found that the series B because of lower
CV value is more consistent.

11
6. Skewness and Kurtosis

Skewness and Kurtosis: Skewness gives us an idea about the symmetry of the curve whereas the Kurtosis
gives us an idea about the shape of the curve.

Measures of Shape of
Definition Types Coefficients
a curve
Skewness Means Lack of (i)Zero Skewed or (1)Absolute Measures
Symmetry normal curve 𝑆𝑘 = Mean-Median
(𝑀𝑒𝑎𝑛 ≠ (ii)Positively skewed 𝑆𝑘 = Mean-Mode
𝑀𝑒𝑑𝑖𝑎𝑛 ≠ 𝑀𝑜𝑑𝑒 (Mean>Median or 𝑆𝑘 = 𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
Mean >Mode)
(iii)Negatively skewed (2) Relative Measures
Mean<Median (i) Karl Pearson Coefficient of
or Mean <Mode 𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
Sk = 𝑆𝐷
3 (𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)
= 𝑆𝐷
Range of coefficient is -3 to +3
(ii) Prof. Bowleys Coefficient
of Skewness
𝑄3 + 𝑄1 −2 𝑚𝑒𝑑𝑖𝑎𝑛
Sk = 𝑄3 −𝑄1
Range of coefficient is -1 to +1
(3) Based on Moments
√𝜷𝟏 (𝜷𝟐 +𝟑)
Sk = where
𝟐(𝟓𝜷𝟐 − 𝟔𝜷𝟏 −𝟗)

𝜇3 2 𝜇4
𝛽1 = 3
, 𝛽2 =
𝜇2 𝜇2 2

Kurtosis Flatness or Leptokurtic


𝜇4
peakedness of the (highly peaked) 𝛽2 = or Ύ2 = 𝛽2 − 3
𝜇2 2
curve Mesokurtic Leptokurtic If 𝛽2 >3 or Ύ2 >0 ,
(normal curve) Mesokurtic If 𝛽2 =3 or Ύ2 =0,
Platykurtic Platykurtic If 𝛽2 <3 or Ύ2 <0
(flatter than normal)

̅̅̅̅𝑟
∑ 𝑓𝑖 (𝑋𝑖 −𝑋)
Note : Here 𝜇2 , 𝜇3 and 𝜇4 are central moments or the moments about the mean and can be defined as 𝜇𝑟= ∑ 𝑓𝑖

12
Moments: refers to the average of the deviations from mean or some other value raised to a certain power.
∑ 𝑓𝑖 (𝑋𝑖 −𝐴)𝑟
Moments about any arbitrary value A : 𝜇𝑟 ′= ∑ 𝑓𝑖
where A is any arbitrary value.
∑ 𝑓𝑖 (𝑋𝑖 )𝑟
Moments about the origin: 𝑚𝑟= ∑ 𝑓𝑖
̅̅̅̅𝑟
∑ 𝑓𝑖 (𝑋𝑖 −𝑋)
Moments about the arithmetic mean: 𝜇𝑟= ∑ 𝑓𝑖
, where 𝑋̅is the Arithmetic mean.
Relationship between 𝝁𝒓 and 𝝁𝒓 ′ : 𝜇𝑟= 𝜇𝑟 ′ - 𝑟𝑐1 (𝜇1 ′)( 𝜇𝑟−1 ′)+ 𝑟𝑐2 (𝜇𝑟 ′)2 ( 𝜇𝑟−2 ′)+……+(−1)𝑟 (𝜇1 ′)𝑟
In particular, 𝜇2= 𝜇2 ′ - (𝜇1 ′)2
𝜇3= 𝜇3 ′- 3 𝜇1 ′𝜇2 ′ + 2(𝜇1 ′)3
𝜇4= 𝜇4 ′ - 4 𝜇1 ′𝜇3 ′+6 (𝜇1 ′)2 ( 𝜇2′ )- 3(𝜇1 ′)4

Important: (i) 𝜇0 = 𝜇0 ′=1 (ii) First central moment is always zero. (iii) 𝜇2 = 𝑆𝐷2 =variance

Objective: Computation of Mean and variance when moments about arbitrary value is given .
Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16 and -40.
Solution: Here arbitrary value A=2 and the moments are 𝜇1 ′ =1, 𝜇2 ′=16 and 𝜇3 ′= -40
∑ 𝑓𝑖 (𝑋𝑖 −2)1 ∑ 𝒇𝒊 𝒙𝒊 ∑𝒇 ∑ 𝒇𝒊 𝒙𝒊
We know that 𝜇1 ′= ∑ 𝑓𝑖
=1 , hence ∑ 𝒇𝒊
- 2 ∑ 𝒇𝒊 =1 which gives 𝑥̅ = ∑ 𝒇𝒊
= 1+2 = 3
𝒊
Hence the mean is 3.
We know that 𝜇2= 𝜇2 ′ - (𝜇1 ′)2, by putting the values we get
𝜇2= 𝜇2 ′ - (𝜇1 ′)2 = 16- 1*1 = 15
Hence the variance is 15.

Objective: Computation of first four central moments, 𝛽1 and 𝛽2 .


Kinds of data: The distribution of data are given below
x 0 1 2 3 4 5 6 7 8
f 1 8 28 56 70 56 28 8 1
Solution :
𝒙𝒊 𝒇𝒊 𝒇𝒊 𝒙𝒊 ̅) 𝒇𝒊 (𝒙𝒊 - 𝒙
(𝒙𝒊 - 𝒙 ̅) 𝒇𝒊 (𝒙𝒊 − 𝒙̅) 𝟐 𝒇𝒊 (𝒙𝒊 − 𝒙̅) 𝟑 ̅) 𝟒
𝒇𝒊 (𝒙𝒊 − 𝒙
0 1 0 -4 -4 16 -64 256
1 8 8 -3 -24 72 -216 648
2 28 56 -2 -56 112 -224 448
3 56 168 -1 -56 56 -56 56
4 70 280 0 0 0 0 0
5 56 280 1 56 56 56 56
6 28 168 2 56 112 224 448
7 8 56 3 24 72 216 648
8 1 8 4 4 16 64 256
Total 256 1024 0 0 512 0 2816

First we will calculate the mean of the series


1024
𝑋̅ = 256 = 4
̅̅̅̅1
∑ 𝑓𝑖 (𝑋𝑖 −𝑋) ̅̅̅̅2
∑ 𝑓𝑖 (𝑋𝑖 −𝑋) 512
Then 𝜇1= ∑ 𝑓𝑖
=0 , 𝜇2= ∑ 𝑓𝑖
= 256 = 2
̅̅̅̅3
∑ 𝑓𝑖 (𝑋𝑖 −𝑋) 0 ∑ 𝑓 (𝑋 −𝑋)̅̅̅̅4 2816
𝜇3= ∑ 𝑓𝑖
=256 =0 ‘ 𝜇4= 𝑖 ∑ 𝑓𝑖 = 256 =11
𝑖
Now we calculate 𝛽1 and 𝛽2
𝜇3 2 02 𝜇4 11
𝜷𝟏 = 3
= =0, 𝜷𝟐 = =22 = 2.75, Hence curve is platykurtic.
𝜇2 23 𝜇2 2

13
Objective: Calculate the appropriate measure of skewness from the following cumulative frequency distribution.
Kinds of data: The distribution of data are given below
Age(under years) 20 30 40 50 60 70
Number of persons 12 29 48 75 94 106
Solution: Here, upper limit along with cumulative frequencies are given in the data.
Now we find the lower limit and frequency of the given dataset.
Age (years) Cummulative frequency Number of persons
(Frequency)
Below 20 12 12
20-30 29 = 29-12 = 17
30-40 48 = 48-29 = 19
40-50 75 =75-48 = 27
50-60 94 = 94-75 =19
60-70 106 =106-94=12
Total N=106

Here since the distribution is open ended mean cannot be calculated, so all the methods in which mean is
required cannot be used. So here bowleys method which is based on quartiles can be used.
1∗106
First we will determine the first Quartile class = = 26.5
4
26.5 come in 20-30 cumulative frequency class. So the first Quartile
1∗106
( −12) 14.5
4
Q1 = 20 + * 10 = 20+ = 20+8.53=28.53 years
17 17
2∗106
Similarly for Median = Q2 = = 53,
4
53 come in 40-50 cumulative frequency class. So the second Quartile is
2∗106
( −48) 5
4
Q2 = 40 + * 10 = 40+ 27 = 40+1.85=41.85 years
27
3∗106
Similarly for Q3 = = 79.5,
4
79.5 come in 50-60 cumulative frequency class. So the third Quartile is
3∗106
( −75) 4.5
4
Q3 = 50 + * 10 = 50+ = 50+2.37=52.37 years
19 19
𝑄3 + 𝑄1 −2 𝑚𝑒𝑑𝑖𝑎𝑛
Prof. Bowleys Coefficient of Skewness Sk = 𝑄3 −𝑄1
By putting the values in the formula we get
52.37+ 28.53−2∗ 41.85 −2.8
Sk = = 23.84 = -0.117
52.37−28.53
Hence the coefficient of skewness is -0.117.

14
7. Probability

Probability : is defined as the ratio of favorable number of cases of any event to the total number of all
possible outcomes.

Important terms and Laws of Probability


Terms Definition Example
Trial Experiment is called as Trial Experiment :Tossing of a single
Event Outcomes are known as event coin
Exhaustive Events The total number of all possible Event: Head and Tail
outcomes of any experiment Exhaustive events={H,T}
Mutually Exclusive events If in two or more events only Mutually exclusive: in single
one can happen throw either H or T will come.
Equally likely If there is no reason to prefer Equally likely: The chances of
one in preference to the other occurrence of H or T are ½.
Independent event If happening of an event is not Independent event: in Tossing of
affected by the happening of two coins the occurrence of H and
the other T in each one is independent.
Dependent event If happening of an event is
affected by the happening of
the other
Laws of Probability
Law of Addition If E is an event which includes the happening of anyone of the n
mutually exclusive events E1, E2, …, En then P(E)= P(E1)+ P(E2)
+….., +P(En)
Law of Multiplication If E is an event which includes the happening of anyone of the n
independent events E1, E2, …, En then P(E)= P(E1) *P(E2)*…..,
*P(En)
Law of Total Probability For any two events A and B the probability of happening of either
A or B is given by
P(AUB)=P(A)+P(B)-P(A∩B)
Particular case
Probability of at least one = 1- P(All the events fail to happen)
event happens

Note:
• The sum of total probability is always equal to 1, if P is the probability of success of any event then
q=1-p is the probability of failure of any event.
• Sometimes the probability is based on combination (selection). In combination the chances of
𝑛 𝑛!
selecting r things out of n things is given by ( ) = (𝑛−𝑟)!∗𝑟!, where n!=n(n-1)(n-2),…3.2.1.
𝑟
eg. 5!= 5*(5-1)*(5-2)*(5-3)*(5-4)=5*4*3*2*1 =120

15
1. Find the chance of throwing atleast one ace in a single throw with two dice.
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 1
Solution: The probability of getting one ace in a single throw of dice p = 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 = 6
1 5
So the probability of failure of getting one ace in a single throw q = 1 − 6 =6
We know that the Probability of at least one event happens = 1- P(All the events fail to happen)
5 5 25 11
=1- 6 ∗ = 1 - 36 =36
6
2. From a bag containing 4 white and 5 black balls 3are drawn at random. What are the odds against these being
all black.
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 3 𝑏𝑙𝑎𝑐𝑘 𝑏𝑎𝑙𝑙𝑠
Solution : The probability of selecting black ball = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠
(5)
5!
5
= 39 = 3!2!
9! =42
( )
3 3!6!
5 37
And the probability of not selecting the black ball =1 - =
42 42
37
So the odds against these being all black are = 5

3. What is the chance of drawing a pie from a purse, one compartment of which contains 3 paises and 2 pies and
the other, 2 paise and 1 pie.
1
Solution: Here we know that the probability of selecting each compartment is 2
1 2 1 1 1 1 11
So the probability of drawing a pie from a purse = 2 *5 + 2 *3 = 5 + 6 = 30

4. A bag contains 4 red balls and 3 blue balls. Two drawings of 2 balls are made. Find the chance that the first
drawing gives 2 red balls and the second drawing, 2 blue balls.
If the balls are returned to the bag after the first draw.
If the balls are not returned.
Solution: (a) In the first case if the balls are returned to the bag after the first draw
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
the probability =𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
*𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠

(4 (3)
4! 3!
2) 2 2!2! 6 3 2
= * = * 2!1!
7! = * = 49
(7 (7
7!
21 21
2) 2) 5!2! 5!2!

(b) In the second case if the balls are not returned to the bag after the first draw
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
the probability =𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
*𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠
𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 5 𝑏𝑎𝑙𝑙𝑠

(4 (3)
4! 3!
2) 2 2!2! 6 3 3
= * = * 2!1!
5! = * = 35
(7 (5)
7!
21 10
2) 2 5!2! 3!2!

16
5. If three coins are tested what is the chance of getting (a) two heads exactly (b) atleast two heads (c) atmost
two heads. 3/8, ½, 7/8
Solution :
(a) p= p{H,H,T} + p{H,T,H} + p{T,H,H}
1 1 1 1 1 1 1 1 1 1 1 1 3
So, P = 2 ∗ ∗2+2∗ ∗2+ ∗ ∗ 2 =8 + +8=8
2 2 2 2 8

(b) p= p{H,H,T} + p{H,T,H} + p{T,H,H} + pp{H,H,H}


1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1
So, P = 2 ∗ ∗2+2∗ ∗2+ ∗ ∗ + ∗ ∗ =8 + +8+8=8=2
2 2 2 2 2 2 2 2 8

© p= p{H,H,T} + p{H,T,H} + p{T,H,H} + pp{T,T,H}+ p{T,H,T} + p{H,T,T} + p{T,T,T}


1 1 1 1 1 1 1 7
P= 8 + +8+8+8+ +8=8
8 8

6. Four persons are chosen at random from a group consisting of 3 men, 2 women and 4 children. Show that
the chance that exactly 2 of them will be children is 10/21.
𝑛𝑜.𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 4 𝑎𝑛𝑑 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 2 𝑓𝑟𝑜𝑚 5 𝑚𝑒𝑛 𝑎𝑛𝑑 𝑤𝑜𝑚𝑒𝑛
Solution: Here p= 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 9

(4 5 4! 5!
2)(2) ∗
2!∗2! 2!∗3! 30 10
So, P= = = 63 =21
( 9)
9!
4 4!∗5!

7. A card is drawn from a well shuffled pack of playing cards. What is the probability that it is either a spade
or an ace.
Solution: Let A is the event of selecting a spade, B is the event of selecting an ace and A∩B is the event of
selecting an ace of spade then by law of total probability we can write
P(AUB)=P(A)+P(B)-P(A∩B)
13 4 1 16 4
= 52 + 52 - 52 = 52 =13

17
8. Discrete and Continuous Distribution
Distribution: The distribution of a statistical data set (or a population) is a listing or function showing all
the possible values (or intervals) of the data and how often they occur.

Distribution

Discrete: for discontinuous variable Continuous: for continuous variable


varaontinuoso

dis
Binomial Dist. Poisson Dist. Normal Dist.
Prob. Mass function is Prob. Mass function is Limiting form of binomial distribution when n is
𝑛 e−λ λx
P(X = x) = ( )pxqn-x, x=0,1,2..n P(X = x) = , x=0,1,2…. large, n→ ∞, neither p nor q is very small
𝑥 x!
0, otherwise 0, otherwise Probability density function is
1 1 𝑥−𝜇
Where 0≤ 𝑥 ≤ ∞, λ≥ 0, and λ is F(x; µ, 𝜎)= exp[- { }2],
𝜎√2𝜋 2 𝜎
Where 0≤ 𝑥 ≤ 𝑛, n and p are the the parameter of the dist.
parameters of the dist.
-∞ < 𝑥 < ∞, -∞ < µ < ∞, σ > 0
Here
Here • X represents the no. of
µ and 𝜎 2 are the parameters of the dist.
• X represent the different num occurrences of the rare event Mean = µ
ber of successes of the event, eg. 0,1,2 Variance = 𝜎 2
• The probability of x is given by • The probability of x is given by Property of Normal Distribution:
𝑛
p(x)= ( )pxqn-x e−λ λx • Curve is bell shaped and symmetrical
𝑥 p(x)= ,
x! • Mean=median=mode
• The frequency of x in N sets each • The frequency of x out of N • As x increases f(x) decreases rapidly
of n trials=N.p(x) cases =N.p(x) • 𝛽1 =0 and 𝛽2 =3
• Mean = np Mean=Variance=λ • 𝑓(𝑥)𝑐𝑎𝑛 𝑛𝑒𝑣𝑒r be negative
• Variance = npq Condition:
• X axis is an asymptote to the curve
Conditions for binomial dist. Poisson distr. occurs when there 2
• Each trial results in two mutually are events which don’t occur as • Mean deviation about mean = σ
3
disjoint outcomes i.e. outcomes of a definite number • QD:MD:SD=10:12:15
success and failure of trials but occur at random • Area Property
• The no. of trials n is finite point of time and space and P(µ- σ<X< µ+σ) = 0.6826
• The trials are independent of each where our preference is only for P(µ- 2σ<X< µ+2σ) = 0.9544
other the number of occurrences of the P(µ- 3σ<X< µ+3σ) = 0.9973
• The probability of success is event Fitting of Normal Dist.
eg. Number of deaths from a
constant for each trial
disease,
• Calculate the mean µ and standard deviation σ
• Example of binomial dist. from the given data.
Tossing of a coin, number of faulty blades in a
packet of 100 • Then we calculate the standard normal variate 𝑧𝑖 =
Throwing of a dice etc. 𝑥𝑖 − µ
Fitting of Binomial Dist. Fitting of Poisson Dist. corresponding to the lower limit of each
𝜎
• Calculate the mean 𝑋̅ • Calculate the mean 𝑋̅ class interval. Then the area under the normal
Equate the 𝑋̅ = np so p=
𝑥̅ • Equate 𝑋̅ = 𝜆 curve to the left of the ordinate at z=𝑧𝑖 say
𝑛 • Find 𝑒 −𝜆 𝑡ℎ𝑒𝑛 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 ∅(𝑧𝑖 ) are computed from the tables.
• Expand the binomial N(q+p)n e−λ λx
𝑛 Or recurrence formula • The areas for successive class intervals are obtained by
=N[𝑞𝑛 + ( ) 𝑞𝑛−1 𝑝 +…..+𝑝𝑛 ) x!
1
P(x+1) =
𝜆
p(x) can also be subtraction ∅(𝑧𝑖 + 1)-∅(𝑧𝑖 ), i=1,2,…
• Or multiplying factor 𝑥+1
• By multiplying these areas by N we get the
𝑛−𝑟+1 𝑝 used
= ∗ expected normal frequencies.
𝑟 𝑞 • Apply chi square to test the
• Apply chi square to test the goodness of fit. • Apply chi square to test the goodness of fit.
goodness of fit.

18
Objective: Fitting of binomial distribution
Kinds of data: The following data relate to the frequency distribution of number of boys in the first seven
children in families of Swedish minister
No.of
0 1 2 3 4 5 6 7 Total
boys/family
No. of families 6 57 206 362 365 256 69 13 1334
Solution:
No.of No. of fi xi P(x) F(x)=N*P(x), ∑(𝒐 −𝑬 )𝟐
𝝒𝟐 = 𝒊𝑬 𝒊
boys/family families fi Expected 𝒊

xi frequency
7
0 6 0 =( ) ∗ (0.51)0 ∗ (0.49)7−0 9.05 1.03
0
7
1 57 57 =( ) ∗ (0.51)1 ∗ (0.49)7−1 65.9 1.21
1
7
2 206 412 =( ) ∗ (0.51)2 ∗ (0.49)7−2 205.8 0.00
2
7
3 362 1086 =( ) ∗ (0.51)3 ∗ (0.49)7−3 357.0 0.07
3
7
4 365 1460 =( ) ∗ (0.51)4 ∗ (0.49)7−4 371.6 0.12
4
7
5 256 1280 =( ) ∗ (0.51)5 ∗ (0.49)7−5 232.1 2.47
5
7
6 69 414 =( ) ∗ (0.51)6 ∗ (0.49)7−6 80.5 1.65
6
7
7 13 91 =( ) ∗ (0.51)7 ∗ (0.49)7−7 12.0 0.09
7
Total 1334 4800 1334 6.62
4800 3.6
Mean =1334 =3.6 , now by comparing np=3.6 we get p= 7 = 0.51

Then q=1-0.51 = 0.49, and the frequencies are calculated in the table.
Then we apply the  2 test for goodness of fit. By comparing observed  2 values 6.62 is greater than the
tabulated  2 values at 5 degrees of freedom with ṕ=0.51. It seems that binomial distribution is fitting well to
the number of boys in the first seven children in families of Swedish minister .

Problem: Ten coins are thrown simultaneously. Find the probability of getting atleast 7 heads.
1
Solution: In tossing of a coin P(H)=P(T)=2
The probability of getting x heads in a random throw of 10 coins is
10 1𝑥 110−𝑥 10 110
P(X=x) = ( ) 2 2 = ( ) 2 ; x=0,1,2…10
𝑥 𝑥
Probability of getting atleast seven heads is given by
P(X≥ 7) = 𝑃(7) + 𝑃(8) + 𝑃(9) + 𝑃(10)
10 110 10 110 10 110 10 110
=( ) 2 +( ) 2 +( ) 2 + ( ) 2
7 8 9 10
110 10 10 10 10 120+45+10+1 176
= 2 {( ) + ( ) + ( ) + ( )} = =
7 8 9 10 1024 1024

19
Objective: Fitting of Poisson distribution

Kinds of data: The following data relate to the number of α –particles emitted by a film of polonium in
2608 successive intervals of one-eighth of a minute.

No.of α-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
particles

Observed
57 203 383 525 532 408 273 139 45 27 10 4 0 1 1
frequency

Solution: Calculate the mean of observed data and equate it to the theoretical mean λ.
10097
Mean= 2608 = 3.87
So 𝜆̇= 3.87 ,
To fit the distribution we find the value of 𝑒 −𝜆
By putting x=0 in P[X=x]= 𝑒 −𝜆 λx/x!
We get p[X=0]= 𝑒 −3.87
Log10P(0) = -3.87 log10e =-3.87 * 0.4343 = -1.68
Then P(0) = Antilog(-1.68) = 0.0208
Then we find out the expected frequency from its probability mass function as shown in table.
∑ 𝒇𝒊 𝑿𝒊 Probability Expected 𝜘 2 =
No.of α- Observed P(X) Frequency ∑(𝑜𝑖 −𝐸𝑖 )2
particles frequency 𝒇𝒊 𝐸𝑖
0 57 0 P(0) =𝑒 −3.87 =.0208 54.3 0.1331
𝑒 −3.87 ∗ 3.871
P(1) = =.0806
1 203 203 1! 210.3 0.2514
𝑒 −3.87 ∗ 3.872
P(2) = =0.1560
2 383 766 2! 407.0 1.4194
3 525 1575 P(3) = 0.2014 525.3 0.0002
4 532 2128 P(4) = 0.1949 508.4 1.0937
5 408 2040 P(5) = 0.1509 393.7 0.5214
6 273 1638 P(6) = 0.0974 254.0 1.4180
7 139 973 P(7) = 0.0539 140.5 0.0159
8 45 360 P(8) = 0.0261 68.0 7.7744
9 27 243 P(9) = 0.0112 29.2 0.1728
10 10 100 P(10) = 0.0043 11.3 0.1547
11 4 44 P(11) = 0.0015 4.0 0.0069
12 0 0 P(12) = 0.0005 1.3
13 1 13 P(13) = 0.0001 0.4
14 1 14 P(14) = 0.0000 0.1
Total 2608 10097 2608 12.9616
Then we apply  2 test for goodness of fit. The observed and expected frequencies are given together in the
following table.
Comparing observed  2 values as12.96 to the tabulated χ2 values at 10 degrees of freedom 18.307 with
𝜆̇ =3.87150.511. It seems that Poisson distribution is fitting well to the data.

20
Problem : In a book of 520 pages, 390 typographical errors occur. Assuming poisson law for the number of
errors per page. Find the probability that a random sample of 5 pages will contain no error.
390
Solution: First we will find the average number of typographical error λ = 520 =0.75
e−λ λx e−0.75 0.75x
By using poisson probability law P(X=x) = =
x! x!
So the probability that a random sample of 5 pages will contain no error is
[𝑃(𝑋 = 0)]5 = (e−0.75 )5 = e−3.75

Objective: Fitting of normal distribution


Kinds of data: The following data relate to the frequency distribution of 1000 population of their
intelligence score.
Class
60-65 65-70 70-75 75-80 80-85 85-90 90-95 95-100 Total
intervals
Observed
3 21 150 335 326 135 26 4 1000
Freq.

Solution: Set up null hypothesis H0: There is no significant difference between observed frequency and
expected frequency and H1: There is significant difference between these two.
Class ∆φ(z)=
Lower Expected Rounded
intervals Fi Xi Fi Xi φ(z) φ(z+1)-
class Z=(X-μ)/σ Freq. Exp.freq.
φ(z)
Below 60 -∞ -∞ 0 0 0.12 0
60-65 3 62.5 187.5 60 -3.66 0 0.003 2.9 3
65-70 21 67.5 1417.5 65 -2.75 0.003 0.031 31 31
70-75 150 72.5 10875 70 -1.83 0.034 0.148 147.8 148
75-80 335 77.5 25962.5 75 -0.91 0.182 0.322 322.1 322
80-85 326 82.5 26895 80 0.01 0.504 0.319 319.3 319
85-90 135 87.5 11812.5 85 0.93 0.823 0.144 144.1 144
90-95 26 92.5 2405 90 1.49 0.967 0.03 29.8 30
95-100 4 47.5 190 95 2.68 0.997 0.003 2.7 3
100&Over ∞ ∞ 1
Total

Calculate the mean and standard deviation of observed data as 79.945 and 5.545 respectively. .

(i) Find out the value of Z=[X-μ]/σ.


(ii) Find out the area of normal curve as per the value of Z as φ(z)
(iii) Then, determine ∆φ(z) by taking successive differences.
(iv) In last, multiply by N to ∆φ(z) to know the expected frequency.
(i) Apply  2 test for goodness of fit.
(v) Interpret the result about fitting of normal distribution.
(vi) Comparing observed  2 values as 4.614 to the tabulated  2 value at 3 d.f..

It seems that normal distribution is fitting well to the data .

21
Problem : X is a normal variate with mean 30 and standard deviation 5. Find the probabilities that (i)
26 ≤ 𝑿 ≤ 𝟒𝟎 (ii) X≥ 𝟒𝟓 (iii) |𝑿 − 𝟑𝟎| > 5
Solution:
(i) Here it is given that µ = 30 and 𝜎 = 5
𝑋−µ
So first we calculate the standard normal variate z= 𝜎
𝑋−µ 26−30 40−30
For X=26, z= = = -0.8 and X=40, Z= =2
𝜎 5 5

Now P (-0.8 ≤ 𝑍 ≤ 2) = P (-0.8 ≤ 𝑍 ≤ 0) + P (0 ≤ 𝑍 ≤ 2)


= P (0 ≤ 𝑍 ≤ 0.8) + P (0 ≤ 𝑍 ≤ 2) =0.2881 + 0.4772 =0.7653
(ii) P(X≥ 45)
45 − 30
𝑍= =3
5
P(Z≥ 3) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 3)
= 0.5 – 0.4986 = .0014
(iii) |𝑋 − 30| > 5 can also be written as
P(|𝑋 − 30| > 5) = 1 − P(|𝑋 − 30| ≤ 5)
=1- P (25 ≤ 𝑋 ≤ 35)
By applying standard normal variate rule we get
=1- P (−1 ≤ 𝑍 ≤ 1)=1-2*P(0≤ 𝑍 ≤ 1)
=1-2*0.3413=1-0.6826=0.3174

22
9. Correlation and Regression
Correlation Regression
Measure of linear relationship between two Measure of average relationship between two or
variables more variables
The correlation coefficient between X on Y and The regression line for Y on X and X and Y are
Y on X is same and calculated by Karl Pearson different and is given by
correlation formula (y-𝑦̅) =𝑏𝑦𝑥 (x-𝑥̅ ) for y on x
𝑐𝑜𝑣(𝑥,𝑦) ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
𝑟𝑥,𝑦 = = (x-𝑥̅ ) =𝑏𝑥𝑦 (y-𝑦̅) for x on y
𝜎𝑥 𝜎𝑦 ̅̅̅2 ∑(𝑦𝑖 −𝑦)
√∑(𝑥𝑖 −𝑥) ̅̅̅2
𝑐𝑜𝑣(𝑥,𝑦) ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
Where 𝑏𝑦𝑥 = 2 = ∑(𝑥𝑖 −𝑥)̅̅̅2
𝜎𝑥
̅̅̅(𝑦𝑖 −𝑦̅)
𝑐𝑜𝑣(𝑥,𝑦) ∑(𝑥𝑖 −𝑥)
and 𝑏𝑥𝑦 = 2
= ̅̅̅2
∑(𝑦𝑖 −𝑦)
𝜎𝑦
here 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are the regression coefficient and
shows the change in dependent variable with a
unit change in independent variable
Correlation coefficient lies between -1 to +1 Regression coefficient lies between -∞ to +∞
Correlation coefficient is independent of change Regression coefficient is independent of change of
of origin and scale. origin and but not of scale.
Test of significance of correlation coefficient Test of significance of regression coefficient
(Null Hypo. r=0) (Null Hypo. 𝑏𝑦𝑥 =0, 𝑏𝑥𝑦 = 0)
𝑟𝑐𝑎𝑙 ∗√𝑛−2 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 2
at (n-2) d.f 𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏
√1−𝑟𝑐𝑎𝑙 𝑦𝑥
𝑏𝑦𝑥
= based on
2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑦−𝑦)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑥−𝑥)
̅̅̅2
∑(𝑥−𝑥)

(n-2) d.f.(for y on x )
𝑏𝑥𝑦
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 =
𝑥𝑦
𝑏𝑥𝑦
based on
2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑥−𝑥)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑦−𝑦)
̅̅̅̅2
∑(𝑦−𝑦)

(n-2) d.f.(for x on y )
Relationship between correlation and regression coefficient
Correlation coefficient is the Geometric mean r = ±√𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦
between the regression coefficients.
If one of the regression coefficients is greater 𝑟 2 ≤ 1, 𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 ≤ 1,
than unity the other must be less than unity.
Arithmetic mean of the regression coefficient is 1
(𝑏 + 𝑏𝑥𝑦 ) ≥ 𝑟
2 𝑦𝑥
greater than the correlation coefficient r if r>0.
𝜋
If r=0, 𝜃 = 2 If the two variables are uncorrelated the lines of
regression become perpendicular to each other.
If r= ±1, 𝜃 = 0 𝑜𝑟 𝜋 The two lines of regression are coincide with each
other.

23
Spearman’s Rank correlation: is used to estimate the correlation between two characters on the basis of the
rank of the individuals.

6∗∑ 𝑑𝑖 2
The formula for rank correlation is 𝜌 = 1 − ,
𝑛∗(𝑛2 −1)
Where d is the difference of the rank of individuals.
• If two individuals received the same rank then the arithmetic mean of their ranks is assigned to the
tied individuals and the next one individual will be given to actual rank. In this case a correction
1
factor is 12 ∑(𝑝3 − 𝑝) is added to ∑ 𝑑𝑖 2 .
1
6∗(∑ 𝑑𝑖 2 + ∑(𝑝3 −𝑝))
• Now 𝜌 = 1 − 12
, where p is the number of items whose ranks are common
𝑛∗(𝑛2 −1)
• Limits of rank correlation coefficient is -1≤ 𝜌 ≤ +1

Objective: Computation of correlation coefficient and the equations of the line of regression of Y on X and
X on Y and the estimation of the value of Y when the value of X is known and the value of X when the
value of Y is known.

Kinds of data: The following table relate to the data of stature (inches) of brother and sister from Pearson
and Lee’s sample of 1,401 families.

Family
1 2 3 4 5 6 7 8 9 10 11
number
Brother,X 71 68 66 67 70 71 70 73 72 65 66
Sister,Y 69 64 65 63 65 62 65 64 66 59 62
Solution: Calculation of correlation coefficient

Family Brother Sister ̅) ̅) ̅ )𝟐 ̅ )𝟐 ̅ )(𝒀𝒊 − 𝒀


̅)
(𝑿𝒊 − 𝑿 (𝒀𝒊 − 𝒀 (𝑿𝒊 − 𝑿 (𝒀𝒊 − 𝒀 (𝑿𝒊 − 𝑿
Number X Y
1 71 69 2 5 4 25 10
2 68 64 -1 0 1 0 0
3 66 65 -3 1 9 1 -3
4 67 63 -2 -1 4 1 2
5 70 65 1 1 1 1 1
6 71 62 2 -2 4 4 -4
7 70 65 1 1 1 1 1
8 73 64 4 0 16 0 0
9 72 66 3 2 9 4 6
10 65 59 -4 -5 16 25 20
11 66 62 -3 -2 9 4 6
Total 759 704 74 66 39
759 7o4
First we calculate the mean 𝑋̅ = 11 = 69 , 𝑌̅ = 11 = 64
Then by using the formula of correlation coefficient, we have
39
𝑟𝑥𝑦 = = 0.558
√74∗66

Test of significance of correlation coefficient


0.558∗ √11−2
t= √1−0.5582
= 2.018
24
the table value of t at 9 df. At 5 % level of significance is 2.26.
Since t calculated is less than t tabulated the null hypothesis is accepted. The correlation coefficient is not
significant.
Calculation of Regression Coefficient
using the formula of regression coefficient of Y on X on Y, we have
39 39
𝑏𝑦𝑥 = 74 = 0.527, 𝑏𝑥𝑦 = 66 = 0.591

Hence, the equation of regression line of Y on X is


Y- 64 = 0.527 (X-69)
Hence, the equation of regression line of X on Y is
X- 69 = 0.591 (Y-64)
Estimation of Y when X is given :
If we want to calculate the value of Y for X=70 then by putting X=70 in the line of regression of Y on X we
get Y - 64 =0.527*(70 -69)
Hence Y= 64 + 0.527 * 1 =64.527
Estimation of X when Y is given :
If we want to calculate the value of X for Y=62 then by putting Y=62 in the line of regression of X on Y we
get X - 69 =0.591(62 -64)
Hence X= 69 + 0.591 * (-2) =67.82
Test of significance of regression coefficient of y on x
𝟎.𝟓𝟐𝟕 𝟎.𝟓𝟐𝟕
𝒕𝒚𝒙 = =𝟎.𝟐𝟔𝟏 = 2.017
(𝟑𝟗)𝟐
√ 𝟔𝟔− 𝟕𝟒
(𝟏𝟏−𝟐)∗𝟕𝟒

Test of significance of regression coefficient of x on y


𝟎.𝟓𝟗𝟏 𝟎.𝟓𝟗𝟏
𝒕𝒙𝒚 = =𝟎.𝟐𝟗𝟐 = 1.799
(𝟑𝟗)𝟐
√ 𝟕𝟒− 𝟔𝟔
(𝟏𝟏−𝟐)∗𝟔𝟔

Since the value of t calculated is less than t tabulated. Regression coefficients are not significant.

Objective: Computation of rank correlation coefficient.


Kinds of data: The following two series of data are given. By the method of rank differences (after ranking
them in proper order).
X 75 88 92 70 60 80 81 50
Y 120 124 150 115 110 140 142 100

25
Solution:
X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊 𝟐
75 120 5 5 0 0
88 124 4 4 -2 4
92 150 1 1 0 0
70 115 6 6 0 0
60 110 7 7 0 0
80 140 3 3 1 1
81 142 2 2 1 1
50 100 8 8 0 0

Total ∑ 𝒅𝒊 𝟐 = 𝟔
6∗6 36
Coefficient of rank correlation = 1 –8∗(82 −1) = 1- 8∗63 =1- 0.0714 = 0.929
Thus , there is high positive correlation.

Objective: Computation of rank correlation coefficient when ranks are being repeated.
Kinds of data: The following two series of data are given on the variables X and Y .
X 12 15 18 20 16 15 18 22 15 21 18 15
Y 10 18 19 12 15 19 17 19 16 14 13 17
Solution:
X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊 𝟐
12 10 12 2 10 100.0
15 18 9.5 2 7.5 56.3
18 19 5 2 3 9.0
20 12 3 4 -1 1.0
16 15 7 5.5 1.5 2.3
15 19 9.5 5.5 4 16.0
18 17 5 7 -2 4.0
22 19 1 8 -7 49.0
15 16 9.5 9 0.5 0.3
21 14 2 10 -8 64.0
18 13 5 11 -6 36.0
15 17 9.5 12 -2.5 6.3
Total ∑ 𝒅 𝟐 = 𝟑𝟒𝟒 𝒊
𝑝3 −𝑝 𝑝3−𝑝
6[∑ 𝑑2 + + +⋯ ]
12 12
r rank = 1- 𝑛(𝑛2−1)

In the X series, 18 is repeated 3 times after third rank, thus the common rank assigned to each of these
4+5+6
values is the average of ( 3 = 5). Next value 16 gets the next ranks as 7. Again the value 15 occurs four
times, common rank assigned to it is 9.5 which is the arithmetic mean of 8,9,10 and 11. The next number 12
gets the ranks as 12. Similarly for the Y - series, the value 19 occurs thrice and common rank assigned to
each is 2 i.e. arithmetic mean of 1, 2 and 3 and accordingly the ranks assigned to other. So here in X series
43 −4 33 −3 23 −2 33 −3
6[344+ + + + ]
12 12 12 12
m= 3 for 18, m=4 for 15 and for Y series m= 3 for 19, m=2 for 17. 𝜌= 1— 2
12(12 −1)
=-0.2028
a poor rank correlation.

26
Objective: Testing the significance of an observed sample correlation coefficient and determination of 95%
and 99% confidence limits.

Kinds of data: In a random sample of 27 pairs of observations from a bivariate population the correlation
coefficient is obtained as 0.6.

Solution: (i) Set up the null and alternative hypothesis as

Ho: 𝜌=0

H1:𝜌 ≠0

(ii) Choose a suitable level of significance 𝛼=0.05 (say)

(iii) Compute ‘t’ statistic

𝑟√𝑛−2
t= √1−𝑟 2 with (n-2) d.f.

0.6√27−2 3
t= = = 3.75 with 25 degrees of freedom.
√1−0.36 √0.64

(iv) Tabulated value of t using two tailed test at 5% level of significance with 25 degrees of freedom is 2.06.

(v) Since calculated value of t(3.750 is greater than the tabulated value of t, H0 is rejected at 5% level of
significance. Hence it is concluded that the variables are correlated in the population.

95% confidence limits for 𝜌 (Population correlation coefficient)


(1−𝑟 2 )
r∓ 1.96 Standard error = r∓1.96∗
√𝑛

1.96(1−0.36)
= 0.6 ∓
√27

= 0.6∓0.2414

= 0.3586 to 0.8414

Likewise 99% confidence limits for 𝜌 are


(1−𝑟 2 )
r ∓2.58 Standard error = r∓2.58∗
√𝑛

(1−0.36)
= 0.6 ∓ 2.58∗
√27

=0.6∓0.3178

= 0.2822 to 0.9178

27
Objective: Computation of correlation coefficient for the bivariate frequency distribution.

Kinds of data: The following data provides according to age the frequency of marks obtained by
100 students in an intelligence test.

Age in years
18 19 20 21 Total
Marks
10-20 4 2 2 - 8
20-30 5 4 6 4 19
30-40 6 8 10 11 35
40-50 4 4 6 8 22
50-60 - 2 4 4 10
60-70 - 2 3 1 6
Total 19 22 31 28 100

Solution: let us assume variable age in years as U and Marks as V.

Let U=X-19. V=(Y-35)/10 and prepare the table as shown below.


𝒀−𝟑𝟓
V= U=X-19 -1 0 1 2
𝟏𝟎
y X/Y 18 19 20 21 f(v) vf(v) v2f(v) 𝛴vf(u,v)
-2 15 10-20 4(8) 2(0) 2(-4) 8 -16 32 4
-1 25 20-30 5(5) 4(0) 6(-6) 4(-8) 19 -19 19 -9
0 35 30-40 6(0) 8(0) 10(0) 11(0) 35 0 0 0
1 45 40-50 4(-4) 4(0) 6(6) 8(16) 22 22 22 18
2 55 50-60 2(0) 4(8) 4(16) 10 20 40 24
3 65 60-70 2(0) 3(9) 1(6) 6 18 54 15
Total f(u) 19 22 31 28 100 25 167 52
uf(u) -19 0 31 56 68
u2f(u) 19 0 31 112 162
𝛴vf(u,v) 9 0 13 30 52

∑ uf(u) 68 ∑ vf(v) 25
Mean of u= ∑ f(u) = 100 =0.68. Mean of v= ∑ f(v) = 100 =0.25

52
Cov(u,v)= 199-0.68x0.25=0.35

162
Variance of u= - (0.68)2=1.1576
100

167
Variance of v=100 - (0.25)2=1.6075

0.35
r(u,v)= = 0.25,
√1.1576𝑥1.6075

Since correlation coefficient is independent of change of origin and scale,

r(x,y)=r(u,v) =0.25

28
10.Multiple and Partial Correlation
Multiple Correlation Coefficient: provide the maximum degree of linear relationship between two or more
independent variables and a single dependent variable. It is a measure of how well a given variable can be
predicted using a linear function of a set of other variables. Multiple correlation coefficients can never be
negative and presented by 𝑅 2 , which represent the percentage variance explained by all the independent
variables in dependent variable.

Multiple correlation coefficient is represented by R1.23, where X1 is dependent variable and X2, X3 are
independent variables and the formula is given by

1 𝑟12 𝑟13
2 𝜔 𝑟122 +𝑟132 −2𝑟12𝑟13𝑟23 1 𝑟23
𝑅1.23 = 1 − 𝜔 = 𝑟
, Where 0≤ R1.23 ≤ 1 and 𝜔=| 21 1 𝑟23 | and 𝜔11 =| |
11 1−𝑟232 𝑟32 1
𝑟31 𝑟32 1

F-test for significance of multiple correlation coefficient:


The null and alternative hypothesis is Ho: R 1.23=0, H1:R 1.23 ≠0
𝑅2 (𝑛−𝑘−1)
The test statistic F=1−𝑅2x follows F distribution with (k, n-k-1) degrees of freedom, where k is the
𝑘
number of independent variables. If calculated value of F is less than the tabulated value, the null
hypothesis is accepted.

Partial Correlation Coefficient : measures the degree of association between two random variables X1 and
X2, with the effect of a set of controlling random variables say X3 is removed. For example, if we have
economic data on the consumption, income, and wealth of various individuals and we wish to see if there is a
relationship between consumption and income, failing to control for wealth when computing a correlation
coefficient between consumption and income would give a misleading result, since income might be
numerically related to wealth which in turn might be numerically related to consumption; a measured
correlation between consumption and income might actually be contaminated by these other correlations.
The use of a partial correlation avoids this problem. Like the correlation coefficient, the partial correlation
coefficient takes on a value in the range from –1 to 1.Partial correlation coefficient helps in deciding whether
to include or not an additional independent variable in regression analysis.
The correlation coefficient between X1 and X2 after the linear effect of X3 on each of them has been
eliminated is called the partial correlation coefficient and the formula is given by
𝑟12 − 𝑟13 𝑟23
𝑟12.3 = 2 2
, -1≤ r12.3 ≤ +1
√(1−𝑟13 ) ∗(1−𝑟23 )

t-test for significance of partial correlation coefficient:


The null and alternative hypothesis is Ho: 𝜌 12.3=0, H1:𝜌 12.3≠0.
𝑟12.3√𝑛−𝑘−2
The test statistic t = √1−𝑟12.32 , follows t distribution with (n-k-2) degrees of freedom, where k is the
number of variables from which the effect of common variable is eliminated. If calculated value of t is less
than the tabulated value, the null hypothesis is accepted.
2
Relation between multiple, total and partial correlations: 1- 𝑅1.23 = (1-𝑟12 2 )(1-𝑟13.2 2 )

29
Objective: Computation of multiple correlation coefficients from the tri-variate population.
Kinds of data: Given r12=0.60,r13=0.70 and r23=0.65
𝑟122 +𝑟132 −2𝑟12𝑟13𝑟23
Solution: We know that R 1.23 = √ 1−𝑟232

0.62 +0.72 −2𝑥0.6𝑥0.7𝑥0.65


= √ 1−0.652

0.36+0.49−0.546
=√ =√0.526 =0.725
0.5775

Thus, R 1.23 = 0.725


𝑟132 +𝑟232 −2𝑟12𝑟13𝑟23
Likewise, R 3.12 = √ 1−𝑟122

0.72 +0.652 −2𝑥0.6𝑥0.7𝑥0.65


=√ 1−0.602

0.49+0.4225−0.546
= √ = √0.573 =0.757
0.64

𝑟122 +𝑟232 −2𝑟12𝑟13𝑟23


In the last, R 2.13 =√ 1−𝑟132

0.62 +0.652 −2𝑥0.6𝑥0.7𝑥0.65


=√ 1−0.702

0.36+0.4225−0.546
=√ =√0.464 =0.681
0.51

In this way, we have all the values of multiple correlation coefficients.

Objective: Testing the significance of an observed multiple correlation coefficient.

Kinds of data: The value of R 1.23=0.725 from a tri-variate distribution for n=25.
Solution: (i) Set up the null and alternative hypothesis as
Ho: R 1.23=0, H1:R 1.23 ≠0
(ii) Choose a suitable level of significance 𝛼=0.05 (say)
(iii) Compute ‘F’ statistic
𝑅2 (𝑛−𝑘−1)
F=1−𝑅2x follows F distribution with (k,n-k-1) degrees of freedom, where k is the number of
𝑘
independent variables
(0.7252 ) 25−2−1
Thus, F=1−0.7252x 2

0.5256 22
=0.4744x 2 = 1.1079 x 11=12.187

The tabulated value of F at (2,22) d.f. at 5% level of significance is 3.44. Hence the calculated value of F
statistic is greater than the tabulated value. Thus ,we reject the null hypothesis and conclude that multiple
correlation R 1.23 is not zero i.e. observed multiple correlation coefficient is significant in the population.

30
Objective: Calculation of 𝑟23.1, b 12.3, b 13.2 and 𝜎 1.23
Kinds of data: In a tri-variate distribution 𝜎1 =2, 𝜎2 =𝜎3 =3, 𝑟12=0.7, 𝑟23 = 𝑟31 =0.5
𝑟23 −𝑟21 𝑟31
Solution: (i) we know that 𝑟23.1 =
√(1−𝑟21 2 )(1−𝑟31 2 )

0.5−0.7∗0.5
By putting the values we get 𝑟23.1 = =0.2425
√(1−0.72 )(1−0.52 )
𝜎 𝜎
(i) b 12.3 = r12.3 * 𝜎1.3 and b 13.2 = r13.2 * 𝜎1.2
2.3 3.2

𝑟12 −𝑟13 𝑟23 0.7−0.7∗0.5


Now, r12.3 = = = 0.6
√(1−𝑟13 2 )(1−𝑟23 2 ) √(1−0.52 )(1−0.52 )

𝑟13 −𝑟12 𝑟32 0.5−0.7∗0.5


Similarly r13.2 = = = 0.2425
√(1−𝑟12 2 )(1−𝑟32 2 ) √(1−0.72 )(1−0.52 )

𝜎1.3 = 𝜎1√(1 − 𝑟13 2 ) = 2*√(1 − 0.52 ) = 1.7320

𝜎2.3 = 𝜎2√(1 − 𝑟23 2 ) = 3*√(1 − 0.52 ) = 2.5980

𝜎1.2 = 𝜎1√(1 − 𝑟12 2 ) = 2*√(1 − 0.72 ) = 1.4282

𝜎3.2 = 𝜎3√(1 − 𝑟32 2 ) = 3*√(1 − 0.52 ) = 2.5980

By putting these values we get


1.7320 1.4282
b12.3 =0.6 * 2.5980 = 0.4 and b13.2 = 0.2425 * 2.5980 =0.1333
𝜔 𝜔
(ii) we know that σ21.23…n = σ1 2 , then σ21.23 = σ1 2 𝜔 , where
𝜔11 11

1 𝑟12 𝑟13 1 0.7 0.5


𝜔=|𝑟21 1 𝑟23 | = |0.7 1 0.5|
𝑟31 𝑟32 1 0.5 0.5 1
1 𝑟23 1 0.5
and 𝜔11 =| | = =| |
𝑟32 1 0.5 1
by solving these by determinant expansion we get

𝜔 = 1 ∗ (1 ∗ 1 − 0.5 ∗ 0.5) − 0.7(0.7 ∗ 1 − 0.5 ∗ 0.5) + 0.5 ∗ (0.7 ∗ 0.5 − 1 ∗ 0.5) = 0.36
And 𝜔11 =(1*1-0.5*0.5) =0.75

𝜔 0.36
Hence σ21.23 = σ1 2 𝜔 = 22 0.75 = 1.92,
11

Then 𝜎 1.23 = 1.385

31
Objective: Testing the significance of an observed partial correlation coefficient.
Kinds of data: The value of r 12.3 = -0.60 from a trivariate distribution for n=29.
Solution: (i) Set up the null and alternative hypothesis as
Ho: 𝜌 12.3=0, H1:𝜌 12.3≠0
(ii) Choose a suitable level of significance 𝛼=0.05 (say)
(iii) Compute ‘t’ statistic
𝑟12.3√𝑛−𝑘−2 −0.60∗√29−2−2
t= √1−𝑟12.32
= =-3.75
√1−(0.6)2

The table value of t at 5% level of significance at 25 degrees of freedom is 2.06. As computed value of t is
greater than the table value of t, so we reject the null hypothesis. Thus the observed partial correlation
coefficient is significantly different in the population.

32
11.Multiple Regression Equation and Analysis Technique

Equation of plane of regression:


The equation of plane of regression of Xi, on remaining variable Xj (j≠ 𝑖 = 1,2, … , 𝑛) is given by
𝑋1 𝑋 𝑋 𝑋
𝜔𝑖1+𝜎2 𝜔𝑖2+….+𝜎𝑖 𝜔𝑖𝑖 +…….+𝜎𝑛 𝜔𝑖𝑛 = 0 ; i= 1,2,…,n
𝜎1 2 𝑖 𝑛

Objective: Determination of regression equation when the values are given and also determine the value of
X3 when X1 =30 and X2 =45
Kinds of data: for a trivariate distribution
̅̅̅1 =40
𝑋 ̅̅̅2 =70
𝑋 ̅̅̅3 =90
𝑋
𝜎1 = 3 𝜎2 = 6 𝜎3 = 7
𝑟12 = 0.4 𝑟23 = 0.5 𝑟13 = 0.6

Solution: the equation of plane of regression of X1 on X2 and X3 is given by


𝑋1 𝑋 𝑋
𝜔11 +𝜎2 𝜔12+𝜎3 𝜔13 = 0
𝜎1 2 3

Since the line of regression passes through mean and can be written as
̅̅̅̅̅
(𝑋1 −𝑋 1)
̅̅̅̅̅
(𝑋2 −𝑋 2)
̅̅̅̅̅
(𝑋3 −𝑋 3)
𝜔11 + 𝜔12+ 𝜔13 = 0
𝜎1 𝜎2 𝜎3

1 𝑟12 𝑟13 1 0.4 0.6


𝑟
𝜔=| 21 1 𝑟23 | =|0.4 1 0.5| and
𝑟31 𝑟32 1 0.6 0.5 1
1 𝑟23 1 0.5
𝜔11 =| |=| | = (1*1-0.5*0.5) = 0.75
𝑟32 1 0.5 1
𝑟21 𝑟23 0.4 0.5
𝜔12 =− |𝑟 1 | =|0.6 1 | = - (0.4*1-0.5*0.6) = -0.10
31

𝑟 1 0.4 1
𝜔13 =| 21 | =| | = (0.4*0.5- 1*0.6) = -0.4
𝑟31 𝑟32 0.6 0.5
By putting the values in regression line we get
(𝑋1 −40) (𝑋2 −70) (𝑋3 −90)
∗ 0.75+ ∗ (−0.10) + ∗ (−0.4) = 0
3 6 7

This is the required line of regression.


By putting X1 =30 and X2 =45 in the equation we get
(30−40) (45−70) (𝑋3 −90)
∗ 0.75 + ∗ (−0.10) + ∗ (−0.4) = 0
3 6 7

= -2.50 + 0.417 - (𝑋3 − 90)*.057 = 0


Hence -2.50 + 0.417 = (𝑋3 − 90)*.057,
−2.083
Then 𝑋3 = +90 =53.46
.057

33
Objective: Find the multiple regression equation of X1 on X2 and X3 .
Kinds of data: The data relating to three variables are given below :
X1 4 6 7 9 13 15
X2 15 12 8 6 4 3
X3 30 24 20 14 10 4

Solution: The regression equation of X1 on X2 and X3 is given by


X1=a + b 12.3 X2 + b 13.2 X3 + e
The value of these constants a, b 12.3 and b 13.2 are obtained by solving the following three normal equations.
𝛴 X1= na + b 12.3 𝛴 X2 +b 13.2 𝛴X3
𝛴 X1X2= a𝛴X2 +b 12.3 𝛴 X22 + b 13.2 𝛴X2x3
𝛴 X1X3 = a 𝛴X3 +b 12.3 𝛴X2X3 + b 13.2 𝛴X32
The sum of squares and sum of products required in above equations are obtained as below:
X1 X2 X3 X1X2 X1X3 X2X3 X22 X32 X12
4 15 30 60 120 450 225 900 16
6 12 24 72 144 288 144 576 36
7 8 20 56 140 160 64 400 49
9 6 14 54 126 84 36 196 81
13 4 10 52 130 40 16 100 169
15 3 4 45 60 12 9 16 225
54 48 102 339 720 1034 494 2188 576

Substituting the values in the normal equations ,we get


6a+ 48 b12.3 +102 b13.2 =54 (i)
48 a +494 b12.3 + 1034 b13.2 =339 (ii)
102 a +1034 b12.3 +2188 b13.2 =720 (iii)
Multiplying equation (i) by 8, we get 48a =384 b12.3 +816 b13.2 = 432 (iv)
Subtracting equation (ii) from (iv), we get 110 b12.3 +218 b13.2 = - 93 (v)
Multiplying equation (i) by 17, we get 102 a +816 b12.3 +1734 b13.2 =918 (vi)
Subtracting equation (iii) from equation (vi), we get 218 b12.3 +454 b13.2 = -198 (vii)
Multiply equation (v) by 109 we get 11990 b12.3 +23762 b13.2 = -10137 (viii)
Multiplying equation (vii) by 55 we get 1990 b12.3 + 24970 b13.2 = -10890 (ix)
Subtracting equation(viii) from equation (ix),we get 1208 b13.2 = -753
Hence, b13.2 = -753/1208 = -0.623
Substituting the value of b13.2 in equation (v), we get 110 b12.3 = 218(-0.623) - 93
110 b12.3 = 135.814 – 93
42.814
Hence, b12.3= = 0.389
110

34
Substituting the value of b12.3 and b13.2 in equation (i),we get
6a + 48(0.389) +102(-0.623) = 54
6a = 54 +63.546 –18.672= 98.874
Hence, a = 16.479
Thus the required regression equation is X1= 16.479 + 0.389X2 –0.623 X3

Multiple Regression Analysis: It is a technique used for predicting the unknown value of a variable from
the known value of two or more variables- also called the predictors. For example the yield of rice per acre
depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If we want to study the
joint affect of all these variables on rice yield, we can use this technique. An additional advantage of this
technique is it also enables us to study the individual influence of these variables on yield.
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
Here b0 is the intercept and b1, b2, b3, …, bk are analogous to the slope in linear regression equation and are
also called regression coefficients. They can be interpreted the same way as slope. Thus if bi = 2.5, it would
indicates that Y will increase by 2.5 units if Xi increased by 1 unit.
ANOVA for Multiple Regression: are similar to ANOVA for linear regression except that degrees of
freedom are adjusted to reflect the number of explanatory variables included in the model.
Analysis of variance table for simple regression analysis
Source of variation Degree of Sum of Square Mean sum Fcal Ftab (5 %) at (source
freedom of square and error ) d.f.
Model 1 ∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 SSM/df of =MSM/MSE
Model
Error n-2 ∑(𝑌𝑖 − 𝑌 ̂𝑖 )2 SSE/df of
Error
Total n-1 ∑(𝑌𝑖 − ̅̅̅
𝑌)2
𝑀𝑆𝑀
In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= , has an F distribution
𝑀𝑆𝐸
with d.f. (1, n-2).
In Multiple regression analysis for p explanatory variables the model degree of freedom are equal to p , the
error d.f. are equal to n-p-1 and the total d.f. are equal to n-1.
Analysis of variance table for Multiple regression analysis
Source of variation Degree of Sum of Square Mean sum Fcal Ftab (5 %) at (source
freedom of square and error ) d.f.
Model p ∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 SSM/df of =MSM/MSE
Model
Error n-p-1 ∑(𝑌𝑖 − 𝑌 ̂𝑖 )2 SSE/df of
Error
Total n-1 ∑(𝑌𝑖 − ̅̅̅ 𝑌)2

35
𝑀𝑆𝑀
In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= , has an F distribution
𝑀𝑆𝐸
with d.f. (1, n-2).
Test of significance of model: The appropriateness of the multiple regression model as a whole can be
tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at
least one of the X's. The suitability of model for prediction is examined by the coefficient of determination
(R2). R2 always lies between 0 and 1.The closer R2 is to 1, the better is the model and its prediction.
t-test: To test whether the independent variables individually influence the dependent variable significantly
or not we test the null hypothesis that the relevant regression coefficient is zero. This can be done using t-
test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences
Y significantly while controlling for other independent explanatory variables.

Objective: Fitting a straight line with two predictors by matrix approach and determination the value of R2.
Kinds of data: The twenty five observations of pounds of steam used per month in a plant along with
average atmospheric temperature in degrees Fahrenheit and number of operating days in the month.

Observation Pounds of steam Average atmospheric Number of


Number used per month TemperatureX1 operating days in
Y the monthX2
1 10.98 35.3 20
2 11.13 29.7 20
3 12.51 30.8 23
4 8.4 58.8 20
5 9.27 61.4 21
6 8.73 71.3 22
7 6.36 74.4 11
8 8.5 76.7 23
9 7.82 70.7 21
10 9.14 57.5 20
11 8.24 46.4 20
12 12.19 28.9 21
13 11.88 28.1 21
14 9.57 39.1 19
15 10.94 46.8 23
16 9.58 48.5 20
17 10.09 59.3 22
18 8.11 70 22
19 6.83 70 11
20 8.88 74.5 23
21 7.68 72.1 20
22 8.47 58.1 21
23 8.86 44.6 20
24 10.36 33.4 20
25 11.08 28.6 22
Solution: we know that the least squares estimates of 𝛽0, 𝛽1 and 𝛽2 are given by

36
b=(X′X) -1X′Y

where b is the vector of estimates of the elements of 𝛽, provided that X′X is non_singular matrix.

10.98 1 35.3 20 𝜀1
11.13 1 29.7 20 𝜀2
12.51 1 30.8 23 𝜀3
8.4 1 58.8 20 𝛽0 𝜀4
Here Y = . X= . 𝛽=[𝛽1] 𝜀= .
. . 𝛽2 .
. . .
10.36 1 33.4 20 𝜀24
[11.08] [ 1 28.6 22 ] [𝜀25]

Where Y is a (25x1) vector, X is a (25x3) matrix,

𝛽 is a (3x1) vector ,and 𝜀 is a (25x1) vector.


−1
1 35.3 20
1 29.7 20
𝑏0 1 1 1 . . . 1
1 30.8 23
Now, b=[𝑏1] = [ 35.3 29.7 30.8 28.6 ]
.
𝑏2 20 20 23 22
.
( [ 1 28.6 22 ])

10.98
11.13
1 1 1 . . . 1
12.51
x[ 35.3 29.7 30.8 28.6 ]
.
20 20 23 22
.
[ 11.08 ]

10.98
11.13
𝑏0 25 1315 506 1 1 1 . . . 1
12.51
b=[𝑏1] = [1315 76323.42 26353.30] X [ 35.3 29.7 30.8 28.6 ]
.
𝑏2 506 26353.30 10450 20 20 23 22
.
[ [ 11.08 ]]

𝑏0 9.1266
Thus, b=[𝑏1] = [−0.0724]
𝑏2 0.2029
Thus, the fitted least squares equation is
Ŷ =9.1266-0.0724 X1 +0.2029 X 2
After the regression equation is estimated we find the estimated value of Y for each X1 and X2and then find
the total, regression and residual sum of squares.

37
Observation Pounds of Estimated Total SS Regression/Model SS= Error SS=
Number steam used ̂
value of 𝒀 =∑(𝒀𝒊 − ̅̅̅
𝒀)𝟐 ∑(𝒀̂𝒊 − ̅̅̅
𝒀)𝟐 ∑(𝒀𝒊 − 𝒀̂𝒊 )𝟐
per month
Y
1 10.98 10.63 2.42 1.45 0.12
2 11.13 11.03 2.91 2.59 0.01
3 12.51 11.56 9.52 4.58 0.90
4 8.4 8.93 1.05 0.25 0.28
5 9.27 8.94 0.02 0.23 0.11
6 8.73 8.43 0.48 0.99 0.09
7 6.36 5.97 9.39 11.92 0.15
8 8.5 8.24 0.85 1.40 0.07
9 7.82 8.27 2.57 1.33 0.20
10 9.14 9.02 0.08 0.16 0.01
11 8.24 9.83 1.40 0.16 2.51
12 12.19 11.30 7.65 3.50 0.80
13 11.88 11.35 6.03 3.72 0.28
14 9.57 10.15 0.02 0.53 0.34
15 10.94 10.40 2.30 0.96 0.29
16 9.58 9.67 0.02 0.06 0.01
17 10.09 9.30 0.44 0.02 0.63
18 8.11 8.52 1.73 0.81 0.17
19 6.83 6.29 6.73 49.82 0.29
20 8.88 8.40 0.30 1.05 0.23
21 7.68 7.96 3.04 2.13 0.08
22 8.47 9.18 0.91 0.06 0.51
23 8.86 9.96 0.32 0.28 1.20
24 10.36 10.77 0.88 1.80 0.17
25 11.08 11.52 2.74 4.39 0.19
Total 63.84 54.21 9.63

The ANOVA table for this regression model is given below:


Source of d.f. SS MS Fcal Ftab(2,22)
variation
Regression 2 54.21 27.105 61.92 3.44
Residual 22 9.63 0.4377
Total 24 63.84

We can also split the S.S. due to regression into S. S. due to X1 and X2
For this purpose we fit the simple line of regression of Y on X1 as Ŷ =13.62-0.08 X1
Now again we calculate the SS due to X1 =∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 = 45.79
Source of d.f. SS MS Fcal Ftab (1,22)
variation
Regression
Due to X1 1 45.79 45.79 104.6 4.30
Due to X2 1 8.42 8.42 19.23
Residual 22 9.63 0.4377
Total 24 63.84
38
Here, since 104.1636 and 19.6361 exceeds Ftab(1,22,0.95)=4.30 , the predictor X1 and X2 are found to be
significant. It is to be noted that the sum of square due to X1 (45.5924) is obtained taking only one variable
X1 and the sum of square due to X2 is obtained (Total SS minus the SS due to X1).

To check the suitability of the model for prediction multiple correlation coefficient R2 is calculated.
𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 45.79+8.42
R2= 𝑇𝑜𝑡𝑎𝑙 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 = =84.91%
63.84

Hence it is found that 84.91 % variability of the predicted model is explained by explanatory variables.
t-test for regression coefficient byx1 = -.0724
𝑏𝑦𝑥 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for y on x )
𝑦𝑥
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦 ̅ ))2 ̅̅̅2
√(∑(𝑦−𝑦) ̅̅̅2 )/(𝑛−2) ∑(𝑥−𝑥)
∑(𝑥−𝑥)

Here null hypothesis byx1 =0


̅̅̅2 =7154.42 , (∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅))2 =326187.2
̅̅̅2 = 63.82 , ∑(𝑥 − 𝑥)
∑(𝑦 − 𝑦)
−.0724
𝑡𝑐𝑎𝑙 = = -6.87
326187.2
(63.82− )
√ 7154.42
(25−2)∗7154.42

The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is
greater than tabulated null hypothesis is rejected.
t-test for regression coefficient byx2 =0.2029
𝑏𝑦𝑥 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for y on x )
𝑦𝑥
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦 ̅ ))2 ̅̅̅2
√(∑(𝑦−𝑦) ̅̅̅2 )/(𝑛−2) ∑(𝑥−𝑥)
∑(𝑥−𝑥)

Here null hypothesis byx2 = 0


̅̅̅2 =218.56 , (∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅))2 =4008.916
̅̅̅2 = 63.82 , ∑(𝑥 − 𝑥)
∑(𝑦 − 𝑦)
.2029
𝑡𝑐𝑎𝑙 = = 2.13
4008.916
(63.82− )
√ 218.56
(25−2)∗218.56

The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is
greater than tabulated null hypothesis is rejected.
Hence both the regression coefficient are found to be significant.

39
12.Simple and Stratified Random Sampling

Simple Random Sampling (SRS): It is the process of selecting a sample from given population according
to some law of chance in which each unit of population has an equal and independent chance of being
included in the sample.
SRSWR(With Replacement): A selection process in which the unit selected at any draw is replaced to the
population before the next subsequent draw is known as Simple random sampling with replacement. In this
case the number of possible samples of size n selected from the population of size N is 𝑁 𝑛 . The samples
selected through this method are not distinct.
SRSWOR(Without Replacement): A selection process in which the unit selected at any draw is not
replaced to the population before the next subsequent draw and the next sample is selected from the
remaining population is known as Simple random sampling without replacement. In this case the number of
possible samples of size n selected from the population of size N is 𝑁𝑐𝑛 . The samples selected through this
method are distinct.
Note: Sample mean is an unbiased estimate of population mean in SRSWR and SRSWOR, whereas sample
variance is an unbiased estimate of population variance in case of SRSWOR only.
SRSWOR is more efficient than SRSWR because V(𝑦
̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑂𝑅 < V(𝑦
̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑅.

Stratified Random Sampling: When the population is heterogeneous and we wish that every section of
population is represented in the sample. We divide the whole population into different number of strata so
that the one stratum is much different from one another whereas the samples within each stratum are more
homogeneous. This technique of selecting a representative sample of whole population is known as stratified
random sampling.
In stratified random sampling allocation of sample size to different strata is based on the staratum sizes (Ni),
the variability within the stratum Si2 and the cost of surveying per sampling unit in the stratum.
Methods for allocation of sample size to different strata are
𝑛
Equal Allocation : ni =𝑘
𝑛Ni
Proportional Allocation: ni = 𝑁
𝑁𝑆
Neyman Allocation: ni = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖

𝑁 𝑆 √𝐶𝑖
Optimum Allocation (based on cost) : ni = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 √𝐶𝑖

40
Objective: In simple random sampling, show the sample mean and sample mean square is an unbiased
estimate of population mean and population mean square with the help of an hypothetical population in
SRSWOR and to determine its variances and S.E.
Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5. Draw a
sample of size n=3 using SRSWOR.
Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑐𝑛 = 5𝑐3 =10.
∑ 𝑦𝑖 1
Compute the mean of each sample ̅̅̅
𝑦𝑛 = and sample mean square 𝑠 2 = 𝑛−1 ∑(𝑦𝑖 − ̅̅̅)
𝑦𝑛 2 .
𝑛
∑ 𝑦𝑖 15 1
Similarly the mean of population ̅̅
𝑦̅̅
𝑁 = = 5 =3 and population mean square 𝑆 2 = 𝑁−1 ∑(𝑦𝑖 − ̅𝑦̅̅̅)
𝑁
2
𝑁
1 10
S2 = 4 [(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]= =2.5
4
The 10 possible samples are given below in the table.
S.No. Possible Sample mean Sample mean Sampling error
samples ̅𝒚̅̅𝒏̅ square (s2) ̅̅̅𝒏̅ − ̅̅̅̅̅
(𝒚 𝒚𝑵)

1. 1,2,3 2.0 1.0 -1.0


2. 2,3,4 3.0 1.0 0.0
3. 3,4,5 4.0 1.0 1.0
4. 4,5,1 3.33 4.33 0.33
5. 5,1,2 2.67 4.33 -0.33
6. 1,3,4 2.67 2.33 -0.33
7. 2,4,5 3.67 2.33 0.67
8. 3,5,1 3.0 4.00 0.0
9. 4,1,2 2.33 2.33 -0.67
10. 5,2,3 3.33 2.33 0.33
Total 30.0 24.98=25 0.00
2 2
̅̅̅)
Now we have to check whether E (𝑦 𝑛 =̅
𝑦̅̅̅
𝑁 and E (s ) = S ,

∑ ̅̅̅̅
𝑦𝑛 30 ∑ si 2 25
̅̅̅)=
E (𝑦 𝑛 ̅̅̅𝑁̅ and E (s2)=
= 10 =3 =𝑦 = 10 =2.5=S2,
𝑁𝑐𝑛 𝑁𝑐𝑛
then we can say that sample mean ̅̅̅ 𝑦𝑛 and sample variance s2 are an unbiased estimator of population
2
mean ̅𝑦̅̅̅
𝑁 and population variance S respectively.
In order to find out the variance of sample mean in SRSWOR, we know that
𝑁−𝑛 5−3
̅̅̅)
V(𝑦 𝑛 SRSWOR= S2 = *2.5 = 0.33
𝑁𝑛 5∗3

We can verify that this variance is correct.


𝑦𝑛 ̅̅̅̅̅̅
∑( ̅̅̅̅−E(𝑦 𝑛 ))
2 1 (∑ 𝑦̅𝑛 )2 1
̅̅̅)=
V(𝑦 𝑛 = 𝑦𝑛 2 –
[∑ ̅̅̅ ] = 10 [93.33-90]=0.33
𝑁𝑐𝑛 𝑁𝑐𝑛 𝑁𝑐𝑛
This shows that V(𝑦 ̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑂𝑅 is correct.
̅̅̅)
Standard Error of (𝑦 𝑛 = √V(𝑦̅̅̅)
𝑛 = √0.33 =0.57
We can also compare the two variances, one in SRSWOR and the other in SRSWR.
𝑁−1 2 5−1
̅̅̅)
V(𝑦 𝑛 SRSWR= 𝑁𝑛 S = 5∗3 *2.5 = 0.66
Since V(𝑦̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑂𝑅 < V(𝑦̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑅
Hence we can say that SRSWOR is more efficient than SRSWR.

41
Objective: Showing the unbiased estimator for population mean and biased estimator for population mean
square in simple random sampling with replacement (SRSWR) with the help of an hypothetical example and
determination of its variance and standard error (S.E.)
Kind of data: Consider a finite population of size N=5 including the values of sampling units as (1,2,3,4,5).
th
Enumerate all possible samples of size n=2 using SRSWR. find the estimate of V(𝑦 ̅̅̅)
𝑛 in 9 sample.
Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁 𝑛 = 52 =25.
∑ 𝑦𝑖 1
Compute the mean of each sample ̅̅̅
𝑦𝑛 = and sample mean square𝑠 2 = 𝑛−1 ∑(𝑦𝑖 − ̅̅̅)
𝑦𝑛 2 .
𝑛
∑ 𝑦𝑖 15 1
Similarly the mean of population ̅̅
𝑦̅̅
𝑁 = = 5 =3 and population mean square 𝑆 2 = 𝑁−1 ∑(𝑦𝑖 − ̅𝑦̅̅̅)
𝑁
2
𝑁
1 10
S2 = 4 [(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]= =2.5
4

S.No. Possible Sample Sample Sampling S.No. Possible Sample Sample Sampling
samples mean mean ̅̅̅𝒏 −
error (𝒚 samples mean mean error
̅̅̅
𝒚𝒏 square ̅̅̅̅̅
𝒚𝑵) ̅̅̅
𝒚𝒏 square ̅̅̅𝒏 − ̅̅̅̅̅
(𝒚 𝒚𝑵)
(s2) (s2)
1 1,2 1.5 0.50 -1.5 13 4,1 2.5 4.50 -0.5
2 1,3 2.0 2.00 -1.0 14 5,1 3.0 8.00 0.0
3 1,4 2.5 4.50 -0.5 15 3,2 2.5 0.50 -0.5
4 1,5 3.0 8.00 0.0 16 4,2 3.0 2.00 0.0
5 2,3 2.5 0.50 -0.5 17 5,2 3.5 4.50 0.5
6 2,4 3.0 2.00 0.0 18 4,3 3.5 0.50 0.5
7 2,5 3.5 4.50 0.5 19 5,3 4.0 2.00 1.0
8 3,4 3.5 0.50 0.5 20 5,4 4.5 0.50 1.5
9 3,5 4.0 2.00 1.0 21 1,1 1.0 0.00 -2.0
10 4,5 4.5 0.50 1.5 22 2,2 2,0 0.00 - 1.0
11 2,1 1.5 0.50 -1.5 23 3,3 3.0 0.00 0.0
12 3,1 2.0 2.00 -1.0 24 4,4 4.0 0.00 1.0
25 5,5 5.0 0.00 2.0
Total 75.0 50.00

2 2
̅̅̅)=
Now we have to check whether E (𝑦 𝑛 ̅𝑦̅̅̅
𝑁 and E (s ) = S ,
∑ ̅̅̅̅
𝑦𝑛 75 ∑ si 2 50
̅̅̅)=
E (𝑦 𝑛 ̅̅̅𝑁̅ and E (s2)=
= 25 =3 =𝑦 = 25 =2 ≠S2,
𝑁𝑛 𝑁𝑛

then we can say that sample mean ̅̅̅


𝑦𝑛 is an unbiased estimate of population mean whereas and sample
variance s2 is not an unbiased estimate of population variance S2 in case of SRSWR.
In order to find out the variance of sample mean in SRSWR, we known that
𝜎2 𝑁−1 5−1
̅̅̅)=
V(𝑦 𝑛 = S2 = *2.5 = 1.0
𝑛 𝑁𝑛 5∗2

̅̅̅)
Standard Error of (𝑦 ̅̅̅)
𝑛 = √V(𝑦 𝑛 = √1 =1
th
In order to find the estimate of V(𝑦
̅̅̅)
𝑛 based on 9 sample, we have
𝑁−1 5−1
̅̅̅)=
V(𝑦 𝑛 𝑆 2 = 5∗2 *2.0 = 0.8
𝑁𝑛

̅̅̅)
Standard Error of (𝑦 ̅̅̅)
𝑛 = √V(𝑦 𝑛 = √0.80 =0.894

42
Objective : Drawing of samples in stratified random sampling under different allocation along with
determination of their variances and standard errors.
Kinds of data: A hypothetical population of N= 3000 is divided into four strata, their sizes of population and
standard deviations are given as follows :
Strata I II III IV
Size Ni 400 600 900 1100
SD Si 4 6 9 12
A stratified random sample of size 800 is to be selected from the population
Soultion : In case of
(i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the different
𝑛 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 800
sample sizes will be ni =𝑘 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎 = =200 samples from each allocation.
4
(ii) In case of proportional allocation ni (i=1,2,3,4) is given by ni = npi where pi =Ni/N
𝑛Ni
ni = 𝑁
800∗400
Hence n1 = =106.67≈107 samples from stratum I
3000
800∗600
n2 = =160 samples from stratum II
3000
800∗900
n3 = =240 samples from stratum III
3000
800∗1100
n4 = =293 samples from stratum IV
3000
Thus, n1 + n2 + n3 + n4 = 800 constitute the samples required from all the strata.

𝑃𝑆 𝑁𝑆
(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗ ∑ 𝑃𝑖 𝑆𝑖 = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 𝑖 𝑖
Here ∑ 𝑁𝑖 𝑆𝑖 =400*4+600*6+900*9+1100*12= 26500

400∗4 600∗6
Hence, n1 = 800 ∗ 26500 =48, n2== 800 ∗ 26500 =109,

900∗9 1100∗12
n3 = 800 ∗ 26500 =245, n4== 800 ∗ =398,
26500

In Neyman allocation, the sample sizes from four strata are 48, 109, 245 and 398 which constitute the
required sample size.
k ∑ pi 2 si 2 ∑ 𝑝𝑖 𝑠𝑖 2
̅̅̅̅
Variance of 𝒚 ̅̅̅̅
𝒔𝒕 in equal allocation V(𝒚 𝒔𝒕 ) = − ,
𝑛 𝑁
from above data ∑ pi Si = 8.83, ∑pisi2= 86.43 and ∑pi2 si2 = 28.37
4∗86.43 28.37
̅̅̅̅
V(𝒚 𝒔𝒕 ) = − 3000 , =.141-.028= 0.1130
800
̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 = √V(𝑦 𝑠𝑡 = √0.1130 =0.336
2 1 1 1 1
̅̅̅̅
Variance of 𝒚 𝒔𝒕 ) =(𝑛 − 𝑁 ) ∑ 𝑝𝑖 𝑠𝑖 =(800 - 3000 )*86.43 =0.0792
̅̅̅̅
𝒔𝒕 in proportional allocation V(𝒚

̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 prop = √V(𝑦 𝑠𝑡 = √0.0792 =0.2815
2
(∑ pi Si ) ∑ 𝑝𝑖 𝑠𝑖 2 8.832 86.43
̅̅̅̅
Variance of 𝒚 ̅̅̅̅
𝒔𝒕 in Neyman allocation V(𝒚 𝒔𝒕 ) = − = 800 − =.068
𝑛 𝑁 3000
̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 ney = √V(𝑦 𝑠𝑡 = √. 068 = 0.262

43
Objective: Determination of the estimate of population mean and population total in stratified random
sampling and samples under different allocations
Kinds of data : A population of size N = 4000 has been divided into five strata with their sizes, S.D.’s and
sample means in stratified random sampling.

Strata I II III IV V
Sizes Ni 300 600 900 1200 1000
Sample Means ̅̅̅̅
𝒚𝒏𝒊 8 10 15 18 13
Standard 2 4 6 8 5
Deviation

A stratified random sample of size 800 is to be drawn from the population.

Solution: (i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the
𝑛 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 800
different sample sizes will be ni =𝑘 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎 = 5 =160 samples from each allocation.
(ii) In case of proportional allocation ni (i=1,2,3,4,5) is given by ni = npi where pi =Ni/N
𝑛Ni
ni = 𝑁
800∗300
Hence n1 = =60 samples from stratum I
4000
800∗600
n2 = =120 samples from stratum II
4000
800∗900
n3 = =180 samples from stratum III
4000
800∗1200
n4 = =240 samples from stratum IV
4000
800∗1000
n5 = =200 samples from stratum V
4000

Thus, n1 + n2 + n3 + n4+ n5= 800 constitute the samples required from all the strata.
𝑃𝑆 𝑁𝑆
(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗ ∑ 𝑃𝑖 𝑆𝑖 = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 𝑖 𝑖
Here ∑ 𝑁𝑖 𝑆𝑖 =300*2+600*4+900*6+1200*8+1000*5= 23000
300∗2 600∗4
Hence, n1 = 800 ∗ 23000 =20.87≈ 21, n2== 800 ∗ 23000 =83.48≈ 83 ,

900∗6 1200∗8
n3 = 800 ∗ 23000 =187.82 ≈ 188, n4== 800 ∗ 23000
=333.91≈ 334,
1000∗5
n5== 800 ∗ 23000 =173.91≈ 174
In Neyman allocation, the sample sizes from five strata are 21, 83, 188, 334 and 174 which constitute
the required sample size.
̅̅̅
We know that an unbiased estimate of population mean 𝑌 𝑁 can be worked out as
̂ 1
̅̅̅̅ ̅̅̅̅
𝑌𝑁 = 𝑌 𝑠𝑡 = 𝑁 *∑ 𝑁𝑖 ̅̅̅̅
𝑦𝑛𝑖
300∗8+600∗10+900∗15+1200∗18+1000∗13
= =14.125
4000
An appropriate estimator to estimate the population total is given by

𝑌̂ = N*𝑌
̅̅̅̅
𝑠𝑡 = 4000*14.125 = 56500

44
13. Ratio and Regression Estimator

Ratio Estimator: Ratio method of estimation is based on the information available for auxiliary variable.
When the correlation coefficient between the study variable and the auxiliary variable is positive and high,
the ratio method of estimation can be used to study the population parameters of study variable Y.
̅̅̅̅
𝑦𝑛
The equation of ratio estimator is given by 𝑦𝑅
̅̅̅= ̅̅̅̅
𝑋 𝑁 , where ̅̅̅
𝑦𝑛 and ̅̅̅
𝑥𝑛 are sample means of y and x
̅̅̅̅
𝑥𝑛

̅̅̅̅
respectively and 𝑋 𝑁 is population mean.

In case of ratio estimator sample mean is not an unbiased estimate of population mean. The bias will be
zero only when there is a perfect positive correlation between y and x.
The bias of ratio estimator to the first order of approximation is given by
(𝑁−𝑛) 2 𝑆𝑋 𝑆𝑌
̅̅̅
𝐵1(𝑌 ̅̅̅
𝑅 ) = 𝑁𝑛 𝑌𝑁 (𝐶𝑥 − 𝜌𝐶𝑥 𝐶𝑦 ) , where 𝐶𝑥 = ̅̅̅̅ and 𝐶𝑦 = ̅̅̅̅
𝑋 𝑁 𝑌 𝑁
(𝑁−𝑛) 2 2 2 𝑌𝑁 ̅̅̅̅
̅̅̅
The variance of ratio estimator is given by V (𝑌 𝑅 ) = 𝑁𝑛 (𝑆𝑦 + 𝑅 𝑆𝑥 − 2𝑅𝜌𝑆𝑥 𝑆𝑦 ) where R =̅̅̅̅
𝑋 𝑁

Regression Estimator: Ratio estimator is used if y and x are linearly related and the line of regression
between y and x are passes through origin. But when this is not the case and the variate y is approximately
a constant multiple of an auxiliary variate x, the regression estimator is used.
The regression estimator can be defined as ̅𝑌̅𝑖𝑟
̅̅ = ̅̅̅
𝑦𝑛 + 𝑏𝑦𝑥 (𝑥
̅̅̅𝑁̅ − ̅̅̅)
𝑥𝑛
Regression estimator is also a biased estimate of population mean.
(𝑵−𝒏) 𝑆𝑥𝑦
̅̅̅̅
The variance of regression estimator is given by V(𝑌 𝑖𝑟 ) = 𝑠𝑦 2 (1-𝑟𝑥𝑦 2 ), here rxy = 𝑆
𝑵𝒏 𝑥 𝑆𝑦

𝑆𝑥𝑦 1 (∑ 𝑥𝑖 )2 1 ∑ 𝑥𝑖 ∑ 𝑦 𝑖
and 𝑏𝑦𝑥 = 𝑆 2 where 𝑠𝑥 2 = 𝑛−1 [∑ 𝑥𝑖 2 - ] and 𝑆𝑥𝑦 = [∑ 𝑥𝑖 𝑦𝑖 - ]
𝑥 𝑛 𝑛−1 𝑛

• Regression estimator is more efficient than Ratio Estimator V(𝑌 ̅̅𝑖𝑟


̅̅) < V(𝑌
̅̅̅
𝑅)
• If correlation coefficient is equal to zero , we should not apply regression estimator.

Objective : Estimation of the average number of bullocks per acre using ratio estimator and show that it is
a biased estimator of population mean. Compute bias and variance along with its standard error.
Kinds of data : A bivariate population of size N=6 is given below :
No. of bullocks(Y) 3 4 8 9 6 9
Farm Size (acre)(X) 15 20 40 45 25 42

Enumerate all possible samples of size n=2 using SRSWOR.


Solution :Here it is given that N=6 and n=2.
The total number of possible samples of size n=2 is 𝑁𝑐𝑛 =6𝑐2 =15

45
S.No. Possible Possible Sample Sample ̂ =̅̅̅̅
𝑹
𝒚𝒏
̅̅̅̅
𝒚 ̅̅̅̅
𝒙
Samples Samples (𝒙𝒊 ) mean mean ̅̅̅
𝒙𝒏 ̅𝒚̅̅𝑹̅= ̅̅̅̅𝒏 ̅̅̅̅
𝑿𝑵
𝒏
𝒙
(𝒚𝒊 ) ̅𝒚̅̅𝒏̅ 𝒏

1. 3,4 15,20 3.5 17.5 6.233 0.20


2. 3,8 15,40 5.5 27.5 6.233 0.20
3. 3,9 15,45 6 30 6.233 0.20
4. 3,6 15,25 4.5 20 7.013 0.225
5. 3,9 15,42 6 28.5 6.561 0.211
6. 4,8 20,40 6 30 6.233 0.20
7. 4,9 20,45 6.5 32.5 6.233 0.20
8. 4,6 20,25 5 22.5 6.930 0.222
9. 4,9 20,42 6.5 31 6.535 0.210
10. 8,9 40,45 8.5 42.5 6.233 0.20
11. 8,6 40,25 7 32.5 6.713 0.215
12. 8,9 40,42 8.5 41 6.461 0.207
13. 9,6 45,25 7.5 35 6.679 0.214
14. 9,9 45,42 9 43.5 6.448 0.207
15. 6,9 25,42 7.5 33.5 6.978 0.224
Total 467.5 97.716 3.135

∑ 𝑋𝑖
̅̅̅̅
𝑋
187
̅̅̅ ∑ 𝑌𝑖 39
𝑁 = 𝑁 = 6 =31.17, 𝑌𝑁 = 𝑁 = 6 =6.50
∑ ̅̅̅̅
𝑌𝑅 97.716
̅̅̅)
E(𝑦 𝑅 = = = 6.514,
𝑁𝑐𝑛 15
Since E(𝑦 ̅̅̅)
𝑅 ≠ ̅
̅̅̅
𝑦̅̅𝑁̅, 𝑡ℎ𝑒 ratio estimator is not an unbiased estimator of population mean 𝑌𝑁 .
The bias of ratio estimator to the first order of approximation is given by
(𝑁−𝑛) 2 𝑆𝑋 𝑆𝑌
̅̅̅
𝐵1(𝑌 ̅̅̅
𝑅 ) = 𝑁𝑛 𝑌𝑁 (𝐶𝑥 − 𝜌𝐶𝑥 𝐶𝑦 ) , where 𝐶𝑥 = ̅̅̅̅ and 𝐶𝑦 = ̅̅̅̅
𝑋 𝑌 𝑁 𝑁
1 1872
Now, SX = √5 ∗ (6639 − ) = 12.73 and SY =2.588,
6
Cx= 0.408, Cy =0.397
To find out the value of 𝜌 correlation coefficient between X and Y, we have to make the following
values :
∑y=39, ∑x=187, ∑xy=1378, ∑x2=6639, ∑y2=287
1378 187∗39

6 6∗6
𝜌= 6639 187 2 287 39
= 0.9859
√ − ( ) ∗√ − ( )2
6 6 6 6
(6−2)
Hence 𝐵1(𝑌 ̅̅̅ 2
𝑅 ) = 6∗2 * 6.50 * (0.408 − 0.9859 ∗ 0.408 ∗ 0.397)=0.014
.
The variance of ratio estimator is given by
(𝑁−𝑛) 2 ̅̅̅̅
𝑌
̅̅̅
V (𝑌 𝑅) = (𝑆𝑦 2 + 𝑅 2 𝑆𝑥 − 2𝑅𝜌𝑆𝑥 𝑆𝑦 ) where R = 𝑁 =0.208 ̅̅̅̅
𝑁𝑛 𝑋𝑁
(6−2) 2 2 2
= 6∗2 (2.58 + 0.208 ∗ 12.73 -2*0.208*0.9859*12.73*2.58) =0.065 =0.0625.
The above formula of variance in terms of coefficient of variation can be written as :
(𝑁−𝑛) 2 2 2
̅̅̅
V (𝑌 𝑅 ) = 𝑁𝑛 ̅̅𝑦̅̅
𝑁 (𝐶𝑦 + 𝐶𝑥 − 2𝜌𝐶𝑥 𝐶𝑦 )
6−2
= ( 6∗2 ) * 6.502 * (0.3972 + 0.4082 − 2 ∗ 0.9859 ∗ 0.408 ∗ 0.397 = .0660
Both values of variances of ratio estimator are approximately equal.
Standard Error of Ratio Estimator (𝑦̅̅̅)
𝑅 = √V(𝑦 ̅̅̅)
𝑅 = √0.0660 =0.256

46
Objective: Determination of the regression estimator, comparison with the ratio estimator, and its sampling
variance and standard errors.
Kinds of data: A bi-variate population of size N=85 with population mean 𝑋̅̅̅̅ ̅̅̅
𝑁 = 6.55 and 𝑌𝑁 = 8.55, a
random sample of size n=10 was drawn using SRSWOR scheme and was recorded as
Y 11 8 7 6 4 5 3 2 9 10
X 10 7 6 5 3 4 2 1 8 9

Solution: First we will calculate the ̅̅̅,


𝑦𝑛 ̅̅̅ ̅̅̅̅
𝑥𝑛 and 𝑋 𝑁
65 55
𝑦𝑛 = = 6.5, ̅̅̅
̅̅̅ 𝑥𝑛 = = 5.5, and ̅̅̅̅ 𝑋𝑁 = 8.55
10 10
The equation of regression estimator ̅𝑌̅̅̅
𝑖𝑟 = ̅̅̅
𝑦𝑛 + 𝑏𝑦𝑥 (𝑥
̅̅̅𝑁̅ − ̅̅̅),
𝑥𝑛
Where
𝑆𝑥𝑦 2 1 (∑ 𝑥𝑖 )2 1 ∑ 𝑥𝑖 ∑ 𝑦 𝑖
𝑏𝑦𝑥 = 𝑆 2 where 𝑠𝑥 = [∑ 𝑥𝑖 2 - ] and 𝑆𝑥𝑦 = [∑ 𝑥𝑖 𝑦𝑖 - ]
𝑥 𝑛−1 𝑛 𝑛−1 𝑛
Total
Y 11 8 7 6 4 5 3 2 9 10 65
X 10 7 6 5 3 4 2 1 8 9 55
YX 110 56 42 30 12 20 6 2 72 90 440
Y2 121 64 49 36 16 25 9 4 81 100 505
X2 100 49 36 25 9 16 4 1 64 81 385
1 552 82.5
By putting the values we get 𝑠𝑥 2 = 9 (385- )= = 9.16, 𝑠𝑦 2 = 9.16 and 𝑆𝑥𝑦 =9.16
10 9
𝑆𝑥𝑦 9.16
So the value of 𝑏𝑦𝑥 = 𝑆 2 = 9.16 =1
𝑥
Now the equation of regression estimator ̅̅ ̅̅ = ̅̅̅
𝑌𝑖𝑟 𝑦𝑛 + 𝑏𝑦𝑥 (𝑥
̅̅̅𝑁̅ − ̅̅̅)
𝑥𝑛 = 6.5 + 1*(6.55-5.5)r=6.55
Estimate of the sampling variance of the estimator ̅̅̅̅ 𝒀𝒊𝒓
(𝑵−𝒏) 𝑆 9.16
̅̅̅̅
V(𝑌 𝑖𝑟 ) = 𝑠𝑦 2 (1-𝑟𝑥𝑦 2 ), here rxy =
𝑥𝑦
= =1
𝑵𝒏 𝑆𝑥 𝑆𝑦 9.16
(85−10)
̅̅𝑖𝑟
By putting the values we get V(𝑌 ̅̅) = * 9.16*(1-12) =0
85∗10
Hence we can say that in case of perfect positive correlation the variance of linear regression estimator ̅̅̅̅
𝑌𝑖𝑟
̅̅̅
for population mean 𝑌𝑁 will always be equal to zero.

The variance of ratio estimator is given by


(𝑁−𝑛) 2 ̅̅̅
𝑌 6.5
̅̅̅
V (𝑌 𝑅) = (𝑆𝑦 2 + 𝑅 2 𝑆𝑥 − 2𝑅𝜌𝑆𝑥 𝑆𝑦 ) where R = 𝑛 = =1.18 ̅̅̅̅
𝑁𝑛 𝑋𝑛 5.5
(85−10)
̅̅̅
V(𝑌 𝑅) = (9.16+1.182 *9.16-2*1.18*1*9.16)=.0262
85∗10
The estimated sampling variance of the sample mean is given by

(𝑁−𝑛) (85−10)
̅̅̅
V(𝑌 𝑛 )SRSWOR = 𝑁𝑛 = 85∗10 *9.16 = 0.808

Here V(𝑌̅̅̅̅ ̅̅̅ ̅̅̅


𝑖𝑟 )< V(𝑌𝑅 ) < V(𝑌𝑛 )SRSWOR
This shows that sample mean ̅̅̅ 𝑌𝑛 is less efficient than the ratio and regression estimator.

47
14. Large Sample Test

For large value of sample size n (usually >30) almost all the distributions follows normal distribution. To
𝑋− 𝜇
solve the problems of large sample size the normal variable X is transformed to a new variable Z= ,
𝜎
which is known as a standard normal variate mean 0 and variance 1.
By the area property of normal distribution the standard normal variate should lie between -3 to +3.
Hence, if |𝑍| > 3, null hypothesis will always be rejected.
If |𝑍| ≤ 3, null hypothesis will be tested for possible rejection at certain level of significance.
For a two tailed test
if |𝑍| <1.96, H0 is accepted at 5 % level of significance and
if |𝑍| <2.58, H0 is accepted at 1 % level of significance
For a single tailed (Right or left) test
if |𝑍| <1.645, H0 is accepted at 5 % level of significance and
if |𝑍| <2.33, H0 is accepted at 1 % level of significance

Objective: Testing the significance of single mean based on large samples.


Kinds of data: A random samples of 900 items has a mean 3.4 cms and standard deviation 2.61 cm.
Given that the population mean and standard deviation are 3.25 cm and 2.62 cm respectively.
Solution: We set up the null hypothesis H0: µ = 3.25 and 𝜎 = 2.62, and
H1:µ ≠ 3.25 and 𝜎 ≠ 2.62,
𝑋̅− 𝜇
Under H0 : Z= that follows N(0,1)
𝜎/√𝑛
3.4−3.25
Now Z = 2.62 = 1.73
√900
Since 1.73<1.96. The null hypothesis is accepted at 5% level of significance.

Objective: Testing the significance of two means based on large samples.


Kinds of data: A random samples of 1000 and 2000 members have their means as 67.5 and 68.0 inches
respectively with population standard deviations for both samples are 2.5 inches.
Solution: We set up the null hypothesis H0: µ1=µ2, and H1:µ1≠ µ2.
̅𝑥̅̅1̅− ̅𝑥̅̅2̅
Under H0 : Z= 1 1
that follows N(0,1)
√𝜎 2 (𝑛 +𝑛 )
1 2
67.5−68
Now Z = 1 1
Since the value of σ2=(2.5)2 =6.25
√6.25( + )
1000 2000

Hence Z= -5.1, Absolute value of Z=5.1


The difference of these two sample means is highly significant since the calculated value of Z=5.1 is larger
than the tabulated value of Z( >3) at 1% level of significance.

48
Objective: Testing the significance of single proportions based on large samples.
Kinds of data: In a sample of 1000 people in Maharashtra, 540 were found rice eaters and rest was wheat
eaters. Test whether the rice and wheat eaters are equally popular in this state at 1 % level of significance.

Solution: We set up the null hypothesis H0: P=0.5, and H1:P≠0.5.


540
p=proportion of rice eaters in Maharashtra =1000 = 0.54, q=1-p=0.46
𝑝−𝑃 0.54−0.5
Under H0 : Z= 𝑝𝑞
= 0.54∗0.46
= 2.532
√𝑛 √
1000

Since Z=2.532 <2.58. the null hypothesis both rice and wheat eaters are equally popular in the state are accepted
at 1% level of significance.

Objective: Testing the significance of two proportions based on large samples.


Kinds of data: A random samples of 400 men and 600 women were interviewed whether they would like
to build a beautiful garden near their residence. Two hundred men and 325 women were interviewed in
favour of proposal.

Solution: We set up the null hypothesis H0: P1=P2, and H1:P1≠ P2.
P1=proportion of men in favour of proposal = 200/400=0.50
P2= proportion of women in favour of proposal=325/600=0.541
𝑃1−𝑃2
Under H0 : Z= 1 1
that follows N(0,1)
√𝑃𝑄( + )
𝑛1 𝑛2
𝑛1𝑃1+𝑛2𝑃2 200+325
Where P= =400+600 =0.525 and Q=1-P=0.475
𝑛1+𝑛2
Hence Z=-1.27, Absolute value of Z=1.27
The opinion of men and women in favour of building up garden near their residence is not significant
Since the calculated value of Z=1.27 is less than the tabulated value of Z(1.96) at 5% level of significance.

Objective: Testing the significance of two standard deviations based on large samples.

Kinds of data: A random samples of 1000 and 1200 members have their standard deviations as 2.58 and
2.50 inches respectively.

Solution: We set up the null hypothesis H0: S1=S2, and H1:S1≠ S2.
𝑆1−𝑆2
Under H0 : Z= 2 2
that follows N(0,1)
√𝑆1 +𝑆2
2𝑛1 2𝑛2

2.58−2.50
Now Z = 2 2
√ 2.58 + 2.50
2𝑥1000 2𝑥1200

Hence Z= 1.03

The difference of these two sample standard deviations do not differ significantly, since the calculated value
of Z(1.03) is less than tabulated Z (1.96) at 5% level of significance.

49
15. Small Sample Test
𝑋̅ −𝜇
Small Sample Test : If the sample size n is small, the distribution of various statistics i.e. z= 𝑆 are far
√𝑛
from normality. In such cases small samples test developed by student (W. S. Gosset) were used.

Student t: let 𝑥𝑖 be a random sample of size n from a normal population with mean 𝜇 and variance 𝜎 2 . Then
𝑋̅ −𝜇 1
student t is defined by the statistic t = 𝑆 , where 𝑋̅ is the sample mean and 𝑆 2 = ∑(𝑥𝑖 − ̅̅̅
𝑥)2 is an (𝑛−1)
√𝑛

unbiased estimate of population variance 𝜎 2 and it follows student t distribution with (n-1) degree of
freedom.

Assumptions of t test:

• The parent population from which samples is drawn should be normal.


• The sample observations are independent or randomly selected.
• Population standard deviation 𝜎 is unknown.

T test for single mean: is used to test whether the sample has been drawn from the population with mean 𝜇
or there is no significant difference between the sample mean 𝑋̅ and the population mean 𝜇.
|𝑋̅ −𝜇 | ∑ 𝑋𝑖 1
the test statistic |𝑡| = 𝑆 0 where 𝑋̅ = , and 𝑆 2 = ∑(𝑥𝑖 − ̅̅̅
𝑥)2, follows student t distribution with
𝑛 (𝑛−1)
√𝑛
(n-1) degree of freedom.

Two independent sample t test: is used to test whether the two samples differ from one another
significantly in their means or whether they may be belonging to the same population. The test statistics is t=
̅̅̅̅
𝑋1 − ̅̅̅̅
𝑋2 ∑(𝑋𝑖 −𝑋̅ )2 + ∑(𝑌𝑖 −𝑌̅)2 (𝑛1 −1)𝑆1 2 + (𝑛2 −1)𝑆2 2
1 1 where 𝑆 2 = = , follows student t distribution with (n1+n2-
√𝑆 2 ( + ) 𝑛1 +𝑛2 −2 𝑛1 +𝑛2 −2
𝑛1 𝑛2
2) degree of freedom.

Paired t test: is used when the sample sizes are equal and the two samples are not independent but the
sample observations are paired together.
̅̅̅̅
|𝑑|
The test statistics is given by |𝑡|= 𝑆
√𝑛
1 2
where 𝑑𝑖 = 𝑋𝑖 − 𝑌𝑖 , 𝑎𝑛𝑑 2
= (𝑛−1) * ∑(𝑑𝑖 − ̅̅̅
𝑑) , follows student t distribution with (n-1) degree of
freedom.
Test of significance of null hypothesis: for test of significance of null hypothesis the calculated value of t is
compared with the table value of t at certain level of significance generally 5%. If calculated value of |𝑡| >
tabulated t, the null hypothesis is rejected or If calculated value of |𝑡| < tabulated t, the null hypothesis is
accepted.
Note: if you are unable to understand whether the samples are paired or independent then you can decide it
by the degree of freedom of tabulated value given in question. In case of independent sample for table value
d.f. is (n1 + n2 -2) whereas in paired sample d.f. is (n-1).

50
Objective: Test the significance of difference between sample mean and population mean.
Kinds of data: The data relate to the IQ’s of ten randomly selected boy are given below:
70,120,110,101,88,83, 95,98,107 and 100. Given population mean μ=100.
Solution: Here the null hypothesis is H0: The data are consistent with the assumption of a mean
IQ of 100 in the population. First we will find the sample mean
(70+120+110+101+88+83+95+98+107+100)
𝑋̅ = = 97.2
10
1
Now we calculate the sample variance 𝑆 2 = (𝑛−1) ∑(𝑥𝑖 − ̅̅̅
𝑥)2

𝑋𝑖 70 120 110 101 88 83 95 98 107 100 Total


𝑋𝑖 - 𝑋̅ -27.2 22.8 12.8 3.8 -9.2 -14.2 -2.2 0.8 9.8 2.8
(𝑋𝑖 − 𝑋̅)2 739.84 519.84 163.84 14.44 84.64 201.64 4.84 0.64 96.04 7.84 1833.60
1
By substituting the values in the formula we get 𝑆 2 =(10−1) * 1833.60 =203.73 and S= 14.27
|𝑋̅ −𝜇0 | |97.2−100| 2.8
Now we apply the t statistics |𝑡| = 𝑆 = 14.27 = 4.51 = 0.62
√𝑛 √10

Here absolute value of t=0.62 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.26.
Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and We
conclude that data are consistent with the assumption of mean IQ of 100 in the population.

Objective: Testing the significance of difference between two means in small samples
Kinds of data: The data relate to the two random samples drawn from two normal population with the
following results.
̅̅̅1=25, S12=36,n2=8, 𝑋
n1=6, 𝑋 ̅̅̅2=20, and S22=25 provided population variances of two population are equal,
i.e. σ12=σ22
Solution: The null hypothesis H0: There is no significant difference between two means i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.
(𝑛1 −1)𝑆1 2 + (𝑛2 −1)𝑆2 2
Now we calculate the value of sample mean square by 𝑆 2 = 𝑛1 +𝑛2 −2
(6−1)∗36+ (8−1)∗25 355
𝑆2 = = 12 = 29.58
6+8−2
̅̅̅̅
𝑋1 − ̅̅̅̅
𝑋2 25−20 5
Apply t-statistic t= 1 1 = = 2.93 =1.70
√𝑆 2 ( + ) 1 1
√29.58∗( + )
𝑛1 𝑛2
6 8

Here absolute value of t=1.70 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.18.
Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and we
conclude that there is no significant difference the two means.

51
Objective: Testing the significance of difference of two means in small samples when the observations
of the two samples are paired together.
Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs are as foll

Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53

Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.
∑ 𝑑 −16
Take the difference of foods A-B as di=Ai-Bi and calculate 𝑑̅ = 𝑛 𝑖 = 8 = -2

Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53
di=Ai-Bi -3 -2 -1 -1 -3 -4 -2 0
(𝑑𝑖 − 𝑑̅)2 1 0 1 1 1 4 0 4
1 2 12
Now we calculate the value of 𝑆 2 = (𝑛−1) * ∑(𝑑𝑖 − ̅̅̅
𝑑) = 7 = 1.71 and S=1.31
̅̅̅̅
|𝑑| 2 2
Apply t-statistic |𝑡|= 𝑆 = 1.31 =0.46 =4.32
√𝑛 √8

Here the absolute value of t= 4.32 and tabulated value of t at 7 d.f. at 0.05 level of significance=2.37
Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and
We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.
.
Objective: Testing the significance of difference of two means in small samples when the observations
of the two samples are independent.
Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs,
assuming that the two samples of pigs are independent were given as follows:

Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53
Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.

52
407 423
Here 𝑋̅ = 8 = 50.88 and 𝑌̅ = 8 = 52.88

Pig Food A Food B


number X 𝑋𝑖 − 𝑋̅ (𝑋𝑖 − 𝑋̅)𝟐 Y 𝑌𝑖 − 𝑌̅ (𝑌𝑖 − 𝑌̅)𝟐
1 49 -1.88 3.53 52 -0.88 0.77
2 53 2.12 4.49 55 2.12 4.49
3 51 0.12 0.01 52 -0.88 0.77
4 52 1.12 1.25 53 0.12 0.01
5 47 -3.88 15.05 50 -2.88 8.29
6 50 -0.88 0.77 54 1.12 1.25
7 52 1.12 1.25 54 1.12 1.25
8 53 2.12 4.49 53 0.12 0.01
Total 407 -0.04 30.88 423 -0.04 16.88
𝑋̅ −𝑌̅ ∑(𝑋𝑖 −𝑋̅ )2 + ∑(𝑌𝑖 −𝑌̅)2
The test statistic used here as t= where 𝑆 2 =
1 1 𝑛1 +𝑛2 −2
√𝑆 2 (𝑛 +𝑛 )
1 2

∑(𝑋𝑖 −𝑋̅ )2 + ∑(𝑌𝑖 −𝑌̅)2 30.88+16.88 47.75


Now 𝑆 2 = = = =3.41
𝑛1 +𝑛2 −2 8+8−2 14
|50.88−52.88| 2
Then t = 1 1
= 0.923 = 2.165
√3.41∗( + )
8 8

Here the absolute value of t= 2.165 and tabulated value of t at 14 d.f. at 0.05 level of significance=2.15
Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and
We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.

53
16. Chi-Square Test

Chi-Square Test : is used to test the null hypothesis based on some general law of nature or any reasoning.

Types of problem dealt with chi-square test:

• Whether a particular distribution is in agreement with the normal distribution or


• Whether the two given distributions are in agreement with each other or
• Whether the two observed frequencies are in any particular given ratio or
• Whether the two sets of classification are independent of each other and so on.

Conditions for the applicability of chi- square test:


• The sample observations should be independent.
• Constraints on the cell frequency should be linear eg. ∑ 𝑂𝑖 = ∑ 𝐸𝑖 .
• N the total frequency should be large (>50).
• No observed frequency should be less than 5.
• If anyone of cell frequency is less than 5 then it is pooled with proceeding or succeeding frequency so
that the pooled frequency is more than 5 and then we adjust the degree of freedom lost in pooling.

𝝌𝟐 test of goodness of fit: The null hypothesis is that there is no difference between the experimental result
and theory. If 𝑂𝑖 (i=1,2,…n) is set of observed frequencies and 𝐸𝑖 is the set of expected frequencies
the karl pearson 𝝌𝟐 is given by
(𝑂𝑖 −𝐸𝑖 )𝟐
𝝌𝟐 = ∑ , for i=1,2,…n follows 𝜒 2 distribution with (n-1) d.f.
𝐸𝑖

a b (a+b)
𝝌𝟐 test for 2X2 contingency table: for the 2X2 contingency table
𝑁(𝑎𝑑−𝑏𝑐)2
c d (c+d)
The 𝜒 2 test is given by 𝜒 2 = (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
, where N=a+b+c+d (a+c) (b+d) N=a+b+c+d
for 1 d.f.
Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square test of
(𝑎+𝑏)(𝑎+𝑐) (𝑎+𝑏)(𝑏+𝑑)
goodness of fit. eg. E(a) = , E(b) = or accordingly.
𝑎+𝑏+𝑐+𝑑 𝑎+𝑏+𝑐+𝑑

Yates’ correction for continuity: if anyone of the cell frequency is less than 5 in 2X2 contingency table
then by using pooling method the degree of freedom becomes 0. In this case we apply Yates correction for
continuity which consist of adding 0.5 to the cell frequency which is less than 5 and then adjusting for the
remaining cell frequencies accordingly so that the marginal totals are not disturbed at all. After corrections
we get
𝟏 𝟏 𝟏 𝟏 𝟐 𝑵 𝟐
𝑵[(𝒂∓ )(𝒅∓ )−(𝒃± )(𝒄± )] 𝑵[|𝒂𝒅−𝒃𝒄|− ]
𝟐 𝟐 𝟐 𝟐 𝟐 𝟐
𝝌 = (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
= (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)

Note: In 𝝌𝟐 test if the calculate value of 𝜒 2 is greater than tabulated value the null hypothesis is rejected
means there is a significant difference between the experimental result and theory.
For mxn table the degree of freedom is (m-1)X(n-1).

54
Objective: Testing whether the frequencies are equally distributed in a given dataset.
Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits were as
follows.
Digits 0 1 2 3 4 5 6 7 8 9
Frequency 22 21 16 20 23 15 18 21 19 25
Solution: We set up the null hypothesis H0: The digits were equally distributed in the given dataset.
𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 200
Under the null hypothesis the expected frequencies of the digits would be = = =20
𝑛𝑜.𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 10
(22−20)2 (21−20)2 (16−20)2 (20−20)2 (23−20)2 (15−20)2 (18−20)2 (21−20)2
Then the value of 𝜘 2 = + + + + + + + +
20 20 20 20 20 20 20 20
(19−20)2 (25−20)2 1 86
+ = 20 (4+1+16+0+9+25+4+1+1+25) =20 =4.3
20 20
The tabulated value of 𝜘 2 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value of 𝜘 2 is
less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the digits are equally
distributed in a given dataset.

Objective: Testing the goodness of fit between experimental result and theory.
Kinds of data: The theory predicts the proportion of beans in the four groups A, B, C and D should be in the
ratio 9:3:3:1. In an experiment having 1600 beans, the observed numbers in the four groups were found to be
882,313,287 and 118.
Solution: We set up the null hypothesis H0: There is no significant difference between experimental result
and theory.
The expected frequencies can be calculated as follows:
Total number of beans =882+313+287+118.=1600. Given ratio is 9:3:3:1
9 3
E(882)= 16 X1600 = 900 ; E(313)= 16 X1600 = 300
3 1
E(287)= 16 X1600 = 300 ; E(118)= 16 X1600 = 100
Thus,  2 for testing the goodness of fit is
4
(Oi − Ei)2 (882 − 900) 2 (313 − 300) 2 (287 − 300) 2 (118 − 100) 2
= = + + +
i =1 Ei 900 300 300 100
=0.36+0.56+0.56+3.24=4.72
Now, d.f.=4-1=3. The tabulated  2 at 5% level of significance at 3 d.f.=7.815
Since the calculated value of  2 is less than the tabulated value, it is not significant. Hence, the null
hypothesis is accepted and we conclude that there is good correspondence between theory and experimental
result.

Objective: Testing the significance of independence of two attributes using  2 test.


Kinds of data: The data relate to the sample of married women according to their level of education and
their marriage adjustment score
Marriage adjustment score
Level of education Very low Low High Very High Total
College 24 97 62 58 241
High School 22 28 30 41 121
Middle School 32 10 11 20 73
Total 78 135 103 139 435
55
Solution: We set up the null hypothesis H0: The two attributes level of education and marriage adjustment
scores both are independent.
The expected frequencies can be calculated as follows:
We sum the row and column totals. The three row totals are 241,121 and 73 respectively. The four column
totals are 78,135,103 and 139 respectively giving grand total as 435.
78 𝑥 241 135 𝑥 241 78 𝑥 121
E(24)= = 43.2 ; E(97)= = 74.8; E(22)= = 21.7
435 435 435
135 𝑥 121 78 𝑥 73 135 𝑥 73
E(28)= = 37.6; E(32)= = 13.1 ; E(10)= = 22.7
435 435 435
103 𝑥 241 19 𝑥 241 103 𝑥 121
E(62)= = 57.1; E(58)= = 65.9 ; E(30)= = 28.7
435 435 435
119 𝑥 121 103𝑥 73 119 𝑥 73
E(41)= = 33.1 ; E(11)= = 17.3 ; E(20)= = 20.1
435 435 435
Thus,  2 for testing the goodness of fit is
(24−43.2)2 (22−21.7)2 (32−13.1)2 (97−74.8)2 (28−37.6)2 (10−22.7)2 (62−57.1)2 (30−28.7)2
= + + + + + + + +
43.2 21.7 13.1 74.8 37.6 22.7 57.1 28.7
(11−17.3)2 (58−65.9)2 (41−33.1)2 (20−20.1)2
+ + + = 57.6
17.3 65.9 33.1 20.1
Now, d.f.=(4-1)x(3-1)=6. The tabulated value of  2 at 5% level of significance and 6 d.f.= 12.592
Since the calculated value of  2 is more than the tabulated value, it is significant. Hence, we reject the
null hypothesis and conclude that there is good correspondence or dependence between level of
education and marriage adjustment score.

Objective: Computation of  2 value in the case of contingency table to test the independence of attributes
where one of the cell frequencies is less than five.
Kinds of data: The data relate to the height of father and their youngest son at the age of 40 years.
Height of youngest Height of fathers
Sons Tall Short Total
Tall 8 2 10
Short 7 6 13
Total 15 8 23
Solution: Here the null hypothesis is H0: The height of youngest son is independent of the height of
Fathers and H1: They are dependent on each other.
Since one cell frequency is less than five then we apply Yates’s correction and correct the contingency
table as given below.
Height of youngest Height of fathers
Sons Tall Short Total
Tall 7.5 (a) 2.5 (b) 10
Short 7.5 (c) 5.5 (d) 13
Total 15 8 23
N (ad − bc) 2
Now Compute value of  2 =
(a + b)(a + c)(c + d )(b + d )
23(7.5 𝑋 5.5−2.5 𝑋 7.5 )2
𝜘2 = = 0.746
10 𝑋 13 𝑋 15 𝑋 8
The table value of  2 at 5 % level of significance and 1 d.f. = 3.841
Since the calculated value of  2 is less than tabulated value of  2 . We do not reject null hypothesis and
Conclude that the height of youngest sons is independent of the height of their father.
56
17. Design of Experiment
(CRD, RBD, LSD, Split and Strip Plot Design)

It is the planning (sequence of steps taken well in time) of an experiment to obtain appropriate data with
respect to problem under investigation.
Principles of experimental design: There are 3 basic principles of experimental design:
Replication: Repetition of treatment under investigation is known as replication.
Randomization: The process of assigning treatment to various experimental unit in purely chance manner is
known as randomization.
Local Control: The process of reducing the experimental error by dividing the relatively heterogeneous
experimental area into homogeneous blocks is known as local control.
Type of Design:
Completely Randomized Design: This design is used when the experimental material is homogeneous eg.
laboratory or pot experiment. The principle of local control is absent in CRD.

Randomized Block Design: This design is used when the fertility gradient of soil is only in one direction.
Then the whole field is divided into a number of equal blocks perpendicular to the direction of fertility
gradient and then each block is divided into number of plots equal to the number of treatments. In RBD we ty
to minimize the within block variation whereas the between block variation as large as possible.

Latin Square Design: This design is used when the fertility gradient of the soil in both the directions. Then
the field is divided into homogeneous blocks in two ways. The blocks in one direction are known as rows
whereas the blocks in other direction are known as columns. In LSD number of replications must be equal to
number of treatments. Number of row, column and treatment should be equal and randomization of treatment
is done in such a way that each treatment occurs once and only once in each row and column.

Split Plot Design: is used when there are two types of treatments and both are to be estimated with different
precision. The treatment which is to be estimated with greater precision is allotted as subplot treatment. In
split plot design the effect of the subplot treatments and the interactions with the main plot treatments can be
estimated more precisely.

Strip Plot Design: is used when there are two factors and both of them require large experimental unit.
Suppose four levels of spacing and three methods of ploughing. In strip plot design interaction effect is
estimated with greater precision. The experimental area is divided into three plots namely vertical strip,
horizontal strip and intersection plot.

All the analysis of experimental design is based on the analysis of variance table.

ANOVA: It is a technique by which the total variation in any experiment may be split into several
physically assignable components. In ANOVA we determine the source of variation and check whether this
source of variation is significant or not. To check the significance of source of variation F-test is used.

Format of ANOVA:
Analysis of variance table
Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at (source
freedom Square square and error ) d.f.
57
Objective: Analyzing the data of completely randomized design with unequal number of replication per
treatment.
Kinds of data: The following data relate to a varietal trial on green gram ( in coded form) that was
conducted using CRD having five varieties V1 , V2 , V3, V4 and V5 with 3, 6, 5 and 4 replications
respectively. The results are given below (kg/plot):

Varieties Seed yield of greengram (kg/plot) Total Mean


V1 2 1 4 7 2.33
V2 3 4 2 1 5 6 17 2.83
V3 1 5 4 2 4 31 6.20
V4 4 6 3 5 18 4.50
Total 73

Solution: Here we test whether the varieties differ significantly or not.


Grand total = 73
732
Correction factor = 18 = 296.05
Total sum of squares = (22 + 12 + ⋯ + 32 + 52 ) - 296.056 = 80.94
Variety sum of squares =( 7)2/3 +(17)2/6 +( 31)2/5 +(18)2/4 -CF
= 337.70 - 296.05 = 41.64
Error sum of squares = Total sum of squares–variety sum of squares
= 80.944 - 41.644 = 39.30

Analysis of variance table


Source of Degree of Sum of Mean sum of Fcal Ftab (5 %) at
variation freedom Square square (3,14) d.f.
Between varieties 3 41.64 13.88 4.94 (S) 3.34
Within varieties 14 39.30 2.80
(Error)
Total 17 80.94

Since Fcal >Ftab, F test indicates that there are significant differences between the treatment means.
The individual varieties can be compared with the help of critical difference.
In case of unequal number of replication the standard error of difference between treatment means varies
from pair to pair .
1 1
Standard error of difference between V1 and V2 = √2.80 ∗ (3 + 6) = 1.18
∴ Critical Difference = (S.E.)diff X t 0.05 at 14 d.f. = 1.18 x 2.14 = 2.52
Similarly, we can compute the value of C.D. for other treatments comparisons..
The following treatments comparisons can be made on the basis of C.D. values:
V3 V4 V2 V1
Conclusions : Variety V3 gives significantly higher yield as compared to other varieties, variety V3 and
variety V4 are at par but both of them differ significantly with variety V2 and variety V1. The variety
V2 and variety V1 are also at par and they are giving the lower yield of green gram.

58
Objective: Analyzing the data of completely randomized design with equal number of replications per
treatment.
Kinds of data : The data relate to the five varieties of sesame using CRD conducted in a greenhouse with
four pots per variety.
Varieties Seed yield of sesame Total Mean
(gm/plot)
V1 25 22 22 18 87 21.75
V2 25 28 26 25 104 26
V3 24 24 18 21 87 21.75
V4 20 17 18 19 74 18.5
V5 14 15 15 11 55 13.75
Total 407

Solution: Here we test whether the varieties differ significantly or not.


The Grand total = 407.
Correction factor = (407)2/20= 8282.45
Total sum of squares = (25 2 + 252 +….112) - CF
= 8685 - CF = 402.55
Variety sum of square = (872 +1042 +------552)/4 -CF
=8613.75 – CF = 331.30
Error sum of square = 402.55 - 331.30 = 71.25

Analysis of variance table


Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at
freedom Square square (4,15) d.f.
Between varieties 4 331.30 82.83 17.44 (S) 3.06
Within varieties 15 71.25 4.75
(Error)
Total 19 402.55
Since Fcal >Ftab, F test indicates that there are significant differences between the treatment means.
2∗𝐸𝑀𝑆𝑆 2∗4.75
Standard error of difference between two treatments = √ = √ = 1.541
𝑟 4
∴ Critical Difference = (S.E.)diff X t 0.05 at 15 d.f. = 1.541 x 2.13 = 3.28
As per F test the varieties differ significantly, the individual varieties can be compared with the help of
critical difference.
The varieties can be compared by setting them in the descending order of their yields in the following
manner.
V2(26) V1(21.75) V3(21.75) V4(18.5) V5(13.75)

If the difference between two varieties means is greater than the critical difference the varieties differ
significantly. The varieties which do not differ significantly have been underlined by a bar.
Conclusions: The variety V2 gives significantly higher yield than all other varieties. The varieties V1, V3
and V4 are at par but differ significantly with variety V2 and V5 .The variety V5 gives lower yield of
sesame.

59
Objective : Analyzing the data of randomized complete block design and the computation of efficiency as
compared to completely randomized design.
Kinds of data : The data relate to the yields of 6 wheat varieties(in rounded figures) in an experiment with
4 randomized blocks.
Block yield of Wheat varieties Total
V1 V2 V3 V4 V5 V6
1 27 30 27 16 16 24 140
2 27 28 22 15 17 22 131
3 28 31 34 14 17 22 146
4 38 39 36 19 15 26 173
Total 120 128 119 64 65 94 590
Mean 30 32 29.75 16 16.25 23.5

Solution: Here we test whether the varieties differ significantly or not.


5902
Correction factor = 24 = 14504.17
Total sum of squares = (272 + 302 + ⋯ . +262 ) − 𝐶𝐹 = 15834 – CF = 1329.83
1402 +1312 +1462 +1732
Bock sum of squares = - CF= 14667.67 – CF =163.50
6
1202 +1282 +1192 +642 +652 +942
Variety sum of squares = − 𝐶𝐹= 15525.50 - CF = 1021.33
4
Error sum of squares = 1329.83 – 163.50 – 1021.33 = 145.00
Analysis of variance table
Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at
freedom Square square (4,19) d.f.
Block 3 163.50 54.50 5.64 3.29
Between varieties 5 1021.33 204.27 21.12 2.90
Within varieties 15 145.00 9.67
(Error)
Total 23 1374.93
Since Fcal >Ftab, F test indicates that there are significant differences between the Variety means.
The individual varieties can be compared with the help of critical difference.
2∗𝐸𝑀𝑆𝑆 2∗9.67
Standard error of difference between two treatments = √ = √ = 2.19
𝑟 4
∴ Critical Difference = (S.E.)diff x t 0.05 at 15 d.f. = 2.19 x 2.13 = 4.66
The varieties can be compared by setting them in the descending order of their yields in the following
manner.
V2(32) V1(30) V3(29.75) V6(23.5) V5(16.25) V4(16.0)
If the difference between two varieties means is greater than the critical difference the varieties differ
significantly. The varieties which do not differ significantly have been underlined by a bar.
Conclusions: The variety V4 gives the lowest yield and it does not differ significantly with V5 While V6
differs significantly rather than V4, V5, V1, V3 and V2. The varieties V2 gives the highest yield and it is at
par with V1 and V3 but differs significantly with V4, V5 and V6.
Relative Efficiency of RBD as compared to CRD:
𝑟(𝑡−1)𝑆𝐸 2 + (𝑟−1)𝑆𝐵 2 4∗(6−1)∗9.67+(4−1)∗54.50
E= 2 = (4∗6−1)∗9.67
= 1.60
(𝑟𝑡−1)𝑆𝐸

Here ,the d.f. for error is less than 20, so we can use the precision factor as given below :

60
(𝑛2 +1)(𝑛1 +3)
, 𝑛1 and 𝑛2 are degree of freedom for two experiments, which is an expression for relative
(𝑛1 +1)(𝑛2 +3)
efficiency of the second experiment as compared to the first.
(15+1)(18+3)
=0.982
(18+1)(15+3)
The corrected relative efficiency is, then, given by
1.60 x 0.982 = 1.57
Therefore, the gain in efficiency in RBD is 57 % as compared to completely randomized design.

Objective : Analyzing the data of Latin square design


Kinds of data : The yields of five varieties of wheat tried in a LSD along with the plan have been given
below in oz.
E (68) B(78) D(80) C (122) A(100)
D(72) E(73) A(70) B(58) C(129)
A(78) C(99) E(57) D(75) B(72)
C(113) A(69) B(60) E(73) D(64)
B(48) D (70) C(76) A(82) E (73)

Solution: for getting the row, column and treatment totals following table will be prepared.

Rows Columns Row Treatment


I II III IV V totals Totals Means
I E (68) B(78) D(80) C (122) A(100) 448 TA =399 79.8
II D(72) E(73) A(70) B(58) C(129) 402 TB =316 63.2
III A(78) C(99) E(57) D(75) B(72) 381 TC =539 107.8
IV C(113) A(69) B(60) E(73) D(64) 379 TD =361 72.2
V B(48) D (70) C(76) A(82) E (73) 349 TE =344 68.8
Column 379 389 343 410 438 GT=1959
Totals
19592
Now correction factor = 25 = 153507.2
Total sum of squares=(682 + 782 + ⋯ + 732 ) − 𝐶𝐹= 162941 – 153507.2=9433.76
4482 +4022 +3812 +3792 +3492
Sum of square (Rows) =( ) – CF =154511-153507.2 = 1003.76
5
3792 +3892 +3432 +4102 +4382
Sum of square (Columns) =( ) – CF =154582.2 – 153507.2=1074.96
5
3992 +3162 +5392 +3612 +3442
Sum of square (Treatments) =( ) – CF = 159647 -153507.2=6139.76
5
Sum of square (Error) = 9433.76-1003.76-1074.96-6139.76= 1215.28

Analysis of variance table


Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at
freedom Square square (4,12) d.f.
Row 4 1003.76 250.94 - -
Column 4 1074.96 268.74 - -
Treatment 4 6139.76 1534.94 15.15 3.26
Error 12 1215.28 101.27
Total 24 36571
Since Fcal> Ftab, there are significant differences among the treatment means.

61
2∗𝐸𝑀𝑆𝑆 2∗101.27
Standard error of difference between two treatments = √ = √ = 6.36
𝑟 5
∴ Critical Difference = (S.E.)diff X t 0.05 at 12 d.f. = 6.36 x 2.179 = 13.86
The varieties can be compared by setting them in the descending order of their yields in the following
manner.
Tc(107.8) Ta(79.8) Td(72.2) Te(68.8) Tb(63.2)
If the difference between two varieties means is greater than the critical difference the varieties differ
significantly. The varieties which do not differ significantly can be underlined by a bar.
Conclusions: The variety Tc gives the highest yield and differ significantly from all the varieties. The
varieties A, D, E, and B are at par to each other.

Objective : Analyzing the data of Latin square design and the computation of efficiency as compared to
RCBD and CRD.
Kinds of data : The data relate to the Latin square design to test the efficiency of methods of spacing:
A,2’’;B,4’’;C,6’’;D,8’’;E,10’’. The yield in grams of plots of Millet arranged in LSD and layout of the
treatments are given below:
Rows Columns Row Treatment
I II III IV V totals Totals Means
I B(257) E(230) A(279) C(287) D(202) 1255 TA =1349 269.8
II D(245) A(283) E(245) B(280) C(260) 1313 TB =1314 262.8
III E(182) B(252) C(280) D(246) A(250) 1210 TC =1262 252.4
IV A(203) C(204) D(227) E(193) B(259) 1086 TD =1191 238.2
V C(231) D(271) E(266) A(334) E(338) 1440 TE =1188 237.6
Column
1118 1240 1297 1340 1309 GT=6304
Totals
Thus, In order to test the significant difference among the treatment means, we have to analyze the above
data.
The computation of the sums of squares is given below:
63042
Now correction factor = 25 = 1589617
Total sum of squares=(2572 + 2302 + ⋯ + 3382 ) − 𝐶𝐹= 1626188 – 1589617=36571
12552 +⋯+14402
Sum of square (Rows) =( ) – CF =1603218-1589617= 13601
5
11182 +⋯+13092
Sum of square (Columns) =( ) – CF =1595763 – 1589617=6146
5
13492 +13142 +12622 +11912 +11882
Sum of square (Treatments) =( ) – CF = 1593773 - 1589617=4156
5
Sum of square (Error) = 36571-13601-6146-4156= 12668

Analysis of variance table


Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at
freedom Square square (4,12) d.f.
Row 4 13601 3400 3.22 3.26
Column 4 6146 1536 1.45 3.26
Treatment 4 4156 1039 1.02 3.26
Error 12 12668 1056
Total 24 36571

62
On comparison of the calculated and tabulated values of F, we find that rows, columns and spacings
are not significant. This is probably due to the shape of the plots. They were long and narrow; hence the
columns are narrow strips running the length of the rectangular area. Under these conditions, the Latin
square may have little advantage on the average over a randomized block plan.
In order to compare means of pairs of treatments, we look up t 0.05 at 12 degrees of freedom =2.18.
The significant difference between two means must be equal to
2∗1056
2.18x√ = 44.80
5

Efficiency of LSD as compared to RBD, when rows are taken as blocks


𝑆𝑐 2 + (𝑚−1)𝑆𝐸 2 1536+(5−1)∗1056
E= = = 1.1
𝑚.𝑆𝐸 2 5∗1056

Since the degrees of freedom for error is less than 20, hence the precision factor is to be computed
for corrected efficiency of LSD as compared to RCBD and CRD.
(𝑛 +1)(𝑛 +3) (12+1)(16+3)
The precision factor is (𝑛2+1)(𝑛1+3) = (16+1)(12+3) = 0.9686
1 2

The corrected efficiency =0.9686x1.11 = 1.07


The gain in efficiency is only 7%
Efficiency of LSD as compared to RBD, when columns are taken as blocks
𝑆𝑅 2 + (𝑚−1)𝑆𝐸 2 3400+(5−1)∗1056
E= 2 = = 1.44
𝑚.𝑆𝐸 5∗1056
(𝑛 +1)(𝑛 +3) (12+1)(16+3)
The precision factor is (𝑛2+1)(𝑛1+3) = (16+1)(12+3) = 0.9686, same for row.
1 2

The corrected efficiency = 0.9686 x 1.44=1.40


The gain in efficiency is merely 40%
It reveals that the rows have not played any significant role to control the errors.

Efficiency of LSD as compared to CRD,


the formula for re;ative efficiency is given by
𝑆𝑅 2 +𝑆𝑐 2 + (𝑚−1)𝑆𝐸 2 3400+1536+(5−1)∗1056
E= = =1.45
(𝑚+1).𝑆𝐸 2 (5+1)∗1056
(𝑛 +1)(𝑛 +3) (12+1)(20+3)
The precision factor =(𝑛2+1)(𝑛1+3) = (20+1)(12+3) =0.9492
1 2

The corrected efficiency is given by = 0.9492 x 1.45 = 1.37


The gain in efficiency is given by merely 37%.

63
Objective: Analysis of data in relation to split plot experiment.
Kinds of data : The data relate to the yields of 3 varieties of Alfalfa obtained in a split plot experiment
with 4 dates of final cutting. The yields are reported in tons per acre.
Yields of Alfalfa in a split plot experiment.
Variety Block
Date 1 2 3 4 5 6 Total
A 2.17 1.88 1.62 2.34 1.58 1.66 11.25
Ladak B 1.58 1.26 1.22 1.59 1.25 0.94 7.84
C 2.29 1.60 1.67 1.91 1.39 1.12 9.98
D 2.23 2.01 1.82 2.10 1.66 1.10 10.92
Total 8.27 6.75 6.33 7.94 5.88 4.82 39.99
A 2.33 2.01 1.70 1.78 1.42 1.35 10.59
Cossack B 1.38 1.30 1.85 1.09 1.13 1.06 7.81
C 1.86 1.70 1.81 1.54 1.67 0.88 9.46
D 2.27 1.81 2.01 1.40 1.31 1.06 9.86
Total 7.84 6.82 7.37 5.81 5.53 4.35 37.72
A 1.75 1.95 2.13 1.78 1.31 1.30 10.22
Ranger B 1.52 1.47 1.80 1.37 1.01 1.31 8.48
C 1.55 1.61 1.82 1.56 1.23 1.13 8.9
D 1.56 1.72 1.99 1.55 1.51 1.33 9.66
Total 6.38 6.75 7.74 6.26 5.06 5.07 37.26
G.Total 22.49 20.32 21.44 20.01 16.47 14.24 114.97

Solution: First we prepare two way table of main plot treatment and replication. Here main plot is variety (3)
whereas subplot is dates(4).

Replication Variety (RV)


Ladak Cossack Ranger Total
I 8.27 7.84 6.38 22.49
II 6.75 6.82 6.75 20.32
III 6.33 7.37 7.74 21.44
Iv 7.94 5.81 6.26 20.01
V 5.88 5.53 5.06 16.47
VI 4.82 4.35 5.07 14.24
Total 39.99 37.72 37.26 114.97

114.972
Correction factor = 72 = 183.58
Total Sum of squares =(2.172 +1.882 +……..+1.512 + 1.332 )-183.58 = 9.12
(22.492 + 20.322 +21.442 +20.012 +16.472 +14.242 )
Block Sum of squares = - 183.58 =4.15
4∗3
(39.992 + 37.722 +37.262 )
Variety Sum of squares = - 183.58 =183.76-183.58= 0.18
6∗4
∑ 𝑅𝑉 2 (8.272 +7.842 +⋯,+5.072 )
Total sum of square from RV table= 𝑏 -CF = - 183.58 = 189.27-183.58=5.69
4
Main plot Error S S or Error I =TSS of RV- BSS- VSS= 5.69 - 0.18 – 4.15 = 1.36
Next we prepare main plot x subplot table

64
Variety Date
A B C D Total
Ladak 11.25 7.84 9.98 10.92 39.99
Cossack 10.59 7.81 9.46 9.86 37.72
Ranger 10.22 8.48 8.9 9.66 37.26
Total 32.06 24.13 28.34 30.44 114.97
(32.062 +24.132 +28.342 +30.442 )
Dates Sum of squares = - CF= 185.54 – 183.58 =1.96
6∗3
∑ 𝑉𝐷 2
Sum of square due to Interaction(VD) = - CF- SSV-SSD
𝑟
11.252 +7.842 +⋯.+9.662
= - 183.58-0.18-1.96 = 185.93 – 183.58-0.18-1.96= 0.21
6
Sub plot Error S S or Error II = Total Sum of squares – All other sum of squares
= 9.12 – 4.15 - 0.18 -1.36 -1.96 - 0.21=1.34
The complete analysis can now be set up :
Analysis of variance table
Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %)
freedom Square square
Block 5 4.15 0.83
Variety 2 0.18 0.09 <1 F.05 at (2,10) =4.10
Error I 10 1.36 0.14
Dates 3 1.96 0.65 23.21** F.05 at (3,45) =2.81
Interaction 6 0.21 0.40 <1 F.05 at (6,45) =2.31
Error II 45 1.34 .029
Total 95 9.20
The means for the dates of cutting are significantly different, but the other effects are found to be non-
significant.
Compute standard errors and to make specific comparisons among treatment means compute respective
critical differences only when F-test shows significant differences.
The standard errors of mean differences can be worked out according to the formulas given below: (i)
2∗𝐸𝑎
Standard Error of difference (Variety) =√ 𝑟∗𝑏 = 0.1086
Critical difference for two variety means= SEdiff x t5%(10 d.f) = 0.1086 *2.23 =0.242
2∗𝐸𝑏
(ii) Standard Error of difference (Date of cutting) =√ 𝑟∗𝑎 =0.0567
Critical difference for two Date of cutting means= SEdiff x t5%(45d.f) =0.0567 *2.02 =0.1145
2∗𝐸𝑏
(iii) S. E. of difference between two dates of cutting at the same level of variety =√ = 0.098 Critical
𝑟
difference for two Date of cutting means at the same level of variety= SEdiff x t5%(45d.f) = 0.098*2.02 = 0.1979

(iv) S. E. of difference between two variety means at the same or different level of date of cutting
2[(𝑏−1)∗𝐸𝑏+𝐸𝑎]
=√ = 0.1375
𝑟𝑏
For (iv) standard error of mean difference involves two error terms, so we use the following equation to
calculate weighted t value.
(𝑏−1)𝐸𝑏𝑡𝑏+𝐸𝑎𝑡𝑎 3∗.029∗2.02+0.14∗2.23
t= (𝑏−1)𝐸𝑏+𝐸𝑎 = =2.149 where ta and tb are t values at error d.f (Ea) and error d.f. (Eb)
3∗0.029+0.14
respectively.
Critical difference for two variety means at the same or different level of date of cutting = SEdiff x t =
=0.1375*2.149 = 0.2956
Conclusions: There was no significant difference among variety means. Yield was significantly affected by
dates of final cutting. However the interaction between variety and final date of cutting was not significant
65
Objective : Analysis of data in relation to strip plot experiment.
Kinds of data : The data relate to the four dates of optimum schedule for five different varieties of wheat
with three replications.
The layout plan and the yields in Kg/plot are given below :
S1 S3 S2 S4
Replication I V2 5.60 2.30 6.70 4.93
V5 5.46 5.87 2.63 6.78
V3 2.24 5.67 3.48 6.58
V1 5.67 6.89 2.56 3.78
V4 2.60 5.65 3.26 2.57
S3 S1 S4 S2
Replication II V4 3.50 6.45 4.80 6.90
V1 6.50 4.69 1.59 4.96
V5 5.32 6.89 2.45 5.36
V2 4.25 3.45 5.69 4.62
V3 2.86 4.39 4.68 2.90
S2 S3 S1 S4
Replication III V3 6.89 4.36 4.26 2.89
V4 4.89 4.58 5.69 5.36
V5 6.89 3.25 2.56 4.60
V2 2.68 4.89 8.90 6.09
V1 2.68 1.89 3.89 2.80

Solution : Prepare two way tables of Replication x Variety, Replication x spacing and Variety x spacing .
(a) Replication x Variety (each figure is a sum of 4 plots)
Replicate Variety
V1 V2 V3 V4 V5 Total
I 18.90 19.53 17.97 14.08 20.74 91.22
II 17.74 18.01 14.83 21.65 20.02 92.25
III 11.26 22.56 18.39 20.52 17.30 90.03
Total 47.90 60.10 51.19 56.25 58.06 273.50
(b) Replication x Spacing (each figure is a sum of 5 plots)
Replicate Spacing
S1 S2 S3 S4 Total
I 21.57 26.38 18.63 24.64 91.22
II 25.87 24.74 22.43 19.21 92.25
III 25.30 24.03 18.96 21.74 90.03
Total 72.74 75.15 60.02 65.59 273.50
(c) Variety x Spacing (each figure is a sum of 3 plots)
Variety Spacing
S1 S2 S3 S4 Total
V1 14.25 14.53 10.95 8.15 47.90
V2 17.95 9.60 15.84 16.71 60.10
V3 10.89 15.46 10.69 14.15 51.19
V4 14.74 17.44 11.34 12.73 56.25
V5 14.91 18.12 11.20 13.83 58.06
Total 72.74 75.15 60.02 65.59 273.50

66
Grand total of the observations = 273.50
273.502
Correction factor = 60 =1246.70
Total Sum of squares =(5.602 + 2.302 + ⋯ + 2.802 ) − 𝐶𝐹 = 158.92
91.222 +92.252 +90.032
Replicate Sum of square = -CF= 0.128
4∗5
47.902 +60.102 +51.192 +56.252 +58.062
Variety S.S. = - CF= 8.45
3∗4
18.902 +19.532 +⋯+17.302
Total Sum of square (1) = -CF=1278.19-1246.70=31.48
4
Error I = TSS(1)- Replicate S.S. – Variety S.S.= 31.48-0.128-8.45 = 22.91
72.742 +75.152 +60.022 +65.592
Spacing Sum of Square= -CF= 1256.20-1246.70=9.50
3∗5
21.572 +26.382 +⋯+21.742
Total Sum of square (2) = -CF=1263.69-1246.70=16.99
5
Error II = TSS(2) - Replicate SS - Spacing SS = 16.99 - 0.128 – 9.50 = 7.36
14.252 +14.532 +⋯+13.832
Total Sum of square (3) = -CF=1298.84 – 1246.70=52.14
3
Interaction = TSS(3) - Variety S.S. –Spacing S.S. = 52.14 - 8.45 - 9.50 =34.19
Error III = Total sum of square – all sum of squares = 76.38
Analysis of variance table
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Replicates 2 0.128 0.06
Variety 4 8.45 2.11 0.74 F.05 at (4,8) =3.34
Error I 8 22.91 2.86
Spacing 3 9.50 3.17 2.58 F.05 at (3,6) =4.76
Error II 6 7.36 1.23
var x spacing 12 34.19 2.85 0.90 F.05 at (12,24) =2.18
Error III 24 76.38 3.18
Total 59

None of the effects are found to be significant in the strip plot design. The standard errors of variety, spacing
and their interaction can be determined on the parallel line of split plot experiment as given below:
2∗𝐸𝑎
(i) S.E. of difference between two variety means=√ 𝑟∗𝑏 =0.690
2∗𝐸𝑏
(ii) S.E. of difference between two spacing means=√ 𝑟∗𝑎 =0.405
2[(𝑏−1)∗𝐸𝑐+𝐸𝑎]
(iii) S.E. of difference between two variety means at the same level of spacing=√ = =1.43
𝑟𝑏
2[(𝑎−1)∗𝐸𝑐+𝐸𝑏]
(iv) S.E. of difference between two spacing means at the same level of variety =√ = 1.36
𝑟𝑎

Critical difference is obtained by multiplying the standard error by table value of t at respective degree of
freedom for (i) and (ii). For (iii) and (iv) the following equations were used to compute the weighted values
(𝑏−1)𝐸𝑐𝑡𝑐+𝐸𝑎𝑡𝑎 (𝑎−1)𝐸𝑐𝑡𝑐+𝐸𝑏𝑡𝑏
of t. t= (𝑏−1)𝐸𝑐+𝐸𝑎 and t= (𝑎−1)𝐸𝑐+𝐸𝑏 , where ta, tb and tc are table value of t at error degree of
freedom of Ea, Eb and Ec respectively.

67
18. Factorial Design

Factorial Design: In factorial experiment effect of several factors of variations are studied and investigated
simultaneously, the treatments being the combinations of different factors under study. In this experiment we
estimate the effect of each of the factor and also their interaction effect. In case of 2𝑛 experiment there are n
factors each at 2 levels.

Let us suppose 22 experiment. There are 2 factors each at 2 levels. The treatment combinations are 𝑎0 𝑏0 ,
𝑎1 𝑏0, 𝑎0 𝑏1 and 𝑎1 𝑏1.

In factorial experiment analysis can be done as usual manner in CRD, RBD but the treatment sum of square
is split into orthogonal components.

In factorial experiment factorial effect totals are given by the expression.

[A] = (a-1)(b+1) =[ab]-[b]+[a]-[1]


[B] = (a+1)(b-1) =[ab]+[b]-[a]-[1]
[AB] = (a-1)(b-1) = [ab]-[a]-[b]+[1]

Yates Method can also be used for computing factorial effect totals.
Treatment Total yield (1) (2) Effect SS
Combination Totals 𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
= 𝟐𝟐 ∗𝒓

1 [1] [1]+[a] [1]+[a]+ [b]+[ab] GT 35046.28


a [a] [b]+[ab] [a]-[1]+ [ab]-[b] [A] 124.03
b [b] [a]-[1] [b]+[ab]- [1]-[a] [B] 30.03
ab [ab] [ab]-[b] [ab]-[b]- [a]+[1] [AB] 34.03

Analysis of variance table for 𝟐𝟐


Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks r-1 𝑆𝑅 2 87.98 3.54* F.05 at (7,17)=2.61
Main Effect A 1 [𝐴]2 F.05 at (1,17) =4.45
𝑆𝐴 2 =22 ∗r 124.03 5.00*
Main Effect B 1 [𝐵]2
𝑆𝐵 2 =22 ∗r 30.03 1.21
Interaction AxB 1 [𝐴𝐵]2
𝑆𝐴𝐵 2 = 22 ∗r 30.03 1.21
Error 3(r-1) By
subtraction 24.82
Total 22 *r - 1 ∑ ∑ 𝑦𝑖𝑗 2

− 𝐶𝐹

Similarly in case of 32 experiment there are two factors each at three levels and the total combinations are 9.

68
Objective: Analysis of 23 factorial experiment.
Kinds of data : A 23 experiment in eight randomized blocks was conducted in order to obtain an idea of
the interaction :with three factors N,P, and K each at two levels. The design and yield per plot are given
below:
Replicate 1
Block 1 (1) 25 pk 24 nk 32 Np 30
Block2 n 30 k 32 npk 36 p 27
Replicate 2
Block 3 p 32 npk 42 n 46 k 39
Block4 nk 34 (1) 44 np 30 pk 36
Replicate 3
Block 5 npk30 k 32 n 28 p 26
Block6 (1) 24 pk 20 nk 28 np 36
Replicate 4
Block 7 np 32 (1)34 pk 39 nk 41
Block8 npk 45 n 41 p 29 k 35

Solution: Null hypothesis H0 = Blocks as well as treatments are homogeneous.


First we will calculate the Block and treatment totals.
The eight block totals are : 111, 125, 159, 144, 116,108,146 and 150.
The eight treatment totals are 127 = [1], 145=[n], 114 = [p],[np]=128,[k]=138,[nk]=135,[pk]=119 and
[npk]=153
Here the total number of observations are =32
Grand total = 1059
10592
Correction factor = =35046.28
32
Total sum of square =(252 + 242 + ⋯ + 352 ) - CF = 36381 – 35046.28 =1334.72
2362 +3032 +2242 +2962
Block sum of square = – CF = 35662.13 – 35046.28=615.845
8
1272 +1452 +⋯+1532
Treatment sum of square = – CF = 35343.25 – 35046.28=296.97
4
Error sum of square = TSS-BSS-TRSS= 1334.72- 615.845- 296.97 = 421.90

Now we break up the treatments S. S. with 7 d.f. into 7 orthogonal components each with 1 d.f. for this we
use Yates method for computing factorial effect totals and their sum of squares.

Treatment Total yield (1) (2) (3) Effect Totals SS


Combination 𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
= 𝟑𝟐
1 127 272 514 1059 GT 35046.28
n 145 242 545 63 [N] 124.03
p 114 273 32 -31 [P] 30.03
np 128 272 31 33 [NP] 34.03
k 138 18 -30 31 [K] 30.03
nk 135 14 -1 -1 [NK] 0.03
Pk 119 -3 -4 29 [PK] 26.28
Npk 153 34 37 41 [NPK] 52.53

69
Analysis of variance table for 𝟐𝟑 factorial experiment
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 615.84 87.98 3.54* F.05 at (7,17)=2.61
N 1 124.03 124.03 5.00* F.05 at (1,17) =4.45
P 1 30.03 30.03 1.21
NP 1 30.03 30.03 1.21
K 1 34.03 34.03 1.37
NK 1 0.03 0.03 0.00
PK 1 26.28 26.28 1.06
NPK 1 52.53 52.53 2.12
Error 17 421.90 24.82
Total 31 1334.72

Here blocks differ significantly. There are merely main effect of N is significant present in the above
experiment. Other treatments are not significant.
Standard error for any factorial effect total =√𝑟. 23 . 𝑆𝐸 2 = √4 ∗ 8 ∗ 24.82 = 28.18
Significant value for any factorial effect total = t5% for 17 d.f. * 28.18 = 2.109*28.18=59.45
Conclusion: Comparing this value with the factorial effect totals in Yates table we find that only main effect
N are significant and others are non-significant.

Objective : Analysis of data in relation to 32 factorial experiment.


Kinds of data : A hypothetical data on two factors A and B each consisting of three levels in four
randomized blocks is given below:
Blocks a2 a1 a0 Total
b2 b1 b0 b2 b1 b0 b2 b1 b0
I 26 33 28 14 18 21 28 10 11 189
II 30 28 26 24 31 18 20 16 14 207
III 20 24 17 24 23 13 8 11 13 153
IV 24 27 17 22 12 8 24 19 18 171
Total 100 112 88 84 84 60 80 56 56 720

Solution: There is no significant difference among three levels of the factors A and B.
First we construct the two way table of treatment totals over all the blocks.

A/B b2 b1 b0 Total
a2 100 112 88 300
a1 84 84 60 228
a0 80 56 56 192
Total 264 252 204 720

7202
C.F.= 36 =14400
Total Sum of Squares= 262+302+…+182 – 14400 = 16028 – 14400 = 1628
1892 + 207 2 + 1532 + 1712
Block S.S.= − C.F . = 180
9
70
1002 + 1122 + ...,562
Treatment S.S.= − C.F . = 768
4
Error S.S.=T.S.S.-Block S.S.-Treatment S.S.= 680

Now, let us compute the value and sum of squares of the eight contrasts.
Contrasts Value of Z Divisor (Dxr) Sum of
Square=Z2/Dxr
A1 300-192=108 6x4 486
A2 492-456=36 18x4 18
B1 264-204=60 6x4 150
B2 468-504=-36 18x4 18
A1B1 156-168=-12 4x4 9
A1B2 300-360=-60 12x4 75
A2B1 300-312=-12 12x4 3
A2B2 660-624=36 36x4 9

Analysis of variance table for 𝟑𝟐 factorial experiment


Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 3 180 60
Treatments 8 768 96 3.39* 2.36
A1 1 486 486 17.15* 4.26
A2 1 18 18 <1 4.26
B1 1 150 150 5.29* 4.26
B2 1 18 18 <1 4.26
A1B1 1 9 9 <1 4.26
A1B2 1 75 75 2.65 4.26
A2B1 1 3 3 <1 4.26
A2B2 1 9 9 <1 4.26
Error 24 680 28.33
Total 35 1628

The treatments are found to be significant especially the linear effect of the factor A and B.

Remark: This is a 23 factorial experiment in four replicates and each replicate has been divided into two
blocks of four plots each. However, this experiment can also be analyzed by the method of complete
confounding including the sum of squares due to factor NPK into errors since NPK has been completely
confounded which are analyzed in practical No.48.

71
19. Confounding
It is the technique by which the precision on the main effects and certain interactions generally they are of
lower order is increased by the sacrifice of precision on certain high order interactions.

Complete Confounding: when the same interaction is confounded in all the replicates, it is known as
complete or total confounding.
The analysis procedure is same as factorial experiment. Only one degree of freedom is lost from treatment
and increased in error because the same interaction is confounded in all the replicates.

Partial confounding: is used when we want to divide the replicate into homogeneous smaller blocks and
also don’t want to lose information on any of the interactions. In partial confounding different effect is
confounded in different replicate. So that the interaction effect can be estimated from the remaining of the
replicates in which that effect was not confounded.
In partial confounding also the analysis procedure is same as factorial experiment. Here the calculation is
only changed in calculation of sum of square of partially confounded effects.

Objective: Layout of completely confounded design


Kinds of data: 23 factorial experiment in which ABC are confounded in all the three replicates.
Solution: For ABC the contrast are
ABC= (a-1)(b-1)(c-1)=(abc+a+b+c) – (ac+bc+ac+(1))
The same entries of the blocks are repeated in other replications with a fresh randomization within blocks.
Thus, the layout of a 23 experiment in which ABC are confounded is given below:
Rep. I Rep. II Rep. III
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
a (1) c ab b bc
abc ab b ac c (1)
b ac abc bc a ac
c bc a (1) abc ab

Objective: Layout of partially confounded design


Kinds of data: 23 factorial experiment in which ABC, AC and BC are confounded in three replicates.
Solution: For ABC, AC and BC contrasts are
ABC= (a-1)(b-1)(c-1)=(abc+a+b+c) – (ac+bc+ac+(1))
AC= (a-1)(b+1)(c-1) = (abc+b+ac+(1))-(a+c+ab+bc)
BC= (a+1)(b-1)(c-1) = (abc+a+bc+(1))-(ab+ac+b+c),
Thus, the layout of a 23 experiment in which ABC, AC and BC are confounded is given below:
Rep. I Rep. II Rep. III
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
a ac abc a (1) b
abc (1) b ab a ab
b ab (1) bc bc c
c bc ac c abc ac

72
Objective: Analysis of data of complete confounding
Kinds of data : The following data relate to complete confounding of 23 experiment of the Factors A,B,C
and the experiment is conducted in 4 replications. In each replicate the interaction ABC is confounded..
Effect ABC Replicate1 Replicate2 Replicate3 Replicate4
confounded
Blocks (i) (ii) (i) (ii) (i) (ii) (i) (ii)
(1) 19.1 a18.6 (1)20.7 a25.9 (1)23.4 a22.2 (1)19.1 a23.6
ab 19.2 b18.2 ab22.1 b23.0 ab20.4 b21.0 ab21.9 b23.7
ac18.8 c19.0 ac21.2 c24.9 ac23.2 c23.6 ac18.6 c21.0
bc19.4 abc20.4 bc20.1 abc23.4 bc20.3 abc21.6 bc21.5 abc22.8

Solution: H0 : The data is homogenous with respect to blocks and treatments.


Here since each replicate has been divided into 2 blocks. One effect has been confounded in each replicate.
We find that the effect ABC has been confounded in each replicate.
Here Grand total of the observations GT= 681.9
681.92
Correction Factor = =14530.863
32

Raw S.S. = (19.12 + 18.62 + ⋯ + 22.82 )=14658.87


Total sum of squares (corrected) =14658.87 – CF = 128.007
The eight blocks totals are 76.5, 76.2, 94.1, 97.2, 87.3, 88.4,81.1, and 91.1
(76.52 +76.22 +⋯+91.12 )
So, Block sum of squares = – CF = 92.03
4

The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.
Treatment Total yield (1) (2) (3) Effect Totals SS =𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
𝟑𝟐
Combination
1 82.3 172.6 342.1 681.9 GT
A 90.3 169.5 339.8 5.9 [A] 1.09
B 85.9 170.3 5.7 -3.9 [B] 0.48
AB 83.6 169.5 0.2 3.3 [AB] 0.34
C 88.5 8 -3.1 -2.3 [C] 0.17
AC 81.8 -2.3 -0.8 -5.5 [AC] 0.95
BC 81.3 -6.7 -10.3 2.3 [BC] 0.17
ABC 88.2 6.9 13.6 23.9 Not estimate

Put all sum of squares in ANOVA table and test the main effects and interactions excluding
the factors that are completely confounded with blocks.

73
Analysis of variance table
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 92.03 13.15 2.80 F.05 at (7,18)=2.58
A 1 1.08 1.08 <1 F.05 at (1,18) =4.41
B 1 0.47 0.47 <1
AB 1 0.34 0.34 <1
C 1 0.16 0.16 <1
AC 1 0.94 0.94 <1
BC 1 0.16 0.16 <1
Error 18 32.79 1.82
Total 31 128.00
From the above analysis of variance table, we find that none of the treatments effect is significant as would
be expected for data taken from a uniformity trial. Whereas block effect is significant, hence confounding is
found effective.

Objective : Analysis of data of partial confounding.


Kinds of data : The following data relate to the partially confounded in 23 experiment taken from an
uniformity trial.
Effects AB AC BC ABC
Confounded
Blocks (i) (ii) (i) (ii) (iii) (ii) (i) (ii)
(1) 25.7 a 23.2 (1)27.6 a25.6 (1) 21.4 b18.8 (1)23.9 a25.4
ab 21.1 b21.0 ac26.7 c27.9 bc18.6 c16.0 ab21.4 b26.9
c 17.6 ac18.6 b26.2 ab28.5 a18.8 ab16.4 ac20.6 c25.2
abc17.5 bc18.3 abc22.0 bc27.2 abc18.2 ac16.6 bc22.4 abc30.1

Solution : H0 : The data is homogenous with respect to blocks and treatments. Since each replicate has been
divided into 2 blocks, one effect has been confounded in each replicate. Replicate 1 confounds AB, replicate
2 confounds AC, replicate 3 confounds BC and replicate 4 confounds ABC . Hence, this is an example of
partial confounding.
Here Grand total of the observations GT= 715.4
715.42
Correction Factor = =15993.66
32

Raw S.S. = (25.72 + 23.22 + ⋯ + 30.12 )=16520.42


Total sum of squares (corrected) =16520.42 – CF = 526.76
The eight block totals are: 81.9, 81.1, 102.5, 109.2, 77.0, 67.8, 88.3 and 107.6.
(81.92 +81.12 +⋯+107.62 )
So, Block sum of squares = – CF = 410.39
4

74
The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.

Treatment Total yield (1) (2) (3) Effect Totals 𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
SS = 𝟑𝟐
Combination
1 98.6 191.6 371.9 715.4 GT
A 93 180.3 343.5 -14 [A] 6.13
B 92.9 169.2 -11.1 -6.2 [B] 1.20
AB 87.4 174.3 -2.9 5.6 [AB] not estimable
C 86.7 -5.6 -11.3 -28.4 [C] 25.21
AC 82.5 -5.5 5.1 8.2 [AC] not estimable
BC 86.5 -4.2 0.1 16.4 [BC] not estimable
ABC 87.8 1.3 5.5 5.4 not estimable

In order to estimate the sum of square of partially confounded effect adjustment factor is calculated for each
interaction.
AF= [Total of the block containing (1) of replicate in which the effect is confounded] – [Total of the block
not containing (1) of replicate in which the effect is confounded]
1
Interaction AB is estimated by =4 (a-1)(b-1)(c+1), here sign of 1 is positive. Hence the

AF for AB = [25.7+21.1+17.6+17.5]-[23.2+21+18.6+18.3] =0.8


Hence Adjusted effect total for AB becomes = 5.6 -0.8 =4.8
[𝐴𝐵]2 4.82
Sum of square of AB= = = 0.96
24 24

AF for AC = [27.6+26.7+26.2+22]-[25.6+27.9+28.5+27.2] =-6.7


Hence Adjusted effect total for AC becomes = 8.2 – (-6.7) =14.9
[𝐴𝐶]2 14.92
Sum of square of AC= = = 9.25
24 24

AF for BC = 77- 67.8=9.2


Hence Adjusted effect total for BC becomes = 16.4 – 9.2 =7.2
[𝐵𝐶]2 7.22
Sum of square of BC= = = 2.16
24 24
1
Interaction ABC is estimated by =4 (a-1)(b-1)(c-1), here sign of 1 is negative. Hence the

AF for ABC = [25.4+26.9+25.2+30.1] - [23.9+21.4+20.6+22.4] = 19.3


Hence Adjusted effect total for ABC becomes = 5.4 – 19.3 = -13.9
[𝐴𝐵𝐶]2 −13.92
Sum of square of ABC= = = 8.05
24 24

S.S. due to error = total S.S. - block S.S. - treatment = 63.42


Put all these sum of squares in a ANOVA table and test for main effects and their interaction.

75
Analysis of variance table for the partially confounded 23 - experiment
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 410.39 2.80 F.05 at (7,17)=2.61
Treatments 7 52.95
A 1 6.12 6.12 1.64 F.05 at (1,17) =4.45
B 1 1.20 1.20 <1 F.01 at (1,17) =8.40
AB 1 0.960 0.960 <1
C 1 25.205 25.205 6.76*
AC 1 9.25 9.25 2.48
BC 1 2.16 2.16 <1
ABC 1 8.05 8.05 2.16
Error 17 63.42 3.73
Total 31

From the above table, it is observed that only the main effect of the factor C is significant at the 5% level.
The other effects are found to be non-significant. Also the block effect are found significant.
In comparing means it is important to keep in mind that the interactions are determined on merely ¾ of the
2∗3.73
replications. Thus, the standard error of a mean interaction response is = √ 3∗22 =0.788

2∗3.73
Similarly, the standard error of a main effect will be =√ 4∗22 = 0.683

76
REFERENCES:
1. Practicals in Statistics , by Dr.H.L.Sharma
2. Statistical Methods, by G.W.Snedecor.
3. Experimental Designs and Survey Sampling: Methods and Applications, by H.L.Sharma
4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel
5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar
6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor

77

You might also like