Professional Documents
Culture Documents
AGRICULTURAL STATISTICS
( UG/PG COURSES)
Compiled by
SURABHI JAIN
(Asst. Prof./Scientist)
&
H. L. Sharma
Professor and Head,
DEPARTMENT OF MATHEMATICS AND STATISTICS
Jawaharlal Nehru Krishi Vishwa Vidyalaya,
JABALPUR 482 004 ( M.P.)
Contents
S. No. Chapter Name Page No.
1. Frequency Distribution 1
2. Graphical Representation of data 2
3. Curve fitting 3
4. Measures of Central Tendency 4-8
5. Measures of Dispersion 9-11
6. Skewness and Kurtosis 12-14
7. Probability 15-17
8. Discrete and Continuous Distribution 18-22
9. Correlation and Regression 23-28
10. Multiple and Partial Correlation 29-32
11. Multiple Regression Equation and Analysis Technique 33-39
12. Simple and Stratified Random sampling 40-44
13. Ratio and Regression Estimator 45-47
14. Large sample test 48-49
15. Small sample test 50-53
16. Chi-Square test 54-56
17. Experimental Design 57-67
18. Factorial Design 68-71
19. Confounding 72-76
References 77
1. Frequency distribution
Frequency distribution: is used to condense the large amount of data and to provide fruitful information
of our interest.
To construct the frequency distribution first we will find the range= max value –min value.
The following formula can be used to determine an approximate number k of classes.
K= 1+3.322 log10N or log(no. of observations)/log(2), where N is the total frequency. Round up the
answer to the next integer.
After dividing the range by number of classes class interval is obtained.
Kinds of data: The list of IQ scores is: 118, 123, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141,
142, 149, 150, 154.
Solution: Here Range = 154-118= 36
So the number of classes k=1+3.322log10 (17) = 5 or log(17)/log(2) = 4.08 so the number of classes=5 can
be considered.
𝑟𝑎𝑛𝑔𝑒 36
Class interval = 𝑛𝑜.𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 = = 7.2≈ 8
5
First class= min value + class interval = 118+8=126
Since the data is discrete we can subtract 1 one from the upper limit of the class. The next class will start
from next integer.
So our frequency distribution table would be
I.Q. (class interval) Number or frequency
118-125 4
126-133 6
134-141 3
142-149 2
150-157 2
The classes in which both the upper and lower limits are included in same class are called inclusive
classes whereas the classes in which the upper limit of first class is same as the lower limit of the second
class eg. 10-15,15-20 etc. are known as exclusive classes.
Remark: To apply any statistical technique, first the inclusive classes should be converted to exclusive
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑠𝑒𝑐𝑜𝑛𝑑 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑓𝑖𝑟𝑠𝑡 𝑐𝑙𝑎𝑠𝑠
classes. For this purpose we find the difference of and add
2
this amount to upper limit of first class and subtract it from the lower limit of next higher class.
𝟏𝟐𝟔−𝟏𝟐𝟓
In the present example the conversion factor = = 0.5
𝟐
So we add 0.5 to 125 and subtract 0.5 from 126 and finally get the exclusive classes.
1
2. Graphical Representation of data
Graphical Representation of data: used to make data attractive, effective, ready for comparison and also
save the time and energy.
Type of Characteristics Diagram
Graph
Simple Thick lines used to represent the 84 BAR DIAGRAM
Number of students
Marks 6
maths 30- 40-
0-10 10-20 20-30 4
40 50
2
No. of 0
students
2 5 7 5 3
0-10 10-20 20-30 30-40 40-50
Marks in Maths
0
0-10 10-20 20-30 30-40 40-50
Marks in maths
2
3. Curve fitting
Curve fitting: is used to find an analytic expression of the form y=f(x), the functional relationship suggested
by the given data.
Fitting of a straight line: Y=a + bX
In curve fitting by principal of least square we have to determine a and b so that
E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 is minimum.
So two line of equations by differentiating w.r.t a and b are
∑ 𝑦𝑖 = na+b∑ 𝑥𝑖
∑ 𝑥𝑖 𝑦𝑖 = a∑ 𝑥𝑖 +b∑ 𝑥𝑖 2 by solving these two equations value of a and b can be obtained.
Fitting of a second degree parabola: Y=a + bX +C𝑿𝟐
In curve fitting by principal of least square we have to determine a and b so that
E=∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 − 𝑐𝑥𝑖 2 )2 is minimum.
So two line of equations by differentiating w.r.t a and b are
∑ 𝑦𝑖 = na+b∑ 𝑥𝑖 +c∑ 𝑥𝑖 2
∑ 𝑥𝑖 𝑦𝑖 = a ∑ 𝑥𝑖 +b∑ 𝑥𝑖 2 + c∑ 𝑥𝑖 3
∑ 𝑥𝑖 2 𝑦𝑖 = a ∑ 𝑥𝑖 2+b∑ 𝑥𝑖 3 + c∑ 𝑥𝑖 4
by solving these three equations value of a, b and c can be obtained.
Objective: Fitting of a quadratic curve to the following data treating X as the independent variable.
Kinds of data:
X: 0 1 2 3 4
Y: 1 1.8 1.3 2.5 6.3
2
Solution: Let Y = a+bX +cX be the equation of the quadratic curve.
X Y X2 X3 X4 XY X2Y Using three normal equations for the quadratic curve,
0 1 0 0 0 0 0 we have 12.9 =5a +10b +30c
1 1.8 1 1 1 1.8 1.8 37.1 =10a+30b+100c
2 1.3 4 8 16 2.6 5.2 130.3=30a+100b+354c
3 2.5 9 27 81 7.5 22.5 Solving the above three equations, we have
4 6.3 16 64 256 25.2 100.8 a=1.42, b=-1.07 and c=0.55
Thus the quadratic curve is fitted as
10 12.9 30 100 354 37.1 130.3
Ŷ = 1.42 -1.07X + 0.55 X2
Remark: If the values of X and Y are so large, the computation of ∑X,∑X2,∑X3, ∑XY…, becomes difficult
and takes more time and energy. Therefore, the calculations may be reduced by using change of origin and
scale of data.
3
4. Measures of Central Tendency
Measures of Central Tendency: gives us an idea about the average/central value of the distribution.
Median: Divides the whole data set into Arrange the observation in
2 equal parts, also used for ascending order, If the number
Qualitative data of observations is 𝑁
(𝑛+1) −𝐹
(1) odd : ( 2 ) 𝑡ℎ 𝑡𝑒𝑟𝑚 Md = L+ 2 *h
𝑓
𝑛 𝑛
+ ( +1)
2 2
(2) Even: 2
th term
is median
Mode: Most frequently occurred Prepare the frequency table
observation, ill-defined and find the mode (𝑓1 −𝑓0 )
Mo = L+ ∗h
(2𝑓1 − 𝑓0 − 𝑓2)
GM=Antilog(0.963) = 9.19
9 9
(5) Harmonic Mean= 1 1 1 1 1 1 1 1 1 = 0.99 = 9.06
( + + + + + + + + )
10 7 11 9 9 10 7 9 12
(6) Calculation of 3rd Quartile, 5th Decile and 60th percentile (first arrange the data in ascending order)
𝑖(𝑛+1) 3(9+1) 30
(a) Quartile Qi= 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 means Q3= 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation
4 4 4
5
Objective: Computation of Measures of Central Tendency by all methods for Grouped data.
Kinds of data: The following data relate to the percentage of marks obtained by 556 students in a certain
examination.
Class
intervals 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
(Marks)
No.of
2 20 50 90 107 115 91 53 20 8
Students
Solution:
Class No. of Cumu- Ordinary Change of origin Change of scale Change of origin
intervals Students lative method method (A=55) method (h=10) and scale method
(%Marks) Frequency frequency 𝑿𝒊 𝑭𝒊 𝑿𝒊 Oi = Xi – 𝑭𝒊 𝑶𝒊 𝒙𝒊
𝑭𝒊 𝑺𝒊 di = 𝑭𝒊 𝒅𝒊
Si =
(Fi) A
𝒉 𝒙𝒊 −𝑨
𝒉
0-10 2 2 5 10 -50 -100 0.5 1.0 -5 -10
10-20 20 22 15 300 -40 -800 1.5 30.0 -4 -80
20-30 50 72 25 1250 -30 -1500 2.5 125.0 -3 -150
30-40 90 162 35 3150 -20 -1800 3.5 315.0 -2 -180
40-50 107(f0) 269 45 4815 -10 -1070 4.5 481.5 -1 -107
50-60 115 (f1) 384 55 6325 0 0 5.5 632.5 0 0
60-70 91(f2) 475 65 5915 10 910 6.5 591.5 1 91
70-80 53 528 75 3975 20 1060 7.5 397.5 2 106
80-90 20 548 85 1700 30 600 8.5 170.0 3 60
90-100 8 556 95 760 40 320 9.5 76.0 4 32
Total 556 28200 -50 -2380 50 2820 -5 -238
Arithmetic Mean
28200
(a) Ordinary method = = 50.72 %
556
(−2380)
(b) Change of origin method = 55+ = 55-4.28= 50.72 %
556
2820
(c) Change of Scale method = 556 *10 = 50.72 %
−238
(d) Change of origin and scale method =55+( 556 ) * 10 = 55-4.28=50.72 %
Median
∑ 𝒇𝒊 𝟓𝟓𝟔
First we will find the median class= = = 278
𝟐 𝟐
278 comes under 384 in cumulative frequency class so 50-60 is median class.
556
−269 90
2
So, Md = 50+ ( ) * 10 = 50+115 = 50.78 %
115
Mode
Highest frequency is 115 so mode class is 50-60.
(115−107) 80
Mo = 50+ (2∗115−107−91) ∗ 10 = 50+32 = 50+2.5 =52.5 %
6
1
Geometric Mean: GM=Antilog(∑ 𝑓 (∑ 𝑓𝑖 𝑙𝑜𝑔𝑥𝑖 ))
𝑖 xi logxi fi filogxi fi/xi
929.58 5 0.70 2 1.40 0.40
GM = Antilog( ) = Antilog(1.67) =46.98 %
556 15 1.18 20 23.52 1.33
∑ 𝑓𝑖 25 1.40 50 69.90 2.00
Harmonic Mean: HM= 35
𝑓
∑ 𝑖
𝑥𝑖
1.54 90 138.97 2.57
45 1.65 107 176.89 2.38
556 55
HM= 13.20 = 42.12 % 1.74 115 200.14 2.09
65 1.81 91 164.98 1.40
Quartiles, Deciles and Percentiles 75 1.88 53 99.38 0.71
85 1.93 20 38.59 0.24
Third quartile, sixth decile and 20th percentile.
3∗556
95 1.98 8 15.82 0.08
First we will determine the Quartile class = = 417. Total 556 929.58 13.20
4
417 come in 60-70 cumulative frequency class. So the
Third Quartile
3∗556
( −384) 330
4
Q3 = 60 + * 10 = 60+ = 60+3.63 =63.63 %
91 91
6∗556
Sixth Decile: Here decile class = = 333.6 so the decile class is 50-60.
10
6∗556
( −269) 646
10
D6 = 50 + * 10 = 50+ 115 = 50+5.62 =55.62 %
115
20∗556
20th percentile : Here Percentile class = = 111.2 so the percentile class is 30-40.
100
20∗556
( −72) 39.2
100
P20 = 30 + * 10 = 30+ ∗ 10 = 30+4.35 =34.35 %
90 90
Objective: Computation of algebraic sum of the deviations of a set of values from their arithmetic mean
Kinds of data: The following data relate to the frequency distribution of the number of workers according
to their wages in a certain factory..
Wages(in
Below10 below20 below30 below40 Below50 below60 Below70 Below80
Rs)
No.of
15 35 60 84 96 127 198 250
workers
Solution: First, we will compute the arithmetic mean by changing the data in the following form using the
rule of cumulative frequency of less than type. Then the data become:
Wages (in
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Rs)
No.of
15 20 25 24 82 31 71 52
workers
Then the A.M.= ∑fixi/N=15750/320= Rs.50.40.
Then the algebraic sum of the deviations of a set of values from their arithmetic mean is equal to ∑fi(xi-𝑋̅)
which is equal to approximately zero.
7
Objective: Computation of pooled mean of the given data.
Kinds of data: the average of 5 numbers (first series) is 40 and the average of another 4 numbers (second
series) is 50.
̅̅̅̅
𝑛1𝑋 ̅̅̅̅
1 +𝑛2𝑋 2 5∗40+4∗50 400
Solution: we know that pooled mean formula = = = = 44.44
𝑛1 +𝑛2 5+4 9
∑ 𝑤𝑖 20 20
Now the formula becomes average speed= ∑
𝑤 = 5 15 = 0.5 = 40 km/h
𝑥𝑖
( + )
30 45
Kinds of data: A man goes from place A to place B at a speed of 10 km/h and comes back from B to A at a
speed of 15 km if the distance travelled is x.
𝑻𝒐𝒕𝒂𝒍 𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝒙+𝒙 𝟑𝟎𝟎𝒙
Solution: We know that Total speed =𝑻𝒐𝒕𝒂𝒍 𝒕𝒊𝒎𝒆 𝒕𝒂𝒌𝒆𝒏 = 𝒙 𝒙 = 𝟐𝟓𝒙 = 12 km/h , which is the harmonic mean
+
𝟏𝟎 𝟏𝟓
of 10 and 15.
Kinds of data: The mean of 100 observations is 50. What will be the new mean if
(i) 6 is added to each observation (ii) each observation is multiplied by 3.
(iii)If 5 is subtracted from each observation and then it is divided by 4.
Solution: (i) Here since 6 is added to each observation then the new mean will be = 50+6 = 56
(ii) if each observation is multiplied by 3 the new mean will be =3 *50 =150
𝑋−5 ̅
(iii)the new variable will be U= ̅ = 𝑋−5 = 50−5 = 11.25
, hence the mean 𝑈
4 4 4
8
5. Measures of Dispersion
9
Objective: Computation of Measures of Dispersion by all methods for Ungrouped data.
Kinds of Data: Suppose the data are 10, 7, 5, 9, 9, 10, 7, 3, 12
Solution:
(1) Range= 12-3=9
(2) Quartile Deviation: Arrange the observation in ascending order
3, 5, 7, 7, 9, 9, 10, 10, 12,
1∗(9+1) 10
Q1= 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =2.5th observation
4 4
So Q1= 5+0.5*(7-5) =6
3∗(9+1) 30
Similarly Q3 = 𝑡ℎ = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 =7.5th observation
4 4
So Q3 = 10+ 0.5*(10-10)=10
(10−6)
Now QD = =2
2
(3) Mean Deviation:
(10+7+5+9+9+10+7+3+12)
Mean= =8
9
1
MD=9 (|10 − 8| + |7 − 8| + |5 − 8| + |9 − 8| + |9 − 8| + |10 − 8| + |7 − 8| + |3 − 8| + |12 − 8|)
1 20
=9 (2+1+3+1+1+2+1+5+4) = = 2.22
9
(4) Standard Deviation:
(10+7+5+9+9+10+7+3+12)
Mean= =8
9
(10−8)2 +(7−8)2 +(5−8)2 +(9−8)2 +(9−8)2 +(10−8)2 +(7−8)2 +(3−8)2 +(12−8)2
SD=√ 9
(4+1+9+1+1+4+1+25+16) 62
=√ =√ 9 = 2.62
9
Age(in years) 20-30 30-40 40-50 50-60 60-70 70-80 80-90 Total
No. of
3 61 132 153 140 51 2 542
members
Solution:
(1)Range = 90-20=70
(2) Quartile Deviation : first we will find the first and third quartile
Age(in No. of Cumulative ̅ ) Fi|𝐗𝐢 − 𝑿 ̅ )2
̅ )| (Xi -𝑿 ̅ )2
Xi FiXi (Xi -𝑿 Fi(Xi -𝑿
years) members Frequency
20-30 3 3 25 75 -29.7 89.2 883.3 2649.8
30-40 61 64 35 2135 -19.7 1202.9 388.9 23721.6
40-50 132 196 45 5940 -9.7 1283.0 94.5 12471.1
50-60 153 349 55 8415 0.3 42.8 0.1 12.0
60-70 140 489 65 9100 10.3 1439.2 105.7 14795.0
70-80 51 540 75 3825 20.3 1034.3 411.3 20975.2
80-90 2 542 85 170 30.3 60.6 916.9 1833.8
Total 542 2183 385 29660 1.96 5152 2800.54 76458.49
10
1∗542
First we will determine the first Quartile class = = 135.5
4
135.5 come in 40-50 cumulative frequency class. So the first Quartile
1∗542
( −64) 715
4
Q1 = 40 + * 10 = 40+ 132 = 40+5.42=45.42 years
132
3∗542
Similarly for Q3 = = 406.5,
4
406.5 come in 60-70 cumulative frequency class. So the third Quartile is
3∗542
( −349) 575
4
Q3 = 60 + * 10 = 60+ 140 = 60+4.11=64.11 years
140
(64.11−45.42) 18.69
So the quartile deviation is = = =9.345
2 2
(3) Mean Deviation : first calculate the mean
29660
Mean= = 54.72
542
5152
From the above table Mean Deviation = =9.51
542
76548.9
(4) Standard Deviation=√ =√141.07 = 11.88
542
No. of
A ̅̅̅𝟐 B ̅̅̅𝟐
goals 𝒇 𝑨 𝒙𝒊 𝒇𝒊 (𝒙𝒊 − 𝒙) 𝒇𝑩 𝒚 𝒊 𝒇𝒊 (𝒚𝒊 − 𝒚)
(𝒇𝑨 ) ̅
(𝒙𝒊 − 𝒙 (𝒙𝒊 − ̅̅̅
𝒙)2 (𝒇𝑩 ) ̅)
(𝒚𝒊 − 𝒚 (𝒚𝒊 − ̅̅̅
𝒚)2
(𝒙𝒊 )
0 27 0 -1.05 1.10 29.77 17 0 -1.2 1.44 24.48
1 9 9 -0.05 0.00 0.02 9 9 -0.2 0.04 0.36
2 8 16 0.95 0.90 7.22 6 12 0.8 0.64 3.84
3 5 15 1.95 3.80 19.01 5 15 1.8 3.24 16.2
4 4 16 2.95 8.70 34.81 3 12 2.8 7.84 23.52
Total 53 56 4.75 14.51 90.83 40 48 4 13.2 68.4
First we will calculate the mean and standard deviation of first (A) series
56 90.83 𝜎𝐴 1.31
̅̅̅
𝑋 𝐴 = 53 = 1.05, 𝜎𝐴 = √ 53 = √1.714 = 1.31 then CV= ̅̅̅̅ *100 = 1.05 *100 = 124.76
𝑋 𝐴
Now we calculate the mean and standard deviation of Second (B) series
48 68.4 𝜎𝐵 1.30
̅̅̅̅
𝑋𝐵 = 40 = 1.2, 𝜎𝐵 = √ 40 = √1.71 = 1.30 then CV= ̅̅̅̅ *100 = 1.2 *100 = 108.33
𝑋 𝐵
After comparing the coefficient of variation of series A and B it was found that the series B because of lower
CV value is more consistent.
11
6. Skewness and Kurtosis
Skewness and Kurtosis: Skewness gives us an idea about the symmetry of the curve whereas the Kurtosis
gives us an idea about the shape of the curve.
Measures of Shape of
Definition Types Coefficients
a curve
Skewness Means Lack of (i)Zero Skewed or (1)Absolute Measures
Symmetry normal curve 𝑆𝑘 = Mean-Median
(𝑀𝑒𝑎𝑛 ≠ (ii)Positively skewed 𝑆𝑘 = Mean-Mode
𝑀𝑒𝑑𝑖𝑎𝑛 ≠ 𝑀𝑜𝑑𝑒 (Mean>Median or 𝑆𝑘 = 𝑄3 + 𝑄1 − 2𝑀𝑒𝑑𝑖𝑎𝑛
Mean >Mode)
(iii)Negatively skewed (2) Relative Measures
Mean<Median (i) Karl Pearson Coefficient of
or Mean <Mode 𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
Sk = 𝑆𝐷
3 (𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)
= 𝑆𝐷
Range of coefficient is -3 to +3
(ii) Prof. Bowleys Coefficient
of Skewness
𝑄3 + 𝑄1 −2 𝑚𝑒𝑑𝑖𝑎𝑛
Sk = 𝑄3 −𝑄1
Range of coefficient is -1 to +1
(3) Based on Moments
√𝜷𝟏 (𝜷𝟐 +𝟑)
Sk = where
𝟐(𝟓𝜷𝟐 − 𝟔𝜷𝟏 −𝟗)
𝜇3 2 𝜇4
𝛽1 = 3
, 𝛽2 =
𝜇2 𝜇2 2
̅̅̅̅𝑟
∑ 𝑓𝑖 (𝑋𝑖 −𝑋)
Note : Here 𝜇2 , 𝜇3 and 𝜇4 are central moments or the moments about the mean and can be defined as 𝜇𝑟= ∑ 𝑓𝑖
12
Moments: refers to the average of the deviations from mean or some other value raised to a certain power.
∑ 𝑓𝑖 (𝑋𝑖 −𝐴)𝑟
Moments about any arbitrary value A : 𝜇𝑟 ′= ∑ 𝑓𝑖
where A is any arbitrary value.
∑ 𝑓𝑖 (𝑋𝑖 )𝑟
Moments about the origin: 𝑚𝑟= ∑ 𝑓𝑖
̅̅̅̅𝑟
∑ 𝑓𝑖 (𝑋𝑖 −𝑋)
Moments about the arithmetic mean: 𝜇𝑟= ∑ 𝑓𝑖
, where 𝑋̅is the Arithmetic mean.
Relationship between 𝝁𝒓 and 𝝁𝒓 ′ : 𝜇𝑟= 𝜇𝑟 ′ - 𝑟𝑐1 (𝜇1 ′)( 𝜇𝑟−1 ′)+ 𝑟𝑐2 (𝜇𝑟 ′)2 ( 𝜇𝑟−2 ′)+……+(−1)𝑟 (𝜇1 ′)𝑟
In particular, 𝜇2= 𝜇2 ′ - (𝜇1 ′)2
𝜇3= 𝜇3 ′- 3 𝜇1 ′𝜇2 ′ + 2(𝜇1 ′)3
𝜇4= 𝜇4 ′ - 4 𝜇1 ′𝜇3 ′+6 (𝜇1 ′)2 ( 𝜇2′ )- 3(𝜇1 ′)4
Important: (i) 𝜇0 = 𝜇0 ′=1 (ii) First central moment is always zero. (iii) 𝜇2 = 𝑆𝐷2 =variance
Objective: Computation of Mean and variance when moments about arbitrary value is given .
Kinds of data: The first three moments of a distribution about the value 2 of a variable are 1, 16 and -40.
Solution: Here arbitrary value A=2 and the moments are 𝜇1 ′ =1, 𝜇2 ′=16 and 𝜇3 ′= -40
∑ 𝑓𝑖 (𝑋𝑖 −2)1 ∑ 𝒇𝒊 𝒙𝒊 ∑𝒇 ∑ 𝒇𝒊 𝒙𝒊
We know that 𝜇1 ′= ∑ 𝑓𝑖
=1 , hence ∑ 𝒇𝒊
- 2 ∑ 𝒇𝒊 =1 which gives 𝑥̅ = ∑ 𝒇𝒊
= 1+2 = 3
𝒊
Hence the mean is 3.
We know that 𝜇2= 𝜇2 ′ - (𝜇1 ′)2, by putting the values we get
𝜇2= 𝜇2 ′ - (𝜇1 ′)2 = 16- 1*1 = 15
Hence the variance is 15.
13
Objective: Calculate the appropriate measure of skewness from the following cumulative frequency distribution.
Kinds of data: The distribution of data are given below
Age(under years) 20 30 40 50 60 70
Number of persons 12 29 48 75 94 106
Solution: Here, upper limit along with cumulative frequencies are given in the data.
Now we find the lower limit and frequency of the given dataset.
Age (years) Cummulative frequency Number of persons
(Frequency)
Below 20 12 12
20-30 29 = 29-12 = 17
30-40 48 = 48-29 = 19
40-50 75 =75-48 = 27
50-60 94 = 94-75 =19
60-70 106 =106-94=12
Total N=106
Here since the distribution is open ended mean cannot be calculated, so all the methods in which mean is
required cannot be used. So here bowleys method which is based on quartiles can be used.
1∗106
First we will determine the first Quartile class = = 26.5
4
26.5 come in 20-30 cumulative frequency class. So the first Quartile
1∗106
( −12) 14.5
4
Q1 = 20 + * 10 = 20+ = 20+8.53=28.53 years
17 17
2∗106
Similarly for Median = Q2 = = 53,
4
53 come in 40-50 cumulative frequency class. So the second Quartile is
2∗106
( −48) 5
4
Q2 = 40 + * 10 = 40+ 27 = 40+1.85=41.85 years
27
3∗106
Similarly for Q3 = = 79.5,
4
79.5 come in 50-60 cumulative frequency class. So the third Quartile is
3∗106
( −75) 4.5
4
Q3 = 50 + * 10 = 50+ = 50+2.37=52.37 years
19 19
𝑄3 + 𝑄1 −2 𝑚𝑒𝑑𝑖𝑎𝑛
Prof. Bowleys Coefficient of Skewness Sk = 𝑄3 −𝑄1
By putting the values in the formula we get
52.37+ 28.53−2∗ 41.85 −2.8
Sk = = 23.84 = -0.117
52.37−28.53
Hence the coefficient of skewness is -0.117.
14
7. Probability
Probability : is defined as the ratio of favorable number of cases of any event to the total number of all
possible outcomes.
Note:
• The sum of total probability is always equal to 1, if P is the probability of success of any event then
q=1-p is the probability of failure of any event.
• Sometimes the probability is based on combination (selection). In combination the chances of
𝑛 𝑛!
selecting r things out of n things is given by ( ) = (𝑛−𝑟)!∗𝑟!, where n!=n(n-1)(n-2),…3.2.1.
𝑟
eg. 5!= 5*(5-1)*(5-2)*(5-3)*(5-4)=5*4*3*2*1 =120
15
1. Find the chance of throwing atleast one ace in a single throw with two dice.
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 1
Solution: The probability of getting one ace in a single throw of dice p = 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 = 6
1 5
So the probability of failure of getting one ace in a single throw q = 1 − 6 =6
We know that the Probability of at least one event happens = 1- P(All the events fail to happen)
5 5 25 11
=1- 6 ∗ = 1 - 36 =36
6
2. From a bag containing 4 white and 5 black balls 3are drawn at random. What are the odds against these being
all black.
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 3 𝑏𝑙𝑎𝑐𝑘 𝑏𝑎𝑙𝑙𝑠
Solution : The probability of selecting black ball = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠
(5)
5!
5
= 39 = 3!2!
9! =42
( )
3 3!6!
5 37
And the probability of not selecting the black ball =1 - =
42 42
37
So the odds against these being all black are = 5
3. What is the chance of drawing a pie from a purse, one compartment of which contains 3 paises and 2 pies and
the other, 2 paise and 1 pie.
1
Solution: Here we know that the probability of selecting each compartment is 2
1 2 1 1 1 1 11
So the probability of drawing a pie from a purse = 2 *5 + 2 *3 = 5 + 6 = 30
4. A bag contains 4 red balls and 3 blue balls. Two drawings of 2 balls are made. Find the chance that the first
drawing gives 2 red balls and the second drawing, 2 blue balls.
If the balls are returned to the bag after the first draw.
If the balls are not returned.
Solution: (a) In the first case if the balls are returned to the bag after the first draw
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
the probability =𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
*𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠
(4 (3)
4! 3!
2) 2 2!2! 6 3 2
= * = * 2!1!
7! = * = 49
(7 (7
7!
21 21
2) 2) 5!2! 5!2!
(b) In the second case if the balls are not returned to the bag after the first draw
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
the probability =𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
𝑛𝑜.𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑟𝑒𝑑 𝑏𝑎𝑙𝑙𝑠
*𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 2 𝑏𝑙𝑢𝑒 𝑏𝑎𝑙𝑙𝑠
𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 5 𝑏𝑎𝑙𝑙𝑠
(4 (3)
4! 3!
2) 2 2!2! 6 3 3
= * = * 2!1!
5! = * = 35
(7 (5)
7!
21 10
2) 2 5!2! 3!2!
16
5. If three coins are tested what is the chance of getting (a) two heads exactly (b) atleast two heads (c) atmost
two heads. 3/8, ½, 7/8
Solution :
(a) p= p{H,H,T} + p{H,T,H} + p{T,H,H}
1 1 1 1 1 1 1 1 1 1 1 1 3
So, P = 2 ∗ ∗2+2∗ ∗2+ ∗ ∗ 2 =8 + +8=8
2 2 2 2 8
6. Four persons are chosen at random from a group consisting of 3 men, 2 women and 4 children. Show that
the chance that exactly 2 of them will be children is 10/21.
𝑛𝑜.𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 4 𝑎𝑛𝑑 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔 2 𝑓𝑟𝑜𝑚 5 𝑚𝑒𝑛 𝑎𝑛𝑑 𝑤𝑜𝑚𝑒𝑛
Solution: Here p= 𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑎𝑙𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑛𝑔 2 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑢𝑡 𝑜𝑓 9
(4 5 4! 5!
2)(2) ∗
2!∗2! 2!∗3! 30 10
So, P= = = 63 =21
( 9)
9!
4 4!∗5!
7. A card is drawn from a well shuffled pack of playing cards. What is the probability that it is either a spade
or an ace.
Solution: Let A is the event of selecting a spade, B is the event of selecting an ace and A∩B is the event of
selecting an ace of spade then by law of total probability we can write
P(AUB)=P(A)+P(B)-P(A∩B)
13 4 1 16 4
= 52 + 52 - 52 = 52 =13
17
8. Discrete and Continuous Distribution
Distribution: The distribution of a statistical data set (or a population) is a listing or function showing all
the possible values (or intervals) of the data and how often they occur.
Distribution
dis
Binomial Dist. Poisson Dist. Normal Dist.
Prob. Mass function is Prob. Mass function is Limiting form of binomial distribution when n is
𝑛 e−λ λx
P(X = x) = ( )pxqn-x, x=0,1,2..n P(X = x) = , x=0,1,2…. large, n→ ∞, neither p nor q is very small
𝑥 x!
0, otherwise 0, otherwise Probability density function is
1 1 𝑥−𝜇
Where 0≤ 𝑥 ≤ ∞, λ≥ 0, and λ is F(x; µ, 𝜎)= exp[- { }2],
𝜎√2𝜋 2 𝜎
Where 0≤ 𝑥 ≤ 𝑛, n and p are the the parameter of the dist.
parameters of the dist.
-∞ < 𝑥 < ∞, -∞ < µ < ∞, σ > 0
Here
Here • X represents the no. of
µ and 𝜎 2 are the parameters of the dist.
• X represent the different num occurrences of the rare event Mean = µ
ber of successes of the event, eg. 0,1,2 Variance = 𝜎 2
• The probability of x is given by • The probability of x is given by Property of Normal Distribution:
𝑛
p(x)= ( )pxqn-x e−λ λx • Curve is bell shaped and symmetrical
𝑥 p(x)= ,
x! • Mean=median=mode
• The frequency of x in N sets each • The frequency of x out of N • As x increases f(x) decreases rapidly
of n trials=N.p(x) cases =N.p(x) • 𝛽1 =0 and 𝛽2 =3
• Mean = np Mean=Variance=λ • 𝑓(𝑥)𝑐𝑎𝑛 𝑛𝑒𝑣𝑒r be negative
• Variance = npq Condition:
• X axis is an asymptote to the curve
Conditions for binomial dist. Poisson distr. occurs when there 2
• Each trial results in two mutually are events which don’t occur as • Mean deviation about mean = σ
3
disjoint outcomes i.e. outcomes of a definite number • QD:MD:SD=10:12:15
success and failure of trials but occur at random • Area Property
• The no. of trials n is finite point of time and space and P(µ- σ<X< µ+σ) = 0.6826
• The trials are independent of each where our preference is only for P(µ- 2σ<X< µ+2σ) = 0.9544
other the number of occurrences of the P(µ- 3σ<X< µ+3σ) = 0.9973
• The probability of success is event Fitting of Normal Dist.
eg. Number of deaths from a
constant for each trial
disease,
• Calculate the mean µ and standard deviation σ
• Example of binomial dist. from the given data.
Tossing of a coin, number of faulty blades in a
packet of 100 • Then we calculate the standard normal variate 𝑧𝑖 =
Throwing of a dice etc. 𝑥𝑖 − µ
Fitting of Binomial Dist. Fitting of Poisson Dist. corresponding to the lower limit of each
𝜎
• Calculate the mean 𝑋̅ • Calculate the mean 𝑋̅ class interval. Then the area under the normal
Equate the 𝑋̅ = np so p=
𝑥̅ • Equate 𝑋̅ = 𝜆 curve to the left of the ordinate at z=𝑧𝑖 say
𝑛 • Find 𝑒 −𝜆 𝑡ℎ𝑒𝑛 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 ∅(𝑧𝑖 ) are computed from the tables.
• Expand the binomial N(q+p)n e−λ λx
𝑛 Or recurrence formula • The areas for successive class intervals are obtained by
=N[𝑞𝑛 + ( ) 𝑞𝑛−1 𝑝 +…..+𝑝𝑛 ) x!
1
P(x+1) =
𝜆
p(x) can also be subtraction ∅(𝑧𝑖 + 1)-∅(𝑧𝑖 ), i=1,2,…
• Or multiplying factor 𝑥+1
• By multiplying these areas by N we get the
𝑛−𝑟+1 𝑝 used
= ∗ expected normal frequencies.
𝑟 𝑞 • Apply chi square to test the
• Apply chi square to test the goodness of fit. • Apply chi square to test the goodness of fit.
goodness of fit.
18
Objective: Fitting of binomial distribution
Kinds of data: The following data relate to the frequency distribution of number of boys in the first seven
children in families of Swedish minister
No.of
0 1 2 3 4 5 6 7 Total
boys/family
No. of families 6 57 206 362 365 256 69 13 1334
Solution:
No.of No. of fi xi P(x) F(x)=N*P(x), ∑(𝒐 −𝑬 )𝟐
𝝒𝟐 = 𝒊𝑬 𝒊
boys/family families fi Expected 𝒊
xi frequency
7
0 6 0 =( ) ∗ (0.51)0 ∗ (0.49)7−0 9.05 1.03
0
7
1 57 57 =( ) ∗ (0.51)1 ∗ (0.49)7−1 65.9 1.21
1
7
2 206 412 =( ) ∗ (0.51)2 ∗ (0.49)7−2 205.8 0.00
2
7
3 362 1086 =( ) ∗ (0.51)3 ∗ (0.49)7−3 357.0 0.07
3
7
4 365 1460 =( ) ∗ (0.51)4 ∗ (0.49)7−4 371.6 0.12
4
7
5 256 1280 =( ) ∗ (0.51)5 ∗ (0.49)7−5 232.1 2.47
5
7
6 69 414 =( ) ∗ (0.51)6 ∗ (0.49)7−6 80.5 1.65
6
7
7 13 91 =( ) ∗ (0.51)7 ∗ (0.49)7−7 12.0 0.09
7
Total 1334 4800 1334 6.62
4800 3.6
Mean =1334 =3.6 , now by comparing np=3.6 we get p= 7 = 0.51
Then q=1-0.51 = 0.49, and the frequencies are calculated in the table.
Then we apply the 2 test for goodness of fit. By comparing observed 2 values 6.62 is greater than the
tabulated 2 values at 5 degrees of freedom with ṕ=0.51. It seems that binomial distribution is fitting well to
the number of boys in the first seven children in families of Swedish minister .
Problem: Ten coins are thrown simultaneously. Find the probability of getting atleast 7 heads.
1
Solution: In tossing of a coin P(H)=P(T)=2
The probability of getting x heads in a random throw of 10 coins is
10 1𝑥 110−𝑥 10 110
P(X=x) = ( ) 2 2 = ( ) 2 ; x=0,1,2…10
𝑥 𝑥
Probability of getting atleast seven heads is given by
P(X≥ 7) = 𝑃(7) + 𝑃(8) + 𝑃(9) + 𝑃(10)
10 110 10 110 10 110 10 110
=( ) 2 +( ) 2 +( ) 2 + ( ) 2
7 8 9 10
110 10 10 10 10 120+45+10+1 176
= 2 {( ) + ( ) + ( ) + ( )} = =
7 8 9 10 1024 1024
19
Objective: Fitting of Poisson distribution
Kinds of data: The following data relate to the number of α –particles emitted by a film of polonium in
2608 successive intervals of one-eighth of a minute.
No.of α-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
particles
Observed
57 203 383 525 532 408 273 139 45 27 10 4 0 1 1
frequency
Solution: Calculate the mean of observed data and equate it to the theoretical mean λ.
10097
Mean= 2608 = 3.87
So 𝜆̇= 3.87 ,
To fit the distribution we find the value of 𝑒 −𝜆
By putting x=0 in P[X=x]= 𝑒 −𝜆 λx/x!
We get p[X=0]= 𝑒 −3.87
Log10P(0) = -3.87 log10e =-3.87 * 0.4343 = -1.68
Then P(0) = Antilog(-1.68) = 0.0208
Then we find out the expected frequency from its probability mass function as shown in table.
∑ 𝒇𝒊 𝑿𝒊 Probability Expected 𝜘 2 =
No.of α- Observed P(X) Frequency ∑(𝑜𝑖 −𝐸𝑖 )2
particles frequency 𝒇𝒊 𝐸𝑖
0 57 0 P(0) =𝑒 −3.87 =.0208 54.3 0.1331
𝑒 −3.87 ∗ 3.871
P(1) = =.0806
1 203 203 1! 210.3 0.2514
𝑒 −3.87 ∗ 3.872
P(2) = =0.1560
2 383 766 2! 407.0 1.4194
3 525 1575 P(3) = 0.2014 525.3 0.0002
4 532 2128 P(4) = 0.1949 508.4 1.0937
5 408 2040 P(5) = 0.1509 393.7 0.5214
6 273 1638 P(6) = 0.0974 254.0 1.4180
7 139 973 P(7) = 0.0539 140.5 0.0159
8 45 360 P(8) = 0.0261 68.0 7.7744
9 27 243 P(9) = 0.0112 29.2 0.1728
10 10 100 P(10) = 0.0043 11.3 0.1547
11 4 44 P(11) = 0.0015 4.0 0.0069
12 0 0 P(12) = 0.0005 1.3
13 1 13 P(13) = 0.0001 0.4
14 1 14 P(14) = 0.0000 0.1
Total 2608 10097 2608 12.9616
Then we apply 2 test for goodness of fit. The observed and expected frequencies are given together in the
following table.
Comparing observed 2 values as12.96 to the tabulated χ2 values at 10 degrees of freedom 18.307 with
𝜆̇ =3.87150.511. It seems that Poisson distribution is fitting well to the data.
20
Problem : In a book of 520 pages, 390 typographical errors occur. Assuming poisson law for the number of
errors per page. Find the probability that a random sample of 5 pages will contain no error.
390
Solution: First we will find the average number of typographical error λ = 520 =0.75
e−λ λx e−0.75 0.75x
By using poisson probability law P(X=x) = =
x! x!
So the probability that a random sample of 5 pages will contain no error is
[𝑃(𝑋 = 0)]5 = (e−0.75 )5 = e−3.75
Solution: Set up null hypothesis H0: There is no significant difference between observed frequency and
expected frequency and H1: There is significant difference between these two.
Class ∆φ(z)=
Lower Expected Rounded
intervals Fi Xi Fi Xi φ(z) φ(z+1)-
class Z=(X-μ)/σ Freq. Exp.freq.
φ(z)
Below 60 -∞ -∞ 0 0 0.12 0
60-65 3 62.5 187.5 60 -3.66 0 0.003 2.9 3
65-70 21 67.5 1417.5 65 -2.75 0.003 0.031 31 31
70-75 150 72.5 10875 70 -1.83 0.034 0.148 147.8 148
75-80 335 77.5 25962.5 75 -0.91 0.182 0.322 322.1 322
80-85 326 82.5 26895 80 0.01 0.504 0.319 319.3 319
85-90 135 87.5 11812.5 85 0.93 0.823 0.144 144.1 144
90-95 26 92.5 2405 90 1.49 0.967 0.03 29.8 30
95-100 4 47.5 190 95 2.68 0.997 0.003 2.7 3
100&Over ∞ ∞ 1
Total
Calculate the mean and standard deviation of observed data as 79.945 and 5.545 respectively. .
21
Problem : X is a normal variate with mean 30 and standard deviation 5. Find the probabilities that (i)
26 ≤ 𝑿 ≤ 𝟒𝟎 (ii) X≥ 𝟒𝟓 (iii) |𝑿 − 𝟑𝟎| > 5
Solution:
(i) Here it is given that µ = 30 and 𝜎 = 5
𝑋−µ
So first we calculate the standard normal variate z= 𝜎
𝑋−µ 26−30 40−30
For X=26, z= = = -0.8 and X=40, Z= =2
𝜎 5 5
22
9. Correlation and Regression
Correlation Regression
Measure of linear relationship between two Measure of average relationship between two or
variables more variables
The correlation coefficient between X on Y and The regression line for Y on X and X and Y are
Y on X is same and calculated by Karl Pearson different and is given by
correlation formula (y-𝑦̅) =𝑏𝑦𝑥 (x-𝑥̅ ) for y on x
𝑐𝑜𝑣(𝑥,𝑦) ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
𝑟𝑥,𝑦 = = (x-𝑥̅ ) =𝑏𝑥𝑦 (y-𝑦̅) for x on y
𝜎𝑥 𝜎𝑦 ̅̅̅2 ∑(𝑦𝑖 −𝑦)
√∑(𝑥𝑖 −𝑥) ̅̅̅2
𝑐𝑜𝑣(𝑥,𝑦) ̅̅̅(𝑦𝑖 −𝑦̅)
∑(𝑥𝑖 −𝑥)
Where 𝑏𝑦𝑥 = 2 = ∑(𝑥𝑖 −𝑥)̅̅̅2
𝜎𝑥
̅̅̅(𝑦𝑖 −𝑦̅)
𝑐𝑜𝑣(𝑥,𝑦) ∑(𝑥𝑖 −𝑥)
and 𝑏𝑥𝑦 = 2
= ̅̅̅2
∑(𝑦𝑖 −𝑦)
𝜎𝑦
here 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are the regression coefficient and
shows the change in dependent variable with a
unit change in independent variable
Correlation coefficient lies between -1 to +1 Regression coefficient lies between -∞ to +∞
Correlation coefficient is independent of change Regression coefficient is independent of change of
of origin and scale. origin and but not of scale.
Test of significance of correlation coefficient Test of significance of regression coefficient
(Null Hypo. r=0) (Null Hypo. 𝑏𝑦𝑥 =0, 𝑏𝑥𝑦 = 0)
𝑟𝑐𝑎𝑙 ∗√𝑛−2 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 2
at (n-2) d.f 𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏
√1−𝑟𝑐𝑎𝑙 𝑦𝑥
𝑏𝑦𝑥
= based on
2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑦−𝑦)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑥−𝑥)
̅̅̅2
∑(𝑥−𝑥)
(n-2) d.f.(for y on x )
𝑏𝑥𝑦
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 =
𝑥𝑦
𝑏𝑥𝑦
based on
2
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦
√(∑(𝑥−𝑥)
̅ )) ̅̅̅2
)/(𝑛−2) ∑(𝑦−𝑦)
̅̅̅̅2
∑(𝑦−𝑦)
(n-2) d.f.(for x on y )
Relationship between correlation and regression coefficient
Correlation coefficient is the Geometric mean r = ±√𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦
between the regression coefficients.
If one of the regression coefficients is greater 𝑟 2 ≤ 1, 𝑏𝑦𝑥 ∗ 𝑏𝑥𝑦 ≤ 1,
than unity the other must be less than unity.
Arithmetic mean of the regression coefficient is 1
(𝑏 + 𝑏𝑥𝑦 ) ≥ 𝑟
2 𝑦𝑥
greater than the correlation coefficient r if r>0.
𝜋
If r=0, 𝜃 = 2 If the two variables are uncorrelated the lines of
regression become perpendicular to each other.
If r= ±1, 𝜃 = 0 𝑜𝑟 𝜋 The two lines of regression are coincide with each
other.
23
Spearman’s Rank correlation: is used to estimate the correlation between two characters on the basis of the
rank of the individuals.
6∗∑ 𝑑𝑖 2
The formula for rank correlation is 𝜌 = 1 − ,
𝑛∗(𝑛2 −1)
Where d is the difference of the rank of individuals.
• If two individuals received the same rank then the arithmetic mean of their ranks is assigned to the
tied individuals and the next one individual will be given to actual rank. In this case a correction
1
factor is 12 ∑(𝑝3 − 𝑝) is added to ∑ 𝑑𝑖 2 .
1
6∗(∑ 𝑑𝑖 2 + ∑(𝑝3 −𝑝))
• Now 𝜌 = 1 − 12
, where p is the number of items whose ranks are common
𝑛∗(𝑛2 −1)
• Limits of rank correlation coefficient is -1≤ 𝜌 ≤ +1
Objective: Computation of correlation coefficient and the equations of the line of regression of Y on X and
X on Y and the estimation of the value of Y when the value of X is known and the value of X when the
value of Y is known.
Kinds of data: The following table relate to the data of stature (inches) of brother and sister from Pearson
and Lee’s sample of 1,401 families.
Family
1 2 3 4 5 6 7 8 9 10 11
number
Brother,X 71 68 66 67 70 71 70 73 72 65 66
Sister,Y 69 64 65 63 65 62 65 64 66 59 62
Solution: Calculation of correlation coefficient
Since the value of t calculated is less than t tabulated. Regression coefficients are not significant.
25
Solution:
X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊 𝟐
75 120 5 5 0 0
88 124 4 4 -2 4
92 150 1 1 0 0
70 115 6 6 0 0
60 110 7 7 0 0
80 140 3 3 1 1
81 142 2 2 1 1
50 100 8 8 0 0
Total ∑ 𝒅𝒊 𝟐 = 𝟔
6∗6 36
Coefficient of rank correlation = 1 –8∗(82 −1) = 1- 8∗63 =1- 0.0714 = 0.929
Thus , there is high positive correlation.
Objective: Computation of rank correlation coefficient when ranks are being repeated.
Kinds of data: The following two series of data are given on the variables X and Y .
X 12 15 18 20 16 15 18 22 15 21 18 15
Y 10 18 19 12 15 19 17 19 16 14 13 17
Solution:
X Y Rank of x 𝑹𝒙 Rank of y 𝑹𝒚 𝒅𝒊 = 𝑹𝒙 - 𝑹𝒚 𝒅𝒊 𝟐
12 10 12 2 10 100.0
15 18 9.5 2 7.5 56.3
18 19 5 2 3 9.0
20 12 3 4 -1 1.0
16 15 7 5.5 1.5 2.3
15 19 9.5 5.5 4 16.0
18 17 5 7 -2 4.0
22 19 1 8 -7 49.0
15 16 9.5 9 0.5 0.3
21 14 2 10 -8 64.0
18 13 5 11 -6 36.0
15 17 9.5 12 -2.5 6.3
Total ∑ 𝒅 𝟐 = 𝟑𝟒𝟒 𝒊
𝑝3 −𝑝 𝑝3−𝑝
6[∑ 𝑑2 + + +⋯ ]
12 12
r rank = 1- 𝑛(𝑛2−1)
In the X series, 18 is repeated 3 times after third rank, thus the common rank assigned to each of these
4+5+6
values is the average of ( 3 = 5). Next value 16 gets the next ranks as 7. Again the value 15 occurs four
times, common rank assigned to it is 9.5 which is the arithmetic mean of 8,9,10 and 11. The next number 12
gets the ranks as 12. Similarly for the Y - series, the value 19 occurs thrice and common rank assigned to
each is 2 i.e. arithmetic mean of 1, 2 and 3 and accordingly the ranks assigned to other. So here in X series
43 −4 33 −3 23 −2 33 −3
6[344+ + + + ]
12 12 12 12
m= 3 for 18, m=4 for 15 and for Y series m= 3 for 19, m=2 for 17. 𝜌= 1— 2
12(12 −1)
=-0.2028
a poor rank correlation.
26
Objective: Testing the significance of an observed sample correlation coefficient and determination of 95%
and 99% confidence limits.
Kinds of data: In a random sample of 27 pairs of observations from a bivariate population the correlation
coefficient is obtained as 0.6.
Ho: 𝜌=0
H1:𝜌 ≠0
𝑟√𝑛−2
t= √1−𝑟 2 with (n-2) d.f.
0.6√27−2 3
t= = = 3.75 with 25 degrees of freedom.
√1−0.36 √0.64
(iv) Tabulated value of t using two tailed test at 5% level of significance with 25 degrees of freedom is 2.06.
(v) Since calculated value of t(3.750 is greater than the tabulated value of t, H0 is rejected at 5% level of
significance. Hence it is concluded that the variables are correlated in the population.
1.96(1−0.36)
= 0.6 ∓
√27
= 0.6∓0.2414
= 0.3586 to 0.8414
(1−0.36)
= 0.6 ∓ 2.58∗
√27
=0.6∓0.3178
= 0.2822 to 0.9178
27
Objective: Computation of correlation coefficient for the bivariate frequency distribution.
Kinds of data: The following data provides according to age the frequency of marks obtained by
100 students in an intelligence test.
Age in years
18 19 20 21 Total
Marks
10-20 4 2 2 - 8
20-30 5 4 6 4 19
30-40 6 8 10 11 35
40-50 4 4 6 8 22
50-60 - 2 4 4 10
60-70 - 2 3 1 6
Total 19 22 31 28 100
∑ uf(u) 68 ∑ vf(v) 25
Mean of u= ∑ f(u) = 100 =0.68. Mean of v= ∑ f(v) = 100 =0.25
52
Cov(u,v)= 199-0.68x0.25=0.35
162
Variance of u= - (0.68)2=1.1576
100
167
Variance of v=100 - (0.25)2=1.6075
0.35
r(u,v)= = 0.25,
√1.1576𝑥1.6075
r(x,y)=r(u,v) =0.25
28
10.Multiple and Partial Correlation
Multiple Correlation Coefficient: provide the maximum degree of linear relationship between two or more
independent variables and a single dependent variable. It is a measure of how well a given variable can be
predicted using a linear function of a set of other variables. Multiple correlation coefficients can never be
negative and presented by 𝑅 2 , which represent the percentage variance explained by all the independent
variables in dependent variable.
Multiple correlation coefficient is represented by R1.23, where X1 is dependent variable and X2, X3 are
independent variables and the formula is given by
1 𝑟12 𝑟13
2 𝜔 𝑟122 +𝑟132 −2𝑟12𝑟13𝑟23 1 𝑟23
𝑅1.23 = 1 − 𝜔 = 𝑟
, Where 0≤ R1.23 ≤ 1 and 𝜔=| 21 1 𝑟23 | and 𝜔11 =| |
11 1−𝑟232 𝑟32 1
𝑟31 𝑟32 1
Partial Correlation Coefficient : measures the degree of association between two random variables X1 and
X2, with the effect of a set of controlling random variables say X3 is removed. For example, if we have
economic data on the consumption, income, and wealth of various individuals and we wish to see if there is a
relationship between consumption and income, failing to control for wealth when computing a correlation
coefficient between consumption and income would give a misleading result, since income might be
numerically related to wealth which in turn might be numerically related to consumption; a measured
correlation between consumption and income might actually be contaminated by these other correlations.
The use of a partial correlation avoids this problem. Like the correlation coefficient, the partial correlation
coefficient takes on a value in the range from –1 to 1.Partial correlation coefficient helps in deciding whether
to include or not an additional independent variable in regression analysis.
The correlation coefficient between X1 and X2 after the linear effect of X3 on each of them has been
eliminated is called the partial correlation coefficient and the formula is given by
𝑟12 − 𝑟13 𝑟23
𝑟12.3 = 2 2
, -1≤ r12.3 ≤ +1
√(1−𝑟13 ) ∗(1−𝑟23 )
29
Objective: Computation of multiple correlation coefficients from the tri-variate population.
Kinds of data: Given r12=0.60,r13=0.70 and r23=0.65
𝑟122 +𝑟132 −2𝑟12𝑟13𝑟23
Solution: We know that R 1.23 = √ 1−𝑟232
0.36+0.49−0.546
=√ =√0.526 =0.725
0.5775
0.49+0.4225−0.546
= √ = √0.573 =0.757
0.64
0.36+0.4225−0.546
=√ =√0.464 =0.681
0.51
Kinds of data: The value of R 1.23=0.725 from a tri-variate distribution for n=25.
Solution: (i) Set up the null and alternative hypothesis as
Ho: R 1.23=0, H1:R 1.23 ≠0
(ii) Choose a suitable level of significance 𝛼=0.05 (say)
(iii) Compute ‘F’ statistic
𝑅2 (𝑛−𝑘−1)
F=1−𝑅2x follows F distribution with (k,n-k-1) degrees of freedom, where k is the number of
𝑘
independent variables
(0.7252 ) 25−2−1
Thus, F=1−0.7252x 2
0.5256 22
=0.4744x 2 = 1.1079 x 11=12.187
The tabulated value of F at (2,22) d.f. at 5% level of significance is 3.44. Hence the calculated value of F
statistic is greater than the tabulated value. Thus ,we reject the null hypothesis and conclude that multiple
correlation R 1.23 is not zero i.e. observed multiple correlation coefficient is significant in the population.
30
Objective: Calculation of 𝑟23.1, b 12.3, b 13.2 and 𝜎 1.23
Kinds of data: In a tri-variate distribution 𝜎1 =2, 𝜎2 =𝜎3 =3, 𝑟12=0.7, 𝑟23 = 𝑟31 =0.5
𝑟23 −𝑟21 𝑟31
Solution: (i) we know that 𝑟23.1 =
√(1−𝑟21 2 )(1−𝑟31 2 )
0.5−0.7∗0.5
By putting the values we get 𝑟23.1 = =0.2425
√(1−0.72 )(1−0.52 )
𝜎 𝜎
(i) b 12.3 = r12.3 * 𝜎1.3 and b 13.2 = r13.2 * 𝜎1.2
2.3 3.2
𝜔 = 1 ∗ (1 ∗ 1 − 0.5 ∗ 0.5) − 0.7(0.7 ∗ 1 − 0.5 ∗ 0.5) + 0.5 ∗ (0.7 ∗ 0.5 − 1 ∗ 0.5) = 0.36
And 𝜔11 =(1*1-0.5*0.5) =0.75
𝜔 0.36
Hence σ21.23 = σ1 2 𝜔 = 22 0.75 = 1.92,
11
31
Objective: Testing the significance of an observed partial correlation coefficient.
Kinds of data: The value of r 12.3 = -0.60 from a trivariate distribution for n=29.
Solution: (i) Set up the null and alternative hypothesis as
Ho: 𝜌 12.3=0, H1:𝜌 12.3≠0
(ii) Choose a suitable level of significance 𝛼=0.05 (say)
(iii) Compute ‘t’ statistic
𝑟12.3√𝑛−𝑘−2 −0.60∗√29−2−2
t= √1−𝑟12.32
= =-3.75
√1−(0.6)2
The table value of t at 5% level of significance at 25 degrees of freedom is 2.06. As computed value of t is
greater than the table value of t, so we reject the null hypothesis. Thus the observed partial correlation
coefficient is significantly different in the population.
32
11.Multiple Regression Equation and Analysis Technique
Objective: Determination of regression equation when the values are given and also determine the value of
X3 when X1 =30 and X2 =45
Kinds of data: for a trivariate distribution
̅̅̅1 =40
𝑋 ̅̅̅2 =70
𝑋 ̅̅̅3 =90
𝑋
𝜎1 = 3 𝜎2 = 6 𝜎3 = 7
𝑟12 = 0.4 𝑟23 = 0.5 𝑟13 = 0.6
Since the line of regression passes through mean and can be written as
̅̅̅̅̅
(𝑋1 −𝑋 1)
̅̅̅̅̅
(𝑋2 −𝑋 2)
̅̅̅̅̅
(𝑋3 −𝑋 3)
𝜔11 + 𝜔12+ 𝜔13 = 0
𝜎1 𝜎2 𝜎3
𝑟 1 0.4 1
𝜔13 =| 21 | =| | = (0.4*0.5- 1*0.6) = -0.4
𝑟31 𝑟32 0.6 0.5
By putting the values in regression line we get
(𝑋1 −40) (𝑋2 −70) (𝑋3 −90)
∗ 0.75+ ∗ (−0.10) + ∗ (−0.4) = 0
3 6 7
33
Objective: Find the multiple regression equation of X1 on X2 and X3 .
Kinds of data: The data relating to three variables are given below :
X1 4 6 7 9 13 15
X2 15 12 8 6 4 3
X3 30 24 20 14 10 4
34
Substituting the value of b12.3 and b13.2 in equation (i),we get
6a + 48(0.389) +102(-0.623) = 54
6a = 54 +63.546 –18.672= 98.874
Hence, a = 16.479
Thus the required regression equation is X1= 16.479 + 0.389X2 –0.623 X3
Multiple Regression Analysis: It is a technique used for predicting the unknown value of a variable from
the known value of two or more variables- also called the predictors. For example the yield of rice per acre
depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If we want to study the
joint affect of all these variables on rice yield, we can use this technique. An additional advantage of this
technique is it also enables us to study the individual influence of these variables on yield.
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
Here b0 is the intercept and b1, b2, b3, …, bk are analogous to the slope in linear regression equation and are
also called regression coefficients. They can be interpreted the same way as slope. Thus if bi = 2.5, it would
indicates that Y will increase by 2.5 units if Xi increased by 1 unit.
ANOVA for Multiple Regression: are similar to ANOVA for linear regression except that degrees of
freedom are adjusted to reflect the number of explanatory variables included in the model.
Analysis of variance table for simple regression analysis
Source of variation Degree of Sum of Square Mean sum Fcal Ftab (5 %) at (source
freedom of square and error ) d.f.
Model 1 ∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 SSM/df of =MSM/MSE
Model
Error n-2 ∑(𝑌𝑖 − 𝑌 ̂𝑖 )2 SSE/df of
Error
Total n-1 ∑(𝑌𝑖 − ̅̅̅
𝑌)2
𝑀𝑆𝑀
In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= , has an F distribution
𝑀𝑆𝐸
with d.f. (1, n-2).
In Multiple regression analysis for p explanatory variables the model degree of freedom are equal to p , the
error d.f. are equal to n-p-1 and the total d.f. are equal to n-1.
Analysis of variance table for Multiple regression analysis
Source of variation Degree of Sum of Square Mean sum Fcal Ftab (5 %) at (source
freedom of square and error ) d.f.
Model p ∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 SSM/df of =MSM/MSE
Model
Error n-p-1 ∑(𝑌𝑖 − 𝑌 ̂𝑖 )2 SSE/df of
Error
Total n-1 ∑(𝑌𝑖 − ̅̅̅ 𝑌)2
35
𝑀𝑆𝑀
In simple regression we test the null hypothesis that 𝛽1 = 0 and test the statistic F= , has an F distribution
𝑀𝑆𝐸
with d.f. (1, n-2).
Test of significance of model: The appropriateness of the multiple regression model as a whole can be
tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at
least one of the X's. The suitability of model for prediction is examined by the coefficient of determination
(R2). R2 always lies between 0 and 1.The closer R2 is to 1, the better is the model and its prediction.
t-test: To test whether the independent variables individually influence the dependent variable significantly
or not we test the null hypothesis that the relevant regression coefficient is zero. This can be done using t-
test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences
Y significantly while controlling for other independent explanatory variables.
Objective: Fitting a straight line with two predictors by matrix approach and determination the value of R2.
Kinds of data: The twenty five observations of pounds of steam used per month in a plant along with
average atmospheric temperature in degrees Fahrenheit and number of operating days in the month.
36
b=(X′X) -1X′Y
where b is the vector of estimates of the elements of 𝛽, provided that X′X is non_singular matrix.
10.98 1 35.3 20 𝜀1
11.13 1 29.7 20 𝜀2
12.51 1 30.8 23 𝜀3
8.4 1 58.8 20 𝛽0 𝜀4
Here Y = . X= . 𝛽=[𝛽1] 𝜀= .
. . 𝛽2 .
. . .
10.36 1 33.4 20 𝜀24
[11.08] [ 1 28.6 22 ] [𝜀25]
10.98
11.13
1 1 1 . . . 1
12.51
x[ 35.3 29.7 30.8 28.6 ]
.
20 20 23 22
.
[ 11.08 ]
10.98
11.13
𝑏0 25 1315 506 1 1 1 . . . 1
12.51
b=[𝑏1] = [1315 76323.42 26353.30] X [ 35.3 29.7 30.8 28.6 ]
.
𝑏2 506 26353.30 10450 20 20 23 22
.
[ [ 11.08 ]]
𝑏0 9.1266
Thus, b=[𝑏1] = [−0.0724]
𝑏2 0.2029
Thus, the fitted least squares equation is
Ŷ =9.1266-0.0724 X1 +0.2029 X 2
After the regression equation is estimated we find the estimated value of Y for each X1 and X2and then find
the total, regression and residual sum of squares.
37
Observation Pounds of Estimated Total SS Regression/Model SS= Error SS=
Number steam used ̂
value of 𝒀 =∑(𝒀𝒊 − ̅̅̅
𝒀)𝟐 ∑(𝒀̂𝒊 − ̅̅̅
𝒀)𝟐 ∑(𝒀𝒊 − 𝒀̂𝒊 )𝟐
per month
Y
1 10.98 10.63 2.42 1.45 0.12
2 11.13 11.03 2.91 2.59 0.01
3 12.51 11.56 9.52 4.58 0.90
4 8.4 8.93 1.05 0.25 0.28
5 9.27 8.94 0.02 0.23 0.11
6 8.73 8.43 0.48 0.99 0.09
7 6.36 5.97 9.39 11.92 0.15
8 8.5 8.24 0.85 1.40 0.07
9 7.82 8.27 2.57 1.33 0.20
10 9.14 9.02 0.08 0.16 0.01
11 8.24 9.83 1.40 0.16 2.51
12 12.19 11.30 7.65 3.50 0.80
13 11.88 11.35 6.03 3.72 0.28
14 9.57 10.15 0.02 0.53 0.34
15 10.94 10.40 2.30 0.96 0.29
16 9.58 9.67 0.02 0.06 0.01
17 10.09 9.30 0.44 0.02 0.63
18 8.11 8.52 1.73 0.81 0.17
19 6.83 6.29 6.73 49.82 0.29
20 8.88 8.40 0.30 1.05 0.23
21 7.68 7.96 3.04 2.13 0.08
22 8.47 9.18 0.91 0.06 0.51
23 8.86 9.96 0.32 0.28 1.20
24 10.36 10.77 0.88 1.80 0.17
25 11.08 11.52 2.74 4.39 0.19
Total 63.84 54.21 9.63
We can also split the S.S. due to regression into S. S. due to X1 and X2
For this purpose we fit the simple line of regression of Y on X1 as Ŷ =13.62-0.08 X1
Now again we calculate the SS due to X1 =∑(𝑌 ̂𝑖 − ̅̅̅
𝑌)2 = 45.79
Source of d.f. SS MS Fcal Ftab (1,22)
variation
Regression
Due to X1 1 45.79 45.79 104.6 4.30
Due to X2 1 8.42 8.42 19.23
Residual 22 9.63 0.4377
Total 24 63.84
38
Here, since 104.1636 and 19.6361 exceeds Ftab(1,22,0.95)=4.30 , the predictor X1 and X2 are found to be
significant. It is to be noted that the sum of square due to X1 (45.5924) is obtained taking only one variable
X1 and the sum of square due to X2 is obtained (Total SS minus the SS due to X1).
To check the suitability of the model for prediction multiple correlation coefficient R2 is calculated.
𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 45.79+8.42
R2= 𝑇𝑜𝑡𝑎𝑙 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑)𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 = =84.91%
63.84
Hence it is found that 84.91 % variability of the predicted model is explained by explanatory variables.
t-test for regression coefficient byx1 = -.0724
𝑏𝑦𝑥 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for y on x )
𝑦𝑥
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦 ̅ ))2 ̅̅̅2
√(∑(𝑦−𝑦) ̅̅̅2 )/(𝑛−2) ∑(𝑥−𝑥)
∑(𝑥−𝑥)
The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is
greater than tabulated null hypothesis is rejected.
t-test for regression coefficient byx2 =0.2029
𝑏𝑦𝑥 𝑏𝑦𝑥
𝑡𝑐𝑎𝑙 = 𝑆.𝐸.𝑜𝑓 𝑏 = based on (n-2) d.f.(for y on x )
𝑦𝑥
̅̅̅2 −(∑(𝑥−𝑥̅)(𝑦−𝑦 ̅ ))2 ̅̅̅2
√(∑(𝑦−𝑦) ̅̅̅2 )/(𝑛−2) ∑(𝑥−𝑥)
∑(𝑥−𝑥)
The table value of t at 23 d.f and 5 % level of significance is 2.068. Since calculated absolute value of t is
greater than tabulated null hypothesis is rejected.
Hence both the regression coefficient are found to be significant.
39
12.Simple and Stratified Random Sampling
Simple Random Sampling (SRS): It is the process of selecting a sample from given population according
to some law of chance in which each unit of population has an equal and independent chance of being
included in the sample.
SRSWR(With Replacement): A selection process in which the unit selected at any draw is replaced to the
population before the next subsequent draw is known as Simple random sampling with replacement. In this
case the number of possible samples of size n selected from the population of size N is 𝑁 𝑛 . The samples
selected through this method are not distinct.
SRSWOR(Without Replacement): A selection process in which the unit selected at any draw is not
replaced to the population before the next subsequent draw and the next sample is selected from the
remaining population is known as Simple random sampling without replacement. In this case the number of
possible samples of size n selected from the population of size N is 𝑁𝑐𝑛 . The samples selected through this
method are distinct.
Note: Sample mean is an unbiased estimate of population mean in SRSWR and SRSWOR, whereas sample
variance is an unbiased estimate of population variance in case of SRSWOR only.
SRSWOR is more efficient than SRSWR because V(𝑦
̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑂𝑅 < V(𝑦
̅̅̅)
𝑛 𝑆𝑅𝑆𝑊𝑅.
Stratified Random Sampling: When the population is heterogeneous and we wish that every section of
population is represented in the sample. We divide the whole population into different number of strata so
that the one stratum is much different from one another whereas the samples within each stratum are more
homogeneous. This technique of selecting a representative sample of whole population is known as stratified
random sampling.
In stratified random sampling allocation of sample size to different strata is based on the staratum sizes (Ni),
the variability within the stratum Si2 and the cost of surveying per sampling unit in the stratum.
Methods for allocation of sample size to different strata are
𝑛
Equal Allocation : ni =𝑘
𝑛Ni
Proportional Allocation: ni = 𝑁
𝑁𝑆
Neyman Allocation: ni = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖
𝑁 𝑆 √𝐶𝑖
Optimum Allocation (based on cost) : ni = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 √𝐶𝑖
40
Objective: In simple random sampling, show the sample mean and sample mean square is an unbiased
estimate of population mean and population mean square with the help of an hypothetical population in
SRSWOR and to determine its variances and S.E.
Kinds of data: The data relate to the hypothetical population whose units are 1, 2, 3, 4 and 5. Draw a
sample of size n=3 using SRSWOR.
Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁𝑐𝑛 = 5𝑐3 =10.
∑ 𝑦𝑖 1
Compute the mean of each sample ̅̅̅
𝑦𝑛 = and sample mean square 𝑠 2 = 𝑛−1 ∑(𝑦𝑖 − ̅̅̅)
𝑦𝑛 2 .
𝑛
∑ 𝑦𝑖 15 1
Similarly the mean of population ̅̅
𝑦̅̅
𝑁 = = 5 =3 and population mean square 𝑆 2 = 𝑁−1 ∑(𝑦𝑖 − ̅𝑦̅̅̅)
𝑁
2
𝑁
1 10
S2 = 4 [(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]= =2.5
4
The 10 possible samples are given below in the table.
S.No. Possible Sample mean Sample mean Sampling error
samples ̅𝒚̅̅𝒏̅ square (s2) ̅̅̅𝒏̅ − ̅̅̅̅̅
(𝒚 𝒚𝑵)
∑ ̅̅̅̅
𝑦𝑛 30 ∑ si 2 25
̅̅̅)=
E (𝑦 𝑛 ̅̅̅𝑁̅ and E (s2)=
= 10 =3 =𝑦 = 10 =2.5=S2,
𝑁𝑐𝑛 𝑁𝑐𝑛
then we can say that sample mean ̅̅̅ 𝑦𝑛 and sample variance s2 are an unbiased estimator of population
2
mean ̅𝑦̅̅̅
𝑁 and population variance S respectively.
In order to find out the variance of sample mean in SRSWOR, we know that
𝑁−𝑛 5−3
̅̅̅)
V(𝑦 𝑛 SRSWOR= S2 = *2.5 = 0.33
𝑁𝑛 5∗3
41
Objective: Showing the unbiased estimator for population mean and biased estimator for population mean
square in simple random sampling with replacement (SRSWR) with the help of an hypothetical example and
determination of its variance and standard error (S.E.)
Kind of data: Consider a finite population of size N=5 including the values of sampling units as (1,2,3,4,5).
th
Enumerate all possible samples of size n=2 using SRSWR. find the estimate of V(𝑦 ̅̅̅)
𝑛 in 9 sample.
Solution: Number of all possible samples of size n=3 under SRSWOR is given by 𝑁 𝑛 = 52 =25.
∑ 𝑦𝑖 1
Compute the mean of each sample ̅̅̅
𝑦𝑛 = and sample mean square𝑠 2 = 𝑛−1 ∑(𝑦𝑖 − ̅̅̅)
𝑦𝑛 2 .
𝑛
∑ 𝑦𝑖 15 1
Similarly the mean of population ̅̅
𝑦̅̅
𝑁 = = 5 =3 and population mean square 𝑆 2 = 𝑁−1 ∑(𝑦𝑖 − ̅𝑦̅̅̅)
𝑁
2
𝑁
1 10
S2 = 4 [(1-3)2 + (2-3)2 +(3-3)2 + (4-3)2 + (5-3)2]= =2.5
4
S.No. Possible Sample Sample Sampling S.No. Possible Sample Sample Sampling
samples mean mean ̅̅̅𝒏 −
error (𝒚 samples mean mean error
̅̅̅
𝒚𝒏 square ̅̅̅̅̅
𝒚𝑵) ̅̅̅
𝒚𝒏 square ̅̅̅𝒏 − ̅̅̅̅̅
(𝒚 𝒚𝑵)
(s2) (s2)
1 1,2 1.5 0.50 -1.5 13 4,1 2.5 4.50 -0.5
2 1,3 2.0 2.00 -1.0 14 5,1 3.0 8.00 0.0
3 1,4 2.5 4.50 -0.5 15 3,2 2.5 0.50 -0.5
4 1,5 3.0 8.00 0.0 16 4,2 3.0 2.00 0.0
5 2,3 2.5 0.50 -0.5 17 5,2 3.5 4.50 0.5
6 2,4 3.0 2.00 0.0 18 4,3 3.5 0.50 0.5
7 2,5 3.5 4.50 0.5 19 5,3 4.0 2.00 1.0
8 3,4 3.5 0.50 0.5 20 5,4 4.5 0.50 1.5
9 3,5 4.0 2.00 1.0 21 1,1 1.0 0.00 -2.0
10 4,5 4.5 0.50 1.5 22 2,2 2,0 0.00 - 1.0
11 2,1 1.5 0.50 -1.5 23 3,3 3.0 0.00 0.0
12 3,1 2.0 2.00 -1.0 24 4,4 4.0 0.00 1.0
25 5,5 5.0 0.00 2.0
Total 75.0 50.00
2 2
̅̅̅)=
Now we have to check whether E (𝑦 𝑛 ̅𝑦̅̅̅
𝑁 and E (s ) = S ,
∑ ̅̅̅̅
𝑦𝑛 75 ∑ si 2 50
̅̅̅)=
E (𝑦 𝑛 ̅̅̅𝑁̅ and E (s2)=
= 25 =3 =𝑦 = 25 =2 ≠S2,
𝑁𝑛 𝑁𝑛
̅̅̅)
Standard Error of (𝑦 ̅̅̅)
𝑛 = √V(𝑦 𝑛 = √1 =1
th
In order to find the estimate of V(𝑦
̅̅̅)
𝑛 based on 9 sample, we have
𝑁−1 5−1
̅̅̅)=
V(𝑦 𝑛 𝑆 2 = 5∗2 *2.0 = 0.8
𝑁𝑛
̅̅̅)
Standard Error of (𝑦 ̅̅̅)
𝑛 = √V(𝑦 𝑛 = √0.80 =0.894
42
Objective : Drawing of samples in stratified random sampling under different allocation along with
determination of their variances and standard errors.
Kinds of data: A hypothetical population of N= 3000 is divided into four strata, their sizes of population and
standard deviations are given as follows :
Strata I II III IV
Size Ni 400 600 900 1100
SD Si 4 6 9 12
A stratified random sample of size 800 is to be selected from the population
Soultion : In case of
(i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the different
𝑛 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 800
sample sizes will be ni =𝑘 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎 = =200 samples from each allocation.
4
(ii) In case of proportional allocation ni (i=1,2,3,4) is given by ni = npi where pi =Ni/N
𝑛Ni
ni = 𝑁
800∗400
Hence n1 = =106.67≈107 samples from stratum I
3000
800∗600
n2 = =160 samples from stratum II
3000
800∗900
n3 = =240 samples from stratum III
3000
800∗1100
n4 = =293 samples from stratum IV
3000
Thus, n1 + n2 + n3 + n4 = 800 constitute the samples required from all the strata.
𝑃𝑆 𝑁𝑆
(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗ ∑ 𝑃𝑖 𝑆𝑖 = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 𝑖 𝑖
Here ∑ 𝑁𝑖 𝑆𝑖 =400*4+600*6+900*9+1100*12= 26500
400∗4 600∗6
Hence, n1 = 800 ∗ 26500 =48, n2== 800 ∗ 26500 =109,
900∗9 1100∗12
n3 = 800 ∗ 26500 =245, n4== 800 ∗ =398,
26500
In Neyman allocation, the sample sizes from four strata are 48, 109, 245 and 398 which constitute the
required sample size.
k ∑ pi 2 si 2 ∑ 𝑝𝑖 𝑠𝑖 2
̅̅̅̅
Variance of 𝒚 ̅̅̅̅
𝒔𝒕 in equal allocation V(𝒚 𝒔𝒕 ) = − ,
𝑛 𝑁
from above data ∑ pi Si = 8.83, ∑pisi2= 86.43 and ∑pi2 si2 = 28.37
4∗86.43 28.37
̅̅̅̅
V(𝒚 𝒔𝒕 ) = − 3000 , =.141-.028= 0.1130
800
̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 = √V(𝑦 𝑠𝑡 = √0.1130 =0.336
2 1 1 1 1
̅̅̅̅
Variance of 𝒚 𝒔𝒕 ) =(𝑛 − 𝑁 ) ∑ 𝑝𝑖 𝑠𝑖 =(800 - 3000 )*86.43 =0.0792
̅̅̅̅
𝒔𝒕 in proportional allocation V(𝒚
̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 prop = √V(𝑦 𝑠𝑡 = √0.0792 =0.2815
2
(∑ pi Si ) ∑ 𝑝𝑖 𝑠𝑖 2 8.832 86.43
̅̅̅̅
Variance of 𝒚 ̅̅̅̅
𝒔𝒕 in Neyman allocation V(𝒚 𝒔𝒕 ) = − = 800 − =.068
𝑛 𝑁 3000
̅̅̅̅)
Standard Error of (𝑦 ̅̅̅̅)
𝑠𝑡 ney = √V(𝑦 𝑠𝑡 = √. 068 = 0.262
43
Objective: Determination of the estimate of population mean and population total in stratified random
sampling and samples under different allocations
Kinds of data : A population of size N = 4000 has been divided into five strata with their sizes, S.D.’s and
sample means in stratified random sampling.
Strata I II III IV V
Sizes Ni 300 600 900 1200 1000
Sample Means ̅̅̅̅
𝒚𝒏𝒊 8 10 15 18 13
Standard 2 4 6 8 5
Deviation
Solution: (i) Equal allocation the sizes of sample allocated to different strata will be the same. Hence the
𝑛 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 800
different sample sizes will be ni =𝑘 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑟𝑎𝑡𝑎 = 5 =160 samples from each allocation.
(ii) In case of proportional allocation ni (i=1,2,3,4,5) is given by ni = npi where pi =Ni/N
𝑛Ni
ni = 𝑁
800∗300
Hence n1 = =60 samples from stratum I
4000
800∗600
n2 = =120 samples from stratum II
4000
800∗900
n3 = =180 samples from stratum III
4000
800∗1200
n4 = =240 samples from stratum IV
4000
800∗1000
n5 = =200 samples from stratum V
4000
Thus, n1 + n2 + n3 + n4+ n5= 800 constitute the samples required from all the strata.
𝑃𝑆 𝑁𝑆
(iii) The sample size in Neyman allocation is given by ni = 𝑛 ∗ ∑ 𝑃𝑖 𝑆𝑖 = 𝑛 ∗ ∑ 𝑁𝑖 𝑆𝑖
𝑖 𝑖 𝑖 𝑖
Here ∑ 𝑁𝑖 𝑆𝑖 =300*2+600*4+900*6+1200*8+1000*5= 23000
300∗2 600∗4
Hence, n1 = 800 ∗ 23000 =20.87≈ 21, n2== 800 ∗ 23000 =83.48≈ 83 ,
900∗6 1200∗8
n3 = 800 ∗ 23000 =187.82 ≈ 188, n4== 800 ∗ 23000
=333.91≈ 334,
1000∗5
n5== 800 ∗ 23000 =173.91≈ 174
In Neyman allocation, the sample sizes from five strata are 21, 83, 188, 334 and 174 which constitute
the required sample size.
̅̅̅
We know that an unbiased estimate of population mean 𝑌 𝑁 can be worked out as
̂ 1
̅̅̅̅ ̅̅̅̅
𝑌𝑁 = 𝑌 𝑠𝑡 = 𝑁 *∑ 𝑁𝑖 ̅̅̅̅
𝑦𝑛𝑖
300∗8+600∗10+900∗15+1200∗18+1000∗13
= =14.125
4000
An appropriate estimator to estimate the population total is given by
𝑌̂ = N*𝑌
̅̅̅̅
𝑠𝑡 = 4000*14.125 = 56500
44
13. Ratio and Regression Estimator
Ratio Estimator: Ratio method of estimation is based on the information available for auxiliary variable.
When the correlation coefficient between the study variable and the auxiliary variable is positive and high,
the ratio method of estimation can be used to study the population parameters of study variable Y.
̅̅̅̅
𝑦𝑛
The equation of ratio estimator is given by 𝑦𝑅
̅̅̅= ̅̅̅̅
𝑋 𝑁 , where ̅̅̅
𝑦𝑛 and ̅̅̅
𝑥𝑛 are sample means of y and x
̅̅̅̅
𝑥𝑛
̅̅̅̅
respectively and 𝑋 𝑁 is population mean.
In case of ratio estimator sample mean is not an unbiased estimate of population mean. The bias will be
zero only when there is a perfect positive correlation between y and x.
The bias of ratio estimator to the first order of approximation is given by
(𝑁−𝑛) 2 𝑆𝑋 𝑆𝑌
̅̅̅
𝐵1(𝑌 ̅̅̅
𝑅 ) = 𝑁𝑛 𝑌𝑁 (𝐶𝑥 − 𝜌𝐶𝑥 𝐶𝑦 ) , where 𝐶𝑥 = ̅̅̅̅ and 𝐶𝑦 = ̅̅̅̅
𝑋 𝑁 𝑌 𝑁
(𝑁−𝑛) 2 2 2 𝑌𝑁 ̅̅̅̅
̅̅̅
The variance of ratio estimator is given by V (𝑌 𝑅 ) = 𝑁𝑛 (𝑆𝑦 + 𝑅 𝑆𝑥 − 2𝑅𝜌𝑆𝑥 𝑆𝑦 ) where R =̅̅̅̅
𝑋 𝑁
Regression Estimator: Ratio estimator is used if y and x are linearly related and the line of regression
between y and x are passes through origin. But when this is not the case and the variate y is approximately
a constant multiple of an auxiliary variate x, the regression estimator is used.
The regression estimator can be defined as ̅𝑌̅𝑖𝑟
̅̅ = ̅̅̅
𝑦𝑛 + 𝑏𝑦𝑥 (𝑥
̅̅̅𝑁̅ − ̅̅̅)
𝑥𝑛
Regression estimator is also a biased estimate of population mean.
(𝑵−𝒏) 𝑆𝑥𝑦
̅̅̅̅
The variance of regression estimator is given by V(𝑌 𝑖𝑟 ) = 𝑠𝑦 2 (1-𝑟𝑥𝑦 2 ), here rxy = 𝑆
𝑵𝒏 𝑥 𝑆𝑦
𝑆𝑥𝑦 1 (∑ 𝑥𝑖 )2 1 ∑ 𝑥𝑖 ∑ 𝑦 𝑖
and 𝑏𝑦𝑥 = 𝑆 2 where 𝑠𝑥 2 = 𝑛−1 [∑ 𝑥𝑖 2 - ] and 𝑆𝑥𝑦 = [∑ 𝑥𝑖 𝑦𝑖 - ]
𝑥 𝑛 𝑛−1 𝑛
Objective : Estimation of the average number of bullocks per acre using ratio estimator and show that it is
a biased estimator of population mean. Compute bias and variance along with its standard error.
Kinds of data : A bivariate population of size N=6 is given below :
No. of bullocks(Y) 3 4 8 9 6 9
Farm Size (acre)(X) 15 20 40 45 25 42
45
S.No. Possible Possible Sample Sample ̂ =̅̅̅̅
𝑹
𝒚𝒏
̅̅̅̅
𝒚 ̅̅̅̅
𝒙
Samples Samples (𝒙𝒊 ) mean mean ̅̅̅
𝒙𝒏 ̅𝒚̅̅𝑹̅= ̅̅̅̅𝒏 ̅̅̅̅
𝑿𝑵
𝒏
𝒙
(𝒚𝒊 ) ̅𝒚̅̅𝒏̅ 𝒏
∑ 𝑋𝑖
̅̅̅̅
𝑋
187
̅̅̅ ∑ 𝑌𝑖 39
𝑁 = 𝑁 = 6 =31.17, 𝑌𝑁 = 𝑁 = 6 =6.50
∑ ̅̅̅̅
𝑌𝑅 97.716
̅̅̅)
E(𝑦 𝑅 = = = 6.514,
𝑁𝑐𝑛 15
Since E(𝑦 ̅̅̅)
𝑅 ≠ ̅
̅̅̅
𝑦̅̅𝑁̅, 𝑡ℎ𝑒 ratio estimator is not an unbiased estimator of population mean 𝑌𝑁 .
The bias of ratio estimator to the first order of approximation is given by
(𝑁−𝑛) 2 𝑆𝑋 𝑆𝑌
̅̅̅
𝐵1(𝑌 ̅̅̅
𝑅 ) = 𝑁𝑛 𝑌𝑁 (𝐶𝑥 − 𝜌𝐶𝑥 𝐶𝑦 ) , where 𝐶𝑥 = ̅̅̅̅ and 𝐶𝑦 = ̅̅̅̅
𝑋 𝑌 𝑁 𝑁
1 1872
Now, SX = √5 ∗ (6639 − ) = 12.73 and SY =2.588,
6
Cx= 0.408, Cy =0.397
To find out the value of 𝜌 correlation coefficient between X and Y, we have to make the following
values :
∑y=39, ∑x=187, ∑xy=1378, ∑x2=6639, ∑y2=287
1378 187∗39
−
6 6∗6
𝜌= 6639 187 2 287 39
= 0.9859
√ − ( ) ∗√ − ( )2
6 6 6 6
(6−2)
Hence 𝐵1(𝑌 ̅̅̅ 2
𝑅 ) = 6∗2 * 6.50 * (0.408 − 0.9859 ∗ 0.408 ∗ 0.397)=0.014
.
The variance of ratio estimator is given by
(𝑁−𝑛) 2 ̅̅̅̅
𝑌
̅̅̅
V (𝑌 𝑅) = (𝑆𝑦 2 + 𝑅 2 𝑆𝑥 − 2𝑅𝜌𝑆𝑥 𝑆𝑦 ) where R = 𝑁 =0.208 ̅̅̅̅
𝑁𝑛 𝑋𝑁
(6−2) 2 2 2
= 6∗2 (2.58 + 0.208 ∗ 12.73 -2*0.208*0.9859*12.73*2.58) =0.065 =0.0625.
The above formula of variance in terms of coefficient of variation can be written as :
(𝑁−𝑛) 2 2 2
̅̅̅
V (𝑌 𝑅 ) = 𝑁𝑛 ̅̅𝑦̅̅
𝑁 (𝐶𝑦 + 𝐶𝑥 − 2𝜌𝐶𝑥 𝐶𝑦 )
6−2
= ( 6∗2 ) * 6.502 * (0.3972 + 0.4082 − 2 ∗ 0.9859 ∗ 0.408 ∗ 0.397 = .0660
Both values of variances of ratio estimator are approximately equal.
Standard Error of Ratio Estimator (𝑦̅̅̅)
𝑅 = √V(𝑦 ̅̅̅)
𝑅 = √0.0660 =0.256
46
Objective: Determination of the regression estimator, comparison with the ratio estimator, and its sampling
variance and standard errors.
Kinds of data: A bi-variate population of size N=85 with population mean 𝑋̅̅̅̅ ̅̅̅
𝑁 = 6.55 and 𝑌𝑁 = 8.55, a
random sample of size n=10 was drawn using SRSWOR scheme and was recorded as
Y 11 8 7 6 4 5 3 2 9 10
X 10 7 6 5 3 4 2 1 8 9
(𝑁−𝑛) (85−10)
̅̅̅
V(𝑌 𝑛 )SRSWOR = 𝑁𝑛 = 85∗10 *9.16 = 0.808
47
14. Large Sample Test
For large value of sample size n (usually >30) almost all the distributions follows normal distribution. To
𝑋− 𝜇
solve the problems of large sample size the normal variable X is transformed to a new variable Z= ,
𝜎
which is known as a standard normal variate mean 0 and variance 1.
By the area property of normal distribution the standard normal variate should lie between -3 to +3.
Hence, if |𝑍| > 3, null hypothesis will always be rejected.
If |𝑍| ≤ 3, null hypothesis will be tested for possible rejection at certain level of significance.
For a two tailed test
if |𝑍| <1.96, H0 is accepted at 5 % level of significance and
if |𝑍| <2.58, H0 is accepted at 1 % level of significance
For a single tailed (Right or left) test
if |𝑍| <1.645, H0 is accepted at 5 % level of significance and
if |𝑍| <2.33, H0 is accepted at 1 % level of significance
48
Objective: Testing the significance of single proportions based on large samples.
Kinds of data: In a sample of 1000 people in Maharashtra, 540 were found rice eaters and rest was wheat
eaters. Test whether the rice and wheat eaters are equally popular in this state at 1 % level of significance.
Since Z=2.532 <2.58. the null hypothesis both rice and wheat eaters are equally popular in the state are accepted
at 1% level of significance.
Solution: We set up the null hypothesis H0: P1=P2, and H1:P1≠ P2.
P1=proportion of men in favour of proposal = 200/400=0.50
P2= proportion of women in favour of proposal=325/600=0.541
𝑃1−𝑃2
Under H0 : Z= 1 1
that follows N(0,1)
√𝑃𝑄( + )
𝑛1 𝑛2
𝑛1𝑃1+𝑛2𝑃2 200+325
Where P= =400+600 =0.525 and Q=1-P=0.475
𝑛1+𝑛2
Hence Z=-1.27, Absolute value of Z=1.27
The opinion of men and women in favour of building up garden near their residence is not significant
Since the calculated value of Z=1.27 is less than the tabulated value of Z(1.96) at 5% level of significance.
Objective: Testing the significance of two standard deviations based on large samples.
Kinds of data: A random samples of 1000 and 1200 members have their standard deviations as 2.58 and
2.50 inches respectively.
Solution: We set up the null hypothesis H0: S1=S2, and H1:S1≠ S2.
𝑆1−𝑆2
Under H0 : Z= 2 2
that follows N(0,1)
√𝑆1 +𝑆2
2𝑛1 2𝑛2
2.58−2.50
Now Z = 2 2
√ 2.58 + 2.50
2𝑥1000 2𝑥1200
Hence Z= 1.03
The difference of these two sample standard deviations do not differ significantly, since the calculated value
of Z(1.03) is less than tabulated Z (1.96) at 5% level of significance.
49
15. Small Sample Test
𝑋̅ −𝜇
Small Sample Test : If the sample size n is small, the distribution of various statistics i.e. z= 𝑆 are far
√𝑛
from normality. In such cases small samples test developed by student (W. S. Gosset) were used.
Student t: let 𝑥𝑖 be a random sample of size n from a normal population with mean 𝜇 and variance 𝜎 2 . Then
𝑋̅ −𝜇 1
student t is defined by the statistic t = 𝑆 , where 𝑋̅ is the sample mean and 𝑆 2 = ∑(𝑥𝑖 − ̅̅̅
𝑥)2 is an (𝑛−1)
√𝑛
unbiased estimate of population variance 𝜎 2 and it follows student t distribution with (n-1) degree of
freedom.
Assumptions of t test:
T test for single mean: is used to test whether the sample has been drawn from the population with mean 𝜇
or there is no significant difference between the sample mean 𝑋̅ and the population mean 𝜇.
|𝑋̅ −𝜇 | ∑ 𝑋𝑖 1
the test statistic |𝑡| = 𝑆 0 where 𝑋̅ = , and 𝑆 2 = ∑(𝑥𝑖 − ̅̅̅
𝑥)2, follows student t distribution with
𝑛 (𝑛−1)
√𝑛
(n-1) degree of freedom.
Two independent sample t test: is used to test whether the two samples differ from one another
significantly in their means or whether they may be belonging to the same population. The test statistics is t=
̅̅̅̅
𝑋1 − ̅̅̅̅
𝑋2 ∑(𝑋𝑖 −𝑋̅ )2 + ∑(𝑌𝑖 −𝑌̅)2 (𝑛1 −1)𝑆1 2 + (𝑛2 −1)𝑆2 2
1 1 where 𝑆 2 = = , follows student t distribution with (n1+n2-
√𝑆 2 ( + ) 𝑛1 +𝑛2 −2 𝑛1 +𝑛2 −2
𝑛1 𝑛2
2) degree of freedom.
Paired t test: is used when the sample sizes are equal and the two samples are not independent but the
sample observations are paired together.
̅̅̅̅
|𝑑|
The test statistics is given by |𝑡|= 𝑆
√𝑛
1 2
where 𝑑𝑖 = 𝑋𝑖 − 𝑌𝑖 , 𝑎𝑛𝑑 2
= (𝑛−1) * ∑(𝑑𝑖 − ̅̅̅
𝑑) , follows student t distribution with (n-1) degree of
freedom.
Test of significance of null hypothesis: for test of significance of null hypothesis the calculated value of t is
compared with the table value of t at certain level of significance generally 5%. If calculated value of |𝑡| >
tabulated t, the null hypothesis is rejected or If calculated value of |𝑡| < tabulated t, the null hypothesis is
accepted.
Note: if you are unable to understand whether the samples are paired or independent then you can decide it
by the degree of freedom of tabulated value given in question. In case of independent sample for table value
d.f. is (n1 + n2 -2) whereas in paired sample d.f. is (n-1).
50
Objective: Test the significance of difference between sample mean and population mean.
Kinds of data: The data relate to the IQ’s of ten randomly selected boy are given below:
70,120,110,101,88,83, 95,98,107 and 100. Given population mean μ=100.
Solution: Here the null hypothesis is H0: The data are consistent with the assumption of a mean
IQ of 100 in the population. First we will find the sample mean
(70+120+110+101+88+83+95+98+107+100)
𝑋̅ = = 97.2
10
1
Now we calculate the sample variance 𝑆 2 = (𝑛−1) ∑(𝑥𝑖 − ̅̅̅
𝑥)2
Here absolute value of t=0.62 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.26.
Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and We
conclude that data are consistent with the assumption of mean IQ of 100 in the population.
Objective: Testing the significance of difference between two means in small samples
Kinds of data: The data relate to the two random samples drawn from two normal population with the
following results.
̅̅̅1=25, S12=36,n2=8, 𝑋
n1=6, 𝑋 ̅̅̅2=20, and S22=25 provided population variances of two population are equal,
i.e. σ12=σ22
Solution: The null hypothesis H0: There is no significant difference between two means i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.
(𝑛1 −1)𝑆1 2 + (𝑛2 −1)𝑆2 2
Now we calculate the value of sample mean square by 𝑆 2 = 𝑛1 +𝑛2 −2
(6−1)∗36+ (8−1)∗25 355
𝑆2 = = 12 = 29.58
6+8−2
̅̅̅̅
𝑋1 − ̅̅̅̅
𝑋2 25−20 5
Apply t-statistic t= 1 1 = = 2.93 =1.70
√𝑆 2 ( + ) 1 1
√29.58∗( + )
𝑛1 𝑛2
6 8
Here absolute value of t=1.70 and tabulated value of t at 9 d.f. at 0.05 level of significance=2.18.
Since the calculated value of t is less than the tabulated value, the null hypothesis is accepted and we
conclude that there is no significant difference the two means.
51
Objective: Testing the significance of difference of two means in small samples when the observations
of the two samples are paired together.
Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs are as foll
Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53
Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.
∑ 𝑑 −16
Take the difference of foods A-B as di=Ai-Bi and calculate 𝑑̅ = 𝑛 𝑖 = 8 = -2
Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53
di=Ai-Bi -3 -2 -1 -1 -3 -4 -2 0
(𝑑𝑖 − 𝑑̅)2 1 0 1 1 1 4 0 4
1 2 12
Now we calculate the value of 𝑆 2 = (𝑛−1) * ∑(𝑑𝑖 − ̅̅̅
𝑑) = 7 = 1.71 and S=1.31
̅̅̅̅
|𝑑| 2 2
Apply t-statistic |𝑡|= 𝑆 = 1.31 =0.46 =4.32
√𝑛 √8
Here the absolute value of t= 4.32 and tabulated value of t at 7 d.f. at 0.05 level of significance=2.37
Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and
We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.
.
Objective: Testing the significance of difference of two means in small samples when the observations
of the two samples are independent.
Kinds of data: The data relate to the increase in weights due to food A and food B in comparing the pigs,
assuming that the two samples of pigs are independent were given as follows:
Pig number 1 2 3 4 5 6 7 8
Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53
Solution: H0: There is no significant difference between two means of foods A and B. i.e.μ1=μ2.
H1: There is significant difference between two means i.e.μ1≠μ2.
52
407 423
Here 𝑋̅ = 8 = 50.88 and 𝑌̅ = 8 = 52.88
Here the absolute value of t= 2.165 and tabulated value of t at 14 d.f. at 0.05 level of significance=2.15
Since the calculated value of t is greater than the tabulated value, the null hypothesis is rejected and
We conclude that the two foods taken into study for pigs differ significantly at 5% level of significance.
53
16. Chi-Square Test
Chi-Square Test : is used to test the null hypothesis based on some general law of nature or any reasoning.
𝝌𝟐 test of goodness of fit: The null hypothesis is that there is no difference between the experimental result
and theory. If 𝑂𝑖 (i=1,2,…n) is set of observed frequencies and 𝐸𝑖 is the set of expected frequencies
the karl pearson 𝝌𝟐 is given by
(𝑂𝑖 −𝐸𝑖 )𝟐
𝝌𝟐 = ∑ , for i=1,2,…n follows 𝜒 2 distribution with (n-1) d.f.
𝐸𝑖
a b (a+b)
𝝌𝟐 test for 2X2 contingency table: for the 2X2 contingency table
𝑁(𝑎𝑑−𝑏𝑐)2
c d (c+d)
The 𝜒 2 test is given by 𝜒 2 = (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
, where N=a+b+c+d (a+c) (b+d) N=a+b+c+d
for 1 d.f.
Or alternatively we can calculate the expected frequency of each cell and then apply the chi-square test of
(𝑎+𝑏)(𝑎+𝑐) (𝑎+𝑏)(𝑏+𝑑)
goodness of fit. eg. E(a) = , E(b) = or accordingly.
𝑎+𝑏+𝑐+𝑑 𝑎+𝑏+𝑐+𝑑
Yates’ correction for continuity: if anyone of the cell frequency is less than 5 in 2X2 contingency table
then by using pooling method the degree of freedom becomes 0. In this case we apply Yates correction for
continuity which consist of adding 0.5 to the cell frequency which is less than 5 and then adjusting for the
remaining cell frequencies accordingly so that the marginal totals are not disturbed at all. After corrections
we get
𝟏 𝟏 𝟏 𝟏 𝟐 𝑵 𝟐
𝑵[(𝒂∓ )(𝒅∓ )−(𝒃± )(𝒄± )] 𝑵[|𝒂𝒅−𝒃𝒄|− ]
𝟐 𝟐 𝟐 𝟐 𝟐 𝟐
𝝌 = (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
= (𝑎+𝑐)(𝑏+𝑑)(𝑎+𝑏)(𝑐+𝑑)
Note: In 𝝌𝟐 test if the calculate value of 𝜒 2 is greater than tabulated value the null hypothesis is rejected
means there is a significant difference between the experimental result and theory.
For mxn table the degree of freedom is (m-1)X(n-1).
54
Objective: Testing whether the frequencies are equally distributed in a given dataset.
Kinds of data: 200 digits were chosen at random from a set of tables. The frequencies of the digits were as
follows.
Digits 0 1 2 3 4 5 6 7 8 9
Frequency 22 21 16 20 23 15 18 21 19 25
Solution: We set up the null hypothesis H0: The digits were equally distributed in the given dataset.
𝑠𝑢𝑚 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 200
Under the null hypothesis the expected frequencies of the digits would be = = =20
𝑛𝑜.𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 10
(22−20)2 (21−20)2 (16−20)2 (20−20)2 (23−20)2 (15−20)2 (18−20)2 (21−20)2
Then the value of 𝜘 2 = + + + + + + + +
20 20 20 20 20 20 20 20
(19−20)2 (25−20)2 1 86
+ = 20 (4+1+16+0+9+25+4+1+1+25) =20 =4.3
20 20
The tabulated value of 𝜘 2 at 9 d.f. and 5 % level of significance is 16.91. Since the calculated value of 𝜘 2 is
less than the tabulated value, the null hypothesis is accepted. Hence we conclude that the digits are equally
distributed in a given dataset.
Objective: Testing the goodness of fit between experimental result and theory.
Kinds of data: The theory predicts the proportion of beans in the four groups A, B, C and D should be in the
ratio 9:3:3:1. In an experiment having 1600 beans, the observed numbers in the four groups were found to be
882,313,287 and 118.
Solution: We set up the null hypothesis H0: There is no significant difference between experimental result
and theory.
The expected frequencies can be calculated as follows:
Total number of beans =882+313+287+118.=1600. Given ratio is 9:3:3:1
9 3
E(882)= 16 X1600 = 900 ; E(313)= 16 X1600 = 300
3 1
E(287)= 16 X1600 = 300 ; E(118)= 16 X1600 = 100
Thus, 2 for testing the goodness of fit is
4
(Oi − Ei)2 (882 − 900) 2 (313 − 300) 2 (287 − 300) 2 (118 − 100) 2
= = + + +
i =1 Ei 900 300 300 100
=0.36+0.56+0.56+3.24=4.72
Now, d.f.=4-1=3. The tabulated 2 at 5% level of significance at 3 d.f.=7.815
Since the calculated value of 2 is less than the tabulated value, it is not significant. Hence, the null
hypothesis is accepted and we conclude that there is good correspondence between theory and experimental
result.
Objective: Computation of 2 value in the case of contingency table to test the independence of attributes
where one of the cell frequencies is less than five.
Kinds of data: The data relate to the height of father and their youngest son at the age of 40 years.
Height of youngest Height of fathers
Sons Tall Short Total
Tall 8 2 10
Short 7 6 13
Total 15 8 23
Solution: Here the null hypothesis is H0: The height of youngest son is independent of the height of
Fathers and H1: They are dependent on each other.
Since one cell frequency is less than five then we apply Yates’s correction and correct the contingency
table as given below.
Height of youngest Height of fathers
Sons Tall Short Total
Tall 7.5 (a) 2.5 (b) 10
Short 7.5 (c) 5.5 (d) 13
Total 15 8 23
N (ad − bc) 2
Now Compute value of 2 =
(a + b)(a + c)(c + d )(b + d )
23(7.5 𝑋 5.5−2.5 𝑋 7.5 )2
𝜘2 = = 0.746
10 𝑋 13 𝑋 15 𝑋 8
The table value of 2 at 5 % level of significance and 1 d.f. = 3.841
Since the calculated value of 2 is less than tabulated value of 2 . We do not reject null hypothesis and
Conclude that the height of youngest sons is independent of the height of their father.
56
17. Design of Experiment
(CRD, RBD, LSD, Split and Strip Plot Design)
It is the planning (sequence of steps taken well in time) of an experiment to obtain appropriate data with
respect to problem under investigation.
Principles of experimental design: There are 3 basic principles of experimental design:
Replication: Repetition of treatment under investigation is known as replication.
Randomization: The process of assigning treatment to various experimental unit in purely chance manner is
known as randomization.
Local Control: The process of reducing the experimental error by dividing the relatively heterogeneous
experimental area into homogeneous blocks is known as local control.
Type of Design:
Completely Randomized Design: This design is used when the experimental material is homogeneous eg.
laboratory or pot experiment. The principle of local control is absent in CRD.
Randomized Block Design: This design is used when the fertility gradient of soil is only in one direction.
Then the whole field is divided into a number of equal blocks perpendicular to the direction of fertility
gradient and then each block is divided into number of plots equal to the number of treatments. In RBD we ty
to minimize the within block variation whereas the between block variation as large as possible.
Latin Square Design: This design is used when the fertility gradient of the soil in both the directions. Then
the field is divided into homogeneous blocks in two ways. The blocks in one direction are known as rows
whereas the blocks in other direction are known as columns. In LSD number of replications must be equal to
number of treatments. Number of row, column and treatment should be equal and randomization of treatment
is done in such a way that each treatment occurs once and only once in each row and column.
Split Plot Design: is used when there are two types of treatments and both are to be estimated with different
precision. The treatment which is to be estimated with greater precision is allotted as subplot treatment. In
split plot design the effect of the subplot treatments and the interactions with the main plot treatments can be
estimated more precisely.
Strip Plot Design: is used when there are two factors and both of them require large experimental unit.
Suppose four levels of spacing and three methods of ploughing. In strip plot design interaction effect is
estimated with greater precision. The experimental area is divided into three plots namely vertical strip,
horizontal strip and intersection plot.
All the analysis of experimental design is based on the analysis of variance table.
ANOVA: It is a technique by which the total variation in any experiment may be split into several
physically assignable components. In ANOVA we determine the source of variation and check whether this
source of variation is significant or not. To check the significance of source of variation F-test is used.
Format of ANOVA:
Analysis of variance table
Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %) at (source
freedom Square square and error ) d.f.
57
Objective: Analyzing the data of completely randomized design with unequal number of replication per
treatment.
Kinds of data: The following data relate to a varietal trial on green gram ( in coded form) that was
conducted using CRD having five varieties V1 , V2 , V3, V4 and V5 with 3, 6, 5 and 4 replications
respectively. The results are given below (kg/plot):
Since Fcal >Ftab, F test indicates that there are significant differences between the treatment means.
The individual varieties can be compared with the help of critical difference.
In case of unequal number of replication the standard error of difference between treatment means varies
from pair to pair .
1 1
Standard error of difference between V1 and V2 = √2.80 ∗ (3 + 6) = 1.18
∴ Critical Difference = (S.E.)diff X t 0.05 at 14 d.f. = 1.18 x 2.14 = 2.52
Similarly, we can compute the value of C.D. for other treatments comparisons..
The following treatments comparisons can be made on the basis of C.D. values:
V3 V4 V2 V1
Conclusions : Variety V3 gives significantly higher yield as compared to other varieties, variety V3 and
variety V4 are at par but both of them differ significantly with variety V2 and variety V1. The variety
V2 and variety V1 are also at par and they are giving the lower yield of green gram.
58
Objective: Analyzing the data of completely randomized design with equal number of replications per
treatment.
Kinds of data : The data relate to the five varieties of sesame using CRD conducted in a greenhouse with
four pots per variety.
Varieties Seed yield of sesame Total Mean
(gm/plot)
V1 25 22 22 18 87 21.75
V2 25 28 26 25 104 26
V3 24 24 18 21 87 21.75
V4 20 17 18 19 74 18.5
V5 14 15 15 11 55 13.75
Total 407
If the difference between two varieties means is greater than the critical difference the varieties differ
significantly. The varieties which do not differ significantly have been underlined by a bar.
Conclusions: The variety V2 gives significantly higher yield than all other varieties. The varieties V1, V3
and V4 are at par but differ significantly with variety V2 and V5 .The variety V5 gives lower yield of
sesame.
59
Objective : Analyzing the data of randomized complete block design and the computation of efficiency as
compared to completely randomized design.
Kinds of data : The data relate to the yields of 6 wheat varieties(in rounded figures) in an experiment with
4 randomized blocks.
Block yield of Wheat varieties Total
V1 V2 V3 V4 V5 V6
1 27 30 27 16 16 24 140
2 27 28 22 15 17 22 131
3 28 31 34 14 17 22 146
4 38 39 36 19 15 26 173
Total 120 128 119 64 65 94 590
Mean 30 32 29.75 16 16.25 23.5
Here ,the d.f. for error is less than 20, so we can use the precision factor as given below :
60
(𝑛2 +1)(𝑛1 +3)
, 𝑛1 and 𝑛2 are degree of freedom for two experiments, which is an expression for relative
(𝑛1 +1)(𝑛2 +3)
efficiency of the second experiment as compared to the first.
(15+1)(18+3)
=0.982
(18+1)(15+3)
The corrected relative efficiency is, then, given by
1.60 x 0.982 = 1.57
Therefore, the gain in efficiency in RBD is 57 % as compared to completely randomized design.
Solution: for getting the row, column and treatment totals following table will be prepared.
61
2∗𝐸𝑀𝑆𝑆 2∗101.27
Standard error of difference between two treatments = √ = √ = 6.36
𝑟 5
∴ Critical Difference = (S.E.)diff X t 0.05 at 12 d.f. = 6.36 x 2.179 = 13.86
The varieties can be compared by setting them in the descending order of their yields in the following
manner.
Tc(107.8) Ta(79.8) Td(72.2) Te(68.8) Tb(63.2)
If the difference between two varieties means is greater than the critical difference the varieties differ
significantly. The varieties which do not differ significantly can be underlined by a bar.
Conclusions: The variety Tc gives the highest yield and differ significantly from all the varieties. The
varieties A, D, E, and B are at par to each other.
Objective : Analyzing the data of Latin square design and the computation of efficiency as compared to
RCBD and CRD.
Kinds of data : The data relate to the Latin square design to test the efficiency of methods of spacing:
A,2’’;B,4’’;C,6’’;D,8’’;E,10’’. The yield in grams of plots of Millet arranged in LSD and layout of the
treatments are given below:
Rows Columns Row Treatment
I II III IV V totals Totals Means
I B(257) E(230) A(279) C(287) D(202) 1255 TA =1349 269.8
II D(245) A(283) E(245) B(280) C(260) 1313 TB =1314 262.8
III E(182) B(252) C(280) D(246) A(250) 1210 TC =1262 252.4
IV A(203) C(204) D(227) E(193) B(259) 1086 TD =1191 238.2
V C(231) D(271) E(266) A(334) E(338) 1440 TE =1188 237.6
Column
1118 1240 1297 1340 1309 GT=6304
Totals
Thus, In order to test the significant difference among the treatment means, we have to analyze the above
data.
The computation of the sums of squares is given below:
63042
Now correction factor = 25 = 1589617
Total sum of squares=(2572 + 2302 + ⋯ + 3382 ) − 𝐶𝐹= 1626188 – 1589617=36571
12552 +⋯+14402
Sum of square (Rows) =( ) – CF =1603218-1589617= 13601
5
11182 +⋯+13092
Sum of square (Columns) =( ) – CF =1595763 – 1589617=6146
5
13492 +13142 +12622 +11912 +11882
Sum of square (Treatments) =( ) – CF = 1593773 - 1589617=4156
5
Sum of square (Error) = 36571-13601-6146-4156= 12668
62
On comparison of the calculated and tabulated values of F, we find that rows, columns and spacings
are not significant. This is probably due to the shape of the plots. They were long and narrow; hence the
columns are narrow strips running the length of the rectangular area. Under these conditions, the Latin
square may have little advantage on the average over a randomized block plan.
In order to compare means of pairs of treatments, we look up t 0.05 at 12 degrees of freedom =2.18.
The significant difference between two means must be equal to
2∗1056
2.18x√ = 44.80
5
Since the degrees of freedom for error is less than 20, hence the precision factor is to be computed
for corrected efficiency of LSD as compared to RCBD and CRD.
(𝑛 +1)(𝑛 +3) (12+1)(16+3)
The precision factor is (𝑛2+1)(𝑛1+3) = (16+1)(12+3) = 0.9686
1 2
63
Objective: Analysis of data in relation to split plot experiment.
Kinds of data : The data relate to the yields of 3 varieties of Alfalfa obtained in a split plot experiment
with 4 dates of final cutting. The yields are reported in tons per acre.
Yields of Alfalfa in a split plot experiment.
Variety Block
Date 1 2 3 4 5 6 Total
A 2.17 1.88 1.62 2.34 1.58 1.66 11.25
Ladak B 1.58 1.26 1.22 1.59 1.25 0.94 7.84
C 2.29 1.60 1.67 1.91 1.39 1.12 9.98
D 2.23 2.01 1.82 2.10 1.66 1.10 10.92
Total 8.27 6.75 6.33 7.94 5.88 4.82 39.99
A 2.33 2.01 1.70 1.78 1.42 1.35 10.59
Cossack B 1.38 1.30 1.85 1.09 1.13 1.06 7.81
C 1.86 1.70 1.81 1.54 1.67 0.88 9.46
D 2.27 1.81 2.01 1.40 1.31 1.06 9.86
Total 7.84 6.82 7.37 5.81 5.53 4.35 37.72
A 1.75 1.95 2.13 1.78 1.31 1.30 10.22
Ranger B 1.52 1.47 1.80 1.37 1.01 1.31 8.48
C 1.55 1.61 1.82 1.56 1.23 1.13 8.9
D 1.56 1.72 1.99 1.55 1.51 1.33 9.66
Total 6.38 6.75 7.74 6.26 5.06 5.07 37.26
G.Total 22.49 20.32 21.44 20.01 16.47 14.24 114.97
Solution: First we prepare two way table of main plot treatment and replication. Here main plot is variety (3)
whereas subplot is dates(4).
114.972
Correction factor = 72 = 183.58
Total Sum of squares =(2.172 +1.882 +……..+1.512 + 1.332 )-183.58 = 9.12
(22.492 + 20.322 +21.442 +20.012 +16.472 +14.242 )
Block Sum of squares = - 183.58 =4.15
4∗3
(39.992 + 37.722 +37.262 )
Variety Sum of squares = - 183.58 =183.76-183.58= 0.18
6∗4
∑ 𝑅𝑉 2 (8.272 +7.842 +⋯,+5.072 )
Total sum of square from RV table= 𝑏 -CF = - 183.58 = 189.27-183.58=5.69
4
Main plot Error S S or Error I =TSS of RV- BSS- VSS= 5.69 - 0.18 – 4.15 = 1.36
Next we prepare main plot x subplot table
64
Variety Date
A B C D Total
Ladak 11.25 7.84 9.98 10.92 39.99
Cossack 10.59 7.81 9.46 9.86 37.72
Ranger 10.22 8.48 8.9 9.66 37.26
Total 32.06 24.13 28.34 30.44 114.97
(32.062 +24.132 +28.342 +30.442 )
Dates Sum of squares = - CF= 185.54 – 183.58 =1.96
6∗3
∑ 𝑉𝐷 2
Sum of square due to Interaction(VD) = - CF- SSV-SSD
𝑟
11.252 +7.842 +⋯.+9.662
= - 183.58-0.18-1.96 = 185.93 – 183.58-0.18-1.96= 0.21
6
Sub plot Error S S or Error II = Total Sum of squares – All other sum of squares
= 9.12 – 4.15 - 0.18 -1.36 -1.96 - 0.21=1.34
The complete analysis can now be set up :
Analysis of variance table
Source of variation Degree of Sum of Mean sum of Fcal Ftab (5 %)
freedom Square square
Block 5 4.15 0.83
Variety 2 0.18 0.09 <1 F.05 at (2,10) =4.10
Error I 10 1.36 0.14
Dates 3 1.96 0.65 23.21** F.05 at (3,45) =2.81
Interaction 6 0.21 0.40 <1 F.05 at (6,45) =2.31
Error II 45 1.34 .029
Total 95 9.20
The means for the dates of cutting are significantly different, but the other effects are found to be non-
significant.
Compute standard errors and to make specific comparisons among treatment means compute respective
critical differences only when F-test shows significant differences.
The standard errors of mean differences can be worked out according to the formulas given below: (i)
2∗𝐸𝑎
Standard Error of difference (Variety) =√ 𝑟∗𝑏 = 0.1086
Critical difference for two variety means= SEdiff x t5%(10 d.f) = 0.1086 *2.23 =0.242
2∗𝐸𝑏
(ii) Standard Error of difference (Date of cutting) =√ 𝑟∗𝑎 =0.0567
Critical difference for two Date of cutting means= SEdiff x t5%(45d.f) =0.0567 *2.02 =0.1145
2∗𝐸𝑏
(iii) S. E. of difference between two dates of cutting at the same level of variety =√ = 0.098 Critical
𝑟
difference for two Date of cutting means at the same level of variety= SEdiff x t5%(45d.f) = 0.098*2.02 = 0.1979
(iv) S. E. of difference between two variety means at the same or different level of date of cutting
2[(𝑏−1)∗𝐸𝑏+𝐸𝑎]
=√ = 0.1375
𝑟𝑏
For (iv) standard error of mean difference involves two error terms, so we use the following equation to
calculate weighted t value.
(𝑏−1)𝐸𝑏𝑡𝑏+𝐸𝑎𝑡𝑎 3∗.029∗2.02+0.14∗2.23
t= (𝑏−1)𝐸𝑏+𝐸𝑎 = =2.149 where ta and tb are t values at error d.f (Ea) and error d.f. (Eb)
3∗0.029+0.14
respectively.
Critical difference for two variety means at the same or different level of date of cutting = SEdiff x t =
=0.1375*2.149 = 0.2956
Conclusions: There was no significant difference among variety means. Yield was significantly affected by
dates of final cutting. However the interaction between variety and final date of cutting was not significant
65
Objective : Analysis of data in relation to strip plot experiment.
Kinds of data : The data relate to the four dates of optimum schedule for five different varieties of wheat
with three replications.
The layout plan and the yields in Kg/plot are given below :
S1 S3 S2 S4
Replication I V2 5.60 2.30 6.70 4.93
V5 5.46 5.87 2.63 6.78
V3 2.24 5.67 3.48 6.58
V1 5.67 6.89 2.56 3.78
V4 2.60 5.65 3.26 2.57
S3 S1 S4 S2
Replication II V4 3.50 6.45 4.80 6.90
V1 6.50 4.69 1.59 4.96
V5 5.32 6.89 2.45 5.36
V2 4.25 3.45 5.69 4.62
V3 2.86 4.39 4.68 2.90
S2 S3 S1 S4
Replication III V3 6.89 4.36 4.26 2.89
V4 4.89 4.58 5.69 5.36
V5 6.89 3.25 2.56 4.60
V2 2.68 4.89 8.90 6.09
V1 2.68 1.89 3.89 2.80
Solution : Prepare two way tables of Replication x Variety, Replication x spacing and Variety x spacing .
(a) Replication x Variety (each figure is a sum of 4 plots)
Replicate Variety
V1 V2 V3 V4 V5 Total
I 18.90 19.53 17.97 14.08 20.74 91.22
II 17.74 18.01 14.83 21.65 20.02 92.25
III 11.26 22.56 18.39 20.52 17.30 90.03
Total 47.90 60.10 51.19 56.25 58.06 273.50
(b) Replication x Spacing (each figure is a sum of 5 plots)
Replicate Spacing
S1 S2 S3 S4 Total
I 21.57 26.38 18.63 24.64 91.22
II 25.87 24.74 22.43 19.21 92.25
III 25.30 24.03 18.96 21.74 90.03
Total 72.74 75.15 60.02 65.59 273.50
(c) Variety x Spacing (each figure is a sum of 3 plots)
Variety Spacing
S1 S2 S3 S4 Total
V1 14.25 14.53 10.95 8.15 47.90
V2 17.95 9.60 15.84 16.71 60.10
V3 10.89 15.46 10.69 14.15 51.19
V4 14.74 17.44 11.34 12.73 56.25
V5 14.91 18.12 11.20 13.83 58.06
Total 72.74 75.15 60.02 65.59 273.50
66
Grand total of the observations = 273.50
273.502
Correction factor = 60 =1246.70
Total Sum of squares =(5.602 + 2.302 + ⋯ + 2.802 ) − 𝐶𝐹 = 158.92
91.222 +92.252 +90.032
Replicate Sum of square = -CF= 0.128
4∗5
47.902 +60.102 +51.192 +56.252 +58.062
Variety S.S. = - CF= 8.45
3∗4
18.902 +19.532 +⋯+17.302
Total Sum of square (1) = -CF=1278.19-1246.70=31.48
4
Error I = TSS(1)- Replicate S.S. – Variety S.S.= 31.48-0.128-8.45 = 22.91
72.742 +75.152 +60.022 +65.592
Spacing Sum of Square= -CF= 1256.20-1246.70=9.50
3∗5
21.572 +26.382 +⋯+21.742
Total Sum of square (2) = -CF=1263.69-1246.70=16.99
5
Error II = TSS(2) - Replicate SS - Spacing SS = 16.99 - 0.128 – 9.50 = 7.36
14.252 +14.532 +⋯+13.832
Total Sum of square (3) = -CF=1298.84 – 1246.70=52.14
3
Interaction = TSS(3) - Variety S.S. –Spacing S.S. = 52.14 - 8.45 - 9.50 =34.19
Error III = Total sum of square – all sum of squares = 76.38
Analysis of variance table
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Replicates 2 0.128 0.06
Variety 4 8.45 2.11 0.74 F.05 at (4,8) =3.34
Error I 8 22.91 2.86
Spacing 3 9.50 3.17 2.58 F.05 at (3,6) =4.76
Error II 6 7.36 1.23
var x spacing 12 34.19 2.85 0.90 F.05 at (12,24) =2.18
Error III 24 76.38 3.18
Total 59
None of the effects are found to be significant in the strip plot design. The standard errors of variety, spacing
and their interaction can be determined on the parallel line of split plot experiment as given below:
2∗𝐸𝑎
(i) S.E. of difference between two variety means=√ 𝑟∗𝑏 =0.690
2∗𝐸𝑏
(ii) S.E. of difference between two spacing means=√ 𝑟∗𝑎 =0.405
2[(𝑏−1)∗𝐸𝑐+𝐸𝑎]
(iii) S.E. of difference between two variety means at the same level of spacing=√ = =1.43
𝑟𝑏
2[(𝑎−1)∗𝐸𝑐+𝐸𝑏]
(iv) S.E. of difference between two spacing means at the same level of variety =√ = 1.36
𝑟𝑎
Critical difference is obtained by multiplying the standard error by table value of t at respective degree of
freedom for (i) and (ii). For (iii) and (iv) the following equations were used to compute the weighted values
(𝑏−1)𝐸𝑐𝑡𝑐+𝐸𝑎𝑡𝑎 (𝑎−1)𝐸𝑐𝑡𝑐+𝐸𝑏𝑡𝑏
of t. t= (𝑏−1)𝐸𝑐+𝐸𝑎 and t= (𝑎−1)𝐸𝑐+𝐸𝑏 , where ta, tb and tc are table value of t at error degree of
freedom of Ea, Eb and Ec respectively.
67
18. Factorial Design
Factorial Design: In factorial experiment effect of several factors of variations are studied and investigated
simultaneously, the treatments being the combinations of different factors under study. In this experiment we
estimate the effect of each of the factor and also their interaction effect. In case of 2𝑛 experiment there are n
factors each at 2 levels.
Let us suppose 22 experiment. There are 2 factors each at 2 levels. The treatment combinations are 𝑎0 𝑏0 ,
𝑎1 𝑏0, 𝑎0 𝑏1 and 𝑎1 𝑏1.
In factorial experiment analysis can be done as usual manner in CRD, RBD but the treatment sum of square
is split into orthogonal components.
Yates Method can also be used for computing factorial effect totals.
Treatment Total yield (1) (2) Effect SS
Combination Totals 𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
= 𝟐𝟐 ∗𝒓
− 𝐶𝐹
Similarly in case of 32 experiment there are two factors each at three levels and the total combinations are 9.
68
Objective: Analysis of 23 factorial experiment.
Kinds of data : A 23 experiment in eight randomized blocks was conducted in order to obtain an idea of
the interaction :with three factors N,P, and K each at two levels. The design and yield per plot are given
below:
Replicate 1
Block 1 (1) 25 pk 24 nk 32 Np 30
Block2 n 30 k 32 npk 36 p 27
Replicate 2
Block 3 p 32 npk 42 n 46 k 39
Block4 nk 34 (1) 44 np 30 pk 36
Replicate 3
Block 5 npk30 k 32 n 28 p 26
Block6 (1) 24 pk 20 nk 28 np 36
Replicate 4
Block 7 np 32 (1)34 pk 39 nk 41
Block8 npk 45 n 41 p 29 k 35
Now we break up the treatments S. S. with 7 d.f. into 7 orthogonal components each with 1 d.f. for this we
use Yates method for computing factorial effect totals and their sum of squares.
69
Analysis of variance table for 𝟐𝟑 factorial experiment
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 615.84 87.98 3.54* F.05 at (7,17)=2.61
N 1 124.03 124.03 5.00* F.05 at (1,17) =4.45
P 1 30.03 30.03 1.21
NP 1 30.03 30.03 1.21
K 1 34.03 34.03 1.37
NK 1 0.03 0.03 0.00
PK 1 26.28 26.28 1.06
NPK 1 52.53 52.53 2.12
Error 17 421.90 24.82
Total 31 1334.72
Here blocks differ significantly. There are merely main effect of N is significant present in the above
experiment. Other treatments are not significant.
Standard error for any factorial effect total =√𝑟. 23 . 𝑆𝐸 2 = √4 ∗ 8 ∗ 24.82 = 28.18
Significant value for any factorial effect total = t5% for 17 d.f. * 28.18 = 2.109*28.18=59.45
Conclusion: Comparing this value with the factorial effect totals in Yates table we find that only main effect
N are significant and others are non-significant.
Solution: There is no significant difference among three levels of the factors A and B.
First we construct the two way table of treatment totals over all the blocks.
A/B b2 b1 b0 Total
a2 100 112 88 300
a1 84 84 60 228
a0 80 56 56 192
Total 264 252 204 720
7202
C.F.= 36 =14400
Total Sum of Squares= 262+302+…+182 – 14400 = 16028 – 14400 = 1628
1892 + 207 2 + 1532 + 1712
Block S.S.= − C.F . = 180
9
70
1002 + 1122 + ...,562
Treatment S.S.= − C.F . = 768
4
Error S.S.=T.S.S.-Block S.S.-Treatment S.S.= 680
Now, let us compute the value and sum of squares of the eight contrasts.
Contrasts Value of Z Divisor (Dxr) Sum of
Square=Z2/Dxr
A1 300-192=108 6x4 486
A2 492-456=36 18x4 18
B1 264-204=60 6x4 150
B2 468-504=-36 18x4 18
A1B1 156-168=-12 4x4 9
A1B2 300-360=-60 12x4 75
A2B1 300-312=-12 12x4 3
A2B2 660-624=36 36x4 9
The treatments are found to be significant especially the linear effect of the factor A and B.
Remark: This is a 23 factorial experiment in four replicates and each replicate has been divided into two
blocks of four plots each. However, this experiment can also be analyzed by the method of complete
confounding including the sum of squares due to factor NPK into errors since NPK has been completely
confounded which are analyzed in practical No.48.
71
19. Confounding
It is the technique by which the precision on the main effects and certain interactions generally they are of
lower order is increased by the sacrifice of precision on certain high order interactions.
Complete Confounding: when the same interaction is confounded in all the replicates, it is known as
complete or total confounding.
The analysis procedure is same as factorial experiment. Only one degree of freedom is lost from treatment
and increased in error because the same interaction is confounded in all the replicates.
Partial confounding: is used when we want to divide the replicate into homogeneous smaller blocks and
also don’t want to lose information on any of the interactions. In partial confounding different effect is
confounded in different replicate. So that the interaction effect can be estimated from the remaining of the
replicates in which that effect was not confounded.
In partial confounding also the analysis procedure is same as factorial experiment. Here the calculation is
only changed in calculation of sum of square of partially confounded effects.
72
Objective: Analysis of data of complete confounding
Kinds of data : The following data relate to complete confounding of 23 experiment of the Factors A,B,C
and the experiment is conducted in 4 replications. In each replicate the interaction ABC is confounded..
Effect ABC Replicate1 Replicate2 Replicate3 Replicate4
confounded
Blocks (i) (ii) (i) (ii) (i) (ii) (i) (ii)
(1) 19.1 a18.6 (1)20.7 a25.9 (1)23.4 a22.2 (1)19.1 a23.6
ab 19.2 b18.2 ab22.1 b23.0 ab20.4 b21.0 ab21.9 b23.7
ac18.8 c19.0 ac21.2 c24.9 ac23.2 c23.6 ac18.6 c21.0
bc19.4 abc20.4 bc20.1 abc23.4 bc20.3 abc21.6 bc21.5 abc22.8
The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.
Treatment Total yield (1) (2) (3) Effect Totals SS =𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
𝟑𝟐
Combination
1 82.3 172.6 342.1 681.9 GT
A 90.3 169.5 339.8 5.9 [A] 1.09
B 85.9 170.3 5.7 -3.9 [B] 0.48
AB 83.6 169.5 0.2 3.3 [AB] 0.34
C 88.5 8 -3.1 -2.3 [C] 0.17
AC 81.8 -2.3 -0.8 -5.5 [AC] 0.95
BC 81.3 -6.7 -10.3 2.3 [BC] 0.17
ABC 88.2 6.9 13.6 23.9 Not estimate
Put all sum of squares in ANOVA table and test the main effects and interactions excluding
the factors that are completely confounded with blocks.
73
Analysis of variance table
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 92.03 13.15 2.80 F.05 at (7,18)=2.58
A 1 1.08 1.08 <1 F.05 at (1,18) =4.41
B 1 0.47 0.47 <1
AB 1 0.34 0.34 <1
C 1 0.16 0.16 <1
AC 1 0.94 0.94 <1
BC 1 0.16 0.16 <1
Error 18 32.79 1.82
Total 31 128.00
From the above analysis of variance table, we find that none of the treatments effect is significant as would
be expected for data taken from a uniformity trial. Whereas block effect is significant, hence confounding is
found effective.
Solution : H0 : The data is homogenous with respect to blocks and treatments. Since each replicate has been
divided into 2 blocks, one effect has been confounded in each replicate. Replicate 1 confounds AB, replicate
2 confounds AC, replicate 3 confounds BC and replicate 4 confounds ABC . Hence, this is an example of
partial confounding.
Here Grand total of the observations GT= 715.4
715.42
Correction Factor = =15993.66
32
74
The sum of squares due to 6 uncompounded factorial effects is obtained by Yates Technique.
Treatment Total yield (1) (2) (3) Effect Totals 𝑬𝒇𝒇𝒆𝒄𝒕 𝒕𝒐𝒕𝒂𝒍𝟐
SS = 𝟑𝟐
Combination
1 98.6 191.6 371.9 715.4 GT
A 93 180.3 343.5 -14 [A] 6.13
B 92.9 169.2 -11.1 -6.2 [B] 1.20
AB 87.4 174.3 -2.9 5.6 [AB] not estimable
C 86.7 -5.6 -11.3 -28.4 [C] 25.21
AC 82.5 -5.5 5.1 8.2 [AC] not estimable
BC 86.5 -4.2 0.1 16.4 [BC] not estimable
ABC 87.8 1.3 5.5 5.4 not estimable
In order to estimate the sum of square of partially confounded effect adjustment factor is calculated for each
interaction.
AF= [Total of the block containing (1) of replicate in which the effect is confounded] – [Total of the block
not containing (1) of replicate in which the effect is confounded]
1
Interaction AB is estimated by =4 (a-1)(b-1)(c+1), here sign of 1 is positive. Hence the
75
Analysis of variance table for the partially confounded 23 - experiment
Source of Degree of Sum of Mean sum of Fcal Ftab (5 %)
variation freedom Square square
Blocks 7 410.39 2.80 F.05 at (7,17)=2.61
Treatments 7 52.95
A 1 6.12 6.12 1.64 F.05 at (1,17) =4.45
B 1 1.20 1.20 <1 F.01 at (1,17) =8.40
AB 1 0.960 0.960 <1
C 1 25.205 25.205 6.76*
AC 1 9.25 9.25 2.48
BC 1 2.16 2.16 <1
ABC 1 8.05 8.05 2.16
Error 17 63.42 3.73
Total 31
From the above table, it is observed that only the main effect of the factor C is significant at the 5% level.
The other effects are found to be non-significant. Also the block effect are found significant.
In comparing means it is important to keep in mind that the interactions are determined on merely ¾ of the
2∗3.73
replications. Thus, the standard error of a mean interaction response is = √ 3∗22 =0.788
2∗3.73
Similarly, the standard error of a main effect will be =√ 4∗22 = 0.683
76
REFERENCES:
1. Practicals in Statistics , by Dr.H.L.Sharma
2. Statistical Methods, by G.W.Snedecor.
3. Experimental Designs and Survey Sampling: Methods and Applications, by H.L.Sharma
4. A handbook of Agricultural Statistics BY Dr. S. R. S Chandel
5. The Theory of Sample surveys and Statistical Decisions by K. S. Kushwaha and Rajesh Kumar
6. Fundamentals of Mathematical Statistics by S. C. Gupta and V. K. Kapoor
77