You are on page 1of 150

Statistical Techniques (BCS-040)

DEFINITION OF STATISTICS (9013873713 Dr. Prabhat Kumar Sangal)


“The science which deals with the collection, tabulation, analysis and
interpretation of numerical data.” – Croxton and Cowden.
TYPES OF DATA
The values of different objects collected in a survey or recorded is called data in
Statistics
Each value in the data is known as observation. Statistical may be classified as
follows:

(1) Nominal Scale or Categorical


The word nominal has come from this Latin word, i.e. ‘Nomen’. It means name.
Therefore, under nominal scale we divide the observations under study into two
or more categories by giving them unique names.
This can be done by dividing the population into two categories male ‘M’ and
female ‘F’

Category Name/Code
Male M
Female F

Here we have named male as ‘M’ and female as ‘F’. This is not the only way,
we can also code male by ‘0’ and female by ‘1’ or we may use any other
convenient symbols. So, we note that main thing is that we have to give a
unique name to each category.
Note 1: We note that in nominal scale we have just coded the objects. Sign of less
than or greater than does not make any sense in nominal scale. That is here we
have coded Hindu, Muslim, by ‘1’ and ‘2’ respectively. But Hindu > Muslim or
Muslim > Hindu does not make any sense.
Similarly, does not make any sense.
That is, we cannot talk about the order between two categories in case of nominal
scale.
(2) Ordinal Scale
As the name ordinal itself suggests that other than the names or codes given to the
different categories, it also provides the order among the categories. That is, we
can place the objects in a series based on the orders or ranks given by using
ordinal scale. But here we cannot find actual difference between the two
categories. Example grades in an examination, A, A+ , B, B+, C, C+, etc.
(3) Interval Scale
Nominal scale gives only names to the different categories, ordinal scale moving
one step further also provides the concept of order between the categories and
interval scale moving one step ahead to ordinal scale also provides the
characteristic of the difference between any two categories.
Interval scale is used when we want to measure years/historical time/calendar
time, temperature (except in the Kelvin scale), sea level,
5 marks in the tests where
there is negative marking also, etc. Mathematically, this scale includes +, – in
addition to >, < and

(4) Ratio Scale

Ratio scale is the highest level of measurement because nominal scale gives only
names to the different categories, ordinal scale provides orders between categories
other than names, interval scale provides the facility of difference between
categories other than names and orders but ratio scale other than names, orders
and characteristic of difference also provides natural zero (absolute zero). In ratio
measurement scale values of characteristic cannot be negative. Ratio scale is used
when we want to measure temperature in Kelvin, weight, height, length, age,
mass, time, plane angle, etc.
Quantitative Data
As the name quantitative itself suggests that it is related to the quantity. In fact,
data are said to be quantitative data if a numerical quantity (which exactly
measure the characteristic under study) is associated with each observation.
Generally, interval or ratio scales are used as a measurement of scale in case of
quantitative data. Data based on the following characteristics generally gives
quantitative type of data. Such as weight, height, ages, length, area, volume,
money, temperature, humidity, size, etc.
Qualitative Data
As the name qualitative itself suggests that it is related to the quality of an
object/thing. It is obvious that quality cannot be measured numerically in exact
terms. Thus, if the characteristic/attribute under study is such that it is measured
only on the bases of presence or absence then the data thus obtained is known as
qualitative data.
Generally nominal and ordinal scales are used as a measurement of scale in case
of qualitative data. Data based on the following characteristics generally gives
qualitative data. Such as gender, marital status, qualification, colour, religion,
satisfaction, types of trees, beauty, honesty, etc.
Discrete Data
If the nature of the characteristic under study is such that values of observations
may be at most countable between two certain limits, then corresponding data are
known as discrete data
For example,
(i) Number of books on the self of an almirah in a library form discrete data.
Because number of books may be 0 or 1 or 2 or 3,…. But number of books
cannot take any real values such as 0.8, 1.32, 1.53245, etc.
Continuous Data
Data are said to be continuous if the measurement of the observations of a
characteristic under study may be any real value between two certain limits.
For example,
(i) Data obtained by measuring the heights of the students of a class of say 30
students form continuous data, because if minimum and maximum heights are
152cm and 175 cm then heights of the students may take any possible values
between 152 cm and 175 cm. For example, it may be 152.2375 cm,
160.31326… cm, etc.
(ii) Data obtained by measuring weights of the students of a class also form
continuous data because weights of students may be 48.25796…kg, 50.275kg,
42.314314314…kg, etc.
Primary Data
Data which are collected by an investigator or agency or institution for a specific
purpose and these people are first to use these data, are called primary data. That
is, these data are originally collected by these people and they are first to use these
data.
For example, suppose a research scholar wants to know the mean age of students
of M.Sc. Chemistry of a particular university. If he collects data related to the age
of each student of M.Sc. Chemistry of that particular university by contacting
each student personally then data so obtained by the research scholar is an
example of primary data for the same research scholar.
Secondary Data
Data obtained/gathered by an investigator or agency or institution from a source
which already exists, are called secondary data. That is, these data were originally
collected by an investigator or agency or institution and has been used by them at
least once. And now, these data are going to be used at least second time.
For example, consider the same example as discussed in case of primary data. If
the research scholar collects the ages of the students from the record of that
particular university, then the data thus obtained is an example of secondary data.
Note that, in both the cases data remain the same, only way of collecting the data
differs.
Methods of collection of primary data
(1) Direct Personal Investigation Method
(2) Telephone Method
(3) Indirect Oral Interviews Method
(4) Local Correspondents Method
(5) Mailed Questionnaires Method
(6) Schedules Method
Methods of collection of secondary data
(1) Published Sources
(a) International Publications, Government Publications in India, Published
Reports of Commissions and Committees, Research Publications, Reports of
Trade and Industry Associations, Published Printed Sources , Published
Electronic Sources
(2) Unpublished Sources
Difference between primary and secondary data
Factor of difference Primary data Secondary data
Definition Data collected by an The data obtained/gathered
investigator or agency or by an investigator or
institution for a specific agency or institution from a
purpose and these people source which already exists
are first to use these data
Time Long time is required for Less time is required for
collection collection
Money Needs more money for Needs less money for
collection collection
Reliability More reliable less reliable
hand First hand data Second hand data
Manpower Needs more man power Needs less man power
Adequacy More adequate Less adequate
Suitability More suitable Less suitable

Frequency Distribution
When observations (raw data) are large in number then it is not easy to handle the
data in this form. So it becomes necessary to condense the data as far as possible
without loosing any information of interest. We do this with the help of frequency
distribution.
An arrangement of the frequency corresponding to the value of the variable is
called frequency distribution.
Let us consider the ages of 30 students selected at random from among those
studying in a certain class.
20, 22, 25, 22, 21, 22, 25, 24, 23, 22, 21, 20, 21, 22, 23, 25, 23, 24, 22, 24, 21, 20,
23, 21, 22, 21, 20, 21, 22, 25.
Age of Tally Frequency
students Mark
20 |||| 04
21 |||| || 07
22 |||| ||| 08
23 |||| 04
24 ||| 03
25 |||| 04
Total 30

Discrete Frequency Distribution


A frequency distribution in which the information is distributed in different
classes on the basis of a discrete variable is known as discrete frequency
distribution. For example, the above example is a discrete frequency distribution.
Continuous Frequency Distribution
A distribution in which the information is distributed in different classes on the
basis of a continuous variable is known as continuous frequency distribution.
Example: The marks of 30 students in statistics are given below:
10, 12, 25, 32, 27, 32, 38, 43, 39, 55, 29, 38, 57, 08, 06, 13, 27, 25, 29, 53,
55, 45, 35, 48, 47, 59, 15, 19, 48, 55
Classify the above data by taking a suitable class interval.
Method of Continuous Frequency Distribution
Exclusive Method
Under this method, a class interval is such that each upper-class limit is excluded
from the class interval. Here in this method, class intervals are so fixed that the
upper limit of one class is the lower limit of the next class. In the following
example there are 24 students who have secured the marks between 0 and 50. A
student who secured 20 marks would be included in class 20-30, not in 10–20.
This method is widely followed in practice.
Example 3: 24 students appeared in an entrance test where all questions are
objective type. The marks obtained out of 50 maximum marks are as follows:
17, 16, 7, 30, 21, 42, 44, 36, 22, 22, 25, 31, 31, 34, 30, 36, 35, 45, 25, 15,
20, 42, 40, 30
Prepare a frequency distribution by using exclusive method.
Min= 7, Maz = 45
Solution: Frequency distribution of marks obtained by above 24 students is given
below in table 13.8 using exclusive method as follows:
Classes Tally No. of
bar Students
0-10 | 1
10-20 ||| 3
20-30 6
30-40 |||| | 9
40-50 |||| |||| 5

||||
Total 24
Inclusive Method
Under the inclusive method of classification both lower class limit as well as the
upper limit of a class is included in that class itself. Following frequency
distribution is formed using inclusive method for the data of Example 3 given
above.
Table 13.9: Frequency Distribution of 24 Students by Inclusive Method
17, 16, 7, 30, 21, 42, 44, 36, 22, 22, 25, 31, 31, 34, 30, 36, 35, 45, 25, 15,
20, 42, 40, 30

Class Tally No. of Relative


bar Students frequency
0-9 | 1 1/24
10-19 3 3/24
|||
20-29 6 6/24
30-39 |||| | 9
40-49 |||| |||| 5

||||
Total 24

That means if data are classified in such a way that the lower as well as the upper
class limits are included in the same class interval, it is called inclusive class
interval.
For converting data from inclusive form to exclusive form, first of all we find the
half of the difference of lower limit of that class and upper limit of the preceding
class. This value is then subtracted from lower limit of each class and added to the
upper limit of each class. In the above example, this can be easily understood as
(10–9)/2 = 0.5. So, the class intervals are as – 0.5- 9.5, 9.5-19.5, … , 39.5-49.5. If
all the observations of data are positive then the lower limit of first class can be
taken 0. Therefore, in this case the class intervals are as 0-9.5, 9.5-19.5, …, 39.5-
49.5.

0-9
10-19
20-29
difference =10-9=1 half of difference=0.5
0-(9+.5) 0-9.5
(10-0.5)-(19+0.5) 9.5-19.5
Relative Frequency Distribution
A relative frequency corresponding to a class is the ratio of the frequency of that
class to the total frequency. The corresponding frequency distribution is called
relative frequency distribution. If we multiply each relative frequency by 100, we
get the percentage frequency corresponding to that class and the corresponding
frequency distribution is called “Percentage frequency distribution”.
Example 1: A frequency distribution of marks of 50 students in a subject is as
given below:
Class (Marks): 0-10 10-20 20-30 30-40 40-50
Frequency: 6 10 14 18 2
Prepare relative and percentage frequency distributions.
Solution: The relative and percentage frequency distributions can be formed as
given in the following table:
Class Frequency Relative frequency Percentage Frequency (f/N) 
(Marks) (f) (f/N) 100
X
0-10 6 6/50 = 0.12 0.12  100 = 12 %
10-20 10 10/50 = 0.20 0.20  100 = 20 %
20-30 14 14/50 = 0.28 0.28  100 = 28 %
30-40 18 18/50 = 0.36 0.36  100 = 36 %
40-50 2 2/50 = 0.04 0.04  100 = 4 %
Total 1.00 100

Cumulative Frequency Distribution


The cumulative frequency of a class is the total of all the frequencies up to and
including that class. A cumulative frequency distribution is a frequency
distribution which shows the observations ‘less than’ or ‘more than’ a specific
value of the variable.
The number of observations less than the upper-class limit of a given class is
called the less than cumulative frequency and the corresponding cumulative
frequency distribution is called less than cumulative frequency distribution.
Similarly, the number of observations corresponding to the value of more than the
lower class limit of a given class is called more than cumulative frequency and the
corresponding cumulative frequency distribution is called ‘more than’ cumulative
frequency distribution. Following is an example, wherein ‘less than’ and ‘more
than’ cumulative frequency distributions have been obtained.
Example 2: For the following frequency distribution of marks of 50 students in a
subject, form both types of cumulative frequency distributions.
Class (Marks) 0-10 10-20 20-30 30-40 40-50
No. of Students 7 11 15 12 5

Solution: Cumulative frequency distributions are formed as given in the


following table:
Given Frequency Less Than Cumulative More Than Cumulative
Distribution Frequency Distribution Frequency Distribution
Classes No. of Marks No. of Marks No of students
Students Less than students More than
0-10 07 10 07 0 50
10-20 11 20 7+11=18 10 50-7=43
20-30 15 30 18+15=33 20 43-11=32
30-40 12 40 33+12=45 30 17
40-50 05 50 45+5=50 40 05
Total 50

Components of a Table
The various components of a table may vary case to case depending upon the
given data. But a good table must contain at least the following components:
1. Table Number
2. Table Heading
3. Caption
4. Stub
5. Body of Table
6. Head Note
7. Foot Note
Table 2.6: The values of the K-S, C-V M and A-D of fitted models based on the dataset of the
time to tumour appearance.

Fitted Distribution K-S C-V M A-D

Exponential Power 0.0894* 0.0402* 0.3902*

Exponentiated Weibull 0.1699 0.3486 2.0316

Exponentiated Exponential 0.1451 0.2409 1.4753

Generalized Rayleigh 0.1106 0.1449 0.9441

Generalized Power Weibull 0.6055 4.5657 20.7211

Generalized Inverted Exponential 0.0996 0.1360 1.0644


* indicates the minimum value.

Bar Diagram: It is used for categorical (Nominal or Ordinal) data


Graphs of Frequency Distributions
The graphical presentation of frequency distributions is drawn for discrete as well
as continuous frequency distributions.
Let us first consider the frequency distribution of a discrete variable.
To represent a discrete frequency distribution graphically, we take the value of the
variable on the X-axis and corresponding frequency on the Y axis. The different
values of the variable are then located as points on the horizontal axis. At each of
these points, a perpendicular bar is drawn to present the corresponding frequency.
Such a diagram is called a ‘Frequency Bar Diagram’. For example, if we take
the frequency distribution for the number of peas per pod for 198 pods as given in
Table 15.1:

No of peas per pod 1 2 3 4 5 6 7


Frequency (number of pods) 14 23 66 40 26 18 11

Graphs for continuous frequency distributions:


Important : Graph for the continuous frequency is drawn only for Exclusive
classes. If the data for inclusive class interval is given then first, we change in the
exclusive classes then construct the graph.

(i) Histogram (equal class interval)


A histogram is drawn by constructing adjacent rectangles over the class intervals
such that the length of the rectangles is proportional to the corresponding class-
frequencies.
The class-boundaries are located on the X-axis (horizontal axis) and the
corresponding frequencies on the Y-axis (vertical axis). Then a rectangle over
each class is constructed in such a way that the (area) height is proportion to the
frequency of that class.
Histogram (unequal class interval)
If the class-intervals are not of equal size, then we first calculate the width of each
class and then find the height of the class by the formula
Let us draw a histogram to the following frequency distribution given below in
the table 15.2
Class Intervals 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Frequency 2 3 13 18 9 7 6 2
Histogram for the above data is given below.

let us consider the frequency distribution for unequal class intervals as given as
Class 0-10 10-20 20-30 30-40 40-70 70-80 80-100
Frequency 20 32 8 2 60 35 10
As it is a case of unequal class intervals, so we have to adjust the frequencies of
the classes 40-70 and 80-100 by the formula suggested in equation 15.1. These
calculations are shown in table 15.4 given below:
Class Interval Frequency Width of Heights of the rectangles
(CI) (CI) or adjusted frequency
0-10 20 10 20
10-20 32 10 32
20-30 8 10 8
30-40 2 10 2
40-70 60 30 (60/30) 10 = 20
70-80 35 10 35
80-100 10 20 (10/20) 10 = 5

ii) Frequency
Polygon
Another method of presenting a frequency distribution graphically other than
histogram is to use a frequency polygon. In order to draw the graph of a frequency
polygon, first of all the mid values of all the class intervals and the corresponding
frequencies are plotted as points with the help of the rectangular co-ordinate axes.
Secondly, we join these plotted points by line segments. The graph thus obtained
is known as frequency polygon, but one important point to keep in mind is that
whenever a frequency polygon is required, we take two imaginary class intervals
each with frequency zero, one just before the first-class interval and other just
after the last class interval. Addition of these two class intervals facilitate the
existence of the property that
Area under the polygon = Area of the histogram
For example, if we take the frequency distribution as given in Table 15.2 then, we
have to first plot the points (5, 2), (15, 3), …, (75, 2) on graph paper along with
the horizontal bars. Then we join the successive points (including the mid points
of two imaginary class intervals each with zero frequency) by line segments to get
a frequency polygon. The frequency polygon for frequency distribution given in
Table 15.2 is shown in Fig. 15.4.

CI Mid Point Frequency


-10-0 -5 0 (-5,0)
0-10 5 2 (5,2)
10-20 15 3 (15,2)
20-30 25 13
30-40 35 18
40-50 45 9
50-60 55 7
60-70 65 6
70-80 75 2
80-90 85 0

Class Intervals 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Frequency 2 3 13 18 9 7 6 2

Note 3: In some cases, first class interval does not start from zero. In such
situations we mark a kink on the horizontal axis, which will indicate the
continuity of the scale starting from zero. Let us take an example of this type.

Example 1: Draw a frequency polygon for the following frequency distribution:


Class 30- 40- 50- 60- 70- 80- 90- 100- 110- 120-
Interval 40 50 60 70 80 90 100 110 120 130

Mid point 35 45 55 65 75 85 95 105 115 125


Frequency 0 4 10 11 13 18 14 11 5 0

(iii)Frequency Curve
In simple words frequency curve is a smooth curve obtained by joining the points
(not necessary all points) of the frequency polygon such that
(a) Like frequency polygon it also starts from the base line (horizontal axis) and
ends at the base line.
(b) Area under frequency curve remains approximately equal to the area under
the frequency polygon.
In other words, let us try to explain the concept theoretically. Suppose we draw a
sample of size n from a large population. Frequency curve is the graph of a
continuous variable. So theoretically continuity of the variable implies that
whatever small class interval we take there will be some observations in that class
interval. That is, in this case there will be large number of line segments and the
frequency polygon tends to coincides with the smooth curve passing through
these points as sample size (n) increases. This smooth curve is known as
frequency curve.
In the following example we have drawn both frequency polygon and frequency
curve to make the idea clear for you.
Example 2: Draw frequency polygon and frequency curve for the following
frequency distribution.

Class 10- 20- 30- 40-50 50- 60- 70- 80-


Intervals 20 30 40 60 70 80 90
frequency 2 5 8 15 18 10 3 1

Solution: Frequency polygon and frequency curve for the above data is given
below in Fig. 15.6.
(iv) Cumulative Frequency Curves
For drawing less than cumulative frequency curve (or less than ogive), first of all
the cumulative frequencies are plotted against the values (upper limits of the class
intervals) up to which they correspond and then we simply join the points by line
segments, curve thus obtained is known as less than ogive. Similarly, more than
frequency curve (more than ogive) can be obtained by plotting more than
cumulative frequencies against lower limits of the class intervals. As we have
already mentioned within brackets that less than cumulative frequency curve and
more than cumulative frequency curve are also called less than ogive and more
than ogive respectively.
In other words, we may define less than ogive and more than ogive as follow:
Less Than Ogive: If we plot the points with the upper limits of the classes as
abscissae and the cumulative frequencies corresponding to the values less then the
upper limits as ordinates and join the points so plotted by line segments, the curve
thus obtained is nothing but known as “less than cumulative frequency curve” or
“less than ogive”. It is a rising curve.
More Than Ogive: If we plot the points with the lower limits of the classes as
abscissae and the cumulative frequencies corresponding to the values more than
the lower limits as ordinates and join the points so plotted by line segments, the
curve thus obtained is nothing but known as “more than cumulative frequency
curve” or “more than ogive”. It is a falling curve.
Let us draw both the ogives (‘less than’ and ‘more than’) for the following
frequency distribution of the weekly wages of number of workers given as.

Weekly 0-10 10-20 20-30 30-40 40-50


wages
No. of 45 55 70 40 10
workers

Before drawing the ogives, we make a cumulative frequency distribution as given


in table 15.6
Weekly No. of Less than Cumulative frequency More than Cumulative frequency
wages workers distribution distribution
Wages Number of workers Wages More than Number of
Less than workers
0-10 45 10 45 0 220
10-20 55 20 45+55=100 10 220-45=175
20-30 70 30 70+100=170 20 120
30-40 40 40 40+170=210 30 50
40-50 10 50 10+210=220 40 10
Total 220

Note 4: Median may also the obtained by drawing dotted vertical line through the
point of inter section of both the ogives, when drawn on a single figure.
Example 10: Determine median graphically from the following data:
10– 15– 20– 25– 30– 35–
Marks : 0–5 5–10
15 20 25 30 35 40

No. of
7 10 20 13 17 10 14 9
Students:
N/2 = 50

PIE DIAGRAMS
Pie diagram/chart is used when the requirement of the situation is to know the
relationship between whole of a thing and its parts, i.e. pie chart provides us the
information that how the entire thing is divided up into different parts. For
example, if the total monthly expenditure of a family is Rs 1000, out of which Rs
250 on food, Rs 200 on education, Rs 100 on rent, Rs 150 on transport, and Rs
300 on miscellaneous items are spent. Then this gives us the information that
25%, 20%, 10%, 15% and 30% of the total expenditure of the family are spent on
food, education, rent, transport and miscellaneous items respectively. Here we
note that if money spent on food (say) increased from 25% to 30% then
percentages of other head(s) must shrink so that total remains 100%. Similarly, if
money spent on any one of the heads decreased then percentages of other head(s)
must spread so that total remains 100%. That is why pie chart gives relationship
between whole and its parts.
Steps used for constructing a pie chart.
Step 1 Find the total of different parts.
Step 2 Find the sector angles (in degrees) of each part keeping in mind that total
angle around the centre of a circle is of
Step 3 Find the percentage of each part taking the total obtained in step 1 as 100
percent.
Step 4 Draw a circle and divide it into sectors, where each sector (or area of the
sector) of the circle with corresponding angles obtained in step 2 will
represent the size of corresponding parts. Diagram thus obtained is
nothing but pie chart fitted to the given data.
Example 10: A company is started by the four persons A, B, C and D and they
distribute the profit or loss between them in proportion of . In year 2010
company earned a profit of Rs 14400. Represent the shares of their profits in a pie
chart.
Solution: Given ratio is
sum of ratios = 4 + 3 + 2 + 1 = 10
Partner Profits (in Rs) Sector Angles Percentages
s (in degree)

A
or

B
or

C
or

D
or
Total 14400 360

Solution:

Note:
(i) In drawing the components on the pie diagram it is advised to follow some
logical arrangements, pattern or sequence. For example, according to size,
with largest on top and others in sequence running clock wise.
(ii) Pie chart is used only when
(a) total of the parts make a meaningful whole. For example, total of the
expenditures of a family on different items make a meaningful whole,
but if in a city there are 100 doctors, 40 engineers, 50 milkmen, 80
businessmen then total of these do not make a meaningful whole so pie
chart should not be used here.
(b) observations of the different parts are observed at the same time.
We have discussed the method of drawing pie diagram, in this section. Let us
discuss some limitations of the pie diagram.
E9) Represent the following data of utilization of 100 paise of income by
XYZ company in year 2009-10.
Item/Head Money spent (in paise)
Manufacturing Expenses 42
Salaries of employees 14
Selling and distribution Expenses 8
Interest Charges 6
Advertisement Expenses 15
Excise duty of sales 5
Taxation 10
E10) Draw a pie diagram to represent the expenditure of Rs 100 over different
budget heads as given below of a family
Item Expenditure (in Rs.)
Food 25
Clothing 15
Education 20
Transport 10
Outing 10
Miscellaneous 5
Saving 15
25 25/100*360=90 25/100*100=25%
15 15/100*360=54
20 20/100*360
10
10
5
15
100
Measures of Central Tendency
The term average in Statistics refers to a one figure summary of a distribution. It
gives a value around which the distribution is concentrated. For this reason that
average is also called the measure of central tendency. For example, suppose Mr.
X drives his car at an average speed of 60 km/hr. We get an idea that he drives
fast (on Indian roads of course!). To compare the performance of two classes, we
can compare the average scores in the same test given to these two classes. Thus,
calculation of average condenses a distribution into a single value that is supposed
to represent the distribution. This helps both individual assessments of a
distribution as well as in comparison with another distribution.
I II
0 2 3
1 5 6
2 10 8
3 5 3
4 1 1
The following are the various measures of central tendency:
Arithmetic Mean (Mean or average), Median, Mode, Geometric Mean,
Harmonic Mean

Properties of a Good Average Open end class


The following are the properties of a good measure of average: 0-500 0 below-20000 2
500-1000 0
1. It should be simple to understand 1000-15000 1
2. It should be easy to calculate 15000-20000 10
3. It should be clearly/rigidly defined
4. It should be based on all the observations
5. It should be least affected by sampling fluctuations 100000-100500 1 100000 above
100500-101000
6. It should be possible to calculate even for open-end class intervals
7. It should not be affected by extremely small or extremely large observations
200,220,230,200,200,10000, Mean=
(200+220+230+200+200+10000)/6=11050/6=1841.67
ARITHMETIC MEAN (AM)
Arithmetic mean (also called mean) is defined as the sum of all observations
divided by the number of observations. Arithmetic mean may be calculated for
the following two types of data:

1. For Ungrouped Data (raw data)


Mathematically, if x1, x2,…,xn are the n observations then their mean is

2. For Discrete Data


If fi is the frequency of xi (i =1, 2,…,k) the formula for arithmetic mean would be
Class (x) frequency(f) xf
0 2 0
1 3 3
2 4 8
3 2 6
4 1 4
Total 12 21
(0+0+1+1+1+2+2+2+2+3+3+4)/12=0*2+1*3+2*4+3*2+4*1/12

3. For Continuous Data

CI f Mid Point (x) xf


0-100 2 (0+100)/2=50 50*2=100
100-200 3 (100+200)/2=150 3*150=
200-300 5 250
300-400 2 350

If fi is the frequency of xi (i=1, 2,…, k) where xi is the mid value of the ith class
interval, the formula for arithmetic mean would be

where, N =

Problem 1: Calculate mean of the weights of five students


54, 56, 70, 45, 50 (in kg)
Solution: If we denote the weight of students by x then mean is obtained by

Thus,
Thus, average weight of students is 55 kg.
Problem 3: Calculate arithmetic mean for the following data
x 20 30 40
f 5 6 4
Solution: x f fx
20 5 100
30 6 180
40 4 160

= 15 = 440

Problem 4: For the following data, calculate arithmetic mean


Class Interval 0-10 10-20 20-30 30-40 40-50
Frequency 3 5 7 9 4

Class Interval Mid Value x Frequency f fx


0-10 05 03 15
10-20 15 05 75
20-30 25 07 175
30-40 35 09 315
40-50 45 04 180

=N=28 = 760

Mean = = 760/28 = 27.143

Merits and Demerits of Arithmetic Mean


Merits of Arithmetic Mean
1. It utilizes all the observations;
2. It is rigidly defined;
3. It is easy to understand and compute; and
4. It can be used for further mathematical treatment.
5. It is least affected by sampling fluctuations
Demerits of Arithmetic Mean
1. It is badly affected by extremely small or extremely large values;
200,220,230,200,200,10000, Mean=
(200+220+230+200+200+10000)/6=11050/6=1841.67
2. It cannot be calculated for open end class intervals; and
3. It is generally not preferred for highly skewed distributions.

Three algebraic properties of mean:


1. ∑(x – Mean) = 0 i.e. sum of deviations of observations from their mean is
zero.
10,20,30 Mean= 10+20+30/3=20
10-20=-10 100 10-9 =1 1
20-20=0 0 20-9= 11 121
30-20=10 100 30-9 =21 441
Sum=0 200
2. Sum of squares of deviations taken from mean is least in comparison to
the same taken from any other average.
∑(x – Mean)2
3. Arithmetic mean is affected by both the change of origin and scale.

If Then where, A and h are constant,

200, 220, 230, 200, 200, Mean = (200+220+230+200+200)/5=210


200-200, 220-200, 230-200, 200-200, 200-
200=0,20,30,0,0=(0+20+30+0+0)/5=10
Combined Mean
If the arithmetic means and the number of observations of two or more related
groups are known, as can calculate the combined mean of these groups. The
combined mean formula for two related groups is as under:

Here, Combined mean of two groups.


No. of observation in first group, No. of observation in second
group.
mean of the first group, mean of the second group.
Similarly, the formula can also be extended for k-groups as

Example: The mean marks of 60 students in section A is 40 and mean marks of


40 students in section B is 45. Find the combined mean of the 100 students in
both the sections.
Solution: Here, Using formula, the combined
mean of all the 100 students will be

Example: The mean wage of 100 workers in a factory running two shifts of 60
and 40 workers is 38. The mean wage of 60 workers in the morning shift is Rs.
40. Find the mean wage of 40 workers in the evening shift.
Solution: Here, We are required to find the
value of Using the formula for combined mean, i.e.,

We have,

Or 3800 – 2400 = 40

So the mean wage of 40 workers in the evening shift is Rs. 35.


Example: Find the combined mean from the following data:
Group: 1 2 3
Number: 200 250 300
Mean: 25 20 15
Solution: Here, we are given related to three groups which can be symbolically
put as

For combined mean, we put these values in formula and get


Correcting the Arithmetic Mean
Remark: For correcting the incorrect value of mean, first we find the corrected
(in case of discrete or continuous series). For this subtract the wrong
items from the incorrect and add to it the correct items. Finally, on
dividing the corrected by the number of observations, we get the corrected
mean.
Example: The average marks of 80 students were found to be 40. Later, it was
discovered that a score of 54 was misread as 84. Find the corrected mean of the
80 students.

Solution: We are given N = 80,


Since Corrected Sum of x=3200-84+54

But due to the error discovered, in not correct.

The correct incorrect misread observation+ correct observation.


= 3200 – 84 + 54 = 3170.

The corrected average

Example: Mean of 100 items is found to be 30. If at the time of calculation, two
items are wrongly taken as 32 and 12 instead of 23 and 11, find the correct mean.
Solution: Given that N = 100,
(Incorrect total of 100 items)

Corrected = incorrect – wrong observations + correct


observations
= 3000 – (32 + 12) + (23 +11) = 2990
MEDIAN
Median is that value of the variable which divides the whole distribution into two
equal parts. Here it may be noted that the data should be arranged in ascending or
descending order of magnitude.
Median for Ungrouped Data
Mathematically, if x1, x2,…,xn are the n observations then for obtaining the
median first of all we have to arrange these n values either in ascending order or
in descending order.

Problem 5: Find median of following observations


6, 4, 3, 7, 8
Solution: First we arrange the given data in ascending order as
3, 4, 6, 7, 8
Since number of observations, i.e. 5, is odd, so median is middle value that is 6.
Problem 6: Calculate median for the following data:
7,8,9,3,4,10
Solution: First we arrange given data in ascending order as 3,4,7,8,9,10
Here, Number of observations (n) = 6 (even). So we get the median by

For Ungrouped Data (when frequencies are given)


If are the different value of variable with frequencies then we calculate
cumulative frequencies from then median is defined by

= Value of variable corresponding to = cumulative

frequency.
Note: If N/2 is not the exact cumulative frequency then value of the variable
corresponding to next cumulative frequencies is the median.

Problem 7: Find Median from the given frequency distribution


x 20 40 60 80
f 7 5 4 3
Solution: first we find cumulative frequency 19/2=9.5
x f c.f.
20 7 7
40 5 7+5=12
60 4 12+4=16
80 3 16+3=19

= Value of the variable corresponding to the cumulative frequency

= Value of the variable corresponding to 9.5 since 9.5 is not among c.f.
So the next cumulative frequency is 12 and the value of variable against
12cumulative frequencyis 40. So median is 40.
2. Median for Grouped Data
For class interval, first we find cumulative frequencies from the given frequencies
and use the formula for calculating the median following

where, L = lower limit of the median class,


N = total frequency,
C = cumulative frequency of the pre-median class,
f = frequency of the median class, and
h = width of the median class.
Median class is the class in which the (N/2)th observation falls. If N/2 is not
among any cumulative frequency then next class to the N/2 will be considered as
median class.
E6) Find Median for the following frequency distribution
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No. of students 5 10 15 20 12 10 8

Solution: First we shall calculate the cumulative frequency distribution

Marks f Cumulative Frequency


0-10 5 5
10-20 10 5+10=15
20-30 15 15+15=30=C
L=30-40 20=f 30+20=50
40-50 12 50+12=62
50-60 10 62+10=72
60-70 8 72+8=80
N= 80, N/2=80/2=40
Here

Since, 40 is not in the cumulative frequency so, the class corresponding to the

next cumulative frequency 50 is median class. Thus 30-40 is median class.

L=30, f=20, C=30, h=40-30=10

= = 30+10*10/15= 30+6.66=36.66

Merits of Median
1. It is rigidly defined;
2. It is easy to understand and compute;
3. It is not affected by extremely small or extremely large values; and
4. It can be calculated even for open end classes (like “less than 10” or
“50 and above”).
Demerits of Median
1. In case of even number of observations we get only an estimate of the
median by taking the mean of the two middle values. We don’t get its
exact value;
2. It does not utilize all the observations. The median of 1, 2, 3 is 2. If the
observation 3 is replaced by any number higher than or equal to 2 and
if the number 1 is replaced by any number lower than or equal to 2, the
median value will be unaffected. This means 1 and 3 are not being
utilized;
3. It is not amenable to algebraic treatment; and
4. It is affected by sampling fluctuations.

MODE
Mode is that observation in a distribution which has the maximum frequency. For
example, when we say that the average size of shoes sold in a shop is 7 it is the
modal size which is sold most frequently.
For Ungrouped Data
Mathematically, if x1, x2,…, xn are the n observations and if some of the
observation are repeated in the data, say is repeated highest times then we can
say the would be the mode value.
Problem 9: Find mode value for the given data
2, 2, 3, 4, 7, 7, 7, 7, 9, 10, 12, 12
Solution: Since 7 have the maximum frequency. Thus, mode is 7.
For Ungrouped Data (when frequencies are given)
If are the different value of variable with frequencies then we such x
corresponding the maximum frequency
X Frequency
2 2
5 8
9 3
Since 5 has maximum frequency so Mode is 5.
For Grouped Data:
Data where several classes are given, following formula of the mode is used

where L = lower limit of the modal class,


= frequency of the modal class,
= frequency of the pre-modal class,
= frequency of the post-modal class, and
h = width of the modal class.
Modal class is that class which has the maximum frequency.
Q7:Find Mode for the following frequency distribution
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No. of students 5 10 15 20 12 10 8

CI f
0-10 5 L=30,f1=20, f0=15,f2=12,h=40-30=10
10-20 10 Mode= 30+(20-12)*10/(2*20-15-12)
20-30 15=f0 =30+80/13=30+6.1=36.1
L=30-40 20=f1
40-50 12=f2
50-60 10
60-70 8
Relationship between Mean, Median and Mode
Mode = 3 Median – 2 Mean
Note: Using this formula, we can calculate mean/median/mode if other two of
them are known.

Merits of Mode
1. Mode is the easiest average to understand and also easy to calculate;
2. It is not affected by extreme values;
3. It can be calculated for open end classes;
4. As far as the modal class is confirmed the pre-modal class and the post
modal class are of equal width; and
5. Mode can be calculated even if the other classes are of unequal width.
Demerits of Mode
1. It is not rigidly defined. A distribution can have more than one mode;
2. It is not utilizing all the observations;
3. It is not amenable to algebraic treatment; and
4. It is greatly affected by sampling fluctuations.
Example: An incomplete distribution is given below:
Variable 10 – 20 – 30 – 40 – 50 – 60 – 70 – Total
20 30 40 50 60 70 80
Frequency 12 30 ? 65 ? 25 18 229
You are told that the median value is 46. Find the missing frequencies.
CI f CF Median=46
10-20 12 12
L=40,f=65,C=42+x,h=10
20-30 30 42
30-40 x 42+x=C Median=L+(N/2-C)*h/f
L=40-50 65=f 107+x 46 =40+(114.5-42-
x)*10/65
50-60 y 107+x+y 46-40=6= (72.5-
x)*10/65
60-70 25 132+x+y 6*65/10=72.5-
x
70-80 18 150+x+y=229 39=72.5-x
x=33.5`~34
x+y=229-150=79, y=79-34=45
N/2=229/2=114.5

Harmonic mean (H.M.) of a set of observations is the reciprocal of the arithmetic


mean of the reciprocals of the observations. Thus,

2,3,4 Mean=(2+3+4)/3=3

1/2,1/3,1/4

HM=3/(1/2+1/3+14)=3/1.05

Weighted Harmonic Mean

In some cases it becomes necessary to calculate weighted mean. Suppose an


automobile covers different distances with different speeds, the average speed can
be obtained by using weighted harmonic mean. The formula for calculating
weighted harmonic mean is

Here, stand for the respective weights of

Example: A cyclist cover his first five km at an average speed of 10 k.m. p.h.,
another three km at 8 km. p.h. and the last two at 5 km p.h. Find the average
speed of the entire journey and verify your answer.

Solution: Since the cyclist covers different distance with different speeds, the
weighted harmonic mean will be appropriate for computing average speed. Using
formula (26)

W1=5, x1=10,w2=3,x2 =8, w3=2,x3=5

(Speed) (Distance) W/X


X W
10 5 0.500
8 3 0.375
5 2 0.400
Total

Thus, the average speed for the entire journey is 7.84 km p.h.

Example: A train goes at a speed of 20 miles per hour for the first 16 miles, at a
speed of 40 m.p.h. for 20 miles. It covers the last 10 miles at a speed of 15 m.p.
Find out its average speed.
Example: In the given frequency distribution two frequencies are missing and its
mean is found to be 1.46.
Number of Accidents 0 1 2 3 4 5 Total
(X):
Frequency (f) 46 ? ? 25 10 5 200
Find the missing frequencies.

Mean=(x1*f1+x2*f2+x3*f3+…)/(f1+f2+f3+…)

Solution: Let the missing frequencies be


X f fX
0 46 0
Then 1 f1 f1
Or … (i) 2 f2 2f2
3 25 75
Also, since = 1.46 4 10 40
(Given) 5 5 25
200
Or … (ii)
Solving (i) and (ii), we get

Q Find Mean , Median and Mode of the following distribution


CI 5-15 15-25 25-35 35-45 45-55
F 5 10 12 8 5
Measures of Dispersion

Different measures of central tendency give a value around which the data is
concentrated. But it gives no idea about the nature of scatter or spread. For
example, the observations 10, 30 and 50 have mean 30 while the observations 28,
30, 32 also have mean 30. Both the distributions are spread around 30. But it is
observed that the variability among units is more in the first than in the second. In
other words, there is greater variability or dispersion in the first set of
observations. Measure of dispersion is calculated to get an idea about the
variability in the data. Following are the different measures of dispersion:
1. Range
2. Standard Deviation
Range
Range is defined as the difference between the maximum value of the variable
and the minimum value of the variable in the distribution.
Variance (σ2) And Standard Deviation (σ )
Variance is the average of the square of deviations of the values taken from the
mean. Standard deviation(SD) is defined as the positive square root of variance.
The formula is

Var (x) = σ2 = , SD = σ =

and for a frequency distribution, the formula is

σ2 =
or

When mean is not integer


Second method
 CI X f fX fX2
0-10 5 10 50 250
10-20 15 15 225 3375
20-30 25 25 625 15625
30-40 35 25 875 30625
40-50 45 10 450 20250
50-60 55 10 550 30250
60-70 65 5 325 21125
Total   100 3100 121500

Question: Find mean and SD


Class Mid Value x Frequen Xf fX2
Interval cy f
0-10 05 03 15 75
10-20 15 05 75 1025
20-30 25 07 175 4375
30-40 35 09 315 11025
40-50 45 06 270 12150
850
28650
=N=30
=955-802.5889=152.411
SD=12.35

E8) Calculate standard deviation for the following data:

Class 0-10 10-20 20-30 30-40 40-50


Frequency 5 8 15 16 6

Coefficient of Variation (CV)


It is a relative measure of variability. If we are comparing the two data series, the
data series having smaller CV will be more consistent(reliable). It is defined as
CV =
Problem 10: Suppose batsman A has mean 50 with SD 10. Batsman B has mean
30 with SD 3. What do you infer about their performance?

Solution: A has higher mean than B. This means A is a better run maker.
However, B has lower CV (3/30 = 0.1) than A (10/50 = 0.2) and is consequently
more consistent.
Example: The sum of squares of 100 observations was calculated as 7961. Later,
it was found that two values, 53 and 42 were wrongly read as 35 and 24 at the
time of calculation. Find the corrected sum of squares.
Solution: Given the incorrect
Corrected = incorrect – (Squares of wrong observations) +
(Squares of correct observations)

Corrected
= 7961 – (1225 + 576) + (2809 + 1764) = 10733.
Question1: The sum of square of 20 observations was worked out as 5100. But
while calculating it, an observation 31 was wrongly considered as 13. Find the
corrected sum of squares.
Question2:The sum of squares of 50 observations is 4122. An observation 39 was
wrongly includes in the series. Find the sum of squares of the remaining 49
observations.
Question3:The arithmetic mean and the S.D. of a series of 20 items were
calculated as 20 cm and 5 cm respectively. But while calculating them, an item 13
was measured as 30. Find the correct arithmetic, mean and standard deviation.
Question4:The mean and S.D. of 20 items are found to be 10 and 2 respectively.
At the time of checking, it was observed that one item 8 was incorrect. Find the
mean and the S.D. if (i) the wrong item is omitted (ii) it is replaced by 12.
Properties of Standard Deviation
1. The value of S.D. of a series remains unchanged if each variate value is
increased or decreased by the same constant value. In other words, we can say
that the S.D is independent of change in origin.
Symbolically,
Let where b is a constant.
Then i.e., the S.D.’s of the variables X and Y will be equal.
Example: Suppose 5, 8, 17, 12 and 7 are five observations on a variable X. A
new variables Y is obtained by adding 2 (a constant) to each observation on X.
Further, let Z be another variable defined by subtracting 3 from each value on X.
Find the standard deviations of the variable X, Y and Z, say
respectively. (Ans4.26, 4.26, 4.26)
2. If the value of variable X are multiplied (or divided) by a constant, the S.D.
of the new observations can be obtained multiplying (or dividing) the initial
S.D. by the same constant. Symbolically,
If Y = kX, where k is a constant
Then
In other words, we can say that S.D. is affected by change in scale.
Example: Suppose 2, 6, 9, 5, 4 are five observations on a variable X. A new
variable Y is obtained by multiplying each observation on X by 3 (a constant).
Further, another variable Z is obtained by dividing each observation on X by 2.
Then we find the S.D.’s of the variables X, Y and Z, say respectively.
(Ans: 2.32, 6.96, 1.16)

Question: Below are given the number of runs scored by two batsmen in eight
matches:
Batsman A 27 16 39 45 101 80 40 52
Batsman B 0 100 80 5 60 40 10 121
Who is better run scorer? Also find which of the two batsmen is more consistent
in scoring. (Ans- Mean =50, 52 so batsman b is better
run scorer, CV= 53.14%, 82.54% so Batsman A is more consistent)
CORRELATION
When two variables are related in such a way that change in the value of one
variable affects the value of another variable, then variables are said to be
correlated or there is correlation between these two variables.
In many practical applications, we might come across the situation where
observations are available on two or more variables. The following examples will
illustrate the situations clearly
Heights and weights of persons of a certain group; Sales revenue and advertising
expenditure in business; and Time spent on study and marks obtained by students
in exam.
If data are available for two variables, say x and y, it is called bivariate
distribution.
Let us consider the example of sales revenue and expenditure on advertising in
business. A natural question arises in mind that is there any connection between
sales revenue and expenditure on advertising? Does sales revenue increase or
decrease as expenditure on advertising increases or decreases?
If we see the example of time spent on study and marks obtained by students, a
natural question appears whether marks increase or decrease as time spent on
study increase or decrease.
Definition: When two variables are related in such a way that change in the value
of one variable affects the value of another variable, then variables are said to be
correlated or there is correlation between these two variables.
Types of Correlation
(a) Positive Correlation
Correlation between two variables is said to be positive if the values of the
variables deviate in the same direction i.e. if the values of one variable increase
(or decrease) then the values of other variable also increase (or decrease). Some
examples of positive correlation are correlation between
Heights and weights of group of persons; House hold income and expenditure;
Amount of rainfall and yield of crops; and Expenditure on advertising and sales
revenue.
In the last example, it is observed that as the expenditure on advertising increases,
sales revenue also increases. Thus the change is in the same direction. Hence the
correlation is positive.
In remaining three examples, usually value of the second variable increases (or
decreases) as the value of the first variable increases (or decreases).
(b) Negative Correlation
Correlation between two variables is said to be negative if the values of variables
deviate in opposite direction i.e. if the values of one variable increase (or
decrease) then the values of other variable decrease (or increase). Some examples
of negative correlations are correlation between
Volume and pressure of perfect gas; Price and demand of goods; Literacy and
poverty in a country; and Time spent on mobile and marks obtained by students in
examination.
In the first example pressure decreases as the volume increases or pressure
increases as the volume decreases. Thus the change is in opposite direction.
Therefore, the correlation between volume and pressure is negative.
In remaining three examples also, values of the second variable change in the
opposite direction of the change in the values of first variable.

Scatter Diagram
Scatter diagram is a statistical tool for determining the potentiality of correlation
between dependent variable and independent variable. Scatter diagram does not
talk about exact relationship between two variables but it indicates whether they
are correlated or not.
Let be the bivariate distribution. If the values of the
dependent variable are plotted against corresponding values of the
independent variable x in the x y plane, such diagram of dots is called scatter
diagram or dot–diagram. It is to be noted that scatter diagram is not suitable for
large number of observations.

Coefficient of Correlation
Coefficient of correlation measures the intensity or degree of linear relationship
between two variables.
If X and Y are two random variables then correlation coefficient between X and Y
is denoted by r and defined as

When mean is not integer

Properties of Correlation Coefficient


1. Correlation coefficient lies between -1 and +1.
2. Correlation coefficient is independent of change of origin and scale.
Description: Correlation coefficient is independent of change of origin and scale,
which means that if a quantity is subtracted and divided by another quantity
(greater than zero) from original variables , i.e. and then
correlation coefficient between new variables U and V is same as correlation
coefficient between X and Y, i.e. .
Property 3: If X and Y are two independent variables then correlation coefficient
between X and Y is zero, i.e.
Problem 1: Find the correlation coefficient between advertisement expenditure
and profit from the following data: (Ans- 0.27)
Advertisement expenditure 30 44 45 43 34 44
Profit 56 55 60 64 62 63
Solution

Advertisement
expenditure(X)
Profit(Y)      
30 56 30-40=-10 56-60=-4 100 16 40
44 55 4 -5 16 25 -20
45 60 5 0 25 0 0
43 64 3 4 9 16 12
34 62 -6 2 36 4 -12
44 63 4 3 16 9 12
240 360 0 0 202 70 32

Question: From the following data calculate Karl Pearson’s coefficient of


correlation

Height of Father 66 68 69 72 65 59 62 67 61 71
Height of Son 65 64 67 69 64 60 59 68 60 64

Ans = 0.829
Question: The coefficient of correlation between two variates X and Y is 0.8 and
their covariance is 20. If variance is 16, find the SD of Y series (Ans-6.25)
Question: From the data given below find the number of items

Where X and Y are deviations from the arithmetic mean. (ANS: 10)

Concept of Rank Correlation


When the characters are not measurable then we use rank correlation. This type of
situation occurs when we deal with the qualitative study such as honesty, beauty,

voice, etc. We denote rank correlation coefficient by ,

where
The value of rs also lies between -1 to 1.

This formula was given by Spearman and hence it is known as Spearman’s rank
correlation coefficient formula.
Problem 1: Suppose we have ranks of 8 students of B.Sc. in Statistics and
Mathematics. On the basis of rank we would like to know that to what extent the
knowledge of the student in statistics and mathematics is related.
Rank in Statistics 1 2 3 4 5 6 7 8
Rank in Mathematics 2 4 1 5 3 8 7 6
Solution:
Rank in Rank in Difference of ranks
Statistics Mathematics
1 2 1-2=−1 1
2 4 2-4=−2 4
3 1 3-1=2 4
4 5 4-5=−1 1
5 3 5-3=2 4
6 8 −2 4
7 7 0 0
8 6 2 4

Here, n = number of paired observations =8

Thus there is a positive association between ranks of Statistics and Mathematics.

Question: The marks obtained by 9 students in Mathematics and


accountancy are as follows:
Question: Calculate rank correlation coefficient from the following marks given
out of 200 by two jugs X and Y in a music competition to 8 participants:

Participant No. 1 2 3 4 5 6 7 8
Marks awarded by X 74 98 110 70 65 85 88 59
Marks awarded by Y 121 133 170 102 90 152 160 85
X Y Difference of
Rank Rank
ranks

74 121 5 5 0 0
98 133 2 4 -2 4
110 170 1 1 0 0
70 102 6 6 0 0
65 90 7 7 0 0
85 152 4 3 1 1
88 160 3 2 1 1
59 85 8 8 0 0

CONCEPT OF LINEAR REGRESSION (Important)


Correlation coefficient measures the strength of linear relationship and direction
of the correlation whether it is positive or negative.
Regression analysis is the process of constructing a mathematical model or
function that can be used to predict or determine one variable by another
variable. In Regression analysis one variable is predicted by another variable.
The variable to be predicted is called the dependent variable and it is denoted by
Y. the predictor is called the independent variable, or explanatory variable, and is
denoted as X. In simple regression analysis, only a straight-line relationship
between two variables is examined. Nonlinear relationships and regression model
with more than one independent variable can be explored by using multiple
regression models, Regression analysis is a statistical technique which is used to
investigate the relationship between variables. For example, we might be
interested in estimation of production of a crop for particular amount of rainfall or
in prediction of demand on the price or prediction of marks on the basis of study
hours of students.

Definition: Regression analysis is a mathematical measure of the average


relationship between two or more variables.
There are two types of variables in regression analysis:
(a) Independent variable
(b) Dependent variable
The variable which is used for prediction is called independent variable. It is
also known as regressor or predictor or explanatory variable.
The variable whose value is predicted by the independent variable is called
dependent variable. It is also known as regressed or explained variable.

Lines of Regression
This regression line when X be the independent variable and Y be the dependent
variable.
Let the equation of line of regression of Y on X be

where, is know as the intercept


is know as the slop of the regression line
and

Problem 1: Height of fathers and sons in inches are given below:

Height of father 65 66 67 67 68
Height of son 66 68 65 69 74
Find line of regression and calculate the estimated average height of son when the
height of father is 68.5 inches.
Solution Let the height of father (X) is independent variable and height of son (Y)
is dependent variable. So let the regression line

X Y
XY X2

65 66 4290 4225
66 68 4488 4356
67 65 4355 4489
67 69 4623 4489
68 74 5032 4624
333 342 22788 22183

Mean of X=333/5=66.6
Mean of Y=342/5=68.4

So regression line is

When X = 68.5 then estimated height of Son


E) Using the regression equation Y = 90 + 50X, fill up the values in the table
below.
Sample No (i) 12 21 15 1 24
0.96 1.28 1.65 1.84 2.35
138 160 178 190 210
138 90+50*0
0

Note:

Solution

Sample x y
No =90+50x E(error)=y-

12 0.96 138 138 138-138=0


21 1.28 160 =90+50*1.28=154 160-154=6
15 1.65 178 =90+50*1.65=172.5 178-172.5=5.5
1 1.84 190 =90+50*1.84=182 190-182=8
24 2.35 210 =90+50*2.35=207.5 210-207.5=2.5

E) A hosiery mill wants to estimate how its monthly costs are related to its
monthly output rate. For that the firm collects a data regarding its costs and
output for a sample of nine months as given in Table below:

Output (tons) Production cost


(thousands of dollars)
1 2
2 3
4 4
8 7
6 6
5 5
8 8
9 8
7 6
1) Construct a scatter diagram for the data given above.
2) Calculate the best linear regression line, where the monthly output is the
dependent variable and the production cost is the independent variable.
3) Use this regression line to predict the firm’s monthly costs if they decide
to produce 4 tons per month.
Solution
(i)Scatter Diagram

Solution Let production cost (X) is independent variable and output (Y) is
dependent variable. So let the regression line

Output (tons) Production cost


(thousands of dollars)
1 2
2 3
4 4
8 7
6 6
5 5
8 8
9 8
7 6

Y X
XY X2
1 2 2 4
2 3 6 9
4 4 16 16
8 7 56 49
6 6 36 36
5 5 25 25
8 8 64 64
9 8 72 64
7 6 42 36
50 49 319 303

Mean of X=49/9=5.44
Mean of Y=50/9=6.55
So regression line is

X=4 find y
Y=6.8-0.046*4=6.616

Problem 2: Regression line of y on x and x on y respectively are


(1)
(2)
Then, find
(i) the mean values of x and y,
(ii) coefficient of correlation between x and y, and
(iii) the standard deviation of y for given variance of x = 5.
Solution:
(i) Since regressions line of y on x and x on y passes through and so
and are the intersection points. Thus to get the mean values of variable x
and y, we solve given simultaneous equations
(3)
(4)
Multiplying equation (4) by 3 then

By solving these equations as simultaneous equations we get mean of x =2 and


mean of y = 4 .

(ii) To find the correlation coefficient, we assume equation (1) as y on x and


equation (2) as x on y. therefore we can write the equations as

Therefore

Similarly
Therefore,

By the property of regression coefficients

Thus, correlation coefficient r = 0.37

(iii) By the definition of regression coefficient of y on x i.e.


= =0.67

Variance of x i.e.,
Now,
= ,
Thus, the variance of y is 16.45.

E1) Marks of 6 students of a class in paper I and paper II of statistics are given
below:

Paper I 45 55 66 75 85 100
Paper II 56 55 45 65 62 71
Find
(i) both regression coefficients,
(ii) both regression lines, and
(iii) correlation coefficient.

E2) We have data on variables x and y as


x 5 4 3 2 1
y 9 8 10 11 12
Calculate
(i) both regression coefficients,
(ii) correlation coefficient,
(iii) regression lines of y on x and x on y, and
(iv) estimate y for x =4.5.

E3) If two regression lines are

,
Then, calculate
(i) correlation coefficient, and
(ii) mean values of x and y.
E) You are given the following information about advertising expenditure
and sales:
Adv. Exp. (X) (Rs. Lakhs) Sales (Y) (Rs. Lakhs)
10 90
3 12
Correlation coefficient = 0.8
(i) Obtain the two regression equations.
(ii) Find the likely sales when advertisement budget6 is Rs. 15 lakhs.
(iii) What should be the advertisement budget if the company wants to attain
sales target of Rs. 120 lakhs?
E) In a partially destroyed laboratory record of an analysis of correlation data, the
following result only the legible:
Variance of X = 9
Regression equation 8X – 10Y + 66 = 0
40X – 18Y = 214
Find on the basis of the above information:
(i) The mean values of X and Y,
(ii) Coefficient of correlation between X and Y, and
(iii) Standard deviation of Y.
Forecasting and Time Series
Forecasting
The forecast is a prediction of future conditions based on an analysis of data
received over a period.
Time Series
A time series is a set of observations taken at specified times, usually at equal
intervals. In other words, a series of observations recorded over time is known as
a time series. In other wards, the data on any characteristic collected with
respect to time is called time series. Example of time series are the data
regarding population of a country recorded at the ten-yearly censuses, annual
production of a crop, say, wheat over a number of years, and so on. In fact, data
related with business and economic activities, in general, recorded over time, give
rise to a time series.
Components of Time Series
These characteristic movements of a time series may be classified in four different
categories called components of time series. In a long time series, generally, we
have the following four components:
1. Trend or Long term movements.
2. Seasonal Component
3. Cyclic Component
4. Irregular or Random Component

Time Series

Irregular or
.Trend or Long Seasonal Random
Cyclic
term Component Component
Component
movements

1. Trend
The general tendency of values of the data to increase or decrease during a long
period of time is called “trend”. Some time series show an upword trend while
sometime show a downward trend. For example upward trends are seen in data of
population, no of passengers in Metro etc. while data of deaths show downward
trend.
Time series may be showed a linear or non linear trend. If a time series data are
plotted on a graph paper and the points on the graph paper more or less around a
straight line then the tendency shown by data is called linear trend. But if points
do not less or more around a straight line the tendency shown by data is called
non linear.
2. Seasonal component (variation)
The variations in the value of the data occur (operate) at a regular and periodic
manner with in one year are called seasonal variations seasonal variations may
be quarterly monthly, weekly, daily. etc depending on the type of data available.
For example- sale of ice cream increase in summer season, sale of raincoat
increase in rainy season. The amplitudes of seasonal variation are different for
different periods. There are two types of seasonal variations due to Natural
Forces.
Seasonal variations due to Natural forces:
Variations in time series that arise due to changes in seasons or weather
conditions and climatic changes are known as seasonal variations due to natural
forces. For example, sales of umbrellas, rain coat increase very fast in rainy
season, the demand for AC goes up in summers.
Seasonal variation due to Man made conventions
Variations in time series that arise due to change in fashions, habit, taste, customs
of people in any society are called seasonal variation due to manmade
conventions. For example, in our country sale of gold, clothes goes up in marriage
season and festivals.
4. Cyclic Component (variations)
The oscillatory variations in the values of time series data with a period of
oscillation of more than one year are called cyclic variation or the cyclic
component in a time series. One oscillation period is called one cycle. Unlike
the seasonal variation, the length (or duration) of a cycle in a cyclic variation
is not same.
Cyclic variations are generally occurred in commercial and economic time
series in which length of a cycle could vary from 2 to 10 years. So the cyclic
variations are also called “business cycle”

4. Random or Irregular Movements (Variations)


The variations in a time series which do not repeat in a definite pattern are called
irregular variation or irregular component of a time series. We cannot think their
time of occurrence, direction and magnitude. These variations usually occur due
to earthquakes, floods wars, accidents.
Moving Average Method
Example Calculate 3 yearly moving average for the given time series

3 yearly Moving
Year Output
Average
1976 17 -
1977 22 (17+22+18)/3=19
1978 18 (22+18+26)/3=22
1979 26 (18+26+16)/3=20
1980 16 (26+16+27)/3=23
1981 27 -
Four yearly moving average
4 yearly Moving Centered 4 yearly
Year Output
Average Moving Average
1976 17 -

1977 22
(17+22+18+26)/
4=20.75
1978 18 (20.75+20.5)/2=20.625
(22+18+26+16)/4=20.5
1979 26 (20.5+21.75)/2=21.125
(18+26+16+27)/
4=21.75
1980 16

1981 27 -

E) Using three and four years moving average to determine the trend.
Year : 1991 1992 1993 1994 19995 1996 1997 1998 1999
2000
Production (‘000 21 22 23 25 24 22 25 26 27
tons) : 26
Also plot the data and moving average trend.
Exponential Smothing Method

Example: Find the exponential smoothing using weight 0.2 of the following
data
Year Production (in Millian) Exponential smoothing
2010 5 5
2011 6 0.2*6+(1-0.2)*5=1.2+4=5.2
2012 8 0.2*8+(1-0.2)*5.2=1.6+4.16= 5.76
2013 10 0.2*10+(1-0.2)*5.76=6.608
2014 10 0.2*10+(1-0.2)*6.608=
2015 11
2016 12

Method of Least Squares


This is the best method for obtaining the trend values. It gives a
convenient basis for calculating the line of best fit for the time series.
It is a mathematical method for measuring trend.
The trend line is given below
Y=a+bt
And we find values of constants a and b using the data of time series
as we find in regression analysis. Generally time is given in years so
we take as
X=t-central time
then trend line becomes
Y=a+bX (1)
Then we find a and b as

and

After that we put the values of a, b and X in equation (1)

Example: Fit a straight-line trend by the method of least squares and


tabulate the trend values.

Solution: The scattered diagram

The straight line trend equation is given by


Y=a+bt
We take X=t-2003
Then, the required equation of the straight line trend is given by

Y = a+bX

We find the values of a and b using the formulae

and

Production (in X2 XY
Year(t) X=t-2003
Millian)Y
2000 40 2000-2003=-3 9 -120
2001 45 2001-2003=-2 4 -90
2002 46 -1 1 -46
2003 42 0 0 0
2004 47 1 1 47
2005 50 2 4 100
2006 46 3 9 138
Total 316 0 28 29

Mean of X=0/7=0
Mean of Y=316/7=45.14

Therefore, the equation of the straight line trend is given by


Y=45.14+1.036*(t-2003)
Therefore the trend values are

Year(t) Production (in Millian)Y Trend values


45.14+1.036(2000-2003)=45.14+1.036*(-
2000 40
3)=42.032
45.14+1.036(2001-2003)=45.14+1.036*(-
2001 45
2)=43.068
45.14+1.036(2002-2003)=45.14+1.036*(-
2002 46
1)=44.104
2003 42 45.14+1.036(2003-2003)=45.14
45.14+1.036(2004-
2004 47
2003)=45.14+1.036*(1)=46.176
45.14+1.036(2005-
2005 50
2003)=45.14+1.036*(2)=47.212
45.14+1.036(2006-
2006 46
2003)=45.14+1.036*(3)=48.248
45.14+1.036(2007-2003)=45.14+1.036*(4)=
Total 316
49.284
E) The following are annual profits (in thousands of rupees) in a business from:
Year 1993 1994 1995 1996 1997 1998
Profits (in ‘000 Rs.) 60 72 75 65 80 85
a) Use the method of least squares to fit a straight line to the above data.
b) Plot the above figures and draw the line.
c) Also make an estimate of the profits for the year 2000.
Forecasting Models
The Additive Model
One of the most widely used models is the additive forecasting model. In this
model it is assumed that at any time t, the time series is the sum of all the
components. Symbolically, the model is

where the value of the time series at the time t


the long-term trend at time t
the cyclic variation at time t
the seasonal variation at time t
the irregular or random variation at time t
In additive model, it is assumed that the effect of the cyclic component
remains the same for all cycles and that the effect of any seasonal variation
remains the same during any year (or corresponding period). Similarly, it is
assumed that the irregular component has the same effect throughout.
The Multiplicative Model
In the additive model, we have assumed that the time series is the sum of the
trend, cyclical, seasonal and random components. From practical experience,
scientists have found that additive models are appropriate when the seasonal
variations remain unchanged, that is, the seasonal variations do not depend on the
trend of the time series.
However, in practice, there are a number of situations where the seasonal
variations change over time. When the seasonal variations exhibit an increasing or
decreasing trend, we can try the multiplicative model. In the multiplicative
model it is assumed that the time series is obtained as a product of the four time
series components, that is,

Multiplicative models are found to be appropriate for many economic time series
data such as data related to production of electricity, number of passengers going
abroad, consumption of cold drinks, etc.
INTRODUCTION TO PROBABILITY
Random Experiment
An experiment in which all the possible outcomes are known in advance but we
cannot predict as to which of them will occur when we perform the experiment,
e.g. Experiment of tossing a coin is random experiment as the possible outcomes
head and tail are known in advance but which one will turn up is not known.
Similarly, ‘Throwing a die’ and ‘Drawing a card from a well shuffled pack of 52
playing cards ‘are the examples of random experiment.
Trial
Performing an experiment is called trial, e.g. Tossing a coin is a trial, Throwing a
die is a trial.
Sample Space
Set of all possible outcomes of a random experiment is known as sample space
and is usually denoted by S, and the total number of elements in the sample space
is known as size of the sample space and is denoted by n(S), e.g.
(i) If we toss a coin then the sample space is S = {H, T}, where H and T denote
head and tail respectively and n(S) = 2.
(ii) If a die is thrown, then the sample space is S = {1, 2, 3, 4, 5, 6} and n(S) =

6.

(iii) If a coin and a die are thrown simultaneously, then the sample space is
S = {H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6} and n(S) = 12.
(iv) If a coin is tossed twice or two coins are tossed simultaneously then the
sample space is
S = {HH, HT, TH, TT}, Here, n(S) = 4.
(v) If a die is thrown twice or a pair of dice is thrown simultaneously, then
sample space is
S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4),
(2, 5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (4, 1), (4, 2), (4, 3), (4,
4), (5, 5), (4, 6), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)} Here, n(S) = 36.
(viii) If a family contains two children then the sample space is S = {B1B2, B1G2,
G1B2, G1G2}
where Bi denotes that birth is of boy, i = 1, 2, and denotes that
birth is of girl, i = 1, 2.
This sample space can also be written as S = {BB, BG, GB, GG}
(ix) If a bag contains 2 red and 4 black balls and
(a) One ball is drawn from the bag, then the sample space is
where R1, R2, denote three red balls and

denote four black balls in the bag.


Remark 1: If a random experiment with possible outcomes is performed n
times, then the total number of elements in the sample is i.e. n(S) = , e.g. if
a coin is tossed twice, then n(S) = =4; if a die is thrown thrice, then n(S) =
= 216.
Sample Point
Each outcome of an experiment is visualised as a sample point in the sample
space. e.g.
(i) If a coin is tossed then getting head or tail is a sample point.
(ii) If a die is thrown twice, then getting (1, 1) or (1, 2) or (1, 3) or…or (6, 6) is
a sample point.
Event
Set of one or more possible outcomes of an experiment constitutes what is known
as event. Thus, an event can be defined as a subset of the sample space, e.g.
i) In a die throwing experiment, event of getting a number less than 5 is the set
{1, 2, 3, 4},

which refers to the combination of 4 outcomes and is a sub-set of the sample


space

= {1, 2, 3, 4, 5, 6}.

ii) If a card is drawn form a well-shuffled pack of playing cards, then the event
of getting a card of a spade suit is

where suffix S under each character in the set denotes that the card is of
spade and J, Q and K represent jack, queen and king respectively.

Exhaustive Cases
The total number of possible outcomes in a random experiment is called the
exhaustive cases. In other words, the number of elements in the sample space is
known as number of exhaustive cases, e.g.
(i) If we toss a coin, then the number of exhaustive cases is 2 and the sample
space in this case is {H, T}.
(ii) If we throw a die then number of exhaustive cases is 6 and the sample space
in this case is {1, 2, 3, 4, 5, 6}
Favourable Cases
The cases which favour to the happening of an event are called favourable cases.
e.g.
(i) For the event of drawing a card of spade from a pack of 52 cards, the
number of favourable cases is 13.
(ii) For the event of getting an even number in throwing a die, the number of
favourable cases is 3 and the event in this case is {2, 4, 6}.
Mutually Exclusive Cases
Cases are said to be mutually exclusive if the happening of any one of them
prevents the happening of all others in a single experiment, e.g.
(i) In a coin tossing experiment head and tail are mutually exclusive as there
cannot be simultaneous occurrence of head and tail.
Equally Likely Cases
Cases are said to be equally likely if we do not have any reason to expect one in
preference to others. If there is some reason to expect one in preference to others,
then the cases will not be equally likely, For example,
(i) Head and tail are equally likely in an experiment of tossing an unbiased
coin. This is because if someone is expecting say head, he/she does not have
any reason as to why he/she is expecting it.
(ii) All the six faces in an experiment of throwing an unbiased die are equally
likely.
You will become more familiar with the concept of “equally likely cases” from
the following examples, where the non-equally likely cases have been taken into
consideration:
(i) Cases of “passing” and “not passing” a candidate in a test are not equally
likely. This is because a candidate has some reason(s) to expect “passing” or
“not passing” the test. If he/she prepares well for the test, he/she will pass the
test and if he/she does not prepare for the test, he/she will not pass. So, here
the cases are not equally likely.
(ii) Cases of “falling a ceiling fan” and “not falling” are not equally likely. This
is because, we can give some reason(s) for not falling if the bolts and other
parts are in good condition.
CLASSICAL OR MATHEMATICAL DEFINITION OF
PROBABILITY
Let there be ‘n’ exhaustive cases in a random experiment which are mutually
exclusive as well as equally likely. Let ‘m’ out of them be favourable for the
happening of an event A (say), then the probability of happening event A
(denoted by P (A)) is defined as
Number of favourable cases for event A m
P(A) =  … (1)
Number of exhaustive cases n

 
Probability of non-happening of the event A is denoted by P A and is defined as

Number of favourable cases for event A n  m m


P(A)   1  1  P(A)
Number of exhaustive cases n n

So,  
P A   P A  1

Therefore, we conclude that, the sum of the probabilities of happening an event


and that of its complementary event is 1.
0  P(A)  1
Example 3: A bag contains 4 red, 5 black and 2 green balls. One ball is drawn
from the bag. Find the probability that?
(i) It is a red ball
(ii) It is not black
(iii) It is green or black
Solution: Let R1 , R 2 , R 3 , R 4 denote 4 red balls in the bag. Similarly
B1 , B2 , B3 , B4 , B5 denote 5 black balls and G1 , G 2 denote two green balls in the
bag. Then the sample space for drawing a ball is given by
R1 , R 2 , R 3 , R 4 , B1 , B2 , B3 , B4 , B5 , G1, G 2 
(i) Let A be the event of getting a red ball, then A = { R 1 , R 2 , R 3 , R 4 }
Number of favourable cases 4
 P(A)  
Number of exhaustive cases 11
(ii) Let B be the event that drawn ball is not black, then
B = { R 1 , R 2 , R 3 , R 4 , G1 , G 2 }
Number of favourable cases 6
 P(B)  
Number of exhaustive cases 11

(iii) Let C be the event that drawn ball is green or black, then
C = { B1 , B2 , B3 , B4 , B5 , G1 , G 2 }.
Number of favourable cases 7
 P(C)  
Number of exhaustive cases 11

Example 4: Three unbiased coins are tossed simultaneously. Find the probability
of getting
(i) at least two heads
(ii) at most two heads
(iii) all heads
(iv) exactly one head
(v) exactly one tail
Solution: The sample space in this case is
S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}
(i) Let E1 be the event of getting at least 2 heads, then
E1 = {HHT, HTH, THH, HHH}

Number of favourable cases 4 1


 P(E1 ) = = =
Number of exhaustive cases 8 2

(ii) Let E 2 be the event of getting at most 2 heads then


E 2 = {TTT, TTH, THT, HTT, HHT, HTH, THH}

Number of favourable cases 7


 P(E 2 ) = =
Number of exhaustive cases 8

(iii) Let E 3 be the event of getting all heads, then


E 3 = {HHH}

Number of favourable cases 1


 P(E 3 ) = =
Number of exhaustive cases 8

(iv) Let E 4 be the event of getting exactly one head then

E 4 = {HTT, THT, TTH}

Number of favourable cases 3


 P(E 4 ) = =
Number of exhaustive cases 8

(v) Let E 5 be the event of getting exactly one tail, then


E 5 = {HHT, HTH, THH}

Number of favourable cases 3


 P(E 5 ) = =
Number of exhaustive cases 8

Example 5: A fair die is thrown. Find the probability of getting


(i) a prime number
(ii) an even number
(iii) a number multiple of 2 or 3
(iv) a number multiple of 2 and 3
(v) a number greater than 4

Solution: The sample space in this case is


S = {1, 2, 3, 4, 5, 6}
(i) Let E1 be the event of getting a prime number, then
E1 = {2, 3, 5}.

Number of favourable cases 3 1


 P(E1 ) = = =
Number of exhaustive cases 6 2

(ii) Let E 2 be the event of getting an even number, then


E 2 = {2, 4, 6}

Number of favourable cases 3 1


 P(E 2 ) = = =
Number of exhaustive cases 6 2

(iii) Let E 3 event of getting a multiple of 2 or 3, then


E 3 = {2, 3, 4, 6 }
Number of favourable cases 4 2
 P(E 3 ) = = =
Number of exhaustive cases 6 3

(iv) Let E 4 event of getting a number multiple of 2 and 3, then


E 4 = {6}

Number of favourable cases 1


 P(E 4 ) = =
Number of exhaustive cases 6

(v) Let E 5 be the event of getting a number greater than 4, then


E 5 = {5, 6}

Number of favourable cases 2 1


 P(E 5 ) = = =
Number of exhaustive cases 6 3

Example 6: In an experiment of throwing two fair dice, find the probability of


getting
(i) a doublet
(ii) sum 7
(iii) sum greater than 8
(iv) 3 on first die and a multiple of 2 on second die
(v) prime number on the first die and odd prime on the second die.
Solution: The sample space has already been given in (vii) of Sec. 1.3.
Here, the sample space contains 36 elements i.e. number of exhaustive cases is
36.
(i) Let E1 be the event of getting a doublet (i.e. same number on both dice), then
E1 = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}.

Number of favourable cases 6 1


 P(E1 ) = = =
Number of exhaustive cases 36 6
(ii) Let E 2 be the event of getting sum 7, then
E 2 = {(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)}

Number of favourable cases 6 1


 P(E 2 ) = = =
Number of exhaustive cases 36 6

(iii) Let E 3 be the event of getting sum greater than 8, then


E 3 = {(3, 6), (6, 3), (4, 5), (5, 4), (4, 6),

(6, 4), (5, 5), (5, 6), (6, 5), (6, 6)}
Number of favourable cases 10 5
 P(E3 ) = = =
Number of exhaustive cases 36 18

(iv) Let E 4 be the event of getting 3 on first die and multiple of 2 on second die, then
E 4 = {(3, 2), (3, 4), (3, 6)}

Number of favourable cases 3 1


 P(E 4 ) = = =
Number of exhaustive cases 36 12

(v) Let E 5 be the event of getting prime number on first die and odd prime on second
die, then
E 5 = {(2, 3), (2, 5), (3, 3), (3, 5), (5, 3), (5, 5)}

Number of favourable cases 6 1


 P(E 5 ) = = =
Number of exhaustive cases 36 6

Example 7: Out of 52 well shuffled playing cards, one card is drawn at random.
Find the probability of getting
(i) a red card
(ii) a face card
(iii) a card of spade
(iv) a card other than club
(v) a king

Solution: Here, the number of exhaustive cases is 52 and a pack of playing cards
contains 13 cards of each suit (spade, club, diamond, heart).
(i) Let A be the event of getting a red card. We know that there are 26 red
cards,
Number of favourable cases 26 1
 P(A) = = =
Number of exhaustive cases 52 2

(ii) Let B be the event of getting a face card. We know that there are 12 face
cards (jack, queen and king in each suit),

Number of favourable cases 12 3


 P(B) = = =
Number of exhaustive cases 52 13

(iii) Let C be the event of getting a card of spade


We know that there are 13 cards of spade
Number of favourable cases 13 1
 P(C) = = =
Number of exhaustive cases 52 4

(iv) Let D be the event of getting a card other than club.


As there are 39 cards other than that of club,.
Number of favourable cases 39 3
 P(D) = = =
Number of exhaustive cases 52 4

(v) Let E be the event of getting a king.


We know that there are 4 kings,
Number of favourable cases 4 1
 P(E) = = =
Number of exhaustive cases 52 13

Example 8: In a family, there are two children. Write the sample space and find
the probability that
(i) the elder child is a girl
(ii) younger child is a girl
(iii) both are girls
(iv) both are of opposite sex
Solution: Let G i denotes that i th birth is of girl (i = 1, 2) and Bi denotes that i th birth is of boy, (i =1, 2).

 S = G1G 2 , G1B2 , B1G 2 , B1B2 


(i) Let A be the event that elder child is a girl
 A = G1G 2 , G1B2 
Number of favourable cases 2 1
and P(A) = = =
Number of exhaustive cases 4 2

(ii) Let B be the event that younger child is a girl


 B = G1G 2 , B1G 2 
Number of favourable cases 2 1
and P(B) = = =
Number of exhaustive cases 4 2

(iii) Let C be the event that both the children are girls
 C = G1G 2 
Number of favourable cases 1
and P(C) = =
Number of exhaustive cases 4

(iv) Let D be the event that both children are of opposite sex
 D = G1B2 , B1G 2 
Number of favourable cases 2 1
and P(D) = = =
Number of exhaustive cases 4 2
Examples 5: If two dice are thrown, what is the probability that sum is
a) greater than 9, and b) neither 10 or 12.
Solution: Total no of cases=36
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
a) P[sum > 9] = P[sum = 10 or sum = 11 or sum = 12]= P[sum =10] + P[sum = 11] + P[sum = 12]

= =

[ sum = 10, there are three favourable cases (4, 6), (5, 5) and (6, 4). Similarly for sum =11 and 12, there
are two and one favourable cases respectively.]

Probability of neither sum 10 or 12= 1-Prob. Of sum 10-Prob of sum 12=

E5) Fourteen balls are serially numbered and placed in a bag. Find the probability that a ball is drawn bears a
number multiple of 3 or 5.
Example 14: In a lottery, one has to choose six numbers at random out of the numbers from 1 to 30. He/ she
will get the prize only if all the six chosen numbers matched with the six numbers already decided by the
lottery committee. Find the probability of wining the prize.
Solution: Out of 30 numbers 6 can be drawn in

Number of exhaustive cases = 593775


Out of these 593775 ways, there is only one way to win the prize (i.e. choose those six numbers that are already
fixed by committee).
Here, the number of favourable cases is 1.
Hence,
Concept of odds in favour of and against the happening of an event
Let n be the number of exhaustive cases in a random experiment which are mutually exclusive and equally
likely as well. Let m out of these n cases are favourable to the happening of an event A (say). Thus, the number
of cases against A are
Then odds in favour of event A are (i.e. ratio ) and odds against A are (i.e.
ratio )
Example 15: If odds in favour of event A are , what is the probability of happening A?
Solution: As odds in favour of A are ,

Probability of happening A i.e. P(A) = .

Example 16: Find the probability of event A if


(i) Odds in favour of event A are 4 : 3 (ii) Odds against event A are 5 : 8
Solution: (i) We know that if odds in favour of A are , then

(ii) Here, , therefore, .


Now, as we know that if odds against the happening of an event A are
, then

Example 17 If P(A) = then find


(i) odds in favour of A; (ii) odds against the happening of event A.
Solution: (i) As P(A) = , odds in favour of A in this case are

(ii) We know that if P(A) = , then odds against the happening of A are
In this case odds against the happening of event A are 5 3 : 3 = 2 : 3

Relative Frequency Approach And Statistical Probability

Classical definition of probability fails if


i) the possible outcomes of the random experiment are not equally likely or/and
ii) the number of exhaustive cases is infinite.

6
In such cases, we obtain the probability by observing the data. This approach to probability is called the relative
frequency approach and it defines the statistical probability.
Statistical (or Empirical) Probability
If an event A (say) happens m times in n trials of an experiment which is performed repeatedly under
essentially homogeneous and identical conditions (e.g. if we perform an experiment of tossing a coin in a room,
then it must be performed in the same room and all other conditions for tossing the coin should also be identical
and homogeneous in all the tosses), then the probability of happening A is defined as:

P(A) = .

As an illustration, we tossed a coin 200 times and observed 50 heads. Then probability of head= proportion of
heads i.e.

AXIOMATIC APPROACH TO PROBABILITY


Let S be a sample space for a random experiment and A be an event which is subset of S, then P(A) is called
probability function if it satisfies the following axioms
(i) P(A) is real and P(A) 0
(ii) P (S) = 1
(iii) is any finite or infinite sequence of disjoint (mutually exclusive) events in S, then

Now, let us give some results using probability function. But before taking up these results, we discuss some
statements with their meanings in terms of set theory. If A and B are two events, then in terms of set theory, we
write
i) ‘At least one of the events A or B occurs’ as
ii) ‘Both the events A and B occurs’ as
iii) ‘Neither A nor B occurs’ as
iv) ‘Event A occurs and B does not occur’ as

v) ‘Exactly one of the events A or B occurs’ as B) (A )


.
Similarly, you can write the meanings in terms of set theory for such statement in case of three or more events
e.g. in case of three events A, B and C, happening of at least one of the events is written as A  B  C.
1 Prove that probability of the impossible event is zero
2 Probability of non-happening of an event A i.e. complementary event of A is given by –

3. (i) = P(A) – P(A B)

(ii) B) = P(B) – P(A B)


Example 4: A, B and C are three mutually exclusive and exhaustive events associated with a random
experiment. Find P(A) given that :

7
and

Addition Theorem on Probability for Two Events


Let S be the sample space of a random experiment and events A and then

If events A and B are mutually exclusive events, then .


Similarly, for three non-mutually exclusive events A, B and C, we have

and for three mutually exclusive events A, B and C, we have

Example 1: From a pack of 52 playing cards, one card is drawn at random. What is the probability that it is a
jack of spade or queen of heart?
Solution: Let A and B be the events of drawing a jack of spade and queen of heart, respectively.

 P(A) and P(B)

Here, a card cannot be both the jack of spade and the queen of heart, hence A and B are mutually exclusive,
 applying the addition theorem for mutually exclusive events,

the required probability = P(A  B) = P(A) + P(B) = .

Example 2: 25 lottery tickets are marked with first 25 numerals. A ticket is drawn at random. Find the
probability that it is a multiple of 5 or 7.
Solution: Let A be the event that the drawn ticket bears a number multiple of 5 and B be the event that it bears
a number multiple of 7.
Therefore,
A = {5, 10, 15, 20, 25},B = {7, 14, 21}Here, as A  B = ,

 A and B are mutually exclusive, and hence,

Example 3: Find the probability of getting either a number multiple of 3 or a prime number when a fair die is
thrown.
Solution: When a die is thrown, then the sample space is S = {1, 2, 3, 4, 5, 6}
Let A be the event of getting a number multiple of 3 and B be the event of getting a prime number,
= {3, 6}, B = {2, 3, 5}, = {3}Here as A  B is not empty set,
 A and B are non-mutually exclusive and hence, the required probability =

= P(A) + P(B) – P( ) = = .

8
Example 4: There are 40 pages in a book. A page is opened at random. Find the probability that the number of
this opened page is a multiple of 3 or 5.
Solution: Let A be the event that the drawn card is a card of ace and B be the event that it is red colour card.
Now as there are four cards of ace and 26 red colour cards in a pack of 52 playing cards. Also, 2 cards in the
pack are ace cards of red colour.

, and

 the required probability = P(A  B)


= P(A) + P(B) – P(A  B)

= .

E1) A card is drawn from a pack of 52 playing cards. Find the probability that it is either a king or a red card.

E2) Two dice are thrown together. Find the probability that the sum of the numbers turned up is either 6 or 8.

9
Conditional Probability
We have discussed earlier that P(A) represents the probability of happening event A for which the number of
exhaustive cases is the number of elements in the sample space S. P(A) dealt earlier was the unconditional
probability. Here, we are going to deal with conditional probability.
Let us start with taking the following example:
Suppose a card is drawn at random from a pack of 52 playing cards. Let A be the event of drawing a black
colour face card. Then A = {Js, Qs, Ks, Jc, Qc, Kc} and hence P(A) = 6/52 = 3/26.
Let B be the event of drawing a card of spade i.e. B = {1s, 2s, 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, Js, Qs, Ks}.
If after a card is drawn from the pack of cards, we are given the information that card of spade has been drawn
i.e., B has happened, then the probability of event A will no more be because here in this case, we have
the information that the card drawn is of spade (i.e. from amongst 13 cards) and hence there are 13 exhaustive
cases and not 52. From amongst these 13 cards of spade, there are 3 black colour face cards and hence
probability of having black colour face card given that it is a card of spade i.e. P(AB) = 3/13, which is the
conditional probability of A given that B has already happened.
Note: Here, the symbol ‘’ used in P(AB) should be read as ‘given’ and not ‘upon’. P(AB) is the conditional
probability of happening A given that B has already happened i.e. here A happens depending on the condition
of B.
So, the conditional probability P(AB) is also the probability of happening A but here the information is given
that the event B has already happened. P(AB) refers to the sample space B and not S.
Remark 1: P(AB) is meaningful only when P(B)  i.e. when the event B is not an impossible event.
Multiplication Law of Probability
Statement: For two events A and B,
P(A  B) = P(A) P(BA), P(A) > 0 … (1)
= P(B) P(AB), P(B) > 0, … (2)
where P(BA) is the conditional probability of B given that A has already happened and P(AB) is the
conditional probability of A given that B has already happened.
P(A  B  C) = P(A) P(BA) P(CAB),
where P(CA  B) represents the probability of happening C given that A and B both have already happened.
INDEPENDENT EVENTS
Before defining the independent events, let us again consider the concept of conditional probability taking the
following example:
Suppose, we draw a card from a pack of 52 playing cards, then probability of drawing a card of spade is 13/52.
Now, if we do not replace the card back and draw the next card. Then, the probability of drawing the second
card ‘a card of spade’ if it being given that the first card was spade would be 12/51 and it is the conditional
probability. Now, if the first card had been replaced back then this conditional probability would have been
13/52. So, if sampling is done without replacement, the probability of second draw and that of subsequent
draws made following the same way is affected but if it is done with replacement, then the probability of
second draw and subsequent draws made following the same way remains unaltered.
So, if in the above example, if the next draw is made with replacement, then the happening or non-happening of
any draw is not affected by the preceding draws. Let us now define independent events.
Independent Events
10
Events are said to be independent if happening or non-happening of any one event is not affected by the
happening or non-happening of other events. For example, if a coin is tossed certain number of times, then
happening of head in any trial is not affected by any other trial i.e. all the trials are independent.

Two events A and B are independent if and only if P(BA) = P(B) i.e. there is no relevance of giving any
information. Here, if A has already happened, even then it does not alter the probability of B. e.g. Let A be the
event of getting head in the toss of a coin and B be the event of getting head in the toss of the coin. Then
the probability of getting head in the toss is , irrespective of the case whether we know or don’t know the
outcome of toss, i.e. P(BA) = P(B).

Multiplicative Law for Independent Events:

If A and B are independent events, then

P(A  B) = P(A) P(B).

This is because if A and B are independent then P(BA) = P(A) and hence the equation (1) discussed in Sec. 3.3
of this unit becomes P(AB) = P(A) P(B).

Similarly, if A, B and C are three independent events, then


P(A  B  C) = P(A) P(B) P(C).
Remark 2: Mutually exclusive events can never be independent.
Result: If events A and B are independent then prove that
(i) A and are independent (ii) and B are independent (iii) and are independent
Example 6: A die is rolled. If the outcome is a number greater than 3, what is the probability that it is a prime
number?
Solution: The sample space of the experiment is
S= {1, 2, 3, 4, 5, 6}
Let A be event that the outcome is a number greater than 3 and B be the event that it is a prime number.
 A = {4, 5, 6}, B = {2, 3, 5} and hence A  B = {5}.
 P(A)  3/6, P(B) = 3/6, P(A  B) = 1/6.

Now, the required probability = P(BA) =

Example 7: A couple has 2 children. What is the probability that both the children are boys, if it is known that?

(i) younger child is a boy (ii)older child is a boy(iii) at least one of them is boy

Solution: Let denote that birth is of boy and girl respectively, i =1, 2.

Then for a couple having two children, the sample space is

Let A be the event that both children are boys then A = {


(i) Let B be the event of getting younger child as boy i.e.
11
B= { . Hence

(ii) Let C be the event of getting older child as boy, then C =

and hence .

(iii) Let D be the event of getting at least one of the children as boy, then
D and hence

Example 8: An urn contains 4 red and 7 blue balls. Two balls are drawn one by one without replacement. Find
the probability of getting 2 red balls.
Solution: Let A be the event that first ball drawn is red and B be the event that the second ball drawn is red.

and P(BA)= 3/10

required probability = P(A and B) = P(A) P(BA)

Example 9: Three cards are drawn one by one without replacement from a well shuffled pack of 52 playing
cards. What is the probability that first card is jack, second is queen and the third is again a jack.

Solution: Define the following events

be the event of getting a jack in the first draw, be the event of getting a queen in second draw, and

be the event of getting a jack in third draw,

Required probability =

Example 10: (i) If A and B are independent events with


= 0.8 and P(B) = 0.4 then find P(A).
(ii) If A and B are independent events with P(A) = 0.2, P(B) = 0.5 then find .
(iii) If A and B are independent events and P(A) = 0.4 and P(B) = 0.3, then find P(AB) and P(BA).

12
(iv) If A and B are independent events with P(A) = 0.4 and P(B) = 0.2, then find

Solution:
(i) We are given

0.8 = 0.4 = (1 0.4) P (B) = 0.6 P (B)

(ii) We are given that P(A) = 0.2, P(B) = 0.5.


events A and B are independent]
= 0.7 – 0.10 = 0.6
(iii) We are given that P(A) = 0.4, P(B) = 0.3.
Now, as A and B are independent events,

= 0.4 0.3 = 0.12


And hence from conditional probability, we have

, and .

(iv) We are given that P(A) = 0.4, P(B) = 0.2.


We know that if two events A and B are independent then
are also independent events.

[Using the concept of independent events]


= (1 P(A))(P(B)) = (1 0.4) (0.2) = (0.6) (0.2) = 0.12
[ A and are independent]
= (0.4) (1 – 0.2) = 0.32

Example 12: Two cards are drawn from a pack of cards in succession with replacement of first card. Find the

probability that both are the cards of ‘heart’.

Solution: Let A be the event that the first card drawn is a heart card and B be the event that second card is a

heart card.

As the cards are drawn with replacement,

13
A and B are independent and hence the required probability

= .

Example 13: A class consists of 10 boys and 40 girls. 5 of the students are rich and 15 students are brilliant.

Find the probability of selecting a brilliant rich boy.

Solution: Let A be the event that the selected student is brilliant, B be the event that he/she is rich and C be

the event that the student is boy.

and hence

the required probability = P(A  B  C)

E3) A card is drawn from a well-shuffled pack of cards. If the card drawn is a face card, what is the
probability that it is a king?
E4) Two cards are drawn one by one without replacement from a well shuffled pack of 52 cards. What is the
probability that both the cards are red?
E5) A bag contains 10 good and 4 defective items, two items are drawn one by one without replacement.
What is the probability that first drawn item is defective and the second one is good?
E6) The odds in favour of passing driving test by a person X are and odds in favour of passing the same
test by another person Y are . What is the probability that both will pass the test?

PROBABILITY OF HAPPENING AT LEAST ONE OF THE INDEPENDENT EVENTS

If A and B be two independent events, then probability of happening at least one of the events is

1 2 n
Similarly if we have n independent event A , A , …, A , then probability of happening at least one of the events

is

i.e. probability of happening at least one of the independent events


14
= 1 – probability of happening none of the events.

Example 14: A person is known to hit the target in 4 out of 5 shots whereas another person is known to hit 2 out of 3

shots. Find the probability that the target being hit when they both try.

Solution: Let A be the event that first person hits the target and B be the event that second person hits the target.

Now, as both the persons try independently,

the required probability = probability that the target is hit

= probability that at least one of the persons hits the target

= P(A  B)

E 7) A problem in statistics is given to three students A, B and C whose chances of solving it are 0.3, 0.5 and

0.6 respectively. What is the probability that the problem is solved?


Example 15: Husband and wife appear in an interview for two vacancies for the same post. The probabilities
of husband’s and wife’s selections are Find the probability that

(i) Exactly one of them is selected (ii)At least one of them is selected (iii)None is selected.
Solution: Let H be the event that husband is selected and W be the event that wife is selected. Then,

(i) The required probability =

= = .

(ii) The required probability = = 1 P P = = .

(iii) The required probability = =

15
Example 16: A person X speaks the truth in 80% cases and another person Y speaks the truth in 90% cases.
Find the probability that they contradict each other in stating the same fact.
Solution: Let A, B be the events that person X and person Y speak truth respectively, then

Thus, the required probability =

= 0.8 0.1 + 0.2 0.9 = 0.08+ 0.18 = 0.26 = 26%.


E 8) Two cards are drawn from a pack of cards in succession presuming that drawn cards are replaced. What
is the probability that both drawn cards are of the same suit?

LAW OF TOTAL PROBABILITY


There are experiments which are conducted in two stages for completion. Such experiments are termed as
two-stage experiments. At the first stage, the experiment involves selection of one of the given numbers of
possible mutually exclusive events. At the second stage, the experiment involves happening of an event
which is a sub-set of at least one of the events of first stage.
As an illustration for a two-stage experiment, let us consider the following example:
Suppose there are two urns – Urn I and Urn II. Suppose Urn I contains 4 white, 6 blue and Urn II contains 4
white, 5 blue balls. One of the urns is selected at random and a ball is drawn. Here, the first stage is the
selection of one of the urns and second stage is the drawing of a ball of particular colour.
If we are interested in finding the probability of the event of second stage, then it
is obtained using law of total probability, which is stated and proved as under:
Law of Total Probability
Statement: Let S be the sample space and E1, E2, …, En be n mutually exclusive
and exhaustive events with P(Ei)  0; i = 1, 2, …, n. Let A be any event which is a
sub-set of E1  E2 …En (i.e. at least one of the events E1, E2, …, En ) with P(A) > 0,
then P(A) = P(E1) P(AE1) + P(E2) P(AE2) + … + P(En) P(AEn)

Example 1: There are two bags. First bag contains 5 red, 6 white balls and the second bag contains 3 red, 4
white balls. One bag is selected at random and a ball is drawn from it. What is the probability that it is
(i) red, (ii) white.
Solution: Let be the event that first bag is selected and E2 be the event that second bag is selected 

(i) Let R be the event of getting a red ball from the selected bag. and .

Thus, the required probability is given by

16
(ii) Let W be the event of getting a white ball from the selected bag.

, and .

Thus, the required probability is given by

Example 2: A factory produces certain type of output by 3 machines. The respective daily production figures
are-machine X : 3000 units, machine Y: 2500 units and machine Z: 4500 units. Past experience shows that 1%
of the output produced by machine X is defective. The corresponding fractions of defectives for the other two
machines are 1.2 and 2 percent respectively. An item is drawn from the day’s production. What is the
probability that it is defective?
Solution: Let E1, E2 and E3 be the events that the drawn item is produced by machine X, machine Y and
machine Z, respectively. Let A be the event that the drawn item is defective.
As the total number of units produced by all the machines is 3000 + 2500 + 4500 = 10000,

 P(E1) =

Thus, the required probability = Probability that the drawn item is defective
= P(A) = P(E1) P(AE1) + P(E2) P(AE2) + P(E3) P(AE3)

= = = .

Example 3: There are two coins-one unbiased and the other two- headed, otherwise they are identical. One of
the coins is taken at random without seeing it and tossed. What is the probability of getting head?
Solution: Let E1 and E2 be the events of selecting the unbiased coin and the two-headed coin respectively. Let
A be the event of getting head on the tossed coin.

 [ selection of each of the coin is equally likely]

[ if it is unbiased coin, then head and tail are equally likely]

P(AE2) = 1 [ if it is two-headed coin, then getting the head is certain]

Thus, the required probability = P(A)= P(E1) P(AE1) + P(E2) P(AE2)= = .

Example 4: The probabilities of selection of 3 persons for the post of a principal in a newly started college are
in the ratio . The probabilities that they will introduce co-education in the college are 0.2, 0.3 and 0.5,
respectively. Find the probability that co-education is introduced in the college.
Solution: Let be the events of selection of first, second and third person for the post of a principal
respectively. Let A be the event that co-education is introduced.

17
 ,

Thus, the required probability = P(A)

= + P(E3) P(AE3)

= = =

E1) A person gets a construction job and agrees to undertake it. The completion of the job in time depends on
whether there happens to be strike or not in the company. There are 40% chances that there will be a
strike. Probability that job is completed in time is 30% if the strike takes place and is 70% if the strike
does not take place. What is the probability that the job will be completed in time?
E2) What is the probability that a year selected at random will contains 53 Sundays?
E3) There are two bags, first bag contains 3 red, 5 black balls and the second bag contains 4 red, 5 black
balls. One ball is drawn from the first bag and is put into the second bag without noticing its colour.
Then two balls are drawn from the second bag. What is the probability that balls are of opposite
colours?
BAYES’ THEOREM
If we are interested in finding the probability of the event of second stage, then it is obtained using law of total
probability. But if the happening of the event of second stage is given to us and on this basis we find the
probability of the events of first stage, then the probability of an event of first stage is the revised (or posterior)
probabilities and is obtained using an important theorem known as Bayes’ theorem given by Thomas Bayes
This theorem is also known as ‘Inverse probability theorem’, because here moving from first stage to second
stage, we again find the probabilities (revised) of the events of first stage i.e. we move inversely. Thus, using
this theorem, probabilities can be revised on the basis of having some related new information.
Statement: Let S be the sample space and E1, E2, …, En be n mutually exclusive and exhaustive events with
P(Ei)  0; i = 1, 2, .., n. Let A be any event which is a sub-set of E1  E2  … En (i.e. at least one of the
events E1, E2, …, En ) with P(A) > 0 [Notice that up to this line the statement is same as that of law of total
probability], then

where P(A) = P(E1) P(AE1) + P(E2) P(AE2) + … +P(En) P(AEn).

Example 5: There are two bags. First bag contains 5 red, 6 white balls and the second bag contains 3 red, 4
white balls. One bag is selected at random and a ball is drawn from it and it is found to be red, what is the
probability of?
i) selecting the first bag ii) selecting the second bag
Solution: First, we have to give the solution exactly as given for Example 1 of Sec. 4.3 of this unit. After that,
we are to proceed as follows:
i) Probability of selecting the first bag given that the ball drawn is red

ii) Probability of selecting the second bag given that the ball drawn is red

P(E2R) =

18
Example 6: A factory produces certain type of output by 3 machines. The respective daily production figures
are-machine X : 3000 units, machine Y: 2500 units and machine Z: 4500 units. Past experience shows that 1%
of the output produced by machine X is defective. The corresponding fractions of defectives for the other two
machines are 1.2 and 2 percent respectively. An item is drawn from the day’s production and if the drawn item
is found to be defective, what is the probability that it has been produced by machine Y?
Solution: Proceed exactly in the manner the Example 2 has been solved and then as under:
Probability that the drawn item has been produced by machine Y given that it is defective

= =

Example 7: The probabilities of selection of 3 persons for the post of a principal in a newly started colleage are
in the ratio . The probabilities that they will introduce co-education in the college are 0.2, 0.3 and 0.5,
respectively. If the co-education is introduced by the candidate selected for the post of principal, what is the
probability that first candidate was selected.
Solution: First give the solution of Example 4 then proceed as under:

The required probability = = = .

E4) A bag contains 4 red and 5 white balls. Another bag contains 2 red and 3
white balls. A ball is drawn from the first bag and is transferred to the second
bag. A ball is then drawn from the second bag and is found to be red, what is
the probability that red ball was transferred from first to second bag?
E5) An insurance company insured 1000 scooter drivers, 3000 car drivers and 6000 truck drivers. The
probabilities that scooter, car and truck drivers meet an accident are 0.02, 0.04, 0.25 respectively. One of
the insured persons meets with an accident. What is the probability that he is a
(i) car driver (ii) truck driver
E6) By examining the chest X-ray, the probability that T.B is detected when a person is actually suffering
from T.B. is 0.99. The probability that the doctor diagnoses incorrectly that a person has T.B. on the basis
of X-ray is 0.002. In a certain city, one in 1000 persons suffers from T.B. A person is selected at random
and is diagnosed to have T.B., what is the chance that he actually has T.B.?
E7) A person speaks truth 3 out of 4 times. A die is thrown. She reports that there is five. What is the chance
there was five?
Let E1 be the event that the person speaks truth, E2 be the event that she tells a lie and A be the event that she
reports a five.
 .

By law of total probability, we have

= =

Thus, the required probability = = = .

Q1: A construction company is bidding for two contracts, A and B. The probability that the company will get
contract A is 3/5, will get contract B is ¼ and the probability that the company gets both the contracts is 1/8.
What is the probability that the company will get contract A or B.

19
Ans.
Q2: Items produced by a certain process, each, may have one or both of two types of defects, A and B. It is
known that 22% if the items have type A defects and 12% have type B defects. Further, 8% are known to have
both types of defects. What is the probability that a randomly selected item will be defective?
Ans. 0.26
Q3: In a class 40% students read statistics, 25% Mathematics and 15% both Mathematics and Statistics. One
student is selected at random. Find the probability,
(i) that he reads Statistics, if it is known that he reads Mathematics =

(ii) that he reads Mathematics, if it is known that he reads Statistics =

Q4: The probabilities of A, B C solving a problem are respectively. If all the three try to solve the
problem simultaneously, find the probability that the problem will be solved.
Ans.
(ii) In the above example, find the probability that exactly one of them will solve the problem.
Ans.
Q5: Three critics review a book. Odds in favour of the book are 5: 2, 4: 3, and 3: 4 respectively for the three
critics. Find the probability that majority are in favour of the book.
Ans.

Q6: Three balls are drawn successively from a box containing 6 red, 4 white and 5 blue balls. Find the
probability that they are of different colours if each ball is (i) not replaced (ii) replaced. ANS24/91&16/75

Q7: The probability that atleast one of the two independent events occurs is 0.5. Probability that the first event
occurs but not the second is 3/25. Also the probability that the second event occurs but not the first is 8/25.
Find the probability that none of the two events occurs.
Q8: Suppose that A and B are two independent events associated with a random experiment. If the probability
that A or B occurs equals to 0.6, while probability that A occurs equals 0.4. Determine the probability that B
occurs.
Q9: In a certain college, the geographical distribution of male students is as follows: 50% come from East, 30%
come from the Mid West and 20% come from the Far west. The following proportion of the male students wear
Ties: 80% of the Easterners, 60% of the Midwesterners and 40% of the Far westerners. What is the probability
that a student who wear a tie comes from the East?

20
Random Experiment
An experiment in which all the possible outcomes are known in advance but we cannot predict as to which of
them will occur when we perform the experiment is called random experiment, e.g. Experiment of tossing a
coin is random experiment as the possible outcomes head and tail are known in advance but which one will turn
up is not known.
Similarly, ‘Throwing a die’ and ‘Drawing a card from a well shuffled pack of 52 playing cards ‘are the
examples of random experiment.
Variable
A quantity which takes different values is called variable. Variable has two types
Fixed (Deterministic) variable
A variable whose values are known in advance is called fixed variable. For example, the months of a year is a
fixed variable. Here it is known that what the next month is.
Random Variable
A variable whose possible values are known in advance but we cannot predict as to which of them will occur
when we perform the experiment is called random variable.
Random variable is a numerical valued function defined on the sample space of a random experiment. Random
variable is denoted by capital letters such as X, Y, Z.. For example, in tossing a coin if we let that x = 1 if the
coin falls with head and x = 0 if the coin falls with tail. So X is a random variable. Random variable has the
following properties:
i) Each particular value of the random variable can be assigned some probability.
ii) All the probabilities associated with all the different values of the random variable gives the value 1(unity).
Discrete Random Variable
A random variable is said to be discrete if it has either a finite or a countable number of values.
Countable number of values means the values which can be arranged in a sequence, i.e. on the basis of three-
four successive known terms, we can catch a rule and hence can write the subsequent terms. For example
suppose X is a random variable taking the values say 2, 5, 8, 11, … then we can write the fifth, sixth, …
values, So, X in this example is a discrete random variable. The number of students present each day in a class
during an academic session is an example of discrete random variable as the number cannot take a fractional
value.
Continuous Random Variable
A random variable is said to be continuous if it can take all possible real (i.e. integer as well as fractional)
values between two certain limits. For example, temperature of a city at various points of time during a day is
an example of continuous random variable as the temperature takes uncountable values, i.e. it can take
fractional values also.
E2) Which of the random variables given below are discrete? Give reasons for your answer.
1. The daily measurement of snowfall at Shimla Ans Continuous
2. The number of industrial accidents in each month. Ans Discrete
3. The number of defective goods in a shipment (lot) of goods from a manufacturer. Ans-Discrete
Probability Mass Function

Let X be a discrete random variable (r.v.) which takes values x1, x2, ... and let P = p(xi). This function
p(xi), i =1,2, … defined for the values x1, x2, … is called probability mass function of X if
(i) p(xi)  0 and

21
(ii) .

Probability Distribution

The set specifies the probability distribution of a discrete r.v. X. Probability


distribution of r.v. X can also be exhibited in the following manner:

X

p( ) p( ) p( ) p( )…
Now, let us take up some examples concerning probability
mass function:
Example 1: State, giving reasons, which of the following are not probability distributions:
(i)
X 0 1
p( )

(ii)
X 0 1 2
p( )

(iii)
X 0 1 2
p( )

Solution:
(i) Here p( )  0, i = 1, 2; but

= p( ) + p( ) = p(0) + p(1) = .

So, the given distribution is not a probability distribution as is greater than 1.

(ii) It is not probability distribution as p(x2) = p(1) = i.e. negative

(iii) Here, p(x ) 0 , i = 1, 2, 3 and .

 The given distribution is probability distribution.


Example 2: For the following probability distribution of a discrete r.v. X, find
i) the constant c, P[X  3] and P[1 < X < 4].
X 0 1 2 3 4 5
p( ) 0 c c 2c 3c c
Solution:

i) As the given distribution is probability distribution, 

 0 + c + c + 2c + 3c + c = 1  8 c = 1  c =

ii) P[X  3] = P[X = 3] + P[X = 2] + P[X = 1] + P[X = 0] = 2c + c+ c + 0 = 4 c = .

22
iii) P[1 < X < 4] = P[X = 2] + P[X = 3] = c + 2c = 3c = 3 .
Example 3: Find the probability distribution of the number of heads when three fair coins are tossed
simultaneously.
Solution: Let X be the number of heads in the toss of three fair coins.
As the random variable, “the number of heads” in a toss of three coins may be 0 or 1 or 2 or 3 associated with
the sample space
{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT},
 X can take the values 0, 1, 2, 3, with
P[X = 0] = P[TTT ] = , P[X = 1] = P[HTT, THT, TTH] = , P[X = 2] = P[HHT, HTH, THH] =

P[X = 3] = P [HHH] = .
Probability distribution of X, i.e. the number of heads when three coins are tossed simultaneously is
X 0 1 2 3
p( )

Probability Density Function


Let X be a continuous random variable which takes on values in the interval (a, b). [i.e. all values between a
and b, a < b)]. A function f(x) defined on X is called the probability density function of X if
(i) f(x) is nonnegative for a x b i.e., f(x) 0 for all x lying between a and b.
(ii) the area under the graph and above the interval (a, b) is 1.
Example 5: A continuous random variable X has the probability density function:
f( ) = Ax3, 0  x  1.
Determine
i) A, (ii )P[0.2 < X < 0.5]
Solution:
(i) As f( ) is probability density function,

     A

(ii) P[0.2 < X < 0.5] = = = [(0.5)4 – (0.2)4] = 0.0625 – 0.0016= 0.0609

EXPECTATION OF A RANDOM VARIABLE


Expected value of a discrete random variable X is .

But, if X is a continuous random variable having the probability density function then in place of
summation we will use integration and in this case, the expected value of X is defined as

Example 3: Find the expectation of the number on an unbiased die when thrown.
Solution: Let X be a random variable representing the number on a die when thrown.

X can take the values 1, 2, 3, 4, 5, 6 with

Thus, the probability distribution of X is given by

23
Hence, the expectation of number on the die when thrown is

Example 2: A player tosses two unbiased coins. He wins Rs 5 if 2 heads


appear, Rs 2 if one head appears and Rs1 if no head appears. Find the
expected value of the amount won by him.
Solution: In tossing two unbiased coins, the sample space, is 

S=  ,

Let X be the amount in rupees won by him


 X can take the values 5, 2 and 1 with

and

 Probability distribution of X is

Expected value of X is given as

= = =

Thus, the expected value of amount won by him is Rs 2.5.


Example 5: For a continuous random variable (X) whose probability density function is given by:

find the expected value of X.

Solution: Expected value of a continuous random variable X is given by

= = =

24
Binomial Distribution (Used when n is small less than 30 and p > 0.05)
1) In involves a repetition of n identical trials.
2) The trials are independent of each other.
3) Each trial has two possible outcomes.
A discrete random variable X is said to follow binomial distribution with parameters n and p if its probability
mass function is given by

where, n is the number of independent trials,


is the number of successes in n trials,
p is the probability of success in each trial, and
q = 1 – p is the probability of failure in each trial.
Mean = np and Variance = npq and Mean> Variance
Example 2: An unbiased coin is tossed six times. Find the probability of obtaining
(i) exactly 3 heads (ii) less than 3 heads (iii) more than 3 heads (iv) at most 3 heads
(v) at least 3 heads (vi) more than 6 heads

Solution: Let p be the probability of getting head (success) in a toss of the coin and n be the number of trials.
 n = 6, p = and hence q = 1 – p = 1 – .
Let X be the number of successes in n trials,
 by binomial distribution, we have

. =

Therefore,
(i) P[exactly 3 heads] = P [X = 3] [ Recall

(ii) P[less than 3 heads] = P[X < 3]

= = .

(iii) P[more than 3 heads] = P[X > 3] = P[X = 4 or X = 5 or X = 6]

= .

(iv) P[at most 3 heads] = P [3 or less than 3 heads] =

= .

25
(v) P[at least 3 heads] = P[3 or more heads] =
or

= .
(vi) P [more than 6 heads] = P [7 or more heads] = P [an impossible event] =0

Example 3: The chances of catching cold by workers working in an ice factory during winter are 25%. What is
the probability that out of 5 workers 4 or more will catch cold?
Solution: Let catching cold be the success and p be the probability of success for each worker.
 Here, n = 5, p = 0.25, q = 0.75 and by binomial distribution

Therefore, the required probability = P[X  4]

E1) The probability of a man hitting a target is 1/4. He fires 5 times. What is the probability of his hitting the
target at least twice?
E2) A policeman fires 6 bullets on a dacoit. The probability that the dacoit will be killed by a bullet is 0.6.
What is the probability that the dacoit is still alive?

Example 2: You are sitting in a plane waiting for its take off. The pilot announces a delay until some incoming
planes land. Suppose you want to find the variable (discrete or continuous) of the following :
i) How long will it be before take off.
ii) How many incoming planes are there.
Problem 3: It has been claimed that in 60% of all solar heat installations, the utility bill is reduced by at least
one-third. Accordingly, what are the probabilities that the utility bill will be reduced by one-third in
i) four of five installations?
ii) at least four of the five installations?
Solution Here the random variable follows binomial distribution with p = 0.6, x = 4 and n = 5.
To find (i), we have to calculate P[X = 4], which is given by
P[X = 4] = C(5, 4) (0.6)4 (0.4) = 0.259
Now to find (ii), we have to find the probability that X is at least 4. This probability is the sum of the
probabilities that X = 4 and X = 5 because ‘at least 4 means 4 or more’.
Thus we have to find p[X = 4] + [X = 5].
P[X = 5] = (5, 5) (0.6)5 = 0.078
the required probability = 0.259 + 0.078 = 0.337.

Poisson Distribution(used when n is large and p<0.05)


Poisson distribution is a limiting case of binomial distribution under the following conditions:
i) n, the number of trials is indefinitely large, i.e. n  .
ii) p, the constant probability of success for each trial is very small, i.e. p  0.
iii) np is a finite quantity say ‘’.
Definition: A random variable X is said to follow Poisson distribution if its probability mass function is given
by:

26
Mean = λ and Variance = λ and Mean = Variance

When time is not mention then we use

is the average arrival rate per unit of time and t is the number of arrival in t units of time
Also we know that = 72 arrivals per hour is a constant for this situation. Since in the question is given in
‘hour’, to standardise the unit, we have to find ‘t’ in hour.

i.e. 60 minutes = 1 hour 3 minutes = hour t = hour

Note: In most of the cases for Poisson distribution, if we are to compute the probabilities of the type
we write them as and
because n may not be definite and hence we cannot go up to the last value and hence
the probability is written in terms of its complementary probability.
Example 2: If the probability that an individual suffers a bad reaction from an injection of a given serum is
0.001, determine the probability that out of 500 individuals
i) exactly 3, more than 2 individuals suffer from bad reaction
Solution: Let X be the Poisson variate, “Number of individuals suffering from bad reaction”. Then,
n = 1500, p = 0.001,   = np = (1500) (0.001) = 1.5
 By Poisson distribution,

i) The desired probability = P[X = 3] = 0.1255

ii) The desired probability =1

= 1 – 0.8087 = 0.1913
Example 3: If the mean of a Poisson distribution is 1.44, find the values of variance
Solution: Here, mean = 1.44  = 1.44
Hence, Variance =  = 1.44

27
Example 4: If a Poisson variate X is such that P[X = 1] = 2P[X = 2], find the mean and variance of the
distribution.
Solution: Let  be the mean of the distribution, hence by Poisson distribution,

Now,

   = 2  2  = 0  ( 1) = 0   = 0, 1

But  = 0 is rejected
[ if  = 0 then either n = 0 or p = 0 which implies that Poisson distribution
does not exist in this case.]
=1
Hence mean =  = 1, and Variance =  = 1.

Example 1: It is known that the number of heavy trucks arriving at a railway station follows the Poisson
distribution. If the average number of truck arrivals during a specified period of an hour is 2, find the
probabilities that during a given hour
a) no heavy truck arrive, b)at least two trucks will arrive.
Solution: Here, the average number of truck arrivals is 2
i.e. mean = 2   = 2
Let X be the number of trucks arrive during a given hour,

 by Poisson distribution, we have

(a) P[arrival of no heavy truck] = P[X = 0] = = 0.1353

(b) P[arrival of at least two trucks] =

E1) Assume that the chance of an individual coal miner being killed in a mine accident during a year is .
Use the Poisson distribution to calculate the probability that in a mine employing 350 miners, there will be
at least one fatal accident in a year. (use )
E2) The mean and standard deviation of a Poisson distribution are 6 and 2
respectively. Test the validity of this statement.
E3) For a Poisson distribution, it is given that P[X = 1] = P[X = 2], find the value of mean of distribution.
Hence find P[X = 0] and P[X = 4].

1. A policeman fires 6 bullets on a dacoit. The probability that the dacoit will be killed by a bullet is 0.6.
What is the probability that the dacoit is still alive?
2. It has been claimed that in 60% of all solar heat installations, the utility bill is reduced by at least one-
third. Accordingly, what are the probabilities that the utility bill will be reduced by one-third in
i) four of five installations?
ii) at least four of the five installations?

28
3. An oil exploration firm plans to drill six holes. It is believed that the probability that each hole will yield
oil is 0.1. Since the holes are in quite different locations, the outcome of drilling one hole is statistically
independent of that of drilling any of the other holes.
(a) If the firm will be able to stay in business only if two or more holes produce oil, what is the
probability of its staying in business?
(b) Give the expected value of the number of holes that result in oil.
4. If a bank receives on an average = 6 bad checks per day, what is the probability that he will receive 4
bad checks on any given day.
5. A hospital has 20 kidney dialysis machines and that the chance of any one of them malfunctioning
during any day is .02. We want to find the probability that exactly 3 machines will be out of service on
the same day. Then,
i) Can we use the binomial formula to find this probability? If yes, calculate the probability.
ii) Can we use the Poisson formula to find this? If yes calculate the probability.(lambda
=np=20*0.02)

Uniform Distribution
Definition : A random variable X is said to follow the uniform distribution in the interval [a, b], where a < b if
its probability density function (pdf) is given by:

, if a X b
f(x) =
0, otherwise

Mean = and Variance =

E 14) Suppose that the weight of sugar obtained processing a tank of sugar cane juice is uniformly
distributed with a mean of 10 kg. and range of 1.8 kg. Then

i) What are the largest and smallest weights of sugar obtained from a tank of sugar can juice?

ii) What is the probability that a tank of juice will yield sugar weighing between 9 kg. and 10.5
kg.?

E 15) A train is due to arrive at 5.30 p.m. but in practice is equally likely to arrive at any time between 2
minutes early and 30 minutes late. Let the time of arrival (expressed as minutes from due time) be X.
Sketch the pdf f(x) of the r.v. X and shade the areas given bellow

1) The probability that the train is less than 10 minutes late.

2) The probability that the train is late, but less than 16 minutes late.

Solution
a=-2, b= 30

29
Normal Distribution
Definition: A continuous random variable X is said to follow normal distribution with parameters  (
) and 2(>0) if its probability density function (pdf) is given by

In short we write normal distribution as and read as X follows normal distribution with

CHIEF CHARACTERISTICS OF NORMAL DISTRIBUTION


i) The curve of the normal distribution is bell-shaped as shown

ii) The curve of the distribution is completely symmetrical about i.e. if we fold the curve at both
the parts of the curve are the mirror images of each other.
iii) For normal distribution, Mean = Median = Mode
iv)
v) It is a continuous distribution

Example 1: If the r.v. X is normally distributed with mean 80 and standard deviation 5, then find

(i) , (ii) , (iii) ,

Solution: Here we are given that X is normally distributed with mean 80 and standard deviation (S.D.) 5.
i.e. Mean =

If Z is the S.N.V., then

Now

30
(i) X = 95,

= = 0.5 – 0.4987 = 0.0013Using table area under normal curve]

ii) X = 72,

=
= 0.5 – 0.4452 [Using table area under normal curve]
= 0.0548

(iii) X = 60.5,

X = 90,

= 0.5000+ 0.4772
= 0.9772
6. A filling machine is set to pour 952 ml (milimetres) of oil into bottles. The amounts of fill are normally
distributed with a mean of 952 ml, and a standard deviation of 4 ml. use the standard normal table to
find the probability that a bottle contains oil between 952 and 956 ml.
7. For each of these write down the equivalent standard normal probability.
a) The number of people who visit a historic monument in a week is normally distributed with a mean
of 10,500 and a standard deviation of 600. Consider the probability that fewer than 9000 people visit
in a week.
b) The number of cheques processed by a bank each day is normally distributed with a mean of 30,100
and a standard deviation of 2450. Consider the probability that the bank processes more that 32,000
cheques in a day.
31
32
33
Sampling Distribution
Population
A group of elements or units under study by an analyst is called Population. For example, the collection of
books in a library, the particles in a salt bag, the rivers in India, the students in a classroom, etc. are considered
as populations in Statistics.
The total number of elements / items / units / observations in a population is known as population size and
denoted by N. The characteristic under study may be denoted by X or Y.
Sample
A sample is a part / fraction / subset of the population. The procedure of drawing a sample from the population
is called sampling. The number of units selected in a sample is known as sample size and it is denoted by n.
Complete Survey and Sample Survey
(1) Complete Survey or Complete Enumeration or Census
When each and every element or unit of the population is investigated or studied for the characteristics under
study then we call it complete survey or census. For example, suppose we want to find out the average height
of the students of a study centre then if we measure the height of each and every student of this study centre to
find the average height of the students then such type of survey is called complete survey.
(2) Sample Survey or Sample Enumeration
When only a part or a small number of elements or units (i.e. sample) of population are investigated or studied
for the characteristics under study then we call it sample survey or sample enumeration. In the above
example, if we select some students of this study centre and measure the height to find average height of the
students then such type of survey is called sample survey.
Simple Random Sampling or Random Sampling
A sampling technique is said to be simple random sampling if the sample is drawn in such a way that each
element or unit of the population has an equal and independent chance of being included in the sample. If a
sample is drawn by this method then it is known as a simple random sample or random sample. The random
sample of size n is denoted by and the observed value of this sample is denoted
by

(1) Simple Random Sampling without Replacement (SRSWOR)


In simple random sampling, if the elements or units are selected or drawn one by one in such a way that an
element or unit drawn at a time is not replaced back to the population before the subsequent draws is called
SRSWOR. If we draw a sample of size n from a population of size N without replacement then total number of
possible samples is . For example, consider a population that consists of three elements, A, B and C.
Suppose we wish to draw a random sample of two elements then N = 3 and n = 2. The total number of possible
random samples without replacement is as (A, B), (A, C) and (B, C).
(2) Simple Random Sampling with Replacement (SRSWR)
In simple random sampling, if the elements or units are selected or drawn one by one in such a way that a unit
In SRSWR
drawneach at aoftime
the is replaced back to the population before the subsequent draw is called SRSWR. In this
1st, 2nd, .., nth draw the
method,
elements the same element or unit can appear more than once in the sample and the probability of selection of a
are remain
same unit
N dueattoeach draw remains same i.e. 1/N. In this method, total number of possible samples is N . In above
n

example,
replacement so bythe total number of possible random samples with replacement is
rule as (A, A), (A, B), (A,
of multiplication total
C), (B, A), (B, B), (B, C), (C, A), (C, B) and (C, C).
number of possible
samples is
Example 2: If population size is 6 then how many samples of size of 4 are possible with replacement?
Solution: Here, we are given that Population size = N = 6 and Sample size = n = 4
34
Since we know that all possible samples of size n taken from a population of size N with replacement are Nn so
in our case Nn = 64 = 1296.
Parameter
A parameter is a function of population values which is used to represent the certain characteristic of the
population. For example, population mean, population variance, population coefficient of variation, population
correlation coefficient, etc. are all parameters. Population parameter mean usually denoted by µ and population
variance denoted by σ2.
Statistic
Any quantity calculated from sample values and does not contain any unknown parameter is known as statistic.
For example, if is a random sample of size n taken from a population with mean µ and variance
A statistic is a
function of sample
σ2 (both are unknown) then sample mean is a statistic whereas are not statistics
values and does not
contain any
because both are function of unknown parameters.
unknown population
Sample Mean and Sample Variance
parameter.

If is a random sample of size n taken from a population whose probability density(mass)


function f(x, θ) then sample mean is defined as

And sample variance is defined as

Statistical Inference
Generally population parameters are unknown and when the population is too large or the units of the
population are destructive in nature or there is a limited resources and manpower available then it is not
possible practically to examine each and every unit of the population to obtain the population parameters. In
such situations, one can draw sample from the population under study and utilize sample observations to draw
reliable conclusions about the population parameters.
The technique of drawing the reliable conclusions about the population on the basis of the sample drawn from
the population is known as statistical inference. The statistical inference may be divided into two areas or
parts:
(i) The population parameters are unknown and we may want to guess the true value of the unknown
parameters on the basis of a random sample drawn from the population. This type of problem is known as
“Estimation”.
(ii) Some information is available about the population or parameter and we may like to verify whether the
information is true or not on the basis of a random sample drawn from the population. This type of
problem is known as “Testing of hypothesis”.
Sampling Distribution of Mean
A list of all possible values for a sample mean with probability associated with each value is called a sampling
distribution of the mean.
Consider a population comprising four typists who type the sample page of a manuscript. The number of errors
made by each typist is shown below:
Typist Number of Errors
A 4
35
B 2
C 3
D 1

i) Calculate the population mean

ii) How many samples of size 2 are possible with replacement?

iii) Write all samples and calculate mean of each sample.

iv) Construct the sampling distribution of means.

v) Calculate the mean of the sampling distribution and compare it with the population mean.

Solution The population mean (average number of errors) can be obtained as

Number of possible samples of size 2 with replacement are Nn = 42 = 16.


The possible sample and sample mean of each sample are shown on the following table
Sample No Sample in Term of Sample Sample Mean
Typist Observation
1 (A, A) (4, 4) 4.0
2 (A, B) (4, 2) 3.0
3 (A, C) (4, 3) 3.5
4 (A, D) (4, 1) 2.5
5 (B, A) (2, 4) 3.0
6 (B, B) (2, 2) 2.0
7 (B, C) (2, 3) 2.5
8 (B, D) (2, 1) 1.5
9 (C, A) (3, 4) 3.5
10 (C, B) (3, 2) 2.5
11 (C, C) (3, 3) 3.0
12 (C, D) (3, 1) 2.0
13 (D, A) (1, 4) 2.5
14 (D, B) (1, 2) 1.5
15 (D, C) (1, 3) 2.0
16 (D, D) (1, 1) 1.0

the sampling distribution of sample mean is shown as


S. No. Frequency(f Probability(p)
)
1 1.0 1 1 1/16 = 0.0625
2 1.5 11 2 2/16 = 0.1250
3 2.0 111 3 3/16 = 0.1875
4 2.5 1111 4 4/16 = 0.2500
Mean of 5 3.0 111 3 3/16 = 0.1875 sampling
6 3.5 11 2 2/16 = 0.1250
7 4.0 1 1 1/16 = 0.0625 distribution
Total 16
=(1*1+1.5*2+2*3+2.5*4+3.0*3+3.5*2+4*1)/16=1+2+6+10+9+7+2)/16=2.5

36
E) The ages of six executives of a company are

Name Age
Mr. Ravi 54
Mrs. Veena 50
Mrs. Shanti 52
Mr. Suresh 48
(i) How many samples of size 3 are possible without replacement?

(ii) Construct the sampling distribution of means by taking samples of size 3 and organise the data.

(iii) Calculate the mean of the sampling distribution and compare it with the population mean.

E4) If lives of 3 Televisions of certain company are 8, 6 and 10 years then construct the sampling distribution
of average life of Televisions by taking all samples of size 2.

Note- Mean of the sampling distribution of the mean is equal to the population mean that is
If the samples are drawn from normal population with mean µ and variance σ2 then the sampling distribution of
mean is also normal distribution with mean µ and variance σ2/n, that is,

STANDARD ERROR
The standard deviation of a sampling distribution of a statistic is known as standard error and it is denoted by
SE. If is a random sample of size n taken from a population with mean µ and variance σ2 then
the standard errors of sample mean ( ) is given by

Where N- population size, n- sample size, σ – population standard deviation

If population is infinite or very large then the SE of sample mean ( ) is

Example 3: Diameter of a steel ball bearing produced by a semi-automatic machine is known to be distributed
normally with mean 12 cm and standard deviation 0.1 cm. If we take a random sample of size 10 with
replacement then find standard error of sample mean for estimating the population mean of diameter of steel
ball bearing for whole population.
Solution: Here, we are given that  = 12, σ = 0.1, n = 10
Since the sampling is done with replacement therefore the standard error of sample mean for estimating
population mean is given by
37
Example 1: Diameter of a steel ball bearing produced on a semi-automatic machine is known to be distributed
normally with mean 12 cm and standard deviation 0.1 cm. If we take a random sample of size 10 then find
(i) Mean and variance of sampling distribution of mean.
(ii) The probability that the sample mean lies between 11.95 cm and 12.05 cm.
Sampling Distribution of Proportion
A list of all possible values for a sample proportion mean with probability associated with each value is called a
sampling distribution of the proportion.
Example Suppose, there is a lot of 3 cartons A, B & C of electric bulbs and each carton contains 20 bulbs. The
number of defective bulbs in each carton is given below:
Carton Number of Defectives Bulbs
A 2
B 4
C 1

(i) Calculate the population proportion of defective bulbs


(ii) How many samples of size 2 are possible with replacement?
(iii) Construct the sampling distribution of proportion by taking samples of size 2 and organise the data.
(iv)Calculate the proportion of the sampling distribution and compare it with the population proportion.
The population proportion of defective bulbs can be obtained as

Sample Sample Carton Sample Observation Sample Proportion(p)


1 (A, A) (2, 2) 4/40
2 (A, B) (2, 4) 6/40
3 (A, C) (2, 1) 3/40
4 (B, A) (4, 2) 6/40
5 (B, B) (4, 4) 8/40
6 (B, C) (4, 1) 5/40
7 (C, A) (1, 2) 3/40
8 (C, B) (1, 4) 5/40
9 (C, C) (1, 1) 2/40

Since there are 9 possible samples therefore the probability of selecting a sample is 1/9. Then we arrange the
possible sample proportion with their respective probability in Table 2.3 given in next page:

S.No. Sample Proportion(p) Frequency Probability pf


1 2/40 1 1/9 2*1/40
2 3/40 2 2/9 3*2/40
3 4/40 1 1/9 4*1/40
4 5/40 2 2/9 5*5/40
5 6/40 2 2/9 6*2/40
6 8/40 1 1/9 8*1/40
Mean = Total 9 42/40
=Population proportion
This distribution is called the sampling distribution of sample proportion. Thus, we can define the sampling
distribution of sample proportion as:

38
Note- Mean of the sampling distribution of the proportion is equal to the population proportion that is

Where N- population size, n- sample size, P – population proportion


If the sample size is large n > = 30 sampling distribution of proportion is also normal distribution with mean

P and variance , that is,

Standard errors of sample proportion p is given by

If population is infinite or very large then the SE of sample proportion is given by

Example 3: A machine produces a large number of items of which 15% are found to be defective. If a random
sample of 200 items is taken from the population and sample proportion is calculated then find
(i) Mean and standard error of sampling distribution of proportion.

(ii) The probability that less than or equal to 12% defectives are found in the sample.
Estimator and Estimate
Generally, population parameters are unknown and the whole population is too large to find out the parameters.
Since the sample drawn from a population always contains some or more information about the population,
therefore in such situations, we guess or estimate the value of the parameter under study based on a random
sample drawn from that population.
Any statistic which is used to estimate an unknown population parameter then it is known as estimator and the
value of the estimator based on observed value of the sample is known as estimate of parameter. For example,
if we want to estimate the average height of students in a college with the help of sample mean then
is the estimator and its particular value, say, 165 cm is the estimate of the population average height .

Estimation (Short Note)


In many real-life problems, the population parameter(s)(population characteristic as average income of the
person of a state.) is (are) unknown and someone is interested to obtain the value of parameter. But, if the
whole population is too large to study or the units of the population are destructive in nature or there is a
limited resources and manpower available then it is not practically convenient to examine each and every unit
of the population to find the value of parameter. In such situations, we can draw a sample from the population
under study and utilize sample observations to find/ estimate the parameter.
The technique of finding the unknown parameter with the help of sample observations is called Estimation.
Estimation is categorised into two categories namely:
Point estimation
If we find a single value with the help of sample observations which is taken as the estimate value of unknown
parameter then this value is known as point estimate and the technique of estimating the unknown parameter
with a single value is known as “point estimation”. For example, if we want to estimate the average height

39
of students in a college with the help of sample mean then is the estimator and its particular value,
say, 165 cm is the estimate of the population average height .
Interval estimation
If we compute an interval on the basis of sample observations, which will contain the parameter with certain
probability (confidence) then this interval is known as interval estimate of the parameter and this technique of
estimating is known as “interval estimation”. This is also called Confidence interval.
For example, if we estimate the average weight of men living in a colony on the basis of sample mean, say, 62
kg then 62 kg is called point estimate of average weight of men in the colony and this procedure is called as
point estimation. If we estimate the average weight of men by an interval, say, [50,100] with 90% confidence
that true value of the weight lie in this interval then this interval is called interval estimate and this procedure is
called as interval estimation.
Criteria (properties) of Good Estimator
Unbiasedness
An estimator is said to
be unbiased if the
Efficiency
expected value of the
estimator is equal to the
true value of the
parameter being
estimated.
For a parameter there may exist more than one estimator. For example, for estimating population mean sample
mean , (Xmax+Xmin)/2, sample median, etc are the estimators. So question may arise which one is the good
estimator. So an estimator is said to the good if it follows the following properties:
Unbiasedness
An estimator (T) is said to be unbiased for the population parameter (θ) if and only if the average or mean of
the sampling distribution of the estimator is equal to the true value of the parameter.

This property of estimator is called unbiasedness.


But if the expected value of the estimator does not equal to the true value of parameter, then the estimator is
said to be “biased estimator”, that is, if

then estimator T is called biased estimator of q.


Note:
1.Sample mean is an unbiased estimator for the population mean.
2.Sample proportion is an unbiased estimator for the population proportion.
Example 2: A random sample of 10 cadets of a centre is selected and measures their weights (in kg) which are
given below:

48, 50, 62, 75, 80, 60, 70, 56, 52, 78

Determine an unbiased estimate of the average weight of cadets of the centre.


Solution: We know that sample mean is an unbiased estimator of the population mean and its particular

value is the unbiased estimate of population mean, therefore,

Hence, an unbiased estimate of the average weight of cadets of the centre is 63.10 kg.
40
Efficiency
An unbiased estimator T1 of a parameter q is said to be more efficient than another estimator T2 of q if
for all n
Confidence interval for the mean when variance is known

Confidence interval for the mean when variance is unknown

Confidence interval for Population proportion

Population Sample

Mean µ

SD  S

Proportion P p

Size N n

Example 1: The mean life of the tyres manufactured by a company follows normal distribution with standard
deviation 3200 kms. A sample of 250 tyres is taken and it is found that the average life of the tyres is 50000
kms with a standard deviation of 3500 kms. Establish the 99% confidence interval within which the mean life
of tyres of the company is expected to lie.
Solution: Here, we are given that
Since population standard deviation, i.e., population variance σ2 is known, therefore, we use

For 99% confidence interval, we have For  = 0.01 we have,

Therefore, the 99% confidence limits are

41
By putting the values of n, and σ, the 99% confidence limits are

or

Hence, 99% confidence interval within which the mean life of tyres of the company is expected to lie is

Example 2: It is known that the average weight of students of a Study Centre of IGNOU follows normal
distribution. To estimate the average weight, a sample of 10 students is taken from this Study Centre and
obtained mean and SD as 63 and 11.79, respectively. Compute the 95% confidence interval for the average
weight of students of Study Centre of IGNOU.
Solution: Since population variance is unknown, therefore we use the confidence limits for the average weight
of students of Study Centre are given by

For 95% confidence interval, we have Also from t-table, we have,


Thus, the 95% confidence limits are

Hence, required 95% confidence interval for the average weight of students of Study Centre of IGNOU is

Example 4: A sample of 200 voters is chosen at random from all voters in a given city. 60% of them were in
favour of a particular candidate. If large number of voters cast their votes then find 99% and 95% confidence
intervals for the proportion of voters in favour of a particular candidate.
Solution: Here, we are given
n = 200,

Confidence limits for the proportion are

For 99% confidence interval, we have For  = 0.01, we have and for  =
0.05,
Therefore, 99% confidence limits of voters in favour of a particular candidate are

Hence, required 99% confidence interval [0.52, 0.68]

Similarly, 95% confidence limits

Hence, 95% confidence interval [0.54, 0.66]

TESTING OF HYPOTHESIS

42
Hypothesis
In our day-to-day life, we see different commercials advertisements in television, newspapers, magazines, etc.
and if someone may be interested to test such type of claims or statement then we come across the problem of
testing of hypothesis. For example,
(i) a customer of motorcycle wants to test whether the claim of motorcycle of certain brand gives the average
mileage 60 km/liter is true or false,
(ii) the businessman of banana wants to test whether the average weight of a banana of Kerala is more than
200 gm,
(iii) a doctor wants to test whether new medicine is really more effective for controlling high blood pressure
than old medicine,
(iv) an economist wants to test whether the variability in incomes differ in two populations,
(v) a psychologist wants to test whether the proportion of literates between two groups of people is same, etc.
In all the cases discussed above, the decision maker is interested in making inference about the population
parameter(s). Here we are interested in testing a claim or statement or assumption about the value of population
parameter(s). Such claim or statement is postulated in terms of hypothesis.
In statistics, a hypothesis is a statement or a claim or an assumption about the value of a population
parameter (e.g., mean, median, variance, proportion, etc.).
Similarly, in case of two or more populations a hypothesis is comparative statement or a claim or an
assumption about the values of population parameters. (e.g., means of two populations are equal, variance of
one population is greater than other, etc.). The plural of hypothesis is hypotheses.
In hypothesis testing problems first of all we should being identifying the claim or statement or assumption or
hypothesis to be tested and write it in the words. Once the claim has been identified then we write it in
symbolical form if possible. As in the above examples,
(i) Customer of motorcycle may write the claim or postulate the hypothesis “the motorcycle of certain brand
gives the average mileage 60 km/liter.” Here, we are concerning the average mileage of the motorcycle
so let µ represents the average mileage then our hypothesis becomes µ = 60 km / liter.
(ii) Similarly, the businessman of banana may write the statement or postulate the hypothesis “the average
weight of a banana of Kerala is greater than 200 gm.” So our hypothesis becomes µ > 200 gm.
(iii) Doctor may write the claim or postulate the hypothesis “ the new medicine is really more effective for
controlling blood pressure than old medicine.” Here, we are concerning the average effect of the
medicines so let µ1 and µ2 represent the average effect of new and old medicines respectively on
controlling blood pressure then our hypothesis becomes µ1 < µ2.
(iv) Economist may write the statement or postulate the hypothesis “ the variability in incomes differ in two
populations.” Here, we are concerning the variability in income so let represent the variability
in incomes in two populations respectively then our hypothesis becomes .
(v) Psychologist may write the statement or postulate the hypothesis “the proportion of literates between two
groups of people is same.” Here, we are concerning the proportion of literates so let represent
the proportions of literates of two groups of people respectively then our hypothesis becomes P1 = P2 or P1
–P2 = 0.
The hypothesis is classified according to its nature and usage as we will discuss in subsequent subsections.
Null and Alternative Hypotheses
As we have discussed in last page that in hypothesis testing problems first of all we identify the claim or
statement to be tested and write it in symbolical form. After that we write the complement or opposite of the
claim or statement in symbolical form. In our example of motorcycle, the claim is µ = 60 km/liter then its
complement is µ ≠ 60 km/liter. In (ii) the claim is µ > 200 gm then its complement is µ ≤ 200 gm. If the claim

43
is µ < 200 gm then its complement is µ ≥ 200 gm. The claim and its complement are formed in such a way that
they cover all possibility of the value of population parameter.
Once the claim and its compliment have been established then we decide of these two which is the null
hypothesis
We state and which is the alternative hypothesis. The thump rule is that the statement containing equality is
the null and
alternative
the null hypothesis. That is, the hypothesis which contains symbols is taken as null hypothesis and
hypotheses in such a
the they
way that hypothesis
cover which does not contain equality i.e. contains is taken as alternative hypothesis. The
all possibility of the
null hypothesis is denoted by H0 and alternative hypothesis is denoted by H1 or HA.
value of population
In our example of motorcycle, the claim is µ = 60 km/liter and its complement is µ ≠ 60 km/liter. Since claim µ
parameter.
= 60 km/liter contains equality sign so we take it as a null hypothesis and complement µ ≠ 60 km/liter as an
alternative hypothesis, that is,
H0: µ = 60 km/liter and H1: µ ≠ 60 km/liter
In our second example of banana, the claim is µ > 200 gm and its complement is µ ≤ 200 gm. Since
complement µ ≤ 200 gm contains equality sign so we take complement as a null hypothesis and claim µ > 200
gm as an alternative hypothesis, that is,
H0: µ ≤ 200 gm and H1: µ > 200 gm
Formally these hypotheses are defined as
The hypothesis which we wish to test is called as the null hypothesis.
According to Prof. R.A. Fisher,
“A null hypothesis is a hypothesis which is tested for possible rejection under the assumption that it is
true.”
The hypothesis which complements to the null hypothesis is called alternative hypothesis.
Note 1: Some authors use equality sign (=) in null hypothesis instead of ≥ and ≤ signs.
TYPE-I AND TYPE-II ERRORS
We have a rule that if the value of test statistic falls in rejection (critical) region then we reject the null
hypothesis and if it falls in the non-rejection region then we do not reject the null hypothesis. A test statistic is
calculated on the basis of observed sample observations. But a sample is a small part of the population about
which decision is to be taken. A random sample may or may not be a good representative of the population.
Sometimes a sample misleads conclusion relating to the null hypothesis. So we can commit two kinds of errors
while testing a hypothesis which are summarised in the following table:
Decision H0 True H1 True
Reject H0 Type-I Error Correct Decision
Do not reject H0 Correct Decision Type-II Error

Let us take a situation where a patient suffering from high fever reaches to a doctor. And suppose the doctor
formulates the null and alternative hypotheses as
H0: The patient is a malaria patient
H1: The patient is not a malaria patient
Then following cases arise:
Case I: Suppose that the hypothesis H0 is really true, that is, patient actually a malaria patient and after
observation, pathological and clinical examination, the doctor rejects H0, that is, he / she declares him
/ her a non-malaria-patient. It is not a correct decision and he / she commits an error in decision
known as type-I error.
Case II: Suppose that the hypothesis H0 is actually false, that is, patient actually a non-malaria patient and
after observation, the doctor rejects H0, that is, he / she declares him / her a non-malaria-patient. It is
a correct decision.

44
Case III: Suppose that the hypothesis H0 is really true, that is, patient actually a malaria patient and after
observation, the doctor does not reject H0, that is, he / she declares him / her a malaria-patient. It is a
correct decision.
Case IV: Suppose that the hypothesis H0 is actually false, that is, patient actually a non-malaria patient and
after observation, the doctor does not reject H0, that is, he / she declares him / her a malaria-patient. It
is not a correct decision and he / she commits an error in decision known as type-II error.

Z-test (important for short note)

Z- test is used for testing the hypothetical value of mean or difference of two means when sample size is large

(n>=30) or population variance is known.

This test has following steps:


Step I: First of all, we have to setup null hypothesis H0 and alternative hypothesis H1 as

or

Step II: After setting the null and alternative hypotheses, we decide the level of significance (), at which
we want to test our hypothesis. Generally, it is taken as 5% or 1% (α = 0.05 or 0.01).
Step III: For testing the null hypothesis, the test statistic as given below:

Step IV: Calculate the value of the test statistic described in Step III on the basis of observed sample
observations.
Step V: Obtain the critical (or cut-off) or tabulated value using Z-table.
Step VI: After that, compare the calculated value of test statistic obtained from Step IV, with the tabulated or
critical value(s) obtained in Step V and if then we reject null hypothesis otherwise
we may accept null hypothesis.

t-test (important for short note)

t- test is used for testing the hypothetical value of mean or difference of two means when sample size is small

(n<30) or population variance is unknown.

This test has following steps:


Step I: First of all, we have to setup null hypothesis H0 and alternative hypothesis H1 as

or

45
Step II: After setting the null and alternative hypotheses, we decide the level of significance (), at which
we want to test our hypothesis. Generally, it is taken as 5% or 1% (α = 0.05 or 0.01).
Step III: For testing the null hypothesis, the test statistic as given below:

Where is the sample mean and is the sample SD

Step IV: Calculate the value of the test statistic described in Step III on the basis of observed sample
observations.
Step V: Obtain the critical (or cut-off) or tabulated value using t-table.
Step VI: After that, compare the calculated value of test statistic obtained from Step IV, with the tabulated or
critical value(s) obtained in Step V and if then we reject null hypothesis otherwise we
may accept null hypothesis.

PAIRED t-TEST
When two samples are not independent and observations are recorded on the same individuals or items.
Generally, such types of observations are recorded to assess the effectiveness of a particular training, diet,
treatment, medicine, etc. In such situations, the observations are recorded “before and after” the insertion of
.
training, treatment, etc. as the case may be. For that we use paired t-test.
Let (X1, Y1), (X2, Y2), …,(Xn, Yn) be a paired random sample of size n and the difference between paired
observations Xi & Yi be denoted by Di, that is,

This test has following steps:


Step I: First of all, we have to setup null hypothesis H0 and alternative hypothesis H1 as

and the alternative hypothesis

or

Step II: After setting the null and alternative hypotheses, we decide the level of significance (), at which
we want to test our hypothesis. Generally, it is taken as 5% or 1% (α = 0.05 or 0.01).
Step III: For testing the null hypothesis, the test statistic as given below:

where,

Step IV: Calculate the value of the test statistic described in Step III on the basis of observed sample
observations.
46
Step V: Obtain the critical (or cut-off) or tabulated value using t-table.
Step VI: After that, compare the calculated value of test statistic obtained from Step IV, with the tabulated or
critical value(s) obtained in Step V and if then we reject null hypothesis otherwise we
may accept null hypothesis.

47
Critical Values for Z-test

Level of Two-Tailed Test One-Tailed Test


Significance (α)
Right-Tailed Test Left- Tailed Test

α = 0.05 (= 5%) zα/2 = 1.96 zα = 1.645 zα = 1.645

α = 0.01 (= 1%) zα/2 = 2.58 zα = 2.33 zα = −2.33

Example 1: A light bulb company claims that the 100-watt light bulb it sells has an average life of 1200 hours
with a standard deviation of 100 hours. For testing the claim 50 new bulbs were selected randomly and allowed
to burn out. The average lifetime of these bulbs was found to be 1180 hours. Is the company’s claim is true at
5% level of significance?
Solution: Here, we are given that
Specified value of population mean = 0 = 1200 hours,
Population standard deviation = σ = 100 hours,
Sample size = n = 50
Sample mean = = 1180 hours.
In this example, the population parameter being tested is population mean i.e. average life of a bulb (µ) and we
want to test the company’s claim that average life of a bulb is 1200 hours. So our claim is  = 1200 and its
complement is  ≠ 1200. Since claim contains the equality sign so we can take the claim as the null hypothesis
and complement as the alternative hypothesis. So
(claim)

(two tailed)
Also the alternative hypothesis is two-tailed so the test is two-tailed test.
Here, we want to test the hypothesis regarding mean when population SD (variance) is known and sample size n
= 50(> 30) is large. So we will go for Z-test.
Thus, for testing the null hypothesis the test statistic is given by

The critical (tabulated) values for two-tailed test at 5% level of significance are zα/2 = z0.025 =1.96.

Since so we do not reject the null hypothesis. Since the null hypothesis is the claim so
we support the claim at 5% level of significance.
Example 2: A manufacturer claims that a special type of projector bulb has an average life 160 hours. To check
this claim an investigator takes a sample of 20 such bulbs, puts on the test, and obtains an average life 167 hours
with standard deviation 16 hours. Assuming that the life time of such bulbs follows normal distribution; does
the investigator accept the manufacturer’s claim at 5% level of significance?

82
Example 3: The mean share price of companies of Pharma sector is Rs.70. The share prices of all companies
were changed time to time. After a month, a sample of 10 Pharma companies was taken and their share prices
were noted as below:
70, 76, 75, 69, 70, 72, 68, 65, 75, 72
Assuming that the distribution of share prices follows normal distribution, test whether mean share price is still
the same at 1% level of significance?

Example 4: A manufacturer of ball point pens claims that a certain pen manufactured by him has a mean
writing-life at least 460 A-4 size pages. A purchasing agent selects a sample of 100 pens and put them on the
test. The mean writing-life of the sample found 453 A-4 size pages with standard deviation 25 A-4 size pages.
Should the purchasing agent reject the manufacturer’s claim at 1% level of significance?
Example 5: In two samples of women from Punjab and Tamilnadu, the mean height of 1000 and 2000 women
are 67.6 and 68.0 inches respectively. If population standard deviation of Punjab and Tamilnadu are same and
equal to 5.5 inches then, can the mean heights of Punjab and Tamilnadu women be regarded as same at 1%
level of significance?
Example 6 In a large population 30% of a random sample of 1200 persons had blue-eyes and 20% of a random
sample of 900 persons had the same blue-eyes in another population. Test the proportion of blue-eyes persons is
same in two populations at 5% level of significance.

Example 7: Out of 200 patients who are given a particular injection 180 survived. Test the hypothesis that the

survival rate is more than 80% at 5% level of significance?


Example 7: A tyre manufacturer claims that the average life of a particular category of his tyre is 18000 km
when used under normal driving conditions. A random sample of 16 tyres was tested. The mean and SD of life
of the tyres in the sample were 20000 km and 6000 km respectively. Assuming that the life of the tyres is
normally distributed, test the claim of the manufacturer at 1% level of significance using appropriate test.
Example 8: In a random sample of 10 pigs fed by diet A, the gain in weights (in pounds) in a certain period

were
12, 8, 14, 16, 13, 12, 8, 14, 10, 9
In another random sample of 10 pigs fed by diet B, the gain in weights (in pounds) in the same period were
14, 13, 12, 15, 16, 14, 18, 17, 21, 15
Assuming that gain in the weights due to both foods follows normal distributions with equal variances, test
whether diets A and B differ significantly regarding their effect on increase in weight at 5% level of
significance.
Example 9:Two different types of drugs A and B were tried on some patients for increasing their weights. Six
persons were given drug A and other 7 persons were given drug B. The gain in weights (in ponds) is given
below:
Drug A 5 8 7 10 9 6 −
Drug B 9 10 15 12 14 8 12

Assuming that increment in the weights due to both drugs follows normal distributions with equal variances, do
the both drugs differ significantly with regard to their mean weights increment at 5% level of significance?
83
(Paired t-test) Example 10: A group of 12 children was tested to find out how many digits they would repeat
from memory after hearing them once. They were given practice session for this test. Next week they were
retested. The results obtained were as follows:

Child Number 1 2 3 4 5 6 7 8 9 10 11 12

Recall Before 6 4 5 7 6 4 3 7 8 4 6 5

Recall After 6 6 4 7 6 5 5 9 9 7 8 7

Assuming that the memories of the children before and after the practice session follow normal distributions, is
the memory practice session improve the performance of children?

Solution: First of all, we formulate null and alternative hypotheses

Test statistic

where,
2
Before(X) After (Y) D=X-Y D

6 6 0 0

4 6 -2 4

5 4 1 1

7 7 0 0

6 6 0 0

4 5 -1 1

3 5 -2 4

7 9 -2 4

84
8 9 -1 1

4 7 -3 9

6 8 -2 4

5 7 -2 4

Total -14 32

Mean

Since so we reject the null hypothesis so we may assume that the memory practice session

improve the performance of children.


Example 11: Ten students were given a test in Statistics and after one month’s coaching they were again given
a test of the similar nature and the increase in their marks in the second test over the first are shown below:
Roll No. 1 2 3 4 5 6 7 8 9 10
Increase in Marks 6 −2 8 −4 10 2 5 −4 6 0

Assuming that increment in marks follows normal distribution. Do the data indicate that students have gained
knowledge from the coaching at 1% level of significance?
Example 12: A machine produces a large number of items out of which 25% are found to be defective. To
check this, company manager takes a random sample of 100 items and found 35 items defective. Is there an
evidence of more deterioration of quality at 5% level of significance?
Example 13: In a random sample of 100 persons from town A, 60 are found to be high consumers of wheat. In
another sample of 80 persons from town B, 40 are found to be high consumers of wheat. Do these data reveal a
significant difference between the proportions of high wheat consumers in town A and town B ( at α = 0.05 )?

Example 14: Two brands of electric bulbs are quoted at the same price. A buyer was tested a random sample of
200 bulbs of each brand and found the following information:
Mean Life (hrs.) SD(hrs.)
85
Brand A 1300 41
Brand B 1280 46
Is there significant difference in the mean duration of their lives of two brands of electric bulbs at 1% level of
significance?
E 5) The following data are collected during a test to determine consumer preference among five leading brands
of bath soaps:
Brand Preferred A B C D E Total
Number of Customers 194 205 204 196 201 1000

Test that the preference is uniform over the five brands at 5% level of significance.
E 6) The following table gives the numbers of road accidents that occurred during the various days of the week:
Days Mon Tue Wed Thu Fri Sat Sun
Number of Accidents 14 15 8 20 11 9 14

Test whether the accidents are uniformly distributed over the week by chi-square test at 1% level of
significance.
E7) A cigarette manufacturer claims that the variance of nicotine content of its cigarettes is 0.62. Nicotine
content is measured in milligrams and is normally distributed. A sample of 25 cigarettes has a variance
of 0.65. Test the manufacturer’s claim at 5% level of significance.

E8) The12 measurements of the same object on an instrument are given below:
1.6, 1.5, 1.3, 1.5, 1.7, 1.6, 1.5, 1.4, 1.6, 1.3, 1.5, 1.5
If the measurement of the instrument follows normal distribution then carry out the test at 1% level of
significance that variance in the measurement of the instrument is less than 0.016.
E10) The variance of a certain dimension article produced by a machine is 7.2 over a long period. A random
sample of 20 articles gave a variance 8. Is it justifiable to conclude that variability has increased at 5% level of
significance assuming that the measurement of dimension article is normally distributed?

86
1. The mean marks obtained by the students of a mathematics course in IGNOU is 54.5 with a standard
deviation 8.0. At one of the study centres, where 100 students took the examination, the mean marks
were 55.9. Are the students of this study centre significantly 1) different 2) better than, from the rest of
the students of that course in IGNOU at 0.01 level?
2. A consumer magazine, when comparing various brands of paints, stated that the drying time of one
particular brand was found to be four hours. The manufacturer was not particularly pleased with this and
consequently modified the paint to try to reduce the drying time. The paint was then tested by a random
sample of 40 customers all of whom were decorating their living rooms. For this sample the mean
drying time in hours was found to be 3.85 and the sample standard deviation was 0.55.
a) Analyse the sample data using the one-sided z-test.
b) Find a 95% confidence interval for the population mean of the drying times for the modified paint.
3. The breaking strengths of cables made by a company had mean of 1800 N. The company then adopted a
new technique which is believed to increase the breaking strengths. 50 cables made by the new
technique were tested to see if the belief is justified. or not. The mean breaking strength of these 50 is
found to be 1850 N with a standard deviation of 100 N. Is the belief justified at a) 5% level b) 1% level.
4. As part of a survey on drivers’ reaction times for a driving magazine, 300 drivers were subjected to the
following test: each driver was asked to press a lever with his/her foot in response to a flashing light.
The reaction times (in seconds) were recorded and the sample mean was found to be 0.83. The sample
standard deviation was 0.31. What can you conclude about drivers’ reaction times?
5. A machine manufactures standard weights to be used in weighting scales. To check if the machine is
working properly, a random sample of five 2-kg. weights was taken. Each 2kg. weight was weighted on
a special scale and the actual weights were found to have a mean of 1.962 kg. and a standard deviation
of 0.038 kg. If α = 0.05, can you say that the machine is in proper working order?
6. A management school claims that the starting salaries for its graduated average Rs. 10,000 or more per
month. A random sample of 7 students who had recently graduated, showed an average salary of Rs.
9700 with a standard deviation of Rs. 306. At a 5% level of significance would you accept the claim?
7. The specifications for the production of a certain alloy call for 23.2% copper. In 10 analyses, the mean
copper content was found to be 23.5 n of 0.24%. Can we conclude that the product meets the
specifications if α = 0.05>
8. The diameters of bolts manufactured by a machine are known to have a standard deviation of 0.0002
cm. A random sample of 10 bolts has an average diameter of 0.5046 cm. Test the hypothesis that the
true mean diameter of bolts is 0.51 cm, using α = 0.01.
9. A new teaching technique is to be tested. A group of 22 students were taught in the traditional way.
Another group of 18 students was taught with the help of the new technique. The two groups were then
given a standardised test which is known to have a standard deviation of 25. The mean score of the
traditional group was 127 and that of the experimental group was 136. If α = 0.1, do you think that the
new technique is significantly better?
10. A psychologist gave a test to decide if male students are a smart as female students. The sample of 40
female students had a mean score of 131 and the sample of 36 males had a mean score of 126. The test
has a standard deviation of 16. Is there a difference at 0.01 level of significances?
We have been considering cases where σ1 and σ2 are known. If they are not known, they have to be
estimated from the sample. If the samples are large, then these estimates are quite close to the real values
and so we can use them in forming the test statistic Z. In the next exercise you see one such situation.
11. A sample of 100 electric light bulbs produced by manufacturer A showed a mean life-time of 1190h and
a standard of 90h. A sample of 75 bulbs produced by manufacturer B showed mean life-time of 1230h
87
and a standard deviation of 120h. a) Is there a difference between the two brands of bulbs at a
significance level of 0.05? b) Are the bulbs of manufacturer B superior to those of manufacturer A at the
same level?
12. We want to test the effect of a new fertilise on wheat production. For this, 24 plots of land of equal area
were chosen. Half of these were treated with the new fertiliser and the other half were treated with old
one. With the new fertiliser, the mean yield was 48 kg. with a standard deviation of 4 kg . With the old
fertiliser, the mean yield was 51 kg, with a standard deviation of 3.6 kg. Can we say at 5% level of
significance that there is an improvement in the yield because of the new fertiliser? What will be your
conclusion at 1% level?
13. A botanist was interested in knowing if there was a difference in the time fruits matured on different
parts of a plant, and recorded the day of the first fruit on the top and on the bottom for 15 plants. all the
fruits came out during the same month.
Top 3 6 7 5 8 9 10 10 7 8 6 9 10 12 4
Bottom 7 9 5 8 8 10 11 12 6 9 7 13 8 13 8
Is there a significant difference in the time to mature at the 1% significant level?
14. The pulse rates of 12 people were recorded before and after taking a new drug.
Before 68 71 84 93 67 74 82 77 71 83 62 66
After 71 70 81 97 73 80 90 76 80 79 80 67
Using 10% level, can you say that there is a significant increase in the pulse rate?
15. A random sample of size 1000 from machine 1 contained 20 defectives, and a random sample of size
1500 form machine 2 contained 40 defectives. If α = 0.05, can you say that machine 1 is better than
machine 2?
16. A flue vaccine was given to 125 of a total of 200 employees of a firm. Thirty employees who had
received the vaccine were down with flue, while 25 of those who did not, also were stricken. At 1%
level of significance would you say that the vaccine was effective?

88
CHI-SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES (important)
This test is used to test the independence of two attributes.
Null and alternative hypotheses
H0: The two attributes (characteristics) are independent
H1: They are not independent
Suppose there are two attributes, say, A and B. Also let the characteristic A be assumed to have ‘r’ categories
A1, A2, …, Ar and characteristic B be assumed to have ‘c’ categories B1, B2, …, Bc. The various observed
frequencies in different classes can be expressed in the form of a table known as contingency table.
B B1 B2 … Bj … Bc Total
A
A1 O11 O12 … O1j … O1c R1
A2 O21 O22 … O2j … O2c R2
. . . . . .
. . . . . .
. . . . . .
Ai Oi1 Oi2 … Oij … Oic Ri
. . . . . .
. . . . . .
. . . . . .
Ar Or1 Or2 … Orj … Orc Rr
Total C1 C2 … Cj … Cc N
Test statistic:

Where Oij- observed frequency and Eij- expected frequency

Take the decision about the null hypothesis as:

If calculated value of test statistic is greater than tabulated value then we reject the null hypothesis otherwise we
may accept the null hypothesis.

Example 5: 1000 students at college level were graded according to their IQ level and the economic condition
.
of their parents
Economic IQ level
Condition High Low Total
Poor 240 160 400
Rich 460 140 600
Total 700 300 1000

Test that IQ level of students is independent of the economic condition of their parents at 5% level of
significance.
89
Solution: H0 : IQ level and economic condition are independent
H1 : IQ level and economic condition are not independent
For testing the null hypothesis, the test statistic is

Eij =

Therefore,

Calculations for

Observed Expected (O – E) (O – E)2


Frequency (O) Frequency (E)

(1,1) 240 280 −40 1600 5.71


(1,2) 160 120 40 1600 13.33
(2,2) 460 420 40 1600 3.81
(2,1) 140 180 −40 1600 8.89
Total = 1000 1000 31.74

Therefore, from above calculations, we have

The degrees of freedom will be (r –1)(c –1) = (2 – 1)(2 – 1) = 1.


The critical value of χ2 with 1 degree of freedom at 5% level of significance is 3.84.
Since calculated value of test statistic (= 31.74) is greater than critical value (= 3.84) so we reject the null
hypothesis i.e. we reject the claim at 5% level of significance.
Thus, we conclude that sample provides us sufficient evidence against the claim so IQ level of students is not
independent of the economic condition of their parents.

Example 6: Calculate the expected frequencies for the following data presuming the two attributes and check
that condition of home and condition of the child are independent at 5% level of significance.

Condition of Child Condition of Home


Clean Dirty
Clear 70 50
Fairly Clean 80 20

90
Dirty 35 45

Solution: H0: Condition of home and condition of child are independent


H1 : Condition of home and condition of child are not independent
For testing the null hypothesis, test statistic is

Now, under H0, the expected frequencies can be obtained as:

Condition of Condition of Home Total


Child Clean Dirty
Clear 70 50 120
Fairly Clean 80 20 100
Dirty 35 45 80
Total 185 115 300

Eij =

Therefore,

Calculations for

Observed Expected (O – E) (O – E)2


Frequency (O) Frequency (E)

70 74.00 −4.00 16.00 0.22


50 46.00 4.00 16.00 0.35
80 61.67 18.33 335.99 5.45
20 38.33 −18.33 335.99 8.77
35 49.33 −14.33 205.35 4.16
45 30.67 14.33 205.35 6.70
Total = 300 300 25.64

Therefore, from above calculations, we have

91
The degrees of freedom will be (r –1)(c –1) = (3 – 1)(2 – 1) = 2.
The critical value of χ2 with 2 degrees of freedom at 5% level of significance is 5.99.
Since calculated value of test statistic (= 25.64) is greater than critical value (= 5.99) so we reject the null
hypothesis i.e. we reject the claim at 5% level of significance.
Thus, we conclude that the sample provides us sufficient evidence against the claim so condition of home and
condition of the child are not independent.

E4) A group of 1650 school children were classified according to their performance in school tests and
family economic level. Test if there is any association between these two attributes (Given
)
Economic Performance
Level Very Good Average Poor Total
Good
Very Rich 4 7 16 25 52
Rich 13 37 79 73 202
Average 105 372 298 175 950
Poor 35 213 75 123 446
Total 157 629 468 396 1650
E5) The following contingency table presents the analysis of 300 persons according to hair colour and eye
colour:
Hair Eye Colour
Colour Blue Grey Brown Total
Fair 30 10 40 80
Brown 40 20 40 100
Black 50 30 40 120
Total 120 60 120 300

Test the hypothesis that there is an association between hair colour and eye colour at 1% level of
significance.
PAIRED t-TEST
When two samples are not independent and observations are recorded on the same individuals or items.
Generally, such types of observations are recorded to assess the effectiveness of a particular training, diet,
treatment, medicine, etc. In such situations, the observations are recorded “before and after” the insertion of
.
training, treatment, etc. as the case may be. For that we use paired t-test.
Let (X1, Y1), (X2, Y2), …,(Xn, Yn) be a paired random sample of size n and the difference between paired
observations Xi & Yi be denoted by Di, that is,

Here, we want to test that there is an effect of a diet, training, treatment, medicine, etc. So we can take the null
hypothesis as

and the alternative hypothesis

92
or

For testing the null hypothesis, the test statistic t is given by

where,

Example 5: A group of 12 children was tested to find out how many digits they would repeat from memory
after hearing them once. They were given practice session for this test. Next week they were retested. The
results obtained were as follows:

Child Number 1 2 3 4 5 6 7 8 9 10 11 12

Recall Before 6 4 5 7 6 4 3 7 8 4 6 5

Recall After 6 6 4 7 6 5 5 9 9 7 8 7

Assuming that the memories of the children before and after the practice session follow normal distributions, is
the memory practice session improve the performance of children?
1 2
Solution: Here, we want to test that memory practice session improve the performance of children. If m and m
1 2 1
denote the mean digit repetition before and after the practice so our claim is m < m and its complement is m ≥
2
m . Since complement contains the equality sign so we can take the complement as the null hypothesis and the
claim as the alternative hypothesis. Thus,

Since the alternative hypothesis is left-tailed so the test is left-tailed test.


It is a situation of before and after. Also, it is given that the memories of the children before and after the
practice session follow normal distributions. So, population of differences will also be normal. Also all the
assumptions of paired t-test meet so we can go for paired t-test.
For testing the null hypothesis, the test statistic t is given by
… (3)

where, are mean and standard deviation of the population of differences.


Child Digit recall d = (X−Y) d2
Number Before (X) After (Y)
1 6 6 0 0

93
2 4 6 −2 4
3 5 4 1 1
4 7 7 0 0
5 6 6 0 0
6 4 5 −1 1
7 3 5 −2 4
8 7 9 −2 4
9 8 9 −1 1
10 4 7 −3 9
11 6 8 −2 4
12 5 7 −2 4

From above calculations, we have

The critical value of test statistic t for left-tailed test corresponding (n-1) = 11 df at 5% level of significance is

Since calculated value of test statistic t (= −3.44) is less than the critical value (=−1.796), that means calculated
value of t lies in rejection region, so we reject the null hypothesis and support the alternative hypothesis i.e.
support the claim at 5% level of significance.

Thus, we conclude that samples fail to provide us sufficient evidence against the claim so we may assume that
memory practice session improves the performance of children.

CHI-SQUARE TEST FOR GOODNESS OF FIT (important for short note)

This test is used to test that a random variable under study follows a specified distribution such as uniform,
binomial, Poisson, normal, etc. Here, we compare observed frequencies in each category with theoretically
expected frequencies. This test is known as “goodness of fit test” because we test how well an observed
frequency distribution fit to the theoretical distribution such as normal, uniform, binomial, etc.
Assumptions
This test works under the following assumptions:
(i) The sample observations are random and independent.
(ii) The sample size is large.
(iii) The observations may be classified into non-overlapping categories.

94
(iv) The expected frequency of each class is greater than five.
(v) Sum of observed frequencies is equal to sum of expected frequencies, i.e.,
We can take the null and alternative hypotheses as
H0: Data follow a specified distribution
H1: Data do not follow a specified distribution
Test statistic:
The test statistic is given by

Where Oi- observed frequency and Ei- expected frequency


Expected frequency are obtained as
for all i =1, 2, …, k
Where pi (i =1, 2, …, k) is the probability that an observation falls in ith category
We calculate value of the chi square test statistic using the given data and obtain the tabulated value of the chi
square.
Take the decision about the null hypothesis as:
If calculated value of is greater than tabulated value then we reject the null hypothesis otherwise we do not
reject the null hypothesis.
E2) The following table gives the numbers of road accidents that occurred during the various days of the week:
Days Mon Tue Wed Thu Fri Sat Sun
Number of Accidents 14 15 8 20 11 9 14

Test whether the accidents are uniformly distributed over the week by chi-square test at 1% level of
significance.
Solution:
H0: The accidents are uniformly distributed over the week
H1: The accidents are not uniformly distributed over the week
Since the data are given in the categorical form and we are interested to fit a distribution, so we can go for chi-
square goodness of fit test.

Since the uniform distribution is one in which all outcomes considered have equal or uniform probability.
Therefore, the probability that the accident occurs in any day is same. Thus,

The theoretical or expected frequency for each day is obtained by multiplying the appropriate probability by the
total number of accidents, that is, sample size N. Therefore,

95
Calculations for

Days Observed Expected (O−E) (O−E)2


Frequency (O) Frequency (E)

Mon 14 13 1 1 0.0769
Tue 15 13 2 4 0.3077
Wed 8 13 −5 25 1.9231
Thu 20 13 7 49 3.7692
Fri 11 13 −2 4 0.3077
Sat 9 13 −4 16 1.2308
Sun 14 13 1 1 0.0769
Total 91 91 7.6923

From the above calculation, we have

The critical value of chi-square with degrees of freedom at 1% level of significance is 16.81.
Since calculated value of test statistic (= 7.6923) is less than critical value (= 16.81) so we do not reject the null
hypothesis i.e. we support the claim at 1% level of significance.
Thus, we conclude that the sample fails to provide us sufficient evidence against the claim so we may assume
that the accidents are uniformly distributed over the week.
Example 1: The following data are collected during a test to determine consumer preference among five
leading brands of bath soaps:
Brand Preferred A B C D E Total
Number of Customers 194 205 204 196 201 1000

Test that the preference is uniform over the five brands at 5% level of significance.

Test for two Population Variances (F-Test)


We can take our alternative null and hypotheses as

or

For testing the null hypothesis, the test statistic F is given by

96
where,

Example 1: Two sources of raw materials are under consideration by a bulb manufacturing company. Both
sources seem to have similar characteristics but the company is not sure about their respective uniformity. A
sample of 12 lots from source A yields a variance of 125 and a sample of 10 lots from source B yields a
variance of 112. Is it likely that the variance of source A greater than B at significance level  = 0.01?
Solution: Here, we are given that

Here, we want to test that variance of source A significantly differs to the variances of source B. If
denote the variances in the raw materials of sources A and B respectively so our claim is and its
complement is Since complement contains the equality sign so we can take the complement as the null
hypothesis and the claim as the alternative hypothesis. Thus,

Since the alternative hypothesis is two-tailed so the test is two-tailed test.


For testing this, the test statistic is given by

Since calculated value of test statistic (= 1.11) is less than the critical value (= 3.10), so we do not reject the null
hypothesis and reject the alternative hypothesis i.e. we reject the claim at 5% level of significance.
E2) Two sources of raw materials are under consideration by a bulb manufacturing company. Both sources
seem to have similar characteristics but the company is not sure about their respective uniformity. A sample of
12 lots from source A yields a variance of 125 and a sample of 10 lots from source B yields a variance of 112.
Is it likely that the variance of source A significantly differs to the variance of source B at significance level 
= 0.01?

ANALYSIS OF VARIANCE(ANOVA) [most important topic]


Analysis of variance is used for testing of equality of means of more than two populations.

97
According to Professor R. A. Fisher, Analysis of Variance (ANOVA) is a method of splitting the total
variation in data into two components of variation one is due to assignable causes (between the groups
variability) or other is variation due to chance causes (within group variability).

UTILITY OF ANOVA
The t -test is used to test the hypothesis about the means of two populations. But there are many situations
where we have to test the hypothesis about the equality of more than two means. For example one may be
interested to test whether there is a significance difference between three teaching methods of the statistical
techniques on the basis of sample data, in agriculture, the experimenter wants to comparison the three of more
fertilizers, in medical field, a investigator wishes to know whether four drugs are equally efficient in the control
of blood pressure.
In such situation, we can use t-test for testing the hypothesis about the means of more than two populations but
we have to use the t-test many times. Due to this the type I error increases. In such situations, we use Analysis
of variance (ANOVA).
Assumptions of ANOVA
1. Dependent variable measured at least on interval scale;
2. The samples are independently and randomly drawn from the population;
3. Population under study follows the normal distribution;
4. The samples have approximately equal variance;
5. Various effects are additive in nature; and
6. Errors (eij) are independently identically distributed normal with mean zero and variance σe2.
One Way Classification
If the observations in an experiment are classified on the basis of a single criterion, then the classification is
called one way classification. For example, if we consider the yield of four varieties of wheat then we divide the
whole plots into four groups. In this case the observations (yields) are classified on the basis of a single
criterion, the variety of wheat. So the classification is called one-way classification.
Two Way Classification
If the observations in an experiment are classified on the basis of two criteria/factors, then the classification is
called two-way classification. For example, we may consider the yields of four varieties of wheat using four
different types of fertilizers. In such experiment, the observations are classified according to two criteria (the
wheat variety and the type of fertilizer). So it is called a two-way classification.
Model for One Way Classification
Suppose there are k normal populations with means and common variance . Further let we draw
k random samples (one from each population) from these populations. Let be the size of the
sample from ith population. Using the sample information, we wish to test null hypothesis

Against alternative hypothesis


98
At least two means are not equal.

Let be the jth observation of ith sample, then one-way classified data can be arranged as
shown in the following table:
Level of Factor/ Treatment 1 2 k
Observations y11 y21 ... yk1
y12 y22 ... yk2
y13 y23 ... yk3
... ... ... ...
...
Total T1 T2 ... Tk

The linear mathematical model for one-way classified data can be written as
i = 1, 2, . . ., k & j = 1, 2, . . ., n
where - represents the general mean(effect) and

represents the effect of ith treatment.


eij - represent the errors due to random fluctuations. It is independently identically distributed normal with
mean zero and variance σe2.
In one-way classification we split the total variation as
Total sum of squares(TSS) = Sum of squares due to treatment(SST) + sum of squares due to error( SSE)
Model ANOVA table for One Way Classification
ANOVA Table for One-way Classified Data
Source of Degrees of Sum of Mean Sum of Variance Ratio FTab
Variation Freedom Squares Square Fcal
(df) SS MSS
Treatment k−1 MSST = F = MSST/MSSE F With (k−1),
(Between SST/(k−1) (N−k) df
Samples) SST
Error (Within N−k MSSE =
samples) SSE/(N−k)
SSE
Total N−1
TSS
Procedure for one way analysis of variance for k independent sample:
Step1: First step of the procedure is to make the null and alternative hypothesis.
We want to test the equality of the population means. Hence, the null hypothesis is given by
H0: μ1 = μ2 = . . . = μk
99
Against the alternative hypothesis
H1: At least two means are not equal
Step2: Calculate the correction factor(CF) as

where, Grand total (G) = Sum of all values, N = total observations = n1 + n2 + … + nk =

Step 3: Find the sum of squares of all the observation as

This is also known as Raw sum of squares (RSS).


Step 4: After that find Total Sum of Squares (TSS) as

Step 5: Find the Sum of Squares due to Treatment or Factor (SST) as

where, T is the sum or total of ith treatment or factor


Step 6: After that find Error Sum of Squares (ESS) as

The above analysis is presented in the following table:


ANOVA Table for One-way Classified Data
Source of Sum of Degrees of Mean Sum of Variance Ratio FTab
Variation Squares Freedom Square MSS Fcal
SS (df)
Treatment SST k−1 MSST = F = MSST/MSSE F With (k−1),
SST/(k−1) (N−k) df
Error SSE N−k MSSE =
SSE/(N−k)
Total TSS N−1

Thus, if an observed value of F is greater than the tabulated value of F for {(k−1), (N−k)} df and specific level
of significance (usually 5% or 1%), then H0 is rejected otherwise, it may be accepted.
Example 1: An investigator is interested to know the level of knowledge about the history of India of 4
different schools in a city. A test is given to 5, 6, 7, 6 students of 8th class of 4 schools. Their scores out of 10 is
given below:
School I (S1) 8 6 7 5 9
School II (S2) 6 4 6 5 6 7
School III(S3) 6 5 5 6 7 8 5
School IV(S4) 5 6 6 7 6 7

100
Solution: If 1, 2, 3, 4 denote the average score of students of 8th class of schools I, II, III, IV respectively.
Then
Null Hypothesis
Alternative hypothesis H1: Difference among 1, 2, 3, 4 are significant.

S1 S2 S3 S4 S12 S22 S32 S42


8 6 6 5 64 36 36 25
6 4 5 6 36 16 25 36
7 6 5 6 49 36 25 36
5 5 6 7 25 25 36 49
9 6 7 6 81 36 49 36
7 8 7 49 64 49
5 25
T1=35 T2=34 T3=42 T4=37 255 198 260 231

Grand Total G = 35 + 34 + 42 + 37 = 148

Correction Factor (CF) = = 912.6667 Since N = n1 + n2 + n3 + n4

Raw Sum of Square (RSS) =

Total Sum of Square (TSS) = RSS – CF = 944 – 912.6667 = 31.3333

Sum of Squares due to Treatments (SST)= =

= 245+192.6667+252+228.1667−912.6667 = 5.1667
Sum of Squares due to Errors (SSE) = TSS − SST = 31.3333 − 5.1667 = 26.1666

Now, MSST =

MSSE =
ANOVA Table
Source of Variation Sum of Degrees of MSS Fcal
Squares Freedom (df)
(SS)
Between schools 5.1667 4-1=3 1.7222 F

Within schools 26.1666 N-k=24-4=20 1.3083

101
Total 24-1=23
Calculated F = 1.3164
Tabulated F at 5% level of significance with (3, 20) degree of freedom is 3.10.
Conclusion: Since Calculated F < Tabulated F, so we may accept H0 and conclude that level of knowledge of
schools I, II, III and IV do not differ significantly.
Example 2: If we have three fertilizers and we have to compare their efficacy, this could be done by a field
experiment in which each fertilizer is applied to 10 plots, and then 30 plots are later harvested, with the crop
field being calculated for each plot. The data were recorded in following table:
Fertilizer Yields (in tones) from the 10 plots allocated to that fertilizer
1 6.27 5.36 6.39 4.85 5.99 7.14 5.08 4.07 4.35 4.95
2 3.07 3.29 4.04 4.19 .41 0.75 04.87 3.94 6.49 3.15
3 4.04 3.79 4.56 4.55 4.53 3.53 3.71 7.00 4.61 4.55
Solution:
H0: Mean effect of Ist fertilizer = Mean effect of the IInd fertilizer = Mean effect IIIrd fertilizer
H0: μ1 = μ2 = μ3
H1: At least one is different
Steps for calculating different sum of squares
Grant Total = Total of all observation = = G = 139.20
Correction Factor (CF) = G /N = 139.20 ×139.20 /30 = 645.89
2

Raw Sum of Square (RSS) = = 6385.3249


Total Sum of Square = RSS − CF =36.4449

Sum of Square due to Fertilizer (SST) = − CF = (54.5)2/10 + (40)2/10 + (44.9)2/10 – CF= 10.8227

Sum of Square due to Error = TSS − SST = 36.4449 −10.8227 = 25.6222


Mean Sum of Square due to Treatment (MSST) = SST/df = 10.8227/2 = 5.4114
Mean Sum of Square due to Error (MSSE) = SSE/df = 25.6221/27 = 0.9490
Variance ratio F2,27 = MSST/MSSE = 5.414/0.9490 = 5.70
Tabulated F2,27 = 3.35
Since calculation value of F2,27 is greater than F2,27 at 5% level of significance tabulated (3.35) so we reject H0. It
means there is a significant difference among the effect of these three fertilizers.
Now, H0 is rejected.
E 1) Three varieties A, B and C of wheat are sown in five plots each and the following yield per plot are as
obtained:
Plots A B C
1 8 7 12
2 10 5 9
3 7 10 13
4 14 9 12
102
5 11 9 14
50 40 60

Set up a table of analysis of variance and find out whether there is significant difference between the
yields of these varieties.

Solution: Null Hypothesis


Alternative hypothesis H1: At least two means are not equal .

N=Total no of observations=15
k=no of groups =3
Plots A B C A2 B2 C2
1 8 7 12 64 49 144
2 10 5 9 100 25 81
3 7 10 13 49 100 169
4 14 9 12 196 81 144
5 11 9 14 121 81 196
50 40 60 530 336 734
Grand total=G=50+40+60=150
Correction factor (CF)= G2/N=150*150/15=1500
Raw sum of squares(RSS)=530+336+734=1600
Total sum of squares(TSS)=RSS-CF=1600-1500=100

Sum of squares due to treatment (SST)=

50*50/5+40*40/5+60*60/5-1500=500+320+720-1500=40

Sum of Squares due to error(SSE)=TSS-SST= 100-40=60

ANOVA Table for One-way classified data


Sources of Degree Sum of Mean Sum F-Statistic or
Variation of Squares of Squares Variation
(SV) Freedom (MSS) Ratio
Due to three 3-1=2 40 (SST) MSST
varieties or due =40/2= 20
to treatments
Due to error 14-2=12 60 (SSE) MSSE =
(within 60/15=5
groups )
Total 15-1=14 TSS = 100
103
For the table value of F at 5% level of distribution is 3.88 which can be seen from the
statistical table. Since the calculated value is greater than the table value of F at 5% level of significance.
So, we reject the null hypothesis and hence we conclude that the difference between the mean yield of
three varieties is significant.
E 2) The following figures relate to production in kg. of three varieties P, Q, R of wheat sown in 12 plots
P 14 16 18
Q 14 13 15 22
R 18 16 19 15 20
Is there any significant difference in the production of these varieties?
Solution:
Sources of Degree of Sum of Squares Mean Sum of F-statistic or
Variation (SV) Freedom (SS) Squares (MSS) Variation Ratio
Between 2 16.8 8.4
Varieties
Due to Error 12 67.20 7.467
Total 14 84

F tabulated = 4.26
Set up-ANOVA table for the following per hectare yield for tree varieties of wheat each grown on four plots.
Variety of wheat
A1 A2 A3
16 15 15
17 15 14
13 13 13
18 17 12

104
Sampling
Population
Population is the collection or group of individuals /items /units /observations under study. The books in a
library, the particles in a room, the rivers in India, students in a classroom, etc are the example of population.
Sample
A sample is a fraction or a part or a subset of population drawn through a valid statistical procedure regarded as
representative of the whole population.
The valid statistical procedure of drawing a sample from the population is called sampling.
Complete Enumeration and Sample Survey
(1) Complete Enumeration or Census
When each and every unit of the population is investigated or studied for the characteristics under study then we
call it complete enumeration or census. For example, checking at border of a country, census of population of
a country, census of import and export, etc.
(2) Sample Survey or Sample Enumeration
When only a part or a small number of units of population are investigated or studied for the characteristics
under study then we call it sample enumeration or sample survey.
Advantages of Sampling Survey over Census or Complete Enumeration
Reduced Cost
Since in a sample survey we study only a part of the population therefore, the cost in terms of money and men
power of the survey is considerably small as compared to that in complete enumeration.
Saving of Time
Sampling results can be analysed more quickly than complete enumeration.
Greater scope
In certain cases, complete enumerations not possible and we can bound to use sample enquiry for example when
the population units are destroyed under investigation like bombs, bullets, life time of electric bulbs, etc. to be
tested we can bound sample survey.
Greater Accuracy
A sample survey gives data of better quality than a complete survey because in sample survey it may be
possible to use better resources as trained field workers, better equipment than complete enumeration, etc.
Sampling and Non-Sampling Error
(1) Sampling Error
The error which arises due to fact that only a part of the population called sample being used to estimate the
population parameters and draw inferences is known as sampling error. So whatever may be the degree of
aquracy is used in selecting a sample there will always be a difference between the population value and its
corresponding estimate. This error present in every sampling scheme. A sample with the smallest sampling
error will always be considered a good representative of the population. The error can be reduced by increasing
the size of sample.
(2) Non-Sampling Error
When all the units of the population are studied then one would expect that there is no error. However, in
practice it is not so. It is difficult to avoid errors of observations or ascertainment completely. Therefore, the

105
error which arises at the stages of observation, classification, tabulation, analysis, etc. known as non-sampling
error. The non-sampling error presents in both the census and the sample survey.

Types of Sampling
Subjective or judgment or purposive sampling
Any type of sampling in which the selection of units in the sample depends on personal discretion or judgment
of the investigator is called a subjective or judgment sampling. This type of sampling is used with a definite
purpose in view and as such is not used for general purpose. The investigator includes those items in the sample
which be thinks are mort typical of the population with respect to the characteristics under study. For example,
if we want to draw a sample of patient suffering form Tuberculosis (TB) since, it is not possible to certain a
population of TB sanatorium therefore, the peoples who suffering form TB are selected in the sample. This
sampling method is not preferred because if the investigator biased then it not given true picture.
Simple Random Sampling or Random Sampling
The simplest and common most method of sampling is simple random sampling. In simple random sampling
the sample is drawn in such a way that each unit of the population has an equal and independent chance of
being included in the sample. Simple random sampling may be classified as:
(1) Simple Random Sampling with Replacement (SRSWR)
In simple random sampling if the units are selected or drawn one by one in such a way that a unit drawn at a
time is replace back to the population before the subsequent draw is called SRSWR. In this type of sampling
from a population of size N, the probability of selection of a unit at each draw remains 1/N. This sampling is
used when population is homogeneous.
(2) Simple Random Sampling without Replacement (SRSWOR)
In simple random sampling if the units are selected or drawn one by one in such a way that a unit drawn at a
time is not replace back to the population before the subsequent draws is called SRSWOR. No. of samples in
SRSWOR are

If N =4, n= 3 then no. of samples


There are two methods of selecting a simple random sample
1. Lottery Method
2. Use of random numbers tables

Lottery Method
This is the simplest method of drawing a simple random sample under which all units of the population are
numbered. In this method we collect identical cards of the same size, some colour and sample shape as the no of
population units and these cards are put in a rotated drum or container in which there are well mixed. If we want
to draw a sample of size n with replacement then we draw a card from the drum and noted the number on this
card and replace back before the next draw and corresponding to this number unit of the population is drawn
then drums is rotated and draw another card and note the number on this card. This procedure will continue
until we get a sample of size n and corresponding to these numbers the units from the population are selected. If
drawn card is not replaced bark before the next draw then we get a SRSWOR.

106
Use of Random Numbers table
The lottery method, discussed above, become quite cumbersome to use if the size of the population is very
large. An alternative method of random selection is that of using the table of random numbers. A random
number table is an arrangement of digits 0 o 9. A table of random number is so constructed that all numbers
0,1,2,…,9 appear independent of each other and appear with approximately the same frequency. If we have to
select a sample from a population of size N (≤ 99) then we select two digit pair. The method of drawing the
random sample consists in the following steps:

1. Identify the N units in the population with the numbers from 1 to N.


2. Select at random, any page of the random number tables and pick up the numbers in any row or column or
diagonal at random; and discarded the number which is greater than N or take the remainder of the
number which is greater than N.
3. The population units corresponding to the numbers selected in step-2, constitute the random sample.

STRATIFIED RANDOM SAMPLING


When the units of the population are scattered and not completely homogeneous in nature with respect to the
characteristic under study, then simple random sample does not give proper representation of the population.
In stratified random sampling the whole population is to be divided in some homogeneous groups or classes
with respect to the characteristic under study which are known as strata. The auxiliary information related to the
character under study may be used to divide the population into various groups or strata such that units within
each stratum units are as homogeneous as possible and the strata are as widely different as possible.
Thus, all strata would comprise the population. Then from each stratum sample would be drawn and lastly all
samples would be combined to get the ultimate sample. For example let us consider that population consists of
N units and these are distributed in a heterogeneous structure. Now first of all we divide the population into ‘k’
non overlapping strata of sizes N1, N2, N3, ...,Nk such that each stratum becomes homogeneous. Evidently N =
N1 + N2 + N3 + ... + Nk. Then from first stratum a sample of size n1 would be drawn by simple random sampling
method. Similarly from the second stratum a sample of n2 units would be drawn and so on, up to kth stratum.
Now all these k samples would be combined to get the ultimate sample. So, the ultimate size of sample would
be This method of sampling is known as Stratified Random Sampling because here
stratification is done first to make population homogeneous and then samples are drawn randomly by simple
random sampling from each stratum.
Allocation of Sample Size

1. Equal allocation

2. Proportional allocation

3. Optimum allocation

Problem : Suppose three small towns are under study, having population N1 = 50000, N2 = 30000 and N3 =
40000, respectively. A stratified random sample is to be taken with a total sample size of n = 500. Determine
the sample size to be taken from each town individually using the method of (a) proportional, and (b) optimal
allocation. It is (roughly) known from a previous survey that S1 = 30, S2 = 15 and S3 = 20
Solution: (a) Under proportional allocation:

107
(c) Under optimal allocation:

E) A sample of 60 persons is to drawn from a population consisting of 600 persons belonging to two villages A
:
and B. The means and SDs of their monthly wages are given below

Village Size Mean SD

A 400 60 20

B 200 120 80

Draw the samples using proportion and optimum allocation.

Systematic sampling

In systematic sampling , one unit is selected randomly and subsequent units are selected according to a pre-

determined pattern. It is used in survey of timbers in a forest, library, etc.

Advantages of systematic sampling

1.Systematic sampling is very simple.

2.It is not very expensive

3.The systematic sample is uniformly distributed over the whole population.

4.Systematic sampling is more efficient than the simple random samling.

Linear Systematic Sampling


108
Suppose we have a population of size N and we have to draw a sample of size n. This method is applicable if

the population size N is multiple of sample size n. i.e., N = nk or N/n=k where k is an integer.

Step I: In this method, first of all we assign number 1 to N to the population units.
Step II: We select a random number r between 1 to k i.e. from the random number table where r is
called random start and k is called sampling interval

Step III: Then we select every k unit of the population is the sample. In this way we get the sample of size n as

This technique will generate k systematic sample with equal probability. This method is known as linear

systematic sampling.

Example: Suppose there are 20 units in a population serially numbered 1 to 20 and we have to draw a

systematic sample of size 4. Here

So first we select a random number between 1 to 5 from random number table. Suppose this number is 3 then

we select rest sample units in a systematic way as

3, 8, 13, 18

Circular Systematic Sampling

The main drawback of linear systematic sampling is that it is used when N is multiple of n i.e. N = kn. But in

general N does not be always a multiple and n

For example N = 15 and n = 4 then . In such situations we use circular systematic sampling. This

method has following steps

Step I: In this method first of all we assign number 1 to N to the population units and suppose N units may be

regarded as arranged around a circle.

109
Step II: We take k by rounding of to the nearest integer., i.e.,

Step III: We select a random number from 1 to N. Suppose the number is i


th

Step IV: Then we select every k unit is circular mannar.

For example suppose we have a population of 14 household from which we have

to draw a sample of size 5.

Here N = 14, n = 5 so

First we select a random number from 1 to N i.e. 1 to 14 let it is 7 then the selected sample is

7 , 7+k, 7+2k, 7+3k, 7+4k

7, 10, 13, 16, 19

7, 10, 13, 2, 5 (Remainer 16/14, 19/14)

If we select random no say 9

9, 12,1, 4, 7
E) the information regarding production of wheat in 25 districts are collected for a particular season. Select a
systematic random sample of 7 units from the following data
23,20,30,37,76,25,16,24,54,45,21,14,54,20,26,19,12,16,28,32,41,35,22,41,15

110
Test 1
Q1: A computer chip manufacturer claims that at most 2 most 2 percent of chips it produces are defective. An
electronic company, impressed by that claim, has purchased a large quality of chips. To check the claim of the
manufacturer, the company has decided to test a sample of 250 of these chips. If there are eight defective chips
among these 250, does this disprove the manufacturer’s claim at 5% level of significance.
Q2: A researcher would like to test whether there is any significant difference between safety consciousness of
men and women while driving a car. In a sample of 300 men, 130 said that they used seat belts. In a sample of
300 women, 90 said that they used seat belts. Test the claim that there is no significant between safety
consciousness of men and women while driving a car at 5% level of significance.
Q 3: A company manufacturers two types of bulbs (A and B), the manager of the company tests a random
sample of 50 bulbs of type A and 60 bulbs of type B. She obtains the following information.
Mean SD(in
Life(in hours)
hours)
Type A 1300 50
Type B 1200 60

Test there is a significance difference in the average life of two types of bulbs.

Test 2
Q1: Two sources of raw materials are under consideration by a bulb manufacturing company. Both sources
seem to have similar characteristics but the company is not sure about their respective uniformity. A sample of
12 lots from source A yields a variance of 125 and a sample of 10 lots from source B yields a variance of 112.
Is it likely that the variance of source A greater than B at significance level a = 0.01?
Q2: The pulse rates of 6 people were recorded before and after taking a new drug.
Before 68 71 84 93 67 74
After 71 70 81 97 73 80
Using 1% level, can you say that there is a significant increase in the pulse rate?

Q3: A computer chip manufacturer claims that at most 2 most 2 percent of chips it produces are defective. An
electronic company, impressed by that claim, has purchased a large quality of chips. To check the claim of the
manufacturer, the company has decided to test a sample of 250 of these chips. If there are eight defective chips
among these 250, does this disprove the manufacturer’s claim at 5% level of significance.

111
Example: In the given frequency distribution two frequencies are missing and its mean is found to be 1.46.
Number of Accidents (X): 0 1 2 3 4 5 Total
Frequency (f) 46 ? ? 25 10 5 200
Find the missing frequencies.
Solution: Let the missing frequencies be
Then X f fX
Or … (i) 0 46 0
1 f1 f1
Also, since = 1046 (Given) 2 f2 2f2
3 25 75
Or … (ii) 4 10 40
Solving (i) and (ii), we get 5 5 25
200

Example: The following data relate to the marks of 70 students in statistics. Find the mean.
Marks (More than): 20 30 40 50 60 70
No. of students : 70 63 40 30 18 7
Solution: In this example, a ‘more than’ cumulative frequency distribution is given. For computing mean, the
given distribution is converted into a simple frequency distribution as shown in the table:
Computing arithmetic Mean
Cumulative Classes f X fd’
classes
More than 20 20-30 70 – 63 = 7 25 –3 –21
More than 30 30-40 63 – 40 = 23 35 –2 –46
More than 40 40-50 40 – 30 = 10 45 –1 –10
More than 50 50-60 30 – 18 = 12 55 0 0
More than 60 60-70 18 – 7 = 11 65 1 11
More than 70 70-80 7–0=7 75 2 14
N = 70

Example: Determine the modal value in the following series.


Value: 10 12 14 16 18 20 22 24 26 28 30 32
Frequency: 7 15 21 38 34 34 11 19 10 38 5 2
(ANS= 18)

112
Example: The sum of squares of 100 observations was calculated as 7961. Later, it was found that two values,
53 and 42 were wrongly read as 35 and 24 at the time of calculation. Find the corrected sum of squares.
Solution: Given the incorrect
Corrected = incorrect – (Squares of wrong observations) + (Squares of correct observations)

Corrected
= 7961 – (1225 + 576) + (2809 + 1764) = 10733.
Question1: The sum of square of 20 observations was worked out as 5100. But while calculating it, an
observation 31 was wrongly considered as 13. Find the corrected sum of squares.
Question2:The sum of squares of 50 observations is 4122. An observation 39 was wrongly includes in the
series. Find the sum of squares of the remaining 49 observations.
Question3:The arithmetic mean and the S.D. of a series of 20 items were calculated as 20 cm and 5 cm
respectively. But while calculating them, an item 13 was measured as 30. Find the correct arithmetic, mean and
standard deviation.
Question4:The mean and S.D. of 20 items are found to be 10 and 2 respectively. At the time of checking, it was
observed that one item 8 was incorrect. Find the mean and the S.D. if (i) the wrong item is omitted (ii) it is
replaced by 12.
Properties of Standard Deviation
1. The value of S.D. of a series remains unchanged if each variate value is increased or decreased by the same
constant value. In other words, we can say that the S.D is independent of change in origin.
Symbolically,
Let where b is a constant.
Then i.e., the S.D.’s of the variables X and Y will be equal.
Example: Suppose 5, 8, 17, 12 and 7 are five observations on a variable X. A new variables Y is obtained by
adding 2 (a constant) to each observation on X. Further, let Z be another variable defined by subtracting 3 from
each value on X. Find the standard deviations of the variable X, Y and Z, say respectively.
(Ans4.26, 4.26, 4.26)
2. If the value of variable X are multiplied (or divided) by a constant, the S.D. of the new observations can be
obtained multiplying (or dividing) the initial S.D. by the same constant. Symbolically,
If Y = kX, where k is a constant
Then
In other words, we can say that S.D. is affected by change in scale.
Example: Suppose 2, 6, 9, 5, 4 are five observations on a variable X. A new variable Y is obtained by
multiplying each observation on X by 3 (a constant). Further, another variable Z is obtained by dividing each
observation on X by 2. Then we find the S.D.’s of the variables X, Y and Z, say respectively.
(Ans: 2.32, 6.96, 1.16)

3. Combined standard deviation can be calculated if the standard deviations, means and number of items in
different groups are given. The formula used for computing combined standard deviation is as under:
Combined S.D. of two related groups.

113
Where
Combined S.D. of two groups.
Standard deviation of the first group.
Standard deviation of the second group.
No. of observations in the first group.
No. of observations in the second group.
= combined mean of the two groups.

mean of the first group


mean of the second group.

Question5: The standard deviation of 5 items is found to be What will be the standard deviation if
the values of all the items are increased by 5? (Ans )
Question6: A sample of 35 values has mean 80 and S.D. 4. A second sample of 65 values from the same
population has mean 70 and S.D.3. Find the mean and standard deviation of the combined sample of 100
values. (Ans: 5.85)

Question7:Find the mean and the standard deviation of the two groups taken together:
Group Number Mean S.D.
A 113 160 22
B 120 150 20

(Ans: 154.85,21.58)

Example: A computer while calculating the correlation coefficient between two variables X and Y from 25
pairs of observations obtained the following results:

It was, however, discovered at the time of checking that two pairs of observations were not correctly copied.
They were taken as (6, 14) and (8, 6) while the correct values were (8, 12) and (6, 8). Find the corrected value
of r.
Solution:
Incorrect values Correct values
X Y XY X Y XY
6 14 36 196 84 8 12 64 144 96
8 6 64 36 48 6 8 36 64 48
Total 14 20 100 232 132 Total 14 114
20 100 208 144
Thus,
Corrected = 125 – 14 + 14 = 125
Corrected = 100 – 20 + 20 = 100
Corrected = 650 – 100 + 100 = 650
Corrected = 460 – 232 + 208 = 436
Corrected = 508 – 132 + 144 = 520

Question: With 10 observations each on two variables X and Y, the following data were observed:
However, on subsequent verification it was found that one value
of X( = 15) and one value of Y(= 13) were wrongly taken as 16 and 18. Find the correct value of correlation
coefficient.

Since,

or
or
or
or
= 1860
Corrected
Corrected

Now considering

Or
Similarly,

Corrected
Corrected
Therefore, the corrected value of r is

115
Random Numbers

The random numbers have been generated through a probabilistic mechanism. The numbers have the following
properties-

i) The probability that each digit 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 will appear at any particular place is the same,
namely 1/10.
ii) The occurrence of any two digits, in any two places, is independent to each other.
Pseudo-Random Numbers
The random numbers, which may be repeated after a certain period in a cycle or there may be a correlation
between successive numbers or several high numbers may follow or precede several low numbers. These are
called pseudo random numbers. In other words-Random numbers are called pseudo random numbers when
they are generated by some deterministic process. The generated numbers can be effectively used in simulation
if the period of cycle is very large.

Use of Random Number Generation


In probability and statistical application, we require large quantities of random numbers. Then to read the
random numbers from the tables is and use in the analysis very slow. And sometimes we need the random
numbers larger than the published in the table. So it is necessary to derive a mechanism, through which we can
generate random numbers automatically. Also in different situation we need the random numbers from
different distribution such as Poisson, binomial Normal, Gamma, etc. In such situation, we use the generated
random numbers.
Generation of Random Numbers
There are various techniques for generating random numbers. Some important methods of generating random
numbers are
1) Lottery Method
2) Middle Square Method
3) Congruential Method

Control Charts
The control chart is the most important tool for process control. With the help of control chart, we quickly
detect the occurrence of assignable causes of variation and can take corrective action to eliminate them.
So control charts are broadly classified into two categories:
1. Control charts for variables (Control charts for measurable characteristics)X bar Chart and R chart
2. Control charts for attributes (Control charts for non-measurable characteristics) p chart, np chart and c chart

CONTROL CHART FOR MEAN


The main steps for the construction of the are as follows:
Step 1: First of all, we calculate sample mean and range of each sample as

116
Step 2: After that, we find grand mean that is, the mean of all sample means and mean of all sample
ranges as

where k- number of samples

Step 3: We calculate the central line and control lines using the following formulae
Centre line (CL) =

Lower control limit (LCL) =

Upper control limit (UCL) =

Where are constants and read from the table for sample size.
Step 4: After that we plot the control chart taking sample number on the X-axis and sample means on the Y-
axis.
Step 5: If any point lies outside of the upper control line or lower control line the process is said to be out of
statistical control.
Example 3: A new process of producing ball bearings is started. For monitoring the outside diameter of the ball
bearings, the quality controller takes the sample of five ball bearings at 10.00 AM, 12.00 PM, 2.00 PM, 4.00 PM
and 6.00 PM and measures the outside diameter (in mm) of each selected ball bearing (Fig. 2.3). The results
of the test over a 4-day production period are as follows:

Observations
Sample
Number R
(k) X1 X2 X3 X4 X5

1 52 52 50 51 51 51.2 2(52-50)
2 50 53 52 53 51 51.8 3
Fig. 2.3
3 54 51 50 52 53 52.0 4
4 56 55 53 55 53 54.4 3
5 51 52 50 53 53 51.8 3
6 50 52 51 50 51 50.8 2
7 50 50 52 51 53 51.2 3
8 52 51 53 50 50 51.2 3
9 52 53 52 55 53 53.0 3
10 51 51 50 51 52 51.0 2
11 52 52 54 52 52 52.4 2

117
12 49 48 50 50 51 49.6 3
13 52 53 54 49 52 52.0 5
14 52 51 54 51 54 52.4 3
15 51 51 52 52 51 51.4 1
16 50 50 51 52 51 50.8 2
17 50 51 53 51 53 51.6 3
18 52 50 49 53 50 50.8 4
19 52 51 54 51 51 51.8 3
20 51 51 50 52 52 51.2 2
1032.4 56

Calculate and draw the centre line and control limits of the Draw the conclusion about the process,
Solution: To calculate the centre line and control limits, we first calculate the values of using equations
as follows:

The CL, UCL and LCL for the are

From Table I given at the end of this block, we have A2 = 0.577 for n = 5.

The CL, UCL and


LCL for the
We now construct the by taking the sample number on the X-axis and the average diameter of the ball
when μ and σ are
bearing on the Y-axis as shown
unknown are

118
Fig. 1: for the average diameter of the ball bearings.

Interpretation of the result


From Fig. 1, we observe that the points corresponding to samples 4 and 12 lie outside the control limits.
Therefore, the process is out-of-control and some assignable causes are present in the process.

CONTROL CHART FOR RANGE CHART

The CL, UCL and LCL for the R-chart are

Where D4 and D3 are constant and read from the table for sample size.
Example 5: A milk company uses automatic machines to fill 500 ml milk packets. A quality control inspector
inspected four packets for each sample at given time intervals and measured the weight of each filled packet.
The data for 20 samples are shown in the following table:

Sample
Weight of Filled Milk Packet (in ml) R
Number
1 500 520 500 500 506.67 20
2 500 490 520 530 503.33 40
3 490 550 570 540 536.67 80
4 510 520 500 520 510.00 20
5 510 480 490 490 493.33 30
6 520 500 520 500 513.33 20
7 520 510 530 510 520.00 20
8 530 490 520 500 513.33 40
9 510 490 500 510 500.00 20
119
10 520 520 490 520 510.00 30
11 520 500 510 500 510.00 20
12 480 500 520 510 500.00 40
13 530 510 520 510 520.00 20
14 500 510 510 500 506.67 10
15 490 520 500 510 503.33 30
16 520 500 530 500 516.67 30
17 520 560 490 510 523.33 70
18 500 490 500 510 496.67 20
19 520 500 530 500 516.67 30
20 500 490 500 490 496.67 10
Total 10196.67 600

Using the R-charts, draw the conclusion about the process by assuming assignable causes for any out-of-control
points.
Solution: From Table I, we have

The CL, UCL and


We first calculate the value of for the centre line and the control limits from the given data as follows:
LCL for the R-chart
when σ is unknown
are

Substituting the values of we get

We now construct the R-chart by taking the sample number on the X-axis and the sample range (R) of the milk
packets on the Y-axis as shown in Fig. 2.5.

120
Fig. 2.5: The R-chart for milk packets.
Interpretation of the result
From Fig. 2.5, we observe that the points corresponding to samples 3 and 17 lie outside the control limits.
Therefore, the process variability is out-of-control and some assignable causes are present in the process.

CONTROL CHART FOR ATTRIBUTES

CONTROL CHART FOR PROPORTION DEFECTIVE (p-CHART)


CL

UCL

LCL

Example 3: The following data are found during the inspection of the first 15 samples of size 100 each from a
lot of two-wheelers manufactured by an automobile company:
Sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number
Number
of 3 4 6 2 12 5 3 6 3 5 4 15 5 2 3
Defectives

Draw the chart for fraction/proportion defective (p) and comment on the state of control.
121
Solution: To draw the p-chart, we need to calculate the centre line and control limits. Here, the process fraction
defective is not known. So in this case we use equations (8a to 8c).
To calculate the control limits, we first calculate the fraction defective for each sample and then .
Sample Sample Size Number of Proportion
Number (n) Defectives Defective
(d) (p = d/n)
From the above table, we 1 100 3 0.03 Proportion
have defective
2 100 4 0.04
3 100 6 0.06
4 100 2 0.02
5 100 12 0.12
6 100 5 0.05
7 100 3 0.03
8 100 6 0.06
9 100 3 0.03
10 100 5 0.05
11 100 4 0.04
12 100 15 0.15
13 100 5 0.05
14 100 2 0.02
15 100 3 0.03
Total 78

Therefore, we can calculate the centre line and control limits as follows:
CL

UCL

LCL

We now draw the p-chart by taking the sample number on the X-axis and the proportion defective (p) on the Y-
axis as shown in Fig. 3.2.

122
Fig. 3.2: The p-chart for fraction defective of two-wheelers.
Interpretation of the result
From the p-chart shown in Fig. 3.2., we observe that the points corresponding to sample numbers 5 and 12 lie
outside the upper control limits. Therefore, the process is out-of-control. It means that some assignable causes
are present in the process.

Control Chart for Defects (C-Chart)


CL

UCL

LCL
c-number of defects per units
Example 2: The number of scratch marks on a particular piece of furniture is recorded. The data for 20 samples
are given below:
Sample 1 2 3 4 5 6 7 8 9 10
Number
Scratch 6 3 14 7 2 5 12 4 7 3
Mark
Sample 11 12 13 14 15 16 17 18 19 20
Number
Scratch 2 7 6 8 4 10 5 4 13 9
Mark

Draw the appropriate control chart and write the comments about the state of the process.
Solution. Here total number of defects = 6 + 3 + … + 13 + 9 = 131.

123
Therefore, we calculate the centre line and control limits of the c-chart as follows:
CL
UCL

LCL
We construct the c-chart by taking the sample number on the X-axis and the number of defects (c) on the Y-
axis as shown in Fig. 4.2.

Fig. 4.2: The c-chart for number of defects


Interpretation of the result
From the c-chart (shown in Fig. 4.2), we observe that no point lies outside the control limits and there is no
assignment causes in the process.

124

You might also like