You are on page 1of 315

KARNATAKA STATE OPEN UNIVERSITY

MUKTHAGANGOTHRI, MYSORE - 570 006.

DEPARTMENT OF STUDIES AND RESEARCH IN MANAGEMENT

M.B.A I SEMESTER
MBHC - 1.4

STATISTICS AND OPTIMIZATION TECHNIQUES

BLOCK-1: INTRODUCTION TO STATISTICS Page No.


Unit - 1
Introduction to Business Statistics 01 - 08
Unit - 2
Analysis of Data 09 - 29
Unit - 3
Measures of Central Tendency 30 - 49
Unit - 4
Measures of Dispersion 50 - 69

BLOCK - 2
CORRELATION AND REGRESSION Page No.
Unit -5
Correlation 70 - 86
Unit - 6
Methods of Computing Correlation 87 - 105
Unit -7
Regression 106 - 135
Unit -8
Multiple Correlation and Regression 136 - 151

1
BLOCK - 3
PROBABILITY Page No.
Unit - 9
Introduction to Probability and Probability Types 152 - 184
Unit - 10
Theoretical Probability Distributions, Normal Distribution 185 - 214
Unit - 11
Introduction to Operations Research 215 - 223
Unit - 12
Game Theory 224 - 241

BLOCK - 4
APPLICATION OF OPERATION RESEARCH IN BUSINESS Page No.
Unit - 13
Decision Analysis 242 - 257
Unit - 14
Network Analysis 258 - 272
Unit - 15
Solution to Network Problems 273 - 289
Unit - 16
Different Time Estimates - P E R T 290 - 304

2
BLOCK -1 : INTRODUCTION TO STATISTICS
The science of statistics is indispensable for a clear appreciation of any problem
affecting all the branches of human knowledge. It covers all the fields of enquiry in which a
grasp of the significance of large numbers is looked for. It is applicable to all the disciplines.
This block gives you the fundamental aspects of statistics. Basic concepts of Central
Tendency such as mean, median and mode are discussed here. This Module gives you an
insight about data collection, tabulation and analysis of data. Methods of calculating
disperssion is also taught in this module.
You are expected to understand these concepts and workout problems given at the
end of each unit.
This block consists of four units.
Unit 1: Provides an introduction to business statistics i.e. Meaning, scope, importance and
limitations, statistics in Business Management.
Unit 2 : Gives an idea about how to analyse of data i.e. Introduction, sources of data,
collection, classification, tabulation and depiction of data.
Unit 3: Describes various Measures of Central tendency i.e Arithmetic, weighted,
geometric mean, Harmonic mean, median and mode.
Unit 4: Explains the Measures of Dispersion i.e. Range, Quartile deviation, Mean deviation,
Standard deviation, variance, Coefficient of variation, Skewness and Kurtosis.

3
BLOCK - 2 : CORRELATION AND REGRESSION
In the previous block you have gained the basic knowledge about data science. You
have learnt how to collect data, how to tabulate data and how to depict data. You have also
learnt the data analysis through various measures of central tendency such as mean, Median
and Mode. All these information you have studied were limited to one set of data.
In this block let us try to understand the relationship between two set of data. . Some
set of data depend directly or inversely on other set of data. For example yield crop over
years depend on amount of rainfall in that place over years. Such relation is called correlation.
Once you are able to co related two set of data then you can try to find the exact relationship
between two sets of data which is called as regression.
In this block you will study 4 units.
Unit 5: Concept and definition of correlation, significance, types, Properties of Correlation
Methods of correlation analysis: Graphic method,
Unit 6: Scatter diagrams, Karl Pearson’s correlation co-efficient, Rank correlation
coefficient,
Unit 7: Regression: Regression analysis: meaning and definition of regression, application
of regression analysis, difference between correlation & regression analysis, Types of
regression models, standard error and Regression coefficients.
Unit 8: Multiplication correlation and regression : Concept of multiple regression and
multiple correlation , Concept of partial correlation. Correlation co-efficient, Methods of
least square.

4
BLOCK -3 : PROBABILITY
In the previous blocks you have studied about statistics. In this block let us study an
important mathematical concept having significant application in business which is probability.
The word probability is not new to you. You may be using it in your daily life. For example, it
may rain today, I may go there etc. Probability is more related to business. Earning profit in
any business depends on numerous factors which are either controllable or non-controllable
hence always probabilistic.
There are quantitative techniques which would aid profit calculation. To find measure
for profitability, it is necessary to perform certain experiments, visualize possible outcomes
of experiments. There are many situations or trail which may result into any two possible
outcomes which is called binominal distribution. Sometimes the possibilities may follow a
normal pattern of low on extremes and more on average. For ex. construction of a house may
take one year. Sometimes it is finished in 6 months or some time it may take 3 years. Such
distributions wherein the chances of extreme possibilities is less and average is more is
called normal distribution.
In this block you will study 4 units stated as below
Unit 9: Basic definitions, events, sample space and probabilities, Basic rules of probability,
Unit 10: conditional probability, independence of events, combinational concepts, law of
total probability, Bayes theory,
Unit 11: theoretical probability distributions – Binomial, Poison and Normal – Simple
problems applied to business. Probability Distribution, Discrete random variable,
Binominal probability distribution,
Unit 12: Normal distribution.

5
BLOCK - 4 : APPLICATION OF OPERATION RESEARCH IN
BUSINESS
Decision Making is an important management activity. The significance of decision
making varies in the type of decision such as routine or one time, decision making environment
such as risk, certainty, uncertainty and so on. In this block you will understand how to take
decisions based on quantitative inputs in various situations.
Further in this block you will also study network analysis. The network analysis mainly
helps you to keep pace with a project. Whenever you take up a project, you can identify
various components of the project, delineate the sequence of those activity, determine the
time and cost required for each activity and then you can raw a chart. Based on this chart you
can control the project. You will come to know where the project is lagging or where the
project is exceeding the budget.
This block is divided into four units.
Unit 13: Decision making, Decision making environment, various criteria used for decision
making, Decision tree analysis,
Unit 14: Network scheduling using PERT and CPM, Basic concepts of network analysis,
Construction of network,
Unit 15: Different time Estimates,
Unit 16: Cost considerations in CPM, Probability of project completions

6
CREDIT PAGE
Programme Name : MBA Year/Semester : 1st Year, 1st Semester Block No: 1 to 4

Course Name : Statistics & Optimization Techniques Credit : 04 Units No: 1 to 16


Course Design Expert Committee
Prof. Vidya Shankar Chairman Prof. Kamble Ashok Member
Vice-Chancellor Dean (Academic)
Karnataka State Open University Karnataka State Open University
Mukthagangothri, Mysore-06 Mukthagangothri, Mysore-06

Course Designer/Course Co-ordinator


Prof. C. Mahadevamurthy BOS Chairman Prof. C. Mahadevamurthy Department
Professor, & Professor, Chairman
DOS & R in Management Member DOS & R in Management &
KSOU, Mukthagangothri, KSOU, Mukthagangothri, Member
Mysore-06 Mysore-06 Convener
Course Writer Course Editor
Smt. Prathiba Jennifer Block - 1(Units 1 to 4) Dr. Rajeshwari H.,
Assistant Professor Assistant Professor
Department of Management DOS & R Management,
Mahajana’s PG Centre KSOU, Mysore.
Mysore.

Prof. B.H. Suresh Block - 2(Units 5 to 8) Dr. Rajeshwari H.,


Department of Commerce Assistant Professor
University of Mysore DOS & R Management,
Mysore. KSOU, Mysore.

Dr. S.J. Manjunath Block - 3(Units 9 to 12) Dr. Rajeshwari H.,


Associate Professor, Assistant Professor
BIMS, Mysore DOS & R Management,
Mysore KSOU, Mysore.

Dr. M.S. Yathisha Chandra Block - 4(Units 13 to 16) Dr. Rajeshwari H.,
Associate Professor, Assistant Professor
Department of Management DOS & R Management,
UBDTCE-VTU, KSOU, Mysore.
Davanagere.

7
Copy Right
Registrar
Karnataka State Open University
Mukthagangothri, Mysuru - 570006
Developed by the Department of Studies and Research in Management, KSOU, under
the guidance of Dean (Academic), KSOU, Mysuru
Karnataka State Open University, January-2021
All rights reserved. No part of this work may be reproduced in any form, or any other means,
without permission in writing from the Karnataka State Open University.
Further information on the Karnataka State Open University Programmes may obtained from
the University’s office at Mukthagangothri, Mysuru-570006
Printed and Published on behalf of Karnataka State Open University. Mysuru-570006 by
Registrar (Administration)-2021

8
BLOCK - 1

INTRODUCTION TO STATISTICS
UNIT -1: INTRODUCTION TO BUSINESS STATISTICS

STRUCTURE

1.0 Objectives
1.1 Introduction
1.2 Definitions and Meaning of Statistics
1.3 Scope of Statistics
1.4 Importance of Statistics
1.5 Limitations of Statistics
1.6 Statistics in Business and Management
1.7 Descriptive and Inferential Statistics in Business Decisions
1.8 Strengths and Weaknesses of Statistics
1.9 Language of Statistics
1.10 Summary
1.11 Key Words
1.12 Self Assessment Questions
1.13 Check Your Progress
1.14 References

1
1.0 OBJECTIVES
After studying this unit you should be able to :
 Define statistics and explain scope, limitations and importance;
 Appreciate use of statistics in Business decisions and
 Analyze the strengths and weaknesses of statistics and the various terminologies of
statistics.

1.1 INTRODUCTION
Statistics is not a new discipline. It is as old as this human society itself. The word
statistics is derived from Latin word ‘Status’ or the Italian word “Statista’ or the German
word ‘statistik’, each of which means a political state. In the ancient times, the scope of
statistics was limited to collection of the data related to age, gender wise population, property
and wealth of the country for framing military and fiscal policies.
The development of statistics has been noticed from the time of Pharaohs of Egypt
before 2000 years ago and the traces of application of it was seen in Kautilya’s Arthashastra
and during the time of Chandragupta Maurya. During 16th, 17th, and 18th century systematic
development took place in the field of statistics. During this period theory of Probability,
theory of Games and chance, Regression and Correlation Analysis, Goodness of Fit tests
like Chi Square, t – Test etc were developed. R A Fisher who is called as the Father of
Statistics is the pioneer to apply statistics to genetics biometry, psychology, education,
agriculture etc which made a remarkable step to introduce statistics to other fields. Thus
statistics became a full fledged science. Today statistics has given solutions to various
complicated fields like Economics, Business, Management, Accountancy, Social science,
Industry, Biology and Medical Sciences.
1.2 DEFINITION AND MEANING OF STATISTICS
It has been defined differently by different writers. Statistics has been expressed as
Numerical Data and Statistical methods.
Statistics as Numerical data:
“Statistics are numerical statements of facts in any department of enquiry placed
in relation to each other.”– Bowley
“By statistics we mean quantitative data affected to a marked extent by multiplicity
of causes.” – Yule and Kendall
“Statistics are the classified facts representing the conditions of the people in a
state, such as those facts which can be represented in tables as numbers in any classified

2
arrangement” - Webster
Statistics as Statistical Methods:
“Statistics is the science of estimates and probabilities” – Boddington
“Statistics may be called as the science of counting or averages” – Bowley
“Statistics is a method of decision making in the face of uncertainty on the basis of
numerical data and calculated risks” – Prof. Ya-Lun-Chou

Thus we can summarize that statistics is :


 Aggregate of Facts
 Affected by multiplicity of causes
 Numerically expressed
 Enumerated as estimated as a reasonable Standard of accuracy
 Collected in a systematic Manner
 Collected for a predetermined purpose and comparable
 It is both science and art
 Helps to arrive to a valid decision
 Evaluates the various alternatives
So Statistics is a subject which consists of the processes, techniques, and methods
of collecting , organizing, presenting analyzing and interpreting data for decision making in
uncertainty.

1.3 SCOPE OF STATISTICS


In olden days it was regarded as the science of statecraft. Today the scope has widened
to various phenomenon from social aspects to economics. Today it is not only used to collect
numerical data but also for their handling, analysis and drawing inferences from them. It
embraces all fields of sciences and finding numerous solutions to various disciplines such
as industry, business, biometry, economics planning, sociology, insurance etc. Today it has
become indispensible for the life of a citizen.

1.4 IMPORTANCE OF STATISTICS


The importance of statistics can be understood by its wide applicability in the fields of
 Planning of Government level like five year plans, budgets etc.
 Statistics in state – craft like collecting e data relating to manpower , Military, crime,
income and wealth etc to formulate suitable policies.
 Statistics in Economics – it gives solutions to various problems of pricing, production,

3
supply, consumption, distribution wealth and income, savings, profits, investments,
expenditure etc. various laws of economy are developed through statistics. Powerful
tools ( trend Analysis, Time Series, Forecasting techniques) are used in analysis of
economic data.
 Business and Management- Business deals with uncertainty and frequently it lands in
dilemmas. Statistics is the scientific ways to come out of the ambiguities of business
and helps widely to take decisions.
 Accountancy and Auditing – Today Chartered Accountants and ICWAs have statistics
as a vital subject in their syllabus since its usages in accounting has become inevitable.
It is widely used in profit analysis, dividend decisions, assets and liability, analysis, etc.
in auditing sampling techniques are used for test checking of voluminous data related
to the business transactions.
 Statistics in Industry – it is used intensively in Quality Control in production process
 Statistics in Physical sciences – physical sciences like astronomy, geology, engineering,
and meteorology expects accurate results in their applications. Statistics techniques
like least square methods give solutions to this.
 Statistics in Social Science – to study the demography features, mortality, fertility,
population growth, poverty etc requires statics to analyze data
 Statistics is also used in Biology, Medical sciences to study the cause and effects, and
also it is used in Psychology and education like scaling of mental health, determining
I Q levels etc.

1.5 LIMITATIONS OF STATISTICS


Statistics is indispensible for almost all the activities of human activities. But it has few
limitations such as
1. It does not study Qualitative Phenomenon
2. It does not study individuals
3. Statistical laws are not exact
4. Statistics is liable to be misused
5. Expertise knowledge is required to analyze the data and interpret it.

1.6 STATISTICS IN BUSINESS AND MANAGEMENT


Today the development in business activities has variety of dimensions in size and
competitions in the market. Every step in the business or management has become

4
complicated. Personal evaluation based on observations is not enough. A team of specialized
managerial executives are inevitable for running up of today’s business activities such as
sales, purchases, production, control, finance, marketing etc. In this the statistical tools and
theories such as forecasting techniques, estimation theory, sampling, probability, least square,
game theory etc. play an indispensible role. “Statistics is a method of decision making in
the face of uncertainty on the basis of numerical data and calculated risks” – Prof.
Ya – Lun - Chou and According to Wallis and Roberts “ Statistics may be regarded as a
body of methods for making wise decisions in the face of uncertainty.” These definitions
reflect the application of statistics in today’s modern business which has its roots in accuracy
in estimation and forecasting regarding future demand for the product, market trends and so
on. The statistical information related to the business is also acts as a guide to future economic
events. The uses of statistics in some of the business decisions are:
 To estimate the probable trends in demand of the goods
 Ordering the right quantity of Materials
 Seasonal and cyclical movements of the business
 Relationship between supply and demand
 To know the purchasing power of money
 Statistical quality control of production to produce without waste
 To promote the new businesses
 To conduct the customer surveys and collect the demographic information to fix the
target segments.
 To understand the consumer expectations and level of product awareness
 To launch the new products through sample surveys
 Optimization of Profits and Investments and minimizing the expenses
 To forecast the future and balance the uncertainties through probability and estimation
theories
Thus uses of statistics have become indispensible in all the branches of business
activities.
1.7 DESCRIPTIVE AND INFERENTIAL STATISTICS IN BUSINESS
DECISIONS
There are two major divisions in statistics: - Descriptive Statistics and Inferential
Statistics.
Descriptive Statistics: Descriptive statistics deals with collecting, summarizing and

5
simplifying the data which are otherwise very voluminous. Through this, meaningful
conclusion can be drawn readily from the data. Thus this method facilitates an understanding
of the data and systematic reporting which makes the data useful for further discussions,
analysis and interpretations. A well thought out data classification facilitates easy descriptions
and a variety of summary measures. These include Measures of Central Tendency, Dispersion,
Skewness and Kurtosis which constitute the essential scope of descriptive statistics.
Inferential Statistics: this is also called as inductive statistics. It goes beyond describing a
given problem situation by means of collecting summarizing and presenting related data. It
consists of the methods that are used for drawing inferences, making broad generalizations
about total observations on the basis of a part from it. That is obtaining a particular value
from the sample information and using it for drawing an inference about the entire population
is inferential statistics.
In business, the decisions are to be taken in uncertainty and most of the time; total coverage
of the information through Census method is not possible. And it may not be always feasible
and practical for various reasons. In such situation it is the inferential statistics which is
used in taking business decisions.
Risk evaluation and Statistics:
Inferential statistics helps to evaluate the risk involved in getting inferences or generalizations
about an unknown population on the basis of sample information. There is always a risk of an
inference about a population being incorrect when based on the knowledge limited to a
sample. The rescue lies in evaluating the risk. The probability distributions help us for
drawing statistical inferences and estimating the degree of reliability of these inferences.
Check Your Progress
1. Inferentional statistics is also called as ............... statistics.
2. Statistics is aggregate of ...............
3. Statistics help to make ............... in the face of uncertainly

1.8 STRENGTHS AND WEAKNESSES OF STATISTICS


Strengths: It develops statistical mode of thinking. Collects and compiles massive set of
data which are systematically and carefully analyzed to seek useful insights and reach valid
conclusions for sound decision making. Statistics creates a flexible mind with a judicious
sense that helps to understand the dangers.
Weakness : The general feeling of distrust in it is the important weakness of statistics.
Element of accuracy and reliability of the data is often questioned. The whole process from

6
collection of data till interpreting the results is porous and allows number of errors in each
step. The data can be manipulated easily by the collector himself. These flaws in statistics
can be minimized but cannot be eliminated. Thus the inaccuracy, unreliability, manipulatin
of data creates distrust in statistics.

1.9 LANGUAGE OF STATISTICS


Statistics uses some common Concepts or Notions while analyzing and interpreting
the data. Notions are the shorthand expressions of concepts and statements which are also
called language of statistics. Various concepts used are Variables, Observed data or values of
the variables, samples, sample size, population, sample statistics etc.

1.10 SUMMARY
A layman knows that statistics is data. It is the numerical information expressed in
quantitative terms. It can also be called as science of data. This unit has thrown light on the
basics of statistics. Today statistics has grown into a separate subject since its importance in
each and every aspect of human life and its relevancy in almost all disciplines. Today Business
decisions are based on statistical inferences as the future is very uncertain. Though there are
many advantages of using statistical tools in decision making, the distrust still prevails
because of the inadequacy, inaccuracy, manipulation of data, lack of expertise skill and
knowledge to interpret the results.

1.11 KEY WORDS


Statistics: The aggregate of facts which are numerically expressed collected and classified
Uncertainty: Insecurity about the future.

1.12 SELF ASSESSMENT QUESTIONS


a. Explain the utility of statistics in Business and Management in the current scenario.
b. Explain the importance of statistics in detail.
c. “Statistics is a method of decision making in the face of uncertainty on the basis of the
numerical data and calculated risks.” Explain with suitable illustrations.
d. Define statistics as a discipline. Also bring out its scope.
e. What are the causes of distrust in statistics?
f. Differentiate between descriptive and inferential statistics. How inferential statistics
is useful in business decisions?

7
1.13 ANSWERS TO CHECK YOUR PROGRESS
1. Inductive
2. Facts
3. Wise decisions

1.14 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House, 2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

8
UNIT 2 : ANALYSIS OF DATA

STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Types of Data
2.3 Data Collection and Sources of Primary and Secondary data
2.4 Classification and Tabulation of data
2.5 Summarizing the data – Frequency Distribution
2.6 Diagrammatic and Graphic Representation of Data
2.7 Summary
2.8 Key Words
2.9 Self Assessment Questions
2.10 Answer to Check Your Progress
2.11 Suggested references

9
2.0 OBJECTIVES
After studying this unit you should be able to :
 Define statistical data;
 Explain types of data and the sources of collecting the data;
 Demenstrate ability to present data in tabulation and
 Describe Graphic presentation of data.

2.1 INTRODUCTION
Data is the information that is collected. Statistical data are the basic raw material of
statistics. Data may relate to an activity of our interest, a problem, or a phenomenon or a
situation under study. The data is the result of the process of measuring, counting or observing.
Therefore Statistical data refer to those aspects of a problem situation that can be measured,
quantified, counted or classified. In any statistical investigation, the collection of the
numerical data is the first and the most important matter to be attended. Often a person
investigating, will have to collect the data from the actual field of inquiry. For this he may
issue suitable questionnaires to get necessary information or he may take actual interviews;
personal interviews are more effective than questionnaires, which may not evoke an adequate
response. Another method of collecting data may be available in publications of Government
bodies or other public or private organizations. Sometimes the data may be available in
publications of Government bodies or other public or private organizations. Such data,
however, is often so numerous that one’s mind can hardly comprehend its significance in the
form that it is shown. Therefore it becomes, very necessary to tabulate and summarize the
data to an easily manageable form. In doing so we may overlook its details. But this is not a
serious loss because Statistics is not interested in an individual but in the properties of
aggregates. For a layman, presentation of the raw data in the form of tables or diagrams is
always more effective.
Prerequisites of statistical data:
a. It should be unambiguous
b. It should be specific and as per the objectives and scope of the study
c. It should be stable
d. It should be appropriate to the enquiry
e. It should be uniform
f. Degree of accuracy should be aimed.
g. It should be apt and reliable.

10
2.2 TYPES OF DATA
A. Based on the characteristics, measured data can be classified into two broad
categories.
a. Quantitative data
b. Qualitative data
Quantitative data: The data that can be quantified in definite units of measurement are
called as quantitative data. That is the successive measurements yield
quantifiable observations. Depending on the nature of the variable
observed or measurement this can be further categorized as continuous
data and discrete data. Continuous data represent the numerical values
of a continuous variable. A continuous variable is the one that can assume
any value between any two points on a line segment and thus it represents
an interval of values. For example: temperature, thickness, velocity,
height, weights etc. Discrete data are the values assumed by discrete
variables. That one whose outcomes are measured in fixed numbers. Ex:
Number of customers visiting a store every day, the number of trains
arriving at the station, number of defects in one consignment etc.
Qualitative data: The data related to the qualitative characteristics of the subject or object
is qualitative data. This data is the data gathered through the attributes.
These data are classified as Nominal Data and Rank Data. The count
data obtained from classification is called as Nominal data. For example:
classification of students according to gender, division of workers as
per skills, education etc. Rank data is the result of assigning the ranks.
For example: ranking as per the performance in the interview, exams,
performance as per quality etc.
B. Based on the sources of data the data can be categorized as :
a. Primary data
b. Secondary data
Primary data: these are the data that do not exist in any form, and thus have to be
collected for the first time from the primary sources. Since they are
collected for the first time, they are the fresh data. And they are the data
collected from the sample drawn from the whole population.
Secondary data: these data already exist in some form published or unpublished in an

11
identifiable secondary source.
2.3 DATA COLLECTION AND SOURCES OF PRIMARY AND
SECONDARY DATA
Data collection is the act of assembling and gathering the needed numerical information
for a research. The process of counting or measuring together with the systematic recording
of results is called collection of statistical data. Collection can be from primary or secondary
source.
Preliminaries of Data Collection:
 Objectives and scope of enquiry
 Statistical units to be used
 Sources of information
 Method of data collection
 Degree of accuracy aimed at in the final results
 Type of enquiry
Primary data do not exist in any form the only source from where they can be collected is
of field surveys from the population or the sample from the population. Primary data sources
can be of internal and external.
Methods of collecting primary data are:
1. Personal Interviews
2. Direct personal Investigation
3. Indirect oral interviews
4. Information received through local agencies
5. Mailed questionnaire method
6. Schedules through enumerators
Secondary Data sources may be broadly classified as two groups:
(i) Published sources : Official Publications of Central Government, Publications of semi
government Statistical Organizations, publications of research institutions, Commercial
and Financial Institutions, Reports of Various Committees and Commissions appointed
by the Government, News papers and Periodicals, International Publications are the
sources of secondary data in published form
(ii) Unpublished sources : Records maintained by the private firms, the research carried

12
out by individuals, records of business concerns etc. The other forms of sources are
Internal Sources and External sources. This depends on the type of the user of the data
whether he is an insider or the outsider. As caution the secondary data should be carefully
examined before use to see that they suit the objectives of the research or the study. But
these data are more convenient to use especially when used as supportive evidences.
The other advantage of secondary data is its easy availability, convenient to reach and
access and not much effort is needed to classify and tabulate the data. The precautions
to be taken are to check whether the data is reliable, suitable and adequate.
2.4 CLASSIFICATION AND TABULATION OF DATA
Classification “Classified and arranged facts speak of themselves, and narrated they
are as dead as mutton” This quote is given by J.R. Hicks.The process of dividing the data into
different groups ( viz. classes) which are homogeneous within but heterogeneous between
themselves, is called a classification.It helps in understanding the salient features of the data
and also the comparison with similar data. For a final analysis it is the best friend of a
statistician.
Methods Of Classification
The data is classified in the following ways:
1. According to attributes or qualities this is divided into two parts :
(A) Simple classification
(B) Multiple classifications.
2. According to variable or quantity or classification according to class intervals. -
Qualitative Classification : When facts are grouped according to the qualities (attributes)
like religion, literacy, business etc., the classification is called as qualitative classification.
(A) Simple Classification : It is also known as classification according to Dichotomy.
When data (facts) are divided into groups according to their qualities, the classification is
called as ‘Simple Classification’. Qualities are denoted by capital letters (A, B, C, D ......)
while the absence of these qualities are denoted by lower case letters (a, b, c, d, .... etc.) For
example,

13
(B) Manifold or multiple classification : In this method data is classified using one or
more qualities. First, the data is divided into two groups (classes) using one of the qualities.
Then using the remaining qualities, the data is divided into different subgroups. For example,
the population of a country is classified using three attributes: sex, literacy and business as,

Classification according to class intervals or variables: The data which is expressed in


numbers (quantitative data), is classified according to class-intervals. While forming class-
intervals one should bear in mind that each and every item must be covered. After finding the
least value of an item and the highest value of an item, classify these items into different
class-intervals. For example if in any data the age of 100 persons ranging from 2 years to 47
years, is given, then the classification of this data can be done in this way:Table - 1

14
According to the class-intervals in classification the following terms are used :
Class-limits: A class is formed within the two values. These values are known as the class-
limits of that class. Magnitude of the class-intervals : The difference between the upper
and lower limits of a class is called the magnitude or length or width of a class and is denoted
by ‘ i ‘ or ‘ c Mid-value or class-mark : The arithmetical average of the two class limits
(i.e. the lower limit and the upper limit ) is called the mid-value or the class mark of that
class-interval. Class frequency : The units of the data belong to any one of the groups or
classes. The total number of these units is known as the frequency of that class and is denoted
by fi or simply f
Classification is of two types according to the class-intervals - (i) Exclusive Method (ii)
Inclusive Method.
Exclusive Method : In this method the upper limit of a class becomes the lower limit of
the next class. It is called ‘ Exclusive ‘ as we do not put any item that is equal to the upper
limit of a class in the same class; we put it in the next class, i.e. the upper limits of classes
are excluded from them. See table 1. For example, a person of age 20 years will not be
included in the class-interval ( 10 - 20 ) but taken in the next class ( 20 - 30 ), since in the
class interval ( 10 - 20 ) only units ranging from 10 - 19 are included.
Inclusive Method : In this method the upper limit of any class interval is kept in the same
class-interval. In this method the upper limit of a previous class is less by 1 from the lower
limit of the next class interval. In short this method allows a class-interval to include both its
lower and upper limits within it. For example :
Table - 2
Ex

15
Open-end Class Intervals : In any question when the lower limit of the first class-interval
or the upper limit of the last class-interval, are not given then subtract the class length of the
next immediate class-interval from the upper limit. This will give us the lower limit of the
first class-interval. Similarly add the same class length to the lower limit of the last class-
interval. But always notice that the lower limit of the first class ( i.e. the lowest class) must
not be negative or less than 0. For example :
Table - 3

Tabulation
It is the process of condensation of the data for convenience, in statistical processing,
presentation and interpretation of the information. A good table is one which has the following
requirements:
1. It should present the data clearly, highlighting important details.
2. It should save space but attractively designed.
3. The table number and title of the table should be given.+
4. Row and column headings must explain the figures therein.
5. Averages or percentages should be close to the data.
6. Units of the measurement should be clearly stated along the titles or headings.
7. Abbreviations and symbols should be avoided as far as possible.
8. Sources of the data should be given at the bottom of the data.
9. In case irregularities creep in table or any feature is not sufficiently explained,
references and foot notes must be given.
10.The rounding of figures should be unbiased.

16
Types of Tables: The important types of statistical table are as follows:
1. Single column or Single Row Tables
2. Multiple column or multiple row tables
3. Reference and Summary tables
Components of a Table: the structure or the components of the table should have:
1. Table Number
2. Title of the table
3. Head Notes
4. Stub and Stub Heads
5. Box Head and Sub Heads
6. Body of the table
7. Footnote
8. Source

2.5 SUMMARIZING THE DATA – FREQUENCY DISTRIBUTION


The frequency distribution is the outcome of a process of classification of individual
observations of a set of data into an appropriate number of classes. It is also called as grouped
data. The frequency distribution can be constructed through
a. Tally Method and b. Entry form Method

Relative Frequency: The relative frequency of a class is the frequency of the class divided
by the total number of frequencies of the class and is generally expresses as a percentage.
Cumulative Frequency: Many a times the frequencies of different classes are not given.
Only their cumulative frequencies are given. The total frequency of all values less than or
equal to the upper class boundary of a given class-interval is called the cumulative frequency
up to and including that class interval. These cumulative frequencies are called less than or
more than cumulative frequencies. For example,
Class – interval 0-10 10-20 20-30 30-40 40-50

Frequency 4 9 5 12 15

17
Table -5

Preparation of Frequency Distribution


Consider the data collected by one of the surveyors, interviewing about 50 people. This is as
follows :
Size of the shoes : 2, 5, 6, 8, 2, 5, 6, 7, 6, 8, 7, 4, 3, .. This is called the raw data. Here some
values repeat themselves. For instance the size 5 is repeated 10 times in 50 people. We say
that the value of 5 of the variate has the frequency of 10. Frequency means the number of
times a value of the variate or an attribute, as the case may be, is repeated in the data. A
table which shows each value of the characteristic with its corresponding frequency, is known
as a Frequency Distribution. The procedure of preparing such a table is explained as below:
Discrete variate : Consider the raw data which gives the size of shoes of 30 persons
2, 5, 6, 4, 5, 7, 4, 4, 6, 2
3, 5, 5, 4, 5, 6, 5, 4, 3, 2
4, 4, 5, 4, 5, 5, 3, 2, 4, 4
The least value is 2 and the highest is 7. All sizes are integers between 2 and 7 (both inclusive).
We can prepare a frequency distribution table as follows :
Table - 6

In this example the size difference from 2 to 7 is very small. If the range of a variate

18
is very large, it is inconvenient to prepare a frequency distribution for each value of the
variate. In such a case we divide the variate into convenient groups and prepare a table showing
the groups and their corresponding frequencies. Such a table is called a grouped frequency
distribution.
Consider the marks (out of 100 ) of 50 students as below :
40, 39, 43, 62, 30, 47, 33, 31, 17, 28
36, 29, 40, 32, 39, 24, 57, 42, 15, 30
50, 52, 47, 65, 31, 07, 37, 47, 17, 20
25, 53, 65, 85, 89, 56, 55, 41, 43, 10
44, 40, 69, 22, 40, 65, 39, 36, 71, 12
The range of the variate (marks) is very large. Also we are eager to know the
performance of the students. The passing limit is 35 and above. Marks between 35 and 44
form the third class ( or grade). Marks ranging between 45 - 59 are considered as second
class and 60 - 100 form the first class. Thus we have a grouped frequency distribution as:
Table

Concerns in constructing frequency distribution:


The factors or issues that have to be kept in mind before constructing a frequency
distribution are:
1. Number of Classes
2. Width of the class Intervals
3. Establishing the initial class
4. Stated and real Class Limits
Check Your Progress
1. The method in which upper limit of lower class becomes lower limit of next class is
called ....................
2. Primary data sources can be .................... & ....................
3. The data related to characteristics of a subject is called ....................

19
2.6 DIAGRAMMATIC AND GRAPHIC REPRESENTATION OF DATA
It is not always easy for a layman to understand figures, nor is it is interesting for
him. Apart from that too many figures are often confusing. One of the most convincing and
appealing ways in which statistical results may be represented is through graphs and diagrams.
It is for this reason that diagrams are often used by businessmen, newspapers, magazines,
journals, government agencies and also for advertising and educating people.The various
graphic presentation of data can be done through:
1. Bar Diagrams
1) Simple ‘Bar diagram’:- It represents only one variable. For example sales, production,
population figures etc. for various years may be shown by simple bar charts. Since these are
of the same width and vary only in heights ( or lengths ), it becomes very easy for readers to
study the relationship. Simple bar diagrams are very popular in practice. A bar chart can be
either vertical or horizontal; vertical bars are more popular.
2) Sub - divided Bar Diagram:- While constructing such a diagram, the various components
in each bar should be kept in the same order. A common and helpful arrangement is that of
presenting each bar in the order of magnitude with the largest component at the bottom and
the smallest at the top. The components are shown with different shades or colors with a
proper index.
Illustration:- During 1968 - 71, the number of students in University ‘ X ‘ are as follows.
Represent the data by a similar diagram.

Year Arts Science Law Total

1968 - 69 20,000 10,000 5,000 35,000


1969 - 70 26,000 9,000 7,000 42,000
1970 - 71 31,000 9,500 7,500 48,000

20
3) Multiple Bar Diagram:- This method can be used for data which is made up of two or
more components. In this method the components are shown as separate adjoining bars. The
height of each bar represents the actual value of the component. The components are shown
by different shades or colors. Where changes in actual values of component figures only are
required, multiple bar charts are used.
Illustration:- The table below gives data relating to the exports and imports of a certain
country X ( in thousands of dollars ) during the four years ending in 1930 - 31.
Year Export Import
1927 - 28 319 250
1928 - 29 339 263
1929 - 30 345 258
1930 - 31 308 206

21
Represent the data by a suitable diagram

2. Pie Chart
Geometrically it can be seen that the area of a sector of a circle taken radically, is
proportional to the angle at its center. It is therefore sufficient to draw angles at the center,
proportional to the original figures. This will make the areas of the sector proportional to
the basic figures.
For example, let the total be 1000 and one of the component be 200, then the angle will be

22
iii) As an example consider the yearly expenditure of a Mr. Ted, a college undergraduate.

T u itio n f e e s $ 6000
B o o k s a n d la b . $ 2000
C lo th e s / c le a n i n g $ 2000
R o o m a n d b o a r d in g $ 12000
T ra n s p o rta tio n $ 3000
In s u ra n c e $ 1000
S u n d ry e x p e n s e s $ 4000

T o ta l e x p e n d itu r e = $ 30000

Now as explained above, we calculate the angles corresponding to various items


(components).

T uition fees =

Book and lab =

Clothes / cleaning =

Room and boarding =

Transportation =

Insurance =

Sundry expenses =

23
Uses:- A pie diagram is useful when we want to show relative positions ( proportions ) of the
figures which make the total. It is also useful when the components are many in number.
3. Graphs
A graph is a visual representation of data by a continuous curve on a squared ( graph )
paper. Like diagrams, graphs are also attractive, and eye-catching, giving a bird’s eye-view of
data and revealing their inner pattern.
Graphs of Frequency Distributions:-
The methods used to represent a grouped data are :-
1. Histogram
2. Frequency Polygon
3. Frequency Curve
4. Ogive or Cumulative Frequency Curve
1. Histogram :- It is defined as a pictorial representation of a grouped frequency
distribution by means of adjacent rectangles, whose areas are proportional to the
frequencies.

24
For example, in a book sale, you want to determine which books were most popular,
the high priced books, the low priced books, books most neglected etc. Let us say you sold
a total 31 books at this book-fair at the following prices.
Rs....2, Rs 1, Rs 2, Rs 2, Rs 3, Rs. 5, Rs. 6, Rs. 17, Rs.17, Rs.7, Rs.15, Rs.7, Rs. 7, Rs.18,
Rs. 8, Rs.10, Rs. 10, Rs. 9, Rs. 13, Rs.11, Rs 12, Rs. 12, Rs. 12, Rs. 14, Rs.16, Rs. 18,
Rs. 20, Rs. 24, Rs.21, Rs. 22, Rs. 25.
The books are ranging from $1 to $25. Divide this range into number of groups, class
intervals. Typically, there should not be fewer than 5 and more than 20 class-intervals are
best for a frequency Histogram.Therefore now we have distribution of books at a book-fair

Class-interval Frequency

$ 1- $ 5 6

$6 - $10 8

$11 - $15 10

$16 - $20 3

$21 - $25 4

Total n =  fi = 31
Note that each class-interval is of equal width i.e. $5 inclusive. Now we draw the
frequency Histogram as under.

25
Relative Frequency Histogram:- It uses the same data. The only difference is that it
compares each class-interval with the total number of items i.e. instead of the frequency of
each class-interval, their relative frequencies are used. Naturally the vertical axis
(i.e. y-axis) uses the relative frequencies in places of frequencies.
2 Frequency Polygon:- Here the frequencies are plotted against the mid-points of the
class-intervals and the points thus obtained are joined by line segments.
Example : -
Height in cm. 150 - 154 154 - 158 158 - 162 162 - 166 166 - 170
No. of children 10 15 20 12 8
The polygon is closed at the base by extending it on both its sides ( ends ) to the
midpoints of two hypothetical classes, at the extremes of the distribution, with zero
frequencies.

On comparing the Histogram and a frequency polygon, you will notice that, in frequency
polygons the points replace the bars ( rectangles ). Also, when several distributions are to be
compared on the same graph paper, frequency polygons are better than Histograms.
3) Frequency Distribution (Curve):- Frequency distribution curves are like frequency
polygons. In frequency distribution, instead of using straight line segments, a smooth curve
is used to connect the points. The frequency curve for the above data is shown as:

26
4. gives or Cumulative Frequency Curves:- When frequencies are added, they are called
cumulative frequencies. The curve obtained by plotting cumulating frequencies is called a
cumulative frequency curve or an ogive (pronounced ojive ).
To construct an Ogive:-
1) Add up the progressive totals of frequencies, class by class, to get the cumulative
frequencies.
2) Plot classes on the horizontal ( x-axis ) and cumulative frequencies on the vertical
( y-axis).
3) Join the points by a smooth curve. Note that Ogives start at (i) zero on the vertical axis,
and (ii) outside class limit of the last class. In most of the cases it looks like ‘S’.Note
that cumulative frequencies are plotted against the ’limits’ of the classes to which
they refer.
(A) Less than Ogive:- To plot a less than ogive, the data is arranged in ascending order of
magnitude and the frequencies are cumulated starting from the top. It starts from zero
on the y-axis and the lower limit of the lowest class interval on the x-axis.
(B) Greater than Ogive:- To plot this ogive, the data are arranged in the ascending order
of magnitude and frequencies are cumulated from the bottom. This curve ends at zero
on the the y-axis and the upper limit of the highest class interval on the x-axis.
Illustrations:- On a graph paper, draw the two ogives for the data given below of the I.Q. of
160 students.

27
Class -intervals :60 - 70 70 - 80 80 - 90 90 - 100 100 - 110
No. of students : 2 7 12 28 42

110 - 120 120 - 130 130 - 140 140 - 150 150 - 160
36 18 10 4 1

28
Uses : - Certain values like median, quartiles, deciles, quartile deviation, coefficient of
skewness etc. can be located using Ogives. it can be used to find the percentage of items
having values less than or greater than certain value. Ogives are helpful in the comparison of
the two distributions.
2.7 SUMMARY
Data is the information that is collected and it is the raw material for statistics. This
data has to be collected in asystematic manner from the right source like Primary or secondary.
Thus collected data should be classified and tabulated for further process. This statistical
data can be presented through frequency distributions or through Graphs to understand them
and process them easily.
Answer to Check Your Progress
1. Exclusive
2. External & Internal
3. Qualitative data
2.8 ANSWER TO CHECK YOUR PROGRESS
1. Exclusive
2. External & Internal
3. Qualitative data

2.9 KEY WORDS


Data – the information collected and compiled
Population - the totalityofobservations made
Sample – the part of totality actually observed to collect the data and
analysis and togeneralise about the population is a sample.
Reference Tables - the ones which preasent extensive information on any subject
Ordered Array - a convinent order to arrasnge the data. It can bei ascending order
or in decending order.
class interval - the width of the class

2.10 SELF ASSESSMENT QUESTIONS


1. Why should a given set of data be presented in an organised form? Explain.
2. List the advantages of converting the data to a frequency distribution.
3. What are the sources of data ? Explain in detail
4. What are merits and demerits of using secondary data?

29
5. What problems do unequal class intervals create ? Explain
6. what do you understand by classification and Tabulation of data ? Discuss the modes of
classification.
7. Prepare a frequency distribution from the following figures relating to bonus paid to
workers
BONUS IN (Rs.)
86 62 58 73 101 90 84 90 76 61 84 63 56 88
72 92 60 83 102 76 99 54 64 87 103 61 88 55

Take a class interval of 5


8. What are the merits and demerits of diagrammic representation of the data?
9. Represent the following in a Pie chart
Tea – 3260 tons, Cocoa – 1850 tons, Coffee – 900 tons
Total - 6010tons
10. For the frequency distribution give below obtain Less than and More than Cumulative
frequencies. Also draw Ogives for each.

L1 -L2 03-05 06-08 09-11 12-14 15-17 18-20


Frequency 5 8 11 15 7 4

2.11 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

30
UNIT -3 : MEASURES OF CENTRAL TENDENCY

STRUCTURE
3.0 Objectives
3.1 Introduction
3.2 Arithmetic mean and its computation
3.3 Weighted Arithmetic mean and its computation
3.4 Geometric mean and its computation
3.5 Harmonic mean and its computation
3.6 Median
3.7 Mode
3.8 Relationship between Mean , Median and Mode
3.9 Answer to Check Your Progress
3.10 Summary
3.11 Key Words
3.12 Self-Assessment Questions
3.13 References

31
3.0 OBJECTIVES
After studying this unit you should be able to :
 Explain the measures of central tendency;
 Compute Mean , Median and Mode;
 Identify the merits and demerits of computing the Mean, Median and Mode and
 Analyze the relationship between the three measures

3.1 INTRODUCTION
In the previous unit, we have studied how to collect raw data, its classification and
tabulation in a useful form, which contributes in solving many problems of statistical concern.
Yet, this is not sufficient, for in practical purposes, there is need for further condensation,
particularly when we want to compare two or more different distributions. We may reduce
the entire distribution to one number which represents the distribution.
A single value which can be considered as typical or representative of a set of observations
and around which the observations can be considered as Centered is called an ’Average’ (or
average value) or a Center of location. Since such typical values tends to lie centrally within
a set of observations when arranged according to magnitudes, averages are called measures
of central tendency.In fact the distribution have a typical value (average) about which, the
observations are more or less symmetrically distributed. This is of great importance, both
theoretically and practically. Dr. A.L. Bowley correctly stated, “Statistics may rightly be
called the science of averages.”The word average is commonly used in day-to-day
conversations. For example, we may say that Abert is an average boy of my class; we may
talk of an average American, average income, etc. When it is said, “Abert is an average student,”
it means is that he is neither very good nor very bad, but a mediocre student. However, in
statistics the term average has a different meaning.
There is a peculiar tendency of the data to cluster or centre around a specific value. On
the whole they tend to be closer to one particular value than others. This peculiar tendency
of the data is called as central tendency. Thus a measure of central tendency of a set of data
lies in obtaining this central value.
The fundamental measures of tendencies are:
(1) Arithmetic mean
(2) Median

32
(3) Mode
(4) Geometric mean
(5) Harmonic mean
(6) Weighted averages
However the most common measures of central tendencies or Locations are
Arithmetic mean, median and mode.

3.2 ARITHMETIC MEAN AND ITS COMPUTATION


This is the most commonly used measure of central tendency popularly called as
Average or mean.
Horace Sacrist : “Arithmetic mean is the amount secured by dividing the sum of values of
the items in a series by their number”.
W.I. King : “The arithmetic average may be defined as the sum of aggregate of a series of
items divided by their number”.
Thus, the students should add all observations (values of all items) together and divide this
sum by the number of observations (or items).
Ungrouped Data
Suppose, we have ‘n’ observations (or measures) x1 , x2 , x3, ......., xn then the Arithmetic
mean is

obviously

This method is known as the ‘’Direct Method”.

Example :A variable takes the values as given below. Calculate the arithmetic mean of 110,
117, 129, 195, 95, 100, 100, 175, 250 and 750.

33
Solution: Arithmetic mean =

= 110 + 117 + 129 +195 + 95 +100 +100 +175 +250 + 750 = 2021

and n = 10, therefore A M = 2021/10 = 202.1

Indirect Method (Assumed Mean Method)

A = Assumed Mean Let A = 175 then

ui = -65, -58, -46, +20, -80, -75,-75, +0, + 75, +575= 670 - 399

= 271/10 = 27.1

= 175 + 27.1= 202.1

Discrete Series :

Arithmetic mean

The formulae for Arithmetic mean by direct method and by the short-cut methods are
as follows:
Direct method Short-cut method

and u = xi - A

Therefore,

34
Example Find the mean of the following 50 observations.
19, 19, 20, 20, 20, 19, 20, 18, 21, 19,
20, 20, 19, 19, 20, 19, 21, 19, 19, 21,
18, 20, 18, 18, 17, 20, 20, 22, 20, 20,
20, 20, 20, 21, 20, 17, 23, 18, 17, 21,
20, 21, 20, 20, 20, 18, 21, 19, 21, 19
Solution: We may tabulate the given observations as follows.

The arithmetic mean is

Mean for Grouped data


Continuous series: The procedure of finding the arithmetic mean in this series, is the same
as we have used in the discrete series. The only difference is that in this series, we are given
class-intervals, whose mid-values (class-marks) are to be calculated first.

Formula, Arithmetic mean


where x = mid-value

35
Example The weights (in gms) of 30 articles are given below :
14, 16, 16, 14, 22, 13, 15, 24, 23, 14, 20, 17, 21, 18, 18, 19, 20, 17, 16, 15, 11, 22, 21, 20,
17, 18, 19, 12, 23,11.
Form a grouped frequency table, by dividing the variate range into intervals of equal width,
one class being 11-13 and then compute the arithmetic mean.
Solution:

-10

36
Example Find the arithmetic mean for the following :
Properties Of Arithmetic Mean
1. The sum of the deviations, of all the values of x, from their arithmetic mean, is zero.
2. The product of the arithmetic mean and the number of items gives the total of all
items.
3. If and are the arithmetic mean of two samples of sizes n1 and n2 respectively then,
the arithmetic mean of the distribution combining the two can be calculated as

Merits of Arithmetic Mean : -


1. It is rigidly defined. Its value is always definite.
2. It is easy to calculate and easy to understand. Hence it is very popular.
3. It is based on all the observations; so that it becomes a good representative.
4. It can be easily used for comparison.
5. It is capable of further algebraic treatment such as finding the sum of the values of
the observations, if the mean and the total number of the observations are given;
finding the combined arithmetic mean when different groups are given etc.
6. It is not affected much by sampling fluctuations.
Demerits of Arithmetic Mean : -
1. It is affected by outliers or extreme values. In such a case A. mean is not a good
representative of the given data.
2. It is a value which may not be present in the given data.
3. Many a times it gives absurd results like 4.4 children per family.
4. It is not possible to take out the averages of ratios and percentages.
5. We cannot calculate it when open-end class intervals are present in the data.

37
3.3 WEIGHTED ARITHMETIC MEAN AND ITS COMPUTATION
When individual observations vary in importance, they are assigned weights according
to the level of importance of each in the computation of their mean. The arithmetic mean of
asset of observations computed by taking into account of their corresponding weights is
known as weighted arithmetic mean or average.

Weighted A M = A +( wi xi/wi)

3.4 GEOMETRIC MEAN AND ITS COMPUTATION


The geometric mean of a set of ‘n’ sample observation is the nth root of their product.
That is
n
GM =

When n is more than 2 ,

GM = antilog (∑ log Xi / n )

Weighted Geometric Mean = GM = antilog (∑ wi log xi / ∑ wi )

GM is particularly used in averaging ratios and percentages and rates of change in one period
over the other.

3.5 HARMONIC MEAN AND ITS COMPUTATION


It is defined as the reciprocal of the arithmetic mean of the reciprocals of a given set
of observations then harmonic mean is denoted as

HM= Weighted HM =

Harmonic mean is particularly useful in averaging rates and ratios. It is the appropriate
average where the unit of observation such as per hour, per day etc. Remains the same and
the act being performed that is covering distance is constant.
𝑛
It is used to calculate overage of rates or rations Hm = 11 1
+ +
𝑎 𝑏 𝑐

38
Ravi drives car 20 km/h for first half of journey and 30 km/hour for the second half, what is average
speed

3.6 MEDIAN
It is the value of the size of the central item of the arranged data (data arranged in the
ascending or the descending order). Thus, it is the value of the middle item and divides the
series in to equal parts.In Connor’s words - “The median is that value of the variable which
divides the group into two equal parts, one part comprising all values greater and the other
all values lesser than the median.” For example, the daily wages of 7 workers are 5, 7, 9, 11,
12, 14 and 15 dollars. This series contains 7 terms. The fourth term i.e. $11 is the median.
Median In Individual Series (ungrouped Data)
1. Set the individual series either in the ascending (increasing) or in the descending
(decreasing) order, of the size of its items or observations.
2. If the total number of observations be ‘n’ then

A. If 'n ' is o d d , T h e m e d ia n = s iz e o f o b s e rv a tio n

B. If 'n ' is e v e n , th e m e d ia n

Example The following figures represent the number of books issued at the counter of a
Statistics library on 11 different days. 96, 180, 98, 75, 270, 80, 102, 100, 94, 75 and 200.
Calculate the median.

39
Solution:
Arrange the data in the ascending order as 75, 75, 80, 94, 96, 98, 100, 102,180, 200, 270.
Now the total number of items ‘n’= 11 (odd)

Therefore, the median = size of item

= size of item
= size of 6th item
= 98 books per day

Example The population (in thousands) of 36 metropolitan cities are as follows :


248, 591, 437, 20, 131, 143, 1490, 407, 384, 176, 263, 193, 181, 777, 387, 302, 213, 204, 153,
733, 391, 176 178, 142, 522, 360, 65, 260, 193, 92, 672, 258, 239, 160, 147, 151. Calculate the
median.
Solution:
Arranging the terms in the ascending order as :

Since total number of items n = 36 (Even).


the median

=
1
2 [
Size of 18th item + size of 19th item ]
= 1 ( 213 + 239)
2
= 1 ( 552)
2
= 276 thousands

40
Median In Discrete Series :Steps :
1. Arrange the cumulative frequencies.
2. Find the cumulative frequencies.
3. Apply the formula :

A. If 'n' = (odd) then,

Median = size of item

B. If 'n' = (even) then,

Median =
Example Locate the median in the following distribution.
Size : 8 10 12 14 16 18 20
Frequency : 7 7 12 28 10 9 6
Solution:

41
Therefore, the median =

= = size of 38th item

In the order of the cumulative frequency, the 38th term is present in the 50th cumulative
frequency, whose size is 14.
Therefore, the median = 14
Median In Continuous Series (grouped Data)
Steps :

1. Determine the particular class in which the value of the median lies. Use as the

rank of the median and not

2. After ascertaining the class in which median lies, the following formula is used for
determining the exact value of the median.

Median =

where, = lower limit of the median class, the class in which the middle item of the distribution
lies.

= upper limit of the median classc.f = cumulative frequency of the class preceding the median
classf = sample frequency of the median class

It should be noted that while interpolating the median value of frequency distribution it is
assumed that the variable is continuous and that there is an orderly and even distribution of
items within each class.
Example Calculate the median for the following and verify it graphically.
Age (years) : 20-25 25-30 30-35 35-40 40-45
No. of person : 70 80 180 150 20

42
Solution:

Median =

Here = 30, = 35, = 250, c.f. = 150 and f = 180

Therefore, Median

43
Note that, while calculating the median of a series, it must be put in the ‘exclusive class-
interval’ form. If the original series is in inclusive type, first convert it into the exclusive
type and then find its median.
Merits Of Median
1. It is rigidly defined.
2. And it is easy to calculate and understand.
3. It is not affected by extreme values like the arithmetic mean.
4. It can be found by mere inspection.
5. It is fully representative and can be computed easily.
6. It can be used for qualitative studies.
7. Even if the extreme values are unknown, median can be calculated if one knows the
number of items.
8. It can be obtained graphically.
Demerits of Median
1. It may not be representative if the distribution is irregular and abnormal.
2. It is not capable of further algebraic treatment.
3. It is not based on all observations.
4. It is affected by sample fluctuations.
5. The arrangement of the data in the order of magnitude is absolutely necessary.
Check Your Progress
1. Before determining median data has to arranges in ............... order.
2. The three different types of mean are ..............
3. represents ......................

44
3.7 MODE
It is the size of that item which possesses the maximum frequency. According to
Professor Kenney and Keeping, the value of the variable which occurs most frequently in a
distribution is called the mode. It is the most common value. It is the point of maximum
density.
Ungrouped Data
Individual series: The mode of this series can be obtained by mere inspection. The number
which occurs most often is the mode.
Example Locate mode in the data 7, 12, 8, 5, 9, 6, 10, 9, 4, 9, 9
Solution : On inspection, it is observed that the number 9 has maximum frequency. Therefore
9 is the mode.
Grouped Data: Steps :
1. Determine the modal class which as the maximum frequency.
2. By interpolation the value of the mode can be calculated as -

Mode = +

where

Example Calculate the modal wages.


Daily wages in $ : 20 -25 25-30 30-35 35-40 40-45 45-50
No. of workers : 1 3 8 12 7 5
Verify it graphically.

45
Solution:

Here the m axim um frequency is 12, corresponding to the class interval (35 - 40) which is the
m odal class.

Therefore

B y interpolation

+
M ode =

35 +

35 +

35 +

M odal wages is $37.22

46
Merits of mode
1. It is simple to calculate.
2. In individual or discrete distribution it can be located by mere inspection.
3. It is easy to understand. Everyone is used to the idea of average size of a garment, an
average American etc.
4. It is not isolated like the median as it is the most common item.
5. Like the Average mean, it is not a value which cannot be found in the series.
6. It is not necessary to know all the items. What we need the point of maximum density
frequency.
7. It is not affected by sampling fluctuations.
Demerits
1. It is ill defined.
2. It is not based on all observations.
3. It is not capable of further algebraic treatment.
4. It is not a good representative of the data.
5. Sometimes there are more than one values of mode.

3.8 RELATIONSHIP BETWEEN MEAN, MEDIAN AND MODE


In the symmetrical distribution Mean, Median and Mode have the same value. The relationship
is AM = GM = HM.
But usually the frequency distribution tends to deviate and get skewed. Then this will affect
the mean than median and mode and the following situation will arise.
Positively skewed : the distribution skewed to the right. As a result mean gains highest
value, followed by descending order of Median and Mode
Negatively Skewed: the distribution skewed to the left. As a result value of mean tends to
be lowest followed in ascending order by median and mode.
All the three are empirically related as:
Mean – Mode = 3(Mean – Median)
Mode = mean - 3(Mean – Median)
Mean - Median = 1/3(Mean – Mode)
Mode = 3 Median - 2 Mean

47
Comparison of the three measures:
1. Mean is the most familiar and widely used measure of central tendency as it takes into
account all observations in its computation. The presence of extreme values affect
Mean more than the Median and the Mode. Mean is used more in symmetric
distributions.
2. Median is easier to understand and compute when the data is relatively small. The
extreme values do not affect median more as such as mean and therefore it is frequently
used as a best measure of central tendency in asymmetric distributions.
3. Mode is the least used measure of central tendency. It is very easy to compute. It can
be used for both quantitative and qualitative data. A little care should be taken in
computing Mode because, every distribution may not have mode and there may be two
modes present in one distribution.
3.9 ANSWER TO CHECK YOUR PROGRESS
1. Ascending
2. Arthamatic, Geomatric & Harmonic
3. Sum of Frequencies

3.10 SUMMARY
Measures of central tendency are the basics of statistics. It is an attempt to find out
the central value of a given set of data. The idea behind determining such a typical value is to
use it as representative of the entire data. There are three measures of central tendency:
Mean, Median and Mode. The other measures are Geometric Mean and harmonic Mean.
Mean is also called as Average. Median is the location average of the middle value of an
ordered array of set of observations. Mode is also a location average and it is that value
which appears the maximum number of times.

3.11 KEY WORDS


Weighted Average - A mean or average value calculated to take into account the
importance of each value to the overall total
Bimodal Distribution - A distribution of 2 observations occurring more frequently
than the others in a set of values.
Mean – the Arithmetic Average of the given set of observations
Median Class – A frequency distribution class interval denoting the median
value of the observations

48
3.12 SELF ASSESSMENT QUESTIONS
1. What are the various Measures of Central Tendency? Explain each in detail.
2. Calculate the average value of age for a class of 10 students with their ages as under
11,12, 13, 13, 10, 13, 12, 11, 10, 12.
3. From the following calculate the average level of marks of the class.
Marks: 0 2 3 4 5 6 7 8 9
Number of students: 11 10 9 21 12 17 8 22 15
4. Given below is the distribution of marks obtained by 60 students in final exams.
Compute a. Mean, b. Median, c. Mode
Marks: 20 30 40 50 60 70
Number of students: 8 12 20 10 6 4
5. From the frequency distribution given below find Mean, Median and Mode
Class intervals : 50-52 53-55 56-58 59-61 62-64
Frequencies: 5 10 21 8 6
6. The average sales of a product for a particular week excluding Sunday wre 150units.
Sunday there was a rush of sales which inflated the Average sales for the entire week
to 210 units. Find the sales for Sunday.
7. Find the GM of 5 sample observations: 28, 45, 50, 65, and 90.
8. Obtain the HM of 5 samples: 4, 20, 12, 10 and 15
9. Calculate the Arithmetic Mean by Step Deviation Method for the following data:
Class – Intervals Mid Points (X) Frequency (f)
0-10 5 7
10-20 15 9
20-30 25 15
30-40 35 11
40-50 45 27
50-60 55 18
60-70 65 5

49
10. Given the following distribution calculate Mean Median and Mode and also show
their empirical relationship.
Pay Scale (Rs) No. of employees
Less than 2000 14
Less than 3000 19
Less than 4000 26
Less than 5000 35
5000 and above 42

3.13 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010

7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

50
UNIT 4 : MEASURES OF DISPERSION

STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Range
4.3 Quartile Deviation and Computations
4.4 Mean Deviation
4.5 Variance
4.6 Standard Deviation
4.7 Coefficient of Variation
4.8 Skewness
4.9 Kurtosis
4.10 Check Your Progress
4.11 Summary
4.12 Key Words
4.13 Self-Assessment Questions
4.14 References

51
4.0 OBJECTIVES
After studying this unit you should be able to :
 Explain measures of Dispersion ;
 Analyse compute each measure of dispersion and
 Identify the relationship and significance of each of the measures

4.1 INTRODUCTION
The measures of central tendencies i.e. mean indicate the general magnitude of the
data and locate only the center of a distribution of measures. They do not establish the
degree of variability or the spread out or scatter of the individual items and their deviation
from (or the difference with) the mean.
i) According to Nciswanger, “Two distributions of statistical data may be symmetrical
and have common means, medians and modes and identical frequencies in the modal
class. Yet with these points in common they may differ widely in the scatter or in their
values about the measures of central tendencies.”
ii) Simpson and Kafka said, “An average alone does not tell the full story. It is hardly
fully representative of a mass, unless we know the manner in which the individual
item. Scatter around it ....a further description of a series is necessary, if we are to
gauge how representative the average is.”
From this discussion we now focus our attention on the scatter or variability which is
known as dispersion. Let us take the following three sets.

Students Group Group Group


X Y Z

1 50 45 30

2 50 50 45

3 50 55 75

Mean- 50 50 50

52
Thus, the three groups have same mean i.e. 50. In fact the median of group X and Y are
also equal. Now if one would say that the students from the three groups are of equal
capabilities, it is totally a wrong conclusion then. Close examination reveals that in group X
students have equal marks as the mean, students from group Y are very close to the mean but
in the third group Z, the marks are widely scattered. It is thus clear that the measures of the
central tendency is alone not sufficient to describe the data.
Definition of dispersion : The arithmetic mean of the deviations of the values of the
individual items from the measure of a particular central tendency used. Thus the ’dispersion’
is also known as the “average of the second degree.” Prof. Griffin and Dr. Bowley said
the same about the dispersion.
In simple terms Dispersion is the variability among individual observations comprising
a set of data. It describes the spread characteristics of the data. A measure of dispersion lies
in quantifying the variability among individual observations and their scatter around the central
value.
Characteristics of ideal measure of dispersion:
1. It should be rigidly defined
2. It should be easy to calculate
3. It should be based on all the observations
4. It should be amenable for further mathematical treatment
5. It should be affected as little as possible by fluctuations of sampling and by extreme
observations
Methods of Computing Dispersion: the various measures of dispersions are
1. Range
2. Quartile Deviation
3. Mean Deviation
4. Variance
5. Standard Deviation
In measuring dispersion, it is imperative to know the amount of variation (absolute
measure) and the degree of variation (relative measure). In the former case we consider the
range, mean deviation, standard deviation etc. In the latter case we consider the coefficient
of range, the coefficient mean deviation, the coefficient of variation etc.

53
(I) Method of limits:
(1) The range (2) Inter-quatrile range (3) Percentile range
(II) Method of Averages:
(1) Quartile deviation (2) Mean deviation
(3) Standard Deviation and (4) Other measures.

4.2 RANGE
In any statistical series, the difference between the largest and the smallest values is
called as the range.

Thus Range (R) = L - S

Coefficient of Range : The relative measure of the range. It is used in the comparative
study of the dispersion co-efficient of Range =

Example ( Individual series ) Find the range and the co-efficient of the range of the following
items :
110, 117, 129, 197, 190, 100, 100, 178, 255, 790.
Solution: R = L - S = 790 - 100 = 690

Co-efficient of Range =
Example (Continuous series) Find the range and its co-efficient from the following data.

6 60 - 70

70-80 80-90 90-100


8 7 8

54
Solution: R = L - S = 100 - 10 = 90

Co-efficient of range =

Merits and Demerits of range:


It is the simplest but crude method of dispersion. It is rigidly defined and readily
comprehensible and easiest to compute. But it is not based on all the observations and cannot
be used for further mathematical treatment. It is based on only two extreme values and not
on entire set of data. It is affected by the fluctuations of sampling. Range cannot be used if
we are dealing with open end class. It is very sensitive to the size of the sample. Therefore
Range is regarded as too indefinite to be used as a practical measure of dispersion.

4.3 QUARTILE DEVIATION AND COMPUTATIONS


Quartiles and Interquartile Range
It is the measure of dispersion based on the upper quartile Q3 and Lower Quartile
Q1. Quartile deviation is obtained from interquartile range on dividing by 2 . Hence it is also
called as Semi Inter Quartile Range. Therefore Q. D. ( SI QR ) =

If we concentrate on two extreme values ( as in the case of range ), we don’t get any
idea about the scatter of the data within the range ( i.e. the two extreme values ). If we discard
these two values the limited range thus available might be more informative. For this reason
the concept of interquartile range is developed. It is the range which includes middle 50% of
the distribution. Here 1/4 ( one quarter of the lower end and 1/4 ( one quarter ) of the upper
end of the observations are excluded.

55
Now the lower quartile ( Q1) is the 25th percentile and the upper quartile ( Q3) is the
75th percentile. It is interesting to note that the 50th percentile is the middle quartile ( Q2)
which is in fact what you have studied under the title ’ Median “. Thus symbolically
Inter quartile range = Q3 - Q1
If we divide (Q3 - Q1) by 2 we get what is known as Semi-Iinter quartile range.

i.e. . It is known as Quartile deviation ( Q. D or SI QR ).

Therefore Q. D. ( SI QR ) =

Coefficient of QD = (Q3 - Q1) /(Q3 + Q1)

Example:Find the following from the distribution.


1. Interquartile range
2. Quartile deviations
3. Coefficient of Q D
Class Interval F Less than C F
0-15 8 8
15-30 26 34
30-45 30 64
45-60 45 109
60-75 20 129
75-90 17 146
90-105 4 150
Total N =150
Q1 = N/4 = 150/4 = 37.5.

The CF greater than 37.5 is 64. Therefore Q1 lies in corresponding class of 30-45

Q1 = l + (N/4 – C)

Q1 = 30+15/30(37.5- 34) = 31.75

Q3 = 3N/4 = 3(150)/4 = 112.5

56
The CF greater than 112.5 is 129. Therefore Q3 lies in corresponding class of 60-75

Q3 = l + (3N/4 – C)

Q3 = 60 + 15/20 (112.5- 109) = 62.625

a. Inter quartile Range = Q3 - Q1 = 62.625 – 31.75 = 30.875


b. Quartile deviation =(Q3 - Q1)/ 2 = 30.875/2 =15.44
c. Coefficient of QD = (Q3 - Q1 ) / (Q3 + Q1 ) = (62.625 – 31.75) / (62.625 + 31.75) = 0.33
Merits and Demerits of QD:
It easy to understand and calculate. It used 50% of the data and thus a better measure
than range. It is not affected by extreme values as it excludes 25% of the data from the
beginning and 25% from the top. It can also be calculated from the open end class and it is
the only measure of dispersion to deal with open end class.
It is not based on all observations since it ignores 25% in the beginning and 25% in
the end. It gets affected by fluctuations of sampling and not suitable for further mathematical
treatment.
Check Your Progress
1. The difference between smallest & largest value is called .................
2. The arthamatic means of deviation of individual values from central value is called ........
3. The formla is calculate semi inter quartile range is ..............
4.4 MEAN DEVIATION
Average deviations (mean deviation ) is the average amount of variations (scatter) of
the items in a distribution from either the mean or the median or the mode, ignoring the
signs of these deviations.

Individual SeriesSteps to calculate MD :


1. Find the mean or AM of the distribution by usual methods.
2. Take the deviation d=X-A of each observation from the average.
3. Ignore the negative signs of the deviations taking all the deviations to be positive to
obtain the absolute deviations l d l = l X - A l
4. Obtain the sum of the absolute deviations obtained in step 3.
5. Divide the total obtained in step 4 by n. (the number of observations).
The result gives the value of mean deviation about the average A.

57
In case of frequency distribution MD is obtained as :

M D (about the Average A) =

M D (about Mean ) =

M D (about Median ) =

M D (about Mode) =
Example (Continuous series) Calculate the mean deviation and the coefficient of mean
deviation from the following data using the mean.Difference in ages between boys and girls
of a class.

Diff. in No.of
years: students:

0-5 449

5 - 10 705

10 - 15 507

15 - 20 281

20 - 25 109

25 - 30 52

30 - 35 16

35 - 40 4

58
Calculation:

1) X

2) M. D.

3) co efficient of M. D.

Merits and demerits of Mean deviations:


M D is rigidly defined and is easy to understand and calculate. It is based on all the
observations. MD removes the irregularities in the distribution and provides accurate and
true measure of dispersion. It is less affected by extreme observations.
The major demerit is we take the absolute values and neglect the signs of the deviations
which mathematically unsound and illogical. This makes it useless for further mathematical
treatment.it is not satisfactory measure when taken about Mode or Median.it cannot be
computed with open end class. And it is tend to increase in size as the size of the sample
increases.

59
4.5 VARIANCE
The term variance was used to describe the square of the standard deviation R.A.
Fisher in 1913. The concept of variance is of great importance in advanced work where it is
possible to split the total into several parts, each attributable to one of the factors causing
variations in their original series. Variance is defined as follows:

V a ria n c e =

4.6 STANDARD DEVIATION (S. D.)


It is the square root of the arithmetic mean of the square deviations of various
values from their arithmetic mean. it is denoted by s.d. or .

Thus, s.d.( x ) =

where n =  fi

Merits :
a. It is rigidly defined and based on all observations.
b. It is amenable to further algebraic treatment.
c. It is not affected by sampling fluctuations.
d. It is less erratic.
e. It is the most widely used measure of dispersion
Demerits :
a. It is difficult to understand and calculate.
b. It gives greater weight to extreme values.

60
Note that variance V(x) =

ands. d. ( x ) = and

Then V ( x ) =

4.7 CO-EFFICIENT OF VARIATION ( C. V. )


To compare the variations ( dispersion ) of two different series, relative measures of
standard deviation must be calculated. This is known as co-efficient of variation or the co-
efficient of s. d. Its formula is CV = (S.D / Mean )100

Thus it is defined as the ratio s. d. to its mean.


Remark: It is given as a percentage and is used to compare the consistency or variability of
two more series. The higher the C. V. , the higher the variability and lower the C. V., the
higher is the consistency of the data.

Example Calculate the standard deviation and its co-efficient from the following data.

A B C D E F G H I J
10 12 16 8 25 30 14 11 13 11

Solution :

61
No xi (xi - x) ( xi - x )2

A 10 -5 25

B 12 -3 9

C 16 +1 1

D 8 -7 49

E 25 +10 100

F 30 +15 225

G 14 -1 1

H 11 -5 16

I 13 -2 4

J 11 -4 16

n= 10 xi = 150 |xi - x |2= 446

Calculations :

i)

ii)

iii)

62
Example Calculate s.d. of the marks of 100 students.

Marks No. of Mid-values fi x i fi x i2


students (fi) (x i)

0-2 10 1 10 10

2-4 20 3 60 180

4-6 35 5 175 875

6-8 30 7 210 1470

8-10 5 9 45 405

n = 100 fi x i = fi x i2 =


500 2940

Solution

1)

2)

σ- 2.09
Combined Standard deviation : If two sets containing n1 and n2 items having means x1 and
x2 and standard deviations 1 and 2 respectively are taken together then,

(1) Mean of the combined data is

(2) s.d. of the combined set is

63
Example The score of two teams A and B in 10 matches are as :

A 40 32 0 40 30 7 13 25 14 5
B 21 14 29 13 5 12 10 13 30 0

Find the variance for both the series. Which team is more consistent ?

64
4.8 SKEWNESS
We study Skewness to have an idea about the shape of the curves which we can draw
with the help of the given frequency distribution. It helps us to understand the nature of the
concentration of observations towards higher and lower values of the variable. A distribution
is said to be skewed if :
1. The frequency curve of the distribution is not a symmetric bell shaped curve but it is
stretched more to one side than the other. If it has a longer tail towards the right it is
said to be positively skewed. And if the tail is longer towards the left then it is nega-
tively skewed.
2. The values of Mean, median and Mode fall at different points.
3. Quartiles Q1 and Q3 are not equidistant from the median.
Measures of Skewness:
1. Sk = Mean – Median
2. Sk = Mean – Mode
3. Sk = (Q3–Md) – (Md - Q1)= Q3 + Q1 – 2Md

Karl Pearson’s Coefficient of Skewness:


Sk = (Mean – Mode ) / S.D
Or where the Mode is ill defined then
Sk = 3(Mean – Mode ) / S.D

4.9 KURTOSIS

65
To know more about the distribution variability, Prof. Karl Pearson called it as
Convexity of the curve or the Kurtosis. Kurtosis enables us to have an idea about the shape
and nature of the hump (middle Part) of a frequency distribution.Therefore Kurtosis is
concerned with the flatness or peachiness of the frequency curve. The normal curve is called
as Mesokurtic. The curves which are more peaked than the normal curve are called as
Leptokurtic and lack kurtosis and have negative Kurtosis. The curves which are flatter than
the normal curve are platykurtic curves and have kurtosis in excess and called as positive
kurtosis.
As a measure of Kurtosis Karl Pearson described coefficient β2 as

β2 =

1.
2.

3.

4.
5.
6.
7.

66
4.10 ANSWERS TO CHECK YOUR PROGRESS
1. Range
2. Dispersion
3. (Q3 - Q1) /2

4.11 SUMMARY
Choice about various measure discussed above is based on their merits and demerits.
Range is simplest of all but it is based on two extreme values. Quartile deviation is also not
adequately representative as it uses only 50% of the data and it suits well with openend
classes. The Variance and Standard Deviations are two most objective measures of dispersion
as they cover all the set of observations of the data. SD is the a widely used measure of
variability.
4.12 KEY WORDS
Dispersion – Variability among the observations made or
deviations from the expected one.
MAD – Mean Absolute Deviation
Skewness – Lack of symmetry
Percentile range - this is a measure of dispersion based on the
difference between certain percentiles.
Lorenz Curve – it is a graphic measure of studying the dispersion.
This curve is used in business to study the
disparities of the distribution of wages, profits,
turnover, production, population etc.
4.13 SELF-ASSESSMENT QUESTIONS
1. Explain the validity of the statement “An Average when published should be accompanied
by a measure of dispersion for significant interpretation”.
2. What is dispersion? Explain each measure in detail.
3. The Standard Deviation is a best measure of dispersion.’ Why?
4. Standard Deviation can never be negative – comment
5. Differentiate SD and MD
6. Calculate the mean deviation from the following:
X: 5 15 25 35 45 55 65
f: 8 12 10 8 3 2 7
67
7. Find the Median and Mean deviation from the following data;
size Frequency
0-10 7
10-20 12
20-30 18
30-40 25
40-50 16
50-60 14
60-70 8

8. Find mean deviation from Mean and Median for the following:
Score No. of Students
140-150 4
150-160 6
160-170 10
170-180 10
180-190 9
190-200 3

9. Find out Mean and Standard Deviation from the following:


Age: 10 20 30 40 50 60 70 80

Death :15 30 53 75 100 110 115 125

10. Explain relative measure of dispersions:


11. The coefficient of variance is 60% and the SD is 12. Find its Mean.
12. During 10 weeks of a session, the marks are as follows.
Ramesh:58 59 60 54 65 66 52 75 69 52

Suresh: 87 89 78 71 73 84 65 66 56 46
a. Who is better scorer?
b. Who is better consistent?

68
13. Write short notes on
a. Skewness
b. Kurtosis
c. Percentile
d. Lorenz Curve

4.13 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010

7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

69
1
2
3
BLOCK 2
INTRODUCTION

Dear Student,
In the previous block you have gained the basic knowledge about data science. You have learnt how to
collect data, how to tabulate data and how to depict data. You have also learnt the data analysis through
various measures of central tendency such as mean, Median and Mode. All these information you have
studied were limited to one set of data.
In this block let us try to understand the relationship between two set of data. . Some set of data depend
directly or inversely on other set of data. For example yield crop over years depend on amount of rainfall
in that place over years. Such relation is called correlation. Once you are able to co related two set of
data then you can try to find the exact relationship between two sets of data which is called as regression.
In this block you will study 4 units.
Unit 5: Concept and definition of correlation, significance, types, Properties of Correlation Methods of
correlation analysis: Graphic method,
Unit 6: Scatter diagrams, Karl Pearson’s correlation co-efficient, Rank correlation coefficient,
Unit 7: Regression: Regression analysis: meaning and definition of regression, application of regression
analysis, difference between correlation & regression analysis, Types of regression models, standard
error and Regression coefficients.
Unit 8: Multiplication correlation and regression : Concept of multiple regression and multiple correlation
, Concept of partial correlation. Correlation co-efficient, Methods of least square.

4
MODULE 2

UNIT 5 : CORRELATION

STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Concept and Definition of Correlation
5.3 Correlation and Causation
5.4 Types of Correlation
5.5 Significance of the Study of Correlation
5.6 Measures to Describe Correlation
5.7 Solved Problems
5.8 Summary
5.9 Keywords
5.10 Self Asessment Questions
5.11 References

5
5.0 OBJECTIVES
After studying this unit you should be able to :
∗ Analyse the importance, as also the limitations of correlation analysis and
∗ Distinguish between
a. linear and non-linear correlation,
b. positive and negative correlation, and
c. simple, partial and multiple correlation

5.1 INTRODUCTION
The existence of a relationship between two or more variables constitutes a vital
information for decision making in any given situation. For example, it is important for a
manufacturer to know how product sales are related to expenses incurred on advertising.
Similarly, it is useful for a farmer to know the relationship between crop yield and the quantum
of fertilizer applied. In all such cases, knowledge of the mechanism by which the variables
are related is beneficial for taking appropriate decisions.

5.2 CONCEPT AND DEFINITION OF CORRELATION


The concept of ‘Correlation’ is a statistical tool which studies the relationship between
two variables and correlation analysis involves various methods and techniques used for
studying and measuring the extent of the relationship between the two variables.
“Two variables are said to be in correlation if the change in one of the variables results in
a change in the other variable”.
Correlation analysis is used as a statistical tool to ascertain the association between
two variables. The correlation analysis refers to the techniques used in measuring the
closeness of the relationship between the variables. The degree of relationship between the
variables under consideration is measured through the correlation analysis.

The important definitions of correlation are given as follows:


1. “When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering and measuring the relationship and expressing it in form of formula is
known as correlation” – Croxton & Cowden
2. “Correlation analysis attempts to determine the ‘degree of relationship’ between
variables”. – Ya Lun Chou
3. “Correlation is an analysis of the co-variation between two or more variables”. – A.M
Tuttle
6
The problem in analyzing the association between two variables can be broken down into
three steps.
1. We try to know whether the two variables are related or independent of each other.
2. If we find that there is relationship between the two variables, we try to know its nature
and strength. This means whether these variables have a positive or negative relationship
and how close that relationship is.
3. We may like to know if there is a causal relationship between them. This means that
the variation in one variable causes variation in another.

5.3 CORRELATION AND CAUSATION


When we use correlation analysis and establish a relationship between two variables,
then we confront a major question: does this relationship indicate the existence of cause
and effect relationship? It may be noted that there may be a very high degree of relationship
between two variables, but they may just show similar movements and causal relationship is
non-existent. Let us see when such a situation arises.
∗ The correlation may be due to chance particularly when the data pertain to a small sample.
A small sample bivariate series may show the relationship but such a relationship may
not exist in the universe.
∗ It is possible that both the variables are influenced by one or more other variables. For
example, expenditure on food and entertainment for a given number of households show
a positive relationship because both have increased over time. But, this is due to rise in
family incomes over the same period. In other words, the two variables have been
influenced by another variable – increase in family incomes.
∗ There may be another situation where both the variables may be influencing each other
so that we cannot say which is the cause and which is the effect. For example, take the
case of price and demand. The rise in price of a commodity may lead to decline in the
demand for it. Here, price is the cause and the demand is the effect.

5.4 TYPES OF CORRELATION


Correlation may be of different types. Some of the most important types are:
1. Positive and negative
2. Linear and non-linear
3. Simple, partial and multiple

7
We briefly describe these types of correlation.
1. Positive and Negative Correlation : Positive correlation indicates that the movement
of the two variables is in the same direction, that is, both the variables are either increasing
or decreasing. In contrast, if the movement of the two variables is in the opposite direction,
that is, one variable is increasing and the other is decreasing, then the correlation is
negative. Some examples of series of positive correlation are :
1. Heights and weights;
2. Household income and expenditure;
3. Price and supply of commodities;
4. Amount of rainfall and yield of crops.
Suppose we are given sets of data relating to heights and weights of students in a class.
They can be plotted on the coordinate plane using x-axis to represent heights and y-axis to
represent weights.
The graph shownYbelow illustrate the Positive correlation
*
* *
* *

* *
* *
*

0 X
Positive Correlation

Correlation between two variables is said to be negative or inverse if the variables deviate
in opposite direction. That is, if the increase (or decrease) in the values of one variable
results on an average, in corresponding decrease (or increase) in the values of other variable.
Some examples of series of negative correlation are :
1. Volume and pressure of perfect gas;
2. Current and resistance
3. Price and demand of goods.

8
The graph shown below illustrate the Negative correlation

y
* *
* *
* *
* *
* *
0 Negative Correlation x

2. Linear and Non-Linear Correlation : If the extent of change in one variable tends to
have a constant ration in the extent of change in another variable, then the correlation is
said to be linear. This will be clear from the following example.

X 5 10 15 20 25

Y 30 60 90 120 150

In this case, we find that the ration of change from one figure to another in the two series
is the same. Thus, it will give a linear correlation. In contrast, in a non-linear correlation,
this consistency of ration of change will not exist. If a couple of figures in either series X
or Y are changes, it may give a non-linear correlation.

3. Simple, Partial and Multiple Correlation : The distinction amongst these three types
of correlation depends upon the number of variables involved in a study. If only two
variables are involved in a study, then it is a problem of either partial or multiple correlation.
In multiple correlation, three or more variables are studied simultaneously. But in partial
correlation we consider only two variables inflecting each other while the effect of other
variable is held constant.

9
5.5 SIGNIFICANCE OF THE STUDY OF CORRECTION
The study of correlation is of immense use in practical life because of the following reasons:
a. Once we know that two variables are closely related, we can estimate the value of one
variable given the value of another. This is known with the help of regression analysis.
b. Correlation analysis contributes to the understanding of economic behavior, aids in
locating the critically important variables on which other depend, may reveal to the
economist the connection by which disturbances spread and suggest to him the paths
through which stabilizing forces may become effective.
c. Most of the variables show some kind of relationship. For example, there is relationship
between price and supply, income and expenditure, etc. With the help of correlation
analysis we can measure in one figure the degree of relationship existing between the
variables.
The correlation analysis enables the executive to estimate costs, sales, prices and other
variables on the basis of some other series with which these costs, sales, or prices may be
functionally related.
However, it should be noted that coefficient of correlation is one of the most widely
used and also one of the most widely abused statistical measure. It is abused in the sense
that one sometimes overlooks the fact that correlation measures are nothing but the strength
of linear relationship and that it does not necessarily imply a cause – effect relationship.
a. The effect of correlation is to reduce the range of uncertainly. The prediction based
on correlation analysis is likely to be more valuable and near to reality.
b. Progressive development in the methods of science and philosophy has been
characterized by increase in the knowledge of relationship or correlations. In nature
also one finds multiplicity of interrelated forces.

5.6 MEASURES TO DESCRIBE CORRELATION


Statisticians have developed two measures for describing the correlation between two
variables:
1. The Co-efficient of Determination
2. Coefficient of correlation
1. The Co-efficient of Determination (r2) : The Co-efficient of determination is the
primary way we can measure the extent, or strength of the association that exists between
two variables, X and Y. The co-efficient determination is the square of the co-efficient of

10
correlation. It is a more useful and readily comprehensible measure for indicating the
percentage variation in the dependent variable which is accounted for by the independent
variable. In other words, the co-efficient of determination given the ratio of the explained
variance to the total variance.
Thus co-efficient of determination is explained as follows :
r2 = Explained Variable / Total Variance
The co-efficient of determination is more useful and a better measure for interpreting the
value of r.

2. The Co-efficient of Correlation: The co-efficient of correlation is the second mea-


sure that we can to describe how well one variable is explained by another. One of the
most widely used statistics is the coefficient of correlation ‘r’ which measures the de-
gree of association between the two values of related variables given in the data set. It
taken values from +1 to -1. If two sets or data have r=+1, they are said to be perfectly
correlated positively if r=-1 they are said to be perfectly correlated negatively; and if r=0
they are uncorrelated.
The coefficient of correlation ‘r’ is given by the formula
n Σ xy − ( Σ x Σ y
γ = [ n Σ x − ( Σ x ) 2 ][ n Σ y 2 − ( Σ y ) 2 ]
2

5.7 SOLVED PROBLEMS


Example – 1 :
The following data relate to advertising expenditure (in lakhs of rupees) and sales
(in crores of rupees) of a firm :
Advertising Expenditure : 10 12 15 23 20
(in lakh of Rs.)
Sales : 14 17 23 25 21
(in crores of Rs.)

a. Estimate the sales target for an advertising expenditure of Rs.25 lakhs.


b. Calculate the coefficient of correlation (r) and coefficient of determination of
advertising expenditure and sales.
Solution :
a. For estimating the sales target for a given advertising expenditure we need

11
regression equation of y and x which is a follows :
Yc = a + bx
nΣxy − (ΣΧ) (ΣY )
Where b = n
n ∑ x 2 − (∑ x ) 2
And a = y - b x
Adv. Exp. Sales

x y x2 y2 xy
10 14 100 196 140
12 17 144 289 204
15 23 225 529 345
23 24 529 625 575
20 21 400 441 420
2 2
? x = 80 ? y = 100 ? x = 1398 ? y = 2080 ? xy = 1684

5(1684) − (80) (100)


b=
5(1398) − (80) 2
8420 − 8000
= 6990 − 6400
420
=
590
= .712
100  80 
 
And a = 5 - .712  5 
= 20 – 11.392
= 8.608
Hence the line of regressionof y on x would be
yc = a + bx
yc = 8.608 + .712x

Thus, if the advertising expenditure is Rs.25 lakhs or


x = 25, then the computed sales target or
yc = would be
y = 8.608 + .712 (25)
= 8.608 + 17.80
= Rs.26.408 lakhs

12
Explained var iation
ii) Coefficient of determination or r2 = Total var iation
Σ( yc − y ) 2
= Σ( y − y )
2

(Σy ) 2
aΣy + bΣxy −
n
= (Σy ) 2
Σy 2 −
n
2
(100)
8.608(100) + .712(1684) −
5
2
= (100)
2080 −
5
860.8 +1199 − 2000
= 2080 − 2000
59.8
= 80 = .747
0.747
co effecient of correlation r=
= .864
nΣxy − (Σx) (Σy)
γ =- Co-efficient
nΣx − (Σ
2
x) 2 nΣy 2 − (of
Σycorrelation
)2 and co-efficient of determination can be calculated by another
method also−which
5 x1684 80 x 100is as follows :
=- (5 x1398 − (80) 2 5 x 2080 − (100) 2
8420 − 8000 420
= (6990 − 6400) (10400 − 10000) = 590 x 400
420 420
= 236000 = 485.8 = .864
co-effecient of determination γ2 = (.864)2 = .747

13
Example 2 :
Calculate the co-efficient of correlation for the ages of husband and wife.
Age of Husband 23 27 28 29 30 31 33 35 36 39
Age of wife 18 22 23 24 25 26 28 29 30 32

Solution :
CALCULATIONS FOR CORRELATION CO-EFFICIENT
x y u=x-31 v=y-25 u2 v2 uv
23 18 -8 -7 64 49 56
27 22 -4 -3 16 9 12
28 23 -3 -2 9 4 6
29 24 -2 -1 4 1 2
30 25 -1 0 1 0 0
31 26 0 1 0 1 0
33 28 2 3 4 9 6
35 29 4 4 16 16 16
36 30 5 5 25 25 25
39 32 8 7 64 49 56
2 2
∑x=311 ∑y=257 ∑u=7 ∑v = 7 ∑u =203 ∑v = 163 ∑uv = 179

Karl Pearson’s correlation coefficient between U and V is given by


nΣuv − (Σu ) (Σv)
γuv = [nΣu 2 − (Σu ) 2 ][nΣv 2 − (Σv) 2 ]
10 x179 − 1x7
= [10 x 203 − (1) 2 [10 x163 − (7) 2 ]
1790 − 7 1783
= (2030 − 1) (1630 − 49) = 2029 x1581
1783 1783
= 45.04 x 39.76 = 1790.79
= 0.9956

Since Karl Pearson’s correlation co-efficient r is independent of origin, we get γxy = γuv
= 0.9956.

14
Example 3

Find Karl Pearson’s co-efficient of correlation between sales and expenses of the
following ten firms
Firm 1 2 3 4 5 6 7 8 9 10
Sale in thousand 50 50 55 60 65 65 65 60 60 50
units
Expenses in 11 13 14 16 16 15 15 14 13 13
thousand rupees

Solution :

Let sales (in thousand units) of a firm be denoted by x and expenses (in ‘ooo rupees) be
denoted by Y. It may be noted that we can take out factor 5 common in x series. Hence it will
be convenient to change the scale also in x. taking 65 and 13 as working means for x and y
respectively, let us take.

u= (x-65) /5; v=y – 13

CALCULATIONS FOR CORRELATION CO-EFFICIENT


Firms x y x − 65 v=y-13 u2 v2 uv
u= 5
1 50 11 -3 -2 9 4 6
2 50 13 -3 0 9 0 0
3 55 14 -2 1 4 1 -2
4 60 16 -1 3 1 9 -3
5 65 16 0 3 0 9 0
6 65 15 0 2 0 4 0
7 65 15 0 2 0 4 0
8 60 14 -1 1 1 1 -1
9 60 13 -1 0 1 0 0
10 50 13 -3 0 9 0 0
2 2
∑x=580 ∑y=140 U=-14 ∑v=10 ∑u = 34 ∑v = 32 ∑uv = 0

Karl Pearson’s correlation co-efficient between u and v is given by :


nΣuv − (Σu ) (Σv)
γuv = [nΣu 2 − (Σu ) 2 ][nΣv 2 − (Σv) 2 ]
10 x0 − (−14) x (10)
= [10 x34 − (−14) 2 ] x [10 x32 − (10) 2 ]
140
= (340 − 196) (320 − 100)

15
140 140
= 144 x 220 = 31680
140
= 177.99 = 0.7866

Since correlation co-efficient is independent of change of origin and scale, we finally


have γxy = γuv = 0.7866.

Aliter. We have

Σx 580
x =
n = 10 = 58;

Σy 140
y = = 14
n = 10

Since x and y are integers, it will be convenient to compute r by taking the diviations from
means directly; i.e. by taking

dx = x - x = x – 58; dy = y - y = y – 14.

CALCULATIONS FOR CORRELATION CO-EFFICIENT


x y dx=x-58 dy=y-14 dx2 dy2 dx.dy
50 11 -8 -3 64 9 24
50 13 -8 -1 64 1 8
55 14 -3 0 9 0 0
60 16 2 2 4 4 4
65 16 7 2 49 4 14
65 15 7 1 49 1 7
65 15 7 1 49 1 7
60 14 2 0 4 0 0
60 13 2 -1 4 1 -2
50 13 -8 -1 64 1 8
∑x=580 ∑y=140 ∑dx=0 ∑dy = 0 ∑dx =360 ∑dy2= 22
2
∑dxdy = 70

Σdxdy 70 70
γxy = Σdx .Σdy
2 2
= 360 x22 = 7920

70
= 88.99 = 0.7866

16
Example 4

Find Karl Pearson’s co-efficient of correlation between the age and the playing habit of
the people from the following information.
Age group (years) No. of people No. of players
15 and less than 20 200 150
20 and less than 25 270 162
25 and less than 30 340 170
30 and less than 35 360 180
35 and less than 40 400 180
40 and less than 45 300 120

Also mention what does your calculated ‘r’ indicate

Solution :

We want to find Karl Pearson’s correlation coefficient between the age and the playing
habit of the people. To do this, we first express the number of players out of a fixed 1000 or
some other convient figure. Here we express the number of players as a percentage of the total
people in each age group.

Now we compute Karl Pearson’s correlation co-efficient between age (x) and the
percentage of players in each age group (y)
Age group (yrs.) No. of people No. of players Percentage of players (y)
15-20 200 150 150
200 x 100 = 75
20-25 270 162 162
270 x 100 = 60
25-30 340 170 170
340 x 100 = 50
30-35 360 180 180
360 x 100 = 50
35-40 400 180 180
400 x 100 = 45
40-45 300 120 120
300 x 100 = 40

17
CALCULATIONS FOR CORRELATION CO-EFFICIENT
Age Mid y x − 27.5 y = 50 u2 v2 uv
group value u= 5 v= 5
(x)
15-20 17.5 75 -2 5 4 25 -10
20-25 22.5 60 -1 2 1 4 -2
25-30 27.5 50 0 0 0 0 0
30-35 32.5 50 1 0 1 0 0
35-40 37.5 45 2 -1 4 1 -2
40-45 42.5 40 3 -2 9 4 -6
Total ?u=3 ? v=4 ? u2 = 19 ? v2 = 34 ? uv = -20

Since correlation coefficient is independent of change of origin and scale we have :


nΣuv − (Σu ) (Σv)
γuv = [nΣu 2 − (Σu ) 2 ][nΣv 2 − (Σv) 2 ]
6 x(−20) − (3) x(4)
= (6 x19 − (3) 2 ] (6 x34 − (4) 2 ]
− 120 − 12 − 132
= (114 − 9) x (204x16) = 105x188
− 132 − 132
= 19740 = 140.4991 = 0.9395
Thus we conclude that there is a very high degree of negative correlation, (almost perfect negative
correlation) between age (x) and playing habit (y). This implies that with advancement in
age, the people’s interest in playing goes on decreasing and the scatter diagram of the (x,y)
values gives points clustering almost around a straight line starting from left top and going
to right bottom.
5.8 SUMMARY
In this unit the concept of correlation or the association between two variables has been
discussed. Correlation analysis helps us to determine the strength of the linear relationship
between the two variables. Correlation analysis is used as a starting point for selecting
useful independent variables for regression analysis.
A scatter plot of the variables may suggest that the two variables are related but the value
of the Pearson Correlation coefficient r quantifies this association. The correlation
coefficient r may assume values between -1 and +1. The sign indicates whether the association
is direct (+ve) or inverse (-ve). A numerical value of r equal to unity indicates perfect

18
association while a value of zero indicates no association.
5.9 KEYWORDS
Correlation : Degree of association between two variables.
Correlation Co-efficient : A number lying between -1 to +1, to quantify the association
between two variables.
Covariance : This is the joint variation between the variables X and Y

5.10 SELF ASSESSMENT QUESTIONS


1. Define the term correlation. Explain the concept of positive and negative correlation
with examples.
2. Distinguish between positive and negative correlation with the help of scatter diagram.
3. Explain the usefulness of Correlation.
4. Explain the different types of Correlation.
5. Distinguish between
a. Simple correlation and Multiple Correlation.
b. Linear and non linear correlation
c. Simple, Partial and multiple correlation
6. Discuss the business applications of correlation with examples.
7. Find if there is any significant correlation between the heights and weights given below:
Height in inches 57 59 62 63 64 65 55 58 57
Weight in lbs 113 117 126 126 130 129 111 116 112
8. The following are the marks obtained by the students of a class in Statistics and
Accountancy:
Roll No. of Marks in Marks in Roll No. of Marks in Marks in
students Statistics Accountancy students Statistics Accountancy
1 15 13 13 14 11
2 0 1 14 9 3
3 1 2 15 8 5
4 3 7 16 13 4
5 16 8 17 10 10
6 2 9 18 13 11
7 18 12 19 11 14
8 5 9 20 11 7
9 4 17 21 12 18
10 17 16 22 18 15
11 6 6 23 9 15
12 19 18 24 7 3
Prepare a correlation table taking the magnitude of each class interval as four marks and the first

19
9. From the following data examine whether input of oil and output of electricity can be
said to be correlated:
Input of 6.9 8.2 7.8 4.8 9.6 8.0 7.7
oil
Output of 1.9 3.5 6.5 1.3 5.5 3.5 2.2
electricity

10. Calculate correlation between X and Y


X 23 27 28 28 29 30 31 33 35 36
Y 18 20 22 27 21 29 27 29 28 29

11. Calculate correlation from the following data:


X 200-300 300-400 400-500 500-600 600-700
Y
10-15 - - - 3 7
15-20 - 4 9 4 3
20-25 7 6 12 5 -
25-30 3 10 19 8 -

12. Find the correlation for the following data


Marks in Mathematics
Marks in 10 20 30 40 50
Statistics 5 2 4 1 4 1
10 8 2 5 1 -
15 - 3 2 1 -
20 - 1 3 2 4
25 - - 4 2 -

13. The following table gives the frequency, according to groups of marks obtained by
67 students in an intelligence test. Measure the degree of relationship between age
and intelligence test:
Age in years Total
Test marks 18 19 20 21
200-250 4 4 2 1 11
250-300 3 5 4 2 14
300-350 2 6 8 5 21
350-400 1 4 6 10 21
Total 10 19 20 18 67

20
5.10 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008

21
UNIT 6 : METHODS OF COMPUTING CORRELATION

STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Scatter diagram
6.3 Karl Pearson’s Co-efficient of Correlation
6.4 Rank Correlation
6.5 Computation of ‘r’ from a Cross Classification Table
6.6 Solved Problems
6.7 Summary
6.8 Key words
6.9 Self Assessment Questions
6.10 References

22
6.0 OBJECTIVES
After studying this unit you should be able to :
∗ Define scattered diagram;
∗ Explain the Karl Pearson’s Co-efficient of Correlation;
∗ Recognize when a scatter diagram suggests relationship between two variables;
∗ Calculate and interpret coefficient of correlation for individual observations as well
as for bivariate grouped data;
∗ Calculate Rank Correlation and
∗ Compute Correlation from a Cross classification Table.

6.1 INTRODUCTION
It is known that Correlation analysis deals with the association between two or more
variables. Business is a complex phenomena being influenced by many variables which are
related on one way or the other. It is very essential for a businessman to know the factors
influencing business and their relationship to take decisions. In this unit methods of analysing
correlation are discussed. These methods will help businessman to analyse the relationship
between two or more variables.
The commonly used methods for studying the correlation between two variables are :
1. Scatter diagram method
2. Karl Pearson’s coefficient of correlation ( Co –variance method)
3. Rank method.
4. Two-way frequency table /Bivariate correlation method/ Cross Classification Table

6.2 SCATTER DIAGRAM


Scatter diagram is a simple and attractive method of diagrammatic representation of a
bivariate distribution for ascertaining the nature of correlation between two variables. A
scatter diagram can give us two types of information. Visually, we can look for patterns that
indicate that the variables are related. Then, if the variables are related, we can see what kind
of line, or estimating equation, describes this relationship.
Scatter diagram of student scores on entrance examinations plotted against
cumulative grade – point averages.

23
Student A B C D E F G H
Entrance examination scores 74 69 85 63 82 60 79 91
Cumulative grade points 2.6 2.2 3.4 2.3 3.1 2.1 3.2 3.8

The nature of the distribution of points will indicate the existence of correlation and the
nature of association between the two variables. If the pattern is distributed along a straight
line diagonally upward, correlation may be taken as perfect positive. If it is a straight line
sloping downward, correlation may be taken as perfect negative. Whether the degree of
correlation is high or low can be known from the nature of the distribution of points. If they
are distributed all over the diagram or cluster around a small area, the evidence of correla-
tion is very remote. With the help of this diagram, it is also possible to know whether
correlation is linear or not. The scatter diagram method however is only a rough method of
finding the presence of correlation; it cannot give us any measure like the coefficient of
correlation

6.3 KARL PEARSON’S CO-EFFICIENT OF CORRELATION (CO –


VARIANCE METHOD)
A mathematical method for measuring the intensity or the magnitude of linear relation-
ship between two variable series was suggested by Karl Pearson(1867 – 1937). The Pearson
coefficient of correlation is denoted by the symbol r . It is one of the very few symbols that
are used universally for describing the degree of correlation between two series.
The formula for computing Pearsonian r is :
r= ∑xy /Nσxσy

Where : x= ( X – X);
Y = (Y – Y)

σx = Standard deviation of series X

σy = Standard deviation of series Y


N = Number of pairs of observations
r=the (product moment) correlation coefficient .

The value of the coefficient of correlation as obtained by the above formula shall always
lie between ± 1. When r = +1, it means there is perfect positive correlation between the
variables. When r=-1, it means there is perfect negative correlation between the variables.
24
When r=0, it means there is no relationship between the two variables. However, in practice
such values of r as +1, -1, and 0 are rare. We normally get values which lie between +1 and
-1 such as +0.6 would mean that correlation is positive because the sign of r is + and
magnitude of correlation is 0.6. Similarly – 0.46 means low degree of negative correlation.

A simple form of the above formula for application to practical problems is given
as follows :
r= ∑xy

√∑x2 . √∑y2

Example 1 :
The following table gives indices of industrial production and registered unemployed
(in hundred thousand). Calculate the value of the coefficient so obtained.
Year 1991 1992 1993 1994 1995 1996 1997 1998

Index of production 100 102 104 107 105 112 103 99

Number unemployed 15 12 13 11 12 12 19 26

Solution :
Calculation of Karl Pearson’s Correlation Coefficients
Year Production (x – X) X2 Unemployed (Y – Y) Y2 xy
X Y

1991 100 -4 16 15 0 0 0

1992 102 -2 4 12 -3 9 +6

1993 104 0 0 13 -2 4 0

1994 107 +3 9 11 -4 16 -12

1995 105 +1 1 12 -3 9 -3

1996 112 +8 64 12 -3 9 -24

1997 103 -1 1 19 +4 16 -4

1998 99 -5 25 26 +11 121 -55

∑X = 832 ∑x = 0 ∑x2 = 120 ∑Y = 120 ∑y = 0 ∑y2 = 184 ∑xy = -92

25
r= ∑xy

√∑x2 . √∑y2
x=(X – X); y = (Y – Y)

X = ∑X = 832 = 104;
N 8

Y= ∑Y = 120 = 15
N 8

∑xy = 92, ∑x2 =120, ∑y2= 184


r= 92 = -0.619

√120 * √184
Example 2:
Find out the coefficient of correlation of correlation between the sales and expenses of
the following 10 firms ( figures in ‘0000 Rs.)
Firms 1 2 3 4 5 6 7 8 9 10

Sales 50 50 55 60 65 65 65 60 60 50

Expenses 11 13 14 16 16 15 15 14 13 13

Solution : Computation of Correlation Coefficient


Sales Deviations Square of Expenses Deviations Square of Product of
from mean Deviations from deviations deviations
X (58) Y mean (14)
X2 Y2 xy
x y

50 -8 64 11 -3 9 +24

50 -8 64 13 -1 1 +8

55 -3 9 14 0 0 0

60 +2 4 16 +2 4 +4

65 +7 49 16 +2 4 +14

65 +7 49 15 +1 1 +7

65 +7 49 15 +1 1 +7

60 +2 4 14 0 0 0

60 +2 4 13 -1 1 -2

50 -8 64 13 -1 1 +8

∑X= 580 ∑x=0 ∑x2 =360 ∑Y = 140 ∑y=0 ∑y2=22 ∑xy=70

26
The correlation coefficient between sales and expenses is :
r= ∑xy = 70 = 0.787

√∑x2 . √∑y2 √360. √22

Example : 3
Calculate the coefficient of correlation by Karl Pearson’s method from the following
data relating to overhead expenses and cost of production :

Overheads (in 80 90 100 110 120 130 140 150 160


‘000Rs) X

Cost (in ‘000 Rs.) 15 15 16 19 17 18 16 18 19

Solution : Computation of Correlation Co-efficient


Overheads Deviations Square of Cost Deviations Square of Product of
from means Deviations from Deviations Deviations
X (120) Y Mean(17) y2
X2 xy
x y

80 -40 1600 15 -2 4 +80

90 -30 900 15 -2 4 60

100 -20 400 16 -1 1 20

110 -20 400 16 -1 1 20

120 0 0 17 0 0 0

130 +10 100 18 +1 1 +10

140 +20 400 16 -1 1 -20

150 +30 900 18 +1 1 +80

160 +40 1600 19 +2 4 +80

∑X = 1080 ∑x2 = 6000 ∑Y = 153 ∑Y2 = 20 ∑xy = 240

27
r= ∑xy = 240 = 0.693

√∑x2 . √∑y2 √6000. √20

6.4 RANK CORRELATION


Rank correlation may be defined as the correlation between the ranks assigned to individuals
in two characters. It is measured by Spearman’s Rank Correlation Coefficient (ρ). The
formula is :
ρ= 1 - 6∑di2
N(N2-1)
Read the above as

Where di stands for difference between the ranks of the i-th individual among the two char-
acters and N stands for the number of paired observations. The value of rank correlation
coefficient varies between -1 and +1. -1 implies complete disagreement in the order of
ranks while +1 implies complete agreement in the order of ranks. The above formula is
used when ranks are not repeated. The rank correlation is especially used in the study of
qualitative characteristics such as honesty, efficiency beauty, performance , etc., This is the
only method that can be applied to data in which the order of the items is known and not the
actual values.
As compared with Pearson’s method of studying correlation, this method is simple to
understand and easy to apply. It is however less accurate than Pearson’s method. Moreover,
it cannot be applied to bivariate frequency distribution.

Example : 4
Calculate the coefficient of correlation from the following data of the Spearman’s. Rank
difference method:
Price of Tea Price of Coffee Price of Tea Price of Coffee
(Rs) (Rs) (Rs) (Rs)
75 120 60 110
88 134 80 140
95 150 81 142
70 115 50 100

28
Solution : Calculation of Spearman’s Correlation Coefficient
Price of Tea R1 Price of coffee R2 (R1 - R2 )2
(Rs.) D2

75 4 120 4 0

88 7 134 5 4

95 8 150 8 0

70 3 115 3 0

60 2 110 2 0

80 5 140 6 1

81 6 142 7 1

50 1 100 1 0

∑D2 = 6

R = 1 – 6 ∑ D2 =1- 6*6

N3 - N 83 - 8
= 1 - 36
512 – 8
= 1 - 0.071 = + 0.329 0.929

Note: Read the formula as given in the last page


6.5 COMPUTATION OF ‘R’ FROM A CROSS CLASSIFICATION TABLE
Cross classification table are constructed when the number of observations on X & Y
variables are very large. In such cases, the data can be made more manageable by expressing
the same in the form of a cross classification table. In these tables, the data on each of the
two variables are classified in appropriate number of columns and rows, and the figures
indicated in the body of the table known as cell frequencies. For example, table 13.4 gives a
cross classification from the data on consumption of electricity and the number of persons
employed in 75 companies. In the columns are given the number of persons employed, and
in the rows, consumption of electricity in units.

29
Computation of ‘r’ from a cross classification table
Table 13.4
Persons employed (x)
10-14 15-19 20-24 25-29 30-34 35-39
electricity in units

50-59 5 7 12
Consumption of

60-59 10 4 5 19
70-79 2 6 3 11
(y)

80-89 3 8 5 18
90-99 3 6 2 11
100-109 3 1 4
15 13 16 14 14 3 75
See how the cross classification table is read.
Consider the first column and first row out of 15 companies which employ between
10-14 persons, 5 companies consume electricity between 50 to 59 units and 10 between 60
to 69 units similarly, out of 12 companies consuming electricity between 50-59 units, 5
companies employ between 10-14 persons, and 7 between 15 to 19 persons, like wise, for
other rows and columns.
We may find the co-efficient of correlation between the number of persons employed x
and electricity consumed y 75 companies using the equation.
NΣfd x d y − (Σfd x ) (Σfd y )
r=
[ NΣfd x2 − (Σfd x ) 2 ] [ NΣfd x2 − (Σfd y ) 2 ]

where,
dx = (x-a) / Cx (a is the assumed mean and Cx the size of the class internal for x
series). dy = (y-b)/Cy (b is the assumed mean and Cy the size of the class interval for y
series), and x, & y are the mid points of the various classes in x and y series, respectively.

Computation procedure :The use of equation consists of the following steps in the order
listed :

30
1) Find the mid points, x & y, for each class for both x & y series.
2) Decide the assumed means a & b, for x & y series, respectively
3) Obtain dx and dy.
4) Multiply each by dx and dy by the corresponding column / row total frequency of to get
fdx and fdy and find the sums ∑fdx & ∑fdy.

5) Take the square of each dx and dy to get d x2 and d y2 {then multiply each d x2 and d y2 by
the corresponding class frequency ‘f’ to get f d x2 and f d y2 , and obtain the sums ∑f d x2 and
∑f d y2 .

6) Obtain the product of each d and d and multiply by frequency of indicated in the
x y
appropriate cells, and write them in squares in the left hand corner of concerned cells.
Add these product values overall rows and columns to get fdxdy, and obtain the sum of
∑fdxdy. This sum added over all columns should be the same as the one added overall
rows.
All the values so computed may be subtitued in equation to obtain ‘r’ the six steps in the
computation of ‘r’ are illustrated in table below using the following computations.
2
∑fdxdy = 145, ∑fdx = 8, ∑fd2y = 165, ∑fdy = 9, and ∑f d x = 170

Made there and with n – 75, we have

(75) (145) − (9) (8)


r=
(75) (170) − (8) 2 ] [(75) (165) − (9) 2 ]

Therefore r = 0.87

6.6 SOLVED PROBLEMS


1. Sixteen industries of some state have been ranked as follows, according to profits earned
in 2002-2003 and the working capital for that year :

Industry A B C D E F G H I J K L M N O P
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Profit rank 13 16 14 15 10 12 04 11 5 9 8 3 1 6 7 2

31
Calculate the rank correlation co-efficient.
Solution : Since only ‘ranks’ are given, we have to calculate spearman’s Rank correlation
co-efficient(p), which is given by the formula.
6Σd 2
P=1–
N ( N 2 − 1)

Where ‘d’ denotes the difference between ranks for the same industry and N denotes the number

6Σd 2 7416
P=1– P=1–
N ( N 2 − 1) 4080

6(1236)
P=1– P = 1 – 1.82
16(16 2 − 1)

P = - 0.82

7416
P=1–
16( 256 − 1)
Calculations for Rank correlation co-efficient

Industry Profit (x) Rank in working d = x-y d2


capital (y)
A 1 13 -12 144
B 2 16 -14 196
C 3 14 -11 121
D 4 15 -11 121
E 5 10 -5 25
F 6 12 -6 36
G 7 4 3 9
H 8 11 -3 9
I 9 5 4 16
J 10 9 1 1
K 11 8 3 9
L 12 3 9 81
M 13 1 12 144
N 14 6 08 64
O 15 7 8 64
P 16 2 14 196
Total - - 0 1236

32
2. Ten competitors in a debate contest are ranked by three judges in the following order :

1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7

Use the rank correlation co-efficient to determine, which pair of judges has the nearest
approach to common taste in debate.
Solution : Calculations for Rank correlation co-efficient
Ranks by d12 d13 d23 d122 d132 d 232
First Second Third = R 1-R2 = R1-R3 = d2-d3
Judge Jduge Judge
R1 R2 R3
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 +2 +1 1 4 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 +6 +2 -4 36 4 16
3 7 1 -4 +2 +6 16 4 36
2 10 2 -8 +0 +8 64 0 64
4 2 3 +2 +1 -1 4 1 1
9 1 10 +8 -1 -9 54 1 81
7 6 5 +1 +2 +1 1 4 1
8 9 7 -1 +1 +2 1 1 4
∑ d122 = ∑ d132 = ∑ d 232 =
200 60 214

33
6Σd122 6Σd132
P12 = 1 – P13 = 1 –
N ( N 2 − 1) N ( N 2 − 1)
6 x 200 6 x60
=1– =1–
10(10 2 − 1) 10(10 2 − 1)
1200 360
=1– =1–
10(100 − 1) 10(100 − 1)
1200 360
=1– =1–
10(99) 10(99)
1200 360
=1– =1–
990 990
P12 = 1 – 1.21 = 1 – 0.36
P12 = - 0.21 P13 = 0.64

6Σd 232 Since P23 is maximum, we conclude


P23= 1 – that the pair of judges I and III has
N ( N 2 − 1) the nearest approach to common
6 x 214 tastes in debates.
P23= 1 –
10(10 2 − 1)
1284
=1-
10(99)
1284
=1-
990
P23 = 1 – 1.29
P23 – 0.29

3. Quotation of Index number of equity share prices of a certain joint stock company and
of prices preference shares are given below

Years 1971 1972 1973 1974 1975 1976 1977


Equity shares 97.5 99.4 98.6 95.2 95.1 98.4 97.1
Preference shares 75.1 95.9 77.1 78.2 79.0 74.8 76.2

34
Use the method of rank correlation to determine the relationship between equity shares and
preference share prices.
Solution : Calculations for Rank correlation co-efficient.
Year Equity shares Preference shares D = R1-R2 d2
X Rank (R1) Y Rank (R2)
1971 97.5 4 75.1 2 2 4
1972 99.4 7 75.9 3 34 16
1973 98.6 6 77.1 5 1 1
1974 95.2 2 78.2 6 -4 16
1975 95.1 1 79.0 7 -6 36
1976 98.4 5 74.8 1 4 16
1977 97.1 3 76.2 4 -1 1
2
Total 0 ∑d = 90
6Σd 2
P=1–
N ( N 2 − 1)

6 x 90
=1-
7(7 2 − 1)

540
=1-
7( 49 − 1)

540
=1-
336
= 1 – 1.607
Therefore P = - 0.607

4. Ten students were ranked on the basis of two attributes : Beauty (x) and intelligence (y).
The co – efficient of rank correlation between x and y was found to be 0.5. It was later
discovered that the difference in ranks in the two attributes obtained by one of the students
was wrongly taken as 3 instead of 7. Find the correct co-efficient of correlation.
Solution : Given

6Σd 2
N = 10 P=1–
N ( N 2 − 1)

6Σd 2
P = 0.5 =1–
10(10 2 − 1)
35
6Σd 2
P=1–
10(100 − 1)

6Σd 2
0.5 = 1 -
10(99)

6Σd 2
0.5 = 1 -
990

6Σd 2
Therefore = 1 – 0.5
990

6Σd 2
= = 0.5
990
= 6 ∑d2 = 0.5 * 990
= 6 ∑d2 = 495

495
6 ∑ d2 =
6
∑d2 = 82.5
Since, one difference was wrongly taken as 2 instead of 7, the correct value of ∑d2 is given by :
Corrected ∑d2 = 82.5 – 32 + 72.
= 82.5 – 9 + 49
Corrected ∑d2 = 122.5
Therefore corrected co-efficient of correlation.

6(122.5)
P=1-
10(10 2 − 1)

735
P=1-
990
= 1 – 0.742
Therefore P = 0.258

36
5. A psychologist wanted to compare two methods A and B of teaching. He selected a
random sample of 22 students. He grouped them into 11 pairs, so tht the students in a
pair have approximately equal scores on an ntelligence test. In each pair one student was
taught by method A,and the other method ‘B’ and examined after the course.
The marks obtained by them are tabulated below :
Pair 1 2 3 4 5 6 7 8 9 10 11
A 48 58 38 28 60 38 54 60 40 56 22
B 74 70 32 52 46 54 38 40 32 22 42

Find the rank correlation co efficient.


Solution : Let variable ‘x’ denote the scores of students taught by method A and Y denote
the scores of students taught by method ‘B’.

In the x- series, we see that the value 60 occures twice. The common rank assigned to each of
1 + 2 
these values is  =1.5
 2 
x Y Rank of Rank of y d=x–y d2
(x) (y)
48 74 6 1 5 25
58 70 3 2 1 1
38 32 8.5 9.5 -1 1
28 52 10 4 6 36
60 46 1.5 5 -3.5 12.25
38 54 8.5 3 5.5 30.25
54 38 5 8 -3 9.00
60 40 1.5 7 -5.5 30.25
40 32 7 9.5 -2.5 6.25
56 22 4 11 -7 49.00
22 42 11 6 5 25.00
∑d = 0 ∑d2 =225

37
Hence, we see that in the ‘x’ series the items, 38 and 60 are repeated, each occurring
twice and in the ‘y’ – series the item 32 is repeated. Thus, in each of the three cases m =2,
hence on applying the correction.
m(m 2 − 1)
Factor for each repeated item, we get
12

 2[2 2 − 1] 2[2 2 − 1] 2[2 2 − 1] 


6Σd 2 + + + 
 12 12 12 
P=1–
11[112 − 1]

6 * 226.5
P=1- = 1 – 1.0295
11x120
Therefore = - 0.0295

6.7 SUMMARY
In this unit the concept of correlation or the association between two variables has been
discussed. Correlation analysis helps us to determine the strength of the linear relationship
between the two variables. Correlation analysis is used as a starting point for selecting
useful independent variables for regression analysis.
A scatter plot of the variables may suggest that the two variables are related but the value
of the Pearson Correlation coefficient r quantifies this association. The correlation
coefficient r may assume values between -1 and +1. The sign indicates whether the association
is direct (+ve) or inverse (-ve). A numerical value of r equal to unity indicates perfect
association while a value of zero indicates no association.

2.8 KEYWORDS
Karl Pearson’s Coefficient of Correlation : A mathematical method for measuring the
intensity or the magnitude of linear relationship between two variable series was suggested
by Karl Pearson (1867-1936).

Scatter Diagram : Scatter diagram is one of the simplest ways of diagrammatic representa-
tion of a bivariate distribution.

Rank Correlation : Charles Edward Spearman, a British Psychologist developed a formula


in 1904 which consists in obtaining the correlation coefficient between the ranks of n indi-
viduals in the two attributes under study.

38
6.9 SELF ASSESSMENT QUESTIONS
1. What do you mean by the term Coefficient of correlation?
2. Distinguish between Coefficient of correlation and coefficient of variation.
3. What is a scatter diagram? How does it help in studying the correlation between two
variables, in respect of both their nature and extent?
4. Define Karl Pearson’s coefficient of correlation. What is it intended to measure?
5. What are the advantages of Spearman’s rank correlation coefficient over Karl Pearson’s
correlation coefficient?
6. Draw a correlation graph from the following data.

Period Jan. Feb. Mar. April May June


Variable 1 15 18 22 20 25 20
Variable 2 30 35 43 41 51 40

7. Find Karl Pearson’s coefficient of correlation from the following data:

Wages 100 101 102 102 100 99 97 98 96 95


Cost of living 98 99 99 97 95 92 95 94 90 91

8. Ten Competitors in a beauty contest are ranked by three judges in the following order:
I Judge 1 5 4 8 9 6 10 7 3 2
II Judge 4 8 7 6 5 9 10 3 2 1
III Judge 6 7 8 1 5 10 9 2 3 4

Use rank correlation coefficient to discuss which pair of judges has the nearest ap-
proach to common tastes in beauty.

9. Calculate the rank coefficient of correlation of the following data:


X 80 78 75 75 68 67 60 59
Y 12 13 14 14 14 16 15 17

39
10. A random sample of 5 college students is selected and their grades in Kannada and
English are found to be:

1 2 3 4 5
Kannada 85 60 73 40 90
English 93 75 65 50 80

Calculate Spearman’s rank correlation coefficient.


11. Calculate the coefficient of correlation between the age of 100 husbands and wives
from the following data:
Age of husbands Age of wives in years
in years 10-20 20-30 30-40 40-50 50-60 Total
15-25 6 3 - - - 9
25-35 3 16 10 - - 29
35-45 - 10 15 7 - 32
45-55 - - 7 10 4 21
55-65 - - - 4 5 9
9 29 32 21 9 100

12. Calculate the Karl Pearson’s Coefficient of correlation between age and playing habits
from the data given below. Also calculate probable error and comment on the value:
Age 20 21 22 23 24 25
No. of students 500 400 300 240 200 160
Regular players 400 300 180 96 60 24

6.10 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008

40
UNIT 7 : REGRESSION

STRUCTURE

7.0 Objectives

7.1 Introduction
7.2 Concept and Definition of Regression
7.3 Distinction between Correlation and Regression
7.4 Regression Analysis
7.5 Advantages of Regression Analysis
7.6 Types of Regression Analysis

7.7 Solved Problems on Regression


7.8 Summary
7.9 Key Words
7.10 Self Assessment Questions
7.11 References

41
7.0 OBJECTIVES
After studying this unit you should be able to :
∗ Define the concept and definition of regression ;
∗ Distinguish between Correlation and Regression;
∗ Explain regression analysis;
∗ Differentiate between types of Regression Analysis and
∗ Solve the different problems on Regression.

7.1 INTRODUCTION
Regression analysis is a very powerful tool in the field of statistical analysis in predicting
the value of one variable on the basis of the given value of another variable, when these two
variables are related to each other. Examples of regression problems can be found in the
study of the yields of crops grown with different amount of fertilizer, the length of life of
certain animals exposed to different amounts of radiation, the hardness of plastics which are
heat-treated for different periods of time and so on. In these problems the variation in one
measurement is studied for particular levels of the other variable selected by the
experimenter.

7.2 CONCEPT AND DEFINITION OF REGRESSION


‘Regression’ means returning or stepping back to the average value. With the help of
values of one variable (independent) we can establish most likely values of other variable
(dependent). On the basis of two available correlated variables, we can forecast the future
data or events or values.
In statistics, the term ‘Regression’ means simply the “Average Relationship”. We
can predict or estimate the values of dependent variable from the given related values of
independent variable with the help of a Regression Technique. The Measure of Regression
studies the nature of co- relationship to estimate the most probable values. It establishes a
functional relationship between the ‘Independent’ and ‘Dependent’ variables.
The ‘Regression’ succeeds the ‘Correlation’. Once the co-relationship between the
two variables is established, the regression analysis proceeds with the estimation of probable
values. Sir Francis Galton, a British Biometrician, introduced the concept ‘Regression’ for
the first time in 1877, while studying the correlation between the ‘heights’ of sons and their
fathers. He concluded in his studies, “Tall fathers tend to have tall sons and short fathers
short sons. The average height of the sons of a group of tall fathers is less than that of the

42
fathers, while the average height of the sons of a group of short fathers are greater than that
of the fathers”. It means the coming generations of tall or short parents tend to step back to
average height of population. The line showing this tendency was called by Galton a
‘Regression Line’.

Now a day’s, a modern statistician prefers to use the term ‘Regression’ in the sense of
‘Estimation’ which is an important statistical tool in economics and business. Estimation or
prediction of economic activities is very essential in planning. Estimating the relationship
among the economic variables constitutes the essence of modern business management.
That is why the term ‘estimating line’ is used instead of ‘regression line’ by the modern
statisticians. Today, the term ‘Regression’ is used in a much broader sense to imply
‘functional relationships’. It means the estimation or prediction of the unknown value of
one variable from the known value of the other variable. The closer the relationship between
the two variables, the greater the confidence may be placed in the estimates.

7.3 DISTINCTION BETWEEN CORRELATION AND REGRESSION


The correlation and the regression analysis help us in studying the relationship between
the two variables, yet they differ in their approach and objectives.
Correlation Regression

1. It preceds regression 1. It succeeds correlation.


2. It tests the closeness between the two 2. It studies the closeness between the
variables. two variables and estimates the values.
3. It measures the degree of co- 3. It measures the nature of co-variation.
variation.
4. It is merely a tool of ascertaining the 4. It is also a tool of studying cause and
degree of relationship. effect of relationship.

5. The relationship may be purely a 5. There is a perfect relationship and it


chance and it may not have practical has practical relevance.
relevance.
6. There is no question of independent 6. There is an identification of
and dependent variables. independent and dependent variables.

7. It is a two-way average of 7. It is a directional relationship with


relationship. cause and effect.

8. It establishes just a relationship. 8. It studies the functional relationship


with the two equations of lines.

43
Both the techniques are based on different sets of assumptions. In practice, the choice
between the two techniques depends upon the purpose of investigation. The presence of co-
relationship does not imply causation, but the presence of causation certainly implies co-
relationship. The association (correlation) need not imply causation (regression) because a
close association may be the result of pure chance. The causation (regression) definitely
implies association (correlation), because cause and effect are based on relationship.

7.4 REGRESSION ANALYSIS


Regression analysis refers to the methods by which estimates are made of the values of
a dependent variable from the values of an independent variable. It is a technique of predicting
the unknown values on the basis of the ‘average relationship’. It studies the pattern of
relationship and the closeness of the relationship in ‘absolute terms’. Thus it is a statistical
tool to deal with the formulation of mathematical models depicting relationship between
the two variables.
In the words of M. M Blair, ‘Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of the original units of data’.
In regression analysis we deal with two types of variables Independent and Dependent.
i. Independent Variable : It is the variable whose value influences the values of other
(dependent) variable. It is called ‘Regressor or Predictor or Explanator’. The known
variable is called the independent variable.
ii. Dependent Variable : It is the variable whose values are influenced by the values of
other (independent) variable. It is called ‘Regressed or Predicted or Explained’. The
variable we are trying to predict is the dependent variable.
In regression, we can have only one dependent variable in our estimating equation.
However, we can use more than one independent variable. Often when we add independent
variables, we improve the accuracy of our prediction.

7.5 ADVANTAGES OF REGRESSION ANALYSIS


The following are some important advantages of regression analysis :
1. Regression analysis helps in developing a regression equation by which the value of a
dependent variable can be estimated given a value of an independent variable.
2. Regression analysis helps to determine standard error of estimate to measure the variability
or spread of values of a dependent variable with respect to the regression line. Smaller
the variance and error of estimate, the closer the pair of values(x, y) fall about the

44
regression line and better the line fits the data, that is, a good estimate can be made of the
value of variable y. When all the points fall on the line, the standard error of estimate
equals zero.
3. When the sample size is large (df=29), the interval estimation for predicting the value of
a dependent variable based on standard error of estimate is considered to be acceptable
by changing the values of either x or y. The magnitude of r2 remains the same regardless
of the values of the two variables.

7.6 TYPES OF REGRESSION MODELS


The primary objective of regression analysis is the development of a regression model
to explain the association between two or more variables in the given population. A regression
model is the mathematical equation that provides prediction of value of dependent variable
based on the known values of one or more independent variables.
The particular form of regression model depends upon the nature of the problem under
study and the type of data available. However, each type of association or relationship can
be described by an equation relating a dependent variable to one or more independent
variables.
1. Simple and Multiple Regression Models
2. Linear and Nonlinear regression models
3. Time series and cross-sectional regression models

1. Simple and Multiple regression Models : A regression model that includes only one
independent variable is called a simple regression model. If it includes more than one
independent variable, it is called a multiple regression model.

2. Linear and Non linear Regression Models : Linear regression models assume that
the relationship between the independent and dependent variable is linear.

3. Time Series and Cross section Regression Models : When a regression model is
estimated on the basis of time series data ( historic data), it is a time series regression
model. A cross sectional model is estimated on the basis of data at a given point in time
across the spectrum of a population or of a region.

45
1. The following table gives the age of cars of a certain make and annual maintenance
cars, obtain the regression equation for costs related to age :
Age of cars : (in years) 2 4 6 8
Maintainance : costs (in hundred x 10 20 25 30
10
of Rs.)

Solution : Let variable ‘x’ denote age of cars & y, maintenance costs of cars

Calculations for regression lines


x x2 y xy
2 4 10 20
4 16 20 80
6 36 25 150
8 64 30 240
20 120 85 490

20
X= =5
4

85
Y= = 21.25
4
N (Σxy ) − (Σx) (Σy )
byx =
NΣx 2 − (Σx) 2

4 x 490 − 20 x 85
=
4 x 120 − (20) 2

260
byx = = 3.25
80

Equations of line of regression of y on x is given by


Y – y = byx (x – x)
Y = 21.25 = 3.25 (x-5)
Y = 3.2 x + 5

46
2. Compute the appropriate regression equation of the following data :
Solution since y is the dependent variables, therefore, the appropriate regression line is
of y on x.
Calculation of regression line
x Y x2 xy
2 18 4 36
4 12 16 48
5 10 25 50
6 8 36 48
8 7 64 56
11 5 121 55
Total 36 60 266 293

Σx 36 Σy 60
Therefore X = = = 6. Y = = = 10.
N 6 N 6

N (Σxy ) − (Σx) (Σy )


byx =
NΣx 2 − (Σx) 2

6 x 293 − 36 x 60
= = 1.34
6 x 266 − (36) 2

Regression equation of Y on x is :

Y – y = byx (x-)

Y – 10 = - 1.34 (x – 6)
Therefore y = -1.34 x + 18.04

3. Obtain the equations of the two lines of regression for the data given below :

X: 1 2 3 4 5 6 7 8 9
Y: 9 8 10 12 11 13 14 16 15

47
Solution calculation for regression lines
x x(x- x ) x2 y y (y- y ) y2 xy
1 -4 16 9 -3 9 12
2 -3 9 8 -4 16 12
3 -2 4 10 -2 4 4
4 -1 1 12 0 0 0
5 0 0 11 -1 1 0
6 1 1 13 1 1 1
7 2 4 14 2 4 4
8 3 9 16 4 16 12
9 4 16 15 3 9 12
∑x = 45 ∑x = 0 ∑x2 = 60 ∑y=108 ∑y = 0 ∑y = 60 ∑xy = 57

Σx 15 Σy 108
x = = = 5. y = = = = 12.
N 9 N 9

Regression co-efficient

X on y y on x
Σxy Σxy
bxy = bxy =
Σy 2 Σx 2
57 57
= = 0.95 = + 0.25
60 60
Equation of lines of regression Y on x
X on y Y – y = byx (x – x)
X – x = bxy (y – y) Y – 12 = 0,95 (x-5)
X – 5 = 0.95 (y – 12) OR Y = 0.95 x + 7.25
X = 0.95 y – 6.45

48
4. From the data given below, find ;
a. the two regression equations.
b. The co-efficient of correlation between the marks in economics and statistics.
c. The most likely marks in statistics when marks in economics are 30 :
Marks in : 25 28 35 32 31 36 29 38 34 32
economics
Marks in : 43 46 49 41 36 32 31 30 33 38
statistics

Solution :
a) let us denote he marks in economics by the variable x and marks in statistics by the
variably y
Caluclation for regression equations.
X (1) x- x (2) (x- x )2 (3) X (4) (y- y ) (5) (y- y )2 (6) (x- x ) and
(y- y ) (7)
25 -7 49 43 5 25 -35
28 -4 16 46 8 64 -32
35 +3 9 49 11 121 +33
32 0 0 41 3 9 0
31 -1 1 36 -2 4 2
36 4 16 32 -6 36 -24
29 -3 9 31 -7 49 21
38 6 36 30 -8 64 -48
34 2 4 33 -5 25 -10
32 0 0 39 1 1 0
Total 0 140 380 0 398 - 93

Σx 320 Σy 380
x = = = 32. y = = = = 3..8
N 10 N 10

49
Regression co-efficient
X on y y on x
Σ( x − x ) ( y − y ) ) Σ( x − x ) ( y − y ) )
bxy = bxy =
Σ( y − y ) 2 Σ( x − x) 2

93 − 93
= = 0.234 = = 0.664
398 140
Equation of lines of regression Y on x
X on y Y – y = byx (x – x)
X – x = bxy (y – y) Y – 38 = 0.664 (x-3)
X – 32 = 0.234(y – 38) Y = 0.664 x + 59.24
X = 0.234 y + 40.892

b) We have
r2 = byx, bxy = (-0.234) x (-0.664)

or r=± 0.234 x 0.664 ± 0.394

But since, both the regression co-efficients are negagive, r must be negative, hence r = -
0.394
a) When x = 30
b) Y = - 0.664 x 30 + 59.248 = 39.248 = 39

Hence where the works in economics are 30, the most likely marks in statistics are 39.

5. The quality of a raw material purchased by ABC Ltd., at the specified prices during
the 12 months of 1982 is given below :
Month Price/kg Quantity Month Price/kg Quantity
(in Rs) (in kg) in (Rs.) (in k.g.)
January 96 250 July 112 220
February 110 200 August 112 220
March 100 250 September 108 200
April 90 280 October 116 210
May 86 300 November 86 300
June 92 300 December 92 250

50
a. Find the regression equation saled on the above data.
b. Can you estimate the approximate quantity likely to be purchased if the price shoots
up to Rs. 124/kg.
c. Hence or otherwise obtain the coefficient of correlation between the price prevailing
and the quantity demanded.

Solution : Calculation for regression equations.


Price (Rs) dx = x-100 dx2 Quantity y dy = y – 248 dy2 dx dy
x
96 -4 16 250 2 4 -8
110 10 100 200 -48 2304 -480
100 0 0 250 2 4 0
90 -10 100 280 32 1024 -320
86 -14 196 300 52 2704 -728
92 -8 64 300 52 2704 -416
112 12 144 220 -28 784 -336
112 12 144 220 -28 784 -336
108 8 64 200 -48 2304 -384
116 16 256 210 -38 1444 -608
86 -14 169 300 52 2704 -728
92 -8 64 250 2 4 -16
1200 0 1344 2980 4 16768 -4360

1200 2980
x = = 100. y = = 248.33.
12 12
a) Regress in co-efficients :
x on y Y on x
( Σdx) x (Σdy ) (Σdx) x (Σdy )
Σdxdy − Σ Σdxdy − Σ
Therefore byx = N Therefore = byx = N
 (Σdy ) 2   (Σdx) 2 
Σdy − Σdx −
2 2
 
 N   N 

− 4360 − 4360 − 0
bxy = byx =
 42  1344 − 0
16768 −  byx = - 3.244
 12 

51
Equations of lines of regression :
X on y y on x
X – x = bxy (y-y) y – x = byx (x-x)
x-100 = -0.26 (y-248.33) y – 248.33 = 3.244 (x-100)
Therefore x = -0.26 y + 164.56 therefore Y = 3.244 x 572.73

b) for k = 124, y = -3.244 x 124 + 572.73 = 170. 474 thus, an estimate of 171.5 kg. would
be bought at the price of Rs.124.

a) ± (−0.26) x ( −3.249 ) = - 0.92 – Negative sign with ‘r’ is taken as regression co-
efficient are negative.

6. In trying to evaluate the efficiency of its advertising campaign. A firm compiled


the following information.
Year : Adv. 1980 1981 1982 1983 1984 1985 1986 1987
Expenditure
12 15 15 23 24 38 42 48
(‘000 Rs.)
Sale (Lakh Rs.) 5.0 5.6 5.8 7.0 7.2 8.8 9.2 9.5

Calculate the regression equation of sales on advertising expenditure. Estimate the probable
sales when advertisement expenditure is Rs.60,000/-
Calculations for regression equation of sales on advertising expenditure.
Year Adv.expenditure dx = x-25 dx2 Sales (lakh y.7 dy2 dxdy
(in ‘000 Rs.) X Rs.) y Dy =
0. 1
1980 12 -13 169 5.0 -20 400 260
1981 15 -10 100 5.6 -14 196 140
1982 15 -10 100 5.8 -12 144 120
1983 23 -2 4 7.0 0 0 0
1984 24 -1 1 7.2 2 4 -2
1985 38 13 169 8.8 18 324 234
1986 42 17 289 9.2 22 484 374
1987 48 23 529 9.5 25 625 575
17 1361 21 2177 1701

52
The line of regression of sales (y) on advertising expenditure (x) is (y – y = byx ( x – x )

Σdx 17
x = Ax + X cx = 25 + x 1 = 27.125
N 8
Σdy 21
y = Ay + X cy = 7 + x 0.1 = 7.26
N 8
Σdx x Σdy
Σdxdy − Σ
N Cy
byx = x
 (Σdx) 
2
Cx
Σdx −
2

 N 

(17) x ( 21)
1701 0
8 0.1 1701− 44.625 0.1
= x = x
 (17) 2  1 1361− 36.125 1
1361 − 
 18 

Y – 7.26 = 0.125 (x – 27.125) = 0.125 x – 3.39

Y = 0.125 x + 3.87

When x = 60, y = 6.125 x 60 + 3.87 = 11.37

Hence probable sales is 11.37 lakh rupees when the advertisement expenditure is
Rs.60,000.
7.7 SOLVED PROBLEMS ON REGRESSION
PROBLEM : 1 :
A researcher wants to find out if there is any relationship between the ages of the husbands
and the ages of the wives. In other words, do old husbands have old wives and young wives?
He took a random sample of 7 couples whose respective ages are given below :
Age of Husband (x) Age of wife (y)
25 18
27 20
29 20
32 25
35 25
37 30
39 37

53
c) For this data compute the regression line.
d) Based upon the correlation between their ages, what would be the age of the wife, if the
husbad’s age is 36 years.
Solution :
The regression line is identified by :
Y= = a + bx
nΣxy − (Σx)(Σy )
Where b =
n(Σx 2 ) − (Σx ) 2

And a = y - b x

Let us make a table to calculate all these values :


X Y x2 xy y2
25 18 625 450 324
27 20 729 540 400
29 20 841 580 400
32 25 1024 800 625
35 25 1225 875 625
37 30 4369 1110 900
39 37 1521 1443 1369
∑x = 224 ∑y = 175 ∑x2 = 7334 ∑xy = 5798 ∑y2 = 4643

7(5798) − ( 224 x175)


b=
7(7334) − ( 224 x 224)

40586 − 39200
51338 − 50176

1386
= 1.193
1162

175  224 
a= - 1.193  7 
7

= 25-38.176
= - 13/176
54
Hence, the line of regression equation would be
Yc = - 13.176 + 1.193 x
b. If the husband’s age is 36 years i.e. if x = 36, then the computed age of wife or Yc
would be
Yc = -13.176 + 1.193 (36)
= - 13.176 + 42.948 = 29.772
= 30 years.
Short cut Method :
Calculations can become much easier if instead of taking the actual values of X
and Y,Y, we take deviation from their respective means. Then Regression Equation
would be
(y- y ) = byx (x- x )

Σxy σy
Where byx = Σx 2 = r σx , where x = (x - x ) , y = (y- y )

a) Taking the same problem, we shall find the lien of regression equation as follows :
(x – x ) (y – y )

Or

X Y X y xy x2 y2
25 18 -7 -7 49 49 49
27 20 -5 -5 25 25 25
29 20 -3 -3 15 9 25
32 25 0 0 0 0 0
35 25 +3 0 0 9 0
37 30 +5 +5 25 25 25
39 37 +7 +12 84 49 144
∑x =224 ∑y = 175 ∑x = 0 ∑x = 0 ∑xy = 198 ∑x2 = 166 ∑y2 = 268

55
224
x = = 32
7

175
y = = 25
7

Regression equation of Y on x is
Y = 25 = byx (x-32)
Σxy 198
Where byx = = = 1.193
Σx 2 166

Therefore Y – 25 = 1.193 (x – 32)


Or y – 25 = 1.193 x – 38.176
Or y = 1.193 x – 38.176 + 25
Or y = 1.93 x – 13.776
b) If the husbands age is 36 years, i.e. if x = 36, then the computed age of wife or Yc would
be
Yc = 1.193 (36) – 13.176
= 42.948 – 13.176 = 29.776
= 30 years

Problem 2
On the basis of the data given in problem 1, calculate the most probable age of husband
if the age of wife is 28 years.
Solution : For calculating the probable age of a husband for the given age of a wife, we
require the following line of regression equation based upon deviations of x and y from their
respective means

X - x = bxy (y - y )

σx Σxy
Where bxy = r = =
σy Σx 2

56
From the data given in problem 1, we have

224
x = = 32
7

175
y = = 25
7
Σxy = 198
Σy2 = 268

Hence regression equation of x or y is :

X – 32 = by (y-25)

Σxy 198
Where bxy = = = 0.739
Σx 2 166

Therefore x – 32 = 0.739 (y-25)

Or x-32 = 0.739 Y – 18.475

Or x = 0.739 y + 32 – 18.475

Or x = 0.739Y + 13.525

Thus if the wife’s age is 28 years or Y = 28, then the age of husband would be

X = 0.739 (28) + 13.525

= 20.692 = 13.525 = 34.217 = 34 years

Problem 3 :
A researcher wants to find out if there is any relationship between the heights of the sons
and the heights of the fathers. He took a random sample o six fathers and their six sons.
Their heights in inches are given below in an ordered array :

57
Height of Father in inches (x) Height of son in inches (y)
63 66
65 68
66 65
67 67
67 69
68 70

Using the short method :


i. Fit a regression line of y on x, and hence predict the height of the son if father’s height
is 70 inches.
ii. Fit a regression time of x on, y and hence predict the height of the father if sons height
is 65 inches.
iii. Calculate Karl Pearson’s co-efficient of correlation.
Solution : Fitting the Regression Lines :
(x-66) or (y-67) or
x Y Dx dy dxdy dx2 dy2
63 66 -3 -1 3 9 1
65 68 -1 +1 -1 1 1
66 65 0 -2 0 0 4
67 67 +1 0 0 1 0
67 69 +1 +2 2 1 4
68 70 +2 +3 6 4 9
∑x =396 ∑y = 405 ∑dx = 0 ∑dy = 3 ∑dxdy = 10 ∑dx2 = 16 ∑dy2 = 19

i) Regression equation of Y on X : y - y = by2 (x - x ) =

Σx 396
x = = = 66
N 6
Σy 405
y = = = 67.5
N 6

Σxy
byx = where x = x - x - x , y = y- y
Σx 2

58
If however either x or y or both x and y are not full integers, then it is simpler to take
deviations from some assumed mean than from actual means. In such case :
NΣdxdy − Σdx Σdy
bxy =
NΣdx 2 − (Σdx ) 2

6 x10 − 0 x3
6 x16 − (0) 2

Where dx = x-Ax and dy = y – Ay.

Substituting the value in the equation :

Yc - 67.5 = .625 (x-66)

Yc = .625 x -41.25 + 67.5

Yc = .625 x + 26.26

Hence, if the height of the father is 70 inches or x -= 70, the height of the son or Yc
would be

Yc = .625 (70) + 26.25

= 43.75 + 26.25 = 70

x y
ii) Regression equation of x on y : x – = bxy (y- )
NΣdxdy − Σdx Σdy
bxy =
NΣdy 2 − (Σdy ) 2

6 x10 − 0 x3
6 x19 − (3) 2

60
=
105

= 0.571

59
Substituting the value in the equation
X – 66 = .571 (y-67.5)
X = .571 y – 38.542 + 66
Xc = .571 Y + 27.458
Hence if the height of the son is 65 inches or Y = 65, the height of the faher or Xc would
be
Xc = .571 (65) + 27.458
= 37.115 + 27.458
= 64.573 inches
Karl Pearson’s co-efficient of correlation or

bxy.bye
r=

σx σx
r xr = r2
= σy σy =

Or e = .571x.625

.356875
=
r = + .597

Problem 4 :
A department store gives in service training to its salesmen which is followed by a
test. The following daa give the test scores and sales made by nine salesmen during a
certain period.
Test Scores : 14 19 24 21 26 22 15 20 19
Sales (‘000 Rs.) 31 36 48 37 50 45 33 41 39

60
Calculate Karl Pearson’s co-efficient or correlation between the test scores and sales.
Solution : Calculation of co-efficient or correlation :
Test scores sales (‘000) x-x or (y-y) or
(X) (Y) Rs X y xy x2 y2
14 31 -6 -9 54 36 81
19 36 -1 -4 4 1 16
24 48 +4 +8 32 16 64
21 37 +1 -3 -3 1 9
26 50 +6 +10 60 36 100
22 45 +2 +5 10 4 25
15 33 -5 -7 35 25 49
20 41 0 +1 0 0 1
19 39 -1 -1 1 1 1
∑x = 180 ∑y = 360 ∑x = 0 ∑y = 0 ∑xy = 193 ∑x2 =120 ∑y2 = 346

Karl Pearson’s co-efficient of correlation or


Σxy
r=
Σx 2 .Σy 2

Where x = x - x
Y=y-y
180
x = = 20
9

360
y = = 40
9

Σxy = 193

Σx2 = 120

Σy2 = 346
193 193 193
r = = = = 0.947
120 x346 41520 203.74

There fore r = 0.947

61
A leading company engaged in the production of detergents has 10 vacancies of salesmen
for which N=15 persons have been called for interview the interview board. Consists of the
sales manager and psychologist. The ranking given by the sales Manager and Psychologist
to the 15 candidates according to their serial number in the interview list who attended the
interview, compares as given in col(2) and col (3) respectively of table 13.6 below :

Computation of Rank Correlation


Sr. No. in the Ranking by the Ranking by the d1 = xi - yi d i2
interview list sales Manager psychologist
(x1) (y1)
1 2 3 4 5
1 1 2 -1 1
2 3 3 0 0
3 2 1 1 1
4 4 5 -1 1
5 6 4 2 4
8 5 6 -1 1
9 7 8 -1 1
10 9 7 2 4
11 8 9 -1 1
13 11 10 1 1
14 10 12 -2 4
15 12 11 1 1
17 14 13 1 1
18 13 14 -1 1
19 15 15 0 0
20
∑ d i2 = 22

62
Regression
The percentage marks obtained in graduation and an MBA entrance test of 10 students
were as follows.
Graduation 50 52 55 60 62 65 65 66 70 75
Entrance test 52 50 57 65 65 62 65 65 71 78

from these data find


a) The two regression equations and
b) The co-efficient of correlation between marks in graduation and those in the entrance
test
Solution : Let the marks in graduation be denoted as x and those in the enterance test as y.

Computations for the Regression Equations

x y xy x2 y2
50 52 2600 2500 2704
52 50 2600 2704 2500
55 57 3135 3025 3249
60 65 3900 3600 4224
62 65 4030 3844 4225
65 62 4030 4225 3844
65 65 4225 4225 4225
66 65 4290 4356 4225
70 71 5850 4900 5041
75 78 5625 6084
2 2
∑x = 620 ∑y = 630 ∑xy = 39620 ∑x = 39004 ∑y = 40332

a) The regression equation of x on y is


Xc = a’ +b’y
With the two normal equations as
∑x = Na’ + b’ ∑y
∑xy = a’ ∑y + b’ ∑y2.

63
Substituting the values or get
620 = 10a’ + 630b’
39620 = 630a’ + 40322b’
Solving (i) and (ii) for a’ and b’
a’ = 6.182
and b’ = 0.886
Hence the regression equation of x on y is
xc = 6.182 + 0.886y
The regression equation of y on x is
yc = a + bx
with the two normal equations as
∑y = Na + b∑x
∑xy = a∑ x + b∑x2.
Substituting the values we get
530 = 10a + 620 b
39620 = 620a + 39004b
Solving (iii) and (iv) for a and b
a = 1.434
and b = 0.993
Hence the regression equation of y on x is yc = 1.434 + 0.993 x

b. The co-efficient of correlation between x and y is given by


nΣxy − (Σx) ( Σy )
γ =
= [ nΣx − (Σx) 2 ][ nΣy 2 − (Σy ) 2 ]
2

Substituting the values.


(10) (39620) + (620) (630)
r=
[10(39004) − (620) 2 ][10( 40322) − (630) 2 ]

5600
=
(5640) (6320)

= 0.938
64
The following data relate to marketing expenditure (Rs. Lac) and the corresponding sales
(in Rs. Crores)
Marketing Expenditure 10 12 15 20 23
Sales 14 17 23 21 35

Estimate the marketing expenditure to obtain a sales target of Rs.40 crores.


Solution
Let marketing expenditure be denoted by x and sales by y
Computations of the Regression Equations
x y xy x2 y2
10 14 140 100 196
12 17 204 144 289
15 23 345 225 529
20 21 420 400 441
23 25 575 529 625
∑x = 80 ∑y = 100 ∑xy = 1684 ∑x2 = 1398 ∑y2 = 2080

Regression equation of x on y is
xc = a’+b’ y
The regression coefficient b’ of x on y is given by
nΣxy − (Σx) (Σy )
b’ =
nΣx 2 − (Σx) 2

and the value of a’ can be obtained as


a’ = x – b y
Substituting the values in (i) above
5(1684) − (80)(100)
b’ =
5(2089) − (100) 2

= 105

65
Substituting the required values in (ii) above

a =   - (1.05)  
80 100
5  5 

= 16 – (1.05) 20
= -50
Again substituting for a’ and b’ in the regression equation
Xc = a’ + b’ y
We have
Xc = (-5.0) + (1.05) y
When y = 40 (Rs. Crores) the corresponding x value is
Xc = (-5) + (1.05) 40
= (-5) + 42
= 37
That is to achieve a sales target of 40 crore there is a need to spend Rs.37 lack on markeing.
From the data given in problems 13.2 above find the coefficient of correlation between
marketing expenditure and sales.
Solution : The co-efficient of correlation r is given by
nΣxy − (Σx Σy
r=
[nΣx − (Σx) 2 ][nΣy 2 − (Σy ) 2 ]
2

The required values being


N=5 ∑xy = 1684
∑x = 80 ∑y = 100
∑x2 = 1398 ∑y2 = 2080
Which when substituted yield
5(1684) − (80) (100)
r=
[5(1398) − (80) 2 ][5(2080) − (100) 2 ]

= 0.865

66
The data for 10 years on sales (y) and advertisement expenditure (x) of a particular
product yielded the following summated values (Rs. Lac)/
∑x = 15, ∑y = 110, ∑xy = 400 ∑x2 = 250 and ∑y2 = 3200 find the following
a) Regression co-efficient b of y on x and then the y – interrupt a
b) X – intercept a and then the regression coefficient b’ of x on y.
c) Most approximate value of y for x = 5 and that of x for y = 25
d) Standard error of estimate syx and Sxy.

Solution :
a) Regression coefficient b of y on x is given by
nΣxy − (Σx) (Σy ) 10(400) − (15)(110)
b= = = 1.033
nΣx − (Σx)
2 2
10(250) − (15) 2

and then y – intercept a is obtained as


Σy Σy
a = y - b x =   - b   =   = - (1.033) 15 
110
10  = 9.451
N  N   10 

b) X – interecept a is directly computed as


a = = = 0.201.
and then the regression coefficient b’ of x on y is
a’ = x - b’ y

x − a'
b’ =
y

or
(Σx / N ) − a' (15 / 10) − 0.201
b’ = = = 0.11
Σy / N (110 / 10)

67
7.8 SUMMARY
In this unit fundaments of linear regression have been highlighted. Broadly speaking, the
fitting of any chosen mathematical function of given data is termed as regression analysis.
The estimation of the parameters of this model is accomplished by the least squares criterion
which tries to minimize the sum of squares of the errors for all the data points. Regression
is thus, a potent device for establishing relationships between variables from the given data.
The discovered relationship can be used for predictive purposes.

7.9 KEY WORDS


Dependent Variable : The variable of interest or focus which is influenced by one or more
independent variables
Estimate : A value obtained from data for a certain parameter of the assumed model or a
forecast value obtained from the model
Independent variable : A variable that can be set either to a desirable value or takes values
that can be observed but not controlled.
Linear Regression : Fitting of any chosen mathematical model, linear in unknown param-
eters to given data
Non – linear regression : Fitting of any chosen mathematical model, non-linear in un-
known parameters to a given data.

7.10 SELF ASSESSMENT QUESTIONS


1. What do you mean by regression?
2. Differentiate between ‘correlation’ and ‘regression’.
3. Explain clearly why there are usually two lines of regression. Point out the case when
there is one line of regression. Illustrate your answer by diagram.
4. Explain the uses of regression analysis.
5. The following data relate to the scores obtained by 9 salesmen of a company in an intel-
ligence test and their weekly sales in thousand rupees:

68
Salesmen A B C D E F G H I
Intelligence 50 60 50 60 80 50 80 40 70
Scores
Weekly 30 60 40 50 60 30 70 50 60
Sales

a. Obtain the regression equation of sales on intelligence test scores of the salesmen.
b. If the intelligence test scores of a salesman in 65, what would be his expected
weekly sales?

6. In a correlation study the following values are obtained:


X Y
Arithmetic mean 36 85
Standard Deviation 11 8

Coefficient of Correlation-0.8

Find the two regression equations that are associated with the above values.

7. You are given the following data:

Price
Index of 78 77 85 88 87 82 81 77 76 83 97 93
cotton(X)
Price
index of 84 82 82 85 89 90 88 92 83 89 98 99
wool(Y)

Correlation Coefficient between X and Y=0.66


a. Find the two regression equations
b. Estimate the value of X when Y=75

8. Following is the distribution of students according to their height and weight:

69
X Y
Mean 65 67
Standard Deviation 2.5 3.5

Obtain two regression equations

9. Price indices of cotton and wool are given below for the 12 months of a year. Obtain the equations
of lines of regressions between the indices
Height in Weight in lbs
inches
90-100 100-110 110-120 120-130
50-55 4 7 5 2
55-60 6 10 7 4
60-65 6 12 10 7
65-70 3 8 6 3

10. The following table shows the age (in years) of 10 children and a quantitative measure
of their aggressive behaviour (measured on a scale of 0 to 10)
Age 6 6 6.7 7 7.4 7.9 8 8.2 8.5 8.9
Aggressive 9 6 7 8 7 4 2 3 3 1
behaviour

a. Determine the regression line of aggressive behaviour according to age.


b. From that line, determine the value of aggressive behaviour that would correspond to a
child of 7.2 years.

7.11 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008

70
UNIT 8 : MULTIPLE CORRELATION AND REGRESSION

STRUCTURE

8.0 Objectives
8.1 Introduction
8.2 Concept of Multiple Regression and Multiple Correlation
8.3 Concept of Partial Correlation
8.4 The Purpose of Multiple Correlation Co-efficient
8.5 Partial Regression Co-efficient – Least Square Normal Equations
8.6 Solved Problems on Partial Regression Co-efficient and Correlation Co- efficient
8.7 Solved Problems on Multiple Regression
8.8 Summary
8.9 Key Words
8.10 Self Assessment questions
8.11 References

71
8.0 OBJECTIVES
After studying this unit you should be able to:
∗ Explain the concept of Partial and Multiple Correlation and Regression ;
∗ Define the Partial Regression co-efficient;
∗ Analyse the relationship between partial regression co-efficient and Correlation co-
efficient;
∗ Describe the concept of Multiple Regression and
∗ Solve the different problems on Partial and Multiple Regression.

8.1 INTRODUCTION
The correlation and regression coefficients discussed earlier measure the degree and
nature of the effect of one variable on another. While it is useful to know how one phenomenon
is influenced by another.
Multiple Regression is a very advanced statistical tool and it is extremely powerful
when you are trying to develop a ‘model’ for predicting a wide variety of outcomes. In the
previous units simple relations were discussed, these were linear correlation and linear
regression between two variables. But most economic and business phenomena cannot be
described in such a simplistic manner.
8.2 CONCEPT OF MULTIPLE REGRESSION AND MULTIPLE CORRELATION
Multiple Regression is a statistical tool that allows you to examine how multiple
independent variables are related to a dependent variable. Multiple regression analysis
represents a logical extension of two-variable regression analysis. Instead of a single
independent variable, two or more independent variables are used to estimate the values of a
dependent variable. However, the fundamental concept in the analysis remains the same.
The term multiple correlation refers to the theory of correlation involving more than
two variables. Multiple correlation is used to find the degree of inter-relationship among
three or more variables. For example the yield of crop in a year may depend upon rainfall,
manure, the average temperature and average humidity during the period between sowing
and harvesting of the crop; the results of houses may depend upon tax rates as well as upon
building costs and upon other variable also; general intelligence in schools may be related to
grades in mathematics and grades in English and so on. Thus the aim of the theory of multiple
correlation is to know how far the dependent variable is influenced by the independent
variables.

72
8.3 CONCEPT OF PARTIAL CORRELATION
It is often important to measure the correlation between a dependent variable and one
particular independent variable when all other variables involved are kept constant i.e. when
the effects of all other variables are removed. The partial correlation analysis measures the
strength of the relationship between Y and one independent variable in such a way that
variations in the other independent variables are taken into account. A partial correlation
coefficient is analogous to a partial regression coefficient in that all other factors are ‘held
constant’. Simple correlation, on the other hand, ignores the effect of all other variables
even though these variables might be quite closely related to the independent variable on to
one another.
Partial correlation is also called ‘net correlation’. It is a study of the relationship between
one dependent variable and one independent variable by keeping the other independent
variables constant. In simple correlation the effect of other independent variables was ignored.
a) Zero Order co-efficient : Simple correlation between two variables is called the
Zero Order co-efficient, as in simple correlation no factor is held constant.
b) First order co-efficient : If a partial correlation is studied between two variables by
keeping a third variable constant it would be called a first order co-efficient as one
variable is kept constant.
c) Second order co-efficient : If a partial correlation is studied between two variables by
keeping two other variables constant it would be called a second order co-efficient as
two variables are kept constant.
Partial correlation co-efficient provides a measure of the relationship between the de-
pendent variable and other variables. With the effect of the most of the variables eliminated.
The function of partial correlation analysis is the measurement of relationship between
two factors. With the effects of one or more other factors eliminated. If the assumptions of
the method are true for a series of data, the power of partial analysis is great.

8.4 THE PURPOSE OF MULTIPLE CORRELATION CO-EFFICIENT


In multiple correlations we study three or more variables at a time, whereas in case of
partial correlation we study the relationship of two variables by making the other factors
constant, in case of multiple correlation the effect of all the independent factors on a
dependent factor is studied.
The co-efficient multiple correlations serve the following purposes:

73
1. It serves as a measure of the degree of association between one variable taken as the
dependent variable and a group of other variables taken as the independent variables.
2. It also serves as a measure of goodness of fit of the calculated plane of regression and
consequently as a measure of the general degree of accuracy of estimates made by
reference to equation for the plane of regression.

8.5 PARTIAL REGRESSION CO-EFFICIENT – LEAST SQUARE


NORMAL EQUATIONS
Partial Regression Co-efficients Least Squares Normal Equations :
To demonstrate the calculation of partial regression co-efficients we consider the
particular form of regression equation involving two independent variables X2 and X3 and a
dependent variable X1.
x̂1 = a +b1x2 + b2x3.

x̂1 = a+b12.3x2 + b13.2 x3.

Where x̂1 = Estimated value of dependent variable.

a = A regression constant representing intercept on y-axis, its value is zero when the regression
equation parts through the origin.
b12.3, b13.2 = partial regression co-efficients
b12.3 = corresponds to change in x1 for each unit change in x2, while x3 is held constant;
b12.3 = represents the change in x1 for each unit change in each unit change in x3 while
x2 is held constant.

8.6 SOLVED PROBLEMS ON PARTIAL REGRESSION


CO-EFFICIENT AND CORRELATION CO-EFFICIENT
Given the following, determine the regression equation of
a) x1 on x2 and x3.
b) x2 on x1 and x3.
γ12 = 0.8 σ1 = 10
γ13 = 0.6 σ2 = 8
γ23 = 0.5 σ3 = 5

74
Solution : (a) Regression equations of x1 on x2 and x3 is given by
x1 = b12.3 x2 + b13.2 x3.
σ1 γ 12 − γ 13γ 23
Where b12.3 = *
σ2 1 − γ 23
2

* 0.8 − (0.6) (20.5)


10
=
8 1 − ( 0 .5 )

b12.3 = 0.833
σ1 γ 13 − γ 12 γ 23
b13.2 = *
σ3 1 − γ 23
2

10 0 .6 − ( 0 .8 ) ( 0 .5 )
b13.2 = = * = 0.533
5 1 − ( 0 .5 ) 2

Wherefore Required Regression Equation is → x1 = b12.3n x2 + b13.2 x3.


x1 = 0.833 x2 + 0.533 x3.
b) Regression Equation of x2 on x4 and x3.
σ2 γ 12 − γ 23γ 13
Where b12.3 = *
σ1 1 − γ 23
2

8 0. 8 − ( 0. 5) ( 0. 6)
b12.3 = *
10 1 − ( 0. 6) 2

b12.3 = 0.625
σ2 γ 23 − γ 12γ 13
Where b23.1 = *
σ3 1 − γ 23
2

8 0.5 − (0.8) (0.6)


b23.1 = *
5 1 − ( 0. 5) 2

0.5 − 0.48
= 1.6 *
1 − 0.25

0.02
= 1.6 *
0.75

= 1.6 * 0.02666
b23.1 = 0.0426
Therefore x2 = b12.3 x1 + b23.1 x3.
x2 = 0.625 x1 + 0.0426 x3.

75
2) In a trivate distribution :
σ1 = 3 γ23 = 0.4
σ2 = 4 γ13 = 0.6
σ3 = 5 γ12 = 0.7
Determine the regression equation of x1 on x2 and x3, if the variables are measured from
their means.
Solution : The required regression equation of x1 on x2 and x3 is given by
x1 = b12.3 x2 + b13.2 x3.
σ1 γ 12 − γ 13γ 23
Where b12.3 = *
σ2 1 − γ 23
2

3 0 . 7 − ( 0 . 6 ) ( 0 .4 )
= *
4 1 − ( 0 .4 ) 2

0.7 − 0.24
= 0.75 *
1 − 0.16

0.46
= 0.75 *
0.84

= 0.75 * 0.5476
b12.3 = 0.4107
σ1 γ 13 − γ 12 γ 23
b13.2 = *
σ3 1 − γ 23
2

3 0 . 6 − ( 0 . 7 ) ( 0 .4 )
= *
5 1 − ( 0 .4 ) 2

0.6 − (0.28)
= 0.6 *
1 − 0.16

0.32
= 0.6 *
0.84

= 0.6 * 0.3809
b13.2 = 0.2285
Hence the required equation is
x1 = b12.3 x2 + b13.2 x3.
x1 = 0.4107 x2 + 0.2285 x3.

76
An instructor of mathematics wishes to determine the relationship of grades in the final
examination to grades on two quizzes given during the semester. Le+z1, x2 and x3 be the
grades of a student in the first quiz. Second quiz, and final examination, respectively. The
instructor made the following computations for a total of 120 students :
x1 = 6.80 S1 = 1.00 γ12 = 0.6
x2 = 7.00 S2 = 0.8 γ13 = 0.7
x3 = 74.00 S3 = 9.00 γ23 = 0.65
a. Find the least – squares regression equation of x3 on x1 and x2.
b. Estimate the final grades of two students who scored respectively 9 and 7 and 4 and 8
marks in the two quizzes.
Solution : a) The regression equation of x3 on x2 and x1 can be written as.
r − r r   S 
(x3 - x3 ) =  23 132 12   3  (x2 - x2 ) +
 1 − r12   S 2 

 r13 − r23 r12   S 3 


    (x2 - x1 ) +
 1 − r12   S1 
2

Substituting the values we have.


 0.65 − (0.7) (0.6)   9 
[x3 – 74.00] =    0.8  (x2 – 0.7) +
 1− (0.6) 2   

 0.7 − (0.65) (0.6)   9 


    (x1 – 6.8)
 1− (0.6) 2  1 

 0.65 − 0.42 
[x3 – 74.00] =   [11.25] (x2 – 0.7) +
 1− 0.36 

 0.7 − 0.39 
 1− 0.36  [9] (x1 – 6.8)
 

= 
0.23   0.31 
 [11.25] [x2 – 7] + 
 0.64   [9] [x1 – 6.8]
 0.64 
(0.3593) [11.25] [x2 – 7] + [0.4843] [9] [x1 – 6.8]
= 4.04 [x2 – 7] + 4.36 [x1 – 6.8]
[x3 – 74] = 4.04 [x2 – 7) + 4.36 [x1 – 6.8]
x3 = 74 + 4.04x2 – 2.828 + 4.36x1 – 29.648
x3 = 4.36x1 + 4.04x2 + 16.072

77
b) The final grade of student who scored 9 and 7 marks is obtained by substituting x1 = 9
and x2 = 7 in the regression equation :
x3 = 4.36 (9) + 4.04 (7) + 16.072
x3 = 39.24 + 28.28 + 16.072
x3 = 8359 or 84
Similarly, the final grade 07 students who scored 4 and 8 marks can also be obtained by
substituting x1 = 4, x2 = 8 in the regression Equation.
x3 = 4.36 [4] + 4.04 [8] + 16.072
x3 = 17.44 + 32.32 + 16.072
x3 = 65.832

8.7 SOLVED PROBLEMS ON MULTIPLE REGRESSION


MULTIPLE REGRESSION AND CORRELATION
1) Given the following data
x1 → 20 25 15 20 26 24
x2 → 3.2 6.5 2.0 0.5 4.5 1.5
x3 → 4.0 5.2 7.5 2.5 3.4 1.5
a) Obtain the least square equation to predict x1 values from those 07 x2 and x3, and
b) Predict x1, when x2 = 3.2 and x3 = 3.0
Solution :
The least squares regression equation
of x1 on x2 and x3 is
x1c = a + b12.3 x2 + b13.2 x3.
Where x1c = denotes the estimated values
07 x1 corresponding to given values
07 x2 and x3.

78
a = is a constant representing the value of x1 when both x2 and x3 are zero. And
b12.3 and b13.2 are two other constants known as the partial regression co-efficients.
b12.3 is known as the partial regression co-efficient of x1 on x2 keeping x3 constant. It measures
the average change in x1 is a result of a unit change in x2 with no change in x3.

Similarly, b13.2 is the partial regression coefficient of x1 on x3 keeping x2 constant. It


measures the average change in x1 is a result of a unit change in x3 with no change in x2.

To determine the values of the three constants. We are required to solve


simultaneously the following three equations.
∑x1 = Na + b12.3 ∑x2 + b13.2∑x3 . . . . . . .(1)

∑x1x2 = a∑x2 + b12.3 ∑x22 + b13.2∑x2x3 . . . . . . .(2)

∑x1x3= a∑x3 + b12.3 ∑x2x3 + b13.2∑x23 . . . . . . .(3)

Computations for the multiple regression Equation


x1 x2 x3 x1 x2 x1 x3 x2` x3 x12 x 22 x32

20 3.2 4.0 64.0 80.0 128.0 400 10.24 16.00


25 6.5 5.2 162.5 130.0 33.8 625 42.25 27.04
15 2.0 7.5 30.0 112.5 15.0 225 4.0 56.25
20 0.5 2.5 10.0 56.0 1.3 400 0.25 6.25
26 0.5 3.4 117.0 88.4 15.3 676 20.25 11.56
24 1.5 1.5 36.0 36.0 2.5 576 2.25 2.25
x1 x2 x3 x1 x2 x1 x3 x2` x3 x12 x22 x32

=130 =18.2 =24.1 =419.5 =496.9 =195.9 =267.7 =79.24 =119.35

Substituting the needed information from the above computations in the above three
equations, we have

130 = 6a + 18.2 b12.3 + 24.1 b13.2 — - - - - - (1)


419.5 = 18.2a + 79.24b12.3 + 195.9 b13.2 - - - - -(2)
496.9 = 241a + 195.9 b12.3 + 119.35 b13.2 - - - - -(3)

79
Step : 1 : Multiply equation No.1 by 18.2 and equation No.2 by 6, and then deduct
equation No.2 from equation No.1, as shown below :

130 = 6a + 18.2 b12.3 + 24.1b13.2 - - - - * 18.2


419.5 =18,2a + 79.24b12.3 + 195.9b13.2 - - - - * 6
= 2366 = 109.2a + 331.24 b12.3 + 438.62 b13.2.
2517 = 109.2a + 475.44 b12.3 + 1175.40 b13.2.
(-) (-) (-) (-)
- 151 = - 144.20b12.3 - 736.78 b13.2 - - - - -(4)

Step : 2: Multiply equation No.1 and equation No.3 by 24.1, and 6.0 respectively and
then deduct equation No.3 from equation No.1,

130 = 6a + 18.2 b12.3 + 24.1b13.2 - - - - * 24.1


496.9 =24.1a + 195.9b12.3 + 119.35b13.2 - - - - * 6
= 3133.0 = 144.6a + 438.62b12.3 + 580.81 b13.2.
2819.4 = 144.6a + 1175.40b12.3 + 716.10 b13.2.
(-) (-) (-) (-)
316.6 = - 736.78b12.3 - 135.29b13.2 - - - - -(5)

Step 3 : Multiply equation No.4 and equation No.5 by 736.78 and 144.2 respectively and
deduct equation No.5 from equation No.4, as shown below :
- 151 = - 144.20 b12.3 – 736.78 b13.2 - - - - -* 736.78
- 316.6 = - 736.78b12.3 - 135.29b13.2 - - - -* 144.2
- 111253.78 = 106243b12.3 + 542874.24b13.2\
+ 42221.2 = 106243b12.3 + 19598.82b13.2.
(-) (+) (+)
- 156474.90 = - 523365.42b13.2 - - - - -

80
Therefore – 156474.90 = -523365.42 b13.2.
− 156474.90
Therefore =
− 523,365.42

b13.2 = 0.299
Substituting b13.2 = 0.299 in equation No.4, we get
- 151 = - 144.20 b12.3 - 736.78 b13.2.
- 151 = 144.20 b12.3 – 736.78 (0.299)
+ 144.20b12.3 = - 220.3 + 151
+ 144.20 b12.3 = - 69.30
− 69.30
Therefore =
+ 144.2

Therefore b12.3 = - 0.481


Similarly, we can find the value of ‘a’ by substituting for b12.3 and b13.2 in equation No.1,
as shown under :
130 = 6a + 18.2 b12.3 + 24.1 b13.2.
130 = 6a + 18.2 (-0.481) + 24.1 (0.299)
130 = 6a – 8.7540 + 7.2060
Therefore 6a = 131.5480
131.5480
a= = 21.9250 = a
6
Substituting the values of the three constants in the regression equation;
x1c = a + b12.3 x2 + b13.2 x3 - - - - - (5)
x1c = 21.9250 + (-0.481) x2 + 0.299 x3.

(b) The most likely value of x1 against x2 = 3.2 and x2 = 3.0 can be predicted by
substituting all relevant values in equation No.5.

x1c = 21.9250 – 0.48x2 + 0.299 x3.


x1c = 21.9250 – 0.481 (3.0) + 0.299 (3.0)
x1c = 21.9250 – 1.539 + 0.897
x1c = 21.283
81
Problem No.2 :
The following table given ghe weights, to the nearest pound heights, and ages of twelve
boys.
Weight 64 53 71 67 55 58 77 57 56 51 76 68
x1
Height 57 49 59 62 51 50 55 48 52 42 51 57
x2
Age x3 9 6 10 11 8 7 10 9 10 6 12 9

a) Find the least square regression


b) Estimate the weight of a boy who is 10 years old and 56 inches tall.
Solution :
The least squares regression equation of x1 on x2 and x3 is

x1c = a + b12.3 x2 + b13.2 x3.

∑x1 = Na + b12.3 ∑x2 + b13.2∑x3 . . . . . . .(1)

∑x1x2 = a∑x2 + b12.3 ∑x22 + b13.2∑x2x3 . . . . . . .(2)

∑x1x3= a∑x3 + b12.3 ∑x2x3 + b13.2∑x23 . . . . . . .(3)

Computations for the multiple regression Equation


x1 x2 x3 x1 x2 x1 x3 x2` x3 x12 x22 x32
54 57 8 3648 512 456 4096 3249 64
53 49 6 2597 318 294 2809 2401 36
71 59 10 4189 710 590 5041 3481 100
67 62 11 4154 737 682 4489 3844 121
55 51 8 2805 440 408 3025 2601 64
58 20 7 2900 406 350 3364 2500 49
77 55 10 4235 770 550 5929 3025 100
57 48 9 2736 513 432 3249 2304 81
56 52 10 2912 560 520 3136 2704 100
51 42 6 2142 306 252 2601 1764 36
76 61 12 4636 912 732 5776 3721 144
68 57 9 3876 612 513 4624 3249 81

x1 x2 x3 x1 x2 x1 x3 x2` x3 x12 x22 x32


= 753 = 643 = 106 = 40830 = 6796 = 5779 = 48139 = 34843 = 976

82
753 = 12a + 643 b12.3 + 106b13.2 — - - - - - (1)
40830 = 643a + 34.843b12.3 + 5779b13.2 - - - - -(2)
6796 = 106a + 5779 b12.3 + 976b13.2 - - - - -(3)

Step : 1 : Multiply equation No.1 and equation No.2 by 643, and 12 respectively.

753 = 12a + 643 b12.3 + 106b13.2 - - - - * 643


40830 =643a + 34.843b12.3 + 5779b13.2 - - - - * 12
= 484179 = 7716a + 413449b12.3 + 68158b13.2.
489960 = 7716a + 418116b12.3 + 69348b13.2.
(-) (-) (-) (-)
- 5781 = - 4667b12.3 - 1190b13.2 - - - - -(4)

Step : 2: Multiply equation No.1 and equation No.3 by 106 and 12.

753 = 12a + 643 b12.3 + 106b13.2 - - - - - - * 106


6796 =106a + 5779b12.3 + 976b13.2 - - - - * 12
= 79818 = 1272a + 68152b12.3 + 11236b13.2.
81552 = 1272a + 69348b12.3 + 11713b13.2.
(-) (-) (-) (-)
- 1734 = - 1190b12.3 - 476b13.2 - - - - -(5)

Step 3 : Multiply equation No.4 and equation No.5 by 1190 and 4667 respectively.
5553730b12.3 + 1416100B13.2 =\ 6879390
5553730 B12.3 + 2221492b13.2 + 8092578
(-) (+) (+)
- 805392b13.2 = - 1213288
+ 121388
Therefore b13.2 = = 1.506
+ 805392

83
Placing the value of b13.2 in Equation No. 5.
1190 b12.3 + 476 b13.2 = 1734
1190 b12.3 + 476 [1.506] = 1734
1190 b12.3 = 1734 – 716,856
1190 b12.3 = 1017.144
1017.144
Therefore b12.3 =
1190
b12.3 = 0.855
Similarly, we can find the value of ‘a’ by substituting for b12.3 and b13.2 in equation No.1,
as shown under :
753 = 12a + 643 b12.3 + 106b13.2.
753 = 12a + 643 (0.855) + 106 (1.506)
753 = 12a – 549.765 – 159.636
12a = 753 – 709.401
43.599
12a =
12
a = 3.633
Substituting the values of the three constants in the regression equation;
x1c = a + b12.3 x2 + b13.2 x3
x1c = 3.633 + 0.855 x2 + 1.506 x3.

(b) Hence we can predict the weight of a boy, whose height is 56 inches and age is 10
years as follows :
x1c = 3.633 + 0.855 (56) + 1.506 (10)
x1c = 3.633 + 47.88 + 15.06
x1c = 66.573

84
8.8 SUMMARY
The principal advantage of multiple regression is that is allows us to use more of the
information available to us to estimate the dependent variable. Sometimes the correlation
between two variables may be insufficient to determine a reliable estimating equation.
Multiple regressions will also enable us to fit curves as well as line. The three main objectives
of multiple regression and correlation analysis are (a) To derive an equation which provides
estimates of the dependent variable from values of the two or more independent variables.
(b) To obtain a measure of the error involved in using this regression equation as a basis for
estimation. (c) To obtain a measure of the proportion of variance in the dependent variable
accounted for the independent variables.

8.9 KEYWORDS
Multiple Regression Analysis : Represents a logical extension of two-variable regression
analysis. Instead of a single independent variable, two or more independent variables are
used to estimate the values of a dependent variable.
Multiple Regression Equation : The multiple regression equation describes the average
relationship between the variables and this relationship is used to predict or control the
dependent variable.
Partial Correlation Coefficient : Provides a measure of the relationship between the
dependent variable and other variables.

8.10 SELF ASSESSMENT QUESTIONS


1. Define Multiple Correlations and Multiple Regressions.
2. Differentiate between Partial correlation and Multiple Correlation.
3) The following constants are obtained from measurements on length (x1) in mm, volume
(x2) in cc, and weight (x3) in gms of 300 eggs :
x1 = 55.95 S1 = 2.26 γ12 = 0.578
x2 = 51.48 S2 = 4.39 γ13 = 0.581
x3 = 56.03 S3 = 4.41 γ23 = 0.974
a) Obtain the linear regression equation of egg weight on egg length and egg volume,
b) Estimate the weight of an egg whose length is 58 mm and volume is 52.5 cc.
[Ans : (a) : x3 = 3.43 + 0.053x1 + 0.964x2.
Ans : (b) x3 = 57.11 gms]

85
4) In a trivate distribution :
σ1 = 3 γ23 = 0.4

σ2 = 4 γ13 = 0.6

σ3 = 5 γ12 = 0.7

Determine the regression equation of x1 on x2 and x3. If the variables are measured from
their means.
(Ans : x1 = 0.410x2 + 0.229x3]

8.11 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008

86
BLOCK -3
PROBABILITY

UNIT 9 : INTRODUCTION TO PROBABILITY AND


PROBABILITY TYPES

STRUCTURE
9.0 Objectives
9.1 Introduction
9.2 Definitions
9.3 Three approaches to Probability
9.4 Set of Mutually Exclusive events
9.5 Probability Axioms
9.6 Theorems of Probability
9.7 Marginal Probability
9.8 Joint Probability
9.9 Conditional Probabilities
9.10 Bayes’ theorem
9.11 Problems solved
9.12 Summary
9.13 Key Words
9.14 Self Assessment Questions
9.15 References

152
9.0 OBJECTIVES
After studying this unit you should able to :
* Explain the basic concepts of probability;
* Describe probability axioms and
* Discuss Theorems of probability.
* Define marginal probability;
* Analyze joint Probability;
* Describe conditional probability and
* Explain Bayes’ Theorem.

9.1 INTRODUCTION
The concept of probability is the chance that something happens or will not happen.
In statistics it is denoted by the capital letter P and is measured on an inclusive numerical
scale of 0 to 1. If we are using percentages, then the scale is from 0% to 100%. If the
probability is 0% then there is absolutely no chance that an out come will occur.
The opposite of probability is deterministic where the outcome is certain on the
assumption that the input data is reliable. With probability something happens or it does
not happen, that is the situation is binomial, or there are only two possible outcomes.
However that does not mean that there is a 50/50 chance of being right or wrong or a 50/
50 chance of winning. If you toss a fair-sided coin, one that has not been “fixed”, you have
a 50% chance of obtaining heads or 50% chance of throwing tails. If you buy one ticket in
a fund raising raffle then you will either win or lose.

Conditional probabilities are contingent on a previous result. For example, suppose you
are drawing three marbles - red, blue and green - from a bag. Each marble has an equal
chance of being drawn. What is the conditional probability of drawing the red marble after
already drawing the blue one? First, the probability of drawing a blue marble is about 33%
because it is one possible outcome out of three. Assuming this first event occurs, there will
be two marbles remaining, with each having a 50% of being drawn. So, the chance of drawing
a blue marble after already drawing a red marble would be about 16.5% (33% x 50%).
Definition of ‘Conditional Probability’
Probability of an event or outcome based on the occurrence of a previous event or
outcome. Conditional probability is calculated by multiplying the probability of the preceding
event by the updated probability of the succeeding event.

153
Probability under Conditions of Statistical Independence:
When a statistically independent event occurs, it does not have any effect on the happening
of another event. There are three types of probabilities under statistical independence: 1.
Marginal, 2. Joint and 3. Conditional.

9.2 BASIC DEFINITIONS


Before we give definitions of the word probability as looked upon by various schools of
thought it is necessary that we familiarise ourselves with certain terms that are used in this
context.
Random Experiment : It is an experiment which if conducted repeatedly under
homogeneous conditions does not give the same result. The result may be anyone of the
various possible ‘outcomes’. Here the result is not unique (or the same every time). For
example if an unbiased dice is thrown it will not always fall with any particular number up.
Any of the six numbers on the dice can come up.
Trial and Event : The performance of a random experiment is called a trial and the outcome
an event. Thus throwing of a dice would be called a trial and the result (falling of anyone of
the six numbers I, 2, 3, 4, 5, 6) an event.
Events could be either simple or compound (also called composite). An event is called
simple if it corresponds to a single possible outcome. Thus in tossing a dice, the chance of
getting 3 is a simple event (because 3 occur in the dice only once). However the chance of
getting an odd number is a compound event (because odd numbers are more than one i.e. 1,
3 and 5).
A compound, event can further be decomposed into simple events. e.g. of a die 5 thrown,
getting an odd number 5 an compound event but it can be decomposed into simple events as
getting (1, 3 or 5) i.e. there are three simple events for the above stated compound event.
Exhaustive Cases: All possible outcomes of an event are known as exhaustive cases. In. the
throw of a single dice the exhaustive cases are 6 as the dice has only six faces each marked
with a different number. However if 2 dice are thrown the exhaustive cases would be 36
(6 x 6) as there are 36 ways in which two dice can fall. Similarly the number of exhaustive
cases in the throw of 2 coins would be four (2 x 2) i.e. HH, TT, HT and TH, (where H stands
for head and T for tail).
Favourable Cases : The number of outcomes which result in the happening of a desired
event are called favourable cases. Thus in a single throw of a dice, the number of favourable
cases of getting an odd number are three i.e. 1, 3 and 5. Similarly in drawing a card from a

154
pack, the cases favourable to getting a spade are 13 (as there are 13 spade cards in the pack).
Mutually Exclusive Events: Two or more events are said to be mutually exclusive if the
happening of anyone of them excludes the happening of all others in a single (i.e. same)
experiment. Thus in the throw of a single dice the event 5 and 6 are mutually exclusive
because if the event 5 happens no other event is possible in the same experiment. Here one
and only one of the events can take place at a time excluding all others.
Equally Likely Events: Two or more events are said to be equally likely if the chance of
their happening is equal i.e., there is no preference of anyone event over the other. Thus in a
throw of an unbiased die, the coming up of 1, 2, 3, 4, 5 or 6 is equally likely. In the throw of
an unbiased coin the coming up of head or tail is equally likely.
Independent and Dependent Events: An event is said to be independent if its happening is
not affected by the happening of other events and if it does not affect the happening of other
events. Thus in the throw of a dice repeatedly, coming up of 5 on the first throw is independent
of coming up of 5 again in the second throw.
However if we are successively drawing cards from a pack (without replacement) the
events would be dependent. The chance of getting a King on the first draw is 4/52 (as there
are 4 Kings in a pack). If this card is not replaced before the second draw, the chance of
getting a King again is 3/51 as there are now only 51 cards left and they contain only 3
Kings.

If however the card is replaced after the first draw i.e. before the second draw the events
would remain independent. In each of the two successive draws the chance of getting a King
would be 4/52.

(i) The number of permutations of n dissimilar things taken all at a time is n!. Thus if
there are 3 letters A, B and C, the total number of ways in which they can be arranged is ABC,
ACB, BAC, BCA, CAB and CBA i.e. 3! = 3x 2x 1 = 6.
Factorial n (written as n!) is equal to the continued product of n natural numbers starting
from 1 i.e.

155
n! = 1 x 2 x 3.......... (n-1) n
= n (n-1) (n- 2) .......... 3. 2 . 1
= n (n- 1) ! = n (n- 1) (n- 2) !
n!
(ii) The number of permutations of n dissimilar things taken r at a time is n Pr  .
( n  r )!
Thus if we are to make arrangements of any two letters out of three letters A, B, C, then
the different arrangement will be AB, BA, A C, CA, BC, CB i.e. 6 arrangements which
in factorial notation can be represented as

3!
3
P2   3!  3  2  1  6
(3  2)!

 Three letters taken 2 at a time can be arranged in 6 ways. The number of


4!
arrangement of any two letters out of 4 = c2   4  3  12
4

2!
(iii) The number of permutations of n things when n1 of them are of one kind and n2 of
n!
another kind is Thus if we have to find out the permutations of the letters of the
n1!n2 !
word FARIDABAD (where A occurs 3) times and D occurs 2 times) the answer would
9! 9  8  7  6  5  4  3  2 1
be or  30,240
3!2! 3  2 1 2  1

(iv) The fundamental rule of counting is that if an operation can be performed in ‘m’
ways and having been performed in any one of these ways a second operation can be
performed in ‘n’ ways, the total number of ways of performing the two operations
together is m x n.

v) The number of combination of n different things taken r at the time is


n!
n
Cr . Thus if we have to pick up two alphabets out of three, A, three A, B and
r!(n  r )!
3! 3! 3  2  1
C, we can pick up AB, or AC or BC i.e 3 ways or 3 C 2    3.
2!(3  2)! 2! 2

156
9.3 THREE APPROACHES TO PROBABILITY
Subjective probability
One type of probability is subjective probability, which is qualitative, sometimes
emotional, and simply based on the belief or the “gut” feeling of the person making the
judgment.
Subjective probability may be a function of a person’s experience with a situation. For
exam- ple, Salesperson A says that he is 80% certain of making a sale with a certain
client, as he knows the client well. However, Salesperson B may give only a 50% probability
level of making that sale. Both are basing their arguments on subjective probability.
Relative frequency probability
A probability based on information or data collected from situations that have occurred
previously is relative frequency probability
Relative frequency probabilities have use in many business situations. For example,
data taken from a certain country indicate that in a sample of 3,000 married couples under
study, one-third were divorced within 10 years of marriage. Again, on the assumption that
future conditions will be similar to past conditions, we can say that in this country, the
probability of being divorced before 10 years of marriage is 1/3 or 33.33%. This
demographic information can then be extended to estimate needs of such things as legal
services, new homes, and child- care.
Classical probability
A probability measure that is also the basis for gambling or betting, and thus useful if
you frequent casinos, is classical probability. Classical probability is also known as simple
probability or marginal probability and is defined by the following ratio:
In order for this expression to be valid, the probability of the outcomes, as defined
by the numerator (upper part of the ratio) must be equally likely.

9.4 SET OF MUTUALLY EXCLUSIVE EVENTS


To cover all possibilities between mutually exclusive events add up all the probabilities.
Probabilities of all these events together add up to 1.
P(A) + p(B) + p(C) +……p( N) = 1
Exhaustive Events
A happens or A does not happen then A and B are Exhaustive Events.

157
P(A happens) + A (does not happen) =1
The Sum of the probabilities of all mutually exclusive and collective exhaustive events
is always equal to 1. That is,
P(A) + p(B) + p(C) = 1
If A,B,Care mutually exclusive and collective event.
Example 1 : -
P( You Pass) =0.9
P(you fail) =1 0.9 = 0.1

EXAMPLE 1 – EXHAUSTIVE EVENTS


A production line uses 3 machines. The Chance that 1st machine breaks down in any
week is 1/10. The chance for 2nd machine is 1/20. Chance of 3rd machine is 1/40. What is the
chance that least one machine working in any week?
Solution
P(at least one not working) + p( all the three working ) =1
P(at least one not working) + 1-p( all the three working )
P(all three working) = p( 1st working ) * p(2nd working ) * p( 3rd working)
p( 1st working )=1- p(1st not working ) = 1-1/10=9/10
p( 2nd working ) = 19/20
p(3rd working ) = 39/40
p(working )9/10 *19/20*39*40= 6669/8000
p(at least 1working ) =1-6669/8000= 1331/8000

9.5 PROBABILITY AXIOMS


We now look at probability axioms. These are general probability rules that hold
regardless of the particular situation or kind of probability (objective or subjective). Given
an experiment.
1. Each elementary event or a combination of elementary events, must have associated
with it a probability greater than or equal to zero but less than or equal to 1. Thus, if A
is an event within a sample space, then
P(A) = 1

158
2. The probability of an entire sample space is 1. Thus, if S represents a entire sample
space, then
P (S) = 1
3. probability that one or the other or both of two mutually exclusive events will occur
is equal to the sum of the individual probabiliteis of these events. Thus.
P(A or B) = P(A) + P(B)
when A and B are mutually exclusive events.
4. The probability of an event that does not occur is equal to 1 minus the probability of
the event that occurs. Thus
P( A ) = 1 – P(A)
where A is the non-occurrence of event A.

Example 1: Suppose we have a box with 3 red, 2 black and 5 white balls. Each time a ball is
drawn, it is returned to the box. What is the probability of drawing:
a) Either a red or a black ball?
b) Either a white or a black ball?
Solution: The probabilities of drawing the specific colour ball are
P (red) = 0.3 P (black) = 0.2 P (white) = 05
Applying the rule 2, we find
P (red) + P (black) + P (white) = 0.3 + 0.2 + 0.5 = 1
As we want to know the probability of drawing either a red or a black ball, then the
answer will be probability P (red) + P (black) = 0.3 + 0.2 = 0.5. Likewise, the probability of
getting either a white ball or a black ball will be
P (white) + P (black) = 0.5 + 0.2 = 0.7

9.6 THEOREMS OF PROBABILITY


There are two important theorems of probability viz.,
(i) The addition theorem and
(ii) The multiplication theorem

159
Addition theorem
If A and B an any two events then the probability that at least one of them occurs in
denoted by P ( A  B ) and is given by

P ( A  B ) = P(A) + P(B) – P( A  B )

Where P(A) = Probability of the occurrence of event A


P(B) = Probability of the occurrence of event B
P ( A  B ) = Probability of simultaneous occurrence of events A and B

Mutually exclusive events have no sample point common to them, therefore if A and B
are two mutually exclusive events then A  B =  i.e. the intersection of two mutually
exclusive events is a null set and in this case P (A  B) = 0
 In case of mutually exclusive events
P(A  B) = P(A)+ P(B)

If there are three events A, B and C. The probability of the occurrence of at least
one of them 'is given by
P(A  B  C) = P(A) + P(B) + P(C)
- P (A  B) - P (B  C) - P (A  C)
+ P(A  B  C)
If the events are mutually exclusive then
P(A  B  C)= P(A)+ P(B)+ P(C)
In case of finite number say n of mutually exclusive events
P(Al  A2  A3 ··· ···  An) = P(A1)+ P(A2)+ ..... . + P(An)
Note : (i) If a number of events A1, A2 ...... An are mutually exclusive and
Exhaustive then the sum of the individual probabilities of their happenings is equal
to 1 i.e.
P(A1)+ P(A2)+ ..... . + P(An) =1

160
(ii) If the events are finite and mutually exclusive then the probability of the
occurrence of at least one of them is equal to the sum of their individual
probabilities.

(iii) The event A and its compliment A can be considered as mutually exclusive
and exhaustive.

 P ( A)  P ( A)  1  P ( A)  1  P ( A)

9.7 PROBLEM SOLVED


Example 1
If A, B, C are m u tu a lly e x c lu siv e and e x h a u s tiv e e v e n ts , f in d P (B ) if
1 1
P (C )  P ( A)  P (B )
3 2

S o lu tio n
S i n c e t h e e v e n t s a r e m u t u a l l y e x c l u s iv e a n d e x h a u s t i v e ,
P (A ) + P (B ) + P (C ) = 1
1 1
L et P (C )  P ( A)  P(B )  k
3 2
 P (C )  3 k , P ( A )  2 k , P ( B )  k

1
 k  2 k  3k  1  k 
6

1 1 1
 P(B)  ,P(A)  P (C )  A n s.
6 3, 2

Example 2
One tickets is drawn at random from a bag containing 30 tickets numbered
from 1 to 30. Find the probability that,
(a) It is a multiple of 5 or 7
(b) It is a multiple of 3 or 5
Solution
One ticket can be drawn out of 30 in 30C l = 30 ways. This is the total
number of ways in which the event can take place, or it is the Exhaustive number
of cases.

161
(a) Multiples of 5 are 5, 10, 15, 20, 25, 30.
Multiples of 7 are 7, 14, 21, 28.
Thus there 'are 6 multiples of 5 and 4 multiples of 7. None of these are
common. So the events are mutually exclusive. The probability of having a
multiple of 5 or 7 would be,

6 4 10 1
  
30 30 30 3

M u ltip les o f 3 are 3 , 6, 9 , 1 2 , 1 5 , 1 8 ,2 1 , 2 4 , 2 7 , 30


M u ltip les o f 5 are 5 , 10 , 15 , 2 0 , 2 5 , 3 0 .
T h u s th ere are 1 0 m u ltip les o f 3 and 6 m u ltiples o f 5 . H o w e ver tw o d igits 15 and
3 0 are co m m o n to b o th sets. T h erefo re the ev ents are n o t m u tu a lly ex clu siv e. T h e
req u ired p ro ba b ility
10 6  10 6 
=    
30 30  30 30 

16 2 14 7
=    A n s.
30 30 30 15

E x a m p le 3
3
T h e prob a bility th at A w ill liv e up to 6 0 years is an d prob a b ility th at B w ill
4
2
liv e u p to 60 ye ars is . W h at is th e p ro b ab ility (i) th at b o th A an d B w ill liv e u p to
3
six ty years (ii) that b oth d ie b efo re reach ing 6 0 ye ars.
S o lu tio n
T he ev ents in eq u atio n are in d ep en dent o f each o th er an d th e ru le o f m u ltip licatio n
w o uld b e a p p lied .
(i) T h e p ro b ab ility th at b o th A a nd B liv e u p to 6 0 ye ars or P (A an d B ) = P (A ) x
3 2 6 1
P (B ) =   
4 3 12 2
3 2
(ii) S ince th e p rob abilities o f th e su rviv al o f A an d B up to 60 ye a rs a re and
4 3
 3 1
resp ectiv ely, th erefo re th e pro b ab ilities o f th eir d eath w o u ld b e  1   
 4 4
 2 1
an d  1    resp ectiv ely.
 3 3
(iii) N o w th e prob ab ility th at b o th A an d B w o uld d ie b efore rea ch in g 60 yea rs,
w o uld be
1 1 1
  A ns.
4 3 12
162
Example 4
A bag contains 4 white and 6 red balls. Two draws of one balls each are
made without replacement. What is the probability that (i) one is red and the other
white (ii) Both the balls are red.
Solution
6
Probability of drawing a red ball in the first draw or P (A) =
10

Probability of drawing a white ball in the second draw given that the first draw has
4
given a red ball or P (B/A) = (since only 9 balls are left in the bag and four
9
white balls are still there)
Probability of the combined event
6 4 24
P(AB) P(A  B)= P(A) x P (B/A) =  
10 9 90

But it could also happen that in the first draw a white ball was drawn then,
4
Probability of drawing a white ball in the first draw or P (A) = and
10

Probability of drawing a red ball in the second draw given that the first draw gave
6
a white ball or P(B/A) =
9

The combined probability of the two events is


4 6 24
P(AB) or P(A  B) = P(A) x P (B/A) =  
10 9 90

Now anyone of the two situations (when we draw a red ball first or we
draw a white .ball first), would satisfy the conditions of the problem. These two
events are mutually exclusive. So the probability that anyone of the two happens is
the sum of the two probabilities.
24 24 48 8
    Ans.
90 90 90 15

Here we have applied both, the rule of multiplication and the rule of
addition of probability.

163
However such problems can be very easily solved with rules of permutations and
combinations.
Thus:
10! 10  9
(i) Two balls can be drawn one of 10 balls in or
10
C 2 or
or 45 ways.
2!8! 2
4!
(ii) One white ball can be drawn out of 4 white balls in 4 C1 or or 4 ways.
1!3!
(iii) One red ball can be drawn out of 6 red balls in 6 C1 or 6 ways.
(iv) The total number of ways of drawing a white and a red ball are 4 C1  6 C1 or 4 x 6
= 24
(v) The required probability would be
No. of cases favourable the event
24 8
=  
45 15
Total No. of ways in which the event can happen

6
C2
(ii) Required Probability = 10
C2
Example 5
(a) A committee of 4 persons is to be appointed from 3 officers of the
production department, 3 officers of the sales department and 2 officers of
the purchase department and 1 cost accountant. Find the probability of
forming a committee in the following manner.
(i) There must be on from each category
(ii) It should have at least one from the purchase department
(b) If P(A) = 0.4, P(B) = 0.7 and P (at least one of A and B) = 0.8, find P (only
one of A and B).
Solution:

Total number of ways of forming a committee of 4 persons = 9 C 4

3
C1  3 C1  2 C1  1
(i) Required Probability = 9
C4
18 1
= 
9 8 7  6 7
4  3  2 1

164
(i) Required Probability
= 1 – Probability that nobody is taken from the purchase department
7!
7
C4
=1  9  1  4!3!
C4 9!
4!5!
7! 5! 20 52 13
= 1-   1   
3! 9! 72 72 18
Second m ethod
Required Probability
= Prob. of taking one from Purchase deptt. And three others + Prob. of taking 2
from Purchase deptt. and two others
2
C1  7 C 3 2
C 2 7 C2
= 9
 9
C4 C4

Required Probability = P(A-B) + P(B-A)


= P(A ) + P(B) -2P (A  B)
P(A) + P(B) – 2 [P (A) + P(B) – (A  B)]
= P(A) – P(B) + 2 (A  B)
= 0.4 +0.7 – 2 x 0.8 = 0.5 Ans.
P(A  B) = P(A) + P(B) – P(A  B)
 0.8 = 0.4 + 0.7 – P (A  B)

P (A  B) = 0.3 ? P(A ) P(B)


Hence the events are not independent.

Examples 6
A husband and a wife appear in an interview for two vacancies in the same post.
The probability of husband’s selection is 1/7 and that of wife’s selection is 1/5. What is
the probability that:
(a) Both of them will be selected ;
(b) Only one of them will be selected;
(c) None of them will be selected

165
Solution:
Let A and B denote the events of husband’s and wife’s selection respectively.
Let P (A) = p1; P(B) = p2
1 1
 p1  , p2 
7 5

P( A ) = Probability of husband’s rejection


1 6
= 1- p1= 1 -   q1
7 7

P( B ) = Probability of husband’s rejection


1 4
= 1- p2= 1 -   q2
5 5

Probability that husband and wife both are selected = P(A) P (B)
1 1 1
=   Ans. [ the events are independent]
7 5 35

(ii) Probability that only one of them will be selected

= P (A) P ( B) + P ( A ) P (B)

1 4 6 1 10 2
=      Ans.
7 5 7 5 35 7

(iii) Probability that both are rejected


6 4 24
= P ( A) P (B ) =   Ans.
7 5 35

166
Examples 7
Additional Examples ;
Example 7: Assume that a card is randomly selected from a deck of 52 playing cards.
Find the probability in each of the following cases :
a. Card drawn is the king
b. Either a heart or the queen of spades
c. Card drawn is a “diamond”.
Solution :
a. In a playing card, there are four kings. Hence, the probability of getting a king is
4/52 or 1/13.
b. There are 13 cards of “heart” and the queen of spades is 1. Hence, the required
probability is (13 + 1)/52 or 7/26.
c. Here, the probability of drawing a card with “diamond” is 13/52 or ¼.

Example 8 : Determine the probability P for each of the following events :


a. At least one head appears in two tosses of a fair coin.
b. The sum 8 appears in a single toss of a pair of fair dice.
c. An ace, a kind, a queen or ten or ‘hearts’ appears in drawing a single card from a
deck of 52 playing cards.
Solution :
a. In two tosses of a fair coin, there can be four possibilities : HH, HT, TH and TT.
It will be seen that H comes 3 times out of 4. Hence, P = ¾.
b. When a pair of fair dice is tossed, we can get 8 as follows :
(2, 6), (3, 5), (4, 4), (5, 3) and (6, 2)
As each of the six faces of one die can be associated with each of the six faces of
the second die, resulting 6 x 6 = 36 cases. Hence, P = 5/36.
c. An ace, a king, a queen and ten of “hearts” add up to 4 + 4 + 4 + 1 = 13. Hence,
the probability is 13/52 = ¼.
Example 9: A letter is chosen at random from word ‘PROFESSOR’
a. What is probability that it is a vowel?
b. What is the probability that it is a ‘S’?
Solution :
a. The word PROFESSOR contains in all 9 letters of which 3 are vowels : O, E and
1
O. Hence, the probability that a letter is a vowel is 3/9 =
3

167
Example 10: Suppose a fair die has its even numbered faces painted red, and the odd
number faces painted white. Consider the experiment of rolling the die and the events.
A = (2 or 3 shows up)
B = (A red face shows up)
Find the following probabilities
(a) P(A) (b) P(B) (c) P(AB) (d) P(A/B) (e) P(A or B)
Solution : Since a fair die has six number 1 to 6.
1
As such each number has probability of occurrence.
6

1 1 2 1
a. Hence P(A), being 2 or 3 showing up is   , i.e.,
6 6 6 3
b. The total number of faces painted red is 3 as the die has 6 numbers.
3 1
Hence, P(B), i.e., where a red face shows up is 
6 2
c. P (AB) is the joint probability of A and B
Hence, P(AB) = P(A) x P(B)
1 1 1
=  
3 2 6
1
P( AB) 6 1 2 1
d. P(A/B) =  or  or
P( B) 1 6 1 3
2
e. P(A or B) = P(A) + P(B) – P(A and B)
1 1 1
  
3 2 6
2  3 1 4 2
  
6 6 3
Example 11: A, B and C bidding for a contract. It is believed that A has exactly half a
chance that B has; B, in turn, has 4/5th as likely as C has to gain the contract. What is
the probability for each to win the contract ?
Solution : Assuming that the probability of C to gain the contract is x.

168
Then,
Probability of B to win is 4/5 of x = 4x/5
Probability of A to win is ½ of 4x/5 = 4x/10

Now 4x/10 + 4x/5 + x = 1 (Since the total of three probabilities should be 1).
Or (20x + 40x + 50x)/50 = 1
Or 110x = 50
 x =50/110 or 5/11
Hence, the probabilities to win for C, B and A are
C(x) = 5/11
B (4x/5) = 4 x 5/11)/5
= 20/11)/5 = 4/11
A = (4x/10)(4x5/11)/10
= 20/11)/10 = 2/11

These probabilities can also be written as C = 0.454 B = 0.364 A = 0.182

Example 12: Three salesmen, A, B and C have been given a target of selling 10,000 units of
a particular product, the probabilities of their achieving their targets being respectively 0.25,
0.30 and 0.50. If these three salesmen try to sell the product, find the probability of success
of only one salesman and failure of the other two.
Solution: Probabilities are
A B C
0.25 0.30 0.50

When A succeeds and B and C do not succeed, then


P = 0.25 x ( 1- 0.30) ( 1 – 0. 50)
= 0.25 x 0.70 x 0.50

= 0.0875

169
Hence, the required probability that one of them succeeds and the other two do not succeed
is
P = 0.0875 + 0.1125 + 0.2625
= 0.4625
Example 13: A sub-committee of 6 members is to be formed out of a group consisting of 7
men and 4 women. Calculate the probability that the sub-committee will consist of ss
(i) Exactly 2 women; and (ii) at least 2 women.

Solution:
(i) Out of 11 persons (7 men and 4 women) a sub-committee of 6 persons can be formed
in

11! 1110  9  8  7
11
C6    462 ways
(11  6)!6! 5  4  3  2  1

This is the exhaustive number of ways a sub-committee can be formed.


Number of ways for the sub-committee to consist of 4 men and 2 women is

7! 4! 765 43
7
C4  4 C3      140
(7  4)!4! (4  3)!3! 3 2 3 2

Hence, probability is 140/462 or 10/33


For 2 men and 4 women

7! 4! 7  6 4!
7
C2  4 C 4      21
(7  2)!2! 4! 2  1 4!

Hence, probability is 21/462 = 1/22


Probability of having at least 2 women is

30  20  3
5/11 + 10/33 + 1/22 =  53 / 66
66

170
Example 14: Two computers A and B are to be marketed. A salesman who is assigned a job
of finding customers for them has 60 percent and 40 percent chances respectively of
succeeding in case of computers A and B. The computers can be sold independently. Given
that he was able to sell at least one computer, what is the probability that the computer A has
been sold?
Solution
Let A be the event that the salesman is able to sell computer A.
Let B be the event that the salesman is able to sell computer B.
Given P(A) = 0.60 and P(B) – 4.0 and that the two events A and B are independent.
P(AB) = P(A). P(B)
= 0.60 x 0.40 = 0.24
Now, probability of selling at least one computer is given by
P(A or B) = P(A) + P(B) – P(AB)
= 0.60 + 0.40 – 0.24 = 0.76
We have to find out P(A) given P(A or B)
P( A)
P[A/P(A or B) = P( AorB)

0.60
=
0.76
= 0.7895
Example 15 : A manufacturing firm receives shipments of machine parts from two
suppliers A and B. Currently, 65 percent of parts are purchased from supplier A and the
remaining from supplier B. The past record shows that 2 percent of the parts supplied by A
are found defective, whereas 5 percent of the parts supplied by B are found defective. On a
particular day the machine breaks down because a defective part is fitted to it.

171
9.7 MARGINAL PROBABILITY
Marginal probability is the simple probability of the occurrence of an event. Right in the
beginning, we have given such examples pertaining to the tossing of a coin. When a coin is
tossed, the probability of getting a head is 0.5, so also in the case of getting a tail. These are
known as marginal probabilities as a toss of a fair coin is a statistically independent event.

9.8 JOINT PROBABILITIES


The probabilities of two or more independent events occurring together is the product of
their marginal probabilities.
Symbolically, P (AB) = P(A) x P(B)
Where P (AB) = probability of events A and B occurring together or
in succession

P (A) = marginal probability of event A


P (B) = marginal probability of event B
It will be seen that this is the multiplication rule for joint, independent events.
Example: Suppose we toss a fair coin twice. What is the probability of getting two successive
heads?
Solution: P(H1H2) = P(H1) x P(H2) = 0.5 x 0.5 = 0.25
Obviously, the probability of getting two successive tails is also the same. Since P(T) =
P(H) = 0.5. If there are three tosses of a fair coin, then the joint probability of getting three
successive heads will be
P(H1H2H3) = P(H1) x P(H2) x P(H3) = 0.5 x 0.5 x 0.5 = 0.125
Example : Suppose we have an unfair coin that has P(H) = 0.7 and P(T) = 0.3. What is the
probability of getting three successive heads on tossing the coin three times?
Solution : P(H1H2H3) = P(H1) x P(H2) x P(H3) = 0.7 x 0.7 x 0.7 = 0.343
Again, we ask : What is the probability of getting three successive tails on tossing the coin
three times?
P(T1 T2 T3) = P(T1) x P(T2) x P(T3) = 0.3 x 0.3 x 0.3 = 0.027
It may be noted that the two joint probabilities add up to 0.343 + 0.027 = 0.37 only and not
to 1. This is because the events H1 H2 H3 and T1 T2 T3 do not form a collectively exhaustive
list, though they are mutually exclusive in the sense that if one event occurs then the other

172
H1 H 2 H 3 H1T2H3
H1 T2 T3 (two tails) T1 H2 H3
T1 T2 H3 (two tails) T1 T2 T3 (three tails)
H1 H 2 T 3 T1 H2 T3 (two tails)

Out of the total eight outcomes, we find that at least two tails occur four times. As
the probability of any of the three successive tosses is 0.5 probability of getting at least two
tails is
P(H1 T2 T3) + P(T1 T2 H3) + P(T1 T2 T3) = 0.125 + 0.125 + 0.125 + 0.125 = 0.5
We will get the same answer in the case of the joint probability of at least two heads
in three successive tosses.
Example: What is the probability of getting three heads or three tails on three successive
tosses?
Solution: P(H1 H2 H3 or T1 T2 T3) = P(H1 H2 H3) + P(T1 T2 T3)
= 0.125 + 0.125 = 0.25
Since there can be only eight outcomes of which only one can be three successive
heads and one can be three successive tails, each outcome has a joint probability of 0.125 as
the total eight outcomes must equal to 1.

Probability Tree Diagrams: In order to have a better understanding of these examples, we


may construct a probability tree. Figure 1 shows the outcome of tossing a fair coin once.
We can have only two possible outcomes – a head or a tail.

F i g 1 E x a m p le o f P r o b a b il it y F ig 2 P r o b a b ilit y tr e e o f a P a r t ia l S e c o n d
T oss T ree

A s s u m in g th a t th e f irs t to s s is a h e a d , th e n in th e s e c o n d
to s s th e o u tc o m e c o u ld b e e ith e r a h e a d o r a ta il. T h is is s h o w n in
F ig 2 .

173
Now, we assume that the outcome of first toss is tail. In this situation, the second
toss must originate from tail. This provides two more branches to the three as shown in
Fig 3.
We may further extend the tree to depict the outcomes of the third toss. We repeat
the same process, as a result we get what is depicted in Fig 4
It may be noted that when we toss once, we have two possible outcomes, when we
toss a coin twice, we have four possible outcomes and when we toss it thrice, then we have
eight possible outcomes.

Fig 3 : Probability Tree of Two Tosses

Fig 4 Probability Tree of Three Tosses

174
9.9 CONDITIONAL PROBABILITIES
The discussion so far was confined to two types of probabilities, marginal or
unconditional probability and joint probability. Under statistical independence, only one
type of probability remains to be discussed. This is known as the conditional probability.
Symbolically, conditional probability is written as P(A/B) which means that probability
of event A, given that event B has occurred.
This appears to be contradictory. It may be recalled that independent events are those
events whose probabilities are not affected by the occurrence of each other. This means that
P(A/B) = P(A). Let us take an example to explain this.
Example: Suppose we are asked : what is the probability that the second toss of a fair coin
will result in tail, given that tail resulted on the first toss?
Solution : This can be written as P(T2/T1). It should be noted that the outcome of the first
even has no influence whatsoever on the outcome of the second even since the two events
are independent. The probability of a tail on the second toss is 0.5. Thus, we can write
P(T2/T1) = 05.
We may now summarize the three type of probabilities under statistical independence
as follows:

Types of Probability Symbol Formula


Marginal P(A) P(A)
Joint P(AB) P(A) * P(B)
Conditional P(B/A) P(B)

In order to understand conditional probability under statistical dependence, let us


take an example.
Example : Suppose we have an urn containing ten balls of different colours such that
2 balls are red and dotted
1 ball is green and dotted
4 balls are red and striped
3 balls are green and striped
What is the probability of drawing any particular ball from this urn?
Solution : In all there are ten balls, each with equal probability of being drawn. The probability
of drawing any particular ball from this urn is 0.1. To facilitate our discussion further, the
above information is shown in the following table.

175
Table 1 Colour and Pattern of Ten Balls
Event Probability of Event
1 0.1 Red and dotted
2 0.1
3 0.1 Green and dotted
4 0.1 Red and striped
5 0.1
6 0.1
7 0.1
8 0.1 Green and striped
9 0.1
10 0.1
Example : Suppose we draw a ball from the urn and find it is red, what is the probability that
it is striped?
Solution : Since our problem relates to red balls, we ignore the green balls completely. In
all, there are six red balls of which two are dotted and four are striped. Out problem now
boils down to finding the simple probabilities of dotted and striped balls. These are shown as
follows ;
P(S/R) = 4/6 = 2/3

1/ 3
P(D/R) = 2/6 =
1.0
It will be seen that each category of red ball has been divided by the total number of
red balls. Since our problem is regarding the striped red balls, the answer is 2/3. This can be
shown symbolically.

P( SR)
P(S/R) =
P( R)

Thus, to calculate the probability of striped red balls, we divided the


probability of red and striped balls by the probability of red balls. The same can be
written in a generalized form as

P ( AB )
P(A/B) =
P( B)

176
This is the formula for calculating conditional probability under statistical
dependence.
Example : What is probability of getting a dotted ball given that it is green?
Solution : We know that the total probability of green balls is 0.4 because there
are four green balls out of total balls. To find the probability of the ball being
dotted given that it is green, we have to divided the probability of green and dotted
by the probability of green. Thus,
P ( DG ) 0 .1 1
P(D/G) =  
P (G ) 0 .4 4

Similarly, we can determine the probability of drawing a green and striped


ball given that it is green :
P ( SG ) 0.3 3
P(S/G) =  
P (G ) 0 .4 4

It may be noted again that the two probabilities ¼ + ¾ taken together add to
Joint Probabilities:
The formula that we used to determine conditional probability under statistical
dependence is
P ( AB )
P(A/B) =
P( B)
We know that it contains one term P(AB) which, in fact, denotes joint
probability. We may now rewrite this formula to determine joint probability. This can
be easily done by cross multiplication.
Thus,
P(AB) = P(A) x P(B)
This can be expressed as: the joint probability of events A and B is equal to the
probability of event A, given that event B has already occurred, multiplied by the
probability of event B.
Example: We now use this formula in our previous examples of green and red balls.
Suppose we have to find the probability of red and striped ball.
Solution: P(SR) = P(S/R) x P(R) = 2/3 x 6/10 = 0.4
Similarly, we can calculate the joint probability of other events as well.
P(DR) = P(D/R) x P(R) = 1/3 x 6/10 = 0.2

177
This shown that the joint probability of dotted and red balls is equal to the
product of the probability of dotted balls, given a red ball and probability of red
balls. This comes to 0.2.
P(DG) = P(D/G) x P(G) = ¼ x 4/10 = 0.1
This is the joint probability of dotted and green ball.
P(SG) = P(S/G) x P(G) = ¾ x 4/10 = 0.3
This is the joint probability of striped and green ball.
Marginal Probabilities:
The marginal probability of the event green ball can be determined by
adding the probabilities of the joint events in which green ball is contained.
Symbolically, P(G) = P(GD) + P(GS) = 0.1 + 0.3 = 0.4
In the same manner, we can determine the marginal probability of the event
red ball by adding the probabilities of the joint events in which red ball is
contained.
Symbolically, P(R) = P(R/D) + P(RS) = 0.2 + 0.4 = 0.6
So far, we have determined marginal probabilities of red balls and green
balls. Likewise, we can determine the marginal probability of dotted balls and
striped balls regardless of their colours. This has been attempted below ;
P(D) = P(RD) + P(GD) = 0.2 + 0.1 = 0.3
P(S) = P(RS) + P(GS) = 0.4 + 0.3 = 0.7
It should be noted that these two probabilities add to 1.0 as was also in the case of
the earlier two calculations. The following table summarizes the probabilities
under statistical dependence.

Types of Probability Symbol Formula


M arginal or unconditional P(A) P(A)
Joint P(AB) P(A/B) X P(B)
Conditional P(A/B) P(AB)/P(B)

178
9.10 BAYES’ THEOREM
Bayes’ theorem is an important statistical method, which is used in evaluating new
information as well as in revising prior estimates of the probability in the light of that
information. Bayes’ theorem may be viewed as a means of transforming our prior probability
of an event into a posterior probability of that event. Bayes’ theorem, if properly used, makes
it unnecessary to collect huge data over a long period in order to make good decisions on the
basis of probabilities.
Example : Suppose we have two machines, I and II, which are used in the manufacture of
shoes. Let E1 be the event of shoes produced by machine I and E2 be the event that they are
produced by machine II. Machine I produces 60 percent of the shoes and machine II 40
percent. It is also reported that 10 percent of the shoes produced by machine I are
defective as against the 20 percent by machine II. What is the probability that a non-
defective shoe was manufactured by machine I?
Solutions : If E1 be the event of the shoe being produced by machine I and A be the event
of a non-defective shoe, our problem in symbolic terms is : P(E1/A). That is, given a non-
defective shoe, what is the probability that it was produced by machine I?
From our conditional probability formulas, the probability P(E1/A) is
P(E1/A) = P(E1A) / P(A)
But from the theorem on total probabilities, P(A) becomes
P(A) = P(AE1) + P(AE2) = P(A/E1) P(E1) + P(A/E2) P(E2)
= P( AEi ) P( Ei )
Substituting this result in (i) above, we get
P ( E1 A)
P ( E1 / A) 
P ( A / E i ) P ( E i )
Which may also be written as
P ( AE1 ) P ( E1 )
P ( E1 / A) 
P ( A / E i ) P ( E i )

179
This is called Bayes’ theorem.
It may be noted that P(E1) is the probability of a shoe being manufactured by
machine I, whereas P(E1/A) is the probability of a shoe being produced by machine I,
given that it is a non-defective shoe. The probability P(E1) is called prior probability
and P(E1/A) is called posterior probability.
Let us set up a table to calculate the probability that a non-defective shoe was produced
by machine.
Computation of Posterior Probabilities :
Event Prior P(Ei) Conditional Joint P(EiA) Posterior P(Ei/A)
P(A/Ei)
(1) (2) (3) (4) (5) = (4)/P(A)
Machine I (E1) 0.6 0.9 0.54 0.54/0.86 = 0.63
Machine II (E2) 0.4 0.8 0.32 0.32/0.86 = 0.37
Total 1.0 P(A) = 0.86 1.00

On the basis of the above table we can say that given a non-defective shoe, the
probability that it was produced by machine I is 0.63 and the probability that it was
produced by machine II is 0.37. We can see that there is some revision in the prior
probabilities when we apply Bayes’ theorem.

A Problem with more than Two Elementary Events: The foregoing problem related
to two elementary events. Let us take a problem having three elementary events.

Example : A manufacturing firm is engaged in the production of steel pipes in its three
plants with a daily production of 1,000, 1,500 and 2,500 units respectively. According
to the past experience, it is known that the fractions of defective pipes produced by the
three plants are respectively 0.04, 0.09 and 0.07. If a pipe is selected from a day’s total
production and found to be defective, find out (a) from which plant the defective pipe
has come, and (b) what is the probability that it has come from the second plant?

180
Solution : Let the probabilities of the possible events be
P(E1) = 1,000/(1,000 + 1,500 + 2,500) = 0.2 – probability that a pipe is manufactured in plant A.

P(E2) = 1,500/(1,000 + 1,500 + 2,500) = 0.3 – probability that a pipe is manufactured in plant B.

P(E1) = 2,500/(1,000 + 1,500 + 2,500) = 0.5 – probability that a pipe is manufactured in plant C.

Let P(D) be the probability that a defective pipe is drawn. Given that the proportions of
the defective pipes coming from the three plants are 0.04, 0.09 and 0.07 respectively,
these are, in fact, the conditional probabilities : P(D/E1) = 0.04; P(D/E2) = 0.09; and
P(D/E3) = 0.07.
Now we can multiply prior probabilities and conditional probabilities in order to obtain
the joint probabilities.
Joint probabilities are
Plant A 0.04 x 0.2 = 0.008
Plant B 0.09 x 0.3 = 0.027
Plant C 0.07 x 0.5 – 0.035
Now we can obtain posterior probabilities by the following calculations :
0.008
Plant A  0.114
0.008  0.027  0.035

0.027
Plant B  0.386
0.008  0.027  0.035

0.035
Plant C  0.500
0.008  0.027  0.035
Table 3 : Computation of Posterior Probabilities
Event Prior P(Ei) Conditional Joint P(EiA) Posterior P(Ei/E)
P(E1Ei)
(1) (2) (3) (4) (5) = (4)/P(E)
E1 0.2 0.04 0.04 x 0.2 = 0.008 0.008 / 0.07 = 0.11
E2 0.3 0.09 0.09 x 0.3 = 0.027 0.027/0.07 = 0.39
E3 0.5 0.07 0.07 x 0.5 = 0.035 0.035/0.07 = 050
Total 1.0 P(E) = 7 1.00
On the basis of these calculations, we can say that (a) most probably the defective
pipe has come from plant C, and (b) the probability that the defective pipe has come from
the second plant is 0.39.

181
9.12 SUMMARY
Probability measures the lik lines of occurances of an event. The out comes may have equal
chances in some cases where as in some other cases, it may not be so.
This unit focuses on conditional probability and joint probability. This unit also discusses about
Bayer’s Theorm.
Given the information that the part was bad, using Bayes’ theorem find the probability that it was
supplied by supplier B.

Solution : We have to use Baye’s theorem to work out the required probability. The necessary calcula-
tions are shown in the following table.

Calculation of Probability

Supplier Prior Conditional Joint Posterior


Probability Probability Probability (Revised)
Probability
(1) (2) (3) (4) – (2) X (3) (5) – (4)/0.035

0.0130
A 0.65 0.02 0.0130 = 0.43
0.0305
0.0175
B 0.35 0.05 0.0175 = 0.43
0.0305

Total 0.0305 1.00

9.13 KEY WORDS


Random Experiment
Trial & Event
Exhaustive Comes
Mutuals Exclusive Events
Marginal Probability
Joint Probability
Contitionl Probability

182
9.14 SELF ASSESSMENT QUESTIONS
1. A sub-committee of 6 members is to be formed out of group consisting of 7 men and 4 ladies.
Calculate the probability that the sub-committee will consist of (1) exactly 2 ladies, and (ii) at least
2 ladies.
2. There are 3 economists, 4 engineers, 2 statisticians and I doctor. A committee of 4 from among
them is to be formed. Find the probability that the committee.
(i) Consist one of each kind
(ii) Has at least one economist
(iii) Has the doctor as a member and three others.
3. (a) a bag contains 6 white. 4 red and 10 black balls. Two balls are drawn at random. Find
the probability that they will both be black.
(b) a bag contains 8 white and 4 red balls. Five balls are drawn at random. What is the
probability that 2 of them are red and 3 white?
4. From a pack of 52cards are drawn at random. Find the probability that one is king and other a
queen?
5. One bag contains 4 white and 2 black balls. Another contains 3 white and 5 black balls. If one
ball is drawn from each bag. Find the probability that
(a) both are white, (b) both are black, and (c) one is white and one is black.
6. A jar contains black and white marbles. Two marbles are chosen without replacement. The prob-
ability of selecting a black marble and then a white marble is 0.34, and the probability of selecting
a black marble on the first draw is 0.47. What is the probability of selecting a white marble on the
second draw, given that the first marble drawn was black?
7. The probability that it is Friday and that a student is absent is 0.03. Since there are 5 school days
in a week, the probability that it is Friday is 0.2. What is the probability that a student is absent
given that todayis Friday?
8. A bag contains red and blue marbles. Two marbles are drawn without replacement. The prob-
ability of selecting a red marble and then a blue marble is 0.28. The probability of selecting a red
marble on the first draw is 0.5. What is the probability of selecting a blue marble on the second
draw, given that the first marble drawn was red?
9. A committee consists of four women and three men. The committee will randomly select two
people to attend a conference in Hawaii. Find the probability that both are women.

10. What is the probability that the total of two dice will be greater than 9, given that the first die is a
5?

Problems on probability are solved to provide better under standing of concept.

183
9.15 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017

2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008

184
UNIT 10 : THEORETICAL PROBABILITY DISTRIBUTIONS
AND NORMAL DISTRIBUTION

STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Basic Definitions
10.3 Properties of Normal Distribution
10.4 The standard Normal Curve
10.5 Equation of the Standard Normal Distribution
10.6 Normal Distribution Problems
10.7 Random Variable
10.8 Types of Probability Distributions
10.9 Binomial Distribution
10.10 Condition necessary for Binomial Distribution
10.11 Problems in Binomial Distribution
10.12 Poisson Distribution
10.13 Problems in Poisson Distribution
10.14 Summary
10.15 Key Words
10.16 Self Assessment Questions
10.17 References

185
10.0 OBJECTIVES
After studying this unit you should be able to:
* Define Random variable;
* Identify types of probability distribution;
* Explain Binomial distribution and
* Solve problems on poisson distribution.
* Draw Normal curve;
* Solve problems on Normal distribution and
* Appreciate Excel application of probability distribution.

10.1 INTRODUCTION
Here we will be looking at probability distributions which portray not the frequency
with which values of a distribution actually occur but the probability with which we predict
they will occur. Probability distributions are very important tools for modeling or
representing processes that occur at random, such as customers visiting a website or
accidents on a building site. A probability distribution is a table or an equation that links
each outcome of a statistical experiment with its probability of occurrence.
In studying probability distributions we will look at how they can be derived and how
we can model or represent the chances of different combinations of outcomes using the
same sort of approach as we use to arrange data into frequency distributions.
A probability distribution is very similar to a frequency distribution. Like a frequency
distribution, a probability distribution has a series of categories, but instead of categories
of values it has categories of types of outcomes. The other difference is that each category
has a probability instead of a frequency.
In the same way as a frequency distribution tells us how frequently each type of value
occurs, a probability distribution tells us how probable each type of outcome is.
The Normal Probability Distribution is very common in the field of statistics.
Whenever you measure things like people’s height, weight, salary, opinions or votes, the
graph of the results is very often a normal curve.

Probability Distribution Prerequisites


To understand probability distributions, it is important to understand variables. Random
variables, and some notation.

186
P(X = x) refers to the probability that the random variable X is equal to a particular value,
denoted by x. As an example, P(X = 1) refers to the probability that the random variable X is
equal to 1.

10.2 BASIC DEFINITIONS


The normal distribution refers to a family of continuous probability distributions
described by the normal equation.
Normal Probability Distribution

The Normal Equation


The normal distribution is defined by the following equation:

Normal equation. The value of the random variable Y is:

-(x - μ)2/2σ2
Y = { 1/[ σ * sqrt(2π) ] } * e

where X is a normal random variable, μ is the mean, σ is the standard deviation, π is


approximately 3.14159, and e is approximately 2.71828.

The random variable X in the normal equation is called the normal random variable. The normal
equation is the probability density function for the normal distribution.

The Normal Curve


The graph of the normal distribution depends on two factors - the mean and the standard
deviation. The mean of the distribution determines the location of the center of the graph,
and the standard deviation determines the height and width of the graph. When the standard
deviation is large, the curve is short and wide; when the standard deviation is small, the curve
is tall and narrow. All normal distributions look like a symmetric, bell-shaped curve, as
shown below.

187
The curve on the left is shorter and wider than the curve on the right, because the curve on
the left has a bigger standard deviation.

Probability and the Normal Curve


The normal distribution is a continuous probability distribution. This has several
implications for probability.
§ The total area under the normal curve is equal to 1.
§ The probability that a normal random variable X equals any particular value is 0.
§ The probability that X is greater than a equals the area under the normal curve bounded
by a and plus infinity (as indicated by the non-shaded area in the figure below).
§ The probability that X is less than a equals the area under the normal curve bounded
by a and minus infinity (as indicated by the shaded area in the figure below).

Additionally, every normal curve (regardless of its mean or standard deviation) conforms to
the following “rule”.
§ About 68% of the area under the curve falls within 1 standard deviation of the mean.
§ About 95% of the area under the curve falls within 2 standard deviations of the mean.
§ About 99.7% of the area under the curve falls within 3 standard deviations of the mean.
Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.
Clearly, given a normal distribution, most outcomes will be within 3 standard deviations of
the mean.

188
To find the probability associated with a normal random variable, use a graphing
calculator, an online normal distribution calculator, or a normal distribution table. In the
examples below, we illustrate the use of Stat Trek’s Normal Distribution Calculator, a free
tool available on this site. In the next lesson, we demonstrate the use of normal distribution
tables.
Example 1
An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what is the
probability that an Acme light bulb will last at most 365 days?
Solution: Given a mean score of 300 days and a standard deviation of 50 days, we want to
find the cumulative probability that bulb life is less than or equal to 365 days. Thus, we know
the following:
§ The value of the normal random variable is 365 days.
§ The mean is equal to 300 days.
§ The standard deviation is equal to 50 days.
We enter these values into the Normal Distribution Calculator and compute the
cumulative probability. The answer is: P( X < 365) = 0.90. Hence, there is a 90% chance
that a light bulb will burn out within 365 days.

Example 2
Suppose scores on an IQ test are normally distributed. If the test has a mean of 100
and a standard deviation of 10, what is the probability that a person who takes the test will
score between 90 and 110?
Solution: Here, we want to know the probability that the test score falls between 90 and 110.
The “trick” to solving this problem is to realize the following:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )

We use the Normal Distribution Calculator to compute both probabilities on the


right side of the above equation.
§ To compute P( X < 110 ), we enter the following inputs into the calculator: The value
of the normal random variable is 110, the mean is 100, and the standard deviation is
10. We find that P( X < 110 ) is 0.84.

189
§ To compute P( X < 90 ), we enter the following inputs into the calculator: The value
of the normal random variable is 90, the mean is 100, and the standard deviation is 10.
We find that P( X < 90 ) is 0.16.
We use these findings to compute our final answer as follows:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
P( 90 < X < 110 ) = 0.84 - 0.16
P( 90 < X < 110 ) = 0.68
Thus, about 68% of the test scores will fall between 90 and 110.

10.3 PROPERTIES OF NORMAL DISTRIBUTION


Figure 1 shows the normal probability distribution :

Fig 1 : The Normal Probability Distribution


Let us see what does this figure indicate in terms of characteristics of the normal distribution.
It indicates the following characteristics.
1. The curve is bell-shaped, that is, it has the same shape on either side of the vertical
line from mean.
2. It has a single peak. As such it is unimodal.
3. The mean is located at the centre of the distribution.
4. The distribution is symmetrical
5. The two tails of the distribution extend indefinitely but never touch the horizontal
axis.
6. Since the normal curve is symmetrical, the lower and upper quartiles are equidistant
from the median, that is, Q3- Median = Median – Q1.
7. The mean, median and mode have the same value, that is, mean = median = mode.
8. The percentage distribution of area under standard normal curve is broadly as follows
: ±1ó 68.27%; ±2ó 95.44% and ± 3ó 99.73%. This was also shown in Fig 7.1.
The units for the standard normal distribution curve are denoted by Z and are called
the Z values or Z scores. They are also called standard units or standard scores. The Z scores
is known as a ‘standardized’ variable because it has a zero mean and a standard deviation of
one. 190
As can be seen from Fig. 10.3 the horizontal axis is labeled Z. The Z values on theright
side of the mean are positive while those on its left side are negative. The Z for a point on the
horizontal axis gives the distance between the mean and that point in terms of the standard
deviation. For example, a specific value of Z gives the distance between the mean and the
point represented by Z in terms of 1 standard deviation to the right of the mean. Likewise, a
point with a value of Z = -1 is one standard deviation to the left of the mean. It can be seen
that the mean is at the centre and its value has been shown as zero. The area on either side of
the mean is 05. Thus, the total area under the curve is 1.

10.4 THE STANDARD NORMAL CURVE

When the area of the standard normal curve is divided into sections by standard
deviations above and below the mean, the area in each section is a known quantity (see Figure
2). As explained earlier, the area in each section is the same as the probability of randomly
drawing a value in that range.

Figure 2.The normal curve and the area under the curve between ó units.

For example, 0.3413 of the curve falls between the mean and one standard deviation
above the mean, which means that about 34 percent of all the values of a normally distributed
variable are between the mean and one standard deviation above it. It also means that there is
a 0.3413 chance that a value drawn at random from the distribution will lie between these
two points.
Sections of the curve above and below the mean may be added together to find the
probability of obtaining a value within (plus or minus) a given number of standard deviations
of the mean (see Figure 3). For example, the amount of curve area between one standard
deviation above the mean and one standard deviation below is 0.3413 + 0.3413 = 0.6826,
which means that approximately 68.26 percent of the values lie in that range. Similarly,
about 95 percent of the values lie within two standard deviations of the mean, and 99.7
percent of the values lie within three standard deviations.
191
Figure3.The normal curve and the area under the curve between ó units.

In order to use the area of the normal curve to determine the probability of occurrence
of a given value, the value must first be standardized, or converted to a z-score . To convert
a value to a z-score is to express it in terms of how many standard deviations it is above or
below the mean. After the z-score is obtained, you can look up its corresponding probability
in a table. The formula to compute az-score is

where x is the value to be converted, μ is the population mean, and σ is the population
standard deviation.

Example 1
A normal distribution of retail-store purchases has a mean of $14.31 and a standard
deviation of 6.40. What percentage of purchases were under $10? First, compute the z-
score:

The next step is to look up the z-score in the table of standard normal probabilities.
The standard normal table lists the probabilities (curve areas) associated with given z-scores.
“Statistics Tables” gives the area of the curve below z—in other words, the probabil-
ity of obtaining a value of z or lower. Not all standard normal tables use the same format,
however. Some list only positive z-scores and give the area of the curve between the mean
and z. Such a table is slightly more difficult to use, but the fact that the normal curve is
symmetric makes it possible to use it to determine the probability associated with any
z-score, and vice versa.

192
To use Table,” first look up the z-score in the left column, which lists z to the first
decimal place. Then look along the top row for the second decimal place. The intersection
of the row and column is the probability. In the example, you first find –0.6 in the left
column and then 0.07 in the top row. Their intersection is 0.2514. The answer, then, is that
about 25 percent of the purchases were under $10.

What if you had wanted to know the percentage of purchases above a certain amount?
Because Table gives the area of the curve below a given z, to obtain the area of the curve
above z, simply subtract the tabled probability from 1. The area of the curve above a z of –
0.67 is 1 – 0.2514 = 0.7486. Approximately 75 percent of the purchases were above $10.

Figure 3.Finding a probability using a z-score on the normal curve.

Example 2

Using the previous example, what purchase amount marks the lower 10 percent of the
distribution?
Locate in Table
the probability of 0.1000, or as close as you can find, and read off the corresponding z-
score. The figure that you seek lies between the tabled probabilities of 0.0985 and 0.1003,
but closer to 0.1003, which corresponds to a z-score of –1.28. Now, use the z formula, this
time solving for x:

Approximately 10 percent of the purchases were below $6.12.

193
10.5 EQUATION OF THE STANDARD NORMAL DISTRIBUTION
The standard normal distribution is a special case of the normal distribution. It
is the distribution that occurs when a normal random variable has a mean of zero and a
standard deviation of one.andard Score (aka, z Score)
The normal random variable of a standard normal distribution is called a standard score or
a z-score. Every normal random variable X can be transformed into a z score via the following
equation:
z = (X - μ) / σ
where X is a normal random variable, μ is the mean mean of X, and σ is the standard deviation
of X.

Standard Normal Distribution Table


A standard normal distribution table shows a cumulative probability associated
with a particular z-score. Table rows show the whole number and tenths place of the z-score.
Table columns show the hundredths place. The cumulative probability (often from minus
infinity to the z-score) appears in the cell of the table.
For example, a section of the standard normal table is reproduced below. To find the
cumulative probability of a z-score equal to -1.31, cross-reference the row of the table
containing -1.3 with the column containing 0.01. The table shows that the probability that a
standard normal random variable will be less than -1.31 is 0.0951; that is, P(Z < -1.31) =
0.0951.

194
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
... ... ... ... ... ... ... ... ... ... ...
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
... ... ... ... ... ... ... ... ... ... ...
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

Of course, you may not be interested in the probability that a standard normal random
variable falls between minus infinity and a given value. You may want to know the probability
that it lies between a given value and plus infinity. Or you may want to know the probability
that a standard normal random variable lies between two given values. These probabilities
are easy to compute from a normal distribution table. Here’s how.
 Find P(Z > a). The probability that a standard normal random variable (z) is greater
than a given value (a) is easy to find. The table shows the P(Z < a). The P(Z > a) = 1 -
P(Z < a).
Suppose, for example, that we want to know the probability that a z-score will be
greater than 3.00. From the table (see above), we find that P(Z < 3.00) = 0.9987.
Therefore, P(Z > 3.00) = 1 - P(Z < 3.00) = 1 - 0.9987 = 0.0013.

 Find P(a < Z < b). The probability that a standard normal random variables lies between
two values is also easy to find. The P(a < Z < b) = P(Z < b) - P(Z < a).
For example, suppose we want to know the probability that a z-score will be greater
than -1.40 and less than -1.20. From the table (see above), we find that P(Z < -1.20)
= 0.1151; and P(Z < -1.40) = 0.0808. Therefore, P(-1.40 < Z < -1.20) = P(Z < -
1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.

195
The Normal Distribution as a Model for Measurements
Often, phenomena in the real world follow a normal (or near-normal) distribution. This
allows researchers to use the normal distribution as a model for assessing probabilities
associated with real-world phenomena. Typically, the analysis involves two steps.
§ Transform raw data. Usually, the raw data are not in the form of z-scores. They need to
be transformed into z-scores, using the transformation equation presented earlier:
z = (X - ì) / ó.
§ Find probability. Once the data have been transformed into z-scores, you can use standard
normal distribution tables, online calculators (e.g., Stat Trek’s free normal distribution
calculator), or handheld graphing calculators to find probabilities associated with the
z-scores.
The problem in the next section demonstrates the use of the normal distribution as a
model for measurement.
Problem 1

Molly earned a score of 940 on a national achievement test. The mean test score was
850 with a standard deviation of 100. What proportion of students had a higher score than
Molly? (Assume that test scores are normally distributed.)
(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90

Solution
The correct answer is B. As part of the solution to this problem, we assume that test
scores are normally distributed. In this way, we use the normal distribution as a model for
measurement. Given an assumption of normality, the solution involves three steps.
§ First, we transform Molly’s test score into a z-score, using the z-score transformation
equation.

196
z = (X - μ) / σ = (940 - 850) / 100 = 0.90
 Then, using an online calculator (e.g., Stat Trek’s free normal distribution calculator),
a handheldgraphing calculator, or the standard normal distribution table, we find the
cumulative probability associated with the z-score. In this case, we find P(Z < 0.90)
= 0.8159.
 Therefore, the P(Z > 0.90) = 1 - P(Z < 0.90) = 1 - 0.8159 = 0.1841.
Thus, we estimate that 18.41 percent of the students tested had a higher score than
Molly.
10.6 NORMAL DISTRIBUTION PROBLEMS
Problems and applications on normal distributions are presented. The answers to
these problems are at the bottom of the page. X is a normally normally distributed variable
with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
ANS: Note: What is meant here by area is the area under the standard normal curve.
a) For x = 40, the z-value z = (40 - 30) / 4 = 2.5
Hence P(x < 40) = P(z < 2.5) = [area to the left of 2.5] = 0.9938

b) For x = 21, z = (21 - 30) / 4 = -2.25


Hence P(x > 21) = P(z > -2.25) = [total area] - [area to the left of -2.25]
= 1 - 0.0122 = 0.9878
c) For x = 30 , z = (30 - 30) / 4 = 0 and for x = 35, z = (35 - 30) / 4 = 1.25
Hence P(30 < x < 35) = P(0 < z < 1.25) = [area to the left of z = 1.25] -
[area to the left of 0]
= 0.8944 - 0.5 = 0.3944
1. A radar unit is used to measure speed of car on a motorway. The speed is normally
distributed with a mean of 90 km/hr and a standard deviation of 10 km/hr. What is the
probability that a car picked at random is travelling at more than 100 km/hr?
Ans: Let x be the random variable that represents the speed of cars. x has ì = 90 and ó =
10. We have to find the probability that x is higher than 100 or P(x > 100)

197
For x = 100 , z = (100 - 90) / 10 = 1
P(x > 90) = P(z >, 1) = [total area] - [area to the left of z = 1]
= 1 - 0.8413 = 0.1587
The probability that a car selected at a random has a speed greater than 100 km/hr is
equal to 0.1587
2. For a certain type of computers, the length of time bewteen charges of the battery is
normally distributed with a mean of 50 hours and a standard deviation of 15 hours. John
owns one of these computers and wants to know the probability that the length of time
will be between 50 and 70 hours.
ANS: Let x be the random variable that represents the length of time. It has a mean
of 50 and a standard deviation of 15. We have to find the probability that x is between 50
and 70 or P( 50< x < 70)
For x = 50 , z = (50 - 50) / 15 = 0
For x = 70 , z = (70 - 50) / 15 = 1.33 (rounded to 2 decimal places)
P( 50< x < 70) = P( 0< z < 1.33) = [area to the left of z = 1.33] - [area to the left
of z = 0]
= 0.9082 - 0.5 = 0.4082
The probability that John’s computer has a length of time between 50 and 70 hours
is equal to 0.4082.
3. Entry to a certain University is determined by a national test. The scores on this test
are normally distributed with a mean of 500 and a standard deviation of 100. Tom
wants to be admitted to this university and he knows that he must score better than at
least 70% of the students who took the test. Tom takes the test and scores 585. Will
he be admitted to this university?
Ans: Let x be the random variable that represents the scores. x is normally ditsributed with
a mean of 500 and a standard deviation of 100. The total area under the normal curve
represents the total number of students who took the test. If we multiply the values
of the areas under the curve by 100, we obtain percentages.
For x = 585 , z = (585 - 500) / 100 = 0.85

The proportion P of students who scored below 585 is given by

P = [area to the left of z = 0.85] = 0.8023 = 80.23%

198
Tom scored better than 80.23% of the students who took the test and he will be admitted
to this University.
1. The annual salaries of employees in a large company are approximateley normally
distributed with a mean of $50,000 and a standard deviation of $20,000.
a) What percent of people earn less than $40,000?
b) What percent of people earn between $45,000 and $65,000?
c) What percent of people earn more than $70,000?
ans) For x = 40000, z = -0.5
Area to the left (less than) of z = -0.5 is equal to 0.3085 = 30.85% earn less than
$40,000.
b) For x = 45000 , z = -0.25 and for x = 65000, z = 0.75
Area between z = -0.25 and z = 0.75 is equal to 0.3720 = 37.20 earn
between $45,000 and $65,000.
c)For x = 70000, z = 1
Area to the right (higher) of z = 1 is equal to 0.1586 = 15.86% earn more than $70,000.

10.7 BINOMIAL DISTRIBUTION


A random variable is one that can take a number of different values, but it is not
possible to know which value it does take until some experiment is performed. Usually, the
experiment takes the form of drawing a sample from a population. Flipping a coin, rolling a
die, and takeing a sample of 600 households are some of the examples. In other words, a
random variable is a description of the numeric values that the outcomes from an experimen
can take. Like the ‘ordinary’ variables, random can be callsified into two categories discrete
and continuous.
A random variable is one that it said to be a discrete random variable if its possibles
values proceed in steps with either a finite of infinite number of steps. It can take a countable
or finite number of possible values. In contrast, if a random variable represents a measurement
on a continuous scale so that all values in an interval are possible, it is called a continuous
random variable. Examples of a continuous random variable are price of a car and daily
consumption of milk.

199
10.8 TYPES OF PROBABILITY DISTRIBUTIONS
There are two types of probability distributions:
* Discrete probability distributions
The probability distribution of a discrete random variable is a list of probabilities
associated with each of its possible values. It is also sometimes called the probability function
or the probability mass function.
More formally, the probability distribution of a discrete random variable X is a function
which gives the probability f(x) that the random variable equals x, for each vale x:
f(x) = P(X=x)

It satisfies the following conditins:

a. 0 ? f(x) ? 1

b. f(x) = 1
Continuous probability distributions
Describe an “unbroken” continuum of possible occurrences. A random variable is
continuous if it can take any value in an interval. The number of possible values in a range is
infinite, so the Probability(of a single value) = 0
Example 1: Discrete probability distribution
The number of successful treatments out of 2 patients is discrete, because the random
variable represent the number of success can be only 0, 1, or 2. The probability of all possible
occurrences—P(0 successes), P(1 success), P(2 successes)—constitutes the probability
distribution for this discrete random variable.
Example 2: Continuous probability distribution
The probability of a given birth weight can be anything from 3 lbs to more than 10 lbs.
Thus, the random variable of birth weight is continuous, with an infinite number of possible
points between any two values.
Probability Distribution of a Discrete Random Variable:
The probability distribution of a discrete random variable, say x, is a list of the distinct
numerical values of x along with t heir associated probabilities. Let us take an example.
Example : Let us take x as the number of heads obtained in three tosses of a fair coin. We
are required to list the numerical values of x along with the corresponding outcomes.

200
Solution : These values along with the corresponding outcomes are shown in Table 1
T a b le 1 : L is t o f O u t c o m e s
O u tc o m e V a lu e o f x
TTT 0
TTH 1
THT 1
THH 2
HTT 1
HTH 2
HHT 2
HHH 3
x is a variable since in three tosses of the coin, it can take any value 0, 1, 2, or 3.
Further, x is the random variable in the sense that we could not have predicted that value of
the out-come before tossing the coin. It may be noted that for each elementary outcome,
there is only one value of x. however, as we can see, two or more elementary outcomes may
give the same value.

The Expected Value of a Random Variable:


The mean of a discrete variable x is, in fact, the mean of its probability distribution.
The mean of a discrete random variable is also called its expected value. It is denoted by
E(x). When we perform an experiment a number of times, then what is our expectation from
that experiment? The mean is the value that we expect to observe per repetition.
Example : Suppose we are given the following data relating to breakdown of a machine in a
certain company during a given week, wherein x represents the number of breakdowns of a
machine and P(x) represents the probability value of x.
Find out the mean number of breakdowns per week for this machine.
T ab le 2 : N u m b er of P robability of B reak dow n s :
x 0 1 2 3 4
P(x) 0.12 0.20 0.25 0.30 0.13
Solution : In order to attempt this problem, we have to multiply each value of x by its
probability and then add all these products. This has been shown in table given below.
Table 3 : Calculation of Mean for the Probability Distribution of Breakdowns
x P (x ) x .P (x )
0 0 .1 2 0
1 0 .2 0 0 .2 0
2 0 .2 5 0 .5 0
3 0 .3 0 0 .9 0
4 0 .1 3 0 .5 2
? x P (x ) 2 .1 2

201
Concept of Expected Value : Thus, we find that the sum of these products gives
? [x.P(x)]=2.12, which is the mean. This can be written as µ = ? [x.P(x)] = 2.12.
On the basis of this calculation, we can say that, on an average, this particular
machine is expected to breakdown 2.12 times per week over a period of time. In
other words, if this machine is used for several weeks, then there may not be any
breakdown, for some other week there may be only one breakdown per week and
so on. The mean number of breakdowns is expected to be 2.12 per week for the
entire period. This is the concept of expected value.
Symbolically, E(x) = ? (x.Prob.(x)], where E(x) = Expected value of a discrete
variable x and x. Prob .(x) = Product of value of variable x with its probability.

It may be noted that expected value can be derived subjectively as well. On the basis
of the experimenter’s own experience and judgment, one may assign probability that the
random variable will take on certain values.
Let us take another example.
Example : An account of a company is hoping to receive payment from two outstanding
accounts during the current month. He estimates that there is 0.6 probability of receiving
Rs. 15,000 due from A and 0.75 probability of receiving Rs. 40,000 due from B. What is the
expected cash flow from these two accounts?
Solution
Table 4 : Calculation of Expected Cash Flow
Account Amount (Rs) Probability (p i) Amount (x i) (Rs)
A 15,000 0.60 9,000
B 40,000 0.75 30,000
Total expected value 39,000

202
Importance of Expected Value: The concept of expected value is of considerable importance
to management in decision-making. This is because the criteria in decision problems involving
uncertainties are usually the maximization of expected profits, or utility, and the minimization
of expected costs. In Chapter 22 on Decision Theory, we shall discuss these criteria in
detail giving suitable examples.
With this introduction we now turn to the binomial distribution.

10.9 BINOMIAL DISTRIBUTION


The binomial distribution is also known as the Bernoulli distribution in honour of the
Swiss mathematician Jacob Bernoulli (1654-1705) who derived it.
To begin with, we go back to our frequently used examples of a fair coin in the last
chapter. Assuming that the coin is tossed once, there can be two possibilities – either head
(or success) or tail (or failure). The sum of the probabilities is p + q, where p is the probability
of success and q of failure. Instead of success and failure we may also say 1 and 0.
Now, assume two coins are tossed together. Then, we can have four possibilities.
1. Both coins falling heads
2. The first coin falling head and the second falling tail
3. The first coin falling tail and the second falling head
4. Both coins falling tails.

Thus, the probabilities of 2 heads (or 2 success) = p x p = p2.


Probabilities of one head and one tail = (p x q) = pq
Probabilities of one tail and one head = (q x p) = qp
Probabilities of 2 tails (or 2 failures) = q x q = q2
Thus, the probabilities of 0, 1 and 2 success are given by q2, 2qp, p2,
respectively, that is, by the successive terms of the expansion of the binomial (q +
p)2. In the same manner, if three coins are tossed simultaneously, probabilities of
0, 1, 2, and 3 success will respectively be given by the terms q3, 3q2p, 3qp2, p3,
being the successive terms of binomial (q + p)3.

203
Let us put these results in the following form ;
For one coin or event (q + p)1 that is, q + p
For two coin or events (q + p)2 that is, q2 + 2qp + p2
For three coins or events (q + p)3 that is, q3 + 3q 2p + 3qp2 + p3
Hence, for n coins or events (q + p)n
n ( n  1) n-2 2
(q + p)n = qn + nqn-1 p + q P + …. P n
2!
This is known as the binomial distribution
To analyze a problem using the binomial distribution you have to know the probability
of each outcome and it must be the same for every trial. In other words, the results of the
trials must be independent of each other.
Words like ‘experiment’ and ‘trial’ are used to describe binomial situations
because of the origins and widespread use of the binomial distribution in science.
Although the distribution has become widely used in many other fields, these scientific
terms have stuck.
It is a discrete probability distribution. Its probability mass function is given by
P(X) = nCxqnx px, x

= 0 to n. The Binominal Distribution


is given by
(q+p)n = qn + nC1 qn1 p1 + nC2qn2 p2
+…………………..+pn
The successive terms of the expansion gives the probability of 0, 1, 2 ……..n success.
The mean and variance of the distribution are np and npq. “n” and “p” are its parameters. It
is a uni-modal distribution. For fixed n or p as p or n increases the distribution shifts from
left to right.
Assumption under which Binomial Distribution canbe applied.
i. The experiment should be of dichotomous nature.
ii. The probability of success should remain the same from experiment to experiment.
iii. Experiments should be conducted under identical conditions.

204
10.10 CONDITIONS NECESSARY FOR BINOMIAL DISTRIBUTION
At this stage, we should know that there are certain conditions that must be fulfilled
by a distribution if it is to be a binomial distribution. Then the conditions are ;
1. It is necessary that each observation is classified in two categories such as success
and failure. For example, if raw material is obtained by a firm from its suppliers, it
may be classified as defective or non-defective on the basis of its normal quality.
Similarly, if a die is thrown, we may call 4, 5 or 6 success and getting 1, 2 or 3 a
failure.
2. It is necessary that the probability of success (or failure) remains the same for each
observation in each trial. Thus the probability of getting head (or tail) must remain
the same in each toss of the experiment. In other words, if the probability of success
(or failure) changes from trial or trial or if the results of each trial are classified in
more than two categories, then it is not possible to use the binomial distribution.
3. The trials or individual observations must be independent of each other. In other words,
no trial should influence the outcome of another trial.

10.11 PROBLEMS IN BINOMIAL DISTRIBUTION

Let us take an example. The binomial distribution (q +p)n in general terms


n
C r q nr P r , where n C r n! /{r! (n  r )!} , where r is the number of ways in which we
can get r success and n-r failures out of n trials.
Example : Find the chance of getting 3 success in 5 trials when the chance of
getting a success in one trial is 2/3.
Solution : Here, n = 5, p = 2/3, q = 1 – p = 1- 2/3 = 1/3 and r = 3
Substituting these values in general terms, the required chance is
n
C r q nr P r
= 5
C3 (1 / 3) 53 ( 2 / 3) 3
5! 1 1 2 2 2
=     
3!(5  3)! 3 3 3 3 3
5  4  3  2 1 1 1 2 2 2
=     
3  2 1 2  1 3 3 3 3 3

205
Fitting a Binomial Distribution : On the basis of some given information, if a binomial
distribution is to be fitted, then the following procedure needs to be adopted.
1. Find the values of p and q. When one value is given to us, the other value can be easily
obtained by subtracting the first value from 1.
2. Expand the binomial (p + q)n. It may be noted that the power of n will be one less than
the number of terms in the expanded binomial. For example, when n=5, there will be
6 terms.
3. Multiply each of the expanded binomial terms by the total frequency (N) so that the
expected frequency in each category can be obtained. Let us take an example.
Example : Fit a binomial distribution to following data :

x 0 1 2 3 4
f 28 62 46 10 4
Solution :
x f fx
0 28 0
1 62 62
2 46 92
3 10 30
4 4 16
150 200

fx 200
Mean =   np
f 150

200 200 200 1


 p    ( n  4)
150  n 150  4 600 3

The expected binomial frequencies can be obtained

f ( r )  N . p (r )  N  n C r P r  q n r
r 4 r

= 150  C r    
4 1 2
 3  3

206
Now, to get the binomial frequencies, we have to put r = 0, 1, 2, 3 and 4 in the above equa-
tion. These calculations are shown in the following table.
Table.5 : Calculation of Binomial Frequencies
r 1 2
r 4 r

f ( r )  150  C r    
4

3 3
0 0
1 2
40
16
f ( 0 )  150  4 C 0      150   30
3 3 81
1 1
1  2
4 1
32
f (1)  150  4 C 1      150   59
3  3 81
2 1 2
2 4 2
24
f ( 2 )  150  C 2    
4
 150   44
3  3 81
3 3
1 2
43
150  8
f ( 3)  150  4 C 3       15
3 3 81
4 4
1 2
4 4
150  1
f ( 4 )  150  4 C 4      2
3  3 81

The frequencies of the binomial distribution are shown in the extreme right of the
above table.
Mean and Standard Deviation of Binomial Distribution:
The mean and standard deviation of such theoretical frequency distributions where
we know the number of independent events and the probability of the happening of the event
in question, can be very easily calculated. If M stands for the mean of such distribution, n for
the number of independent events and p for the probability of the happening of the event in a
si ngl e tri al , then M = np. The value of the standard deviation of the expected frequencies in
such cases is
  npq
Meeting the Conditions for using the Bernoulli Process
Before closing our discussion on the binomial distribution, it must be emphasized
that one should be careful in using the binomial probability. It is necessary to ensure that
conditions specified earlier for binomial distribution are satisfied, particularly conditions 2
and 3. Condition 2 requires that the probability of the outcome of any trial should remain
unchanged for each trial. While this condition is fully met in experiments involving tossing
a coin or rolling a die, in real life it may be difficult to ensure the compliance of this
condition.

207
Condition 3 requires that the trials of a Bernoulli process must be independent of
each other. This means that the outcome of one trial must not influence in any way the
outcome of any other trial. This condition, too, may not be satisfied in real-life situation for
example, take the case of interviewing candidates for a certain post in a company. The expert,
who is interviewing the candidates, may find that t he first three candidates are far below the
standard expected. In view of this, he may not remain impartial (as he was earlier) while
interviewing the fourth candidate. This means violation of condition 3. One can find several
situations of this type in everyday life where compliance of condition 3 becomes extremely
difficult.

10.12 POISSON DISTRIBUTION


Having discussed the binomial distribution in the proceeding section, we now turn to
Poisson distribution, which is also a discrete probability distribution. It was developed by a
French mathematician SD Poisson (1781-1840) and hence named after him.
Along with the normal and binomial distributions, the Poisson distribution is one of
the most widely used distributions. It is used in quality control statistics to count the number
of defective items or in insurance problems to count the number of causalities or in waiting
– time problems to count the number of incoming telephone calls or incoming customers
or the number of patients arriving to consult a doctor in a given time period, and so forth. All
these examples have a common feature; they can be described by a discrete random variable,
which takes on integer values (0, 1, 2, 3 and so on).
The Characteristics of the Poisson distribution are:
1. The events occur independently. This means that the occurrence of a subsequent
event is not at all influenced by the occurrence of an earlier event.
2. Theoretically, there is no upper limit with the number of occurrences of an event
during a specified time period.
3. The probability of a single occurrence of an event within a specified time period is
proportional to the length of the time period of interval.
4. In an extremely small portion of the time period, the probability of two or more
occurrences of an event is negligible.

208
10.13 PROBLEMS IN POISSON DISTRIBUTION
Let us take an example to show how Poisson probabilities can be calculated.
Example : Suppose, we have a production process of some item that is manufactured in
large quantities. We find that, n general, the proportion of defective items is p = 0.01. A
random sample of 100 items is selected. What is the probability that there are 2 defective
items in this sample?
Solution :
The Poisson formula is

  e 
P( x) 
x!

Where
P(x) = Probability of x occurrences
λx = Lambda (i.e the mean number of occurrences per interval of time)
raised to the x power
eλ = 2.71828 (being the base of the natural logarithm system), raised to
the negative lambda power
x! = x factorial
Here λx= np = 100 x 0.01 = 1.0
Applying the above formula to the data given

(1) 2  ( 2.71828) 1
P(2) =
2 1

(1) 2  0.36788
P(2) =  0.18394
2
Suppose, we want to know what is the probability of having upto 2 defective items
in that sample of 100 items. We simply add the3 figures
P(0) 0.368
P(1) 0.368
P(2) 0.184
Total 0.92

209
The answer is 0.92
Again, if we are interested in knowing the probability of having more than
2 defective items, the answer will be
1-092 = 0.08
Example : Suppose the probability of dialing a wrong number is 0.05. Then, what
is the probability of dialing exactly 3 wrong numbers in 100 dials?
Solution
p = 0.05
n = 100
λ = np
= 100 x 0.05 = 5
Applying the Poisson formula,

(5) 3  (2.71828) 5
P ( x) 
3!

125  0.0067 *
=  0.14
6

Example : Fit a Poisson distribution to the following data, which relate to the
number of deaths due to the kick of a horse in 10 corps per army per annum over
20 years.
Deaths 0 1 2 3 4 Total (f)
Frequency 109 65 22 3 1 200

210
S o lu tio n : C a lcu late th e th eo retica l freq uen cie s
T h e th eo retic al e x p ecte d freq u e n cies are g iv e n b y th e fo rm ula

x  e 
N x
x!

W h ere x = 0 , 1, 2, 3 an d 4
N = to ta l freq uen c y
λ = m ean
e = 2 .7 1 8 2 8
In o rd er to fin d th e v alu e o f λ , w e h av e to calcu la te th e arith m e tic m ean .
Table : W orksheet for Data in Example 10.14
Deaths (x) Frequency (f) fx
0 109 0
1 65 65
2 22 44
3 3 9
4 1 4
Total 200 122

M ean = ? fx/n = 122/200 = 0.61

x  e
Nx
x!

200  ( 0 .61) x  ( 2 .71828 )  0 .61


=
x!

e- 0.61 = 0.5435
Now for each value of x from 0 to 4, we have to calculate the frequency. This is
shown below :
r f
0 200  0 .5435  108.7
1 200  0 .61  0 .5435  66 .3
2 200  ( 0 .61) 2  0 .5435
 20 .2
2
3 200  ( 0 .61) 3  0 .5435
 4 .1
3 2
4 200  ( 0 .61) 4  0 .5435
 0 .6
4  3 2

211
Thus, the theoretical frequencies are
Table : Theoretical Frequencies of Data in Example
x Tf f
0 109 109
1 66 65
2 20 22
3 4 3
4 1 1
Total 200 200

10.14 SUMMARY
This Unit deals with definition and explaination about random variable. It also describes
types of probability distribution. It also gives a note on Binomial and Poission distribution.
Problems on Binomial and Poission distribution are solved to give a better understanding of
the concept.
This unit focuses on normal distribution. The details of normal curve are given
here. Problems on normal distribution are also solved in this unit.
10.15 KEY WORDS
Binomial Distribution
Poission Distribution
Randam Variable
Normal Curve
Normal Distribution
Normal Distribution table
10.16 SELF ASSESSMENT QUESTIONS
1. From past experience, a manager of an upscale shoe store knows that 85% of her
customers will use a credit card when making purchases. Suppose three customers
are in line to make a purchase.
a) Does this example satisfy the conditions of a Bernoulli process?
b) Construct a probability tree that delineates all possible values and their associated
probabilities.
c) Using the probability tree, derive the binomial probability distribution.
2. Approximately 20% of U.S. workers are afraid that they will never be able to retire
(bankrate.com, June 23, 2008). Suppose 10 workers are randomly selected.

212
a) What is the probability that none of the workers is afraid that they will never be
able to retire?
b) What is the probability that at least two of the workers are afraid that they will
never be able to retire?
c) What is the probability that no more than two of the workers are afraid that they
will never be able to retire?
d) Calculate the expected value, the variance, and the standard deviation of this binomial
probability distribution.
3. A small life insurance company has determined that on the average it receives 6 death
claims per day. Find the probability that the company receives at least seven death
claims on a randomly selected day.
4. The number of traffic accidents that occurs on a particular stretch of road during a
month follows a Poisson distribution with a mean of 9.4. Find the probability that
less than two accidents will occur on this stretch of road during a randomly selected
month.
5. An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what
is the probability that an Acme light bulb will last at most 365 days?
6. Suppose scores on an IQ test are normally distributed. If the test has a mean of 100
and a standard deviation of 10, what is the probability that a person who takes the test
will score between 90 and 110?

10.17 REFERENCES
1. Statistics for Management by Richard Leven, David S. Rubin
2. Statistical methods by S.P. Gupta
3. Fundamentals of Statistics by S.C. Gupta
4. Advanced Practical Statistics by S.P. Gupta
5. Statistics Theory and Practice by M.S. Shukla and S.S. Gulshan
6. Statistical Methods by J. Medhi

213
UNIT 11 : INTRODUCTION TO OPERATIONS RESEARCH

STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Concept of Operation Research (OR)
11.3 Scope of OR
11.4 Phases of OR
11.5 Application of OR
11.6 Limitations of OR
11.7 Summary
11.8 Key words
11.9 Self Assessment Questions
11.10 Reference

214
11.0 OBJECTIVES
After studying this unit you should be able to:
* Explain the concept of OR;
* Asses the scope and phases of OR;
* Identify the applications of OR and
* Distinguish between models of OR.

11.1 INTRODUCTION
Operation research is the research done on operations. It can be visualized as a method,
tool, set of technique, an activity that aids the manager in making a decision. It makes use of
quantitative techniques to provide better solution to the problem.
The present day business scenario has vastly changed owing to the competition and
increased complexity from all fronts. Globalization has made the domestic businesses also
to move towards globally set benchmarks. Hence decision making has become a very complex
and dynamic activity. It has to consider the various alternatives both qualitatively and
quantitatively. The various techniques that are used to facilitate decision making are called
quantitative methods or optimization techniques or decision science or operation analysis
or operation research.
The concept of operation research was basically evolved during second world war in
England. The major concern at that time was the efficient or the optimum use of scarce
resources of war material including human resource. As this technique was mainly evolved
for the military operations, it is known as operations research. With the conclusion of World
war the application was OR was later spread to business to facilitate optimum use of resources
such men, machine, money and material.

Meaning and Definitions : -


Operations research makes an attempt to solve many of business problems in quantitative
way. Many authors have defined operations research in their own way.
You can go through the following definitions of operations research.

Operation Research – Meaning and Definitions


(a) Pocock stresses that OR is an applied science; he states “OR is scientific
methodology-analytic-cal, experimental, quantitative which by assessing the overall
implication of various alternative courses of action in a management system, provides

215
(b) Morse and Kimball have stressed the quantitative approach of OR and have described
it as “a scientific method of providing executive departments with a quantitative basis
for decisions regarding the operations under their control”.
(c) Miller and Starr see OR as applied decision theory. They state “OR is applied decision
theory. It uses any scientific, mathematical or logical means to attempt to cope with
the problems that confront the executive, when he tries to achieve a thorough—going
rationality in dealing with his decision problem”.
(d) Saaty considers OR as tool of improving the quality of answers to problems. He say,
“OR is the art of giving bad answers to problems which otherwise have worse answers”.

Few other definitions of OR are as follows:


• “OR is concerned with scientifically deciding how to best design and operate man-
machine system usually requiring the allocation of scare resources.”
– Operations Research Society, America
• “OR is essentially a collection of mathematical techniques and tools which in
conjunction with system approach, are applied to solve practical decision problems
of an economic or engineering nature’’.
– Daellenbach and George
• “OR utilizes the planned approach (updated scientific method) and an interdisciplinary
team in order to represent complex functional relationships as mathematical models
for the purpose of providing a quantitative analysis’’.
– Thieraub and Klekamp
• “OR is a scientific knowledge through interdisciplinary team effort for the purpose of
deter-mining the best utilization of limited resources.”
– H.A. Taha
• “OR is a scientific approach to problem solving for executive management”.
– H.M. Wagner

By going through the above definitions, we can interpret that Operation Research is
scientific decision making tool leading to optimal use of resources. Operations research
helps in taking business decisions more effectively and objectively. The game theory for
example considers competition. The transportation problem tries identify the least cost
route for transporting items from different sources to different destinations. The assignment

216
complete the task would be minimized. The operation also helps in taking an objective
decision which would either maximize the profit or minimize the cost by choosing a good
alternative among the various alternative available.

11.2 CONCEPT OF OPERATIONS RESEARCH


The operations research attempts to give mathematics based solutions to the various
problems that are encountered in the day to day business. The concept of operations research
dates back to 19th century. In fact many of the operations research topics being discussed
today were evolved much before the concept of operations research was evolved. The term
Operations Research was coined during the Second World War. The Second World War
started in 1939 soon after the First World War (1914-1918) faced a severe crunch of
resources such as men, war materials, money. Hence they were supposed to have optimum
utilization of the existing resources. Many researches were taken place in this area to find
ways to gain most output with least input and hence obtain optimal use of resources. With
the conclusion of Second World War, the operations research techniques found its own
place in the business field where managers are interested to increase their productivity with
minimum available resources.

11.3 SCOPE OF OPERATIONS RESEARCH


Operations research has applications in various fields. The scope of operations research
is not only confirmed to business but also to medical, research, military, computers and so
on.
i. In Defence Operations :
In modern warfare, the defense operations are carried out by three major independent
components namely Air Force, Army and Navy. The activities in each of these
components can be further divided in four sub-components namely administration,
intelligence, operations and training and supply. The applications of modern warfare
techniques in each of the components of military organisations require expertise
knowledge in respective fields.
ii. Business Functions :
Each component in a business unit is an independent unit having their own goals. For
example: production department minimises the cost of production but maximises
output. Marketing department maximises the output, but minimises cost of unit sales.
Finance department tries to optimise the capital investment and personnel department
appoints good people at minimum cost.

217
iii. Planning:
In modern times, it has become necessary for every government to have careful
planning, for economic development of the country. OR techniques can be fruitfully
applied to maximise the per capita income, with minimum sacrifice and time. A
government can thus use OR for framing future economic and social policies.

iv. Agriculture:
With increase in population, there is a need to increase agriculture output. But this
cannot be done arbitrarily. There are several restrictions.

v. In Industry:
The system of modern industries is so complex that the optimum point of operation
in its various components cannot be intuitively judged by an individual. The business
environment is always changing and any decision useful at one time may not be so
good some time later. There is always a need to check the validity of decisions
continuously against the situations.

vi. In Hospitals:
OR methods can solve waiting problems in out-patient department of big hospitals
and administrative problems of the hospital organisations. Techniques such as queuing
theory helps in determining the number work stations required, for example, number
of petrol pumps and attendants required in a petrol bunk. Numbers of out patient
counters required in a hospital and so on based on estimated arrival of service seekers
and time required to provide the service

vii. In Transport:
You can apply different OR methods to regulate the arrival of trains and processing
times minimise the passengers waiting time and reduce congestion, formulate suitable
transportation policy, thereby reducing the costs and time of trans-shipment.

viii. Research and Development:


You can apply OR methodologies in the field of R&D for several purposes, that as
control and plan product introductions.

218
11.4 PHASES OF OR
Any business problem that needs the intervention of operations research can be solved
using various steps. These steps rather phase help in identifying the nature of business
problems and the appropriate tools or techniques required to solve the given problem. In
many cases the data provided may not be sufficient as such more data can be sought on a
given issue. The collected is used to formulate the objective function which shows clearly
the objective of solving the problem Operation research also deals with dynamic programming
where decisions have to be taken in a dynamic environment. In such cases data collection,
analysis may become redundant. The various phases in solving the problems using operations
research are
1. Judgment Phase: In this first phase the problem is identified that are encountered in
the real life situations. The problem is structured in such way that the solution should
support the organization objective by using proper judgment. The problem would be
structured to facilitate the decision maker with all the possible information.
2. Research phases: On this second phase, further data is collected on the basis of the
objective and given data. Hypothesis is framed and tested. The data collected is analyzed
and verified. Suitable assumptions may be made wherever required.
3. Action Phase: In this last phase suitable recommendation are made for possible
solution that has been arrived by solving the problem. Before implementing solution
as suggested by the relevant OR model one may have to check the compatibility of the
solution with the environment constraints and such other qualitative issues.

11.5 APPLICATIONS OF OR
Operations research has many applications on various fields in business management.
1. It is used while making decision about product to be produced and competitive
strategies.
2. In scheduling salesman activities such time, territory, frequency of visit
3. In deciding type of promotion strategies to be employed
4. In forecasting the business direction
5. In deciding the price of the product with reference to competitor
6. In developing the framework for market research
7. In deciding the product mix and product proportioning
8. In production planning and sequencing and scheduling

219
9. In transportation ware housing and physical distribution
10. In material handling facility planning
11. In assemble line balancing
12. In maintaining of machines and replacing
13. In project planning and scheduling
14. In designing queuing system for serving customers
15. In maintain right size of inventory
16. In planning profit and selecting optimum dividend policy
17. In portfolio analysis i.e. product portfolio and investment portfolio
18. In determining optimum organizing structure
Apart from these OR also has applications in the field of defence, public system such
as government and industrial applications.

11.6 LIMITATIONS OF OPERATION RESEARCH


Operations Research has number of applications; similarly it also has certain
limitations. These limitations are mostly related to the model building and money and time
factors problems involved in its application. Some of them are as given below:
i) Distance between O.R. specialist and Manager
Operations Researchers job needs a mathematician or statistician, who might not be
aware of the business problems. Similarly, a manager is unable to understand the complex
nature of Operations Research. Thus there is a big gap between the two personnel.
ii) Magnitude of Calculations
The aim of the O.R. is to find out optimal solution taking into consideration all the
factors. In this modern world these factors are enormous and expressing them in
quantitative model and establishing relationships among these require voluminous
calculations, which can be handled only by machines.
iii) Money and Time Costs
The basic data are subjected to frequent changes, incorporating these changes into the
operations research models is very expensive. However, a fairly good solution at present
may be more desirable than a perfect operations research solution available in future
or after some time.

220
iv) Non-quantifiable Factors
When all the factors related to a problem can be quantifiable only then operations
research provides solution otherwise not. The non-quantifiable factors are not
incorporated in O.R. models. Importantly O.R. models do not take into account
emotional factors or qualitative factors.
v) Implementation
Once the decision has been taken it should be implemented. The implementation of
decisions is a delicate task. This task must take into account the complexities of human
relations and behavior and in some times only the psychological factors.

11.7 SUMMARY
Operations Research is relatively a new discipline, which originated in World War II,
and became very popular throughout the world. India is one of the few first countries in the
world who started using operations research. Operations Research is used successfully not
only in military/army operations but also in business, government and industry. Now a day’s
operations research is almost used in all the fields. Proposing a definition to the operations
research is a difficult one, because its boundary and content are not fixed. The tools for
operations search is provided from the subject’s viz. economics, engineering, mathematics,
statistics, psychology, etc., which helps to choose possible alternative courses of action.
The operations research tool/techniques include linear programming, non-linear
programming, dynamic programming, integer programming, Markov process, queuing theory,
etc. Operations Research has a number of applications. Similarly it has a number of limitations,
which is basically related to the time, money, and the problem involves in the model building.
Day-by day operations research is gaining more and more acceptance because it improves
decision making effectiveness of the managers. Almost all the areas of business use the
operations research for decision making.

11.8 KEY WORDS


Operations Research
Dynamic programming
Optimization
Objective functions
Constraints

221
11.9 SELF ASSESSMENT QUESTIONS
1. Define operations research and explain the concept of operations research
2. Discuss the limitations of OR
3. Describe the scope and applications of OR in the present context
4. Outline the phases of OR
5. Give an account on various models of OR

11.10 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

222
UNIT 12 : GAME THEORY

STRUCTURE

12.0 Objectives

12.1 Introduction

12.2 Game theory

12.3 The Maximin or Minimax Principle

12.4 Game without saddle Point

12.5 Minimax or Maximin Principle for mixed strategy Games

12.6 Principle of Dominance

12.7 2X2 mixed strategy using arithmetical method

12.8 Summary

12.9 Key Words

12.10 Self Assessment Questions

12.11 References

223
12.0 OBJECTIVES

After reading this unit you should be able to:


* Interpret the meaning of Game theory;
* Able to solve 2X2 matrix problem;
* Identify Dominance property and
* Able to solve 2 X m, n X 2 Problems graphically

12.1 INTRODUCTION

Game theory is a branch of applied mathematics and economics that studies situations
where players choose different actions in an attempt to maximize their returns. First developed
as a tool for understanding economic behavior and then by the RAND Corporation to define
nuclear strategies, game theory is now used in many diverse academic fields, ranging from
biology and psychology to sociology and philosophy. Beginning in the 1970s, game theory
has been applied to animal behavior, including species’ development by natural selection.
Because of games like the prisoner’s dilemma, in which rational self-interest hurts everyone,
game theory has been used in political science, ethics and philosophy. Finally, game theory
has recently drawn attention from computer scientists because of its use in artificial
intelligence and cybernetics.

Although similar to decision theory, game theory studies decisions that are made in
an environment where various players interact. In other words, game theory studies choice
of optimal behaviour when costs and benefits of each option are not fixed, but depend upon
the choices of other individuals.

12.2 GAME THEORY

Game theory is a branch of mathematical analysis developed to study decision making


in conflict situations. Such a situation exists when two or more decision makers who have
different objectives act on the same system or share the same resources. For example,
companies selling same product. There are two person and multiperson games. Game theory
provides a mathematical process for selecting an optimum strategy (that is. an optimum &
decision or a sequence of decisions) to of an opponent who has a strategy of his own. The
games studied by game theory are well-defined mathematical objects. A game consists of a

224
set of players, a set of moves (or strategies) available to those players, and a specification of
payoffs (result) for each combination of strategies.
A general theory of rational behaviour for situations in which (1) two (two- o9 person
games) or more (multi-person games) decision makers (players) have-available to them (2)
a finite number of courses of action (plays) each leading to (3) a well defined outcome or
end with gains and losses expressed in terms of numerical payoffs associated with each
combination of courses of action and for each decision maker. The decision makers have
perfect knowledge of the rules of the game, i.e., (1), (2) and (3) but no knowledge about the
opponents’ moves and are rational in the sense of making decisions that optimize their
individual gains. The matrix of payoffs can represent various conflicts. In a zero-sum game
one person wins what the other loses. In other situations. gains and losses may be unequally
distributed which allows the representation of numerous competitive and conflict situations.
The theory proposes several solutions, e.g., in a minimax strategy each participants minimizes
the maximum loss the other can impose on him, a mixed strategy involves probabilistic
choices. Experiments with such games revealed conditions for cooperation, defection and
the persistence of conflict. The theory and some of the results have found applications in
economics, management science, bargaining and conflict resolution among many areas of
interest.

12.2.1 History of game theory


The first known discussion of game theory occurred in a letter written by James
Waldegrave in 1713. In this letter, Waldegrave provides a minimax mixed strategy solution
to a two-person version of the card game. It was not until the publication of Antoine Augustia
Cournot’s Researches into the Mathematical Principles of the Theory of Wealth in 1838
that a general game theoretic analysis was pursued. In this work Cournot considers a duopoly
and presents a solution that is a restricted version of the Nash equilibrium.
Although Cournot’s analysis is more general than Waldegrave’s, game theory did not
really exist as a unique field until John Von Neumann published a series of papers in 1928.
While the French mathematician Borel did some earlier work on games, Von Neumann can
rightfully be credited as the inventor of game theory. Von Neumann was a brilliant
mathematician whose work was far-reaching from set theory to his calculations that were
key to development of both the Atom and Hydrogen bombs and finally to his work developing
computers. Von Neumann’s work culminated in the 1944 book. The Theory of Games and
Economic Behavior by Von.

225
Neumann and Oskar Morgenstern. This profound work contains the method for finding
optimal solutions for two-person zero-sum games. During this time period, work on game
theory was primarily focused on cooperative game theory, which analyzes optimal strategies
for groups of individuals, presuming that they can enforce agreements between them about
proper strategies.

In 2005, game theorists Thomas Schelling and Robert Aumann won the Bank of Sweden
Prize in Economic Sciences. Schelling worked on dynamic models, early examples of
evolutionary game theory. Aumann contributed more to the equilibrium school, developing
an equilibrium coarsening correlated equilibrium and developing extensive analysis of the
assumption of common knowledge.

12.2.2 Assumptions of Game theory

In game theory one usually makes the following assumptions:


(1) Each decision maker [“PLAYER”] has available to him two or more well-specified
choices or sequences of choices (called “PLAYS”).
(2) Every possible combination of plays available to the players leads to a well-defined
end-state (win, loss, or draw) that terminates the game.
(3) A specified payoff for each player is associated with each end-state [ZERO-SUM
game] means that the sum of payoffs to all players is zero in each end-statej.
(4) Each decision maker has perfect knowledge of the game and of his opposition; that is,
he knows in full detail the rules of the game as well as the payoffs of all other players.
(5) All decision makers are rational; that is, each player, given two alternatives, will select
the one that yields him the greater payoff.
(6) ‘The last two assumptions, in particular, restrict the application of game theory in real-
world conflict situations. Nonetheless, game theory has provided a means for analyzing
many problems of interest in economics, management science, and other fields.

12.2.3 Uses of game theory

Games in one form or another are widely used in many different academic disciplines.
· Economics and business
· Applications in Biology

226
· Computer science and logic
· Political science
· Philosophy
· Sociology

12.2.4 Terminologies in Game theory

Player: The competitors in the game are known as players

Strategy: A strategy for a player is defined as a set of rules or alternative course of action
available to him in advance by which player decides the course of action he should adopt.
Strategy can be of two type:
a) Pure strategy: If the players select the same strategy each time then it is referred to as
pure strategy. In this case each player knows exactly what other player is going to do.
The objective of the players is to maximize gains or minimize losses.
b) Mixed strategy: When the players use a combination of strategies and each player
always keep guessing as to which course of action is to be selected by thee other
player at a particular occasion then this is known as mixed strategy.
c) Optimum strategy:A course of action or play which puts the player in the most preferred
position, irrespective of the strategy of his competitors is called an optimum strategy.
d) Value of the game:It is the expected payoff of play when all the player s of the game
follow their optimum strategies. The game is called fair if the value of the game is
zero and unfair if it is non zero
e) Payoff Matrix:When the player select their particular strategies the payoff ( Gains Or
losses) can be represented in the form of a matrix called payoff matrix.

Let player A have m strategies A. , A2, ..... Am & Player B have n strategies B1 B2.....Bn.
The payoff matrix is written in terms of A i.e Positive values reflect gains to A & negative
values reflect loss to A. Let a; be the payoff which player gains from B if player A chooses
strategy Ai & Player B chooses strategy Bj. Then the pay off matrix is

227
Player B

Bl B2 ....Bj… … .Bn

AI a11 a12 a1j a1n

Player A A2 a21 a22 a2j a2n

Ai ai1 ai2 aij ain

An an1 an2 anj ann

12.3 THE MAXIMIN OR MINIMAX PRINCIPLE

For player A minimum Value in each row represents the gain (pay off) to him, if he
chooses his particular strategy, They are written next to the matrix as row minima, He will
then select the strategy that maximizes his minimum gains.

The choice of player A is called maximin value and the corresponding loss is the
maximin value of the game. It is denoted by V

Player B minimizes his maximum losses. The maximum value in each column repre-
sents the maximum loss to him if he chooses his particular strategy. These are written in the
matrix by column maxima. He will then select the strategy that minimizes his maximum
loss. The choice of player B is called minimax value and the corresponding loss is the mini-
max value of the game.

It is denoted by V

By using this theorem, it is easy to determine a saddle point of a matrix.

Rules for determining saddle point


1. Select the minimum element of each row of the payoff matrix
2. Select the greatest element of each column of the payoff matrix
3. If the same element has been identified as row minima as well as column maxima then
it is the saddle point.

228
Consider the below matrix
Player B

1 21 11 1 1

Player A 1 00 1 0
1

33 11 11 1

3 1 1

In this matrix, player A’s options are listed on the left, while player B’s options are
listed on top. We think of A as playing the rows and B as playing the columns. Positive
numbers indicate a win for the row player, while negative numbers indicate a loss for the row
player. Thus, for example, the p,s entry represents the outcome if A plays p and B plays s

In each round of the game, each player’s choice is called a strategy. Thus, if A chooses
p, we refer to the p row as player A’s strategy.

Here we have four saddle points, i.e the payoff values which are row minima as as
column maxima.

Here there is no saddle points

229
12.4 GAMES WITHOUT SADDLE POINT

ALGEBRAIC METHOD

If the games do not have a saddle point, such problems can be solved in either in
algebraic or arithmetical method. Consider the following example.
1. Determine the optimum strategy and the values of the game for the following payoff
matrix O
B

H T

H 2 -1
A
T -1 0

The pay off matrix do not have any saddle point, Let player A plays H with probability
x & T with probability with 1-x so that

x + (1-x) = 1

If Player B play H all the time Then A expected gain will be

E(A,H) = x.2 + (1-x) -1 = 3x-1

Similarly if Player B play T all the time then A ‘s expected gain will be

E(A,T) = x.(-l) + (1-x) 0 = -x

The best strategy for A is naturally the one which gives equal gain whether player B
selects H or T. Then

3x-l = -x

Or 4x=1 or x= ¼

The value of the game is 2X ¼ + ¾ X -1 = - ¼

Similarly applying the same strategy for B, we get

Y= ¼
230
The solution is
1. The player A should play H or T with probability ¼ & ¾ respectively. The optimal
strategy for A is { ¼, ¾ }
2. The player B should play H or T with probability ¼ & ¾ respectively. The optimal
strategy for B is { ¼ , ¾ }
3. The expected value of the game for A is -1/4

12.5 MINIMAX OR MIXIMIN PRINCIPLE FOR MIXED STRATEGY GAMES

If a game does not have a saddle point, two players cannot use the maximin -minimax
(pure) strategies as their optimal strategies. Hence the concept of mixed strategy i,e instead
of selecting pure strategies only, each

The pay of matrix is as below


B

Y, Y2 YJ Yn

1 2 j n

x, 1 V 11 V 12 V.1j V 1n .

X2 2 V 21 V 21 'v 2j V 21

X; i Vil Vi2 Vij Vjn

Xm m V m1 V m2 V nj v mn

The lower value of game i.e., selected by players


m m m
V= max xi [min[  Vi,Xi,  vi2Xi ….. Vi,Xi]]
i=1 i=1 i=1

Similarly the player B chooses


n
Yi(Yi>0,  Yi =1)
j=1

Which given the upper value


n n n
V= min yi [man[  Vj,Xj,  vj2Xj ….. Vj,Xj]]
231
12.6 PRINCIPLE OF DOMINANCE
If the No. of strategic alternatives available for a player is more than 2 or 3 , then it is
difficult to obtain value of game and also saddle point. In that case Dominance theory is
used.

2.6.1 Inferior & Superior Strategies


Consider two strategies a & b for player A whose pay off are given by ( ab a2, a3....an)
& (bj, b2, b3... .bn), If a; > b; then a is said to be the superior strategy and b is said to be the
inferior strategies.

Inferior strategies can be removed from the given pay off matrix so that a smaller pay
off matrix is obtained.
2.6.2 Problem :
1. Solve this problem using dominance
B
-4 6 3

A -3 -3 4

2 -3 4

Delete the columns which have higher or equal values when compared to the
corresponding elements of another column.
Similarly delete the rows which have lower or equal values when compared to the
corresponding elements of another row.
In this case columns 3 is dominated by column 1 . So neglect that column
-4 6

-3 -3

2 -3

Now Row 2 is inferior compared to Row 3, So neglect row 2,


-4 6

2 -3

232
Now Calculate row minima and column maxima

-4 6 -4

-3
2 -3

2 6

There is no saddle points so Let us use mixed strategy.

V= -4 x1 +2 x2

V = 6 xl + -3 x2

X 1 + x2 = 1

i.e. 4x1 + 2 x2 = v & 6x1 - 3x2= V

-4x 1 + 2 x2 = 6 xl - 3 x2

10X1 = 5X2  x1=0.5x2

x1 + x2 = 1

0.5 x2 +x2 =1

1.5x2 = 1

= > X 2 = 2/3

= > Xi=l/3

The player A choose mixed strategy

(1/3,0,2/3) •

Similarly

The player B choose mixed strategy

(3/5,2/5,0)

Value of the game is

6x1-3x2

= 6 x 1/3 - 3 x2 /3

= 2- 2 =0

233
12.7 2X2 MIXED STRATEGY USING ARITHMETICAL METHOD
A 2 X 2 matrix game with out saddle point can also be solved fusing arithmetical
method ,.

Consider the following example


1. Solve the following problem for mixed strategy using mathematical method

H T

H 8 -3

T -3 1

Solution:

Step - 1 : Try to obtain saddle point

H T

8 -3 4 4/(11+4) = 4/15

-3 1 11 11/(11+4) = 11/15
A
.3 1 11/15

4/15 11/15

There is no saddle point


Take the difference between two numbers, in column I & put it under column II
i.e., 8-(-3) =11
Take the difference between two Numbers in column II & put it under column I
i.e., -3-1 = 4 (Neglect Sign)
Similarly
Take the difference between values of row 1 & put it next to row II 8-(-3) =11
Take the difference between values of row 2 & put it next to row I
-3-1= 4 (neglect sign)
234
So,

The player A m ust use strateg y H w ith prob ability 4/(4 +l 1) = 4/15 and strategy T w ith 11/15
w hile player B use strategy H w ith probability 4/1 5 and strategy T w ith 11/15,

Let

B P lay H

Then value of the gam e is

4 x 8  11 x (  3 ) 1
V = =
15 15

B P lays T

4 x (  3 )  11 x1 1
V = =
15 15

A plays H

4 x 8  11 x (  3 ) 1
V = R s. =
15 15

A plays T

4 x (  3 )  11 x1 1
V = R s. =
15 15

3. Use Dominance to solve the following problem

A B C D E F

A 0 0 0 0 0 0

B 4 2 0 2 1 1

C 4 3 1 3 2 2

D 4 3 7 -5 1 2

E 4 3 4 -1 2 2

F 4 3 3 -2 2 2

235
Column E is superior to column A, B, Row C is superior then A & B. Now the
reduced matrix.

C D E F

C 1 2 2 2

D 7 -5 1 2

E 4 -1 2 2

F 3 -2 2 2

Column E dominates column F


C D E

C 1 3 2

D 7 -5 1

E 4 -1 2

F 3 -2 2

Row E dominates row F


C D E

C 1 3 2 •

D 7 -5 - i

E 4 -1 2

236
Now take the avg. of player B , C & D strategy.

1 3 7 5 4 1
2 2 2

(2, 1, 3/2)

Superior to column E

Strategy E, So E can be eliminated

Similarly

The avg. C & D for player A is

1 7 35
2 2

(4, -1)

Which is equal to row E which can be eliminated A

Finally we land up with


C D

C 1 3

D 7 -5 -

237
For a 1X 1 + 7X 2 = V & 3X 1 – 5X 2 = V

= > X 1 + 7X 2 = 3X 1 – 5X 2 .

= > 2X 1 = 12X 2 .

X 1 = 6X 2 .

Substituting it is

X 1 +X 2 = 1

= > 6X 2 + X 2 = 1 = > 7X 2 =1

= >X 2 = 1/7

= >X 1 =6/7

For player A strategy is (0, 0, 6/7, 1/7, 0, 0)

Similarly for player B

Y 1 + 3Y 2 = V & 7Y1 – 5Y 2 =V

Solving which we get (0,0,4/7,3/7,0,0)

12.8 SUMMARY

Game theory is a type of decision theory in which one’s choice of action is determined
after taking into consideration ail possible alternatives available to an opponent playing the
same game

The game theory is capable of analyzing single competitive situation. However there
is great gap between what theory can handle and actual situation.

12.9 KEY WORDS

Game Theory
Strategy
Saddle point
Maximin or Minimax
Dominance

238
12.10 SELF ASSESSMENT QUESTIONS
1. Solve the following game
1 2 3

1 -3 -2 6

2 2 0 2

3 5 -2 -4

2. Solve the following problem both arithmetically & algebraically.


P2
I II

P1 I 1 3
II 4 2

3.
A B C D E
A 4 4 2 -4 -6

B 8 6 8 -4 0

C 10 2 4 10 12

4. Solve this problem


I 2 4

II 2 3

III 3 2

IV -2 6

239
12.11 REFERENCE
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

240
BLOCK -4
APPLICATION OF OPERATION RESEARCH INBUSINESS
UNIT-13 : DECISION ANALYSIS

STRUCTURE

13.0 Objectives

13.1 Introduction

13.2 Decision Making

13.3 Decision making under certainty

13.4 Decision making under uncertainty

13.5 Decision making under Risk

13.6 Decision making under Conflict

13.7 Decision Tree Analysis

13.8 Decision making Under Utility

13.9 Summary

13.10 Key Words

13.11 Self Assessment Questions

13.12 References

242
13.0 OBJECTIVES
After studying this unit, you should be able to :
* Explain role of decision analysis in real life situations;
* Discuss the importance of decision making;
* Examine different conditions under which decision are made and
* Construct decision trees

13.1 INTRODUCTION
Decision making is not a new function. In our daily life, we take up decisions over one
or another occasion. These decisions may have a short term or long term effect in our lives.
Similarly the decision taken by the companies or their representatives would certainly affect
the company in long run.

These decisions may include


 Acquiring of another company
 Making or buying of parts
 Buying or selling of shares
 Introduction of new product into market
 Fixing up of wages etc.,

13.2 DECISION MAKING


A decision can be defined as the selection by the decision maker of an act which he/
she considers as best according to some pre-designated standard from among several available
options.
Decision making is a process of selecting the best amongst the available alternatives.
Decision making problem
The Decision making problem includes the following element
1. Course of action
A decision is made from a set of different alternative course of actions or acts or
strategies.

243
2. State or strategies
The state of nature refers to consequence of decision making which are beyond control.
3. Uncertainty
In some cases of decision making all the influencing factors may not be known to the
decision maker. This is called as uncertainty.
4. Payoff:
The decision making now yields which are called as pay off. It would be gain or loss.
The pay off under different consequences subjected alternative courses of actions can
be listed in the form of a table which is called as pay-off table.

State of nature Alternative course of action


A1 A2 A3 Aj An
P11
C1 P12 P13 Plj Pln

C2 P21 P22 P23 P2j P2n


C3 P31 P32 P33 P3j P3n

Ci Pil Pi2 Pi3 Pij Pin

Cn Pml Pm2 Pm3 Pmj Pmn

Decision making process


There are different steps to be followed while taking a decision.
Step 1 : Determine the various alternative course of action..
Step 2 : Identify the possible outcomes
Step 3 : Determine the pay off function which describes the consequences resulting from
decision alternative.
Step 4: Construct the opportunity loss table. The opportunity loss occurs due to failure of
not adopting the best available course of action.

Consider a fixed state of nature E; (i = 1, 2 …..m) for which the pay off corresponding
to the n course of action given be P11 P12.....Pm.

244
Let MI be the pay off of the least possible course of action. The opportunity cost
table is created as below.
State of nature Conditional Opportunity loss
A1 A2 A3 Aj An
C1 MrP11 MrPi2 M1P13 M1Pij M1Pfa

C2 M2-P12 M2-P22 M2-P23 M2-P2j M2-P2n

C3 M3-P13 M3-P32 M3-P33 M3-P3j M3-P3n

Ci M1Pli Mi-Pi2 MrPi3 MrPij M4-Pin

Cm Mm-Pm Mni-Pm2 Mm-Pmi M,-Pmj Mn- Pim

Decision making Environment


There are mainly 4 situations under which decision are taken
a. Decision under certainty : -
Here all the variables that supports the decision s are known to the decision maker.
Hence the decision maker can come out with only one decision which is the best.
For eg. : Had we know the transportations cost from each source to each destinations
& supply & demand of each source and Destination, Then we can decide how much to
transport from which source to which destination.
Similarly if we know cost of manufacturing an item with in the plant & cost of procuring
the same item from outside, then we can decide whether make or buy the parts.
b. Decision under uncertainty : -
Here the all factors that would have an effect on decision are not known to the decision
maker. So decision making becomes difficult.
Decision making under uncertainty is like shooting at enemy at dark without knowing
whether enemy is there or not.
Another example that we can give for decision making under uncertainty is telecasting
of advertisements at different time without knowing exactly at what time prospective
customer would watch the television.

245
c. Decision making another risk.
This refers to a situation where decision maker chooses from several possible out
comes where in probabilities of occurrences can be stated.
The decision taking under risk is like shooting a enemy at dark by knowing that enemy
is there.
Decision making under risk is involved launching a new product even after preliminary
survey made.
d. Decision Under Conflict
In many situations neither states of nature are completely known nor are they completely
uncertain. Partial knowledge is available and therefore it may be termed as decision
making under partial uncertainty. An example if this is the situation of conflict involving
two or more competitors marketing the same product.
Probabilities may be based on decision maker’s personal opinions about future events,
or on data obtained from market surveys, expert opinions, etc.
When probability of occurrence of each state of nature can be assessed, problem
environment is called decision making under risk.
Examples: -
 Probability of being dealt club from deck of cards is 1/4.
 Probability of rolling 5 on die is 1/6.

13.3 DECISION MAKING UNDER CERTAINTY


Certainty is when you have information about the event, you know the outcome of the
event and you are confident of the events happening. We experience certainty about a specific
question when we have a feeling of complete belief or complete confidence in a single
answer to the question.
Decisions such as deciding on a new carpet for the office or installing a new piece of
equipment or promoting an employee to a supervisory position are made with a high level of
certainty.
While there is always some degree of uncertainty about the eventual outcome of such
decisions there is enough clarity about the problem, the situation and the alternatives to
consider the conditions to be certain.

246
It is the term used in a situation where for each decision alternative there is only one
event and therefore only one outcome for each action. For example, there is only one possible
event for the two possible actions: “Do nothing” at a future cost of S3.00 per unit for 10,000
units, or “rearrange” a facility at a future cost of $2.80 for the same number of units. A
decision matrix (or payoff table) would look as follows:

Actions State of Nature (with probability of 1.0)


Do nothing $30,000 (10,000 Units.$3.00)
Rearrange 28,000 (10,000 units $2.80)

Note that there is only one State of Nature in the matrix because there is only one
possible outcome for each action (with certainty). The decision is obviously to choose the
action that will result in the most desirable outcome (least cost), that is to “rearrange.”.

13.4 DECISION MAKING UNDER UNCERTAINTY


Under this condition the probabilities associated with occurrence of different status
of nature are not given but the pay off are given.
Different approaches are used here to determine the solution.
Laplace criterion
This criterion assigns equal probability to all possible pay off and then select the
alternative which corresponds to the maximum pay off.
For eg. : Suppose you want to buy lottery ticket worth Rupees 10, 3 different tickets
are available having reward of 10,000, 1,00,000 and 10,00,000. Then you would buy the last
one.
Problem :
A super Bazaar expects the sales in one of the four categories 300, 350, 400 or 450.
The cost is as given deviation from ideal level results in additional cost of stocking cost or
demand not full fill.
Customer Inventory level
Category

A1 A2 A3 A4
E i -300
7 12 20 27

247
E2-350 10 9 10 25

E3 - 400 23 20 14 23
E4 - 450 32 24 21 17

Laplace criteria assumes equal probabilities i.e., 1/4 for each


The net pay off includes
E(A1)=l/4(7 + 10 + 23+32)=18
E (A2) = 1/4 (12 + 9 + 20 + 24) = 16.25
E (A3) = 1/4 (20 + 10 + 14 + 21) = 16.25
E (A4) = 1/4 (27 + 25 + 23 + 17) = 23
The best of level inventory is either A2 or A3 which is minimum
The maximin or mini max criterion
It assumes that worst possible is going to happen. Select that which maximizes the mini-
mum pay off
Say for example
There are 3 schemes of mutual funds, cost being same.

Scheme Guaranteed Income Expected Income


Scheme 1000 100000
Scheme 2 1200 50000
Scheme 3 1100 75000

Then according this principle scheme 2 is selected.


The method followed is as below
Step 1: Determine minimum pay off of each alternative
Step 2 : Choose maximum out of this
It can be (maxi min) maximum of minimum for profit.
Or (Mini max) minimum of maximum for cost.

248
A1 A2 A3 A4
E1 7 12 20 27
E2 10 9 10 25
E3 23 20 14 27
E4 32 24 21 17

here the maxi mum cost of each all is


A1-32, A2-24, A -21, A4-27
3
Minimum of this is 21 i.e., A3 T
Therefore A3 is selected,
The maxi max or minimim categories
This is based on optimism Here the decision maker selects the maximum of maximum pay
off or min of min of cost Taking the same example.

A1 A2 A3 A4

E1 7 12 20 27
E2 10 9 10 25

E3 23 20 14 23
E4 32 24 21 17

Here the min cost is


A1-7, A2-9, A3-10, A -17
4
Minimum of all there is 7
So A is selected
1
13.4.4 Savage Criteria
This is based on regret or opportunity and calls for selecting the course of action that
minimize the maximum regret.
Step -1 : Determine the amount of regret for pay off of each alternative for a particular
event which is calculated as below.

249
im regret = ( max payoff- i th pay off) for ith event if he pay off represents profit
(ith pay off - min. pay off) for ith event if the pay off represents cost
Step 2 : Determine max regret for each alternative
Step 3 : Select minimum out of these
Considering the same example.

A1 A2 A3 A4 Min pay off

E1 7 12 20 27 7

E2 10 9 10 25 9

E3 23 20 14 23 14

E4 32 24 21 17 17

Deduct this minimum pay off from all the elements of that row.
Regret payoff amount
0 5 13 20
1 0 1 16
9 6 0 9
15 7 4 0
Max regret 15 7 13 20
Minimum of this is 7
So A2 is chosen
13.4.5 Hurwicz Criterion
It stipulates that a decision maker’s should be both optimist and pessimistic.
Steps
1. Choose a as degree of optimism and 1 - a as degree of pessimism
2. Determine. Max & min pay off for each alternative
3. Calculate n=a xI + (l-a)II

250
Consider the same example

E1 E2 E3 E4 I max II min

A1 7 10 22 32 32 7

A2 12 9 20 24 24 9

A3 20 10 14 21 21 10

A4 27 25 23 17 27 17

Let us take α= 0.5


N= α x I + ( l-α)x II
A1-» (0.5X32 + (1-.5) X 7) =19.5
A2-» 16.5
A3-» 15.5
A4-»22
So chose A3 as it is minimum
13.5 DECISION MAKING UNDER RISK
When a decision maker chooses from several possible alternatives whose probability
of occurrence is stated then the decision taken is called decision under risk.
Here EMV or Expected Monetary Value as calculated.
The expected monetary value for a given course of action is the expected value of the
additional pay off for that action.
Step 1 : List conditional profit for each act / event
Step 2 : Determine the expected conditional profit
Step 3 : Determine EMV for each act
Step 4 : Choose the act that corresponds to optimal EMV

251
Problem :
A man has choice of running either a hot snack stall or ice- cream stall at a sea side
resort. If it is fairy cool summer he should make Rs. 5000 by running the hot snack stall. If
it is not he can make profit of Rs. 1000. On the other hand his profit is Rs. 6,500 for hot
summer and Rs. 1000 if it is cool by running ice cream stall. There is 40% chance of summer
being hot. What should be his choice.
Solution
The pay off table can be constructed as below.:
Conditional pay off
Event Probability Ei Hot snack Ice -cream
Cool summer 0.6 5000 1000
Hot summer 0.4 1000 6500

Expected conditional pay off

i ii Iii iv
Event Prob Hot (i x ii) Cool (I x iii)
Cool 0.6 3000 600
Hot 0.4 400 2600
3400 3200

Since the expected monetary value of selling hot snack is more he should opt for hot snack.

13.6 DECISION MAKING UNDER CONFLICT


Conflict and choice are closely related in that choice produces conflict and conflict is
resolved by making a choice. Although conflict was invoked in psychological approaches
to decision making , no generally accepted measure of conflict strength has been established.
Here the interest of two varied personnel comes into picture.
In decision making under conflict many outcomes are possible. An enemy is trying to
outwit you & foil your strategy.
Usually these types of problems are solved using Game theory which are going to be
discussed next.

252
13.7 DECISION TREE ANALYSIS
It is graphic display of various decision alternatives and sequence as if they were branches
of a tree.
 Decision point
 Event
Problem
Amar company is currently working with a process which after paying for materials labour,
etc. brings a profit of Rs. 12,000 . The following alternatives are made available to the
company.
(i) The company can conduct research (R1) which is expected to cost Rs. 10,000 having
90% chances of success. If it proves a success the company gets a gross income of
Rs. 25,000
(ii) The company can conduct research (R2) which is expected to cost Rs. 8000 having
probability of 60% success. The gross income will be 25,000
(iii) The company can pay Rs. 6000 as royality for a new process which will bring a
gross income of Rs. 20,000.
(iv) The company can continue the current process

Draw the decision tree as follows

253
The net EMV is highest for the alternative pay off royalty for the new process, the
optimal decision would be the procure a process on Royalty basis.

254
13.8 DECISION MAKING UNDER UTILITY
Always money can not be the sole criteria. In may case decisions has to be taken. Also
even when expected values are calculated in money terms, money may mean every thing.
Some people would prefer to take risk, some would take prestige in starting new venture.
A rational decision maker will choose that alternative which optimise the expected utility
rather than expected monitory value. Once we know the individual utility function along with
the probability assigned to out come in a particular situations then total expected utility for
each course of action can be obtained by multiplying utility values with their probability.
13.9 SUMMARY
Decision making is an integral part of most planning organizing controlling & motivate
process. The decision maker selects one strategies or course or action over other depending
upon utility sales, cost or rate of return. Decision theory provides a method for rational
decision making when the consequences are not fully deterministic. They provide a frame
work for better understanding of the decision situation for evaluating alternatives.

13.10 KEY WORDS

* Decision making under Certanites

* Decision making under Uncertainity

* Decision making under certainity

13.11 SELF ASSESSMENT QUESTIONS


1. Define decision making. Explain decision making under various circumstances.
2. A food company is thinking of launching a new product at high price (S1) or moderate
change in existing product with small increase in price (S2) or small change in
composition with negligible price change (S3) as a result of which
3. There can be increase in sales (I)
4. No change in sales (N)
5. Decrease in sales (D)

255
Which would have an input on profit as show.

State of Nature S1 S2 S3
I 7,00,000 5,00,000 3,00,000
N 3,00,000 4,50,000 3,00,000
D 1,50,000 0 3,00,000

Solve this problem using all the different criterion


4. A farmer is attempting to decide which of three crops he should plant on his one hundred
acre farm. The profit from each crop strongly depend on rainfall. He categorize the
amount of rainfall as substantial, moderate or light. His profit is estimated below.

Rainfall Crop A Crop B Crop C

S 7000 2500 4000

M 3500 3500 4000

L 1000 4000 3000

The estimated probability of substantial rainfall is 0.2, moderate is 0.4 & light is 0.5.
5. A glass factory specialized in crystal is developing a substantial backdrops in farms
management considering 3 course of action.
SI arrange for sub contract, S2 arrange over time S3 construct new facility. The future
demand may be low, medium or high the probability which is 0.1, 0.5 & 0.4
The profit matrix is as shown.

Profit S1 S2 S3

Low 10 -20 -150

Medium 50 60 20

High 50 100 200

Construct a decision and indicate preferred action.

256
13.12 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

257
UNIT 14 : NETWORK ANALYSIS

STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Network Analysis
14.3 Application of PERT and CPM techniques to Business problems.
14.4 Distinction between PERT and CPM.
14.5 Basic concepts of Network Analysis
14.6 Rules of Network Construction
14.7 Fulkerson’s Rule (i-j Rule) of Numbering Events
14.8 Illustrations
14.9 Summary
14.10 Key Words
14.11 Self Assessment Questions
14.12 References

258
14.0 OBJECTIVES

After studying this unit, you should be able to:


 Explain the importance of network analysis;
 Examine the various applications of network analysis;
 Distinguish between P E R T and C P M;
 Explain the basic concepts of network analysis;
 Identity the rules of network analysis;
 Get acquainted with Fulkerson’s Rule of computing and
 Construct the network of simple problems.

14.1 INTRODUCTION

A project such as construction of a bridge, highway, flyover, power plant, repair and
maintenance of oil refineries or an air plane; design, development and marketing of a new
product; research and development work, etc. may be defined as a collection of interrelated
activities (tasks) which must be completed in a specified time according to a specified
sequence (or order) and require resources such as personnel, money, materials, facilities.
The process of dividing the project into these activities is called the work breakdown structure
(WBS). The activity or a unit of work, also called work content, is an identifiable and
manageable work unit. The main objective before starting such projects is: How to schedule
the required activities so as to:
a. Complete the given project on or before a specified time limit.
b. Minimize the cost of completion of the project on or before a specified time limit.
c. Minimize the total project completion time for a given cost.

Hence, before starting any project, it is essential to devise an adequate plan for
scheduling and controlling the various activities (tasks) of the given project. The class of
operations research techniques used for planning, scheduling and controlling large and
complex projects are often referred to as network analysis, network planning, or network
planning and scheduling techniques. PERT and CPM are two well known techniques used for
network analysis.

259
14.2 NETWORK ANALYSIS

A network is a graphical diagram consisting of a certain configuration of arrows and


nodes for showing the logical sequence of various activities (tasks) to be performed to
achieve project objectives. Network analysis is quite useful for designing, planning, co-
coordinating, controlling and decision making so that the project could be economically
completed in the minimum possible time with the limited available resources. The two most
popular form of this technique now used in many scheduling situations are the Programme
Evaluation and Review Technique (PERT) and the Critical Path Method (CPM).

Programme Evaluation and Review Technique (PERT)

PERT was developed in 1956-58 by the US Navy Special Projects office in co-
operation with the management consulting firm of Booz, Alien and Hamilton to aid in the
planning and scheduling of the US Navy’s Polaris Missile Programme which involved over
three thousand different contracting organizations. The Objective of the team was to
efficiently plan and produce the Polaris missile system. Since then this technique has proved
to be useful for all jobs or projects which have an element of uncertainty in the matter of
estimation of duration, as in case with new types of projects both at the Government and
Industry level. In PERT we usually assume that the time to perform each activity is uncertain
and as such three time estimates i.e. is the optimistic, the pessimistic and the most likely
time estimates are used.

Critical Path Method (CPM)

CPM was developed in 1957 by J.E.Kelly of Remington Rand and M.R. Walker of
E.I.Dupont to aid in the scheduling of routine plant overhaul, maintenance and construction
of work. This method differentiates between planning and scheduling .Planning refers to the
determination of activities that must be accomplished and the order in which such activities
should be performed to achieve the objectives of the project. Scheduling refers to the
introduction of time into the plan thereby creating a time table for the various activities to
be performed. CPM uses two time and two cosl; estimates for each activity. CPM operates
on the assumption that the time taken by each activity in the project is already known
precisely.

260
14.3 APPLICATION OF NETWORK ANALYSIS TO BUSINESS

Few Management applications of PERT and CPM are to plan, schedule, monitor and
control projects such as;
i. Construction of buildings, bridges, factories, highways, stadiums, irrigation projects,
etc.
ii. Budget and auditing procedures.
iii. Missile development programmes.
iv. Installation of complex new equipments such as computers or large machinery.
v. Advertising programmes and for development and launching of new products.
vi. Planning of political campaigns.
vii. Strategic and tactical military planning.
viii. Research and development of new products.
ix. Finding the best traffic flow pattern in large cities.
x. Maintenance and overhauling complicated equipments in the chemical, power plants
steel and petroleum industries.
xi. Long range planning and developing staffing plans.
xii. Organising of big conferences, public works, etc.
xiii. Shifting of manufacturing plant from one site to another.
xiv. Preparation of bids and proposals for projects of large size.
xv. Launching space programmes.

14.4 DISTINCTION BETWEEN P E R T AND C P M

The basic differences between the two techniques are summarized below;

PERT

1. A probability model with uncertainty in activity duration. The duration of each activity
is normally computed from multiple time estimates with a view to take into account
time uncertainty. These estimates are ultimately used to arrive at the probability of
achieving any given scheduled data of project completion.

2. It is said to be event oriented network because in the analysis of network emphasis is


given on important stages of completion of tasks rather than the activities required to

261
3. PERT is normally used for projects involving activities of non-repetitive nature in
which time estimates are uncertain.

4. It helps in pin pointing critical areas in a project so that necessary adjustments can be
made to meet the scheduled completion date of the project.

5. PERT analysis does not usually consider costs.

CPM

1. A deterministic model with well known activity times based upon past experience. It,
therefore, does not deal with uncertainty in time.

2. CPM is suitable for establishing a trade-off for optimum balancing between scheduled
time and cost of the project.

3. CPM is used for projects involving activities of repetitive nature.

4. CPM deals with costs of project schedules and their minimization. The concept of
crashing is applied mainly to CPM models.

5. It is difficult to use CPM as a controlling device for the simple reason that one must
repeat the entire evaluation of the project each time; the changes are introduced into
the network.

14.5 BASIC CONCEPTS OF NETWORK ANALYSIS

A fundamental ingredient in both PERT and CPM is the use of network systems as a
means of graphically depicting the current problems or proposed project. PERT and CPM
network consists of two major components as discussed below:

Event

Events of the network represent project milestones, such as the start or the completion
of an activity (task) or activities, and occur at a particular instant of time at which some
specific part of the project has been or is to be achieved; therefore events do not consume

262
time or resources. Events are commonly represented by circles. The event circles are called
nodes in the network diagram. Events can be further classified into following types:

Merge Event

When more than one activity comes and joins, the event is known as merge event.

Figure 3.1 Merge Event

Burst Event

When more than one activity leaves an event, the event is known as a burst event.

Figure 3.2 Burst Event

Merge and Burst Event

An activity may be a merge and burst event simultaneously as with respect to some
activities it can be merge event and with respect to some other activity it may be burst event.

Figure 3.3 Merge and Burst Event


Activity

Activities of the network represent project operations or tasks to be conducted.


Activities require the expenditure of time and resources for their accomplishment. Dummy
activities do not consume time and resources. An arrow is commonly used to represent an
activity with its head indicating the direction of the progress in the project. The activity
arrow is called arc. The activity arrow is not scaled; the length of the arrow is a matter of

263
Activities are identified by the number of their starting (tail) event and ending (head)
event. An arrow (i, j) extended between two events, the tail event T represents the start of the
activity and the head event ‘j’, represents the completion of the activity.

Activity

i j

Staring Event Completion Event

Figure 3.4 Activity Node Relationships

The activities are further classified into following three types:

Predecessor Activity

An activity which must be completed before one or more other activities start is
known as predecessor activity.

Successor Activity

An activity which started immediately after one or more of other activities are
completed is known as successor activity.

Concurrent activity

Activities which can be accomplished concurrently/simultaneously are known as


concurrent activities. It can be observed that an activity can be a predecessor or successor to
an event may be concurrent with one or more of the other activities.

Dummy Activity

An activity which does not consume either any resource or time is known as dummy
activity. A dummy activity in the network is added only to represent the given. A dummy
activity is depicted by dotted lines in the network diagram, precedence relationships among
activities of the project and is needed when,

1. Two or more parallel activities in a project have same head and tail events or

2. Two or more activities have some (but not all) of their immediate predecessor activities
in common.
264
Sequencing

The first prerequisite in the development of a network is to maintain the precedence


relationships. In order to construct the network following points have to be taken note of;
 What job or jobs precede an activity?
 What job or jobs could run concurrently?
 What job or jobs follow it?
 What controls the start and finish of a job?

14.6 RULES OF NETWORK CONSTRUCTION


There are a number of rules in connection with the handling of events and activities of a
project network that should be followed:
1. Try to avoid arrows which cross each other.
2. Use straight arrows.
3. No event can occur until every activity preceding it has been completed.
4. An event cannot occur twice.
5. An activity succeeding an event cannot be started until that event has occurred.
6. Use the arrows from left to right. Avoid mixing two directions; vertical and standing
arrows may be used if necessary.
7. Dummies should be introduced if it is extremely necessary.
8. The network has only one entry point called the start event and one point of emergence
called the end or terminal event.

14.7 FULKERSON’S RULE (i-j rule) NUMBERING EVENTS


After the network is drawn in a logical sequence every event is assigned a number. The
number sequence must be such so as to reflect the flow of the network. The following rules
are followed when numbering the events.
1. Number the start node which has no predecessor activity as 1.
2. Delete all the activities emerging from this node 1.
3. Number all the resulting start nodes without any predecessor as 2, 3...
4. Delete all the activities originating from the start nodes without any predecessor next
to the last number in step3.
5. Number all the resulting new start nodes without any predecessor next to the last number
in step3.
6. Repeat the process until the terminal node without any successor activity is reached

265
14.8 ILLUSTRATIONS
After learning the concepts and rule of constructing the networks, let us try to solve some
simple problems.
Problem-1:
Draw a network for the following project and number the events according to Fulkerson’s
rule.
1. A is the start activity and K is the end activity.
2. J is the successor activity to F.
3. C and D are successor activity to B.
4. D is the preceding activity to G.
5. E and F occur after event C.
6. E precedes J.
7. Restrains the occurrence of J and G precedes H.
8. K succeeds activity.

Solution:

Figure 3.5

Problem-2:

Draw a network for the following data.


Event No. 1 2 3 4 5 6 7

Immediate Predecessors - 1 1 2,3 3 4,5 5,6

266
Solution

Figure 3.6

Problem- 3:

Construct a network for the project whose activities and their precedence
relationship are given below:

Activity No A B C D E F G H I

Immediate Predecessors - A A - D B,C,E F D G,H

Solution

Figure 3.7
267
Problem- 4:

Draw the network using the data given below.

A < C, D, I; B < G, F ; D < G, F ; F < H, K ; G, H < J ; I, J, K < E

Solution:

Given A < C which means that C cannot be started until A is completed i.e. A is the
preceding activity to C. The above constraints can be given in the following table.

Event No A B C D E F G H I J K

Predecessors - - A A I,J,K B,D B,D F A G,H F

Figure
3.8

Problem-5:

The sequence of activities together with their predecessors for manufacturing an item are
given below, draw the network diagram.
Activity A B C D E F G H

Predecessor Activity - A A B B,C E D,F G

268
Solution

Figure 3.9

Problem -6:

Given in the table are the activities and sequence necessary for m aintenance job of
m aterial handling equipm ents in a factory. D raw a network.

Activity A B C D E F G H I J

Predecessor Activity - A B B B C C F,G D,E,H I

Solution:

Figure 3.10

269
Problem - 7:
Activity A B C D E F G H I

Pre-requisite Activity - - - A B C D,E B D,E,H

Solution

Figure 3.11

14.9 SUMMARY

Network analysis helps in monitoring a project to ensure must be project would be


completed within

270
14.10 KEY WORDS
Activity

Event

Merge event

Burst event

Predecessor

Successor

Sequencing

Network

14.11 SELF ASSESSMENT QUESTIONS

Case Study-1:

Sigma Ltd., an Airplane manufacturing company wants to draw a network for their
project. The following information is available. Draw the project for Sigma Ltd.
Activity A B C D E F G H I J K

Pre-requisite Activity - A A C B,C D,E E G D,F I,H J

Case Study - 2:

Zen Limited has a listed the activities and sequence requirements necessary for the
machine maintenance in their company. Draw a network diagram.
A c tiv ity A B C D E F G H I J K L M N 0 P Q R S T

P re- - - B A C B A ,F G A ,F G J ,H J L M ,K N O P E Q ,D ,I,R R

re q u is ite

271
14.12 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

272
UNIT 15 : SOLUTION TO NETWORK PROBLEMS

STRUCTURE

15.0 Objectives

15.1 Introduction

15.2 Notations used for basic Scheduling Computation

15.3 Forward Pass Method

15.4 Backward Pass Method

15.5 Critical Path

15.6 Determination of Float and Slack Time

15.7 Summary

15.8 Key Words

15.9 Self Assessment Questions

15.10 References

273
15.0 OBJECTIVES

After studying this unit, you should be able to ;


 Explain the objectives of critical path analysis.
 Compute the three time estimates.
 Prepare the project scheduling with the help of networks.
 Discuss forward and backward pass methods.
 Determine the float and slack time.

15.1 INTRODUCTION
The objective of critical path analysis is to estimate the total project duration and to
assign staring and finishing times to all the activities involved in the project. This helps in
checking actual progress against the scheduled duration of the project.
The duration of individual activities may be uniquely determined (in case of CPM) or
may involve the three-time estimates (in case of PERT) out of which the expected duration
of an activity is computed. Having computed this, the following factors should be known to
prepare project scheduling:
1. Total completion time of the project.
2. Earliest and latest start time of each activity.
3. Float of each activity, i.e. the amount of time by which the completion of an activity
can be delayed without delaying the total project completion time.
4. Critical activities and the critical path.

15.2 NOTATIONS USED FOR BASIC SCHEDULING COMPUTATION


The following notations are used for the basic scheduling computation:
1. TE or Ei = Earliest occurrence time of event, i. It is the earliest time at which an even:
can occur without affecting the total project time.
2. TL or Li = Latest occurrence time of event i. It is the latest time at which an event can
occur without affecting the total project time.
3. ESij = Earliest start time for activity (i,j). It is the earliest time at which the activity
can start without affecting the total project time.
4. LSij = Latest start time for activity (i,j). It is the latest possible by which an activity
must start without affecting the total project time. \
5. EFij = Earliest finish time for activity (i,j). It is the earliest possible time at which a

274
activity can finish without affecting the total project time.
6. LFij = Latest finish time for activity (i,j). It is the latest time by which an activity mi
get completed without delaying project completion.
7. tij= Duration of activity (i,j).
For calculating the above mentioned times, two methods namely forward pass and
backward pass are employed.

15.3 FORWARD PASS METHOD


This method is used to compute earliest start time. In this method calculations begin
from the initial event 1, proceed through the network visiting events in an increasing order
of event number and end to the final event, At each event we calculate earliest occurrence
event time and earliest start and finish time for each activity that begins at that event. When
calculations end at the final event, its earliest occurrence time gives the earliest possible
completion time of the entire project,
1. Set the earliest occurrence time of initial event to zero i.e. Ei= 0.
2. Calculate earliest start time for each activity that begins at event i (=1), This is equal
to the earliest occurrence time of event, j (tail event) that is, ESij = Ei
3. Calculate the earliest finish time of each activity that begins at event i. This is equal to
the earliest start time of the activity plus the duration of the activity. That is EFij = ESij
+ tij =Ei + tij for all activities (i,j) beginning at event i
4. Proceed to the next event, say j ;j>i
5. Calculate the earliest occurrence time for event j. This is the maximum of the earliest
finish times of all activities ending into that event, that is
Ej = Max (EFij) = Max (Ei +tij) for all immediate predecessor activities.
6. If j =N, that is the final event, then the earliest finish time for the project, that is the
earliest occurrence time EN for the final event is given by,
EN = Max (EFij) for all terminal activities
= Max (EN-1 +tij) and stop the method.
15.4 BACKWARD PASS METHOD
This method is used to compute latest allowable time. In this method calculations
begin from the final event N, proceed through the network visiting events in the decreasing
order of event number and end at the initial event 1. At each event we calculate the latest
occurrence time (L) for the corresponding event, the latest finish sand start for each activity
that is terminating t the event, such that the earliest finish time for the project remains the
same. The method may be summarized as follows:
275
1. Set the latest occurrence of the last event N equal to its earliest occurrence time
(known from the forward pass method) that is, LN = EN j=N.

2. Calculate the latest finish time of each activity which ends at event j . This is equal to
the latest occurrence time of the final event N that is LFij = Lj for all activities (i,j)
ending at event j.

3. Calculate the latest start time of all activities ending at j . It is obtained by subtracting
the duration of the activity from the latest finish time of the activity, that is
LFij = L
And LSij = LFij – ti’j
= Lj – tij for ach activity (i,j) ending at event ,j.

4. Proceed backward to the event in the sequence that decreases j by t

5. Calculate the latest occurrence time of event i(i<j). This is the minimum of the latest
start time of all activities starting from that event, that is Li =Min (LSij) for all immediate
successor activities = Min (Lj - tij)

6. If j=l (initial event), then the latest finish time for project, i.e. latest occurrence time
Lj for the initial event is given by,

L1 = min (LSij) for all immediate successor activities.

= min (Lj-tij)

and stop the method.


15.5 CRITICAL PATH
Certain activities in a network diagram of a project are called critical activities because
delay in their execution will cause further delay in the project completion time Thus all
activities having zero total float values are identified as critical activities.
The critical path is the continuous chain of critical activities in a network. It is the longest
path starting from first to the last event and is shown by the thick line or double lines in the
network diagram.

The length of the critical path is the sum of the individual times of all the critical activities
lying on it and defines the minimum time required to complete the project.

276
The critical path on a network diagram can be identified as :

1. For all activities (i,j) lying on the critical path the E values and the L values for tail
and head event are equal, that is Ej=Lj and Ej=Lj.

2. On critical path Ej-El = Lj-Li = tij

15.6 DETERMINATION OF FLOAT AND SLACK TIME

Float is defined as the difference between the latest and the earliest activity time.
Slack is defined as the difference between the latest and earliest event time. There are three
types of floats:

Total Float : It refers to the amount of time by which the completion of an activity could be
delayed beyond the earliest expected completion time without affecting the overall project
duration time.

Mathematically, the total float of an activity (i,j) is the difference the latest start time
and the earliest start time of that activity.
Total float (TFij) = LSij - Esij
= (Lj-Ei) – tij

Free Float : The time by , which the completion of an activity can be delayed beyond the
earliest finish time without affecting the earliest start time of a succeeding activity. Free
float is calculated as Free float (FFij) = (Ej-Ei)-tij

Independent Float : The amount of time by which start of an activity can be delayed
without affecting the earliest start time of any immediately following activity assuming that
the preceding activity has finished at its latest finish time. The negative value of independent
float is considered as zero. Independent float is calculated as, Independent float (IFij) =
(Ej - Li)-t..

= (ESij-LSij) - tij

277
Problem - 1: Kaizen Limited has decided to add - new product to its line, It will outsource
the product from another firm, package it, and sell it to a number of distributors.
The following are the activities to be completed to implement the above
project.

Activity Description Time (weeks)

A Organize sales office 6

B Hire Salesmen 4

C Train Salesmen 7

D Select advertising agency 2

E Plan advertising campaign 4

F Conduct advertising campaign 10

G- Design package 2

H Setup packaging facilities 10

I Package initial stocks 6

J Order stock from manufacturer 13

K Select distributors 9

L Sell to distributors 3

M Ship stock 5

278
T h e p re c e d e n c e re la tio n b e tw e e n e a c h a c tiv ity is s h o w n in th e fo llo w in g
d ia g ra m ;

O rd e r
S to c k
P ackage
st o c k

D e sig n Set u p
S h ip sto c k
package P a c k in g
to
F a c ility
D ist rib u to rs

S e le c t
ST A R T D istrib u tio n EN D
S e ll in

R
D istrib u tio
n

O rg a n iz e H ire T ra in
S a le s O ffic e S a le sm e n S a le sm e n

F ig u r e 4 .1

S e le c t P la n C o n d u ct
A d v e rtisin g A d v e rtisin g A d v e rtisin g
A ge n cy C a m p a ig n C a m p a ig n

1. Draw the network diagram.

2. Indicate the critical path.

3. For each non critical activity, find the total and free float.

Solution:

279
Forward Pass Method:

E1 = 0 E2 = E1 + t1.1 = 0 + 6 = 6

E3 = E1 + t1.3 = 0 + 2 = 2

E4 = Max (Ei + ti.4) = Max (E1 + t1.4; E3 + t3.4)

= Max (0 + 13; 2 + 10) = 13

E5 = E2 + t2.5 = 6 + 4 = 10

E6 = Max (Ei + ti.6) = Max (E2 + t2.6; E5 + t5.6)

= Max (6 + 9; 10 + 7) = 17

E7 = E2 + t2.7 = 6 + 2 = 8

E8 = E7 + t7.8 = 8 + 4 = 12

E9 = Max (Ei. + Ei.9) = Max (E4+ t4.9; E6 + t6.9)

= Max (13+ 6; 17+ 3) = 20

E10= Max (Ei + ti. 10) = Max (E8 +t 8. 10. E9 + t9. 10)

= Max (12+10; 20+ 5) = 25

280
Backward Pass Method:

10 = E10 == 25
L
L9= L10 – t9.10 = 25 – 5 = 20

L8 = L10 – t8.10 = 25- 10 = 15 L7=L8-t7.9 = 15-4 = 11

L6 = L9-T6.9 = 20-3=17 L5 = L6 – t5.6 = 17-7=10

L4 = L9 – t4.9 = 20 – 6 = 14 L3 = l4 – t3,4 = 14 – 10= 4

L2 = Min (Lj + t2.j) = Min (L5 – t2.5 L6- t2.6; L7 – t2.7)

= Min (10-4; 17-9; 11-2)=6

L1=Min (Lj + t1.j) = Min (L2 – t1,2; L3- t1.3 ; L4 – t1,4)

=Min (6-6;4-2;14-13) = 0

2. The critical path in the network has been shown by the double line by joining all those
events where the two values E. and L. are equal. The critical path of the project is 1 -2-
5-6-9-10 and the critical activities are A, B, C, L and M. The total project time is 25
weeks
3. For each non-critical activity, the total float and free float calculations are listed in the
table below.
A ctivity D uration E arliest T im e L atest T im e Float
(t I,j)
S tart Finish (E i S tart (Lj- Finish T otal Free
(E i) + U j) U j)
Lj (Lj-U j)-E i (E j-E i)-t i,j

1-3 2 0 2 2 4 2 0

1-4 13 0 13 1 14 1 0

2-6 9 6 15 8 17 2 2

2-7 2 6 8 9 11 3 0

4-9 6 13 19 14 20 1 1

7-8 4 8 12 11 15 3 0

8-10 10 12 22 15 25 3 3
281
Problem - 2: A banking company has decided to modernize one of its branch offices. The
major activities of the project, along with the durations and preceding activities involved in the
renovation process are listed in the table below:

Activity A B C D E F G H I J K L M

Preceding activity E A B K - E F F F I,L C,G,H D I,L

Duration (weeks) 4 2 1 12 14 2 3 2 4 3 4 2 2

1. Draw the network diagram

2. Find the minimum time in which the renovation is completed.

Solution:

Figure 4.3

2. The critical path of the project is 1-2-3-4-7-8-9-10-12 and critical activities are E, A
B, C, K, D, L and J. The total project time is 42 weeks.

282
Problem-3: Listed in the table are the activities and sequencing requirements necessary
for the completion of a research work.

Activity A B C D E F G H I J K L M

Preceding activity - - B C A,D D A,D E G,H I G J,K L

Duration (weeks) 6 5 9 2 2 1 6 5 6 2 4 3 1

1. Draw the network diagram.

2. Find the critical path and duration of the project.

3. Find the total free and independent floats for various activities

Solution:

Figure 4.4

283
Forw ard Pass M ethod:

E 1= 0

E 2 = E 1 +t 1,2 = 0+5=5

E 3 =E 2 + t 2,3 = 5+2=7

E 4 =E 3 + t 3,4 = 7+2=9

E 5 =M ax (E i + t i,4) )= M ax (E 1 +t 1,5 ; E 4 +t 4,5 )

= M ax (0+6;9+0)=9

E 6 = E 5 +t 5,6 =9+2=11

E 7 =E 5 +t 5,7 =9+6=15

E 8 = M ax (E 7 + t 7,8 ; E 6 +t 6,8 )

= M ax (15+0; 11+5)=16

E 9 =E 8 +t 8,9 =16+6=22

E 10 =M ax (E 7 +t 7,10 ; E 9 +t 9,10 )

= M ax (15+4;22+2)=24

E 11 =E 10 + t 10,11 =24+3=27

E 12 =M ax (E 11 +t 11,12; E 4 +t4,12 )

=M ax (27+1;9+1)=28
B ack w ard P ass

L12=28

L 1 1 = L 1 2 - 1 = 2 8 -1 = 2 7

L 1 0 = L ll-3 = 2 7 -3 = 2 4

L 9 = L 1 0 - 2 = 2 4 -2 = 2 2

L 8 = L 9 -6 = 2 2 -6 = 1 6

L 7 = M in (L 1 0 -4 ;L 8 -0 )

M in (2 4 -4 ; 1 6 -0 ) = 1 6

L 6 = L 8 -5 = 1 6 -5 = 1 1

L 5 = M in ( L 5 - 6 ; L 6 - 2 )

284
Min (15 -6; 11 -2) = 9

L4 = Min(L12-l ;L5-0)

Min (28- 1 ; 9 - 0) = 9

L3= L4 - 2 = 9 - 2 = 7

L2 = L3 - 2 = 7- 2= 5

Ll= Min (L5 - 6 ; L2 - 5)

Min(9-5;5-5) = 0

The critical path is 1-2-3-4-5-6-8-9-10-11-12, the critical activities are


B,C,D,E,H,I,J,L, and M.

The various float of each activity are calculated in table:

Activity Duration Earliest Latest Total Free Independent


Float Float Float

Start Finish Start Finish

A 6 0 6 3 9 3 3 3

B 5 0 5 0 5 0 0 0

C 2 5 7 5 7 0 0 0

D 2 7 9 7 9 0 0 0

E 2 9 11 9 11 0 0 0

G 6 9 15 10 16 1 0 0

H 5 11 16 11 16 0 0 0

I 6 16 22 16 22 0 0 0

J 2 22 24 22 24 0 0 0

K 4 15 19 20 24 5 5 4

L 3 24 27 24 27 0 0 0

285
Problem - 4:

A project schedule has the following features:

Activity 1-2 1-3 2-4 3-4 3-5 4-9 5-6 5-7 6-8 7-8 8-10 9-10

Duration (weeks) 4 1 1 1 6 5 4 8 1 2 5 7

1. Construct the network diagram

2. Compute the total float free float and independent float for each activity

3. Find the critical path and the total duration of the project.

Solution:

Figure 4.5

2. Critical Path is 1-3-5-7-8-10 and the project duration is 22 days.

3. The various float of each activity are calculated in table:

286
Activity Duration Earliest Latest Total Free Independent —
Float Float Float
Start Finish Start Finish

1-2 4 0 4 7 11 7 0 0
1-3 1 0 1 0 1 0 0 0
2-4 1 4 5 11 12 7 2 0
3-4 1 1 2 11 12 10 5 5
3-5 6 1 7 I 7 0 0 0
4-9 8 7 15 7 15 0 0 0
5-6 4 7 11 12 16 5 0 0
5-7 2 15 17 15 17 0 0 0
6-8 1 11 72 16 17 5 5 0
7-8 5 5- 10 12 17 7 0 0

8-10 7 10 17 15 22 5 5 0

9-10 5 17 22 17 22 0 0 0

15.8 SUMMARY

The network diagram are constructed taking time required, proceeding activity and
successive activity. The free time or slack time can be calcualated using the network. The
managers can plan resources by checking the avaialable free time. For example activity 5-6
has 3 days free time and activity 5-7 is critical, then the labourers can be deputed to 5-7
instead of doing 5-6.

15.9 KEY WORDS


Free float
Independent float
Critical Path
Total Float

287
15.10 SELF ASSESSMENT QUESTIONS

Case Study-1: A project has the following schedule:


Activity 1-2 1-3 1-4 2-5 3-6 3-7 4-6 5-8 6-9 7-8 8-9

Duration (months) 2 2 1 4 8 5 3 1 5 4 3

1. Draw the network.


2. Calculate the total, free and independent float for each activity
3. Find the critical Path and the duration of the project.

Case Study-2: The R&D of SONY is developing a new power supply for a high definition
television. The job is broken down into following form:
Job Description Predecessor Expected
job time (days)

A Determine output voltage - 5

B Determine to use solid state rectifier A 7

C Choose Rectifier B 2

D Choose Filter B 3

E Choose Transformer C 1

F Choose Chassis D 2

G Choose Mounting C 1

H Layout Chassis E,F 3

I Build and Test G,H 10

1. Draw a critical path scheduling arrow diagram, identify the critical path
2. What is the minimum time for completion of the job?

288
15.11 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008

289
UNIT 16 : DIFFERENT TIME ESTIMATES – P E R T
STRUCTURE

16.0 Objectives
16.1 Introduction
16.2 PERT with three time estimates
16.3 PERT Procedure
16.4 Illustrations
16.5 Summary
16.6 Key Words
16.7 Self Assessment Questions

16.8 References

290
16.0 OBJECTIVES
After studying this unit, you should be able to ;
 Discuss the three time estimates computation.
 Explain the procedure of computing PERT
 Examine the steps involved in the computing
 Determine the expected time of project completion.
 Solve the problems related to PERT and CPM

16.1 INTRODUCTION ó
PERT (Programme Evaluation and Review Technique) is essentially a management
technique and if tailored properly, can be used with advantage for responsibility accounting
in addition to attaining other well defined objectives. Managers have found this technique
for immense value where adopted judiciously and when configurations of events activities
are correctly assessed and their times are realistically worked out.
PERT is designed for scheduling complex projects that involve many inter-related
tasks. It improves the planning process because:
1. It helps the planner to define the project’s various components activities and even
logically.
2. It provides a basis for normal time estimates, and yet allows for some measure optimism
or pessimism in estimating the completion dates.
3. It shows the effects of changes to the overall plan as they contemplated.
4. It provides a built-in means for on-going evaluation of the plan.
5. It facilitates the process of communication between planner’s management by either;
adhering to organizational lines or crossing over them. In essence, PERT makes the.
\clear-cut assignment of responsibility possible.

16.2 PERT WITH THREE-TIME ESTIMATES


If the duration of activities in a project is uncertain, then activity scheduling
calculations are done by using the expected value of durations However, such expected urallon
estimations may not be given an accurate answer. Thus, rather than estimating directly the
expected completion time of an activity, three values are considered. From these times a
single value is estimated for future consideration. This is called three-time: estimates in
PERT. The three times estimates are listed below:

291
Optimistic Time (to or a)
This is the shortest (minimum) possible time to perform an activity, assuming tha
everything goes well.

Pessimistic Time (tp or b)


This is the maximum time that is required to perform an activity, under extremely bad
conditions.

Most Likely Time (tm or m)


It refers to the estimate of the normal time the activity would take. This assumes
normal delays. It is the mode of the probability distribution.
From these three time estimates, the expected time of an activity is calculated. It is
given by the weighted average of the three time estimates.

t o  4t m  t p
te 
6

2
 tp  tp 
Variance of the activity is given by σ 2 =  
 6 

16.3 PERT PROCEDURE

Step-1 : Draw the project network.

Step-2: Compute the expected time duration of each activity using te = (to + 4 tm + t )/ 6
p

Step-3: Compute the expected variance σ 2 of each activity.

Step-4: Compute the earliest start, earliest finish, latest start, latest finish and total float
of each activity.

Step-5: Determine the critical path and identify the critical activities.

Step-6: Compute the expected variance of the project length σ 2 which is the sum of
the variance of all the critical activities and hence find the standard deviation of
the project length σ

292
Step-7: Compute the expected standard deviation of the project length Z= Ts - Tc /6

Where Ts = Specified or scheduled time to complete the project.

Tc = Normal expected duration of the project.

0 = Expected standard deviation of the project ‘length’.

16.4 ILLUSTRATIONS

Problem-1:

A small project comprises of activities whose time estimates are given in table

Activity Estimated Duration (Weeks)


Optimistic Most Likely Pessimistic

1-2 1 1 7
1-3 1 4 7
1-4 2 2 8
2-5 1 1 1
3-5 2 5 14
4-6 2 5 8
5-6 3 6 15

1. Draw the project network


2. Find the expected duration and variance of each activity. What is the expected project
length?
3. Calculate the variance and standard deviation of the project length. What is the prob-
ability that the project will be completed

1. At least 4 weeks earlier than expected?

2. No more than 4 weeks later than expected time?

293
Solution:

The expected time and variance of each activity is calculated in the following table:

Variance
Optimistic Most Likely Pessimistic te =  tp  tp 
2
Activity ce of the activity is given by σ2 =  
Time (to) Time (tm) Time (tp) (t0+4 tm+ tp) /6  6 

1-2 1 1 7 2 1

1-3 1 4 7 4 1

1-4 2 2 8 3 1

2-5 1 1 1 1 0

3-5 2 5 14 6 4

4-6 2 5 8 5 1

5-6 3 6 15 7 4

294
a) By examining all the paths we see that the critical path is 1-3-5-6.

b) The expected project length is the sum of the duration of each critical activity. That is
duration of project = 4 + 6 + 7=17 weeks.

c) Variance of project length is the sum of the variance of the critical activities that is,

Variance of project length  2=l+4 + 4 = 9 weeks.

Standard Deviation,  =? 9 = 3 weeks

1. Probability that the project will be completed at least 4 weeks earlier than expected
time of 17 weeks is given by,

 Z  Ts  Te   (17  4)  17 
Prob  =   = Prob.Z=-1.33
    3

But Z = - 1.33 from normal distribution table is 1- 0.9082 = 0.0918. Thus the
probability of completing the project within 13 week (that is 4 week earlier) is 1-0.9082 =
0.0918= 9.18%. I

2. Probability that the project will be completed 4 weeks later than expected time of 17
weeks is given by,

 Z  Ts  Te   (17  4)  17 
Prob.  =   = Prob.Z=-1.33
   3

But Z =1.33 from normal distribution table is 0.9082. Thus the probability r
completing the project within 21 week (that is 4 week later) is = 0.9082 = 90.82 %.

295
Problem-2: The following table gives the three estimates draw the network of the project and
calculate the slack for each event. Find the critical path and the probability of completing the
project in 35 days.
Activity 1-2 1-3 2-5 3-4 4-5 5-8 4-6 4-7 6-9 8-9 7-10 9-10

to 3 1 6 8 0 5 6 3 1 3 5 2

tm 5 2 8 12 0 7 9 6 2 5 14 5

tp 7 3 12 17 0 9 12 8 3 8 17 6

The expected time and variance of each activity is calculated in the following table:
Activity Optimistic Most Likely Pessimistic te = Variance
Time (t0) Time (tm) Time (tp) (to+4 tm+ tp) /6  2

1-2 3 5 7 5 0.44

1-3 1 2 3 2 0.11

2-5 6 8 12 8.33 1

3-4 8 12 17 12.17 2.25

4-5 0 0 0 0 0

5-8 5 7 9 7 0.44

4-6 6 9 12 9 1

4-7 3 6 8 5.83 0.69

6-9 1 2 3 2 0.11

8-9 3 5 8 5.17 0.69

7-10 5 14 17 13 4

9-10 2 5 6 4.67 0.44

296
9

The critical path is 1-3-4-7-10.

The project duration is 33 days. That is Te = 33 days,

The expected variance of the critical path is = 0.11 + 2.25 + 0.69 + 4 = 7.05 days

Standard Deviation,  =  7.05 = 2.65 days

The probability of completing the project work in 35 days is,

Ts  Te 35  33
Prob. Z= = =0.75
 2.65

For Z= 0.75 the normal distribution table gives a value of 0.7734. Thus the probability of
completing the project in 35 days= 0.7734 = 77.34 %

297
The various floats are calculated as below :
Activity Duration Earliest Latest Total Free Independent

Start Finish Start Finish Float Float Float


.

1-2 5 0 5 2.84 7.84 2.84 0 0


1-3 2 0 2 0 2 0 0 0
2-5 8.33 5 13.33 7.84 16.17 2.84 0.840 0
3-4 12.17 2 14.17 2 14.17 0 0 0
4-5 0 14.17 14.17 16.17 16.17 2 0 0
5-8 7 14.17 21.17 16.17 23.17 2 0 0
4-6 9 14.17 23.17 17.34 26.34 3.17 0 0
4-7 5.83 14.17 20 14.17 20 0 0 0
6-9 2 23.17 25.17 26.34 28.34 3.17 1.17 0
8-9 5.17 21.17 26.34 23.17 28.34 2 0 0
7-10 13 20 33 20 33 0 0 0
9-10 4.67 26.34 31 28.33 33 2 2 0

Problem - 3: A project is represented by the following three time estimates:(in weeks)


Activity A B C D E F G H I

to 5 18 26 16 15 6 7 7 3

tp 10 22 40 20 25 12 12 9 5

tm 8 20 33 18 20 9 10 8 4

Predecessor - - - A A B C D EF

1) Draw the network and determine the critical path.


2) Expected task times and their variance.
3) Probability of completing the project in 41.5 weeks

298
Solution:
V ariance
O ptim istic Pessim istic M ost Likely t e = (to+4 t m +  2 2

[tp - to]
A ctivity
Tim e (t 0 ) Tim e (t p ) Tim e (t m ) t p ) /6
6
1-2 5 10 8 7.8 0.696

1-3 18 22 20 20 0.444

1-4 26 40 33 33 5.429

2-5 16 20 18 18 0.443

2-6 15 25 20 20 2.780

3-6 6 12 9 9 1.000

4-7 7 12 10 9.8 0.694

5-7 7 9 8 8 0.111

6-7 3 5 4 4 0.111

3. The earliest and latest expected time for each event will be calculated by
considering the expected time of each activity.

Forward Pass:

E1=0 E2=E1+t1,2=0+7.8=7.8

E3=E1+t1,3=0+20=20; E4=E1+t1,4=0+33=33

E5=E2+t2,5=7.8+18=25.8; E6=Max (Ei+ti,6)

=Max (E2+t2,6; E3+t3,6)

=Max (7.8+20;20+9)=29
E7= Max (Ei+ti,7)
= Max (E5+t5,7; E6+t6,7; E4+t4,7)
=Max (25.8+8; 29+4;33+9.8)
=42.8

299
Backward Pass:

L7=E7=42.8 Es=7.8
Es=25.8
Lf=7.8 D 18
L6=L7-t6,7=42.8-4=38.8 Lf=34.8
2 5
H8
A 7.8
L5= L7-t5,7=42.8-8=34.8 E 20

Es=0 Es=20 F 9 Es=29 I 4 Es=42.8


L4=L7-t4,7=42.8-9.8=33.0 B 20 6 7
1 3
Lf=0 Lf=29.8 Lf=38.8 Lf=42.8
L3=L6-t3,6=38.8-9=29.8

L2=Min (Lj-t2,j)
C 33 4 G 9.8
= Min (L6-t2,6; L5-t2,5) Es=33

= Min (38.8-20; 34.8-18) =16.8 Lf=33

L1=Min (Lj-t1,j)

=Min (L3-t1,3;L4-t1,4; L2-t1,2)

=Min (29.8-20; 33-33; 16.8-7.8) =0

4. The last event 7 will occur only after 42.8 weeks. For this we require only the
duration of critical activities. This will help us in calculating the standard
deviation of the duration of the last event.

Expected length of the critical path = 33 + 9.8 =42.8

Variance of the critical path - 5.429 + 0.694 = 6.123

Standard Deviation =  =  6.123 = 2.474 weeks

The probability of finishing the project in 41.5 weeks is Z = Ts-Te = 41.5 - 42.8 = 0,52
2.474

From the normal distribution table we have the value 1- 0.70 = 0.30 for Z= -0.52 that is
the probability of completing the project in 41.5 weeks is 30%.

300
Problem -4: Consider the following project.
Activity Three Time Estimates Predecessor
to tm tp
A 3 6 9 -
B 2 5 8 -
C 2 4 6 A
D 2 3 10 B
E 1 3 11 B
F 4 6 8 C,D
G 1 5 15 E

Find the path and standard deviation. Also find the probability of completing the
project by 18 weeks.

Solution: The expected time and the variance of each activity is calculated as shown in
the table below:

Variance
Optimistic Most likely Pessimistic (to+4tm+tp)
Activity 2
Time (t0) Time (tp) Time (tm) te =
6 (
 2 = tp - to
6
)
A 3 6 9 6 1

B 2 5 8 5 1

C 2 4 6 4 0.444

D 2 3 10 4 1.777

E 1 3 11 4 2.777

F 4 6 8 6 0.444

G 1 5 15 6 5.444

301
Figure 5.4

Critical Path is 1-2-4-6 or A-C- F

The project length = 6 + 4 + 6 = 16 weeks

Project length variance = σ2 = 1 + 0.444 +0.444 =1.888

Standard Deviation σ = 1.888 = 1 .374 weeks

The probability of completing the project in 1 8 weeks

Ts  Te 18  16
Prob. Z= = =1.456
0 1.374

From the normal distribution table we have the value 0.92647 for Z = 1.456 that is the
probability of completing the project by 18 weeks is 92.65%.

16.5 SUMMARY

In this block you have gained a fair knowledge on network analysis. You have learnt to find
out the total time required to complete project. You are also able to compute estimated time
from different time estimates. Further you can learn about crashing of projects where in you
can try to squeeze the project time by involving extra cost. The normal distribution tables
you have learnt in block three is also used here.

302
16.6 KEY WORDS

PERT

Activity

Optimistic

Probability

Pessimistic

Projects

16.7 SELF ASSESSMENT QUESTIONS

Case Study-1:

A project has the following activities and other characteristics


Activity A B C D E F G H I

Preceding Activity - - A A C D B E,F G

Optimistic Time 4 1 6 2 5 3 3 1 4

Most likely Time 7 5 12 5 11 6 9 4 19

Pessimistic Time 16 15 30 8 17 15 27 7 28

1) Draw the PERT network diagram


2) Prepare the activity schedule for the project.
3) Identify the critical path
4) Determine the project completion time
5) Find the probability that the project is completed in 36 weeks.

303
Case Study-2: The owner of a retail outlet is considering a new computer system for
transaction and inventory management. A computer company has sent the following
instructions with regard to the installation of the system.

Activity Activity Im m ediate Three estimates


tim e
Description Predecessor Optim istic M ost Likely Pessim istic

A Select the com puter - 4 6 8


7
B Design Input/ Output system A 5 15

C Design m onitoring system A 4 8 12

D Assem ble computer hardware B 15 20 25

E Develop the m ain program B 10 18 26

F Develop the input/output C 8 9 16


routine
G Create database E 4 8 12

H Install the system D,F 1 2 3

I Test and Im plem ent G,J 6 7 8

1. Construct the network diagram


2. Determine the critical path and expected completion time.
3. Determine the probability of completing the work in 5 5 days.

16.8 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010

304

You might also like