Professional Documents
Culture Documents
M.B.A I SEMESTER
MBHC - 1.4
BLOCK - 2
CORRELATION AND REGRESSION Page No.
Unit -5
Correlation 70 - 86
Unit - 6
Methods of Computing Correlation 87 - 105
Unit -7
Regression 106 - 135
Unit -8
Multiple Correlation and Regression 136 - 151
1
BLOCK - 3
PROBABILITY Page No.
Unit - 9
Introduction to Probability and Probability Types 152 - 184
Unit - 10
Theoretical Probability Distributions, Normal Distribution 185 - 214
Unit - 11
Introduction to Operations Research 215 - 223
Unit - 12
Game Theory 224 - 241
BLOCK - 4
APPLICATION OF OPERATION RESEARCH IN BUSINESS Page No.
Unit - 13
Decision Analysis 242 - 257
Unit - 14
Network Analysis 258 - 272
Unit - 15
Solution to Network Problems 273 - 289
Unit - 16
Different Time Estimates - P E R T 290 - 304
2
BLOCK -1 : INTRODUCTION TO STATISTICS
The science of statistics is indispensable for a clear appreciation of any problem
affecting all the branches of human knowledge. It covers all the fields of enquiry in which a
grasp of the significance of large numbers is looked for. It is applicable to all the disciplines.
This block gives you the fundamental aspects of statistics. Basic concepts of Central
Tendency such as mean, median and mode are discussed here. This Module gives you an
insight about data collection, tabulation and analysis of data. Methods of calculating
disperssion is also taught in this module.
You are expected to understand these concepts and workout problems given at the
end of each unit.
This block consists of four units.
Unit 1: Provides an introduction to business statistics i.e. Meaning, scope, importance and
limitations, statistics in Business Management.
Unit 2 : Gives an idea about how to analyse of data i.e. Introduction, sources of data,
collection, classification, tabulation and depiction of data.
Unit 3: Describes various Measures of Central tendency i.e Arithmetic, weighted,
geometric mean, Harmonic mean, median and mode.
Unit 4: Explains the Measures of Dispersion i.e. Range, Quartile deviation, Mean deviation,
Standard deviation, variance, Coefficient of variation, Skewness and Kurtosis.
3
BLOCK - 2 : CORRELATION AND REGRESSION
In the previous block you have gained the basic knowledge about data science. You
have learnt how to collect data, how to tabulate data and how to depict data. You have also
learnt the data analysis through various measures of central tendency such as mean, Median
and Mode. All these information you have studied were limited to one set of data.
In this block let us try to understand the relationship between two set of data. . Some
set of data depend directly or inversely on other set of data. For example yield crop over
years depend on amount of rainfall in that place over years. Such relation is called correlation.
Once you are able to co related two set of data then you can try to find the exact relationship
between two sets of data which is called as regression.
In this block you will study 4 units.
Unit 5: Concept and definition of correlation, significance, types, Properties of Correlation
Methods of correlation analysis: Graphic method,
Unit 6: Scatter diagrams, Karl Pearson’s correlation co-efficient, Rank correlation
coefficient,
Unit 7: Regression: Regression analysis: meaning and definition of regression, application
of regression analysis, difference between correlation & regression analysis, Types of
regression models, standard error and Regression coefficients.
Unit 8: Multiplication correlation and regression : Concept of multiple regression and
multiple correlation , Concept of partial correlation. Correlation co-efficient, Methods of
least square.
4
BLOCK -3 : PROBABILITY
In the previous blocks you have studied about statistics. In this block let us study an
important mathematical concept having significant application in business which is probability.
The word probability is not new to you. You may be using it in your daily life. For example, it
may rain today, I may go there etc. Probability is more related to business. Earning profit in
any business depends on numerous factors which are either controllable or non-controllable
hence always probabilistic.
There are quantitative techniques which would aid profit calculation. To find measure
for profitability, it is necessary to perform certain experiments, visualize possible outcomes
of experiments. There are many situations or trail which may result into any two possible
outcomes which is called binominal distribution. Sometimes the possibilities may follow a
normal pattern of low on extremes and more on average. For ex. construction of a house may
take one year. Sometimes it is finished in 6 months or some time it may take 3 years. Such
distributions wherein the chances of extreme possibilities is less and average is more is
called normal distribution.
In this block you will study 4 units stated as below
Unit 9: Basic definitions, events, sample space and probabilities, Basic rules of probability,
Unit 10: conditional probability, independence of events, combinational concepts, law of
total probability, Bayes theory,
Unit 11: theoretical probability distributions – Binomial, Poison and Normal – Simple
problems applied to business. Probability Distribution, Discrete random variable,
Binominal probability distribution,
Unit 12: Normal distribution.
5
BLOCK - 4 : APPLICATION OF OPERATION RESEARCH IN
BUSINESS
Decision Making is an important management activity. The significance of decision
making varies in the type of decision such as routine or one time, decision making environment
such as risk, certainty, uncertainty and so on. In this block you will understand how to take
decisions based on quantitative inputs in various situations.
Further in this block you will also study network analysis. The network analysis mainly
helps you to keep pace with a project. Whenever you take up a project, you can identify
various components of the project, delineate the sequence of those activity, determine the
time and cost required for each activity and then you can raw a chart. Based on this chart you
can control the project. You will come to know where the project is lagging or where the
project is exceeding the budget.
This block is divided into four units.
Unit 13: Decision making, Decision making environment, various criteria used for decision
making, Decision tree analysis,
Unit 14: Network scheduling using PERT and CPM, Basic concepts of network analysis,
Construction of network,
Unit 15: Different time Estimates,
Unit 16: Cost considerations in CPM, Probability of project completions
6
CREDIT PAGE
Programme Name : MBA Year/Semester : 1st Year, 1st Semester Block No: 1 to 4
Dr. M.S. Yathisha Chandra Block - 4(Units 13 to 16) Dr. Rajeshwari H.,
Associate Professor, Assistant Professor
Department of Management DOS & R Management,
UBDTCE-VTU, KSOU, Mysore.
Davanagere.
7
Copy Right
Registrar
Karnataka State Open University
Mukthagangothri, Mysuru - 570006
Developed by the Department of Studies and Research in Management, KSOU, under
the guidance of Dean (Academic), KSOU, Mysuru
Karnataka State Open University, January-2021
All rights reserved. No part of this work may be reproduced in any form, or any other means,
without permission in writing from the Karnataka State Open University.
Further information on the Karnataka State Open University Programmes may obtained from
the University’s office at Mukthagangothri, Mysuru-570006
Printed and Published on behalf of Karnataka State Open University. Mysuru-570006 by
Registrar (Administration)-2021
8
BLOCK - 1
INTRODUCTION TO STATISTICS
UNIT -1: INTRODUCTION TO BUSINESS STATISTICS
STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Definitions and Meaning of Statistics
1.3 Scope of Statistics
1.4 Importance of Statistics
1.5 Limitations of Statistics
1.6 Statistics in Business and Management
1.7 Descriptive and Inferential Statistics in Business Decisions
1.8 Strengths and Weaknesses of Statistics
1.9 Language of Statistics
1.10 Summary
1.11 Key Words
1.12 Self Assessment Questions
1.13 Check Your Progress
1.14 References
1
1.0 OBJECTIVES
After studying this unit you should be able to :
Define statistics and explain scope, limitations and importance;
Appreciate use of statistics in Business decisions and
Analyze the strengths and weaknesses of statistics and the various terminologies of
statistics.
1.1 INTRODUCTION
Statistics is not a new discipline. It is as old as this human society itself. The word
statistics is derived from Latin word ‘Status’ or the Italian word “Statista’ or the German
word ‘statistik’, each of which means a political state. In the ancient times, the scope of
statistics was limited to collection of the data related to age, gender wise population, property
and wealth of the country for framing military and fiscal policies.
The development of statistics has been noticed from the time of Pharaohs of Egypt
before 2000 years ago and the traces of application of it was seen in Kautilya’s Arthashastra
and during the time of Chandragupta Maurya. During 16th, 17th, and 18th century systematic
development took place in the field of statistics. During this period theory of Probability,
theory of Games and chance, Regression and Correlation Analysis, Goodness of Fit tests
like Chi Square, t – Test etc were developed. R A Fisher who is called as the Father of
Statistics is the pioneer to apply statistics to genetics biometry, psychology, education,
agriculture etc which made a remarkable step to introduce statistics to other fields. Thus
statistics became a full fledged science. Today statistics has given solutions to various
complicated fields like Economics, Business, Management, Accountancy, Social science,
Industry, Biology and Medical Sciences.
1.2 DEFINITION AND MEANING OF STATISTICS
It has been defined differently by different writers. Statistics has been expressed as
Numerical Data and Statistical methods.
Statistics as Numerical data:
“Statistics are numerical statements of facts in any department of enquiry placed
in relation to each other.”– Bowley
“By statistics we mean quantitative data affected to a marked extent by multiplicity
of causes.” – Yule and Kendall
“Statistics are the classified facts representing the conditions of the people in a
state, such as those facts which can be represented in tables as numbers in any classified
2
arrangement” - Webster
Statistics as Statistical Methods:
“Statistics is the science of estimates and probabilities” – Boddington
“Statistics may be called as the science of counting or averages” – Bowley
“Statistics is a method of decision making in the face of uncertainty on the basis of
numerical data and calculated risks” – Prof. Ya-Lun-Chou
3
supply, consumption, distribution wealth and income, savings, profits, investments,
expenditure etc. various laws of economy are developed through statistics. Powerful
tools ( trend Analysis, Time Series, Forecasting techniques) are used in analysis of
economic data.
Business and Management- Business deals with uncertainty and frequently it lands in
dilemmas. Statistics is the scientific ways to come out of the ambiguities of business
and helps widely to take decisions.
Accountancy and Auditing – Today Chartered Accountants and ICWAs have statistics
as a vital subject in their syllabus since its usages in accounting has become inevitable.
It is widely used in profit analysis, dividend decisions, assets and liability, analysis, etc.
in auditing sampling techniques are used for test checking of voluminous data related
to the business transactions.
Statistics in Industry – it is used intensively in Quality Control in production process
Statistics in Physical sciences – physical sciences like astronomy, geology, engineering,
and meteorology expects accurate results in their applications. Statistics techniques
like least square methods give solutions to this.
Statistics in Social Science – to study the demography features, mortality, fertility,
population growth, poverty etc requires statics to analyze data
Statistics is also used in Biology, Medical sciences to study the cause and effects, and
also it is used in Psychology and education like scaling of mental health, determining
I Q levels etc.
4
complicated. Personal evaluation based on observations is not enough. A team of specialized
managerial executives are inevitable for running up of today’s business activities such as
sales, purchases, production, control, finance, marketing etc. In this the statistical tools and
theories such as forecasting techniques, estimation theory, sampling, probability, least square,
game theory etc. play an indispensible role. “Statistics is a method of decision making in
the face of uncertainty on the basis of numerical data and calculated risks” – Prof.
Ya – Lun - Chou and According to Wallis and Roberts “ Statistics may be regarded as a
body of methods for making wise decisions in the face of uncertainty.” These definitions
reflect the application of statistics in today’s modern business which has its roots in accuracy
in estimation and forecasting regarding future demand for the product, market trends and so
on. The statistical information related to the business is also acts as a guide to future economic
events. The uses of statistics in some of the business decisions are:
To estimate the probable trends in demand of the goods
Ordering the right quantity of Materials
Seasonal and cyclical movements of the business
Relationship between supply and demand
To know the purchasing power of money
Statistical quality control of production to produce without waste
To promote the new businesses
To conduct the customer surveys and collect the demographic information to fix the
target segments.
To understand the consumer expectations and level of product awareness
To launch the new products through sample surveys
Optimization of Profits and Investments and minimizing the expenses
To forecast the future and balance the uncertainties through probability and estimation
theories
Thus uses of statistics have become indispensible in all the branches of business
activities.
1.7 DESCRIPTIVE AND INFERENTIAL STATISTICS IN BUSINESS
DECISIONS
There are two major divisions in statistics: - Descriptive Statistics and Inferential
Statistics.
Descriptive Statistics: Descriptive statistics deals with collecting, summarizing and
5
simplifying the data which are otherwise very voluminous. Through this, meaningful
conclusion can be drawn readily from the data. Thus this method facilitates an understanding
of the data and systematic reporting which makes the data useful for further discussions,
analysis and interpretations. A well thought out data classification facilitates easy descriptions
and a variety of summary measures. These include Measures of Central Tendency, Dispersion,
Skewness and Kurtosis which constitute the essential scope of descriptive statistics.
Inferential Statistics: this is also called as inductive statistics. It goes beyond describing a
given problem situation by means of collecting summarizing and presenting related data. It
consists of the methods that are used for drawing inferences, making broad generalizations
about total observations on the basis of a part from it. That is obtaining a particular value
from the sample information and using it for drawing an inference about the entire population
is inferential statistics.
In business, the decisions are to be taken in uncertainty and most of the time; total coverage
of the information through Census method is not possible. And it may not be always feasible
and practical for various reasons. In such situation it is the inferential statistics which is
used in taking business decisions.
Risk evaluation and Statistics:
Inferential statistics helps to evaluate the risk involved in getting inferences or generalizations
about an unknown population on the basis of sample information. There is always a risk of an
inference about a population being incorrect when based on the knowledge limited to a
sample. The rescue lies in evaluating the risk. The probability distributions help us for
drawing statistical inferences and estimating the degree of reliability of these inferences.
Check Your Progress
1. Inferentional statistics is also called as ............... statistics.
2. Statistics is aggregate of ...............
3. Statistics help to make ............... in the face of uncertainly
6
collection of data till interpreting the results is porous and allows number of errors in each
step. The data can be manipulated easily by the collector himself. These flaws in statistics
can be minimized but cannot be eliminated. Thus the inaccuracy, unreliability, manipulatin
of data creates distrust in statistics.
1.10 SUMMARY
A layman knows that statistics is data. It is the numerical information expressed in
quantitative terms. It can also be called as science of data. This unit has thrown light on the
basics of statistics. Today statistics has grown into a separate subject since its importance in
each and every aspect of human life and its relevancy in almost all disciplines. Today Business
decisions are based on statistical inferences as the future is very uncertain. Though there are
many advantages of using statistical tools in decision making, the distrust still prevails
because of the inadequacy, inaccuracy, manipulation of data, lack of expertise skill and
knowledge to interpret the results.
7
1.13 ANSWERS TO CHECK YOUR PROGRESS
1. Inductive
2. Facts
3. Wise decisions
1.14 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House, 2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
8
UNIT 2 : ANALYSIS OF DATA
STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Types of Data
2.3 Data Collection and Sources of Primary and Secondary data
2.4 Classification and Tabulation of data
2.5 Summarizing the data – Frequency Distribution
2.6 Diagrammatic and Graphic Representation of Data
2.7 Summary
2.8 Key Words
2.9 Self Assessment Questions
2.10 Answer to Check Your Progress
2.11 Suggested references
9
2.0 OBJECTIVES
After studying this unit you should be able to :
Define statistical data;
Explain types of data and the sources of collecting the data;
Demenstrate ability to present data in tabulation and
Describe Graphic presentation of data.
2.1 INTRODUCTION
Data is the information that is collected. Statistical data are the basic raw material of
statistics. Data may relate to an activity of our interest, a problem, or a phenomenon or a
situation under study. The data is the result of the process of measuring, counting or observing.
Therefore Statistical data refer to those aspects of a problem situation that can be measured,
quantified, counted or classified. In any statistical investigation, the collection of the
numerical data is the first and the most important matter to be attended. Often a person
investigating, will have to collect the data from the actual field of inquiry. For this he may
issue suitable questionnaires to get necessary information or he may take actual interviews;
personal interviews are more effective than questionnaires, which may not evoke an adequate
response. Another method of collecting data may be available in publications of Government
bodies or other public or private organizations. Sometimes the data may be available in
publications of Government bodies or other public or private organizations. Such data,
however, is often so numerous that one’s mind can hardly comprehend its significance in the
form that it is shown. Therefore it becomes, very necessary to tabulate and summarize the
data to an easily manageable form. In doing so we may overlook its details. But this is not a
serious loss because Statistics is not interested in an individual but in the properties of
aggregates. For a layman, presentation of the raw data in the form of tables or diagrams is
always more effective.
Prerequisites of statistical data:
a. It should be unambiguous
b. It should be specific and as per the objectives and scope of the study
c. It should be stable
d. It should be appropriate to the enquiry
e. It should be uniform
f. Degree of accuracy should be aimed.
g. It should be apt and reliable.
10
2.2 TYPES OF DATA
A. Based on the characteristics, measured data can be classified into two broad
categories.
a. Quantitative data
b. Qualitative data
Quantitative data: The data that can be quantified in definite units of measurement are
called as quantitative data. That is the successive measurements yield
quantifiable observations. Depending on the nature of the variable
observed or measurement this can be further categorized as continuous
data and discrete data. Continuous data represent the numerical values
of a continuous variable. A continuous variable is the one that can assume
any value between any two points on a line segment and thus it represents
an interval of values. For example: temperature, thickness, velocity,
height, weights etc. Discrete data are the values assumed by discrete
variables. That one whose outcomes are measured in fixed numbers. Ex:
Number of customers visiting a store every day, the number of trains
arriving at the station, number of defects in one consignment etc.
Qualitative data: The data related to the qualitative characteristics of the subject or object
is qualitative data. This data is the data gathered through the attributes.
These data are classified as Nominal Data and Rank Data. The count
data obtained from classification is called as Nominal data. For example:
classification of students according to gender, division of workers as
per skills, education etc. Rank data is the result of assigning the ranks.
For example: ranking as per the performance in the interview, exams,
performance as per quality etc.
B. Based on the sources of data the data can be categorized as :
a. Primary data
b. Secondary data
Primary data: these are the data that do not exist in any form, and thus have to be
collected for the first time from the primary sources. Since they are
collected for the first time, they are the fresh data. And they are the data
collected from the sample drawn from the whole population.
Secondary data: these data already exist in some form published or unpublished in an
11
identifiable secondary source.
2.3 DATA COLLECTION AND SOURCES OF PRIMARY AND
SECONDARY DATA
Data collection is the act of assembling and gathering the needed numerical information
for a research. The process of counting or measuring together with the systematic recording
of results is called collection of statistical data. Collection can be from primary or secondary
source.
Preliminaries of Data Collection:
Objectives and scope of enquiry
Statistical units to be used
Sources of information
Method of data collection
Degree of accuracy aimed at in the final results
Type of enquiry
Primary data do not exist in any form the only source from where they can be collected is
of field surveys from the population or the sample from the population. Primary data sources
can be of internal and external.
Methods of collecting primary data are:
1. Personal Interviews
2. Direct personal Investigation
3. Indirect oral interviews
4. Information received through local agencies
5. Mailed questionnaire method
6. Schedules through enumerators
Secondary Data sources may be broadly classified as two groups:
(i) Published sources : Official Publications of Central Government, Publications of semi
government Statistical Organizations, publications of research institutions, Commercial
and Financial Institutions, Reports of Various Committees and Commissions appointed
by the Government, News papers and Periodicals, International Publications are the
sources of secondary data in published form
(ii) Unpublished sources : Records maintained by the private firms, the research carried
12
out by individuals, records of business concerns etc. The other forms of sources are
Internal Sources and External sources. This depends on the type of the user of the data
whether he is an insider or the outsider. As caution the secondary data should be carefully
examined before use to see that they suit the objectives of the research or the study. But
these data are more convenient to use especially when used as supportive evidences.
The other advantage of secondary data is its easy availability, convenient to reach and
access and not much effort is needed to classify and tabulate the data. The precautions
to be taken are to check whether the data is reliable, suitable and adequate.
2.4 CLASSIFICATION AND TABULATION OF DATA
Classification “Classified and arranged facts speak of themselves, and narrated they
are as dead as mutton” This quote is given by J.R. Hicks.The process of dividing the data into
different groups ( viz. classes) which are homogeneous within but heterogeneous between
themselves, is called a classification.It helps in understanding the salient features of the data
and also the comparison with similar data. For a final analysis it is the best friend of a
statistician.
Methods Of Classification
The data is classified in the following ways:
1. According to attributes or qualities this is divided into two parts :
(A) Simple classification
(B) Multiple classifications.
2. According to variable or quantity or classification according to class intervals. -
Qualitative Classification : When facts are grouped according to the qualities (attributes)
like religion, literacy, business etc., the classification is called as qualitative classification.
(A) Simple Classification : It is also known as classification according to Dichotomy.
When data (facts) are divided into groups according to their qualities, the classification is
called as ‘Simple Classification’. Qualities are denoted by capital letters (A, B, C, D ......)
while the absence of these qualities are denoted by lower case letters (a, b, c, d, .... etc.) For
example,
13
(B) Manifold or multiple classification : In this method data is classified using one or
more qualities. First, the data is divided into two groups (classes) using one of the qualities.
Then using the remaining qualities, the data is divided into different subgroups. For example,
the population of a country is classified using three attributes: sex, literacy and business as,
14
According to the class-intervals in classification the following terms are used :
Class-limits: A class is formed within the two values. These values are known as the class-
limits of that class. Magnitude of the class-intervals : The difference between the upper
and lower limits of a class is called the magnitude or length or width of a class and is denoted
by ‘ i ‘ or ‘ c Mid-value or class-mark : The arithmetical average of the two class limits
(i.e. the lower limit and the upper limit ) is called the mid-value or the class mark of that
class-interval. Class frequency : The units of the data belong to any one of the groups or
classes. The total number of these units is known as the frequency of that class and is denoted
by fi or simply f
Classification is of two types according to the class-intervals - (i) Exclusive Method (ii)
Inclusive Method.
Exclusive Method : In this method the upper limit of a class becomes the lower limit of
the next class. It is called ‘ Exclusive ‘ as we do not put any item that is equal to the upper
limit of a class in the same class; we put it in the next class, i.e. the upper limits of classes
are excluded from them. See table 1. For example, a person of age 20 years will not be
included in the class-interval ( 10 - 20 ) but taken in the next class ( 20 - 30 ), since in the
class interval ( 10 - 20 ) only units ranging from 10 - 19 are included.
Inclusive Method : In this method the upper limit of any class interval is kept in the same
class-interval. In this method the upper limit of a previous class is less by 1 from the lower
limit of the next class interval. In short this method allows a class-interval to include both its
lower and upper limits within it. For example :
Table - 2
Ex
15
Open-end Class Intervals : In any question when the lower limit of the first class-interval
or the upper limit of the last class-interval, are not given then subtract the class length of the
next immediate class-interval from the upper limit. This will give us the lower limit of the
first class-interval. Similarly add the same class length to the lower limit of the last class-
interval. But always notice that the lower limit of the first class ( i.e. the lowest class) must
not be negative or less than 0. For example :
Table - 3
Tabulation
It is the process of condensation of the data for convenience, in statistical processing,
presentation and interpretation of the information. A good table is one which has the following
requirements:
1. It should present the data clearly, highlighting important details.
2. It should save space but attractively designed.
3. The table number and title of the table should be given.+
4. Row and column headings must explain the figures therein.
5. Averages or percentages should be close to the data.
6. Units of the measurement should be clearly stated along the titles or headings.
7. Abbreviations and symbols should be avoided as far as possible.
8. Sources of the data should be given at the bottom of the data.
9. In case irregularities creep in table or any feature is not sufficiently explained,
references and foot notes must be given.
10.The rounding of figures should be unbiased.
16
Types of Tables: The important types of statistical table are as follows:
1. Single column or Single Row Tables
2. Multiple column or multiple row tables
3. Reference and Summary tables
Components of a Table: the structure or the components of the table should have:
1. Table Number
2. Title of the table
3. Head Notes
4. Stub and Stub Heads
5. Box Head and Sub Heads
6. Body of the table
7. Footnote
8. Source
Relative Frequency: The relative frequency of a class is the frequency of the class divided
by the total number of frequencies of the class and is generally expresses as a percentage.
Cumulative Frequency: Many a times the frequencies of different classes are not given.
Only their cumulative frequencies are given. The total frequency of all values less than or
equal to the upper class boundary of a given class-interval is called the cumulative frequency
up to and including that class interval. These cumulative frequencies are called less than or
more than cumulative frequencies. For example,
Class – interval 0-10 10-20 20-30 30-40 40-50
Frequency 4 9 5 12 15
17
Table -5
In this example the size difference from 2 to 7 is very small. If the range of a variate
18
is very large, it is inconvenient to prepare a frequency distribution for each value of the
variate. In such a case we divide the variate into convenient groups and prepare a table showing
the groups and their corresponding frequencies. Such a table is called a grouped frequency
distribution.
Consider the marks (out of 100 ) of 50 students as below :
40, 39, 43, 62, 30, 47, 33, 31, 17, 28
36, 29, 40, 32, 39, 24, 57, 42, 15, 30
50, 52, 47, 65, 31, 07, 37, 47, 17, 20
25, 53, 65, 85, 89, 56, 55, 41, 43, 10
44, 40, 69, 22, 40, 65, 39, 36, 71, 12
The range of the variate (marks) is very large. Also we are eager to know the
performance of the students. The passing limit is 35 and above. Marks between 35 and 44
form the third class ( or grade). Marks ranging between 45 - 59 are considered as second
class and 60 - 100 form the first class. Thus we have a grouped frequency distribution as:
Table
19
2.6 DIAGRAMMATIC AND GRAPHIC REPRESENTATION OF DATA
It is not always easy for a layman to understand figures, nor is it is interesting for
him. Apart from that too many figures are often confusing. One of the most convincing and
appealing ways in which statistical results may be represented is through graphs and diagrams.
It is for this reason that diagrams are often used by businessmen, newspapers, magazines,
journals, government agencies and also for advertising and educating people.The various
graphic presentation of data can be done through:
1. Bar Diagrams
1) Simple ‘Bar diagram’:- It represents only one variable. For example sales, production,
population figures etc. for various years may be shown by simple bar charts. Since these are
of the same width and vary only in heights ( or lengths ), it becomes very easy for readers to
study the relationship. Simple bar diagrams are very popular in practice. A bar chart can be
either vertical or horizontal; vertical bars are more popular.
2) Sub - divided Bar Diagram:- While constructing such a diagram, the various components
in each bar should be kept in the same order. A common and helpful arrangement is that of
presenting each bar in the order of magnitude with the largest component at the bottom and
the smallest at the top. The components are shown with different shades or colors with a
proper index.
Illustration:- During 1968 - 71, the number of students in University ‘ X ‘ are as follows.
Represent the data by a similar diagram.
20
3) Multiple Bar Diagram:- This method can be used for data which is made up of two or
more components. In this method the components are shown as separate adjoining bars. The
height of each bar represents the actual value of the component. The components are shown
by different shades or colors. Where changes in actual values of component figures only are
required, multiple bar charts are used.
Illustration:- The table below gives data relating to the exports and imports of a certain
country X ( in thousands of dollars ) during the four years ending in 1930 - 31.
Year Export Import
1927 - 28 319 250
1928 - 29 339 263
1929 - 30 345 258
1930 - 31 308 206
21
Represent the data by a suitable diagram
2. Pie Chart
Geometrically it can be seen that the area of a sector of a circle taken radically, is
proportional to the angle at its center. It is therefore sufficient to draw angles at the center,
proportional to the original figures. This will make the areas of the sector proportional to
the basic figures.
For example, let the total be 1000 and one of the component be 200, then the angle will be
22
iii) As an example consider the yearly expenditure of a Mr. Ted, a college undergraduate.
T u itio n f e e s $ 6000
B o o k s a n d la b . $ 2000
C lo th e s / c le a n i n g $ 2000
R o o m a n d b o a r d in g $ 12000
T ra n s p o rta tio n $ 3000
In s u ra n c e $ 1000
S u n d ry e x p e n s e s $ 4000
T o ta l e x p e n d itu r e = $ 30000
T uition fees =
Clothes / cleaning =
Transportation =
Insurance =
Sundry expenses =
23
Uses:- A pie diagram is useful when we want to show relative positions ( proportions ) of the
figures which make the total. It is also useful when the components are many in number.
3. Graphs
A graph is a visual representation of data by a continuous curve on a squared ( graph )
paper. Like diagrams, graphs are also attractive, and eye-catching, giving a bird’s eye-view of
data and revealing their inner pattern.
Graphs of Frequency Distributions:-
The methods used to represent a grouped data are :-
1. Histogram
2. Frequency Polygon
3. Frequency Curve
4. Ogive or Cumulative Frequency Curve
1. Histogram :- It is defined as a pictorial representation of a grouped frequency
distribution by means of adjacent rectangles, whose areas are proportional to the
frequencies.
24
For example, in a book sale, you want to determine which books were most popular,
the high priced books, the low priced books, books most neglected etc. Let us say you sold
a total 31 books at this book-fair at the following prices.
Rs....2, Rs 1, Rs 2, Rs 2, Rs 3, Rs. 5, Rs. 6, Rs. 17, Rs.17, Rs.7, Rs.15, Rs.7, Rs. 7, Rs.18,
Rs. 8, Rs.10, Rs. 10, Rs. 9, Rs. 13, Rs.11, Rs 12, Rs. 12, Rs. 12, Rs. 14, Rs.16, Rs. 18,
Rs. 20, Rs. 24, Rs.21, Rs. 22, Rs. 25.
The books are ranging from $1 to $25. Divide this range into number of groups, class
intervals. Typically, there should not be fewer than 5 and more than 20 class-intervals are
best for a frequency Histogram.Therefore now we have distribution of books at a book-fair
Class-interval Frequency
$ 1- $ 5 6
$6 - $10 8
$11 - $15 10
$16 - $20 3
$21 - $25 4
Total n = fi = 31
Note that each class-interval is of equal width i.e. $5 inclusive. Now we draw the
frequency Histogram as under.
25
Relative Frequency Histogram:- It uses the same data. The only difference is that it
compares each class-interval with the total number of items i.e. instead of the frequency of
each class-interval, their relative frequencies are used. Naturally the vertical axis
(i.e. y-axis) uses the relative frequencies in places of frequencies.
2 Frequency Polygon:- Here the frequencies are plotted against the mid-points of the
class-intervals and the points thus obtained are joined by line segments.
Example : -
Height in cm. 150 - 154 154 - 158 158 - 162 162 - 166 166 - 170
No. of children 10 15 20 12 8
The polygon is closed at the base by extending it on both its sides ( ends ) to the
midpoints of two hypothetical classes, at the extremes of the distribution, with zero
frequencies.
On comparing the Histogram and a frequency polygon, you will notice that, in frequency
polygons the points replace the bars ( rectangles ). Also, when several distributions are to be
compared on the same graph paper, frequency polygons are better than Histograms.
3) Frequency Distribution (Curve):- Frequency distribution curves are like frequency
polygons. In frequency distribution, instead of using straight line segments, a smooth curve
is used to connect the points. The frequency curve for the above data is shown as:
26
4. gives or Cumulative Frequency Curves:- When frequencies are added, they are called
cumulative frequencies. The curve obtained by plotting cumulating frequencies is called a
cumulative frequency curve or an ogive (pronounced ojive ).
To construct an Ogive:-
1) Add up the progressive totals of frequencies, class by class, to get the cumulative
frequencies.
2) Plot classes on the horizontal ( x-axis ) and cumulative frequencies on the vertical
( y-axis).
3) Join the points by a smooth curve. Note that Ogives start at (i) zero on the vertical axis,
and (ii) outside class limit of the last class. In most of the cases it looks like ‘S’.Note
that cumulative frequencies are plotted against the ’limits’ of the classes to which
they refer.
(A) Less than Ogive:- To plot a less than ogive, the data is arranged in ascending order of
magnitude and the frequencies are cumulated starting from the top. It starts from zero
on the y-axis and the lower limit of the lowest class interval on the x-axis.
(B) Greater than Ogive:- To plot this ogive, the data are arranged in the ascending order
of magnitude and frequencies are cumulated from the bottom. This curve ends at zero
on the the y-axis and the upper limit of the highest class interval on the x-axis.
Illustrations:- On a graph paper, draw the two ogives for the data given below of the I.Q. of
160 students.
27
Class -intervals :60 - 70 70 - 80 80 - 90 90 - 100 100 - 110
No. of students : 2 7 12 28 42
110 - 120 120 - 130 130 - 140 140 - 150 150 - 160
36 18 10 4 1
28
Uses : - Certain values like median, quartiles, deciles, quartile deviation, coefficient of
skewness etc. can be located using Ogives. it can be used to find the percentage of items
having values less than or greater than certain value. Ogives are helpful in the comparison of
the two distributions.
2.7 SUMMARY
Data is the information that is collected and it is the raw material for statistics. This
data has to be collected in asystematic manner from the right source like Primary or secondary.
Thus collected data should be classified and tabulated for further process. This statistical
data can be presented through frequency distributions or through Graphs to understand them
and process them easily.
Answer to Check Your Progress
1. Exclusive
2. External & Internal
3. Qualitative data
2.8 ANSWER TO CHECK YOUR PROGRESS
1. Exclusive
2. External & Internal
3. Qualitative data
29
5. What problems do unequal class intervals create ? Explain
6. what do you understand by classification and Tabulation of data ? Discuss the modes of
classification.
7. Prepare a frequency distribution from the following figures relating to bonus paid to
workers
BONUS IN (Rs.)
86 62 58 73 101 90 84 90 76 61 84 63 56 88
72 92 60 83 102 76 99 54 64 87 103 61 88 55
2.11 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
30
UNIT -3 : MEASURES OF CENTRAL TENDENCY
STRUCTURE
3.0 Objectives
3.1 Introduction
3.2 Arithmetic mean and its computation
3.3 Weighted Arithmetic mean and its computation
3.4 Geometric mean and its computation
3.5 Harmonic mean and its computation
3.6 Median
3.7 Mode
3.8 Relationship between Mean , Median and Mode
3.9 Answer to Check Your Progress
3.10 Summary
3.11 Key Words
3.12 Self-Assessment Questions
3.13 References
31
3.0 OBJECTIVES
After studying this unit you should be able to :
Explain the measures of central tendency;
Compute Mean , Median and Mode;
Identify the merits and demerits of computing the Mean, Median and Mode and
Analyze the relationship between the three measures
3.1 INTRODUCTION
In the previous unit, we have studied how to collect raw data, its classification and
tabulation in a useful form, which contributes in solving many problems of statistical concern.
Yet, this is not sufficient, for in practical purposes, there is need for further condensation,
particularly when we want to compare two or more different distributions. We may reduce
the entire distribution to one number which represents the distribution.
A single value which can be considered as typical or representative of a set of observations
and around which the observations can be considered as Centered is called an ’Average’ (or
average value) or a Center of location. Since such typical values tends to lie centrally within
a set of observations when arranged according to magnitudes, averages are called measures
of central tendency.In fact the distribution have a typical value (average) about which, the
observations are more or less symmetrically distributed. This is of great importance, both
theoretically and practically. Dr. A.L. Bowley correctly stated, “Statistics may rightly be
called the science of averages.”The word average is commonly used in day-to-day
conversations. For example, we may say that Abert is an average boy of my class; we may
talk of an average American, average income, etc. When it is said, “Abert is an average student,”
it means is that he is neither very good nor very bad, but a mediocre student. However, in
statistics the term average has a different meaning.
There is a peculiar tendency of the data to cluster or centre around a specific value. On
the whole they tend to be closer to one particular value than others. This peculiar tendency
of the data is called as central tendency. Thus a measure of central tendency of a set of data
lies in obtaining this central value.
The fundamental measures of tendencies are:
(1) Arithmetic mean
(2) Median
32
(3) Mode
(4) Geometric mean
(5) Harmonic mean
(6) Weighted averages
However the most common measures of central tendencies or Locations are
Arithmetic mean, median and mode.
obviously
Example :A variable takes the values as given below. Calculate the arithmetic mean of 110,
117, 129, 195, 95, 100, 100, 175, 250 and 750.
33
Solution: Arithmetic mean =
= 110 + 117 + 129 +195 + 95 +100 +100 +175 +250 + 750 = 2021
ui = -65, -58, -46, +20, -80, -75,-75, +0, + 75, +575= 670 - 399
= 271/10 = 27.1
Discrete Series :
Arithmetic mean
The formulae for Arithmetic mean by direct method and by the short-cut methods are
as follows:
Direct method Short-cut method
and u = xi - A
Therefore,
34
Example Find the mean of the following 50 observations.
19, 19, 20, 20, 20, 19, 20, 18, 21, 19,
20, 20, 19, 19, 20, 19, 21, 19, 19, 21,
18, 20, 18, 18, 17, 20, 20, 22, 20, 20,
20, 20, 20, 21, 20, 17, 23, 18, 17, 21,
20, 21, 20, 20, 20, 18, 21, 19, 21, 19
Solution: We may tabulate the given observations as follows.
35
Example The weights (in gms) of 30 articles are given below :
14, 16, 16, 14, 22, 13, 15, 24, 23, 14, 20, 17, 21, 18, 18, 19, 20, 17, 16, 15, 11, 22, 21, 20,
17, 18, 19, 12, 23,11.
Form a grouped frequency table, by dividing the variate range into intervals of equal width,
one class being 11-13 and then compute the arithmetic mean.
Solution:
-10
36
Example Find the arithmetic mean for the following :
Properties Of Arithmetic Mean
1. The sum of the deviations, of all the values of x, from their arithmetic mean, is zero.
2. The product of the arithmetic mean and the number of items gives the total of all
items.
3. If and are the arithmetic mean of two samples of sizes n1 and n2 respectively then,
the arithmetic mean of the distribution combining the two can be calculated as
37
3.3 WEIGHTED ARITHMETIC MEAN AND ITS COMPUTATION
When individual observations vary in importance, they are assigned weights according
to the level of importance of each in the computation of their mean. The arithmetic mean of
asset of observations computed by taking into account of their corresponding weights is
known as weighted arithmetic mean or average.
Weighted A M = A +( wi xi/wi)
GM = antilog (∑ log Xi / n )
GM is particularly used in averaging ratios and percentages and rates of change in one period
over the other.
HM= Weighted HM =
Harmonic mean is particularly useful in averaging rates and ratios. It is the appropriate
average where the unit of observation such as per hour, per day etc. Remains the same and
the act being performed that is covering distance is constant.
𝑛
It is used to calculate overage of rates or rations Hm = 11 1
+ +
𝑎 𝑏 𝑐
38
Ravi drives car 20 km/h for first half of journey and 30 km/hour for the second half, what is average
speed
3.6 MEDIAN
It is the value of the size of the central item of the arranged data (data arranged in the
ascending or the descending order). Thus, it is the value of the middle item and divides the
series in to equal parts.In Connor’s words - “The median is that value of the variable which
divides the group into two equal parts, one part comprising all values greater and the other
all values lesser than the median.” For example, the daily wages of 7 workers are 5, 7, 9, 11,
12, 14 and 15 dollars. This series contains 7 terms. The fourth term i.e. $11 is the median.
Median In Individual Series (ungrouped Data)
1. Set the individual series either in the ascending (increasing) or in the descending
(decreasing) order, of the size of its items or observations.
2. If the total number of observations be ‘n’ then
B. If 'n ' is e v e n , th e m e d ia n
Example The following figures represent the number of books issued at the counter of a
Statistics library on 11 different days. 96, 180, 98, 75, 270, 80, 102, 100, 94, 75 and 200.
Calculate the median.
39
Solution:
Arrange the data in the ascending order as 75, 75, 80, 94, 96, 98, 100, 102,180, 200, 270.
Now the total number of items ‘n’= 11 (odd)
= size of item
= size of 6th item
= 98 books per day
=
1
2 [
Size of 18th item + size of 19th item ]
= 1 ( 213 + 239)
2
= 1 ( 552)
2
= 276 thousands
40
Median In Discrete Series :Steps :
1. Arrange the cumulative frequencies.
2. Find the cumulative frequencies.
3. Apply the formula :
Median =
Example Locate the median in the following distribution.
Size : 8 10 12 14 16 18 20
Frequency : 7 7 12 28 10 9 6
Solution:
41
Therefore, the median =
In the order of the cumulative frequency, the 38th term is present in the 50th cumulative
frequency, whose size is 14.
Therefore, the median = 14
Median In Continuous Series (grouped Data)
Steps :
1. Determine the particular class in which the value of the median lies. Use as the
2. After ascertaining the class in which median lies, the following formula is used for
determining the exact value of the median.
Median =
where, = lower limit of the median class, the class in which the middle item of the distribution
lies.
= upper limit of the median classc.f = cumulative frequency of the class preceding the median
classf = sample frequency of the median class
It should be noted that while interpolating the median value of frequency distribution it is
assumed that the variable is continuous and that there is an orderly and even distribution of
items within each class.
Example Calculate the median for the following and verify it graphically.
Age (years) : 20-25 25-30 30-35 35-40 40-45
No. of person : 70 80 180 150 20
42
Solution:
Median =
Therefore, Median
43
Note that, while calculating the median of a series, it must be put in the ‘exclusive class-
interval’ form. If the original series is in inclusive type, first convert it into the exclusive
type and then find its median.
Merits Of Median
1. It is rigidly defined.
2. And it is easy to calculate and understand.
3. It is not affected by extreme values like the arithmetic mean.
4. It can be found by mere inspection.
5. It is fully representative and can be computed easily.
6. It can be used for qualitative studies.
7. Even if the extreme values are unknown, median can be calculated if one knows the
number of items.
8. It can be obtained graphically.
Demerits of Median
1. It may not be representative if the distribution is irregular and abnormal.
2. It is not capable of further algebraic treatment.
3. It is not based on all observations.
4. It is affected by sample fluctuations.
5. The arrangement of the data in the order of magnitude is absolutely necessary.
Check Your Progress
1. Before determining median data has to arranges in ............... order.
2. The three different types of mean are ..............
3. represents ......................
44
3.7 MODE
It is the size of that item which possesses the maximum frequency. According to
Professor Kenney and Keeping, the value of the variable which occurs most frequently in a
distribution is called the mode. It is the most common value. It is the point of maximum
density.
Ungrouped Data
Individual series: The mode of this series can be obtained by mere inspection. The number
which occurs most often is the mode.
Example Locate mode in the data 7, 12, 8, 5, 9, 6, 10, 9, 4, 9, 9
Solution : On inspection, it is observed that the number 9 has maximum frequency. Therefore
9 is the mode.
Grouped Data: Steps :
1. Determine the modal class which as the maximum frequency.
2. By interpolation the value of the mode can be calculated as -
Mode = +
where
45
Solution:
Here the m axim um frequency is 12, corresponding to the class interval (35 - 40) which is the
m odal class.
Therefore
B y interpolation
+
M ode =
35 +
35 +
35 +
46
Merits of mode
1. It is simple to calculate.
2. In individual or discrete distribution it can be located by mere inspection.
3. It is easy to understand. Everyone is used to the idea of average size of a garment, an
average American etc.
4. It is not isolated like the median as it is the most common item.
5. Like the Average mean, it is not a value which cannot be found in the series.
6. It is not necessary to know all the items. What we need the point of maximum density
frequency.
7. It is not affected by sampling fluctuations.
Demerits
1. It is ill defined.
2. It is not based on all observations.
3. It is not capable of further algebraic treatment.
4. It is not a good representative of the data.
5. Sometimes there are more than one values of mode.
47
Comparison of the three measures:
1. Mean is the most familiar and widely used measure of central tendency as it takes into
account all observations in its computation. The presence of extreme values affect
Mean more than the Median and the Mode. Mean is used more in symmetric
distributions.
2. Median is easier to understand and compute when the data is relatively small. The
extreme values do not affect median more as such as mean and therefore it is frequently
used as a best measure of central tendency in asymmetric distributions.
3. Mode is the least used measure of central tendency. It is very easy to compute. It can
be used for both quantitative and qualitative data. A little care should be taken in
computing Mode because, every distribution may not have mode and there may be two
modes present in one distribution.
3.9 ANSWER TO CHECK YOUR PROGRESS
1. Ascending
2. Arthamatic, Geomatric & Harmonic
3. Sum of Frequencies
3.10 SUMMARY
Measures of central tendency are the basics of statistics. It is an attempt to find out
the central value of a given set of data. The idea behind determining such a typical value is to
use it as representative of the entire data. There are three measures of central tendency:
Mean, Median and Mode. The other measures are Geometric Mean and harmonic Mean.
Mean is also called as Average. Median is the location average of the middle value of an
ordered array of set of observations. Mode is also a location average and it is that value
which appears the maximum number of times.
48
3.12 SELF ASSESSMENT QUESTIONS
1. What are the various Measures of Central Tendency? Explain each in detail.
2. Calculate the average value of age for a class of 10 students with their ages as under
11,12, 13, 13, 10, 13, 12, 11, 10, 12.
3. From the following calculate the average level of marks of the class.
Marks: 0 2 3 4 5 6 7 8 9
Number of students: 11 10 9 21 12 17 8 22 15
4. Given below is the distribution of marks obtained by 60 students in final exams.
Compute a. Mean, b. Median, c. Mode
Marks: 20 30 40 50 60 70
Number of students: 8 12 20 10 6 4
5. From the frequency distribution given below find Mean, Median and Mode
Class intervals : 50-52 53-55 56-58 59-61 62-64
Frequencies: 5 10 21 8 6
6. The average sales of a product for a particular week excluding Sunday wre 150units.
Sunday there was a rush of sales which inflated the Average sales for the entire week
to 210 units. Find the sales for Sunday.
7. Find the GM of 5 sample observations: 28, 45, 50, 65, and 90.
8. Obtain the HM of 5 samples: 4, 20, 12, 10 and 15
9. Calculate the Arithmetic Mean by Step Deviation Method for the following data:
Class – Intervals Mid Points (X) Frequency (f)
0-10 5 7
10-20 15 9
20-30 25 15
30-40 35 11
40-50 45 27
50-60 55 18
60-70 65 5
49
10. Given the following distribution calculate Mean Median and Mode and also show
their empirical relationship.
Pay Scale (Rs) No. of employees
Less than 2000 14
Less than 3000 19
Less than 4000 26
Less than 5000 35
5000 and above 42
3.13 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
50
UNIT 4 : MEASURES OF DISPERSION
STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Range
4.3 Quartile Deviation and Computations
4.4 Mean Deviation
4.5 Variance
4.6 Standard Deviation
4.7 Coefficient of Variation
4.8 Skewness
4.9 Kurtosis
4.10 Check Your Progress
4.11 Summary
4.12 Key Words
4.13 Self-Assessment Questions
4.14 References
51
4.0 OBJECTIVES
After studying this unit you should be able to :
Explain measures of Dispersion ;
Analyse compute each measure of dispersion and
Identify the relationship and significance of each of the measures
4.1 INTRODUCTION
The measures of central tendencies i.e. mean indicate the general magnitude of the
data and locate only the center of a distribution of measures. They do not establish the
degree of variability or the spread out or scatter of the individual items and their deviation
from (or the difference with) the mean.
i) According to Nciswanger, “Two distributions of statistical data may be symmetrical
and have common means, medians and modes and identical frequencies in the modal
class. Yet with these points in common they may differ widely in the scatter or in their
values about the measures of central tendencies.”
ii) Simpson and Kafka said, “An average alone does not tell the full story. It is hardly
fully representative of a mass, unless we know the manner in which the individual
item. Scatter around it ....a further description of a series is necessary, if we are to
gauge how representative the average is.”
From this discussion we now focus our attention on the scatter or variability which is
known as dispersion. Let us take the following three sets.
1 50 45 30
2 50 50 45
3 50 55 75
Mean- 50 50 50
52
Thus, the three groups have same mean i.e. 50. In fact the median of group X and Y are
also equal. Now if one would say that the students from the three groups are of equal
capabilities, it is totally a wrong conclusion then. Close examination reveals that in group X
students have equal marks as the mean, students from group Y are very close to the mean but
in the third group Z, the marks are widely scattered. It is thus clear that the measures of the
central tendency is alone not sufficient to describe the data.
Definition of dispersion : The arithmetic mean of the deviations of the values of the
individual items from the measure of a particular central tendency used. Thus the ’dispersion’
is also known as the “average of the second degree.” Prof. Griffin and Dr. Bowley said
the same about the dispersion.
In simple terms Dispersion is the variability among individual observations comprising
a set of data. It describes the spread characteristics of the data. A measure of dispersion lies
in quantifying the variability among individual observations and their scatter around the central
value.
Characteristics of ideal measure of dispersion:
1. It should be rigidly defined
2. It should be easy to calculate
3. It should be based on all the observations
4. It should be amenable for further mathematical treatment
5. It should be affected as little as possible by fluctuations of sampling and by extreme
observations
Methods of Computing Dispersion: the various measures of dispersions are
1. Range
2. Quartile Deviation
3. Mean Deviation
4. Variance
5. Standard Deviation
In measuring dispersion, it is imperative to know the amount of variation (absolute
measure) and the degree of variation (relative measure). In the former case we consider the
range, mean deviation, standard deviation etc. In the latter case we consider the coefficient
of range, the coefficient mean deviation, the coefficient of variation etc.
53
(I) Method of limits:
(1) The range (2) Inter-quatrile range (3) Percentile range
(II) Method of Averages:
(1) Quartile deviation (2) Mean deviation
(3) Standard Deviation and (4) Other measures.
4.2 RANGE
In any statistical series, the difference between the largest and the smallest values is
called as the range.
Coefficient of Range : The relative measure of the range. It is used in the comparative
study of the dispersion co-efficient of Range =
Example ( Individual series ) Find the range and the co-efficient of the range of the following
items :
110, 117, 129, 197, 190, 100, 100, 178, 255, 790.
Solution: R = L - S = 790 - 100 = 690
Co-efficient of Range =
Example (Continuous series) Find the range and its co-efficient from the following data.
6 60 - 70
54
Solution: R = L - S = 100 - 10 = 90
Co-efficient of range =
If we concentrate on two extreme values ( as in the case of range ), we don’t get any
idea about the scatter of the data within the range ( i.e. the two extreme values ). If we discard
these two values the limited range thus available might be more informative. For this reason
the concept of interquartile range is developed. It is the range which includes middle 50% of
the distribution. Here 1/4 ( one quarter of the lower end and 1/4 ( one quarter ) of the upper
end of the observations are excluded.
55
Now the lower quartile ( Q1) is the 25th percentile and the upper quartile ( Q3) is the
75th percentile. It is interesting to note that the 50th percentile is the middle quartile ( Q2)
which is in fact what you have studied under the title ’ Median “. Thus symbolically
Inter quartile range = Q3 - Q1
If we divide (Q3 - Q1) by 2 we get what is known as Semi-Iinter quartile range.
Therefore Q. D. ( SI QR ) =
The CF greater than 37.5 is 64. Therefore Q1 lies in corresponding class of 30-45
Q1 = l + (N/4 – C)
56
The CF greater than 112.5 is 129. Therefore Q3 lies in corresponding class of 60-75
Q3 = l + (3N/4 – C)
57
In case of frequency distribution MD is obtained as :
M D (about Mean ) =
M D (about Median ) =
M D (about Mode) =
Example (Continuous series) Calculate the mean deviation and the coefficient of mean
deviation from the following data using the mean.Difference in ages between boys and girls
of a class.
Diff. in No.of
years: students:
0-5 449
5 - 10 705
10 - 15 507
15 - 20 281
20 - 25 109
25 - 30 52
30 - 35 16
35 - 40 4
58
Calculation:
1) X
2) M. D.
3) co efficient of M. D.
59
4.5 VARIANCE
The term variance was used to describe the square of the standard deviation R.A.
Fisher in 1913. The concept of variance is of great importance in advanced work where it is
possible to split the total into several parts, each attributable to one of the factors causing
variations in their original series. Variance is defined as follows:
V a ria n c e =
where n = fi
Merits :
a. It is rigidly defined and based on all observations.
b. It is amenable to further algebraic treatment.
c. It is not affected by sampling fluctuations.
d. It is less erratic.
e. It is the most widely used measure of dispersion
Demerits :
a. It is difficult to understand and calculate.
b. It gives greater weight to extreme values.
60
Note that variance V(x) =
Then V ( x ) =
Example Calculate the standard deviation and its co-efficient from the following data.
A B C D E F G H I J
10 12 16 8 25 30 14 11 13 11
Solution :
61
No xi (xi - x) ( xi - x )2
A 10 -5 25
B 12 -3 9
C 16 +1 1
D 8 -7 49
E 25 +10 100
F 30 +15 225
G 14 -1 1
H 11 -5 16
I 13 -2 4
J 11 -4 16
Calculations :
i)
ii)
iii)
62
Example Calculate s.d. of the marks of 100 students.
0-2 10 1 10 10
2-4 20 3 60 180
8-10 5 9 45 405
Solution
1)
2)
σ- 2.09
Combined Standard deviation : If two sets containing n1 and n2 items having means x1 and
x2 and standard deviations 1 and 2 respectively are taken together then,
63
Example The score of two teams A and B in 10 matches are as :
A 40 32 0 40 30 7 13 25 14 5
B 21 14 29 13 5 12 10 13 30 0
Find the variance for both the series. Which team is more consistent ?
64
4.8 SKEWNESS
We study Skewness to have an idea about the shape of the curves which we can draw
with the help of the given frequency distribution. It helps us to understand the nature of the
concentration of observations towards higher and lower values of the variable. A distribution
is said to be skewed if :
1. The frequency curve of the distribution is not a symmetric bell shaped curve but it is
stretched more to one side than the other. If it has a longer tail towards the right it is
said to be positively skewed. And if the tail is longer towards the left then it is nega-
tively skewed.
2. The values of Mean, median and Mode fall at different points.
3. Quartiles Q1 and Q3 are not equidistant from the median.
Measures of Skewness:
1. Sk = Mean – Median
2. Sk = Mean – Mode
3. Sk = (Q3–Md) – (Md - Q1)= Q3 + Q1 – 2Md
4.9 KURTOSIS
65
To know more about the distribution variability, Prof. Karl Pearson called it as
Convexity of the curve or the Kurtosis. Kurtosis enables us to have an idea about the shape
and nature of the hump (middle Part) of a frequency distribution.Therefore Kurtosis is
concerned with the flatness or peachiness of the frequency curve. The normal curve is called
as Mesokurtic. The curves which are more peaked than the normal curve are called as
Leptokurtic and lack kurtosis and have negative Kurtosis. The curves which are flatter than
the normal curve are platykurtic curves and have kurtosis in excess and called as positive
kurtosis.
As a measure of Kurtosis Karl Pearson described coefficient β2 as
β2 =
1.
2.
3.
4.
5.
6.
7.
66
4.10 ANSWERS TO CHECK YOUR PROGRESS
1. Range
2. Dispersion
3. (Q3 - Q1) /2
4.11 SUMMARY
Choice about various measure discussed above is based on their merits and demerits.
Range is simplest of all but it is based on two extreme values. Quartile deviation is also not
adequately representative as it uses only 50% of the data and it suits well with openend
classes. The Variance and Standard Deviations are two most objective measures of dispersion
as they cover all the set of observations of the data. SD is the a widely used measure of
variability.
4.12 KEY WORDS
Dispersion – Variability among the observations made or
deviations from the expected one.
MAD – Mean Absolute Deviation
Skewness – Lack of symmetry
Percentile range - this is a measure of dispersion based on the
difference between certain percentiles.
Lorenz Curve – it is a graphic measure of studying the dispersion.
This curve is used in business to study the
disparities of the distribution of wages, profits,
turnover, production, population etc.
4.13 SELF-ASSESSMENT QUESTIONS
1. Explain the validity of the statement “An Average when published should be accompanied
by a measure of dispersion for significant interpretation”.
2. What is dispersion? Explain each measure in detail.
3. The Standard Deviation is a best measure of dispersion.’ Why?
4. Standard Deviation can never be negative – comment
5. Differentiate SD and MD
6. Calculate the mean deviation from the following:
X: 5 15 25 35 45 55 65
f: 8 12 10 8 3 2 7
67
7. Find the Median and Mean deviation from the following data;
size Frequency
0-10 7
10-20 12
20-30 18
30-40 25
40-50 16
50-60 14
60-70 8
8. Find mean deviation from Mean and Median for the following:
Score No. of Students
140-150 4
150-160 6
160-170 10
170-180 10
180-190 9
190-200 3
Suresh: 87 89 78 71 73 84 65 66 56 46
a. Who is better scorer?
b. Who is better consistent?
68
13. Write short notes on
a. Skewness
b. Kurtosis
c. Percentile
d. Lorenz Curve
4.13 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
69
1
2
3
BLOCK 2
INTRODUCTION
Dear Student,
In the previous block you have gained the basic knowledge about data science. You have learnt how to
collect data, how to tabulate data and how to depict data. You have also learnt the data analysis through
various measures of central tendency such as mean, Median and Mode. All these information you have
studied were limited to one set of data.
In this block let us try to understand the relationship between two set of data. . Some set of data depend
directly or inversely on other set of data. For example yield crop over years depend on amount of rainfall
in that place over years. Such relation is called correlation. Once you are able to co related two set of
data then you can try to find the exact relationship between two sets of data which is called as regression.
In this block you will study 4 units.
Unit 5: Concept and definition of correlation, significance, types, Properties of Correlation Methods of
correlation analysis: Graphic method,
Unit 6: Scatter diagrams, Karl Pearson’s correlation co-efficient, Rank correlation coefficient,
Unit 7: Regression: Regression analysis: meaning and definition of regression, application of regression
analysis, difference between correlation & regression analysis, Types of regression models, standard
error and Regression coefficients.
Unit 8: Multiplication correlation and regression : Concept of multiple regression and multiple correlation
, Concept of partial correlation. Correlation co-efficient, Methods of least square.
4
MODULE 2
UNIT 5 : CORRELATION
STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Concept and Definition of Correlation
5.3 Correlation and Causation
5.4 Types of Correlation
5.5 Significance of the Study of Correlation
5.6 Measures to Describe Correlation
5.7 Solved Problems
5.8 Summary
5.9 Keywords
5.10 Self Asessment Questions
5.11 References
5
5.0 OBJECTIVES
After studying this unit you should be able to :
∗ Analyse the importance, as also the limitations of correlation analysis and
∗ Distinguish between
a. linear and non-linear correlation,
b. positive and negative correlation, and
c. simple, partial and multiple correlation
5.1 INTRODUCTION
The existence of a relationship between two or more variables constitutes a vital
information for decision making in any given situation. For example, it is important for a
manufacturer to know how product sales are related to expenses incurred on advertising.
Similarly, it is useful for a farmer to know the relationship between crop yield and the quantum
of fertilizer applied. In all such cases, knowledge of the mechanism by which the variables
are related is beneficial for taking appropriate decisions.
7
We briefly describe these types of correlation.
1. Positive and Negative Correlation : Positive correlation indicates that the movement
of the two variables is in the same direction, that is, both the variables are either increasing
or decreasing. In contrast, if the movement of the two variables is in the opposite direction,
that is, one variable is increasing and the other is decreasing, then the correlation is
negative. Some examples of series of positive correlation are :
1. Heights and weights;
2. Household income and expenditure;
3. Price and supply of commodities;
4. Amount of rainfall and yield of crops.
Suppose we are given sets of data relating to heights and weights of students in a class.
They can be plotted on the coordinate plane using x-axis to represent heights and y-axis to
represent weights.
The graph shownYbelow illustrate the Positive correlation
*
* *
* *
* *
* *
*
0 X
Positive Correlation
Correlation between two variables is said to be negative or inverse if the variables deviate
in opposite direction. That is, if the increase (or decrease) in the values of one variable
results on an average, in corresponding decrease (or increase) in the values of other variable.
Some examples of series of negative correlation are :
1. Volume and pressure of perfect gas;
2. Current and resistance
3. Price and demand of goods.
8
The graph shown below illustrate the Negative correlation
y
* *
* *
* *
* *
* *
0 Negative Correlation x
2. Linear and Non-Linear Correlation : If the extent of change in one variable tends to
have a constant ration in the extent of change in another variable, then the correlation is
said to be linear. This will be clear from the following example.
X 5 10 15 20 25
Y 30 60 90 120 150
In this case, we find that the ration of change from one figure to another in the two series
is the same. Thus, it will give a linear correlation. In contrast, in a non-linear correlation,
this consistency of ration of change will not exist. If a couple of figures in either series X
or Y are changes, it may give a non-linear correlation.
3. Simple, Partial and Multiple Correlation : The distinction amongst these three types
of correlation depends upon the number of variables involved in a study. If only two
variables are involved in a study, then it is a problem of either partial or multiple correlation.
In multiple correlation, three or more variables are studied simultaneously. But in partial
correlation we consider only two variables inflecting each other while the effect of other
variable is held constant.
9
5.5 SIGNIFICANCE OF THE STUDY OF CORRECTION
The study of correlation is of immense use in practical life because of the following reasons:
a. Once we know that two variables are closely related, we can estimate the value of one
variable given the value of another. This is known with the help of regression analysis.
b. Correlation analysis contributes to the understanding of economic behavior, aids in
locating the critically important variables on which other depend, may reveal to the
economist the connection by which disturbances spread and suggest to him the paths
through which stabilizing forces may become effective.
c. Most of the variables show some kind of relationship. For example, there is relationship
between price and supply, income and expenditure, etc. With the help of correlation
analysis we can measure in one figure the degree of relationship existing between the
variables.
The correlation analysis enables the executive to estimate costs, sales, prices and other
variables on the basis of some other series with which these costs, sales, or prices may be
functionally related.
However, it should be noted that coefficient of correlation is one of the most widely
used and also one of the most widely abused statistical measure. It is abused in the sense
that one sometimes overlooks the fact that correlation measures are nothing but the strength
of linear relationship and that it does not necessarily imply a cause – effect relationship.
a. The effect of correlation is to reduce the range of uncertainly. The prediction based
on correlation analysis is likely to be more valuable and near to reality.
b. Progressive development in the methods of science and philosophy has been
characterized by increase in the knowledge of relationship or correlations. In nature
also one finds multiplicity of interrelated forces.
10
correlation. It is a more useful and readily comprehensible measure for indicating the
percentage variation in the dependent variable which is accounted for by the independent
variable. In other words, the co-efficient of determination given the ratio of the explained
variance to the total variance.
Thus co-efficient of determination is explained as follows :
r2 = Explained Variable / Total Variance
The co-efficient of determination is more useful and a better measure for interpreting the
value of r.
11
regression equation of y and x which is a follows :
Yc = a + bx
nΣxy − (ΣΧ) (ΣY )
Where b = n
n ∑ x 2 − (∑ x ) 2
And a = y - b x
Adv. Exp. Sales
x y x2 y2 xy
10 14 100 196 140
12 17 144 289 204
15 23 225 529 345
23 24 529 625 575
20 21 400 441 420
2 2
? x = 80 ? y = 100 ? x = 1398 ? y = 2080 ? xy = 1684
12
Explained var iation
ii) Coefficient of determination or r2 = Total var iation
Σ( yc − y ) 2
= Σ( y − y )
2
(Σy ) 2
aΣy + bΣxy −
n
= (Σy ) 2
Σy 2 −
n
2
(100)
8.608(100) + .712(1684) −
5
2
= (100)
2080 −
5
860.8 +1199 − 2000
= 2080 − 2000
59.8
= 80 = .747
0.747
co effecient of correlation r=
= .864
nΣxy − (Σx) (Σy)
γ =- Co-efficient
nΣx − (Σ
2
x) 2 nΣy 2 − (of
Σycorrelation
)2 and co-efficient of determination can be calculated by another
method also−which
5 x1684 80 x 100is as follows :
=- (5 x1398 − (80) 2 5 x 2080 − (100) 2
8420 − 8000 420
= (6990 − 6400) (10400 − 10000) = 590 x 400
420 420
= 236000 = 485.8 = .864
co-effecient of determination γ2 = (.864)2 = .747
13
Example 2 :
Calculate the co-efficient of correlation for the ages of husband and wife.
Age of Husband 23 27 28 29 30 31 33 35 36 39
Age of wife 18 22 23 24 25 26 28 29 30 32
Solution :
CALCULATIONS FOR CORRELATION CO-EFFICIENT
x y u=x-31 v=y-25 u2 v2 uv
23 18 -8 -7 64 49 56
27 22 -4 -3 16 9 12
28 23 -3 -2 9 4 6
29 24 -2 -1 4 1 2
30 25 -1 0 1 0 0
31 26 0 1 0 1 0
33 28 2 3 4 9 6
35 29 4 4 16 16 16
36 30 5 5 25 25 25
39 32 8 7 64 49 56
2 2
∑x=311 ∑y=257 ∑u=7 ∑v = 7 ∑u =203 ∑v = 163 ∑uv = 179
Since Karl Pearson’s correlation co-efficient r is independent of origin, we get γxy = γuv
= 0.9956.
14
Example 3
Find Karl Pearson’s co-efficient of correlation between sales and expenses of the
following ten firms
Firm 1 2 3 4 5 6 7 8 9 10
Sale in thousand 50 50 55 60 65 65 65 60 60 50
units
Expenses in 11 13 14 16 16 15 15 14 13 13
thousand rupees
Solution :
Let sales (in thousand units) of a firm be denoted by x and expenses (in ‘ooo rupees) be
denoted by Y. It may be noted that we can take out factor 5 common in x series. Hence it will
be convenient to change the scale also in x. taking 65 and 13 as working means for x and y
respectively, let us take.
15
140 140
= 144 x 220 = 31680
140
= 177.99 = 0.7866
Aliter. We have
Σx 580
x =
n = 10 = 58;
Σy 140
y = = 14
n = 10
Since x and y are integers, it will be convenient to compute r by taking the diviations from
means directly; i.e. by taking
dx = x - x = x – 58; dy = y - y = y – 14.
Σdxdy 70 70
γxy = Σdx .Σdy
2 2
= 360 x22 = 7920
70
= 88.99 = 0.7866
16
Example 4
Find Karl Pearson’s co-efficient of correlation between the age and the playing habit of
the people from the following information.
Age group (years) No. of people No. of players
15 and less than 20 200 150
20 and less than 25 270 162
25 and less than 30 340 170
30 and less than 35 360 180
35 and less than 40 400 180
40 and less than 45 300 120
Solution :
We want to find Karl Pearson’s correlation coefficient between the age and the playing
habit of the people. To do this, we first express the number of players out of a fixed 1000 or
some other convient figure. Here we express the number of players as a percentage of the total
people in each age group.
Now we compute Karl Pearson’s correlation co-efficient between age (x) and the
percentage of players in each age group (y)
Age group (yrs.) No. of people No. of players Percentage of players (y)
15-20 200 150 150
200 x 100 = 75
20-25 270 162 162
270 x 100 = 60
25-30 340 170 170
340 x 100 = 50
30-35 360 180 180
360 x 100 = 50
35-40 400 180 180
400 x 100 = 45
40-45 300 120 120
300 x 100 = 40
17
CALCULATIONS FOR CORRELATION CO-EFFICIENT
Age Mid y x − 27.5 y = 50 u2 v2 uv
group value u= 5 v= 5
(x)
15-20 17.5 75 -2 5 4 25 -10
20-25 22.5 60 -1 2 1 4 -2
25-30 27.5 50 0 0 0 0 0
30-35 32.5 50 1 0 1 0 0
35-40 37.5 45 2 -1 4 1 -2
40-45 42.5 40 3 -2 9 4 -6
Total ?u=3 ? v=4 ? u2 = 19 ? v2 = 34 ? uv = -20
18
association while a value of zero indicates no association.
5.9 KEYWORDS
Correlation : Degree of association between two variables.
Correlation Co-efficient : A number lying between -1 to +1, to quantify the association
between two variables.
Covariance : This is the joint variation between the variables X and Y
19
9. From the following data examine whether input of oil and output of electricity can be
said to be correlated:
Input of 6.9 8.2 7.8 4.8 9.6 8.0 7.7
oil
Output of 1.9 3.5 6.5 1.3 5.5 3.5 2.2
electricity
13. The following table gives the frequency, according to groups of marks obtained by
67 students in an intelligence test. Measure the degree of relationship between age
and intelligence test:
Age in years Total
Test marks 18 19 20 21
200-250 4 4 2 1 11
250-300 3 5 4 2 14
300-350 2 6 8 5 21
350-400 1 4 6 10 21
Total 10 19 20 18 67
20
5.10 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008
21
UNIT 6 : METHODS OF COMPUTING CORRELATION
STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Scatter diagram
6.3 Karl Pearson’s Co-efficient of Correlation
6.4 Rank Correlation
6.5 Computation of ‘r’ from a Cross Classification Table
6.6 Solved Problems
6.7 Summary
6.8 Key words
6.9 Self Assessment Questions
6.10 References
22
6.0 OBJECTIVES
After studying this unit you should be able to :
∗ Define scattered diagram;
∗ Explain the Karl Pearson’s Co-efficient of Correlation;
∗ Recognize when a scatter diagram suggests relationship between two variables;
∗ Calculate and interpret coefficient of correlation for individual observations as well
as for bivariate grouped data;
∗ Calculate Rank Correlation and
∗ Compute Correlation from a Cross classification Table.
6.1 INTRODUCTION
It is known that Correlation analysis deals with the association between two or more
variables. Business is a complex phenomena being influenced by many variables which are
related on one way or the other. It is very essential for a businessman to know the factors
influencing business and their relationship to take decisions. In this unit methods of analysing
correlation are discussed. These methods will help businessman to analyse the relationship
between two or more variables.
The commonly used methods for studying the correlation between two variables are :
1. Scatter diagram method
2. Karl Pearson’s coefficient of correlation ( Co –variance method)
3. Rank method.
4. Two-way frequency table /Bivariate correlation method/ Cross Classification Table
23
Student A B C D E F G H
Entrance examination scores 74 69 85 63 82 60 79 91
Cumulative grade points 2.6 2.2 3.4 2.3 3.1 2.1 3.2 3.8
The nature of the distribution of points will indicate the existence of correlation and the
nature of association between the two variables. If the pattern is distributed along a straight
line diagonally upward, correlation may be taken as perfect positive. If it is a straight line
sloping downward, correlation may be taken as perfect negative. Whether the degree of
correlation is high or low can be known from the nature of the distribution of points. If they
are distributed all over the diagram or cluster around a small area, the evidence of correla-
tion is very remote. With the help of this diagram, it is also possible to know whether
correlation is linear or not. The scatter diagram method however is only a rough method of
finding the presence of correlation; it cannot give us any measure like the coefficient of
correlation
Where : x= ( X – X);
Y = (Y – Y)
The value of the coefficient of correlation as obtained by the above formula shall always
lie between ± 1. When r = +1, it means there is perfect positive correlation between the
variables. When r=-1, it means there is perfect negative correlation between the variables.
24
When r=0, it means there is no relationship between the two variables. However, in practice
such values of r as +1, -1, and 0 are rare. We normally get values which lie between +1 and
-1 such as +0.6 would mean that correlation is positive because the sign of r is + and
magnitude of correlation is 0.6. Similarly – 0.46 means low degree of negative correlation.
A simple form of the above formula for application to practical problems is given
as follows :
r= ∑xy
√∑x2 . √∑y2
Example 1 :
The following table gives indices of industrial production and registered unemployed
(in hundred thousand). Calculate the value of the coefficient so obtained.
Year 1991 1992 1993 1994 1995 1996 1997 1998
Number unemployed 15 12 13 11 12 12 19 26
Solution :
Calculation of Karl Pearson’s Correlation Coefficients
Year Production (x – X) X2 Unemployed (Y – Y) Y2 xy
X Y
1991 100 -4 16 15 0 0 0
1992 102 -2 4 12 -3 9 +6
1993 104 0 0 13 -2 4 0
1995 105 +1 1 12 -3 9 -3
1997 103 -1 1 19 +4 16 -4
25
r= ∑xy
√∑x2 . √∑y2
x=(X – X); y = (Y – Y)
X = ∑X = 832 = 104;
N 8
Y= ∑Y = 120 = 15
N 8
√120 * √184
Example 2:
Find out the coefficient of correlation of correlation between the sales and expenses of
the following 10 firms ( figures in ‘0000 Rs.)
Firms 1 2 3 4 5 6 7 8 9 10
Sales 50 50 55 60 65 65 65 60 60 50
Expenses 11 13 14 16 16 15 15 14 13 13
50 -8 64 11 -3 9 +24
50 -8 64 13 -1 1 +8
55 -3 9 14 0 0 0
60 +2 4 16 +2 4 +4
65 +7 49 16 +2 4 +14
65 +7 49 15 +1 1 +7
65 +7 49 15 +1 1 +7
60 +2 4 14 0 0 0
60 +2 4 13 -1 1 -2
50 -8 64 13 -1 1 +8
26
The correlation coefficient between sales and expenses is :
r= ∑xy = 70 = 0.787
Example : 3
Calculate the coefficient of correlation by Karl Pearson’s method from the following
data relating to overhead expenses and cost of production :
90 -30 900 15 -2 4 60
120 0 0 17 0 0 0
27
r= ∑xy = 240 = 0.693
Where di stands for difference between the ranks of the i-th individual among the two char-
acters and N stands for the number of paired observations. The value of rank correlation
coefficient varies between -1 and +1. -1 implies complete disagreement in the order of
ranks while +1 implies complete agreement in the order of ranks. The above formula is
used when ranks are not repeated. The rank correlation is especially used in the study of
qualitative characteristics such as honesty, efficiency beauty, performance , etc., This is the
only method that can be applied to data in which the order of the items is known and not the
actual values.
As compared with Pearson’s method of studying correlation, this method is simple to
understand and easy to apply. It is however less accurate than Pearson’s method. Moreover,
it cannot be applied to bivariate frequency distribution.
Example : 4
Calculate the coefficient of correlation from the following data of the Spearman’s. Rank
difference method:
Price of Tea Price of Coffee Price of Tea Price of Coffee
(Rs) (Rs) (Rs) (Rs)
75 120 60 110
88 134 80 140
95 150 81 142
70 115 50 100
28
Solution : Calculation of Spearman’s Correlation Coefficient
Price of Tea R1 Price of coffee R2 (R1 - R2 )2
(Rs.) D2
75 4 120 4 0
88 7 134 5 4
95 8 150 8 0
70 3 115 3 0
60 2 110 2 0
80 5 140 6 1
81 6 142 7 1
50 1 100 1 0
∑D2 = 6
R = 1 – 6 ∑ D2 =1- 6*6
N3 - N 83 - 8
= 1 - 36
512 – 8
= 1 - 0.071 = + 0.329 0.929
29
Computation of ‘r’ from a cross classification table
Table 13.4
Persons employed (x)
10-14 15-19 20-24 25-29 30-34 35-39
electricity in units
50-59 5 7 12
Consumption of
60-59 10 4 5 19
70-79 2 6 3 11
(y)
80-89 3 8 5 18
90-99 3 6 2 11
100-109 3 1 4
15 13 16 14 14 3 75
See how the cross classification table is read.
Consider the first column and first row out of 15 companies which employ between
10-14 persons, 5 companies consume electricity between 50 to 59 units and 10 between 60
to 69 units similarly, out of 12 companies consuming electricity between 50-59 units, 5
companies employ between 10-14 persons, and 7 between 15 to 19 persons, like wise, for
other rows and columns.
We may find the co-efficient of correlation between the number of persons employed x
and electricity consumed y 75 companies using the equation.
NΣfd x d y − (Σfd x ) (Σfd y )
r=
[ NΣfd x2 − (Σfd x ) 2 ] [ NΣfd x2 − (Σfd y ) 2 ]
where,
dx = (x-a) / Cx (a is the assumed mean and Cx the size of the class internal for x
series). dy = (y-b)/Cy (b is the assumed mean and Cy the size of the class interval for y
series), and x, & y are the mid points of the various classes in x and y series, respectively.
Computation procedure :The use of equation consists of the following steps in the order
listed :
30
1) Find the mid points, x & y, for each class for both x & y series.
2) Decide the assumed means a & b, for x & y series, respectively
3) Obtain dx and dy.
4) Multiply each by dx and dy by the corresponding column / row total frequency of to get
fdx and fdy and find the sums ∑fdx & ∑fdy.
5) Take the square of each dx and dy to get d x2 and d y2 {then multiply each d x2 and d y2 by
the corresponding class frequency ‘f’ to get f d x2 and f d y2 , and obtain the sums ∑f d x2 and
∑f d y2 .
6) Obtain the product of each d and d and multiply by frequency of indicated in the
x y
appropriate cells, and write them in squares in the left hand corner of concerned cells.
Add these product values overall rows and columns to get fdxdy, and obtain the sum of
∑fdxdy. This sum added over all columns should be the same as the one added overall
rows.
All the values so computed may be subtitued in equation to obtain ‘r’ the six steps in the
computation of ‘r’ are illustrated in table below using the following computations.
2
∑fdxdy = 145, ∑fdx = 8, ∑fd2y = 165, ∑fdy = 9, and ∑f d x = 170
Therefore r = 0.87
Industry A B C D E F G H I J K L M N O P
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Profit rank 13 16 14 15 10 12 04 11 5 9 8 3 1 6 7 2
31
Calculate the rank correlation co-efficient.
Solution : Since only ‘ranks’ are given, we have to calculate spearman’s Rank correlation
co-efficient(p), which is given by the formula.
6Σd 2
P=1–
N ( N 2 − 1)
Where ‘d’ denotes the difference between ranks for the same industry and N denotes the number
6Σd 2 7416
P=1– P=1–
N ( N 2 − 1) 4080
6(1236)
P=1– P = 1 – 1.82
16(16 2 − 1)
P = - 0.82
7416
P=1–
16( 256 − 1)
Calculations for Rank correlation co-efficient
32
2. Ten competitors in a debate contest are ranked by three judges in the following order :
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation co-efficient to determine, which pair of judges has the nearest
approach to common taste in debate.
Solution : Calculations for Rank correlation co-efficient
Ranks by d12 d13 d23 d122 d132 d 232
First Second Third = R 1-R2 = R1-R3 = d2-d3
Judge Jduge Judge
R1 R2 R3
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 +2 +1 1 4 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 +6 +2 -4 36 4 16
3 7 1 -4 +2 +6 16 4 36
2 10 2 -8 +0 +8 64 0 64
4 2 3 +2 +1 -1 4 1 1
9 1 10 +8 -1 -9 54 1 81
7 6 5 +1 +2 +1 1 4 1
8 9 7 -1 +1 +2 1 1 4
∑ d122 = ∑ d132 = ∑ d 232 =
200 60 214
33
6Σd122 6Σd132
P12 = 1 – P13 = 1 –
N ( N 2 − 1) N ( N 2 − 1)
6 x 200 6 x60
=1– =1–
10(10 2 − 1) 10(10 2 − 1)
1200 360
=1– =1–
10(100 − 1) 10(100 − 1)
1200 360
=1– =1–
10(99) 10(99)
1200 360
=1– =1–
990 990
P12 = 1 – 1.21 = 1 – 0.36
P12 = - 0.21 P13 = 0.64
3. Quotation of Index number of equity share prices of a certain joint stock company and
of prices preference shares are given below
34
Use the method of rank correlation to determine the relationship between equity shares and
preference share prices.
Solution : Calculations for Rank correlation co-efficient.
Year Equity shares Preference shares D = R1-R2 d2
X Rank (R1) Y Rank (R2)
1971 97.5 4 75.1 2 2 4
1972 99.4 7 75.9 3 34 16
1973 98.6 6 77.1 5 1 1
1974 95.2 2 78.2 6 -4 16
1975 95.1 1 79.0 7 -6 36
1976 98.4 5 74.8 1 4 16
1977 97.1 3 76.2 4 -1 1
2
Total 0 ∑d = 90
6Σd 2
P=1–
N ( N 2 − 1)
6 x 90
=1-
7(7 2 − 1)
540
=1-
7( 49 − 1)
540
=1-
336
= 1 – 1.607
Therefore P = - 0.607
4. Ten students were ranked on the basis of two attributes : Beauty (x) and intelligence (y).
The co – efficient of rank correlation between x and y was found to be 0.5. It was later
discovered that the difference in ranks in the two attributes obtained by one of the students
was wrongly taken as 3 instead of 7. Find the correct co-efficient of correlation.
Solution : Given
6Σd 2
N = 10 P=1–
N ( N 2 − 1)
6Σd 2
P = 0.5 =1–
10(10 2 − 1)
35
6Σd 2
P=1–
10(100 − 1)
6Σd 2
0.5 = 1 -
10(99)
6Σd 2
0.5 = 1 -
990
6Σd 2
Therefore = 1 – 0.5
990
6Σd 2
= = 0.5
990
= 6 ∑d2 = 0.5 * 990
= 6 ∑d2 = 495
495
6 ∑ d2 =
6
∑d2 = 82.5
Since, one difference was wrongly taken as 2 instead of 7, the correct value of ∑d2 is given by :
Corrected ∑d2 = 82.5 – 32 + 72.
= 82.5 – 9 + 49
Corrected ∑d2 = 122.5
Therefore corrected co-efficient of correlation.
6(122.5)
P=1-
10(10 2 − 1)
735
P=1-
990
= 1 – 0.742
Therefore P = 0.258
36
5. A psychologist wanted to compare two methods A and B of teaching. He selected a
random sample of 22 students. He grouped them into 11 pairs, so tht the students in a
pair have approximately equal scores on an ntelligence test. In each pair one student was
taught by method A,and the other method ‘B’ and examined after the course.
The marks obtained by them are tabulated below :
Pair 1 2 3 4 5 6 7 8 9 10 11
A 48 58 38 28 60 38 54 60 40 56 22
B 74 70 32 52 46 54 38 40 32 22 42
In the x- series, we see that the value 60 occures twice. The common rank assigned to each of
1 + 2
these values is =1.5
2
x Y Rank of Rank of y d=x–y d2
(x) (y)
48 74 6 1 5 25
58 70 3 2 1 1
38 32 8.5 9.5 -1 1
28 52 10 4 6 36
60 46 1.5 5 -3.5 12.25
38 54 8.5 3 5.5 30.25
54 38 5 8 -3 9.00
60 40 1.5 7 -5.5 30.25
40 32 7 9.5 -2.5 6.25
56 22 4 11 -7 49.00
22 42 11 6 5 25.00
∑d = 0 ∑d2 =225
37
Hence, we see that in the ‘x’ series the items, 38 and 60 are repeated, each occurring
twice and in the ‘y’ – series the item 32 is repeated. Thus, in each of the three cases m =2,
hence on applying the correction.
m(m 2 − 1)
Factor for each repeated item, we get
12
6 * 226.5
P=1- = 1 – 1.0295
11x120
Therefore = - 0.0295
6.7 SUMMARY
In this unit the concept of correlation or the association between two variables has been
discussed. Correlation analysis helps us to determine the strength of the linear relationship
between the two variables. Correlation analysis is used as a starting point for selecting
useful independent variables for regression analysis.
A scatter plot of the variables may suggest that the two variables are related but the value
of the Pearson Correlation coefficient r quantifies this association. The correlation
coefficient r may assume values between -1 and +1. The sign indicates whether the association
is direct (+ve) or inverse (-ve). A numerical value of r equal to unity indicates perfect
association while a value of zero indicates no association.
2.8 KEYWORDS
Karl Pearson’s Coefficient of Correlation : A mathematical method for measuring the
intensity or the magnitude of linear relationship between two variable series was suggested
by Karl Pearson (1867-1936).
Scatter Diagram : Scatter diagram is one of the simplest ways of diagrammatic representa-
tion of a bivariate distribution.
38
6.9 SELF ASSESSMENT QUESTIONS
1. What do you mean by the term Coefficient of correlation?
2. Distinguish between Coefficient of correlation and coefficient of variation.
3. What is a scatter diagram? How does it help in studying the correlation between two
variables, in respect of both their nature and extent?
4. Define Karl Pearson’s coefficient of correlation. What is it intended to measure?
5. What are the advantages of Spearman’s rank correlation coefficient over Karl Pearson’s
correlation coefficient?
6. Draw a correlation graph from the following data.
8. Ten Competitors in a beauty contest are ranked by three judges in the following order:
I Judge 1 5 4 8 9 6 10 7 3 2
II Judge 4 8 7 6 5 9 10 3 2 1
III Judge 6 7 8 1 5 10 9 2 3 4
Use rank correlation coefficient to discuss which pair of judges has the nearest ap-
proach to common tastes in beauty.
39
10. A random sample of 5 college students is selected and their grades in Kannada and
English are found to be:
1 2 3 4 5
Kannada 85 60 73 40 90
English 93 75 65 50 80
12. Calculate the Karl Pearson’s Coefficient of correlation between age and playing habits
from the data given below. Also calculate probable error and comment on the value:
Age 20 21 22 23 24 25
No. of students 500 400 300 240 200 160
Regular players 400 300 180 96 60 24
6.10 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008
40
UNIT 7 : REGRESSION
STRUCTURE
7.0 Objectives
7.1 Introduction
7.2 Concept and Definition of Regression
7.3 Distinction between Correlation and Regression
7.4 Regression Analysis
7.5 Advantages of Regression Analysis
7.6 Types of Regression Analysis
41
7.0 OBJECTIVES
After studying this unit you should be able to :
∗ Define the concept and definition of regression ;
∗ Distinguish between Correlation and Regression;
∗ Explain regression analysis;
∗ Differentiate between types of Regression Analysis and
∗ Solve the different problems on Regression.
7.1 INTRODUCTION
Regression analysis is a very powerful tool in the field of statistical analysis in predicting
the value of one variable on the basis of the given value of another variable, when these two
variables are related to each other. Examples of regression problems can be found in the
study of the yields of crops grown with different amount of fertilizer, the length of life of
certain animals exposed to different amounts of radiation, the hardness of plastics which are
heat-treated for different periods of time and so on. In these problems the variation in one
measurement is studied for particular levels of the other variable selected by the
experimenter.
42
fathers, while the average height of the sons of a group of short fathers are greater than that
of the fathers”. It means the coming generations of tall or short parents tend to step back to
average height of population. The line showing this tendency was called by Galton a
‘Regression Line’.
Now a day’s, a modern statistician prefers to use the term ‘Regression’ in the sense of
‘Estimation’ which is an important statistical tool in economics and business. Estimation or
prediction of economic activities is very essential in planning. Estimating the relationship
among the economic variables constitutes the essence of modern business management.
That is why the term ‘estimating line’ is used instead of ‘regression line’ by the modern
statisticians. Today, the term ‘Regression’ is used in a much broader sense to imply
‘functional relationships’. It means the estimation or prediction of the unknown value of
one variable from the known value of the other variable. The closer the relationship between
the two variables, the greater the confidence may be placed in the estimates.
43
Both the techniques are based on different sets of assumptions. In practice, the choice
between the two techniques depends upon the purpose of investigation. The presence of co-
relationship does not imply causation, but the presence of causation certainly implies co-
relationship. The association (correlation) need not imply causation (regression) because a
close association may be the result of pure chance. The causation (regression) definitely
implies association (correlation), because cause and effect are based on relationship.
44
regression line and better the line fits the data, that is, a good estimate can be made of the
value of variable y. When all the points fall on the line, the standard error of estimate
equals zero.
3. When the sample size is large (df=29), the interval estimation for predicting the value of
a dependent variable based on standard error of estimate is considered to be acceptable
by changing the values of either x or y. The magnitude of r2 remains the same regardless
of the values of the two variables.
1. Simple and Multiple regression Models : A regression model that includes only one
independent variable is called a simple regression model. If it includes more than one
independent variable, it is called a multiple regression model.
2. Linear and Non linear Regression Models : Linear regression models assume that
the relationship between the independent and dependent variable is linear.
3. Time Series and Cross section Regression Models : When a regression model is
estimated on the basis of time series data ( historic data), it is a time series regression
model. A cross sectional model is estimated on the basis of data at a given point in time
across the spectrum of a population or of a region.
45
1. The following table gives the age of cars of a certain make and annual maintenance
cars, obtain the regression equation for costs related to age :
Age of cars : (in years) 2 4 6 8
Maintainance : costs (in hundred x 10 20 25 30
10
of Rs.)
Solution : Let variable ‘x’ denote age of cars & y, maintenance costs of cars
20
X= =5
4
85
Y= = 21.25
4
N (Σxy ) − (Σx) (Σy )
byx =
NΣx 2 − (Σx) 2
4 x 490 − 20 x 85
=
4 x 120 − (20) 2
260
byx = = 3.25
80
46
2. Compute the appropriate regression equation of the following data :
Solution since y is the dependent variables, therefore, the appropriate regression line is
of y on x.
Calculation of regression line
x Y x2 xy
2 18 4 36
4 12 16 48
5 10 25 50
6 8 36 48
8 7 64 56
11 5 121 55
Total 36 60 266 293
Σx 36 Σy 60
Therefore X = = = 6. Y = = = 10.
N 6 N 6
6 x 293 − 36 x 60
= = 1.34
6 x 266 − (36) 2
Regression equation of Y on x is :
Y – y = byx (x-)
Y – 10 = - 1.34 (x – 6)
Therefore y = -1.34 x + 18.04
3. Obtain the equations of the two lines of regression for the data given below :
X: 1 2 3 4 5 6 7 8 9
Y: 9 8 10 12 11 13 14 16 15
47
Solution calculation for regression lines
x x(x- x ) x2 y y (y- y ) y2 xy
1 -4 16 9 -3 9 12
2 -3 9 8 -4 16 12
3 -2 4 10 -2 4 4
4 -1 1 12 0 0 0
5 0 0 11 -1 1 0
6 1 1 13 1 1 1
7 2 4 14 2 4 4
8 3 9 16 4 16 12
9 4 16 15 3 9 12
∑x = 45 ∑x = 0 ∑x2 = 60 ∑y=108 ∑y = 0 ∑y = 60 ∑xy = 57
Σx 15 Σy 108
x = = = 5. y = = = = 12.
N 9 N 9
Regression co-efficient
X on y y on x
Σxy Σxy
bxy = bxy =
Σy 2 Σx 2
57 57
= = 0.95 = + 0.25
60 60
Equation of lines of regression Y on x
X on y Y – y = byx (x – x)
X – x = bxy (y – y) Y – 12 = 0,95 (x-5)
X – 5 = 0.95 (y – 12) OR Y = 0.95 x + 7.25
X = 0.95 y – 6.45
48
4. From the data given below, find ;
a. the two regression equations.
b. The co-efficient of correlation between the marks in economics and statistics.
c. The most likely marks in statistics when marks in economics are 30 :
Marks in : 25 28 35 32 31 36 29 38 34 32
economics
Marks in : 43 46 49 41 36 32 31 30 33 38
statistics
Solution :
a) let us denote he marks in economics by the variable x and marks in statistics by the
variably y
Caluclation for regression equations.
X (1) x- x (2) (x- x )2 (3) X (4) (y- y ) (5) (y- y )2 (6) (x- x ) and
(y- y ) (7)
25 -7 49 43 5 25 -35
28 -4 16 46 8 64 -32
35 +3 9 49 11 121 +33
32 0 0 41 3 9 0
31 -1 1 36 -2 4 2
36 4 16 32 -6 36 -24
29 -3 9 31 -7 49 21
38 6 36 30 -8 64 -48
34 2 4 33 -5 25 -10
32 0 0 39 1 1 0
Total 0 140 380 0 398 - 93
Σx 320 Σy 380
x = = = 32. y = = = = 3..8
N 10 N 10
49
Regression co-efficient
X on y y on x
Σ( x − x ) ( y − y ) ) Σ( x − x ) ( y − y ) )
bxy = bxy =
Σ( y − y ) 2 Σ( x − x) 2
93 − 93
= = 0.234 = = 0.664
398 140
Equation of lines of regression Y on x
X on y Y – y = byx (x – x)
X – x = bxy (y – y) Y – 38 = 0.664 (x-3)
X – 32 = 0.234(y – 38) Y = 0.664 x + 59.24
X = 0.234 y + 40.892
b) We have
r2 = byx, bxy = (-0.234) x (-0.664)
But since, both the regression co-efficients are negagive, r must be negative, hence r = -
0.394
a) When x = 30
b) Y = - 0.664 x 30 + 59.248 = 39.248 = 39
Hence where the works in economics are 30, the most likely marks in statistics are 39.
5. The quality of a raw material purchased by ABC Ltd., at the specified prices during
the 12 months of 1982 is given below :
Month Price/kg Quantity Month Price/kg Quantity
(in Rs) (in kg) in (Rs.) (in k.g.)
January 96 250 July 112 220
February 110 200 August 112 220
March 100 250 September 108 200
April 90 280 October 116 210
May 86 300 November 86 300
June 92 300 December 92 250
50
a. Find the regression equation saled on the above data.
b. Can you estimate the approximate quantity likely to be purchased if the price shoots
up to Rs. 124/kg.
c. Hence or otherwise obtain the coefficient of correlation between the price prevailing
and the quantity demanded.
1200 2980
x = = 100. y = = 248.33.
12 12
a) Regress in co-efficients :
x on y Y on x
( Σdx) x (Σdy ) (Σdx) x (Σdy )
Σdxdy − Σ Σdxdy − Σ
Therefore byx = N Therefore = byx = N
(Σdy ) 2 (Σdx) 2
Σdy − Σdx −
2 2
N N
− 4360 − 4360 − 0
bxy = byx =
42 1344 − 0
16768 − byx = - 3.244
12
51
Equations of lines of regression :
X on y y on x
X – x = bxy (y-y) y – x = byx (x-x)
x-100 = -0.26 (y-248.33) y – 248.33 = 3.244 (x-100)
Therefore x = -0.26 y + 164.56 therefore Y = 3.244 x 572.73
b) for k = 124, y = -3.244 x 124 + 572.73 = 170. 474 thus, an estimate of 171.5 kg. would
be bought at the price of Rs.124.
a) ± (−0.26) x ( −3.249 ) = - 0.92 – Negative sign with ‘r’ is taken as regression co-
efficient are negative.
Calculate the regression equation of sales on advertising expenditure. Estimate the probable
sales when advertisement expenditure is Rs.60,000/-
Calculations for regression equation of sales on advertising expenditure.
Year Adv.expenditure dx = x-25 dx2 Sales (lakh y.7 dy2 dxdy
(in ‘000 Rs.) X Rs.) y Dy =
0. 1
1980 12 -13 169 5.0 -20 400 260
1981 15 -10 100 5.6 -14 196 140
1982 15 -10 100 5.8 -12 144 120
1983 23 -2 4 7.0 0 0 0
1984 24 -1 1 7.2 2 4 -2
1985 38 13 169 8.8 18 324 234
1986 42 17 289 9.2 22 484 374
1987 48 23 529 9.5 25 625 575
17 1361 21 2177 1701
52
The line of regression of sales (y) on advertising expenditure (x) is (y – y = byx ( x – x )
Σdx 17
x = Ax + X cx = 25 + x 1 = 27.125
N 8
Σdy 21
y = Ay + X cy = 7 + x 0.1 = 7.26
N 8
Σdx x Σdy
Σdxdy − Σ
N Cy
byx = x
(Σdx)
2
Cx
Σdx −
2
N
(17) x ( 21)
1701 0
8 0.1 1701− 44.625 0.1
= x = x
(17) 2 1 1361− 36.125 1
1361 −
18
Y = 0.125 x + 3.87
Hence probable sales is 11.37 lakh rupees when the advertisement expenditure is
Rs.60,000.
7.7 SOLVED PROBLEMS ON REGRESSION
PROBLEM : 1 :
A researcher wants to find out if there is any relationship between the ages of the husbands
and the ages of the wives. In other words, do old husbands have old wives and young wives?
He took a random sample of 7 couples whose respective ages are given below :
Age of Husband (x) Age of wife (y)
25 18
27 20
29 20
32 25
35 25
37 30
39 37
53
c) For this data compute the regression line.
d) Based upon the correlation between their ages, what would be the age of the wife, if the
husbad’s age is 36 years.
Solution :
The regression line is identified by :
Y= = a + bx
nΣxy − (Σx)(Σy )
Where b =
n(Σx 2 ) − (Σx ) 2
And a = y - b x
40586 − 39200
51338 − 50176
1386
= 1.193
1162
175 224
a= - 1.193 7
7
= 25-38.176
= - 13/176
54
Hence, the line of regression equation would be
Yc = - 13.176 + 1.193 x
b. If the husband’s age is 36 years i.e. if x = 36, then the computed age of wife or Yc
would be
Yc = -13.176 + 1.193 (36)
= - 13.176 + 42.948 = 29.772
= 30 years.
Short cut Method :
Calculations can become much easier if instead of taking the actual values of X
and Y,Y, we take deviation from their respective means. Then Regression Equation
would be
(y- y ) = byx (x- x )
Σxy σy
Where byx = Σx 2 = r σx , where x = (x - x ) , y = (y- y )
a) Taking the same problem, we shall find the lien of regression equation as follows :
(x – x ) (y – y )
Or
X Y X y xy x2 y2
25 18 -7 -7 49 49 49
27 20 -5 -5 25 25 25
29 20 -3 -3 15 9 25
32 25 0 0 0 0 0
35 25 +3 0 0 9 0
37 30 +5 +5 25 25 25
39 37 +7 +12 84 49 144
∑x =224 ∑y = 175 ∑x = 0 ∑x = 0 ∑xy = 198 ∑x2 = 166 ∑y2 = 268
55
224
x = = 32
7
175
y = = 25
7
Regression equation of Y on x is
Y = 25 = byx (x-32)
Σxy 198
Where byx = = = 1.193
Σx 2 166
Problem 2
On the basis of the data given in problem 1, calculate the most probable age of husband
if the age of wife is 28 years.
Solution : For calculating the probable age of a husband for the given age of a wife, we
require the following line of regression equation based upon deviations of x and y from their
respective means
X - x = bxy (y - y )
σx Σxy
Where bxy = r = =
σy Σx 2
56
From the data given in problem 1, we have
224
x = = 32
7
175
y = = 25
7
Σxy = 198
Σy2 = 268
X – 32 = by (y-25)
Σxy 198
Where bxy = = = 0.739
Σx 2 166
Or x = 0.739 y + 32 – 18.475
Or x = 0.739Y + 13.525
Thus if the wife’s age is 28 years or Y = 28, then the age of husband would be
Problem 3 :
A researcher wants to find out if there is any relationship between the heights of the sons
and the heights of the fathers. He took a random sample o six fathers and their six sons.
Their heights in inches are given below in an ordered array :
57
Height of Father in inches (x) Height of son in inches (y)
63 66
65 68
66 65
67 67
67 69
68 70
Σx 396
x = = = 66
N 6
Σy 405
y = = = 67.5
N 6
Σxy
byx = where x = x - x - x , y = y- y
Σx 2
58
If however either x or y or both x and y are not full integers, then it is simpler to take
deviations from some assumed mean than from actual means. In such case :
NΣdxdy − Σdx Σdy
bxy =
NΣdx 2 − (Σdx ) 2
6 x10 − 0 x3
6 x16 − (0) 2
Yc = .625 x + 26.26
Hence, if the height of the father is 70 inches or x -= 70, the height of the son or Yc
would be
= 43.75 + 26.25 = 70
x y
ii) Regression equation of x on y : x – = bxy (y- )
NΣdxdy − Σdx Σdy
bxy =
NΣdy 2 − (Σdy ) 2
6 x10 − 0 x3
6 x19 − (3) 2
60
=
105
= 0.571
59
Substituting the value in the equation
X – 66 = .571 (y-67.5)
X = .571 y – 38.542 + 66
Xc = .571 Y + 27.458
Hence if the height of the son is 65 inches or Y = 65, the height of the faher or Xc would
be
Xc = .571 (65) + 27.458
= 37.115 + 27.458
= 64.573 inches
Karl Pearson’s co-efficient of correlation or
bxy.bye
r=
σx σx
r xr = r2
= σy σy =
Or e = .571x.625
.356875
=
r = + .597
Problem 4 :
A department store gives in service training to its salesmen which is followed by a
test. The following daa give the test scores and sales made by nine salesmen during a
certain period.
Test Scores : 14 19 24 21 26 22 15 20 19
Sales (‘000 Rs.) 31 36 48 37 50 45 33 41 39
60
Calculate Karl Pearson’s co-efficient or correlation between the test scores and sales.
Solution : Calculation of co-efficient or correlation :
Test scores sales (‘000) x-x or (y-y) or
(X) (Y) Rs X y xy x2 y2
14 31 -6 -9 54 36 81
19 36 -1 -4 4 1 16
24 48 +4 +8 32 16 64
21 37 +1 -3 -3 1 9
26 50 +6 +10 60 36 100
22 45 +2 +5 10 4 25
15 33 -5 -7 35 25 49
20 41 0 +1 0 0 1
19 39 -1 -1 1 1 1
∑x = 180 ∑y = 360 ∑x = 0 ∑y = 0 ∑xy = 193 ∑x2 =120 ∑y2 = 346
Where x = x - x
Y=y-y
180
x = = 20
9
360
y = = 40
9
Σxy = 193
Σx2 = 120
Σy2 = 346
193 193 193
r = = = = 0.947
120 x346 41520 203.74
61
A leading company engaged in the production of detergents has 10 vacancies of salesmen
for which N=15 persons have been called for interview the interview board. Consists of the
sales manager and psychologist. The ranking given by the sales Manager and Psychologist
to the 15 candidates according to their serial number in the interview list who attended the
interview, compares as given in col(2) and col (3) respectively of table 13.6 below :
62
Regression
The percentage marks obtained in graduation and an MBA entrance test of 10 students
were as follows.
Graduation 50 52 55 60 62 65 65 66 70 75
Entrance test 52 50 57 65 65 62 65 65 71 78
x y xy x2 y2
50 52 2600 2500 2704
52 50 2600 2704 2500
55 57 3135 3025 3249
60 65 3900 3600 4224
62 65 4030 3844 4225
65 62 4030 4225 3844
65 65 4225 4225 4225
66 65 4290 4356 4225
70 71 5850 4900 5041
75 78 5625 6084
2 2
∑x = 620 ∑y = 630 ∑xy = 39620 ∑x = 39004 ∑y = 40332
63
Substituting the values or get
620 = 10a’ + 630b’
39620 = 630a’ + 40322b’
Solving (i) and (ii) for a’ and b’
a’ = 6.182
and b’ = 0.886
Hence the regression equation of x on y is
xc = 6.182 + 0.886y
The regression equation of y on x is
yc = a + bx
with the two normal equations as
∑y = Na + b∑x
∑xy = a∑ x + b∑x2.
Substituting the values we get
530 = 10a + 620 b
39620 = 620a + 39004b
Solving (iii) and (iv) for a and b
a = 1.434
and b = 0.993
Hence the regression equation of y on x is yc = 1.434 + 0.993 x
5600
=
(5640) (6320)
= 0.938
64
The following data relate to marketing expenditure (Rs. Lac) and the corresponding sales
(in Rs. Crores)
Marketing Expenditure 10 12 15 20 23
Sales 14 17 23 21 35
Regression equation of x on y is
xc = a’+b’ y
The regression coefficient b’ of x on y is given by
nΣxy − (Σx) (Σy )
b’ =
nΣx 2 − (Σx) 2
= 105
65
Substituting the required values in (ii) above
a = - (1.05)
80 100
5 5
= 16 – (1.05) 20
= -50
Again substituting for a’ and b’ in the regression equation
Xc = a’ + b’ y
We have
Xc = (-5.0) + (1.05) y
When y = 40 (Rs. Crores) the corresponding x value is
Xc = (-5) + (1.05) 40
= (-5) + 42
= 37
That is to achieve a sales target of 40 crore there is a need to spend Rs.37 lack on markeing.
From the data given in problems 13.2 above find the coefficient of correlation between
marketing expenditure and sales.
Solution : The co-efficient of correlation r is given by
nΣxy − (Σx Σy
r=
[nΣx − (Σx) 2 ][nΣy 2 − (Σy ) 2 ]
2
= 0.865
66
The data for 10 years on sales (y) and advertisement expenditure (x) of a particular
product yielded the following summated values (Rs. Lac)/
∑x = 15, ∑y = 110, ∑xy = 400 ∑x2 = 250 and ∑y2 = 3200 find the following
a) Regression co-efficient b of y on x and then the y – interrupt a
b) X – intercept a and then the regression coefficient b’ of x on y.
c) Most approximate value of y for x = 5 and that of x for y = 25
d) Standard error of estimate syx and Sxy.
Solution :
a) Regression coefficient b of y on x is given by
nΣxy − (Σx) (Σy ) 10(400) − (15)(110)
b= = = 1.033
nΣx − (Σx)
2 2
10(250) − (15) 2
x − a'
b’ =
y
or
(Σx / N ) − a' (15 / 10) − 0.201
b’ = = = 0.11
Σy / N (110 / 10)
67
7.8 SUMMARY
In this unit fundaments of linear regression have been highlighted. Broadly speaking, the
fitting of any chosen mathematical function of given data is termed as regression analysis.
The estimation of the parameters of this model is accomplished by the least squares criterion
which tries to minimize the sum of squares of the errors for all the data points. Regression
is thus, a potent device for establishing relationships between variables from the given data.
The discovered relationship can be used for predictive purposes.
68
Salesmen A B C D E F G H I
Intelligence 50 60 50 60 80 50 80 40 70
Scores
Weekly 30 60 40 50 60 30 70 50 60
Sales
a. Obtain the regression equation of sales on intelligence test scores of the salesmen.
b. If the intelligence test scores of a salesman in 65, what would be his expected
weekly sales?
Coefficient of Correlation-0.8
Find the two regression equations that are associated with the above values.
Price
Index of 78 77 85 88 87 82 81 77 76 83 97 93
cotton(X)
Price
index of 84 82 82 85 89 90 88 92 83 89 98 99
wool(Y)
69
X Y
Mean 65 67
Standard Deviation 2.5 3.5
9. Price indices of cotton and wool are given below for the 12 months of a year. Obtain the equations
of lines of regressions between the indices
Height in Weight in lbs
inches
90-100 100-110 110-120 120-130
50-55 4 7 5 2
55-60 6 10 7 4
60-65 6 12 10 7
65-70 3 8 6 3
10. The following table shows the age (in years) of 10 children and a quantitative measure
of their aggressive behaviour (measured on a scale of 0 to 10)
Age 6 6 6.7 7 7.4 7.9 8 8.2 8.5 8.9
Aggressive 9 6 7 8 7 4 2 3 3 1
behaviour
7.11 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008
70
UNIT 8 : MULTIPLE CORRELATION AND REGRESSION
STRUCTURE
8.0 Objectives
8.1 Introduction
8.2 Concept of Multiple Regression and Multiple Correlation
8.3 Concept of Partial Correlation
8.4 The Purpose of Multiple Correlation Co-efficient
8.5 Partial Regression Co-efficient – Least Square Normal Equations
8.6 Solved Problems on Partial Regression Co-efficient and Correlation Co- efficient
8.7 Solved Problems on Multiple Regression
8.8 Summary
8.9 Key Words
8.10 Self Assessment questions
8.11 References
71
8.0 OBJECTIVES
After studying this unit you should be able to:
∗ Explain the concept of Partial and Multiple Correlation and Regression ;
∗ Define the Partial Regression co-efficient;
∗ Analyse the relationship between partial regression co-efficient and Correlation co-
efficient;
∗ Describe the concept of Multiple Regression and
∗ Solve the different problems on Partial and Multiple Regression.
8.1 INTRODUCTION
The correlation and regression coefficients discussed earlier measure the degree and
nature of the effect of one variable on another. While it is useful to know how one phenomenon
is influenced by another.
Multiple Regression is a very advanced statistical tool and it is extremely powerful
when you are trying to develop a ‘model’ for predicting a wide variety of outcomes. In the
previous units simple relations were discussed, these were linear correlation and linear
regression between two variables. But most economic and business phenomena cannot be
described in such a simplistic manner.
8.2 CONCEPT OF MULTIPLE REGRESSION AND MULTIPLE CORRELATION
Multiple Regression is a statistical tool that allows you to examine how multiple
independent variables are related to a dependent variable. Multiple regression analysis
represents a logical extension of two-variable regression analysis. Instead of a single
independent variable, two or more independent variables are used to estimate the values of a
dependent variable. However, the fundamental concept in the analysis remains the same.
The term multiple correlation refers to the theory of correlation involving more than
two variables. Multiple correlation is used to find the degree of inter-relationship among
three or more variables. For example the yield of crop in a year may depend upon rainfall,
manure, the average temperature and average humidity during the period between sowing
and harvesting of the crop; the results of houses may depend upon tax rates as well as upon
building costs and upon other variable also; general intelligence in schools may be related to
grades in mathematics and grades in English and so on. Thus the aim of the theory of multiple
correlation is to know how far the dependent variable is influenced by the independent
variables.
72
8.3 CONCEPT OF PARTIAL CORRELATION
It is often important to measure the correlation between a dependent variable and one
particular independent variable when all other variables involved are kept constant i.e. when
the effects of all other variables are removed. The partial correlation analysis measures the
strength of the relationship between Y and one independent variable in such a way that
variations in the other independent variables are taken into account. A partial correlation
coefficient is analogous to a partial regression coefficient in that all other factors are ‘held
constant’. Simple correlation, on the other hand, ignores the effect of all other variables
even though these variables might be quite closely related to the independent variable on to
one another.
Partial correlation is also called ‘net correlation’. It is a study of the relationship between
one dependent variable and one independent variable by keeping the other independent
variables constant. In simple correlation the effect of other independent variables was ignored.
a) Zero Order co-efficient : Simple correlation between two variables is called the
Zero Order co-efficient, as in simple correlation no factor is held constant.
b) First order co-efficient : If a partial correlation is studied between two variables by
keeping a third variable constant it would be called a first order co-efficient as one
variable is kept constant.
c) Second order co-efficient : If a partial correlation is studied between two variables by
keeping two other variables constant it would be called a second order co-efficient as
two variables are kept constant.
Partial correlation co-efficient provides a measure of the relationship between the de-
pendent variable and other variables. With the effect of the most of the variables eliminated.
The function of partial correlation analysis is the measurement of relationship between
two factors. With the effects of one or more other factors eliminated. If the assumptions of
the method are true for a series of data, the power of partial analysis is great.
73
1. It serves as a measure of the degree of association between one variable taken as the
dependent variable and a group of other variables taken as the independent variables.
2. It also serves as a measure of goodness of fit of the calculated plane of regression and
consequently as a measure of the general degree of accuracy of estimates made by
reference to equation for the plane of regression.
a = A regression constant representing intercept on y-axis, its value is zero when the regression
equation parts through the origin.
b12.3, b13.2 = partial regression co-efficients
b12.3 = corresponds to change in x1 for each unit change in x2, while x3 is held constant;
b12.3 = represents the change in x1 for each unit change in each unit change in x3 while
x2 is held constant.
74
Solution : (a) Regression equations of x1 on x2 and x3 is given by
x1 = b12.3 x2 + b13.2 x3.
σ1 γ 12 − γ 13γ 23
Where b12.3 = *
σ2 1 − γ 23
2
b12.3 = 0.833
σ1 γ 13 − γ 12 γ 23
b13.2 = *
σ3 1 − γ 23
2
10 0 .6 − ( 0 .8 ) ( 0 .5 )
b13.2 = = * = 0.533
5 1 − ( 0 .5 ) 2
8 0. 8 − ( 0. 5) ( 0. 6)
b12.3 = *
10 1 − ( 0. 6) 2
b12.3 = 0.625
σ2 γ 23 − γ 12γ 13
Where b23.1 = *
σ3 1 − γ 23
2
0.5 − 0.48
= 1.6 *
1 − 0.25
0.02
= 1.6 *
0.75
= 1.6 * 0.02666
b23.1 = 0.0426
Therefore x2 = b12.3 x1 + b23.1 x3.
x2 = 0.625 x1 + 0.0426 x3.
75
2) In a trivate distribution :
σ1 = 3 γ23 = 0.4
σ2 = 4 γ13 = 0.6
σ3 = 5 γ12 = 0.7
Determine the regression equation of x1 on x2 and x3, if the variables are measured from
their means.
Solution : The required regression equation of x1 on x2 and x3 is given by
x1 = b12.3 x2 + b13.2 x3.
σ1 γ 12 − γ 13γ 23
Where b12.3 = *
σ2 1 − γ 23
2
3 0 . 7 − ( 0 . 6 ) ( 0 .4 )
= *
4 1 − ( 0 .4 ) 2
0.7 − 0.24
= 0.75 *
1 − 0.16
0.46
= 0.75 *
0.84
= 0.75 * 0.5476
b12.3 = 0.4107
σ1 γ 13 − γ 12 γ 23
b13.2 = *
σ3 1 − γ 23
2
3 0 . 6 − ( 0 . 7 ) ( 0 .4 )
= *
5 1 − ( 0 .4 ) 2
0.6 − (0.28)
= 0.6 *
1 − 0.16
0.32
= 0.6 *
0.84
= 0.6 * 0.3809
b13.2 = 0.2285
Hence the required equation is
x1 = b12.3 x2 + b13.2 x3.
x1 = 0.4107 x2 + 0.2285 x3.
76
An instructor of mathematics wishes to determine the relationship of grades in the final
examination to grades on two quizzes given during the semester. Le+z1, x2 and x3 be the
grades of a student in the first quiz. Second quiz, and final examination, respectively. The
instructor made the following computations for a total of 120 students :
x1 = 6.80 S1 = 1.00 γ12 = 0.6
x2 = 7.00 S2 = 0.8 γ13 = 0.7
x3 = 74.00 S3 = 9.00 γ23 = 0.65
a. Find the least – squares regression equation of x3 on x1 and x2.
b. Estimate the final grades of two students who scored respectively 9 and 7 and 4 and 8
marks in the two quizzes.
Solution : a) The regression equation of x3 on x2 and x1 can be written as.
r − r r S
(x3 - x3 ) = 23 132 12 3 (x2 - x2 ) +
1 − r12 S 2
0.65 − 0.42
[x3 – 74.00] = [11.25] (x2 – 0.7) +
1− 0.36
0.7 − 0.39
1− 0.36 [9] (x1 – 6.8)
=
0.23 0.31
[11.25] [x2 – 7] +
0.64 [9] [x1 – 6.8]
0.64
(0.3593) [11.25] [x2 – 7] + [0.4843] [9] [x1 – 6.8]
= 4.04 [x2 – 7] + 4.36 [x1 – 6.8]
[x3 – 74] = 4.04 [x2 – 7) + 4.36 [x1 – 6.8]
x3 = 74 + 4.04x2 – 2.828 + 4.36x1 – 29.648
x3 = 4.36x1 + 4.04x2 + 16.072
77
b) The final grade of student who scored 9 and 7 marks is obtained by substituting x1 = 9
and x2 = 7 in the regression equation :
x3 = 4.36 (9) + 4.04 (7) + 16.072
x3 = 39.24 + 28.28 + 16.072
x3 = 8359 or 84
Similarly, the final grade 07 students who scored 4 and 8 marks can also be obtained by
substituting x1 = 4, x2 = 8 in the regression Equation.
x3 = 4.36 [4] + 4.04 [8] + 16.072
x3 = 17.44 + 32.32 + 16.072
x3 = 65.832
78
a = is a constant representing the value of x1 when both x2 and x3 are zero. And
b12.3 and b13.2 are two other constants known as the partial regression co-efficients.
b12.3 is known as the partial regression co-efficient of x1 on x2 keeping x3 constant. It measures
the average change in x1 is a result of a unit change in x2 with no change in x3.
Substituting the needed information from the above computations in the above three
equations, we have
79
Step : 1 : Multiply equation No.1 by 18.2 and equation No.2 by 6, and then deduct
equation No.2 from equation No.1, as shown below :
Step : 2: Multiply equation No.1 and equation No.3 by 24.1, and 6.0 respectively and
then deduct equation No.3 from equation No.1,
Step 3 : Multiply equation No.4 and equation No.5 by 736.78 and 144.2 respectively and
deduct equation No.5 from equation No.4, as shown below :
- 151 = - 144.20 b12.3 – 736.78 b13.2 - - - - -* 736.78
- 316.6 = - 736.78b12.3 - 135.29b13.2 - - - -* 144.2
- 111253.78 = 106243b12.3 + 542874.24b13.2\
+ 42221.2 = 106243b12.3 + 19598.82b13.2.
(-) (+) (+)
- 156474.90 = - 523365.42b13.2 - - - - -
80
Therefore – 156474.90 = -523365.42 b13.2.
− 156474.90
Therefore =
− 523,365.42
b13.2 = 0.299
Substituting b13.2 = 0.299 in equation No.4, we get
- 151 = - 144.20 b12.3 - 736.78 b13.2.
- 151 = 144.20 b12.3 – 736.78 (0.299)
+ 144.20b12.3 = - 220.3 + 151
+ 144.20 b12.3 = - 69.30
− 69.30
Therefore =
+ 144.2
(b) The most likely value of x1 against x2 = 3.2 and x2 = 3.0 can be predicted by
substituting all relevant values in equation No.5.
82
753 = 12a + 643 b12.3 + 106b13.2 — - - - - - (1)
40830 = 643a + 34.843b12.3 + 5779b13.2 - - - - -(2)
6796 = 106a + 5779 b12.3 + 976b13.2 - - - - -(3)
Step : 1 : Multiply equation No.1 and equation No.2 by 643, and 12 respectively.
Step : 2: Multiply equation No.1 and equation No.3 by 106 and 12.
Step 3 : Multiply equation No.4 and equation No.5 by 1190 and 4667 respectively.
5553730b12.3 + 1416100B13.2 =\ 6879390
5553730 B12.3 + 2221492b13.2 + 8092578
(-) (+) (+)
- 805392b13.2 = - 1213288
+ 121388
Therefore b13.2 = = 1.506
+ 805392
83
Placing the value of b13.2 in Equation No. 5.
1190 b12.3 + 476 b13.2 = 1734
1190 b12.3 + 476 [1.506] = 1734
1190 b12.3 = 1734 – 716,856
1190 b12.3 = 1017.144
1017.144
Therefore b12.3 =
1190
b12.3 = 0.855
Similarly, we can find the value of ‘a’ by substituting for b12.3 and b13.2 in equation No.1,
as shown under :
753 = 12a + 643 b12.3 + 106b13.2.
753 = 12a + 643 (0.855) + 106 (1.506)
753 = 12a – 549.765 – 159.636
12a = 753 – 709.401
43.599
12a =
12
a = 3.633
Substituting the values of the three constants in the regression equation;
x1c = a + b12.3 x2 + b13.2 x3
x1c = 3.633 + 0.855 x2 + 1.506 x3.
(b) Hence we can predict the weight of a boy, whose height is 56 inches and age is 10
years as follows :
x1c = 3.633 + 0.855 (56) + 1.506 (10)
x1c = 3.633 + 47.88 + 15.06
x1c = 66.573
84
8.8 SUMMARY
The principal advantage of multiple regression is that is allows us to use more of the
information available to us to estimate the dependent variable. Sometimes the correlation
between two variables may be insufficient to determine a reliable estimating equation.
Multiple regressions will also enable us to fit curves as well as line. The three main objectives
of multiple regression and correlation analysis are (a) To derive an equation which provides
estimates of the dependent variable from values of the two or more independent variables.
(b) To obtain a measure of the error involved in using this regression equation as a basis for
estimation. (c) To obtain a measure of the proportion of variance in the dependent variable
accounted for the independent variables.
8.9 KEYWORDS
Multiple Regression Analysis : Represents a logical extension of two-variable regression
analysis. Instead of a single independent variable, two or more independent variables are
used to estimate the values of a dependent variable.
Multiple Regression Equation : The multiple regression equation describes the average
relationship between the variables and this relationship is used to predict or control the
dependent variable.
Partial Correlation Coefficient : Provides a measure of the relationship between the
dependent variable and other variables.
85
4) In a trivate distribution :
σ1 = 3 γ23 = 0.4
σ2 = 4 γ13 = 0.6
σ3 = 5 γ12 = 0.7
Determine the regression equation of x1 on x2 and x3. If the variables are measured from
their means.
(Ans : x1 = 0.410x2 + 0.229x3]
8.11 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008
86
BLOCK -3
PROBABILITY
STRUCTURE
9.0 Objectives
9.1 Introduction
9.2 Definitions
9.3 Three approaches to Probability
9.4 Set of Mutually Exclusive events
9.5 Probability Axioms
9.6 Theorems of Probability
9.7 Marginal Probability
9.8 Joint Probability
9.9 Conditional Probabilities
9.10 Bayes’ theorem
9.11 Problems solved
9.12 Summary
9.13 Key Words
9.14 Self Assessment Questions
9.15 References
152
9.0 OBJECTIVES
After studying this unit you should able to :
* Explain the basic concepts of probability;
* Describe probability axioms and
* Discuss Theorems of probability.
* Define marginal probability;
* Analyze joint Probability;
* Describe conditional probability and
* Explain Bayes’ Theorem.
9.1 INTRODUCTION
The concept of probability is the chance that something happens or will not happen.
In statistics it is denoted by the capital letter P and is measured on an inclusive numerical
scale of 0 to 1. If we are using percentages, then the scale is from 0% to 100%. If the
probability is 0% then there is absolutely no chance that an out come will occur.
The opposite of probability is deterministic where the outcome is certain on the
assumption that the input data is reliable. With probability something happens or it does
not happen, that is the situation is binomial, or there are only two possible outcomes.
However that does not mean that there is a 50/50 chance of being right or wrong or a 50/
50 chance of winning. If you toss a fair-sided coin, one that has not been “fixed”, you have
a 50% chance of obtaining heads or 50% chance of throwing tails. If you buy one ticket in
a fund raising raffle then you will either win or lose.
Conditional probabilities are contingent on a previous result. For example, suppose you
are drawing three marbles - red, blue and green - from a bag. Each marble has an equal
chance of being drawn. What is the conditional probability of drawing the red marble after
already drawing the blue one? First, the probability of drawing a blue marble is about 33%
because it is one possible outcome out of three. Assuming this first event occurs, there will
be two marbles remaining, with each having a 50% of being drawn. So, the chance of drawing
a blue marble after already drawing a red marble would be about 16.5% (33% x 50%).
Definition of ‘Conditional Probability’
Probability of an event or outcome based on the occurrence of a previous event or
outcome. Conditional probability is calculated by multiplying the probability of the preceding
event by the updated probability of the succeeding event.
153
Probability under Conditions of Statistical Independence:
When a statistically independent event occurs, it does not have any effect on the happening
of another event. There are three types of probabilities under statistical independence: 1.
Marginal, 2. Joint and 3. Conditional.
154
pack, the cases favourable to getting a spade are 13 (as there are 13 spade cards in the pack).
Mutually Exclusive Events: Two or more events are said to be mutually exclusive if the
happening of anyone of them excludes the happening of all others in a single (i.e. same)
experiment. Thus in the throw of a single dice the event 5 and 6 are mutually exclusive
because if the event 5 happens no other event is possible in the same experiment. Here one
and only one of the events can take place at a time excluding all others.
Equally Likely Events: Two or more events are said to be equally likely if the chance of
their happening is equal i.e., there is no preference of anyone event over the other. Thus in a
throw of an unbiased die, the coming up of 1, 2, 3, 4, 5 or 6 is equally likely. In the throw of
an unbiased coin the coming up of head or tail is equally likely.
Independent and Dependent Events: An event is said to be independent if its happening is
not affected by the happening of other events and if it does not affect the happening of other
events. Thus in the throw of a dice repeatedly, coming up of 5 on the first throw is independent
of coming up of 5 again in the second throw.
However if we are successively drawing cards from a pack (without replacement) the
events would be dependent. The chance of getting a King on the first draw is 4/52 (as there
are 4 Kings in a pack). If this card is not replaced before the second draw, the chance of
getting a King again is 3/51 as there are now only 51 cards left and they contain only 3
Kings.
If however the card is replaced after the first draw i.e. before the second draw the events
would remain independent. In each of the two successive draws the chance of getting a King
would be 4/52.
(i) The number of permutations of n dissimilar things taken all at a time is n!. Thus if
there are 3 letters A, B and C, the total number of ways in which they can be arranged is ABC,
ACB, BAC, BCA, CAB and CBA i.e. 3! = 3x 2x 1 = 6.
Factorial n (written as n!) is equal to the continued product of n natural numbers starting
from 1 i.e.
155
n! = 1 x 2 x 3.......... (n-1) n
= n (n-1) (n- 2) .......... 3. 2 . 1
= n (n- 1) ! = n (n- 1) (n- 2) !
n!
(ii) The number of permutations of n dissimilar things taken r at a time is n Pr .
( n r )!
Thus if we are to make arrangements of any two letters out of three letters A, B, C, then
the different arrangement will be AB, BA, A C, CA, BC, CB i.e. 6 arrangements which
in factorial notation can be represented as
3!
3
P2 3! 3 2 1 6
(3 2)!
2!
(iii) The number of permutations of n things when n1 of them are of one kind and n2 of
n!
another kind is Thus if we have to find out the permutations of the letters of the
n1!n2 !
word FARIDABAD (where A occurs 3) times and D occurs 2 times) the answer would
9! 9 8 7 6 5 4 3 2 1
be or 30,240
3!2! 3 2 1 2 1
(iv) The fundamental rule of counting is that if an operation can be performed in ‘m’
ways and having been performed in any one of these ways a second operation can be
performed in ‘n’ ways, the total number of ways of performing the two operations
together is m x n.
156
9.3 THREE APPROACHES TO PROBABILITY
Subjective probability
One type of probability is subjective probability, which is qualitative, sometimes
emotional, and simply based on the belief or the “gut” feeling of the person making the
judgment.
Subjective probability may be a function of a person’s experience with a situation. For
exam- ple, Salesperson A says that he is 80% certain of making a sale with a certain
client, as he knows the client well. However, Salesperson B may give only a 50% probability
level of making that sale. Both are basing their arguments on subjective probability.
Relative frequency probability
A probability based on information or data collected from situations that have occurred
previously is relative frequency probability
Relative frequency probabilities have use in many business situations. For example,
data taken from a certain country indicate that in a sample of 3,000 married couples under
study, one-third were divorced within 10 years of marriage. Again, on the assumption that
future conditions will be similar to past conditions, we can say that in this country, the
probability of being divorced before 10 years of marriage is 1/3 or 33.33%. This
demographic information can then be extended to estimate needs of such things as legal
services, new homes, and child- care.
Classical probability
A probability measure that is also the basis for gambling or betting, and thus useful if
you frequent casinos, is classical probability. Classical probability is also known as simple
probability or marginal probability and is defined by the following ratio:
In order for this expression to be valid, the probability of the outcomes, as defined
by the numerator (upper part of the ratio) must be equally likely.
157
P(A happens) + A (does not happen) =1
The Sum of the probabilities of all mutually exclusive and collective exhaustive events
is always equal to 1. That is,
P(A) + p(B) + p(C) = 1
If A,B,Care mutually exclusive and collective event.
Example 1 : -
P( You Pass) =0.9
P(you fail) =1 0.9 = 0.1
158
2. The probability of an entire sample space is 1. Thus, if S represents a entire sample
space, then
P (S) = 1
3. probability that one or the other or both of two mutually exclusive events will occur
is equal to the sum of the individual probabiliteis of these events. Thus.
P(A or B) = P(A) + P(B)
when A and B are mutually exclusive events.
4. The probability of an event that does not occur is equal to 1 minus the probability of
the event that occurs. Thus
P( A ) = 1 – P(A)
where A is the non-occurrence of event A.
Example 1: Suppose we have a box with 3 red, 2 black and 5 white balls. Each time a ball is
drawn, it is returned to the box. What is the probability of drawing:
a) Either a red or a black ball?
b) Either a white or a black ball?
Solution: The probabilities of drawing the specific colour ball are
P (red) = 0.3 P (black) = 0.2 P (white) = 05
Applying the rule 2, we find
P (red) + P (black) + P (white) = 0.3 + 0.2 + 0.5 = 1
As we want to know the probability of drawing either a red or a black ball, then the
answer will be probability P (red) + P (black) = 0.3 + 0.2 = 0.5. Likewise, the probability of
getting either a white ball or a black ball will be
P (white) + P (black) = 0.5 + 0.2 = 0.7
159
Addition theorem
If A and B an any two events then the probability that at least one of them occurs in
denoted by P ( A B ) and is given by
P ( A B ) = P(A) + P(B) – P( A B )
Mutually exclusive events have no sample point common to them, therefore if A and B
are two mutually exclusive events then A B = i.e. the intersection of two mutually
exclusive events is a null set and in this case P (A B) = 0
In case of mutually exclusive events
P(A B) = P(A)+ P(B)
If there are three events A, B and C. The probability of the occurrence of at least
one of them 'is given by
P(A B C) = P(A) + P(B) + P(C)
- P (A B) - P (B C) - P (A C)
+ P(A B C)
If the events are mutually exclusive then
P(A B C)= P(A)+ P(B)+ P(C)
In case of finite number say n of mutually exclusive events
P(Al A2 A3 ··· ··· An) = P(A1)+ P(A2)+ ..... . + P(An)
Note : (i) If a number of events A1, A2 ...... An are mutually exclusive and
Exhaustive then the sum of the individual probabilities of their happenings is equal
to 1 i.e.
P(A1)+ P(A2)+ ..... . + P(An) =1
160
(ii) If the events are finite and mutually exclusive then the probability of the
occurrence of at least one of them is equal to the sum of their individual
probabilities.
(iii) The event A and its compliment A can be considered as mutually exclusive
and exhaustive.
P ( A) P ( A) 1 P ( A) 1 P ( A)
S o lu tio n
S i n c e t h e e v e n t s a r e m u t u a l l y e x c l u s iv e a n d e x h a u s t i v e ,
P (A ) + P (B ) + P (C ) = 1
1 1
L et P (C ) P ( A) P(B ) k
3 2
P (C ) 3 k , P ( A ) 2 k , P ( B ) k
1
k 2 k 3k 1 k
6
1 1 1
P(B) ,P(A) P (C ) A n s.
6 3, 2
Example 2
One tickets is drawn at random from a bag containing 30 tickets numbered
from 1 to 30. Find the probability that,
(a) It is a multiple of 5 or 7
(b) It is a multiple of 3 or 5
Solution
One ticket can be drawn out of 30 in 30C l = 30 ways. This is the total
number of ways in which the event can take place, or it is the Exhaustive number
of cases.
161
(a) Multiples of 5 are 5, 10, 15, 20, 25, 30.
Multiples of 7 are 7, 14, 21, 28.
Thus there 'are 6 multiples of 5 and 4 multiples of 7. None of these are
common. So the events are mutually exclusive. The probability of having a
multiple of 5 or 7 would be,
6 4 10 1
30 30 30 3
16 2 14 7
= A n s.
30 30 30 15
E x a m p le 3
3
T h e prob a bility th at A w ill liv e up to 6 0 years is an d prob a b ility th at B w ill
4
2
liv e u p to 60 ye ars is . W h at is th e p ro b ab ility (i) th at b o th A an d B w ill liv e u p to
3
six ty years (ii) that b oth d ie b efo re reach ing 6 0 ye ars.
S o lu tio n
T he ev ents in eq u atio n are in d ep en dent o f each o th er an d th e ru le o f m u ltip licatio n
w o uld b e a p p lied .
(i) T h e p ro b ab ility th at b o th A a nd B liv e u p to 6 0 ye ars or P (A an d B ) = P (A ) x
3 2 6 1
P (B ) =
4 3 12 2
3 2
(ii) S ince th e p rob abilities o f th e su rviv al o f A an d B up to 60 ye a rs a re and
4 3
3 1
resp ectiv ely, th erefo re th e pro b ab ilities o f th eir d eath w o u ld b e 1
4 4
2 1
an d 1 resp ectiv ely.
3 3
(iii) N o w th e prob ab ility th at b o th A an d B w o uld d ie b efore rea ch in g 60 yea rs,
w o uld be
1 1 1
A ns.
4 3 12
162
Example 4
A bag contains 4 white and 6 red balls. Two draws of one balls each are
made without replacement. What is the probability that (i) one is red and the other
white (ii) Both the balls are red.
Solution
6
Probability of drawing a red ball in the first draw or P (A) =
10
Probability of drawing a white ball in the second draw given that the first draw has
4
given a red ball or P (B/A) = (since only 9 balls are left in the bag and four
9
white balls are still there)
Probability of the combined event
6 4 24
P(AB) P(A B)= P(A) x P (B/A) =
10 9 90
But it could also happen that in the first draw a white ball was drawn then,
4
Probability of drawing a white ball in the first draw or P (A) = and
10
Probability of drawing a red ball in the second draw given that the first draw gave
6
a white ball or P(B/A) =
9
Now anyone of the two situations (when we draw a red ball first or we
draw a white .ball first), would satisfy the conditions of the problem. These two
events are mutually exclusive. So the probability that anyone of the two happens is
the sum of the two probabilities.
24 24 48 8
Ans.
90 90 90 15
Here we have applied both, the rule of multiplication and the rule of
addition of probability.
163
However such problems can be very easily solved with rules of permutations and
combinations.
Thus:
10! 10 9
(i) Two balls can be drawn one of 10 balls in or
10
C 2 or
or 45 ways.
2!8! 2
4!
(ii) One white ball can be drawn out of 4 white balls in 4 C1 or or 4 ways.
1!3!
(iii) One red ball can be drawn out of 6 red balls in 6 C1 or 6 ways.
(iv) The total number of ways of drawing a white and a red ball are 4 C1 6 C1 or 4 x 6
= 24
(v) The required probability would be
No. of cases favourable the event
24 8
=
45 15
Total No. of ways in which the event can happen
6
C2
(ii) Required Probability = 10
C2
Example 5
(a) A committee of 4 persons is to be appointed from 3 officers of the
production department, 3 officers of the sales department and 2 officers of
the purchase department and 1 cost accountant. Find the probability of
forming a committee in the following manner.
(i) There must be on from each category
(ii) It should have at least one from the purchase department
(b) If P(A) = 0.4, P(B) = 0.7 and P (at least one of A and B) = 0.8, find P (only
one of A and B).
Solution:
3
C1 3 C1 2 C1 1
(i) Required Probability = 9
C4
18 1
=
9 8 7 6 7
4 3 2 1
164
(i) Required Probability
= 1 – Probability that nobody is taken from the purchase department
7!
7
C4
=1 9 1 4!3!
C4 9!
4!5!
7! 5! 20 52 13
= 1- 1
3! 9! 72 72 18
Second m ethod
Required Probability
= Prob. of taking one from Purchase deptt. And three others + Prob. of taking 2
from Purchase deptt. and two others
2
C1 7 C 3 2
C 2 7 C2
= 9
9
C4 C4
Examples 6
A husband and a wife appear in an interview for two vacancies in the same post.
The probability of husband’s selection is 1/7 and that of wife’s selection is 1/5. What is
the probability that:
(a) Both of them will be selected ;
(b) Only one of them will be selected;
(c) None of them will be selected
165
Solution:
Let A and B denote the events of husband’s and wife’s selection respectively.
Let P (A) = p1; P(B) = p2
1 1
p1 , p2
7 5
Probability that husband and wife both are selected = P(A) P (B)
1 1 1
= Ans. [ the events are independent]
7 5 35
= P (A) P ( B) + P ( A ) P (B)
1 4 6 1 10 2
= Ans.
7 5 7 5 35 7
166
Examples 7
Additional Examples ;
Example 7: Assume that a card is randomly selected from a deck of 52 playing cards.
Find the probability in each of the following cases :
a. Card drawn is the king
b. Either a heart or the queen of spades
c. Card drawn is a “diamond”.
Solution :
a. In a playing card, there are four kings. Hence, the probability of getting a king is
4/52 or 1/13.
b. There are 13 cards of “heart” and the queen of spades is 1. Hence, the required
probability is (13 + 1)/52 or 7/26.
c. Here, the probability of drawing a card with “diamond” is 13/52 or ¼.
167
Example 10: Suppose a fair die has its even numbered faces painted red, and the odd
number faces painted white. Consider the experiment of rolling the die and the events.
A = (2 or 3 shows up)
B = (A red face shows up)
Find the following probabilities
(a) P(A) (b) P(B) (c) P(AB) (d) P(A/B) (e) P(A or B)
Solution : Since a fair die has six number 1 to 6.
1
As such each number has probability of occurrence.
6
1 1 2 1
a. Hence P(A), being 2 or 3 showing up is , i.e.,
6 6 6 3
b. The total number of faces painted red is 3 as the die has 6 numbers.
3 1
Hence, P(B), i.e., where a red face shows up is
6 2
c. P (AB) is the joint probability of A and B
Hence, P(AB) = P(A) x P(B)
1 1 1
=
3 2 6
1
P( AB) 6 1 2 1
d. P(A/B) = or or
P( B) 1 6 1 3
2
e. P(A or B) = P(A) + P(B) – P(A and B)
1 1 1
3 2 6
2 3 1 4 2
6 6 3
Example 11: A, B and C bidding for a contract. It is believed that A has exactly half a
chance that B has; B, in turn, has 4/5th as likely as C has to gain the contract. What is
the probability for each to win the contract ?
Solution : Assuming that the probability of C to gain the contract is x.
168
Then,
Probability of B to win is 4/5 of x = 4x/5
Probability of A to win is ½ of 4x/5 = 4x/10
Now 4x/10 + 4x/5 + x = 1 (Since the total of three probabilities should be 1).
Or (20x + 40x + 50x)/50 = 1
Or 110x = 50
x =50/110 or 5/11
Hence, the probabilities to win for C, B and A are
C(x) = 5/11
B (4x/5) = 4 x 5/11)/5
= 20/11)/5 = 4/11
A = (4x/10)(4x5/11)/10
= 20/11)/10 = 2/11
Example 12: Three salesmen, A, B and C have been given a target of selling 10,000 units of
a particular product, the probabilities of their achieving their targets being respectively 0.25,
0.30 and 0.50. If these three salesmen try to sell the product, find the probability of success
of only one salesman and failure of the other two.
Solution: Probabilities are
A B C
0.25 0.30 0.50
= 0.0875
169
Hence, the required probability that one of them succeeds and the other two do not succeed
is
P = 0.0875 + 0.1125 + 0.2625
= 0.4625
Example 13: A sub-committee of 6 members is to be formed out of a group consisting of 7
men and 4 women. Calculate the probability that the sub-committee will consist of ss
(i) Exactly 2 women; and (ii) at least 2 women.
Solution:
(i) Out of 11 persons (7 men and 4 women) a sub-committee of 6 persons can be formed
in
11! 1110 9 8 7
11
C6 462 ways
(11 6)!6! 5 4 3 2 1
7! 4! 765 43
7
C4 4 C3 140
(7 4)!4! (4 3)!3! 3 2 3 2
7! 4! 7 6 4!
7
C2 4 C 4 21
(7 2)!2! 4! 2 1 4!
30 20 3
5/11 + 10/33 + 1/22 = 53 / 66
66
170
Example 14: Two computers A and B are to be marketed. A salesman who is assigned a job
of finding customers for them has 60 percent and 40 percent chances respectively of
succeeding in case of computers A and B. The computers can be sold independently. Given
that he was able to sell at least one computer, what is the probability that the computer A has
been sold?
Solution
Let A be the event that the salesman is able to sell computer A.
Let B be the event that the salesman is able to sell computer B.
Given P(A) = 0.60 and P(B) – 4.0 and that the two events A and B are independent.
P(AB) = P(A). P(B)
= 0.60 x 0.40 = 0.24
Now, probability of selling at least one computer is given by
P(A or B) = P(A) + P(B) – P(AB)
= 0.60 + 0.40 – 0.24 = 0.76
We have to find out P(A) given P(A or B)
P( A)
P[A/P(A or B) = P( AorB)
0.60
=
0.76
= 0.7895
Example 15 : A manufacturing firm receives shipments of machine parts from two
suppliers A and B. Currently, 65 percent of parts are purchased from supplier A and the
remaining from supplier B. The past record shows that 2 percent of the parts supplied by A
are found defective, whereas 5 percent of the parts supplied by B are found defective. On a
particular day the machine breaks down because a defective part is fitted to it.
171
9.7 MARGINAL PROBABILITY
Marginal probability is the simple probability of the occurrence of an event. Right in the
beginning, we have given such examples pertaining to the tossing of a coin. When a coin is
tossed, the probability of getting a head is 0.5, so also in the case of getting a tail. These are
known as marginal probabilities as a toss of a fair coin is a statistically independent event.
172
H1 H 2 H 3 H1T2H3
H1 T2 T3 (two tails) T1 H2 H3
T1 T2 H3 (two tails) T1 T2 T3 (three tails)
H1 H 2 T 3 T1 H2 T3 (two tails)
Out of the total eight outcomes, we find that at least two tails occur four times. As
the probability of any of the three successive tosses is 0.5 probability of getting at least two
tails is
P(H1 T2 T3) + P(T1 T2 H3) + P(T1 T2 T3) = 0.125 + 0.125 + 0.125 + 0.125 = 0.5
We will get the same answer in the case of the joint probability of at least two heads
in three successive tosses.
Example: What is the probability of getting three heads or three tails on three successive
tosses?
Solution: P(H1 H2 H3 or T1 T2 T3) = P(H1 H2 H3) + P(T1 T2 T3)
= 0.125 + 0.125 = 0.25
Since there can be only eight outcomes of which only one can be three successive
heads and one can be three successive tails, each outcome has a joint probability of 0.125 as
the total eight outcomes must equal to 1.
F i g 1 E x a m p le o f P r o b a b il it y F ig 2 P r o b a b ilit y tr e e o f a P a r t ia l S e c o n d
T oss T ree
A s s u m in g th a t th e f irs t to s s is a h e a d , th e n in th e s e c o n d
to s s th e o u tc o m e c o u ld b e e ith e r a h e a d o r a ta il. T h is is s h o w n in
F ig 2 .
173
Now, we assume that the outcome of first toss is tail. In this situation, the second
toss must originate from tail. This provides two more branches to the three as shown in
Fig 3.
We may further extend the tree to depict the outcomes of the third toss. We repeat
the same process, as a result we get what is depicted in Fig 4
It may be noted that when we toss once, we have two possible outcomes, when we
toss a coin twice, we have four possible outcomes and when we toss it thrice, then we have
eight possible outcomes.
174
9.9 CONDITIONAL PROBABILITIES
The discussion so far was confined to two types of probabilities, marginal or
unconditional probability and joint probability. Under statistical independence, only one
type of probability remains to be discussed. This is known as the conditional probability.
Symbolically, conditional probability is written as P(A/B) which means that probability
of event A, given that event B has occurred.
This appears to be contradictory. It may be recalled that independent events are those
events whose probabilities are not affected by the occurrence of each other. This means that
P(A/B) = P(A). Let us take an example to explain this.
Example: Suppose we are asked : what is the probability that the second toss of a fair coin
will result in tail, given that tail resulted on the first toss?
Solution : This can be written as P(T2/T1). It should be noted that the outcome of the first
even has no influence whatsoever on the outcome of the second even since the two events
are independent. The probability of a tail on the second toss is 0.5. Thus, we can write
P(T2/T1) = 05.
We may now summarize the three type of probabilities under statistical independence
as follows:
175
Table 1 Colour and Pattern of Ten Balls
Event Probability of Event
1 0.1 Red and dotted
2 0.1
3 0.1 Green and dotted
4 0.1 Red and striped
5 0.1
6 0.1
7 0.1
8 0.1 Green and striped
9 0.1
10 0.1
Example : Suppose we draw a ball from the urn and find it is red, what is the probability that
it is striped?
Solution : Since our problem relates to red balls, we ignore the green balls completely. In
all, there are six red balls of which two are dotted and four are striped. Out problem now
boils down to finding the simple probabilities of dotted and striped balls. These are shown as
follows ;
P(S/R) = 4/6 = 2/3
1/ 3
P(D/R) = 2/6 =
1.0
It will be seen that each category of red ball has been divided by the total number of
red balls. Since our problem is regarding the striped red balls, the answer is 2/3. This can be
shown symbolically.
P( SR)
P(S/R) =
P( R)
P ( AB )
P(A/B) =
P( B)
176
This is the formula for calculating conditional probability under statistical
dependence.
Example : What is probability of getting a dotted ball given that it is green?
Solution : We know that the total probability of green balls is 0.4 because there
are four green balls out of total balls. To find the probability of the ball being
dotted given that it is green, we have to divided the probability of green and dotted
by the probability of green. Thus,
P ( DG ) 0 .1 1
P(D/G) =
P (G ) 0 .4 4
It may be noted again that the two probabilities ¼ + ¾ taken together add to
Joint Probabilities:
The formula that we used to determine conditional probability under statistical
dependence is
P ( AB )
P(A/B) =
P( B)
We know that it contains one term P(AB) which, in fact, denotes joint
probability. We may now rewrite this formula to determine joint probability. This can
be easily done by cross multiplication.
Thus,
P(AB) = P(A) x P(B)
This can be expressed as: the joint probability of events A and B is equal to the
probability of event A, given that event B has already occurred, multiplied by the
probability of event B.
Example: We now use this formula in our previous examples of green and red balls.
Suppose we have to find the probability of red and striped ball.
Solution: P(SR) = P(S/R) x P(R) = 2/3 x 6/10 = 0.4
Similarly, we can calculate the joint probability of other events as well.
P(DR) = P(D/R) x P(R) = 1/3 x 6/10 = 0.2
177
This shown that the joint probability of dotted and red balls is equal to the
product of the probability of dotted balls, given a red ball and probability of red
balls. This comes to 0.2.
P(DG) = P(D/G) x P(G) = ¼ x 4/10 = 0.1
This is the joint probability of dotted and green ball.
P(SG) = P(S/G) x P(G) = ¾ x 4/10 = 0.3
This is the joint probability of striped and green ball.
Marginal Probabilities:
The marginal probability of the event green ball can be determined by
adding the probabilities of the joint events in which green ball is contained.
Symbolically, P(G) = P(GD) + P(GS) = 0.1 + 0.3 = 0.4
In the same manner, we can determine the marginal probability of the event
red ball by adding the probabilities of the joint events in which red ball is
contained.
Symbolically, P(R) = P(R/D) + P(RS) = 0.2 + 0.4 = 0.6
So far, we have determined marginal probabilities of red balls and green
balls. Likewise, we can determine the marginal probability of dotted balls and
striped balls regardless of their colours. This has been attempted below ;
P(D) = P(RD) + P(GD) = 0.2 + 0.1 = 0.3
P(S) = P(RS) + P(GS) = 0.4 + 0.3 = 0.7
It should be noted that these two probabilities add to 1.0 as was also in the case of
the earlier two calculations. The following table summarizes the probabilities
under statistical dependence.
178
9.10 BAYES’ THEOREM
Bayes’ theorem is an important statistical method, which is used in evaluating new
information as well as in revising prior estimates of the probability in the light of that
information. Bayes’ theorem may be viewed as a means of transforming our prior probability
of an event into a posterior probability of that event. Bayes’ theorem, if properly used, makes
it unnecessary to collect huge data over a long period in order to make good decisions on the
basis of probabilities.
Example : Suppose we have two machines, I and II, which are used in the manufacture of
shoes. Let E1 be the event of shoes produced by machine I and E2 be the event that they are
produced by machine II. Machine I produces 60 percent of the shoes and machine II 40
percent. It is also reported that 10 percent of the shoes produced by machine I are
defective as against the 20 percent by machine II. What is the probability that a non-
defective shoe was manufactured by machine I?
Solutions : If E1 be the event of the shoe being produced by machine I and A be the event
of a non-defective shoe, our problem in symbolic terms is : P(E1/A). That is, given a non-
defective shoe, what is the probability that it was produced by machine I?
From our conditional probability formulas, the probability P(E1/A) is
P(E1/A) = P(E1A) / P(A)
But from the theorem on total probabilities, P(A) becomes
P(A) = P(AE1) + P(AE2) = P(A/E1) P(E1) + P(A/E2) P(E2)
= P( AEi ) P( Ei )
Substituting this result in (i) above, we get
P ( E1 A)
P ( E1 / A)
P ( A / E i ) P ( E i )
Which may also be written as
P ( AE1 ) P ( E1 )
P ( E1 / A)
P ( A / E i ) P ( E i )
179
This is called Bayes’ theorem.
It may be noted that P(E1) is the probability of a shoe being manufactured by
machine I, whereas P(E1/A) is the probability of a shoe being produced by machine I,
given that it is a non-defective shoe. The probability P(E1) is called prior probability
and P(E1/A) is called posterior probability.
Let us set up a table to calculate the probability that a non-defective shoe was produced
by machine.
Computation of Posterior Probabilities :
Event Prior P(Ei) Conditional Joint P(EiA) Posterior P(Ei/A)
P(A/Ei)
(1) (2) (3) (4) (5) = (4)/P(A)
Machine I (E1) 0.6 0.9 0.54 0.54/0.86 = 0.63
Machine II (E2) 0.4 0.8 0.32 0.32/0.86 = 0.37
Total 1.0 P(A) = 0.86 1.00
On the basis of the above table we can say that given a non-defective shoe, the
probability that it was produced by machine I is 0.63 and the probability that it was
produced by machine II is 0.37. We can see that there is some revision in the prior
probabilities when we apply Bayes’ theorem.
A Problem with more than Two Elementary Events: The foregoing problem related
to two elementary events. Let us take a problem having three elementary events.
Example : A manufacturing firm is engaged in the production of steel pipes in its three
plants with a daily production of 1,000, 1,500 and 2,500 units respectively. According
to the past experience, it is known that the fractions of defective pipes produced by the
three plants are respectively 0.04, 0.09 and 0.07. If a pipe is selected from a day’s total
production and found to be defective, find out (a) from which plant the defective pipe
has come, and (b) what is the probability that it has come from the second plant?
180
Solution : Let the probabilities of the possible events be
P(E1) = 1,000/(1,000 + 1,500 + 2,500) = 0.2 – probability that a pipe is manufactured in plant A.
P(E2) = 1,500/(1,000 + 1,500 + 2,500) = 0.3 – probability that a pipe is manufactured in plant B.
P(E1) = 2,500/(1,000 + 1,500 + 2,500) = 0.5 – probability that a pipe is manufactured in plant C.
Let P(D) be the probability that a defective pipe is drawn. Given that the proportions of
the defective pipes coming from the three plants are 0.04, 0.09 and 0.07 respectively,
these are, in fact, the conditional probabilities : P(D/E1) = 0.04; P(D/E2) = 0.09; and
P(D/E3) = 0.07.
Now we can multiply prior probabilities and conditional probabilities in order to obtain
the joint probabilities.
Joint probabilities are
Plant A 0.04 x 0.2 = 0.008
Plant B 0.09 x 0.3 = 0.027
Plant C 0.07 x 0.5 – 0.035
Now we can obtain posterior probabilities by the following calculations :
0.008
Plant A 0.114
0.008 0.027 0.035
0.027
Plant B 0.386
0.008 0.027 0.035
0.035
Plant C 0.500
0.008 0.027 0.035
Table 3 : Computation of Posterior Probabilities
Event Prior P(Ei) Conditional Joint P(EiA) Posterior P(Ei/E)
P(E1Ei)
(1) (2) (3) (4) (5) = (4)/P(E)
E1 0.2 0.04 0.04 x 0.2 = 0.008 0.008 / 0.07 = 0.11
E2 0.3 0.09 0.09 x 0.3 = 0.027 0.027/0.07 = 0.39
E3 0.5 0.07 0.07 x 0.5 = 0.035 0.035/0.07 = 050
Total 1.0 P(E) = 7 1.00
On the basis of these calculations, we can say that (a) most probably the defective
pipe has come from plant C, and (b) the probability that the defective pipe has come from
the second plant is 0.39.
181
9.12 SUMMARY
Probability measures the lik lines of occurances of an event. The out comes may have equal
chances in some cases where as in some other cases, it may not be so.
This unit focuses on conditional probability and joint probability. This unit also discusses about
Bayer’s Theorm.
Given the information that the part was bad, using Bayes’ theorem find the probability that it was
supplied by supplier B.
Solution : We have to use Baye’s theorem to work out the required probability. The necessary calcula-
tions are shown in the following table.
Calculation of Probability
0.0130
A 0.65 0.02 0.0130 = 0.43
0.0305
0.0175
B 0.35 0.05 0.0175 = 0.43
0.0305
182
9.14 SELF ASSESSMENT QUESTIONS
1. A sub-committee of 6 members is to be formed out of group consisting of 7 men and 4 ladies.
Calculate the probability that the sub-committee will consist of (1) exactly 2 ladies, and (ii) at least
2 ladies.
2. There are 3 economists, 4 engineers, 2 statisticians and I doctor. A committee of 4 from among
them is to be formed. Find the probability that the committee.
(i) Consist one of each kind
(ii) Has at least one economist
(iii) Has the doctor as a member and three others.
3. (a) a bag contains 6 white. 4 red and 10 black balls. Two balls are drawn at random. Find
the probability that they will both be black.
(b) a bag contains 8 white and 4 red balls. Five balls are drawn at random. What is the
probability that 2 of them are red and 3 white?
4. From a pack of 52cards are drawn at random. Find the probability that one is king and other a
queen?
5. One bag contains 4 white and 2 black balls. Another contains 3 white and 5 black balls. If one
ball is drawn from each bag. Find the probability that
(a) both are white, (b) both are black, and (c) one is white and one is black.
6. A jar contains black and white marbles. Two marbles are chosen without replacement. The prob-
ability of selecting a black marble and then a white marble is 0.34, and the probability of selecting
a black marble on the first draw is 0.47. What is the probability of selecting a white marble on the
second draw, given that the first marble drawn was black?
7. The probability that it is Friday and that a student is absent is 0.03. Since there are 5 school days
in a week, the probability that it is Friday is 0.2. What is the probability that a student is absent
given that todayis Friday?
8. A bag contains red and blue marbles. Two marbles are drawn without replacement. The prob-
ability of selecting a red marble and then a blue marble is 0.28. The probability of selecting a red
marble on the first draw is 0.5. What is the probability of selecting a blue marble on the second
draw, given that the first marble drawn was red?
9. A committee consists of four women and three men. The committee will randomly select two
people to attend a conference in Hawaii. Find the probability that both are women.
10. What is the probability that the total of two dice will be greater than 9, given that the first die is a
5?
183
9.15 REFERENCES
1. Gupta S.P. Business Statistics –– S Chand and Sons Publishers, Delhi 2017
2. Quantitative Techniques for Business Decisions , Chetana Book House, Mysore 2015
3. Vignesh Prajapathi Big data Analysis With R and Hadoop Packet Publishing 2016
4. Operation Research SD Sharma Discovery Publishing House Delhi 2016
5. Srinath L. S PERT and CPM East West Press Delhi 2002
6. Kalavathy, Operation Research Vikas Publishing House, Delhi 2008
184
UNIT 10 : THEORETICAL PROBABILITY DISTRIBUTIONS
AND NORMAL DISTRIBUTION
STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Basic Definitions
10.3 Properties of Normal Distribution
10.4 The standard Normal Curve
10.5 Equation of the Standard Normal Distribution
10.6 Normal Distribution Problems
10.7 Random Variable
10.8 Types of Probability Distributions
10.9 Binomial Distribution
10.10 Condition necessary for Binomial Distribution
10.11 Problems in Binomial Distribution
10.12 Poisson Distribution
10.13 Problems in Poisson Distribution
10.14 Summary
10.15 Key Words
10.16 Self Assessment Questions
10.17 References
185
10.0 OBJECTIVES
After studying this unit you should be able to:
* Define Random variable;
* Identify types of probability distribution;
* Explain Binomial distribution and
* Solve problems on poisson distribution.
* Draw Normal curve;
* Solve problems on Normal distribution and
* Appreciate Excel application of probability distribution.
10.1 INTRODUCTION
Here we will be looking at probability distributions which portray not the frequency
with which values of a distribution actually occur but the probability with which we predict
they will occur. Probability distributions are very important tools for modeling or
representing processes that occur at random, such as customers visiting a website or
accidents on a building site. A probability distribution is a table or an equation that links
each outcome of a statistical experiment with its probability of occurrence.
In studying probability distributions we will look at how they can be derived and how
we can model or represent the chances of different combinations of outcomes using the
same sort of approach as we use to arrange data into frequency distributions.
A probability distribution is very similar to a frequency distribution. Like a frequency
distribution, a probability distribution has a series of categories, but instead of categories
of values it has categories of types of outcomes. The other difference is that each category
has a probability instead of a frequency.
In the same way as a frequency distribution tells us how frequently each type of value
occurs, a probability distribution tells us how probable each type of outcome is.
The Normal Probability Distribution is very common in the field of statistics.
Whenever you measure things like people’s height, weight, salary, opinions or votes, the
graph of the results is very often a normal curve.
186
P(X = x) refers to the probability that the random variable X is equal to a particular value,
denoted by x. As an example, P(X = 1) refers to the probability that the random variable X is
equal to 1.
-(x - μ)2/2σ2
Y = { 1/[ σ * sqrt(2π) ] } * e
The random variable X in the normal equation is called the normal random variable. The normal
equation is the probability density function for the normal distribution.
187
The curve on the left is shorter and wider than the curve on the right, because the curve on
the left has a bigger standard deviation.
Additionally, every normal curve (regardless of its mean or standard deviation) conforms to
the following “rule”.
§ About 68% of the area under the curve falls within 1 standard deviation of the mean.
§ About 95% of the area under the curve falls within 2 standard deviations of the mean.
§ About 99.7% of the area under the curve falls within 3 standard deviations of the mean.
Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.
Clearly, given a normal distribution, most outcomes will be within 3 standard deviations of
the mean.
188
To find the probability associated with a normal random variable, use a graphing
calculator, an online normal distribution calculator, or a normal distribution table. In the
examples below, we illustrate the use of Stat Trek’s Normal Distribution Calculator, a free
tool available on this site. In the next lesson, we demonstrate the use of normal distribution
tables.
Example 1
An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what is the
probability that an Acme light bulb will last at most 365 days?
Solution: Given a mean score of 300 days and a standard deviation of 50 days, we want to
find the cumulative probability that bulb life is less than or equal to 365 days. Thus, we know
the following:
§ The value of the normal random variable is 365 days.
§ The mean is equal to 300 days.
§ The standard deviation is equal to 50 days.
We enter these values into the Normal Distribution Calculator and compute the
cumulative probability. The answer is: P( X < 365) = 0.90. Hence, there is a 90% chance
that a light bulb will burn out within 365 days.
Example 2
Suppose scores on an IQ test are normally distributed. If the test has a mean of 100
and a standard deviation of 10, what is the probability that a person who takes the test will
score between 90 and 110?
Solution: Here, we want to know the probability that the test score falls between 90 and 110.
The “trick” to solving this problem is to realize the following:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
189
§ To compute P( X < 90 ), we enter the following inputs into the calculator: The value
of the normal random variable is 90, the mean is 100, and the standard deviation is 10.
We find that P( X < 90 ) is 0.16.
We use these findings to compute our final answer as follows:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
P( 90 < X < 110 ) = 0.84 - 0.16
P( 90 < X < 110 ) = 0.68
Thus, about 68% of the test scores will fall between 90 and 110.
When the area of the standard normal curve is divided into sections by standard
deviations above and below the mean, the area in each section is a known quantity (see Figure
2). As explained earlier, the area in each section is the same as the probability of randomly
drawing a value in that range.
Figure 2.The normal curve and the area under the curve between ó units.
For example, 0.3413 of the curve falls between the mean and one standard deviation
above the mean, which means that about 34 percent of all the values of a normally distributed
variable are between the mean and one standard deviation above it. It also means that there is
a 0.3413 chance that a value drawn at random from the distribution will lie between these
two points.
Sections of the curve above and below the mean may be added together to find the
probability of obtaining a value within (plus or minus) a given number of standard deviations
of the mean (see Figure 3). For example, the amount of curve area between one standard
deviation above the mean and one standard deviation below is 0.3413 + 0.3413 = 0.6826,
which means that approximately 68.26 percent of the values lie in that range. Similarly,
about 95 percent of the values lie within two standard deviations of the mean, and 99.7
percent of the values lie within three standard deviations.
191
Figure3.The normal curve and the area under the curve between ó units.
In order to use the area of the normal curve to determine the probability of occurrence
of a given value, the value must first be standardized, or converted to a z-score . To convert
a value to a z-score is to express it in terms of how many standard deviations it is above or
below the mean. After the z-score is obtained, you can look up its corresponding probability
in a table. The formula to compute az-score is
where x is the value to be converted, μ is the population mean, and σ is the population
standard deviation.
Example 1
A normal distribution of retail-store purchases has a mean of $14.31 and a standard
deviation of 6.40. What percentage of purchases were under $10? First, compute the z-
score:
The next step is to look up the z-score in the table of standard normal probabilities.
The standard normal table lists the probabilities (curve areas) associated with given z-scores.
“Statistics Tables” gives the area of the curve below z—in other words, the probabil-
ity of obtaining a value of z or lower. Not all standard normal tables use the same format,
however. Some list only positive z-scores and give the area of the curve between the mean
and z. Such a table is slightly more difficult to use, but the fact that the normal curve is
symmetric makes it possible to use it to determine the probability associated with any
z-score, and vice versa.
192
To use Table,” first look up the z-score in the left column, which lists z to the first
decimal place. Then look along the top row for the second decimal place. The intersection
of the row and column is the probability. In the example, you first find –0.6 in the left
column and then 0.07 in the top row. Their intersection is 0.2514. The answer, then, is that
about 25 percent of the purchases were under $10.
What if you had wanted to know the percentage of purchases above a certain amount?
Because Table gives the area of the curve below a given z, to obtain the area of the curve
above z, simply subtract the tabled probability from 1. The area of the curve above a z of –
0.67 is 1 – 0.2514 = 0.7486. Approximately 75 percent of the purchases were above $10.
Example 2
Using the previous example, what purchase amount marks the lower 10 percent of the
distribution?
Locate in Table
the probability of 0.1000, or as close as you can find, and read off the corresponding z-
score. The figure that you seek lies between the tabled probabilities of 0.0985 and 0.1003,
but closer to 0.1003, which corresponds to a z-score of –1.28. Now, use the z formula, this
time solving for x:
193
10.5 EQUATION OF THE STANDARD NORMAL DISTRIBUTION
The standard normal distribution is a special case of the normal distribution. It
is the distribution that occurs when a normal random variable has a mean of zero and a
standard deviation of one.andard Score (aka, z Score)
The normal random variable of a standard normal distribution is called a standard score or
a z-score. Every normal random variable X can be transformed into a z score via the following
equation:
z = (X - μ) / σ
where X is a normal random variable, μ is the mean mean of X, and σ is the standard deviation
of X.
194
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
... ... ... ... ... ... ... ... ... ... ...
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0722 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
... ... ... ... ... ... ... ... ... ... ...
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
Of course, you may not be interested in the probability that a standard normal random
variable falls between minus infinity and a given value. You may want to know the probability
that it lies between a given value and plus infinity. Or you may want to know the probability
that a standard normal random variable lies between two given values. These probabilities
are easy to compute from a normal distribution table. Here’s how.
Find P(Z > a). The probability that a standard normal random variable (z) is greater
than a given value (a) is easy to find. The table shows the P(Z < a). The P(Z > a) = 1 -
P(Z < a).
Suppose, for example, that we want to know the probability that a z-score will be
greater than 3.00. From the table (see above), we find that P(Z < 3.00) = 0.9987.
Therefore, P(Z > 3.00) = 1 - P(Z < 3.00) = 1 - 0.9987 = 0.0013.
Find P(a < Z < b). The probability that a standard normal random variables lies between
two values is also easy to find. The P(a < Z < b) = P(Z < b) - P(Z < a).
For example, suppose we want to know the probability that a z-score will be greater
than -1.40 and less than -1.20. From the table (see above), we find that P(Z < -1.20)
= 0.1151; and P(Z < -1.40) = 0.0808. Therefore, P(-1.40 < Z < -1.20) = P(Z < -
1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.
195
The Normal Distribution as a Model for Measurements
Often, phenomena in the real world follow a normal (or near-normal) distribution. This
allows researchers to use the normal distribution as a model for assessing probabilities
associated with real-world phenomena. Typically, the analysis involves two steps.
§ Transform raw data. Usually, the raw data are not in the form of z-scores. They need to
be transformed into z-scores, using the transformation equation presented earlier:
z = (X - ì) / ó.
§ Find probability. Once the data have been transformed into z-scores, you can use standard
normal distribution tables, online calculators (e.g., Stat Trek’s free normal distribution
calculator), or handheld graphing calculators to find probabilities associated with the
z-scores.
The problem in the next section demonstrates the use of the normal distribution as a
model for measurement.
Problem 1
Molly earned a score of 940 on a national achievement test. The mean test score was
850 with a standard deviation of 100. What proportion of students had a higher score than
Molly? (Assume that test scores are normally distributed.)
(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90
Solution
The correct answer is B. As part of the solution to this problem, we assume that test
scores are normally distributed. In this way, we use the normal distribution as a model for
measurement. Given an assumption of normality, the solution involves three steps.
§ First, we transform Molly’s test score into a z-score, using the z-score transformation
equation.
196
z = (X - μ) / σ = (940 - 850) / 100 = 0.90
Then, using an online calculator (e.g., Stat Trek’s free normal distribution calculator),
a handheldgraphing calculator, or the standard normal distribution table, we find the
cumulative probability associated with the z-score. In this case, we find P(Z < 0.90)
= 0.8159.
Therefore, the P(Z > 0.90) = 1 - P(Z < 0.90) = 1 - 0.8159 = 0.1841.
Thus, we estimate that 18.41 percent of the students tested had a higher score than
Molly.
10.6 NORMAL DISTRIBUTION PROBLEMS
Problems and applications on normal distributions are presented. The answers to
these problems are at the bottom of the page. X is a normally normally distributed variable
with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
ANS: Note: What is meant here by area is the area under the standard normal curve.
a) For x = 40, the z-value z = (40 - 30) / 4 = 2.5
Hence P(x < 40) = P(z < 2.5) = [area to the left of 2.5] = 0.9938
197
For x = 100 , z = (100 - 90) / 10 = 1
P(x > 90) = P(z >, 1) = [total area] - [area to the left of z = 1]
= 1 - 0.8413 = 0.1587
The probability that a car selected at a random has a speed greater than 100 km/hr is
equal to 0.1587
2. For a certain type of computers, the length of time bewteen charges of the battery is
normally distributed with a mean of 50 hours and a standard deviation of 15 hours. John
owns one of these computers and wants to know the probability that the length of time
will be between 50 and 70 hours.
ANS: Let x be the random variable that represents the length of time. It has a mean
of 50 and a standard deviation of 15. We have to find the probability that x is between 50
and 70 or P( 50< x < 70)
For x = 50 , z = (50 - 50) / 15 = 0
For x = 70 , z = (70 - 50) / 15 = 1.33 (rounded to 2 decimal places)
P( 50< x < 70) = P( 0< z < 1.33) = [area to the left of z = 1.33] - [area to the left
of z = 0]
= 0.9082 - 0.5 = 0.4082
The probability that John’s computer has a length of time between 50 and 70 hours
is equal to 0.4082.
3. Entry to a certain University is determined by a national test. The scores on this test
are normally distributed with a mean of 500 and a standard deviation of 100. Tom
wants to be admitted to this university and he knows that he must score better than at
least 70% of the students who took the test. Tom takes the test and scores 585. Will
he be admitted to this university?
Ans: Let x be the random variable that represents the scores. x is normally ditsributed with
a mean of 500 and a standard deviation of 100. The total area under the normal curve
represents the total number of students who took the test. If we multiply the values
of the areas under the curve by 100, we obtain percentages.
For x = 585 , z = (585 - 500) / 100 = 0.85
198
Tom scored better than 80.23% of the students who took the test and he will be admitted
to this University.
1. The annual salaries of employees in a large company are approximateley normally
distributed with a mean of $50,000 and a standard deviation of $20,000.
a) What percent of people earn less than $40,000?
b) What percent of people earn between $45,000 and $65,000?
c) What percent of people earn more than $70,000?
ans) For x = 40000, z = -0.5
Area to the left (less than) of z = -0.5 is equal to 0.3085 = 30.85% earn less than
$40,000.
b) For x = 45000 , z = -0.25 and for x = 65000, z = 0.75
Area between z = -0.25 and z = 0.75 is equal to 0.3720 = 37.20 earn
between $45,000 and $65,000.
c)For x = 70000, z = 1
Area to the right (higher) of z = 1 is equal to 0.1586 = 15.86% earn more than $70,000.
199
10.8 TYPES OF PROBABILITY DISTRIBUTIONS
There are two types of probability distributions:
* Discrete probability distributions
The probability distribution of a discrete random variable is a list of probabilities
associated with each of its possible values. It is also sometimes called the probability function
or the probability mass function.
More formally, the probability distribution of a discrete random variable X is a function
which gives the probability f(x) that the random variable equals x, for each vale x:
f(x) = P(X=x)
a. 0 ? f(x) ? 1
b. f(x) = 1
Continuous probability distributions
Describe an “unbroken” continuum of possible occurrences. A random variable is
continuous if it can take any value in an interval. The number of possible values in a range is
infinite, so the Probability(of a single value) = 0
Example 1: Discrete probability distribution
The number of successful treatments out of 2 patients is discrete, because the random
variable represent the number of success can be only 0, 1, or 2. The probability of all possible
occurrences—P(0 successes), P(1 success), P(2 successes)—constitutes the probability
distribution for this discrete random variable.
Example 2: Continuous probability distribution
The probability of a given birth weight can be anything from 3 lbs to more than 10 lbs.
Thus, the random variable of birth weight is continuous, with an infinite number of possible
points between any two values.
Probability Distribution of a Discrete Random Variable:
The probability distribution of a discrete random variable, say x, is a list of the distinct
numerical values of x along with t heir associated probabilities. Let us take an example.
Example : Let us take x as the number of heads obtained in three tosses of a fair coin. We
are required to list the numerical values of x along with the corresponding outcomes.
200
Solution : These values along with the corresponding outcomes are shown in Table 1
T a b le 1 : L is t o f O u t c o m e s
O u tc o m e V a lu e o f x
TTT 0
TTH 1
THT 1
THH 2
HTT 1
HTH 2
HHT 2
HHH 3
x is a variable since in three tosses of the coin, it can take any value 0, 1, 2, or 3.
Further, x is the random variable in the sense that we could not have predicted that value of
the out-come before tossing the coin. It may be noted that for each elementary outcome,
there is only one value of x. however, as we can see, two or more elementary outcomes may
give the same value.
201
Concept of Expected Value : Thus, we find that the sum of these products gives
? [x.P(x)]=2.12, which is the mean. This can be written as µ = ? [x.P(x)] = 2.12.
On the basis of this calculation, we can say that, on an average, this particular
machine is expected to breakdown 2.12 times per week over a period of time. In
other words, if this machine is used for several weeks, then there may not be any
breakdown, for some other week there may be only one breakdown per week and
so on. The mean number of breakdowns is expected to be 2.12 per week for the
entire period. This is the concept of expected value.
Symbolically, E(x) = ? (x.Prob.(x)], where E(x) = Expected value of a discrete
variable x and x. Prob .(x) = Product of value of variable x with its probability.
It may be noted that expected value can be derived subjectively as well. On the basis
of the experimenter’s own experience and judgment, one may assign probability that the
random variable will take on certain values.
Let us take another example.
Example : An account of a company is hoping to receive payment from two outstanding
accounts during the current month. He estimates that there is 0.6 probability of receiving
Rs. 15,000 due from A and 0.75 probability of receiving Rs. 40,000 due from B. What is the
expected cash flow from these two accounts?
Solution
Table 4 : Calculation of Expected Cash Flow
Account Amount (Rs) Probability (p i) Amount (x i) (Rs)
A 15,000 0.60 9,000
B 40,000 0.75 30,000
Total expected value 39,000
202
Importance of Expected Value: The concept of expected value is of considerable importance
to management in decision-making. This is because the criteria in decision problems involving
uncertainties are usually the maximization of expected profits, or utility, and the minimization
of expected costs. In Chapter 22 on Decision Theory, we shall discuss these criteria in
detail giving suitable examples.
With this introduction we now turn to the binomial distribution.
203
Let us put these results in the following form ;
For one coin or event (q + p)1 that is, q + p
For two coin or events (q + p)2 that is, q2 + 2qp + p2
For three coins or events (q + p)3 that is, q3 + 3q 2p + 3qp2 + p3
Hence, for n coins or events (q + p)n
n ( n 1) n-2 2
(q + p)n = qn + nqn-1 p + q P + …. P n
2!
This is known as the binomial distribution
To analyze a problem using the binomial distribution you have to know the probability
of each outcome and it must be the same for every trial. In other words, the results of the
trials must be independent of each other.
Words like ‘experiment’ and ‘trial’ are used to describe binomial situations
because of the origins and widespread use of the binomial distribution in science.
Although the distribution has become widely used in many other fields, these scientific
terms have stuck.
It is a discrete probability distribution. Its probability mass function is given by
P(X) = nCxqnx px, x
204
10.10 CONDITIONS NECESSARY FOR BINOMIAL DISTRIBUTION
At this stage, we should know that there are certain conditions that must be fulfilled
by a distribution if it is to be a binomial distribution. Then the conditions are ;
1. It is necessary that each observation is classified in two categories such as success
and failure. For example, if raw material is obtained by a firm from its suppliers, it
may be classified as defective or non-defective on the basis of its normal quality.
Similarly, if a die is thrown, we may call 4, 5 or 6 success and getting 1, 2 or 3 a
failure.
2. It is necessary that the probability of success (or failure) remains the same for each
observation in each trial. Thus the probability of getting head (or tail) must remain
the same in each toss of the experiment. In other words, if the probability of success
(or failure) changes from trial or trial or if the results of each trial are classified in
more than two categories, then it is not possible to use the binomial distribution.
3. The trials or individual observations must be independent of each other. In other words,
no trial should influence the outcome of another trial.
205
Fitting a Binomial Distribution : On the basis of some given information, if a binomial
distribution is to be fitted, then the following procedure needs to be adopted.
1. Find the values of p and q. When one value is given to us, the other value can be easily
obtained by subtracting the first value from 1.
2. Expand the binomial (p + q)n. It may be noted that the power of n will be one less than
the number of terms in the expanded binomial. For example, when n=5, there will be
6 terms.
3. Multiply each of the expanded binomial terms by the total frequency (N) so that the
expected frequency in each category can be obtained. Let us take an example.
Example : Fit a binomial distribution to following data :
x 0 1 2 3 4
f 28 62 46 10 4
Solution :
x f fx
0 28 0
1 62 62
2 46 92
3 10 30
4 4 16
150 200
fx 200
Mean = np
f 150
f ( r ) N . p (r ) N n C r P r q n r
r 4 r
= 150 C r
4 1 2
3 3
206
Now, to get the binomial frequencies, we have to put r = 0, 1, 2, 3 and 4 in the above equa-
tion. These calculations are shown in the following table.
Table.5 : Calculation of Binomial Frequencies
r 1 2
r 4 r
f ( r ) 150 C r
4
3 3
0 0
1 2
40
16
f ( 0 ) 150 4 C 0 150 30
3 3 81
1 1
1 2
4 1
32
f (1) 150 4 C 1 150 59
3 3 81
2 1 2
2 4 2
24
f ( 2 ) 150 C 2
4
150 44
3 3 81
3 3
1 2
43
150 8
f ( 3) 150 4 C 3 15
3 3 81
4 4
1 2
4 4
150 1
f ( 4 ) 150 4 C 4 2
3 3 81
The frequencies of the binomial distribution are shown in the extreme right of the
above table.
Mean and Standard Deviation of Binomial Distribution:
The mean and standard deviation of such theoretical frequency distributions where
we know the number of independent events and the probability of the happening of the event
in question, can be very easily calculated. If M stands for the mean of such distribution, n for
the number of independent events and p for the probability of the happening of the event in a
si ngl e tri al , then M = np. The value of the standard deviation of the expected frequencies in
such cases is
npq
Meeting the Conditions for using the Bernoulli Process
Before closing our discussion on the binomial distribution, it must be emphasized
that one should be careful in using the binomial probability. It is necessary to ensure that
conditions specified earlier for binomial distribution are satisfied, particularly conditions 2
and 3. Condition 2 requires that the probability of the outcome of any trial should remain
unchanged for each trial. While this condition is fully met in experiments involving tossing
a coin or rolling a die, in real life it may be difficult to ensure the compliance of this
condition.
207
Condition 3 requires that the trials of a Bernoulli process must be independent of
each other. This means that the outcome of one trial must not influence in any way the
outcome of any other trial. This condition, too, may not be satisfied in real-life situation for
example, take the case of interviewing candidates for a certain post in a company. The expert,
who is interviewing the candidates, may find that t he first three candidates are far below the
standard expected. In view of this, he may not remain impartial (as he was earlier) while
interviewing the fourth candidate. This means violation of condition 3. One can find several
situations of this type in everyday life where compliance of condition 3 becomes extremely
difficult.
208
10.13 PROBLEMS IN POISSON DISTRIBUTION
Let us take an example to show how Poisson probabilities can be calculated.
Example : Suppose, we have a production process of some item that is manufactured in
large quantities. We find that, n general, the proportion of defective items is p = 0.01. A
random sample of 100 items is selected. What is the probability that there are 2 defective
items in this sample?
Solution :
The Poisson formula is
e
P( x)
x!
Where
P(x) = Probability of x occurrences
λx = Lambda (i.e the mean number of occurrences per interval of time)
raised to the x power
eλ = 2.71828 (being the base of the natural logarithm system), raised to
the negative lambda power
x! = x factorial
Here λx= np = 100 x 0.01 = 1.0
Applying the above formula to the data given
(1) 2 ( 2.71828) 1
P(2) =
2 1
(1) 2 0.36788
P(2) = 0.18394
2
Suppose, we want to know what is the probability of having upto 2 defective items
in that sample of 100 items. We simply add the3 figures
P(0) 0.368
P(1) 0.368
P(2) 0.184
Total 0.92
209
The answer is 0.92
Again, if we are interested in knowing the probability of having more than
2 defective items, the answer will be
1-092 = 0.08
Example : Suppose the probability of dialing a wrong number is 0.05. Then, what
is the probability of dialing exactly 3 wrong numbers in 100 dials?
Solution
p = 0.05
n = 100
λ = np
= 100 x 0.05 = 5
Applying the Poisson formula,
(5) 3 (2.71828) 5
P ( x)
3!
125 0.0067 *
= 0.14
6
Example : Fit a Poisson distribution to the following data, which relate to the
number of deaths due to the kick of a horse in 10 corps per army per annum over
20 years.
Deaths 0 1 2 3 4 Total (f)
Frequency 109 65 22 3 1 200
210
S o lu tio n : C a lcu late th e th eo retica l freq uen cie s
T h e th eo retic al e x p ecte d freq u e n cies are g iv e n b y th e fo rm ula
x e
N x
x!
W h ere x = 0 , 1, 2, 3 an d 4
N = to ta l freq uen c y
λ = m ean
e = 2 .7 1 8 2 8
In o rd er to fin d th e v alu e o f λ , w e h av e to calcu la te th e arith m e tic m ean .
Table : W orksheet for Data in Example 10.14
Deaths (x) Frequency (f) fx
0 109 0
1 65 65
2 22 44
3 3 9
4 1 4
Total 200 122
x e
Nx
x!
e- 0.61 = 0.5435
Now for each value of x from 0 to 4, we have to calculate the frequency. This is
shown below :
r f
0 200 0 .5435 108.7
1 200 0 .61 0 .5435 66 .3
2 200 ( 0 .61) 2 0 .5435
20 .2
2
3 200 ( 0 .61) 3 0 .5435
4 .1
3 2
4 200 ( 0 .61) 4 0 .5435
0 .6
4 3 2
211
Thus, the theoretical frequencies are
Table : Theoretical Frequencies of Data in Example
x Tf f
0 109 109
1 66 65
2 20 22
3 4 3
4 1 1
Total 200 200
10.14 SUMMARY
This Unit deals with definition and explaination about random variable. It also describes
types of probability distribution. It also gives a note on Binomial and Poission distribution.
Problems on Binomial and Poission distribution are solved to give a better understanding of
the concept.
This unit focuses on normal distribution. The details of normal curve are given
here. Problems on normal distribution are also solved in this unit.
10.15 KEY WORDS
Binomial Distribution
Poission Distribution
Randam Variable
Normal Curve
Normal Distribution
Normal Distribution table
10.16 SELF ASSESSMENT QUESTIONS
1. From past experience, a manager of an upscale shoe store knows that 85% of her
customers will use a credit card when making purchases. Suppose three customers
are in line to make a purchase.
a) Does this example satisfy the conditions of a Bernoulli process?
b) Construct a probability tree that delineates all possible values and their associated
probabilities.
c) Using the probability tree, derive the binomial probability distribution.
2. Approximately 20% of U.S. workers are afraid that they will never be able to retire
(bankrate.com, June 23, 2008). Suppose 10 workers are randomly selected.
212
a) What is the probability that none of the workers is afraid that they will never be
able to retire?
b) What is the probability that at least two of the workers are afraid that they will
never be able to retire?
c) What is the probability that no more than two of the workers are afraid that they
will never be able to retire?
d) Calculate the expected value, the variance, and the standard deviation of this binomial
probability distribution.
3. A small life insurance company has determined that on the average it receives 6 death
claims per day. Find the probability that the company receives at least seven death
claims on a randomly selected day.
4. The number of traffic accidents that occurs on a particular stretch of road during a
month follows a Poisson distribution with a mean of 9.4. Find the probability that
less than two accidents will occur on this stretch of road during a randomly selected
month.
5. An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what
is the probability that an Acme light bulb will last at most 365 days?
6. Suppose scores on an IQ test are normally distributed. If the test has a mean of 100
and a standard deviation of 10, what is the probability that a person who takes the test
will score between 90 and 110?
10.17 REFERENCES
1. Statistics for Management by Richard Leven, David S. Rubin
2. Statistical methods by S.P. Gupta
3. Fundamentals of Statistics by S.C. Gupta
4. Advanced Practical Statistics by S.P. Gupta
5. Statistics Theory and Practice by M.S. Shukla and S.S. Gulshan
6. Statistical Methods by J. Medhi
213
UNIT 11 : INTRODUCTION TO OPERATIONS RESEARCH
STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Concept of Operation Research (OR)
11.3 Scope of OR
11.4 Phases of OR
11.5 Application of OR
11.6 Limitations of OR
11.7 Summary
11.8 Key words
11.9 Self Assessment Questions
11.10 Reference
214
11.0 OBJECTIVES
After studying this unit you should be able to:
* Explain the concept of OR;
* Asses the scope and phases of OR;
* Identify the applications of OR and
* Distinguish between models of OR.
11.1 INTRODUCTION
Operation research is the research done on operations. It can be visualized as a method,
tool, set of technique, an activity that aids the manager in making a decision. It makes use of
quantitative techniques to provide better solution to the problem.
The present day business scenario has vastly changed owing to the competition and
increased complexity from all fronts. Globalization has made the domestic businesses also
to move towards globally set benchmarks. Hence decision making has become a very complex
and dynamic activity. It has to consider the various alternatives both qualitatively and
quantitatively. The various techniques that are used to facilitate decision making are called
quantitative methods or optimization techniques or decision science or operation analysis
or operation research.
The concept of operation research was basically evolved during second world war in
England. The major concern at that time was the efficient or the optimum use of scarce
resources of war material including human resource. As this technique was mainly evolved
for the military operations, it is known as operations research. With the conclusion of World
war the application was OR was later spread to business to facilitate optimum use of resources
such men, machine, money and material.
215
(b) Morse and Kimball have stressed the quantitative approach of OR and have described
it as “a scientific method of providing executive departments with a quantitative basis
for decisions regarding the operations under their control”.
(c) Miller and Starr see OR as applied decision theory. They state “OR is applied decision
theory. It uses any scientific, mathematical or logical means to attempt to cope with
the problems that confront the executive, when he tries to achieve a thorough—going
rationality in dealing with his decision problem”.
(d) Saaty considers OR as tool of improving the quality of answers to problems. He say,
“OR is the art of giving bad answers to problems which otherwise have worse answers”.
By going through the above definitions, we can interpret that Operation Research is
scientific decision making tool leading to optimal use of resources. Operations research
helps in taking business decisions more effectively and objectively. The game theory for
example considers competition. The transportation problem tries identify the least cost
route for transporting items from different sources to different destinations. The assignment
216
complete the task would be minimized. The operation also helps in taking an objective
decision which would either maximize the profit or minimize the cost by choosing a good
alternative among the various alternative available.
217
iii. Planning:
In modern times, it has become necessary for every government to have careful
planning, for economic development of the country. OR techniques can be fruitfully
applied to maximise the per capita income, with minimum sacrifice and time. A
government can thus use OR for framing future economic and social policies.
iv. Agriculture:
With increase in population, there is a need to increase agriculture output. But this
cannot be done arbitrarily. There are several restrictions.
v. In Industry:
The system of modern industries is so complex that the optimum point of operation
in its various components cannot be intuitively judged by an individual. The business
environment is always changing and any decision useful at one time may not be so
good some time later. There is always a need to check the validity of decisions
continuously against the situations.
vi. In Hospitals:
OR methods can solve waiting problems in out-patient department of big hospitals
and administrative problems of the hospital organisations. Techniques such as queuing
theory helps in determining the number work stations required, for example, number
of petrol pumps and attendants required in a petrol bunk. Numbers of out patient
counters required in a hospital and so on based on estimated arrival of service seekers
and time required to provide the service
vii. In Transport:
You can apply different OR methods to regulate the arrival of trains and processing
times minimise the passengers waiting time and reduce congestion, formulate suitable
transportation policy, thereby reducing the costs and time of trans-shipment.
218
11.4 PHASES OF OR
Any business problem that needs the intervention of operations research can be solved
using various steps. These steps rather phase help in identifying the nature of business
problems and the appropriate tools or techniques required to solve the given problem. In
many cases the data provided may not be sufficient as such more data can be sought on a
given issue. The collected is used to formulate the objective function which shows clearly
the objective of solving the problem Operation research also deals with dynamic programming
where decisions have to be taken in a dynamic environment. In such cases data collection,
analysis may become redundant. The various phases in solving the problems using operations
research are
1. Judgment Phase: In this first phase the problem is identified that are encountered in
the real life situations. The problem is structured in such way that the solution should
support the organization objective by using proper judgment. The problem would be
structured to facilitate the decision maker with all the possible information.
2. Research phases: On this second phase, further data is collected on the basis of the
objective and given data. Hypothesis is framed and tested. The data collected is analyzed
and verified. Suitable assumptions may be made wherever required.
3. Action Phase: In this last phase suitable recommendation are made for possible
solution that has been arrived by solving the problem. Before implementing solution
as suggested by the relevant OR model one may have to check the compatibility of the
solution with the environment constraints and such other qualitative issues.
11.5 APPLICATIONS OF OR
Operations research has many applications on various fields in business management.
1. It is used while making decision about product to be produced and competitive
strategies.
2. In scheduling salesman activities such time, territory, frequency of visit
3. In deciding type of promotion strategies to be employed
4. In forecasting the business direction
5. In deciding the price of the product with reference to competitor
6. In developing the framework for market research
7. In deciding the product mix and product proportioning
8. In production planning and sequencing and scheduling
219
9. In transportation ware housing and physical distribution
10. In material handling facility planning
11. In assemble line balancing
12. In maintaining of machines and replacing
13. In project planning and scheduling
14. In designing queuing system for serving customers
15. In maintain right size of inventory
16. In planning profit and selecting optimum dividend policy
17. In portfolio analysis i.e. product portfolio and investment portfolio
18. In determining optimum organizing structure
Apart from these OR also has applications in the field of defence, public system such
as government and industrial applications.
220
iv) Non-quantifiable Factors
When all the factors related to a problem can be quantifiable only then operations
research provides solution otherwise not. The non-quantifiable factors are not
incorporated in O.R. models. Importantly O.R. models do not take into account
emotional factors or qualitative factors.
v) Implementation
Once the decision has been taken it should be implemented. The implementation of
decisions is a delicate task. This task must take into account the complexities of human
relations and behavior and in some times only the psychological factors.
11.7 SUMMARY
Operations Research is relatively a new discipline, which originated in World War II,
and became very popular throughout the world. India is one of the few first countries in the
world who started using operations research. Operations Research is used successfully not
only in military/army operations but also in business, government and industry. Now a day’s
operations research is almost used in all the fields. Proposing a definition to the operations
research is a difficult one, because its boundary and content are not fixed. The tools for
operations search is provided from the subject’s viz. economics, engineering, mathematics,
statistics, psychology, etc., which helps to choose possible alternative courses of action.
The operations research tool/techniques include linear programming, non-linear
programming, dynamic programming, integer programming, Markov process, queuing theory,
etc. Operations Research has a number of applications. Similarly it has a number of limitations,
which is basically related to the time, money, and the problem involves in the model building.
Day-by day operations research is gaining more and more acceptance because it improves
decision making effectiveness of the managers. Almost all the areas of business use the
operations research for decision making.
221
11.9 SELF ASSESSMENT QUESTIONS
1. Define operations research and explain the concept of operations research
2. Discuss the limitations of OR
3. Describe the scope and applications of OR in the present context
4. Outline the phases of OR
5. Give an account on various models of OR
11.10 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
222
UNIT 12 : GAME THEORY
STRUCTURE
12.0 Objectives
12.1 Introduction
12.8 Summary
12.11 References
223
12.0 OBJECTIVES
12.1 INTRODUCTION
Game theory is a branch of applied mathematics and economics that studies situations
where players choose different actions in an attempt to maximize their returns. First developed
as a tool for understanding economic behavior and then by the RAND Corporation to define
nuclear strategies, game theory is now used in many diverse academic fields, ranging from
biology and psychology to sociology and philosophy. Beginning in the 1970s, game theory
has been applied to animal behavior, including species’ development by natural selection.
Because of games like the prisoner’s dilemma, in which rational self-interest hurts everyone,
game theory has been used in political science, ethics and philosophy. Finally, game theory
has recently drawn attention from computer scientists because of its use in artificial
intelligence and cybernetics.
Although similar to decision theory, game theory studies decisions that are made in
an environment where various players interact. In other words, game theory studies choice
of optimal behaviour when costs and benefits of each option are not fixed, but depend upon
the choices of other individuals.
224
set of players, a set of moves (or strategies) available to those players, and a specification of
payoffs (result) for each combination of strategies.
A general theory of rational behaviour for situations in which (1) two (two- o9 person
games) or more (multi-person games) decision makers (players) have-available to them (2)
a finite number of courses of action (plays) each leading to (3) a well defined outcome or
end with gains and losses expressed in terms of numerical payoffs associated with each
combination of courses of action and for each decision maker. The decision makers have
perfect knowledge of the rules of the game, i.e., (1), (2) and (3) but no knowledge about the
opponents’ moves and are rational in the sense of making decisions that optimize their
individual gains. The matrix of payoffs can represent various conflicts. In a zero-sum game
one person wins what the other loses. In other situations. gains and losses may be unequally
distributed which allows the representation of numerous competitive and conflict situations.
The theory proposes several solutions, e.g., in a minimax strategy each participants minimizes
the maximum loss the other can impose on him, a mixed strategy involves probabilistic
choices. Experiments with such games revealed conditions for cooperation, defection and
the persistence of conflict. The theory and some of the results have found applications in
economics, management science, bargaining and conflict resolution among many areas of
interest.
225
Neumann and Oskar Morgenstern. This profound work contains the method for finding
optimal solutions for two-person zero-sum games. During this time period, work on game
theory was primarily focused on cooperative game theory, which analyzes optimal strategies
for groups of individuals, presuming that they can enforce agreements between them about
proper strategies.
In 2005, game theorists Thomas Schelling and Robert Aumann won the Bank of Sweden
Prize in Economic Sciences. Schelling worked on dynamic models, early examples of
evolutionary game theory. Aumann contributed more to the equilibrium school, developing
an equilibrium coarsening correlated equilibrium and developing extensive analysis of the
assumption of common knowledge.
Games in one form or another are widely used in many different academic disciplines.
· Economics and business
· Applications in Biology
226
· Computer science and logic
· Political science
· Philosophy
· Sociology
Strategy: A strategy for a player is defined as a set of rules or alternative course of action
available to him in advance by which player decides the course of action he should adopt.
Strategy can be of two type:
a) Pure strategy: If the players select the same strategy each time then it is referred to as
pure strategy. In this case each player knows exactly what other player is going to do.
The objective of the players is to maximize gains or minimize losses.
b) Mixed strategy: When the players use a combination of strategies and each player
always keep guessing as to which course of action is to be selected by thee other
player at a particular occasion then this is known as mixed strategy.
c) Optimum strategy:A course of action or play which puts the player in the most preferred
position, irrespective of the strategy of his competitors is called an optimum strategy.
d) Value of the game:It is the expected payoff of play when all the player s of the game
follow their optimum strategies. The game is called fair if the value of the game is
zero and unfair if it is non zero
e) Payoff Matrix:When the player select their particular strategies the payoff ( Gains Or
losses) can be represented in the form of a matrix called payoff matrix.
Let player A have m strategies A. , A2, ..... Am & Player B have n strategies B1 B2.....Bn.
The payoff matrix is written in terms of A i.e Positive values reflect gains to A & negative
values reflect loss to A. Let a; be the payoff which player gains from B if player A chooses
strategy Ai & Player B chooses strategy Bj. Then the pay off matrix is
227
Player B
Bl B2 ....Bj… … .Bn
For player A minimum Value in each row represents the gain (pay off) to him, if he
chooses his particular strategy, They are written next to the matrix as row minima, He will
then select the strategy that maximizes his minimum gains.
The choice of player A is called maximin value and the corresponding loss is the
maximin value of the game. It is denoted by V
Player B minimizes his maximum losses. The maximum value in each column repre-
sents the maximum loss to him if he chooses his particular strategy. These are written in the
matrix by column maxima. He will then select the strategy that minimizes his maximum
loss. The choice of player B is called minimax value and the corresponding loss is the mini-
max value of the game.
It is denoted by V
228
Consider the below matrix
Player B
1 21 11 1 1
Player A 1 00 1 0
1
33 11 11 1
3 1 1
In this matrix, player A’s options are listed on the left, while player B’s options are
listed on top. We think of A as playing the rows and B as playing the columns. Positive
numbers indicate a win for the row player, while negative numbers indicate a loss for the row
player. Thus, for example, the p,s entry represents the outcome if A plays p and B plays s
In each round of the game, each player’s choice is called a strategy. Thus, if A chooses
p, we refer to the p row as player A’s strategy.
Here we have four saddle points, i.e the payoff values which are row minima as as
column maxima.
229
12.4 GAMES WITHOUT SADDLE POINT
ALGEBRAIC METHOD
If the games do not have a saddle point, such problems can be solved in either in
algebraic or arithmetical method. Consider the following example.
1. Determine the optimum strategy and the values of the game for the following payoff
matrix O
B
H T
H 2 -1
A
T -1 0
The pay off matrix do not have any saddle point, Let player A plays H with probability
x & T with probability with 1-x so that
x + (1-x) = 1
Similarly if Player B play T all the time then A ‘s expected gain will be
The best strategy for A is naturally the one which gives equal gain whether player B
selects H or T. Then
3x-l = -x
Or 4x=1 or x= ¼
Y= ¼
230
The solution is
1. The player A should play H or T with probability ¼ & ¾ respectively. The optimal
strategy for A is { ¼, ¾ }
2. The player B should play H or T with probability ¼ & ¾ respectively. The optimal
strategy for B is { ¼ , ¾ }
3. The expected value of the game for A is -1/4
If a game does not have a saddle point, two players cannot use the maximin -minimax
(pure) strategies as their optimal strategies. Hence the concept of mixed strategy i,e instead
of selecting pure strategies only, each
Y, Y2 YJ Yn
1 2 j n
x, 1 V 11 V 12 V.1j V 1n .
X2 2 V 21 V 21 'v 2j V 21
Xm m V m1 V m2 V nj v mn
Inferior strategies can be removed from the given pay off matrix so that a smaller pay
off matrix is obtained.
2.6.2 Problem :
1. Solve this problem using dominance
B
-4 6 3
A -3 -3 4
2 -3 4
Delete the columns which have higher or equal values when compared to the
corresponding elements of another column.
Similarly delete the rows which have lower or equal values when compared to the
corresponding elements of another row.
In this case columns 3 is dominated by column 1 . So neglect that column
-4 6
-3 -3
2 -3
2 -3
232
Now Calculate row minima and column maxima
-4 6 -4
-3
2 -3
2 6
V= -4 x1 +2 x2
V = 6 xl + -3 x2
X 1 + x2 = 1
-4x 1 + 2 x2 = 6 xl - 3 x2
x1 + x2 = 1
0.5 x2 +x2 =1
1.5x2 = 1
= > X 2 = 2/3
= > Xi=l/3
(1/3,0,2/3) •
Similarly
(3/5,2/5,0)
6x1-3x2
= 6 x 1/3 - 3 x2 /3
= 2- 2 =0
233
12.7 2X2 MIXED STRATEGY USING ARITHMETICAL METHOD
A 2 X 2 matrix game with out saddle point can also be solved fusing arithmetical
method ,.
H T
H 8 -3
T -3 1
Solution:
H T
8 -3 4 4/(11+4) = 4/15
-3 1 11 11/(11+4) = 11/15
A
.3 1 11/15
4/15 11/15
The player A m ust use strateg y H w ith prob ability 4/(4 +l 1) = 4/15 and strategy T w ith 11/15
w hile player B use strategy H w ith probability 4/1 5 and strategy T w ith 11/15,
Let
B P lay H
4 x 8 11 x ( 3 ) 1
V = =
15 15
B P lays T
4 x ( 3 ) 11 x1 1
V = =
15 15
A plays H
4 x 8 11 x ( 3 ) 1
V = R s. =
15 15
A plays T
4 x ( 3 ) 11 x1 1
V = R s. =
15 15
A B C D E F
A 0 0 0 0 0 0
B 4 2 0 2 1 1
C 4 3 1 3 2 2
D 4 3 7 -5 1 2
E 4 3 4 -1 2 2
F 4 3 3 -2 2 2
235
Column E is superior to column A, B, Row C is superior then A & B. Now the
reduced matrix.
C D E F
C 1 2 2 2
D 7 -5 1 2
E 4 -1 2 2
F 3 -2 2 2
C 1 3 2
D 7 -5 1
E 4 -1 2
F 3 -2 2
C 1 3 2 •
D 7 -5 - i
E 4 -1 2
236
Now take the avg. of player B , C & D strategy.
1 3 7 5 4 1
2 2 2
(2, 1, 3/2)
Superior to column E
Similarly
1 7 35
2 2
(4, -1)
C 1 3
D 7 -5 -
237
For a 1X 1 + 7X 2 = V & 3X 1 – 5X 2 = V
= > X 1 + 7X 2 = 3X 1 – 5X 2 .
= > 2X 1 = 12X 2 .
X 1 = 6X 2 .
Substituting it is
X 1 +X 2 = 1
= > 6X 2 + X 2 = 1 = > 7X 2 =1
= >X 2 = 1/7
= >X 1 =6/7
Y 1 + 3Y 2 = V & 7Y1 – 5Y 2 =V
12.8 SUMMARY
Game theory is a type of decision theory in which one’s choice of action is determined
after taking into consideration ail possible alternatives available to an opponent playing the
same game
The game theory is capable of analyzing single competitive situation. However there
is great gap between what theory can handle and actual situation.
Game Theory
Strategy
Saddle point
Maximin or Minimax
Dominance
238
12.10 SELF ASSESSMENT QUESTIONS
1. Solve the following game
1 2 3
1 -3 -2 6
2 2 0 2
3 5 -2 -4
P1 I 1 3
II 4 2
3.
A B C D E
A 4 4 2 -4 -6
B 8 6 8 -4 0
C 10 2 4 10 12
II 2 3
III 3 2
IV -2 6
239
12.11 REFERENCE
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
240
BLOCK -4
APPLICATION OF OPERATION RESEARCH INBUSINESS
UNIT-13 : DECISION ANALYSIS
STRUCTURE
13.0 Objectives
13.1 Introduction
13.9 Summary
13.12 References
242
13.0 OBJECTIVES
After studying this unit, you should be able to :
* Explain role of decision analysis in real life situations;
* Discuss the importance of decision making;
* Examine different conditions under which decision are made and
* Construct decision trees
13.1 INTRODUCTION
Decision making is not a new function. In our daily life, we take up decisions over one
or another occasion. These decisions may have a short term or long term effect in our lives.
Similarly the decision taken by the companies or their representatives would certainly affect
the company in long run.
243
2. State or strategies
The state of nature refers to consequence of decision making which are beyond control.
3. Uncertainty
In some cases of decision making all the influencing factors may not be known to the
decision maker. This is called as uncertainty.
4. Payoff:
The decision making now yields which are called as pay off. It would be gain or loss.
The pay off under different consequences subjected alternative courses of actions can
be listed in the form of a table which is called as pay-off table.
Consider a fixed state of nature E; (i = 1, 2 …..m) for which the pay off corresponding
to the n course of action given be P11 P12.....Pm.
244
Let MI be the pay off of the least possible course of action. The opportunity cost
table is created as below.
State of nature Conditional Opportunity loss
A1 A2 A3 Aj An
C1 MrP11 MrPi2 M1P13 M1Pij M1Pfa
245
c. Decision making another risk.
This refers to a situation where decision maker chooses from several possible out
comes where in probabilities of occurrences can be stated.
The decision taking under risk is like shooting a enemy at dark by knowing that enemy
is there.
Decision making under risk is involved launching a new product even after preliminary
survey made.
d. Decision Under Conflict
In many situations neither states of nature are completely known nor are they completely
uncertain. Partial knowledge is available and therefore it may be termed as decision
making under partial uncertainty. An example if this is the situation of conflict involving
two or more competitors marketing the same product.
Probabilities may be based on decision maker’s personal opinions about future events,
or on data obtained from market surveys, expert opinions, etc.
When probability of occurrence of each state of nature can be assessed, problem
environment is called decision making under risk.
Examples: -
Probability of being dealt club from deck of cards is 1/4.
Probability of rolling 5 on die is 1/6.
246
It is the term used in a situation where for each decision alternative there is only one
event and therefore only one outcome for each action. For example, there is only one possible
event for the two possible actions: “Do nothing” at a future cost of S3.00 per unit for 10,000
units, or “rearrange” a facility at a future cost of $2.80 for the same number of units. A
decision matrix (or payoff table) would look as follows:
Note that there is only one State of Nature in the matrix because there is only one
possible outcome for each action (with certainty). The decision is obviously to choose the
action that will result in the most desirable outcome (least cost), that is to “rearrange.”.
A1 A2 A3 A4
E i -300
7 12 20 27
247
E2-350 10 9 10 25
E3 - 400 23 20 14 23
E4 - 450 32 24 21 17
248
A1 A2 A3 A4
E1 7 12 20 27
E2 10 9 10 25
E3 23 20 14 27
E4 32 24 21 17
A1 A2 A3 A4
E1 7 12 20 27
E2 10 9 10 25
E3 23 20 14 23
E4 32 24 21 17
249
im regret = ( max payoff- i th pay off) for ith event if he pay off represents profit
(ith pay off - min. pay off) for ith event if the pay off represents cost
Step 2 : Determine max regret for each alternative
Step 3 : Select minimum out of these
Considering the same example.
E1 7 12 20 27 7
E2 10 9 10 25 9
E3 23 20 14 23 14
E4 32 24 21 17 17
Deduct this minimum pay off from all the elements of that row.
Regret payoff amount
0 5 13 20
1 0 1 16
9 6 0 9
15 7 4 0
Max regret 15 7 13 20
Minimum of this is 7
So A2 is chosen
13.4.5 Hurwicz Criterion
It stipulates that a decision maker’s should be both optimist and pessimistic.
Steps
1. Choose a as degree of optimism and 1 - a as degree of pessimism
2. Determine. Max & min pay off for each alternative
3. Calculate n=a xI + (l-a)II
250
Consider the same example
E1 E2 E3 E4 I max II min
A1 7 10 22 32 32 7
A2 12 9 20 24 24 9
A3 20 10 14 21 21 10
A4 27 25 23 17 27 17
251
Problem :
A man has choice of running either a hot snack stall or ice- cream stall at a sea side
resort. If it is fairy cool summer he should make Rs. 5000 by running the hot snack stall. If
it is not he can make profit of Rs. 1000. On the other hand his profit is Rs. 6,500 for hot
summer and Rs. 1000 if it is cool by running ice cream stall. There is 40% chance of summer
being hot. What should be his choice.
Solution
The pay off table can be constructed as below.:
Conditional pay off
Event Probability Ei Hot snack Ice -cream
Cool summer 0.6 5000 1000
Hot summer 0.4 1000 6500
i ii Iii iv
Event Prob Hot (i x ii) Cool (I x iii)
Cool 0.6 3000 600
Hot 0.4 400 2600
3400 3200
Since the expected monetary value of selling hot snack is more he should opt for hot snack.
252
13.7 DECISION TREE ANALYSIS
It is graphic display of various decision alternatives and sequence as if they were branches
of a tree.
Decision point
Event
Problem
Amar company is currently working with a process which after paying for materials labour,
etc. brings a profit of Rs. 12,000 . The following alternatives are made available to the
company.
(i) The company can conduct research (R1) which is expected to cost Rs. 10,000 having
90% chances of success. If it proves a success the company gets a gross income of
Rs. 25,000
(ii) The company can conduct research (R2) which is expected to cost Rs. 8000 having
probability of 60% success. The gross income will be 25,000
(iii) The company can pay Rs. 6000 as royality for a new process which will bring a
gross income of Rs. 20,000.
(iv) The company can continue the current process
253
The net EMV is highest for the alternative pay off royalty for the new process, the
optimal decision would be the procure a process on Royalty basis.
254
13.8 DECISION MAKING UNDER UTILITY
Always money can not be the sole criteria. In may case decisions has to be taken. Also
even when expected values are calculated in money terms, money may mean every thing.
Some people would prefer to take risk, some would take prestige in starting new venture.
A rational decision maker will choose that alternative which optimise the expected utility
rather than expected monitory value. Once we know the individual utility function along with
the probability assigned to out come in a particular situations then total expected utility for
each course of action can be obtained by multiplying utility values with their probability.
13.9 SUMMARY
Decision making is an integral part of most planning organizing controlling & motivate
process. The decision maker selects one strategies or course or action over other depending
upon utility sales, cost or rate of return. Decision theory provides a method for rational
decision making when the consequences are not fully deterministic. They provide a frame
work for better understanding of the decision situation for evaluating alternatives.
255
Which would have an input on profit as show.
State of Nature S1 S2 S3
I 7,00,000 5,00,000 3,00,000
N 3,00,000 4,50,000 3,00,000
D 1,50,000 0 3,00,000
The estimated probability of substantial rainfall is 0.2, moderate is 0.4 & light is 0.5.
5. A glass factory specialized in crystal is developing a substantial backdrops in farms
management considering 3 course of action.
SI arrange for sub contract, S2 arrange over time S3 construct new facility. The future
demand may be low, medium or high the probability which is 0.1, 0.5 & 0.4
The profit matrix is as shown.
Profit S1 S2 S3
Medium 50 60 20
256
13.12 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
257
UNIT 14 : NETWORK ANALYSIS
STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Network Analysis
14.3 Application of PERT and CPM techniques to Business problems.
14.4 Distinction between PERT and CPM.
14.5 Basic concepts of Network Analysis
14.6 Rules of Network Construction
14.7 Fulkerson’s Rule (i-j Rule) of Numbering Events
14.8 Illustrations
14.9 Summary
14.10 Key Words
14.11 Self Assessment Questions
14.12 References
258
14.0 OBJECTIVES
14.1 INTRODUCTION
A project such as construction of a bridge, highway, flyover, power plant, repair and
maintenance of oil refineries or an air plane; design, development and marketing of a new
product; research and development work, etc. may be defined as a collection of interrelated
activities (tasks) which must be completed in a specified time according to a specified
sequence (or order) and require resources such as personnel, money, materials, facilities.
The process of dividing the project into these activities is called the work breakdown structure
(WBS). The activity or a unit of work, also called work content, is an identifiable and
manageable work unit. The main objective before starting such projects is: How to schedule
the required activities so as to:
a. Complete the given project on or before a specified time limit.
b. Minimize the cost of completion of the project on or before a specified time limit.
c. Minimize the total project completion time for a given cost.
Hence, before starting any project, it is essential to devise an adequate plan for
scheduling and controlling the various activities (tasks) of the given project. The class of
operations research techniques used for planning, scheduling and controlling large and
complex projects are often referred to as network analysis, network planning, or network
planning and scheduling techniques. PERT and CPM are two well known techniques used for
network analysis.
259
14.2 NETWORK ANALYSIS
PERT was developed in 1956-58 by the US Navy Special Projects office in co-
operation with the management consulting firm of Booz, Alien and Hamilton to aid in the
planning and scheduling of the US Navy’s Polaris Missile Programme which involved over
three thousand different contracting organizations. The Objective of the team was to
efficiently plan and produce the Polaris missile system. Since then this technique has proved
to be useful for all jobs or projects which have an element of uncertainty in the matter of
estimation of duration, as in case with new types of projects both at the Government and
Industry level. In PERT we usually assume that the time to perform each activity is uncertain
and as such three time estimates i.e. is the optimistic, the pessimistic and the most likely
time estimates are used.
CPM was developed in 1957 by J.E.Kelly of Remington Rand and M.R. Walker of
E.I.Dupont to aid in the scheduling of routine plant overhaul, maintenance and construction
of work. This method differentiates between planning and scheduling .Planning refers to the
determination of activities that must be accomplished and the order in which such activities
should be performed to achieve the objectives of the project. Scheduling refers to the
introduction of time into the plan thereby creating a time table for the various activities to
be performed. CPM uses two time and two cosl; estimates for each activity. CPM operates
on the assumption that the time taken by each activity in the project is already known
precisely.
260
14.3 APPLICATION OF NETWORK ANALYSIS TO BUSINESS
Few Management applications of PERT and CPM are to plan, schedule, monitor and
control projects such as;
i. Construction of buildings, bridges, factories, highways, stadiums, irrigation projects,
etc.
ii. Budget and auditing procedures.
iii. Missile development programmes.
iv. Installation of complex new equipments such as computers or large machinery.
v. Advertising programmes and for development and launching of new products.
vi. Planning of political campaigns.
vii. Strategic and tactical military planning.
viii. Research and development of new products.
ix. Finding the best traffic flow pattern in large cities.
x. Maintenance and overhauling complicated equipments in the chemical, power plants
steel and petroleum industries.
xi. Long range planning and developing staffing plans.
xii. Organising of big conferences, public works, etc.
xiii. Shifting of manufacturing plant from one site to another.
xiv. Preparation of bids and proposals for projects of large size.
xv. Launching space programmes.
The basic differences between the two techniques are summarized below;
PERT
1. A probability model with uncertainty in activity duration. The duration of each activity
is normally computed from multiple time estimates with a view to take into account
time uncertainty. These estimates are ultimately used to arrive at the probability of
achieving any given scheduled data of project completion.
261
3. PERT is normally used for projects involving activities of non-repetitive nature in
which time estimates are uncertain.
4. It helps in pin pointing critical areas in a project so that necessary adjustments can be
made to meet the scheduled completion date of the project.
CPM
1. A deterministic model with well known activity times based upon past experience. It,
therefore, does not deal with uncertainty in time.
2. CPM is suitable for establishing a trade-off for optimum balancing between scheduled
time and cost of the project.
4. CPM deals with costs of project schedules and their minimization. The concept of
crashing is applied mainly to CPM models.
5. It is difficult to use CPM as a controlling device for the simple reason that one must
repeat the entire evaluation of the project each time; the changes are introduced into
the network.
A fundamental ingredient in both PERT and CPM is the use of network systems as a
means of graphically depicting the current problems or proposed project. PERT and CPM
network consists of two major components as discussed below:
Event
Events of the network represent project milestones, such as the start or the completion
of an activity (task) or activities, and occur at a particular instant of time at which some
specific part of the project has been or is to be achieved; therefore events do not consume
262
time or resources. Events are commonly represented by circles. The event circles are called
nodes in the network diagram. Events can be further classified into following types:
Merge Event
When more than one activity comes and joins, the event is known as merge event.
Burst Event
When more than one activity leaves an event, the event is known as a burst event.
An activity may be a merge and burst event simultaneously as with respect to some
activities it can be merge event and with respect to some other activity it may be burst event.
263
Activities are identified by the number of their starting (tail) event and ending (head)
event. An arrow (i, j) extended between two events, the tail event T represents the start of the
activity and the head event ‘j’, represents the completion of the activity.
Activity
i j
Predecessor Activity
An activity which must be completed before one or more other activities start is
known as predecessor activity.
Successor Activity
An activity which started immediately after one or more of other activities are
completed is known as successor activity.
Concurrent activity
Dummy Activity
An activity which does not consume either any resource or time is known as dummy
activity. A dummy activity in the network is added only to represent the given. A dummy
activity is depicted by dotted lines in the network diagram, precedence relationships among
activities of the project and is needed when,
1. Two or more parallel activities in a project have same head and tail events or
2. Two or more activities have some (but not all) of their immediate predecessor activities
in common.
264
Sequencing
265
14.8 ILLUSTRATIONS
After learning the concepts and rule of constructing the networks, let us try to solve some
simple problems.
Problem-1:
Draw a network for the following project and number the events according to Fulkerson’s
rule.
1. A is the start activity and K is the end activity.
2. J is the successor activity to F.
3. C and D are successor activity to B.
4. D is the preceding activity to G.
5. E and F occur after event C.
6. E precedes J.
7. Restrains the occurrence of J and G precedes H.
8. K succeeds activity.
Solution:
Figure 3.5
Problem-2:
266
Solution
Figure 3.6
Problem- 3:
Construct a network for the project whose activities and their precedence
relationship are given below:
Activity No A B C D E F G H I
Solution
Figure 3.7
267
Problem- 4:
Solution:
Given A < C which means that C cannot be started until A is completed i.e. A is the
preceding activity to C. The above constraints can be given in the following table.
Event No A B C D E F G H I J K
Figure
3.8
Problem-5:
The sequence of activities together with their predecessors for manufacturing an item are
given below, draw the network diagram.
Activity A B C D E F G H
268
Solution
Figure 3.9
Problem -6:
Given in the table are the activities and sequence necessary for m aintenance job of
m aterial handling equipm ents in a factory. D raw a network.
Activity A B C D E F G H I J
Solution:
Figure 3.10
269
Problem - 7:
Activity A B C D E F G H I
Solution
Figure 3.11
14.9 SUMMARY
270
14.10 KEY WORDS
Activity
Event
Merge event
Burst event
Predecessor
Successor
Sequencing
Network
Case Study-1:
Sigma Ltd., an Airplane manufacturing company wants to draw a network for their
project. The following information is available. Draw the project for Sigma Ltd.
Activity A B C D E F G H I J K
Case Study - 2:
Zen Limited has a listed the activities and sequence requirements necessary for the
machine maintenance in their company. Draw a network diagram.
A c tiv ity A B C D E F G H I J K L M N 0 P Q R S T
P re- - - B A C B A ,F G A ,F G J ,H J L M ,K N O P E Q ,D ,I,R R
re q u is ite
271
14.12 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
272
UNIT 15 : SOLUTION TO NETWORK PROBLEMS
STRUCTURE
15.0 Objectives
15.1 Introduction
15.7 Summary
15.10 References
273
15.0 OBJECTIVES
15.1 INTRODUCTION
The objective of critical path analysis is to estimate the total project duration and to
assign staring and finishing times to all the activities involved in the project. This helps in
checking actual progress against the scheduled duration of the project.
The duration of individual activities may be uniquely determined (in case of CPM) or
may involve the three-time estimates (in case of PERT) out of which the expected duration
of an activity is computed. Having computed this, the following factors should be known to
prepare project scheduling:
1. Total completion time of the project.
2. Earliest and latest start time of each activity.
3. Float of each activity, i.e. the amount of time by which the completion of an activity
can be delayed without delaying the total project completion time.
4. Critical activities and the critical path.
274
activity can finish without affecting the total project time.
6. LFij = Latest finish time for activity (i,j). It is the latest time by which an activity mi
get completed without delaying project completion.
7. tij= Duration of activity (i,j).
For calculating the above mentioned times, two methods namely forward pass and
backward pass are employed.
2. Calculate the latest finish time of each activity which ends at event j . This is equal to
the latest occurrence time of the final event N that is LFij = Lj for all activities (i,j)
ending at event j.
3. Calculate the latest start time of all activities ending at j . It is obtained by subtracting
the duration of the activity from the latest finish time of the activity, that is
LFij = L
And LSij = LFij – ti’j
= Lj – tij for ach activity (i,j) ending at event ,j.
5. Calculate the latest occurrence time of event i(i<j). This is the minimum of the latest
start time of all activities starting from that event, that is Li =Min (LSij) for all immediate
successor activities = Min (Lj - tij)
6. If j=l (initial event), then the latest finish time for project, i.e. latest occurrence time
Lj for the initial event is given by,
= min (Lj-tij)
The length of the critical path is the sum of the individual times of all the critical activities
lying on it and defines the minimum time required to complete the project.
276
The critical path on a network diagram can be identified as :
1. For all activities (i,j) lying on the critical path the E values and the L values for tail
and head event are equal, that is Ej=Lj and Ej=Lj.
Float is defined as the difference between the latest and the earliest activity time.
Slack is defined as the difference between the latest and earliest event time. There are three
types of floats:
Total Float : It refers to the amount of time by which the completion of an activity could be
delayed beyond the earliest expected completion time without affecting the overall project
duration time.
Mathematically, the total float of an activity (i,j) is the difference the latest start time
and the earliest start time of that activity.
Total float (TFij) = LSij - Esij
= (Lj-Ei) – tij
Free Float : The time by , which the completion of an activity can be delayed beyond the
earliest finish time without affecting the earliest start time of a succeeding activity. Free
float is calculated as Free float (FFij) = (Ej-Ei)-tij
Independent Float : The amount of time by which start of an activity can be delayed
without affecting the earliest start time of any immediately following activity assuming that
the preceding activity has finished at its latest finish time. The negative value of independent
float is considered as zero. Independent float is calculated as, Independent float (IFij) =
(Ej - Li)-t..
= (ESij-LSij) - tij
277
Problem - 1: Kaizen Limited has decided to add - new product to its line, It will outsource
the product from another firm, package it, and sell it to a number of distributors.
The following are the activities to be completed to implement the above
project.
B Hire Salesmen 4
C Train Salesmen 7
G- Design package 2
K Select distributors 9
L Sell to distributors 3
M Ship stock 5
278
T h e p re c e d e n c e re la tio n b e tw e e n e a c h a c tiv ity is s h o w n in th e fo llo w in g
d ia g ra m ;
O rd e r
S to c k
P ackage
st o c k
D e sig n Set u p
S h ip sto c k
package P a c k in g
to
F a c ility
D ist rib u to rs
S e le c t
ST A R T D istrib u tio n EN D
S e ll in
R
D istrib u tio
n
O rg a n iz e H ire T ra in
S a le s O ffic e S a le sm e n S a le sm e n
F ig u r e 4 .1
S e le c t P la n C o n d u ct
A d v e rtisin g A d v e rtisin g A d v e rtisin g
A ge n cy C a m p a ig n C a m p a ig n
3. For each non critical activity, find the total and free float.
Solution:
279
Forward Pass Method:
E1 = 0 E2 = E1 + t1.1 = 0 + 6 = 6
E3 = E1 + t1.3 = 0 + 2 = 2
E5 = E2 + t2.5 = 6 + 4 = 10
= Max (6 + 9; 10 + 7) = 17
E7 = E2 + t2.7 = 6 + 2 = 8
E8 = E7 + t7.8 = 8 + 4 = 12
E10= Max (Ei + ti. 10) = Max (E8 +t 8. 10. E9 + t9. 10)
280
Backward Pass Method:
10 = E10 == 25
L
L9= L10 – t9.10 = 25 – 5 = 20
=Min (6-6;4-2;14-13) = 0
2. The critical path in the network has been shown by the double line by joining all those
events where the two values E. and L. are equal. The critical path of the project is 1 -2-
5-6-9-10 and the critical activities are A, B, C, L and M. The total project time is 25
weeks
3. For each non-critical activity, the total float and free float calculations are listed in the
table below.
A ctivity D uration E arliest T im e L atest T im e Float
(t I,j)
S tart Finish (E i S tart (Lj- Finish T otal Free
(E i) + U j) U j)
Lj (Lj-U j)-E i (E j-E i)-t i,j
1-3 2 0 2 2 4 2 0
1-4 13 0 13 1 14 1 0
2-6 9 6 15 8 17 2 2
2-7 2 6 8 9 11 3 0
4-9 6 13 19 14 20 1 1
7-8 4 8 12 11 15 3 0
8-10 10 12 22 15 25 3 3
281
Problem - 2: A banking company has decided to modernize one of its branch offices. The
major activities of the project, along with the durations and preceding activities involved in the
renovation process are listed in the table below:
Activity A B C D E F G H I J K L M
Duration (weeks) 4 2 1 12 14 2 3 2 4 3 4 2 2
Solution:
Figure 4.3
2. The critical path of the project is 1-2-3-4-7-8-9-10-12 and critical activities are E, A
B, C, K, D, L and J. The total project time is 42 weeks.
282
Problem-3: Listed in the table are the activities and sequencing requirements necessary
for the completion of a research work.
Activity A B C D E F G H I J K L M
Duration (weeks) 6 5 9 2 2 1 6 5 6 2 4 3 1
3. Find the total free and independent floats for various activities
Solution:
Figure 4.4
283
Forw ard Pass M ethod:
E 1= 0
E 2 = E 1 +t 1,2 = 0+5=5
E 3 =E 2 + t 2,3 = 5+2=7
E 4 =E 3 + t 3,4 = 7+2=9
= M ax (0+6;9+0)=9
E 6 = E 5 +t 5,6 =9+2=11
E 7 =E 5 +t 5,7 =9+6=15
E 8 = M ax (E 7 + t 7,8 ; E 6 +t 6,8 )
= M ax (15+0; 11+5)=16
E 9 =E 8 +t 8,9 =16+6=22
E 10 =M ax (E 7 +t 7,10 ; E 9 +t 9,10 )
= M ax (15+4;22+2)=24
E 11 =E 10 + t 10,11 =24+3=27
E 12 =M ax (E 11 +t 11,12; E 4 +t4,12 )
=M ax (27+1;9+1)=28
B ack w ard P ass
L12=28
L 1 1 = L 1 2 - 1 = 2 8 -1 = 2 7
L 1 0 = L ll-3 = 2 7 -3 = 2 4
L 9 = L 1 0 - 2 = 2 4 -2 = 2 2
L 8 = L 9 -6 = 2 2 -6 = 1 6
L 7 = M in (L 1 0 -4 ;L 8 -0 )
M in (2 4 -4 ; 1 6 -0 ) = 1 6
L 6 = L 8 -5 = 1 6 -5 = 1 1
L 5 = M in ( L 5 - 6 ; L 6 - 2 )
284
Min (15 -6; 11 -2) = 9
L4 = Min(L12-l ;L5-0)
Min (28- 1 ; 9 - 0) = 9
L3= L4 - 2 = 9 - 2 = 7
L2 = L3 - 2 = 7- 2= 5
Min(9-5;5-5) = 0
A 6 0 6 3 9 3 3 3
B 5 0 5 0 5 0 0 0
C 2 5 7 5 7 0 0 0
D 2 7 9 7 9 0 0 0
E 2 9 11 9 11 0 0 0
G 6 9 15 10 16 1 0 0
H 5 11 16 11 16 0 0 0
I 6 16 22 16 22 0 0 0
J 2 22 24 22 24 0 0 0
K 4 15 19 20 24 5 5 4
L 3 24 27 24 27 0 0 0
285
Problem - 4:
Activity 1-2 1-3 2-4 3-4 3-5 4-9 5-6 5-7 6-8 7-8 8-10 9-10
Duration (weeks) 4 1 1 1 6 5 4 8 1 2 5 7
2. Compute the total float free float and independent float for each activity
3. Find the critical path and the total duration of the project.
Solution:
Figure 4.5
286
Activity Duration Earliest Latest Total Free Independent —
Float Float Float
Start Finish Start Finish
1-2 4 0 4 7 11 7 0 0
1-3 1 0 1 0 1 0 0 0
2-4 1 4 5 11 12 7 2 0
3-4 1 1 2 11 12 10 5 5
3-5 6 1 7 I 7 0 0 0
4-9 8 7 15 7 15 0 0 0
5-6 4 7 11 12 16 5 0 0
5-7 2 15 17 15 17 0 0 0
6-8 1 11 72 16 17 5 5 0
7-8 5 5- 10 12 17 7 0 0
8-10 7 10 17 15 22 5 5 0
9-10 5 17 22 17 22 0 0 0
15.8 SUMMARY
The network diagram are constructed taking time required, proceeding activity and
successive activity. The free time or slack time can be calcualated using the network. The
managers can plan resources by checking the avaialable free time. For example activity 5-6
has 3 days free time and activity 5-7 is critical, then the labourers can be deputed to 5-7
instead of doing 5-6.
287
15.10 SELF ASSESSMENT QUESTIONS
Duration (months) 2 2 1 4 8 5 3 1 5 4 3
Case Study-2: The R&D of SONY is developing a new power supply for a high definition
television. The job is broken down into following form:
Job Description Predecessor Expected
job time (days)
C Choose Rectifier B 2
D Choose Filter B 3
E Choose Transformer C 1
F Choose Chassis D 2
G Choose Mounting C 1
1. Draw a critical path scheduling arrow diagram, identify the critical path
2. What is the minimum time for completion of the job?
288
15.11 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
7. Richard I. Levin. Statistics for Management, New Delhi: Pearson education India, 2008
289
UNIT 16 : DIFFERENT TIME ESTIMATES – P E R T
STRUCTURE
16.0 Objectives
16.1 Introduction
16.2 PERT with three time estimates
16.3 PERT Procedure
16.4 Illustrations
16.5 Summary
16.6 Key Words
16.7 Self Assessment Questions
16.8 References
290
16.0 OBJECTIVES
After studying this unit, you should be able to ;
Discuss the three time estimates computation.
Explain the procedure of computing PERT
Examine the steps involved in the computing
Determine the expected time of project completion.
Solve the problems related to PERT and CPM
16.1 INTRODUCTION ó
PERT (Programme Evaluation and Review Technique) is essentially a management
technique and if tailored properly, can be used with advantage for responsibility accounting
in addition to attaining other well defined objectives. Managers have found this technique
for immense value where adopted judiciously and when configurations of events activities
are correctly assessed and their times are realistically worked out.
PERT is designed for scheduling complex projects that involve many inter-related
tasks. It improves the planning process because:
1. It helps the planner to define the project’s various components activities and even
logically.
2. It provides a basis for normal time estimates, and yet allows for some measure optimism
or pessimism in estimating the completion dates.
3. It shows the effects of changes to the overall plan as they contemplated.
4. It provides a built-in means for on-going evaluation of the plan.
5. It facilitates the process of communication between planner’s management by either;
adhering to organizational lines or crossing over them. In essence, PERT makes the.
\clear-cut assignment of responsibility possible.
291
Optimistic Time (to or a)
This is the shortest (minimum) possible time to perform an activity, assuming tha
everything goes well.
t o 4t m t p
te
6
2
tp tp
Variance of the activity is given by σ 2 =
6
Step-2: Compute the expected time duration of each activity using te = (to + 4 tm + t )/ 6
p
Step-4: Compute the earliest start, earliest finish, latest start, latest finish and total float
of each activity.
Step-5: Determine the critical path and identify the critical activities.
Step-6: Compute the expected variance of the project length σ 2 which is the sum of
the variance of all the critical activities and hence find the standard deviation of
the project length σ
292
Step-7: Compute the expected standard deviation of the project length Z= Ts - Tc /6
16.4 ILLUSTRATIONS
Problem-1:
A small project comprises of activities whose time estimates are given in table
1-2 1 1 7
1-3 1 4 7
1-4 2 2 8
2-5 1 1 1
3-5 2 5 14
4-6 2 5 8
5-6 3 6 15
293
Solution:
The expected time and variance of each activity is calculated in the following table:
Variance
Optimistic Most Likely Pessimistic te = tp tp
2
Activity ce of the activity is given by σ2 =
Time (to) Time (tm) Time (tp) (t0+4 tm+ tp) /6 6
1-2 1 1 7 2 1
1-3 1 4 7 4 1
1-4 2 2 8 3 1
2-5 1 1 1 1 0
3-5 2 5 14 6 4
4-6 2 5 8 5 1
5-6 3 6 15 7 4
294
a) By examining all the paths we see that the critical path is 1-3-5-6.
b) The expected project length is the sum of the duration of each critical activity. That is
duration of project = 4 + 6 + 7=17 weeks.
c) Variance of project length is the sum of the variance of the critical activities that is,
1. Probability that the project will be completed at least 4 weeks earlier than expected
time of 17 weeks is given by,
Z Ts Te (17 4) 17
Prob = = Prob.Z=-1.33
3
But Z = - 1.33 from normal distribution table is 1- 0.9082 = 0.0918. Thus the
probability of completing the project within 13 week (that is 4 week earlier) is 1-0.9082 =
0.0918= 9.18%. I
2. Probability that the project will be completed 4 weeks later than expected time of 17
weeks is given by,
Z Ts Te (17 4) 17
Prob. = = Prob.Z=-1.33
3
But Z =1.33 from normal distribution table is 0.9082. Thus the probability r
completing the project within 21 week (that is 4 week later) is = 0.9082 = 90.82 %.
295
Problem-2: The following table gives the three estimates draw the network of the project and
calculate the slack for each event. Find the critical path and the probability of completing the
project in 35 days.
Activity 1-2 1-3 2-5 3-4 4-5 5-8 4-6 4-7 6-9 8-9 7-10 9-10
to 3 1 6 8 0 5 6 3 1 3 5 2
tm 5 2 8 12 0 7 9 6 2 5 14 5
tp 7 3 12 17 0 9 12 8 3 8 17 6
The expected time and variance of each activity is calculated in the following table:
Activity Optimistic Most Likely Pessimistic te = Variance
Time (t0) Time (tm) Time (tp) (to+4 tm+ tp) /6 2
1-2 3 5 7 5 0.44
1-3 1 2 3 2 0.11
2-5 6 8 12 8.33 1
4-5 0 0 0 0 0
5-8 5 7 9 7 0.44
4-6 6 9 12 9 1
6-9 1 2 3 2 0.11
7-10 5 14 17 13 4
296
9
The expected variance of the critical path is = 0.11 + 2.25 + 0.69 + 4 = 7.05 days
Ts Te 35 33
Prob. Z= = =0.75
2.65
For Z= 0.75 the normal distribution table gives a value of 0.7734. Thus the probability of
completing the project in 35 days= 0.7734 = 77.34 %
297
The various floats are calculated as below :
Activity Duration Earliest Latest Total Free Independent
to 5 18 26 16 15 6 7 7 3
tp 10 22 40 20 25 12 12 9 5
tm 8 20 33 18 20 9 10 8 4
Predecessor - - - A A B C D EF
298
Solution:
V ariance
O ptim istic Pessim istic M ost Likely t e = (to+4 t m + 2 2
[tp - to]
A ctivity
Tim e (t 0 ) Tim e (t p ) Tim e (t m ) t p ) /6
6
1-2 5 10 8 7.8 0.696
1-3 18 22 20 20 0.444
1-4 26 40 33 33 5.429
2-5 16 20 18 18 0.443
2-6 15 25 20 20 2.780
3-6 6 12 9 9 1.000
5-7 7 9 8 8 0.111
6-7 3 5 4 4 0.111
3. The earliest and latest expected time for each event will be calculated by
considering the expected time of each activity.
Forward Pass:
E1=0 E2=E1+t1,2=0+7.8=7.8
E3=E1+t1,3=0+20=20; E4=E1+t1,4=0+33=33
=Max (7.8+20;20+9)=29
E7= Max (Ei+ti,7)
= Max (E5+t5,7; E6+t6,7; E4+t4,7)
=Max (25.8+8; 29+4;33+9.8)
=42.8
299
Backward Pass:
L7=E7=42.8 Es=7.8
Es=25.8
Lf=7.8 D 18
L6=L7-t6,7=42.8-4=38.8 Lf=34.8
2 5
H8
A 7.8
L5= L7-t5,7=42.8-8=34.8 E 20
L2=Min (Lj-t2,j)
C 33 4 G 9.8
= Min (L6-t2,6; L5-t2,5) Es=33
L1=Min (Lj-t1,j)
4. The last event 7 will occur only after 42.8 weeks. For this we require only the
duration of critical activities. This will help us in calculating the standard
deviation of the duration of the last event.
The probability of finishing the project in 41.5 weeks is Z = Ts-Te = 41.5 - 42.8 = 0,52
2.474
From the normal distribution table we have the value 1- 0.70 = 0.30 for Z= -0.52 that is
the probability of completing the project in 41.5 weeks is 30%.
300
Problem -4: Consider the following project.
Activity Three Time Estimates Predecessor
to tm tp
A 3 6 9 -
B 2 5 8 -
C 2 4 6 A
D 2 3 10 B
E 1 3 11 B
F 4 6 8 C,D
G 1 5 15 E
Find the path and standard deviation. Also find the probability of completing the
project by 18 weeks.
Solution: The expected time and the variance of each activity is calculated as shown in
the table below:
Variance
Optimistic Most likely Pessimistic (to+4tm+tp)
Activity 2
Time (t0) Time (tp) Time (tm) te =
6 (
2 = tp - to
6
)
A 3 6 9 6 1
B 2 5 8 5 1
C 2 4 6 4 0.444
D 2 3 10 4 1.777
E 1 3 11 4 2.777
F 4 6 8 6 0.444
G 1 5 15 6 5.444
301
Figure 5.4
Ts Te 18 16
Prob. Z= = =1.456
0 1.374
From the normal distribution table we have the value 0.92647 for Z = 1.456 that is the
probability of completing the project by 18 weeks is 92.65%.
16.5 SUMMARY
In this block you have gained a fair knowledge on network analysis. You have learnt to find
out the total time required to complete project. You are also able to compute estimated time
from different time estimates. Further you can learn about crashing of projects where in you
can try to squeeze the project time by involving extra cost. The normal distribution tables
you have learnt in block three is also used here.
302
16.6 KEY WORDS
PERT
Activity
Optimistic
Probability
Pessimistic
Projects
Case Study-1:
Optimistic Time 4 1 6 2 5 3 3 1 4
Pessimistic Time 16 15 30 8 17 15 27 7 28
303
Case Study-2: The owner of a retail outlet is considering a new computer system for
transaction and inventory management. A computer company has sent the following
instructions with regard to the installation of the system.
16.8 REFERENCES
1. Gupta S.P. Business Statistics, New Delhi: S Chand and Sons Publishers, 2000
2. Shahsi Kumar. Quantitative Techniques and methods, Mysuru: Chetana Book House,
2010
3. Vignanesh Prajapathi, Big data Analysis With R and Hadoop, Mumbai: Packt Publishing,
2013
4. SD Sharma, Operation Research, Delhi: Discovery Publishing House, 1997
5. Srinath L. S, PERT and CPM, Delhi: East West Press,2001
6. Kalavathy, Operation Research , New Delhi: Vikas Publishing House, 2010
304