Professional Documents
Culture Documents
BUSINESS MANAGEMENT
(PGDBM)
Editor& Coordinator
Prof. B. Raja Shekhar
2. Prof .A. K. Pujari, Dean, School of Mathematics & Computer and Information
Sciences, University of Hyderabad,
5. Mr. Sumit Dey, Supply Chain Manager, VST Industries Ltd, Hyderabad
University of Hyderabad,
2
1
Contributors
3
1
PREFACE
In this dynamic and complex business environment the role of the manager is becoming very critical.
Among various managerial functions, decision making is gaining lot of importance. The success or
failure of the business is definitely related to the decision making skills of the managers. The purpose
of this course is to introduce about various Quantitative and Research tools to the students to
understand the decision making process. This particular course exposes the students to various
statistical tools to be used in decision making and research. The entire course is divided in to two
blocks. Block-I deals with Statistics and Block-II deals with Research Methodology component.
Block-I is again divided to ten units, where as block-II comprises six units. This course has been
written with assumption that students are familiar with some fundamentals of mathematics at school
level.
Quantitative techniques play a vital role in business decision making process. Quantitative techniques
area can be broadly divided into two parts, viz. Statistics and Operations Research. An attempt is made
in this course to introduce various topics of statistics to make the students more familiar with decision
making process. Statistics subject is a broad subject which cannot be explained exhaustively in
distance education programme. In spite of keeping this limitation an attempt is made to introduce
fundamentals of statistics to advanced statistical tools like non parametric tools. The first unit in block-
I deals with introduction to decision making and statistics, where in the decision making process and
various definitions of statistics are discussed at length. The second unit titled Measures of Central
Tendency and Dispersion deals with various measures like arithmetic mean, geometric mean,
harmonic mean, median and mode with solved problems. Followed by average measures various
dispersion tools like range, mean deviation, quartiles and standard deviation are explained. Next to the
Introduction and Central tendency the most important topic of statistic is probability. The entire third
unit is dedicated to describe various issues of probability starting from definitions to applications
followed by three probability distributions namely binominal distribution, Poisson distribution, and
normal distribution.
In business and management the effect of one variable over other variable decides the profitability,
stability and performance. An attempt is made to discuss various issues of correlation to quantify the
association between the two or more variables. At the end of the unit there is a discussion about rank
correlation where ranks are given instead of actual scores. After the association between two variables
is established by correlation, one will be interested to know to measure the nature of relationships and
change in one variable with response to change in another variable. Unit five enlightens the students in
4
1
the form of regression and regression equations. There is a common phenomenon in any business that
is fluctuation of sales and demand. Fluctuation of sales/demand is due to various reasons. The reasons
include various trends namely seasonal, cyclical, secular and irregular. This trend analysis would be
useful in forecasting the sales.
Seventh unit titled concept of sampling and estimation deals with various methods of sampling
including probabilistic, non probabilistic. This unit also discusses about the problems and advantages
in choosing each of the sampling techniques. Followed by sampling is estimation, which discusses
about various estimates like point estimate, and interval estimate. The primary objective of the sample
is to get a feel of whole population with the help of a representative sampling. In most of the business
situations population characteristics will be estimated based on sample data. In the process one needs
to make some hypothesis about the population. Unit-eight titled testing of hypothesis deals with the
above topics and also exposes the students to various tests of hypothesis like z-test and t-test for single
sample, double sample and proportions. Entire unit nine is dedicated fully for analysis of variance
(ANOVA) to test the two samples, whether they are drawn from the same population or different
populations with same mean. The last unit of Quantitative methods deals with non-parametric test
where the population is not normal. One of the popular non-parametric tests is Chi -square test. A
detailed description with the help of examples is given in this unit followed by Chi·square test there
are other non-parametric tests namely the sign test and rank sum test. In spite of the syllabus of the
course being heavily loaded with content, necessary steps have been taken to make the concepts
simple and clear with the help of good number of examples and exercise problems.
Though, it is a difficult task to condense two important courses in to a single course, but efforts are
made to cover all essential concepts with relevant applications. Even after putting my best efforts,
there might be some errors and deficiencies in the course. I will be extremely thankful if you give your
valuable suggestions and comments on the course, which will be rectified in the next print.
5
1
CONTENTS
Preface
4) Correlation
5) Regression
6) Time Series
8) Testing of Hypotheses
9) Analysis of Variance
6
1
1. LEARNING OBJECTIVES:
1.1. INTRODUCTION:
Whether it is factory, farm, or a domestic kitchen, resources of men, machine and money have to be
coordinated against time and space constraints to achieve given objectives in a most efficient manner.
The manager has to constantly analyze the existing situation, determine the objectives, seek
alternatives, implement, coordinate, control and evaluate. The common thread of these activities is the
capability to evaluate information and make decisions. Managerial activities become complex as the
organizational settings in which they have to be performed become complex. As the complexity
increases, management becomes more of a science than an art and a manager by birth yields place to a
manager by profession. There is an increasing realization of the importance of Statistics in various
quarters. This is reflected in the increasing use of Statistics in the government, industry, business,
agriculture, mining, transport, education, medicine, and so on.
Quantitative analysis uses a scientific approach to decision making. It consists of defining a problem,
developing a model, acquiring input data, developing a solution, analyzing the results and
implementing the results. Quantitative analysis has been in existence since the beginning of recorded
history, but it was Frederick W.Taylor, who in the early 1900s pioneered the principles of the
scientific approach to management. During World War II, many new scientific and quantitative
7
1
techniques were developed to assist the military. These new developments were so successful that
after World War II many companies started using similar techniques in managerial decision making
and planning. Today, many organizations employ a staff of Operations Research or Management
Science personnel or consultants to apply the principles of scientific management to problems and
opportunities.
To a great extent the success or failures that a person experiences in life depend on the decisions
that he or she makes. One decision may make the difference between a successful career and an
unsuccessful one. „Decision theory is an analytic and systematic way to tackle problems‰. A good
decision is one that is based on logic it considers all available data and possible alternatives; and
applies the quantitative approach. Although occasionally good decisions yield bad results, in the long
run using decision theory will result in successful outcomes.
The first step in the quantitative approach is to develop a clear, concise statement of the problem
which is the most important and most difficult step. It is essential to go beyond the symptoms of the
problems and identify the true causes. Only a few areas are taken into consideration, when the problem
is difficult to quantify.
8
1
Once the problem is selected, the next step is to develop a model. A model is a representation (usually
mathematical) of a situation. The type of models include physical, scale, schematic and mathematical
models.
Defining the
Problem
Developing
a Model
Acquiring input
Data
Developing a
Solution
Testing the
Solution
Analyzing the
Results
Implementing the
Results
Once the model is developed, accurate data must be obtained. Even if the model is perfect
representation of reality, improper data will result in misleading results. This situation is called
9
1
garbage in garbage out. There are a number of sources that can be used in collecting data. Company
reports and documents, interviews and sampling statistical procedures can be used to obtain data.
Developing a solution involves manipulating the model to arrive at the best (optimal) solution. The
input data and model determine the accuracy of the solution.
Testing the input data and the model includes determining the accuracy and completeness of the data
used by the model. There are several ways to test the data. One method of testing the data is to collect
additional data from a different source. If the original data were collected using interviews, some
additional data can be collected by direct measurement or sampling, which can be compared with
original data and statistical tests can be employed to determine whether there are differences between
the original data and the additional data.
Analyzing the results starts with determining the implications of the solution. In most of the cases, a
solution to a problem will result in some kind of action or change in the way an organisation is
operating. The implications of these actions or changes must be determined and analyzed before the
results are implemented. Sensitivity Analysis determines how the solutions will change with a
different model or input data.
The final step is to implement the results. This is the process of incorporating the solution into the
company.
10
1
Developing a model is an important part of the quantitative analysis approach. The following
mathematical model represents profits.
Profits =Revenue-Expenses
Revenues are expressed as price per unit multiplied times the number of units sold. Expenses can be
determined by summing fixed costs and variable costs. Variable cost is often expressed as variable
cost per unit multiplied times the number of units. Thus as per the mathematical model:
Profit= (selling price per unit) (number of units sold)-[fixed cost + (variable cost per unit) (number of
units sold)]
Profit=SX-[F+VX]
Profit=SX-F-VX
Where
S=selling price per unit
F=fixed cost
V=variable cost per unit
X=number of units sold
The parameters in this model are f, v, and s, as these are inputs that are inherent in the model. The
number of units sold(X) is the decision variable of interest.
11
1
If all the values are known with certainty the model is deterministic. Models that involve risk or
chance often measured as a probability value are called probabilistic models. For example, the market
for a new product might be „good‰ with a chance of 60 %( a probability of 0.6) or „not good‰ with a
chance of 40 %( a probability of 0.4).
The below listed are possible problems in the quantitative analysis approach.
In the world of business, government and education, problems are unfortunately, not easily
identified. There are four road blocks that quantitative analysis face in defining a problem.
• Beginning assumptions
The third difficulty is that people have a tendency to state problems in terms of solutions. From an
implementation stand point „a good solution to the right problem is much better than an „optimal‰
solution to the wrong problem.
12
1
• Solution Outdated
Even with the best of problem statements, however, there is a fourth danger. The problem can
change as the model is being developed in rapidly changing business environment, it is not
unusual for problems to appear or disappear virtually overnight.
Gathering the data to be used in the quantitative approach to problem solving is often very difficult.
• Using Accounting Data
One problem is that most data generated in a firm come from basic accounting reports. The
accounting department collects its inventory data, for example, in terms of cash flows and
turnover .But quantitative analysts tackling an inventory problem need to collect data on holding
costs and ordering costs. If they ask for such data, they will find that the data were simply never
collected for those specified costs.
• Validity of data
Data must often be distilled and manipulated before being used in a model. Unfortunately, the
validity of the results of a model is no better than the validity of the data that go into the model.
13
1
The next problem is that quantitative models usually give just one answer to a problem. Most
managers would like to have a range of options and not be put in a take –it or leave –it position. A
more appropriate strategy is to present a range of options, indicating the effect that each solution
has on the objective function.
The results of quantitative analysis often take the form of predictions of how things will work in the
future. If certain changes are made now to get a preview of how well solutions really work, managers
are often asked how good the solutions looks to them .The problem is that complex models tend to
give solutions that are not intuitively obvious, such solutions tend to be rejected by managers. The
quantitative analyst now has to work through the model and convince the manager about the validity
of the results. In the process of convincing the manager, the analyst will have to review every
assumption that went into the model, and if he or she can be convinced that the model is valid, there is
a good chance that the solution results are also valid.
Once the solution has been tested, the results must be analyzed in terms of how they will affect the
total organization. Even small changes in organisations are often difficult to bring about and if the
results indicate large changes in organizational policy; the quantitative analyst can expect resistance in
14
1
analyzing the results. The analyst should ascertain who must change and by how much, if the people
who must change will be better or worse off, and who has the power to direct the change.
Implementation is not just another step that takes place after the modeling process is over. Each one of
these steps greatly affects the chances of implementing the results of a quantitative study.
Even though, many business decisions can be based intuitively, based on hunches and
experiences, there are more and more situations in which quantitative models can assist. Some
managers however, fear that the use of formal analysis process will reduce their decision making
power. Others fear that it may expose some previous intuitive decisions as inadequate. Still others
just feel uncomfortable about having their formal decision making. These managers often argue
the use of quantitative models.
When the quantitative analyst is not an integral part of the department facing the problem, he or
she sometimes tends to treat the modeling activity as an end in itself. That means the analyst
accepts the problem as stated by manager and builds a model to solve only that problem. When the
results are computed, he or she hands them back to the manager and considers the job done. The
analyst who does not care whether these results will help to make the final decision or is not
concerned with its implementation. Successful implementation requires that the analyst not tell the
users what to do, but work with them and take their feelings into account.
In solving the problems by using quantitative approach, the essential tools will be Quantitative
Techniques. Quantitative Techniques can be subdivided in to two parts. They are Statistics and
Operations Research. The entire content of this course is statistics only. Operations Research is a
15
1
In the early years of Operations Research, OR group consisted of specialists in mathematics, physics,
chemistry, engineering, statistics, and economics and this helped very much to develop OR models
with an interdisciplinary approach. The production scheduling problem, for example, is quite complex
when it cuts across the entire firm. Thus, it is necessary to look at the problem in many different ways
in order to determine which one (or which combination) of the various disciplinary approaches is the
best. The interdisciplinary approach recognizes that most business problems have accounting,
biological, economic, engineering, mathematical, physical, psychological sociological and statistical
aspects. The solution to a mathematical model or equation can be thought of as a function of
controlled and uncontrolled variables, related in some precise mathematical manner. The objective
function developed (utilizing controlled and uncontrolled variables) may have to be supplemented by a
set of restrictive statements on the possible values of the controlled variables. Though Operations
Research is an interesting subject, the components of it are beyond the scope of this course material.
However, the interested students can choose Decision Analysis course as the elective in second
semester, which discusses the content of Operations Research.
Since, the complexity of business environment makes the process of decision making difficult, the
decision-maker cannot rely entirely upon his observation, experience or evaluation to make a decision.
16
1
Decisions have to be based upon data which show relationship, indicate trends, and show rates of
change in various relevant variables. The field of statistics provides methods for collecting, presenting,
analyzing and meaningfully interpreting data. The statistical methodology in collection, analysis and
interpretation of data, for better decision-making is a basic input for managerial decision-making and
research in both physical and social sciences, particularly in business and economics.
At the outset, it may be noted that the word ÂStatisticsÊ is used rather curiously in two senses plural
and singular. In the plural sense, it refers to a set of figures i.e. production and sale of textiles,
television sets, and so on. In the singular sense, Statistics refers to the whole body of analytical tools
that are used to collect the figures, organise and interpret them and finally draw conclusions from
them.
Statistics has been defined by various authors differently. Some of the definitions are extremely
narrow. This is understandable since Statistics has developed over the past several decades and in the
earlier days; the role of Statistics was confined to a limited sphere. The following are the few
definitions of statistics.
(i) „Statistics are the classified facts representing the conditions of the people in a State.
specially those facts which can be stated in number or in tables of numbers or in any
tabular or classified arrangement‰.-Webster
(ii) „Statistics are numerical statements of facts in any department of enquiry placed in
relation to each otherÊ.-Bowley
(iii) „By Statistics we mean quantitative data affected to a marked extent by multiplicity of
causes‰. Yule and Kendall.
(iv) „Statistics may be defined as the aggregate of facts affected to a marked extent by
multiplicity of causes, numerically expressed, enumerated or estimated according to a
reasonable standard of accuracy, collected in systematic manner, for a predetermined
purpose and placed in relation to each other.‰-Prof. Horace Secrist.
Statistics to be considered as much wider in its scope and, accordingly, the experts gave a wider
definition of it. Spiegel, for instance, defines Statistics highlighting its role in decision-making
particularly under uncertainty, as follows;‰ Statistics is concerned with scientific method for
17
1
collecting, organizing, summarizing, presenting and analysing data as well as drawing valid
conclusions and making reasonable decisions on the basis of such analysis.‰ Below figure shows the
classification of subject matter in the field of statistics.
STATISTICS
Statistical data constitute the basic raw material of the statistical method. These data are either readily
available or collected by the analyst. The manager may face four types of situations:
(i) When data need to be presented in a form which helps in easy grasping (for example,
presentation of performance data in graphs, charts, and tables, in the annual report of a
company):
(ii) Where no specific action is contemplated but it is intended to test some hypotheses and
draw inferences;
(iii) When some unknown quantities have to be estimated or relationships established through
observed data; and
(iv) When a decision has to be made under uncertainty regarding a course of action to be
followed.
As indicated in the figure, while situation (i) falls in the realm of descriptive statistics, situations (ii)
and (iii) fall in the area of inductive statistics, and situation (iv) is dealt by the statistical decision
18
1
theory. Thus, the descriptive statistics refers to the analysis and synthesis of data so that better
description of the situation can be made and thereby promote better understanding of facts.
Classification of production and sales in different locations and their changes in the values of relevant
variables, or the average value of data are also a part of descriptive statistics.
Inductive statistics is concerned with the development of scientific criteria so that values of a group
may be meaningfully estimated by examining only a small portion of that group. The group is known
as ÂpopulationÊ or ÂuniverseÊ and the portion is known as ÂsampleÊ. Further, values in the sample are
known as statistics and values in the population are known as parameters. Thus, inductive statistics is
concerned with estimating universe parameters from the sample statistics. The term inductive
statistics is derived from the inductive process which tries to arrive at information about the general
(universe) from the knowledge of the particular (sample).
Samples are drawn instead of a complete enumeration for the following reasons: (i) at the time of
completion of a full enumeration so much time is lost that often by that time the data become obsolete.
(ii) Sampling cuts down cost substantially, and (iii) sometimes securing information is a destructive
process, for example, in case of quality control situation, pieces have to be broken down for testing.
Application of statistical techniques to managerial decision problems depends on the availability and
reliability of statistical data. Statistical data can be broadly grouped into two categories: (i) Published
data (that have already been collected and are readily available in the published form) and (ii)
unpublished data (that have not yet been collected and the analyst himself will have to collect them)
Data are also classified as primary or secondary. All the original data collected by the analysis
themselves fall in the category of primary data. Secondary data are those which are available for use
from other sources. Data are also generated by internal operations of the economic unit. For instance,
sales, labour data, financial statements, production schedules, cash flow data, budget data, etc.,
pertaining to an industrial unit constitute internal data and are found in the internal records of the
company. Data are also classified as micro and macro. Micro data relate to one unit or one region and
macro data relate to the entire economy or entire industry.
19
1
Whether a given problem pertains to business or to some other field there are some well defined
steps that need to be followed in order to reach meaningful conclusions.
1. Formulation of the problem: The statistical study begins with the formulation of the problem
on which the study is to be done. The problem should be understood clearly and one should
not go beyond the scope of it or exclude some relevant aspect.
2. Objectives of the study: After formulating the problem the objectives of the study are
determined. The objectives should not be extremely ambitious because it may be difficult to
achieve them because of limitations of time, finance or even competence of those conducting
the study.
3. Determining sources of data: Once the problem and the objectives are formulated determine
the data required to conduct the study. Data can be collected from primary and secondary
sources.
4. Designing data collection forms: Once the decision in favour of collection of primary data is
taken, one has to decide the mode of their collection. The two methods available are (i)
observational method, and (ii) survey method. Suitable questionnaire is to be designed to
collect data from respondents in a field survey.
5. Conducting the field survey: Side by side when the data collection forms are being designed,
one has to decide whether a census survey or a sample survey is to be conducted. For the
latter, a suitable sample design and the sample size are to be chosen. The field survey is then
conducted by interviewing sample respondents. Sometimes, the survey is done by mailing
questionnaires to the respondents instead of contacting them personally.
6. Organising the data: The field survey provides raw data from the respondents so it is
necessary to organize the data in the form of suitable tables and charts to know their salient
features.
7. Analyzing the data : On the basis of the preliminary examination of the data collected as well
as the nature and scope of the problem, analyze data by selecting the most appropriate
statistical technique
8. Reaching statistical findings: The analysis will result in some statistical findings .Interpret
these findings in terms of the concrete problem with which the investigation was started
20
1
9. Presentation of findings: The last step is presenting the findings of the study, properly
interpreted, in a suitable form. The choice is between an oral presentation and a written one.
In the case of an oral presentation, one has to be extremely selective in choosing the material
as in a limited time one has to provide a broad idea of the study as well as its major findings to
be understood by the audience in proper perspective. In case of a written presentation, a
report has to be prepared. It should be reasonably comprehensive and should have graphs and
diagrams to facilitate the reader in under standing it in all its ramifications.
The detailed discussion on above topics is available in Block II of the course material.
It is perhaps difficult to imagine a field of knowledge which can do without Statistics. It is a tool
of all sciences indispensable to search and intelligent judgment and has become a recognised
discipline in its own right there is hardly any field whether it be trade, industry ,or
commerce,economics,biology,botany,astronomy,physics,chemistry,education,medicine,sociology,
psychology,or meteorology where statistical tools are not applicable. The importance of statistics
has been summarized by A.L.Bowley as,‰ A Knowledge of statistics is like knowledge of foreign
language or of algebra. It may prove of use at any time under any circumstances‰.
There are three major functions in any business enterprise in which statistical methods are useful.
1. The planning of operations: This may relate to either special projects or to their recurring
activities of a firm over a specified period.
2. The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured produce, norms for the daily output, and so forth.
3. The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives remedial
measures so that such a deficiency does not occur again.
Different authors have highlighted the importance of Statistics in business. For instance, Croxton
and Cowden give numerous uses of Statistics in business such as project planning, budgetary
planning and control, inventory planning and control, quality control, marketing, production and
personnel administration.
21
1
In the sphere of production, for example, Statistics can be useful in various ways. Statistical
quality control methods are used to ensure the production of quality goods. This is achieved by
identifying and rejecting defective or substandard goods. The sale targets can be fixed on the
basis of sale forecasts, which are done by using varying methods of forecasting.
Statistics can be widely used in studying the seasonal behaviour. A business firm engaged in the
sale of certain product has to decide how much stock of that product should be kept. If the
product is subject to seasonal fluctuations then it must know the nature of seasonal fluctuations in
demand. To this purpose, seasonal index of consumption may be required. If the firm can obtain
such data or construct a seasonal index on its own, then it can keep a limited stock of the product
in lean months and large stocks in the remaining months. In this way, it will avoid the blocking of
funds in maintaining large stocks in the lean months. It will also not miss any opportunity to sell
the product in the busy season by maintaining adequate stock of the product during such a period.
Statistics can be very useful in export marketing. Developing countries have started giving
considerable importance to their exports. Here, too, quality is an important factor on which
exports depend. This apart, the concerned firm must know the probable countries where its
product can be exported. Before that, it must select the right product, which has considerable
demand in the overseas markets. This is possible by carefully analysing the Statistics of imports
and exports. It may also be necessary to undertake a detailed survey of overseas markets to know
more precisely the export potential of a given product.
Although, statistics is very widely used in all spheres of human activity, it has some limitations,
which restrict its scope and utility.
1. There are certain phenomena or concepts where Statistics cannot be used. This is because
these phenomena or concepts are not amenable to measurement. For example, beauty,
intelligence, courage cannot be quantified. Statistics has no place in all such cases where
quantification is not possible.
22
1
2. Statistics reveal the average behaviour, the normal or the general trend. An application of the
ÂaverageÊ concept if applied to an individual or a particular situation may lead to a wrong
conclusion and sometimes may be disastrous. For example, one may be misguided when told
that the average depth of a river from one bank to the other is four feet, when there may be
some points in between where its depth is far more than four feet. On this understanding, one
may enter those points having greater depth, which may be hazardous.
3. Since statistics are collected for a particular purpose, such data may not be relevant or useful
in other situations or cases. For example, secondary data (i.e., data originally collected by
someone else) may not be useful for the other person.
4. Statistics is not 100 per cent precise as is Mathematics or Accountancy. Those who use
Statistics should be aware of this limitation.
5. In Statistical surveys, sampling is generally used as it is not physically possible to cover all the
units or elements comprising the universe. The results may not be appropriate as far as the
universe is concerned. Moreover, different surveys based on the same size of sample show
different results.
6. At times, association or relationship between two or more variables is studied in Statistics, but
such a relationship does not indicate Âcause and effectÊ relationship. It simply shows the
similarity or dissimilarity in the movement of the two variables.
7. A major limitation of Statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that Statistics does not cover.
23
1
Apart from the limitations of Statistics mentioned above, there are misuses of it. Many people,
knowingly or unknowingly, use Statistical data in wrong manner. The misuse of Statistics may
take several forms, some of which are explained below.
1. Sources of data not given: At times, the source of data is not given. In the absence of the
source, the reader does not know how far the data are reliable. Further, if he wants to refer to
the original source, he is unable to do so.
2. Defective data: Another misuse is that sometimes one gives inaccurate data. This may be
done knowingly in order to defend oneÊs position or to prove a particular point.
3. Unrepresentative sample: In Statistics, several times one has to conduct a survey, which
necessitates to choose a sample from the given population or universe. The sample may turn
out to be unrepresentative of the universe.
6. Unwarranted conclusions: This may be as a result of making false assumptions. For example,
while making projections of population for the next five years, one may assume a lower rate
of growth though the past two years indicate otherwise.
7. Confusion of correlation and causation In Statistics: Several times one has to examine the
relationship between two variables. A close relationship between the two variables may not
establish a cause-and-effect-relationship in the sense that one variable is the cause and the
other is the effect.
24
1
9. Mistakes in arithmetic: Finally, one may come across certain mistakes in calculations or in
the application of a wrong formula. This human error may result in grossly wrong figures,
leading to wrong conclusions.
The foregoing discussion on misuses of Statistics clearly indicates the pitfalls in which one is
likely to be trapped if one does not exercise sufficient care in the collection, analysis and
interpretation of data.
1.5 SUMMARY:
Decision making is not practiced only in business areas. Decision making is a requirement for all
humans at all times. Decisions are taken mainly based on our knowledge and experience but when
problem becomes complicated with large input data, the analysis becomes complicated and hence an
effective use of systematic approach is needed. This has created the necessity of scientific methods for
decision making for business. Quantitative analysis uses a scientific approach to decision making. This
approach consists of defining a problem, developing a model, acquiring input data, developing a
solution, analyzing the results, and implementing the results. In using the quantitative approach,
however there can be potential problems, including conflicting viewpoints, the impact of quantitative
analysis models on other departments, beginning assumptions, outdated solutions, fitting text book
models, understanding the models, acquiring input data, hard to understand mathematics, obtaining
only one answer, testing the solutions, and analysing the results. Decisions have to be based upon data
which show relation ship, indicate trends, and show rates of change in various relevant variables.
The field of statistics provides methods for collecting, presenting, analyzing and meaningfully
interpreting data. The entire Statistical study can be explained in the following steps: formulation of
the problem, determining the objectives of the study, determining sources of data, designing data
collection forms, conducting the field survey, organising the data, analysing the data, reaching
statistical findings and presentation of findings .Statistics is very important and it is rather impossible
to think any sphere of human activity where statistics does not creep in. Statistics has assumed
25
1
unprecedented dimensions these days and statistical thinking is becoming more and more
indispensable for an able citizenship In fact to a very striking degree ,the modern culture has become
a statistical culture and the subject of statistics has acquired tremendous progress in the recent past so
much so that an elementary knowledge of statistical methods has become a part of the general
education curricula of many Universities all over the world Although statistics is very widely used in
all spheres of human activity, it is not without limitations which restrict its scope and utility. Statistics
does not study qualitative phenomenon, statistical methods do not give any recognition to an object or
a person or an event in isolation, statistical laws are not exact and Statistics is liable to be misused.
1.6 GLOSSARY:
Data: Data refers to any group of measurements that happen to interest us. These
measurements provide information the decision maker uses.
Descriptive Statistics: A collection of methods that enable us to organise, display and describe
data using such devices as tables, graphs and summary measures.
1.7 REFERENCES:
1) Barry Render Ralph M.Stair, Jr. Michael E.Hanna, Quantitative Analysis for Management, Pearson
education, Delhi, 2003.
2) S.P.Gupta, Statistical Methods, Sultan Chand & Sons, New Delhi, 2000.
3) S.C.Gupta, Fundamentals of Statistics, Himalaya Publishing House, Mumbai, 2004.
26
1
1. ‰All statistics are numerical statements but all numerical statements are not statistics.‰ Examine.
2. Give some examples of various types of models. What is a mathematical model? Develop two
examples of mathematical models.
3. Discuss important applications of Statistics with in modern business and industry.
4. ‰Statisticians at times misuse Statistics‰ Elucidate this statement.
5. Explain how quantitative analysis helps in making better decisions.
by
Research Scholar
University of Hyderabad.
27
1
2.1 INTRODUCTION:
One of the important objectives of statistical analysis is to determine various numerical measures,
which describe the inherent characteristics of a frequency distribution. The first of such measures is
average. The averages are the measures, which condense a huge unwieldy set of numerical data into
single numerical values which are representative of the entire distribution. Averages provide the
gist and give a birdÊs eye view of the huge mass of unwieldy numerical data. Averages are the
typical values around which other items of the distribution congregate. They are the values which
lie between the two extreme observations (i.e., the smallest and the largest observations), of the
distribution and give us an idea about the concentration of the values in the central part of the
distribution. Accordingly they are also sometimes referred to as the Measures of Central Tendency.
28
1
4. It should be rigidly defined. There must be uniformity in its interpretation by different decision-
makers or investigators.
5. It should be based on all the observations. Entire data set should be taken into consideration. It
should be easy to understand and calculate.
6. It should have sampling stability.
7. It should be capable of further algebraic treatment.
8. It should not be unduly affected by extreme values.
The following are the various measures of central tendency or measures of location, which are
commonly used in practice.
1. Mathematical Averages
a) Arithmetic Mean
i) Simple Mean
ii) Weighted mean
b) Geometric Mean
c) Harmonic Mean
2. Averages of position:
a) Median
b) Mode
c) Quartiles
d) Deciles
e) Percentiles
The measures computed for a sample are called statistics. These are denoted by lower case letters.
Eg. ÂnÊ no. of observations, x Mean
Measures computed for the entire population are called parameters, and denoted by Greek letters.
Eg: N Size of the population, µ population mean.
29
1
(1)Direct method
(2)Indirect or short-cut method:
Above two methods can be applied on three different data series.
They are,
a. Ungrouped Data
b. Grouped- Discrete series
c. Grouped- Continuous series
2.3.1.1 DIRECT METHOD: UNGROUPED DATA
Arithmetic mean:
In this method A.M is calculated by adding the values of all observations and dividing the total by
the no. of observations.
X =
X 1 + X 2 + X 3 + ........ X n
=
∑X
n n
Example:
The arithmetic Mean of the numbers 8,3,5,12,10 is
8 + 3 + 5 + 12 + 10 38
X = = = 7 .6 .
5 5
2.3.1.2 DIRECT METHOD: GROUPED DATA- DISCRETE SERIES
X =
∑ fX
∑f
Where,
f = frequency
X =Variable
30
1
X =
∑ fX
∑f
∑ f =3+2+4+1=10
X =
(3)(5) + (8)(2) + (6 )(4) + (2)(1) = 15 + 16 + 24 + 2 = 5.7
3 + 2 + 4 +1 10
The following assumptions are made to calculate Arithmetic Mean from grouped data.
i) The class intervals must be closed
ii) The width of each class interval should be equal
iii) The values of the observation in each class interval must be uniformly distributed
between its lower and upper limits.
iv) The mid-value of each class interval must represent the average of all values in that
class interval.
X =
∑ fm
∑f
31
1
Solution:
No.of weeks
No.of accidents Mid value fm
(f)
(m)
0-4 5 2 10
5-9 22 7 154
10-14 13 12 156
15-19 8 17 136
20-24 2 22 44
∑ f =50 ¡ f m=500
X =
∑ fm
∑f
500
X = = 10
50
AM= 10 accidents per week.
2.3.1.4 SHORT-CUT METHOD: UNGROUPED DATA
In this method an arbitrary assumed mean is used as a basis for calculating deviations from individual
values in the data set.
Let „A‰ be the assumed Mean and let
d =X−A
or X = A + d
∴X =
∑ X = ∑(A + d) = A + ∑ d
n n n
32
1
∴X = A+
∑d
n
Example:
The arithmetic Mean of the numbers 8,3,5,12,10 is
Let assumed mean A= 8
X d = X-A
8 0
3 -5
5 -3
12 4
10 2
Total ¡ d = -2
n= 5
∴X = A+
∑d
n
(−2)
∴X =8+
5
X = A+
∑ fd
∑f
33
1
Example:
The daily earnings (in rupees) of employees working on a daily basis in a firm are:
No.of employees 3 6 10 14 24 42 75
Solution:
X = A+
∑ fd
∑f
6040
X = 160 + = Rs.194.51
175
Average daily earnings = Rs. 194.51
34
1
Solution:
Let A=12
No.of accidents Mid d = (m − A) i fd
No.of weeks
value(m)
= (m − 12) 5 (f)
0-4 2 -2 5 -10
5-9 7 -1 22 -22
10-14 12 0 13 0
15-19 17 1 8 8
20-24 22 2 2 4
¡ f=50 ¡ fd = -20
35
1
∑ f d
x = A+ *i
∑ f
− 20
=12+ * 5 = 10 accidents per week (Same as earlier)
50
One of the limitations of AM discussed above is that it gives equal importance to all the items. But
there are cases where relative importance of the different items is not same. When this is so,
Weighted AM is calculated. The weighted means enables us to calculate an average that takes into
account the importance of each value to the overall total.
∑ (w × X )
Xw =
∑w
∑(w× X ) → Sum of the weight of each element times that element
∑ w → Sum of all the weights
X w → Weighted mean
w → weight assigned to each observation
Example: A contractor employs three types of workers, male, female and children. To a male worker
he pays Rs.100 per day, to female worker is Rs.80 and to a child worker is Rs.50 per day. What is
the average wage per day, if there are 20 male, 15 female and 5 children.
Solution:
X w wX
100 20 2000
80 15 1200
50 5 250
-------------------------------------------------------------
¡w=40 ¡wX=3450
--------------------------------------------------------------
36
1
∑ (w × X )
Xw =
∑w
Arithmetic Mean may be affected by extreme values that are not representative of the rest of the
data
Unable to compute mean for a data set that has open ended classes at either the high or low end of
the scale.
G.M =
n ( x1 x 2 x 3 .......... )
∑ f log x
GM= Anti log
∑ f
37
1
Merits:
¾ It is rigidly defined.
¾ It is useful in average ratios and percentages and in determining rates of increase or decrease.
¾ It is capable of algebraic manipulation
¾ It gives less weight to large items and more to small ones than does the arithmetic average.
Demerits:
¾ It is difficult to understand.
¾ It is difficult to compute and interpret.
¾ It cannot be computed when there are negative values in a series or one or more of the values
are zero.
The harmonic mean is based on the reciprocals of the numbers averaged. It is defined as the
reciprocal of the arithmetic mean of the reciprocal of the individual observations
for a set of ÂnÊ positive values x1,x2----xn. The harmonic mean is equal to
N
HM= =
1
∑
X
38
1
Example:
3 3 3× 8
HM= = = = 3.43
1 1 1 7 7
+ +
2 4 8 8
Calculation of harmonic mean-discrete series:
HM =
∑f
f
∑( x )
HM =
∑f
f
∑( m)
where
¡f = Total frequency
f =Individual frequencies
m=Mid point of class
Merits:
39
1
Demerits:
It is difficult to understand and compute
Its value cannot be computed when there are both positive and negative items in a series
It gives largest weight to smallest items. This is generally not a desirable feature and as such
this average is not very useful for the analysis of economic data.
The median is a single value that measures the central item in the data. This single item is the middle
most or most central item in the set of numbers. Half of the items lie above this point, and the other
half lie below it.
To find the median of a data set, first array the data in ascending or descending order. If the data set
contains an odd number of items, the middle item of the array is the median. If there is an even
number of items, the median is the average of the two middle items.
n +1
Median= th item in a data array.
2
Example:
Item 1 2 3 4 5 6 7
Time in minutes 4.2 4.3 4.7 4.8 5.0 5.1 9.0
40
1
n +1 7 +1
Median= th item= th item =4th item = 4.8
2 2
Example:
Item 1 2 3 4 5 6 7 8
Time in minutes 86 52 49 43 35 31 30 11
n +1 8 +1
Median= th item = th item= 4.5th item = Average of 4th and 5th items.
2 2
43 + 35
= = 39
2
2.3.5.2 CALCULATING THE MEDIAN FROM GROUPED DATA
To find the median value, first identify the class interval which contains the median value or (n 2) th
observation of the data set. To identify such class interval, find the cumulative frequency of each
class until the class for which the cumulative frequency is equal to or greater than the value of
(n 2) th observation. The value of the median within that class is found by using interpolation.
Median = l +
(n 2 − cf ) × i
f
l → lower class limit (or boundary) of the median class interval
cf → Cumulative frequency of the class prior to the median class interval
f → frequency of the median class
i → width of the median class
n → total no. of observations in the distribution
Example: In a factory employing 3000 persons, 5 percent earn less than Rs.150 per day, 580 earn
from Rs.151-200·per day, 30 percent earn from Rs.201-250 per day, 500 earn from Rs.251-300
per day,20 per cent earn from Rs.301-350 per day, and the rest earn Rs.351 or more per day. What
is the Median wage?
41
1
Median=l+
(n 2 ) − cf *i
f
1500 − 730
=201+ × 50
900
=201+42.77=Rs.243.77
The Median wage is Rs.243.77 per day.
Advantages of Median
¾ Extreme values do not affect the median as strongly as they do the mean.
¾ The median is easy to understand and can be calculated from any kind of data- even for grouped
data with open-ended classes-unless the median falls in the open-ended class.
¾ The median can be found even for data of qualitative descriptions such as co lour or sharpness,
rather than numbers. Suppose there are fine runs of printing press, the results from which must
be rated according to sharpness of image. They can array the results from best to worst.
42
1
Disadvantages of Median
Applications of Median
2.3.6 MODE
The mode of a set of numbers is that value which occurs with the greatest frequency. It is the most
common value. In some contexts, the mode may not exist, and even if it does exist it may not be
unique.
Example: The set 2,2,5,7,9,9,9,10,10,11,12,18 has mode 9
Example: The set 3, 5,8,10,12,15,16 has no mode
Example: The set 2,3,4,4,4,5,5,7,7,7,9 has two modes 4 and 7 and is called „bimodal‰, since it has
two modes.
A distribution having only one mode is called unimodal distribution.
Mode for grouped data
In the case of grouped data when a frequency curve has been constructed to fit in data, the mode
will be the value (or values) of x corresponding to the maximum point (or points) on the curve.
From a frequency distribution (or) histogram the mode can be obtained from the following formula.
(∆1 )
Mode= L1 + ×i
(∆1 ) + (∆ 2 )
Where L1= Lower boundary class of modal class (i.e. class containing the mode)
∆ 1 = Excess of modal frequency over frequency of preceding class.
∆ 2 = Excess of modal frequency over frequency of succeeding class.
43
1
Example:
X Frequency
Less than 150 150
150-200 580
200-250 900
250-300 500
300-350 600
350 & above 270
Total 3000
44
1
Merits of Mode:
Demerits of Mode:
¾ The value of the mode cannot always be determined sometimes there may be a bi-modal series.
¾ It is not capable of algebraic manipulations.
¾ It is not based on each and every item of the series
¾ It is not a rigidly defined measure. There are several formulas for calculating the mode, usually
all of which give some what different answers.
The various measures of central tendency like Mean, Median and Mode gives us one single figure
that represents the entire data. But the average alone cannot adequately describe a set of
observations unless all the observations are the same. It is hence necessary to describe the
variability or dispersion of the observations. In two or more distributions the central value may be
the same but still there can be wide disparities in the formation of distribution.
45
1
46
1
3. Mean Deviation
4. Standard Deviation
Of these the first two, namely the range and quartile deviations are positional measures because they
depend on the values at a particular position in the distribution. The other two, the average
deviation and the standard deviation are called calculation measures of deviation because all the
values are employed in their calculation and the last one is a graphic method.
2.5.1 RANGE:
It is the simplest method of studying dispersion. It is defined as the difference between the value of
the smallest item and the value of the largest item included in the distribution.
Range=L-S
Where L=Largest item
S=smallest item
L−S
Coefficient of Range=
L+S
Uses of Range in Business:
(i) Quality control: The object of quality control is to keep a check on the quality of the
product without 100% inspection. The idea basically is that if the range, i.e. the difference between
the largest and smallest mass produced items, increases beyond a certain point, the production
machinery should be examined to find out why the items produced have not followed their usual
consistent pattern.
(ii) Fluctuations in the share prices: range is useful in studying the variations in the prices of
stocks and shares and other commodities that are sensitive to price changes from one period to
another.
(iii) Weather Forecasts: The meteorological department does make use of the range in
determining say the difference between the minimum temperature and the maximum, temperature.
This information is of great concern to the general public
2.5.2 QUARTILE DEVIATION (QD): Quartile deviation represents the difference between the third
quartile and the first quartile.
Inter Quartile range= Q3-Q1
47
1
Q −Q
3 1
Quartile Deviation=
2
Quartile Deviation is an absolute measure of dispersion. The relative measure corresponding to this
measure, called the coefficient of quartile deviation.
Q −Q
3 1
Coefficient of Quartile deviation=
Q +Q
3 1
2.5.3 MEAN DEVIATION (MD): The Mean Deviation is also known as the average deviation. It is
the average difference between the items in a distribution and the Median or Mean of that series.
For Individual observations,
1
M.D.= ∑X−A
N
∑D
where D = X − A
= N
It is especially effective in reports presented to the general public or to groups not familiar with
statistical methods. This measure is useful for small sample with no elaborate analysis required.
Incidentally it may be mentioned that the National Bureau of Economic Research has found, in its
work on forecasting business cycle, that the average deviation is the most practical measure of
dispersion to use for this purpose.
48
1
∑ x2
σ=
N
(
Where x = X − X )
X= individual observation
X =Mean
∑d ∑ d
2
2
From Assumed Mean, σ = −
n n
σ=
∑ fx 2
where (
x= X − X )
∑f
From Assumed Mean,
∑ fd ∑ fd
2
2
σ = −
∑ f
∑ f
Where d=(x-A)
A= assumed Mean
Continuous series:
Direct Method:
49
1
σ=
∑ fx 2
where x= (m − x )
∑f
∑ fd ∑ fd
2
2
*i
σ = −
∑ f
∑ f
d=
(m − A)
i
i= class interval
Example: Find out the standard Deviation for the following data of sales per day.
35, 30,45,20,25
Direct Method:
2
∑(X − X )
σ =
n
370
σ = = 8.6
5
Assumed Mean method:
Let Assumed mean=35
∑d ∑ d
2
2
σ = −
n n
¡d = (35-35) +(30-35) + (45-35) +(20-35) +(25-35)
= 0-5+10-15-10 = -20
¡d2 = (35-35)2 +(30-35)2 + (45-35)2 +(20-35)2 +(25-35)2
= 0+25+100+225+100 = 450
50
1
450 − 20
2
σ = −
5 5
SD= σ = 90 − 16
Example: Suppose that a prospective buyer tests the bursting pressure of samples of rubber tubes
received from the manufacturer. Find out the standard deviation of the bursting pressure of the
tubes.
Solution:
Direct Method:
σ=
∑ f (m − x) 2
∑f
x=
∑ fm =
2015
=20.15
∑f 100
1673
σ= = 4.09
100
51
1
∑ fd ∑ fd
2
2
*i
σ = −
∑ f
∑ f
d=
(m − A)
i
i= class interval
89 − 47
2
σ = − *5
100 100
1. Combined S.D:
nσ + n2σ 2 + n1 d 1 + n2 d 2
2 2 2 2
σ 12 = 1 1
n +n
1 2
52
1
σ 12 = combined S.D
σ 1 = S.D of 1st group
σ 2 = S .D of 2nd group
d 1
= x1− x12
d 2
= x 2
− x 12
( N 2 − 1)
σ=
12
2.5.6 VARIANCE: The standard deviation is an absolute measure of dispersion. The corresponding
relative measure is known as the coefficient of variation. It is used in such problems where we want
to compare the variability of two or more than two series. That series for which, the coefficient of
variation is greater is said to be more variable or conversely less consistent, less uniform, less stable
or less homogeneous.
Variance= σ 2
σ
Coefficient of variance= X 100
x
2.6 SUMMARY:
Measure of Central Tendency and variability play a vital role in characterizing data. Several types of
averages are used, the most common being the arithmetic mean or the Mean, the Median, the Mode,
the geometric Mean and the harmonic Mean. A measure of central tendency summarizes the
distribution of a variable into single figure, which can be regarded as its representative. This
measure alone, however, is not sufficient to describe a distribution because there may be a situation
where two or more distribution has the same central value. Conversely, it is possible that the pattern
of distribution in two or more situations is same but the values of their central tendency are
different. Hence, measures of dispersion are used to represent the characteristics of a distribution.
The concept of dispersion is related to the extent of scatter or variability in observations. Some
53
1
important measures of dispersion are Range, Quartile Deviation, Mean Deviation and Standard
Deviation.
2.7 GLOSSARY:
Arithmetic mean: A measure of central tendency calculated by dividing the sum of all
Observations by the number of observations in the data set.
Coefficient off A measure of relative variability that expresses the standard deviation as a
variation: percentage of the mean. The distance between the highest and the lowest
values in a data set.
Geometric Mean: A measure of central tendency used to measure the average rate of change or
growth for some quantity, computed by taking the nth root of the product of
n values representing change.
Measures of Measures that describe the centre of a distribution. The mean, median and
central tendency: mode are three of the measures of central tendency.
Mode: The value that has the maximum frequency in the data set
Range: Difference between the largest and the smallest values in a data Range set.
54
1
Standard The square root of the variance in a series. It shows how the
Deviation: data are spread out.
Variance: Averaged squared deviation between the mean and each item in a series.
2.8 REFERENCES:
1.‰Every average has its own peculiar characteristics. It is difficult to say which average is the best.‰
Explain with examples.
2. What do you mean by Dispersion? What are the different measures of dispersion?
55
1
4. The number of solar heating systems available to the public is quite large, and their heat –storage
capacities are quite varied. Here is a distribution of heat-storage capacity (in days) of 28 systems
that were tested recently by University Laboratories,Inc.
Days Frequency
0-0.99 2
1-1.99 4
2-2.99 6
3-3.99 7
4-4.99 5
5-5.99 3
6-6.99 1
University Laboratories, Inc. knows that its report on the tests will be widely circulated and used as
the basis for tax legislation on solar-heat allowances. It therefore wants the measures it uses to be as
a reflective of the data as possible.
5. In a small company, two typists are employed. Typist A types one page in ten minutes while typist
B takes twenty minutes for the same.
(a)Both are asked to type 10 pages. What is the average time taken for typing one page?
(b)Both are asked to type for one hour. What is the average time taken by them for typing one page?
6. The following data gives the saving bank accounts balances of nine sample households selected in
a survey. The figures are in rupees.
745 2,000 1,500 68,000 461 549 3,750 1,800 4,795
Find the mean and the median for these data.
56
1
7. The prices of a Tea Company shares in Mumbai and Kolkata markets during the last ten months
are recorded below:
Month(2000) Mumbai Kolkata
January 105 108
February 120 117
March 115 120
April 118 130
May 130 100
June 127 125
July 109 125
August 110 120
September 104 110
October 112 135
Determine the Arithmetic Mean and Standard Deviation of the price of shares. In which market are
the share prices more stable?
8. The Casual Life Insurance Company is considering purchasing a new fleet of company cars. The
financial departmentÊs director, Tom Dawkins, sampled 40 employees to determine the number of
miles each drove over a 1-year period. The results of the study follow. Calculate the range and inter
quartile range.
9. Southeastern Stereos, a wholesaler, was contemplating becoming the supplier to three retailers, but
inventory shortages have forced Southeastern to select only one. Southeastern credit manager is
evaluating the credit record of these three retailers. Over the past 5 years, these retailers, accounts
receivable have been outstanding for the following average number of days. The credit manager
57
1
feels that consistency, in addition to lowest average, is important. Based on relative dispersion,
which retailer would make the best customer?
10. Realistic Stereo shops marks up its merchandise 35 percent above the cost of its latest additions
to stock .Until 4 months ago, the Dynamic 400-S VHS recorder had been Rs.300.During the last 4
months Realistic has received 4monthly shipments of this recorder at these unit costs: Rs.275,
Rs.250,Rs.240,and Rs.225.At what average rate per month has RealisticÊs retail price for this unit
been decreasing during these 4 months?
by
Dr.B.Raja Shekhar
Reader
School of Management Studies
University of Hyderabad.
58
1
3.1 INTRODUCTION:
We live in world in which we are unable to forecast the future with complete certainty. Our need to
cope with uncertainty leads us to the study and use of probability theory. By organizing the
information and considering it systematically, we will be able to recognize our assumptions,
communicate our reasoning to others, and make a sounder decision than we could by using a shot-in-
the-dark approach. Probability is a part of everyday lives. In personal and managerial decisions, we
face uncertainty and use probability theory whether or not we admit the use of something
sophisticated. When we hear a weather forecast of a 70 percent chance of rain, we change our plans
from a picnic to a pool game. Managers, who deal with inventories of highly styled womenÊs
clothing, must wonder about the chances that sales will reach or exceed a certain level. Probability
deals with many uncertainties in business.
3.2.1 OUTCOME:
An outcome is the result of an experiment or other situation involving uncertainty. The set of all
possible outcomes of a probability experiment is called a sample space.
3.2.2 SAMPLE SPACE:
The sample space is an exhaustive list of all the possible outcomes of an experiment. Each possible
result of such a study is represented by one and only one point in the sample space.
59
1
Example1:
Experiment Rolling a die once: Sample space = {1, 2, 3, 4, 5, 6}
Experiment Tossing a coin: Sample space = {Heads, Tails}
3.2.3 EVENT:
In probability theory, an event is one or more of the possible outcomes of doing something. If we toss
a coin, getting a tail would be an event, and getting a head would be another event. Similarly, if we
are drawing from a deck of cards, selecting the ace of spades would be an event. An example of an
event closer to your life is picking a student from a class of 100 to answer a question.
Example 2:
Experiment: Tossing a fair coin 50 times (n = 50) Event E = 'heads' Result: 30 heads, 20 tails, so r =
30 Relative frequency: = r/n = 30/50 = 3/5 = 0.6
If an experiment is repeated many, many times without changing the experimental conditions, the
relative frequency of any particular event will settle down to some value. For example, in the above
experiment, the relative frequency of the event 'heads' will settle down to a value of approximately 0.5
if the experiment is repeated many more times.
3.2.5 PROBABILITY:
The probability of an event has been defined as its long-run relative frequency. It has also been
thought of as a personal degree of belief that a particular event will occur (subjective probability). A
probability provides a quantitative description of the likely occurrence of a particular event.
Probability is conventionally expressed on a scale from 0 to 1; a rare event has a probability close to 0,
a very common event has a probability close to 1.
60
1
In some experiments, all outcomes are equally likely. For example if you were to choose one winner
in a raffle from a hat, all raffle ticket holders are equally likely to win, that is, they have the same
probability of their ticket being chosen. This is the equally likely outcomes model and is defined to be:
Example3:
Two events are independent if the occurrence of one of the events gives us no information about
whether or not the other event will occur; that is, the events have no influence on each other.
In probability theory, we say that two events, A and B, are independent if the probability that they
both occur is equal to the product of the probabilities of the two individual events, i.e.
P( A ∩ B) = P( A).P( B)
The idea of independence can be extended to more than two events. For example, A, B and C are
independent if:
a. A and B are independent; A and C are independent and B and C are independent (pair wise
independence);
b. P( A ∩ B ∩ C ) = P( A).P( B).P(C )
If two events are independent then they cannot be mutually exclusive and vice versa.
61
1
Example 4:
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a card from
his/her pack. Find the probability that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so: P ( A ∩ B ) = P ( A).P ( B ) = 1/52.1/52
= 0.00037
That is, there is a very small chance that the man and the woman will both draw the ace of clubs.
Two events are mutually exclusive, if it is impossible for them to occur together.
Example 5:
¾ Experiment: Rolling a die once Sample space S = {1, 2, 3, 4, 5, 6} Events A = 'observe an odd
number' = {1, 3, 5}. B = 'observe an even number' = {2, 4, 6}. A ∩ B = φ = the empty set, so
A and B are mutually exclusive.
¾ A subject in a study cannot be both male and female, nor can they be aged 20 and 30.
The addition rule is a result used to determine the probability that event A or event B occurs or both
occur.
62
1
Solution
In this case there are four favorable outcomes:
(1) the ace of spades
(2) the ace of hearts
(3) the ace of diamonds
(4) the ace of clubs.
The same principle can be applied to the problem of determining the probability of obtaining
different totals from a pair of dice.
63
1
Example8:
What is the probability that when a pair of six-sided dice are thrown, the sum of the numbers
equals 5?
Solution:
There are 36 possible outcomes when a pair of dice is thrown. Consider that if one of the dice
rolled is a 1, there are six possibilities for the other die. If one of the dice rolled a 2, the same
is still true. And the same is true if one of the dice is a 3, 4, 5, or 6. If this is still confusing,
look at the following (abbreviated) list of outcomes:
[(1,1),(1,2),(1,3),(1,4),(1,5),(1,6);(2,1),(2,2),(2,3)⁄(3,1),(3,2),3,3)⁄ (4,1)⁄(5,1)⁄(6,1)⁄.
The total number of outcomes is 6 × 6 = 36. Since four of the outcomes have a total of 5
[(1,4),(4,1),(2,3),(3,2)], the probability of the two dice adding up to 5 is 4/36 = 1/9.
Example 9:
What is the probability that when a pair of six-sided dice is thrown, the sum of the number
equals 12?
Solution
We already know the total number of possible outcomes is 36, and since there is only one
outcome that sums to 12, (6,6--you need to roll double sixes), the probability is simply 1/36.
The multiplication rule is a result used to determine the probability that two events, A and B, both
occur.
64
1
In many situations, once more information becomes available; we are able to revise our estimates for
the probability of further outcomes or events happening. For example, suppose you go out for lunch at
the same place and time every Friday and you are served lunch within 15 minutes with probability 0.9.
However, given that you notice that the restaurant is exceptionally busy, the probability of being
served lunch within 15 minutes may reduce to 0.7. This is the conditional probability of being served
lunch within 15 minutes given that the restaurant is exceptionally busy.
The usual notation for "event A occurs given that event B has occurred" is "A | B" (A given B). The
symbol | is a vertical line and does not imply division. P(A | B) denotes the probability that event A
will occur given that event B has occurred already.
A rule that can be used to determine a conditional probability from unconditional probabilities is:
P( A ∩ B)
P( A / B) =
P( B )
Where: P(A | B) = the (conditional) probability that event A will occur given that event B has occurred
already P ( A ∩ B ) = the (unconditional) probability that event A and event B both occur
P (B) = the (unconditional) probability that event B occurs
Bayes' Theorem is a result that allows new information to be used to update the conditional probability
of an event.
65
1
Using the multiplication rule, the following gives Bayes' Theorem in its simplest form:
P( A ∩ B) P( B / A).P( A)
P( A / B) = =
P ( B) P( B)
P( A ∩ B) P( B / A).P ( A)
P( A / B) = =
P( B) P( B / A).P( A) + P( B / A' ).P( A' )
¾ What is the probability that the student who will sit next to you on the bus is a man & a
biology major?
Define: C = man;
B = biology major
P(C I B) = P(C) x P(B|C)
• = 0.90 x 0.90
= 0.81
66
1
The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed
can be 'heads' or 'tails'. However, we often want to represent outcomes as numbers. A random variable
is a function that associates a unique numerical value with every outcome of an experiment. The value
of the random variable will vary from trial to trial as the experiment is repeated.
A random variable has either an associated probability distribution (discrete random variable) or
probability density function (continuous random variable).
Example 11:
¾ A coin is tossed ten times. The random variable X is the number of tails that are noted. X can
only take the values 0, 1,..., 10, so X is a discrete random variable.
¾ A light bulb is burned until it burns out. The random variable Y is its lifetime in hours. Y can
take any positive real value, so Y is a continuous random variable.
3.7.2 VARIANCE:
The (population) variance of a random variable is a non-negative number which gives an idea of how
widely spread the values of the random variable are likely to be; the larger the variance, the more
scattered the observations on average.
Stating the variance gives an impression of how closely concentrated round the expected value the
distribution is; it is a measure of the 'spread' of a distribution about its average value.
a. The larger the variance, the further that individual values of the random variable
(observations) tend to be from the mean, on average;
b. The smaller the variance, the closer that individual values of the random variable
(observations) tend to be to the mean, on average;
67
1
c. Taking the square root of the variance gives the standard deviation, i.e.: σ2 =σ = s
d. The variance and standard deviation of a random variable are always non-negative.
1. Binomial Distribution
2. Poisson Distribution
3. Normal Distribution
68
1
From the following example, we can see how the binomial distribution arises.
If a coin is tossed once, there are two outcomes, tail or head. The probability of obtaining a head,
p=1/2 and the probability of obtaining a tail, q=1/2. These are terms of binomial (q+p) where (q+p)=1
In general, in ÂnÊ tosses of a coin, the probabilities of the various possible events are given by the
successive terms of the Binomial expansion
(q + p ) = q
n n
+ nc1 q n −1 p + nc2 q n − 2 p 2 + − − − − + ncr q n− r p + − − − p n
r
Example12:
To assure quality of a product, a random sample of size 25 is drawn from a process. The number of
defects (X) found in the sample is recorded. The random variable X follows a binomial distribution
with n = 25 and P (product is defective).
Example 13:
Records at a local blood bank show that in any year about 20% of the population donate blood.
In a queue of 8 people in a checkout:
1. what is the probability that at most 3 people are donors
69
1
Our random variable represents the number of people in the queue who are blood donors ie
x = 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8
We have a set number of people – 8 that are being looked at: n=8
The people in the queue can either be blood donors (p) or not (q).
The likelihood of any one donating blood is 20%: p = 0.2 for all people looked at
Each person can be assumed to be a donor or not, independent, of other people in queue
Solutions
P (5 people are donors) = P ( x = 5) can use the tables with some adjustment.
P (at least 2 are donors) = P ( x ≥ 2) can use the tables with some adjustment.
P ( x ≥ 2) = 1 − P ( complement of x ≥ 2)
70
1
x ≥ 2 = 2 or 3 or 4 or 5 or 6 or 7 or 8 Complement of this is x = 0 or 1
P ( x ≥ 2) = 1 − P ( x ≤ 1) = 1 − 0.503 = 0.497
The expected value and the variance and standard deviation of a binomial distribution are found by
using the following formula:
Poisson distribution, which is also a discrete probability distribution. It was developed by a French
mathematician SD Poisson (11781-1840) and hence named after him.
Along with the normal and binomial distributions, the Poisson distribution is one of the most widely
used distributions. It is used in quality control statistics, to count the number of defective items or in
insurance problems to count the number of casualties or in waiting-time problems to count the number
of incoming telephone calls or incoming customers or the number of patients arriving to consult a
doctor in a given time period, and so forth., All these examples have a common feature: they can be
described by a discrete random variable, which takes on integer values (0, 1, 2, 3 and so on).
The characteristics of the Poisson distribution are:
1. The events occur independently. This means that the occurrence of a subsequent even is not at
all influenced by the occurrence of an earlier event.
2. Theoretically, there is no upper limit with the number of occurrences of an event during a
specified time period.
3. The probability of a single occurrence of an event within a specified time period is
proportional to the length of the time period of interval.
71
1
4. In an extremely small portion of the time period, the probability of two or more occurrences of
an event is negligible.
The Poisson distribution is used for modeling rates of occurrence. The Poisson distribution has one
parameter:
e −λ λ X
P(X) =
X!
Mean = Variance= λ
Example14:
During an off-peak period, passengers arrive at an airport check-in at the average rate of 3 per minute.
1. What is the probability that in the next minute less than 5 passengers will arrive?
3. What is the probability that in the next five minutes at least 12 passengers will arrive?
Our random variable represents the number of passengers arriving at the check-in
ie x = 0 or 1 or 2 or 3 or 4 or 5 or ⁄⁄⁄⁄⁄
We will assume each passenger arrives independently of any other passenger. We are given the
average arrivals and the time period this average holds for ie λ = 3 per minute.
72
1
Solutions:
1. What is the probability that in the next minute less than 5 passengers will arrive?
e −λ × λx
or using the Poisson formula : where λ = 3 and x = 5
x!
e − λ × λ x e −3 × 35
P( x = 5) = = = 0.100818813 = 0.101
x! 5!
3. What is the probability that in the next five minutes at least 12 passengers will arrive?
x ≥ 12 means x = 12 or 13 or 13 or ⁄⁄⁄.
The complement of this set is x = 0 or 1 or 2 or ⁄⁄ or 11 = P ( x ≤ 11)
73
1
The preceding two distributions discussed in this chapter were discrete probability distributions. We
shall now take up another distribution in which the random variable can take on any value within a
given range. This is the normal distribution, which is an important continuous probability distribution.
This distribution is also known as the Gaussian distribution after the name of the eighteenth century
mathematician-astronomer Karl Gauss, whose contribution in the development of the normal
distribution was very considerable. As a vast number of phenomena have approximately normal
distribution, it has to make inferences by drawing samples. The normal distribution has certain
characteristics, which make it applicable to such situations.
2
−(x−u )
1 2σ 2
P(X) = e
σ 2π
Probability for different X values can be obtained from normal distribution table available at the end
of all the units.
Let us see what this figure indicates in terms of characteristics of the normal distribution. It indicates
the following characteristics.
1. The curve is bell shaped, that is, it has the same shape on either side of the vertical line from
mean.
2. It has a single peak. As such it is uni modal.
3. The mean is located at the centre of the distribution.
74
1
µ±1σ
µ±2σ
µ±3σ
75
1
2) All the statistical tables are limited by size of their parameters (Mostly) However, when these
parameters are large enough one may use normal distribution for calculating the critical
values-for these tables. Ex: F-statistic is related to standard normal Z-statistic as follows:
F= Z 2 ;
3) Approximation to Binomial is made by taking µ = np; σ 2 = npq
Application: - The probability of a defective item coming off a certain assembly line is
p=0.25. A sample of 400 items is selected from a large lot of these items. What is the
probability of getting 90 (or) less items are defective.
4) If the mean and S.D of a normal distribution are known. It is easy to convert back and forth
from raw scores to percentiles.
5) It has been proven that the underlying distribution is normal if and only if the sample mean is
independent of sample variance, this characterizes the normal distribution. Therefore many
effective transformations can be applied to convert almost only shaped distribution into a
normal one.
6) The most important reason for popularity of normal distribution is the CENTRAL LIMIT
THEOREM (CLT). The distribution of the sample averages of a large number of independent
variables (random) will be approximately normal regardless of the distributions of the
individual random variables. The CLT is useful especially when you are dealing with a
population with an unknown distribution.
7) The normality condition is required by almost all kinds of parametric statistical tests. Using
most statistical tables, such as T-table, and F-table all require the normality condition of the
population.
8) This condition should be tested before using the tables, otherwise the conclusion will be
wrong.
A normal curve with mean X and standard deviation σ can be converted in to a standard normal
distribution by performing the change of the scale and origin. The original X and σ will be converted
to o and 1 respectively. The units for the standard normal distribution curve are denoted by Z and are
called the Z values or Z scores. They are also called standard units or standard scores. The Z score is
known as a ÂstandardizedÊ variable because it has a zero mean and a standard deviation of one.
76
1
X−X
z=
σ
This transformation will be used widely in Z test in hypothesis testing.
Example 15:
The Rural Bank is reviewing its service charges and interest paying policies on cheque accounts. The
bank has found that the average daily balance on personal cheque accounts is Rs.550.00, with a
standard deviation of Rs.150.00. In addition, the average daily balances have been found to be
normally distributed.
1. What percentage of personal cheque account customers carry average daily balances in excess of
Rs.800.00
3. The bank is considering paying interest to customers carrying average daily balances in excess of a
certain amount. If the bank does not want to pay interest to more than 5% of itÊs customers, what is the
minimum daily balance it should pay interest on?
Our random variable represents the daily balance on cheque accounts held with the bank. We know
the balances are normally distributed with a mean of Rs.550 and standard deviation of Rs.185.
The mean has been calculated on all balances so is the population mean µ = 550
The standard deviation has been calculated on all balances so is a population standard deviation
σ = 185
Solution:
77
1
1. What percentage of personal cheque account customers carry average daily balances in excess of
Rs.800.00
Second look up table to find the area between the centre and the z score 1.67
=0.4525
Third find P ( x > 800) = P ( z > 1.67) by subtracting 0.4525 from 0.5
78
1
3. The bank is considering paying interest to customers carrying average daily balances in excess of a
certain amount. If the bank does not want to pay interest to more than 5% of itÊs customers, what is
the minimum daily balance it should pay interest on?
The bank will only want to pay interest to a few customers so the daily balance limit will be in the top
end of the balances. 5% represents an area under the bell curve – in this case the area in the right hand
tail.
First find the rest of the area on the right hand side of the bell: 50% - 5% = 45% = 0.4500
Second look up the area 0.4500 (or nearest area to it) in table 3 to find the z score that corresponds
with this area: z = 1.645
79
1
x−µ
rearrange z = to find x : x = zσ + µ
σ
3.9 SUMMARY:
This Chapter presents the fundamental concepts of probability and probability distributions. The
topics of random variables, discrete probability distributions (such as Poisson and binomial), and
continuous probability distributions (such as normal) have been discussed. A probability distribution
is any statement of a probability function having a set of collectively exhaustive and mutually
exclusive events. All probability distributions follow the basic probability rules. Basic probability
concepts and distributions are useful in decision theory, inventory control, Markov analysis project
management, simulation, and statistical quality control.
3.10 GLOSSARY:
BayesÊ theorem: A rule that is used while revising probability of events after having more information
Bernoulli process: One repetition of a binomial experiment. Also called a trial.
Binomial Distribution : The probability distribution that gives the probability of x successes in n trials when
the probability of a success is p for each trial of a binomial experiment
Event: One of the possible outcomes of an experiment
Normal Distribution: A symmetrical distribution that is mounded up about the mean and is bell shaped and
becomes sparse at the extremes. The two tails never touch the horizontal axis.
. Like the binomial, but unlike the normal, it is a discrete probability distribution, that
Poisson Distribution: gives the probability of x(success) in an interval. It is appropriate when the probability
of x is very large and n is large.
Probability distribution: A Distribution of the probabilities associated with each of the values of a random
80
1
Probability: A numerical measure of the likelihood that a specific event will occur.
3.11 REFERENCES:
1. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
2. Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi
3. Two computers A and B are to be marketed. A salesman who is assigned a job of finding
customers for them has 60 percent and 40 percent chances respectively of succeeding in case
of computers A and B. The computers can be sold independently. Given that he was able to
sell at least one computer, what is the probability that the computer A has been sold?
4. Two factories manufacture the same machine part. Each part is classified as having either
0,1,2,or 3 manufacturing defects; the joint probability distribution for this is given below:
Number of Defects
0 1 2 3
Manufacturer A 0.1250 0.0625 0.1875 0.1250
Manufacturer B 0.0625 0.0625 0.1250 0.2500
(i) A part is observed to have no defects. What is the conditional probability that it was
produced by manufacturer A?
(ii) A part is known to have been produced by manufacturer A. What is the conditional
probability that the part has no defects?
81
1
(iii) A part is known to have two or more defects. What is the conditional probability that
it was manufactured by A?
(iv) A part is known to have one or more defects. What is the conditional probability that
it was manufactured by B?
5. A company has three plants to manufacture 8,000 scooters in a month. Out of 8,000 scooters,
plant I manufactures 4, 000, plant II manufacturers 3,000 and plant III manufacturers 1,000
scooters. At plant I, 85 out of 100 scooters are rated of standard quality or better, at plant II
only 65 out of 100 scooters are rated of standard quality or better and at plant III 60 out of 100
scooters are rated of standard quality or better. What is the probability that the scooter
selected at random comes from (i) plant I. (ii) plant II, and (iii) if it is known that the scooter is
of a standard quality?
6. In a certain locality, half of the households is known to use a particular brand of soap. In a
household survey, sample of 10 households are allotted to each investigator and 2048
investigators are appointed for the survey. How many investigators are likely to report:
(i) 3 users; (ii) not more than 3 users; and (iii) at least 4 users?
7. A manufacturer finds that the average demand per day for the mechanics to repair new
products is 2, over a period of one year and the demand per day is distributed as Poisson
variate. He employs 3 mechanics. On how many days in one year: (i) both the mechanics will
be free, and (ii) some demand is refused?
8. Five hundred televisions are inspected as they come off the production line and the number of
defects per set is recorded below:
No. of defects(X) 0 1 2 3 4
No. of sets 368 72 52 7 1
Estimate the average number of defects per set and expected frequencies of 1,2,3 and
defects assuming Poisson distribution
[Given e −0.408 = 0.6649 ]
9. Six hundred candidates appeared for an entrance test for admission to a management course.
The marks obtained by the candidates were found to be normally distributed with a mean of
152 marks and a standard deviation of 18 marks.
If the top 60 performers were given confirmed admission, what are the minimum marks (to the
nearest integer) above which a candidate would be sure of being admitted?
82
1
Further, those obtaining at least 170 marks, but not qualified for confirmed admission were
included in a provisional list. How many candidates were included in this list? (Answer to the
nearest integer.)
10. The customer accounts at a certain departmental store have an average balance of Rs.480 and
a standard deviation of Rs.160. Assuming that the account balances are normally distributed
By
Reader
University of Hyderabad.
83
1
4. CORRELATION
LEARNING OBJECTIVES:
4.1. INTRODUCTION:
Managers of today often need to understand and make decisions depending upon the numerical
data on two or more variables simultaneously. For example,
i) Cost of production and volume of Production,
ii) Expenditure on Advertising and Sales of a Product,
iii) Number of Vehicles on Road and Number of Accidents,
iv) Number of Colleges offering MBA Programme and number of MBA Graduates,
v) Number of Counters at an e - Seva Kendra and the waiting time of customers
vi) Number of Telephone calls and Rate per Call and so on.
In other words, one of the basic functions of a manager is to understand the relationship between
these variables and make appropriate decisions keeping the future in mind known as
ÂForecasting or PredictionÊ. The part which deals with understanding of the behaviour of variables is
Correlation and the part deal with the forecasting is Regression.
84
1
4.2 CORRELATION:
Correlation is a statistical tool for studying the relationship between two or more variables and
correlation analysis involves various methods and techniques used for studying and measuring the
extent of relationship between the two variables. Two variables said to be correlated, if the change in
one variable results in a corresponding change in the other.
It is a diagrammatic representation of a bivariate distribution and provides a fairly and simplest way of
understanding the relationship between two variables. Scatter diagram is simply the graphical
presentation of pairs of observed bivariate data in the form of ÂdotsÊ. Different scatter diagrams are
presented in the following discussion.
Broadly speaking, there are four types of correlation, namely, a)Positive correlation, b)Negative
correlation, c)Linear correlation and d)Non-Linear Correlation.
If the values of two variables deviate in the same direction i.e., if increase in the values of one variable
results, on an average, in a corresponding increase in the values of the other variable or if a
decrease in the values of one variable results, on an average, in a corresponding decrease in the
values of the other variable, the corresponding correlation is said to be positive or direct.
Examples:
i) Sales revenue of a product and expenditure on Advertising.
ii) Amount of rain fall and yield of a crop (up to a point)
iii) Price of a commodity and quantity of supply of a commodity.
iv) Height of the Parent and the height of the Child.
v) Number of patients admitted into a Hospital and Revenue of the Hospital.
vi) Number of workers and output of a factory.
85
1
Y variable
X variable
If the variables X and Y are related to each other with a very high degree of positive relationship then
we can notice a graph as in Fig.4.2.
X variable
86
1
If the variables X and Y are related to each other with a very low degree of positive relationship then
we can notice a graph as in Fig.4.3.
X variable
Correlation is said to be negative or inverse if the variables deviate in the opposite direction
i.e., if the increase (decrease) in the values of one variable results, on the average, in a
corresponding decrease (increase) in the values of the other variable.
Examples:
1. Price and demand of a commodity
2. Sale of Woolen garments and the day temperature.
87
1
Y variable
X variable
If the variables X and Y are related to each other with a very high degree of negative relationship then
we can notice a graph as in Fig.4.5.
X variable
88
1
Y variable
X variable
If the variables X and Y are related to each other with a very low degree of negative relationship then
we can notice a graph as in Fig.4.6.
4.3.3 NO CORRELATION:
If the scatter diagram show the points which are highly spread over and show no trend or patterns we
can say that there is no correlation between the variables. Refer to Fig. 4.7.
89
1
Fig.4.7. NO CORRELATION (r = 0)
v
a
r
I
a
b
l
e
X variable
Two variables are said to be linearly related if corresponding to a unit change in one
variable there is a constant change in the other variable over the entire range of the values.
If two variables are related linearly, then we can express the relationship as
Y=a+bX
Where ÂaÊ is called as the „intercept" (If X= 0, then Y= a) and ÂbÊ is called as the "rate of change" or
slope.
If we plot the values of X and the corresponding values of Y on a graph, then the graph would be a
straight line as shown in Fig.4.8.
Example:
X 1 2 3 4 5
Y 6 8 10 12 14
For a unit change in the value of x, a constant 2 units change in the value of y can be noticed.
The above can be expressed as : Y=4+2X
90
1
Fig:4.8
16
14 14
12 12
Y variable
10 10
8 8
6 6
4
2
0
X variable
If corresponding to a unit change in one variable, the other variable does not change in a constant rate,
but change at varying rates, then the relationship between two variables is said to be non-linear or
curvilinear as shown in Fig. 4.9. In this case, if the data are plotted on the graph, we do not get a
straight line curve. Mathematically, the correlation is non-linear if the slope of the plotted curve is not
constant. Data relating to Economics, Social Science and Business Management do exhibit often non-
linear relationship. We confine ourselves to linear correlation only.
Example:
X -6 -4 -2 0 2 4 6
Y 36 16 4 0 4 16 36
40
35
30
Y variable
25
20
15
10
5
0
X variable
91
1
To measure the degree of association between two variables X and Y, Karl Pearson defined the
Coefficient of Correlation ÂγÊ as below. In this method, the coefficient of correlation is calculated as
the ratio of the covariance of the two variables to the product of their variances.
Cov( X i , Yi )
Correlation co-efficient ( γ ) =
{V ( X i ) V (Yi )}
where Cov(Xi,Yi) =
∑(X i − X )(Yi − Y )
n
and V(Xi) =
∑(X i − X )2
n
V(Yi) =
∑ (Y i − Y )2
n
γ=
∑ xy
∑x ∑y2 2
Where x = X i - X and y = Yi - Y
Following is the data on two variables X i and Y i . we find the sums and squares of products as shown
92
1
Xi Yi ( Xi − X ) ( Yi − Y ) ( X i − X )2 ( Yi − Y )2 ( X i − X )( Yi − Y )
x y x2 y2 xy
2 7 -2 -4 4 16 8
3 9 -1 -2 1 4 2
4 10 0 -1 0 1 0
5 14 1 3 1 9 3
6 15 2 4 4 16 8
20 55
∑x 2
=10 ∑y 2
=46 ∑ xy = 21
X=
∑ Xi = 20
= 4, Y =
∑Y i
=
55
= 11,
n 5 n 5
γ =
∑ xy
∑x ∑y 2 2
21 21
γ= = = 0.98
10 46 3.16 x 6.78
The value of γ = 0.98 shows that two series X and Y have almost perfect positive correlation
4.4.2. PEARSONÊS METHOD – WITHOUT DEVIATIONS (SHORT-CUT METHOD):
When the arithmetic means of both sets of numerical items are not whole numbers and involve
decimals, calculating the coefficient of correlation by direct method becomes tedious. To overcome
this difficulty the following modified short-cut method formula is used.
Cov(Xi,Yi) =
∑X i Yi
−XY
n
93
1
V(Xi) =
∑X i
2
− X 2 ; V(Yi) =
∑Y i
2
−Y 2
n n
Cov( X i , Yi )
γ =
{V ( X i ) V (Yi )}
n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2 n∑ Y 2 − (∑ Y )2
i i i i
Solved Example 2:
Calculate the Karl PearsonÊs coefficient of correlation for the following data between sales and
advertising expenditure.
Let sales represents Xi variable and advertise expenditure represents Yi variable to calculate the
correlation coefficient using the following formula.
n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2 n∑ Y 2 − (∑ Y )2
i i i i
Xi Yi Xi 2 Yi 2 X i Yi
1 3 1 9 3
2 15 4 225 30
3 6 9 36 18
4 20 16 400 80
5 9 25 81 45
6 25 36 625 150
∑X i =21 ∑Y i =78 ∑X i
2
=91 ∑Y i
2
=1376 ∑X Y
i i =326
94
1
γ =
(6 x 326) − (21 x 78)
(6 x 91) − (21)2 (6 x 1376) − (78)2
318
γ =
(10.247 x 46.605)
γ = = 0.667
This suggests that a fairly high degree of correlation between X and Y series i.e. between sales and
advertising expenditure
In case the magnitude of the data is large, using the two methods explained above will give lot of
inconvenience while calculating the correlation coefficient by Karl PearsonÊs method. So we take
deviations from some convenient numbers to reduce the magnitude of data. There will be no change in
the value of correlation coefficient even if deviations are taken. We define, u i = X i - A and vi = Yi
- B, where A and B can any arbitrary and assumed values. The formulae are given below,
V(ui) =
∑u 2
i
−u ;
2
V(vi) =
∑v 2
i
− v 2 ; Cov(ui,vi) =
∑u i vi
−uv
n n n
Cov(u i , vi )
γ =
{V (u i ) V (vi )}
95
1
n∑ u i vi − ∑u ∑v i i
γ =
n∑ u 2 − (∑ u )2 n ∑ v 2 − (∑ v )2
i i i i
Solved Example 3:
Using short cut method, we calculate 'r' for the following data of X i = Advertising expenditure
(Rupees in thousands) and Yi = sales (Rupees in lakhs). Let us define A = 60 and B=70, two
2 2
Xi Yi ui vi ui vi u i vi
-2384
u =
∑u i
=
50
=5 ; v=
∑v i
=
− 40
=-4
n 10 n 10
n∑ u i vi − ∑u ∑v i i
γ=
n∑ u 2 − (∑ u )2 n ∑ v 2 − (∑ v )2
i i i i
96
1
γ =
(10 x 2540) − (50 x − 40)
(10 x 5648) − (50)2 (10 x 2384) − (− 40)2
27040 27040
γ = =
(53980) (22240) 34647.373
γ = 0.78
Hence the correlation between X and Y series is fairly high as the coefficient of correlation is 0.78
i) The value of correlation coefficient γ varies between [-1, +1]. This indicates that the value of
does not exceed unity.
ii) Sign of γ depends on sign of the covariance.
iii) If γ = -1, the variables are perfectly negatively correlated.
iv) If γ = +1, the variables are perfectly positively correlated.
v) If γ = 0, the variables are not correlated in a linear fashion. There may be nonlinear
relationship between variables.
vi) Correlation coefficient is independent of change of scale and shifting of origin. In other words,
shifting the origin and change the scale do not have any effect on the value of correlation.
Let us see the following example to understand the concept, Âif γ = 0, the variables are not correlated
in a linear fashion. There may be nonlinear relationship between variablesÊ.
97
1
Solved Example.4:
Xi Yi Xi 2 Yi 2 X i Yi
-3 9 9 81 -27
-2 4 4 16 -8
-1 1 1 1 -1
0 0 0 0 0
1 1 1 1 1
2 4 4 16 8
3 9 9 81 27
n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2 n∑ Y 2 − (∑ Y )2
i i i i
γ =
(7 x 0) − (0 x 28)
(7 x 28) − (0)2 (7 x 196) − (28)2
0
γ =
196 x 588
0
γ = =0
196 x 588
Since γ = 0 it does not mean that the variables Xi and Yi are uncorrelated. It can only be said that the
variables are linearly uncorrelated. In fact if we closely look at the data of Xi and Yi, it can be
observed that Yi = Xi2 is the relationship existing between Xi and Yi. This is a nonlinear relationship
98
1
between the variables. Karl PearsonÊs coefficient of correlation can not measure nonlinear relationship
between the variables.
When the number of observations is large, the data are often classified into two-way frequency
distribution i.e. table where in the values of one variable(X) are represented in the rows while other
variable(Y) in columns. These values can be either discrete or continuous. The frequencies in each
class are shown in cells in the body of the table.
99
1
5- 15- 25-
X 35-45
15 25 35
Y
mp 10 20 30 40
dx 2
mp -1 0 1 2 fdy fdy fdxdy
dy f
4(-
75-125 100 -1 3(3) 4(0) 8(-16) 19 -19 19 -17
4)
2(-
175-225 200 1 2(0) 3(3) 4(8) 11 11 11 9
2)
3(-
225-275 250 2 3(0) 2(4) 2(8) 10 20 40 6
6)
2
∑fdy ∑fdy ∑fdxdy
Total f 16 15 14 21 66
=12 = 70 = -2
∑ f.dx
fdx -16 0 14
= 42
2
∑fdx
2
f.dx 16 0 14
= 84
∑fdxd
f.dxdy -5 0 3
y =0
100
1
− 9.27
γ =
[89.76] [67.82]
− 9.27
γ = = - 0.119
9.47 × 8.24
This shows very low degree of negative correlation between advertising expenditure(X) and sales
revenue(Y)
It is not possible to express attributes such as character, conduct, honesty, beauty, morality, intellectual
integrity etc. in numerical terms. For example, it is easy to for a class teacher to arrange the students in
his class in an ascending or descending order of intelligence. This means that he can rank them
according to their intelligence. Hence in problems that involve attributes of the type mentioned above,
the coefficient of correlation is entirely based on the rank differences between corresponding items.
We may have two types of numerical problems in rank correlation:
a) When actual ranks are given
b) When ranks are not given
6∑d 2
rs =1−
N ( N 2 − 1)
Where rs denotes SpearmanÊs Rank Correlation and N denotes number of pairs of observations.
iv) In the second case, when the ranks are not given, when the actual data are given, we have to assign
ranks. We may do so by taking highest value as 1 or the lowest value as 1. When the two observations
are same, then the normal practice is to assign an average rank to the two observations.
101
1
6∑d 2
rs =1−
N ( N 2 − 1)
∑d2 = 24
6∑d 2
rs =1−
N ( N 2 − 1)
102
1
rs =1−
(6 X 24)
(10)(10 2 − 1)
144
rs =1− = 0.855
(10)(99)
The rank correlation coefficient (0.855) shows that there is a very high degree of correlation between
ranks obtained in subject A and Subject B of the ten students.
When the ranks are not given:
Solved Example.7:
Compute the SpearmanÊs coefficient of correlation between marks assigned to ten students by Judges
X and Y in a certain competitive test as shown below
Student No 1 2 3 4 5 6 7 8 9 10
Marks by Judge X 43 56 29 81 96 34 73 62 48 76
Marks by Judge Y 15 26 34 86 19 29 83 67 51 58
∑d2 = 106
6∑d 2
rs = 1−
N ( N 2 − 1)
103
1
rs = 1−
(6 x 106)
(10)(10 2 − 1)
636
rs =1− = 0.36
(10)(99)
The rank correlation coefficient (0.36) shows that there is a low degree of correlation between marks
assigned by Judge X and Judge Y to the ten students.
Solved Example.8:
Obtain the rank correlation between variables X( Price of commodity A in Rs) and Y( Price of
commodity B in Rs) from the following pairs of observed values.
X 24 29 23 38 46 52 41 36 68 56
Y 110 126 145 131 163 158 131 129 154 140
Solution:
X Ranks of Y Ranks of Difference Squared
X Y ( R1 – R2) difference
(R1 ) (R2) (d) ( d2)
24 9 110 10 -1 1
29 8 126 9 -1 1
23 10 145 4 6 36
38 6 131 6.5 -0.5 0.25
46 4 163 1 3 9
52 3 158 2 1 1
41 5 131 6.5 -1.5 2.25
36 7 129 8 -1 1
68 1 154 3 -2 4
56 2 140 5 -3 9
∑d2 = 64.5
In the data, there two equal values (found in Y series) i.e. 131 which is a tie for the ranks 6 and 7
respectively. Then the average of 6 and 7 ranks (6.5) is assigned as rank for both the observations.
Then the common ranks for both the observations are 6.5.
104
1
In this data we find common ranks in the second series(Y). Therefore the formula for the coefficient of
correlation through the rank differences method has to be modified as given below:
1 1 1
6 ∑ d 2 + (m1 − m1 ) + (m2 − m2 ) + (m3 − m3 ) + ........
3 3 3
1−
rs 12 12 12
=
N ( N − 1)
2
m1, m2, m3 ⁄. stands for number of items in the respective groups with common ranks. In this
problem only one group having items two (or two common ranks in that group), hence we can assign
m1 = 2)
1
6 ∑ d 2 + (m1 − m1 )
3
1−
rs 12
=
N ( N − 1)
2
1
6 64.5 + (2 3 − 2)
1−
rs 12
=
10(10 − 1)
2
6[64.5 + 0.5 ]
rs = 1− = 0.61
990
The rank correlation coefficient (0.61) shows that there is a moderate correlation between X and Y.
When γ = 1; or -1; or 0, the interpretation of γ does not pose any problem. When γ = 1; or -1 , all the
points lie on straight line in a graph showing a perfect positive or negative correlation . When the
points are extremely scattered on a graph, then it becomes evident that there is almost no relationship
between the two variables. However, when it comes to other values of γ, we have to be careful in its
interpretation. Suppose we get a correlation of γ = 0.9, we may say that γ = 0.9 is Âtwice as goodÊ or
Âtwice as strongÊ as a correlation of γ = 0.45. It may be noted that this comparison is wrong.. The
strength of γ is judged by coefficient of determination, γ2 for γ = 0.9, γ2 = 0.81. We multiply it by 100,
105
1
thus getting 81 percent. Thus suggest that when γ = 0.9 then we can say that 81 per cent of the total
variation in the Y series can be attributed to the relationship with X.
Correlation helps us in having an idea about the degree and direction of the relationship between two
variables under study. It fails to reflect upon the cause and effect relationship between the variables. If
the variables have the cause and effect relationship, they are bound to vary in sympathy with each
other and hence there is a high degree of correlation. In other- words, causation always implies
correlation but the converse is not true, i.e., even a fairly high degree of correlation between two
variables need not imply a cause and effect relationship between them. For example, if X= Number of
patients admitted into a super specialty Hospital and Y= Number of Space shuttles launched, one may
get a numerical value for r , but it does not signify the relationship between these variable.
4.10. SUMMARY:
The importance and the Concept of correlation and utility of scatter diagram which suggests a
relationship between two variables were discussed. PearsonÊs Coefficient of correlation, a measure
for degree of association between variables is presented. The distinction between linear and non-
linear correlation, positive and negative correlation and simple, partial and multiple correlations
have been brought out .Different types of correlation, and applications are covered. Finally
coefficient of correlation and spearman rank correlation for bivariate and grouped data have been
explained.
4.11. GLOSSARY:
Correlation Analysis: Analysis of statistical data that is concerned with the question of whether there is
a relationship between two variables.
106
1
Covariance: A joint variation between the variables X and Y. The covariance of X and Y,
which is the average of the products of the deviations from the means for
n pairs of X and Y series.
Rank Correlation: A method to determine correlations then the data are not available in numerical
form and, as an alternative, the method of ranking is used.
Scatter Diagram: An ungrouped plot of two variables on X and Y axes. A plot of the paired
observations of X and Y that shows a broad pattern of relationship
between the two variables.
4.12. REFERENCES:
1. Gupta. S.C., and Kapoor. V.K., „Fundamentals of Mathematical Statistics‰, Sultan Chand and
Sons, N.Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalaya Publishing House, New Delhi , 2004
3. Murray R. Spiegel and Larry J. Stephens,„Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
4. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
5. Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi
6. S.P.Gupta and M.P.Gupta „Business Statistics‰, Sultan Chand & Sons, New Delhi,2001.
107
1
1. Why is rank correlation important in business Statistics? How does it differ from Karl PearsonÊs
coefficient of correlation?
2. What are the different methods of finding correlation between the two variables?
3. Find the coefficient of correlation between the sales and expenses of the ten firms given below and
comment.
Sales 50 50 55 60 65 65 65 60 60 50
Expenses 11 13 14 16 16 15 15 14 13 13
X 26 29 31 34 42 48 51 54 63 57
Y 19 21 27 31 36 42 49 51 59 60
X 78 89 96 69 59 79 68 61
6. Find Product moment coefficient of correlation between X and Y from the following data.
Marks in
65 66 67 67 68 69 70 72
English (X)
Marks in
67 68 65 68 72 72 69 71
Mathematics (Y)
108
1
7. Find Product moment coefficient of correlation between X and Y from the following data.
8. Two persons were asked to watch ten specified TV programmes and offer their evaluation by
rating them 1 to 10. These ratings are given below.
TV Programme A B C D E F G H I J
Ranks given by X 4 6 3 9 1 5 2 7 10 8
Ranks given by Y 2 3 4 9 5 7 1 10 8 6
9. Campus Stores has been selling the Believe It or Not: Wonders of Statistics Study Guide for 12
semesters and would like to estimate the relationship between sales and number of sections of
elementary statistics taught in each semester. The following data have been collected:
Sales (Units) 33 38 24 61 52 45
Number of sections 3 7 6 6 10 12
Sales (units) 65 82 29 63 50 79
Number of sections 12 13 12 13 14 15
(a) Develop the estimating equation that best fits the data.
(b) Calculate the sample coefficient of determination and the sample coefficient of correlation.
109
1
10. The director References of a management training programme is interested to know whether there
is a positive association between a traineeÊs score prior to his/her joining the programme and the same
traineeÊs score after the completion of the training. The director has obtained the scores of 10 trainees
as follows:
Trainee 1 2 3 4 5 6 7 8 9 10
Rank score1 1 4 10 8 5 7 3 2 6 9
Rank score2 2 3 9 10 3 6 1 6 7 8
110
1
5. REGRESSION ANALYSIS
5.1. INTRODUCTION:
Literally the word "Regression" means Âstepping back or returning to the average valueÊ. Regression
was first used by Sir Francis Galton, a British biometrician in the 19th Century in some studies on
estimating the extent to which the stature of sons of tall parents regresses back to the mean stature of
the population. The interesting features of the study were:
i) Tall fathers have tall sons and short fathers have short sons,
ii) The average height of the sons of a group of tall fathers is less than that of the fathers and the
average height of the sons of a group of short fathers is more than that of the fathers.
In todayÊs world, regression is used in varying fields with diversified areas of applications. It is
specially used in Business Management and Economics to study the relationship between two or more
economic variables that are related casually and for estimation of
• Demand and supply curves,
• Cost functions,
• Production and consumption functions
• The effect of Advertising on sales and
• Prediction of sales with given advertising expenditure and so on.
111
1
Prediction (estimation) is one of the major areas of interest on most of the spheres of human activity.
Estimation of future production, consumption, prices, sales, income, profits, investments, demands etc,
are of paramount importance to Businessmen. Population estimation and projections are indispensable
for efficient planning of an economy.
In Pharmaceutical industry, it is important and necessary to estimate the effect of a new drug on
patient, its ill effects, if any, in short as well as in long run.
5.2 REGRESSION:
According to M.M Blair, "Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data". If regression analysis is
confined only to two variables then it is termed as Simple regression.
Business examples:
1. Expenditure of a person depends on his income,
2. Yield of a crop depends on the rainfall,
3. The demand of a product depends on its price etc.
If we plot the given bivariate data on a graph paper, and obtain the scatter diagram, the points on the
diagram will more or less concentrate around a curve called "curve of regression". Often, such a curve
112
1
is not distinct, is quite confusing and some times complicated too. A graph in Fig.5.1 is given below
for the purpose of illustration.
If the regression curve is a straight line, we say that the variables under study are related in linear
fashion. The regression equation for such a relationship is yi = a + b xi, a straight line. In case of a
linear regression, the values of the dependent variable increase or decrease by a constant amount for a
unit change in the independent variable.
If the regression curve is not a straight line, then the regression is called as non-linear (curvy linear)
regression. The regression equation will involve square terms like x2, y2 and/or product terms like xy,
xz and so on. We confine ourselves to linear regression between two variables only.
Line of regression is the line, which gives the best estimate of one variable for any given value of the
other variable. In case of two variables x and y, we shall have two lines of regression, called, a)
Regression of y on x and b) Regression of x on y.
Definition:
¾ Line of regression of y on x is the line which gives the best estimate for the value of y for
any specified ( or given) value of x.
¾ Line of Regression of x on y is the line which gives the best estimate for the value of x for
any specified(given) value of y.
113
1
The term best fit is interpreted in accordance with the "Principle of Least Squares" which consists of
minimizing the sum of the squares of the residuals estimated. The residual is also known as error and
is defined as "the deviation between the given observed value and the value of the estimate as given by
the line of best fit".
Let (x1,y1),(x2,y2),...,(xn,yn) be n pairs of observations on the variables x and y under study. Let yi = a
+ bxi be the line of best fit (regression) of y on x.
For the purpose of illustration, we present two points P2(x2,a+b x2) and Q2(x5,a+bx5) in the
following scatter diagram. For the given point P1(x2, y2) in the diagram the error is given by the
line of best fit y = a + b x shown in the diagram is P1P2. Since the X co-ordinates for P1 and P2
are same and Y co-ordinate of P1 is P1M1 where as for P2 it is P2M2.
X Q2(x5, a+bx5)
M1 - X P1(x2, y2)
X Q1(x5,y5)
Y variable
M2 X P2(x2,a+bx2)
X variable
The difference, namely, (P1M1–P2M2) = P1P2 which is called as error. In particular, P2M2 = a + b x2
and P1M1 =y2. Hence error is: e2= y2 - (a + b x2).
Similarly, for Q1 (x5,y5) and Q2 (x5,a + b x5), we can define the error as: e5 = y5 - (a + bx5).
114
1
Here, e2 is positive and e5 is negative. As is explained, the error can be either negative, positive or zero
(if X2 = a + b x2).
e22 = ( y 2 − a − b x 2 )
2
Error square is defined as:
e52 = ( y 5 − a − b x5 )
2
E = Σ (yi - a – b xi)2
Sometimes Σ(yi - ye)2 is used to represent 'E' where ye denotes the estimated values of y. By using
maxima and minima Principles of differential calculus we get two equations called "NORMAL
EQUATIONS", as given below
Σ yi = n a + b Σ xi
Σ xi yi = a Σ xi + b Σ xi2
Using these two normal equations, we can solve uniquely for the value of ÂaÊ and ÂbÊ.
Similarly the normal equations for obtaining the Regression equation of x on y is xi = a + b yi are
Σ x i = n a + b Σ yi
Σ xiyi = a Σ yi + b Σyi2
The values of ÂaÊ and ÂbÊ are obtained by using the above formulae by interchanging the variables
x to y and y to x.
115
1
Solved Example 1:
Σxi = na + b Σ xi
Σ xiyi = a Σ xi + b Σ xi2
85 = 4a + 20 b --------- (1)
To solve the above equations, we adopt the procedure of elimination. To eliminate 'a', we multiply the
first equation by 5 to get
116
1
425 = 20 a + 100 b
The second equation is taken as it is. Then by subtracting the second equation from the first equation
425 = 20 a + 100 b
490 = 20 a + 120 b
(-) (-) (-)
- 65 = 0 - 20 b
65
⇒ 20 b = 65 or b = = 3.25
20
85 = 4 a + 20 (3.25)
⇒ 4 a = 85 - 65.00 = 20
20
⇒ a= =5
4
yi = 5 + 3.25 xi
This equation is very useful to predict the value of yi (maintenance cost) if we know xi, the age
(in years) of the car.
For example, we predict the annual maintenance cost for a car with 9 years age. This is obtained by
substituting xi = 9 in the equation yi = 5 + 3.25 xi. Annual maintenance cost of a car with xi = 9 years
age is
yi = 5 + 3.25 x 9
= 5 + 29.25 = 34.25 (hundreds of Rs.)
117
1
Σ xi yi = a Σ yi + b Σ yi 2
by substituting the sums and products obtained from the above table, we have
20 = 4a + 85 b
490 = 85a + 2025 b
To eliminate 'a', we multiply the first equation with '85' and the second with '4'.The result is
1700 a = 340 a + 7225 b Review questions
260
b= = 0.297 ≈ 0.30
875
The value of 'a' can be obtained from
20 = 4a + 85(0.30)
4a = 20 - 25.5 = - 5.5
− 5 .5
a = = -1.375
4
The regression equation of x on y is
xi = - 1.375 + 0.30 yi
Slope (b): it is the coefficient of the xi term in the equation: yi = a + b xi . It shows how much the
dependent variable changes for one unit change in the independent variable. When positive, it gives
the increase in yi per unit increase in xi
y - Intercept (a) : It is the value of yi where the line yi = a + b xi cuts Y – axis. It is the constant term
in the estimation equation.
The value of Âa Â(Y - intercept) and ÂbÊ (slope) are calculated the following formula
118
1
a=
(∑ x )(∑ y )− (∑ x )(∑ x y ) And
2
i i i i i
n ∑ x − (∑ x i )
2
i
2
n ∑ xi y i − ∑ xi ∑ y i
b=
n∑ xi2 − (∑ xi ) 2
Solved Example 2:
a=
(∑ x )(∑ y )− (∑ x )(∑ x y )
2
i i i i i
n ∑ xi2 − (∑ xi ) 2
a=
(120) 85 − 20 (490)
4 (120) − (20)
2
=
(10200 − 9800)
(480 − 400)
400
= =5
80
n ∑ xi y i − ∑ xi ∑ y i
b=
n∑ xi2 − (∑ xi ) 2
b =
(4) (490) − (20) (85)
4 (120) − (20)
2
b =
(1960 − 1700 )
(480 − 400 )
119
1
260
b = = 3.25
80
The values of ÂaÊ and 'b' obtained are same as those obtained using normal equations. Similarly, we
can estimate ÂaÊ and 'b' for the regression equation of x on y by interchanging x and y in the above
formulae. Let us look at the two Regression equations obtained above.
Regression equation of y on x: yi = 5 + 3.25 xi
The regression co-efficient 3.25 is called as the regression co-efficient of y on x and is denoted by
byx = 3.25.
Similarly the regression co-efficient 0.30 is called as the regression co-efficient of x on y and is
denoted by bxy = 0.30.
On the basis of regression equation of Y on X, we can find out the value of Y for any value of X.
Similarly, we can find out the value of X for any value of Y based on the regression equation of X on
Y.
1. Both the regression co-efficient will take the same sign, i.e., byx and bxy will
take either positive or negative sign.
2. The correlation co-efficient is the G.M. of the regression co- efficients, i.e.,
γ = ± (b )(b )
xy yx
120
1
Solved Example 3:
x = 15, y = 110, V(x) = 25, V(y) = 625 and γ (x,y) = 0.81
then we use the following procedure to get regression equations.
i) Regression equation of y on x:
V (y)
( yi − y) = γ ( xi − x)
V (x )
or
σy
( yi − y) = γ ( xi − x)
σx
σ
Since byx = γ x
σ y
( yi − y ) = byx ( xi − x )
625
(yi - 110) = (0.81) (xi - 15)
25
yi = 49.25 + 4.05 xi
121
1
V (x )
( xi − x )= γ ( yi − y)
V (y)
or
σx
( xi − x) = γ ( yi − y)
σy
σ
Since bxy = γ x
σ y
( xi − x ) = byx ( y i − y )
25
(xi - 15) = (0.81) ( yi - 110)
625
1
xi = 15 + (0.81) (yi - 110)
25
xi = - 2.82 + 0.162 yi
In certain situations we need to estimate the regression coefficients, means and variances of the
variables directly from the given regression equations.
122
1
Solved Example 4:
Let the two regression equations be
3X + 2Y – 26 = 0
6X + Y – 31 = 0
i) To find the means, we solve these two simultaneous linear equations by the process of eliminations.
For this, we multiply the 2nd equation with Â2Ê to get
12X + 2Y – 62 = 0
Subtract the 1st equation from the 2nd equation, we get
9X – 36 = 0
or X = 36/9 = 4.
Substitute X = 4 in the 1st equation, to get
3(4) + 2Y – 26 = 0
or 2Y = 26 – 12 = 14
or Y = 7
Since the regression equations will pass through their respective means, we can say that
X = 4 and Y = 7
ii) The regression equations given above are used to get ÂrÊ the correlation coefficient.
123
1
γ =± byx . bxy
1 3
γ =± − x−
6 2
1
= ±
2
Since byx , bxy are negative γ will be negative. So
1
γ =- = - 0.5
2
iii) We can use the regression equations for estimating the unknown variance (standard deviation)
of one of the variables given the other variance.
σy
( yi − y) = γ ( xi − x)
σx
σ y
Since byx = γ
σ x
σ y
byx = γ
σ x
1 5
- = -0.5
6 σ x
σ x = − 6 x − 0.5 x 5 = 15
124
1
Solved Example 5:
Compute the two regression equations on the basis of the following information:
Parameters x y
Mean 40 45
Standard deviation 10 9
Karl PearsonÊs correlation coefficient = 0.5. Also calculate the value of y for x = 48, using appropriate
regression equation.
σy
( yi − y) = γ ( xi − x)
σx
Substituting the values of : x , y , σ x , σ y , γ given in the above problem in the above equation, we
get:
9
( yi − 45) = 0.5 ( xi − 40)
10
yi = 45 + 0.45 ( xi - 40)
yi = 45 + 0.45 xi - 18
yi = 27 + 0.45 xi
Regression Equation of x on y:
σx
( xi − x) = γ ( yi − y)
σy
125
1
Substituting the values of : x , y , σ x , σ y , γ given in the above problem in the above equation, we
get:
10
( xi − 40) = 0.5 ( yi − 45)
9
xi = 40 + 0.556 ( y i − 45)
xi = 40 + 0.556 yi - 25.02
xi = 14.98 + 0.556 yi
In order to estimate the value of y for x = 48, we have to use regression equation of y on x:
yi = 27 + 0.45 xi
yi = 27 +( 0.45 x 48)
y = 27 + 21.6
y = 48.6
Therefore, the value of y is 48.6 when the value of x = 48
The following are the points of difference between Correlation and Regression
i) Where as coefficient of correlation is a measure of degree of co-variability between X and Y,
the objective of regression analysis is to study the nature of relationship between the variables
so that we may be able to predict the value of one on the basis of another
ii) Correlation is merely a tool of ascertaining the degree of relationship between two variables
and, therefore, we can not say one variable is the cause and other the effect.. Foe example , a
126
1
high degree of correlation between price and demand for a certain commodity or a particular
point of time may not suggest which is the cause and which is the effect. However, in
regression analysis one variable is taken as independent while the other as independent, thus
making it possible to study cause and effect relationship.
iii) In correlation analysis, coefficient of correlation is a measure of direction and degree of linear
relationship between two variable X and Y. It is symmetric whether the correlation between X
and Y or between and X. It is also immaterial which of X and Y is dependent variable and
which is independent variable. In regression analysis , the regression coefficients Y on X and
X on Y are not symmetric. Hence it is definitely makes a difference as to which variable is
dependent and which variable is independent.
iv) There may be nonsense correlation between two variables which is purely due to chance and
has no practical relevance such as increase in income and increase in weight of group of
people. However, there is nothing like nonsense regression.
v) Correlation coefficient is independent of change of scale and origin. Regression coefficients
are independent of change of origin but not of scale.
5.11. SUMMARY:
Correlation measures the degree of relationship between two variables. Regression helps in exploiting
the degree of relationship existing between two variables so that they can be linked through a
regression equation. Regression equation helps in predicting the future behaviour of a dependent
variable given the value of independent variable. Prediction plays a key role in formulating strategic
decision frame work to the management. Coefficient of correlation is a measure of degree of co-
variability between X and Y, the objective of regression analysis is to study the nature of relationship
between the variables so that we may be able to predict the value of one on the basis of another
127
1
5.12. GLOSSARY:
Dependent Variable The variable of interest or focus which is influenced by
one or more independent variable(s). The variable that being
predicted or explained. It is denoted by Y in the regression equation.
Independent (or explanatory) Variable : A variable that can be set either to a desired value or takes values
that can be observed but not controlled. The variable that doing
the predicting or explaining. It is denoted by X in
the regression equation.
Regression Line A line of best fit, which can always be found for a scatter diagram
by using the method of least squares.
:
5.13. REFERENCES
1) Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi, 2004
2) Murray R. Spiegel and Larry J. Stephens, „Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
3) Sancheti, D.C., and Kapoor, V.K., „Business Statistics‰, New Delhi
128
1
2. State some of the important properties of regression coefficients. How are these helpful in analysing
the regression lines?
3. An investigation into the demand for television sets in 7 towns has resulted in the following data:
Town A B C D E F G
Population (lakh) (x) 11 14 14 17 17 21 25
No. of TV sets demanded(Â000)(y) 15 27 27 30 34 38 46
Fit a linear regression of y on x and estimate the demand for TV sets for a city with population of
(a) 20 lakh and (b) 32 lakh
X (Age of a Car) 1 3 5 7 9
Y (Maintenance Cost) 15 26 34 40 52
.
5. An enquiry into 50 families to study the relationship between Expenditure on Housing (Xi) and
Expenditure on food (Yi) give the following results.
∑ Xi = 8500, ∑ Yi = 8500, σx = 60, σy = 60, γ = + 0.6
i) Obtain the regression lines of X on Y and Y on X
ii) When X = 200, determine the value of Y
Obtain two regression equations and estimate i) the yield of crops when the rainfall is 22 cms
and ii) the rainfall when the yield is 600 kgs.
X = Yield (in Kg) Y=Rainfall (in cms)
Mean 508.4 26.7
S.D 36.8 4.6
γ = + 0.52
129
1
6. Estimate the sales for a given number of advertisements from the following data for
X = 10
X (No of Ads) 3 7 4 2 0 4 1 2
Y (Sales) 11 18 9 4 7 6 3 8
8. The following data relate to the scores obtained by 9 salesman of a company in an intelligence test
and their weekly sales in thousand rupees:-
Salesman: A B C D E F G H I
Intelligence 50 60 50 60 80 50 80 40 70
Test Scores
Weekly 30 60 40 50 60 30 70 50 60
Sales
(A) Obtain the regression equation of sales on intelligence test scores of the salesmen
(B) If the intelligence test scores of a salesman in 65, what would be his expected weekly sales?
9. The following table gives the aptitude test scores and productivity indices of 10 workers selected at
random:
Aptitude scores (X) 60 62 65 70 72 48 53 73 65 82
Productivity index(Y) 68 60 62 80 85 40 52 62 60 81
Calculate the two regression equations and estimate (i) the productivity index of a worker whose test
score is 92. (ii) The best score of a worker whose productivity index is 75.
130
1
10. Bank of Lincoln is interested in reducing the amount of time people spend waiting to see a
personal banker. The bank is interested in the relationship between waiting time (Y) in minutes and
number of bankers on duty (X).
X 2 3 5 4 2 6 1 3 4 3 3 2 4
Y 12.8 11.3 3.2 6.4 11.6 3.2 8.7 10.5 8.2 11.3 9.4 12.8 8.2
131
1
6. TIME SERIES
6.1 INTRODUCTION:
Time series is a set of statistical observations arranged in chronological order. Time series is usually
used with reference to economic data and economists are largely responsible for development of
techniques of time series analysis. It can also be used different purposes like managing production,
variations of different variables, growth or decline of profits of organization etc. Businessmen forecast
future demand of sales in order to estimate proportionate production so that over stock (or) inadequate
production can be avoided. Economists estimate future population based on time series. So that he
can analyse amount of good supply, jobs, vehicles, etc required in future.
ñ Firstly, the analysis of a time series enables us to understand the past behaviour or
performance. It is possible to know how the data have changed over time and find out the
132
1
probable reasons responsible for such changes. If the past performance, say of a
company, has been poor, it can take corrective measures to arrest the poor performance.
ñ Secondly, a time series analysis helps directly in business planning. A firm can know the
long-term trend in the sale of its products. It can find out at what rate sales have been
increasing over the years. This may help it in making projections of its sales for the next
few years and plan the procurement of raw material, equipment and manpower
accordingly.
ñ Thirdly, a time series analysis enables one to study such movements as cycles that
fluctuate around the trend. Knowledge of cyclical pattern in certain series of data will be
helpful in making generalization in the concerned business or industry.
ñ Finally a time series analysis enables one to make meaningful comparisons in two or more
series regarding the rate or type of growth. For example, growth in consumption at the
national level can be compared with that in the national income over specified period.
Such comparisons are of considerable importance to business and industry.
A time series may contain one or more of the following four components:
a) Secular Trend
b) Seasonal variations
c) Cyclical variations
d) Irregular variations
We can observe steady increase or decrease in the variable over a period of time. The note of
increase/decrease can also be varying and after a period of growth or decline, reverse themselves and
enter a period of decline or growth. Various types of trends can be broadly classified into
- linear or straight line trend
- Non-linear trend
133
1
Generally, the longer the period covered, the more significant the trend. When the period is short,
the secular movements can not be expected to reveal themselves clearly and general drift of the
series may be unduly influenced by cyclical fluctuations. This can be said by stating the minimum
period to be considered in trend analysis is two to three cycles. It is not necessary that the rise or
fall must continue in the same direction through out the period. As long as we can say that the
period as a whole was characterized, (excepting the year which has got different trend) we can say
that a secular trend was present
The recurrent variations in time series that usually last longer than a year and are regular neither in
aptitude nor in length are called cyclical variations. There are four well-defined periods or phases
in cycle- prosperity, decline, depression & improvement. Cyclical variations are useful in framing
suitable policies for stabilizing level of business activity for avoiding booms and depressions as
both are bad for an economy.
134
1
Given any long term series, we wish to determine and present the direction which it takes, as it is
growing or declining? There are two important objectives for trend measurement:
In studying trend in and of itself, we need to ascertain the growth factor. The growth factor helps us in
predicting the future behaviour of the data. If a trend can be determined, the rate of change can also be
ascertained and tentative estimates concerning future can be made accordingly.
Example: The growth in the textile industry can be compared with the growth in the economy as a
whole or with the growth of other industries.
The elimination of trend leaves us with seasonal, cyclical and irregular factors. These three relatively
short term elements then can be compared divorced from the long term factor.
Measurement of trend is in order to find out trend characteristics in and of themselves and to enable us
to eliminate trend in order to study other elements. The various methods that can be used far
determining trend are as follows:
:
ñ Free-hand or graphical method
ñ Semi-averages method
ñ Moving average method
ñ Method of least squares
This is the simplest method of studying the trend. The procedure of obtaining a straight line trend is
given below:-
135
1
Sales in thousands
Prediction of sales
2500
2000
1500
1000
500
0
1998 ‘99 2000 ë01 ë02 ë03 ë04 ë05
years
If sales in the year 2006 to be predicted, extend the graph with free hand and corresponding sales can
be noted on Y axis.
Merits:
¾ This is the simplest method of measuring trend.
¾ This method is very flexible in that it can be used regardless of whether the trend is a
straight line or a curve
¾ The trend line drawn by a statistician experienced in computing the trend and having
knowledge of the economic history of the concern or the industry under analysis may
be a better expression of the secular movement than a trend fitted by the use of a rigid
mathematical formula which while providing a good fit to the points may have no alter
logical justification.
136
1
Limitations:
• This method is highly subjective because the trend line depends on the personal judgment
of the investigation and therefore different persons may draw different trend line from the
same set of data.
• Since freehand curve fitting is subjective it cannot have much value if it is used as a basis
for predictions.
• It is very time consuming to construct a freehand trend if a careful and conscientious job
is done.
137
1
Prediction of sales
Sales in thousands
2500
2000
1500
1000
500
0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
year
If sales in the year 2006 to be predicted, extend the graph with free hand and corresponding sales can
be noted on Y axis.
Merits:
• This method is simple to understand as compared to the moving average and the method of
least squares.
• This is an objective method of measuring trend as everyone who applies the method is bound
to get the same result. (having aside the arithmetic mistakes)
Limitations:
• This method assumes straight line relationship between the plotted points regardless of the
fact whether that relationship exists or not
• The limitations of arithmetic average shall automatically apply. If there are extremes in either
half or both the series. Then the trend line will not give a true picture of the growth factor.
138
1
averaging is to give a smoother curve, lessening the influence of the fluctuations that pull annual
figure away from the general trend.
Note: Since the moving average method is most commonly applied to data which are characterized by
cyclical movements, it is necessary to select a period for moving average which coincides with the
length of the cycle. Otherwise the cycle will not be entirely removed.
Ordinarily, the necessary period will range between three and ten years for general business series but
even longer periods are required for certain types of data.
The 3 yearly moving averages shall be computed as follows:
Example:2
139
1
If the sales in 2005 to be predicted, the graph can be extended, and the corresponding value on Y axis
is required sales value.
Merits:
Limitations:
140
1
Demerits:
ñ Trend values cannot be computed for all years
ñ One has to use his own judgments for choice of period of averages.
ñ Since moving average is not represented by mathematical functions this method canÊt
be used in forecasting.
ñ Appropriate only when trend is linear, cyclical variations are regular both in period &
amplitudes.
6.4.4 METHOD OF LEAST SQUARES: This method is most widely used in practice. It is a
mathematical method and with its help a trend line is fitted to the data in such a manner that the
following two conditions are satisfied:-
(1) ∑( y − y c ) = 0
It is the sum of deviations of the actual values of ÂyÊ and the computed values of ÂyÊ is zero.
(2) ∑( y − y c ) = is Least
2
i.e the sum of squares of the deviations of the actual and computed values is least from this line and
hence the name is method of least square. The line obtained by this method is known as the line of
best fit.
141
1
yc = a + bx
Where yc is used to designate trend value
x represents time
a is the computed trend figure of y variable when x=0
b is the slope of trend line or amount of change in ÂyÊ variable that is associated with a
change of one unit in ÂxÊ variable.
In order to determine the values of the constants a and b the following two normal equation are to be
solved.
∑ y = Na + b ∑ x
∑ xy = a ∑ x + b ∑ x 2
N presents the no. of years (months or any other period) for which data are given.
Sales in thousands
2000
1500
1000
500
0
1994 1996 1998 2000 2002 2004 2006
Merits:
• This is a mathematical method of measuring trend and as such there is no possibility of
subjectiveness.
• It is the line from where the sum of positive and negative deviations is zero and the sum of
squares of the deviations lead.
142
1
i.e ∑( y − y c ) = 0
i.e ∑( y − y c ) least
2
• Trend values can be obtained for all the given time periods in the series
Limitations:
• Great care has to be taken in selecting the type of trend curve to be fitted. i.e. linear, parabolic
or some other type
• It is more tedious and time consuming than other methods.
• Prediction are based only on long term variation, i.e trend and the impact of cyclical, seasonal
and irregular variations is ignored.
• Being a mathematical method it is not flexible- the addition of even one more observation
makes it necessary to do all the computations again.
Applications:
¾ Used in different situations like production, sales, exports, imports, etc., by business men
¾ Economists use it for forecasting population trends.
¾ Avoidance of unfavorable situations based on time series can be done as we can predict ( at
least to some extent) based on trend so that uncertainty is reduced.
¾ Certain seasonal trends help in proportionate production based on trend.
¾ Sudden changes in demand, rapid technological progress cannot be known but prediction to
certain extent makes it possible for the firm to cope up to the in competency.
Example:3
Fit a trend line to the following data.
Year Production Year Production
of steel
1995 20 2000 25
1996 22 2001 23
1997 24 2002 26
1999 23 2003 25
143
1
Solution:
Moving averages:
25
24.5
24
23.5
23
22.5
22
21.5
1994 1996 1998 2000 2002 2004
144
1
y c = a + bX
∑ Y 209
a= = = 23.22
N 9
∑ XY 34
b= = = 0.56
∑ X 2 60
Yc = 23.22 + 0.56 X
Example: 4
Fit a straight line trend by the method of least squares to the following data. Assuming the same rate
of change continues what would be the predicted sales for the year 1998?
145
1
Solution.
y c = a + bX
Since ΣX = 0
ΣY 1052
a= = = 131.5
N 8
ΣXY 616
b= = = 14.67
ΣX 2 42
Y=131.5+14.67X
Y1989=131.5+14.67(-3.5) =131.5 - 51.331=80.16
Thus
Y1998=131.5+14.67 (5.5)=212.163
146
1
6.5 SUMMARY:
In this chapter the importance of time series analysis and the four components of time series analysis
i.e. Secular Trend, Seasonal variations, cyclical variations and Irregular variations have been
discussed. The Measurement of trend its objectives and the various methods that can be used for
determining trend- Free-hand or graphical method, Semi-averages method, moving average method,
Method of least squares their advantages and limitations have been brought out. The mechanical
approach of time series analysis is subject to considerable error and change. It is therefore necessary
for management to combine these simple procedures with knowledge of other factors in order to
develop workable forecasts.
6.6 GLOSSARY:
Cyclical component or fluctuations: In a time series, fluctuations around the trend line
that last for more than one year.
147
1
Secular trend or trend component: One of the components in a time series indicating
the long-term movement of an item or variable.
148
1
6.7 REFERENCES:
1) Murray R. Spiegel and Larry J. Stephens,‰ Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
2) Levin, I. Richard and Rubin S. David. „Statistics for management‰, Prentice Hall of India,
New Delhi, 2000.
3) Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi.
6.8 EXERCISES:
1. What is a time series? Mention its important components. Explain them briefly.
2. What are the advantages of undertaking a time series analysis to a business firm?
3. Suppose we are given a time series data for 12 years-1989 to 2000 relating to sales of a certain
business firm. These data are given below.
You are asked to find out the three-year moving averages, starting from 1989.
4. Fit a straight line trend by the method of least squares for the following and draw it on the graph
paper:
149
1
6. Fit a trend line by the method of semi-averages to the data given below. Estimate the sales for
1997. If the actual sale for that year is Rs.520 lakhs account for the difference between the two
figures.
7. Given below are the figures of production (in million tones) of wheat:
150
1
Find the trend values using the linear trend method. Using the trend equation, estimate the
profit for 2001.
9. Assuming an additive model, apply 3-year moving averages to obtain the trend- free series for
years 2 to 6 from the following data.
Year 1 2 3 4 5 6 7
Exports 126 130 137 141 145 155 159
(Rs.lakh)
10. The trend equation for annual sales of a product is- Y=102+36 X
With 1st January 1990 as origin.
(i) Determine the monthly trend equation with 1st July 1992 as origin.
(ii) Compute the trend values of sales in August 1991 and October 1994.
by
Dr.B.Raja Shekhar
Reader
University of Hyderabad.
151
1
7.1 INTRODUCTION:
Once the problem has been formulated and a research design is developed the researcher has to decide
whether the information is to be collected from all the population or only from some of the population.
When the data are collected from each of the population of interest, it is known as the census survey.
If, on the other hand, data are to be collected only from some of the population, it is known as the
sample survey. Statistical analysis frequently involves making inferences from a sample to a
population.
In Statistics, a population is an entire set of objects or units of observation of one sort or another, while
a sample is a subset of a population, selected for particular study (usually because it is impractical to
study the whole population). The numerical characteristics of a population are called parameters.
Generally the values of the parameters of interest remain unknown to the researcher; we calculate the
„corresponding‰ numerical characteristics of the sample (known as statistics) and use these to
estimate, or make inferences about, the unknown parameter values.
152
1
A standard notation is often used to keep straight the distinction between population and sample. The
below sets show some commonly used symbols.
Population: N µ σ2
Sample: n x s2
• Usually, the population is too large to study, i.e. hundreds of thousands or millions of cases.
• Time, money, and patience limit studying an entire population.
• The population may change by the time the generalization is made, making the generalization
invalid.
153
1
• SAMPLING ERROR: A sample, even a random one, is likely to vary somewhat from the
population. The differences between the sample and the population are sampling errors. There
is an inverse relationship between sampling error and sample size. Sampling error increases as
standard deviation increases.
There are two sampling techniques available for drawing the samples. They are Probabilistic
and non probabilistic sampling techniques.
Probability Sampling:
Every unit of the population has an equal chance of being selected. There are various
probabilistic sampling techniques, like Simple random sampling, stratified sampling, Cluster
sampling, and Systematic sampling.
Every unit of the population has unequal chance of being selected. The selection process is, at
least, partially subjective. There are various non-probabilistic sampling techniques, like
judgment sampling, convenience sampling and quota sampling.
154
1
The sample is randomly drawn from the population (usually accomplished using a table of random
numbers). This is one of the most basic (simple) forms of sampling, based on the probability that the
random selection of names from a sampling frame will produce a sample that is representative of a
target population. In this respect, a simple random sample is similar to a lottery:
An example of a simple random sample you could easily construct would be to take the names of
every student in your class from the register, write all the names on separate pieces of paper and put
them in a box. If you then draw out a certain percentage of names at random you will have constructed
your simple random sample⁄
•Advantages:
–Simple
–Sampling error easily measured.
•Disadvantages:
–Need complete list of units.
–Does not always achieve best representative ness.
–Units may be scattered.
Randomly select the sample from different subgroups of the population, either equal numbers or
proportional.
155
1
Example: From the school of 60 students in its first year (40 males and 20 females) and 60 students in
its second year (10 males and 50 females), select a sample of 24 students.
Randomly select intact groups and sample all members of those groups. This form of sampling is
usually done when a target population is spread over a wide geographic area.
• For example, an opinion poll into voting behaviour may involve a sample of 1000 people to
represent the 35 million people eligible to vote in a General Election. A researcher can use a multi-
stage / cluster sample that firstly, divides the country into smaller units (in this example, electoral
constituencies) and then into small units within constituencies (for example, local boroughs).
Boroughs could then be selected which, based on past research, show a representative cross-section
of voters and a sample of electors could be taken from a relatively small number of boroughs
across the country. There are three stages: 1) the target population is divided into many regional
clusters (groups) eg: London, Berlin, Rome etc 2) a few clusters are randomly selected for study 3)
A few subjects are randomly chosen from within a cluster.
156
1
Uses
1. This type of sample saves the researcher time and money.
2. Once a relatively reliable sample has been established, the researcher can use the same or a
similar sample again and again (as with political opinion polling).
Limitations
1. Unless great care is taken by the researcher it is possible that the cluster samples will not be
representative of the target population.
2. Even though it is a relatively cheap form of sampling, this is not necessarily the case. A
sample that seeks to represent the whole of Britain, for example, is still going to be too
expensive for many researchers
Select a sample by drawing every nth case from a list of the entire population. Divide the number in
the population by the number desired in the sample, start near the top of the list and then select every
nth thereafter. A variation on the above is to select the names for your sample systematically rather
than on a simple random basis. Thus, instead of putting all the names on your sampling frame
individually into a box, it's less trouble to select your sample from the sampling frame itself.
For example, if you were constructing a 20% sample of a target population containing 100 names, a
systematic sample would involve choosing every fifth name from your sampling frame.
Uses:
1. Both are relatively quick and easy ways of selecting samples (if the target population is
reasonably small).
2. They are random / near random, which means that everyone in the target population has an
equal chance of appearing in the sample (this is not quite true of systematic sampling, but such
samples are „random enough‰ for most research purposes).
3. They are both reasonably inexpensive to construct. Both simply require a sampling frame
that is accurate for the target population.
157
1
4. Other than some means of identifying people in the target population (a name and address,
for example), the researcher does not require any other knowledge about this population (an
idea that will become more significant when we consider some other forms of sampling)
Limitations:
1. The fact that these types of sample always need a sampling frame means that, in some
cases, it may not be possible to use these types of sampling. For example, a study into
„underage drinking‰ could not be based on a simple random or systematic sample because no
sampling frame exists for the target population.
2. In many cases a researcher will want to get the views of different categories of people
within a target population and it is not always certain that these types of sampling will produce
a sample that is representative of all shades of opinion.
For example, in a classroom it might be important to get the views of both the teacher and
their students about some aspect of education. A simple random / systematic sample may not
include a teacher because this category is likely to be a very small percentage of the overall
class; there is a high level of probability that the teacher would not be chosen for by any
sample that is simply based on chance⁄
Convenience sample are prone to bias by their very nature-selecting population éléments which are
convenient to choose almost always make them special or different from the best of the elements in
the population in some way.Hence the results obtained by following convenience sample method can
hardly be representative of the population-they are generally biased and unsatisfactory. How ever,
convenience sample is often used for making pilot studies.
158
1
In this method of sampling the choice of sample items depends exclusively on the judgment of the
investigator. In other words, the investigator exercises his judgment in the choice and includes those
items in the sample which he thinks are most typical of the universe with regard to the characteristics
under investigation. For example, if sample of ten students is to be selected from a class of sixty for
analysing the spending habits of students, the investigator would select 10 students who, in his opinion
are representative of the class.
Merits:
Though the principles of sampling theory are not applicable to judgment sampling, the method
is sometimes used in solving many types of economic and business problems. The use of
judgment sampling is justified under a variety of circumstances.
(i) When only a small number of sampling units is in the universe, sample random selection may
miss the more important elements, whereas judgment selection would certainly include them
in the sample.
(ii) When we want to study some unknown traits of a population, some of whose characteristics
are known, we may then stratify the population according to these known properties and select
sampling units from each stratum on the basis of judgment .This method is used to obtain a
more representative sample.
(iii) In solving every day business problems and making public policy decisions, executives and
public officials are often pressed for time and cannot wait for probability sample designs.
Judgment sampling is then the only practical method to arrive at solutions to their urgent
problems.
159
1
Limitations:
(i) This method is not scientific because the population units to be sampled may be affected by
the personal prejudice or bias of the investigator.
The sample is selected so that certain characteristics are in proportion to the population. In quota
sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified
sampling. Then judgment is used to select the subjects or units from each segment based on a specified
proportion. For example, an interviewer may be told to sample 200 females and 300 males between
the age of 45 and 60.
It is this second step which makes the technique one of non-probability sampling. In quota sampling
the selection of the sample is non-random. For example interviewers might be tempted to interview
those people in the street who look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This non-random element is its greatest weakness and
quota versus probability has been a matter of controversy for many years.
7.8 ESTIMATION:
Having discussed probability theory, the normal distribution, and the concept of sampling distribution
of a statistic, we now turn to the topic estimation. In business, there arise several situations when
managers have to make quick estimates. Since their estimates have an impact on the success or failure
of their enterprises, they have to take sufficient care to ensure that their estimates are not far away
from the final outcome. The point to note is that such estimates are made without complete
information and with a great deal of uncertainty about the eventual outcome.
In all such situation, it is the theory of probability that forms the basis for statistical inference. The
term ÂStatistical inferenceÊ means making a probability judgment concerning a population on the basis
160
1
of one or more samples. Based on probability theory, statistical inferences are made as a basis for
making decisions. For example, an investor is interested to know whether he should subscribe for an
investment consultancy service or not. On the basis of a sample, he has to examine whether the
selection of his investment on the advice of the investment consultancy service has been more
profitable than the selection based randomly, he may go in for this service.
Let us first know the concept of ÂestimateÊ as used in Statistics. According to some dictionaries,
estimate is a valuation based on opinion or roughly made from imperfect or incomplete data. This
definition may apply, for example, when an individual who has an opinion about the competence of
one of his colleagues. But, in Statistics the term estimate is not used in this sense. In statistics too, the
estimates are made when the information available is incomplete or imperfect. However, such
estimates are made only when they are based on sound judgment or experience and when the samples
are scientifically selected.
There are two types of estimates that we can make about a population: a point estimate and an interval
estimate. A point estimate is a single number, which is used to estimate an unknown population
parameter. Although a point estimate may be the most common way of expressing an estimate, it
suffers from a major limitation since it fails to indicate how close it is to the quantity it is supposed to
estimate. In other words, a point estimate does not give any idea about the reliability or precision of
the method of estimation used.
The second type of estimate is known as the interval estimate. It is a range of values used to estimate
an unknown population parameter. In case of an interval estimate, the error is indicated in two ways:
first by the extent of its range: and second, by the probability of the true population parameter lying
within that range.
When we make an estimate of a population parameter, we use a sample statistic. This sample statistic
is an estimator.
161
1
∑x
For example, the sample means x =
n
x is a point estimator of the population mean. Many different Statistics can be used to estimate the
same parameter. For example, we may use the sample mean or the sample median or even the range
to estimate the population mean. The question here is: how can we evaluate the properties of these
estimates, compare them with one another, and finally, decide which the ÂbestÊ is? The answer to this
question is possible only when we have certain criteria that a good estimator must satisfy. These
criteria are briefly discussed below.
There are four criteria by which we can evaluate the quality of a statistic as an estimator. These are:
unbiased ness, efficiency, consistency and sufficiency.
7.9.2.1 UNBIASEDNESS:
This is a very important property that an estimator should possess. If we take all possible samples of
the same size from a population and calculate their means, the mean of all these means will be equal to
the mean of the population. This means that the sample mean is an unbiased estimator of the
population mean. When the expected value (or mean) of a sample statistic is equal to the value of the
corresponding population parameter, the sample statistic is said to be an unbiased estimator.
7.9.2.2 CONSISTENCY:
Another important characteristic that an estimator should possess is consistency. Let us take the case
of the standard deviation of the sampling distribution.
σ
σx =
n
The formula states that the standard deviation of the sampling distribution decreases as the sample size
increases and vice versa. When the sample size n increases, the population standard deviation is to be
divided by a higher denominator. This results in the reduced value of sample standard deviation.
162
1
7.9.2.3 EFFICIENCY:
Another desirable property of a good estimator is that it should be efficient. Efficiency is measured in
terms of size of the standard error of the statistic. Since an estimator is a random variable, it is
necessarily characterized by a certain amount of variability. This means that some estimates may be
more variable than others. Just as bias is related to the expected value of the estimator, so efficiency
can be defined in terms of the variance.
7.9.2.4 SUFFICIENCY:
The fourth property of a good estimator is that it should be sufficient. A sufficient statistic such as is
an estimator, that utilizes all the information a sample contains about the parameter to be estimated.
x, for example, is a sufficient estimator of the population mean µ. It implies that no other estimator of
µ, such as the sample median, can provide any additional information about the parameter µ.
Likewise, we can say that the sample proportion p is a sufficient estimator for the population
proportion «.
7.10 SUMMARY:
At the outset, the chapter has given the definitions of some basic terms used in sampling. The
distinction between the probability and non-probability sample has been brought out. This is followed
by a discussion of simple random sampling, systematic random sampling, and stratified random
sampling. Other sampling designs such as cluster sampling, judgment sampling, and quota sampling
their advantages and limitations were also discussed. Finally the four criteria used to evaluate the
quality of a statistic as an estimator namely unbiased ness, efficiency, consistency and sufficiency
have also been explained.
163
1
7.11 GLOSSARY:
7.12 REFERENCES:
Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi, 2004
Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
164
1
7.13 EXERCISE:
1. What are the advantages and limitations of sampling?
2. What is meant by Ârepresentative nessÊ in a sample? Explain in what sense a simple random
sample is representative of the population.
3. Distinguish between a probability sample and a non-probability sample.
4. Describe the various steps involved in the sampling process.
5. What are the criteria of a good estimator? Explain each of the criteria briefly.
By
Reader
University of Hyderabad.
165
1
8. TESTING OF HYPOTHESIS
8.1 INTRODUCTION:
Statistical interference is that branch of statistics which is concerned with using probability concept to
deal with uncertainty in decision making. The field of statistical interference has had a fruitful
development since the late half of the 19th century.
It refers to the process of selecting & using a sample statistic to draw inference about a population
parameter based on the sample drawn from the population. Statistical interference treats two different
classes of problems.
(a) Hypothesis testing: i.e. to test some hypothesis about parent population from which sample is
drawn.
(b) Estimation: i.e. to use the ÂstatisticsÊ obtained from the sample as estimate of the unknown
ÂparameterÊ of the population from where the sample is drawn.
Hypothesis testing begins with an assumption, called hypothesis that is made about a population
parameter. Then sample data is collected to produce sample statistics and this information is used
to decide how likely is that the hypothesized parameter is correct.
Hence, it can be said that a hypothesis is a supposition made as a basis for reasoning.
166
1
(1) A coin may be tossed 200 times and 180 times it may be heads. In this case it is interesting to
test the hypothesis that the coin is unbiased.
(2) The average weight of 100 students of a particular college can be studied and may get the result
of 110lb. In this case it is interesting to test the hypothesis that the sample has been drawn from
the population with average weight 115lb.
(3) Hypothesis testing can also be done for testing the hypothesis that the variables in the population
are uncorrelated
Example:
• When people need to determine whether the parameters of two populations are alike or
different
• If whether the female employees receive lower salaries than its male employees for the same
work
• Wish to determine whether the proportion of promotable employees at one government
installation is different from that at another.
167
1
The first thing in hypothesis testing is to set up a hypothesis about a population parameter Then collect
sample data, produce sample statistics &use this information to decide how likely it is that our
hypothesized population parameter is correct. Say we assume a certain value for mean of population to
test the validity of our assumption we gather sample data and determine the differences between the
hypothesized value and actual value of sample mean. Then we judge whether the difference is
significant the smaller the difference the greater the likelihood that our hypothesized value for the
mean is correct. the larger the value of difference, the lesser the likely hood.
The conventional approach to hypothesis testing is not to construct a single hypothesis about the
population parameter, but rather to set up two different hypotheses. These hypotheses must be so
constructed that if one is accepted the other is rejected & vice versa. They are:
(a) Null hypothesis(Ho)
(b) Alternate hypothesis(Ha)
Null hypothesis:
It is a very useful tool in testing the significance of difference. This hypothesis asserts that there is no
real difference in the sample and the population in the particular matter under consideration and that
the difference found is accidental and unimportant arising out of fluctuations of sampling. The null
hypothesis is akin to the legal principle that a man is innocent until he is proved to be guilty.
Alternate hypothesis:
As against null hypothesis alternate hypothesis specifies those values that the researcher believes to
hold true. The alternate hypothesis may embrace the whole range of values rather than a single point.
Ex: if we want to know whether the drug is effective in curing malaria (or) not
Ho: the drug is not effective in curing malaria.
Ha: the drug A is effective in curing malaria.
168
1
The next step is to test the validity of Ho against that of Ha at a certain level of significance. The
confidence with which an experimenter rejects (or) accepts null hypothesis depends upon the
significance level adopted.
This involves selecting an appropriate probability distribution for a particular test, i.e a probability
distribution that can be properly applied. Some probability distributions commonly used are Z, T, F
and chi square test. Test criteria must employ an appropriate probability distribution. Ex: if only small
sample information is available, the use of the normal distribution would be inappropriate.
8.4.6 Doing computations & Making decisions: These calculations include the testing statistic and
the standard error of the testing statistic. A statistical conclusion or statistical decision is a decision
either to reject or to accept the null hypothesis. The decision will depend on whether the computed
value of the test criterion falls in the region of rejection or region of acceptance.
Type 1 error:
The hypothesis is true but our test rejects it i.e. rejecting Ho when it is true.
Type II error:
The hypothesis is false but test accepts it i.e. accepting Ho when it is false.
Alpha = prob (type1 error)
= prob (rejecting Ho, when it is true)
Beta = prob (type II error)
= prob (not rejecting Ho, when it is false)
169
1
A two tailed test of hypothesis will reject the null hypothesis, if the sample statistic is significantly
higher than or lower than the hypothesized population parameter
Acceptance zone
Rejection zone
Rejection zone
Mean
Rejection zone
Acceptance zone
Mean
Acceptance zone
Left tail test
Rejection zone
Mean
170
1
The standard deviation of the sampling distribution is called standard error. (1) This is used as an
instrument in hypothesis testing .Standard error provides an idea about unreliability of a sample.
Greater the standard error, greater the departure from actual frequencies to the expected ones. Hence,
greater the unreliability of the sample. (2) Reciprocal of S.E .i.e. 1/s.e is a measure of reliability of
the sample. (3) With the help of S.E we can determine limits within which the parameter values are
expected to lie.
8.8 Z-table:
In this we can only find out the presence (or) absence of a particular characteristic. Hence we will
check whether attribute A is present (or) not by drawing samples.
The sampling distributions of the number of successes follow Binomial probability distribution.
Hence the S.E. is:
171
1
Ex: Q) A coin was tossed 400 times and head turned up 216 times. Test the hypothesis that the coin is
unbiased.
Ans: Let us take the hypothesis that the coin is unbiased. On the basis of this hypothesis the
probability of getting head or tail would be equal i.e. ½. Hence in 400 throws a coin we should expect
200 heads and 200 tails.
Observed no. of heads=216
Difference between observed and expected number of heads= 216-200=16
1 1
S.E= 400 * *
2 2
Difference 16
Zcal= = = 1 .6
S .E 10
at α = 5% level of significance
Ztable=1.96
Zcal< Ztable
Hence hypothesis is accepted. So the coin is unbiased
Other applications:
1) To find from a hospital statistics the no. of male babies born in a particular month is equal to
no. of female babies born.
2) In census, whether no. males is equal to no. of females
3) In a class, whether the performance of boys and girls differs
4) In a class, whether the maths students perform better (or) the arts students perform better
5) Whether a drug A. P effective (or) in effective.
172
1
Instead of recording the number of successes in each sample, we might record the proportion of
successes, that is 1 n th of no. of successes in each sample. As this would amount to dividing all the
figures of the record by n, the mean proportion of successes must be p.
npq pq
Standard deviation of proportion of successes= =
n n
pq
S.Ep=
n
Ex: Q: 500 apples are taken at random from a basket and 50 are found to be bad. Estimate the
proportion of bad apples in the basket and assign the percentage limits.
50
A: The proportion of bad apples in the given sample= =0.1
500
p=0.1, q=0.9, n=500
pq 0.1 × 0.9
S.E= = =0.013
n 500
Limits within which the percentage of bad apples lies
( p ± 3S .E ) ) × 100
= (0.1 ± 3 × 0.013) × 100
= 6.1% & 13.9%
173
1
Other applications:
1. To assign limits to the variation of the sample mean for that matter any parameter of
the sample compared to the population parameters.
2. We can find the no. of defective items produced by a machine at a particular level of
significance and can say with accuracy the limits.
3. WE can feel what portion of a machine o/p defective?
4. In a class, expecting the number or students who will be in the first grade, with
accuracy we can tell the limits-upper and lower.
If two samples are drawn from different populations, we may be interested in finding out whether the
difference between the proportion of successes is significant (or) not. In such a case we take
difference between proportions of successes in one sample to another.
1 1
S .E ( p1 p2 ) = pq +
n1 n2
n1 p1 + n2 p 2 x + x2
P= (or ) P= 1
n1 + n2 n1 + n2
P1 − P2
Zcal=
S .E
Ztabel= α = Level of significance 5% (or)1%
174
1
Ex:-1. In a simple random sample of 600 men taken from a big city, 400 are found to be smokers. In
another simple random sample of 900 men taken from another city 450 are smokers. Do the data
indicate that there is a significant difference in habits or smoking in two cities?
Sol::-
Ho=no significant difference in habit or smoking in two cities
HA=Significant difference in habit of smoking in two cities.
400
p1 = = 0.667
600
q1 = 0.333
450
p2 = = 0 .5
900
q 2 = 0 .5
x1 + x 2 400 + 450 17
p= = =
n1 + n2 600 + 900 30
13
q=1-p=
30
1 1
S .E( P1 − P2 ) = pq +
n1 n2
17 13 1 1
S .E( P1 − P2 ) = * + =0.026
30 30 600 900
P1 − P2 0.667 − 0.5
Zcal= = = 6.42
S .E 0.026
175
1
Acceptance
Rejection zone
zone
Rejection
zone
-2.58 0 2.58
6.42
Zcal > Ztable → Null Hypothesis rejected
So, there is a significant difference between the smoking habits of people in the two cities.
Applications:
1. A machineÊs productivity has improved after over-handling (or) not
2. Whether the student benefited from extra classes (or) not
3. Whether a particular policy of the govt. has effected the buying behaviour of the peoples (or) not
4. Whether a particular product is successful (or) not
Ex:-
It is assumed that the mean life span of the human beings in a particular state is 67 years. The mean
age obtained from a random sample or size 100 is 64 years. The Standard deviation of the distribution
of life of population is 3 years. Test whether the Mean life of the population is decreased over a
period of time. Test at 5% level of significance.
176
1
Solution:
Ho=µ=67
HA= µ ≤ 67
x =64
σ 3
S .E x = = = 0.3
n 100
64 − 67
Zcal= = −10
0 .3
α =5%--Ztable=1.
Applications:-
1) Whether the claims of the firms in advertisement about quality, life time, performance, speed
of their products are same as that of the products lot they produce
2) For test marketing i.e whether the product will click (or) not from the survey results.
177
1
8.9.4 Two-failed test for difference between the Means of two samples:-
1) Two independent random samples with n1&n2 numbers (each greater than 30) respectively are
drawn σ 1 , σ 2 their respective standard deviation. Then where x1 , x 2 are respective Means.
σ 12 σ 22
S .E = +
n1 n2
H o = µ1 = µ 2
H A = µ1 ≠ µ 2
and n1>30,n2>30
The S.E of of the population is σ
Then, S.E or difference between samples
1 1
= σ 2 +
1
n n 2
Ex: Intelligence test on two groups of boys and girls gave the following results:
Mean S.D N
Girls 75 15 150
Boys 70 20 250
H o : µ Girls = µ Boys
H A : µ Girls ≠ µ Boys
178
1
σ 1 = 15
σ 2 = 20
n1 = 150
n2 = 250
σ 2 σ 2
S .E = 1 + 2
n1 n2
S .E( x1 − x2 ) =
(15)2 + (20)2 = 1.761
150 250
75 − 70
Zcal= = 2.84
1.761
α =1% level of significance
Ztable=2.58
Zcal> Ztable → Ho rejected
This distribution is used when the sample size is less than 30 and the population standard deviation is
unknown.
Definition:
x−µ
t= × n
s
∑ (x − x )
2
Where s=
n −1
179
1
Properties of t-distribution
The t-table:
It gives, over a range of values of degrees of freedom, the probabilities of exceeding by chance
value of ÂtÊ at different levels of significance. The degrees of freedom are infinitely large, the t-
distribution is equivalent to normal distribution.
In determining whether the mean of a sample drawn from a normal population deviates
significantly from a stated value
t=
(x − x ) n
s
S=
Σ( x − x )
⇒ S=
2
Σd 2 − n d ( ) 2
n −1 n −1
180
1
∑ d 2 − ∑
1 d 2
S=
n
n − 1
Eg: The manufacturer of a certain make of electric bulbs claims that his bulbs have a mean life of 25
months with a standard deviation of 5 months. A random sample of 6 such bulbs gave the following
values.
Life of months 24, 26, 30,20,20,18 can you regard the producerÊs claim to be valid at 1% level of
significance.
Solution:
Ho ⇒ there is no significance difference in the mean life of bulbs in sample and the populations.
Ha ⇒ there is a significant difference.
t=
(x − N ) n
S
N=25
X X−X X2
29 +1 1
26 +3 9
30 +7 49
20 -3 9
20 -3 9
18 -5 25
ΣX = 138 ΣX 2 = 102
ΣX 138
X = ⇒ ⇒ 23
n 6
ΣX 2 102
S= ⇒ ⇒ 20.4
n −1 5
181
1
S=4.517
123 − 251 2 × 2.449
t= 6⇒ ⇒ 1.084
4.517 4.517
ϑ = n −1 = 6 −1 ⇒ 5
∴ from table-t:
forD.F = 5; t0.01 ⇒ 4.032
The calculated value ofÂtÊ is less than the table value.
Sample I Sample II
Mean: x1 x2
S.D:S1 S2
Testing the hypothesis the sample comes from same population:
X1 − X 2 n1 n2
t= ×
S n1 + n2
Where
Σ(X 1 − X 1 ) + Τ( X 2 − X 2 )
2 2
S=
n1 + n2 − 2
182
1
Example:
Two types of drugs were used on 5-7 patients for reducing their weight.
Drug A was imported; B indigenous. The decrease in weight after using the drugs for six months was.
A: 10 12 13 11 14
B: 8 9 12 14 15 10 9
Is there a significance difference in the n of two drugs? If not, which drug should you buy?
Solution:
Null Hypothesis: there is no significant difference in efficiency of two drugs.
X1 − X 2 n1 n2
t=
S n1 + n2
X 1 =12
X 2 = 11
X1 (X 1 − X1 ) (X 1 − X 1 )2 X2 (X 2 − X2) (X 2 − X2) 2
10 -2 4 8 -3 9
12 0 0 9 -2 4
13 +1 1 12 1 1
11 -1 1 14 3 9
14 +2 4 15 4 16
10 -1 1
9 -2 4
Σ60 Σ10 Σ77 Σ44
183
1
Σ (x 1 − x1 ) + Σ( x 2 − x 2 )
2 2
10 + 44 54
S= = = ⇒ 2.324
n1 + n2 − 2 5+7−2 10
X1 − X 2 n1 n2
t= S n1 + n2
12 − 11 5× 7 1.708
= ⇒ = 0.735
2.324 5+7 2.324
ϑ = n1 + n2 − 2 = 5 + 7 − 2 = 10
ϑ = 10; t 0.05 = 2.228
Q the cal. ÂtÊ less than ÂtÊ table, the hypothesis is accepted.
⇒ there is no significant difference in the n of two drugs.
Also, the indigenous drug can be bought because of no much difference, need not import drugs.
Two samples are said to be dependent when the elements in one sample are related to those in other
any significant manner.
d −0 d n
t= × n ⇒t =
S S
d ⇒ mean difference
S ⇒ S.D
Σ(d − d ) Σd 2 − n(d )
2 2
S= =
n −1 n −1
Eg:
A drug is given to 10 persons and the increments in their blood pressure were. Is it reasonable to
believe that the drug has no effect on change of B.P?
5% level of significance.
ϑ = 9; t ⇒ 2.26
184
1
Solution:
d 3 6 -2 4 -3 4 6 0 0 2 Σd = 0
d2 9 36 4 16 9 16 36 0 0 4 Σd 2= 130
Σd 20
d = ⇒ =2
n 10
()
Σd 2 − n d
2
s= n −1
130 − 10(2)
2
s= 10 − 1
s= 3.162
d n
t=
S
2 10 2 × 3.162
t= = ⇒2
3.162 3.162
ϑ = n − 1 ⇒ 10 − 1 = 9;
185
1
Given a random sample from a bivariate normal population, if we are to list the hypothesis that the
correlation coefficient of population is zero.
r
t= × n−2
(1 − r )
2 1/ 2
Eg:
A random sample of 27 pairs of observations from a normal population gives a correlation coefficient
of 0.42. Is it likely that the variables in the population are uncorrelated?.
Solution:
H0= no significant difference in the sample correlation and correlation in the population.
r
t= × n−2
1− r2
r=0.42; n=27
0.42
t= × 27 − 2
1 − (0.42)
2
0.42
⇒ × 5 ⇒ 2.31
0.908
ϑ ⇒ n − 2 ⇒ 27 − 2 ⇒ 25
For ϑ = 25; t 0.05 ⇒ 1.708
186
1
Managerial Applications:
Applications:-
1. All the mean life of the product- whether it is to the mark (or) not
2. Whether the Machines A and M have their Mean out put at the same level (or) not
3. Whether the employees A and B have their average performance as the same (or) not
8.11 SUMMARY:
This chapter has first dealt with the concept of hypothesis and the steps involved in testing it. It has
been pointed out that two types of error- Type I error and type II error- are likely to be made in testing
a hypothesis. Type I error arises when the hypothesis is true but it is rejected. Type II error arises
when the hypothesis, though false, is accepted. Hypothesis testing is explained giving examples
wherever necessary. Finally StudentsÂtÊ distribution its properties and its applications have been
discussed.
8.12 GLOSSARY:
187
1
Critical region: The set of values of the test statistic that will
cause us to reject the null hypothesis
Decision rule:
If the calculated test statistic falls within the
critical region, the null hypothesis H0 is rejected.
In contrast, if the calculated test statistic does not
fall within the critical region, the null hypothesis
is not rejected.
Hypothesis:
An unproven proposition or supposition that
tentatively explains a phenomenon.
188
1
Significance level:
The value of α that gives the probability of
rejecting the null hypothesis when it is true. This
gives rise to Type I error.
Test criteria
Criteria consisting of (i) specifying a level of
significance α , (ii) determining a test statistic (iii)
determining the critical region(s), and (iv)
determining the critical value (s).
Test statistic:
The value of Z orÂtÊ calculated for a sample
statistic such as the sample mean or the sample
proportion.
Two-tail test:
a statistical hypothesis test in which the
alternative hypothesis is stated in such a way that
it includes both the higher and the lower values of
a parameter than the value specified in the null
hypothesis.
Type I error:
An error caused by rejecting a null hypothesis
that is true.
Type II error:
An error caused by failing to reject a null
hypothesis that is not true.
189
1
8.13 REFERENCES:
1. Gupta. S.C., and Kapoor.V.K, „Fundamentals of Mathematical Statistics‰, Sultan Chand
and Sons, New Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi,
2004
3. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi,
2000.
4. S.P.Gupta and M.P.Gupta „Business Statistics‰, Sultan Chand & Sons, New Delhi, 2001.
3. An insurance agent has claimed that the average age of policyholders who insure through him is less
than the average for all agents, which is 30.5 years. A random sample of 100 policy holders who had
insured through him gave the following age distribution.
Calculate the arithmetic mean and standard deviation of this distribution and use these values to
test his claim at the 5 percent level of significance.
4. Two types of batteries are tested for their length of life and the following data are obtained:
190
1
5. A radio shop sells, on an average, 200 radios per day with a standard deviation of 50 radios.
After an extensive advertising campaign, the management will compute the average sales for the
next 25 days to see whether an improvement has occurred. Assume that the daily sales of radios
are normally distributed.
6. You obtain a large number of components to an identical specification from two sources. You may
notice that some of the components are from the supplierÊs own plant in Pune and some are from the
plant located in Bangalore. You would like to know whether the proportions of defective components
are the same or there is a difference between the two. You take a random sample of 600 components
from each plant and find that the rejection rate p1 is 0.015 for Pune components as compared to
p2=0.017 for Bangalore components. Set up the null hypothesis and test it at 5 percent level of
significance.
7. Out of 20,000 customers ledger accounts, a sample of 600 was taken to test the accuracy of posting
and balancing and 45 mistakes were found. Assign limits which the number of mistakes can be
expected at 95% level of confidence.
8.In a village ÂAÊ out of a random sample of 1,000 persons 100 were found to be vegetarians while in
another village ÂBÊ our of 1,500 persons 180 were found to be vegetarians. Do you find a significant
difference in the food habits of the people of the two villages?
191
1
9. In a hospital 480 female and 520 male babies were born in a week. Do this figure confirm the
hypothesis that males and females are born in equal number?
10. A strength test carried out an sample of two yarns spun of the same count give the
following results:
The strengths are expressed in pounds. Is the difference in mean strengths significant of the
sources from which the samples are drawn?
by
Dr.B.Raja Shekhar
Reader
University of Hyderabad.
192
1
• Understand how „Analysis of variance‰ can be used to test for the equality of three or more
population means. ( ø)
• Learn how to summarize F-ratio in the form of an ANOVA table.
9.1 INTRODUCTION:
193
1
2. The populations from which the samples are drawn have equal variances, i.e.,
σ 12 = σ 22 = σ n2
3. Each sample is drawn randomly and is independent of other samples.
1. response variable
2. a factor or criterion
3. A treatment
Take an example to understand the three key terms.
Ex: Production volume in different shifts in a factory.
¾ In the above example of production volume there are two variables- day of the week and
volume of the production in each shift. If one of the objectives is to determine whether
mean production volume is same during days of the week, then the variable of interest,
i.e., the response, is the mean production volume.
¾ The variables, quantitative or qualitative, that are related to response variable are called
factors, i.e., a day of the week is the factor or independent variable.
¾ The value assumed by a factor in an experiment is called a Level
¾ The combinations of levels of the factors for which the response will be observed are
called treatments. In this example days of the week are treatments.
1) The amount of variation among the sample means or the variation attributable to the
difference among sample means. This variation is due to assignable causes.
2) The amount of variation within the sample observations. This difference is considered
due to chance causes or experimental (random) errors.
194
1
The observation in the sample data may be classified according to one factor or two factors.
The classification according to one factor and two factors are called one-way classification
and two-way classification; respectively.
Many business applications involve experiments in which different populations (or groups) are
classified with respect to only one attribute of interest such as:
1st step:
2nd step:
195
1
−
x Is the grand mean
n= no. of observations in corresponding samples.
S.S.W (sum of squares with in the sample) calculation
2
−
S.S.W.= ∑ ∑ x − x
(The above given formula are in their simple format)
3rd step:
Find out degrees of freedom associated with S.S.B and S.S.W., and then find out total degree of
freedom (df)
Total d.f= between samples df + Within samples (df)
Since K independent samples are being compared, therefore (k-1) degrees of freedom are associated
with the sum of the squares among samples.
As each of the K samples contributes nj-1 degrees of freedom for each independent sample within
itself, therefore there are (n-k) degrees of freedom associated with the sum of the squares within
samples.
So, total d.f, (n-1) = (k-1) + (n-k)
4th step:
a) Find out Mean sum of squares between the samples (M.S.B)
S .S .B
M.S.B=
k −1
b) Find out Mean sum of squares within the samples (M.S.W)
S .S .W
M.S.W=
n−k
c) Find out Mean, sum of squares for total (M.S.T)
SST
M.S.T=
n −1
5th step:
196
1
Apply F-test statistic with (k-1) degrees of freedom for the numerator and (n-k) degrees of freedom for
the denominator.
SSB
σ between = (k − 1) = MSB
2
F-ratio=
σ within SSW (n − k ) MSW
2
6th step:
T= ∑x +∑x 1 2
+ − − − − ∑ xk
2
197
1
∴ S .S .T = ∑ X 1 + ∑ +−−−−∑X 2
2 2
X 2 K
Find S.S.B through the formula
.S.S.B=
(∑ x ) 2
2
− C.F
n 2
SSB
F ratio = v 1
where v1=k-1 and v2= n-k
SSB
v 2
Examples:
There are three main brands of a certain powder. A sample of 120 packets sold is examined and found
to be allocated among four groups A, B, C, D and Brands I, II, III as shown below:
Brand Group
A B C D
I 0 4 8 15
II 5 8 13 6
III 18 19 11 13
Use the three methods and find out whether there is any difference in brand preferences?
Solution:
H0=h1=h2=h3
H1=At least one among these are not equal to others.
198
1
0 0 5 25 18 324
4 16 8 64 19 361
8 64 13 169 11 121
15 225 6 36 13 169
∑x 1 = 27 ∑x
2
1
= 305 ∑x 2 = 32 ∑x
2
2
= 294 ∑x 3 = 62 ∑x
3
2
= 975
X1=6.75 x2 = 8 x3 = 15.25
∴ T = ∑ x1 + ∑ x2 + ∑ x3 = 120
T 2 (120 )
2
Correction factor, C.F= = = 1200
n 32
SSB=Sum of squares between the samples
(∑ x1 )2 (∑ x2 )2 (∑ x3 )2
= + + − C.F
n1 n2 n3
(27 )2 (32)2 + (61)2 − 1200
= +
4 4 4
=
(182.25 + 256 + 930.25)) − 1200
= 168.5
And, SST=Total sum of squares.
= (∑ x + ∑ x + ∑ x ) − C.F
1
2
2
2
3
2
=(305+294+975)-1200
=1574-1200
=374
∴ SSW=SST-SSB
=374-168.5
=205.5
199
1
374 11
So Null hypothesis is accepted. So, we can say that there is no difference in brand preferences.
200
1
Second method:
2
SSB= ∑ n x − x
So, we have to find x
x1 + x2 + x3 1 27 4 + 32 4 + 61 4
x= =
3 3
120
= = 10
12
[
So, S.S.B= 4 X (27 4 − 10 ) + (32 4 − 10 ) + (61 4 − 10 )
2 2 2
]
[
= 4 X (− 3.25) + (− 2 ) + (5.25)
2 2 2
]
=168.5
And, SSW= ∑ ∑ x − x ( ) 2
+ {(5 − 8) + (8 − 8) + (13 − 8) + (6 − 8) }+
2 2 2 2
=(45.56+7.56+1.56+68.06)+(9+0+25+4)+ (7.56+34.06+28.06+5.06)
=122.74+38+44.74
=205.48
84.25
F=
Within samples 22.83
201
1
V1=k-1=3-1=2
V2= n-k=12-3=9
As the F cal value is less than Ftable value. So, there is not any difference in brand preference.
And, this we can say with 5% level of significance.
In the one way classification we partition the total variation into two components: variation among the
samples and variation within the samples (due to random error)
However, there might be a possibility that some of the variation left in the random error from one-way
analysis of variation was not due to random error or chance but due to some other measurable factor.
And if this is done it means that this accountable variation was deliberately included in the SSW and
therefore caused the mean sum of squares within samples (MSW) to be little large. Consequently, F-
value would then be small and responsible for the rejection of null hypothesis.
So, in the two-way classification we can partition total variation in the sample data as shown below:
202
1
(In this table ÂcÊ stands for column and ÂrÊ stands for rows)
Example:
To study the performance of three detergents and three different water temperature, the following
ÂwhitenessÊ reading were obtained with specially designed equipment:
203
1
Detergent
Water temperature A B C
Cold water 57 55 67
Warm water 49 52 68
Hot water 54 5\46 58
Water Detergents
temperature A(x1) X12 (x2)) X22 C(x3) X32 Row sum
Cold water +7 49 +5 25 17 289 29
Warm water -1 01 +2 4 18 324 19
Hot water +4 16 -4 16 8 64 8
Column sum 10 66 3 45 43 677 56
=33.33+3+616.33-348.44
=304.22
204
1
=280.33+120.33+21.33-348.44
=73.55
SST=Total sum of squares
= (∑ x 1
2
+ ∑ x2 + ∑ x3 − C.F
2 2
)
=(66+45+677)-384.44
=439.56
SSE= SST-(SSC+SSR)= 439.56-(304.22+73.55)
=61.79
Thus, MSC=SSC/c-1=304.22/2=152.11
MSR=SSR/r-1=73.55/2=36.775
61.79 = 15.447
MSE= SSE
(c − 1)(r − 1) = 4
205
1
a) Now, F table value for Df1=2 and Df2=4 and α = 0.05 in 6.94, which is lesser than the
calculated value, i.e, 9.847.
Hence, we conclude that there is significant difference between the performance of the three
detergents.
b) Since the calculated value of F2= 2.380 at df1=2, df2=4, and α = 0.05 is less than its table
value 6.94. So, the null hypothesis is accepted. Hence we conclude that water temperature do
not make a significant difference in the performance of the detergent.
Example: In a certain factory production can be accomplished by four different workers on five
different types of machines. A sample study in the context of a two-way design without repeated
values is being made with two fold objectives of examining whether the four workers differ with
respect to mean productivity and whether the mean productivity is the same for the five different
machines. The researcher involved in this study reports while analyzing the gathered data as under:
For SSR
206
1
variation freedom
Between 35.2 (c-1)=4 35.2/4=8.8 8 .8
F1= = 1.24
machines 7 .1
(columns)
Between 53.8 (r-1)=3 53.8 = 17.93 17.938
3 F2= = 2.53
workmen (rows) 7 .1
Residual error 85.2 (c-1)(r-1)=12 85.2 = 7.1
12
Total 174.2 (n-1)=19
Level of significance, α =5% (given)
9.6 SUMMARY:
This chapter discusses about the important concepts of analysis of variance. ANOVA is
essentially a procedure for testing the difference amongst more than two sample means at the
same time. This technique is an important tool in the hands of a researcher and is an
extremely useful technique concerning researches in the fields of economics, biology
education, sociology, business/industry and in researches of several other disciplines. One-
way and two- way classification to test the equality of population means have been explained.
9.7 GLOSSARY:
207
1
Mean Square between Samples (MSB): A measure of the variation among means of
samples taken from different populations.
Mean square within Samples (MSW): A measure of the variation within data of all
samples taken from different populations.
208
1
two-factor experiments.
.
9.8 REFERENCES:
1. Gupta. S.C., and Kapoor. V.K., „Fundamentals of Mathematical Statistics‰, Sultan Chand
and Sons, N.Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalaya Publishing House, New Delhi , 2004
3. Murray R. Spiegel and Larry J. Stephens,„Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
4. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
3. Three varieties of wheat, A, B and C, were treated with four different fertilizers, 1,2,3 and the
yields of wheat per acre were as follows:
Varieties of wheat
Fertilizers A B C Total
1 55 72 47 174
2 64 66 53 183
3 58 57 74 189
4 59 57 58 174
Perform an analysis of variance on the above data and interpret the results.
209
1
4. The following table gives the data on the performance of three different detergents at three different
water temperatures. The performance was obtained on the ÂwhitenessÊ readings based on specially
designed equipment for nine loads of washing:
5. Consider the following ANOVA table, based on information obtained for three randomly selected
samples from three independent populations, which are normally distributed with equal variances.
b) Test the null hypothesis that the means of the three populations are all equal, using
0.01 level of significance.
210
1
6. The following represent the number of units of production per day turned out by four different
workers using five different types of machines:
Machine Type
Worker A B C D E Total
1 4 5 3 7 6 25
2 5 7 7 4 5 28
3 7 6 7 8 8 36
4 3 5 4 8 2 22
Total 19 23 21 27 21 111
On the basis of this information, can it be concluded that (i) the mean productivity is the same for
different machines. (ii) the workers donÊt differ with regard to productivity?
7. Set up an analysis of variance table for the following per acre production data for three varieties of
wheat, each grown on 4 plots and state if the variety differences are significant.
211
1
8. Three varieties of wheat W1, W2 and W3 are treated with four different fertilizers viz., f1, f2, f3 and
f4. The yields of wheat per acre were as under:
Set up a table for the analysis of variance and work out the F-ratios in respect of the above. Are the F-
ratios significant?
9. Apply the technique of Analysis of Variance to the following data, relating to yields of four
varieties of wheat in three blocks:
Varieties Blocks
1 2 3
I 10 9 8
II 7 7 6
III 8 5 4
IV 5 4 4
10.Three different methods of teaching Statistics are used on three groups of students. Random
samples of size 5 are taken from each group and the results are shown below. The grades are on a 10-
point scale.
212
1
Determine on the basis of the above data whether there is a difference in the teaching methods.
by
Dr.B.Raja Shekhar
Reader
University of Hyderabad.
213
1
10.0 OBJECTIVES:
10.1 INTRODUCTION:
The majority of hypothesis tests have made inferences about population parameters, such as the mean
and the proportion. These parametric tests have used the parametric statistics of samples that came
from the population being tested. To formulate these tests made restrictive assumptions about the
populations. But populations are not always normal. And even if a goodness-of-fit test indicates that a
population is approximately normal there are certain situations in which, the use of the normal curve is
not appropriate.
In recent times statisticians have developed useful techniques that do not make restrictive assumptions
about the shape of population distributions. These are known as distribution free or, more commonly,
non-parametric tests. The hypotheses of a non parametric test are concerned with some things other
than the value of a population parameter. A large number of non-parametric test exist. The more
widely used ones are
1. Chi-square test ( χ 2 test)
2. The sign test
a) One-sample sign test
b) Paired-sample sign test
3. A rank sum test
a) The Mann-Whitney U test
b) The kruskal-Wallis test.
214
1
From what has been stated above in respect of important non-parametric tests, we can say that these
tests share in main the following characteristics:
1. They do not suppose any particular distribution and the consequential assumptions.
2. They are rather quick and easy to use i.e., they do not require laborious computations since in
many cases the observations are replaced by their rank order and in many others we simply
use signs.
3. They are often not as efficient or ÂsharpÊ as tests of significance or the parametric tests. An
interval estimate with 95% confidence may be twice as large with the use of nonparametric
tests as with regular standard methods. The reason being that these tests do not use all the
available information but rather use groupings or rankings and the price we pay is a loss in
efficiency. In fact, when we use non-parametric tests, we make a trade-off: we loose
sharpness in estimating intervals, but we gain the ability to use less information and to
calculate faster.
4. When our measurements are not as accurate as is necessary for standard tests of significance,
then non-parametric methods come to our rescue which can be used fairly satisfactorily.
5. Parametric tests cannot apply to ordinal or nominal scale data but non-parametric tests do not
suffer from any such limitation.
6. The parametric tests of difference like ÂtÊ or ÂFÊ make assumption about the homogeneity of
the variances whereas this is not necessary for non-parametric tests of difference.
1. Non-parametric tests are distribution free, i.e. they do not require any assumption to be made
about population following normal or any other distribution.
2. Generally they are simple to understand and easy to apply when the sample sizes are small
3. Sometimes even formal ordering or ranking is not required.
4. Many non-parametric methods make it possible to work with very small samples.
5. Non-parametric methods make fewer and less stringent assumptions than do the classical
procedures.
215
1
The χ 2 test (pronounced as Chi-square test) is one of the simplest and most widely used non-
parametric test in statistical work. The symbol χ 2 is the Greek letter Chi. The χ 2 test was first used
by Karl Pearson in the year 1900. The quantity χ 2 describes the magnitude of discrepancy between
theory and observation. It is defined as
χ =∑2 (O − E ) 2
E
Where O refers to the observed frequencies and E refers to the expected frequencies.
(ii) Take the difference between observed and expected frequencies and obtain the squares of these
(iii) Divide the value of (O − E ) obtained in step (ii) by the respective expected frequency and obtain
2
the total
216
1
χ =∑2 (O − E )2
E
This gives the value of χ 2 which can range from zero to infinity. If χ 2 is zero it means, that the
observed and expected frequencies completely coincide. The greater the discrepancy between the
observed and expected frequencies, the greater shall be the value of χ 2 .
The calculated value of χ 2 is compared with the table value of χ 2 for given degree of freedom at a
certain specified level of significance. If at the stated level (generally 5% levels is selected) the
calculated value of χ 2 is more than the table value of χ 2 the difference between theory and
observation is considered to be significant, (i.e). it could not have arisen due to fluctuations of simple
sampling. If on the other hand, the calculated value of χ 2 is less than the table value, the difference
between theory and observation is not considered as significant, (i.e) it is regarded as due to
fluctuations of simple sampling and hence ignored.
217
1
(i) This test (as a non-parametric test) is based on frequencies and not on the parameters like
mean and standard deviation.
(ii) The test is used for testing the hypothesis and is not useful for estimation.
(iii) The test possesses the additive property as has already been explained
(iv) The test can also be applied to a complex contingency table with several classes and as
such is a very useful test in research work.
(v) This test is an important non-parametric test as no rigid assumptions are necessary in
regard to the type of population, no need of parameter values and relatively less
mathematical details are involved.
While comparing the calculated value of χ 2 with the table value we have to determine the degrees of
freedom. By degrees of freedom we mean the number of classes to which the values can be assigned
arbitrarily or at will without violating the restrictions or limitations placed.
For example: If we are to choose any five numbers whose total is 100, we can exercise our
independent choice for any four numbers only, the fifth number is fixed by virtue of the total being
100 as it must be equal to 100 minus the total of the four numbers selected.
V=n-k
V → df
K → Number of independent constraints.
The following points about the χ 2 test are worth noting:
1. The sum of the observed and expected frequencies is always zero.
Symbolically, ∑(O − E ) = ∑ O − ∑ E = N − N = 0
218
1
The χ 2 test is one of the most popular statistical inference procedures today. It is applicable to a
very large number of problems in practice which can be summed up under the following heads:
1) χ 2 test as a test of independence
Expectation of (AB) =
( A) × ( B )
N
240 × 812
=
3248
=60
(or) E1, (i.e.), expected frequency corresponding to first row and first column is 60.
The bale of expected frequencies shall be:
60 752 812
180 2256 2456
240 3008 3248
219
1
O E (O − E )2 (O − E )2 E
20 60 1600 26.667
220 180 1600 8.889
792 752 1600 2.128
2216 2256 1600 0.709
∑
(O − E )2 = 38.393
E
χ 2 = ∑
(O − E )2 = 38.393
E
V=(r-1) (c-1) = (2-1) (2-1) =1
2
For DF =1, χ 0.05 = 3.84
For the calculated value of χ 2 is greater than the table value. The hypothesis is rejected. Hence,
Quinine is useful in checking malaria.
2. Based on information on 1000 randomly selected fields about the tenancy status of the cultivation
of these fields and use of fertilizers, collected in an agro-economic survey. The following
classification was noted.
220
1
Would you conclude that owner cultivators are more inclined towards the use of fertilizers at 5%
level? Carry out Chi-Square test as per testing procedure.
Solution:-
Let us take the hypothesis that ownership of fields and the use of fertilizers are independent attributes.
Expectation of (AB) =
( A) × (B ) = 480 × 600 = 288
N 1000
Expected frequencies
288 312 600
192 208 400
480 520 1000
Applying χ 2 test:
O E (O − E )2 (O − E )2 E
416 288 16,384 58.889
64 192 16,384 85.333
184 312 16,384 52.513
336 208 16,384 78.769
[ ]
χ 2 = ∑ (O − E )2 E = 273.504
v= (2 − 1)(2 − 1) = 1
The sign test is the simplest of the non-parametric tests. The test is known as the sign test as it is based
on the direction of the plus or minus signs of observations in a sample instead of their numerical
values. The sign test can be of two types:
1. The one-sample sign test
2. The paired-sample sign test.
221
1
X − np
Z=
npq
Use Z table for interpretation.
In a one-sample sign test the test is null hypothesis ø=ø0 against an appropriate alternative on the
basis of a random sample of size n, replace each sample value greater than ø0 with a plus sign and
each sample value less than ø0 with a minus sign and discard sample value exactly equal to (0). Then
test the null hypothesis that these plus and minus signs are values of a random variable having the
binominal distribution with p=0.5.
Example:
It is required to test the hypothesis that the mean sales per day (µ) are 20 against the alternative
hypothesis of µ Æ20. Fifteen observations were taken and the following results were obtained:
18, 19, 25, 21, 16, 15, 19, 22, 24, 21, 18, 17, 15, 26 and 24.
Level of significance= 0.05.
Solution:
n=15
Replace each value greater than 20 with a plus (+) sign and each value less than 20 with a
222
1
X − np
Z=
npq
7 − (15 * 0.5)
Z=
15 * 0.5 * 0.5
Z= - 0.26
Since calculated Z= -0.26 lies between Z=-1.96 and Z=1.96 (the critical value of Z at 0.05 level of
significance), the null hypothesis is accepted.
The paired-sample sign test involving paired data such as data relating to the collection of an accounts
receivable before and after a new collection policy, Responses of father and son towards ideal family
size etc., In these problems, each pair of sample values can be replaced with a plus sign if the first
value is greater than the second, a minus sign, if the first value is smaller than the second, then proceed
in the same manner as in one-sample sign test.
Example:
The following data is related to downtimes (periods in which computers were inoperative on account
of failures, in minutes) of two different computers. Test whether the downtime in two computers is
same or different.
Computer 58 60 42 62 65 59 60 52 50 75 59
A 52 57 30 46 66 40 78 55 52 58 44
Computer 32, 48 50 41 45 40 43 43 70 60 80
B 45, 36 56 40 70 50 53 50 30 42 45
223
1
Solution:
These data are shown in the below table, along with + or - sign as may be applicable in case of each
pair of values. A plus sign is assigned when the downtime for computer A is greater than that for
computer B and a minus sign is given when the down time for computer B is greater than that for
computer A.
Computer 58 60 42 62 65 59 60 52 50 75 59
A
Computer 32 48 50 41 45 40 43 43 70 60 80
B
Sign + + - + + + + + - + -
Computer 52 57 30 46 66 40 78 55 52 58 44
A
Computer 45 36 56 40 70 50 53 50 30 42 45
B
Sign + + - + - - + + + + -
X − np
Z=
npq
15 − (22 * 0.5)
Z=
1220.5 * 0.5
Z= 1.71
Since calculated Z= 1.71 lies between Z=-1.96 and Z=1.96 (the critical value of Z at 0.05 level of
significance from normal distribution table), the null hypothesis is accepted. This indicates downtime
in both the computers is same.
224
1
Rank sum tests are whole family of tests. Although there are number of Rank sum tests the two
widely used tests the Mann-Whitney U test and the Kruskal-Wallis test are only discussed here. When
only two populations are involved, Mann-Whitney U test is used and the Kruskal-Wallis test is used
when more than two populations are involved. Use of these tests will enable to determine whether
independent samples have been drawn from the same population or not.
A nonparametric method used to determine whether two independent samples have been drawn from
populations with the same distribution.
a) Mann-Whitney test (or U-test): This is a very popular test amongst the rank sum tests. This test is
used to determine whether two independent samples have been drawn from the same population or
not. It uses more information than the sign test. This test applies under very general conditions and
requires only that the populations sampled are continuous. However, in practice even the violation of
this assumption does not affect the results very much.
To perform this test, rank the data jointly, taking them as belonging to a single sample in either an
increasing or decreasing order of magnitude. We usually adopt low to high ranking process which
means; assign rank 1 to an item with lowest value, rank 2 to the next higher item and so on. In case
there are ties, then we would assign each of the tied observation the mean of the ranks, which they
jointly occupy. For example, if sixth, seventh and eighth values are identical, we would assign each
the rank (6+7+8)/3=7. After this we find the sum of the ranks assigned to the values of the second
sample (and call it R2). Then we work out the test statistic i.e., U which is a measurement of the
difference between the ranked observations of the two samples as under:
n1 (n1 + 1)
U = n1 .n 2 + − R1
2
or
225
1
n 2 (n 2 + 1)
U = n1 .n 2 + − R2
2
Where n1 and n2 are the sample sizes and R1 is the sum of ranks assigned to the values of the first
sample. (In practice, whichever rank sum can be conveniently obtained can be taken as R1, since it is
immaterial which sample is called the first sample.)
In applying U-test we take the null hypothesis that the two samples come from identical populations.
If this hypothesis is true, it seems reasonable to suppose that the means of the ranks assigned to the
values of the two samples should be more or less the same. Under the alternative hypothesis, the
means of the two populations are not equal and if this is so, then most of the smaller ranks will go to
the values of one sample while most of the higher ranks will go to those of the other sample.
If the null hypothesis that the n1+n2 observations came from identical populations is true, the said ÂUÊ
statistic has a sampling distribution with
n1 .n 2
Mean= µ u =
2
And Standard deviation (or the standard error)
n1n2 (n1 + n2 + 1)
σu =
12
If n1 and n2 are sufficiently large (i.e., both greater than 10), the sampling distribution of U can be
approximated closely with normal distribution and the limits of the acceptance region can be
determined in the usual way at a given level of significance. But if either n1 or n2 is so small that the
normal curve approximation to the sampling distribution of U cannot be used, and then exact tests may
be based on special tables such as one given in the appendix, showing selected values of WilcoxonÊs
(unpaired) distribution.
Example:
Suppose that a state university wants to test the hypothesis that the mean scores of students at two
branches are equal. The board keeps statistics on all students at all branches of the system. A random
sample of 15 students from each branch has produced the below data.
226
1
Branch A 1000 1120 800 750 1300 950 1050 1280 1400 850
1150 1200 1500 600 775
Branch B 920 1,120 850 1,360 650 725 890 1,600 900 1140
1550 550 1240 925 500
Solution:
To apply the Mann-Whitney U test to this problem we begin ranking all the scores in order from
lowest to highest.
227
1
Using the values for n1 and n2 and the R1 & R2 determine the U statistic
n1 (n1 + 1)
U= n1n2 + − R1
2
=
(15)(15) = 112.5 ← Mean of the U statistic
2
And the Standard Error of the U statistic
n1n2 (n1 + n2 + 1)
σu =
12
σu =
(15)(15)(15 + 15 + 1)
12
= 581.25 = 24.1 ← Standard Error
228
1
u − µu 98 − 112.5
Ζ= = = −0.602
σu 24.1
Table value of Z at 0.05 level of significance=1.96
Since, the sample statistic does lie within the acceptance zone, the mean scores at two schools are the
same. (Same as earlier)
This test is conducted in a way similar to the U test described above. The test is used to test the null
hypothesis that ÂkÊ independent random samples come from identical universes against the alternative
hypothesis that the means of these universes are not equal. This test is analogous to the one-way
analysis of variance, but unlike the latter it does not require the assumption that the samples come
from approximately normal populations or the universes having the same standard deviation. It is a
nonparametric version of ANOVA
To perform this test, we have to rank all scores without regard to groups to which they belong and
The K-Statistic is calculated from the formula
12 R12 R22 R2
K= + + .... + k − 3(n + 1)
N ( N + 1) n1 n2 nk
Example:
Use the Kruskal-Wallis test at 5% level of significance to test the null hypothesis that a professional
bowler performs equally well with the four bowling balls, given the following results:
229
1
Solution
To apply the H test or the Kruskal-Wallis test to this problem, we begin by ranking all the given
figures from the highest to the lowest, indicating besides each the name of the ball as under:
Bowling results Rank Name of the ball
associated
302 1 B
297 2 D
282 3 A
297 4 D
276 5 B
275 6 B
271 7 A
270 8 D
268 9 B
266 10 C
262 11 A
260 12 C
258 13 D
257 14 A
255 15 C
252 16 B
248 17 A
246 18 C
242 19 D
239 20 C
For finding the values of Ri, we arrange the above table as under:
230
1
corresponding rank
12 k
Ri2
H= ∑ − 3(n + 1)
n(n + 1) i =1 ni
12 52 2 37 2 75 2 46 2
+ + + − 3(20 + 1)
= 20(20 + 1) 5 5 5 5
=(.02857)(2362.8)-63=67.51-63=4.51
As the four samples have five items each, the sampling distribution of H approximates closely with
χ 2 distribution. Now taking the null hypothesis that the bowler performs equally well with the four
balls, we have the value of χ 2 =7.815 for (k-1) or 4-1=3 degrees of freedom at 5% level of
significance. Since the calculated value of H is only 4.51 and does not exceed the χ 2 value of 8.815,
we accept the null hypothesis and conclude that bowler performs equally well with the four bowling
balls.
231
1
10.6 SUMMARY:
There are many situations in which the various assumptions required for standard tests of significance
(such as that population is normal, samples are independent, standard deviation is known, etc.) cannot
be met, and then we can use non-parametric methods. Moreover, they are easier to explain and easier
to understand. This is the reason why such tests have become popular. But one should not forget the
fact that they are usually less efficient/powerful as they are based on no assumption (or virtually no
assumption) and we all know that the less one assumes, the less one can infer from a set of data. But
then the other side must also be kept in view that the more one assumes, the more one limits the
applicability of oneÊs methods.
10.7 GLOSSARY:
Chi-Square test:
A statistical technique used to test significance in
the analysis of frequency distribution.
Kruskal-Wallis test:
A nonparametric method for testing the null
hypothesis that K independent random samples
come from identical populations. It is a direct
generalization of the Mann-Whitney test.
Mann-Whitney U test:
A non parametric test that is used to determine
whether two different samples come from
identical populations or whether these
populations have different means.
Non-parametric tests:
Tests that rely less on parameter estimation
and/or assumptions about the shape of a
population distribution.
232
1
Run:
A sequence of identical occurrences that may be
preceded and followed by different occurrences.
At times, they may not be preceded or followed
by any occurrences.
Sign test:
a non parametric test that takes into account the
difference between paired observations where (+)
and minus (-) signs are substituted for
quantitative values.
10.8 REFERENCES:
1) Levin, Bichard, Statistic for Management (Prentice Hall of India, 1999)
2) Harnett, Introduction of Statistical Methods (Addison-Wesley Publishing co)
3) S.P.Gupta, Statistical Methods (Published by SultanChand & Sons, 23 Delhi,2005)
2. What are the major advantages of nonparametric methods over parametric methods?
233
1
3. A company has collected the following data that relate to the average weekly loss of man-hours
on account of accidents in 8 plants over a period of six months. The data were obtained to ascertain
the effectiveness of an industrial safety before and after programme was put into operation. 72 and
59, 26 and 24,125 and 120, 39 and 35,54 and 43,39 and 35,13 and 15,12 and 18.
Test the null hypothesis that the safety programme is not effective, using the two-sample sign test at
α = 0.05 level of significance.
4. A company used three different methods of advertising its product in three cities. It later found the
increased sales (in thousand rupees) in identical retail outlets in the three cities as follows:
City A 70 58 60 45 55 62 89 72
City B 65 57 48 55 75 68 45 52
City C 53 59 71 70 63 60 58 75
Use Kruskal-Wallis method to test the hypothesis that the mean increase in sales on account of three
different methods of advertising was the same in the retail outlets in A,B and C cities. Use 5 percent
level of significance.
5. The following data relate to the costs of building comparable lots in the two Resorts A and B (in
million rupees):
The company owning the resort area A claimed that the median price of building lots was less in area
A as compared to resort area B. You are asked to test this claim, using a nonparametric test with a 1
percent level of significance.
6. 1000 students at college level were graded according to their I.Q. and the economic conditions of
their homes. Use χ 2 -test to find out whether there is any association between economic condition at
home and I.Q.
234
1
Economic I.Q
conditions High Low Total
Rich 460 140 600
Poor 240 160 400
Total 700 300 1000
8. A survey of 320 families with five children each revealed the following distribution:
No.of boys 5 4 3 2 1 0
No.of girls 0 1 2 3 4 5
No.of families 14 56 110 88 40 12
Is this distribution consistent with the hypothesis that male and female births are equally probable?
Apply Chi-square test.
9. Suppose playing four rounds of golf at the City Club 11 professionals totaled
280,282,290,273,283,283,275,284,282,279, and 281. Use the sign test at 5% level of significance to
test the null hypothesis that professional golfers average µ H 0 = 284 for four rounds against the
235
1
10. The following are the numbers of artifacts dug up by two archaeologists at an ancient cliff
dwelling on 30 days.
By X 1 0 2 3 1 0 2 2 3 0 1 1 4 1 2 1
By Y 0 0 1 0 2 0 0 1 1 2 0 1 2 1 1 0
By X 3 5 2 1 3 2 4 1 3 2 0 2 4 2
By Y 2 2 6 0 2 3 0 2 1 0 1 0 1 0
Use the sign test at 1% level of significance to test the null hypothesis that the two archaeologists, X
and Y, are equally good at finding artifacts against the alternative hypothesis that X is better.
by
Dr.B.Raja Shekhar
Reader
University of Hyderabad.
236
1
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
237
1
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Student's t Table
238
1
F Distribution Tables
df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 39.86 49.5 53.6 55.8 57.2 58.2 58.9 59.4 59.86 60.19 60.7 61.22 61.74 63.3
2 8.526 9 9.16 9.24 9.29 9.326 9.35 9.37 9.381 9.392 9.41 9.425 9.441 9.49
3 5.538 5.46 5.39 5.34 5.31 5.285 5.27 5.25 5.24 5.23 5.22 5.2 5.184 5.13
4 4.545 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.936 3.92 3.9 3.87 3.844 3.76
5 4.06 3.78 3.62 3.52 3.45 3.405 3.37 3.34 3.316 3.297 3.27 3.238 3.207 3.11
6 3.776 3.46 3.29 3.18 3.11 3.055 3.01 2.98 2.958 2.937 2.9 2.871 2.836 2.72
7 3.589 3.26 3.07 2.96 2.88 2.827 2.78 2.75 2.725 2.703 2.67 2.632 2.595 2.47
8 3.458 3.11 2.92 2.81 2.73 2.668 2.62 2.59 2.561 2.538 2.5 2.464 2.425 2.29
9 3.36 3.01 2.81 2.69 2.61 2.551 2.51 2.47 2.44 2.416 2.38 2.34 2.298 2.16
10 3.285 2.92 2.73 2.61 2.52 2.461 2.41 2.38 2.347 2.323 2.28 2.244 2.201 2.06
11 3.225 2.86 2.66 2.54 2.45 2.389 2.34 2.3 2.274 2.248 2.21 2.167 2.123 1.97
239
1
12 3.177 2.81 2.61 2.48 2.39 2.331 2.28 2.24 2.214 2.188 2.15 2.105 2.06 1.9
13 3.136 2.76 2.56 2.43 2.35 2.283 2.23 2.2 2.164 2.138 2.1 2.053 2.007 1.85
14 3.102 2.73 2.52 2.39 2.31 2.243 2.19 2.15 2.122 2.095 2.05 2.01 1.962 1.8
15 3.073 2.7 2.49 2.36 2.27 2.208 2.16 2.12 2.086 2.059 2.02 1.972 1.924 1.76
16 3.048 2.67 2.46 2.33 2.24 2.178 2.13 2.09 2.055 2.028 1.99 1.94 1.891 1.72
17 3.026 2.64 2.44 2.31 2.22 2.152 2.1 2.06 2.028 2.001 1.96 1.912 1.862 1.69
18 3.007 2.62 2.42 2.29 2.2 2.13 2.08 2.04 2.005 1.977 1.93 1.887 1.837 1.66
19 2.99 2.61 2.4 2.27 2.18 2.109 2.06 2.02 1.984 1.956 1.91 1.865 1.814 1.63
20 2.975 2.59 2.38 2.25 2.16 2.091 2.04 2 1.965 1.937 1.89 1.845 1.794 1.61
inf 2.706 2.3 2.08 1.94 1.85 1.774 1.72 1.67 1.632 1.599 1.55 1.487 1.421 1
df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 161 200 215.7 224.6 230.2 234 236.8 239 241 242 244 246 248 254.31
2 18.5 19 19.16 19.25 19.3 19.33 19.35 19.4 19.4 19.4 19.4 19.4 19.4 19.496
3 10.1 9.55 9.277 9.117 9.014 8.941 8.887 8.85 8.81 8.79 8.74 8.7 8.66 8.5264
4 7.71 6.94 6.591 6.388 6.256 6.163 6.094 6.04 6 5.96 5.91 5.86 5.8 5.6281
5 6.61 5.79 5.41 5.192 5.05 4.95 4.876 4.82 4.77 4.74 4.68 4.62 4.56 4.365
6 5.99 5.14 4.757 4.534 4.387 4.284 4.207 4.15 4.1 4.06 4 3.94 3.87 3.6689
7 5.59 4.74 4.347 4.12 3.972 3.866 3.787 3.73 3.68 3.64 3.57 3.51 3.44 3.2298
8 5.32 4.46 4.066 3.838 3.688 3.581 3.501 3.44 3.39 3.35 3.28 3.22 3.15 2.9276
9 5.12 4.26 3.863 3.633 3.482 3.374 3.293 3.23 3.18 3.14 3.07 3.01 2.94 2.7067
10 4.96 4.1 3.708 3.478 3.326 3.217 3.136 3.07 3.02 2.98 2.91 2.85 2.77 2.5379
11 4.84 3.98 3.587 3.357 3.204 3.095 3.012 2.95 2.9 2.85 2.79 2.72 2.65 2.4045
240
1
12 4.75 3.89 3.49 3.259 3.106 2.996 2.913 2.85 2.8 2.75 2.69 2.62 2.54 2.2962
13 4.67 3.81 3.411 3.179 3.025 2.915 2.832 2.77 2.71 2.67 2.6 2.53 2.46 2.2064
14 4.6 3.74 3.344 3.112 2.958 2.848 2.764 2.7 2.65 2.6 2.53 2.46 2.39 2.1307
15 4.54 3.68 3.287 3.056 2.901 2.791 2.707 2.64 2.59 2.54 2.48 2.4 2.33 2.0658
16 4.49 3.63 3.239 3.007 2.852 2.741 2.657 2.59 2.54 2.49 2.42 2.35 2.28 2.0096
17 4.45 3.59 3.197 2.965 2.81 2.699 2.614 2.55 2.49 2.45 2.38 2.31 2.23 1.9604
18 4.41 3.56 3.16 2.928 2.773 2.661 2.577 2.51 2.46 2.41 2.34 2.27 2.19 1.9168
19 4.38 3.52 3.127 2.895 2.74 2.628 2.544 2.48 2.42 2.38 2.31 2.23 2.16 1.878
20 4.35 3.49 3.098 2.866 2.711 2.599 2.514 2.45 2.39 2.35 2.28 2.2 2.12 1.8432
inf 3.84 3 2.605 2.372 2.214 2.099 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1
df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 4052 5000 5403 5625 5764 5859 5928 5981 6022 6056 6106 6157 6209 6365.9
2 98.5 99 99.17 99.25 99.3 99.33 99.36 99.4 99.4 99.4 99.4 99.4 99.4 99.499
3 34.1 30.8 29.46 28.71 28.24 27.91 27.67 27.5 27.3 27.2 27.1 26.9 26.7 26.125
4 21.2 18 16.69 15.98 15.52 15.21 14.98 14.8 14.7 14.5 14.4 14.2 14 13.463
5 16.3 13.3 12.06 11.39 10.97 10.67 10.46 10.3 10.2 10.1 9.89 9.72 9.55 9.02
6 13.7 10.9 9.78 9.148 8.746 8.466 8.26 8.1 7.98 7.87 7.72 7.56 7.4 6.88
7 12.2 9.55 8.451 7.847 7.46 7.191 6.993 6.84 6.72 6.62 6.47 6.31 6.16 5.65
8 11.3 8.65 7.591 7.006 6.632 6.371 6.178 6.03 5.91 5.81 5.67 5.52 5.36 4.859
241
1
9 10.6 8.02 6.992 6.422 6.057 5.802 5.613 5.47 5.35 5.26 5.11 4.96 4.81 4.311
10 10 7.56 6.552 5.994 5.636 5.386 5.2 5.06 4.94 4.85 4.71 4.56 4.41 3.909
11 9.65 7.21 6.217 5.668 5.316 5.069 4.886 4.74 4.63 4.54 4.4 4.25 4.1 3.602
12 9.33 6.93 5.953 5.412 5.064 4.821 4.64 4.5 4.39 4.3 4.16 4.01 3.86 3.361
13 9.07 6.7 5.739 5.205 4.862 4.62 4.441 4.3 4.19 4.1 3.96 3.82 3.67 3.165
14 8.86 6.52 5.564 5.035 4.695 4.456 4.278 4.14 4.03 3.94 3.8 3.66 3.51 3.004
15 8.68 6.36 5.417 4.893 4.556 4.318 4.142 4 3.9 3.81 3.67 3.52 3.37 2.868
16 8.53 6.23 5.292 4.773 4.437 4.202 4.026 3.89 3.78 3.69 3.55 3.41 3.26 2.753
17 8.4 6.11 5.185 4.669 4.336 4.102 3.927 3.79 3.68 3.59 3.46 3.31 3.16 2.653
18 8.29 6.01 5.092 4.579 4.248 4.015 3.841 3.71 3.6 3.51 3.37 3.23 3.08 2.566
19 8.19 5.93 5.01 4.5 4.171 3.939 3.765 3.63 3.52 3.43 3.3 3.15 3 2.489
20 8.1 5.85 4.938 4.431 4.103 3.871 3.699 3.56 3.46 3.37 3.23 3.09 2.94 2.421
inf 6.64 4.61 3.782 3.319 3.017 2.802 2.639 2.51 2.41 2.32 2.19 2.04 1.88 1
Chi-Square Table
Area in the right tail of a Chi-Square Distribution
242
1
n2 9 10 11 12 13 14 15 16 17 18 19 20
n1
2 0 0 0 1 1 1 1 1 2 2 2 2
3 2 3 3 4 4 5 5 6 6 7 7 8
4 4 5 6 7 8 9 10 11 11 12 13 13
5 7 8 9 11 12 13 14 15 17 18 19 20
6 10 11 13 14 16 17 19 21 22 24 25 27
7 12 14 16 18 20 22 24 26 28 30 32 34
8 15 17 19 22 24 26 29 31 34 36 38 41
9 17 20 23 26 28 31 34 37 39 42 45 48
10 20 23 26 29 33 36 39 42 45 48 52 55
11 23 26 30 33 37 40 44 47 51 55 58 62
243
1
12 26 29 33 37 41 45 49 53 57 61 66 69
13 28 33 37 41 45 50 54 59 63 67 72 76
14 31 36 40 45 50 55 59 64 67 74 78 83
15 34 39 44 49 54 59 64 70 75 80 85 90
16 37 42 47 53 59 64 70 75 81 86 92 98
17 39 45 51 57 63 67 75 81 87 93 99 105
18 42 48 55 61 67 74 80 86 93 99 106 112
1 0 0
2 1 1 1 2 2 2 3 3 3 4 4 4
3 3 4 5 5 6 7 7 8 9 9 10 11
4 6 7 8 9 10 11 12 14 15 16 17 18
5 9 11 12 13 15 16 18 19 20 22 23 25
6 12 14 16 17 19 21 23 25 26 28 30 32
7 15 17 19 21 24 26 28 30 33 35 37 39
8 18 20 23 26 28 31 33 36 39 41 44 47
9 21 24 27 30 33 36 39 42 45 48 51 54
10 24 27 31 34 37 41 44 48 51 55 58 62
11 27 31 34 38 42 46 50 54 57 61 65 69
244
1
12 30 34 38 42 47 51 55 60 64 68 72 77
13 33 37 42 47 51 56 61 65 70 75 80 84
14 36 41 46 51 56 61 66 71 77 82 87 92
15 39 44 50 55 61 66 72 77 83 88 94 100
16 42 48 54 60 65 79 77 83 89 95 101 107
245