You are on page 1of 246

POST GRADUATE DIPLOMA IN

BUSINESS MANAGEMENT
(PGDBM)

CENTRE FOR DISTANCE AND


VIRTUAL LEARNING (CDVL)
UNIVERSITY OF HYDERABAD
Golden Threshold, Nampally Station Road
Abids, Hyderabad – 500 001
Ph : 040 - 24600264, Fax : 040 - 24600266
E mail : cde@uohyd.ernet.in

Copyrights reserved © - University of Hyderabad, CDVL


Post Graduate Diploma in Business Management
(PGDBM)

416: Quantitative and Research Methods

Block-I: Quantitative Methods

Editor& Coordinator
Prof. B. Raja Shekhar

Centre for Distance and Virtual Learning


University of Hyderabad
Hyderabad-500 001
1

PROGRAMME ADVISORY COMMITTEE

1. Prof. V. Venkata Ramana, Dean, School of Management Studies, University of


Hyderabad,

2. Prof .A. K. Pujari, Dean, School of Mathematics & Computer and Information
Sciences, University of Hyderabad,

3. Prof. A.Vidyadhar Reddy, Dean, Dept. of Business Management, Osmania


University, Hyderabad.

4. Prof..M.S.Bhat, School of Management Studies ,Jawaharlal Nehru Technological


University, Hyderabad.

5. Mr. Sumit Dey, Supply Chain Manager, VST Industries Ltd, Hyderabad

6. Mr. K.A.Ramnath, Head, Training & Development, GATI Ltd, Secunderabad.

7. Prof. B. Raja Shekhar, Reader, School of Management Studies, University of


Hyderabad & Coordinator, PGDBM.

8. Dr. S. Jeelani, Director, Centre for Distance and Virtual Learning,

University of Hyderabad,

2
1

Contributors

1. Prof. B.Raja Shekhar, School of Management Studies, University of Hyderabad.

2. Dr. C.R.Rao, Reader, Department of Mathematics & Statistics, School of Mathematics


& Computer/Information Sciences, University of Hyderabad.

3. Dr.G.V.R.K. Acharyulu, Reader, School of Management Studies, University of


Hyderabad

4. Ms. P. Umamaheswari Devi, Asst. Professor, Adikavi Nannaya University,


Rajahmundry.

3
1

PREFACE

In this dynamic and complex business environment the role of the manager is becoming very critical.
Among various managerial functions, decision making is gaining lot of importance. The success or
failure of the business is definitely related to the decision making skills of the managers. The purpose
of this course is to introduce about various Quantitative and Research tools to the students to
understand the decision making process. This particular course exposes the students to various
statistical tools to be used in decision making and research. The entire course is divided in to two
blocks. Block-I deals with Statistics and Block-II deals with Research Methodology component.
Block-I is again divided to ten units, where as block-II comprises six units. This course has been
written with assumption that students are familiar with some fundamentals of mathematics at school
level.
Quantitative techniques play a vital role in business decision making process. Quantitative techniques
area can be broadly divided into two parts, viz. Statistics and Operations Research. An attempt is made
in this course to introduce various topics of statistics to make the students more familiar with decision
making process. Statistics subject is a broad subject which cannot be explained exhaustively in
distance education programme. In spite of keeping this limitation an attempt is made to introduce
fundamentals of statistics to advanced statistical tools like non parametric tools. The first unit in block-
I deals with introduction to decision making and statistics, where in the decision making process and
various definitions of statistics are discussed at length. The second unit titled Measures of Central
Tendency and Dispersion deals with various measures like arithmetic mean, geometric mean,
harmonic mean, median and mode with solved problems. Followed by average measures various
dispersion tools like range, mean deviation, quartiles and standard deviation are explained. Next to the
Introduction and Central tendency the most important topic of statistic is probability. The entire third
unit is dedicated to describe various issues of probability starting from definitions to applications
followed by three probability distributions namely binominal distribution, Poisson distribution, and
normal distribution.

In business and management the effect of one variable over other variable decides the profitability,
stability and performance. An attempt is made to discuss various issues of correlation to quantify the
association between the two or more variables. At the end of the unit there is a discussion about rank
correlation where ranks are given instead of actual scores. After the association between two variables
is established by correlation, one will be interested to know to measure the nature of relationships and
change in one variable with response to change in another variable. Unit five enlightens the students in

4
1

the form of regression and regression equations. There is a common phenomenon in any business that
is fluctuation of sales and demand. Fluctuation of sales/demand is due to various reasons. The reasons
include various trends namely seasonal, cyclical, secular and irregular. This trend analysis would be
useful in forecasting the sales.

Seventh unit titled concept of sampling and estimation deals with various methods of sampling
including probabilistic, non probabilistic. This unit also discusses about the problems and advantages
in choosing each of the sampling techniques. Followed by sampling is estimation, which discusses
about various estimates like point estimate, and interval estimate. The primary objective of the sample
is to get a feel of whole population with the help of a representative sampling. In most of the business
situations population characteristics will be estimated based on sample data. In the process one needs
to make some hypothesis about the population. Unit-eight titled testing of hypothesis deals with the
above topics and also exposes the students to various tests of hypothesis like z-test and t-test for single
sample, double sample and proportions. Entire unit nine is dedicated fully for analysis of variance
(ANOVA) to test the two samples, whether they are drawn from the same population or different
populations with same mean. The last unit of Quantitative methods deals with non-parametric test
where the population is not normal. One of the popular non-parametric tests is Chi -square test. A
detailed description with the help of examples is given in this unit followed by Chi·square test there
are other non-parametric tests namely the sign test and rank sum test. In spite of the syllabus of the
course being heavily loaded with content, necessary steps have been taken to make the concepts
simple and clear with the help of good number of examples and exercise problems.

Though, it is a difficult task to condense two important courses in to a single course, but efforts are
made to cover all essential concepts with relevant applications. Even after putting my best efforts,
there might be some errors and deficiencies in the course. I will be extremely thankful if you give your
valuable suggestions and comments on the course, which will be rectified in the next print.

Prof. B. RAJA SHEKHAR

5
1

CONTENTS

416: Quantitative and Research Methods

Block-I: Quantitative Methods

Preface

1) Introduction to Decision-making and Statistics

2) Measures of Central tendency and Dispersion

3) Concepts of Probability and Probability Distributions

4) Correlation

5) Regression

6) Time Series

7) Concept of sampling and Estimation

8) Testing of Hypotheses

9) Analysis of Variance

10) Non-parametric tests

6
1

1. INTRODUCTION TO DECISION MAKING AND STATISTICS

1. LEARNING OBJECTIVES:

After reading this lesson, you should be able to


¾ List the steps of the Decision-making process.
¾ Develop a Quantitative analysis Model
¾ Understand the meaning and definition of Statistics
¾ Know the nature of Statistical study
¾ Recognize the importance of Statistics as well as its limitations

1.1. INTRODUCTION:

Whether it is factory, farm, or a domestic kitchen, resources of men, machine and money have to be
coordinated against time and space constraints to achieve given objectives in a most efficient manner.
The manager has to constantly analyze the existing situation, determine the objectives, seek
alternatives, implement, coordinate, control and evaluate. The common thread of these activities is the
capability to evaluate information and make decisions. Managerial activities become complex as the
organizational settings in which they have to be performed become complex. As the complexity
increases, management becomes more of a science than an art and a manager by birth yields place to a
manager by profession. There is an increasing realization of the importance of Statistics in various
quarters. This is reflected in the increasing use of Statistics in the government, industry, business,
agriculture, mining, transport, education, medicine, and so on.

1.2. THE QUANTITATIVE ANALYSIS APPROACH:

Quantitative analysis uses a scientific approach to decision making. It consists of defining a problem,
developing a model, acquiring input data, developing a solution, analyzing the results and
implementing the results. Quantitative analysis has been in existence since the beginning of recorded
history, but it was Frederick W.Taylor, who in the early 1900s pioneered the principles of the
scientific approach to management. During World War II, many new scientific and quantitative

7
1

techniques were developed to assist the military. These new developments were so successful that
after World War II many companies started using similar techniques in managerial decision making
and planning. Today, many organizations employ a staff of Operations Research or Management
Science personnel or consultants to apply the principles of scientific management to problems and
opportunities.

1.3 DECISION THEORY:

To a great extent the success or failures that a person experiences in life depend on the decisions
that he or she makes. One decision may make the difference between a successful career and an
unsuccessful one. „Decision theory is an analytic and systematic way to tackle problems‰. A good
decision is one that is based on logic it considers all available data and possible alternatives; and
applies the quantitative approach. Although occasionally good decisions yield bad results, in the long
run using decision theory will result in successful outcomes.

1.3.1 THE SIX STEPS IN DECISION THEORY:

The six steps involved in decision making are as follows.


1. Clearly define and analyse the problem
2. Search for the possible alternatives
3. List out possible outcomes or states of nature
4. Evaluate the payoff or profit of each combination of alternatives and outcomes.
5. Choose one of the best decision theory models
6. Apply the model and make your decision.

1.3.1.1 Defining the problem:

The first step in the quantitative approach is to develop a clear, concise statement of the problem
which is the most important and most difficult step. It is essential to go beyond the symptoms of the
problems and identify the true causes. Only a few areas are taken into consideration, when the problem
is difficult to quantify.

8
1

1.3.1.2 Developing a model:

Once the problem is selected, the next step is to develop a model. A model is a representation (usually
mathematical) of a situation. The type of models include physical, scale, schematic and mathematical
models.

Defining the
Problem

Developing
a Model

Acquiring input
Data

Developing a
Solution

Testing the
Solution

Analyzing the
Results

Implementing the
Results

1.3.1.3 Acquiring Input Data:

Once the model is developed, accurate data must be obtained. Even if the model is perfect
representation of reality, improper data will result in misleading results. This situation is called

9
1

garbage in garbage out. There are a number of sources that can be used in collecting data. Company
reports and documents, interviews and sampling statistical procedures can be used to obtain data.

1.3.1.4 Developing a solution:

Developing a solution involves manipulating the model to arrive at the best (optimal) solution. The
input data and model determine the accuracy of the solution.

1.3.1.5 Testing the Solution:

Testing the input data and the model includes determining the accuracy and completeness of the data
used by the model. There are several ways to test the data. One method of testing the data is to collect
additional data from a different source. If the original data were collected using interviews, some
additional data can be collected by direct measurement or sampling, which can be compared with
original data and statistical tests can be employed to determine whether there are differences between
the original data and the additional data.

1.3.1.6 Analyzing the Results and Sensitivity Analysis:

Analyzing the results starts with determining the implications of the solution. In most of the cases, a
solution to a problem will result in some kind of action or change in the way an organisation is
operating. The implications of these actions or changes must be determined and analyzed before the
results are implemented. Sensitivity Analysis determines how the solutions will change with a
different model or input data.

1.3.1.7 Implementing the results:

The final step is to implement the results. This is the process of incorporating the solution into the
company.

10
1

1.3.2 DEVELOPING A QUANTITATIVE ANALYSIS MODEL:

Developing a model is an important part of the quantitative analysis approach. The following
mathematical model represents profits.
Profits =Revenue-Expenses
Revenues are expressed as price per unit multiplied times the number of units sold. Expenses can be
determined by summing fixed costs and variable costs. Variable cost is often expressed as variable
cost per unit multiplied times the number of units. Thus as per the mathematical model:

Profits=Revenue-(fixed cost +variable cost)

Profit= (selling price per unit) (number of units sold)-[fixed cost + (variable cost per unit) (number of
units sold)]
Profit=SX-[F+VX]
Profit=SX-F-VX
Where
S=selling price per unit
F=fixed cost
V=variable cost per unit
X=number of units sold
The parameters in this model are f, v, and s, as these are inputs that are inherent in the model. The
number of units sold(X) is the decision variable of interest.

1.3.3 THE ADVANTAGES OF MATHEMATICAL MODELING:

There are a number of advantages of using mathematical models:


1. Models accurately represent reality, if properly formulated; a model can be extremely accurate
2. Models help a decision maker in formulating the problems
3. Models give us insight and information
4. Models save time and money in decision making and problem solving
5. A model may be the only way to solve some large or complex problems in a timely fashion
6. A model can be used to communicate problems and solutions to others.

11
1

1.3.4 MATHEMATICAL MODELS CATEGORIZED BY RISK:

If all the values are known with certainty the model is deterministic. Models that involve risk or
chance often measured as a probability value are called probabilistic models. For example, the market
for a new product might be „good‰ with a chance of 60 %( a probability of 0.6) or „not good‰ with a
chance of 40 %( a probability of 0.4).

1.3.5 POSSIBLE PROBLEMS IN THE QUANTITATIVE ANALYSIS APPROACH:

The below listed are possible problems in the quantitative analysis approach.

1.3.5.1 Defining the problem:

In the world of business, government and education, problems are unfortunately, not easily
identified. There are four road blocks that quantitative analysis face in defining a problem.

• Conflicting view points:


The first difficulty is that quantitative analysts must often consider conflicting view points in
defining the problem.

• Impact on other departments


The next difficulty is that problems do not exist in isolation and are not owned by just one
department of a firm. The problem statement should be as broad as possible and include inputs
from all departments that have a stake in the solution.

• Beginning assumptions
The third difficulty is that people have a tendency to state problems in terms of solutions. From an
implementation stand point „a good solution to the right problem is much better than an „optimal‰
solution to the wrong problem.

12
1

• Solution Outdated
Even with the best of problem statements, however, there is a fourth danger. The problem can
change as the model is being developed in rapidly changing business environment, it is not
unusual for problems to appear or disappear virtually overnight.

1.3.5.2 Developing a Model:

• Fitting the textbook models


One of the problems in developing quantitative models is that a managerÊs perception of a
problem wonÊt always match the text book approach.

• Understanding the model


A second major concern involves the trade –off between the complexity of the model and ease of
understanding. Managers will not use the results of a model they do not understand. This can be
overcome by simplifying assumptions in order to make the model easier to understand. The model
loses some of its reality but gains acceptance by management.

1.3.5.3 Acquiring Input Data

Gathering the data to be used in the quantitative approach to problem solving is often very difficult.
• Using Accounting Data
One problem is that most data generated in a firm come from basic accounting reports. The
accounting department collects its inventory data, for example, in terms of cash flows and
turnover .But quantitative analysts tackling an inventory problem need to collect data on holding
costs and ordering costs. If they ask for such data, they will find that the data were simply never
collected for those specified costs.
• Validity of data
Data must often be distilled and manipulated before being used in a model. Unfortunately, the
validity of the results of a model is no better than the validity of the data that go into the model.

13
1

1.3.5.4 Developing a Solution

• Hard-to –Understand Mathematics


The first concern in developing solutions is that although the mathematical models used are
complex and powerful, they may not be completely understood. Fancy solutions to problems may
have faulty logic or data. The fear of mathematics often causes managers to remain silent when
they should be critical.

• Only one answer is limiting

The next problem is that quantitative models usually give just one answer to a problem. Most
managers would like to have a range of options and not be put in a take –it or leave –it position. A
more appropriate strategy is to present a range of options, indicating the effect that each solution
has on the objective function.

1.3.5.5 Testing the Solution

The results of quantitative analysis often take the form of predictions of how things will work in the
future. If certain changes are made now to get a preview of how well solutions really work, managers
are often asked how good the solutions looks to them .The problem is that complex models tend to
give solutions that are not intuitively obvious, such solutions tend to be rejected by managers. The
quantitative analyst now has to work through the model and convince the manager about the validity
of the results. In the process of convincing the manager, the analyst will have to review every
assumption that went into the model, and if he or she can be convinced that the model is valid, there is
a good chance that the solution results are also valid.

1.3.5.6 Analyzing the results

Once the solution has been tested, the results must be analyzed in terms of how they will affect the
total organization. Even small changes in organisations are often difficult to bring about and if the
results indicate large changes in organizational policy; the quantitative analyst can expect resistance in

14
1

analyzing the results. The analyst should ascertain who must change and by how much, if the people
who must change will be better or worse off, and who has the power to direct the change.

1.3.5.7 Implementation –not just the final step

Implementation is not just another step that takes place after the modeling process is over. Each one of
these steps greatly affects the chances of implementing the results of a quantitative study.

• Lack of commitment and resistance to change

Even though, many business decisions can be based intuitively, based on hunches and
experiences, there are more and more situations in which quantitative models can assist. Some
managers however, fear that the use of formal analysis process will reduce their decision making
power. Others fear that it may expose some previous intuitive decisions as inadequate. Still others
just feel uncomfortable about having their formal decision making. These managers often argue
the use of quantitative models.

• Lack of commitment by quantitative analysts

When the quantitative analyst is not an integral part of the department facing the problem, he or
she sometimes tends to treat the modeling activity as an end in itself. That means the analyst
accepts the problem as stated by manager and builds a model to solve only that problem. When the
results are computed, he or she hands them back to the manager and considers the job done. The
analyst who does not care whether these results will help to make the final decision or is not
concerned with its implementation. Successful implementation requires that the analyst not tell the
users what to do, but work with them and take their feelings into account.

1.4 QUANTITATIVE TECHNIQUES:

In solving the problems by using quantitative approach, the essential tools will be Quantitative
Techniques. Quantitative Techniques can be subdivided in to two parts. They are Statistics and
Operations Research. The entire content of this course is statistics only. Operations Research is a

15
1

scientific approach to solve problems for executive management. An application of operations


research involves:
(a) Constructing mathematical, economic and statistical descriptions or models of decision and
control problems to treat situations of complexity and uncertainty.
(b) Analysing the relationships that determine the probable future consequences of decision choices,
and devising appropriate measure of effectiveness in order to evaluate the relative merit of
alternative actions.

The essential characteristics of Operations Research (OR) are:


(i) Examination of functional relationship from a system overview;
(ii) Utilization of the interdisciplinary approach;
(iii) Adoption of the planned approach;
(iv) Uncovering of new problems for study.

In the early years of Operations Research, OR group consisted of specialists in mathematics, physics,
chemistry, engineering, statistics, and economics and this helped very much to develop OR models
with an interdisciplinary approach. The production scheduling problem, for example, is quite complex
when it cuts across the entire firm. Thus, it is necessary to look at the problem in many different ways
in order to determine which one (or which combination) of the various disciplinary approaches is the
best. The interdisciplinary approach recognizes that most business problems have accounting,
biological, economic, engineering, mathematical, physical, psychological sociological and statistical
aspects. The solution to a mathematical model or equation can be thought of as a function of
controlled and uncontrolled variables, related in some precise mathematical manner. The objective
function developed (utilizing controlled and uncontrolled variables) may have to be supplemented by a
set of restrictive statements on the possible values of the controlled variables. Though Operations
Research is an interesting subject, the components of it are beyond the scope of this course material.
However, the interested students can choose Decision Analysis course as the elective in second
semester, which discusses the content of Operations Research.

1.4.1 STATISTICS AND MANAGERIAL DECISIONS:

Since, the complexity of business environment makes the process of decision making difficult, the
decision-maker cannot rely entirely upon his observation, experience or evaluation to make a decision.

16
1

Decisions have to be based upon data which show relationship, indicate trends, and show rates of
change in various relevant variables. The field of statistics provides methods for collecting, presenting,
analyzing and meaningfully interpreting data. The statistical methodology in collection, analysis and
interpretation of data, for better decision-making is a basic input for managerial decision-making and
research in both physical and social sciences, particularly in business and economics.

1.4.2 MEANING AND DEFINITIONS OF STATISTICS

At the outset, it may be noted that the word ÂStatisticsÊ is used rather curiously in two senses plural
and singular. In the plural sense, it refers to a set of figures i.e. production and sale of textiles,
television sets, and so on. In the singular sense, Statistics refers to the whole body of analytical tools
that are used to collect the figures, organise and interpret them and finally draw conclusions from
them.

Statistics has been defined by various authors differently. Some of the definitions are extremely
narrow. This is understandable since Statistics has developed over the past several decades and in the
earlier days; the role of Statistics was confined to a limited sphere. The following are the few
definitions of statistics.
(i) „Statistics are the classified facts representing the conditions of the people in a State.
specially those facts which can be stated in number or in tables of numbers or in any
tabular or classified arrangement‰.-Webster
(ii) „Statistics are numerical statements of facts in any department of enquiry placed in
relation to each otherÊ.-Bowley
(iii) „By Statistics we mean quantitative data affected to a marked extent by multiplicity of
causes‰. Yule and Kendall.
(iv) „Statistics may be defined as the aggregate of facts affected to a marked extent by
multiplicity of causes, numerically expressed, enumerated or estimated according to a
reasonable standard of accuracy, collected in systematic manner, for a predetermined
purpose and placed in relation to each other.‰-Prof. Horace Secrist.

Statistics to be considered as much wider in its scope and, accordingly, the experts gave a wider
definition of it. Spiegel, for instance, defines Statistics highlighting its role in decision-making
particularly under uncertainty, as follows;‰ Statistics is concerned with scientific method for

17
1

collecting, organizing, summarizing, presenting and analysing data as well as drawing valid
conclusions and making reasonable decisions on the basis of such analysis.‰ Below figure shows the
classification of subject matter in the field of statistics.

STATISTICS

DESCRIPTIVE INDUCTIVE STATISTICAL DECISION


STATISTICS STATISTICS THEORY

DATA STATISTICAL INFERENCE DECISION PROBLEMS


COLLECTION (HYPOTHESIS TESTING) AND ALTERNATIVES
AND ESTIMATION (REGRESSION UNCERTAINTIES
PRESENTATION AND CORRELATION) CONSEQUENCES
CRITERION OF CHOICE

Statistical data constitute the basic raw material of the statistical method. These data are either readily
available or collected by the analyst. The manager may face four types of situations:
(i) When data need to be presented in a form which helps in easy grasping (for example,
presentation of performance data in graphs, charts, and tables, in the annual report of a
company):
(ii) Where no specific action is contemplated but it is intended to test some hypotheses and
draw inferences;
(iii) When some unknown quantities have to be estimated or relationships established through
observed data; and
(iv) When a decision has to be made under uncertainty regarding a course of action to be
followed.

As indicated in the figure, while situation (i) falls in the realm of descriptive statistics, situations (ii)
and (iii) fall in the area of inductive statistics, and situation (iv) is dealt by the statistical decision

18
1

theory. Thus, the descriptive statistics refers to the analysis and synthesis of data so that better
description of the situation can be made and thereby promote better understanding of facts.
Classification of production and sales in different locations and their changes in the values of relevant
variables, or the average value of data are also a part of descriptive statistics.

Inductive statistics is concerned with the development of scientific criteria so that values of a group
may be meaningfully estimated by examining only a small portion of that group. The group is known
as ÂpopulationÊ or ÂuniverseÊ and the portion is known as ÂsampleÊ. Further, values in the sample are
known as statistics and values in the population are known as parameters. Thus, inductive statistics is
concerned with estimating universe parameters from the sample statistics. The term inductive
statistics is derived from the inductive process which tries to arrive at information about the general
(universe) from the knowledge of the particular (sample).

Samples are drawn instead of a complete enumeration for the following reasons: (i) at the time of
completion of a full enumeration so much time is lost that often by that time the data become obsolete.
(ii) Sampling cuts down cost substantially, and (iii) sometimes securing information is a destructive
process, for example, in case of quality control situation, pieces have to be broken down for testing.

14.3 STATISTICAL DATA:

Application of statistical techniques to managerial decision problems depends on the availability and
reliability of statistical data. Statistical data can be broadly grouped into two categories: (i) Published
data (that have already been collected and are readily available in the published form) and (ii)
unpublished data (that have not yet been collected and the analyst himself will have to collect them)

Data are also classified as primary or secondary. All the original data collected by the analysis
themselves fall in the category of primary data. Secondary data are those which are available for use
from other sources. Data are also generated by internal operations of the economic unit. For instance,
sales, labour data, financial statements, production schedules, cash flow data, budget data, etc.,
pertaining to an industrial unit constitute internal data and are found in the internal records of the
company. Data are also classified as micro and macro. Micro data relate to one unit or one region and
macro data relate to the entire economy or entire industry.

19
1

1.4.4 THE NATURE OF A STATISTICAL STUDY:

Whether a given problem pertains to business or to some other field there are some well defined
steps that need to be followed in order to reach meaningful conclusions.

1. Formulation of the problem: The statistical study begins with the formulation of the problem
on which the study is to be done. The problem should be understood clearly and one should
not go beyond the scope of it or exclude some relevant aspect.
2. Objectives of the study: After formulating the problem the objectives of the study are
determined. The objectives should not be extremely ambitious because it may be difficult to
achieve them because of limitations of time, finance or even competence of those conducting
the study.
3. Determining sources of data: Once the problem and the objectives are formulated determine
the data required to conduct the study. Data can be collected from primary and secondary
sources.
4. Designing data collection forms: Once the decision in favour of collection of primary data is
taken, one has to decide the mode of their collection. The two methods available are (i)
observational method, and (ii) survey method. Suitable questionnaire is to be designed to
collect data from respondents in a field survey.
5. Conducting the field survey: Side by side when the data collection forms are being designed,
one has to decide whether a census survey or a sample survey is to be conducted. For the
latter, a suitable sample design and the sample size are to be chosen. The field survey is then
conducted by interviewing sample respondents. Sometimes, the survey is done by mailing
questionnaires to the respondents instead of contacting them personally.
6. Organising the data: The field survey provides raw data from the respondents so it is
necessary to organize the data in the form of suitable tables and charts to know their salient
features.
7. Analyzing the data : On the basis of the preliminary examination of the data collected as well
as the nature and scope of the problem, analyze data by selecting the most appropriate
statistical technique
8. Reaching statistical findings: The analysis will result in some statistical findings .Interpret
these findings in terms of the concrete problem with which the investigation was started

20
1

9. Presentation of findings: The last step is presenting the findings of the study, properly
interpreted, in a suitable form. The choice is between an oral presentation and a written one.
In the case of an oral presentation, one has to be extremely selective in choosing the material
as in a limited time one has to provide a broad idea of the study as well as its major findings to
be understood by the audience in proper perspective. In case of a written presentation, a
report has to be prepared. It should be reasonably comprehensive and should have graphs and
diagrams to facilitate the reader in under standing it in all its ramifications.
The detailed discussion on above topics is available in Block II of the course material.

1.4.5 IMPORTANCE OF STATISTICS IN BUSINESS:

It is perhaps difficult to imagine a field of knowledge which can do without Statistics. It is a tool
of all sciences indispensable to search and intelligent judgment and has become a recognised
discipline in its own right there is hardly any field whether it be trade, industry ,or
commerce,economics,biology,botany,astronomy,physics,chemistry,education,medicine,sociology,
psychology,or meteorology where statistical tools are not applicable. The importance of statistics
has been summarized by A.L.Bowley as,‰ A Knowledge of statistics is like knowledge of foreign
language or of algebra. It may prove of use at any time under any circumstances‰.

There are three major functions in any business enterprise in which statistical methods are useful.
1. The planning of operations: This may relate to either special projects or to their recurring
activities of a firm over a specified period.
2. The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured produce, norms for the daily output, and so forth.
3. The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives remedial
measures so that such a deficiency does not occur again.

Different authors have highlighted the importance of Statistics in business. For instance, Croxton
and Cowden give numerous uses of Statistics in business such as project planning, budgetary
planning and control, inventory planning and control, quality control, marketing, production and
personnel administration.

21
1

In the sphere of production, for example, Statistics can be useful in various ways. Statistical
quality control methods are used to ensure the production of quality goods. This is achieved by
identifying and rejecting defective or substandard goods. The sale targets can be fixed on the
basis of sale forecasts, which are done by using varying methods of forecasting.

Statistics can be widely used in studying the seasonal behaviour. A business firm engaged in the
sale of certain product has to decide how much stock of that product should be kept. If the
product is subject to seasonal fluctuations then it must know the nature of seasonal fluctuations in
demand. To this purpose, seasonal index of consumption may be required. If the firm can obtain
such data or construct a seasonal index on its own, then it can keep a limited stock of the product
in lean months and large stocks in the remaining months. In this way, it will avoid the blocking of
funds in maintaining large stocks in the lean months. It will also not miss any opportunity to sell
the product in the busy season by maintaining adequate stock of the product during such a period.

Statistics can be very useful in export marketing. Developing countries have started giving
considerable importance to their exports. Here, too, quality is an important factor on which
exports depend. This apart, the concerned firm must know the probable countries where its
product can be exported. Before that, it must select the right product, which has considerable
demand in the overseas markets. This is possible by carefully analysing the Statistics of imports
and exports. It may also be necessary to undertake a detailed survey of overseas markets to know
more precisely the export potential of a given product.

1.4.6 LIMITATIONS OF STATISTICS:

Although, statistics is very widely used in all spheres of human activity, it has some limitations,
which restrict its scope and utility.

1. There are certain phenomena or concepts where Statistics cannot be used. This is because
these phenomena or concepts are not amenable to measurement. For example, beauty,
intelligence, courage cannot be quantified. Statistics has no place in all such cases where
quantification is not possible.

22
1

2. Statistics reveal the average behaviour, the normal or the general trend. An application of the
ÂaverageÊ concept if applied to an individual or a particular situation may lead to a wrong
conclusion and sometimes may be disastrous. For example, one may be misguided when told
that the average depth of a river from one bank to the other is four feet, when there may be
some points in between where its depth is far more than four feet. On this understanding, one
may enter those points having greater depth, which may be hazardous.

3. Since statistics are collected for a particular purpose, such data may not be relevant or useful
in other situations or cases. For example, secondary data (i.e., data originally collected by
someone else) may not be useful for the other person.

4. Statistics is not 100 per cent precise as is Mathematics or Accountancy. Those who use
Statistics should be aware of this limitation.

5. In Statistical surveys, sampling is generally used as it is not physically possible to cover all the
units or elements comprising the universe. The results may not be appropriate as far as the
universe is concerned. Moreover, different surveys based on the same size of sample show
different results.

6. At times, association or relationship between two or more variables is studied in Statistics, but
such a relationship does not indicate Âcause and effectÊ relationship. It simply shows the
similarity or dissimilarity in the movement of the two variables.

7. A major limitation of Statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that Statistics does not cover.

23
1

1.4.7 MISUSES OF STATISTICS:

Apart from the limitations of Statistics mentioned above, there are misuses of it. Many people,
knowingly or unknowingly, use Statistical data in wrong manner. The misuse of Statistics may
take several forms, some of which are explained below.

1. Sources of data not given: At times, the source of data is not given. In the absence of the
source, the reader does not know how far the data are reliable. Further, if he wants to refer to
the original source, he is unable to do so.

2. Defective data: Another misuse is that sometimes one gives inaccurate data. This may be
done knowingly in order to defend oneÊs position or to prove a particular point.

3. Unrepresentative sample: In Statistics, several times one has to conduct a survey, which
necessitates to choose a sample from the given population or universe. The sample may turn
out to be unrepresentative of the universe.

4. Inadequate sample: A sample that is unrepresentative of the universe is a major misuse of


Statistics. This apart, at times one may conduct a survey based on an extremely inadequate
sample.

5. Unfair Comparisons: An important misuse of Statistics is making unfair comparisons from


the data collected.

6. Unwarranted conclusions: This may be as a result of making false assumptions. For example,
while making projections of population for the next five years, one may assume a lower rate
of growth though the past two years indicate otherwise.

7. Confusion of correlation and causation In Statistics: Several times one has to examine the
relationship between two variables. A close relationship between the two variables may not
establish a cause-and-effect-relationship in the sense that one variable is the cause and the
other is the effect.

24
1

8. Suppression of unfavorable results: Another wrong use of Statistics may be on account of


suppressing results that are unfavorable to the organisation or an individual.

9. Mistakes in arithmetic: Finally, one may come across certain mistakes in calculations or in
the application of a wrong formula. This human error may result in grossly wrong figures,
leading to wrong conclusions.

The foregoing discussion on misuses of Statistics clearly indicates the pitfalls in which one is
likely to be trapped if one does not exercise sufficient care in the collection, analysis and
interpretation of data.

1.5 SUMMARY:

Decision making is not practiced only in business areas. Decision making is a requirement for all
humans at all times. Decisions are taken mainly based on our knowledge and experience but when
problem becomes complicated with large input data, the analysis becomes complicated and hence an
effective use of systematic approach is needed. This has created the necessity of scientific methods for
decision making for business. Quantitative analysis uses a scientific approach to decision making. This
approach consists of defining a problem, developing a model, acquiring input data, developing a
solution, analyzing the results, and implementing the results. In using the quantitative approach,
however there can be potential problems, including conflicting viewpoints, the impact of quantitative
analysis models on other departments, beginning assumptions, outdated solutions, fitting text book
models, understanding the models, acquiring input data, hard to understand mathematics, obtaining
only one answer, testing the solutions, and analysing the results. Decisions have to be based upon data
which show relation ship, indicate trends, and show rates of change in various relevant variables.

The field of statistics provides methods for collecting, presenting, analyzing and meaningfully
interpreting data. The entire Statistical study can be explained in the following steps: formulation of
the problem, determining the objectives of the study, determining sources of data, designing data
collection forms, conducting the field survey, organising the data, analysing the data, reaching
statistical findings and presentation of findings .Statistics is very important and it is rather impossible
to think any sphere of human activity where statistics does not creep in. Statistics has assumed

25
1

unprecedented dimensions these days and statistical thinking is becoming more and more
indispensable for an able citizenship In fact to a very striking degree ,the modern culture has become
a statistical culture and the subject of statistics has acquired tremendous progress in the recent past so
much so that an elementary knowledge of statistical methods has become a part of the general
education curricula of many Universities all over the world Although statistics is very widely used in
all spheres of human activity, it is not without limitations which restrict its scope and utility. Statistics
does not study qualitative phenomenon, statistical methods do not give any recognition to an object or
a person or an event in isolation, statistical laws are not exact and Statistics is liable to be misused.

1.6 GLOSSARY:

Data: Data refers to any group of measurements that happen to interest us. These
measurements provide information the decision maker uses.

Descriptive Statistics: A collection of methods that enable us to organise, display and describe
data using such devices as tables, graphs and summary measures.

Inferential Statistics: A collection of methods that enable us in making decisions about a


population based on the sample results.
Model: A representation of reality or of a real –life situation.
Quantitative Analysis or A scientific approach using quantitative techniques as a tool in decision
Management Science: making
Statistics: Statistics is the use of data to help decision- maker reach better decisions.

1.7 REFERENCES:

1) Barry Render Ralph M.Stair, Jr. Michael E.Hanna, Quantitative Analysis for Management, Pearson
education, Delhi, 2003.
2) S.P.Gupta, Statistical Methods, Sultan Chand & Sons, New Delhi, 2000.
3) S.C.Gupta, Fundamentals of Statistics, Himalaya Publishing House, Mumbai, 2004.

26
1

1.8 REVIEW QUESTIONS:

1. ‰All statistics are numerical statements but all numerical statements are not statistics.‰ Examine.
2. Give some examples of various types of models. What is a mathematical model? Develop two
examples of mathematical models.
3. Discuss important applications of Statistics with in modern business and industry.
4. ‰Statisticians at times misuse Statistics‰ Elucidate this statement.
5. Explain how quantitative analysis helps in making better decisions.

by

Ms. P. Umamaheswari Devi

Research Scholar

School of Management Studies

University of Hyderabad.

27
1

2. MEASURES OF CENTRAL TENDENCY

2.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to


• Calculate various types of averages-simple arithmetic mean, weighted mean, geometric
mean, harmonic mean, median and mode.
• Know the main properties of each measure of central tendency and select the most
appropriate one for use with a given set of data.
• Calculate different measures of dispersion, the range, quartile deviation, mean deviation,
standard deviation and the coefficient of variation.
• Select the most appropriate measure of dispersion for a given set of data.

2.1 INTRODUCTION:

One of the important objectives of statistical analysis is to determine various numerical measures,
which describe the inherent characteristics of a frequency distribution. The first of such measures is
average. The averages are the measures, which condense a huge unwieldy set of numerical data into
single numerical values which are representative of the entire distribution. Averages provide the
gist and give a birdÊs eye view of the huge mass of unwieldy numerical data. Averages are the
typical values around which other items of the distribution congregate. They are the values which
lie between the two extreme observations (i.e., the smallest and the largest observations), of the
distribution and give us an idea about the concentration of the values in the central part of the
distribution. Accordingly they are also sometimes referred to as the Measures of Central Tendency.

2.2 PROPERTIES OF CENTRAL VALUE OR AVERAGE

Properties of the central value of average are as follows.


1. It is useful to extract and summarize the characteristics of the entire data set in precise form.
2. Since an average represents the entire data set, it facilitates comparison between two or more
data sets. Such comparison can be made either at a point of time or over a period of time.
3. It offers a base for computing various other measures such as dispersion, skewness, kurtosis that
help in many other phases of statistical analysis.

28
1

4. It should be rigidly defined. There must be uniformity in its interpretation by different decision-
makers or investigators.
5. It should be based on all the observations. Entire data set should be taken into consideration. It
should be easy to understand and calculate.
6. It should have sampling stability.
7. It should be capable of further algebraic treatment.
8. It should not be unduly affected by extreme values.

2.3 VARIOUS MEASURES OF CENTRAL TENDENCY:

The following are the various measures of central tendency or measures of location, which are
commonly used in practice.
1. Mathematical Averages
a) Arithmetic Mean
i) Simple Mean
ii) Weighted mean
b) Geometric Mean
c) Harmonic Mean
2. Averages of position:
a) Median
b) Mode
c) Quartiles
d) Deciles
e) Percentiles

The measures computed for a sample are called statistics. These are denoted by lower case letters.
Eg. ÂnÊ no. of observations, x Mean
Measures computed for the entire population are called parameters, and denoted by Greek letters.
Eg: N Size of the population, µ population mean.

2.3. 1 ARITHMETIC MEAN (AM):


Arithmetic Mean is the first average and is most widely used in statistics.
There are two methods to calculate Arithmetic Mean.

29
1

(1)Direct method
(2)Indirect or short-cut method:
Above two methods can be applied on three different data series.
They are,
a. Ungrouped Data
b. Grouped- Discrete series
c. Grouped- Continuous series
2.3.1.1 DIRECT METHOD: UNGROUPED DATA
Arithmetic mean:

ΣX Sum of values of all observations


X=
n No. of elements in the population

In this method A.M is calculated by adding the values of all observations and dividing the total by
the no. of observations.

X =
X 1 + X 2 + X 3 + ........ X n
=
∑X
n n
Example:
The arithmetic Mean of the numbers 8,3,5,12,10 is
8 + 3 + 5 + 12 + 10 38
X = = = 7 .6 .
5 5
2.3.1.2 DIRECT METHOD: GROUPED DATA- DISCRETE SERIES

If data are grouped as frequency distribution, then mean

X =
∑ fX
∑f
Where,
f = frequency
X =Variable

∑ f =Sum of the frequency

30
1

Ex:- Calculate the mean for the following data.


X: 5 8 6 2
f: 3 2 4 1

X =
∑ fX
∑f

∑ f =3+2+4+1=10

X =
(3)(5) + (8)(2) + (6 )(4) + (2)(1) = 15 + 16 + 24 + 2 = 5.7
3 + 2 + 4 +1 10

2.3.1.3 DIRECT METHOD: GROUPED DATA- CONTINUOUS SERIES

The following assumptions are made to calculate Arithmetic Mean from grouped data.
i) The class intervals must be closed
ii) The width of each class interval should be equal
iii) The values of the observation in each class interval must be uniformly distributed
between its lower and upper limits.
iv) The mid-value of each class interval must represent the average of all values in that
class interval.

X =
∑ fm
∑f

f = frequency of the class

∑ m =mid-value of the class


∑ f =Sum of the frequency
Example: A company is planning to improve plant safety. For this, accident data for the last 50
weeks was compiled. These data are grouped into frequency distribution as shown below.
Calculate A.M. of the number of accidents per week.

31
1

No. of accidents 0-4 5-9 10-14 15-19 20-24


No. of weeks 5 22 13 8 2

Solution:

No.of weeks
No.of accidents Mid value fm
(f)
(m)
0-4 5 2 10
5-9 22 7 154
10-14 13 12 156
15-19 8 17 136
20-24 2 22 44

∑ f =50 ¡ f m=500

X =
∑ fm
∑f
500
X = = 10
50
AM= 10 accidents per week.
2.3.1.4 SHORT-CUT METHOD: UNGROUPED DATA

In this method an arbitrary assumed mean is used as a basis for calculating deviations from individual
values in the data set.
Let „A‰ be the assumed Mean and let
d =X−A

or X = A + d

∴X =
∑ X = ∑(A + d) = A + ∑ d
n n n

32
1

∴X = A+
∑d
n
Example:
The arithmetic Mean of the numbers 8,3,5,12,10 is
Let assumed mean A= 8
X d = X-A
8 0
3 -5
5 -3
12 4
10 2
Total ¡ d = -2

n= 5

∴X = A+
∑d
n

(−2)
∴X =8+
5

X =7.6 (Same as earlier)

2.3.1.5 SHORT-CUT METHOD: GROUPED DATA- DISCRETE SERIES

In this case also arbitrary mean A to be assumed.

X = A+
∑ fd
∑f

33
1

Example:
The daily earnings (in rupees) of employees working on a daily basis in a firm are:

Daily earnings 100 120 140 160 180 200 220

No.of employees 3 6 10 14 24 42 75

Calculate average daily earnings.

Solution:

Take assumed Mean ÂAÊ=160

Daily earnings No. of employees d=x -A fd


(f)
100 3 -60 -180
120 6 -40 -240
140 10 -20 -200
160 14 0 0
180 24 20 480
200 42 40 1680
220 75 60 4500
Total 175 ¡ fd =6040

X = A+
∑ fd
∑f

6040
X = 160 + = Rs.194.51
175
Average daily earnings = Rs. 194.51

34
1

2.3.1.6 SHORT-CUT METHOD: GROUPED DATA – CONTINUOUS SERIES

The difference between earlier and this problem is Class width.


∑ f d
x = A+ *i
∑ f
A= Assumed value for the A.M
i = width of the class
m− A
d= , deviations from assumed mean.
i
m = mid value of the class
Example: A company is planning to improve plant safety. For this, accident data for the last 50
weeks was compiled. These data are grouped into frequency distribution as shown below.
Calculate A.M. of the number of accidents per week.

No.of accidents 0-4 5-9 10-14 15-19 20-24


No. of weeks 5 22 13 8 2

Solution:
Let A=12
No.of accidents Mid d = (m − A) i fd
No.of weeks
value(m)
= (m − 12) 5 (f)

0-4 2 -2 5 -10
5-9 7 -1 22 -22
10-14 12 0 13 0
15-19 17 1 8 8
20-24 22 2 2 4
¡ f=50 ¡ fd = -20

35
1

∑ f d
x = A+ *i
∑ f

 − 20 
=12+   * 5 = 10 accidents per week (Same as earlier)
 50 

2.3.2 WEIGHTED ARITHMETIC MEAN:

One of the limitations of AM discussed above is that it gives equal importance to all the items. But
there are cases where relative importance of the different items is not same. When this is so,
Weighted AM is calculated. The weighted means enables us to calculate an average that takes into
account the importance of each value to the overall total.

∑ (w × X )
Xw =
∑w
∑(w× X ) → Sum of the weight of each element times that element
∑ w → Sum of all the weights

X w → Weighted mean
w → weight assigned to each observation

Example: A contractor employs three types of workers, male, female and children. To a male worker
he pays Rs.100 per day, to female worker is Rs.80 and to a child worker is Rs.50 per day. What is
the average wage per day, if there are 20 male, 15 female and 5 children.
Solution:
X w wX
100 20 2000
80 15 1200
50 5 250
-------------------------------------------------------------
¡w=40 ¡wX=3450
--------------------------------------------------------------

36
1

∑ (w × X )
Xw =
∑w

Weighted AM= 3450/40 =Rs.86.25

♦ Advantages of Arithmetic Mean

™ The concept is familiar to most people and intuitively clear.


™ Mean is useful for performing statistical procedures such as comparing the means.
™ It is rigidly defined by algebraic formula.
™ It is capable for further algebraic treatment.

♦ Disadvantages of Arithmetic Mean

™ Arithmetic Mean may be affected by extreme values that are not representative of the rest of the
data
™ Unable to compute mean for a data set that has open ended classes at either the high or low end of
the scale.

2.3.3 THE GEOMETRIC MEAN:


The geometric Mean ÂGÊ of a set of ÂnÊ numbers x1,x2,x3----xn is the Nth root of the products of the
numbers.

G.M =
n ( x1 x 2 x 3 .......... )

The geometric Mean of the numbers 2, 4, 8 is =


3 (2 * 4 * 8 )

Calculation of Geometric mean-discrete series:

 ∑ f log x 
GM= Anti log  
 ∑ f 

37
1

Calculation of Geometric mean-Continuous series


 ∑ f log m 
GM= Anti log  
 ∑ f 
f = Frequency
x= Variable
m= Mid value of the class

∑ f = Sum of the frequency.

Merits:

¾ It is rigidly defined.
¾ It is useful in average ratios and percentages and in determining rates of increase or decrease.
¾ It is capable of algebraic manipulation
¾ It gives less weight to large items and more to small ones than does the arithmetic average.

Demerits:

¾ It is difficult to understand.
¾ It is difficult to compute and interpret.
¾ It cannot be computed when there are negative values in a series or one or more of the values
are zero.

2.3.4 THE HARMONIC MEAN:

The harmonic mean is based on the reciprocals of the numbers averaged. It is defined as the
reciprocal of the arithmetic mean of the reciprocal of the individual observations
for a set of ÂnÊ positive values x1,x2----xn. The harmonic mean is equal to
N
HM= =
1

X

38
1

Example:

The harmonic Mean of the numbers 2, 4, 8 is

3 3 3× 8
HM= = = = 3.43
1 1 1 7 7
+ +
2 4 8 8
Calculation of harmonic mean-discrete series:

HM =
∑f
f
∑( x )

Calculation of harmonic mean-continuous series:

HM =
∑f
f
∑( m)

where
¡f = Total frequency
f =Individual frequencies
m=Mid point of class

Merits:

™ Its value is based on every item of the series


™ It leads itself to algebraic manipulation
™ In problems relating to time and rates, it gives betters results than other averages

39
1

Demerits:
™ It is difficult to understand and compute

™ Its value cannot be computed when there are both positive and negative items in a series

™ It gives largest weight to smallest items. This is generally not a desirable feature and as such
this average is not very useful for the analysis of economic data.

Relation between Arithmetic, Geometric and Harmonic Mean


The geometric Mean of a set of positive numbers x1,x2,---xn is less than or equal to their arithmetic
Mean but is greater than (or) equal to their Harmonic Mean .
In symbols
HM ≤ GM ≤ AM
They all will be equal, only if all the values x1 ,x2, x3·xn are identical

2.3.5 THE MEDIAN

The median is a single value that measures the central item in the data. This single item is the middle
most or most central item in the set of numbers. Half of the items lie above this point, and the other
half lie below it.

2.3.5.1 CALCULATING THE MEDIAN FROM UNGROUPED DATA

To find the median of a data set, first array the data in ascending or descending order. If the data set
contains an odd number of items, the middle item of the array is the median. If there is an even
number of items, the median is the average of the two middle items.
n +1
Median= th item in a data array.
2
Example:
Item 1 2 3 4 5 6 7
Time in minutes 4.2 4.3 4.7 4.8 5.0 5.1 9.0

40
1

n +1 7 +1
Median= th item= th item =4th item = 4.8
2 2

Example:
Item 1 2 3 4 5 6 7 8
Time in minutes 86 52 49 43 35 31 30 11

n +1 8 +1
Median= th item = th item= 4.5th item = Average of 4th and 5th items.
2 2
43 + 35
= = 39
2
2.3.5.2 CALCULATING THE MEDIAN FROM GROUPED DATA

To find the median value, first identify the class interval which contains the median value or (n 2) th
observation of the data set. To identify such class interval, find the cumulative frequency of each
class until the class for which the cumulative frequency is equal to or greater than the value of
(n 2) th observation. The value of the median within that class is found by using interpolation.

Median = l +
(n 2 − cf ) × i
f
l → lower class limit (or boundary) of the median class interval
cf → Cumulative frequency of the class prior to the median class interval
f → frequency of the median class
i → width of the median class
n → total no. of observations in the distribution

Example: In a factory employing 3000 persons, 5 percent earn less than Rs.150 per day, 580 earn
from Rs.151-200·per day, 30 percent earn from Rs.201-250 per day, 500 earn from Rs.251-300
per day,20 per cent earn from Rs.301-350 per day, and the rest earn Rs.351 or more per day. What
is the Median wage?

41
1

Earnings % of workers No.of persons Cumulative frequency

Less than 150 5 150 150


151-200 - 580 730
201-250 30 900 1630-Median class
251-300 - 500 2130
301-350 20 600 2730
351 & above - 270 3000
3000

Median observation= (n 2)th = 3000 2 = 1500th items.


This observation lies in the class 201-250

Median=l+
(n 2 ) − cf *i
f
1500 − 730
=201+ × 50
900
=201+42.77=Rs.243.77
The Median wage is Rs.243.77 per day.

2.3.5.3 PROPERTIES OF MEDIAN

Advantages of Median

¾ Extreme values do not affect the median as strongly as they do the mean.
¾ The median is easy to understand and can be calculated from any kind of data- even for grouped
data with open-ended classes-unless the median falls in the open-ended class.
¾ The median can be found even for data of qualitative descriptions such as co lour or sharpness,
rather than numbers. Suppose there are fine runs of printing press, the results from which must
be rated according to sharpness of image. They can array the results from best to worst.

42
1

Disadvantages of Median

¾ Statistical procedures are more complex for finding mean.


¾ The data should be arranged before calculating median. If the data is large, time is consumed
for arranging data.

Applications of Median

The median is helpful in understanding the characteristics of data when


i) Observations are qualitative in nature
ii) Extreme values are present in the data set
iii) A quick estimate of average is desired.

2.3.6 MODE
The mode of a set of numbers is that value which occurs with the greatest frequency. It is the most
common value. In some contexts, the mode may not exist, and even if it does exist it may not be
unique.
Example: The set 2,2,5,7,9,9,9,10,10,11,12,18 has mode 9
Example: The set 3, 5,8,10,12,15,16 has no mode
Example: The set 2,3,4,4,4,5,5,7,7,7,9 has two modes 4 and 7 and is called „bimodal‰, since it has
two modes.
A distribution having only one mode is called unimodal distribution.
Mode for grouped data
In the case of grouped data when a frequency curve has been constructed to fit in data, the mode
will be the value (or values) of x corresponding to the maximum point (or points) on the curve.
From a frequency distribution (or) histogram the mode can be obtained from the following formula.
(∆1 )
Mode= L1 + ×i
(∆1 ) + (∆ 2 )

Where L1= Lower boundary class of modal class (i.e. class containing the mode)
∆ 1 = Excess of modal frequency over frequency of preceding class.
∆ 2 = Excess of modal frequency over frequency of succeeding class.

43
1

i= Size of modal class


f1 − f 0
Mode = l+ ×i
2 f1 − f o − f 2
l= Lower limit of the modal class
f1=Frequency of the modal class
f0= frequency of the class preceding the modal class
f2= frequency of the class succeeding the modal class
i= class interval

Example:

X Frequency
Less than 150 150
150-200 580
200-250 900
250-300 500
300-350 600
350 & above 270
Total 3000

Since the highest frequency is in 201-250 class, it is called as modal class.


l=200
f1 = 900
f0 = 580
f2 = 500
i=50
f1 − f 0
Mode = l+ ×i
2 f1 − f o − f 2
900 − 580
Mode= 200+ × 50
2 * 900 − 580 − 500
Mode=200+22.22= 222.22

44
1

Merits of Mode:

¾ By definition, mode is the most typical value or representative value of a distribution.


¾ It is not unduly affected by extreme values.
¾ Its value can be determined in open-ended distributions without ascertaining the class-limits.
¾ It can be used to describe qualitative phenomenon. For example, if we want to compare the
consumer preferences for different types of products, we should take into account the modal
preferences expressed by different groups of people

Demerits of Mode:

¾ The value of the mode cannot always be determined sometimes there may be a bi-modal series.
¾ It is not capable of algebraic manipulations.
¾ It is not based on each and every item of the series
¾ It is not a rigidly defined measure. There are several formulas for calculating the mode, usually
all of which give some what different answers.

Relation between Mean, Median and Mode


For Unimodal frequency curves which are moderately skewed (asymmetrical) the empirical relation
is Mean-Mode = 3(Mean-Median). or
Mode= 3 Median- 2 Mean

2.4 MEASURES OF DISPERSION

The various measures of central tendency like Mean, Median and Mode gives us one single figure
that represents the entire data. But the average alone cannot adequately describe a set of
observations unless all the observations are the same. It is hence necessary to describe the
variability or dispersion of the observations. In two or more distributions the central value may be
the same but still there can be wide disparities in the formation of distribution.

45
1

Important definitions of dispersion:


• Dispersion is the measure of the variation of the items. -A.L.Bowley
• The degree to which numerical data tend to spread about an average value is called the
variation or dispersion of the data - Spiegel
• Dispersion is the degree of the scatter or variation of the variable about a central value
- Brooks & Dick
• The measurement of the scattered ness of the mass of figures in a series about an average is
called measure of variation or dispersion -Simpson & Kafka

Significance of measuring variation:


Measures of dispersion are needed for four basic purposes-
1) To determine the reliability of an Average
2) To serve as a basis for the control of the variability
3) To compare two or more series with regard to their variability
4) To facilitate the use of other statistical Measures.

Properties of a good Measure of variation:


A good measure of dispersion should possess as far as possible the following properties-
1) It should be simple to understand
2) It should be easy to compute
3) It should be rigidly defined
4) It should be based on each and every item of the distribution.
5) It should be amenable to further algebraic treatment
6) It should have sampling stability
7) It should not be unduly affected by extreme items.

2.5 METHODS OF STUDYING VARIATION:

The following are the important methods of studying variation:


1. The Range
2. Quartile Deviation

46
1

3. Mean Deviation
4. Standard Deviation

Of these the first two, namely the range and quartile deviations are positional measures because they
depend on the values at a particular position in the distribution. The other two, the average
deviation and the standard deviation are called calculation measures of deviation because all the
values are employed in their calculation and the last one is a graphic method.

2.5.1 RANGE:
It is the simplest method of studying dispersion. It is defined as the difference between the value of
the smallest item and the value of the largest item included in the distribution.
Range=L-S
Where L=Largest item
S=smallest item
L−S
Coefficient of Range=
L+S
Uses of Range in Business:

(i) Quality control: The object of quality control is to keep a check on the quality of the
product without 100% inspection. The idea basically is that if the range, i.e. the difference between
the largest and smallest mass produced items, increases beyond a certain point, the production
machinery should be examined to find out why the items produced have not followed their usual
consistent pattern.
(ii) Fluctuations in the share prices: range is useful in studying the variations in the prices of
stocks and shares and other commodities that are sensitive to price changes from one period to
another.
(iii) Weather Forecasts: The meteorological department does make use of the range in
determining say the difference between the minimum temperature and the maximum, temperature.
This information is of great concern to the general public

2.5.2 QUARTILE DEVIATION (QD): Quartile deviation represents the difference between the third
quartile and the first quartile.
Inter Quartile range= Q3-Q1

47
1

Q −Q
3 1
Quartile Deviation=
2
Quartile Deviation is an absolute measure of dispersion. The relative measure corresponding to this
measure, called the coefficient of quartile deviation.

Q −Q
3 1
Coefficient of Quartile deviation=
Q +Q
3 1

2.5.3 MEAN DEVIATION (MD): The Mean Deviation is also known as the average deviation. It is
the average difference between the items in a distribution and the Median or Mean of that series.
For Individual observations,
1
M.D.= ∑X−A
N
∑D
where D = X − A
= N

Where A=Any Average


D=Deviation from average
For discrete series,
∑f D
M.D=
N
D Denotes deviation from average ignoring signs.
For continuous series,
∑f D
M .D =
N
D denotes deviation of mid values from average ignoring signs.
Uses of M.D in Business

It is especially effective in reports presented to the general public or to groups not familiar with
statistical methods. This measure is useful for small sample with no elaborate analysis required.
Incidentally it may be mentioned that the National Bureau of Economic Research has found, in its
work on forecasting business cycle, that the average deviation is the most practical measure of
dispersion to use for this purpose.

48
1

2.5.4 STANDARD DEVIATION (SD)


Standard Deviation was introduced by Karl Pearson in 1823. It is by far the most important and
widely used measure of studying dispersion. It is the square root of the mean of the squared
deviation from the arithmetic mean.
Calculation of Standard Deviation:
Individual Series:

∑ x2
σ=
N
(
Where x = X − X )
X= individual observation

X =Mean

 
∑d ∑ d 
2

2
 
From Assumed Mean, σ = −   
 n n
   

d= deviations of the items from an assumed mean.


Discrete series:

σ=
∑ fx 2

where (
x= X − X )
∑f
From Assumed Mean,

 
∑ fd ∑ fd
2

2
  
σ =  − 
 ∑ f 
 ∑ f 



Where d=(x-A)
A= assumed Mean
Continuous series:
Direct Method:

49
1

σ=
∑ fx 2

where x= (m − x )
∑f

From Assumed Mean

 
∑ fd ∑ fd
2

2
   *i
σ =  − 
 ∑ f 
 ∑ f 


d=
(m − A)
i
i= class interval
Example: Find out the standard Deviation for the following data of sales per day.
35, 30,45,20,25
Direct Method:
2
∑(X − X )
σ =
n

Mean= (35+30+45+20+25)/5 =31= X

¡(X- X )2 = (35-31)2 +(30-31)2 + (45-31)2 +(20-31)2 +(25-31)2


= 16+1+196+121+36=370

370
σ = = 8.6
5
Assumed Mean method:
Let Assumed mean=35

 
∑d ∑ d 
2

2
 
σ = −   
 n n
   
¡d = (35-35) +(30-35) + (45-35) +(20-35) +(25-35)
= 0-5+10-15-10 = -20
¡d2 = (35-35)2 +(30-35)2 + (45-35)2 +(20-35)2 +(25-35)2
= 0+25+100+225+100 = 450

50
1

 450  − 20  
2

σ =  −  
 5  5  

SD= σ = 90 − 16

Standard Deviation = 74 = 8.6 (same as earlier)

Example: Suppose that a prospective buyer tests the bursting pressure of samples of rubber tubes
received from the manufacturer. Find out the standard deviation of the bursting pressure of the
tubes.

Bursting Pressure in lbs : 5-10 10-15 15-20 20-25 25-30


Number of tubes : 2 9 29 54 6

Solution:
Direct Method:

Bursting Pressure Number of m fm m- x (m- x )2 f(m- x )2


Tubes (f)
5-10 2 7.5 15 -12.65 160 320
10-15 9 12.5 112.5 -7.65 58.5 527
15-20 29 17.5 507.5 -2.65 7 204
20-25 54 22.5 1215 2.35 5.5 298
25-30 6 27.5 165 7.35 54 324
Total 100 2015 1673

σ=
∑ f (m − x) 2

∑f
x=
∑ fm =
2015
=20.15
∑f 100

1673
σ= = 4.09
100

51
1

Assumed mean method:

Let assumed mean =22.5

Bursting Pressure Number of m d= fd fd2


Tubes (f) (m-22.5)/5
5-10 2 7.5 -3 -6 18
10-15 9 12.5 -2 -18 36
15-20 29 17.5 -1 -29 29
20-25 54 22.5 0 0 0
25-30 6 27.5 1 6 6
Total 100 ¡ fd = -47 ¡ fd2= 89

 
∑ fd ∑ fd
2

2
   *i
σ = − 

 ∑ f 
 ∑ f 


d=
(m − A)
i
i= class interval

 89  − 47  
2

σ =  −   *5
 100  100  

σ = 4.09 (same as earlier)


2.5.5 MATHEMATICAL PROPERTIES OF S.D:

1. Combined S.D:

nσ + n2σ 2 + n1 d 1 + n2 d 2
2 2 2 2

σ 12 = 1 1

n +n
1 2

52
1

σ 12 = combined S.D
σ 1 = S.D of 1st group
σ 2 = S .D of 2nd group

d 1
= x1− x12

d 2
= x 2
− x 12

2. Standard Deviation of ÂnÊ natural Numbers:

( N 2 − 1)
σ=
12

2.5.6 VARIANCE: The standard deviation is an absolute measure of dispersion. The corresponding
relative measure is known as the coefficient of variation. It is used in such problems where we want
to compare the variability of two or more than two series. That series for which, the coefficient of
variation is greater is said to be more variable or conversely less consistent, less uniform, less stable
or less homogeneous.
Variance= σ 2
σ
Coefficient of variance= X 100
x

2.6 SUMMARY:

Measure of Central Tendency and variability play a vital role in characterizing data. Several types of
averages are used, the most common being the arithmetic mean or the Mean, the Median, the Mode,
the geometric Mean and the harmonic Mean. A measure of central tendency summarizes the
distribution of a variable into single figure, which can be regarded as its representative. This
measure alone, however, is not sufficient to describe a distribution because there may be a situation
where two or more distribution has the same central value. Conversely, it is possible that the pattern
of distribution in two or more situations is same but the values of their central tendency are
different. Hence, measures of dispersion are used to represent the characteristics of a distribution.
The concept of dispersion is related to the extent of scatter or variability in observations. Some

53
1

important measures of dispersion are Range, Quartile Deviation, Mean Deviation and Standard
Deviation.

2.7 GLOSSARY:

Arithmetic mean: A measure of central tendency calculated by dividing the sum of all
Observations by the number of observations in the data set.

Coefficient off A measure of relative variability that expresses the standard deviation as a
variation: percentage of the mean. The distance between the highest and the lowest
values in a data set.

Geometric Mean: A measure of central tendency used to measure the average rate of change or
growth for some quantity, computed by taking the nth root of the product of
n values representing change.

Measures of Measures that describe the centre of a distribution. The mean, median and
central tendency: mode are three of the measures of central tendency.

Measures Measures that give the spread of a distribution.


of dispersion:
Median: The value of the middle item in a data set arranged in an ascending or a
descending order. It divides the data set into two equal parts.

Mode: The value that has the maximum frequency in the data set

Parameters: Numerical values that describe the characteristics of a whole population,


Commonly represented by Greek letters.

Range: Difference between the largest and the smallest values in a data Range set.

54
1

Standard The square root of the variance in a series. It shows how the
Deviation: data are spread out.

Statistics: Numerical measures describing the characteristics of a sample.


represented by Roman letters.

Variance: Averaged squared deviation between the mean and each item in a series.

2.8 REFERENCES:

1) G. C. Beri, Statistics for Management , Tata McGraw Hill publishing company


Ltd.,Delhi,2003.
2) Barry Render Ralph M.Stair, Jr. Michael E.Hanna, Quantitative Analysis for
Management, Pearson education, Delhi 2000.
3) S.P.Gupta, Statistical Methods, Sultan Chand& Sons, NewDelhi, 1995.
4) S.C.Gupta, Fundamentals of Statistics, Himalaya Publishing House, Mumbai.
5) Richard I.Levin,David S. Rubin, Statistics for Management , Prentice Hall of India
Private Limited, NewDelhi,1999.

2.9 REVIEW EXERCISE:

1.‰Every average has its own peculiar characteristics. It is difficult to say which average is the best.‰
Explain with examples.

2. What do you mean by Dispersion? What are the different measures of dispersion?

3. Bennett Distribution Company, a subsidiary of a major appliance manufacturer, is forecasting


regional sales for next year. The Atlantic branch, with current yearly sales of Rs.193.8 million, is
expected to achieve a sales growth of 7.25 percent; the Midwest branch, with current sales of
Rs.79.3 million is expected to grow by 8.20 percent; and the Pacific branch, with sales of Rs.57.5
million, is expected to increase sales by 7.15 percent. What is the average rate of sales growth
forecasted for next year?

55
1

4. The number of solar heating systems available to the public is quite large, and their heat –storage
capacities are quite varied. Here is a distribution of heat-storage capacity (in days) of 28 systems
that were tested recently by University Laboratories,Inc.

Days Frequency
0-0.99 2
1-1.99 4
2-2.99 6
3-3.99 7
4-4.99 5
5-5.99 3
6-6.99 1

University Laboratories, Inc. knows that its report on the tests will be widely circulated and used as
the basis for tax legislation on solar-heat allowances. It therefore wants the measures it uses to be as
a reflective of the data as possible.

(a)Compute the mean for these data.


(b) Compute the mode for these data.
(c) Compute the median for these data.
(d)Select the answer among parts (a) (b), and (c) that best reflects the central tendency of the test data
and justify your choice.

5. In a small company, two typists are employed. Typist A types one page in ten minutes while typist
B takes twenty minutes for the same.
(a)Both are asked to type 10 pages. What is the average time taken for typing one page?
(b)Both are asked to type for one hour. What is the average time taken by them for typing one page?

6. The following data gives the saving bank accounts balances of nine sample households selected in
a survey. The figures are in rupees.
745 2,000 1,500 68,000 461 549 3,750 1,800 4,795
Find the mean and the median for these data.

56
1

7. The prices of a Tea Company shares in Mumbai and Kolkata markets during the last ten months
are recorded below:
Month(2000) Mumbai Kolkata
January 105 108
February 120 117
March 115 120
April 118 130
May 130 100
June 127 125
July 109 125
August 110 120
September 104 110
October 112 135

Determine the Arithmetic Mean and Standard Deviation of the price of shares. In which market are
the share prices more stable?

8. The Casual Life Insurance Company is considering purchasing a new fleet of company cars. The
financial departmentÊs director, Tom Dawkins, sampled 40 employees to determine the number of
miles each drove over a 1-year period. The results of the study follow. Calculate the range and inter
quartile range.

3600 4,200 4,700 4,900 5,300 5,700 6,700 7,300


7,700 8,100 8,300 8,400 8,700 8,700 8,900 9,300
9,500 9,500 9,700 10,000 10,300 10,500 10,700 10,800
11,000 11,300 11,300 11,800 12,100 12,700 12,900 13,100
13,500 13,800 14,600 14,900 16,300 17,200 18,500 20,300

9. Southeastern Stereos, a wholesaler, was contemplating becoming the supplier to three retailers, but
inventory shortages have forced Southeastern to select only one. Southeastern credit manager is
evaluating the credit record of these three retailers. Over the past 5 years, these retailers, accounts
receivable have been outstanding for the following average number of days. The credit manager

57
1

feels that consistency, in addition to lowest average, is important. Based on relative dispersion,
which retailer would make the best customer?

Lee 62.2 61.8 63.4 63.0 61.7


Forrest 62.5 61.9 62.8 63.0 60.7
Davis 62.0 61.9 63.0 63.9 61.5

10. Realistic Stereo shops marks up its merchandise 35 percent above the cost of its latest additions
to stock .Until 4 months ago, the Dynamic 400-S VHS recorder had been Rs.300.During the last 4
months Realistic has received 4monthly shipments of this recorder at these unit costs: Rs.275,
Rs.250,Rs.240,and Rs.225.At what average rate per month has RealisticÊs retail price for this unit
been decreasing during these 4 months?

by

Dr.B.Raja Shekhar
Reader
School of Management Studies
University of Hyderabad.

58
1

3. CONCEPTS OF PROBABILITY AND PROBABILITY DISTRIBUTIONS

3.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to


• Understand some basic terminology in probability and define a probability in a given
situation using suitable empirical methods.
• Recognise problems that can be modeled by the binomial, Poisson and normal distributions.
• Solve such problems with the use of appropriate tables.

3.1 INTRODUCTION:

We live in world in which we are unable to forecast the future with complete certainty. Our need to
cope with uncertainty leads us to the study and use of probability theory. By organizing the
information and considering it systematically, we will be able to recognize our assumptions,
communicate our reasoning to others, and make a sounder decision than we could by using a shot-in-
the-dark approach. Probability is a part of everyday lives. In personal and managerial decisions, we
face uncertainty and use probability theory whether or not we admit the use of something
sophisticated. When we hear a weather forecast of a 70 percent chance of rain, we change our plans
from a picnic to a pool game. Managers, who deal with inventories of highly styled womenÊs
clothing, must wonder about the chances that sales will reach or exceed a certain level. Probability
deals with many uncertainties in business.

3.2 BASIC PROBABILITY CONCEPTS

3.2.1 OUTCOME:
An outcome is the result of an experiment or other situation involving uncertainty. The set of all
possible outcomes of a probability experiment is called a sample space.
3.2.2 SAMPLE SPACE:

The sample space is an exhaustive list of all the possible outcomes of an experiment. Each possible
result of such a study is represented by one and only one point in the sample space.

59
1

Example1:
Experiment Rolling a die once: Sample space = {1, 2, 3, 4, 5, 6}
Experiment Tossing a coin: Sample space = {Heads, Tails}

3.2.3 EVENT:
In probability theory, an event is one or more of the possible outcomes of doing something. If we toss
a coin, getting a tail would be an event, and getting a head would be another event. Similarly, if we
are drawing from a deck of cards, selecting the ace of spades would be an event. An example of an
event closer to your life is picking a student from a class of 100 to answer a question.

3.2.4 RELATIVE FREQUENCY:


Relative frequency is another term for proportion; it is the value calculated by dividing the number of
times an event occurs by the total number of times an experiment is carried out. The probability of an
event can be thought of as its long-run relative frequency when the experiment is carried out many
times.

Example 2:
Experiment: Tossing a fair coin 50 times (n = 50) Event E = 'heads' Result: 30 heads, 20 tails, so r =
30 Relative frequency: = r/n = 30/50 = 3/5 = 0.6
If an experiment is repeated many, many times without changing the experimental conditions, the
relative frequency of any particular event will settle down to some value. For example, in the above
experiment, the relative frequency of the event 'heads' will settle down to a value of approximately 0.5
if the experiment is repeated many more times.

3.2.5 PROBABILITY:

The probability of an event has been defined as its long-run relative frequency. It has also been
thought of as a personal degree of belief that a particular event will occur (subjective probability). A
probability provides a quantitative description of the likely occurrence of a particular event.
Probability is conventionally expressed on a scale from 0 to 1; a rare event has a probability close to 0,
a very common event has a probability close to 1.

60
1

In some experiments, all outcomes are equally likely. For example if you were to choose one winner
in a raffle from a hat, all raffle ticket holders are equally likely to win, that is, they have the same
probability of their ticket being chosen. This is the equally likely outcomes model and is defined to be:

Number of out comes correpsonding to the event


P( E ) =
Total number of outcomes

Example3:

¾ The probability of drawing a spade from a pack of 52 well-shuffled playing cards is


13/52 = 1/4 = 0.25 since event E = 'a spade is drawn'; the number of outcomes
corresponding to E = 13 (spades); the total number of outcomes = 52 (cards).
¾ When tossing a coin, we assume that the results 'heads' or 'tails' each have equal
probabilities of 0.5.

3.2.6 INDEPENDENT EVENTS:

Two events are independent if the occurrence of one of the events gives us no information about
whether or not the other event will occur; that is, the events have no influence on each other.

In probability theory, we say that two events, A and B, are independent if the probability that they
both occur is equal to the product of the probabilities of the two individual events, i.e.
P( A ∩ B) = P( A).P( B)

The idea of independence can be extended to more than two events. For example, A, B and C are
independent if:

a. A and B are independent; A and C are independent and B and C are independent (pair wise
independence);
b. P( A ∩ B ∩ C ) = P( A).P( B).P(C )

If two events are independent then they cannot be mutually exclusive and vice versa.

61
1

Example 4:
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a card from
his/her pack. Find the probability that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so: P ( A ∩ B ) = P ( A).P ( B ) = 1/52.1/52
= 0.00037
That is, there is a very small chance that the man and the woman will both draw the ace of clubs.

3.2.7 MUTUALLY EXCLUSIVE EVENTS:

Two events are mutually exclusive, if it is impossible for them to occur together.

Formally, two events A and B are mutually exclusive if and only if P( A ∩ B ) = φ

If two events are mutually exclusive, they cannot be independent.

Example 5:

¾ Experiment: Rolling a die once Sample space S = {1, 2, 3, 4, 5, 6} Events A = 'observe an odd
number' = {1, 3, 5}. B = 'observe an even number' = {2, 4, 6}. A ∩ B = φ = the empty set, so
A and B are mutually exclusive.
¾ A subject in a study cannot be both male and female, nor can they be aged 20 and 30.

3.3 ADDITION RULE:

The addition rule is a result used to determine the probability that event A or event B occurs or both
occur.

The result is often written as follows, using set notation: P ( A ∪ B ) = P ( A) + P ( B ) − P ( A ∩ B )


where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs

62
1

P( A ∪ B) = probability that event A or event B occurs


P( A ∩ B) = probability that event A and event B both occur
For mutually exclusive events, that is events, which cannot occur together: P ( A ∩ B ) = 0

The addition rule therefore reduces to P ( A ∪ B ) = P ( A) + P ( B ) .


For independent events, that is events, which have no influence on each other:
P( A ∩ B) = P( A).P( B)
The addition rule therefore reduces to P ( A ∪ B ) = P ( A) + P ( B ) − P ( A).P ( B )
Example 6:
Suppose we wish to find the probability of drawing either a king or a spade in a single draw from a
pack of 52 playing cards.
We define the events
A = 'draw a king' and
B = 'draw a spade'
Since there are 4 kings in the pack and 13 spades, but 1 card is both a king and a spade,
we have:
P( A ∪ B) = P( A) + P( B) − P( A ∩ B)
= 4/52 + 13/52 - 1/52 = 16/52
So, the probability of drawing either a king or a spade is 16/52 (= 4/13).
Example7:
What is the probability that a card drawn at random from a deck of cards will be an ace?

Solution
In this case there are four favorable outcomes:
(1) the ace of spades
(2) the ace of hearts
(3) the ace of diamonds
(4) the ace of clubs.

Since each of the 52 cards in the deck represents a possible outcome,


there are 52 possible outcomes. Therefore, the probability is 4/52 or 1/13.

The same principle can be applied to the problem of determining the probability of obtaining
different totals from a pair of dice.

63
1

Example8:

What is the probability that when a pair of six-sided dice are thrown, the sum of the numbers
equals 5?

Solution:
There are 36 possible outcomes when a pair of dice is thrown. Consider that if one of the dice
rolled is a 1, there are six possibilities for the other die. If one of the dice rolled a 2, the same
is still true. And the same is true if one of the dice is a 3, 4, 5, or 6. If this is still confusing,
look at the following (abbreviated) list of outcomes:
[(1,1),(1,2),(1,3),(1,4),(1,5),(1,6);(2,1),(2,2),(2,3)⁄(3,1),(3,2),3,3)⁄ (4,1)⁄(5,1)⁄(6,1)⁄.
The total number of outcomes is 6 × 6 = 36. Since four of the outcomes have a total of 5
[(1,4),(4,1),(2,3),(3,2)], the probability of the two dice adding up to 5 is 4/36 = 1/9.

Example 9:

What is the probability that when a pair of six-sided dice is thrown, the sum of the number
equals 12?

Solution
We already know the total number of possible outcomes is 36, and since there is only one
outcome that sums to 12, (6,6--you need to roll double sixes), the probability is simply 1/36.

3.4 MULTIPLICATION RULE:

The multiplication rule is a result used to determine the probability that two events, A and B, both
occur.

The multiplication rule follows from the definition of conditional probability.

The result is often written as follows, using set notation:


P( A ∩ B) = P( A / B).P( B) = P( A).P( B / A)
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs,

64
1

P( A ∩ B) = probability that event A and event B occur,


P (A | B) = the conditional probability that event A occurs given that event B has occurred already,
P (B | A) = the conditional probability that event B occurs given that event A has occurred already
For independent events, that are events which have no influence on one another, the rule simplifies to:
P( A ∩ B) = P( A).P( B)
That is, the probability of the joint events A and B is equal to the product of the individual
probabilities for the two events.

3.5 CONDITIONAL PROBABILITY:

In many situations, once more information becomes available; we are able to revise our estimates for
the probability of further outcomes or events happening. For example, suppose you go out for lunch at
the same place and time every Friday and you are served lunch within 15 minutes with probability 0.9.
However, given that you notice that the restaurant is exceptionally busy, the probability of being
served lunch within 15 minutes may reduce to 0.7. This is the conditional probability of being served
lunch within 15 minutes given that the restaurant is exceptionally busy.

The usual notation for "event A occurs given that event B has occurred" is "A | B" (A given B). The
symbol | is a vertical line and does not imply division. P(A | B) denotes the probability that event A
will occur given that event B has occurred already.

A rule that can be used to determine a conditional probability from unconditional probabilities is:
P( A ∩ B)
P( A / B) =
P( B )
Where: P(A | B) = the (conditional) probability that event A will occur given that event B has occurred
already P ( A ∩ B ) = the (unconditional) probability that event A and event B both occur
P (B) = the (unconditional) probability that event B occurs

3.6 BAYES' THEOREM:

Bayes' Theorem is a result that allows new information to be used to update the conditional probability
of an event.

65
1

Using the multiplication rule, the following gives Bayes' Theorem in its simplest form:
P( A ∩ B) P( B / A).P( A)
P( A / B) = =
P ( B) P( B)
P( A ∩ B) P( B / A).P ( A)
P( A / B) = =
P( B) P( B / A).P( A) + P( B / A' ).P( A' )

Where: P(A) = probability that event A occurs


P(B) = probability that event B occurs P(A') = probability that event A does not occur
P(A | B) = probability that event A occurs given that event B has occurred already
P(B | A) = probability that event B occurs given that event A has occurred already
P(B | A') = probability that event B occurs given that event A has not occurred already
Example 10:
¾ What is the probability that the student who will sit next to you on the bus is a woman & a
biology major?
Define: A = woman;
ƒ B = biology major

ƒ P(A I B) = P(A) * P(B|A)


• = 0.10 * 0.10
• = 0.01

¾ What is the probability that the student who will sit next to you on the bus is a man & a
biology major?
Define: C = man;
ƒ B = biology major
ƒ P(C I B) = P(C) x P(B|C)
• = 0.90 x 0.90
= 0.81

66
1

3.7 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS:

3.7.1 RANDOM VARIABLE:

The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed
can be 'heads' or 'tails'. However, we often want to represent outcomes as numbers. A random variable
is a function that associates a unique numerical value with every outcome of an experiment. The value
of the random variable will vary from trial to trial as the experiment is repeated.

There are two types of random variable - Discrete and Continuous.

A random variable has either an associated probability distribution (discrete random variable) or
probability density function (continuous random variable).

Example 11:

¾ A coin is tossed ten times. The random variable X is the number of tails that are noted. X can
only take the values 0, 1,..., 10, so X is a discrete random variable.
¾ A light bulb is burned until it burns out. The random variable Y is its lifetime in hours. Y can
take any positive real value, so Y is a continuous random variable.

3.7.2 VARIANCE:

The (population) variance of a random variable is a non-negative number which gives an idea of how
widely spread the values of the random variable are likely to be; the larger the variance, the more
scattered the observations on average.

Stating the variance gives an impression of how closely concentrated round the expected value the
distribution is; it is a measure of the 'spread' of a distribution about its average value.

Variance is symbolized by V(X) or Var(X) or σ 2

a. The larger the variance, the further that individual values of the random variable
(observations) tend to be from the mean, on average;
b. The smaller the variance, the closer that individual values of the random variable
(observations) tend to be to the mean, on average;

67
1

c. Taking the square root of the variance gives the standard deviation, i.e.: σ2 =σ = s
d. The variance and standard deviation of a random variable are always non-negative.

3.8 PROBABILITY DISTRIBUTIONS:

The following are popular distributions.

1. Binomial Distribution
2. Poisson Distribution
3. Normal Distribution

3.8.1 BINOMIAL DISTRIBUTION

Binomial Distribution is one of the theoretical or expected frequency distributions in probability


distributions. It is also known as „Bernoulli distribution‰, since it is associated with the name of Swiss
Mathematician „James Bernoulli‰, who is also known as Jacques or Jacob. The Binomial distribution
expresses the probability of one set of dichotomous alternatives. i.e., success or failure.

In Bernoulli process, an experiment is performed repeatedly, yielding either a success or a failure in


each trial and where there is absolutely no pattern in the occurrence of success and failures. The trials
are independent in this process and the mathematical model for this process is developed under a very
specific set of assumptions involving the concept of a series of experimental trials. These assumptions
are:
(i) An experiment, under the same conditions, is performed for ÂnÊ number of trials. (Where
ÂnÊ is fined)
(ii) There are only two possible outcomes of the experiment in each trial, and the sample
space for each experiment is denoted by ÂsÊ where S= {success, failure}
(iii) The probability of success is denoted by ÂpÊ, which remains constant from trial to trial and
the probability of failure is denoted by q=(1-p)
(iv) The trials, which are statistically independent, do not affect the outcomes of subsequent
trials.

68
1

From the following example, we can see how the binomial distribution arises.

If a coin is tossed once, there are two outcomes, tail or head. The probability of obtaining a head,
p=1/2 and the probability of obtaining a tail, q=1/2. These are terms of binomial (q+p) where (q+p)=1
In general, in ÂnÊ tosses of a coin, the probabilities of the various possible events are given by the
successive terms of the Binomial expansion

(q + p ) = q
n n
+ nc1 q n −1 p + nc2 q n − 2 p 2 + − − − − + ncr q n− r p + − − − p n
r

∴ The General form of the Binomial distribution is


P ( r ) = n cr q n − r p r
Where P= Probability of success in a single trial.
q=1-p
n= number of trials
r= Number of successes in ÂnÊ trials
Mean of the binomial distribution=np
Variance of the distribution = npq

Example12:

To assure quality of a product, a random sample of size 25 is drawn from a process. The number of
defects (X) found in the sample is recorded. The random variable X follows a binomial distribution
with n = 25 and P (product is defective).

Example 13:

Records at a local blood bank show that in any year about 20% of the population donate blood.
In a queue of 8 people in a checkout:
1. what is the probability that at most 3 people are donors

2. what is the probability that 5 people are donors

3. what is the probability that at least 2 people are donors

4. how many people would you expect to be donors

69
1

Identifying the problem

Our random variable represents the number of people in the queue who are blood donors ie
x = 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8

We have a set number of people – 8 that are being looked at: n=8
The people in the queue can either be blood donors (p) or not (q).
The likelihood of any one donating blood is 20%: p = 0.2 for all people looked at
Each person can be assumed to be a donor or not, independent, of other people in queue

This problem fits the binomial properties.

Solutions

1. What is the probability that at most 3 people are donors?

P( at most 3 are donors ) = P ( x ≤ 3) found directly from tables = 0.944

2. What is the probability that 5 people are donors?

P (5 people are donors) = P ( x = 5) can use the tables with some adjustment.

P( x = 5) = P( x ≤ 5) − P ( x ≤ 4) = 0.999 − 0.990 = 0.009

Or using the binomial formula: C xn × p x × q n − x = C 58 × 0.2 5 × 0.8 3 = 0.0092

remember that n = 8, we are finding x = 5, so n – x = 8 – 5 = 3 and p = 0.2 so q = 0.8

3. What is the probability that at least 2 people are donors?

P (at least 2 are donors) = P ( x ≥ 2) can use the tables with some adjustment.

P ( x ≥ 2) = 1 − P ( complement of x ≥ 2)

70
1

x ≥ 2 = 2 or 3 or 4 or 5 or 6 or 7 or 8 Complement of this is x = 0 or 1

P ( x ≥ 2) = 1 − P ( x ≤ 1) = 1 − 0.503 = 0.497

4. How many people would you expect to be donors?

The expected value and the variance and standard deviation of a binomial distribution are found by
using the following formula:

Expected value: E ( x) = np = 8 × 0.2 = 1.6

Variance: V ( x) = npq = 8 × 0.2 × 0.8 = 1.28

Standard deviation: S .D. = var iance = 1.28 = 1.13

3.8.2 POISSON DISTRIBUTION:

Poisson distribution, which is also a discrete probability distribution. It was developed by a French
mathematician SD Poisson (11781-1840) and hence named after him.
Along with the normal and binomial distributions, the Poisson distribution is one of the most widely
used distributions. It is used in quality control statistics, to count the number of defective items or in
insurance problems to count the number of casualties or in waiting-time problems to count the number
of incoming telephone calls or incoming customers or the number of patients arriving to consult a
doctor in a given time period, and so forth., All these examples have a common feature: they can be
described by a discrete random variable, which takes on integer values (0, 1, 2, 3 and so on).
The characteristics of the Poisson distribution are:
1. The events occur independently. This means that the occurrence of a subsequent even is not at
all influenced by the occurrence of an earlier event.
2. Theoretically, there is no upper limit with the number of occurrences of an event during a
specified time period.
3. The probability of a single occurrence of an event within a specified time period is
proportional to the length of the time period of interval.

71
1

4. In an extremely small portion of the time period, the probability of two or more occurrences of
an event is negligible.

The Poisson distribution is used for modeling rates of occurrence. The Poisson distribution has one
parameter:

e −λ λ X
P(X) =
X!

λ= the rate (mean).

Mean = Variance= λ

Example14:

During an off-peak period, passengers arrive at an airport check-in at the average rate of 3 per minute.

1. What is the probability that in the next minute less than 5 passengers will arrive?

2. What is the probability that exactly 5 passengers will arrive?

3. What is the probability that in the next five minutes at least 12 passengers will arrive?

Identifying the problem

Our random variable represents the number of passengers arriving at the check-in
ie x = 0 or 1 or 2 or 3 or 4 or 5 or ⁄⁄⁄⁄⁄

We will assume each passenger arrives independently of any other passenger. We are given the
average arrivals and the time period this average holds for ie λ = 3 per minute.

72
1

This problem fits the Poisson properties.

Solutions:
1. What is the probability that in the next minute less than 5 passengers will arrive?

P( less than 5 passengers arrive) = P ( x < 5) = P ( x ≤ 4) = 0.815

2. What is the probability that exactly 5 passengers will arrive?

P( 5 people arrive) = P ( x = 5) : can use the tables with some adjustment.

P(x=5) = P(x ≤ 5) – P(x ≤ 4) =0.916 – 0.815 =0.101

e −λ × λx
or using the Poisson formula : where λ = 3 and x = 5
x!

e − λ × λ x e −3 × 35
P( x = 5) = = = 0.100818813 = 0.101
x! 5!

3. What is the probability that in the next five minutes at least 12 passengers will arrive?

Time period has changed so adjust the average first.

λ = 3 for 1 minute so λ = 3 × 5 = 15 for 5 minute period

P(at least 12 passengers arrive) = P ( x ≥ 12) = 1 − P (complement of ≥ 12)

x ≥ 12 means x = 12 or 13 or 13 or ⁄⁄⁄.
The complement of this set is x = 0 or 1 or 2 or ⁄⁄ or 11 = P ( x ≤ 11)

P ( x ≥ 12) = 1 − P ( x ≤ 11) = 1 − 0.185 = 0.815

73
1

3.8.3 NORMAL DISTRIBUTION:

The preceding two distributions discussed in this chapter were discrete probability distributions. We
shall now take up another distribution in which the random variable can take on any value within a
given range. This is the normal distribution, which is an important continuous probability distribution.
This distribution is also known as the Gaussian distribution after the name of the eighteenth century
mathematician-astronomer Karl Gauss, whose contribution in the development of the normal
distribution was very considerable. As a vast number of phenomena have approximately normal
distribution, it has to make inferences by drawing samples. The normal distribution has certain
characteristics, which make it applicable to such situations.

2
−(x−u )
1 2σ 2
P(X) = e
σ 2π
Probability for different X values can be obtained from normal distribution table available at the end
of all the units.

3.8.3.1 CHARACTERISTICS OF NORMAL PROBABILITY DISTRIBUTION:

Normal Probability Distribution

Let us see what this figure indicates in terms of characteristics of the normal distribution. It indicates
the following characteristics.
1. The curve is bell shaped, that is, it has the same shape on either side of the vertical line from
mean.
2. It has a single peak. As such it is uni modal.
3. The mean is located at the centre of the distribution.

74
1

4. The distribution is symmetrical


5. The two tails of the distribution extend indefinitely but never touch the horizontal axis.
6. Since the normal cure is symmetrical, the lower and upper quartiles are equidistant from the
median, that is, Q3-Median=Median-Q1.
7. The mean, median, and mode have the same value, that is, mean=median-mode

µ±1σ
µ±2σ

µ±3σ

Area under the normal curve:


Area between µ+1σ to µ-1σ is 68.27%
Area between µ+2σ to µ-2σ is 95.45%
Area between µ+3σ to µ-3σ is 99.73%

3.8.3.2 THE MOST IMPORTANT REASONS FOR ITS APPLICABILITY:

1) Normal distribution is important because, a wide variety of naturally occurring random


variables such as heights and weights of all creatures are distributed evenly around a central
value, average, or norm (hence the name normal distribution) Although the Distributions are
only approximately normal, they are quiet close. Whenever too many factors influencing the
out come of a random variable then the underlying distribution is normal.
Ex: Height of a tree is determined by the „sum‰ of such factors as rain, soil quality, sunshine,
disease etc.,

75
1

2) All the statistical tables are limited by size of their parameters (Mostly) However, when these
parameters are large enough one may use normal distribution for calculating the critical
values-for these tables. Ex: F-statistic is related to standard normal Z-statistic as follows:
F= Z 2 ;
3) Approximation to Binomial is made by taking µ = np; σ 2 = npq
Application: - The probability of a defective item coming off a certain assembly line is
p=0.25. A sample of 400 items is selected from a large lot of these items. What is the
probability of getting 90 (or) less items are defective.
4) If the mean and S.D of a normal distribution are known. It is easy to convert back and forth
from raw scores to percentiles.
5) It has been proven that the underlying distribution is normal if and only if the sample mean is
independent of sample variance, this characterizes the normal distribution. Therefore many
effective transformations can be applied to convert almost only shaped distribution into a
normal one.
6) The most important reason for popularity of normal distribution is the CENTRAL LIMIT
THEOREM (CLT). The distribution of the sample averages of a large number of independent
variables (random) will be approximately normal regardless of the distributions of the
individual random variables. The CLT is useful especially when you are dealing with a
population with an unknown distribution.
7) The normality condition is required by almost all kinds of parametric statistical tests. Using
most statistical tables, such as T-table, and F-table all require the normality condition of the
population.
8) This condition should be tested before using the tables, otherwise the conclusion will be
wrong.

A normal curve with mean X and standard deviation σ can be converted in to a standard normal

distribution by performing the change of the scale and origin. The original X and σ will be converted
to o and 1 respectively. The units for the standard normal distribution curve are denoted by Z and are
called the Z values or Z scores. They are also called standard units or standard scores. The Z score is
known as a ÂstandardizedÊ variable because it has a zero mean and a standard deviation of one.

76
1

X−X
z=
σ
This transformation will be used widely in Z test in hypothesis testing.

Example 15:

The Rural Bank is reviewing its service charges and interest paying policies on cheque accounts. The
bank has found that the average daily balance on personal cheque accounts is Rs.550.00, with a
standard deviation of Rs.150.00. In addition, the average daily balances have been found to be
normally distributed.

1. What percentage of personal cheque account customers carry average daily balances in excess of
Rs.800.00

2. What percentage of these customers carry a balance of Rs.600.00 or lower.

3. The bank is considering paying interest to customers carrying average daily balances in excess of a
certain amount. If the bank does not want to pay interest to more than 5% of itÊs customers, what is the
minimum daily balance it should pay interest on?

Identifying the problem

Our random variable represents the daily balance on cheque accounts held with the bank. We know
the balances are normally distributed with a mean of Rs.550 and standard deviation of Rs.185.
The mean has been calculated on all balances so is the population mean µ = 550
The standard deviation has been calculated on all balances so is a population standard deviation
σ = 185

Solution:

77
1

1. What percentage of personal cheque account customers carry average daily balances in excess of
Rs.800.00

x−µ 800 − 550


First convert Rs.800 to a z score by the formula z = = = 1.67
σ 150

Second look up table to find the area between the centre and the z score 1.67
=0.4525

Third find P ( x > 800) = P ( z > 1.67) by subtracting 0.4525 from 0.5

P ( x > 800) = P ( z > 1.67) = 0.5 − 0.4525 = 0.0475

2. What percentage of these customers carry a balance of Rs.600.00 or lower.

78
1

x−µ 600 − 550


z= = = 0.33
σ 150

So P ( x < 600) = P ( z < 0.33) = 0.5 + 0.1293 = 0.6293

3. The bank is considering paying interest to customers carrying average daily balances in excess of a
certain amount. If the bank does not want to pay interest to more than 5% of itÊs customers, what is
the minimum daily balance it should pay interest on?

The bank will only want to pay interest to a few customers so the daily balance limit will be in the top
end of the balances. 5% represents an area under the bell curve – in this case the area in the right hand
tail.

First find the rest of the area on the right hand side of the bell: 50% - 5% = 45% = 0.4500

Second look up the area 0.4500 (or nearest area to it) in table 3 to find the z score that corresponds
with this area: z = 1.645

Third rearrange the z formula to find the value of x

79
1

x−µ
rearrange z = to find x : x = zσ + µ
σ

x = 1.645 × 150 + 550 = 796.75

Any accounts with balances over Rs.796.75 will earn interest.

3.9 SUMMARY:
This Chapter presents the fundamental concepts of probability and probability distributions. The
topics of random variables, discrete probability distributions (such as Poisson and binomial), and
continuous probability distributions (such as normal) have been discussed. A probability distribution
is any statement of a probability function having a set of collectively exhaustive and mutually
exclusive events. All probability distributions follow the basic probability rules. Basic probability
concepts and distributions are useful in decision theory, inventory control, Markov analysis project
management, simulation, and statistical quality control.

3.10 GLOSSARY:
BayesÊ theorem: A rule that is used while revising probability of events after having more information
Bernoulli process: One repetition of a binomial experiment. Also called a trial.

Binomial Distribution : The probability distribution that gives the probability of x successes in n trials when
the probability of a success is p for each trial of a binomial experiment
Event: One of the possible outcomes of an experiment
Normal Distribution: A symmetrical distribution that is mounded up about the mean and is bell shaped and
becomes sparse at the extremes. The two tails never touch the horizontal axis.

. Like the binomial, but unlike the normal, it is a discrete probability distribution, that
Poisson Distribution: gives the probability of x(success) in an interval. It is appropriate when the probability
of x is very large and n is large.

Probability distribution: A Distribution of the probabilities associated with each of the values of a random

80
1

variable. It is a theoretical distribution and is used to represent population

Probability: A numerical measure of the likelihood that a specific event will occur.

3.11 REFERENCES:
1. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
2. Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi

3.12 REVIEW EXERCISE:


1. What do you understand by the term probability? Discuss its importance in business decision
making.
2. A letter is chosen at random from the word „STATISTICIANÊ
a) What is the probability that it is a vowel?
b) What is the probability that it is a T?

3. Two computers A and B are to be marketed. A salesman who is assigned a job of finding
customers for them has 60 percent and 40 percent chances respectively of succeeding in case
of computers A and B. The computers can be sold independently. Given that he was able to
sell at least one computer, what is the probability that the computer A has been sold?
4. Two factories manufacture the same machine part. Each part is classified as having either
0,1,2,or 3 manufacturing defects; the joint probability distribution for this is given below:

Number of Defects
0 1 2 3
Manufacturer A 0.1250 0.0625 0.1875 0.1250
Manufacturer B 0.0625 0.0625 0.1250 0.2500

(i) A part is observed to have no defects. What is the conditional probability that it was
produced by manufacturer A?
(ii) A part is known to have been produced by manufacturer A. What is the conditional
probability that the part has no defects?

81
1

(iii) A part is known to have two or more defects. What is the conditional probability that
it was manufactured by A?
(iv) A part is known to have one or more defects. What is the conditional probability that
it was manufactured by B?
5. A company has three plants to manufacture 8,000 scooters in a month. Out of 8,000 scooters,
plant I manufactures 4, 000, plant II manufacturers 3,000 and plant III manufacturers 1,000
scooters. At plant I, 85 out of 100 scooters are rated of standard quality or better, at plant II
only 65 out of 100 scooters are rated of standard quality or better and at plant III 60 out of 100
scooters are rated of standard quality or better. What is the probability that the scooter
selected at random comes from (i) plant I. (ii) plant II, and (iii) if it is known that the scooter is
of a standard quality?

6. In a certain locality, half of the households is known to use a particular brand of soap. In a
household survey, sample of 10 households are allotted to each investigator and 2048
investigators are appointed for the survey. How many investigators are likely to report:
(i) 3 users; (ii) not more than 3 users; and (iii) at least 4 users?
7. A manufacturer finds that the average demand per day for the mechanics to repair new
products is 2, over a period of one year and the demand per day is distributed as Poisson
variate. He employs 3 mechanics. On how many days in one year: (i) both the mechanics will
be free, and (ii) some demand is refused?
8. Five hundred televisions are inspected as they come off the production line and the number of
defects per set is recorded below:

No. of defects(X) 0 1 2 3 4
No. of sets 368 72 52 7 1
Estimate the average number of defects per set and expected frequencies of 1,2,3 and
defects assuming Poisson distribution
[Given e −0.408 = 0.6649 ]
9. Six hundred candidates appeared for an entrance test for admission to a management course.
The marks obtained by the candidates were found to be normally distributed with a mean of
152 marks and a standard deviation of 18 marks.
If the top 60 performers were given confirmed admission, what are the minimum marks (to the
nearest integer) above which a candidate would be sure of being admitted?

82
1

Further, those obtaining at least 170 marks, but not qualified for confirmed admission were
included in a provisional list. How many candidates were included in this list? (Answer to the
nearest integer.)

10. The customer accounts at a certain departmental store have an average balance of Rs.480 and
a standard deviation of Rs.160. Assuming that the account balances are normally distributed

(i) What proportion of the accounts is over Rs.600?


(ii) What proportion of the accounts is between Rs.400 and Rs.600?
(iii) What proportion of the accounts is between Rs.240 and Rs.360?

By

Dr. C.R. Rao

Reader

Department of Mathematics & Statistics

School of Mathematics & Computer/Information


Sciences

University of Hyderabad.

83
1

4. CORRELATION

LEARNING OBJECTIVES:

After studying the chapter, you should be able to

• Understand the importance and the Concept of correlation.


• Distinguish between i) linear and non-linear correlation, ii) positive and negative correlation and
iii) simple, partial and multiple correlation.
• The utility of scatter diagram which suggests a relationship between two variables.
• Calculate and interpret coefficient of correlation and spearman rank correlation for bivariate and
grouped data.

4.1. INTRODUCTION:

Managers of today often need to understand and make decisions depending upon the numerical
data on two or more variables simultaneously. For example,
i) Cost of production and volume of Production,
ii) Expenditure on Advertising and Sales of a Product,
iii) Number of Vehicles on Road and Number of Accidents,
iv) Number of Colleges offering MBA Programme and number of MBA Graduates,
v) Number of Counters at an e - Seva Kendra and the waiting time of customers
vi) Number of Telephone calls and Rate per Call and so on.

In other words, one of the basic functions of a manager is to understand the relationship between
these variables and make appropriate decisions keeping the future in mind known as
ÂForecasting or PredictionÊ. The part which deals with understanding of the behaviour of variables is
Correlation and the part deal with the forecasting is Regression.

84
1

4.2 CORRELATION:

Correlation is a statistical tool for studying the relationship between two or more variables and
correlation analysis involves various methods and techniques used for studying and measuring the
extent of relationship between the two variables. Two variables said to be correlated, if the change in
one variable results in a corresponding change in the other.

4.2.1 SCATTER DIAGRAM:

It is a diagrammatic representation of a bivariate distribution and provides a fairly and simplest way of
understanding the relationship between two variables. Scatter diagram is simply the graphical
presentation of pairs of observed bivariate data in the form of ÂdotsÊ. Different scatter diagrams are
presented in the following discussion.

4.3. TYPES OF CORRELATION:

Broadly speaking, there are four types of correlation, namely, a)Positive correlation, b)Negative
correlation, c)Linear correlation and d)Non-Linear Correlation.

4.3.1 POSITIVE CORRELATION:

If the values of two variables deviate in the same direction i.e., if increase in the values of one variable
results, on an average, in a corresponding increase in the values of the other variable or if a
decrease in the values of one variable results, on an average, in a corresponding decrease in the
values of the other variable, the corresponding correlation is said to be positive or direct.
Examples:
i) Sales revenue of a product and expenditure on Advertising.
ii) Amount of rain fall and yield of a crop (up to a point)
iii) Price of a commodity and quantity of supply of a commodity.
iv) Height of the Parent and the height of the Child.
v) Number of patients admitted into a Hospital and Revenue of the Hospital.
vi) Number of workers and output of a factory.

85
1

Perfect positive Correlation:


If the variables X and Y are perfectly positively related to each other then, we get a graph as shown in
Fig.4.1.

Fig. 4.1: PERFECT POSITIVE


CORRELATION (r = +1)

Y variable

X variable

Very High Positive Correlation:

If the variables X and Y are related to each other with a very high degree of positive relationship then
we can notice a graph as in Fig.4.2.

Fig.4.2: VERY HIGH POSITIVE


CORRELATION (r = nearly +1)
Y variable

X variable

86
1

Very low Positive Correlation:

If the variables X and Y are related to each other with a very low degree of positive relationship then
we can notice a graph as in Fig.4.3.

Fig.4.3. VERY LOW POSITIVE


CORRELATION (r=near to +0)
Y variable

X variable

4.3.2 NEGATIVE CORRELATION:

Correlation is said to be negative or inverse if the variables deviate in the opposite direction
i.e., if the increase (decrease) in the values of one variable results, on the average, in a
corresponding decrease (increase) in the values of the other variable.
Examples:
1. Price and demand of a commodity
2. Sale of Woolen garments and the day temperature.

Perfect Negative Correlation:


If the variables X and Y are perfectly negatively related to each other then, we get a graph as shown in
Fig.4.4.

87
1

Fig 4.4. PERFECT NEGATIVE CORRELATION


(r = - 1)

Y variable

X variable

Very High Negative Correlation:

If the variables X and Y are related to each other with a very high degree of negative relationship then
we can notice a graph as in Fig.4.5.

Fig 4.5. VERY HIGH NEGATIVE


CORRELATION (r = near to - 1)
Y variable

X variable

88
1

Fig.4.6. VERY LOW NEGATIVECORRELATION


(r = near to 0 but negative)

Y variable

X variable

Very low Negative Correlation:

If the variables X and Y are related to each other with a very low degree of negative relationship then
we can notice a graph as in Fig.4.6.

4.3.3 NO CORRELATION:

If the scatter diagram show the points which are highly spread over and show no trend or patterns we
can say that there is no correlation between the variables. Refer to Fig. 4.7.

89
1

Fig.4.7. NO CORRELATION (r = 0)

v
a
r
I
a
b
l
e

X variable

4.3.4 LINEAR CORRELATION:

Two variables are said to be linearly related if corresponding to a unit change in one
variable there is a constant change in the other variable over the entire range of the values.
If two variables are related linearly, then we can express the relationship as
Y=a+bX
Where ÂaÊ is called as the „intercept" (If X= 0, then Y= a) and ÂbÊ is called as the "rate of change" or
slope.
If we plot the values of X and the corresponding values of Y on a graph, then the graph would be a
straight line as shown in Fig.4.8.

Example:
X 1 2 3 4 5
Y 6 8 10 12 14

For a unit change in the value of x, a constant 2 units change in the value of y can be noticed.
The above can be expressed as : Y=4+2X

90
1

Fig:4.8

16
14 14
12 12

Y variable
10 10
8 8
6 6
4
2
0
X variable

4.3.5 NON LINEAR (CURVILINEAR) CORRELATION:

If corresponding to a unit change in one variable, the other variable does not change in a constant rate,
but change at varying rates, then the relationship between two variables is said to be non-linear or
curvilinear as shown in Fig. 4.9. In this case, if the data are plotted on the graph, we do not get a
straight line curve. Mathematically, the correlation is non-linear if the slope of the plotted curve is not
constant. Data relating to Economics, Social Science and Business Management do exhibit often non-
linear relationship. We confine ourselves to linear correlation only.

Example:
X -6 -4 -2 0 2 4 6
Y 36 16 4 0 4 16 36

Fig.4.9: Non Linear Correlation

40
35
30
Y variable

25
20
15
10
5
0

X variable

91
1

4.4. KARL PEARSONÊS COEFFICIENT OF CORRELATION:

To measure the degree of association between two variables X and Y, Karl Pearson defined the
Coefficient of Correlation ÂγÊ as below. In this method, the coefficient of correlation is calculated as
the ratio of the covariance of the two variables to the product of their variances.
Cov( X i , Yi )
Correlation co-efficient ( γ ) =
{V ( X i ) V (Yi )}

where Cov(Xi,Yi) =
∑(X i − X )(Yi − Y )
n

and V(Xi) =
∑(X i − X )2
n

V(Yi) =
∑ (Y i − Y )2
n

γ=
∑ xy
∑x ∑y2 2

Where x = X i - X and y = Yi - Y

4.4.1. PEARSONÊS METHOD – DIRECT METHOD:

We apply direct formula to find Karl Pearson's Co-efficient of correlation.


Solved Example1:

Following is the data on two variables X i and Y i . we find the sums and squares of products as shown

in the table below.

92
1

Xi Yi ( Xi − X ) ( Yi − Y ) ( X i − X )2 ( Yi − Y )2 ( X i − X )( Yi − Y )

x y x2 y2 xy

2 7 -2 -4 4 16 8
3 9 -1 -2 1 4 2
4 10 0 -1 0 1 0
5 14 1 3 1 9 3
6 15 2 4 4 16 8

20 55
∑x 2
=10 ∑y 2
=46 ∑ xy = 21

X=
∑ Xi = 20
= 4, Y =
∑Y i
=
55
= 11,
n 5 n 5

γ =
∑ xy
∑x ∑y 2 2

21 21
γ= = = 0.98
10 46 3.16 x 6.78

The value of γ = 0.98 shows that two series X and Y have almost perfect positive correlation
4.4.2. PEARSONÊS METHOD – WITHOUT DEVIATIONS (SHORT-CUT METHOD):

When the arithmetic means of both sets of numerical items are not whole numbers and involve
decimals, calculating the coefficient of correlation by direct method becomes tedious. To overcome
this difficulty the following modified short-cut method formula is used.

Cov(Xi,Yi) =
∑X i Yi
−XY
n

93
1

V(Xi) =
∑X i
2

− X 2 ; V(Yi) =
∑Y i
2

−Y 2
n n

Cov( X i , Yi )
γ =
{V ( X i ) V (Yi )}

n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2  n∑ Y 2 − (∑ Y )2 
 i i   i i 

Solved Example 2:

Calculate the Karl PearsonÊs coefficient of correlation for the following data between sales and
advertising expenditure.
Let sales represents Xi variable and advertise expenditure represents Yi variable to calculate the
correlation coefficient using the following formula.
n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2  n∑ Y 2 − (∑ Y )2 
 i i   i i 

Xi Yi Xi 2 Yi 2 X i Yi

1 3 1 9 3
2 15 4 225 30
3 6 9 36 18
4 20 16 400 80
5 9 25 81 45
6 25 36 625 150

∑X i =21 ∑Y i =78 ∑X i
2
=91 ∑Y i
2
=1376 ∑X Y
i i =326

94
1

γ =
(6 x 326) − (21 x 78)
(6 x 91) − (21)2  (6 x 1376) − (78)2 
   

318
γ =
(10.247 x 46.605)

γ = = 0.667

This suggests that a fairly high degree of correlation between X and Y series i.e. between sales and
advertising expenditure

4.4.3. PEARSONÊS METHOD – SHIFTING ORIGIN:

In case the magnitude of the data is large, using the two methods explained above will give lot of
inconvenience while calculating the correlation coefficient by Karl PearsonÊs method. So we take
deviations from some convenient numbers to reduce the magnitude of data. There will be no change in
the value of correlation coefficient even if deviations are taken. We define, u i = X i - A and vi = Yi

- B, where A and B can any arbitrary and assumed values. The formulae are given below,

V(ui) =
∑u 2
i
−u ;
2
V(vi) =
∑v 2
i
− v 2 ; Cov(ui,vi) =
∑u i vi
−uv
n n n

Cov(u i , vi )
γ =
{V (u i ) V (vi )}

95
1

n∑ u i vi − ∑u ∑v i i
γ =
n∑ u 2 − (∑ u )2  n ∑ v 2 − (∑ v )2 
 i i   i i 

Solved Example 3:
Using short cut method, we calculate 'r' for the following data of X i = Advertising expenditure

(Rupees in thousands) and Yi = sales (Rupees in lakhs). Let us define A = 60 and B=70, two

variables chosen arbitrarily. Then u i = X i - 60 and vi = Yi - 70

2 2
Xi Yi ui vi ui vi u i vi

39 47 - 21 -23 441 529 +483


65 53 5 -17 25 289 - 85
62 58 2 -12 4 144 - 24
90 86 30 16 900 256 +480
82 62 22 -8 484 64 -176
75 68 15 -2 225 4 - 30
25 60 -35 -10 1225 100 -350
98 91 38 21 1444 441 +798
36 51 -24 -19 576 361 +456
78 84 18 14 324 196 +252
Total
∑ u =50
i ∑v i = -40 ∑u 2
i =5648 ∑v 2
i = ∑u v
i i =2504

-2384

u =
∑u i
=
50
=5 ; v=
∑v i
=
− 40
=-4
n 10 n 10

n∑ u i vi − ∑u ∑v i i
γ=
n∑ u 2 − (∑ u )2  n ∑ v 2 − (∑ v )2 
 i i   i i 

96
1

γ =
(10 x 2540) − (50 x − 40)
(10 x 5648) − (50)2  (10 x 2384) − (− 40)2 
   

27040 27040
γ = =
(53980) (22240) 34647.373

γ = 0.78

Hence the correlation between X and Y series is fairly high as the coefficient of correlation is 0.78

4.5. PROPERTIES OF COEFFICIENTS OF CORRELATION:

i) The value of correlation coefficient γ varies between [-1, +1]. This indicates that the value of
does not exceed unity.
ii) Sign of γ depends on sign of the covariance.
iii) If γ = -1, the variables are perfectly negatively correlated.
iv) If γ = +1, the variables are perfectly positively correlated.
v) If γ = 0, the variables are not correlated in a linear fashion. There may be nonlinear
relationship between variables.
vi) Correlation coefficient is independent of change of scale and shifting of origin. In other words,
shifting the origin and change the scale do not have any effect on the value of correlation.

Let us see the following example to understand the concept, Âif γ = 0, the variables are not correlated
in a linear fashion. There may be nonlinear relationship between variablesÊ.

97
1

Solved Example.4:

If Xi and Yi are given as below, we calculate the correlation coefficient.

Xi Yi Xi 2 Yi 2 X i Yi
-3 9 9 81 -27
-2 4 4 16 -8
-1 1 1 1 -1
0 0 0 0 0
1 1 1 1 1
2 4 4 16 8
3 9 9 81 27

∑X i =0 ∑Y i =28 ∑ X i 2 =28 ∑ Yi 2 =196 ∑X Y


i i =0

n∑ X i Yi − ∑ X ∑Y i i
γ =
n∑ X 2 − (∑ X )2  n∑ Y 2 − (∑ Y )2 
 i i   i i 

γ =
(7 x 0) − (0 x 28)
(7 x 28) − (0)2  (7 x 196) − (28)2 
   

0
γ =
196 x 588

0
γ = =0
196 x 588

Since γ = 0 it does not mean that the variables Xi and Yi are uncorrelated. It can only be said that the
variables are linearly uncorrelated. In fact if we closely look at the data of Xi and Yi, it can be
observed that Yi = Xi2 is the relationship existing between Xi and Yi. This is a nonlinear relationship

98
1

between the variables. Karl PearsonÊs coefficient of correlation can not measure nonlinear relationship
between the variables.

4.6. CORRELATION OF GROUPED DATA:

When the number of observations is large, the data are often classified into two-way frequency
distribution i.e. table where in the values of one variable(X) are represented in the rows while other
variable(Y) in columns. These values can be either discrete or continuous. The frequencies in each
class are shown in cells in the body of the table.

Steps for calculating correlation coefficient for grouped data:


i) Record the mid-points (mp) of the class intervals for both X and Y variables.
ii) Choose an assumed mean in X series and calculate the deviations (dx) from it. The same
procedure to be used for Y series and calculate the deviations(dy)
iii) To simplify the calculations, step deviations can be taken by dividing deviations by a
common factor.
iv) Calculate f.dx , f .dx.dx i.e.f.dx2 , f.dx.dy for X series and f.dy , f .dy.dy i.e.f.dy2 , f.dx.dy
for Y series.
v) Substitute all the values obtained in the following formula.

n∑ fdxdy − ∑ fdx ∑ fdy


γ =
n ∑ fdx 2 − (∑ fdx )2  n ∑ fdy 2 − (∑ fdy )2 
   
Solved Example.5:
Calculate the Karl PearsonÊs coefficient of correlation for the following grouped data
Sales Revenue Advertising Expenditure( Rs lakh) Total
(Rs lakh) 5 -15 15 - 25 25 - 35 35 -45
75 - 125 3 4 4 8 19
125 - 175 8 6 5 7 26
175 - 225 2 2 3 4 11
225 - 275 3 3 2 2 10
Total 16 15 14 21 66

99
1

5- 15- 25-
X 35-45
15 25 35
Y
mp 10 20 30 40

dx 2
mp -1 0 1 2 fdy fdy fdxdy
dy f

4(-
75-125 100 -1 3(3) 4(0) 8(-16) 19 -19 19 -17
4)

125-175 150 0 8(0) 6(0) 5(0) 7(0) 0 0 0 0

2(-
175-225 200 1 2(0) 3(3) 4(8) 11 11 11 9
2)

3(-
225-275 250 2 3(0) 2(4) 2(8) 10 20 40 6
6)
2
∑fdy ∑fdy ∑fdxdy
Total f 16 15 14 21 66
=12 = 70 = -2

∑ f.dx
fdx -16 0 14
= 42
2
∑fdx
2
f.dx 16 0 14
= 84

∑fdxd
f.dxdy -5 0 3
y =0

The value in the bracket in each cell shows fdxdy

n∑ fdxdy − ∑ fdx ∑ fdy


γ =
n ∑ fdx 2 − (∑ fdx )2  n ∑ fdy 2 − (∑ fdy )2 
   

(66 × −2) − (40 × 12)


γ =
66 × 114 − (40) 2  66 × 70 − (12) 2 
   

100
1

− 9.27
γ =
[89.76] [67.82]
− 9.27
γ = = - 0.119
9.47 × 8.24
This shows very low degree of negative correlation between advertising expenditure(X) and sales
revenue(Y)

4.7. RANK CORRELATION (SPEARMANÊS METHOD)

It is not possible to express attributes such as character, conduct, honesty, beauty, morality, intellectual
integrity etc. in numerical terms. For example, it is easy to for a class teacher to arrange the students in
his class in an ascending or descending order of intelligence. This means that he can rank them
according to their intelligence. Hence in problems that involve attributes of the type mentioned above,
the coefficient of correlation is entirely based on the rank differences between corresponding items.
We may have two types of numerical problems in rank correlation:
a) When actual ranks are given
b) When ranks are not given

Calculation of Rank Correlation:


i) In the first case , when actual ranks are given, the difference of the two ranks
( R1 – R2) are taken and these are denoted by ÂdÊ
ii) The differences are squared and their total ( ∑d2 ) obtained
iii) Then the following formula is applied to calculate the rank correlation coefficient

6∑d 2
rs =1−
N ( N 2 − 1)

Where rs denotes SpearmanÊs Rank Correlation and N denotes number of pairs of observations.
iv) In the second case, when the ranks are not given, when the actual data are given, we have to assign
ranks. We may do so by taking highest value as 1 or the lowest value as 1. When the two observations
are same, then the normal practice is to assign an average rank to the two observations.

101
1

When the ranks are given:


Solved Example.6:
The ranking of 10 students in two subjects A and B are as follows:
Student 1 2 3 4 5 6 7 8 9 10
Ranks in Subject A 4 6 1 3 9 7 10 2 8 5
Ranks in Subject B 5 8 3 1 7 6 9 2 10 4

Calculate coefficient of rank correlation and comment on the result


Solution:
In order to calculate rank correlation, we have to calculate ∑d2 and the following formula is used

6∑d 2
rs =1−
N ( N 2 − 1)

The following table shows the calculations:

Student Ranks in Ranks in Difference Squared


No Subject A Subject B ( R1 – R2) difference
(R1 ) (R2) (d) ( d2)
1 4 5 -1 1
2 6 8 -2 4
3 1 3 -2 4
4 3 1 2 4
5 9 7 2 4
6 7 6 1 1
7 10 9 1 1
8 2 2 0 0
9 8 10 -2 4
10 5 4 1 1

∑d2 = 24

6∑d 2
rs =1−
N ( N 2 − 1)

102
1

rs =1−
(6 X 24)
(10)(10 2 − 1)

144
rs =1− = 0.855
(10)(99)

The rank correlation coefficient (0.855) shows that there is a very high degree of correlation between
ranks obtained in subject A and Subject B of the ten students.
When the ranks are not given:
Solved Example.7:

Compute the SpearmanÊs coefficient of correlation between marks assigned to ten students by Judges
X and Y in a certain competitive test as shown below

Student No 1 2 3 4 5 6 7 8 9 10
Marks by Judge X 43 56 29 81 96 34 73 62 48 76
Marks by Judge Y 15 26 34 86 19 29 83 67 51 58

Student Marks by Ranks by Marks by Ranks by Difference Squared


No Judge X Judge X Judge Y Judge Y ( R1 – R2) difference
(R1 ) (R2) (d) ( d2)
1 43 8 15 10 -2 4
2 56 6 26 8 -2 4
3 29 10 34 6 4 16
4 81 2 86 1 1 1
5 96 1 19 9 -8 64
6 34 9 29 7 2 4
7 73 4 83 2 2 4
8 62 5 67 3 2 4
9 48 7 51 5 2 4
10 76 3 58 4 -1 1

∑d2 = 106

6∑d 2
rs = 1−
N ( N 2 − 1)

103
1

rs = 1−
(6 x 106)
(10)(10 2 − 1)

636
rs =1− = 0.36
(10)(99)

The rank correlation coefficient (0.36) shows that there is a low degree of correlation between marks
assigned by Judge X and Judge Y to the ten students.

Solved Example.8:

Obtain the rank correlation between variables X( Price of commodity A in Rs) and Y( Price of
commodity B in Rs) from the following pairs of observed values.

X 24 29 23 38 46 52 41 36 68 56

Y 110 126 145 131 163 158 131 129 154 140

Solution:
X Ranks of Y Ranks of Difference Squared
X Y ( R1 – R2) difference
(R1 ) (R2) (d) ( d2)
24 9 110 10 -1 1
29 8 126 9 -1 1
23 10 145 4 6 36
38 6 131 6.5 -0.5 0.25
46 4 163 1 3 9
52 3 158 2 1 1
41 5 131 6.5 -1.5 2.25
36 7 129 8 -1 1
68 1 154 3 -2 4
56 2 140 5 -3 9

∑d2 = 64.5

In the data, there two equal values (found in Y series) i.e. 131 which is a tie for the ranks 6 and 7
respectively. Then the average of 6 and 7 ranks (6.5) is assigned as rank for both the observations.
Then the common ranks for both the observations are 6.5.

104
1

In this data we find common ranks in the second series(Y). Therefore the formula for the coefficient of
correlation through the rank differences method has to be modified as given below:

 1 1 1 
6 ∑ d 2 + (m1 − m1 ) + (m2 − m2 ) + (m3 − m3 ) + ........
3 3 3

1−  
rs 12 12 12
=
N ( N − 1)
2

m1, m2, m3 ⁄. stands for number of items in the respective groups with common ranks. In this
problem only one group having items two (or two common ranks in that group), hence we can assign
m1 = 2)

 1 
6 ∑ d 2 + (m1 − m1 ) 
3

1−  
rs 12
=
N ( N − 1)
2

 1 
6 64.5 + (2 3 − 2) 
1−  
rs 12
=
10(10 − 1)
2

6[64.5 + 0.5 ]
rs = 1− = 0.61
990

The rank correlation coefficient (0.61) shows that there is a moderate correlation between X and Y.

4.8 COEFFICIENT OF DETERMINATION ( R2):

When γ = 1; or -1; or 0, the interpretation of γ does not pose any problem. When γ = 1; or -1 , all the
points lie on straight line in a graph showing a perfect positive or negative correlation . When the
points are extremely scattered on a graph, then it becomes evident that there is almost no relationship
between the two variables. However, when it comes to other values of γ, we have to be careful in its
interpretation. Suppose we get a correlation of γ = 0.9, we may say that γ = 0.9 is Âtwice as goodÊ or
Âtwice as strongÊ as a correlation of γ = 0.45. It may be noted that this comparison is wrong.. The
strength of γ is judged by coefficient of determination, γ2 for γ = 0.9, γ2 = 0.81. We multiply it by 100,

105
1

thus getting 81 percent. Thus suggest that when γ = 0.9 then we can say that 81 per cent of the total
variation in the Y series can be attributed to the relationship with X.

4.9. CORRELATION AND CAUSATION:

Correlation helps us in having an idea about the degree and direction of the relationship between two
variables under study. It fails to reflect upon the cause and effect relationship between the variables. If
the variables have the cause and effect relationship, they are bound to vary in sympathy with each
other and hence there is a high degree of correlation. In other- words, causation always implies
correlation but the converse is not true, i.e., even a fairly high degree of correlation between two
variables need not imply a cause and effect relationship between them. For example, if X= Number of
patients admitted into a super specialty Hospital and Y= Number of Space shuttles launched, one may
get a numerical value for r , but it does not signify the relationship between these variable.

4.10. SUMMARY:

The importance and the Concept of correlation and utility of scatter diagram which suggests a
relationship between two variables were discussed. PearsonÊs Coefficient of correlation, a measure
for degree of association between variables is presented. The distinction between linear and non-
linear correlation, positive and negative correlation and simple, partial and multiple correlations
have been brought out .Different types of correlation, and applications are covered. Finally
coefficient of correlation and spearman rank correlation for bivariate and grouped data have been
explained.

4.11. GLOSSARY:

Correlation: Degree of association between two variables

Correlation Analysis: Analysis of statistical data that is concerned with the question of whether there is
a relationship between two variables.

106
1

A number lying between –1 (Perfect negative Correlation) and


Correlation coefficient:
+1(Perfect negative Correlation) to quantify the association between two variables.

Covariance: A joint variation between the variables X and Y. The covariance of X and Y,
which is the average of the products of the deviations from the means for
n pairs of X and Y series.

Rank Correlation: A method to determine correlations then the data are not available in numerical
form and, as an alternative, the method of ranking is used.

Scatter Diagram: An ungrouped plot of two variables on X and Y axes. A plot of the paired
observations of X and Y that shows a broad pattern of relationship
between the two variables.

4.12. REFERENCES:

1. Gupta. S.C., and Kapoor. V.K., „Fundamentals of Mathematical Statistics‰, Sultan Chand and
Sons, N.Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalaya Publishing House, New Delhi , 2004
3. Murray R. Spiegel and Larry J. Stephens,„Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
4. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.
5. Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi
6. S.P.Gupta and M.P.Gupta „Business Statistics‰, Sultan Chand & Sons, New Delhi,2001.

107
1

4.13. REVIEW EXERCISE:

1. Why is rank correlation important in business Statistics? How does it differ from Karl PearsonÊs
coefficient of correlation?
2. What are the different methods of finding correlation between the two variables?

3. Find the coefficient of correlation between the sales and expenses of the ten firms given below and
comment.

Sales 50 50 55 60 65 65 65 60 60 50
Expenses 11 13 14 16 16 15 15 14 13 13

4. Find the Coefficient of correlation for the following data.

X 26 29 31 34 42 48 51 54 63 57

Y 19 21 27 31 36 42 49 51 59 60

5. Find Karl PearsonÊs Coefficient of correlation and comment.

X 78 89 96 69 59 79 68 61

Y 125 137 156 112 107 136 123 108

6. Find Product moment coefficient of correlation between X and Y from the following data.

Marks in
65 66 67 67 68 69 70 72
English (X)
Marks in
67 68 65 68 72 72 69 71
Mathematics (Y)

108
1

7. Find Product moment coefficient of correlation between X and Y from the following data.

Price in Rs (X) 100 90 85 92 90 84 88 90


Sales in Units (Y) 500 610 700 630 670 800 800 750

8. Two persons were asked to watch ten specified TV programmes and offer their evaluation by
rating them 1 to 10. These ratings are given below.

TV Programme A B C D E F G H I J
Ranks given by X 4 6 3 9 1 5 2 7 10 8
Ranks given by Y 2 3 4 9 5 7 1 10 8 6

Calculate SpearmanÊs coefficient of correlation of the two ratings

9. Campus Stores has been selling the Believe It or Not: Wonders of Statistics Study Guide for 12
semesters and would like to estimate the relationship between sales and number of sections of
elementary statistics taught in each semester. The following data have been collected:

Sales (Units) 33 38 24 61 52 45
Number of sections 3 7 6 6 10 12
Sales (units) 65 82 29 63 50 79
Number of sections 12 13 12 13 14 15

(a) Develop the estimating equation that best fits the data.
(b) Calculate the sample coefficient of determination and the sample coefficient of correlation.

109
1

10. The director References of a management training programme is interested to know whether there
is a positive association between a traineeÊs score prior to his/her joining the programme and the same
traineeÊs score after the completion of the training. The director has obtained the scores of 10 trainees
as follows:

Trainee 1 2 3 4 5 6 7 8 9 10
Rank score1 1 4 10 8 5 7 3 2 6 9
Rank score2 2 3 9 10 3 6 1 6 7 8

Determine the degree of association between pre-training and post-training scores.


by
Dr.G.V.R.K. Acharyulu
Associate Professor
Apollo Institute of Health Management
Hyderabad.

110
1

5. REGRESSION ANALYSIS

5.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to:

• Understand the concept of regression


• Understand the concept of Error
• Understand and use of the Least Squares Method to calculate the equations of regression line for a
given set of data.
• Use alternative methods to obtain a regression line

5.1. INTRODUCTION:

Literally the word "Regression" means Âstepping back or returning to the average valueÊ. Regression
was first used by Sir Francis Galton, a British biometrician in the 19th Century in some studies on
estimating the extent to which the stature of sons of tall parents regresses back to the mean stature of
the population. The interesting features of the study were:

i) Tall fathers have tall sons and short fathers have short sons,
ii) The average height of the sons of a group of tall fathers is less than that of the fathers and the
average height of the sons of a group of short fathers is more than that of the fathers.

In todayÊs world, regression is used in varying fields with diversified areas of applications. It is
specially used in Business Management and Economics to study the relationship between two or more
economic variables that are related casually and for estimation of
• Demand and supply curves,
• Cost functions,
• Production and consumption functions
• The effect of Advertising on sales and
• Prediction of sales with given advertising expenditure and so on.

111
1

Prediction (estimation) is one of the major areas of interest on most of the spheres of human activity.
Estimation of future production, consumption, prices, sales, income, profits, investments, demands etc,
are of paramount importance to Businessmen. Population estimation and projections are indispensable
for efficient planning of an economy.
In Pharmaceutical industry, it is important and necessary to estimate the effect of a new drug on
patient, its ill effects, if any, in short as well as in long run.

5.2 REGRESSION:

According to M.M Blair, "Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data". If regression analysis is
confined only to two variables then it is termed as Simple regression.
Business examples:
1. Expenditure of a person depends on his income,
2. Yield of a crop depends on the rainfall,
3. The demand of a product depends on its price etc.

5.2.1 MULTIPLE REGRESSION:


The regression analysis for studying more than two variables at a time is known as "Multiple
Regression". However, we confine ourselves to only Simple regression. In regression analysis, we
come across two types of variables.
Dependent Variable:
The variable whose value is influenced or is to be predicted is termed as dependent variable.
Dependent variable is also known as regressed (or explained) variable.
Independent Variable:
The variable which influences the dependent variable or which is used for prediction is called
independent variable. Independent variable is also known as regressed (or predictor or explanatory)
variable.

5.2.2 LINEAR REGRESSION:

If we plot the given bivariate data on a graph paper, and obtain the scatter diagram, the points on the
diagram will more or less concentrate around a curve called "curve of regression". Often, such a curve

112
1

is not distinct, is quite confusing and some times complicated too. A graph in Fig.5.1 is given below
for the purpose of illustration.

If the regression curve is a straight line, we say that the variables under study are related in linear
fashion. The regression equation for such a relationship is yi = a + b xi, a straight line. In case of a
linear regression, the values of the dependent variable increase or decrease by a constant amount for a
unit change in the independent variable.
If the regression curve is not a straight line, then the regression is called as non-linear (curvy linear)
regression. The regression equation will involve square terms like x2, y2 and/or product terms like xy,
xz and so on. We confine ourselves to linear regression between two variables only.

5.3. LINES OF REGRESSION:

Line of regression is the line, which gives the best estimate of one variable for any given value of the
other variable. In case of two variables x and y, we shall have two lines of regression, called, a)
Regression of y on x and b) Regression of x on y.
Definition:
¾ Line of regression of y on x is the line which gives the best estimate for the value of y for
any specified ( or given) value of x.
¾ Line of Regression of x on y is the line which gives the best estimate for the value of x for
any specified(given) value of y.

113
1

The term best fit is interpreted in accordance with the "Principle of Least Squares" which consists of
minimizing the sum of the squares of the residuals estimated. The residual is also known as error and
is defined as "the deviation between the given observed value and the value of the estimate as given by
the line of best fit".

5.4. REGRESSION EQUATION OF Y ON X:

Let (x1,y1),(x2,y2),...,(xn,yn) be n pairs of observations on the variables x and y under study. Let yi = a
+ bxi be the line of best fit (regression) of y on x.
For the purpose of illustration, we present two points P2(x2,a+b x2) and Q2(x5,a+bx5) in the
following scatter diagram. For the given point P1(x2, y2) in the diagram the error is given by the
line of best fit y = a + b x shown in the diagram is P1P2. Since the X co-ordinates for P1 and P2
are same and Y co-ordinate of P1 is P1M1 where as for P2 it is P2M2.

X Q2(x5, a+bx5)
M1 - X P1(x2, y2)

X Q1(x5,y5)
Y variable

M2 X P2(x2,a+bx2)

X variable

The difference, namely, (P1M1–P2M2) = P1P2 which is called as error. In particular, P2M2 = a + b x2
and P1M1 =y2. Hence error is: e2= y2 - (a + b x2).
Similarly, for Q1 (x5,y5) and Q2 (x5,a + b x5), we can define the error as: e5 = y5 - (a + bx5).

114
1

Here, e2 is positive and e5 is negative. As is explained, the error can be either negative, positive or zero
(if X2 = a + b x2).

e22 = ( y 2 − a − b x 2 )
2
Error square is defined as:

e52 = ( y 5 − a − b x5 )
2

Let E = sum of the squares of error for n pairs, then

E = Σ (yi - a – b xi)2

Sometimes Σ(yi - ye)2 is used to represent 'E' where ye denotes the estimated values of y. By using
maxima and minima Principles of differential calculus we get two equations called "NORMAL
EQUATIONS", as given below
Σ yi = n a + b Σ xi

Σ xi yi = a Σ xi + b Σ xi2

Using these two normal equations, we can solve uniquely for the value of ÂaÊ and ÂbÊ.

5.5. REGRESSION EQUATION OF X ON Y:

Similarly the normal equations for obtaining the Regression equation of x on y is xi = a + b yi are
Σ x i = n a + b Σ yi

Σ xiyi = a Σ yi + b Σyi2

The values of ÂaÊ and ÂbÊ are obtained by using the above formulae by interchanging the variables
x to y and y to x.

115
1

Solved Example 1:

Let us illustrate the application of the above formulae by an example:


The following table gives the age in years of cars of a certain company and its annual maintenance
cost.

Age of Car Maintenance Xi Yi X i2 Yi 2


in year Cost (Yi)
(Xi) (in Rs. Ê00)
2 10 20 4 100
4 20 80 16 400
6 25 150 36 625
8 30 240 64 900
20 85 490 120 2025

Normal equations for the Regression equation of y on x: yi = a + b xi are

Σxi = na + b Σ xi

Σ xiyi = a Σ xi + b Σ xi2

We have n = 4, Σxi = 20, Σyi = 85, Σxiyi = 490, Σxi2 = 120

By substituting these values in the above equations, we get

85 = 4a + 20 b --------- (1)

490 = 20a + 120 b ⁄⁄⁄. (2)

To solve the above equations, we adopt the procedure of elimination. To eliminate 'a', we multiply the
first equation by 5 to get

116
1

425 = 20 a + 100 b
The second equation is taken as it is. Then by subtracting the second equation from the first equation
425 = 20 a + 100 b
490 = 20 a + 120 b
(-) (-) (-)
- 65 = 0 - 20 b
65
⇒ 20 b = 65 or b = = 3.25
20

We substitute the value of b = 3.25 in the first equation to get

85 = 4 a + 20 (3.25)

⇒ 4 a = 85 - 65.00 = 20

20
⇒ a= =5
4

Hence the regression equation of y on x is

yi = 5 + 3.25 xi

This equation is very useful to predict the value of yi (maintenance cost) if we know xi, the age
(in years) of the car.
For example, we predict the annual maintenance cost for a car with 9 years age. This is obtained by
substituting xi = 9 in the equation yi = 5 + 3.25 xi. Annual maintenance cost of a car with xi = 9 years
age is
yi = 5 + 3.25 x 9
= 5 + 29.25 = 34.25 (hundreds of Rs.)

Similarly, we can fit the regression equation of x on y as:


xi = a + b yi and the corresponding normal equations are
Σ xi = n a + b Σ y i

117
1

Σ xi yi = a Σ yi + b Σ yi 2

by substituting the sums and products obtained from the above table, we have
20 = 4a + 85 b
490 = 85a + 2025 b
To eliminate 'a', we multiply the first equation with '85' and the second with '4'.The result is
1700 a = 340 a + 7225 b Review questions

1960 a = 340 a + 8100 b


(-) (- ) (-)
- 260 = 0 - 875 b

260
b= = 0.297 ≈ 0.30
875
The value of 'a' can be obtained from
20 = 4a + 85(0.30)
4a = 20 - 25.5 = - 5.5
− 5 .5
a = = -1.375
4
The regression equation of x on y is
xi = - 1.375 + 0.30 yi

5.6. ESTIMATION OF ÂAÊ AND ÂBÊ DIRECTLY:

Slope (b): it is the coefficient of the xi term in the equation: yi = a + b xi . It shows how much the
dependent variable changes for one unit change in the independent variable. When positive, it gives
the increase in yi per unit increase in xi
y - Intercept (a) : It is the value of yi where the line yi = a + b xi cuts Y – axis. It is the constant term
in the estimation equation.

The value of Âa Â(Y - intercept) and ÂbÊ (slope) are calculated the following formula

118
1

a=
(∑ x )(∑ y )− (∑ x )(∑ x y ) And
2
i i i i i

n ∑ x − (∑ x i )
2
i
2

n ∑ xi y i − ∑ xi ∑ y i
b=
n∑ xi2 − (∑ xi ) 2

Then we substitute the values of a and b obtained above into

yi = a + b xi which gives the Regression equation of y on x.

Solved Example 2:

We have n = 4, Σxi = 20, Σyi = 85, Σxiyi = 490, Σxi2 = 120

a=
(∑ x )(∑ y )− (∑ x )(∑ x y )
2
i i i i i

n ∑ xi2 − (∑ xi ) 2

a=
(120) 85 − 20 (490)
4 (120) − (20)
2

=
(10200 − 9800)
(480 − 400)
400
= =5
80

n ∑ xi y i − ∑ xi ∑ y i
b=
n∑ xi2 − (∑ xi ) 2

b =
(4) (490) − (20) (85)
4 (120) − (20)
2

b =
(1960 − 1700 )
(480 − 400 )

119
1

260
b = = 3.25
80

The values of ÂaÊ and 'b' obtained are same as those obtained using normal equations. Similarly, we
can estimate ÂaÊ and 'b' for the regression equation of x on y by interchanging x and y in the above
formulae. Let us look at the two Regression equations obtained above.
Regression equation of y on x: yi = 5 + 3.25 xi

Regression equation of x on y: xi = - 1.375 + 0.30 yi

The regression co-efficient 3.25 is called as the regression co-efficient of y on x and is denoted by
byx = 3.25.
Similarly the regression co-efficient 0.30 is called as the regression co-efficient of x on y and is
denoted by bxy = 0.30.

On the basis of regression equation of Y on X, we can find out the value of Y for any value of X.
Similarly, we can find out the value of X for any value of Y based on the regression equation of X on
Y.

5.7. PROPERTIES OF REGRESSION CO-EFFICIENTS:

1. Both the regression co-efficient will take the same sign, i.e., byx and bxy will
take either positive or negative sign.
2. The correlation co-efficient is the G.M. of the regression co- efficients, i.e.,

γ = ± (b )(b )
xy yx

r is positive if byx is positive and r is negative if byx is negative.


3. If one of the regressions co-efficient is greater than unity (one), then the other must be
less than unity.
4. The regression equations pass through their respective means.

120
1

5.8. REGRESSION EQUATION USING ' γ ':

Some times we get data on x and y as below:

Solved Example 3:
x = 15, y = 110, V(x) = 25, V(y) = 625 and γ (x,y) = 0.81
then we use the following procedure to get regression equations.
i) Regression equation of y on x:

V (y)
( yi − y) = γ ( xi − x)
V (x )

or

σy
( yi − y) = γ ( xi − x)
σx

σ
Since byx = γ x

σ y

( yi − y ) = byx ( xi − x )

625
(yi - 110) = (0.81) (xi - 15)
25

yi = 110 + 0.81 { 25 }(xi - 15)

yi = 110 + 4.05 (xi - 15) = 110 - 4.05 (15) + 4.05 xi

yi = 49.25 + 4.05 xi

121
1

ii) Regression equation of x on y:

V (x )
( xi − x )= γ ( yi − y)
V (y)

or

σx
( xi − x) = γ ( yi − y)
σy

σ
Since bxy = γ x

σ y

( xi − x ) = byx ( y i − y )

25
(xi - 15) = (0.81) ( yi - 110)
625

1
xi = 15 + (0.81) (yi - 110)
25

= 15 + 0.81(0.20) (yi - 110)

xi = - 2.82 + 0.162 yi

5.9. ESTIMATION – REGRESSION COEFFICIENTS USING REGRESSION EQUATIONS:

In certain situations we need to estimate the regression coefficients, means and variances of the
variables directly from the given regression equations.

122
1

Solved Example 4:
Let the two regression equations be
3X + 2Y – 26 = 0
6X + Y – 31 = 0
i) To find the means, we solve these two simultaneous linear equations by the process of eliminations.
For this, we multiply the 2nd equation with Â2Ê to get
12X + 2Y – 62 = 0
Subtract the 1st equation from the 2nd equation, we get
9X – 36 = 0
or X = 36/9 = 4.
Substitute X = 4 in the 1st equation, to get
3(4) + 2Y – 26 = 0
or 2Y = 26 – 12 = 14
or Y = 7
Since the regression equations will pass through their respective means, we can say that
X = 4 and Y = 7
ii) The regression equations given above are used to get ÂrÊ the correlation coefficient.

Let us assume 3X + 2Y – 26 = 0 to be the regression equation of Y on X.


Hence, 2Y = 26 – 3X
3
or Y = 13 – X
2
3
Then byx = - = - 1.5
2
The 2nd equation 6X + Y – 31 = 0 will be the regression equation of X on Y.
Hence, 6X = 31 – Y
31 1
or X = – Y
6 6
1
Then bxy = -
6
The value of r is given by

123
1

γ =± byx . bxy

1 3
γ =± − x−
6 2
1
= ±
2
Since byx , bxy are negative γ will be negative. So
1
γ =- = - 0.5
2
iii) We can use the regression equations for estimating the unknown variance (standard deviation)
of one of the variables given the other variance.

σy
( yi − y) = γ ( xi − x)
σx

σ y
Since byx = γ
σ x

if σy = 5 then σx can be calculated as given below

σ y
byx = γ
σ x

1 5
- = -0.5
6 σ x

σ x = − 6 x − 0.5 x 5 = 15

124
1

Solved Example 5:
Compute the two regression equations on the basis of the following information:

Parameters x y
Mean 40 45
Standard deviation 10 9

Karl PearsonÊs correlation coefficient = 0.5. Also calculate the value of y for x = 48, using appropriate
regression equation.

Regression equation of y on x is:

σy
( yi − y) = γ ( xi − x)
σx
Substituting the values of : x , y , σ x , σ y , γ given in the above problem in the above equation, we

get:

9
( yi − 45) = 0.5 ( xi − 40)
10

yi = 45 + 0.45 ( xi - 40)

yi = 45 + 0.45 xi - 18

yi = 27 + 0.45 xi

Regression Equation of x on y:
σx
( xi − x) = γ ( yi − y)
σy

125
1

Substituting the values of : x , y , σ x , σ y , γ given in the above problem in the above equation, we

get:

10
( xi − 40) = 0.5 ( yi − 45)
9

xi = 40 + 0.556 ( y i − 45)

xi = 40 + 0.556 yi - 25.02

xi = 14.98 + 0.556 yi

In order to estimate the value of y for x = 48, we have to use regression equation of y on x:

yi = 27 + 0.45 xi

yi = 27 +( 0.45 x 48)

y = 27 + 21.6

y = 48.6
Therefore, the value of y is 48.6 when the value of x = 48

5.10. Difference between Correlation and Regression

The following are the points of difference between Correlation and Regression
i) Where as coefficient of correlation is a measure of degree of co-variability between X and Y,
the objective of regression analysis is to study the nature of relationship between the variables
so that we may be able to predict the value of one on the basis of another
ii) Correlation is merely a tool of ascertaining the degree of relationship between two variables
and, therefore, we can not say one variable is the cause and other the effect.. Foe example , a

126
1

high degree of correlation between price and demand for a certain commodity or a particular
point of time may not suggest which is the cause and which is the effect. However, in
regression analysis one variable is taken as independent while the other as independent, thus
making it possible to study cause and effect relationship.
iii) In correlation analysis, coefficient of correlation is a measure of direction and degree of linear
relationship between two variable X and Y. It is symmetric whether the correlation between X
and Y or between and X. It is also immaterial which of X and Y is dependent variable and
which is independent variable. In regression analysis , the regression coefficients Y on X and
X on Y are not symmetric. Hence it is definitely makes a difference as to which variable is
dependent and which variable is independent.
iv) There may be nonsense correlation between two variables which is purely due to chance and
has no practical relevance such as increase in income and increase in weight of group of
people. However, there is nothing like nonsense regression.
v) Correlation coefficient is independent of change of scale and origin. Regression coefficients
are independent of change of origin but not of scale.

5.11. SUMMARY:

Correlation measures the degree of relationship between two variables. Regression helps in exploiting
the degree of relationship existing between two variables so that they can be linked through a
regression equation. Regression equation helps in predicting the future behaviour of a dependent
variable given the value of independent variable. Prediction plays a key role in formulating strategic
decision frame work to the management. Coefficient of correlation is a measure of degree of co-
variability between X and Y, the objective of regression analysis is to study the nature of relationship
between the variables so that we may be able to predict the value of one on the basis of another

127
1

5.12. GLOSSARY:
Dependent Variable The variable of interest or focus which is influenced by
one or more independent variable(s). The variable that being
predicted or explained. It is denoted by Y in the regression equation.

Independent (or explanatory) Variable : A variable that can be set either to a desired value or takes values
that can be observed but not controlled. The variable that doing
the predicting or explaining. It is denoted by X in
the regression equation.

Regression Line A line of best fit, which can always be found for a scatter diagram
by using the method of least squares.

Regression: A method that uses past data to estimate the relationship


between two variables. Relating a dependent (Response) variable
with that of number of independent variables based on a set of data.

:
5.13. REFERENCES
1) Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi, 2004
2) Murray R. Spiegel and Larry J. Stephens, „Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
3) Sancheti, D.C., and Kapoor, V.K., „Business Statistics‰, New Delhi

128
1

5.14 REVIEW QUESTIONS:

1. Explain the concept of ÂregressionÊ. How is it important in economic analysis?

2. State some of the important properties of regression coefficients. How are these helpful in analysing
the regression lines?

3. An investigation into the demand for television sets in 7 towns has resulted in the following data:

Town A B C D E F G
Population (lakh) (x) 11 14 14 17 17 21 25
No. of TV sets demanded(Â000)(y) 15 27 27 30 34 38 46

Fit a linear regression of y on x and estimate the demand for TV sets for a city with population of
(a) 20 lakh and (b) 32 lakh

4. Fit a straight line to the following data:

X (Age of a Car) 1 3 5 7 9
Y (Maintenance Cost) 15 26 34 40 52
.
5. An enquiry into 50 families to study the relationship between Expenditure on Housing (Xi) and
Expenditure on food (Yi) give the following results.
∑ Xi = 8500, ∑ Yi = 8500, σx = 60, σy = 60, γ = + 0.6
i) Obtain the regression lines of X on Y and Y on X
ii) When X = 200, determine the value of Y
Obtain two regression equations and estimate i) the yield of crops when the rainfall is 22 cms
and ii) the rainfall when the yield is 600 kgs.
X = Yield (in Kg) Y=Rainfall (in cms)
Mean 508.4 26.7
S.D 36.8 4.6

γ = + 0.52

129
1

6. Estimate the sales for a given number of advertisements from the following data for
X = 10

X (No of Ads) 3 7 4 2 0 4 1 2
Y (Sales) 11 18 9 4 7 6 3 8

Develop the other regression equation also.

7. When X = 3, Y = 6, n = 4, ∑XiYi = 78, ∑Xi2 = 44, ∑Yi2 = 150:


Find both the regression equations and γ.

8. The following data relate to the scores obtained by 9 salesman of a company in an intelligence test
and their weekly sales in thousand rupees:-

Salesman: A B C D E F G H I
Intelligence 50 60 50 60 80 50 80 40 70
Test Scores
Weekly 30 60 40 50 60 30 70 50 60
Sales

(A) Obtain the regression equation of sales on intelligence test scores of the salesmen
(B) If the intelligence test scores of a salesman in 65, what would be his expected weekly sales?

9. The following table gives the aptitude test scores and productivity indices of 10 workers selected at
random:
Aptitude scores (X) 60 62 65 70 72 48 53 73 65 82
Productivity index(Y) 68 60 62 80 85 40 52 62 60 81

Calculate the two regression equations and estimate (i) the productivity index of a worker whose test
score is 92. (ii) The best score of a worker whose productivity index is 75.

130
1

10. Bank of Lincoln is interested in reducing the amount of time people spend waiting to see a
personal banker. The bank is interested in the relationship between waiting time (Y) in minutes and
number of bankers on duty (X).

X 2 3 5 4 2 6 1 3 4 3 3 2 4
Y 12.8 11.3 3.2 6.4 11.6 3.2 8.7 10.5 8.2 11.3 9.4 12.8 8.2

¾ Calculate the regression equation that best fits the data.


¾ Calculate the sample coefficient of determination and the sample coefficient correlation
by
Dr.G.V.R.K. Acharyulu
Associate Professor
Apollo Institute of Health Management
Hyderabad.

131
1

6. TIME SERIES

6.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to


• Recognise and define different components of a time series
• Obtain trend, seasonal index, cyclical and irregular movements by using appropriate methods.

6.1 INTRODUCTION:

Time series is a set of statistical observations arranged in chronological order. Time series is usually
used with reference to economic data and economists are largely responsible for development of
techniques of time series analysis. It can also be used different purposes like managing production,
variations of different variables, growth or decline of profits of organization etc. Businessmen forecast
future demand of sales in order to estimate proportionate production so that over stock (or) inadequate
production can be avoided. Economists estimate future population based on time series. So that he
can analyse amount of good supply, jobs, vehicles, etc required in future.

1. „A time series is a set of statistical observations arranged in chronological order.‰


_Moris Hamburg
2. „A time series consists of statistical data which are collected, recorded or observed over
successive increments.‰ -Patterson
3. „A time series may be defined as a collection of magnitudes belonging to different time periods, of
some variable or composite of variables, such as production of steel, per capita income, gross
national product, price of tobacco, or indeed of industrial production.‰ - Ya-Lun-chou

6.2 IMPORTANCE OF TIME SERIES ANALYSIS:

There are several reasons for undertaking a time series analysis.

ñ Firstly, the analysis of a time series enables us to understand the past behaviour or
performance. It is possible to know how the data have changed over time and find out the

132
1

probable reasons responsible for such changes. If the past performance, say of a
company, has been poor, it can take corrective measures to arrest the poor performance.

ñ Secondly, a time series analysis helps directly in business planning. A firm can know the
long-term trend in the sale of its products. It can find out at what rate sales have been
increasing over the years. This may help it in making projections of its sales for the next
few years and plan the procurement of raw material, equipment and manpower
accordingly.

ñ Thirdly, a time series analysis enables one to study such movements as cycles that
fluctuate around the trend. Knowledge of cyclical pattern in certain series of data will be
helpful in making generalization in the concerned business or industry.

ñ Finally a time series analysis enables one to make meaningful comparisons in two or more
series regarding the rate or type of growth. For example, growth in consumption at the
national level can be compared with that in the national income over specified period.
Such comparisons are of considerable importance to business and industry.

6.3 COMPONENTS OF A TIME SERIES:

A time series may contain one or more of the following four components:
a) Secular Trend
b) Seasonal variations
c) Cyclical variations
d) Irregular variations

6.3.1 SECULAR TREND(T):

We can observe steady increase or decrease in the variable over a period of time. The note of
increase/decrease can also be varying and after a period of growth or decline, reverse themselves and
enter a period of decline or growth. Various types of trends can be broadly classified into
- linear or straight line trend
- Non-linear trend

133
1

Generally, the longer the period covered, the more significant the trend. When the period is short,
the secular movements can not be expected to reveal themselves clearly and general drift of the
series may be unduly influenced by cyclical fluctuations. This can be said by stating the minimum
period to be considered in trend analysis is two to three cycles. It is not necessary that the rise or
fall must continue in the same direction through out the period. As long as we can say that the
period as a whole was characterized, (excepting the year which has got different trend) we can say
that a secular trend was present

6.3.2 SEASONAL VARIATIONS(S):


Seasonal variations are those periodic movements in business activity which occur regularly every
year and have their origin in the nature of the year it self.
The factors that cause seasonal variations are
(i) Climate and weather conditions
(ii) Customs traditions and habits

6.3.3 CYCLICAL VARIATIONS(C):

The recurrent variations in time series that usually last longer than a year and are regular neither in
aptitude nor in length are called cyclical variations. There are four well-defined periods or phases
in cycle- prosperity, decline, depression & improvement. Cyclical variations are useful in framing
suitable policies for stabilizing level of business activity for avoiding booms and depressions as
both are bad for an economy.

6.3.4 IRREGULAR VARIATIONS(I):


Also called erratic, accidental random. These variations refer to such variations in business activity
which do not repeat in particular definite pattern.
The result of four components (Y):
If the relationship is assumed to be multiplicative,
Y= T X S X C X I
If the relationship is assumed to be additive,
Y= T + S + C + I

134
1

6.4 MEASUREMENT OF TREND:

Given any long term series, we wish to determine and present the direction which it takes, as it is
growing or declining? There are two important objectives for trend measurement:

i) To find out trend characteristics in and of themselves:

In studying trend in and of itself, we need to ascertain the growth factor. The growth factor helps us in
predicting the future behaviour of the data. If a trend can be determined, the rate of change can also be
ascertained and tentative estimates concerning future can be made accordingly.

Example: The growth in the textile industry can be compared with the growth in the economy as a
whole or with the growth of other industries.

ii) To enable us to determine trend in order to study other elements:-

The elimination of trend leaves us with seasonal, cyclical and irregular factors. These three relatively
short term elements then can be compared divorced from the long term factor.

Measurement of trend is in order to find out trend characteristics in and of themselves and to enable us
to eliminate trend in order to study other elements. The various methods that can be used far
determining trend are as follows:
:
ñ Free-hand or graphical method
ñ Semi-averages method
ñ Moving average method
ñ Method of least squares

6.4.1 FREE HAND OR GRAPHICAL METHOD:

This is the simplest method of studying the trend. The procedure of obtaining a straight line trend is
given below:-

135
1

¾ Plot the time series


¾ Examine carefully the direction of hand based on the plotted information.
¾ Draw a straight line which best fits the data. The line drawn should conform to following
conditions:-
a. The line should be smooth
b. Some of vertical deviations from the trend of annual observation above trend line
should be equal to below line.
c. Trend line should bisect cycles so that the area above the trend equals area below it.
d. The sum of squares of vertical deviations from the trend should be minimum.

Sales in thousands
Prediction of sales
2500
2000
1500
1000
500
0
1998 ‘99 2000 ë01 ë02 ë03 ë04 ë05
years

If sales in the year 2006 to be predicted, extend the graph with free hand and corresponding sales can
be noted on Y axis.
Merits:
¾ This is the simplest method of measuring trend.
¾ This method is very flexible in that it can be used regardless of whether the trend is a
straight line or a curve
¾ The trend line drawn by a statistician experienced in computing the trend and having
knowledge of the economic history of the concern or the industry under analysis may
be a better expression of the secular movement than a trend fitted by the use of a rigid
mathematical formula which while providing a good fit to the points may have no alter
logical justification.

136
1

Limitations:

• This method is highly subjective because the trend line depends on the personal judgment
of the investigation and therefore different persons may draw different trend line from the
same set of data.
• Since freehand curve fitting is subjective it cannot have much value if it is used as a basis
for predictions.
• It is very time consuming to construct a freehand trend if a careful and conscientious job
is done.

6.4.2 METHOD OF SEMI-AVERAGES:

The procedure of obtaining the trend by semi average method is as follows:-


a) The given data is divided into two parts, preferably with the same number of years. (For. Eg:-
If data is given from 1995 to 2004, i.e over a period of 10 years, the two equal parts will be
each 4 years, i.e, from 1995-1999 & from 2000-2004.) In case of odd no. of years, two equal
parts can be made by simply omitting the middle year.
b) After the data have been divided into two parts, (i.e 1995-1999 and 2000-2004, an average of
each part is obtained to get exactly two points).
c) Each point is then plotted at the mid point of the class interval covered by respective part and
then the two points are joined by a straight line which gives us the required trend line.
Example:1
Year Sales in thousands Total of four years Semi averages
1995 500
1996 550
1997 600 3650 730
1998 800
1999 1200
2000 750
2001 1300
2002 1500 6900 1380
2003 1650
2004 1700

137
1

Prediction of sales
Sales in thousands
2500
2000
1500
1000
500
0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
year

If sales in the year 2006 to be predicted, extend the graph with free hand and corresponding sales can
be noted on Y axis.

Merits:
• This method is simple to understand as compared to the moving average and the method of
least squares.
• This is an objective method of measuring trend as everyone who applies the method is bound
to get the same result. (having aside the arithmetic mistakes)

Limitations:

• This method assumes straight line relationship between the plotted points regardless of the
fact whether that relationship exists or not
• The limitations of arithmetic average shall automatically apply. If there are extremes in either
half or both the series. Then the trend line will not give a true picture of the growth factor.

6.4.3 METHOD OF MOVING-AVERAGES:


When a trend is to be determined by the method of moving averages, the average value for a number
of years (or months or weeks) is secured and this average is taken as the normal or trend value for the
unit of time falling at the middle of the period covered in the calculation of the averages. The effect of

138
1

averaging is to give a smoother curve, lessening the influence of the fluctuations that pull annual
figure away from the general trend.

Note: Since the moving average method is most commonly applied to data which are characterized by
cyclical movements, it is necessary to select a period for moving average which coincides with the
length of the cycle. Otherwise the cycle will not be entirely removed.

Ordinarily, the necessary period will range between three and ten years for general business series but
even longer periods are required for certain types of data.
The 3 yearly moving averages shall be computed as follows:

a+b+c b+c+d c+d +e d +e+ f


, , ,
3 3 3 3
and for 5-yearly moving average

a+b+c+d +e b+c+d +e+ f c+d +e+ f + g


, , + ........
5 5 5
Where a.b.c.d.e,--- are the data for the given period.

Example:2

Year Sales in thousands Three years Moving


moving total Average
1995 500
1996 550 1650 550
1997 600 1950 650
1998 800 2600 867
1999 1200 2750 917
2000 750 3250 1083
2001 1300 3550 1183
2002 1500 4450 1483
2003 1650 4850 1617
2004 1700

139
1

If the sales in 2005 to be predicted, the graph can be extended, and the corresponding value on Y axis
is required sales value.

Merits:

• It is simple compared to method of least square


• It is a flexible method of measuring trend far the reason that if a few more figures are added to
the data, the entire calculations are not changed-we only get some more trend values
• If the period of moving average happens to coincide with the period of cycle fluctuations in
the data, such fluctuations are automatically eliminated.
• It is particularly effective if the trend of a series is very irregular.

Limitations:

• Trend values can not be computed for all the years.


• Great care has to be exercised in selecting the period of moving average.
• The length of the various cycles in any series will usually vary considerably and therefore no
moving average can completely remove the cycle.
• The moving average is appropriate for trend computation only when:-
a) The purpose of investigation does not call for current analysis or forecasting
b) The trend is linear and
c) The cyclical variations are regular both in period and amplitudes.

140
1

Unfortunately, these conditions are encountered very infrequently.


Centering:
If moving average is an even period moving average, moving total are kept at centre of time span.
Since the placed would not coincide with original time period, we synchronize moving averages and
original data. This centering always consists of taking two period moving average of the moving
averages.
Merits:
ñ Simple as compared to method of least squares.
ñ If few more figures are added entire calculations are not changed.
ñ It follows general movements of data and its shape is determined by data.
ñ Effective when trend is irregular.

Demerits:
ñ Trend values cannot be computed for all years
ñ One has to use his own judgments for choice of period of averages.
ñ Since moving average is not represented by mathematical functions this method canÊt
be used in forecasting.
ñ Appropriate only when trend is linear, cyclical variations are regular both in period &
amplitudes.

6.4.4 METHOD OF LEAST SQUARES: This method is most widely used in practice. It is a
mathematical method and with its help a trend line is fitted to the data in such a manner that the
following two conditions are satisfied:-
(1) ∑( y − y c ) = 0

It is the sum of deviations of the actual values of ÂyÊ and the computed values of ÂyÊ is zero.

(2) ∑( y − y c ) = is Least
2

i.e the sum of squares of the deviations of the actual and computed values is least from this line and
hence the name is method of least square. The line obtained by this method is known as the line of
best fit.

The straight line is represented by the equation

141
1

yc = a + bx
Where yc is used to designate trend value
x represents time
a is the computed trend figure of y variable when x=0
b is the slope of trend line or amount of change in ÂyÊ variable that is associated with a
change of one unit in ÂxÊ variable.

In order to determine the values of the constants a and b the following two normal equation are to be
solved.
∑ y = Na + b ∑ x
∑ xy = a ∑ x + b ∑ x 2

N presents the no. of years (months or any other period) for which data are given.

Sales in thousands

2000
1500
1000
500
0
1994 1996 1998 2000 2002 2004 2006

Line of best fit


( yc = a + bx )

Merits:
• This is a mathematical method of measuring trend and as such there is no possibility of
subjectiveness.
• It is the line from where the sum of positive and negative deviations is zero and the sum of
squares of the deviations lead.

142
1

i.e ∑( y − y c ) = 0

i.e ∑( y − y c ) least
2

• Trend values can be obtained for all the given time periods in the series

Limitations:

• Great care has to be taken in selecting the type of trend curve to be fitted. i.e. linear, parabolic
or some other type
• It is more tedious and time consuming than other methods.
• Prediction are based only on long term variation, i.e trend and the impact of cyclical, seasonal
and irregular variations is ignored.
• Being a mathematical method it is not flexible- the addition of even one more observation
makes it necessary to do all the computations again.
Applications:
¾ Used in different situations like production, sales, exports, imports, etc., by business men
¾ Economists use it for forecasting population trends.
¾ Avoidance of unfavorable situations based on time series can be done as we can predict ( at
least to some extent) based on trend so that uncertainty is reduced.
¾ Certain seasonal trends help in proportionate production based on trend.
¾ Sudden changes in demand, rapid technological progress cannot be known but prediction to
certain extent makes it possible for the firm to cope up to the in competency.
Example:3
Fit a trend line to the following data.
Year Production Year Production
of steel
1995 20 2000 25
1996 22 2001 23
1997 24 2002 26
1999 23 2003 25

143
1

Solution:
Moving averages:

Year Production 3-Yearly 3-yearly


total average
1995 20
1996 22 66 22
1997 24 67 22.33
1998 21 68 22.66
1999 23 69 23
2000 25 71 23.66
2001 23 74 24.66
2002 26 74 24.66
2003 25

3-yearly moving average

25
24.5
24
23.5
23
22.5
22
21.5
1994 1996 1998 2000 2002 2004

144
1

Method of least squares:

Year Production X XY X2 Trend values


Y Yc
1995 20 -4 -80 16 20.98
1996 22 -3 -66 9 21.54
1997 24 -2 -48 4 22.1
1998 21 -1 -21 1 22.66
1999 23 0 0 0 0
2000 25 1 25 1 23.75
2001 23 2 46 4 24.34
2002 26 3 78 9 24.9
2003 25 4 100 16 25.46
N=9 ∑ y = 209 ∑x =0 ∑ xy = 34 ∑ x 2 = 60 ∑ y c = 185.73

y c = a + bX

∑ Y 209
a= = = 23.22
N 9
∑ XY 34
b= = = 0.56
∑ X 2 60

Yc = 23.22 + 0.56 X

Example: 4
Fit a straight line trend by the method of least squares to the following data. Assuming the same rate
of change continues what would be the predicted sales for the year 1998?

Year 1989 1990 1991 1992 1993 1994 1995 1996


Sales 76 80 130 144 138 120 174 190
(Rs.Lakh)

Calculate trend values from 1989 to 1996.

145
1

Solution.

Fitting straight line trend by the method of least squares

Year Sales Y Deviations from XY X2 Yc


1992.5
1989 76 -3.5 -266 12.25 80.16
1990 80 -2.5 -200 6.25 94.83
1991 130 -1.5 -195 2.25 109.5
1992 144 -0.5 -72 0.25 124.2
1993 138 +0.5 69 0.25 138.8
1994 120 +1.5 180 2.25 153.5
1995 174 +2.5 435 6.25 168.2
1996 190 +3.5 665 12.25 182.8
N=8 ΣY =1052 ΣX = 0 ∑ XY = 616 ∑X 2
= 42

y c = a + bX

Since ΣX = 0
ΣY 1052
a= = = 131.5
N 8
ΣXY 616
b= = = 14.67
ΣX 2 42
Y=131.5+14.67X
Y1989=131.5+14.67(-3.5) =131.5 - 51.331=80.16

Thus
Y1998=131.5+14.67 (5.5)=212.163

Thus the predicted sales for 1998 are Rs.212.163 lakhs.

146
1

6.5 SUMMARY:
In this chapter the importance of time series analysis and the four components of time series analysis
i.e. Secular Trend, Seasonal variations, cyclical variations and Irregular variations have been
discussed. The Measurement of trend its objectives and the various methods that can be used for
determining trend- Free-hand or graphical method, Semi-averages method, moving average method,
Method of least squares their advantages and limitations have been brought out. The mechanical
approach of time series analysis is subject to considerable error and change. It is therefore necessary
for management to combine these simple procedures with knowledge of other factors in order to
develop workable forecasts.

6.6 GLOSSARY:

Cyclical component or fluctuations: In a time series, fluctuations around the trend line
that last for more than one year.

Deseasonalisation: A statistical process by which the seasonal


variation from a time series is eliminated.

Forecasting: Predicting the expected value of an item or


variable of interest.

Irregular component: Variation in a time series that occurs due to


chance.

Ratio-to-moving-Average method: A statistical technique used to measure seasonal


fluctuations.

147
1

Relative cyclical residual: A measure of cyclical variation wherein the


percentage deviation from the trend for each
observation in the series is used.

Residual method: A method of describing the cyclical component of


a time series. It is based on the assumption that
most of the variation in a time series not
explained by the secular trend is cyclical
variation.

Seasonal index: A measure used to adjust monthly or quarterly


data for seasonal fluctuations.

Seasonal variation: One of the components in a time series that


occurs during different seasons or periods of less
than one year duration.

Second degree equation: An equation in time series analysis to describe a


non-linear trend.

Secular trend or trend component: One of the components in a time series indicating
the long-term movement of an item or variable.

Time series: Data arranged in relation to time. Such data have


four components-trends, cycle, seasonal and
irregular movements.

148
1

6.7 REFERENCES:
1) Murray R. Spiegel and Larry J. Stephens,‰ Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
2) Levin, I. Richard and Rubin S. David. „Statistics for management‰, Prentice Hall of India,
New Delhi, 2000.
3) Sancheti, D.C., and Kapoor, V.K. , „Business Statistics‰, New Delhi.

6.8 EXERCISES:

1. What is a time series? Mention its important components. Explain them briefly.

2. What are the advantages of undertaking a time series analysis to a business firm?

3. Suppose we are given a time series data for 12 years-1989 to 2000 relating to sales of a certain
business firm. These data are given below.

Year Sales Year Sales


(Million Rs.) (Million Rs.)
1989 10 1995 15
1990 15 1996 24
1991 20 1997 15
1992 25 1998 21
1993 15 1999 15
1994 12 2000 24

You are asked to find out the three-year moving averages, starting from 1989.
4. Fit a straight line trend by the method of least squares for the following and draw it on the graph
paper:

Year Reserves Year Reserves


1988-89 612 1992-93 1001

149
1

1989-90 719 1993-94 1106


1990-91 820 1994-95 1231
1991-92 907

5. Calculate the three-monthly moving averages from the following data:

Jan. Feb March April May June


57 65 63 72 69 78
July Aug. Sep. Oct. Nov. Dec.
82 81 90 92 95 97

6. Fit a trend line by the method of semi-averages to the data given below. Estimate the sales for
1997. If the actual sale for that year is Rs.520 lakhs account for the difference between the two
figures.

Year Sales Year Sales


(Rs.lakhs) (Rs.lakhs)
1989 412 1993 470
1990 438 1994 482
1991 444 1995 490
1992 454 1996 500

7. Given below are the figures of production (in million tones) of wheat:

Year 1989 1990 1991 1992 1993 1994 1995


Production 80 90 92 83 94 99 92

Fit a straight-line trend to these figures.

150
1

8. The following data refer to annual profits of a certain business.

Year 1991 1992 1993 1994 1995 1996 1995


Profits Rs.Ê000 60 72 75 65 80 85 95

Find the trend values using the linear trend method. Using the trend equation, estimate the
profit for 2001.
9. Assuming an additive model, apply 3-year moving averages to obtain the trend- free series for
years 2 to 6 from the following data.

Year 1 2 3 4 5 6 7
Exports 126 130 137 141 145 155 159
(Rs.lakh)

10. The trend equation for annual sales of a product is- Y=102+36 X
With 1st January 1990 as origin.

(i) Determine the monthly trend equation with 1st July 1992 as origin.
(ii) Compute the trend values of sales in August 1991 and October 1994.

by

Dr.B.Raja Shekhar

Reader

School of Management Studies

University of Hyderabad.

151
1

7. CONCEPT OF SAMPLING AND ESTIMATION

7.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to


• Differentiate amongst some major sample designs.
• Understand the sampling distributions of the sample mean and the sample proportion.
• Determine an appropriate sample size to estimate a population mean or proportion for a given
level of accuracy and with a prescribed level of confidence.
• Recognise the need for making estimates.

7.1 INTRODUCTION:

Once the problem has been formulated and a research design is developed the researcher has to decide
whether the information is to be collected from all the population or only from some of the population.
When the data are collected from each of the population of interest, it is known as the census survey.
If, on the other hand, data are to be collected only from some of the population, it is known as the
sample survey. Statistical analysis frequently involves making inferences from a sample to a
population.

7.2 POPULATION AND SAMPLE

In Statistics, a population is an entire set of objects or units of observation of one sort or another, while
a sample is a subset of a population, selected for particular study (usually because it is impractical to
study the whole population). The numerical characteristics of a population are called parameters.
Generally the values of the parameters of interest remain unknown to the researcher; we calculate the
„corresponding‰ numerical characteristics of the sample (known as statistics) and use these to
estimate, or make inferences about, the unknown parameter values.

152
1

• POPULATION: All members of a particular group of people, event, or objects.

• SAMPLE: A relatively small group that is observed in the research study.

7.3 STANDARD NOTATION:

A standard notation is often used to keep straight the distinction between population and sample. The
below sets show some commonly used symbols.

Size Mean Variance

Population: N µ σ2

Sample: n x s2

Why study a sample? Why not the entire population?

• Usually, the population is too large to study, i.e. hundreds of thousands or millions of cases.
• Time, money, and patience limit studying an entire population.
• The population may change by the time the generalization is made, making the generalization
invalid.

7.4 DEFINITION OF SAMPLING TERMS:

• SAMPLING: Procedure by which some members of a given population are selected as


representatives of the entire population.

• SAMPLING UNIT: Subject under observation on which information is collected.


Example: Children less than 5 years, hospital discharges, health events.

153
1

• SAMPLING ERROR: A sample, even a random one, is likely to vary somewhat from the
population. The differences between the sample and the population are sampling errors. There
is an inverse relationship between sampling error and sample size. Sampling error increases as
standard deviation increases.

7.5 CRITERIA FOR SAMPLING UNITS:

Sample units must meet a number of criteria if they are to be useful.


¾ Each sample unit should have an equal chance of selection.
¾ A sample unit should have stability.
¾ The sampling unit must be easily described.
¾ The number and location of sample units should be selected according to the Purpose of the
sampling.

7.6 SAMPLING TECHNIQUES:

There are two sampling techniques available for drawing the samples. They are Probabilistic
and non probabilistic sampling techniques.

Probability Sampling:

Every unit of the population has an equal chance of being selected. There are various
probabilistic sampling techniques, like Simple random sampling, stratified sampling, Cluster
sampling, and Systematic sampling.

Non-Probability sampling techniques:

Every unit of the population has unequal chance of being selected. The selection process is, at
least, partially subjective. There are various non-probabilistic sampling techniques, like
judgment sampling, convenience sampling and quota sampling.

154
1

PROBABILITY SAMPLING TECHNIQUES:

7.6.1. SIMPLE RANDOM SAMPLING:

The sample is randomly drawn from the population (usually accomplished using a table of random
numbers). This is one of the most basic (simple) forms of sampling, based on the probability that the
random selection of names from a sampling frame will produce a sample that is representative of a
target population. In this respect, a simple random sample is similar to a lottery:

• Everyone in the target population is identified on a sampling frame.


• The sample is selected by randomly choosing names from the frame until the sample is
complete.

An example of a simple random sample you could easily construct would be to take the names of
every student in your class from the register, write all the names on separate pieces of paper and put
them in a box. If you then draw out a certain percentage of names at random you will have constructed
your simple random sample⁄

•Advantages:
–Simple
–Sampling error easily measured.

•Disadvantages:
–Need complete list of units.
–Does not always achieve best representative ness.
–Units may be scattered.

7.6.2 STRATIFIED SAMPLING:

Randomly select the sample from different subgroups of the population, either equal numbers or
proportional.

155
1

Example: From the school of 60 students in its first year (40 males and 20 females) and 60 students in
its second year (10 males and 50 females), select a sample of 24 students.

A. First and second year students.


To achieve this, we create two sampling frames.
Frame 1 consists of the names of 60 first year students.
Frame 2 consists of the names of 60 second year students.

B. Males and females.


The first sampling frame (first year students) is split into two further sampling frames (males and
females). The second sampling frame (second year students) is also split into two further sampling
frames (males and females). We then choose the names for each sample on a random basis and, when
this has been done, we simply combine the four samples to give us the overall sample we require that
will be representative of the target population.
Year One: Sample 1 consists of 8 first year males; Sample 2 consists of 4 first year females
Year Two: Sample 3 consists of 2 second year males; Sample 4 consists of 10 second year females.

7.6.3 CLUSTER SAMPLING:

Randomly select intact groups and sample all members of those groups. This form of sampling is
usually done when a target population is spread over a wide geographic area.
• For example, an opinion poll into voting behaviour may involve a sample of 1000 people to
represent the 35 million people eligible to vote in a General Election. A researcher can use a multi-
stage / cluster sample that firstly, divides the country into smaller units (in this example, electoral
constituencies) and then into small units within constituencies (for example, local boroughs).
Boroughs could then be selected which, based on past research, show a representative cross-section
of voters and a sample of electors could be taken from a relatively small number of boroughs
across the country. There are three stages: 1) the target population is divided into many regional
clusters (groups) eg: London, Berlin, Rome etc 2) a few clusters are randomly selected for study 3)
A few subjects are randomly chosen from within a cluster.

156
1

Uses
1. This type of sample saves the researcher time and money.
2. Once a relatively reliable sample has been established, the researcher can use the same or a
similar sample again and again (as with political opinion polling).

Limitations
1. Unless great care is taken by the researcher it is possible that the cluster samples will not be
representative of the target population.
2. Even though it is a relatively cheap form of sampling, this is not necessarily the case. A
sample that seeks to represent the whole of Britain, for example, is still going to be too
expensive for many researchers

7.6.4 SYSTEMATIC SAMPLING:

Select a sample by drawing every nth case from a list of the entire population. Divide the number in
the population by the number desired in the sample, start near the top of the list and then select every
nth thereafter. A variation on the above is to select the names for your sample systematically rather
than on a simple random basis. Thus, instead of putting all the names on your sampling frame
individually into a box, it's less trouble to select your sample from the sampling frame itself.

For example, if you were constructing a 20% sample of a target population containing 100 names, a
systematic sample would involve choosing every fifth name from your sampling frame.

Uses:
1. Both are relatively quick and easy ways of selecting samples (if the target population is
reasonably small).
2. They are random / near random, which means that everyone in the target population has an
equal chance of appearing in the sample (this is not quite true of systematic sampling, but such
samples are „random enough‰ for most research purposes).
3. They are both reasonably inexpensive to construct. Both simply require a sampling frame
that is accurate for the target population.

157
1

4. Other than some means of identifying people in the target population (a name and address,
for example), the researcher does not require any other knowledge about this population (an
idea that will become more significant when we consider some other forms of sampling)

Limitations:
1. The fact that these types of sample always need a sampling frame means that, in some
cases, it may not be possible to use these types of sampling. For example, a study into
„underage drinking‰ could not be based on a simple random or systematic sample because no
sampling frame exists for the target population.

2. In many cases a researcher will want to get the views of different categories of people
within a target population and it is not always certain that these types of sampling will produce
a sample that is representative of all shades of opinion.

For example, in a classroom it might be important to get the views of both the teacher and
their students about some aspect of education. A simple random / systematic sample may not
include a teacher because this category is likely to be a very small percentage of the overall
class; there is a high level of probability that the teacher would not be chosen for by any
sample that is simply based on chance⁄

7.7 NON-PROBABILITY TECHNIQUES :

7.7.1 CONVENIENCE SAMPLING :

A convenience sample is obtained by selecting ÂconvenientÊpopulation units.A sample obtained from


already available lists such as automobile registrations ; téléphone directories, etc. is a convenience
sample and not a random sample even if the sample is drawn at random from the lists.

Convenience sample are prone to bias by their very nature-selecting population éléments which are
convenient to choose almost always make them special or different from the best of the elements in
the population in some way.Hence the results obtained by following convenience sample method can
hardly be representative of the population-they are generally biased and unsatisfactory. How ever,
convenience sample is often used for making pilot studies.

158
1

7.7.2 JUDGMENT SAMPLING (PURPOSIVE SAMPLING):

In this method of sampling the choice of sample items depends exclusively on the judgment of the
investigator. In other words, the investigator exercises his judgment in the choice and includes those
items in the sample which he thinks are most typical of the universe with regard to the characteristics
under investigation. For example, if sample of ten students is to be selected from a class of sixty for
analysing the spending habits of students, the investigator would select 10 students who, in his opinion
are representative of the class.

Merits:

Though the principles of sampling theory are not applicable to judgment sampling, the method
is sometimes used in solving many types of economic and business problems. The use of
judgment sampling is justified under a variety of circumstances.

(i) When only a small number of sampling units is in the universe, sample random selection may
miss the more important elements, whereas judgment selection would certainly include them
in the sample.

(ii) When we want to study some unknown traits of a population, some of whose characteristics
are known, we may then stratify the population according to these known properties and select
sampling units from each stratum on the basis of judgment .This method is used to obtain a
more representative sample.

(iii) In solving every day business problems and making public policy decisions, executives and
public officials are often pressed for time and cannot wait for probability sample designs.
Judgment sampling is then the only practical method to arrive at solutions to their urgent
problems.

159
1

Limitations:

(i) This method is not scientific because the population units to be sampled may be affected by
the personal prejudice or bias of the investigator.

(ii) There is no objective way of evaluating the reliability of sample results.

7.7.3 QUOTA SAMPLING:

The sample is selected so that certain characteristics are in proportion to the population. In quota
sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified
sampling. Then judgment is used to select the subjects or units from each segment based on a specified
proportion. For example, an interviewer may be told to sample 200 females and 300 males between
the age of 45 and 60.

It is this second step which makes the technique one of non-probability sampling. In quota sampling
the selection of the sample is non-random. For example interviewers might be tempted to interview
those people in the street who look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This non-random element is its greatest weakness and
quota versus probability has been a matter of controversy for many years.

7.8 ESTIMATION:

Having discussed probability theory, the normal distribution, and the concept of sampling distribution
of a statistic, we now turn to the topic estimation. In business, there arise several situations when
managers have to make quick estimates. Since their estimates have an impact on the success or failure
of their enterprises, they have to take sufficient care to ensure that their estimates are not far away
from the final outcome. The point to note is that such estimates are made without complete
information and with a great deal of uncertainty about the eventual outcome.

In all such situation, it is the theory of probability that forms the basis for statistical inference. The
term ÂStatistical inferenceÊ means making a probability judgment concerning a population on the basis

160
1

of one or more samples. Based on probability theory, statistical inferences are made as a basis for
making decisions. For example, an investor is interested to know whether he should subscribe for an
investment consultancy service or not. On the basis of a sample, he has to examine whether the
selection of his investment on the advice of the investment consultancy service has been more
profitable than the selection based randomly, he may go in for this service.

7.9 TYPES OF ESTIMATES:

Let us first know the concept of ÂestimateÊ as used in Statistics. According to some dictionaries,
estimate is a valuation based on opinion or roughly made from imperfect or incomplete data. This
definition may apply, for example, when an individual who has an opinion about the competence of
one of his colleagues. But, in Statistics the term estimate is not used in this sense. In statistics too, the
estimates are made when the information available is incomplete or imperfect. However, such
estimates are made only when they are based on sound judgment or experience and when the samples
are scientifically selected.

There are two types of estimates that we can make about a population: a point estimate and an interval
estimate. A point estimate is a single number, which is used to estimate an unknown population
parameter. Although a point estimate may be the most common way of expressing an estimate, it
suffers from a major limitation since it fails to indicate how close it is to the quantity it is supposed to
estimate. In other words, a point estimate does not give any idea about the reliability or precision of
the method of estimation used.

The second type of estimate is known as the interval estimate. It is a range of values used to estimate
an unknown population parameter. In case of an interval estimate, the error is indicated in two ways:
first by the extent of its range: and second, by the probability of the true population parameter lying
within that range.

7.9.1 ESTIMATOR AND ESTIMATE:

When we make an estimate of a population parameter, we use a sample statistic. This sample statistic
is an estimator.

161
1

∑x
For example, the sample means x =
n
x is a point estimator of the population mean. Many different Statistics can be used to estimate the
same parameter. For example, we may use the sample mean or the sample median or even the range
to estimate the population mean. The question here is: how can we evaluate the properties of these
estimates, compare them with one another, and finally, decide which the ÂbestÊ is? The answer to this
question is possible only when we have certain criteria that a good estimator must satisfy. These
criteria are briefly discussed below.

7.9.2 CRITERIA OF GOOD ESTIMATOR:

There are four criteria by which we can evaluate the quality of a statistic as an estimator. These are:
unbiased ness, efficiency, consistency and sufficiency.

7.9.2.1 UNBIASEDNESS:

This is a very important property that an estimator should possess. If we take all possible samples of
the same size from a population and calculate their means, the mean of all these means will be equal to
the mean of the population. This means that the sample mean is an unbiased estimator of the
population mean. When the expected value (or mean) of a sample statistic is equal to the value of the
corresponding population parameter, the sample statistic is said to be an unbiased estimator.

7.9.2.2 CONSISTENCY:

Another important characteristic that an estimator should possess is consistency. Let us take the case
of the standard deviation of the sampling distribution.
σ
σx =
n
The formula states that the standard deviation of the sampling distribution decreases as the sample size
increases and vice versa. When the sample size n increases, the population standard deviation is to be
divided by a higher denominator. This results in the reduced value of sample standard deviation.

162
1

7.9.2.3 EFFICIENCY:

Another desirable property of a good estimator is that it should be efficient. Efficiency is measured in
terms of size of the standard error of the statistic. Since an estimator is a random variable, it is
necessarily characterized by a certain amount of variability. This means that some estimates may be
more variable than others. Just as bias is related to the expected value of the estimator, so efficiency
can be defined in terms of the variance.

7.9.2.4 SUFFICIENCY:

The fourth property of a good estimator is that it should be sufficient. A sufficient statistic such as is
an estimator, that utilizes all the information a sample contains about the parameter to be estimated.

x, for example, is a sufficient estimator of the population mean µ. It implies that no other estimator of
µ, such as the sample median, can provide any additional information about the parameter µ.
Likewise, we can say that the sample proportion p is a sufficient estimator for the population
proportion «.

7.10 SUMMARY:

At the outset, the chapter has given the definitions of some basic terms used in sampling. The
distinction between the probability and non-probability sample has been brought out. This is followed
by a discussion of simple random sampling, systematic random sampling, and stratified random
sampling. Other sampling designs such as cluster sampling, judgment sampling, and quota sampling
their advantages and limitations were also discussed. Finally the four criteria used to evaluate the
quality of a statistic as an estimator namely unbiased ness, efficiency, consistency and sufficiency
have also been explained.

163
1

7.11 GLOSSARY:

Census: A measurement of each element in a group or


population of interest
Sample: A subset or some part of a population

Sampling error: The difference between the population parameter


and the observed probability statistic

Standard error: The standard deviation of the sampling


distribution of a statistic

.Statistics: A measure or characteristic of a sample.

Non-sampling error: An error that occurs in the collection, recording,


tabulation and computation of data.

Parameter: A quantity that remains constant in each case


considered but varies in different cases.

Precision: The desired size of the confidence interval when a


population parameter is to be estimated. The
concept is also useful in determining the sample
size.

7.12 REFERENCES:
Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi, 2004
Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.

164
1

7.13 EXERCISE:
1. What are the advantages and limitations of sampling?
2. What is meant by Ârepresentative nessÊ in a sample? Explain in what sense a simple random
sample is representative of the population.
3. Distinguish between a probability sample and a non-probability sample.
4. Describe the various steps involved in the sampling process.
5. What are the criteria of a good estimator? Explain each of the criteria briefly.

By

Dr. C.R. Rao

Reader

Department of Mathematics & Statistics

School of Mathematics & Computer/Information


Sciences

University of Hyderabad.

165
1

8. TESTING OF HYPOTHESIS

8.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to


• Understand the concept of hypothesis and the procedure involved in testing them.
• Distinguish between two types of errors and the relationship between them.
• Understand the power of a hypothesis test.
• Specify the most appropriate test of hypothesis in a given situation.

8.1 INTRODUCTION:

Statistical interference is that branch of statistics which is concerned with using probability concept to
deal with uncertainty in decision making. The field of statistical interference has had a fruitful
development since the late half of the 19th century.

It refers to the process of selecting & using a sample statistic to draw inference about a population
parameter based on the sample drawn from the population. Statistical interference treats two different
classes of problems.

(a) Hypothesis testing: i.e. to test some hypothesis about parent population from which sample is
drawn.
(b) Estimation: i.e. to use the ÂstatisticsÊ obtained from the sample as estimate of the unknown
ÂparameterÊ of the population from where the sample is drawn.

8.2 HYPOTHESIS TESTING:

Hypothesis testing begins with an assumption, called hypothesis that is made about a population
parameter. Then sample data is collected to produce sample statistics and this information is used
to decide how likely is that the hypothesized parameter is correct.
Hence, it can be said that a hypothesis is a supposition made as a basis for reasoning.

166
1

According to Prof .MomÊs Hamburg:

ÂHypothesis in a statistical inference is a quantitative statement about a population.Ê For a


researcher the hypothesis is a formal question that he intends to resolve. Thus hypothesis is
defined as a proposition or a set of propositions set forth as an explanation for the occurrence of
some specific group to guide investigation.

There can be several types of hypothesis:

(1) A coin may be tossed 200 times and 180 times it may be heads. In this case it is interesting to
test the hypothesis that the coin is unbiased.
(2) The average weight of 100 students of a particular college can be studied and may get the result
of 110lb. In this case it is interesting to test the hypothesis that the sample has been drawn from
the population with average weight 115lb.
(3) Hypothesis testing can also be done for testing the hypothesis that the variables in the population
are uncorrelated

8.3 APPLICATION OF HYPOTHESIS

Hypothesis helps in many decision situations.

Example:

• When people need to determine whether the parameters of two populations are alike or
different
• If whether the female employees receive lower salaries than its male employees for the same
work
• Wish to determine whether the proportion of promotable employees at one government
installation is different from that at another.

167
1

8.4 PROCEDURE FOR TESTING HYPOTHESIS:

8.4.1 Set up a hypothesis:

The first thing in hypothesis testing is to set up a hypothesis about a population parameter Then collect
sample data, produce sample statistics &use this information to decide how likely it is that our
hypothesized population parameter is correct. Say we assume a certain value for mean of population to
test the validity of our assumption we gather sample data and determine the differences between the
hypothesized value and actual value of sample mean. Then we judge whether the difference is
significant the smaller the difference the greater the likelihood that our hypothesized value for the
mean is correct. the larger the value of difference, the lesser the likely hood.
The conventional approach to hypothesis testing is not to construct a single hypothesis about the
population parameter, but rather to set up two different hypotheses. These hypotheses must be so
constructed that if one is accepted the other is rejected & vice versa. They are:
(a) Null hypothesis(Ho)
(b) Alternate hypothesis(Ha)

Null hypothesis:

It is a very useful tool in testing the significance of difference. This hypothesis asserts that there is no
real difference in the sample and the population in the particular matter under consideration and that
the difference found is accidental and unimportant arising out of fluctuations of sampling. The null
hypothesis is akin to the legal principle that a man is innocent until he is proved to be guilty.

Alternate hypothesis:

As against null hypothesis alternate hypothesis specifies those values that the researcher believes to
hold true. The alternate hypothesis may embrace the whole range of values rather than a single point.
Ex: if we want to know whether the drug is effective in curing malaria (or) not
Ho: the drug is not effective in curing malaria.
Ha: the drug A is effective in curing malaria.

168
1

8.4.4 Set up a suitable significance level:

The next step is to test the validity of Ho against that of Ha at a certain level of significance. The
confidence with which an experimenter rejects (or) accepts null hypothesis depends upon the
significance level adopted.

Significance level: it is expressed as a percentage usually like 1%,5%,10%,etc.This implies that


there lies a risk of committing an error to an extent of 1%(or)5%(or)10% depending upon the level
chosen.

8.4.5 Selecting a test criterion:

This involves selecting an appropriate probability distribution for a particular test, i.e a probability
distribution that can be properly applied. Some probability distributions commonly used are Z, T, F
and chi square test. Test criteria must employ an appropriate probability distribution. Ex: if only small
sample information is available, the use of the normal distribution would be inappropriate.

8.4.6 Doing computations & Making decisions: These calculations include the testing statistic and
the standard error of the testing statistic. A statistical conclusion or statistical decision is a decision
either to reject or to accept the null hypothesis. The decision will depend on whether the computed
value of the test criterion falls in the region of rejection or region of acceptance.

TYPES OF ERRORS IN TESTING OF HYPOTHESIS:

Type 1 error:
The hypothesis is true but our test rejects it i.e. rejecting Ho when it is true.
Type II error:
The hypothesis is false but test accepts it i.e. accepting Ho when it is false.
Alpha = prob (type1 error)
= prob (rejecting Ho, when it is true)
Beta = prob (type II error)
= prob (not rejecting Ho, when it is false)

169
1

8.6 Two tailed & One tailed Tests of Hypothesis:

A two tailed test of hypothesis will reject the null hypothesis, if the sample statistic is significantly
higher than or lower than the hypothesized population parameter

Acceptance zone
Rejection zone

Rejection zone

Mean

Two tail test


One tailed test is so called because the rejection region will be located in only one tail which may be
either left or right depending upon the alternative hypothesis formulated.

Rejection zone
Acceptance zone

Mean

Acceptance zone
Left tail test

Rejection zone

Mean

Right tail test

170
1

8.7 STANDARD ERROR AND SAMPLING DISTRIBUTION:

The standard deviation of the sampling distribution is called standard error. (1) This is used as an
instrument in hypothesis testing .Standard error provides an idea about unreliability of a sample.
Greater the standard error, greater the departure from actual frequencies to the expected ones. Hence,
greater the unreliability of the sample. (2) Reciprocal of S.E .i.e. 1/s.e is a measure of reliability of
the sample. (3) With the help of S.E we can determine limits within which the parameter values are
expected to lie.

8.8 Z-table:

Conditions for use:-


1) Test for randomness
2) Tests concerning all for one population (or) two populations based on their large size, random
sample (s) (say over 30) to invoke central limit theorem. This includes test concerning proportions,
with large-size, random sample size ÂnÊ (say over 30) to invoke distribution convergence results.
3) To compare to correlation coefficients

8.9 TEST OF SIGNIFICANCE FOR ATTRIBUTES:

In this we can only find out the presence (or) absence of a particular characteristic. Hence we will
check whether attribute A is present (or) not by drawing samples.

8.9.1. Test for number of successes:

The sampling distributions of the number of successes follow Binomial probability distribution.
Hence the S.E. is:

S.E. of number of successes= npq

Where n-Size of sample


p- Probability of success in each trail
q-probability of failure (l-p)

171
1

Ex: Q) A coin was tossed 400 times and head turned up 216 times. Test the hypothesis that the coin is
unbiased.
Ans: Let us take the hypothesis that the coin is unbiased. On the basis of this hypothesis the
probability of getting head or tail would be equal i.e. ½. Hence in 400 throws a coin we should expect
200 heads and 200 tails.
Observed no. of heads=216
Difference between observed and expected number of heads= 216-200=16

S.E of no, of heads= npq

n=400; p=1/2; q=1/2

1 1
S.E= 400 * *
2 2
Difference 16
Zcal= = = 1 .6
S .E 10

at α = 5% level of significance
Ztable=1.96

Zcal< Ztable
Hence hypothesis is accepted. So the coin is unbiased

Other applications:

1) To find from a hospital statistics the no. of male babies born in a particular month is equal to
no. of female babies born.
2) In census, whether no. males is equal to no. of females
3) In a class, whether the performance of boys and girls differs
4) In a class, whether the maths students perform better (or) the arts students perform better
5) Whether a drug A. P effective (or) in effective.

172
1

8.9.2. Tests for proportion of successes:

Instead of recording the number of successes in each sample, we might record the proportion of
successes, that is 1 n th of no. of successes in each sample. As this would amount to dividing all the
figures of the record by n, the mean proportion of successes must be p.

npq pq
Standard deviation of proportion of successes= =
n n

pq
S.Ep=
n

Ex: Q: 500 apples are taken at random from a basket and 50 are found to be bad. Estimate the
proportion of bad apples in the basket and assign the percentage limits.

50
A: The proportion of bad apples in the given sample= =0.1
500
p=0.1, q=0.9, n=500

pq 0.1 × 0.9
S.E= = =0.013
n 500
Limits within which the percentage of bad apples lies
( p ± 3S .E ) ) × 100
= (0.1 ± 3 × 0.013) × 100
= 6.1% & 13.9%

173
1

Other applications:

1. To assign limits to the variation of the sample mean for that matter any parameter of
the sample compared to the population parameters.
2. We can find the no. of defective items produced by a machine at a particular level of
significance and can say with accuracy the limits.
3. WE can feel what portion of a machine o/p defective?
4. In a class, expecting the number or students who will be in the first grade, with
accuracy we can tell the limits-upper and lower.

8.9.3 TEST FOR DIFFERENCE BETWEEN PROPORTIONS:

If two samples are drawn from different populations, we may be interested in finding out whether the
difference between the proportion of successes is significant (or) not. In such a case we take
difference between proportions of successes in one sample to another.

1 1 
S .E ( p1 p2 ) = pq + 
 n1 n2 
n1 p1 + n2 p 2 x + x2
P= (or ) P= 1
n1 + n2 n1 + n2

Where P1=no. of successes in sample of n1(probability)


P2=no. of successes in sample of n2 (probability)

Q1= (1-P1) = probabilities of failures in n1


Q2= (1-P2) =probabilities of failures in n2

P1 − P2
Zcal=
S .E
Ztabel= α = Level of significance 5% (or)1%

Zcal < Ztabel → accept Null Hypothesis

Zcal > Ztabel → reject Null Hypothesis

174
1

Ex:-1. In a simple random sample of 600 men taken from a big city, 400 are found to be smokers. In
another simple random sample of 900 men taken from another city 450 are smokers. Do the data
indicate that there is a significant difference in habits or smoking in two cities?
Sol::-
Ho=no significant difference in habit or smoking in two cities
HA=Significant difference in habit of smoking in two cities.

400
p1 = = 0.667
600
q1 = 0.333
450
p2 = = 0 .5
900
q 2 = 0 .5

x1 + x 2 400 + 450 17
p= = =
n1 + n2 600 + 900 30

13
q=1-p=
30

1 1
S .E( P1 − P2 ) = pq + 
 n1 n2 

 17 13  1 1 
S .E( P1 − P2 ) =  *  +   =0.026
 30 30  600 900  

P1 − P2 0.667 − 0.5
Zcal= = = 6.42
S .E 0.026

Ztabel (atα = 1% ) = 2.58S .E

175
1

Acceptance
Rejection zone
zone

Rejection
zone

-2.58 0 2.58
6.42
Zcal > Ztable → Null Hypothesis rejected

So, there is a significant difference between the smoking habits of people in the two cities.

Applications:
1. A machineÊs productivity has improved after over-handling (or) not
2. Whether the student benefited from extra classes (or) not
3. Whether a particular policy of the govt. has effected the buying behaviour of the peoples (or) not
4. Whether a particular product is successful (or) not

In case of large samples: (n>30)


σp
S.E of means= S .E x =
n
σp -standard deviation of the population
n-no.of observations in a sample

Ex:-

It is assumed that the mean life span of the human beings in a particular state is 67 years. The mean
age obtained from a random sample or size 100 is 64 years. The Standard deviation of the distribution
of life of population is 3 years. Test whether the Mean life of the population is decreased over a
period of time. Test at 5% level of significance.

176
1

Solution:
Ho=µ=67
HA= µ ≤ 67
x =64
σ 3
S .E x = = = 0.3
n 100
64 − 67
Zcal= = −10
0 .3
α =5%--Ztable=1.

Zcal is in rejection zone·Null Hypothesis is rejected

Hence mean life of the population is decreased from 67 years.

Applications:-
1) Whether the claims of the firms in advertisement about quality, life time, performance, speed
of their products are same as that of the products lot they produce
2) For test marketing i.e whether the product will click (or) not from the survey results.

177
1

8.9.4 Two-failed test for difference between the Means of two samples:-

1) Two independent random samples with n1&n2 numbers (each greater than 30) respectively are
drawn σ 1 , σ 2 their respective standard deviation. Then where x1 , x 2 are respective Means.

 σ 12 σ 22 
S .E =  + 
 n1 n2 

H o = µ1 = µ 2
H A = µ1 ≠ µ 2

(2)If n1,n2 two random samples

and n1>30,n2>30
The S.E of of the population is σ
Then, S.E or difference between samples

1 1
= σ 2  + 
 1
n n 2 

Ex: Intelligence test on two groups of boys and girls gave the following results:

Mean S.D N
Girls 75 15 150
Boys 70 20 250

A significant difference between mean scores obtained by boys and girls.

H o : µ Girls = µ Boys
H A : µ Girls ≠ µ Boys

178
1

σ 1 = 15
σ 2 = 20
n1 = 150
n2 = 250

σ 2 σ 2 
S .E =  1 + 2 
 n1 n2 

S .E( x1 − x2 ) =
(15)2 + (20)2 = 1.761
150 250
75 − 70
Zcal= = 2.84
1.761
α =1% level of significance
Ztable=2.58
Zcal> Ztable → Ho rejected

There is a significant difference in the means of the populations.

8.10 StudentsÂtÊ distribution:

This distribution is used when the sample size is less than 30 and the population standard deviation is
unknown.

Definition:

x−µ
t= × n
s

∑ (x − x )
2

Where s=
n −1

179
1

Properties of t-distribution

1. The variable t-distribution ranges from - α to + α


2. The constant ÂCÊ is actually a function of  ϑ Ê, so that for a particular value ϑ , the distribution
of f(t) is completely specified. Thus f(t) is a family of functions, one of each value of ϑ .
3. Like the standard normal distribution, the t-distribution is symmetrical and has a mean zero.
4. The variance of the t-distribution is greater than one, but approaches one as the sample size
becomes large.

The t-table:

It gives, over a range of values of degrees of freedom, the probabilities of exceeding by chance
value of ÂtÊ at different levels of significance. The degrees of freedom are infinitely large, the t-
distribution is equivalent to normal distribution.

8.10.1 Applications of the t-distribution:

To test the significance of the mean of a random sample:

In determining whether the mean of a sample drawn from a normal population deviates
significantly from a stated value

t=
(x − x ) n
s

Where x = the mean of the sample


N= The actual (or) hypothetical mean of the population.
n = the sample size
s= the standard deviation of the sample

S=
Σ( x − x )
⇒ S=
2
Σd 2 − n d ( ) 2

n −1 n −1

180
1

 ∑ d 2 − ∑ 
 1  d 2 
S= 
 n 
 n − 1 

d=deviation from assumed mean.

Eg: The manufacturer of a certain make of electric bulbs claims that his bulbs have a mean life of 25
months with a standard deviation of 5 months. A random sample of 6 such bulbs gave the following
values.

Life of months 24, 26, 30,20,20,18 can you regard the producerÊs claim to be valid at 1% level of
significance.

Solution:

Ho ⇒ there is no significance difference in the mean life of bulbs in sample and the populations.
Ha ⇒ there is a significant difference.

t=
(x − N ) n
S
N=25
X X−X X2
29 +1 1
26 +3 9
30 +7 49
20 -3 9
20 -3 9
18 -5 25
ΣX = 138 ΣX 2 = 102

ΣX 138
X = ⇒ ⇒ 23
n 6

ΣX 2 102
S= ⇒ ⇒ 20.4
n −1 5

181
1

S=4.517
123 − 251 2 × 2.449
t= 6⇒ ⇒ 1.084
4.517 4.517
ϑ = n −1 = 6 −1 ⇒ 5
∴ from table-t:
forD.F = 5; t0.01 ⇒ 4.032
The calculated value ofÂtÊ is less than the table value.

The null hypothesis is accepted, the producerÊs claim is true.

Testing difference between Means of two samples: (Independent samples):

Sample I Sample II
Mean: x1 x2
S.D:S1 S2
Testing the hypothesis the sample comes from same population:

X1 − X 2 n1 n2
t= ×
S n1 + n2

Where

Σ(X 1 − X 1 ) + Τ( X 2 − X 2 )
2 2

S=
n1 + n2 − 2

(n1 − 1)S12 + (n2 − 1)S 2 2


⇒S=
n1 + n 2 − 2

182
1

Example:

Two types of drugs were used on 5-7 patients for reducing their weight.

Drug A was imported; B indigenous. The decrease in weight after using the drugs for six months was.

A: 10 12 13 11 14
B: 8 9 12 14 15 10 9

Is there a significance difference in the n of two drugs? If not, which drug should you buy?

For Df = 10; t = 0.05 = 2.223

Solution:
Null Hypothesis: there is no significant difference in efficiency of two drugs.

X1 − X 2 n1 n2
t=
S n1 + n2

X 1 =12

X 2 = 11

X1 (X 1 − X1 ) (X 1 − X 1 )2 X2 (X 2 − X2) (X 2 − X2) 2

10 -2 4 8 -3 9
12 0 0 9 -2 4
13 +1 1 12 1 1
11 -1 1 14 3 9
14 +2 4 15 4 16
10 -1 1
9 -2 4
Σ60 Σ10 Σ77 Σ44

183
1

Σ (x 1 − x1 ) + Σ( x 2 − x 2 )
2 2
10 + 44 54
S= = = ⇒ 2.324
n1 + n2 − 2 5+7−2 10

X1 − X 2 n1 n2
t= S n1 + n2

12 − 11 5× 7 1.708
= ⇒ = 0.735
2.324 5+7 2.324
ϑ = n1 + n2 − 2 = 5 + 7 − 2 = 10
ϑ = 10; t 0.05 = 2.228
Q the cal. ÂtÊ less than ÂtÊ table, the hypothesis is accepted.
⇒ there is no significant difference in the n of two drugs.
Also, the indigenous drug can be bought because of no much difference, need not import drugs.

8.10.2 Testing difference between means of two samples (Dependent samples):

Two samples are said to be dependent when the elements in one sample are related to those in other
any significant manner.

d −0 d n
t= × n ⇒t =
S S
d ⇒ mean difference
S ⇒ S.D

Σ(d − d ) Σd 2 − n(d )
2 2

S= =
n −1 n −1

Eg:

A drug is given to 10 persons and the increments in their blood pressure were. Is it reasonable to
believe that the drug has no effect on change of B.P?

5% level of significance.
ϑ = 9; t ⇒ 2.26

184
1

Solution:

H0 : the drug has no effect on the B.P


HA : the drug has no effect on the B.P

d 3 6 -2 4 -3 4 6 0 0 2 Σd = 0
d2 9 36 4 16 9 16 36 0 0 4 Σd 2= 130
Σd 20
d = ⇒ =2
n 10

()
Σd 2 − n d
2

s= n −1

130 − 10(2)
2

s= 10 − 1

s= 3.162

d n
t=
S

2 10 2 × 3.162
t= = ⇒2
3.162 3.162

ϑ = n − 1 ⇒ 10 − 1 = 9;

ϑ for 9 at alpha=0.05 ⇒ 2.26


t cal< ttable ⇒ hypothesis is accepted.

Hence, it is reasonable to believe that the drug has no effect on B.P.

185
1

8.10.3 Testing the significance of observed correlation co-efficient:

Given a random sample from a bivariate normal population, if we are to list the hypothesis that the
correlation coefficient of population is zero.

r
t= × n−2
(1 − r )
2 1/ 2

Eg:

A random sample of 27 pairs of observations from a normal population gives a correlation coefficient
of 0.42. Is it likely that the variables in the population are uncorrelated?.

Solution:

H0= no significant difference in the sample correlation and correlation in the population.

r
t= × n−2
1− r2
r=0.42; n=27
0.42
t= × 27 − 2
1 − (0.42)
2

0.42
⇒ × 5 ⇒ 2.31
0.908
ϑ ⇒ n − 2 ⇒ 27 − 2 ⇒ 25
For ϑ = 25; t 0.05 ⇒ 1.708

tcal > tobs


The hypothesis is rejected

186
1

Hence, it is likely that the variables in the population are uncorrelated.

Managerial Applications:

1 Market survey and decision making about the possible demand.


1. Error estimation.
2. Demand analysis and possible volume of sales
3. Authentication of data colleted.

Applications:-

1. All the mean life of the product- whether it is to the mark (or) not
2. Whether the Machines A and M have their Mean out put at the same level (or) not
3. Whether the employees A and B have their average performance as the same (or) not

8.11 SUMMARY:
This chapter has first dealt with the concept of hypothesis and the steps involved in testing it. It has
been pointed out that two types of error- Type I error and type II error- are likely to be made in testing
a hypothesis. Type I error arises when the hypothesis is true but it is rejected. Type II error arises
when the hypothesis, though false, is accepted. Hypothesis testing is explained giving examples
wherever necessary. Finally StudentsÂtÊ distribution its properties and its applications have been
discussed.

8.12 GLOSSARY:

Alpha ( α ): The significance level of a test of hypothesis that


denotes the probability of rejecting a null
hypothesis when it is actually true. In other
words, it is the probability of committing a Type I
error.

187
1

Alternative hypothesis: A hypothesis that takes a value of a population


parameter different from that used in the null
hypothesis.

Beta ( β ): The probability of not rejecting a null hypothesis


when it actually is false. In other words, it is the
probability of committing a Type II error.

Critical region: The set of values of the test statistic that will
cause us to reject the null hypothesis

Critical value: The ÂfirstÊ (or boundary) value in the critical


region.

Decision rule:
If the calculated test statistic falls within the
critical region, the null hypothesis H0 is rejected.
In contrast, if the calculated test statistic does not
fall within the critical region, the null hypothesis
is not rejected.

Hypothesis:
An unproven proposition or supposition that
tentatively explains a phenomenon.

Null hypothesis: a statement about a status quo about a population


parameter that is being tested.
One-tail test:
A statistical hypothesis test in which the
alternative hypothesis is specified such that only
one direction of the possible distribution of values
is considered.

188
1

Power of the hypothesis test:


The probability of rejecting the null hypothesis
when it is false.

Significance level:
The value of α that gives the probability of
rejecting the null hypothesis when it is true. This
gives rise to Type I error.

Test criteria
Criteria consisting of (i) specifying a level of
significance α , (ii) determining a test statistic (iii)
determining the critical region(s), and (iv)
determining the critical value (s).

Test statistic:
The value of Z orÂtÊ calculated for a sample
statistic such as the sample mean or the sample
proportion.

Two-tail test:
a statistical hypothesis test in which the
alternative hypothesis is stated in such a way that
it includes both the higher and the lower values of
a parameter than the value specified in the null
hypothesis.
Type I error:
An error caused by rejecting a null hypothesis
that is true.
Type II error:
An error caused by failing to reject a null
hypothesis that is not true.

189
1

8.13 REFERENCES:
1. Gupta. S.C., and Kapoor.V.K, „Fundamentals of Mathematical Statistics‰, Sultan Chand
and Sons, New Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalayan Publishing House, New Delhi,
2004
3. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi,
2000.
4. S.P.Gupta and M.P.Gupta „Business Statistics‰, Sultan Chand & Sons, New Delhi, 2001.

8.14 REVIEW EXERCISE:


1. What is a Hypothesis? What steps are involved in statistical testing of a hypothesis?

2. Explain how hypothesis testing is useful to management.

3. An insurance agent has claimed that the average age of policyholders who insure through him is less
than the average for all agents, which is 30.5 years. A random sample of 100 policy holders who had
insured through him gave the following age distribution.

Age last birthday No. of persons


(years) insured
16-10 12
21-25 22
26-30 20
31-35 30
36-40 16
Total 100

Calculate the arithmetic mean and standard deviation of this distribution and use these values to
test his claim at the 5 percent level of significance.

4. Two types of batteries are tested for their length of life and the following data are obtained:

190
1

Sample size Mean Life Variance (hours)


Type A 9 600 121
Type B 8 640 144

Is there a significant difference in the two means?

5. A radio shop sells, on an average, 200 radios per day with a standard deviation of 50 radios.
After an extensive advertising campaign, the management will compute the average sales for the
next 25 days to see whether an improvement has occurred. Assume that the daily sales of radios
are normally distributed.

(i) Write down the null and alternative hypothesis


(ii) test the hypothesis at 5 percent level of significance if x = 216
(iii) How large must x be in order that the null hypothesis is rejected at 5 percent level of
significance?

6. You obtain a large number of components to an identical specification from two sources. You may
notice that some of the components are from the supplierÊs own plant in Pune and some are from the
plant located in Bangalore. You would like to know whether the proportions of defective components
are the same or there is a difference between the two. You take a random sample of 600 components
from each plant and find that the rejection rate p1 is 0.015 for Pune components as compared to
p2=0.017 for Bangalore components. Set up the null hypothesis and test it at 5 percent level of
significance.

7. Out of 20,000 customers ledger accounts, a sample of 600 was taken to test the accuracy of posting
and balancing and 45 mistakes were found. Assign limits which the number of mistakes can be
expected at 95% level of confidence.

8.In a village ÂAÊ out of a random sample of 1,000 persons 100 were found to be vegetarians while in
another village ÂBÊ our of 1,500 persons 180 were found to be vegetarians. Do you find a significant
difference in the food habits of the people of the two villages?

191
1

9. In a hospital 480 female and 520 male babies were born in a week. Do this figure confirm the
hypothesis that males and females are born in equal number?

10. A strength test carried out an sample of two yarns spun of the same count give the
following results:

Sample Size Sample Mean Sample variance


Yarn A 4 50 42
Yarn B 9 42 56

The strengths are expressed in pounds. Is the difference in mean strengths significant of the
sources from which the samples are drawn?

by

Dr.B.Raja Shekhar

Reader

School of Management Studies

University of Hyderabad.

192
1

9. ANALYSIS OF VARIANCE (ANOVA)

9.0 LEARNING OBJECTIVES:

After reading this lesson, you should be able to

• Understand how „Analysis of variance‰ can be used to test for the equality of three or more
population means. ( ø)
• Learn how to summarize F-ratio in the form of an ANOVA table.

9.1 INTRODUCTION:

Analysis of variance (abbreviated as ANOVA) is an extremely useful technique concerning researches


in the fields of economics, biology, education, sociology, business/industry and in researches of
several other disciplines. This technique is used when multiple sample cases are involved.
Significance of difference between the means of two samples can be judged through either z-test or the
t-test, but the difficulty arises when there is a need to examine the significance of the difference
amongst more than two sample means at the same time. The ANOVA technique enables us to
perform those simultaneous tests and as such is considered to be an important tool of analysis in the
hands of a researcher. ANOVA is essentially a procedure for testing the difference among different
groups of data for homogeneity. The basic principle of ANOVA is to test for differences among the
means of the populations by examining the amount of variation within each of these samples, relative
to the amount of variation between the samples

Examples involving more than two populations:-

• Effectiveness of different promotional devices in terms of sales

• Production volume in different shifts in a factory


• Yield from plot of land due to varieties of seeds, fertilizers, and cultivation methods.

193
1

Assumptions for analysis of variance:


1. Each population has a normal distribution.

2. The populations from which the samples are drawn have equal variances, i.e.,

σ 12 = σ 22 = σ n2
3. Each sample is drawn randomly and is independent of other samples.

9.2 IMPORTANT TERMS USED IN ANOVA:

1. response variable
2. a factor or criterion
3. A treatment
Take an example to understand the three key terms.
Ex: Production volume in different shifts in a factory.
¾ In the above example of production volume there are two variables- day of the week and
volume of the production in each shift. If one of the objectives is to determine whether
mean production volume is same during days of the week, then the variable of interest,
i.e., the response, is the mean production volume.
¾ The variables, quantitative or qualitative, that are related to response variable are called
factors, i.e., a day of the week is the factor or independent variable.
¾ The value assumed by a factor in an experiment is called a Level
¾ The combinations of levels of the factors for which the response will be observed are
called treatments. In this example days of the week are treatments.

APPROACH TO ANALYSIS OF VARIANCE:

It involves two steps:

1) The amount of variation among the sample means or the variation attributable to the
difference among sample means. This variation is due to assignable causes.
2) The amount of variation within the sample observations. This difference is considered
due to chance causes or experimental (random) errors.

194
1

The observation in the sample data may be classified according to one factor or two factors.
The classification according to one factor and two factors are called one-way classification
and two-way classification; respectively.

9.4 ONE WAY CLASSIFICATION TO TEST EQUALITY OF POPULATION MEANS:

Many business applications involve experiments in which different populations (or groups) are
classified with respect to only one attribute of interest such as:

(i) Percentage of marks secured by students in a course.


(ii) Flavor preference of ice-cream by customers.

9.4.1 STEPS FOR TESTING NULL HYPOTHESIS:

1st step:

State hypotheses to test the equality of population means as:


Ho: ø1=ø2=----øn (Null hypothesis)
H1: Not all ø s are equal. (Alternative Hypothesis)
α = Level of significance

2nd step:

Calculate total variation (SST)


SST-SSB+SSW
Where S.S.B=Sum of Squares Between the samples
And S.S.W= sum of squares within the sample

¾ S.S.B. (Some of squares between the samples ) calculation


2
 =
S.S.B = ∑ n x − x 
 

[Where, x is the mean value of a particular sample]

195
1


x Is the grand mean
n= no. of observations in corresponding samples.
S.S.W (sum of squares with in the sample) calculation
2
 −

S.S.W.= ∑ ∑ x − x 
 
(The above given formula are in their simple format)

3rd step:

Find out degrees of freedom associated with S.S.B and S.S.W., and then find out total degree of
freedom (df)
Total d.f= between samples df + Within samples (df)
Since K independent samples are being compared, therefore (k-1) degrees of freedom are associated
with the sum of the squares among samples.
As each of the K samples contributes nj-1 degrees of freedom for each independent sample within
itself, therefore there are (n-k) degrees of freedom associated with the sum of the squares within
samples.
So, total d.f, (n-1) = (k-1) + (n-k)

4th step:
a) Find out Mean sum of squares between the samples (M.S.B)
S .S .B
M.S.B=
k −1
b) Find out Mean sum of squares within the samples (M.S.W)
S .S .W
M.S.W=
n−k
c) Find out Mean, sum of squares for total (M.S.T)

SST
M.S.T=
n −1
5th step:

196
1

Apply F-test statistic with (k-1) degrees of freedom for the numerator and (n-k) degrees of freedom for
the denominator.
SSB
σ between = (k − 1) = MSB
2

F-ratio=
σ within SSW (n − k ) MSW
2

6th step:

Make decision regarding Null Hypothesis


If calculated value of F-test statistic is more than its critical value- reject the Ho otherwise accept Ho

Rejection region for Null hypothesis using ANOVA


General arrangement of the ANOVA table for one factor analysis of variance

Source of Sum of squares Degrees of Mean squares Test- statistic or F-


variation freedom value
Between SSB k-1 SSB
MSB=
samples k −1 MSB
F=
Within samples SSW n-k SSW MSW
MSW=
n−k
Total SST n-1

9.4.2 Short-cut method:


The values of SSB and SSW can be calculated by applying the following short-cut method:
¾ Calculate the total ÂTÊ of the observation in samples from each of ÂkÊ samples

T= ∑x +∑x 1 2
+ − − − − ∑ xk
2

¾ Calculate the Correction factor (CF)= T (n=n1+n2+----nk)


n
¾ Find the sum of the squares of all observation in samples from each of ÂkÊ samples and subs
tract CF from this sum to obtain the total sum of the squares of deviation SST.

197
1

∴ S .S .T = ∑ X 1 + ∑ +−−−−∑X 2
2 2
X 2 K
Find S.S.B through the formula

.S.S.B=
(∑ x ) 2
2

− C.F
n 2

And S.SW= S.S.T-SSB


Now, find the F-ratio as,

SSB

F ratio = v 1
where v1=k-1 and v2= n-k
SSB
v 2

Examples:

There are three main brands of a certain powder. A sample of 120 packets sold is examined and found
to be allocated among four groups A, B, C, D and Brands I, II, III as shown below:

Brand Group
A B C D
I 0 4 8 15
II 5 8 13 6
III 18 19 11 13

Use the three methods and find out whether there is any difference in brand preferences?
Solution:
H0=h1=h2=h3
H1=At least one among these are not equal to others.

198
1

Brand I Brand II Brand III


X1 2 X2 2 X3 2
x 1 x 2 x 3

0 0 5 25 18 324
4 16 8 64 19 361
8 64 13 169 11 121
15 225 6 36 13 169

∑x 1 = 27 ∑x
2
1
= 305 ∑x 2 = 32 ∑x
2
2
= 294 ∑x 3 = 62 ∑x
3
2
= 975
X1=6.75 x2 = 8 x3 = 15.25

∴ T = ∑ x1 + ∑ x2 + ∑ x3 = 120

T 2 (120 )
2
Correction factor, C.F= = = 1200
n 32
SSB=Sum of squares between the samples

 (∑ x1 )2 (∑ x2 )2 (∑ x3 )2 
=  + +  − C.F
 n1 n2 n3 

 (27 )2 (32)2 + (61)2  − 1200
= + 
 4 4 4 

=
(182.25 + 256 + 930.25)) − 1200
= 168.5
And, SST=Total sum of squares.

= (∑ x + ∑ x + ∑ x ) − C.F
1
2
2
2
3
2

=(305+294+975)-1200
=1574-1200
=374
∴ SSW=SST-SSB
=374-168.5
=205.5

199
1

Degree of freedom: v1=k-1=3-1=2


V2=n-k=12-3=9
SSB 168.5
∴ MSB = = = 84.25
v1 2
SSW 205.5
MSW = = = 22.83
v2 9

∴ ANOVA table can be formed as follows:

Source of Sum of Degree of Mean squares Test-statistic


variance Squares freedom
Between samples 168.5 2 84.25
84.25
F=
Within samples 22.83
205.5 9 22.83 =3.69
total

374 11

So, the calculated value for F=3.69

While, the table value for α = 5% and v1 =2 , v2=9 is, F=21.26

Here F cal < F critical

So Null hypothesis is accepted. So, we can say that there is no difference in brand preferences.

200
1

Second method:

2
SSB= ∑ n x − x 
So, we have to find x

x1 + x2 + x3 1 27 4 + 32 4 + 61 4
x= =
3 3
120
= = 10
12
[
So, S.S.B= 4 X (27 4 − 10 ) + (32 4 − 10 ) + (61 4 − 10 )
2 2 2
]
[
= 4 X (− 3.25) + (− 2 ) + (5.25)
2 2 2
]
=168.5

And, SSW= ∑ ∑ x − x ( ) 2

{(0 − 6.75) + (4 − 6.75) + (8 − 6.75) + (15 − 6.75) }


2 2 2 2

+ {(5 − 8) + (8 − 8) + (13 − 8) + (6 − 8) }+
2 2 2 2

=(45.56+7.56+1.56+68.06)+(9+0+25+4)+ (7.56+34.06+28.06+5.06)
=122.74+38+44.74
=205.48

Now the ANOVA table


Source of Sum of squares Degree of Mean squares Test statistic
variance freedom
Between samples 168.5 2 168.5/2=84.25

84.25
F=
Within samples 22.83

205.5 9 205.5/2=22.83 =3.69

201
1

V1=k-1=3-1=2
V2= n-k=12-3=9

As the F cal value is less than Ftable value. So, there is not any difference in brand preference.
And, this we can say with 5% level of significance.

9.5 TWO-WAY CLASSIFICATION TO TEST EQUALITY OF POPULATION MEANS:

In the one way classification we partition the total variation into two components: variation among the
samples and variation within the samples (due to random error)

However, there might be a possibility that some of the variation left in the random error from one-way
analysis of variation was not due to random error or chance but due to some other measurable factor.
And if this is done it means that this accountable variation was deliberately included in the SSW and
therefore caused the mean sum of squares within samples (MSW) to be little large. Consequently, F-
value would then be small and responsible for the rejection of null hypothesis.

So, in the two-way classification we can partition total variation in the sample data as shown below:

Variation between samples (SSB)


Total variation
(SST)
Variation within samples (SSW)

Unwanted variation due to difference Actual variation due to random error-


between block means- sum of squares of error (SSE)
Block sum of squares (SSR)

202
1

ANOVA table for two-way classification:-

Source of Sum of Degree of Mean squares Test statistic


variance squares freedom
Between SSC C-1 MSC=SSC/C-1 F1=MSC/MSE
columns

Between rows SSR r-I MSR=SSR/r-1 F2=MSR/MSE

Residual error SSE (c-1)(r-1) MSE=SSE/(c-1)(r-1)


Total SST n-1

(In this table ÂcÊ stands for column and ÂrÊ stands for rows)

The test statistic F for analysis of variance is given by

F1= MSC/MSE (if MSC>.MSE)


= MSE/MSC (if MSE> MSC)

F2= MSR/MSE (if MSR>MSE)


= MSE/MSR (if MSE>MSR)

[So, if F cal< F table, accept null hypothesis (H0) otherwise reject H0


ANOVA with more than 2 parameters is called MANOVA. .

Example:

To study the performance of three detergents and three different water temperature, the following
ÂwhitenessÊ reading were obtained with specially designed equipment:

203
1

Detergent
Water temperature A B C
Cold water 57 55 67
Warm water 49 52 68
Hot water 54 5\46 58

Perform a two-way analysis of variance using 5 per cent level of significance.

Ans: H0:=There is no significant difference in the performance of 3 detergents due to water


temperature and vice-versa.

Coded data (on subtraction of 50)

Water Detergents
temperature A(x1) X12 (x2)) X22 C(x3) X32 Row sum
Cold water +7 49 +5 25 17 289 29
Warm water -1 01 +2 4 18 324 19
Hot water +4 16 -4 16 8 64 8
Column sum 10 66 3 45 43 677 56

Here, T=sum of all observations in three samples of detergents=56

Correction factor (CF)=


(56)2 = 348.44
9
SSC=sum of squares between detergents (columns)

 (10 )2 (3)2 (43)2 


=  + +  − C.F =
 3 3 3 

=33.33+3+616.33-348.44
=304.22

204
1

SSR=Sum of squares between water temperature (rows)

 (29 )2 (19 )2 (8)2 


= + +  − C.F
 3 3 3 

=280.33+120.33+21.33-348.44
=73.55
SST=Total sum of squares

= (∑ x 1
2
+ ∑ x2 + ∑ x3 − C.F
2 2
)
=(66+45+677)-384.44
=439.56
SSE= SST-(SSC+SSR)= 439.56-(304.22+73.55)
=61.79
Thus, MSC=SSC/c-1=304.22/2=152.11
MSR=SSR/r-1=73.55/2=36.775
61.79 = 15.447
MSE= SSE
(c − 1)(r − 1) = 4

Source of Sum of Degree of Mean squares Variance ratio


variation squares freedom
Between 304.22 2 152.11 F1= 152.11 = 9.847
15.447
detergents
( columns)

Between 73.55 2 36.775 F2 = 36.775 = 2.38


15.447
Temp.
(Rows)

Residual error 61.79 4 15.447


Total 439.56 8

205
1

a) Now, F table value for Df1=2 and Df2=4 and α = 0.05 in 6.94, which is lesser than the
calculated value, i.e, 9.847.

Hence, we conclude that there is significant difference between the performance of the three
detergents.
b) Since the calculated value of F2= 2.380 at df1=2, df2=4, and α = 0.05 is less than its table
value 6.94. So, the null hypothesis is accepted. Hence we conclude that water temperature do
not make a significant difference in the performance of the detergent.

Example: In a certain factory production can be accomplished by four different workers on five
different types of machines. A sample study in the context of a two-way design without repeated
values is being made with two fold objectives of examining whether the four workers differ with
respect to mean productivity and whether the mean productivity is the same for the five different
machines. The researcher involved in this study reports while analyzing the gathered data as under:

a) Sum of squares for variance between machines =(SSC)=35.2


b) Sum of squares of variance between workmen (SSR)=53.8
c) Sum of squares for total variance (SST)=174.2

∴ Residual error (SSR)= 174.2-(35.2+53.8)


=174.2-89.0
=85.2
For SSC

Degree of freedom (Df1)=(c-1)=3


Degree of freedom (Df2)=(c-1)(r-1)=3x4=12

For SSR

Degree of freedom (Df1)=(r-1)=4


Degree of freedom (Df2)=(c-1)(r-1)=3x4=12

Source of Sum of squares Degree of Mean squares Variance ratio

206
1

variation freedom
Between 35.2 (c-1)=4 35.2/4=8.8 8 .8
F1= = 1.24
machines 7 .1
(columns)
Between 53.8 (r-1)=3 53.8 = 17.93 17.938
3 F2= = 2.53
workmen (rows) 7 .1
Residual error 85.2 (c-1)(r-1)=12 85.2 = 7.1
12
Total 174.2 (n-1)=19
Level of significance, α =5% (given)

Table value for F1=3.25

Here, F table > F cal.


So, accept the null hypothesis. It means that mean productivity does not differ for the five
different machines. So, accept the null hypothesis. Hence there is no difference between workers
with respect to their productivity.

9.6 SUMMARY:
This chapter discusses about the important concepts of analysis of variance. ANOVA is
essentially a procedure for testing the difference amongst more than two sample means at the
same time. This technique is an important tool in the hands of a researcher and is an
extremely useful technique concerning researches in the fields of economics, biology
education, sociology, business/industry and in researches of several other disciplines. One-
way and two- way classification to test the equality of population means have been explained.

9.7 GLOSSARY:

207
1

Analysis of variance (ANOVA): A statistical technique used to test the equality of


three or more sample means.

Between-column variance: An estimate of the population variance derived


from the variance among the sample means.

a continuous distribution that has two parameters


F-distribution: (df for the numerator and df for the denominator).
It is mainly used to test hypotheses concerning
variances.

F-ratio: In ANOVA, it is the ratio of between-column


variance to within column variance.

Grand mean: In ANOVA, it is the mean of all observations


across all treatment groups.

Mean Square between Samples (MSB): A measure of the variation among means of
samples taken from different populations.

Mean square within Samples (MSW): A measure of the variation within data of all
samples taken from different populations.

One-way ANOVA: The analysis of variance technique that analyses


one variable only.

SSB: The sum of squares between samples. Also


called the sum of squares of the factor or
treatment

Two-way ANOVA: The analysis of variance technique that involves

208
1

two-factor experiments.

Within-column variance: An estimate of the population variance based on


the variances within the k samples, using a
weighted average of the k sample variances.

.
9.8 REFERENCES:
1. Gupta. S.C., and Kapoor. V.K., „Fundamentals of Mathematical Statistics‰, Sultan Chand
and Sons, N.Delhi, 1997, 9th Edition.
2. Gupta. S.C., „Fundamentals of Statistics‰, Himalaya Publishing House, New Delhi , 2004
3. Murray R. Spiegel and Larry J. Stephens,„Statistics- SchaumÊs Outlines‰, Third edition, Mc
Graw-Hill international Editions, 1999.
4. Levin, I. Richard and Rubin S. David. „Statistics for management‰, P H I, New Delhi, 2000.

9.9 REVIEW EXERCISE:

1. Describe the procedure involved in the analysis of variance.

2. How is analysis of variance useful to business?

3. Three varieties of wheat, A, B and C, were treated with four different fertilizers, 1,2,3 and the
yields of wheat per acre were as follows:

Varieties of wheat
Fertilizers A B C Total
1 55 72 47 174
2 64 66 53 183
3 58 57 74 189
4 59 57 58 174

Perform an analysis of variance on the above data and interpret the results.

209
1

4. The following table gives the data on the performance of three different detergents at three different
water temperatures. The performance was obtained on the ÂwhitenessÊ readings based on specially
designed equipment for nine loads of washing:

detergent A detergent B detergent C


Cold water 45 43 55
Warm Water 37 40 56
Hot water 42 44 46

Perform a two-way analysis of variance, using the level of significance α =0.05

5. Consider the following ANOVA table, based on information obtained for three randomly selected
samples from three independent populations, which are normally distributed with equal variances.

Source of variation Sum of Degrees of Mean Value of


Squares SS Freedom df squares MS The
test statistic
Between samples 60 20 F=
within samples
Total 14

a) Complete the ANOVA table by filling in missing values.

b) Test the null hypothesis that the means of the three populations are all equal, using
0.01 level of significance.

210
1

6. The following represent the number of units of production per day turned out by four different
workers using five different types of machines:

Machine Type
Worker A B C D E Total
1 4 5 3 7 6 25
2 5 7 7 4 5 28
3 7 6 7 8 8 36
4 3 5 4 8 2 22
Total 19 23 21 27 21 111

On the basis of this information, can it be concluded that (i) the mean productivity is the same for
different machines. (ii) the workers donÊt differ with regard to productivity?

7. Set up an analysis of variance table for the following per acre production data for three varieties of
wheat, each grown on 4 plots and state if the variety differences are significant.

Per acre production data


Plot of land Variety of wheat
A B C
1 6 5 5
2 7 5 4
3 3 3 3
4 8 7 4

211
1

8. Three varieties of wheat W1, W2 and W3 are treated with four different fertilizers viz., f1, f2, f3 and
f4. The yields of wheat per acre were as under:

Fertilizer treatment Varieties of wheat


W1 W2 W3 Total
f1 55 72 47 174
f2 64 66 53 183
f3 58 57 74 189
f4 59 57 58 174
Total 236 252 232 720

Set up a table for the analysis of variance and work out the F-ratios in respect of the above. Are the F-
ratios significant?

9. Apply the technique of Analysis of Variance to the following data, relating to yields of four
varieties of wheat in three blocks:

Varieties Blocks
1 2 3
I 10 9 8
II 7 7 6
III 8 5 4
IV 5 4 4

10.Three different methods of teaching Statistics are used on three groups of students. Random
samples of size 5 are taken from each group and the results are shown below. The grades are on a 10-
point scale.

212
1

Group A GroupB Group C


7 3 4
6 6 7
7 5 5
7 4 4
8 7 8

Determine on the basis of the above data whether there is a difference in the teaching methods.

by

Dr.B.Raja Shekhar

Reader

School of Management Studies

University of Hyderabad.

213
1

10. NON-PARAMETRIC TESTS

10.0 OBJECTIVES:

After reading this lesson, you should be able to


• Understand the relevance of nonparametric test in data analysis
• Understand the procedure involved in carrying out nonparametric tests
ƒ Design and conduct some selected nonparametric tests.
ƒ Understand the meaning and uses of Chi-square test

10.1 INTRODUCTION:

The majority of hypothesis tests have made inferences about population parameters, such as the mean
and the proportion. These parametric tests have used the parametric statistics of samples that came
from the population being tested. To formulate these tests made restrictive assumptions about the
populations. But populations are not always normal. And even if a goodness-of-fit test indicates that a
population is approximately normal there are certain situations in which, the use of the normal curve is
not appropriate.

In recent times statisticians have developed useful techniques that do not make restrictive assumptions
about the shape of population distributions. These are known as distribution free or, more commonly,
non-parametric tests. The hypotheses of a non parametric test are concerned with some things other
than the value of a population parameter. A large number of non-parametric test exist. The more
widely used ones are
1. Chi-square test ( χ 2 test)
2. The sign test
a) One-sample sign test
b) Paired-sample sign test
3. A rank sum test
a) The Mann-Whitney U test
b) The kruskal-Wallis test.

214
1

10.2 Characteristics of Distribution-free or Nonparametric Tests:

From what has been stated above in respect of important non-parametric tests, we can say that these
tests share in main the following characteristics:
1. They do not suppose any particular distribution and the consequential assumptions.
2. They are rather quick and easy to use i.e., they do not require laborious computations since in
many cases the observations are replaced by their rank order and in many others we simply
use signs.
3. They are often not as efficient or ÂsharpÊ as tests of significance or the parametric tests. An
interval estimate with 95% confidence may be twice as large with the use of nonparametric
tests as with regular standard methods. The reason being that these tests do not use all the
available information but rather use groupings or rankings and the price we pay is a loss in
efficiency. In fact, when we use non-parametric tests, we make a trade-off: we loose
sharpness in estimating intervals, but we gain the ability to use less information and to
calculate faster.
4. When our measurements are not as accurate as is necessary for standard tests of significance,
then non-parametric methods come to our rescue which can be used fairly satisfactorily.
5. Parametric tests cannot apply to ordinal or nominal scale data but non-parametric tests do not
suffer from any such limitation.
6. The parametric tests of difference like ÂtÊ or ÂFÊ make assumption about the homogeneity of
the variances whereas this is not necessary for non-parametric tests of difference.

10.2.1 ADVANTAGES OF NON -PARAMETRIC TESTS.

1. Non-parametric tests are distribution free, i.e. they do not require any assumption to be made
about population following normal or any other distribution.
2. Generally they are simple to understand and easy to apply when the sample sizes are small
3. Sometimes even formal ordering or ranking is not required.
4. Many non-parametric methods make it possible to work with very small samples.
5. Non-parametric methods make fewer and less stringent assumptions than do the classical
procedures.

215
1

10.2.2 DISADVANTAGES OF NON-PARAMETRIC TESTS:

1. They ignore a certain amount of information.


2. They are often not as efficient or „sharp‰ as parametric tests.

10.3 Chi-square test ( χ 2 TEST):

The χ 2 test (pronounced as Chi-square test) is one of the simplest and most widely used non-

parametric test in statistical work. The symbol χ 2 is the Greek letter Chi. The χ 2 test was first used

by Karl Pearson in the year 1900. The quantity χ 2 describes the magnitude of discrepancy between
theory and observation. It is defined as

χ =∑2 (O − E ) 2

E
Where O refers to the observed frequencies and E refers to the expected frequencies.

10.3.1 Steps involved in applying Chi-square test:


The various steps involved are as follows:
(i) Calculate the expected frequencies. In general frequency for any cell can be calculated from the
following equation.
RT × CT
E=
N
E= Expected frequency
RT= the row total for the row containing the cell
CT= the column total for the column containing the cell.
N= the total number of observations.

(ii) Take the difference between observed and expected frequencies and obtain the squares of these

differences (i.e) obtain the values of (O − E )


2

(iii) Divide the value of (O − E ) obtained in step (ii) by the respective expected frequency and obtain
2

the total

216
1

χ =∑2 (O − E )2

E
This gives the value of χ 2 which can range from zero to infinity. If χ 2 is zero it means, that the
observed and expected frequencies completely coincide. The greater the discrepancy between the
observed and expected frequencies, the greater shall be the value of χ 2 .

The calculated value of χ 2 is compared with the table value of χ 2 for given degree of freedom at a
certain specified level of significance. If at the stated level (generally 5% levels is selected) the
calculated value of χ 2 is more than the table value of χ 2 the difference between theory and
observation is considered to be significant, (i.e). it could not have arisen due to fluctuations of simple
sampling. If on the other hand, the calculated value of χ 2 is less than the table value, the difference
between theory and observation is not considered as significant, (i.e) it is regarded as due to
fluctuations of simple sampling and hence ignored.

10.3.2 Conditions for the Application of χ 2 Test:

The following conditions should be satisfied before χ 2 test can be applied


(i) Observations recorded and used are collected on a random basis
(ii) All the items in the sample must be independent.
(iii) No group should contain very few items, say less than 10. In case where the frequencies
are less than 10, regrouping is done by combining the frequencies of adjoining groups so
that the new frequencies become greater than 10. Some statisticians take this number as 5,
but 10 is regarded as better by most of the statisticians.
(iv) The overall number of items must also be reasonably large. It should normally be at least
50,howsoever small the number of groups may be.
(v) The constraints must be linear. Constraints which involve linear equations in the cell
frequencies of a contingency table (i.e equations containing no squares or higher powers
of the frequencies) are known as linear constraints.

217
1

10.3.3 Important Characteristics of χ 2 Test:

(i) This test (as a non-parametric test) is based on frequencies and not on the parameters like
mean and standard deviation.
(ii) The test is used for testing the hypothesis and is not useful for estimation.
(iii) The test possesses the additive property as has already been explained
(iv) The test can also be applied to a complex contingency table with several classes and as
such is a very useful test in research work.
(v) This test is an important non-parametric test as no rigid assumptions are necessary in
regard to the type of population, no need of parameter values and relatively less
mathematical details are involved.

10.3.4 DEGREES OF FREEDOM:

While comparing the calculated value of χ 2 with the table value we have to determine the degrees of
freedom. By degrees of freedom we mean the number of classes to which the values can be assigned
arbitrarily or at will without violating the restrictions or limitations placed.

For example: If we are to choose any five numbers whose total is 100, we can exercise our
independent choice for any four numbers only, the fifth number is fixed by virtue of the total being
100 as it must be equal to 100 minus the total of the four numbers selected.
V=n-k
V → df
K → Number of independent constraints.
The following points about the χ 2 test are worth noting:
1. The sum of the observed and expected frequencies is always zero.
Symbolically, ∑(O − E ) = ∑ O − ∑ E = N − N = 0

2. χ 2 Distribution is a limiting approximation of the multinomial distribution.


3. Though χ 2 distribution is essentially a continuous distribution the χ 2 test can be applied to
discrete random variables whose frequencies can be counted and tabulated with or without grouping.

218
1

10.3.5 Uses of χ 2 Test:

The χ 2 test is one of the most popular statistical inference procedures today. It is applicable to a
very large number of problems in practice which can be summed up under the following heads:
1) χ 2 test as a test of independence

2) χ 2 as a test of goodness of fit


3) χ 2 as a test of homogeneity
Examples:
1. In an anti a malarial campaign in a certain area, quinine was administered to 812 persons out of
a total population of 3,248. The number of fever cases is shown below:

Treatment Fever No Fever Total


Quinine 20 792 812
No Quinine 220 2216 2436
Total 240 3008 3248

Discuss the usefulness of quinine in checking malaria.


Solution:-
Let us take the hypothesis that quinine is not effective in checking malaria.
Applying χ 2 test:

Expectation of (AB) =
( A) × ( B )
N
240 × 812
=
3248
=60
(or) E1, (i.e.), expected frequency corresponding to first row and first column is 60.
The bale of expected frequencies shall be:

60 752 812
180 2256 2456
240 3008 3248

219
1

O E (O − E )2 (O − E )2 E
20 60 1600 26.667
220 180 1600 8.889
792 752 1600 2.128
2216 2256 1600 0.709


(O − E )2 = 38.393
E

χ 2 = ∑
 (O − E )2  = 38.393

 E 
V=(r-1) (c-1) = (2-1) (2-1) =1
2
For DF =1, χ 0.05 = 3.84

For the calculated value of χ 2 is greater than the table value. The hypothesis is rejected. Hence,
Quinine is useful in checking malaria.

2. Based on information on 1000 randomly selected fields about the tenancy status of the cultivation
of these fields and use of fertilizers, collected in an agro-economic survey. The following
classification was noted.

Owned Rented Total


Using fertilizers 416 184 600
Not using fertilizers 64 336 400
Total 480 520 1000

220
1

Would you conclude that owner cultivators are more inclined towards the use of fertilizers at 5%
level? Carry out Chi-Square test as per testing procedure.

Solution:-

Let us take the hypothesis that ownership of fields and the use of fertilizers are independent attributes.

Expectation of (AB) =
( A) × (B ) = 480 × 600 = 288
N 1000
Expected frequencies
288 312 600
192 208 400
480 520 1000
Applying χ 2 test:
O E (O − E )2 (O − E )2 E
416 288 16,384 58.889
64 192 16,384 85.333
184 312 16,384 52.513
336 208 16,384 78.769

[ ]
χ 2 = ∑ (O − E )2 E = 273.504
v= (2 − 1)(2 − 1) = 1

For v=1, χ.205 = 3.84

Since it is more we reject it.

10.4 THE SIGN TEST:

The sign test is the simplest of the non-parametric tests. The test is known as the sign test as it is based
on the direction of the plus or minus signs of observations in a sample instead of their numerical
values. The sign test can be of two types:
1. The one-sample sign test
2. The paired-sample sign test.

221
1

In any problem in which sign test is used, we count:


Number of + signs
Number of - signs
Number of zeroÊs
H0: P= 0.5 (Null hypothesis)
HA: PÆ 0.5 (Alternative hypothesis)
The critical value for a two-sided alternative at α =0.05 can be conveniently found by the
expression.

X − np
Z=
npq
Use Z table for interpretation.

10.4.1 One-sample sign test :

In a one-sample sign test the test is null hypothesis ø=ø0 against an appropriate alternative on the
basis of a random sample of size n, replace each sample value greater than ø0 with a plus sign and
each sample value less than ø0 with a minus sign and discard sample value exactly equal to (0). Then
test the null hypothesis that these plus and minus signs are values of a random variable having the
binominal distribution with p=0.5.
Example:
It is required to test the hypothesis that the mean sales per day (µ) are 20 against the alternative
hypothesis of µ Æ20. Fifteen observations were taken and the following results were obtained:
18, 19, 25, 21, 16, 15, 19, 22, 24, 21, 18, 17, 15, 26 and 24.
Level of significance= 0.05.

Solution:

n=15

Replace each value greater than 20 with a plus (+) sign and each value less than 20 with a

minus (-) sign,


- - + +- - - + + +- - - + +
Number of plus signs=7=X

222
1

Number of minus signs=8

H0: P= 0.5 (Null hypothesis)


HA: PÆ 0.5 (Alternative hypothesis)

X − np
Z=
npq
7 − (15 * 0.5)
Z=
15 * 0.5 * 0.5
Z= - 0.26
Since calculated Z= -0.26 lies between Z=-1.96 and Z=1.96 (the critical value of Z at 0.05 level of
significance), the null hypothesis is accepted.

10.4.2 Paired-sample sign test:

The paired-sample sign test involving paired data such as data relating to the collection of an accounts
receivable before and after a new collection policy, Responses of father and son towards ideal family
size etc., In these problems, each pair of sample values can be replaced with a plus sign if the first
value is greater than the second, a minus sign, if the first value is smaller than the second, then proceed
in the same manner as in one-sample sign test.

Example:
The following data is related to downtimes (periods in which computers were inoperative on account
of failures, in minutes) of two different computers. Test whether the downtime in two computers is
same or different.

Computer 58 60 42 62 65 59 60 52 50 75 59
A 52 57 30 46 66 40 78 55 52 58 44

Computer 32, 48 50 41 45 40 43 43 70 60 80
B 45, 36 56 40 70 50 53 50 30 42 45

223
1

Solution:
These data are shown in the below table, along with + or - sign as may be applicable in case of each
pair of values. A plus sign is assigned when the downtime for computer A is greater than that for
computer B and a minus sign is given when the down time for computer B is greater than that for
computer A.
Computer 58 60 42 62 65 59 60 52 50 75 59
A
Computer 32 48 50 41 45 40 43 43 70 60 80
B
Sign + + - + + + + + - + -

Computer 52 57 30 46 66 40 78 55 52 58 44
A
Computer 45 36 56 40 70 50 53 50 30 42 45
B
Sign + + - + - - + + + + -

Number of plus signs=15=X


Number of minus signs=7
H0: P= 0.5 (Null hypothesis)
HA: PÆ 0.5 (Alternative hypothesis)

X − np
Z=
npq
15 − (22 * 0.5)
Z=
1220.5 * 0.5
Z= 1.71
Since calculated Z= 1.71 lies between Z=-1.96 and Z=1.96 (the critical value of Z at 0.05 level of
significance from normal distribution table), the null hypothesis is accepted. This indicates downtime
in both the computers is same.

224
1

10.5 A RANK SUM TEST:

Rank sum tests are whole family of tests. Although there are number of Rank sum tests the two
widely used tests the Mann-Whitney U test and the Kruskal-Wallis test are only discussed here. When
only two populations are involved, Mann-Whitney U test is used and the Kruskal-Wallis test is used
when more than two populations are involved. Use of these tests will enable to determine whether
independent samples have been drawn from the same population or not.

10.5.1 THE MANN-WHITNEY U TEST:

A nonparametric method used to determine whether two independent samples have been drawn from
populations with the same distribution.

a) Mann-Whitney test (or U-test): This is a very popular test amongst the rank sum tests. This test is
used to determine whether two independent samples have been drawn from the same population or
not. It uses more information than the sign test. This test applies under very general conditions and
requires only that the populations sampled are continuous. However, in practice even the violation of
this assumption does not affect the results very much.

To perform this test, rank the data jointly, taking them as belonging to a single sample in either an
increasing or decreasing order of magnitude. We usually adopt low to high ranking process which
means; assign rank 1 to an item with lowest value, rank 2 to the next higher item and so on. In case
there are ties, then we would assign each of the tied observation the mean of the ranks, which they
jointly occupy. For example, if sixth, seventh and eighth values are identical, we would assign each
the rank (6+7+8)/3=7. After this we find the sum of the ranks assigned to the values of the second
sample (and call it R2). Then we work out the test statistic i.e., U which is a measurement of the
difference between the ranked observations of the two samples as under:
n1 (n1 + 1)
U = n1 .n 2 + − R1
2
or

225
1

n 2 (n 2 + 1)
U = n1 .n 2 + − R2
2
Where n1 and n2 are the sample sizes and R1 is the sum of ranks assigned to the values of the first
sample. (In practice, whichever rank sum can be conveniently obtained can be taken as R1, since it is
immaterial which sample is called the first sample.)

In applying U-test we take the null hypothesis that the two samples come from identical populations.
If this hypothesis is true, it seems reasonable to suppose that the means of the ranks assigned to the
values of the two samples should be more or less the same. Under the alternative hypothesis, the
means of the two populations are not equal and if this is so, then most of the smaller ranks will go to
the values of one sample while most of the higher ranks will go to those of the other sample.
If the null hypothesis that the n1+n2 observations came from identical populations is true, the said ÂUÊ
statistic has a sampling distribution with

n1 .n 2
Mean= µ u =
2
And Standard deviation (or the standard error)

n1n2 (n1 + n2 + 1)
σu =
12
If n1 and n2 are sufficiently large (i.e., both greater than 10), the sampling distribution of U can be
approximated closely with normal distribution and the limits of the acceptance region can be
determined in the usual way at a given level of significance. But if either n1 or n2 is so small that the
normal curve approximation to the sampling distribution of U cannot be used, and then exact tests may
be based on special tables such as one given in the appendix, showing selected values of WilcoxonÊs
(unpaired) distribution.

Example:

Suppose that a state university wants to test the hypothesis that the mean scores of students at two
branches are equal. The board keeps statistics on all students at all branches of the system. A random
sample of 15 students from each branch has produced the below data.

226
1

Branch A 1000 1120 800 750 1300 950 1050 1280 1400 850
1150 1200 1500 600 775

Branch B 920 1,120 850 1,360 650 725 890 1,600 900 1140
1550 550 1240 925 500

Solution:

To apply the Mann-Whitney U test to this problem we begin ranking all the scores in order from
lowest to highest.

Rank Score Branch Rank Score Branch


1 500 B 16 1100 A
2 550 B 17 1050 A
3 600 A 18 1100 A
4 650 B 19 1120 B
5 725 B 20 1140 B
6 750 A 21 1150 A
7 775 A 22 1200 A
8 800 A 23 1240 B
9 830 B 24 1250 A
10 850 A 25 1300 A
11 890 B 26 1360 B
12 900 B 27 1400 A
13 920 B 28 1500 A
14 925 B 29 1550 B
15 950 A 30 1600 B

n1=number of items in sample1, that is at Branch A


n2= number of items in sample 2, at Branch B
R1=Sum of the ranks of the item in sample 1.

227
1

R2= sum of the ranks of the item in sample 2


n1=15
n2=15
R1= 247
R2= 218

Using the values for n1 and n2 and the R1 & R2 determine the U statistic
n1 (n1 + 1)
U= n1n2 + − R1
2

= (15)(15) + (15)(16) − 247 = 98 ← U statistic


2
Critical value of ÂUÊ for n1=15, n2=15, at 0.05 level of significance is
Utable = 64
Critical value (64) is less than the calculated value (98), accept the null hypothesis that there is no
difference in the scores of two branches.
If, U table is not available, the problem can be solved by normal approximation.

The Normal Approximation:

Mean of the Sampling Distribution of U


n1n2
µu =
2

=
(15)(15) = 112.5 ← Mean of the U statistic
2
And the Standard Error of the U statistic

n1n2 (n1 + n2 + 1)
σu =
12

σu =
(15)(15)(15 + 15 + 1)
12
= 581.25 = 24.1 ← Standard Error

228
1

u − µu 98 − 112.5
Ζ= = = −0.602
σu 24.1
Table value of Z at 0.05 level of significance=1.96
Since, the sample statistic does lie within the acceptance zone, the mean scores at two schools are the
same. (Same as earlier)

10.5.2 KRUSKAL-WALLIS TEST:

This test is conducted in a way similar to the U test described above. The test is used to test the null
hypothesis that ÂkÊ independent random samples come from identical universes against the alternative
hypothesis that the means of these universes are not equal. This test is analogous to the one-way
analysis of variance, but unlike the latter it does not require the assumption that the samples come
from approximately normal populations or the universes having the same standard deviation. It is a
nonparametric version of ANOVA

To perform this test, we have to rank all scores without regard to groups to which they belong and
The K-Statistic is calculated from the formula

12  R12 R22 R2 
K=  + + .... + k  − 3(n + 1)
N ( N + 1)  n1 n2 nk 

When n1,n2⁄.nk= number in each of ÂkÊ samples


N=n1+n2+⁄+nk and
R1,R2,R3⁄Rk=the rank sums of each sample

Example:
Use the Kruskal-Wallis test at 5% level of significance to test the null hypothesis that a professional
bowler performs equally well with the four bowling balls, given the following results:

Bowling results in Five Games


With Ball No.A 271 282 257 248 262
With Ball No.B 252 275 302 268 276
With Ball No.C 260 255 239 246 266

229
1

With Ball No.D 279 242 297 270 258

Solution

To apply the H test or the Kruskal-Wallis test to this problem, we begin by ranking all the given
figures from the highest to the lowest, indicating besides each the name of the ball as under:
Bowling results Rank Name of the ball
associated
302 1 B
297 2 D
282 3 A
297 4 D
276 5 B
275 6 B
271 7 A
270 8 D
268 9 B
266 10 C
262 11 A
260 12 C
258 13 D
257 14 A
255 15 C
252 16 B
248 17 A
246 18 C
242 19 D
239 20 C

For finding the values of Ri, we arrange the above table as under:

230
1

Bowling results with


different Ball A Rank Ball B Rank Ball C Rank Ball D Rank balls and
271 7 252 16 260 12 279 4
282 3 275 6 255 15 242 19
257 14 302 1 239 20 297 2
248 17 268 9 246 18 270 8
262 11 276 5 266 10 158 13
n1=5 R1=52 n2=5 R2=37 n3=5 R3=75 n4=5 R4=46

corresponding rank

Now we calculate H statistic as under:

12 k
Ri2
H= ∑ − 3(n + 1)
n(n + 1) i =1 ni

12  52 2 37 2 75 2 46 2 
 + + +  − 3(20 + 1)
= 20(20 + 1)  5 5 5 5 

=(.02857)(2362.8)-63=67.51-63=4.51
As the four samples have five items each, the sampling distribution of H approximates closely with
χ 2 distribution. Now taking the null hypothesis that the bowler performs equally well with the four
balls, we have the value of χ 2 =7.815 for (k-1) or 4-1=3 degrees of freedom at 5% level of

significance. Since the calculated value of H is only 4.51 and does not exceed the χ 2 value of 8.815,
we accept the null hypothesis and conclude that bowler performs equally well with the four bowling
balls.

231
1

10.6 SUMMARY:
There are many situations in which the various assumptions required for standard tests of significance
(such as that population is normal, samples are independent, standard deviation is known, etc.) cannot
be met, and then we can use non-parametric methods. Moreover, they are easier to explain and easier
to understand. This is the reason why such tests have become popular. But one should not forget the
fact that they are usually less efficient/powerful as they are based on no assumption (or virtually no
assumption) and we all know that the less one assumes, the less one can infer from a set of data. But
then the other side must also be kept in view that the more one assumes, the more one limits the
applicability of oneÊs methods.
10.7 GLOSSARY:
Chi-Square test:
A statistical technique used to test significance in
the analysis of frequency distribution.

Degrees Of Freedom: The number of elements that can be chosen


freely.

Kruskal-Wallis test:
A nonparametric method for testing the null
hypothesis that K independent random samples
come from identical populations. It is a direct
generalization of the Mann-Whitney test.

Mann-Whitney U test:
A non parametric test that is used to determine
whether two different samples come from
identical populations or whether these
populations have different means.

Non-parametric tests:
Tests that rely less on parameter estimation
and/or assumptions about the shape of a
population distribution.

232
1

One-Sample Runs test:


A nonparametric test used for determining
whether the items in a sample have been selected
randomly.

Rank Sum tests:


A group of nonparametric tests that are based on
the sum of ranks assigned to the observations in
samples.

Run:
A sequence of identical occurrences that may be
preceded and followed by different occurrences.
At times, they may not be preceded or followed
by any occurrences.

Sign test:
a non parametric test that takes into account the
difference between paired observations where (+)
and minus (-) signs are substituted for
quantitative values.

10.8 REFERENCES:
1) Levin, Bichard, Statistic for Management (Prentice Hall of India, 1999)
2) Harnett, Introduction of Statistical Methods (Addison-Wesley Publishing co)
3) S.P.Gupta, Statistical Methods (Published by SultanChand & Sons, 23 Delhi,2005)

10.9 REVIEW EXERCISE:


1. Enumerate the different nonparametric tests and explain any two of them.

2. What are the major advantages of nonparametric methods over parametric methods?

233
1

3. A company has collected the following data that relate to the average weekly loss of man-hours
on account of accidents in 8 plants over a period of six months. The data were obtained to ascertain
the effectiveness of an industrial safety before and after programme was put into operation. 72 and
59, 26 and 24,125 and 120, 39 and 35,54 and 43,39 and 35,13 and 15,12 and 18.
Test the null hypothesis that the safety programme is not effective, using the two-sample sign test at
α = 0.05 level of significance.

4. A company used three different methods of advertising its product in three cities. It later found the
increased sales (in thousand rupees) in identical retail outlets in the three cities as follows:

City A 70 58 60 45 55 62 89 72
City B 65 57 48 55 75 68 45 52
City C 53 59 71 70 63 60 58 75

Use Kruskal-Wallis method to test the hypothesis that the mean increase in sales on account of three
different methods of advertising was the same in the retail outlets in A,B and C cities. Use 5 percent
level of significance.
5. The following data relate to the costs of building comparable lots in the two Resorts A and B (in
million rupees):

Resort A: 30.9 32.5 44.3 39.5 35.0 48.9


Resort B: 53.9 61.0 36.0 42.5 40.9 47.9

The company owning the resort area A claimed that the median price of building lots was less in area
A as compared to resort area B. You are asked to test this claim, using a nonparametric test with a 1
percent level of significance.

6. 1000 students at college level were graded according to their I.Q. and the economic conditions of
their homes. Use χ 2 -test to find out whether there is any association between economic condition at
home and I.Q.

234
1

Economic I.Q
conditions High Low Total
Rich 460 140 600
Poor 240 160 400
Total 700 300 1000

6. An experiment was conducted to test the efficacy of Chloromycetin in checking typhoid. In a


certain hospital Chloromycetin was given to 285 out of the 392 patients suffering from
typhoid. The number of typhoid cases were as follows:

Typhoid No Typhoid Total


Chloromycetin 35 250 285
No chloromycetin 50 57 107
Total 85 307 392

With the help of χ 2 , test the effectiveness of chloromycetin in checking typhoid.

8. A survey of 320 families with five children each revealed the following distribution:

No.of boys 5 4 3 2 1 0
No.of girls 0 1 2 3 4 5
No.of families 14 56 110 88 40 12

Is this distribution consistent with the hypothesis that male and female births are equally probable?
Apply Chi-square test.

9. Suppose playing four rounds of golf at the City Club 11 professionals totaled
280,282,290,273,283,283,275,284,282,279, and 281. Use the sign test at 5% level of significance to
test the null hypothesis that professional golfers average µ H 0 = 284 for four rounds against the

alternative hypothesis µ H 0 < 284

235
1

10. The following are the numbers of artifacts dug up by two archaeologists at an ancient cliff
dwelling on 30 days.

By X 1 0 2 3 1 0 2 2 3 0 1 1 4 1 2 1
By Y 0 0 1 0 2 0 0 1 1 2 0 1 2 1 1 0

By X 3 5 2 1 3 2 4 1 3 2 0 2 4 2
By Y 2 2 6 0 2 3 0 2 1 0 1 0 1 0

Use the sign test at 1% level of significance to test the null hypothesis that the two archaeologists, X
and Y, are equally good at finding artifacts against the alternative hypothesis that X is better.

by

Dr.B.Raja Shekhar

Reader

School of Management Studies

University of Hyderabad.

236
1

Standard Normal (Z) Table

Area between 0 and z

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767

237
1

2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

Student's t Table

t table with right tail probabilities

df\p 0.05 0.025 0.01 0.005


1 6.314 12.706 31.82 63.657
2 2.92 4.3027 6.965 9.9248
3 2.353 3.1825 4.541 5.8409
4 2.132 2.7765 3.747 4.6041
5 2.015 2.5706 3.365 4.0321
6 1.943 2.4469 3.143 3.7074
7 1.895 2.3646 2.998 3.4995
8 1.86 2.306 2.896 3.3554
9 1.833 2.2622 2.821 3.2498
10 1.812 2.2281 2.764 3.1693
11 1.796 2.201 2.718 3.1058
12 1.782 2.1788 2.681 3.0545
13 1.771 2.1604 2.65 3.0123
14 1.761 2.1448 2.624 2.9768
15 1.753 2.1315 2.602 2.9467
16 1.746 2.1199 2.583 2.9208
17 1.74 2.1098 2.567 2.8982
18 1.734 2.1009 2.552 2.8784
19 1.729 2.093 2.539 2.8609

238
1

20 1.725 2.086 2.528 2.8453


21 1.721 2.0796 2.518 2.8314
22 1.717 2.0739 2.508 2.8188
23 1.714 2.0687 2.5 2.8073
24 1.711 2.0639 2.492 2.7969
25 1.708 2.0595 2.485 2.7874
26 1.706 2.0555 2.479 2.7787
27 1.703 2.0518 2.473 2.7707
28 1.701 2.0484 2.467 2.7633
29 1.699 2.0452 2.462 2.7564
inf 1.645 1.96 2.326 2.5758

F Distribution Tables

F Table for alpha=0.10.

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 39.86 49.5 53.6 55.8 57.2 58.2 58.9 59.4 59.86 60.19 60.7 61.22 61.74 63.3
2 8.526 9 9.16 9.24 9.29 9.326 9.35 9.37 9.381 9.392 9.41 9.425 9.441 9.49
3 5.538 5.46 5.39 5.34 5.31 5.285 5.27 5.25 5.24 5.23 5.22 5.2 5.184 5.13
4 4.545 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.936 3.92 3.9 3.87 3.844 3.76
5 4.06 3.78 3.62 3.52 3.45 3.405 3.37 3.34 3.316 3.297 3.27 3.238 3.207 3.11
6 3.776 3.46 3.29 3.18 3.11 3.055 3.01 2.98 2.958 2.937 2.9 2.871 2.836 2.72
7 3.589 3.26 3.07 2.96 2.88 2.827 2.78 2.75 2.725 2.703 2.67 2.632 2.595 2.47
8 3.458 3.11 2.92 2.81 2.73 2.668 2.62 2.59 2.561 2.538 2.5 2.464 2.425 2.29
9 3.36 3.01 2.81 2.69 2.61 2.551 2.51 2.47 2.44 2.416 2.38 2.34 2.298 2.16
10 3.285 2.92 2.73 2.61 2.52 2.461 2.41 2.38 2.347 2.323 2.28 2.244 2.201 2.06
11 3.225 2.86 2.66 2.54 2.45 2.389 2.34 2.3 2.274 2.248 2.21 2.167 2.123 1.97

239
1

12 3.177 2.81 2.61 2.48 2.39 2.331 2.28 2.24 2.214 2.188 2.15 2.105 2.06 1.9
13 3.136 2.76 2.56 2.43 2.35 2.283 2.23 2.2 2.164 2.138 2.1 2.053 2.007 1.85
14 3.102 2.73 2.52 2.39 2.31 2.243 2.19 2.15 2.122 2.095 2.05 2.01 1.962 1.8
15 3.073 2.7 2.49 2.36 2.27 2.208 2.16 2.12 2.086 2.059 2.02 1.972 1.924 1.76
16 3.048 2.67 2.46 2.33 2.24 2.178 2.13 2.09 2.055 2.028 1.99 1.94 1.891 1.72
17 3.026 2.64 2.44 2.31 2.22 2.152 2.1 2.06 2.028 2.001 1.96 1.912 1.862 1.69
18 3.007 2.62 2.42 2.29 2.2 2.13 2.08 2.04 2.005 1.977 1.93 1.887 1.837 1.66
19 2.99 2.61 2.4 2.27 2.18 2.109 2.06 2.02 1.984 1.956 1.91 1.865 1.814 1.63
20 2.975 2.59 2.38 2.25 2.16 2.091 2.04 2 1.965 1.937 1.89 1.845 1.794 1.61
inf 2.706 2.3 2.08 1.94 1.85 1.774 1.72 1.67 1.632 1.599 1.55 1.487 1.421 1

F Table for alpha=.05

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 161 200 215.7 224.6 230.2 234 236.8 239 241 242 244 246 248 254.31
2 18.5 19 19.16 19.25 19.3 19.33 19.35 19.4 19.4 19.4 19.4 19.4 19.4 19.496
3 10.1 9.55 9.277 9.117 9.014 8.941 8.887 8.85 8.81 8.79 8.74 8.7 8.66 8.5264
4 7.71 6.94 6.591 6.388 6.256 6.163 6.094 6.04 6 5.96 5.91 5.86 5.8 5.6281
5 6.61 5.79 5.41 5.192 5.05 4.95 4.876 4.82 4.77 4.74 4.68 4.62 4.56 4.365
6 5.99 5.14 4.757 4.534 4.387 4.284 4.207 4.15 4.1 4.06 4 3.94 3.87 3.6689
7 5.59 4.74 4.347 4.12 3.972 3.866 3.787 3.73 3.68 3.64 3.57 3.51 3.44 3.2298
8 5.32 4.46 4.066 3.838 3.688 3.581 3.501 3.44 3.39 3.35 3.28 3.22 3.15 2.9276
9 5.12 4.26 3.863 3.633 3.482 3.374 3.293 3.23 3.18 3.14 3.07 3.01 2.94 2.7067
10 4.96 4.1 3.708 3.478 3.326 3.217 3.136 3.07 3.02 2.98 2.91 2.85 2.77 2.5379
11 4.84 3.98 3.587 3.357 3.204 3.095 3.012 2.95 2.9 2.85 2.79 2.72 2.65 2.4045

240
1

12 4.75 3.89 3.49 3.259 3.106 2.996 2.913 2.85 2.8 2.75 2.69 2.62 2.54 2.2962
13 4.67 3.81 3.411 3.179 3.025 2.915 2.832 2.77 2.71 2.67 2.6 2.53 2.46 2.2064
14 4.6 3.74 3.344 3.112 2.958 2.848 2.764 2.7 2.65 2.6 2.53 2.46 2.39 2.1307
15 4.54 3.68 3.287 3.056 2.901 2.791 2.707 2.64 2.59 2.54 2.48 2.4 2.33 2.0658
16 4.49 3.63 3.239 3.007 2.852 2.741 2.657 2.59 2.54 2.49 2.42 2.35 2.28 2.0096
17 4.45 3.59 3.197 2.965 2.81 2.699 2.614 2.55 2.49 2.45 2.38 2.31 2.23 1.9604
18 4.41 3.56 3.16 2.928 2.773 2.661 2.577 2.51 2.46 2.41 2.34 2.27 2.19 1.9168
19 4.38 3.52 3.127 2.895 2.74 2.628 2.544 2.48 2.42 2.38 2.31 2.23 2.16 1.878
20 4.35 3.49 3.098 2.866 2.711 2.599 2.514 2.45 2.39 2.35 2.28 2.2 2.12 1.8432
inf 3.84 3 2.605 2.372 2.214 2.099 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1

F Table for alpha=.01

df2/df1 1 2 3 4 5 6 7 8 9 10 12 15 20 INF
1 4052 5000 5403 5625 5764 5859 5928 5981 6022 6056 6106 6157 6209 6365.9
2 98.5 99 99.17 99.25 99.3 99.33 99.36 99.4 99.4 99.4 99.4 99.4 99.4 99.499
3 34.1 30.8 29.46 28.71 28.24 27.91 27.67 27.5 27.3 27.2 27.1 26.9 26.7 26.125
4 21.2 18 16.69 15.98 15.52 15.21 14.98 14.8 14.7 14.5 14.4 14.2 14 13.463
5 16.3 13.3 12.06 11.39 10.97 10.67 10.46 10.3 10.2 10.1 9.89 9.72 9.55 9.02
6 13.7 10.9 9.78 9.148 8.746 8.466 8.26 8.1 7.98 7.87 7.72 7.56 7.4 6.88
7 12.2 9.55 8.451 7.847 7.46 7.191 6.993 6.84 6.72 6.62 6.47 6.31 6.16 5.65
8 11.3 8.65 7.591 7.006 6.632 6.371 6.178 6.03 5.91 5.81 5.67 5.52 5.36 4.859

241
1

9 10.6 8.02 6.992 6.422 6.057 5.802 5.613 5.47 5.35 5.26 5.11 4.96 4.81 4.311
10 10 7.56 6.552 5.994 5.636 5.386 5.2 5.06 4.94 4.85 4.71 4.56 4.41 3.909
11 9.65 7.21 6.217 5.668 5.316 5.069 4.886 4.74 4.63 4.54 4.4 4.25 4.1 3.602
12 9.33 6.93 5.953 5.412 5.064 4.821 4.64 4.5 4.39 4.3 4.16 4.01 3.86 3.361
13 9.07 6.7 5.739 5.205 4.862 4.62 4.441 4.3 4.19 4.1 3.96 3.82 3.67 3.165
14 8.86 6.52 5.564 5.035 4.695 4.456 4.278 4.14 4.03 3.94 3.8 3.66 3.51 3.004
15 8.68 6.36 5.417 4.893 4.556 4.318 4.142 4 3.9 3.81 3.67 3.52 3.37 2.868
16 8.53 6.23 5.292 4.773 4.437 4.202 4.026 3.89 3.78 3.69 3.55 3.41 3.26 2.753
17 8.4 6.11 5.185 4.669 4.336 4.102 3.927 3.79 3.68 3.59 3.46 3.31 3.16 2.653
18 8.29 6.01 5.092 4.579 4.248 4.015 3.841 3.71 3.6 3.51 3.37 3.23 3.08 2.566
19 8.19 5.93 5.01 4.5 4.171 3.939 3.765 3.63 3.52 3.43 3.3 3.15 3 2.489
20 8.1 5.85 4.938 4.431 4.103 3.871 3.699 3.56 3.46 3.37 3.23 3.09 2.94 2.421
inf 6.64 4.61 3.782 3.319 3.017 2.802 2.639 2.51 2.41 2.32 2.19 2.04 1.88 1

Chi-Square Table
Area in the right tail of a Chi-Square Distribution

df\area 0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01


1 0 0 0 0.02 2.71 3.84 5.02 6.63
2 0 0.05 0.1 0.21 4.61 5.99 7.38 9.21
3 0.1 0.22 0.35 0.58 6.25 7.81 9.35 11.3
4 0.3 0.48 0.71 1.06 7.78 9.49 11.1 13.3
5 0.6 0.83 1.15 1.61 9.24 11.1 12.8 15.1
6 0.9 1.24 1.64 2.2 10.6 12.6 14.4 16.8
7 1.2 1.69 2.17 2.83 12 14.1 16 18.5
8 1.6 2.18 2.73 3.49 13.4 15.5 17.5 20.1
9 2.1 2.7 3.33 4.17 14.7 16.9 19 21.7
10 2.6 3.25 3.94 4.87 16 18.3 20.5 23.2
11 3.1 3.82 4.57 5.58 17.3 19.7 21.9 24.7
12 3.6 4.4 5.23 6.3 18.5 21 23.3 26.2
13 4.1 5.01 5.89 7.04 19.8 22.4 24.7 27.7

242
1

14 4.7 5.63 6.57 7.79 21.1 23.7 26.1 29.1


15 5.2 6.26 7.26 8.55 22.3 25 27.5 30.6
16 5.8 6.91 7.96 9.31 23.5 26.3 28.8 32
17 6.4 7.56 8.67 10.1 24.8 27.6 30.2 33.4
18 7 8.23 9.39 10.9 26 28.9 31.5 34.8
19 7.6 8.91 10.1 11.7 27.2 30.1 32.9 36.2
20 8.3 9.59 10.9 12.4 28.4 31.4 34.2 37.6
21 8.9 10.3 11.6 13.2 29.6 32.7 35.5 38.9
22 9.5 11 12.3 14 30.8 33.9 36.8 40.3
23 10 11.7 13.1 14.8 32 35.2 38.1 41.6
24 11 12.4 13.8 15.7 33.2 36.4 39.4 43
25 12 13.1 14.6 16.5 34.4 37.7 40.6 44.3
26 12 13.8 15.4 17.3 35.6 38.9 41.9 45.6
27 13 14.6 16.2 18.1 36.7 40.1 43.2 47
28 14 15.3 16.9 18.9 37.9 41.3 44.5 48.3
29 14 16 17.7 19.8 39.1 42.6 45.7 49.6
30 15 16.8 18.5 20.6 40.3 43.8 47 50.9

Partial Table of Critical Values of U in the Mann-Whitney Test


Critical values for One-Tail Test at α =.025 or
Two-tail test at α =.05

n2 9 10 11 12 13 14 15 16 17 18 19 20
n1

2 0 0 0 1 1 1 1 1 2 2 2 2

3 2 3 3 4 4 5 5 6 6 7 7 8

4 4 5 6 7 8 9 10 11 11 12 13 13

5 7 8 9 11 12 13 14 15 17 18 19 20

6 10 11 13 14 16 17 19 21 22 24 25 27

7 12 14 16 18 20 22 24 26 28 30 32 34

8 15 17 19 22 24 26 29 31 34 36 38 41

9 17 20 23 26 28 31 34 37 39 42 45 48

10 20 23 26 29 33 36 39 42 45 48 52 55

11 23 26 30 33 37 40 44 47 51 55 58 62

243
1

12 26 29 33 37 41 45 49 53 57 61 66 69

13 28 33 37 41 45 50 54 59 63 67 72 76

14 31 36 40 45 50 55 59 64 67 74 78 83

15 34 39 44 49 54 59 64 70 75 80 85 90

16 37 42 47 53 59 64 70 75 81 86 92 98

17 39 45 51 57 63 67 75 81 87 93 99 105

18 42 48 55 61 67 74 80 86 93 99 106 112

19 45 52 58 65 72 78 85 92 99 106 113 119

20 48 55 62 69 76 83 90 98 105 112 119 127

Partial Table of Critical Values of U in the Mann-Whitney Test


Critical values for One-Tail Test at α =.05 or
Two-tail test at α =.10
n2 9 10 11 12 13 14 15 16 17 18 19 20
n1

1 0 0

2 1 1 1 2 2 2 3 3 3 4 4 4

3 3 4 5 5 6 7 7 8 9 9 10 11

4 6 7 8 9 10 11 12 14 15 16 17 18

5 9 11 12 13 15 16 18 19 20 22 23 25

6 12 14 16 17 19 21 23 25 26 28 30 32

7 15 17 19 21 24 26 28 30 33 35 37 39

8 18 20 23 26 28 31 33 36 39 41 44 47

9 21 24 27 30 33 36 39 42 45 48 51 54

10 24 27 31 34 37 41 44 48 51 55 58 62

11 27 31 34 38 42 46 50 54 57 61 65 69

244
1

12 30 34 38 42 47 51 55 60 64 68 72 77

13 33 37 42 47 51 56 61 65 70 75 80 84

14 36 41 46 51 56 61 66 71 77 82 87 92

15 39 44 50 55 61 66 72 77 83 88 94 100

16 42 48 54 60 65 79 77 83 89 95 101 107

17 45 51 57 64 70 77 83 89 96 102 109 115

18 48 55 61 68 75 82 88 95 102 109 116 123

19 51 58 65 72 80 87 94 101 109 116 123 130


20 54 62 69 77 84 92 100 107 115 123 130 138

245

You might also like