You are on page 1of 43

CHAPTER 1

Statistics
Statistics is a science dealing with the collection, analysis, interpretation and presentation
of data.

Portfolio Management
You manage a portfolio of investments and are presented with a research report that
shows that over the past five years, technology stocks have outperformed utility stocks by
13.7% to 10.1%. %.

Marketing Management
You manage a line of consumer products and want to change your marketing campaign. You
decide that you want to test a new campaign against the existing one to see if the new one
is any better? How would you design the test? What sort of data would you collect? In this
course, you will learn how to design research and how to determine if the results are
significant.

Marketing Manager
An online research survey shows you that 60% of your competitor's customers would
switch to your product if you offered a coupon. Should you create a coupon? What factors
would you consider? Can you trust an online survey? In this course, you will learn how to
read survey results and ask the right questions about them.

Business Analyst
Your company's data warehouse collects sales data and reports it to all the company's
directors and vice-presidents on a daily basis. At the next director's meeting, it will be your
job to make some recommendations based on these numbers. How will you analyze the
data?

With its eyes on inflation, RBI may halt rate cuts for the foreseeable future

Interest rate cuts may be elusive during the rest of the year as all six members of the RBI's
Monetary Policy Committee harped on the need to keep inflation within the mandated
range, and voted in favour of holding back policy tools till the picture is clearer

Retail inflation measured by the Consumer Price Index (CPI) rose to 6.93% in July on
account of higher food prices, breaching the RBI's upper tolerance level of 6% for two
consecutive months.

TYPES OF STATISTICAL METHODS


Statistical methods, broadly, fall into the following two categories:

(i) Descriptive statistics


(ii) Inferential statistics

(i) Descriptive statistics includes statistical methods involving the collection, presentation,
and characterization of a set of data in order to describe the various features of that set of
data. In general, methods of descriptive statistics include graphic methods and numeric
measures. Bar charts, line graphs, and pie charts comprise the graphic methods, whereas
numeric measures include measures of central tendency, dispersion, skewness, and
kurtosis.

(ii) Inferential statistics includes statistical methods which facilitate estimating the
characteristics of a population or making decisions concerning a population on the basis of
sample results. Sample and population are two relative terms. The larger group of units
about which inferences are to be made is called the population or universe, while a sample
is a fraction, subset, or portion of that universe.

Statistics in Business Management


According to Wallis and Roberts, ‘Statistics may be regarded as a body of methods for making wise
decisions in the face of uncertainty.’ Ya-Lin-Chou gave a modified definition over this, saying that
‘Statistics is a method of decision-making in the face of uncertainty on the basis of numerical data and
calculated risks.’ These definitions reflect the applications of statistics in the development of general
principles for dealing with uncertainty.

Marketing Before a product is launched, the market research team of an organization, through a
pilot survey, makes use of various techniques of statistics to analyse data on population, purchasing
power, habits of the consumers, competitors, pricing, and a hoard of other aspects. Such studies
reveal the possible market potential for the product.
Analysis of sales volume in relation to the purchasing power and concentration of population is
helpful in establishing sales territories, routing of salesman, and advertising strategies to improve
sales.
Production Statistical methods are used to carry out R&D programmes for improvement in the
quality of the existing products and setting quality control standards for new ones. Decisions about
the quantity and time of either self-manufacturing or buying from outside are based on statistically
analysed data.
Finance A statistical study through correlation analysis of profits and dividends helps to predict and
decide probable dividends for future years. Statistics applied to analysis of data on assets and liabilities
and income and expenditure helps to ascertain the financial results of various operations.
Financial forecasts, break-even analysis, investment decisions under uncertainty—all involve the
application of relevant statistical methods for analysis.
Personnel In the process of manpower planning, a personnel department makes statistical studies of
wage rates, incentive plans, cost of living, labour turnover rates, employment trends, accident rates,
performance appraisal, and training and development programmes. Employer-employee relationships
are studied by statistically analysing various factors—wages, grievances handling, welfare, delegation
CHAPTER 2

Data Sources
Primary data : Primary data is collected afresh and for the first time and thus happen to be
original in character and known as Primary data.

Secondary data : The data which have been collected by someone else and which have
already been passed through the statistical process are known as Secondary data.

Methods of Collecting Primary Data


Direct personal observation

Indirect Oral Interviews

Questionnaire method

Schedule Method
From Local Agents

Focus Groups

Sources of Secondary Data

Unpublished Published
Sources Sources

Secondary
Data

Secondary Data – Examples


COVID-19 Statistics

County health departments

Census – birth, death etc.

Hospital, clinic, school nurse records

Various reports of the companies

National Statistics

State government programs

Reports of NSSO

WHO

Unpublished Sources
There are certain records maintained properly by the govt. agencies, private offices and
firms. These data are not published.
Nominal Scale
What is your gender?

Male

Female

Your place of work

Finance

Production

Sales

HR

TIME SERIES
Tables and graph
Classification of Data
Classification of data is the process of arranging data in groups/classes on the basis of
certain properties.

Basis of Classification

Geographical Classification

Chronological Classification

Qualitative Classification

Quantitative Classification

Organizing data using Array


ARRAYS AND TALLIES

Exclusive and Inclusive Class intervals

Frequency Distributions

Cumulative Frequency

Relative Frequency Distribution

Percentage Frequency Distribution

Bivariate Frequency Distribution

TYPES OF CLASS INTERVALS


Exclusive Class Intervals Inclusive Class Intervals
Tabulation
Tables are means of recording in permanent form the analysis that is made through
classification and by placing in just a position things that are similar and should be
compared

Objectives of Tabulation
To simplify the complex data

To economize space

To depict trend

To facilitate comparison

To facilitate statistical comparison

To help reference

Parts of Table
SIMPLE TABLE

TWO WAY TABLE

THREE WAY TABLE


Diagrammatic Presentation of Data
According to King: ‘One of the chief aims of statistical science is to render the meaning of
masses of figures clear and comprehensible at a glance.’ This is often best accomplished
by presenting the data in a pictorial (or graphical) form.

According to M. J. Moro Moroney: Diagrams help us to see the pattern and shape of any
complex situation. Just as a map gives us a bird's eye-view of the wide stretch of a country,
so diagrams help as visualise the whole meaning of the numerical complex at a single
glance.

Advantages and Limitations of Diagrams (Graphs)


Diagrams give an attractive presentation

Diagrams leave good visual impact

Diagrams facilitate comparison

Diagrams save time

Diagrams simplify complexity and depict the characteristics of the data


General Rules for Constructing Diagrams
Title

Proportion between width and height

Selection of scale

Footnotes

Index

Neatness and cleanliness

Simplicity

types of diagrams
One dimensional diagrams: bar diagrams

Two- dimensional diagrams: rectangles, squares and circles

Three dimensional diagrams: cubes, cylinders and spheres

Pictograms and cartograms

One-dimensional diagrams (charts)

Histogram

Frequency polygon

Frequency curve

Cumulative frequency distribution (Ogive)

Pie diagram

Simple Bar Graph


A bar graph is made up of columns plotted on a graph.

The columns are positioned over a label that represents a categorical variable.

The height of the column indicates the size of the group defined by the column label.

Multiple Bar Charts


Multiple bar chart is also known as grouped (or compound) bar chart. Such charts are
useful for direct comparison between two or more sets of data.

The technique of drawing such a chart is same as that of a single bar chart with a

Deviation Bar Charts


Deviation bar charts are suitable for presentation of net quantities in excess or deficit such
as profit, loss, import or exports.
The excess (or positive) values and deficit (or negative) values are shown above and below
the base line.

Subdivided Bar Chart


Subdivided bar charts are suitable for expressing information in terms of ratios or
percentages.

For example, exports and Home (domestic) sales.

Percentage Bar Chart


When the relative proportions of components of a bar are more important than their
absolute values, then each bar can be constructed with same size to represent 100%.

The component values are then expressed in terms of percentage of the total to obtain the
necessary length for each of these in the full length of the bars.

Frequency Polygon
The frequency polygons are formed as a closed figure with the horizontal axis, therefore a
series of straight lines are drawn from the midpoint of the top base of

The first and the last rectangles to the mid point falling on the horizontal axis of the next
outlaying interval with zero frequency.

Pie Diagram
These diagrams are normally used to show the total number of observations of different
types in the data set on a percentage basic rather than on an absolute basis through a
circle.

The pie chart has two distinct advantages: (i) it is aesthetically pleasing and (ii) it shows
that the total for all categories or slices of the pie adds to 100%.

Pictograms or Ideographs
A pictogram is another form of pictorial bar chart. Such charts are useful in presenting
data to people who cannot understand charts.

Small symbols or simplified pictures are used to represent the size of the data.
CHAPTER 3

MEASURES OF CENTRAL TENDENCY


Central Tendency

According to Clark and Schakade

“Average is an attempt to find one single figure to describe whole group of figure”.

Objectives of averaging
• It is useful to extract and summarize the characteristics of the entire data set in a
precise form.

• Since an ‘average’ represents the entire data set, it facilitates comparison between
two or more data sets.

• It offers a base for computing various other measures such as dispersion,


skewness, kurtosis that help in many other phases of statistical analysis.

Requistes of measure
• It should be rigidly defined

• It should be based on all the observations

• It should be easy to understand and calculate

• It should have sampling stability

• It should be capable of further algebraic treatment

• It should not be unduly affected by extreme observations

Mathematical Averages
(a) Arithmetic Mean commonly called the mean or average

• Simple

• Weighted

(b) Geometric Mean

(c) Harmonic Mean


Averages of Position
(a) Median

(b) Quartiles

(c) Deciles

(d) Percentiles

(e) Mode

Measures of tendency
The mean is the average of all values.

– If the data set represents a sample from some larger population, this measure is
called the sample mean and is denoted by X.

– If the data set represents the entire population, it is called the

population mean and is denoted by μ.

Changing the mean


• In Excel, the mean can be calculated with the AVERAGE
function.

• Because the calculation of the mean involves every score in the distribution,
changing the value of any score will change the value of the mean.

• Modifying a distribution by discarding scores or by adding new scores will usually


change the value of the mean.

• To determine how the mean will be affected for any specific situation you must
consider: 1) how the number of scores is affected, and 2) how the sum of the scores is
affected

• If a constant value is added to every score in a distribution, then the same constant
value is added to the mean. Also, if every score is multiplied by a constant value, then the
mean is also multiplied by the same constant value.

In a survey of 5 cement companies, the profit (in Rs lakh) earned during a year was 15, 20,
10, 35, and 32. Find the arithmetic mean of the profit earned.
From the following information on the number of defective components in 1000 boxes;

Calculate the arithmetic mean of defective components for the whole of the production line
and comment on the results.

Combined mean
There are two units of an automobile company in two different cities employing 760 and 800
persons, respectively. The arithmetic means of monthly salaries paid to persons in these
two units are Rs 18,750 and Rs 16,950 respectively. Find the combined arithmetic mean of
salaries of the employees in both the units.
WEIGHTED MEAN
WHEN IT WON’T WORK
When a distribution contains a few extreme values (or is very skewed) i.e. outliers, the
mean will be pulled toward the extremes (displaced toward the tail). In this case, the mean
will not provide a "central" value.

• The mean cannot be calculated for qualitative characteristics such as intelligence,


honesty, beauty, or loyalty.

• The mean cannot be calculated for a data set that has open-ended classes at either
the high or low end of the scale.

MEDIAN
• The median is the middle observation when the data are sorted from smallest to
largest.

– If the number of observations is odd, the median is literally the middle observation.

– If the number of observations is even, the median is usually defined as the average
of the two middle observations.

• In Excel, the median can be calculated with the MEDIAN function

The extreme values does not affect the calculation of the median.

The median is considered the best statistical technique for studying the qualitative
attribute of a an observation in the data set.

The median value may be calculated for an open- end distribution.

Calculate the median of the following data that relates to the service time (in minutes) per
customer for 7 customers at a railway reservation counter:

3.5, 4.5, 3, 3.8, 5.0, 5.5, 4

Calculate the median of the following data that relates to the service time (in minutes) per
customer for 7 customers at a railway reservation counter:
3.5, 4.5, 3, 3.8, 5.0, 5.5, 4

Arrange data: 3 3.5 3.8 4 4.5 5 5.5

CONTINUOS SERIES

The mode is that value of an observation which occurs most frequently in the data
set, that is, the point (or class mark) with the highest frequency.

In Excel, the mode can be calculated with the

MODE function.
ADVANTAGES OF MODE
Mode class can also be located by inspection.

The mode is not affected by the extreme values in the distribution.

The mode value can also be calculated for open- end frequency distributions.

The mode can be used to describe quantitative as well as qualitative data

FORMULA

RELATIONSHIP BETWEEN MEDIAN, MEAN AND MODE


Mode = 3 Median 2 Mean
UNIT4
REGRESSION ANALYSIS
AND FORECASTING
Simple Regression Analysis and Correlation

CORRELATION
Correlation is a measure of the degree of relatedness of variables. It can help a business
researcher determine, for example, whether the stocks of two airlines rise and fall in any
related manner. For a sample of pairs of data, correlation analysis can yield a numerical
value that represents the degree of relatedness of the two stock prices over time. In the
transportation industry, is a correlation evident between the price of transportation and the
weight of the object being shipped? If so, how strong are the correlations? In economics,
how strong is the correlation between the producer price index and the unemployment
rate? In retail sales, are sales related to population density, number of competitors, size of
the store, amount of advertising, or other variables? Several measures of correlation are
available, the selection of which depends mostly on the level of data being analyzed. Ideally,
researchers would like to solve for , the population coefficient of correlation. However,
because researchers virtually always deal with sample data, this section introduces a
widely used sample coefficient of correlation, r.

This measure is applicable only if both variables being analyzed have at least an interval
level of data.

The following data are the claims (in $ millions) for BlueCross BlueShield benefits for nine
states, along with the surplus (in $ millions) that the company had in assets in those states.

State Claims Surplus Alabama $1,425 $277 Colorado 273 100

Florida 915 120

Illinois 1,687 259

Maine 234 40

Montana 142 25

North Dakota 259 57

Oklahoma 258 31

Texas 894 141

Use the data to compute a correlation


Regression analysis is the process of constructing a mathematical model or function that
can be used to predict or determine one variable by another variable or other variables.
The most elementary regression model is called simple regression or bivariate regression
involving two variables in which one variable is predicted by another variable. In simple
regression, the variable to be predicted is called the dependent variable and is designated
as y. The predictor is called the independent variable, or explanatory variable, and is
designated as x. In simple regression analysis, only a straight-line relationship between
two variables is examined

DETERMINING THE EQUATION OF THE REGRESSION LINE

The first step in determining the equation of the regression line that passes through the
sample data is to establish the equation’s form. Several different types of equations of lines
are discussed in algebra, finite math, or analytic geometry courses. Recall that among
these equations of a line are the two-point form, the pointslope form, and the slope-
intercept form. In regression analysis, researchers use the slope-intercept equation of a
line. In math courses, the slope-intercept form of the equation of a line often takes the form

where

m = slope of the line

b = y intercept of the line

In statistics, the slope-intercept form of the equation of the regression line through the
population points is

where

= the predicted value of y

= the population y intercept b1 = the population slope

STANDARD ERROR OF THE ESTIMATE


Residuals represent errors of estimation for individual points.With large samples of data,
residual computations become laborious. Even with computers, a researcher sometimes
has difficulty working through pages of residuals in an effort to understand the error of the
regression model. An alternative way of examining the error of the model is the standard
error of the estimate, which provides a single measurement of the regression error.

Because the sum of the residuals is zero, attempting to determine the total amount of

error by summing the residuals is fruitless. This zero-sum characteristic of residuals can
be avoided by squaring the residuals and then summing them.
SUM OF SQUARES OF ERROR SSE = (y - yN)2

COEFFICIENT OF DETERMINATION
A widely used measure of fit for regression models is the coefficient of determination, or

r 2. The coefficient of determination is the proportion of variability of the dependent variable

(y) accounted for or explained by the independent variable (x).

The coefficient of determination ranges from 0 to 1. An r 2 of zero means that the predictor
accounts for none of the variability of the dependent variable and that there

is no regression prediction of y by x. An r 2 of 1 means perfect prediction of y by x and that


100% of the variability of y is accounted for by x. Of course, most r 2 values are between the
extremes. The researcher must interpret whether a particular r 2 is high or low, depending
on the use of the model and the context within which the model was developed.

In exploratory research where the variables are less understood, low values of r 2 are
likely to be more acceptable than they are in areas of research where the parameters are

more developed and understood. One NASA researcher who uses vehicular weight to
predict mission cost searches for the regression models to have an r 2 of .90 or
higher.However,

a business researcher who is trying to develop a model to predict the motivation level of
employees might be pleased to get an r 2 near .50 in the initial research.

Relationship Between r and r 2

Is r, the coefficient of correlation (introduced in Section 12.1), related to r 2, the coefficient of


determination in linear regression? The answer is yes: r 2 equals (r)2. The coefficient of
determination is the square of the coefficient of correlation. In Demonstration Problem 12.1,
a regression model was developed to predict FTEs by number of hospital beds. The r 2
value for the model was .886. Taking the square root of this value yields r = .941, which is
the correlation between the sample number of beds and FTEs. A word of caution here:

Because r 2 is always positive, solving for r by taking gives the correct magnitude of r but
may give the wrong sign. The researcher must examine the sign of the slope of the
regression line to determine whether a positive or negative relationship exists between the
variables and then assign the appropriate sign to the correlation value.

USING REGRESSION TO DEVELOP A FORECASTING


Business researchers often use historical data with measures taken over time in an effort
to forecast what might happen in the future. A particular type of data that often lends itself

well to this analysis is time-series data defined as data gathered on a particular


characteristic over a period of time at regular intervals. Some examples of time-series
data are 10 years of weekly Dow Jones Industrial Averages, twelve months of daily oil
production, or monthly consumption of coffee over a two-year period. To be useful to
forecasters, time-series measurements need to be made in regular time intervals and
arranged according to time of occurrence.

INTERPRETING THE OUTPUT


Although manual computations can be done, most regression problems are analyzed by
using a computer. In this section, computer output from both Minitab and Excel will be
presented and discussed.

At the top of the Minitab regression output, shown in Figure 12.20, is the regression

equation. Next is a table that describes the model in more detail. “Coef ” stands for
coefficient of the regression terms. The coefficient of Number of Passengers, the x variable,
is

0.040702. This value is equal to the slope of the regression line and is reflected in the
regression equation. The coefficient shown next to the constant term (1.5698) is the value of
the constant, which is the y intercept and also a part of the regression equation.

The “T” values are a t test for the slope and a t test for the intercept or constant. (We
generally do not interpret the t test for the constant.) The t value for the slope, t = 9.44 with

an associated probability of .000, is the same as the value obtained manually in section

12.7. Because the probability of the t value is given, the p-value method can be used to
interpret the t value.

Multiple Regression Analysis

THE MULTIPLE REGRESSION MODEL

Multiple regression analysis is similar in principle to simple regression analysis. However,


it is more complex conceptually and computationally. Recall from Chapter 12 that the

equation of the probabilistic simple regression model is where Extending this notion to
multiple regression gives the general equation for the probabilistic

multiple regression model. where


In multiple regression analysis, the dependent variable, y, is sometimes referred to as the
response variable. The partial regression coefficient of an independent variable, ,
represents the increase that will occur in the value of y from a one-unit increase in that
independent variable if all other variables are held constant. The “full” (versus partial)
regression coefficient

of an independent variable is a coefficient obtained from the bivariate model (simple


regression) in which the independent variable is the sole predictor of y. The partial
regression coefficients occur because more than one predictor is included in a model.

In actuality, the partial regression coefficients and the regression constant of a multiple
regression model are population values and are unknown. In virtually all research, these b1

bi

k = the number of independent variables

bk = the partial regression coefficient for independent variable k b3 = the partial regression
coefficient for independent variable 3 b2 = the partial regression coefficient for independent
variable 2 b1 = the partial regression coefficient for independent variable 1 b0 = the
regression constant

y = the value of the dependent variable

y = b0 + b1x1 + b2x2 + b3x3 + …. + bkxk + H H = the error of prediction

b1 = the population slope

b0 = the population y intercept

y = the value of the dependent variable y = b0 + b1x + H

SIGNIFICANCE TESTS OF THE REGRESSION MODEL AND ITS


COEFFICIENTS
Multiple regression models can be developed to fit almost any data set if the level of
measurement is adequate and enough data points are available. Once a model has been
constructed, it is important to test the model to determine whether it fits the data well and
whether the assumptions underlying regression analysis are met. Assessing the adequacy
of

the regression model can be done in several ways, including testing the overall significance
of the model, studying the significance tests of the regression coefficients, computing the
residuals, examining the standard error of the estimate, and observing the coefficient of
determination. In this section, we examine significance tests of the regression model and of
its coefficients.

Testing the Overall Model


With simple regression, a t test of the slope of the regression line is used to determine
whether the population slope of the regression line is different from zero—that is, whether
the independent variable contributes significantly in linearly predicting the dependent
variable.

RESIDUALS, STANDARD ERROR OF THE ESTIMATE, AND R2

Three more statistical tools for examining the strength of a regression model are the
residuals, the standard error of the estimate, and the coefficient of multiple determination.

Residuals

The residual, or error, of the regression model is the difference between the y value and
the predicted value.

The residuals for a multiple regression model are solved for in the same manner as they
are with simple regression. First, a predicted value, , is determined by entering the

value for each independent variable for a given set of observations into the multiple
regression equation and solving for . Next, the value of is computed for each set of
observations.

Shown here are the calculations for the residuals of the first set of observations from Table
13.1. The predicted value of y for x1 = 1605 and x2 = 35 is

An examination of the residuals in Table 13.2 can reveal some information about the

fit of the real estate regression model. The business researcher can observe the residuals
and decide whether the errors are small enough to support the accuracy of the model. The

house price figures are in units of $1,000. Two of the 23 residuals are more than 20.00, or
more than $20,000 off in their prediction. On the other hand, two residuals are less than 1,
or $1,000 off in their prediction.

Residuals are also helpful in locating outliers. Outliers are data points that are apart,

or far, from the mainstream of the other data. They are sometimes data points that were
mistakenly recorded or measured. Because every data point influences the regression
model, outliers can exert an overly important influence on the model based on their
distance from

other points. In examining the residuals in Table 13.2 for outliers, the eighth residual listed
Residual = y - yN = 63.0 - 62.499 = 0.501
Actual value of y = 63.0

yN = 57.4 + .0177(1605) - .666(35) = 62.499

Building Multiple Regression Models

NONLINEAR MODELS: MATHEMATICAL TRANSFORMATION

The regression models presented thus far are based on the general linear regression
model, which has the form

where

0 = the regression constant

1, 2 . . . , k are the partial regression coefficients for the k independent variables x1, . . . , xk
are the independent variables

k = the number of independent variables

In this general linear model, the parameters, , are linear. It does not mean, however, that
the dependent variable, y, is necessarily linearly related to the predictor variables. Scatter
plots sometimes reveal a curvilinear relationship between x and y.

Multiple regression response surfaces are not restricted to linear surfaces and may be
curvilinear.

To this point, the variables, xi , have represented different predictors. For example, in the
real estate example presented in Chapter 13, the variables—x1, x2—represented two
predictors: number of square feet in the house and the age of the house, respectively.

Certainly, regression models can be developed for more than two predictors. For example,
a marketing site location model could be developed in which sales, as the response
variable, is predicted by population density, number of competitors, size of the store, and
number

of salespeople. Such a model could take the form. This regression model has four xi
variables, each of which represents a different

predictor.

Polynomial Regression

Regression models in which the highest power of any predictor variable is 1 and in which
there are no interaction terms—cross products (xi xj)—are referred to as first-order
models. Simple regression models like those presented in Chapter 12 are first-order
models with one independent variable. The general model for simple regression is

If a second independent variable is added, the model is referred to as a first-order model


with two independent variables and appears as

Polynomial regression models are regression models that are second- or higher-order
models. They contain squared, cubed, or higher powers of the predictor variable(s) and
contain response surfaces that are curvilinear.

Consider a regression model with one independent variable where the model includes

a second predictor, which is the independent variable squared. Such a model is referred to
as a second-order model with one independent variable because the highest power among
the predictors is 2, but there is still only one independent variable.

INDICATOR (DUMMY) VARIABLES

Some variables are referred to as qualitative variables (as opposed to quantitative


variables) because qualitative variables do not yield quantifiable outcomes. Instead,
qualitative

variables yield nominal- or ordinal-level information, which is used more to categorize


items. These variables have a role in multiple regression and are referred to as indicator,
or

dummy variables. In this section, we will examine the role of indicator, or dummy, variables
as predictors or independent variables in multiple regression analysis.

Indicator variables arise in many ways in business research. Mail questionnaire or

personal interview demographic questions are prime candidates because they tend to
generate qualitative measures on such items as sex, geographic region, occupation, marital

status, level of education, economic class, political affiliation, religion, management/


nonmanagement status, buying/leasing a home,method of transportation, or type of broker.
In one business study, business researchers were attempting to develop a multiple
regression model to predict the distances shoppers drive to malls in the greater Cleveland
area.

One independent variable was whether the mall was located on the shore of Lake Erie. In a
second study, a site location model for pizza restaurants included indicator variables for

(1) whether the restaurant served beer and (2) whether the restaurant had a salad bar.

MODEL-BUILDING: SEARCH PROCEDURES


we have explored various types of multiple regression models.

We evaluated the strengths of regression models and learned how to understand more
about the output from multiple regression computer packages. In this section we examine

Procedures for developing several multiple regression model options to aid in the
decisionmaking process.

Suppose a researcher wants to develop a multiple regression model to predict the world
production of crude oil. The researcher realizes that much of the world crude oil market is
driven by variables related to usage and production in the United States. The researcher
decides to use as predictors the following five independent variables.

1. U.S. energy consumption

2. Gross U.S. nuclear electricity generation

3. U.S. coal production

4. Total U.S. dry gas (natural gas) production

5. Fuel rate of U.S.-owned automobiles

The researcher measured data for each of these variables for the year preceding each

data point of world crude oil production, figuring that the world production is driven by the
previous year’s activities in the United States. It would seem that as the energy
consumption of the United States increases, so would world production of crude oil. In
addition, it makes sense that as nuclear electricity generation, coal production, dry gas
production, and fuel rates increase, world crude oil production would decrease if energy
consumption stays approximately constant.

Table 14.6 shows data for the five independent variables along with the dependent variable,
world crude oil production. Using the data presented in Table 14.6, the researcher

attempted to develop a multiple regression model using five different independent


variables. The result of this process was the Minitab output in Figure 14.9. Examining the
output, the researcher can reach some conclusions about that particular model and its
variables.

The output contains an R2 value of 92.1%, a standard error of the estimate of 1.215,

and an overall significant F value of 46.62. Notice from Figure 14.9 that the t ratios indicate
that the regression coefficients of four of the predictor variables, nuclear, coal, dry gas,

and fuel rate, are not significant at = .05.

MULTICOLLINEARITY
One problem that can arise in multiple regression analysis is multicollinearity.
Multicollinearity is when two or more of the independent variables of a multiple regression
model are highly correlated. Technically, if two of the independent variables are correlated,
we have collinearity; when three or more independent variables are correlated, we have
multicollinearity. However, the two terms are frequently used interchangeably.

The reality of business research is that most of the time some correlation between
predictors (independent variables) will be present. The problem of multicollinearity arises

when the intercorrelation between predictor variables is high. This relationship causes
several other problems, particularly in the interpretation of the analysis.

1. It is difficult, if not impossible, to interpret the estimates of the regression


coefficients.

2. Inordinately small t values for the regression coefficients may result.

3. The standard deviations of regression coefficients are overestimated.

4. The algebraic sign of estimated regression coefficients may be the opposite of what
would be expected for a particular predictor variable.

The problem of multicollinearity can arise in regression analysis in a variety of business


research situations. For example, suppose a model is being developed to predict

salaries in a given industry. Independent variables such as years of education, age, years in
management, experience on the job, and years of tenure with the firm might be considered
as predictors. It is obvious that several of these variables are correlated (virtually all of
these

variables have something to do with number of years, or time) and yield redundant
information.

Suppose a financial regression model is being developed to predict bond market

rates by such independent variables as Dow Jones average, prime interest rates, GNP,
producer price index, and consumer price index. Several of these predictors are likely to be
intercorrelated.

Time-Series Forecasting and Index Numbers

INTRODUCTION TO FORECASTING

Virtually all areas of business, including production, sales, employment, transportation,


distribution, and inventory, produce and maintain time-series data.

SMOOTHING TECHNIQUES
Several techniques are available to forecast time-series data that are stationary or that
include no significant trend, cyclical, or seasonal effects. These techniques are often
referred
to as smoothing techniques because they produce forecasts based on “smoothing out” the
irregular fluctuation effects in the time-series data. Three general categories of smoothing
techniques are presented here: (1) naıve forecasting models, (2) averaging models, and

(3) exponential smoothing.

Naïve Forecasting Models

Naıve forecasting models are simple models in which it is assumed that the more recent
time periods of data represent the best predictions or forecasts for future outcomes. Naïve
models do not take into account data trend, cyclical effects, or seasonality. For this reason,
naıve models seem to work better with data that are reported on a daily or weekly basis or
in situations that show no trend or seasonality. The simplest of the naïve forecasting
methods is the model in which the forecast for a given time period is the value for the
previous time period.

where

Ft = the forecast value for time period t xt-1 = the value for time period t – 1

TREND ANALYSIS

There are several ways to determine trend in time-series data and one of the more
prominent is regression analysis. In Section 12.9, we explored the use of simple regression
analysis in determining the equation of a trend line. In time-series regression trend
analysis, the response variable, Y, is the variable being forecast, and the independent
variable, X, represents time.

Many possible trend fits can be explored with time-series data. In this section we examine
only the linear model and the quadratic model because they are the easiest to

understand and simplest to compute. Because seasonal effects can confound trend
analysis, it is assumed here that no seasonal effects occur in the data or they were
removed prior to determining the trend.

SEASONAL EFFECTS

Earlier in the chapter, we discussed the notion that time-series data consist of four
elements: trend, cyclical effects, seasonality, and irregularity. In this section, we examine
techniques for identifying seasonal effects. Seasonal effects are patterns of data behavior
that occur in periods of time of less than one year. How can we separate out the seasonal
effects?
Decomposition

One of the main techniques for isolating the effects of seasonality is decomposition. The
decomposition methodology presented here uses the multiplicative model as its basis. The
multiplicative model is:

T#C#S#I

where T = trend

C = cyclicality S = seasonality I = irregularity

INDEX NUMBERS

One particular type of descriptive measure that is useful in allowing comparisons of data
over time is the index number. An index number is, in part, a ratio of a measure taken
during one time frame to that same measure taken during another time frame, usually
denoted

as the base period. Often the ratio is multiplied by 100 and is expressed as a percentage.
When expressed as a percentage, index numbers serve as an alternative to comparing raw
numbers. Index number users become accustomed to interpreting measures for a given
time period in light of a base period on a scale in which the base period has an index of
100(%). Index numbers are used to compare phenomena from one.

UNIT5
NONPARAMETRIC STATISTICS
AND QUALITY

CHI-SQUARE GOODNESS-OF-FIT TEST


The upper income class and cannot be in more than one class.

The chi-square goodness-of-fit test compares the expected, or theoretical, frequencies of


categories from a population distribution to the observed, or actual, frequencies from a
distribution to determine whether there is a difference between what was expected and
what was observed. For example, airline industry officials might theorize that the ages of

airline ticket purchasers are distributed in a particular way. To validate or reject this
expected distribution, an actual sample of ticket purchaser ages can be gathered randomly,
and the observed results can be compared to the expected results with the chi-square
goodness-of-fit test.

CHI-SQUARE GOODNESSOF- FIT TEST

where

fo = frequency of observed values fe = frequency of expected values k = number of


categories

c = number of parameters being estimated from the sample data df = k - 1 - c

x2 = a

( fo - fe)2

CONTINGENCY ANALYSIS: CHI-SQUARE TEST OF INDEPENDENCE

The chi-square goodness-of-fit test is used to analyze the distribution of frequencies for
categories of one variable, such as age or number of bank arrivals,

to determine whether the distribution of these frequencies is the same as some


hypothesized or expected distribution. However, the goodness-of-fit test cannot be used to
analyze two variables simultaneously. A different chi-square test, the

chi-square test of independence, can be used to analyze the frequencies of two variables
with multiple categories to determine whether the two variables are independent. Many
times this type of analysis is desirable. For example, a market researcher might want to
determine whether the type of soft drink preferred by a consumer is independent of the
consumer’s age.

Nonparametric Statistics

RUNS TEST
The one-sample runs test is a nonparametric test of randomness. The runs test is used to
determine whether the order or sequence of observations in a sample is random. The runs
test examines the number of “runs” of each of two possible characteristics that sample
items

may have. A run is a succession of observations that have a particular one of the
characteristics.

For example, if a sample of people contains both men and women, one run could be

a continuous succession of women. In tossing coins, the outcome of three heads in a row
would constitute a run, as would a succession of seven tails.

Suppose a researcher takes a random sample of 15 people who arrive at a Wal-Mart to


shop. Eight of the people are women and seven are men. If these people arrive randomly at

the store, it makes sense that the sequence of arrivals would have some mix of men and
women, but not probably a perfect mix. That is, it seems unlikely (although possible) that
the sequence of a random sample of such shoppers would be first eight women and then
seven men. In such a case, there are two runs. Suppose, however, the sequence of
shoppers is woman, man, woman, man, woman, and so on all the way through the sample.
This

would result in 15 “runs.” Each of these cases is possible, but neither is highly likely in a
random scenario. In fact, if there are just two runs, it seems possible that a group of
women

came shopping together followed by a group of men who did likewise. In that case, the
observations would not be random. Similarly, a pattern of woman-man all the way through

may make the business researcher suspicious that what has been observed is not really
individual random arrivals, but actually random arrivals of couples.

In a random sample, the number of runs is likely to be somewhere between these


extremes.What number of runs is reasonable? The one-sample runs test takes into
consideration the size of the sample, n, the number observations in the sample having each

characteristic, n1, n2 (man, woman, etc.), and the number of runs in the sample, R, to reach
conclusions about hypotheses of randomness. The following hypotheses are tested by the
one-sample runs test.

H0: The observations in the sample are randomly generated. Ha: The observations in the
sample are not randomly generated.

MANN-WHITNEY U TEST

The Mann-Whitney U test is a nonparametric counterpart of the t test used to compare the
means of two independent populations. This test was developed by Henry B. Mann and
D. R.Whitney in 1947. Recall that the t test for independent samples presented in Chapter 10
can be used when data are at least interval in measurement and the populations are
normally

distributed. However, if the assumption of a normally distributed population is invalid or if


the data are only ordinal in measurement, the t test should not be used. In such cases, the
Mann-Whitney U test is an acceptable option for analyzing the data. The following

assumptions underlie the use of the Mann-Whitney U test.

1. The samples are independent.

2. The level of data is at least ordinal.

The two-tailed hypotheses being tested with the Mann-Whitney U test are as follows.

H0: The two populations are identical. Ha: The two populations are not identical.

Computation of the U test begins by arbitrarily designating two samples as group 1 and
group 2. The data from the two groups are combined into one group, with each data value
retaining a group identifier of its original group. The pooled values are then ranked

from 1 to n, with the smallest value being assigned a rank of 1. The sum of the ranks of
values from group 1 is computed and designated as W1 and the sum of the ranks of values

from group 2 is designated as W2.

The Mann-Whitney U test is implemented differently for small samples than for large
samples. If both n1, n2 10, the samples are considered small. If either n1 or n2 is greater
than 10, the samples are considered large.

Small-Sample Case

With small samples, the next step is to calculate a U statistic for W1 and forW2 as and

The test statistic is the smallest of these two U values. Both values do not need to be
calculated; instead, one value of U can be calculated and the other can be found by using
the transformation

p- values for U. To determine the p-value for a U from the table, let

n1 denote the size of the smaller sample and n2 the size of the larger sample. Using the
particular table in Table A.13 for n1, n2, locate the value of U in the left column. At the
intersection

of the U and n1 is the p-value for a one-tailed test. For a two-tailed test, double the p-value
shown in the table.

WILCOXON MATCHED-PAIRS SIGNED RANK TEST


The Mann-Whitney U test presented in Section 17.2 is a nonparametric alternative to the

t test for two independent samples. If the two samples are related, the U test is not
applicable.

A test that does handle related data is the Wilcoxon matched-pairs signed rank test, which
serves as a nonparametric alternative to the t test for two related samples. Developed by
Frank Wilcoxon in 1945, the Wilcoxon test, like the t test for two related samples, is used

to analyze several different types of studies when the data of one group are related to the
data in the other group, including before-and-after studies, studies in which measures are
taken on the same person or object under two different conditions, and studies of twins or
other relatives.

The Wilcoxon test utilizes the differences of the scores of the two matched groups in a
manner similar to that of the t test for two related samples. After the difference scores

have been computed, the Wilcoxon test ranks all differences regardless of whether the
difference is positive or negative. The values are ranked from smallest to largest, with a
rank

of 1 assigned to the smallest difference. If a difference is negative, the rank is given a


negative sign. The sum of the positive ranks is tallied along with the sum of the negative

ranks. Zero differences representing ties between scores from the two groups are ignored,
and the value of n is reduced accordingly. When ties occur between ranks, the ranks are

averaged over the values. The smallest sum of ranks (either + or -) is used in the analysis
and is represented by T. The Wilcoxon matched-pairs signed rank test procedure for
determining statistical significance differs with sample size.When the number of matched
pairs, n, is greater than 15, the value of T is approximately normally distributed and a z
score is computed to test the null hypothesis.When sample size is small, n 15, a different
procedure is followed.

Two assumptions underlie the use of this technique.

1. The paired data are selected randomly.

2. The underlying distributions are symmetrical. The following hypotheses are being
tested.

For two-tailed tests:

For one-tailed tests: or

where Md is the median.

Small-Sample Case (n 15)


When sample size is small, a critical value against which to compare T can be found in
Table A.14 to determine whether the null hypothesis should be rejected. The critical value is
located by using n and a. Critical values are given in the table for a = .05, .025, .01, and

.05 for two-tailed tests and a = .10, .05, .02, and.01 for one-tailed tests. If the observed
value of T is less than or equal to the critical value of T, the decision is to reject the null
hypothesis.

As an example, consider the survey by American Demographics that estimated the average
annual household spending on healthcare. The U.S. metropolitan average was

$1,800. Suppose six families in Pittsburgh, Pennsylvania, are matched demographically with
six families in Oakland, California, and their amounts of household spending on healthcare
for last year are obtained. The data follow on the next page.

H0: Md = 0 Ha: Md 6 0

H0: Md = 0 Ha: Md 7 0 H0: Md = 0 Ha: Md Z

KRUSKAL-WALLIS TEST

The nonparametric alternative to the one-way analysis of variance is the Kruskal-Wallis


test, developed in 1952 by William H. Kruskal and W. Allen Wallis. Like the one-way analysis
of variance, the Kruskal-Wallis test is used to determine whether c 3 samples come from

the same or different populations.Whereas the one-way ANOVA is based on the


assumptions of normally distributed populations, independent groups, at least interval level
data,

and equal population variances, the Kruskal-Wallis test can be used to analyze ordinal data
and is not based on any assumption about population shape. The Kruskal-Wallis test is
based on the assumption that the c groups are independent and that individual items are
selected randomly.

The hypotheses tested by the Kruskal-Wallis test follow.

H0: The c populations are identical.

Ha: At least one of the c populations is different.

This test determines whether all of the groups come from the same or equal populations or
whether at least one group comes from a different population.

The process of computing a Kruskal-Wallis K statistic begins with ranking the data in all
the groups together, as though they were from one group. The smallest value is awarded

a 1. As usual, for ties, each value is given the average rank for those tied values. Unlike
oneway ANOVA, in which the raw data are analyzed, the Kruskal-Wallis test analyzes the
ranks

of the data.
a Kruskal-Wallis K statistic.

KRUSKAL-WALLIS TEST

where

c = number of groups

n = total number of items Tj = total of ranks in a group

nj = number of items in a group K L x2, with df = c – 1

FRIEDMAN TEST
The Friedman test, developed by M. Friedman in 1937, is a nonparametric alternative to the
randomized block design discussed in Chapter 11. The randomized block design has the
same assumptions as other ANOVA procedures, including observations are drawn from
normally distributed populations. When this assumption cannot be met or when the
researcher has ranked data, the Friedman test provides a nonparametric alternative.

Three assumptions underlie the Friedman test.

1. The blocks are independent.

2. No interaction is present between blocks and treatments.

3. Observations within each block can be ranked. The hypotheses being tested are as
follows. H0: The treatment populations are equal.

Ha: At least one treatment population yields larger values than at least one other treatment
population.

The first step in computing a Friedman test is to convert all raw data to ranks (unless the
data are already ranked). However, unlike the Kruskal-Wallis test where all data are
ranked together, the data in a Friedman test are ranked within each block from smallest.

SPEARMAN’S RANK CORRELATION


the Pearson product-moment correlation coefficient, r, was presented and

discussed as a technique to measure the amount or degree of association between two


variables.

The Pearson r requires at least interval level of measurement for the data.When only
ordinal-level data or ranked data are available, Spearman’s rank correlation, rs , can be

used to analyze the degree of association of two variables. Charles E. Spearman (1863–
1945) developed this correlation coefficient.
The formula for calculating a Spearman’s rank correlation is as follows:

SPEARMAN’S RANK CORRELATION

where

n = number of pairs being correlated

d = the difference in the ranks of each pair

rs = 1 - 6ad2 n(n2 - 1)

The Spearman’s rank correlation formula is derived from the Pearson productmoment
formula and utilizes the ranks of the n pairs instead of the raw data. The value of

d is the difference in the ranks of each pair.

The process begins by the assignment of ranks within each group. The difference in ranks
between each group (d) is calculated by subtracting the rank of a member of one group
from the rank of its associated member of the other group. The differences (d) are then
squared and summed. The number of pairs in the groups is represented by n.

Statistical Quality Control

INTRODUCTION TO QUALITY CONTROL

Quality control (sometimes referred to as quality assurance) is the collection

of strategies, techniques, and actions taken by an organization to assure itself that it is


producing a quality product.

From this point of view, quality control begins with product planning and design, where
attributes of the product or service are determined and specified, and continues through
product production or service operation until feedback from the final consumer is looped
backed through the institution for product improvement. It is implied that all.

PROCESS ANALYSIS

process is “a series of

actions, changes, or functions that bring about a result.” Processes usually involve the
manufacturing,

production, assembling, or development of some output from given input. Generally, in a


meaningful system, value is added to the input as part of the process. In the area of
production,
processes are often the main focus of decision makers. Production processes abound

in the chemical, steel, automotive, appliance, computer, furniture, and clothing manufacture
industries, as well as many others. Production layouts vary, but it is not difficult to picture
an assembly line with its raw materials, parts, and supplies being processed into a finished
product that becomes worth more than the sum of the parts and materials that went into it.

However, processes are not limited to the area of production.Virtually all other areas of
business involve processes. The processing of a check from the moment it is used for a
purchase, through the financial institution, and back to the user is one example. The hiring
of new

employees by a human resources department involves a process that might begin with a
job description and end with the training of a new employee. Many different processes
occur

within healthcare facilities. One process involves the flow of a patient from check-in at a
hospital through an operation to recovery and release.Meanwhile, the dietary and foods
department prepares food and delivers it to various points in the hospital as part of another
process.

The patient’s paperwork follows still another process, and central supply processes
medical supplies from the vendor to the floor supervisor.

There are many tools that have been developed over the years to assist managers and
workers in identifying, categorizing, and solving problems in the continuous
qualityimprovement

process.Among these are the seven basic tools of quality developed by Kaoru Ishikawa in
the 1960s.* Ishikawa believed that 95% of all quality-related problems could be solved using
those basic tools,† which are sometimes referred to as the “seven old tools.”‡

The seven basic tools are as follows:

1. Flowchart or process map

2. Pareto chart

3. Cause-and-effect diagram (Ishikawa or fishbone chart)

4. Control chart

5. Check sheet or checklist

6. Histogram

7. Scatter chart or scatter diagram

Flowcharts

One of the first activities that should take place in process analysis is the flowcharting of
the process from beginning to end. A flowchart is a schematic representation of all the
activities and interactions that occur in a process. It includes decision points, activities,
input/output, start/stop, and a flowline. Figure 18.1 displays some of the symbols used in
flowcharting.

The parallelogram represents input into the process or output from the process. In the
case of the dietary/foods department at the hospital, the input includes uncooked food,
utensils, plates, containers, and liquids. The output is the prepared meal delivered to the
patient’s room. The processing symbol is a rectangle that represents an activity. For the
dietary/foods department, that activity could include cooking carrots or loading food carts.

The decision symbol, a diamond, is used at points in the process where decisions are made
that can result in different pathways. In some hospitals, the dietary/foods department
supports a hospital cafeteria as well as patient meals.

CONTROL CHARTS

The use of control charts in the United States failed to gain momentum after World War II
because the success of U.S. manufacturers in the world market reduced the apparent need

for such a tool. As the Japanese and other international manufacturers became more
competitive by using such tools.

Decision Analysis

THE DECISION TABLE AND DECISION MAKING UNDER CERTAINTY

Decision alternatives are the various choices or options available to the decision maker in
any given problem situation. On most days, financial managers face the choices of whether
to invest in blue chip stocks, bonds, commodities, certificates of deposit, money markets,
annuities, and other investments. Construction decision makers must decide whether to
concentrate on one building job today, spread out workers and equipment to several jobs,
or not work today. In virtually every possible business scenario, decision alternatives are
available. A good decision maker identifies many options and effectively evaluates them.

States of nature are the occurrences of nature that can happen after a decision is made
that can affect the outcome of the decision and over which the decision maker has little or
no

control. These states of nature can be literally natural atmospheric and climatic conditions
or they can be such things as the business climate, the political climate, the worker
climate, or the condition of the marketplace, among many others. The financial investor
faces such states of nature as the prime interest rate, the condition of the stock market,
the international monetary exchange rate, and so on. A construction company is faced with
such states
of nature as the weather, wildcat strikes, equipment failure, absenteeism, and supplier
inability to deliver on time. States of nature are usually difficult to predict but are important
to identify in the decision-making process.

Decision Table

The concepts of decision alternatives, states of nature, and payoffs can be examined jointly
by using a decision table, or payoff table. Table 19.1 shows the structure of a decision table.
On the left side of the table are the various decision alternatives, denoted by di . Along the
top row are the states of nature, denoted by sj . In the middle of the table are the various
payoffs for each decision alternative under each state of nature, denoted by Pij .

Maximax Criterion

The maximax criterion approach is an optimistic approach in which the decision maker
bases action on a notion that the best things will happen. The decision maker isolates the
maximum payoff under each decision alternative and then selects the decision alternative
that produces the highest of these maximum payoffs. The name “maximax” means selecting
the maximum overall payoff from the maximum payoffs of each decision alternative.
Consider the $10,000 investment problem. The maximum payoff is $2,200 for stocks, $900
for

bonds, $750 for CDs, and $1,300 for the mixture of investments. The maximax criterion
approach requires that the decision maker select the maximum payoff of these four.

Decision Trees
Another way to depict the decision process is through the use of decision trees. Decision
trees have a ■ node to represent decision alternatives and a ● node to represent states of
nature. If probabilities are available for states of nature, they are assigned to the line
segment following the state-of-nature node symbol, ●. Payoffs are displayed at the ends of
the decision tree limbs. Figure 19.2 is a decision tree for the financial investment example
given in Table 19.5.

Expected Monetary Value (EMV)


One strategy that can be used in making decisions under risk is the expected monetary
value (EMV) approach. A person who uses this approach is sometimes referred to as an

EMVer. The expected monetary value of each decision alternative is calculated by


multiplying the probability of each state of nature by the state’s associated payoff and
summing
these products across the states of nature for each decision alternative, producing an
expected monetary value for each decision alternative. The decision maker compares the
expected monetary values for the decision alternatives and selects the alternative with the
highest expected monetary value.

As an example, we can compute the expected monetary value for the $10,000 investment
problem displayed in Table 19.5 and Figure 19.2 with the associated probabilities.We

use the following calculations to find the expected monetary value for the decision
alternative Stocks.

Expected Value for Stagnant Economy _ (.25)(_$500) $125 Expected Value for Slow-Growth
Economy _ (.45)($700) _ $315 Expected Value for Rapid-Growth Economy _ (.30)($2,200) _
$660

The expected monetary value of investing in stocks is

-$125 + $315 + $660 = $850

You might also like