You are on page 1of 222

Business Statistics Courseware

CHAPTER 1
1 Business Statistics Certification Program

Congratulations on deciding to do the ExpertRating Business Statistics Program! As businesses become increasingly complex,
decision-making no longer remains a layman job and requires training to make decisions in situations marked by uncertainty
and complexity. Needless to say, the advantages of Business Statistics are huge. This is thus an opportune time to take a
course in Business Statistics and reap the huge rewards through a well-directed and dedicated effort.

The ExpertRating Business Statistics Program has been developed by ExpertRating Ltd, a leader in online testing and
certification with over 200,000 certified professionals in over 40 countries and in over 100 different skill areas. This
certification program is one of the most comprehensive programs available anywhere till date.

You have made the right decision!

Not only have you chosen a business statistics program that is well recognized, you have also made an important and wise
decision to equip yourself with all the skills needed to make informed decisions in the world of business. Read on to know
more about the program.

You will proceed through the courseware according to the following list of chapters

Chapter 1: Introduction to Business Statistics Courseware

This chapter will explore data collection methods that are relevant in today’s modern business environment. From sales
figures to employee attendance, collecting and analyzing the right data is the backbone of a successful company.

Chapter 2: Organizing and Presenting your Business Data

This chapter will present the different ways that data can be presented visually. From graphs to charts and tables, simplifie d
yet engaging presentation of data to stakeholders is important.

Chapter 3: Measures of Central Tendency and Dispersion

A successful business professional wants to make sure that his goods and services are always within a narrow and consistent
distance from his or her goals. Dispersion or variability in quality is a solid measure of how well a business is doing. This
chapter will explore dispersion analysis and its practical implications in the business environment.
Chapter 4: Introduction to Probability

Will the new marketing campaign increase holiday sales? The answer to this question is covered through probabilities in
business decision-making. All business decisions carry a risk of failure or under-performance. Understanding probabilities
and how to calculate them is an essential skill for today’s business professional.

Chapter 5: Research Methods in Business

This chapter will explore research methodology used by a number of business units within a company such as marketing and
sales to improve business outcomes.

Chapter 6: Sampling Methods in Business Research

Do you have the right customer demographic for your new product? The right sample in a business study or on-going
research is crucial in the success of the research study and the relevance of the results. This chapter explores sampling
techniques and their importance in business research and decisions.

Chapter 7: Testing Your Business Hypothesis

This chapter will explore an introduction to testing hypotheses in business settings.

Chapter 8: Correlations between Business Variables

It is a popular adage that correlation and causation are two different things. This chapter will explore what correlations are
and how to calculate them between business variables.

Chapter 9: Making Business Predictions Linear Regression Analysis - Part 1

This chapter will explore the prediction of business outcomes through linear regression or how a number of variables affect
the outcome.

Chapter 10: Making Business Predictions Linear Regression Analysis - Part 2

This chapter will continue to explore regression principles by examining how to calculate a random error term and how to
conduct multiple regression analysis.

Chapter 11: Making Effective Comparisons through Analysis of Variance (ANOVA)

Customers are populations that business professionals have to understand. ANOVA provides information that predicts
averages of different populations.

Chapter 12: Using Excel In Business Statistics

1.1 The Final Certification Test

After you have gone through the complete program material and revised it, you can appear for the final test. You must
appear for the test without referring to the text material, and it is advisable that you should be well prepared for the test . The
specifications of the test are mentioned below:

The ExpertRating Exam Format

Type of Exam - Multiple choice with one or more correct answers

Duration - 45 minutes to 1 hour


Number of Questions - 40-60

Question Weightage - All questions carry equal marks.

Navigation - You can go back and answer unanswered questions.

Answer Reviews - You can review the questions at the end of the exam by going back and answering marked
questions.

Exhibits - Some exams will require you to answer a question based upon an exhibit.

Pass marks - 50%

Retake Policy - You can retake the test any number of times by paying the required retake fee.

Note: Some exams may follow a different format. Please read the exam details carefully before registering.

All successful candidates will receive a hardcopy Certificate of Accomplishment stating that they have completed all
the requirements of the Business Statistics Certification process. It will take about 3 weeks to get your certificate
through registered post. You will also get an online transcript that you can immediately use to display your test
marks and highlight the areas you are proficient in. You can link to the online transcript page from your website or
ask friends, relatives or business associates to look it up on the internet.

1.3 Understanding the How, What, and Why of Business Data

Overview

The business world is full of uncertainty and projections. As a business professional, it is your job to make sense of all the
numbers and data that the business generates. Communicating this data to your employees and other stakeholders is just as
important as how you collect and analyze this data.

In normal day-to-day business activities, you might find yourself asking the following questions: Did your employees’
performance output improve over the past 5 years? If yes, by how much? What are the chances that your business will sell
20% more products next year than this year? Will 2010 sales’ figures be higher than each of 2007 through 2009? Your
business presentation or report to the senior executives and other stakeholders will most likely include a bar graph like this
one.
As a business professional, you will find data in a number of places that include:

 Business reports
 Newspaper and magazine articles
 Technical reports
 Business research

In this Chapter, we will explore what business data is and how it is developed. The objectives of this chapter are as follows:

 What is Statistics?
 Different types of business satistics
 Business data collection methods

Keywords

Statistics
Descriptive
Inferential
Estimation
Hypothesis Testing
Population
Sample
Observational Studies
Surveys
Designed Experiments

1.4 What is Statistics?

Statistics is a branch of mathematics that takes numbers and


turns it into information for decision makers. You must Raw data – Raw data is the numbers that go with
remember the last time you heard a news reporter say that an observation or a study of a phenomenon.
70% of people will experience a certain event or that a football
team has a 30% chance to win a game.

In business, you are likely to hear that profit is 3% higher this year than the previous. You may also hear that American
workers put in more than 10 hours of work per week, on average, then European workers.

Statistics is designed to make sense of these numbers. If after analyzing data, you don’t gain much information or knowledge,
or the information is difficult and confusing, then your statistical analysis didn’t do its job. Business statistics has to m ake your
data more useful and easier to understand. In general, statistics is divided into 2 broad sections as follows:
Descriptive Statistics

Descriptive analysis involves collecting, summarizing and analyzing raw data.

Example 1: Collect data

To find out if your customers are satisfied with your products, you give out a survey to every customer who visits you. The
survey will most likely have questions such as these:

Question 1: How satisfied are you with the speed of service at our store?

Example 2: Summarize data

Now imagine that you wish to summarize the results of the satisfaction survey. Let’s say that 200 customers fill out the survey
and you wish to know how they’ve answered Question 1. You will need to add up how many customers have chosen each of
the choices from Very Dissatisfied (0) to Very Satisfied (5).

Example 3: Analyze data

Analyzing data is the next step after preparing summaries. Let’s say that you wish to know how many of your customers
answered Question 1 as 4 or 5 (which indicates high satisfaction with the speed of your services). Your analysis may show that
100 customers marked either a 4 or a 5 on Question 1 on your survey.

At this stage, you must decide the best way to show the results of your data in graphic or chart form. We will cover data
presentation in greater detail in Chapter 2; however, your display of the analysis may look like this:
Inferential Statistics

Inferential analysis involves making decisions or drawing Population – The total group of people or things
that you wish to analyze. Sample – A subset of the
conclusions about a population based on results from a
population you are studying.
sample.

An example of an inferential analysis is when you wish to learn more about a larger population from a small sample.

Example 1 – Population and Sample


Population: In business settings, all the employees in your company are a population.

Sample:Those working in the sales department are considered a sample of this larger population. Another sample of this
larger population is all workers over the age of 50 or workers who have been with the company for less than 5 years.

Inferential analysis is divided into 2 parts:

1.Estimation – Imagine that you wish to estimate the average height of a population. You can do so by choosing a sample
(marked in red). You take the height of the people in your sample. Once you have done so, you can estimate what the average
height of the population will be.
2.Hypothesis testing – Imagine that you want to test a claim that the average weight of a population you are studying is
150lbs. Hypothesis testing allows you to analyze a sample of your larger population and then decide whether your hypothesis
that the larger population weighs 150lbs., on average, is true.

You must be wondering: why not just test all of the population and get your information that way? Surely, the results will be
more accurate if you collect data on every person or thing in your population. While this is true, there are two important
reasons that samples are used instead in statistical analysis:

 Cost -It is very expensive to collect data from a population. Imagine that your company has 50,000 employees. The cost to
collect data from each and every employee whenever information is needed would be much too expensive.
 Time-consuming - Testing an entire population can be too time-consuming. The US Gallup poll usually asks 3,000 people about
current political events to gauge what the American public thinks. Can you guess how long it will take to ask 200 million
American adults what they think? Testing a small sample saves time and as we all know, Time is Money in everyday business.

Remember, the type of sample that you choose to collect data


from is very important. In fact, there is an entire branch of
statistics that is concerned with developing and improving Probability – The likelihood that something will
sampling methods. Think of the last time a telephone happen based on information that you already
marketer interrupted your dinner to ask questions about the have.
type of household cleaning products you use. If you do answer
the phone, they will most likely ask for your age and maybe
your occupation.

An important part of business statistics is asking customers the right questions and learning who is buying your products.
Marketing managers and professionals are aware of the importance of advertising to the right demographic. But who is in
your right demographic?

One way of finding out is to do a focus group (which is one method of picking the right sample). If you are launching a new
iPod application, then advertising to the 60+ year old demographic is not your likely population. You will most likely target the
much younger demographic of 18-34 year olds who are major users of iPods. We will discuss focus groups in depth a little
later.
The difference between a population and a sample is a very important concept in Business Statistics. It is important that you
master it as you proceed in the course. One of the important distinctions that you will need to remember is the following:

Parameter: Numerical data or measures that describe a population. For example, if you calculate the average age of ALL
employees in your company (a population), the measure you get is called a Parameter.

Statistic: Numerical data or measures that describe a sample. If you calculate the average age of HR employees ONLY in your
company, the measure you receive is called a Statistic.

As we discuss descriptive statistics in Chapter 2, we will also introduce Probability as an important concept that is related to
Business Statistics. Probability is the likelihood that an event will occur based on information that you already have.

In our coverage of inferential statistics and their importance in business decision-making, we will discuss probabilities and
their importance.

Understanding Business Data

There are many reasons why you will need to collect data in day-to-day activities to address business needs. Some of those
reasons are as follows:

Table 1.2.2-1 Business Data

Data is at the core of all the statistical analyses you will conduct in Business Statistics. Many of your statistical calculations are
based on observations – including observations that can be made directly by people such as:

 Counting the number of cars that pass through a certain intersection every hour.
 Counting the number of customers who donate a dollar to a charity at the cashier
Other types of data are gathered through formal research such as:

 A survey about a group of customer’s preferred cell phone application.


 A focus group on customers’ preference for product packaging
 A research study on the effectiveness of a new brand of household cleaner
 A safety testing study of a new vehicle

If you notice, the person doing an observation is actually collecting the data firsthand. This source of data collected is referred
to as a Primary Source, and an example would be giving out surveys.

A Secondary Source of data is one where the person conducts analysis on data that he or she has NOT collected. For
example, as part of your marketing research, you will need to use Census data collected by the government to find out how
much your targeted customers make annually. While you are analyzing the data, you have not collected it yourself. This is
why it is called secondary.

Remember that one of the more important distinctions between primary data and secondary data is how much control you
have as a Business professional. If your city census collects demographic data in a certain way, you are essentially limited by
that way. For example, if the annual census of incomes doesn’t include those who are self-employed with a payroll of less
than $100,000, you will not have access to that information from a secondary source such as your loc al census office.

If you decide to gain that information as part of your marketing research, you will have to choose a sample, build a survey
questionnaire, and collect the data yourself. This will be primary data. Unlike secondary sources of data, it is more expensive
and time consuming as you must use personnel and administrative resources to collect data.

Business Data Collection Methods

As discussed earlier, there are 2 major sources (primary and secondary) of data. They differ in the way that data is collected,
the time and cost spent, and how much control you have on what types of data to collect. The following are four data
collection methods:
Surveys

In business, surveys are useful in collecting information on client preferences, experiences, and practices. If you wish to find
out how satisfied your clients are, or what colors of sofa fabrics they like, or how much they intend to spend during the
coming Christmas season, a survey is the best way to get that information.
Surveys are relatively straightforward to design and they give you much control on the types of data you wish to collect.
However, they can be costly and time consuming as you will need personnel and resources to administer the surveys, collect
the data, and then analyze it.

Figure 1.4.1-1 – Sample Client Satisfaction Survey

Data Collected By Outside Agencies

In your business activities, you will sometimes analyze data collected elsewhere, usually by governmental or non-profit
agencies. Private research firms also conduct research, the results of which they make available to the general public. For
example, the US Census Bureau collects information on behavior patterns such as the total days spent commuting to work.

In Figure 1.2.2 2 below, you will notice that the Bureau has collected information about major cities in the United States by
taking a sample from each city and asking them about their commute times . The times are then added and averaged as days.

As we have discussed previously, data that is collected by governmental or private agencies, including that available in
business reports and journals are important to a business professional. They are both time and cost efficient and can provide
valuable information across a number of business functions including finance, marketing, and operations. The downside to
this type of data is that it may be insufficient and provides little control.

Figure 1.4.2-1 Sample Government Data

Observational Studies

In business, observational studies are important to add a measure of objectivity to data that is collected. Unlike surveys which
ask for a personal taste or preference, an observational study means that your customer or client is directly observed by your
or your associates.

For example, in Table 1.4.3-1, you may wish to compare customer traffic at a given time of the day for 3 different stores
owned by the same company. These figures could give the manager ideas on why Store A has more traffic on Saturdays then
Stores B & C.

Table 1.4.3-1 Number of Customers Who Enter Store between 11 and Noon
It is important to remember that observational studies are designed to observe natural behavior (such as purchasing
patterns) of your customers or employees. These types of studies are more reliable than simply surveying. As a business
professional, you may use observational studies sometimes rather than rely on an estimate prepared by a regional manager
of an individual store. Like surveys, observational studies can be costly and time consuming. They require staff to observe
and collect data, and then analyze this data.

Designed Experiments

Designed experiments are important in a number of industries that include the pharmaceuticals and manufacturing. A drug
company will design experiments to determine whether a new painkiller is effective in controlling pain. Producers of
fertilizers and other chemicals will also design similar experiments to test the effectiveness of their products.

The car manufacturing industry is also one that depends on designed experiments. For example, new cars are tested for a
number of things such as:

 Efficiency and speed


 IGas usage
 Safety

Designed experiments are usually expensive and time consuming. They must be designed and carried out by specially trained
professionals and the results have to be often duplicated. In the business world, however, they represent the most reliable
and rigorous sources of data.

Business Statistics and Computer Programs

Many business professionals will use computer programs to conduct data analysis. These programs provide an efficient and
quick way to perform multifunctional analyses. The challenge for you is to determine when and how to use these data
analysis tools. In both descriptive and inferential analyses, computer programs such as Excel can turn data into useful
information that is important in everyday business decision making.

When using Excel or any other computer program to analyze business data, it is important to make sure that you:

 Understand the functional capabilities of the program


 Understand the underlying statistical concepts
 Organize and present information in the clearest way
 Learn how to evaluate your analysis results for errors

Figure 1.4.4-1 Sample Excel Output


1.5 Summary of Key Concepts

 Statistics is a branch of mathematics that takes numbers and turns it into information for decision makers.
 Data can be found in a number of places that include business reports, newspaper and magazine articles, technical reports,
and business research.
 Descriptive analysis involves collecting, summarizing and analyzing raw data.
 Inferential analysis involves making decisions or drawing conclusions about a population based on results from a sample.
 Population is the total group of people or things that you wish to analyze while a sample is a subset of the population that you
are studying.
 There are two sources of data: primary and secondary. They differ in the types of data they produce and how much time and
cost must be invested to collect the data.
 Data collection methods include surveys, designed experiments, observational studies, and external data prepared by
governmental and private agencies.

Glossary of Terms

Descriptive statistics: Statistics involved in collecting, summarizing and analyzing raw data.

Designed experiments: A data collection method that uses experimentation to test effectiveness of a product or a
process

Inferential statistics: Inferential analysis involves making decisions or drawing conclusions about a population based
on results from a sample.

Observational studies: A data collection method that involves observing natural behavior and recording it.
Parameters: Measurements that result from the study of an entire population.

Population: Total group of people or things you wish to study.

Probability: The likelihood that something will happen based on information that you already have.

Raw data: Numbers that go with an observation or a study of a phenomenon.

Sample: A subset of the population that you wish to study.

Statistic: Measurements that result from the study of a sample.

Survey: A data collection method that involved developing and administering questionnaires.

1.6 Chapter Review Questions

Descriptive Questions

 Describe the difference between a Primary and a Secondary source of data.


 List and describe 3 data collection methods
 What is Inferential Analysis? Identify and describe the two parts of Inferential Analysis

Multiple Choice Questions

Mark the correct answer

1: ______________ analysis involves collecting, summarizing and analyzing raw data


a)Reflective
b)Descriptive
c)Directive
d)Inferential

2: Probability is the likelihood that an event ____________ based on information that you already have?
a)Will happen
b)May happen
c)Has happened
d)May not happen

3: What are all your statistical calculations based on?


a)Raw data
b)Relative data
c)Observations
d)Basic mathematical functions

4: _____________ are important to add a measure of objectivity to data that is collected.


a)Observational studies
b)Anonymous surveys
c)Multiple locations
d)Brief personal surveys

5: Surveys are useful in collecting information on the following EXCEPT:


a)Client preferences
b)Client accessibility
c)Client experiences
d)Client practices
6: What is the numerical data or measure that describes a sample?
a)Parameters
b)Statistic
c)Survey
d)Observational study

Answer Key

1-b, 2-a , 3-c , 4-a , 5-b , 6-b

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

CHAPTER 2

2 Organizing and Presenting your Business Data

Overview

In your day-to-day business activities, you are expected to present your results in a clear and understandable manner. It is
important to remember that business data has different audiences as follows:

 Internal employees
 Investors and stakeholders
 Government officials and legislators
 Business analysts and reporters
 General public

Transforming raw numbers and observations into data through statistical operations is only part of presenting your business
results. After you have transformed your data, you must present it using a number of methods that include tables, charts,
and graphs. However, before conducting meaningful business research and analyses, it is important to recognize and
properly organize your data into categories.

In this Chapter, we will explore what business data categories and how to best present data results. The objectives of this
chapter are as follows:

 Data organization
 Presenting your data
 Cross-tabulating your data
 Best practices in data presentation

Keywords

Nominal
Ordinal
Interval
Ratio
Categorical
Numerical
Parametric
Summary Table
Bar Chart
Pie Chart
Pareto Diagrams
Ordered Array
Frequency Distribution
Stem-and-Leaf Display
Histogram
Scatter Plot
Side-By-Side Bar Chart

2.1 Data Organization

In Chapter 1, we discussed the importance of data in Business Statistics and how you, as a business professional, have a
number of methods at your disposal to collect this data. From marketing research to human resources, statistics is a tool to
understand business needs better. Organizing your data is the first challenge in presenting it.

To organize your data, you must categorize data properly. Observations, whether they are measures, tests, or counts, are the
bases of your data categories. Most of your statistical observations will involve numerical data of some kind; for example, the
number of customers purchasing a product or the number of hours worked by part time staff. This numerical data will be
based on observations that you had recorded or data that others have recorded.

No matter who records these observations, it is important that you categorize and organize data in a way that makes them
meaningful in making business decisions. Before you attempt any analyses, organizing your data into categories is the first
step.

Assume that your company is introducing a new power drill into the market. Your challenge is to show that your drill will
outperform others on the market in a number of ways. How would you do that? The main way is to conduct observations and
gather data that show the drill’s performance.

The first step is to categorize your observations into different data types. These types are as follows:
Nominal Data

Nominal data uses numbers to indicate a classification. For example, you might have two groups of people who choose to try
the drill in your store display: single men and married men. When you are gathering data on the number of people who tried
the drill, you would categorize (or code, as it is knows in statistics) as such:

Table 2.2.1-1 Data Classification into Nominal Categories

In the category of Other, you could include such groups as women or adolescent boys. Nominal data provides general
categories for your observations and is especially useful in marketing and demographics studies. For example, knowing who
purchases or uses your products and services is extremely important in any marketing campaign. A good business
professional understands that product placement and advertising depends on the demographic of the consumer. Stores and
product manufacturers understand that power drills and other repair equipment are largely used by men. However, this
doesn’t mean that only men will purchase them as many women are likely to buy them for male relatives on their birthdays
or during holidays.
In the example of the power drill, your research results may appear as follows:

Table 2.2.1-2 Number of Customers Who Enter Store between 11 and Noon

As you can see in Table 2.2.1-1, married men (categorized in your data observation as 0) were much more likely to try the
power drill at the store display. They are followed by single men while all others are much less likely to try the drill. In fact, the
data shows that 5 out of 6 married men tried the drill while only 1 out of 6 Other did.

Ordinal Data

Ordinal numbers or data are concerned with rankings – first, second, third, fourth, and so on. In the example of the power
drill, you might be interested to know how this new drill compares to the other 4 types of drills in that price category. To
provide a basis for comparison, you choose weight. You test all the drills and find the new drill is the lightest and for that you
give it the highest ranking of 1.

Remember that ordinal data provides a straightforward general comparison amongst products that is slightly more
informative than nominal data. Despite this, it is not very detailed and may not provide all the information that you need to
make effective business decisions. You must remember the last time you heard a toothpaste commercial on television that
claims a certain brand is the number one toothpaste chosen by dentists for their patients. The challenge in hearing this
ranking is to find out the basis of comparison. Organizing your data based on rankings is one way to prepare it for
presentation. However, to get more detailed information, you will need to use interval and ratio data.

Interval And Ratio Data

Interval and ratio data are called parametric data and they are the basis of statistical techniques that we will cover in det ail in
coming chapters. Unlike nominal and ordinal data, these two types of data organization allow specific measures of
comparison between observations. For example, saying that your new power drill is ranked number one in lightness means
very little unless you can quantify the differences amongst the top 5 drills. Interval data allows for this type of quantification
as seen in Table 2.2.1-2.

Table 2.2.3-1 – Example of Interval Data


You will notice that although your newest power drill (A) is the lightest of the other four drills in its price category, it is only
0.05 pounds lighter than its closest competitor. Interval data also allows for a more detailed examination of your
observations through the calculation of an average (total divided by the number of observations) as follows:

What does this average tell us about our observations? It gives us a general sense of how our data is dispersed (which we will
cover in greater detail in Chapter 3). It also gives information on Ratio data is similar to interval data in that it provides more
detailed information on how observations differ; however, this type of data is measured on an absolute scale with
a zeroincluded so that the comparison is more meaningful.

Presenting Your Data

There are different ways to present your business data depending on its type. For example, nominal data is referred to
categorical data while interval/ratio data is referred to as numerical data.

Categorial Data

There are two ways to present categorical data: tables and graphs.
Tabulating Data

A summary table is used to display categorical data. Assume that you are a Manager at a local cable company. You conduct a
survey of your clients to find out their preference in paying their monthly service bill.

As seen in Table 2.2.1- 1, clients preferred ABM and Internet at 30% and 28%, respectively, while the other three methods are
chosen fewer times. When presenting your data in a summary table such as this one, you can also choose to list the
categories on the left column in an alphabetical order or the percentage on the right in an ascending or descending order.

Table 2.3.1-1 Bill Payment Preference

Graphing Data

Both bar and pie charts provide a graphical representation of categorical data that is easy to create and are informative for
your readers. In a bar graph, the length of the bars will be proportionate to the percentage or frequency of the data. For
example, the bar graph of the example above will look like this:
If you will notice, the Y-Axis of your bar graph will represent the value of the categorical data while the X-Axis will represent
the name/title of your categorical data.

Similarly, the slices in a pie chart are proportionate to the percentage or frequency of the data. For example, the pie chart of
the example above will look like this:

Finally, a Pareto chart is similar to a bar chart in that every bar is proportionate to the percentage or frequency of the data.
However, in a Pareto chart, the data is organized in a descending order which allows the emphasis of the significant ‘few’ from
the trivial ‘many’. In other words, if your data shows many observations that are very small but you have a number of
important, large ones, then this type of chart will emphasize the latter. A Pareto chart also includes a trend line which shows
how the data varies.

Numerical Data

While there are many ways to present numerical data, we will concentrate in this chapter on two major types: ordered array
and frequency distribution. An example of an ordered array is a stem-and-leaf chart while an example of a frequency
distribution is a histogram.

A stem-and-leaf display is a way to organize numerical data into chunks (called stems) so that the values within each of these
chunks (called leaves) are lined up as leaves on the right side of a row. For example, assume that you collect information on
the ages of the married and single men who try your new power drill. You select a small sample (12 men) of each category
(Married and Single) and you ask for their ages. After collecting all the data, she/he enters it into a table such as the one that
follows.

Table 2.3.2 -1 Ages (in years) of Married and Single Customers Who Tried the Power Drill
By simply looking at the data, you can notice a couple of things: first, the range of ages for the Married sample is 29-52 years
while that of the Single sample is 18-45 years. This does not, however, give you an ordered array that shows where the ages
lie within these ranges. For such an array, you would need to use a stem-and-leaf display.

In your stem-and-leaf display, the ages of your two groups would be shown as follows:

Figure 2.3.2-1 Age of Men Trying the Power Drill

So, how do we arrive at the figures on the right column labeled ‘Leaf’? If you look at the first row, the Stem is labeled ‘2’ which
corresponds to the decade of the two ages (29, 29) observed in Table 2.3.2 1. The Leaf column on the right column
corresponds to the second number in the year which is a ‘9’ in both cases.

If you notice, the Stem labeled ‘2’ has two leaves (9 & 9). This indicates that the count of married men in their 20s who trie d
the power drill is 2. Can you guess what the count of married men in their 30s who tried the power drill? The Stem labeled ‘3’
indicates these married men in their 30s and to find their count, you simply count the number of leaves or the figure on the
right column which is 7. This stem-and-leaf array shows that more than half of the married men (7 out of 12) who have tried
the power drill were in their 30s.
In the second stem-and-leaf display, we can see that the age array is more evenly distributed among single men. The display
tells us that there are 4 men in their teens, 4 men in their 20s, 3 men in their 30s, and 2 men in their 40s who have tried the
power drill. If we were to compare the 2 groups, we will notice from the distribution that the single men tend to be younger in
age than the married men.

As a business professional, what implications can you draw from these two stem-and-leaf displays about the type of
customer who is showing enough interest in your new power drill? For one, the earlier numbers showed that married men
were much more likely to try the drill than single men. With these displays, you learn another layer of information simply by
organizing your data in a certain way. You have learned that married men in their 30s are more likely to try your new power
drill.

A good manager will make use of that information in his or her marketing campaign, whether it is print or television. While
these age figures are based on a small sample, your observations will likely include a sizable sample from which you can
make conclusions about your larger target population. We will examine in later chapters how you can choose the right
sample from which to collect your business data.

Histograms

Table 2.3.2-2 Monthly Power Drill Sales in 2008


As we have discussed, you can record your observations in a frequency table. For example, assume that after the launch of
your new power drill, you track the total number of sales over a one year period across all regional stores. You can collect
data on sales’ figures on a daily, weekly, monthly, or yearly basis. If you record your monthly sales into a frequency table, it
will resemble Table 2.3.2-2.

Based on this frequency table, a number of charts and graphs can be prepared including a Histogram such as this one:
Similarly, you can develop a histogram of the same data that is grouped in quarterly sales figures by designating Q1 to be the
period of January – March, Q2 to be the period of April – June, and so on as follows:
Cross Tabulating Your Data

In some instances, you will be required to compare two sets of data graphically. For example, you could be asked to compare
the breakdown of married/single men who tried the power drill and their preference of product price over performance. As a
business professional, you have to pay attention to changes in your customers’ preferences over time as that affects decision-
making, marketing, and sales outcomes.

Cross tabulation allows you to compare two sets of observations through a tabular form (contingency tables) and a graphic
form (side-by-side bar chart). Another example of graphic cross tabulation is a scatter plot.

Contigency Tables

A contingency table presents data from two sets of observations or variables. Assume that you conduct a survey on your
customers on try the power drill in the store. In your survey, you ask them to rank Price or Performance as their number one
reason to consider purchasing the drill in the future. Your presentation of results, broken down into Married or Single men,
will be as seen in Table 2.4.1-1.

Table 2.4.1-1 Customers Preference (Price vs. Performance)

As you can see, there are a total of 290 customers who fill out the survey. Of the 190 married men who try out the power, 134
say that price will be the number one determinant of the possibility of buying the product in the future while 56 choose
performance. Of the 100 single men who were surveyed, approximately 2/3 (a percentage similar to the married men) say
that price will be the number one determinant of the possibility of buying the product in the future.

Following analysis, these results indicate that while your previous results showed that your married customers were slightly
older than your single customers, they are both likely to buy the power drill in the future based on the price.

Side-By-Side Bar Chart

Given these results, you can graphically display the results in your contingency table as shown below. The goal behind such a
chart is to give a simple yet engaging representation of the importance of price over performance for your two sets of
categorical variables, single and married.

It is important to remember that the purpose of a side-by-side bar chart is to offer a clear graphic that illustrates the
comparison between your variables. From the outset, it is apparent that the proportion of single to married men who prefer
price and those who prefer performance is approximately similar. So, while the overall number of married men who filled out
the survey was greater than the single men, their general preferences were similar.

Scatter Plots

Our final cover of cross tabulation is the scatter plot. A scatter plot is a graphic representation of data based on paired
observations. For example, if you are giving out a survey to a random group of people who apply to work in your company,
you may choose to pair their level of education (counted as number of years beyond a high school education) and years of
experience in their chosen fields. For example, someone with a one year diploma will be marked as ‘1’ while someone with a
PhD will be marked as ‘10’.

A scatter plot places one variable or observation on the vertical axis while placing the other variable on the horizontal axis.
The main purpose of plotting these two variables against one another is to discover the relationship between them. Assume
that approximately 30 people have completed applications to join your office team. As the Manager, you wish to plot level of
education against years of experience as follows with the Y Axis - Level of Education (in Yrs.) and the X Axis – Years of
Experiences (in Yrs.).

Table 2.4.3-1 Customer Survey that Shows Level of Education vs. Years of Experience
The main purpose behind the scatter plot is to visually display the relationship between the two variables. Although we will
cover relationships between variables in greater depth in coming chapters, the scattering of data points shows the strength
of the relationship. With variables that are very closely related, the points are scattered very closely together.
2.2 Summary of Key Concepts

 Observations, whether they are measures, tests, or counts, are the bases of your data categories.
 It is important that you categorize and organize data in a way that makes them meaningful in making business decisions.
 Nominal data uses numbers to indicate a classification.
 Ordinal numbers or data are concerned with rankings – first, second, third, fourth, and so on.
 Interval and ratio data are called parametric data and they are the basis of statistical techniques.
 Both bar and pie charts provide a graphical representation of categorical data.
 A Pareto chart is similar to a bar chart in that every bar is proportionate to the percentage or frequency of the data.
 A stem-and-leaf display is a way to organize numerical data into chunks (called stems) so that the values within each of these
chunks (called leaves) are lined up as leaves on the right side of a row.
 Cross tabulation allows you to compare two sets of observations through a tabular form (contingency tables) and a graphic
form (side-by-side bar chart).
 A scatter plot is a graphic representation of data based on paired observations.

Glossary of Terms

Bar chart: A bar chart is a graphic way of summarizing a set of categorical data.

Categorical data: Values or observations that can be organized into discrete categories such as female and male.

Descriptive statistics: Statistics involved in collecting, summarizing and analyzing raw data.

Frequency table: A frequency table is a way of summarizing a set of observations or data.

Histogram: A histogram is a graphic way of summarizing and presenting data

Nominal data: A set of observations or data that can be organized into distinct labels.

Ordinal data: A set of observations or data that can be ranked according to certain criteria.

Pareto chart: A graphic display of data that includes both bars and a line graph.
Scatter plot: A graphic representation of data based on paired observations.

2.3 Chapter Review Questions

Descriptive Questions

 List and describe the different Data types.


 List and describe the two (2) ways to present categorical data?
 What is parametric data? Provide at least two (2) examples of parametric data.

Multiple Choice Questions

Mark the correct answer.

1: The following are data types EXCEPT:


a)Nominal data
b)Interval data
c)Ordinal data
d)Rational data

2: Nominal data uses _____________ to indicate a classification.


a) Numbers
b)Pie graphs
c)Bar graphs
d)Tables

3: A good business professional understands that product placement and advertising depends on the _____________ of the
consumer.
a)Exposure
b)Demographic
c)Need
d)None of the above

4: What are Ordinal numbers or data concerned with?


a)Variety
b)Rankings
c)Graphs
d)Charts

5: Which of the following is referred to categorical data?


a)Rational data
b)Nominal data
c)Interval data
d) Ordinary data

6: When graphing data, you can use all the following methods EXCEPT:
a)Pareto diagrams
b)Bar charts
c)Summary tables
d)Pie charts

Answer Key

1-d, 2-a, 3-b, 4-b, 5-b, 6-c


References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

CHAPTER 3

3 Measures of Central Tendency and Dispersion

Overview

In statistics, you are most concerned about gathering useful data and interpreting it in a way that helps in business decision-
making. Measures of central tendency and dispersion work provide you with a clearer understanding of what data means.

Assume that you are a city planner tasked with providing the local government with information on city inhabitants. You are
specifically asked to comment on areas of high population. You are also to comment on the number suburbs and their
distance from the city core.

Think of the areas of high population as measures of central tendency. In your statistical analysis, central tendency points to
the clustering of your data around a typical or a central value. Measures of dispersion, on the other hand, display the scatter
of your data. The number and reach of city suburbs are the scatter in your data

In this Chapter, we will explore the clustering and scattering of business data. These are referred to as Measures of Central
Tendency and Measures of Dispersion.

The objectives of this chapter are as follows:

 To understand measures of central tendency.


 Which measures of central tendency are appropriate to use
 To understand measures of dispersion
 Which measures of dispersion are appropriate to use.

Keywords

Mean
Median
Mode
Range
Standard Deviation
Variance
Outlier Data Points

3.1 What is Central Tendency?

In Business Statistics, central tendency is the extent to which all your observations of an event center on a typical or central
value or range. A very common central tendency that is used in everyday life is a mean or average, as it is widely known. For
example, a car dealership owner may notice that the average age of the buyer of his new sports vehicles is 32 years. Other
types of data that you may be interested in its central tendency are.
 Mean income of potential customers
 The midway point in annual sales of 20 national stores
 Most common brand of detergent used by stay-at-home parents

Assume that you are the owner of a car dealership. Since the launch of a new sports vehicle in the past week, you have
gathered data on the customers who have purchased a vehicle. As seen Table 3.2.1 1, you have collected the gender (male vs.
female) and age of each of the 15 customers who have bought the vehicle.

Table 3.2.1-1 – Gender and Age of Sports Vehicle Clients

There are three main ways to calculate central tendency from data as follows:

 Mean
 Median
 Mode
Mean

Equation 3.2.1 1 – Calculation of Means

If we are to calculate the mean of data in Table 3.2.1 1, we will follow these steps:
Thus, the mean or average age of customers who have bought the new sports vehicle within one week is 32.1 years.

Statistical means are very sensitive to extreme values. For example, in the above example, the average age of the sports
vehicle buyers was 32.1 years with data ranging from 20 to 42 years. Now assume that this data set has one observation that
is far higher than the range, such as a customer who is 65 years old. Let’s look at what an extreme value or an outlier, as it is
known in statistics, changes your overall mean:

With the addition of only 1 extra observation, the mean age of customers who are purchasing the sports vehicle is now 34.1
years. Furthermore, as you can see from the range of ages, the outlier number of 65 years gives a wrongful impression that
the new sports vehicle is purchased by a wide range of customers which includes a middle age demographic of those in their
40s, 50s, and 60s when the majority of purchasing customers are in their 20s and 30s. As a business professional, you must
be aware of this limitation of means and make sure that you have other measures of central tendency that could provide
better information on potential customers.

The effects of an outlier are especially significant when the sample size is small. For example, assume that you are calculat ing
the mean of 5 numbers. The first set of five numbers is (1,2,3,4,5) and the second set of five numbers is (1,2,3,4,10). The first
set of numbers is close in values while the second set of numbers has an outlier (10) which is much higher than its closest
value. The mean of the two sets is significantly different at 3 and 4. The outlier here (10) results in a 25% difference in the
mean.
Median

Median is another measure of central tendency that is widely used in business statistics. If all observations or data points are
lined up in an ordered array from the lowest value to the highest value, a median point is at that point where 50% of the data
are below and the other 50% are above. In this case of data sets as seen below, the median for both sets is 3. Unlike the
mean of a data set, a median is not affected by outlier data value.

Can you determine the median of the car dealership customer data in Table 3.2.1 1? To arrive at this value manually, you
must first arrange the data into an ordered array (from the smallest value to the largest value). Remember that you must find
a middle point in this data array whereby half of the data values fall below and the other half are above. There are 15 data
points in this table so the midway point will be the 8th value when they are ordered in an array. To calculate the midpoint of a
data array, use the following equation:

Equation 3.2.2 1 – Calculating the Median Position in an Ordered Data Array

Remember that this equation only gives you the POSITION of the median point, the actual VALUE of the point depends on
your data. From this result, we can conclude the median age of the customers who have purchased the sports vehicle in one
week is 34 years. In other words, 50% of the customers were older than 34 while 50% were younger.

In business, you will usually see median as a measure of central tendency used in marketing research. For example, a
company that sells high end furniture may wish to determine the median income of their target population. If the median
income in that population is over $100,000 per year, the company owners and managers will design their marketing
campaigns with the knowledge that 50% of their target population makes more than $100,000.

Similarly, many companies can use median sales figures as performance goals upon which employees and departments are
evaluated. For example, a car dealership manager may use the previous year’s median sales point as a benchmark for
performance in the current year. The sales team may wish to sell 50% of their lot by the same time as they had the previous
year.

Remember that the median point is the middle point when data is arranged in an array from the smallest to the largest value.
As we have seen in the examples above, this middle point is easy to identify when the total data points are odd-numbered. To
calculate the median point in an even-numbered data array, you must take the average of the two middle points in the
ordered data array. The median value in the example below is the average of the two middle points (33 & 34) which gives you
33.5.

As a measure of central tendency, the median is a useful addition to the mean of a data set because it is not affected by
extreme value or rare occurrences. For example, in a city hit by a natural disaster, many residents have difficulty finding
employment. Mean salary figures from the city reflect a decrease in earnings of city resident given the number of people who
were unemployed. However, the median salary may not be affected as the city’s occupational profile remains the same.

It is good practice to provide BOTH mean and median when presenting information about data. If you are promoting new
dishwashing tablets, it enhances the performance of your product to show that the average family needs to replace their box
only once a month. However, your product seems even more efficient if you say that 1 in 2 families (or 50%) can keep their
single box for a period of two months.

Mode

The third measure of central tendency is the Mode. The mode is the most frequent value that is present in your data. Similar
to the median, the mode is not affected by outliers. However, a mode also has a number of qualities as follows:
1. A mode can be determined from both categorical and nominal data. For example, in categorical data for marital status
(single, married, divorced, widowed), you can find out the mode of given data, or which of the four is the most frequent
status. In this example, you survey 10 people and ask them about their marital status, and the result show that 4 out of the 10
are presently divorced. The mode in this set of data, then, is Divorced as it is repeated the most frequently.

2. In both categorical and nominal data, there may or may not be a mode in your data set as seen in the example below. In
the first set of values to the left, there are 3 data points that carry a value of 9 each. In other words, this is the value in the
data set that repeats the most amount of time. In the second set of values to the right, all the data points carry a value of ‘1’
each. In this instance, there is no mode in the data.

3. In both categorical and nominal data, there could be more than one mode. For example, assume that the first data set also
contains a data point that repeats 3 times as well. In that case, this set of data will have 2 modes.

So, which of the three measures of central tendency will you most likely use in your day-to-day data analysis? These are
useful guidelines:

1. The mean is generally used to provide an idea about the middle point in your data set. Means are sensitive to outliers so
you must be careful to examine the data before calculation.
2. The median of your data set is used often and complements the mean as it is not sensitive to the presence of outlier data
points. For example, median home prices are particularly more useful than the mean.
3. For many of your reports, you will need to include both the mean and median in your data analysis and presentation to
give an idea on how the data cluster around a central value. The mode is also important in cases where you need to find out a
repeating value that could prove to be significant for other reasons.

Measures of Dispersion

So far, we have discussed in depth measures of central tendency that show how your data points are clustered around a
typical or central value. In Business Statistics, you will be interested in not only how your data clusters but how it scatters as
well. Assume that you are tasked with studying crime patterns in the city. You will be asked to comment on those areas where
violence is clustered and happens frequently. Measures of central tendency will shed light on how the crime data cluster.

You may be interested in how far crime reaches into the city. Your analysis will examine how crimes are scattered into far
reaching neighborhoods and suburbs. This type of analysis is referred to as Measures of Dispersion. The figure below shows
similar centers; however, you can see from the tails of the curves that they have different ranges or dispersions.

Measures of dispersion give you information on the spread or scatter of your data. In other words, they tell you how far
reaching your data values are. There are measures of dispersion that we will explain as follows:

 Range
 Variance
 Standard deviation
Range

Range is the simplest way to calculate the dispersion of your data set. To calculate a range, you must determine the
difference between the largest value and the smallest value.

To calculate the range of our car dealership sales data, we will proceed as follows:

What this figure tells us is that the range of ages of customers who have bought the new sports vehicles is 22 years.

Range, however, as a measure of data dispersion has a number of shortcomings as follows:

1. Range does not indicate how the data is distributed. As seen in this example, both sets of data have 6 values ranging from
4 to 9, so they both have a range of ‘5’. However, the data set on the left is evenly distributed while that on the right is not.

2. Range is also sensitive to outlier values. In the two data sets below, all but the final value are identical. With a single change
in a data value, the range is widely different.
Variance

Variance is another measure of data dispersion. It produces a number that takes into account how far every data point differs
from the mean calculated for the entire data set. The formula for calculating variance is as follows:

Let’s use our car dealership example to illustrate how to calculate the variance in a data set. Remember that n refers to the
number of observations or data points, which in this case is 16.

Also, X refers to the mean of the sample as previously calculated:

Table 3.3.2 1 Calculation of Variance


When calculating the variance as seen in the above equation, you must subtract 1 from n as follows:
The variance of this data, then, is 43.2. While the mean of the data is useful in locating its center, the variance describes its
spread. Furthermore, the variance provides a value that describes the spread around the mean as we will see in the next
section.

Standard Deviation

While the mean of a data set gives you information on its center when distributed, it does not give you information on how
data is scattered around the mean. Standard deviation is the most common measure of dispersion and provides information
about the distribution of the data around the mean.

When the data values are close together and the bell-shaped curve is steep, the standard deviation is expected to be small.
When the data values are widely dispersed and the bell curve is relatively flat, that shows that you have a fairly large stan dard
deviation.

As mentioned earlier, standard deviation is calculated by determining the square root of the variance through this formula:

Standard Deviation Equation:


To calculate the standard deviation from our variance example, you must follow this step:

Remember that the standard deviation carries the same unit as the original data values. In this case, the standard deviation is
6.6 years above and below the mean. So what does this standard deviation value say about the data set? Let us look at what a
bell curve distribution with a mean will typically look like:

As you can see, the middle point of the data is the mean value (X) and the rest of the values fall to the right and left of this
point. Within one standard deviation on the right side of the mean, you will find 34.1% of the data. Within one standard
deviation on the left side, you will also find another 34.1% of the data.

This means that within one standard deviation to the right and left side of the mean, you will find 68.2% of the values in your
data set. Similarly, another standard deviation on the right side adds 13.6% of the data and the same percentage to the left
side. What does this mean for our car dealership example? Let’s determine how standard deviation explains the dispersion
around the mean as follows:
To calculate the dispersion of data around the mean for a single standard deviation, you must:

1. First step, you add 6.6 yrs to the mean as follows


= Mean + 13.2 yrs
= 32.1 + 6.6 yrs = 47.7 yrs.
According to the graph above, this figure means that 47.7% of the customers who purchased a sports vehicle are between the
ages of 32.1 yrs and 38.7 yrs.

2.Second step, you subtract 6.6 yrs from the mean as follows:
= Mean - 6.6 yrs
= 32.1 - 6.6 yrs = 25.5 yrs.
This figure means that another 34.1% of the customers who purchased a sports vehicle are between the ages of 25.5 yrs and
32.1 yrs.

3. Third step, you determine that 68.2% of customers who purchased a sports vehicle (34.1% + 34.1%) are between the ages
of 25.5 yrs and 38.7 yrs. This is also the data scatter of one standard deviation around the mean. What if you wanted to know
how many of the data values fall with 2 standard deviations away from the mean? You must follow the same previous steps
for 1 standard deviation in calculating the ages on both the right and left sides of the mean on the bell curve.

1. First step, you add 13.2 yrs to the mean as follows:


= Mean + 13.2 yrs = 32.1 + 13.2 yrs
= 45.3 yrs.
According to the graph above, this figure means that 47.7% (a figure calculated by adding 34.1% and 13.6%) of the
customers who purchased a sports vehicle are between the ages of 32.1 yrs and 45.3 yrs.

2. Second step, you subtract 13.2 yrs from the mean as follows:
= Mean - 13.2 yrs = 32.1 - 13.2 yrs =
18.9 yrs.
This figure means that 47.7% of the customers who purchased a sports vehicle are between the ages of 18.9 yrs and
32.1 yrs.

3. Third step, you determine that 95.4% of customers who purchased a sports vehicle (47.7% + 47.7%) are between
the ages of 18.9 yrs and 45.3 yrs. This is also the scatter of two standard deviations around the mean.

As you can see from these calculations, standard deviation as a measure of dispersion is an excellent complement to
the mean of a data set. In day-to-day business decision-making, you need to determine the central value around
which your data is clustered, but you also need to determine how the individual data value scatter or disperse
around this central value. These are the functions of the mean and the standard deviation, respectively.

It is important to remember the following about measures of dispersion:

1. The more your data values are spread out, the greater your range, variance and standard deviation.
2. The more your data are clustered closely, the smaller your range, variance and standard deviation.
3. If all the values in your data set are the same, there will be no variation and measures of data dispersion will be
zero.

3.2 Summary of Key Concepts

 Measures of central tendency and dispersion work provide you with a clearer understanding of what data means.
 In your statistical analysis, central tendency points to the clustering of your data around a typical or a central value. Measures
of dispersion, on the other hand, display the scatter of your data.
 A mean is the most common measure of central tendency and it involves the addition of observation values and dividing them
by the total number of observations.
 An outlier is an extreme data value that could affect the mean.
 A median point is a measure of central tendency whereby 50% of the data fall below and the other 50% are above.
 The mode is the third measure of central tendency and the most frequent value that is present in a data set.
 The median of a data set complements the mean and it is not sensitive to the presence of outlier data points.
 Measures of dispersion give you information on the spread or scatter of your data.
 Range is the simplest way to calculate the dispersion of your data set. To calculate a range, you must determine the difference
between the largest value and the smallest value.
 Variance is another measure of data dispersion which takes into account how far every data point differs from the mean
calculated for the entire data set.
 Standard deviation is the most common measure of dispersion and provides information about the distribution of the data
around the mean.

Glossary of Terms

Mean: The most common measure of central tendency and it involves the addition of observation values and
dividing them by the total number of observations.

Median: A measure of central tendency whereby 50% of the data fall below and the other 50% are above.

Mode: The most frequent value that is present in a data set.

Outlier: An extreme data value that could affect the mean.

Range: The difference between the largest value and the smallest value in a data set.

Standard deviation: The square root of variance and a measure of the scatter of data around the mean.
Variance: A measure of data dispersion which takes into account how far every data point differs from the mean
calculated for the entire data set.

3.3 Chapter Review Questions

Descriptive Questions

 A data set has a number of outlier points. Define ‘outlier’? Which of the three measures of central tendency is affected by
outliers? Explain.
 How do the mean, median, and mode compare in describing the central tendency of data?
 What is the statistical formula for variance? How does variance influence the calculation of standard deviation?

Multiple Choice Questions

Mark the correct answer.

1: An outlier data point affects the following measure:


a)Mode
b)Median
c)Variance
d)Mean

2: As a measure of central tendency, the median point is:


a)Located at the midway point of outlier points
b)Located at the midway point of the upper 50% and lower 50% of data
c)Located at the midway point of the most frequent data
d)Located at the sum of differences between the mean and individual data points

3: In which of the following data sets is the mode 100:


a)100, 150, 100, 100, 250
b)100, 200, 200, 400, 500
c)100, 250, 350, 400, 400
d)0, 100, 100, 200, 500

4: Your car sales’ figures across a number of dealerships show a range of 1200 cars. You have calculated the range by:
a)Subtracting the lowest value from the mean
b)Subtracting the highest value from the mean
c)Subtracting the lowest value from the highest value
d)Subtracting the highest value from the median

5: Measures of dispersion show how the data values scatter around:


a)The median
b)The standard deviation
c)The variance
d)The mean

6: The percentage of data scattered around the mean at one standard deviation is:
a)68.2%
b)50% above the mean
c)50% below the mean
d)95.4%

Answer Key
1-d , 2-b , 3-a , 4-c , 5-d , 6-a

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

CHAPTER 4

4 Introduction to Probability

Overview

Probability is the chance that an uncertain event will happen. In business, you will need to understand how probable it is that
an event or a phenomenon can happen that could affect your productivity, profits, and general position in the market. For
example, assume that you are producing microchips for a national computer hardware producing. You may want to know
how probable it is to have production defects during a given period of time.

Or you might want to determine whether the latest economic recession is probable in the next fiscal year and how it will
affect your sales figures. While it is not possible to answer these questions with 100 percent degree of certainty every time, it
is possible to calculate their probability. Understanding and determining probabilities is essential in effectively using dat a to
make business decisions.

It is important to remember that with statistics, in general, you are attempting to deduce larger implications from a given
data. With probability, on the other hand, you are looking at larger implication to examine their impact on your bus iness
question as seen in the figure below.

In this Chapter, we will explore probability theory which is also known as the ‘law of chance’.

The objectives of this chapter are as follows:


 Probability concepts
 Visualizing probability events

Keywords

Probability
Prior Knowledge
Empirical Evidence
Subject Experience
Contingency Table
Decision Tree

4.1 Probability Concepts

Assume that you toss a coin in the air. What is the chance that you will get a ‘tails’? What about a ‘heads’? You can determine
the probability of an event as simple as a coin toss by understanding the possible outcomes. Before we look at this coin toss ,
let us examine some basic probability concepts as follows:

 Probability: The chance that an uncertain event will happen (probability = between 0 and 1)
 Impossibility: The chance that an event will NEVER happen (probability = 0)
 Certainty: The chance that an event will ALWAYS happen (probability = 1)

In today’s business settings, probability calculation are performed in sales’ and profit projections, real estate profit There are
three ways to assess probability as follows:
Prior Knowledge

Determining probability based on prior knowledge is often used in day-to-day business decision making.

For example, a paper printing machines in a factory is known to produce 995 correct color compositions for every 1000
papers produced. In this case, the probability of producing a correct composition is:

Similarly, we can determine the probability of producing a paper that does NOT have the correct color composition as
follows:
When calculating probability, you must divide the number of ways that an event can occur with the total number of possible
outcomes. By having previous knowledge about the paper printing machine, you can determine the probability of producing
the correct or incorrect color composition paper.

Let’s return to the example of the coin toss. What are the chances that you will receive a ‘heads’ after one toss?

 First step is to determine the number of ways that the event can occur and the total number of outcomes. Since a coin has only
2 faces, you can get a ‘heads’ once with every toss and the total number of outcomes is 2 (heads, tails).
 Second step is to calculate the probability of receiving a ‘heads’ after one coin toss according to the following formula:

This result indicates that the probability of getting a ‘heads’ with a single coin toss is 0.5 (remember that probability is always
between 0 and 1. Think of this probability as a proportion, ½. Another way of stating this chance of occurrence is to say that
there is a 50% chance of a ‘heads’ after a single coin toss. Remember that we assume that all outcomes (heads or tails) have
the same probability of occurrence.

If they do not, then this calculated probability is not correct. For example, if the coin is somehow damaged or tampered with
to favor one side over the other when it is tossed in the air, then the probability of a ‘heads’ or ‘tails’ will not equal 0.5.

It is important to note that in probability, there is something termed an Independent Event. Such events are outcomes that
are not affected by other outcomes. For example, if you toss the same coin in the air a second time, the probability of getting
a ‘heads’ is also 0.5 and not affected by the previous toss.
Similarly, assume that you are rolling a dice. What is the probability of it settling on any one of its six faces? Assuming t hat all
six outcomes (1,2,3,4,5,6) have the same probability of occurrence, we can calculate the probability of getting a ‘1’, as an
example. Let’s call the probability of getting a ‘1’, A, as follows:

What this figure tells us is that there is one outcome out of the six possible outcomes for every roll of the dice. Can you
determine the probability of getting a ‘4’? Remember that two rolls of a dice are independent events, which means that the
second roll is independent of the first roll. Let’s call the probability of getting a ‘4’, B, as follows:

The above examples illustrate occurrences that you have prior knowledge of their odds. Assuming that there is an equal
chance of occurrence, we know the odds of a single outcome when rolling a dice or tossing a coin. In most business decisions,
however, the odds of an event occurring are not so straightforward. They usually must be calculated from given information
as we will see in the next section.

Empirical Evidence

Determining probabilities from collected evidence follows the same principles. An import ant distinction between prior
knowledge and empirical evidence is that you may not always know if events have an equal chance of occurrence in everyday
situations. In a business setting, you will have to make assumptions about the equal likelihood of events occurring. In Table
4.2.2-1, we have the recycling patterns of males and females in a given business office.

Table 4.2.2-1 Recycling Patterns by Gender


Let’s determine how to calculate the probability of selecting a Female who does not recycle from the population illustrated
in the table.

The probability of selecting a female who does not recycle from this population is 0.18 or 18%. Similarly, the probability of
picking a Male (who either recycles or does not recycle) is calculated as follows:

The probability of selecting a male from the above population is 0.49 or 49%. In both instances, we have to assume that all
events have the same probability of occurrence.

Subjective Evidence

A consequence of extensive business experience and expertise is the development of business acumen. Highly successful
businessmen such as Bill Gates and Steve Jobs are known for their business sense or acumen. They have built multibillion
dollar companies through gained insights and experience

In many day-to-day business decisions such as the launch of a new product or entry into a new market, a business
professional does not have enough information to calculate the probability of an event occurring. In those cases, the
following are important in making business decisions in the absence of information:

1. Prior experience

Prior experience allows a business professional to estimate the probabilities of success when information is not complete.
For example, introducing a new product into select Asian markets for a North American producer can be a challenge. In
addition to development, testing, and distribution challenges, there are other social and cultural ones. These intangible
challenges make it difficult for the business professional to determine the probabilities of market penetration. Instead, he or
she must rely on past experiences with similar markets to make business decisions.

2. Situational analysis

Situational analysis can also provide insights into business decision making when determining pro babilities is difficult. An
example of a frequently used situational analysis is what is termed as SWOT (Strengths, Weaknesses, Opportunities, and
Threats) analysis. If you are conducting a SWOT analysis when introducing your new technology product to markets in China,
for example, you may discover the following:

Strength: The product is new to the market so competition is minimized

Weakness: The product is initially high priced for the majority of its intended customers

Opportunity: The new product is desired among middle class, technologically savvy college students.

Threat: A rivaling company is intending to launch a similar but cheaper product in the coming year.

As a business professional, this type of situational analysis can guide your decision making, especially when examining the
probability of success, in the absence of prior knowledge or empirical evidence.

3. Personal opinions

Visualizing Probability Events

In business publications such as annual reports, you may have to present your probability results and their implications
visually. This type of visual presentation is important to engage your readers and communicate your results in a clear, easy to
understand manner.

Contigency Tables

Assume that you are choosing cards from a full deck (52 cards). You are especially interested in the probability of choosing a
Jack from this deck (4 cards: 2 red and 2 black). To display this information visually, you must construct a contingency table
which shows the probability of choosing a Jack of either color every time you choose a card from the deck. The table is
constructed as follows:

Table 4.3.1-1 Contingency Table Example

Assume that you are reaching into a full deck (52 cards), what is the probability that you will choose a Red Jack? To answer
this, you must follow the same process of dividing the desired outcome with all possible outcomes (assuming that they have
an equal likelihood of occurrence) as we have seen earlier in the chapter:
The probability of choosing a red Jack from your deck of cards is 0.04 or 4%. Similarly, what is the probability of choosing a
black Jack from only the black cards in the deck? We know that there are 26 black cards in a full deck and will

A contingency table shows the relationship between two variables in a graphic manner. One variable in represented in a
column while the other variable is represented in a row. With each cell entry, we have an instance where the two happen
together. As we will cover in later chapters, a contingency table examines the relationship between two variables which is also
known as a joint probability occurrence. Values in a contingency table are not based on prior knowledge but on observations.

For example, assume that you are the manufacturer of a brand of television sets. The sets come in three different screen
sizes (small, medium, large). You are interested in assessing the rate of damage to the screens following a production cycle by
recording your observations as seen in Table 4.3.1-2.

Table 4.3.1-2 Damaged vs. Not Damaged Television Sets, by Size

In this example, the variables are as follows:

Variable 1 – Damage or Not Damaged

Variable 2 – Screen size (Small, Medium, and Large)

We determine the probability of the presence of a damaged television set after each production cycle as follows:
Similarly, the presence of an undamaged television set after each production cycle is determined as follows:

While probabilities of damage to television sets give a general idea about production quality assurance, they may not be
detailed enough for everyday business decision making. Despite the low incidence of damage to the sets (4%) per production
cycle, are they equal among screen sizes? This is a question that you can ask by analyzing the values in the contingency table
further. This analysis generates more insights about damage probabilities within screen sizes.

As seen in Table 4.3.1 3, probabilities of damage by screen size must be calculated in order to gain more information about
production quality assurance. By finding out which of the screen sizes has the highest probability of damage per production
cycle, a manager can implement quality assurance steps to understand why more television sets are being damaged.

To calculate the probability of damage among small screen television sets, the following steps are followed:

Table 4.3.1-3 Damage Probability by Screen Size

A deeper analysis of the data shows that small screen television sets are the least likely of the three sizes to have damage
while the large screen sets are the most likely (0.06). In fact, the large sets are 3 times more likely to show damage than t he
small sets after a single production. The medium screen sets are also 2.5 times more likely to show damage than the small
sets. These probability results may prompt the manufacturing manager to follow a course of action that includes:

1. Analyze manufacturing processes to determine the reasons behind the high probability of damage when producing
medium and large television sets.

2. Correct those processes that could contribute to these probabilities such as improving employee skills or replacing old
machines.
Please note that contingency tables are based on observations (empirical evidence) that you collect. When constructing the
tables, make sure that you:

1. Identify the two variables that you wish to study clearly.

2. Place one variable in the columns and the other variable in the rows.

3. Enter those observations or data values where the variables occur jointly. For example, you can track production output (a
variable) against employee attendance (a variable).

4. Calculate probabilities based on the chance of each paired occurrence against all possible outcomes. For example, the
number of damaged television sets is calculated against the total number of sets that are manufactured in a single
production cycle.

Decision Trees

A decision tree is a visual decision making tool that can prove useful across a number of business settings. The main princip le
behind a decision tree is the presentation and comparison of a number of competing alternatives.

A decision tree can be made from our previous example of choosing a Jack out of a full deck of 52 cards. Similar to the
contingency table, a simple decision tree such as the one below shows possible outcomes from a set of probabilities. For
example, in a full deck, half are black cards while the other half is red. The black cards branch out further into the type of card
(e.g. Jack, Queen, etc.) or its number (8, 9, 10, etc.).

A decision tree, however, does not present outcomes only but it presents the consequences of these outcomes given a
certain business criterion such as cost. In the above figure, we can see the probability of choosing a Black Jack out of the full
deck of 52 cards is 2/52 while out of the black cards is 2/26.

However, a decision in business activities is not only used to display information about probabilities. It is also used to as sess
financial and productivity impacts of these probabilities. Let’s examine our example of television sets. Assume that you are
the manager of the manufacturing company and wish to minimize financial losses due to damaged television sets in the
coming production cycle. While engaged in long term quality assurance processes, you are interested in short term savings.

Your past production cycle resulted in the following number of units and their production costs:

 Small television sets: 1000 units@ $100


 Medium television sets: 750 units @ $150
 Large television sets 500 units @ $200

The production costs of television sets reflect supplies, machinery, personnel, and all other incidental fees. Given the abov e
figures, the production costs per cycle are as follows:

Similarly, you calculate the potential losses per production cycle based on the probabilities of damage to television sets as
shown in Table 4.3.1-3 as follows:

This figure tells you that in every production cycle that results in 1000 small screen television sets, a financia l loss of $2,000 is
incurred. Similar calculations show that financial losses due to damage to medium and large screen television sets are $5,625
and $6,000, respectively. The cumulative financial loss per production cycle is then $13,625.

To minimize financial losses in the coming production cycle, you propose an alternative production goal and wish to compare
it to the past production goal. As the manager responsible for production costs, you suggest that the company shift the
production costs towards more small television sets and fewer large sets to reduce the impact of production damage

As you have empirical evidence about the probability of damage to television sets during the production process, you
propose to produce 50% more small television sets and 50% fewer large sets at the same overall cost as follows:

Given this alternative production output, you calculate your new predicted financial losses as follows:

Predicted financial loss due to damage to small screen television sets:


= 1500*$100*.02
= $3,000

Predicted financial loss due to damage to large television sets:


= 250*$200*.06
= $3,000 A decision tree that incorporates the alternative production output will detail the predicted financial losses as
follows:

The decision tree shows that a change in the production cycle will result in financial savings as follows:

= $13,625 - $11,625
= $2,000

It is important to remember that decision trees are dependent on the accuracy of the information that you are
using. For example, the probabilities of damage to television sets during the production process are assumed to be
equal despite the alternative output.

Furthermore, business decisions to alter production so that you are increasing one product’s output and decreasing
another must be based on sound sales and marketing information. In other words, you may increase small television
set production if you knew that the extra units will sell briskly. While you may have information on past
probabilities, future decisions depend on comprehensive information about your business. As such, the above
decision tree can have many more branches with production output alternatives.

4.2 Summary of Key Concepts

 Probability is the chance that an uncertain event will happen.


 In business, the probability that an event or a phenomenon will happen can affect productivity, profits, and general position in
the market.
 Probability allows for prior knowledge or empirical evidence to influence future business decisions.
 In probability calculations, the number of ways that an event can occur is divided by the total number of possible outcomes.
 In the absence of prior knowledge or empirical evidence, probability can be based on the subjective experience of the business
professional. Subjective experience is built on past experiences, situational analyses, and personal opinions.
 A contingency table shows the relationship between two variables in a graphic manner.
 A decision tree is a visual decision making tool that can prove useful across a number of business settings. The main principle
behind a decision tree is the presentation and comparison of a number of competing alternatives.

Glossary of Terms

Probability: The chance that an uncertain event will happen.

Contingency table: A table that shows the relationship between two variables, each entry is an event where they
occur together.

Decision tree: A graphic representation of alternative business decisions depicted as branches.

4.3 Chapter Review Questions

Descriptive Questions

 What are the three (3) bases of probability calculations? Explain.


 Explain the purpose of a contingency table.
 Contrast ‘statistics’ and ‘probability’. Provide an example from everyday business situations that illustrates this difference.

Multiple Choice Questions

Mark the correct answer.

1: An example of a ‘Situational Analysis’ is determining:


a)Strengths, Weaknesses, Opportunities, Threats
b)Strengths, Weakness, Empirical Evidence
c)Strengths, Weakness, Prior Knowledge
d)Strengths, Weakness, Subjective Experience Topic:

2: When calculating probability, an assumption is made that all outcomes have:


a)A greater value than individual outcomes
b)A smaller value than individual outcomes
c)An equal likelihood of occurrence
d)An unequal likelihood of occurrence

3: Values in a contingency table are based on:


a)Prior knowledge
b)Empirical evidence
c)Personal experiences
d)Impossible occurrences

4: A printing machine produces 20 incorrect color combinations for every 3000 papers, the probability of getting
a correct combination paper:
a)20/3000
b)2980/3000
c)2/2980
d)Cannot be calculated from the given information

5:Branches in a decision tree represent:


a)Prior observations
b)Unlikely decisions
c)Personal experiences
d)Alternative decisions

Answer Key

1-a , 2-c , 3-b , 4-b , 5-d.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

th
Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4 Edition.

CHAPTER 5

5 Research Methods I: Introduction

Overview

st
Business in the 21 century is intensely competitive, especially given its global nature. Making solid decisions in the face of
this competition is important for a company’s survival and position in its industry. Companies such as Google, Apple, and
Herman Miller have understood this fact in the face of global competition. Therefore, executives in these companies invest in
research and development of products and services on a continual basis. Even when markets are facing recessions and
contracture, research is an important investment.

Solid research for successful companies means asking the right questions about their products and services. Designing good
studies to answer these questions is an integral step. We have covered research methods in the previous chapter. We now
understand the difference between quantitative and qualitative research. We also understand the empirical method of study
whereby variables are identified.

In this Chapter, we will expand on the design of studies in business. Specifically, how study design is important in business
decision making.

The objectives of this chapter are as follows:

 Choosing the right study design


 Descriptive study design
 Causal study design
 Business decision-making

Keywords
Exploratory
Descriptive
Causal
Surveys
Questionnaires
Laboratory experiment
Field experiment
Cross-sectional design
Longitudinal design
5.1 Fundamental Study Designs

Typically there are a number of steps that are undertaken to design a research study. Figure 5.2 shows the steps needed to
design a good business study. This is considered a master plan of investigation. While the researcher can change some
details, this plan provides guidance from the needs assessment to the presentation of the results.

Figure 5.2 Research Study Design Steps

Step 1: Define study information needed

In this step, the researcher defines what information is needed. For example, a marketing specialist may be interested in the
purchasing patterns of elderly shoppers in a district. Purchasing information could be organized by:

 Type (home goods, groceries, etc.)


 Shopping frequency (daily, weekly, monthly)
 Shopping venues (malls, online, etc.), and
 Purchase amounts

Design the exploratory, descriptive, and/or causal phases of the research


In this step, the researcher designs one (or more) of the research phases that may be necessary. The marketing specialist
above may choose an exploratory method of gathering information. He or she can design a survey or questionnaire and
distribute them to elderly shoppers. If the company wishes to learn more about new markets beyond the information from
exploratory research, it can use descriptive research.

Specify how data will be measured

As we have covered in earlier chapters, data can be gathered and measured in a number of ways. Depending on the nature of
the data (qualitative or quantitative), the business researcher can specify ahead how it will be measured. For example,
demographic data can be measured as follows:

 Age – In years (e.g. exact numbers) or a range (e.g. 18-24). Research subjects will then be given surveys that ask for their exact
age or to choose a range.
 Income levels – In specific dollar amounts or a range (e.g. $24,000-$30,000 per annum).

The choice of data measurement depends on the purpose of the research. It also depends on the nature of the business
problem or challenge being studied.

Design surveys and/or questionnaires

In this step, the researcher designs surveys or questionnaires that address all the information that is needed from the study .
As we have covered in the Research Methods chapter, an ideal survey or questionnaire is short and written in a simple, easy-
to-understand language. This is convenient for research subjects to fill out and complete one.

Identify the sampling frame and sample size

In this step, the researcher identifies the sampling frame from the available/desired population. If a population of target
customers is 250,000 (identified inhabitants of a city), a sampling frame would identify a sample representative of this
population. As seen in the Research Methods chapter, sample sizes depend on the type of business study such as product
testing or marketing.

Outline data analysis plan

The final step in business study design is to outline how the data will be analyzed. Data can be analyzed and
presented in a number of ways. As we have covered in previous chapters, descriptive data can be organized visually
such as histograms and line diagrams. More involved presentations can include calculations of central tendency such
as median and mode. If the research design includes causal investigation, more advanced calculations such as
correlation and regression analyses are conducted.

5.2 Business Research

In business, research is conducted to answer questions or solve problems. Research is also conducted to address concerns
and queries that management may have about day-to-day business activities. Some examples of research concerns are:

 How do we market our new household product to a middle-age demographic?


 Are frontline employees satisfied with the new flexible work schedule?
 Is the new car oil in development more efficient than the one on the market?

Conducting research to answer every day business questions is a systemic, multi-step process. In other words, you must
follow a number of sequential steps to come up with answers to your research concerns and problems. In most instances,
the process will resemble Figure 5.2.1-1 as follows:
Figure 5.2.1-1 Business Research Steps

1. Choose the research questions

A research question is simply a business concern or problem for which you are searching answers. From marketing to human
resources, business decisions need evidence that they will work. Assume that you are a human resources manager who is
considering introducing a mentoring program for new hires.

Your goal is to find out if the new program reduces the training time needed for these hires. Your research question(s) will
be: Does the new mentoring program reduce the number of days that new hires are in official training? If yes, by how many days?

2. Review current knowledge

In many of your business concerns or problems, you can be sure that someone has looked at the same business problem or
something similar. Businesses and research institutes publish their studies in academic and professional journals. Your task
in this stage of your research development is to review what has been studied so far. For example, as the human resources
manager above, you can look at past research of mentorship program and whether or not it has been successful. Review of
past research and current knowledge is also important to guide you in developing your research tools, conducting the study
and analyzing the data.
3. Design your study

In this step, you must figure out the HOW of the study. This means you will develop testing tools and design the manner in
which you will collect data. For example, you may choose to develop questionnaire (a quantitative study) or conduct detailed
interviews with your employees (a qualitative study). In some research studies, you may use a mixed design of quantitative
and qualitative tools to answer your research questions.

4. Conduct the study

In this step, you carry out the actual study and gather your data. This usually involves giving out questionnaires, conducting
interviews, or completing case studies. Once you have gathered your data, you enter it into a computer program for analysis.

In many research studies, this is the often the most costly and time-consuming step. Collecting data can take from days to
months, depending on your study design. As the human resources manager investigating mentorship programs, your study is
limited to employees whom you can give questionnaires during the work day. This task may take a few days or weeks to
complete. However, if you are conducting marketing research by sending out a city-wide survey to individual homes, it will
likely take weeks for you to receive them back.

5. Analyze results and draw conclusions

In this final step of the research process, you analyze the data you’ve collected. The depth of the analysis and the tools that
you will need depend on your research design. For example, quantitative data often requires a computer program to
complete analyses such as measures of central tendency (mean, median, and mode) and dispersion (standard deviation).

An important aspect of data analysis is making sure that you draw the appropriate conclusion from the observations. These
conclusions are important in answering the initial research questions that prompt the study in the first place.

Formulating Research Questions

Asking the right questions is the basis for effective research in business. The questions must be directly relevant to the
business concern or problem and amenable to a cost and time effective study. Let’s examine a business problem and how
appropriate research questions can be formulated to address and solve the problem.

Assume that you a manager at a luxury resort who is interested on what makes clients satisfied. At present, you offer a number of
promotions to resort guests and are especially interested in finding out if they make your clients more satisfied.

From this scenario, we can develop a number of research questions as follows:

1. Are resort clients satisfied with their overall experience? If yes, what particular aspects, such as customer service or room
view, contribute to their satisfaction?

2. Are resort clients who are receiving a promotion more satisfied than those who are not? If yes, how much more satisfied
are these clients?
It is important to remember that measuring the satisfaction of clients can be done in a number of ways including surveys and
in-depth interviews. The nature of the tools that you use to collect information about client satisfaction decides how well it
answers your research questions. These research tools also impact how well you address your business problem.

Reviewing Current Knowledge

As mentioned previously, reviewing current knowledge in the area of interest such as marketing, finance, or human
resources, is important in building a research study. Reviewing current studies has a number of important uses that include:

 Examining theoretical concepts that are related to the business problem


 Identifying the variables that cause the business problem (independent) and the variables that are affected by them
(dependent). For example, in our luxury resort the dependent variable is the satisfaction of the resort clients while the
independent variables can be a number of factors such as customer service and room prices
 Selecting the right research study tools
 Choosing the right sample of people on whom to conduct the study
 Choosing the right type of statistical analyses to conduct on the data collected
 Reaching the right conclusions that answer the research questions.

Depending on the nature of your business problem, you may need to review one or more of these resources to determine
what others have found and to design your own study. In most instances, reviewing current knowledge is done through a
number of sources as follows:

 Academic and professional journals


 Books
 Internet websites
 Archives
 Interviews and observations
 Private and public databases

Research Design

Think of research design as the master plan of your research. A good design details how to collect the information that you
need and how to conduct the appropriate statistical analyses after collection. While every business problem has unique
circumstances, there are similarities that allow for similar research designs to be used.

A good research design provides the following benefits to the researcher:

 Addresses business problems and issues before they result in complications such as employee turnover, reduced sales, and
decreasing profits
 Saves time and money in addressing these problems while providing a reliable and valid way of handling them

When completed properly, a research plan based on solid design should include these five (5) important components:

1. Research questions

2. Research design

3. Participants

4. Variables

5. Statistical toolsParticipants make up your sample and they provide information to answer your research questions. For
example, assume that you are marketing a new product to the college age demographic or those aged 18-22 years. You
choose a sample of 100 people in that age group whom you give a survey about their product preferences. From the results
of your survey, you decide which features to keep and which to discard. These 100 people who fill out the surveys are your
study participants.

There are essentially three types of research design as seen in Figure 5.2.3 1. The choice of design that you use depends on
the business problem or concern and how much information you have. In some business studies, you may use one or more
designs. Figure 5.2.3 1 Types of Research Design
1. Exploratory

Exploratory research design is usually used when the researcher doesn’t know much about the problem and wants
to find out more information. For example, assume that you want to conduct a wide marketing research study that
looks into the buying habits of people in a new city. This type of research is exploratory in nature because it is
designed to simply give you basic information about potential customers in this new city. Another important reason
to conduct an exploratory study is to actually come up with research questions for future studies. Exploratory
studies use a number of tools to collect information such as surveys, focus groups, and case analyses.

2. Descriptive

Descriptive research design is more narrow-focused than exploratory research and is designed to provide answers to
questions of Who, What, How, When, and Where. For example, a descriptive marketing study may look at the
spending habits of adults over the age of 65 in one area of the new city. Descriptive studies use a number of methods
such as interviews, questionnaires, and surveys.

It is important to remember that descriptive studies do not uncover the Why behind collected information.

3. Causal

Causal research design is much more rigorous than exploratory or descriptive research. As a business professional,
you may use causal design if interested in why one variable causes another. Experiments are the most common tool
when conducting causal research. For example, if a pharmaceutical company is testing a new drug, they must
conduct experiments on why it is effective.
It is important to remember that the type of research design that you will use will determine how you will collect
data and whom you will include in your sample.

We will cover Sampling in greater length in coming chapters. Research design will also influence whether your
research uses quantitative research tools or qualitative research tools.

5.3 Quantitative Research Methods

Quantitative research methods usually involve measurement and analysis of numerical data. Once you have identified your
business problem or concern, you will develop research questions and hypotheses. Hypothesis are essentially what you think
might be happening. For example, you may notice that your company’s fourth quarter sales figures are significantly lower
than past years. You suspect that the reason that sales figures decrease because it is the period with the least promotional
efforts. This suspicion is your hypothesis which you can test whether it is true or false by conducting a quantitative res earch
study. There are two quantitative methods that are used mainly as follows:
1. Surveys and questionnaires

2. Experimental studies

Surveys and Questioners

Surveys and questionnaires usually represent a cost effect method of gathering information. They are self-administered
methods of data collection that can be done in person, through the phone, and through the Internet. The goal of using a
survey or a questionnaire is to obtain a ‘snap shot’ idea of habits, preferences, and needs. Unlike other methods of data
collection, surveys and questionnaires are:

 Time-saving and cost-effective


 Easy to administer – These can be administered by non-specialists who simply give out the surveys or questionnaires

To create surveys or questionnaires that explore every day business problems, the following process is followed:

1. Examine the purpose of the data to be gathered

2. Create appropriate questions that address the problem

3. Administer the survey/questionnaire

4. Gather and analyze data

5. Report the results and draw conclusions

Experimental Studies
Experimental studies are another type of quantitative studies that involve the manipulation of variables and the
measurement of outcomes. Unlike surveys or questionnaires, experiments are costly and time consuming.
Experiments also require the development, administration and analysis of results to be done by specialists in their
business field.

In experiments, you are interested in finding out how one set of variables affect another set of variables. For
example, you may design an experiment that explores how your company’s newly developed brake system performs
in comparison to the older systems. When conducting experiments, you will follow the following steps:

1. Define the desired population which you wish to study (e.g. new car owners)

2. Select a sample from this population to represent the population

3. Assign groups within your sample to test variables upon

4. Administer the experiment and collect data

5. Analyze the data and draw conclusions.

5.4 Qualitative Research Methods

Qualitative research methods do not usually involve numerical data. Instead, information is collected using an interview -type
method to collect in-depth information. For example, many movie producers are interested in how audiences will receive
their latest film. Before the film is released widely, they hold viewings with small groups of 7-10 people to watch the film and
give their opinions. This is an example of a focus group. Another method of collecting qualitative data is to complete a case
study

Focus Groups & Case Studies

Focus groups are an important data collection method in business research. We have seen that quantitative collection
methods such as surveys or questionnaires provide ‘snapshot’ information about habits, needs, and patterns. Focus groups,
on the other hand, are designed to provide in-depth information that cannot be gathered from surveys or questionnaires.

Focus groups are typically organized and then moderated by a facilitator. So, when are focus groups important in collecting
information? The following are instances where focus groups are important in exploring business problems or needs:

 Marketing research – When a company wishes to learn about the experiences of customers using their products, they are likely
to hold focus groups. For example, companies producing video games often hold focus groups with their dedicated consumers
in order to find out what works well and what needs improvement. The results from these focus groups are used to improve
products that are already on the market and in the development of new products.
 Fundraising research – Non-profit organization also use focus groups to find out how effective their fundraising campaigns are
in soliciting donations. The focus group of 8-10 people is used to elicit their opinions on the campaign; for example, how well
the message in the campaign motivates them to donate money.
 Human resources research – Companies have used focus groups to explore employee satisfaction and motivation.

The process of conducting a focus group is as follows:

1. Choose a sample of 8-10 people who share a number of qualities that include:
a. Usage of products or services
b. Appropriate demographic information
c. Willingness to engage in the focus group

2. Choose a moderator for the informal, roundtable discussion who is experienced in facilitating and guiding a discussion.
Good moderators will illicit detailed and important information that are relevant to the business problem at hand.
3. Facilitate a relaxed physical setting to encourage several hours of discussion and talking. This includes providing snacks and
possibly an honorarium to compensate participants for the time spent in the focus group.

4. Document and prepare all the opinions shared during the discussion into a report to share with management.

Case Studies

According to researchers, a case study is an in-depth study of a business problem or phenomenon that is usually limited to
the activities of 1 or 2 companies. A case study is a unique type of research study in business because although it is largely a
qualitative exploration of a problem or phenomenon, it may contain quantitative information. A case study is defined as
follows:

The process of preparing a case study is as follows:

1. Define the business problem and its related research questions such as long-term marketing strategies and introduction of
new products.

2. Determine the type of research methods (qualitative and quantitative) that will be used to collect information about the
case. In some studies, archival information may be all that is necessary. In other studies, focus groups and structured
interviews may be required to gather information.

3. Collect the data necessary to explore the business case.

4. Evaluate the data and prepare a report detailing all findings. An excerpt from a detailed case study done on the airplane-
manufacturing company, Boeing, is seen in Figure 5.4.1 1. The Boeing Company is an aerospace corporation that is often of
interest to business professionals. In this excerpt, the researchers conduct a case study on the company’s continued
innovation in its industry.

1
Source: Seperich, Woolverton, Beierlin and Hahn, 1996

2
Figure 5.4.1 1 A Sample Case Study on the Company Boeing
2
Source: MacCormack and Forbath, “Learning the Fine Art of Global Collaboration”. Harvard Business Review (2008).

A case study is an in-depth research study that has the following characteristics:

 Based on ‘real life’ business problems or concerns


 Focuses on a small sample, usually 1 or 2 companies although there are many case studies that examine more companies.
 Follows changing trends over a period of time
 Uses a number of methods to collect data

It is important to remember that case studies provide an in-depth complement to quantitative studies. Case studies also
provide historical information analysis which cannot be done with surveys or questionnaires.

Structured Interviews

Structured interviews are another type of qualitative research methods. In a structured interview, the interviewer prepares a
set of questions ahead of time, and if there are multiple interviewees, they are all asked the same set of questions. Structu red
interviews are particularly used in human resources management as it is relevant in hiring, retention, and promotion of
employees.

The key to a structured interview is to make the questions as relevant to the information that you are seeking. While
questions in a structured interview tend to be very formal, they are also open-ended to allow the person being interviewed to
share their thoughts and experiences in their own language. The interviewer must try to record what is being said as closely
as possible, even requesting time from the interviewee to capture as much of the inter, view as possible. An example of a
structured interview is show in Figure 5.4.2 1, which is an excerpt from a company hiring interview.

Figure 5.4.2 1 Sample Structured Interview Questions

Business Research Ethics

3
When conducting research in a business setting, there are certain ethical considerations that have to be made. Ethics in
business research is not different than ethics in other fields of research. Assume that you manage a division in your company
that deals with weekly newsletters and updates mailed to regular customers. At your disposal is general customer
information that includes their names, addresses, and credit card information. In addition to this information, you have a
database of their purchases in the previous two years based on surveys that you have conducted. You are also planning to
conduct several surveys in the near future.

Given this information that you hold about your customers, there are a number of ethical considerations that you must be
aware of as you gather more information as follows:

 Transparency – As a researcher, you must provide clear, accurate and truthful descriptions of why you are collecting data and
how you will handle the information. You must also be ready to provide information on what benefits you will draw from the
results of your research. An important part of transparency is making sure that the data you gathered is only used for those
purposes that you share with your participants. For example, it is unethical to gather information about your customers and
then sell it to online marketers.
 Informed consent – It is crucial that you gain the full consent of your research participants before you collect data. This consent
must be based on clear and accurate information that you give the participants about your research study.
 Confidentiality/ Anonymity – Gathering sensitive information about buying habits and behaviors requires that you protect the
confidentiality and anonymity of your participants. For example, you must remove all identifying information such as name,
age, and address from survey results that are compiled.
3
Ethics is a code of conduct imposed on researchers by their professional associations and/or their employers.

5.5 Summary of Key Concepts

 Probability is the chance that an uncertain event will happen.


 Research is an important part of solving everyday business problems and addressing common concerns.
 Conducting research to answer every day business questions is a systemic, multi-step process.
 A research question is simply a business concern or problem for which you are searching answers.
 Review of past research and current knowledge is important to guide the development of research tools, conducting the study,
and analyzing the data.
 Asking the right questions is the basis for effective research in business. The questions must be directly relevant to the
business concern or problem and amenable to a cost and time effective study.
 Research design details how to collect the information that you need and how to conduct the appropriate statistical analyses
after collection.
 There are three types of research design: exploratory, descriptive, and causal.
 The choice of research design depends on the business problem or concern and information available.
 Quantitative research methods usually involve measurement and analysis of numerical data. Survey, questionnaires, and
experiments are examples of quantitative research methods.
 Qualitative research methods do not usually involve numerical data but in-depth information. Focus groups, case studies, and
structured interviews are examples of qualitative research methods.
 There are important ethical considerations to make when conducting business research that include transparency, informed
consent, confidentiality, and anonymity.

Glossary of Terms

Business ethics: A code of practice that is imposed on researchers by their professional affiliations.

Case Studies: A qualitative method of data collection that usually involves historical analysis and concentrates on
one or two companies.
Dependent variable: A condition such as sales figures that is affected by one or more conditions such as marketing
expenditure and customer preferences.

Experiment: A quantitative method of data collection that involves manipulating variables and measuring the
outcome.

Focus groups: A qualitative method of data collection that involves a moderated, roundtable discussion of a business
problem or concern.

Independent variable: A condition that affects business outcomes.

Surveys/Questionnaires: Quantitative methods of data collection that result in a ‘snapshot’ view of habits,
preferences, and behaviors.

5.6 Chapter Review Questions

Descriptive Questions

1. Business research follows a five step linear process. List and explain these steps.

2. List at least three sources of current knowledge that you can use in your research design.

3. Compare quantitative and qualitative research methods and provide one example of each.

Multiple Choice Questions

Mark the correct answer.

1: The following represent different research designs EXCEPT:


a)Descriptive
b)Qualitative
c)Explorative
d)Causal

2: A solid research design represents:


a)A source of current business knowledge
b)A blueprint of your research
c)A master plan of your research
d)A list of possible research questions

3: Quantitative research methods involve the measurement and analysis of data. All of the following are examples of
quantitative methods EXCEPT:
a)Case study
b)Survey
c)Questionnaire
d)Experiment

4: In business ethics, a researcher must make sure that participants agree to a research study based on clear, accurate, and
complete information. This is referred to as:
a)Anonymity
b)Confidentiality
c)Transparency
d)Informed consent

5: A researcher wishes to study the effect of salary bonuses on employee performance. Employee performance, in this case,
represents the:
a) Research question
b)Independent variable
c)Dependent variable
d)Research design

6: Research questions are important in a study because:


a)They guide the type of data that must be collected
b)They determine the confidentiality of the study
c)They determine the transparency of the study
d)They guide the moderator of a focus group

Answer Key

1-b, 2-c, 3-a, 4-d, 5-c, 6-a.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

Shipper, F., Manz, K. P., Adams, S. B, & Manz, C. C. (2010). Herman Miller: A Case of Reinvention and Renewal" in Crafting &
th
Executing Strategy: The Quest for Competitive Advantage: Concepts and Cases, (18 Ed.), Arthur A Thompson, Jr., A. J. (Lonnie)
Strickland, & John E. Gamble, (Forthcoming), New York: McGraw-Hill.

CHAPTER 6

6 Research Methods II: Designing Studies

Overview

Research allows professionals and organizations to make better business decisions. The more accurate the data and better
the research design, the better the final decisions made in business settings. For today’s managers and decision-makers, the
questions that they need to ask are: What kind of information do we need? How do we design studies to gather this information?

In Research Methods I: An Introduction, we have covered the basics of research methodology. A solid methodology allows a
business professional to ask the right questions. Following this, the right type of study design is essential to answer these
questions. In this Chapter, we will expand on the design of studies in business. Specifically, how study design is important in
conducting different ones such as sales and market share analyses, distribution and competitive analyses, image, and
advertising analysis.

The objectives of this chapter are as follows:

 Differences between cross-sectional vs. longitudinal study designs


 Finding evidence of causality among variables
 Practical descriptive research designs

Keywords

Pilot study
Cross-sectional design
Longitudinal design
Concomitant variation
Time order variation
Elimination of alternative causation
6.1 Choosing the Right Study Design

How do you choose the right study design for your business research question?

First, you must ask a question that addresses your business


problem. You may want to test the market of a new district
and open a franchise for your restaurant there. A business
man may wonder about increasing monthly sales’ figures in a
clothing store.

If a company is interested in improving its grills, it must ask the question: “how do we make our grills better?” As we have seen
in the previous chapter, there are three (3) broad study designs. In a business study, the three designs serve different
purposes. They also represent a step-wise approach to seeking answers for the business problem or challenge.

1. Exploratory
2. Descriptive
3. Causal

As seen in Figure 5.1, exploratory research prompts the researcher to ask the right questions. This is the first step in using
research to improve business. Typically, a company or an organization has a lot of data at its disposal. Some of that data it
already collects as part of its processes and operations. Other data can be collected from readily available databases that the
company either has or can access such as a city Census data. In some cases, the company must collect data through
exploration of a sample in a population.

Figure 5.1 Type of Research Design


In different fields of business, this is what exploratory research may lead to:

 Marketing: How do we increase sales teams' quarterly figures?


 Production: How do we produce a more efficient microwave oven?
 Operations: How do we limit waiting times for our customer service calls to 2 minutes?
 Finance: How do we make accounts receivable more efficient across the organization?

In all these instances, the questions are broad enough to provide a general view of the problems or challenges facing the
company. How is exploratory research conducted? There are a number of methods that are used:

Surveys & Questionnaires

Companies develop short, simple-to-fill out surveys to gain a broad view of the problem. For example, a human resources
manager at a hospital wishes to explore patient complaints at shift changes. To do so, he develops a short survey of 4-5
questions. The survey is given to employees and patients. Examples of questions on this survey are:

Telephone Interviews

Companies conduct phone interviews with clients and customers to explore a challenge or a problem. For example, a cellular
phone company can make random calls to customers about a recently made addition. The company representative can ask a
few, simple question about usability and customer satisfaction. This also affords a good opportunity to gain information
about problems. The company representative can ask at the end of the call if there are any problems or complaints. This
allows the customer to express any issues in their own words.

Secondary Data

Secondary data refers to data that is already available through government, private, and other public sources. For example,
census data is useful in collecting demographic data such as ages, income levels, and education. Internal company records of
employees can also serve as secondary data as it is collected for other purposes.
Pilot Studies

A company or an organization can develop and run a pilot study to test a new idea for a product or a service. This product is
usually tested on a select group of users who can provide detailed feedback. For example, a pilot marketing study can
promote a new kitchen appliance to a limited market such as a small city. Users can then be followed through a survey that
asks about their satisfaction with the appliance. From a pilot study, a company can determine whether the product will
appeal to a wider market.

Case Studies

Case studies are deep examinations of a single organizational problem or challenge. This problem is studied from a number
of angles such as human resources, finance, and management. For example, a company investigating a high turnover rate
amongst its frontline workers will do a combination of:

 Staff interviews
 Pay scale comparisons, and
 Exit interviews

Data from these interviews and analyses will be used to answer the question: Why is there a high employee turnover
amongst frontline workers? Sometimes a case study does not answer the main problem. However, it provides
guidance to conduct a wider study across the organization.

6.2 Descriptive Study Design

Descriptive data involves gathering both qualitative and quantitative data. Similar to exploratory research, there are a
number of methods that can help in collecting this data. These methods include the following:

 Written surveys & questionnaires


 Personal interviews
 Telephone interviews
 Observations
 Portfolios

These data collecting methods are similar to those used in exploratory research. However, the actual nature of surveys,
questionnaires, and interviews used in descriptive research is designed to provide more insights into the business problem.
In business, descriptive studies can be used in:

Sales studies

Market potential – In this type of study, a company examines a potential market to find whether it is ready for its products
or services. Market researchers develop interest surveys to test this readiness. They can also analyze demographic
information about the population available through census data.

Market share – In this type of study, an overall market view is examined. If a company believes that the market is big
enough, with present competitors, for its products and services, they can initiate market potential studies.

Sales analyses – Sales data can be analyzed in a descriptive study. Quantitative analyses are especially useful in sales.

Consumer behaviour & attitude studies

Image – Consumer perceptions of image are studied using descriptive questionnaires. A company can conduct these
periodically to analyze how the image of their company and its products changes.
Product usage – Whether a company is providing Internet service or selling cleaning supplies, it is always interested in
product usage. Company representatives can conduct phone interviews to find out usage patterns.

Advertising – Companies analyze the impact of advertising of all kinds on sales and brand recognition. Descriptive research
studies to analyze this impact include interviews, questionnaires, and surveys.

Pricing – Descriptive studies are used to develop pricing strategies in a company’s marketing plan. Market researchers
compare prices internally and with competitors to manage sales’ projections.

Market Studies

Distribution – Descriptive studies are important to analyze distribution routes, efficiencies, and costs.

Competitive analysis – With competition and market placement important in company survival, analysis of competitors’
performance is necessary. Descriptive analysis is used to compare the company’s productivity, advertising, and market
presence to its competitors.

There are two types of descriptive research designs as seen in Figure 5.3.

Figure 5.3 Different Types of Descriptive Research Designs

1) Cross-Sectional designs

This is the most common form of descriptive research design. A sample is chosen from a desired population and its attitudes,
preferences, and behaviours are recorded in the current time.

2) Longitudinal designs

In this type of descriptive study, a business researcher tests the same population over time. This type of design is used in a
number of business applications such as:

 Trend analysis
 Attitude and behaviour changes over time
 Brand switching
 Awareness of new products and services
 Turnover of staff
 Long-term effects on market shares and prices
 Market volatility.

6.3 Causal Studies in Business

What is the impact of a company’s new Jeans campaign on sales?

How does the organization’s new mentorship program affect team productivity?

Causal studies in business research are designed to analyze how one variable (advertising campaign) affects another variable
(Jeans sales). In this particular case, X is the first variable while Y is the second variable. If a change in X results in a change in
Y, we say that two variables have a causal relationship. Sometimes the relationship seems causal when X and Y are both
affected by a third variable. In that case, the results of a causal study would show whether it is a direct causal relationsh ip or
not.

There are a number of ways to find out if there is evidence of causality as follows:

1. Concomitant variation

If the variable X changes, then the variable Y also changes. For example, if factory productivity (X) increases 10% for every
month of training (Y), then we can say that changes in X cause changes in Y. This is a causal relationship.

If the variable X does not change, then the variable Y does not change. For example, if monetary incentives (X) are not given,
then employee satisfaction (Y) does not change. We can conclude that X and Y have a causal relationship.

2. Time order

If a variable X (shift changes) always occurs before an effect Y (late medicine dispensing), we can conclude that X and Y hav e a
causal relationship. In other words, X (the cause) will always occur before Y (the effect).

3. Elimination of alternative explanation

Causal research can also be used to eliminate factors that could be resulting in an effect, Y. For example, a company’s
financial restructuring (X1) or that they introduced a new product (X2) could be the cause of the increase in profit, Y. Research
can eliminate one or the other from the reason behind the company’s increase in profit.

Causal research involves the manipulation of variables through experimentation. As seen in Figure 5.4., a researcher can
conduct either a laboratory experiment or a field experiment. In a laboratory experiment, the researcher conducts
manipulations in a controlled environment. For example, a car company can conduct safety experiments on their cars in their
specialized laboratories.

In a field experiment, the conditions are less controlled than in a laboratory setting. Test marketing is an example of a field
experiment. A company can introduce several designs of the same product into different samples and record sales figures. In
the field, however, there are other factors that can influence consumer purchasing behaviour.

In these experiments, the following are outlined in a research plan:


 Independent variable – This is the variable that is manipulated (X, the price of a product)
 Dependent variable – This is the variable that is measured (Y, the sale of a product)
 Treatment group – This is the sample that is exposed to manipulation
 Control group – This is the sample that does not change in independent variable.

Figure 5.4 Causal Research Types

6.4 Business Decision-Making

Research allows professionals and organization make better business decisions. The better the research design, the better
the data. High quality data and targeted statistical analyses lead to better decision-making within the organization. Whether a
company is experiencing transformation such as a merger, or is considering entry into new markets, it needs to make
decisions based on good data.

Familiarity with statistical research designs and data analysis makes you a better analyst of information. Every single day, you
will run into figures that indicate some aspect of performance in a business setting. Think of the following examples:

 A Deloitte study finds that 45% of women and 63% of men are satisfied with their initial compensation package at hiring.
 Apple, Inc. posted a 5% increase in profits in Q3 compared to Q2. This is the company’s sixth consecutive profit increase.
 China’s commercial real estate development has seen an increase of 10% in 5 straight years. The trend is not expected to slow
down.
 A survey of 500 college students who have graduate at the height of the global recession in 2008 finds that nearly a half have
not found full-time employment in their fields 36 months post graduation.
 A study conducted by the Department of Health has found that overweight children are 3 times as likely to suffer an asthma
attack as children of normal weight.

All of these examples are based on studies that are probably a mix of exploratory, descriptive, and causal research. Some of
the data is primary (collected by the researchers) while some is secondary (based on information available from other
organizations).

When you read a technical report with extensive data, charts, and other visual representations of information, there are
questions that you may ask such as:
 Is the data presented in the technical report based on exact calculations or estimated?
 Are the statistical calculations conducted in the study based on publicly available, secondary data?
 What kind of causal research has the researchers conducted? Was the research done in a laboratory or in the field?
 In collecting data for this report, could there have been other data gathered?
 If the information collected is secondary, is it recent enough to be useful to managers and executives in their decision-making?
 How can managers and other employees use the technical information found in this report?

These questions are the first step in determining whether solid decision-making is possible with the data available. If the data
is lacking, the business researcher designs more studies and gathers more data for better decision-making.

Practical Design: Marketing Research

Marketing researchers often study new markets before a product or a service is launched. The Omega Company launched a
Test Market study in 2009 in a Houston district called Sugar Land. A Test Market is a physical segment of city or a
demographic group where a company conducts a study on its product or service before presenting it to a larger population.

The Omega Company chose Sugar Land as a test market before promoting its frozen pizza products in the wider market of
the state of Texas. As seen in Figure 5.5., the company launched a 6-month Print and Television advertisement campaign in
two different parts of the district. Section 1 district was only exposed to the Print campaign while Section was only exposed to
the Television campaign.

This is an example of a causal research design whereby a field experiment is carried out. The independent variable, X,
(print/television campaign) is studied as the cause of the dependent variable, Y, (increased brand recognition).

The marketing researchers have to make sure that people in Section 1 are not exposed to TV advertisement while those in
Section 2 are not exposed to print advertisement. This allows them to compare the effectiveness of each type of campaign on
how well people recognize the company’s pizza products.

Researchers must create surveys or questionnaires to collect data on how much exposure people had to advertisement and
how well they recognize the company brands.

Figure 5.5 Test Market – Omega Company


Practical Design: Operations Research

The Herman Miller Company opened its doors at the beginning of the 20th century. Today, it is one of the largest furniture
companies in the world. With a share of nearly 8% of the total stock value of office furniture, the company has over 100
offices and outlets all over the world.

Management wished to study trends in net sales and operating earnings. This was especially important as many companies
suffered major losses during the 2008 recession. The table below shows how Herman Miller’s net sales and other financial
indicators changed from 2006 to 2009. The data can also be shown graphically. This analysis is an example of descriptive
research from secondary data which is collected by the company as part of their operations.

Table 5.5.2 Herman Miller’s Net Sales Figures, 2006-2010

YearNet Sales ($, in millions)Operating Earnings ($, in millions)


2006 1737.2 53.6
2007 1918.9 122.8
2008 2012.1 246.6
2009 1630 198.1
2010 1318.8 157.7

The table shows that the company’s net sales have survived the 2008 market crash. However, in the subsequent 2 years, the
company’s net sales dropped 19% from 2008 to 2009, and 19% again from 2009 to 2010. These sales’ drops are indicative of
the market weakness following the crash as the sales of goods such as office furniture was reduced by many organizations.
Herman Miller Company

Review of Operations
6.5 Summary of Key Concepts

 A company or an organization can develop and run a pilot study to test a new idea for a product or a service.
 Cross-sectional studies are the most common form of descriptive research design and are considered a snapshot in time while
longitudinal research studies test the same population over time.
 Causal research involves the manipulation of variables through experimentation. In a laboratory experiment, the researcher
conducts manipulations in a controlled environment while in a field experiment, the conditions are less controlled.
 Concomitant variation is when one variable changes with another and can provide evidence of causality.
 Time order variation is when one variable always occurs before another. It is also evidence of causality.
 Elimination of alternative causation is when research studies test a number of variables for evidence of causality.
 A test market is a geographic area or demographic group where a company conducts a study on its product or service before
presenting it to a larger population.
Glossary of Terms

Concomitant variation: When one variable changes with another and can provide evidence of causality.

Cross-sectional studies: Most common form of descriptive research design and are considered a snapshot in time.

Elimination of alternative causation: When research studies test a number of variables for evidence of causality.

Longitudinal studies: Descriptive studies that test the same population over time.

Pilot study: A study to test a new idea for a product or a service.

Secondary data: Data that is already available through government, private, and other public sources.

Test market: A geographic area or demographic group where a company conducts a study on its product or service
before presenting it to a larger population.

Time order variation: When one variable always occurs before another. It is also evidence of causality.

6.6 Chapter Review Questions

Descriptive Questions

1. There are three (3) different ways that business researchers can investigate causality relationships between variables.
Explain what they are and provide an example of each.

2. Explain the difference between cross-sectional and longitudinal studies. Provide a practical example of business studies
that could be conducted for each.

3. Descriptive research designs provide several practical designs in a business setting. Provide 3 such studies and explain
their applications.

Multiple Choice Questions

Mark the correct answer.

1: A business researcher is conducting a test market study in one of the boroughs of New York City. This type of research
design is an example of:
a)Causal
b)Qualitative
c)Quantitative
d)Descriptive

2: A business statistician determines that employees maximize their retirement fund contributions every time the company
provides a bonus at evaluations. To find evidence of causality, the statistician searches for:
a)Disparate variation
b)Concomitant variation
c)Time variation
d)Market variation

3: The Herman Miller Company is interested in conducting an in-depth examination of the company’s performance in 2009 to
understand their 19% reduction in net sales. The best kind of study for this examination is:
a)A causal analysis
b)A Time order analysis
c)A Histogram
d)A case study

4: A field experiment provides ___________ control to (than) a laboratory experiment.


a)Equal
b)More
c)Less
d)Not enough information

5: Companies such as Apple and Google conduct ________ studies to introduce new products and receive feedback from a
select group of users who provide feedback.
a)Pilot
b)Concomitant
c)Exploratory
d)Causal

6: A business researcher tests college students’ social networking preferences at the beginning of the school year and again at
the end of the school. This is an example of a(an) ________ study.
a)Alternative
b)Time-order
c)Cross-sectional
d)Longitudinal

Answer Key

1-a, 2-b, 3-d, 4-c, 5-a, 6-a.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

Shipper, F., Manz, K. P., Adams, S. B, & Manz, C. C. (2010). Herman Miller: A Case of Reinvention and Renewal" in Crafting &
th
Executing Strategy: The Quest for Competitive Advantage: Concepts and Cases, (18 Ed.), Arthur A Thompson, Jr., A. J. (Lonnie)
Strickland, & John E. Gamble, (Forthcoming), New York: McGraw-Hill.

CHAPTER 7

7 Sampling Methods in Business Research

Overview

In business research, collecting data to turn into useful information requires a strong understanding of the business problem .
Data collection also requires a method to access this data from the customer population. Choosing the right population and
the right testing sample is important in collecting this useful information.

It is rare for a researcher to be able to collect information from an entire population of interest. Assume that you wish to
study the consumer behavior of all elderly people in your city of residence. There are serious limitations to giving out surveys
to each and every elderly person in this population. In general, a sample is chosen out of a population because of the
following:

1. Time limits – It is much less time-consuming to test a sample than an entire population. This is why research studies are
conducted frequently while a population census may be done once every 10 or 20 years.

2. Cost limits – It is much less costly to test a sample than an entire population.
3. Practicality – It is much more practical to test a sample that is limited in number than an entire population.

A population is not always made up of people. A manufacturer may consider a week’s production output to be an entire
population. A baker may consider an entire batch of 5000 cookies to be a population. For each of these populations, they may
be interested in the quality of the product before selling it to customers. Testing all the products is time-consuming and
costly. Instead, samples will be chosen and tested for high quality before shipping out.

In this Chapter, we will explore different sampling techniques and methods. The objectives of this chapter are to cover the
following:

 Sampling Process
 Non-Probability Samples
 Probability Samples
 Strengths and weakness of different sampling methods

Keywords

Sample Size
Probability Sampling
Non-probability Sampling
Sampling Frame
Sampling Process
Simple Random
Systematic
Stratified
Cluster
Convenience
Judgment

7.1 Sampling Process

The sampling process is concerned with finding the best sample for your research study. Assume that in the study of the
consumer behavior of elderly people in your city, the population is defined as those above 65 years of age. While this may
seem like an easily defined population, it is very challenging to actually access it. How would you reach each and every pers on
in the city who is older than 65 years of age? That is the practical question that, as a business profession al, you must ask
yourself.

In addition to logistic problems of reaching the entire population, there are time and cost limitations. Therefore, you must
choose a sample carefully from this population. This sample must be as closely representative of your population as possible
to provide accurate results. If it is not, you risk collecting data that does not yield useful information to your business.

For example, if your sample is mostly comprised of elderly individuals who make more than $100,000 per year, and yet most
of the population is known to make less than half of that, the data you collect will not be useful. A main reason is that your
sample has much more disposable money than the majority of the population you are targeting. As someone interested in
what this population buys, your sample’s consumer behavior does not match the majority of the population.

Sampling methods are important in making sure that you choose the right population, and then the right sample. This often
requires a multi-step process as follows:

Figure 7.1.1-1 Sampling Process:


1. Choose the right population
2. Identify the sampling frame
3. Identify the sampling method
4. Determine your sample size
5. Sampling and data collection
6. Review your sampling process

Choose the Right Population

As written before, a population is not always just people. It is the collection of ALL items that you want to study and
understand its characteristics including manufacturing products. It can also be events. For example, all house fire cases that
are filed in city hall over a 20-year period can be considered a population. Similarly, all commercial buildings’ construction in a
period of 50 years can be considered a population.
In business research, the population that you are interested in studying is called a Target Population. This means a
population that you define and target according to your research needs and questions. In our example of elderly consumers
in a city, we defined the population as all individuals above the age of 65 years in the city. This is the target population. What
if you are a marketing manager who wishes to launch a product to the elderly in a single district of this city? Then, your ta rget
population will NOT be the entire city but the district of interest.

Choosing and defining the right target population is not as straightforward as it seems. As a business professional, you must
consider your research question carefully before choosing the target population. This is because in human populations, for
example, there are a number of qualities that will impact your research including:

 Demographics
 Employment and earnings
 Socioeconomic status
 Consumer behavior

You must pay close attention to these potential qualities when choosing your target population.

Identify the Sampling Frame

After choosing your right target population, the next step is to identify the sampling frame. A Sampling Frame is a list of
common elements that allows you to access your population. Think of sampling frames as data sources such as phone
directories, geographic maps, or customer/client directories.

For example, assume that you work for a telemarketing company. You want to conduct a phone survey about consumer
purchases. Your sampling frame is the telephone directory or book, which lists all the members of the population (such as a
city).

Other examples of sampling frames are electoral databases where you can have lists of voters and their contact information,
or university enrollment databases for lists of current and past students. Both databases will contain information such as
name, age, and gender.

A sampling frame in itself can provide you with valuable insights such as demographic information for your research. It is also
a guide that allows you to choose one of the several sampling methods that are available in your research.

Identify the Sampling Method

The third step in the sampling process is to identify the right sampling method for your research. The type of sampling
method to use will depend on a number of business conditions such as:

 The type of business problem that you are analyzing


 Cost and time limits
 Accuracy requirements and expectations. In other words, are you conducting a survey to check if your customers are pleased
with the latest detergent or are you conducting safety studies for a new vehicle? The answer to this question determines
whether the sampling method used will fulfill accuracy requirements and standards.

Once you’ve understood your business problem, there are two major sampling methods which will be covered in detail in
Section 6.3 that you can use as follows:

Figure 7.1.3-1 Sampling Methods


1. Probability sampling – In this method, sample items (such as people or products) are chosen from a larger population
based on known probabilities. For example, if there is a population of 100 people, each one of them will have a 1/100 chance
of being included in a testing sample if there is an equal likelihood of inclusion. There are four types of probability sampl ing:

a. Simple Random sampling


b. Systematic sampling
c. Stratified sampling
d. Cluster sampling

2. Non-Probability Sampling – In this method, sample items are chosen from a larger population without any regard for
their probabilities or occurrences.

a. Convenience sampling
b. Judgment sampling

Determine Your Sample Size

The fourth step in the sampling process is to determine your sample size. In probability sampling, there are statistical samp le
size tables that you can use to determine sample size. These tables are based on the number of variables that you are testing
among other things. In general, the more participants or items in your study, the more accurate your results will be. Large
sample sizes also yield results that are more representative of the population from which they are drawn.

These are some of the considerations you must make when determining your sample size:

 The population parameters you wish to estimate


 The overall cost of choosing a sample
 Your current knowledge about the population
 Spread (variability) of the population
 Logistics or the difficulty in collecting data
 The final purpose and audience of your research. For example, your research results may be limited to sharing information
with other employees or you may want to publish them for a greater audience.

Sampling and Data Collection

The final step in the sampling process is to actually collect the data from the sample. Good practices in data collection
include:

 Making sure that data is labeled in time order


 Detailed notes are made of missing

Review your Sampling Process


Once data collection is completed, it is important to review the sampling process for errors and problems. When conducting
survey research, such as the ones in marketing studies, non-response rates may be high. Non-response means that the
customers/participants in your sample whom you’ve sent surveys simply don’t return them. In our example of elderly
population, if the sample contains returned surveys from high income individuals mostly, then your sample is not
representative of the population you are studying. Some of your surveys will also come back missing information or
incorrectly filled. A review will help you investigate whether questions in your survey were too vague or difficu lt to understand
by your sample participants.

Probability Sampling

Probability sampling means that participants or items in your sample are chosen out of the population according to some
known probability. These probabilities can range from simple to more complicated ways to choose sample participants or
items out of a population. The following are four probability sampling methods:

Simple Random Sampling

Assume that there are 100 people in a group (a population) and you need a sample of 10 to be tested as a sample. If each one
of the 100 has an equal chance of being chosen, then any person has a 1/100 or 1% chance of being chosen. This is
called Simple Random Sampling.

In this population of 100 participants, the process to conduct simple random sampling is:

1. The first step is to give each person a label from 001 to 100 based on the total number of people. There are two reasons
for this labeling as follows: to protect the participants’ privacy and to ease the generation of random numbers.
2. The second step is to use a random number-generating table or a computer program such as Excel is to generate random
numbers from the numbered list from 001 to 100. This is similar to writing all those numbers in small pieces of paper and
picking them out of a container randomly. A computer program is faster and more efficient, especially when the population is
in the thousands or higher. If 5 numbers are generated randomly, they may look like this:

3. The third step is to distribute your study materials (surveys or questionnaires) to this randomly selected sample of
5 people.

7.2 Systematic Sampling

Systematic sampling is a method that arranges your target population into an ordered scheme from which you can choose
th
participants or items at regular intervals. A simple example would be to choose every 7 name in a list of telephone numbers.
The process of conducting systematic sampling out of a population of 40 people as seen in Figure 7.2.2-1 is as follows:
1. In the first step, you decide on a sample size, which is termed n. The desired sample size in this figure is 4.

2. In the second step, you divide your population of 40 people into groups, which are termed k. In this case, the groups are
made up of 10 people each.

th
3. In the third step, you randomly select one person from each group. In Figure 6.3.2-1, the 7 person is chosen in the first
group.

4. In the fourth step, you select the 7th person from the other group. These four (4) now make up your sample.

Figure 7.2.2-1 Systematic Sampling

Assume that you want to survey 10 homes as a sample in a 100 home neighborhood. To conduct systematic sampling, you
th
must begin at the first home (#1) in an ordered array. For every 10 home, you select a home to include in your sample. By
th
the time you reach the 100 home, you will have your 10 samples.

Stratified Sampling

Stratified sampling is the third type of probability sampling method. In this method, your population is divided into two or
more subgroups/subpopulations (or strata) according to characteristics that they share. For example, you can group them
according to income, age, or political affiliation.
The following is the process to conduct stratified sampling:

1. The first step is to assign your population to different strata. The cut-off points between one stratum to another may not
be clear and you will have to identify them. For example, if you are interested in dividing your population into low, middle,
and high income earners, you must make a cut-off point between income groups.

2. The second step is to select a simple random sample from each stratum. The samples have to correspond to the size of
the stratum. For example, if the low income group has 100 people while the middle income earners are 50 people, your
random sample sizes may be 10 and 5 people, respectively.

3. The third step is to combine the samples from the strata into the study samples. As seen in Figure 7.2.3-1, the darkened
samples from each stratum constitute the desired sample.

Figure 7.2.3-1 Stratified Sampling

Stratified sampling is a method used frequently in choosing samples from populations that differ in ethnic and
socioeconomic status.

Cluster Sampling

Cluster sampling is similar to stratified sampling in that it divides a population into several clusters. However, these clusters
may not be representative of the population and don’t necessarily have common characteristics. For example, a city is
clustered into districts strictly based on geography. These districts represent subpopulations from which samples are drawn.

The following is the process to conduct stratified sampling:

1. The first step is to assign your population to different clusters.

2. The second step is to randomly select clusters for the sample as seen in Figure 7.2.4-1.

3. The third step is to either use the entire members or items in the cluster or to use another sampling method such as
simple random or systematic to draw a sample for study.

Cluster sampling is also known as a ‘two-staged’ or ‘multi-staged’ type of sampling method because once a population is
clustered, a second sampling method is needed. This method is especially useful in conducting research on electoral votes as
polling stations can be used as clusters.

Marketing researchers use cluster sampling also by studying geographical clusters such as districts and their consumer
behavior. Within these clusters, they divide them according to other characteristics such as socioeconomic status or other
demographics. In other words, the population is sorted once by location and within locations, it is sorted again by different
characteristics.
Figure 7.2.4-1 Cluster Sampling

7.3 Non-Probability Sampling

Unlike probability sampling, non-probability sampling doesn’t depend on statistical figures to determine whether a person or
an item is drawn as a sample from a population. In other words, in a non-probability sample, a person or item may never
have a chance to be chosen to participate in the study from the population. As seen in Figure 7.3.4-2, there are two types of
this sampling method, judgment and convenience.

Figure 7.3.4-2 Non-Probability Samples

Convenience Sampling

Convenience sampling means that the researcher simply grabs the first available participants or items. For example, a
shopping mall manager interested in measuring the satisfaction of customers prepares a survey. From opening hours until
midday, the manager gives out the survey to people who are shopping in the mall. This ready and convenient sample is
chosen because they are present.

Convenience sampling is widely used by business marketers and social science researchers interested in measuring
consumer behaviors and attitudes. This sampling method can be useful as an initial step toward a larger study because it
gives the researcher an idea about trends. This is useful to the researcher in developing more detailed research questions in
a bigger study that uses probability sampling.

Judgment Sampling

Judgment sampling is another non-probability method that is used widely in business and other disciplines. In this method,
the opinions of pre-selected experts are solicited in a study. For example, in a study about the hiring practices of Chief
Executive Officers (CEOs) in large corporations, five or six CEOs maybe chosen to be interviewed for the study. Who is chosen
to participate in the study depends on the judgment of the researcher. A researcher may be interested in a CEO who has
been in a position for at least 5 years. In this case, the chosen CEOs will all have that in common.
Judgment sampling is frequently used to survey the opinions of people who are known to be experts in their fields. This
expertise is the main criterion that they are chosen for a research method and it gives the researcher a wide margin of
judgment on whom to include in his or her study.

Sampling Methods' Strengths and Weaknesses

Each of the sampling methods presented in the chapter has its own strengths and weaknesses. The choice of which method
to use in your own business research will depend on a number of issues that include:

 How much you know about your research questions


 What goals you wish to achieve from the study. In other words, are you studying a new market to improve current products or
introduce new products?
 Who the audience for your report will be. The accuracy and detail of the sampling method will be dictated by who will read and
benefit from your research study. In other words, will the results be shared with employees in your department only or are yo u
planning to publish the research in the Harvard Business Review?

Once you have answered these questions, you can determine which of the sampling methods is appropriate for your
research. Despite the nature of your research questions and your intended purpose and audience, there are limitations to
which sampling method you choose:

1. Limited prior knowledge of the business problem. If you are in the preliminary stages of investigation into your business
problem or need, the sampling method will have to be exploratory in nature in the beginning. For example, the cluster
sampling method is one to use in an investigative research study.

2. Costs and operational limitations. Business research will always be limited by the cost and practicality of conducting a
study.

3. Time limitations.

Table 7.3.1-1 Comparison of the Strengths and Weaknesses of Different Sampling Methods
7.4 Summary of Key Concepts

 Choosing the right population and the right testing sample is important in gaining insights into business needs and problems.
 The sampling process is concerned with finding the best sample for a research study.
 In business research, the population that you are interested in studying is called a Target Population.
 A Sampling Frame is a list of common elements that allows access to a population such as phone directories, geographic maps,
or customer/client directories.
 The type of sampling method to use will depend on a number of business conditions such as the type of business problem,
cost and time limits, and accuracy requirements and expectations.
 In probability sampling, sample items (such as people or products) are chosen from a larger population based on known
probabilities.
 Probability sampling can be simple random sampling, systematic sampling, stratified sampling, or cluster sampling.
 In non-probability sampling, sample items are chosen from a larger population without any regard for their probabilities or
occurrences.
 Probability sampling can be convenience sampling or judgment sampling.
 Simple random sampling is a method that ensures that population items have an equal chance of being included in a sample.
 Systematic sampling is a method that arranges your target population into an ordered scheme from you can choose
participants or items at regular intervals.
 Stratified sampling is a method that divides a population into two or more subgroups (or strata) according to characteristics
that they share such as income, age, or political affiliation.
 Cluster sampling is similar to stratified sampling by dividing a population into several clusters such as geographic location.
 Convenience sampling means that the researcher simply grabs the first available participants or items.
 Judgment sampling is a method where the opinions of pre-selected experts are solicited in a study.

Glossary of Terms

Cluster sampling: A probability sampling method based on dividing a population into clusters, most often geographic.

Convenience sampling: A non-probability sampling method that uses the first available items or participants.

Judgment sampling: A non-probability sampling method that solicits the opinions or pre-selected experts.

Non-probability sampling: A sampling method that draws items from a population without any regards for its
probabilities or occurrences.

Probability sampling: A sampling method that draws items from a population based on known probabilities.

Sample size: Number of items (people, products, or events) that is drawn from a population.

Sampling frame: A source list of populations based on common elements.

Sampling process: A multi-step process that ensures the choosing of an appropriate sample from a larger population.

Simple random sampling: A probability sampling method based on equal opportunities at inclusion.

Stratified sampling: A probability sampling method based on dividing a population into strata based on common
characteristics.

Systematic sampling: A probability sampling method based on arranging a population into an ordered scheme.

7.5 Chapter Review Questions

Descriptive Questions

 List and describe the six (6) steps in the sampling process.
 Detail the strengths and weaknesses of stratified and cluster sampling.
 What are the limitations facing a business researcher when studying a population?

Multiple Choice Questions

Mark the correct answer.

1. A business researcher usually cannot survey an entire population due to the following EXCEPT:
a)Time limitation
b)Cost limitation
c)Probability
d)Practicality

2.The first step in the sampling process is to:


a)Identify the sampling frame
b)Identify the sampling method
c)Choose the right population
d)Determine the sample size

3. An example of a sampling frame is:


a)District cluster
b)Sample size table
c)Socioeconomic status
d)Telephone directory

4. In stratified sampling, the population is divided into:


a)Frames
b)Strata
c)Tables
d)Surveys

5. In judgment sampling, the researcher draws a sample of:


a)Pre-selected experts
b)Population items
c)Chief Executive Officers
d)Random participants

6. An ideal sampling method to usual in the initial stages of understanding a business problem is:
a)Simple random
b)Stratified
c)Judgment
d)Cluster

Answer Key

1- c, 2- c, 3- d, 4- b, 5- a, 6- d

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

CHAPTER 8

8 Testing Your Business Hypothesis

Overview
For more than 20 years, scientists, business people, and government officials have been debating global warming. Many
people believe that humans’ increased use of cars and factories contribute to an increase in global warming which has
catastrophic effects on the environment. A few scientists and business people argue that global warming is simply a natural
occurrence and that environmental changes are happening by chance.

The challenge for any researcher is to determine whether something is happening by chance or because of the effects of a
variable. This is the basis of inferential statistics which we have covered in earlier chapters. The initial goal of any research
is to look at a set of data and determine the patterns that are happening.

The next goal is to determine whether the patterns are being caused by a factor or more or because of chance. In other
words, proper research should tell you whether an outcome is caused just by chance or if there is something else that goe s
beyond chance that may be causing it.

One of the basic methods of inferential statistics that looks at data patterns is hypothesis testing. Hypothesis testing helps a
researcher determine the difference between a real and a random pattern in a data set. It helps determine whether two
variables that are occurring have a real relationship or are simply happening at the same time.

In this Chapter, we will explore hypothesis testing and its implications in business. The objectives of this chapter are to cover
the following:

 To understand the difference between a population and a sample


 To understand the standardization of a sample statistic
 To understand the hypothesis testing process

Keywords

Alternative Hypothesis
Null Hypothesis
Z-score
Population
Sample
Hypothesis Testing
Type I Error
Type II Error

8.1 Populations and Samples

We've covered in previous chapters the differences between descriptive and inferential statistics where we determined that
descriptive statistics describe a sample, not the population. We’ve also learned that inferential analysis involves making
decisions or drawing conclusions about a population based on results from a sample. For example, assume that your
company’s marketing department claims that the population in your city eats at restaurants 2 times per week, on average.
Since it is not possible to question every person in this population about their restaurant dining habits due to time, cost, and
logistics limitations, a sample is tested instead. A sample is drawn from this population using one of the sampling methods to
find out if it is indeed true the claim that city residents eat at restaurants 2 times per week, on average.

Remember that, in research, a population is not only people but can be products or events. When conducting research in
business, you will see problems or needs formulated as research questions:

 As a percentage of production, what percentage of car seat manufactured in one week (a population) is defective?
 How many employees will pass the mandatory safety examination based on a sample?
 How many more units are produced during the evening shift compared to the morning shift?
 Is there a significant difference between the car purchasing behavior of people with graduate education or those without such
an education?

In everyday business activities, an answer to these questions


may begin with an ‘educated guess’. In statistics, an educated
guess is called a ‘hypothesis’. A hypothesis, therefore, is a Hypothesis – A claim or an educated guess about a
population parameter such as mean, variance or
claim about a population characteristic (or as it is termed,
standard deviation.
parameter). For a hypothesis to guide your research and help
you answer questions, it must be clear and relevant to the
business problem or need.

Population Parameters

A population parameter can be as follows:

 Population mean
 Population variance
 Population standard deviation.

These population parameters will be very difficult to calculate because testing an entire population is usually too expensive
and time consuming. Instead, researchers have to make educated guesses based on prior information or research. For
example, assume that you live in a city of 100,000 people. To calculate the mean age, you must average all the ages of the
entire city population as follows:

Equation 8.1.1 Population Mean (μ)

If you notice, the convention used for the population mean (μ) is different than the sample mean (X) that we have covered in
previous chapters. Similarly, the population variance is calculated by averaging the squared deviations of each variable from
the mean of the population. The equation of the population variance is as follows:
Finally, the population standard deviation ( ) is similar to the sample standard deviation in that it is the most commonly used
measure of dispersion around the mean. The population standard deviation is calculated by determining the square root of
the variance and in the same unit as the original data. For example, if the population age is calculated in years, the standard
deviation will also be in years. The formula for population standard deviation is as follows:

We can now compare population parameter and sample statistics and their conventions as seen in Table 8.1.1.

Table 8.1.1 Statistical Measures for Populations and Samples


Given how expensive and time-consuming it is to conduct research on an entire population, the next best strategy is to take a
representative sample from that population. This sample must be specially chosen to represent the population in variables
that matter. For example, in the city of 100,000 people, a sample drawn from this population has to represent all ages and
not just one group such as teenagers or senior citizens.

Once you have chosen this sample, how close the sample is to the larger population is referred to as a Sampling Error. A
sampling error is simply the difference between the population parameter and the sample statistic.

Hypothesis Testing Process

A good hypothesis translates a research question or problem into a format that can be tested using any one of the available
research methodologies. Hypotheses, however, are claims or educated guesses about a population parameter such as a
population mean. Examples of hypotheses are:

 The average age of cell phone users city-wide is 40 years


 The average salary of teachers in the entire school district is $50,000
 A government agency claims that 20% of all companies file an amendment to their taxes
 A battery manufacturer believes that 3% of all products are a safety hazard
 A public health official believes that 60% of all elderly people have received flu examination in the previous year.

In all of the above hypotheses, a hypothesis about a population parameter is written. Most research, however, cannot test an
entire population because of limitations in time and cost. Instead, a researcher has to choose a representative sample from
the population and test whether the hypothesis is true. A representative sample doesn’t mean a big sample but one that
reflects the important characteristics of the population. For example, if wishing to test whether the average salary of teachers
in a school district is $50,000, a sample of teachers has to be drawn from a number of schools and not a single school or two.

Now that we have drawn a representative sample from the population, what happens next? The next step is to test the
hypothesis using one or more of available research methods such as survey, questionnaires or interviews.

The hypothesis testing process if comprised of three major steps:


 Formulate the statistical hypothesis (Null & Alternative Hypothesis)
 Compute the sample test statistic
 Make a statistical decision

Formulate the Statistical Hypothesis

Formulating a Null Hypothesis (Ho) is the first step in testing your hypothesis. A null hypothesis is a claim or guess that must
be tested. Assume that you wish to test the hypothesis that the average age of cell phone users city-wide is 40 years.

The null hypothesis is stated this way:

Remember that the null hypothesis is always about testing a population parameter and not a sample statistic as seen below.
In other words, we are making a guess that the average age of a cell phone user in the entire city.

In this step of the hypothesis testing process, you must begin with an assumption that the null hypothesis is true and that t he
average age of a cell phone user is 40 years. Think of the null hypothesis as the notion that a person is innocent until proven
guilty. In that sense, the average age of cell phone users is assumed to be 40 years until there is evidence that proves that it is
not.

Next, you must formulate the alternative hypothesis that you test against the null hypothesis. In other words, this is when you
try to prove that that a person is actually innocent or guilty. This is called the Alternative Hypothesis (H1) and, unlike the
null hypothesis, it is a test of a sample statistic. An alternative hypothesis is always framed in the negative of the null
hypothesis as follows:

It is important to note the following about the alternative hypothesis, H1, in your research:
 The alternative hypothesis is the opposite of the null hypothesis
 The alternative hypothesis challenges the status quo by aiming to prove that it is incorrect. In the above example of cell phone
users, the alternative hypothesis aims to show that is either higher or lower than 40 years, which is the null hypothesis.

Given this presentation of statistical hypotheses, we can apply them to some of our examples above by formulating null and
alternative hypotheses as follows:

 The average salary of teachers in the entire school district is $50,000 H0: μ = $50,000 H1: μ ≠ $50,000
 A government agency claims that 20% of all companies file an amendment to their taxes H0: μ = 20% H1: μ ≠ 20%
 3. A battery manufacturer believes that 3% of all products are a safety hazard H0: μ = 20% H1: μ ≠ 20%

Compute the Sample Statistic

We’ve covered that the null hypothesis is a claim about a population parameter while the alternative hypothesis is a research
study into a sample statistic. In this second step of the hypothesis process, a test is done on a sample chosen out of the
population. The results of this test are then generalized back to the population. This is the essence of inferential statistics
which we have covered in previous chapters. Characteristics of the larger population are inferred from the test results of the
smaller sample. For example, assume that you draw a sample of 15 individuals from your population of city dwellers who use
cell phones. As seen in Table 8.1.2, the ages range from 20 to 42.

Table 8.1.2 Age of a Sample of Cell Phone Users


In this example, the sample statistic that we will use is the mean (X). The mean of this sample of cell phone users will help us
test the validity of the null hypothesis. However, with any sample statistic used in hypothesis testing, the mean has to be
standardized in order to be a valid test. Standardization of the sample statistic (X) means that each data value in Table 8.1.1,
has to be transformed into a number that takes into account the sample mean and standard deviation.

One method of standardizing data values in a sample is to


calculate the Z-Score. The z-score is calculated by subtracting a
data value in the sample by the mean and then dividing it by
the standard deviation according to this formula:
Equation 8.1.4 Z-Score Calculation

To calculate the z-score for a data set, there are 3 steps to follow:

 Calculate the mean of the data


 Calculate the standard deviation of the data
 Subtract each individual data value in the sample from the mean and then divide the product by the standard deviation.

Step 1

The mean of data in Table 8.1.2 is as follows:

Thus, the mean or average age of cell phone users is 32.1 years.

Step 2

To calculate the standard deviation of data in Table 8.1.1, you must calculate the variance first as follows:

Once you have calculated the variance, the standard deviation is simply the square root of the figure as follows:
Step 3

Now that we have calculated both the mean and standard deviation of our sample data set, we must calculate the z-score for
each data value according to Equation 8.1.2. To calculate the z-score for the first data value (20 yrs.), you must follow these
steps

Table 8.1.3 Age of a Sample of Cell Phone Users


So, what do these z-scores tell us about the sample and its individual data values? These z-scores tell us how the data in this
sample is scattered around the mean. For example, the first data point, 20, is 1.8 standard deviations below the mean (hence
the negative sign). The data point 40 is 1.2 standard deviations above the mean.

Z-scores give you information on how the data is skewed. In other words, they give an image in a distribution whether the
data is symmetric as seen in the middle image below, left-skewed or right-skewed. Remember that the calculation of the
mean of a data set is sensitive to outliers. It is possible to have the data skewed to either side of the mean, depending on the
data set you’ve chosen to sample.

Figure 8.1.1 Data Scattering around the Mean


Make a Statistical Decision

In our sample mean, we calculated it to be 32.1 years. Remember that our statistical hypotheses were:

The sample mean is smaller than the hypothetical population mean by 7.9 years. The next question that we ask ourselves is:
how likely is it to get a sample with a mean of 32.1 years if we assume that the population of cell phone users is 40 years? If
the answer is that it is very likely, then we accept the null hypothesis. As seen in Figure 8.1.2, the mean is close to the
hypothesized population mean.

Figure 8.1.2 Accepting the Null Hypothesis

If the answer is that it is very unlikely, then we reject the null hypothesis and assume that the larger population has a
different mean that is closer to the sample mean. For example, assume that you draw a sample from your population of cell
phone users that shows an average age of 18 years. Given that this mean of 18 is much lower than the hypothesized
population mean of 40 years, we can reject the null hypothesis and conclude that the population from which this sample is
drawn cannot have a mean of 40 years.
Figure 8.1.3 Rejecting the Null Hypothesis

But how far from the population mean is ‘too far away’ to reject the null hypothesis? The decision to accept or reject a null
hypothesis depends on regions or statistical borders that you must determine to be necessary to make a business decision.
These regions are called ‘critical regions’ and on the sampling distribution of test statistic (the mean, in this case), they are on
the opposite tail ends of the curve as seen in Figure 8.1.4.

Figure 8.1.4 Regions of Hypothesis Rejection on a Sampling Distribution

Critical values that make up the region of rejection on either sides of the sampling distribution depend on the level of
risk that you are willing to take when making a business decision. There are two levels of risk or errors, as they are called,
that must be considered when determining the regions of rejection:

Type I Error

a. The probability of a type I error occurring is referred to as the level of significance and is symbolized as alpha or ‘a’.
b. This is an error that leads you to reject a null hypothesis when it is in fact true. In other words, assume that a sample mean
accurately approximates a population mean. In deciding on the regions of rejection on the sampling distribution, you
mistakenly set the region so wide that it leads to the rejection of the null hypothesis based on the sampling statistic. For
example, you would have committed a type I error if you reject the null hypothesis in our sample above where the mean age
of cell phone users is 32.1 years when it is, in fact, a close approximation of the population mean of 40 years.

c. A type I error is considered a serious error to make when conducting statistical analyses as it may lead to the rejection of
the results of a sample that could be representative of the larger population.

d. The level of significance or the value of type I error is set by the researcher before the beginning of a research study. What
the value of a signifies is the risk of making such a mistake that a researcher is willing to accept. Typically, this value is set at
2.5% or 5% or 10%.

Type II Error

a. The probability of a type II error occurring is referred to as the probability of accepting a null hypothesis when it should be
rejected. A type II error is symbolized as beta or ‘β’.

b. This is an error that leads you to accept a faulty null hypothesis. Assume that you choose a sample of cell phone users with
a mean age of 20 years when in fact, your population mean is 40 years. By accepting the null hypothesis, you believe that
your population is much younger than it actually is. This has serious implications for your business problem or need as you
cannot target a population for which you are missing basic demographic characteristics.

c. The probability of a type II error increases as you set the probability of a type I error lower.

Hypothesis Testing in a Nutshell

To summarize, hypothesis testing is a multi-step process as follows:

 Formulate your null hypothesis (H0) and alternative hypothesis (H1).


 Choose the level of significance (a) and sample size
 Determine the sampling test statistic that you wish to test (such as mean).
 Determine the critical regions of rejection appropriate to your research.
 Collect data from the sample and complete necessary analyses.
 Make your statistical decision by accepting or rejecting the null hypothesis. If the sample statistic falls in the rejection region,
reject the null hypothesis. If the sample statistic does not fall in the rejection region, accept the null hypothesis. Make your
business decisions based on these conclusions.

Remember that population parameters such as mean or standard deviation are never truly known. Unless there are
massive budgets to test an entire population, these parameters will always be guessed to the best educated figure
based on a number of sources. For example, a population average age can be approximated from the last census
figures. Even then, a census usually doesn’t gather information from 100% of the people it should. And a population
changes in the mean time.

Other sources of educated guesses about these parameters are previous research studies. Given that population
parameters are usually unknown, formulating a null hypothesis requires careful gathering of information during the
early stages of research design.

8.2 Summary of Key Concepts


 Hypothesis testing helps a researcher determine the difference between a real and a random pattern in a data set.
 A hypothesis is a claim or an educated guess about a population parameter or a sample statistic such as mean, variance or
standard deviation.
 A sampling error is the difference between the population parameter and the sample statistic.
 A good hypothesis translates a research question or problem into a format that can be tested using any one of the available
research methodologies.
 The hypothesis testing process if comprised of three major steps: formulate the statistical hypothesis, compute the sample test
statistic, and make a statistical decision.
 The null hypothesis (H0) is always about testing a population parameter and not a sample statistic.
 The alternative hypothesis (H1) is a test of a sample statistic.
 A z-score is the standardization of a data value by subtracting it from the mean and dividing by the standard deviation.
 Z-scores give you information on how the data is skewed.
 Critical values that make up the region of rejection on either sides of the sampling distribution depend on the level of risk that
a research is willing to take when making a business decision.
 There are two levels of risk or errors, as they are called, that must be considered when determining the regions of rejection : a
Type I error and a Type II error.
 A Type I error is an error that leads you to reject a null hypothesis when it is in fact true.
 A Type II error is an error that leads to the acceptance of a faulty null hypothesis.

Glossary of Terms

Alternative hypothesis (H1): A hypothesis about testing a sample statistic.

Hypothesis: A claim or an educated guess about a population parameter or a sample statistic such as mean or
standard deviation.

Null hypothesis (H0): A hypothesis about testing a population parameter.

Sampling error: The difference between the population parameter and the sample statistic.

Type I error: An error that leads you to reject a null hypothesis when it is in fact true.

Type II error: An error that leads to the acceptance of a faulty null hypothesis.

Z-score: Standardization of a data value by subtracting it from the mean and dividing by the standard deviation

8.3 Chapter Review Questions

Descriptive Questions

 List the steps in the hypothesis testing process.


 What is the purpose of standardizing data values? How do you calculate a z-score?
 Differentiate between a Type I error and a Type II error? Which one has more serious implications in a research study?

Multiple Choice Questions

Mark the correct answer.

1: ______________ is the formula for standardizing a data value


2: Null hypothesis is based on an:
a)Educated guess about a population study
b)Educated guess about a sample study
c)Educated guess about a population parameter
d)Educated guess about a sample statistic

3: A sampling error is calculated by:


a)Subtracting the sample statistic from the population parameter
b)Subtracting the population parameter from the sample statistic
c)Subtracting the sample mean from the data value
d)Subtracting the data value from the sample mean

4: A Type I error is the probability of:


a)Rejecting the alternative hypothesis when it is true
b)Rejecting the null hypothesis when it is true
c)Accepting the null hypothesis when it is true
d)Accepting the alternative hypothesis when it is true

5: The probability of a type II error increases as you:


a)Set the probability of the wrong population parameter higher
b)Set the probability of the wrong population parameter lower
c)Set the probability of a Type I error higher
d)Set the probability of a Type I error lower

Answer Key

1- a, 2- c, 3-a, 4- b, 5- d.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

CHAPTER 9

9 Correlations between Business Variables


Overview

Inferential statistics allows business professionals to make guesses about the populations they are interested in. A car
manufacturer may introduce a new line of luxury cars in a large city. By studying the city’s upper income earners, the
manufacturer can target them for advertising.

However, in most business cases, a professional may not have all the information he or she needs about the desired
population. Due to constraints in time and budgets, as well as logistic issues, business professionals use samples to estimat e
the qualities of the population of interest. Confidence interval estimation is one such efficient and popular method used to
make inferences about a population based on a sample.

In most business situations, the researcher does not have population parameters such as its standard deviation and mean.
To estimate confidence intervals when only sample specific information is available, the researcher must use distribution
tables to make inferences about the population.

In this Chapter, we will cover confidence interval estimation. Specifically, how these estimates are used in making inference s
about a larger population.

The objectives of this chapter are as follows:

 Point estimations
 Confidence interval estimations
 Interval values
 Known vs. unknown standard deviation

Keywords

Confidence interval
Population standard deviation
Sample standard deviation
Mean
Point estimate

9.1 Point Estimation

Point estimation is a type of statistical inference in which a business researcher estimates an unknown parameter using
a single value (thus, the point estimation). This estimation is done on a sample data drawn from a population.

The goal of point estimation is to make an inference about a parameter such as the mean or proportion of a population as
seen in Figure 9.1.

Figure 9.1 Point Estimation


A point estimate is only as good as the sample from which it is taken. If a number of point estimates are taken from a number
of samples of the same populations, the point estimates will vary. The ability to make inferences in business decision making
depends on how much you can trust the sample drawn from a population.

Point Estimation Example

A manager at a mechanical engineering firm is interested in studying the test scores of summer engineering interns. As the
firm hires some of those interns as full-time employees following graduation, the manager is interested in estimating their
mean test score, µ.

If the firm manager wanted to estimate μ, the population mean test score through the sample statistic available, he would
refer to this single number as a point estimator. In other words, we say that 230 is the point estimate for μ, and in general. The
following figure summarizes this example:

9.2 Confidence Interval Estimation

Interval Estimation

In business, point estimation of a population parameter is associated with significant uncertainty. For example, if the
engineering firm manager above is to make conclusions on which interns to hire as employees, a single mean from a sample
will not be a good estimator. The manager will need better estimates so he can make a good decision on new hires.

In most cases, a point estimate is followed by interval estimation. An interval estimate provides better information than a
point estimate. Such interval estimation is referred to as confidence intervals. As seen in Figure 9.2.1, the width of a
confidence interval is substantial. The chance that a population parameter falls within that width is much larger than the
chance of a point estimate matching it.

Figure 9.2.1 Interval Estimation

So what is a confidence interval?

A confidence interval is a range of values that you believe a population parameter falls within. For example, suppose that
you are a marketing manager interested in the mean salary of middle-aged women buying your company’s sports utility
vehicles. Since your population is a little over 250,000 women, you need to make a guess about the mean salary of this large
population. Based on a sample that you draw, you calculate with a confidence of a certain amount (e.g. 95%) that your salary
mean falls within a range of values.

Imagine that the sample indicates that the salary mean is between $50,000 and $60,000 per annum, with a 95% confidence
interval. This means that you are 95% confident that the actual salary mean of the larger population falls between these two
values. You are also 5% confident that the population mean is above $60,001 or below $49,999. This interval can be either
one-tailed or two-tailed.

To better understand confidence interval estimation, we have to recall sampling distributions. A Sampling Distribution is a
distribution comprised of all the possible values of a sample statistic for a given size sample selected from a population of
interest. For example, imagine the engineering manager samples 25 interns during one summer regarding their test scores.
Then, he obtains many different samples of 25.

Figure 9.2.2 Sampling Distribution of Test Scores


For each sample of 25, drawn from the large population of interns, the manager calculates a mean. From all these potential
test score means, the manager can construct a sampling distribution. Figure 9.2.2 above shows how the distribution of test
score means would look. As the sample size is increased, the distribution drawn becomes more normal.

Interval Values

As covered above, a confidence interval estimate provides more accurate predictions about a population parameter than a
point estimate. Such an interval gives values that have the following properties:

 Considers variation in sample statistics from sample to sample as we have seen in the engineering firm example. A sample in
one office may have a mean test score of 230 as shown and a sample at another office may have a mean test score of 250.
 A single sample provides the confidence interval range.
 Provides good information on how accurately the values approximate the population parameters. Unlike a point estimate, the
likelihood of a sample representing the population parameter more accurately is much higher.
 Described in terms of level of confidence e.g. 95% confident, 99% confident, etc.
 It is important to remember that these values can never be 100% confident because the population from which it is drawn will
not be tested in its entirety through sampling.

As seen in Figure 9.2.3, confidence intervals can be calculated for a population mean or a population proportion.

Figure 9.2.3 Confidence Intervals and Population Parameters


For these calculations, there are a number of assumptions that must be observed:

 The population standard deviation (σ) is known to the business researcher.


 The population is normally distributed as we have seen above in 9.2.3.
 If the researcher realizes that the population is not normal, a larger sample is necessary to calculate the confidence interval.

Remember that regardless of the underlying distribution of the population parameter, increasing the sample size and means
will approximate normal distributions. If you examine the results of a 95% confidence interval estimation in the engineering
firm example, you will summarize it as follows:

‘If all possible samples of size n are taken from the engineering firm’s databases, and the mean test scores available for al l
intervals are estimated, you can be confident that 95% of all the intervals will include the true value of the unknown
population mean’.

What does the above formula mean?


The point estimate in this formula refers to the sample statistic that is calculated. For example, in the engineering firm
example, the same statistic (test score mean) would be considered the point estimate, which in this case was 230 (n=200).
How confident you want to be depends on the level that you choose as seen in Table 9.1.

Table 9.1 Confidence Levels and Zα/2 Values

Someone may want to be 80% confident that the interval values they’ve calculated will contain the population parameter of
interest. The engineering firm manager may want to be 90% confident that the mean test scores of the sample falls within the
interval values calculated. In most business research, researchers set the confidence level at 95% or 99% to increase the
likelihood that the interval values they’ve calculated would be wide enough to contain the population parameter.

An important step in calculating confidence intervals is to find the critical value, Zα/2. This critical value is calculated by taking
into account that a confidence interval estimate is a two-tailed test of inference. This means that in a normal distribution, the
point estimate is set as zero and the two tails show the limits of the critical value.

As seen in Figure 9.2.4, depending on the confidence level chosen, the lower confidence and the upper confidence figures can
be calculated. The lower confidence and the upper one are at equal distance from the zero (which is the point estimate). In
the engineering firm example, the point estimate would be in the middle of the limits.

The calculated figures for the upper and lower confidence levels would either be added or subtracted from this figure.

Figure 9.2.4 Calculation of Critical Value, Zα/2


To calculate the critical value, the following steps are followed:

1. Determine the confidence level (α) that you wish. For example, 90%, or 95%, or 99%.

2. Subtract the confidence level from 1.

3. Divide that figure by two since confidence level estimate is a two-tailed inference test.

4. Determine the Lower Confidence Limit and the Upper Confidence Limit from Table 9.2 after section 9.2.

If the manager at the engineering firm has set the confidence level at 95%, calculating the critical value will follow
these steps:

1. Since the confidence is set at 95% (0.95), it is subtracted from 1 = 1-0.95 = 0.05

2. This figure is divided by two for a two-tailed test: 0.05/2 = 0.025

3. Refer to Table 8.2 and find the number 0.025. It will be written as 0.0250. If you trace the number to the left hand
column, you will find 1.9. If you trace it to the upper row, you will find 0.6. Adding those two will yield ±1.96. This is
the critical value of the confidence interval estimation.

If the manager at the engineering firm has set the confidence level at 95%, calculating the critical value will follow
these steps:

1. Since the confidence is set at 99% (0.99), it is subtracted from 1 = 1-0.99 = 0.01

2. This figure is divided by two for a two-tailed test: 0.01/2 = 0.005

3. Refer to Table 9.2 and find the number 0.005. It will be between 0.0051 and 0.0049. If you trace the number to the
left hand column, you will find 2.5. If you trace it to the upper row, you will find 0.07 and 0.08. Their average is 0.075
which can be rounded up to 0.08. Adding those two will yield ±2.58. This is the critical value of the confidence interval
estimation.

9.3 Estimation Process

So far, we have discussed confidence interval estimation as an inference about a population parameter based on a sample.
Population means (µ) are usually unknown as it is difficult to test a large population due to:
 High cost of testing all items or individuals in a population
 Time consumption. Large/wide testing can be time consuming which is why census data is often collected every decade.
 Difficulty in finding all members/individuals of a population, especially when it is based on humans.

To avoid the time consuming and high cost of this type of research, business researchers rely on inferential statistics such as
confidence interval estimation to make a guess about the population parameter. They pull a sample out of the population
and calculate its parameter (such as mean).

For example, suppose a marketer samples a group of elderly shoppers in a mall. He finds out that their average age is 65
years. After setting a confidence level and calculating its upper and lower limits, the marketer concludes that he is 95%
confident that the population of elderly shoppers at the mall have a mean age that falls between 55 and 75 years.

Similarly, think of a police officer monitoring drivers driving at speeds of up to 20 km/h above the city’s limits (which are set at
80 km/h). Since the police officer cannot test the entire city population that drives above the limit, he chooses a sample. The
sample mean driving speed is shown to be 90 km/h. The officer sets a confidence level and calculates the upper and lower
limits. She concludes that drivers in the city who break the speeding limits are likely to be driving at a range of 85 km/h and
95 km/h, with a 95% confidence.

These illustrate the examples whereby confidence interval estimation is useful inferential tools when only a sample of a
population can be tested. There are two instances of interval estimation:

1. The researcher knows the standard deviation of the population (σ).


2. The researcher does not know the standard deviation of the population.

Known Standard Deviation

One of the assumptions of confidence interval estimation is that the population standard deviation (σ) is known to the
business researcher. Additionally, it is assumed that the population is normally distributed. In many cases, a business
researcher may increase the size of a sample drawn from a population so that the distribution is normal. To calculate the
confidence interval with a known population standard deviation, the following formula is used:
Example 1

A sample of 25 collectors’ edition books from a national library


from a large normal population has a mean weight of 1.15 kgs.
Based on previous studies of this national library, it is known
that the population standard deviation is 0.20 kgs.

Calculate a 95% confidence interval for the mean weight of the population of books.

Example 2

A used car salesman draws a sample of 15 cars from his lot (a


large normal population). He calculates that the mean repair
cost of the sample has been $350 before showcasing it to
potential customers. From previous knowledge, it is known
that the population standard deviation is $75.

Calculate a 99% confidence interval for the mean weight of the population of books.
The 99% confidence interval for the mean weight of the population is 300.0385 ≤ µ ≤ 399.9615. In other words, we are 99%
confident that the population mean, µ, is between 300.04 and 399.96.

Unknown Standard Deviation

In most real life business situations, the population standard deviation (σ) is not known. In those cases where the population
standard deviation is not available, the researcher can use the sample standard deviation, S, instead. However, it is
important to remember that S introduces more variability from sample to sample which makes it more uncertain to estimate
the population parameter.

There are certain assumptions that are made:

 The population standard deviation (σ) is NOT known to the business researcher.
 The population is normally distributed as we have seen above in 8.3.
 If the researcher realizes that the population is not normal, a larger sample is necessary to calculat e the confidence interval.

To calculate the confidence interval with an unknown population standard deviation, the following formula is used:

(where tα/2 is the critical value of the t distribution with n -1 degrees of freedom and an area of α/2 in each tail)

Instead of using the normal Z distribution (Table 9.2) in this section’s examples, we use the Student t distribution (Table 9.3) to
calculate the confidence interval estimate with an unknown population standard deviation (σ).

What is the t distribution? The t distribution is a family of distributions whereby the tα/2 value seen in the formula above
depends on a calculated degrees of freedom (df). The degrees of freedom, as seen in other chapters, is the number of
observations that varies after the sample mean has been determined. The formula for the degrees of freedom is as
follows: df = n – 1. For example, assume that a data set of 10 numbers has a mean of 50. The degree of freedom calculation
is: df = n – 1 = 10 – 1 = 9.

As seen in Figure 9.3.1, the t distribution curve assumes the shape of a Z distribution as the sample (n) increases. In the curve
with a df = 7 (indicating a sample number of 8), the shape is flat. As the df increases (indicating an increase in sample size),
the shape becomes less flat and approximates a Z distribution.

This is an important thing to consider when choosing a sample size for confidence interval estimation. It is always better to
choose a bigger sample so there is more confidence in the interval values that are calculated. As the sample size increases,
the degrees of freedom increase which means that the critical values for the confidence level approximate the Z distribution.

Figure 9.3.1 t-Distribution Curves


This table is another illustration that shows that as the sample size increases (and the degrees of freedom), the closer a t
distribution approximates the values of a Z distribution.

Example 1

A business school administrator chooses a random sample of 20 students, mean age of 28 years, and sample standard
deviation (S) of 3.5 years. The administrator wishes to calculate a 95% interval for the population mean, µ.

Since the sample size is 20, the df is = n – 1 = 19, so ta/2 = t0.025 = 2.093 (To get this number, use Table 9.3 and look for df =19 in
the left most column. Search for the corresponding figure under 0.025 (1 tail) and 0.50 (2 tail).

The 95% confidence interval for the mean age of the population is 26.362 ≤ µ ≤ 29.638. In other words, we are 95% confident
that the population mean, µ, is between 26.36 and 29.64 years.
Example 2

A grocery store chain manager is interested in studying the average distance a customer travels to a selected store. She
chooses a random sample of 25 customers who travel a mean distance of 8 kilometers and a sample standard deviation of 4
kilometers. The administrator wants to calculate a 99% interval for the population mean, µ.

Since the sample size is 25, the df is = n – 1 = 24, so ta/2 = t0.025 = 2.492 (To get this number, use Table 9.3 and look for df =24 in
the left most column. Search for the corresponding figure under 0.01 (1 tail) and 0.02 (2 tail).

The 99% confidence interval for the mean distance travelled of the population is 6.0064 ≤ µ ≤ 9.9936. In other words, we are
95% confident that the population mean, µ, is between 6.01 and 9.99 kilometers.

Summary approach

To summarize, the estimation of confidence intervals follows a number of steps:

1. Determine whether the population standard deviation is available or not. If not, use the sample standard deviation, S, to
estimate confidence intervals.

2. Set the confidence level (α) that you wish such as 90%, or 95%, or 99%.

3. Calculate the degrees of freedom if using the sample standard deviation.

4. Divide that figure by two since confidence level estimate is a two-tailed inference test.

5. Determine the Lower Confidence Limit and the Upper Confidence Limit from a Z distribution if the population standard
deviation is available. Use a t distribution if the population standard deviation is unknown.

Depending on whether the population standard deviation is known or unknown, there are a number of assumptions made
before estimating confidence intervals. The first assumption is that the population (or sample) standard deviation is availab le.
One or the other must be available for the available formulas to be used. The second assumption is that the population is
normally distributed. Finally, it is understood that if the population is not normally distributed, the researcher will have to
increase the sample size used in the calculations.

Table 9.2 Normal Distribution Z-Table


Table 9.3 Student’s T Distribution

9.4 Correlation

We’ve covered in earlier chapters that measures of central tendency such as the mean or median give information on how
well data center around a common value. We’ve also learned that measures of dispersion such as standard deviation tell us
how data values are dispersed around the mean of the data set. An important descriptor of data is the correlation between
variables. In business settings at times, you will be interested in relationships between variables. For example, assume that
you are launching a new television set in a city. You note that in the promotional period of 90 days, the higher the average
income of a district (variable X), the higher sales figures (variable Y). What does this tell us? It tells us that average income
and sales figures for a given district have a relationship. In other words, these two variables are correlated.

Correlation analysis of variables such as sales and income give us a numerical figure. This numerical figure gives a busines s
professional two important information about the relationship between the variables as follows:

1. The type of relationship


2. The strength of the relationship

Type Of Relationship

Assume that you are interested in the relationship between these sets of variables such as the following:

 Length of education in years and beginning salary


 Company financial ratios and growth index
 Customer satisfaction scores and number of staff training hours

For each set of these variables, we do NOT designate one an independent variable and the other a dependent variable.
Instead, we are simply interested in the nature of the relationship between them. For example, for the first set of variables ,
we may be interested in whether more years of education mean a higher salary. In the second set of variables, we may
interested in finding out if high financial ratios are correlated with a low growth index.

The type of a correlation is that a variable (such as length of education) is whether the increase in one variable leads to no
change, an increase or a decrease in the second variable. The relationship between the two variables can be one of two types
as seen in Figure 9.4.1:

1. Linear - In a linear relationship, as the variable X values increase, the variable Y values change in one direction only and can
be shown graphically as a straight line

2. Curvilinear - In a non-linear or curvilinear relationship, as the variable X values increase, the variable Y values do not tend to
only increase or only decrease: the Y values change their direction of change.

Figure 9.4.1 Nature of Correlations between Two Variables


Strength Of The Relationship

Correlation analysis between variables is not only concerned with the type of the relationship only but with its strength as
well. Assume that you are a restaurant owner who is interested in how wine prices purchased by patrons are related to
overall expenditure at the restaurant. You hypothesize that people who dine with the most expensive wines also spend more
money overall. However, to get an idea of how strong the relationship is, you must analyze and compare values from
individuals who have ordered both wine and dinner.

The strength of the relationship between two variables is quantified through a number that is referred to as a correlation
coefficient. The value of this descriptive is falls between the values of -1 and +1. The type of correlation which we will cover in
this chapter and will also be the foundation of closely related topics in coming chapters is the Pearson Correlation. The
Pearson correlation is sometimes referred to as bivariate correlation and is named after its developer, Karl Pearson. It is the
most widely used correlation coefficient in business research

In business, the main purpose behind a correlation coefficient such as the Pearson coefficient is to:

1. Assess the type (or direction) of the relationship as seen in Table 9.4.1. For example, if one variable increases, the
coefficient tells us whether the second variable increases or decreases.

2. Quantify the strength of the relationship through a number from -1 to +1.

Table 9.4.1 Types of Correlations and their Strength


There are several assumptions that are made about the correlation coefficient between the two variables, X and Y:

1. The correlation coefficient assumes that there is a linear relationship between the two variables. In other words,
the two variables are related in a direct manner. For example, assume that you are a manager who wishes to
understand how your marketing budget affects your product sales. By plotting your marketing spending against
your product sales, you realize that there is no relationship between the two. With further analysis, you realize that
quality is what determines sales for this particular product. So no matter how much money you spend in marketing,
the poor quality of the product prevents good sales. In this case, there is no linear relationship between the two
variables.

2. The correlation coefficient also assumes that the data for the two variables is randomly selected. As we have
covered in earlier chapter, it is important that samples selected for statistical analyses are representative of the
general population. A correlation coefficient is not an accurate reflection of the relationship between two variables if
the sample is not randomly selected.

3. Finally, the correlation coefficient assumes that X and Y are independent of each other. If the two variables under
study are dependent on each other such as the amount of fat in the diet and cholesterol readings, then a correlation
coefficient is not a useful reflection of their relationship.

9.5 Correlation Coefficient

The Pearson correlation coefficient (r) is calculated by taking into account the deviation of each value of the two variables
from its mean. In other words, the mean of variable X is calculated and then each value in the data set is subtracted from that
mean. The calculation of the correlation coefficient is shown in Equation 9.5.1.

Equation 9.5.1 Pearson Correlation Coefficient (r)


Please note the Sx and Sy in the equation above take into account the deviation of each data point from the variable mean.
The calculation for both is as follows:

It is important to note the following about the actual value of the correlation coefficient, r, once it is calculated:

 The value of the coefficient lies between -1 and +1.


 Positive correlation: If X and Y have a strong positive correlation, the coefficient value is close to +1. A value of exactly +1
indicates an exact positive fit which means that for every positive change in X, there is an equivalent change in Y in the same
direction which is also positive. For example, if a factory has 10 machines with a production capacity of 500 units per week,
then adding 10 more machines means that the production capacity increases to 1000 units per week. In this case, the there is a
perfect positive fit between the number of machines and the production capacity of the factory.
 Negative correlation: If X and Y have a strong negative correlation, the coefficient value is close to -1. A value of exactly -1
indicates an exact negative fit which means that for every negative change in X, there is an equivalent change in Y in the
opposite direction.
 No correlation: If there is no linear correlation or a very weak linear correlation, the value of the correlation coefficient
approaches the value of zero. A correlation coefficient value approaching zero indicates that there is a random, nonlinear
relationship between variables X and Y.

Positive Correlation

Assume that you are managing a company that produces power drills. You are tasked with studying the relationship between
quarterly sales figures and promotion expenditure. As promotional spending is an important part of a company’s marketing
strategy, it is important to make sure that it fulfills its goals of increasing sales. To find the relationship between sales and
promotions, data from 12 consecutive quarters (3 years) is collected and analyzed as seen in Table 9.5.1.

Table 9.5.1 Sales Figures vs. Promotions over a Period of 12 Quarters


The first step in analyzing this data is to construct a scatterplot. As we covered in Chapter 3, a scatterplot is a graphical
representation of the relationship between two variables.

This graph plots the first variable (sales figures) on the X-axis and second variable (promotions) on the Y-axis. A scatterplot is an
initial step in determining both the type and strength of the relationship between the two variables. For example, a strong
relationship will show a line around which data is scattered closer than a weak relationship. In Figure 9.5.1, the scatterplot
shows how the data scatter when Y is plotted against X.

Figure 9.5.1 Scatterplot of Sales Figures vs. Promotions


The above scatterplot of the data provides the following information about the type and strength of the Correlation between
sales figures and promotions:

 The trend line or line around which data points are uniformly organized is up-sloping which means that the relationship
between variables X (sales) and Y (promotions) is positive.
 The data is closely scattered around the trend line which means that the correlation between X and Y is fairly strong.

A scatterplot is a very useful tool as an initial, graphic assessment of the relationship between two variables. It gives you a
rapid look at whether the relationship is positive or negative (or zero), and how strong it is from the scatter of the data
around the trend line.

However, for an accurate assessment of the relationship between sales and promotions in our above example, it is important
to calculate the correlation coefficient, r, as seen in Table 9.5.2.

Table 9.5.2 Calculation of the Correlation Coefficient (r)


To calculate the correlation coefficient from the above table, the process will be as follows:
A correlation coefficient of 0.88 indicates a strong, positive correlation between promotional expenditure and total sales
calculated over 12 consecutive quarters. To management, this high correlation may indicate that the company’s marketing
strategy is working in generating higher sales. The more money they spend on promoting the power drills, the more sales
that are generated.

Negative Correlation

Assume that you are the owner of a fast food restaurant and you wish to find out whether your restaurant attracts a wide
range of patrons. You are especially interested in finding how if people in higher income demographics buy food at your
restaurant and their weekly frequency. To find out, you designate your variables as such:

To determine the type and strength of the correlation between the number of weekly visits to the restaurant and the average
income of the patrons, the following steps must be followed:

1. Collect data by designing a survey that asks patrons to recount the number of times they’ve visited the restaurant in the
previous week. The survey will also collect information on their annual income.

2. Prepare a scatterplot of the data to obtain a graphical representation of the type and strength of the relationship.

3. Calculate the correlation coefficient, r.

Table 9.5.3 Average Income vs. Number of Weekly Visits


Table 9.5.4 Average Income vs. Number of Weekly Visits
Table 9.5.4 represents the scatterplot of the data which provides information about the type and strength of the correlation
between average income and number of weekly visits:

 The trend line or line around which data points are uniformly organized is down-sloping which means that the relationship
between variables X (sales) and Y (promotions) is negative.
 The data is closely scattered around the trend line which means that the correlation between X and Y is very strong.

Table 9.5.5 Calculation of the Correlation Coefficient (r)


To calculate the correlation coefficient from the above table, the process will be as follows:
A correlation coefficient of -0.93 indicates a strong, negative correlation between average income and the number of weekly
visits to the fast food restaurant. In other words, the higher the income of individuals, the less likely they are to buy food at
the restaurant. To management, this high negative correlation indicates that the company needs to develop ways to attract
higher income individuals and families to frequent their establishment. An example of such a method is to create a new
menu based on the stated preferences of a sample of high earners. This menu can be especially advertised to those in higher
income demographics.

It is important to note the following about the value of the correlation coefficient, r:

 Depending on the type of research, a strong positive correlation can be from +0.70 to +1.00 while a strong negative correlation
can be from -0.70 to -1.00. Studies in business and social sciences typically require smaller values than those in the pure and
applied sciences. For example, in business a correlation of +0.60 between two variables can be considered to be a strong,
positive correlation.
 A weak correlation that approaches zero does not necessarily mean that the two variables have no relationship. It may mean
that their relationship is non-linear or can simply be affected by a third variable.
 Remember that correlation does NOT mean causation. In other words, just because two variables have a strong relationship
doesn’t mean that one causes the other. It simply means that they tend to occur at the same time and there could be a number
of reasons why they occur. For example, a coastal city’s officials may find that the higher the number of church services per
year, the more likely there are storms. This does not mean that conducting church services leads to storms. It simply means
that they occur together and a third variable may be at play.

9.6 Summary of Key Concepts

 A correlation is one way to understand the relationship between two variables that occur at the same time.
 Correlation analysis does NOT mean that you are searching for causes of why one variable is related to another.
 Correlation analysis provides a numerical figure which indicates important information about the type and strength of the
relationship between two variables.
 The type of a correlation is that a variable (such as length of education) is whether the increase in one variable lea ds to no
change, an increase or a decrease in the second variable.
 A correlation can be linear or curvilinear in nature.
 The strength of the relationship between two variables is quantified through a number that is referred to as a correlation
coefficient.
 The Pearson correlation (r) is sometimes referred to as bivariate correlation and is named after its developer, Karl Pearson.
 The Pearson correlation is the most widely used correlation coefficient in business research.

 There are several assumptions that are made about the correlation coefficient between the two variables, X and Y:

1. The correlation coefficient assumes that there is a linear relationship between the two variables.
2. The correlation coefficient also assumes that the data for the two variables is randomly selected
3. The correlation coefficient assumes that X and Y are independent of each other.

 Studies in business and social sciences typically require smaller values than those in the pure and applied sciences.
 A weak correlation that approaches zero does not necessarily mean that the two variables have no relationship. It may mean
that their relationship is non-linear or can simply be affected by a third variable.

Glossary of Terms

Negative Correlation: A relationship between two variables that displays a down-slope scatterplot and a negative
correlation coefficient.

Pearson Correlation: The Pearson correlation (r) is the most widely used correlation coefficient in business research.

Positive Correlation: A relationship between two variables that displays an up-slope scatterplot and a positive
correlation coefficient.

9.7 Chapter Review Questions

Descriptive Questions

 Give two (2) examples why correlation between two variables does not mean causation.
 What three assumptions are made in regards to variables X and Y when conducting correlation analysis?
 Explain what a correlation coefficient value approaching zero may mean in your analysis. Provide an example.
 What are the three basic assumptions made when estimating confidence intervals? Explain how the assumptions differ
whether the population standard deviation is known or not.
 What is the formula for determining the confidence interval estimate with an unknown population standard deviation?
 How do point estimates differ from interval values in making inferences about population parameters? Give an example of
each.

Multiple Choice Questions

Mark the correct answer.

1: Correlation analysis allows a researcher to determine the following about two variables EXCEPT:
a)Type of relationship
b)Strength of relationship
c)Correlation coefficient
d)Cause of relationship

2: A study the examines growth hormone levels in local dairy products finds it to be positively correlated with the presence of
pesticide traces. These results indicate:
a)The higher the growth hormone level in the products, the higher the pesticide traces
b)The higher the growth hormone level in the products, the lower the pesticide traces
c)The lower the growth hormone level in the products, the higher the pesticide traces
d)The higher the growth hormone level in the products, no change in the pesticide traces

3: The following are assumptions about the correlation coefficient between two variables EXCEPT:
a)Linear relationship between variables
b)Causation relationship between variables
c)Random selection of variable data
d)Independence of variable data from one another

4: A human resources manager finds that employees who participate in the staff annual retreat in August are less likely to
miss hours of work through the holiday season. These results indicate:
a)The more employees who attend the annual retreat, the more missed hours of work
b)The fewer employees who attend the annual retreat, the more missed hours of work
c)The more employees who attend the annual retreat, the fewer missed hours of work
d)The fewer employees who attend the annual retreat, the fewer missed hours of work

5: A construction company paid by city officials finds that as politicians re duce the budget for maintenance work, the less
likely they are to find lucrative contracts. These results indicate:
a)The less money city politicians allocate to maintenance work, the more lucrative are the contracts
b)The more money city politicians allocate to maintenance work, the less lucrative are the contracts
c)The less money city politicians allocate to maintenance work, the less lucrative are the contracts
d)The local budget restricts do not affect the size of construction company contracts

6: Confidence interval estimation is a type of statistical tool that allows business researchers to:
a)Conduct predictions on how a sample will vary
b)Make inferences about population parameters
c)Make inferences about sample parameters
d)Compare means of known populations and unknown samples
Answer Key

1-d, 2-a, 3-b, 4-c, 5-c, 6-b, 7-a, 8-d, 9-c, 10-b.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage Publications.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

Kleinbaum, D., Kupper, L., Muller, K., & Nizam, A. (2008). Applied Regression Analysis and Other Multivariate Methods. 4th Edition,
Thomson Brooks/Cole.
CHAPTER 10

10 Making Business Predictions Linear Regression Analysis - Part 1

Overview

The business world relies a great deal on communicating sound predictions to a number of stakeholders that include
financial institutions, governments, investors, journalists, special interest groups, and the general public. You will likely hear
predictions in such journals as the Wall Street Journal or the Economist about the value of stocks in companies such as IBM or
Microsoft. On a much smaller scale, a business manager or leader maybe interested in finding out how annual bonuses affect
the performance of staff in a company. A business leader may also be interested in predicting the performance of new hires
based on past performances.

To make accurate predictions, it is important for business professionals to gather comprehensive data relevant to the
predictions. In other words, predictions about future outcomes are only as good as the data from past occurrences. For
example, if a meteorologist is predicting the likelihood of a major storm in a certain city, he or she must rely on storm dat a
from many years in order to make an accurate prediction of the future. Similarly, a human resources manager may be
interested in finding out if monthly webinars offered by the company are boosting the sales figures of the marketing
department. This manager must rely on prior data that matched the number of webinars hours with the total dollar amount
in sales to make a prediction about the future performance of staff.

Regression analysis is the most common method for making accurate predictions about future outcomes. By using data
collected about two variables, regression analysis allows a business manager to make a prediction about one variable from
another variable. For example, based on traffic accidents data that examines the age of drivers and the frequency of their
accidents, once can predict the likelihood that someone under the age of 20 will be involved in an accident. The accuracy of
the prediction will depend on the data that is collected.

In this Chapter, we will explore regression analysis through the simple linear regression model. The objectives of this chapter
are as follows:

 Understand the meaning of the regression coefficients β0 and β1


 Use regression analysis to predict the value of a dependent variable based on an independent variable

Keywords

Regression Coefficient
Slope
Intercept
Regression Analysis
Regression Equation

10.1 Relationships between Variables

We’ve learned in Chapter 8 that through a correlation coefficient, you can determine the nature and strength of a relationshi p
between two variables. For example, if you want to determine the relationship between training hours and sales’
performance of your staff, you would collect data that matches training (variable X) with the sales’ performance (variable Y) of
every employee. The resulting data set is then analyzed graphically through both a scatterplot and a calculation of the
correlation coefficient. Your data may show that the more training hours your staff members receive, the higher their sales’
performance. If the correlation coefficient is high enough, you can conclude with some confidence that your two variables
(training hours and sales performance) are strongly correlated.

As seen in Figure 10.1.1, if two variables have a strong relationship, their scatterplot would reflect a trend line around which
data points gather in a tight way. If you notice on the left side of the figure, both a positive and a negative correlation between
variables X and Y can occur. At the top left, the scatterplot indicates a strong, positive correlation where if X increases, Y
increases too. At the bottom left, the scatterplot indicates a strong, negative correlation where if X increases, Y decreases. An
example of this pattern is when someone adopts an exercise regimen instead of watching television. As more hours are
spent exercising, fewer hours are spent watching television.

On the right side of the figure, we have examples of weak relationships between variables X and Y. At the top right, there is a
weak, positive correlation between variable X and Y. That means whenever X increases, Y increases for only some of the data
points. For example, assume that you are studying the relationship between car costs and their incidence of illegal speeding.
You hypothesize that the more expensive the car, the more likely it is involved in illegal speeding.

Figure 10.1.1 Relationship Strength Between Two Variables

However, after you gather the data, you realize that the scatterplot shows data points that are scattered widely around the
trend line. This means that although there is a slight chance that the more expensive the car, the more likely that the driver
goes past the legal driving speed. However, since the relationship is weak, it is very likely that there are other factors that lead
to driving past the legal speed limits. For example, younger drivers and those who live in rural areas are more likely to be
caught driving past legal limits.

Figure 10.1.1 No Relationship Between Two Variables


Finally, as seen in Figure 10.1.1, there are instances when a scatterplot shows that there is no relationship between the two
variables. Assume that you are a manager of a factory producing car parts and you wish to determine the relationship
between product quality as shown in the number of defective pieces (variable X) and the total experience in years of your
team leads (variable Y). Your operations manager collects data from 15 shifts. For every shift, the number of defective
products and the total number of experience of team leads are tabulated. A scatterplot is then constructed which shows that
there is no relationship between the two variables, with the trend line being horizontal. This indicates that in some shifts, you
may find a high number of defective products despite a high number of combined lead experience. In other shifts, you may
also find a high number of defective products despite a low number of combined lead experience.

No relationship between variables in this example can mean a number of things. For example, it could be that the real
important factor in product quality is the combined experience of machine operators or frontline workers and not team
leads. Also, product quality could be affected by the age and functioning of the actual machines. Determining that two
variables have no relationship does not mean that they are not related in some way. It simply means that the relationship is
not linear or that there could be other factors that influence both.

In addition to a scatterplot, correlation analysis is done by


calculating a coefficient (r) from the data set and
determining where it falls between -1 and +1. This
coefficient provides information about the nature of the
relationship between the two variables and its strength. A
correlation coefficient only provides information about the
relationship but does not comment on causation; in other
words, from this figure alone, we cannot determine if one
variable causes the other, only that they coexist.
Making Predictions – From Correlation to Regression

In everyday settings, you will be interested in making predictions about certain outcomes. For example, you could be
interested in determining:

 How much your house will sell for next year given current house prices and size (square footage).
 The amount of scholarship offered given current grades.
 The likelihood of a hurricane given information about past weather patterns and frequency of storms.

Similarly, in business settings you may be interested in making predictions about a number of outcomes such as:

 The likelihood of car accidents based on the age of new drivers


 The price of a stock based on previous earning and stock prices
 The annual sales’ performance of new hires based on the performance of past employees

The basic idea behind making predictions about future outcomes is to take data that has already been collected about the
variables (for example, X and Y), and to calculate the degree with which these are correlated. Then, you use the correlation
and the information that you have about Xi to predict Yi. This process is referred to as Regression Analysis.

Regression analysis is used to:

1. Predict the value of the dependent variable based on the value of the independent variable
2. Determine the effect of changes in the dependent variable on the independent variable.

Remember that the independent and dependent variables are defined as follows:

 Independent Variable – The variable that is used to predict the dependent variable.
 Dependent Variable - The variable the must be predicted or explained.

Assume that you are a city worker interested in accidents committed at a major intersection. You are specifically interested in
finding out if the age of the drivers influences driving speed, which is related to accidents. To investigate this, you colle ct data
about accidents for a period of 30 days. For each accident, you record information about the age of the driver and his or her
car speed when arriving at the intersection. You classify your variables as follows:

 Independent Variable – Age of the driver (in years)


 Dependent Variable – Car speed at the intersection (in miles per hour)

Using regression analysis, you can estimate the future speed of a driver arriving at this particular intersection from his age
from this previous data set. Regression analysis can also determine the effect of changes in the prevalence of drivers of a
certain age on the overall car accident rate.

Simple Linear Regression Model

The most basic regression analysis that could be conducted between two variables (X and Y) is conduction through a Simple
Linear Regression Model as follows:

Figure 10.1.2 Simple Linear Regression Model


Where

In chapter 9 we’ve learned to plot X and Y values for each data point and to draw a line of best fit or a trend line through the
data points. The scatterplot shows graphically how far each data point (with its own X and Y value) lies from the trend line as
shown in Figure 10.1.3.

Figure 10.1.3 The Position of a Trend Line or Line of Best Fit through a Data Set

Once you’ve plotted all your data, you can determine the components of the simple linear regression model, also known
as regression coefficients, as shown in Figure 10.1.4 as follows:

 The intercept ( β0) is calculated by extending the trend line to intersect the Y-axis (where X is equal to zero).
 The slope (β1 ) is the estimated change in the average value of Y as a result of one-unit change in X.
Figure 10.1.4 Determining the Components of the Simple Linear Regression Model

10.2 Conducting Regression Analysis

This is another example where regression analysis is used in real estate planning. Assume that you are a real estate agent
interested in determining the relationship between the price of a house and its size. Specifically, you are interested in finding
out if you can estimate a price for a house that you are now selling for which you have information about size.

You are tasked with estimating the price of a house (Yi) for which you are given the size in square feet (X i). To begin the
regression analysis, you compile a data set that matches house prices in the thousands to house sizes in square feet as seen
in Table 10.2.1.

Table 10.2.1 House Prices in Thousands vs. Sizes in Square Feet, 10 Houses
To conduct a regression analysis on the above data set, you conduct a 3-step process:

1. First, create a scatterplot of the data set.

2. Second, compute the value of the correlation coefficient.

3. Third, conduct regression analysis.

Create a Scatterplot

A scatterplot will give you information about the type and strength of relationship between the two variables. Once you have
created the scatterplot, you create a line of best fit or a trend line. This line provides you with the two regression coefficients
that you need to build your model:

a. Y-intercept
b. Slope

As you can observe in Figure 10.2.1, the scatterplot shows that the correlation between house prices and sizes is a strong,
positive correlation. We know that it is positive from the up-slope position of the trend line that is drawn from the data.

Figure 10.2.1 Scatterplot of House Prices in Thousands vs. Sizes in Square Feet
Compute the Correlation Coefficient

The second step is to calculate the correlation coefficient of the data set. This allows you to determine the strength of the
relationship between the two variables. Remember that the higher the correlation between the price and size of a house, the
more accurate your predictions of one variable from the known variable.

Table 10.2.2 Calculation of the Correlation Coefficient (r)


3. The third and final step is to calculate the correlation coefficient, r

This correlation coefficient of 0.76 indicates that there is a strong, positive correlation between the size of a house and its
selling price. This coefficient also indicates that, from the given values of house sizes and prices in Table 10.2.1, you can
predict the price of a house (Yi) from its size (Xi).

Conduct Regression Analysis

In this final step, you compute the following coefficients of the simple linear regression model when conducting regression
analysis:
The simple linear regression equation provides an estimate of the population regression line. If you recall from earlier
chapters, population figures can only be estimated from a sample statistic. In this sample of 10 homes, we can calculate the
regression coefficients as follows:

After computing the slope of the data set, you can calculate the Y-intercept according to the following equation:

Table 10.2.3 Calculation of the Correlation Coefficient (r)


In other words, the regression equation for the two variables of house prices and sizes in square feet is as follows:

As a real estate agent, you are now interested in using this regression equation you have developed to predict the price of a
new home based on prior information found in the data set of house prices and sizes. For example, imagine that you are
trying to sell a house measuring 5000 square feet. The buyers are interested in finding out how much they should bid on the
house given information about house prices in the area. Using the above regression equation, you compute the price of the
house as follows:

Given that the house price units were in the $1000s, then this house would be approximately priced at $687,000.

Regression at a Glance

In summary, regression analysis goes beyond correlation to determine the relationship between two variables. It provides a
statistical manner in which one variable can be used to predict another based on prior information found in a data set.
Conducting regression analysis involves the following three steps:

1. Creating a scatterplot to graphically observe the nature and strength of the relationship between the two variables.
2. Compute the correlation coefficient (r). If the actual value of the correlation coefficient is high (whether it is negative or
positive), there is a good possibility that one variable can be used to predict the value of another.
3. Conduct regression analysis to determine the regression equation.
10.3 Summary of Key Concepts

 A correlation coefficient can determine the nature and strength of a relationship between two variables.
 if two variables have a strong relationship, their scatterplot would reflect a trend line around which data points gather in a tight
way.
 Regression analysis is used to predict the value of the dependent variable based on the value of the independent variable.
 Regression analysis is also used to determine the effect of changes in the dependent variable on the independent variable.
 This regression model or equation relies on several figures determined from the known data set( β0 and β1, Εi).
 The intercept (β0) is calculated by extending the trend line to intersect the Y-axis (where X is equal to zero).
 The slope (β1)) is the estimated change in the average value of Y as a result of one-unit change in X.
 The simple linear regression equation provides an estimate of the population regression line.
 Conducting regression analysis involves the following three steps:

1. Creating a scatterplot to graphically observe the nature and strength of the relationship between the two variables.
2. Compute the correlation coefficient (r). If the actual value of the correlation coefficient is high (whether it is negativ e or
positive), there is a good possibility that one variable can be used to predict the value of another.
3. Conduct regression analysis.

Glossary of Terms

Intercept: A regression coefficient that is part of the simple linear regression model. An intercept ( β 0) is calculated
by extending the trend line to intersect the Y-axis (where X is equal to zero).

Regression analysis: A step-wise process that uses prior knowledge found in a data set to predict one variable based
on the known value of another variable.

Regression coefficient: A component of the simple linear regression model that is derived from a known data set of X
and Y values.

Regression equation: An equation developed following regression analysis that is used to predict one variable based
on the known value of another variable.

Slope: A regression coefficient that is part of the simple linear regression model. The slope ( β1) ) is the estimated
change in the average value of Y as a result of one-unit change in X.

10.4 Chapter Review Questions

Descriptive Questions

1. List and define the three components of the simple linear regression model?

2. Differentiate between the slope and the intercept of a trend line?

3. What are the three steps involved in conducting regression analysis?

Multiple Choice Questions

Mark the correct answer.

1: All of the following are the components of the simple linear regression model EXCEPT:
a)The slope of the trend line
b)The correlation coefficient (r)
c)The random error term
d)The intercept (where the trend line intersects the Y-axis)

2: A researcher finds that the correlation coefficient between cigarette smoking and speeding violations is zero. The trend line
of the scatterplot will show:
a)A horizontal line with a slope of zero
b) An up-slope with an intercept at the Y-axis
c)A down-slope with an intercept at the Y-axis
d)A vertical line with a slope of +1

3: All of the following are the steps in the process of regression analysis EXCEPT:
a)Create a scatterplot that shows the nature and strength of the relationship between variables
b)Compute the correlation coefficient that provides a mathematical figure of the relationship between variables
c)Conduct regression analysis to predict one variable from another variable
d)Compute the effect of changes in the independent variable on the dependent variable

4: Which of the following forms the denominator (or bottom row) of the equation to determine the slope (β 1):
a)The total sum of the deviation of the variable X i from its means
b)The total sum of the deviation of the variable Yi from its means
c)The total sum of the deviation of the intercept from the slope of the trend line
d)The total sum of the deviation of the variable X i from the variable Yi

Answer Key

1- b, 2- a, 3- d, 4- a

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage Publications.

th
Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4 Edition.

th
Kleinbaum, D., Kupper, L., Muller, K., & Nizam, A. (2008). Applied Regression Analysis and Other Multivariate Methods. 4 Edition,
Thomson Brooks/Cole.

CHAPTER 11

11 Making Business Predictions Linear Regression Analysis - Part 2

Overview

We have learned in Chapter 9 that the simple linear regression model contains three regression coefficients: an intercept, a
slope, and a random error term. The random error term is designed to quantify the effects of all factors that are outside the
independent variable. If you are studying the effects of alcohol consumption on accident frequency, the linear regression
model will designate alcohol consumption as the independent variable and accident frequency as the dependent variable.

The random error term provides a numerical estimate of factors beyond alcohol consumption in accident frequency. These
factors may include speeding behavior, the age of the driver, or the condition of the road.

Multiple regression analysis is essential in understanding how many factors can affect a single outcome. For example, if a
company is struggling with high turnover amongst its most experienced managers, it can explore the differe nt causes for this
phenomenon. High turnover could be due to poor pay, difficult working conditions, and limited opportunities for
advancement. As a business professional, you are interested if one or more of these causes experienced managers to leave
the company. Multiple regression allows you to analyze the individual effects of these causes and to quantify the size of the
effects.

In this Chapter, we will further explore regression analysis. The objectives of this chapter are as follows:

 Understand the meaning of the regression coefficient, Ei


 Use multiple regression analysis to predict the value of a dependent variable based on more than one independent variable

Keywords

Random Error Term


Multiple Regression

11.1 Random Error Term

In earlier chapters, we have learned that you can predict the


value of one variable from another variable given a data set of
paired observations of the two variables. For example, assume
that you are studying property costs and their relations to size.
To predict the cost of an individual property you wish to sell,
you must use a data set of property costs and size to develop
a simple linear regression model.

As seen in Figure 11.1.1, the model includes an error term that takes into account the distance of every point in the
scatterplot from the slope.

Figure 11.1.1 Simple Linear Regression Model

Where
The random error term is a numerical estimate of factors outside the independent variable. The regression equation
accounts for these factors.

In a linear regression, the slope can be prepared in two ways: graphically, the slope is the line of best fit in a scatter plot. In
other words, a line is drawn that is as near to all the points as possible. Mathematically, the slope can be calculated in a
formula that takes into account the deviation of variables X and Y from their averages.

As seen in Figure 11.1.2, the random error value of the point on the scatterplot is the distance between the point and the
coordinates at which a straight line is drawn to the slope. In other words, it is the distance between a given dat a point and its
corresponding value on the slope.

Figure 11.1.2 The Position of the Random Error Term

As regression approximates the value of one variable from another, the error term in the model is designed to capture the
size of the error in this approximation. If the relationship between two variables is perfectly linear, then the size of the error
in the model is minimal or zero as seen in Figure 11.1.3. As the relationship between the two variables weakens, the size of
the error increases. When two variables have no relationship (for example, if research shows that a person’s height and
intelligence are unrelated), then the error term is very large.

As mentioned above, the random error term (or error term) accounts for factors beyond the independent variable that could
influence the dependent variable. For example, an advertiser interested in the effect of promotional spending on sales figure s
should also consider factors such as product quality when examining sales figures.

Figure 11.1.3 Size of the Random Error Term Ei


Recall the example seen in Chapter 8 about the relationship between a company’s promotional expenditures (Variable
X) and its quarterly sales figures (Variable Y) as seen in Table 11.1.1. In that chapter, we have calculated a correlation
coefficient of 0.88 which indicates a strong, positive correlation between promotional expenditure and total sales calculated
over 12 consecutive quarters. Given this strong correlation between the two variables, regression analysis will provide a
regression equation from which sales’ figures can be predicted from promotional spending.

The random error term in the regression equation takes into account factors outside promotional spending that could affect
quarterly sales. Factors that could affect the quarterly sales of the company include:

1. Pricing
2. Product quality
3. Competitive market placement
4. Customer service
5. New/emerging markets

Table 11.1.1 Sales Figures vs. Promotions over a Period of 12 Quarters


To determine the linear regression equation for these two variables, the regression coefficients (slope, intercept, and random
error term) are calculated as shown in Table 11.1.2.

Table 11.1.2 Calculation of the Correlation Coefficient (r)


Step 1: Calculate the regression coefficient (β 1) or the slope

To calculate the regression coefficient (β 1) from the above table, the process will be as follows:

Remember that a slope (β1) of 6.12 is the estimated change in the average value of Y as a result of one-unit change in X.

Step 2: Calculate the regression coefficient (β 0) or the intercept

To calculate the regression coefficient (β 0) from the above table, the process will be as follows:
Step 3: Calculate the regression coefficient (Ei) or the random error term

To calculate the random error term, use the STEYX function in Excel to find the standard error of the predicted y-value for
each individual x in the regression. Please find step-by-step instructions in Chapter 13 on how to use Excel in this calculation.
The STEYX function uses the following syntax:

=STEYX (known Y’s, known X’s)

= 34

Given these values of the regression coefficients, the regression equation is:

In other words, the regression equation for the two variables of promotional and sizes in square feet is as follows:

As a business professional, you can use this regression equation to predict quarterly sales based on prior in formation found
in the data set of promotions and sales. Assume that you are interested in predicting quarterly sales for the third quarter of
2009 based on known promotional spending of $2,000. The Q3 calculation will be as follows:

11.2 Multiple Regression

Multiple regression analysis is used to study the effect of two or more independent (or explanatory) variables on the
dependent variable. Similar to simple linear regression models where only one independent variable is studied, multiple
regression models can explore both linear and non-linear relationships between the variables.

Figure 11.2.1 Linear vs. Multiple Linear Regression Models


The linear regression model for 2 or more independent variables is expresses as follows (Figure 11.2.2). Both the intercept
and random error term remain singular but there is a slope value for every independent variable to the kth variable.

Figure 11.2.2 Multiple Linear Regression Model Equation

Given that there is more than one slope calculation for in the multiple regression model, the graphical representation of the
model will look like this:
The graph above shows that the two independent variables (X1 and X2) are represented by two different slopes that show a
three-dimensional view of the data. In the graph of the simple linear regression model, the data is represented as a
scatterplot.

Companies are usually interested in a number of factors that could affect their performance and productivity. Professionals
within these companies realize that one factor such as marketing expenditure or product quality alone would not affect sales
figures. It is likely that two or more factors within the company and outside it would affect sales figures. For example, if a
company is launching a new expensive electronic product in a new market, it must pay attention to the earning profile of its
target customers.

If the new market is one in a city that is suffering from massive layoffs and an increase in the ranks of the unemployed, the n
high promotional expenditure would have a limited effect on sales figures. In some cases, two or more factors (independent
variables) act together to affect the outcome (dependent variable).

Assume that you are interested in finding out if quarterly sales figures would be affected by promotional spending and
pricing. To conduct analysis, you must identify your research variables as follows:

Table 11.2.1 Sales Figures vs. Promotions over a Period of 12 Quarters


The manual calculation of regression coefficients for the multiple linear regression model follows the same principles as the
simple regression model. Computer program such as Excel, however, offer an easier, more efficient way to calculate the
coefficients. Chapter 13 will cover regression calculations through Excel in depth. Using this program, the above example’s
regression coefficients are determined as follows:

The multiple regression model is as follows:


However, since the quarterly sales are in $1,000s, then the predicted number is $1,169,000.

It is important to remember that regression analysis provides a predicted figure based on prior information. The usefulness
of the prediction depends on how comprehensive and accurate the data set from which the regression coefficients are
calculated is. Also, there are other factors that could affect quarterly sales figures such as leadership quality, branding, and
customer services.

Furthermore, it is important that you limit your predictions to within the range of known (or observed Xs). In other words, as
seen in Figure 11.2.3, your most accurate prediction will be within the range of known Xs. If you try to predict quarterly sales,
for example, from very low or very high values of promotions spending.

Figure 11.2.3 Extrapolation Limits in Regression Analysis

11.3 Strength and Limitations of Regression Analysis


Regression analysis is a rigorous statistical tool used to predict the value of one variable from another using a data set of
known Xs and Ys. However, this type of analysis has a number of built-in assumptions of which you must be aware before
making business decisions. Some of these assumptions include:

1. The sample drawn for analysis is representative of the entire population. We have covered in earlier chapters that it is
important to choose a sample that is well representative of the population. In some cases, the sample used for analysis is not
representative of the entire population which makes its predictive value less.

2. The independent and dependent variables have a linear relationship.

Regression analysis also has a number of strengths and limitations that are important to consider when making business
decisions.

Strengths

Unlike correlation analysis, regression analysis gives you a chance to specify hypotheses about the dependent variable and it s
relationship to the independent variable. When it is done with an appropriately representative sample, regression analysis
can produce a quantitative effect of a number of independent variables on the dependent variables.

Another strength of regression analysis in business statistics is that it is straightforward and easy to conduct using a number
of computer applications such as Excel. While it is preferable to use large samples for good predictive power, regression
analysis can also be conducted on relatively small samples. In the cases of small samples, it is important to test one or two
independent variables at most.

Limitations

Despite its strengths, regression analysis has a number of limitations that include:

 The analysis can be demanding and time consuming. This is especially true when dealing with large data sets.
 Assumptions about the results of regression analysis are not always known or investigated by those conducting it.

Conducting regression analysis without having an initial understanding of the relationship between the variables.
One way to avoid the effects of this limitation is to calculate the correlation coefficient between the variables to
understand both the strength and type of relationship between the variables. Another way to understand this
relationship to draw a scatterplot and visually evaluate how the data points scatter around the trend line.

11.4 Summary of Key Concepts

 The random error term is a numerical estimate of factors outside the independent variable. The regression equation accounts
for these factors.
 As regression approximates the value of one variable from another, the error term in the model is designed to capture the siz e
of the error in this approximation.
 As the relationship between the two variables weakens, the size of the error increases.
 Multiple regression analysis is used to study the effect of two or more independent (or explanatory) variables on the
dependent variable.
 Multiple regression models can explore both linear and non-linear relationships between the variables.
 Regression analysis has a number of built-in assumptions that must be considered before making business decisions such as
the sample drawn for analysis being representative of the entire population and the presence of a linear relationship between
the independent and dependent variables.
 Regression analysis is straightforward and easy to conduct using a number of computer applications such as Excel.
 A limitation of regression analysis is conducting it without having an initial understanding of the relationship between the
variables.
 To reduce the above limitation, one must calculate the correlation coefficient between the variables to understand both the
strength and type of relationship between the variables. Another way to understand this relationship to draw a scatterplot and
visually evaluate how the data points scatter around the trend line.

Glossary of Terms

Multiple Regression Analysis: A statistical tool to study the effect of two or more independent (or explanatory)
variables on a dependent variable.

Random Error Term: A numerical estimate of factors outside the independent variable. The regression equation
accounts for these factors.

11.5 Chapter Review Questions

Descriptive Questions

1. Define the random error term and what it represents on a scatterplot.

2. List two (2) assumptions that are found in regression analysis.

3. List and provide examples of two (2) strengths and two (2) limitations of regression analysis.

Multiple Choice Questions

Mark the correct answer.

1: The random error term in the linear regression model indicates:


a)A numerical estimate of the relationship between the independent and dependent variables
b)A numerical estimate of factors outside the independent variable
c)A numerical estimate of factors outside the dependent variable
d)A numerical estimate of the relationship between the intercept and slope

2: All of the following are the components of the multiple linear regression model EXCEPT:
a)The slope of the trend line
b)The correlation coefficient (r)
c)The random error term
d)The intercept (where the trend line intersects the Y-axis)

3: Following data is used to construct a linear regression model. Which of the following figures is the SLOPE of the data:

a)- 0.60
b)- 0.06
c)+ 1.06
d)+ 1.60

4: A multiple linear regression model contains the following regression coefficients:


a)An intercept, a slope, more than one random error term
b)A slope, an intercept, and a random error term
c)An intercept, more than one slope, a random error term
d)A slope, more than one intercept, and a random error term

5: A limitation of regression analysis is:


a)An incomplete understanding of the relationship between the independent and dependent variable
b)An incomplete understanding of the relationship between the slope and the random error term
c)An incomplete understanding of the relationship between the slope and the intercept
d)An incomplete understanding of the relationship between the intercept and the random error term

6: Following data is used to construct a linear regression model. Which of the following figures is the INTERCEPT of the data:

a)- 105
b)- 150
c) + 150
d)+ 105

Answer Key

1- b, 2- b, 3- a, 4- c, 5- b, 6- d

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
th
Publication. 10 Edition.

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage Publications.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

Kleinbaum, D., Kupper, L., Muller, K., & Nizam, A. (2008). Applied Regression Analysis and Other Multivariate Methods. 4th Edition,
Thomson Brooks/Cole.

CHAPTER 12

12 Making Effective Comparisons through Analysis of Variance (ANOVA)

Overview

In business you may need to compare two or more groups on a number of variables. For example, you may be interested in
finding out if there is a difference in opinion on increased employment taxes between people younger than 35 and those
older than 35. As a stockbroker, you might be asked by investors to compare the performance of two companies over the
past fiscal year. In both these examples, you could compare the population mean (or average) in order to make business
decisions. Your statistical analysis could reveal the following:

 People who are younger than 35 are more likely to agree to increased employment taxes than those over 35.
 IBM portfolio averaged better returns to investors over the past fiscal year than Apple.

How are outcomes different for these two age groups when they are asked about increased employment taxes? This is a
question of importance for business analysts and marketers who are interested in those demographic characteristics such as
political choices, opinions on products, and buying behavior. In business settings, comparison between groups is also
important in product research and development, and financial market analysis amongst others.

In this Chapter, we will explore the Analysis of Variance (ANOVA) as a statistical tool to compare groups. The objectives of this
chapter are as follows:

 T-tests
 ANOVA and the F-statistic
 ANOVA in Business Settings

Keywords

Analysis of Variance
Null Hypothesis
Alternative Hypothesis

12.1 Two Sample Tests

In business research, you are often faced with making a decision about two products, strategies, or approaches. For example,
a software developer may develop two different version of their software, for one home use and another for o ffice use. These
two different versions may have slightly different capabilities. As a business professional, you are interested in whether your
personal users are as satisfied with the software as the office users. The two-sample test, or T-test, is a hypothesis test for
making a comparison between two independent sample such as those of the software users.

Assume that you are testing a new hypoallergenic soap product and want to know how much it differs in skin reactions than
your last product. Let’s call them Product A and Product B, respectively. You assign 12 men and 12 women to test Product A
(the new product) and a similar number of men and women to test Product B. You supply both groups with their respective
soap products and request that they record the number of allergic reactions that they receive. You record the total number of
allergic reactions reported by each group of men and women using both products as seen in Table 12.1.1.

Table 12.1.1 Comparison Between the Means of Two Conditions, Product A and Product B Use

We see that those who use Product A report an average of 5 allergic reactions following use of the product while those using
Product B report an average of 12 allergic reactions. Comparisons of this kind are very common in business research across a
number of disciplines such as product design and development, marketing, and even finance. To evaluate the effects of some
product, strategy, or campaign, a group of people is divided into two groups:
 A group receiving the treatment to be evaluated is referred to as the treatment group. In the above, the treatment group is the
one using Product A.
 A comparison group, also referred to as the control group. In the above example, the comparison group is the one using
Product B.

By dividing your groups into two (treatment and control), you can determine what the effect of the design or strategy is on
the users. In the above example, it is clear that Product A less than 50% the number of allergic reactions than Product B.

However, in some business settings, the difference in outcome between the treatment and control groups can be smaller.
The question that you must answer is: is the difference between the means of the treatment group and the control
group statistically significant? This is the question that a two-sample test attempts to answer.

To determine whether the mean number of allergic reaction of Product A vs. Product B is statistically significant, the first step
is to establish a hypothesis. As we have seen in previous chapters, you establish a null hypothesis and an alternative
hypothesis as follows:

 Null hypothesis: This hypothesis states that the difference between the treatment group and the comparison group is 0. In this
case, the difference in means between Product A and Product B’s allergic reactions is 0.
 Alternative hypothesis: This hypothesis states that the difference between the treatment group and the comparison group is
not zero.

Where

n = number of people in a group (treatment or comparison)


µ = population average
X = sample average

By calculating the t-statistic, or tSTAT as it is referred to, you can determine whether to accept or reject the null hypothesis from
the values in a data set. If both Product A and Product B report, on average, an equal number of allergic reactions in the
testing phase, you can determine that the difference between their means is equal to zero. This indicates that that you can
accept the null hypothesis.

From a business perspective, discovering that there is no difference between the treatment and comparison groups may lead
to the following:
 Conducting another research study with a larger sample. In the above example, the sample can be increased from a total of 48
participants to approximately 200 participants. The larger sample size could indicate a different outcome in reporting allergic
reactions.
 Change the marketing strategy of the product to emphasize another aspect of Product A. For example, the product could be
presented in different scents and packaging or it may be promoted to a different demographic such as adolescents.

12.2 ANOVA

The t-test is a useful tool for comparing the means of two groups in business research; however, the t-test does not provide
good inferences about population means when there are 3 or more samples. For example, if you are comparing a marketing
campaign’s outcome in 3 or more cities, the t-test is not an effective tool to make such a comparison. In other words, with
three or more groups, the t-test is not an effective statistical tool to compare their means. Imagine that you are launching a
new marketing campaign in three cities: Los Angeles, Chicago, and New York. After 6 months, you want to compare the mean
sales figures of the three cities (µ1, µ2, µ3, respectively). As seen in the figure below, your stated hypothesis will be as follows:

1. Null hypothesis: The mean sales figures of the three cities are equal so that µ1= µ2= µ3. This means that the difference
between of every two means is equal to 0.

2. Alternative hypothesis: The mean sales figures of the three cities are not equal so that µ1= µ2= µ3. In this figure, sales
figures for Los Angeles and Chicago are equal but those for New York are different for both.

Figure 12.2.1 Equal and Unequal Means in ANOVA

In business settings where you wish to compare 3 or more groups, ANOVA or Analysis Of Variance is the appropriate
statistical tool. There are some similarities between the t-test and ANOVA. Like the t-test, ANOVA is used to test hypotheses
about differences in the mean values of an outcome between two groups. For example, a number of allergic reactions
reported by soap users is an outcome.
However, while the t-test can be used to compare two means or one mean, ANOVA can be used to examine differences
among the means of several different groups at once. If your company has 4 different types of soaps, you can use ANOVA to
compare allergic reaction outcomes between the soaps.

Figure 12.2.2 Two Sample Tests and Examples

There are four kinds of two-sample tests that can be conducted as follows:

1. Population means with independent samples. These samples are not related such as conducting a marketing campaign
in two different cities or offering clients winter tires from two different manufacturers.

2. Population means with related samples. In these types of tests, the same participants/groups are tested and outcome is
assessed based on treatment that is administered. For example, a human resource manager may be interested in whether an
educational seminar would improve productivity. In this case, a group of employees’ productivity is assessed before and after
the educational seminar. If the seminar is effective, then productivity will increase. If the seminar is not effective, then
productivity will remain the same.

3. Population proportions. A comparison of population proportions is done using independent samples. If you have two
groups of 20 individuals each (total of 40 individuals), 2 individuals in the first group and 5 in the second group report
dissatisfaction, then the proportions are 0.1 and 0.25 respectively.

4. Population variances. Variances are concerned with how far an individual participant in a sample deviates from the mean.
In this type of two-sample test, the variances between two populations are compared through their means.

In experimental designs, the groups are randomly assigned to treatment and comparison groups. In other words, participants
have an equal chance of being placed in one of the groups as seen in Figure 12.2.2.

Figure 12.2.3 Experimental Designs and ANOVA


The F-Statistic

The F-statistic is a ratio of the variance of sample 1 to the variance of sample 2. In a given research study, sample 1 is the
larger one while sample 2 is the smaller sample as seen in Equation 12.2.1.

Equation 12.2.1 Calculating the F-statistic (FSTAT) in ANOVA

When conducting an ANOVA statistic, you must follow these steps:

 First, state the null and alternative hypothesis as we have previously discussed. For ANOVA the null hypothesis is stated as: H0:
All population means are equal. For example, if we are testing three populations (such as cities or demographics), then we
would state: H0: µ1 = µ2 = µ3. The alternative hypothesis is stated as: H1: Not all population means are equal. It is important to
note that ANOVA testing does not tell you which mean does not equal the other means, or the combination of means that are
not equal. It will only tell you that at least one mean does not equal the other means.
 Second, select the level of significance. Remember that the level of significance is the chance that you will reject the null
hypothesis by mistake. In other words, it is the chance that you will determine that there is a difference between the means
when there is none. For example, if you reject the null hypothesis by mistake in the example with Product A and Product B, you
will assert that the two products are different in causing allergic reactions when they are not. Normally α is equal to .05 or .01
although other significance levels can be used.
 Third, determine the test distribution to use. For ANOVA, the F distribution is used which is a compilation of figures that match
the degrees of freedom and which determine whether the difference between means is actually statistically significant. An
example of an F distribution is found in Equation 11.3.1 1
 Fourth, define the critical value. To find the critical F value, we must first calculate the degrees of freedom for our numerator
and our denominator as follows:

1. The degrees of freedom for the numerator is equal to N - 1 where N is the number of samples. For example, if we have
independent samples from five populations, K equals 5 and our degrees of freedom equal 4.

2. The degrees of freedom for our denominator is equal to K - N, where K is the total number of items or variables in all
samples, and k is the number of samples. So if we have 5 independent populations, our N value is 5. If each sample contained
6 items, then our value for K equals 30 (6 + 6 + 6 + 6 + 6 + 6). The F table has column values for the numerator and row values
for the denominator.

3. Then, take your numerator value and your denominator value to find the appropriate F value. For example, if our
numerator is equal to 10 and our denominator is equal to 15, our F value from the table is equal to 2.54.

 Fifth, calculate the test statistic which is also referred to as FSTAT.


 Sixth and finally, state your decision rule based on the hypothesis. Reject H0 in favor of H1 if the test statistic F (calculated F
value) is > the critical value from step 4, otherwise fail to reject H0. The decision rule for ANOVA testing will always use the >
symbol.

Remember that ANOVA, like any statistical measure, produces results that are dependent on the quality and the size of the
sample chosen for study. It is important to use a one of the proven sample selection methods when designing the study.

Assume that you are presenting three new products to the market with a goal of finding out which age demographic it will
appeal to. The products are defined as follows: Product 1, Product 2, and Product 3. You collect data on a random number
of users

Table 12.2.1 Calculation of the Correlation Coefficient (r)


Step 1: Calculate the average of all the means

To calculate the average of all the means, the process will be as follows:

Total Mean= [10 (mean Product 1) + 10 (mean Product 2) + 10 (mean Product 3)] / [n1+n2+n3]
µtotal= [10*35 + 10*36.3 + 10*33.5] / [30]
= 34.9

Step 2: Calculate the variance within groups

To calculate the variance within groups, the process will be as follows:


= [778 + 1574 + 1665] / (30 – 3)
= 4017 / 27
= 149

Step 3: Calculate the variance between groups

To calculate the variance between groups, the process will be as follows:

Variance (between) = [10(µtotal-µ1)2 + 10(µtotal-µ2)2 +10(µtotal-µ3)2] / (N-1)


= [10(34.9-35)2 + 10(34.9-36.3)2 + 10(34.9-33.5)2] / 3-1
= 39.3/2
= 19.65

Step 4: Calculate the FSTAT

To calculate the FSTAT, the process will be as follows:


FSTAT = Variance (between) / Variance (within)
= 19.65 / 149
FSTAT = 0.13

If we set the level of significance at 0.05, then the degrees of freedom (Df) are calculated as follows:

 Df (numerator) = N – 1 = 2
 Df (denominator) = K – N = 30 – 3 = 27

FCritical = 3.35
(Please check Figure 11.3.1 1 for this critical figure by looking at the top row for the Df (numerator) and the left most
column for the Df (denominator).

Step 5: Determine your Decision Rule

In this final step, you must make a decision whether to accept or reject the null hypothesis. Our hypothesis
statements were as follows:

H0: µ1 = µ2 = µ3 H1: Not all population means are equal.

We have also determined that the FSTAT was 0.13 while the FCritical was 3.35. Since FSTAT < FCritical, then
we accept that null hypothesis which states that µ1 = µ2 = µ3.

What are the implications of this result? The results show that there are no statistically significant differences
between the ages of those using Product 1, Product 2, and Product 3. For a company, this indicates that the products
are not differentiated according to demographics which may mean that marketing campaigns can be fairly similar
for all 3 products.

12.3 ANOVA in Business Settings

ANOVA is a statistically accurate and convenient way to make comparisons amongst groups. In business, ANOVA is often used
in these settings:

 Product research and development. Companies use ANOVA to determine the effectiveness of products and innovations. For
example, pharmaceutical companies determine the effectiveness of new drugs by studying treatment and comparison groups
and finding out if there are differences. Similarly, a company producing computer chips will be interested in comparing the
processing speeds of new chips in comparison to those currently on the market.
 Education and training. Organizations use ANOVA to evaluate the effectiveness of training and educational programs. In these
settings, the same groups of participants are evaluated before and after they enroll in these programs. ANOVA results are used
to compare the effectiveness of the programs based on how much the participants have gained in knowledge.
 Financial market information. ANOVA is useful in providing comparison between such financial indicators as dividend yields
amongst numerous stocks listed on the NYSE and NASDAQ.
 Human factor analysis. ANOVA is used to determine those factors that affect human performance such as fatigue, sleep
problems, and sensory overload.

ANOVA, however, has its limitations as follows:

 ANOVA assumes that all population means from each data group are equal.
 ANOVA also assumes that all variances from each data group are equal.

Figure 12.3 An Example of an F Distribution


12.4 Chi-Square Distribution

The chi-square distribution is a test that verifies the difference between the expected frequencies and the observed
frequencies in categorical data within a set of quantitative data. In this test, we are not testing individual scores in a sample to
make inferences about the population. Instead, we are looking at frequencies of categories such as gender or hair colour.
The question asked by the chi-square test is: Within each category in the data, do the numbers of individuals or items recorded
differ significantly from those you expect?

For example, if a business man expects that 70 percent of his new clients would be females and observes that only 45 percent
are, then the question would be whether difference between the expected and observed percentages of clients are due to a
sampling error or a real, statistical difference.

To conduct a chi-square test, the following conditions must be satisfied:

 The sample from which data is composed must be drawn randomly from a population.
 Data must be available in raw frequencies; in other words, already compiled or grouped data cannot be tested using the chi-
square distribution.
 Variables under study must be independent.

Chi-Square Test Statistic

To find values for the chi-square test, the business researcher must determine what the expected frequencies of the data are.
There are two ways to determine expected frequencies as follows:

1. One can assume that all frequencies are equal in each category. For example, an MBA university program may expect
that half of their applicants for the admission year 2012/2013 will be women while the other half will be men.

Program administrators can determine the expected frequency in this category of admission (gender) by dividing the number
in the sample by the number of categories. If the college is expecting 3,000 applications in two categories, women and men,
they must divide their sample by 2, the number of categories, to get the expected frequencies in each category.

2. Determine expected frequencies based on prior knowledge and assessments. In the MBA university program above, if
the previous year’s (2011/2012) applicants’ ratio is 55% women and 45% men, this is considered prior knowledge. For this
year, administrators can expect the ratio to be the same. To determine the expected frequencies, the following calculation is
done:

There are a number of steps to conduct the chi-square test on categories within a set of data as seen in Figure 12.4.1.

Figure 12.4.1 Chi-Square Test Steps


1. Establish the level of significance, α

The level of significance is standard and denoted by the symbol, α, as we have seen in previous chapters. Typical values of t he
level of significance are

 α = .05
 α = .01
 α = .001

2. Determine the statistical hypothesis

Null Hypothesis (Ho): This is the hypothesis that the two variables are independent.
Alternate Hypothesis (H1): This is the hypothesis that the two variables are associated.

2
3. Calculate the test statistic, X

2
The test statistic, X , is calculated by subtracting the expected frequency of each data point from the observed value and
dividing the resultant figure by the expected frequency as shown:
4. Determine the degree of freedom

Chi-Square Test Example

The following table data is gathered by the school superintendent of a random sample of 255 employees in a number of
charter schools. Set at a 5% level of significance, the superintendent aimed to associate a school employee’s gender with
their job title.

Table 12.4.1

Step 1

Based on the above table, we can begin calculating expected frequencies.


This is the expected frequency for members of school employees who are both male and a teacher.

Step 2

The next step is to calculate the other expected frequencies and insert them in the table (between brackets):

Step 3

In this step, we determine the statistical hypothesis as follows:

Null Hypothesis (H0): There is no association between school employees’ gender and their job title.
Alternative Hypothesis (H1): There is an association between school employees’ gender and their job title.

Step 4
Test Statistic = 4.90

Step 5

In this step, we determine the critical value from the chi-square table 12.4.2 given below.

Step 6

In this final step, we reach a conclusion that: Test statistic < Critical value, therefore, we accept the null hypothesis, H 0. The
conclusion is that there is no association between school employees’ gender and their job titles in the charter schools.

Chi-Square Distribution Table

Table 12.4.2
12.5 Summary of Key Concepts

 Analysis of Variance or ANOVA is a powerful statistical tool that is used to test hypotheses about differences in the mean values
of an outcome between two groups.
 The two-sample test, or T-test, is a hypothesis test for making a comparison between two independent sample such as those
of the software users.
 A null hypothesis is a hypothesis states that the difference between the treatment group and the comparison group is 0.
 An alternative hypothesis is a hypothesis states that the difference between the treatment group and the comparison group is
not zero.
 ANOVA, like any statistical measure, produces results that are dependent on the quality and the size of the sample chosen for
study. It is important to use a one of the proven sample selection methods when designing the study.
 In business settings, ANOVA is used in product research and development, education and training, financial market
information, and human factor analysis.
 ANOVA assumes that all population means from each data group are equal.
 ANOVA also assumes that all variances from each data group are equal.
 The chi-square distribution is a test that verifies whether there is a significant difference between the expected frequencies and
the observed frequencies in one or more categories in a set of quantitative data.
 To determine expected frequencies, one can assume that all frequencies are equal in each category or determine them based
on prior knowledge.

Glossary of Terms

Alternative hypothesis: A hypothesis states that the difference between the treatment group and the comparison
group is not zero.

ANOVA: Analysis of Variance.

Null hypothesis: A hypothesis states that the difference between the treatment group and the comparison group is
0.

Chi-square distribution: A test that verifies whether there is a significant difference between expected and observed
frequencies in one or more categories in a set of quantitative data.

Expected frequency: Calculated data based on researcher assumptions or prior knowledge.

Observed frequency: Raw data gathered in a study and entered in a frequency table.

12.6 Chapter Review Questions

Descriptive Questions

1. List and describe three (3) business settings in which ANOVA is used to compare outcomes amon gst groups.

2. Describe two (2) assumptions made in ANOVA.

3. State and give an example each of the four (4) types of two-sample tests.

4. Describe the steps required to conduct a chi-square test on a set of raw frequencies.

Multiple Choice Questions


Mark the correct answer.

1: All of the following are examples of two-sample tests EXCEPT:


a)Population means
b)Population standard deviation
c)Population variance
d)Population proportion

The following two (2) questions are based on the data in the table below.

2: The variance between the two groups of application users is:


a)78
b)88
c)68
d)Cannot be calculated because of small sample

3: The variance within the two groups of application users is:


a)31.9
b)19.3
c)13.9
d)Cannot be calculated because of small sample

4: If we ACCEPT the null hypothesis in this example of application users, the implication is that:
a)There are no differences in the means between the two groups
b)There are small differences in the means between the two groups
c)There are large differences in the means between the two groups
d)Cannot be calculated because of small sample

5: After conducting a chi-square test, a researcher finds that the test statistic is greater than the critical value. The conclusion
is to:
a)Reject the null hypothesis
b)Accept the null hypothesis
c)Reject the degrees of freedom
d)Gather more data

Answer Key:

1- b, 2- a, 3- c, 4- a, 5-a.

References

Anderson, D.R., Sweeney, D. J., and Williams, T.A. (2007). Statistics for Business and Economics. South-Western College
Publication. 10th Edition.
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage Publications.

Freedman, D., Purves, R., and Pisani, R. (2007). Statistics. W.W. Norton & Company. 4th Edition.

Kleinbaum, D., Kupper, L., Muller, K., & Nizam, A. (2008). Applied Regression Analysis and Other Multivariate Methods. 4th
Edition, Thomson Brooks/Cole.

CHAPTER 13

13 Using Excel In Business Statistics

Charts and Diagrams (Chapter 2)

Histograms

To create a histogram, refer to Data Set 1 in the Excel File titled ‘ Book 1, and follow these steps:

1. Choose the data table labeled ‘Monthly Power Drill Sales in 2008’.

2. Click on the data range (B4–B15).

3. From the Excel Menu, choose Insert, Column, 2-D Column.

4. From the Excel Menu, choose Design, Layout if you wish to make changes to the Histogram.
Scatter Plots

To create a scatter plot, refer to Data Set 1 in the Excel File titled ‘Book 1 , and follow these steps:

1. Choose the data table labeled ‘Level of Education vs. Years of Experience’.

2. Choose the data range (E4–F17).

3. From the Excel Menu, choose Insert, Scatter.

4. From the Excel Menu, choose Design, Layout if you wish to make changes to the Scatter plot.
13.1 Measures of Central Tendency (Chapter 3)

Mean

To calculate the mean of a data set, refer to Data Set 2 in the Excel File titled ‘Book 1 , and follow these steps:

1. Click on an empty cell.

2. To calculate a mean, this is the syntax in Excel: AVERAGE(number1, [number2],...)

3. Type the following in this order, =, average, (B3:B17).

4. Make sure that there are NO spaces so that the cell contain ‘=average(B3:B17)’ which return an average or mean of
number in cells B3 through B17.

5. Press Enter.
6. You will receive the following answer: 32.1.

Median

To calculate the median of a data set, refer to Data Set 2 in the Excel File titled ‘Book 1 ', and follow these steps:

1. Click on an empty cell.

2. To calculate a median, this is the syntax in Excel: MEDIAN(number1, [number2],...)

3. Type the following in this order, =, median, (B3:B17)

4. Make sure that there are NO spaces so that the cell contain ‘=median(B3:B17)’ which returns a median of numbers in cells
B3 through B17.

5. Press Enter.

6. You will receive the following answer: 34.


Mode

To calculate the mode of a data set, refer to Data Set 2 in the Excel File titled ‘Book 1 , and follow these steps:

1. Click on an empty cell.

2. Type the following in this order, =, mode, (B3:B17).

3. To calculate a median, this is the syntax in Excel: MODE(number1, [number2],...)

4. Make sure that there are NO spaces so that the cell contain ‘=mode(B3:B17)’ which returns a mode of numbers in cells B3
through B17.

5. Press Enter.

6. You will receive the following answer: 36.


13.2 Measures of Dispersion (Chapter 3)

Standard deviation

To calculate the standard deviation of a data set, refer to Data Set 2 in the Excel
File titled ‘Book 1 , and follow these steps:

1. Click on an empty cell.

2. To calculate standard deviation, the syntax in Excel is: STDEV(number1,


[number2],...)

3. Type the following in this order, =, stdev, (B3:B17).


4. Make sure that there are NO spaces so that the cell contain ‘=stdev(B3:B17)’
which returns a standard deviation of numbers in cells B3 through B17.

5. Press Enter.

6. You will receive the following answer: 6.7.

Variance

To calculate the variance of a data set, refer to Data Set 2 in the Excel File titled
‘Book 1 , and follow these steps:
1. Click on an empty cell.

2. To calculate a median, this is the syntax in Excel: VAR(number1, [number2],...)

3. Type the following in this order, =, var, (B3:B17).

4. Make sure that there are NO spaces so that the cell contain ‘=var(B3:B17)’ which
returns a variance of numbers in cells B3 through B17.

5. Press Enter.

6. You will receive the following answer: 45.2.


13.3 Hypothesis Testing (Chapter 7)

To test a hypothesis in a data set, you must standardize the individual data values. The Z-test allows you to standardize each
point in a data set using its mean and sample standard deviation. To calculate the Z- values in a data set, refer to Data Set
3 in the Excel File titled ‘Book 1 , and follow these steps:

1. Click on an empty cell.

2. To calculate the Z value of a data point, this is the syntax in Excel: ZTEST(array,x,sigma)

3. The array refers to the range of data values that is to be analyzed, x refers to the data point to be standardized,
and sigma is the population standard deviation. If the population standard deviation is unknown, then it is left blank and Excel
will use the sample standard deviation

4. Type the following in this order, =, ztest, (B3:B17), x.


5. Press Enter.

6. For an x value of 42, the Z-value is 1.

13.4 Correlation Analysis (Chapter 8)

To calculate the correlation between two variables in a data set, refer to Data Set 4 in the Excel File titled ‘Book 1 , and follow
these steps:

1. Click on an empty cell.

2. To calculate the correlation between two variables, this is the syntax in Excel: CORREL(array1,array2).

3. Type the following in this order, =, correl, (A3:A14), (B3:B14).

4. Make sure that there are NO spaces so that the cell contain ‘=correl(A3:A14, B3:B14)’ which returns a value between -1 and
+1.

5. Press Enter.

6. You will receive the following answer: 0.88.


13.5 Simple Linear Regression Analysis (Chapter 9)

To perform simple linear regression analysis between two variables in a data set, refer to Data Set 5 in the Excel File titled
‘Book 1 , and follow these steps:

1. From the Excel Menu, choose Data.

2. Then, choose Data Analysis.

3. Then, choose Regression

4. When prompted, Input Y Range (A3:A12), and Input X Range (B3:B12)

5. Press OK.

Screen 1
Screen 2
Screen 3

Given the information in Screen 3 below, Excel coefficient outputs yield the following the regression equation:
13.6 Multiple Regression Analysis (Chapter 10)

To perform multiple regression analysis between variables in a data set, refer to Data Set 6 in the Excel File titled ‘Book 1 ,
and follow these steps:

1. From the Excel Menu, choose Data.

2. Then, choose Data Analysis

3. Then, choose Regression

4. When prompted, Input Y Range (A4:A15), and Input X Range (B4:C15). Note that since there are two independent
variables (X1 and X2), the range will be across two columns of B and C to cover their values.

5. Press OK.

6. You will receive the following output:

Screen 1
Screen 2
Screen 3

Given the above Excel coefficient outputs, the regression equation is:
13.7 Analysis of Variance (ANOVA) (Chapter 11)

To perform analysis of variance in a data set, refer to Data Set 7 in the Excel File titled ‘Book 1 , and follow these steps:

1. From the Excel Menu, choose Data.


2. Then, choose Data Analysis

3. Then, choose ANOVA: Single Factor

4. When prompted, Input Range (A3:A12). Note that since there are 3 Products whose means are being compared, the range
will be across three columns of A, B and C to cover their values.

5. Press OK.

Screen 1

Screen 2
Screen 3

These results that the FSTAT is 0.13 while the FCritical is 3.35. Since FSTAT < FCritical, you will accept the null hypothesis as
covered in Chapter 12.
References

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models. Sage Publications.

Kleinbaum, D., Kupper, L., Muller, K., & Nizam, A. (2008). Applied Regression Analysis and Other Multivariate Methods. 4th
Edition, Thomson Brooks/Cole.

CHAPTER 14
14 Current Case Studies in Applied Statistics

Overview

The business world is complex with with numerous data. Making sense of the information collected or that which is available
in the public domain is necessary for competitiveness. Designing effective business studies and analyzing generated are
necessary to answer management questions. As we have covered in earlier chapters, data gathered is only as good as the
hypotheses posed and instruments used. Therefore, it is important to give sufficient time and effort in developing hypotheses
and choosing the right type of data gathering tools.
Additionally, how information is presented is important to all business stakeholders. Whether using simple bar graphs or
displaying complex inferential formulas, the presentation of data has to be comprehensive and geared towards answering
management questions.

In this chapter, we will cover two in-depth case studies, one in the field of management while the other in human resources.

The objectives of this chapter are as follows:

 Devise management questions


 Determine study design and survey questions
 Display data through tables and graphs
 Develop hypotheses and determine critical values
 Calculate test statistics and display inferential test results
 Use Excel functions to make necessary calculations
 Answer management questions

14.1 Management: Precision Beauty Inc.

In many management settings, managers and executives are interested in determining whether two categorical, random
variables are statistically related. A Chi-square test is a good test to make this determination. Chi-square tests are also
referred to as a test for independence of association. The goal is to understand whether an association found between the two
variables happened by chance or indicates a statistically significant relationship. As with other hypothesis tests, Chi-square
follows multiple steps.

Precision Beauty Inc. (Precision) is a leading edge company in skin products for both men and women. The company’s
meteoric rise over the past decade has been due to their unisex products. Using modern marketing strategies, the company
created and is now dominating a market of skin products that both men and women use. The products have natural
ingredients, are made through sustainable production practices, and are packaged in recyclable containers. The
environmental consciousness message of the company has appealed to a younger market that is less concerned with gender
differences.

Precision development managers are interested in the answer to this question:

Management Question: “Are purchasing preferences for three types of moisturizer (Aloe, Shea, and Coconut) similar or
different across gender?”

Solution:

For the company to answer its management question, there are a number of steps that it must take through exploratory and
descriptive data analysis, and causal data analysis (Chi-square test).

Company managers tasked their research department with developing a study to answer the management questions above.
The research manager developed a survey that would have two (2) questions posed to regular customers of the company’s
skin products as follows:
From the exploratory surveys, the research manager gathers completed ones from a random sample of 3,150 individuals.

Table 14.1 Percentage of Total Customers by Purchasing Preference

To provide a visual describing data gathered during the exploratory phase, the table above with the raw data breaking down
purchasing preferences by gender is transformed into a bar graph as follows:

Figure 14.1 Purchasing Preferences by Gender, Precision Beauty, Inc.


Another method to describe the data is to combine female and male purchases and determine the percentage of total
customers choosing one ingredient over another. As you can see in Figure 14.2, 30% of the overall survey respondents prefer
Aloe while 44% and 26% prefer Shea and Coconut, respectively.

Figure 14.2 Percentages of Total Customers by Purchasing Preference

The goal of this basic data analysis is to examine the data, transform it into desired statistics, and present it in a variet y of
ways such as bar diagrams. While it provides useful information about overall preferences, this type of analysis does not
address the primary hypothesis test for which the study was designed.

The initial management questions were: “Are purchasing preferences for three types of moisturizer (Aloe, Shea, and
Coconut) similar or different across gender?”
Statistically, this question translates to carrying out a hypothesis using the Chi-square method. Set at a 5% level of
significance, the research manager aimed to associate a customer’s purchasing preferences with their gender. In other
words, the goal is to determine if these two random, categorical variables are statistically related.

First Step:

The initial step in this hypothesis test is to calculate the purchasing frequencies by gender. As we have covered in the Chi-
square distribution chapter, this is done calculating the expected number of female and male customers preferring one
purchase over another. Table 14.1 represents the Observed frequencies (O) of purchasing preferences. In other words, data
collected is based on observations made through survey testing.

However, the formula for calculating that Expected frequency (E) is as follows:

For example, the expected frequency of males preferring Aloe is:

This is the expected frequency for customers who are both male and prefer Aloe. The next step is to calculate the other
expected frequencies and insert them in the table as seen below:

Table 14.2 Expected Frequencies of Total Customers by Purchasing Preference

Second Step
In this step, we determine the statistical hypothesis as follows:

Null Hypothesis (H0): There is no association between purchasing preferences and gender.
Alternative Hypothesis (H1): There is an association between s purchasing preferences and gender.

Third Step

Table 14.3 Test Statistic Table (Observed and Expected Frequencies)

The value can also be determined by using the Excel Function as follows:
Thus, the hypothesis statements can be rewritten as follows:

Accepting the null hypothesis means that there is no statistically significant relationship between the two categorical, random
samples.

Calculating the test statistic by using the Excel Function is as follows:

Excel Function = CHITEST(actual_range, expected_range)

To conduct these calculations:

1. Open a new Excel spreadsheet file.

2. Organize the data found in the first two columns of Table 14.3:

 Enter Observed values as a single row


 Enter Expected values as single row

3. Enter ‘=’ into any empty cell and type: =CHITEST

4. A dialogue box will appear that will prompt you to enter the actual_range. To do so, highlight the Observed values row
then enter ‘,’.

5. A dialogue box will also appear that will prompt you to enter the expected_range. To do so, highlight the Expected values
row then press ‘enter’.

-9
6. You will receive a value of 1.15886 X 10 . This is a probability or p-value which is the chance that purchasing preferences
are happening randomly and have nothing to do with gender.

The following screenshot illustrates what you will see when using the Excel function:
Click here to enlarge

Since the p-value is much smaller than 0.05, we can deduce that there is a statistically significant relationship between
purchasing preferences and gender.

Fourth and Final Step

In this final step, we reach a conclusion that since the calculated test statistic (50.4) is larger than the critical value (11.07) at a
degree of freedom of 5, and then we can reject the null hypothesis, H 0. Using the Excel function, we reach the same
conclusion as the p-value calculated is much smaller than 0.05.

The conclusion is that there is an association between customers’ purchasing preferences and their gender.

If we return to the management question, we can confidently say that purchasing preferences for the three types of
moisturizer (Aloe, Shea, and Coconut) are different across gender. This type of information gained from the Chi-square
hypothesis test can allow management to gear their marketing and promotional efforts to these differences. For example, the
company may decide that low-selling Aloe moisturizers amongst women may need to be studied further. The next step is to
ask women whether it is the product itself (quality, smell, texture, etc.) or the packaging/promotions.

When management embarks on a research study, the design is geared to the basic question it wishes answered. Without
causal research design, information may not be examined at a depth that allows for these questions to be answered. For
Precision Beauty, Inc., their products can continue to be geared towards both men and women.

However, they may choose to continue studying how the two groups prefer to purchase their products. For example, the
following criteria may influence current and future purchasing preferences:

 Income
 Age
 Education level
 Urban access to stores
 Promotional materials and special marketing
 Brand recognition

14.2 Human Resources: Affinity Plus, LLC

In finance, inferential statistics allow managers to estimate the value of a population parameter based on a random sample.
A confidence interval estimate is one such statistical manipulation that allows for a random sample to estimate the value of a
population parameter. Specifically, the estimate is expressed as a range of values derived from the sample within which we
can say, with a certain confidence, that a population parameter lies within. The confidence level, then, is the probability t hat
the derived interval will contain the true population parameter.

Affinity Plus, LLC (Affinity) is a human resources management company that provides consulting services to companies,
private and public. While the company does not replace an in-house human resources department within an organization,
Affinity provides research and strategic consulting on specific issues. The expertise of the company is in its ability to provide
clear data analysis as well as strategic recommendations based on this analysis.

Affinity has been approached by a corporation concerned about the commute times of their employees and the impact on
their productivity. The overall population of the city has previously shown to have a mean commute time of 18 minutes with a
standard deviation of 2.5 minutes. The overall population mean commute distance is also found to be 16 miles with a
standard deviation of 6 miles. However, since the corporation’s managers did not have hard data of just how much their
employees commute, they hired Affinity to conduct a study to answer the following question:

Management Question: “Are the corporation’s employees spending longer hours and travelling longer distances, on average,
in daily commutes to work?”

Solution:

Affinity researchers developed a two-question survey to distribute to a random sample of corporate employees. The goal of
the survey was to gain average commute times (in minutes) and average commute distances (in miles).

Table 14.4 displays commute times and distance travelled by sample and population. From a cursory examination of the data,
it is clear that company employees spend more time, on average, commuting back and forth to work than city employees.
The table also shows that the employees travel longer distances, on average, than city employees.

Table 14.4 Commute Time and Distance, Sample vs. Population

Another way to present the data is to display the above table graphically. Figure 14.3 shows that the difference in commute
times between sample and population seems larger than the difference in commute distances.
Figure 14.3 Commute Time and Distance, Sample vs. Population

While the figure is useful in seeing the differences in commute times and distances, the information does not describe the
sample further. For the research manager to make inferences about the entire corporate population (note that this is NOT
the larger city one) from the given sample of 75 employees, confidence interval estimation is necessary.

The research manager gathers completed surveys from 75 employees.

In statistical terms, the first part of the management question above can be detailed as follows:

 Find the 95% confidence limits for the actual average commute time of all company employees based on the sample.

The following information is available to the researchers:

To determine the critical value of Zα/2, you can use Table 14.4.2 (Normal Distribution Z-Table). The value can also be
determined by using the Excel Function as follows:
To calculate the confidence interval estimate, this information is inserted into the following information:

The lower confidence limit = 31 minutes – 0.5658 = 30.43 minutes


The higher confidence limit = 31 minutes + 0.5658 = 31.57 minutes

Using the Excel Function, confidence estimate can also be derived as follows:

In this first part of the management question, the analysis shows that there is a 95% chance that the interval between 30.43
minutes and 31.57 minutes will include the actual mean commute time of all the corporate employees at the company. This
interval also shows that, as the company suspected, employees spend more than 50% percent on their commute time than
the city’s average (31 minutes vs. 18 minutes).

In statistical terms, the second part of the management question above can be detailed as follows:

 Find the 95% confidence limits for the actual average commute distance of all company employees based on the sample.

The following information is available to the researchers:

To determine the critical value of Zα/2 , you can use Table 14.4.2 (Normal Distribution Z-Table). The value can also be
determined by using the Excel Function as follows:
To calculate the confidence interval estimate, this information is inserted into the following information:

The lower confidence limit = 22 miles – 1.3579 = 20.46 miles


The higher confidence limit = 22 miles + 1.3579 = 23.36 miles

Using the Excel Function, confidence estimate can also be derived as follows:

In this second part of the management question, the analysis shows that there is a 95% chance that the interval between
20.46 miles and 23.36 miles will include the actual mean commute distance of all the corporate employees at the company.
This interval also shows that employees travel longer in their commute than the city’s average (22 miles vs. 16 miles).

Please note that the confidence intervals calculated depend on the alpha or confidence probabilities that are set. In the above
examples, the probabilities are set at 95% which indicate an alpha value of 0.05 (1-0.95). However, confidence intervals set at
99% will be different as seen below:

1. Find the 99% confidence limits for the actual average commute time of all company employees based on the samp le.
2. Find the 99% confidence limits for the actual average commute time of all company employees based on the sample.

14.3 Statistical Tables

CHI-SQUARE DISTRIBUTION TABLE


NORMAL DISTRIBUTION Z-TABLE

You might also like