Quantitative Methods Online Course

28/08/2023, 18:54 Quantitative Methods Online Course
This document is authorized for use only by Eissa Haroon. Copy or posting is an infringement of copyright.
Quantitative Methods Online Course

Pre-Assessment Test Introduction
Welcome to the pre-assessment test for the HBS Quantitative Methods Tutorial.
All questions must be answered for your exam to be scored.
Navigation:
To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice
and click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied
with your selection before you submit each answer. You may also skip a question by pressing the forward advance arrow.
Please note that you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the
navigational arrows at any time. Although you can skip a question, you must navigate back to it and answer it - all questions
must be answered for the exam to be scored.
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some
questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.
Your results will be displayed immediately upon completion of the exam.
After completion, you can review your answers at any time by returning to the exam.
Good luck!
Frequently Asked Questions

How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in
the course.
Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book
examination.
May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on
exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam
question.
Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity
with the material, but you may take longer if you need to.
What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices
will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you
return to the exam site.
How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question.
The results screen will indicate which questions you answered correctly.
Overview & Introduction

Welcome to QM...
Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis.
This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve
management decision-making problems.
Click on the link labeled "The Tutorial and its Method" in the left menu to get started.
The Tutorial and its Method

QM is designed to help you develop quantitative analysis skills in business contexts. Mastering its content will help you
evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon
below to advance to the next page.
This isn't a formal or comprehensive course in quantitative methods. QM won't make you a statistician, but it will help
you become a more effective manager.
The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are
learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to
improve your ability to formulate, analyze, and solve managerial problems.
about:blank 1/119
You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent
opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual
management problems.
You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM
offers many features we hope you will explore, utilize, and enjoy.
The Story and its Characters

Naturally, the most appropriate setting for a course on statistics is a tropical island...
Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after
all, staying at a 5-star hotel as a Summer Associate with Avio Consulting.
This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer
internship, as you prepared to enroll in a two-year MBA program this fall.
You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio
partners consider you a high potential intern — they are willing to invest in you with the hope that you will later return
after you complete your MBA program.
Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead:
providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai.
Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine
cuisine, a spa, sports activities. And above all, the pristine beach and glorious ocean.
After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo.
Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the
luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach
to his management decision-making process.
Using the Tutorial: A Guide to Tutorial Resources

Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage.
QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link
labeled "Using the Tutorial" on the left.
There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips.
Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip
will have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice
will present your results to Leo for his consideration. The Kahana Clips will give you exposure to the types of business
problems that benefit from the analytical methods you'll be learning, and a context for practicing the methods and
interpreting their results.
To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiple-choice
assessment exam will evaluate your understanding of the material.
In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. Complementing
the text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be
asked questions even in Explanatory Clips that you should answer to check your understanding of the concepts.
Some Explanatory Clips give you directions or tips on how to use the analytical and computational features of Microsoft
Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this
course.
QM is supplemented with spreadsheets of data relating to the examples and problems presented. When you see a
Briefcase link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel
functions to reproduce the graphs and analyses that appear in the clips.
You will also see Data links that you should click to view summary data relating to the problem.
Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource
that you can use to make sure that you have mastered the important concepts in each section.
Work through exercises to solidify your knowledge of the material. Challenge exercises provide opportunities to tackle
somewhat more advanced problems. The challenge exercises are optional - you should not have to complete them to gain
the mastery needed to pass the tutorial assessment test.
The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the
one on the right to move forward.
Use the one on the left if you want to back up a page or two.
about:blank 2/119
In the upper right of the QM tutorial screen are three buttons. From left to right they are links to the Help, Briefcase,
and Glossary.
To access additional Help features, click on the Help icon.
In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel workbooks. In many
of the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time.
In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of
the Excel functions used in the course.
We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition
for quantitative analysis that you will need as an effective and successful manager.
... and Welcome to Hawaii!

The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat,
and you watch as the foggy West Coast recedes behind you.
I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them.
This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful
than others. Apparently, he once owned and managed a gourmet spam restaurant!
Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but
that didn't do so well. He had to declare bankruptcy.
Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large
operation on his hands.
It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to
take risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how
to approach management problems in a more sophisticated, analytical fashion.
We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way.
You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the
analytic work soon.
Leo and the Hotel Kahana

Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport.
Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that
she would leave it to me.
Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs.
Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I
make decisions. That's where you come into the picture.
I used to be quite a risk-taker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want
to get it right.
After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in
the main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."
Basics: Data Description
Leo's Data Mine

After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You
wake up early enough to take a short walk on the beach before you make your way to Leo's office.
Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip.
Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important
data on the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over.
There's just so much to keep track of.
Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and
the type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse
your files?
Yes. There are two things in particular that have been on my mind recently.
about:blank 3/119
For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract
out the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or
discontinue offering scuba lessons all together.
I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and
how it compares to the school I've been using.
I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the
year, and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given
month.
These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it.
Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates.
That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on
diving school prices and occupancy patterns.
Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, Alice,
your suite includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool
whenever you like.
Thanks! We'll certainly take advantage of that.
Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to
sea.
Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise
and useful information. But no matter what data you are working with, always make sure you really understand the data
before doing a lot of analysis or making managerial decisions."
What is Alice getting at when she tells you to "understand the data"? And how can you develop such an understanding?
Describing and Summarizing Data

Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the
distribution of data.
Working with Data

Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do?
The data we encounter each day have valuable information buried within them. As managers, correctly analyzing
financial, production, or marketing data can greatly improve the quality of the decisions we make.
Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information
and insight as possible from our data we have available.
When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How
were they collected? How can we help the data tell their story?
Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was
three and a half feet. We might be surprised...
... until we learn that the building is an elementary school.
We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she
measured height: with or without shoes.
Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we
might first try to detect any patterns, trends, or relationships that exist in the data.
We might start by grouping the data into logical categories. Grouping data can help us identify patterns within a single
category or across different categories. But how do we do this? And is this often time-consuming process worth it?
Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to
comprehend.
In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask:
Are operating expenses increasing or decreasing? Do office space costs vary much from year to year?
Comparing data across different years or different categories can give us further insight. Are selling costs growing more
rapidly than sales? Which division has the highest inventory turns?
Histograms
about:blank 4/119
In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed
graphically can significantly deepen our understanding of a data set and the situation it describes.
To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What
questions might we want to answer with the energy data? Which country is the largest consumer? How much energy do
most countries use?
Source
In order to create a graph that provides good visual insight into these questions, we might sort the countries by their
level of energy consumption, then group together countries whose consumption falls in the same range — e.g., the
countries that use 100 to 199 million tonnes per year, or 200 to 299 million tonnes.
Source
We can find the number of countries in each range, and then create a bar graph in which the height of each bar
represents the number of countries in each range. This graph is called a histogram.
A histogram shows us where the data tend to cluster. What are the most common values? The least common? For
example, we see that most countries consume less than 100 million tonnes per year, and the vast majority less than 200
million tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year.
Why are there so many countries in the first range — the lowest consumption? What factors might influence this?
Population might be our first guess.
Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller
nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's
energy usage.
Outliers
In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age
distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far
from the rest of the data are known as outliers. How do we interpret them?
First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error?
Was it collected in a different way than the rest of the data? At a different time?
We might discover that the data point refers to a 75 year-old retiree, taking the course for fun.
After making an effort to understand where an outlier comes from, we should have a deeper understanding of the
situation the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one
of three things: leave the outlier alone, or — very rarely — remove it or change it to a corrected value.
A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we
truly want to understand the age distribution of all students in the class, we would leave the point in.
Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in
full-time degree-granting programs, we would exclude the senior citizen and all other non-degree program students
enrolled in the course.
Occasionally, we might change the value of an outlier. This should be done only after examining the underlying
situation in great detail.
For example, if we look at the inventory graph below, a data point showing 80 pairs of roller-blades in inventory
would be highly unusual.
Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6
on April 14th.
Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is
extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales
and purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable
value, we correct the data point.
Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we
want to draw. Such changes to a data set should be made on a case-by-case basis only after careful investigation of
the situation.
Summary
With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data
sets often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If
data can provide insight into a situation, they can help us to make the right decisions.
about:blank 5/119
Creating Histograms
Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms
using the Histogram tool. However, we suggest you read through the instructions to learn how Excel creates
histograms so you can construct them in the future when you do have access to the Data Analysis Toolpak.
To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data
Analysis" appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left
and select "Excel Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check
the box next to Analysis Toolpak and click "OK." Excel will then walk you through a setup process to install the
toolpak.
Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis
Histogram tool.
To prepare the data, we enter or copy the values into a single column in an Excel worksheet.
Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins,"
into a second column of data.
In the Tool bar, select the Data tab, and then choose Data Analysis.
In the Data Analysis pop-up window, choose Histogram and click OK.
Click on the Input Range field and enter the range of data values by either typing the range or by dragging the
cursor over the range.
Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't
specify our own bins, Excel will create its own bins, which are often quite peculiar.
Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the
summary table, which is created by default.
Click New Worksheet Ply, and enter the name you would like to give the output sheet.
Finally, click OK, and the histogram with the summary table will be created in a new sheet.
Central Values for Data

Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a
concise way with a single number.
The Mean
Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data
as well as possible. But how do we do this? Which single value best represents an entire set of data? That depends on
the data we're investigating and the type of questions we'd like the data to answer.
What number would best describe employee satisfaction data collected from annual review questionnaires? The
numerical average would probably work quite well as a single value representing employees' experiences.
To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result
by 11, the number of surveys. The Greek letter mu represents the mean of the data set.
The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set.
However, it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean
value towards one extreme.
In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that
side will pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US
consumption pulls the mean to a value higher than the consumption of most other countries. What other numbers
can we use to find the central tendency of the data?
The Median
Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42
billion. How should we interpret this number? How well does this average represent the revenues of these
companies?
When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of
revenue a year. If this is true, why is the mean so high?
Source
about:blank 6/119
As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These higher
revenues pull up the average considerably.
Source
In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the
data. In these cases, we can use another central value called the median.
Source
The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher
than the median, and half are lower.
Source
For income, the median revenues of the top 100 US companies is $30 billion, significantly less than $42 billion. Half
of all the companies earn less than $30 billion, and half earn more than $30 billion.
Source
Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of high-
revenue earners. How can we find the median?
Source
With an odd number of data points, listed in order, the median is simply the middle value. For example, consider
this set of 7 data points. The median is the 4th data point, $32.51.
In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values —
and obtain a median of $41.92.
When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the
pros and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a
highly skewed distribution.
By contrast, the median is not biased by outliers and is often a better value to represent skewed data.
The Mode
A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We
might use the mode to represent data when knowing the average value isn't as important as knowing the most
common value.
In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram
more than one peak. A distribution that has two peaks is called a bimodal distribution.
Summary
To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode.
They are often called summary statistics or descriptive statistics. All three give a sense of the "center" or
"central tendency" of the data set, but we need to understand how they differ before using them:
Finding The Mean In Excel

To find the mean of a data set entered in Excel, we use the AVERAGE function.
We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas.
In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are
located.
Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it
does not represent an actual data point.
Finding The Median In Excel

Excel can find the median, even if a data set is unordered, using the MEDIAN function.
The easiest way to calculate a data set's median is to select a range of cell references.
Finding The Mode In Excel

Excel can also find the most common value of a data set, the mode, using the MODE function.
about:blank 7/119
If more than one mode exists in a data set, Excel will find the one that occurs first in the data.
Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating.
Variability
The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are
spread around the center. "Two sets of data could have the same mean and median, and yet be distributed completely
differently around the center value," Alice tells you. "We need a way to measure variation in the data."
The Standard Deviation

It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely
dispersed?
Let's look at an example. To identify good target markets, a car dealership might look at several communities and find
the average income of each. Two communities — Silverhaven and Brighton — have average household incomes of
$95,500 and $97,800. If the dealer wants to target households with incomes above $90,000, he should focus on
Brighton, right?
We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is
there a wide range around the average income? A market might be less attractive if fewer households have an income
above the dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a
closer look at the data.
Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the
dealer's target income range. Without understanding the variability in the data, the dealer might have chosen Brighton,
which has fewer targeted homes.
Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two
communities.
Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a
data set, we need a summary statistic that captures the level of dispersion in a set of data.
The standard deviation is a common measure for describing how much variability there is in a set of data. We represent
the standard deviation with the Greek letter sigma:
The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a
conceptual level first. Then we'll build up step by step to help understand where the formula comes from.
The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are
widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.
Calculating
A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for
Saturdays, typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can
make the difference between unhappy guests and satisfied customers who want to return.
On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services
during a shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her
invaluable additional information about how those requests might vary across different days.
The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To
staff properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 and
195, for example, or between 120 and 220.
To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to
calculate a summary statistic called the variance.
Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the
variance, we first sum the squares of these differences. Why square the differences?
A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or
zero. If we simply summed the differences between each Saturday's requests and the mean, positive and negative
differences would cancel each other out.
But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get
only positive numbers that do not cancel each other out in a sum.
The formula for variance adds up the squared differences and divides by n-1 to get a type of "average" squared
difference as a measure of variability. (The reason we divide by n-1 to get an average here is a technicality beyond the
scope of this course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the
variability of the data?
about:blank 8/119
Sure, but variances don't come out in the most convenient form. Because we square the differences, we end up with a
value in 'squared' requests. What is a request-squared? Or a dollar-squared, if we were solving a problem involving
money?
We would like a way to express variability that is in the same units as the original data — front-desk requests, for
example. The standard deviation — the first formula we saw — accomplishes this.
The standard deviation is simply the square root of the variance. It returns our measure to our original units. The
standard deviation for the hotel's Saturday desk traffic is 25.2 requests.
Interpreting
What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests.
With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation
would translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long
lines; when traffic is very low, desk staff are idle.
For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is
more representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both
customer dissatisfaction and staff idle time.
Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it
easy for us to calculate variance and standard deviation.
Summary
The standard deviation measures how much data vary about their mean value.
Finding in Excel
Excel's STDEV function calculates the standard deviation.
To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas.
In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation.
To calculate variance, we can use Excel's VAR function in the same way.
The Coefficient of Variation

The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you
so much. How can you compare the variability in different data sets?
A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two
data sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is
"more variable"?
Standard deviations must be considered within the data's context. The standard deviations for two stock indices below
— The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent
over a period. But were the two indices equally variable?
Source
If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the average); if the average
is $700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's
average index price was over three and half times higher than TSC's average index price.
Source
To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of
the data to the data's mean.
Source
We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which
is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a
percent of the mean.
To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative
variation? Click the answer you select.
Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out
which data set is most variable in this relative sense.
about:blank 9/119
The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of
variability.
Summary
The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare
variation in different data sets of different scales or units.
Applying Data Analysis

After a good night's sleep, you meet Alice for breakfast.
"It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a
presentation for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary
statistics wherever appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem."
Pricing the Scuba Schools

In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20
and get price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course.
Prices
You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own
bins. If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the
finished histogram.
Prices
Histogram
This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape
of the distribution suggests that:
Prices
Histogram
You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation):
Prices
Histogram
Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average
for the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand,
maybe these more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with
the course offered by my current contractor...
Exercise 1: VA Linux Stock Bonanza

After a company completes its initial public offering, how is the ownership of common stock distributed between
individuals in the firm, often termed "named insiders"?
Let's examine a company, VA Linux, that chose to sell its stock in an Initial Public Offering (IPO) during the IPO
craze in the late 1990s.
According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of
common stock owned by insiders:
Source
From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer)
Exercise 2: Employee Turnover

Here is a histogram graphing annual turnover rates at a consulting firm.
Which summary statistic better describes these data?
Exercise 3: Honidew Internship

The J. B. Honidew Corporation offers a prestigious summer internship to first-year students at a local business
school. The human resources department of Honidew wants to publish a brochure to advertise the position.
about:blank 10/119
To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic
expectations. The human resources manager calculates the mean GPA of the previous 8 interns, to include in the
brochure.
What is the mean GPA of the former interns?
Interns' GPA's
In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35.
In the presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'.
What's the median GPA in this data set?
Interns' GPA's
Exercise 4: Scuba Regulations

Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for
example, to scuba equipment using a device called a "rebreather" to recycle oxygen from exhaled air.
Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With
too little oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning.
Minimizing the deviation of oxygen concentration levels from the specified level is clearly a matter of life and
death!
A scuba equipment-testing lab compared the oxygen concentrations of two different brands of rebreathers, A
and B. Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen
concentration appear to have a lower standard deviation?
Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the
set. Even without calculations, we have a good knack for seeing which set is more variable.
We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we
can calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.
Exercise 5: Fluctuations in Energy Prices

After decades of government control, states across the US are deregulating energy markets. In a deregulated
market, electricity prices tend to spike in times of high demand.
This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable.
To provide a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard
deviation of prices during the 1990s, when electricity prices were largely regulated.
From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and
$50.55. What is the standard deviation of these eleven prices?
Electricity Prices
Source
Excel makes the job much easier, because all that's required is entering the data into cells and inputting the
range of cells into the =STDEV() function. The result is $2.02.
On the other hand, to calculate the standard deviation by hand, use the formula:
First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the
sum of these squared differences, 40.79. Divide by the number of points minus one (11 - 1 =10 in this case) to
obtain 4.08. Taking the square root of 4.08 gives us the standard deviation, $2.02.
Exercise 6: Big Mart Personal Care Products

Suppose you are a purchasing agent for a wholesale retailer, Big-Mart. Big-Mart offers several generic versions
of household items, like deodorant, to consumers at a considerable discount.
Every 18 months, Big-Mart requests bids from personal care companies to produce these generic products.
After simply choosing the lowest individual bidder for years, Big-Mart has decided to introduce a vendor "score
card" that measures multiple aspects of each vendor's performance. One of the criteria on the score card is the
level of year-to-year fluctuation in the vendor's pricing.
Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in
relation to their average price, as measured by the coefficient of variation?
Summary
about:blank 11/119
Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater
share of the project.
Relationships Between Variables

So far, you learned how to work with a single variable, but many managerial problems involve several factors that
need to be considered simultaneously.
Two Variables
We use histograms to help us answer questions about one variable. How do we start to investigate patterns and
trends with two variables?
Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a
relationship between the two?
Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition?
How can we let the data tell their story about the strength and nature of that relationship?
As always, one of our first steps is to try to visualize the data.
Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one
height-weight pair for each athlete.
Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can
see a relationship between height and weight. This type of graph is called a "scatter diagram."
Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful
in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship
becomes.
In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier.
We need to be careful not to draw conclusions about causality when we see these types of relationships.
Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights.
Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing
heavier certainly doesn't make us taller!
The direction and extent of causality might be easy to understand with the height and weight example, but in
business situations, these issues can be quite subtle.
Managers who use data to make decisions without firm understanding of the underlying situation often make
blunders that in hindsight can appear as ludicrous as assuming that gaining weight can make us taller.
Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram,
we plot for each day the number of massages purchased at a spa resort versus the total number of guests visiting
the resort.
We can see a relationship between the number of guests and the number of massages. The more guests that stay at
the resort, the more massages purchased — to a point, where massages level off.
Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited
numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions,
those that deepen our understanding of the underlying context from which the data are drawn.
Variable and Time

Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a
single variable over time. In such cases, we can consider time as our second variable.
Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics
distributor. Experience tells us these components have high price volatility. Should we make the purchase now?
Or wait?
Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same
way we plotted height and weight. Because time is one of the variables, we call this graph a time series.
Time series are extremely useful because they put data points in temporal order and show how data change over
time. Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal
patterns, with prices in some months consistently higher than in others?
Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely
only on visual analysis when looking for relationships and patterns.
about:blank 12/119
False Relationships
Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each
other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are
none. We must be careful to avoid some of these common pitfalls.
Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being
elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in
1860) was the first victim of this unfortunate relationship.
Source
James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and
William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all
died in office.
Source
Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about
the president elected in 2020?
Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the
relationship is no more than an interesting coincidence.
Hidden Variables
Even when two data sets seem to be directly related, we may need to investigate further to understand the
reason for the relationship.
We may find that the reason is not due to any fundamental connection between the two variables themselves,
but that they are instead mutually related to another underlying factor.
Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store.
The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump.
But are the two data sets actually related? If so, why?
A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and
summer, people play baseball.
If we had simply plotted puck and baseball sales without thinking further, we might not have considered the
time of year at all. We could have neglected a critical variable driving the sales of both products.
In many business contexts, hidden variables can complicate the investigation of a relationship between almost
any two variables.
A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one
variable causes the other, but simply illustrate how the data behave.
Summary
Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we
still need to be skeptical: is the relationship plausible? An apparent relationship between two variables may
simply be coincidental, or may stem from a relationship each variable has with a third, often hidden variable.
Creating Scatter Diagrams

To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's
built in chart tools to plot the data.
To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding
value in the other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in
two columns.
If the data sets are next to each other, simply select both sets.
Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first
type: Scatter with Only Markers.
Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-
axis and the second column of data on the Y-axis.
about:blank 13/119
We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing
Layout 1.
Then we can add the chart title and label the axes by selecting and editing the text.
Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design
elements of your chart.
Correlation
By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of
that relationship? Can we describe the relationship in a standardized way?
Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship
between two variables looks strong ...
... or weak ...
... linear ...
... or nonlinear ...
... positive (when one variable increases, the other tends to increase) ...
... or negative (when one variable increases, the other tends to decrease).
Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when
data points are close to an imaginary line running through a scatter plot.
Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship
between the two variables.
However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive
visually. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship
between two variables.
To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1.
Here's a strong positive correlation (about 0.85) ...
... and here's a strong negative correlation (about -0.90).
If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1.
At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in
the middle?
Even when the correlation coefficient is 0, a relationship might exist — just not a linear relationship. As we've seen,
scatter plots can reveal patterns and help us better understand the business context the data describe.
To reinforce our understanding of how our intuition about the strength of a linear relationship between variables
translates into a correlation coefficient, let's revisit the examples we analyzed visually earlier.
Influence of Outliers
In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the
attendance patterns of their employees. For example, do workers' absence rates vary by time of year?
Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature
rises. After pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466.
While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting
that the weather might indeed be the culprit.
But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager
might realize that the three outliers correspond to a late-summer, three-day transportation strike that kept some
workers homebound the previous year.
Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers,
the relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why
do the outliers influence our measure of linearity so much?
As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value
of every data point. Just as it does with the mean, this inclusiveness can get us into trouble...
Because measures like correlation give more weight to points distant from the center of the data, outliers can
strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure
about:blank 14/119
we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences
by returning to the data.
Summary
The correlation coefficient characterizes the strength and direction of a linear relationship between two data
sets. The value of the correlation coefficient ranges between -1 and +1.
Finding in Excel
Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on
athletes' height and weight.
Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the
first set is aligned with its corresponding value in the other set.
To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL
function as shown below.
The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained.
With height and weight, both values certainly need to refer to the same person!
Occupancy and Arrivals

Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to
help me with Leo's hotel occupancy problem."
In the hotel industry, one of the most important management performance measures is room occupancy rate, the
percentage of available rooms occupied by guests.
Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island
each month. On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship,
so state agencies can gather very precise data on arrivals.
Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured
by the average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of
this relationship, and a measure of its strength.
Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as
tracked by the Hawaii Department of Business, Economic Development, and Tourism. Click on the link to access
these data.
Kauai Data
Source
The best way to graphically represent the relationship between arrivals and occupancy is a _______.
Kauai Data
Source
You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be
characterized as
Kauai Data
Source
You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the
right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.
Kauai Data
Source
To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37).
When you hit enter, the correct answer, 0.71, will appear.
Kauai Data
Together with Alice, you compile your findings and present them to Leo.
Source
I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate
follows a general trend, but not a precise pattern. Look at this: in two months with nearly the same average
number of daily arrivals, the occupancy rates were very different — 68% in one month and 82% in the other.
about:blank 15/119
But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more
campers come to Kauai in one month, and more hotel patrons in the other?
Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation
preferences of the arrivals would be what we call a hidden variable.
Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so
will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days
each at the hotel than if they spend only 3 days.
I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get
back to it at a later time. The scuba school contract is more pressing at the moment.
Exercise 1: The Effectiveness of Search Engines

As online retailing expands, many companies are interested in knowing how effective search engines are in
helping consumers find goods online.
Computer scientists study the effectiveness of such search engines and compare how many results search
engines recall and the precision with which they recall them. "Precision" is another way of saying that the search
found its target, for example a page containing both the phrases "winter parka" and "Eddie Bauer."
What could you say about the relationship between the precision and the number of results recalled?
Source
Exercise 2: Education and Income

Is an education a good investment in your future? Some very successful business executives are college
dropouts, but is there a relationship in the general population between income and education level?
Consider the following scatter plot, which lists the income and years of formal education for 18 people. What is
the correlation between the two variables?
Source
Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to
have a rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-
education data, the coefficient is nearest to _______.
Sampling & Estimation

Introduction: The Scuba Problem
Leo asks you to help him evaluate the Kahana's contract with the scuba school.
Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business
activities. We have an excellent coral reef, and scuba diving is becoming very popular among vacationers and
business travelers.
We started our year-round diving program last year, contracting a local diving school to do a scuba certification
course. The one-year trial contract is now up for renewal.
Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we
subsidize the costs associated with each course. So I want to get a good handle on how satisfied the guests are with
the lessons before I decide whether or not to renew the contract.
The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look
at it, but I can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires
at the end of the month.
Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon
exhausted but exhilarated. Alice is especially enthusiastic.
"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet!
"But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a
whole enjoyed the scuba certification course. After all, we may have caught the instructor on his best day this year."
Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school.
Generating Random Samples
about:blank 16/119
Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a
few guests, and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey
must be representative of all of the guests who have taken the scuba course at the resort. But how can you be sure you
get a good sample?
How to Create a Representative and Unbiased Sample

As managers, we often need to know something about a large group of people or products. For example, how many
defective parts does a large plant produce each year? What are the average annual earnings of a Wall Street
investment banker? How many people in our industry plan to attend the annual conference?
When it is too costly to gather the information we want to know about every person or every thing in an entire
group, we often ask the question of a subset, or sample of the group. We then try to use that information to draw
conclusions about the whole group.
To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that
sample and try to infer something about the total population we're interested in. For example, we could select a
sample of people in our industry, ask them if they plan to attend the annual conference, and then infer from their
answers how many people in the entire industry plan to attend.
For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that
between 7% and 13% of our entire population will attend.
This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go
forward. We want to know something about a population large enough to make examining every population
member impractical.
We first select elements from the population at random...
...then analyze that sample...
...and then draw an inference about the total population we're interested in.
Taking a Random Sample

The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're
interested in. For example, we couldn't just ask the conference organizers if they wanted to attend. They would
not be representative of the whole group — they would be biased in favor of attending the conference!
To get a good sample, we must make sure we select the sample "at random" from the full population. This means
that every person or thing in the population is equally likely to be selected. If there are 15,000 people in the
industry, and we are choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15
— of being selected.
Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see
examples of some major mistakes people have made while trying to select a random sample, and provide some
advice about how to avoid the most common types of sampling errors.
In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the
group in a database, we can just assign a unique number to each member of the group. We then let a computer
draw random numbers from the list. This would ensure that each element of the population has an equal
likelihood of being selected.
If the population about which we need to obtain information is not listed in an easy-to-access database, the task
of selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to
introduce a bias in the way we select the sample.
For example, if we want to know something about the opinions of an entire company, we cannot just pick
employees from one department. We have to make sure that each employee has an equal chance of being
included in the sample. A department as a whole might be biased in favor of one opinion.
Sample Size
Once we have decided how to select a sample, we have to ask how large our sample needs to be. How many
members of the group do we need to study to get a good estimate about what we want to know about the entire
population?
The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the larger the
population, the larger the sample size needed to achieve a given level of accuracy, but this is not true.
A sample size of 1,000 randomly-selected individuals can often give a satisfactory estimation about the
underlying population, as long as the sample is representative of the whole population. This is true regardless of
whether the population consists of thousands of employees or millions of factory parts.
about:blank 17/119
Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy
of our estimate. Other times, we might need to sample thousands to obtain the accuracy we require.
Later in this unit, we will find out how to calculate a good sample size. For now, it's important to understand
that the sample size depends on the level of accuracy we require, not on the size of the population.
Learning about a Sample

Once we select our sample, we need to make sure we obtain accurate information about each member of the
sample. For example, if we want to learn about the number of defects a plant produces, we must carefully
measure each item in the sample.
When we want to learn something about a group of people and don't have any existing data, we often use a
survey to learn about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to
resolve.
First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer
them in a certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a
question one way, and others interpret it differently, our results will be meaningless!
Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the
phone? Should we interview survey participants in person, or distribute handouts at a meeting?
There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively
inexpensive, but might have a very low response rate. This is a major problem if those who respond have a
different opinion than those who don't respond. After all, the sample is meant to learn about the entire
population, not just those with strong opinions!
Creating a telephone survey creates other issues: When do we call people? Who is home during regular business
hours? Most likely not working professionals. On the other hand, if we call household numbers in the evening,
the "happy hour crowd" might not be available.
When we decide to conduct a survey in person, we have to consider whether the presence of the person asking
the questions might influence the survey results. Are the survey participants likely to conceal certain information
out of embarrassment? Are they likely to exaggerate?
Clearly, every survey will have different issues that we need to confront before going into the field to collect the
data.
Response Rates
With any type of survey, we must pay close attention to the response rate. We have to be sure that those who
respond to the survey answer questions in much the same way as those who don't respond would answer them.
Otherwise, we will have a biased view of what the whole population thinks.
Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to
follow up with the people who did not respond the first time. We either need to increase the response rate by
getting answers from those who originally did not respond, or we must demonstrate that the non-respondents'
opinions do not differ from those of the respondents on the issue of interest.
Tracking down everyone in a sample and getting their response can be costly and time consuming. When our
resources are limited, it is often better to take a small sample and relentlessly pursue a high response rate than
to take a larger sample and settle for a low response rate.
Summary
Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences:
Classic Sampling Mistakes

To understand the importance of representative samples, let's go back in history and look at some mistakes
made in the Literary Digest poll of 1936.
The Literary Digest, a popular magazine in the 1930s, had correctly predicted the outcome of U.S, presidential
elections from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who
would become the next president?
Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you
vote for in this year's presidential election?" This may well be the largest poll in American history."
Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!"
Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and
become President of the United States."
about:blank 18/119
As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was
re-elected to a third term in office in the largest landslide victory recorded to that date. This was a devastating
blow to the Digest's reputation. What went wrong? How could such a large survey be so far off the mark?
The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the
survey to people on three different lists: the magazine's subscribers, car owners, and people listed in telephone
directories. What was wrong with choosing a sample from these lists?
The sample was not representative of the American public. Most lower-income people did not subscribe to the
Digest and did not own phones or cars back in 1936. This led the poll to be biased towards higher-income
households and greatly distorted the poll's results. Lower-income households were more likely to vote for the
Democrat, Roosevelt, but they were not included in the poll.
Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten
million voters who were sent a poll, over two million responded. Two million is a huge number of people. What
was wrong with this survey?
The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than
Democrats. Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the
survey. Among those who received the survey, a disproportionate number of Republicans responded, and the
results became even more biased.
The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the
outcome of the election. Its reputation wounded, the Digest went out of business soon thereafter.
During the same election year, a little known psychologist named George Gallup correctly predicted what the
Digest missed: Roosevelt's victory. What did Gallup do that the Literary Digest did not? Did he create an even
bigger sample?
Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of
accurate results if they weren't randomly selected from the population.
Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly
representative of the US population. He also instructed his team to be persistent in asking the opinion of each
person in the sample, which generated a high response rate.
Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling
soon became a standard for public opinion polls.
Today's polls usually consist of a sample of around a thousand randomly selected people who are truly
representative of the underlying populations. For example, look at poll reported in a leading newspaper: the
sample size will likely be around a thousand.
Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a
look at a recent example of a biased question.
In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mail-in survey to
show that the public supported his desire to abolish special interest groups. This is the question he asked:
Source
In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in
America agreed with Perot's stance.
Source
Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two
interesting follow-up surveys. In the first survey, it used the same question that Perot asked and found that 80
percent of the population favored passing the law. YP attributed the difference to the fact that it was able to
create a more representative sample than Perot.
Source
Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way:
Source
The response to this question was strikingly different. Only 40 percent of the sampled population agreed to
prohibit contributions. As it turned out, the results of the survey all came down to the way the question was
phrased.
Source
For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in
the sample results.
Source
about:blank 19/119
The real lesson of these two examples is this: How data are collected is at least as important as how data are
analyzed. A sample that is unrepresentative, biased, or not drawn at random can give highly misleading results.
How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need
to be representative and unbiased, you conduct a survey of the hotel guests.
Solving the Scuba Problem (Part I)

How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you
determine that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year-
round. The database includes each guest's name, address, phone number, age, date of arrival, length of stay, and
room number.
Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good
sample of all the guests who took scuba lessons.
Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will
respond, or conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest
contacted?
To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by
phone. Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a
random, representative sample.
You open up the list of names in the hotel database. The names were entered as guests arrived. To make things
simple, you randomly select a date and then record the first 50 guests arriving after that date who took the course.
You ask the hotel operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of
the guests on the list. He asks the guests to rate their scuba experience on a 1 to 6 scale and reports the results back
to you. Click the link below to view your sample.
Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter
"5" as "5.0"). Round if necessary.
Hotel Database
You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes.
Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data
again.
Back in your room, you look over your list of data. What should you tell Leo?
What factor is biasing your results?
When you report this news to Leo, he begins to laugh.
We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did
meet had to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually
guests see a manta ray every week, and the guests in April could barely see the underwater coral. No wonder they
weren't happy.
You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that
the guests are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests
that show the average satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54.
Exercise 1: The Bell Computer Problem

Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This
morning, he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell
Computers.
The article praised the high quality and low cost of the PCs made by Bell. However, it also included some
negative comments about Bell's customer service.
Currently, customer service is only available to customers of Bell Computers over the phone.
Collins wants to understand more fully what customers think of Bell's customer service. His marketing
department designs a survey that asks customers to rate Bell's customer service from 1 to 10.
How should he conduct the survey?
Exercise 2: The Wave Problem
about:blank 20/119
"Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the
competition among laundry detergents is fierce.
The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his
marketing team to come up with a strong advertising campaign stressing Wave's superiority over other
competitors. Wave conducted a survey in the month of June.
They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you
notice a difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave
aired a major television campaign claiming that 75% of the population thought that Wave was a good product.
You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a
study for Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water.
Coral Reef wonders how the Wave results are possible, considering that Coral Reef holds over 45% of the current
market share.
Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not
representative and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed?
Challenge: The Airport

You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that
were filled to capacity that morning. You decide to stand outside the airport's single exit door and ask a sample
of 60 passengers leaving the airport how full their flight was.
Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you
quickly realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely
have come from only a couple of flights, and you want to get a good sense of what percent of all flights arriving
that morning were filled to capacity. Thus, you decide to randomly select 60 people from all the passengers
departing the building that morning.
After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their
flight was filled to capacity, and 20 people tell you that their flight was not filled to capacity. What can you
conclude from your survey results so far?
What is the problem with your survey?
To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers
each) and five of which had only a single passenger on the plane. In this case, half of the planes were full.
However, almost all of the passengers (500 of the total 505) departing from the airport would report (correctly!)
that they had been on a full plane. Since people from a full plane are more likely to be selected, there is a
systematic bias in your response.
It is important, in every survey, to try to make your sample as representative as possible. In this case, your
sample was not representative of the planes arriving to the airport.
A better approach might be to ask the people you select what their flight number was, and then ask them how
full their flight was. Make sure you have at least one passenger from every plane. Then count the responses of
only one person from each flight. By including only one person per flight in your sample, you ensure that your
sample is an accurate prediction of how many planes are filled to capacity.
Sampling is complicated, and it is important to think through all the factors that might influence your results. In
this case, the mistake is that you are trying to estimate a population of planes by sampling a population of
passengers. This makes the sample unrepresentative of the underlying population. By randomly sampling the
passengers rather than the flights, each flight is not equally likely to be selected, and the sample is biased.
The Population Mean

You report the results of your survey, the sample mean, and its standard deviation to Leo.
The Scuba Problem II

A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've
only collected 45 responses.
If you'd chosen different people, they likely would have given different responses. What if — just by chance — these
45 people loved the scuba course, and no one else did?
You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from
4.4, but at this point we're not sure exactly how far away it might be. Without more calculations, all we can say is
that 4.4 is the best estimate we have. That is why...
Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey
result?
about:blank 21/119
If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're
off by two whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would
be completely different.
I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving!
The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the
sample reflects the population.
Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be
able to pin down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with
enough information to make a sound business decision."
That sounds like a good idea, but you wonder how to actually do it.
Using Confidence Intervals

The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give
us a sense of how accurately the sample mean estimates the population mean.
Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of
our scuba school, what can we say about the average satisfaction rate of all scuba-diving hotel guests? Could it be
4.3? 4.0? 4.7? 2.0?
To make decisions as a manager, we need to have more than just a good point estimate. We need to have a
sense of how close or far away the true population mean might be from our estimate.
We can indicate the most likely values of the true population mean by creating a range, or interval, around the
sample mean. If we construct it correctly, this range will very likely contain the true population mean.
For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average
customer satisfaction for all scuba guests falls between 4.2 and 4.6.
Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision
than if he simply knew the estimated average of 4.4.
Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the
mean x-bar, the standard deviation s, and the sample size n.
We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For
any level of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly
what we mean by "confidence," and how to compute z. For now, just keep in mind that for higher levels of
confidence, we'll need to put in a larger value of z.
Using these numbers, we can create a range around the sample mean according to the following formula:
Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should
the range be centered? How wide must the range be to make us confident that it contains the true population
mean? What factors would lead us to need a wider or narrower range?
Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample
mean.
The sample mean is our best estimate of the population mean. This suggests that the sample mean should always
be the center of the range. Move the slider bar to see how the sample mean affects the range.
Second, the width of the range depends on the standard deviation of the sample. When the sample standard
deviation is large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the
population mean. Thus, we have to create a wider range to be confident that it includes the true population mean.
On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an
accurate predictor of the true population mean. In this case, we can draw a more narrow range.
The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard
deviation affects the range.
Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or
two atypical points in the sample could throw the sample mean off considerably from the true population mean. So
with a small sample, we need to create a wide range to feel comfortable that the true mean is likely to be inside it.
The larger the sample, the more certain we can be that the sample mean represents the population mean. With a
large sample, even if our sample includes a few atypical points, there are likely to be many more typical points in
the sample to compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range.
Move the slider bar to see how the sample size influences the range.
about:blank 22/119
Finally, the width of the range depends on our desired level of confidence. The level of confidence states how
certain we want to be that the range contains the mean of the population. The more confident we want to be that
the range contains the true population mean, the wider we have to make the range.
If our desired level of confidence is fairly low, we can draw a more narrow range.
In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident"
that the range contains the true population mean. This means there is a 95% chance that the range contains the
true population mean.
Move the slider bar to see how the confidence level affects the range.
These variables determine the size of the range that we want to construct. We will learn exactly how to construct
this range in a later section.
For now, all we have to understand is that the population mean can best be estimated by a range of values and that
the range depends on three sample statistics as well as the level of confidence that we want to assign to the range.
Summary
The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we
construct a range around the sample mean that likely contains the population mean. The width of the range is
determined by the sample size, sample standard deviation, and the level of confidence. The confidence level
measures how certain we are that the range we construct contains the true population mean.
Alice recommends taking a step back from sampling and learning about the normal distribution.
The Normal Distribution

Alice recommends taking a step back from sampling and learning about the normal distribution.
The normal distribution helps us create a range around a sample mean that is likely to contain the true population
mean. You can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a
precisely defined concept. Understanding the normal distribution will also give you deeper insight into how sampling
works.
The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is
sometimes called the "bell curve."
Like any probability distribution, the normal distribution is shown on two axes: the x-axis for the variable we're
studying — women's heights, for example — and the y-axis for the likelihood that different values of the variable
will occur.
For example, few women are very short and few are very tall. Most are in the middle somewhere, with fairly average
heights. Since women of average height are so much more common, the distribution of women's heights is much
higher in the center near the average, which is about 63.5 inches.
As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a
specific range is equal to the area under the curve over that range.
For example, the percentage of all women who are between 61 and 66 inches tall is equal to the area under the curve
over that range.
The percentage of all women taller than 66 inches is equal to the area under the curve to the right of 66 inches.
Like any probability distribution, the total area under the curve is equal to 1, or 100%, because the height of every
woman is represented in the curve.
Over the years, statisticians have discovered that many populations have the properties of the normal distribution.
For example, IQ test scores follow a normal distribution. The weights of pennies produced by U.S. mints have been
shown to follow a normal distribution.
But what is so special about this curve?
First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution.
Hence, the probability that a normal distribution will have a value less than the mean is 50%, and that the probability
it will have a value greater than the mean is 50%.
Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is
depends solely on the distribution's standard deviation.
In fact, the location and width of any normal curve are completely determined by two variables: the mean and the
standard deviation of the distribution.
Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of
the values very close to the mean.
about:blank 23/119
How is this information useful?
Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we
can create a few useful "rules of thumb" for the normal distribution.
For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one
standard deviation away from the mean on either side.
It's easiest to see this with a standard normal curve, which has a mean of zero and a standard deviation of one.
If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the
probability.
The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no
matter what its mean or standard deviation.
For example, about two thirds of all women have heights within one standard deviation, 2.5 inches, of the average
height, which is 63.5 inches.
95% of women have heights within two standard deviations (or 5 inches) of the average height.
To see how these rules of thumb translate into specific women's heights, we can label the x-axis twice to show which
values correspond to being one standard deviation above or below the mean, which values correspond to being two
standard deviations above or below the mean, and so on.
Essentially, by labeling the x-axis twice we are translating the normal curve into a standard normal curve, which is
easier to work with.
For women's height, the mean is 63.5 and the standard deviation is 2.5. So, one standard deviation above the mean is
63.5 + 2.5, and one standard deviation below the mean is 63.5 - 2.5.
Thus, we can see that about 68% of all women have heights between 61 and 66 inches, since we know that about 68%
of the probability is between -1 and +1 on a standard normal curve.
Similarly, we can read the heights corresponding to two standard deviations above and below the mean to see that
about 95% of all women have heights between 58.5 and 68.5 inches.
The z-statistic
The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve,
as we did with women's heights simply by re-labeling the x-axis. To do this more formally, we use something called
the z-statistic.
For a normal distribution, we usually refer to the number of standard deviations we must move away from the
mean to cover a particular probability as "z", or the "z-value." For any value of z, there is a specific probability of
being within z standard deviations of the mean.
For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the
probability of being between -1 and +1 on a standard normal curve.
A good way to think about what the z-statistic can do is this analogy: if a giant tells you his house is four steps to
the north, and you want to know how many steps it will take you to get there, what else do you need to know?
You would need to know how much bigger his stride is than yours. Four steps could be a really long way.
The same is true of a standard deviation. To know how far you must go from the mean to cover a certain area
under the curve, you have to know the standard deviation of the distribution.
Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with
a mean of 0 and a standard deviation of 1. We are translating the real value in its original units — inches in our
example — into a z-value.
The z-statistic translates any value into its corresponding z-value simply by subtracting the mean and dividing by
the standard deviation.
Thus, for the women's height of 66 inches, the z-value, z = (66-63.5)/2.5, equals 1. Therefore, 66 is exactly one
standard deviation above the mean.
Essentially, the z-statistic allows us to measure the distance from the mean in terms of standard deviations instead
of real values. It gives everyone the same size feet in statistics.
We can extend the rules of thumb we've developed beyond the two cases we've looked at. For example, we may
want to know the likelihood of being within 1.5 standard deviations from the mean, or within three standard
deviations from the mean.
about:blank 24/119
Select different values of z — that is, select different numbers of standard deviations from the mean — and see how
the probability changes. Be sure to try z values of 1 and 2 to verify that our rules of thumb are on target!
Sometimes we may want to go in the other direction, starting with the probability and figuring out how many
standard deviations are necessary on either side of the mean to capture that probability.
For example, suppose we want to know how many standard deviations we need to be from the mean to capture
95% of the probability.
Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about
95% of the probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard
deviations of the mean.
This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard
deviations from the mean.
Select different probabilities and see how many standard deviations we have to move away from the mean to cover
that probability.
We can create a table that shows which values of z correspond to each probability or we can calculate z using a
simple function in Microsoft Excel. We'll explain how to use both of these approaches in the next few clips.
z-table
Remember, the probabilities and the rules of thumbs we've described apply ONLY to a normal distribution. Don't
think you can use them for any distribution!
Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area
underneath the curve is called the cumulative probability. For example, the probability of being less than the mean
is 0.5, or 50%. This is just one example of a cumulative probability.
A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left.
There are easy ways to find cumulative probabilities using spreadsheet packages such as Microsoft Excel. You'll
have opportunities to practice solving these types of problems shortly.
Cumulative probabilities can be used to find the probability of any range of values. For example, to find the
percentage of all women who have heights between 63.5 and 68 inches, we would simply subtract the percent
whose heights are less than 63.5 inches from the percent whose heights are less than 68 inches.
Summary
The normal distribution has a unique symmetrical shape whose center and width are completely determined by its
mean and its standard deviation. For every normal distribution, the probability of being within a specified number
of standard deviations of the mean is the same. The distance from the mean, as measured in standard deviations,
is known as the z-value. Using the properties of the normal distribution, we can calculate a probability associated
with any range of values.
Using Excel's Normal Functions

To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel
function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal
curve with mean zero and standard deviation one.
For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1).
The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the
probability of obtaining a value less than 1 for a standard normal curve is about 84%.
We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal
curve is symmetric, so there is a 50% chance of being below the mean.
Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1.
Since the normal curve is symmetric, half of that 68% — or 34% of the probability — must lie between 0 and 1.
Putting these two facts together confirms that there is an 84% chance of obtaining a value less than 1 for a standard
normal curve.
If we want to find the cumulative probability of a value in a general normal curve — one that does not necessarily
have a mean of zero and a standard deviation of one — we have two options. One option is to first standardize the
value in question to find the equivalent z-value, and then use the NORMSDIST to find the cumulative probability
for that z-value.
For example, if we have a normal distribution with mean 26 and standard deviation 8, we may wish to know the
probability of obtaining a value less than 24.
about:blank 25/119
Standardizing can be done easily by hand, but Excel also has a STANDARDIZE function. We enter the function in a
cell and insert three values: the value to be standardized, and the mean and standard deviation of the normal
distribution.
We find that the standardized value (or z value) of 24 for a normal curve with mean 26 and standard deviation 8 is
-0.25.
Now, to find the cumulative probability for the z-value -0.25, we enter the Excel function =NORMSDIST(-0.25),
which tells us that the probability of a value less than -0.25 on a standard normal curve is 40%. Thus, the
probability of a value less than 24 on a normal curve with mean 26 and standard deviation 8 is 40%.
The second way to find a cumulative probability in a general normal curve is to use the NORMDIST function.
Here, we enter the function in a cell and insert four values: the number whose cumulative probability we want to
find, the mean and standard deviation of the normal distribution, and the word "TRUE."
As with our previous approach, we find that the probability of obtaining a value less than 24 on a normal curve
with mean 26 and standard deviation 8 is 40%.
The value "TRUE" tells Excel to return a cumulative probability. If instead of "TRUE" we enter "FALSE," Excel
returns the y-value of the normal curve — something we are usually not interested in.
Quite often, we have a cumulative probability, and want to work backwards, translating it into a value on a normal
curve.
Suppose we want to find the z-value associated with the cumulative probability 95%.
To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function
NORMSINV. Note once again the S, which tells us we are working with a standard normal curve.
We find that the z-value associated with the cumulative probability 95% is 1.64.
Sometimes we may want to translate a cumulative probability back to a value on a general normal curve. For
example, we may want to find the value associated with the cumulative probability 95% for a normal curve with
mean 26 and standard deviation 8.
If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV
function. NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the
normal distribution in question.
We find that the value associated with the cumulative probability 95% for a normal curve with mean 26 and
standard deviation 8 is 39.2.
Using the z-table

The previous clip shows us how to use software programs like Excel to calculate z-values and cumulative
probabilities for the normal curve. Another way to find z-values and cumulative probabilities is to use a z-table.
Using z-tables is a bit more cumbersome than using Excel, but it helps reinforce the concepts.
Let's use the z-table to find a cumulative probability. Women's heights are distributed normally, with mean around
63.5 inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches?
First, we calculate the z-value for 65.6 inches, 0.84. The cumulative probability associated with the z-value is the
area under the standard normal curve to the left of the z-value. This cumulative probability is the percentage of
women who are shorter than 65.6 inches.
We next use the table to find the cumulative probability corresponding to a z-value of 0.84. First, we find the row
by locating the z-value up to the first digit to the right of the decimal point, 0.8. Then we choose the column
corresponding to the remainder of the z-value (0.84 - 0.8 = 0.04). The cumulative probability is 0.7995. About
80% of women are shorter than 65.6 inches.
Finding the cumulative probability for a value less than the mean is a bit trickier. For example, we might want to
know what percentage of women are shorter than 61.6 inches.
We find that the z-value for a height of 61.6 inches is a negative number: -0.76.
When a z-value is negative, we must first use the table to find the cumulative probability corresponding to the
positive z-value, in this case +0.76. Then, since the normal curve is symmetric, we will be able to conclude that the
probability of being less than the z-value -0.76 is the same as the probability of being greater than the z-value
+0.76.
We find the cumulative probability for +0.76 by locating the row corresponding to the z-value up to the first digit
to the right of the decimal point, 0.7, and the column corresponding to the remainder of the z-value (0.76 - 0.7 =
0.06). The cumulative probability is 0.7764.
Since the probability of being less than a z-value of +0.76 is 0.7764, then the probability of being greater than a z-
value of +0.76 is 1 - 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a z-value of
about:blank 26/119
-0.76 is also 0.2236.
Finally, we reach our conclusion. About 22.36% of women are shorter than 61.6 inches.
Practice with Normal Curves

Find the cumulative probability associated with the z-value 2.
Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if
necessary.
z-table
Excel
Find the cumulative probability associated with the z-value 2.36.
necessary.
z-table
Excel
Find the cumulative probability associated with the z-value -1.
necessary.
z-table
Excel
Find the cumulative probability associated with the z-value 1.645.
necessary.
z-table
Excel
Find the cumulative probability associated with the z-value -1.645.
necessary.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the
value 115.
necessary.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the
value 80.
necessary.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than
80 but less than 115.
necessary.
z-table
Excel
necessary.
about:blank 27/119
z-table
Excel
45.
necessary.
z-table
Excel
necessary.
z-table
Excel
Find the z-value associated with the cumulative probability of 60%.
necessary.
z-table
Excel
Find the z-value associated with the cumulative probability of 40%.
necessary.
z-table
Excel
Find the z-value associated with the cumulative probability of 2.5%.
necessary.
z-table
Excel
For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative
probability of 88%.
Enter your answer as an integer (e.g., "5"). Round if necessary.
z-table
Excel
For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative
probability of 28%.
Enter your answer as an integer (e.g., "5"). Round if necessary.
z-table
Excel
The Central Limit Theorem

How can the normal distribution help you sample Leo's hotel guests?
How do the unique properties of the normal distribution help us when we use a random sample to infer something
about the underlying population?
After all, when we sample a population, we usually have no idea whether or not the population is normally
distributed. We're typically sampling because we don't even know the mean of the population! If the normal
distribution is such a great tool, when can we use it?
It turns out that even if a population is not normally distributed, the properties of the normal distribution are very
helpful to us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central
Limit Theorem".
about:blank 28/119
Definition
Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and
plotted the means of each sample, then — assuming the samples we take are sufficiently large — the resulting
plot of the sample means would look normally distributed.
Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would
be equal to the true mean of the population.
To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or completely
bizarre — if we took enough samples, and the samples were sufficiently large, then the means of those samples
would form a normal distribution centered around the true mean of the population.
The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a
distribution of the means of many samples, but that is exactly what we are doing. We'll call this distribution the
Distribution of Sample Means. (Statisticians also often call it the Sampling Distribution of the Mean).
Let's walk through this step-by-step. If we have a population — any population — we can take a random sample.
This sample has a mean.
We can plot that mean on a graph.
Then we take another sample. That sample also has a mean, which we also plot on the graph.
Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the
population's mean.
The more samples we take, the more the graph of the sample means would look like a normal distribution.
Eventually, the graph of the sample means — the Distribution of the Sample Means — would form a nearly
perfect replica of a normal distribution.
Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a
normal distribution with them. We're taking a lot of samples here just to let you see that graphing the means of
many samples would give you a normal curve.
In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the
Central Limit Theorem allow us to say based on that single sample?
The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More
specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is
centered at the true population mean.
The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the
underlying distribution of the population we want to learn about. Since we know the Distribution of Sample
Means is normally distributed and centered at the true population mean, we can completely disregard the
underlying distribution of the population.
As we'll see shortly, because we know so much about the normal distribution, we can use the information about
the Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual
population mean.
Summary
The Central Limit Theorem states that for any population distribution, the means of samples from that
population are distributed approximately normally. The more samples, and the larger the sample size, the
closer the Distribution of Sample Means fits a normal curve. The mean of a single sample lies on this normal
curve, so we can use the normal curve's special properties to extract more information from a single sample
mean.
Illustrating
Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked
"Uniform," "Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation
will display a different kind of distribution.
On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for
which all values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution
that has two separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will
display a distribution that is not symmetrical — values are more likely to fall above the mean than below.
Uniform
The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean.
about:blank 29/119
Let's start building a distribution of the sample means on the bottom half of the page by placing each sample
mean on a graph. We repeat this process several times to create our distribution.
Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the
distribution of the sample means.
As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will
always form a normal distribution. This is what the Central Limit Theorem predicts.
Bimodal
Skewed
The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed,
a key insight that will allow you to estimate the population mean from a sample.
Confidence Intervals
Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values
that is almost certain to contain the population mean.
Estimating a Population Mean II

For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of
the distribution's mean 95% of the time.
The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large
samples are normally distributed, regardless of the distribution of the underlying population.
Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean.
Combining these facts can give us a measure of how accurately the mean of a sample estimates the population
mean.
Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points —
from a population, there is a 95% chance that the mean of that sample falls within two standard deviations of the
true population mean.
Let's build this up step by step to make sure we understand the logic.
First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point
on a normal distribution — the Distribution of Sample Means.
Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that
the sample mean is within two standard deviations of the mean of the distribution.
The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean.
Thus, we can conclude that there is a 95% chance that the sample mean is within two standard deviations of the
population mean.
We have argued that 95% of our samples will have a mean within the range shown around the true population
mean.
about:blank 30/119
Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence
interval is.
Let's look at intervals around the means of two different types of samples: those whose sample means fall within
the 2 standard deviation range around the population mean (which should be the case for 95% of all samples) and
those whose sample means fall outside the 2 standard deviation range around the population mean (which should
be the case for 5% of all samples).
First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population
mean.
Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population
mean. Since the population mean is more than 2 standard deviations away from this sample mean, an interval of
width 2 standard deviations around this sample mean could not contain the true population mean.
We know that 5% of all samples should have sample means outside the 2 standard deviation range around the
population mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population
mean.
Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range
around the population mean.
If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would
contain the true population mean. Thus, 95% of all samples we obtain will have intervals that contain the
population mean.
We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a
95% confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain
the true population mean.
It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but
we are saying that 95% of the time a range that is two standard deviations wide centered around the sample mean
contains the population mean.
To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population
and drawing a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19
out of 20 — would actually contain the population mean.
What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an
approach that 1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't
comfortable with those odds, we should raise the confidence level.
If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not
contain the true population mean. However, this higher confidence comes at a cost. If we keep the same sample
size, then the confidence interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep
the same interval width, we can increase our sample size.
How do we know if an interval is too wide? Typically, if we would make a different decision for different values
within an interval, that interval is too wide. Let's look at an example.
To estimate the percent of people in our industry who will attend the annual conference, we might construct a
confidence interval that ranges from 7% to 13%. If we would select a different conference venue if the true
percentages is 7% than if it is 13%, we need to tighten our range.
Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be
acquainted with. We need to know that the standard deviation of the Distribution of Sample Means is σ, the
standard deviation of the underlying population, divided by the square root of n, the sample size.
We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about
the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large
samples to be tightly clustered around the true population mean, and thereby form a narrow distribution.
Summary
A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the
population mean. A confidence interval is centered at the mean of a sample randomly drawn from the
population under study. When we have a confidence level of 95% we expect equally wide confidence intervals
centered at 95 out of 100 such sample means to contain the population mean.
Finding a Confidence Interval

You understand the theory behind a confidence interval. But how do you actually construct one?
We can now translate the previous discussion into a simple method for finding a confidence interval for the mean
of any population. First, we randomly select a sample of size 30 from the population.
about:blank 31/119
We then compute the mean and standard deviation of the sample.
Next, we assign the sample mean as the center of the confidence interval.
To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want
a 95% confidence interval, the interval should be 2 times the standard deviation of the population divided by the
square root of n, the sample size.
Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do
have — the standard deviation of the sample.
Here's what the equation looks like for our example.
If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the z-value
corresponding to the desired level of confidence.
We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the
sample size has to be at least 30.
Wine Lover's Magazine

Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the
average age of their subscribers so they can better target potential advertisers.
We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with
a sample, but emphasize that they want to be 95% confident that the range we give them contains the true
average age of its full set of subscribers.
We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has a mean of 52
and a standard deviation of 40.
To find an appropriate confidence interval, we incorporate information about the sample into the formula:
The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95%
confidence interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12.
We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling
them they can be 95% confident that the true population mean falls between these values.
What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size
from the sample data, but how do we obtain the right z-value?
Obtaining the z-value

The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a less common
confidence interval? To be 98% confident that our interval contains the population mean, how do we obtain the
appropriate z-value?
To find the z-value for 98% confidence level, we are essentially asking: How far to the left and right of the
standard normal curve's mean do we have to go to capture 98% of the area?
Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering
1% of the area under the curve. The z-value of the right boundary is the z-value associated with a cumulative
probability of 99% — the sum of the central 98% and the 1% in the left tail.
Converting the desired confidence level into the corresponding cumulative probability on the standard normal
curve is essential because Excel's NORMSINV function and the z-table work with cumulative probabilities.
To find the z-value associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which
returns the z-value 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to
0.9900. The z-value is 2.33, the sum of the row-value 2.3 and the column-value 0.03.
Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using the appropriate
normal distribution function in Excel. Alternatively, you can use the Standard Normal Table (z-table) in your
briefcase.
The correct z-value for a confidence level of 99.5% is:
z-table
Excel
Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the
standard normal curve. To do this, note that to have 99.5% probability in the middle of the standard normal
curve, we must exclude a total area of 1 - 99.5% = 0.5% from the curve. That area is divided into two equal parts
in the distribution's tails: 0.25% in each tail.
about:blank 32/119
We can now see that the cumulative probability associated with confidence level of 99.5% is 1 − 0.25% = 99.75%.
Thus, the z-value for a confidence level of 99.5% is the same as the z-value of a cumulative probability of 99.75%.
We find the z-value in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we
could find the z-value in the z-table by looking up the probability 0.9975.
Summary
To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a
range around the sample mean with a specified level of confidence. The confidence level indicates how confident
we are that the sample mean we collected contains the population mean.
Using Small Samples

We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we
have only a small sample? Let's consider a different survey, one that concerns a delicate matter.
The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value
of her guests' belongings. She needs this value to determine the correct insurance protection in case guest
belongings disappear from their cabins, are destroyed in a fire, or sink with the ship.
She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information.
She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample,
we need to prepare an estimate.
With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample
size. A small sample increases our uncertainty about two important aspects of our estimate of the population
mean.
First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure
that the sample means follow a normal distribution.
Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the
population standard deviation.
Due to these additional uncertainties, we cannot use z-values to construct confidence intervals. Using a z-value
would overstate our confidence in our estimate.
Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a
handful of data points?
It depends: if we don't know anything about the underlying population, we cannot create a confidence interval
with fewer than 30 data points. However, if the underlying population is normally distributed — or even roughly
normally distributed — we can use a confidence interval to estimate the population mean.
In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can
construct a confidence interval, even when we have a small sample. However, we do need to modify our
approach slightly.
To estimate the population mean with a small sample, we use a t-distribution, which was discovered in the early
20th century at the Guinness Brewing Company in Ireland.
A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values. What is the
difference between the normal distribution and the t-distribution?
A t-distribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails,
because it is more likely than the normal distribution to have values fall farther away from the mean.
Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For
example, we must go more than 2 standard deviations on either side of the mean to capture 95% of the
probability for a t-distribution.
Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will be wider than
one based on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with
a smaller sample, so we need a wider interval to achieve a given level of confidence.
The t-distribution is also different because it varies with the sample size: For each sample size, there is a
different t-value associated with a given level of confidence. The smaller the sample size n, the shorter the height
and the thicker the tails of the t-distribution curve, and the farther we have to go from the mean to reach a given
level of confidence.
On the other hand, as the sample size increases, the shape of the t-distribution becomes more and more like the
shape of a normal distribution. Once we reach a sample size of 30, the t-distribution becomes virtually identical
to the z-distribution, so t-values and z-values can be used interchangeably.
about:blank 33/119
Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most people use the z-
distribution for larger samples, partially out of habit and partially because it's easier, since the z-value doesn't
vary based on the sample size.
Finding the t-value

To find the right t-value, we first have to identify the t-distribution that corresponds to our sample size. We do
this by finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample
size minus one. If our sample size is 16, we have 15 degrees of freedom, and so on.
Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1
minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us
the appropriate t-value.
For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15)
would return the value 2.131.
Excel
Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For example, for t
= 2.131, the appropriate confidence interval is:
Excel
If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with the degrees
of freedom and the confidence level we specify. When using different t-value tables, we need to be careful to
note which probability the table depicts.
Excel
t-table
Some tables report values associated with the confidence level, like 0.95. Others report values based on the
area in the tails, which would be 0.05 for a 95% confidence interval. Our t-table, like many others, reports
values associated with a cumulative probability, so for a 95% level of confidence, we would have to look at a
cumulative probability of 97.5%.
Excel
t-table
The Good Ship Demiurgos

Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers'
belongings. The manager samples 16 guests, and reports that they have an average of $10,200 worth of
clothing, jewelry, and personal effects in their cabins. From her survey numbers, we calculate a standard
deviation of $4,800.
We need to double check that the distribution isn't too skewed, which we might expect, since some of the
passengers are quite wealthy. The manager explains that the insurance policy has a limited liability clause that
limits a passenger's maximum claim to $20,000.
Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest
reported values above $20,000, the manager simply reported $20,000 as the value to be covered for our data
set.
We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel
comfortable using the t-distribution.
Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15)
tells us that the appropriate t-value is 2.131.
Excel
Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times
$4,800 over the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and
we can report that we are 95% confident that the average value of passengers' belongings is between $7,643
and $12,757.
Excel
t-table
What if the Demiurgos' manager thinks this interval is too large?
Excel
t-table
about:blank 34/119
She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also
increases the size of the denominator (the square root of n). Both factors narrow the confidence interval.
Excel
t-table
For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the t-value
would drop to 2.06 and the square root of n in the denominator would increase. The width of the interval
would decrease significantly, from $2,557 to $1,939.
Excel
t-table
Summary
Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is
roughly normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a
small sample, use a t-distribution. T-distributions are a set of distributions that resemble the normal
distribution, but with shorter heights near the mean and thicker tails. To find a confidence interval for a small
sample size, place the appropriate t-value into the confidence interval formula.
Choosing a Sample Size

When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For
example, when estimating car owners' average spending on car repairs each year, we might want to be 95%
confident that our estimate is within $50 of the true mean.
We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample
size, the tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a
confidence interval that extends a distance of d on either side of the mean:
To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of
sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other
source, we might take a preliminary survey to obtain a rough estimate of sigma.
In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of
confidence, we set z = 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly
sample at least 139 people.
In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose
a sample size n that satisfies the expression below. We can do this with simple algebra, or by using the attached
Excel utility.
Confidence Interval Utility
Summary
When estimating a population mean, we can ensure that our confidence interval extends a distance of at most
d on either side of the mean by choosing an appropriate sample size.
Step-by-Step Guide
Here is a step-by-step process for creating a confidence interval:
First, we choose a level of confidence and a sample size n appropriate to the decision context.
Second, we take a random sample and find the sample mean. This is our best estimate for the population mean.
Third, we find the sample's standard deviation.
Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size is over 30, we
find the z-value for our confidence level. If not, we find the t-value for our confidence level and with degrees of
freedom = sample size - 1.
Fifth, we calculate the end points of the confidence interval using the formulae below.
Summary
Construct confidence intervals using the steps outlined below. With a confidence interval derived from an
unbiased random sample, we can say that the true population mean falls within the interval with the
corresponding level of confidence.
Excel Utility
about:blank 35/119
Click here to open an Excel utility that allows you to create confidence intervals by providing the sample mean,
standard deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of
the utility. To ensure you are using the utility correctly, try to reproduce the results for the Wine Lover's
Magazine and the Demiurgos examples.
Solving the Scuba Problem II

The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem.
You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the
average satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you
decide to create a 95% confidence interval for Leo.
Your calculations show the following:

z-table
t-table
To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by
the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a
95% confidence interval by going 0.45 points above and below the sample mean of 4.4, which translates into a
confidence interval from 3.95 to 4.85.
You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85.
Leo looks at your numbers and shakes his head.
That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's
closer to 4, I'm concerned. Can we narrow the range at all?
Looking over your notes, you think you can give Leo some options.
Why don't you create a larger sample and report the results back to me?
You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able
to reach 25 guests. You combine the two samples, which gives a new sample size of 70.
For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2.
Armed with more data, you create another confidence interval.
We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is between:

z-table
t-table
To create this 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by
the sample standard deviation divided by the square root of the sample size. Using the new numbers given for the
larger sample, you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5,
which translates into a confidence interval from 4.22 to 4.78.
Thank you. I am much happier with this result. I have enough information now to decide whether to keep the
current scuba diving school.
Exercise 1: The Veetek VCR Gambit

Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread
throughout Japan. For over a year, the sales of high-end VCRs have lagged, due to a shift towards DVD players.
Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk
shipment of high-end VCRs.
This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently languishing on the
shelves of his stores. Before he calls Veetek, he wants to know how many high-end VCRs he can promise. After
two days of furious phone calls, his deputy has gathered data from 36 representative outlets in his retail chain.
The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi
needs you to find a 95% confidence interval for the average VCR inventory per store. The interval is

z-table
t-table
Exercise 2: Pulluscular Pig Disorder
about:blank 36/119
Paul Segal manages the pig-farming division of the agricultural company Bowman-Lyons-Centerville. A
rumored outbreak of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public
relations disaster.
The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain
size post-mortem.
Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to
find out. At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed.
For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95%
confidence, in what range does the herd's average brain weight lie?

z-table
t-table
Proportions
The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.
The Customer Response Problem

I'm sorry to disturb you, but I have another problem, and I think you might be able to help.
The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops
significantly during the off-season, from September through February and then April through May.
We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant
for weeklong periods during the slow season this year.
I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay
during our slow period. They won't have complete date flexibility of course, but the steep discount should make the
offer attractive for them.
To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The
deadline by which they had to respond to the offer has passed. Ten guests responded with the required room
deposit prior to the deadline — that's a solid 10 percent.
I figure if we send out 2,000 promotions, we'll get about 200 responses.
This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end
up disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the
discount, you'll have to turn away customers willing to pay full price.
That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests,
will it hold for 2,000? What if 11% actually respond to the promotions?
Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the
offer is not valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a
discount.
I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return
guests can I safely send the discount offer and be confident that no more than 200 will respond?
You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as
interested in them as he is in your results.
Confidence Intervals And Proportions

Sometimes, the question we pose to members of a sample calls for a yes or no answer.
We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask
if they plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant
produced yesterday and count how many are defective.
Even though our question has only two answers, we still have to address an inherent uncertainty: We know what
values our data can take — yes or no — but we don't know how often each response will be given.
In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion,
p-bar. This is our best estimate of p, the true percentage of "yes" responses in the underlying population.
Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to
know what percentage of all passengers remembers seeing our ad.
about:blank 37/119
We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300
passengers respond to our survey, of which 100 passengers report remembering the ad.
Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of
respondents, 300.
The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always
add up to 1 because survey respondents report either remembering the ad or not.
Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best
estimate, or point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%.
As managers, we typically want more than this simple point estimate — we want to know how accurate the
estimate is. How far from 33% might the true percentage be? Can we say confidently that it is between 30% and
36%, for example?
When we work with proportions, how do we find a confidence interval around our point estimate?
The process for creating a confidence interval around a proportion is nearly identical to the process we've used
before. The only difference is that we can approximate the standard deviation of the population with a simple
formula rather than calculating it directly from the raw data.
Based on our sample, our best estimate of the true population proportion is p-bar, the percentage of "yes"
responses in our survey. Statistical theory tells us that our best estimate of the standard deviation of the true
population proportion is the square root of [(p-bar)*(1 - (p-bar))]. We can use this approximate standard deviation
to determine a confidence interval for the proportion.
For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A
95% confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to
0.33 plus or minus 0.053, or 27.7% to 38.3%.
Unfortunately, there is one catch when we calculate confidence intervals around proportions...
Sample Size
Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are
sampling New Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S.,
the odds of having the disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people?
No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very
small, we need to have a large enough sample to make sure we find at least a few people with the disease.
Otherwise, we will not have enough data to get a good estimate of the true proportion.
There is a guideline we must meet to make sure that our sample is large enough when estimating proportions.
Two conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second,
the product of the sample size and 1 minus the proportion must also be at least 5.
If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our
sample contains a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be
useful otherwise.
To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However,
since we don't know the proportion p-bar before sampling, we don't know if the two conditions are met before
setting the sample size. How can we get around this problem?
Finding a Preliminary Estimate of p-bar

We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past experience. For
example, to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general
population. This is a reasonable first estimate for p-bar.
In many cases, however, we are sampling for the first time. Without past experience, we don't know what p-bar
might be. In this case, it may well be worth our time to take a small test sample to estimate the proportion, p-
bar.
For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our
preliminary estimate of p-bar.
Substituting 3% for p-bar in our two requirements, n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5, tells us that n must satisfy
n*0.03 ≥ 5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample must be at least 167.
We would then use a real sample — with at least 167 respondents — to find an actual sample value of p-bar to
create a confidence interval for the population proportion.
Summary
about:blank 38/119
Proportions are often used to indicate the frequency of some characteristic in a population. The sample
proportion p-bar is the number of occurrences of the characteristic in the sample divided by the number of
respondents, the sample size. It is our best estimate of the true proportion in the population. We can construct a
confidence interval for the population proportion. Two guidelines for the sample size must be met for a valid
confidence interval: n(p-bar) and n(1 - (p-bar)) must each be at least five.
Solving the Customer Response Problem

Creating confidence intervals around proportions is not much different from creating them around means. Finding
the right number of Leo's promotional brochures to mail should be easy.
Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will this response
rate hold for 2,000 guests? To how many guests can he send the discount offer for his 200 rooms?
First, you calculate a 95% confidence interval for the response rate.
Enter the lower bound as a decimal number with two digits to the right of the decimal (e.g., enter "5" as "5.00").
Round if necessary.
z-table
The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that
answer by using the sample data and applying the familiar formula:
Then, after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific
number of guests.
Enter the number of guests as an integer (e.g., "5"). Round if necessary.
z-table
Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond
to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200
rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo
needs to send out the survey to at most 1,259 past customers.
Leo is pleased with your work. He tells you to relax and enjoy the resort.
Exercise 1: GMW Automotive

GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez
heads the Mexican sales division of the company's Latin American subsidiary.
GMW earns additional profit when customers choose to finance their car purchase with a GMW financing
package. Arturo has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW
customers who opt for financing.
Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You
take a random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%,
opted for GMW financing.
If you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all
Mexican customers opting for GMW financing falls in the range:
z-table
Exercise 2: Crown Toothpaste

Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores selling discount
merchandise. Tar-Mart has a huge national presence, and manufacturers compete fiercely to get their products
onto Tar-Mart's shelves.
Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4
weeks and display it prominently. After that period, she will stop stocking Crown unless 5% of Tar-Mart's
customers bought Crown or were considering buying Crown within the next month.
The trial period is now over. Kayleigh has asked you to take a sample of customers to see if Tar-Mart should
continue stocking Crown. She would like you to be at least 95% confident in your answer.
The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when Tar-Mart
introduced a new product, the percentage of people who expressed interest ranged between 2% and 10%. What
sample size should you use?
about:blank 39/119
You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had
bought Crown or were considering buying Crown within the next month. What is the 95% confidence interval for
the population proportion?
z-table
First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar) = 250*0.04 =
10 ≥ 5 and n(1 - (p-bar)) = 250*0.96 = 240 ≥ 5. Then, using the formula, you find the confidence interval around
the sample proportion. The endpoints of that interval are 1.6% and 6.4%.
Challenge: OOPS! Package Deliveries

OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard
increasing complaints about late deliveries, and wants to know how many of the shipments are late by one day or
more.
Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered
late, a proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence
interval calculation, the interval is:
Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that
the population proportion of late packages is between:
z-table
First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size
satisfies the rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5.
Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].
Hypothesis Testing
Introduction
After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you
are gathering your beach gear, Leo gives you another call.
Improving the Kahana

Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me
with next. I've been working on ideas to increase the Kahana's profits.
Is it possible to increase profits by raising the room prices? That would be an easy solution.
I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take
into consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where
we started from with profits — or even worse.
What other factors influence your profits?
The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money
guests spend on non-room amenities. You know, food, drinks, spa services, sports activities, and so on.
As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant,
for example, or the types of amenities offered.
And you'd like us to help you understand your guests' discretionary spending patterns better.
Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if some of my recent
efforts to increase guest spending have paid off.
I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For
example, I hired a new executive chef last year. I'd like to know if restaurant revenues per person have changed
since then.
I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on
beverages.
Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I
might change our marketing campaigns to better suit each of those market segments.
What records do you have for us to work with?
about:blank 40/119
We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the
results.
You're really getting into these statistical methods, aren't you, Leo?
Definition
Leo made some important changes to his business and he has some ideas of what the impact of these changes has
been. How do you put his ideas to the test?
As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions.
Based on whether or not our claim is statistically supported, we may wish to take managerial action.
Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to
substantiate. To begin, we will learn how to test hypotheses about population means.
For instance, suppose we know that the historical average number of defects in a production process is 3 defects
per 1,000 units produced. We have a hunch that a certain change to the process — a new machine, say — has
changed this number. The hypothesis we wish to substantiate is that the average defect rate has changed — that it
is no longer 3 per 1,000.
How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then,
we see whether or not what we learn about the sample supports our hypothesis that the defect rate has changed.
Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently
say that the defect rate has changed?
That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that
has been cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it
is very likely for the mean of a sample taken from the population to fall within that range.
The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range
or not.
If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect
rate is still 3.
However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the
defect rate has changed.
This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example
and define our new statistical technique more precisely.
To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the alternative
hypothesis.
Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null
hypothesis states that the population mean is equal to that initial value: in our example, the null hypothesis states
that the current population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population
mean, in this case the current average defect rate.
The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the
average defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not
hold.
As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a
sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null
hypothesis or we do not reject it.
Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we
necessarily "accept" the alternative hypothesis.
In our example, the evidence from our sample will help us determine whether or not we should reject the null
hypothesis that the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed.
Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly
unlikely that our sample mean would come from a population with the mean stated by the null hypothesis.
For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null hypothesis. Drawing
a sample with 14 defects from a population with an average defect rate of 3 would be very unlikely.
"We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a
population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply
don't have enough evidence to draw a definite conclusion."
For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis,
since it wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect
about:blank 41/119
rate of 3.
Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we
never say that we "accept" the null hypothesis — we simply don't reject it.
It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to
substantiate as the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can
substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based
on the evidence.
It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two
types of conclusions with an analogy: a US jury trial.
In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the
accused is innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is
trying to prove.
The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it
is certain beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude
that the accused truly is innocent. The jury simply declares that the accused is "not guilty."
Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not
prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it.
A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave
unchallenged a particular statement: the null hypothesis.
Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results
support his claims that food and beverage spending patterns have changed.
Summary
We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the
population mean is equal to an initial value that is based on our experience or conventional wisdom. We test the
null hypothesis to learn if we should reject it in favor of our claim, the alternative hypothesis, which states that
the null hypothesis does not hold.
Single Population Means

The next morning, Leo explains the measures he has undertaken to increase customer spending on food and
beverages. "I'd like to see if they've had a discernable impact on my guests' restaurant-related spending patterns."
The Restaurant Revenue Problem

Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the
main cocktail lounge.
The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate
items on the menu, like that mango and brie tart I recommended to you. She also has offerings that cater to
simpler tastes. But the question is, have restaurant profits been affected by the new chef?
Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits
have increased too. Based on last year's consolidated reports, the average spending on food per person per day was
$55. I'm curious to see if that has changed.
In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now
more guests can be seated in the lounge, and more seats have good views of the ocean.
I also invested in a large machine that makes a wide variety of frozen drinks. Pina coladas are very, very popular.
I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high
margins, but I'm not sure if beverage sales have increased enough to cover the investments.
Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?"
Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average
spending on beverages per guest per day was $21.
Isn't that high?
Well, we have some very nice wines in our restaurants.
We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests.
about:blank 42/119
We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food
revenues, total beverage revenues, and number of guests at the table. Using this information, we should be able to
estimate the daily spending on food and beverages per guest.
You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar
renovations — have influenced the resort's profits.
Hypothesis Tests for Single Population Means

Leo has prepared data for you. How are you going to put it to use?
Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of
test.
Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started
showing old classics instead of recent releases.
He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He
would like to know if average customer satisfaction has changed since he altered his theater's artistic focus.
The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the
Greek letter mu to represent the current mean satisfaction rating of the theater's entire film-going population.
His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now
different.
To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is
careful to sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a
standard deviation of 2.8.
Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this
year's moviegoers really are more satisfied?
Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average
satisfaction rating? This is equivalent to asking the question: If the null hypothesis is true — the average
satisfaction is still 6.7 — would we be likely to randomly draw the sample that we did, with average satisfaction 7.3?
To answer this question, we have to first define what we mean by "likely." As in sampling and estimation, we
typically use 95% as our threshold level of likelihood.
We then construct a range around the population mean specified by our null hypothesis. The range should be
drawn so that if the null hypothesis is true, 95% of all samples drawn from the population would fall in that range.
In other words, we create a range of likely sample means.
The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its
familiar properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed
population mean, mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population
standard deviation, so we use the sample standard deviation as our best estimate.
As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on
either side of the mean. However, when we conduct a hypothesis test, we center the range around the mean
specified in the null hypothesis because we always start a hypothesis test by assuming the null hypothesis is true.
In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence
level translates into a z-value of 1.96. We construct the range of likely sample means:
This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample
will fall between 6.3 and 7.1.
Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis.
Why? Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside
this range.
The region outside the range of likely sample means is called the "rejection region," since we reject the null
hypothesis if our sample mean falls into it. In the movie theater example, the rejection region contains all values
less than 6.3 and all values greater than 7.1.
In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we
reject the null hypothesis, we in effect accept the alternative hypothesis.
We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7.
If our sample mean had fallen within the range around 6.7, we could not make a definite statement about
moviegoers' satisfaction. We would not have enough evidence to state that things have changed, but we can never
claim that they have definitely remained the same.
about:blank 43/119
Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only
with sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go
wrong in two ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact
false. Let's look at the first of these: the null hypothesis is true, but we reject it.
We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the
rejection region when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely
we mean a 5% chance. However, 5% of all samples from a population with the null hypothesis mean would fall in
the rejection region, so when we reject a null hypothesis, there is a 5% chance we will do so erroneously.
Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified
in rejecting the null hypothesis. Hence we continue to speak of a confidence level of 95%.
A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says
that there is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is
what people mean when they say that something is "statistically significant" at a 5% significance level.
If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence
level, our range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis
erroneously. But doing this has a downside: by decreasing the chance of one type of error, we increase the chance
of the other type.
The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null
hypothesis when it is in fact false. This decreases our chance of being able to substantiate the alternative
hypothesis when it is true. As managers, we need to choose the confidence level of our test based on the relative
costs of making each type of error.
The range of likely sample means should not be confused with a confidence interval. Confidence intervals are
always constructed around sample means, never around population means. When we construct a confidence
interval, we don't even have an initial estimate of the population mean. Constructing a confidence interval is a
process for estimating the population mean, not for testing particular claims about that mean.
Summary
In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we construct a range
of likely sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection
region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level
measures how confident we are that we are justified in rejecting the null hypothesis.
One-sided Hypothesis Tests

The movie theater manager did not have a strong conviction about the direction of change for customer
satisfaction prior to performing the hypothesis test.
He wanted to test for change in both directions — up or down — and thus he used a two-sided hypothesis test.
The null hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer
satisfaction may have increased or decreased. The two-tailed nature of the test was reflected in the two-sided
range we drew around the population mean.
Sometimes, we may want to know if the actual population mean differs from our initial value of the population
mean in a specific direction. For instance, if the theater manager were quite sure that satisfaction had not
decreased, he wouldn't have to test in that direction; rather, he'd only have to test for positive change.
In these cases, our alternative hypothesis should clearly state which direction of change we want to test for.
These kinds of tests are called one-sided hypothesis tests. Here, we substantiate the claim that the mean has
increased only if the sample mean is sufficiently higher than 6.7, so our rejection region extends only to the
right.
Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial understanding of
the population: the population mean is equal to a specified initial value.
If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that
the mean still equals that initial value. The alternative hypothesis should state that the mean does not equal that
initial value.
If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean
— then the null hypothesis should state that the population mean has at most that value. The alternative
hypothesis states that the mean is greater than the null hypothesis mean.
Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null
hypothesis should state that the mean is at least that initial value. The alternative hypothesis should state that
the mean is less than the null hypothesis mean, and the rejection region extends only to the left.
When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely sample means.
Suppose the theater manager claims that satisfaction improved. As usual, he states the claim he wants to
about:blank 44/119
substantiate as his alternative hypothesis.
The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to
substantiate the claim that mean satisfaction increased? To find out, the manager creates a one-sided range: he
assumes the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95%
of all sample means.
To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means
be less than that value?
To find out, we use what we know about the cumulative probability under the normal curve: a cumulative
probability of 95% corresponds to a z-value of 1.645.
z-table
Why is this different from the z-value for a two-sided test with a 95% confidence level? For a two-sided test, the
z-value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail.
For a one-sided test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is
excluded from the upper tail.
z-table
We now have all the information we need to find the upper bound on the range of likely sample means.
The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the
manager rejects the null hypothesis. He is confident that customer satisfaction is higher.
Summary
When we want to test for change in a specific direction, we use a one-sided test. Instead of finding a range
containing 95% of all sample means centered at the null hypothesis mean, we find a one-sided range. We
calculate its endpoint using the cumulative probability under the normal curve.
Excel Utility (Single Populations)

The Excel Utility link below allows you to perform hypothesis tests for single populations.
Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before
using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the
utility correctly, try to reproduce the results for the theater manager's example.
Excel Utility for Single populations
Solving the Restaurant Revenue Problem

A single-population hypothesis test tests a claim using a sample from a single population. With a plan in mind, you
take a look at Leo's sample data.
You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to
organize the data from your sample on daily guest spending on restaurant food.
One change Leo made to his restaurant operations was to hire a new chef. He wants to know whether average
restaurant spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a
hypothesis test.
Last year's average spending on food per person was $55; this gives you an initial value for the mean.
Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your null hypothesis,
which states that the average revenue per guest is still $55.
If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be
accounted for by chance.
You add the alternative hypothesis to your notes.
Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range
of likely sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in
the rejection region?
Leo didn't specify what level of confidence he wanted for your results. You call him for clarification.
I suppose a 95% confidence level is okay. I'd like to be more confident, of course.
After you point out that higher confidence would reduce his chances of being able to substantiate a change in
spending if a change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to
about:blank 45/119
compute a range around the null hypothesis mean of $55. Consulting your notes, you find the correct formula:
You find the range containing 95% of all sample means. Its endpoints are:
z-table
Utility for Single Populations
You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19
out of 20 samples of this size from the population of hotel guests would have means that would fall in the
calculated range.
The sample mean of $64 falls outside of this range. You and Alice report your results to Leo.
Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has
increased.
I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has affected
average beverage spending?
Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage
spending per guest. He wants to know if spending has gone up. You decide to do a one-sided test.
First, you write down all of Leo's data, along with the hypotheses:
You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a z-value
of 1.645. The upper bound is $24.29.
What is the correct interpretation of this number? Given that the null hypothesis is true,
The range of likely sample means contains the collected sample mean of $24. This tells you that:
Presenting your full report to Leo, he appears confused and disappointed.
How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't
pay off, shouldn't the increase in seats have helped?
First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact
that our sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in
average beverage revenue.
Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we
are unable to conclude that spending has increased.
You added seats and now more people can be seated in your lounge. But a greater number of guests does not
necessarily translate into more spending per person.
That does make a lot of sense.
Your overall revenues may have actually increased, because more guests can be seated in the lounge.
Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic
I'll see a jump in total beverage revenues in the consolidated report at the end of the year.
Why don't we go fill three of those new seats right now?
Exercise 1: Oma's Pretzels

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel
snacks, and advertises that these pretzels contain an average of 112 calories per serving.
In a recent test, an independent consumer research organization conducted an experiment to see if this claim
was true. Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32
bags was 102 calories per serving. The standard deviation of the sample was 19.
Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them
appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed?
z-table
You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states
that the population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric
content in the actual population is what Oma's has always advertised.
The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the alternative hypothesis
simply states that the mean does not equal the initial value. A two-sided test is more appropriate in this
problem, since Blanche only wants to know if the mean calorie content has changed.
about:blank 46/119
You assume that the null hypothesis is true and construct a range of likely sample means around the population
mean. Using the data and the appropriate formula, you find the range [103; 121].
The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99%
confident that the population mean is not 112.
Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test?
z-table
Exercise 2: The Clearwater Power Company

The Clearwater Power Company produces electrical power from coal. A local environmental group claims that
Clearwater's emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town
downwind of the plant.
According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts
per billion (ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the
environmentalist's argument.
The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a
mean sulfur dioxide content of 35 ppb with a standard deviation of 24 ppb.
The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are
higher than permitted. Which of the following is an appropriate null hypothesis for this problem?
z-table
The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a one-sided test.
The alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they
will choose a 95% confidence level.
What is the range of likely sample means?
z-table
They calculate the one-sided range around the null hypothesis mean that contains 95% of all samples. The z-
value for a one-sided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb.
Based on your calculations, what should you do?
z-table
Exercise 3: Neshey's Smooches

You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The
machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not
be working at its former capacity.
If the machine isn't working at top capacity, you will need to have it replaced.
Which type of hypothesis test is most appropriate for this problem?
The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of
340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly
selected one-hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.
You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what
should you do?
The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ < 340 since you are using a one-tail test
and you are assuming that the new population mean is lower than the population mean before the flood.
Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate z-value is
1.645 if you want to capture 95% of all sample means in a one-sided range around the null hypothesis mean.
Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of the calculated
range of likely sample means. You accept this as strong evidence against the null hypothesis, substantiating the
alternative hypothesis that the mean output rate has dropped. You should replace the machine.
Single Population Proportions

about:blank 47/119
Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just the revenue of
the restaurants that I care about," Leo says, "It's also my guests' satisfaction with their restaurant experience."
The Restaurant Ambiance Problem

When I go out to eat, I expect more than just excellent food. The whole dining experience is essential — everything
from the service, to the décor, to the design and quality of the silverware.
And it's not just that all of these factors must be excellent individually — they have to fit together. The restaurant
has to have ambiance! I'm sure my guests have similar expectations, and I want to be sure my restaurant meets
them.
Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I think have
improved the ambiance.
It took me a long time and a substantial amount of money to get everything right, but I'm pleased with the result:
the restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine.
In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate the Kahana's
restaurants' ambiance on a scale of one to five.
Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea of how well we
were doing. That percentage has been very high: 72%.
I've collected this year's data for you. Can you figure out if my guests are happier with my restaurants' ambiance?
Hypothesis Tests for Single Population Proportions

Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean.
Often the summary statistic we want to make a claim about is a proportion. How do we test a hypothesis about a
population proportion instead of a population mean?
We know from our work with confidence intervals that the processes for estimating population proportions and
population means are virtually identical. Similarly, hypothesis tests for proportions are much like hypothesis tests
for means.
Because we are examining a population proportion instead of a population mean, we use slightly different
notation: we use a lower case p to represent the population proportion in place of µ for a population mean. We
construct a hypothesis test to test a claim about the value of p.
Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past experience, we have
an initial understanding of the population proportion. The null hypothesis for a proportion test states the initial
understanding. For example, in a two-sided test, the null hypothesis asserts that the population proportion, p, is
equal to the initial value we had in mind.
The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The alternative hypothesis
typically states the opposite of the null hypothesis: it states that our initial understanding is incorrect.
As with population means, we collect a random sample and calculate the sample proportion, "p-bar." However, for
a hypothesis test about a population proportion, we don't need to calculate a standard deviation from the sample.
Statistical theory tells us that σ, the standard deviation of the population proportion, is the square root of [p*(1 -
p)]. Since we always start the test assuming the null hypothesis is true, we will calculate σ using the null
hypothesis proportion.
Analogously to population mean tests, we create a range of likely sample proportions around the null hypothesis
proportion. To create the range, we substitute for σ, the standard deviation of the underlying null hypothesis
population.
If our sample proportion falls outside the range of likely sample proportions, we reject the null hypothesis.
Otherwise, we cannot reject the null hypothesis.
Summary
In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then, we construct a
range of likely sample proportions around the null hypothesis proportion. If the sample proportion we collect
falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis.
Solving the Restaurant Ambiance Problem

Once you understand hypothesis testing for means, using the same techniques on proportions is easy.
about:blank 48/119
By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's restaurant ambiance
problem calls for a hypothesis test for a population proportion.
Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent" has increased.
Historically, that population proportion has been 0.72. Since Leo wants to see if there has been positive change,
you do a one-sided test.
The appropriate pair of hypotheses is:
You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent" has increased. The
alternative hypothesis states that the proportion has increased, and the null hypothesis states that it has not
increased.
You look at Leo's data. The sample proportion is 0.81 and the sample size is 126.
But what about the standard deviation?
z-table
Here's how you find the standard deviation for a proportion problem:
Using the appropriate formula, you calculate the standard deviation to be 0.45.
Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely sample means
around the null hypothesis value of the population proportion: 0.72.
Find the range of likely sample proportions around the null hypothesis proportion, and formulate a short answer
for Leo.
z-table
A one-sided test calls for a one-sided range of likely sample proportions. You need to find the upper bound for this
range such that the range captures the lower 95% of the sample proportions.
The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis proportion, 0.72, for p.
The upper bound for the range containing the lower 95% of all sample means is 0.78.
Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The data provide
sufficient evidence that the population proportion has, in fact, changed.
Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected indicate that the
difference between the historical population proportion and the proportion of the random sample is not due to
chance.
The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased.
Just what I wanted to hear! Thanks, you two.
Exercise 1: The Ventura Insurance Company

Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a
special insurance package to members of certain professional groups.
In particular, Luther wants to create a special package for health professionals.
To find out what rate to charge for this package, Luther conducts a preliminary study to see if health
professionals are less likely to be involved in car accidents than the rest of his customer base.
If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer
health professionals a lower, more competitive rate.
In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the
correct pair of hypotheses for solving Luther's problem?
A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents.
If he uses a 95% confidence level, which of the following is the best conclusion Luther can come to?
z-table
You need to find a range of likely sample proportions. To find this range, you calculate a standard deviation. The
standard deviation is 0.28.
For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower bound of this range is
0.054 = 5.4%.
about:blank 49/119
The range of likely sample proportions does not contain 5.0%, so you should reject the null hypothesis. With
95% confidence, the proportion of health professionals involved in car accidents is lower than the proportion of
the overall population of drivers.
P-Values
After sleeping over your analysis of restaurant operations, Leo seems unsatisfied.
Leo Demands a Deeper Understanding

Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a "reject/don't
reject" decision. If I understand you correctly, it doesn't matter how close to the border of the rejection region our
sample statistic falls: "reject" is "reject."
But can't you tell me more? I want to know how strong the evidence against the null hypothesis is, not just if it is
strong enough.
I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one that provides a
measure of the strength of the evidence.
P-Values
The evening before, Alice had acquainted you with p-values: "We can use the p-value method of hypothesis testing
to make 'reject/not reject' decisions in the same way we have been doing all along. But the p-value also measures
the strength of evidence against a null hypothesis."
In hypothesis tests we've done so far, we first chose the confidence level of the test. The confidence level tells us the
significance level of the test, which is simply 1 minus the confidence level.
Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value for rejection.
Assuming that the null hypothesis is true, we reasoned that certain sample mean values are less likely to appear
than others. If the mean of the sample we collected was sufficiently unlikely to appear (that is, less than 5%
likely) we considered the null hypothesis implausible and rejected it.
Now, rather than simply checking whether the likelihood of collecting our sample is above or below our chosen
threshold, we'll ask: if the null hypothesis is true, how likely is it to choose a sample with a mean at least as far
from the null hypothesis mean as the sample mean we collected?
The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a
certain distance from the null hypothesis mean.
In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null
hypothesis.
The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also
indicates the strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we
barely have enough evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have
strong evidence for rejecting the null hypothesis.
Let's look at an example. Recall the movie theater manager who wanted to know if the average satisfaction rate for
his clientele had changed from its historical rate of 6.7.
To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample means if the null
hypothesis mean had still been true. Since the mean of the sample of current moviegoers we collected, 7.3, fell
outside of that range, we rejected the null hypothesis.
Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean as extreme as 7.3 is
less than 5% if the null hypothesis is true. Now let's find out exactly how unlikely it is by calculating the p-value.
Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for samples of
sufficient size, the sample means of any population are distributed normally.
To calculate the likelihood of a certain range of sample mean values — in our example, sample mean values greater
than 7.3 or less than 6.1 — we just need to find the appropriate area under the distribution curve of the sample
means.
To calculate the p-value for this two-sided test, we want to find the area under the normal curve to the right of 7.3
and to the left of 6.1. The standard deviation in this example is 2.8, and the sample size is 196.
We can calculate this probability by first calculating the z-value associated with the value 7.3. That z-value is 3.
Then, we find the probability of having a z-value less than -3 or greater than 3.
about:blank 50/119
The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the same size, so the
total area is 0.0027. That is our p-value. These areas and the p-value can be found in Excel using the
NORMSDIST(-3) function, in the z-table, or with the Excel utility provided.
Excel
z-table
Our p-value calculation tells us that the probability of collecting a sample mean at least as far from 6.7 as 7.3 is
0.0027. The p-value is lower than 0.05. Thus, at a significance level of 0.05, we would reject the null hypothesis
and conclude that moviegoers' average satisfaction rating is no longer 6.7.
Excel
z-table
But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower
significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower
the p-value, the higher our confidence in rejecting the null hypothesis.
One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the p-value is the area
under one side of the curve. In our movie theater example, if the alternative hypothesis states that the population
mean is larger than 6.7, the p-value is the area under the normal curve to the right of the sample mean of 7.3.
Summary
The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood, assuming that
the null hypothesis is true, of collecting a sample mean at least as far from the null hypothesis mean as the
sample actually collected. We compare the p-value to the threshold significance level to make a reject/not reject
decision. The p-value also tells us how comfortable we can be with that decision.
Solving the Restaurant Revenue Problem (Part II)

Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant revenue
hypothesis test again. This time, you'll be able to give Leo an idea of how strong the statistical evidence is.
Leo wants you to complete the p-value hypothesis test right there in his office. You're a little nervous — you've
never had a client peering over your shoulder when you work. But you oblige him, because you're growing more
confident of your statistical skills.
Looking back at your notes on the problem, you find the data and the hypotheses. You make a mental note that you
are doing a two-sided test to see whether or not average spending on food has changed from its historical level of
$55.
An eager Leo interrupts your thought process:
When you ran the hypothesis test earlier, I had you use a 95% confidence level. That corresponds to a significance
level of 0.5, right?
You politely respond:
To find the significance level corresponding to a confidence level of 95%, simply subtract 95% from 100%, and
convert into decimal notation: 0.05.
After you clarify Leo's mistake, he sits back and lets you finish your analysis without further interruption. First,
you find the appropriate z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal,
(e.g., enter "5" as "5.00"). Round if necessary.
Excel
z-table
The correct z-value is 3.00, corresponding to a right-tail probability of 0.00135. You:
z-table
Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55. You want to
calculate the likelihood of getting a sample mean that is at least as far from the population mean as x-bar.
That likelihood is not just the tail probability to the right of the sample mean. Sample means on the other side of
the normal curve are just as far from the population mean as x-bar. They must be included, too, when you calculate
the p-value for a two-sided hypothesis test.
Doubling the right-tail probability gives you the correct p-value: 0.0027.
Alice summarizes your results for Leo.

about:blank 51/119
All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than the significance
level 0.05. Our data are statistically significant at the 0.05 level.
Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value method
suggests that we reject the null hypothesis. With 95% confidence, average food spending per guest has changed.
But now we can also see that the evidence is very strong, because the p-value is much lower than the significance
level. We can claim that food spending has changed at the 0.0027 level of significance. Thanks, you two. I feel
much more comfortable concluding that average guest spending in my restaurant has changed.
Exercise 1: Oma's Revisited

In the following exercise you will revisit an earlier problem, this time solving it with the p-value method.
Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel
snacks. Each bag of pretzels contains one serving, and Oma's advertises that the pretzel snacks contain an
average of 112 calories per serving.
In a recent test, an independent consumer research organization conducted an experiment to see if this claim
was true. The researchers found that the average calorie content in a sample of 32 bags was 102 calories per
serving. The standard deviation of the sample was 19.
Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can market them
appropriately. At the significance level 0.01, do these data indicate that the pretzels' calorie content has
changed?

Excel
z-table
In this problem, the null hypothesis is that the actual population mean is what Oma's has always advertised. A
two-sided test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content
has changed.
Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the appropriate
formula. The z-value is -2.98.
Using the Excel NORMSDIST function or the Standard Normal Table, you can find the corresponding left-tail
probability of 0.0014. For a two-sided test, you double this number to find the p-value, in this case 0.0028.
Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover, you now can
say that you are rejecting the null hypothesis at the 0.0028 level of significance. You can recommend to Blanche
that she have the labeling changed on the pretzel bags, and adjust her marketing accordingly.
Exercise 2: Neshey's Revisited

In the following challenge you will revisit an earlier problem, this time solving it with the p-value method.
You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The
machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not
be working at its former capacity.
If the machine isn't working at top capacity, you will need to have it replaced.
The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of
340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly
selected one-hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.
You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what
should you do?

Excel
z-table
The null hypothesis is that the population mean is no lower than 340. The alternative hypothesis is that the
population mean is now less than 340. You are using a one-tailed test, and you are assuming that the new
population mean is lower than the population mean before the flood.
Identify the values of the relevant quantities. Use the appropriate formula and calculate the z-value. The z-value
is -2.83.
This z-value corresponds to a left-tail probability of 0.0023. This is the tail you are interested in, since you are
conducting a one-sided test to see if the actual population mean is less than it was in the past. This tail
probability is the p-value.
about:blank 52/119
Since the p-value is less than the significance level, you reject the null hypothesis that the population mean is
unchanged. Moreover, you now can say that you are rejecting the null hypothesis at the 0.0023 level of
significance. You should replace the machine.
Comparing Two Populations

Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary spending habits of two
categories of guests: leisure and business.
Leisure Guests vs. Business Guests: Who spends more?

Every hotel manager wrestles with the problem of stretching limited marketing resources. I want to make sure that
I'm wisely allocating each marketing dollar.
Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many professional
associations like to have their conventions here, so our islands attract business travelers, who mix business and
pleasure.
Business travelers pay lower room prices because conferences book rooms in bulk. Bulk reservations are good for
me because they keep my occupancy levels high.
However, I don't have a good sense of whether the discretionary spending of my business guests is different from
that of my leisure guests: they may take fewer scuba lessons but use the spa services more, for example.
Can you help me figure out whether there is any significant difference between leisure and business travelers'
discretionary spending habits? Your conclusions might influence my marketing efforts.
I collected two random samples: one of leisure guests and one of business guests. Not including room, meal, and
beverage charges leisure travelers spent an average of $75 a day, compared to $64 a day for the business travelers.
I knew that the difference between the two averages of the two samples could be due to chance, so I thought I'd
have you do a hypothesis test to find out.
When I was compiling the data for you, I realized that my samples were of different sizes. I was able to get 85
leisure guests to respond, but only 76 business guests returned my survey.
Which figure will you use as the sample size? Or will you add them together?
I also realized that with these data, you'd have to calculate two sample standard deviations, one for each sample.
How do you go about solving a problem like this?
Using Hypothesis Tests to Compare Two Population Means

How do you test whether two populations have different means?
So far, we've used hypothesis tests to study the mean or proportion of a single population. Often, managers want
to compare the means or proportions of two different populations: in this case, we use a two-population
hypothesis test. Let's clarify when we use each type of test.
We conduct single-population tests when we have an initial value for a population mean and want to test to see if
it is correct. Single population tests are especially useful when we suspect that the population mean has changed.
For example, we use a single-population test when we know the historical average of a population and want to
test whether that historical average has changed.
We conduct two-population tests to compare a characteristic of two groups for which we have access to sample
data for each group. For example, we'd use a two-population test to study which of two educational software
packages better prepares students for the GMAT. Do the students using package 1 perform better on the GMAT
than the students using package 2?
In two-population tests, we take two samples, one from each population. For each sample, we calculate the
sample mean, standard deviation, and sample size.
We can then use the two sets of sample data to test claims about differences between the two populations. For
example, when we want to know whether two populations have different means, we formulate a null hypothesis
stating that the means are not different: the first population mean is equal to the second.
Let's look at the GMAT software package example more closely. The manager of one educational software
company might wonder if the average GMAT score of students using her software is different from the average
GMAT score of students using the competitor's software.
Since the manager only wants to test if the average GMAT scores are different, she conducts a two-sided
hypothesis test for two populations. The null hypothesis states that there is no difference between the average
GMAT scores of the students who use the two companies' software.
about:blank 53/119
The alternative hypothesis states that the average GMAT scores of the students who use the two companies'
software are different.
We denote the average scores of the two populations by the Greek letter mu and distinguish them with
subscripts. Our hypotheses are:
To be 95% confident in the result of the test, we use a significance level of 0.05.
We collect two samples, one from each population. We denote the sample means with the familiar x-bar, which
we again distinguish with subscripts.
We are able to collect the GMAT scores of 45 people who used the company's software, and 36 people who used
the competitor's software. As we will see shortly, the different sample sizes will not pose a problem.
The respective sample means are 650 and 630, and the standard deviations are 60 and 50.
Could the two random samples we picked just happen to have different means by chance but really have come
from populations that have the same population means?
The null hypothesis states that there is no difference in the two population means. As with single-population
tests, we test the null hypothesis by asking how likely it would be to produce the sample results if the null
hypothesis is in fact true.
That is, if the average GMAT scores for students using the two different software packages actually are the same,
what is the chance that two samples we collect would have sample means as different as 650 and 630?
Our intuition tells us that the greater the difference between the means of the two samples, the more likely it is
that the samples came from different populations. But how do we know when the numerical difference is large
enough to be statistically significant? When do we have enough evidence to actually conclude that the two
populations must be different?
We use p-values to answer this question. First, we calculate a z-value for the difference of the sample means,
incorporating the data from both populations. It looks a bit complicated:
Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we have:
Using the formula, we find that the z-value is 1.64.
For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a p-value of 0.10.
Since this p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis.
In other words, the high p-value tells us that there is insufficient evidence from the two samples to conclude that
the average GMAT score of the students who use the company's software is different from the average GMAT
score of students who use the competitor's software.
Two-population hypothesis tests can be performed using the formula shown on the previous page, or you can
click here to access the Excel utility for hypothesis testing.
Summary
In a hypothesis test for two population means, we assume a null hypothesis: that the two population means
are equal. We collect a sample from each population and calculate its sample statistics. We calculate a p-value
for the difference between the two samples. If the p-value is less than the significance level, we reject the null
hypothesis.
Hypothesis Tests for Two Population Proportions

Often, managers want to know if two population proportions are equal. For example, a marketing manager of
a packaged snack foods company might want to compare the snack food habits of different states in the US.
The marketing manager might think that the proportion of consumers who favor potato chips in Texas is
different from the proportion of consumers who favor potato chips in Oklahoma.
Comparing two population proportions is similar to comparing two population means. We have two
populations: the null hypothesis states that their proportions are the same; the alternative hypothesis states
that they are different.
We collect a sample from each population and calculate its sample size and sample proportion. As in the single
population proportion test, we don't need to find the sample standard deviation, since we know that the
population standard deviation is the square root of
[p*(1 - p)].
Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for the difference
between the proportions using the formula below:
about:blank 54/119
We translate the z-value into a p-value just as we would for any other type of hypothesis test. If the p-value is
less than our significance level, we reject the null hypothesis and conclude that the proportions are different. If
the p-value is greater than the significance level, we do not reject the null hypothesis.
Optional Example
Let's take a closer look at the study of snacking habits in Texas and Oklahoma.
The manager does not wish to test for a particular direction of difference; he just wants to know if the
proportions are different. Thus, he should use a two-sided test.
The marketing manager wants to be 95% confident in the result of this test, so the significance level is 0.05.
Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample
proportions are 45% and 35%, respectively.
Could the two random samples we picked just happen to have different sample proportions? That is, if the
true proportions of Texans and Oklahomans favoring potato chips actually are the same, what would be the
chance that the sample proportions are 45% and 35% respectively?
We use p-values to answer this question. First, we calculate a z-value for the difference of the sample
proportions that incorporates the data from both populations. The null hypothesis states that the
population proportions are equal, so their difference is 0.
The z-value is 2.48.
For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence a p-value
of 0.013.
Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis.
In other words, the low p-value tells us that there is sufficient evidence from the samples to conclude that
there is a difference between the proportions of Texan and Oklahoman potato chip lovers. We can make this
claim at a 0.013 level of significance.
Two-population hypothesis tests for population proportions can be performed using the formula shown in
the previous slide, or you can click here to access the Excel utility for hypothesis tests.
Summary
In a hypothesis test for two population proportions, we assume a null hypothesis: the two population
proportions are equal. We collect two samples and calculate the sample proportions. We calculate a p-value
for the difference between the sample proportions. If the p-value is less than the significance level, we reject
the null hypothesis.
Excel Utility (Two Populations)

Click here to open an Excel Utility that allows you to perform hypothesis tests for two populations.
Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts
before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are
using the utility correctly, try to reproduce the results for the GMAT and potato chip examples.
Solving the Leisure vs. Business Guest Spending Problem

Two-population hypothesis tests help you determine whether two populations have different means. You use a
two-population test to solve Leo's problem.
You have to find out if leisure guests' average daily discretionary spending is different from business guests'
average daily discretionary spending.
Leo has provided these data:
Now it's time to state the null hypothesis. The best formulation is:
You want to find out if the means of two populations — average spending by leisure guests vs. average spending
by business guests — are different. The two samples from those populations have different means: $75 and $64,
respectively.
The samples may come from populations with the same means, and the numerical difference is due to chance in
getting these particular samples. It could be that the first sample just happened to have a high mean and the
second sample just happened to have a lower mean.
You test the null hypothesis that the population means have the same value.
about:blank 55/119
You make note of your null hypothesis and the corresponding alternative hypothesis. You use a two-sided test
because you don't have any reason to believe that one type of guest spends more than the other.
At Leo's request you do a p-value test using a significance level of 0.05. To calculate the p-value, you first find
the z-value.
Enter the z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round
if necessary.
z-table
Utility for Two Populations
The z-value is +2.12 or -2.12, depending on how you set up the difference, i.e., in what order you subtract the
sample means. Either way, the final conclusion will be the same.
You use the z-value to calculate the p-value. The p-value is _______.
z-table
A z-value of 2.12 has a cumulative probability of 0.9830. You subtract this probability from the total probability,
1, for a right-tail probability of 0.017.
Because we are doing a two-side test, we want to measure the probability of extreme values on both sides. Thus,
we double 0.017 to get a p-value of 0.034.
Since the p-value 0.034 is less than the significance level 0.05, you recommend to Leo that he reject the null
hypothesis. The average daily discretionary spending per person is different for leisure and business guests.
Leo reads your report:
I see. We can tell if two population means are different by running a hypothesis test on their difference. We test
the null hypothesis that there is no difference.
As you see, the p-value is less than the significance level. This tells you that...
...the difference in the two sample means is probably not due to chance. The spending habits of the two types of
guests are different. Got it.
Exercise 1: The Burger Baron

Claude Forbes is a regional manager of The Burger Baron restaurant chain. The Baron would like to open a
franchise in Sappington. Two properties are up for sale, and Claude wants to choose the location that has the
most traffic.
On 54 randomly selected days over a six-month period, an average of 92 cars passed by location A during the
lunch hour, from noon to 1 p.m. On 62 randomly selected days, 101 cars passed by location B during the lunch
hour. The standard deviations for the two locations were 16 and 23, respectively.
Is the difference between the amount of traffic at locations A and B statistically significant at the 0.01 level?
z-table
First, you set up the hypotheses. The null hypothesis states that there is no difference in mean traffic flow
during the lunch hour at the two locations. Since you are only interested in whether the traffic is different at
the two locations, you conduct a two-sided hypothesis test.
Next, you find the z-value for the difference between the two sample means using the appropriate formula.
The z-value is -2.47.
A z-value of -2.47 corresponds to a left-tail probability of 0.0068. Since you are doing a two-sided test, you
must double this probability to find the correct p-value: 0.0136.
The p-value is higher than the significance level. You cannot reject the null hypothesis. On the basis of these
data, Claude has insufficient evidence to show that one location has more traffic than the other.
Exercise 2: Karnivorous Kong vs. Peter the Pipsqueak

The Magical Toy company manufactures a line of wrestling action figures. Maude Troston, the head of the doll
department, must make the final decision regarding which of two models, "Karnivorous Kong" or "Peter the
Pipsqueak," should be discontinued.
Magical Toy ships crates containing equal numbers of Pipsqueaks and Kongs to retail outlets. 45 randomly
selected toy stores have sold an average of 78% of all Pipsqueaks delivered to them. 48 other stores have sold
about:blank 56/119
an average of 84% of their Kongs.
At a significance level of 0.05, do the data indicate that one action figure sells better than the other?
z-table
Maude wants to know if one figure sells better, but doesn't have a preconception about which figure might be
flying off the shelves. Thus you use a two-sided test, and your alternative hypothesis simply states that there is
a difference between the two population proportions.
Using the appropriate formula and the data, you find the z-value for the difference between the sample
proportions. The z-value is -0.738.
A z-value of -0.738 corresponds to a left-tail probability of 0.2303. You double this to find the p-value for the
two-sided test. The p-value is 0.4606.
The p-value is greater than the significance level. Therefore, you can't reject the hypothesis. The data do not
show conclusively that one action figure sells better than the other.
Grapefruit Bizarre
The Regal Beverage Company makes the soft drink Grapefruit Bizarre. The marketing department wants to
refocus its energies and resources.
You have been asked to determine if there are regional differences in consumers' response to advertisements
for Grapefruit Bizarre. Specifically, you must find out if the Midwest responds to Grapefruit Bizarre
advertisements as well as the West Coast.
The marketing department is clamoring to start a second campaign. It claims that ads that are effective on the
West Coast do not go over as well in the Midwest. Management demands statistical evidence at a significance
level of 0.05.
In the context of a free movie screening, an ad for Grapefruit Bizarre is shown to 173 Midwesterners. The
viewers had been randomly selected, and had not previously tasted the drink.
When asked later, 33% claimed that they were at least mildly interested in trying Grapefruit Bizarre. In a
similar survey conducted on the West Coast, 42% of 152 test subjects claimed at least a mild interest in trying
Grapefruit Bizarre.
You calculate the z-value for the difference in sample proportions.
Enter your z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00").
Round if necessary.
z-table
The z-value is + or -1.68, depending on how you set up the difference, i.e., in what order you subtract the
sample means.
A z-value of -1.68 corresponds to a left-tail probability of 0.0465. What do you report to the marketing
department?
z-table
You are conducting a one-sided hypothesis test. The alternative hypothesis states that the proportion of the
Midwest sample is less than the proportion of the West Coast sample. Therefore, you are interested only in the
left-tail probability. Your p-value is 0.0465.
0.0465 is less than the significance level 0.05. You should reject the null hypothesis. Midwesterners do not
respond to the ads as well as people from the West Coast. Marketing's claims are valid.
Challenge: LeMer Fashion Design

Upscale fashion designer Marjorie LeMer must decide from which supplier she should purchase bolts of cloth.
Rumor has it that BlueTex's product is superior to Southern Halifax's.
Random 10-yard sections from 43 bolts of Halifax's cloth contain a mean of 1.8 flaws per yard. Similar
sections from 42 bolts of BlueTex's product contain 1.6 flaws per yard. The standard deviations are 0.3 and
0.6, respectively.
Marjorie wants you to find out if the rumors that BlueTex makes a better product are statistically warranted.
You conduct a one-sided test. Which of the following is the best alternative hypothesis?
about:blank 57/119
z-table
At what level are these data significant?
z-table
You find the z-value for the difference between the two sample means using the appropriate formula. The z-
value is -1.94.
The cumulative probability for z=-1.94 is 0.0262. This is the left-tail probability. Since you are running a one-
sided test, 0.0262 is your p-value.
0.0262 is greater than 0.01. At this significance level, you would not reject the null hypothesis. 0.0262 is less
than 0.05. At this significance level, you would reject the null hypothesis.
Southern Halifax's product is slightly cheaper than BlueTex's. All other factors being equal, Marjorie would
like to buy the less expensive product. Unless she is 99% confident that there is a difference in quality, she will
go with the cheaper cloth.
Based on this information and your calculations data, what should Marjorie do?
z-table
The data are not significant at the 0.01 level, so you can't reject the null hypothesis that they are different.
A significance level of 0.01 corresponds to a confidence level of 99%. So at a 99% confidence level, you can't
reject the null hypothesis. Marjorie can't be 99% confident that BlueTex's product is better than Halifax's.
Since Halifax's product is cheaper and you can't determine a difference in quality at the level of statistical
significant Marjorie requires, you recommend Halifax to Marjorie.
"Good work!" says Alice. "You're ready for a new challenge: investigating relationships between variables."
Regression Basics
Introduction
As you relax in your room during a brief afternoon downpour, your phone rings.
Leo's Bisque Debacle and the Staffing Problem

Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd better not keep
him waiting.
I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A
server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt.
The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help from a staffing
agency. Those staffing agencies charge a fortune, which is especially irritating considering that the employees
they refer to us are often poorly suited to customer service in an up-scale hotel.
Really, this is my fault for not having a more effective staffing process. I just wish I could predict my needs
better. Sometimes, when demand is lower than I expected, I'm overstaffed. Then I lose money paying idle
bellhops. If I had a good sense of my staffing needs at least a month in advance, I could avoid hiring workers at
the last minute and having idle staff.
I had been thinking that the number of advance reservations would give me a good idea of how high my
occupancy would be a month down the road. But clearly advance reservations don't tell me the whole story. I've
been making way too many false predictions.
Is there anything you can do to help me here? What predictions about occupancy can I make based on advance
bookings? And how much can I trust them?
We'll take a look at the data on advance bookings and occupancy and let you know what we find out.
Introducing the Regression Line

Alice seems confident that the two of you can offer useful advice on Leo's staffing problem:
"This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all the time in
business: in finance, demand forecasting, market research to name just a few areas. I'm sure you'll use it in your
about:blank 58/119
MBA program. And it's a great chance to review what you've learned so far: sampling, confidence intervals, and
hypothesis testing all play a part in regression."
As we have seen, it is often useful to examine the relationship between two variables. Using scatter diagrams, we
can visualize such relationships.
We can learn more about the relationship by finding the correlation coefficient, which measures the strength of
the linear relationship on a scale from -1 to 1.
Regression is a statistical tool that goes even further: it can help us understand and characterize the specific
structure of the relationship between two variables.
Let's look at an example. Julius Tabin owns a small food processing company that produces the spreadable
lunchmeat product EasyMeat. Julius is trying to understand the relationship between his firm's advertising and
its sales.
Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius' competitors'
actions have been quite stable. Julius believes that his advertising levels influence his firm's sales positively, but
he doesn't have a clear understanding of what the relationship looks like.
Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the Excel link to create
the scatter diagram yourself from an Excel spreadsheet.
EasyMeat Data
Plotting annual sales against annual advertising expenditures gives us a visual sense of the relationship between
the two variables. Looking at the graph, we can see that as advertising has gone up, sales have generally
increased. The relationship looks reasonably linear.
EasyMeat Data
The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship between
advertising and sales.
EasyMeat Data
What if we were to draw a line that characterizes this relationship? Which line would best fit the data? Our
mind's eye already sees how the two variables are related, but how can we formalize our visual impression?
EasyMeat Data
Before we start any calculations, let's look at several lines that could describe the relationship.
EasyMeat Data
One of these lines most accurately describes the relationship between the two variables: the "best-fit" or
regression line.
EasyMeat Data
In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the y-intercept is -333,831
and the slope is 50.
EasyMeat Data
In general, a regression line can be described by a simple linear equation: y = a + bx, with y-intercept a and
slope b.
EasyMeat Data
In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think Julius' sales
depend to some degree on his advertising. The x-variable, advertising, is called the independent variable, or
the explanatory variable.
EasyMeat Data
When we observe that a change in the independent variable (here advertising) is typically accompanied by a
proportional change in the dependent variable (here sales), regression analysis can identify and formalize that
relationship.
EasyMeat Data
Summary
Regression analysis helps us find the mathematical relationship between two variables. We can use regression
to describe a linear relationship: one that can be represented by a straight line and characterized by an
equation of the form y = a + bx.
about:blank 59/119
The Uses of Regression

What kinds of questions can regression analysis help answer?
How does regression help us as managers? It can help in two ways: first, it helps us forecast. For example, we
can make predictions about future values of sales based on possible future values of advertising.
Second, it helps us deepen our understanding of the structure of the relationship between two variables by
expressing the relationship mathematically.
EasyMeat Data
Using Regression for Forecasting

Let's talk first about how managers can use regression to forecast. In our example, regression can help Julius
predict his company's sales for a specified level of advertising.
For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to be?
If we didn't know anything about the relationship, but only had the historical data, we might simply note that
the last time Julius spent $65,000 on advertising, his sales were $3,200,200. But is this the best prediction we
can make?
EasyMeat Data
Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this will allow us
to make more accurate predictions than if we infer the future value of sales from a single observation of
advertising and sales. Having identified the relationship between the two variables from the full data set, we
can apply our understanding of that relationship to our forecast.
EasyMeat Data
Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising. If Julius plans
to spend $65,000 in advertising, what would we predict sales to be?
EasyMeat Data
The point on the line shows us what level of sales to expect. In this case, we would expect sales of $2,916,169.
EasyMeat Data
With regression, we can forecast sales for any advertising level within the range of advertising levels
we've seen historically. For example, even if Julius has never spent exactly $50,000 on advertising, we can
still forecast a corresponding level of sales.
EasyMeat Data
We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we
have already observed. The further we are from the historical values of advertising, the more we should
question the reliability of our forecast.
EasyMeat Data
For example, we might feel comfortable forecasting sales for advertising levels a bit above the observed range-
perhaps as high as $100,000 or $105,000. But we shouldn't infer that if Julius spent $10 million on
advertising, he would achieve $500 million in sales. The total market for spreadable meat is probably much
less than $500 million annually!
EasyMeat Data
Likewise, we might feel comfortable forecasting sales for advertising levels just below the observed range. But
we certainly shouldn't report that if Julius spent $0 on advertising, he would have negative sales!
EasyMeat Data
If we try to use our regression equation to forecast sales for advertising levels outside of the historical range,
we are implicitly assuming that the relationship between advertising and sales continues to be linear outside
of the historical range.
EasyMeat Data
In reality, although the relationship may be quite linear for the range of values we've observed, the curve may
well level off for advertising values much lower or much higher than those we've observed. With no
observations outside the historical data range, we simply don't have evidence about what the relationship
looks like there.
EasyMeat Data
about:blank 60/119
Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are
assuming that the past is a reasonable predictor of the future. Thus, we should only use regression to predict
the future if the general circumstances that held in the past, such as competition, industry dynamics, and
economic environment, are expected to hold in the future.
The Structure of a Relationship

Regression can be used to deepen our understanding of the structural relationship between two variables. If
we think about it, many business decisions are about increasing or decreasing one variable — investments or
advertising, for example — to affect some other variable — productivity, brand recognition, or profits, for
example. Regression can reveal the structure of relationships of this type.
Our regression analysis stipulates a linear relationship between sales and advertising. Understanding "the
structure" of this relationship translates into finding and interpreting the coefficients of the regression
equation.
As we've noted in the previous slide, the constant term -333,831 may have no real managerial significance; it
just "anchors" the regression line by telling us the y-intercept. We've never seen advertising levels close to $0,
so we cannot infer that spending no money on advertising will lead to sales of -$333,831!
The more important term is the advertising coefficient, 50, which gives us the slope of the line. The
advertising coefficient tells us how sales have changed on average as advertising has increased.
In the past, when advertising has increased by $10,000, what has been the average corresponding change in
sales?
Assuming that the relationship between sales and advertising is linear, each $1 increase in advertising should
be accompanied by the same average increase in sales. In our example, for every incremental $1 in advertising,
sales increase on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on
average by $500,000.
The regression line gives us insight into how two variables are related. As one variable increases, by how much
does the other variable typically change? How much growth in sales can we anticipate from an incremental
increase in advertising expenditures? Regression analysis helps managers answer questions like these.
Summary
We use regression analysis for two primary purposes: forecasting and studying the structure of the
relationship between two variables. We can use regression to predict the value of the dependent variable for a
specified value of the independent variable. The regression equation also tells us how the dependent variable
has typically changed with changes in the independent variable.
Exercise 1: Soft Drink Consumption

Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP).
Generally, the higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is
measured in number of 8-oz servings.
Based on data from 12 countries, the relationship can be expressed mathematically as:
Source
Based on this relationship, you can expect that, on average, for each additional $1,000 of per-capita GDP a
country's soda consumption increases by _____ servings.
The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for
every additional $1 of per-capita GDP. So, for an additional $1,000, average consumption increases by
($1,000)(0.018 servings/$) = 18 servings.
The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of
soda consumed in the Netherlands per year?
Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary.
The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore,
we anticipate the Netherlands' average soda consumption to be 580.6 servings.
Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands,
the actual measured number of servings consumed is much lower: 362. The discrepancy in the actual and
predicted consumption reinforces that per-capita GDP alone is not a perfect predictor of soda consumption.
Calculating the Regression Line
about:blank 61/119
A regression line helps you understand the relationship between two variables and forecast future values of the
dependent variable. Alice points out to you that these two features of regression analysis make it a powerful tool
for managers who make important decisions in the uncertain world of business.
But how do you generate a regression line from observed data? Of all the straight lines that you could draw
through a scatter diagram, which one is the regression line?
The Accuracy of a Line

Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be
drawn that would pass through every point in the data set.
This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data
point to fall in a perfect line. The regression line depicts the best linear relationship between the two variables.
We attribute the difference between the actual data points and the line to the influence that other variables have
on sales, or to chance alone.
Since the regression line does not pass through every point, the line does not fit the data perfectly. How
accurately does the regression line represent the data?
To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line
we could draw through our data set.
Let's consider a second line. Click on the line that more closely fits the ten data points.
Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure
of a line's accuracy.
To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and
the line.
Why don't we measure the shortest distance between the point and the line — the distance perpendicular to
the line? Why do we measure vertically?
We measure vertical distance because we are interested in how well the line predicts the value of the
dependent variable. The dependent variable — in our case, sales — is measured on the vertical axis. For each
data point, we want to know how close the value of sales predicted by the line is to the historically observed value
of sales.
From now on we will refer to this vertical distance between a data point and the line as the error in prediction
or the residual error, or simply the error. The error is the difference between the observed value and the
line's prediction for our dependent variable. This difference may be due to the influence of other variables or to
plain chance.
Going forward, we will refer to the value of the dependent variable predicted by the line as y-hat and to the
actual value of the dependent variable as y. Then the error is y - (y-hat), the difference between the actual and
predicted values of the dependent variable.
The complete mathematical description of the relationship between the dependent and independent variables is
y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the
regression line plus the error, y - (y-hat).
Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data.
To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and
negative distances would cancel each other out.
Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we
calculate variance.
This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good
measure of how accurately a line describes a set of data.
The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors.
Summary
To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of
Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to
the line, square the distances, and sum the squares.
Identifying the Regression Line

Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that
"best fits" the data: the regression line.
about:blank 62/119
We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines
will give us different Sums of Squared Errors. The line we are looking for — the regression line — is the one with
the smallest Sum of Squared Errors.
Let's look at several lines that could describe the relationship between advertising and sales in our example. Our
intuition tells us that the middle line is a much better fit than line a or line b.
Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to determine its accuracy.
The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy.
The line that most accurately describes the relationship between advertising and sales — the regression line — is
the line that minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive
process best left to statistical software.
Summary
The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared
Errors is minimized.
Performing Regression Analysis

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression
analysis using the regression tool. However, we suggest you read through the following instructions to learn
how Excel's regression tool works, so you can run regressions in the future, when you do have access to the
Data Analysis Toolpak.
Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software

packages and major spreadsheet programs — Excel, for example — can do the necessary calculations for you
in a matter of seconds. Click on the Excel link to access the data file so you can practice doing the analysis in
Excel as you read through the instructions.
EasyMeat Data
Let's go through the process step by step. We start with data entered into two columns in an Excel
spreadsheet. Each column contains values of a variable. To perform regression analysis, there must be an
equal number of entries in each column.
Under the Data tab in the toolbar we select the Data Analysis option.
A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK".
A new window opens offering several options for regression analysis.
In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of
cells containing the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($).
We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label
(B1) and the data (B2:B11) for the independent variable: Advertising ($).
Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is
helpful because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not
include the labels in your ranges, do not check the label box, or Excel will treat the first row of data as labels,
excluding those entries from the regression.
Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click
"OK."
Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data.
For the moment, we're mainly interested in the entries in the cells labeled "Coefficients," which specify the
intercept and slope of the regression line.
Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in
the "Advertising ($)" row is the slope of the regression line.
For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and
perform the regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file
containing the relevant regression output.
EasyMeat Data
EasyMeat Regression
Exercise 1: Soft Drink Consumption Revisited

about:blank 63/119
To practice using Excel's regression tool, run a regression using the world soft drink consumption data from
an earlier exercise. Use soft drink consumption for the dependent variable and per capita GDP for the
independent variable.
Soft Drink Consumption Data
What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the right of the
decimal point (e.g., enter "5" as "5.000"). Round if necessary.

Soft Drink Consumption Regression
Source
We run the regression by selecting range C1:C13 for the Y-range, the dependent variable consumption, and
B1:B13 for the X-range, the independent variable GDP per capita. We check the label box, and see the output
below. The slope of the regression line is the coefficient of the independent variable, GDP per capita.
What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if necessary.

Soft Drink Consumption Regression
Source
The intercept of the line is the coefficient labeled "Intercept."
Deeper into Regression

Equipped with the basic tools needed to find and interpret the regression line, you feel ready to tackle Leo's
assignment. But Alice cautions you not to be hasty and urges you to consider some tricky questions: "How well
does the regression line actually characterize the relationship in the data? Is a straight line even a good descriptor
of the relationship?"
Quantifying the Predictive Power of Regression

How much does the relationship between advertising and sales help us understand and predict sales? We'd like
to be able to quantify the predictive power of the relationship in determining sales levels. How much more do we
know about sales thanks to the advertising data?
To answer this question we need a benchmark telling us how much we know about the behavior of sales
without the advertising data. Only then does it make sense to ask how much more information the
advertising data give us.
Without the advertising data, we have the sales data alone to work with. Using no information other than the
sales data, the best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as
our benchmark, and draw a "mean sales line" through the data.
Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how
accurately an individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure
of how much more accurate the regression line is than the mean line.
To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much
smaller the error is around the regression line than around the mean line.
The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the
same measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors
for our benchmark — the mean sales line — the Total Sum of Squares. Here, the Total Sum of Squares is 8.01
trillion.
The Sum of Squared Errors for the regression line is often called the Residual Sum of Squared Errors, or the
Residual Sum of Squares. The Residual Sum of Squares is the variation left "unexplained" by the regression.
Here, the Residual Sum of Squares is 1.13 trillion.
The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88 trillion in this case, is
called the Regression Sum of Squares. The Regression Sum of Squares measures the variation in sales
"explained" by the regression line.
Excel's regression output reports all three of these terms.
A standardized measure of the regression line's explanatory power is called R-squared. R-squared is the fraction
of the total variation in the dependent variable that is explained by the regression line.
R-squared will always be between 0 and 1 — at worst, the regression line explains none of the variation in sales;
at best it explains all of it.
We find R-squared by dividing the variation explained by the regression line — the Regression Sum of Squares —
by the total variation in the dependent variable — the Total Sum of Squares.
about:blank 64/119
R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the advertising and sales
example, the R-squared value is 6.88 trillion/8.01 trillion = 0.859 = 85.9%.
An equivalent approach to computing R-squared is somewhat less intuitive but more common. In this approach
we first find the fraction of the total variation in the dependent variable that is NOT explained by the regression
line: we divide the Residual Sum of Squares by the Total Sum of Squares.
Then we subtract the fraction of unexplained variation from 1 to obtain R-squared.
Fortunately, we don't need to calculate R2 ourselves — Excel computes R2 and includes it in the standard
regression output.
In a regression that has only one independent variable, R-squared is closely related to the correlation coefficient
between the independent and dependent variables: the correlation coefficient is simply the positive or negative
square root of R-squared—positive if the slope of the regression line is positive and negative if the slope of the
regression line is negative.
Excel's regression output always computes the square root of R-squared, which it labels "Multiple R."
Summary
R-squared measures how well the behavior of the independent variable explains the behavior of the
dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As
such, it tells us what proportion of the total variation in the dependent variable is explained by its linear
relationship with the independent variable.
Residual Analysis
Although the regression line is the line that best fits the observed data, the data points typically do not fall
precisely on the line. Collectively, the vertical distances from the data to the line — the errors — measure how
well the line fits the data. These errors are also known as residuals.
A careful study of the residuals can tell us a lot about a regression analysis and the validity of the assumptions
we base it on.
For example, when we run a regression, we assume that a straight line best describes the relationship between
our two variables. In fact, sometimes the relationship may be better described by a curve.
In this graph, we can clearly see a negative trend. If we run a regression on these data, we find a relatively high
R-squared. How do we use the residuals to check our assumption that the relationship is linear?
First, we measure the residuals: the distance from the data points to the regression line.
Then we plot the residuals against the values of the independent variable. This graph — called a residual plot
— helps us identify patterns in the residuals.
We can recognize a pattern in the residual plot: a curve. This pattern strongly indicates that a straight line is
not the best way to express the relationship between the variables: a curve would be a much better fit.
A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the
errors from the general trend in the data. Residual plots are critical for studying error patterns in more
advanced regressions with multiple independent variables.
If the only pattern in the dependent variable is accounted for by a linear relationship with the independent
variable, then we should see no systematic pattern in the residual plot. The residuals should be spread
randomly around the horizontal axis.
In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance.
Residuals are called homoskedastic if their distributions have the same variance.
If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of
the dependent variable than what is explained by our linear regression. Other factors may be influencing the
dependent variable, or the assumption that the relationship is linear may be unwarranted.
We've already seen the pattern of a curved relationship. What other patterns might we see?
Let's look at this scatter diagram and its corresponding residual plot. The residuals appear to be getting larger
for higher values of the independent variable. This phenomenon is known as heteroskedasticity.
Residual analysis reveals that the distribution of the residuals changes with the independent variable: the
variance increases as the independent variable increases. Since the variance of the residuals — which
contributes to the variation of the dependent variable — is affected by the behavior of the independent
variable, we can conclude that there must be more to the story than just the linear relationship.
about:blank 65/119
There are a number of other assumptions about regression whose validity can be tested by performing a
residual analysis. Although interesting, these uses of residual analysis are beyond the scope of this course.
Summary
A complete regression analysis should include a careful inspection of the residuals. Plot the residuals
against the independent variable to reveal patterns in the distribution of the residuals.
Graphing Residual Lines and Residuals

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform
residual analysis using the regression tool. However, we suggest you read through the instructions to learn
how Excel's regression tool works, so you can perform residual analysis in the future, when you do have access
to the Data Analysis Toolpak.
To study patterns in residuals and other regression data, visual representations can be very helpful. To plot a
regression line we first generate a scatter diagram of the data we are studying.
Once we have the scatter diagram, we add the regression line by right-clicking on any one of the data points. A
menu will pop up. From that menu, we select "Add Trendline."
In the Trendline Operations Menu, we select "Linear" under Trend/Regression Type.
If we wish to display the regression equation and R-squared value on the same scatter plot, we can select the
check boxes next to those items at the bottom and press "Close."
The scatter diagram will now have been augmented by the regression line, the regression equation, and the R-
squared value. This is a quick way to perform a simple regression analysis from a scatter plot, though it
doesn't provide all of the output we'll want to thoroughly review the results.
Residual analysis is an option in Excel's Regression tool. To calculate the residuals and generate the residual
plot, we select "Residual Plots" in the Residuals section of the Regression Menu and click "Ok."
The residuals and their plot appear in the regression output.
The Significance of Regression Coefficients

R-squared and the residuals are not the only things to keep an eye on when running a regression.
The regression line is the line that best fits our observed data. But are we sure that the regression line depicts the
"true" linear relationship between the variables?
In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the
variables. Why? Because the data we use to find the regression line typically represent only a sample from the
entire population of data pertaining to the relationship.
Returning to our spreadable lunchmeat example, advertising and sales data for a different set of years would
likely have given us a different line. Since each regression line comes from a limited set of data, it gives us only
an approximation of the "true" linear relationship between the variables.
As we do in sampling, we'll use Greek letters to denote parameters describing the population, and Latin letters to
denote the estimates of those parameters based on sample data. We'll use alpha and beta to represent the
coefficients of the "true" linear relationship in the population, and a and b to represent our estimates of alpha
and beta.
When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of
alpha, the intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line.
Like all estimates based on sample data, the calculated estimate for a coefficient is probably different from the
true coefficient. Just as in sampling, to find a range of likely values for the true coefficient, we construct
confidence intervals around each estimated coefficient.
Excel's output gives us 95% confidence intervals for each coefficient by default.
The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 95%
confident that the slope of the true line is between 33.5 and 66.5.
We can specify any other confidence level when setting up the regression. We might be interested in a 99%
confidence interval for the slope of the EasyMeat regression line.
The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 99%
confident that the slope of the true line is between 25.9 and 74.1.
Testing for a Linear Relationship

about:blank 66/119
Since we don't know the exact value of the slope of the true advertising line, we might well question whether
there actually IS a linear relationship between advertising and sales. How can we assure ourselves that the
pattern we see in the sample data is not simply due to chance?
If there truly were no linear relationship, then in the full population of relevant data, changes in advertising
would not correspond systematically to proportional changes in sales. In that case, the slope of the best-fitting
line for the true relationship would be zero.
There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the
confidence interval for the slope coefficient and see if it includes zero.
The 95% confidence interval for the slope in our case, [33.5, 66.5], does not include zero, so we can reject with
95% confidence the hypothesis that the slope of the true line is zero. We can say that we are 95% confident
that there really is a linear relationship between advertising and sales.
Alternatively, we can test how likely it is that the slope really is zero using a hypothesis test. We formulate the
null hypothesis that there is no linear relationship: the true slope, beta, of the regression line is zero.
Then we calculate the p-value: the likelihood of having a slope at least as far from zero as the slope calculated,
b, assuming that the slope is in fact zero.
In our example, the p-value tells us the likelihood of having a slope as far from zero as 50 if the true slope were
in fact zero.
Fortunately, Excel saves us the trouble of calculating the p-value and reports it along with the regression
output. This p-value tells us exactly what we want to know: if there really is no linear relationship between the
variables, how likely is it that our sample would have given us a slope as large as 50?
In our example, the p-value for the advertising coefficient is 0.0001, indicating that we can be confident at the
99.99% level that beta, the slope of the true line, is different from zero. Thus, we are confident at the 99.99%
level that there really is a linear relationship between advertising and sales.
In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the
slope beta is zero, and conclude with 95% confidence that there is a linear relationship between the two
variables. Moreover, the smaller the p-value, the more confident we are that a linear relationship exists.
If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude
with 95% confidence that there is a significant linear relationship between the variables.
Additionally, p-values can be used to test hypotheses about the intercept of a regression line. In our example,
the p-value for the intercept is 0.498, much larger than 0.05. However, since the intercept simply anchors the
regression line, whether or not it is zero is not particularly important.
We won't use the standard error of the coefficient in this course, but will simply point out that the standard
error is similar to a standard deviation of our distribution for the coefficient. Thus, the 95% confidence
interval is approximately 50 +/- 2*(7.2). It's not exact, because the distribution isn't quite normal, but the
concept is very similar.
We won't use the t-stat in this course either, since the p-value gives the equivalent information in a more
readily usable form. The t-stat tells us how many standard errors the coefficient is from the value zero. Thus, if
the t-stat is greater than 2, we are quite sure (approximately 95% confident) that the true coefficient is not
zero.
Summary
The slope and intercept of the regression line are estimates based on sample data: how closely they
approximate the actual values is uncertain. Confidence intervals for the regression coefficients specify a range
of likely values for the regression coefficients. Excel reports a p-value for each coefficient. If the p-value for a
slope coefficient is less than 0.05 we can be 95% confident that the slope is nonzero, and hence that there is a
linear relationship between the independent and dependent variables.
Revisiting R-squared and p

It is important not to confuse the p-value for a coefficient with the R-squared for the regression. R-squared
tells us what percentage of the variation seen in the dependent variable is explained by its relationship with
the independent variable. The p-value tells us the likelihood that there is no real relationship between the
dependent and independent variables - that the true coefficient of the line in the full population is zero.
Let's look at an example that illustrates the difference in these two measures. This scatter diagram shows data
tightly clustered around the regression line. R-squared is very high — very little variation in Y is left
unexplained by the line. The p-value on the coefficient is very low — there is clearly a significant linear
relationship between X and Y.
The second scatter diagram has a much lower R-squared — a lot of variation in Y is left unexplained by the
line. But there is no question that there is a significant relationship between X and Y, so the p-value on the
about:blank 67/119
coefficient is again very low. Here, the linear relationship may not explain as much variation as it does in the
upper graph, but the relationship is clearly significant.
A high p-value indicates a lack of confidence in the underlying linear relationship. We would not expect a
questionable relationship to explain much variation. So if we have only one independent variable and it has a
high p-value, we would not expect to find a very high R-squared. However, as we'll see shortly, the story
becomes more complex when we have two or more independent variables.
So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the
sample we use to conduct the regression analysis, the more precise the information we obtain about the true
nature of the relationship under investigation. Specifically, the larger the sample, the better our estimates for
the slope and the intercept, and the tighter the confidence intervals around those estimates.
Let's look at how sample size affects p-values in an example. This scatter plot is based on a large sample size —
50 observations. With a p-value of zero we are extremely confident that there is a linear relationship between
X and Y, and our confidence interval for the slope coefficient is quite tight.
The second scatter plot is based on a sample size of only 10. The p-value rises to 0.07, so we cannot be 95%
confident that there is a linear relationship between X and Y. Our lack of confidence in our estimate is also
evident in the wide confidence interval for the slope coefficient.
Summary
The p-value and R2 provide different information. A linear relationship can be significant but not explain a
large percentage of the variation, so having a low p-value does not ensure a high R2. Sample size is an
important determinant of regression accuracy: as with all sampling, larger samples give more accurate
estimates.
Solving the Staffing Problem

Aware of many of basic regression's subtleties, you are ready to turn to Leo's staffing problem.
The Kahana has 500 guest rooms. Over the last three years, average yearly occupancy has varied a good deal: my
record low was 250 rooms occupied, and my high was 484. That range of variation makes staff planning
difficult.
I'd like to be able to predict the Kahana's occupancy one month in advance. So far, I've been making educated
guesses about the level of occupancy one month in the future based on the number of advance bookings. I take
the number of bookings and add another 50%, to be on the safe side. Obviously, I haven't done very well with
that method.
You need to find out more precisely what the relationship is between advance bookings and occupancy. Alice
asks you to run a regression analysis on Leo's occupancy and advance bookings data.
Bookings and Occupancy Data
The results of your analysis tell you that, for every 100 additional advance bookings,

Bookings and Occupancy Regression
When you run the regression analysis, be sure you choose the dependent and independent variables correctly.
We are using advance bookings to predict occupancy, suggesting that we believe that occupancy levels "depend"
on advance bookings. Thus, the occupancy is the dependent variable, and the number of advance bookings is the
In the regression output, find the coefficient of the slope of the regression line. This tells you how many
additional guests Leo can expect for each additional advance booking.
Based on these data, can you be 95% confident that the slope of the regression line is not 0?

How much of the variation in occupancy is explained by the variation in the number of advance bookings?

You present your findings to Leo.
So there is a positive relationship between advance bookings and occupancy. But the power of advance bookings
to predict occupancy is pretty small. More than half of the variation in your room occupancy is due to other
factors.
about:blank 68/119
I suppose the mathematical relationship you found will help me make slightly more informed staffing choices.
But mostly I'll still be stumbling in the dark.
Not so fast Leo: We may be able to identify other factors linked to occupancy, and use them to come up with an
improved forecasting model. Why don't we meet tomorrow and discuss what other factors might help us predict
occupancy?
Alright. In the meantime, I'll be doing my best to appease Mr. Pitt. He's talking about filing suit against me. He
says he was burned. Burned by a bisque!
Exercise 1: Inventories and Capacity Utilization

You have been asked to examine the relationship between two important macroeconomic quantities: the
change in business inventories and factory capacity utilization levels.
If you wish to learn what percent of the variation in changes in inventories is explained by capacity utilization
levels, which variable should you choose as your independent variable?
Click here to access US economic data from the years 1971-1986. Run the regression with the change in
business inventories as the dependent variable and the capacity utilization as the independent variable.
Source
Using the regression output, find the slope of the regression line. Enter the slope as a decimal number with 2
digits to the right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.
Inventories and Capacity Data

Inventories and Capacity Regression
Exercise 2: Clever Solutions

Greta John is the human resources manager at the software consulting firm Clever Solutions. Recently, some
of the programmers have been restless: the senior programmers feel that length of service and loyalty have not
been rewarded in their compensation. The junior programmers think that seniority should not be a major
basis for pay.
Greta wants some hard data to inform the debate. As a preliminary step, she plots employees' salaries against
their length of service. Using the data provided, perform a regression analysis with salary as the dependent
variable and length of service as the independent variable.
Clever Solutions Data
What is the average increase in a Clever Solutions programmer's salary per year of service?

Clever Solutions Regression
Based on the regression analysis, Greta can tell that approximately _____ of the variation in compensation
can be explained by length of service.

Clever Solutions Regression
Exercise 3: Productivity and Compensation

Productivity measures a nation's average output per labor-hour. It is one of the most closely watched variables
in economics: as workers produce more per hour, employers can pay them more without increasing the price
of the product.
Since wages can rise without provoking a corresponding rise in consumer prices, the growth of productivity is
essential to a real increase in a nation's standard of living.
Peter Agarwal, a student at the Harvard Business School, wants to investigate the relationship between change
in productivity and change in real hourly compensation.
Peter has data on change in productivity and change in compensation for eight industrialized nations. The
figures are annual averages over the period from 1979-1990.
Productivity and Compensation Data

Source
Run a regression with change in compensation as the dependent variable and change in productivity as the
about:blank 69/119
Source
How much of the variation in the change in compensation can be explained by the change in productivity?
Enter the percentage as decimal number with 2 digits to the right of the decimal point (e.g, enter ''50%'' as
''0.50''). Round if necessary.

Productivity and Compensation Regression
Given these data, Peter finds that the relationship can be mathematically expressed as:
Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Productivity and Compensation Regression
The coefficient for the slope given by Excel is an estimate based on the data in Peter's sample. The estimate for
the slope of the regression line is about 0.75. If the actual slope of the relationship is 0, there is no significant
linear relationship between the change in productivity and the change in compensation.
On the regression output, there are two ways to tell if the slope coefficient is significant at the 0.05 level. First,
we can look at the 95% confidence interval provided and see that it ranges from -0.05 to +1.56. Since the 95%
confidence interval contains zero, the coefficient is not significant at the 0.05 level.
Alternatively, we can note that the p-value of the slope coefficient, 0.0625, is greater than 0.05. Peter cannot
be 95% confident that the actual slope is 0.
Since Peter cannot be confident that the slope is not zero, he cannot be confident that there is a linear
relationship between the two variables.
Suppose Peter collects data on eight more countries. Run the regression for the entire data set with change in
compensation as the dependent variable and change in productivity as the independent variable.
Expanded Productivity and Compensation Data
The new data set indicates that the variables have a slightly different relationship:
Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Expanded Productivity and Compensation Regression
What might explain why the coefficient is significant in the second (combined) data set?

Expanded Productivity and Compensation Regression
Multiple Regression
Introduction
After brainstorming for several hours, you and Alice devise a plan to improve your regression analysis of the
staffing problem, to help Leo better predict the Kahana's occupancy.
The Staffing Problem (II)

Good morning. I hope you had a good night's sleep and have come up with some ideas on how to better predict
my occupancy.
For my part, I slept horribly last night. This bisque lawsuit is taking years off my life.
I'm sorry to hear that. I'm afraid we don't have any legal advice to give you. We do have some ideas about how to
improve your ability to forecast your hotel's occupancy.
We can do better than explaining 39% of the variation in occupancy as we did with our earlier regression using
advance bookings as the independent variable.
Remember when we first arrived, we analyzed the relationship between Kauai's average hotel occupancy rates
and arrivals on the island? We found a fairly strong correlation between occupancy and arrivals: 71%.
Source
I took a look at the relationship between arrivals on Kauai and your hotel's occupancy numbers. In the
regression of Kahana occupancy versus arrivals, arrivals explain 80% of the variation in occupancy. That's much
better than the 39% explained by advance bookings.
about:blank 70/119
Wait a minute. Those numbers add up to more than 100%. It seems like your regression technique explains
more variation than there is!
Good observation, Leo. The numbers don't add up. The reason is that there is a statistical relationship between
arrivals and advance bookings that we aren't taking into account when we run the two regressions separately.
We intend to find one equation that incorporates the data on both arrivals and advance bookings and takes into
account the relationship between them. We'll also investigate the impact of other factors, such as the business
practices of your competitors. Who is your main competitor in the area?
That would be the Hotel Excelsior. Its manager, Knut Steinkalt, is a real cut-throat. He's always offering special
promotions that undercut my room prices.
Fortunately, the Excelsior, though very luxurious, is not nearly as inviting as the Kahana. That place feels like an
undertaker's parlor! I've been able to keep ahead of old Knut by offering a better product.
We'll study the Excelsior's promotions, and see if they've had a significant influence on the Kahana's occupancy.
Thanks. Let me know as soon as you have some results. I have to warn you, though, I may be out: I'm going see
my lawyers in Honolulu this week to discuss Mr. Pitt's bisque lawsuit. What a mess!
Introducing Multiple Regression

"Most management problems are too complex to be completely described by the interactions between only two
variables," Alice tells you. "Incorporating multiple independent variables can give managers a more accurate
mathematical representation of their business."
When buying a new home, we know that the house's size influences its selling price. All other factors being
equal, we expect that the larger the home, the higher its price.
We can gain a better sense of this relationship by graphing data on price and house size in a scatter diagram, and
running a regression with price as the dependent variable and house size as the independent variable. We'll use
data on 15 recent home purchases in the town of Silverhaven.
Silverhaven Real Estate Data
We can tell from the value of R-squared, 26%, that the relationship is fairly weak: house size variation explains
only about a quarter of the variation in price.

Silverhaven Real Estate Regressions
This should come as no surprise: house size is not the only variable affecting the price of a home. There are at
least three other important factors, as any real estate agent will tell you:
One facet of a desirable location is a low commuting time, so we'd expect average commuting time to be related
to price. To study this relationship, we'll use a proxy variable: the house's distance from the business center in
downtown Silverhaven. A proxy variable is a variable that is closely correlated with the variable we want to
investigate, but typically has more readily available data.
The data on distance for the same 15 houses reveals a negative relationship: the farther away from downtown,
the less expensive houses tend to be. Again, the strength of the relationship is relatively weak: R-squared for the
regression of price versus distance is 37%.

We might be tempted to think that, if house size explains 26% of the price of a house and distance explains 37%,
then the two variables together would explain 63% of the price. But what if there is a relationship between the
two independent variables, house size and distance?
In fact, the correlation coefficient of house size and distance is 31%. As we might expect, there is a positive
relationship between the two variables: as we move farther from the city center, houses tend to be larger. How
should we factor this relationship into our analysis of how house size and distance affect the price of a house?
Rather than considering each individual relationship, we need to find a way to express the three-way
relationship among all three variables: price, house size, and location. Instead of two separate equations
describing price, each with a different independent variable, we need one equation that includes both
independent variables.
We find this three-way relationship using multiple regression. In multiple regression, we adapt what we
know about regression with one independent variable — often called simple regression — to situations in
which we take into account the influence of several variables.
about:blank 71/119
Thinking about several variables simultaneously can be quite challenging. Given only the price of a home, we
cannot make inferences about its size and location. A $500,000 home might be a mansion in the countryside, a
modest house in the suburbs, or a cozy cardboard box on the corner of 1st and Main.
Graphing data on more than two variables poses its own set of difficulties. Three variables can still be
represented, but beyond that, visualization and graphical representation become essentially impossible.
Although we can carry over many of the central ideas behind simple regression to the multivariate case, we'll
have to consider several interesting complications.
As managers, almost any quantity we wish to study will be influenced by more than one variable: to construct an
accurate model of a business' dynamics, we'll usually need several variables. Multiple regression is an essential
and powerful management tool for analyzing these situations.
Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a
good idea. But how do you adapt regression analysis to accommodate multiple variables?
Summary
Multiple regression is an extension of simple regression that allows us to analyze the relationships between
multiple independent variables and a dependent variable. Relationships among independent variables
complicate multivariate regression. With more than two independent variables, graphing multivariable
relationships is impossible, so we must proceed with caution and conduct additional analyses to identify
patterns.
Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a
good idea. But how do you adapt regression analysis to accomodate multiple variables?
Adapting Basic Concepts

"Multiple regression equations look very similar to regression equations with only one independent variable," Alice
explains. "But be careful - you have to interpret them slightly differently."
Interpreting the Multiple Regression Equation

In our home price example, we found two regression equations: one for the relationship between price and
house size, and one for the relationship between price and distance. What will the equation look like for the
three-way relationship between price, the dependent variable, and the two independent variables, house size
and distance?
The regression equation in our housing example will have the form below: house size and distance each have
their own coefficients, and they are summed together along with the constant coefficient a.
In general, the linear equation for a regression model with k different variables has the form below. Since the
coefficients we obtain from the data are just estimates, we must distinguish between the idealized equation that
represents the "true" relationship and the regression line that estimates that relationship. To express that even
the "true" equation does not fit perfectly, we include an error term in the idealized equation.
Running the regression gives us coefficients for house size and distance: 252 and -55,006, respectively. We can
use this multiple regression equation to predict the price of other houses not in our data set. To predict a house's
price, we need to know only its size and its distance to downtown.

Suppose "Windsor" is a modest mansion of 3,500 square feet, located in the outer suburbs of Silverhaven,
approximately 11 miles from downtown. Based on our regression equation, how much would we expect Windsor
to sell for?

We simply enter Windsor's square footage and distance to downtown into the equation, and calculate an
expected selling price of $699,938.
Let's take a closer look at the coefficients in the housing example, focusing on the distance coefficient: -55,006.
This coefficient is substantially different from the coefficient in the original simple regression: -39,505. Why is it
so different?
The coefficient in the simple regression and the coefficient in the multiple regression have very different
meanings. In the simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the
following way: for every additional mile farther from downtown, we expect house price to decrease by an average
of $39,505.
about:blank 72/119
We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of
variation of all other factors that influence price.
In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is
different, because it has a different meaning. Here, the coefficient tells us that, for every additional mile, we
should expect the price to decrease by $55,006, provided the size of the house stays the same.
In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of
distance to downtown. We refer to this decrease as the net effect of distance on price. Alternatively, we refer to it
as "the effect of distance on price controlling for house size."
Two houses are similar in size, but located in different neighborhoods: "Shangri La" is five miles farther from
downtown than "Xanadu." If Xanadu's selling price is $450,000, what would we expect Shangri La's selling price
to be?

Since the two houses are the same size, we use the net effect of distance on price,
-$55,006/mile, to predict the expected difference in their selling prices. Shangri La is 5 additional miles from
downtown, so its price should be -$55,006/mile * 5 miles = $275,030 less than Xanadu's, or $450,000 -
$275,030 = $174,970.
"Valhalla" is another house located 5 miles farther from downtown than Xanadu. We have no information about
the relative sizes of the two homes. If Xanadu's selling price is $450,000, what would we expect Valhalla's selling
price to be?

Since we cannot assume that the sizes of the two houses are equal, we should not control for size. Thus we use
the gross effect of distance on price,
-$39,505/mile, to predict the expected difference in the two homes' selling prices. Valhalla is 5 additional miles
from downtown, so its price should be $39,505/mile * 5 miles = $197,525 less than Xanadu's, or $450,000 -
$197,525 = $252,475.
Let's try to build our intuition about the difference in the distance coefficients in the simple and multiple
regressions. The coefficients are different because they have different meanings. But what exactly accounts for
the drop from -39,505 to -55,006?
In the multiple regression, by essentially considering only houses that are of equal size, we separate out the
effect of house size on price. We are left with a distance coefficient that is net relative to house size.
In the simple regression, the gross effect of distance, -$39,505/mile, represents an average over the range of
house sizes. As such, it also captures some of the effect that house size has on price. Let's take a closer look at the
distance and house size data.
Calculating the correlation coefficient between sizes of homes and their distances from downtown Silverhaven,
we see that there is a slight positive relationship, with a correlation coefficient of 31%. In other words, as we
move farther from downtown, houses tend to be larger.
We have seen that two things happen as we move farther from downtown — housing prices drop because the
commute is longer, and house size increases. The fact that house size increases with distance complicates the
pricing story, because larger houses tend to be more expensive.
Longer distances from downtown translate into two different effects on price. One effect of distance on price is
negative: as distance increases, commute times increase and prices drop.
A second effect of distance on price is positive: as distance increases, house sizes increase, and larger houses
corresponds to higher prices.
Running a multiple regression with both size and distance as independent variables helps tease out these two
separate effects. When we control for house size, we see the net effect of distance on price: prices drop by
$55,006 per additional mile.
When we don't control for house size, the effect of distance alone on price is confounded by the fact that house
size tends to rise as distance increases. The "real" effect of distance on price is diminished by the relationship
between price and house size.
When we look at the net relationship between distance and price, we consider only similarly sized houses. Now
we assume that as distance from downtown grows, house size stays the same. If house size didn't increase as we
moved farther out, prices would drop more sharply: by $55,006 rather than $39,505 per additional mile.
Let's analyze the house size coefficient in a similar fashion. In the multi-variable regression model of home
prices, the house size coefficient is net relative to distance. The coefficient of 252 tells us to expect prices of
homes equally distant from downtown to increase by an average of $252 for each additional square foot of
size.
about:blank 73/119
The gross effect of house size on price is $167 per square foot, considerably less than $252, the net effect. When
we do not control for distance, house size "off-sets" some of the negative effect of distance: the fact that larger
houses are typically located farther from the city counteracts some of the effect of increased house size.
We should always be careful to interpret regression coefficients properly. A coefficient is "net" with respect to all
variables included in the regression, but "gross" with respect to all omitted variables. An included variable may
be picking up the effects on price of a variable that is not included in the model — school district, for example.
Finally, we should note that these coefficients are based on sample data, and as such are only estimates of the
coefficients of the true relationship. For each independent variable, we must inspect its p-value in the regression
output to make sure that its relationship with the dependent variable is significant.
Since the p-value is less than 0.05 for both house size and distance to downtown, we can be 95% confident that
the true coefficients of the two independent variables are not zero. In other words, we are confident that there
are linear relationships between each independent variable and house price.
There are four steps we should always follow when interpreting the coefficients of an independent variable in a
multiple regression:
Summary
We use multiple regression to understand the structure of relationships between multiple variables and a
dependent variable, and to forecast values of the dependent variable. A coefficient for an independent variable
in a regression equation characterizes the net relationship between the independent variable and the
dependent variable: the effect of the independent variable on the dependent variable when we control for the
other independent variables included in the regression.
Performing in Excel
Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform
multiple regression analysis using the regression tool. However, we suggest you read through the following
instructions to learn how Excel's regression tool works, so you can perform multiple regression in the future,
when you do have access to the Data Analysis Toolpak.
Performing multiple regression in Excel is nearly identical to performing simple regressions. In the "Input Y
range" field, enter the range that includes the column label and the data for the dependent variable as you
would for a simple regression.
To enter data on the independent variables, the columns of all the independent variables must be contiguous
and have the same number of rows.
In the "Input X Range" field, enter the cell reference of the top cell of the independent variable appearing
farthest to the left. Following a colon, enter the cell reference of the bottom cell of the independent variable
appearing farthest to the right.
Select "Labels" so that Excel can properly label the independent variables in the output. Select "Residual
Plots" to include the residuals and residual plots of the residuals against each independent variable in the
output. Enter the desired "Confidence Level," or accept the default level of 95%.
The regression output prints the values of the coefficients for each variable in separate rows, along with the p-
values for the coefficients and the desired confidence intervals.
Residual Analysis
When running a multiple regression, you must distinguish between net and gross relationships. What about R-
squared? How do you measure the predictive power of a regression model with multiple independent variables?
One of the Silverhaven houses in our data set, "Windsor," sold for $570,000, substantially less than its predicted
price, $699,938. Rumor has it that Windsor is haunted, and it was hard to find a buyer for it. The difference
between the actual price and the predicted price is the residual error, in this case -$129,938.
In simple regression, the residuals — the differences between the actual and predicted values of the dependent
variables — are easy to visualize on a scatterplot of the data. The residuals are the vertical distances from the
regression line to the data points.
Graphically representing relationships among three variables is a bit difficult.
When we have only one independent variable, we create a scatter plot of the data points, then draw a line
through the data representing the relationship defined by the regression equation: y = a + bx.
With two independent variables, an ordinary scatter plot will no longer do. Instead, we keep track of the third
variable by picturing a three-dimensional space.
about:blank 74/119
Our data set is "scattered" in the space with distance from downtown measured on one axis, house size
measured on another axis, and the dependent variable, price, measured on the vertical axis.
The regression equation with two independent variables defines a plane that passes through the data. The
residuals — the differences between the actual and predicted prices in the data set — are the vertical distances
from the regression plane to the data points.
As in simple regression, these vertical distances are known as residuals or errors. The regression plane is the
plane that "best fits" the data in the sense that it is the plane for which the sum of the squared errors is
minimized.
We can plot the residuals against the values of each independent variable to look for patterns that could indicate
that our linear regression model is inadequate in some way. Here is the residual plot analyzing the behavior of
the residuals over the range of the independent variable house size...
...and here is the residual plot analyzing the behavior of the residuals over the range of the independent variable
distance.
When linear regression is a good model for the relationships studied, each of the residual plots should reveal a
random distribution of the residuals. The distribution should be normal, with mean zero and fixed variance.
The residual plot against distance in the multiple regression looks different from the residual plot against
distance in the simple regression. They represent different concepts: the first gives insight into the net
relationship between distance and price controlling for house size, and the second gives insight into the gross
relationship.
With more than two independent variables, visualizing the residuals in the context of the full regression
relationship becomes essentially impossible. We can only think of residuals in terms of their meaning: the
differences between the actual and predicted values of the dependent variable.
Because residual plots always involve only two variables — the magnitude of the residual and one of the
independent variables — they provide an indispensable visual tool for detecting patterns such as
heteroskedasticity and non-linearity in regressions with multiple independent variables.
Summary
The residuals, or errors, are the differences between the actual values of the dependent variable and the
predicted values of the dependent variable. For a regression with two independent variables, the residuals are
the vertical distances from the regression plane to the data points. We can graph a residual plot for each
independent variable to help identify patterns such as heteroskedasticity or non-linearity.
Quantifying the Predictive Power of Multiple Regression

In simple regressions, R-squared measures the predictive or explanatory power of the independent variable:
the percentage of the variation in the dependent variable explained by the independent variable. When we run
the regression of house price versus house size, we find an R-squared of 26%. The R-squared for price versus
distance from downtown is 37%.
The multiple regression of price versus the two independent variables, house size and distance, also returns an
R-squared value: 90%. Here, R-squared is the percentage of price variation explained by the variation in both
of the independent variables.
The multiple regression's R-squared is much higher than the R-squared of either simple regression. But how
can we be sure we really gained predictive power by considering more than one variable?
In fact, R-squared cannot decrease when we add another independent variable to a regression — it can only
stay the same or increase, even if the new independent variable is completely unrelated to the dependent
variable.
To understand why R-squared always improves when we add another variable, let's return to the simple
regression of house price versus distance from downtown. When we look at only two observations — two
houses — we can find a line that fits the two points perfectly.
R-squared for these two data points with this line is 100%. But clearly we can't use this line to explain the
true relationship between house price and distance. The high R-squared is a result of the fact that the number
of observations, 2, is so small relative to the number of independent variables, 1.
If we have 3 observations — three houses — we can't find a line that fits perfectly. Here, R-squared is 35%.
But when we add another independent variable — no matter how irrelevant — we can find a plane that fits the
three data points perfectly, increasing R-squared to 100%. The "perfect" R-squared is again due to the fact
that the number of observations, 3, is only one more than the number of independent variables, 2.
Improving R-squared by adding irrelevant variables is "cheating." We can always increase R-squared to 100%
by adding independent variables until we have one fewer than the number of observations. We are "over-
about:blank 75/119
fitting" the data when we obtain a regression equation in this way: the equation fits our particular data set
exactly, but almost surely does not explain the true relationship between the independent and dependent
variables.
To balance out the effect of the difference between the number of observations and the number of
independent variables, we modify R-squared by an adjustment factor. This transformation looks quite
complicated, but notice that it is largely determined by n-k, the difference between the number of observations
n and the number of independent variables k.
This adjustment reduces R-squared slightly for each variable we add: unless the new variable explains enough
additional variance to increase R-squared by more than the adjustment factor reduces it, we should not add
the new variable to the model.
Excel reports both the "raw" R-squared and the Adjusted R-squared.
It is critical to use adjusted R-squared when comparing the predictive power of regressions with different
numbers of independent variables. For example, since the adjusted R-squared of the multiple regression of
house price versus house size and distance is greater than the adjusted R-squared of either simple regression,
we can conclude that we gained real predictive power by considering both independent variables
simultaneously.
A final caveat: we should never compare R-squared or adjusted R-squared values for regressions with
different dependent variables. An R-squared of 50% might be considered low when we are trying to explain
product sales, since we expect to be able to identify and quantify many key drivers of sales. An R-squared of
50% would be considered high if we were trying to explain human personality traits, since they are influenced
by so many factors and random events.
Summary
R-squared measures how well the behavior of the independent variables explains the behavior of the
dependent variable. It is the percentage of variation in the dependent variable explained by its relationship
with the independent variables. Because R-squared never decreases when independent variables are added
to a regression, we multiply it by an adjustment factor. This adjustment balances out the apparent
advantage gained just by increasing the number of independent variables.
Solving the Staffing Problem (II)

Alice urges you to use your newfound knowledge to analyze the relationship between the Kahana's occupancy
and advanced bookings and Kauai's arrivals.
You and Alice want to find the three-way relationship between the dependent variable, the Kahana's occupancy,
and the two independent variables, arrivals on Kauai and the Kahana's advance bookings.
The first thing Alice asks you to do is to measure the strength of the relationship between the two independent
variables. Enter the correlation coefficient as a decimal number with three digits to the right of the decimal point
(e.g., enter "5" as "5.000"). Round if necessary.
Kahana Occupancy Data
The correlation between arrivals and bookings isn't especially strong, 44%.
You run the multiple regression of occupancy versus Kauai arrivals and advance bookings.
Which of the following indicates that the multiple regression with two independent variables is an improvement
over both simple regressions?

Kahana Occupancy Regressions
Do the data indicate that the regression coefficients of the two independent variables are significant at the 0.05
significance level?

Suppose Leo has 200 advance bookings for the month of January and 250 advance bookings for February. What
is the best estimate of how many more guests Leo can expect in February compared to January?

You and Alice try to contact Leo, but he appears to be unavailable.

about:blank 76/119
Leo must be with his lawyers in Honolulu. I'm sure we'll be able to get a hold of him later.
In the meantime, I can complete my research into the Excelsior's promotions. And there are some complexities
of multiple regression that you should become familiar with...
Exercise 1: Empire Learning

Empire Learning is a developer of educational software. CEO Bill Hartborne is making a bid for a contract to
create an e-learning module for a new client.
Preparing the bid requires an estimate of the number of labor-hours it will take to create the new module. Bill
believes that the length of a module and the complexity of its animations directly affect the amount of labor
required to complete it.
Bill has data on the labor-hours Empire used to complete previous courses. He also knows the number of
pages and the animation run-time of each previous course — quantities he thinks are reasonable proxies for
course length and animation complexity, respectively.
Perform a simple regression analysis for each of the independent variables: number of pages and run-time of
animations.
Empire Learning Data
Which factor explains more variation in labor hours?

Empire Learning Regressions
In the simple regressions, which of the independent variables contributes significantly to the number of labor-
hours it takes Empire to create an e-learning course?

The p-values for the coefficients on animation run-time and number of pages are 0.003 and 0.0002
respectively—well below 0.05, the most commonly used level of significance. Thus, we conclude that both
independent variables contribute significantly in their respective simple regressions to the number of labor
hours Empire takes to create an e-learning course.
Run the multiple regression of labor-hours versus number of pages and run-time of animations.
According to this multiple regression, which of the independent variables contributes significantly to the
number of labor-hours it takes Empire to create an e-learning course?

The p-values for the coefficients on animation run-time and number of pages are 0.014 and 0.0015
respectively—well below 0.05, the most commonly used level of significance. Thus, we conclude that both
independent variables contribute significantly in the multiple regression to the number of labor hours Empire
takes to create an e-learning course.
Exercise 2: The Empire Strikes Back

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Bill Hartborne, CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it
will take his team to create a new e-learning course. He is using data on previous courses Empire created, with
the number of pages and the total run-time of animations as independent variables.

In the multiple regression of labor-hours versus number of pages and run-time of animations, what does the
coefficient of 0.84 for the number of pages tell us?

about:blank 77/119
In the multiple regression equation, the coefficient of the independent variable "number of pages" is gross
relative to:

Challenge: Children of the Empire

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours
it will take his team to create a new e-learning course. He is using data on previous courses Empire created,
with the number of pages and the total run-time of animations as independent variables.

Bill bills out his talent at $70/hour. Based on the multiple regression, how much should he charge for the
labor content of a course with 400 pages and 170 seconds of animations? Enter the estimated cost of the labor
(in $) as an integer (e.g., enter "$5.00" as "5"). Round if necessary.

First use the regression equation to predict the number of labor-hours required to complete the course.

Then multiply that number by Empire Learning's billing rate of $70/hour to find the total amount he should
charge for the labor content of the course, $77,210.

Bill is sure that the client will balk at a labor bill of over $70,000. He knows that animation is important to the
client, so he doesn't want to cut corners there. However, he believes that his lead writer can cover the content
in fewer pages without compromising his renowned clear and engaging prose.

To reduce total labor costs to $70,000, how many pages must Bill cut from the plan to meet his client's cost
limits?

To reduce the labor bill from $77,210 to $70,000, Bill must reduce labor costs by $7,210. To achieve this
reduction, Bill must cut the contract's labor hours by 103 hours, since he bills out his talent at $70/hour.

Since the animation run time will not change, we use the net relationship between labor-hours and number of
pages, which tells us that each additional page consumes 0.84 labor hours. Thus, Bill must reduce the number
of pages by 123..

New Concepts in Multiple Regression

"I expect Leo will call us this evening after his meeting with the lawyers," Alice predicts. "I hope things are going
well. If Mr. Pitt's lawsuit materializes, Leo might not have much of a business left to help him with."
The Staffing Problem (III)

I just spent the whole day at my lawyers' offices. Please give me some good news about the occupancy problem.
Well, we've found a regression model that incorporates arrivals on Kauai and advance bookings. We're now able
to explain about 86% of the variation in occupancy.
about:blank 78/119
That's great. That's so much better than the 39% you calculated using only advance bookings as the independent
variable. I should be able to make much more reliable predictions based on your new model!
Unfortunately, no. Although this new model helps us understand why your occupancy varies, we can't exactly
use the model to make predictions.
You can use advance bookings to make a prediction about occupancy in a given month because the bookings are
known to you ahead of time. But you won't get the data on the number of arrivals in a month until it's too late
and your guests are already on your doorstep.
That's terrible! Sure, it's nice to know how today's occupancy is affected by today's arrivals. But I need to make
business decisions! I need to know one month in advance how many staffers to hire! Please, isn't there
something you can do?
Don't give up yet, Leo. We still have a number of statistical approaches at our disposal. We'll have something for
you when you get back from Honolulu.
Multicollinearity
"We need some more advanced statistical tools to find a regression suitable to Leo's purposes," Alice tells you.
"And there is still a pitfall you'll need to learn to avoid when using multiple regression."
Another key factor that influences the price of a house is the size of the property it is built on - its "lot size."
Naturally, we'd pay more for a spacious acre of land than for 800 cramped square feet.
Using new data on the lot sizes of the 15 sample houses in Silverhaven, run the simple regression of house price
on lot size. If we don't control for any other factors, how much of the variation in price can be explained by
variation in lot size? Enter R-squared as a decimal number with two digits to the right of the decimal point (e.g.,
enter "50%" as "0.50"). Round if necessary.

Lot size accounts for 30 percent of a home's price. Do the data provide evidence that there is a significant linear
relationship between house price and lot size?

The low p-value of 0.033 tells us that we can be confident that the gross relationship between lot size and home
price is significant. What happens when we add lot size as a third independent variable in our multiple
regression of price on three independent variables: house size, distance from downtown Silverhaven, and lot
size?
Run a multiple regression of price on the three independent variables: house size, distance, and lot size. How
does the addition of the new independent variable, "lot size" affect the predictive power of the regression model?

By adding the independent variable "lot size", we improve adjusted R-squared slightly: from 89% to 91%, telling
us that the predictive power of the regression has improved. What about the significance of the independent
variables? Has adding the new variable changed the p-values of the coefficients?
Something odd has happened. In our earlier regression with two independent variables, the p-values for both
the house size and the distance coefficients were less than 0.05. Now, adding lot size into the equation has
somehow raised the p-value for house size.
The new p-value for house size, 0.2179, is so high that there is no longer evidence of a significant linear
relationship between price and house size after taking lot size and distance into account. How do we explain this
drop in significance?
When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship
between two or more of the independent variables. Let's look at the data on house size and lot size.
The high correlation coefficient between house size and lot size — 94% — is the culprit in the Case of the
Dropping Significance.
When two of the independent variables are highly correlated, one is essentially a proxy for the other. This
phenomenon is called multicollinearity. In our example, lot size is a good proxy for house size.
Both house size and lot size contribute to the price of a home. But because these two variables are closely
correlated in our data set, there is not enough information in the data to discern how their combined
about:blank 79/119
contributions should be attributed to each of these two variables.
The net effect of house size on price should tell us the effect of house size on price assuming that the lot size is
fixed. However, we can't detect this effect in the data: house size and lot size are so closely related that we've
never seen house size vary much when lot size is fixed.
Would dropping the variable house size improve the predictive power of the regression model? The multiple
regressions with and without house size have different numbers of independent variables, so we use adjusted R-
squared to compare their predictive power.
Without house size, adjusted R-squared is 90.89%, slightly lower than 91.40%, the adjusted R-squared for the
regression including house size. Thus, although the regression model cannot accurately estimate the effect of
house size when we control for lot size and distance, the addition of house size does help explain a bit more of
the variance in selling price.
Diagnosing and Treating Multicollinearity

A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value
accompanied by low significance for one or more of the independent variables. One way to diagnose
multicollinearity is to check if the p-value on and independent variable rises when a new independent variable
is added, suggesting strong correlation between those independent variables.
How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If
we're using it to make predictions, multicollinearity is not a problem, assuming as always that the historically
observed relationships among the variables continue to hold going forward.
In the house price example, we'd keep the house size variable in the model, because its presence improves
adjusted R-squared and because our judgment would suggest that house size should have an impact on price
separate from the effect of lot size.
If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious
problem that must be addressed. One way to reduce multicollinearity is to increase the sample size. The more
observations we have, the easier it will be to discern the net effects of the individual independent variables.
We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables.
Identifying which variable to remove requires a careful analysis of the relationships between the independent
variables and the dependent variable. This is where a manager's deep understanding of the dynamics of the
situation becomes invaluable.
In our home price example, we'd expect both house size and lot size to have significant and discernable effects
on the price of a home: a shack on an acre of land should cost less than a mansion on a similar property.
To better understand the net effect of house size we should probably gather a larger sample to reduce
multicollinearity. If we didn't expect lot size and house size to have distinct effects on price, we might remove
house size from the equation.
Summary
Multicollinearity occurs when some of the independent variables are strongly interrelated: distinguishing the
respective effects of some of the independent variables on the dependent variable is not possible using the
available data. Multicollinearity is typically not a problem when we use regression for forecasting. When using
regression to understand the net relationships between independent variables and the dependent variable,
multicollinearity should be reduced or eliminated.
Lagged Variables
In the Silverhaven real estate example, we looked at a number of houses and collected data on four
characteristics: price, house size, lot size, and distance from the city center. These are cross-sectional data:
we looked at a cross-section of the Silverhaven real estate market at a specific point in time.
A time series is a set of data collected over a range of time: each data point pertains to a specific time period.
Our EasyMeat data is an example of a time series — we have data on sales and advertising levels for ten
consecutive years.
Sometimes, the value of the dependent variable in a given period is affected by the value of an independent
variable in an earlier period. We incorporate the delayed effect of an independent variable on a dependent
variable using a lagged variable.
The effects of variables such as advertising often carry over into the later time period. For example, last year's
EasyMeat advertising levels may continue to affect this year's sales.
To study this carry-over effect, we add the lagged variable "previous year's advertising." Our regression now
has two independent variables: the current year's advertising level and last year's advertising level.
about:blank 80/119
To run a regression on the lagged EasyMeat advertising variable, we first need to prepare the data for the
lagged variable: we copy the column of advertising data over to a new column, shifted down by one row.
The first and last data points draw attention: the first has all the necessary data except a lagged advertising
value, and the last has a lagged value but no other information. Since we need observations with data on all
variables, we are forced to discard the first observation as well as the extraneous piece of information in the
lagged variable column.
In effect, by introducing a lagged variable, we lose a data point, because we have no value for "previous year's
advertising" for the first observation.
We run the regression as we would any ordinary multiple regression. The equation we obtain is:
EasyMeat Sales Data

EasyMeat Sales Regressions
Comparing the output from the regressions with and without the lagged variable, we first notice that the
number of observations has decreased from 10 to 9. Does the addition of the lagged advertising variable
improve the regression?
The lagged variable decreases adjusted R-squared from 84.11% to 76.62%. Moreover, the lagged variable's
high p-value — 0.3305 — tells us that its effect is too insignificant for us to distinguish from the current year's
advertising's effect. We conclude that the addition of the lagged variable has not improved the regression.
We can introduce variables with even longer lags. We might try to use advertising data from two or more years
in the past to help explain the present sales level. However, for each additional period of lag we lose another
data point.
Running the EasyMeat regression with the current year's advertising, last year's advertising, and the
advertising level from two years ago leaves only eight observations. We lose predictive power: adjusted R-
squared drops from 76.62% to 62.47%. Moreover, none of the coefficients is significant at the 0.05 level.
Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which
reduces the precision of our estimate of the regression coefficients. At the same time, because we are adding
another variable, we decrease adjusted R-squared.
Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an
observation and the "penalty" imposed by the adjustment to R-squared.
Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are
usually available ahead of time. Lagged variables are often good "leading indicators" that help us predict
future values of a dependent variable.
Summary
We can use lagged variables when data consist of a time series, and we believe that the value of the
dependent variable at one point in time is related to the value of an independent variable at a previous point
in time. Lagged variables are especially useful for prediction since they are available ahead of time.
However, they come at a cost: we lose one observation for each time series interval of delayed effect we
incorporate into the lagged variable.
Dummy Variables
So far, we've been constructing regression models using only quantitative variables such as advertising
expenditures, house size, or distance. These variables by their nature take on numerical values.
But many variables we study are qualitative or categorical: they do not naturally take on numerical values,
but can be classified into categories.
For example, in our house price example, a categorical variable might be the home's primary construction
material: wood, brick, or straw. Assigning numerical values to such categories doesn't make sense — wood
cannot be described as a brick plus some number. How do we incorporate categorical variables into a
regression analysis?
Let's look at an example from Julius Tabin's EasyMeat business. We have found that EasyMeat sales are
determined to a large extent by how much he spends on advertising. Another factor affecting sales is which
flavor of EasyMeat Julius features in his advertising campaigns.
Julius produces both EasyMeat Classic, made from beef for the most part, and EasyMeat Poulk!, a pork and
poultry blend.
Hoping to match the right spreadable meat flavor to the mood of the nation, Julius features Poulk! in his ad
campaigns in some years. In other years, classic takes the lead role.
To study the effects of Julius' flavor-of-the-year choice, we use a type of categorical variable called a dummy
variable. A dummy variable takes on one of two values, 0 or 1, to indicate which of two categories a data point
about:blank 81/119
falls into.
This dummy variable — we'll call it "Poulk!" flavor — is set to 1 for years when EasyMeat ads feature Poulk!,
and 0 for years when they feature Classic.
Run the regression on the data, with the sales as the dependent variable, and advertising and the Poulk! flavor
dummy variable as the independent variables. What is the coefficient for the Poulk! flavor variable? Enter the
coefficient for the Poulk! flavor variable as an integer (e.g., "5"). Round if necessary.
EasyMeat Sales Data

EasyMeat Sales Regressions
We find a coefficient of 533,024 for Poulk! flavor. What does this coefficient tell us?
The coefficient 533,024 tells us that after controlling for the level of advertising expenditures, average sales
are $533,024 higher when Poulk! is the flavor-of-the-year, rather than EasyMeat Classic.
This regression model can be expressed graphically: essentially we have two parallel regression lines: one for
years in which Classic is promoted, and one for "Poulk! Years." The vertical distance between the lines is the
average increase in sales Julius would expect if he chooses to feature Poulk! in a given year, after controlling
for advertising level.
The slope of each line is the same: it tells us the average increase in sales when Julius spends another dollar in
advertising after controlling for which flavor is featured.
In this example, we arbitrarily chose to set the dummy variable equal to zero for the Classic flavor, i.e. to make
Classic the "base case" flavor. If we made Poulk! the base case flavor, we would obtain exactly the same graph.
The coefficient on flavor now would be -533,024, but it would again indicate that Julius should expect average
sales to be $533,024 lower in the years that his ads feature Classic rather than Poulk!.
As for any independent variable in a regression, we can test a dummy variable's significance by calculating a p-
value for its coefficient. When the p-value is less than 0.05, we can be 95% confident that the coefficient is not
zero and that the dummy variable is significantly related to the dependent variable.
If Julius produced and promoted more than two flavors, we'd need another dummy variable for each
additional flavor. In general, we'd need one fewer dummy variable than the number of flavors: one variable for
each of the flavors except for Classic, the base case.
For each year that a given flavor is featured, its dummy variable is set to one, and all other flavor variables are
set to zero. In a "Classic Year," all dummy variables are set to zero.
Why don't we have exactly as many dummy variables as categories? Suppose we have just two categories —
Poulk! and Classic — and we tried to use separate dummy variables for each — P and C. In a Poulk! Year, P = 1
and C = 0. In a Classic Year, P = 0 and C = 1. The second dummy variable would be completely redundant:
since there are only 2 flavors, if P = 0 we know the flavor is not Poulk!, so it must be Classic.
Technically, including both dummy variables would be problematic because they would be perfectly
correlated: whenever P = 1, C = 0, and whenever P = 0, C = 1. Perfectly correlated variables are perfectly
collinear: we gain nothing by including both and must drop one to avoid crippling multicollinearity.
It is useful to note that running a simple regression analysis with only a dummy variable is equivalent to
running a hypothesis test for the difference between the means of two populations. In this case, one
population would be sales in Poulk! years, and the other would be sales in Classic years. From the data we find
the mean and standard deviation of sales in Poulk! and Classic years. The hypothesis test gives us a p-value of
0.43, so we cannot conclude that the means differ when the feature flavor is different.
A simple regression of sales on the dummy variable flavor gives us the same result. The p-value on the dummy
variable is 0.45, telling us that the flavor dummy variable is not statistically significant. The p-values differ
slightly for the two approaches due to differences in the way Excel calculates the terms, but the tests are
conceptually equivalent.
Summary
We can use dummy variables to incorporate qualitative variables into a regression analysis. Dummy
variables have the values 0 and 1: the value is 1 when an observation falls into a category of the qualitative
variable and 0 when it doesn't. For qualitative variables with more than two categories, we need multiple
dummy variables: one fewer than the number of categories.
Solving the Staffing Problem (III)

With one eye on the lookout for signs of lurking multicollinearity and two new types of variables in your toolbox,
you set out to settle the staffing problem for good.
You might have noticed that the number of arrivals follows an annual seasonal pattern. Arrivals tend to drop off
during the late summer, surge again in October, and drop very low for the rest of the year.
about:blank 82/119
Source
They pick up briefly in February, but the tourist business slows through the spring. During the early summer,
vacationers start arriving in droves, with arrivals peaking in June or July.
Source
Perhaps we can use the seasonality of arrivals in some way. We might come up with a lagged variable that
functions as a proxy for the current month's arrivals. Then we can run a regression with the lagged variable.
Since the values of a lagged variable would be based on historical data, Leo would know them ahead of time.
Given that arrivals follow this seasonal pattern, which of the following variables is likely to be a good proxy for
this month's arrivals on Kauai?
Source
You run the simple regression of Kahana occupancy versus 12-month lagged arrivals.

Source
What is the lowest level at which the "lagged arrivals" variable is significant?

Source
The regression output shows a p-value of 0.000 for the lagged arrivals coefficient, far below the significance
level of 0.01.
At 55%, the R-squared for the simple regression on lagged arrivals is much lower than for the simple regression
on current arrivals, 80%. The adjusted R-squared is only 53%. Still, lagged arrivals have substantial predictive
power. Run a multiple regression of occupancy versus lagged arrivals and advance bookings.
What is the adjusted R-squared for this multiple regression? Enter the adjusted R-squared as a decimal number
with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").

The adjusted R-squared of 60% for the multiple regression is higher than the 53% for the simple regression. But
is that the best you can do? For the moment, you report your analysis to Alice.
This model with the lagged variable is more useful to Leo, even if its predictive power is still pretty low. But let's
look at these data on Leo's competition that I dug up. Maybe we can use them to give us an even better model.
Leo's competitor, Knut Steinkalt at the Hotel Excelsior, frequently launches promotion campaigns: in some
months, Steinkalt slashes room prices dramatically.
These data show the months in which the Excelsior's promotions took place in the last three years. We can use a
dummy variable to see how the competition's promotions have affected Leo's occupancy.
The Excelsior has to advertise these promotional packages at least a month in advance to attract customers. As
long as Leo keeps an eye on the Excelsior's advertising, he'll be alerted to these promotions in enough time to
take them into account when he makes staffing decisions for the following month.
Using Alice's research, you run a simple regression of occupancy versus the promotions. Then you run the
multiple regression of occupancy versus advance bookings, lagged arrivals, and Excelsior promotions.
Suppose Leo has used the multiple regression equation to predict July's occupancy using a given level of advance
bookings and last July's arrivals. Just before he makes his staffing decisions, he learns that, unexpectedly, his
rival Knut has cut the Excelsior's room prices for July. Leo should revise his predicted occupancy by

The gross effect of Excelsior promotions on Kahana occupancy, a reduction of 52 guests, is not relevant here
since we know the advanced bookings and lagged arrivals for July are still the same as they were before the
Excelsior launched its campaign. Leo wants to use the net effect of the promotion on his occupancy levels,
about:blank 83/119
assuming advanced bookings and lagged arrivals remain fixed — an average reduction in occupancy of 60
guests.
Adding Excelsior promotions to the regression analysis

Upon Leo's return, you eagerly report your results.
This is good work. Naturally, I'd like to have even greater predictive power than 84%, but I realize you aren't
psychics. This model will really help me when I hire staff. Thanks!
I have some good news of my own. Mr. Pitt agreed not to file suit against me! I did have to promise him an
extended stay in the Kahana's penthouse, free of charge.
He's coming next month — there's one occupant I don't need to use statistics to predict. This time, I'll serve the
bisque myself.
Thanks again for all your help!
Exercise 1: The Kiwana Quandary

Linda Szewczyk, marketing director of Amalgamated Fruits Vegetables & Legumes (AFV&L), is researching
the nation's fruit consumption habits. In particular, she would like greater insight into household
consumption of the kiwana, a cross-breed of kiwis and bananas that AFV&L pioneered.
Naturally, one important determinant of household consumption is the size of the household — the number of
members. Since AFV&L has positioned the kiwana as a "high end" fruit, Linda believes that household income
may also influence its consumption.
Run a multiple regression of household kiwana consumption versus household size and income. Make note of
important regression parameters such as R-squared, adjusted R-squared, the coefficients, and the coefficients'
significance. The income variable has a coefficient of 0.0004. Can a variable with such a small coefficient be
statistically significant?
Kiwana Consumption Data
The independent variable, income, is statistically significant since its p-value is less than 0.05, the most
common level of significance. The small coefficient tells us that for every additional $10,000 of income,
average kiwana consumption increases by 4 lbs. a year.
To date, AF&L has focused its marketing campaigns on high-income, highly educated consumers. Linda would
like to deepen her understanding of how the educational level of the household members might affect their
appetite for kiwanas.
To incorporate education into her kiwana consumption analysis, Linda separated the households in her data
set into three categories based on the highest level of education attained by any member of the household —
no college degree, college degree but no post-graduate degree, and post-graduate degree. She represents these
categories using two dummy variables — "college only" and "post-graduate."
Run a regression on all four independent variables.
Controlling for household size, income, and post-graduate degree, how many more pounds of kiwanas are
consumed in a household in which the highest educational level is a college degree, compared to a household
in which no one holds a college degree?

Kiwana Consumption Regressions
The coefficient for the dummy variable "college," 51.6, tells us the expected difference in kiwana consumption
for "college degrees only" households compared to the excluded educational category: households in which no
one holds a college degree. The coefficient describes the net relationship between "college degrees only" and
household kiwana consumption, controlling for household size, income, and post-graduate degree.
Controlling for household size and income, how many more pounds of kiwanas are consumed in a household
in which the highest educational level is a post-graduate degree, compared to a household in which the highest
educational level is a college degree? Enter the difference in consumption between the two households as a
decimal number with two digits to the right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.

about:blank 84/119
When you control for household size and income, college degree households consume 51.63 lbs more than
non-college households. Post-graduate degree households consume 51.95 lbs more than non-college
households. In other words, post-graduate households consume 0.32 lbs more than college households.
What does the analysis of household kiwana consumption indicate?

Exercise 2: The Return of the Empire

For this exercise, refer to the regression analyses you ran in exercise 1 of the previous section.

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours
it will take his team to create a new e-learning course. He is using data on previous courses Empire created,
with the number of pages and the total animation run-time as independent variables.

Bill believes that the number of illustrations used in the course may also have a significant impact on the
number of labor-hours it takes to complete an e-learning course. He wants to add the number of illustrations
to the model as another independent variable.

Run the simple regression of labor-hours versus number of illustrations.

At which level is the number of illustrations a statistically significant independent variable?

Run the multiple regression of labor-hours versus number of pages, illustrations, and animation run-time.

Is there evidence of multicollinearity in the data?

A common symptom of multicollinearity is a high adjusted R-squared — in this case 94% — accompanied by
one or more independent variables with low significance. In this case, the coefficient for the number of
illustrations is not significant at the 0.05 level, and the p-value for the number of pages has risen to 0.0291, up
from 0.0015 in the regression without illustrations.
Which of the following is the likely culprit of multicollinearity?

Multicollinearity occurs when the respective effects of two or more independent variables on the dependent
variable are not distinguishable in the data. This can be the result of correlated independent variables. The
fact that the p-value for the number of pages rises when we add the illustrations raises our suspicions that the
number of illustrations and the number of pages might be correlated.
We can compute the correlation between the number of pages and the number of illustrations: the correlation
coefficient, 67%, is fairly high.
We could also attempt to diagnose the cause of the multicollinearity by running a regression of labor-hours
versus number of pages and number of illustrations - omitting animations. Here, the significance of
illustrations is extremely low, with a p-value of 0.85.
In the regression of labor-hours versus number of illustrations and run-time of animations - omitting pages -
the respective effects of the independent variables on the dependent variables can be distinguished. Here, the
about:blank 85/119
p-values for both variables are much lower than 0.05. All of the evidence points to a linear relationship
between number of pages and number of illustrations as the culprit for the multicollearity.
Bill wants to use the regression analysis to predict the number of labor-hours it will take to complete a new e-
learning course. Comparing the two regression models - one with all three independent variables, one without
illustrations - which should Bill use?

Decision Analysis
Introduction
Three more days before you leave Hawaii. As excited as you are by your imminent debut at business school, you
don't look forward to your departure. You find yourself weighing the relative merits of business school and just
staying on Kauai. Leo has an opening in the restaurant...
The shrill ring of the telephone interrupts your reverie. Sounding excited even by Leo standards, the hotelier
demands your immediate attention in his office.
Dining at Sea: The Chez Tethys Problem

I've had an exciting business venture on my mind for some time: I want to run a floating restaurant. It's going to
be amazing, offering spectacular views of volcanic silhouettes and the bright lights of resort towns, the best
Hawaiian culinary artistry, tastefully alluring hula dancing and music, ...
Slow down, slow down, Leo. I'm interested to hear your idea, but let's focus on the big picture first. A floating
restaurant?
I want to run a world-class restaurant on a yacht. Chez Tethys, I'll call it. I've had the idea for some time. Chez
Tethys will be Hawaii's most luxurious dining experience. The yacht will travel from island to island, docking in
a different harbor each night.
Seating will be limited, and you can board only at selected times. The challenge alone of making a reservation
will make it irresistible. It would become a matter of prestige to include a meal Chez Tethys on any trip to the
islands.
The Chez Tethys has been on my mind since before I inherited the Kahana. I'd been looking for investors, but
without luck. Then the Kahana fell into my lap, and the hotel has preoccupied me since then.
I'm beginning to feel confident about running the hotel — thanks to you two, I should add. Now I feel like I'm
ready to expand my operations. In addition to its own profits, just think of the prestige a high-profile floating
restaurant would add to the Kahana!
Listen, Leo, this is an intriguing proposition, but I'm not fully convinced that a floating restaurant would really
catch on. Let's go over your business plan, and then we'll think about how likely it is that the Tethys will be a
profitable venture. One quick question: How do you intend to finance the Tethys?
I'm way ahead of you. I've already been to the bank, and I think I can take out a loan using the Kahana as
collateral. I have a really good feeling about the Tethys. I'm pretty confident that I can make a floating restaurant
work.
"Leo's Chez Tethys idea is out of the ordinary, but very interesting," Alice muses. "But we'll need to guide him
through the decision to expand his operations so extravagantly, especially since he's putting the Kahana at risk.
We wouldn't want him to make a choice he'll regret."
Introducing Decision Analysis

S&C Films, a small film production company, just purchased a script for a new film called Cloven. Cloven's
unusual plot has the makings of a cult classic: a pastoral romance shattered by a livestock revolution and the
creation of a poultry-dominated totalitarian farm state. S&C's owner and producer, Seth Chaplin,
enthusiastically pitches the film: "It's Animal Farm meets Casablanca."
Seth entered into negotiations with two major studios — Pony Pictures and K2 Classics — and hammered out a
prospective deal with each of them.
Pony Pictures liked the script and agreed to pay S&C $10 million to produce Cloven, covering the production
costs and a profit margin for S&C. Seth estimates Cloven would cost S&C $9 million to produce. Under this
agreement, Pony would acquire all rights to the movie.
about:blank 86/119
The second deal, negotiated with K2 Classics, stipulates that K2 would market and distribute the movie, and
S&C would cover the production costs itself. Under this agreement, K2 and S&C would split ownership of the
movie. Seth believes that the movie has huge potential. Even with partial ownership, S&C would reap enormous
profits if Cloven becomes a blockbuster.
This deal specifies that K2 would collect all revenues until its marketing and distribution costs are covered, then
take 35% of any further revenues. S&C would collect the remaining 65%. If Cloven is a blockbuster, S&C will
make a killing. If it flops, S&C will lose some or all of its production investment.
Which deal should Seth choose?
Decision-making is an essential management responsibility. Managers are charged with choosing from multiple
courses of action that can each lead to radically different consequences. Seth's decision about how to finance
Cloven's production might lead to anything from blockbuster profits to financial disaster.
When faced with a choice of actions and a range of possible outcomes, decision making can be difficult, in large
part due to uncertainty. Although the course of action we choose influences the outcome, much of what
happens is not only beyond our control, but beyond our powers to predict with certainty.
Nonetheless, decisions like Seth's that involve uncertainty can be analyzed logically and rigorously: decision
analysis tools can help managers weigh alternative options and make informed and rational choices.
In a world of uncertainty, applying decision analysis will not guarantee that each decision you make will lead to
the best result, or even to a good result. But if you apply effective decision analysis consistently over the course
of a managerial career, you are almost certain to gain a reputation for sound judgment.
Summary
Decision analysis is a set of formal tools that can help managers make more informed decisions in the face of
uncertainty. Although applying even the most rigorous decision analysis does not guarantee infallibility, it can
help you make sound judgments over the course of your career.
Decision Trees
"The problem with Leo's floating restaurant is that its success hinges on so many factors," Alice tells you. "Some,
Leo can predict, like the approximate price of a yacht, or the cost of labor. But what about consumer demand?
Truth be told, not even Leo knows whether the Chez Tethys will catch on. Leo has to incorporate this uncertainty
into his decision-making process."
Uncertainty and Probability

Producer Seth Chaplin boasts that when Cloven is released in theaters, it will become an instant classic. But even
an incurable enthusiast like Seth would never claim that success is certain, and if asked how confident he is that
Cloven will achieve at least modest success, he might reply with a percentage: "80%."
Uncertainty clearly has a quantitative dimension: we can distinguish between degrees of uncertainty.
But our intuition about uncertainty is often underdeveloped. How do we measure uncertainty consistently and
systematically?
Probability is a measure of uncertainty. A meteorologist reports the probability of rain in her forecast: "We'll see
cloudy skies tomorrow, with a 20% chance of rain."
Probability is measured on a scale from 0 to 1, or 0% to 100%. Events that are impossible are said to have zero
(or zero percent) probability, and events that are absolutely certain are said to have a probability of one (or
100%). But how do we interpret probabilities of 20% or 37.8%?
Let's look at a very simple and familiar uncertain event: a spin of a wheel of fortune. Our wheel consists of two
equally-sized areas: an orange half and a green half. Spin the wheel a few times and count the number of green
and orange outcomes you get.
Since the two areas are of equal size, we'd expect "greens" and "oranges" to occur with equal frequency. About half
of the spins will end up "green" and half will end up "orange." Of course, if we only spin a few times, we won't
expect exactly half of the spins to result in "green." But the more times we spin, the closer we expect the
percentage of greens will be to 50%. This is the probability of getting a "green" each time we spin.
This interpretation of probability is a relative frequency interpretation. We have a set of events — in our case,
spins — and a set of outcomes — "green" or "orange." We form the ratio of the number of times a particular color
occurs to the total number of spins. The probability of that particular color is the value that ratio approaches as the
total number of spins approaches infinity.
Let's look at a different wheel. This wheel is also divided into two areas, but now the green area is larger: it covers
three quarters of the wheel.
What is the probability of a spin of this wheel resulting in the outcome "green"?
about:blank 87/119
Since the green area covers three quarters of the whole wheel, we can expect that three out of four spins will result in a
green outcome. This corresponds to a probability of 75%.
For simple uncertain events like a wheel of fortune game, figuring out the probabilities of the outcomes is easy. There
are a limited number of possible outcomes and there are no hidden factors affecting the outcome. Besides, if we had
any difficulty believing that the probability of a "green" is 75%, we could simply spin the wheel over and over again to
calculate an approximation of the probability.
For more complex uncertain events like the weather or the success of a new venture, finding the probabilities of the
outcomes is more difficult. Many factors can affect the outcome; indeed, we may not even know all the possible
outcomes. Furthermore, there may be no simple experiment we can repeatedly run to generate approximations of the
relative frequencies. How should we think about probabilities in these situations?
Let's think about the weather. Anytime you leave your house, you face the decision of whether or not to take an
umbrella. If the sky is perfectly blue, you probably won't take one. Nor will you do so if there are a few white clouds in
the sky.
On the other hand, if the sky is completely overcast, you might consider taking some kind of rain protection, and if it's
already raining, you almost certainly would. You'd be able to give a rough estimate of the probability of rain; for
instance if it's overcast, you might estimate a 40% chance of rain.
This estimate may be rough, but it isn't completely arbitrary, and it's clearly a better estimate than either 0% or 100%.
Your subjective estimate may be based on an informal assessment of relative frequency. The estimate of "40%"
might be shorthand for the reflection: "In my past experience, it has rained on a little under half the days when it is has
been this overcast in the morning at this time of the year, in this geographic location."
Although you probably never sat down and tabulated the days according to season, atmospheric conditions, and
rainfall, your informal assessment is powerful enough to make a reasonable choice about taking your umbrella. Though
sometimes, you may make the wrong choice.
Similarly, when Seth Chaplin gives his estimate of an 80% chance of Cloven's success, he is probably articulating the
following: "In my experience, about four out of five movies of a similar genre, with a similar budget, released under
similar conditions were successful."
In the examples we've discussed thus far, our uncertainty about the events we faced has stemmed from the fact that the
events had not yet happened. Sometimes we face uncertainty for a different reason: an event may have already
occurred, but we simply don't know what the outcome was. Let's return to the wheel of fortune to see how we might
think about uncertainty in these cases.
Imagine you and a friend spin the wheel. You close your eyes and keep them shut while your friend observes the wheel.
While the wheel is spinning, both you and your friend are uncertain about the outcome, and you would each assess the
probability of "green" at 75%.
When the wheel stops, the result becomes definite. Your friend knows the outcome: for her, the probability of "green" is
now either 0% or 100%. But as long as your eyes are closed, you remain uncertain about the outcome.
If someone promised to give you $10 if you correctly guessed the outcome of that spin before you opened your eyes,
which color would you choose?
From your limited point of view, the best choice you can make is "green," even though the outcome may already be a
definite "orange." If you played the same game over and over again and always chose "green," you would win three
quarters of the games. The probability of the outcome turning out to have been "green" when you open your eyes is
75%.
Even though the event has already occurred, the outcome is uncertain from your point of view. It makes sense to
measure the uncertainty due to a lack of information the same way we measure the uncertainty due to our inability to
predict the outcomes of future events.
Summary
Uncertainty makes decision-making challenging. We can be uncertain about outcomes that have or have not
occurred. To make the most informed decisions, we quantify our uncertainty using probability measures. Probability
is measured on a scale from 0% to 100%: events with 0% probability are impossible; events with 100% probability
are certain. An event's probability can often be determined by observing the relative frequency of its occurrence
within a set of opportunities for the event to occur. For events for which relative frequencies are difficult to assess, we
often make subjective estimates of the probabilities. Though not based on "hard" data, these estimates are often a
sufficient basis for sound decision making.
Structuring Decision Trees

Alice whips out a notepad and pencil, and begins drawing something that vaguely resembles a saguaro cactus. "Let's
look at how we can analyze and structure decisions like Leo's."
about:blank 88/119
Graphical representations are often a highly efficient way to organize and convey information. Scatter plots and
histograms efficiently communicate distributions of data, an organizational chart can quickly outline the structure of a
company; and pie charts effectively express information about proportions and probabilities.
Is there such a thing as a graphical tool that helps inform and organize the decision-making process?
The answer is yes, and the graphical tool is called a decision tree. Let's look at a simple decision tree.
Seth Chaplin's business decision — whether to produce Cloven for Pony Pictures or in partnership with K2 Classics —
involves two alternatives. At some point, Seth must make a decision. Until he does, there are two paths along which
history can unfold. Cloven will either be owned by Pony or co-owned by S&C Films and K2 Classics. A different branch
of the tree represents each alternative.
Decisions aren't the only points at which alternatives can branch off. Events with uncertain outcomes also lead to
branching alternatives. Once released in theaters, Cloven could be a "Blockbuster" hit, have a fair, but "Lackluster"
performance, or "Flop" completely. Each of these outcomes corresponds to a separate branch on the decision tree.
Decision trees have two types of branching points, or nodes. At decision nodes, the tree branches into the
alternatives the decision-maker can choose. We use square boxes to represent decision nodes.
At chance nodes, the tree branches because the uncertainty of the business world permits multiple possible
outcomes. We represent chance nodes by circles.
We arrange the nodes from left to right in the order in which we will eventually determine their results. In Seth's
example, the first thing to be determined is his own decision about which deal to choose. The next thing to be
determined is Cloven's box office performance.
One of the challenges of creating a decision tree is determining the correct sequence of nodes: in what order will the
relevant outcomes be determined? Which events "depend" on the occurrence of other events? What alternatives are
created or foreclosed by prior decisions or by the outcomes of uncertain events?
Drawing a decision tree forces us to delineate each alternative and clarify our assumptions about those alternatives.
Thus, even before we use a tree to make a decision, simply structuring the tree helps us clarify and organize our
thoughts about a problem. A decision tree is also useful in its own right as a tool for communicating our understanding
of a complex situation to others.
Categorizing branching points as decision or chance nodes makes clear which events and outcomes are under our
power and which are beyond our control.
Writing down alternatives in a systematic fashion often allows us to think of ideas for new alternatives. For instance,
after viewing the decision tree, Seth might realize that he would prefer a deal that pays part of his production costs but
still allows a small ownership stake as opposed to either of the deals he has hammered out with Pony and K2.
Summary
Decision trees are a graphical tool managers use to organize, structure, and inform their decision-making. Decision
trees branch at two types of nodes: at decision nodes, the branches represent different courses of action the decision-
maker can choose. At chance nodes, the branches represent different possible outcomes of an uncertain event - at the
time of the initial decision, the decision maker does not know which of these outcomes will occur (or has occurred).
The nodes are arranged from left to right, in the order in which the decision-maker will determine which of the
possible branches actually occurs.
Incorporating Data into Decision Trees

Laying out a decision tree — mapping a logical structure and identifying the alternative scenarios — is an essential
first step in the decision-making process. But how do we compare different scenarios? On what criteria might we
base our preference for one scenario over another?
In the business world, success is often measured in profits. Seth Chaplin will consider Cloven a success if it brings in
at least a modest profit. How should Seth incorporate possible profit levels into his decision analysis?
Seth's decision tree contains four possible scenarios: four unique paths from his current decision node on the left to
the ultimate outcomes on the right. He must assign to each scenario a dollar amount — S&C's profits. For example, if
Seth chooses to produce Cloven for Pony Pictures, his expected profits will be $1 million.
What if Seth chooses the K2 deal? If he has a significant ownership stake in the movie, his profits will depend on
Cloven's box office success. With a "Blockbuster" movie come "Blockbuster" profits: Seth estimates around $6
million.
If Cloven performance is "Lackluster," Seth thinks the Cloven project would break even for S&C films: S&C's
expected profits would be roughly $0. If Cloven "Flops," S&C's production costs will exceed its revenues from the
movie, leading to an expected loss of $2 million.
Are these three the only possible outcomes? Clearly, no. S&C's profits on the release of Cloven could be $6.6 million
or $15.03. In fact there is a range of outcomes stretching from profits in the hundreds of millions of dollars — the
about:blank 89/119
historical upper limit of movie profits — to a loss of $9 million — the amount S&C will invest in production.
Although it is mathematically possible to analyze the full range of possible outcomes, the procedure for doing so is
usually more complicated and time consuming than is warranted for many decision problems. In practice,
considering only a few representative scenarios as we have done in the Cloven example will typically lead to good
decisions.
When we use a scenario to represent a range of possible outcomes, the outcome figure we assign to that scenario
should represent the weighted average of all outcomes in the range that scenario represents. For example, in the
Cloven decision, $0 million might represent the weighted average of all possible profits from -$3 million to +$3
million.
Decision Trees and Probabilities

How can Seth use his decision tree and estimated profit values to choose the better deal?
If Seth chooses the partnership with K2, S&C could make $6 million. Clearly, that scenario is preferable to the
scenario in which S&C simply sells Cloven to Pony for a profit of $1 million.
If S&C retains part ownership of Cloven and it "Flops" in the theaters, S&C stands to lose $2 million. That scenario is
clearly worse than the scenario in which S&C sells Cloven to Pony.
If Cloven is highly likely to "Flop" in theaters, then Seth should choose the Pony deal; if it's highly likely to be a
"Blockbuster" he should choose the K2 deal. Clearly, the likelihood of the different outcomes in the K2 deal should
weigh heavily in Seth's decision.
To each branch emanating from a chance node we must associate a probability: the probability of that outcome
occurring. These probabilities may be based on historical data or on our best judgment of the likelihood of each
outcome.
Based on the quality of the script and his years of experience in the movie industry, Seth estimates the probability of
Cloven becoming a "Blockbuster" at 30%, the probability of a fair, but "Lackluster" performance at 50%, and the
probability of a "Flop" at 20%.
If S&C sells Cloven to Pony, the level of box office success is largely irrelevant to Seth. S&C's profits are certain to be
$1 million.
When assigning outcomes and probabilities to chance nodes, we must meet two requirements. First, outcomes that
branch off from the same chance node must be "mutually exclusive." Thus, for example, two outcomes emanating
from the same node cannot both occur at the same time: the occurrence of one outcomes excludes the occurrence of
any other.
Second, the set of outcomes that branch off from the same chance node must be "collectively exhaustive": the
branches must represent all possible outcomes.
In practice, we typically don't depict every possible outcome separately. For Cloven, we've reduced the vast range of
possible financial outcomes to a manageable set of three representative outcomes. However, we consider these three
outcomes collectively exhaustive in that they represent three ranges that together exhaust all possibilities.
Also, we usually consider extremely unlikely but not impossible events to be included in one of the existing branches,
or if sufficiently unlikely, to be irrelevant to decision-making. In the Cloven example, we don't include a branch
representing the simultaneous destruction of all existing copies of the Cloven film.
When we construct a chance node, we must make sure that the outcomes emanating from that node are mutually
exclusive and collectively exhaustive. Since it is certain that one and only one of the possible scenarios will occur, the
individual probabilities must add up to 100%.
Summary
For a decision tree to effectively inform a decision, it must incorporate two types of relevant data: endpoint values
corresponding to each scenario and the probabilities of the possible outcomes of each uncertain event. An outcome
value is associated with a scenario - a unique path from the first node on the left of the tree to an endpoint on the
right. We place the appropriate probability on each branch emanating from a chance node. Outcomes represented
by branches emanating from the same chance node must be mutually exclusive and collectively exhaustive.
Exercise 1: The Tardytech Blame Game

Ted Nesbit is a project manager at Tardytech, a software development company. Ted is planning the project schedule
and budget for the development of a new piece of custom business software for a Fortune 500 client.
The client has imposed a strict deadline for completion of the project. However, based on past experience with
similar clients, Ted knows that there is a significant probability that Tardytech will not be able to meet its deadline
due — in most cases — client-side delays.
about:blank 90/119
Experience indicates that 56% of all Tardytech projects are completed on time, 42% are delayed due to client-side
delays, and 2% are delayed due to errors and process failures at Tardytech.
What is the probability that Tardytech's new project will not be completed on time?
The total probability of a delay is the sum of the probabilities of a client-side delay and a Tardytech-side delay: 44%.
A potential new client has asked Tardytech to report the probability that a project will not be delayed due to
Tardytech errors or process failures. What probability should Ted report?
Only 2% of all projects are delayed due to problems caused by Tardytech. The remaining projects are either
completed on time (56%) or delayed by client-side delays (42%). The probability that a project will not be delayed by
Tardytech is the sum of the probabilities of those outcomes, 98%.
Another potential client has asked Tardytech to report the percentage of all delayed projects whose delays could be
attributed to Tardytech's errors and process failures. What percentage should Ted report?
In the past, 44 out of 100 projects were not completed on time. On average, 2 projects out of 44 delayed projects were
delayed due to Tardytech errors and process failures. Thus, the proportion of Tardytech-caused delays among all
delays is 2/44, or 4.5%. We'll discuss and analyze similar questions in greater depth in the next unit.
Exercise 2: The First Bank of Silverhaven

As a loan officer at the First Bank of Silverhaven, Carla Wu determines whether or not to grant loans to small
business loan applicants.
Of 108 recent successful small-business loan applicants with a full year of payment history, 28 were unable to meet at
least one loan payment in the first year they had an outstanding loan.
What is the probability that a randomly selected small business in this pool missed a payment in the first year of the
loan?
Enter the probability as a decimal number with two digits to the right of the decial point (e.g., enter "50%" as "0.50").
Round if necessary.
The probability that a randomly selected small business missed a payment in the first year of the loan is simply the
ratio of the number who missed at least one payment to the total number of loans, i.e., 25.9%.
Exercise 3: The Shipping Bea

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add
to her current fleet.
If she leases the truck, she may be able to generate new business that would increase her company's profits.
However, if additional business fails to materialize, she may not be able to cover the incremental leasing costs.
Based on her assessment of future trends in the transportation sector, Robin identifies three scenarios that might
transpire: "Boom," "Moderate Growth," and "Slowdown," depending upon the performance of the economy and its
impact on the transportation sector. She associates with each scenario an estimate of the total profits an expanded
fleet will generate for her firm.
If she doesn't lease the new truck, Robin probably won't generate as much revenue, but her leasing costs will be
lower. Again, she breaks down the possible outcomes if she does not lease the additional truck into three scenarios
corresponding to the performance of the economy. Then she associates an estimate of total firm profits with each
scenario.
Which of the following trees best represents Robin's decision?
The first branching of the tree occurs at the point when Robin makes her decision: to lease or not to lease a new
truck. A square decision node should represent this decision.
Each of Robin's options splits into three possible outcomes depending on the performance of the economy. Robin
will learn which economic scenario transpires only after making her decision. The second branching represents this
uncertainty. The nodes of these branches are chance nodes and should be drawn as circles.
Exercise 4: Wheel of Fortune

You and a friend have a wheel of fortune that has a 75% chance of a "green" outcome and a 25% chance of a "red"
outcome. Before you spin, the friend offers you $100 if you correctly predict the outcome. If you choose the wrong
color, you receive nothing. The good news: you don't have to choose until the spin is complete. The bad news: you
will have to keep your eyes closed until the wheel has stopped and you have made your prediction.
You close your eyes and keep them shut while the wheel spins. When the wheel stops, your eyes are still closed. You
now must decide whether to choose "red" or "green."
about:blank 91/119
Which of the following trees best represents your decision?
The nodes of a decision tree are arranged from left to right in the order in which we discover their results, not in the
order in which the events actually occur. Thus, even though the wheel stops and the result is finalized before you
make your decision, because you don't know the result until after your decision is made, the chance node for the spin
result should appear after — to the right of — the decision node.
Comparing the Outcomes

Alice's decision trees are neat devices to help organize and structure a decision problem. But how are they going to help
Leo evaluate his options and choose the best one?
Introducing the Expected Monetary Value

Seth Chaplin must decide how to produce the movie Cloven. He has mapped out the logical structure of his decision in
a decision tree and has incorporated the appropriate data, evaluating each scenario in terms of its expected profits and
its probability of occurring. Now, how should he use the tree to inform his decision?
If S&C Films works in partnership with K2 and retains part ownership of Cloven, S&C could earn a profit of $6 million,
a highly desirable outcome. However, that scenario is relatively unlikely: 30%. How do we balance the high value of
that outcome against its low probability?
The answer is elegant and simple: we multiply the outcome value by its probability: $6 million * 0.3 = $1.8 million.
Essentially, we "credit" the "Blockbuster" outcome with only 30% of its value.
Similarly, we credit a "Lackluster" performance of $0 profits with only half of its value.
The magnitude of the loss incurred if the film "Flops" is mitigated by its low likelihood of 20%.
Then we add these weighted values together, for a total of $1.4 million. This total is called the expected monetary
value (EMV) of the K2 option. The EMV is a weighted average of the expected outcomes of the scenarios in which S&C
retains part ownership of Cloven as stipulated in the K2 deal.
Let's return to our wheel of fortune game to build our intuition for the concept of the EMV. This wheel consists of three
areas: "blue," "green," and "red." The blue area is 30% of the total area. If a spin results in "blue," you win $6.
"Green" covers 50% and "red" covers 20% of the wheel's area. If a spin results in "green," you gain nothing at all, if the
result is "red," you lose $2.
Suppose you play the game 100 times. How much money do you think you will have gained or lost over 100 spins?
What do you estimate will be your average yield per game?
About 30% of the time, a spin will result in "blue." In other words, you can expect 30 of the 100 spins to yield $6.
About 50% of the time a spin will result in "green." You expect 50 of the 100 spins to yield $0. And about 20% of the time
a spin will result in "red," — that is, you expect 20 of the 100 spins to cost you $2.
The total amount you can expect to win after playing the game 100 times is $140, and the average yield per spin is $1.40.
That average is the expected monetary value (EMV) for a single spin.
The expected monetary value for a single spin is $1.40. If you spin the wheel once, which of the following results is least
likely to occur as the outcome?
The probability of each outcome is shown below. With a probability of 0%, $1.40 is the least likely outcome.
It is important to understand the nature of the expected monetary value. We do not actually "expect" an outcome of
$1.40. In fact, $1.40 is not even a possible outcome. The EMV of $1.40 is the long-term average value of the outcomes of
a large number spins.
In the Cloven case, the EMV of $1.4 million is the average amount of profits Seth Chaplin can expect to make when he
produces similar films in similar circumstances.
We can use the EMV as a measure with which to compare alternative options. First, we calculate the EMV for each chance
node, beginning at the right of the tree. For the chance node associated with the K2 deal, the EMV is $1.4 million.
Now that we have calculated the EMV of the K2 chance node, we can "collapse" the branches emanating from the chance
node to a single point. Going forward, we can treat the EMV of $1.4 million as the endpoint value of the K2 option.
The EMV for the Pony Option is simply $1.0 million: the outcome value multiplied by its probability of 100%.
At a decision node, we choose the best EMV of all the branches emanating from that decision node. In our Cloven
example, the best EMV is the one with the highest expected profits. The EMV of the K2 deal, $1.4 million, exceeds $1.0
million, the EMV of the Pony deal. Selecting the option with the best EMV and removing all other options from
consideration is known as "pruning" the tree.
about:blank 92/119
Any decision tree — no matter how large or complex — can be analyzed using two simple procedures. At each chance
node, calculate the EMV, collapse the branches to a point, and replace the chance node with its EMV. At each decision
node, compare the EMVs and prune the branches with less favorable EMVs. This entire process is known as folding
back the decision tree.
Summary
We often use expected monetary value (EMV) to quantify the value of uncertain outcomes. The EMV is the sum of the
values of the possible outcomes of an uncertain event after each has been weighted by its probability of occurring. The
EMV can be interpreted as the expected average outcome value of the uncertain event, if that uncertain event were
repeated a large number of times. To analyze a tree, we "fold it back": we move from right (the future) to left, finding
the EMV for each node. For chance nodes we calculate the EMV as described below. For decision nodes we simply
choose the option with the best EMV — lower costs or higher profits — among the choices represented by a decision
node's branches and prune the others.
Relevant Costs
Burning to use decision trees and EMVs, you start to draw up the square and circular nodes that make up Leo's Chez
Tethys decision. Alice, however, urges caution: "Before you get too trigger-happy with those decision trees, you should be
aware of a few common pitfalls."
Jen Amato has been driving "Millie," an old jalopy of a car, for the past five years. Millie — an Oldsmobile Delta-88 — is
twenty years old. A week ago, the air conditioner broke down, and Jen had it replaced at a cost of $500.
Now, the drive train is worn out. Replacing it will cost $1,200. If she doesn't repair it, Jen can sell Millie "as is" for about
$300. Jen must decide: should she sell her car or should she have the drive train replaced?
If Jen sells her car now, she will not even recoup her $500 investment in the air conditioner. If she has the drive train
replaced, her beloved Millie will last a little longer, and Jen will be able to enjoy the cool air she spent so much money on.
How should the fact that Jen hasn't had the opportunity to benefit from her air conditioner investment affect her
decision?
With a broken drive train, Millie is worth about $300. According to her mechanic, if Jen pays $1,200 to replace the drive
train, Millie will be worth approximately $1,100 in terms of resale value. In the analysis of Jen's decision, what role
should her $500 investment in the air conditioner play?
In any scenario in which Jen sells her car, the $500 air conditioner cost has already been incurred, so it could be
included in the total cost.
Likewise, the air conditioner repair costs are part of the prehistory of any scenario in which Jen has Millie's drive train
replaced. Here, too, the $500 investment in the air conditioner could be included as a cost.
However, when we compare the total costs of the two options, we recognize that the $500 does not contribute to a
difference in the outcome values. Because the $500 cost is incurred in both cases, it plays no role in the comparison
of the two options, and thus is irrelevant to Jen's decision — she will have spent the $500 no matter what she decides to
do now.
Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all
scenarios that could possibly unfold after the decision is made. As such these costs — called sunk costs — should not
have any bearing on the decision, because we cannot devise a scenario in which they are avoided.
It isn't wrong to include a sunk cost in the analysis as long as it is included in the value of every outcome. However,
including sunk costs distracts from the differences between scenarios — the relevant costs. Imagine the complexity of
Jen's tree if she included every sunk cost she's incurred from owning Millie — from the original purchase price of the car
to the cost of all the gasoline she's pumped into Millie over the years, to her expenditure on a dashboard hula dancer.
Misinterpreting a sunk cost as a cost that weighs on only some of the scenarios is a common error. After sinking $500 of
repairs into her car, selling it for $300 will be quite painful for Jen. Nonetheless, good decisions are made based on
possible future outcomes, not on the desire to correct or justify past decisions or mistakes.
Another common decision-making error is to omit from the analysis relevant costs that should have a bearing on the
decision. If Jen has Millie's drive train replaced, the repair costs are just one of many costs involved with that option.
Given Millie's age, there is a high likelihood of another repair cost soon. Similarly, if Jen sells Millie now, she will have
costs associated with buying a new car or arranging other transportation services.
Opportunity costs are an important cost category that decision makers often neglect to include in their analyses. For
example, selling the car will require Jen to devote 10 hours of her time that she would otherwise devote to her part-time
job paying $12 per hour. Thus, Jen should add $120 in opportunity costs to the outcomes of the "sell Millie" decision.
We should also take non-monetary costs into account. Jen will feel sad at leaving her trusted vehicle and companion on
many a road trip behind. Although such costs can be difficult to quantify, they should not be neglected. We will see
shortly how to use sensitivity analysis to incorporate non-monetary costs into a decision analysis.
Summary
about:blank 93/119
Among the most common errors in decision analysis is the failure to properly account for the costs involved in different
possible scenarios. On the one hand, relevant costs such as opportunity costs or non-monetary consequences are often
omitted. On the other hand, irrelevant costs such as sunk costs are incorrectly included in the analysis. Sunk costs are
costs that were incurred or committed to prior to making the decision and cannot be recovered at the time the decision
is being made. Since these costs factor into any possible future outcome, they can be safely omitted from the analysis;
sunk costs must never be included in only selected branches of a decision tree.
Time Horizons
Jen has decided to replace her old car, Millie the Oldsmobile. Her friend, Sven, is leaving the country for two years and
doesn't want to pay to store his Mazda Miata. He offers Jen two options.
In the first option, Jen leases the car, paying Sven $700 each year. When Sven returns from abroad, he reclaims his car.
In the second option, Jen buys the car outright for $4,000. How should Jen compare these two options?
What is the appropriate cost difference Jen should use to compare the two options Sven has offered?
To begin, note that in the first option, the costs are spread across two years, whereas in the second option, Jen makes a
single payment when she closes the deal with Sven. To compare the options meaningfully, we must define a time
horizon for this problem, that is, we must compare their costs over a common time period. In this case, two years is a
convenient time horizon.
Next, note that we cannot directly compare $1,400, the simple sum of the two lease payments made at two different
times, to the $4,000 one-time cost of the purchase option paid entirely at the beginning of the first year. We can only
compare costs when they are valued at the same point in time.
We'll compare the present value of the costs associated with each option — that is, the value of the costs at the time
Jen makes her decision. In order to compare the costs, we first need to convert the second installment of the leasing
payment into its present value.
Jen currently has sufficient cash for either alternative in an investment account with a 5% rate of return. What is the
present value of the second installment of $700 paid under the leasing option?
The present value of the second installment of $700 is $666.67. At the end of one year, $666.67 residing in her
investment account today will have increased in value to $666.67*(1.05) = $700, the amount she must pay for the
second lease installment.
The present value of the total cost of the leasing option is $1,366.67. Can we use this number to compare the two
options?
Although $1,366.67 and $4,000 are now comparable costs because they are given at their present value, we have not
yet considered the fact that under the purchase option, Jen will own the Miata at the end of the two-year period. In two
years, Jen expects that the Miata's market value will be around $3,000. The value of an asset at the end of the time
horizon is typically called its terminal value.
What's the net present value of the cost of the purchase option - that is, the present value of future cash flows after
subtracting, or "netting out," the initial payment to Sven? Recall that Jen currently has her money invested in accounts
that earn a 5% average annual return.
If Jen resells the car in two years, she would receive $3,000. The present value of $3,000 received at the end of two
years is $2,721.09.
The net present value of the cost of the purchase option is $1,278.91.
If cost is the only deciding factor in Jen's choice, she'd save $87.76 if she chose to purchase the Miata. Before finalizing
her decision, Jen should make sure to weigh other considerations relevant to the decision, such as whether or not she
will need to borrow money to cover other expenses this year if she spends $4,000 to buy the Miata now, what the
borrowing rate is, etc.
Summary
When we make a decision, we need to choose a time horizon over which we quantify the outcomes of our decision. To
compare monetary values that take place at different times, we must account for the time value of money by
comparing cash flows at their values at a common point in time. Generally, we compare the present values — or net
present values — of different outcomes by discounting future cash flows at the appropriate discount rate. Terminal
values of assets held at the end of the time horizon must be determined and discounted to their present value.
Solving the Chez Tethys Problem

"Now let's find Leo, and get to work on his decision problem," Alice says, snapping her laptop shut.
I calculated that if the floating restaurant idea really takes off, I'd make over $2 million in profits over the next five years.
And that's after I subtract my initial investment, including the purchase of the ship.
How certain are you that the Tethys will be the success you envision?
about:blank 94/119
Gosh, I don't know. It's almost like a flip of a coin to me — chances are about fifty/fifty that the Tethys will be a big hit.
That sounds a little too optimistic. Let's take a close look at your business plan.
The three of you spend the rest of the day in energetic discussion and research. Finally, you agree on two representative
scenarios that might occur if Leo launches the Chez Tethys: "Phenomenon," or "Fad."
If the Tethys becomes a cultural "Phenomenon" with staying power, Leo can expect $2 million in profits over five years,
in terms of net present value. Leo grudgingly agrees that the likelihood of this scenario is only 35%.
Alternatively, dining Chez Tethys might become a passing "Fad" for a couple of years, then be replaced by "The Next Big
Thing." In that case, Leo would face substantial losses, estimated to have a present value of about $800,000. Leo grants
that his brainchild has a 65% chance of just being a fad.
What is the EMV of launching the Chez Tethys?
Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter
''$5,500,000'' as ''5.500''). Round if necessary.
The expected monetary value of launching the Tethys is the outcome value of the "Phenomenon" scenario weighted by its
probability added to the outcome value of the "Fad" scenario weighted by its probability, i.e., $180,000.
Hmm. I can see how this analysis is helpful. I'm glad that my expected profits are positive, but somehow I'm not satisfied.
I don't feel very comfortable with some of our estimates.
Exercise 1: The Shipping Bea Flies Again

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to
her current fleet.
Robin identifies three states of the transportation sector that might occur: "Boom," "Moderate Growth," and
"Slowdown." Her firms profits depend on whether she leases an additional truck and on the state of the sector: She
associates an estimate of total firm profits with each of the six scenarios.
What is the EMV of the decision to lease the truck?
Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500''
as ''5.5''). Round if necessary.
The EMV of the decision to lease the truck is $31,000. Weigh each of the scenario's outcomes by the probabilities that
they will occur, then add the weighted values.
What is the EMV of the decision to not lease the truck?
The EMV of the option to not lease the truck is $18,200. Weigh each of the scenario's outcomes by the probabilities that
they will occur, then add the weighted values.
Exercise 2: The SHAMH of the Century

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain
operations. He can either apply for a government grant or run a local fundraiser, but the demands on his time are too
high for him to be able to do both.
Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. Grants in this
category are for a fixed amount, so if he loses the grant, he'll have no money to run the SHAMH.
Based on his past fundraising experience, he estimates that if he runs a local fundraiser, he has a 30% chance of raising
$30,000 and 70% chance of raising $20,000.
What is the EMV of launching a fundraiser?
To find the EMV of launching the fundraiser, weigh the $30,000 outcome value by its probability of 30% and add that
to the $20,000 outcome value weighted by its probability of 70%.
What is the EMV of applying for the grant?
about:blank 95/119
To find the EMV of applying for the grant, weigh the grant award amount of $25,000 by the probability of winning it
(90%) and add that to the value of losing the reward ($0) weighted by the probability of losing (10%).
The EMV of the fundraiser option is $23,000, higher than the EMV of applying for the grant. Based on thia analysis,
Jari should organize a fundraiser.
Exercise 3: The Gaiacorps Upgrade

Marsha Ratulangi is the chief operating officer at Gaiacorps, a non-profit organization dedicated to preserving natural
habitats around the world. Gaiacorps' IT hardware is aging, and Marsha must decide whether to extend the current
lease on Gaiacorps' IT desktop computers or purchase new ones.
Which of the following costs is not relevant to Marsha's decision?
The cost of the lease extension and the future maintenance costs of the current hardware are incurred only in the
scenario in which Gaiacorps extends its current lease. The purchase price of the new hardware is incurred only in the
scenario in which Gaiacorps purchases new hardware. Which of the three costs are incurred depends on which option
Marsha chooses, so all three are highly relevant to her decision.
Which of the following costs is relevant to Marsha's decision?
The cost of the memory card upgrade and the maintenance costs previously invested in the current hardware are sunk
costs that have already been incurred.
Marsha's salary is not a sunk cost, but is incurred in all possible scenarios. Neither of the options Marsha is considering
will affect the amount of these three costs, thus all three costs are irrelevant to the decision.
In contrast, the cost of disposing of the new hardware is relevant, since it is incurred only in the case that Marsha buys
new desktop computers.
Exercise 4: Mopping up the Empire

Under pressure from his company's ad hoc advisory board to lower operating expenses, the CEO of Empire Learning,
Bill Hartborne, is considering canceling the biweekly cleaning service for the company offices. Each cleaning
engagement costs $75.
Instead of hiring a cleaning service, Bill could simply clean the office himself whenever a client visits the office. The
probability of exactly one client visiting the office in a given month is 25%, and the probability that two clients will visit
is 10%. The probability that no clients will visit is 65%.
Bill draws up the structure of his decision in the tree depicted below. Given a one-month time horizon, what is your
best estimate of the EMV of the cost of not hiring a cleaning service?
Instead of cleaning the office, Bill could be creating value for his company by making sales, networking, boosting
employee morale, or simply increasing his productivity by napping on the office sofa. We need to calculate the
opportunity cost of the time Bill would spend cleaning the office, so we need to know how long he spends cleaning,
and how highly his time is valued.
Assuming that it always takes Bill 2 full hours to clean Empire Learning's offices, and that Bill's time is valued at
$200/hour, what is the EMV of Bill cleaning the office?
Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.
The EMV of Bill cleaning the office is $180. Based on this analysis, Bill should continue employing the cleaning service.
Exercise 5: The Eris Shoe Company

Val Purcell, CEO of a supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a
bid for a contract to re-engineer the supply chain of a potential new client: the Eris Shoe Company.
Creating the bid will cost $16,000 in Val's time and legal fees. If his bid beats out the competition, Val expects the
contract to return profits of $100,000 — from which the cost of preparing the bid has not yet been subtracted. Val
believes he has a 20% chance of winning the bid.
Eris will pay the consulting fee and Val will accrue his estimated $100,000 profits upon completion of the project,
which is scheduled for one year from now.
Under these terms, what is the expected monetary value of creating and submitting the bid?
The answer cannot be determined without knowing Purcell's discount rate. If Purcell wins the bid, it won't receive its
consulting fee until completion of the project one year from now. Cash flows related to the contract should be
discounted at Purcell's discount rate. For simplicity assume that the cost/profit figures represent cash outflows and
inflows, and that Purcell's discount rate is 15%.
What is the EMV for the option of submitting this bid?
about:blank 96/119
Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter
''$5,500'' as ''5.50''). Round if necessary.
To find the EMV of creating the bid, first calculate the present value of the cash inflow from Purcell's profits on the
contract, to be received a year from now upon completion of the project.
To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.
Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios to determine
the EMV of placing the bid.
Before Val starts work on the bid, Eris decides to move the project's completion date to two years after the contract is
signed. Again assume that the cost/profit figures represent cash outflows and inflows, and that Purcell's discount rate is
15%. What is the EMV for submitting the bid now?
Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.
To find the new EMV of putting in the bid, first calculate the present value of the cash inflow from Purcell's profits on
the contract, to be received two years from now upon completion of the project.
To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.
Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios and determine
the EMV of placing the bid. The EMV of putting in the bid is -$877. If Val's payments are delayed by another year, he
cannot expect the project to be profitable, so he should not submit a bid. Delaying the profits changes Val's optimal
decision!
Exercise 6: The Crumbling Empire

Cap Winestone of Universal Learning is preparing a bid for a new building to accommodate its expanding operations.
As luck has it, the headquarters of former competitor Empire Learning is for sale through a sealed bid auction. The
building is well suited to Universal's needs, containing computing equipment, network infrastructure and other
important e-learning accessories.
Cap estimates the building is worth about $900,000 to him and is trying to decide what bid to place. To simplify his
decision, he narrows down his bid choices to four possible bids. He muses: "With lower bids I gain more value. If I bid
$600,000, I'll get a building worth $900,000 to me, so I gain $300,000 in value. In terms of the value I gain, the lower
the bid, the better."
"On the other hand," Cap continues, "with a low bid I'm not likely to win. From the point of view of winning the bid, the
higher the bid the better. How do I balance these two opposing factors?"
Cap lays out a decision tree with the possible bid amounts, the likelihood of winning for each bid, and the outcome
values. What amount should Cap bid?
Folding back the tree, we find the $800,000 bid gives the highest expected monetary value. The $800,000 bid best
balances the value gained against the probability of winning.
Sensitivity Analysis
Still in Leo's office after you initially calculated the expected monetary value of launching the Chez Tethys, you listen to
Leo's growing concerns about the estimates that inform his decision.
Leo the Skeptic: The Uncertain Estimates Problem

Now that you've mapped out the possible downside and its probability for me, I'm a little discouraged. Sure, the analysis
indicates that ventures like the Tethys will be profitable on average, but the fact that I have almost a two-thirds chance of
an $800,000 loss scares me.
What if the potential loss is even greater? Or, what if it's even more likely that the Chez Tethys will just be a passing
"Fad"?
Both points are well taken, Leo. I'll tell you what: let's break for lunch, and then delve a little deeper into our analysis.
A Decision's Sensitivity to Outcome Estimates

Over lunch, Alice comments on Leo's reaction to your analysis. "Leo's questions bring us to a crucial component of
decision analysis: sensitivity analysis."
Seth Chaplin has all but decided how to produce the film Cloven. Based on his initial analysis, he is inclined to produce
Cloven in partnership with K2 Classics, thereby retaining part ownership of the film. The EMV of the K2 option is $1.4
million. The EMV of the alternative option — in which Seth's company produces Cloven for Pony Pictures and
relinquishes ownership in the film — is $1.0 million.
about:blank 97/119
But Seth's calculations were based on estimates. The probabilities and the outcome values he used in his analysis were
educated guesses. No matter how detailed and rigorous the methodology, a decision analysis is only as good as the data
on which it is based. What if Seth isn't completely certain about these data?
Seth is particularly unsure about the $6 million value he used to represent the profits associated with a "Blockbuster"
film. What would the EMV of the K2 option be if $4 million were a more representative value for "Blockbuster" profits?
Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000''
If $4 million is the expected value of S&C's profits for the "Blockbuster" scenario, then the EMV of a "Blockbuster" drops
to $800,000, less than the $1 million EMV of the Pony option. Note that if the "Blockbuster" profit figure drops to $4
million, Seth's optimal decision changes: he should now choose the Pony option. Seth has learned that his optimal
decision is sensitive to the figure he uses to represent S&C's profits if Cloven attains "Blockbuster" status.
If Seth's optimal decision switches when the "Blockbuster" profit figure drops to $4 million, he might reasonably wonder
what his decision would be for other values. What about $5 million? $4.5 million? How low would the "Blockbuster"
profit figure have to be to make the Pony option preferable to the K2 option in terms of the EMV?
At what "Blockbuster" profit figure does Seth's optimal decision switch?
Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000''
To answer that question, we first write the EMV of the K2 option with a variable, B, to represent "Blockbuster" profits in
millions of dollars.
Now, we compare the EMV of the K2 option to the EMV of the Pony option. The K2 EMV is greater than the Pony EMV
when the "Blockbuster" profit figure is greater than $4.67 million.
We call $4.67 million the breakeven value for "Blockbuster" profits: above it, the EMV criterion recommends the K2
option; below it, the EMV criterion recommends the Pony option.
Seth may not be sure if the "Blockbuster" profits are best represented by $6 million or $5 million or $4.8 million, but as
long as he is confident that the figure is greater than $4.67 million, he need not lose any sleep over finding a more
accurate value. Knowing that his estimates for the "Blockbuster" profit figure are firmly on one side of the breakeven
value allows him to stop worrying about the precise value of that number in the decision analysis, since knowing that
number with greater precision will not change his decision.
On the other hand, if Seth thinks the "Blockbuster" profit figure could be below $4.67 million, he may wish to invest
additional resources to find a more accurate figure to represent "Blockbuster" profits.
If Seth thinks the true number is around the breakeven value of $4.67 million, but isn't certain if it's slightly above or
slightly below that figure, he can also stop worrying about the precise value of the number. As long as the figure is around
$4.67 million, Seth should be indifferent between the K2 and Pony deals, since the EMVs of the two options are about the
same.
In fact, a good way to check that we have calculated a breakeven value correctly is to substitute the value into the EMV
calculation: if the options have the same EMV, the breakeven value is correct.
Once we calculate a breakeven value, we know whether or not expending additional time and other resources to find a
more accurate estimate is worthwhile. The breakeven value establishes a comfort zone: as long as we are confident that
the value we are estimating is within the zone, we can feel comfortable choosing the option recommended by our initial
analysis, based on our original estimate.
If we think the value we are estimating could be close to the breakeven value, we need to be more cautious. If the true
value we are trying to estimate could lie outside of the comfort zone, we might want to try to make our estimate of that
value more accurate before we reach a final decision.
How confident we are that the true value we are estimating lies inside the comfort zone given by the breakeven analysis is
a matter of judgment and experience. Sometimes, we might collect sample data to estimate an outcome value. In this
case, we should look closely at the variation in the data to see how widely and in what way the data can vary.
Calculating a breakeven value for data used in a decision analysis is called sensitivity analysis: for each estimated value
in the analysis, we check to see by how much it would have to change to affect our decision, assuming our estimates for all
the other data are correct.
Sensitivity analysis is an important and powerful tool for management decision-making. Managers who base decisions on
an initial analysis without performing sensitivity analysis on critical data risk lulling themselves into a false sense of
security in their decisions.
Summary
After completing an initial decision analysis, always conduct a sensitivity analysis for each outcome value estimate you
are uncomfortable with. First, calculate the outcome value's breakeven value: the value for which the EMV of the option
initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone:
about:blank 98/119
If we believe that the actual outcome value might be outside that zone — thereby changing the optimal decision — we
should reconsider our analysis and refine our estimate of the outcome value in question.
Evaluating Non-Monetary Consequences

During his negotiations with K2 and Pony, Seth realized that he did not particularly look forward to working with the
K2 team. Based on past experience, he knows that interpersonal frictions can be highly frustrating and can make a
collaboration unpleasant. This frustration is clearly a cost — albeit a non-monetary cost — associated with the K2
option.
Sensitivity analysis can give us a "reality check" on how highly we value non-monetary consequences such as
frustration, reputation costs and benefits, and sentimental values. Although it may be difficult to assign a value to such
consequences, we can often answer questions about the most we'd be willing to pay to avoid (or to obtain) them by
calculating a threshold value.
Clearly, Seth wouldn't want to spoil any magical Hollywood days just to make an additional $5 in profits. But the more
the K2 option pays relative to the alternatives, the more willing Seth might be to suffer working with the K2 team. How
can Seth determine whether he should accept the K2 deal and bear the resulting interpersonal trials and tribulations?
Sensitivity analysis can help us analyze non-monetary consequences such as frustration. Let's use "F" to represent the
frustration cost (in $millions) Seth will incur if he has to work with the K2 team. Since frustration will occur in any
scenario involving the K2 option, $F million must be subtracted from all outcomes associated with the K2 option.
How large must the cost of frustration be for Seth's optimal decision to switch to the Pony option?
Enter F, the cost of frustration, in $millions as decimal number with two digits to the right of the decimal point (e.g.,
enter ''$5,500,000'' as ''5.50''). Round if necessary.
Adding $F million to the outcomes changes the EMV of the K2 option to $1.4 million - $F million. For Seth's decision
to switch from the K2 deal to the Pony deal, the EMV of the K2 option, $1.4 million - $F million, must drop below $1.0
million. In other words, F must satisfy the inequality below.
In order to lower the EMV of the K2 option below the EMV of the Pony option, the cost of frustration would have to be
valued at least at $400,000. Seth needs to ask himself if he would pay $400,000 to avoid the frustration of working
with K2. Sensitivity analysis provides a clear, monetary upper bound against which he can measure the strength of his
feelings.
In the end, Seth decides that he can learn to love the K2 team for $400,000.
Summary
To incorporate non-monetary consequences into a decision, first find the option with the best EMV. Then, add the
non-monetary consequence to the outcome values of all scenarios affected by that non-monetary consequence, and
calculate the breakeven value for which the option recommended by the initial decision analysis ceases to have the
best EMV. The breakeven value defines a comfort zone: If we believe that the actual value of the non-monetary
consequences might be outside that zone — thereby changing the optimal decision — we should try to gain a firm
estimate of the non-monetary consequence's value.
A Decision's Sensitivity to Probability Estimates

Seth is somewhat unsure about his estimates for the probabilities of how successful Cloven will be. He is quite sure that
the probability of the film flopping is around 20%, but he's less sure about the probabilities of "Lackluster" and
"Blockbuster" performance levels. He wants to know how sensitive his decision to choose the K2 deal is to the values of
these probabilities.
Let's call the probability of a "Blockbuster" "p." Seth is confident that the probability of a "Flop" is 20%. What is the
probability of a "Lackluster" outcome?
Since the three outcomes are mutually exclusive and collectively exhaustive, their probabilities must add to 100%. Seth
is confident that the probability of a "Flop" is 20%, so he knows that the probabilities of the remaining two outcomes
("Blockbuster" and "Lackluster") must add to 80%. Thus, the probability of a "Lackluster" performance is 0.8 - p.
Using p to denote the probability of a "Blockbuster" outcome and 0.8 - p to denote the probability of a "Lackluster"
outcome, what is the EMV of the K2 option?
The EMV of the K2 option is p * $6 million - $0.4 million. For what values of p, the probability of Cloven becoming a
"Blockbuster" film, is the Pony option preferred to the K2 option on the basis of EMV?
The breakeven value for the probability of a "Blockbuster" hit is 23.3%: when the probability is above 23.3%, the EMV
of the K2 option is higher; when the probability is below 23.3%, the EMV of the Pony option is higher. If Seth is
confident that the probability of a "Blockbuster" is at least 23.3%, he can feel comfortable choosing the K2 option. He
need not expend additional effort trying to further refine his estimates for the probabilities of different levels of success.
Decision-making is an iterative, multi-step process. When analyzing a decision, we should first construct and analyze a
decision tree based on our best estimates of the outcomes and probabilities involved. After reaching a tentative
about:blank 99/119
decision, it is critical to scrutinize the data used in a decision analysis and conduct sensitivity analyses for each estimate
that we feel unsure about.
As long as we are comfortable that the true value we are estimating is within the range specified by the breakeven
calculation, we can confidently proceed with our decision. If not, we should focus our efforts on refining our estimates
for those values to which our decisions are most sensitive.
Finally, we should note that managers sometimes need to test the sensitivity of their decisions to two or more estimates
simultaneously. Sensitivity analysis techniques can be extended to address these situations; these techniques are
beyond the scope of this course.
Summary
After completing an initial decision analysis, always conduct a sensitivity analysis for each probability value you are
uncomfortable with. First calculate the probability's breakeven value: the probability for which the EMV of the option
initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort
zone: If we believe that the actual probability might be outside that zone — thereby changing the optimal decision —
we should reconsider our analysis and refine our estimate of the probability in question.
Solving the Uncertain Estimates Problem

Sensitivity analysis helps managers cope with the uncertainty that surrounds the estimates they base decisions on. With
your new knowledge, you're ready to turn to Leo's problem.
Your initial analysis recommends that Leo launch his floating restaurant, but Leo has expressed some doubt about the
estimate of the losses he'd incur if the Tethys turns out to be a passing "Fad."
How high do the losses have to be in the "Fad" scenario to make launching the Tethys ill-advised in terms of EMV?
Launching the Tethys would be less attractive than not launching it if the EMV of launching is less than $0, the EMV of
not launching. Solving the inequality below, we find that the losses in the event that the Tethys is just a "Fad" would have
to exceed $1.08 million for Leo to abandon the project based on the EMV criterion.
Assume that, in fact, Leo's estimate — that he will incur an $800,000 loss if Chez Tethys turns out to be a "Fad" — is
correct. For what probability of a "Fad" does launching the Tethys cease to be a worthwhile venture, in terms of the EMV?
Launching the Tethys is less attractive than not launching it if the EMV of launching is less than $0, the EMV of not
launching. Solving the inequality below, we find that the probability of a "Fad" would have to be higher than 71% for Leo
to prefer to abandon the project based on the EMV criterion.
Hmm, I'm fairly confident that our estimate of the possible loss is on target — certainly it won't be higher than a million
dollars! But the more I think about it, the more I wish we had a better estimate for the probability that the Tethys will be
the success I've always dreamed of.
I have some ideas for ways to get a better handle on the likelihood of the Tethys' success. But I need to make some phone
calls to see if I'm on the right track. Let's break for today and meet tomorrow morning.
Exercise 1: Sensitive as a Truck

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to
her current fleet.
Based on her assessment of future trends in the transportation sector, Robin identified three scenarios that might
occur: "Boom," "Moderate Growth," and "Slowdown," corresponding to the performance of the economy and its impact
on the transportation sector. She associated estimates of total firm profits with each scenario, and calculated the EMVs
of "leasing" and "not leasing" a new truck: $31,000 and $18,200, respectively.
Robin is afraid that she may have been a little overoptimistic in her estimate of $70,000 of total profits in the scenario
in which she "leases a new truck" and the transportation sector "booms." She wants to know how sensitive her decision
is to the outcome value of this scenario.
What must the value of total profits of the scenario in which Robin "leases the truck" and the transportation sector
"booms" be in order for the "don't lease truck" option to be the more attractive one?
For sufficiently low profits generated by the new truck in the "Boom" scenario, the EMV of leasing the truck is lower
than the EMV of not leasing the truck. To find the breakeven value X, we solve the inequality between the two EMVs
shown below. This inequality is solved when X, the value of total profits in the lease truck and "Boom" scenario, is
$27,333.
Exercise 2: Sneakers of Discord

Val Purcell, CEO of the supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a
bid for a contract to re-engineer the procurement process for a potential new client: the Eris Shoe Company.
about:blank 100/119
Creating the bid will cost $16,000 in Val's time and legal fees. If he wins the bid, he will receive payment for the project
upon completion, two years after signing the contract. He has estimated the present value of the profits if he wins the
bid at $75,614. After factoring in the $16,000 cost of submitting the bid, the net present value of the profits if he wins
the bid is $59,614.
Val believes he has a 20% chance of winning the bid, so the EMV for submitting the bid is -$877: under these
circumstances, Val can't expect submitting a bid to be profitable. But Val hasn't been able to figure out how stiff the
competition for the contract is, and he's fairly uncertain about the probability of winning.
How high would the probability of winning the bid have to be to make Val change his mind and choose to put in a bid?
To change Val's decision, the EMV of bidding will have to be higher than the EMV of not bidding: $0. To find out how
high p, the probability of winning the bid would have to be, we solve the inequality below. Note that the probability of
losing the bid (1 - p) must change as the probability of winning changes. The inequality is solved when the probability
of winning the bid is greater than 21.2%.
Exercise 3: The SHAMH Continues

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain
operations. He can either apply for a government grant or run a local fundraiser, but the demands on his time are too
high for him to be able to do both.
Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. He estimates
that as a local fundraiser, he has a 30% chance of raising $30,000 and 70% chance of raising $20,000. The EMV of the
fundraiser option is $23,000, higher than $22,500, the EMV of applying for the grant. Based on the initial analysis,
Jari should organize a fundraiser.
Jari has expressed uncertainty about his estimate for the probabilities of the two levels of fundraising success. How low
would the probability of raising $30,000 have to be to change his decision? (Enter your answer as a decimal number
with two digits to the right of the decimal point)
For Jari to change his initial decision, the EMV of the fundraising option would have to be less than $22,500. That
would be the case if the probability of the high fundraising success — valued at $30,000 — were less than 25%.
Jari needs $20,000 to operate the SHAMH through the next year. Raising $25,000 or more would permit Jari to
operate the SHAMH and expand the capacity of the operation, thereby allowing even more abandoned miniature
horses to be saved. This year will be Jari's last running the SHAMH, and he really wants to expand its capacity before
he leaves.
If he's unable to raise at least $25,000 to cover the SHAMH expansion, he'll be very disappointed. How highly would
Jari have to value his disappointment in order for him to prefer the government grant option that is more likely to
secure funds for expansion? Assume that his original probability assessments for the success of the fundraiser are
correct.
If Jari is only moderately successful in his fundraising activities, he won't be able to expand the SHAMH on his watch.
The disappointment cost D should be subtracted from the outcome value for the scenario in which he takes in only
$20,000 in contributions. After incorporating the disappointment cost, the EMV of the fundraiser option becomes
$23,000 - 0.7D.
However, if he applies for the grant and doesn't win it, he won't be able to expand, either. The disappointment cost D
must also be subtracted from the outcome value in the scenario in which he writes the grant but doesn't win it. After
incorporating the disappointment cost, the EMV of the grant-writing option becomes $22,500 - 0.1D.
The breakeven value is the value of disappointment for which the EMVs of the two options are equal. Thus, to find the
breakeven value of D, we must set up an inequality between the two EMVs and solve for D. In this case, the
disappointment cost would have to be greater than $833.33 in order for Jari to prefer the grant-writing option.
Decision Analysis II
Conditional Probabilities
After a Kahana breakfast of crab benedict, you meet once more with Leo.
Dining Chez Tethys: The Market Research Problem

So, I've been thinking: my decision is heavily dependent on the likelihood that the Chez Tethys will be a success. Couldn't
we do some market research to find out how interested our target market would be? Then I could make a better decision
— one that would be less risky!
Sure, Leo. But market research costs money. How much would you be willing to pay for information that would help you
better predict the success of the Chez Tethys?
Hmmm. Good question...frankly, I don't have a clue how to even start thinking about that. Can you two help me?
about:blank 101/119
"Whatever market research Leo has in mind," remarks Alice, "it won't reveal with certainty how consumers will take to
the Tethys. In other words, we're uncertain about the Tethys' chance of success, and we'll be uncertain that the market
research we collect is accurate."
Joint and Marginal Probabilities

Pondering two layers of uncertainty you begin to feel vertigo...
When analyzing a decision, we typically use estimates for the probabilities and financial implications of various outcomes
that could occur. Often, we can imagine additional information — scientific tests, market research data, or professional
expertise — that would help make our estimates more accurate. How much should we be willing to pay for this type of
information? And how do we incorporate new information into our decision analysis?
Before we answer these questions, we'll need to expand our understanding of probability and introduce the important
concepts of conditional probability and statistical independence. Let's look at an example first.
British automaker Chariot's most sought-after model is the Ben Hur. Consumers love the Bennie, as it's affectionately
called. It was offered as a limited edition: to date, only 1,000 Bennies are on the road in Britain. We'll take a closer look at
two properties of the Bennie: its "Color" and its "Stereo."
The Bennie comes in two color options: Burgundy and Champagne. Also, the Bennie is offered with a high-end car stereo
system by Sweetone. Let's look at a table of the 1,000 Bennies currently on the road and see how the two properties —
"Color" and "Stereo" — are distributed.
Of the 1,000 Bennies, 150 are Burgundy and have a Sweetone stereo. 600 are Burgundy and do not have the Sweetone
stereo. Furthermore, there are 50 Champagne Bennies with the stereo and 200 without.
Let's add another column to our table and fill in the total numbers of Burgundy and Champagne Bennies. To calculate
these numbers, we simply take the sums of the rows. For example, the number of Burgundy Bennies is simply the number
of Burgundy Bennies with the Sweetone stereo added to the number of Burgundy Bennies without.
Next, in the final row, we'll fill in the total numbers of Bennies with and without the Sweetone stereo. Here, we simply
take the sums of the two columns. In the bottom right cell, we enter the total number of cars: 1,000.
This number — 1,000 — should be equal to the sum of the numbers in the final row: the total number of Bennies with the
Sweetone stereo added to the total number without one. Also, it should be equal to the sum of the numbers in the final
column: the number of Burgundy Bennies plus the number of Champagne Bennies. Finally, since 1,000 is the total
number of all Bennies, it should be the sum of all the original numbers we entered into the table.
A Venn diagram is a useful graphical way to represent the contents of the table. The Burgundy rectangle on top represents
the set of all Burgundy Bennies. The Champagne rectangle below represents the set of all Champagne Bennies. The
rectangles do not intersect because a Bennie cannot be both Champagne and Burgundy.
We now add a patterned rectangle on the left to represent the set of Bennies with the Sweetone stereo. The area without
the pattern represents the set of Bennies that are not equipped with the Sweetone. These areas intersect the rectangles
that represent the distribution of "color": some Burgundy Bennies are Sweetone-equipped; some are not. Some
Champagne Bennies are Sweetone-equipped; some are not.
The sizes of the areas in the Venn diagram are directly proportional to the incidence of the different Bennie properties:
Burgundy/Champagne and Sweetone/no Sweetone. We use Venn diagrams to effectively communicate information about
sets of things — for example, Burgundy cars and Sweetone-equipped cars — and their interactions.
The table is a useful tool for calculating proportions of potential interest to managers. For instance, we can find the
proportion of Burgundy Bennies with the Sweetone stereo simply by locating the cell that contains their number and
dividing it by the total number of cars: 15%.
Or, to find the proportion of Burgundy Bennies in general, we find the cell that contains the total number of Burgundy
Bennies and divide it by the total number of cars: 75%.
In this way, we can create an entire table of proportions. These proportions can be interpreted as probabilities. For
example, since 15% of the Bennies on the road are Burgundy-colored and Sweetone-equipped, the probability that a
randomly selected Bennie will be Burgundy-colored and have the Sweetone is 15%.
The probability that a randomly selected Bennie will be Burgundy is 75%. Going forward, when talking about the table,
we'll use the words "proportion" and "probability" interchangeably.
The probabilities on the inside of the table are called joint probabilities: the probabilities of a single car having two
particular Bennie features, for example, Burgundy-colored and Sweetone-equipped. We'll denote joint probabilities in the
following way: P(Burgundy & Sweetone) is the probability that a Bennie is Burgundy-colored and Sweetone-equipped.
The probabilities of each "Color" or "Stereo" option occurring in a given Bennie — Burgundy or Champagne, Sweetone-
equipped or not Sweetone-equipped — are often called marginal probabilities. They are denoted simply as
P(Burgundy), P(Champagne), P(Sweetone), and P(no Sweetone).
Information about the distribution of properties in populations is often available in terms of probabilities, so the table of
probabilities is a very natural way to represent the Bennie data.
about:blank 102/119
Summary
For two events A and B with outcomes A1, A2, etc. and B1, B2, etc., respectively, the joint probability P(A1 & B1) is
the probability that the uncertain event A has outcome A1 and the uncertain event B has outcome B1. The joint
probabilities of all possible outcomes of two uncertain events can be summarized in a probability table. The
marginal probability of the outcome A1 of the first uncertain event is the sum of the joint probabilities of outcomes
A1 and all possible outcomes B1, B2, etc. of the second uncertain event.
Conditional Probabilities
Automaker Chariot's limited edition Ben Hur model comes in two possible colors — Champagne or Burgundy — and
with or without a high-end Sweetone stereo. The table below shows the distribution of these properties — "Color" and
"Stereo" — in the population of 1,000 Bennies that Chariot has sold to date.
Restricting our focus to the Burgundy Bennies, we ask the following question: among the set of Burgundy Bennies,
what is the proportion of Burgundy Bennies with a Sweetone stereo? Stated differently: what is the probability that a
randomly selected Bennie has a Sweetone given that we know it is Burgundy?
To answer the question, we find the ratio of Sweetone-equipped Burgundy Bennies among the set of all Burgundy
Bennies. That is, we divide the number of Bennies that are both Burgundy and have a Sweetone — 150 — by the total
number of Bennies that are Burgundy — 750. The probability is 20%.
This probability is called a conditional probability: the probability that a Bennie is Sweetone-equipped given that it
is Burgundy. We'll denote this probability P(Sweetone | Burgundy), and read the vertical line as "given."
We can calculate P(Sweetone | Champagne) as:
P(Sweetone | Champagne) is the number of Bennies that are Champagne and Sweetone-equipped divided by the total
number of Champagne Bennies.
What is P(No Sweetone | Burgundy)?
Enter the percentage as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").
Round if necessary.
Using the table of actual numbers of cars we can calculate P(No Sweetone | Burgundy) and a fourth conditional
probability, P(No Sweetone | Champagne).
Earlier, we used the table of actual numbers of cars to calculate a table of probabilities. When given information as a
table of probabilities, we can use the probabilities to calculate conditional probabilities as well: we simply form the
ratios of the appropriate probabilities.
For example, the probability that a Bennie is Sweetone-equipped given that it is Burgundy is the probability of a
Sweetone-equipped, Burgundy Bennie divided by the probability of a Burgundy Bennie.
In fact, a conditional probability is formally defined in terms of the ratio of a joint probability to a marginal probability.
If P(Sweetone | Burgundy) represents the probability that a Bennie is Sweetone-equipped given that it is Burgundy,
what does P(Burgundy | Sweetone) represent?
P(Burgundy | Sweetone) represents the probability that a Bennie is Burgundy given that it is Sweetone-equipped. Note
that P(Sweetone | Burgundy) and P(Burgundy | Sweetone) are not the same.
We can calculate the other conditional probabilities, this time conditioning the property "Color" on the property
"Stereo." It is useful to write the conditional probailities next to the table of probabilities to facilitate calculation: since
P(Burgundy | Sweetone) and P(Champagne | Sweetone) require only the probabilities in the Sweetone column, we
write them directly below the Sweetone column.
Similarly, since P(Burgundy | No Sweetone) and P(Champagne | No Sweetone) require only the probabilities in the "No
Sweetone" column, we write them directly below the "No Sweetone" column. Note that the new rows mimic the rows in
the original table: Burgundy on top and Champagne below it.
Similarly, we write the conditional probabilities for the property "Stereo" conditioned on the property "Color" to the
right of the table. Again, we mimic the columns in the original table: a column for "Sweetone" and one for "No
Sweetone." We now have a full table of joint, marginal, and conditional probabilities.
It is important to double check which event we are conditioning on, and to make sure our calculations are realistic. For
example, the probability that a randomly chosen Indian citizen is the prime minister is nearly zero, but the probability
that the prime minister of India is an Indian citizen is 100%!
The table of joint probabilities informs our understanding of the likelihood of different properties or events in a crucial
way: as we've seen in the Bennie example, once we have the joint probabilities, we can compute any conditional
probability and any marginal probability.
Thus, when presented with a decision problem in which the outcomes are influenced by multiple uncertain events,
constructing the joint probabilities of these events is almost always a wise first step.
about:blank 103/119
Summary
The conditional probability P(A | B) is the probability of the outcome A of one uncertain event, given that the
outcome B of a second uncertain event has already occurred. The table of joint probabilities provides all the
information needed to compute all conditional probabilities. First calculate the marginal probabilities for each event,
then compute the conditional probabilities as shown below:
Statistical Independence
Let's return once more to our Chariot Ben Hur example. The Bennie comes in two possible colors — Champagne or
Burgundy — and with or without a Sweetone stereo. The probability table below shows the distribution of these
properties - "Color" and "Stereo" - in the population of Bennies that Chariot has sold to date.
Note that the proportion of Sweetone-equipped Bennies in the Burgundy population is the same as the proportion of
Sweetone-equipped Bennies in the overall population: 20%. In the language of conditional probabilities, this is the
same as saying that P(Sweetone | Burgundy) = P(Sweetone).
In other words, if we randomly select a Bennie, then discovering that it is Burgundy gives us no additional information
about whether or not it is equipped with a Sweetone stereo, beyond what we had before we knew its color: we still think
there is a 20% chance it has a Sweetone.
Similarly, the proportion of Burgundy Bennies in the Sweetone-equipped population is the same as the proportion of
Burgundy Bennies in the overall population: P(Burgundy | Sweetone) = P(Burgundy) = 75%.
A Bennie's color tells us nothing about its stereo system. Its stereo system tells us nothing about its color. When this is
true, we say that that the Bennie properties "Color" and "Stereo" are statistically independent, or, more simply,
independent.
In general, we can interpret the fact that two uncertain events are independent in the following way: knowing that one
event has occurred gives us no additional information about whether or not the other event has. For example, the
results of two spins of a wheel of fortune are independent. The first result does not reveal anything about the second.
We can confirm the independence of the Bennie's stereo and color by looking at our Venn diagram and noting that
Sweetone-equipped Bennies occupy the same percentage in the population of burgundy Bennies — 20% — as they do in
the entire Bennie population.
Thus, to find the joint probability that a Bennie has a Sweetone stereo and is burgundy, we take 20% of the 75% of
Bennies that are burgundy, giving us a joint probability of 15%. This property is true for any two statistically
independent properties: the joint probabilities are simply the products of the marginal probabilities.
Although it may seem plausible to assume that certain properties are independent, managers who take statistical
independence for granted do so at their peril. We need to verify the assumption that the properties are independent by
looking at and evaluating data or by proving independence on the theoretical level.
Statistical Dependence
When are two outcomes statistically dependent? Let's look at another optional feature of the Bennie: a unique
factory-installed theft discouragement system (TDS).
Using Chariot's data, we can create the following table of the distribution of the properties "Stereo" and "Protection."
Before going forward, practice your skills by calculating the joint, marginal, and conditional probabilities for these
properties. Place them in the usual format in the complete probability table.
The completed table is shown below. Which of the following is true?
From the table, we can see that P(TDS | Sweetone) is not equal to P(TDS). We can infer that the properties "Stereo"
and "Protection" are not statistically independent — once you know that a randomly selected Bennie has a Sweetone,
you know the chances of it having a TDS system are greater than they would be if you didn't have that information.
Once again our Venn diagram provides visual confirmation, in this case that the properties are not independent. In
the overall population, the proportion of Bennies that are TDS-protected is only 35%. However, in the population of
Sweetone-equipped Bennies, the proportion is significantly higher: 75%. Why might that be?
Bennie buyers who opt for the Sweetone stereo feel that their cars will be especially targeted for theft or vandalism,
and are more likely to choose the TDS option than are buyers who choose the low-end car stereo.
A savvy car thief knows that the next Bennie he passes has a 35% probability of being protected by a theft
discouragement system.
Once he identifies that a particular Bennie has a Sweetone stereo, he gains more information: then he knows that the
car has a 75% chance of being TDS-protected. This may affect his decision about whether or not to break into that
car.
about:blank 104/119
Summary
Two uncertain events A and B are said to be statistically independent if knowing that A has occurred does not tell us
anything about the probability of B occurring, and vice versa. Statistical independence of two events can be
demonstrated based on data or proved from theory; it should never be assumed. Events that are not statistically
independent are said to be statistically dependent.
Conditional Probabilities in Decision Analysis

Probability theory is fascinating, sure, but what do conditional probabilities have to do with decision analysis?
To see how we might apply conditional probabilities in a decision analysis, let's revisit the Cloven film production
example.
Seth Chaplin of S&C Films is ecstatic. In a sushi lunch meeting with the agent of superstar actor Shawn Connelly, the
agent agreed to try to convince Connelly to be the voice of the main character in Seth's new film, Cloven.
Under Seth's charming yet irresistible pressure, the agent told Seth that — just between the two of them — he thought he
had a 40% chance of persuading Connelly to take the role. Connelly's fame is such that by lending the prestige of his name
to Cloven, the likelihood of Cloven's success increases dramatically.
Seth has booked the services of a crack animation team, so he'll have to start production soon — probably before Connelly
makes a decision. But that shouldn't hinder production because Connelly can add his voiceovers well after the animation
has been created.
Seth has not yet signed either of the two deals he negotiated, and feels he should reconsider his options in light of
Connelly's possible participation. With a decent shot at star power behind Cloven, Seth feels he can close a better deal
with Pony. Connelly's voice would also substantially increase the likelihood of blockbuster success, making the deal with
K2 potentially more lucrative.
Seth returns to Pony and hammers out a second agreement, contigent on Connelly's participation. Pony will pay an
additional $5 million — $15 million in total — if Seth can retain Connelly's voice acting services. Even after accounting for
Connelly's salary, Seth expects to make a total of $2.2 million in profits from the Pony deal if Connelly agrees to take the
role.
How should all of this new information affect Seth's decision? Let's take a look at the new decision tree.
If Seth chooses the Pony deal, there are now two possible scenarios: in the first, Connelly agrees to lend his voice to
Cloven, in the second, Connelly declines. These two scenarios are associated with two different profit outcomes: $2.2
million and $1 million, respectively.
Based on Connelly's agent's assessment, there is a 40% chance that Connelly will take the role, and a 60% chance that he
won't.
What about the K2 option? Which of the following decision trees correctly reflects the newly introduced circumstances?
The nodes of a decision tree are arranged from left to right in the order in which the decision maker will eventually know
their results. In this case, Seth will first discover whether or not Connelly takes the role; only then will he see how Cloven
performs at the box office.
What are the probabilities of the different branches? As in the Pony option, the first branching in the K2 option is a split
between the scenarios in which Connelly signs onto the Cloven project and those in which he doesn't. These branches are
associated with probabilities of 40% and 60%, respectively.
What are the probabilities of the six final branches? Let's look closely at the top three branches. Once we know that
Connelly has signed on, we need to know what the respective probabilities are of a "Blockbuster," a "Lackluster"
performance, and a "Flop." In other words, we need the conditional probabilities of the three outcomes, given that
Connelly takes part in Cloven.
In any decision tree, to "be at a node" is to assume that all the events on the path leading to that node from the left have
already taken place. Thus, the data we associate with any decision or chance node depend directly on the sequence of
events on the unique path leading up to that node from the left.
Specifically, the probabilities after a chance node must be conditioned on all events on the preceding path, and the
outcome values must incorporate the effects of all of the preceding events.
Returning to Seth's decision: if Connelly takes part in Cloven, Seth estimates at 50% each the conditional probability that
Cloven will be a "Blockbuster" and the conditional probability that it will have a "Lackluster" theater run. At this stage in
Connelly's career, a "Flop" is virtually impossible. If Connelly doesn't take part, the conditional probabilities are simply
the original probabilities of 30%, 50%, and 20% for the three possible outcomes.
Using these conditional probabilities, we can determine which of the two options Seth should choose by calculating the
EMVs of all the nodes and folding back the decision tree.
Let's begin with the Pony option. What is its EMV? Enter the EMV in $millions as a decimal number with two digits to
right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.
about:blank 105/119
The EMV of the Pony option is $1.48 million.
Now we find the EMV of the K2 option, constructing it step by step. First, we must find the EMV for the event that
Connelly decides to voice the main character in Cloven. That EMV is $3 million.
Next, we find the EMV for the event in which Connelly refuses the part. That EMV is simply the EMV of the original K2
option: $1.4 million.
The total EMV for the K2 option is $2.04 million: the EMV with Connelly's participation weighted by the probability of
his participation, plus the EMV of Cloven filmed without Connelly, weighted by the probability that he refuses the part.
The possibility that Shawn Connelly might take part greatly enhances the attractiveness of the K2 option.
Summary
The eventual outcomes of many decisions involve sequential uncertain events whose outcomes are determined over
time. To conduct a decision analysis in such cases, we need the probabilities of the first uncertain event's possible
outcomes and conditional probabilities of future uncertain events' possible outcomes, conditioned on the previous
events' outcomes.
Exercise 1.: Captain Ahab Fisheries

Captain Ahab Fisheries cans herring and sardines in either spicy tomato sauce or vegetable oil. The table to the right
summarizes the distribution of Ahab's product line. What is the marginal probability that a randomly selected can of
fish is canned in tomato sauce?
Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as
"0.500"). Round if necessary.
The marginal probability of tomato sauce is 23%. Similarly, the marginal probability of vegetable oil is 77%.
What is the marginal probability that a randomly selected can of fish contains sardines?
The marginal probability of sardines is 40%. Similarly, the marginal probability of herring is 60%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can
of sardines?
The conditional probability of tomato sauce given that a can contains sardines is 35%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can
of herring?
Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").
Round if necessary.
The conditional probability of tomato sauce given that a can contains herring is 15%.
Which is larger, the conditional probability that a randomly selected can of fish contains sardines, given that it is
canned in vegetable oil, or the conditional probability that a randomly selected can of fish is canned in vegetable oil,
given that it contains sardines?
The conditional probability of a can containing sardines given that it contains vegetable oil is 34%. The conditional
probability of the fish being canned in vegetable oil given that the can contains sardines is 65%. Note that P(Sardines |
Oil) is not the same as P(Oil | Sardines).
Exercise 2: Mutually Exclusive and Collectively Exhaustive

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three
probabilities is less than 100% tells us
We cannot infer anything about whether or not these events are mutually exclusive from the information provided. The
events A, B, and C on the left are mutually exclusive. The events D, E, and F on the right are not mutually exclusive.
probabilities is less than 100% tells us
For a set of events to be collectively exhaustive, the probabilities of the events must sum to at least 100%.
about:blank 106/119
probabilities is greater than 100% tells us
The probabilities of the events in a mutually exclusive set cannot add up to more than 100%. If the probabilities of a set
of events add up to more than 100%, then at least two of them are not mutually exclusive, i.e., they can occur
simultaneously.
probabilities is greater than 100% tells us
We cannot infer anything about whether or not these events are collectively exclusive from the information provided.
Below are examples of events with these probabilities of which one set is collectively exhaustive and one isn't.
The Value of Information

Market research can be expensive. If the cost of the research outweighs its value to Leo, then obviously he shouldn't spend
money on the research. "We'll need to assess the value of Leo's market research," Alice tells you, "to determine whether or
not paying for it is a wise choice."
The Expected Value of Perfect Information

The Eris Shoe Company is considering sourcing some of its production from the developing country Arboria. Arboria was
a land of civil war and strife until a controversial UN intervention two years ago reconciled warring factions and helped
install a national unity government. Today, guerrilla warfare persists in the mountains, but the major coastal cities are
relatively secure.
Eris CEO Emily Ville has identified Arboria as a candidate location due to its low labor costs and is considering opening a
small buying office in the capital city. However, she also recognizes the potential for substantial risk if civil war should
break out, including the loss of Eris' investments in the office and the disruption of its supply chain.
Emily estimates the present value of sourcing from Arboria at $1 million in net savings as long as the Arborian production
facility and infrastructure work at a high level of reliability. In her opinion, the probability of such a "Serene" scenario is
40%.
Emily believes there is a 40% chance of a more "Troubled" environment beset with minor supply chain disruptions, which
would reduce Eris' expected cost savings to $0.5 million. Finally, she estimates the likelihood of major political unrest at
20%, and the net present value of the losses associated with such a "chaotic" scenario at $1 million.
Instead of sourcing from Arboria, Eris could continue its current sourcing agreements, which would not result in any cost
reduction.
The EMV of sourcing from Arboria is $0.4 million. Based on the EMV criterion, Emily should choose to source from
Arboria. However, given the stakes involved, Emily would like to have additional information that would increase her
understanding of Arboria's political situation.
Ideally, she'd like to base her decision on perfect information, i.e., she'd like to know now exactly which one of the three
outcomes will materialize. Such perfect information would be extremely valuable: how much should Emily be willing to
pay for it?
Managers often have the opportunity to gather more data or expertise to inform their business decisions. Depending on
the business context, new information can be obtained from statistical data, expert consultants, or scientific tests and
experiments. In many cases, such information can improve the accuracy of the estimates incorporated into a decision
analysis.
However, the cost of such data can drag down the bottom line. Clearly, the cost of additional information must be
weighed against its value. How can managers assess the value of information?
Suppose — for the sake of argument — that Emily knows a fortune-teller, Frieda Featherlight, who has genuine psychic
powers and can predict the future flawlessly, giving Emily the perfect information she craves. How much should Emily
pay Frieda for her supernatural services?
To answer this question, we calculate the Expected Value of Perfect Information — the EVPI — provided by Frieda.
We can determine this value by framing a decision: should Emily purchase Frieda's information before she makes her
decision to source from Arboria?
Characterized as a decision problem, we can determine the EVPI using familiar tools: a decision tree and the expected
monetary value. Let's construct a tree for Emily's decision, beginning with a decision node that branches into two options:
"Buy the information" or "Don't buy the information."
The lower branch — "Don't buy the information" — is simply the original tree for the decision about whether or not to
source from Arboria. This tree uses only the information Emily already possesses. The EMV of this option is $0.4 million.
If Emily engages Frieda's services and "buys the information," three possibilities emerge based on which scenario Frieda
predicts — "Serene," "Troubled," or "Chaotic." If the probabilities of these scenarios actually occurring are 40%, 40%, and
20%, respectively, what is Emily's best assessment of the likelihood that Frieda will predict each scenario?
about:blank 107/119
Without further information, Emily has no reason to change her initial assessments. Unless she probes Frieda for
information, Emily's best guess that Frieda will predict each outcome — "Serene," "Troubled," and "Chaotic" — is the
same as her best estimates for the probabilities that those outcomes will actually occur: 40%, 40%, and 20%, respectively.
Let's assume for the moment that Frieda won't charge Emily for her services. The beauty of perfect information is that
Emily can delay her decision until after she has heard Frieda's completely accurate prediction of what the Arborian
political climate will be. If Frieda predicts a "Serene" sourcing experience, then Emily will choose to source from Arboria,
and cost savings of about $1 million are certain.
Likewise, if Frieda predicts a "Troubled" sourcing experience, Emily will source from Arboria, and cost savings of about
$0.5 million are certain. If Frieda predicts a "Chaotic" sourcing experience, then Emily will choose not to source from
Arboria, thereby avoiding a certain loss in the $1 million range. In this case, Eris will receive no cost savings.
Still working under the assumption that Frieda won't charge for her services, what is the EMV of "buying the
information"?
Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter
"$5,500,000" as "5.50"). Round if necessary.
The EMV of "buying the information" is $600,000.
At $600,000, the EMV of "buying the information" is $200,000 higher than the EMV of "not buying the information."
This difference between the EMVs of the option with perfect information and without perfect information is the expected
value of perfect information (EVPI). The EVPI of $200,000 is the maximum amount Emily should pay Frieda for a
séance.
The EVPI establishes an upper bound on what we should pay for perfect information. In some business cases, perfect —
or near perfect — information may be available without supernatural means.
For instance, suppose that aerospace company Airbus wants to know if it could perfect all of the technologies necessary to
create a safe and reliable SpaceCruiser, a space shuttle for tourists that would have all the comforts of a luxury cruise
ship.
Airbus could build a small but completely functional prototype and see if it works. Although Airbus might collect near
perfect information by doing so, the cost of developing the SpaceCruiser prototype would likely be higher than the
expected value of that perfect information.
Instead, Airbus might first use computer simulations and limited functionality prototypes to assess the technical viability
of the SpaceCruiser. The information thus gained wouldn't be perfect, but it would be helpful and it would cost far less
than a fully functional prototype.
The EVPI establishes an upper bound on the value of any information, perfect or flawed. Through sampling, educated
expertise, or imperfect testing, we might be able to gain better — though not perfect — estimates of the probabilities and
the outcome values of our decision's possible scenerios.
Shortly, we'll learn how to value imperfect information — information that reduces, but does not eliminate, our
uncertainty about future events. For the time being, we know that we should never spend more for imperfect information
than we are willing to pay for perfect information.
In her quest for an infallible choice in the Arborian sourcing decision, if Emily won't pay more than $200,000 for perfect
information, she certainly shouldn't pay more than $200,000 for imperfect information. If the price of imperfect
information exceeds the EVPI, we should not expend resources on it.
Summary
As managers, we would like to know exactly which outcomes will occur so we can make the best decisions. We can
calculate the expected value of such perfect information — EVPI — to find an upper limit on the amount we
would be willing to pay for any additional information. To calculate the EVPI, we first frame the decision problem "to
buy or not to buy the perfect information." We subtract the EMV of buying perfect information — assuming it's free —
from the EMV of not buying perfect information to find the EVPI.
The Expected Value of Sample Information

In almost all circumstances, perfect information is either impossible to assemble or prohibitively expensive. In these
cases, we use imperfect information — often called "sample" information — to inform our decision. As with perfect
information, the cost of sample information must be weighed against its value. As managers, how do we assess the
value of sample information?
Let's return the Eris Shoe Company's decision to source production in Arboria. Emily Ville has distinguished three
possible scenarios for the Arborian political climate: "Serene," "Troubled," and "Chaotic," which she believes have
probabilities of 40%, 40%, and 20%, respectively. The outcomes she associates with these scenarios — in cost savings
for Eris — are $1 million, $0.5 million, and -$1 million, respectively.
Emily would like to improve her estimates of the probabilities. She contacts PoliFor, a consulting company that
specializes in business outlook intelligence. PoliFor produces an analysis of a country's or region's political climate, and
about:blank 108/119
then provides a risk assessment specific to the needs of its client, distinguishing between "High" and "Low" risk
situations.
In a world of uncertainty, PoliFor's assessments are not always accurate. Sometimes, regions with "Low" risk
assessments burst into revolutionary flame. Sometimes, regions with "High" risk assessments turn out to be as stable
and placid as the moon's orbit. How high a price should Emily pay for PoliFor's services given that PoliFor's
assessments are not perfectly reliable?
Based on preliminary conversations with PoliFor's lead consultant, Emily assesses the probabilities that PoliFor will
report "High" or "Low" risk for sourcing in Arboria at 30% and 70%, respectively. She then considers how each of these
risk level assessments would affect her estimates of the relative likelihood of her three representative scenarios:
"Serene," "Troubled," and "Chaotic."
Emily believes that if PoliFor predicts "Low" risk, then the likelihood of a "Chaotic" political climate would be low: 5%,
and the probabilities of "Troubled" or "Serene" scenarios would be 40% and 55%, respectively. If we use "Low" to
represent PoliFor predicting a low-risk political environment, which of the following best expresses the beliefs Emily
has summarized above?
The information upon which Emily is basing her assessments is the PoliFor prediction of a "Low" risk political climate.
Thus, she is estimating conditional probabilities that Eris' experience sourcing from Arboria will be "Serene,"
"Troubled," or "Chaotic," given that the consultants predict a "Low" level of risk.
Similarly, Emily assesses the conditional probabilities that Eris' experience sourcing from Arboria will be "Serene,"
"Troubled," or "Chaotic" given that the consultants predict a "High" level of risk at 5%, 40%, and 55%, respectively.
How do we calculate the expected value of the imperfect information PoliFor offers?
As we did for perfect information, we frame this question as a decision problem: should Emily purchase the imperfect
information from PoliFor or not? First, we'll construct a decision tree. The tree begins with a decision node for the
choice Emily faces: "Buy the information" or "Don't buy the information".
The lower branch for the option "Don't buy the information" is the original tree that Emily constructed using her initial
estimates for the outcome values and probabilities of the three basic scenarios. What should emanate to the right from
the "Buy Information" branch?
If we are at the end of the "Buy Information" branch, we assume that Emily has purchased the information. In that
case, the first thing that she will do is open the report envelope and examine the information to learn what level risk
PoliFor has predicted. Thus, the upper branch for the option "Buy the information" splits the chance node into two
branches, one for each risk level the consulting company might predict.
What should emanate from each of the Low and High branches?
The information that Emily has purchased is valuable only if she uses it to support her decision-making. The value of
the information resides in Emily's ability to make her sourcing decision after learning the level of risk PoliFor predicts.
Thus, after each prediction, we place a decision node: should she source from Arboria or continue to source from her
current suppliers?
To complete the decision tree, we must incorporate the possible scenarios that can occur if Emily chooses to source
from Arboria: "Serene," Troubled," or "Chaotic." What probabilities should we assign to the three branches emanating
from the chance node highlighted to the right?
At the highlighted node, we assume that every event on the unique path leading from the node back to the beginning of
the tree has transpired: Emily bought the information; PoliFor reported "High" risk; and Emily chose to source from
Arboria. Thus, the probabilities assigned to the branches must be conditioned on those events. For example, we assign
P(Serene | High) to the "Serene" scenario on the "High" branch.
Now we complete the tree, substituting the values for the conditional probabilities and placing the appropriate
monetary values at the endpoints. As usual, we'll assume for the moment that the information is free, and start to fold
back the tree. What is the EMV of the "High Risk" branch's decision node?
If PoliFor predicts a "High" risk level, Emily should not choose to source from Arboria since the EMV of the "Source"
from Arboria node, -$0.3 million, is less than the EMV of the "Don't source" node, $0 million. Instead, she should
continue her current supply arrangements, in which case she will realize $0 in cost savings.
Folding back the decision tree, we find an EMV of $490,000 for the option "Buy the information."
What is the most Emily should pay for the imperfect information?
The difference between the EMVs of the "Buy" and "Don't buy the information" options is $90,000, the expected value
of imperfect (or sample) information (EVSI) provided by PoliFor. The EVSI is the maximum amount that Emily should
be willing to pay for PoliFor's report.
As we would anticipate, the expected value of this imperfect information is quite a bit less than $200,000, the expected
value of perfect information we computed earlier.
PoliFor's information, although imperfect, has value because it allows Emily to update her original probability
estimates based on the risk level PoliFor predicts. She can make better decisions based on these more accurate updated
probability assessments.
about:blank 109/119
The probability estimates we started with — Emily's estimates for P(Chaotic), etc. — are called prior probabilities. The
updated conditional probabilities — P(Chaotic | High Risk), P(Troubled | High Risk), etc. — are called posterior
probabilities.
Summary
By investing in professional expertise, statistical studies or tests, managers can often improve their understanding of
uncertain events. Such imperfect (or "sample") information is itself a source of uncertainty because it isn't a perfect
predictor of the outcomes of interest. Although imperfect information can improve our predictions, such information
comes at a price. Responsible managers calculate the expected value of sample information (EVSI), and purchase
information only if its expected value exceeds its cost. To calculate the EVSI, we pose the decision problem "to buy or
not to buy the sample information," and subtract the EMV of buying the information — assuming that it's free —
from the EMV of not buying it.
Updating Prior Probabilites

Zeke "Claw" Crankshaw owns and runs Oligarchol, one of the largest independent oil producers in the Western
Hemisphere.
Recently, Zeke acquired the mineral rights to a piece of land near Chanchito Gordo Canyon. Based on his own
professional assessment, Zeke believes that he has a 50% chance of finding an oil field if he drills in Chanchito Gordo,
and a 50% chance of drilling nothing but "Dry" holes.
Like any subjective probabilities, these estimates are the product of Zeke's educated guesswork. Clearly, the oil is either
beneath the surface at Chanchito or it isn't. However, Zeke's experience tells him, for example, that in about half of all
similar situations — similar terrain, similarly exposed rock strata, etc. — oil prospectors have struck oil.
He estimates the value of striking oil — in present value of future profits — to be $8 million after netting out drilling
costs. If all of his boreholes come up "Dry," Zeke will have lost the $2 million cost of drilling. With these assessments in
mind, Zeke must decide if he should drill on the property.
If Zeke decides not to drill, his investment in the mineral rights will not pay off in the way he had hoped. However,
since the purchase price of the mineral rights has already been incurred, it is a sunk cost that shouldn't bear on the
decision.
What is the expected monetary value of drilling in Chanchito Gordo?
The EMV of drilling in Chanchito Gordo is $3 million.
Without further information, Zeke should start shipping drill bits and oil rigs over the dusty back-country roads to
Chanchito. But old Zeke is not the impetuous young whippersnapper of the early days of Oligarchol, back when the
company was derisively referred to in industry circles as "Pipe Dreams, Incorporated."
Now, Zeke is happy to use the latest in seismic testing technology to gain better information and make more informed
decisions. Zeke can order a $100,000 seismic test of the Chanchito Gordo property. Although the seismic test will
improve his estimate of the likelihood of finding oil, it won't provide perfect information.
The seismic test will report one of two possible outcomes: a "Positive" or a "Negative" result, depending on whether or
not the test indicates the presence of oil. Because the seismic test's accuracy is a key determinant of its value, Zeke has
gathered some historical data on the test's performance.
Zeke believes that if there is an oil field on his property, then the probability that the test will return a "Positive" result
is 90%. This is the conditional probability P(Positive | Oil). Clearly, the test isn't perfect — even if there is a reserve on
his property, there is a 1 in 10 chance the test will fail to detect it. This is the conditional probability of a "False
Negative" result: P(Negative | Oil) = 10%.
And, unfortunately for Zeke, even when there isn't a drop of oil to be found, the test will still report a positive result — a
false "Positive" — with probability 30%. The probability that the seismic test will indicate the absence of oil when there
is, in fact, no oil — P(Negative|Dry) — is 70%. How can Zeke use these data to calculate the EVSI — the expected value
of the test information?
As usual, we start with a decision node that branches into two options: "Test" or "Don't Test." The lower branch, "Don't
Test," is the same as Zeke's original tree for whether to drill or not. The EMV for the "Don't Test" branch is $3 million;
it is based only on the information Zeke previously had available to him.
If Zeke buys the test, he will learn the test's result — positive or negative — before he has to decide whether to drill or
not. What probability should Zeke assign to the "Test Positive" branch?
The probability that Zeke should assign to the "Test Positive" branch is P(Positive). However, this probability is not in
the information Zeke originally had available, which is summarized in the table to the right.
about:blank 110/119
To find the P(Positive), we must first find the joint probabilities and enter them into the table. What is P(Oil &
Positive)?
To find P(Oil | Positive), we use the familiar relationship for conditional probabilities. If you wish to practice, calculate
the other joint probabilities and P(Positive) and P(Negative) before advancing to the next screen.
Adding the joint probabilities, we find that P(Positive) = 60% and P(Negative) = 40%.
We are now ready to complete the tree. After Zeke learns the test result, he makes a decision about whether or not to
drill. If he drills, he will either strike oil or not. What probability should we associate with the branch highlighted to the
right?
If Zeke is at the highlighted node, then all the events on the branch leading up to the node must have occurred: he
decided to test, the test was positive, and he decided to drill. Thus, Zeke must condition the probability of striking oil on
past events: specifically, on the fact that the test report was "Positive." Thus, he should associate the conditional
probability P(Oil | Positive) with the highlighted "Oil" branch.
We find the conditional probabilities in the usual way. The full table of joint, marginal, and conditional probabilites is
shown to the right.
Our tree now has all the necessary data. We fold it back and find that the EMV for conducting a seismic test —
assuminig the test is free — is $3.3 million.
What is the EVSI of the seismic test?
Enter the EVSI in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter
The EVSI is $300,000 — the difference between the EMV of conducting the test, $3.3 million, and the EMV of not
conducting the test, $3 million. This is the maximum amount Zeke should be willing to spend on the seismic test. Since
the expected value of the test exceeds its $100,000 cost, Zeke should order the test.
Summary
Often, information is not available in the form we need it in to inform our decision process and the calculation of the
EVSI. Frequently, for example, we have access to data about the reliability of a test. The reliability of a test is given as
the set of conditional probabilities P(TRi | Si), where TRi is a test result and Si is one of the scenarios the test is
intended to help predict. Collectively, these conditional probabilities reveal the test's ability to predict the outcome of
the uncertain event in question. Using this information, we update our initial estimates for the probabilities of each
scenario — the prior probabilities P(Si) — to achieve improved, conditional probabilities of each scenario, given the
result of the test — posterior probabilities P(Si | TRi). We use the posterior probabilities to calculate the EVSI and aid
our decision process.
Solving the Market Research Problem

You know how to calculate the expected value of information and set an upper limit on the amount Leo should be willing
to pay for market research. But what kind of market research could Leo have in mind? And would it be worth its cost?
Leo wants to know how much money he should be willing to spend on additional information about consumers' likely
reaction to a floating restaurant. First, you calculate the expected value of perfect information to give Leo on absolute
upper bound on the amount he should pay for any kind of information.
You begin by setting up a decision tree to help inform the decision of whether or not to buy perfect information. Which of
the following best represents the "Buy Information" branch of the perfect information tree?
Whatever the source of perfect information, it will predict the occurrence of either a "Phenomenon" or a "Fad." Which
one of these scenarios it will predict is uncertain; the respective probabilities are equal to the probabilities of the two
scenarios: 35% and 65%.
Once Leo has perfect information, he can choose to either launch the Tethys venture or not.
What is the EMV of acquiring the perfect information, assuming that it is free?
If Leo is certain that the Chez Tethys will be a "Phenomenon," he will launch her and realize a $2 million profit. If he is
certain that the Tethys would be a mere "Fad," he'll avoid the $800,000 loss by not launching, thereby realizing no
profits or losses in the floating restaurant industry. The EMV of acquiring perfect information is $700,000.
Recall that the EMV of launching the Tethys — based on Leo's original estimates — was $180,000. What is the expected
value of the perfect information?
about:blank 111/119
The expected value of perfect information is the difference between $700,000 — the EMV of acquiring the perfect
information — and the EMV of not acquiring the perfect information. In this case, the EMV of not acquiring the perfect
information is the original EMV of launching the Chez Tethys, $180,000. The difference is $520,000.
That's a lot of money! I have an idea for a market research event that is expensive, but it's much cheaper than that. Let's
get to it!
Not so fast, Leo. Up to $520,000 is what you should be willing to pay for perfect information. But real world information
is usually much less than perfect. What exactly do you have in mind?
You two are going to love this. I'll rent a cruise ship and do a week-long test cruise around the islands, inviting guests at
the finest hotels on board for free. At dinner, I'll run an advertisement for the Tethys, then survey all the guests and ask
them if they'd be interested in dining there!
That's an audacious market research plan for an audacious business venture. I like the idea, though, and I think you'll get
much better information than if you hire a firm to do conventional market research. Let's try to estimate the value of this
information.
The three of you debate the merits of Leo's marketing event. You split the results of Leo's event into two response
scenarios based on how Leo's guests/subjects receive his demonstration: "Enthusiastic" and "Tepid," estimating the
probabilities of these responses at 40% and 60%, respectively.
If the subjects respond to Leo's demonstration "Enthusiastically," you believe that the Tethys becoming a "Phenomenon"
is much more likely than your earlier estimate: 65%. On the other hand, if the reception is "Tepid," you estimate that the
probability of a "Phenomenon" is a mere 15%.
Given these data and the tree to the right, what is the expected value of Leo's sample information?
Enter the EVSI in $millions as a decimal number with four digits to the right of the decimal point (e.g., enter
To calculate the EVSI, calculate the EMV of Leo's running his planned event. First, calculate the EMV of launching the
Tethys given that the event participants' response is "Enthusiastic." The EMV is the sum of the outcomes of the
"Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional probabilities that these scenarios
will occur given an "Enthusiastic" reception: $1.02 million.
Since the EMV for launching the Tethys is higher than the EMV for not launching her, Leo should embark on the Tethys
venture if his guest/subjects respond "Enthusiastically" to his event. Thus, the EMV if guests give an "Enthusiastic"
response to Leo's event is $1.02 million.
Next, calculate the EMV of launching the Tethys given that the event participants' response is "Tepid." The EMV is the
sum of the outcomes of the "Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional
probabilities that these scenarios will occur given a "Tepid" reception: -$380,000.
Since the EMV for launching the Tethys is lower than the EMV for maintaining the status quo at the Kahana, Leo should
not embark on the Tethys venture if the participants' response is "Tepid". Thus, the EMV of a "Tepid" response to Leo's
event is $0.
The EMV of Leo's event is $408,000: the sum of the EMVs of the "Enthusiastic" and "Tepid" responses, after they have
been weighted by the probabilities that these responses will occur.
The expected value of Leo's sample information is the difference between the EMV of his event and $180,000 — the EMV
of launching the Tethys without further information. The difference is $228,000.
I talked to some friends last night, and the total cost of renting and running a cruise ship for a week, and shooting an
advertising short comes to at least $240,000.
That's higher than the expected value of this information we calculated: $228,000.
Are you telling me that my event would not be worth its cost?
I'm afraid so, Leo. But that doesn't necessarily mean that you shouldn't do it. You might choose to have your event even if
its expected monetary value is low. It's late. Let's break for tonight and meet over breakfast tomorrow.
Exercise 1: The FUD Grocery Chain

The grocery chain FUD offers two lines of private label products: "FUD Brand," and "Pleasant Valley Fare." The two
brands each are used to label a set of grocery items with very little overlap. For example, "FUD Brand" is used for meat
and cheese products; "Pleasant Valley Fare" for canned goods.
Esther Smith, CEO of FUD, anticipates that eliminating the brand with weaker consumer loyalty and repackaging its
product line under the label of the stronger brand would result in an additional $200,000 in profits from increased
sales and from cost savings on packaging and advertising. If Esther eliminates the stronger brand, she anticipates a net
bottom line improvement of $0 — reductions in sales volume or price discounts would roughly offset cost savings.
about:blank 112/119
Esther doesn't know which of the brands enjoys greater customer loyalty — since the product lines don't overlap, she
can't directly compare the two brands' performances. Initially, she believes that there is a 50% chance that customers
are more loyal to FUD Brand, and a 50% chance that they are more loyal to Pleasant Valley Fare (PVF).
Based on this information, which of the two options - eliminating FUD Brand or Pleasant Valley Fare - has the higher
EMV?
Each option has an EMV of $100,000. Based on Esther's current information, she has no basis for preferring one brand
over the other.
Suppose Esther had access to perfect information, i.e., she could discover with certainty which brand her customers
prefer, so she could eliminate the weaker brand and increase FUD's profits by $200,000. How much should she be
willing to pay for such perfect information?
Enter the expected value of perfect information in $millions as a decimal number with two digits to the right of the
decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.
First, construct the tree for the decision to buy or not to buy the information. The EMV for not buying the information
is $100,000 no matter which brand Esther chooses to eliminate.
There is a 50% chance that the perfect information Esther acquires will reveal that customers prefer the FUD Brand
and a 50% chance it will reveal that customers prefer Pleasant Valley Fare.
Once Esther has the perfect information, she will choose to eliminate the brand that customers are less loyal to, adding
$200,000 to FUD's bottom line.
The EMV of buying perfect information is $200,000. The expected value of perfect information, EVPI, is the difference
between the EMVs of buying and not buying perfect information: $100,000.
Since Esther cannot get perfect information, she's willing to settle for information that is less than perfect and decides
to hire a market research firm to determine which of FUD's two brands customers prefer.
Esther believes that the probability that the market research firm will find that FUD Brand has higher customer loyalty
is 50%. Likewise, the probability that the market research firm will predict that Pleasant Valley Fare has higher
customer loyalty is 50%.
Whichever brand the market research firm indicates is stronger, Esther believes that the finding will be correct with
probability 85%. Thus, there is a 15% chance that the finding the firm provides will be wrong.
What is the expected value of this sample information?
Enter the expected value of sample information in $millions as a decimal number with three digits to the right of the
decimal point (e.g., enter "$5,500,000" as "5.500"). Round if necessary.
First, construct the tree for the decision to buy or not to buy the sample information using the probabilities and
outcomes cited above. The EMV for not buying the information is $100,000 no matter which brand Esther chooses.
Once Esther has the sample information, she will choose to eliminate the brand that the market research indicates is
weaker. The EMV of eliminating the brand that market research predicts enjoys less customer loyalty is $170,000.
The EMV of buying this imperfect information is $170,000. Thus, the expected value of sample information, EVSI, is
the difference between the EMVs of buying and not buying perfect information: $70,000.
Exercise 2: PPD Immunity

After a public relations debacle over a possible outbreak of pulluscular pig disorder (PPD) at one of Bowman-Lyons-
Centerville's hog farms, Paul Segal, head of the hog farming division, is considering vaccinating another herd at a hog
farm located near the original outbreak.
Immunity to PPD is fairly common among pigs; the probability that the entire herd in question is immune is fairly
high: 60%.
However, if any pigs in the herd are not immune, the expected cost to Bowman-Lyons-Centerville is $150,000. This
cost has been calculated based on the probability of a PPD outbreak and the ensuing costs to contain the disease and
manage the expected PR fallout.
The cost of vaccinating the herd is $40,000. What is the EMV of the cost of not vaccinating the herd?
Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter "$5,500"
as "5.5"). Round if necessary.
The EMV of the cost of not vaccinating the herd is $60,000. Since the cost of vaccinating the herd — $40,000 — is
lower than the expected cost of not vaccinating the herd, Paul should have the herd vaccinated, barring the emergence
of any further information.
about:blank 113/119
There is a test that Paul could perform on the herd to determine if the herd is immune to PPD. This test delivers two
results: "Positive" for immunity, or "Negative" for immunity.
The test is not perfectly accurate: if the entire herd is immune, the test will report "Positive" with probability 85% and
"Negative" with probability 15%. Similarly, if any of the pigs in the herd are not immune, the test will report "Positive"
with probability 30% and "Negative" with probability 70%.
Using Paul's prior probability for the immunity of the herd, calculate P(Positive), the marginal probability that the test
will report a "Positive" result.
First, calculate the joint probabilities P(Positive & Immune) and P(Positive and Not Immune), using the marginal
probabilities for immunity and non-immunity — 60% and 40%, respectively — and the conditional probabilities that
quantify the reliability of the test - P(Positive | Immune) and P(Positive | Not Immune).
Then, sum the joint probabilities to find the marginal probability of a "Positive" test result: 63%. Since the test results
"Positive" and "Negative" are mutually exclusive and collectively exhaustive events, the probability of a "Negative" is
simply 100% minus the probability of a "Positive," i.e., P(Negative) is 37%.
The tree below will help Paul determine whether or not ordering the immunity test will be worthwhile. The calculated
probabilities of the test results are entered in the appropriate branches. However, to reach a decision, Paul needs the
conditional probabilities of immunity and non-immunity conditioned on a "Positive" test result. What is P(Immune |
Positive)?
To calculate P(Immune | Positive), use the definition of conditional probability and divide the joint probability
P(Positive & Immune) by the marginal probability P(Positive). P(Immune | Positive) is 81%. Since immunity and non-
immunity are mutually exclusive and collectively exhaustive events, P(Not Immune | Positive) is simply 100% minus
P(Immune | Positive) i.e., 19%.
Calculate the remaining probabilities on the decision tree. What is the highest amount Paul should be willing to spend
on the immunity test?
Enter the expected value of sample information in $thousands as a decimal number with one digit to the right of the
decimal point (e.g., enter "$5,500" as "5.5"). Round if necessary.
The expected value of not vaccinating given a "Positive" test result is $28,500. Since this is lower than the cost of
vaccination, Paul should choose not to vaccinate on EMV grounds.
The probability that the herd is immune given that the test reports "Negative" is 24.3%. Since the EMV of the cost of
vaccinating the herd ($40,000) is lower than the EMV of the cost of not vaccinating ($113,550) Paul should choose to
vaccinate if the test result is "Negative".
The EMV of conducting the immunity test is $32,755, i.e., the sum of the EMVs of "Negative" and "Positive" test
results, weighted by their respective probabilities.
The expected value of sample information, EVSI, is the difference between the EMVs of testing and not testing for
immunity, i.e., $7,245. The most Paul should be willing to pay for the immunity test is $7,245.
Risk Analysis
Just as an e-learning course must have a bittersweet last section, so too must your internship on Hawaii have a final day. A
parting Pacific frolic is refreshing preparation for your meeting with Leo.
Introducing Risk
As you dry in the morning sun, Alice invites you to consider Leo's position should he wager the Kahana on the Tethys'
success. "Even if the potential profits are substantial, losing the Kahana if the Tethys tanks would be a severe blow. I
wonder how seriously he's considered the potential downside."
Imagine: upon your arrival at business school you buy yourself a nice new car. It's a sleek, powerful, Burgundy Ben Hur
by Chariot — with both a Sweetone stereo and a Theft Discouragement System — worth over $25,000.
City driving can be rough: the probability that you'll be involved in an accident that "totals" your car — reduces its value
from $25,000 to zero — in your first year is 0.01%. Collision insurance to cover such a loss in your new city is optional: it
will cost you $1,600 per year above and beyond the premium for required liability insurance. Will you buy the collision
insurance?
Most people would pay the premium to mitigate the risks associated with a car accident leading to such large losses. But
let's look at the decision tree for buying collision insurance. For simplicity, we'll ignore accidents that do not total your
car.
about:blank 114/119
After a year in school, your Bennie's market value will have a present value of $25,000. If you don't buy collision
insurance and the car is totaled, you'll lose the full value of the car — $25,000. If — as we all hope — you and your car
survive the city's roads unscathed, you won't lose anything.
Opting for collision insurance entails a $1,600 insurance premium. If you are spared the calamity of a wreck, you'll have
incurred just the premium cost at the end of the year. If traffic tragedy befalls you, your insurance company will pay out
the value of the car minus a $1,000 deductible: so at the end of the year you will be out $2,600: the premium plus the
$1,000 deductible.
Which option is better in terms of the EMV?
At $2.50, the EMV of not buying collision insurance is much better than the EMV of buying the insurance. Essentially, if
you buy the collision insurance, you are paying $1,600 to avoid an expected loss of $2.50! We shouldn't be surprised —
insurance companies are not charities, and from their perspective, selling you insurance has a very favorable EMV.
Still, a majority of drivers choose collision insurance, even though it has a significantly lower EMV. Why might this be the
case?
Let's consider a different situation. Suppose you owned a bright and shiny miniature windup toy Bennie for $25. Would
you take out a $1.60 insurance policy on it, even if the probability were upwards of 50% that it would be stepped on or
lost in the coming year?
Probably not. A $25 loss is not even a minor catastrophe. If you lost or accidentally destroyed your miniature Bennie,
you'd either replace it quickly, or accept its departure without fuss.
If instead of a nice, new, $25,000 Bennie you owned a 20-year old beat-up and rusty Oldsmobile Delta-88, worth about
$2,500, you'd also think twice about paying $160 to insure it against collision. A loss of $2,500 — though stinging — is
unlikely to be a major setback, especially when weighed against a $160 premium.
But, imagine that you — assisted by your MBA degree — amass a multi-million dollar fortune over the next decades.
Would you still insure a $25,000 car against collision? Perhaps not.
Clearly, the value of the potential loss relative to your net worth has an important influence on your willingness to take on
the risk associated with uncertainty. Suppose you are invited to play a gambling game in which you could win $10 million
with 50% probability or lose $2 million with 50% probability. Even though the EMV of playing this game is $4 million,
you'd probably decline participation unless you can afford a $2 million loss.
For most of us, the pain of the $2 million loss would weigh more heavily than the joy of winning $10 million, so we would
not accept this gamble. For someone who can afford a $2 million loss, this game might be very profitable, especially if
played multiple times.
There are formal ways to measure such assessments by assigning a personal utility value to each outcome. Then we can
show that maintaining the status quo — our current assets — provides greater utility than playing the game does.
Rigorous methods of quantifying utility are beyond this course's scope. For now, let's look at how we might gain insight
into the personal utility we associate with different monetary outcomes.
Returning to collision insurance, let's summarize the scenarios and their probabilities and outcome values. You have two
options: "Buy collision insurance" and "Don't buy collision insurance.
If you insure against collision, there are two possible outcomes: either you avoid a wreck for one year and lose the
insurance premium of $1,600, or misfortune strikes and your Bennie is totaled. In the latter case, you lose the premium
and the deductible. The probabilities associated with these two outcomes are 99.99% and 0.01%, respectively. The
possible outcomes and their respective probabilities are listed in the table below.
If you don't insure against collision and park your car safely at year's end, you won't sustain any loss at all. If the
unfortunate alternative occurs and your Bennie is totaled, you'll have lost its value of $25,000. Again, the probabilities of
these outcomes are 99.99% and 0.01%, respectively.
We now have two tables that completely summarize the possible scenarios of our decision. The monetary values of these
scenarios are arranged from top to bottom, from the best outcomes to the worst. These tables — one for each option — are
together referred to as the risk profiles for the decision.
From the risk profiles, it's easy to recognize that the option of not buying insurance will deliver the most preferred
outcome if you don't have an accident but will deliver the least desirable outcome if you do. This insight provides the basis
for you to prefer to buy the collision insurance: it exposes you to lower risk, even though it has a lower EMV.
Summary
The EMV criterion is not the only criterion that informs decisions. Besides wanting to maximize our average outcomes
in the long run, we want to minimize our exposure to risk, especially when potential losses are high relative to our own
net worth.
Risk Attitudes
about:blank 115/119
Recall Seth Chaplin and S&C Films' production of the movie Cloven. Seth has a choice between two options. In a deal
with Pony Pictures, S&C produces the film, and Pony acquires the complete rights to Cloven for a flat production fee. In
a second deal with K2 Classics, S&C retains part ownership, and the financial outcomes for S&C depend on Cloven's
performance at the box office.
Seth knows from previous analysis that the K2 deal has a higher EMV: $2.04 million vs. $1.48 million for the Pony
deal. However, he also knows the K2 deal is riskier. Before making a final decision, Seth wants to understand the full
implications of his two options. Let's build risk profiles for each deal.
There are seven possible scenarios that can occur. The outcome values of these scenarios depend on Seth's choice of a
production partner, on whether or not superstar actor Shawn Connelly lends his gravelly baritone to the film's lead
character, and on the audience's reception of the film.
Let's summarize the possible scenarios. If Seth chooses the Pony option, two scenarios could occur. In one, Connelly
participates in the movie and Seth's profits are $2.2 million. In the other, Connelly does not participate in the movie
and Seth's profits are only $1 million. If Seth opts for the Pony deal, the probabilities of these two scenarios occurring
are 40% and 60%, respectively.
If Seth takes the K2 option, there are five possible scenarios, but only three possible outcomes: "Blockbuster" success, a
"Lackluster" performance, and a "Flop."
A "Blockbuster" success can occur whether or not Connelly participates in the production. The probability of a
"Blockbuster" is 38%: this probability is calculated by adding the joint probability that Cloven stars Connelly and is a
"Blockbuster" to the joint probability that Cloven is a "Blockbuster" without Connelly's participation.
A "Lackluster" performance can also occur whether or not Connelly participates in the production. The probability of a
"Lackluster" performance is 50%: this probability is calculated by adding the joint probability that Cloven stars
Connelly and has a "Lackluster" performance to the joint probability that Cloven has a "Lackluster" performance
without Connelly's participation.
A "Flop" occurs only when Connelly doesn't take part in the production of Cloven. The total probability of a "Flop" is
the joint probability of Cloven being a "Flop" and Connelly not taking part: 12%.
Seth now has complete risk profiles for the two options. The risk profiles give Seth a quick overview of the possible
outcomes — and their respective probabilities — for each choice. Note that we've combined scenarios so that we now
have probabilities for each distinct outcome value. How does a manager like Seth use risk profiles to inform a decision?
Risk profiles allow Seth to compare options not just in terms of their EMVs but in terms of the risk they expose him to.
From Seth's risk profile, we can see that, although the K2 option has the higher EMV, it is much riskier, a $2 million
loss is possible, whereas the Pony option doesn't include any possible loss.
Which option Seth should choose ultimately depends on how comfortable he is with the risk associated with the K2
option. The Pony option is certain to deliver a profit. The K2 option might generate a substantial loss.
If a $2 million loss is more than Seth believes his company can — or should — sustain, he may want to choose the Pony
option despite its lower EMV. If Seth chooses the Pony option, he will be demonstrating that he is risk averse: he is
choosing an option with a lower EMV in return for lower exposure to risk.
Most people are at least slightly risk averse, as shown by their inclinations to buy insurance. Some people tend to be
risk seeking, that is, they are willing to forgo options with higher EMVs and accept higher levels of risk in return for the
possibility of extremely high returns.
Although risk-seeking behavior is generally rare in the population when significant down-side risk is involved, many
people tend to be somewhat risk seeking when the value of the loss risked is low. For example, the EMV for playing the
lottery is lower than the EMV for not playing the lottery. However, since the amount risked by playing the lottery — the
price of a lottery ticket — is so low many people choose to play it in return for the unlikely outcome that they will win a
multi-million dollar prize.
When we use the EMV as the basis for decision making, we are acting in a risk neutral manner. Most people are risk
neutral when they make routine "day in, day out" types of decisions — those for which potential losses are not terribly
harmful to the decision-maker or her organization.
People feel comfortable using the EMV for routine decisions because they feel comfortable "playing the averages": some
decisions will result in losses and some in gains, but over the long run, the outcomes will average out. We expect that,
over time, choosing options with the highest EMVs will lead to highest total value.
Most people feel comfortable using EMV as the basis for decisions that are not made repeatedly — provided the
decisions have outcome values similar to those of other, more routine decisions that they make on a regular basis.
Why would Pony Pictures agree to purchase the rights to Cloven and take on all the risk associated with marketing and
distributing the film? For Pony, the deal with Seth is a routine decision. Pony is a large enough company to sustain a
potential multi-million dollar loss if Cloven flops. Pony also distributes 10 to 12 movies a year, so overall, the positive
EMV of a film release promises that in the long run, Pony's operations will be profitable.
Seth knows that the K2 deal is very risky: if he chooses the K2 deal, he will tie his company's financial success to box
office performance and to Shawn Connelly's whims. If Cloven "Flops," S&C Films might well go bankrupt.
about:blank 116/119
But Seth likes a gamble...
Summary
Risk profiles allow us to assess the utility different outcomes bring us, as opposed to their monetary value. The
concise summary risk profiles provided helps us compare and contrast our different decision options, allowing us to
choose the option we prefer based on our attitude to risk: risk averse, risk seeking, or risk neutral.
Solving the Market Research Problem (II)

"Although Leo's market research event doesn't pay off in terms of its expected value," Alice explains, "it could help him
manage his risk."
Let's assemble the risk profile for Leo's decision, including the decision about whether or not to run his market research
event. We have to keep in mind that the cost of running his event is $240,000.
Which of the following tables correctly summarizes the outcomes and probabilities if Leo chooses to run his market
research event?
If Leo chooses to run his market research event, then the participants' response can be either "Enthusiastic" or "Tepid." If
it's "Tepid," Leo will choose to stay out of the floating restaurant business, so there is a 60% chance that Leo will incur
only the cost of the event, $240,000.
If the response to the event is "Enthusiastic," then there is a 65% chance that Leo will make a profit of $1.76 million — the
original profit estimate of $2 million, minus the event's cost of $240,000. The likelihood of a $1.76 million profit is (40%)
(65%), or 26%.
If the response to the event is "Enthusiastic," then there is an 35% chance that Leo will suffer a loss of $1.04 million — the
original loss estimate of $800,000, minus the event's cost of $240,000. The likelihood of a $1.04 million loss is (40%)
(35%), or 14%.
You next complete the risk profile for Leo's profits if he chooses not to run the event. You present your risk analysis to Leo
to show him why he might want to run his event even if its expected monetary value is not optimal.
I see. So if I run my event, I'm most likely to incur the cost of the event, and nothing else. But, if the response is
"Enthusiastic," I should go ahead with my plan. Then, there is a small chance that I will incur a large loss if the restaurant
becomes a mere "Fad," and a good chance that the Tethys will become a "Phenomenon."
On the other hand, if don't run my event, although my potential profits are slightly higher, the risk is too great. I just have
too much to lose at this point.
At the same time, I think I can afford to pay for a market research event now, without going to the bank. Then if my
guests react favorably, I might consider diving into the Tethys adventure.
I want you two to enjoy your last day here and have dinner with me tonight. Anywhere special you'd like to go?
Hmmm. I know this wonderful place that serves the most delightful mix of French and Hawaiian cuisines...
Aw, shucks.
Exercise 1: The Frivolous Lawsuit of D. Pitt

Danforth Pitt was injured in a bisque-spilling incident in a Hawaiian luxury resort restaurant. Danforth, who insists he
hasn't been able to remove the scent of crab from his knees, is pursuing his legal options.
After initial investigation and consultation, Pitt's lawyers explain to Danforth that he can expect to win $250,000 in
damages with 5% probability or $100,000 with 25% probability, before subtracting the attorney's fees. The fees
associated with pursuing legal action are estimated at $50,000, which Pitt must pay whether he wins or loses the suit.
What is the EMV of pursuing legal action against the resort hotel?
Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter
"$5,000" as "5.00"). Round if necessary.
The EMV of pursuing legal action is -$12,500.
To avoid a legal battle, the hotel's owner agrees to settle out of court, offering Danforth two full weeks in the hotel's
penthouse suite free of charge. Including amenities, the value of this offer is $12,000.
If Danforth chooses to reject the offer to settle out of court, and to sue the hotel instead, he should be characterized as
which of the following?
Since the EMV of pursuing legal action is lower than the EMV of settling out of court, but the possible gains of legal
action, though relatively unlikely, are much higher than the EMV of settling, Danforth would have to be characterized
as risk seeking.
about:blank 117/119
Final Assessment Test I Introduction

Welcome to the post-assessment test for the HBS Quantitative Methods Tutorial.
Navigation:
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables as well as utilities for finding confidence
intervals and conducting hypothesis tests are provided for your convenience. For some questions, additional links to Excel
spreadsheets containing relevant data will appear immediately below the question text.
Good luck!

the course.
examination.
question.
Final Assessment Test II Introduction

Welcome to the second post-assessment test for the HBS Quantitative Methods Tutorial.
Navigation:
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some
questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.
IMPORTANT: Take this exam only after taking the previous final assessment exam. There are only two final
assessment exams. Thus, if you did not receive a passing score on the first, it is critical that you review the
course material carefully before taking this exam.
Good luck!

about:blank 118/119
the course.
examination.
question.
Copyright Harvard Business School Publishing. Copying or posting is an infringement of copyright. Permissions@hbsp.harvard.edu
or 617-783-7860.
about:blank 119/119

Quantitative Methods Online Course

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantitative Methods Online Course

Uploaded by

Copyright:

Available Formats

28/08/2023, 18:54 Quantitative Methods Online Course

Quantitative Methods Online Course

All questions must be answered for your exam to be scored.

Your results will be displayed immediately upon completion of the exam.

Frequently Asked Questions

Overview & Introduction

The Tutorial and its Method

The Story and its Characters

Using the Tutorial: A Guide to Tutorial Resources

To access additional Help features, click on the Help icon.

... and Welcome to Hawaii!

Leo and the Hotel Kahana

Basics: Data Description

Leo's Data Mine

Thanks! We'll certainly take advantage of that.

Describing and Summarizing Data

Working with Data

... until we learn that the building is an elementary school.

Central Values for Data

Finding The Mean In Excel

Finding The Median In Excel

Finding The Mode In Excel

The Standard Deviation

The Coefficient of Variation

Applying Data Analysis

Pricing the Scuba Schools

Exercise 1: VA Linux Stock Bonanza

Exercise 2: Employee Turnover

Which summary statistic better describes these data?

Exercise 3: Honidew Internship

What is the mean GPA of the former interns?

Exercise 4: Scuba Regulations

Exercise 5: Fluctuations in Energy Prices

Exercise 6: Big Mart Personal Care Products

Relationships Between Variables

As always, one of our first steps is to try to visualize the data.

Variable and Time

Creating Scatter Diagrams

... or weak ...

... linear ...

... or nonlinear ...

... and here's a strong negative correlation (about -0.90).

Occupancy and Arrivals

Exercise 1: The Effectiveness of Search Engines

Exercise 2: Education and Income

Sampling & Estimation

Generating Random Samples

How to Create a Representative and Unbiased Sample

We first select elements from the population at random...

...then analyze that sample...

Taking a Random Sample

Learning about a Sample

Classic Sampling Mistakes

Solving the Scuba Problem (Part I)

What factor is biasing your results?

When you report this news to Leo, he begins to laugh.

Exercise 1: The Bell Computer Problem

How should he conduct the survey?

Exercise 2: The Wave Problem

Challenge: The Airport

What is the problem with your survey?

The Population Mean

The Scuba Problem II

Using Confidence Intervals