You are on page 1of 24

Business Analytics

SAMPLING AND ESTIMATION


Amazon Case Study

 Amazon uses sampling to answer an important managerial question. It decided


long ago that our company's mission is to be Earth's most customer centric
company. Amazon are obsessed with the customer experience. So when amazon
have an opportunity to improve the customer experience through analytics, they
usually focus on the thing that is likely to have the highest impact on customer
experience, positive impact, and the broadest possible impact, because we're a
global company. So they look for things like low prices, huge selection,
improvements in the delivery experience and convenience that are likely to apply
for a long period of time everywhere in the world. When amazon ship items to
customers, they come from a warehouse where they store inventory. The way that
amazon process inventory is that way we receive a truck that has books, consumer
electronics items, toys, kitchen, sports, clothes, shoes. The truck comes in. Amazon
receive the items, which basically means that they open up a carton and take out
the items and make sure that they're in good shape. And then we stow them into a
shelf, waiting on the customer orders that will eventually come to ship to customers.
The places for errors in this process include misidentifying the item at receive.
Amazon Case Study

 So we think we have black shoes, and someone's made a mistake and


identified them as blue shoes. We could place the item into the wrong bin. We
could pull the wrong item from the shelf, and then there are a couple of other
smaller ways that we might make mistakes. We're trying to minimize the defects
to customers, meaning minimize the chance that a customer would receive
the wrong item or receive a delay because the last item that we have is in the
wrong place. And we're trying to reduce our costs to deal with those kinds of
defects at the same time, so improve quality and lower costs. The best to do
that is to have as few defects as possible in our inventory. Years ago, way
before Amazon.com, retail learned that inventory accuracy matters in stores
and in warehouses, and retailers got accustomed to annual counts, annual
inventory accounts. Often, stores would close for a day or two days, or
sometimes a week. Warehouses would do the same, and humans would go
out into the warehouse and count every item, make sure that they knew what
was where in the warehouse, and then you would reopen and start selling
again.
Amazon Case Study

 That's a very expensive process, because you actually have to close your
operation during the time that the warehouse is closed. And you also don't
have the benefit of knowing whether you're perfect in your inventory
throughout the rest of the year. You basically have one sample. It's a complete
sample, but it's one sample, once a year, and then you hope that your
processes are good enough the rest of the year. What we've learned to do is
to sample our inventory continuously, sample the accuracy of our inventory
continuously, to make sure that we have as accurate an inventory as we can
afford to have. The idea behind sampling is it might not be possible for you to
learn the true value of a statistic of interest in the population. We have many
warehouses that house that inventory. Going through all that would be very,
very time consuming. And the idea behind sampling in that situation would be
you would at random pick a subset of the items in inventory, and ask whether
they had those defects. So it's a lower cost way to learn the rate at which the
statistic of interest occurs in the population.
Sample vs. Population

 In the previous module, we learned


about descriptive statistics. The
numerical properties of a
population are called parameters
and those of a sample are called
statistics. A statistic is an estimate of
a true value of a parameter. If a
sample is sufficiently large and is
representative of the population,
the sample statistics should be
reasonably good estimates of the
population parameters.
Sample vs. Population

 To differentiate between population


and sample measures, we use the
Greek alphabet for population
parameters, and the Latin alphabet
for sample statistics. The symbols for
the mean and standard deviation
are summarized in the table below.
Important Pointers

 What happens to the sample mean and standard deviation as you take
new samples of equal size?

 Since each sample is randomly selected, the mean and standard


deviation vary from one sample to the next. However, since the sample
size is fairly large, each sample’s mean and standard deviation are fairly
close to the population mean and standard deviation. We’ll learn more
about how to select a good sample later.
Selection of Random Sample

 In some cases, selecting a random sample is quite straightforward. If we


have a list of all members of a population in a database, we can use a
computer to assign a random number to each member and draw a
sample from the list. This process makes sure that each member—that is,
each element of the population—has an equal likelihood of being
selected, which ensures that the sample is representative of the
population.
 We can use the RAND function to generate random numbers between
any two specified values. For example, if we wanted to generate random
numbers between 0 and 10 we would multiply the function by 10 and
enter =RAND()*10. If we wanted numbers between 5 and 15, we would
enter =5+RAND()*10.
Case Study

 Suppose a college has asked you to conduct a survey to determine the


percentage of 8:00 AM classrooms that were full on a given morning. The
college has three classroom buildings, each containing two lecture halls. Each
lecture hall has a capacity of 100 students. You randomly choose one of three
buildings, and stand outside the entrance when classes let out. You ask the first
60 students leaving the building how full their class was. However, you soon
realize that this sample is not random because you only went to only one of the
buildings and the classes at that building may not be representative of all 8:00
AM classes. Moreover, since the students you surveyed were the first to exit the
building, it’s also quite possible that they all came from the same class!
 Realizing that your survey approach would not produce a random and
representative sample, you gather some friends to help sample. You place one
surveyor outside each building. You each randomly select 20 students leaving
the buildings that morning and tally the results: 5 people decline to participate,
35 tell you that their class was full, and 20 tell you that their class was not full.
Explanation

 This question is a bit tricky. This sample still may not be representative of all
classes because there is a bias in the approach. When you sample
students leaving each of the buildings, you will, on average, select more
people from full classes, simply because there were more people in those
classes. Imagine that of the 6 classes that took place that morning, 4 were
full (each having 100 students) and 2 had only 40 students each. In this
case, most of the students, 400 of the total 480, were in full classes. Your
sample would include more students from the full classes and therefore is
not representative of all classes that took place that morning.
Sample Size

 In addition to deciding how to select a random sample, we also must


determine how large the sample should be. The appropriate sample size
depends on how accurate we want our estimates of the population
parameters to be. Suppose we want to sample from two populations—the
first population comprises 5,000 observations and the second population
comprises 5 million.
 If we take a sample of size 1,000 from the first population, how many times
larger does the sample need to be from the second population to ensure
the same level of accuracy?
Explanation

 We might expect that for a larger population, a larger sample size is


needed to achieve a given level of accuracy, but this is not necessarily
true. A sample of 1,000 is often a satisfactory representation of a
population numbering in the millions, as long as the sample is randomly
selected and representative of the entire population.
Sample Size

 The graphic below suggests the general relationship


between accuracy and sample size. Later in this module,
we will learn how to calculate the minimum required
sample size to ensure a specified level of accuracy.
 Although we don’t necessarily have to increase the
sample size for larger populations, we may need a larger
sample size when we are trying to detect something very
rare. For example, if we are trying to estimate the
incidence of a rare disease, we may need a larger
sample simply to ensure that some people afflicted with
the disease are included in the sample.
Avoiding Bias

 A common way to collect information about a population is to conduct a


survey. That is, a researcher asks questions of a randomly selected sample
from the population. Conducting a survey raises problems that can be
surprisingly tricky to resolve. Consider how we phrase our questions. Is there
bias in the phrasing that might lead participants to answer the questions in
a certain way? Are any questions worded ambiguously? If some of the
people in the sample interpret a question one way, and others interpret it
differently, the survey’s results will be meaningless.
Questions

 Suppose you are an aspiring politician thinking about running for local
office. You decide to conduct a survey to get a sense of whether you
actually have a chance of winning. Which method would you use?

 In-person
 Mail
 Phone
 E-mail
 Text
 Social Media
Avoiding Bias

 Surveyors wish to get as high a response rate as possible. Low response


rates can introduce bias if the non-respondent’s answers would have
differed from those who responded—that is, if the non-respondents and
the respondents represent different segments of the population. If we do
not represent a segment of the population, then our sample is not
representative of the population. If resources are limited, it is often better
to take a small sample and relentlessly pursue a high response rate than to
take a larger sample and settle for a low response rate. If we have a low
response rate, we should contact non-respondents and try to either
increase the response rate or demonstrate that the non-respondents’
answers do not differ from the respondents’ answers.
Avoiding Bias
Normal Distribution

 After we obtain a sample, we will analyze the sample to draw inferences


about the greater population. To understand how to do this, it is helpful to
understand the basic characteristics of a common probability distribution
known as the Normal Distribution.
 Because the normal distribution is a continuous probability distribution, the
probability of the normal distribution equaling any particular value is zero
(this is why we only assess the probability of a range for a continuous
distribution). Because of this, we can use the terms “less than” and “less
than or equal to” interchangeably when calculating probabilities for
continuous distributions.
Excel Formulae

 To find a cumulative probability, the probability of being less than a specified


value on a normal curve, we use Excel’s NORM.DIST function.
 =NORM.DIST(x, mean, standard_dev, cumulative)
 x is the value at which you want to evaluate the distribution function.
 mean is the mean of the distribution.
 standard_dev is the standard deviation of the distribution.
 cumulative is an argument that specifies the type of probability we wish to
calculate. We insert “TRUE” to indicate that we wish to find the cumulative
probability, that is, the probability of being less than or equal to the x-value.
(Inserting the value “FALSE” provides the height of the normal distribution at the
value x, which we will not cover in this course.)
Excel Formulae

 For a normal distribution, we can use Excel’s NORM.INV function to


calculate a given percentile. The “INV” indicates that this function
calculates the inverse of the cumulative probability.
 =NORM.INV(probability, mean, standard_dev)
 probability is the cumulative probability for which we want to know the
corresponding x-value on a normal distribution.
 mean is the mean of the distribution.
 standard_dev is the standard deviation of the distribution.
Central Limit Theorem

 The Central Limit Theorem is one of the most subtle statistical concepts,
and is worth understanding because it gives us much deeper insight into
how sampling actually works. The Central Limit Theorem says that if we
take many random samples from a population and plot the means of
each sample, then assuming the samples we take are sufficiently large,
the resulting plot of the sample means will look normally distributed.
Furthermore, if we take enough of these samples, the average of the
sample means will be equal to the true mean of the population. Thus, we
show this graph called the distribution of sample means as a normal curve
centered at the true population mean.
Central Limit Theorem

 How does the distribution of sample means differ from the distribution of the
population? The most important difference is that the two distributions have
different standard deviations. Since the width of the distribution of sample
means is affected by the sample size, larger samples will result in a more narrow
distribution of sample means. This should reinforce our intuition, because we
know that the larger the sample size, the more accurately the sample mean
approximates the population mean. Thus, for larger samples, the resulting
distribution of sample means will be more closely clustered around the
population mean. One of the most amazing findings about the Central Limit
Theorem is that no matter what type of distribution the population has, uniform,
skewed, bimodal, or completely bizarre, if we take enough sufficiently large
samples, then the means of those samples will form a normal distribution
centered at the true population mean. Let's walk through this step by step.
Central Limit Theorem

 If we have a population, any population, we can take a random sample from that
population. That sample has a mean. We can plot that mean on a graph. Then we can
take another sample. That sample also has a mean, which we also plot on the graph.
Now, if we plot a lot of sample means in this way, they will start to form a normal
distribution around the population mean. The more samples we take, the more the graph
of the sample means will look like a normal distribution. Eventually we would form the
distribution of sample means. Now, no one would actually take a lot of samples,
calculate all the sample means, and then construct a normal distribution with them. In the
real world, we take a single sample and squeeze it for all the information it's worth. The
Central Limit Theorem is a powerful tool for sampling and estimation, because it allows us
to ignore the underlying distribution of the population we want to learn about. We now
know that the mean of a sample is part of a normal distribution. Specifically, we know
that the sample mean falls somewhere in a normal distribution that is centered at the true
population mean. Because of this, we can completely disregard the underlying
distribution of the population and focus only on the sample.
Questions ???

You might also like