You are on page 1of 25

Estimation

•Percentiles
•The Bootstrap
•Confidence Intervals
•Using Confidence Intervals
Estimation - Introduction

• So far developed ways of inferential thinking. In particular, we learned how to use data
to decide between two hypotheses about the world.

• But often we just want to know the size of the values.


– In an election year, we might want to know what percent of voters favor a particular candidate.

– To assess the current economy, we might be interested in the median annual income of
households in India.

• Here, we will develop a way to estimate an


unknown parameter.
• (Remember that a parameter is a numerical value associated with a population).
Estimation

• To figure out the value of a parameter, we need data. If we have


the relevant data for the entire population, we can simply calculate
the parameter.

• But if the population is very large – for example, if it consists of all


the households in a country – then it might be too expensive and
time-consuming to gather data from the entire population.

• In such situations, data scientists rely on sampling at random from


the population.

This leads to a question of inference:


How to make justifiable conclusions about the
unknown parameter, based on the data in the
random sample?
We answer this question by using inferential thinking.
Estimation

• A statistic based on a random sample can be a reasonable estimate of


an unknown parameter in the population.
(For example, you might want to use the median annual income of sampled households as an estimate of the median annual
income of all households in India .

• But the value of any statistic depends on the sample, and the sample is
based on random draws. So every time data scientists come up with an
estimate based on a random sample, they are faced with a question:

"How different could this estimate have been, if the sample had come
out differently?"

• See one way of answering this question. The answer will give you the
tools to estimate a numerical parameter and quantify the amount of
error in your estimate.
Estimation

• A statistic based on a random sample can be a reasonable estimate of


an unknown parameter in the population.
(For example, you might want to use the median annual income of sampled households as an estimate of the median annual
income of all households in India .

• But the value of any statistic depends on the sample, and the sample is
based on random draws. So every time data scientists come up with an
estimate based on a random sample, they are faced with a question:

"How different could this estimate have been, if the sample had come
out differently?"

• See one way of answering this question. The answer will give you the
tools to estimate a numerical parameter and quantify the amount of
error in your estimate.
Estimation

• A statistic based on a random sample can be a reasonable estimate of


an unknown parameter in the population.
(For example, you might want to use the median annual income of sampled households as an estimate of the median annual income of all

.
households in India

• But the value of any statistic depends on the sample, and the sample
is based on random draws. So every time data scientists come up with
an estimate based on a random sample, they are faced with a
question:

"How different could this estimate have been, if the sample had
come out differently?"

• See one way of answering this question. The answer will give you the
tools to estimate a numerical parameter and quantify the amount of error in
your estimate.
Estimation

• A statistic based on a random sample can be a reasonable estimate


of an unknown parameter in the population.

• But the value of any statistic depends on the sample, and the
sample is based on random draws. So every time data scientists
come up with an estimate based on a random sample, they are
faced with a question:

"How different could this estimate have been, if the sample had
come out differently?"

• We shall see one way of answering this question.


The answer will give you the tools to estimate a numerical parameter
and quantify the amount of error in your estimate.
Percentiles, Quartiles & Quantiles—already
covered
Percentiles

• The most famous percentile is the median, often used in summaries of


income data. Other percentiles will be important… start by defining
percentiles.

• Numerical data can be sorted in increasing or decreasing order. Thus the


values of a numerical data set have a rank order. A percentile is the value
at a particular rank.

• For example, if your score on a test is on the 95th percentile, a common


interpretation is that only 5% of the scores were higher than yours.

• The median is the 50th percentile; it is commonly assumed that 50% the
values in a data set are above the median.
The General Definition - Percentiles
• Let be a number between 0 and 100. The pth percentile of a collection is the
smallest value in the collection that is at least as large as p% of all the values.

• By this definition, any percentile between 0 and 100 can be computed for
any collection of values, and it is always an element of the collection.

• In practical terms, suppose there are n elements in the collection. To find the
pth percentile:
– Sort the collection in increasing order.
– Find (p/100) *n. Call that k

• If k is an integer, take the kth element of the sorted collection.

• If is not an integer, round it up to the next integer, and take that element of
the sorted collection.
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)

• Plots quantile information


– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Quantile Plot

• Simple way -1st look at a univariate data distribution.


• Displays the given attribute to assess both
– the overall behavior and
– unusual occurrences.
• It plots quantile information.
– Let xi , for i = 1 to N, be the data sorted in increasing order
– x1 is the smallest observation and
– xN is the largest for some attribute X.
• Each observation, xi , is paired with a percentage, fi , which => approx fi 100% of
the data are below the value, xi
• Note: 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the median,
and the 0.75 percentile is Q3.
Quantile Plot
• Let
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution Vs corresponding quantiles
of another

• Allows the user to view whether there is a shift in going from one distribution to
another
Quantile-Quantile (Q-Q) Plot

• Each point corresponds to the same quantile for each data set and shows the unit price of items sold at
branch 1 Vs 2 for that quantile.

• For comparison, the straight line represents => for each given quantile, the unit price at each branch is
the same.

• The darker points - data for Q1, the median, and Q3.

• At Q1, the unit price of items sold at branch 1 < at branch 2. ie, 25% of items sold at branch 1 were <=
$60, Vs 25% of items at branch 2 <= $64.

• At Q2, the 50th percentile (marked by the median), 50% of items sold at branch 1 <= $75, Vs branch 2 <=
$85.

• In general, a shift in the distribution of branch 1 Vs 2 in that the unit prices of items sold at branch 1 < at branch 2.
Bootstrap
Bootstrap
• One sample - One estimate
• But the random sample could have come out
differently.
=> Then the estimate would have been different.
Main question:
• How different could the estimate have been?
• The variability of the estimate tells us something
about how accurate the estimate is.
Where to Get Another Sample?

• One sample One estimate


• To get another value of the estimate, need
another random sample.
• Can’t go back and sample again from the
population:
• Time and resource is limited…………Stuck?
The Bootstrap

• Need another random sample that looks like the


population

• All that we have is the original sample which is


large and random.

• It’s a good bet that it resembles the population.

• So sample at random from the original sample!


Demo

You might also like