Estimation

Estimation
•Percentiles
•The Bootstrap
•Confidence Intervals
•Using Confidence Intervals
Estimation - Introduction
• So far developed ways of inferential thinking. In particular, we learned how to use data
to decide between two hypotheses about the world.
• But often we just want to know the size of the values.

– In an election year, we might want to know what percent of voters favor a particular candidate.
– To assess the current economy, we might be interested in the median annual income of
households in India.
• Here, we will develop a way to estimate an

unknown parameter.
• (Remember that a parameter is a numerical value associated with a population).
Estimation
• To figure out the value of a parameter, we need data. If we have

the relevant data for the entire population, we can simply calculate
the parameter.
• But if the population is very large – for example, if it consists of all

the households in a country – then it might be too expensive and
time-consuming to gather data from the entire population.
• In such situations, data scientists rely on sampling at random from

the population.
This leads to a question of inference:

How to make justifiable conclusions about the
unknown parameter, based on the data in the
random sample?
We answer this question by using inferential thinking.
Estimation
• A statistic based on a random sample can be a reasonable estimate of

an unknown parameter in the population.
(For example, you might want to use the median annual income of sampled households as an estimate of the median annual
income of all households in India .
• But the value of any statistic depends on the sample, and the sample is
based on random draws. So every time data scientists come up with an
estimate based on a random sample, they are faced with a question:
"How different could this estimate have been, if the sample had come
out differently?"
• See one way of answering this question. The answer will give you the
tools to estimate a numerical parameter and quantify the amount of
error in your estimate.
Estimation

(For example, you might want to use the median annual income of sampled households as an estimate of the median annual
income of all households in India .
• But the value of any statistic depends on the sample, and the sample is
based on random draws. So every time data scientists come up with an
estimate based on a random sample, they are faced with a question:
"How different could this estimate have been, if the sample had come
out differently?"
tools to estimate a numerical parameter and quantify the amount of
error in your estimate.
Estimation

(For example, you might want to use the median annual income of sampled households as an estimate of the median annual income of all
.
households in India
• But the value of any statistic depends on the sample, and the sample
is based on random draws. So every time data scientists come up with
an estimate based on a random sample, they are faced with a
question:
"How different could this estimate have been, if the sample had
come out differently?"
tools to estimate a numerical parameter and quantify the amount of error in
your estimate.
Estimation
• A statistic based on a random sample can be a reasonable estimate

of an unknown parameter in the population.
• But the value of any statistic depends on the sample, and the
sample is based on random draws. So every time data scientists
come up with an estimate based on a random sample, they are
faced with a question:
"How different could this estimate have been, if the sample had
come out differently?"
• We shall see one way of answering this question.

The answer will give you the tools to estimate a numerical parameter
and quantify the amount of error in your estimate.
Percentiles, Quartiles & Quantiles—already
covered
Percentiles
• The most famous percentile is the median, often used in summaries of

income data. Other percentiles will be important… start by defining
percentiles.
• Numerical data can be sorted in increasing or decreasing order. Thus the

values of a numerical data set have a rank order. A percentile is the value
at a particular rank.
• For example, if your score on a test is on the 95th percentile, a common

interpretation is that only 5% of the scores were higher than yours.
• The median is the 50th percentile; it is commonly assumed that 50% the
values in a data set are above the median.
The General Definition - Percentiles
• Let be a number between 0 and 100. The pth percentile of a collection is the
smallest value in the collection that is at least as large as p% of all the values.
• By this definition, any percentile between 0 and 100 can be computed for
any collection of values, and it is always an element of the collection.
• In practical terms, suppose there are n elements in the collection. To find the
pth percentile:
– Sort the collection in increasing order.
– Find (p/100) *n. Call that k
• If k is an integer, take the kth element of the sorted collection.
• If is not an integer, round it up to the next integer, and take that element of
the sorted collection.
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information

– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Quantile Plot
• Simple way -1st look at a univariate data distribution.

• Displays the given attribute to assess both
– the overall behavior and
– unusual occurrences.
• It plots quantile information.
– Let xi , for i = 1 to N, be the data sorted in increasing order
– x1 is the smallest observation and
– xN is the largest for some attribute X.
• Each observation, xi , is paired with a percentage, fi , which => approx fi 100% of
the data are below the value, xi
• Note: 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the median,
and the 0.75 percentile is Q3.
Quantile Plot
• Let
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution Vs corresponding quantiles
of another
• Allows the user to view whether there is a shift in going from one distribution to
another
Quantile-Quantile (Q-Q) Plot
• Each point corresponds to the same quantile for each data set and shows the unit price of items sold at
branch 1 Vs 2 for that quantile.
• For comparison, the straight line represents => for each given quantile, the unit price at each branch is
the same.
• The darker points - data for Q1, the median, and Q3.
• At Q1, the unit price of items sold at branch 1 < at branch 2. ie, 25% of items sold at branch 1 were <=
$60, Vs 25% of items at branch 2 <= $64.
• At Q2, the 50th percentile (marked by the median), 50% of items sold at branch 1 <= $75, Vs branch 2 <=
$85.
• In general, a shift in the distribution of branch 1 Vs 2 in that the unit prices of items sold at branch 1 < at branch 2.
Bootstrap
Bootstrap
• One sample - One estimate
• But the random sample could have come out
differently.
=> Then the estimate would have been different.
Main question:
• How different could the estimate have been?
• The variability of the estimate tells us something
about how accurate the estimate is.
Where to Get Another Sample?
• One sample One estimate

• To get another value of the estimate, need
another random sample.
• Can’t go back and sample again from the
population:
• Time and resource is limited…………Stuck?
The Bootstrap
• Need another random sample that looks like the

population
• All that we have is the original sample which is

large and random.
• It’s a good bet that it resembles the population.
• So sample at random from the original sample!

Demo

Estimation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation

Uploaded by

Copyright:

Available Formats

Estimation

• But often we just want to know the size of the values.

• Here, we will develop a way to estimate an

• To figure out the value of a parameter, we need data. If we have

• But if the population is very large – for example, if it consists of all

• In such situations, data scientists rely on sampling at random from

This leads to a question of inference:

• A statistic based on a random sample can be a reasonable estimate of

• A statistic based on a random sample can be a reasonable estimate of

• A statistic based on a random sample can be a reasonable estimate of

• A statistic based on a random sample can be a reasonable estimate

• We shall see one way of answering this question.

• The most famous percentile is the median, often used in summaries of

• Numerical data can be sorted in increasing or decreasing order. Thus the

• For example, if your score on a test is on the 95th percentile, a common

• If k is an integer, take the kth element of the sorted collection.

• Plots quantile information

• Simple way -1st look at a univariate data distribution.

• One sample One estimate

• Need another random sample that looks like the

• All that we have is the original sample which is

• It’s a good bet that it resembles the population.

• So sample at random from the original sample!

You might also like