You are on page 1of 30

MIS171 Business Analytics

Week 6a: Confidence intervals


Unit Schedule
Unit Information
Week 1 Introduction and Data Visualisation
Week 2 Univariate Analyses Assessment task 1 (10%) –
(Participation in Pre lectorial and Lectorial workbook
Descriptive activities and continuous class attendance from week 2 to
Week 3 Bivariate Analyses
week 12)
Week 4 Data visualisation Assessment task 2 (10%) - (Project related Online quiz)
Week 5 Inference Probability
Week 6 Confidence Intervals
Week 7 Hypothesis Testing Assessment task 3 (20%) - (Assignment on analysis and
reporting)
Simple Linear Regression and
Week 8 Predictive Prediction
Week 9 Multiple Linear Regression

Week 10 Multiple Linear Regression in action Assessment task 4 (10%) - (Project related Online quiz)

Week 11 Exam Revision 1


Revision
Week 12 Exam Revision 2
Week 13 Exam Exam (50%)
Workbook Exercise 5.1
Weights of chocolate bars
The supervisor of a dairy milk chocolate factory has observed that the weight of each
“32 gram” chocolate bar is actually a normally distributed random variable, with a
mean of 32.2 grams and a standard deviation of 0.3 grams.
(a) Find the probability that, if a customer buys one chocolate bar, that will weigh less
than 32 grams.

X ~ N (32.2,0.3)

P(X<32)= ? X ~ N (32.2,0.3)
Week 5 – Learning objectives
• Describe the elements of a sampling plan;
• Understand the issues of working with sample data;
• Explain how to conduct simple random sampling;
• Explain systematic, stratified, and cluster sampling;
• Explain sampling from a continuous process;
• Choose appropriate sampling techniques;
• Define sampling distribution and why they are important;
• Explain the importance of the central limit theorem;
• Construct and interpret confidence interval estimates for the mean and the
proportion;
• Calculate sample size.

4
Statistical sampling
Sampling is the foundation of statistical analysis.
Sampling plan is a description of the approach that is used to obtain samples from a
population prior to any data collection activity.
A sampling plan states:

• its objectives
• target population
• population frame (the list from which the sample is selected)
• operational procedures for collecting data
• statistical tools for data analysis

5
Sampling - Example
A company wants to understand how golfers might respond to a
membership program that provides discounts at golf courses.

• Objective - estimate the proportion of golfers who would join the program
• Target population - golfers over 25 years old
• Population frame - golfers who purchased equipment at particular stores
• Operational procedures - e-mail link to survey or direct-mail questionnaire
• Statistical tools - PivotTables to summarize data by demographic groups and
estimate likelihood of joining the program

6
Importance of sampling

• Do not always have access to all data


• Working with the whole populations can be time consuming and
expensive (data cleansing can be very time-consuming and costly).
data cleansing- removing errors ex: zero value
• Sampling allows analyst to work with a manageable amount of data in
order to run analytical models quickly.
• Need some data to test the model for explanatory power on new events
and validate findings on data unseen by the model

7
Sampling methods
Subjective Methods
• Judgment sampling – expert judgment is used to select the sample
ex : asking doctors (experts) for opinions not random
• Convenience sampling – samples are selected based on the ease with which the
data can be collected
ex: not fair ; based on convience for yourself , has no equal chances for everyone)
Probabilistic Sampling
• Simple random sampling involves selecting items from a population so that
every subset of a given size has an equal chance of being selected
ex: random selection, everyone have a equal chance to get selected.
• Systematic (periodic) sampling – a sampling plan that selects every nth item
from the population.
ex: convieniece sampling like shuffling names inside the jar (less chances for people
names below the jar to get it) and pointing random people.

8
Additional probabilistic sampling methods
Stratified sampling –
applies to populations that are divided into natural subsets (called strata) and
allocates the appropriate proportion of samples to each stratum.
strata is a group that divides into proportions, boys and girls. ex: out of 5, how
many i collect and out of 11 how many i collect.
Cluster sampling –
based on dividing a population into subgroups (clusters), sampling a set of clusters,
and (usually) conducting a complete census within the clusters sampled
ex: president election , how many people vote this guy in the household.
Sampling from a continuous process:
• Select a time at random; then select the next n items produced after that time.
• ex: the erase industry is doing 24 hours, then select n item produced tht time.
• Select n times at random; then select the next item produced after each of
these times.

9
Random sampling

• Each element of the population has a known (non-zero) chance of being included
in the sample chosen.
• Should be used where possible.
• Inferential statistics requires random samples.
• For the rest of this course we assume that all samples are random samples.

10
Sampling error
Sampling Error is the difference between a sample statistic estimate and the
corresponding population parameter; such as:
• ǀ - µǀ for sample mean
• |s - | for sample standard deviation

Sampling (statistical) error occurs because samples are only a subset of the total
population. Sampling error is inherent in any sampling process, and although it
can be minimized, it cannot be totally avoided.

Nonsampling error occurs when the sample does not represent the target
population adequately. Nonsampling error usually results from a poor sample
design or inadequate data reliability.
(this happens when you made is not reliable enough and poor sample design.)

11
Sampling experiment – Other sample sizes
As the sample size increases, the average of the sample means are all still close to
the expected value of 5; however, the standard deviation of the sample means
becomes smaller, meaning that the means of samples are clustered closer together
around the true expected value. The distributions become normal. ( if i take 10
sample size, 25 groups of 10 will have different sample size of means.) average of
all the sample mean is going to be the same as the mean of samples is getting
smaller as it is clustered closer)

12
Central limit theorem

If the sample size is large enough, then the sampling distribution of the mean is:
magic number:30
if the sample size is 30, the population is considered as normally distributed.
if it says the population is normally distributed, then any number can be the sample size

• approximately normally distributed regardless of the distribution has a mean equal to the
population mean

• If the population is normally distributed, then the sampling distribution is also normally
distributed for any sample size.

• The central limit theorem allows us to use the theory we learned about calculating probabilities
for normal distributions to draw conclusions about sample means.
13
Central limit theorem
If X has mean and standard deviation and if n (sample size) is sufficiently large, the
distribution of is approximately normally distributed.
  
X ~ N  , 
 n

i.e., regardless of the distribution of the original population, the distribution of the
sample mean is approximately normal, if the sample size is large enough.
This “normal approximation” gets better as the sample size increases.
But the sample MUST be (simple) random.

But the sample MUST be (simple) random.

14
Sampling distribution
The sampling distribution of the mean is the distribution of the means of all
possible samples of a fixed size n from some population.

The standard deviation of the sampling distribution of the mean is called the
standard error of the mean:

As n increases, the standard error decreases.


Larger sample sizes have less sampling error.

15
Workbook Exercise 5.1

Computing the standard error of the mean


For the uniformly distributed population, we know s2 = 8.333 and, therefore, s = 2.89

Calculate the standard errors for the following samples

Sample size Standard error of the mean


10
25
100
500
Point estimate
From the sample, we compute the value of relevant sample statistics which could be
used as point (i.e. a single value) estimates of the equivalent population parameters.

• We refer to  (x-bar) as the point estimator of the population mean, µ.

• s is the point estimator of the population standard deviation, .

• p is the point estimator of the population proportion, π.

17
Why we don’t use a point estimate?
We should not use the value of - by itself as an estimate of , because:
x
• it is almost certain to be wrong,
• and we don’t know how wrong.

A point estimator does not provide information about how close the
estimate is to the population parameter.

We would have no confidence in using it. Instead we use a confidence


interval.

18
Interval Estimate
• To deal with uncertainty, we can use an interval estimate
• It provides a range of values that best describe the population
• To develop an interval estimate we need to learn about confidence
levels
• just because i have sample and the sample mean, i use interval
estimate .

19
Confidence levels
• A confidence level is the probability that the interval
estimate will include the population parameter (such
as the mean)
• Sample means will follow the normal probability
distribution for large sample sizes (n ≥ 30)
• To construct an interval estimate with a 90 %
confidence level
• Confidence level corresponds to a z-score from the
standard normal table equal to 1.645
• the mean is always zero when it comes to z graph
• between 0.0495 and 0.0505 =0.05

20
Confidence intervals
• A confidence interval is a range of values between which the value of the population
parameter is believed to be, along with a probability that the interval correctly
estimates the true (unknown) population parameter.

• This probability is called the level of confidence, denoted by 1 - a, where a is a


number between 0 and 1.

• The level of confidence is usually expressed as a percent; common values are 90%,
95%, or 99%.

• For a 95% confidence interval, if we chose 100 different samples, leading to 100
different interval estimates, we would expect that 95% of them would contain the true
population mean.
21
Level of Significance
• As there is a 90 % probability that any
given confidence interval will contain the
true population mean, there is a 10 %
chance that it won’t.

• This 10 % is known as the level of


significance (α) and is represented by the
purple shaded area

22
Interval estimates
An interval estimate provides a range for a population characteristic based on a
sample. Intervals specify a range of plausible values for the characteristic of
interest and a way of assessing “how plausible” they are.

In general, a 100(1 - a)% probability interval is any interval [A, B] such that the
probability of falling between A and B is 1 - a.

Probability intervals are often centered on the mean or median.

Example: in a normal distribution, the mean plus or minus 1 standard deviation


describes an approximate 68% probability interval around the mean.

23
Estimation process

24
Margin of error and interval estimate

Margin of error (ME) =


standard error of mean

An interval estimate is constructed by subtracting and adding the margin of error


(ME), to a point estimate:

An interval estimate of the population mean is:

25
Confidence Interval for the Mean
When Population Standard Deviation is known
Sample mean ± margin of error if it is given, you use this formula;

Margin of error is: ± zα/2 (standard error)

• zα/2 is the value of the standard normal random variable for an upper tail area of α/2
(or a lower tail area of 1 − α/2).
• zα/2 is computed as =NORM.S.INV(1 – α/2)

• Example: if α = 0.05 (for a 95% confidence interval), then NORM.S.INV(0.975) = 1.96


• Example: if α = 0.10 (for a 90% confidence interval), then NORM.S.INV(0.95) = 1.645

The margin of error can also be computed by =CONFIDENCE.NORM(alpha, standard deviation, size).
26
Confidence Interval for the Mean
When Population Standard Deviation is known
Example:
A production process fills bottles of liquid detergent. The standard deviation in filling
volumes is constant at 15 mls. A sample of 25 bottles revealed a mean filling volume of
796 mls.

A 95% confidence interval estimate of the mean filling volume for the population is;

 = Sample mean α = Confidence interval σ = Population standard deviation


0.025

27
Workbook Exercise 5.3
• Find a 90% interval estimate of the average number of hours that Australian children
spend watching television.
• Suppose a survey of 100 children is taken, and the sample mean calculated as 27.191
hours per week.
• We are also told that the population standard deviation is  = 8 hours/week.
• Assume the population of hours watching TV is Normal
Confidence interval properties
• As the level of confidence, 1 - a, decreases, za /2 decreases, and the confidence
interval becomes narrower.

• For example, a 90% confidence interval will be narrower than a 95%


confidence interval. Similarly, a 99% confidence interval will be wider than a
95% confidence interval.

• Essentially, you must trade off a higher level of accuracy with the risk that the
confidence interval does not contain the true mean.

• To reduce the risk, you should consider increasing the sample size.

29
END

You might also like