You are on page 1of 20

MINISTRY OF EDUCATION AND TRAINING

NATIONAL ECONOMICS UNIVERSITY

***

GROUP MID-TERM EXAM


Course: Mathematical Statistics
TOPIC: Sampling distribution & estimation

Lecturer: Trần Thị Bích

Members: Vũ Thị Hồng Giang 11221852


Trịnh Ngân Hà 11221997
Hoàng Minh Hiền 11222204
Phạm Hồng Ngọc 11224749
Nguyễn Đức Thành 11203579
Sampling distribution & estimation
I. SUMMARIZE THE ARTICLE 3
1. The article 3
2. Introduction 3
3. Technique 3
4. Additional references of Sampling distribution and estimation 6
5. Conclusion 7
II. DATA ANALYSIS OF A SPECIFIC ORGANIZATIONAL PROBLEM 7
1. Overall view: Amazon and dataset 7
2. Sampling distribution 7
3. Estimation 15
I. SUMMARIZE THE ARTICLE
1. The article

COVID-19 prevalence estimation by random sampling in population - optimal


sample pooling under varying assumptions about true prevalence.

Source: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-
020-01081-0

2. Introduction

Studying large populations can be very difficult, and getting information


from each member can be costly and time-consuming. That's why researchers turn
to random sampling to help reach the conclusions they need to make key decisions,
whether that means helping provide the services that residents need, making better
business decisions, or executing changes in an investor's portfolio.

To optimal sample pooling under varying assumptions about true


prevalence, the study was made to estimate the rate of Covid infection by random
sampling in the population. This method is taken with the desire to significantly
reduce the total number of tests needed to estimate prevalence, making it quicker
to identify Covid cases. From there, it helps agencies to reduce pressure, saving
time and costs.

The study showed that pooling approaches could facilitate patient testing
during an infectious disease outbreak mainly by expanding current screening
capacities of detection both in the community and among healthcare providers.

The sample pooling strategy was intended to determine the optimal


parameters for group testing of pooled specimens for the detection of SARS-CoV-2
and process them without significant loss of test usability.

3. Technique

The academic paper shown used sampling distribution to estimate the rate of
Covid infection. This statistical tool determines the probability of an event based
on data from a small group within a large population with the primary purpose
being to establish representative results of small samples of a comparatively larger
population. Since the population is too large to analyze, you can select a smaller
group and repeatedly sample or analyze them. The majority of data analyzed by
researchers are samples, not populations. For example, a medical researcher that
wanted to compare the average weight of all babies born in North America from
1995 to 2005 to those born in South America within the same period cannot draw
the data for the entire population of over a million childbirths that occurred over
the ten-year time frame within a reasonable amount of time. They will instead only
use the weight of 100 babies in each continent to conclude. The weight of 100
babies used is the sample and the average weight calculated is the sample mean.

Using finite-sample distribution, users can calculate the mean, range,


standard deviation, mean absolute value of the deviation, variance, and unbiased
estimate of the variance of the sample. No matter for what purpose users wish to
use the collected data, it helps strategists, statisticians, academicians, and financial
analysts make necessary preparations and take relevant actions concerning the
expected outcome.

This article started by generating a population of 500,000 citizens and then


let the probability of being infected at the sampling time of each individual is p.
The number of patient samples collected from the population is denoted by n, and
the number of patient samples that are pooled into a single well is denoted by k.
Hence, the total number of pools is thus n/k. Let m = n/k. The number of positive
pools in an experiment is called x.

Explored parameter options:

p∈{0.001,0.003,0.01,0.03,1.0}

n∈{200,500,1000,1500,2000,3000,5000}

k∈{1,3,5,7,10,15,20,25,30,40,50,70,100,200}
Then, the study calculated the estimated prevalence p at each parameter
combination by replicating the experiment 100,000 times and reporting here the
2.5 and 97.5% quantiles of the distribution of p.

With 5000 total samples, there is minimal difference in the central estimates
of p between individual samples and a pooling level of 200 (95% range 0.0022-
0.0021). 145 reactions, or a 34.5 reduction in the number of independent RT-PCR
sets, are sufficient to obtain patient-level diagnosis 97.5% of the time.

Central 95% estimates of p with a test with sensitivity (η) 0.95 and perfect
specificity (θ = 1) under different combinations of the total number of samples and
level of sample pooling. a: p = 0.001; b: p = 0.003; c: p = 0.01; d: p = 0.03; e: p = 
0.10
4. Additional references of Sampling distribution and estimation
- The use of random sampling in investigations involving child abuse material

Sources:
https://www.sciencedirect.com/science/article/pii/S1742287612000369?
fbclid=IwAR3NjlExEHsJEQqL1sji6E1nuR8tqQjtYO2p29h5r_2cHcUkdW6
CUphe7s8#aep-abstract-id3
- The effect of sampling health facilities on estimates of effective coverage: a
simulation study
Source:
https://ij-healthgeographics.biomedcentral.com/articles/10.1186/s12942-022-
00307-2
5. Conclusion
Attempts to estimate the true current prevalence of COVID-19 by PCR tests
can benefit from sample pooling strategies. Such strategies have the potential to
greatly reduce the required number of tests with only slight decreases in the
precision of prevalence estimates.
However, there are still some limitations. The strategy outlined here does
present some logistical challenges: samples must be randomly allocated to pools,
which rules out some practical approaches such as sampling a particular sub-
district and pooling these, then sampling another district the next day; binary
testing of sub-pools might be more cumbersome than it’s worth. All in all, even
though there are certain distinct advantages to this sampling method, we can’t deny
that sampling distribution has various benefits to us, giving the results that can be
fairly trustworthy.

II. DATA ANALYSIS OF A SPECIFIC ORGANIZATIONAL PROBLEM


1. Overall view: Amazon and dataset
Amazon is an American multinational technology company
headquartered in Seattle, Washington that focuses on cloud computing, digital
streaming, artificial intelligence, and e-commerce. In particular, Amazon’s e-
commerce is highly appreciated for its quality and reputation.
For this analysis, our team selected the Amazon Products Sales Dataset
2023. In the analysis, our team focused our analysis on buyer reviews for the
collected men's fashion products listed in this dataset.
Source: https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset
2. Sampling distribution
The data table on the level of consumer evaluation for men's fashion products
includes 16094 observations.
These are the results obtained by the team using SPSS

Statistics
ratings
N Valid 16094
Missing 0
Mean 3.867
Median 3.900
Mode 4.0
Std. Deviation .4307
Variance .186
Range 2.7
Minimum 2.3
Maximum 5.0

ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 2.3 14 .1 .1 .1
2.4 31 .2 .2 .3
2.5 58 .4 .4 .6
2.6 58 .4 .4 1.0
2.7 86 .5 .5 1.5
2.8 96 .6 .6 2.1
2.9 128 .8 .8 2.9
3.0 320 2.0 2.0 4.9
3.1 250 1.6 1.6 6.5
3.2 343 2.1 2.1 8.6
3.3 473 2.9 2.9 11.5
3.4 613 3.8 3.8 15.3
3.5 807 5.0 5.0 20.4
3.6 1042 6.5 6.5 26.8
3.7 1181 7.3 7.3 34.2
3.8 1528 9.5 9.5 43.7
3.9 1694 10.5 10.5 54.2
4.0 1973 12.3 12.3 66.5
4.1 1465 9.1 9.1 75.6
4.2 1262 7.8 7.8 83.4
4.3 874 5.4 5.4 88.8
4.4 566 3.5 3.5 92.3
4.5 505 3.1 3.1 95.5
4.6 352 2.2 2.2 97.7
4.7 194 1.2 1.2 98.9
4.8 110 .7 .7 99.6
4.9 26 .2 .2 99.7
5.0 45 .3 .3 100.0
Total 16094 100.0 100.0
The frequency distribution table for the ratings given by Amazon customers
of the product shows that there are no ratings from 1.0 to 2.2. In the data set, the
number of 4-star reviews appeared with the most frequency (1973 times). The
frequency of the ratings increases gradually and reaches its peak at 4.0 before
decreasing. However, the frequency of the 3.8 and 4.1 ratings is quite high, with
1528 rated at 3.8 ratings and 1465 rated at 4.1 ratings. This can also be seen more
clearly in the histogram.

The statistics table and histogram show that: The mean is 3.867, which
means the average rating is 3.867. The median is 3.9, so 3.9 is the rating in the
middle of the data series sorted in ascending or descending order The mode is 4.0,
so 4.0 is the rating that accounts for the most. The variance is 0.186 and the
standard deviation is 0.4307. The lower these values are, the more stable the data
will be, and the less it will fluctuate around the mean. The range is 2.7, the
maximum value is 5.0, and the minimum value is 2.3

It can be inferred from the information above that 45.8% of products are
rated higher than 4.3. The ratings from customers are around 4.0, which is quite
high. It might be attributed to the high quality of goods and services on Amazon,
which meet the needs of customers.
From the data on consumer evaluation of men's fashion products, our team has
selected 2 samples as follows:

a) Sample 1 has 270 observations

Statistics
ratings
N Valid 270
Missing 0
Mean 4.421
Median 4.800
Mode 4.8
Std. Deviation .5658
Variance .320
Range 1.9
Minimum 3.1
Maximum 5.0

ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 3.1 11 4.1 4.1 4.1
3.2 8 3.0 3.0 7.0
3.6 15 5.6 5.6 12.6
3.7 12 4.4 4.4 17.0
3.9 20 7.4 7.4 24.4
4.0 20 7.4 7.4 31.9
4.3 20 7.4 7.4 39.3
4.5 13 4.8 4.8 44.1
4.8 97 35.9 35.9 80.0
4.9 26 9.6 9.6 89.6
5.0 28 10.4 10.4 100.0
Total 270 100.0 100.0

The statistics table and histogram show that: The mean is 4.421, which
means the average rating is 4.421. The median is 4.8, so 4.8 is the rating in the
middle of the data series sorted in ascending or descending order. The mode is 4.8,
so 4.8 is the rating that accounts for the most. The variance is 0.320 and the
standard deviation is 0.5658 The statistics table and histogram show that: The
mean is 4.421, which means the average rating is 4.421. The median is 4.8, so 4.8
is the rating in the middle of the data series sorted in ascending or descending
order. The mode is 4.8, so 4.8 is the rating that accounts for the most. The variance
is 0.320 and the standard deviation is 0.5658

b) Sample 2 has 1097 observations

Statistics
ratings
N Valid 1097
Missing 0
Mean 3.687
Median 3.800
Mode 4.3
Std. Deviation .6191
Variance .383
Range 2.7
Minimum 2.3
Maximum 5.0
ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 2.3 6 .5 .5 .5
2.4 10 .9 .9 1.5
2.5 23 2.1 2.1 3.6
2.6 33 3.0 3.0 6.6
2.7 45 4.1 4.1 10.7
2.8 46 4.2 4.2 14.9
2.9 5 .5 .5 15.3
3.0 24 2.2 2.2 17.5
3.2 59 5.4 5.4 22.9
3.3 110 10.0 10.0 32.9
3.5 41 3.7 3.7 36.6
3.6 115 10.5 10.5 47.1
3.8 141 12.9 12.9 60.0
3.9 11 1.0 1.0 61.0
4.0 115 10.5 10.5 71.5
4.2 53 4.8 4.8 76.3
4.3 189 17.2 17.2 93.5
4.7 38 3.5 3.5 97.0
4.9 4 .4 .4 97.4
5.0 29 2.6 2.6 100.0
Total 1097 100.0 100.0
The statistics table and histogram show that: The mean is 3.687, which
means the average rating is 3.687. The median is 3.8, so 3.8 is the rating in the
middle of the data series sorted in ascending or descending order. The mode is 4.3,
so 4.3 is the rating that accounts for the most. The variance is 0.383 and the
standard deviation is 0.6191.

In conclusion, we can see that the ratings of buyers about men’s fashion on
Amazon are slightly changing due to each sample size with mean of 4,421 and
3,800. As the sample size increases, the sampling distributions more closely
approximate the normal distribution and become more tightly clustered around the
mean of the buyer’s rating of men’s fashion—just as the central limit theorem
states.

We can notice that the histograms for all four statistics (sample mean, sample
median, and sample standard deviation) are becoming more symmetric; and bell-
shaped is also becoming narrower, particularly those for the sample mean. Also,
notice that the estimated standard deviation of the sample mean is not only
decreasing as sample size increases but is also approximately the same for the
same sample sizes.
3. Estimation

Using the one sample and Z tables we can conclude about the variables'
population means. For instance, we can conclude that for the entire sample, the
height of men’s mean is going to be between 174.08 cm to 181.22 cm for 95% of
the time.
Problem 1: The sample of 270 observations is taken from a normal population of
the rating data. The sample mean is 4,421, and the SD of the sample is 0,5658.
a) Determine the 90% confidence intervals for the population mean.

Solution

Let

● X̄ is the sampling distribution of the sample means

● µ is the mean of the population

● σ is the standard deviation of the population

● n is the sample size

We have:
∂ ∂
P( x – z α / 2 √ n < μ < x + z α / 2 √ n ) = 100(1-α )%

For x = 4,421; σ = 0,5658, n= 270, α = 10%


0,5658 0,5658
P(4,421 – 1,64 √ 270 < μ < 4,421 + 1,64 √ 270 ) = 100(1-0.1)% = 0,9

P(4,364 < μ < 4,477)

Interpretation:

If the experiment were carried out multiple times, 90% of the intervals created in
this way would contain µ.

Lower Confidence Limit: 4,364 Upper Confidence Limit: 4,477


b) Determine the 95% confidence intervals for the population mean
Solution
∂ ∂
P( x – z α / 2 √ n < μ < x + z α / 2 √ n ) = 100(1-α )%

For x = 4,421; σ = 0,5658, n= 270, α = 5%


0,5658 0,5658
P(4,421 – 1,96 √ 270 < μ < 4,421 + 1,96 √ 270 ) = 100(1-5)% = 0,95

P(4,354 < μ < 4,488)

Interpretation:
If the experiment were carried out multiple times, 95% of the intervals created in
this way would contain µ.
Lower Confidence Limit: 4,354 Upper Confidence Limit: 4,488
c) Determine the 99% confidence intervals for the population mean
Solution
∂ ∂
P( x – z α / 2 √ n < μ < x + z α / 2 √ n ) = 100(1-α )%

For x = 4,421; σ = 0,5658, n= 270, α = 1%


0,5658 0,5658
P(4,421 – 2,57 √ 270 < μ < 4,421 + 2,57 √ 270 ) = 100(1 - 0,01)% = 0,99

P(4,329 < μ < 4,513)

Interpretation:
If the experiment were carried out multiple times, 99% of the intervals created in
this way would contain µ.
Lower Confidence Limit: 4,329 Upper Confidence Limit: 4,513
 By increasing the confidence interval of the test, we can be more certain with
the sample. Both the upper and lower confidence interval will increase by an
amount and hence the 99% confidence interval is going to be wider. Therefore,
as we increase the confidence level, the width of the interval increases as well.
A more accurate means a higher confidence level.

Problem 2: From rating data, the standard deviation of buyers’ rating is 0.4307.
How many observations should we measure to ensure that the sample mean we
obtain is no more than 0,2 from the population mean with

a) 90% confidence

Let

● X̄ is the sampling distribution of the sample means

● µ is the mean of the population

● σ is the standard deviation of the population

● n is the sample size

We have:

P(|x−μ| < 0,2) = 0,90


|x−μ| 0,2
P ( ∂ < ∂ ) = 0,90
√n √n

0,2 √ n
P(|Z| < 0,4307 ) = 0,90

−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,90

0,2 √ n
P( ¿=0.95
0,4307
0.2 √ n
= 1.64
0,4307

n≈ 12,47

Need a minimum sample size of 13 to be able to estimate the true population


mean lying in CI of 0,2 with 90% certainty.

b) 95% confidence

Let

● X̄ is the sampling distribution of the sample means

● µ is the mean of the population

● σ is the standard deviation of the population

● n is the sample size

We have:

P(|x−μ| < 0,2) = 0,95


|x−μ| 0,2
P ( ∂ < ∂ ) = 0,95
√n √n

0,2 √ n
P(|Z| < 0,4307 ) = 0,95

−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,95

0,2 √ n
P( ¿=0.975
0,4307

0.2 √ n
= 1.96
0,4307

n≈ 17,82
Need a minimum sample size of 18 to be able to estimate the true population
mean lying in CI of 0,2 with 95% certainty.

c) 99% confidence

We have:

P(|x−μ| < 0,2) = 0,99


|x−μ| 0,2
P ( ∂ < ∂ ) = 0,99
√n √n

0,2 √ n
P(|Z| < 0,4307 ) = 0,99

−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,99

0,2 √ n
P( ¿=0.995
0,4307

0.2 √ n
= 2.57
0,4307

n≈ 30,63

Need a minimum sample size of 31 to be able to estimate the true population


mean lying in CI of 0,2 with 99% certainty.

 The greater the number of observations of the selected sample, the greater
the confidence.

Thus, from the collected data, my team based on the method of sampling
distribution and estimation to analyze some related issues. From that, it can be seen
that this method is really beneficial when it is possible to estimate the parameters
of the population more accurately and make decisions based on these estimates. In
addition, this method also helps us to test the correctness of assumptions about the
distribution of the population, useful for large-scale studies.

You might also like