Professional Documents
Culture Documents
Group 9 - Sampling Distribution and Estimation
Group 9 - Sampling Distribution and Estimation
***
Source: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-
020-01081-0
2. Introduction
The study showed that pooling approaches could facilitate patient testing
during an infectious disease outbreak mainly by expanding current screening
capacities of detection both in the community and among healthcare providers.
3. Technique
The academic paper shown used sampling distribution to estimate the rate of
Covid infection. This statistical tool determines the probability of an event based
on data from a small group within a large population with the primary purpose
being to establish representative results of small samples of a comparatively larger
population. Since the population is too large to analyze, you can select a smaller
group and repeatedly sample or analyze them. The majority of data analyzed by
researchers are samples, not populations. For example, a medical researcher that
wanted to compare the average weight of all babies born in North America from
1995 to 2005 to those born in South America within the same period cannot draw
the data for the entire population of over a million childbirths that occurred over
the ten-year time frame within a reasonable amount of time. They will instead only
use the weight of 100 babies in each continent to conclude. The weight of 100
babies used is the sample and the average weight calculated is the sample mean.
p∈{0.001,0.003,0.01,0.03,1.0}
n∈{200,500,1000,1500,2000,3000,5000}
k∈{1,3,5,7,10,15,20,25,30,40,50,70,100,200}
Then, the study calculated the estimated prevalence p at each parameter
combination by replicating the experiment 100,000 times and reporting here the
2.5 and 97.5% quantiles of the distribution of p.
With 5000 total samples, there is minimal difference in the central estimates
of p between individual samples and a pooling level of 200 (95% range 0.0022-
0.0021). 145 reactions, or a 34.5 reduction in the number of independent RT-PCR
sets, are sufficient to obtain patient-level diagnosis 97.5% of the time.
Central 95% estimates of p with a test with sensitivity (η) 0.95 and perfect
specificity (θ = 1) under different combinations of the total number of samples and
level of sample pooling. a: p = 0.001; b: p = 0.003; c: p = 0.01; d: p = 0.03; e: p =
0.10
4. Additional references of Sampling distribution and estimation
- The use of random sampling in investigations involving child abuse material
Sources:
https://www.sciencedirect.com/science/article/pii/S1742287612000369?
fbclid=IwAR3NjlExEHsJEQqL1sji6E1nuR8tqQjtYO2p29h5r_2cHcUkdW6
CUphe7s8#aep-abstract-id3
- The effect of sampling health facilities on estimates of effective coverage: a
simulation study
Source:
https://ij-healthgeographics.biomedcentral.com/articles/10.1186/s12942-022-
00307-2
5. Conclusion
Attempts to estimate the true current prevalence of COVID-19 by PCR tests
can benefit from sample pooling strategies. Such strategies have the potential to
greatly reduce the required number of tests with only slight decreases in the
precision of prevalence estimates.
However, there are still some limitations. The strategy outlined here does
present some logistical challenges: samples must be randomly allocated to pools,
which rules out some practical approaches such as sampling a particular sub-
district and pooling these, then sampling another district the next day; binary
testing of sub-pools might be more cumbersome than it’s worth. All in all, even
though there are certain distinct advantages to this sampling method, we can’t deny
that sampling distribution has various benefits to us, giving the results that can be
fairly trustworthy.
Statistics
ratings
N Valid 16094
Missing 0
Mean 3.867
Median 3.900
Mode 4.0
Std. Deviation .4307
Variance .186
Range 2.7
Minimum 2.3
Maximum 5.0
ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 2.3 14 .1 .1 .1
2.4 31 .2 .2 .3
2.5 58 .4 .4 .6
2.6 58 .4 .4 1.0
2.7 86 .5 .5 1.5
2.8 96 .6 .6 2.1
2.9 128 .8 .8 2.9
3.0 320 2.0 2.0 4.9
3.1 250 1.6 1.6 6.5
3.2 343 2.1 2.1 8.6
3.3 473 2.9 2.9 11.5
3.4 613 3.8 3.8 15.3
3.5 807 5.0 5.0 20.4
3.6 1042 6.5 6.5 26.8
3.7 1181 7.3 7.3 34.2
3.8 1528 9.5 9.5 43.7
3.9 1694 10.5 10.5 54.2
4.0 1973 12.3 12.3 66.5
4.1 1465 9.1 9.1 75.6
4.2 1262 7.8 7.8 83.4
4.3 874 5.4 5.4 88.8
4.4 566 3.5 3.5 92.3
4.5 505 3.1 3.1 95.5
4.6 352 2.2 2.2 97.7
4.7 194 1.2 1.2 98.9
4.8 110 .7 .7 99.6
4.9 26 .2 .2 99.7
5.0 45 .3 .3 100.0
Total 16094 100.0 100.0
The frequency distribution table for the ratings given by Amazon customers
of the product shows that there are no ratings from 1.0 to 2.2. In the data set, the
number of 4-star reviews appeared with the most frequency (1973 times). The
frequency of the ratings increases gradually and reaches its peak at 4.0 before
decreasing. However, the frequency of the 3.8 and 4.1 ratings is quite high, with
1528 rated at 3.8 ratings and 1465 rated at 4.1 ratings. This can also be seen more
clearly in the histogram.
The statistics table and histogram show that: The mean is 3.867, which
means the average rating is 3.867. The median is 3.9, so 3.9 is the rating in the
middle of the data series sorted in ascending or descending order The mode is 4.0,
so 4.0 is the rating that accounts for the most. The variance is 0.186 and the
standard deviation is 0.4307. The lower these values are, the more stable the data
will be, and the less it will fluctuate around the mean. The range is 2.7, the
maximum value is 5.0, and the minimum value is 2.3
It can be inferred from the information above that 45.8% of products are
rated higher than 4.3. The ratings from customers are around 4.0, which is quite
high. It might be attributed to the high quality of goods and services on Amazon,
which meet the needs of customers.
From the data on consumer evaluation of men's fashion products, our team has
selected 2 samples as follows:
Statistics
ratings
N Valid 270
Missing 0
Mean 4.421
Median 4.800
Mode 4.8
Std. Deviation .5658
Variance .320
Range 1.9
Minimum 3.1
Maximum 5.0
ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 3.1 11 4.1 4.1 4.1
3.2 8 3.0 3.0 7.0
3.6 15 5.6 5.6 12.6
3.7 12 4.4 4.4 17.0
3.9 20 7.4 7.4 24.4
4.0 20 7.4 7.4 31.9
4.3 20 7.4 7.4 39.3
4.5 13 4.8 4.8 44.1
4.8 97 35.9 35.9 80.0
4.9 26 9.6 9.6 89.6
5.0 28 10.4 10.4 100.0
Total 270 100.0 100.0
The statistics table and histogram show that: The mean is 4.421, which
means the average rating is 4.421. The median is 4.8, so 4.8 is the rating in the
middle of the data series sorted in ascending or descending order. The mode is 4.8,
so 4.8 is the rating that accounts for the most. The variance is 0.320 and the
standard deviation is 0.5658 The statistics table and histogram show that: The
mean is 4.421, which means the average rating is 4.421. The median is 4.8, so 4.8
is the rating in the middle of the data series sorted in ascending or descending
order. The mode is 4.8, so 4.8 is the rating that accounts for the most. The variance
is 0.320 and the standard deviation is 0.5658
Statistics
ratings
N Valid 1097
Missing 0
Mean 3.687
Median 3.800
Mode 4.3
Std. Deviation .6191
Variance .383
Range 2.7
Minimum 2.3
Maximum 5.0
ratings
Cumulative
Frequency Percent Valid Percent Percent
Valid 2.3 6 .5 .5 .5
2.4 10 .9 .9 1.5
2.5 23 2.1 2.1 3.6
2.6 33 3.0 3.0 6.6
2.7 45 4.1 4.1 10.7
2.8 46 4.2 4.2 14.9
2.9 5 .5 .5 15.3
3.0 24 2.2 2.2 17.5
3.2 59 5.4 5.4 22.9
3.3 110 10.0 10.0 32.9
3.5 41 3.7 3.7 36.6
3.6 115 10.5 10.5 47.1
3.8 141 12.9 12.9 60.0
3.9 11 1.0 1.0 61.0
4.0 115 10.5 10.5 71.5
4.2 53 4.8 4.8 76.3
4.3 189 17.2 17.2 93.5
4.7 38 3.5 3.5 97.0
4.9 4 .4 .4 97.4
5.0 29 2.6 2.6 100.0
Total 1097 100.0 100.0
The statistics table and histogram show that: The mean is 3.687, which
means the average rating is 3.687. The median is 3.8, so 3.8 is the rating in the
middle of the data series sorted in ascending or descending order. The mode is 4.3,
so 4.3 is the rating that accounts for the most. The variance is 0.383 and the
standard deviation is 0.6191.
In conclusion, we can see that the ratings of buyers about men’s fashion on
Amazon are slightly changing due to each sample size with mean of 4,421 and
3,800. As the sample size increases, the sampling distributions more closely
approximate the normal distribution and become more tightly clustered around the
mean of the buyer’s rating of men’s fashion—just as the central limit theorem
states.
We can notice that the histograms for all four statistics (sample mean, sample
median, and sample standard deviation) are becoming more symmetric; and bell-
shaped is also becoming narrower, particularly those for the sample mean. Also,
notice that the estimated standard deviation of the sample mean is not only
decreasing as sample size increases but is also approximately the same for the
same sample sizes.
3. Estimation
Using the one sample and Z tables we can conclude about the variables'
population means. For instance, we can conclude that for the entire sample, the
height of men’s mean is going to be between 174.08 cm to 181.22 cm for 95% of
the time.
Problem 1: The sample of 270 observations is taken from a normal population of
the rating data. The sample mean is 4,421, and the SD of the sample is 0,5658.
a) Determine the 90% confidence intervals for the population mean.
Solution
Let
We have:
∂ ∂
P( x – z α / 2 √ n < μ < x + z α / 2 √ n ) = 100(1-α )%
Interpretation:
If the experiment were carried out multiple times, 90% of the intervals created in
this way would contain µ.
Interpretation:
If the experiment were carried out multiple times, 95% of the intervals created in
this way would contain µ.
Lower Confidence Limit: 4,354 Upper Confidence Limit: 4,488
c) Determine the 99% confidence intervals for the population mean
Solution
∂ ∂
P( x – z α / 2 √ n < μ < x + z α / 2 √ n ) = 100(1-α )%
Interpretation:
If the experiment were carried out multiple times, 99% of the intervals created in
this way would contain µ.
Lower Confidence Limit: 4,329 Upper Confidence Limit: 4,513
By increasing the confidence interval of the test, we can be more certain with
the sample. Both the upper and lower confidence interval will increase by an
amount and hence the 99% confidence interval is going to be wider. Therefore,
as we increase the confidence level, the width of the interval increases as well.
A more accurate means a higher confidence level.
Problem 2: From rating data, the standard deviation of buyers’ rating is 0.4307.
How many observations should we measure to ensure that the sample mean we
obtain is no more than 0,2 from the population mean with
a) 90% confidence
Let
We have:
0,2 √ n
P(|Z| < 0,4307 ) = 0,90
−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,90
0,2 √ n
P( ¿=0.95
0,4307
0.2 √ n
= 1.64
0,4307
n≈ 12,47
b) 95% confidence
Let
We have:
0,2 √ n
P(|Z| < 0,4307 ) = 0,95
−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,95
0,2 √ n
P( ¿=0.975
0,4307
0.2 √ n
= 1.96
0,4307
n≈ 17,82
Need a minimum sample size of 18 to be able to estimate the true population
mean lying in CI of 0,2 with 95% certainty.
c) 99% confidence
We have:
0,2 √ n
P(|Z| < 0,4307 ) = 0,99
−0,2 √ n 0,2 √ n
P( 0,4307 < Z < 0,4307 ¿ = 0,99
0,2 √ n
P( ¿=0.995
0,4307
0.2 √ n
= 2.57
0,4307
n≈ 30,63
The greater the number of observations of the selected sample, the greater
the confidence.
Thus, from the collected data, my team based on the method of sampling
distribution and estimation to analyze some related issues. From that, it can be seen
that this method is really beneficial when it is possible to estimate the parameters
of the population more accurately and make decisions based on these estimates. In
addition, this method also helps us to test the correctness of assumptions about the
distribution of the population, useful for large-scale studies.