You are on page 1of 2
DRAFT DISCUSSION DOCUMENT Dr. Kirkendall Sample Size Recommendation Sample Sizes ‘We need to select a sample size that we think provides the best compromise in terms of confidence and cost. We can make reasonable statements about the probability of missing messages being less than .1 with a sample between 25 and 35. To be able to state that the probability that p is less than .05 is .95 we would need a sample of about 60. If n=15 and we do not find any missing messages then the probability that p (probability of missing messages in a good day) is less than .1 is .79 and the probability that p is less than .05 is .54. Ifn=25 and we do not find any missing messages then the probability that p is less than 1 is .93 and the probability that p is less than .05 is .72. Ifn=35 and we do not find any missing messages then the probability that p is less than, -1is .975 and the probability that p is less than .05 = .83. If n=60 and we do not find any missing messages then the probability that p is less than 1 is .998 and the probability that p is less than .05 is 954. Confidence Statements If the sample size is 25 and we do not find any missing messages in the sampled days, then we can be 93% confident that the probability of missing messages among the good days is less than 1; and we can be 72% confident that the probability of missing messages among the good days is less than .05. If n=25 and we do not find any missing messages then the probability that p is less than .1 is .93 and the probability that p is less than .0S is .72. If the sample size is 35 and we do not find any missing messages in the sampled days, then we can be 97.5% confident that the probability of missing messages among the good days is less than 1; and we can be 83% confident that the probability of missing messages among the good days is less than .05 Ifn=35 and we do not find any missing messages then the probability that p is less than .1 is .975 and the probability that p is less than .05 = .83 If the sample size is 60 and we do not find any missing messages among the sampled days, then we can be 99.8% confident that the probability of missing messages among the good days is less than .1; and we can be 95.4% confident that the probability of missing messages among the good days is less than .05, Ifn=60 and we do not find any missing messages then the probability that p is less than .1 is .998 and the probability that p is less than .05 is .954. GEORGE W. BUSH PRESIDENTIAL RECORD. OAP00005353. DRAFT DISCUSSION DOCUMENT Determination of Good days Days that OA believes have no missing email messages based on ARIMA analysis and expert, judgment. A random sample with replacement will be taken from these days and restored from the DR tapes. Results will be used to determine whether any missing email messages are found in the sampled days. If no missing email messages are found in any sampled day the good days will be assumed to be complete. Let p represent the probability that a file from the DR tapes contains new messages that are already in Phase II results. We would like to have p=0 for the Good Days, but the only way to be completely sure is to evaluate data files, and/or restore files for all days. Instead we propose a lower cost approach that uses sampling to help us assure that p is very low. A random sample with replacement of Good Days will be taken and assessed for presence of new messages. If the DR tapes for the sample of days contain only messages that are already included in Phase II results, this will be evidence that there are no missing messages among the Good Days. Statistical foundations ‘We want information about the proportion of the population (days) have some feature (missing messages), p. If we use a simple random sample of size n selected with replacement, the number of sampled units that have the feature is known to follow a binomial distribution with parameters nand p. ‘We can use the binomial distribution and a prior distribution that the probability p could be ‘anywhere in the interval from 0 to 1 (a uniform distribution) to find the posterior probability of p given sampled results. The posterior distribution can be viewed as the evidence in the sample for the values p might take. The posterior distribution of p is a beta distribution with parameters x+I, and n-x+1. (x is the number of days in the sample for which missing messages are found.) We are interested in the situation when x=0, so the parameters are I and n. The mean of this distribution is 1/n+1, and the variance is (n~1)/((n+2)(n+3)). The following confidence statements (probabilities) came from a beta distribution with parameters I and n available through a statistical software package called Statistix, GEORGE W. BUSH PRESIDENTIAL RECORD. OAP00005354

You might also like