You are on page 1of 37

On the Reliability of Confidence Interval

Estimates in both Normal and Non-normal


Data: A Simulation Study

A Thesis
Presented to the
Faculty of the Department of Mathematics
College of Arts and Sciences
Caraga State University
Butuan City

In Partial Fulfillment
of the Requirements for the Degree
Bachelor of Science in Mathematics
(Applied Statistics)

Marchan A. Saga
February 2016

ABSTRACT

As usually aimed in any estimation problems, researchers are so interested on the reliability of estimates. As far as the theory is concerned,
confidence interval estimates are expressed as functions of tabular values of
either T distribution or Z-distribution. These distributions further lie on the
assumption of the normality of data. Thus, reliability of estimates absolutely
depends on the necessary requirement. Common fault in vast analysis are
committed when this important assumption is ignored. When this happens,
uncertainties on inferences about the population parameter become higher
than what is often defined (the level of significance). In this way, decision
making is mislead.
This study investigates the reliability of confidence interval estimates
via computer program (C-programming language) for both normal and nonnormal data taken from normal and non-normal population, respectively, via
Simple Random Sampling without Replacement (SRSWR). Confidence interval reliability is also observed when sample sizes and level of significance are
varying.
Results showed that when normality assumption is ignored, confidence
interval estimates become less reliable. This worsens when the specified level
of significance gets smaller. However, it is further revealed that increasing the
sample size contributed gain in reliability.

TABLE OF CONTENTS

Title Page

Approval Sheet

ii

Abstract

iii

Acknowledgement

iv

Table of Contents

I. Introduction

1.1

Background of the Study

1.2

Objectives of the Study

1.3

Significance of the Study

II. Basic Concepts And Preliminaries

2.1

Basic Concepts

2.2

Preliminaries

III. Methodology

4
10
12

3.1

Algorithm

12

3.2

Flowchart

13

IV. Results And Discussions


4.1

Results

V. Summary, Conclusions And Recommendations

14
14
19

5.1

Summary And Conclusion

19

5.2

Recommendations

20

Bibliography

21

Annexes

23

CHAPTER 1
Introduction

When Statistics are not based on strictly accurate and precise


calculations, they mislead instead of give -Alexis de Tocqueville,
Democracy in America
1.1

Background of the Study


Across disciplines, many researchers have endeavoured in considerations to

cost of data collection, time to accomplish the project and even human resources. This problem has been addressed through the development of both
sampling and survey theory where estimation is an inevitable element. Estimation has a big role in the field of statistics. One of the major applications of
statistics is estimating population parameters from sample statistics. In conducting a research, there is nothing wrong if the researcher will cover all the
respondents, however as mentioned, it will be costly and time consuming. All
those consequences will be resolved due to estimation. As an important parcel of statistics, estimation theory aims to search for an accurate and efficient
estimate of a given parameter of interest. These estimates can be in a point or
in an interval form. Point estimate is a single number (value) computed from
a random sample which represents a plausible value of the parameter. It pin-

2
points a location or a point in the distribution of possible values of the random
variable [8]. On the other hand, interval estimate is a range of values computed
from a random sample, which represents an interval of plausible values for the
unknown value of the parameter of the population. When some measure of
certainty or confidence is attached to the interval estimate, the interval called
the confidence interval estimate. Confidence interval estimate is a fundamental technique in statistical inference and widely used method of inference [13].
The interpretation of a confidence interval derives from the sampling process
that generates the sample from which the confidence interval is calculated. It
provides a measure of how confident the researcher is in stating that the
interval estimate obtained from the random sample contains the true value of
the parameter. Equivalently, if 95% confidence interval is constructed, then, in
the long run, 95% of the intervals constructed in similar manner, will contain
the true value of the parameter [8]. As usually aimed in any estimation problems, researchers are so interested on the reliability of estimates. As far as the
theory is concerned, confidence interval estimates are expressed as functions of
tabular values of either T distribution or Z-distribution. These distributions
further lie on the assumption of the normality of data. Thus, reliability of
estimates absolutely depends on the necessary requirements or assumptions.
Common fault in vast analysis are committed when these important assump-

3
tions are ignored. When this happens, uncertainties on inferences about the
population parameter become higher than what is often defined (the level of
significance). In this way, decision making is mislead.
1.2

Objectives of the Study


This study investigates the reliability of confidence interval estimates via

computer program (C-programming language) for both normal and non-normal


data taken from normal and non-normal population, respectively, via Simple
Random Sampling without Replacement (SRSWR). Confidence interval reliability is also observed when sample sizes and level of significance are varying.
1.3

Significance of the Study


Research is not new in our sociey. In fact, it has a big role in our soci-

ety, however, several researchers neglected some of the statistical assumptions.


This study shows the significance of satisfying necessary statistical assumptions.

CHAPTER 2
Basic Concepts And Preliminaries

This section presents the basic concepts and terminologies that will be used
in the next chapter. Definitions and theorems are taken from [12], [10] and
[8].
2.1

Basic Concepts

Definition 2.1.1 A variable is a characteristic or property of an individual


population unit.

Definition 2.1.2 Universe is a collection or set of all individuals or entities


whose characteristics are to be studied.

Definition 2.1.3 A population is set of all possible values of the variable.


Let us consider an illustration given below.

5
illustration:

That is, the population is now expressed as collection of values of a variable


Y taken from a universe.
Furtheremore, there are two types of population, namely; finite and infinite.
Population is finite when the elements of the population can be counted for a
given time period. On the other hand, infinite when the number of elements
of the population is unlimited.
Definition 2.1.4 Any numerical value describing a characteristic of a population is called a parameter.
Definition 2.1.5 Let X1 , X2 , . . . , XN be the set of data, represents a finite
population of size N , then the population mean is,
PN
=

i=1

Xi

6
Population mean descibes the characteristic of the population, thus, is a
parameter.
Example 2.1.6 The number of employees at 5 different drugstores are 3, 5, 6,
4 and 6. Treating the data as a population, find the mean number of employees
for 5 stores.
Solution. Since the data are considered to be a finite population,
=

3 + 5 + 6 + 4 + 6+
= 4.8.
5

Definition 2.1.7 Let X1 , X2 , . . . , XN be a finite population, the population


variance is,
2

Pn

i=1 (Xi

)2

Definition 2.1.8 Sample is a subset of the units of a population.


Definition 2.1.9 Any numerical value describing a characteristic of a sample
is called a statistic.
Definition 2.1.10 Let X1 , X2 , . . . , XN be the set data, represents a finite
sample of size n, then the sample mean is,
=
X

Pn

i=1

Xi

is describing the characteristic of a sample, thus, X


is a
sample mean X
statistic.

7
Example 2.1.11 A food inspector examined a random sample of 7 cans of
a certain brand of tuna to determine the percent of foreign impurities. The
following data were recorded: 1.8, 2.1, 1.7, 1.6, 0.9, 2.7 and 1.8. Compute the
sample mean.

= 1.8 + 2.1 + 1.7 + 1.6 + 0.9 + 2.7 + 1.8 = 14.2.


X
7
Definition 2.1.12 Let X1 , X2 , . . . , XN be a random sample, the sample variance is,
2

s =

Pn

2
X)
n1

i=1 (Xi

Example 2.1.13 Suppose set A is a random sample taken from population


B, set A consist of -5,-4,-3,-2,0,1,2,4,7.
Solution:
first, compute the sample mean,
= 5 4 3 2 + 0 + 1 + 2 + 4 + 7 = 0,
X
9
then,the variance is,
2

s =

P9

0)
(5)2 + (4)2 + + (4)2 + (7)2
124
=
=
.
91
8
8

i=1 (Xi

Definition 2.1.14 Sampling is the process of selecting a sample from a universe or a population.

8
Definition 2.1.15 Probability Sampling is a sampling where samples are
obtained using some objective chance mechanism, thus involving randomization. It requires the use of a complete listing of the elements of the universe
called the sampling frame. Simple Random Sapling is one of the example of probability sampling. There are two types of simple random sampling.
Simple random sapling without replacement (SRSWOR) does not allow repetitions of selected units in the sample. On the other hand, simple random
sampling with replacement (SRSWR) allows repetitions of selected units in
the sample.

Definition 2.1.16 Non-Probability Sampling is a sampling where samples are obtained haphazardly, selected purposively or are taken as volunteers.
The probablities of selection are unknown.

Definition 2.1.17 Estimation is concerned with finding a value or range of


values or unknown parameter.

Definition 2.1.18 Point Estimator of a population parameter is a rule or


formula that tells us how to use sample data to calculate a single number that
can be used as an estimate of the population parameter.

Definition 2.1.19 Interval Estimator is a formula that tells us how to use


sample data to calculate an interval that estimates a populaton parameter.

9
Definition 2.1.20 The numerical values of the test statistic for which the
null hypothesis will be rejected. The value of is usually chosen to be small
(e.g., 0.01, 0.05, 0.10) and is reffered to as level of significance of the test.

Definition 2.1.21 (1)100%Confidence Interval is a range of numbers


believed to include an unknown population parameter associated with the
interval is a measure of the confidence we have that the interval does indeed
contain the parameter of interest.

Example 2.1.22 The contents of 7 similar containers of sulfuric acid are


9.8, 10.2, 10.4, 9.8, 10.0, and 9.6 liters. Find the 95% confidence interval
for the mean content of all such containers, assuming an approximate normal
distribution for container contents.
Solution. The sample mean and standard deviation for the given data are
= 10.0 and s = 0.283.
X
using the t table, we find t0.025 = 2.447 for v = 6 degrees of freedom. Hence
the 95% confidence interval for is

10.0 < (2.447)(0.283/ 7) < < 10.0 + (2.447)(0.283/ 7)

which reduces to
9.4 < < 10.26.

10
2.2

Preliminaries
The next theorem is stated in [12].

and s2 are the mean and variance, respectively, of a


Theorem 2.2.1 if X
random sample of size n taken from a population that is normally distributed
with mean and variance 2 , then

t=

s/ n

is value of a random variable T having the t distribution with v = n 1


degrees of freedom.

Theorem 2.2.2 If all possible random samples of size n are drawn, without
replacement from a finite population of size N with mean and standard
will be
deviation , then the sampling distribution of the sample mean X
approximately normally distributed with a mean and standard deviation given
by,
X =
r
N n

s=
n N 1
Theorem 2.2.3 Central Limit Theorem. If random sample of size n are
drawn from a large or infinite population with mean and variance 2 , ten

11
is approximately normally
the sampling distriution of the sample mean X
. Hence,
distributed with X = and standard deviation X=/

z=

/ n

is a value of a standard normal variable Z.


is used as an estiTheorem 2.2.4 Sample Size for Estimating . If X
mate of , we can be (1 )100% confident that the error will not exceed a
specified amount e when the sample size is


n=

z/2
e

2
.

and s2 are from normal then


Theorem 2.2.5 If X
t/2,n .s < < X
+ t/2,n .s] 1
P [X
where is the level of significance.

Theorem 2.2.5 is the core of this study. The researcher will validate the
and s2
theorem if this really holds given the assumptions above. Moreover, X
from non-normal data will also part of the validation.

CHAPTER 3
Methodolody

In this chapter, the methodology on how the researcher obtained the results
is being presented. The researcher generate a normal data from R- programming language (Free Software) and a non-normal data of size 20 from the
examination test results of the students. The researcher used C-programming
language (Free Software) to exhaust all combinations of sample size 5, 10 and
15 from the population size 20.

3.1

Algorithm

1st Step: Set the level of significance , sample size n, integer x, the population
mean and

N Cn .

2nd Step: Take a random of size n from a population of size N .


3rd Step: Compute the (1 ) 100% Confidence interval for the population
mean ,
r
r


s
N

n
s
N n

X t 2 ,(n1)
, X + t 2 ,(n1)
N
N
n
n

4th Step: Assign x = 1 if is in interval and x = 0 otherwise.


5th Step: Repeat the steps 2 4 until all N Cn samples of size n are exhausted.

13
Then the following proportion gives the percent reliability;
P

x
N Cn
3.2

Flow Chart

CHAPTER 4
Results And Discussions

4.1

Results
Figure 4.1: Histogram of a Non-normal Data.

The figure above shows the histogram of a non-normal data taken from the
examination test results of the students. Observe that the histogram is skewed
the left, from that observation, it is visually evident that the set of data is not

15
normally distributed.
Figure 4.2: Histogram of a Normal Data.

The figure above shows the histogram of a normal data generated in Rprogramming language (free software) , observe that the histogram of the data
is somewhat like a bell curve, that is, visually evident that the data is normally
distributed.

16
Table 1: Simulation result on the proportion of (1 ) 100 Confidence
Interval containing the population parameter across normal and non-normal
data, varying level of signifance and increasing sample sizes.

Table 1 above shows the main result of the study. For = 0.10 , theoretically it must be expected that at least 90% of the constructed intervals
will contain the population mean . This is indeed true for data taken from a
normal population. Proportions are 90.33%, 90.47% and 90.50% for sample
sizes 5, 10, and 15, respectively. It must be noted that there is a relatively
small increase in the said reliability at an increasing sample sizes. However,
it is not observed when data are taken from a non-normal population. As
shown, observed proportions are 83.29% for sample size 5, 87.19% for sample
size 10 and 89.99% for sample size 15. This result reveals that when normality
assumption is ignored, the error in estimating a desired parameter becomes
relatively higher than what is pre-defined by the researcher. Surprisingly, increasing the sample size still exposes the remedy when normality assumption

17
is not met. This in fact explained previously by virtue of the commonly known
central limit theorem.
At = 0.05, almost similar trend of result is observed. A usual, it is
expected that in the long run, 95% of the intervals under a given sampling
design will contain the parameter of interest, the population mean. Under the
normality assumption, it is divulged that proportions are 95.37%, 95.35% and
95.68% for sample sizes, 5, 10, and 15, respectively. However, samples from
non-normal polulation still generated lesser number of intervals that contained
the delared parameter. Observed proportions are 89.86%, 91.36% and 93.64%
for respective samples, 5, 10 and 15 which are obviously smaller than the
theoretically expected value (95%). It is noted that increassing sample sizes
contributed a relative increase in reliablity while normality assumption is ignored.
Lastly, at a more precise level of significance = 0.01 , the same pattern
of result was observed. When normality assumption is violated, the more that
the estimation process becomes worse (less accrurate and less precise). Under
this unsatisfied assumption, proportion of intervals that contain the population parameter are 95.98%, 95.94%, and 97.24% for sample sizes 5, 20, 15,
respectively. This values are unfortunately lesser than what is again expected
(99%). The idea of increasing the sample size as a remedy in increasing the

18
reliability of estimates for non-normal data still holds.

CHAPTER 5
Summary, Conclusion and Recommendation

5.1

Summary And Conclusion


Under the assumption of normality, (1)100 Confidence Interval indeed

attained its predefined degree of statistical reliability. The result is consistent


as revealed in a computer simulation achored on Simple Random Sampling
Without Replacement scheme. The mentioned consistency is supported when
the simulation is run in across level of significance (0.10, 0.05 and 0.01) and
varying sample sizes (increasing in order). On the otherhand, for non-normal
data, the simulation showed that intervals did not reach the expected degree
of reliability. For an instance, at , in all of the sample sizes considered, proportion of interval estimates that contain the population mean are lesser than
95% which is suppose to be expected. Furthermore, it was also found out
that increasing the sample size is a good remedy to account for the situation
when normality assumption is not met. Based on the results presented, it
is concluded that to attain higher degree of statistical reliability, researchers
should first satisfy necessary statistical assumptions. In this study, the normal
distribution.

20
5.2

Recommendations
The study clearly focuses only on the interval estimates of population mean.

Thus for the interest of future researchers, the following recommendations are
generated;
Estimating the interval estimates for population variance in both normal
and non-normal data.
Estimating the interval estimates for population proportion both normal and
non-normal data.
Comparing the power of the test in hypothesizing a parameter value using
one sample t-test in both normal and non-normal data.
Comparing the power of the test in testing mean differences using independent sample t-test in both normal and non-normal data.

21

Bibliography

[1] K. Kenly, 2005. The Effects of Non-Normal Distribution on Confidence


Intervals Around the Standardize Mean Difference: Bootstrap and Parametric Confidence Intervals, Sage Publications.
[2] S. Gali, 2015. On Importance of Normality Assumption in Using a T-Test:
One Sample and Two Sample Cases, Chennai-India.
[3] R. Bender, G. Berg, and H. Zeeb, 2005. Using Confidence Curves in Medical Research, Biometrical Journal 47, pp. 237247.
[4] A. Attia, 2005.Why should researchers report the confidence interval in
modern research?, Middle East Fertility Society Journal, Vol. 10, No. 1.
[5] Statistics Solutions, 2016. Normality, http://www.statisticssolutions.com/
academic-solutions/resources/directory-of-statistical-analyses/normality.
[6] J. Sim and N. Reid, 1999. Statistical inference by confidence intervals:
issues of interpretation and utilization, Physical Therapy. Vol. 79, No. 2,
pp. 186-195.
[7] Mood. A.M., 1913. Introduction to the theory of statistics, Third Ed.
McGraw-Hill, Inc.

22
[8] Institute of Statistics, 2014. WorkBook in Statistics 1, Tenth Ed. University
of the Philippines, Los Ba
nos.
[9] J. Sauro and J.R. Lewis, 2005 Estimating Completion Rates From Small
Samples Using Binomial Confidence Intervals: Comparisons And Recommendations, Proceeding of the Human Factors And Ergonomics Society
49th Annual Meeting.
[10] J.t. Mc Claive, F.H. Dietrich and T. Sincich, 1997, Statistics, Seventh Ed.
Prentice Hall.
[11] A.D. Aczel, 1995, Statistics Concepts and Applications, Richard D. IRWIN, INC.
[12] R.E. Walpole, 1982, INTRODUCTION TO STATISTICS, Macmillan
Publishing Company, New York.
[13] D. Gilliland and V. Melfi, 2010, A Note on Confidence Interval Estimation
and Margin of Error, Journal of Statistics Education, Volume 18, No.1.
[14] J. Orloff and J. Bloom, 2014, Confidence Intervals for the Mean of Nonnormal Data, Class 23, 18.05, Spring.

23
Annexes
CODE:

24

25
Actual Results for Normal Data at
= 0.1,
20 taken 5

20 taken 10

26
20 taken 15

= 0.05,
20 taken 5

27
20 taken 10

20 taken 15

28
= 0.01,
20 taken 5

20 taken 10

29
20 taken 15

Actual Results for Non-Normal Data at


= 0.1,
20 taken 5

30
20 taken 10

20 taken 15

31
= 0.05,
20 taken 5

20 taken 10

32
20 taken 15

= 0.01,
20 taken 5

33
20 taken 10

20 taken 15

You might also like