Professional Documents
Culture Documents
Article Sampling
Article Sampling
156] || Click here to download free Android application for this journal
Original Article
ABSTRACT
Introduction: The purpose of this article is to provide a general understanding of the concepts of sampling as applied to health-
related research.
Sample Size Estimation: It is important to select a representative sample in quantitative research in order to be able to generalize
the results to the target population. The sample should be of the required sample size and must be selected using an appropriate
probability sampling technique. There are many hidden biases which can adversely affect the outcome of the study. Important factors
to consider for estimating the sample size include the size of the study population, confidence level, expected proportion of the
outcome variable (for categorical variables)/standard deviation of the outcome variable (for numerical variables), and the required
precision (margin of accuracy) from the study. The more the precision required, the greater is the required sample size.
Sampling Techniques: The probability sampling techniques applied for health related research include simple random sampling,
systematic random sampling, stratified random sampling, cluster sampling, and multistage sampling. These are more recommended
than the nonprobability sampling techniques, because the results of the study can be generalized to the target population.
Table 1: Sampling framework – from target population to the sample SAMPLE SIZE ESTIMATION
Sampling unit Description A sample must be of the required size in order to have the
Target population This is a larger population – the results from a representative required degree of accuracy in the results as well as to
sample can be generalized to this level
be able to identify any significant difference/association
Study population This is the accessible part of the target population from where
the sample is selected that may be present in the study population. [12]
Sampling frame This is a list of all the members in the study population (may Determining the minimum required sample size for
not be available in all cases) achieving the main objectives of the study is of prime
Sample Members of the study population who are selected for the importance for all studies but is generally neglected by
study – should be representative of the study and target population most novice researchers. A common practice is to select
all the cases that are available (consecutive sampling) in
patients depending on the type of health care facility, a given period of time or to select a sample size based
that is, Ministry of Health, Military hospital, Private on a previous study.[13] Another practice is to select a
hospital, etc. Patients who present to a specific health sample of 50 or 100 patients depending upon the time
care center may be different from those who go to and resources available.[14] While the above assumptions
another health center or another health care provider. may be adequate in some cases, they are generally not
Hence, it is not advisable to generalize the results from appropriate, especially for studies which require the
a single hospital-based study to the whole city let comparison of two or more groups with respect to one
alone the entire country.[6,7] Another important point or more outcomes of interest.
to consider is that people who agree to participate in
the study may be different from the nonresponders. In The factors that need to be considered when determining
general responders tend to be more health conscious the required sample size include the size of the study
and more literate, or they may be more likely to population (from which the sample is to be selected),
have a chronic condition as compared to an acute the confidence level (generally set at 95% confidence
exacerbation of the disease. Hence, the results of the level), the expected prevalence or variance of the main
study may be different from the outcome in the general outcome variable that is being studied, and the required
population either in a positive or negative direction, margin of error/accuracy that is acceptable for the
depending on how the responders are different from study.[12,15] In studies comparing two or more groups, the
the nonresponders.[8] power of the study is generally set at 80% and additional
information regarding the expected difference between
It is important to show that the selected sample is the two groups, will also be required.[16] Nowadays, it is
representative of the study and the target population not required to go about looking up difficult formulae
with regards to the demographic and other relevant and going through complicated calculations in order to
characteristics that may affect the outcome of the determine the required sample size. There are a number
study. [9] For example, in a study to compare the of free online software and easily accessible websites
outcomes of diabetic patients being managed by an like Open-Epi,[17] RaoSoft,[18] Pi-face,[19] etc., which can
endocrinologist as compared to those being managed estimate a number of permutations for the required
by family physicians. It is important to consider sample size based on the estimated parameters for the
that the two groups have the same socioeconomic study population.
characteristics with regards to age, gender, income,
and education since all of these are related to the The researcher does need to do some preparation in
outcome. [10] It is also important to consider the advance before estimating the required sample size. The
severity and duration of disease since patients simplest scenario is a single sample study, where the
presenting to the endocrinologist may be more prevalence of a specific variable is required in the study
likely to be already having complications due to population, e.g., prevalence of diabetes mellitus or its
diabetes mellitus. It is recommended to obtain the complications. The additional information to determine
relevant demographic and background information the required sample size includes the estimated size of
from the responders to demonstrate that they are the study population (if very large then use 20,000), the
representative of the target/study population. [10] expected prevalence of the main variable (if unknown
Additional information that may be easily obtained then use 50%), and the required margin of accuracy
should also be collected about the nonresponders/ (generally set at 10% or 5%).[20] The margin of accuracy
lost to follow-up cases, like area of residence, body is related to how accurate the required result is with
mass index (BMI), smoking status, etc. This will regards to being close to the expected population value,
be useful to demonstrate that the responders are the more precise the required results, the greater is
similar to the nonresponders with regards to these the sample size required. Generally for an expected
background variables.[11] prevalence of around 50% for the outcome variable
a margin of accuracy of ±10% requires a sample size the general rule is that the more precise the required
of around 100, which increases to around 400 for accuracy, the greater is the required sample size.[24] A
an accuracy of ±5% and 10,000 for ±1% margin of summary of the information required for estimating the
accuracy. sample size is given in Table 2.
In case of determining the sample size for determining The requirements for determining the sample size for
the mean value for a numerical variable (e.g., BMI, comparing between two (or more) groups becomes more
cholesterol level, etc.,), the additional information complex with the requirements for estimation about the
required is for the expected variance of the required expected prevalence in both the groups (for categorical
variable in the target population.[21] This information variables) and the expected difference of means (for
can be obtained from the literature review of similar numerical values). But the basic rule is the same – the
studies in the form of the standard deviation (SD) for greater the variability of the variable under study or the
the required variable. The higher the SD, the greater the more precise the required accuracy, the greater is
will be the required sample size. In case the SD is not the required sample size.[12] Table 3 shows the estimated
known for the target population, it can be estimated by sample sizes for a categorical variable (hypertension)
taking the difference between the estimated “highest” and a numerical variable (systolic blood pressure)
and “lowest” values in the population and dividing it by for comparing these variables between smokers and
four (±2 SD on either side of the mean for the “normal” nonsmokers for different level of accuracies. It is up to
distribution).[22] For example, the BMI for a group of the researcher to select the required criteria according
diabetics is expected to have a high value of 48 and to the study objectives and the available resources.
a low value of 16 kg/m2. Hence, the “normal” range is It should be kept in mind that these are all based on
48-16 = 32, which gives an estimated value for the SD estimates and if the sample results are found to have
of ±8 (32 divided by 4). The other information required more variability than used in the estimation then the
for determining the sample size is the accuracy of the P values will not be statistically significant. If there is
estimated mean, that is, how close it should be to the provision for doing a pilot study, then the estimated
actual population mean.[23] In the above example, for prevalence or SD can be more accurately determined
BMI, the accuracy can be set as ±1, ±2 or ±4 kg/m2, based on a smaller sample from within the study
Table 3: Estimated sample sizes required for varying expected differences between two samples (nonsmokers versus smokers) for a categorical
and numerical variable
Confidence level = 95%, Power = 80%, Ratio of nonsmokers:smoker = 1:1
Hypertension (%) Systolic blood pressure (mmHg)
Nonsmokers Smoker Expected Required Nonsmokers Smoker Expected difference Required
% % difference % sample size (SD: ±15) (SD: ±20) (mmHg) sample size
20 30 10 315+315 (630) 120 125 5 200+200 (400)
20 35 15 150+150 (300) 120 130 10 50+50 (100)
20 40 20 90+90 (180) 120 135 15 25+25 (50)
SD: Standard deviation
population for determining the required sample size hidden bias that people who have two phones (or double
more accurately.[25] SIM phones) are twice as more likely to be selected as
compared to the majority of people who have only one
SAMPLING TECHNIQUES number.[29] The people with >1 phone are more likely
to have a higher income so this may bias any study
The other important issue related to sampling is which may be asking about their perceptions about
selecting the required sample size in a manner, so that health care insurance or even about choosing between
the sample is representative of the study population.[7] prepaid/postpaid mobile phone services. This type of
It is a common pitfall to opt for the easier option of bias can be controlled for by simply recognizing this as
convenience sampling where “all” the available persons a bias at the planning stage of the survey and including
in the study population are selected for the study a question on “How many phone numbers do you have?”
until the required sample size is reached. This is in the survey. This can be used to appropriately weight
nonprobability sampling, where the sample is less likely the responses of such respondents in the final analysis
to be representative of the study population due to stage.[29]
inherent biases in the sampling process.[13] Other forms
of nonprobability sampling include purposive sampling, The types of probability sampling methods include
quota sampling, and snowball sampling, where the simple random sampling, systematic random sampling,
sample is selected according to some predetermined and stratified random sampling [Table 4] – these three
criteria. These type of sampling techniques are more methods are more relevant when a sample frame (list
appropriate for small level studies which are not meant of the people in the study population) is available.[7]
to be generalized to a larger population.[13] Simple random sampling is as simple as picking up chits
(names or numbers written on pieces of paper) from the
The more relevant sampling technique is called box for a small study population of up to 30-50 people.
“probability sampling” or “random sampling.”[26] It is For larger study population, a computer-generated
important to note here that the word “random” as used random number table can be used to select the
in this context is different from the normal usage in the respondents accordingly, e.g., every nth person coming
everyday terms. It is misleading to state that the sample out of a clinic or selecting the nth person from each
was chosen at random from all the patients coming to household.[7] Systematic random sampling is applicable
the outpatient clinic. In order to be classified as random when the study population is relatively large (100 or
or probability sampling, every person in the study more) and a list is available of all the members, e.g.,
population must have an equal or known probability of employees in a hospital, medical students in a class, or
being included in the sample.[7] It is quite common to even beds in a hospital. The total number of subjects
overlook some hidden biases in the sampling process in the list is divided by the required sample size to
which adversely affect the outcome of the study. For obtain the “skip number” e.g., to select 25 out of a list
example, if a study was to be conducted to determine of 200 the skip number will be every 8th person on the
the satisfaction of patients coming to a health care list (200/25 = 8). The next step is to choose a number
center and the decision was to sample every third randomly from between 1 and 8 which will be the first
patient who was coming out of the center. Apparently, person selected and then systematically select every 8th
this seems to be “unbiased” if every third person was person from the list till the end of the list is reached, e.g.,
selected accordingly. But one hidden factor is related 3, 11, 19, 27, …, 195. It is important to remember that
to the outcome of the study, that is, satisfaction with the first person should be chosen randomly – arbitrarily
the care provided. A person who is not satisfied with selecting the 1st person or the 8th person on the list will
the health care provided would be unlikely to return lead to zero probability of the other persons in the list
to the center or would come only once a month, while being selected.[30] Stratified random sampling is a form
a person who is satisfied would be returning more of systematic random sampling with the addition that
frequently maybe 2-3 times a month. Hence, it is quite the list is stratified (arranged by categories) according
likely that the result of the satisfaction survey shows a
more positive result than the actual perception.[27] One
way to account for this hidden bias is to interview only Table 4: List of different probability and nonprobability sampling
“new” patients who are visiting the clinic for the 1st methods
time or it may be sufficient to just ask the respondent Probability sampling methods Nonprobability sampling methods
how many times s/he has visited the clinic in the last Simple random sampling Convenience sampling
month or year.[28] The same bias may be associated with Systematic random sampling Consecutive sampling
random digit dialing for a phone survey. Apparently, the Stratified random sampling Quota sampling
computer dials a number randomly so there should be Cluster sampling Judgmental/purposive sampling
no bias in the sample selection? Actually, there is still a Multistage sampling Snowball sampling
23. Boston University, School of Public Health. Power and Sample www.onlinestatbook.com/2/research_design/sampling.html.
Size Determination. Available from: http://www.sphweb.bumc. [Last accessed on 2014 Sep 26].
bu.edu/otlt/MPH-Modules/BS/BS704_Power/BS704_Power_ 28. World Health Organization. Toolkit on Monitoring Health
print.html. [Last accessed on 2014 Sep 25]. Systems Strengthening: Service Delivery, [June 2008]. Available
24. Penn State University. Stat 100– Statistical Concepts and from: http://www.who.int/healthinfo/statistics/toolkit_hss/
Reasoning. 2014. Available from: http://www.onlinecourses. EN_PDF_Toolkit_HSS_ServiceDelivery.pdf. [Last accessed
science.psu.edu/stat100/node/17. [Last accessed on 2014 Sep on 2014 Sep 26].
25]. 29. Ferraro D, Krenzke T, Montaquila J. RDD Telephone Surveys:
25. Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, Jager Reducing Bias and Increasing Operational Efficiency. Joint
KJ. Sample size calculations: Basic principles and common Statistical Meeting: Section on Survey Research Methods; 2008.
pitfalls. Nephrol Dial Transplant 2002;17:2087-93. Available p. 1949-56. Available from: http://www.amstat.org/sections/
from: http://www.ndt.oxfordjournals.org/content/25/5/1388. srms/proceedings/y2008/Files/301280.pdf. [Last accessed on
long. [Last accessed on 2014 Sep 25].
2014 Sep 26].
26. Doherty M. Probability versus non-probability sampling in
30. Daniel J. Choosing the type of probability sampling. In:
sample surveys. New Zealand Stat Rev 1994;21-8. Available
Sampling Essentials. CH. 5. SAGE Publications Inc.; 2012.
from: http://www.nss.gov.au/nss/home.nsf/75427d7291fa01
Available from: http://www.sagepub.com/upm-data/40803_5.
45ca2571340022a2ad/768dd0fbbf616c71ca2571ab002470c
pdf. [Last accessed on 2014 Sep 26].
d/$FILE/Probability%20 versus%20Non%20Probability%20
Sampling.pdf. [Last accessed on 2014 Sep 26].
27. Lane DM. Research design. In: Online Statistics Education: How to cite this article: Omair A. Sample size estimation and sampling
techniques for selecting a representative sample. J Health Spec 2014;2:142-7.
An Interactive Multimedia Course of Study. Rice University,
University of Houston, Tufts University. Available from: http:// Source of Support: Nil, Conflict of Interest: None declared.