You are on page 1of 27

3.

5 Simple Random Sampling and Other Sampling


Methods
Printer-friendly version
Sampling Methods can be classified into one of two categories:

 Probability Sampling: Sample has a known probability of being selected


 Non-probability Sampling: Sample does not have known probability of being selected as in
convenience or voluntary response surveys
Probability Sampling
In probability sampling it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods are examples
of probability sampling:
1. Simple Random Sampling (SRS)
2. Stratified Sampling
3. Cluster Sampling
4. Systematic Sampling
5. Multistage Sampling (in which some of the methods above are combined in stages)
Of the five methods listed above, students have the most trouble distinguishing between stratified
sampling and cluster sampling.
Stratified Sampling is possible when it makes sense to partition the population into groups based on
a factor that may influence the variable that is being measured. These groups are then called strata.
An individual group is called a stratum. With stratified sampling one should:
 partition the population into groups (strata)
 obtain a simple random sample from each group (stratum)
 collect data on each sampling unit that was randomly sampled from each group (stratum)
Stratified sampling works best when a heterogeneous population is split into fairly homogeneous
groups. Under these conditions, stratification generally produces more precise estimates of the
population percents than estimates that would be found from a simple random sample. Table
3.2 shows some examples of ways to obtain a stratified sample.
Table 3.2. Examples of Stratified Samples
Example 1 Example 2 Example 3
Population All people in U.S. All PSU intercollegiate All elementary students in the
athletes local school district
Groups (Strata) 4 Time Zones in the U.S. 26 PSU intercollegiate 11 different elementary
(Eastern,Central, teams schools in the local school
Mountain,Pacific) district

Obtain a Simple 500 people from each of the 5 athletes from each of 20 students from each of the
Random Sample 4 time zones the 26 PSU teams 11 elementary schools

Sample 4 × 500 = 2000 selected 26 × 5 = 130 selected 11 × 20 = 220 selected


people athletes students
Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should
 divide the population into groups (clusters).
 obtain a simple random sample of so many clusters from all possible clusters.
 obtain data on every sampling unit in each of the randomly selected clusters.
It is important to note that, unlike with the strata in stratified sampling, the clusters should be
microcosms, rather than subsections, of the population. Each cluster should be heterogeneous.
Additionally, the statistical analysis used with cluster sampling is not only different, but also more
complicated than that used with stratified sampling.

Table 3.3. Examples of Cluster Samples


Example 1 Example 2 Example 3
Population All people in U.S. All PSU intercollegiate All elementary students in a
athletes local school district
Groups (Clusters) 4 Time Zones in the U.S. 26 PSU intercollegiate 11 different elementary
(Eastern,Central, teams schools in the local school
Mountain,Pacific.) district

Obtain a Simple 2 time zones from the 4 8 teams from the 26 4 elementary schools from
Random Sample possible time zones possible teams the l1 possible elementary
schools

Sample every person in the 2 every athlete on the 8 every student in the 4
selected time zones selected teams selected elementary schools
Each of the three examples that are found in Tables 3.2 and 3.3 were used to illustrate how both
stratified and cluster sampling could be accomplished. However, there are obviously times when one
sampling method is preferred over the other. The following explanations add some clarification about
when to use which method.
 With Example 1: Stratified sampling would be preferred over cluster sampling, particularly if the
questions of interest are affected by time zone. For example the percentage of people watching a live
sporting event on television might be highly affected by the time zone they are in. Cluster sampling
really works best when there are a reasonable number of clusters relative to the entire population. In
this case, selecting 2 clusters from 4 possible clusters really does not provide much advantage over
simple random sampling.
 With Example 2: Either stratified sampling or cluster sampling could be used. It would depend on
what questions are being asked. For instance, consider the question "Do you agree or disagree that
you receive adequate attention from the team of doctors at the Sports Medicine Clinic when
injured?" The answer to this question would probably not be team dependent, so cluster sampling
would be fine. In contrast, if the question of interest is "Do you agree or disagree that weather
affects your performance during an athletic event?" The answer to this question would probably be
influenced by whether or not the sport is played outside or inside. Consequently, stratified sampling
would be preferred.
 With Example 3: Cluster sampling would probably be better than stratified sampling if each
individual elementary school appropriately represents the entire population as in a school district
where students from throughout the district can attend any school. Stratified sampling could be used
if the elementary schools had very different locations and served only their local neighborhood (i.e.,
one elementary school is located in a rural setting while another elementary school is located in an
urban setting.) Again, the questions of interest would affect which sampling method should be used.
The most common method of carrying out a poll today is using Random Digit Dialing in which a
machine random dials phone numbers. Some polls go even farther and have a machine conduct the
interview itself rather than just dialing the number! Such "robo call polls" can be very biased
because they have extremely low response rates (most people don't like speaking to a machine) and
because federal law prevents such calls to cell phones. Since the people who have landline phone
service tend to be older than people who have cell phone service only, another potential source of
bias is introduced. National polling organizations that use random digit dialing in conducting
interviewer based polls are very careful to match the number of landline versus cell phones to the
population they are trying to survey.
Non-probability Sampling
The following sampling methods that are listed in your text are types of non-probability sampling
that should be avoided:
1. volunteer samples
2. haphazard (convenience) samples
Since such non-probability sampling methods are based on human choice rather than random
selection, statistical theory cannot explain how they might behave and potential sources of bias are
rampant. In your textbook, the two types of non-probability samples listed above are called
"sampling disasters."
Read the article: "How Polls are Conducted" by the Gallup organization available in Canvas.
The article provides great insight into how major polls are conducted. When you are finished reading
this article you may want to go to the Gallup Poll Web site, https://www.gallup.com, and see the
results from recent Gallup polls. Another excellent source of public opinion polls on a wide variety
of topics using solid sampling methodology is the Pew Research Center website
at https://www.pewresearch.org When you read one of the summary reports on the Pew site, there is
a link (in the upper right corner) to the complete report giving more detailed results and a full
description of their methodology as well as a link to the actual questionnaire used in the survey so
you can judge whether their might be bias in the wording of their survey.
It is important to be mindful of margin or error as discussed in this article. We all need to remember
that public opinion on a given topic cannot be appropriately measured with one question that is only
asked on one poll. Such results only provide a snapshot at that moment under certain conditions.
The concept of repeating procedures over different conditions and times leads to more valuable and
durable results. Within this section of the Gallup article, there is also an error: "in 95 out of those 100
polls, his rating would be between 46% and 54%." This should instead say that in an expected 95 out
of those 100 polls, the true population percent would be within the confidence interval calculated. In
5 of those surveys, the confidence interval would not contain the population percent.

Basic statistical tools in research and data analysis


Zulfiqar Ali and S Bala Bhaskar1

Author information ► Copyright and License information ► Disclaimer

This article has been corrected. See Indian J Anaesth. 2016 October; 60(10): 790.

This article has been cited by other articles in PMC.

Abstract
Go to:
INTRODUCTION
Statistics is a branch of science that deals with the collection, organisation, analysis of data and
drawing of inferences from the samples to the whole population.[1] This requires a proper design
of the study, an appropriate selection of the study sample and choice of a suitable statistical test.
An adequate knowledge of statistics is necessary for proper designing of an epidemiological
study or a clinical trial. Improper statistical methods may result in erroneous conclusions which
may lead to unethical practice.[2]
Go to:

VARIABLES
Variable is a characteristic that varies from one individual member of population to another
individual.[3] Variables such as height and weight are measured by some type of scale, convey
quantitative information and are called as quantitative variables. Sex and eye colour give
qualitative information and are called as qualitative variables[3] [Figure 1].
Figure 1
Classification of variables

Quantitative variables
Quantitative or numerical data are subdivided into discrete and continuous measurements.
Discrete numerical data are recorded as a whole number such as 0, 1, 2, 3,… (integer), whereas
continuous data can assume any value. Observations that can be counted constitute the discrete
data and observations that can be measured constitute the continuous data. Examples of discrete
data are number of episodes of respiratory arrests or the number of re-intubations in an intensive
care unit. Similarly, examples of continuous data are the serial serum glucose levels, partial
pressure of oxygen in arterial blood and the oesophageal temperature.
A hierarchical scale of increasing precision can be used for observing and recording the data
which is based on categorical, ordinal, interval and ratio scales [Figure 1].
Categorical or nominal variables are unordered. The data are merely classified into categories
and cannot be arranged in any particular order. If only two categories exist (as in gender male
and female), it is called as a dichotomous (or binary) data. The various causes of re-intubation in
an intensive care unit due to upper airway obstruction, impaired clearance of secretions,
hypoxemia, hypercapnia, pulmonary oedema and neurological impairment are examples of
categorical variables.
Ordinal variables have a clear ordering between the variables. However, the ordered data may
not have equal intervals. Examples are the American Society of Anesthesiologists status or
Richmond agitation-sedation scale.
Interval variables are similar to an ordinal variable, except that the intervals between the values
of the interval variable are equally spaced. A good example of an interval scale is the Fahrenheit
degree scale used to measure temperature. With the Fahrenheit scale, the difference between 70°
and 75° is equal to the difference between 80° and 85°: The units of measurement are equal
throughout the full range of the scale.
Ratio scales are similar to interval scales, in that equal differences between scale values have
equal quantitative meaning. However, ratio scales also have a true zero point, which gives them
an additional property. For example, the system of centimetres is an example of a ratio scale.
There is a true zero point and the value of 0 cm means a complete absence of length. The
thyromental distance of 6 cm in an adult may be twice that of a child in whom it may be 3 cm.
Go to:

STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS


Descriptive statistics[4] try to describe the relationship between variables in a sample or
population. Descriptive statistics provide a summary of data in the form of mean, median and
mode. Inferential statistics[4] use a random sample of data taken from a population to describe
and make inferences about the whole population. It is valuable when it is not possible to examine
each member of an entire population. The examples if descriptive and inferential statistics are
illustrated in Table 1.
Table 1
Example of descriptive and inferential statistics
Descriptive statistics
The extent to which the observations cluster around a central location is described by the central
tendency and the spread towards the extremes is described by the degree of dispersion.

Measures of central tendency


The measures of central tendency are mean, median and mode.[6] Mean (or the arithmetic
average) is the sum of all the scores divided by the number of scores. Mean may be influenced
profoundly by the extreme variables. For example, the average stay of organophosphorus
poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5
months because of septicaemia. The extreme values are called outliers. The formula for the mean
is

Mean,
where x = each observation and n = number of observations. Median[6] is defined as the middle
of a distribution in a ranked data (with half of the variables in the sample above and half below
the median value) while mode is the most frequently occurring variable in a distribution. Range
defines the spread, or variability, of a sample.[7] It is described by the minimum and maximum
values of the variables. If we rank the data and after ranking, group the observations into
percentiles, we can get better information of the pattern of spread of the variables. In percentiles,
we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any
other percentile amount. The median is the 50th percentile. The interquartile range will be the
observations in the middle 50% of the observations about the median (25th-75th percentile).
Variance[7] is a measure of how spread out is the distribution. It gives an indication of how close
an individual observation clusters about the mean value. The variance of a population is defined
by the following formula:

where σ2 is the population variance, X is the population mean, Xi is the ith element from the
population and N is the number of elements in the population. The variance of a sample is
defined by slightly different formula:

where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample and n
is the number of elements in the sample. The formula for the variance of a population has the
value ‘n’ as the denominator. The expression ‘n−1’ is known as the degrees of freedom and is
one less than the number of parameters. Each observation is free to vary, except the last one
which must be a defined value. The variance is measured in squared units. To make the
interpretation of the data simple and to retain the basic unit of observation, the square root of
variance is used. The square root of the variance is the standard deviation (SD).[8] The SD of a
population is defined by the following formula:

where σ is the population SD, X is the population mean, Xi is the ith element from the population
and N is the number of elements in the population. The SD of a sample is defined by slightly
different formula:

where s is the sample SD, x is the sample mean, xi is the ith element from the sample and n is the
number of elements in the sample. An example for calculation of variation and SD is illustrated
in Table 2.
Table 2
Example of mean, variance, standard deviation
Normal distribution or Gaussian distribution
Most of the biological variables usually cluster around a central value, with symmetrical positive
and negative deviations about this point.[1] The standard normal distribution curve is a
symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD
of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of
the mean [Figure 2].

Figure 2
Normal distribution curve
Skewed distribution
It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed
distribution [Figure 3], the mass of the distribution is concentrated on the right of Figure 1. In a
positively skewed distribution [Figure 3], the mass of the distribution is concentrated on the left
of the figure leading to a longer right tail.

Figure 3
Curves showing negatively skewed and positively skewed distribution

Inferential statistics
In inferential statistics, data are analysed from a sample to make inferences in the larger
collection of the population. The purpose is to answer or test the hypotheses. A hypothesis
(plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus
procedures for making rational decisions about the reality of observed effects.
Probability is the measure of the likelihood that an event will occur. Probability is quantified as a
number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).
In inferential statistics, the term ‘null hypothesis’ (H0 ‘H-naught,’ ‘H-null’) denotes that there is
no relationship (difference) between the population variables in question.[9]
Alternative hypothesis (H1 and Ha) denotes that a statement between the variables is expected to
be true.[9]
The P value (or the calculated probability) is the probability of the event occurring by chance if
the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by
researchers in deciding whether to reject or retain the null hypothesis [Table 3].
Table 3
P values with interpretation
If P value is less than the arbitrarily chosen value (known as α or the significance level), the null
hypothesis (H0) is rejected [Table 4]. However, if null hypotheses (H0) is incorrectly rejected,
this is known as a Type I error.[11] Further details regarding alpha error, beta error and sample
size calculation and factors influencing them are dealt with in another section of this issue by
Das S et al.[12]
Table 4
Illustration for null hypothesis

PARAMETRIC AND NON-PARAMETRIC TESTS


Numerical data (quantitative variables) that are normally distributed are analysed with
parametric tests.[13]
Two most basic prerequisites for parametric statistical analysis are:

 The assumption of normality which specifies that the means of the sample group are
normally distributed
 The assumption of equal variance which specifies that the variances of the samples and of
their corresponding population are equal.

However, if the distribution of the sample is skewed towards one side or the distribution is
unknown due to the small sample size, non-parametric[14] statistical techniques are used. Non-
parametric tests are used to analyse ordinal and categorical data.

Parametric tests
The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal
distribution of the underlying population. The samples have the same variance (homogeneity of
variances). The samples are randomly drawn from the population, and the observations within a
group are independent of each other. The commonly used parametric tests are the Student's t-test,
analysis of variance (ANOVA) and repeated measures ANOVA.
Student's t-test
Student's t-test is used to test the null hypothesis that there is no difference between the means of
the two groups. It is used in three circumstances:

1. To test if a sample mean (as an estimate of a population mean) differs significantly from
a given population mean (this is a one-sample t-test)

The formula for one sample t-test is

where X = sample mean, u = population mean and SE = standard error of mean

2. To test if the population means estimated by two independent samples differ significantly
(the unpaired t-test). The formula for unpaired t-test is:

where X1 − X2 is the difference between the means of the two groups and SE denotes the
standard error of the difference.

3. To test if the population means estimated by two dependent samples differ significantly
(the paired t-test). A usual setting for paired t-test is when measurements are made on the
same subjects before and after a treatment.
The formula for paired t-test is:

where d is the mean difference and SE denotes the standard error of this difference.
The group variances can be compared using the F-test. The F-test is the ratio of variances (var
l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ
significantly.
Analysis of variance
The Student's t-test cannot be used for comparison of three or more groups. The purpose of
ANOVA is to test if there is any significant difference between the means of two or more groups.
In ANOVA, we study two variances – (a) between-group variability and (b) within-group
variability. The within-group variability (error variance) is the variation that cannot be accounted
for in the study design. It is based on random differences present in our samples.
However, the between-group (or effect variance) is the result of our treatment. These two
estimates of variances are compared using the F-test.
A simplified formula for the F statistic is:

where MSb is the mean squares between the groups and MSw is the mean squares within groups.
Repeated measures analysis of variance
As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more
groups. However, a repeated measure ANOVA is used when all variables of a sample are
measured under different conditions or at different points in time.
As the variables are measured from a sample at different points of time, the measurement of the
dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because
it fails to model the correlation between the repeated measures: The data violate the ANOVA
assumption of independence. Hence, in the measurement of repeated dependent variables,
repeated measures ANOVA should be used.

Non-parametric tests
When the assumptions of normality are not met, and the sample means are not normally,
distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free
test) are used in such situation as they do not require the normality assumption.[15] Non-
parametric tests may fail to detect a significant difference when compared with a parametric test.
That is, they usually have less power.
As is done for the parametric tests, the test statistic is compared with known values for the
sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of
non-parametric analysis techniques and the corresponding parametric analysis techniques are
delineated in Table 5.
Table 5
Analogue of parametric and non-parametric tests

Median test for one sample: The sign test and Wilcoxon's signed rank test
The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These
tests examine whether one instance of sample data is greater or smaller than the median
reference value.
Sign test
This test examines the hypothesis about the median θ0 of a population. It tests the null
hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is
marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If
the observed value is equal to the reference value (θ0), it is eliminated from the sample.
If the null hypothesis is true, there will be an equal number of + signs and − signs.
The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is
useful when it is difficult to measure the values.
Wilcoxon's signed rank test
There is a major limitation of sign test as we lose the quantitative information of the given data
and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed
values in comparison with θ0 but also takes into consideration the relative sizes, adding more
statistical power to the test. As in the sign test, if there is an observed value that is equal to the
reference value θ0, this observed value is eliminated from the sample.
Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample
and compares the difference in the rank sums.
Mann-Whitney test
It is used to test the null hypothesis that two samples have the same median or, alternatively,
whether observations in one sample tend to be larger than observations in the other.
Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to
the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null
hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states
that P (xi > yi) ≠1/2.
Kolmogorov-Smirnov test
The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test
whether two random samples are drawn from the same distribution. The null hypothesis of the
KS test is that both distributions are identical. The statistic of the KS test is a distance between
the two empirical distributions, computed as the maximum absolute difference between their
cumulative curves.
Kruskal-Wallis test
The Kruskal–Wallis test is a non-parametric test to analyse the variance.[14] It analyses if there
is any difference in the median values of three or more independent samples. The data values are
ranked in an increasing order, and the rank sums calculated followed by calculation of the test
statistic.
Jonckheere test
In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a
more statistical power than the Kruskal–Wallis test.[14]
Friedman test
The Friedman test is a non-parametric test for testing the difference between several related
samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when
the same parameter has been measured under different conditions on the same subjects.[13]

Tests to analyse the categorical data


Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or
nominal variables. The Chi-square test compares the frequencies and tests whether the observed
data differ significantly from that of the expected data if there were no differences between
groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between
observed (O) and the expected (E) data (or the deviation, d) divided by the expected data by the
following formula:

A Yates correction factor is used when the sample size is small. Fischer's exact test is used to
determine if there are non-random associations between two categorical variables. It does not
assume random sampling, and instead of referring a calculated statistic to a sampling
distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It
is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row
and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null
hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a
multivariate test as it analyses multiple grouping variables. It stratifies according to the
nominated confounding variables and identifies any that affects the primary outcome variable. If
the outcome variable is dichotomous, then logistic regression is used.
Go to:

SOFTWARES AVAILABLE FOR STATISTICS, SAMPLE SIZE


CALCULATION AND POWER ANALYSIS
Numerous statistical software systems are available currently. The commonly used software
systems are Statistical Package for the Social Sciences (SPSS – manufactured by IBM
corporation), Statistical Analysis System ((SAS – developed by SAS Institute North Carolina,
United States of America), R (designed by Ross Ihaka and Robert Gentleman from R core team),
Minitab (developed by Minitab Inc), Stata (developed by StataCorp) and the MS Excel
(developed by Microsoft).
There are a number of web resources which are related to statistical power analyses. A few are:

 StatPages.net – provides links to a number of online power calculators


 G-Power – provides a downloadable power analysis program that runs under DOS
 Power analysis for ANOVA designs an interactive site that calculates power or sample
size needed to attain a given power for one effect in a factorial ANOVA design
 SPSS makes a program called SamplePower. It gives an output of a complete report on
the computer screen which can be cut and paste into another document.

Go to:

SUMMARY
It is important that a researcher knows the concepts of the basic statistical methods used for
conduct of a research study. This will help to conduct an appropriately well-designed study
leading to valid and reliable results. Inappropriate use of statistical techniques may lead to faulty
conclusions, inducing errors and undermining the significance of the article. Bad statistics may
lead to bad research, and bad research may lead to unethical practice. Hence, an adequate
knowledge of statistics and the appropriate use of statistical tests are important. An appropriate
knowledge about the basic statistical methods will go a long way in improving the research
designs and producing quality medical research which can be utilised for formulating the
evidence-based guidelines.

Sampling
Sampling is a statistical procedure that is concerned with the selection of
the individual observation; it helps us to make statistical inferences
about the population.
The Main Characteristics of Sampling
In sampling, we assume that samples are drawn from the population and
sample means and population means are equal. A population can be
defined as a whole that includes all items and characteristics of the
research taken into study. However, gathering all this information is time
consuming and costly. We therefore make inferences about the
population with the help of samples.

Random sampling:
In data collection, every individual observation has equal probability to be
selected into a sample. In random sampling, there should be no pattern
when drawing a sample.

Significance: Significance is the percent of chance that a relationship may


be found in sample data due to luck. Researchers often use the 0.05%
significance level.
Probability and non-probability sampling:
Probability sampling is the sampling technique in which every individual
unit of the population has greater than zero probability of getting selected
into a sample.

Non-probability sampling is the sampling technique in which some


elements of the population have no probability of getting selected into a
sample.

Types of random sampling:


With the random sample, the types of random sampling are:

Simple random sampling: By using the random number generator technique,


the researcher draws a sample from the population called simple random
sampling. Simple random samplings are of two types. One is when
samples are drawn with replacements, and the second is when samples
are drawn without replacements.
Equal probability systematic sampling: In this type of sampling method, a
researcher starts from a random point and selects every nth subject in
the sampling frame. In this method, there is a danger of order bias.
Stratified simple random sampling: In stratified simple random sampling, a
proportion from strata of the population is selected using simple random
sampling. For example, a fixed proportion is taken from every class from
a school.
Multistage stratified random sampling: In multistage stratified random sampling,
a proportion of strata is selected from a homogeneous group using simple
random sampling. For example, from the nth class and nth stream, a
sample is drawn called the multistage stratified random sampling.
Cluster sampling: Cluster sampling occurs when a random sample is drawn
from certain aggregational geographical groups.
Multistage cluster sampling: Multistage cluster sampling occurs when a
researcher draws a random sample from the smaller unit of an
aggregational group.
Types of non-random sampling: Non-random sampling is widely used in
qualitative research. Random sampling is too costly in qualitative
research. The following are non-random sampling methods:
Availability sampling: Availability sampling occurs when the researcher
selects the sample based on the availability of a sample. This method is
also called haphazard sampling. E-mail surveys are an example of
availability sampling.
Quota sampling: This method is similar to the availability sampling method,
but with the constraint that the sample is drawn proportionally by strata.
Expert sampling: This method is also known as judgment sampling. In this
method, a researcher collects the samples by taking interviews from a
panel of individuals known to be experts in a field.
Analyzing non-response samples: The following methods are used to handle the
non-response sample:
Weighting: Weighting is a statistical technique that is used to handle the
non-response data. Weighting can be used as a proxy for data.
In SPSS commands, “weight by” is used to assign weight. In SAS, the
“weight” parameter is used to assign the weight.
Dealing with missing data: In statistics analysis, non-response data is called
missing data. During the analysis, we have to delete the missing data, or
we have to replace the missing data with other values. In SPSS, missing
value analysis is used to handle the non-response data.
Sample size: To handle the non-response data, a researcher usually takes
a large sample.
Statistics Solutions can assist with determining the sample size / power
analysis for your research study. To learn more, visit our webpage
on sample size / power analysis, or contact us today.
GraphPad Statistics Guide Interpreting results: Kolmogorov-Smirnov test Key facts about the
Kolmogorov-Smirnov test • The two sample Kolmogorov-Smirnov test is a nonparametric test that
compares the cumulative distributions of two data sets(1,2). • The test is nonparametric. It does not
assume that data are sampled from Gaussian distributions (or any other defined distributions). • The
results will not change if you transform all the values to logarithms or reciprocals or any transformation.
The KS test report the maximum difference between the two cumulative distributions, and calculates a P
value from that and the sample sizes. A transformation will stretch (even rearrange if you pick a strange
transformation) the X axis of the frequency distribution, but cannot change the maximum distance
between two frequency distributions. • Converting all values to their ranks also would not change the
maximum difference between the cumulative frequency distributions (pages 35-36 of Lehmann,
reference 2). Thus, although the test analyzes the actual data, it is equivalent to an analysis of ranks.
Thus the test is fairly robust to outliers (like the Mann-Whitney test). • The null hypothesis is that both
groups were sampled from populations with identical distributions. It tests for any violation of that null
hypothesis -- different medians, different variances, or different distributions. • Because it tests for more
deviations from the null hypothesis than does the Mann-Whitney test, it has less power to detect a shift
in the median but more power to detect changes in the shape of the distributions (Lehmann, page 39). •
Since the test does not compare any particular parameter (i.e. mean or median), it does not report any
confidence interval. • Don't use the Kolmogorov-Smirnov test if the outcome (Y values) are categorical,
with many ties. Use it only for ratio or interval data, where ties are rare. • The concept of one- and two-
tail P values only makes sense when you are looking at an outcome that has two possible directions (i.e.
difference between two means). Two cumulative distributions can differ in lots of ways, so the concept
of tails is not really appropriate. the P value reported by Prism essentially has many tails. Some texts call
this a two-tail P value. Interpreting the P value The P value is the answer to this question: If the two
samples were randomly sampled from identical populations, what is the probability that the two
cumulative frequency distributions would be as far apart as observed? More precisely, what is the
chance that the value of the Komogorov-Smirnov D statistic would be as large or larger than observed?
If the P value is small, conclude that the two groups were sampled from populations with different
distributions. The populations may differ in median, variability or the shape of the distribution. Graphing
the cumulative frequency distributions The KS test works by comparing the two cumulative frequency
distributions, but it does not graph those distributions. To do that, go back to the data table, click
Analyze and choose the Frequency distribution analysis. Choose that you want to create cumulative
distributions and tabulate relative frequencies. Don't confuse with the KS normality test It is easy to
confuse the two sample Kolmogorov-Smirnov test (which compares two groups) with the one sample
Kolmogorov-Smirnov test, also called the Kolmogorov-Smirnov goodness-of-fit test, which tests whether
one distribution differs substantially from theoretical expectations. 8/25/2018 Interpreting results:
Kolmogorov-Smirnov test
https://www.graphpad.com/guides/prism/7/statistics/interpreting_results_kolmogorov-
smirnov_test.htm?toc=0&printWindow 2/2 The one sample test is most often used as a normality test
to compare the distribution of data in a single dat aset with the predictions of a Gaussian distribution.
Prism performs this normality test as part of the Column Statistics analysis. Comparison with the Mann-
Whitney test The Mann-Whitney test is also a nonparametric test to compare two unpaired groups. The
Mann-Whitney test works by ranking all the values from low to high, and comparing the mean rank of
the values in the two groups. How Prism computes the P value Prism first generates the two cumulative
relative frequency distributions, and then asks how far apart those two distributions are at the point
where they are furthest apart. Prism uses the method explained by Lehmann (2). This distance is
reported as Kolmogorov-Smirnov D. The P value is computed from this maximum distance between the
cumulative frequency distributions, accounting for sample size in the two groups. With larger samples,
an excellent approximation is used (2, 3). An exact method is used when the samples are small, defined
by Prism to mean when the number of permutations of n1 values from n1+n2 values is less than 60,000,
where n1 and n2 are the two sample sizes. Thus an exact test is used for these pairs of group sizes (the
two numbers in parentheses are the numbers of values in the two groups): (2, 2), (2, 3) ... (2, 346) (3, 3),
(3, 4) ... (3, 69) (4, 4), (4, 5) ... (4, 32) (5, 5), (5, 6) ... (5, 20) (6, 6), (6, 7) ... (6, 15) (7, 7), (7, 8) ... (7, 12) (8,
8), (8, 9), (8, 10) (9, 9) Prism accounts for ties in its exact algorithm (developed in-house). It
systematically shuffles the actual data between two groups (maintaining sample size). The P value it
reports is the fraction of these reshuffled data sets where the D computed from the reshuffled data sets
is greater than or equal than the D computed from the actual data.

Introduction

When faced with a research problem, you need to collect, analyze and interpret data to
answer your research questions. Examples of research questions that could require you
to gather data include how many people will vote for a candidate, what is the best
product mix to use and how useful is a drug in curing a disease. The research problem
you explore informs the type of data you’ll collect and the data collection method you’ll
use. In this article, we will explore various types of data, methods of data collection and
advantages and disadvantages of each. After reading our review, you will have an
excellent understanding of when to use each of the data collection methods we discuss.

Types of Data

Quantitative Data
Data that is expressed in numbers and summarized using statistics to give meaningful
information is referred to as quantitative data. Examples of quantitative data we could
collect are heights, weights, or ages of students. If we obtain the mean of each set of
measurements, we have meaningful information about the average value for each of
those student characteristics.
Qualitative Data
When we use data for description without measurement, we call it qualitative data.
Examples of qualitative data are student attitudes towards school, attitudes towards
exam cheating and friendliness of students to teachers. Such data cannot be easily
summarized using statistics.
Primary Data
When we obtain data directly from individuals, objects or processes, we refer to it
as primary data. Quantitative or qualitative data can be collected using this approach.
Such data is usually collected solely for the research problem to you will study. Primary
data has several advantages. First, we tailor it to our specific research question, so
there are no customizations needed to make the data usable. Second, primary data is
reliable because you control how the data is collected and can monitor its quality. Third,
by collecting primary data, you spend your resources in collecting only required data.
Finally, primary data is proprietary, so you enjoy advantages over those who cannot
access the data.
Despite its advantages, primary data also has disadvantages of which you need to be
aware. The first problem with primary data is that it is costlier to acquire as compared to
secondary data. Obtaining primary data also requires more time as compared to
gathering secondary data.

Secondary Data
When you collect data after another researcher or agency that initially gathered it makes
it available, you are gathering secondary data. Examples of secondary data are
census data published by the US Census Bureau, stock prices data published by CNN
and salaries data published by the Bureau of Labor Statistics.
One advantage to using secondary data is that it will save you time and money,
although some data sets require you to pay for access. A second advantage is the
relative ease with which you can obtain it. You can easily access secondary data from
publications, government agencies, data aggregation websites and blogs. A third
advantage is that it eliminates effort duplication since you can identify existing data that
matches your needs instead of gather new data.
Despite the benefits it offers, secondary data has its shortcomings. One limitation is that
secondary data may not be complete. For it to meet your research needs, you may
need to enrich it with data from other sources. A second shortcoming is that you cannot
verify the accuracy of secondary data, or the data may be outdated. A third challenge
you face when using secondary data is that documentation may be incomplete or
missing. Therefore, you may not be aware of any problems that happened in data
collection which would otherwise influence its interpretation. Another challenge you may
face when you decide to use secondary data is that there may be copyright restrictions.

Now that we’ve explained the various types of data you can collect when conducting
research, we will proceed to look at methods used to collect primary and secondary
data.

Methods Employed in Primary Data Collection

When you decide to conduct original research, the data you gather can be quantitative
or qualitative. Generally, you collect quantitative data through sample surveys,
experiments and observational studies. You obtain qualitative data through focus
groups, in-depth interviews and case studies. We will discuss each of these data
collection methods below and examine their advantages and disadvantages.

Sample Surveys
A survey is a data collection method where you select a sample of respondents from a
large population in order to gather information about that population. The process of
identifying individuals from the population who you will interview is known as sampling.
To gather data through a survey, you construct a questionnaire to prompt information
from selected respondents. When creating a questionnaire, you should keep in mind
several key considerations. First, make sure the questions and choices are
unambiguous. Second, make sure the questionnaire will be completed within a
reasonable amount of time. Finally, make sure there are no typographical errors. To
check if there are any problems with your questionnaire, use it to interview a few people
before administering it to all respondents in your sample. We refer to this process as
pretesting.
Using a survey to collect data offers you several advantages. The main benefit is time
and cost savings because you only interview a sample, not the large population.
Another benefit is that when you select your sample correctly, you will obtain
information of acceptable accuracy. Additionally, surveys are adaptable and can be
used to collect data for governments, health care institutions, businesses and any other
environment where data is needed.

A major shortcoming of surveys occurs when you fail to select a sample correctly;
without an appropriate sample, the results will not accurately generalize the population.

Ways of Interviewing Respondents

Once you have selected your sample and developed your questionnaire, there are
several ways you can interview participants. Each approach has its advantages and
disadvantages.

In-person Interviewing
When you use this method, you meet with the respondents face to face and ask
questions. In-person interviewing offers several advantages. This technique has
excellent response rates and enables you to conduct interviews that take a longer
amount of time. Another benefit is you can ask follow-up questions to responses that
are not clear.

In-person interviews do have disadvantages of which you need to be aware. First, this
method is expensive and takes more time because of interviewer training, transport,
and remuneration. A second disadvantage is that some areas of a population, such as
neighborhoods prone to crime, cannot be accessed which may result in bias.

Telephone Interviewing
Using this technique, you call respondents over the phone and interview them. This
method offers the advantage of quickly collecting data, especially when used with
computer-assisted telephone interviewing. Another advantage is that collecting data via
telephone is cheaper than in-person interviewing.
One of the main limitations with telephone interviewing it’s hard to gain the trust of
respondents. Due to this reason, you may not get responses or may introduce bias.
Since phone interviews are generally kept short to reduce the possibility of upsetting
respondents, this method may also limit the amount of data you can collect.

Online Interviewing
With online interviewing, you send an email inviting respondents to participate in an
online survey. This technique is used widely because it is a low-cost way of interviewing
many respondents. Another benefit is anonymity; you can get sensitive responses that
participants would not feel comfortable providing with in-person interviewing.

When you use online interviewing, you face the disadvantage of not getting a
representative sample. You also cannot seek clarification on responses that are
unclear.

Mailed Questionnaire
When you use this interviewing method, you send a printed questionnaire to the postal
address of the respondent. The participants fill in the questionnaire and mail it back.
This interviewing method gives you the advantage of obtaining information that
respondents may be unwilling to give when interviewing in person.

The main limitation with mailed questionnaires is you are likely to get a low response
rate. Keep in mind that inaccuracy in mailing address, delays or loss of mail could also
affect the response rate. Additionally, mailed questionnaires cannot be used to interview
respondents with low literacy, and you cannot seek clarifications on responses.

Focus Groups
When you use a focus group as a data collection method, you identify a group of 6 to 10
people with similar characteristics. A moderator then guides a discussion to identify
attitudes and experiences of the group. The responses are captured by video recording,
voice recording or writing—this is the data you will analyze to answer your research
questions. Focus groups have the advantage of requiring fewer resources and time as
compared to interviewing individuals. Another advantage is that you can request
clarifications to unclear responses.

One disadvantage you face when using focus groups is that the sample selected may
not represent the population accurately. Furthermore, dominant participants can
influence the responses of others.

Observational Data Collection Methods

In an observational data collection method, you acquire data by observing any


relationships that may be present in the phenomenon you are studying. There are four
types of observational methods that are available to you as a researcher: cross-
sectional, case-control, cohort and ecological.

In a cross-sectional study, you only collect data on observed relationships once. This
method has the advantage of being cheaper and taking less time as compared to case-
control and cohort. However, cross-sectional studies can miss relationships that may
arise over time.
Using a case-control method, you create cases and controls and then observe them. A
case has been exposed to a phenomenon of interest while a control has not. After
identifying the cases and controls, you move back in time to observe how your event of
interest occurs in the two groups. This is why case-control studies are referred to as
retrospective. For example, suppose a medical researcher suspects a certain type of
cosmetic is causing skin cancer. You recruit people who have used a cosmetic, the
cases, and those who have not used the cosmetic, the controls. You request
participants to remember the type of cosmetic and the frequency of its use. This method
is cheaper and requires less time as compared to the cohort method. However, this
approach has limitations when individuals you are observing cannot accurately recall
information. We refer to this as recall bias because you rely on the ability of participants
to remember information. In the cosmetic example, recall bias would occur if
participants cannot accurately remember the type of cosmetic and number of times
used.
In a cohort method, you follow people with similar characteristics over a period. This
method is advantageous when you are collecting data on occurrences that happen over
a long period. It has the disadvantage of being costly and requiring more time. It is also
not suitable for occurrences that happen rarely.
The three methods we have discussed previously collect data on individuals. When you
are interested in studying a population instead of individuals, you use
an ecological method. For example, say you are interested in lung cancer rates in Iowa
and North Dakota. You obtain number of cancer cases per 1000 people for each state
from the National Cancer Institute and compare them. You can then hypothesize
possible causes of differences between the two states. When you use the ecological
method, you save time and money because data is already available. However the data
collected may lead you to infer population relationships that do not exist.

Experiments

An experiment is a data collection method where you as a researcher change some


variables and observe their effect on other variables. The variables that you manipulate
are referred to as independent while the variables that change as a result of
manipulation are dependent variables. Imagine a manufacturer is testing the effect of
drug strength on number of bacteria in the body. The company decides to test drug
strength at 10mg, 20mg and 40mg. In this example, drug strength is the independent
variable while number of bacteria is the dependent variable. The drug administered is
the treatment, while 10mg, 20mg and 40mg are the levels of the treatment.
The greatest advantage of using an experiment is that you can explore causal
relationships that an observational study cannot. Additionally, experimental research
can be adapted to different fields like medical research, agriculture, sociology, and
psychology. Nevertheless, experiments have the disadvantage of being expensive and
requiring a lot of time.

Summary

This article introduced you to the various types of data you can collect for research
purposes. We discussed quantitative, qualitative, primary and secondary data and
identified the advantages and disadvantages of each data type. We also reviewed
various data collection methods and examined their benefits and drawbacks. Having
read this article, you should be able to select the data collection method most
appropriate for your research question. Data is the evidence that you use to solve your
research problem. When you use the correct data collection method, you get the right
data to solve your problem.

You might also like