Professional Documents
Culture Documents
Data Management
What is Data?
Data Management
Data are raw information or facts that become useful information when organized in a meaningful way. It could
be of qualitative and quantitative nature.
A. Methods
Planning andofConducting
data collection
an Experiment or Study
1. Census – this is the procedure of systematically acquiring and recording information about all members of
a given population. Researchers rarely survey the entire population for two (2) reasons: the cost is too
high and the population is dynamic in that the individuals making up the population may change over time.
2. Sample Survey – sampling is a selection of a subset within a population, to yield some knowledge about
the population of concern. The three main advantages of sampling are that (i) the cost is lower, (ii) data
collection is faster, and (iii) since the data set is smaller, it is possible to improve the accuracy and quality
of the data.
3. Experiment – this is performed when there are some controlled variables (like certain treatment in
medicine) and the intention is to study their effect on other observed variables (like health of patients).
One of the main requirements to experiments is the possibility of replication.
4. Observation study – this is appropriate when there are no controlled variables and replication is
impossible. This type of study typically uses a survey. An example is one that explores the correlation
between smoking and lung cancer. In this case, the researchers would collect observations of both
smokers and non-smokers and then look for the number of cases of lung cancer in each group.
2. Sampling Methods
a. Nonprobability sampling – is any sampling method where some elements of the population have no
chance of selection or where the probability of selection can’t be accurately determined. The selection
of elements is based on some criteria other than randomness. These conditions give rise to exclusion
04 Handout *Property of
STI
GE1707
bias, caused by the fact that some elements of the population are excluded. Nonprobability sampling
does not allow the estimation of sampling errors. Information about the relationship between sample
and population is limited, making it difficult to extrapolate from the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer the door.
In any household with more than one occupant, this is a nonprobability sample, because some people
are more likely to answer the door (e.g. an unemployed person who spends most of their time at
home is more likely to answer than an employed housemate who might be at work when the
interviewer calls) and it’s not practical to calculate these probabilities.
In addition, nonresponse effects may turn any probability design into a nonprobability design if the
characteristics of nonresponse are not well understood, since nonresponse effectively modifies each
element’s probability of being sampled.
b. Probability Sampling – it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods are example of
probability sampling:
i. Simple Random Sampling (SRS), all samples of a given size have an equal probability of being
selected and selections are independent. The frame is not subdivided or partitioned. The sample
variance is a good indicator of the population variance, which makes it relatively easy to estimate
the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection may
result in a sample that doesn’t reflect the makeup of the population. For instance, a simple
random sample of ten people from a given country will on average produce five men and five
women, but any given trial is likely to overrepresent one sex and underrepresent the other.
Systematic and stratified techniques, discussed below, attempt to overcome this problem by using
information about the population to choose a more representative sample.
In some cases, investigators are interested in research questions specific to subgroups of the
population. For example, researchers might be interested in examining whether cognitive ability
as predictor of job performance is equally applicable across racial groups. SRS cannot
accommodate the needs of researchers in this situation because it does not provide subsamples
of the population. Stratified sampling, which is discussed below, addresses this weakness of
SRS.
ii. Systematic Sampling – relies on dividing the target population into strata (subpopulations) of
equal size and then selecting randomly one element from the first stratum and corresponding
elements from all other strata. A simple example would be to select every 10 th name from the
telephone directory, with the first selectin being random. SRS may select a sample from the
beginning of the list. Systematic sampling helps to spread the sample over the list.
As long as the starting point is randomized, systematic sampling is a type of probability sampling.
Every 10th sampling is especially useful for efficient sampling from databases.
However, systematic sampling is especially vulnerable to periodicities in the list. Consider a street
where the odd-numbered houses are all on one side of the road, and the even-numbered houses
are all on another side. Under systematic sampling, the houses sampled will all be either odd-
numbered or even-numbered. Another drawback of systematic sampling is that even in scenarios
where it is more accurate than SRS, its theoretical properties make it difficult to quantify that
accuracy.
04 Handout *Property of
STI
GE1707
Systematic sampling is not SRS because different samples of the same size have different
selection probabilities e.g. the set (4,14, 24,) has a one-in-ten probability of selection, but the set
(4,1,24, 34,) has zero probability of selection.
iii. Stratified Sampling – when the population embraces a number of distinct categories, the frame
can be organized by these categories into separate “strata”. Each stratum is then sampled as an
independent sub-population. Dividing the population into strata can enable researchers to draw
inferences about specific subgroups that may be lost in a more generalized random sample.
Since each stratum is treated as an independent population, different sampling approaches can
be applied to different strata. However, implementing such an approach can increase the cost and
complexity of sample selection. Example: To determine the proportions of defective products
being assembled in a factory.
A stratified sampling approach is most effective when three conditions are met:
a. Variability within strata are minimized
b. Variability between strata are maximized
c. The variables upon which the population is stratified are strongly correlated with the desired
dependent variable (beer consumption is strongly correlated with gender).
iv. Cluster Sampling – sometimes it is cheaper to ‘cluster’ the sample in some way (e.g. by
selecting respondents from certain areas only, or certain time-periods only). Cluster sampling is
an example of two-stage random sampling: in the first stage a random sample of areas is chosen;
in the second stage a random sample of respondents within those areas is selected. This works
best when each cluster is a small copy of the population.
This can reduce travel and other administrative costs. Cluster sampling generally increases the
variability of sample estimates above that of simple random sampling, depending on how the
clusters differ between themselves, as compared with the within-cluster variation. If clusters
chosen are biased in a certain way, inferences drawn about population parameters will be
inaccurate.
v. Matched random sampling – in this method, there are two (2) samples in which the members
are clearly paired, or are matched explicitly by the researcher (for example, IQ measurements or
pairs of identical twins). Alternatively, the same attribute, or variable, may be measured twice on
each subject, under different circumstances (e.g. the milk yields of cows before and after being
fed a particular diet).
To be able to compare effects and make inference about associations or predictions, one typically
has to subject different groups to different conditions. Usually, an experimental unit is subjected to
treatment and a control group is not.
b. Random Assignments
The second fundamental design principle is randomization of allocation of (controlled variables)
treatments to units. The treatment effects, if present, will be similar within each group.
c. Replication
All measurements, observations or data collected are subject to variation, as there are no
completely deterministic processes. To reduce variability, in the experiment the measurements
must be repeated. The experiment itself should allow for replication itself should allow for
replication, to be checked by other researchers.
c. Blocking – is the arranging of experimental units in groups (blocks) that are similar to one another.
Typically, a blocking factor is a source of variability that is not of primary interest to the
experimenter. An example of a blocking factor might be the sex of a patient; by blocking on sex
(that is comparing men to men and women to women), this source of variability is controlled for,
thus leading to greater precision.
Chi-Square
The chi-square test is used to determine whether there is significant difference between the expected value
frequencies and the observed frequencies in one or more categories.
There are two (2) types of chi-square tests. Both use the chi-square statistic and distribution for different
purposes:
04 Handout *Property of
STI
GE1707
04 Handout *Property of
STI
GE1707
Example: Researchers have conducted a survey of 1600 coffee drinkers asking how much coffee they
drink in order to confirm previous studies. Previous studies have indicated that 72% of Americans drink
coffee. Below are the results of previous studies (left) and the survey (right). At 𝛼𝛼 = 0.05, is there enough
evidence to conclude that the distributions are the same?
a. The null hypothesis 𝐻𝐻0 : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis 𝐻𝐻𝑎𝑎 : the null hypothesis is false.
c. 𝑎𝑎 = 0.05
d. The degrees of freedom: 𝑘𝑘 − 1 = 4 − 1 = 3
e. The test statistic can be calculated using the table below:
Response % of Coffee 𝐸𝐸 𝑂𝑂 𝑂𝑂 − 𝐸𝐸 (𝑂𝑂 − (𝑂𝑂 −
Drinkers 𝐸𝐸 )2 𝐸𝐸)2
𝐸𝐸
2 cups per week 15% 0.15 × 1600 = 240 206 −34 1156 4.817
1 cup per week 13% 0.13 × 1600 = 208 193 −15 225 1.082
1 cup per day 27% 0.27 × 1600 = 432 462 30 900 2.083
2+ cups per day 45% 0.45 × 1600 = 720 739 19 361 0.5014
(𝑂𝑂 −
(𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 − 𝑜𝑜𝑒𝑒𝑒𝑒𝑜𝑜𝑐𝑐𝑐𝑐𝑜𝑜𝑜𝑜 )2
𝜒𝜒 2 = � 𝐸𝐸)2 = 8.483
𝑜𝑜𝑒𝑒𝑒𝑒𝑜𝑜𝑐𝑐𝑐𝑐𝑜𝑜𝑜𝑜 =�
𝐸𝐸
f. From 𝛼𝛼 = 0.05 and 𝑘𝑘 − 1 = 3, the critical values is 7.815.
g. Is there enough evidence to reject 𝐻𝐻0 ? Since 𝜒𝜒 2 ≈ 8.483 > 7.815, there is enough statistical evidence
to reject the null hypothesis and to believe that the old percentages no longer hold.
Test of Independence
The chi-square test of independence is used to assess if two (2) factors are related. This test is often used in
social science research to determine if factors are independent of each other. For example, we would use this
test to determine relationships between voting patterns and race, income and gender, and behavior and
education.
In general, when running the test of independence, we ask, “Is Variable X independent of Variable Y?” It is
important to note that this test does not test how the variables are related, just simply whether or not they are
independent of one another. For example, while the test of independence can help us determine if income and
gender are independent, it cannot help us assess how one category might affect the other.
Just as with a goodness of fit test, we will calculate expected values, calculate chi-square statistic, and
compare it to the appropriate chi-square value from a reference to see if we should reject 𝐻𝐻0 , which is that the
variables are not related. Formally, the hypothesis statements for the chi-square test-of independence are:
𝐻𝐻0 : There is no association between the two (2) categorical variables
𝐻𝐻1 : There is an association (the two (2) variables are not independent)
An experiment is conducted in which the frequencies for two (2) variables are determined. To use the test, the
same assumptions must be satisfied: the observed frequencies are obtained through a simple random
sample, and each expected frequency is at least 5. The frequencies are written down in a table: the columns
contain outcomes for one (1) variable, and the rows contain outcomes for the other variable.
04 Handout *Property of
STI
GE1707
The procedure for the hypothesis test is essentially the same. The differences are that:
a. 𝐻𝐻0 is that the two (2) variables are independent.
b. 𝐻𝐻𝑎𝑎 is that the two (2) variables are not independent (they are dependent).
c. The expected frequency 𝐸𝐸𝑟𝑟 ,𝑐𝑐 for the entry in row 𝑜𝑜, column 𝑐𝑐 is calculated using:
(𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑟𝑟 𝑜𝑜) × (𝑜𝑜𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑐𝑐𝑜𝑜𝑐𝑐𝑆𝑆𝑆𝑆𝑐𝑐 𝑐𝑐)
𝐸𝐸𝑟𝑟 ,𝑐𝑐 =
𝑆𝑆𝑎𝑎𝑆𝑆𝑒𝑒𝑐𝑐𝑜𝑜 𝑜𝑜𝑠𝑠𝑠𝑠𝑜𝑜
d. The degrees of freedom: (𝑐𝑐𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑟𝑟𝑜𝑜 − 1) × (𝑐𝑐𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜 𝑐𝑐𝑜𝑜𝑐𝑐𝑆𝑆𝑆𝑆𝑐𝑐𝑜𝑜 − 1)
Example: The results of a random sample of children with pain from musculoskeletal injuries treated
with acetaminophen, ibuprofen, or codeine are shown in the table. At 𝛼𝛼 = 0.10, is there enough
evidence to conclude that the treatment and result are independent?
04 Handout *Property of
STI
GE1707
Example: A doctor believes that the proportions of births in this country on each day of the week are equal. A
simple random of 700 births from a recent year is selected, and the result are below. At a significance level of
0.01, is there enough evidence to support the doctor’s claim?
a. The null hypothesis 𝐻𝐻0 : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis 𝐻𝐻𝑎𝑎 : the null hypothesis is false.
c. 𝑎𝑎 = 0.01
d. The degrees of freedom: 𝑘𝑘 − 1 = 7 − 1 = 6
e. The test statistic can be calculated using a table:
REFERENCES:
Almukkahal R., Ottman L., DeLancey D., Evans A., Lawsky E., & Meery B. (2016). CK12 advance probability
and statistics. Flexbook Next Generation Textbooks.
Sampling and experimentation: planning and conducting a study (n.d.) Retrieved from
https://www.scribd.com/document/51105391/Planning-and-Conducting-a-Study-for-AP-Statistics
04 Handout *Property of
STI