You are on page 1of 21

Data Management

What is Data?

Data are raw information or facts


that become useful information
when organized in a meaningful
way. It could be of qualitative and
quantitative nature.
What is Data Management?

Data Management is concerned with


“looking after” and processing data. It
involves the following:
• Looking after field data sheets
• Checking and correcting the raw data
• Preparing data for analysis
• Documenting and archiving the data and
meta-data
Importance of Data Management
• Ensures that data for analysis are of high quality
so that conclusions are correct

• Good data management allows further use of the


data in the future and enables efficient
integration of results with other studies.

• Good data management leads to improved


processing efficiency, improved data quality, and
improved meaningfulness of the data.
Planning and Conducting an Experiment or Study
A. Methods of data collection
1. Census – this is the procedure of systematically acquiring and recording information
about all members of a given population. Researchers rarely survey the entire
population for two (2) reasons: the cost is too high and the population is dynamic in
that the individuals making up the population may change over time.
2. Sample Survey – sampling is a selection of a subset within a population, to yield
some knowledge about the population of concern. The three main advantages of
sampling are that (i) the cost is lower, (ii) data collection is faster, and (iii) since the
data set is smaller, it is possible to improve the accuracy and quality of the data.
3. Experiment – this is performed when there are some controlled variables (like
certain treatment in medicine) and the intention is to study their effect on other
observed variables (like health of patients). One of the main requirements to
experiments is the possibility of replication.
4. Observation study – this is appropriate when there are no controlled variables and
replication is impossible. This type of study typically uses a survey. An example is one
that explores the correlation between smoking and lung cancer. In this case, the
researchers would collect observations of both smokers and non-smokers and then
look for the number of cases of lung cancer in each group.
B. Planning and Conducting Surveys

1. Characteristics of a well-designed
and well-conducted survey
a. A good survey must be representative of the population.
b. To use the probabilistic results, it always incorporates a chance, such as a random number
generator. Often we don’t have a complete listing of the population, so we have to be careful
about exactly how we are applying “chance”. Even when the frame is correctly specified, the
subjects may choose not to respond or may not be able to respond.
c. The wording of the question must be neutral; subjects give different answers depending on the
phrasing.
d. Possible sources of errors and biases should be controlled. The population of concern as a
whole may not be available for a survey. Its subset of items possible to measure is called a
sampling frame (from which the sample will be selected). The plan of the survey should
specify a sampling method, determine the sample size and steps for implementing the
sampling plan, and sampling and data collecting.
2. Sampling Methods
a. Nonprobability sampling – is any sampling method where some elements of the
population have no chance of selection or where the probability of selection can’t
be accurately determined.

Example: We visit every household in a given street, and interview the first person to
answer the door. In any household with more than one occupant, this is a nonprobability
sample, because some people are more likely to answer the door (e.g. an unemployed
person who spends most of their time at home is more likely to answer than an
employed housemate who might be at work when the interviewer calls) and it’s not
practical to calculate these probabilities.

One example of nonprobability sampling is convenience sampling (customers in a


supermarket are asked questions). Another is quota sampling, when judgment is used to
select the subjects based on specified proportions. For example, an interviewer may be
told to sample 200 females and 300 males between the age of 45 and 60.

In addition, nonresponse effects may turn any probability design into a nonprobability
design if the characteristics of nonresponse are not well understood, since nonresponse
effectively modifies each element’s probability of being sampled.
b. Probability Sampling – it is possible to both determine which sampling units
belong to which sample and the probability that each sample will be selected.
The following sampling methods are example of probability sampling:
i. Simple Random Sampling (SRS), all samples of a given size have an
equal probability of being selected and selections are independent. The
frame is not subdivided or partitioned. The sample variance is a good
indicator of the population variance, which makes it relatively easy to
estimate the accuracy of results.

ii. Systematic Sampling – relies on dividing the target population into strata
(subpopulations) of equal size and then selecting randomly one element from the
first stratum and corresponding elements from all other strata. A simple example
would be to select every 10th name from the telephone directory, with the first
selectin being random. SRS may select a sample from the beginning of the list.
Systematic sampling helps to spread the sample over the list.
iii. Stratified Sampling – when the population embraces a number
of distinct categories, the frame can be organized by these
categories into separate “strata”. Each stratum is then sampled as
an independent sub-population. Dividing the population into strata
can enable researchers to draw inferences about specific
subgroups that may be lost in a more generalized random sample.

Example: To determine the proportions of defective products being


assembled in a factory.
A stratified sampling approach is most effective when three
conditions are met:
a. Variability within strata are minimized
b. Variability between strata are maximized
c. The variables upon which the population is stratified are strongly
correlated with the desired dependent variable (beer
consumption is strongly correlated with gender).
iv. Cluster Sampling – sometimes it is cheaper to ‘cluster’ the
sample in some way (e.g. by selecting respondents from certain
areas only, or certain time-periods only). Cluster sampling is an
example of two-stage random sampling: in the first stage a random
sample of areas is chosen; in the second stage a random sample of
respondents within those areas is selected. This works best when
each cluster is a small copy of the population.

v. Matched random sampling – in this method, there are two (2)


samples in which the members are clearly paired, or are matched
explicitly by the researcher (for example, IQ measurements or pairs
of identical twins). Alternatively, the same attribute, or variable, may
be measured twice on each subject, under different circumstances
(e.g. the milk yields of cows before and after being fed a particular
diet).
C. Planning and conducting experiments
1. Characteristics of a well-designed and well-conducted experiment
A good statistical experiment includes:
a. Stating the purpose of research, including estimates regarding the size of treatment
effects, alternative hypotheses, and the estimated experimental variability. Experiments
must compare the new treatment with at least one (1) standard treatment, to allow an
unbiased estimates of the difference in treatment effects.
b. Design of experiments, using blocking (to reduce the influence of confounding
variables) and randomized assignment of treatments to subjects
c. Examining the data set in secondary analyses, to suggest new hypotheses for future
study
d. Documenting and presenting the results of the study
Example: Experiments on humans can change their behavior. The famous Hawthorne
study examined changes to the working environment at the Hawthorne plant of the
Western Electric Company. The researchers first measured the productivity in the plant,
then modified the illumination in an area of the plant and found that productivity improved.
However, the study is criticized today for the lack of a control group and blindness. Those
in the Hawthorne study became more productive not because the lighting was changed
but because they were being observed.
2. Treatment, control groups, experimental units, random
assignments and replication
a. Control groups and experimental units
To be able to compare effects and make inference about
associations or predictions, one typically has to subject different
groups to different conditions. Usually, an experimental unit is
subjected to treatment and a control group is not.
b. Random Assignments
The second fundamental design principle is randomization of
allocation of (controlled variables) treatments to units. The
treatment effects, if present, will be similar within each group.
c. Replication
All measurements, observations or data collected are subject to
variation, as there are no completely deterministic processes. To
reduce variability, in the experiment the measurements must be
repeated. The experiment itself should allow for replication itself
should allow for replication, to be checked by other researchers.
3. Sources of bias and confounding, including
placebo effect and blinding
Sources of bias specific to medicine are confounding variables and placebo effects,
among others.

a. Confounding – a confounding variable is an extraneous variable in a statistical model


that correlates (positively or negatively) with both the dependent variable and the
independent variable. The methodologies of scientific studies therefore need to control
for these factors to avoid a false positive (Type I) error (an erroneous conclusion that
the dependent variables are in a causal relationship with the independent variable).

Example: Consider the statistical relationship between ice cream sales and drowning
deaths. These two (2) variables have a positive correlation because both occur more
often during summer. However, it would be wrong to conclude that there is a cause-and-
effect relation between them.

b. Placebo and blinding – a placebo is an imitation pill identical to the actual treatment
pill, but without the treatment ingredients. A placebo effect is a sham (or simulated) effect
when medical intervention has no direct health impact but results in actual improvement
of a medical condition because the patients knew they were treated.

c. Blocking – is the arranging of experimental units in groups (blocks) that are similar to
one another. Typically, a blocking factor is a source of variability that is not of primary
interest to the experimenter.
4. Completely randomized design, randomized
block design and matched pairs
a. Completely randomized designs – are for studying the effects of
one primary factor without the need to take other nuisance variables
into account. The experiment compares the values of a response
variable (like health improvement) based on the different levels of that
primary factor (e.g., different amounts of medication).

b. Randomized block design – is a collection of completely randomized


experiments, each run within one of the blocks of the total experiment. A
matched pairs of design is its special case when the blocks consist of just
two (2) elements (measurements on the same patient before and after the
treatment or measurements on two (2) different but in some way similar
patients).
Determine which kind of sampling was used
in each of the following scenarios:
1. To evaluate employee compensation, choose a random sample
of 10 zip codes in the state, then survey all business within each
chosen zip code about their benefit package.
2. To determine the quality of education at the University of Utah, a
UNID number is chosen at random, then every 1000th student is
evaluated until 30 students are selected.
3. The names of 25 employees are being chosen out of a hat from
a company of 250 employees.
4. The same study participants are measured before and after an
intervention.
5. To determine the quality of on-campus housing, 20 residents
from each dorm were chosen to complete a survey.
1. To evaluate employee compensation,
choose a random sample of 10 zip codes
in the state, then survey all business within
each chosen zip code about their benefit
package.

Cluster Sampling
2. To determine the quality of education at
the University of Utah, a UNID number is
chosen at random, then every 1000th
student is evaluated until 30 students are
selected.
Systematic Sampling
3. The names of 25 employees are being
chosen out of a hat from a company of 250
employees.

Simple Random Sampling


4. The same study participants are
measured before and after an intervention.

Matched Random Sampling


5. To determine the quality of on-campus
housing, 20 residents from each dorm
were chosen to complete a survey.

Stratified Sampling
Thank you ☺

You might also like