You are on page 1of 46

Introduction to

Survey Quality
Chapter 7
Data Processing: Errors
and Their Control
Topics
1. Overview of Data Processing Steps
2. Nature of Data Processing Error
3. Data Capture Errors
4. Post–Data Capture Editing
5. Coding
6. File Preparation
7. Applications of Continuous Quality Improvement:
The Case of Coding
8. Integration Activities
Table 2. Five Major Sources of Nonsampling Error and Their Potential Causes
Data processing is a set of activities aimed at
converting the survey data from its raw state
as the output from data collection to a
cleaned and corrected state that be can used
in analysis, presentation, and dissemination.
During data process, the data may be changed
by a number of operations which are intended
to improve their accuracy.
The data may be checked, compared, corrected, keyed or
scanned, coded, tabulated, and so on, until the survey
manager is satisfied that the results are “fit for use.”
The sequence of data processing steps range
from the simple (e.g., data keying) to the
complex, involving:
➢ editing,
➢ imputation,
➢ weighting,
➢ and so on.
Data processing operations can be expensive,
time consuming, and costly.
Technology has also allowed greater integration of
data processing with other survey processes.
Some data processing steps can be accomplished during
the data collection phase, thereby reducing costs and
total production time while improving data accuracy.
Data processing operations may be quite
prone to human error when performed
manually.
By reducing reliance on manual labor, automation
reduces the types of errors in the data caused by
manual processing, but it may also introduce
other types of errors that are specific to the
technology used.
The literature on the data processing error
and control is quite small relative to that on
the measurement error (especially,
respondent errors and questionnaire effects)
and nonresponse.
Data processing operations have traditionally accounted for a very large
portion of the total survey budget. In some surveys the editing alone
consumes up to 40% of the entire survey budget (U.S. Federal Committee on Statistical
Methodology, 1990).
This is unfortunate since some processing
steps, such as coding, can be very error-prone,
particularly coding of complex concepts.

Coding error rates or coding disagreement rates


can reach 20% levels for some variables, especially
if the coding staff is not well trained.
Despite its potential to influence survey results,
data processing error is considered by many
survey methodologists as relatively uninteresting.
Perhaps this is because, unlike nonresponse and
questionnaire design, cognitive models and
sociological theories are not directly applicable to
data processing operations.
This may explain the dearth of literature on data
processing topics. Although there is ample
evidence of the importance of data processing
error in survey work, the associated error
structures are essentially unknown and
unexplored.
Acceptance sampling was originally developed by
Dodge and Romig (1944) for industrial settings. In
the 1960s and 1970s, these methods were
extended for use in many manual survey
operations. Such methods have been referred to
in the survey literature as the administrative
applications of quality control methods.
The main objective of acceptance sampling
is to find the defects in a product. Typically, the
method works by first creating batches or lots
from the survey materials (e.g., questionnaires).

Acceptance sampling has some shortcomings


as a means of continually lowering the error rate for the
outgoing product over time.
Data processing is a neglected error source in
survey research. Coders, keyers, editors, and
other operators contribute correlated error to
the mean squared error, similar to the
contributions made by interviewers. The
increased use of technology eliminates
correlated error but may generate new errors.
Data capture is the phase of the survey
where information recorded on the form or
questionnaire is converted to a computer-
readable format.
The information may be captured by keying, mark
character recognition (MCR), intelligent character
recognition (ICR), and even by voice recognition entry
(VRE).
a tedious and labor-intensive task that is prone
to error unless controlled properly.

Keying can be avoided for interviewer-assisted modes


by using CATI, CAPI, and other CAI technologies.
In the data processing literature, three definitions
of a keying error rate are prevalent:
Suppose that a response to a question on income
is 17,400. If the value keyed differs from 17,400,
we say that a keying error has occurred. The
following three entries are examples of keying
errors: 1740, 17,500, or 17,599.
Which of the three errors should be considered
most serious?
In most keying operations, independent rekey
verification is the primary method of quality
control for keying.
The error rate for any keying process depends on a
number of factors, such as:
1. keyer experience,
2. the variation in error rates among keyers,
3. the keyer turnover rates,
4. the amount and type of quality control used for the keying
process,
5. the legibility of the data to be keyed,
6. the ease with which keyers can identify what fields should
be keyed, and
7. the amount and quality of the scan editing process needed
to prepare forms for keying.
Keying error rates are usually small, but the
error effects can be considerable. Comparisons
of keying error rates across surveys and
organizations are often difficult to do, due to
varying definitions of error rates.
Two types of errors can occur with Intelligent
Character Recognition (ICR):
1. substitution and
2. rejection
➢ occurs when the ICR software misinterprets a
character

Substitution error rates are usually small.


However, this rate should not be generalized for other surveys,
since the accuracy of ICR depends on the application.
➢ occurs when the ICR software cannot
interpret a character and therefore rejects it.

Rejected characters must be corrected manually and


then re-entered into the system. Consequently, reject
errors are expensive to handle but contribute no error
to the data entry process as long as they are handled
properly.
There are two types of error associated with
intelligent character recognition: substitution and
rejection.
1. Editing is procedures designed and used for detecting
erroneous and/or question able survey data (survey
response data or identification type data) with the goal of
correcting (manually and/or via electronic means) as much
erroneous data (not necessarily all of the questioned data)
as possible, usually prior to data imputation and summary
procedures. (U.S. Federal Committee of Statistical Methodology, 1990)
2. Editing is the activity aimed at detecting and correcting
errors in data. (The International Work Session on Statistical Editing, 1995)).

3. Editing is the identification and, if necessary, correction of


errors and outliers in individual data used for statistics
production. (Granquist and Kovar, 1997)
1. To provide information about data quality.
2. To provide information about future survey
improvements.
3. To simply “clean up” the data so that further
processing is possible.
Data editing consists of a set of rules that are
applied to the survey data.
Some examples of editing rules are:
1. A value should always appear in specific positions of the file;
2. some values of a variable are not allowed;
3. some combinations of values are not allowed;
4. the value of a sum should be identical to the sum of its
components; and
5. a specific value is allowed only if it is included in a
predefined interval.
These edits are called deterministic edits, which, if violated, point to
errors with certainty.
Fatal or critical edits are supposed to detect erroneous variable
values that must be corrected for the data record to be usable.
Examples of errors that may be considered as fatal are:
• Identification number errors (e.g., a business ID number is invalid)
• Item nonresponse for key variables
• Invalid values on key items (e.g., out-of-range values for age)
• Values that are inconsistent (e.g., birth mother’s age is less than the
child’s age)
• Defined relationships between variables are not satisfied (e.g., net
aftertax income is less than gross income before taxes)
• Values that are extreme or unreasonable
are supposed to identify suspicious values.
Suspicious values are values that may be in error, but that
determination cannot be made without further investigation.
For example:
a farm operator may value his or her land at 10 times the value of the
land that surrounds the farm. This could be a keying error in which an
extra “0” was appended to the farmer’s report. However, it may also be
the actual estimate provided by the farm operator. A check of the
original survey form should reveal which it is.
Query edits should be pursued only if they have the potential
to affect the survey estimates in a noticeable way. Since it is
not clear whether an error has occurred, each query edit has
to be investigated in more depth. Such investigations can be
time consuming and costly. It is therefore important that the
rules for query edits be designed so that only errors that
have a substantial probability of affecting the estimates be
identified and pursued. These editing rules might also be
termed stochastic edits (as opposed to deterministic edits),
since there is uncertainty as to whether they identify actual
errors in the data.
Editing performed at the record or questionnaire
level.

Editing performed on aggregate data (e.g., means


or totals) or other checks that are applied to the
entire body of records.
Data can be edited at different stages of the
survey process, as follows:
• Editing during the interview.
• Editing prior to data capture
• Editing during data capture
• Editing after data capture
• Output editing
Editing is an essential stage of the survey process, but the
problems created by editing can be quite serious. Editing is
costly and time consuming, despite all the new technology. As
mentioned, in some surveys, editing alone can consume 40%
of the entire survey budget (U.S. Federal Committee on Statistical Methodology, 1990).

Extensive editing may delay release of the survey data, thus


reducing its relevance to data users
A key lesson from contemporary research on
editing is that not all errors in the data should be
fixed, but rather, one should try to implement
selective editing.

Various studies show that selective editing can result in


considerable savings in time and money without any
degradation in the accuracy of the final estimates.
Editing is an essential part of the survey design,
but it should focus on those changes that have a
considerable effect on the survey estimates. The
only way to identify the most efficient edit rules is
to conduct experiments that compare edited and
unedited data. Thus, resources put on editing
must be balanced against other needs.
Editing can consume a large fraction of a total survey
budget. To maximize the benefits of editing, results from the
editing phase should be used to improve the survey
processes; for example, to improve the questionnaire,
interviewer training, and other survey processes that may
cause edit failures. There is also a need to improve the
editing process itself by using methods that more effectively
detect potential errors. Macroediting and selective editing
are examples of such methods.
The End
• Questions?

You might also like