You are on page 1of 27

Public Opinion Quarterly, Vol. 74, No. 5, 2010, pp.

907–933

NONRESPONSE ERROR, MEASUREMENT ERROR,


AND MODE OF DATA COLLECTION
TRADEOFFS IN A MULTI-MODE SURVEY OF
SENSITIVE AND NON-SENSITIVE ITEMS

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


JOSEPH W. SAKSHAUG*
TING YAN
ROGER TOURANGEAU

Abstract Although some researchers have suggested that a tradeoff


exists between nonresponse and measurement error, to date, the evidence
for this connection has been relatively sparse. We examine data from an
alumni survey to explore potential links between nonresponse and mea-
surement error. Records data were available for some of the survey items,
allowing us to check the accuracy of the answers. The survey included
relatively sensitive questions about the respondent’s academic perfor-
mance and compared three methods of data collection—computer-assisted
telephone interviewing (CATI), interactive voice response (IVR), and an
Internet survey. We test the hypothesis that the two modes of computerized
self-administration reduce measurement error but increase nonresponse
error, in particular the nonresponse error associated with dropping out
of the survey during the switch from the initial telephone contact to the
IVR or Internet mode. We find evidence for relatively large errors due
to the mode switch; in some cases, these mode switch biases offset
the advantages of self-administration for reducing measurement error.
We find less evidence for a possible second link between nonresponse
and measurement error, based on a relationship between the level of effort
needed to obtain the data and the accuracy of the data that are ultimately
obtained. We also compare nonresponse and measurement errors across
different types of sensitive items; in general, measurement error tended
to be the largest source of error for estimates of socially undesirable

JOSEPH W. SAKSHAUG is a Ph.D. candidate in the Program in Survey Methodology at the Institute for
Social Research at the University of Michigan, Ann Arbor, MI, USA. TING YAN is a Senior Survey
Methodologist with NORC at the University of Chicago, Chicago, IL, USA. ROGER TOURANGEAU is
a Research Professor at the Institute for Social Research atthe University of Michigan, Ann Arbor, MI,
USA, and the Director of the Joint Program in Survey Methodology at the University of Maryland,
College Park, MD, USA. We thank Paul Biemer and three anonymous reviewers for critical comments
and helpful suggestions. *Address correspondence to Joseph W. Sakshaug, Institute for Social
Research, University of Michigan, 426 Thompson Street, Room 4050, Ann Arbor, MI 48104,
USA; e-mail: joesaks@umich.edu.
doi: 10.1093/poq/nfq057
Ó The Author 2011. Published by Oxford University Press on behalf of the American Association for Public Opinion Research.
All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
908 Sakshaug, Yan, and Tourangeau

characteristics; nonresponse error tended to be the largest source of error


for estimates involving socially desirable or neutral characteristics.
Introduction
Several methodological studies have examined both nonresponse and measure-

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


ment errors. These studies fall into two main categories. One attempts to mea-
sure the relative contributions of nonresponse and measurement error to the
overall error; the other attempts to determine whether there is some relationship
or link between the two sources of error.
An example of the first type of study is the article by Schaeffer, Seltzer, and
Klawitter (1991). That article used court records data on respondents and non-
respondents to estimate nonresponse and response errors in a survey of divorced
resident and nonresident parents. Schaeffer et al. (1991) found that for estimates
related to the amounts of child support owed or paid by nonresident fathers, the
measurement errors (due to substantial net overreporting) were about three
times as large as the nonresponse errors. However, the opposite result was
found for estimates related to whether any support was paid; nonresident fathers
who were not interviewed were less likely to pay support than those who were
interviewed, yielding nonresponse errors that were about twice as large as the
measurement errors for this variable.
In a later study, Biemer (2001) used a reinterview survey design and latent
class analysis to compare estimates of nonresponse and measurement error
across two modes of data collection (computer-assisted telephone interviewing
[CATI] and face-to-face interviewing). His findings were also mixed. For some
items, the error due to nonresponse was greater than the error due to inaccurate
reporting (e.g., the percentage reporting ‘‘poor’’ health or the percentage report-
ing they had ever smoked at least 100 cigarettes); for other items, measurement
error dominated nonresponse error (e.g., the percentage reporting ‘‘excellent’’
general health). Other studies have reported similarly mixed results, with the
most consistent finding being that the relative magnitude of the errors depends
on the particular estimates involved (see, for example, Olson 2006;
Tourangeau, Groves, and Redline 2010).
Because the magnitudes of nonresponse and measurement errors are specific
to an item (and to an estimate), their relative magnitudes are also likely to vary
by item. For this reason, it is important to take into account item characteristics
that may affect each source of error and may play a role in the mixed evidence.
For example, items that ask about socially undesirable behaviors (e.g., illicit
drug use) are highly susceptible to measurement errors but not especially sus-
ceptible to nonresponse errors (Tourangeau and Yan 2007). Items that ask about
socially desirable behaviors (e.g., voting, paying child support) are somewhat
less prone to measurement errors than those asking about undesirable behav-
iors, but they may also be more susceptible to nonresponse errors since those
who belong to the desirable category (e.g., the voters) are more likely to become
respondents than those in the undesirable category (e.g., the nonvoters; see
Error Tradeoffs in a Multi-mode Survey 909

Schaeffer, Seltzer, and Klawitter 1991; Tourangeau, Groves, and Redline


2010). These nonresponse-reporting error effects can be particularly strong
when the survey topic described to respondents is related to the undesirable
behavior (e.g., items about voter turnout in a survey described as being about
‘‘Politics, Elections, and Voting’’; Tourangeau, Groves, and Redline 2010). In

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


the extreme case, this type of selectivity may produce a respondent pool that
overrepresents those who belong to the socially desirable category and who are,
therefore, unlikely to misreport about it. The resulting estimates are likely to
exhibit relatively large nonresponse errors but small measurement errors.
By contrast, estimates based on non-sensitive items, such as questions about
age, gender, ethnicity, or poor health status, often can be expected to yield
results similar to those for socially desirable characteristics (large nonresponse
errors but small measurement errors) because these variables are not particu-
larly prone to misreporting, but they are often related to unit nonresponse
(Biemer 2001; Groves and Couper 1998). One goal of this article is to examine
the relative contributions of measurement and nonresponse errors by different
types of survey items, including items asking about socially desirable character-
istics, socially undesirable ones, and neutral items.
The issue of item sensitivity and its relation to nonresponse and measurement
errors is particularly important in the context of mixed-mode studies. This is
because self-administered modes of data collection tend to elicit both greater
reporting accuracy to sensitive items (e.g., Kreuter, Presser, and Tourangeau
2008) and lower response rates; the lower response rates may, in turn, produce
greater nonresponse bias than interviewer-administered modes. In an effort to
minimize the impact of both sources of error, many surveys use interviewers
for contacting and recruiting respondents prior to switching them to a self-
administered interview. This is especially common in interactive voice response
(IVR) surveys (for a review, see Tourangeau, Steiger, and Wilson 2002). It is
also possible to recruit respondents for a Web survey by contacting them by
telephone and then providing a URL for the Web survey, as Kreuter et al.
(2008) did for a portion of their sample. A drawback to the ‘‘recruit-and-switch’’
form of IVR is that a substantial portion of the sample (typically 20 percent or
more; see Tourangeau, Steiger, and Wilson 2002) drop out during the switch
from CATI to IVR—that is, sample members say they will complete the IVR
portion of the interview but hang up during the transfer to the IVR system. An
analogous phenomenon occurs with Web surveys: Some members of the tele-
phone sample agree to complete the survey via the Internet but never bother to
access the questionnaire online (Kreuter, Presser, and Tourangeau 2008; see
also Fricker, Galesic, Tourangeau, and Yan 2005 for similar findings). If, as
Kreuter et al. (2008) argue, IVR and Web data collection reduce measurement
error to sensitive items, the question arises as to whether this reduction in mea-
surement error offsets any increase in nonresponse bias due to the relatively
large number of cases who drop out during the switch from one mode of col-
lection to another. The second goal of this article is to address this question.
910 Sakshaug, Yan, and Tourangeau

The second group of nonresponse/measurement error studies investigates


possible links between the two error sources. Several researchers have
expressed concern that efforts to reduce nonresponse by contacting hard-to-
reach members of the sample or converting initial refusals may increase mea-
surement error. In an early study investigating this possibility, Cannell and

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


Fowler (1963) examined survey reports about hospital stays and found that
respondents who were recruited after extensive follow-up provided less accu-
rate information about their hospital stays than those who required less follow-
up. In a related finding, Bollinger and David (2001) showed that respondents to
the Survey of Income and Program Participation who later dropped out pro-
vided less accurate information in the waves in which they did take part than
respondents who completed every wave of data collection. Both of these studies
are based on the hypothesis that the same people who are reluctant to participate
in a survey will be unwilling to put much effort into answering the questions if
they are induced to take part.
Similar reasoning underlies the hypothesis that reluctant respondents (for
example, those requiring refusal conversion) are especially prone to survey
‘‘satisficing’’ (Krosnick 1991, 1999)—that is, reluctant respondents are more
likely than their more willing counterparts to take various cognitive shortcuts
(such as giving ‘‘don’t know’’ responses) in answering the questions. Studies
exploring this notion have the disadvantage of looking at relatively indirect
indicators of measurement error, such as ‘‘straightlining’’ (giving the same
answer to every question in a battery of related questions) or item nonresponse,
rather than response accuracy in comparison to records data.
At least four studies have examined the notion of a connection between
reluctance to take part in a survey and survey satisficing, finding mixed
evidence of such a link (Fricker 2007; Tourangeau, Groves, Kennedy, and
Yan 2009; Triplett, Blair, Hamilton, and Kang 1996; and Yan, Tourangeau,
and Arens 2004). In the largest of these studies, Fricker (2007) examined four
indicators of response inaccuracy in the Current Population Survey (CPS) and
found a relationship between response propensities (in this case, the estimated
probability that a given household would complete all eight CPS interviews)
and two of the indicators of inaccuracy—item nonresponse and the use of round
values in reporting earnings and hours worked. (In a similar vein, Friedman,
Clusen, and Hartzell 2003 reported that late respondents to a health survey
had higher item nonresponse rates than early respondents.) Fricker (2007) also
found that respondents to the American Time Use Survey (ATUS) who required
refusal conversion reported fewer activities on average than ATUS respondents
who cooperated more readily. Triplett et al. (1996) report a very similar
finding—respondents in their study who required refusal conversion reported
fewer activities in a time-use survey than those who did not require a conversion
attempt. On the other hand, Yan et al. (2004) found no consistent relationship
between various indicators of survey satisficing and the estimated likelihood of
participation, and Tourangeau et al. (2009) found evidence that the relationship
Error Tradeoffs in a Multi-mode Survey 911

between giving inconsistent answers across two waves of a Web survey and the
probability of responding to the second wave was nonmonotonic. Olson (2006)
investigated the relationship between nonresponse and response accuracy and
found no simple relationship between the two (for similar findings, see
Willimack, Schuman, Pennell, and Lepkowski 1995). In summary, there is

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


some evidence that hard-to-persuade or hard-to-contact respondents are more
likely to give satisficing answers if they are persuaded to take part, but this ev-
idence is not entirely consistent across studies or measures of satisficing. Some
additional studies that do not directly look at the issue of satisficing (Friedman
et al. 2003; Olson 2006; and Willimack et al. 1995) also find evidence that the
relationship between nonresponse and measures of data quality is not a straight-
forward one. The third goal of this article is to revisit this issue, examining the
possible link between nonresponse and measurement error by examining how
reporting error varies with the level of effort needed to complete the case.
We reexamine data from a study previously discussed by Kreuter, Presser,
and Tourangeau (2008) and Kreuter, Yan, and Tourangeau (2008). The previ-
ous analyses of these data have focused on measurement error. Here, we focus
on total error and on the relative contributions of nonresponse and measurement
error for the multiple modes of data collection and for both sensitive and non-
sensitive items. The study collected data from a sample of University of Mary-
land alumni and, after a brief telephone screener, used three methods of data
collection—CATI, IVR, and a Web survey. The main questionnaire included
several sensitive items about the respondent’s academic record (such as whether
he or she had withdrawn from a class) that were verified against the respondent’s
official transcript; non-sensitive items (such as age or years since graduation)
were also collected and verified against official records. In summary, we use these
data to address three questions:

 First, what was the relative contribution of nonresponse and measurement


error to the overall error in the survey estimates? Do the relative contri-
butions of each error source change across sensitive and non-sensitive
items?
 Second, does the reduction in measurement error offset any increase in
nonresponse bias due to the relatively large number of cases who drop
out during the switch from an interviewer-administered mode of data col-
lection to a self-administered mode?
 Third, how did the level of effort needed to contact the sample members
and get them to complete the screener relate to the level of accuracy in their
answers?

Methods
The data analyzed here are from a study carried out by the Joint Program in
Survey Methodology (JPSM) at the University of Maryland as part of one
912 Sakshaug, Yan, and Tourangeau

of its graduate classes. In 2005, students in the Practicum class designed a sur-
vey of University of Maryland alumni, with data collection for the main survey
conducted by Schulman, Ronca, and Bucuvalas, Inc. (The students conducted
pretest interviews.) Members of the sample were contacted initially by tele-
phone, asked a brief set of screening questions about their personal and house-

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


hold characteristics (including access to the Internet), and then assigned to one
of three methods of data collection for the main interview: CATI, IVR, and an
Internet survey. (For the sample members assigned to the CATI mode, the main
interview followed the screening questions with no break between the two.) The
introduction described the survey as sponsored by the University of Maryland
and asking about ‘‘your college experience, interest in alumni activities, and
community involvement.’’ The specific wording of the introduction is given
below:

Hello, my name is [INTERVIEWER’S NAME] and I’m calling on behalf


of the University of Maryland. You have been randomly selected to
participate in a survey of Maryland alumni. I’m not calling to ask
you for a donation. I will be asking you about your college experience,
interest in alumni activities, and community involvement. Your parti-
cipation is strictly voluntary and you may skip any question you don’t
want to answer. All of your responses will be kept confidential. The sur-
vey will take about 10 minutes.

The methods used in the study are described in more detail by Kreuter,
Presser, and Tourangeau (2008). Here, we summarize the relevant features
of the sample design, data collection, and questionnaires.
Sampling and data collection: The sample was a random sample (proportion-
ately stratified by graduation year) consisting of 20,000 graduates drawn from
a population of 55,320 alumni who, according to university records, received
undergraduate degrees from the University of Maryland from 1989 to 2002.
After sample cases were matched with Alumni Association records (to obtain
telephone numbers) and various ineligible cases were dropped (e.g., those used
in pretesting and those living abroad), the survey fielded 7,535 telephone
numbers.1 More than a third of these telephone numbers turned out to be invalid
(e.g., the number was disconnected), and the status of about another quarter
could not be determined. A total of 1,501 alumni completed the screener
and were randomly assigned to a mode of data collection. There were 37 cases
who reported they did not have Internet access, and these were randomly
assigned to either CATI or IVR data collection. The response rate (AAPOR

1. This number differs slightly from the corresponding figure in Kreuter, Presser, and Tourangeau
(2008), because we excluded certain cases (sample members found to be deceased) that Kreuter
et al. included.
Error Tradeoffs in a Multi-mode Survey 913

Response Rate 1; AAPOR 2009) for the screener was 31.9 percent. Most of the
nonresponse was due to difficulties in contacting the alumni rather than their
unwillingness to cooperate. The refusal rate was about ten percent of the fielded
phone numbers (excluding ineligibles).

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


Questionnaire: The main interview consisted of 37 questions. Most of these
were included at the behest of the Alumni Association and are not relevant
to our purposes. Our analysis focuses on nine questions for which validation
data were available from university records:
(1) What was your cumulative overall undergraduate grade point average or
GPA at the time you received your undergraduate degree?
(2) Did you ever receive a grade of ‘‘D’’ or ‘‘F’’ for a class?
(3) During the time you were an undergraduate at the University of Mary-
land, did you ever drop a class and receive a grade of ‘‘W’’?
(4) . . .did you graduate cum laude, magna cum laude, or summa cum laude?
(5) Since you graduated, have you ever donated financially to the University
of Maryland?
(6) Did you make a donation to the University of Maryland in calendar year
2004?
(7) Are you a dues-paying member of the University of Maryland Alumni
Association?
(8) In what year were you born?
(9) In what year did you receive your undergraduate degree?

The year-of-birth question was part of the telephone screener. The rest of the
items were included in the main questionnaire. These questions came early in
the questionnaire, but our numbering of the items above does not correspond to
the item numbers in the questionnaire.
We constructed three estimates based on the GPA item: the proportion of cases
with a GPA less than 2.5, the proportion with a GPA higher than 3.5, and the
mean GPA. We thought GPAs lower than 2.5 would be seen as socially undesir-
able (a GPA of 2.0 or less triggers academic warning at the University of Mary-
land) and that GPAs higher than 3.5 would be seen as socially desirable (a GPA
that high or higher in a given term qualifies the student for the Dean’s List). We
calculated proportions based on items 2 through 7 above and the mean years
since birth and since graduation based on the last two items.

Forms of nonresponse and bias estimates: We estimated the effects of several


forms of nonresponse by comparing the records data for various subgroups of
the sample:

(1) The entire sample (n ¼ 7,535)


(2) The subset of the sample that was actually contacted for the screening
interview (n ¼ 3,497)
914 Sakshaug, Yan, and Tourangeau

(3) The cases who completed the screener (n ¼ 1,501)


(4) Those who started the main questionnaire (n ¼ 1,107)
(5) Those who completed the relevant item (with sample sizes that vary from
item to item)

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


Comparisons between members of groups 2 and 1 on the variable of interest
provide an estimate of the bias due to noncontact with sample members. Sim-
ilarly, comparisons between members of groups 3 and 2 provide an estimate of
the bias due to screener refusals. Of particular interest are the comparisons
between members of groups 4 and 3; differences between these two groups
reflect the impact of losses due to the switch in modes of data collection between
the screener and main interview. Comparisons between groups 5 and 4 reflect
the impact of item nonresponse and break-offs partway through the main
questionnaire. Finally, comparisons between groups 5 and 1 provide estimates
of total nonresponse bias. All of these comparisons are possible since they are
based on records data that are available for the entire sample, not just the
respondents.
The estimates of the biases due to measurement error are all based on com-
parisons between the survey responses and records data within group 5. We
examine three groups of estimates: those based on characteristics that are
socially undesirable (such as having received a D or an F in a course), those
based on characteristics that are socially desirable (having graduated with
honors), and those based on neutral characteristics (years since graduation).
Standard errors for bias estimates reported in the tables below were computed
using the random-groups method (see Wolter 2009 for a detailed description of
this technique). More specifically, we randomly subdivided the sample into 30
replicates and computed bias estimates for each replicate. The variances of the
bias estimates were computed by estimating the variability in the bias estimates
across the replicates.

Results
We present the results in three parts. First, we assess the relative contributions of
nonresponse and measurement error to the overall error in the survey estimates
for the sensitive and non-sensitive items. Next, we examine the impact of
different types of nonresponse (noncontact, refusal to complete the screening
interview, and dropout and item nonresponse after the screener was completed)
and their relation to response accuracy. One aim of this part of the analysis is to
determine whether the expected reduction in measurement error due to self-
administration offsets any increase in nonresponse bias due to the cases that
drop out during the switch from one mode of data collection to another.
Our final set of analyses sheds further light on the tradeoffs between
nonresponse and measurement error by examining how the level of effort
Error Tradeoffs in a Multi-mode Survey 915

needed to contact the sample members and get them to complete the screener
relates to the level of accuracy in their answers.
Relative contribution of nonresponse and measurement error: Table 1 shows
the distribution of the true statuses (according to university records) for the full
sample and for the various subgroups of the sample at each stage of the recruit-

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


ment process (getting contacted, completing the screening interview and getting
assigned to a main interview mode, responding to at least one item after switch-
ing modes, and responding to the question of interest). For example, 15.3 per-
cent of the initial sample had a GPA of less than 2.5; the corresponding
estimates were 14.9 percent for the sample cases who were contacted and
13.7 percent for those who completed the screener. In addition, the table shows
the distribution of reported statuses for each survey item (for example, of the
cases who responded to the relevant survey item, 4.2 percent reported a GPA of
less than 2.5). The items are in three groups—those asking about socially
desirable, socially undesirable, and neutral characteristics.
By comparing the means and proportions for the various subgroups of the
sample in the different columns of table 1, we can estimate the effects of the
several forms of nonresponse in this study: noncontact, refusal, dropout after
the mode assignment, and item missing data. The differences between the
estimates in the first and second columns reflect the impact of noncontact;
the differences between those in the second and third columns reflect the im-
pact of screener nonresponse; and so on. We can also assess the impact of
response error by comparing the estimates in the last two columns of table
1, which are based on university records and survey reports, respectively.
Table 2 shows the resulting estimates for individual components that make
up the nonresponse error and also for the measurement error in each statistic.
The table reveals a clear pattern—measurement error contributes more to the
overall error in the estimates for the socially undesirable characteristics than
nonresponse error does, but the opposite is largely true for the estimates re-
garding the socially desirable and neutral characteristics. In several cases, the
difference between the nonresponse and measurement error estimates is sta-
tistically significant.
Table 2 also allows us to examine the effects of nonresponse at each stage of
the recruitment and data collection process. First, we can compare the sizes of
the nonresponse biases incurred during the screening operation, namely, non-
contact and refusal bias. Under ideal circumstances, any biases associated with
the failure to contact sample members (noncontacts) would more or less be
‘‘canceled out’’ by the biases associated with those who were contacted but
refused to participate in the survey (refusals). We did not find any evidence
that these two forms of nonresponse error offset each other. For all 11 of
the estimates, the noncontact and refusal biases move the estimates in the same
direction. In addition, it appears that the biases due to refusal tend to be larger in
absolute magnitude than the biases due to noncontact.
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
916
Table 1. Percentage/Mean in Each Subgroup, According to Frame and Survey Data (standard errors in parentheses)
Before Mode Switch Frame Data After Mode Switch Frame Data Survey Report
Survey Data Item
Sample Contacts Screener Rs Mode Switch Rs Item Responders Responders
(n ¼ 7,535) (n ¼ 3,497) (n ¼ 1,501) (n ¼ 1,107) (n’s vary) (n’s vary)
Undesirable Characteristics
GPA < 2.5 15.3 14.9 13.7 12.2 12.1 4.2
(0.4) (0.6) (0.9) (1.0) (969; 1.1) (969; 0.7)
At least one D/F 62.6 62.4 61.0 61.1 60.7 45.5
(0.6) (0.8) (1.3) (1.5) (1,071; 1.5) (1,071; 1.5)
Dropped a class 70.9 69.7 69.3 68.7 68.1 47.9
(0.5) (0.8) (1.2) (1.4) (1,057; 1.4) (1,057; 1.5)
Desirable Characteristics
GPA > 3.5 18.6 19.5 20.9 21.4 22.0 23.1
(0.5) (0.7) (1.1) (1.2) (969; 1.3) (969; 1.3)

Sakshaug, Yan, and Tourangeau


Honors 9.4 9.9 11.3 11.9 12.3 17.2
(0.3) (0.5) (0.8) (1.0) (1,062; 1.0) (1,062; 1.2)
Ever donated to UMD 25.3 30.9 38.0 40.0 40.6 41.4
(0.5) (0.8) (1.3) (1.5) (1,019; 1.5) (1,019; 1.5)
Donated in last year 8.5 11.6 15.1 15.4 15.7 17.1
(0.3) (0.5) (0.9) (1.1) (1,001; 1.2) (1,001; 1.2)
Alumni member 7.1 9.8 14.5 15.9 15.9 23.2
(0.3) (0.5) (0.9) (1.1) (1,048; 1.1) (1,048; 1.3)

Continued
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
Error Tradeoffs in a Multi-mode Survey
Table 1. Continued
Before Mode Switch Frame Data After Mode Switch Frame Data Survey Report
Survey Data Item
Sample Contacts Screener Rs Mode Switch Rs Item Responders Responders
(n ¼ 7,535) (n ¼ 3,497) (n ¼ 1,501) (n ¼ 1,107) (n’s vary) (n’s vary)
Neutral characteristics
GPA 3.02 3.03 3.06 3.07 3.08 3.18
(0.01) (0.01) (0.01) (0.01) (969; 0.02) (969; 0.02)
Years since birth 33.44 33.98 34.63 34.57 34.52 34.69
(screener item) (0.07) (0.12) (0.19) (0.22) (1,090; 0.22) (1,090; 0.23)
Years since degree 9.27 9.51 9.82 9.86 9.86 9.92
(0.05) (0.07) (0.11) (0.13) (1,076; 0.13) (1,076; 0.15)

NOTE.—Parenthetical entries in the first four columns of figures are standard errors; in the last two columns, the parenthetical entries are sample sizes followed by the
standard errors.

917
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
918
Table 2. Nonresponse and Measurement Error Bias Estimates, by Survey Statistic (standard errors in parentheses)
Nonresponse Bias
Mode Switch Item Measurement
Noncontact Refusal Dropout Nonresponse Total NR Bias
Undesirable Characteristics
GPA < 2.5 0.4 1.2 1.5 0.1 3.2 7.9
(0.4) (0.7) (0.5) (0.4) (1.0) (1.1)
At least one D/F 0.2 1.4 0.1 0.4 1.9 15.2
(0.8) (1.0) (0.9) (0.2) (1.9) (1.2)y
Dropped a class 1.2 0.4 0.6 0.6 2.8 20.2
(0.5) (0.9) (0.6) (0.2) (1.5) (1.3)y
Desirable Characteristics
GPA > 3.5 0.9 1.4 0.5 0.6 3.4 1.1
(0.6) (0.7) (0.7) (0.5) (1.5) (0.9)

Sakshaug, Yan, and Tourangeau


Honors 0.5 1.4 0.6 0.4 2.9 4.9
(0.3) (0.5) (0.5) (0.1) (0.8) (0.6)
Ever donated to UMD 5.6 7.1 2.0 0.6 15.3 0.8
(0.5) (1.3) (0.8) (0.4) (1.6)y (1.7)
Donated in last year 3.1 3.5 0.3 0.3 7.2 1.4
(0.3) (0.7) (0.6) (0.3) (0.8)y (1.3)
Alumni member 2.7 4.7 1.4 0.0 8.8 7.3
(0.3) (0.6) (0.5) (0.2) (1.0) (0.8)

Continued
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
Error Tradeoffs in a Multi-mode Survey
Table 2. Continued
Nonresponse Bias
Mode Switch Item Measurement
Noncontact Refusal Dropout Nonresponse Total NR Bias
Neutral Characteristics
GPA 0.01 0.03 0.01 0.01 0.06 0.10
(0.01) (0.01) (0.01) (0.01) (0.02) (0.01)
Years since birth 0.54 0.65 0.06 0.05 1.08 0.17
(screener item) (0.09) (0.11) (0.10) (0.03) (0.17)y (0.06)
Years since degree 0.24 0.31 0.04 0.00 0.59 0.06
(0.05) (0.07) (0.07) (0.02) (0.10)y (0.07)

NOTE.—Noncontact bias is computed as the difference between the contacted and full sample estimates in table 1; refusal bias is the difference between the screener
respondents and full sample estimates; and so on.
y
indicates that the difference between the nonresponse and measurement error biases is statistically significant, p < 0.05.

919
920 Sakshaug, Yan, and Tourangeau

A relatively unexplored source of nonresponse error in multi-mode surveys is


the bias due to the cases who complete the screening interview but drop out
during or after the switch to the mode of data collection for the main question-
naire. Of the 1,501 respondents who completed the screening interview and
were assigned to a mode of data collection for the main interview, 394 (or

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


26.2 percent) never started the questionnaire. This includes 114 of the 524 cases
assigned to IVR and 271 of the 639 cases assigned to the Web. The dropout bias
estimates in table 2 reflect the dropouts from all three modes, including the nine
CATI cases who completed the screening interview but dropped out prior to
starting the main interview. In table 2, the dropout biases appear to be similar
in size and direction to the prior sources of nonresponse (noncontact and
screener refusal). However, in one case the magnitude of the mode switch bias
exceeds either of the individual components of nonresponse error—the mode
switch bias associated with having a GPA less than 2.5 (1.5 percent) is greater
than either of the screener biases for that item (noncontact: 0.4; refusal: 1.2).
For the majority of items, however, the mode switch bias is slightly larger than
or between the size of the noncontact and refusal biases.
The final source of nonresponse error is item nonresponse. Item nonresponse
occurred if a respondent failed to answer one or more specific items or if he or
she prematurely stopped answering after starting the main questionnaire (break-
offs). The item nonresponse biases generally follow the same direction as the
screener nonresponse biases—they are negative for the socially undesirable
characteristics and positive for the socially desirable ones. For both types of
characteristics, this means that the respondents were more likely than the non-
respondents to be in the socially desirable category (e.g., not to have failed
a class but to have received university honors). For the socially undesirable
characteristics, the absolute magnitudes of the item nonresponse biases gener-
ally fall between those of the biases introduced by noncontact and refusal
screener; item nonresponse generally had less impact on the estimates for
the socially desirable and neutral characteristics.

Relationship between nonresponse and response accuracy: We also examined


the magnitude and relationship between the different sources of nonresponse
and measurement error more fully by standardizing the bias estimates to remove
the effects of differing units of measurement. We computed relative bias esti-
mates for each item:

ðyr  yn Þ
Relbias ¼ ;
yn

in which yr is the estimate based on the relevant group of respondents and yn is
the estimate for the full sample. The results comparing the relative biases from
each error source confirm the main findings already apparent from table 2. For
Error Tradeoffs in a Multi-mode Survey 921

example, measurement error dominates the overall error in the estimates based
on the socially undesirable characteristics, whereas nonresponse introduces
more overall error for the estimates based on the desirable and neutral character-
istics. Similarly, the analysis of the relative biases confirms that the overall non-
response bias appears to be driven mostly by screener nonresponse rather than

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


mode switch dropout; however, the magnitude of mode switch bias equals or
exceeds the individual components of screener nonresponse for some of the
estimates.
The findings in table 2 suggest that there is a risk in switching respondents
from one mode of data collection to another. Switching respondents from an
interviewer-administered mode of data collection to a self-administered mode is
a design decision that is typically made in the hope that respondents will pro-
vide more accurate responses in the self-administered mode to any sensitive
questions. However, it carries the risk that sample members will drop out during
the switch and that any gains in response accuracy may be offset by increases
in nonresponse error. Under these circumstances, it would be better from a to-
tal survey error perspective to forego the mode switch and proceed with an
interviewer-administered interview.
We explored the idea of a tradeoff between measurement error and this form
of nonresponse error by comparing the size of the biases associated with the two
forms of error in each mode condition. The bias estimates for these two sources
of error are shown in table 3.2 The mode switch nonresponse bias estimates for
the CATI cases reflect the CATI cases who were asked to complete the main
interview in the same mode as the screener but who nonetheless dropped out
before completing any of the main questionnaire items (there were nine such
cases). For a few items, the mode switch nonresponse biases are significantly
different from zero. For example, in the IVR mode, the mode switch nonre-
sponse bias for the percentage with a GPA less than 2.5 is 1.8 (p ¼
0.04). In the Web mode, the mode switch nonresponse bias in the percentage
ever donating to the University of Maryland and the percentage who are dues-
paying alumni members is 4.1 (p ¼ 0.02) and 3.0 (p ¼ 0.04), respectively. As
expected, the bias estimates for measurement error for the self-administered
modes are generally lower than those for the CATI cases, although there
are some exceptions. For some of the items, the direction of the measurement
error or nonresponse error for a particular estimate is not the same for all three
modes. For example, for the estimated proportion of GPAs above 3.5, the bias is
negative for IVR but positive for Web administration and CATI. Overall,
though, the self-administered modes appear to be better at eliciting accurate
responses to the sensitive items, particularly those involving undesirable

2. The years-since-birth item was asked in the CATI mode only as part of the screener and is
removed from the remaining analysis of the impact of the switch in modes.
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
922
Table 3. Bias Estimates for Mode Switch Nonresponse, Measurement Error, and Total Bias after Switch by Survey Statistic
and Mode of Data Collection (standard errors in parentheses)
Mode Switch Nonresponse
Bias Measurement Bias Total Bias after Mode Switch
CATI IVR Web CATI IVR Web CATI IVR Web
Undesirable Characteristics
GPA < 2.5 0.2 1.8 1.4 8.5 6.7 8.4 8.7 8.5 9.8
(0.3) (0.8)y (1.3) (1.9)y (1.3)y (1.4)y (1.9) (1.6) (1.3)*
At least one D/F 0.4 0.7 0.2 19.0 15.3 12.0 19.4 14.6 11.8
(0.5) (1.4) (1.7) (2.0)y (2.5)y (1.9)y (2.0) (2.5) (2.1)
Dropped a class 0.2 1.3 0.2 21.1 19.4 20.5 21.3 20.7 20.3
(0.4) (0.8) (1.6) (3.4)y (1.9)y (2.4)y (3.5) (2.0) (2.9)
Desirable Characteristics
GPA > 3.5 0.6 0.1 0.5 1.1 1.3 3.3 1.7 1.2 3.8

Sakshaug, Yan, and Tourangeau


(0.3) (1.1) (1.4) (2.2) (1.4) (1.4)y (2.1) (1.1) (1.9)*
Honors 0.4 0.0 0.9 4.2 4.6 5.7 4.6 4.6 6.6
(0.1)y (0.8) (1.0) (1.1)y (1.0)y (1.0)y (1.1) (1.2) (1.2)*
Ever donated to UMD 0.4 1.6 4.1 1.9 2.0 1.4 2.3 3.6 2.7
(0.3) (1.0) (1.6)y (3.2) (2.7) (3.9) (3.0) (2.3)* (3.1)*
Donated in last year 0.1 0.1 1.3 3.6 0.9 0.0 3.7 0.8 1.3
(0.3) (1.0) (1.2) (2.3) (2.7) (2.7) (2.4) (2.3) (2.3)
Alumni member 0.1 0.8 3.0 8.9 6.7 6.4 9.0 7.5 9.4
(0.3) (1.0) (1.3)y (1.7)y (1.7)y (1.4)y (1.8) (1.4) (2.0)*

Continued
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
Error Tradeoffs in a Multi-mode Survey
Table 3. Continued
Mode Switch Nonresponse
Bias Measurement Bias Total Bias after Mode Switch
CATI IVR Web CATI IVR Web CATI IVR Web
Neutral Characteristics
GPA 0.01 0.01 0.02 0.10 0.08 0.10 0.11 0.09 0.12
(0.04) (0.01) (0.02) (0.03)y (0.02)y (0.01)y (0.03) (0.02) (0.02)*
Years since degree 0.04 0.07 0.12 0.00 0.23 0.08 0.04 0.30 0.04
(0.03) (0.12) (0.10) (0.10) (0.15) (0.11) (0.10) (0.19)* (0.15)

y
indicates that the nonresponse or measurement error bias estimate is significantly different from zero, p < 0.05.
*indicates that the overall error introduced after the switch in data collection mode was greater in Web or IVR than in CATI.

923
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
924
Table 4. Bias Estimates, by Survey Statistic and Level of Effort (standard errors in parentheses)
Noncontact Bias Nonresponse Bias Measurement Bias Total Bias
1–2 3–5 6þ 1–2 3–5 6þ 1–2 3–5 6þ 1–2 3–5 6þ
calls calls calls calls calls calls calls calls calls calls calls calls
Undesirable
Characteristics
GPA < 2.5 3.2 1.6 0.4 0.9 2.1 2.8 7.6 7.2 7.8 11.7 10.9 11.0
(0.05) (0.03) (0.03) (0.07) (0.06) (0.04) (0.06) (0.05) (0.04) (0.03) (0.03) (0.02)
At least one D/F 3.7 2.0 0.2 1.3 0.3 1.7 15.2 15.7 15.2 17.6 17.4 17.1
(0.07) (0.05) (0.04) (0.11) (0.08) (0.07) (0.06) (0.04) (0.04) (0.09) (0.06) (0.06)
Dropped a class 5.3 3.2 1.2 2.4 1.7 1.5 18.5 19.2 20.2 26.2 24.1 22.9
(0.05) (0.04) (0.03) (0.09) (0.07) (0.06) (0.06) (0.05) (0.04) (0.10) (0.08) (0.07)
Desirable
Characteristics

Sakshaug, Yan, and Tourangeau


GPA > 3.5 4.4 2.6 0.9 1.5 1.8 2.4 0.6 0.9 1.1 6.5 5.3 4.4
(0.06) (0.04) (0.03) (0.10) (0.07) (0.06) (0.04) (0.03) (0.03) (0.08) (0.06) (0.05)
Honors 1.8 1.0 0.5 2.9 2.4 2.4 5.1 5.3 4.9 9.8 8.7 7.8
(0.03) (0.03) (0.02) (0.06) (0.04) (0.03) (0.04) (0.03) (0.02) (0.07) (0.05) (0.03)
Ever donated 6.7 6.2 5.6 9.5 10.0 9.7 1.1 0.4 0.8 15.1 16.6 16.1
to UMD (0.05) (0.03) (0.03) (0.11) (0.08) (0.06) (0.10) (0.08) (0.06) (0.08) (0.05) (0.05)
Donated in last year 4.1 3.5 3.0 3.7 3.6 4.1 0.8 2.7 1.4 8.6 9.8 8.5
(0.04) (0.02) (0.02) (0.07) (0.05) (0.03) (0.06) (0.05) (0.04) (0.06) (0.04) (0.03)
Alumni member 4.6 3.9 2.8 6.3 6.3 6.1 7.0 7.0 7.2 17.9 17.2 16.1
(0.03) (0.03) (0.02) (0.07) (0.05) (0.04) (0.04) (0.03) (0.03) (0.08) (0.05) (0.03)

Continued
Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019
Error Tradeoffs in a Multi-mode Survey
Table 4. Continued
Noncontact Bias Nonresponse Bias Measurement Bias Total Bias

1–2 3–5 6þ 1–2 3–5 6þ 1–2 3–5 6þ 1–2 3–5 6þ


calls calls calls calls calls calls calls calls calls calls calls calls
Neutral Characteristics
GPA 0.07 0.04 0.01 0.03 0.03 0.05 0.09 0.10 0.10 0.19 0.17 0.16
(0.0006) (0.0004) (0.0003) (0.001) (0.0007) (0.0007) (0.0006) (0.0003) (0.0003) (0.0008) (0.0006) (0.0005)
Years since birth 0.69 0.66 0.54 1.03 0.70 0.55 0.38 0.23 0.17 2.10 1.59 1.26
(screener item) (0.009) (0.007) (0.005) (0.02) (0.01) (0.007) (0.005) (0.003) (0.002) (0.01) (0.01) (0.007)
Years since degree 0.24 0.19 0.24 0.52 0.33 0.35 0.20 0.09 0.06 0.96 0.61 0.65
(0.006) (0.005) (0.003) (0.01) (0.007) (0.005) (0.006) (0.004) (0.002) (0.01) (0.007) (0.005)

925
926 Sakshaug, Yan, and Tourangeau

characteristics (as Kreuter, Presser, and Tourangeau 2008 also found in looking
at these data).
The final three columns in table 3 show estimates of the overall bias in-
troduced after the assignment to the mode of data collection for the main
questionnaire and reflect the combined effects of mode switch nonresponse

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


and measurement error.3 These estimates allow us to assess the overall im-
pact of the mode switch. For two estimates about socially undesirable char-
acteristics and one estimate about socially desirable characteristics, the
switch to self-administration (whether by IVR or Web data collection) re-
duced the overall error, but this was not always true for the other seven esti-
mates. For two of the seven items, the IVR respondents show greater absolute
error after the switch than the CATI respondents; and for six of the seven
items, the Web respondents show greater overall error than the CATI respond-
ents. The gains from self-administration are generally smaller for the socially
desirable and neutral items than for those involving the socially undesirable
outcomes, and in some cases these gains are outweighed by the losses due to
mode switch nonresponse bias. The estimates in which the overall bias
increased in the groups switched to self-administration are shown with an
asterisk in table 3.
Level of effort and nonresponse and measurement error: Thus, one factor that
sometimes increased nonresponse error but decreased measurement error was
the switch to self-administration. Does the level of effort needed to get the case
also produce a tradeoff between the two forms of error, such that harder-to-
interview cases provide less accurate answers? Table 4 examines this issue,
showing the level of accuracy in the answers by the level of effort needed
to contact sample members and get them to complete the screener. The table
also shows estimates of the noncontact bias and the overall nonresponse bias by
level of effort. The final three columns in table 4 show estimates of the overall
bias by level of effort. (The overall bias estimate represents the difference
between the estimate based on the survey data from the item respondents
and the estimate based on the frame data for the entire sample, that is,
differences between the first and last columns in table 1, broken down
separately for the three level-of-effort groups.)
Several patterns are apparent in table 4. First, the noncontact bias decreases in
absolute magnitude as the level of effort increases, a pattern that is consistent
across all 11 estimates. Further, the noncontact bias estimates are negative for
all of the questions involving socially undesirable characteristics and positive
for all of the questions involving socially desirable characteristics, again sug-
gesting that sample members with socially undesirable characteristics are less
likely to be contacted than sample members with socially desirable

3. For brevity, we omit item nonresponse from our assessment of the overall error after the assign-
ment of mode. Our conclusions remain the same whether or not we include this source of error.
Error Tradeoffs in a Multi-mode Survey 927

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


Figure 1. Estimated noncontact biases, by number of callbacks. All of the
noncontact bias estimates are closer to zero for six or more callbacks. The top
five trend lines plot estimates involving socially desirable characteristics; the
middle three, estimates involving neutral characteristics; and the bottom three,
estimates involving socially undesirable characteristics.

characteristics. (For the estimates involving neutral characteristics, the biases


are near zero at all levels of effort.) Figure 1 plots the trends in the biases by the
number of callbacks for the 11 survey statistics. All of the estimates converge on
zero—no bias—with more callbacks.
Second, the relationship between level of effort and the overall nonresponse
bias in table 4 does not seem to be consistent. For some of the estimates,
additional callbacks led to a reduction in nonresponse bias (e.g., for the
proportion of sample members who belong to the Alumni Association and
the average years since birth), but for others, additional calls were associated
with greater nonresponse bias (e.g., for the proportion with a GPA higher than
3.5 or the proportion who donated to the university in the last year). The nature
of the questions (whether they involve socially desirable or undesirable or
neutral characteristics) predicts only the direction of the nonresponse bias,
not the size of the bias or the nature of the relation between the level of effort
and the level of nonresponse bias. Third, as with the nonresponse bias, the
relationship between the level of effort and the level of measurement bias is
not consistent across the 11 estimates. Still, an overall trend is evident such
that additional level of effort is associated with increased measurement bias;
this positive relationship between level of effort and measurement bias holds
for nine of the 11 estimates. Even when it is apparent, however, this increase in
measurement bias by level of effort is rather small.
928 Sakshaug, Yan, and Tourangeau

Overall, then, table 4 suggests that additional efforts most clearly affect non-
contact bias (which is hardly surprising) but have smaller, less consistent effects
on the overall level of nonresponse and measurement error. Any increase in the
inaccuracy of the survey answers produced by additional recruitment efforts is
small, so that overall there seems to be a net gain from additional callbacks. For

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


seven of the 11 statistics, there is a monotonic (though generally small) decline
in the absolute level of overall bias across the three levels of effort.

Discussion
Our results support five conclusions. First, all of the different forms of non-
response had a consistent relationship to the survey estimates, so that the
effects of one form of nonresponse reinforced rather than canceled the effects
of other forms. Second, breaking nonresponse into its various components
was still useful, since the relative importance of the different components var-
ied from one estimate to the next. Third, as some prior investigations of the
relative contributions of different nonsampling errors have found, measure-
ment error tended to be the largest source of error, but in our study this was
true only for the estimates regarding the prevalence of socially undesirable
characteristics; the estimates involving socially desirable characteristics
tended to be dominated by nonresponse error. Fourth, the results show that
switching respondents to a self-administered mode (like IVR or the Web) can
reduce measurement error but may increase overall error because of dropouts
during or after the mode switch. And finally, additional callbacks appeared to
reduce one form of nonresponse error (the bias due to noncontacts) but had
a less consistent relation to other forms of nonresponse error or to measure-
ment error.

Nonresponse error: Our first conclusion is that the various nonresponse biases
we distinguished in our analysis all tended to push the survey estimates in the
same direction. This is apparent from table 2, where, within any given row,
the bias estimates in the first four columns tend to be all negative or all positive.
The alumni who had greater difficulties during their undergraduate years were
harder to contact, more difficult to screen, more likely to drop out during
the switch to the main interview, and less inclined to answer the questions
in the main questionnaire than those who had more successful undergraduate
careers. Although there are a few reversals of sign in table 2, they tend to be
quite small (see, for example, the row with the estimated biases for the mean
years since birth). Some earlier studies have found offsetting effects of
noncontact and refusal (e.g., Kalsbeek, Yang, and Agans 2002), raising the
possibility that these two forms of nonresponse error might sometimes cancel
each other out, but in our study the various forms of nonresponse almost always
reinforce each other. And, in general, the biasing effects of measurement error
also worsen (rather than offset) the biasing effects of nonresponse.
Error Tradeoffs in a Multi-mode Survey 929

Despite the fact that the different types of nonresponse tended to push the
estimates in the same direction, their relative importance varied from one
estimate to the next. For example, for one of the estimates, dropouts after
the mode switch introduced the largest bias, but for most of the other estimates,
screener refusal seemed to introduce the most nonresponse bias. In general,

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


screener refusal had the largest absolute impact of the four forms of nonresponse
that we examined (this is true for all but two of the 11 estimates in table 2).
Whenever the topic or sponsorship of a survey is a key factor in sample mem-
bers’ decisions about whether to take part, screening refusal may loom large as
a source of nonresponse error; the screening stage is when sample members
are first likely to become aware of the survey topic and sponsor, and attitudes
toward these features of the survey are likely to be related to the survey var-
iables (Groves et al. 2006; Tourangeau et al. 2009). In our study, we suspect
that the survey topic and sponsor had an effect on whether sample members
took part in either the screening interview or the mode switch. The large non-
response biases for the socially desirable characteristics indicate that those
with positive college experiences (e.g., GPA > 3.5) were more cooperative
than those with less desirable characteristics. People with unfavorable college
experiences or those who had an unfavorable opinion of the sponsor were
more likely to refuse.
More generally, to the extent that noncontact reflects sample members’
deliberate attempts to ward off unwelcome intrusions (for example, by screen-
ing their telephone calls), noncontact and refusal are likely to have similar
effects on survey estimates. Increasingly, noncontact may reflect the same gen-
eral reluctance to comply with requests from outsiders that seems to be behind
the rising rates of refusals in surveys (Brick and Williams 2009). Traditionally,
researchers have thought of noncontact and refusal as separate phenomena
(Groves and Couper 1998); over time, however, they may be converging,
reflecting the same underlying processes.

Measurement error: Our third conclusion is that measurement error can pro-
duce very large biases, especially for sensitive questions about socially undesir-
able characteristics, like flunking or withdrawing from a class. The
measurement biases for the estimates about such undesirable characteristics
range from almost eight to more than 20 percentage points (see the top three
rows of the final column of table 2). For the most part, the measurement errors
are smaller for the estimates based on positive characteristics (like having a high
GPA) and smaller still for the estimates regarding neutral characteristics (years
since graduation). Tourangeau, Groves, and Redline (2010) reach similar
conclusions about the importance of measurement bias in their study of reports
about voting; in that study, the measurement biases were about twice as large as
the nonresponse biases. Beginning with Horvitz (1952), methodological
researchers have demonstrated that measurement error can be a large contrib-
utor to the overall error in survey estimates; that may be especially true when the
930 Sakshaug, Yan, and Tourangeau

survey questions ask respondents to make potentially embarrassing admissions


about themselves. An alternative explanation is that the measurement errors we
observed were due to memory decay rather than social desirability effects. Be-
cause the average time since graduation for the respondents was quite long (al-
most 10 years), respondents may simply have forgotten their classes and grades.

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


We examined the measurement error biases by graduation year and indeed
found that reporting accuracy was poorer for longer recall periods. However,
the reporting errors were all in one direction; that is, the social desirability bias
seems to get worse the further back in time one goes. This suggests that the
measurement errors were due mostly to deliberate misreporting rather than clas-
sical (i.e., random) forgetting. It may be easier for respondents to downplay
their academic failings when the memory of them is not so fresh.

Tradeoffs between measurement and nonresponse errors: We examined two


possible tradeoffs between measurement error and nonresponse error. One in-
volved the switch from the initial CATI screener to a self-administered mode of
data collection (IVR or Web) for the main questionnaire. Self-administration
often reduces measurement errors for sensitive questions (Tourangeau and
Yan 2007), but the switch to IVR or the Web can encourage dropouts, and this
form of nonresponse may offset gains in accuracy and increase the overall level
of error in the survey estimates. Dropout in this study was substantial: More than
a quarter of the screener respondents never started the main survey, and the drop-
out rate was much higher for the IVR cases (22 percent) than for those assigned to
CATI (three percent) for the main data collection and much higher still for the
Web cases (42 percent). For some estimates, the biasing effects of this form of
nonresponse more than offset the gains from self-administration (see table 3). So,
switching to a self-administered mode had opposite effects on the two major sour-
ces of nonsampling error examined in our study.
Our final conclusion from the study involved another factor that might have
opposite effects on nonresponse and measurement error. We found that addi-
tional callbacks consistently reduced the bias from noncontact (see figure 1 and
the first three columns of table 4) but had no consistent relation to measurement
error. Regardless of how many contact attempts were needed for a given
respondent, measurement error stayed at about the same level and was affected
most by whether the respondent was in the socially desirable or undesirable
category (Kreuter, Presser, and Tourangeau 2008). Moreover, additional
callbacks did not clearly reduce the overall levels of nonresponse error (see
the middle three columns of table 4), just the noncontact component of it.

Conclusions
The 2005 JPSM Practicum survey affords an unusual opportunity to examine
the effects of nonresponse and measurement on a range of survey estimates,
because high-quality records data are available for both the respondents and
Error Tradeoffs in a Multi-mode Survey 931

nonrespondents and because it is possible to distinguish several forms of non-


response. Despite the study’s unique features—it is a survey of alumni at a sin-
gle university designed by graduate students and faculty at that university and
featuring a mode experiment—the conclusions seem generally consistent with
prior attempts to assess multiple sources of error in the same survey. The non-

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


sampling errors are large, in some cases very large. For example, the survey
underestimates the proportion of alumni with low GPAs by about 11 percent
and the proportion who ever got a D or an F in a class by about 17 percent (see
table 1). With some variables, the major source of nonsampling error is mea-
surement error, but with others it is nonresponse. We suspect that with many
items asking about undesirable characteristics, measurement error may be the
biggest contributor to overall error and that nonresponse error is more likely to
affect items asking about socially desirable characteristics. For the most part,
the different sources of error in the survey estimates do not cancel each other out
but instead reinforce each other.
Finally, a unique and surprising result from our analysis indicated that
switching respondents to a more private mode of data collection had opposite
effects on nonresponse and measurement error. These opposing effects offset
the advantages of self-administration for some of the estimates (see table 3).
Although many researchers have expressed concern that efforts to reduce non-
response may increase measurement error, our results suggest the opposite:
Efforts to reduce measurement error (by switching respondents to a self-
administered mode of data collection) may increase nonresponse error.
A key question is whether the mode switch should be abandoned. We find
mixed support for this notion. On the one hand, we find that implementing the
mode switch can increase nonresponse error and offset the advantages of self-
administration. Furthermore, implementing the mode switch does not necessar-
ily lead to an overall reduction in bias (see table 3). Thus, it may be better from
a total survey error perspective to forego the mode switch and use interviewer
administration for all interviews. On the other hand, the mode switch does not
necessarily increase the bias either (e.g., there are mixed results from the mode
switch). Thus, the mode switch may be appropriate because it is not necessarily
increasing the overall bias and in some cases does reduce measurement error.
Further investigation on implementing the mode switch and assessing its trade-
offs between nonresponse and measurement error for different types of sensi-
tive items, including those that are more sensitive than the ones studied here,
and for different target populations is needed.

References
American Association for Public Opinion Research (AAPOR). 2009. Standard Definitions: Final
Dispositions of Case Codes and Outcomes Rates for Surveys. 6th ed. Lenexa, KS: AAPOR.
Biemer, Paul P. 2001. ‘‘Nonresponse Bias and Measurement Bias in a Comparison of Face-to-face
and Telephone Interviewing.’’ Journal of Official Statistics 17(2):295–320.
932 Sakshaug, Yan, and Tourangeau

Bollinger, Christopher R., and Martin David. 2001. ‘‘Estimation with Response Error and Nonre-
sponse: Food Stamp Participation in the SIPP.’’ Journal of Business and Economic Statistics
19(2):129–42.
Brick, J. Michael, and Douglas Williams. 2009. ‘‘Reasons for Increasing Nonresponse in U.S.
Household Surveys.’’ Paper presented at the Workshop of the Committee on National Statistics,
Washington, DC, December 14.

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


Cannell, Charles F., and Floyd J. Fowler. 1963. ‘‘Comparison of a Self-enumerative Procedure and
a Personal Interview: A Validity Study.’’ Public Opinion Quarterly 27(2):250–64.
Fricker, Scott. 2007. ‘‘The Relationship between Response Propensity and Data Quality in the Cur-
rent Population Survey and the American Time-use Survey.’’ Unpublished doctoral dissertation.
College Park: University of Maryland.
Fricker, Scott, Mirta Galesic, Roger Tourangeau, and Ting Yan 2005. ‘‘An Experimental Compar-
ison of Web and Telephone Surveys.’’ Public Opinion Quarterly 69(3):370–92.
Friedman, Esther M., Nancy A. Clusen, and Michael Hartzell. 2003. ‘‘Better Late? Characteristics of
Late Respondents to a Health Care Survey.’’ Proceedings of the Survey Research Methods Sec-
tion of the American Statistical Association (pp. 992–98). Alexandria, VA: American Statistical
Association.
Groves, Robert M., and Mick P. Couper. 1998. Nonresponse in Household Interview Surveys. New
York: Wiley.
Groves, Robert M., Mick P. Couper, Stanley Presser, Eleanor Singer, Roger Tourangeau, Giorgina
P. Acosta, and Lindsay Nelson. 2006. ‘‘Experiments in Producing Nonresponse Bias.’’ Public
Opinion Quarterly 70(5):720–36.
Horvitz, Daniel G. 1952. ‘‘Sampling and Field Procedures in the Pittsburgh Morbidity Survey.’’
Public Health Reports 67(10):1003–12.
Kalsbeek, William D., Juan Yang, and Robert P. Agans. 2002. ‘‘Predictors of Nonresponse in
a Longitudinal Survey of Adolescents.’’ Proceedings of the Survey Research Methods Section
of the American Statistical Association (pp. 1740–45). Alexandria, VA: American Statistical
Association.
Kreuter, Frauke, Stanley Presser, and Roger Tourangeau 2008. ‘‘Social Desirability Bias in CATI,
IVR, and Web Surveys: The Effects of Mode and Question Sensitivity.’’ Public Opinion Quar-
terly 72(5):847–65.
Kreuter, Frauke, Ting Yan, and Roger Tourangeau 2008. ‘‘Good Item or Bad—Can Latent Class
Analysis Tell? The Utility of Latent Class Analysis for the Evaluation of Survey Questions.’’
Journal of the Royal Statistical Society, Series A (Statistics in Society) 171(3):723–38.
Krosnick, Jon A. 1991. ‘‘Response Strategies for Coping with the Cognitive Demands of Attitude
Measures in Surveys.’’ Applied Cognitive Psychology 5(3):213–36.
———. 1999. ‘‘Survey Research.’’ Annual Review of Psychology 50(3):537–67.
Olson, Kristen M. 2006. ‘‘Survey Participation, Nonresponse Bias, Measurement Error Bias, and
Total Bias.’’ Public Opinion Quarterly 70(5):737–58.
Schaeffer, Nora C., Judith A. Seltzer, and Marieka Klawitter. 1991. ‘‘Estimating Nonresponse and
Response Bias: Resident and Nonresident Parents’ Reports about Child Support.’’ Sociological
Methods and Research 20(1):30–59.
Tourangeau, Roger, Robert M Groves, Courtney Kennedy, and Ting Yan. 2009. ‘‘The Presentation
of a Web Survey, Nonresponse, and Measurement Error among Members of a Web Panel.’’ Jour-
nal of Official Statistics 25(3):299–321.
Tourangeau, Roger, Robert M Groves, and Cleo D. Redline. 2010. ‘‘Sensitive Topics and Reluctant
Respondents: Demonstrating a Link between Nonresponse Bias and Measurement Error.’’ Public
Opinion Quarterly 74(3):413–32.
Tourangeau, Roger, Darby M Steiger, and David Wilson. 2002. ‘‘Self-administered Questions by
Telephone: Evaluating Interactive Voice Response.’’ Public Opinion Quarterly 66(2):265–78.
Tourangeau, Roger, and Ting Yan. 2007. ‘‘Sensitive Questions in Surveys.’’ Psychological Bulletin
133(5):859–83.
Error Tradeoffs in a Multi-mode Survey 933

Triplett, Timothy, Johnny Blair, Teresa Hamilton, and Yun Chiao Kang 1996. ‘‘Initial Cooperators
vs. Converted Refusers: Are There Response Behavior Differences?’’ Proceedings of the Survey
Research Methods Section of the American Statistical Association (pp. 1038–41). Alexandria,
VA: American Statistical Association.
Willimack, Diane K., Howard Schuman, Beth-Ellen Pennell, and James M. Lepkowski. 1995.
‘‘Effects of a Prepaid Nonmonetary Incentive on Response Rates and Response Quality in

Downloaded from https://academic.oup.com/poq/article-abstract/74/5/907/1815368 by guest on 20 August 2019


a Face-to-face Survey.’’ Public Opinion Quarterly 59(1):78–92.
Wolter, Kirk M. 2009. Introduction to Variance Estimation. New York: Springer.
Yan, Ting, Roger Tourangeau, and Zac Arens 2004. ‘‘When Less Is More: Are Reluctant Respond-
ents Poor Reporters?’’ Proceedings of the Survey Research Methods Section of the American
Statistical Association (pp. 4633–51). Alexandria, VA: American Statistical Association.

You might also like