You are on page 1of 5

CBS News/New York Times Battleground Tracker

August 20, 2014


Introduction
The Battleground Tracker is a four-wave panel study conducted by YouGov for CBS
News and the New York Times Upshot during the 2014 midterm elections. The
survey provides estimates for voting on every race for U.S. Senate, Governor, and the
House of Representatives. Interviews are conducted with samples of online survey
panelists that have been selected and weighted to be representative of registered
voters. This document describes in detail the statistical methodology used in the
Battleground Tracker panel.
There are three basic steps in data collection:
1. Creation of a frame representing the U.S. population that includes detailed
demographic data (from the U.S. Bureau of the Census) and 2012 voting
behavior (from the 2012 News Election Pool Exit Poll).
2. A target sample of the desired size is selected from the frame using
probability sampling.
3. A matched sample is selected from a pool of panelists with similar
demographics and voting patterns to those in the target sample.
The pool of panelists come from online opt-in panels. Matched sampling is a method
of purposive selection designed to produce a sample representative of the target
population. The general procedures described here have been used at YouGov in the
U.S. since 2006 for the Cooperative Congressional Election Surveys (see
http://projects.iq.harvard.edu/cces/home) with some modifications described
below.
Sample Design

The target population for the CBS News/New York Times Battleground Tracker is
registered voters for 2014 midterm elections in the U.S., excluding D.C. Oversamples
of (a) 66 competitive congressional districts, and (b) 15 smaller states where there
are Senate or Governors races which, if sampled proportionately, would have few
interviews. (The population of Wyoming, for example, is smaller than the average
congressional district.)
The sample is intended to be representative of the target population (all registered
voters in the U.S.). In conventional probability sampling, the sample is usually
drawn with minimal information about respondents prior to selection. Random
sampling ensures that those selected will be representative of the population, in
theory, though in practice, it has been shown that because of the lower response
rates today they can be effectively opt-in as well. Instead, we use purposive
sampling (selecting respondents based upon their characteristics) to obtain a
sample that is constructed to be representative of the population in terms of a
specified set of characteristics.
The first step in the sample selection process is to construct a frame of detailed
individual level records of members of the target population. The frame is based on
official U.S. Census Bureau statistics and past voting patterns as found in the 2012
Presidential Election Exit Poll conducted by the News Election Pool (NEP), a
consortium of news organizations.
The following sources were used to create the frame:
2010 US Census Congressional District Summary File (113
th
Congress) (Technical
Documentation, April 2013) provides a cross-classification of age, gender and
race (non-Hispanic whites, blacks, Hispanics, Asian, and all others) based on the
2010 U.S. Census.
2012 American Community Survey (ACS) Public Use Microsample provides data
on education, citizenship, marital status, employment status, and home
ownership. These variables were imputed onto the 2010 Census data using
Geocorr12 (a geographical database maintained by the Missouri Census Data
Center) to estimate geographic overlap.
Current Population Survey (CPS) November 2012 Registration and Voting
Supplement provides data on voter registration and turnout in the November
2012 U.S. Presidential election. These data were imputed onto the partial frame
created in steps 1 and 2 using multinomial logit models.
NEP Exit Poll was used to impute 2012 candidate choice onto the records for
voters in the previous step based upon demographic and geographic
characteristics common to both sources. Probabilities were estimated using
multinomial logit models.
A post-stratification weight for the frame was computed by raking the imputed data
to match turnout and vote counts at the Congressional District (CD) level as well as
ACS demographics. This preserves the joint distribution of age by gender and race
by education found in the ACS sample. To eliminate the weights, the data were then
resampled with probabilities inversely proportional to the raking weights.
The resulting frame includes the following data on each individual level record:
Congressional district, Census demographics (age, gender, race), ACS demographics
(education, citizenship, marital status, employment status, and home ownership),
CPS voter registration and turnout in the 2012 election, and Presidential and Senate
vote from NEP Exit Poll states. This is the basis for sample selection and weighting,
described next.
Sample Selection and Weighting
The sample was selected from a pool of online panelists to approximately match the
characteristics of those on the frame. The primary source of panelists was the
YouGov America panel, supplemented by other online panels. The YouGov panel is
recruited primarily through internet advertising (Google Adwords, Facebook,
banner ads) and referrals. The other panels use mostly co-registration and
advertising for recruitment. Sources and counts of all panelists are included in the
disposition reports for each wave of the Battleground Tracker.
Sample inclusion probabilities (propensity scores) were estimated using a case-
control logistic regression model (where the frame is the control group and the
sample is the case or treatment group) and divided into propensity score deciles.
Finally, the sample was raked to propensity score deciles, demographics, and 2012
Presidential vote, with the weights trimmed to a maximum value of six within each
of 115 strata (defined by congressional district for oversampled districts or rest of
state for the remainder). A national population weight was obtained by multiplying
the raked weights by the ratio of weighted sample size in each stratum to the
estimated number of registered voters in that stratum.
Interviewing Procedures
The survey was conducted using YouGovs web-based survey system. Third party
panelists were selected based upon criteria determined by YouGov and invited via
an email with a link which took them to the YouGov survey system.
To discourage duplicate responses, visitors to the YouGov survey system are
cookied so that repeat visitors can be detected and discarded from the sample.
(Users can delete cookies from their system, but only 3 percent of YouGov panelists
who have taken a previous survey in the past 90 days have done so.)
In addition, interviews were discarded if one of the following conditions occurred:
Breakoffs
More than 50% of the questions were skipped. (Respondents were allowed
to skip any question except for age and voter registration, though most
respondents answered all or nearly all of the questions.)
Interviews that were completed in less than four minutes.
Respondents who said their birth year was prior to 1915 or after 1996.
Respondents who did not provide an address (so we were unable to
determine their congressional district) and other information required for
estimation and weighting (except for education, which was imputed).
Counts of the number of the number of discarded interviews are provided in the
disposition report for each wave of the Battleground Tracker.
Margins of Error
The margin of error is calculated using model-based standard errors, which
estimate the variability of estimates from repeated application of the same
procedures. Model-based standard errors depend on the assumption that responses
are independent and that the selection mechanism is missing at random. (See R.J.A.
Little and D.B. Rubin, Statistical Analysis with Missing Data, 2
nd
ed., Wiley, 2002.)
This means that the survey responses are conditionally independent of the selection
mechanism conditional upon the matching and weighting variables; that is, given
any specific combination of matching and weighting variables, we assume that
panelists have the same likelihood of voting (or answering other questions in the
survey) as non-panelists with the same characteristics. It does not assume that the
data come from a probability sample with known probabilities of selection.
Standard errors are computed using formula (29) in J.K. Kim and M.K. Riddles,
Some theory for propensity-score-adjustment estimators in Survey Sampling,
Survey Methodology (December 2012), vol 38, pp. 157-165. The estimated
probabilities in equation (24) of Kim and Riddles uses the reciprocal of the final
raking weight and the variance estimates are averaged for each of the vote variables.

House Election Model
Because many of the individual congressional district samples are small,
hierarchical regression models were estimated for turnout and congressional vote
as a function of demographics and past vote, with the coefficients varying by
congressional district. In the hierarchical model, the coefficients were modeled as a
function of individual respondent characteristics (age, race, gender, education, 2012
Presidential vote, and 2014 Senate vote) and district level characteristics (past vote,
incumbency status, and election spending), using a multinomial logit model. The
estimated probabilities were then imputed to all units on the frame. The sample
units were then raked to the marginals on the frame, so that the weighted sample
counts are equal to the average model imputations. This technique uses data from
voters with similar characteristics to determine the final weights.
The main differences between this procedure and the previous CCESwere (a) the
construction of the new Congressional district frame with precise CD-level
demographics and voting behavior, and (b) the use of the hierarchical logistic
regression model to obtain better estimates of voting for House candidates even
when the sample size in a district is small.
Dispositions
Wave 1 Wave 2
Completed interviews
Both wave 1 and wave 2 68974 68974
Wave 1 only 38649 0
Wave 2 only 0 39751
TOTAL (should match toplines and xtabs) 107623 108725
Discarded interviews
Duplicates 7639 6822
Breakoffs 4624 4520
Data cleaning (too fast, too many skips, bad data) 2014 4636
TOTAL 14277 15978
Total interviews started (excludes screenouts for non-registered, etc.) 121900 124703
Screenout, ineligible 39515 15704
Total survey starts 161415 140407
Source of Panelists (excluding breakoffs))
YouGov America 73898
Critical Mix 8725
Toluna 8350
Lightspeed/GMI 7281
MyPoints 5116
Research Now 3810
Usamp 1994

You might also like