You are on page 1of 2

Benchmark Model Specification Sheet

Data Used
The Benchmark model uses a variety of open source data and some closed, proprietary
data to make its predictions. Among the open source data used are:

- Census data, obtained from IPUMS:
o How we use it: To obtain demographic variables, such as race, income,
population, age distribution, and economic data on industries.

- National Generic Ballot, obtained from Fivethirtyeight:
o How we use it: To adjust the national environment for all of the races in the

- Historical election data for districts, obtained from Dailykos:
o How we use it: To determine the presidential voting history of a district for
the model, and establish how it differs from the congressional voting history.

- Incumbency and congressional election data, obtained from Ballotpedia:
o How we use it: To handicap incumbents properly and to see if they tend to
over-perform or under-perform the presidential margins in their district.

- Cook PVI values, obtained from The Cook Report:
o How we use it: To give a standardized lean to each district in the model.

- Party Affiliation Changes, obtained from proprietary data
o How we use it: To see how a district is moving in one direction or another in
their partisan lean.

- Polling and Trump Approval, obtained from pollster’s websites.
o How we use it: Sometimes, local polling exists for the races. The model can
take these polls into account. Trump approval is also useful at the state level.

All data is collected and arranged in an HLM format – districts nested in states. This
adequately takes into account the differing political landscape that is present in each state
according to statewide variables. Each district also has its own set of variables that are
controlled for when making a prediction.

Ultimately, running the model in STATA will yield a predicted margin for the Republican
candidate. If that margin is negative, then the Democrat is favored to win by that amount. If the
margin is positive, the Republican is favored to win. We classify our results into general
categories, such as Tossup (0-3%), Lean (3-6%), Likely (6-12%).

A special word on local polling: Sometimes, local polling is done on house races. These
need to be incorporated into the model. Having a dummy variable inserted into the model for
polling doesn’t work well for races where there is no polling at all.

The best way to do this, I have found, is to treat polling as a separate event and average
the average polling results with our modeling. Generally, they are not terribly far apart.
However, in cases where they are far apart, we feel like this allows us to project if the polling is
overestimating a candidate or not. We weight this average based on the quality of the polling.

It is important to note that there are some things that the model does not take into
account. Most notably, candidate quality is not something we are modeling.

Local polling will mostly capture candidate quality, so in cases where local polling exists,
this is not so much a problem. In races with no local polling and little national coverage, we do
not feel it is prudent to manually decide for ourselves what is a quality candidate and what is
not, so we report the model numbers as is, regardless of the quality of the candidate.