You are on page 1of 37

The Building Blocks

of Social Scientific
Research:
Measurement

CHAPTER OBJECTIVES
5.1 Discuss the importance of operationalization 5.3 Summarize the ways in which accurate
in hypothesis measurement. measurements must be reliable and valid.

5.2 Explain why measurements of political 5.4 Describe different levels of measurement and
phenomena must correspond closely to the their importance of measurement precision.
original meaning of a researcher's concepts.
5.5 Identify different types of multi-item
measures.

IN THE PREVIOUS CHAPTERS, WE DISCUSSED the beginning stages of


political science research projects: the choice of research topics, the formula-
tion of scientific explanations, the development of testable hypotheses, and
the definition of concepts. In this chapter, we take the next step toward test-
ing hypotheses empirically. Before testing hypotheses, we must understand
some issues involving the measurement of the concepts we have decided
to investigate and how we record systematic observations using numerals or
scores to create variables that represent the concepts for analysis.

n chapter 2, we said that scientific knowledge is based on empirical research.


order to test empirically the accuracy and utility of a scientific explanation
r a political phenomenon, we will have to observe and measure the presence
the concepts we are using to understand that phenomenon. Furthermore,
•t~is.test is to be adeq:1aJe, our measun;ments of the political phenomenon
.. . . .. . , Ust be,as accur~te and. precis(as. ppssible. The pr9cess of measur:ement is .•

... .'.. portarl.t 9ecause it provfde;; the bridgt oetween our prclpos~d explanations·..•
. . . 128 . ...... .. ..
• ,. • " .. .. " ,;, " .. " • " ~ * "' "' j, .. .. " .. .. ... ,. .. • ~ " "' ... "' • ii

...
1t • • • ,. .. • " .. " • " .. ., " ... (II • .. • " ,. " : "'. : " ,. ..

.. . . .. .... . . . .. .. . .. . . . .. . .. .. . ..
!!- • "" .. .. .. • '* "' .. "' • ,r,

* • • # ii * • . ... ... . .. . .. . .. .. '


The Building Blocks of Social Scientific Research: Measurement 129

and the empirical world they are supposed to explain. How researchers measure
their concepts can have a significant impact on their findings; differences in mea-
surement can lead to totally different conclusions.

lane Kenworthy and Jonas Pontusson'.s investigation qf income inequality in afflu-


ent countries illustrates well the impact on research findings of how a concept is
measured. 1 One way to measure income distribution is to look at the earnings of
full-time-employed individuals and to compare the incomes of those at the top
and the bottom of the earnings distribution. Kenworthy and Pontusson argued
that it is more appropriate to compare the incomes of households than incomes of
individuals. The unemployed are excluded from the calculations of individual earn-
ings inequality, but households include the unemployed. Also, low-income work-.
ers disproportionately drop out of the employed labor force. Using working-age
household income reflects changes in employment among household members.
Kenworthy and Pontusson found that when individual income was used as a basis
for measuring inequality, inequality had increased the most in the United States,
New Zealand, and the United Kingdom, all liberal market economies. They,further
found that income inequality had increased significantly more in these countries
than in Europe'.ssocial market economies and Japan. When household income was
used, the data indicated that inequality had increased in all countries with the
exception of the Netherlands.

Another example involves the measurement of turnout rates (discussed in


chapter 1). Political scientists have investigated whether turnout rates in the United
States have declined in recent decades. 2 The answer may depend on how the num-
ber of eligible voters is measured. Should it be the number of all citizens of voting
age, or should this number be adjusted to take into account those who are not
eligible to vote, or should the turnout rate be calculated using just the number of
registered yoters as the potential voting population?

The researchers discussed in chapter 1 measured a variety of political phenomena,


some of which posed greater challenges than oth- ··
ers. Milner, Poe, and Leblang wanted to measure

Lane Kenworthyand Jonas Pontusson,"Rising Inequality


and the Politics of Redistribution in Affluent Countries,"
Perspectiveson Politics 3, no. 3 (2005): 449-71. Available
at http://www.u.arizona.edu/-lkenwor/pop2005. pdf
2 SeeWalterDeanBurnham,"The TurnoutProblem,"in Elections
AmericanStyle,ed. A. JamesReichley(Washington,D.C.:
BrookingsInstitution, 198-7),97-133; Michael P. McDonald
and Samuel L. Popkin, "The Myth of the VanishingVoter,"
American Political ScienceReview95, no. 4 (2001): 963-74.
Available at http://elections.grnu.edu1APSR%20McDonald%
20and _Popkin_2001.pdf
130 CHAPTER5

three different types of human rights: personal integrity or security rights, sub-
sistence rights, and civil and political rights. Each of these types of rights has
multiple dimensions. For example, civil and political rights consist of both civil
liberties, such as freedom of speech, as well as economic liberties, including pri-
vate property rights. Jeffrey A. Segal and Albert D. Cover measured both the polit-
ical ideologies and the written opinions of US Supreme Court justices in cases
involving civil rights and liberties. 3 Valerie]. Hoekstra measured people's opinions
about issues connected to Supreme Court cases and their opinions about the
Court. 4 Richard L. Hall and Kristina Miler wanted to measure oversight activity
by members of Congress, the number of times they were contacted by lobbyists,
and whether members of Congress and lobbyists were pro-regulation or antireg-
ulation. 5 And Stephen Ansolabehere, Shanta Iyengar, Adam Simon, and Nicholas
Valentino measured the intention to vote reported by study participants to see
if it was affected by exposure to negative campaign advertising. 6 In each case,
some ·political behavior or attribute was measured so that a scientific explanation
could be tested. All of these researchers made important choices regarding their
measurements.

Devising Measurement Strategies


......•••..........••.......••.......•.••••.................•.....•••••............
As we pointed out in chapter 4, researchers must define the concepts they use
in their hypotheses through conceptualization. They also must decide how to
measure the presence, absence, or amount of these concepts in the real world.
Political scientists refer to this process as operationalization, -or ptoviding an
operational definition of their concepts. Operationalization is deciding how to
record empirical observations of the occurrence of an attribute or a behavior using
numerals or scores.

Let us consider, for example, a researcher trying to explain the existence of democ-
racy in different nations. If the researcher were to hypothesize that higher rates
of literacy make democracy more "likely,then a definition of two concepts-lit-
eracy and democracy-would be necessary. The researcher could then develop a

3 Jeffrey A. Segal and Albert D. Cover, "Ideological Values and the Votes of U.S. Supreme Court
Justices," American Political ScienceReview83, no. 2 (1989): 557-65. Available at http://www.uic
.edu/classes/pols/pols200mm/Segal89.pdf
4 Valerie J. Hoekstra, Public Reactionto SupremeCourt Decisions{New York: Cambridge University
Press, 2003).
5 Richard C. Hall and Kristina Miler, "What Happens after the Alarm? Interest Group Subsidies to
Legislative Overseers," Journalof Politics 70, no. 4 (2008): 990-1005. ·
6 Stephen Ansolabehere, Shanto Iyengar, Adam Simon, and Nicholas Valentino, "Does Attack
Advertising Demobilize the Electorate?" American Political ScienceReview88, no. 4 (1994):
829-38. Available at http://weber.ucsd.edu/-tkousser/Ansolabehere,pdf
The Building Blocks of Social Scientific Research: Measurement 131

strategy, based on the definitions of the two concepts, for measuring the existence
and amount of both -attributes in nations.

Suppose literacywas defined as "the completion of six years of formal education"


and democracywas defined as "a system of government in which public officials
are selected in competitive elections." These definitions would then be used to
develop operational definitions of the two concepts. These operational definitions
would indicate what should be observed empirically to measure both literacy and
democracy, and they would indicate specifically what data should be collected to
test the researchers hypothesis. In this example, the operational definition of liter-
acy might be "those nations in which at least 50 percent of the populatio~ has had
six years of formal education, as indicated in a publication of the United Nations,"
and the operational definition of democracymight be "those countries in which the
second-place finisher in elections for the chief executive office has received at least
25 percent of the vote at least once in the past eight years."

When a researcher specifies a concept's operational definition, the concepts pre-


cise meaning in a particular research study becomes clear. In the preceding exam-
ple, we now know exactly what the researcher means by literacyand democracy.
Since different people often mean different things by the same concept, operational
definitions are especially important. Someone might argue that defining literacy
in terms of formal education ignores the possibili.ty that people who complete six
years of formal education might still be unable to read or write well. Similarly, it
might be argued that defining democracy in terms of competitive elections ignores
other important features of democracy, such as freedom of expression and citizen
involvement in government activity. In addition, the operational definition of com-
petitiveelectionsis clearly debatable. Is the "competitiveness" of elections based on
the number of competing candidates, the size of the margin of victory, or the num-
ber of con~ecutive victories by a single party in a series of elections? Unfortunately,
operational definitions are seldom absolutely correct or absolutely incorrect; rather,
they are evaluated according to how well they corresP,ond to the concepts they are
meant to measure.

It is useful to think of arriving at the operational definition as being the last stage in
the process of defining a concept precisely. We often begin with an abstract concept
(such as democracy), then attempt to define it in a meaningful way, and finally
decide in specific terms how we are going to measure it. At the end of this process,
we hope to attain a definition that is sensible, close to our meaning of the concept,
and exact in what it tells us about how to go,about measuring the concept.

Let us consider another example: imagine that a researcher is interested in why


some individuals are more liberal than others. The concept of liberalismmight be
defined as "believing that government ought to pursue policies that provide ben-
efits for the less well-off." The task, then, is to develop an operational definition
132 CHAPTER 5

that can be used to measure whether particular individuals are liberal or not. The
following question from the Generiil Social Survey might be used to operationalize
the concept:

73A. Some people think that the government in Washington ought to


reduce the income differences between the rich and the poor, perhaps by
raising the taxes of wealthy families or by giving income assistance to the
poor. Others think that the government should not concern itself with
reducing this income difference between the rich and the poor.

Here is a card with a scale from 1 to 7. Think of a score of 1 as meaning


that the government ought to reduce the income differences between
rich and poor, and a score of 7 as meaning that the government should
not concern itself with reducing income differences. What score between
1 and 7 comes closest to the way you feel? (CIRCLEONE)7

An abstract concept, liberalism has now been given an operational definition that
can be used to measure the concept for individu;:1ls.This definition is also related
to the original definition of the concept, and it indicates precisely what observa-
tions need to be made. It is not, however, the only operational definition possible.
Others might suggest that questions regarding affirmative action, same-sex mar-
riage, school vouchers, the death penalty, welfare benefits, and pornography could
be used to measure liberalism.

The important thing is to think carefully about the operational definition you
choose and to try to ensure that the definition coincides closely with the meaning
of the original concept. How a concept is operationalized. affects how generaliza-
tions are made and interpreted. For example, general statements about liberals or
conservatives apply to liberals or conservatives only as they have been operation-
ally defined, in this case by this one question regarding government involvement
in reducing income differences. As a consumer of research, you should familiarize
yourself with the operational definitions used by researchers so that you are better
able to interpret and generalize research results.

Examples of Political Measurements:


Getting to Operationalizc;ltion
......•...... .........•.................................•.........................
~

Let us take a closer look at some operational definitions used by the political
science researchers referred to in chapter 1, as well as some others. To measure the
strength of a legislator's intervention in air pollution regulations proposed by the

7 Question 'wording for the variable EQWLTH from GSS 1998 Codebook.Available at http://www.thearda
.com/Archive/Files/Codebooks/GSS l 998_CB.asp
The Building Blocks of Social Scientific Research:Measurement 133

Environmental Protection Agency, Hall and Miler coded and counted the number
of substantive comments made by legislators challenging or defending the agency's
proposed air quality regulations during five oversight hearings held in Congress
and during the public comment period. 8 Agencies are required to maintain a
public docket that contains all the comments received during the comment period.
Transcripts were available for each of the hearings. The researchers ended up
with two variables: one was the number of supporting comments; the other was
the number of comments in opposition to the proposed regulation. To measure
constituency interests in each of the members' districts, they measured the number
of manufacturingjobs in each district and created an index of air pollution based on
district levels of PM 10 particulate matter and ground-level ozone (the pollutants
addressed by the proposed regulations). Because Hall and Miler were interested in
investigating whether lobbyists targeted their efforts toward members of Congress
friendly toward the lobbyists' positions, they needed to measure the pro- or
antienvironmental policy positions for each member of Congress, and this variable
had to measure position beforethe oversight hearings and regulatory comment
period. Fortunately for the researchers, the leaders of the health and environmental
coalition had classified members in terms of their likely support for the rule prior
to the lobbying period and were willing to share their ratings. These measures were
based on legislators' previous voting record on health and environmental issues.

The research conducted by Segal and Cover on the behavior of US Supreme Court
justices is a good example of an attempt to overcome a serious measurement prob-
lem to test a scientific hypothesis. 9 Recall that Segal and Cover were interested,
as many others have been before them, in the extent to which the votes cast by
Supreme Court justices were dependent on the justices' personal political attitudes.
Measuring the justices' votes on the cases decided by the Supreme Court is no
problem; the votes are public information. But measuring the personal political atti-
tudes of judges, independentof theirvotes,is a problem (remember the discussion in
chapter 4 on avoiding tautologies, or statements that link two concepts that mean
essentially the same thing). Many of the judges whose behavior is of interest have
died, and it is difficult to get living Supreme Court justices to reveal their political
attitudes through personal interviews or questionnaires. Furthermore, one ideally
would like a measure of attitudes that is comparable across many judges and that
measures attitudes related to the cases decided by the Court.

Segal and Cover limited their inquiry to votes on civil liberties cases between 1953
and 1987, so they needed a measure of related political attitudes for the judges
serving on the Supreme Court over that same period. They decided to infer the
judges' attitudes from the newspaper editorials written about them in four major

8 Hall and Miler, "What Happens after the Alarm?"


9 Segal and Cover, "Ideological Values and the Votesof U.S. Supreme Court Justices."
134 CHAPTER5

daily newspapers from the time each justice was appointed by the president until
the justices confirmation vote by the Senate. They selected the editorials appearing
in two liberal papers and in two conservative papers. Trained analysts read the edi-
torials and coded each paragraph for whether it asserted that a justice designate was
liberal, moderate, or conservative (or if the paragraph was inapplicable) regarding
"support for the rights of defendants in criminal cases, women and racial minorities
in equality cases, and the individual against the government in privacy and First
Amendment cases."10

Because of practical barriers to ideal measurement, then, Segal and Cover had to
rely on an indirect measure of judicial attitudes as perceivedby Jour newspapers
rather than on a measure of the attitudes themselves. Although this approach may
have resulted in flawed measures, it also permitted the test of an interesting and
important hypothesis about the behavior of Supreme Court justices that had not
been tested previously. Without such measurements, the hypothesis could not have
been tested.

Next, let us consider research conducted by Bradley and his colleagues on the rela-
tionship between party control of government and the distribution and redistri-
bution of wealth. 11 The researchers relied on the Luxembourg Income Study (LIS)
database, which provides cross-national income data over time in OECD (Organ-
isation for Economic Co-operation and Development) countries. 12 They decided,
however, to make adjustments to published LIS data on income inequality. That
data included_pensioners. Because some countries make comprehensive provisions
for retirees, retirees in these countries make little provision on their own for retire-
ment. Thus, many of these people would be counted as "poor" before any govern-
ment transfers. lnclµding pensioners would inflate the pretransfer poverty level as
well as the extent of income transfer for these countries. Therefore, Bradley and
his colleagues limited their analysis to households' with a head aged twenty-five to
fifty-nine (thus excluding the student-age population as well) and calculated their
own measures of income inequality from the LIS data. They argued that their data
would measure redistribution across income gro:ups, not life-cycle redistributions
of income, such as transfers to students and retireµ persons. Incomewas defined as
income from wages and salaries, self-employment income, property income, and
private pension income. The researchers also made adjustments for household size
using an equivalence scale, which adjusts the number of persons in a household
to an equivalent number _ofadults. The equivalence scale takes into account the
economies of scale resulting from sharing household expenses.

10 Ibid., 559.
11 David Bradley,EvelyneHuber,Stephanie Moller, FrancoiseNielsen, and John D. Stephens,
"Distribution and Redistribution in Postindustrial Democracies,"WorldPolitics 55, no. 2 (2003):
193-228.
12 For information on the LIS database,see http://www.lisdatacenter.org/our-data/lis-database/
The Building Blocks of Social Scientific Research: Measurement 135

Martin P Wattenberg and Craig Leonard Brians measured exposure by responses


to a survey question that• asked respondents if they recalled a campaign ad' and
whether or not it was negative or positive in tone. 13 Finally, Ansolabehere and his
colleagues measured exposure to negative campaign ads in the 1990 Senate elec-
tions by accessing newspaper and magazine articles about the campaigns and deter-
mining how the tone of the campaigns was described in these articles.14

The cases discussed here are good examples of researchers' attempts to mea-
sure important political phenomena (behaviors or attributes) in the real world.
Whether the phenomenon in question was judges' political attitudes, income
inequality, the tone of campaign advertising, or the attitudes and behavior of
legislators, the researchers devised measurement strategies that could detect and
measure the presence and amount of the concept in question. These observations
were then generally used as the basis for an empirical test of the researchers'
hypotheses.

To be useful in providing scientific explanations for political behavior, measure-


ments of political phenomena must correspond closely to the original meaning of a
researcher's concepts. They must also provide the researcher with enough informa-
tion to ma~ valuable comparisons and contrasts. Hence, the quality of measure-
ments is judged in regard to both their accuracyand their precision.

The Accuracy of Measurements


Because we are going to use· our measurements to test whether or not our
explanations for political phenomena are valid, those measurements must be as
accurate as possible. Inaccurate measurements may lead to erroneous conclusions,
since they will interfere with our ability to observe the actual relationship between
two or more variables.

There are two major threats to the accuracy of measurements. Measures may be
inaccurate because they are unreliableand/or because they are invalid.

Reliability
Reliability describes the consistency of results from a procedure or measure in
repeated tests or trials. In the context of measurement, a reliable measure is one

13 Martin P. Wattenberg and Craig Leonard Brians, "Negative Campaign Advertising: Demobilizer or
Mobilizer?" American Political Science Review93, no. 4 (1999): 891-99. Available at http://weber
.ucsd.edu/-tkousser/Wattenberg.pdf
14 Ansolabehere, Iyengar, Simon, and Valentino, "Does Attack Advertising Demobilize the Electorate?"
136 CHAPTER5

that produces the same result each time the measure is used. An unreliable measure
is one that produces inconsistent results-sometimes higher, sometimes lower.15

Suppose, for example, you want to measure support for the president among col-
lege students. You select two similar survey questions (Ql and Q2) and ask the
participants in a random sample of students to answer each question. The results
from this sample were 50 percent support for the president using Ql and 50 per-
cent support for the president using Q2. But what might you find if you ask the
same questions of multiple random samples of students? Will the results from each
question remain consistent, assuming that the samples are identical? If a second
sample of students is polled, you may find the same result, 50 percent, for Ql but
60 percent for Q2. If you were to ask Ql of multiple random samples of students
and the result was consistently 50 percent, you could assert that your measure, Ql,
is reliable. If Q2 were asked to multiple random samples of students and each sam-
ple of students returned different answers ranging somewhere between 40 percent
and 60 percent, you could conclude that Q2 is less reliable than Ql because Q2
generates inconsistent results each time it is used.

Likewise, you can assess the reliability of procedures as well. Suppose you are given
the responsibility of counting a stack of one thousand paper ballots for some public
office. The first time you count them, you obtain a particular result. But as you were
counting the ballots, you might have been interrupted, two or more ballots might
have stuck together, some might have been blown onto the floor, or you might have
written down the totals incorrectly:As a precaution, then, you count them five more
times and get four other people to count them once each as well. The similarity of
the,results of all ten counts would be an indication of the reliability or-the counting
process.

Similarly, suppose you wanted to test the hypothesis that the New York Times is
more critical of the federal government than is the Wall StreetJournal. This would
require you to measure the level of criticism found in articles in the two papers.
You would need to develop criteria or instructions for identifying or measuring
criticism. The reliability of your measuring scheme could be assessed by having
two people read all the articles, independently rate the level of criticism in them
according to your instructions, and then compare their results. Reliability would be
demonstrated if both people reached similar conclusions regarding the content of
the articles in question.

The reliability of political science measures can be calculated in many different


ways. We describe three methods here that are often associated with written test
items or survey questions, but the ideas may be applied in other research contexts.
\

15 Edward G. Carmines and Richard A. Zeller, Reliability and Validity Assessment, A Sage University
Paper: Quantitative Applications in the Social Sciences no. 07---017 (Beverly Hills, Calif.: Sage, 1979).
The Building Blocks of Social Scientific Research: Measurement 137

The test-retest method involves applying the same "test" to the same observations
after a period of time and then comparing the results of the different measure-
ments. For example, if a series of questions measuring liberalism is asked of a group
of respondents on two different days, a comparison of their scores at both times
could be used as an indication of the reliability of the measure of liberalism. We
frequently engage in test-retest behavior in our everyday lives. How often have you
stepped on the bathroom scale twice in a matter of seconds?

The test-retest method of measuring reliability may be both difficult and problem-
atic, since one must measure the phenomenon at two different points. It is possible
that two different results may be obtained because what is being measured has
changed, not because the measure is unreliable. For example, if your bathroom
scale gives you two different weights within a few seconds, the scale is unreliable,
as your weight cannot have changed. However, if you weigh yourself once a week
for a month and find that you get different results each time, is the scale unreliable,
or has your weight changed between measurements? A further problem with the
test-retest check for reliability is that the administration of the first measure may
affect the second measure's results. For instance, the difference between SATRea-
soning Test scores the first and second times that individuals take the test may not
be assumed to-be a measure of the reliability of the test, since test takers might alter
their behavior the second time as a result of taking the test the first time (e.g., they
might learn from their first experience with the test).

The alternative-form method of measuring reliability also involves measuring the


same attribute more than once, but it uses two different measures of the same
concept rather than the same measure. For example, a researcher could devise
two different sets of questions to measure the concept of lib~ralism, ask the same
respondents questions at two different times using one set of questions the first
time and the other set of questions the second time, and compare the respondents'
scores. Using two different forms of the measure reduces the chance that the second
scores are influenced by the first measure, but it still requires the phenomenon to
0

be measured twice. Depending on tl:e length of time between the two measure-
ments, what is being measured may change.

The split-halves method of measuring reliability involves applying two measures


of the same concept at the same time. The results of the two measures are then
compared. This method avoids the problem that the concept being measured may
change between measures. The split-halves method is often used when a multi-
item measure can be split into two equivalent halves. For example, a researcher
may devise a measure of liberalism consisting of the responses to ten questions on
a public opinion survey. Half of these questions could be selected to represent one
measure of liberalism, and the other half selected to represent a second measure of
liberalism. If individual scores on the two measures of liberalism are similar, then
the ten-item measure may be said to be reliable by the split-halves approach.
138 CHAPTER5

The test-retest, alternative-form, and split-halves methods provide a basis for calcu-
lating the similarity of results of two or more applications of the same or equivalent
measures. The less consistent the results are, the less reliable the measure. Polit-
ical scientists take very seriously the reliability of the measures they use. Survey
researchers are often concerned about the reliability of the answers they receive. For
example, respondents' answers to survey questions often vary considerably when
the instruments are given at two different times. 16 If respondents are not concen-
trating or taking the survey seriously, the answers they provide may as well have
been pulled out of a hat.

Now, let us return to the example of measuring your weight using a home scale.
If you weigh yourself on your home scale, then go to the gym and weigh yourself
again there, and get the same number (alternative forms test of reliability), you
may conclude that your home scale is reliable. But what if you get two different
numbers? Assuming your weight has not changed, what is the problem? If you go
back home immediately and step back on your home scale and find that it gives
you a measurement that is different from the first it gave you, you could conclude
that your scale has a faulty mechanism, is inconsistent, and therefore is unreliable.
However, what if your bathroom scale gives you the same weight as the first time?
It would appear to be reliable. Maybe the gym scale is unreliable. You could test
this out by going back to the gym and reweighing yourself. If the gym scale gives a
reading different from the one it gave the first time, then it is unreliable. But what if
the gymscale gives consistent readings? Each scale appears to be reliable (the scales
are not giving you different weights at random), but at least one of them is giving
you a wrong measurement (that is, not giving you your correct weight). This is a
problem of validity.

Validity
Essentially, a yalid measure is one that measures what it is supposed to measure.
Unlike reliability, which depends on whether repeated applications of the same or
equivalent measures yield the same result, validity refers to the degree of corre-
spondence between the measure and the concept it is thought to measure.

Let us consider first an example of a measure whose validity has been questioned:
voter turnout. Many studies examine the factors that affect voter turnout and, thus,

1,6 Philip E. Converse, "The Nature of Belief Systems in Mass Publics," in Ideologyand Discontent,
ed. David E. Apter (New York: Free Press of Glencoe, 1964); Pauline Marie Vaillancourt, "Stability
of Children's Survey Responses," Public Opinion Quarterly37, no. 3 (1973): 373---87; J. Miller
McPherson, Susan Welch, and Cal Clark, "The Stability and Reliability of Political Efficacy: Using
Path Analysis to Test Alternative Models," American Political Science Review71, no. 2 (1977):
509-21; and Philip E. Converse and Gregory B. Markus, "Plus ~a change ... : The New CPS Election
Study Panel," American Political Science Review73, no. 1 (197,?l: 32-49.
The Building Blocks of Social Scientific Research: Measurement 139

require an accurate measurement of voter turnout. One way of measuring voter


turnout is to ask people if they voted in the last election-self-reported voting.
However, given the social desirability of voting in the United States-wearing the "I
voted" sticker or posting "I voted" on a social media site can bring social rewards-
will nonvoters admit their failure to vote to an interviewer? Some nonvoters may
claim in surveys to have voted, resulting in an invalid measure of voter turnout tqat
overstates the number of voters. In fact, this is what usually happens. Voter surveys
commonly overestimate turnout by several percentage points. 17

A measure can also be invalid if it measures a slightly or very different concept


than intended. For example, assume that a researcher intends to measure ideology,
conceptualized as an individual's political views on a continuum between conserva-
tive, moderate, and liberal. The researcher proposes to measure ideology by asking
survey respondents, 'To which party do you feel closest, the Democratic Party or
the Republican Party?" This measure would be invalid because it fails to measure
ideology as conceptualized. Partisan affinity, while often consistent with ideology,
is not the same as ideology. This measure could be a valid measure of party identi-
fication, but not ideology.

A measure'.s-ilalidityis more difficult to demonstrate empirically than is its reliability


because validity involves the relationship between the measurement of a concept
and the actual presence or amount of the concept itself. Information regarding the
correspondence is seldom abundant. Nonetheless, there are ways to evaluate the
validity of any particular measure. In the following paragraphs we explain several
ways of thinking about validity including face, content, construct, and interitem
validity.

Face validity may be asserted (not empirically demonstrated) when the measure-
ment instrument appears to measure the concept it is supposed to measure. To
assess the face validity of a measure, we need to know the meaning of the concept
being measured and whether the information being collected is "germane to that
concept." 18 For example, let us return to thinking about how we might meas.ure
political ideology-that is, whether someone is conservative, moderate, or liberal.
Such a measure could be as simple as a question used by the Pew Research Cen-
ter: "Do you think of yourself as conservative, moderate or liberal?"19 On its face,
this measure appears to capture the intended concept, so it has face validity. It
might be tempting to use individuals' responses to a question on party identifica-
tion, but.one would be assuming that all Democrats are liberal and all Republicans

17 Raymond E. Wolfinger and Steven J. Rosenstone, appendix A in WhoVotes?(New Haven, Conn.: Yale
University Press, 1980).
18 Kenneth D. Bailey, Methodsof Social Research(New York: Free Press, 1978), 58.
19 Pew Research Center. Retrieved December 5, 2014. Available at http://www.pewresearch.org/data-

I
trend/political-attitudes/political-ideology/
140 CHAPTER5

are conservative. Also, if the party identification variable included a category for
independents, what would be their ideology? Can you assume they are all moder-
ates? For these reasons, a question measuring party identification would lack face
validity as a measure of ideology.

In general, measures lack face validity when there are good reasons to question the
correspondence of the measure to the concept in question. In other words, assess-
ing face validity is essentially a matter of judgment. If no consensus exists about the
meaning of the concept to be measured, the face validity of the measure is bound
to be problematic.

Content validity is similar to face validity but involves determir:iingthe full domain
or meaning of a particular concept and then making sure that all components of
the meaning are included in the measure. For example, suppose you wanted to
design a measure of the extent to which a nation's political system is democratic.
As noted earlier, democracymeans many things to many people. Raymond D. Gastil
constructed a measure of democracy that included two dimensions, political rights
and civil liberties. His checklists for each dimension consisted of eleven items. 20
Political scientists are often interested in concepts with multiple dimensions or
complex domains, like democracy, and spend quite a bit of time discussing and
justifying the content of their measures. In order for a measure of Gastil's concep-
tion of democracy to achieve content validity, the measure should capture all eleven
components in the definition.

A third way to evaluate the validity of a measure is by empirically demonstrating


construct validity. Construct validity can be understood in two different ways:
convergent construct validity and divergent construct validity. Convergent con-
st1:11ctvalidity is when a measure of a concept is related to a measure of another
concept with which the original concept is thought to be related. In other words,
a researcher may specify, on theoretical grounds, that two concepts ought to be
related in a positive manner (say, political efficacy with political participation or
education with income) or a negative manner (say, democracy and human rights
abuses). The researcher then develops a measure of each of the concepts and exam-
ines the relationship between them. If the measures are positively or negatively
correlated, then one measure has convergent validity for the other measure. In the
case that there is no relationship between the measures, then the ~heoretical rela-
tionship is in error, at least one of the measures is not an accurate representation of
the concept, or the procedure used to test the relationship is faulty. The absence
of a hypothesized relationship does not mean a measure is invalid, but the presence
of a relationship gives some assurance of the measures validity.

20 As discussed in Ross E. Burkhart and Michael S. Lewis-Beck, "Comparative Democracy: The


Economic Development Thesis," American Political Science Review88, no. 4 (1994): appendix A.
The Building Blocks of Social Scientific Research: Measurement 141

Discriminant construct validity involves two measures that theoretically are


expected not to be related; thus, the correlation between them is expected to be
low or weak. If the measures do not correlate with one another, then discriminate
construct validity is demonstrated.

Let us return to the question of measuring the power of legislative leaders because it
provides a good example of the importance of construct validity. As we pointed out
before, the perceived-influence approach to measuring power is more difficult to
use than the formal-powers approach. Therefore, if the two measures are shown to
have construct validity, operationalizing leadership power using the formal-powers
approach by itself might be a valid way to measure the concept. If the two measures
do not have construct validity, then it would be clear that the two approaches are
not measuring the same thing. Thus, which measure is used could greatly affect the
findings of research into the factors associated with the presence of strong leader-
ship power or on the consequences of such power. These were the very questions
raised by political scientist James Coleman Battista.21 He constructed several mea-
sures of perceived leadership power and correlated them with a measure of formal
power. The results, shown-in table 5-1, show that the me~ure of formal power
correlates only weakly with three measures of perceived power (which, as expected,
correlate w~l with one another). Therefore, measures of perceived power and the
measure of formal power do not demonstrate,convergent construct validity.

A fourth way to demonstrate validity is through interitem association. This is the


type of validity test most often used by political scientists. It relies on the similarity
of outcomes_of more than one measure of a concept to demonstrate the validity of
the entire measurement scheme. It is often preferable to use more ,than one item to
measure a concept-reliance on just one measure is more prone to error or mis-
classification of a case.22

Let us return to the researcher who wants to develop a valid measure of liberalism.
First, the researcher might measure peoples attitudes toward (1) welfare, (2) mili-
tary spending, (3) abortion, (4) Social Security benefit levels, (5) affirmative action,
(6) a progressive income tax, (7) school vouchers, and (8) protection of the rights
of the accused. Then the researcher could determine how the responses to each .
question relate to the responses to each of the other questions. The validity of the
measurement scheme would be demonstrated if strong relationships existed among
people's responses across the eight questions.

21 James Coleman Battista, "Formal and Perceived Leadership Power in U.S. State Legisl'atures," State
Politics and Policy Quarterly11, no. 1 (2011).
22 Joseph A. Gliem and Rosemary R. Gliem, "Calculating, Interpreting, and Reporting Cronbach's
Alpha Reliability Coefficient for Likert-Type Scales" (paper presented at the 2003 Midwest Research
to Practice Conference in Adult, Continuing, and Community Education, Ohio State University,
Columbus, 2003). Available at https://scholarworks.iupui.edu/bitstream/handle/1805/344/
Gliem+&+Gliem.pdf?sequence=l
142 CHAPTER 5

TABLE 5-1 Correlationsof LeadershipPower Measures

Formal , Average % rating % rating


power perceived highest highest
index , power (all) (internal)
Formal power index 1
'
j, Average,p~rceived
~~
powe~ • , :186
"
1
' "
"
1
Prop. rating highest (all) .086 .698* 1
.
.702.*'
-tJC
,Cj
ts.
'·'If

.Ji
~6"134* l'-
;~ ,
t'
'{.,
..
1,

/lt ~"., a
1
.J
Combined power index .915* .558* .338* .375*

*p <.05
Source:James Coleman Battista, "Formal and Perceived Leadership Power in U.S. State
Legislatures," State Politics and Policy Quarterly 11, no. 1 (2011): tab. 2, p. 209. Copyright© The
Author 2011. Reprinted by Permission of SAGE Pub.lications.

Note:We show here only the results for Battista's analysis when southern states are excluded.
Because southern states have a history of little two-party competition but may have to contend with
party factions, and the question asked respondents to rate the power of legislative leaders as party
leaders, respondents were likely to give low perceived power ratings. Southern legislative leaders,
however, may be strong chamber leaders, which is what is being measured in states with a history of
two-party competition. See chapter 13 for an explanation of correlation.

The results of such interitem association tests are often displayed in-a correlation
matrix. Such a display shows how strongly related each qf the items in the mea-
surement scheme is to all the other items. In the hypothetical data shown in table
5-2, we can see that people's responses to six of the eight measures were strongly
related to each other, whereas responses to the questions on protection of the rights
of the accused and school vouchers were not part of the general pattern. Thus, the
researcher would probably conclude that the first six items all measure liberalism
and that, taken together, they are a valid measurement of liberalism.

· The figures in table 5-2 are product-moment correlations: numbers that can vary
in value from -1.0 to +LO and that indicate the extent to which one variable is
related to another. The closer the correlation is to ±1, the stronger the relationship;
the closer the correlation is to 0.0, the weaker the relationship (see chapter 13 for
a full explanation). The figures in the last two rows are considerably closer to 0.0
than are the other entries, indicating that peoples answers to the questions about
school vouchers and rights of the accused did not follow the same pattern as their
answers to the other questions. Therefore, it looks like school vouchers and rights
of the accused are not connected to the same concept of liberalism as meas~red by
the other questions.
The Building Blocks of Social Scientific Research: Measurement 143

TABLE 5-2 InteritemAssociation ValidityTest of a Measureof Liberalism

Rights
Military Social Affirmative Income School of
Welfare spending Abortion Security action tax vouchers accused
Welfare 1

I
I
Military
. spending
'·'

'",.
,26 1
,
'

Abortion .71 .60 1


~, ,,i:j
'
.~q?' .9,l .Ml . '" . . '
C
f"6ocial Security·~ 1 . ~ 1c

' '>-".2
' .
'
Affirmative .63 .38 .59 .69 1
action

'
f
.
lncomettax .48' .67
~
.75 .39" '
.
.51 1 1l
I !;;;\I- l

School vouchers .28 .08 .19 .03 .30 -.07 1


.
l
I
~ig~tsof
accused ...
,.,1. .ft-.'5.:

-.0\ " '


. 14 : /t,

-:,14 ..
•.
'
.10 C
'"

,
.23 .18
'
.,.45 1 J
j
I

Note:Hypothetical data.

Content and face validity are difficult to assess when agreement is lacking on the
meaning of a concept, and construct validity, which requires a well-developed the-
oretical perspective, usually yields a less-than-definitive result. The interitem asso-
ciation test requires multiple measures of the same concept. Although these validity
"tests" provide important evidence, none of them is likely to support an unequivo-
cal decision concerning the validity of particular measures.

Problems with Reliability and Validity


in Political Science Measurement
Survey researchers often want to measure respondents' household income. Mea-
surement of this basic variable illustrates the numero~s threats to the reliability and
validity of political science measures. The following is a question used in the 2004
Amer.ican National Election Study (ANES):

Please look at the booklet and tell me the letter of the income group
that includes the income of all members of your family living here in
2003 before taxes. This figure should include salaries, wages, pensions,
dividends, interest, and all other income. Please tell me the letter of the
income group· that includes the income you had in 2003 before taxes.
144 CHAPTER5

Respondents were given the following choices:

A. None, or less than $2,999 M. $30,000-$34 ,999

B. $3,000-$4,999 N. $35,000-$39,999
C. $5,000-$6,999 0. $40 ,000-$44 ,999
D. $7,000-$8,999 P. $45,000-$49,999
E. $9,000-$10,999 Q. $50,000-$59 ,999
E $11,000-$12,999 R. $60,000-$69,999
G. $13,000-$14,999 S. $70,000-$79 ,999
H. $15,000-$16,999 T. $80,000-$89 ,999
I. $17 ,000-$19 ,999 U. $90,000-$104,999

J. $20,000-$21,999 V $105,000-$119,999
K. $22,000-$24,999 w $120,000 and over
L. $25,000-$29,999

Both the reliability and the validity of this method of measuring income are ques-
tionable. Threats to the reliability of the measure include the following:

• Respondents may not know how much money they make and therefore
incorrectly guess their income.
• Respondents may not know how much money other family members
make"and guess incorrectly. ·
• Respondents may know how much they make but carelessly select the
wrong categories.
• Interviewers may circle the wrong categories when listening to the
selections of the respondents.
• Data-entry personnel may touch the wrong numbers when entering the
answers into the computer.
• Dishonest interviewers may incorrectly guess the income of a respondent
who does not complete the interview.
• Respondents may not know which family members to include in the
income total; some respondents may include only a few family members,
while others may include even distant relations.
• Respondents whose income is on the border between two categories may
not know which one to pick. Some may pick the higher category; others,
the lower one.

Because of these measurement problems, if this measure were applied to the same
people at two different times, we could expect the results to vary, resulting in
The Building Blocks of Social Scientific Research: Measurement 145

inaccurate measures that are too high for some respondents and too low for others.
Some amount of random measurement error is likely to occur with any measure-
ment scheme.

In addition to these threats to reliability, there are numerous threats to the validity
of this measure:

• Respondents may have illegal income they do not want to reveal and_
therefore may systematically underestimate their income.
• Respondents may try to impre~s the interviewer, or themselves, by
systematically overestimating their income.
• Respondents may systematically underestimate their before-tax income
because they think of their take-home pay and underestimate how much
money is being withheld from their paychecks.
• Respondents may systematically skip the question due to privacy
concerns over providing a precise number even if they know it.

Notice that this second list of problems contains the word systematically. These
problems are not simply caused by random inconsistencies in measurements, with
some being,. too high and others too low for unpredictable reasons. Systematicmea-
surement error introduces error that may bias research results, thus compromising
the confidence we have in them.

This long list of problems with both the reliability and the validity of this fairly
straightforward measure of a relatively concrete concept is worrisome. Imagine how
much more difficult it is to develop reliable and valid measures when the concept
is abstract (for example, tolerance, environmental conscience, self-esteem, or liber-
alism) and the measurement scheme is more complicated.

The relia~ility and validity of the measures used by political scientists are seldom
demonstrated to everyone's satisfaction. Most measures of political phenomena are
neither completely invalid or valid nor thoroughly unreliable or reliable but, rather,
are partly accurate. Therefore, researchers generally present the rationale and evi-
dence available in support of their measures and attempt to persuade their audi-
ence that their measures are at least as accurate as alternative measures would be.
Nonetheless, a skeptical stance on the part of the reader toward the reliability and
validity of political science measures is often warranted.

Note, finally, that reliability and validity are not the same thing. A measure may
be reliable without being valid. One may devise a series of questions to measure
liberalism, for example, that yields the same result for the same people every time
but that misidentifies individuals. A valid measure, however, will also be reliable: if
it accurately measures the concept in question, then it will do so,consistently across
measurements-allowing, of course, for some random measurement error that may
occur. It is more important, then, to demonstrate validity than reliability, but reli-
ability is usually more easily and precisely tested.
146 CHAPTER5

The Precision of Measurements


Measurements should be not only accurate but also precise; that is, measurements
should contain as much information as possible about the attribute or behavior
being measured. The more precise our measures, the more complete and informative
can be our test of the relationships between two or more variables.

Suppose, for example, that we wanted to measure the height of political candidates
to see if taller candidates usually win elections. Height could be measured in many
different ways. We could have two categories of the variable "height"-tall and
short-and assign different candidates to the two categories based on whether they
were of above-average or below-average height. Or we could compare the heights
of candidates running for the same office arid measure which candidate was the
tallest, which the next tallest, and so on. Or we could take a tape measure and mea-
sure each candidate's height in inches and record that measure. The last method of
measurement captures the most information about each candidate's height and is,
ther_efore,the most precise measure of the attribute.

Levels of Measurement
When we consider the precision of our measurements, we refer to the level of
measurement. The level of measurement involves the type of information that we
think our measurements contain and the mathematical properties that determine
the type of comparisons that can be made across a number of observations on the
same variable. The level of measurement also refers to the claim-we are ~lling to
make when we assign numbers to our measurements.

There are four different levels of measurement: nominal, ordinal, interval, and ratio.
While few concepts used in political science research inherently require a particular
level of measurement, there are methodological limitations because some measures
provide more information and better mathematical properties than others. So the
level of measurement used to measure any particular concept is a function of both
the researcher's imagination and resources, and methodological needs.

We begin with nominal measurement, the level that has the fewest mathematical
properties 9f the· four levels. A nominal-level measure indicat~s that the values
assigned to a variable represent only different categories or classifications for that
variable. In such a case, no category is more or less than another category; they
are simply different..For example, suppose we measure the religion of individuals
by asking them to indicate whether they are Christian, Jewish, Muslim, or other.
Since the four categories or values for the variable religion are simply different,
the measurement is at a nominal level. Other common examples of nominal-level
The Building Blocks of Social Scientific Research: Measurement 147

measures are gender, marital status, and state of residence. A nominal measure
of partisan affiliation might have the following categories: Democrat, Republican,
Green, Libertarian, other, and none. Numbers will be assigned to the categories
when the data are coded for statistical analysis, but these numbers do not repre-
sent mathematical differences between the categories--any of the parties could be
assigned any number, as long as those numbers are different from each other. In
this sense, nominal-level measures provide the least amount of information about a
concept. An ordiI\al measurement has all of the properties of a nominal measure
but also assumes observations can be compared in terms of having more or less
of a particular attribute. Hence, the ordinal level of measurement captures more
information about the measured concept and has more mathematical properties
than a nominal-level measure. For example, we could create an ordinal measure of
formal education completed with the following categqries: "eighth grade or less,"
"some high school," "high school graduate," "some college," and "college degree or
more." Here we are concerned not with the exact difference between the catego-
ries of education but only with whether one category is more or less than another.
When coding this variable, we would assign higher numbers to higher categories
of education. The intervals between the numbers have no meaning; all that matters
is that the higher numbers represent more of the attribute than do the lower num-
bers. An ordinal variable measuring partisan affiliation with the categories "strong
Republican," "weak Republican," "neither leaning Republican nor Democrat,"
"weak Democrat," and "strong Democrat" could be assigned codes 1, 2, 3, 4, 5 or 1,
2, 5, 8, 9 or any other combination of numbers, as long as they were in ascending
or descending order. ·

Dichotomous nominal variables-that is, nominal-level variables with only two


categories--are nominal-level measures, but frequently treated as ordinal-level
measures. For example, we could measure nuclear capability with two categories,
where a country that has nuclear capabilities would be coded as a one and a coun-
try that does not would be coded as a zero. One could interpret this variable as
nuclear capability being present or absent in a country and therefore a one rep-
resents more of the concept, nuclear capability. To give another example, a person
who did not vote in the last election lacks, or has less of, the attribute of having
voted than a person who did vote.

Because nominal and ordinal measures rely on categories it is important to make


sure these variables are both exhaustive and exclusive. F.xhaustiverefers to making
sure that all possible categories--or answer choices-are accounted for. The sim-
plest solution to make sure a variable is exhaustive is to 1nclude an "other" category
that can be used for values that are not represented in the identified categories.
F.xclusive~efers to making sure that a single value or answer can only fit into one
category. Each.category should be distinct from the others, with no overlap.
148 CHAPTER5

Debating·the Level Qf Measurement


••••••••••••••••••••••••••••••••••••••••••••••••••••••• ~\ ••••••• 1 ••••••••

Supposewe ask individuarsthree sociattrust. But doesa"personwho pas''


questionsdesignedto measuresocial a scoreofo haveno'socialtru~t?Doesa
trust, and we believethat individualswho ..PersonV{itha scoreof 3 havethree·times
answerall three.questionsa'certainway as much socialtrusf as a personwith a,
hayecnoresocialtrust than'personswho of
score 1? Perhaps,then,Joe ~~~iableis
'answertwo of the qu~stic5ns a certainway, an intet=va'r;}evel
m~asure,if one is willing
and these individuals:have.more social to assumathat the differenc.ein ?ocial,
trust thqn inc/ividval{~hQanswer9ne trusi'betweeninaividualswith scoresc5t
of the questionsa c~rtain,·way. We cou)d 2 and 3 is the sameas the d'ifference
assigna scoreof 3 to.the'tirst group, 2 to, betweenindividuaJ?, wjth scoresof 1 and·
'to.
the seco(ldgroup, 1 t~e third.group, 2. But w,hatif the effect of al)SW~ringr,pore
and Oto thosewho did not answerany of questiorisin,th,ea!firma~veis.not simply ,
the questionsin ~"sociallytrusting manner. additive?In other words, per~apsa,P,erspn,
In this case,the highe,r,the_number, the who has,a·score,qfshas a lot more social'-<
moresocial trust an indiviqualhas. trust;than someonevYitha,scoreof 2 ,.
and thatthis differenceis.morethan tl;)e
What levelof measurementis this variable? t:lifferencebetweenindivi(Ju,als with scores
It might be consideredto be ratio level,if of 1 and 2. In this case,Jben,.the me;.isure
one interpretsthe variableas simplythe would be ordina1·1~ve( nofinterva~lev~I. t
number of questionsansweredindicating

Check out more Helpful Hints at edge.sagepub.com/johnson8e

The next level of measurement, an interval•measurement, includes the properties


of the nominal level (characteristics are different) and the ordinal level (characteris-
tics can be put in a meaningful order). But unlike the preceding levels of measure-
ment, the intervals. between the categories or values assigned to the observations do
have meaning. The value of a particular observation is important not just in terms of
whether it is larger or smaller than another value (as in ordinal measures) but also in
terms of how much larger or smaller it is. For example, suppose we record the year
in which certain events occurred. If we have three observations-1950, 1962, and
1977-we know that the event in 1950 occurred twelve years before the one in 1962
and twenty-seven years before the one in 1977. A one-unit change (the interval) all
along this measurement is identical in meaning: the passage of one years time.
The Building Blocks of Social Scientific Research: Measurement 149

Another characteristic of an interval level of measurement that distinguishes it


from the next level of measurement (ratio) is that an interval-level measure has an
arbitrarily assigned zero point that does not represent the absence of the attribute
being measured. For example, many time and temperature scales have arbitrary
zero points. Thus, the year O CE does not indicate the beginning of time-if this
were true, there would be no BCE dates. Nor does 0°C indicate the absence of heat;
rather, it indicates the temperature at which water freezes. For this reason, with
interval-level measurements we cannot calculate ratios; that is, we cannot say that
60°F is twice as warm as 30°E So while the interval level of measurement captures
more information and mathematical properties than the nominal and ordinal levels,
it does not have the full properties of mathematics.

The final level of measurement is a ratio measurement. This type of measurement


involves the full mathematical properties of numbers and contains the most possi-
ble information about a measured concept. That is, a ratio-level measure includes
the values of the categories, the order of the categories, and the intervals between
the categories; it also precisely indicates the relative amounts of the variable that
the categories represent because its scale includes a meaningful zero. If, for exam-
ple, a researcher is willing to claim that an observation with ten units of a vari-
able possesses exactly twice as much of that attribute as an observation with five
units of that variable, then a ratio-level measurement exists. The key to making this
assumption is that a value of zero on the variable actually represents the absence
of that variable. Because ratio measures have a true zero point, it makes sense to
say that one measurement is x times another. It makes sense to say a sixty-year-old
person is twice the age of a thirty-year-old person (60/30 = 2), whereas it does not
make sense to say that 60°C is twice as warm as 30°C. 23

Political science researchers have measured many concepts at the ratio level. Peo-
ple's ages, unemployment rates, percentage of the vote for a particular candidate,
and crim'e rates are all measures that contain a zero point and possess the full math-
ematical properties of the numbers used. However, more political science research
has probably relied on nominal- and ordinal-level measures than on interval- or
ratio-level measures. This has restricted the types of hypotheses and analysis tech-
niques that political scientists have been willing and able to use.

Identifying the level of measurement of variables is important, since it affects the


data analysis, techniques that can be used and the conclusions that can be drawn
about the relationships between variables. Higher-order methods often require higher
levels of measurement, while other methods have been developed for lower levels

23 The distinction betweenan interval-leveland a ratio-level measureis not alwaysclear, and some
political sciencetexts do not distinguish betweenthem. Interval-levelmeasuresin political science
are rather rare; ratio-level measures(moneyspent, age, numberof children, years living in the same
location, for example)are more common.
150 CHAPTER5

of measurement. The decision of which level of measurement to use is not always a


straightforward one, and uncertainty and disagreement often exist among research-
ers concerning these decisions. Few phenomena inherently require one particular
level of measurement. Often, a phenomenon can be measured with any level of
measurement, depending on the particular technique designed by the researcher
and the claims the researcher is willing to make about the resulting measure.

Working with Precision: Too Little or Too Much


Researchers usually try to devise as high a level of measurement for their concepts
as possible (nominal being the lowest level of measurement and ratio the highest).
With a higher level of measurement, more advanced data analysis techniques can
be used, and more precise statements can be made about the relationships between
variables. Thus, researchers measuring attitudes or concepts with multiple opera-
tional definitions often construct a scale or an index from nominal-level measures
that permits at least ordinal-level comparisons between observations. We discuss
the construction of indexes and scales in greater detail in the following paragraphs.

It is easy to transform ratio-levelinformation (e.g., age in number ofyears) into ordinal-


level information (e.g., age groups). However, if you start with the ordinal-level
measure, age groups, you will not have each person's actual age. If you decide you
want to use a person's actual age, you will have to collect that d~ta-it cannot be
created from an ordinal-level measurement. Similarly,a researcher investigating the
effect of campaign spending on election outcomes could use a ratio-level variable
measuring how much each candidate spent on his or her campaign. This infor-
mation could be used to construct a new variable indicating how much more one
a
candidate spent than the other, or simply whether or not candidate spent more
than his or her oppone~t. Candidate spending could also be grouped into ranges.

Nominal and ordinal variables with many categories or interval- and ratio-level
mea_suresusing more decimal places are more precise than measures with fewer
categories or decimal places, but sometimes the result may provide more informa-
tion than can be used. Researchers frequently start out with ratio-level measures or
~th ordinal and nominal measures with quite a few categories but then collapse
or combine the data to create groups or fewer categories. They do this so Jh~t they
have enough cases in e·ach category for statistical analysis or to make compari-
sons easier to follow. For example, one might want to present comparisons simply
between Democrats and Republicans rather than presenting data broken down into
categories of strong, moderat_e,and weak for each party.

It may seem contradictory now to point out ~hat extremely precise measures also
may-create problems. For example, measures with many response possibilities take
up space if they are questions on a written questionnaire or require more time to
The Building Blocks of Social Scientific Research: Measurement 151

explain if they are included in a telephone survey. Such questions may also confuse
or tire survey respondents. A more serious problem is that they·may lead to mea-
surement error. Think about the possible responses to a question asking respon-
dents to use a 100-point scale (called a thermometer scale) to indicate their support
for or opposition to a political candidate, assuming that 50 is consid€red the neutral
position and 9 is least favorable or "coldest" and 100 most favorable. Some respon-
dents may not use the whole scale (to them, no candidate ever deserves more than
an 80 or less than a 20), whereas others may use the ends and the very middle of
the scale and ignore the scores in between. We might predict that a person who
gives a candidate a 100 is more likely to vote for that candidate than is a person
who gives the same candidate an 80, but in reality they may like the candidate
pretty much the same way and would be equally likely to vote for the candidate.
Another problem with overly precise measurements is that they may be unreliable.
If asked to rate candidates on more than one occasion, respondents could vary
slightly the number that they choose, even if their opinion has not changed.

Multi-Item Measures
Many meaisures consist of a single item. For example, the measures of party
identification, whether or not one party controls Congress, the percentage of
the vote received by a candidate, how concerned about an issue a person is, the
policy area of a judicial case, and age are all based on a single measure of each
phenomenon in question. Often, however, researchers need to devise measures. of
more complicated phenomena that have more than one facet or dimension. For
example, internationalism, political ideology, political knowledge, dispersion of
political power, and the extent to which a person is politically active are complex
phenomena or concepts that may be measured in many different ways.

In this sit~ation, researchers often develop a measurement strategy that allows them
to capture numerous aspects of a complex phenomenon while representing the
existence of that phenomenon in particular cas~swith a single representative value.
Usually this involves the construction of a multi-item index or scale "representing
the several dimensions of the phenomenon. These multi-item measures are useful
because they enhance the accuracy of a measure, simplify a researchers data by
reducing them to a more manageable size, and increase the level of measurement
of a phenomenon. In the remainder of this section, we describe several common
types of indexes and scales.

Indexes
A summation index is a method of accumulating scores on individual items to
form a composite measure of a complex phenomenon. An index is constructed by
152 CHAPTER5

assigning a range of possible scores for a certain number of items, determining the
score for each item for each observation, and then combining the scores for each
observation across all the items. The resulting summary score is the representative
measurement of the phenomenon.

A researcher interested in measuring how much freedom exists in different countries,


for example, might construct an index of political freedom by devising a list of items
germane to the concept, determining where individual countries score on each item,
and then adding these scores to get a summary measure. In table 5-3, such a hypo-
thetical index is used to measure the amount of freedom in countries A through E.

The index in table 5-3 is a simple, additive one; that is, each item counts equally
toward the calculation of the index score, and the total score is the summation of
the individual item scores. However, indexes may be constructed with more com-
plicated aggregation procedures and by counting some items as more important
than others. In the preceding example, a researcher might consider some indicators
of freedom as more important than others and wish to have them contribute more
to the calculation of the final index score. This could be done either by weighting

TABLE 5-3 Hypothetical Index for Measuring Freedom


in Countries·

Does the country Country Country Country Country Country


possess: A B C D E

Privatelyowned 1 0 0 0 1
newspapers
; •. .
t Legalright to form 1 ' 1 0 9 ~o l
'
l political parties r:~ . .. i
'
Contestedelections 1 1 0 0 0
for significant public
offices
,.
"f :1 ' ~ ) ·1 " o· 1
Voting rights extended 0' •
to most of the aduIt l-
' "~ i
,
pop1,1lation . ;... -~ ·1
Limitations on 1 0 0 0 1
government'sability to
incarceratecitizens
, . '
Indexscore 5 ~tJ~ 0 1 .,, 2 Il
"l! '·
Note:Hypotheticaldata. The scoreis 1 if the answeris yes, O if no.
The Building Blocks of Social Scientific Research: Measurement 153

(multiplying) some item scores by a number indicating their importance or by


assigning a higher score than 1 to those attributes considered more important.

Indexes are often used with public opinion surveys to measure political attitudes.
This is because attitudes are complex phenomena and we usually do not know
enough about them to devise single-item measures. So we often ask several ques-
tions of people about a single attitude and aggregate the answers to represent the
attitude. A researcher might measure attitudes toward abortion, for example, by
asking respondents to choose one of five possible responses-strongly agree, agree,
undecided, disagree, and strongly disagree-to the following three statements:
(1) Abortions should be permitted in the first three months of pregnancy. (2) Abortions
should be permitted if the woman's life is in danger. (3) Abortions should be permit-
ted whenever a woman wants one.

An index of attitudes toward abortion could be comput~d by assigning numerical


values to each response (such as 1 for stronglyagree,2 for agree,3 for undecided,and
so on) and then adding the values of a respondent's answers to these three ques-
tions. (The researcher would have to decide what to do when a respondent did not
answ~r one or more of the questions.) The lowest possible score in this case would
be a 3, indicating the most extreme pro-abortion attitude, and the highest possible
score would be a 15, indicating the most extreme anti-abortion attitude. Scores in
between would indicate varying degrees of approval of abortion.

Indexes are typically fairly simple ways of producing single representative scores of
complicated phenomena such as political attitudes. They are probably more accu-
rate than most single-item measures, but they may also be flawed in important
ways. Aggregating scores across several items assumes, for example, that each item
is equally impqrtant to the summary measure of the concept and that the items
used faithfully encompass the domain of the concept. Altho.ugh individual item
scores can be weighted to change their contribution to the summary measure, the
researcher often has little information upon which to base a weighting scheme.

Several standard indexes are often used in political science research. The FBI crime
.index, the Consumer Confidence Index, and the Consumer Price Index, for example,
have been used by many researchers. Before using these or any other readily avail-
able index, you should familiarize yourself with its construction and be aware of any
questions raised about its validity. Although simple summation indexes are gener-
ally more accurate than single-item measures of complicated phenomena, it is often
unclear how valid they are or what level of measurement they represent. For exam-
ple, is the index of freedom an ordinal-level measure, or could it be an interval-level
or even a ratio-level measure? Another possible issue with indexes -such as the
Consumer Price Index is that what goes into its calculation can change over time.24

24 Brett Arends, "Why You Can't Trust the Inflation Numbers," Wall Street Journal, January 26, 2011,
http://online.wsj.com/article/S B100014240527487040136045 76104351050317610. html
154 CHAPTER 5

Creating an Index of Speakers'


~n:stitutional Powers,
••••••••••••••••••••••••••••••••••••••••• •1•••• •1 • •••• ·:~·•••• : .............. .

RichardA. Clucascreatedan index of the Assignsmajoc;itymember~,doesoot


institutiona)powerpf stqte houset_
speakers de~ide num9,er= ?.5 ,,
by sorting powersinto five g~neral ,Arrotheractor has formal power
categoriesapd assignin!},valuesto powers over assignments,but the speaker
within thosecategoriesusing the following shares in process,such as serving
scoring procedure: as a member of rules committee;
decide?committeenurober = 2.(}
• AppointmentpowE;rindex ,
Share5:.inas~ignmentproce~$;
Spea~erselectsall chairs and party
share$powerover number= 1.5
leaders= 5 points
·Sharesin assignmentprocess;'does
Selectschairs apd a ma}ority,of
not decide-numbe(= 1:0
leaders= 4
Not involvedin eiJher7 .D.O
Selectschairs; selects,few or no
'leaders= 3 • ResourcePowerlndeX:.
Sharespowerst9 selectchairs; Campaigneommitteeexists;.
selectsfew or no leaders= 2 speakerltas controloyer legislative
Doesnot select chairs; selects'fewor employ~es= 5.0 ··--
no leaders= l· ,Committeeexists;.speakerdoes not
• Committee.PpwerIndex control employees=3.0

Speakerassignsall membersto Committeedoes not exist; speaker


-

committee;decides number of
J/i. -
controlsemployees·=LO
committees= 5.0 Committeedoes not exist;
Assignsall members;sharespower speakerdoes no~contra~
over number = 4.6 employees= 0.0 ·

Assignsall members;does not • Procedural Power Index: The


decide number.=.4.0 index was created by first creating,
separate indexes for the speak,er's
Assignsmajor~ty
,members;decides
gow~r over bill referral and
number= 3-5
floor proce,dures,The,averqgyof
AssignsmajoritymE:tmbers;
shares these scores was,then used for the
powerover number= 3.0 ,index.
The Building Blocks of Social Scientific Research: Measurement 155

,• Bill ReferrqlPowerIndex 'Has no control over floor= 1 - -


..-.- _-1------1---
Speakerhas compltte contrpl over • TenurePowerIndex'
bill refem;J= 5 N'otenure limit = 5
,. Controlsref~rral,byt there are Eightyears= 4
,restrictlonson its 1:1se
=4
Six years= 3
, Sharespowerover.referral;no
Four years=2
·restrictionson Hse= 3 ,.,,er -

Twoyears=.1-
Sharespowerover\eterral;
,l f
actions·
restricted= 2 Clucas'soverallindex can range in
·Not involvedin ref'erral= ,1 value.from3 to 25. Eachcateg9ryhasa
maximt1m'v~(ue of 5; thus, ~e is making
• FloorPbwerstlfldex
th~ c.lairn,~~atfull pow~rin eayh qitegory
"
Speakerpreparescalendar,decides ispf equal sig~ific,;1nse. He usesthe index
ql!estion,directs9hamber= q as if it were an interval-levelmeasure,
Controlstwo"ofthesefloor which meansth~Ja unit changein any
'powers= powercategoryis equivalent.These
.... 4
Controlson&of thes~floor assumption?are,.reasonable but are
powe~s= 3 certainlyopen to challenge.

Source:Richard A. Clucas, "Principle-Agent Theory and the Power of State !'louse Speakers," Legislative
· ''studies Quarterly26, no. 2 (2001): 319--38; material ta~en from the appendix on pp. ~35 and·336.

Check out more Helpful Hints at edge.sagepub.com/johnson8e

Scales
Although indexes are generally an improvement over single-item measures, their
construction also contains an element of arbitrariness. Both the selection of partic-
ular items making up the index and the way in which the scores on individual items
are aggregated are based on the researchers judgment. Scales are also multi-item
measures, but the selection and combination of items in them is accomplished more
systematically than is usually the case for indexes. Over the years, several different
kinds of multi-item scales have been used frequently in political science research.
We discuss three of them: Likert scales, Guttman scales, and Mokken scales.

A Likert scale score is calculated from the scores obtained on individual items.
Each item generally asks a respondent to indicate a degree of agreement or dis-
agreement with the item, as with the abortion questions discussed earlier. A Likert
156 CHAPTER5

scale differs from an index, however, in that once the scores on each of the items
are obtained, only some of the items are selected for inclusion in the calculation of
the final score. Those items that allow a researcher to distinguish most readily those
scoring high on an attribute from those scoring low will be retained, and a new
scale score will be calculated based only on those items.

For example, consider the researcher interested in measuring the liberalism of a


group of respondents. Since definitions of liberalismvary, the researcher cannot be
sure how many aspects of liberalism need to be measured. With Liken scaling, the
researcher would begin with a large group of questions thought to express various
aspects of liberalism with which respondents would be asked to agree or disagree.
A provisional Liken scale for liberalism, then, might look like the one in table 5-4.

In practice, a set of questions like this would be scattered throughout a question-


naire so that respondents do not see them as related. Some of the questions might
also be worded in the opposite way (that is, so an "agree" response is a conservative
response) to ensure genuine answers.

The respondents' answers to these eight questions would be summed to pro-


duce a-provisional score. The scores in this case can range from 8 to 40. Then the

TABLE 5-4 ProvisionalLikertScale to MeasureConcept of Liberalism

Strongly Strongly
Disagree Disagree Undecided Agree I
Agree
(1) (2) (3) (4) (5)

The governmentshould ensurethat no one


Iives in poverty.

It is more important to take care of people's


needsthan it is to balancethe federal budget.

f ,SocialSecurity benefjt~shouId not be c.ut,.


I

The governmentshould spend moneyto


improvehousingand transportation in urban
areas.

( Weal~Jiypepple_,__
should ijay).axesat a mu~h : -'it .
J high'errate th~~ poor peopl,~- , , ,
Busingshould be usedto integrate public
schools.

tThe rights ofcg~rsonsaccusedo.f~ Cl"irTJjl(V~St'~, \


f be 11igorouslyprotected. ,
The Building Blocks of Social Scientific Research: Measurement 157

responses of the most liberal and the most conservative people to each question
would be compared; any questions with similar answers from the disparate respon-
dents would be eliminated-such questions would not distinguish liberals from
conservatives. A new summary scale score for all the respondents would be calcu-
lated from the questions that remained. A statistic called Cronbach's alpha, which
measures internal consistency of the items in the scale and has a maximum value of
1.0, is used to determine which items to drop from the scale. The rule of thumb is
that Cronbach's alpha should be 0.8 or above; items are dropped from the scale one
at a time until this value is reached. 25

Liken scales are improvements over multi-item indexes because the items that
make up the multi-item measure are selected in part based on the respondents'
behavior rather than on the researchers judgment. Likert scales suffer two of the
other defects of indexes, however. The researcher cannot be sure that all the dimen-
sions of a concept have been measured, and the relative importance of each item is
still determined arbitrarily.

The Guttman scale also uses a series of items to produce a scale score for respon-
dents. Unlike the Likert scale, however, a Guttman scale presents respondents with a
range of att,itude choices that are increasingly difficult to agree with; that is, the items
composing the scale range from those easy to agree with to those difficult to agree
with. Respondents who agree with one of the "more difficult" attitude items will also
generally agree with the "less difficult" ones. (Guttman scales have also been used to
measure attributes other than attitudes. Their main application has been in the area
of attitude research, however, so an example of that type is used here.)

Let us return to the researcher interested in measuring attitudes toward abortion.


He or she might devise a series of items ranging from "easy to agree with" to "diffi-
cult to agree with." Such an approach might be represented by the following items:

Doyou agreeor disagreethat abortionsshouldbepermitted:

1. Wheri the life of the woman is in danger


2. In the case of incest or rape
3. When the fetus appears to be unhealthy
4. When the father does not want to have a baby
5. When the woman cannot afford to have a baby
6. Whenever the woman wants one

25 Gliem and Gliem, "Calculating, Interpreting, and Reporting Cronbach's Alpha Reliability Coefficient
for Likert-Type Scales."
158 CHAPTER5

This array of items seems likely to result in responses consistent with Guttman
scaling. A respondent agreeing with any one of the items is likely also to agree with
those items numbered lower than that one. This would result in the "stepwise"
pattern of responses characteristic of a Guttman scale.

Suppose six respondents answered this series of questions, as shown in table 5-5.
Generally speaking, the pattern of responses is as expected; those who agreed with
the "most difficult" questions were also likely to agree with the "less difficult" ones.
However, the responses of three people (2, 4, and 5) to the question about the father's
preferences do not fit the pattern. Consequently, the question about the father does
not seem to fit the pattern and would be removed from the scale. Once that has been
done, the stepwise pattern becomes clear.

With real data, it is unlikely that every respondent would give answers that fit the
pattern perfectly. For example, in table 5-5, respondent 6 gave an "agree" response
to the question about incest or rape. This response is unexpected and does not fit
the pattern. Therefore, we would be making an error if we assigned a scale score of 0
to respondent 6. When the data fit the scale pattern well (number of errors is small),
researchers assume that the scale is an appropriate measure and that the respon-
dent's "error" may be "corrected" (in this case, either the "agree" in the case of incest
or rape or the "disagree" in the case of the life of the woman). There are standard
procedures to follow to determine· how to correct the ·data to make it conform to the
scale pattern. We emphasize, however, that this is done. only if the changes are few.

Guttman scales differ from Likert scales in that, in the former case, generally only
one set of responses will yield a particular scale score. That is;-to -get-a score of 3

TABLE 5-5 GuttmanScale of Attitudes toward Abortion

No. of Revised
. Life of Incest or Unhealthy · agree scale
Respondent woman rape . fetus Father Afford Anytime answers score
1 A A A A A A 6 5
< A A

.::i A A 7!. D ~~
}/
'o
JS
,,4,. 4

3 A A A D D D 3 3

·A ,K o· A D ,,.
~
D•. 3 2:.
5 A D D A D D 2 1

o' D b D .l , 0

Note:Hypothetical data. A = Agree, D = Disagree.


The Building Blocks of Social Scientific Research: Measurement 159

on the abortion scale, a particular pattern of responses (or something very close to
it) is necessary: In the case of a Likert scale, however, many different patterns of
responses can yield the same scale score. A Guttman scale is also much more diffi-
cult to achieve than a Likert scale, since the items must have been ordered and be
perceived by the respondents as representing increasingly more difficult responses
reflecting the same attitude.

Both Likert and Guttman scales have shortcomings in their level of measurement.
The level of measurement produced by Likert scales is, at best, ordinal (since we
·do not know the relative importance of each item and so cannot be sure that a 5
answer on one item is the same as a 5 answer on another), and the level of measure-
ment produced by Guttman scales is usually assumed to be ordinal.

Another type of scaling procedure, called Mokken scaling, also analyzes responses
to multiple items by respondents to see if, for each item, respondents can be
ordered and if items can be ordered. 26 The Mokken scale was used by Saundra K.
Schneider, William G. Jacoby, and Daniel C. Lewis to see if there was structure and
coherence in public opinion regarding the distribution of responsibilities between
the federal government and state and local governments. 27 Respondents were asked
whether th~y thought state or local governments "should take the lead" rather than
the national government for thirteen different policy areas. The scaling procedure
allowed the researchers to see if a specific sequence of policies emerged while mov-
ing from one end of the scale to the other. One end of the scale would indicate
maximal support for national policy activity, while the other end would indicate
maximal support for subnational government policy responsibility.

The results of their analysis are shown in figure 5-1. The scale runs from Oto 13,
with O indicating that the national government should take the lead in all thirteen
policy areas and a score of 13 indicating that the respondent believes that state and
local governments should take the lead in all policy areas. A person at any scale
score believes that the state and local governments should take the lead in all pol-
icy areas that fall below that score. Thus, a person with a score of 9 believes that
the national government should take the lead in health care, equality for women,
protecting the environment, and equal opportunity, and that state and local gov-
ernments should take the lead responsibility for natural disasters down to urban
development. The bars in the figure correspond to the percentage of respondents

26 RobertJan Mokken,A Theoryand Procedureof ScaleAnalysiswith Applications in Political Research


(The Hague,Neth.: Mouton, 1971). Mokkenscaling is an exampleof a nonparametricitem response
theory (IRT) model. It differs from Guttman scaling in that it has a probabilistic interpretation,
whereasGuttman scaling does not. SeeAte Dijktra, Girbe Buist, Peter Moorer,and Theo Dassen,
"Construct Validity of the Nursing Care DependencyScale," Journal of Clinical Nursing 8, no. 4
(1999): 380-88.
27 SaundraK. Schneider,William G. Jacoby,and DanielC. Lewis,"Public Opiniontoward
IntergovernmentalPolicy Responsibilities,"Publius: TheJournalof Federalism41, no. 1 (2011): 1-30.
160 CHAPTER 5

who received a particular score. Thus, just slightly more than 5 percent of the
respondents thought that state and local governments should take the lead in all
policy areas, whereas only 1 percent of the respondents thought the national gov-
ernment should take the lead in all thirteen policy areas. Most respondents divided
up the responsibilities. The authors concluded from their analysis that the public

FIGURE 5-1 Mokken Scale of Public Opinion about National


versus State/Local Government Responsibilities for
Specific Policy Areas

-
13
Health care
12
Equality for women
11
Protecting the environment
10
Equal opportunity
9
Natural disasters
8
Assisting the elderly
7
Assisting the disabled
6
Assisting the poor
5
Reducing unemployment
4
Providing education

••
3
Economic development
2
Reducing crime
1
Urban development
0
II
0 5 10 15
Percentage
. .
···············································································································
Source:2006 Comparative Congressional Election Survey, cited in Saundra K. Schneider, William
G. Jacoby, and Daniel C. Lewis, "Public Opinion toward Intergovernmental Policy Responsibilities,"
Publius: TheJournal of Federalism41, no. 1 (2011): fig. 3, p. 14. Reprinted by permission of
Oxford University Press.© The Author 2010. Published by Oxford University Press on behalf of CSF
Associates: Publius, Inc.
The Building Blocks of Social Scientific Research:Measurement 161

does have a rationale behind its preferences for the distribution of policy responsi-
bilities between national versus state and local governments.

The procedures described so far for constructing multi-item measures are fairly
straightforward. There are other advanced statistical techniques for summarizing
or combining individual items or variables. For example, it is possible that several
variables are related to some underlying concept. Factor analysis is a statistical
technique that may be used to uncover patterns across measures. It is especially
useful when a researcher has a large number of measures and when there is uncer-
tainty about how the measures are interrelated.

An example is the analysis by Daniel D. Dutcher, who conducted research on the


attitudes of owners of streamside property toward the water-quality improvement
strategy of planting trees in a wide band (called riparian buffers) along the sides

TABLE 5-6 Items MeasuringLandownerAttitudes toward


RiparianBufferson TheirStreamsideProperty,
.... Sorted into Dimensions Using FactorAnalysis
Maintaining Property Aesthetics
• Maintaining my view of the stream
• Maintaining the look of a pastoralor meadowstream
• Maintaining open space
• Wantingthe neighborsto think that I'm doing my part to keep up the traditional look of
the neighborhood

. Contributing to Stream and Bay Quality


'
Being confidenfthat mi!intain1ngoYcreating if streamsideforest on my-propertyis
necessaryt~ protect the s'l:ream·

~
•• Contributing totti~ im'prnvementof downstreamarea~.including the ChesapeakeBat,
~ ~
;. • Maintaining or improvingstream-bank'stabilityon my property
~ "'· j!j '"" 'Jc

Protecting Property against Damage or Loss


• Keepingvegetationfrom encroachingon fields or fences
• Minimizing the potential for flood damageto lands or buildings
• Discouragingpests(deer,woodchucks,snakes,insects, etc.)
• Initial costs, maintenancecosts, or loss of income

Source:Adaptedfrom tab. 3.4 in DanielD. Dutcher,"LandownerPerceptionsof Protectingand


EstablishingRiparianForestsin CentralPennsylvania"(Ph.D.diss., Pennsylvania
State University,
May2000), 64.
162 CHAPTER5

of streams.28 He asked landowners to rate the importance of twelve items thought


to affect the willingness of landowners to create and maintain riparian buffers. He
wanted to know whether the attitudes could be grouped into distinct dimensions·
that could be used as summary variables instead of using each of the eleven items
separately. Using factor analysis, he found that the items factored into three dimen-
sions. These dimensions and the items included in each dimension are listed in
table 5-6. The first dimension, which he labeled "maintaining property aesthetics,"
included items such as maintaining a view of the stream, neatness, and maintaining
open space. A second dimension contained items related to concern over water
quality. The third dimension related to protecting property against damage or loss.

Factor analysis is just one of many techniques developed to explore the dimension-
ality of measures and to construct multi-item scales. The readings listed at the end
of this chapter include some resources for students who are especially interested in
this aspect of variable measurement.

Through indexes and scales, researchers attempt to enhance both the accuracy and
the precision of their measures. Although these multi-item measures have received
most use in attitude research, they are often useful in other endeavors as well. Both
indexes and scales require researchers to make decisions regarding the selection of
individual items and the way in which the scores on those items will be combined
to produce more useful measures of political phenomena,

Conclusion
To a large extent, a research project is only as good as the measurements that are
developed and used in it. Inaccurate measurements will interfere with the testing
of scientific explanations for political phenomena and may lead to erroneous
conclusions. Imprecise measurements will limit the extent of the comparisons that
can be made between observations and the precision of the.knowledge that results
from empirical research.

Despite the importance of good measurement, political science researchers often


find that their measurement schemes are of uncertain accuracy and precision.
Abstract concepts are difficult to measure in a valid way, and the practical con-
straints of time and money often jeopardize the reliability and precision of measure-
ments. The quality of a researcher's measurements makes an important contribution
to the results of his or her empirical research and should not be lightly or routinely
sacrificed.

28 Daniel D. Dutcher, "Landowner Perceptions of Protecting and Establishing Riparian Forests in Central
Pennsylvania" (PhD. diss., Pennsylvania State University, 2000).
..
..
'
.. . . . .
Sometimes the accuracy of measurements may be enhanced through the use of
multi-item measures. With indexes and scales, researchers select multiple indica-
tors of a phenomenon, assign scores to each of these indicators, and combine those
scores into a summary measure. Although these methods hav'e been used most
frequently in attitude research, they can also be used in other situations to improve
the accuracy and precision of single-item measures.
'• .

--------------JN 1111diliiii4•+»r.·:
111
111
Alternative-form method. A methodof calculating Face validity. Validityassertedby arguingthat a f ·: ·..

I'..
reliability by repeatingdifferent but equivalentmeasuresat measurecorrespondscloselyto the concept it is designed
two or more points in time. to measure. .. .. '
Bias. A type of measurementerror that results in
'systematicallyover-or under-measuringthe value of a
Factor analysis. A statisticaltechnique useful in the
constructionof multi-item scalesto measureabstract
.
concept. concepts.
1
J•
.
Construct validity. Validitydemonstratedfor a measure Guttman scale. A multi-item measurein which
by showingthat it is relatedto the measureof another t •
concept.
respondentsare presentedwith increasinglydifficult
measuresof approvalfor an attitude. t. ..
Content validity. Validitydemonstratedby ensuringthat Interitem association. A test of the extentto which
the full domain of a concept is measured. the scoresof severalitems, each thought to measurethe
Convergent construct validity. Validitydemonstrated sameconcept, are the same. Resultsare displayedin a
by showingthat the measureof a concept is relatedto the correlationmatrix.
measuceof another,relatedconcept. Interval measurement. A measurefor which a one- ·
Correlation matrix. A table showingthe relationships
among discrete measures.
Dichotomous variable. A nominal-levelvariablehaving
unit differencein scores is the same throughoutthe range
of the measure.
Level of measurement. The extentor degreeto
i ·.·
f. • •
only two categoriesthat for certain analyticalpurposescan which the valuesof variablescan be comparedand
be treated as a quantitativevariable. mathematicallymanipulated.
Discriminant construct validity. Validity Likert scale. A multi-item measurein which the
demonstratedby showingthat the measureof a concept items are selected based on their ability to discriminate
has a low correlationwith the measureof another concept betweenthose scoring high and those scoring low on the
that is thought to be unrelated. measure.

. .. . . .. . .
. •,
. ....
,,
. "
• •
.. ..
.. . . . . . .
Measurement. The processby which phenomenaare Ratio measurement. A measurefor which the scores
observedsystematicallyand representedby scoresor possessthe full mathematicalpropertiesof the numbers
numerals. assigned.
Mokken scale. A type of scaling procedurethat assesses Reliability. The extentto which a measureyields the
the extent to which there is order in the responsesof same resultson repeatedtrials.
respondentsto multiple items. Similarto Guttmanscaling.
..
Split-halves method. A methodof calculatingreliability
Nominal measurement. A measurefor which different by comparingthe resultsof two equivalentmeasuresmade
scoresrepresentdifferent, but not ordered,categories. at the sametime.
Operational definition. The rules by which a concept summation index. A multi-item measurein which
is measuredand scoresassigned. individualscoreson a set of items are combinedto form a
summarymeasure.
Operationalization. The processof assigningnumerals
or scoresto a variableto representthe valuesof a concept. Test-retest method. A methodof calculatingreliability
by repeatingthe same measureat two or more points in
Ordinal measurement. A measurefor which the scores
time.
representorderedcategoriesthat are not necessarily
equidistantfrom each other. Validity. The correspondencebetweena measureand
the concept it is supposedto measure.
Random measurement error. Errorsin measurement
that have no systematicdirection or cause.

Niili·l;liti·1'1W--------------
··.1
.
.. DeVellis, Robert F. Scale Development: Theory and and Interpretation. 5th ed. Glendale,Calif.: Pyrczak,
Applications. 3rd ed. ThousandOaks,Calif.: Sage,2011. 2013.

.. Kim, Jae-On,and CharlesW. Mueller. Introduction to


Factor Analysis: What It Is and How to Do It. A Sage
Netemeyer,Richard G., William 0. Bearden,and
SubhashSharma. Scaling Procedures:-lssues and
University Paper:Quantitative Applications in the Social Applications. Thousand Oaks,Calif.: Sage,2003.
Sciences no. 07-013. Beverly Hills, Calif..:Sage, 1978.
Walford, Geoffrey,Eric Tucker,and Madhu Viswanathan.
Mertler, Craig A., and Rachel A. Vannatta.Advanced and The Sage Handbook of Measurement. Los Angeles, Calif.:
Multivariate Statistical Methods: Practical Application Sage, 2010.

• •
..
• •

. .. . .. .. . . .. .. ..
;-;-,
"
. ..
• • • •

You might also like