Professional Documents
Culture Documents
of Social Scientific
Research:
Measurement
CHAPTER OBJECTIVES
5.1 Discuss the importance of operationalization 5.3 Summarize the ways in which accurate
in hypothesis measurement. measurements must be reliable and valid.
5.2 Explain why measurements of political 5.4 Describe different levels of measurement and
phenomena must correspond closely to the their importance of measurement precision.
original meaning of a researcher's concepts.
5.5 Identify different types of multi-item
measures.
... .'.. portarl.t 9ecause it provfde;; the bridgt oetween our prclpos~d explanations·..•
. . . 128 . ...... .. ..
• ,. • " .. .. " ,;, " .. " • " ~ * "' "' j, .. .. " .. .. ... ,. .. • ~ " "' ... "' • ii
...
1t • • • ,. .. • " .. " • " .. ., " ... (II • .. • " ,. " : "'. : " ,. ..
.. . . .. .... . . . .. .. . .. . . . .. . .. .. . ..
!!- • "" .. .. .. • '* "' .. "' • ,r,
and the empirical world they are supposed to explain. How researchers measure
their concepts can have a significant impact on their findings; differences in mea-
surement can lead to totally different conclusions.
three different types of human rights: personal integrity or security rights, sub-
sistence rights, and civil and political rights. Each of these types of rights has
multiple dimensions. For example, civil and political rights consist of both civil
liberties, such as freedom of speech, as well as economic liberties, including pri-
vate property rights. Jeffrey A. Segal and Albert D. Cover measured both the polit-
ical ideologies and the written opinions of US Supreme Court justices in cases
involving civil rights and liberties. 3 Valerie]. Hoekstra measured people's opinions
about issues connected to Supreme Court cases and their opinions about the
Court. 4 Richard L. Hall and Kristina Miler wanted to measure oversight activity
by members of Congress, the number of times they were contacted by lobbyists,
and whether members of Congress and lobbyists were pro-regulation or antireg-
ulation. 5 And Stephen Ansolabehere, Shanta Iyengar, Adam Simon, and Nicholas
Valentino measured the intention to vote reported by study participants to see
if it was affected by exposure to negative campaign advertising. 6 In each case,
some ·political behavior or attribute was measured so that a scientific explanation
could be tested. All of these researchers made important choices regarding their
measurements.
Let us consider, for example, a researcher trying to explain the existence of democ-
racy in different nations. If the researcher were to hypothesize that higher rates
of literacy make democracy more "likely,then a definition of two concepts-lit-
eracy and democracy-would be necessary. The researcher could then develop a
3 Jeffrey A. Segal and Albert D. Cover, "Ideological Values and the Votes of U.S. Supreme Court
Justices," American Political ScienceReview83, no. 2 (1989): 557-65. Available at http://www.uic
.edu/classes/pols/pols200mm/Segal89.pdf
4 Valerie J. Hoekstra, Public Reactionto SupremeCourt Decisions{New York: Cambridge University
Press, 2003).
5 Richard C. Hall and Kristina Miler, "What Happens after the Alarm? Interest Group Subsidies to
Legislative Overseers," Journalof Politics 70, no. 4 (2008): 990-1005. ·
6 Stephen Ansolabehere, Shanto Iyengar, Adam Simon, and Nicholas Valentino, "Does Attack
Advertising Demobilize the Electorate?" American Political ScienceReview88, no. 4 (1994):
829-38. Available at http://weber.ucsd.edu/-tkousser/Ansolabehere,pdf
The Building Blocks of Social Scientific Research: Measurement 131
strategy, based on the definitions of the two concepts, for measuring the existence
and amount of both -attributes in nations.
It is useful to think of arriving at the operational definition as being the last stage in
the process of defining a concept precisely. We often begin with an abstract concept
(such as democracy), then attempt to define it in a meaningful way, and finally
decide in specific terms how we are going to measure it. At the end of this process,
we hope to attain a definition that is sensible, close to our meaning of the concept,
and exact in what it tells us about how to go,about measuring the concept.
that can be used to measure whether particular individuals are liberal or not. The
following question from the Generiil Social Survey might be used to operationalize
the concept:
An abstract concept, liberalism has now been given an operational definition that
can be used to measure the concept for individu;:1ls.This definition is also related
to the original definition of the concept, and it indicates precisely what observa-
tions need to be made. It is not, however, the only operational definition possible.
Others might suggest that questions regarding affirmative action, same-sex mar-
riage, school vouchers, the death penalty, welfare benefits, and pornography could
be used to measure liberalism.
The important thing is to think carefully about the operational definition you
choose and to try to ensure that the definition coincides closely with the meaning
of the original concept. How a concept is operationalized. affects how generaliza-
tions are made and interpreted. For example, general statements about liberals or
conservatives apply to liberals or conservatives only as they have been operation-
ally defined, in this case by this one question regarding government involvement
in reducing income differences. As a consumer of research, you should familiarize
yourself with the operational definitions used by researchers so that you are better
able to interpret and generalize research results.
Let us take a closer look at some operational definitions used by the political
science researchers referred to in chapter 1, as well as some others. To measure the
strength of a legislator's intervention in air pollution regulations proposed by the
7 Question 'wording for the variable EQWLTH from GSS 1998 Codebook.Available at http://www.thearda
.com/Archive/Files/Codebooks/GSS l 998_CB.asp
The Building Blocks of Social Scientific Research:Measurement 133
Environmental Protection Agency, Hall and Miler coded and counted the number
of substantive comments made by legislators challenging or defending the agency's
proposed air quality regulations during five oversight hearings held in Congress
and during the public comment period. 8 Agencies are required to maintain a
public docket that contains all the comments received during the comment period.
Transcripts were available for each of the hearings. The researchers ended up
with two variables: one was the number of supporting comments; the other was
the number of comments in opposition to the proposed regulation. To measure
constituency interests in each of the members' districts, they measured the number
of manufacturingjobs in each district and created an index of air pollution based on
district levels of PM 10 particulate matter and ground-level ozone (the pollutants
addressed by the proposed regulations). Because Hall and Miler were interested in
investigating whether lobbyists targeted their efforts toward members of Congress
friendly toward the lobbyists' positions, they needed to measure the pro- or
antienvironmental policy positions for each member of Congress, and this variable
had to measure position beforethe oversight hearings and regulatory comment
period. Fortunately for the researchers, the leaders of the health and environmental
coalition had classified members in terms of their likely support for the rule prior
to the lobbying period and were willing to share their ratings. These measures were
based on legislators' previous voting record on health and environmental issues.
The research conducted by Segal and Cover on the behavior of US Supreme Court
justices is a good example of an attempt to overcome a serious measurement prob-
lem to test a scientific hypothesis. 9 Recall that Segal and Cover were interested,
as many others have been before them, in the extent to which the votes cast by
Supreme Court justices were dependent on the justices' personal political attitudes.
Measuring the justices' votes on the cases decided by the Supreme Court is no
problem; the votes are public information. But measuring the personal political atti-
tudes of judges, independentof theirvotes,is a problem (remember the discussion in
chapter 4 on avoiding tautologies, or statements that link two concepts that mean
essentially the same thing). Many of the judges whose behavior is of interest have
died, and it is difficult to get living Supreme Court justices to reveal their political
attitudes through personal interviews or questionnaires. Furthermore, one ideally
would like a measure of attitudes that is comparable across many judges and that
measures attitudes related to the cases decided by the Court.
Segal and Cover limited their inquiry to votes on civil liberties cases between 1953
and 1987, so they needed a measure of related political attitudes for the judges
serving on the Supreme Court over that same period. They decided to infer the
judges' attitudes from the newspaper editorials written about them in four major
daily newspapers from the time each justice was appointed by the president until
the justices confirmation vote by the Senate. They selected the editorials appearing
in two liberal papers and in two conservative papers. Trained analysts read the edi-
torials and coded each paragraph for whether it asserted that a justice designate was
liberal, moderate, or conservative (or if the paragraph was inapplicable) regarding
"support for the rights of defendants in criminal cases, women and racial minorities
in equality cases, and the individual against the government in privacy and First
Amendment cases."10
Because of practical barriers to ideal measurement, then, Segal and Cover had to
rely on an indirect measure of judicial attitudes as perceivedby Jour newspapers
rather than on a measure of the attitudes themselves. Although this approach may
have resulted in flawed measures, it also permitted the test of an interesting and
important hypothesis about the behavior of Supreme Court justices that had not
been tested previously. Without such measurements, the hypothesis could not have
been tested.
Next, let us consider research conducted by Bradley and his colleagues on the rela-
tionship between party control of government and the distribution and redistri-
bution of wealth. 11 The researchers relied on the Luxembourg Income Study (LIS)
database, which provides cross-national income data over time in OECD (Organ-
isation for Economic Co-operation and Development) countries. 12 They decided,
however, to make adjustments to published LIS data on income inequality. That
data included_pensioners. Because some countries make comprehensive provisions
for retirees, retirees in these countries make little provision on their own for retire-
ment. Thus, many of these people would be counted as "poor" before any govern-
ment transfers. lnclµding pensioners would inflate the pretransfer poverty level as
well as the extent of income transfer for these countries. Therefore, Bradley and
his colleagues limited their analysis to households' with a head aged twenty-five to
fifty-nine (thus excluding the student-age population as well) and calculated their
own measures of income inequality from the LIS data. They argued that their data
would measure redistribution across income gro:ups, not life-cycle redistributions
of income, such as transfers to students and retireµ persons. Incomewas defined as
income from wages and salaries, self-employment income, property income, and
private pension income. The researchers also made adjustments for household size
using an equivalence scale, which adjusts the number of persons in a household
to an equivalent number _ofadults. The equivalence scale takes into account the
economies of scale resulting from sharing household expenses.
10 Ibid., 559.
11 David Bradley,EvelyneHuber,Stephanie Moller, FrancoiseNielsen, and John D. Stephens,
"Distribution and Redistribution in Postindustrial Democracies,"WorldPolitics 55, no. 2 (2003):
193-228.
12 For information on the LIS database,see http://www.lisdatacenter.org/our-data/lis-database/
The Building Blocks of Social Scientific Research: Measurement 135
The cases discussed here are good examples of researchers' attempts to mea-
sure important political phenomena (behaviors or attributes) in the real world.
Whether the phenomenon in question was judges' political attitudes, income
inequality, the tone of campaign advertising, or the attitudes and behavior of
legislators, the researchers devised measurement strategies that could detect and
measure the presence and amount of the concept in question. These observations
were then generally used as the basis for an empirical test of the researchers'
hypotheses.
There are two major threats to the accuracy of measurements. Measures may be
inaccurate because they are unreliableand/or because they are invalid.
Reliability
Reliability describes the consistency of results from a procedure or measure in
repeated tests or trials. In the context of measurement, a reliable measure is one
13 Martin P. Wattenberg and Craig Leonard Brians, "Negative Campaign Advertising: Demobilizer or
Mobilizer?" American Political Science Review93, no. 4 (1999): 891-99. Available at http://weber
.ucsd.edu/-tkousser/Wattenberg.pdf
14 Ansolabehere, Iyengar, Simon, and Valentino, "Does Attack Advertising Demobilize the Electorate?"
136 CHAPTER5
that produces the same result each time the measure is used. An unreliable measure
is one that produces inconsistent results-sometimes higher, sometimes lower.15
Suppose, for example, you want to measure support for the president among col-
lege students. You select two similar survey questions (Ql and Q2) and ask the
participants in a random sample of students to answer each question. The results
from this sample were 50 percent support for the president using Ql and 50 per-
cent support for the president using Q2. But what might you find if you ask the
same questions of multiple random samples of students? Will the results from each
question remain consistent, assuming that the samples are identical? If a second
sample of students is polled, you may find the same result, 50 percent, for Ql but
60 percent for Q2. If you were to ask Ql of multiple random samples of students
and the result was consistently 50 percent, you could assert that your measure, Ql,
is reliable. If Q2 were asked to multiple random samples of students and each sam-
ple of students returned different answers ranging somewhere between 40 percent
and 60 percent, you could conclude that Q2 is less reliable than Ql because Q2
generates inconsistent results each time it is used.
Likewise, you can assess the reliability of procedures as well. Suppose you are given
the responsibility of counting a stack of one thousand paper ballots for some public
office. The first time you count them, you obtain a particular result. But as you were
counting the ballots, you might have been interrupted, two or more ballots might
have stuck together, some might have been blown onto the floor, or you might have
written down the totals incorrectly:As a precaution, then, you count them five more
times and get four other people to count them once each as well. The similarity of
the,results of all ten counts would be an indication of the reliability or-the counting
process.
Similarly, suppose you wanted to test the hypothesis that the New York Times is
more critical of the federal government than is the Wall StreetJournal. This would
require you to measure the level of criticism found in articles in the two papers.
You would need to develop criteria or instructions for identifying or measuring
criticism. The reliability of your measuring scheme could be assessed by having
two people read all the articles, independently rate the level of criticism in them
according to your instructions, and then compare their results. Reliability would be
demonstrated if both people reached similar conclusions regarding the content of
the articles in question.
15 Edward G. Carmines and Richard A. Zeller, Reliability and Validity Assessment, A Sage University
Paper: Quantitative Applications in the Social Sciences no. 07---017 (Beverly Hills, Calif.: Sage, 1979).
The Building Blocks of Social Scientific Research: Measurement 137
The test-retest method involves applying the same "test" to the same observations
after a period of time and then comparing the results of the different measure-
ments. For example, if a series of questions measuring liberalism is asked of a group
of respondents on two different days, a comparison of their scores at both times
could be used as an indication of the reliability of the measure of liberalism. We
frequently engage in test-retest behavior in our everyday lives. How often have you
stepped on the bathroom scale twice in a matter of seconds?
The test-retest method of measuring reliability may be both difficult and problem-
atic, since one must measure the phenomenon at two different points. It is possible
that two different results may be obtained because what is being measured has
changed, not because the measure is unreliable. For example, if your bathroom
scale gives you two different weights within a few seconds, the scale is unreliable,
as your weight cannot have changed. However, if you weigh yourself once a week
for a month and find that you get different results each time, is the scale unreliable,
or has your weight changed between measurements? A further problem with the
test-retest check for reliability is that the administration of the first measure may
affect the second measure's results. For instance, the difference between SATRea-
soning Test scores the first and second times that individuals take the test may not
be assumed to-be a measure of the reliability of the test, since test takers might alter
their behavior the second time as a result of taking the test the first time (e.g., they
might learn from their first experience with the test).
be measured twice. Depending on tl:e length of time between the two measure-
ments, what is being measured may change.
The test-retest, alternative-form, and split-halves methods provide a basis for calcu-
lating the similarity of results of two or more applications of the same or equivalent
measures. The less consistent the results are, the less reliable the measure. Polit-
ical scientists take very seriously the reliability of the measures they use. Survey
researchers are often concerned about the reliability of the answers they receive. For
example, respondents' answers to survey questions often vary considerably when
the instruments are given at two different times. 16 If respondents are not concen-
trating or taking the survey seriously, the answers they provide may as well have
been pulled out of a hat.
Now, let us return to the example of measuring your weight using a home scale.
If you weigh yourself on your home scale, then go to the gym and weigh yourself
again there, and get the same number (alternative forms test of reliability), you
may conclude that your home scale is reliable. But what if you get two different
numbers? Assuming your weight has not changed, what is the problem? If you go
back home immediately and step back on your home scale and find that it gives
you a measurement that is different from the first it gave you, you could conclude
that your scale has a faulty mechanism, is inconsistent, and therefore is unreliable.
However, what if your bathroom scale gives you the same weight as the first time?
It would appear to be reliable. Maybe the gym scale is unreliable. You could test
this out by going back to the gym and reweighing yourself. If the gym scale gives a
reading different from the one it gave the first time, then it is unreliable. But what if
the gymscale gives consistent readings? Each scale appears to be reliable (the scales
are not giving you different weights at random), but at least one of them is giving
you a wrong measurement (that is, not giving you your correct weight). This is a
problem of validity.
Validity
Essentially, a yalid measure is one that measures what it is supposed to measure.
Unlike reliability, which depends on whether repeated applications of the same or
equivalent measures yield the same result, validity refers to the degree of corre-
spondence between the measure and the concept it is thought to measure.
Let us consider first an example of a measure whose validity has been questioned:
voter turnout. Many studies examine the factors that affect voter turnout and, thus,
1,6 Philip E. Converse, "The Nature of Belief Systems in Mass Publics," in Ideologyand Discontent,
ed. David E. Apter (New York: Free Press of Glencoe, 1964); Pauline Marie Vaillancourt, "Stability
of Children's Survey Responses," Public Opinion Quarterly37, no. 3 (1973): 373---87; J. Miller
McPherson, Susan Welch, and Cal Clark, "The Stability and Reliability of Political Efficacy: Using
Path Analysis to Test Alternative Models," American Political Science Review71, no. 2 (1977):
509-21; and Philip E. Converse and Gregory B. Markus, "Plus ~a change ... : The New CPS Election
Study Panel," American Political Science Review73, no. 1 (197,?l: 32-49.
The Building Blocks of Social Scientific Research: Measurement 139
Face validity may be asserted (not empirically demonstrated) when the measure-
ment instrument appears to measure the concept it is supposed to measure. To
assess the face validity of a measure, we need to know the meaning of the concept
being measured and whether the information being collected is "germane to that
concept." 18 For example, let us return to thinking about how we might meas.ure
political ideology-that is, whether someone is conservative, moderate, or liberal.
Such a measure could be as simple as a question used by the Pew Research Cen-
ter: "Do you think of yourself as conservative, moderate or liberal?"19 On its face,
this measure appears to capture the intended concept, so it has face validity. It
might be tempting to use individuals' responses to a question on party identifica-
tion, but.one would be assuming that all Democrats are liberal and all Republicans
17 Raymond E. Wolfinger and Steven J. Rosenstone, appendix A in WhoVotes?(New Haven, Conn.: Yale
University Press, 1980).
18 Kenneth D. Bailey, Methodsof Social Research(New York: Free Press, 1978), 58.
19 Pew Research Center. Retrieved December 5, 2014. Available at http://www.pewresearch.org/data-
I
trend/political-attitudes/political-ideology/
140 CHAPTER5
are conservative. Also, if the party identification variable included a category for
independents, what would be their ideology? Can you assume they are all moder-
ates? For these reasons, a question measuring party identification would lack face
validity as a measure of ideology.
In general, measures lack face validity when there are good reasons to question the
correspondence of the measure to the concept in question. In other words, assess-
ing face validity is essentially a matter of judgment. If no consensus exists about the
meaning of the concept to be measured, the face validity of the measure is bound
to be problematic.
Content validity is similar to face validity but involves determir:iingthe full domain
or meaning of a particular concept and then making sure that all components of
the meaning are included in the measure. For example, suppose you wanted to
design a measure of the extent to which a nation's political system is democratic.
As noted earlier, democracymeans many things to many people. Raymond D. Gastil
constructed a measure of democracy that included two dimensions, political rights
and civil liberties. His checklists for each dimension consisted of eleven items. 20
Political scientists are often interested in concepts with multiple dimensions or
complex domains, like democracy, and spend quite a bit of time discussing and
justifying the content of their measures. In order for a measure of Gastil's concep-
tion of democracy to achieve content validity, the measure should capture all eleven
components in the definition.
Let us return to the question of measuring the power of legislative leaders because it
provides a good example of the importance of construct validity. As we pointed out
before, the perceived-influence approach to measuring power is more difficult to
use than the formal-powers approach. Therefore, if the two measures are shown to
have construct validity, operationalizing leadership power using the formal-powers
approach by itself might be a valid way to measure the concept. If the two measures
do not have construct validity, then it would be clear that the two approaches are
not measuring the same thing. Thus, which measure is used could greatly affect the
findings of research into the factors associated with the presence of strong leader-
ship power or on the consequences of such power. These were the very questions
raised by political scientist James Coleman Battista.21 He constructed several mea-
sures of perceived leadership power and correlated them with a measure of formal
power. The results, shown-in table 5-1, show that the me~ure of formal power
correlates only weakly with three measures of perceived power (which, as expected,
correlate w~l with one another). Therefore, measures of perceived power and the
measure of formal power do not demonstrate,convergent construct validity.
Let us return to the researcher who wants to develop a valid measure of liberalism.
First, the researcher might measure peoples attitudes toward (1) welfare, (2) mili-
tary spending, (3) abortion, (4) Social Security benefit levels, (5) affirmative action,
(6) a progressive income tax, (7) school vouchers, and (8) protection of the rights
of the accused. Then the researcher could determine how the responses to each .
question relate to the responses to each of the other questions. The validity of the
measurement scheme would be demonstrated if strong relationships existed among
people's responses across the eight questions.
21 James Coleman Battista, "Formal and Perceived Leadership Power in U.S. State Legisl'atures," State
Politics and Policy Quarterly11, no. 1 (2011).
22 Joseph A. Gliem and Rosemary R. Gliem, "Calculating, Interpreting, and Reporting Cronbach's
Alpha Reliability Coefficient for Likert-Type Scales" (paper presented at the 2003 Midwest Research
to Practice Conference in Adult, Continuing, and Community Education, Ohio State University,
Columbus, 2003). Available at https://scholarworks.iupui.edu/bitstream/handle/1805/344/
Gliem+&+Gliem.pdf?sequence=l
142 CHAPTER 5
.Ji
~6"134* l'-
;~ ,
t'
'{.,
..
1,
/lt ~"., a
1
.J
Combined power index .915* .558* .338* .375*
*p <.05
Source:James Coleman Battista, "Formal and Perceived Leadership Power in U.S. State
Legislatures," State Politics and Policy Quarterly 11, no. 1 (2011): tab. 2, p. 209. Copyright© The
Author 2011. Reprinted by Permission of SAGE Pub.lications.
Note:We show here only the results for Battista's analysis when southern states are excluded.
Because southern states have a history of little two-party competition but may have to contend with
party factions, and the question asked respondents to rate the power of legislative leaders as party
leaders, respondents were likely to give low perceived power ratings. Southern legislative leaders,
however, may be strong chamber leaders, which is what is being measured in states with a history of
two-party competition. See chapter 13 for an explanation of correlation.
The results of such interitem association tests are often displayed in-a correlation
matrix. Such a display shows how strongly related each qf the items in the mea-
surement scheme is to all the other items. In the hypothetical data shown in table
5-2, we can see that people's responses to six of the eight measures were strongly
related to each other, whereas responses to the questions on protection of the rights
of the accused and school vouchers were not part of the general pattern. Thus, the
researcher would probably conclude that the first six items all measure liberalism
and that, taken together, they are a valid measurement of liberalism.
· The figures in table 5-2 are product-moment correlations: numbers that can vary
in value from -1.0 to +LO and that indicate the extent to which one variable is
related to another. The closer the correlation is to ±1, the stronger the relationship;
the closer the correlation is to 0.0, the weaker the relationship (see chapter 13 for
a full explanation). The figures in the last two rows are considerably closer to 0.0
than are the other entries, indicating that peoples answers to the questions about
school vouchers and rights of the accused did not follow the same pattern as their
answers to the other questions. Therefore, it looks like school vouchers and rights
of the accused are not connected to the same concept of liberalism as meas~red by
the other questions.
The Building Blocks of Social Scientific Research: Measurement 143
Rights
Military Social Affirmative Income School of
Welfare spending Abortion Security action tax vouchers accused
Welfare 1
I
I
Military
. spending
'·'
'",.
,26 1
,
'
' '>-".2
' .
'
Affirmative .63 .38 .59 .69 1
action
'
f
.
lncomettax .48' .67
~
.75 .39" '
.
.51 1 1l
I !;;;\I- l
-:,14 ..
•.
'
.10 C
'"
,
.23 .18
'
.,.45 1 J
j
I
Note:Hypothetical data.
Content and face validity are difficult to assess when agreement is lacking on the
meaning of a concept, and construct validity, which requires a well-developed the-
oretical perspective, usually yields a less-than-definitive result. The interitem asso-
ciation test requires multiple measures of the same concept. Although these validity
"tests" provide important evidence, none of them is likely to support an unequivo-
cal decision concerning the validity of particular measures.
Please look at the booklet and tell me the letter of the income group
that includes the income of all members of your family living here in
2003 before taxes. This figure should include salaries, wages, pensions,
dividends, interest, and all other income. Please tell me the letter of the
income group· that includes the income you had in 2003 before taxes.
144 CHAPTER5
B. $3,000-$4,999 N. $35,000-$39,999
C. $5,000-$6,999 0. $40 ,000-$44 ,999
D. $7,000-$8,999 P. $45,000-$49,999
E. $9,000-$10,999 Q. $50,000-$59 ,999
E $11,000-$12,999 R. $60,000-$69,999
G. $13,000-$14,999 S. $70,000-$79 ,999
H. $15,000-$16,999 T. $80,000-$89 ,999
I. $17 ,000-$19 ,999 U. $90,000-$104,999
J. $20,000-$21,999 V $105,000-$119,999
K. $22,000-$24,999 w $120,000 and over
L. $25,000-$29,999
Both the reliability and the validity of this method of measuring income are ques-
tionable. Threats to the reliability of the measure include the following:
• Respondents may not know how much money they make and therefore
incorrectly guess their income.
• Respondents may not know how much money other family members
make"and guess incorrectly. ·
• Respondents may know how much they make but carelessly select the
wrong categories.
• Interviewers may circle the wrong categories when listening to the
selections of the respondents.
• Data-entry personnel may touch the wrong numbers when entering the
answers into the computer.
• Dishonest interviewers may incorrectly guess the income of a respondent
who does not complete the interview.
• Respondents may not know which family members to include in the
income total; some respondents may include only a few family members,
while others may include even distant relations.
• Respondents whose income is on the border between two categories may
not know which one to pick. Some may pick the higher category; others,
the lower one.
Because of these measurement problems, if this measure were applied to the same
people at two different times, we could expect the results to vary, resulting in
The Building Blocks of Social Scientific Research: Measurement 145
inaccurate measures that are too high for some respondents and too low for others.
Some amount of random measurement error is likely to occur with any measure-
ment scheme.
In addition to these threats to reliability, there are numerous threats to the validity
of this measure:
• Respondents may have illegal income they do not want to reveal and_
therefore may systematically underestimate their income.
• Respondents may try to impre~s the interviewer, or themselves, by
systematically overestimating their income.
• Respondents may systematically underestimate their before-tax income
because they think of their take-home pay and underestimate how much
money is being withheld from their paychecks.
• Respondents may systematically skip the question due to privacy
concerns over providing a precise number even if they know it.
Notice that this second list of problems contains the word systematically. These
problems are not simply caused by random inconsistencies in measurements, with
some being,. too high and others too low for unpredictable reasons. Systematicmea-
surement error introduces error that may bias research results, thus compromising
the confidence we have in them.
This long list of problems with both the reliability and the validity of this fairly
straightforward measure of a relatively concrete concept is worrisome. Imagine how
much more difficult it is to develop reliable and valid measures when the concept
is abstract (for example, tolerance, environmental conscience, self-esteem, or liber-
alism) and the measurement scheme is more complicated.
The relia~ility and validity of the measures used by political scientists are seldom
demonstrated to everyone's satisfaction. Most measures of political phenomena are
neither completely invalid or valid nor thoroughly unreliable or reliable but, rather,
are partly accurate. Therefore, researchers generally present the rationale and evi-
dence available in support of their measures and attempt to persuade their audi-
ence that their measures are at least as accurate as alternative measures would be.
Nonetheless, a skeptical stance on the part of the reader toward the reliability and
validity of political science measures is often warranted.
Note, finally, that reliability and validity are not the same thing. A measure may
be reliable without being valid. One may devise a series of questions to measure
liberalism, for example, that yields the same result for the same people every time
but that misidentifies individuals. A valid measure, however, will also be reliable: if
it accurately measures the concept in question, then it will do so,consistently across
measurements-allowing, of course, for some random measurement error that may
occur. It is more important, then, to demonstrate validity than reliability, but reli-
ability is usually more easily and precisely tested.
146 CHAPTER5
Suppose, for example, that we wanted to measure the height of political candidates
to see if taller candidates usually win elections. Height could be measured in many
different ways. We could have two categories of the variable "height"-tall and
short-and assign different candidates to the two categories based on whether they
were of above-average or below-average height. Or we could compare the heights
of candidates running for the same office arid measure which candidate was the
tallest, which the next tallest, and so on. Or we could take a tape measure and mea-
sure each candidate's height in inches and record that measure. The last method of
measurement captures the most information about each candidate's height and is,
ther_efore,the most precise measure of the attribute.
Levels of Measurement
When we consider the precision of our measurements, we refer to the level of
measurement. The level of measurement involves the type of information that we
think our measurements contain and the mathematical properties that determine
the type of comparisons that can be made across a number of observations on the
same variable. The level of measurement also refers to the claim-we are ~lling to
make when we assign numbers to our measurements.
There are four different levels of measurement: nominal, ordinal, interval, and ratio.
While few concepts used in political science research inherently require a particular
level of measurement, there are methodological limitations because some measures
provide more information and better mathematical properties than others. So the
level of measurement used to measure any particular concept is a function of both
the researcher's imagination and resources, and methodological needs.
We begin with nominal measurement, the level that has the fewest mathematical
properties 9f the· four levels. A nominal-level measure indicat~s that the values
assigned to a variable represent only different categories or classifications for that
variable. In such a case, no category is more or less than another category; they
are simply different..For example, suppose we measure the religion of individuals
by asking them to indicate whether they are Christian, Jewish, Muslim, or other.
Since the four categories or values for the variable religion are simply different,
the measurement is at a nominal level. Other common examples of nominal-level
The Building Blocks of Social Scientific Research: Measurement 147
measures are gender, marital status, and state of residence. A nominal measure
of partisan affiliation might have the following categories: Democrat, Republican,
Green, Libertarian, other, and none. Numbers will be assigned to the categories
when the data are coded for statistical analysis, but these numbers do not repre-
sent mathematical differences between the categories--any of the parties could be
assigned any number, as long as those numbers are different from each other. In
this sense, nominal-level measures provide the least amount of information about a
concept. An ordiI\al measurement has all of the properties of a nominal measure
but also assumes observations can be compared in terms of having more or less
of a particular attribute. Hence, the ordinal level of measurement captures more
information about the measured concept and has more mathematical properties
than a nominal-level measure. For example, we could create an ordinal measure of
formal education completed with the following categqries: "eighth grade or less,"
"some high school," "high school graduate," "some college," and "college degree or
more." Here we are concerned not with the exact difference between the catego-
ries of education but only with whether one category is more or less than another.
When coding this variable, we would assign higher numbers to higher categories
of education. The intervals between the numbers have no meaning; all that matters
is that the higher numbers represent more of the attribute than do the lower num-
bers. An ordinal variable measuring partisan affiliation with the categories "strong
Republican," "weak Republican," "neither leaning Republican nor Democrat,"
"weak Democrat," and "strong Democrat" could be assigned codes 1, 2, 3, 4, 5 or 1,
2, 5, 8, 9 or any other combination of numbers, as long as they were in ascending
or descending order. ·
Political science researchers have measured many concepts at the ratio level. Peo-
ple's ages, unemployment rates, percentage of the vote for a particular candidate,
and crim'e rates are all measures that contain a zero point and possess the full math-
ematical properties of the numbers used. However, more political science research
has probably relied on nominal- and ordinal-level measures than on interval- or
ratio-level measures. This has restricted the types of hypotheses and analysis tech-
niques that political scientists have been willing and able to use.
23 The distinction betweenan interval-leveland a ratio-level measureis not alwaysclear, and some
political sciencetexts do not distinguish betweenthem. Interval-levelmeasuresin political science
are rather rare; ratio-level measures(moneyspent, age, numberof children, years living in the same
location, for example)are more common.
150 CHAPTER5
Nominal and ordinal variables with many categories or interval- and ratio-level
mea_suresusing more decimal places are more precise than measures with fewer
categories or decimal places, but sometimes the result may provide more informa-
tion than can be used. Researchers frequently start out with ratio-level measures or
~th ordinal and nominal measures with quite a few categories but then collapse
or combine the data to create groups or fewer categories. They do this so Jh~t they
have enough cases in e·ach category for statistical analysis or to make compari-
sons easier to follow. For example, one might want to present comparisons simply
between Democrats and Republicans rather than presenting data broken down into
categories of strong, moderat_e,and weak for each party.
It may seem contradictory now to point out ~hat extremely precise measures also
may-create problems. For example, measures with many response possibilities take
up space if they are questions on a written questionnaire or require more time to
The Building Blocks of Social Scientific Research: Measurement 151
explain if they are included in a telephone survey. Such questions may also confuse
or tire survey respondents. A more serious problem is that they·may lead to mea-
surement error. Think about the possible responses to a question asking respon-
dents to use a 100-point scale (called a thermometer scale) to indicate their support
for or opposition to a political candidate, assuming that 50 is consid€red the neutral
position and 9 is least favorable or "coldest" and 100 most favorable. Some respon-
dents may not use the whole scale (to them, no candidate ever deserves more than
an 80 or less than a 20), whereas others may use the ends and the very middle of
the scale and ignore the scores in between. We might predict that a person who
gives a candidate a 100 is more likely to vote for that candidate than is a person
who gives the same candidate an 80, but in reality they may like the candidate
pretty much the same way and would be equally likely to vote for the candidate.
Another problem with overly precise measurements is that they may be unreliable.
If asked to rate candidates on more than one occasion, respondents could vary
slightly the number that they choose, even if their opinion has not changed.
Multi-Item Measures
Many meaisures consist of a single item. For example, the measures of party
identification, whether or not one party controls Congress, the percentage of
the vote received by a candidate, how concerned about an issue a person is, the
policy area of a judicial case, and age are all based on a single measure of each
phenomenon in question. Often, however, researchers need to devise measures. of
more complicated phenomena that have more than one facet or dimension. For
example, internationalism, political ideology, political knowledge, dispersion of
political power, and the extent to which a person is politically active are complex
phenomena or concepts that may be measured in many different ways.
In this sit~ation, researchers often develop a measurement strategy that allows them
to capture numerous aspects of a complex phenomenon while representing the
existence of that phenomenon in particular cas~swith a single representative value.
Usually this involves the construction of a multi-item index or scale "representing
the several dimensions of the phenomenon. These multi-item measures are useful
because they enhance the accuracy of a measure, simplify a researchers data by
reducing them to a more manageable size, and increase the level of measurement
of a phenomenon. In the remainder of this section, we describe several common
types of indexes and scales.
Indexes
A summation index is a method of accumulating scores on individual items to
form a composite measure of a complex phenomenon. An index is constructed by
152 CHAPTER5
assigning a range of possible scores for a certain number of items, determining the
score for each item for each observation, and then combining the scores for each
observation across all the items. The resulting summary score is the representative
measurement of the phenomenon.
The index in table 5-3 is a simple, additive one; that is, each item counts equally
toward the calculation of the index score, and the total score is the summation of
the individual item scores. However, indexes may be constructed with more com-
plicated aggregation procedures and by counting some items as more important
than others. In the preceding example, a researcher might consider some indicators
of freedom as more important than others and wish to have them contribute more
to the calculation of the final index score. This could be done either by weighting
Privatelyowned 1 0 0 0 1
newspapers
; •. .
t Legalright to form 1 ' 1 0 9 ~o l
'
l political parties r:~ . .. i
'
Contestedelections 1 1 0 0 0
for significant public
offices
,.
"f :1 ' ~ ) ·1 " o· 1
Voting rights extended 0' •
to most of the aduIt l-
' "~ i
,
pop1,1lation . ;... -~ ·1
Limitations on 1 0 0 0 1
government'sability to
incarceratecitizens
, . '
Indexscore 5 ~tJ~ 0 1 .,, 2 Il
"l! '·
Note:Hypotheticaldata. The scoreis 1 if the answeris yes, O if no.
The Building Blocks of Social Scientific Research: Measurement 153
Indexes are often used with public opinion surveys to measure political attitudes.
This is because attitudes are complex phenomena and we usually do not know
enough about them to devise single-item measures. So we often ask several ques-
tions of people about a single attitude and aggregate the answers to represent the
attitude. A researcher might measure attitudes toward abortion, for example, by
asking respondents to choose one of five possible responses-strongly agree, agree,
undecided, disagree, and strongly disagree-to the following three statements:
(1) Abortions should be permitted in the first three months of pregnancy. (2) Abortions
should be permitted if the woman's life is in danger. (3) Abortions should be permit-
ted whenever a woman wants one.
Indexes are typically fairly simple ways of producing single representative scores of
complicated phenomena such as political attitudes. They are probably more accu-
rate than most single-item measures, but they may also be flawed in important
ways. Aggregating scores across several items assumes, for example, that each item
is equally impqrtant to the summary measure of the concept and that the items
used faithfully encompass the domain of the concept. Altho.ugh individual item
scores can be weighted to change their contribution to the summary measure, the
researcher often has little information upon which to base a weighting scheme.
Several standard indexes are often used in political science research. The FBI crime
.index, the Consumer Confidence Index, and the Consumer Price Index, for example,
have been used by many researchers. Before using these or any other readily avail-
able index, you should familiarize yourself with its construction and be aware of any
questions raised about its validity. Although simple summation indexes are gener-
ally more accurate than single-item measures of complicated phenomena, it is often
unclear how valid they are or what level of measurement they represent. For exam-
ple, is the index of freedom an ordinal-level measure, or could it be an interval-level
or even a ratio-level measure? Another possible issue with indexes -such as the
Consumer Price Index is that what goes into its calculation can change over time.24
24 Brett Arends, "Why You Can't Trust the Inflation Numbers," Wall Street Journal, January 26, 2011,
http://online.wsj.com/article/S B100014240527487040136045 76104351050317610. html
154 CHAPTER 5
committee;decides number of
J/i. -
controlsemployees·=LO
committees= 5.0 Committeedoes not exist;
Assignsall members;sharespower speakerdoes no~contra~
over number = 4.6 employees= 0.0 ·
Twoyears=.1-
Sharespowerover\eterral;
,l f
actions·
restricted= 2 Clucas'soverallindex can range in
·Not involvedin ref'erral= ,1 value.from3 to 25. Eachcateg9ryhasa
maximt1m'v~(ue of 5; thus, ~e is making
• FloorPbwerstlfldex
th~ c.lairn,~~atfull pow~rin eayh qitegory
"
Speakerpreparescalendar,decides ispf equal sig~ific,;1nse. He usesthe index
ql!estion,directs9hamber= q as if it were an interval-levelmeasure,
Controlstwo"ofthesefloor which meansth~Ja unit changein any
'powers= powercategoryis equivalent.These
.... 4
Controlson&of thes~floor assumption?are,.reasonable but are
powe~s= 3 certainlyopen to challenge.
Source:Richard A. Clucas, "Principle-Agent Theory and the Power of State !'louse Speakers," Legislative
· ''studies Quarterly26, no. 2 (2001): 319--38; material ta~en from the appendix on pp. ~35 and·336.
Scales
Although indexes are generally an improvement over single-item measures, their
construction also contains an element of arbitrariness. Both the selection of partic-
ular items making up the index and the way in which the scores on individual items
are aggregated are based on the researchers judgment. Scales are also multi-item
measures, but the selection and combination of items in them is accomplished more
systematically than is usually the case for indexes. Over the years, several different
kinds of multi-item scales have been used frequently in political science research.
We discuss three of them: Likert scales, Guttman scales, and Mokken scales.
A Likert scale score is calculated from the scores obtained on individual items.
Each item generally asks a respondent to indicate a degree of agreement or dis-
agreement with the item, as with the abortion questions discussed earlier. A Likert
156 CHAPTER5
scale differs from an index, however, in that once the scores on each of the items
are obtained, only some of the items are selected for inclusion in the calculation of
the final score. Those items that allow a researcher to distinguish most readily those
scoring high on an attribute from those scoring low will be retained, and a new
scale score will be calculated based only on those items.
Strongly Strongly
Disagree Disagree Undecided Agree I
Agree
(1) (2) (3) (4) (5)
( Weal~Jiypepple_,__
should ijay).axesat a mu~h : -'it .
J high'errate th~~ poor peopl,~- , , ,
Busingshould be usedto integrate public
schools.
responses of the most liberal and the most conservative people to each question
would be compared; any questions with similar answers from the disparate respon-
dents would be eliminated-such questions would not distinguish liberals from
conservatives. A new summary scale score for all the respondents would be calcu-
lated from the questions that remained. A statistic called Cronbach's alpha, which
measures internal consistency of the items in the scale and has a maximum value of
1.0, is used to determine which items to drop from the scale. The rule of thumb is
that Cronbach's alpha should be 0.8 or above; items are dropped from the scale one
at a time until this value is reached. 25
Liken scales are improvements over multi-item indexes because the items that
make up the multi-item measure are selected in part based on the respondents'
behavior rather than on the researchers judgment. Likert scales suffer two of the
other defects of indexes, however. The researcher cannot be sure that all the dimen-
sions of a concept have been measured, and the relative importance of each item is
still determined arbitrarily.
The Guttman scale also uses a series of items to produce a scale score for respon-
dents. Unlike the Likert scale, however, a Guttman scale presents respondents with a
range of att,itude choices that are increasingly difficult to agree with; that is, the items
composing the scale range from those easy to agree with to those difficult to agree
with. Respondents who agree with one of the "more difficult" attitude items will also
generally agree with the "less difficult" ones. (Guttman scales have also been used to
measure attributes other than attitudes. Their main application has been in the area
of attitude research, however, so an example of that type is used here.)
25 Gliem and Gliem, "Calculating, Interpreting, and Reporting Cronbach's Alpha Reliability Coefficient
for Likert-Type Scales."
158 CHAPTER5
This array of items seems likely to result in responses consistent with Guttman
scaling. A respondent agreeing with any one of the items is likely also to agree with
those items numbered lower than that one. This would result in the "stepwise"
pattern of responses characteristic of a Guttman scale.
Suppose six respondents answered this series of questions, as shown in table 5-5.
Generally speaking, the pattern of responses is as expected; those who agreed with
the "most difficult" questions were also likely to agree with the "less difficult" ones.
However, the responses of three people (2, 4, and 5) to the question about the father's
preferences do not fit the pattern. Consequently, the question about the father does
not seem to fit the pattern and would be removed from the scale. Once that has been
done, the stepwise pattern becomes clear.
With real data, it is unlikely that every respondent would give answers that fit the
pattern perfectly. For example, in table 5-5, respondent 6 gave an "agree" response
to the question about incest or rape. This response is unexpected and does not fit
the pattern. Therefore, we would be making an error if we assigned a scale score of 0
to respondent 6. When the data fit the scale pattern well (number of errors is small),
researchers assume that the scale is an appropriate measure and that the respon-
dent's "error" may be "corrected" (in this case, either the "agree" in the case of incest
or rape or the "disagree" in the case of the life of the woman). There are standard
procedures to follow to determine· how to correct the ·data to make it conform to the
scale pattern. We emphasize, however, that this is done. only if the changes are few.
Guttman scales differ from Likert scales in that, in the former case, generally only
one set of responses will yield a particular scale score. That is;-to -get-a score of 3
No. of Revised
. Life of Incest or Unhealthy · agree scale
Respondent woman rape . fetus Father Afford Anytime answers score
1 A A A A A A 6 5
< A A
.::i A A 7!. D ~~
}/
'o
JS
,,4,. 4
3 A A A D D D 3 3
·A ,K o· A D ,,.
~
D•. 3 2:.
5 A D D A D D 2 1
o' D b D .l , 0
on the abortion scale, a particular pattern of responses (or something very close to
it) is necessary: In the case of a Likert scale, however, many different patterns of
responses can yield the same scale score. A Guttman scale is also much more diffi-
cult to achieve than a Likert scale, since the items must have been ordered and be
perceived by the respondents as representing increasingly more difficult responses
reflecting the same attitude.
Both Likert and Guttman scales have shortcomings in their level of measurement.
The level of measurement produced by Likert scales is, at best, ordinal (since we
·do not know the relative importance of each item and so cannot be sure that a 5
answer on one item is the same as a 5 answer on another), and the level of measure-
ment produced by Guttman scales is usually assumed to be ordinal.
Another type of scaling procedure, called Mokken scaling, also analyzes responses
to multiple items by respondents to see if, for each item, respondents can be
ordered and if items can be ordered. 26 The Mokken scale was used by Saundra K.
Schneider, William G. Jacoby, and Daniel C. Lewis to see if there was structure and
coherence in public opinion regarding the distribution of responsibilities between
the federal government and state and local governments. 27 Respondents were asked
whether th~y thought state or local governments "should take the lead" rather than
the national government for thirteen different policy areas. The scaling procedure
allowed the researchers to see if a specific sequence of policies emerged while mov-
ing from one end of the scale to the other. One end of the scale would indicate
maximal support for national policy activity, while the other end would indicate
maximal support for subnational government policy responsibility.
The results of their analysis are shown in figure 5-1. The scale runs from Oto 13,
with O indicating that the national government should take the lead in all thirteen
policy areas and a score of 13 indicating that the respondent believes that state and
local governments should take the lead in all policy areas. A person at any scale
score believes that the state and local governments should take the lead in all pol-
icy areas that fall below that score. Thus, a person with a score of 9 believes that
the national government should take the lead in health care, equality for women,
protecting the environment, and equal opportunity, and that state and local gov-
ernments should take the lead responsibility for natural disasters down to urban
development. The bars in the figure correspond to the percentage of respondents
who received a particular score. Thus, just slightly more than 5 percent of the
respondents thought that state and local governments should take the lead in all
policy areas, whereas only 1 percent of the respondents thought the national gov-
ernment should take the lead in all thirteen policy areas. Most respondents divided
up the responsibilities. The authors concluded from their analysis that the public
-
13
Health care
12
Equality for women
11
Protecting the environment
10
Equal opportunity
9
Natural disasters
8
Assisting the elderly
7
Assisting the disabled
6
Assisting the poor
5
Reducing unemployment
4
Providing education
••
3
Economic development
2
Reducing crime
1
Urban development
0
II
0 5 10 15
Percentage
. .
···············································································································
Source:2006 Comparative Congressional Election Survey, cited in Saundra K. Schneider, William
G. Jacoby, and Daniel C. Lewis, "Public Opinion toward Intergovernmental Policy Responsibilities,"
Publius: TheJournal of Federalism41, no. 1 (2011): fig. 3, p. 14. Reprinted by permission of
Oxford University Press.© The Author 2010. Published by Oxford University Press on behalf of CSF
Associates: Publius, Inc.
The Building Blocks of Social Scientific Research:Measurement 161
does have a rationale behind its preferences for the distribution of policy responsi-
bilities between national versus state and local governments.
The procedures described so far for constructing multi-item measures are fairly
straightforward. There are other advanced statistical techniques for summarizing
or combining individual items or variables. For example, it is possible that several
variables are related to some underlying concept. Factor analysis is a statistical
technique that may be used to uncover patterns across measures. It is especially
useful when a researcher has a large number of measures and when there is uncer-
tainty about how the measures are interrelated.
~
•• Contributing totti~ im'prnvementof downstreamarea~.including the ChesapeakeBat,
~ ~
;. • Maintaining or improvingstream-bank'stabilityon my property
~ "'· j!j '"" 'Jc
Factor analysis is just one of many techniques developed to explore the dimension-
ality of measures and to construct multi-item scales. The readings listed at the end
of this chapter include some resources for students who are especially interested in
this aspect of variable measurement.
Through indexes and scales, researchers attempt to enhance both the accuracy and
the precision of their measures. Although these multi-item measures have received
most use in attitude research, they are often useful in other endeavors as well. Both
indexes and scales require researchers to make decisions regarding the selection of
individual items and the way in which the scores on those items will be combined
to produce more useful measures of political phenomena,
Conclusion
To a large extent, a research project is only as good as the measurements that are
developed and used in it. Inaccurate measurements will interfere with the testing
of scientific explanations for political phenomena and may lead to erroneous
conclusions. Imprecise measurements will limit the extent of the comparisons that
can be made between observations and the precision of the.knowledge that results
from empirical research.
28 Daniel D. Dutcher, "Landowner Perceptions of Protecting and Establishing Riparian Forests in Central
Pennsylvania" (PhD. diss., Pennsylvania State University, 2000).
..
..
'
.. . . . .
Sometimes the accuracy of measurements may be enhanced through the use of
multi-item measures. With indexes and scales, researchers select multiple indica-
tors of a phenomenon, assign scores to each of these indicators, and combine those
scores into a summary measure. Although these methods hav'e been used most
frequently in attitude research, they can also be used in other situations to improve
the accuracy and precision of single-item measures.
'• .
--------------JN 1111diliiii4•+»r.·:
111
111
Alternative-form method. A methodof calculating Face validity. Validityassertedby arguingthat a f ·: ·..
I'..
reliability by repeatingdifferent but equivalentmeasuresat measurecorrespondscloselyto the concept it is designed
two or more points in time. to measure. .. .. '
Bias. A type of measurementerror that results in
'systematicallyover-or under-measuringthe value of a
Factor analysis. A statisticaltechnique useful in the
constructionof multi-item scalesto measureabstract
.
concept. concepts.
1
J•
.
Construct validity. Validitydemonstratedfor a measure Guttman scale. A multi-item measurein which
by showingthat it is relatedto the measureof another t •
concept.
respondentsare presentedwith increasinglydifficult
measuresof approvalfor an attitude. t. ..
Content validity. Validitydemonstratedby ensuringthat Interitem association. A test of the extentto which
the full domain of a concept is measured. the scoresof severalitems, each thought to measurethe
Convergent construct validity. Validitydemonstrated sameconcept, are the same. Resultsare displayedin a
by showingthat the measureof a concept is relatedto the correlationmatrix.
measuceof another,relatedconcept. Interval measurement. A measurefor which a one- ·
Correlation matrix. A table showingthe relationships
among discrete measures.
Dichotomous variable. A nominal-levelvariablehaving
unit differencein scores is the same throughoutthe range
of the measure.
Level of measurement. The extentor degreeto
i ·.·
f. • •
only two categoriesthat for certain analyticalpurposescan which the valuesof variablescan be comparedand
be treated as a quantitativevariable. mathematicallymanipulated.
Discriminant construct validity. Validity Likert scale. A multi-item measurein which the
demonstratedby showingthat the measureof a concept items are selected based on their ability to discriminate
has a low correlationwith the measureof another concept betweenthose scoring high and those scoring low on the
that is thought to be unrelated. measure.
. .. . . .. . .
. •,
. ....
,,
. "
• •
.. ..
.. . . . . . .
Measurement. The processby which phenomenaare Ratio measurement. A measurefor which the scores
observedsystematicallyand representedby scoresor possessthe full mathematicalpropertiesof the numbers
numerals. assigned.
Mokken scale. A type of scaling procedurethat assesses Reliability. The extentto which a measureyields the
the extent to which there is order in the responsesof same resultson repeatedtrials.
respondentsto multiple items. Similarto Guttmanscaling.
..
Split-halves method. A methodof calculatingreliability
Nominal measurement. A measurefor which different by comparingthe resultsof two equivalentmeasuresmade
scoresrepresentdifferent, but not ordered,categories. at the sametime.
Operational definition. The rules by which a concept summation index. A multi-item measurein which
is measuredand scoresassigned. individualscoreson a set of items are combinedto form a
summarymeasure.
Operationalization. The processof assigningnumerals
or scoresto a variableto representthe valuesof a concept. Test-retest method. A methodof calculatingreliability
by repeatingthe same measureat two or more points in
Ordinal measurement. A measurefor which the scores
time.
representorderedcategoriesthat are not necessarily
equidistantfrom each other. Validity. The correspondencebetweena measureand
the concept it is supposedto measure.
Random measurement error. Errorsin measurement
that have no systematicdirection or cause.
Niili·l;liti·1'1W--------------
··.1
.
.. DeVellis, Robert F. Scale Development: Theory and and Interpretation. 5th ed. Glendale,Calif.: Pyrczak,
Applications. 3rd ed. ThousandOaks,Calif.: Sage,2011. 2013.
• •
..
• •
. .. . .. .. . . .. .. ..
;-;-,
"
. ..
• • • •