Wilson 2013

Measurement 46 (2013) 3766–3774
Contents lists available at SciVerse ScienceDirect
Measurement
journal homepage: www.elsevier.com/locate/measurement
Using the concept of a measurement system to characterize

measurement models used in psychometrics
Mark Wilson ⇑
University of California, Berkeley, CA 94720, USA
a r t i c l e i n f o a b s t r a c t
Article history: The philosophy of measurement in the social and behavioral sciences is seen (from with-
Available online 18 April 2013 out) as typically following the representational viewpoint. However, in practice, this is
not the case for the great majority of measures that are developed in this area. The paper
Keywords: surveys several approaches to measurement in the social sciences (i.e., Classical Test The-
Measurement system ory, Guttman Scaling, Item Response Theory, Rasch Scaling, and Construct Modeling), as
Classical Test Theory examples of measurement approaches in the area of psychometrics, and uses the founda-
Guttman Scaling
tional concept of a measuring system, as developed by Mari [1], to explicate the logic on
Item Response Theory
Rasch Scaling
which these approaches are based and thus enable a comparison with measurement
Construct Modeling approaches used by other fields such as engineering and physics. The paper uses the under-
lying concept of the standard reference set (one of the essential features of Mari’s formaliza-
tion) to show how the five approaches differ, and also how they are related. The
importance of these differences, and the consequences for measurement using those
approaches are also explicated and discussed.
Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction sciences measurement than what is seen currently as being

the dominant philosophical viewpoint on measurement.
In the social sciences in general, and in the behavioral In psychology and the behavioral sciences in general,
sciences in particular, the study of measurement is most and in psychometrics in particular, the representational
commonly sustained within the discipline called ‘‘psycho- viewpoint [2] is widely seen as dominating the philosoph-
metrics’’. There are for each discipline within the social sci- ical foundations of measurement. This is quite understand-
ences, particular subdomains that pertain to measurement able, as the combined philosophical works of the authors of
within those disciplines in particular—for example, in soci- Ref. [2] stand head and shoulders above the works of any
ology, there is the sub-discipline of ‘‘sociometrics’’, etc., others in this area. Not that there are not significant works
but, in the main, the major debates and developments in by others, but that the scope and comprehensiveness of [2]
these specific sub-disciplines reflect the situation that per- is generally seen as being without equal. However, this
tains in psychometrics. Thus, in the paper, I will use psy- dominance in formal modeling has little or no correspon-
chometrics as a locale for my discussing approaches to dence with the reality of most actual measures that are
measurement in the social sciences. My particular take constructed in these domains. The sad state of philosophiz-
on this is to sketch an alternative conceptualization for ing in the area of psychometrics is that the philosophical
the philosophical foundations of measurement in the social grounding provided by these giants of the field is ‘‘More
sciences that aligns better with the actual practice of social honor’d in the breach than the observance’’ [3]. In fact, it
is somewhat rare to find examples of applications of the
representational approach beyond the works of the
authors of [2]. One reaction to this has been for some
⇑ Tel.: +1 510 708 0322; fax: +1 510 642 4803.
authors to amend the tenets of the representational ap-
E-mail address: MarkW@berkeley.edu
0263-2241/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.measurement.2013.04.005
M. Wilson / Measurement 46 (2013) 3766–3774 3767
proach to incorporate a probabilistic element (e.g., [4,5]).

Another reaction has been to seek alternative philosophical
bases for measurement, such as ‘‘scientific realism’’ (e.g.,
[6,7]). The debate about this is still in its early stages, with Fig. 4. Data Presentation.
several presentations given and planned at psychometric
conferences, and only little of it yet having reached publi-
cation (though see [8] for some background to this debate). ments of a mathematical structure such as a set of
In fact, my view is that most practicing psychometricians integers or the real numbers along with their usual
do not actually subscribe to a representational measure- arithmetic relations: mI(a) = h (see Fig. 2).
ment philosophy. In fact, it seems to me that there is very (2) Carry out data acquisition: through empirical inter-
little representationalism present in mainstream psycho- action, select an element of A that corresponds to
metrics, except insofar as Stevens’ definition [9] of mea- the thing up (for person p) using the ‘‘thing selection
surement can be seen as being consistent with function’’ v(), i.e., v(up) = a (see Fig. 3).
representationalism. (3) Make the data presentation (i.e., re-use calibration):
As a contribution to this debate, this paper utilizes the select the symbol to measure by compounding 1
functional concept of a measuring system, as developed by and 2: m(up) = mI(v(up)) = h (see Fig. 4).
Mari [1], to explicate the logic of several measurement ap-
proaches used in psychometrics (for a summary see [10]), What makes this a measuring system is that this map-
and thus establish grounds for the comparison of these ping m is a homomorphism between the empirical rela-
with measurement approaches used by other fields such tional structure of things (U, RU), and the symbolic
as engineering and physics. relational structure of symbols (H, RH): that is rU(up)
whenever m(rU)(m(up)). In Mari’s approach, evaluations
2. Brief background on the concept of a measuring are measurements if and only if this latter is the case. In
system the following we will see how this looks for some measure-
ment approaches that have been used in psychometrics.
A very comprehensive account of the concept of a mea-
surement system can be found in Mari [1]—in this section, 3. Guttman Scaling
only a brief account is given. The operation of a measuring
system (MS), as described by Mari [1], is summarized by For all the examples of measurement approaches in this
Fig. 1. paper, the basic situation under consideration is that we
In this figure, the evaluation of the ‘‘thing’’ is accom- have an instrument composed of a set of items, I = {I1, . . . , II}
plished by the following 3-step procedure. for which prior substantive theory indicates that a person’s
response to those items (indicated by the response vector
(1) Establish a calibration: create a ‘‘standard reference up for person p) is an indicator of the construct to be mea-
set’’ (of things) A, by associating them with a set of sured, H (i.e., the measurand). Without loss of generality, I
symbols H. Generally these symbols will be the ele- will assume that the responses to the items are dichoto-
mous, that is 0 or 1.
The Likert style of item has been a very common form of
item in the social sciences. The most generic form of this is
the provision of a stimulus statement (sometimes called a
‘‘stem’’), and a set of standard options that the respondent
must choose among. Possibly the most common set of op-
tions is the set {Strongly Agree, Agree, Disagree and
Strongly Disagree}, with sometimes a middle or neutral
option. The particular set of options used may be adapted
from those given above to match the context. For example,
Fig. 1. Summary of the operation of a measuring system. an example of s Likert-style item would be:
I believe in fairies. (Choose one option)
Strongly Agree, Agree, Disagree, Strongly Disagree.
It is relatively easy to come up with many items when

Fig. 2. Establishing a calibration. all that is needed is a new stem for each one. Although this
makes it a very popular approach, there is certain dissatis-
faction with the way that the response options relate to the
construct. The problem is that there is very little to guide a
respondent in judging what is the difference between the
options, say, ‘‘Strongly Disagree’’ and ‘‘Agree’’. Respondents
may well have radically different ideas about these distinc-
tions. This problem is greatly aggravated when the option
Fig. 3. Data acquisition. offered are not even words, but a set of numerals or letters,
3768 M. Wilson / Measurement 46 (2013) 3766–3774
such as {1, 2, . . . , 8, 9}—in this sort of array, the respondent to the item set, such as ‘‘Disagree, Disagree, Agree, Dis-
does not even get a hint as to what it is that s/he is sup- agree’’, would indicate that the items did not form a perfect
posed to be making distinctions between! Guttman scale.
One alternative that has been developed is to build into Four items developed by Guttman using this approach
each option set meaningful statements that give the are shown in Fig. 5. These items were used in a study of
respondent some context in which to make the desired dis- American soldiers returning from the Second World
tinctions. The aim here is to try and make the relationship War—they have more than two categories, which makes
between each item and the overall scale interpretable. This them somewhat more complicated to interpret as Guttman
approach was formalized by Guttman [11] who created his items, but nevertheless, they can still be thought of in the
scalogram approach (also known as Guttman Scaling): same way.
Thus, in Guttman Scaling the essential idea is that, on
If a person endorses a more extreme statement, he
the basis of substantive theory and practical knowledge
should endorse all less extreme statements if the state-
about items, one can order, in terms of expected responses,
ments are to be considered a [Guttman] scale. . .We
a set of items from ‘‘easiest’’ to ‘‘hardest’’ (or ‘‘least posi-
shall call a set of items of common content a scale if a
tive’’ to ‘‘most positive’’, etc., depending on the context).
person with a higher rank than another person is just
Then the ordered set of possible response vectors is as-
as high or higher on every item than the other person
sumed to constitute a standard reference set, directly indi-
[11, p. 62].
cating the value of H in terms of the rank of the ‘‘highest’’
Thus, for example, suppose we hypothesize four dichot- item for which the response is 1:
omous attitude (Likert-style) items that do indeed form a
mðup Þ ¼ guttmanscoreðup Þ;
Guttman scale. If the order of items, 1, 2, 3, 4 is in this case
also the scale order, and the responses are ‘‘Agree’’ and which is the rank of v(up) minus 1 if it is defined, and
‘‘Disagree’’, then only the following responses are possible undefined otherwise. As there is a finite number of items,
under the Guttman scale requirement: I, the usual symbol set is the integers 1 to I + 1. The stan-
(Agree, Agree, Agree, Agree), dard reference set can be written as A = {a0, a1, . . . , aI}
where
(Agree, Agree, Agree, Disagree),
(Agree, Agree, Disagree, Disagree), a0 ¼ fð0; 0; . . . ; 0Þg
(Agree, Disagree, Disagree, Disagree), a1 ¼ fð1; 0; . . . ; 0Þg
(Disagree, Disagree, Disagree, Disagree). a2 ¼ fð1; 1; 0; . . . ; 0Þg
If all the responses are of this type, then, when they are ..
.
scored (say, ‘‘Agree’’ = 1, and ‘‘Disagree’’ = 0), then there is a
aI ¼ fð1; 1; . . . ; 1; 1Þg:
one-to-one relationship between the sumscores and the
set of item responses. A person with a sumscore of 1 must Note these particular response vectors in A are called
have agreed with Item 1 and not the rest, and thus can be ‘‘Guttman true scale-types’’. There is a frustrating incom-
interpreted as being somewhere between Item 1 and Item pleteness in this approach, as the there are many possible
2 in his/her views. Similarly, a person who scored 3 must response vectors that are not scale-types, response such as
have agreed with the first three items and disagreed with (1, 0, 1, 0, . . . , 0), etc. Hence, one aspect of the creation of
the last, so can be interpreted as being somewhere be- the instrument I is the selection of items that are suitable
tween Item 3 and Item 4 in his/her views. Other responses for use in a Guttman scale—intuitively, one would seek
Fig. 5. Examples of Guttman’s items [11].

items that are increasing in ‘‘difficulty’’ as one went from on substantive theory about the interpretation of the
the first to the last item, and where the increments in dif- items. Instead the standard criteria are based on statistical
ficulty were as large as possible (although this will become considerations having to do with various aspects of uncer-
more difficult to achieve if the number of items, I, is large). tainty. These are grouped together under the term ‘‘item
Guttman recommended the use of a quantitative indicator, analysis’’ [17]—typical criteria are:
the coefficient of reproducibility [11], which is the ratio of (a) Reliability of the item set, reliability(I).
the observed number of true scale-type response vectors (b) Discriminations of the items, discrimination(Ii).
to the total number of responses, to gauge the suitability (c) Etc.
of the set of items for use in a Guttman scale, both in a rel-
ative sense (i.e., sets with larger coefficients are better), Uncertainty in the measure is estimated in terms of the
and in an absolute sense (e.g., accept item sets with coeffi- standard error of measurement (sem) for the scores, where
cient values greater than .85). pffiffiffiffiffiffiffiffiffiffiffi
The problem of what to do with persons with non- sem ¼ S 1 r;
scale-types, which can constitute a high proportion of the
where S is the standard deviation of the scores, and r is the
sample, has led to the low usage of this approach in most
reliability [17].
areas of application. According to Kofsky [12, pp. 202–203],
Using this concept, each measure should be more accu-
. . . the scalogram model may not be the most accurate rately expressed as a binary: (h, sem). Effectively, this
picture of development, since it is based on the assump- moves the symbol set H beyond the set of integers in the
tion that an individual can be placed on a continuum at interval [0, I], as it was above, to encompass the segment
a point that discriminates the exact [emphasis added] of the real number line, [0 d, I + d], where the value of
skills he has mastered from those he has never been d is dependent on the level of uncertainty one wishes to
able to perform. ... A better way of describing individual express. The most common representation of CTT is Xp = -
growth sequences might employ probability statements Tp + Ep where Xp, is the observed score for person p (or,
about the likelihood of mastering one task once another sumscore(up) above), Ep is the error (or sem above) and Tp
has been or is in the process of being mastered. is the ‘‘true score’’ (i.e., theoretical average of person p’s ob-
served score over a large (infinite) number of observa-
The next approach can be seen as a step towards deal-
tions). Note that, in historical terms, CTT was developed
ing with this problem, although historically it much pre-
long before Guttman Scaling (as is indicated by the publi-
dates Guttman Scaling.
cation dates on the references above). However, the order
has been reversed in this account, in order to emphasize
4. Classical Test Theory the potential importance of the meaning of the standard
set.
In Classical Test Theory (CTT) [13–16], the problem of One alternative to the sum score symbols that is often
what to do about non-scale-types is finessed by simply used is the percentile. This is simply the value of the cumu-
ignoring it. Examples of item-types that are used in mea- lative distribution function for the sum score in a chosen
sures that use a CTT approach include the Likert-style reference sample of persons, expressed as a percentage.
and Guttman type items discussed in the previous section, This conceals the differences between instruments that
as well as the common multiple choice type items used in have different numbers of items, and can be used as a basis
educational achievement test, as well as open-ended essay for the ‘‘equating’’ of such instruments. This approach is of-
type items that require ratings by judges. Response vectors ten referred to as ‘‘norm-referenced testing’’. For this ap-
are given a symbol (usually called a ‘‘score’’) that is equal proach, the symbol set (ignoring the sem issue) is the real
to what the Guttman score would have been if the 1s numbers between 0 and 100. All of the remaining points
and 0s had been ordered according to a scale-type, or above hold, however.
equivalently, the sum score of the response vector: Sometimes, when a decision is to be made on the basis
m(up) = sumscore (up) = R upi, where upi is the ith response of the measures whether a person is ‘‘above’’ some point
in the vector up. The symbol set H will then be the integers (or, equivalently, ‘‘below’’). This requires the setting of a
from 0 to the maximum score I. The standard reference set cut-score. If that is the sole purpose of the instrument, this
can be written as A = {a0, a1, . . . , aI} where can be seen as equivalent to establishing a new, coarser,
a0 ¼ fð0; 0; . . . ; 0Þg standard reference set A0 ¼ fa00 ; a01 g, where (assuming the
a1 ¼ fð1; 0; . . . ; 0Þ; ð0; 1; . . . ; 0Þ; . . . ; ð0; 0; . . . ; 0; 1Þg cut-off is k)1
a2 ¼ fð1; 1; 0; . . . ; 0Þ; ð1; 0; 1; 0; . . . ; 0Þ; . . . ; ð0; 0; . . . ; 0; 1; 1Þg a00 ¼ fa0 ; a1 ; . . . ; ak g

.. a01 ¼ fakþ1 ; akþ2 ; . . . ; aI g:
.
aI ¼ fð1; 1; . . . ; 1; 1Þg: where the ais are as in the definition above. The cut-score
will usually be set using a procedure that invokes substan-
Note that this standard set now accounts for every possible tive knowledge from among professionals in areas related
response vector under the assumptions. Just as for Gutt-
man Scaling, the item set I is typically a smaller set that re- 1
Note that in this expression, {a0, a1, . . . , ak} is taken to mean the set of all
sults from some item selection based on empirical data. the response vectors in the sets a0, a1, . . . , ak rather than the set of the k sets
However, unlike Guttman Scaling the criteria are not based a0, a1, . . . , ak. This is true for similar expressions below.
to the construct and the typical applications. Clearly, this Just as for Guttman Scaling and CTT, the item set I is
same procedure can be repeated to have multiple cut- typically a smaller set that results from some item selec-
scores. tion based on empirical data. As for CTT the criteria are
based on statistical considerations having to do with vari-
5. Item Response Theory ous aspects of uncertainty, such as uncertainty that the
items are a satisfactory source of data for the measurand,
In Item Response Theory [18] the CTT approach is and uncertainty whether the logistic function specified
amended and extended to (a) formalize the relationship above is a good choice.. These are grouped together under
between the person and the item (i.e., rather than the the term ‘‘item fit analysis’’ [18–20], and generally capital-
instrument as a whole) using a mathematical model, (b) ize on the probability of response vectors to accumulate
adopt a metric (most commonly the log of the odds) that the likelihood of observing the responses to a given item,
frees the scale from a dependence on the (largely) inciden- given its difficulty, and the person parameters for the ob-
tal aspect of the number of items. There are several differ- served persons (more traditional CTT item analysis proce-
ent mathematical models in common use—I will illustrate dures are also commonly used). Other considerations are
only the simplest, the Rasch model: for this model, which also involved, such as the match between the observed
assumes dichotomous data, represented as 0 and 1, the and expected distribution of persons on the logit scale,
mathematical relationship is given by: and the match of item locations to that distribution3.
Uncertainty in the measure is expressed in terms of the
Prðupi ¼ 1jhp ; di Þ ¼ ppi1 ¼ expðhp di Þ=Wpi ; standard error for the person estimates, which is found as
a by-product of the estimation procedures, and differs from
where hp is the person symbol (often called the ‘‘location’’,
the standard error for CTT in that it is conditional on the
or the ‘‘ability’’ depending on the context), di is an item
location itself: s(h). As for CTT, each person’s measure should
parameter (often termed the ‘‘difficulty’’), and Wpi is a nor-
be more accurately expressed as a binary: (h, s(h)).
ming value equal to [1 + exp(hp di)]. The connection to
Note that one way to see the standard reference set for
the log-odds is evident as:
Guttman Scaling is that it is the standard set for CTT where
logðppi1 =ppi0 Þ ¼ hp di we have accepted some response vectors as being possible
(i.e., the Guttman true-types), and others as not (i.e., the
(where ppi0 has an obvious definition). The probability of a non-Guttman types). Put another way, the Guttman true-
response vector is given by applying the local indepen- types could be seen as having a non-zero probability, and
dence assumption: the rest as having a zero probability. In a sense that is
Y
I mid-way between CTT and Guttman Scaling, the symbols
Prðup jhp ; dÞ ¼ Prðupi jhp ; di Þ; in the Rasch Scaling standard reference set are not ac-
i¼1 corded equal standing—each response vector has a proba-
bility of being observed (conditional on the estimated
where d is the vector of item parameters. This special case
parameters d. Thus, some response vectors will have smal-
is commonly termed Rasch Scaling.
ler probabilities than others—this would be the case, for
In psychometric modeling, many other functions
example for the response vector (0, 0, . . . , 0, 1) in the case
(termed ‘‘item response models’’) besides the simple logistic
where the items are ordered in terms of item difficulty.
function are used. Other models that are commonly used are
This fact has been used as the basis for developing several
the two-parameter (2PL) and the three-parameter (3PL) lo-
‘‘person fit statistics’’ [18,21], which are then used to de-
gistic models [18]: In particular, the 2PL adds a so-called dis-
cide that some persons with low-probability response vec-
crimination parameter for each item, ai, and the 3PL adds a
tors should then not be assigned symbols (estimates). This
pseudo-guessing parameter, ci, for each item. There are also
results in a reduced standard set B, although the number of
versions of these models that are based on the cumulative
symbols (estimates) remains the same. Thus the standard
distribution of the Gausian probability distribution.
reference set can be written as B = {b0, b1, . . . , bI} where
In the special case of the Rasch model, just as for CTT, a
symbol is found for each possible response vector, and, due b0 a0 ¼ fð0;0; . .. ; 0Þg
to the fact that the sumscores are sufficient statistics for b1 a1 ¼ fð1; 0; .. . ; 0Þ;ð0; 1; . .. ; 0Þ; . .. ; ð0; 0; .. . ; 0; 1Þg
the person location [19,20], the initial standard reference
b2 a2 ¼ fð1; 1; 0; .. . ; 0Þ; ð1;0; 1;0; . .. ; 0Þ; .. . ; ð0; 0; .. . ; 0; 1; 1Þg
set is the same as for CTT (see above). However, unlike
..
the case for CTT, the symbols are not automatically as- .
signed via a simple explicit function based on the sum- bI aI ¼ fð1; 1; . .. ; 1; 1Þg;
scores, but must be statistically estimated [18], and may
take values anywhere on the real number line (1,
(i.e., each bi is a subset of the respective ai in CTT), and the
1)2—we denote the estimated value of hp, for person p with
rules for reducing the sets a0 through aI are determined by
response vector up, as h (up).
the specific person fit procedures chosen.
In the 2PL and 3PL, the standard reference set will (in
2
The minimum and maximum cases, (0, 0 . . . , 0) and (1, 1 . . . , 1) receive non-trivial cases) be much larger than that shown above,
the symbols 1 and 1, respectively. (In practical terms, these symbols are
not very useful, and practitioners typically either refrain from giving
3
persons with those response vectors symbols, or they assign finite values to Note that the graphical device showing both persons and items on the
them, following certain procedures.) same scale is sometimes referred to as a ‘‘Wright map.’’
as, in general, every different response vector u, will have a not match could lead to a revision of the construct map
different estimate h (up) (i.e., this can be seen as a conse- and/or the hypothesized link, as well as modification/dele-
quence of the property of the 2PL that the sufficient statis- tion of the item.
tic for h (up) is given by a weighted sumscore, R ai upi, and Once an item set is established, ‘‘banding’’ or ‘‘standard
this implies that the 3PL also has the same larger symbol setting’’ takes place—this is the equivalent of Mari’s cali-
set (although with different estimates than for 2PL). A par- bration: the placement of the values into segments of the
ticular consequence of this is that the set of estimates for logit scale (the ‘‘Wright Map’’). The upper and lower limits
response vectors that have the same sumscore will not of these bands are determined by the locations of the lower
necessarily be contiguous on the h scale. and upper limits of the locations of the items linked to each
Some special characteristics of Rasch Scaling: Due to (a) level of the construct map. Most often, these item locations
its simplicity, and (b) its unique properties, such as ‘‘spe- do not result immediately in a ‘‘clean’’ segmentation of the
cific objectivity’’ [19] which confers particular strengths logit scale—hence a judgmental process [24] is required to
on item sets that are found to be amenable to Rasch mod- determine reasonable locations for the band edges4. This
eling (i.e., item sets that ‘‘fit’’ the model), the Rasch model process may result in further decisions regarding the suit-
can be used to bring the item and the person parameters ability of certain items, and may also involve information
onto the same metric, allowing a wide range of possibilities about the persons, when that is available. Denote these lim-
for the development of analogical and figurative aids to its by H = (H1, H2, . . . , HK), the K boundary values between
interpretation (e.g., see [20]). With these advantages, Rasch the K + 1 segments. Then we see that:
Scaling represents an interesting, and potentially powerful,
compromise between the strictness of the Guttman Scale Band 1 is ð1; H1 ;
adherence to the pre-eminence of the substantive theory Band 2 is ðH1 ; H2 ;
(via the item ordering implied by the substantive interpre- :
tation), and CTT’s flexibility in accepting all response vec- Band k is ðHk1 ; Hk ;
tors. Thus, Rasch Scaling has been seen as a way to
:
reconcile the perspectives of CTT and Guttman Scaling
[22]. Band K þ 1 is ðHK ; þ1Þ:
Or, equivalently, the standard reference set is C, composed

of the set of response vectors ck, k = 1, . . . , K, and
6. Construct Modeling
ck ¼ fbi : Hk1 < hðuÞ < Hk Þ; uebi Þ;
Construct Modeling [23] builds upon the reconciliation
of CTT and Guttman Scaling, as represented by Rasch Scal- where the bi are the same as those constructed for Rasch
ing. It takes as its technical side the ground-work of Rasch Scaling. These bands can be used as a basis for criterion-
Scaling, and moves one step further along the path towards referenced interpretations of the measurements in that
adhering to the substantive theory. In this case, it is as- each band may be associated with the items whose diffi-
sumed that the substantive theory takes a particularly sim- culties fall within its range5. This provides ways to enhance
ple form: the construct consists of a simple linear and deepen the interpretations available to those who apply
succession of discrete segments of a continuum, from a the measurements. At the same time, the existence of the
lowest level to a highest level, and when these are laid underlying Rasch scale means that (a) technical aspects of
out in a figure, it is termed a ‘‘construct map’’—see Fig. 6. the measures are available, such as standard errors, and
A concrete example of this is shown in Fig. 7, which was (b) technical advantages of item response scales are still
developed in the context of a test of students’ knowledge available, such as flexibility in item choice, ability to link
about the practice of designing and critiquing displays of forms through items, and the possibility of computerized
data for students in the middle school, with the lowest le- adaptive item administration. This is one reason to choose
vel at the bottom of the figure. In this case, the Rasch Scal- the Construct Modeling approach compared to the alterna-
ing described above serves as a starting place for tive of latent class modeling [26], which might be seen as
establishing the standard set. All of the steps described potentially appropriate, given what is shown in Fig. 5.
above for Rasch Scaling are followed. Simultaneously with
that, a second set of steps is followed that takes into ac- 7. Discussion and conclusion
count the construct map, and additional substantive infor-
mation concerning each item, to wit, a substantive link Mari [1, p. 80] characterizes measurement as needing to
from each item to a (single) level of the construct map. attain both objectivity and intersubjectivity. In his words:
When deciding on the item set, an additional criterion
that is used is this link from each item in the set to the con- 4
The label for this is different depending on when the set B is developed:
struct map—for each item it needs to be judged whether if it is developed before the scaling (although it may be adapted after), then
the estimated item location is well-matched to the hypoth- the process is termed ‘‘construct modeling’’ [23]; if it is developed after the
esized order in the construct map. An example of an item scaling, then the resulting scale is termed a ‘‘described variable’’ [25].
5
used to assess the Data Display construct is shown in Depending on the interpretive context, the specification of the
difficulties of the items may vary. For example, whereas the standard is
Fig. 8 and 9 shows sample student responses that have to set it at the point where the probability of endorsement is .5, for some
been matched to certain levels of the DaD construct map. contexts, it may be more readily interpretable if the probability is set at,
Of course, accumulated empirical information that it does say, .8, for ‘‘mastery’’ in achievement testing.
Fig. 6. A graphical representation of a construct map with K levels.
Fig. 7. The levels of the Data Display construct map.
Fig. 8. An example item related to the DaD construct: ‘‘Shimmering Candle’’.

Fig. 9. Examples of responses to the DaD construct levels for the Shimmering Candles item.
objectivity implies that the MS is able to discriminate lem. In practical terms, this amounts to a situation where
the measurand from the various influence quantities the measurement system does not give some people mea-
so that the acquisition component of the MS is sensitive sures. The best procedure in this case is to seek a re-admin-
only to the measurand; istration of the measurement process for that individual,
with perhaps the possibility of gathering extra information
intersubjectivity implies that the MS is able to refer the to check on the conditions under which these misfit re-
measurand to the primary standard, so that all the mea- sponse vectors tend to be found.
surements expressed in terms of that standard are com- In the text above, it was noted that there are many item
parable with each other. response models available beyond the Rasch model. Where
From the point of view adopted here, we need to specify these are being used, some, but not all, of the development
how these properties would appear in the context of Con- above in the Construct Modeling section can be developed.
struct Modeling. In particular, banding is not readily possible, as the concept
First, consider objectivity. If we think of ‘‘the various of the ‘‘location’’ of an item on the logit scale does not have
influence quantities’’ as being embodied by the possibility a straightforward interpretation. Also, the specific objectiv-
of using different sets of items as the instrument, then this ity possible under the Rasch model is not attainable for
amounts to independence from the specific set of items other models [18]. Thus, for other item response model ap-
used. This is indeed what Rasch’s ‘‘specific objectivity’’ is proaches, the development here seems difficult.
concerned with (i.e., if the set of items fits the Rasch model, The generic type of construct (measurand) that is used
then it does not matter which items are used to measure to motivate the development of Construct Modeling (i.e.,
the person). Hence, when using a Rasch scale as the basis a simple linear succession of discrete segments of a contin-
for Construct Mapping, (and where the items do indeed uum) may seem quite restrictive on first glance. However,
fit a Rasch model) it would remain to check whether the most published measures in the social sciences are in fact
banding was relatively robust to choices of the items as of just this type, or simpler (i.e., they have no segments,
representatives. just a continuum). That said, where there are more com-
Second, consider intersubjectivity. Unlike the case for plex constructs under consideration, many of them repre-
CTT, Construct Modeling inherits the Rasch Scaling (and sent quite simple extensions of the generic construct
Guttman Scaling) characteristic of disallowing some re- discussed above. For example, where there are multiple
sponse vectors. However, these are not represented in linear continua (i.e., a ‘‘multi-dimensional’’ construct), then
the standard reference set, hence this is not a formal prob- the scaling can be accomplished using multi-dimensional
versions of the Rasch model [27,28], and each dimension [6] J. Michell, An introduction to the logic of psychological
measurement, Lawrence Erlbaum Associates, Mahwah, New Jersey,
can be treated then as a separate case for banding. Where
1990.
there are polytomous items and/or multiple substantive [7] D. Borsboom, G. Mellenbergh, H. Van, The theoretical status of latent
categories within a particular polytomous score [29], the variables, Psychological Review 110 (2) (2003) 203–219.
banding procedure can be generalized to deal with the sit- [8] R. Lissitz, The concept of validity: revisions, new directions, and
applications, Information Age Publishing, Charlotte, North Carolina,
uation [24]. Where the latent class is posited to be an or- 2009.
dered latent class rather than a latent continuum, the [9] S.S. Stevens, On the theory of scales of measurement, Science 103
methods described above can be applied, with the proviso (1946) 677–680.
[10] K. Sijtsma, Introduction to the measurement of psychological
that one should check for the most appropriate model attributes, Measurement 44 (2011) 1209–1219.
using fit procedures [30]. Where a more complex construct [11] L.A. Guttman, The basis for scalogram analysis, in: S.A. Stouffer, L.A.
is under consideration, such as a ‘‘learning progression’’ Guttman, F.A. Suchman, P.F. Lazarsfeld, S.A. Star, J.A. Clausen (Eds.),
Studies in Social Psychology in World War Two. Volume. 4:
[31] (which posits level-based links among different Measurement and Prediction, Princeton University Press,
dimensions), there are also methods analogous to those Princeton, 1950, pp. 60–90.
described above, although these are still under develop- [12] E. Kofsky, A scalogram study of classificatory development, Child
Development 37 (1966) 191–204.
ment [32]. Of course, there are more complex constructs [13] F. Edgeworth, The statistics of examinations, Journal of the Royal
yet, but the list above contains a very large proportion of Statistical Society 51 (1888) 599–635.
the extant types. [14] F. Edgeworth, Correlated Averages, Philosophical Magazine, 5th
series, 34 (1892) 190–204.
This paper has built on an earlier conference paper [33]
[15] C. Spearman, The proof and measurement of association between
to explore how to use the functional concept of a Measur- two things, American Journal of Psychology 15 (1904) 72–101.
ing System6 to explicate the logic of several measurement [16] C. Spearman, Demonstration of formulae for true measurement of
approaches used in psychometrics, and thus enable a com- correlation, American Journal of Psychology 18 (1907) 161–169.
[17] J. Nunnally, I. Bernstein, Psychometric Theory, 3rd ed., McGraw-Hill,
parison with measurement approaches used by other fields Columbus, Ohio, 1994.
such as engineering and physics. It surveyed Guttman Scal- [18] W. van der Linden, R. Hambleton, Handbook of Item Response
ing, Classical Test Theory, Rasch Scaling and Construct Mod- Theory, Springer, New York, 1997.
[19] G. Rasch, 1960/1980, Probabilistic Models for Some Intelligence and
eling, as examples of measurement approaches in the area of Attainment Tests, University of Chicago Press, Chicago, 1980.
psychometrics, and explicated the underlying standard ref- [20] B. Wright, M. Stone, Best Test Design, MESA Press, Chicago, 1979.
erence set that is one of the essential features of Mari’s for- [21] R. Meijer, K. Sijtsma, Methodology review: evaluating person fit,
Applied Psychological Measurement 25 (2001) 107–135.
malization [1], and showed how these differ among the four [22] B. Wright, G. Masters, Rating Scale Analysis, MESA Press, Chicago,
approaches. The importance of these differences, and the 1981.
consequences for measurement using those approaches, [23] M. Wilson, Constructing Measures: An Item Response Modeling
Approach, Erlbaum, Mahwah, New Jersey, 2005.
hinge on the capacity to identify theoretically tractable sub- [24] M. Wilson, K. Draney, A Technique for Setting Standards and
stantive properties capable of supporting both objectivity Maintaining Them over Time, in: S. Nishisato, Y. Baba, H.
and intersubjectivity. Connecting psychometric approaches Bozdogan, K. Kanefugi (Eds.), Measurement and Multivariate
Analysis, Springer-Verlag, Tokyo, 2002, pp. 325–332.
to measurement with Mari’s formalization of the functional
[25] Organisation for Economic Co-operation and Development, PISA
concept of a measuring system opens up new opportunities 2000 Technical Report, OECD, Paris, 2002.
for productive dialogue between the natural and social [26] M. Croon, Latent class analysis with ordered latent classes, British
sciences. Journal of Mathematical and Statistical Psychology 43 (2) (1990)
171–192.
[27] R. Adams, M. Wilson, W. Wang, The multidimensional random
Acknowledgements coefficients multinomial logit model, Applied Psychological
Measurement 21 (1) (1997) 1–23.
[28] M. Wu, R. Adams, M. Wilson, S. Haldane. ACERConQuest 2.0
I would like to thank Luca Mari and Roman Morowski [computer program], ACER, Hawthorn, Australia, 2008.
for organizing the symposium in which some of these ideas [29] M. Wilson, R. Adams, Marginal maximum likelihood estimation for
the ordered partition model, Journal of Educational Statistics 18 (1)
were originally presented, and for their encouragement of
(1993) 69–90.
this work. [30] D. Torres Irribarra, R. Diakow, Model selection for tenable
assessment: selecting a latent variable model by testing the
assumed latent structure, in: Paper Presented at the 17th
References International Meeting of the Psychometric Society, Hong Kong SAR,
2011.
[1] L. Mari, Beyond the representational viewpoint: a new formalization [31] A. Alonzo, A. Gotwals (Eds.), Learning Progressions in Science, Sense
of measurement, Measurement 27 (2000) 71–84. Publishers, Rotterdam, The Netherlands, 2012.
[2] D.H. Krantz, R.D. Luce, P. Suppes, A. Tversky, Foundations of [32] M. Wilson, Responding to a challenge that learning progressions
Measurement. Volume 1: additive and Polynomial Representations, pose to measurement practice: hypothesized links between
Academic Press, New York, 1971. dimensions of the outcome progression, in: A.C. Alonzo, A.W.
[3] A. Thompson, N. Taylor, Hamlet, Arden, London, 2006. Gotwals (Eds.), Learning Progressions in Science, Sense Publishers,
[4] R. Perline, B. Wright, H. Wainer, The Rasch model as additive Rotterdam, The Netherlands, 2012.
conjoint measurement, Applied Psychological Measurement 3 [33] M. Wilson. The role of mathematical models in measurement: a
(1979) 237–256. perspective from psychometrics. Paper Presented at the Joint
[5] G. Karabatsos, The Rasch model, additive conjoint measurement, and International IMEKO TC1+TC7+TC13 Symposium ‘‘Intelligent
new models of probabilistic measurement theory, Journal of Applied Quality Measurement – Theory, Education and Training’’ August
Measurement 2 (4) (2001) 389–423. 31–September 02, Jena, Germany, 2011.
[34] A. Frigerio, A. Giordani, L. Mari, Outline of a general model of
6
measurement, Synthese 175 (2010) 123–149.
This paper has not sought to expand the viewpoint adopted here to
include the new insights from Mari’s later work [34]. This remains as an
important step to be carried out, but is beyond the scope of the current
paper.

Wilson 2013

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wilson 2013

Uploaded by

Copyright:

Available Formats

Measurement 46 (2013) 3766–3774

Contents lists available at SciVerse ScienceDirect

Using the concept of a measurement system to characterize

1. Introduction sciences measurement than what is seen currently as being

proach to incorporate a probabilistic element (e.g., [4,5]).

It is relatively easy to come up with many items when

Fig. 5. Examples of Guttman’s items [11].

a2 ¼ fð1; 1; 0; . . . ; 0Þ; ð1; 0; 1; 0; . . . ; 0Þ; . . . ; ð0; 0; . . . ; 0; 1; 1Þg a00 ¼ fa0 ; a1 ; . . . ; ak g

Or, equivalently, the standard reference set is C, composed

Fig. 6. A graphical representation of a construct map with K levels.

Fig. 7. The levels of the Data Display construct map.

Fig. 8. An example item related to the DaD construct: ‘‘Shimmering Candle’’.

You might also like