You are on page 1of 46

ANTHROPAC 4

Methods Guide

Stephen P. Borgatti

ANALYTIC TECHNOLOGIES
How To Cite (two examples)

The software:

Borgatti. 1996. ANTHROPAC 4.0. Natick, MA: Analytic Technologies.

This manual:

Borgatti. 1996. ANTHROPAC 4.0 Methods Guide. Natick, MA: Analytic


Technologies.

Copyright 8 1996 by Analytic Technologies. All rights reserved.


TABLE OF CONTENTS

Defining a Cultural Domain 1


Collecting Proximities 4
2.1 Direct Ratings 4
2.2 Single Pilesorts 7
2.3 Multiple Pilesorts 9
2.4 Successive Pilesorts 10
2.5 Triads 13
Deriving Proximities 17
3.1 Levels of Measurement 18
3.2 Measures of Similarity and Distance 22
Analyzing Proximities 26
4.1 Hierarchical Clustering 26
4.2 Multidimensional Scaling 28
4.3 PROFIT 36
Consensus Analysis 40
CHAPTER 1
Defining a Cultural Domain

Practically speaking, to define a cultural or cognitive domain is to make a list of its elements.
For example, to define the domain of fruits is to generate a list of things that people in a given
culture would consider a fruit. Often, the definition of a domain is determined directly by the
research question. For example, if we are interested in perceptions of the character of the
(English) days of the week, the domain is clear: it is the set {Monday, Tuesday, Wednesday,
Thursday, Friday, Saturday, Sunday}. Other times, however, we have a general idea of the
domain, but do not know exactly which items belong to it. To determine this, we use the Free
List technique.

1.1 Freelisting

The technique basically consists of asking a small set of respondents (say 30) to name (or,
ideally, write down) all items matching a given description. For example, if you are interested
in the domain of "bad" words, you might use the following instructions:

Please write down as many bad words as you can think of. Don't be embarrassed to
write down even the really bad ones. Thank you.

Once the data have been collected, a number of analyses are possible. Often, the purpose of the
free listing task is to obtain a set of terms to be used in additional data collection tasks, such as
pilesorts and ratings or rankings. If this is the case, the next step is to count up the number of
times that each term is mentioned, and sort the list in order of decreasing frequency. A typical
domain will have a core set of items that are mentioned by many respondents, plus a large
number of items that are mentioned by few or just one person. Presumably, the core set of
items reflect the existence of a shared cultural norm regarding what is a bad word, while the
additional items represent the idiosyncratic views of individuals. If so, we should ideally expect
a noticeable drop-off in the frequency of mention of non-core items. In practice, this "elbow"
in the frequency distribution may be difficult to spot.

When no elbow suggests itself you must find another means of deciding which items to include
in further work. Obviously, one approach is to take them all. This solution is usually not
feasible, however, on purely practical grounds: some domains can elicit hundreds and even
thousands of distinct terms. Another approach is to adopt the rule that only items mentioned by
at least two people will be kept, on the grounds that agreement between two individuals is the
absolute minimum requirement for viewing an item as more than idiosyncratic. Alternatively,
one can pick the top n items, where n is an arbitrary but convenient number, given the nature
of subsequent research.

ANTHROPAC Methods Guide 1


No matter what rule is adopted, it is worth asking oneself why some respondents did not
mention some of the items selected by the rule. If it is that they would have mentioned a given
item had they remembered it, then there is no problem. You could and probably should test for
that by presenting the final list to a new sample and asking, for each item, 'is this a bad word?'.
You could then use consensus analysis to analyze the resulting data. On the other hand, the
variation in frequencies may be due to individual differences in opinion. This could be
problematic, depending on what you will be using the list for. After all, if after settling on a list
of bad words you go on to do further experiments in which it is assumed that all the words are
clearly "bad", some experimental subjects may become confused if the list contains words that
they do not regard as bad1. A third possibility is that the variation reflects the degree of badness
of the word. Rather than there being two classes of words, bad and not bad, it may be simply
that all words have a degree of badness, and there is no cultural category of "bad words". This
might also be a problem if your subsequent research implicitly assumes the existence of a
bounded category.

Two other problems that can crop up are the existence of synonyms and the inclusion of terms
at varying levels of contrast. Unidentified synonyms can cause problems if, in a later phase of
research, you ask respondents to react to relationships among the terms. For example, it would
be embarrassing to unwittingly ask "Do you spend more time with your brothers and sisters, or
your siblings?". Part-whole relationships cause similar problems. For example, if you freelist
the domain of places to live, you may find yourself later asking "Do you like it better here in
Connecticut, or the States?"2. Part-whole and related problems are particularly common when
you freelist reasons for things. For example, asking "What are some reasons why people go to
the doctor?" can get all of the following (in the same list!): "cancer", "flu", "fatigue", "illness",
"disease", "they have insurance", "only doctors can dispense drugs". Some of these items
include others, others are synonyms, and still others belong to very different domains reflecting
very different ways of answering the question.

The remedy in all these cases -- if any is needed -- is to go back to your informants (the same
or different sample), and ask about each pair of items: 'is this a kind of that?', 'does this mean
the same as that?'.

In some studies, the free list is not the first step toward a larger data collection process. Rather,
it is an end in itself. This usually occurs when the structure of the freelist itself is the object of
study. For example, it has been suggested that the order of items in a given list reflects
saliency. Romney and d'Andrade (1964) have shown that when you ask Americans to freelist
kin terms, 97% of the lists give "mother" as the first item. One interpretation of saliency might
by prototypicality. If you ask consumers to freelist mustards, the most common first mention is
"French's yellow mustard", which perhaps typifies the category.

1 I have found that among US college students, about 1% of respondents will list words like "poverty", "war",
and "no" as bad words. If these words are left in the list, they can be very confusing to other respondents.

2 I grew up in Latin America. When I used to come visit relatives in the States, my uncle in Connecticut liked
to confuse me by asking that question.

2 ANTHROPAC Methods Guide


An index of item saliency is recommended by Jerry Smith (1993). It is essentially a weighted
average of the (inverse) rank of an item across multiple freelists, where each list is weighted by
the number of items in the list.

It has also been noticed that freelists contain runs of items that are strongly related. For
example, a freelist of animal terms among Americans might have all domestic animals, like
dog and cat, together, along with all farm animals, like pig, chicken and cow, followed by
forest animals, like bear, wolf, and deer, followed by zoo animals, etc. Thus, the order of items
in a freelist can give glimpses of the underlying cognitive structure of the domain.

References

Romney, A.K and R. d'Andrade


1964. "Cognitive aspects of English kin terms in transcultural studies in cognition."
American Anthropologist 66(3):146-170.

Gatewood, J.B.
1983. "Loose talk: Linguistic competence and recognition ability." American Anthropologist
85:378-86.

Smith, J.J.
1993. "Using ANTHROPAC 3.5 and a spreadsheet to compute a freelist salience index."
Cultural Anthropology Methodology Newsletter 5(3):1-3.

ANTHROPAC Methods Guide 3


CHAPTER 2
Collecting Proximities

Proximities are measurements of the similarities or dissimilarities among a set of items.


Proximity data are stored in square item-by-item matrices in which the i,jth cell records the
degree of similarity or dissimilarity between the ith and jth items. Proximity data may be
observed directly (as in measuring miles between cities), derived analytically (as in computing
distances between cities based on their map coordinates), or collected from respondents. To
collect proximity data is to ask respondents to tell us which items in a cultural domain are more
similar to each other than to others. The objective is to map the structure of the domain. For
example, a researcher might be interested in studying the way members of a given culture
understand the domain of fruits and vegetables.

Sometimes, the interest is in determining whether there are subgroups within the domain (such
as tubers or vines), and in which subgroups a given set of items fall. This is often the case in
marketing research, where one is interested in knowing whether there exist competitive niches
within a larger product category. Brands that occupy the same niche compete more heavily
with each other than with other brands. Before introducing a new brand, marketers often try to
determine what niche it is likely to fall into, so as to determine who the key competitors are.

Other times, the interest is in uncovering the dimensions that people unconsciously use to
classify the world. For example, we might find that people tend to group fruits and vegetables
by taste (these are sweet, these are sour, those are pasty) and nutritional value (these give you
energy, these are good for your eyes, etc). This suggests a functional orientation that classifies
items by their significance to humans. Alternatively, we might find that informants group these
items by shape, color or other morphological characteristics. A simple hypothesis we might
advance is that domains which are used to satisfy basic human needs are more likely to be
sorted according to function than form.

In either case, the first step is to collect perceived similarities among the items. There are
several techniques for doing this, including pilesorts, triads, and direct ratings. In this chapter,
we review each of these methods.

2.1 Direct Ratings

Conceptually, the simplest way to collect proximity data is to ask respondents directly to rate
how similar each pair of items in a domain are. If the number of items in the domain is small
enough, you can ask respondents to fill in a grid, as shown in Figure 2-1.
Track Mark Relic Scar Wake Trail Scent Spoor
Track XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Mark XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Relic XXXX XXXX XXXX XXXX XXXX XXXX
Scar XXXX XXXX XXXX XXXX XXXX
Wake XXXX XXXX XXXX XXXX
Trail XXXX XXXX XXXX
Scent XXXX XXXX
Spoor XXXX

Since the similarity between, say, "mark" and "spoor" is the same as the similarity between
"spoor" and "mark", the top half of the matrix is blocked out. Also, the respondent is not asked
to rate the similarity of an item to itself.

One question that arises is what kind of ratings the respondent should be asked to supply. One
approach is to rate pairs of items on a k-point scale. Typical choices might be 5, 7, 10 or 100.
The simplest scale is k = 2, in which respondents are asked simply whether the items are
similar or not. The larger the value of k, the finer the discrimination among pairs. However,
experiments suggest that as k increases beyond 15, reliability begins to decline. If the data are
going to be averaged across respondents (as they usually are), the discrimination issue is less
important, and so you are better off choosing a very low k. If you are tempted to use k = 2, you
should consider using pilesorts instead (see sections 2.2-2.4), which accomplish the same task
in a friendlier way.

Other approaches include direct magnitude scaling, rank ordering, and so on. Rank-ordering is
generally not recommended because there are too many pairs to work with.

If the number of items is too great for a grid (which can be intimidating), you can present each
column of the grid separately, as shown in Figure 2-2. Each stimulus item ("scar" in the
example) is given its own page. Since every item gets its own page with all the other items on
it, the similarity of any given pair is actually asked twice. For example, Figure 2-2 asks about
the similarity between "scar" and "relic". But "relic" also gets its own page (not shown) which
contains "scar". This can be used as a reliability check. However, it can also induce anxiety in
respondents who notice the redundancy and want to appear consistent.
On the 5-point scale shown above, how similar in meaning is "SCAR" to each of these
words?

Track _______
Mark _______
Wake _______
Trail _______
Scar __5____
Relic _______
Scent _______
Spoor _______

Another approach is to present pairs of items one at a time for the respondent to rate, as shown
in Figure 2-3. From the point of view of entering data into the computer, this method is no
different from the grid. In fact, it is the grid wordprocessed differently. However, it may be
less intimidating to respondents, allowing them to focus on each pair individually.

On the 5-point scale shown above, how similar in meaning is each pair of words below?
Track Mark _______
Track Wake _______
Track Trail _______
Track Scar _______
Track Relic _______
Track Scent _______
Track Spoor _______
Mark Wake _______
Mark Trail _______
Mark Scar _______
Mark Relic _______
Mark Scent _______
Mark Spoor _______
Wake Trail _______
Wake Scar _______
Wake Relic _______
Wake Scent _______
Wake Spoor _______
Trail Scar _______
Trail Relic _______
Trail Scent _______
Trail Spoor _______
Scar Relic _______
Scar Scent _______
Scar Spoor _______
Relic Scent _______
Relic Spoor _______
Scent Spoor _______
All of these approaches require literate, numerate respondents. They also forbid large domains:
no more than 30 or 40 items can be accommodated using these methods. In the next section,
we consider pilesorts, which are capable of handling much larger domains.

2.2 Single Pilesorts

The basic pilesort technique asks informants to "sort these [items] into piles according to how
similar they are." Normally, what the informants actually sort is small cards with the name of
an item printed on each one. In many cases, however, it is feasible to sort the items themselves,
such as fish hooks, dried insects, cans of tuna, etc. Alternatively, photographs or short
descriptions may be sorted. It is important to realize that each of these alternatives can yield
quite different results, depending on the domain and the level of familiarity of the informant
with the items. For example, sorting photographs of fish might bias the pilesorts towards a
morphological classification, especially if the informant does not recognize the fish
(Boster and Johnson, 1991). On the other hand, sorting names of fish may tap additional
characteristics, such as the way it reacts to being hooked, its flavor when eaten, etc. If you are
after shared cultural beliefs, I recommend keeping the stimuli as abstract as possible. This may
allow the informant to call up culturally-transmitted information about, say, elm trees, rather
than base judgements on the physical characteristics of the tree before them.
Typically, respondents are informed that "it's ok to put a strange item in a pile by itself, and
you can make as many or as few piles as you like, as long as you don't put everything into one
pile or every item into a pile by itself". This instruction is normally given in order to let the
informant express themselves as freely as possible with a minimum of investigator
interference. However, one must not fall into the trap of believing that respondents carry a set
of piles in their head, which they reproduce on demand. For many domains, it is likely that
informants perceive items as having similarities and differences along a multitude of
dimensions, and any single sort represents just one compromise solution among several
possibilities. Further, these dimensions of similarity need not be categorical, but rather
continuums that have no natural gaps such that a division into piles is a simple and reliable
task.

It is best to think of pilesorts as a rating, on a dichotomous scale of "similar" and "not similar",
of the similarity of pairs of objects. If two items are placed in the same pile together, that
constitutes a rating of that pair as "similar" (i.e., similarity score of "1"). If two items are
placed in separate piles, that constitutes a rating of that pair as "not similar" (similarity = "0").
Thus, a pilesort of a set of 5 items is like (but perhaps cognitively easier than) asking the
respondent to answer the paired comparison questionnaire shown in Figure 2-4.
For each of the following pairs of fruit, please indicate whether they are similar
or different. Of course, every pair of fruit have some points of similarity and
some points of difference. What I want you to do is judge whether, on balance, the
pair are more alike than they are different.
Similar?
APPLE PAPAYA ________
APPLE WATERMELON ________
ORANGE APPLE ________
PEAR APPLE ________
PAPAYA WATERMELON ________
ORANGE PAPAYA ________
PAPAYA PEAR ________
WATERMELON ORANGE ________
PEAR WATERMELON ________
ORANGE PEAR ________

Note that in the figure, all 5*(5-1)/2 = 10 pairs of fruit are presented for judgement. It is
important to realize that the paired comparisons approach is in principle equivalent to (but
more desirable than) asking the respondent to fill out the bottom half of a fruit-by-fruit matrix
(Figure 2-5) with ones and zeros, where a "1" means the pair are similar and a "0" means they
are not. In practice, however, there can be a difference between the pilesort and the ratings. In
a rating, a respondent may find herself claiming that item A is similar to item B, and item B is
similar to item C, but item A is not similar to item C. That is, sab = 1, sbc = 1, but sac = 0. But
note that in the standard pilesort situation, if item A is in the same pile as B, and item B is in
the same pile as C, then obviously item A must be in the same pile as C. Hence, the physical
task of pilesort enforces a transitivity that is not required by the rating task and that is not
necessarily in the respondent's head. The results of a pilesort satisfy the mathematical
requirements of an equivalence relation, which is any binary relation R such that for all
elements a, b and c, it is true that aRa, aRb implies bRa, and aRb plus bRc implies aRc.

APPLE PAPAYA WATERMEL ORANGE PEAR


┌───────┬───────┬──────────┬──────────┬──────┐
APPLE │ │ │ │ │ │
├───────┼───────┼──────────┼──────────┼──────┤
PAPAYA │ │ │ │ │ │
├───────┼───────┼──────────┼──────────┼──────┤
WATERMELON │ │ │ │ │ │
├───────┼───────┼──────────┼──────────┼──────┤
ORANGE │ │ │ │ │ │
├───────┼───────┼──────────┼──────────┼──────┤
PEAR │ │ │ │ │ │
└───────┴───────┴──────────┴──────────┴──────┘

This discrepancy comes about because, in standard usage, a respondent may place an item in
one and only one pile. In practice, this is guaranteed by the fact that a single card or object
physically represents the object, and it cannot be in two places at once. However, it is often the
case that a respondent feels that an item properly belongs in more than one pile. In such cases,
there is no reason why the researcher couldn't make up a duplicate card for that item, thereby
enabling the respondent to do what he or she wants. Note that if a paired comparisons or fill-in-
the-matrix approach were used instead of a pilesort, this freedom is available inherently. At the
same time, if duplicate cards are provided, the pilesort technique no longer imposes transitivity.
For example, if there were two cards for item B, then it would be perfectly possible for item A
to be put in the same pile as B, B put in the same pile as C, and yet A not be in the same pile as
C, because two different B's were being used.

Pilesort data is typically aggregated across respondents to obtain a single item-by-item matrix
X in which xij gives the number of respondents who placed item i and item j in the same pile.
This number is interpreted as an index of the degree of similarity between i and j, on the theory
that if i and j are really similar almost everyone will see that and xij will be high. In contrast, if
i and j are really dissimilar, most respondents will not put them in the same pile, and x ij will be
low. The assumption is that the probability of a given respondent placing i and j in the same
pile is a fixed function of the unobservable similarity of the two which is more or less the same
for all informants in the sample. When the similarity between two items is neither high nor
low, we expect respondents to be equally likely to place them in the same pile or not, resulting
in about half the sample placing them together and half apart. Thus, middling values in X,
reflecting disagreement among respondents, are taken to mean middling degrees of similarity.
It is worth noting that this relationship between agreement and similarity only holds if all the
respondents have the same underlying view of the similarity of the objects, which is
presumably culturally transmitted. If there are respondents with different views, then
disagreement among respondents (middling values of X) cannot be interpreted as evidence of
middling similarity, since it is quite possible that half the group sees the pair as absolutely
identical while the other half sees them as absolutely distinct. See the chapter on consensus
analysis for further discussion on this topic.

Returning to the issue of allowing the respondent to use as many piles as they like, it should be
evident from the foregoing discussion that this option is only helpful if respondents actually
store relationships among items in mental piles. Since this is unlikely, there is little reason to
feel guilty about imposing a restriction on the number of piles when necessary. Restricting the
number of piles is necessary if you intend to compare individual informants' pilesorts. This is
because some informants will tend, as a matter of personal style, to be "lumpers" while others
will be "splitters". That is, some respondents, no matter what the domain, will take a broad
view and make very few piles, while others tend to split hairs and make as many piles as they
can. Two such respondents could have identical views regarding what is similar to what, but
choose to answer at a different level of detail using different numbers of piles. Consequently,
their data can look quite different, and measures of similarity or distance will find the
respondents quite different. Therefore, when individual differences are to be studied, all
respondents should be required to make the same number of piles.

2.3 Multiple Pilesorts

Given that a pilesort may be asking the resopndondent to solve the difficult mathematical
problem of finding the equivalence relation that most closely conforms to a mental similarity
relation, it is reasonable to suppose that the respondent may come to a local rather than global
optimum. In other words, there may be other ways of sorting the items into piles that the
respondent would feel equally or more comfortable (or uncomfortable) with. It makes sense,
then, to give the respondent the opportunity to sort the items again. This is the multiple pilesort
technique.

The second sort may be quite different in character from the first, though certain pairs of items
that were sorted together the first time may well end up sorted together the second time. Such
items, one assumes, are really very similar on a number of criteria. If so, the number of
different sorts in which a respondent piles together a given pair of items is a measure of the
perceived similarity of those items. Thus, asking a respondent to sort a set of items 5 times
(perhaps on different occasions) is like asking them to rate the similarity of each pair of items
on a 0 to 5 scale, where a rating of 5 occurs if they place the pair together in all 5 sorts, and a
rating of 0 occurs if the two are never placed in the same pile.

2.4 Successive Pilesorts

As outlined above, one of the advantages of multiple sorts is that, for a given respondent, they
give a measure of the degree of similarity between two items, whereas a single sort gives only
a yes/no judgement. Another approach to getting degrees of similarity is the successive
pilesort. In this technique, the respondent is asked to sort all the items into just two piles. Then
they are asked to split either of the piles in two, to produce 3 piles. The process continues until
all piles have been reduced to a single item. Alternatively, the process can begin from the other
direction. In this approach, the respondent begins with as many piles as there are items, and is
asked to combine the two most similar items into a single pile. The process is repeated until all
items are placed in the same pile.

Yet another approach was suggested by Jim Boster. In his method, the respondent is asked to
generate an initial pilesort using as many piles as they like. Let us call the number of piles they
produce N. Then the respondent is asked to collapse any two of the N piles to yield just N-1
piles. Then they collapse any two of these, until only 2 piles are left. Then the investigator has
the respondent return to his or her original sort and, this time, split any pile in two, creating
N+1 piles. This process is continued until all piles have been reduced to singletons.
Splits Piles

0 a,b,c,d,e,f,g 1
┌──────────┴──────────┐
1 a,b,c,d e,f,g 2
┌──────┴─────┐ │
2 a,b c,d │ 3
┌──┴──┐ │ │
3 │ │ │ │ 4
│ │ │ ┌───┴───┐
4 │ │ │ │ f,g 5
│ │ ┌──┴──┐ │ │
5 │ │ │ │ │ │ 6
│ │ │ │ │ ┌──┴──┐
6 a b c d e f g 7
^ ^ ^ ^ ^ ^
Cuts: 3 2 5 1 4 6

The immediate result of a successive pilesort is a hierarchical clustering of items for each
respondent. As the diagram above shows, it may be seen as an inverted binary tree. At the top
all items are together in one pile. We can label the levels by the number of piles at that point,
or by the number of splits it took to get that far. The number of splits is one less than the
number of piles. One level down, the items have been divided into two piles. Another level
down, one of these classes is split in two, and so on. The similarity between two items is
indexed by the number of splits it takes to separate them. For example, the similarity between f
and g is 6, because six splits are needed to separate them. In contrast, the similarity of a and e
is 1, because it only takes one split to separate them. Thus, the minimum possible similarity is
1 and the maximum is one less than the number of items in the domain.

Coding the results of a successive pilesort can be tricky. There are two basic methods,
depending on whether you want to make things easy for the respondent or the interviewer. One
way is to keep track of changes in the piles using paper and pencil as the sorting goes on. After
the first sort occurs, you stop to write down which items occur in which piles. For example,
suppose the first sort was a free sort into 4 piles. You write down

#4: {a} {b} {c,d} {e,f,g}.

Then suppose the respondent merges a and b into a single pile. You can either write the entire
partition as it stands after the merge:

#3: {a,b} {c,d} {e,f,g},

or you can write a shorthand note to yourself, like

#3: (1+2)
to indicate that piles 1 and 2 were merged. Similarly, when the respondent splits a pile, you
might record:

#5: 4 --> {e}, 5 = {f,g}

which would mean that f and g were removed from pile 4 to create pile 5. Alternatively, you
can draw a tree like the one above as the respondent creates it.

For data entry purposes, you can treat the successive sort data as multiple sort data with 7 sorts
per person.3 Alternatively, you can use a more efficient format that only works with successive
pilesorts. In this format, you write down the items in a long line from left to right with a
number between each item that indicates the level at which that pair of items were separated.
For example, for the tree shown above, you would write:

a3b2c5d1e4f6g

The similarity between any pair of items is given by the smallest number anywhere between
them. For example, the smallest value between d and a is 2, which means that d was separated
from a by the respondent's second split. (NOTE: when entering such data into the computer,
the items are identified by number, not letter or name.)

An even more efficient way of coding the data, according to Jim Boster, is to integrate the
coding process with the sorting. To do it, start by creating a set of cards called cutcards
numbered from 1 to N, where N is the number of items in the domain. Ask the respondent to
make the initial sort. Arrange the piles in a line. Count the number of piles (call that number
P). Ask the respondent to indicate which two piles are similar enough to combine, but don't let
him or her combine them. Instead move the piles so they are physically close to each other, and
put cutcard P-1 between them. Tell the respondent to regard the two piles as one. Then ask
them which two piles to combine now. As before, move the piles so they are physically
adjacent, separated by a cutcard P-2. Repeat until there is only one pile. Now return to the
original sort with P piles. Ask the informant to split the most heterogeneous pile. Place cutcard
P+1 between the newly separated piles. Then split another pile, placing cutcard P+2 between
the new piles. Repeat until all piles contain a single item. To record the data, start at the left of
the line and write down the sequence of item-cutcard-item-cutcard exactly as you see it.

One obvious benefit of successive pilesorts is that degrees of similarity are generated for each
respondent. Another important benefit is that the data do not suffer from the lumper/splitter
problem mentioned earlier, and so are entirely comparable between respondents. This is true
regardless of how many piles a respondent makes to start with, since he or she will always end
up having produced a sort with every possible number of piles.

An obvious disadvantage of successive pilesorts is the amount of time required to complete


one, plus the additional demands it places on the intelligence and patience of the informant.
There is also the greater amount of data to enter (if one treats the data as multiple sorts) or the
added complexity of the cutcard recording method.

3 In fact it is the only way to do it in this version of ANTHROPAC. If you want to enter the data in the
format advocated by Boster, you must use ANTHROPAC version 3.5 or lower.
ANTHROPAC cannot process successive pilesort data if they are coded in the economical
Boster format. However, programs that do are available from Jim Boster.

2.5 Triads

The primary reason to use triads is to limit the cognitive burden on respondents by giving them
a very simple set of tasks which can be analyzed to reveal their perceptions of the degree of
similarity between all pairs of items.

A triads questionnaire consists of a series of triples (triads) of items. For each triad, the
respondent is asked to indicate which pair of items is most similar, or alternatively, which one
item is the most different. For example, consider the following two triads:

1. DOG SEAL MOSQUITO

2. BEAR SHARK DOLPHIN

For the first triad, most north-americans would choose mosquito as the most different, which is
equivalent to choosing {dog,seal} as the most similar pair. In the second triad, there will be
some people choosing "bear" because sharks and dolphins are similar in shape and habitat, and
some people choosing "shark" because dolphins and bears are mammals. Very few north-
americans are likely to choose "dolphin" because the pair {bear,shark} do not seem as similar
as either {bear,dolphin} or {shark,dolphin}.

Ideally, all possible combinations of items are presented to the respondent for judgement. For
example, "dog" and "seal" would appear against "dolphin", "bear", "shark" and every other
animal in the set. Thus, each pair of items would occur n-2 times, where n is the number of
items in the domain. However, since the number of triads increases roughly with the cube of
the number of items, this usually results in too many triads to reasonably administer. The exact
formula is

n(n - 1)(n - 2)
t=
6

where t is the number of triads. For example, a questionnaire with 10 items has 120 triples. A
questionnaire with 20 items has 1140 triads.

Consequently, instead of a full factorial design it is common to use a fractional factorial design
known as a balanced incomplete block (BIB). (A design is a pattern or template that specifies
which triads should appear and in what order.) BIB designs reduce the number of triads by
presenting each pair of items only a limited number of times. BIB designs are classified by the
number of times each pair occurs. This number is known as "lambda" (λ). For example, a λ=2
design is a BIB that has each pair of items occur exactly twice. The highest lambda possible is
N-2, since that is the case where all possible triples occur.
An example of a λ=1 BIB design for 9 items is given below:

2 3 8
9 6 2
3 5 6
9 8 4
1 8 6
8 7 5
9 3 7
6 7 4
4 2 5
5 1 9
3 4 1
2 1 7

The numbers refer to items. Accordingly, the first row specifies that the 2nd, 3rd and 8th items
in the domain will occur together in a triad. Note that each pair of items occurs together in a
triad only once throughout the design. To construct the actual questionnaire, you arbitrarily
number each item in the domain, then substitute the corresponding item for each number in the
design. For example, if the list of items is

1. Shark
2. Dolphin
3. Whale
4. Frog
5. Seal
6. Dog
7. Eel
8. Snake
9. Hippo

then the questionnaire created by the design above would be:

Dolphin Whale Snake


Hippo Dog Dolphin
Whale Seal Dog
Hippo Snake Frog
Shark Snake Dog
Snake Tuna Seal
Hippo Whale Tuna
Dog Tuna Frog
Frog Dolphin Seal
Seal Shark Hippo
Whale Frog Shark
Dolphin Shark Tuna

Since the example uses a λ=1 design, each pair of items occurs together only once in the
questionnaire, and therefore only 12 triads are needed in total. The lower the value of lambda,
the greater the reduction in the number of triads that a respondent must endure. For example,
for the case of a domain with 15 items, the full factorial λ=13 design has 455 triads. The λ=3
design has 105 triads. The λ=2 design has 70 triads and the λ=1 design has 35 triads. (Each
block has 35 triads: the λ=13 design has 13x35=455 triads.)
Unfortunately, this reduction in triads is accompanied by a reduction in accuracy. Consider, for
example, the case of λ=1. In this design, each pair of items occurs only once, "against" a single,
randomly assigned item. If that item happens to be extremely unusual, most respondents will
choose that item as the most different, even though the other two items are not particularly
similar. In fact, if a different third item had been assigned (such as one very similar to one of
the other two items), then most respondents might have chosen one of the other two items as
the most different. Thus, in a λ=1 design the similarity to any two items is completely
determined by their similarity to a single third item.

For this reason, λ=1 designs are not recommended unless different designs are used for each
respondent. What this means is that a different initial ordering of items is used for each
questionnaire, so that a given triple in one questionnaire may or may not appear in another
questionnaire. Thus, it is as if each questionnaire received a different but equally valid λ=1
design. The advantage of this is that, in the aggregate, the similarity of any given pair of items
will not be determined by any one third item, but rather by many (if not all). This significantly
improves the accuracy and reliability of the test, at the cost of making individual responses
incomparable (since each individual did not receive the same questionnaire).

For more information on the accuracy of BIB designs, see the classic study by Burton and
Nerlove (1976).

REFERENCES

Boster, J.S. and J.C. Johnson


1991 "Form or function: A comparison of expert and novice judgments of similarity among
fish." American Anthropologist 91:866-889

Burton, M.L. and S.B. Nerlove


1976 "Balanced Designs for Triads Tests: Two Examples From English". Social Science
Research 5:247-267

Weller, S.C. and A.K. Romney


1988 Systematic Data Collection. Sage Publications
CHAPTER 3
Deriving Proximities

Suppose you have data consisting of a number of attitude variables collected on a sample of
respondents. The data are arranged as a respondent-by-variable matrix. One way to explore the
data is to examine which variables (attitudes) tend to form clusters such that if a respondent has
one of the attitudes in the cluster, he or she is also likely to have most of the other attitudes in
the cluster as well. This analysis can be done by computing an attitude-by-attitude correlation
matrix which is then input to cluster analysis or multidimensional scaling.

Similarly, you may be interested in looking at similarities among respondents across attitudes.
Are there segments or groups of respondents who have similar attitudes across a range of
topics, but which differ from the pattern of attitudes of other groups? One approach to this
problem is to treat each row of the data matrix as a profile that describes each respondent. The
profiles may be correlated to produce a person-by-person correlation matrix that can then be
analyzed via cluster analysis or MDS.

n m
┌─┬─┬─┬─┬─┬─┐ ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ════════>> ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ compare rows ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
m └─┴─┴─┴─┴─┴─┘ m └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘


║ compare
║ columns

║ n
┌─┬─┬─┬─┬─┬─┐
├─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤
├─┼─┼─┼─┼─┼─┤

Of course, in either case we are free to choose a different measure of similarity besides the
correlation coefficient. In fact, when comparing respondents, as opposed to variables, it is
typically the case that euclidean distances are computed instead of correlations. The reason for
this is explained later on. In general, the process we are describing is this: Given a rectangular
data matrix of m rows and n columns, compute similarities or distances either among the rows
to produce an mm proximity matrix, or among the columns to produce an nn proximity
matrix (see illustration above). The resulting square, symmetric matrix is then input into a
cluster analysis, multidimensional scaling or other procedure in order to help reveal patterns of
association among the m or n objects.

The most important choice to be made is the measure of similarity or distance. There are
probably hundreds of named measures in the literature, plus an infinite number of possible
variations. One key factor in this decision is the measurement level of the data.

3.1 Levels of Measurement

Measurement is the assignment of numbers to objects in such a way that certain key
relationships among the objects correspond to relationships among the numbers. Different
levels of measurement are defined by the number and kind of correspondences that hold
between the object relations and the numeric relations.

Consider the measurement of the mass (weight) of a set of blocks. Let us refer to any given
block as x, and its mass as m(x). When we measure the mass of x, we assign a score, denoted
w(x), which is a number that is meant to reflect that block's mass. The assignment of weight
scores has a number of important properties. For instance, suppose we use a balance scale to
see whether block x and block y have equal mass. The measurement of weight is such that x
and y are assigned the same score if and only if the scale balances. In other words, x balances y
<===> w(x) = w(y). Similarly, if the scale tips in x's favor, then (and only then) x is assigned a
larger score than y. That is, x overcomes y <===> w(x) > w(y). Furthermore, if we add z to the
side that x is on, and the combination just balances y, then the numeric score of x, added
arithmetically to that of y, should exactly equal the score of z. That is x and z balances y
<===> w(x)+w(z) = w(y).

The first thing to understand is that the algebraic relationships among the numbers assigned as
scores, such as addition, equality, and ordinality, mirror relationships among the objects in the
real world, as revealed by a balance scale. The second important thing is the implication that
more than one set of scores can do the job. For example, it's obvious in the weight example that
if we multiply every weight score, w(x), by 10, all the properties we discussed will continue to
hold true. The point is obvious if you consider the fact that you can measure things using
different units. As long as you don't mix units (measuring some objects with some units and
other objects with other units), there is no problem: the measurement is equally valid.

Levels of measurement are distinguished by which algebraic relationships among scores


correspond to real relationships among the objects. Typically, people talk about four levels of
measurement (although many more have been devised and an infinite number are possible):
nominal, ordinal, interval and ratio.

3.1.1 Nominal Scales


Line Length Length

A 5 99

B 100 100

C 3 -12

D 4 50

If we are strict in our definition of nominal scales, we find that nominal scales are very rarely
used. Most of what passes for nominal scales are simply classifications. A classification is
where you divide up a set of objects into mutually exclusive categories, such as grouping
people by religion or state of birth. For convenience, you assign a number to each person based
on which category they fall in. The numeric properties of these scores don't mean anything: a
"3" is not greater than a "1".

Strictly speaking, however, classifications are not nominal measurements, though practically
speaking they are indistinguishable. Nominal measurement is where you measure an attribute
of objects, say length, but the only property preserved by your assignment of scores is that two
objects are assigned the same score if and only if they are the same length. That is,

x is same length as y <===> f(x) = f(y)

Thus, in nominal scales, you can assign objects any number you like, as long as objects that are
equal with respect to the attribute being measured are assigned the same code.

3.1.2 Ordinal Scales

Ordinal scales are ones in which the assignment of scores preserves order relations (greater
than and less than) in addition to the equality relation that nominal scales preserve. For
example, if we measure people's height on an ordinal scale, we are free to assign any numbers
we like, just so long as taller people get larger scores. Consider the lengths of the following
four lines:

A. ----- B. ------- C. --- D. ----

A valid ordinal measurement of these lengths is given by the second column of the table below.
An equally valid ordinal alternative is given by the third column.

Obviously, when things are measured on an ordinal scale, you can tell whether one object has
more of the attribute than the other, but you can't tell how much more. One way to determine
whether two different ordinal scales applied to the same set of objects are both valid is to rank
order4 the objects separately according to their scores on each scale. If both scales are valid,
they will result in exactly the same rank order.

4 To rank order a set of objects on, say, length, means assigning the number "1" to the shortest object, the
number "2" to the next shortest, and so on. Or, you can do it exactly the opposite, assigning a "1" to the longest
object, a "2" to the next longest, etc. Either way, the result is that the only numbers you assign are from 1 to n,
where n is the number of objects ranked.
A special case of ordinal data is the presence/absence scale. This is where an object's score is
either a "1" or a "0" depending on whether it has a given attribute or not. The scale is
considered ordinal because a larger score indicates more of the attribute than a smaller score.

3.1.3 Interval Scales

Interval scales are ones in which the assignment of scores preserves equality of intervals
between objects, in addition to the order and equality relations that ordinal scales preserve.
Preserving the equality of intervals means that if we measure people's height, the numbers we
assign must be such that if person a is taller than b by the same amount that c is taller than d,
then f(a)-f(b) must be numerically equal to f(c)-f(d). Consider the lengths of the following four
lines:

A. ----- B. ------- C. --- D. ----

A valid interval measurement of these lengths is given by the second column of the table
below.

An equally valid interval alternative is given by the third column. Note that in both measure-
ments, the interval between the length of A nd the length of B is the same as the interval
between the lengths of C and A. Further, across scales, you can see that the intervals are
proportional to each other: all the intervals in the first scale are twice as big as the intervals in
the second scale. In general, if two interval scales applied to the same objects are both valid,
then you can always get an object's score on one scale from its score on another by means of a
simple linear transformation. For example, suppose we denote the temperature of city x on one
interval scale by f(x), and the temperature on another scale by c(x). Then, if both are valid, it is
possible to find constants m and b such that

f(x) = mc(x) + b
f(x) - b
c(x) =
m

Another way to tell if two interval measurements are really the same is to standardize them
separately. Starting with one set of scale scores, compute the mean and standard deviation
across all objects. Then, from each object's score, subtract the mean and divide by the standard
deviation (in that order). Now repeat the process for the second set of measurements. If both
are valid (i.e., they measure the same thing), the standardized scores should be identical.

When things are measured on an interval scale, you can tell how much more of an attribute one
object has than another in an additive sense, but you can't tell how much more in a
multiplicative sense. In other words, if air temperature is measured on an interval scale (e.g.,
Fahrenheit), it is meaningless to observe that if the temperature was 40 one day and 80
another day, that it was twice as hot the second day. You can see this by considering what the
temperature for each day was when measured on another interval scale (e.g., Centigrade). It is
easy to verify that 40 Fahrenheit corresponds to about 4 Centigrade, while 80 Fahrenheit
corresponds to about 27 Centigrade. According to the Centigrade scale, the second day was
more than 6 times hotter than the first day, not twice as hot. Yet both scales are valid.

What's going on here is that, being interval scales, Fahrenheit and Centigrade do not assign
numeric scores in such a way that ratios among the scores reflect anything at all about
relationships between the amount of heat in the air.

3.1.4 Ratio Scales

Ratio scales are ones in which the assignment of scores preserves equality of ratios between the
amounts of an attribute that objects possess, in addition to the order, equality and difference
relations that interval scales preserve. Preserving ratios means that if we measure a country's
size in terms of the number of citizens, the numbers we assign must be such that if a country
doubles in size over a period of time, the numeric score we assign at the end of the period must
be twice as large as the score assigned at the beginning. Consider the lengths of the following
four lines:

A. ----- B. ------- C. --- D. ----

A valid ratio-scale measurement of these lengths is given by the second column of the table
below. An equally valid ratio alternative is given by the third column.

Line Length Length


A 50 100
B 70 140
C 30 60
D 40 80

Note that not only the intervals between objects, but the scores themselves are proportional
across the two measurements for all objects. In general, if two ratio scales applied to the same
objects are both valid, then you can always get an object's score on one scale from its score on
another by means of a simple multiplicative. For example, suppose we denote the length of
road x on one ratio scale by m(x), and the length on another scale by k(x). Then, if both are
valid, it is possible to find constant m such that

f(x) = m  k(x)
k(x) = m-1  f(x)
This equation is just like the corresponding equation for interval scales, except that the constant
b (see previous section) has been set to zero. This is why people refer to ratio scales as "having
a fixed zero point".

Another way to tell if two ratio measurements are really the same is to normalize them
separately. Starting with one set of scale scores, take the sum across all objects. Then, divide
each individual object's score by the sum. Repeat the process for the second set of
measurements. If both are valid (i.e., they measure the same thing using different units), the
normalized scores should be identical.

Ratio scales are the most desirable of the scales discussed here, because the scores they
generate contain so much information. Whereas with an ordinal scale a score of "5" and a score
of "10" just means that the second object has more of a quality than the first, with a ratio scale
it means that the second object has exactly twice as much of the quality as the first. Ratio
scaling is an assignment of numbers to objects in which the numbers are chosen extremely
cleverly so as to contain all the ordinal, interval and ratio information about the objects.

Of course, ratio scales are also the most difficult to come by, especially in the social sciences.
In creating any measurement device, it is extremely difficult to ascertain what properties
(beyond the ordinal) the scores really have. (For the postmodernists among you, it should be
admitted that the validity of any measurement scale depends ultimately on unprovable theories
and assumptions. However, there is no need to throw up one's hands in adolescent existential
despair. I have not exposed science as a white, male, western, fascist, pack of lies designed to
oppress the politically correct. In many cases, the assumptions are so fundamental that even
postmodernists implicitly assume them.)

3.2 Measures of Similarity and Distance

The purpose of a measure of similarity is to compare two lists of numbers (i.e. vectors), and
compute a single number which evaluates their similarity. Most measures were developed in
the context of comparing pairs of variables (such as income or attitude toward abortion) across
cases (such as respondents in a survey). In other words, the objective is to determine to what
extent two variables co-vary, which is to say, have the same values for the same cases.

One problem with comparing two variables is that they may not be measured on the same
scale. For example, suppose we are interested in comparing the temperature of one city with
the temperature of a nearby city, across a hundred years. Clearly, we expect some relationship
between the temperatures. But even if the relationship is absolutely perfect, we don't
necessarily expect to see the same numbers. For instance, if one city is in Texas and the other is
in Mexico, it may be that one set of temperatures is measured on a Fahrenheit scale, while the
others are in Centigrade. Even if the temperatures are both measured in Centigrade, it may be
that the thermometers are calibrated differently, so that one reads consistently higher than the
other. Consequently, in comparing two temperature variables, we would want to allow for or
control for differences in scale.

The general principle is that a measure of similarity should be invariant under admissible data
transformations, which is to say changes in scale. Thus, a measure designed for interval data,
such as the familiar Pearson correlation coefficient, automatically disregards differences in
variables that can be attributed to differences in scale. If you recall, all valid interval scales,
applied to the same objects, can translated into each other by a linear transformation (see
Equation 2-1). This means that to see how similar two interval variables are, you must first do
away with differences in scale by either standardizing the data (this is what the correlation
coefficient does), or by trying to find the constants m and b such that the transformed variable
mX+b is as similar as possible to Y, and then reporting that similarity (this is what the r-square
measure of regression does). Likewise, a measure designed for ordinal data should respond
only to differences in the rank ordering, not to the absolute size of scores. A measure designed
for ratio data should control for differences due to a multiplicative factor.

3.2.1 Euclidean Distance

The basis of many measures of similarity and dissimilarity is euclidean distance. The distance
between vectors X and Y is defined as follows:

n
d(x, y) = ( x - y
i
i
2
i )

In other words, euclidean distance is the square root of the sum of squared differences between
corresponding elements of the two vectors. Note that the formula treats the values of X and Y
seriously: no adjustment is made for differences in scale. Euclidean distance is only appropriate
for data measured on the same scale. As you will see in the section on correlation, the
correlation coefficient is (inversely) related to the euclidean distance between standardized
versions of the data.

Euclidean distance is most often used to compare profiles of respondents across variables. For
example, suppose our data consist of demographic information on a sample of individuals,
arranged as a respondent-by-variable matrix. Each row of the matrix is a vector of m numbers,
where m is the number of variables. We can evaluate the similarity (or, in this case, the
distance) between any pair of rows. Notice that for this kind of data, the variables are the
columns. A variable records the results of a measurement. For our purposes, in fact, it is useful
to think of the variable as the measuring device itself. This means that it has its own scale,
which determines the size and type of numbers it can have. For instance, the income measurer
might yield numbers between 0 and 79 million, while another variable, the education measurer,
might yield numbers from 0 to 30. The fact that the income numbers are larger in general than
the education numbers is not meaningful because the variables are measured on different
scales. In order to compare columns we must adjust for or take account of differences in scale.
But the row vectors are different. If one case has larger numbers in general then another case,
this is because that case has more income, more education, etc., than the other case; it is not an
artifact of differences in scale, because rows do not have scales: they are not even variables. In
order to compute similarities or dissimilarities among rows, we do not need to (in fact, must
not) try to adjust for differences in scale. Hence, euclidean distance is usually the right measure
for comparing cases.

3.2.2 Correlation

The correlation between vectors X and Y are defined as follows:

1
n x y - 
i i X Y
r(X,Y) = i

 XY
where μX and μY are the means of X and Y respectively, and σX and σY are the standard
deviations of X and Y. The numerator of the equation is called the covariance of X and Y, and
is the difference between the mean of the product of X and Y subtracted from the product of
the means. Note that if X and Y are standardized, they will each have a mean of 0 and a
standard deviation of 1, so the formula reduces to:

r( X * ,Y * ) = n1  xi y i
i
Whereas euclidean distance was the sum of squared differences, correlation is basically the
average product. There is a further relationship between the two. If we expand the formula for
euclidean distance, we get this:

n
d(x, y) = ( xi
i - y i )2

=  x +  y - 2 x
i
2
i
i
2
i
i
i yi

But if X and Y are standardized, the sums Σx2 and Σy2 are both equal to n. That leaves Σxy as
the only non-constant term, just as it was in the reduced formula for the correlation coefficient.
Thus, for standardized data, we can write the correlation between X and Y in terms of the
squared distance between them:

2
( *, *)
r( X * ,Y * ) = 1 - d X Y
2n
CHAPTER 4
Analyzing Proximities

The primary goal in analyzing proximities is to find groups of items which "hang together". A
secondary goal, which requires additional data, is to discover underlying dimensions along
which the items vary, and which could in some sense explain the observed pattern of
proximities.

4.1 Hierarchical Clustering

One of the simplest methods of analyzing proximity data is cluster analysis. Dozens of cluster
analysis methods have been described in the literature, but the one most commonly used is
Johnson's (1967) hierarchical clustering. The method is agglomerative, which means it starts
with many small clusters and gradually merges them into fewer, bigger clusters. How it works:

Step 1. Find the most proximate pair of clusters. If this is the first iteration, each item is
considered a cluster.

Step 2. Join the two clusters together into a new cluster which replaces the two.

Step 3. Determine the proximity of the new cluster to each other cluster. Johnson suggested
two ways of doing this. The first is minimum method, which takes the smallest
distance (largest similarity) between any item in one cluster and any item in the other
cluster. This is also called single-linkage clustering and the connectedness method.
The other approach is the maximum method, which takes the largest distance (smallest
similarity) between any item in one cluster and any item in the other cluster. This is
also called complete-linkage clustering and the diameter method.

ANTHROPAC also provides two other options: the average method and the median
method. The average method takes the proximity between two clusters to be the
average proximity between members of the two clusters. The median method uses the
median instead of the average.

Step 4. Repeat steps 1 to 3 until all items belong to the same cluster.

As an example, consider the folowing matrix of distances among U.S. cities:


1 2 3 4 5 6 7 8 9
BOST NY DC MIAM CHIC SEAT SF LA DENV
1 BOSTON 0 206 429 1504 963 2976 3095 2979 1949
2 NY 206 0 233 1308 802 2815 2934 2786 1771
3 DC 429 233 0 1075 671 2684 2799 2631 1616
4 MIAMI 1504 1308 1075 0 1329 3273 3053 2687 2037
5 CHICAGO 963 802 671 1329 0 2013 2142 2054 996
6 SEATTLE 2976 2815 2684 3273 2013 0 808 1131 1307
7 SF 3095 2934 2799 3053 2142 808 0 379 1235
8 LA 2979 2786 2631 2687 2054 1131 379 0 1059
9 DENVER 1949 1771 1616 2037 996 1307 1235 1059 0

The principal output of hierarchical clustering is a cluster diagram (see below). In the diagram,
the cities are columns (the city names are written vertically above the city id code), and the
levels of clustering are the rows. The example was run using the diameter method. Since the
most proximate pair of cities in the data matrix was Boston and New York (206 miles), they
were the first to be joined into a cluster. This is shown in the first row of the diagram, in which
capital 'X's form a bar joining the Boston and NY columns.

Hierarchical Clustering Via


the DIAMETER Method

C S
B H E D
M O I A E
I S C T N
A T A T V
M O N D G L S L E
I N Y C O E F A R

Level 4 1 2 3 5 6 7 8 9
----- - - - - - - - - -
206 . XXX . . . . . .
379 . XXX . . . XXX .
429 . XXXXX . . XXX .
963 . XXXXXXX . XXX .
1131 . XXXXXXX XXXXX .
1307 . XXXXXXX XXXXXXX
1504 XXXXXXXXX XXXXXXX

The next most proximate cities were San Francisco and LA, at 379 miles. These two were
joined at the second iteration. Notice that the second line of the diagram shows not only these
two cities linked by 'X's, but also Boston and NY, which were joined earlier.

On the third iteration, the program added Washington D.C. to the cluster containing NY and
Boston. Since this example was run using the diameter method, the cluster level shown at the
extreme left column (429) can be interpreted as the maximum distance between cities in the
same cluster. In other words, no pair of cities within either cluster is more than 429 miles apart.

On the next to last iteration, all cities have been assigned to one of two clusters, an eastern and
a western group. The maximum distance within either cluster is 1504 miles. If we go back to
the raw data, we can see that the maximum distance between clusters is 3273 (Miami-Seattle).
When the two clusters are merged in the last iteration, the cluster level now reads 3273.
With perfect data, in which clusters clearly exist such that members of each subgroup are
highly similar (close) to each other but very different (distant) from members of other clusters,
all four methods described above reach the same conclusions. With messy data, in which
clusters may or may not exist, they can get quite different results. In these circumstances, the
minimum method tends to produce a single core cluster to which all items are connected more
or less peripherally. In contrast, the diamater method tends to produce lots of smaller clusters.

The average method works well in practice but, unlike all the others, assumes that the data are
interval-scaled. The median method is perhaps the best analytically, with results similar to the
average method but with the advantage of working well with ordinal data. However, in
practice, the median method is significantly slower than the others.

Another output of cluster analysis is the ultrametric distance matrix. If you think of the
clustering diagram as a model of the proximity structure of the data, the ultrametric distance
matrix records the distances between cities in the model. It is the fitted or expected values
under the model. A clustering is good to the extent that the ultrametric matrix correlates with
the input proximity matrix.

4.2 Multidimensional Scaling

From a non-technical point of view, the purpose of multidimensional scaling (MDS) is to


provide a visual representation of the pattern of similarities or distances among a set of objects.
For example, given a matrix of perceived similarities between various brands of air fresheners,
MDS plots the brands on a map such that brands that are perceived to be very similar to each
other are placed near each other on the map, and brands that are perceived to be very different
from each other are placed far away from each other on the map. For instance, given the matrix
of distances among cities shown above, MDS produces this map:
┌───────────┴───────────┴───────────┴───────────┴───────────┴───────┐
0.43 ┤ ├
│ │
│ │
│ │
0.23 ┤ MIAMI ├
│ │
│ LA │
│ │
0.02 ┤ SF ├
│ DENVER │
│ CHICAGO DC │
│ NY │
-0.19 ┤ SEATTLE BOSTO ├
│ │
│ │
│ │
-0.39 ┤ ├
│ │
│ │
│ │
└───────────┬───────────┬───────────┬───────────┬───────────┬───────┘
-0.40 -0.21 -0.03 0.16 0.35

In this example, the relationship between input proximities and distances among points on the
map is positive: the smaller the input proximity, the closer (smaller) the distance between
points, and vice versa. Had the input data been similarities, the relationship would have been
negative: the smaller the input similarity between items, the farther apart in the picture they
would be.

From a slightly more technical point of view, what MDS does is find a set of vectors in p-
dimensional space such that the matrix of euclidean distances among them corresponds as
closely as possible to some function of the input matrix according to a criterion function called
stress.

A simplified view of the algorithm is as follows:

1. Assign points to arbitrary coordinates in p-dimensional space.


2. Compute euclidean distances among all pairs of points, to form what we will call the
D matrix.
3. Compare the D matrix with a monotonic function of the input data, called DHAT, by
evaluating the stress function. The smaller the value, the greater the correspondence
between the two.
4. Adjust coordinates of each point in the direction that best maximally reduces stress.
5. Repeat steps 2 through 4 until stress won't get any lower.

4.2.1 Input Data

The input to MDS is a square, symmetric 1-mode matrix indicating relationships among a set
of items. By convention, such matrices are categorized as either similarities or dissimilarities,
which are opposite poles of the same continuum. A matrix is a similarity matrix if larger
numbers indicate more similarity between items, rather than less. A matrix is a dissimilarity
matrix if larger numbers indicate less similarity. The distinction is somewhat misleading,
however, because similarity is not the only relationship among items that can be measured and
analyzed using MDS. Hence, many input matrices are neither similarities nor dissimilarities.

However, the distinction is still used as a means of indicating whether larger numbers in the
input data should mean that a given pair of items should be placed near each other on the map,
or far apart. Calling the data "similarities" indicates a negative or descending relationship
between input values and corresponding map distances, while calling the data "dissimilarities"
or "distances" indicates a positive or ascending relationship.

A typical example of an input matrix is the aggregate proximity matrix derived from a pilesort
task. Each cell xij of such a matrix records the number (or proportion) of respondents who
placed items i and j into the same pile. It is assumed that the number of respondents placing
two items into the same pile is an indicator of the degree to which they are similar. An MDS
map of such data would put items close together which were often sorted into the same piles.

Another typical example of an input matrix is a matrix of correlations among variables.


Treating these data as similarities (as one normally would), would cause the MDS program to
put variables with high positive correlations near each other, and variables with strong negative
correlations far apart.

Another type of input matrix is a flow matrix. For example, a dataset might consist of the
number of business transactions occurring during a given period between a set of corporations.
Running this data through MDS might reveal clusters of corporations whose members trade
more heavily with one another than with outsiders. Although technically neither similarities
nor dissimilarities, these data should be classified as similarities in order to have companies
who trade heavily with each other show up close to each other on the map.

4.2.2 Dimensionality

Normally, MDS is used to provide a visual representation of a complex set of relationships that
can be scanned at a glance. Since maps on paper are two-dimensional objects, this translates
technically to finding an optimal configuration of points in 2-dimensional space. However, the
best possible configuration in two dimensions may be a very poor, highly distorted,
representation of your data. If so, this will be reflected in a high stress value. When this
happens, you have two choices: you can either abandon MDS as a method of representing your
data, or you can increase the number of dimensions.

There are two difficulties with increasing the number of dimensions. The first is that even 3
dimensions are difficult to display on paper and are significantly more difficult to comprehend.
Four or more dimensions render MDS virtually useless as a method of making complex data
more accessible to the human mind.
The second problem is that with increasing dimensions, you must estimate an increasing
number of parameters to obtain a decreasing improvement in stress. The result is a model of
the data that is nearly as complex as the data itself.

On the other hand, there are some applications of MDS for which high dimensionality is not a
problem. For instance, MDS can be viewed as a mathematical operation that converts an item-
by-item matrix into an item-by-variable matrix. Suppose, for example, that you have a person-
by-person matrix of similarities in attitudes. You would like to explain the pattern of
similarities in terms of simple personal characteristics such as age, sex, income and education.
The trouble is, these two kinds of data are not conformable. The person-by-person matrix in
particular is not the sort of data you can use in a regression to predict age (or vice-versa).
However, if you run the data through MDS (using very high dimensionality in order to achieve
perfect stress), you can create a person-by-dimension matrix which is similar to the person-by-
demographics matrix that you are trying to compare it to.

4.2.3 Stress

The degree of correspondence between the distances among points implied by the MDS map
and the matrix input by the user is measured (inversely) by a stress function. The general form
of these functions is as follows:

 (f( xij ) - d ij )2
scale

In the equation, dij refers to the euclidean distance, across all dimensions, between points i and
j on the map, f(xij) is some function of the input data, and scale refers to a constant scaling
factor, used to keep stress values between 0 and 1. When the MDS map perfectly reproduces
the input data, f(xij) - dij is for 0 all i and j, so stress is zero. Thus, the smaller the stress, the
better the representation.

The stress function used in ANTHROPAC is variously called "Kruskal Stress", "Stress
Formula 1" or just "Stress 1". The formula is:
 (f( xij ) - d ij )2
 d ij2

The transformation of the input values f(xij) used depends on whether we are using metric or
non-metric scaling. In metric scaling, f(xij) = xij. In other words, the raw input data is compared
directly to the map distances (at least in the case of dissimilarities). In non-metric scaling, f(xij)
is a weakly monotonic transformation of the input data that minimizes the stress function. The
monotonic transformation is computed via "monotonic regression", also known as "isotonic
regression".

From a mathematical standpoint, non-zero stress values occur for only one reason: insufficient
dimensionality. That is, for any given dataset, it may be impossible to perfectly represent the
input data in two or other small number of dimensions. On the other hand, any dataset can be
perfectly represented using n-1 dimensions, where n is the number of items scaled. As the
number of dimensions used goes up, the stress must either come down or stay the same. It can
never go up.
Of course, it is not necessary that an MDS map have zero stress in order to be useful. A certain
amount of distortion is tolerable. Different people have different standards regarding the
amount of stress to tolerate. The rule of thumb we use is that anything under 0.1 is excellent
and anything over 0.15 is unacceptable. Care must be exercised in interpreting any map that
has non-zero stress since, by definition, non-zero stress means that some or all of the distances
in the map are, to some degree, distortions of the input data. The distortions may be spread out
over all pairwise relationships, or concentrated in just a few egregious pairs. In general,
however, longer distances tend to be more accurate than shorter distances, so larger patterns are
still visible even when stress is high. See the sections on Shepard Diagrams and Interpretation
for further information on this issue.

From a substantive standpoint, stress may be caused either by insufficient dimensionality, or by


random measurement error. For example, a dataset consisting of distances between buildings in
New York City, measured from the center of the roof, is clearly 3-dimensional. Hence we
expect a 3-dimensional MDS configuration to have zero stress. In practice, however, there is
measurement error such that a 3-dimensional solution does not have zero stress. In fact, it may
be necessary to use 8 or 9 dimensions to bring stress down to zero. In this case, the fact that the
"true" number of dimensions is known to be three allows us to use the stress of the 3-
dimensional solution as a direct measure of measurement error. Unfortunately, in most
datasets, it is not known in advance how many dimensions there "really" are.

In such cases we hope (with little foundation) that the true dimensionality of the data will be
revealed to us by the rate of decline of stress as dimensionality increases. For example, in the
distances between buildings example, we would expect significant reductions in stress as we
move from one to two to three dimensions, but then we expect the rate of change to slow as we
continue to four, five and higher dimensions. This is because we believe that all further
variation in the data beyond that accounted for by three dimensions is non-systematic noise
which must be captured by a host of "specialized" dimensions each accounting for a tiny
reduction in stress. Thus, if we plot stress by dimension, we expect the following sort of curve:
Thus, we can theoretically use the "elbow" in the curve as a guide to the dimensionality of the
data. In practice, however, such elbows are rarely obvious, and other, theoretical, criteria must
be used to determine dimensionality.

4.2.4 Shepard Diagrams

The Shepard diagram is a scatterplot of input proximities (both xij and f(xij)) against output
distances for every pair of items scaled. Normally, the X-axis corresponds to the input
proximities and the Y-axis corresponds to both the MDS distances dij and the transformed
("fitted") input proximities f(xij). An example is given in Figure 3. In the plot, asterisks mark
values of dij and dashes mark values of f(xij). Stress measures the vertical discrepancy between
xij (the map distances) and f(xij) (the transformed data points). When the stress is zero, the
asterisks and dashes lie on top of each other. In metric scaling, the asterisks form a straight
line. In nonmetric scaling, the asterisks form a weakly monotonic function 5, the shape of which
can sometimes be revealing (e.g., when map-distances are an exponential function of input
proximities).

If the input proximities are similarities, the points should form a loose line from top left to
bottom right, as shown in the figure. If the proximities are dissimilarities, then the data should
form a line from bottom left to top right. In the case of non-metric scaling, f(xij) is also plotted.

┌┬──────────────┬──────────────┬──────────────┬──────────────┬┐
2.5 ├ * ┤
│ * * │
│ - * │
2.0 ├ * * * - ┤
│ ** * │
│ - │
1.5 ├ * ┤
│ │
│ │
1.0 ├ ┤
│ │
│ * * - │
0.5 ├ * * * ┤
│ - * │
│ *│
0.0 ├ ┤

At present, the ANTHROPAC program does not print Shepard diagrams. It does, however,
print out a list of the most discrepant (poorly fit) pairs of items. If you notice that the same
item tends to appear in a number of discrepant pairs, it would make sense to delete the item and
rerun the scaling.

5 If the input data are dissimilarities, the function is never decreasing. If the input data are similarities , the
function is never increasing.
4.2.5 Interpretation

There are two important things to realize about an MDS map. The first is that the axes are, in
themselves, meaningless and the second is that the orientation of the picture is arbitrary. Thus
an MDS representation of distances between US cities need not be oriented such that north is
up and east is right. In fact, north might be diagonally down to the left and east diagonally up
to the left. All that matters in an MDS map is which point is close to which others.

When looking at a map that has non-zero stress, you must keep in mind that the distances
among items are imperfect, distorted representations of the relationships given by your data.
The greater the stress, the greater the distortion. In general, however, you can rely on the larger
distances as being accurate. This is because the stress function accentuates discrepancies in the
larger distances, and the MDS program therefore tries harder to get these right.

There are two things to look for in interpreting an MDS picture: clusters and dimensions.
Clusters are groups of items that are closer to each other than to other items. For example, in an
MDS map of perceived similarities among animals, it is typical to find (among north
americans) that the barnyard animals such as chicken, cow, horse, and pig are all very near
each other, forming a cluster. Similarly, the zoo animals like lion, tiger, antelope, monkey,
elephant and giraffe form a cluster. When really tight, highly separated clusters occur in
perceptual data, it may suggest that each cluster is a domain or subdomain which should be
analyzed individually. It is especially important to realize that any relationships observed
within such a cluster, such as item a being slightly closer to item b than to c should not be
trusted because the exact placement of items within a tight cluster has little effect on overall
stress and so may be quite arbitrary. Consequently, it makes sense to extract the submatrix
corresponding to a given cluster and re-run the MDS on the submatrix. 6 (In some cases,
however, you will want to re-run the data collection instead.)

Dimensions are item attributes that seem to order the items in the map along a continuum. For
example, an MDS of perceived similarities among breeds of dogs may show a distinct ordering
of dogs by size. The ordering might go from right to left, top to bottom, or move diagonally at
any angle across the map. At the same time, an independent ordering of dogs according to
viciousness might be observed. This ordering might be perpendicular to the size dimension, or
it might cut a sharper angle.

The underlying dimensions are thought to "explain" the perceived similarity between items.
For example, in the case of similarities among dogs we expect that the reason why two dogs
are seen as similar is because they have similar locations or scores on the identified
dimensions. Hence, the observed similarity between a doberman and a german shepherd is
explained by the fact that they are seen as nearly equally vicious and about the same size. Thus,
the implicit model of how similarity judgments are produced by the brain is that items have
attributes (such as size, viciousness, intelligence, furriness, etc) in varying degrees, and the
similarity between items is a function of their similarity in scores across all attributes. This

6 In some cases, however, it is better to rerun the data collection on the subset of items. This is because the
presence of the other items can evoke additional dimensions/attributes of comparison that could affect the way
items in the subset are viewed.
function is often conceived of as a weighted sum of the similarity across each attribute, where
the weights reflect the importance or saliency of the attribute.

It is important to realize that these substantive dimensions or attributes need not correspond in
number or direction to the mathematical dimensions (axes) that define the vector space (MDS
map). For example, the number of dimensions used by respondents to generate similarities may
be much larger than the number of mathematical dimensions needed to reproduce the observed
pattern. This is because the mathematical dimensions are necessarily orthogonal
(perpendicular), and therefore maximally efficient. In contrast, the human dimensions, while
cognitively distinct, may be highly intercorrelated and therefore contain some redundant
information.

One thing to keep in mind in looking for dimensions is that your respondents may not have the
same views that you do. For one thing, they may be reacting to attributes you have not thought
of. For another, even when you are both using the same set of attributes, they may assign
different scores on each attribute than you do. For example, one of the attributes might be
"attractiveness". Your view of what constitutes an attractive dog, person, fruit or other item
may be very different from your respondents'. Fortunately, a very simple technique exists to
deal with this problem. The technique is called property fitting (PROFIT) and is described in
the next section.

4.3 PROFIT

PROFIT (PROperty FITting) is a method of testing hypotheses about the attributes that
influence people's judgement of the similarities among a set of items. As discussed above, there
are two basic approaches to analyzing proximities: searching for clusters and searching for
dimensions. PROFIT is a way of testing hypotheses about underlying dimensions.

Suppose we have an MDS map, based on perceived similarities among dog breeds, such as the
one shown below (the data are made up). You might hypothesize that the pattern of similarities
we observe is partly a function of breed size. As we move from top left to bottom right, the
breeds seem to be getting larger. However, the pattern is not perfect and may in fact be more a
function of my selective attention than truly present. What we need is an objective assessment
of the degree to which breeds in fact get larger as we move down and right. Also, the exact
direction along which breeds get larger is open to question. Perhaps it is more left-right than
up-down.
The way to do this is to estimate the parameters of a model that relates breed size to position on
the map, or, equivalently, relates position on the map to breed size. Putting it that way, it is
apparent that what we need to do is regress breed size on map location. Map location is given
by the coordinates of each breed on the map. If the map is 2-dimensional, then there are two
coordinates for each breed. Hence there are two independent variables in the regression.

The dependent variable is breed size. You can get these data by looking them up in a dog book.
Similarly, if you are scaling cars and you think that price is a factor in assessing similarities,
then you can look up the price of each car in a reference book. However, in general this is not a
good idea, because the purpose of running PROFIT is to understand the criteria that
respondents used to assess similarities. To use book figures is to assume that respondents are
aware of those same figures, which is unlikely. The best thing to do is to collect new data from
a sample of people drawn from the same population as the respondents who generated the
proximities. Have the new sample rate each item on the attribute you have hypothesized. In our
case, we would ask respondents to indicate the typical size of each dog breed, either via a
rating system (such as a 7-point scale), or direct estimation of the number of pounds or height
at the withers, or both. The data are then averaged across respondents to produce a single value
for each breed.

Both the coordinate data and the attribute data (in separate files) are input to the PROFIT
program. PROFIT then performs a multiple regression using the coordinates as independent
variables and the attribute as the dependent variable. If you have more than one attribute, such
as size, ferocity, retrieving ability, length of hair, etc., the program performs a separate
regression for each one. For each attribute, there are two key outputs: an r-square statistic and
the direction cosines.

The r-square tells you whether location on the map was related to values of the attribute (i.e.,
does size really increase as you go from left to right?). The higher the r-square, the closer the
relationship. For domains with less than 20 items, the rule of thumb is that you need an r-
square of at least .80 to support a conclusion that the hypothesized attribute was driving the
perceived similarities among items. (And of course, you can never prove it, even with an r-
square of 1.0. However, a low r-square does disprove the hypothesis.)

The direction cosines are rescalings of the regression coefficients. They give the relative
contribution of each axis of the map to the prediction of the attribute. In other words, they tell
you what precise direction the attribute increases along. For example, for the dog data, both
cosines are positive, which means that breed size increases as you move both east and north on
the map. However, the cosine for the horizontal (X) axis is larger than the cosine for the
vertical (Y) axis. This means that larger breeds are more east than they are north.

We use the direction cosines to draw arrows representing the attributes on the map (see map
below). The values of the direction cosines give the coordinates of the head of the arrow. The
middle of the arrow is always located at the dead center of the map (coordinates 0,0). To draw
the arrow, draw a line from the spot indicated by the direction cosines (the head), through the
center of the map, and out the other side. If the attribute data were coded in such a way that
bigger numbers meant more of the attribute, then we draw an arrowhead at the spot indicated
by the direction cosines, as shown below. Otherwise, we draw an arrowhead at the other end of
the line. The arrowhead always points in the direction of increasing attribute values.

To interpret the line, do NOT think of it as a boundary separating dogs above the line from
those below the line: this is totally wrong. Instead, draw perpendicular lines from each dog to
the PROFIT arrow (see map below). This is called the projection of location onto breed size.
The length of the line from the dog to the arrow is utterly irrelevant. It means absolutely
nothing. What matters is where the line from the dog meets the arrow. If the line is closer to
the arrowhead than another line is, then the dog associated with the first line is (predicted to
be) larger than the dog associated with the second line. For example, in the picture, the
doberman ("dobie") is predicted to be larger than the pitbull ("pitt"). The square of the
correlation between these projections and the actual breed size is equal to the r-square
discussed above.
CHAPTER 5
Consensus Analysis

Consensus analysis is both a theory and a method. As a theory, it specifies the conditions under which
more agreement among individuals on the right answers to a "test" indicates more knowledge on their
part. As a method, it provides a way to uncover the culturally correct answers to a set of questions in
the face of certain kinds of intra-cultural variability. At the same time, it enables the researcher to assess
the extent of knowledge possessed by an informant about a given cultural domain.

Consider a multiple choice exam given an introductory anthropology class. A possible question might
be "The author of Tristes Tropiques was _______", followed by 5 choices. If we treat the students'
responses as data we obtain a rectangular, respondent-by-question matrix X in which x ij gives student i's
choice on the jth question. Each row of the matrix is a vector of numbers, ranging from 1 to 5,
representing a given student's responses to each question. The answer key is also a vector, with the
same range of values, representing the instructor's responses. To obtain each student's score, we
compare the student's vector with the instructor's vector. If the two vectors are the same across all
questions, the student gets a perfect score. If the vectors are quite different, the student has missed
many questions and gets a low score. The important point is that a student's score on the exam is
actually a measure of similarity between the student's and instructor's vectors.

Of course, we can compute the similarity between any two vectors, not just a student's with an
instructor's. Let's consider the similarity between two students' vectors. If both students got perfect
scores on the exam, then their vectors will be identical, assuming there is only one right answer to each
question. If both students did pretty well, then again we expect a certain amount similarity in their
pattern of answers, because on all questions that they both got right, they will have the same answer.
On questions that one got right and the other got wrong, they will definitely have different answers.
And on questions which both got wrong, they will usually have different answers, because for each
question there are several wrong answers to choose from. The more questions they get wrong, the less
the similarity we expect between their response vectors. If two students each got very few questions
right, the similarity between their response vectors (assuming no cheating or other bias), should be no
better than chance level.

This thought experiment suggests, that agreement between students (i.e. similarity of response vectors)
is a function of each student's knowledge of the subject matter, at least under ideal conditions. Let us
specify more clearly what these conditions are. Implicitly, we have assumed a test situation in which
there is one and only one right answer to each question. We also assume a student response model of
the following sort (see Figure 1).
Figure 1. Simple response model.

If the student knows the answer to a question, she writes it down without error (i.e., gets it right) and
moves on to the next question. If the student doesn't know the answer, she guesses randomly among all
the choices. Let's use di to denote the probability that the ith student knows the right answer to any
given question. We can think of di as the proportion of all possible questions about a given topic that
student i knows the answer to.

The probability that i doesn't know the answer is 1-di. Given that she doesn't the answer to a
given question, the probability that she will get the question right by guessing is 1/L, where L
is the number of alternatives (in this case, 5). We are of course assuming that she is not
predisposed to always pick, say, the middle answer, and that she cannot eliminate absurd
alternatives. If she can eliminate some alternatives, the probability of guessing right is given by
1/L*, where L* is the number of choices left. In any case, the probability of getting a given
question right is the probability of knowing the right answer, plus the probability of guessing
right.We can see that the total probability of getting a question right is
mi = di + (1 - di)/L = prob of getting Qj right
and the probability of knowing the answer ("competence" is

di = (Lmi - 1)/(L-1) = prob of knowing answer

Using this simple model, we can retrace our thought experiment to get a more precise statement about
the relationship between agreement and knowledge. We begin by formulating the probability that two
students i and j, with knowledge levels di and dj respectively, give the same answer to a given question.

There are four ways it can happen:

1. Both i and j know the right answer.

p(both know) = didj

2. Student i knows the right answer, and student j guesses right.


p(i knows, j guesses) = di(1-dj)/L

3. Student j knows the right answer, and student i guesses right.

p(j knows, i guesses) = dj(1-di)/L

4. Neither knows the right answer, but both guess the same answer (regardless of whether its right or
wrong).

p(neither knows, guess the same) = (1-di)(1-dj)/L

The probability that i and j give the same answer to any given question is denoted by mij and is
simply the sum of the four probabilities above, as follows:
mij = didj + di(1-dj)/L + dj(1-di)/L + (1-di)(1-dj)/L
mij = didj + (1 - didj) /L
Thus, the agreement between i and j is given by the product of their respective competencies.
This is the key theoretical result: given a test situation and student response model as outlined
above, it is incontrovertible that agreement implies knowledge and vice versa. The assumptions
of this model can be summarized as follows:

1. Common Truth. There is one and only one right answer for every question. This is implied by
the first fork in the response model where if the student knows the answer, they write it down.

2. Local Independence. Students' responses are independent (across students and questions),
conditional on the truth. This is implied in the second fork of the response model, where if a
student does not know the answer, she guesses randomly among the available choices.

3. Item Homogeneity. Questions are drawn randomly from a universe of possible questions, so
that the probability di that student i knows the answer to a question is the same for all
questions. Thus, all questions are on the same topic, about which a given student has a fixed
level of knowledge. This is implied in the response model by the use of a single parameter d i
to characterize a respondent's probability of knowing the answer.

We now turn to the key practical result. In the last equation, mij (the proportion of questions
that students i and j answered the same) is known. We can look at two student response
vectors, and compute the proportion of matches. And in a test situation, where we have an
answer key, we can also compute the d parameters, since they are just the percentage of
questions answered correctly, minus a correction for chance guessing. But suppose we do not
have the answer key. The equation can be rewritten as follows:

didj = (Lmij - 1)/(L-1) = m*ij

This says that the products of the unknown parameters d are equal to observed similarities
between students' responses, corrected for guessing. The observed similarities are the mijs.
After correcting for chance, we get m*ij, which is just a rescaling of the observed similarities.
This new equation can be solved via minimum residual factor analysis (Comrey) to yield least
squares estimates of the d parameters.

In other words, even if we have lost the answer key, we can still find out exactly how much
knowledge each student has by factor analyzing the pattern of student-student similarities. And
given that we can tell who knows what they're doing and who doesn't, we can also determine
what the right answers must have been to each question. That is, we can determine what the
most probable answer was to any given question, given knowledge of who gave what answer.
For example, if all 20 students who got more than 90% of the questions right said that the
answer to question 7 was "b", the likelihood that it is not "b" is extremely remote, regardless of
what the majority of students might have said.

This result is of tremendous significance for cultural anthropologists, who typically do not
know the answers to the questions they are asking (!). One of the problems faced by
anthropologists is the existence of cultural variability. If we ask basic questions of a sample of
informants, even in matters of presumed "fact", we receive a variety of conflicting answers. We
are not talking here of matters of personal preference, such as what do you like to do in your
spare time, but more general questions which all respondents may agree have a single "right"
answer -- yet disagree on what it is. Sometimes such disagreement is due to subcultural
variability: there are in effect as many truths as subcultures. Yet even within a subculture, there
may be differences in knowledge (or "cultural literacy" to put it in contemporary terms) which
result in different answers. For example, I am not very good at identifying neighborhood trees
or plants. An anthropologist asking me for the names of plants is likely to get many wrong
answers. On the other hand, I have a good memory for names and dates of European historical
interest. The methodology of consensus analysis permits the anthropologist to (a) discover the
right answers and (b) determine who knows about a given topic and who doesn't.

It is important to note that, in this context, the "right answer" to a question is a culturally
defined concept. We are not talking about truth in the Western folk-scientific sense of
empirical reality. To name a tree correctly I do not conduct a biological investigation: I access
the culture that assigns it a name. Knowing the right answer to "is the earth flat?" has nothing
to do with understanding astronomy or geology: it is a function of one's access to the culture of
a given group.

The methodology of consensus analysis depends on the three assumptions outlined earlier.
Translated into the anthropological context, they are as follows:

1. One Culture. It is assumed that, whatever cultural reality might be, it is the same for
everyone. There are no subcultures that have systematically different views on a given topic.
All variability is due to variations in amount of knowledge.

2. Independence. The only force drawing people to a given answer is the culturally correct
answer. When informants do not know an answer, they choose or make up one independently
of each other. In other words, interview each respondent individually rather than in groups, and
try to prevent the respondent from getting into a "response set", such as always answering
"yes".

3. One Domain. All questions are drawn from the same underlying domain: you must not mix
questions about tennis with questions about plants, because a person's knowledge of tennis may
be very different from their knowledge of plants.

If these assumptions hold, you can rely on the estimates of the degree of knowledge an
informant has, and what the right answers are. In addition, the ANTHROPAC implementation
of consensus analysis can help test whether the assumptions hold, or more precisely, they can
test whether the assumptions do not hold. One such test is the computation of the eigenvalues
of the M* matrix. The One Culture assumption is inconsistent with the existence of more than
one large eigenvalue. Two large eigenvalues, for instance, is strong evidence that (at least) two
truths (two systematically different patterns of responses) are governing the responses of
informants. The program prints the ratio of the first eigenvalue to the second. The rule of
thumb is that if the ratio is less than 3 to 1, the assumption of One Culture is indefensible. A
ratio of 10 to 1 provides strong support, but can never prove, that the assumption is valid.
INDEX

Analyzing Proximities........................3, 27 37
Clusters...................................................35 MDS map..................31, 32, 33, 35, 36, 37
Coding 12 Multidimensional Scaling..................3, 29
cognitive..........................................1, 3, 14 N 1, 11, 13
Cognitive...................................................3 Nominal Scales......................................20
Collecting Proximities..........................3, 4 Ordinal Scales........................................20
Consensus Analysis..........................3, 41 Pilesorts.........................................3, 10, 11
Correlation..............................................25 presence/absence scale.......................21
cultural 1, 2, 4, 8, 41, 44, 45 PROFIT...................................3, 37, 38, 39
Cultural Domain...................................3, 1 Proximities......................................3, 4, 18
Culture 45 rank ordering......................................5, 24
Dimensionality........................................31 ratio scales..............................................23
dimensions. 4, 8, 27, 31, 32, 33, 35, 36, 37 Shepard Diagrams...........................33, 34
Dimensions.............................................36 similarities.4, 8, 18, 19, 25, 29, 30, 31, 32,
Direct Ratings.......................................3, 4 34, 35, 36, 37, 38, 44
Distance........................................3, 23, 24 Similarity..............................................3, 23
Domain 45 Single Pilesorts....................................3, 8
Hierarchical Clustering......................3, 27 Starting 22, 23
Independence...................................43, 45 Stress 32, 34
Input 30 synonyms..................................................2
Input Data................................................30 Triads 3, 14, 17
Interpretation....................................33, 35 Truth 43
Interval Scales........................................21 ultrametric distance matrix...................29
MDS 18, 29, 30, 31, 32, 33, 34, 35, 36,

You might also like