You are on page 1of 6

Feature Article

Controlled Vocabularies: A Primer


by Julia Marshall
eter Mark Roget published “steel-framed,” “aluminum-framed,” and

P
the Thesaurus of English “fiberglass-framed.” And this list does not Julia Marshall has a Master’s Degree in Library
Words and Phrases in 1852, include the tents used for commercial or mil- Science from Catholic University and has been
but not until 1957 was the itary purposes3. To think I used to sleep in a indexing since 1998. While working part-time at
term “thesaurus” used in the plain old “pup tent” in the backyard when I the American Institutes for Research and
working with online indexes for the National
context of information was a kid!
Center for Education Statistics, she became
retrieval by Peter Luhn of IBM1. As the impact Stability refers to how often vocabularies interested in controlled vocabularies. A past
of computers and the Internet on information change. When I hiked as a youngster, we car- chair of the Washington DC chapter of ASI, she is
retrieval has increased, printed thesauri have ried plastic jugs of water with us. Now people expanding her skills to include information
evolved to include other types of controlled carry “hydration systems.” If you had a term architecture as well as indexing. She would like
vocabularies. for “drinking water,” would you change that to thank Bonnie Jo Dopp, Sue Nedrow, and Pilar
This article is meant to be used as a start- term to “hydration systems,” or would you Wyman for their help in editing this article..
ing point toward further study of controlled make “hydration systems” a synonym for
vocabularies. “drinking water?” If you added “hydration
systems” to a hierarchy, where would it fit
Defining the Vocabulary and how would it affect other terms like Will the controlled vocabulary be used as a
When you plan a controlled vocabulary, you “water purification systems” or “giardiasis”? browsing list with hyperlinks or will it be hid-
need to begin gathering information on the The second question addresses the end den within the search function? If it is in the
client to determine the best solution for their user or target audience. Sometimes this is search function, will your client be using a
information retrieval needs. You need to ask a an easy one. database program such as SQL Server or MS
lot of questions, which will generally fall into The end users for a company intranet are Access? Or will they be using XML or CGI
four main categories. the employees of that company. But how web- scripts? Will they be using a third-party search
1) What is the material being searched? savvy are the company employees? Are new engine such as Inktomi or a custom-made
2) Who is doing the searching? employees trained on the intricacies of the search function? All of these can significantly
3) How will the controlled vocabulary be intranet, or are they left to fend for them- affect how a controlled vocabulary works
implemented? selves? What documents are company employ- within a site. Since this is such a complicated
4) How will the controlled vocabulary be ees most likely to be searching? If your target area, I will be writing more about this in the
maintained? audience is the general public, answering next installment of this series.
The first question concerns content. Is these questions can be quite daunting. Ask Finally, consider how the controlled vocab-
the content mostly text, or will pictures or the marketing department for information on ulary will be used by the staff to maintain
even sound bites be involved? Is the content customer profiles. If a department routinely and add records to the database. Will the staff
already online or is it in print format? Con- handles queries from the public, interview be trained information specialists who are
sider also the specificity and the stability of staff about what kinds of questions people full-time employees, or interns who change
the terms. How many fine distinctions will you ask. Rosenfeld and Morville have some excel- every three to six months? Trained employees
need to make among your terms? lent information on user research in the 2002 will be much more capable of working with
Fast, Leise, and Steckel use the example of edition of Information Architecture4. complex hierarchies and relationships.
camping gear. An outdoor company that has If the controlled vocabulary will be used in Interns might be better off with a simpler syn-
100 types of tents is going to need more spe- an online environment, you will need to ask onym ring. If the staff finds a new term that
cific terms than a company that sells only 7- how the vocabulary will be accessed. Who they want to add to the controlled vocabulary,
102 styles. You could have terms for will be responsible for “plugging” the con- what will be the procedure? Will anyone be
“three-season tents,” “expedition tents,” and trolled vocabulary into the site? Is he or she a able to add terms, or only designated staff?
“screen house tents.” You could have terms willing participant in the process of creating a Milstead writes, “A thesaurus is never ‘fin-
for the number of people that will fit in a tent, controlled vocabulary? Cooperation from the ished,’ unless it is no longer being used for
from “bivy-sack tents” to “family tents.” person responsible for the implementa- indexing or its database is no longer being
Designs of tents include “A-frame,” “umbrella tion of the controlled vocabulary is crucial updated. Plan for maintenance before you
hub,” and “hoop tents.” Tents can also be in an online environment. even begin developing your thesaurus. A

120 . VOL. 13/NO. 4 OCTOBER – DECEMBER 2005 KEY WORDS


thesaurus which is not well-maintained Creating the Vocabulary thesaurus becomes more complex it also
quickly becomes a liability rather than an becomes less consistent.17
Remember question #1, about what materi-
asset.”5 Leonard Will maintains that the job of the
als will be searched? That is a good starting
thesaurus is to provide useful access points
Browsing versus place for gathering terms. If the
by which material can be retrieved. A detailed
documents/materials are already indexed, you
Searching can begin with the index terms. If not, index a
description of the material is what the cata-
As I mentioned above, controlled vocabu- log record is for.18 Back-of-the-book indexers
sample yourself to come up with a “starter
laries can be used for browsing or searching. are familiar with this concept as the “do not
list.” Jessica Milstead calls this the “bottom-
A browsing list is great for users who want to duplicate the book in the index” rule.
up” method.11
scan through the topics, in much the same Now that you have a list of terms to work
Alternatively, Milstead suggests you examine
way that they would with a back-of-the-book with, you will need to clarify a few issues.
existing dictionaries and thesauri to create
index. Hidden controlled vocabularies are
tucked away under the “hoods” of search
the vocabulary (the “top-down” method12). As Homographs
proven authorities on the subject at hand, “Homographs are words having the same
functions by various linking methods.6 A user these works do some of the intellectual work
types in a term, the search function finds the spelling but different meanings.”19
for you, but keep in mind that some of the As noted above, “china” can refer to
term in the controlled vocabulary, and then terms or structures will not suit your client’s
displays documents linked to the term the ceramics, as well as to the country. “Firing” is
needs. Remember too, material from diction- a process that can be used to mature ceram-
user typed in. The user never actually sees the aries and thesauri might be under copyright
controlled vocabulary, just the results. ics, a scorching of plants in poor soil, or a
protection. process that propels ammunition. “Cranes”
Jakob Nielsen states, When gathering terms, keep your target
Slightly more than half of all users are search-
can refer to birds as well as lifting equipment.
audience in mind. Milstead suggests you con- If you are working on a website devoted to
dominant, about a fifth of the users are link-dom- vene a group of subject experts to serve as
inant, and the rest exhibit mixed behavior. The pottery and ceramics, “china” might well
advisors.13 If these people will be users of the need to have qualifiers to distinguish between
search-dominant users will usually go straight
controlled vocabulary, so much the better. domestic objects made from vitreous porce-
for the search button when they enter a website…
In contrast, the link-dominant users prefer to
Aitchison, Gilchrist, and Bawden suggest lain and objects made out of clay that origi-
follow the links around a site… Mixed-behavior scanning user queries from public service nated in China:
users switch between search and link-following, logs.14 A website’s log files are useful places to
gather this information. If a log shows that china (ceramics)
depending on what seems most promising to
them.7 “hydration systems” is searched over and China (country)
Keeping the above in mind, I would recom- over again, consider adding “hydration sys- The ANSI/NISO Z39.19-200x (2005 draft)
mend providing both a browsing list and tems” to the vocabulary. You cannot accom- has an excellent section on homographs20 and
search capabilities. People have been trained modate every single search, but frequent advises, the “use of qualifiers should be
to expect a “search” box at a website, but 20 searches should be given careful considera- avoided whenever possible because of prob-
percent of clients is a significant number. tion. lems that parentheticals can cause in filing
Even without a browsing list, I would add an Fast, Leise, and Steckel also advise checking and in retrieval.”21
interactive component to the search function. out the competition.15 What kind of catch Take into consideration the context of the
This would let users in on the logic behind phrases are your client’s competitors using? vocabulary. If your client is an international
the search function, showing that documents Journals or magazines where your client trade association, “China” will probably be
retrieved are not arbitrary, and it would allow advertises can give you an idea of current understood to mean the country. “Cranes”
them to revise their query if necessary.8 buzzwords. E-mail discussion lists and blogs will probably not need a qualifier if your
Annie Dillard once wrote, “A work in are also possible sources of vocabulary terms. client is a construction-equipment dealer.
progress quickly becomes feral. It reverts to a By now you have a growing pile of terms.
wild state overnight.”9 Do you really need to include “three-season Compound Terms
Controlled vocabularies can get even wilder tents” as a term, or would “tents” do just Compound terms consist of more than
with all of the input you are trying to coordi- fine? If you include “china” as a term, should one word that represents a single concept.22
nate. Documenting all the decisions you make you include the qualifier of “ceramics” to dis- Also known as “bound terms” or “lexemes,”
on a project as they come up keeps the ambiguate it from the country? If you include compound terms need to be considered
project, if not tame, then manageable. This “Early Girl” tomatoes as a term, will you then carefully, particularly in an online retrieval
documentation will help maintain consistency include all varieties of tomatoes? Keep in environment.
as the controlled vocabulary changes and mind that the “disadvantage of a highly spe- For example, a search on “topic map” in
grows.10 Make an extra copy so that you can cific vocabulary is that the number of index- Google brings up the Enchanted Learning
leave it with the client once you have finished ing terms is increased….and it is website (http://www.enchantedlearning.
the project. consequently more expensive to compile, com/Home.html), which includes “Short
maintain, and operate.”16 In one of her work- printable books on many topics – from early
shops, Bella Weinberg commented that as a to fluent readers” as well as “Maps, flag, and

KEY WORDS VOL. 13/NO. 4 OCTOBER – DECEMBER 2005 . 121


Feature Article
information on many areas” in the Geography tum” or “aspirin” and “acetylsalicylic If you add a hierarchical structure to
section.23 acid.” your vocabulary list, you will create a taxon-
A search environment depends a great deal • Generic names versus trade names: omy. A hierarchical structure can add more
on post-coordination. In other words, the “photocopies” and “Xeroxes” or “tis- precision29 to an information retrieval system,
user is responsible for putting together the sues” and “Kleenex.” but it is also more complex to organize and
terms, such as “topic” and “map,” that will • Spelling variants: “theater” and “theatre,” maintain.
retrieve useful documents. “e-mail and “email,” or “gypsies” and Ask if your client is willing to make
A pre-coordinated controlled vocabulary is “gipsies.” the commitment to provide the upkeep neces-
much more suited to browsing. (A back-of- • Dialect variants: “elevators” and “lifts” or sary for the controlled vocabulary to be
the-book index, for example is a pre-coordi- “soda,” “pop,” and “soft drink.” effective. Will there be staff available for
nated list of terms that the indexer has • And of course, acronyms: “FBI” and maintenance and additions to the database?
created to access a book.) “Federal Bureau of Investigation” or A hierarchical controlled vocabulary that is
According to the ANSI/NISO 2005 draft, “ASI” and “American Society of Index- not maintained on a routine basis will quickly
compound terms should have both a focus, ers.” become obsolete.
or a head noun, and a modifier, which nar- Near-synonyms, defined by the ANSI/NISO Hierarchical relationships are based on
rows the concept of the noun. In the term 2005 draft as “terms whose meanings are the concepts of “broader” and “narrower” in
“topic maps,” “maps” is the noun and generally regarded as different, but which are a logically progressive sequence.30 The rela-
“topic” is the modifier. treated as equivalents for the purposes of a tionships are reciprocal, meaning that they
Following are situations where the controlled vocabulary”25 should be consid- can go both ways. For example:
ANSI/NISO 2005 draft encourages compound ered as well. Near-synonyms (also called back-of-the-book indexes
terms:24 “quasi-synonyms” by Aitchison et al.26) can BT indexes
• Ambiguity of single terms is clarified by include: indexes
having modifiers: “educational attain- • Terms with significant overlap: “urban NT back-of-the-book indexes
ment” and “price indexes.” areas” and “cities” or “sea water” and
• The compound terms have non-distinct “salt water.” (“BT” is the conventional indication of a
elements: “first aid” and “cut glass.” • Terms on a continuum: “meteors,” “broader term” and “NT” indicates a “nar-
• The modifier is a metaphor from an “meteorites,” and “meteoroids.” rower term.”)
unrelated thing: “tree structures” and • Antonyms that express opposite sides of The ANSI/NISO 2005 draft recognizes three
“blood oranges.” the same concept: “equality” and types of hierarchical relationships: generic,
• The modifier has lost its original mean- “inequality” or “heat” and “cold” (tem- instance, and whole-part. Generic and
ing so that the meaning of the compound perature being the concept). instance relationships are often characterized
term has become more than the sum of Once you have divided your list into piles of as “IsA” relationships. For example, an ele-
its parts: “teddy bears” and “trade equivalent terms, you have created a syn- phant Is A mammal. Therefore:
winds.” onym ring or synonym equivalence list. Syn- elephant
• The compound term is a proper name or onym rings are a good choice if the client has BT mammal
includes proper names: “Oedipus com- a decentralized, search-dominant site with lit- mammal
plex” and “United Nations.” tle centralized control over content.27 If you NT elephant31
• The compound term is in common usage take a term in one of the groupings of syn-
and understood to represent a single onyms and make that the “preferred term” to The elephant-mammal relationship is a
concept: “gross domestic product” and which all the other terms will refer back, then generic relationship. A good test for whether
“high school.” you have created an authority file. a generic relationship is valid is the “all-and-
For example, if you enter “Black Ameri- some” test. The relationship between ele-
Synonyms cans,” “Negroes,” “Afro-Americans,” or “Col- phants and mammals passes this test because
Start by grouping synonyms together. As ored people,” you are told, “Use: African “all elephants are mammals and some mam-
you think of other synonymous terms, add Americans.”28 mals are elephants” is true.
those to the list. Following are types of syn- (“See” cross-references in back-of-the- The instance relationship identifies a link
onyms to be included: book indexes function in a similar way: between a general category of objects or
• The obvious: “cars” and “automobiles” “Black Americans. See African Americans.”) events and individual instances of those
or “pants” and “trousers.” objects or events. These instances are often
• Popular names and scientific names: Defining Relationships proper names or nouns.
“tomatoes” and “Lycopersicon esculen- within the Vocabulary

122 . VOL. 13/NO. 4 OCTOBER – DECEMBER 2005 KEY WORDS


indexing software If you decide against a poly-hierarchy, you My trusty guide, the ANSI/NISO 2005 draft,
NT Cindex will need to decide under which broader gives a lengthy discussion on types of associa-
NT Macrex term to post your narrower terms. For exam- tive relationships.37 The following types of
NT Sky Index ple, you will need to decide whether to post associational relationships are included in the
“optic nerve” with “eyes” or with “nervous ANSI/NISO 2005 draft.
The whole-part relationship is just what its system.” The narrower terms listed by facets Derivational relationships:
name suggests: The narrower term is inher- in the above examples would be listed alpha- equines
ently a part of the whole (broader term). betically instead of grouped under the facet. NT donkeys
index entries How do you decide what main categories RT mules
NT cross-references you will give your taxonomy? How do you NT horses
NT locators decide which narrower terms go where? RT mules
NT main headings Check the materials you examined when NT mules
NT subheadings you were compiling terms. Indexes, dictionar- RT donkeys
ies, and existing thesauri can also give you RT horses
You might discover that some terms can fit some ideas.
under two or more headings, in which case A simple, but effective idea for making Because a mule is a hybrid offspring of a
you will need to make a decision on whether structure choices is Card Sorting. Write some donkey and a horse, the term “mules” is
to create a poly-hierarchy: of your terms onto index cards – one term related to both “horses” and “donkeys.”
eyes per card. Ask a sampling of people to sort the However, “horses” and “donkeys” would not
NT optic nerve cards into like piles. An Open Card Sort be listed as related terms, but as “sibling”
nervous system allows the testers to create and name their narrower terms to the broader term of
NT optic nerve own piles. A Closed Card Sort presents the “Equines.”
optic nerve testers with a pre-defined list of terms into Relationships based on action:
BT eyes which they sort the terms on the cards.35 This environmental cleanup
BT nervous system is an effective way to not only generate feed- RT pollution
back on the work you are doing, but to
Poly-hierarchical relationships create involve staff within your client company more Cause and effect relationships:
even more complexity within your vocabulary. directly in the process of creating the con- pathogens
The ANSI/NISO 2005 draft says: trolled vocabulary. RT infections
Some taxonomies allow poly-hierarchy, which Thus far, we have examined equivalence
means that a term can have multiple parents relationships (synonym rings) and hierarchi- Concept/object relationships:
[broader terms], and although the term appears cal relationships (taxonomies). If you add a temperature
in multiple places, it is the same term. If the par- third type of relationship, associative, you
ent term has children [narrower terms] in one RT thermometers
have built a thesaurus.
place in a taxonomy, then it has the same children Raw materials and products:
in every other place where it appears.32 Associative Terms
Another way to introduce multiple hierar- Associative (or related) terms are not wheat
chies is with facets. Most taxonomies list nar- related by hierarchy or equivalence to RT flour
rower terms alphabetically, but some arrange another term, but are related semantically or Discipline/object or practitioner:
the list of narrower terms by grouping like conceptually:
items together:33 botany
mammals RT botanist
index entries NT elephants RT plants
NT <components of> RT circus animals
cross-references Associational relationships are the main
locators (“RT” indicates “related term.”) organizing principle for topic maps. An ISO
main headings Although not all elephants are circus ani- standard for topic maps provides for a con-
subheadings mals and not all circus animals are elephants, struct called “topic associations,” which cre-
<formatting> elephants are strongly associated with circus ate relationships between two or more
indented indexes animals. topics:38 “Flour is made from wheat” or
run-in indexes Since they do not need to fit in a hierarchy, “Bread is made from flour,” and “Don
associative terms can become overused. Bella Quixote was written by Cervantes” and
The words in angle brackets will never be Weinberg has advised that related terms be “Don Quixote takes place in Spain.”
used as indexing terms. They are simply indi- used judiciously as they can easily become Association types (“made from,” “written
cators of the logic behind the grouping in the garbage pails for your vocabulary.36 by,” and “takes place in”) each have desig-
taxonomy.34

KEY WORDS VOL. 13/NO. 4 OCTOBER – DECEMBER 2005 . 123


Feature Article
nated roles and can be applied to topics 13
Ibid. http://contextualanalysis.com/pub_usingfacets
within a database no matter the subject matter. 14
Jean Aitchison, Alan Gilchrist, and David Bawden. .php (Accessed April 12, 2005.)
In the case of Don Quixote, the roles Thesaurus Construction and Use: A practical 34
Ibid.
manual. (London: Aslib, 1997). 35
IAWiki. “CardSorting.” http://iawiki.net/cardsorting
might be “title” for Don Quixote, “author” 15
Fast, Leise, and Steckel. “Creating a Controlled (Accessed June 29, 2005.) Rosenfeld and Morville’s
for Cervantes, and “place” for Spain. Vocabulary,” 2003. Information Architecture (listed above) also gives
According to Steve Pepper, the topic map 16
Aitchison, Gilchrist, and Bawden, 1997. a good introduction to card sorting.
model generalizes the features of a printed 17
Bella Haas Weinberg. “Thesaurus Design for Index-
36
Weinberg. “Thesaurus Design for Indexing, Metadata,
index, and extends those features in many ing, Metadata, and Natural Language Searching” and Natural Language Searching” Workshop. 2000
directions at once, thereby creating more Workshop, June 1, 2000, St. John’s University. 37
ANSI/NISO Z39.19-200x (2005 draft).
navigational possibilities.39
18
Leonard Will. “Thesaurus Principles and Practice,” 38
Steve Pepper. The Tao of Topic Maps, April 2002.
http://www.willpower.demon.co.uk/thesprin.htm http://www.ontopia.net/topicmaps/materials/
(Accessed July 8, 2005.) tao.html (Accessed July 19, 2005.) The ISO stan-
Next Steps 19
Hans Wellisch. Indexing from A to Z. (New York: H. dard is ISO/IED 13250:2003, listed on Wikipedia at
Once you have created your controlled W. Wilson, 1996). Wellisch distinguishes homo- http://en.wikipedia.org/wiki/Topic_map.
vocabulary, be it synonym ring, taxonomy, or graphs from homonyms by saying that homonyms 39
Pepper, 2002.
thesaurus, you will need to decide on a dis- might sound the same, but are spelled differently.
(e.g., “beau” versus “bow”). Since the spelling is
play format, browsing list, or other strategy for what would affect both searching and browsing, I
implementing the controlled vocabulary into a use the term “homographs.”
search function. And once you have installed 20
ANSI/NISO Z39.19-200x (2005 draft).
the controlled vocabulary, or printed copies, 21
Ibid.
what are the steps for evaluating the final 22
Ibid.
product? These topics I will cover in my next 23
Google also retrieved many websites on the relevant ASI Materials
installment for Key Words, “Controlled Vocab- term “topic map.” This out-of-context example
makes my point about the difficulties compound ASI Brochures
ularies: Implementation and Evaluation.” terms can cause. To obtain copies of the ASI Brochures
Notes
24
ANSI/NISO Z39.19-200x (2005 draft). Aitchison, for your chapter or group meeting, index-
Gilchrist, and Bawden (1997) also discuss the for- ing courses, or your clients, contact
1
Jean Aitchison and Stella Dextre Clarke. “The The- mation of compound terms.
saurus: A Historical Viewpoint, with a Look to the 25
Ibid.
info@asindexing.org.
Future,” The Thesaurus: Review, Renaissance,
and Revision. Edited by Sandra K. Roe and Alan R.
26
Aitchison, Gilchrist, and Bawden, 1997. Sample ASI publications
Thomas. (Binghamton, NY: Haworth Press, 2004). 27
Karl Fast, Fred Leise, and Mike Steckel. “Synonym A rotating collection of sample ASI pub-
2
Karl Fast, Fred Leise, and Mike Steckel, “Creating a Rings and Authority Files,” Boxes and Arrows,
August 26, 2003. http://www.boxesandarrows lications (ITI books and issues of Key
Controlled Vocabulary,” Boxes and Arrows, April 7,
2003. http://www.boxesandarrows.com/archives/ .com//archives/synonym_rings_and_authority Words) is available for display at chapter
creating_a_controlled_vocabulary.php (Accessed _files.php (Accessed March 3, 2005.) meetings, trade show exhibits, etc. If you
March 3, 2005.) 28
Library of Congress Subject Authorities List. are interested in using the collection,
3
Eureka! The Tent Company. http://www.eureka http://authorities.loc.gov/ please contact info@asindexing.org.
camping.com (Accessed July 21, 2005.) 29
Precision and recall are two basic concepts in infor-
4
Louis Rosenfeld and Peter Morville. Information mation retrieval. Both concepts are represented as
Architecture for the World Wide Web. (Sebastopol, ratios. Precision is expressed as a ratio percentage
CA: O’Reilly, 2002). between the number of relevant documents
retrieved and the total number of documents
5
Jessica Milstead, How Do I Build A Thesaurus?
retrieved. The ratio percentage for recall is calcu-
http://www.asindexing.org/site/thesbuild.shtml
lated by dividing the number of retrieved relevant
(Accessed June 21, 2005.)
documents by the number of all relevant documents
6
Web databases, CGI Scripts, XML, and search algo- within the collection. Precision and recall are
rithms are ways to link terms to the documents inverse ratios. When one goes up, the other goes
being searched. down.
7
Jakob Nielsen. Designing Web Usability. (Indianapo- 30
”Broader” and “narrower” terms can also be
lis, IN: New Riders Publishing, 2000). referred to as “parent” and “child” terms.
Rosenfeld and Morville, 2002. Check out Chapter 8 on In upcoming issues of
8
31
I use the singular form of the nouns here to express
search systems. the “IsA” relationship. The plural form is a more
9
Annie Dillard. The Writing Life. (New York: Harper & accepted practice in both controlled vocabularies Key Words
Row, 1989). and indexing. • Communications on the “Fly”
10
Fast, Leise, and Steckel. “Creating a Controlled 32
ANSI/NISO Z39.19-200x (2005 draft). • Indexing Chinese Names
Vocabulary,” 2003. 33
Fred Leise’s article “Using Faceted Classification to
11
Milstead. How Do I Build A Thesaurus? (1996).
• Indexing Japanese Names
Assist Indexing” is a good introduction to the sub-
12
Ibid. ject of facets.

124 . VOL. 13/NO. 4 OCTOBER – DECEMBER 2005 KEY WORDS

You might also like