You are on page 1of 7

In general usage, a thesaurus is a reference work that lists words grouped

together according to similarity of meaning (containing synonyms and


sometimes antonyms), in contrast to a dictionary, which provides definitions
for words, and generally lists them in alphabetical order.
a compilation of terms showing synonymous, hierarchical, and other relationships and
dependencies, the function of which is to provide a standardized, controlled vocabulary for
information storage and retrieval.
The major purposes of a thesaurus
1. To provide a map of a given field of knowledge, indicating how concepts or ideas about
concepts are related to one another, which helps an indexer or a searcher to understand
the structure of the field.
2. To provide a standard vocabulary for a given subject field which will ensure that indexers
are consistent when they are making index entries to an information storage and retrieval
system.
3. To provide a system of references between terms which will ensure that only one term
from a set of synonyms is used for indexing one concept.
4. To provide a guide for users of the systems so that they choose the correct term for a
subject search.
5. A desirable purpose is to provide a means by which the use of terms in a given subject
field may be standardized.
(Encyclopedia of library and information science, 1980.)

9.2 FEATURES OF THESAURI


Some important features of thesauri will be highlighted here. For a more detailed
discussion, please consult Aitchison and Gilchrist (1972). The objective is for the
reader to be able to compare thesauri. In general, the discussion applies to both
manually and automatically generated thesauri. However, differences between the two
are also identified where appropriate.

9.2.1 Coordination Level


Coordination refers to the construction of phrases from individual terms. Two distinct
coordination options are recognized in thesauri: precoordination and postcoordination. A precoordinated thesaurus is one that can contain phrases.
Consequently, phrases are available for indexing and retrieval. A postcoordinated
thesaurus does not allow phrases. Instead, phrases are constructed while searching.
The choice between the two options is difficult. The advantage in precoordination is
that the vocabulary is very precise, thus reducing ambiguity in indexing and in
searching. Also, commonly accepted phrases become part of the vocabulary. However,
the disadvantage is that the searcher has to be aware of the phrase construction rules
employed. Thesauri can adopt an intermediate level of coordination by allowing both
phrases and single words. This is typical of manually constructed thesauri. However,
even within this group there is significant variability in terms of coordination level.
Some thesauri may emphasize two or three word phrases, while others may emphasize
even larger sized phrases. Therefore, it is insufficient to state that two thesauri are
similar simply because they follow precoordination. The level of coordination is
important as well. It should be recognized that the higher the level of coordination, the
greater the precision of the vocabulary but the larger the vocabulary size. It also
implies an increase in the number of relationships to be encoded. Therefore, the
thesaurus becomes more complex. The advantage in postcoordination is that the user
need not worry about the exact ordering of the words in a phrase. Phrase combinations
can be created as and when appropriate during searching. The disadvantage is that
search precision may fall, as illustrated by the following well known example, from
Salton and McGill (1983): the distinction between phrases such as "Venetian blind"
and "blind Venetian" may be lost. A more likely example is "library school" and
"school library." The problem is that unless search strategies are designed carefully,
irrelevant items may also be retrieved. Precoordination is more common in manually
constructed thesauri. Automatic phrase construction is still quite difficult and therefore
automatic thesaurus construction usually implies post-coordination. Section 9.4
includes a procedure for automatic phrase construction.

9.2.2 Term Relationships


Term relationships are the most important aspect of thesauri since the vocabulary
connections they provide are most valuable for retrieval. Many kinds of relationships
are expressed in a manual thesaurus. These are semantic in nature and reflect the
underlying conceptual interactions between terms. We do not provide an exhaustive
discussion or listing of term relationships here. Instead, we only try to illustrate the
variety of relationships that exist. Aitchison and Gilchrist (1972) specify three
categories of term relationships: (1) equivalence relationships, (2) hierarchical

relationships, and (3) nonhierarchical relationships. The example in Figure 9.1


illustrates all three categories. Equivalence relations include both synonymy and
quasi-synonymy. Synonyms can arise because of trade names, popular and local
usage, superseded terms, and the like. Quasi-synonyms are terms which for the
purpose of retrieval can be regarded as synonymous, for example, "genetics" and
"heredity," which have significant overlap in meaning. Also, the terms "harshness"
and "tenderness," which represent different viewpoints of the same property
continuum. A typical example of a hierarchical relation is genus-species, such as
"dog" and "german shepherd." Nonhierarchical relationships also identify
conceptually related terms. There are many examples including: thing--part such as
"bus" and "seat"; thing--attribute such as "rose" and "fragance."
Wang, Vandendorpe, and Evens (1985) provide an alternative classification of term
relationships consisting of: (1) parts--wholes, (2) collocation relations, (3)
paradigmatic relations, (4) taxonomy and synonymy, and (5) antonymy relations. Parts
and wholes include examples such as set--element; count--mass. Collocation relates
words that frequently co-occur in the same phrase or sentence. Paradigmatic relations
relate words that have the same semantic core like "moon" and "lunar" and are
somewhat similar to Aitchison and Gilchrist's quasi-synonymy relationship.
Taxonomy and synonymy are self-explanatory and refer to the classical relations
between terms. It should be noted that the relative value of these relationships for
retrieval is not clear. However, some work has been done in this direction as in Fox
(1981) and Fox et al. (1988). Moreover, at least some of these semantic relationships
are commonly included in manual thesauri. Identifying these relationships requires
knowledge of the domain for which the thesaurus is being designed. Most if not all of
these semantic relationships are difficult to identify by automatic methods, especially
by algorithms that exploit only the statistical relationships between terms as exhibited
in document collections. However, it should be clear that these statistical associations
are only as good as their ability to reflect the more interesting and important semantic
associations between terms.

9.2.3 Number of Entries for Each Term


It is in general preferable to have a single entry for each thesaurus term. However, this
is seldom achieved due to the presence of homographs--words with multiple
meanings. Also, the semantics of each instance of a homograph can only be
contextually deciphered. Therefore, it is more realistic to have a unique representation
or entry for each meaning of a homograph. This also allows each homograph entry to
be associated with its own set of relations. The problem is that multiple term entries
add a degree of complexity in using the thesaurus--especially if it is to be used
automatically. Typically the user has to select between alternative meanings. In a

manually constructed thesaurus such as INSPEC, this problem is resolved by the use
of parenthetical qualifiers, as in the pair of homographs, bonds (chemical) and bonds
(adhesive). However, this is hard to achieve automatically.

9.2.4 Specificity of Vocabulary


Specificity of the thesaurus vocabulary is a function of the precision associated with
the component terms. A highly specific vocabulary is able to express the subject in
great depth and detail. This promotes precision in retrieval. The concomitant
disadvantage is that the size of the vocabulary grows since a large number of terms are
required to cover the concepts in the domain. Also, specific terms tend to change (i.e.,
evolve) more rapidly than general terms. Therefore, such vocabularies tend to require
more regular maintenance. Further, as discussed previously, high specificity implies a
high coordination level which in turn implies that the user has to be more concerned
with the rules for phrase construction.

9.2.5 Control on Term Frequency of Class Members


This has relevance mainly for statistical thesaurus construction methods which work
by partitioning the vocabulary into a set of classes where each class contains a
collection of equivalent terms. Salton and McGill (1983, 77-78) have stated that in
order to maintain a good match between documents and queries, it is necessary to
ensure that terms included in the same thesaurus class have roughly equal frequencies.
Further, the total frequency in each class should also be roughly similar. (Appropriate
frequency counts for this include: the number of term occurrences in the document
collection; the number of documents in the collection in which the term appears at
least once). These constraints are imposed to ensure that the probability of a match
between a query and a document is the same across classes. In other words, terms
within the same class should be equally specific, and the specificity across classes
should also be the same.

9.2.6 Normalization of Vocabulary


Normalization of vocabulary terms is given considerable emphasis in manual thesauri.
There are extensive rules which guide the form of the thesaural entries. A simple rule
is that terms should be in noun form. A second rule is that noun phrases should avoid
prepositions unless they are commonly known. Also, a limited number of adjectives
should be used. There are other rules to direct issues such as the singularity of terms,
the ordering of terms within phrases, spelling, capitalization, transliteration,
abbreviations, initials, acronyms, and punctuation. In other words, manual thesauri are
designed using a significant set of constraints and rules regarding the structure of

individual terms. The advantage in normalizing the vocabulary is that variant forms
are mapped into base expressions, thereby bringing consistency to the vocabulary. As
a result, the user does not have to worry about variant forms of a term. The obvious
disadvantage is that, in order to be used effectively, the user has to be well aware of
the normalization rules used. This is certainly nontrivial and often viewed as a major
hurdle during searching (Frost 1987). In contrast, normalization rules in automatic
thesaurus construction are simpler, seldom involving more than stoplist filters and
stemming. (These are discussed in more detail later.) However, this feature can be
regarded both as a weakness and a strength.
These different features of thesauri have been presented so that the reader is aware of
some of the current differences between manual and automatic thesaurus construction
methods. All features are not equally important and they should be weighed according
to the application for which the thesaurus is being designed. This section also gives an
idea of where further research is required. Given the growing abundance of largesized document databases, it is indeed important to be challenged by the gaps between
manual and automatic thesauri.

Cleveland and Cleveland (2001) suggest the following steps for thesaurus
construction:

Identify the subject field. The boundaries of the subject field should be clearly
defined.
Identify the nature of the literature to be indexes. Is it primarily journa l
literature? Or does it include books, reports, conference papers etc. Is it
retrospective or current?
Identify the users, what are their information needs? Will their question be
broad or specific?
Identify the structure; will it be pre-coordinated or post-coordinated system?
Consult published indexes, glossaries, dictionaries and other tools in the
subject areas for a raw vocabulary. This will increase the thesaurus designers
understanding of the terminology and semantic relationship in the field.
Cluster the terms.
Establish term relationships.
Review or refine for consistency
Invert the structured thesaurus to produce an alphabetical arrangement of
entries
Test the thesaurus

Evaluation of index
To determine whether an index is good or Bad, it has to be evaluated. The task of
index evaluation would assist in determining ho w effective an index is in terms of
how many document that contain a term can be retrieved; how many of the
retrieved item actually match the needs of the user. Is the index efficient? Related
to the foregoing is the Quality of indexes which is a function its effectiveness in
terms of it being able to provide the users what they want with minimum difficulties.
The qualities of a good indexer are also discussed.
Index quality is a product of factors resident in indexers and the information
systems. Some of these include:

Indexing language used.


Exhaustivity issues.
Specificity problems
Extent of coordination.
The structure of the information system.
Index consistency.
Expertise of the indexer.
Experience and training of the indexer.

When the index does the foregoing, then the index quality as a retrieval
mechanism is assured.
Some checklist under which indexes can be evaluated are:

Are the indexed terms appropriate for the intended audience?


Are the main headings relevant to the needs of the reader? Are they
pertinent, specific and comprehensive?
Are the sub headings useful?
Are the sub headings concise with the most important word at the beginning?
Avoid unnecessary words and phrases, prepositions and articles should also
be avoided.
Are the page references accurate?
Does the index have see and see also cross-references?
Is the index length adequate for the complexity of the document?
Is there a need for more than one type of index?
Is the index organisation accurate, clear and consistent ?

Measures of index effectiveness


There are two major measures of the effectiveness of indexes. These are recall and
precision power or capability. The recall power is the capacity of the indexing
system to identify relevant document. Recall ratio is quantitative; it is measured by

the ratio of number of relevant documents retrieved to the total number of relevant
document in the collection.
Precision refers to the capability of the system to hold back unwanted documents
and to give the user wanted documents. The precision ratio is also a quantitative
ratio of the number of relevant documents retrieved to the total number of
documents retrieved. Precision ratio is measured as (100 R/L) meaning R= Number
of relevant document retrieved, L= Total number of documents retrieved in a
search. For instance, if 100 documents are retrieved and 70 of them are relevant to
users request, then the precision ratio is 70 to 100 (7/100). Precision for the
search should be 70 percent effective.
Parameters for index quality as follows:
Coverage should be complete.
Consistency in terms choices.
Term choices are appropriate to the nature of users
There should be adequacy of cross-references, but they are not overzealously
done.
Subheadings should truly reflect the main headings.
No incorrect or missing locators is allowed
No missing proper names.
There should be consistent alphabetization throughout text .
Misspellings are all corrected
There should be no indexing of same topic under different index terms
without proper cross-reference