Professional Documents
Culture Documents
Vocabulary Mapping For Terminology Servi
Vocabulary Mapping For Terminology Servi
Vocabulary Mapping for Terminology Services
Diane VizineGoetz, Carol Hickey, Andrew Houghton and Roger Thompson
OCLC Research, OCLC Online Computer Library Center, Inc.
Email: vizine@oclc.org; hickeyc@oclc.org; houghton@oclc.org; thompson@oclc.org
Project Web Site: http://www.oclc.org/research/projects/termservices/
Abstract
The paper describes a project to add value to controlled vocabularies by making intervocabulary associations. A
methodology for mapping terms from one vocabulary to another is presented in the form of a case study applying the
approach to the Educational Resources Information Center (ERIC) Thesaurus and the Library of Congress Subject
Headings (LCSH). Our approach to mapping involves encoding vocabularies according to MachineReadable Cataloging
(MARC) standards, machine matching of vocabulary terms, and categorizing candidate mappings by likelihood of valid
mapping. Mapping data is then stored as machine links. Vocabularies with associations to other schemes will be a key
component of Webbased terminology services. The paper briefly describes how the Open Archives Initiative Protocol for
Metadata Harvesting (OAIPMH) is used to provide access to a vocabulary with mappings.
1 Introduction
A majority of tools and features for accessing the names, subjects, and classification categories assigned to content
objects are not easily accessed by people or computers. The knowledge organization schemes and the features found in
cataloging and retrieval systems are often deeply embedded in proprietary formats and software. Even when knowledge
organization resources are openly available, they are rarely linked with other compatible schemes or services. This paper
describes a project to add value to controlled vocabularies through vocabulary mapping. The vocabulary associations are
then made accessible through Web services.
In this paper, 'terminology services' is used to describe Web services involving various types of knowledge organization
resources, including authority files, subject heading systems, thesauri, Web taxonomies, and classification schemes. The
term 'vocabulary' is used to refer to these knowledge organization resources. Vocabularies with associations to other
schemes will be a key component of Webbased terminology services. Web services are modular, Webbased, machine
tomachine applications that can be combined in various ways. For background information on Web services, see Gardner
(2001) and Tennant (2002). Web services can be accessed at various points in the metadata lifecycle, for example, when
a work is authored or created, at the time an object is indexed or cataloged, or during search and retrieval. A Web
service that provides mappings from a term in one vocabulary to one or more terms in another vocabulary is an example
of a terminology service.
2 Vocabulary compatibility
Researchers have been interested in achieving compatibility among controlled vocabularies for many years. Lancaster and
Smith (1983) published an overview of the issues involved in integrating vocabularies, which is still relevant today. They
describe several factors that influence how successfully one vocabulary can be associated with another, including:
Extent of overlap in the subject matter
Level of specificity of terms
Degree of pre/postcoordination
How the vocabulary codes equivalence, hierarchical, and other relationships
Researchers involved in more recent efforts to integrate vocabularies have identified additional factors affecting
vocabulary compatibility:
Differences in word use, e.g. common versus scientific names (Doerr 2001, Olson and Strawn 1997)
Differences in meaning resulting from different classifications of terms (Doerr 2001, Whitehead 1990)
Zeng and Chan (2003) review the primary methodologies used to associate and integrate vocabularies. Of the
approaches they describe, two are relevant to this paper:
Direct mappingestablishing equivalence between terms in different controlled vocabularies or between verbal terms
and classification numbers.
Cooccurrence mappingestablishing mappings from the cooccurrence of terms from different schemes in the same
metadata or catalog record.
Cooccurrence mappings are considered to be more loosely mapped than direct mappings, which usually have an
intellectual review component.
3 OCLC vocabulary projects
3.1 Dewey mappings
In 1994, OCLC staff began linking Library of Congress Subject Headings (LCSH) to the Dewey Decimal Classification (DDC)
scheme. DDC/LCSH pairs were generated from OCLC WorldCat records that contained both DDC numbers and LCSH. Co
occurrence mappings were made for frequently occurring pairs. Later, an association measure was introduced in the co
occurrence mapping process to provide a better indicator of association than simple pair frequencies (VizineGoetz 1998).
Approximately 90,000 cooccurrence mappings have been made in WebDewey, the electronic version of the DDC. An
example of DDC/LCSH cooccurrence mappings is shown below for DDC class, 617.522 Oral regionsurgery:
LC Subject Headings
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 1/13
4/13/2016 VizineGoetz
Cleft lip
Cleft lipSurgery
Cleft palate
MouthDiseases
MouthMicrobiology
MouthSurgery
Oral medicine
Temporomandibular jointDiseases
The mapped LCSH provide additional indexing vocabulary for the electronic version of the DDC and also assist catalogers
in assigning subject headings. These terms are also included in versions of the DDC used in automated classification
services.
3.2 Other mappings
The scope of OCLC's vocabulary mapping research projects has expanded to include additional classification schemes,
subject heading systems, and thesauri. A list of OCLC vocabulary associations and the mapping approach used (direct,
cooccurrence, or both) is shown in Table 1. In addition to DDC/LCSH cooccurrence mappings, direct mappings have been
made between selected classes from the Library of Congress Classification (LCC) and the National Library of Medicine
Classification (NLMC) and DDC. The LCC/DDC mappings and NLMC/DDC mappings are used to profile questions and
expertise for virtual reference services. Project staff members have also made direct mappings of genre terms for fiction
and drama (GSAFD) to LCSH and to LCSHac (headings for children's materials) using the procedures outlined in this
paper. Because the GSAFD vocabulary is quite small only 153 preferred terms and based largely on LCSH, the GSAFD
mapping effort was not considered a suitable test of our mapping approach. For these reasons, the approach was
applied to another vocabulary.
Table 1. OCLC vocabulary associations
From To
Vocabulary DDC ERIC GSAFD LCC LCSH LCSHac MeSH NLMC
DDC (Dewey Direct Direct & Co Direct & Co Direct Direct
Decimal occur occur
Classification)
ERIC Thesaurus Direct
GSAFD (Genre Direct Direct
terms for fiction)
LCC (Library of Direct
Congress
Classification)
LCSH (LC Subject Direct & Direct Direct Co Direct
Headings) Co occur
occur
LCSHac (LC Direct &
Children's Co
Headings) occur
MeSH (Medical Direct Direct
Subject Headings)
NLMC (National Direct
Library of Medicine
Classification)
The GSAFD vocabulary terms with mappings are accessible using the OAIPMH. The OAI protocol specifies a simple HTTP
protocol for automated sharing of metadata, but as the OAICat effort has shown, the approach works equally well for
sharing other XML content. The content of the GSAFD records is MARC in XML (MARC Standards). The records are
accessible to users via a browser (http://alcme.oclc.org/gsafd/) and to machines through the OAIPMH Web services
mechanisms. See Van de Sompel et al. (2003) for a more complete description of the how the file can be accessed using
the OAIPMH. The GSAFD/LCSH mapping file can also be downloaded from our project Web site. The file is encoded in
MARC in XML and also according to version 0.5 of the Zthes schema. We have also prototyped some experimental Web
services using cooccurrence mappings between the GSAFD vocabulary and LCSH.
3.3 Mapping to LCSH
As Table 1 shows, much of our mapping activity involves LCSH. Describing the relationship between the Art and
Architecture Thesaurus (AAT) and LCSH, Whitehead (1990, p. 82) asks: "Why map to LCSH?" and replies:
Despite the weaknesses and the critical assessments that have plagued LCSH over the years, the fact
remains that LCSH is the standard vocabulary used by the majority of information resources, especially
libraries, in the United States.
She also notes that efforts to improve or replace LCSH must take into account its widespread use and the probability that
it will be maintained for a long time. Others have reached similar conclusions. For example, the FAST project sponsored by
OCLC selected LCSH as the basis for creating a faceted vocabulary for metadata. O'Neill and Chan (2003) cite the
following reasons for choosing the LCSH scheme:
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 2/13
4/13/2016 VizineGoetz
[LCSH] is by far the most commonly used and widely accepted subject vocabulary for general application.
It is the de facto universal controlled vocabulary and has been translated or adapted by many countries around the
world.
It is the largest general indexing vocabulary in the English language.
LCSH are also among the recommended encoding schemes that can be used to qualify the Dublin Core subject element.
Several prominent projects that use Dublin Core metadata create subject elements based on LCSH, including the
Colorado Digitization Program, DSpace, and ePrints UK.
3.4 Vocabulary encoding standards
Many standards exist for encoding vocabularies: see Koch (2003) and the SWADEurope Thesaurus Activity thesaurus link
page for listings of some current standards. For authority files, subject headings and thesauri, we have decided to use
the MARC21 Format for Authority Data. For classification data, we use the MARC21 Format for Classification Data. MARC
was chosen because many large vocabularies are available in the MARC formats, and the MARC Authority format supports
intervocabulary relationships, which are particularly important to us because of our mapping work. Some examples of
vocabularies available in the MARC format include:
Library of Congress Subject Headings (LCSH) > 277,000/263,524 concepts/terms
The Getty vocabularies (Art & Architecture Thesaurus; Union List of Artist Names; Thesaurus of Geographic Names) >
1.6 million concepts/names/terms
Medical Subject Headings (MeSH) > 21,973/125,858 concepts/terms
Canadian Subject Headings (CSH) > 6,000 concepts
Library of Congress Classification data (LCC) > 595,000 categories
The MARC authority format enables us to provide detailed coding for many common controlled vocabulary elements.
Preferred terms are coded in the block of MARC tags labeled 1XX. The tag 150 is used for topical terms. Nonpreferred
terms are coded in the 4XX range. The MARC authority format provides for the coding of some relationships between a
preferred term and nonpreferred terms, including earlier forms and acronyms. Broader term/narrower term relationships
and associative relationships (related terms) are coded in MARC tags 5XX. Subfield $w is used to code relationships
between 1XX and 4XX fields and 1XX and 5XX fields. Tags 7XX are used to provide links between equivalent terms in the
same vocabulary and equivalent terms in different vocabularies. Section 5 provides a detailed explanation of MARC 7XX
linking fields.
In the remainder of this paper we describe our approach to mapping the ERIC Thesaurus to LCSH. The ERIC Thesaurus
was chosen because it is a wellestablished vocabulary, publicly accessible on the Web, and large enough to provide a
meaningful test of our mapping approach. The ERIC Thesaurus is produced by the Educational Resources Information
Center, an education information network, sponsored by the U.S. Department of Education, and provides public access to
education literature (ERIC 2004).
4 Mapping the ERIC Thesaurus to LCSH
4.1 Converting ERIC to MARC
Vocabularies to be mapped are first converted to the MARC21 Authority Format. The effort involved in this step varies
depending on the format of the source vocabulary (vocabulary being mapped). We have converted vocabularies from
formats primarily intended for display, e.g. word processing documents without extensive use of styles and vocabularies
in more structured formats such as the ERIC file (Figure 1).
Multiple instances of broader terms (BT), narrower terms (NT), and related terms (RT) stored in single ERIC fields are
encoded as separate fields in the MARC format (Figure 2). The RT field shown below generates 14 fields in the MARC
record. These are the fields labeled with MARC tag 550 (without $w subfields). The field labeled UF is similarly converted
into two MARC fields (tag 450). One of the terms, Student ability, represents a formerly valid term. The notation in
parentheses in the ERIC record indicates this and gives the lifespan of the term. When this data is converted to MARC, a
688 field (Application History Note) is constructed for this data. In the 450 field, subfield $w is added to indicate the term
was formerly valid. By encoding the source and target vocabularies in the MARC Authorities Format we are able to
standardize the representation of similar information and improve our ability to match vocabularies.
Figure 1. Sample ERIC record
<TERM> Academic Ability
<SCOPE> The degree of actual competence to perform in scholastic or
educational activities (Note: For potential competence, use
"Academic Aptitude" for measured achievement, use
"Academic Achievement")
<RT> Ability Grouping; Academic Achievement; Academic Aptitude;
Academic Aspiration; Academically Gifted; Aptitude Treatment
Interaction; Cognitive Ability; College Entrance Examinations;
High Risk Students; Intelligence; Scholarship; Spatial Ability;
Student Characteristics; Verbal Ability
<BT> Ability
<UF> Scholastic Ability; Student Ability (1966 1980)
<GROUP> 120
<TYPE> Main
<ADD> 07/01/1966
Figure 2. ERIC record in MARC21 authority format
001 ERIC00025
003 OCoLCO
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 3/13
4/13/2016 VizineGoetz
005 20031117154238.0
008 031118 n|a|znn|bb||||||||||| ||an| ||| d
040 $beng$cOCoLCO$dOCoLCO$eericd
072 $a120
150 $aAcademic Ability
450 $aScholastic Ability
450 $wa$aStudent Ability
550 $aAbility Grouping
550 $aAcademic Achievement
550 $aAcademic Aptitude
550 $aAcademic Aspiration
550 $aAcademically Gifted
550 $aAptitude Treatment Interaction
550 $aCognitive Ability
550 $aCollege Entrance Examinations
550 $aHigh Risk Students
550 $aIntelligence
550 $aScholarship
550 $aSpatial Ability
550 $aStudent Characteristics
550 $aVerbal Ability
550 $aAbility$wg
680 $iThe degree of actual competence to perform in scholastic or
educational activities (Note: For potential competence, use "Academic
Aptitude" for measured achievement, use "Academic Achievement")
688 $aStudent Ability (1966 1980)
MARC field and subfield statistics are provided in Appendices 14 for the following versions of the files:
ERIC
Statistics for complete file ( Appendix 1)
Statistics for subset without mapping data ( Appendix 2)
Statistics for subset with mapping data ( Appendix 3)
LCSH
Statistics for complete file ( Appendix 4)
As these statistics show, LCSH is a large vocabulary with more than 200,000 preferred terms (MARC tag 150) and nearly
as many topical nonpreferred terms (MARC tag 450). In contrast, the ERIC Thesaurus has about 6,000 preferred terms
and 4,500 nonpreferred terms. Although these statistics do not provide information about the potential subject overlap
between ERIC and LCSH, the sheer size of the LCSH file compared with ERIC leads us to expect a favorable match rate.
Statistics are provided for the subset of ERIC records, without and with mapping data, reported in this paper. This subset
is described in detail in section 4.2.
4.2 Matching vocabulary terms
After the ERIC file is encoded in the MARC Authority format, the ERIC vocabulary is matched to the LCSH vocabulary. Using
a series of computer programs, all preferred terms (MARC tag 150) and nonpreferred terms (MARC tag 450) in the source
and target vocabularies are matched. Differences in spacing, capitalization, and punctuation are ignored during the
matching process. The following terms are considered matches:
ERIC Thesaurus Term LCSH Term
Alzheimers Disease Alzheimer's disease
Nurses Aides Nurses' aides
Currently, plural versus singular forms, terms that differ only by the presence or absence of a parenthetical qualifier, and
terms with a qualifier introduced by a comma are not being matched. These refinements would likely improve the match
rate and will be employed in the next phase of the project.
ERIC Thesaurus Term LCSH Term
Echolocation Echolocation (Physiology)
Crack Crack (Drug)
Radiology Radiology, Medical
Rh factors Rh factor
A total of 3,797 ERIC terms were matched to LCSH and categorized according to the following match types:
PT/PT An exact match (after normalization) of a preferred term (PT) in the source vocabulary to a preferred term
(PT) in the target vocabulary
PT/NPT An exact match of a preferred term (PT) (source) to a nonpreferred term (NPT) (target)
NPT/NPT An exact match between a nonpreferred term (NPT) (source) and a nonpreferred term (NPT) (target)
NPT/PT An exact match between a nonpreferred term (NPT) (source) and a preferred term (PT) (target)
4.3 Evaluating matches
Four categories of ERIC terms were reviewed and analyzed (numbers in parentheses are ERIC category codes):
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 4/13
4/13/2016 VizineGoetz
Learning & perception (110)
Individual development & characteristics (120)
Health & safety (210)
Disabilities (220)
This subset comprises about 12% of ERIC preferred terms. Statistics for PT/PT matches and PT/NPT matches are shown in
Table 2. Columns 3 and 4 show the number of term matches and concept matches for PT/PT matches and columns 5 and
6 present this information for PT/NPT matches.
Table 2. PT/PT and PT/NPT matches
Learning &
perception
(110) 164 49 49 10 8
Individual
development
&
characteristics
(120) 269 83 81 25 23
Health &
safety (210) 227 129 127 31 30
Disabilities
(220) 113 37 37 12 10
About 99% of PT/PT matches were found to represent equivalent concepts in the two vocabularies and 91% of PT/NPT
matches represent equivalent concepts. Very few false matches were observed for these two match types. A false match
occurs when terms from the vocabularies are identical but the concepts represented are different. Some examples of
false matches are:
Term ERIC LCSH
A total of 365 (294 + 71) equivalent concepts were identified. This is 47% (365/773) of the preferred terms in the ERIC
subset. All matches in the subset were manually reviewed to determine which matches represented valid mappings. The
following guidelines established in the Northwestern University LCSH/MeSH mapping project (Olson and Strawn 1997)
were applied in the evaluation:
Mapped terms should have generally the same scope in both vocabularies. One should not be broader or narrower
than the other.
Source vocabulary terms are mapped only to main terms in LCSH (main headings). An exception was made for non
preferred terms that matched a subdivided LCSH (main heading + subheading). For example:
(PT matches NPT)
ERIC LCSH Match Type Valid Mapping
PT: Ametropia PT: EyeRefractive errors PT/NPT Yes
NPT: Ametropia
One to one mappings are preferred, but a term in the source vocabulary could be mapped to more than one term in
the target vocabulary when multiple terms are needed to form an equivalent concept.
(PT matches PT) and (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Cleft Palate PT: Cleft Palate PT/PT Yes
NPT: Cleft Lip PT: Cleft Lip NPT/PT Yes
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 5/13
4/13/2016 VizineGoetz
(NPT matches NPT) and (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Extraversion Introversion PT: Extraversion NPT/NPT Yes
NPT: Ambiversion NPT: Extroversion
NPT: Extroversion
NPT: Introversion PT: Introversion NPT/PT Yes
The match types guided our review of the matches. Matches were coded by type and each type was assigned a different
color. PT/PT (white) matches were reviewed first, followed by PT/NPT (green). Evaluation of these matches was relatively
straightforward since most involved onetoone matches. NPT/NPT (yellow) and NPT/PT (blue) were more complex to
review because they often involved matches to multiple terms in the target vocabulary.
(PT matches NPT) and two (NPT matches PT)
ERIC LCSH Match Type Valid Mapping
PT: Adolescents PT: Teenagers PT/NPT Yes
NPT: Adolescents
NPT: Adolescence PT: Adolescence NPT/PT No
NPT: Teenagers PT: Teenagers NPT/PT Yes
In the example above, the NPT/PT match on the term Adolescence is an invalid mapping because the ERIC term and the
LCSH term represent different concepts. The ERIC term Adolescents is for works on young people, 1317 years of age.
The LCSH term Adolescence is for works on the physiological, psychological, or social development of adolescents. The
ERIC term, Adolescent Development, is a better match for the later term. For terms that matched three or more LCSH,
e.g. Neurological Impairments, the review could be quite timeconsuming and sometimes did not yield a correct
mapping. In the subset, NPT/NPT matches represent equivalent concepts about 81% of the time, and NPT/PT matches
represent equivalent concepts about 55% of the time. This last set of statistics should be viewed with some caution,
given the small number of matches analyzed. Even so, the mapping results do have some interesting implications for
future mapping projects.
If the term/conceptmapping rate is constant within a vocabulary, it should be possible to predict the expected mapping
rate for a vocabulary based on a review of a sample of matches. Further, if the false match rate can be predicted reliably,
review of matches with a high term/conceptmapping rate (PT/PT and PT/NPT, Table 2) could be dispensed with when the
false match rate is below a particular threshold. Only those types of matches with low term/concept mapping rates
(NPT/NPT and NPT/PT, Table 3) would need to be reviewed. Further, for matches requiring review, more experienced
reviewers could be assigned to complex matches while less experienced reviewers could be given simpler matches.
Table 3. NPT/NPT and NPT/PT matches
Learning &
perception
(110) 12 6 6 6 4
Individual
development
&
characteristics
(120) 57 16 12 41 22
Health &
safety (210) 60 22 n/a 38 n/a
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 6/13
4/13/2016 VizineGoetz
Disabilities
(220) 30 15 n/a 15 n/a
5 Intervocabulary linking
Vocabulary links are stored in MARC fields 7XX. Using these fields, we can encode the following:
name or code of the target vocabulary
mapped term
control number or unique identifier for the mapped term
identity of the mapping organization
In the following example, the first two 750 fields are mappings to LCSH. This information is coded in the indicator value.
The first two character positions at the beginning of a field are called indicators. These character positions can contain
information that interprets or supplements the data found in the field. The unique identifier for the term is coded in the $0
subfield and the organization that supplied the mapping data in the $5 subfield (e.g. OCoLCO). The code, OCoLCO, is
the MARC organization code for OCLC's Office of Research. MARC organization codes are used to represent names of
libraries and other organizations that need to be identified in the bibliographic environment. The last 750 field is a
mapping to MeSH with the unique MeSH identifier coded in $0 subfield. No information about the mapping organization is
provided. The unique identifiers for the mapped terms are linked to Webaccessible versions of LCSH and MeSH. Mapping
data of this type can be used to create high quality terminology services. For example, terminology services that support
search and retrieval might use the full range of available mappings, while services invoked during indexing or cataloging
might use only mappings produced by a specific organization or for a given vocabulary.
A legitimate concern about vocabulary mapping is how the mappings will be maintained. Although not a trivial task,
mappings can be maintained with the help of software that tracks changes to vocabulary term records. Changes to
vocabulary terms are recorded in a number of ways, e.g. by data in a vocabulary record that indicates when the record
was last modified, by notes fields that chronicle changes to a vocabulary term (see field 688 in the MARC record
examples), and through notifications of additions and changes distributed by vocabulary owners. Depending on the
nature of the changes, human review may be needed to determine if mappings are still valid when a vocabulary term
changes.
Figure 3. ERIC record with mapping data
001 ERIC03056
003 OCoLCO
005 20031117154238.0
008 031118 n|a|znn|bb||||||||||| ||an| ||| d
040 $beng$cOCoLCO$dOCoLCO$eericd
072 7 $a110$2ericd
150 $aEidetic Imagery
450 $wa$aEidetic Images
450 $aPhotographic Memory
550 $aVisualization
550 $aMemory$wg
680 $iVividly clear, detailed imagery of something (usually visual) that has been
previously perceived
688 $aEidetic Images (1967 1980)
750 0 $aEidetic imagery$0(DLC)sh 85041379 $5OCoLCO
750 0 $aPhotographic memory$0(DLC)sh 00009368 $5OCoLCO
750 2 $a Eidetic Imagery$0(DNLM)D004538
In this example, the LCSH terms are linked to LC subject authority records accessible through the OAICat framework.
These records are accessible to users via a browser and to machines through the OAIPMH Web services mechanisms.
The MeSH link generates a search of the MeSH vocabulary using the search features of the MeSH Browser.
6 Next steps
Our plans for the near term include refining the matching software and developing improved tools for reviewers. When
the review of the ERIC/LCSH matches is complete, the file of mappings will be made available to other researchers. The
file will be available in MARC in XML and also encoded according to version 0.5 of the Zthes schema. We also anticipate
making this file available via OAIPMH and for searching using SRU/SRW and the Zthes profile. See the Terminology
Services project Web site for details.
Acknowledgements
We thank the reviewers of this paper for their many helpful comments and suggestions.
References
Doerr, M. (2001) "Semantic Problems of Thesaurus Mapping". Journal of Digital Information 1(8)
http://jodi.tamu.edu/Articles/v01/i08/Doerr/
Gardner, T. (2001) "An Introduction to Web Services". Ariadne (29) http://www.ariadne.ac.uk/issue29/gardner/
Koch, T. (2003) "Activities to advance the powerful use of vocabularies in the digital environment Structured overview"
http://www.lub.lu.se/~traugott/drafts/seattlespecvocab.html
Lancaster, F. W. and L. Smith (1983) "Compatibility Issues Affecting Information Systems and Services". General
Information Programme and UNISIST, PGI83/WS/23 (Paris: UNESCO)
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 7/13
4/13/2016 VizineGoetz
Mandel, C. (1987) "Multiple Thesauri in Online Library Bibliographic Systems". Cataloging Distribution Service (Library of
Congress: Washington, D.C.)
Olson, T. and G. Strawn (1997) "Mapping the LCSH and MeSH Systems". Information Technology and Libraries, 16(1), 5
19
O'Neill, E. and L. Chan (2003) "FAST (Faceted Application of Subject Terminology): A Simplified LCSHbased Vocabulary".
World Library and Information Congress: 69th IFLA General Conference and Council, 19 August, Berlin
http://www.ifla.org/IV/ifla69/papers/010eONeill_MaiChan.pdf
Tennant, R. (2002) "Digital LibrariesWhat To Know About Web Services". Library Journal 12 (July 15)
http://www.libraryjournal.com/index.asp?layout=articleArchive&articleid=CA231639
Van de Sompel, H., Young, J. and T. Hickey (2003) "Using the OAIPMH... Differently". DLib Magazine 9(7/8)
http://www.dlib.org/dlib/july03/young/07young.html
VizineGoetz, D. (1998) "Popular LCSH with Dewey Numbers". In Annual Review of OCLC Research 1997
http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003449
Whitehead, C. (1990) "Mapping LCSH into Thesauri: the AAT Model". In Beyond the Book: Extending MARC for Subject
Access, edited by T. Petersen and P. Molholt (Boston: G.H. Hall), p. 81
Zeng, M. and L. Chan (2003) "Trends and issues in establishing interoperability among knowledge organization systems".
Journal of the American Society for Information Science and Technology, published online 16 Dec 2003
Links
Canadian Subject Headings (CSH) http://www.nlcbnc.ca/6/23/indexe.html
Colorado Digitization Program Western States Dublin Core Metadata Best Practices
http://www.cdpheritage.org/westerntrails/wt_bpmetadata.html
Dspace "Metadata" http://dspace.org/technology/metadata.html
ePrints UK. "Using simple Dublin Core to describe eprints" http://www.rdn.ac.uk/projects/eprintsuk/docs/simpledc
guidelines/
ERIC (2004) Educational Resources Information Center (ERIC) http://www.eric.ed.gov/index.html
Getty Vocabulary Program http://www.getty.edu/research/conducting_research/vocabularies/
GSAFD experimental Web
serviceshttp://research.oclc.org/WebServices/GenreTermsAndSubjectHeadings/GenreTermsAndSubjectHeadings.asmx
Library of Congress Subject Headings (LCSH) http://www.loc.gov/cds/lcsh.html
Library of Congress Classification (LCC) http://lcweb.loc.gov/cds/mds.html#lccr
MARC21 Format for Authority Data (2003) Concise edition http://www.loc.gov/marc/authority/ecadhome.html
MARC21 Format for Classification Data (2002) Concise edition http://www.loc.gov/marc/classification/eccdhome.html
MARC Code List for Organizations (2004) http://www.loc.gov/marc/organizations/orgshome.html
MARC Standards: MARC in XML http://www.loc.gov/marc/marcxml.html
Medical Subject Headings (MeSH) http://www.nlm.nih.gov/pubs/factsheets/mesh.html
MeSH Browser http://www.nlm.nih.gov/mesh/mbinfo.html
OCLC Research: FAST: Faceted Application of Subject Terminology
http://www.oclc.org/research/projects/fast/default.htm
OCLC Research: OAICat repository frameworkhttp://www.oclc.org/research/software/oai/cat.htm
OCLC Research: Search & Retrieve on the Webhttp://www.oclc.org/research/projects/webservices/default.htm
OCLC Research: Terminology Services http://www.oclc.org/research/projects/mswitch/4_termservs.htm
SWADEurope Thesaurus Activity http://www.w3c.rl.ac.uk/SWAD/thes_links.html
Zthes: a Z39.50 Profile for Thesaurus Navigation http://zthes.z3950.org/
Appendices
Definitions for column labels
Tag 3character MARC field tag
Occ Total number of this field in all records
%Recs Percent of records that contain this field
Occ/Rec Occurrence of this field divided by the total number of records
Len/Occ Average length of this field
Sub 1chracter MARC subfield code
Occ Total number of this subfield in all records
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 8/13
4/13/2016 VizineGoetz
Occ/Rec Occurrence of this subfield divided by the total number of records
Len/Occ Average length of this subfield
Appendix 1
ERIC Thesaurus encoded in MARC Authority Format
Field and Subfield Statistics
Tag Occ %Recs Occ/Rec Len/Occ Sub Occ Occ/Rec Len/Occ
001 6080 100.00 1.00 9.00
003 6080 100.00 1.00 7.00
040 6080 100.00 1.00 29.00 a 6080 100.00 7.00
b 6080 100.00 3.00
c 6080 100.00 7.00
d 6080 100.00 7.00
e 6080 100.00 5.00
072 6080 100.00 1.00 3.00 a 6080 100.00 3.00
150 6080 100.00 1.00 16.30 a 6080 100.00 16.30
450 4562 43.90 0.75 18.05 a 4562 43.90 17.86
w 873 11.71 1.00
550 68725 100.00 11.30 16.42 a 68725 100.00 16.24
w 11878 91.71 1.00
680 3774 62.07 0.62 148.18 i 3774 62.07 148.18
688 873 11.71 0.14 29.95 a 873 11.71 29.95
Appendix 2
ERIC Thesaurus encoded in MARC Authority Format
773 Record Subset without mapping data
Field and Subfield Statistics
Tag Occ %Recs Occ/Rec Len/Occ Sub Occ Occ/Rec Len/Occ
001 773 0.00 1.00 9.00
003 773 0.00 1.00 7.00
040 773 0.00 1.00 29.00 a 773 0.00 7.00
b 773 0.00 3.00
c 773 0.00 7.00
d 773 0.00 7.00
e 773 0.00 5.00
072 773 0.00 1.00 3.00 a 773 0.00 3.00
150 773 0.00 1.00 15.93 a 773 0.00 15.93
450 668 0.00 0.86 17.47 a 668 0.00 17.26
w 139 0.00 1.00
550 9365 0.00 12.12 15.94 a 9365 0.00 15.77
w 1595 0.00 1.00
680 520 0.00 0.67 140.28 i 520 0.00 140.28
688 139 0.00 0.18 29.43 a 139 0.00 29.43
Appendix 3
ERIC Thesaurus encoded in MARC Authority Format
773 Record Subset with mapping data
Field and Subfield Statistics
Tag Occ %Recs Occ/Rec Len/Occ Sub Occ Occ/Rec Len/Occ
001 773 100.00 1.00 9.00
003 773 100.00 1.00 7.00
005 773 100.00 1.00 16.00
040 773 100.00 1.00 29.00 a 773 100.00 7.00
b 773 100.00 3.00
c 773 100.00 7.00
d 773 100.00 7.00
e 773 100.00 5.00
072 773 100.00 1.00 3.00 a 773 100.00 3.00
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 9/13
4/13/2016 VizineGoetz
150 773 100.00 1.00 15.93 a 773 100.00 15.93
450 668 49.94 0.86 17.47 a 668 49.94 17.26
w 139 13.71 1.00
550 5777 99.61 7.47 15.81 a 5777 99.61 15.62
w 1098 76.33 1.00
680 520 67.27 0.67 140.28 i 520 67.27 140.28
688 139 13.71 0.18 29.43 a 139 13.71 29.43
750 404 50.19 0.52 30.18 0 404 50.19 16.00
a 404 50.19 13.71
x 12 1.55 15.75
Appendix 4
LCSH (updated in November 2003) in MARC Authority Format
Field and Subfield Statistics
Tag Occ %Recs Occ/Rec Len/Occ Sub Occ Occ/Rec Len/Occ
001 277272 100.00 1.00 12.00
005 277272 100.00 1.00 16.00
008 277272 100.00 1.00 40.00
010 277272 100.00 1.00 12.07 a 277272 100.00 12.00
z 1503 0.45 12.00
035 5 0.00 0.00 6.20 a 5 0.00 6.20
040 277272 100.00 1.00 7.99 a 277272 100.00 3.06
b 32824 11.84 3.00
c 277272 100.00 3.00
d 145241 51.45 3.01
043 1 0.00 0.00 10.00 a 1 0.00 10.00
053 89983 29.46 0.32 10.95 a 89983 29.46 7.40
b 15728 5.30 5.95
c 21721 5.17 10.43
073 3286 1.19 0.01 13.32 a 5046 1.19 6.07
z 3286 1.19 4.00
100 19949 7.19 0.07 15.78 a 19949 7.19 14.48
b 23 0.01 2.39
c 133 0.05 18.02
d 714 0.26 8.94
q 19 0.01 15.74
t 21 0.01 13.67
v 154 0.05 12.92
x 1167 0.32 11.88
y 16 0.01 12.69
z 54 0.02 9.07
110 5644 2.04 0.02 37.25 a 5644 2.04 33.19
b 603 0.21 6.17
p 1 0.00 7.00
t 7 0.00 17.71
v 161 0.06 13.02
x 1111 0.36 14.09
y 52 0.02 18.77
z 33 0.01 9.85
111 7 0.00 0.00 24.29 a 7 0.00 21.57
v 1 0.00 5.00
x 2 0.00 7.00
130 465 0.17 0.00 27.38 a 465 0.17 9.69
f 2 0.00 4.00
l 21 0.01 9.00
p 67 0.02 6.45
v 114 0.04 16.40
x 336 0.10 16.63
y 9 0.00 15.22
150 202494 73.03 0.73 22.79 a 202494 73.03 18.55
v 3513 1.24 13.25
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 10/13
4/13/2016 VizineGoetz
x 43118 13.80 14.42
y 2687 0.97 13.75
z 16385 5.84 9.40
151 45427 16.38 0.16 29.73 a 45427 16.38 23.29
v 694 0.24 14.06
x 13696 4.48 13.27
y 7758 2.80 12.95
z 76 0.03 7.79
180 2858 1.03 0.01 18.57 v 82 0.03 11.99
x 3453 1.03 14.61
y 108 0.04 14.04
z 10 0.00 12.00
181 2 0.00 0.00 27.50 x 1 0.00 21.00
z 2 0.00 17.00
182 34 0.01 0.00 16.71 y 34 0.01 16.71
185 392 0.14 0.00 21.78 v 419 0.14 19.72
x 14 0.01 19.43
260 714 0.26 0.00 125.81 a 1193 0.26 27.97
i 1380 0.26 40.92
360 3900 1.41 0.01 117.73 a 6316 1.23 25.42
i 7713 1.41 38.71
400 30220 3.41 0.11 14.59 a 30220 3.41 14.21
b 10 0.00 2.40
c 70 0.02 15.07
d 324 0.08 8.86
k 14 0.00 22.00
q 1 0.00 18.00
t 11 0.00 21.18
v 57 0.02 15.58
w 567 0.17 3.00
x 330 0.07 13.77
410 6134 1.24 0.02 38.44 a 6134 1.24 37.31
b 155 0.05 9.20
k 4 0.00 18.00
t 1 0.00 12.00
v 49 0.02 20.43
w 294 0.10 3.00
x 201 0.06 16.84
y 6 0.00 21.33
z 3 0.00 10.33
411 3 0.00 0.00 50.00 a 3 0.00 42.00
w 1 0.00 3.00
x 1 0.00 21.00
430 248 0.07 0.00 24.77 a 248 0.07 8.88
f 1 0.00 4.00
g 1 0.00 28.00
l 7 0.00 6.86
p 37 0.01 4.76
v 50 0.01 13.24
w 31 0.01 3.00
x 178 0.05 16.36
y 2 0.00 9.00
450 189834 35.10 0.68 21.59 a 189834 35.10 20.16
v 797 0.24 14.56
w 19091 6.33 3.00
x 12686 3.73 13.56
y 520 0.17 18.90
z 1899 0.64 10.96
451 37085 7.02 0.13 28.61 a 37085 7.02 26.78
v 160 0.06 17.42
w 3695 1.14 3.00
x 2188 0.57 15.91
y 1316 0.32 14.39
z 4 0.00 12.50
480 742 0.21 0.00 21.05 v 16 0.01 13.44
w 210 0.06 3.00
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 11/13
4/13/2016 VizineGoetz
x 809 0.21 18.26
482 5 0.00 0.00 37.80 w 5 0.00 3.00
y 5 0.00 34.80
485 247 0.06 0.00 19.77 v 256 0.06 18.02
w 80 0.02 3.00
x 3 0.00 10.33
500 4193 1.19 0.02 15.46 a 4193 1.19 13.78
b 1 0.00 2.00
c 108 0.04 20.43
d 51 0.02 9.29
t 2 0.00 19.50
v 23 0.01 11.83
w 301 0.10 1.02
x 308 0.10 10.51
y 4 0.00 24.00
z 62 0.02 6.39
510 471 0.16 0.00 33.04 a 471 0.16 20.50
b 155 0.05 5.02
v 3 0.00 9.00
w 442 0.15 1.00
x 319 0.10 13.91
y 3 0.00 12.00
z 20 0.01 9.45
530 127 0.04 0.00 25.75 a 127 0.04 5.99
l 2 0.00 5.50
p 46 0.01 6.13
v 10 0.00 11.10
w 104 0.04 1.00
x 104 0.03 19.24
550 220509 56.84 0.80 17.74 a 220509 56.84 13.82
v 358 0.12 11.74
w 205025 56.10 1.00
x 16008 4.90 12.26
y 1192 0.38 12.26
z 50205 15.23 8.85
551 14009 4.48 0.05 22.73 a 14009 4.48 9.80
v 413 0.13 10.98
w 13774 4.41 1.00
x 12518 3.80 10.22
y 2046 0.52 16.78
z 70 0.01 7.90
580 780 0.27 0.00 14.86 w 780 0.27 1.00
x 782 0.27 13.82
581 1 0.00 0.00 18.00 w 1 0.00 1.00
z 1 0.00 17.00
585 188 0.06 0.00 13.06 v 192 0.06 11.81
w 188 0.06 1.00
667 3769 1.36 0.01 62.46 a 3769 1.36 62.46
670 245924 42.46 0.89 90.70 1 14 0.00 114.64
2 7 0.00 76.43
3 3 0.00 314.00
4 9 0.00 65.89
5 1 0.00 23.00
6 3 0.00 121.00
7 2 0.00 102.50
8 1 0.00 360.00
9 2 0.00 9.50
a 245926 42.46 48.44
b 144503 24.35 71.88
h 2 0.00 76.50
675 35640 12.85 0.13 40.27 a 83548 12.85 17.18
680 9464 3.33 0.03 163.18 a 4029 1.02 22.89
i 10729 3.33 135.35
681 8872 3.15 0.03 40.53 a 9094 3.15 22.53
i 9093 3.15 17.01
682 12 0.00 0.00 167.00 a 9 0.00 41.11
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 12/13
4/13/2016 VizineGoetz
i 21 0.00 77.81
781 28049 10.11 0.10 27.63 v 2 0.00 14.50
x 4 0.00 17.00
z 53046 10.11 14.61
https://journals.tdl.org/jodi/index.php/jodi/rt/printerFriendly/114/113 13/13