Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Procedures for Sorting Chemical Names for Chemical Abstracts ’ Indexes

Procedures for Sorting Chemical Names for Chemical Abstracts ’ Indexes

Ratings: (0)|Views: 4 |Likes:
Published by jasonYYZ
Procedures for Sorting Chemical Names for Chemical Abstracts ’ Indexes
Procedures for Sorting Chemical Names for Chemical Abstracts ’ Indexes

More info:

Categories:Types, Research
Published by: jasonYYZ on Apr 01, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

06/27/2014

pdf

text

original

 
410
(1
1)
The detailed algorithms
of
NEWMAN
are
described
in
Kao,
J.;
Watt,
L.
Comput. Chem.,
in
press.
(12)
Nyburg,
S.
C.
Acra
Crystallogr.,
Sect.
B: Struct.
Crystdlogr.
Cryst.
Chem.
1974,
B30,
251.
J.
Chem.
Inf.
Comput.
Sci.
1985,
25,
410-412
(13)
The
detailed algorithms
of
CDRAFT
are
described
in
Watt,
L.;
Kao,
J.
Comput.
Chem.,
in
press.
(14)
Houminer,
Y.;
ao,
J.;
Seeman,
J.
I.
J.
Chem.
Soc.,
Chem.
Commun.
1984,
1608.
Procedures for Sorting Chemical
Names
for
Chemical Abstracts
ndexes
ALLEN C. ISENBERG, JOANN T. LEMASTERS, ABE
F.
MAXWELL, andGERALD G. VANDER STOUW*Chemical Abstracts Service, Columbus, Ohio 432
10
Received January 25, 1985In the preparation
of
each
Chemical Substance
Index
to
Chemical Abstracts
(CA), nearlythree-quarters
of
a
million
chemical substance names must be sorted
by
computer program into
an
invariant order. This sorting is done
on
sortkeys that are generated from the character strings
in
the names
and
is
done
in
a
way
that
takes advantage
of
the
data
elements used
by
ChemicalAbstracts Service (CAS)
n
preparing these names. The organization
of
CA
index
nomenclatureand the rules used
in
sortkey
generation
are
described.
INT
R
0
DUCT
ON
The
Chemical Substance Index
to
Chemical Abstracts
(CA)each year includes index entries that refer to nearly three-quarters of a million different chemical substances. Thesealphabetical indexes, which are published twice annually, aremerged every
5
years into a collective index. The preparationof these volume and collective indexes requires that very largelists of chemical substance names be sorted into a consistentorder, so that the user of the printed indexes can locate asubstance of interest with confidence that it has been placedat the correct point in the index.For many years the preparation of the CA indexes requiredthe efforts
of
a group
of
clerical staff who devoted their timeto sorting thousands of index entries typed on separate cards.Although manual sorting achieved remarkably consistent re-sults, the rapid growth of the indexes during the 1950s and1960s made maintaining the quality of these efforts increas-ingly difficult and expensive. Since the early 1970s, ChemicalAbstracts Service (CAS) has used computer processing ex-tensively in the preparation of its indexes.’ These computersystems include programs that carry out, with
no
human in-tervention, the sorting of chemical names that was formerlydone by hand. Recently published descriptions of two al-gorithms for sorting chemical names2t3prompt us to describethe procedures that CAS uses in sorting names.DATA ELEMENT STRUCTURE FOR CHEMICALNAMESTo appreciate the way CAS sorts chemical names, it isnecessary to understand two general aspects of the sortingprocess: first, the way in which CAS constructs a chemicalname from individual data elements and ranks these dataelements for sorting purposes; second, the way in which dataelements that occur at the same ranking level are sorted bythe use of sortkeys. The first two sections of this paper discussdata elements and their utility in sorting; the last sectiondescribes the use of sortkeys.CAS uses an extensive and rigorous set of rules for gen-erating chemical names. These rules, which are applied byhuman nomenclature experts with extensive computer support,ensure that a given chemical substance can be found at apredictable place in the printed
Chemical Substance IndexS4
The systematic names that result from these rules appear inthe
Index
in an “inverted” form; Le., that portion of the name0095-2338/85/1625-0410$01.50/0that refers to a ”parent” structure is given before the namesof the structural fragments that are attached to that parentstructure. Thus, for example, the name2-Butenedioic acid, 2-butyl-gives the parent name 2-Butenedioic acid before the name ofthe attached substituent represented by the string 2-butyl-.The corresponding “uninverted” name would be2-butyl-2-butenedioic acid
In
the inverted form of this name, the characters before thefirst comma (sometimes referred to as the “comma ofinversion”) constitute the data element known as the
headingparent.
This data element normally has one of three forms:(a) a molecular skeleton name such as Butene, to which isattached the name of the principal functional group if one ispresent (dioic acid in this instance); (b) a functional parentcompound in which no skeleton is expressed, such as Carbonicacid; (c) a trivially named parent such as Phenol or Urea. Thenames of the attached substituents, such as 2-butyl-, are
in-
cluded
in
a
separate
substituent
data element.The
heading parent
and
substituent
are two of the dataelements that CAS uses for chemical names; the others aredescribed later in this section. These data elements are as-signed by the nomenclature specialist when a name is prepared.
As
described in the next section of this paper, the data elementidentifications play an important role in the sorting programs.They are also important in formating names for the printedindexes. The formating programs use the data elements todetermine, for example, that two names which sort togetherhave identical heading parents; the heading parent then needsto be printed only in the first name and can be representedby a long dash in the second name. Similarly,
if
two estersof an acid sort together, the formating process will cause thename of the acid
to
appear only once, with the two estersidentified under it. The data element identifiers do notthemselves appear in CAS printed services or online files,however.Frequently a name contains a character string that describesa derivative of the principal functional group, such as the esterof an acid or the oxime or hydrazone of a ketone. Thus, forexample, if the above name were modified
to
2-Butenedioic acid, 2-butyl-, dimethyl esterthe string dimethyl ester would constitute the
name modifi-
1985 American
Chemical
Society
   D  o  w  n   l  o  a   d  e   d   b  y   C   K   R   N   C   N   S   L   P   M   A   S   T   E   R  o  n   A  u  g  u  s   t   9 ,   2   0   0   9   P  u   b   l   i  s   h  e   d  o  n   M  a  y   1 ,   2   0   0   2  o  n   h   t   t  p  :   /   /  p  u   b  s .  a  c  s .  o  r  g   |   d  o   i  :   1   0 .   1   0   2   1   /  c   i   0   0   0   4   8  a   0   0   9
 
SORTING
HEMICALAMES
cation
data element. This data element may also contain otherinformation as, for example, in the following complex namemodification: butyl ester, ion(
I-),
compd. with ethanamineAnother data element that frequently occurs in names isthe
stereochemical descriptor,
which contains stereochemicalor spatial information. In this example, the string
(E)-
de-scribes the stereochemistry and completes the name:2-Butenedioic acid, 2-butyl-, dimethyl ester,
(E)-
It is typically formated after the name modification, if thereis one, or after the substituent or heading parent. This dataelement may also contain data such as
trans-,
(I?)-,
and similarterms. (Certain configurational descriptors for stereoparentsare expressed or implied in data elements other than thestereochemical descriptor.)Two other data elements used in chemical names to dif-ferentiate between otherwise identical heading parents are the
line formula
and the
homograph definition.
The line formuladifferentiates between parents of different stoichiometriccomposition, such as line formulas CrC12 and CrCIJ in theheadings Chromium chloride (CrCl,) and Chromium chloride(CrC13). The homograph definition distinguishes betweenheading parents having different meanings, such as alkaloidand mineral in the headings Serpentine (alkaloid) and Ser-pentine (mineral). Another type of sorting differentiationresults from the use of
heading subdivisions
to organize indexheadings with large numbers of entries. Four types of sub-divisions are used:
(I)
Qualifiers
divide the heading intoseparate areas of study according to the nature of the topicsdiscussed in the original document, such as properties andreactions. (2)
Categories
divide the heading into differenttypes of chemical derivatives such as esters, oximes, andpolymers. (3) Six chemical substance particle headings (e.g.,Alpha particle and Proton) are divided with special radiationqualifiers, biological effects and chemical and physical effects.(4) Alloy categories base and nonbase are used at alloyheadings. The effect of any of these subdivisions is to grouprelated index entries that would not otherwise be sorted to-gether. Within any of these subdivided headings, all of therest of the sorting described below applies.(1:l).
J.
Chem.
In$
Comput. Sci.,
Vol.
25,
No.
4,
1985
411
SORTING BASED ON THE STRUCTURE
OF
CHEMICAL NAMESThe basic order of priority among the data elements usedin sorting CAS chemical substance names is heading parent
>
line formula
>
homograph definition
>
substituent
>
qualifier
>
category
>
name modification
>
stereochemicaldescriptor. The homograph definition and line formula serveprimarily to resolve sorting
of
heading parents in those caseswhere they are applicable; that is, they are considered to bepart of the heading parent for sorting. Thus, all names withthe same heading parent are brought together by the sortingof that data element. Names having the same heading parentare then sorted on the other data elements present, on the basisof their respective priorities.An important principle invoked in this data element basedsorting
is
that of “nothing before something”. The implicationof this principle is that, for example,
all
of
the names havingthe heading parent Benzenepropanal and no substituent willsort ahead of all those that have that same heading but havea substituent present.
In
practice, names are sorted
on
thebasis of a single sort with a sortkey formulated from all of theseparate data elements, rather than by consecutive sorts eachusing a separate data element. However, it is easier to
un-
derstand the sort order by initially thinking of it as a seriesof individual sorts. Consider, for example, the list of nameswith the same heading parent 2-Cyclohexen- 1-01 shown inTable I. These names are shown in the table with the data
Table
I.
Names with the Same Heading Parent 2-Cyclohexen-1-01name stereo-modifi- chemicalparent substituent cation descriptor2-Cyclohexen-
1-01
2-Cyclohexen-
1-01
2-Cyclohexen- 1-012-Cyclohexen- l-ol2-Cyclohexen- -ol2-Cyclohexen- 1-012-Cyclohexen- -ol2-Cyclohexen- 1-01
(0
acetateacetate
(S)-
l-methyl-4-I-methyl-4- (1R-trans)-l-methyl-4- benzoatel-methyl-4- benzoate (1s-cis)-(1-methy1ethenyl)-
(
1-methyletheny1)-(1-methyletheny1)-(1-methyletheny1)-
elements labeled. As can be seen, all of those with no sub-stituent come before all those with a substituent; in turn,
all
of those in each group without a name modification comebefore those with a name modification.The following list further illustrates the principle. All ofthe names with the heading parent 2-Butanol are broughttogether, as are those with the heading parent 2-Butanone.In both cases, the principle of nothing before something placesall of the entries without a substituent before those with asubstituent. In turn, within each of these groupings the entrieswithout a name modification occur before those with a namemodification. In the case of the heading parent 2-Butanone,the entries without a substituent are subdivided by the cate-gories oximes and hydrazones:2-Butanol2-Butanol, sodium salt2-Butanol, l-chloro-2-Butanol, 4-(trimethylstanny1)-I-Butanone, l-phenyl-2-Butanone, hydrazonesdimethylhydrazone2-Butanone, oximes0-methyloximeoxime2-Butanone, 3-(4-acetylphenyl)-2-Butanone, 3-ethoxy- 1, -dihydroxy-2-Butanone, 3-ethoxy- 1 1-dihydroxy-, oximeButanoyl chlorideUSE
OF
SORTKEYSSorting at the same data element level is accomplished withsortkeys generated from the data in each data element ratherthan the data values themselves. Simple character by char-acter sorting of the data elements will give a different orderthan the order desired for the index. The problem is parti-culary acute with chemical nomenclature, since chemicalnames often begin with numerical locants; a character-by-character sorting would give these locants undue importancecompared to the alphabetic characters in the names. Consider,for example, the following list of heading parents. On the leftthey are listed in the order that would result from characterby character sorting; on the right, the heading parents are listedin the desired order, based primarily on alphabetical sorting:
character by character alphabetical1-Butene-3-yne 2-Butene2-Butene-l,4-diol-Butene2-Butene-1$diol 2-Bu tene-2,3-diol2-Butene-2,3-diol 1-Buten-3-yne
The alphabetical sorting is achieved by generating a sortkeyfrom each heading parent on the left. The process of gener-ating a sortkey first divides the data element into three fields.The first field contains all the Roman alphabetics in the data
   D  o  w  n   l  o  a   d  e   d   b  y   C   K   R   N   C   N   S   L   P   M   A   S   T   E   R  o  n   A  u  g  u  s   t   9 ,   2   0   0   9   P  u   b   l   i  s   h  e   d  o  n   M  a  y   1 ,   2   0   0   2  o  n   h   t   t  p  :   /   /  p  u   b  s .  a  c  s .  o  r  g   |   d  o   i  :   1   0 .   1   0   2   1   /  c   i   0   0   0   4   8  a   0   0   9

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->