You are on page 1of 6

Chemical Information Sources/Structure Searches

1
Chemical Information Sources/Structure
Searches
Introduction
STRUCTURE SEARCHING utilizes a graphic depiction of the chemical structure as input for a search. Such
searches are generally run against the data in online chemical dictionary files, such as STN's Registry File.
Depending on the type of structure search allowed by the system, the complete molecule or any compound
containing the structure of the molecule will be retrieved as an answer set. Unlimited substitution of the input
molecule may be allowed at free sites on the molecule (a FULL SUBSTRUCTURE SEARCH) or substitution may
be limited to certain sites (a CLOSED SUBSTRUCTURE SEARCH). On the STN system, once an answer set is
formed in the Registry File, it can be crossed over to the CA or other files to conduct further subject searches of the
compounds thus isolated in a structure search. In these cases, it is actually the CAS Registry Number for the
compounds that is being searched in the crossover files. Note that it is now possible to conduct a search that takes
into account the stereochemistry of the chiral centers and double bonds. Stereo searching
[1]
can also be performed in
the Registry File and the Beilstein File on STN or on the Reaxsys
[2]
system (that includes both Beilstein and
Gmelin). A SIMILARITY SEARCH finds target molecules that are like the query structure in some respects. That
might be some biological property such as drug absorption or toxicology, with respect to metabolism. Often, it is the
similarity in functional groups that is measured. Finally, MARKUSH STRUCTURE SEARCHING, an important
technique in patent searches that allows for considerable variability in the structures retrieved, is another option in
some files.
Why Use Structure Searching?
There are many reasons to do a substructure search, among them:
€€ Particular structural features can be focused on.
€€ Unwanted features can be excluded.
€€ The complexities of nomenclature can be avoided.
€€ The novelty of a compound can be assessed.
€€ The structure(s) can be correlated with chemical or physical properties or biological activities.
€€ The structure(s) can be linked to chemical reaction databases to see model compounds or to look for specific
reaction conditions.
€€ Competitive products or market leads can be found.
In combination with other types of searches, structure searching is a very powerful supplement.
Structure Searches in the STN Registry and Other Files
Over 70,000,000 registered chemical substances appear in the Chemical Abstracts Service Registry File. Most of
those have been registered since 1965, but, of course, not all of the compounds in the Registry File were discovered
since that date. In 2002, Chemical Abstracts Service embarked on a project to retrospectively index all documents in
the CA database. Thus, many compounds that have had no new information published about them since the
establishment of the CA or CAPlus Files (i.e., since 1967) have now been added to the Registry File.
Most of the millions of compounds in the Registry File have their Registry Numbers linked to the databases on the
STN system
[3]
. The LC (File Locater) field of a Registry File record tells in which STN databases the Registry
Number is found. In addition to the Registry File, structure searches can be conducted in such databases on STN as
BEILSTEIN, CASREACT, and others. A similar file locater function is included in other chemical dictionary files,
such as NLM's ChemIDplus
[4]
.
Chemical Information Sources/Structure Searches
2
There are several types of structure searches possible in the Registry File, as well as different options for views of
the molecules and different methods of inputting the structure. SciFinder masks to a certain extent the relationship
between the Registry File and the CAPlus File, CASREACT, and other databases intertwined with its software.
SciFinder Structure Editing Screen
(Reproduced with permission of CAS, a division of the American Chemical Society.)
Within the SciFinder search stage itself, considerable information can be gleaned about the answer set to be
retrieved. In the Preview option, the sample answer set can be analyzed by atom attachments, or, if the drawn
structure contains them, by system-defined or user-defined variable groups. Once the structure is built and the
answer set is retrieved, such information can also be found for the full answer set. At this point, the search can
proceed as it would if the compounds had been identified by name or molecular formula searches, allowing you to
"Get References" from the CAPlus part of the SciFinder system or to link to any of the icons in the retrieved
Registry File records.
The structure search can be further refined with additional structural features or by limiting it to commercially
available substances. Once refined, the references can be retrieved that have the Registry Number of the compounds
in their indexing.
With a suitable viewer, the image of the molecule can be viewed as a 3-D model.
CAS Registry connectivity data for the Isatin molecule viewed with WebLab Viewer Lite
In traditional, command-driven structure searching, when logging on to STN, the choice of terminal determines what
type of view of the molecule you will see. If one selects option 3 at the prompt:
TERMINAL (Enter 1, 2, 3 OR ?)
the structural depictions will be encoded with regular punctuation symbols found on a computer keyboard. Thus a
double bond might be indicated by a colon (:) or an equal sign (=). With the proper telecommunications software,
selecting option 2 will depict the structures as true graphical representations. That is the default option when using
Chemical Information Sources/Structure Searches
3
STN Express with Discover! (front-end software that allows the building of the structures offline).
The following types of structure searches are possible on STN:
€ EXACT SEARCH--retrieves the substance as drawn plus any stereoisomers, ionic substances, or homopolymers,
as well as isotopically labeled compounds with that structure
€ FAMILY SEARCH--retrieves the same set of compounds as the EXACT search, but will also retrieve any
multi-component compounds represented in the Registry File (salts, mixtures, or copolymers)
€ CLOSED SUBSTRUCTURE SEARCH--allows variable nodes at certain defined positions only
€ FULL SUBSTRUCTURE SEARCH--retrieves any record in the file that has the structure input as the search
key.
With SciFinder, one of two options is available, depending on whether the Substructure Search Module is included
in the version of the software. The basic SciFinder search covers an exact and family search. The SSS module allows
the fuller search options.
There are actually several stages of a Registry File structure search. The first stage involves a screening of the huge
file for compounds that have the requisite substitutents and other features, without regard to their position on the
molecule. The much more computer-intensive iteration stage involves an atom-by-atom, bond-by-bond look at the
candidate molecules isolated in the screen search. Since this stage requires so much of STN's computer resources,
there are limits on the number of compounds that can be looked at during the iterative stage. A sample search must
be run on approximately 5% of the file, after which a prediction as to whether the full file search will run to
completion is given. Assuming the prediction is favorable, the candidate molecules found in the screening of the full
file can be compared to the structure. Otherwise, the structure must be modified to be able to run to completion. With
SciFinder, there is some built-in intelligence that offers to "autofix" a molecule that might give the system trouble. It
is also wise to preview the SciFinder Scholar search to see what kinds of substances will be retrieved with the
structure as drawn.
Structure Searching on Reaxys
It is also possible to do very precise structure searching on the Reaxys system, where the Beilstein Handbook of
Organic Compounds and the Gmelin Handbook of Inorganic and Organometallic Compounds collectively provide
comprehensive coverage of chemical research from the 18th century to the present. As of mid-2007, the Beilstein
database contained more than 10 million organic compounds and more than 8.7 million substances with information
on reactions.
Reaxys Structure Editing Screen with Isatin Molecule (above) and Two of the Search Results (below)
Unlike the STN system, where the type of structure search (exact, family, closed or full substructure; exact or
substructure on SciFinder) determines the type of compounds retrieved, the predecessor of the Reaxys system,
CrossFire required the user to "set free sites" by indicating the number of substitutions allowed at given atoms or to
make other choices at the time of structure drawing in order to broaden or narrow the scope of a search. Setting free
sites was done either all at once in the Query Options menu (once the desired atoms have been selected) or atom by
atom by choosing the precise number of free sites allowed for each atom. Other options were the inclusion/exclusion
of isotopes and allowing substances that have a charge, radicals, etc. to be retrieved in the search.
As with SciFinder and other STN options for structure searching, CrossFire included a number of template files to
assist in building complex molecules. To be sure that you are properly drawing a functional group in structure
searching, it is best to choose it from a template file if available.
CrossFire also allowed predefined groups of variable atoms:
Chemical Information Sources/Structure Searches
4
A = any atom
Q = any atom but C or H
M = a metal atom
X = halogen
The addition of an H to these symbols meant Hydrogen could also be one of the variable atoms. For example, XH
implied that any of the atoms F, Cl, Br, I, or At plus H would satisfy the search. Likewise, there were generic group
symbols to represent such things as carbocyclic or heterocyclic rings, alkyl, alkenyl, or alkynyl chain groups, etc.
Finally, the user could define generic groups if the predefined groups were not sufficient.
Reaxys surely includes some of these same features, but the author has not had sufficient experience with Reaxys to
comment on its capabilities.
Beilstein and Gmelin
Beilstein is for organic compounds, whereas Gmelin is for inorganic and organometallic compounds. Beilstein
covers compounds containing carbon along with the following elements:
H
Li, Be B, C, N, O, F
Na, Mg Si, P, S, Cl
K, Ca As, Se, Br
Rb, Sr Te, I
Cs, Ba
Compounds can be single components or salts and mixtures (if they have at least one organic component). Peptides
are covered if they contain twelve or fewer amino acids. Polymers or polycondensation products are not treated. The
following are not typically treated as Beilstein compounds, but would be found in Gmelin:
€€ CO, CS, CO2, CS2, COS, C3O2, C3S2
€€ Carbonic acid and its thio analogs along with their salts with inorganic cations
€€ HCN, HOCN, HSCN and the corresponding iso-acids and all the metal salts and complexes of these acids
€€ Dicyanogene
€€ Phosgene
€€ Metal salts of formic acid, acetic acid, and oxalic acid
Gmelin covers compounds not covered in Beilstein, i.e., inorganic and organometallic chemistry as well as related
fields such as mineralogy and metallurgy. Compounds are indexed with terms such as coordination compounds,
alloys, ceramics, and inorganic polymers.
Beilstein Lawson Numbers
Compounds in the Beilstein database are also indexed by a number that indicates various structural features. That is
the Lawson Number. It represents certain structural fragments and can be used for structural similarity searches. In
general, the smaller the Lawson Number, the more common the fragment. Every substance in Beilstein has at least
one Lawson number assigned to it. Dividing the Lawson Number by 8 puts you roughly in the Beilstein system
number for the printed Beilstein volume that contains the compound. The compounds are divided into 3 major
groups in the printed Beilstein Handbook:
1. Acyclic Compounds, Volumes 1-4; System Numbers 1-449
2. Isocyclic Compounds, Volumes 5-16; System Numbers 450-2358
3. Heterocyclic Compounds, Volumes 17-27; System Numbers 2359-4720.
Chemical Information Sources/Structure Searches
5
Unfortunately, the Beilstein Institute never published the meanings of the 4,720 system numbers used to classify
organic compounds. However, the Lawson Number Descriptions
[5]
can now be found on the web. The Lawson
Number is effective when used in combination with other search keys, such as molecular formula, element ranges,
etc. It is also useful when combined with NOT in substructure searches.
Chemisches Zentralblatt
[6]
Chemisches Zentralblatt is the oldest abstracting journal in the field of chemistry. It covers the chemical literature
from 1830 to 1969. In the course of those 140 years, Chemisches Zentralblatt published 900,000 pages, including 2
million abstracts. Chemisches Zentralblatt introduced formula indexes using the Richter system (different from the
Hill system) in 1925. In 1956 it changed to the Hill system. The previous title Chemisches Central-Blatt (1856-1906)
has only author, subject, and patent number indexes. InfoChem has performed automatic chemical named entity
recognition of Chemisches Zentralblatt text to produce a structure-searchable database. The database is offered either
as a Web application or for in-house loading. It will link to digitized versions of the original paper product that were
produced by FIZ-Chemie.
Summary
Structure searching considerably expands the ability of a chemist to retrieve information from a database since the
search key is the "native language" of the chemist, the chemical structure. Any chemist, regardles of his or her native
tongue, understands a chemical structure. Thus, the structure searching systems speak the universal language of
chemistry. The development of graphical user interfaces that allow easy drawing of the desired structure on a
computer screen was a major advance in chemical searching. There are now several commercial databases such as
Chemical Abstracts and Reaxys (Beilstein/Gmelin) that have this capability, as do public systems such as PubChem
[7]
and ChemSpider
[8]
. It may take some time to explore and learn all of the capabilities of the structure searching
systems, but the reward in enhanced search retrieval is well worth it.
CIIM Link for further study
SIRCh Link for Structure Searches
Problem Set on this topic
[9]
References
[1] http:/ / www. cas. org/ ASSETS/ C69290063CFB41BF89840D9E8265C970/ stereoex. pdf
[2] http:/ / www. reaxys. com
[3] http:/ / www. indiana. edu/ ~cheminfo/ C471/ stnfiles.html
[4] http:/ / chem.sis. nlm.nih. gov/ chemidplus/ chemidheavy. jsp
[5] http:/ / www. indiana. edu/ ~cheminfo/ cicc/ lawson_test.htm
[6] http:/ / infochem.de/ products/ databases/ czb. shtml
[7] http:/ / pubchem. ncbi. nlm.nih. gov/
[8] http:/ / www. chemspider.com/
[9] http:/ / www. indiana. edu/ ~cheminfo/ C471/ 471ps4.html
Article Sources and Contributors
6
Article Sources and Contributors
Chemical Information Sources/Structure Searches  Source: http://en.wikibooks.org/w/index.php?oldid=2063872  Contributors: Adrignola, Avicennasis, Gary Dorman Wiggins
Image Sources, Licenses and Contributors
File:Isatin wlvl.jpg  Source: http://en.wikibooks.org/w/index.php?title=File:Isatin_wlvl.jpg  License: Creative Commons Zero  Contributors: Gary Dorman Wiggins
License
Creative Commons Attribution-Share Alike 3.0 Unported
//creativecommons.org/licenses/by-sa/3.0/