Professional Documents
Culture Documents
Artificial
Neural
Networks
Second Edition
METHODS IN M O L E C U L A R B I O LO G Y
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Second Edition
Edited by
Hugh Cartwright
Chemistry Department, Oxford University, Oxford, UK; Chemistry Department,
University of Victoria, BC, Canada
Editor
Hugh Cartwright
Chemistry Department
Oxford University
Oxford, UK
Chemistry Department
University of Victoria
BC, Canada
Artificial Neural Networks (ANNs) are among the most fundamental techniques within the
field of Artificial Intelligence. Their operation loosely emulates the functioning of the
human brain, but the value of an ANN extends well beyond its role as a biological model.
An ANN can both memorize and reason: it provides a way in which a computer can learn
from scratch about a previously unseen problem. Remarkably, the exact form of the prob-
lem is rarely critical; it might be financial (e.g., can we predict the direction of the stock
market in the next few months?); it might be sociological (what factors make a face attrac-
tive?); it could be medical (can we tell from an X-ray whether a bone is broken?); or, as in
this volume, the problem might be purely scientific.
This text brings together some productive and fascinating examples of how ANNs are
applied in the biological sciences and related areas: from the analysis of intracellular sorting
information to the prediction of the behavior of bacterial communities; from biometric
authentication to studies of tuberculosis; from studies of gene signatures in breast cancer
classification to the use of mass spectrometry in metabolite identification; from visual navi-
gation to computer diagnosis of possible lesions; and more. The authors describe not only
what they have done with ANNs but also how they have done it. Readers intrigued by the
work described in this book will find numerous practical details, which should encourage
further use of these rapidly developing tools.
v
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Contributors
ix
x Contributors
SAKSIRI JAMSAK • Center of Data Mining and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok, Thailand
SRAVANTHI JOGINIPELLI • Department of Bioinformatics, University of Arkansas at Little Rock,
Little Rock, AR, USA
L.J. KANGAS • Washington State University Tri-Cities, Richland, WA, USA
ANDRZEJ KLOCZKOWSKI • Battelle Center for Mathematical Medicine, Nationwide
Children’s Hospital, Columbus, OH, USA; Department of Pediatrics, The Ohio State
University, Columbus, OH, USA
PETER LARSEN • Biosciences Division, Argonne National Laboratory, Lemont, IL, USA;
Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
BEI LIU • Radiation Oncology, City of Hope Medical Center, Duarte, CA, USA
MARIO MALCANGI • Department of Computer Science, Universita degli Studi di Milano,
Milan, Italy
PRASIT MANDI • Center of Data Mining and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok, Thailand
ARACELY C. MARTÍNEZ-OLGUÍN • División de Estudios de Posgrado, Universidad del
Papaloapan, Oaxaca, Mexico
VENKATA KIRAN MELAPU • Department of Bioinformatics, University of Arkansas
at Little Rock, Little Rock, AR, USA
J.H. MILLER • Washington State University Tri-Cities, Richland, WA, USA
NARELLE MONTAÑEZ-GODÍNEZ • División de Estudios de Posgrado, Universidad del
Papaloapan, Oaxaca, Mexico
KYAW Z. MYINT • NIDA Center of Excellence for Computational Chemogenomics Drug
Abuse Research, Computational Chemical Genomics Screening Center, Department of
Pharmaceutical Sciences, School of Pharmacy, University of Pittsburgh, Pittsburgh,
PA, USA
CHANIN NANTASENAMAT • Center of Data Mining and Biomedical Informatics,
Faculty of Medical Technology, Mahidol University, Bangkok, Thailand; Department
of Clinical Microbiology and Applied Technology, Faculty of Medical Technology,
Mahidol University, Bangkok, Thailand
STEFAN NIEHÖRSTER • Thin films and Physics of Nanostructures, Physics Department,
Bielefeld University, Bielefeld, Germany
CRISTIAN F. PASLUOSTA • Electronics Core - Medical Device Solutions, Lerner Research
Institute, Cleveland Clinic, Cleveland, OH, USA
ANDREW PHILIPPIDES • Centre for Computational Neuroscience and Robotics, University
of Sussex, Brighton, UK
VIRAPONG PRACHAYASITTIKUL • Department of Clinical Microbiology and Applied
Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
LIKIT PREEYANON • Center of Data Mining and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok, Thailand
GUILLERMO RAMÍREZ-GALICIA • División de Estudios de Posgrado, Universidad del
Papaloapan, Oaxaca, Mexico
PHILIP H. ROWE • School of Pharmacy & Biomolecular Sciences, Liverpool John Moores
University, Liverpool, UK
SANDHYA SAMARASINGHE • Integrated Systems Modelling Group, Lincoln University,
Christchurch, New Zealand
B.T. SCHROM • Washington State University Tri-Cities, Richland, WA, USA
Contributors xi
YANG SHEN • Laboratory of Chemical Physics, National Institute of Diabetes and Digestive
and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
WATSHARA SHOOMBUATONG • Center of Data Mining and Biomedical Informatics,
Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
SAW SIMEON • Center of Data Mining and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bangkok, Thailand
GIOVANNI SPARACINO • Department of Information Engineering, University of Padova,
Padova, Italy
A. SPECK-PLANCHE • Department of Chemistry and Biochemistry, Faculty of Sciences,
University of Porto, Porto, Portugal
ANDY THOMAS • Thin films and Physics of Nanostructures, Physics Department, Bielefeld
University, Bielefeld, Germany; Institute of Physics, Johannes Gutenberg University
Mainz, Mainz, Germany
MARC TORRENT • Medical Research Council Laboratory of Molecular Biology, Cambridge, UK;
Vall d’Hebron Institute of Research, Universitat Autònoma de Barcelona, Vall d’Hebron
University Hospital, Barcelona, Spain
APILAK WORACHARTCHEEWAN • Center of Data Mining and Biomedical Informatics,
Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
XIANG-QUN XIE • NIDA Center of Excellence for Computational Chemogenomics Drug
Abuse Research, Computational Chemical Genomics Screening Center, Department of
Pharmaceutical Sciences, School of Pharmacy, University of Pittsburgh, Pittsburgh,
PA, USA
CHIARA ZECCHIN • Department of Information Engineering, University of Padova,
Padova, Italy
Chapter 1
Abstract
A precise spatial-temporal organization of cell components is required for basic cellular activities such as
proliferation and for complex multicellular processes such as embryo development. Particularly important
is the maintenance and control of the cellular distribution of proteins, as these components fulfill crucial
structural and catalytic functions.
Membrane protein localization within the cell is determined and maintained by intracellular elements
known as adaptors that interpret sorting information encoded in the amino acid sequence of cargoes.
Understanding the sorting sequence code of cargo proteins would have a profound impact on many areas
of the life sciences. For example, it would shed light onto the molecular mechanisms of several genetic
diseases and would eventually allow us to control the fate of proteins.
This chapter constitutes a primer on protein-sorting information analysis and localization/trafficking
prediction. We provide the rationale for and a discussion of a simple basic protocol for protein sequence
dissection looking for sorting signals, from simple sequence inspection techniques to more sophisticated
artificial neural networks analysis of sorting signal recognition data.
Key words Protein sorting, Sequence analysis, Artificial neural networks, Intracellular localization
1 Introduction
1.1 How Do This intriguing, complex question is at the core of modern biol-
the Emergent ogy, and it can be initially approached by simply stating that cells
Properties of Cells are not just a mix but an ordered array of their components. Cells
Arise from a Mix require an internal spatial-temporal organization for the fulfill-
of Proteins, Lipids, ment of specific biochemical processes (e.g., by creating chemical
Sugars, and Nucleic potentials—Fig. 1a). Indeed, life can be conceived as resulting
Acids? (See Note 1) from the constant struggle to generate and maintain internal order
against entropy trying to drive organisms to a lethal equilibrium.
This dynamic, yet highly ordered, steady state is particularly
complex in eukaryotic cells (e.g., human cells) as they add a physical
dimension to their spatial-temporal organization in the form of
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_1, © Springer Science+Business Media New York 2015
1
2 R. Claudio Aguilar
Fig. 1 Establishment and maintenance of cell organization. Cells are represented as compartments separating
cytosol (bluish) from the extracellular space (greyish). A single membrane-spanning protein on the cell plasma
membrane is represented as a rectangle with cytosolic and lumenal regions (black and white sectors, respec-
tively). A red segment within the protein represents the transmembrane domain (TMD). (a) Left panel: Isotropic
(homogenous) distribution of the protein at equilibrium. This is a stage of low energy (no chemical potential)
and without spatial differential properties. Right panel: The organized (polarized) distribution of key functional
components (e.g., proteins) allows cell function. For example, asymmetrical distribution of ion membrane
transporters can generate electrochemical potentials required for synaptic transmission. Diffusion forces
(black arrows) oppose intracellular organization by promoting the dispersion of membrane proteins. Active
(energy-consuming) transport (red/green arrows) of protein is constantly required to counteract diffusion in
order to establish and maintain cell organization. (b) Transmembrane proteins display a tyrosine-based sorting
signal (“Y”) recognized by adaptors (blue complex labeled “AP”). This cargo is actively removed from the
plasma membrane (internalization, red arrows) and recycled back (green arrows) to polarized areas. Transport
occurs through vesicle trafficking, i.e., vesicles loaded with cargo by the adaptor bud off from a donor com-
partment and fuse with a target compartment. Note that the protein topology is conserved: cytosolic regions
always face the cytosol while lumenal regions always face the lumen or the extracellular space
1.2 How Are Cargoes The answer to this question is (deceivingly) simple, just like the
Selectively Loaded mail. The “mailed” cargo displays a destination “address” (or
into the Right Vesicles “addresses”) and it is delivered via the interplay of cellular “postal
and Delivered workers” (that load specific mail items into the proper “bag”) with
to the Right the “addressee.”
Compartments? If we focus on proteins (i.e., protein sorting), and specifically
on integral membrane proteins (see Note 3), their destination
addresses and/or delivery routes are encoded as consensus amino
acid sequences that may or may not be activated/deactivated by
posttranslational modifications. Most of these sorting signals are
located in the cytoplasmic domain of the protein cargo, readily
available for cellular “postal workers” to interpret and bag the
cargo in a vesicle (Fig. 1b). The loaded vesicles are connected to
motors and placed on cytoskeleton tracks to be mobilized to their
destinations (see Note 2).
Many different cellular “postal workers,” or adaptors, have
been identified and characterized. These adaptors are also pro-
teins, or protein complexes, that associate with the cytoplasmic
side of membranes and vary in both location and type of sorting
signals they interpret. Table 1 shows a list of some of the most
important protein-sorting signals and their corresponding pro-
posed adaptors.
Within this signal recognition machinery, the clathrin-associated
adaptor proteins (APs) emerge as major players in the protein-
trafficking system of higher eukaryotes [3, 4] (Fig. 1b). Five tetra-
meric AP complexes (AP1 through AP5) with different intracellular
localizations have been identified and showed to mediate different
protein-sorting events connecting several compartments [5, 6].
Whereas other AP subunits are engaged in interactions with vari-
ous molecules, the μ (medium) subunit is in charge of recognizing
4 R. Claudio Aguilar
Table 1
Signal-adaptor specificity
1.3 Can We Predict If we were able to identify and interpret the whole repertoire of
the Intracellular sorting signals, we would be capable of predicting the fate of a
Destination protein cargo solely based on its amino acid sequence. This achieve-
of Transmembrane ment would have a high impact on many different areas of the life
Proteins by Reading sciences; for example, it would shed light on the nature of several
Their Amino Acid hereditary diseases and would enhance our ability to control the
Sequences? intracellular fate of proteins within cells. However, we still do not
fully understand how the sorting information is coded in protein
sequences or the precise role in protein trafficking of all the differ-
ent adaptors. In addition, multiple factors including topology/
accessibility and context within the protein affect the recognition
of sorting signals by adaptors (see below), adding extra layers of
complexity to the process. It should also be noted that cargoes
often display multiple sorting signals from one or more types and
that the protein final destination results from the interplay of this
composite of motifs.
The purpose of this chapter is to introduce the foundations of
protein-sorting information analysis. Specifically, we will discuss
how to identify candidate sorting signals within protein sequences
and analyze their role in protein trafficking. We will also describe
simple techniques to experimentally assess the relevance of putative
signals for protein sorting in the context of model and native cargo.
Analysis of Intracellular Sorting Information 5
2 Method Foundations
2.1 Looking In order to read sorting signals, first we need to find them. Finding
for Signals sequences within cargoes that would fit into a given sorting con-
sensus is not hard. Indeed, some of these motifs (e.g., YXXØ) are
very commonly found within protein sequences (see Fig. 2), but
not all of them are functional for cargo sorting; actually, only a few
are. Therefore, the real task is to identify viable sorting signal can-
didates among other sequences that, even when fitting into the
consensus, are likely to fulfill different functions within the pro-
tein. Although we only partially understand the information-
coding system, it has been established that sequence motifs need to
meet a few conditions related to its nature, topology/accessibility,
and environment to be active in protein sorting. These require-
ments can be exploited to our advantage to highlight sequences
with higher probability of being functional sorting signals:
(a) Fit into a signal consensus: Although in some cases certain
flexibility may exist, the better a sequence fits into a given con-
sensus, the higher the probability that it works as a real sorting
signal. It should also be kept in mind that posttranslational
Fig. 2 Sequence analysis of the lysosomal protein Lamp2. Left panel: The lysosomal-associated membrane
protein 2 (Lamp2) is transported from endosomes (E) to the lysosome (L). Right panel: The amino acid sequence
of the isoform A of human Lamp2 was obtained from the NCBI protein database (URL provided). Although four
YXXø motifs (underlined) were identified in Lamp2, only one (Yeqf) was located in the cytosolic region of the
protein. Further, this motif is located within the distance range (6–10 amino acids from the TMD) for AP recog-
nition. See text for details
6 R. Claudio Aguilar
2.2.1 To Predict/ Although not comprehensive (and many exceptions to these rules
Determine Signal-Adaptor may exist), Table 1 indicates what sorting signal is recognized by
Specificity: What Adaptor each adaptor or family of adaptors. The signals identified in the
will Recognize a Given cargo of interest should be contrasted against Table 1 or similar. In
Signal? addition, when signals are recognized by families of adaptors (e.g.,
APs), determining the fine specificity for family members could be
critical to predict cargo intracellular trafficking and distribution.
For example, different AP adaptors localize to different compart-
ments and are believed to mediate different protein-sorting events.
The fine YXXØ specificity of members of the AP family has
been explored in detail [11, 12]. These studies indicate that, in
addition to Y and Ø, the X-positions (including amino acids
upstream of the crucial Y) also influence the process of signal rec-
ognition [11, 12]. Based on these findings, a table summarizing
the amino acid preference of each AP complex at the X/Ø-positions
of a XXXYXXØ generic signal was constructed (Table 2).
The Y-signal amino acid preference of the different APs docu-
mented in Table 2 points to several important trends of AP consen-
sus recognition and their impact on cargo trafficking. For example,
enrichment of acidic amino acids (D/E) preceding the Ø position
and presence of the amino acid glycine (G) just before the critical
Tyr would make a protein cargo a prime subject for AP3 recogni-
tion (Table 2). Interestingly, AP3 has been implicated in protein
sorting to the lysosome and many lysosomal proteins display acidic
sequences that fit into the GYXXø signal consensus subset.
However, Table 2 does not predict the experimentally
observed synergic and inhibitory effects between amino acids on
Table 2
Y-signal specificity of the mammalian AP familya
XXXYXXø motif
Adaptor Proposed role in TMP
complexb −3 −2 −1 Y +1 +2 ø vesicle traffic
AP1 + R S L Y R/Q P L Golgi complex-endosome
− F/L L I I F/V
AP2 + G F P Y E/Q P/R L Removal of TMP from PM
− F
AP3 + R A G/D/E Y E P I Endosome to lysosome
− F/I S F/L/I
AP4c + C Y F Y D P F Endosome to lysosome
− T G N/T V
a
Table condenses results published in ref. 11, 12
b
For each adaptor amino acids, preferred (“+” row) and excluded (“−” row) at each position are indicated. Shaded cells
indicate no significant preference or exclusion. Adaptor complex AP5 is excluded from this table as no analysis for signal
preference is currently available
c
In addition to YXXø signals, AP4 has been shown to recognize YX[F/Y/L][F/L]E motifs
Analysis of Intracellular Sorting Information 9
Fig. 3 YXXø artificial neural network architecture. The neurons in the network are
represented by squares and the connections between units by arrows. The input
layer (I) is divided in 5 clusters (one for each X position within the XXXYXXØ motif)
made up of 20 nodes each (representing the 20 possible amino acids—only 2
per cluster are shown); the Ø cluster contains only 5 nodes (for amino acids F, M,
I, L and V), yielding 105 neurons in total. Hidden (H) and output (O) layers are
shown. Final network output is denoted as binary Y or N value, denoting the pres-
ence or absence of interaction between the signal and adaptor under consider-
ation, respectively (see ref. 13 for more details)
2.2.2 To Predict/ In order to address this question, we need to understand the func-
Determine Signal tion of specific adaptors; i.e., what is going to happen to the cargo
Relevance to Cargo when a signal is bound by its specific adaptor? The function of
Trafficking/Localization: some adaptors is well established, but in other cases it is not.
What Is the Impact Nevertheless, knowing the adaptor specificity of a given signal, it is
of Specific Adaptor-Signal possible to make some predictions in terms of cargo localization or
Interactions for Cargo trafficking of normal and signal-mutated cargoes. For example,
Trafficking/Localization? Fig. 3 depicts the trafficking of a transmembrane protein with a
Y-signal to be recognized by AP3.
Following its targeting to the endoplasmic reticulum (ER), the
nascent lysosomal-resident cargo is inserted into the ER membrane
by sequential action of the signal recognition particle and the
translocon systems (see Fig. 3 and Note 2). After being transported
to the Golgi apparatus by vesicle trafficking, cargo would emerge at
the trans-Golgi network (TGN) ready to be sorted. Cargo carrying
an appropriate signal can be recognized by the AP3 complex and
actively sent in route to late endosome/lysosomes (LE/L) or can
be passively incorporated onto carriers destined to the plasma
membrane (the “default pathway”). Once at the plasma membrane,
AP2 will bind and promote cargo internalization (Fig. 4), leading
Analysis of Intracellular Sorting Information 11
Fig. 4 Traffic of a generic AP3 cargo. Following translation of the messenger RNA
in the endoplasmic reticulum (ER), cargo is transported to the Golgi apparatus (G)
for further processing to finally emerge at the trans-Golgi network (TGN) from
where it would be sorted. While AP3 actively recruits cargo for late endosomal
(LE) to lysosome (L) targeting (route II–III), the “default” pathway (route I) would
transport cargo to the plasma membrane (PM). At the PM, AP2 promotes cargo
for internalization, facilitating a new iteration of sorting (see text for details)
2.3 Testing the Fate Often the cytoplasmic regions of cargoes do not contain single but
of Cargo multiple sorting determinants, and those signals can overlap with
each other and with sites for posttranslational modification.
Therefore, the protein’s final destination will result from the inter-
play of this composite of sequence-encoded information.
Although adaptors and signals discussed in this chapter will
not be involved, TM and lumenal domains can also play an impor-
tant role in the fate of the proteins. Therefore, following studies
with simplified cargo models (see above) and investigations with
the actual cargo (i.e., the signal within its native context) should be
conducted. The latter should involve the analysis of truncations
and point mutations that in the native context would affect signals
and regulatory components (e.g., palmitoylation and phosphoryla-
tion sites).
3.1 Finding There are several resources that can be explored, for example, the
the Amino Acid NCBI databases and some organism-specialized sites, such as the
Sequence of the Cargo Saccharomyces Genome Database (SGD) or FlyBase:
of Interest NCBI: http://www.ncbi.nlm.nih.gov/ (“Protein” database)
SGD: http://www.yeastgenome.org/
FlyBase: http://flybase.org/
In all cases, be aware of partial sequences, isoforms, and species
variation (a corollary to the latter is that conservation of a putative
sorting signal means functional relevance but not necessarily a role
in protein sorting).
Example
Sequence information for Lamp2 can be found at the NCBI data-
base (Fig. 2). http://www.ncbi.nlm.nih.gov/protein/P13473.2.
3.2 Identification Information in terms of what cargo regions are facing the cytosol
of the Cytosolic might be readily available from the database protein entries
Region(s) Within (see above), or it may be necessary to run a topology (TM and
the Cargo of Interest cytosolic region) prediction using the cargo sequence. If the latter
is required, a few URLs for useful online sites are provided below:
http://smart.embl-heidelberg.de/
http://phobius.sbc.su.se/
http://www.enzim.hu/hmmtop/html/submit.html
Analysis of Intracellular Sorting Information 13
Example
The cytosolic region of Lamp2 was predicted to encompass amino
acids 401–410 in agreement with information provided by the cor-
responding NCBI entry (Fig. 2).
3.3 Search for Motifs Contrast the sequence of the cargo cytosolic region(s) against
Fitting Sorting Signal motifs listed in Table 1.
Consensuses
Example
The cytosolic region of Lamp2 was found to contain a C-terminal
YXXø motif (YEQF).
3.5 Predict Adaptor What adaptor would bind the putative sorting signal?
Recognition
– Prediction of signal-adaptor specificity based on Table 1 (also
use Table 2 and ANNs if the AP family is involved).
– Experimental testing of adaptor recognition. Determine the
isolated candidate signal ability to bind adaptors: two-hybrid
technology, in vitro binding assays, localization analysis, and
IP of TAC chimeras.
Example
The Lamp2 YEQF motif fulfills the criteria to be a signal recog-
nized by AP3.
4 Conclusion
These approaches will not only create tools to facilitate the analysis
of protein information, but will also allow the generation of new
hypotheses to be tested on the classical experimental realm.
4.1 Where The method described in this chapter has been successfully and
to Go from Here? routinely used to analyze protein cargo of different origins (yeast,
mammalian, etc.). However, in its current form, it relies on manual
inspection of sequences and on the use of scattered resources. One
obvious limitation of this “manual” approach is its inability to sup-
port high-throughput analysis. Although complete collections of
protein sequences from entire organisms exist, we lack the tools to
systematically analyze the databases for cargos carrying specific
sorting signals. This limitation, for example, precludes us from
searching for cargoes potentially affected by genetic diseases due to
deficiencies in specific adaptors (e.g., Hermansky-Pudlak syndrome
forms due to AP3 abnormal function—see [15]). If we had this
capability, we could predict (and perhaps prevent) cellular abnor-
malities in patients.
Therefore, there is a clear need for applications that incorpo-
rate the myriad of factors discussed in previous sections.
Furthermore, the application of information theory [16] should be
instrumental for the development of methods to crack the protein-
sorting information code.
4.2 Is Benchwork Absolutely. Our basic cell and molecular knowledge still needs to
Still Needed? be pushed forward and classical benchwork will play a central role
in this effort. Indeed, there still are some basic research aspects that
need to be resolved/clarified, for example, controversies about the
specificity of certain adaptors and several reports of functional sig-
nals that deviate from the discussed consensuses. The existence of
previously unrecognized signals in the cytoplasm region of cargoes
is another possibility that should be experimentally investigated.
In addition to discussing the basis and application of a method
based on resources and analytical tools, from molecular biology to
ANNs, this chapter emphasizes the necessity to amalgamate exper-
imental and theoretical methods. The coalescence of in silico,
in vitro, and in vivo approaches is what it will take to fully under-
stand adaptor-signal interaction and to therefore move forward
knowledge and to fulfill our medical and biotechnological needs.
5 Notes
Acknowledgments
References
1. Esposito G, Clara FA, Verstreken P (2012) Golgi. Ann Bot 92(2):167–180. doi:10.1093/
Synaptic vesicle trafficking and Parkinson’s dis- aob/mcg134
ease. Dev Neurobiol 72(1):134–144. doi: 10. Barman S et al (2001) Transport of viral pro-
10.1002/dneu.20916 teins to the apical membranes and interac-
2. Aridor M, Hannan LA (2002) Traffic jams II: tion of matrix protein with glycoproteins in
an update of diseases of intracellular trans- the assembly of influenza viruses. Virus Res
port. Traffic 3(11):781–790. doi:10.1034/ 77(1):61–69. doi:10.1016/s0168-1702(01)
j.1600-0854.2002.31103.x 00266-0
3. Boehm M, Bonifacino JS (2002) Genetic anal- 11. Ohno H, Aguilar RC, Yeh D et al (1998) The
yses of adaptin function from yeast to mam- medium subunits of adaptor complexes recog-
mals. Gene 286(2):175–186. doi:10.1016/ nize distinct but overlapping sets of tyrosine-
s0378-1119(02)00422-5 based sorting signals. J Biol Chem 273(40):
4. Bonifacino JS (2014) Adaptor proteins involved 25915–25921. doi:10.1074/jbc.273.40.25915
in polarized sorting. J Cell Biol 204(1):7–17. 12. Aguilar RC, Boehm M, Gorshkova I et al (2001)
doi:10.1083/jcb.201310021 Signal-binding specificity of the mu 4 subunit of
5. Hirst J, Irving C, Borner GHH (2013) Adaptor the adaptor protein complex AP-4. J Biol
protein complexes AP-4 and AP-5: new players Chem 276(16):13145–13152. doi:10.1074/
in endosomal trafficking and progressive spastic jbc.M010591200
paraplegia. Traffic 14(2):153–164. doi:10.1111/ 13. Mukherjee D, Hanna CB, Aguilar RC (2012)
tra.12028 Artificial neural network for the prediction of
6. Ohno H (2006) Physiological roles of clathrin tyrosine-based sorting signal recognition by
adaptor AP complexes: lessons from mutant ani- adaptor complexes. J Biomed Biotechnol. doi:
mals. J Biochem 139(6):943–948. doi:10.1093/ 10.1155/2012/498031
jb.mvj120 14. Parrish JR, Gulyas KD, Finley RL Jr (2006)
7. Green N, Fang H, Kalies KU et al. (2001) Yeast two-hybrid contributions to interactome
Determining the topology of an integral mem- mapping. Curr Opin Biotechnol 17(4):387–
brane protein. Current protocols in cell biol- 393. doi:10.1016/j.copbio.2006.06.006
ogy/editorial board, Juan S Bonifacino [et al.]. 15. Dell’Angelica EC, Shotelersuk V, Aguilar RC
Chapter 5: Unit 5.2. doi: 10.1002/0471143030. et al (1999) Altered trafficking of lysosomal
cb0502s00 proteins in Hermansky-Pudlak syndrome due
8. Vergarajauregui S, Puertollano R (2006) Two to mutations in the beta 3A subunit of the
di-leucine motifs regulate trafficking of muco- AP-3 adaptor. Mol Cell 3(1):11–21.
lipin-1 to lysosomes. Traffic 7(3):337–353. doi:10.1016/s1097-2765(00)80170-7
doi:10.1111/j.1600-0854.2006.00387.x 16. Adami C (2012) The use of information theory
9. Neumann U, Brandizzi F, Hawes C (2003) in evolutionary biology. Year Evol Biol 1256:
Protein transport in plant cells: in and out of the 49–65.doi:10.1111/j.1749-6632.2011.06422.x
Chapter 2
Abstract
Chemical shifts are obtained at the first stage of any protein structural study by NMR spectroscopy.
Chemical shifts are known to be impacted by a wide range of structural factors, and the artificial neural
network based TALOS-N program has been trained to extract backbone and side-chain torsion angles
from 1H, 15N, and 13C shifts. The program is quite robust and typically yields backbone torsion angles for
more than 90 % of the residues and side-chain χ1 rotamer information for about half of these, in addition
to reliably predicting secondary structure. The use of TALOS-N is illustrated for the protein DinI, and
torsion angles obtained by TALOS-N analysis from the measured chemical shifts of its backbone and 13Cβ
nuclei are compared to those seen in a prior, experimentally determined structure. The program is also
particularly useful for generating torsion angle restraints, which then can be used during standard NMR
protein structure calculations.
Key words NMR, Chemical shifts, Protein structure, Side-chain conformation, Artificial neural
network, Secondary structure, Backbone torsion angle
1 Introduction
1.1 Relations The first step of any protein structural study by NMR spectroscopy
Between Chemical typically involves assignment of the multitude of NMR resonances
Shifts and Protein to individual nuclei. Originally, for proteins extracted from natural
Structure sources, this only involved assignment of the hydrogen NMR
spectra [1, 2]. However, due to extensive resonance overlap in 1H
NMR spectra, this technology was restricted to relatively small
proteins. With advances in molecular biology, the vast majority of
today’s structural studies focus on cloned proteins, typically over-
expressed in Escherichia coli [3–5]. By using suitable isotopically
enriched growth media, it then is readily feasible to obtain essen-
tially full incorporation of the NMR-observable stable isotopes 13C
and 15N. These nuclei not only are key to dispersing the crowded
NMR spectra in three or four orthogonal frequency dimensions,
dramatically reducing the resonance overlap problem; the 13C and
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_2, © Springer Science+Business Media New York 2015
17
18 Yang Shen and Ad Bax
15
N chemical shifts themselves have proven to be important reporters
on local backbone conformation [6–8]. NMR chemical shifts in
proteins are exquisitely sensitive to local conformation. However,
they depend on many different factors, including backbone and
side-chain torsion angles, neighboring residues, ring currents
caused by nearby aromatic groups, hydrogen bonding, electric
fields, local strain and geometric distortions, as well as solvent
exposure [9–15]. This not only has made it difficult to separately
quantify the relation between each of these parameters and the
chemical shift; it also makes it impossible to uniquely attribute
such a structural parameter to any individual chemical shift.
For protein NMR spectroscopy, triple resonance correlation
experiments, which link the resonances of directly bonded 1H, 13C,
and 15N nuclei, are commonly used to assign the chemical shifts of
1
H, 13C, and 15N nuclei in proteins [16–18]. The chemical shift
assignment procedure usually consists of two steps: (1) sequence-
specific assignment of the backbone atoms and (2) side-chain
assignments. Nearly complete chemical shift assignments for back-
bone and side-chain atoms are commonly required to assign
nuclear Overhauser enhancement (NOE) spectra, which classically
are used to derive interproton distances that serve as the primary
experimental restraints for calculating the protein structure. The
backbone (1Hα, 13C′, 13Cα, 15N, and 1HN) and 13Cβ chemical shifts,
which are generally obtained in the earliest stage of any protein
NMR study, are particularly useful reporters on local conforma-
tion. Their link to secondary structure, as well as to hydrogen
bonding and χ1 side-chain torsion angles, has been long recog-
nized and has been the focus of both empirical studies as well as
quantum-chemical calculations [11–15, 19, 20].
1.2 Protein The rapid increase in the number of proteins, for which both high-
Backbone and Side- resolution structural coordinates have been deposited in the
Chain Conformation Protein Data Bank (PDB) [21] and NMR chemical shift assign-
from NMR Chemical ments are available in the BioMagResBank (BMRB) [22], has
Shifts stimulated the development of quantitative empirical methods to
study the relation between protein structure and chemical shifts
[23]. Among the wide array of empirical methods, TALOS [20]
and its two successors TALOS+ [24] and TALOS-N [25] have
become particularly widely used for making accurate ϕ/ψ back-
bone torsion angle predictions on the basis of the backbone (13Cα,
13
C′, 15N, 1Hα, and 1HN) and 13Cβ chemical shift assignments.
These ϕ/ψ predictions can be used to validate NOE-derived NMR
structures that did not use chemical shift-derived input parameters
or, conversely, to generate additional restraints as input to the pro-
tein structure calculation and refinement protocols.
The original TALOS program (Torsion Angle Likeliness
Obtained from Shift) searches a protein database, consisting origi-
nally of only 20 proteins but later expanded to ca 200 proteins,
Protein Structure from NMR 19
Fig. 1 Flowchart of the TALOS-N program (reproduced from [25] with permission from Springer)
2 Materials
In this chapter, we use the protein DinI [29] to illustrate the use of
TALOS-N for predicting its backbone ϕ/ψ and side-chain χ1 torsion
angles, as well as its secondary structure classification. To follow the
examples, both the TALOS-N software package and an input file
with correctly formatted chemical shift assignments are needed.
22 Yang Shen and Ad Bax
2.1 Software The TALOS-N software package, including the required binaries
Requirements for three of the most common operating systems, Linux, Mac OS
X, and Windows, as well as the requisite protein database and
scripts, can be downloaded from http://spin.niddk.nih.gov/bax/
software/TALOS-N/ and installed straightforwardly (see Note 1).
Alternatively, a server version of TALOS-N can be used directly,
without installing the TALOS-N software (http://spin.niddk.nih.
gov/bax/nmrserver/talosn/).
2.2 Data An input table containing both the full amino acid sequence and
Requirements the NMR chemical shift assignments is required, to be prepared
with a specific data format (general purpose NMRPipe table for-
mat). As an example, an excerpt of such a file is shown below for
the protein DinI:
DATA FIRST_RESID 2
DATA SEQUENCE RIEVTIAKT SPLPAGAIDA LAGELSRRIQ
YAFPDNEGHV SVRYAAANNL
DATA SEQUENCE SVIGATKEDK QRISEILQET WESADDWFVS E
VARS RESID RESNAME ATOMNAME SHIFT
FORMAT %4d %1s %4s %8.3f
2 R C 174.123
2 R CA 55.537
2 R CB 32.786
2 R H 8.772
2 R HA 4.994
2 R N 123.394
3 I C 173.941
3 I CA 60.986
The protein’s amino acid sequence should be provided in one
or more lines starting with the tag “DATA SEQUENCE”. Only the
one-character residue name is allowed (see Note 2) and space char-
acters in the sequence are ignored. An optional line beginning with
a tag of “DATA FIRST_RESID” is needed to specify the first resi-
due number of the amino acid sequence listed in the “DATA
SEQUENCE” line if the first residue listed is not residue number 1.
For the chemical shift table, columns for residue number, one-
character residue type (see Note 2), atom name (see Note 3), and
chemical shift value must be included, and their definitions
(“RESID,” “RESNAME,” “ATOMNAME,” and “SHIFT,” respec-
tively) must be predeclared in a line beginning with a “VARS” tag;
a line beginning with a “FORMAT” tag is also required (immedi-
ately after the “VARS” line) to define the data type of each corre-
sponding column of the table.
Note that all chemical shifts used as input for TALOS-N are
required to be properly referenced (see Note 4) to ensure the accu-
racy and reliability of the predictions. If the protein sample used to
collect the NMR chemical shift data is perdeuterated, 2H isotope
Protein Structure from NMR 23
3 Methods
3.1 TALOS-N The TALOS-N prediction can be performed for DinI with an input
Prediction chemical shift file of name “inCS.tab” by typing the command:
talosn -in inCS.tab
The program first converts the chemical shifts (δ) of each query
residue to its corresponding secondary chemical shifts (Δδ) by sub-
tracting a residue-type-dependent random coil value, as well as
corrections to account for the residue types of its two immediate
neighbors. The converted secondary chemical shifts are stored in a
file named “predAdjCS.tab” (in the “SHIFT” column),
together with the original chemical shifts (“CS_OBS”) and the cor-
responding corrections (“CS_ADJ”, which is the random coil value
including nearest neighbor (i ± 1) residue-type correction) used to
calculate the secondary chemical shifts. To make a ϕ/ψ angle pre-
diction, the converted secondary chemical shifts together with the
amino acid-type information are used as inputs for the (ϕ/ψ)-ANN
to calculate the 324-state ϕ/ψ distribution for each predictable
residue (see Note 7), with the output stored in a file named “pre-
dANN.tab”. A database search step is then performed to search a
9,523-protein database for the 25 best-matched heptapeptides in
terms of the 324-state ϕ/ψ angle distribution, the secondary
chemical shifts, and the amino acid type. A single file, “predAll.
tab”, is generated in this step to store the information of those
best database matches for each of the residues in the target protein.
A final summarization and quality control step is performed to
identify outliers in the 25 best-matching heptapeptides by evaluat-
ing the clustering of the ϕ/ψ angles of their center residues in the
Ramachandran map or by using the observed ϕ/ψ of a reference
structure if such a structure is available (this requires an additional
option “-ref ref.pdb” in the command line, where “ref.
pdb” is the name for the reference structure). A summary file
“pred.tab” is then created, displaying the average ϕ and ψ values
(in the PHI and PSI columns) and their respective standard devia-
tions (DPHI and DPSI), as well as an aggregate, weighted χ2 score
(DIST, see Eq. 12 of reference [25]), reflecting how well the target
24 Yang Shen and Ad Bax
Fig. 2 Graphic TALOS-N inspection interface for protein DinI. (For details, see Subheading 3.2 or the TALOS-N
webpage http://spin.niddk.nih.gov/bax/software/TALOS-N/)
3.2 Manual The TALOS-N predictions can be inspected and further adjusted
Inspection by using a Java graphic program, jrama. Two examples of com-
and Adjustment mand line calls of this program are:
jrama -in pred.tab
jrama -in pred.tab -ref DinI.pdb
Figure 2 shows the jrama graphic interface, loaded with the
TALOS-N predicted results for DinI. The left panel of the graphic
interface shows a map of the ϕ/ψ angles of the center residues of
the 25 best-matched heptapeptides in the database (green
squares) and the query residue Thr-6 (blue, depicting the angles
observed in the NMR-derived PDB entry 1GHH), superimposed
on a Ramachandran map, depicting in gray the “most favorable”
ϕ/ψ angles for Thr, i.e., those most commonly observed in high-
resolution crystal structures of a very large array of proteins. The
324 (ϕ/ψ)-ANN-predicted scores for Thr-6 are shown as colored
28 Yang Shen and Ad Bax
voxels but only for those that are populated at least one standard
deviation above the average predicted voxel density. The top
right panel displays the amino acid sequence of DinI, with resi-
dues colored according to their ϕ/ψ prediction classification.
Missing predictions (e.g., residue M1) are shown in light gray,
consistent predictions in light or dark green (for “Strong” and
“Generous” predictions, respectively), ambiguous predictions
in yellow, and dynamic residues in blue. Three other panels cor-
respond to the RCI-S2 value, the predicted secondary structure
(red, α-helix; aqua, β-sheet), with the height of the bars reflecting
the probability assigned by the SS-ANN secondary structure pre-
diction. The bottom right panel depicts the χ1 rotamer predic-
tions (red oval, g−; green, g+; yellow, t), with the height of the
ovals corresponding to the probability assigned by the χ1 rota-
meric state prediction.
The TALOS-N prediction (including the summary of
TALOS-N predicted ϕ/ψ angles) is normally performed with the
default parameters and settings. However, the left panel also can
be used to manually adjust the prediction classification of a given
query residue according to a user’s preference. The prediction
files then will be overwritten to reflect any changes made
interactively.
3.3 Generation The TALOS-N output can be converted into ϕ and ψ torsion
of Angular Restraints angle restraints that then can be used directly as input for a con-
ventional protein NMR structure calculation procedure [33, 34].
Two convenient scripts, “talos2dyana.com” and “talos2x-
plor.com”, are included in the TALOS-N software package for
this purpose. These scripts read predicted ϕ and ψ angles from
the TALOS-N prediction summary file “pred.tab” and gener-
ate for each residue with an unambiguous TALOS-N prediction
(classified as “Strong” or “Generous”) a ϕ and a ψ torsion
angle restraint (see Note 9). These torsion angle restraints can be
stored in either CYANA format, as shown below for residues 2
and 3 of DinI:
2 ARG PHI -127.2 -87.2
2 ARG PSI 109.1 149.1
3 ILE PHI -137.2 -97.2
3 ILE PSI 106.4 146.4
or in XPLOR format:
assign (resid 1 and name C )(resid 2 and name N )
(resid 2 and name CA )(resid 2 and name C ) 1.0 -107.2 20.0 2
assign (resid 2 and name N ) (resid 2 and name CA )
(resid 2 and name C ) (resid 3 and name N ) 1.0 129.1 20.0 2
assign (resid 2 and name C ) (resid 3 and name N )
(resid 3 and name CA ) (resid 3 and name C ) 1.0 -117.2 20.0 2
assign (resid 3 and name N ) (resid 3 and name CA )
(resid 3 and name C ) (resid 4 and name N ) 1.0 126.4 20.0 2
Protein Structure from NMR 29
4 Notes
Acknowledgments
References
1. Wüthrich K (1986) NMR of proteins and 13. Vila JA, Aramini JM, Rossi P et al (2008)
nucleic acids. Wiley, New York Quantum chemical C-13(alpha) chemical shift
2. Englander SW, Wand AJ (1987) Main-chain- calculations for protein NMR structure deter-
directed strategy for the assignment of 1H mination, refinement, and validation. Proc Natl
NMR spectra of proteins. Biochemistry 26: Acad Sci U S A 105:14389–14394
5953–5958 14. Kohlhoff KJ, Robustelli P, Cavalli A et al
3. Oh BH, Westler WM, Darba P et al (1988) (2009) Fast and accurate predictions of protein
Protein 13C spin systems by a single two- NMR chemical shifts from interatomic dis-
dimensional nuclear magnetic resonance exper- tances. J Am Chem Soc 131:13894–13895
iment. Science 240:908–911 15. Asakura T, Taoka K, Demura M et al (1995)
4. Ikura M, Kay LE, Bax A (1990) A novel The relationship between amide proton chemi-
approach for sequential assignment of 1H, 13C, cal shifts and secondary structure in proteins. J
and 15N spectra of larger proteins: heteronu- Biomol NMR 6:227–236
clear triple-resonance three-dimensional NMR 16. Bax A, Grzesiek S (1993) Methodological
spectroscopy. Application to calmodulin. advances in protein NMR. Acc Chem Res
Biochemistry 29:4659–4667 26:131–138
5. Wagner G (1993) Prospects for NMR of large 17. Sattler M, Schleucher J, Griesinger C (1999)
proteins. J Biomol NMR 3:375–385 Heteronuclear multidimensional NMR exper-
6. Saito H (1986) Conformation-dependent C13 iments for the structure determination of
chemical shifts—a new means of conforma- proteins in solution employing pulsed field
tional characterization as obtained by high res- gradients. Prog Nucl Magn Reson Spectrosc
olution solid state C13 NMR. Magn Reson 34:93–158
Chem 24:835–852 18. Salzmann M, Wider G, Pervushin K et al
7. Wishart DS, Sykes BD, Richards FM (1991) (1999) TROSY-type triple-resonance experi-
Relationship between nuclear magnetic reso- ments for sequential NMR assignments of large
nance chemical shift and protein secondary proteins. J Am Chem Soc 121:844–848
structure. J Mol Biol 222:311–333 19. Wagner G, Pardi A, Wuthrich K (1983)
8. Spera S, Bax A (1991) Empirical correlation Hydrogen-bond length and H-1-NMR
between protein backbone conformation and chemical- shifts in proteins. J Am Chem Soc
Ca and Cb 13C nuclear magnetic resonance 105:5948–5949
chemical shifts. J Am Chem Soc 113: 20. Cornilescu G, Delaglio F, Bax A (1999) Protein
5490–5492 backbone angle restraints from searching a
9. Haigh CW, Mallion RB (1979) Ring current database for chemical shift and sequence
theories in nuclear magnetic resonance. Prog homology. J Biomol NMR 13:289–302
Nucl Magn Reson Spectrosc 13:303–344 21. Berman HM, Kleywegt GJ, Nakamura H et al
10. Avbelj F, Kocjan D, Baldwin RL (2004) (2012) The Protein Data Bank at 40: reflecting
Protein chemical shifts arising from alpha- on the past to prepare for the future. Structure
helices and beta-sheets depend on solvent 20:391–396
exposure. Proc Natl Acad Sci U S A 101: 22. Markley JL, Ulrich EL, Berman HM et al
17394–17397 (2008) BioMagResBank (BMRB) as a partner
11. de Dios AC, Pearson JG, Oldfield E (1993) in the Worldwide Protein Data Bank (wwPDB):
Secondary and tertiary structural effects on new policies affecting biomolecular NMR
protein NMR chemical shifts—an ab initio depositions. J Biomol NMR 40:153–155
approach. Science 260:1491–1496 23. Wishart DS (2011) Interpreting protein chem-
12. Case DA (1998) The use of chemical shifts and ical shift data. Prog Nucl Magn Reson Spectrosc
their anisotropies in biomolecular structure 58:62–87
determination. Curr Opin Struct Biol 8: 24. Shen Y, Delaglio F, Cornilescu G et al (2009)
624–630 TALOS+: a hybrid method for predicting
32 Yang Shen and Ad Bax
Abstract
Microbial communities are found in nearly all environments and play a critical role in defining ecosystem
service. Understanding the relationship between these microbial communities and their environment is
essential for prediction of community structure, robustness, and response to ecosystem changes. Microbial
Assemblage Prediction (MAP) describes microbial community structure as an artificial neural network
(ANN) that models the microbial community as functions of environmental parameters and community
intra-microbial interactions. MAP models can be used to predict community assemblages over a wide
range of possible environmental parameters, extrapolate the results of point observations across spatial
scales, and make predictions about how microbial communities may fluctuate as the result of changes in
their environment.
Key words Modeling, Microbial assemblage prediction, Microbial ecology, Systems biology,
Metagenomics
1 Introduction
The submitted manuscript has been created by UChicago Argonne, LLC, operator of Argonne National
Laboratory (“Argonne”). Argonne, a US Department of Energy Office of Science laboratory, is operated
under contract no. DE-AC02-06CH11357. The US Government retains for itself, and others acting on
its behalf, a paid-up nonexclusive, irrevocable worldwide license in the said article to reproduce and pre-
pare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on
behalf of the government.
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_3, © Springer Science+Business Media New York 2015
33
34 Peter Larsen et al.
Fig. 1 Two steps to generate MAP model. Generation of MAP model requires as
input a set of related microbial population structures and corresponding environ-
mental parameter data. The first step is to generate the environmental interac-
tion network (EIN) (left panel). The EIN is the set of identified statistical
relationships between environmental parameters and microbial taxa abundances
and between microbial taxa abundances, presented as a directed acyclic graph.
In the small, theoretical example above, there are three environmental parame-
ters and four microbial taxa represented in the network. In an EIN, root nodes are
always environmental parameters. The second step of generating a MAP model
(right panel) is fitting the data to equations derived from the EIN. Each value of a
node in the EIN is defined to be a function of the values of its parent nodes. In the
complete MAP model, the relative abundance of every taxa node in the network
is ultimately a function of the value of environmental parameter nodes
2 Materials
2.1 Collected Two kinds of data are required in MAP models: microbial population
Biological structure and environmental parameter data (see Note 1).
Observations While the list of possible environments and techniques for
metagenomic sampling are outside of the scope of this protocol,
methods for metagenomic sample collection and sequencing are
reviewed by Thomas et al. [11]. Raw metagenomic data must be
analyzed to identify the relative abundance of specific taxonomic
groups present in the microbial populations. There are many avail-
able computational tools for this, but “QIIME” for 16 s RNA
sequence data (http://qiime.org/) [15] and “MG-Rast” for shot-
gun metagenomic data (http://metagenomics.anl.gov/) [16] are
examples of excellent and freely available software solutions. The
population data needs to be binned to some level of taxonomy
(e.g., order, class, or genus). The selection of specific taxonomic
level to which data is binned depends greatly on the nature of the
microbial population being modeled but generally should be
36 Peter Larsen et al.
Fig. 2 Examples of MAP model applications. MAP models can be used to investigate microbial populations at
a variety of temporal and spatial scales. The included examples drawn from MAP model and data originally
presented in Larsen et al. [14]. (a) In a MAP model extrapolated across time, marine microbial populations are
predicted for a single location over a total of 6 years. For every day in this model, environmental parameters
at a single location in the Western English Channel were inferred from satellite imagery and used as input to a
MAP model. In this figure, gray lines indicate changing relative abundance of individual taxa. Black “X”s indi-
cate the Bray-Curtis similarity score between the predicted and observed population structure at every time
point. (b) In this MAP model extrapolated over space, MAP model generated from data collected at a single
point is used to extrapolate the population structure over a wide geographical area using satellite image mod-
eling data as environmental parameters. In this figure, height of surface is proportional to relative abundance
of the taxa Rickettsiales in the surface waters of the Western English Channel on March 17, 2008. The geo-
graphical region pictured is between 48.5°N, 7.6°W and 50.8°N, 1.5°W. (c) In this figure, MAP model is used
to extrapolate over multiple environmental conditions. Using a MAP model generated using satellite image
data collected between 2007 and 2008, an initial microbial population structure (black bars) was generated
for August 20th, 2008. In the figure, y-axis is percent abundance of total population. X-axis is the set of 24
bacterial taxa in this MAP model. Starting with these observed environmental parameters, environmental
parameters were changed in silico until a “bloom” of Vibrionales was observed in microbial population without
greatly altering the abundance of all other microbial taxa (gray bars, Vibrionales abundance highlighted with
arrow). Strikingly, the conditions required to generate an August Vibrionales bloom in MAP model (i.e., lower
soluble phosphorus, increase in chlorophyll A concentration, and moderate levels of dissolved organic carbon)
are nearly identical to environmental parameters associated with a Vibrionales bloom in August 2003
selected by using the highest taxonomic level for which the majority
of observations in all populations, the observed abundance of a
taxa is nonzero.
Each microbial population structure observation must be
accompanied by a set of environmental parameters. The nature of
this data may vary widely depending on the nature of the environ-
ment being sampled and the scale of the desired MAP model but
may include chemical and physical attributes collected from in situ
observations, data collected from automated sensors deployed in
the environment, or remotely sensed data such as by satellite image.
Predicting Bacterial Community Assemblages 37
2.2 Required Some computational tools are required to generate MAP models,
Computational Tools and specific tools are recommended here for use in construction of
MAP models (see Note 2).
For generating EIN, the Bayesian Network Inference tool
“Banjo” [17] is recommended. Banjo is a software application and
framework for structure learning of static and dynamic Bayesian
networks using a score-based structure inference. Banjo is freely
available under a noncommercial use license (http://www.cs.duke.
edu/~amink/software/banjo/).
For generating the equations in the ANN from the topology of
the EIN, Eureqa is recommended. “Eureqa Formulize” is a soft-
ware package available from Nutonian Inc. (http://www.nutonian.
com/), which is freely available for academic research (http://
www.nutonian.com/products/eureqa/). As described by Eureqa
documentation, this analysis tool searches the space of mathemati-
cal equations using symbolic regression to identify a mathematical
model that best fits the data provided. Starting with initial expres-
sions that randomly couple mathematical building blocks, both the
form and parameters of possible equations are generated by recom-
bining previous equations and varying their sub-expressions using
a statistics-based evolutionary algorithm. The resulting models are
ranked using a ratio of equation accuracy and complexity.
3 Methods
There are two steps for construction of the MAP model: genera-
tion of EIN and fitting equations to ANN. Before beginning, the
available data should be divided into training and validation sub-
sets. The training subset is used to construct the MAP model and
then verify the accuracy of the MAP model with the validation
subset.
3.2 Generate EIN The following steps are specific to Banjo software (see Note 4).
from Normalized Data Before generating BN network in Banjo, a file of unallowed edges
must be made (“edgesNotAllowed” parameter in Banjo settings file).
This parameter must be set such that environmental parameter nodes
38 Peter Larsen et al.
3.3 Generate ANN The same, normalized dataset used to generate the EIN is used to
from EIN generate equations for the MAP model ANN. Specifics for running
Eureqa on particular platforms are available from the Nutonian
Eureqa website (http://www.nutonian.com/products/eureqa/).
From the output of Banjo EIN, network topology in the format of
a list of parent nodes for each child node is converted into Eureqa.
The operations “constant,” “add,” “subtract,” “multiply,” “divide,”
and “exponential” are recommended. As Eureqa outputs several
Predicting Bacterial Community Assemblages 39
possible functions to fit observed data, the equation selected for use
in MAP was selected to be the most parsimonious one as defined by
the following, as many as possible, optimization criteria:
– Smallest equation size.
– Highest correlation with observed data.
– If there is a choice between an equation with an exponential
term and one without, choose the equation without an expo-
nential term.
– If there is an obvious peak or drop in observed relative bacte-
rial abundance, choose the equation that best models that peak
40 Peter Larsen et al.
3.4 Implement The results of the previous step are a set of linked mathematical
MAP Model equations in which the results of some equations will be the terms
in other equations. Taken together, given a single set of normal-
ized environmental parameter data, values for all taxonomic rela-
tive abundances can be calculated with a unique solution.
Implementation of this model can be easily accomplished in a
number of mathematical programming platforms (e.g., R-Project
(http://www.r-project.org/) or MatLab (http://www.mathworks.
com/products/matlab/)) or common spreadsheet applications.
3.5 Validate Once the MAP model has been generated, the model must be vali-
MAP model dated. Similarity between predicted population structures in the
validation subset can be calculated by a variety of means, such as
Pearson’s correlation coefficient, product moment correlation
coefficient, or Bray-Curtis similarity score [19]. The specific simi-
larity metric must be chosen depending on the distribution of the
data used to generate the model. Due to the typical distribution of
microbial population data, in which a few taxa are consistently of
high relative abundance and many taxa are consistently of low rela-
tive abundance [18], it is possible to get deceptively high correla-
tions between predicted and observed population structures so
long as the model correctly identifies the small number of high-
abundance taxa. To correct for this, there are two null models that
can be easily tested. The first null model is to set all taxa’s predicted
relative abundance equal to the average taxa abundance across all
samples. The other null model is to set all taxa abundances equal to
the minimum observed taxa’s abundance. If either of these null
models has a better correlation with biological observations than
the MAP model, it is likely that the MAP model is not identifying
useful relationships in the data (see Note 5).
To assign a level of statistical significance to the MAP model
results, a bootstrap-style approach is recommended (see Note 6).
In a bootstrap-like method, the model data is randomly permuted
a large number of times. The similarity of observed data to per-
muted data is calculated each interaction. A significance of the
model can be reported as:
# iterations in which random permutation similarity model similarity
Total # of iterations
Predicting Bacterial Community Assemblages 41
4 Notes
References
1. Hugenholtz P, Goebel BM, Pace NR (1998) 10. Bardgett RD, Freeman C, Ostle NJ (2008)
Impact of culture-independent studies on the Microbial contributions to climate change
emerging phylogenetic view of bacterial through carbon cycle feedbacks. ISME J 2(8):
diversity. J Bacteriol 180:4765–4774 805–814
2. Barns SM, Takala SL, Kuske CR (1999) Wide 11. Thomas T, Gilbert J, Meyer F (2012)
distribution and diversity of members of the Metagenomics - a guide from sampling to data
bacterial kingdom Acidobacterium in the analysis. Microb Inform Exp 2(1):3
environment. Appl Environ Microbiol 65: 12. Larsen PE, Gibbons SM, Gilbert JA (2012)
1731–1737 Modeling microbial community structure and
3. Shtarkman YM et al (2013) Subglacial Lake functional diversity across time and space.
Vostok (Antarctica) accretion ice contains a FEMS Microbiol Lett 332(2):91–98
diverse set of sequences from aquatic, marine 13. Larsen P, Hamada Y, Gilbert J (2012) Modeling
and sediment-inhabiting bacteria and eukarya. microbial communities: current, developing,
PLoS One 8(7):e67221 and future technologies for predicting microbial
4. Fierer N et al (2008) Short-term temporal vari- community interaction. J Biotechnol 160(1–2):
ability in airborne bacterial and fungal popula- 17–24
tions. (Translated from eng). Appl Environ 14. Larsen PE, Field D, Gilbert JA (2012)
Microbiol 74(1):200–207 Predicting bacterial community assemblages
5. Bowers RM et al (2009) Characterization of using an artificial neural network approach. Nat
airborne microbial communities at a high- Methods 9(6):621–625
elevation site and their potential to act as 15. Caporaso JG et al (2010) QIIME allows analy-
atmospheric ice nuclei. (Translated from sis of high-throughput community sequencing
eng). Appl Environ Microbiol 75(15): data. Nat Methods 7(5):335–336
5121–5130 16. Meyer F et al (2008) The metagenomics RAST
6. Takai K et al (2001) Archaeal diversity in waters server—a public resource for the automatic
from deep South African gold mines. Appl phylogenetic and functional analysis of metage-
Environ Microbiol 67(12):5750–5760 nomes. BMC Bioinform 9:386
7. Edwards RA et al (2006) Using pyrosequenc- 17. Smith VA et al (2006) Computational infer-
ing to shed light on deep mine microbial ecol- ence of neural information flow networks.
ogy. BMC Genomics 7:57 PLoS Comput Biol 2(11):e161
8. Teske A, Sorensen KB (2008) Uncultured 18. Sogin ML et al (2006) Microbial diversity in
archaea in deep marine subsurface sediments: the deep sea and the underexplored “rare bio-
have we caught them all? ISME J 2(1):3–18 sphere”. Proc Natl Acad Sci U S A 103(32):
9. Baune C, Bottcher ME (2010) Experimental 12115–12120
investigation of sulphur isotope partitioning 19. Mumby PJ, Clarke KR, Harborne AR (1996)
during outgassing of hydrogen sulphide from Weighting species abundance estimates for
diluted aqueous solutions and seawater. Isotopes marine resource assessment. Aquat Conserv
Environ Health Stud 46(4):444–453 6(3):115–120
Chapter 4
Abstract
Bacteria have been one of the world’s most dangerous and deadliest pathogens for mankind, nowadays
giving rise to significant public health concerns. Given the prevalence of these microbial pathogens and
their increasing resistance to existing antibiotics, there is a pressing need for new antibacterial drugs.
However, development of a successful drug is a complex, costly, and time-consuming process. Quantitative
Structure-Activity Relationships (QSAR)-based approaches are valuable tools for shortening the time of
lead compound identification but also for focusing and limiting time-costly synthetic activities and in vitro/
vivo evaluations. QSAR-based approaches, supported by powerful statistical techniques such as artificial
neural networks (ANNs), have evolved to the point of integrating dissimilar types of chemical and biologi-
cal data. This chapter reports an overview of the current research and potential applications of QSAR
modeling tools toward the rational design of more efficient antibacterial agents. Particular emphasis is
given to the setup of multitasking models along with ANNs aimed at jointly predicting different antibacte-
rial activities and safety profiles of drugs/chemicals under diverse experimental conditions.
Key words Antibacterial activity, Drug resistance, QSAR, Topological indices, Ontology, Moving
average approach, Artificial neural networks, mtk-QSBER models
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_4, © Springer Science+Business Media New York 2015
45
46 A. Speck-Planche and M.N.D.S. Cordeiro
2.1 Databases Databases are typically organized as public sources and contain
and Handling large collections of data describing the most relevant aspects of
of the Retrieved Data phenomena and/or experiments [34, 35]. A well-known database
is ChEMBL [34]—an online free and dynamic source available at
2.2 Molecular Molecular descriptors are essential hot spots of any QSAR study.
Descriptors and Their As pointed up nicely some years ago by Todeschini and Consonni
Applicability in QSAR [37]: “The molecular descriptor is the final result of a logic and
Approaches mathematical procedure which transforms chemical information
encoded within a symbolic representation of a molecule into a
useful number or the result of some standardised experiment.”
The use of certain types of molecular descriptors depends on the
kind of QSAR approach that is going to be employed. Molecular
descriptors used in classical QSAR approaches may be experimen-
tally determined (e.g., physicochemical properties like the
Hammett σ constant, refractivity, etc.) via Hansch analysis or
through the use of simple indicator descriptors (R-group substitu-
tions of the parent core), Free-Wilson analysis [38]. 3D-QSAR
approaches have emerged as a natural extension to the classical
QSAR approaches pioneered by Hansch and Free-Wilson and are
based on correlating the activity with non-covalent molecular
interaction fields [38]. These approaches require mandatorily 3D
structures, e.g., based on protein crystallography or molecule
superimposition, and assume that all molecules interact with the
bio-target in the same manner, i.e., with an identical binding mode.
In 3D-QSAR approaches, pair potentials (e.g., Lennard-Jones) are
calculated for the whole molecule to model the effects of shape
and size (steric fields), the presence of hydrophobic/hydrophilic
regions, and the behavior of the electronic density in different
parts of the molecule (electrostatic fields). CoMFA (Comparative
Molecular Field Analysis) and CoMSIA (Comparative Molecular
Similarity Index Analysis) [39] are the most applied 3D-QSAR
approaches, in particular, to support docking calculations. Over
time, 3D-QSAR approaches have evolved to improved variants
such as 4D-QSAR, additionally including ensemble of ligand con-
formations and protonation states in 3D-QSAR [40]; 5D-QSAR,
adding induced-fit effects of the adaptation of the receptor-
binding pocket to the ligand in 4D-QSAR [41]; and 6D-QSAR
approaches, further incorporating different solvation models in
5D-QSAR [42].
As an alternative to 3D-QSAR approaches, the structures of
the molecules can be optimized without any superimposition of
alignment. Toward that end, calculations based on, e.g., density
functional theory (DFT) and molecular dynamics (MD) are car-
ried out [43], and both local (charge, polarizability HOMO-/
LUMO-based parameters) and global descriptors (different ener-
gies, volumes, and areas) can be determined and used to obtain
correlations with biological activities. Further, highly accurate
quantum-mechanics calculations allow the study of more specific
interactions such as the strength of hydrogen interactions, stability
of coordination bonds, etc.
Several works using 3D-QSAR approaches or QM calculations
have been reported in the field of antimicrobial research [44–48].
50 A. Speck-Planche and M.N.D.S. Cordeiro
2.3 Modeling The last stage pertains to the creation of a statistically significant
Techniques QSAR model, where the molecular descriptors serve as the inde-
pendent variables and the desired endpoint response(s) as the
dependent variable(s). For that purpose, there are many different
modeling techniques available in a variety of software packages.
This section will just focus on two of the most commonly applied
statistical techniques to generate QSAR models. The first group
embraces regression approaches like partial least squares (PLS)
[33], fundamentally used in 3D-QSAR studies, and the traditional
multiple linear regression (MLR) [38]. The second group com-
prises classification approaches such as LDA and ANNs.
A very important aspect should be highlighted here, whenever
one is retrieving a large volume of data to develop the QSAR
model. Usually databases are compilations of biological and chemi-
cal data reported in the literature, which have been determined
within different laboratories by diverse workers and using several
types of assay protocols, consequently making it difficult to esti-
mate the associated experimental errors. As regression techniques
try to predict the real response values of the compounds, it is then
evident that if one is working with a large and heterogeneous data-
set of chemicals, it would be very difficult to find a good predictive
QSAR model using such techniques because it is almost impossible
to control the uncertainty of the data. The aim of classification
techniques is to generate lines, planes, hyperplanes, and/or surfaces
able to separate compounds into different groups (e.g., active/
inactive). So, one has only to preestablish the cutoff value(s) of the
endpoint response(s) under study to then categorize the compounds
accordingly. Our personal experience, and looking at diverse pub-
lished studies in the past, led us to conclude that when classifica-
tion techniques are employed even for the prediction of diverse
ANN-Based Mtk-Model for Discovery of Antibacterial Agents 53
3.1 Curation In the work referred to above [27], 13,073 endpoints belonging
of the ChEMBL to more than 9,000 chemicals/drugs were retrieved from ChEMBL
Database in an Excel-compatible file format [34]. Chemicals for which
data were incomplete were eliminated, as were duplicates (i.e., the
same chemicals assayed under the same experimental conditions),
leading to a final dataset comprising 8,560 chemicals/drugs and
to 10,918 statistical cases, considering the various experimental
conditions (cj). The experimental conditions cj cover an ontology
of the form cj ⇒ (me, bt, ai, lc), where me describes the different
measures of biological effects (anti-enterococci activity or toxicity),
56 A. Speck-Planche and M.N.D.S. Cordeiro
3.2 Moving Average A useful predictive mtk-QSBER model clearly depends on the
Approach molecular structure of the compound and on the experimental con-
ditions/ontology cj under which the compound was assayed. But
if the original descriptors are calculated as previously described (see
Subheading 2.2), one will not be able to differentiate the biological
effects for a given molecule when different elements of the ontol-
ogy cj are modified. For this reason, a new set of molecular descrip-
tors was computed by applying the moving average approach [73]:
( ) ( )
Di c j = Di − Di c j
avg
(1)
3.3 Setup Firstly, the dataset under study (10,918 cases) was randomly
of the mtk-QSBER divided into two series: training and prediction sets. The training
Model set contained 8,298 cases, 4,217 assigned as positive and 4,081 as
negative. The prediction or external set included 2,620 cases,
1,353 positive and 1,267 negative cases.
Then, the original molecular descriptors Di were calculated—
i.e., in this case, MODESLAB spectral moments [70] and those of
the type ΔDi(cj) determined by applying Eq. 1. The total final
58 A. Speck-Planche and M.N.D.S. Cordeiro
Table 1
Comparative analysis of the different ANNs exploited in this study
Table 2
Quality and classification performance of the mtk-QSBER model
3.4 Virtual Screening Many scientists argue that QSAR models are not entirely validated
of Antibacterial as long as no novel compounds are synthesized and biologically
Agents tested. Yet a QSAR model is feasibly validated if the external cases
(not used to build the model) are correctly predicted. That is the
major reason for splitting the datasets into training and prediction
series. The critical point here, however, is the chemical space that
a derived model can cover. The greater the number of different
families of compounds used to build the model, the greater will be
the chemical space in which the model can be used prospectively.
Besides, a well-validated QSAR model should also show promis-
ing results when applied on the virtual screening of chemicals/
drugs. The best way to reveal the promising applications of the
present model was to predict the multiple biological effects of the
antibacterial agent known as BC-3781 [84]. This investigational
drug was reported to exhibit high inhibitory activity against dif-
ferent strains belonging to the genus Enterococcus. By applying
our model, BC-3781 was predicted as a possible antibacterial
agent against different enterococci using multiple experimental
conditions, in good agreement with the experimental evidence.
No toxicological data could be obtained from the literature for
this investigational drug, but the mtk-QSBER model led us to
infer that BC-3781 should not be potentially harmful for labora-
tory animals and, in principle, toxicologically safe for humans as
well. Notice here that although there is still no experimental data
about the toxicity of BC-3781, our toxicity predictions explain
why this compound is already undergoing phase II clinical trials
and that no significant toxic/adverse effects have been reported
so far for humans.
ANN-Based Mtk-Model for Discovery of Antibacterial Agents 61
Acknowledgments
References
1. Grayson ML, Crowe SM et al (eds) (2010) potential impact on empirical therapy. Clin
Kucers’ the use of antibiotics. A clinical review Microbiol Infect 14(Suppl 6):2–8
of antibacterial, antifungal, antiparasitic, and 5. Gonzales R, Corbett KK et al (2008) Drug
antiviral drugs, 6th edn. CRC Press, Taylor & resistant infections in poor countries: a shrink-
Francis Group, LLC, Boca Raton, FL ing window of opportunity. BMJ 336:
2. Shatalin K, Shatalina E et al (2011) H2S: a uni- 948–949
versal defense against antibiotics in bacteria. 6. Lautenbach E, Abrutyn E (2009) Healthcare-
Science 334:986–990 acquired bacterial infections. In: Brachman PS,
3. Cordero OX, Wildschutte H et al (2012) Abrutyn E (eds) Bacterial infections of humans:
Ecological populations of bacteria act as socially epidemiology and control, 4th edn. Springer
cohesive units of antibiotic production and Science + Business Media, LLC, New York, NY,
resistance. Science 337:1228–1231 pp 543–575
4. Rossolini GM, Mantengoli E (2008) 7. Rigottier-Gois L, Alberti A et al (2011) Large-
Antimicrobial resistance in Europe and its scale screening of a targeted Enterococcus
62 A. Speck-Planche and M.N.D.S. Cordeiro
f aecalis mutant library identifies envelope fitness 22. Prado-Prado FJ, Garcia-Mera X et al (2010)
factors. PLoS One 6:e29023 Multi-target spectral moment QSAR versus
8. Tenover FC, McGowan JE Jr (2009) The epi- ANN for antiparasitic drugs against different
demiology of bacterial resistance to antimicro- parasite species. Bioorg Med Chem 18:
bial agents. In: Brachman PS, Abrutyn E (eds) 2225–2231
Bacterial infections of humans: epidemiology 23. Garcia I, Fall Y et al (2011) First computational
and control, 4th edn. Springer Science + Business chemistry multi-target model for anti-
Media, LLC, New York, NY, pp 91–104 Alzheimer, anti-parasitic, anti-fungi, and anti-
9. Feuerriegel S, Oberhauser B et al (2012) bacterial activity of GSK-3 inhibitors in vitro,
Sequence analysis for detection of first-line in vivo, and in different cellular lines. Mol
drug resistance in Mycobacterium tuberculosis Divers 15:561–567
strains from a high-incidence setting. BMC 24. Speck-Planche A, Kleandrova VV et al (2012)
Microbiol 12:90 Fragment-based approach for the in silico dis-
10. Lienhardt C, Glaziou P et al (2012) Global covery of multi-target insecticides. Chemometr
tuberculosis control: lessons learnt and future Intell Lab Syst 111:39–45
prospects. Nat Rev Microbiol 10:407–416 25. Speck-Planche A, Kleandrova VV et al (2012)
11. Hann MM, Oprea TI (2004) Pursuing the In silico discovery and virtual screening of
leadlikeness concept in pharmaceutical multi-target inhibitors for proteins in
research. Curr Opin Chem Biol 8:255–263 Mycobacterium tuberculosis. Comb Chem
12. Lazo JS, Wipf P (2000) Combinatorial chemistry High Throughput Screen 15:666–673
and contemporary pharmacology. J Pharmacol 26. Speck-Planche A, Kleandrova VV et al (2012)
Exp Ther 293:705–709 Chemoinformatics in anti-cancer chemother-
13. Bleicher KH, Bohm HJ et al (2003) Hit and apy: multi-target QSAR model for the in silico
lead generation: beyond high-throughput discovery of anti-breast cancer agents. Eur J
screening. Nat Rev Drug Discov 2:369–378 Pharm Sci 47:273–279
14. Hansch C, Leo A (1995) Exploring QSAR: 27. Speck Planche A, Cordeiro MNDS (2013) In
fundamentals and applications in chemistry Chemoinformatics in drug design. Artificial
and biology. American Chemical Society, neural networks for simultaneous prediction
Washington, DC of anti-enterococci activities and toxicological
profiles. Proceedings of the 5th International
15. Jorgensen WL (2004) The many roles of
joint conference on computational intelli-
computation in drug discovery. Science 303: gence, NCTA-International conference on
1813–1818 neural computation theory and applications,
16. Oprea T (2005) Chemoinformatics in drug Vilamoura, Algarve, Portugal, 20–22 Sept,
discovery. WILEY-VCH Verlag GmbH & Co. pp 458–465
KGaA, Weinheim 28. Luan F, Cordeiro MNDS et al (2013) TOPS-
17. Brown N, Lewis RA (2006) Exploiting QSAR MODE model of multiplexing neuroprotective
methods in lead optimization. Curr Opin Drug effects of drugs and experimental-theoretic
Discov Devel 9:419–424 study of new 1,3-rasagiline derivatives poten-
18. Borchardt RT, Kerns EH et al (eds) (2006) tially useful in neurodegenerative diseases.
Optimizing the “drug-like” properties of leads Bioorg Med Chem 21:1870–1879
in drug discovery. Springer Science + Business 29. Tenorio-Borroto E, Penuelas Rivas CG et al
Media, LLC, New York, NY (2012) ANN multiplexing model of drugs
19. Croes S, Koop AH et al (2012) Efficacy, nephro- effect on macrophages; theoretical and flow
toxicity and ototoxicity of aminoglycosides, math- cytometry study on the cytotoxicity of the anti-
ematically modelled for modelling- supported microbial drug G1 in spleen. Bioorg Med Chem
therapeutic drug monitoring. Eur J Pharm Sci 20:6181–6194
45:90–100 30. Speck-Planche A, Kleandrova VV et al (2013)
20. Hau J, Schapiro SJ (2011) Handbook of labo- New insights toward the discovery of antibac-
ratory animal science: essential principles and terial agents: multi-tasking QSBER model for
practices. CRC Press, Taylor & Francis Group, the simultaneous prediction of anti-tuberculosis
LLC, Boca Raton, FL activity and toxicological profiles of drugs. Eur
21. Vina D, Uriarte E et al (2009) Alignment-free J Pharm Sci 48:812–818
prediction of a drug-target complex network 31. Speck-Planche A, Kleandrova VV et al (2013)
based on parameters of drug connectivity and Chemoinformatics for rational discovery of
protein sequence of receptors. Mol Pharm safe antibacterial drugs: simultaneous predic-
6:825–835 tions of biological activity against streptococci
ANN-Based Mtk-Model for Discovery of Antibacterial Agents 63
and toxicological profiles in laboratory ani- 47. Bhonsle JB, Venugopal D et al (2007)
mals. Bioorg Med Chem 21:2727–2732 Application of 3D-QSAR for identification of
32. Speck-Planche A, Cordeiro MNDS (2013) descriptors defining bioactivity of antimicrobial
Simultaneous modeling of antimycobacterial peptides. J Med Chem 50:6545–6553
activities and ADMET profiles: a chemoinfor- 48. Bucinski A et al (2004) Artificial neural net-
matic approach to medicinal chemistry. Curr works for prediction of antibacterial activity in
Top Med Chem 13:1656–1665 series of imidazole derivatives. Comb Chem
33. van de Waterbeemd H (1995) Chemometrics High Throughput Screen 7:327–336
methods in molecular design. VCH Publishers, 49. Todeschini R, Consonni V (2009) Molecular
Weinheim descriptors for chemoinformatics. WILEY-
34. Gaulton A, Bellis LJ et al (2012) ChEMBL: a VCH Verlag GmbH & Co. KGaA, Weinheim
large-scale bioactivity database for drug discov- 50. Estrada E, Matamala AR (2007) Generalized
ery. Nucleic Acids Res 40:D1100–D1107 topological indices. Modeling gas-phase rate
35. Knox C, Law V et al (2011) DrugBank 3.0: a coefficients of atmospheric relevance. J Chem
comprehensive resource for ‘omics’ research on Inf Model 47:794–804
drugs. Nucleic Acids Res 39:D1035–D1041 51. Estrada E, Uriarte E et al (2000) A novel
36. Mok NY, Brenk R (2011) Mining the ChEMBL approach for the virtual screening and rational
database: an efficient chemoinformatics work- design of anticancer compounds. J Med Chem
flow for assembling an ion channel- focused 43:1975–1985
screening library. J Chem Inf Model 51: 52. Roy K, Ghosh G (2004) QSTR with extended
2449–2454 topochemical atom indices. 2. Fish toxicity of
37. Todeschini R, Consonni V (2000) Handbook substituted benzenes. J Chem Inf Comput Sci
of molecular descriptors. WILEY-VCH Verlag 44:559–567
GmbH, Weinheim 53. Roy K, Ghosh G (2008) QSTR with extended
38. Kubinyi H (1993) QSAR: Hansch analysis and topochemical atom indices. 10. Modeling of
related approaches. VCH Publishers, Weinheim toxicity of organic chemicals to humans using
different chemometric tools. Chem Biol Drug
39. Kubinyi H, Folkers G et al (eds) (2002) 3D Des 72:383–394
QSAR in drug design: recent advances. Kluwer
Academic Publishers, New York 54. Castillo-Garit JA, Vega MC et al (2011)
Ligand-based discovery of novel trypanosomi-
40. Klein CD, Hopfinger AJ (1998) Pharma cidal drug-like compounds: in silico identifica-
cological activity and membrane interactions of tion and experimental support. Eur J Med
antiarrhythmics: 4D-QSAR/QSPR analysis. Chem 46:3324–3330
Pharm Res 15:303–311
55. Casañola-Martin GM, Marrero-Ponce Y et al
41. Vedani A, Dobler M (2002) 5D-QSAR: the (2010) Bond-based 2D quadratic fingerprints
key for simulating induced fit? J Med Chem in QSAR studies: virtual and in vitro tyrosinase
45:2139–2149 inhibitory activity elucidation. Chem Biol
42. Vedani A, Dobler M et al (2005) Combining Drug Des 76:538–545
protein modeling and 6D-QSAR. Simulating 56. Barigye SJ, Marrero-Ponce Y et al (2013)
the binding of structurally diverse ligands to the Event-based criteria in GT-STAF information
estrogen receptor. J Med Chem 48:3700–3703 indices: theory, exploratory diversity analysis
43. Carloni P, Alber F (eds) (2003) Quantum and QSPR applications. SAR QSAR Environ
medicinal chemistry. WILEY-VCH Verlag Res 24:3–34
GmbH & Co. KGaA, Weinheim 57. Barigye SJ, Marrero-Ponce Y et al (2013)
44. Li P, Yin J et al (2013) Synthesis, antibacterial Relations frequency hypermatrices in mutual,
activities, and 3D-QSAR of sulfone derivatives conditional and joint entropy-based informa-
containing 1,3,4-oxadiazole moiety. Chem tion indices. J Comput Chem 34:259–274
Biol Drug Des 82:546–556 58. Vazquez-Prieto S, Gonzalez-Diaz H et al
45. Lu X, Lv M et al (2012) Pharmacophore and (2013) A QSPR-like model for multilocus gen-
molecular docking guided 3D-QSAR study of otype networks of Fasciola hepatica in
bacterial enoyl-ACP reductase (FabI) inhibi- Northwest Spain. J Theor Biol 343C:16–24
tors. Int J Mol Sci 13:6620–6638 59. Alonso N, Caamano O et al (2013) Model for
46. Uddin R, Lodhi MU et al (2012) Combined high-throughput screening of multi-target
pharmacophore and 3D-QSAR study on a drugs in chemical neurosciences; synthesis,
series of Staphylococcus aureus Sortase A assay and theoretic study of rasagiline carba-
inhibitors. Chem Biol Drug Des 80:300–314 mates. ACS Chem Neurosci 4:1393–1403
64 A. Speck-Planche and M.N.D.S. Cordeiro
60. Estrada E, Molina E et al (2001) Can 3D descriptors and fingerprints. J Comput Chem
structural parameters be predicted from 2D 32:1466–1474
(topological) molecular descriptors? J Chem 72. Todeschini R, Lasagni M et al (1994) New
Inf Comput Sci 41:1015–1021 molecular descriptors for 2D and 3D struc-
61. Estrada E (2002) Physicochemical interpreta- tures. Theory. J Chemometr 8:263–272
tion of molecular connectivity indices. J Phys 73. Hill T, Lewicki P (2006) STATISTICS meth-
Chem A 106:9085–9091 ods and applications. A comprehensive refer-
62. Molina E, Diaz HG et al (2004) Designing ence for science, industry and data mining.
antibacterial compounds through a topological StatSoft, Tulsa
substructural approach. J Chem Inf Comput 74. Suzuki K (ed) (2011) Artificial neural net-
Sci 44:515–521 works: methodological advances and biomedi-
63. Gonzalez-Diaz H, Torres-Gomez LA et al cal applications. InTech, Rijeka
(2005) Markovian chemicals “in silico” design 75. Sabet R, Fassihi A et al (2012) Computer-aided
(MARCH-INSIDE), a promising approach for design of novel antibacterial 3-hydroxypyridine-
computer-aided molecular design III: 2.5D 4-ones: application of QSAR methods based on
indices for the discovery of antibacterials. J Mol the MOLMAP approach. J Comput Aided Mol
Model 11:116–123 Des 26:349–361
64. Marrero-Ponce Y, Marrero RM et al (2006) 76. Garcia-Domenech R, de Julian-Ortiz JV
Non-stochastic and stochastic linear indices of (1998) Antimicrobial activity characterization
the molecular pseudograph’s atom-adjacency in a heterogeneous group of compounds.
matrix: a novel approach for computational in J Chem Inf Comput Sci 38:445–449
silico screening and “rational” selection of 77. Lata S, Sharma BK et al (2007) Analysis and
new lead antibacterial agents. J Mol Model prediction of antibacterial peptides. BMC
12:255–271 Bioinformatics 8:263
65. Marrero-Ponce Y, Medina-Marrero R et al 78. Hall M, Frank E et al (2009) The WEKA data
(2005) Atom, atom-type, and total nonsto- mining software: an update. SIGKDD Explor
chastic and stochastic quadratic fingerprints: a 11:10–18
promising approach for modeling of antibacte- 79. Hall M, Frank E et al (1999–2013)
rial activity. Bioorg Med Chem 13:2881–2899 WEKA. Waikato Environment for Knowledge
66. Speck-Planche A, Scotti MT et al (2009) Analysis. v3.6.9, Hamilton
Design of novel antituberculosis compounds 80. Witten IH, Frank E et al (2011) Data mining:
using graph-theoretical and substructural practical machine learning tools and tech-
approaches. Mol Divers 13:445–458 niques. Morgan Kaufmann Publishers, Elsevier,
67. Estrada E (1996) Spectral moments of the Amsterdam
edge adjacency matrix in molecular graphs. 1. 81. StatSoft (2001) STATISTICA 6.0. Data analy-
Definition and applications for the prediction sis software system
of physical properties of alkanes. J Chem Inf 82. González-Díaz H, Pérez-Bello A et al (2007)
Comput Sci 36:844–849 Chemometrics for QSAR with low sequence
68. Estrada E (1997) Spectral moments of the homology: Mycobacterial promoter sequences
edge adjacency matrix in molecular graphs. recognition with 2D-RNA entropies.
2. Molecules containing heteroatoms and Chemometr Intell Lab Syst 85:20–26
QSAR applications. J Chem Inf Comput Sci 83. Hanczar B, Hua J et al (2010) Small-sample pre-
37:320–328 cision of ROC-related estimates. Bioinformatics
69. Estrada E (1998) Spectral moments of the 26:822–830
edge adjacency matrix in molecular graphs. 3. 84. Sader HS, Biedenbach DJ et al (2012)
Molecules containing cycles. J Chem Inf Antimicrobial activity of the investigational pleu-
Comput Sci 38:23–27 romutilin compound BC-3781 tested against
70. Estrada E, Gutiérrez Y (2002–2004) Gram-positive organisms commonly associated
MODESLAB. v1.5, Santiago de Compostela with acute bacterial skin and skin structure
71. Yap CW (2011) PaDEL-descriptor: an open infections. Antimicrob Agents Chemother 56:
source software to calculate molecular 1619–1623
Chapter 5
Abstract
With the introduction of the REACH legislation in the European Union, there is a requirement for property
and toxicity data on chemicals produced in or imported into the EU at levels of 1 tonne/year or more.
This has meant an increase in the in silico prediction of such data. One of the chief predictive approaches
is QSAR (quantitative structure–activity relationships), which is widely used in many fields.
A QSAR approach that is increasingly being used is that of artificial neural networks (ANNs), and this
chapter discusses its application to the range of physicochemical properties and toxicities required by
REACH. ANNs generally outperform the main QSAR approach of multiple linear regression (MLR),
although other approaches such as support vector machines sometimes outperform ANNs. Most ANN
QSARs reported to date comply with only two of the five OECD Guidelines for the Validation of (Q)SARs.
Key words QSAR, Artificial neural networks, Physicochemical properties, Toxicity, REACH
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_5, © Springer Science+Business Media New York 2015
65
66 John C. Dearden and Philip H. Rowe
In this QSAR, LD50 is the dose required to kill 50 % of the mice,
and P is the octanol–water partition coefficient, an important
property that reflects the ability of compounds to penetrate lipid
membranes in an organism; n is the number of compounds used to
develop the QSAR; R2 is the coefficient of determination, the frac-
tion of the toxicity that is described by the equation; and s is the
standard error of the prediction. It can be seen that the partition
coefficient, a simple physicochemical property, models 85.2 % of
the acute toxicity of barbiturates to the mouse.
Another example, shown in Eq. 2, concerns the toxicity of
nitrobenzenes to the aquatic ciliate Tetrahymena pyriformis [3]:
Log (1 / IGC50 ) = 0.467 log P − 1.60 E LUMO − 2.55 (2)
= =
n 42 R 2 0.881 s = 0.246
IGC50 is the dose that reduces growth rate of the ciliate by 50 %.
ELUMO is the energy of the lowest unoccupied molecular orbital, a
measure of the electron-attracting ability (electrophilicity) of a
compound. Equation 2 demonstrates the use of both a hydropho-
bic descriptor property, P, and an electronic property, ELUMO, in
the same equation.
Equations 1 and 2 are examples of multiple linear regression
(MLR), which is the most widely used QSAR modeling technique.
There are, however, numerous other techniques that are used in
QSAR modeling, including artificial neural networks (ANNs) [4],
nonlinear regression, partial least squares, genetic algorithms, and
support vector machines (SVM) [5, 6]. Each has its own strengths
and weaknesses.
The main strength of ANNs is that they are nonlinear (strictly,
nonrectilinear) methods and so can deal with structure–activity
relationships in which there are nonrectilinear correlations between
activity and one or more descriptor properties. In Eq. 1 above, there
is a nonrectilinear correlation between barbiturate toxicity and parti-
tion coefficient. In this case, the correlation is parabolic, which can
be accounted for by the incorporation of a (log P)2 term. However,
in many cases, the form of nonrectilinearity is not known, so an
MLR approach will not yield good correlations. This is particu-
larly important when non-congeneric data sets are being modeled.
Artificial Neural Networks for REACH Property/Toxicity 67
4.1.2 Boiling Point Boiling point is somewhat easier to predict. Hall and Story [35]
used a back-propagation NN with atom-type electrotopological
descriptors to model the boiling points of 298 diverse organic
chemicals, with an MAE of 3.9° (OECD compliance 1,4IE). Using
a feed-forward NN with quantum-mechanical descriptors, Chalk
et al. [36] modeled the boiling points of a large number of organic
chemicals, with a 6,000-compound training-set standard deviation
of 16.5° and a 629-compound test-set standard deviation of 19.0°
72 John C. Dearden and Philip H. Rowe
4.1.3 Relative Density Relative density is quite simple to calculate, for liquids at least [31].
Gakh et al. [37] used a feed-forward back-propagation NN to pre-
dict the densities of 134 liquid hydrocarbons at 25 °C, using
descriptors derived from graph theory. They found an MAE of
0.6 % for a 25-chemical external test set (OECD compliance 1,4E).
Group contributions and a feed-forward back-propagation NN
with two hidden layers were used to predict the densities of 82
ionic liquids [38]. An external test set of 24 ionic liquids yielded an
MAE of 0.26 % (OECD compliance 1,4E).
4.1.4 Vapor Pressure Vapor pressure, being temperature dependent, must be reported
for a specific temperature, such as 25 °C (298 K). McClelland and
Jurs [39] used a computational NN to model the vapor pressures
at 25 °C of 420 diverse organic chemicals. They obtained RMSE
values of 0.19 and 0.33 log unit for the training and validation sets,
respectively (OECD compliance 1,4IE). Chalk et al. [40] used a
feed-forward NN to model the variation of vapor pressure with
temperature, using quantum-mechanical descriptors with a data
set of 7,681 measurements for 2,349 chemicals. Their standard
deviations were 0.322 and 0.326 log unit for training and valida-
tion sets, respectively (OECD compliance 1,4IE).
4.1.5 Surface Tension The surface tensions at various temperatures of 82 organic liquids
were modeled with a multilayer perceptron NN, with five physical
properties as descriptors [41], and with an MAE of 1.41 % for the
training set of 734 data points and 1.95 % for the test set of 314
data points (OECD compliance 1,4E). Gharagheizi et al. [42] used
a much larger number (752) of organic liquids and a range of tem-
peratures. Using a feed-forward NN with group contribution
descriptors, they obtained standard deviations of 1.7 % for training
and internal and external validation sets (OECD compliance 1,4IE).
4.1.6 Water Solubility One of the most important physicochemical properties is aqueous
solubility, and there have been many QSPR studies carried out in
this area [43], including those using ANN approaches. Yan and
Gasteiger [44] used a back-propagation NN with 3D descriptors
to model a set of 1,293 organic chemicals. They obtained, for the
797-compound training set and the 496-compound test set, MAEs
0.41 and 0.49 log unit (OECD compliance 1,4IE). These com-
pare well with the typical experimental error in aqueous solubility
of about 0.58 log unit [45]. They are also considerably better than
an MLR model for the same chemicals (MAEs 0.70 and 0.68 log
unit for training and test sets, respectively). However, when an
external set of 1,587 Merck chemicals was tested, the NN MAE
Artificial Neural Networks for REACH Property/Toxicity 73
was 0.77 log unit. The authors commented that this was probably
due to greater chemical diversity of the Merck chemicals. Similar
results were obtained by Bruneau [46] using a Bayesian NN with
physicochemical descriptors for a training set of 1,560 organic
chemicals (standard error = 0.53 log unit) and an external test set
of 934 organic chemicals (standard error = 0.81 log unit) (OECD
compliance 1,4IE).
4.1.7 n-Octanol–Water Without doubt the most important physicochemical property for
Partition Coefficient modeling bioactivity is the n-octanol–water partition coefficient
(P, Kow). In QSAR studies it is always used in logarithmic form as
log P or log Kow. Some 70 % of all QSARs include this term or one
related to it, such as the chromatographic retention term Rm. Its
importance lies in the fact that it is a surrogate for lipid–water par-
titioning and hence models xenobiotic transport in an organism,
although it can contribute also to receptor binding. There are a
number of reviews of the topic [47, 48] and many published stud-
ies. Chen [49] compared the performance of MLR, radial basis
function neural networks, and support vector machines for the
prediction of log P of about 3,500 organic chemicals. The SVM
results were best (training set s = 0.54, test set s = 0.56), with MLR
slightly worse (training set s = 0.62, test set s = 0.56) and RBFNN
somewhat worse (training set s = 0.56, test set s = 0.72) (OECD
compliance 1,4IE). Tetko et al. [50] used electrotopological state
(E-state) indices [51] with an ensemble of 50 different neural net-
works to model log P values of a very large number of organic
chemicals; for a 12,908-compound training set they obtained
RMSE =
0.38 log unit, and for a 1,174-compound test set
RMSE = 0.36 log unit (OECD compliance 1,4IE). These results
were better than those from MLR and from several commercially
available software programs. Considering the very large number of
chemicals used, this is a very good result.
4.1.8 Flash Point The flash point of a chemical is the temperature at which the vapor
concentration is equal to the lower explosive limit. Katritzky et al.
[52] used both MLR and back-propagation NN with CODESSA
descriptors to model the flash points of up to 758 organic chemi-
cals. For the training sets, MLR yielded MAE = 13.9°, while NN
yielded better results: MAE = 12.6° (OECD compliance 1,4IE).
No test-set results were reported. Gharagheizi et al. [53], using a
feed-forward NN and group contributions as descriptors with a
large data set of organic chemicals, found MAE = 7.9° for the
1,241-compound training set and MAE = 9.9° for the
137-compound test set (OECD compliance 1,4IE).
4.1.9 Flammability Upper and lower flammability limits represent the range of concen-
Limits trations in air within which a chemical will ignite provided that an
ignition source is present. Studies of both limits have been made by
74 John C. Dearden and Philip H. Rowe
4.1.10 Explosive Explosive properties have to date been investigated very little by
Properties QSAR, with the exception of impact sensitivity. Cho et al. [58] used
back-propagation NN with structural, topological, and physico-
chemical descriptors to model the impact sensitivity of 263 nitro
compounds, with a standard error of prediction of 0.211 log unit
(standard errors not given for training and test sets) (OECD com-
pliance 1,4IE). Wang et al. [59] compared the performance of
MLR, partial least squares (PLS), and back-propagation NN in
modeling the impact sensitivity of 156 diverse nitro compounds.
The RMSEs found for the 127-compound training set and the
49-compound test set, respectively, were MLR 0.210, 0.251; PLS
0.214, 0.250; and BPNN 0.192, 0.247 (OECD compliance 1,4IE).
Clearly the back-propagation NN approach gave the best results.
4.1.11 Self-Ignition MLR, PLS, and radial basis function and back-propagation NNs
Temperature (Autoignition with structural and physicochemical descriptors were used by
Temperature) Tetteh et al. [60] to model the self-ignition temperature of 233
organic chemicals. Overall MAEs were MLR 90.2°, PLS 38.4°,
RBFNN 30.1°, and BPNN 29.9° (MAEs not given for training
and test sets separately) (OECD compliance 1,4IE). Pan et al. [61]
used MLR and back-propagation NNs with E-state descriptors and
group contributions, respectively, to model the self-ignition tem-
peratures of 118 hydrocarbons. MAEs for the 42-compound test
set were MLR 32.4°, E-state BPNN 21.6°, and group contribu-
tion BPNN 24.8° (OECD compliance 1,4E).
of prediction of 0.374 log unit for MLR and 0.282 log unit for
BPNN (OECD compliance 1,4I). No external validation was per-
formed. Soil sorption of 124 pesticides was investigated by
Goudarzi et al. [63] using SPA-MLR and SPA-ANN with physico-
chemical and structural descriptors (SPA is successive projection
algorithm); the type of NN used was not specified. The RMSEs
obtained (log units) were training set 0.420 (MLR) and 0.282
(NN) and test set 0.371 (MLR) and 0.289 (NN) (OECD compli-
ance 1,4IE).
4.1.14 Viscosity Using their ADAPT descriptors, Kauffman and Jurs [67] devel-
oped both MLR and computational NN models to predict the
viscosity of almost 200 organic solvents. For 170 training-set
chemicals, RMSE (MLR) was found to be 0.257 mPa s, while
RMSE (CNN) was 0.147 mPa s. RMSEs for the 21-compound
test set were (MLR) 0.278 and (CNN) 0.242 mPa s (OECD com-
pliance 1,4IE). Artemenko et al. [68] found a back-propagation
NN, with molecular fragment descriptors, to yield better predic-
tions than did MLR for a diverse set of liquid organic chemicals.
For 293 training-set chemicals, RMSEs (log units) were MLR
0.111 and BPNN 0.212. RMSEs for the 37-compound test set
were MLR 0.212 and BPNN 0.141 (OECD compliance 1,4IE).
4.2.1 Skin Irritation Salina et al. [72] have reviewed the QSAR prediction of skin irritation
or Corrosion and corrosion. Golla et al. [73] used a feed-forward back-propagation
NN with physicochemical and structural descriptors to model the
rabbit skin irritation of 186 organic chemicals and obtained
RMSE = 1.1 unit, which is reasonable for a data range of 0–8 units.
Artificial Neural Networks for REACH Property/Toxicity 77
4.2.2 Eye Irritation Salina et al. [72] have reviewed the QSAR prediction of eye irrita-
tion. Barratt [75] investigated the eye irritation of 57 aliphatic
alcohols and other neutral organic chemicals using a back-
propagation NN with a small number of physicochemical proper-
ties for classification as irritant or nonirritant and obtained 97.37 %
correct classification (OECD compliance 1,4). No external test set
was used.
Patlewicz et al. [76] investigated a set of 19 cationic surfactants
with back-propagation NN using a small number of physicochemi-
cal properties, which yielded R2 = 0.702 (OECD compliance 1,4).
No error values were given.
4.2.3 Skin Sensitization Patlewicz et al. [77] have reviewed the QSAR prediction of skin
sensitization. Devillers [78] used a feed-forward back-propagation
NN with structural and physicochemical descriptors in a classifica-
tion model of the skin sensitization of 259 organic chemicals and
obtained an overall correct classification of 89.19 %, compared
with 82.63 % using linear discriminant analysis [79] (OECD com-
pliance 1,4IE). No external test-set results were given.
Golla et al. [80] also used a feed-forward back-propagation
NN with structural and physicochemical descriptors in a classifica-
tion model of skin sensitization, using three different endpoints
(mouse local lymph node assay (LLNA), guinea-pig maximization
test (GPMT), and a test from the German Federal Institute for
Health Protection of Consumers and Veterinary Medicine
(BgVV)). The training-set results were LLNA (358 chemicals)
90 % correct classification, GPMT (307 chemicals) 95 %, and
BgVV (251 chemicals) 90 % (OECD compliance 1,4IE). No test-
set results were given.
4.2.4 Mutagenicity Benigni [81] has reviewed the QSAR prediction of mutagenicity.
Xu et al. [82] investigated a very large data set of 7,617 organic
chemicals, of which 4,252 were mutagens and 3,365 were non-
mutagens. Using molecular fingerprints as descriptors, they
employed five different approaches, namely, SVM, decision tree,
feed-forward NN, k-nearest neighbor, and naïve Bayes, to classify
the chemicals. The prediction accuracies for the training set were,
respectively, 84.1 %, 80.4 %, 80.2 %, 82.1 %, and 65.8 % (OECD
compliance 1,4). The NN method, while good, was not quite as
accurate as the SVM, decision tree, and k-nearest neighbor meth-
ods. No test-set results were given.
78 John C. Dearden and Philip H. Rowe
4.2.5 Acute Toxicity Tsakovska et al. [84] have reviewed the QSAR prediction of mam-
(Mammalian) malian toxicity. Devillers [85] employed PLS regression and back-
propagation NN to model acute oral toxicity (LD50) to the rat of
51 organophosphorus pesticides, with the use of physicochemical
descriptors. The NN approach yielded RMSE = 0.29 log unit for
the training set and 0.26 log unit for the test set (OECD compli-
ance 1,4IE); these figures were greatly superior to those from PLS
regression.
The oral toxicity of 54 benzodiazepine drugs to the mouse was
modeled by MLR, back-propagation ANN, SVM, and PLS meth-
ods [86]. RMSE values (log units) were, respectively, training set
0.213, 0.142, 0.183, and 0.174 and test set 0.194, 0.124, 0.192,
and 0.165, indicating the superior predictivity of the ANN method
(OECD compliance 1,4IE).
No mammalian inhalation toxicity QSAR studies appear to
have been done using a neural network approach.
4.2.6 Aquatic Toxicity Kaiser and Niculescu [87] used a probabilistic NN to model a large
data set of toxicity of organic chemicals to Daphnia magna, using a
combination of physicochemical and structural descriptors. For the
700-compound training set, s = 0.512 log unit, and for the 76-com-
pound test set s = 0.668 log unit (OECD compliance 1,4IE).
Another aquatic invertebrate that has been widely used in
QSAR toxicity studies is the ciliate Tetrahymena pyriformis. Over
2,000 chemicals have been tested in the same laboratory against
this organism, which means that experimental error should be
very low. Kahn et al. [88] compared the performance of MLR
and back-propagation NN on a large number of organic chemi-
cals, using CODESSA descriptors. The BPNN results were better
(914-compound training set s = 0.442 log unit, 457-compound
test set s = 0.484 log unit) than were those from MLR (training
set s = 0.551 log unit, test set s = 0.561 log unit) (OECD compli-
ance 1,4IE). Kahn et al. also reported 18 previous ANN QSAR
studies on toxicity to T. pyriformis.
Izadiyan et al. [89] modeled the effects of 40 ionic chemicals
on the limnic green alga Scenedesmus vacuolatus, using both MLR
and multilayer perceptron NN, with structural descriptors.
Standard errors of prediction in the validation set were MLR 0.525
log unit and MLPNN 0.440 log unit (OECD compliance 1,4IE).
Artificial Neural Networks for REACH Property/Toxicity 79
4.2.8 Acute Dermal There do not appear to be any neural network QSAR studies of
Toxicity (Mammalian) mammalian toxicity through dermal absorption.
4.2.10 Reproductive Most QSAR studies of reproductive toxicity have been made on
Toxicity endocrine disruption. Liu et al. [92] used least-square SVM, coun-
terpropagation NN, and k-nearest neighbor (k-NN) approaches to
classify the endocrine-disruption capability of 232 training-set
chemicals and 87 external test-set chemicals, using molecular
structure descriptors. The three methods gave very similar results,
with correct predictions as follows: LS-SVM 89.66 %, CP-NN
87.50 %, and k-NN 89.22 % for the training set and LS-SVM
83.91 %, CP-NN 82.76 %, and k-NN 83.91 % for the external test
set (OECD compliance 1,3,4IE).
Roncaglioni et al. [93] investigated the performance of five
different nonlinear QSAR models (decision forest (DF), adaptive
fuzzy partition (AFP), decision tree (CART), multilayer percep-
tron feed-forward NN (MLPNN), and support vector machine
(SVM)) to classify the endocrine-disruption capability of 232
training-set chemicals and 87 external test-set chemicals, using
topological and structural descriptors. The correct classifications
were DF 97.04 %, AFP 86.36 %, CART 85.38 %, MLPNN 84.39 %,
and SVM 89.92 % for the training set and DF 85.33 %, AFP
85.33 %, CART 85.33 %, MLPNN 84.00 %, and SVM 86.67 % for
the external test set (OECD compliance 1,3,4IE).
80 John C. Dearden and Philip H. Rowe
4.2.11 Toxicokinetics There are numerous toxicokinetic parameters, and many QSAR
studies have been made of most of them. Wajima et al. [94] inves-
tigated the prediction of human clearance from data for 68 drugs
on humans, dogs, and rats, with physicochemical descriptors and
dog and rat data, using several approaches. RMSEs (log units)
were MLR 0.350, PLS 0.350, and feed-forward NN 0.395 (OECD
compliance 1,4). No external test set was used.
Blood levels of the antibiotic tobramycin in about 300 patients
were modeled with a back-propagation NN, using patient data
(e.g., age, weight, sex, illness) as descriptors [95]. The MAE was
33.9 %, which was considered acceptable (OECD compliance 1,4).
No external test set was used. The authors concluded that “neural
networks were capable of capturing the relationships between
plasma drug levels and patient-related prognostic factors from rou-
tinely collected sparse within-patient pharmacokinetic data.”
4.2.12 Short-Term Fish Singh et al. [96] compared the performance of a number of artifi-
Toxicity cial intelligence approaches (multilayer perceptron NN, radial basis
function NN, generalized regression NN, gene expression program-
ming (GEP), SVM, and decision tree (DT)) to model the acute
fish toxicity of 573 chemicals, of which 458 formed the training
set and 115 the test set. The MAEs (log units) obtained for the
training set were MLPNN 0.54, RBFNN 0.57, GRNN 0.34, GEP
0.71, and DT 0.43, and those for the test set were MLPNN 0.46,
RBFNN 0.41, GRNN 0.37, SVM 0.49, GEP 0.48, and DT 0.54
(OECD compliance 1,4IE,5). The generalized regression NN in
particular performed very well.
Gong et al. [97] used radial basis function NN and a heuristic
method to model the acute fish toxicity of 92 substituted benzenes
(training set 74, test set 18 chemicals) using CODESSA descrip-
tors. The RMSEs (log units) for the training and test sets, respec-
tively, were RBFNN 0.220 and 0.205 and heuristic method 0.273
and 0.223 (OECD compliance 1,4IE,5).
4.2.13 Activated Sludge Okey and Martis [98] employed a feed-forward NN to model the
Respiration Inhibition inhibition of activated sludge respiration by 139 (training set) and
25 (test set) diverse organic chemicals, using quantum-mechanical
and topological descriptors. They obtained MAE = 0.41 log unit
for the test set, which is a reasonable result (OECD compliance
1,4E,5).
A QSAR (No. Q17-10-1-226) in the (Q)SAR Model Reporting
Format (QMRF) [28] employs a multilayer perceptron back-
propagation NN to model the inhibition of activated sludge respi-
ration, using physicochemical and structural descriptors. The
reported statistics are RMSE (68-chemical training set) = 0.029 log
unit and RMSE (15-chemical test set) = 0.03 log unit (OECD
compliance 1,2,3,4IE,5).
Artificial Neural Networks for REACH Property/Toxicity 81
4.2.14 Abiotic The acid hydrolysis of a large number of carboxylic acid esters in
Degradation (Hydrolysis) water and in water–organic solvent mixtures was modeled [99]
using a feed-forward NN and quantum-chemical descriptors,
descriptors representing the various organic solvents, and temper-
ature and solvent concentration. For the training set (1,883
records), RMSE = 0.271 log unit, and for the test set (209 records),
RMSE = 0.342 log unit (OECD compliance 1,4IE).
4.2.17 Prenatal There do not appear to be any neural network QSAR studies of
Developmental Toxicity either of these endpoints.
4.2.18 Two-Generation
Reproductive Toxicity
4.2.19 Long-Term Meng and Lin [101] used feed-forward back-propagation NNs to
Aquatic Toxicity model both acute and chronic (up to 56-day) toxicity of alcohol
ethoxylates (nonionic surfactants) to fathead minnow (Pimephales
promelas), Daphnia magna, and green alga (Pseudokirchneriella
subcapitata). Unfortunately they presented their results largely in
graphical form, making it virtually impossible to obtain accurate
numerical predictions, even though they used both training and
external test sets. It appears from the graphs that chronic toxicity
ran more or less in parallel to acute toxicity. For example, for D.
magna the mean ratio of acute median effect concentration to
chronic no-observed-effect concentration was 2.8.
Zhao et al. [103] used MLR, radial basis function NN, and
SVM to model a large data set of fish BCF values, using topological
and physicochemical descriptors. For the 378-chemical training
set, standard errors of prediction (log units) were MLR 0.70,
RBFNN 0.58, and SVM 0.63, and for the 95-chemical test set, the
errors were MLR 0.74, RBFNN 0.63, and SVM 0.62. A hybrid
RBFNN, from two models with different descriptors, yielded
s = 0.56 for the training set and 0.59 for the test set (OECD com-
pliance 1,4IE).
4.2.21 Effects Honeybees are susceptible to pesticides, and Devillers et al. [104]
on Terrestrial Organisms developed a feed-forward back-propagation NN model of the
(Invertebrates, acute toxicity of 100 pesticides to Apis mellifera. Using the auto-
Microorganisms, Plants) correlation vectors of physicochemical properties as descriptors,
they obtained RMSE = 0.430 log unit for the 89-chemical training
set and 0.386 log unit for the 11-chemical test set (OECD compli-
ance 1,4IE). These good results contrasted sharply with poor
results from a PLS model (data not given).
A study of fungicidal activities of thiazoline derivatives against
rice blast (Magnaporthe grisea) used both MLR and feed-forward
back-propagation NN, with physicochemical descriptors [105].
Three sets of thiazoline derivatives, with different substitution pat-
terns, were used, and similar good results were obtained for all
three sets. The largest set used 82 training compounds (standard
error (log units) = 0.139 (MLR) and 0.097 (NN)) and 18 test
compounds (standard error (log units) = 0.162 (MLR) and 0.122
(NN)) (OECD compliance 1,4IE).
4.2.23 Long-Term There do not appear to be any neural network QSAR studies of
Effects on Terrestrial long-term effects on terrestrial organisms.
Organisms (Invertebrates,
Microorganisms,
Plants, Birds)
4.3 Software Many computer packages are available for the prediction of physi-
for Property cochemical and toxicological properties. Some of these are freely
and Toxicity Prediction distributed; others are commercial. Dearden et al. [31] have
reviewed the software available for physicochemical property pre-
diction, and Dearden [108] and Lapenna et al. [109] have reviewed
the software available for toxicity prediction. Unfortunately there
is a general lack of transparency concerning the nature of the algo-
rithms used in these packages, and so it is impossible to say with
confidence, in many cases, whether or not the package uses neural
networks for prediction and whether there is OECD compliance.
Among those that predict physicochemical properties, some
are clearly described as using methods other than neural networks.
For example, Molecular Modelling Pro [110] is described as mod-
eling melting and boiling points by the method of Joback and Reid
[111], which is a multiple linear regression method. Others use
neural network methods. For example, ADMET Predictor [112]
uses neural networks in the prediction of pKa, aqueous solubility,
and octanol–water partition coefficient. ChemSilico [113] also
uses neural network methods in the prediction of aqueous solubility
and octanol–water partition coefficient.
There is a similar situation in regard to toxicity prediction.
TOPKAT [114] is described by Lapenna et al. [109] as using
regression analysis to predict rat oral LD50. In contrast, TerraQSAR
[115] is described [116] as using neural net technology to predict
mouse and rat oral and intravenous LD50, fathead minnow and
Daphnia magna LC50, and skin irritation. ADMET Predictor
[112] uses neural networks to predict mutagenicity and endocrine
disruption, and ChemSilico [113] uses neural networks to predict
mutagenicity.
84 John C. Dearden and Philip H. Rowe
5 Conclusions
Acknowledgments
References
networks: methods and protocols. Humana 26. Vračko M, Bandelj V, Barbier P et al (2006)
Press, New York, pp 137–158 Validation of counter propagation neural
14. Devillers J (2009) Artificial neural network network models for predictive toxicology
modeling of the environmental fate and eco- according to the OECD principles: a case
toxicity of chemicals. In: Devillers J (ed) study. SAR QSAR Environ Res 17:265–284
Ecotoxicology modeling. Springer, Dordrecht, 27. Worth AP (2010) The role of QSAR method-
pp 1–28 ology in the regulatory assessment of chemi-
15. Gleeson MP, Modi S, Bender A et al (2012) cals. In: Puzyn T, Leszczynski J, Cronin MTD
The challenges involved in modelling toxicity (eds) Recent advances in QSAR studies: meth-
data in silico: a review. Curr Pharm Des 18: ods and applications. Springer, Dordrecht,
1266–1291 pp 367–382
16. Moore DRJ, Breton RL, MacDonald DB 28. (Q)SAR Model Reporting Format. Available
(2003) A comparison of model performance at http://qsardb.jrc.it/qmrf. Accessed 16
for six quantitative structure-activity relation- Jan 2014
ship packages that predict acute toxicity to 29. ECHA Guidance on information require-
fish. Environ Toxicol Chem 22:1799–1809 ments and chemical safety assessment.
17. Fatemi MH, Abraham MH, Haghdadi M Chapter R.7a. Endpoint specific guidance, in
(2009) Prediction of biomagnification factors Guidance for the Implementation of REACH,
for some organochlorine compounds using ECHA, Helsinki, 2012. Available at http://
linear free energy relationship parameters and echa.europa.eu/support/guidance-on-reach-
artificial neural networks. SAR QSAR Environ and-clp-implementation/consultation-
Res 20:453–465 procedure. Accessed 25 Jan 2014
18. Tan N-X, Li P, Rao H-B et al (2010) 30. Regulation (EC) No. 1907/2006 of the
Prediction of the acute toxicity of chemical European Parliament and of the Council of
compounds to the fathead minnow by 18 December 2006. Available at http://
machine learning approaches. Chemometr eurlex.europa.eu/LexUriServ/LexUriServ.
Intell Lab Syst 100:66–73 do?uri=CONSLEG:2006. Accessed 25 Jan
19. Dearden JC, Cronin MTD, Kaiser KLE 2014
(2009) How not to develop a quantitative 31. Dearden JC, Rotureau P, Fayet G (2013)
structure-activity or structure-property rela- QSPR prediction of physico-chemical proper-
tionship (QSAR/QSPR). SAR QSAR ties for REACH. SAR QSAR Environ Res
Environ Res 20:241–266 24:279–318
20. Baskin II, Ait AO, Halberstam NM et al 32. Taskinen J, Yliruusi J (2003) Prediction of
(2002) An approach to the interpretation of physicochemical properties based on neural
backpropagation neural network models in network modelling. Adv Drug Deliv Rev
QSAR studies. SAR QSAR Environ Res 13: 55:1163–1183
35–41 33. Godavarthy SS, Robinson RL, Gasem KAM
21. Guha R, Jurs PC (2005) Interpreting com- (2006) An improved structure-property model
putational neural network QSAR models: a for predicting melting-point temperatures. Ind
measure of descriptor importance. J Chem Eng Chem Res 45:5117–5126
Inf Model 45:800–806 34. Karthikeyan M, Glen RC, Bender A (2005)
22. Guha R, Stanton DT, Jurs PC (2005) General melting point prediction based on a
Interpreting computational neural network diverse compound data set and artificial neu-
QSAR models: a detailed interpretation of the ral networks. J Chem Inf Model 45:581–590
weights and biases. J Chem Inf Model 45: 35. Hall LH, Story CT (1996) Boiling point and
1109–1121 critical temperature of a heterogeneous data
23. Chaudhry Q, Piclin N, Cotterill J et al (2010) set. QSAR with atom type electrotopological
Global QSAR models of skin sensitisers for state indices using artificial neural networks.
regulatory purposes. Chem Cent J 4(Suppl 1): J Chem Inf Comput Sci 36:1004–1014
S1–S6 36. Chalk AJ, Beck B, Clark T (2001) A quantum
24. Johnson SR (2008) The trouble with QSAR mechanical/neural network model for boiling
(or how I learned to stop worrying and points with error estimation. J Chem Inf
embrace fallacy). J Chem Inf Model 48:25–26 Comput Sci 41:457–462
25. Enoch SJ, Madden JC, Cronin MTD (2008) 37. Gakh AA, Gakh EG, Sumpter BG et al (1994)
Identification of mechanisms of toxic action Neural network-graph theory approach to the
for skin sensitisation using a SMARTS pattern prediction of the physical properties of
based approach. SAR QSAR Environ Res organic compounds. J Chem Inf Comput Sci
19:555–578 34:832–839
86 John C. Dearden and Philip H. Rowe
38. Valderrama JO, Reátegui A, Rojas RE (2009) 51. Kier LB, Hall LH (1999) Molecular structure
Density of ionic liquids using group contribu- description: the electrotopological state.
tion and artificial neural networks. Ind Eng Academic, New York
Chem Res 48:3254–3259 52. Katritzky AR, Stoyanova-Slavova IB, Dobchev
39. McClelland HE, Jurs PC (2000) Quantitative DA et al (2007) QSPR modelling of flash
structure-property relationships for the pre- points: an update. J Mol Graph Model 26:
diction of vapor pressures of organic com- 529–536
pounds from molecular structure. J Chem Inf 53. Gharagheizi F, Alamdari RF, Angaji MT
Comput Sci 40:967–975 (2008) A new neural network-group contri-
40. Chalk AJ, Beck B, Clark T (2001) A bution method for estimation of flash point
temperature-dependent quantum mechanical/ temperature of pure components. Energy
neural net model for vapor pressure. J Chem Fuel 22:1628–1635
Inf Comput Sci 41:1053–1059 54. Gharagheizi F (2010) Chemical structure-
41. Roosta A, Setoodeh P, Jahanmiri A (2012) based model for estimation of the upper flam-
Artificial neural network modeling of surface mability limit of pure compounds. Energy
tension for pure organic compounds. Ind Eng Fuel 24:3867–3871
Chem Res 51:561–566 55. Gharagheizi F (2009) Prediction of upper
42. Gharagheizi F, Eslamimanesh A, Mohammadi flammability limit percent of pure compounds
AH et al (2011) Use of artificial neural from their molecular structures. J Hazard
network- group contribution method to Mater 167:507–510
determine surface tension of pure com- 56. Gharagheizi F (2008) Quantitative
pounds. J Chem Eng Data 56:2587–2601 structure-property relationship for prediction
43. Dearden JC (2006) In silico prediction of of the lower flammability limit of pure com-
aqueous solubility. Expert Opin Drug Discov pounds. Energy Fuel 22:3037–3039
1:31–52 57. Gharagheizi F (2009) A new group
44. Yan A, Gasteiger J (2003) Prediction of aque- contribution- based model for estimation of
ous solubility of organic compounds based on lower flammability limit of pure compounds.
a 3D structure representation. J Chem Inf J Hazard Mater 170:595–604
Comput Sci 43:429–434 58. Cho SG, No KT, Goh EM et al (2005)
45. Katritzky AR, Wang Y, Sild S et al (1998) Optimization of neural networks architecture
QSPR studies on vapor pressure, aqueous for impact sensitivity of energetic molecules.
solubility, and the prediction of water-air par- Bull Korean Chem Soc 26:399–408
tition coefficients. J Chem Inf Comput Sci 59. Wang R, Jiang J, Pan Y et al (2009)
38:720–725 Prediction of impact sensitivity of nitro ener-
46. Bruneau P (2001) Search for predictive getic compounds by neural network based
generic model of aqueous solubility using on electrotopological-state indices. J Hazard
Bayesian neural nets. J Chem Inf Comput Sci Mater 166:155–186
41:1605–1616 60. Tetteh J, Metcalfe E, Howells SL (1996)
47. Livingstone DJ (2003) Theoretical property Optimisation of radial basis and backpropaga-
predictions. Curr Top Med Chem 3: tion neural networks for modelling auto-
1171–1192 ignition temperature by quantitative
48. Klopman G, Zhu H (2005) Recent method- structure-property relationships. Chemometr
ologies for the estimation of n-octanol-water Intell Lab Syst 32:177–191
partition coefficients and their use in the pre- 61. Pan Y, Jiang J, Wang R et al (2008) Prediction
diction of membrane transport properties of of auto-ignition temperatures of hydrocar-
drugs. Mini Rev Med Chem 5:127–133 bons by neural network based on atom-type
49. Chen H-F (2009) In silico log P prediction electrotopological-state indices. J Hazard
for a large data set with support vector Mater 157:510–517
machines, radial basis neural networks and 62. Liu GS, Yu JG (2005) QSAR analysis of soil
multiple linear regression. Chem Biol Drug sorption coefficients for polar organic chemi-
Des 74:142–147 cals: substituted anilines and phenols. Water
50. Tetko IV, Tanchuk VY, Villa AEP (2001) Res 39:2048–2055
Prediction of n-octanol-water partition 63. Goudarzi N, Goodarzi M, Araujo MCU et al
coefficients from PHYSPROP database (2009) QSPR modeling of soil sorption coef-
using artificial neural networks and E-state ficients (Koc) of pesticides using SPA-ANN
indices. J Chem Inf Comput Sci 41: and SPA-MLR. J Agric Food Chem 57:
1407–1421 7153–7158
Artificial Neural Networks for REACH Property/Toxicity 87
64. Luan F, Ma W, Zhang H et al (2005) 76. Patlewicz GY, Rodford RA, Ellis G et al (2000)
Prediction of pKa for neutral and basic drugs A QSAR model for the eye irritation of cat-
based on radial basis function neural networks ionic surfactants. Toxicol In Vitro 14:79–84
and the heuristic method. Pharm Res 22: 77. Patlewicz G, Roberts DW, Uriarte E (2008)
1454–1460 A minireview of available skin sensitization
65. CODESSA software. Available at www.semi- (Q)SARs/expert systems. Chem Res Toxicol
chem.com. Accessed 26 Jan 2014 21:521–541
66. Habibi-Yangjeh A, Danandeh-Jenagharad M, 78. Devillers J (2000) A neural network SAR
Nooshyar M (2005) Prediction acidity constant model for allergic contact dermatitis. Toxicol
of various benzoic acids and phenols in water Methods 10:181–193
using linear and nonlinear QSPR models. Bull 79. Cronin MTD, Basketter DA (1994)
Korean Chem Soc 26:2007–2016 Multivariate QSAR analysis of a skin sensiti-
67. Kauffman GW, Jurs PC (2001) Prediction of zation database. SAR QSAR Environ Res 2:
surface tension, viscosity and thermal conduc- 159–179
tivity for common organic solvents using 80. Golla S, Madihally S, Robinson RL et al
quantitative structure-property relationships. (2009) Quantitative structure-property rela-
J Chem Inf Comput Sci 41:408–418 tionship modeling of skin sensitization: a
68. Artemenko NV, Baskin II, Palyulin VA et al quantitative prediction. Toxicol In Vitro 23:
(2001) Prediction of physical properties of 454–465
organic compounds using artificial neural net- 81. Benigni R (ed) (2003) Quantitative structure-
works within the substructure approach. activity relationship (QSAR) models of muta-
Doklady Chem 381:317–320 gens and carcinogens. CRC Press, Boca Raton
69. Modarresi H, Modarress H, Dearden JC 82. Xu C, Cheng F, Chen L et al (2012) In silico
(2007) QSPR model of Henry’s law constant prediction of chemical Ames mutagenicity.
for a diverse set of organic chemicals based on J Chem Inf Model 52:2840–2847
genetic algorithm-radial basis function net- 83. Leong MK, Lin S-W, Chen H-B et al (2010)
work approach. Chemosphere 66:2067–2076 Predicting mutagenicity of aromatic amines
70. Gharagheizi F, Abbasi R, Tirandazi B (2010) by various machine learning approaches.
Prediction of Henry’s law constant of organic Toxicol Sci 116:498–513
compounds in water from a new group- 84. Tsakovska I, Lessigiarska I, Netzeva T et al
contribution-based method. Ind Eng Chem (2008) A mini review of mammalian toxicity
Res 49:10149–10152 (Q)SAR models. QSAR Comb Sci 27:41–48
71. ECHA Guidance on information require- 85. Devillers J (2004) Prediction of mammalian
ments and chemical safety assessment. toxicity of organophosphorus pesticides from
Chapter R.7b. Endpoint specific guidance, in QSTR modelling. SAR QSAR Environ Res
Guidance for the Implementation of REACH, 15:501–510
ECHA, Helsinki, 2012. Available at http://
echa.europa.eu/support/guidance-on-reach- 86. Funar-Timofei S, Ionescu D, Suzuki T (2010)
and-clp-implementation/consultation- A tentative quantitative structure-toxicity
procedure. Accessed 17 Jan 2014 relationship study of benzodiazepine drugs.
Toxicol In Vitro 24:184–200
72. Salina AG, Patlewicz G, Worth AP (2008) A
review of QSAR models for skin and eye irri- 87. Kaiser KLE, Niculescu SP (2001) Modeling
tation and corrosion. QSAR Comb Sci 27: acute toxicity of chemicals to Daphnia magna:
49–59 a probabilistic neural network approach.
Environ Toxicol Chem 20:420–431
73. Golla S, Madihally S, Robinson RL et al
(2009) Quantitative structure-property rela- 88. Kahn I, Sild S, Maran U (2007) Modeling the
tionships modeling of skin irritation. Toxicol toxicity of chemicals to Tetrahymena pyrifor-
In Vitro 23:176–184 mis using heuristic multilinear regression and
heuristic back-propagation neural networks.
74. Barratt MD (1996) Quantitative structure- J Chem Inf Model 47:2271–2279
activity relationships (QSARs) for skin corro-
sivity of organic acids, bases and phenols: 89. Izadiyan P, Fatemi MH, Izadiyan M (2013)
principal components and neural networks Elicitation of the most important structural
analysis of extended datasets. Toxicol In Vitro properties of ionic liquids affecting ecotoxic-
10:85–94 ity in limnic green algae: a QSAR approach.
Ecotoxicol Environ Saf 87:42–48
75. Barratt MD (1997) QSARs for the eye irrita-
tion potential of neutral organic chemicals. 90. Domine D, Devillers J, Chastrette M et al
Toxicol In Vitro 11:1–8 (1993) Estimating pesticide field half-lives
88 John C. Dearden and Philip H. Rowe
from a backpropagation neural network. SAR 103. Zhao C, Boriani E, Chana A et al (2008) A
QSAR Environ Res 1:211–219 new hybrid system of QSAR models for pre-
91. Jing G-H, Li X-L, Zhou Z-M (2011) diction bioconcentration factors (BCF).
Quantitative structure-biodegradability rela- Chemosphere 73:1701–1707
tionship study about the aerobic biodegrada- 104. Devillers J, Pham-Delègue MH, Decourtye A
tion of some aromatic compounds. Chin J et al (2002) Structure-toxicity modelling of
Struct Chem 30:368–375 pesticides to honey bees. SAR QSAR Environ
92. Liu H, Papa E, Walker JD et al (2007) In Res 13:641–648
silico screening of estrogen-like chemicals 105. Song JS, Moon T, Nam KD et al (2008)
based on different nonlinear classification Quantitative structure-activity relationship
models. J Mol Graph Model 26:135–144 (QSAR) studies for fungicidal activities of
93. Roncaglioni A, Piclin N, Pintore M et al thiazoline derivatives against rice blast. Bioorg
(2008) Binary classification models for Med Chem Lett 18:2133–2142
endocrine disrupter effects mediated through 106. Fjodorova N, Vračko M, Tušar M et al (2010)
the estrogen receptor. SAR QSAR Environ Quantitative and qualitative models for carci-
Res 19:697–733 nogenicity prediction for non-congeneric
94. Wajima T, Fukumura K, Yano Y et al (2002) chemicals using CP NN method for regula-
Prediction of human clearance from animal tory uses. Mol Divers 14:581–594
data and molecular structural parameters 107. Singh KP, Gupta S, Rai P (2013) Predicting
using multivariate regression analysis. J Pharm carcinogenicity of diverse chemicals using
Sci 91:2489–2499 probabilistic neural network modelling
95. Chow H-H, Tolle KM, Roe DJ et al (1997) approaches. Toxicol Appl Pharmacol 272:
Application of neural networks to population 465–475
pharmacokinetic data analysis. J Pharm Sci 108. Dearden JC (2010) Expert systems for tox-
86:840–845 icity prediction. In: Cronin MTD, Madden
96. Singh KP, Gupta S, Rai P (2013) Predicting JC (eds) In silico toxicology: principles and
acute aqueous toxicity of structurally diverse applications. Royal Society of Chemistry,
chemicals in fish using artificial intelligence London, pp 478–507
approaches. Ecotoxicol Environ Saf 95: 109. Lapenna S, Fuart-Gatnick M, Worth A (2010)
221–232 Review of QSAR models and software tools
97. Gong Z, Xia B, Zhang R et al (2008) for predicting acute and chronic systemic tox-
Quantitative structure-activity relationship icity. Available at http://ihcp.jrc.ec.europa.
study on fish toxicity of substituted benzenes. eu/our_labs/predictive_toxicology/doc/
QSAR Comb Sci 27:967–976 EUR_24639_EN.pdf/view. Accessed 25 Jan
98. Okey RW, Martis MC (1999) Molecular level 2014
studies of the origin of toxicity: determination 110. Molecular Modeling Pro. Available at http://
of key variables and selection of descriptors. www.chemsw.com/Software-and-Solutions/
Chemosphere 38:1419–1427 L a b o r a t o r y - S o f t w a r e / D r a w i n g - a n d -
99. Halberstam NM, Baskin II, Palyulin VA et al Modeling-Tools/Molecular-Modeling-Pro.
(2002) Quantitative structure-conditions- aspx. Accessed 25 Jan 2014
property relationship studies. Neural network 111. Joback KG, Reid RC (1987) Estimation of
modelling of the acid hydrolysis of esters. pure-component properties from group-
Mendeleev Commun 12(5):185–186 contributions. Chem Eng Commun 57:
100. Dewhurst I, Renwick AG (2013) Evaluation 233–243
of the threshold of toxicological concern 112. ADMET Predictor. http://www.simulations-
(TTC)—challenges and approaches. Regul plus.com. Accessed 28 Jan 2014
Toxicol Pharmacol 65:168–177 113. ChemSilico. Available at www.chemsilico.
101. Meng Y, Lin B-L (2008) A feed-forward arti- com. Accessed 28 Jan 2014
ficial neural network for prediction of the 114. TOPKAT. Available at www.accelrys.com.
aquatic ecotoxicity of alcohol ethoxylate. Accessed 28 Jan 2014
Ecotoxicol Environ Saf 71:172–186 115. TerraBase. Available at www.terrabase-inc.
102. Fatemi MH, Jalali-Heravi M, Konuze E
com. Accessed 28 Jan 2014
(2003) Prediction of bioconcentration factor 116. Kaiser KLE (2003) Neural networks for effect
using genetic algorithm and artificial neural prediction in environmental health issues using
network. Anal Chim Acta 486:101–108 large datasets. QSAR Comb Sci 22:185–190
Chapter 6
Abstract
Collision-induced dissociation (CID) is widely used in mass spectrometry to identify biologically important
molecules by gaining information about their internal structure. Interpretation of experimental CID spec-
tra always involves some form of in silico spectra of potential candidate molecules. Knowledge of how
charge is distributed among fragments is an important part of CID simulations that generate in silico spec-
tra from the chemical structure of the precursor ions entering the collision chamber. In this chapter we
describe a method to obtain this knowledge by machine learning.
Key words Mass spectrometry, Collision-induced dissociation, Charge transfer, Metabolite identifica-
tion, Lipid metabolites
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_6, © Springer Science+Business Media New York 2015
89
90 J.H. Miller et al.
the most likely bond to break in CID, in silico MS/MS spectra can
be derived from protein sequence without detailed knowledge of
protein structure. For metabolites, the connection between molec-
ular structure and CID is much more complex than for polypep-
tides; consequently, software tools for metabolite identification
from MS/MS spectra have been more difficult to develop [3].
MetISIS [4] is a software package being developed to aid in
the identification of metabolites. Given a MS/MS spectrum
acquired from an unknown sample, MetISIS attempts to match the
spectrum to an entry in a database of in silico spectra developed
from the chemical structure of metabolites. Most of the entries in
the database can be ignored because their mass is significantly dif-
ferent from the mass of the precursor ions fragmented to produce
the MS/MS spectrum of the unknown. If the masses of the
unknown metabolite and the database candidates are within a spec-
ified tolerance, then the similarity of CID spectra is quantified in
various ways, including Pearson correlation coefficient (PCC).
Requiring similarity of precursor-ion mass and PCC near unity
produces a shortlist of candidate molecules with a high probability
of being the unknown metabolite.
The success of this identification technique depends on the
quality of the database of MS/MS spectra. It should contain a
large number of metabolites for which high-resolution CID spec-
tra are available. Populating such a database with experimental
MS/MS spectra is both time-consuming and costly. To circumvent
this difficultly, in silico databases, such as LipidBlast [5], have been
developed. MetISIS [4] generates in silico MS/MS spectra of
metabolites based on their molecular structure, such as those found
in the LIPID Metabolites and Pathways Strategy (LIPID MAPS)
database [6] and illustrated in Fig. 1 for phosphatidylcholine
18:0/18:0 (see Note 1).
The simulation of CID that MetISIS [4] performs to generate
an MS/MS spectrum for a metabolite of known chemical structure
has two components: (1) prediction of which bonds are most likely
to break in a collision and (2) prediction of which fragments will
continue to carry charge and be detected by the mass spectrometer.
Fig. 1 Molecular structure of phosphatidylcholine 18:0/18:0. Arrows mark bond cleavages in collision-induced
dissociation (CID) that generate the fragments most frequently observed in a linear ion-trap mass
spectrometer
Charge Prediction in Metabolite Identification 91
Fig. 2 Flow chart of the algorithm developed to learn the relative frequency of
bond cleavages in collision-induced dissociation (CID) from experimental data.
When training is complete, the three components above the dotted line are
ignored and in silico CID spectra can be generated from the molecular structure
of lipid metabolites. The charge-prediction component is a separate machine-
learning task using the methods described in this chapter. (Reproduced from [4]
with permission from Oxford University Press)
Fig. 3 A block in an input vector displaying 18 options for how an atom is bonded to its successor in the tree
derived from of a breadth-first search of a fragment from its root atom
Fig. 4 The distribution of true- and false-positive identifications as a function of Pearson correlation coefficient
(R2) in a test of the MetISIS algorithm that did not contain the charge-prediction component (see Fig. 2). While
the algorithm performed reasonably well, the approximately 50 false positives with R2 near unity suggested
that charge prediction was essential for accurate simulation of MS/MS spectra
2 Methods
2.1 Construction 1. The construction of training and testing sets begins by acquiring
of Training and Testing experimental CID spectra from pure samples of specified
Sets metabolites. The training/testing process for charge prediction
in CID of lipid metabolites was based on MS/MS experiments
with 94 lipids of known chemical structure (see Note 3).
2. The molecular structure of each lipid metabolite with an
experimental CID spectrum was provided by the chemical
supplier.
3. Input vectors (see Note 4) for all bond cleavages that might
be associated with peaks in experimental spectra were gener-
ated from the molecular structures. Bonds to hydrogen
atoms for which cleavage would reduce the precursor-ion
mass by only 1 Da were ignored. Bonds between carbon atoms
in tails of lipids were also ignored because it seemed likely
that any experimental peak could be explained by this type of
fragmentation.
4. To generate the output label, the left and right mass fields in
the <meta data> block of the encoded vector (see Note 4) were
used to determine the mass for each fragment and the total
mass of the lipid. The experimental CID spectrum of the lipid
was then examined to see if it contained a peak at either of
these two masses. In the absence of such a peak, the potential
input vectors for a charged and a neutral fragment were
removed from consideration. If there was a peak in the experi-
mental spectrum at only one of potential fragment masses,
then the exemplar with that mass in left-hand fragment was
labeled 0.8 and the exemplar with that mass as right-hand frag-
ment was labeled 0.2. If peaks were found at both masses, then
the mass with the higher intensity was considered the winner
and the output of the pair of exemplars was labeled accord-
ingly. A tolerance of ±2 Da in matching peak masses with frag-
ment masses was used. For the 44 lipids selected to generate a
training set, this process gave rise to 658 pairs of exemplars.
For the 50 lipids selected to generate a testing set, the same
process gave rise to 2,183 pairs of exemplars.
5. Both training and testing sets contained labeled exemplars
from all 15 classes of lipids (see Note 3) for which experimental
MS/MS spectra were available.
2.2 ANN Structure As shown in Fig. 5, the ANN architecture had an input layer with
the 1,270 nodes required for input vectors that characterize the
chemical environment around bond-cleavage sites (see Note 4) in
the lipids with experimental spectra (see Note 3).
96 J.H. Miller et al.
Fig. 5 Structure of the neural network used to develop charge prediction for a set
of 94 lipid metabolites distributed over 15 classes (see Note 3). The input vector
size (1,270) reflects the amount of compression that was possible with this set
of molecules (see Note 4). Bias nodes in the input and hidden layers are indi-
cated by their constant value of unity
Between this input layer and the single node of the output
layer is a hidden layer with two nodes. The number of nodes in the
hidden layer was chosen equal to the number of classes the input
data needs to distinguish, which are “left fragment charged” and
“left fragment neutral.” The bias nodes of the input and hidden
layers are indicated by their constant value of 1. Although charge
prediction could be treated as classification, we chose to treat it as
multivariable nonlinear regression. Sigmoidal transformations are
used to generate an output between 0 and 1. Back-propagation of
the difference between this output and the label of exemplars was
used to optimize the weights of the neural network with a learning
rate of 0.25 and momentum coefficient of 0.95.
2.3 Training Results The root-mean-square deviation (RMSD) between output values
and exemplar labels was selected as a measure of training conver-
gence. The lower curve in Fig. 6 shows the convergence achieved
by 500 epochs of training. These results reveal that convergence is
rapid, as would be expected for a neural network with only two
nodes in a single hidden layer. The “elbow” in the RMSD for the
training set occurs after about 25 epochs, and the RMSD decreases
very slowly after about 200 epochs.
The upper curve in Fig. 6 shows the RMSD of output values
relative to the label on exemplars of the test set, which was calcu-
lated after every epoch of training. The RMSD relative to the test
set follows the RMSD of the training set for only about 10
epochs, after which it deviates sharply, remains approximately
constant, but increases slightly as the RMSD of the training set
Charge Prediction in Metabolite Identification 97
Table 1
Accuracy of charged-fragment prediction in a test set derived from 50
lipid metabolites distributed over 15 classes (see Note 3) after 500 cycles
of ANN refinement on a similar training set
3 Notes
Acknowledgments
References
1. Eng JK, McCormack AL, Yates JR (1994) An database for lipid identification. Nat Methods
approach to correlate tandem mass-spectral 10:755–758
data of peptides with amino-acid-sequences in 6. LIPID Metabolites and Pathways Strategy
a protein database. J Am Soc Mass Spectrom (LIPID MAPS) program. www.lipidmaps.org
5:976–989 7. Goldberg DE (1989) Genetic algorithms in
2. Perkins DN, Pappin DJ, Creasy DM et al search, optimization and machine learning.
(1999) Probability-based protein identifica- Addison-Wesley, Reading, MA
tion by searching sequence databases using 8. Faulon JL, Visco D, Roe D (2005) Enumerating
mass spectrometry data. Electrophoresis 20: molecules. In: Lipkowitz KB et al (eds) Reviews
3551–3567 in computational chemistry, vol 21. Wiley,
3. Scheubert K, Hufsky F, Bocker S (2013) Hoboken, NJ, pp 209–286
Computational mass spectrometry for small 9. Schietgat L, Ramon J, Bruynooghe M et al
molecules. J Cheminform. doi:10.1186/1758- (2008) An efficiently computable graph-based
2946-5-12 metric for the classification of small molecules.
4. Kangas LJ, Metz TO, Isaac G et al (2012) In In: Proceedings of the 11th international con-
silico identification software (ISIS): a machine ference on discovery science (LNAI 5525),
learning approach to tandem mass spectral pp 197–209
identification of lipids. Bioinformatics 28: 10. Cormen TH, Leiserson CE, Rivest RL et al
1705–1713 (2009) Elementary graph algorithms. In:
5. Kind T, Liu KH, Lee do Y et al (2013) Introduction to algorithms, 3rd edn. MIT
LipidBlast in silico tandem mass spectrometry Press, Cambridge, pp 594–602
Chapter 7
Abstract
Peptides are molecules of varying complexity, with different functions in the organism and with remarkable
therapeutic interest. Predicting peptide activity by computational means can help us to understand their
mechanism of action and deliver powerful drug-screening methodologies. In this chapter, we describe how
to apply artificial neural networks to predict antimicrobial peptide activity.
1 Introduction
Electronic Supplementary Material: The online version of this chapter (doi: 10.1007/978-1-4939-2239-0_7)
contains supplementary material, which is available to authorized users
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_7, © Springer Science+Business Media New York 2015
101
102 David Andreu and Marc Torrent
a
Positively charged Negatively charged
O O O O O
NH NH NH NH NH
CHARGED
OH
N O O
NH
NH OH
HN H2N
NH 2
Arginine (Arg) Histidine (His) Lysine (Lys) Aspartic (Asp) Glutamic (Glu)
O O O O O O O
NH NH NH NH NH NH NH
POLAR
HO
OH CH3 NH 2 SH
O O
NH 2
Serine (Ser) Threonine (Thr) Asparagine (Asn) Glutamine (Gln) Glycine (Gly) Serine (Ser) Proline (Pro)
HYDROPHOBIC
NH NH NH NH NH NH NH
CH3
H3C S NH
CH3
OH
Alanine (Ala) Valine (Val) Isoleucine (Ile) Methionine (Met) Phenilalanine (Phe) Tyrosine (Tyr) Tryptophan (Trp)
b
R1 O R3 O
H H H H
N N
N N
H R2 H H
O H O R4
c
O R2 O R4
H H
H H
N N
N N
H H H H
R1
NH
R1 O R3 O
H
O
R3
O R3 O R1
H H H H
HN
O
N N
H
N N
H H H
H
R4 O R2 O
Fig. 1 Peptide composition and structure. Peptides are built from amino acids with different chemical properties
(a) linked by amide bonds. The peptide chain is mainly planar, with flexibility mostly limited to rotations
around the α-carbon atoms (b). Despite these constraints, peptides can adopt different secondary structures
(c) including α-helix and β-strands with loops and coils
Prediction of Bioactive Peptides 103
Table 1
Selected examples of natural peptides
1.2 Use of Prediction As detailed in Table 1, peptides can have diverse functions, which
Models in the Search are correlated with amino acid composition and structure. It is of
for New Drugs particular interest to develop computational tools that are able to
predict active peptides, based on their physicochemical and struc-
tural characteristics. In this chapter we will focus on antimicrobial
peptide classification, but the methods and techniques described
here are widely applicable. Therefore, the same strategies can be
applied in the prediction of other classes of biologically active
peptides.
We aim to describe in this chapter how one can use artificial
neural networks (ANNs) to predict peptide activity. We will
describe how to build and curate datasets, train an ANN, and vali-
date the results in the software package R.
2 Materials
3 Methods
3.1 Build Datasets Though it might appear obvious, it is worth stating that the results
of computational analysis are strongly dependent on the quality of
the databases provided. For a supervised learning strategy such as the
Prediction of Bioactive Peptides 105
Table 2
Databases of antimicrobial peptides
3.1.1 Build the Positive The positive dataset is built from a list of peptides which have been
Dataset shown experimentally to have the property of interest. The elements
of the dataset can be retrieved from experiments done in the labora-
tory or from public databases. The dataset consists of a text file built
using either a text engine or a spreadsheet containing the list of
peptides. It should be saved as a non-formatted text file (.txt exten-
sion) and the peptides should be delimited by a carriage return.
For antimicrobial peptides, useful databases are listed in Table 2.
In our working example, we have selected the cationic peptides
from the APD2 database (see AMPlist.txt included in the
Supplementary Material).
3.1.2 Build the Negative This dataset contains a list of peptides validated to not have the
Dataset property of interest. As was the case with the positive dataset, it can
be built based on experiments performed in the laboratory, or by
relying on public databases, but also by extrapolation as described
in the introduction to this section. The dataset consists of a text file
as described in Subheading 2.1. Like the positive dataset, it should
be saved as a non-formatted text file (.txt extension) and the pep-
tides should be delimited by a carriage return.
To build our example negative dataset, we have used the
Uniprot server (http://uniprot.org) to search for non-antimicrobial,
nontoxic, and non-antibiotic peptides, with restricted length (from
5 to 45 amino acids). Only Uniprot-refined entries were selected.
A list of >3,000 peptides was finally included in the negative
dataset. As these peptides may have high sequence similarity, we
used CD-Hit (http://weizhong-lab.ucsd.edu/cdhit_suite/, [19])
Prediction of Bioactive Peptides 107
3.1.3 Merge Datasets Merging the datasets is an optional but convenient step. To merge
and Add the Class Vector the datasets, we have merely copied the text contained in the posi-
tive and negative datasets (in that order) into a new text file, called
AMPlist (see AMPlist.txt included in the Supplementary Material).
To train the classification system, peptides in the list must be
labeled as active or inactive. To do that, active peptides were labeled
1 and inactive peptides labeled 0. While a list of 0 s and 1 s can be
built using a spreadsheet and then loaded into the training soft-
ware, we have for the sake of convenience included instead a code
line to build the class vector (see below).
3.1.4 Script Explanation The script provided uses the R function read.table() to load the pep-
tide list file. In the example, the peptide list called AMPlist.txt has a
list of 2,048 antimicrobial peptides and 2,234 non-antimicrobial
peptides, built as described above. To load a new list instead of the
example, just change the name in the script (line 13, bold text):
peptide_list<- read.table("AMPlist.txt", quote="\"")
The class vector, containing an appropriate number of 0 s and
1 s, is created in line 19 of the script:
y<- c(rep(1,2048),rep(0,2234))
The first parenthesis contains the number of active peptides
and the second parenthesis the number of inactive peptides.
Numbers can be changed but must be consistent with the list sup-
plied as peptide_list.
3.2 Select Molecular descriptors are numerical values that characterize prop-
Descriptors erties of the molecule of interest [20]. For example, a simple
molecular descriptor could be the molecular weight or the charge
of the peptide. There are many types of molecular descriptors, but
they can be classified as either (1) structure-based, which depend
on the structural properties and include topological or geometrical
descriptors, or (2) property-based, which are experimentally or
theoretically determined values and include properties such as
hydrophobicity or accessible surface area.
Structure-based descriptors can refer to different levels of
organization. For example, in peptides we can refer to (1) the
amino acid sequence, (2) the secondary structure of particular
segments in the peptide, and (3) the complete tertiary structure of
the peptide.
108 David Andreu and Marc Torrent
3.2.1 Descriptor One of the most extensive lists of descriptors for amino acids is
Calculation encoded in the AAindex database [21, 22]. The AAindex database
contains a selected list of 533 descriptors, based on amino acid
physicochemical properties, substitution matrices, and statistical
protein contact potentials.
The method of calculating properties based on the AAindex
values is also flexible. The simplest strategy is to compute the aver-
age for all amino acids present in the peptide sequence. This
approach is computationally inexpensive but may miss important
details like the order in which the amino acids appear in the struc-
ture. For example, the peptide ASLP will have the same value as
the peptide PALS, though the peptides may have completely dif-
ferent biological properties.
Another approach is to use autocorrelation, which takes into
account the distribution of these properties along the sequence of
peptides [23, 24]. There are several strategies to compute autocor-
relation; one example is the Moreau-Broto autocorrelation:
N -d
AC MB = åP P i i +d
i =1
where the property of interest is named P and the autocorrelation
value is computed as the summation of products along the sequence
for amino acids separated by a length d. For the example described
before, assuming a distance d of 1 amino acid and the arbitrary
values A = 1, S = 2, L = 3, and P = 4, we would have an ACMB = 1 × 2
+ 2 × 3 + 3 × 4 = 20 for peptide ASLP and AC = 4 × 1 + 1 × 3 + 3 × 2 = 13
for peptide PALS.
In the script provided in this chapter, we will use the average
strategy, but autocorrelation can be also implemented in the script.
To learn more about other methods to calculate autocorrelation,
see refs. 23, 24.
Though it is possible to calculate all descriptors in the script
provided, several limitations must be taken into account. First, the
number of descriptors should be small compared with the number
of examples provided for the training process to be accurate. For
example, it is not suitable to use 400 descriptors if the peptide
dataset has only 200 peptides. Second, using a disproportionate
number of descriptors could easily lead to model over-fitting. Third,
many descriptors are highly correlated with others, generating
Prediction of Bioactive Peptides 109
3.2.2 Script Explanation In the script provided, only a small subset of parameters [13] is used
for prediction. To load a different subset of parameters, substitute
the numbers given in lines 22 and 27 (highlighted in bold) by the
corresponding numbers in the descriptor list (see Supplementary
Material):
vector<- as.numeric(c("9", "37", "38", "39", "40", "69", "71",
"96", "146", "151", "319", "321", "381", "401"))
rownames(ff)<- c("AA9", "AA37", "AA38", "AA39", "AA40",
"AA69", "AA71", "AA96", "AA146", "AA151", "AA319",
"AA321", "AA381", "AA401")
In lines 30–39, we implement a code for deleting highly cor-
related parameters. Therefore, those parameters that display a cor-
relation index of 0.90 with another parameter are disregarded for
the network training (see Note 5).
The threshold can be changed in the script, in line 33:
highlyCor<- findCorrelation(cor2, 0.90)
3.3 Develop Artificial neural networks are computational models inspired in the
the Artificial Neural brain neuronal system [25, 26]. The ANNs consist of inputs (rep-
Network Model resenting the signal received by the neuron) that are multiplied by
weights (representing the strength of the signal) and then mathe-
matically transformed to produce the outputs (the response to the
input signal). This is represented schematically in Fig. 2a.
In summary, ANNs can be represented as weighted directed
graphs with neurons represented by nodes and connections rep-
resented by directed edges. Following this representation, ANNs
can be classified as (1) feed-forward networks or as (2) recurrent
(or feedback) networks. The difference between them is that the
first does not contain any loops while the later does contain
loops (Fig. 2b). In this chapter we will focus only on feed-for-
ward networks and, concretely, on multilayer perceptron net-
works, a particular ANN in which neurons are organized in layers
and are connected with unidirectional edges (Fig. 2c).
In this chapter, we aim to build ANNs that learn from given
examples (known active and inactive peptides) to be able to predict
new inputs (unknown peptides). This scheme is known as supervised
learning because we provide a set of examples to help the network to
110 David Andreu and Marc Torrent
a
Input 1
w1
Input 2 w2
w3 T OUTPUT
Input 3
w4
Input 4
b
Input 1
T OUTPUT
Input 2
Input 1
T OUTPUT
Input 2
c
Input 1
Input 2
T
Input 3
Input 4
T OUTPUT
Input 5
Input 6
T
Input 7
Input 8
3.3.1 Selection In the script provided, we have divided the entire dataset (AMPlist)
of the Training and Test into a training set that contains 75 % of the observations and a test-
Datasets ing set that contains 25 % of the total observations. This division is
variable and can be changed in line 46 (bold number) (see Note 6):
smp_size<- floor(0.75 * nrow(M))
3.3.2 Training The training step consists of feeding the model with input data and
and Validation allowing it to optimize the weights in order to find the values that
best fit the data. Different training strategies have been described
in Subheading 2.3, and therefore, in this section, we will focus on
explaining how validation is performed and why it is important.
Validation is an essential step in assessing the strength of the
method. If the model were trained with the entire set of input data,
we could expect that it would find a set of weights that fit the data
with high accuracy. However, when this model is tested with a new
set of examples, previously unseen by the model, we may find a
poor prediction accuracy. This behavior is called over-fitting, which
is a computational artifact that does not generate a true fitting rule
but an overcompensated model that only fits the example data.
112 David Andreu and Marc Torrent
RMSE was used to select the optimal model using the smallest
value. The final value used for the model was size = 5.
The results output summarize the method, meaning that we
have used 9 predictors over a database of 3,211 examples, with no
pre-processing and with a fivefold cross-validation strategy (each
step comprising 2,569 examples used to train the method).
The method has tested the different number of neurons in the
hidden layer, from 0 to 5. To determine which model was best, the
method identifies the model with the lowest root mean square
error (RMSE), which is the model with 5 neurons. The RMSE is
the square root of the variance of the residuals, indicating the abso-
lute fit of the model to the data. For classification purposes, our
model outputs have values between 0 and 1 while the true values
are binarized (either 0 or 1). RMSE measures the error distance
between predicted and real data. As we need a threshold to map
predicted to real data, RMSE does not account for prediction
accuracy but is a good indicator of how well the model fits the data.
As a rule of thumb, the lower the better; it should not be higher
than 0.5.
3.3.3 Testing We use the isolated section of the dataset (in our case the 25 %) to
test the prediction power of the method. In the testing step, the
method is applied to the testing dataset and the predictions are
compared with the real values. In the best-case scenario, the pre-
dicted values will be identical to the real values. We define true posi-
tive values (TP) as those positive observations that are predicted as
positive (in our case, antimicrobial peptides correctly predicted
as antimicrobial) and true negatives (TN) as negative observations
that are correctly predicted as negative (non-active peptides pre-
dicted as non-antimicrobial). We can also have false positives (FP,
non-active peptides predicted as antimicrobials) and false negatives
(FN, active peptides predicted as non-antimicrobials).
Therefore, we can build a matrix, as depicted in Table 3, known
as the confusion matrix. Good prediction methods will have a large
number of observations as true positives or true negatives and low
numbers of false negatives and false positives.
114 David Andreu and Marc Torrent
Table 3
Confusion matrix
Prediction
0 1
Real 0 TN FP
1 FN TP
3.4 Network To help understand the structure of the neural network and how it
Visualization operates, it is useful to represent it as shown in Fig. 2. Also, it is
important to specify the weights, that is, the contribution of each
node to the next layer of the network. The higher the contribution
of the node, the more important it is to the prediction.
In the script provided (line 69), we have used a function called
plot.nnet to visualize our network (please see http://beckmw.
wordpress.com/ for a complete description of this function):
Repnet<- mlp(train[,1:9], train,10, size=4)
plot.nnet(repnet)
First, we recompute the neural network with a number of hid-
den neurons (size) specified by the output results as detailed in
Subheading 3.3.2 and then use the plot.nnet function to plot the
results (Fig. 3).
Input_1 I1
Input_2 I2
Input_3 I3
H1
Input_4 I4
H2
Input_5 I5 O1 Output_1
H3
Input_6 I6
H4
Input_7 I7
Input_8 I8
Input_9 I9
Fig. 3 Example of a neural network obtained using the example model proposed in this chapter
116 David Andreu and Marc Torrent
Weights are depicted in the edges that connect the nodes. The
absolute value of the weight is symbolized by the width of the line
(the wider the line, the higher the contribution). The line color
symbolizes whether the contribution is positive (green) or negative
(blue). In this example, descriptors 6 and 7 are contributing sig-
nificantly to the model while descriptor 2 is contributing less.
The plot.nnet function does not depict the bias values (con-
stant values added to each hidden and output layer) for the mlp
neural network; however, this has no impact in the interpretation
of our results. The plot.nnet function only depicts the bias values
for the nnet function.
4 Notes
Acknowledgements
References
1. Nelson DL, Cox MM (2005) Lehninger prin- 12. Wang G, Li X, Wang Z (2009) APD2: the
ciples of biochemistry. 4th edn. p 1100 updated antimicrobial peptide database and its
2. Liu F, Baggerman G, Schoofs L et al (2008) application in peptide design. Nucleic Acids
The construction of a bioactive peptide data- Res 37:D933–D937
base in Metazoa. J Proteome Res 13. Hammami R, Zouhir A, Le Lay C et al (2010)
7:4119–4131 BACTIBASE second release: a database and
3. Zhang H, Forman HJ (2012) Glutathione syn- tool platform for bacteriocin characterization.
thesis and its role in redox signaling. Semin BMC Microbiol 10:22
Cell Dev Biol 23:722–728 14. Waghu FH, Gopi L, Barai RS et al (2014)
4. Patel BM, Mehta AA (2012) Aldosterone and CAMP: collection of sequences and structures
angiotensin: role in diabetes and cardiovascular of antimicrobial peptides. Nucleic Acids Res
diseases. Eur J Pharmacol 697:1–12 42:D1154–D1158
5. Amblard M, Fehrentz J-A, Martinez J et al 15. Seebah S, Suresh A, Zhuo S et al (2007)
(2005) Fundamentals of modern peptide syn- Defensins knowledgebase: a manually curated
thesis. Methods Mol Biol 298:3–24 database and information source focused on
6. Coin I, Beyermann M, Bienert M (2007) the defensins family of antimicrobial peptides.
Solid-phase peptide synthesis: from standard Nucleic Acids Res 35:D265–D268
procedures to the synthesis of difficult 16. Gueguen Y, Garnier J, Robert L et al (2006)
sequences. Nat Protoc 2:3247–3256 PenBase, the shrimp antimicrobial peptide
7. Hilpert K, Winkler DFH, Hancock REW penaeidin database: sequence-based classifica-
(2007) Peptide arrays on cellulose support: tion and recommended nomenclature. Dev
SPOT synthesis, a time and cost efficient Comp Immunol 30:283–288
method for synthesis of large numbers of pep- 17. Hammami R, Ben HJ, Vergoten G et al (2009)
tides in a parallel and addressable fashion. Nat PhytAMP: a database dedicated to antimicro-
Protoc 2:1333–1349 bial plant peptides. Nucleic Acids Res 37:
8. Winkler DFH, Campbell WD (2008) The spot D963–D968
technique: synthesis and screening of peptide 18. Li Y, Chen Z (2008) RAPD: a database of
macroarrays on cellulose membranes. Methods recombinantly-produced antimicrobial pep-
Mol Biol 494:47–70 tides. FEMS Microbiol Lett 289:126–129
9. Reddy AS, Pati SP, Kumar PP et al (2007) Virtual 19. Huang Y, Niu B, Gao Y et al (2010) CD-HIT
screening in drug discovery—a computational Suite: a web server for clustering and compar-
perspective. Curr Protein Pept Sci 8:329–351 ing biological sequences. Bioinformatics 26:
10. Boix E, Nogues VM, Torrent M (2012) 680–682
Discovering new in silico tools for antimicro- 20. Jenssen H (2011) Descriptors for antimicro-
bial peptide prediction. Curr Drug Targets bial peptides. Expert Opin Drug Discov 6:
13:1148–1157 171–184
11. Torrent M, Andreu D, Nogués VM et al (2011) 21. Kawashima S, Pokarowski P, Pokarowska M
Connecting peptide physicochemical and anti- et al (2008) AAindex: amino acid index data-
microbial properties by a rational prediction base, progress report 2008. Nucleic Acids Res
model. PLoS One 6:e16968 36:D202–D205
118 David Andreu and Marc Torrent
22. Kawashima S, Ogata H, Kanehisa M (1999) 25. Krogh A (2008) What are artificial neural net-
AAindex: amino acid index database. Nucleic works? Nat Biotechnol 26:195–197
Acids Res 27:368–369 26. Heaton J. (2012) Introduction to the Math of
23. Li ZR, Lin HH, Han LY et al (2006) Neural Networks. Heaton Research Inc. 119
PROFEAT: a web server for computing struc- pages.
tural and physicochemical features of proteins 27. Günther F, Fritsch S (2010) neuralnet: training
and peptides from amino acid sequence. of neural networks. R J 2:30–38
Nucleic Acids Res 34:W32–W37 28. Bergmeir C, Benitez JM (2012) Neural Networks
24. Rao HB, Zhu F, Yang GB et al (2011) Update in R using the Stuttgart Neural Network
of PROFEAT: a web server for computing Simulator: RSNNS. J Stat Software 46:1–26
structural and physicochemical features of pro- 29. Kuhn M (2008) Building predictive models in
teins and peptides from amino acid sequence. R using the caret package. J Stat Software 28:
Nucleic Acids Res 39:W385–W390 1–26
Chapter 8
Abstract
In biology and chemistry, a key goal is to discover novel compounds affording potent biological activity or
chemical properties. This could be achieved through a chemical intuition-driven trial-and-error process or
via data-driven predictive modeling. The latter is based on the concept of quantitative structure-activity/
property relationship (QSAR/QSPR) when applied in modeling the biological activity and chemical prop-
erties, respectively, of compounds. Data mining is a powerful technology underlying QSAR/QSPR as it
harnesses knowledge from large volumes of high-dimensional data via multivariate analysis. Although
extremely useful, the technicalities of data mining may overwhelm potential users, especially those in the
life sciences. Herein, we aim to lower the barriers to access and utilization of data mining software for
QSAR/QSPR studies. AutoWeka is an automated data mining software tool that is powered by the widely
used machine learning package Weka. The software provides a user-friendly graphical interface along with
an automated parameter search capability. It employs two robust and popular machine learning methods:
artificial neural networks and support vector machines. This chapter describes the practical usage of
AutoWeka and relevant tools in the development of predictive QSAR/QSPR models. Availability: The
software is freely available at http://www.mt.mahidol.ac.th/autoweka.
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_8, © Springer Science+Business Media New York 2015
119
120 Chanin Nantasenamat et al.
and Cros [2] in the mid-1850s; they established that there exists an
association between chemical constitution and physiological prop-
erties. This was at a time prior to the reporting of the aromatic ring
structure of benzene in 1865 by Kekulé [3]. In 1868, Brown and
Fraser observed changes to physiological action upon methylation
of the basic nitrogen atom in alkaloids. In 1869, Richardson [4]
showed that aliphatic alcohols of different molecular weight also
displayed varied narcotic effects. Similarly, a few decades later in
1893, Richet [5] showed that the water solubility of polar chemi-
cals (such as alcohols, ethers, and ketones) was inversely related to
their toxicities, whereby a decrease in the solubility of a polar
chemical results in an increase in its toxicity. Toward the 1900s,
Meyer and Overton [6, 7] reported that the lipophilicity of a group
of organic compounds exhibited a linear relationship with their
narcotic potencies. In 1917, Moore [8] investigated the effects of
chemical “fumigants” on insects and showed that toxicity of these
compounds increased with their boiling point. It can be seen that
the majority of these efforts pertained to the establishment of qual-
itative relationships between chemicals and their respective activity/
property.
Quantitative relationships between chemical structures and
activity/property started to take shape when Hammett [9] intro-
duced a simple equation (later to be known as the Hammett equa-
tion) in 1937 that considers the substituent effect caused by
electron-withdrawing or electron-donating groups as summarized
by the sigma constant, while the rho constant describes the struc-
tural class or chemotype under study. Such an equation allows
determination of electronic effects caused by the placement of sub-
stituent groups at different positions (i.e., ortho-, meta-, and para-)
of the benzene ring. The subsequent work of Taft [10] led to the
introduction of the steric parameter. It is these two contributions
from Hammett and Taft that paved the way for further contribu-
tions from Hansch et al. In the mid-1950s to 1960s, Hansch and
Muir employed the Hammett substituent constant and partition
coefficients in their formulation of equations to correlate the
chemical substituents and growth of Avena plant using plant
growth regulators comprising of phenoxyacetic acids and chloro-
mycetins [11–13]. Finally in 1964, Hansch and Fujita [14] inves-
tigated the biological effects of a wide range of chemicals using a
combination of physicochemical properties as summarized by the
following equation:
1
log = a ⋅ σ + b ⋅π − c ⋅π 2 + d
C
where C is the molar concentration of hormone, σ is the Hammett
parameter, and π is a measure of hydrophobicity. In parallel, Free
and Wilson [15] analyzed such structure-activity relationships using
a binary approach in which the presence or absence of a substituent
AutoWeka 121
Bioactivity Primary
Database Literature Draw Chemical Initial Geometry Optimization
Structures (AM1)
Molecular Descriptors
Optimized Molecular
Structure
C, e E, C, e C, g, e
Reduced Subset of Predicted Bioactivity Values
Molecular
Descriptors Predicted Bioactivity Values
Table 1
Types of biological activities modeled by QSAR and selected examples
Type Examples
Absorption Blood-brain barrier penetration
Human intestinal absorption
P-glycoprotein
Skin permeability
Distribution Aqueous solubility (logS)
Octanol-water partition coefficient (logP)
Octanol-water distribution coefficient (logD)
Metabolism Cytochrome P450 induction/inhibition
Excretion Hepatic microsomal intrinsic clearance
Toxicity AMES mutagenicity
Carcinogenicity
Hepatotoxicity
Skin sensitivity
Target/organ cytotoxicity
Receptor agonist/antagonist A3 adenosine receptor antagonist
Angiotensin II receptor antagonist
CCR5 receptor antagonist
Glucagon receptor antagonist
Peroxisome proliferator-activated receptor
gamma agonists
Enzyme activator/inhibitor/modulator Aromatase inhibitor
Dipeptidyl peptidase IV inhibitor
Gamma-aminobutyric acid modulator
Gamma-secretase modulator
Glucokinase activator
2 Materials
2.1 Data Source There are several ways in which one can acquire the data that is to
be used for QSAR/QSPR modeling. It can be measured data from
one’s own experiments, compiled from the primary literature,
retrieved from curated databases, or even all of the above. In most
cases, accessibility to the primary literature is heavily dependent on
institutional or personal subscription, while there are a growing
number of open access journals that are freely available for readers.
AutoWeka 125
2.2 Drawing Chemical structures can be drawn into the computer using any of
and Refining Chemical the following software tools:
Structures ●● Accelrys Draw (available at http://www.accelrys.com)
●● MarvinSketch (available at http://www.chemaxon.com)
●● OpenEye VIDA (available at http://www.eyesopen.com)
Chemical structure file format conversion tools of use include:
●● Babel (available at http://www.eyesopen.com)
●● Open Babel (available at http://www.openbabel.org)
Molecular structures can be refined using online tools:
●● CORINA Online Demo (available at http://www.molecular-
networks.com/online_demos/corina_demo)
●● COSMOS (accessible at http://cosmos.igb.uci.edu)
In certain cases where the molecular structure is relevant or of
importance for use in prediction, it can also be subjected to geom-
etry optimization via quantum chemical calculations:
●● Gaussian (commercially available at http://www.gaussian.
com)
●● GAMESS (available at http://www.msg.ameslab.gov/gamess)
●● HyperChem (commercially available at http://www.hyper.
com)
●● MOLCAS (commercially available at http://www.molcas.org)
●● MOPAC (available at http://openmopac.net)
2.4 Data Compilation Prior to constructing the QSAR/QSPR model, the block of X
molecular descriptors and Y bioactivity values are combined and
prepared in ARFF file format for use as input to AutoWeka.
●● Spreadsheet program
–– Microsoft Excel (commercially available at http://office.
microsoft.com/excel)
–– OpenOffice (freely available at http://www.openoffice.
org)
●● Text editor
–– Notepad++ (freely available at http://notepad-plus-plus.
org) EditPad Lite (freely available at http://www.edit-
padlite.com)
●● CSV to ARFF converter
–– csv2arff (accessible from http://slavnik.fe.uni-lj.si/markot/
csv2arff/csv2arff.php)
2.5 Multivariate It should be noted here that multivariate analysis or the actual pro-
Analysis cess of constructing the QSAR model is performed using AutoWeka
(Freely available at http://www.mt.mahidol.ac.th/autoweka). It is
in this phase that parameter optimization takes place in an auto-
mated fashion. Completed results will contain information on the
best set of parameters for the selected machine learning algorithm.
2.6 Plotting Graphs After completing AutoWeka calculations, results from parameter
optimization could be visualized using the Python scripts available in
the Download section at http://www.mt.mahidol.ac.th/autoweka.
Alternatively, plots of the results could also be created using statis-
tical graphical software such as SigmaPlot (Commercially available
from http://www.sigmaplot.com).
128 Chanin Nantasenamat et al.
3 Methods
3.1 Data Source At this phase, the practitioner decides on the data to be used in
their QSAR/QSPR project and therefore retrieves the necessary
information (i.e., chemical name, SMILES, and bioactivity values)
from resources mentioned in Subheading 2.1; the data are then
compiled as outlined in Subheading 3.4. In the case study described
in this chapter, we will be using a data set of 22 curcumin analogs
with DPPH free radical-scavenging activity as previously reported
by Venkateswarlu et al. [46] and modeled by Worachartcheewan
et al. [47]. The bioactivity values were binned to the binary labels
“low” and “high” activity, denoting compounds affording IC50
values of greater than and less than 10 μM, respectively. It should
be noted that in the previously reported QSAR modeling study
described by Worachartcheewan et al. [47], three compounds
(nos. 5, 6, and 10) were identified as outliers and subjected to
removal from the data set, thereby reducing it to 19 compounds.
3.2 Drawing Chemical structures of curcumin analogs (Fig. 2) are drawn into
and Refining Chemical the computer using software described in Subheading 2.2.
Structures Subsequently, chemical structures are either subjected to a rough
energy minimization (using a molecular mechanics force field) or a
more extensive quantum chemical calculation (using a tool such as
HF, B3LYP, etc.) in order to obtain a more refined structure.
These structures can be subjected to an initial geometry refine-
ment using the semiempirical Austin Model 1 (AM1) followed by
Becke’s three-parameter hybrid method using the Lee-Yang-Parr
correlation functional (B3LYP) along with 6-31g(d) basis set as
performed previously by Worachartcheewan et al. [47].
3.4 Data Compilation One of the first phases of a QSAR/QSPR study is compiling the
data set of interest, which can be performed by entering essential
data into an appropriate spreadsheet program. For example, a typical
AutoWeka 129
O O O O O O
H3CO OCH3 H3CO
HO
H 1 H
OH H
HO 2 H
OH HO
H 3 H
OH
H
OH O O OH
H H
OH O O H
OH O O
4 5 H3CO 6 OCH3
OCH3 OCH3
O O O O O O
H3CO OCH3 H3CO OCH3
(H3C)2N 7 N(CH3)2 H
HO 8 OHH HO
H 9 H
OH
OCH3 OCH3 OCH3
O O O O O O
Br Br (H3C)3C C(CH3)3 (H3C)3C C(CH3)3
H
HO 10 OHH HO
H 11 OH
H H
HO 12 OHH
OCH3 OCH3 C(CH3)3 C(CH3)3
O O O O O O
H3COOC COOCH3 HOOC COOH H
HO OH
H
H
HO 13 H
OH HO
H 14 H
OH HO
H 15 H
OH
O O H
OH O O OH
H O O
HO
H OCH3 H
OH
H
HO 16 OH
H 17 H
HO 18 OH
H
OH
H H
OH
O O O O O O
HO
H H
OH H
HO H
OH H
HO OH
H
HO
H 19 OHH HO
H 20 OH
H HO
H 21 OH
H
OCH3 OCH3 OCH3 H
OH
O O
H
HO OH
H
H
HO 22 OH
H
OH
H OH
H
3.5 Machine Before we move on to the multivariate analysis phase and actually
Learning Algorithms construct the QSAR model, it is pertinent to first provide a glimpse
of the algorithmic details of machine learning algorithms that
are commonly used in QSAR/QSPR on biological [47–56] and
chemical systems [57–61] and which are implemented in AutoWeka.
3.5.1 Artificial Neural The artificial neural network (ANN) is a well-known multivariate
Network method that is commonly used to develop QSAR/QSPR models.
ANN takes its inspiration from the biological brain with its organi-
zation of neurons [62]. The single-layer perceptron (SLP), which
AutoWeka 131
Table 2
Example of ARFF input file of curcumin analogs with
quantitative Y variable
@RELATION RadicalScavengingRegression
@DATA
4.82,−0.134,0.067,7.463,2,−1.322
3.53,−0.141,0.071,7.092,2,−1.531
3.793,−0.143,0.072,6.993,2,−1.519
5.91,−0.124,0.062,8.065,2,−1.38
5.819,−0.127,0.064,7.874,0,−1.716
7.447,−0.133,0.067,7.519,2,−1.415
6.721,−0.129,0.065,7.752,2,−1.407
4.584,−0.142,0.071,7.042,2,−1.681
5.27,−0.139,0.07,7.194,2,−1.633
5.122,−0.142,0.071,7.042,2,−2
3.643,−0.146,0.073,6.849,2,−2
3.277,−0.134,0.067,7.463,4,−0.778
3.949,−0.134,0.067,7.463,3,−0.845
6.099,−0.125,0.063,8,4,−0.903
2.758,−0.138,0.069,7.246,3,−0.881
1.995,−0.129,0.065,7.752,4,−0.732
3.507,−0.127,0.064,7.874,4,−0.799
3.901,−0.129,0.065,7.752,5,−0.716
4.116,−0.127,0.064,7.874,6,−0.663
132 Chanin Nantasenamat et al.
Table 3
Example of ARFF input file of curcumin
analogs with qualitative Y variable
@RELATION
RadicalScavengingClassification
@DATA
4.82,−0.134,0.067,7.463,2,Low
3.53,−0.141,0.071,7.092,2,Low
3.793,−0.143,0.072,6.993,2,Low
5.91,−0.124,0.062,8.065,2,Low
5.819,−0.127,0.064,7.874,0,Low
7.447,−0.133,0.067,7.519,2,Low
6.721,−0.129,0.065,7.752,2,Low
4.584,−0.142,0.071,7.042,2,Low
5.27,−0.139,0.07,7.194,2,Low
5.122,−0.142,0.071,7.042,2,Low
3.643,−0.146,0.073,6.849,2,Low
3.277,−0.134,0.067,7.463,4,High
3.949,−0.134,0.067,7.463,3,High
6.099,−0.125,0.063,8,4,High
2.758,−0.138,0.069,7.246,3,High
1.995,−0.129,0.065,7.752,4,High
3.507,−0.127,0.064,7.874,4,High
3.901,−0.129,0.065,7.752,5,High
4.116,−0.127,0.064,7.874,6,High
AutoWeka 133
Input Output
layer layer
x1
x2 y1
x3
x1 h1
x2 h2 y
x3 h3
3.5.2 Support Vector The support vector machine (SVM) is a popular learning method
Machine that had been adopted for solving a plethora of problems via clas-
sification and regression analysis. A notable property of SVM is its
estimation of model parameters by means of the convex
optimization approach that guarantees that a local solution is also
the global optimum [65]. Vapnik first introduced this technique
based on the principles of structure risk minimization of the statis-
tical learning theory [66, 67]. In practice, the SVM method
constructs a maximum-margin hyperplane to separate two classes.
To solve nonlinearly separable data, the SVM method acts in con-
junction with a mapping function Φ(x) : x ∈ ℜM → ℜP that is used to
transform the original data set of M-dimension onto a higher
dimensional space or feature space of P-dimension where M ≪ P.
Subsequently, a simple linear classifier f(xi) can then be used for
classifying P-dimensional samples [68, 69] (shown in Fig. 6).
AutoWeka 135
Kernel function
Φ(x)
Fig. 6 Schematic representation of relationship between input and feature spaces using a mapping function
( )
K x i ,x j = Φ ( x i ) Φ x j =
T
( ) ∑Φ (x ) Φ (x ) i
T
j (4)
i , j =1
In practice, given D, SVM is used to construct a linear function
f(xi) representing the correlation between the structure and bio-
logical activities/chemical properties of data xi:
N
f ( xi ) = ∑wi Φ ( xi ) + b (5)
i =1
where wi ∈ ℜ are the coefficients, b ∈ ℜ is the bias, and N is the
M
Φ ( xi ) Φ x j
T
( ) (6)
●● Polynomial kernel:
(1 + Φ ( x ) Φ ( x ) )
T d
i j (7)
where d = 2, 3, and 4 (it should be noted that d = 1 for a linear
kernel).
●● Radial basis function (RBF) kernel:
( (
exp − γ xi − x j )) (8)
Table 4
Summary of computational methods used in QSAR/QSPR modeling
Built-in
feature Single/
Method Advantage Disadvantage selector ensemble
ANN Performs well on complex data Low interpretability No Single
SVM Solves nonlinearly separable data Low interpretability No Single
PCA Summarizes a data set without Does not consider relationship Yes Single
losing too much variation between X and Y
PLS Simple and interpretable model Requires cross-term Yes Single
DT Simple and interpretable model Requires a number of training Yes Single
data sets
RF High interpretability and low Complexity method Yes Ensemble
risk of over-fitting
ANN, SVM, PCA, PLS, DT, and RF are acronyms for artificial neural network, support vector machine, principal com-
ponent analysis, partial least squares regression, decision tree, and random forest, respectively
3.6 Multivariate We will now proceed with the step-by-step case study of construct-
Analysis ing the QSAR model using AutoWeka. Although we try to provide
as much detail as possible on the practical aspects of using AutoWeka,
some adjustments may need to be made as the reader adapts this
protocol to their own projects. As previously mentioned in
Subheading 3.1, the case study used in this chapter is based on the
set of 22 curcumin analogs reported by Venkateswarlu et al. [46].
Let us start by going to the website of AutoWeka that is
accessible at http://www.mt.mahidol.ac.th/autoweka. Readers
can click on either the Download link on top or the orange
Download button down below (Fig. 7).
AutoWeka 137
Fig. 9 Screenshot of the contents of the zip file of AutoWeka software (left panel) and upon opening the
AutoWeka program (right panel)
Fig. 13 Screenshot of an ongoing (left panel) and completed (right panel) ANN calculation
142 Chanin Nantasenamat et al.
Fig. 14 Screenshot of the method for accessing the calculation results folder
Fig. 17 Graphical plots from the parameter optimization process for (a) the number
of hidden nodes, (b) the number of learning epochs, and (c) the learning rate and
momentum
AutoWeka 145
4 Notes
Acknowledgments
References
1. Brodin A (1858) On the analogy of arsenic and 10. Taft RW (1952) Polar and steric substituent
phosphoric acid with respect to chemical and constants for aliphatic and o-benzoate groups
toxicology. Medico-Surgical Academy, St. from rates of esterification and hydrolysis of
Petersburg, Russia esters1. J Am Chem Soc 74(12):3120–3128
2. Cros A (1863) Action de l’alcool amylique sur 11. Hansch C, Maloney PP, Fujita T et al (1962)
l’organisme. University of Strasbourg, Strasbourg Correlation of biological activity of phenoxyace-
3. Kekulé A (1865) Sur la constitution des sub- tic acids with Hammett substituent constants
stances aromatiques. Bull Soc Chim Fr 3:98 and partition coefficients. Nature 194:178–180
4. Richardson B (1869) Physiological research on 12. Hansch C, Muir RM, Fujita T et al (1963) The
alcohols. Med Times Gaz 2:703–706 correlation of biological activity of plant growth
5. Richet C (1893) On the relationship between regulators and chloromycetin derivatives with
the toxicity and the physical properties of sub- Hammett constants and partition coefficients.
stances. Compt Rendus Seances Soc Biol J Am Chem Soc 85(18):2817–2824
9:775–776 13. Hansch C, Muir RM (1950) The ortho effect
6. Overton E (1897) Osmotic properties of cells in plant growth-regulators. Plant Physiol
in the bearing on toxicology and pharmacol- 25(3):389
ogy. Z Phys Chem 22:189–209 14. Hansch C, Fujita T (1964) p-σ-π analysis. A
7. Meyer H (1899) On the theory of alcohol nar- method for the correlation of biological activity
cosis. Arch Exp Pathol Pharmacol 42:109–118 and chemical structure. J Am Chem Soc 86(8):
8. Moore W (1917) Volatility of organic com- 1616–1626
pounds as an index of the toxicity of their 15. Free SM Jr, Wilson JW (1964) A mathematical
vapors to insects. J Agric Res 10(7):365 contribution to structure-activity studies. J Med
9. Hammett LP (1937) The effect of structure Chem 7:395–399
upon the reactions of organic compounds. 16. Hansch C (1969) Quantitative approach to
Benzene derivatives. J Am Chem Soc 59(1): biochemical structure-activity relationships. Acc
96–103 Chem Res 2(8):232–239
146 Chanin Nantasenamat et al.
44. Martins JPA, Ferreira MMC (2013) QSAR chemical properties of templates from
modeling: a new open source computational molecular imprinting literature using interac-
package to generate and validate QSAR models. tive text mining approach. Chemometr Intell
Quim Nova 26:554–560 Lab Syst 116:128–136
45. Hall M, Frank E, Holmes G et al (2009) The 58. Nantasenamat C, Isarankura-Na-Ayudhya C,
WEKA data mining software: an update. Tansila N et al (2007) Prediction of GFP spec-
SIGKDD Explorations 11 (1) tral properties using artificial neural network. J
46. Venkateswarlu S, Ramachandra MS, Subbaraju Comput Chem 28(7):1275–1289
GV (2005) Synthesis and biological evaluation 59. Nantasenamat C, Naenna T, Isarankura N-AC
of polyhydroxycurcuminoids. Bioorg Med et al (2005) Quantitative prediction of imprint-
Chem 13(23):6374–6380 ing factor of molecularly imprinted polymers
47. Worachartcheewan A, Nantasenamat C, by artificial neural network. J Comput Aid Mol
Isarankura-Na-Ayudhya C et al (2011) Des 19(7):509–524
Predicting the free radical scavenging activity 60. Nantasenamat C, Isarankura-Na-Ayudhya C,
of curcumin derivatives. Chemometr Intell Lab Naenna T et al (2007) Quantitative structure-
Syst 109(2):207–216 imprinting factor relationship of molecularly
48. Mandi P, Nantasenamat C, Srungboonmee K imprinted polymers. Biosens Bioelectron
et al (2012) QSAR study of anti-prion activity 22(12):3309–3317
of 2-aminothiazoles. Excli J 11:453–467 61. Nantasenamat C, Srungboonmee K, Jamsak S
49. Nantasenamat C, Isarankura-Na-Ayudhya C, et al (2013) Quantitative structure-property
Naenna T et al (2008) Prediction of bond dis- relationship study of spectral properties of
sociation enthalpy of antioxidant phenols by green fluorescent protein with support vector
support vector machine. J Mol Graph Model machine. Chemometr Intell Lab Syst 120:
27(2):188–196 42–52
50. Nantasenamat C, Li H, Mandi P et al (2013) 6 2. McCulloch WS, Pitts W (1943) A logical calcu-
Exploring the chemical space of aromatase lus of the ideas immanent in nervous activity.
inhibitors. Mol Div. doi:10.1007/s11030- Bull Math Biophys 5(4):115–133
11013-19462-x 63. Lawrence J (1993) Introduction to neural net-
51. Nantasenamat C, Piacham T, Tantimongcolwat works: design, theory, and applications, 6th
T et al (2008) QSAR model of the quorum- edn. California Scientific Software, California
quenching N-acyl-homoserine lactone lac- 64. Smith M (1993) Neural networks for statisti-
tonase activity. J Biol Syst 16(2):279–293 cal modeling. Van Nostrand Reinhold,
52. Pingaew R, Tongraung P, Worachartcheewan A New York
et al (2012) Cytotoxicity and QSAR study of (thio) 65. Bishop CM, Nasrabadi NM (2006) Pattern
ureas derived from phenylalkylamines and pyri- recognition and machine learning, vol 1.
dylalkylamines. Med Chem Res 22:4016-4029 Springer, New York
53. Prachayasittikul S, Wongsawatkul O, 66. Vapnik V (2000) The nature of statistical learn-
Worachartcheewan A et al (2010) Elucidating the ing theory. Springer, New York
structure-activity relationships of the vasorelax- 67. Vapnik V (1998) Statistical learning theory.
ation and antioxidation properties of thionico- Wiley, New York
tinic acid derivatives. Molecules 15(1):198–214 68. Cortes C, Vapnik V (1995) Support-vector
54. Thippakorn C, Suksrichavalit T, Nantasenamat networks. Mach Learn 20(3):273–297
C et al (2009) Modeling the LPS neutralization 69. Cristianini N, Shawe-Taylor J (2000) An intro-
activity of anti-endotoxins. Molecules 14(5): duction to support vector machines and other
1869–1888 kernel-based learning methods. Cambridge
55. Worachartcheewan A, Nantasenamat C, University Press, Cambridge
Isarankura-Na-Ayudhya C et al (2013) Predicting 70. Platt JC (1999) Fast training of support vector
antimicrobial activities of benzimidazole deriva- machines using sequential minimal optimiza-
tives. Med Chem Res 22:5418–5430 tion. In: Schoelkopf B, Burges C, Smola A
56. Worachartcheewan A, Nantasenamat C, (eds) Advances in kernel methods: support vec-
Naenna T et al (2009) Modeling the activity of tor learning. MIT Press, Cambridge, USA,
furin inhibitors using artificial neural network. pp 185–208
Eur J Med Chem 44(4):1664–1673 71. Smola AJ, Schölkopf B (2004) A tutorial on
57. Nantasenamat C, Li H, Isarankura-Na- support vector regression. Stat Comput 14(3):
Ayudhya C et al (2012) Exploring the physico- 199–222
Chapter 9
Abstract
This chapter focuses on the fingerprint-based artificial neural networks QSAR (FANN-QSAR) approach to
predict biological activities of structurally diverse compounds. Three types of fingerprints, namely ECFP6,
FP2, and MACCS, were used as inputs to train the FANN-QSAR models. The results were benchmarked
against known 2D and 3D QSAR methods, and the derived models were used to predict cannabinoid (CB)
ligand binding activities as a case study. In addition, the FANN-QSAR model was used as a virtual screen-
ing tool to search a large NCI compound database for lead cannabinoid compounds. We discovered several
compounds with good CB2 binding affinities ranging from 6.70 nM to 3.75 μM. The studies proved that
the FANN-QSAR method is a useful approach to predict bioactivities or properties of ligands and to find
novel lead compounds for drug discovery research.
Key words Quantitative structure-activity relationship (QSAR), Fingerprint, Artificial neural networks
(ANN), Biological activity, Cannabinoid
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_9, © Springer Science+Business Media New York 2015
149
150 Kyaw Z. Myint and Xiang-Qun Xie
2 Methods
2.1 Data Sets As shown in Table 1, a total of six data sets were used in this study.
Five of them were compiled by Sutherland et al. [21] and were
downloaded from their supplemental data. The sixth data set, can-
nabinoid receptor 2 (CB2) data set, was curated by the Xie lab [5,
20]. For this data set, if there were more than one reported CB2
activity for a ligand, an average activity was used.
Once the structure-activity relationship (SAR) data were col-
lected, each data set was then divided into training, validation,
and testing sets. For the ACE, AchE, BZR, COX2, and DHFR
data sets, the same training and testing data sets provided by
Ligand Biological Activity Predictions 151
Table 1
Summary of six data sets used in the FANN-QSAR model development
Table 2
Number of training, validation, and testing set compounds in each data set
1
.
0 .
. pKi or pIC50
.
. .
.
0
1
Bias Bias
Fig. 1 Graphical representation of the fingerprint-based ANN-QSAR (FANN-QSAR) model. (Reprinted with per-
mission from ref. 17. Copyright (2012) American Chemical Society)
Table 3
Comparison of results from different QSAR methods
0.8
Round 1 ECFP6- Optimal No. of Training Validation Average
Round 2 ANN-QSAR Hidden Neurons Set MSE Set MSE MSE
0.7 Round 3
Round 4 Round 1 1000 0.182 0.795 0.488
Average MSE
0.6 Round 5
Round 2 600 0.257 0.749 0.503
Round 1
0.9
FP2-ANN- Optimal No. of Training Validation Average
Round 2
QSAR Hidden Neurons Set MSE Set MSE MSE
Round 3
0.8 Round 1 700 0.300 0.809 0.554
Round 4
Round 5
Average MSE
Fig. 2 Cross-validation results of each FANN-QSAR method on CB2 ligand data set. (Reprinted with permission
from ref. 17. Copyright (2012) American Chemical Society)
same training and test compounds were used across all three
FANN-QSAR models. For example, the same training and test
compounds in Round 1 of the ECFP6-ANN-QSAR model were
used in the Round 1 of the FP2-ANN-QSAR and MACCS-ANN-
QSAR models. As shown in the table, the ECFP6-ANN-QSAR
model consistently outperformed the FP2- and MACCS-ANN-
QSAR models in all five rounds of experiments. The ECFP6-
ANN- QSAR model achieved an average r2 test value of 0.56
(r = 0.75) across all repeat experiments compared with 0.48 (r = 0.69)
and 0.45 (r = 0.67) for the FP2- and MACCS-ANN-QSAR mod-
els, respectively. Results showed that the ECFP6 fingerprint was
better than FP2 and MACCS fingerprints for the cannabinoid data
set as well as the other five data sets. In fact, it has been also
reported that circular fingerprints such as ECFP6 fingerprints are
found to be more useful in virtual screening and ADMET properties’
156 Kyaw Z. Myint and Xiang-Qun Xie
Table 4
A summary of the performance of each FANN-QSAR model on CB2 ligand
data set
2.5 Cannabinoid To more rigorously test the predictive ability of the FANN-QSAR
Receptor Binding method on new cannabinoid compounds which are not in the Xie
Activity Prediction group’s cannabinoid ligand training data set, the most recently
on Newly Reported reported cannabinoid ligands and associated CB2 binding affinity
Cannabinoid Ligands data were downloaded from ChEMBL database [35]. These com-
pounds were not found in the training (CBID) data set and were
collected to be used as a new test set in order to evaluate the
FANN-QSAR performance. The new test data set consisted of
295 compounds with reported CB2 Ki values which were then
Ligand Biological Activity Predictions 157
12
11 rtest = 0.75
10
9
Predicted pKi
8
7
6
5
4
3
3 4 5 6 7 8 9 10 11 12
Experimental pKi
Fig. 3 Scatter plot between experimental pKi and predicted pKi values of 278 test cannabinoid ligands after the
removal of 17 outliers. (Reprinted with permission from ref. 17. Copyright (2012) American Chemical Society)
2.6 Virtual Screening In order to illustrate how the FANN-QSAR model could be used
of the NCI Compound in drug discovery research, we applied it as a virtual screening
Database for Lead tool to search for CB2 lead ligands from the NCI compound data-
Cannabinoid Ligands base [36]. For consistency, the same 20 trained models in the
158 Kyaw Z. Myint and Xiang-Qun Xie
Table 5
Identified NCI hit compounds with CB2 binding activities
CH3
CH O
HO
CH3
49888 330.46 5.59 8.17b 8.28
O O
H3C
CH3
CH3 CH3
OH
O
174122 463.52 4.76 5.59c 8.41
N
N O
H3C CH3
O
H3C
N
H3C
369049 488.66 4.00 5.51c 8.48
O
O H3C H3C
CH3
O
H3 C H3C
O HO
H3C
O
76301 354.44 3.99 5.42c 8.21
a
An average literature reported Ki value of a known cannabinoid compound (HU210) which is more than 90 % similar
to 746843
b
An average Ki value of two independent experiments performed in duplicate
c
An experimental Ki value of one experiment performed in duplicate
160 Kyaw Z. Myint and Xiang-Qun Xie
Fig. 4 CB2 receptor binding affinity Ki values of four NCI hit compounds measured by [3H]CP-55940 radioli-
gand competition binding assay using human CB2 receptors harvested from transfected CHO-CB2 cells.
(Reprinted with permission from ref. 17. Copyright (2012) American Chemical Society)
3 Notes
to each well of the dried filter plates. Then the bound radioactivity
is counted using a Perkin Elmer 96-well TopCounter. The Ki is
calculated by using nonlinear regression analysis (Prism 5;
GraphPad Software Inc., La Jolla, CA), with the Kd values for
[3H]CP-55940 determined from saturation binding experiments.
This assay is used for determining binding affinity parameters (Ki)
of ligand-receptor interactions between the CB2 receptor and
ligands.
References
1. Myint KZ, Xie X-Q (2010) Recent advances in pharmacophoric approach. SAR QSAR Environ
fragment-based QSAR and multi-dimensional Res 22(5–6):525–544. doi:10.1080/1062936x.
QSAR methods. Int J Mol Sci 11(10): 2011.569948
3846–3866 11. Chen J-Z, Han X-W, Liu Q et al (2006)
2. Perkins R, Fang H, Tong W et al (2003) 3D-QSAR studies of arylpyrazole antagonists of
Quantitative structure-activity relationship cannabinoid receptor subtypes CB1 and CB2.
methods: perspectives on drug discovery and A combined NMR and CoMFA approach. J Med
toxicology. Environ Toxicol Chem 22(8): Chem 49(2):625–636
1666–1679 12. Vilar S, Santana L, Uriarte E (2006)
3. Salum L, Andricopulo A (2009) Fragment- Probabilistic neural network model for the in
based QSAR: perspectives in drug design. Mol silico evaluation of anti-HIV activity and
Divers 13(3):277 mechanism of action. J Med Chem 49(3):
4. Chen JZ, Wang J, Xie XQ (2007) GPCR 1118–1124. doi:10.1021/jm050932j
structure-based virtual screening approach for 13. González-Díaz H, Bonet I, Terán C et al (2007)
CB2 antagonist search. J Chem Inf Model ANN-QSAR model for selection of anticancer
47(4):1626–1637. doi:10.1021/ci7000814 leads from structurally heterogeneous series of
5. Wang L, Ma C, Wipf P et al (2013) compounds. Eur J Med Chem 42(5):580–585.
TargetHunter: an in silico target identification doi:10.1016/j.ejmech.2006.11.016
tool for predicting therapeutic potential of 14. Patra JC, Chua BH (2011) Artificial neural net-
small organic molecules based on chemoge- work-based drug design for diabetes mellitus
nomic database. AAPS J 15:395–406, PMID: using flavonoids. J Comput Chem 32(4):
23292636 555–567. doi:10.1002/jcc.21641
6. Tandon M, Wang L, Xu Q et al (2012) A tar- 15. Dimitrov I, Naneva L, Bangov I et al (2014)
geted library screen reveals a new inhibitor Allergenicity prediction by artificial neural net-
scaffold for protein kinase D. PLoS One 7(9): works. J Chemometrics. doi:10.1002/cem.2597
e44653. doi:10.1371/journal.pone.0044653, 16. Vanyúr R, Héberger K, Kövesdi I et al (2002)
PMID: 23028574 Prediction of tumoricidal activity and accumu-
7. Ma C, Wang L, Yang P et al (2013) LiCABEDS lation of photosensitizers in photodynamic
II. Modeling of ligand selectivity for G-protein therapy using multiple linear regression and
coupled cannabinoid receptors. J Chem Inf artificial neural networks. Photochem
Model 53(1):11–26. doi:10.1021/ci3003914 Photobiol 75(5):471–478. doi:10.1562/0031-
8. Wang L, Ma C, Wipf P et al (2012) Linear and 8655(2002)0750471potaaa2.0.co2
nonlinear support vector machine for the clas- 17. Myint K-Z, Wang L, Tong Q et al (2012)
sification of human 5-HT1A ligand functional- Molecular fingerprint-based artificial neural
ity. Mol Inf 31(1):85–95. doi:10.1002/ networks QSAR for ligand biological activity
minf.201100126 predictions. Mol Pharm 9(10):2912–2923.
9. Myint K, Xie X-Q (2011) Fragment-based doi:10.1021/mp300237z
QSAR algorithm development for compound 18. Molnár L, Keserű GM (2002) A neural network
bioactivity prediction. SAR QSAR Environ Res based virtual screening of cytochrome P450
22(3):385–410 3A4 inhibitors. Bioorg Med Chem Lett
10. Chen JZ, Myint KZ, Xie X-Q (2011) New 12(3):419–421. doi:10.1016/s0960-894x(01)
QSAR prediction models derived from GPCR 00771-5
CB2-antagonistic triaryl bis-sulfone analogues 19. Muresan S, Sadowski J (2005) “In-House
by a combined molecular morphological and Likeness”: comparison of large compound
Ligand Biological Activity Predictions 163
collections using artificial neural networks. 34. Glem R, Bender A, Arnby C et al (2006)
J Chem Inf Model 45(4):888–893. doi: Circular fingerprints: flexible molecular descrip-
10.1021/ci049702o tors with applications from physical chemistry
20. Wang L, Xie XQ (2012) Cannabinoid Ligand to ADME. IDrugs 9(3):199–204
Database. Accessed Nov 2011 35. Bellis LJ, Akhtar R, Al-Lazikani B et al (2011)
21. Sutherland JJ, O’Brien LA, Weaver DF (2004) Collation and data-mining of literature bioactiv-
A comparison of methods for modeling quanti- ity data for drug discovery. Biochem Soc Trans
tative structure–activity relationships. J Med 39(5):1365–1370. doi:10.1042/BST0391365,
Chem 47(22):5541–5554. doi:10.1021/ BST0391365 [pii]
jm0497141 36. Collins J, Crowell J (2011) Drug Synthesis and
22. Greenidge PA, Carlsson B, Bladh L-G et al Chemistry Branch, Developmental Therapeutics
(1998) Pharmacophores incorporating numer- Program (DTP), Division of Cancer Treatment
ous excluded volumes defined by x-ray crystallo- and Diagnosis, National Cancer Institute.
graphic structure in three-dimensional database http://dtp.nci.nih.gov/
searching: application to the thyroid hormone 37. Tripos (2012) SYBYL-X 1.2. 1699 South
receptor. J Med Chem 41(14):2503–2512 Hanley Rd., St. Louis, Missouri, 63144, USA
23. Mathworks (2007) MATLAB. 7.5.0.342 38. Xie XQ, Chen JZ (2008) Data mining a small
(R2007b) edn, Natick, MA molecule drug screening representative subset
24. O’Boyle N, Banck M, James C et al (2011) from NIH PubChem. J Chem Inf Model
Open Babel: an open chemical toolbox. 48(3):465–475. doi:10.1021/ci700193u
J Cheminform 3:33 39. Huffman JW, Yu S, Showalter V et al (1996)
25. Durant JL, Leland BA, Henry DR et al (2002) Synthesis and pharmacology of a very potent
Reoptimization of MDL keys for use in drug cannabinoid lacking a phenolic hydroxyl with
discovery. J Chem Inf Comput Sci 42(6):1273– high affinity for the CB2 receptor. J Med Chem
1280. doi:10.1021/ci010132r 39(20):3875–3877. doi:10.1021/jm960394y
26. Rogers D, Hahn M (2010) Extended- 40. Morgan HL (1965) The generation of a unique
connectivity fingerprints. J Chem Inf Model machine description for chemical structures—a
50(5):742–754. doi:10.1021/ci100050t technique developed at Chemical Abstracts
27. Cramer R, Patterson D, Bunce J (1988) Service. J Chem Doc 5(2):107–113. doi:
Comparative molecular field analysis (CoMFA). 10.1021/c160017a018
1. Effect of shape on binding of steroids to car- 41. Gertsch J, Leonti M, Raduner S et al (2008)
rier proteins. J Am Chem Soc 110:5959–5967 Beta-caryophyllene is a dietary cannabinoid.
28. Klebe G (1998) Comparative molecular simi- Proc Natl Acad Sci 105(26):9099–9104.
larity indices analysis: CoMSIA. 3D QSAR in doi:10.1073/pnas.0803601105
Drug Design. Three-Dimensional Quantitative 42. Raduner S, Majewska A, Chen J-Z et al (2006)
Structure Activity Relationships 3:87–104 Alkylamides from Echinacea are a new class
29. Lowis D (1997) HQSAR: a new, highly predic- of cannabinomimetics: cannabinoid type 2
tive QSAR technique. Tripos Technical Notes receptor-dependent and -independent immu-
1(5):17 nomodulatory effects. J Biol Chem 281(20):
14192–14206
30. Jain AN (2004) Ligand-based structural
hypotheses for virtual screening. J Med Chem 43. Zhang Y, Xie Z, Wang L et al (2011) Mutagenesis
47(4):947–961 and computer modeling studies of a GPCR con-
served residue W5.43(194) in ligand recogni-
31. Ferguson AM, Heritage T, Jonathon P et al tion and signal transduction for CB2
(1997) EVA: a new theoretically based molecu- receptor. Int Immunopharmacol 11(9):1303–
lar descriptor for use in QSAR/QSPR analysis. 1310. doi:10.1016/j.intimp.2011.04.013
J Comput Aided Mol Des 11(2):143–152. doi
:10.1023/a:1008026308790 44. Yang P, Wang L, Feng R et al (2013) Novel
triaryl sulfonamide derivatives as selective can-
32. Agrafiotis DK, Cedeño W, Lobanov VS (2002) nabinoid receptor 2 inverse agonists and osteo-
On the use of neural network ensembles in clast inhibitors: discovery, optimization, and
QSAR and QSPR. J Chem Inf Comput Sci biological evaluation. J Med Chem
42(4):903–911. doi:10.1021/ci0203702 56(5):2045–2058. doi:10.1021/jm3017464
33. Bender A, Jenkins JL, Scheiber J et al (2009) 45. DePriest SA, Mayer D, Naylor CB et al (1993)
How similar are similarity searching methods? 3D-QSAR of angiotensin-converting enzyme
A principal component analysis of molecular and thermolysin inhibitors: a comparison of
descriptor space. J Chem Inf Model 49(1):108– CoMFA models based on deduced and experi-
119. doi:10.1021/ci800249s mentally determined active site geometries.
164 Kyaw Z. Myint and Xiang-Qun Xie
Abstract
We present a GEneral Neural Network (GENN) for learning trends from existing data and making
predictions of unknown information. The main novelty of GENN is in its generality, simplicity of use, and
its specific handling of windowed input/output. Its main strength is its efficient handling of the input data,
enabling learning from large datasets. GENN is built on a two-layered neural network and has the option
to use separate inputs–output pairs or window-based data using data structures to efficiently represent
input–output pairs. The program was tested on predicting the accessible surface area of globular proteins,
scoring proteins according to similarity to native, predicting protein disorder, and has performed remark-
ably well. In this paper we describe the program and its use. Specifically, we give as an example the constru-
ction of a similarity to native protein scoring function that was constructed using GENN. The source code
and Linux executables for GENN are available from Research and Information Systems at http://mamiris.
com and from the Battelle Center for Mathematical Medicine at http://mathmed.org. Bugs and problems
with the GENN program should be reported to EF.
Key words Neural network, Protein scoring, Windowed input, Automatic learning, GENN, Protein
structure prediction
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_10, © Springer Science+Business Media New York 2015
165
166 Eshel Faraggi and Andrzej Kloczkowski
Fig. 1 Schematic diagram describing the architecture of GENN. The feature files
include information used to establish a prediction. The training files contain infor-
mation to be predicted. The error between the neural network predicted value
and the training values is used to train its weights. H1 and H2 refer to the first
and second hidden layers, respectively
2.1 Instance-Based We use the term instance-based prediction to describe cases in which
Prediction each example to be learned from is independent of every other
example in the database. This is in contrast to windows-based pre-
diction, to be discussed next, where examples covered in overlapping
windows will share common features. All programs related to
instance-based prediction have a base name of “genn2inst.”
An example of instance-based prediction is prediction of global
protein features. When one wants to predict quantities for entire
protein sequences, for example the radius of gyration, one is deal-
ing with instance-based prediction. For each protein, various input
features are gathered but in general different proteins will have
unique input features that are not shared by other proteins.
Another example we have implemented using GENN, that will be
described below, is that of predicting the fitness of model protein
structures to the native state. Such a scheme was implemented suc-
cessfully using GENN for the CASP10 experiment [27].
168 Eshel Faraggi and Andrzej Kloczkowski
Fig. 2 Example of the input file for GENN. The first line gives the head directory where all the instance directo-
ries are. This should be enclosed in double quotes to preserve the slashes. The next line describes the inputs
used separated by spaces. The third line lists the outputs to be predicted represented by file names separated
by spaces. The remaining lines list the instances on which the training will be performed
The specifics of using GENN for training and predicting from data
are given below.
4.1 Training a Model The program is run from the command line. The options associ-
ated with training are given in Table 1. By default the program
looks for a file called “genn.in” to read the training dataset structure
from (initialization file). If this file does not exist in the directory it
is necessary to include in the command line a “-l” option followed
by the name of the initialization file.
Given additional data a model can be retrained by including it
as the initial condition for the new round of training. This is
achieved using the “-wf” option followed by the name of the file
containing the weights and parameters of the trained model.
In this case the weights and other model parameters are used as
initial values and no randomization is carried out. Then new train-
ing is performed on the new instances.
4.2 Predictions Once a model has been trained all parameters are stored in a file.
from Trained Models This file then can be used to give a prediction for new instances.
In addition a collection of models can be used to produce an
average prediction with an accompanying standard deviation. This
standard deviation can prove beneficial in cases where quality of
prediction is desired, but this requires testing on specific cases. The
basic options associated with prediction from a trained model are
given in Table 1. Note that some options are common between
training and prediction tasks, however, model parameters such as
the activation parameter or others that are related to the model
architecture are stored in, and read from, the model file and should
be reissued only in the uncommon event that one desires to over-
ride their values.
5 Special Options
5.1 Filter Network Some cases for window-based prediction can benefit from a filter,
where predictions over a window are used as inputs for a follow-up
prediction. Examples include protein secondary structure where a
filter prediction layer is often used [21, 25, 29]. While a filter layer
can always be run using a second-stage neural network, for the
Table 1
GENN options
5.3 Global Inputs For certain window-based prediction global variables may be
and Outputs important. We shall consider the case of protein secondary struc-
ture to illustrate the point. In a window-based approach the sec-
ondary structure of a given residue is considered in the context
of a window around it. However, certain parameters such as the
amino acid composition of the protein may contain information
about general tendencies to form a specific protein class (all-alpha,
all-beta, alpha/beta, or alpha + beta). This information would typ-
ically not be contained in a limited window view and hence can
help the prediction. Indeed global input features can significantly
contribute to the prediction accuracy. To use global inputs/outputs,
one should store their values in files called genn.gin/genn.gut,
respectively, in the directory corresponding to that instance.
An option, “-gi” for global input and/or “-go” for global output, is
required in the command line to include these global files. These
options should be followed by the number of values to use from
these global files enabling control of how much global information
to use. Note that options for global files are only required for train-
ing. The input structure is recorded in the weight files and is
retrieved from them when called to predict.
5.4 Degeneracy In some cases of window-based prediction one may like to sample
Training certain cases more often than others. For example, for disordered
proteins one may like to sample more of the disordered residues in a
given protein instance [31]. In instance-based prediction one can
include instances and their occurrence at will by repeating them in the
file “genn.in.” For window-based prediction to change the sampling
frequency within an instance, one includes a file with an integer count
of the number of times (including zero) to train on a given training
output. Hence, the degeneracy file order should match the rest of the
input/output and should include the corresponding indexes.
Degeneracy files should be stored in instance directories under files
named “genn.dgn,” and an option should be given for their use.
GEneral Neural Network 173
5.5 Types of Output GENN is able to predict multiple outputs. This in turn enables
specific functionality besides general multi-output prediction. The
first option is to treat the multiple outputs as components of
a multi-state probability vector and train, optimize, and report
accordingly. While training is not affected by this option, optimiza-
tion is carried on the ability of the trained network to correctly
distinguish the most dominant component of the probability
vector. Reporting also follows the same protocol. This behavior
was found useful when an ancestor of GENN was used to predict
protein secondary structure [25].
Another possibility is to predict several instances of the same
output and generate an average prediction from the same trained
network. This has proven beneficial in several instances of continuous
value prediction [29]. It provides an extra layer of averaging, with
more control over variation between prediction models and hence
the ability to average over specific random errors in certain circum-
stances. This also generates an estimate for the prediction stability in
the reported standard deviation over predictions and in this form may
be beneficial in assigning prediction stability and accuracy.
6 Automated Learning
7 Examples
8 Summary
Acknowledgements
References
1. Kassin SM (1979) Consensus information, resources variables: a review of modelling issues
prediction, and causal attribution: a review of and applications. Environ Model Softw 15:
the literature and issues. J Pers Soc Psychol 101–124
37:1966 8. Chang JC, Wooten EC, Tsimelzon A,
2. Crick NR, Dodge KA (1994) A review and Hilsenbeck SG, Gutierrez MC, Elledge R,
reformulation of social information-processing Mohsin S, Osborne CK, Chamness GC, Allred
mechanisms in children’s social adjustment. DC et al (2003) Gene expression profiling
Psychol Bull 115:74 for the prediction of therapeutic response to
3. Fielding AH, Bell JF (1997) A review of meth- docetaxel in patients with breast cancer. Lancet
ods for the assessment of prediction errors in 362:362–369
conservation presence/absence models. Environ 9. Schofield W et al (1985) Predicting basal
Conserv 24:38–49 metabolic rate, new standards and review of
4. Makhoul J (1975) Linear prediction: a tutorial previous work. Hum Nutr Clin Nutr 39:5
review. Proc IEEE 63:561–580 10. Blundell T, Sibanda B, Sternberg M, Thornton
5. Fontenot RJ, Wilson EJ (1997) Relational J (1987) Knowledge-based prediction of pro-
exchange: a review of selected models for a pre- tein structures. Nature 326:26
diction matrix of relationship activities. J Bus 11. Chou PY, Fasman GD (1978) Empirical pre-
Res 39:5–12 dictions of protein conformation. Annu Rev
6. Rost B et al (2001) Review: protein secondary Biochem 47:251–276
structure prediction continues to rise. J Struct 12. Floudas C, Fung H, McAllister S, Mönnigmann
Biol 134:204–218 M, Rajgaria R (2006) Advances in protein
7. Maier HR, Dandy GC (2000) Neural networks structure prediction and de novo protein design:
for the prediction and forecasting of water a review. Chem Eng Sci 61:966–988
GEneral Neural Network 177
13. Moult J (2005) A decade of CASP: progress, 26. Wang C, Xi L, Li S, Liu H, Yao X (2012)
bottlenecks and prognosis in protein structure A sequence-based computational model for the
prediction. Curr Opin Struct Biol 15:285–289 prediction of the solvent accessible surface area
14. Vazquez A, Flammini A, Maritan A, Vespignani for α-helix and β-barrel transmembrane resi-
A (2003) Global protein function prediction dues. J Comput Chem 33:11–17
from protein-protein interaction networks. Nat 27. Faraggi E, Kloczkowski A (2013) A global
Biotechnol 21:697–700 machine learning based scoring function for
15. Borgwardt KM, Ong CS, Schönauer S, protein structure prediction. Proteins: Struct
Vishwanathan S, Smola AJ, Kriegel H-P (2005) Funct Bioinf. doi:10.1002/prot.24454
Protein function prediction via graph kernels. 28. Xue B, Dor O, Faraggi E, Zhou Y (2008) Real
Bioinformatics 21:i47–i56 value prediction of backbone torsion angles.
16. Chothia C (1974) Hydrophobic bonding and Proteins: Struct Funct Bioinf 72:427–433
accessible surface area in proteins. Nature 29. Faraggi E, Yang Y, Zhang S, Zhou Y (2009)
248:338–339 Predicting continuous local structure and the
17. Moret M, Zebende G (2007) Amino acid effect of its substitution for secondary structure
hydrophobicity and accessible surface area. in fragment-free protein structure prediction.
Phys Rev E 75:011920 Structure 17:1515–1527
18. Dor O, Zhou Y (2007) Real-spine: an inte- 30. Zhang T, Faraggi E, Zhou Y (2010) Fluct-
grated system of neural networks for real-value uations of backbone torsion angles obtained
prediction of protein structural properties. from nmr-determined structures and their
Proteins: Struct Funct Bioinf 68:76–81 prediction. Proteins: Struct Funct Bioinf 78:
19. Durham E, Dorr B, Woetzel N, Staritzbichler 3353–3362
R, Meiler J (2009) Solvent accessible surface 31. Zhang T, Faraggi E, Xue B, Dunker AK,
area approximations for rapid and accurate pro- Uversky VN, Zhou Y (2012) Spine-d: accurate
tein structure prediction. J Mol Model 15: prediction of short and long disordered regions
1093–1108 by a single neural-network based method.
20. Zhang H, Zhang T, Chen K, Shen S, Ruan J, J Biomol Struct Dyn 29:799–813
Kurgan L (2009) On the relation between resi- 32. Moult J, Fidelis K, Kryshtafovych A,
due flexibility and local solvent accessibility in Tramontano A (2011) Critical assessment
proteins. Proteins: Struct Funct Bioinf 76: of methods of protein structure prediction
617–636 (casp) round ix. Proteins: Struct Funct Bioinf
21. Faraggi E, Xue B, Zhou Y (2009) Improving 79:1–5
the prediction accuracy of residue solvent 33. Faraggi E, Yaoqi Z, Kloczkowski A (2014)
accessibility and real-value backbone torsion Accurate single-sequence prediction of solvent
angles of proteins by guided-learning through accessible surface area using local and global
a two-layer neural network. Proteins: Struct features. Proteins: Struct, Funct, and Bioinf.
Funct Bioinf 74:847–856 DOI: 10.1002/prot.24682
22. Zhang T, Zhang H, Chen K, Ruan J, Shen S, 34. CASP10 (2012) Official group performance
Kurgan L (2010) Analysis and prediction of ranking. http://www.predictioncenter.org/
RNA-binding residues using sequence, evolu- casp10/groups_analysis.cgi. Accessed 10 June
tionary conservation, and predicted secondary 2012
structure and solvent accessibility. Curr Protein 35. Feng Y, Kloczkowski A, Jernigan R (2007)
Pept Sci 11:609–628 Four-body contact potentials derived from two
23. Gao J, Zhang T, Zhang H, Shen S, Ruan J, protein datasets to discriminate native struc-
Kurgan L (2010) Accurate prediction of pro- tures from decoys. Proteins: Struct Funct
tein folding rates from sequence and sequence- Bioinf 68:57–66
derived residue flexibility and solvent 36. Feng Y, Kloczkowski A, Jernigan RL (2010)
accessibility. Proteins: Struct Funct Bioinf Potentials’ r’us web-server for protein energy
78:2114–2130 estimations with coarse-grained knowledge-
24. Nunez S, Venhorst J, Kruse CG (2010) based potentials. BMC Bioinf 11:92
Assessment of a novel scoring method based 37. Gniewek P, Leelananda SP, Kolinski A, Jernigan
on solvent accessible surface area descriptors. RL, Kloczkowski A (2011) Multibody
J Chem Inf Model 50:480–486 coarse-grained potentials for native structure
25. Faraggi E, Zhang T, Yang Y, Kurgan L, Zhou recognition and quality assessment of protein
Y (2012) Spine x: improving protein secondary models. Proteins: Struct Funct Bioinf
structure prediction by multistep learning 79:1923–1929
coupled with prediction of solvent accessible 38. Zhou H, Zhou Y (2002) Distance-scaled, finite
surface area and backbone torsion angles. ideal-gas reference state improves structure-
J Comput Chem 33:259–267 derived potentials of mean force for structure
178 Eshel Faraggi and Andrzej Kloczkowski
selection and stability prediction. Protein Sci selection and structure prediction. PLoS One
11:2714–2726 5:e15386
39. Yang Y, Zhou Y (2008) Ab initio folding of 41. Zhang Y, Skolnick J (2004) Scoring function
terminal segments with secondary structures for automated assessment of protein structure
reveals the fine difference between two closely template quality. Proteins: Struct Funct Bioinf
related all-atom statistical energy functions. 57:702–710
Protein Sci 17:1212–1219 42. Xu J, Zhang Y (2010) How significant is a pro-
40. Zhang J, Zhang Y (2010) A novel side-chain tein structure similarity with tm-score = 0.5?
orientation dependent potential derived from Bioinformatics 26:889–895
random-walk reference state for protein fold
Chapter 11
Abstract
This chapter describes the implementation of a neural network-based predictive control system for driving
a prosthetic hand. Nonlinearities associated with the electromechanical aspects of prosthetic devices pres-
ent great challenges for precise control of this type of device. Model-based controllers may overcome this
issue. Moreover, given the complexity of these kinds of electromechanical systems, neural network-based
modeling arises as a good fit for modeling the fingers’ dynamics. The results of simulations mimicking
potential situations encountered during activities of daily living demonstrate the feasibility of this
technique.
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_11, © Springer Science+Business Media New York 2015
179
180 Cristian F. Pasluosta and Alan W.L. Chiu
control structure per se, the modulation of the grip force using
artificial sensory information can be considered within this line of
research. In their approach, force-derivative feedback was intro-
duced into the force control loop to improve the sensibility and to
reduce the force overshoot at the beginning of the grasping [20].
A hybrid force-velocity-position control was also implemented to
address the overshoot issue [21]. Further, using the derivative of
the shear force, an adaptive slip prevention system was imple-
mented to improve the force control [19]. Another slip prevention
algorithm that also minimizes object deformation was presented
by Engeberg and Meek [27]. Andrecioli and Engeberg utilized an
adaptive controller based on the detection of object stiffness [28].
Wettels et al. implemented a control system that adjusts the grip
force according to a slippage detection system that uses a fluid-
based tactile sensor to compute the normal and tangential compo-
nents of the grip force [17]. Peerdeman et al. utilized the University
of Bologna hand IV to test a low level of control based on the
intrinsically passive controller technique [29].
Consider now the low-cost mechatronic design of a prosthetic
hand shown in Fig. 2 [35]. This is a five-fingered hand with an
opposable thumb, with each finger containing two phalanges and
their tips slightly bent. The thumb, index and middle fingers move
independently (active fingers), while the ring and little fingers
(passive fingers) are mechanically coupled to the middle finger.
The fingers are flexed by DC gear motors by means of a tendon-cable
Fig. 2 Mechatronic design of a low-cost prosthetic hand, modified from ref. 34,
with permission
182 Cristian F. Pasluosta and Alan W.L. Chiu
Neural Network
Model q
Flex Sensor
y Force Sensor
g u y
Optimization Motor
Controller Hand
Fig. 3 Neural network-based control scheme (from ref. 35, with permission)
y(t–2)
y(t–1)
Feedforward
u(t–2) Neural
Network y(t)
u(t–1)
q(t–2)
q(t–1)
3 Optimization
i =N 1 i =1
U (t ) = u (t )u (t + N u − 1)
T
(6)
In Eqs. 5 and 6, the parameters N2, N1 and Nu are the predic-
tion horizon, minimum prediction horizon, and control horizon,
respectively. The constant ρ restricts the changes in the control
input [36]. The input u(t) is supplied to the motor at each sample
point and it is bounded for convergence purposes of the optimiza-
tion algorithm.
The algorithm presented in refs. 36, 45 can be used to mini-
mize the objective function (in other words, to find the optimum
u(t)). This algorithm utilizes a gradient descent technique where
the future set of control inputs is determined by the update rule
indicated in Eq. 7:
∂J
U ( t + 1) = U ( t ) − η ( t ) (7)
∂U (t )
α (r (t ) − y (t ) )
η ( t ) = η0 e (8)
The constant α is determined empirically. Expressing Eq. 5 in
matrix notation:
J (t , U (t ) ) = E (t ) E (t ) + ρ∆U (t ) ∆U (t )
T T
(9)
where:
E (t ) = e (t + 1) ,...,e (t + N 2 )
T
(10)
e (t + i ) = r (t + i ) − yˆ (t + i ) ; for i = 1, …, N 2 (11)
∆U (t ) = ∆u (t ) , …, ∆u (t + N u − 1)
T
(12)
We have that
∂J ∂Y (t ) ∂∆U (t )
= 2E (t ) + 2 ρ∆U (t ) (13)
∂U (t ) ∂U (t ) ∂U (t )
Y (t ) = y (t + 1) ,…, y (t + N 2 )
T
(14)
186 Cristian F. Pasluosta and Alan W.L. Chiu
1 0 0 0
−1 1 0 0
∂∆U (t )
=0 N u × N u matrix (15)
∂U (t )
−1 1 0
0 0 −1 1
∂y (t + 1)
0 0
∂u (t )
∂y (t + 2 )
∂y (t + 2 )
∂Y (t ) 0
= ∂u (t ) ∂u (t + 1) (16)
∂U (t )
∂y (t + N 2 )
∂y (t + N 2 )
∂y (t + N 2 )
∂u (t ) ∂u (t + 1) ∂u (t + N u − 1)
A recursive algorithm is presented by Noriega and Wang [45]
to calculate Eq. 16:
∂y (t + n ) ∂y (t + n − 1) ∂fˆ ( p )
= 1 + , (17)
∂u (t + m − 1) ∂u (t + m − 1) ∂yˆ (t + n − 1)
∂y (t + n − 1) dp
= lw sech 2 ( iwp + b1 ) iw (18)
∂u (t + m − 1) du
∂fˆ ( p ) dp
= lw sec h 2 ( iwp + b1 ) iw (19)
∂yˆ (t + n − 1) dy
Here, n = 1, …, N2, m = 1, …, N2 and,
dp
= [0, 0, 0, …, 1, 0, …,0]
T
(20)
du
dp
= [1, 0, …, 0, 0, …,0]
T
(21)
dy
The motor power consumption and undesirable oscillations
are reduced by turning off the motor when the measured force
falls within a tolerance range (which we set at ±5 % of the refer-
ence value).
Table 1
Objects used to evaluate the step response of the neural network-based
NMPC system
1.5
Force (N)
0.5
0
0 10 20 30 40 50 60 70 80 90 100
time (sec)
Motor signal
2
Voltage
−2
−4
0 10 20 30 40 50 60 70 80 90 100
time (sec)
Fig. 6 Representative recording of the system step response (from ref. 34, with permission)
Fig. 7 Average closing and rise time for each object. Modified from ref. 34, with permission
Neural Network Predictive Control 189
Fig. 8 Average overshoot for each step and all objects. Modified from ref. 34, with permission
Fig. 9 Average steady state error for each object. Modified from ref. 34, with permission
Fig. 10 Block diagram of the control strategy (modified from ref. 34, with permission)
Fig. 11 Whole hand experiments (modified from ref. 35, with permission)
6 Notes
Acknowledgement
References
1. Smith RJ, Tenore F, Huberdeau D et al (2008) hand based on principal components analysis.
Continuous decoding of finger position from Annual international conference of the IEEE
surface EMG signals for the control of pow- Engineering in Medicine and Biology Society,
ered prostheses. 30th Annual international 3–6 Sep 2009, pp 5022–5025
conference of the IEEE Engineering in 3. Jingdong Z, Li J, Shicai S et al (2006) A five-
Medicine and Biology Society, 20–25 Aug fingered underactuated prosthetic hand system.
2008, pp 197–200 Proceedings of the IEEE international confer-
2. Matrone G, Cipriani C, Secco EL et al (2009) ence on mechatronics and automation, 25–28
Bio-inspired controller for a dexterous prosthetic Jun 2006, pp 1453–1458
Neural Network Predictive Control 193
4. Jingdong Z, Zongwu X, Li J et al (2005) IEEE Trans Neural Syst Rehabil Eng 13(4):
Levenberg-Marquardt based neural network 468–472. doi:10.1109/TNSRE.2005.856072
control for a five-fingered prosthetic hand. 15. Lundborg G, Rosén B (2001) Sensory substi-
Proceedings of the IEEE international confer- tution in prosthetics. Hand Clin 17(3):
ence on robotics and automation, 18–22 Apr 481–488
2005, pp 4482–4487 16. Mangieri E, Ahmadi A, Maharatna K et al
5. Zajdlik J (2006) The preliminary design and (2008) A novel analogue circuit for control-
motion control of a five-fingered prosthetic ling prosthetic hands. IEEE biomedical circuits
hand. Proceedings of the international confer- and systems conference, 20–22 Nov 2008,
ence on intelligent engineering systems, 0-0 pp 81–84
0, pp 202–206
17. Wettels N, Parnandi AR, Moon JH et al (2009)
6. Dapeng Y, Jingdong Z, Yikun G et al (2009) Grip control using biomimetic tactile sensing
EMG pattern recognition and grasping force systems. IEEE/ASME Trans Mechatron 14(6):
estimation: improvement to the myocontrol of 718–723
multi-DOF prosthetic hands. IEEE/RSJ inter-
18. Tura A, Lamberti C, Davalli A et al (1998)
national conference on intelligent robots and
Experimental development of a sensory control
systems, 10–15 Oct 2009, pp 516–521
system for an upper limb myoelectric prosthesis
7. Tsujiuchi N, Takayuki K, Yoneda M (2004) with cosmetic covering. J Rehabil Res Dev
Manipulation of a robot by EMG signals using 35(1):14–26
linear multiple regression model. Proceedings
of the IEEE/RSJ international conference on 19. Engeberg ED, Meek SG (2008) Adaptive
intelligent robots and systems, vol 1992, 28 object slip prevention for prosthetic hands
Sep–2 Oct 2004, pp 1991–1996 through proportional-derivative shear force
feedback. IEEE/RSJ International conference
8. Jingdong Z, Zongwu X, Li J et al (2006) EMG on intelligent robots and systems, 22–26 Sep
control for a five-fingered underactuated pros- 2008, pp 1940–1945
thetic hand based on wavelet transform and
sample entropy. International conference on 20. Engeberg ED, Meek S (2008) Improved grasp
intelligent robots and systems, 9–15 Oct 2006, force sensitivity for prosthetic hands through
pp 3215–3220 force-derivative feedback. IEEE Trans Biomed
Eng 55(2):817–821
9. Weir RF, Ajiboye AB (2003) A multifunction
prosthesis controller based on fuzzy-logic tech- 21. Engeberg ED, Meek SG, Minor MA (2008)
niques. Proceedings of the 25th annual inter- Hybrid force-velocity sliding mode control of a
national conference of the IEEE Engineering prosthetic hand. IEEE Trans Biomed Eng
in Medicine and Biology Society, vol 1672, 55(5):1572–1581
17–21 Sep 2003, pp 1678–1681 22. Zhao DW, Jiang L, Huang H et al (2006)
10. Pylatiuk C, Mounier S, Kargov A et al (2004) Development of a multi-DOF anthropomor-
Progress in the development of a multifunc- phic prosthetic hand. International conference
tional hand prosthesis. 26th Annual interna- on robotics and biomimetics, 17–20 Dec 2006,
tional conference of the IEEE Engineering in pp 878–883
Medicine and Biology Society, 1–5 Sep 2004, 23. Cipriani C, Zaccone F, Stellin G et al (2006)
pp 4260–4263 Closed-loop controller for a bio-inspired
11. Cipriani C, Antfolk C, Balkenius C et al (2009) multi-fingered underactuated prosthesis.
A novel concept for a prosthetic hand with a Proceedings of the IEEE international confer-
bidirectional interface: a feasibility study. IEEE ence on robotics and automation, 15–19 May
Trans Biomed Eng 56(11):2739–2743. 2006, pp 2111–2116
doi:10.1109/TBME.2009.2031242 24. Carrozza MC, Cappiello G, Micera S et al
12. Panarese A, Edin BB, Vecchi F et al (2009) (2006) Design of a cybernetic hand for percep-
Humans can integrate force feedback to toes in tion and action. Biol Cybern 95(6):629–644
their sensorimotor control of a robotic hand. 25. Light CM, Chappell PH, Hudgins B et al
IEEE Trans Neural Syst Rehabil Eng 17(6): (2002) Intelligent multifunction myoelectric
560–567. doi:10.1109/TNSRE.2009.2021689 control of hand prostheses. J Med Eng Technol
13. Rodriguez-Cheu LE, Casals A (2006) Sensing 26(4):139–146
and control of a prosthetic hand with myoelectric 26. Connolly C (2008) Prosthetic hands from
feedback. The first IEEE/RAS-EMBS interna- Touch Bionics. Ind Rob 35(4):290–293
tional conference on biomedical robotics and 27. Engeberg ED, Meek SG (2013) Adaptive slid-
biomechatronics, 20–22 Feb 2006, pp 607–612 ing mode control for prosthetic hands to
14. Dhillon GS, Horch KW (2005) Direct neural simultaneously prevent slip and minimize
sensory feedback and control of a prosthetic arm. deformation of grasped objects. IEEE/ASME
194 Cristian F. Pasluosta and Alan W.L. Chiu
Abstract
Computer-aided diagnosis is a diagnostic procedure in which a radiologist uses the outputs of computer
analysis of medical images as a second opinion in the interpretation of medical images, either to help with
lesion detection or to help determine if the lesion is benign or malignant. Artificial neural networks (ANNs)
are usually employed to formulate the statistical models for computer analysis. Receiver operating charac-
teristic curves are used to evaluate the performance of the ANN alone, as well as the diagnostic perfor-
mance of radiologists who take into account the ANN output as a second opinion. In this chapter, we use
mammograms to illustrate how an ANN model is trained, tested, and evaluated, and how a radiologist
should use the ANN output as a second opinion in CAD.
Key words Artificial neural network (ANN), Medical image, Mammogram, CAD, Receiver operating
characteristic (ROC)
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_12, © Springer Science+Business Media New York 2015
195
196 Bei Liu
2 Materials
2.2 Data Collection An ANN is actually a mathematical model with many weight
parameters. For the ANN architecture in Fig. 1, there are 54
weight parameters that need to be optimized in the ANN training;
the number of training cases should be much larger than the num-
ber of parameters to ensure the generalizability of the ANN model.
Therefore, medical images for a large number of patients need
to be collected, in order to build an ANN model for CAD applica-
tion. The patients are partitioned into three groups: training cases,
validation cases, and testing cases, to ensure generalizability of the
ANN. For a patient with multiple images, such as a breast MRI and
ANN Application in CAD 197
Fig. 1 8-6-1 ANN architecture. There are 54 internode connections, and thus 54
weight parameters
2.3 Performance The receiver operating characteristic (ROC) is used to evaluate the
Evaluation overall performance of the trained ANN model and the CAD sys-
tem. An ROC curve completely describes the trade-off between
sensitivity and specificity at different operating points [11, 12],
where sensitivity or true positive fraction (TPF) is defined as the
probability that an actually positive case is diagnosed as positive
and specificity is defined as the probability that an actually negative
case is diagnosed as negative. More often, the false positive fraction
(FPF) is used, which is defined as the probability that an actually
negative case is diagnosed as positive. One can easily see that, by
definition, FPF = 1–specificity. The plot of TPF vs. FPF is the ROC
curve (Fig. 2). The area under an ROC curve (AUC) is a very use-
ful summary index to describe the CAD system’s overall perfor-
mance: AUC can be viewed as the average sensitivity or average
specificity of the CAD system.
198 Bei Liu
Fig. 2 Three ROC curves. AUC is the average sensitivity or average specificity
3 Methods
3.1 Medical Image 1. After collecting enough medical images (digital or film based)
Processing for patient cases to build an ANN model for CAD applications,
all the film-based images need to be digitized and stored in a
CAD server along with the digital images.
2. Image pre-processing is often necessary to remove noise and
artifacts in the images. Histogram equalization is applied on
each image to improve contrast.
3. Regions of interest (ROIs) are defined so that the ROIs, which
are simply rectangular regions, enclose the lesions. It is possi-
ble that there will be multiple ROIs on one medical image.
4. Lesions are segmented automatically using computer software.
Depending on the applications, many segmentation methods
can be used: from simple thresholding to region growing,
watershed, or an active contour model (snake). For microcalci-
fication lesions of the breast, the shape and size features
strongly depend on the segmentation algorithm used [15, 16];
therefore, various segmentation methods should be tried and
tested in order to choose a segmentation method that most
effectively fits the CAD application.
3.3 ANN Training 8. The image dataset is partitioned into three sub-datasets for
ANN modeling: a training dataset, a validation dataset, and a
testing dataset. The goal of ANN training, which is an iterative
process, is to optimize the weight of each nodal connection to
minimize the cost function. The cost function is the sum of the
squared error between ANN outputs and training target val-
ues, which is defined from the golden truth based on biopsy. In
the task of classifying a breast lesion as benign or malignant,
the target values for benign cases and malignant cases are set to
be 0 and 1, respectively (see Notes 1 and 2).
9. Backpropagation is used for ANN training. In this method the
weight correction is ∆wk(i) = −η∂E/∂wk + α∆wk(i − 1) in the i-th
iteration, where w is the weight factor, E is the cost function,
and ∂E/∂wk is the gradient descent force that drives the cost
function to the minimum. The parameter η is the learning rate;
this controls how rapidly the ANN training converges, and the
parameter α is the momentum, which is used to reduce the
weight fluctuation. The optimal values of η and α are problem
dependent and are determined by trial and error for a specific
problem [21].
10. Since backpropagation is a gradient descent method, there is a
danger that the ANN training might get trapped in a local
minimum. It is generally impossible to be certain that the
global minimum has been located, so various initial weights
should be tried for ANN training in order to obtain a good
ANN model. The area under the ROC curve (AUC) is calcu-
lated from the validation dataset for each epoch, or every ten
epochs, and the ANNs with the maximum AUC values are
tested using the testing dataset. Since the testing dataset is
not involved in the ANN training or ANN selection, it is
ANN Application in CAD 201
3.4 ROC Analysis 11. ROC analysis is used to evaluate the ANN performance and
and Observer Study LABROC4 is used for binormal ROC curve fitting [12, 22].
LABROC4 and some other ROC analysis program such as
CORROC, which calculates the p value for two ROC curves,
are developed by the University of Chicago and can be down-
loaded free from http://metz-roc.uchicago.edu/MetzROC/
software. The 95 % confidence interval for Az, which is actually
the AUC after binormal fitting, is also calculated.
12. An observer study is conducted to evaluate the performance of
the CAD. A number of radiologists should be recruited for an
observer study since the CAD performance is radiologist
dependent. Before the observer study, each radiologist is
trained so that they are familiar with the CAD system, includ-
ing how to use the software, and to become familiar and com-
fortable with the setting and the flow of the study. The
radiologists are given some details of how the ANNs are
trained, such as the image features and nonimage features that
are used for ANN training, the size of the training dataset, the
validation dataset and the testing dataset, as well as the Az
value of the ANN model applied on the testing case and its
standard error. The observers also read a number of cases with
and without computer aid, and compare those data with the
known outcomes, so that they can build up adequate confi-
dence in use of the CAD system. After this brief training,
observers are asked to view a set of medical images, which can
be the ANN testing images or other independent images, but
cannot be the training or validation images, with or without
computer-estimated likelihood of malignancy.
For example, in the CAD observer study conducted by Rana
et al. [10], four radiologists were asked to view a set of mam-
mograms with microcalcification lesions with and without com-
puter aid. Each radiologist drew a square ROI to enclose the
microcalcification lesion and the computer algorithm automati-
cally calculated the number of calcifications and segmented
them. For 93 out of 332 cases, observers were also asked to
place the number of calcifications into one of the four catego-
ries to help the computer algorithm set a threshold for calcifica-
tion identification [23]: (a) 3–5, (b) 6–10, (c) 11–30, or (d) 31
or more calcifications. The Az value of the ANN model used was
0.79 with a standard error of 0.05. With and without computer-
estimated likelihood of malignancy, each observer read both the
CC-view and MLO-view mammogram for each patient and
assigned a BI-RADS category for each patient: benign findings,
probably benign, suspicious abnormality, and highly suggestive
202 Bei Liu
of malignancy. The other categories are not used because all the
mammograms have calcification lesions. ROC analysis was then
performed and compared with the radiologists’ diagnosis with
and without computer aid (see Notes 5 and 6).
4 Notes
References
1. Eadie LH, Taylor P, Gibson AP (2012) A sys- benign microcalcifications on mammograms:
tematic review of computer-aided diagnosis in texture analysis using an artificial neural net-
diagnostic cancer imaging. Eur J Radiol 81: work. Phys Med Biol 42:549–567
e70–e76 4. Jiang Y, Nishikawa RM, Schmidt RA et al
2. Baker JA, Kornguth PJ, Lo JY et al (1996) (1999) Improving breast cancer diagnosis with
Artificial neural network: improving the quality computer-aided diagnosis. Acad Radiol 6:
of breast biopsy recommendations. Radiology 22–33
198:131–135 5. Huo Z, Giger ML, Vyborny CJ et al (2002)
3. Chan H, Sahiner B, Petrick N et al (1997) Breast cancer: effectiveness of computer-aided
Computerized classification of malignant and diagnosis—observer study with independent
204 Bei Liu
database of mammograms. Radiology 224: 19. Matignon R (2005) Neural network modeling
560–568 using SAS enterprise miner. AuthorHouse,
6. Suzuki K, Li F, Sone S et al (2005) Computer- Bloomington
aided diagnosis scheme for distinction between 20. Way TW, Sahiner B, Hadjiiski LM et al (2010)
benign and malignant nodules in thoracic low- Effect of finite sample size on feature selection
dose CT by use of massive training artificial and classification: a simulation study. Med Phys
neural network. IEEE Trans Med Imaging 37(2):907–920
24:1138–1150 21. Rojas R, Feldman J (1996) Neural networks: a
7. Richard HH, Lippman RP (1991) Neural net- systematic introduction. Springer, New York
work classifiers estimate Bayesian a posteriori 22. Metz CE, Herman BA, Shen JH (1998)
probabilities. Neural Comput 3:461–483 Maximum likelihood estimation of receiver
8. Swingler K (1996) Applying neural networks: a operating characteristic (ROC) curves from
practical guide. Academic, London continuously distributed data. Stat Med 17:
9. Berry MJA, Linoff G (1997) Data mining tech- 1033–1053
niques. Wiley, New York 23. Salfity MF, Nishikawa RM, Jiang Y et al
10. Rana RS, Jiang Y, Schmidt RA et al (2007) (2003) The use of priori information in the
Independent evaluation of computer classifica- detection of mammographic microcalcifica-
tion of malignant and benign calcifications in tions to improve their classification. Med Phys
full-field digital mammograms. Acad Radiol 30:823–831
14:363–370 24. Rojas R (1996) A short proof of the posterior
11. Swets JA (1978) ROC analysis applied to the probability classifier neural networks. Neural
evaluation of medical imaging techniques. Comput 8:41–43
Invest Radiol 14:109–121 25. Liu B, Jiang Y (2013) A multitarget training
12. Metz CE (1986) ROC methodology in radio- method for artificial neural network with appli-
logic imaging. Invest Radiol 21:720–733 cation to computer-aided diagnosis. Med Phys
13. Swets JA (1986) Forms of empirical ROCs in 40:011908. doi:10.1118/1.4772021
discrimination and diagnostic tasks: implication 26. Rosen PP (2001) Rosen’s breast pathology.
for theory and measurement of performance. Lippincott Williams & Wilkins, Philadelphia
Psychol Bull 99:181–198 27. Jiang Y, Nishikawa RM, Wolverton DE et al
14. Hanley JA (1988) The robustness of the binor- (1996) Malignant and benign clustered micro-
mal assumption used in fitting ROC curves. calcifications: automated feature analysis and
Med Decis Making 8:197–203 classification. Radiology 198:671–678
15. Veldkamp WJH, Karssemeijer N (1996) 28. Chan H, Sahiner B, Lam KL et al (1998)
Influence of segmentation on classification of Computerized analysis of mammographic
microcalcifications in digital mammography. microcalcifications in morphological and tex-
IEEE engineering in medicine and biology ture feature spaces. Med Phys 25:2007–2019
society, bridging disciplines for biomedicine, 29. Huo Z, Giger ML, Vyborny CJ (2001)
proceedings of the 18th annual international Computerized analysis of multiple-
conference of the IEEE 3, pp 1171–1172 mammographic views: potential usefulness of
16. Paquerault S, Yarusso LM, Papaioannou J et al special view mammograms in computer-aided
(2004) Radial gradient-based segmentation of diagnosis. IEEE Trans Med Imaging 20:
mammographic microcalcifications: observer 1285–1292
evaluation and effect on CAD performance. 30. Metz CE, Shen J (1992) Gains in accuracy
Med Phys 31:2648–2657 from replicated reading of diagnostic images:
17. Shi J, Sahiner B, Chan HP et al (2008) prediction and assessment in terms of ROC
Characterization of mammographic masses analysis. Med Decis Making 12:60–75
based on level set segmentation with new 31. Liu B, Metz CE, Jiang Y (2004) An ROC com-
image features and patient information. Med parison of four methods of combining
Phys 35(1):280–290 information from multiple images of the same
18. Singh S, Maxwell J, Baker JA et al (2011) patient. Med Phys 31:2552–2563
Computer-aided classification of breast masses: 32. Liu B, Metz CE, Jiang Y (2005) Effect of cor-
performance and interobserver variability of relation on combining diagnostic information
expert radiologists versus residents. Radiology from two images of the same patient. Med Phys
258:73–80 32:3329–3338
Chapter 13
Abstract
Robust personal authentication is becoming ever more important in computer-based applications. Among
a variety of methods, biometric offers several advantages, mainly in embedded system applications. Hard
and soft multi-biometric, combined with hard and soft computing methods, can be applied to improve the
personal authentication process and to generalize the applicability.
This chapter describes the embedded implementation of a multi-biometric (voiceprint and finger-
print) multimodal identification system based on hard computing methods (DSP) for feature extraction
and matching, an artificial neural network (ANN) for soft feature pattern matching, and a fuzzy logic
engine (FLE) for data fusion and decision.
Key words Artificial neural network, Fuzzy decision logic, Soft-biometric data, Multi-biometric
1 Introduction
More and more of the devices we deal with every day depend on
embedded systems. This trend can reasonably be expected to accel-
erate going forward. Because of this, as well as due to other factors,
the security of systems and of data is likely to continue to present
challenges that developers will attempt to meet by devising new
tools. There can be little doubt that many such methods will rely
on the biometric approach [1–4].
Biometrics is uniquely suited to authentication tasks in embed-
ded systems because it offers simultaneous solutions to the twin
problems of the meager demand for resources that we wish to
place on such systems and of the high security requirements that
their potential applications can be expected to mandate [5, 6]. It is
not hard to imagine wanting to limit demand on resources if we
start from the very basic concept of a system that has no keyboard,
is designed to have the smallest possible footprint, and can be
considered subsidiary to a larger system, machine, or process. If we
consider the embedded system’s role as gatekeeper to a much
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_13, © Springer Science+Business Media New York 2015
205
206 Mario Malcangi
larger and more complex reality, say the electronic ignition key to
a sophisticated, automated installation, it is not hard, either, to
envision the degree of security requirements for which a minuscule
device is ultimately responsible. As such devices proliferate, the
demand for lightweight yet robust authentication methods can
only grow.
Because biometrics depends on physiological traits or on dis-
tinctive behavior [7, 8] unique to the individual whose identity is
to be authenticated, or—indeed—a combination of the two, it can
rightly stake a claim to being superior, at least potentially, to cur-
rent and established authentication methods like personal identifi-
cation numbers (PINs), passwords, or smart cards. The individual’s
biometric data is always available, is nonrandomly unique, cannot
be transferred to another party, cannot be forgotten, is not subject
to theft, and cannot be guessed. These advantages mean that bio-
metrics provides very high security compared to traditional identi-
fication methods that rely on the possession of a token, such as a
card, key, or chip, or of personal knowledge, as in the case of a
password [9].
Despite the superiority of biometric authentication due to such
advantages and despite its rapid spread in microelectronics, mass-
market adoption has lagged. The primary reason widespread appli-
cation of biometrics has not taken off is that it does not offer
surefire authentication the way a password does. A password can
always perform. On the other hand, biometric features, in their
original form, are analog information. As a result, they are subject
to variation when captured by a biometric scanning device. This
applies to fingerprint readers, microphones for voice-pattern rec-
ognition, cameras for facial feature matching, and any other digital
system that has to match measured, analog, human traits. Of course
such features can be digitized, but the data will nonetheless have
been processed through fuzzy logic. Fuzziness can only be mini-
mized so much, given that the original, analog features are, them-
selves, inherently fuzzy.
False acceptance rate (FAR) and false rejection rate (FRR) [10]
can, of course, be minimized with classical pattern-matching tech-
niques [11–13] but a 100 % correct acceptance rate cannot be
achieved this way. However, the intrinsic fuzziness of biometric
features makes it logical to look to soft-computing approaches
[14–16] in the design of processes that match biometric feature
patterns in the hope that nearly 100 % correct authentication might
be attained that way. For example, even in prohibitive conditions,
human beings always manage to recognize a familiar face. The
human mind processes biometric features fuzzily and neurally.
Soft-computing data are variations and uncertainties. Biometric
features vary constantly, present challenges to analytical description,
and inherently fall short of belonging to their owner 100 % of the time.
Biometric Authentication 207
2 Materials
2.1 Biometric The biometric sensors are specifically designed to capture physiologic
Sensors or behavioral signals generated by human beings. The sensor is the
first device of the signal chain. It captures and converts a physical
property into analog signals (commonly electrical) or digital sig-
nals (when it embeds the mixed-signal electronics).
2.1.1 Voiceprint Sensors The voiceprint sensor is a voice-grade microphone system able to
capture the utterance, preserving the intelligibility of the speech.
●● Single microphone: captures the acoustic wave of the utterance
as a direct sound source (position sensitive; sensitive to sur-
rounding noise).
●● Dual microphone: captures the acoustic sound wave of the
utterance and the surrounding audio noise (partially position
sensitive; low sensitivity to surrounding noise).
●● Microphone array: captures the acoustic sound wave of the
utterance by a set of microphones spatially distributed (posi-
tion insensitive; highly insensitive to surrounding noise).
A single microphone has been used in this design (see Note 1).
2.1.2 Fingerprint Sensors A fingerprint sensor is an electronic device capable to capture the
image of the fingerprint pattern and make it available as a bit map.
Several technologies have been applied to develop fingerprint
sensors:
●● Optical (special digital camera): captures visible light emitted
from a phosphor layer which illuminates the surface of the fin-
ger—advantages: noncontact, not electrostatic discharge sensi-
tive—disadvantages: capabilities affected by the quality of skin,
easily fooled.
●● Ultrasonic (based on medical ultrasonic imaging principles):
captures the images of a fingerprint by penetrating the epider-
mal layer of skin with high-frequency sound—advantages:
noncontact, not electrostatic discharge sensitive, capabilities
not affected by the quality of skin, very difficult to fool—disad-
vantages: the price is significantly higher than for a capacitive
fingerprint scanner.
●● Capacitance (based on electrical capacitance): consists of an
array of capacitors in which one of the two plates is the dermal
layer and the dielectric is the epidermal
–– Passive: Capacitance is measured across each sensor capaci-
tor to form a pixel of the fingerprint image.
–– Active: A voltage is applied to the skin first, and then the
charge of each capacitor of the array is measured to form a
pixel of the fingerprint image.
Biometric Authentication 209
2.2 Sensor Data Sensor data acquisition concerns the conditioning and digital rep-
Acquisition Chain resentation of the voiceprint and fingerprint information captured
by the respective biometric sensors. Conditioning is necessary if
the sensor is affected by nonlinearities and if it is not dynamically
compatible with analog-to-digital conversion (ADC) subsystem.
2.2.1 Microphone Sensor The microphone sensor data acquisition process is sensitive to several
Data Acquisition nonlinearities and mixed signal constraints. The microphone chain
path needs conditioning and digital-to-analog adaptation, so opti-
mal data can be available at the processing stage:
●● Transducer linearization.
●● Automatic gain control.
●● Filtering.
●● Sampling.
●● Quantization.
The microphone nonlinearities need to be compensated to
avoid distorted measurements at the level of feature extraction.
This can be done in the analog domain by electronic circuitry
tuned on the specific microphone, or in the digital domain at the
preprocessing stage of the signal acquisition. In this design the
second option has been applied because it is more flexible and
adaptive. The microphone transfer function is estimated at calibra-
tion time, and then the nonlinearity compensated for by applying
at run time the inverse function to the captured signal.
The amplification is required because the small signals from
the microphone need to be adapted to the analog-to-digital con-
verter (ADC) input dynamic range to have higher signal-to-
quantization-noise ratio (SQNR).
Low-pass filtering (antialiasing) and high-pass filtering (offset
removal) are applied prior to sampling the microphone signal to
ensure lowest frequency distortion of the pulse code modulated
(PCM) stream to be quantized and optimal SQNR.
2.2.2 Fingerprint Sensor The solid-state capacitive fingerprint sensor integrates all the required
Data Acquisition conditioning and digitalization resources so that optimal digital fin-
gerprint image data is available at its output. Fingerprint image is
captured at 513 dpi (dot per inch) resolution as a bitmap image
consisting of 64,512 pixels (224 horizontal and 288 vertical).
Each pixel has 8-bit-depth quantization level (gray-level images).
The fingerprint digital image array acquisition process is completed
by synchronous serial transfer to the application processor (DSP).
210 Mario Malcangi
2.3 Feature To extract the features, a set of digital signal processing algorithms
Extraction Algorithms is applied to the captured utterance.
2.3.1 Voiceprint Hard The following formulae have been applied to measure the voice-
Features print hard features [23]:
●● Root mean square (RMS):
N -1
1
RMS j =
N
ås (m ) j
2
(1)
m =0
●● Zero-crossing rate (ZCR):
N -1
ZCR j = å 0.5 sign ( s (m ) ) - sign(s (m - 1)
j j (2)
m =0
●● Autocorrelation (AC):
N N +1-i
AC j = å å s ( j ) s (i + j - 1 )
j j (3)
i =1 j =1
●● Cepstral linear prediction coefficients (CLPC):
m -1
ækö
CLPC j = am + å ç ÷ c k am -k (4)
k =1 è m ø
with
c0 = r(0): first autocorrelation coefficient
am: prediction coefficients
sj(n) = wN(n)s : j-th windowed part of the uttered stream s
æ 2p n ö
wN (n ) = 0.54 - 0.46 cos ç ÷ : N-length weighting Hamm-
èN -1ø
ing window
The above are short-time measurements executed by multiply-
ing the N samples-wide weighting Hamming window w(n) by the
uttered speech stream s(n).
RMS is the root mean square of the windowed speech segment.
Such measurement helps to identify phonetic unit end-points.
ZCR is the time-domain measurement of the dominant fre-
quency information. It is used to determine whether or not the
current processed uttered speech segment is voiced or unvoiced
and also to find the frequency (band) with the major energy
concentration.
AC is the measurement of the speech-pitch frequency. It
preserves information about pitch-frequency amplitude while
ignoring phase. Phase is unimportant for the purpose of speech
identification.
CLPC is the LPC-Cepstral feature vector that models the
vocal tract.
Biometric Authentication 211
2.3.2 Voiceprint Soft The following soft features are extracted from speech:
Features ●● Speed.
●● Stress.
Speed is measured as the total duration of the speech utterance.
Stress is measured as the ratio between the peak amplitude of the
stressed vowel and the average amplitude of the whole utterance.
Both these voiceprint features are related to the way the person
is used to speaking a requested word.
2.3.3 Fingerprint Hard Several preprocessing steps are executed on the captured grayscale
Features fingerprint image from the stage of bitmap to its minutiae-based
representation:
●● Orientation field and region of interest.
–– Normalization of the captured fingerprint image.
–– Image segmentation in blocks.
–– x and y gradient computation for each pixel in each block.
–– Local orientation for each pixel.
–– Orientation field correction.
●● Positioning (delta and core localization).
●● Ridging.
●● Ridge thinning.
Fingerprint hard features are the minutiae. The criteria used
for minutiae extraction from the thinned ridge image are the
following:
●● If a ridge pixel has two or more 8-nearest, then it is a
bifurcation.
●● If a ridge pixel has only one 8-nearest, then it is a
termination.
As the crest-valley image is two levels encoded (1-0), the map-
ping algorithm is
7
if åp
i =0
i >2
2.3.4 Fingerprint Soft Two fingerprint soft features are extracted from fingerprint cap-
Features tured image:
●● Total area.
●● Mean intensity.
Both these two fingerprint features are related to the way the
person approaches contact with the fingerprint sensor.
The total area is measured as the ratio between the total pixels
available on the fingerprint scanning device and the total pixels of
the captured fingerprint image that have a value higher than an esti-
mated peak noise level. The peak noise level is estimated at calibra-
tion time and it is applied at run time (enrolling and identification).
The mean intensity is measured as average intensity of all the
pixels with a value higher than a threshold level. The threshold
level is 10 % higher than the peak noise level.
2.4 Hard Computing The captured fingerprint and voiceprint hard features have to be
Pattern Matching matched with enrolled features (templates).
2.4.1 Voiceprint Hard Two methods are applied to score the person’s identity. One is
Features based on measuring Mahalanobis distance, and the other on mea-
suring the distance of dynamic time warping—k-nearest neighbor
(DTW-KNN).
The Mahalanobis distance measurement is
Di ( x ) = ( x - x ) W -1 ( x - x )
T
(6)
where W is the covariance array computed using the average and
the standard deviation features of the utterance. The input pattern
x is processed with reference to the utterance-averaged feature vec-
tor x that represents the person to be identified. The distance
Di(x) is a score for the authorized user.
The DTW-KNN (dynamic time warping—k-nearest neighbor)
method combines the dynamic-time-warping measurement with the
k-nearest neighbor decision algorithm. The DTW clusters similar
Biometric Authentication 213
template
search space
pattern
m
M
Warp
1
input
pattern
1 N
2.4.2 Fingerprint Hard Fingerprint hard feature pattern matching is based on minutiae.
Features Pattern matching consists of a procedure that first tries to align the
template pattern and the input pattern, and then computes an
overlapping score (Fig. 2). The score is a measurement of the
authenticity of the person who enrolled the input.
2.5 Soft Computing The captured fingerprint and voiceprint soft features have to be
Pattern Matching matched with enrolled features (templates). This is done using the
artificial neural network (ANN) soft computing paradigm.
2.5.1 Artificial Neural The ANN is a signal processing and pattern classification paradigm
Network (ANN) inspired by the structure and functions of biological neural net-
works. Information signals that flow through the ANN modify the
connections, enabling the learning process.
214 Mario Malcangi
∆ search space
template
input
∆q
∆
reference
This is the activity of the ith node at the layer l + 1 for the kth input
xj processed by the ANN, assuming that M nodes are in the lth
layer.
Input and output layers have a linear activation function that
controls the connection. A nonlinear (sigmoid) activation function
Biometric Authentication 215
(a)
x...0 w0 f(x;w; θ)
f(x;w; θ) x...j wj
wM
xj wj xM θ
θ
(b)
(b)
x...0 w0 f(x;w; θ)
xj wj
... wM
xM θ
(c)
Fig. 3 (a) FFBP-ANN for voiceprint and fingerprint soft feature classification, (b) linear activation function for
input and output layers, and (c) sigmoid activation function for hidden layer
2.6 Soft Computing The fusion and decision logic is based on the fuzzy logic inferential
Fusion and Decision paradigm. A fuzzy logic engine processes the crisp inputs (scores
and features), applies to them a set of inferential rules, and makes
the fuzzy decision. This is done by the fuzzy logic engine.
2.6.1 Fuzzy Logic Engine The fuzzy logic engine (FLE) used to implement the data and deci-
sion fusion and to infer about the final decision is a zero-order
Sugeno-type (Fig 4a). It fuzzyfies the inputs (soft features and scores)
216 Mario Malcangi
a
µ µ µ K1 K2 K3
A1 A2 A3 B1 B2 B3
1 1 1
0.8
OR 0.8
0.1 (max)
0 0 0
x1 X y1 Y k2 K
Rule 1: IF x IS A2 OR y IS B3 THEN k is k2 K2 = 30
µ µ µ K1 K2 K3
A1 A2 A3 B1 B2 B3
1 1 1
AND
0.5 (min)
0.4
0.4
0 0 0
x1 X y1 Y k1 K
b
µ µ µ
1 1 aggregation 1
0.8 0.8
0.4 0.4
0 0 0
k2 K k1 K k1 k2 K
µ µ
1 1
0.8 0.8
0.4 defuzzification 0.4
(weighted average)
0 0
k1 k2 K k1 k2 K
= = 23.33
Fig. 4 (a) Zero-order Sugeno-type fuzzy logic inference engine. (b) Aggregation and defuzzification
Biometric Authentication 217
WA =
åm ( x ) x (9)
åm ( x )
where μ(x) is the degree to which the inputs x belong to the appro-
priate fuzzy sets.
2.7 System Modeling Matlab environment and the DSP toolbox are used to code the
and Simulation signal processing algorithms, to run them in simulation mode, and
Environments then to export the source code as ANSI-C. The Data Acquisition
(DAQ) toolbox is used to collect data from the fingerprint sensor
and from the microphone.
2.7.1 ANN To model and simulate the FFBP-ANN the Matlab environment
plus the DSP toolbox is used to code the FFBP-ANN, to run it in
simulation mode, and then to export the source code as
ANSI-C. The Data Acquisition (DAQ) toolbox is used to collect
data from the fingerprint sensor and from the microphone.
2.8 Digital Signal To implement the embedded system, a system-on-chip (SoC) proces-
Processor sor is used. This is an application-specific processor (ASP) optimized
for signal processing, with on-chip memory and peripherals.
A 32-bit floating-point DSP, architecturally optimized to
process data of different bit format (8-bit, 16-bit, 32-bit), has been
chosen. Due to its computing architecture peculiarity, it is able to
process efficiently the 8-bit data of the image pixels and the 16-bit
data of the voice samples. The 32-bit floating-point data format
ensures the required computing precision both for the image and
the speech processing.
218 Mario Malcangi
3 Methods
3.1 Feature The feature extraction layer implements the biometric information
Extraction Layer capture (voiceprint and fingerprint) and the signal processing
algorithm (hard computing) methods for feature extraction.
soft
Artificial score Fuzzy authentication
Neural Logic score
Network Engine
soft features
total area &
mean intensity
fingerprint score
hard features authentication
fingerprint
3.1.2 Feature Extraction Features (hard and soft) are extracted from the captured voiceprint
and fingerprint applying the hard computing algorithm. Data are
assembled in data structures useful to be processed at the pattern
matching layer.
Voiceprint hard features are structured as time-sequenced vec-
tors of data:
RMS(N)
ZCR(N)
AC(N)
CLPC(N)
where N is the vector length and each vector element is the
feature measurement at the window application time.
Fingerprint hard features are structured sequences of minutiae as:
MINUTAE(M)
where M is the vector length and each vector element is the
minutiae data.
From voiceprint the two soft features are measured as:
SPEED
STRESS
The two features are scalar data (one measurement executed
on the whole captured voiceprint data stream).
From the fingerprint two soft features are measured as:
TOTAL_AREA
MEAN_INTENSITY
The two features are scalar data (one measurement executed
on the whole captured fingerprint data array).
3.2 Matching Layer The matching layer implements the hard and soft computing scor-
ing systems. Each of these applies the appropriate algorithms for
scoring the input data referred to a set of template data belonging
to the persons to be identified. This layer runs in two different
modes, enrolling and identifying.
The enrolling mode is active when a new person is to be added
to the set of persons that are to be accepted. The identifying mode
is active when a person (allowed or not allowed) is accessing the
system furnishing its voiceprint and fingerprint. Both enrolling and
identification modes run the same pattern matching algorithms,
the first to store the reference templates, and the second to execute
the scoring.
220 Mario Malcangi
3.2.1 Enrollment Mode Voiceprint enrollment mode consists in storing as multiple tem-
plates the features measured each time the allowed person utters at
the microphone a specific (requested) vocal sequence. Fingerprint
enrollment mode consists in storing as multiple templates the fea-
tures measured each time the allowed person puts a finger
(requested) on the sensor.
3.2.2 Identifying Mode The voiceprint and fingerprint identifying mode runs the pattern
matching algorithms to score the input hard features related to the
stored templates. For each input (voiceprint and/or fingerprint) a
score is available at the matching layer output.
3.2.3 Soft Biometric The soft biometric scoring is executed running the FFBP-ANN.
Training and Scoring To do this the FFBP-ANN inputs the soft biometric features
and outputs the score according to how it learned about the soft
feature belonging to the authorized person. A training phase is
requested to embed the knowledge in the FFBP-ANN nodes
(neurons). This is run each time the soft features at the FFBP-
ANN inputs belong to an authorized person, explicitly during the
enrollment (training mode), and implicitly each time the person is
identified as authorized (evolving mode). Learning is executed
running the error back-propagation algorithm.
Back Propagating Networks (BPNs) are trained according to a
generalized least mean-square (LMS) algorithm:
w j ( k + 1) = w j ( k ) + h ( x t ( k ) - x ( k ) ) f j ( k ) (10)
The weights wj(k) are modified by the kth input activity pattern
fj(k), so that a new updated weight wj(k + 1) is available at k + 1th
input time. The modification is proportional to the difference
between the xt(k) target response and the current response x(k).
The constant η controls the learning rate and is in the range
0 < η < 1.
During the learning activity the difference between the target
activity and the current activity at the output layer is a learning
error indicator. To measure it, the total error E is measured as
K
1 N t
E =å å (xi (k ) - xiL ( k ))2 (11)
k =1 2 i =1
This is the total error measured at the N nodes of the output layer
L after K patterns of the training set have been applied. If E is mini-
mum, then weights at the end of the training are the best for the
L-layer BPN.
Applying the best trained weights set to the three-layer FFBP
(L = 3) enables this ANN to perform the scoring of the biometric
features. The ANN executes also the data fusion among the soft
biometric data from voiceprint and fingerprint.
Biometric Authentication 221
3.3 Decision Layer The decision layer implements the data fusion and the decision
fusion information from the matching layer. This is executed by
the FLE that evaluates four kinds of data input:
●● Voiceprint score.
●● Fingerprint score.
●● Soft-biometric measurements.
●● Soft-biometric score.
Prior to a run, the FLE needs to be tuned for processing the
data inputs and producing the decision. Membership functions
and rules set have to be designed using the modeling and simula-
tion toolbox.
Input data are fuzzyfied using optimally tuned membership
functions (Figs. 6, 7, and 8a). Singleton membership functions are
applied for rule consequents (Fig. 8b).
The rules are tuned combining the fuzzyfied inputs as follows
(only the most significant are reported):
1. IF voiceprint_score IS high AND fingerprint_score IS high AND
soft_score IS high THEN authentication IS very high.
2. IF voiceprint_score IS medium AND fingerprint_score IS
medium AND soft_score IS high THEN authentication IS high.
3. IF voiceprint_score IS medium AND speech_speed IS high AND
soft_score IS medium THEN authentication IS high.
µ
very low low medium high very high
1
0
voiceprint score
µ
very low low medium high very high
1
0
voice speed
µ
very low low medium high very high
1
0
voice stress
µ
very low low medium high very high
1
0
fingerprint score
µ
very low low medium high very high
1
0
fingerprint speed
µ
very low low medium high very high
1
0
fingerprint stress
a
µ
low medium high
1
0
soft-features score
b
µ very low low medium high very high
1
0
authentication
Fig. 8 (a) Membership functions to fuzzify inputs (soft-features score) and (b) to defuzzify outputs
(authentication)
Biometric Authentication 223
4 Notes
References
1. Jain AK, Pankanti S, Prabhakar et al (2004) Second generation biometrics: the ethical, legal
Biometrics: a grand challenge. Proc 17th inter- and social context. The International Library
national conference on pattern recognition of Ethics, Law and Technology, vol 11.
(ICPR), vol 2. pp 935–942 Springer, pp 49–79
2. Anil K, Jain AK (2012) Biometric recognition: 3. Baird SL (2002) Biometrics: security technol-
an overview. In: Mordini E, Tzovaras D (eds) ogy. The technology teacher 61(5):1, pp 8–22
Biometric Authentication 225
4. Sujithra M, Padmavathi G (2012) Next genera- system- on-chip for real-time applications,
tion biometric security system: an approach for Banff, Canada, 6–7 July
mobile device security, Proc second interna- 15. Malcangi M (2004) Improving speech end-
tional conference on computational science, point detection using fuzzy logic-based meth-
engineering and information technology. odologies. Proc thirteenth Turkish symposium
ACM, New York, NY, USA, pp 377–381 on artificial intelligence and neural networks,
5. Corcoran P, Cucos A (2005) Techniques for Izmir, Turkey, pp 10–11
securing multimedia content in consumer elec- 16. Wahab A, Ng GS, Dickiyanto R (2005) Speaker
tronic appliances using biometric signatures. authentication system using soft computing
IEEE Trans Consumer Elect 51:545–551 approaches. Neurocomputing 68:13–17
6. Jain AK, Ross A, Pankanti S (2006) Biometrics: 17. Kar B, Kartik B, Dutta PK (2006) Speech and
a tool for information security. IEEE Trans Inf face biometric for person authentication. Proc
Forensics Security 1(2):125–143 IEEE international conference on industrial
7. Jain AK, Nandakumar K, Lu X et al. (2004) technology, India, pp 391–396
Integrating faces, fingerprint, and soft biomet- 18. Jang-Hee Y, Jong-Gook K et al (2007) Design
ric traits for user recognition. Proc biometric of embedded multimodal biometric systems.
authentication workshop, LNCS 3087, Prague, Proc of the int conf on signal image technolo-
pp 259–269 gies and internet based systems, SITIS
8. Hong A, Jain S, Pankanti S (1999) Can multi- 2007:1058–1062
biometrics improve performance? Proc 19. Jain AK, Dass SC, Nandakumar K (2004) Soft
AutoID`99, Summit, NJ, pp 59–64 biometric traits for personal recognition sys-
9. Anil J, Lin H, Sharath P (2000) Biometric tems. Proc international conf on biometric
identification. Commun ACM 43(2):90–98 authentication, LNCS, vol 3072. Hong Kong,
10. Cappelli R, Maio D, Maltoni D et.al (2006) pp 731–738
Performance evaluation of fingerprint verifica- 20. Pak-Sum Hui H, Meng HM, Mak M (2007)
tion systems. IEEE Trans Pattern Anal Mach Adaptive weight estimation in multi-biometric
Intell 28:3–18 verification using fuzzy logic decision fusion.
11. Bishop C (1995) Neural networks for pattern Proc IEEE international conference on acous-
recognition. Oxford University Press, Oxford tic, speech, and signal processing ICASSP’07,
12. Ciota Z (2001) Improvement of speech pro- vol 1. pp 501–504
cessing using fuzzy logic approach. Proc of 21. Runkler TA (1997) Selection of appropriate
IFSA World congress and 20th NAFIPS Int defuzzification methods using application
Conf, 2 specific properties. IEEE Trans Fuzzy Sys 5:
13. Bosteels RTK, Kerre EE (2007) Fuzzy audio 72–79
similarity measures based on spectrum histo- 22. Abiyev RH, Altunkaya K (2007) “Neural net-
gram and fluctuation patterns. Proc Int Conf work based biometric personal identification”,
Multimedia and Ubiquitous Engineering Frontiers in the Convergence of Bioscience and
2007, Seoul, Korea, 27–28 April Information Technologies. pp 682–687
14. Malcangi M (2002) Soft-computing approach 23. O’Shaugnessy D (1987) Speech communica-
to fit a speech recognition system on a single- tion—human and machine. Addison-Wesley,
chip. Proc 2002 international workshop Reading, MA
Chapter 14
Abstract
To behave in a robust and adaptive way, animals must extract task-relevant sensory information efficiently.
One way to understand how they achieve this is to explore regularities within the information animals
perceive during natural behavior. In this chapter, we describe how we have used artificial neural networks
(ANNs) to explore efficiencies in vision and memory that might underpin visually guided route naviga-
tion in complex worlds. Specifically, we use three types of neural network to learn the regularities within
a series of views encountered during a single route traversal (the training route), in such a way that the
networks output the familiarity of novel views presented to them. The problem of navigation is then
reframed in terms of a search for familiar views, that is, views similar to those associated with the route.
This approach has two major benefits. First, the ANN provides a compact holistic representation of the
data and is thus an efficient way to encode a large set of views. Second, as we do not store the training
views, we are not limited in the number of training views we use and the agent does not need to decide
which views to learn.
Key words Visual navigation, Computational neuroscience, Machine learning, Boosting, Restricted
Boltzmann machine, Infomax
1 Introduction
All animals are faced with the challenge of extracting useful infor-
mation from the environment and using that information to shape
their behavior in an adaptive way. A major goal for neuroscience is
to understand how neural circuits are organized to achieve that
process. This goal depends on understanding the information pro-
vided by the environment as well as the information required for a
particular behavior. For many animals, the primary sensory channel
is vision and scientists have long sought to understand how vision
drives behavior and underpins perception by asking questions such
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_14, © Springer Science+Business Media New York 2015
227
228 Andrew Philippides et al.
as “What does the frog’s eye tell the frog’s brain?,”1 or “How do
humans recognize their grandmother?”2
If we take the example of a very simple visually guided behav-
ior, we can begin to understand the style of information process-
ing that we see in biological systems. Many animals perform
phototaxis, where they try and move in the direction of the bright-
est light; for instance to gain warmth or rise to the surface of the
sea. For some microorganisms, such as zooplankton, we know in
detail the visual circuits that produce this behavior. They have two
photoreceptors which detect light from one side of the body only.
These photoreceptors are directly connected to motile hair cells
[1] which beat rhythmically to propel the organism through the
world. Light on one side of the organism stimulates one photore-
ceptor, which inhibits the movement of the hair cells on the same
side of the organism. The undamped beating from hairs on the
“dark” side of the body will ensure a turn towards the light.
Evolution tends not to produce overelaborate solutions; we see
that the zooplankton has a minimal yet perfectly efficient sensory
circuit for the task in hand.
For more complex visually guided behaviors, however, it is
nontrivial to understand what an efficient sensory system would
look like and we are faced with the task of being objective about
animals with visual systems that are very different from ours, who
have different perspectives on the world [2, 3] and different
behavioral requirements. As it is objective in terms of the way it
treats sensory data, a good first step in this endeavor is to try to
explore regularities in the sensory information perceived by animals
during their natural behavior.
One available method, which can be used to objectively explore
sensory data, is to use artificial neural networks (ANNs), or machine
learning more broadly, to explore the information available in a
raw sensory array. Such uses of ANNs provide a mapping from the
sensory environment to behavior. Essentially ANNs can be used to
interrogate complex data and find the regularities and affordances
that biological agents might well be tuned to (rather than being
used as explicit models of animal brains). Here, we present case
studies describing how we have used ANNs to understand how
animals might use visual information for navigation.
1
Here Lettvin and colleagues found that parts of the frog’s visual system are
tuned to detect small moving objects. Thus the eye is providing information
to the frog about where to direct the tongue in the hope of catching a fly.
McCulloch and Pitts, authors of this paper, are commonly held up as pioneers
of neural network research. J. Y. Lettvin, H. R. Maturana, W. S. McCulloch,
and W. H. Pitts, “What the Frog’s Eye Tells the Frog’s Brain,” Proc. IRE 47
(1959) 1940–1951.
2
Lettvin again came up with the concept of a sparse code in the brain where
small numbers of individual neurons may be highly selective to concepts such
as one’s grandmother. See Quian-Quiroga, R., Fried, I., and Koch, C. (2013)
Brain Cells for Grandmother. Scientific American, Feb.
Visual Navigation 229
Fig. 1 (a) The world as seen by a human eye. The human field of view is quite large, almost 120° wide.
However, we only have high-resolution vision at the center of our visual field, which we make up for by moving
our eyes. (b) An ant’s eye view of the world. Ants have a horizontal field of view spanning almost 360°, but with
homogenous low resolution. Notice how you cannot recognize the house in the ant’s eye image. This picture
shows a vertical angle of 45°, but ants do see at increased elevations, because they extract celestial compass
information
2 Our Approach
Insight: We can use the views experienced in the first route traversal
as training data for an ANN designed to perform familiarity judge-
ments. After training the ANN will be able to output a familiarity
score for any novel view presented to it.
The fact that ants are moving, and therefore facing, in the cor-
rect direction most of the time during the first route traversal
means that these training views implicitly define the movement
directions required to stay on the route. Thus, it is sufficient to
learn the views as they are experienced. After the initial route
learning, the problem of navigation is reframed as a search for
views similar to those associated with the route. If a novel view is
familiar, it is likely that a similar view has been experienced before
and so the ant is likely to be close to the route and heading in an
appropriate direction. So, by intermittently scanning the environ-
ment and moving in the direction that gives the most familiar view
an ant can reliably retrace its route.
To learn a route we thus train an ANN to learn the regularities
within a series of views in such a way that it can make a familiarity
judgement about any new view presented to it. This approach has
two major benefits. First, an ANN is an efficient way to encode a
series of views. We do not attempt to store every view experienced
along the route, but instead use them to train the ANN. Second,
and as a corollary, because we are not storing all the views used
for training, we are not limited in the amount of training views we
can use and so the agent does not need to decide when or which
views to learn.
There are many different possibilities for training ANNs for
this task. The only prerequisite for our approach is that the ANN
should provide a measure of the familiarity of a presented view. In
our case we also wanted biological plausibility, in the sense that the
vision, memory size, and learning behavior should be plausibly
achieved by an ant. These requirements have driven our research
trajectory.
are part of the route and views that are not part of the route. The
positive examples are simply the forward-facing views experienced
along the route. The negative views consisted of views from the
route taken facing to the left and right of the direction of move-
ment at an angle of 45° relative to the route heading (Fig. 2a).
Classifying views is a difficult task because of the high dimen-
sionality of the input if one adopts a pixel-by-pixel representation. In
order to make learning more tractable one can project this high-
dimensional space into a lower dimensional space by passing views
through a set of visual filters. Following Viola and Jones [17] we
chose to construct a classifier using randomly generated filters com-
posed of Haar-like features. Haar-like features act like edge detectors
or crude approximations to Gabor filters and are maximally activated
at high-contrast boundaries in the image (Fig. 2b). The set of out-
puts from these feature detectors form the basis of our image repre-
sentation and are used to train a classifier via the Adaboost learning
algorithm, a commonly used variant of boosting [18].
Boosting is a supervised learning technique for constructing a
strong classifier from a set of weak classifiers given a training set of
labelled positive and negative examples. A weak classifier is one
that performs only slightly better than chance. Conversely, a strong
classifier is one that performs arbitrarily well. A strong classifier is
constructed from a linear weighted combination of the outputs of
weak classifiers. In our context, each visual filter is a candidate weak
classifier, hj(x), and consists of a Haar feature fj, a threshold θj, and
a parity pj that determines whether the output should be greater or
less than the threshold
ì1 if p j f j ( x ) < p j q j
hj (x ) = í (1)
î0 else
This process is illustrated in Fig. 2c–d We initially generate a large
pool (5,000) of such filters of random types and sizes and at ran-
dom positions in the visual field. We then use Adaboost to select
and combine a prespecified number, T, of these weak classifiers into
a strong classifier. By specifying T, we thus control the complexity
of the classifier. As an aside we highlight an interesting outcome of
the boosting process. Because Adaboost starts with a large pool of
feature detectors, and then performs feature selection to pick out
and use only those features that are most useful for the current
classification problem, we can interrogate sets of filters to see if
they relate to biologically salient features or portions of the world.
The Adaboost algorithm works as follows. At each iteration,
the training data xi (m on- and l off-route views) are weighted
according to a distribution that indicate the current importance of
each example in the dataset, with initial weights w0,1 set as 1/(2m)
or 1/(2l) for on- and off-route views, respectively. The weak clas-
sifiers are evaluated according to the weighted error
a Negative b
-ve view
-ve
Positive
+ve view
+ve
Negative
view
-ve
-ve
c Decision Boundary
Negative Examples
0.4
Positive Examples
0.3
Frequency
0.2
0.1
0
0 50 100 150 200 250 300 350
Feature Detector Output
d
*
if > 200 THEN h(x) = 1
*
if < 200 THEN h(x) = 0
e f
1m
1m
Fig. 2 Using an ANN as a classifier of on- and off-route views. (a) Training images are collected in three directions
every 4 cm along the training route. A forward-facing image is collected as an example of an on-route view and two
off-route views are also collected at ±45° relative to the route heading. (b) Examples of the four different classes of
Haar-like feature detectors, from left to right, a single block, edge-detector, Gabor-like edge detector, and diagonal
edge detector. Each feature detector can vary in size, shape, and position. As images are panoramic, feature detec-
tors wrap-around if they extend beyond the left or right edge of the image. White is a weight of 1, grey 0, and black
−1. (c–d) Constructing a weak classifier using a single Haar-like feature. Each training image is convolved with the
feature detector to determine an optimal threshold which determines the output for the unseen data. (c): Distribution
of feature detector outputs for the two classes for a Haar feature f. A threshold, θ, is determined that optimally sepa-
rates the distributions (vertical bar ). (d) The final weak classifier is defined by a feature detector, a threshold, and a
parity. For the example shown, the output of the weak classifier, h(x), for the image, x, is 1 if the feature detector
output exceeds a threshold of 200, or 0 otherwise. (e–f) Learning a circuit of the gantry workspace. The dark line
represents the training route. (e) The gantry workspace as viewed from above. (f) Performance of the algorithm from
ten different start positions (dashed lines ) using a boosted classifier with 50 features
234 Andrew Philippides et al.
m +l
et , j = åwt ,i h j ( xi ) - y i (2)
i =1
where yi is 1 for on-route and 0 for off-route views. If the mini-
mum error across the classifiers at iteration t, et, is 0, the algorithm
stops. Otherwise, the classifier associated with the lowest error is
denoted as ht, added to the strong classifier and removed from the
pool of features. The weights associated with the training data are
then updated according to
1-ci
æ e ö
wt ,i +1 = wt ,i ç t ÷ (3)
è 1 + et ø
where ci = 0 if xi is classified correctly and 1 if not, with weights
normalized after each iteration, so they sum to 1. Essentially, this
process increases the weights of incorrectly classified views relative
to correctly classified views, thereby encouraging the next weak
classifier to focus on examples that were incorrectly classified at
the last iteration. Weak classifiers are added until the overall clas-
sification performance exceeds a threshold or the maximum num-
ber of weak classifiers, T, is reached. The final classification is then
æT ö
H ( x ) = sign ç åat ht (x ) ÷ (4)
è t =1 ø
are passed into the classifier and all of the viewing directions that
produce a positive classification contribute to a weighted a verage
which determines the direction of travel and a 5 cm step is made in
this direction. The process is then iterated.
As can be seen (Fig. 2f), the approach works well and robust
route navigation is achievable. As is typical of ants, the results show
that the agent navigates along a corridor defined by the training
path but influenced by noise and the environment. It was also
noted that the routes “smoothed out” some of the kinks that were
present in the original training run. While results were dependent
on the number of classifiers used (more classifiers resulted in better
performance) near-perfect results were achieved with 50 classifiers,
while very robust results could still be achieved with only 20 clas-
sifiers. We have thus shown that it is possible to learn a nontrivial
route through an environment using a simple view classification
strategy based on positive and negative views collected during a
single episode of learning. By using an ANN as a classifier, we
reframed the problem of route navigation in terms of a search for
familiar views. The benefits of the ANN are that it both provides a
compact, holistic representation of the information required for
route navigation but also a measure of the expected uncertainty of
the classification.
We have thus demonstrated the principle that ANNs can be used
to learn a holistic representation of the views that determine a route.
Secondly, we have shown that a coarse visual system CAN encode a
complex world because it is used to drive a specific behavior through
familiarity recall rather than object or place recognition.
b
a
Hidden j
wij
Visible i
Training Recapitulation
Fig. 3 (a) The simulation environment. Top left : A typical simulated environment. Middle : High-resolution view
of the world from an ant’s perspective. View is panoramic, so forward is middle of the row and the view wraps
round from left to right. Bottom : Low-resolution representation of the view shown in middle panel. (b)
Architecture of a Restricted Boltzmann Machine. The visible nodes are associated with the input data and the
hidden nodes with the hidden variables of the model to be learnt. (c) Left: The training route taken through a
complex desert ant habitat. Upper right: A zoomed-in view of the familiarity landscape produced by the trained
network. Right: Ten recapitulated routes through the environment
Visual Navigation 237
- E ( v ,h )
e
P (v,h ) = (7)
Z
where Z is a normalization constant. Thus, the probability assigned
to a single data point, v, is found by summing over all possible hid-
den vectors as shown in Eq. 8
åe
- E ( v ,h )
P (v ) = åP (v,h ) = h
(8)
h Z
Following a maximum likelihood principle, training an RBM con-
sists of increasing the probability that the network assigns to the
observed data by adjustment of the weights and biases. This is
achieved by decreasing the energy of the observed data and increas-
ing the energy of unobserved configurations using some kind of
gradient descent algorithm [20, 21]. There are many possibilities,
but in the current work we employ an efficient approach called Fast
Persistent Contrastive Divergence; full details of the training pro-
cedure can be found in Hinton [21, 24] and its application to the
ant problem in Baddeley et al. [25].
Having trained an RBM, the probability of a visible datum (in
our case the visual scene) is found by summing over all possible
configurations of the hidden units as described by Eq. 8.
Unfortunately this computation is intractable for large h. However,
we can calculate the probability of a visible datum up to a normal-
ization constant by using an alternative, tractable, formulation for
the probability of a datum expressed in terms of the free energy,
F(v), as described by Eq. 9:
F ( v ) = - log åe
- E ( v ,h )
(9)
h
- F (v )
which allows us to write P (v ) = e where Z′ is a normaliza-
Z¢
tion constant. For an RBM the free energy has an analytic form
given by Eq. 10:
F ( v ) = åai vi - å log 1 + e ( xj
) (10)
i j
where x j = b j + åwij vi . Since we wish to compare the relative
i
probabilities of different views, the normalization constants cancel
out. This allows us to directly compare the probabilities that each
of the views in a scan is part of the training route by simply calcu-
lating their free energies.
The routes we produced can be seen in Fig. 3c. As before, there
is a broad route corridor defined by the training data and the structure
of the environment. As with the classifier approach, this shows that an
ANN-based holistic memory can achieve robust route homing.
Visual Navigation 239
wij
Novelty
Y1 YM
Layer
2m
1m
Fig. 4 (a) The two-layered network used with the Infomax learning rule. Circles
represent units and arrows denote connections between the input units on the top
and the novelty units on the bottom. There is no output from the network as such
since the response of the network is a function of novelty unit activations. (b)
Navigating using a trained ANN to assess scene familiarity. Three separate routes
(thick grey lines) learned in an environment containing both small and large objects.
For each of the three routes, that consisted of between 700 and 980 views taken
every 1 cm, we show three recapitulations (thin black lines). During route reca-
pitulations the headings at each step were subject to normally distributed noise
Visual Navigation 241
where xj is the value of the j′th input pixel. Weights are initialized
randomly from a uniform distribution in the range [−0.5, 0.5] and
then normalized so that the mean of the weights feeding into each
novelty unit is 0 and the standard deviation 1. The Infomax
approach [27] decomposes each view into a fixed number of com-
ponents (determined by the number of hidden units in the net-
work) which remains constant, independent of the number of
views experienced. It does this by adjusting the weights so as to
maximize the information that the novelty units provide about the
input, by following the gradient of the mutual information. Since
two novelty units that are correlated carry the same information,
adjusting weights to maximize information will tend to decorrelate
the activities of the novelty units and the algorithm can thus be
used to extract independent components from the training data
[28]. The core update equation is
h é N
ù
Dwij =
N êwij - ( y i + hi ) åhk wkj ú (12)
ë k =1 û
with a standard deviation of (15o). (c) Including learning walks prevents return
paths from overshooting the goal. Without a learning walk the simulated ant over-
shoots and carries on in the direction it was heading as it approached the nest
location. By including the views experienced during a learning walk the simulated
ant, instead of overshooting, gets repeatedly drawn back to the location of the
nest. Thick grey line: training path; thin black lines: recapitulations
242 Andrew Philippides et al.
References
1. Jékely G, Colombelli J, Hausen H et al (2008) ants (Cataglyphis velox). Behav Ecol 23(5):
Mechanism of phototaxis in marine zooplank- 944–954. doi: 10.1093/beheco/ars051
ton. Nature 456(7220):395–399 12. Wystrach A, Beugnon G, Cheng K (2012) Ants
2. von Uexküll J (1931) Der Organismus und die might use different view-matching strategies
Umwelt. In Driesch H, Woltereck H. (Eds.), on and off the route. J Exp Biol 215(1):44–55.
Das Lebensproblem im Lichte der modernen doi:10.1242/Jeb.059584
Forschung, Quelle und Meyer, Leipzig, pp 13. Collett M (2010) How desert ants use a visual
189–224 landmark for guidance along a habitual route.
3. Nagel T (1974) What is it like to be a bat? Proc Natl Acad Sci U S A 107(25):11638–
Philos Rev 83:435–450 11643. doi:10.1073/pnas.1001401107
4. Shettleworth SJ (2010) Clever animals and kill- 14. Graham P, Cheng K (2009) Ants use the pan-
joy explanations in comparative psychology. oramic skyline as a visual cue during navigation.
Trends Cogn Sci 14(11):477–481. Curr Biol 19(20):R935–R937
doi:10.1016/j.tics.2010.07.002 15. Wystrach A, Graham P (2012) What can we
5. Wehner R (2009) The architecture of the des- learn from studies of insect navigation? Anim
ert ant’s navigational toolkit (Hymenoptera: Behav 84(1):13–20. doi:10.1016/j.anbehav.
Formicidae). Myrmecological News 12:85–96 2012.04.017
6. Hölldobler B (1990) The ants. Harvard 16. Baddeley B, Graham P, Philippides A et al (2011)
University Press, Cambridge, MA Holistic visual encoding of ant-like routes: navi-
7. Collett TS, Graham P, Harris RA et al (2006) gation without waypoints. Adapt Behav 19(1):3–
Navigational memories in ants and bees: 15. doi:10.1177/1059712310395410
Memory retrieval when selecting and following 17. Viola P, Jones M (2001) Rapid object detec-
routes. Adv Study Behav 36:123–172. doi: tion using a boosted cascade of simple features.
10.1016/S0065-3454(06)36003-2 In: Computer vision and pattern recognition,
8. Zeil J, Hofmann MI, Chahl JS (2003) The 2001. CVPR 2001. Proceedings of the 2001
catchment areas of panoramic snapshots in out- IEEE Computer Society Conference on, 2001.
door scenes. J Opt Soc Am A Opt Image Sci IEEE, vol 511. pp I-511–I-518
Vis 20:450–469 18. Freund Y, Schapire R, Abe N (1999) A short
9. Philippides A, Baddeley B, Cheng K et al introduction to boosting. J Jpn Soc Artif Intell
(2011) How might ants use panoramic views 14(771–780):1612
for route navigation? J Exp Biol 214(3):445– 19. Hinton GE, Salakhutdinov RR (2006)
451. doi: 10.1242/Jeb.046755 Reducing the dimensionality of data with
10. Wystrach A, Philippides A, Aurejac A et al (2014) neural networks. Science 313(5786):
Visual scanning behaviours and their role in the 504–507
navigation of the Australian desert ant 20. Smolensky P (1986) Information processing in
Melophorus bagoti. J Comp Physiol A:1–12 dynamical systems: foundations of harmony
11. Mangan M, Webb B (2012) Spontaneous for- theory. In Rumelhart D, McClelland J, the
mation of multiple routes in individual desert PDP Research Group (Eds.). Parallel distributed
244 Andrew Philippides et al.
Abstract
Prediction of the future value of a variable is of central importance in a wide variety of fields, including
economy and finance, meteorology, informatics, and, last but not least important, medicine. For example,
in the therapy of Type 1 Diabetes (T1D), in which, for patient safety, glucose concentration in the blood
should be maintained in a defined normoglycemic range, the ability to forecast glucose concentration in
the short-term (with a prediction horizon of around 30 min) might be sufficient to reduce the incidence
of hypoglycemic and hyperglycemic events. Neural Network (NN) approaches are suitable for prediction
purposes because of their ability to model nonlinear dynamics and handle in their inputs signals coming
from different domains. In this chapter we illustrate the design of a jump NN glucose prediction algorithm
that exploits past glucose concentration data, measured in real-time by a minimally invasive continuous
glucose monitoring (CGM) sensor, and information on ingested carbohydrates, supplied by the patient
himself or herself. The methodology is assessed by tuning the NN on data of ten T1D individuals and then
testing it on a dataset of ten different subjects. Results with a prediction horizon of 30 min show that
prediction of glucose concentration in T1D via NN is feasible and sufficiently accurate. The average time
anticipation obtained is compatible with the generation of preventive hypoglycemic and hyperglycemic
alerts and the improvement of artificial pancreas performance.
Key words Type 1 diabetes, Forecast, Continuous glucose monitoring, Nonlinear modeling, Time
series modeling
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_15, © Springer Science+Business Media New York 2015
245
246 Chiara Zecchin et al.
300
CGM[mg/dl]
insulin[U]
250 CHO[g]
200
hyperglycemic threshold
CGM[mg\dl]
150
100
hypoglycemic threshold
70g CHO 70g CHO
50
40g CHO
5U insulin 10g CHO 5U insulin
3U insulin
0
Tue 06:00 Tue 08:00 Tue 10:00 Tue 12:00 Tue 14:00 Tue 16:00 Tue 18:00 Tue 20:00 Tue 22:00 Wed 00:00 Wed 02:00 Wed 04:00
time[Day HH:MM]
Fig. 1 Representative CGM signal (black dots linearly interpolated) measured by the Dexcom SEVEN PLUS
device and information on insulin doses (green stems) and meals (blue stems) of a T1D subject
248 Chiara Zecchin et al.
4.1 Inputs Selection Inputs were selected through a multiple consecutive steps proce-
and Preprocessing dure, using data drawn from the training set.
1. First, a pool of possible candidate input signals was chosen,
based on a priori physiological knowledge of the glucose insulin
system and of diabetes. Going into details, the history of the
glucose concentration signal, measured by the CGM sensor,
250 Chiara Zecchin et al.
Fig. 2 Block scheme of the proposed jump NN model for glucose prediction. z− k indicates the k-steps back-
ward shift operator
1 1
N
( 1 − z −N ) RaG (n ) = ( RaG (n ) − RaG (n − N ) )
N
(1)
with z− k indicating the k time steps backward shift operator, n cur-
rent time instant, and N equal to six steps.
Several preprocessing techniques are commonly applied before
the data are used for training the NN to accelerate convergence
and to ease the problem to be learned [37]. An essential operation
is a rough scaling of the data between the lower and upper bounds
of the neuron transfer function. We normalized inputs and output
so that they had zero mean and standard deviation equal to 1. This
step guarantees that, at the beginning of the training procedure, all
the signals have, potentially, the same importance and they all
belong to the approximately linear range of the tangent sigmoidal
activation function. Data normalization also limits the extent to
which larger numbers may dominate smaller ones and avoids pre-
mature saturation of hidden nodes, which would degrade the
learning process.
4.2 Mathematical Let us define: Nin number of input signals and Nhn number of hid-
Representation den neurons. The jump NN predicts a signal that may be expressed,
of the Jump NN Model in a vector matrix form, as
yˆ (n + N | n ) = Ω × X (n ) +Ψ × Φ ( Γ × X (n ) ) (2)
where X(n) indicates the column vector of size [Nin + 1] of input
signals at time step n, plus an input equal to 1 associated with the
weight accounting for the bias:
252 Chiara Zecchin et al.
T
1
X (n ) = 1, y (n ) , ∆y (n ) , RaG (n ) , (1 − z −N ) RaG (n ) (3)
N
4.3 Structure The number of hidden neurons and hidden layers was chosen with
Optimization k-fold cross-validation on the training set. Several candidate NNs
with one hidden layer and with an increasing number on hidden
neurons were evaluated. NNs with two hidden layers were also
assessed, but they did not significantly outperform NN architec-
tures with one hidden layer only. Note that the selected model is
usually a compromise between performance and structure simplicity.
Indeed the NN with the lowest possible number of neurons that
significantly outperforms simpler structures is chosen. In our case,
the best and most parsimonious architecture was a NN with one
hidden layer with five neurons. Thus, since the number of input
signals is equal to 4 and there is 1 output, there are 31 free param-
eters to tune during training.
4.4 NN Training The NN weights were optimized on the training and validation set
Procedure (see Subheading 5.1 for details) and then the architecture was
tested on an independent test set. The NN was trained with the
backpropagation Levenberg–Marquardt training algorithm,
applied in batch mode, i.e., weights and biases are only updated
after all the inputs and targets are presented to the NN, thus on the
basis of the average error obtained on the entire training set. The
minimized objective function J was the mean square error:
1 N TS 1 N TS 1 Mi
(
∑ i i i i N ∑) ( ) ∑ ( yˆ (k + N | k ) − y (k ) )
T 2
J = ˆ
y − y ˆ
y − y = i i (5)
N TS i =1 TS i =1 M i
k =1
with NTS the number of training time series and Mi the length of
the ith training time series. The training procedure was arrested
using early stopping [38]. Ordinarily, a NN learns in stages,
increasing the performance in the training set as the training ses-
sion progresses, towards a local minimum of the error surface.
Glucose Prediction by Jump NN 253
5.1 Database The NN was optimized and tested on data of 20 T1D patients,
monitored for 2 or 3 consecutive days in real-life conditions. Data
were collected during the DIAdvisor™ project [39]. The glucose
concentration was measured by the Dexcom SEVEN PLUS CGM
sensor, which has a sampling time Ts of 5 min. Information on
ingested CHO was extracted by picture of meals taken by the
patients with a standard mobile phone with camera.
The database was divided into a training and validation set,
constituted by ten time series and a test set, formed by the other
ten time series. The training and validation set was further ran-
domly divided into the training set constituted by 70 % of the data,
254 Chiara Zecchin et al.
used for minimizing the prediction error and the validation set
constituted by the remaining 30 % of data, used for stopping the
training procedure.
5.2 Assessment The quality of prediction obtained with the proposed jump NN
Metrics algorithm is quantitatively assessed by computing three metrics
commonly used in the glucose prediction literature. The consid-
ered indexes, which measure different merits of the predicted glu-
cose profile, are:
1. The Root Mean Square Error (RMSE), (mg/dl) between the
predicted time-series and the original glucose time-series mea-
sured by the CGM sensor, calculated as
2
1 M
RMSE =
M
∑ y (i + N | i ) − y ( i )
i =1
(6)
TG = PH − delay (7)
6 Results
Subj 1
350
CGM target
jump NN prediction
310 CHO [g]
270
230
CGM [mg/dl]
hyperglycemic threshold
180
140
Fig. 3 CGM profile (black ) and prediction obtained with the proposed jump NN (blue line). Stems indicate CHO
ingestion, horizontal thin lines represent the hypoglycemic and the hyperglycemic threshold
256 Chiara Zecchin et al.
Table 1
Results obtained on the ten test subjects (with PH = 30 min), average
values and standard deviation
7 Conclusions
References
1. De Gooijer J, Hyndman R (2006) 25 years of 2. Kruppa J, Ziegler A, Knig I (2012) Risk esti-
time series forecasting. Int J Forecasting mation and risk prediction using machine
22(3):443–473 learning methods. Hum Genet 131(7):1–16
258 Chiara Zecchin et al.
3. Iasemidis L (2011) Seizure prediction and its 17. The American Diabetes Association (2013)
applications. Neurosurg Clin N Am 22(4): Standards of medical care in diabetes: 2013.
489–513 Diabetes Care 36(S1):S11–S66
4. Bequette BW (2010) Continuous glucose 18. Tamborlane WV, Beck RW, Bode BW (2008)
monitoring: real-time algorithms for calibra- Continuous glucose monitoring and intensive
tion, filtering, and alarms. J Diabetes Sci treatment of type 1 diabetes. N Engl J Med
Technol 4(2):404 359(10):1464–1476
5. Sparacino G, Zanon M, Facchinetti A et al 19. Battelino T, Phillip M, Bratina N et al (2011)
(2012) Italian contributions to the develop- Effect of continuous glucose monitoring on
ment of continuous glucose monitoring sen- hypoglycemia in type 1 diabetes. Diabetes Care
sors for diabetes management. Sensors 34(4):795–800
12(10):13753–13780 20. Deiss D, Bolinder J, Riveline JP et al (2006)
6. Sparacino G, Facchinetti A, Maran A et al Improved glycemic control in poorly con-
(2008) Continuous glucose monitoring time trolled patients with type 1 diabetes using real-
series and hypo/hyperglycemia prevention: time continuous glucose monitoring. Diabetes
requirements, methods, open problems. Cur Care 29(12):2730–2732
Diabetes Rev 4(3):181–192 21. Reifman J, Rajaraman S, Gribok A et al (2007)
7. Sparacino G, Facchinetti A, Cobelli C (2010) Predictive monitoring for improved manage-
“Smart” continuous glucose monitoring sen- ment of glucose levels. J Diabetes Sci Technol
sors: on-line signal processing issues. Sensors 1(4):478–486
10(7):6751–6772 22. Gani A, Gribok AV, Rajaraman J et al (2009)
8. Buckingham B, Cobry E, Clinton P et al Predicting subcutaneous glucose concentra-
(2009) Preventing hypoglycemia using predic- tion in humans: data-driven glucose modeling.
tive alarm algorithms and insulin pump suspen- IEEE Trans Biomed Eng 56(2):246–254
sion. Diabetes Technol Ther 11(2):93–97 23. Sparacino G, Zanderigo F, Corazza S et al
9. Buckingham B, Chase HP, Dassau E et al (2010) (2007) Glucose concentration can be predicted
Prevention of nocturnal hypoglycemia using pre- ahead in time from continuous glucose moni-
dictive alarm algorithms and insulin pump sus- toring sensor time-series. IEEE Trans Biomed
pension. Diabetes Care 33(5):1013–1017 Eng 54(5):931–937
10. Bode B, Gross K, Rikalo N et al (2004) Alarms 24. Eren-Oruklu M, Cinar A, Quinn L et al (2009)
based on real-time sensor glucose values alert Estimation of the future glucose concentra-
patients to hypo-and hyperglycemia: the guard- tions with subject specific recursive linear mod-
ian continuous monitoring system. Diabetes els. Diabetes Technol Ther 11(4):243–253
Technol Ther 6(2):105–113 25. Finan DA, Doyle FJ, Palerm CC et al (2009)
11. Garcia A, Rack-Gomer AL, Bhavaraju NC et al Experimental evaluation of a recursive model
(2012) Dexcom G4AP: an advanced continu- identification technique for type 1 diabetes.
ous glucose monitor for the artificial pancreas. J Diabetes Sci Technol 5(3):1192–1202
J Diabetes Sci Technol 7(6):1436–1445 26. Castillo-Estrada G, del Re L, Renard E (2010)
12. Cobelli C, Renard E, Kovatchev B (2011) Nonlinear gain in online prediction of blood
Artificial pancreas: past, present, future. glucose profile in type 1 diabetic patients. 49th
Diabetes 60(11):2672–2682 IEEE Conference on Decision and Control
13. Thabit H, Hovorka R (2012) Closed-loop (CDC), p 1668–1673
insulin delivery in type 1 diabetes. Endocrinol 27. Eren-Oruklu M, Cinar A, Rollins DK et al
Metab Clin North Am 41(1):105–117 (2012) Adaptive system identification for esti-
14. Zecchin C, Facchinetti A, Sparacino G et al mating future glucose concentrations and hypo-
(2014) Jump neural network for online short- glycemia alarms. Automatica 48(8):1892–1897
time prediction of blood glucose from con- 28. Turksoy K, Bayrak ES, Quinn L et al (2013)
tinuous monitoring sensors and meal Hypoglycemia early alarm systems based on
information. Comput Methods Programs multivariable models. Ind Eng Chem Res
Biomed 113(1):144–152 52:12329–12336
15. Cryer PE (2007) Hypoglycemia, functional 29. Pérez-Gandía C, Facchinetti A, Sparacino G
brain failure, and brain death. J Clin Invest et al (2010) Artificial neural network algorithm
117(4):868–870 for online glucose prediction from continuous
16. Williams G, John CP (2004) Handbook of dia- glucose monitoring. Diabetes Technol Ther
betes. Blackwell, Oxford 12(1):81–88
Glucose Prediction by Jump NN 259
30. Pappada SM, Cameron BD, Rosman PM et al 37. Basheer IA, Hajmeer M (2000) Artificial neural
(2011) Neural network-based real-time predic- networks: fundamentals, computing, design,
tion of glucose in patients with insulin- and application. J Microbiol Methods 43(1):3
dependent diabetes. Diabetes Technol Ther 38. S. Amari, N. Murata, K.-R. Muller, M. Finke, H.
13(2):135–141 Yang (1996) Statistical Theory of Overtraining -
31. Daskalaki E, Prountzou A, Diem P et al (2012) Is Cross-Validation Asymptotically Effective?,
Real-time adaptive models for the personalized Advances in Neural Information Processing
prediction of glycemic profile in type 1 diabetes Systems 8, Proceedings of the 1995 Conference,
patients. Diabetes Technol Ther 14(2): edited by David S. Touretzky, Michael C. Mozer
168–174 and Michael E. Hasselmo pp 176–182
3 2. Zecchin C, Facchinetti A, Sparacino G et al 39. DIAdvisor. Personal glucose predictive diabetes
(2012) Neural network incorporating meal advisor. http://www.diadvisor.eu/. Accessed 22
information improves accuracy of Jan 2014
short-time prediction of glucose concen- 40. Gani A, Gribok AV, Lu Y et al (2010) Universal
tration. IEEE Trans Biomed Eng glucose models for predicting subcutaneous
59(6):1550–1560 glucose concentration in humans. IEEE Trans
33. McNelis PD (2005) Neural networks in Inf Technol Biomed 14(1):157–165
finance: gaining predictive edge in the market. 41. Facchinetti A, Sparacino G, Cobelli C (2010)
Elsevier Academic Press, London Modeling the error of continuous glucose
34. Dalla Man C, Camilleri M, Cobelli C (2006) A monitoring sensor data: critical aspects dis-
system model of oral glucose absorption: vali- cussed through simulation studies. J Diabetes
dation on gold standard data. IEEE Trans Sci Technol 4(1):4–14
Biomed Eng 53(12):2472–2478 42. Manohar C, Levine JA, Nandy DK et al
35. Dalla Man C, Rizza RA, Cobelli C (2007) (2012) The effect of walking on postprandial
Meal simulation model of the glucose insulin glycemic excursion in patients with type 1
system. IEEE Trans Biomed Eng 54(10): diabetes and healthy people. Diabetes Care
1740–1749 35(12):2493–2499
3 6. Facchinetti A, Sparacino G, Cobelli C
43. Zecchin C, Facchinetti A, Sparacino G et al
(2011) Online denoising method to handle (2013) Physical activity measured by physical
intraindividual variability of signal-to- activity monitoring system correlates with glu-
noise ratio in continuous glucose monitor- cose trends reconstructed from continuous
ing. IEEE Trans Biomed Eng glucose monitoring. Diabetes Technol Ther
58(9):2664–2671 15(10):836–844
Chapter 16
Abstract
Magnetron sputtering and optical lithography are standard techniques to prepare magnetic tunnel
junctions with lateral dimensions in the micrometer range. Here we present the materials and techniques
to deposit the layer stacks, define the structures, and etch the devices. In the end, we obtain tunnel junc-
tion devices exhibiting memristive switching for potential use as artificial synapses.
Key words Memristors, Artificial synapses, Magnetron sputtering, Optical lithography, Tunnel
junctions
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_16, © Springer Science+Business Media New York 2015
261
262 Stefan Niehörster and Andy Thomas
2 Materials
3 Methods
3.2 Thin Film 1. The thin films are deposited by magnetron sputtering. The
Deposition base pressure inside the sputter chamber is less than
2.0 × 10−7 mbar, with a set point of 2.0 × 10−7 mbar. The sputter
pressure is approximately 1.3 × 10−3 mbar and is regulated via
the Argon flow (20 sccm) and the throttle position (21 %) in
front of the turbo-pump. In our case, the tantalum film is sput-
tered from a 3″ target with a power of 65 W and the palladium
film is sputtered from a 4″ target with a power of 115 W. The
machine specific deposition rates are calibrated by X-ray reflec-
tivity measurements (0.1–0.6 nm/s).
2. We subsequently deposit a stack of 5 nm tantalum, 10 nm palla-
dium, and 2 nm tantalum on the silicon dioxide side of the wafer.
3. The sample is next moved in-situ into the oxidation chamber.
The base pressure inside this chamber is approximately
2.5 × 10−7 mbar, with a set point of 7.0 × 10−6 mbar. The
pressure during the oxidation process is approximately
2.0 × 10−3 mbar and depends on the oxygen flow (13 sccm) and
the throttle position (80 %) in front of the turbo-pump. The
microwave power to ionize the oxygen is set to 275 W and the
bias voltage to accelerate the oxygen ions towards the sample
is −80 V. The sample is oxidized for 150 s to obtain the tanta-
lum oxide tunnel barrier.
264 Stefan Niehörster and Andy Thomas
4. The sample (in-situ) is then moved back into the sputter cham-
ber and layers of 10 nm Ta, 5 nm Pd, 5 nm Ta, and 30 nm Au
are deposited. Gold needs a higher sputter pressure of approxi-
mately 4.5 × 10−3 mbar, which is realized by a smaller opening
of the throttle (8 %). The sputter power for gold is 29 W,
because of the smaller target size of 2″.
Fig. 1 (a) The complete mask layout. (b) The upper right edge of the layout design. The black parts cover the
sample (positive resist). The triangles in the frame are markers for easier positioning, in particular, if a second
lithography step is necessary. The mask has to be placed upside down on the sample, so the characters are
reversed to get a correct layout on the sample
Memristive Switching 265
3.4 Ion Beam 1. Clean dust specks from the sample with nitrogen gas and lock
Etching it in the etching chamber.
2. The base pressure inside the etching chamber should be
approximately 3.0–4.0 × 10−8 mbar and the etching (sputter)
pressure approximately 3.5 × 10−5 mbar. The pressure is regu-
lated by the Argon flow (approximately 2.0 sccm). The cath-
ode current should be in the range 300–400 μA. The sample
holder should rotate to achieve an equal etching rate over the
entire sample. We have to etch through the tantalum oxide
barrier into the palladium film (Fig. 2) (see Note 3).
3. After the etching procedure, it is possible to remove the
remaining squares of photoresist through the use of an ultra-
sonic bath in acetone for approximately 5 min (see Note 4) and
in ethanol for approximately 1 min. Only remove the photore-
sist, if no contact pads are needed and skip Subheading 3.5 as
well as 3.6.
3.5 Insulator 1. Before removing the photoresist, we mark one sample edge
with a felt pen. Fill the spaces between the junction squares
with an isolator such as silicon nitride or tantalum oxide via
rf-sputtering. The insulator should have at least the same thick-
ness as the sum of the etched films (here ≥60 nm).
2. Now, remove the photoresist as well as the felt pen mark with
acetone in an ultrasonic bath. In our case, the sample has to
stay 5–10 min in the ultrasonic bath in a beaker filled with
acetone because of the processed photoresist (see Note 4).
3. The felt pen mark covered the surface and allows an easy
removal of the insulator on top of it, therefore, providing
access to the lower electrode.
4. This procedure makes it easier to contact the elements without
a short circuit and protects the sample from scratches and
other damage or adverse environmental conditions.
3.6 Contact Pads It is possible to deposit contact pads on the elements, if the sizes of
the junction pillars become very small. While it is easy to contact a
junction pillar with lateral dimensions larger than 100 μm, it is very
challenging to do this if the sizes approach 10 μm. In our case, we
added contact pads with a size of 50 × 100 μm2 (see Notes 5 and 6).
1. At first, it may be necessary to spin coat the sample with pho-
toresist as described in Subheading 3.3, step 1.
2. The contact pads are prepared using a lift-off process. Therefore
we need a mask with a layout matching the previously used
junction mask, i.e., the whole sample is covered and the
50 × 100 μm2 holes are on top of the junction pillars. Place the
mask on the sample and irradiate it with the UV-light.
3. Sputter a 5 nm film of tantalum and a 60 nm film of gold, as is
described in Subheading 3.2, steps 1 and 4.
4. Place the sample in a beaker filled with acetone for the lift-off
process and place it for 5–10 min in the ultrasonic bath (similar
to Subheading 3.5)
4 Notes
References
1. Thomas A (2013) Memristor-based neural net- tunnel junctions. Appl Phys Lett
works. J Phys D Appl Phys 46(9):093001. 95(11):112508. doi:10.1063/1.3224193
doi:10.1088/0022-3727/46/9/093001 8. Yang JJ, Zhang MX, Strachan JP et al (2010)
2. Turel O, Lee J, Ma X et al (2004) Neuromorphic High switching endurance in TaOx memristive
architectures for nanoelectronic circuits. Int J devices. Appl Phys Lett 97(23):232102.
Circ Theor Appl 32(5):277–302. doi:10.1002/ doi:10.1063/1.3524521
cta.282 9. Strachan JP, Torrezan AC, Medeiros-Ribeiro
3. Chua L (1971) Memristor: the missing circuit G et al (2011) Measuring the switching
element. IEEE Trans Circ Theor 18:507–519 dynamics and energy efficiency of tantalum
4. Krzysteczko P, Münchenberger J, Schäfers M oxide memristors. Nanotechnology
et al (2012) The memristive magnetic tunnel 22(50):505402. doi:10.1088/0957-4484/
junction as a nanoscopic synapse-neuron sys- 22/50/505402
tem. Adv Mater 24:762–766 10. Torrezan AC, Strachan JP, Medeiros-Ribeiro G
5. Afifi A, Ayatollahi A, Raissi F (2009) STDP et al (2011) Sub-nanosecond switching of a
implementation using memristive nanodevice in tantalum oxide memristor. Nanotechnology
CMOS-Nano neuromorphic networks. IEICE 22(48):485203. doi:10.1088/0957-4484/
Electron Exp 6(3):148–153. doi:10.1587/ 22/48/485203
elex.6.148 11. Yakopcic C, Sarangan A, Gao J et al. (2011)
6. Indiveri G, Chicca E, Douglas R (2006) A TiO2 memristor devices. IEEE National
VLSI array of low-power spiking neurons and Aerospace & Electronics Conference, July
bistable synapses with spike-timing dependent 2011, p 101–104
plasticity. IEEE Trans Neural Netw 17(1):211– 12. Pickett MD, Strukov DB, Borghetti JL et al
221. doi:10.1109/TNN.2005.860850 (2009) Switching dynamics in titanium dioxide
7. Krzysteczko P, Reiss G, Thomas A (2009) memristive devices. J Appl Phys 106(7):074508.
Memristive switching of MgO based magnetic doi:10.1063/1.3236506
Chapter 17
Abstract
Advancement of science and technology has prompted researchers to develop new intelligent systems that
can solve a variety of problems such as pattern recognition, prediction, and optimization. The ability of the
human brain to learn in a fashion that tolerates noise and error has attracted many researchers and pro-
vided the starting point for the development of artificial neural networks: the intelligent systems. Intelligent
systems can acclimatize to the environment or data and can maximize the chances of success or improve
the efficiency of a search. Due to massive parallelism with large numbers of interconnected processers and
their ability to learn from the data, neural networks can solve a variety of challenging computational prob-
lems. Neural networks have the ability to derive meaning from complicated and imprecise data; they are
used in detecting patterns, and trends that are too complex for humans, or other computer systems.
Solutions to the toughest problems will not be found through one narrow specialization; therefore we
need to combine interdisciplinary approaches to discover the solutions to a variety of problems. Many
researchers in different disciplines such as medicine, bioinformatics, molecular biology, and pharmacology
have successfully applied artificial neural networks. This chapter helps the reader in understanding the
basics of artificial neural networks, their applications, and methodology; it also outlines the network learn-
ing process and architecture. We present a brief outline of the application of neural networks to medical
diagnosis, drug discovery, gene identification, and protein structure prediction. We conclude with a sum-
mary of the results from our study on tuberculosis data using neural networks, in diagnosing active tuber-
culosis, and predicting chronic vs. infiltrative forms of tuberculosis.
Key words Artificial neural networks, Intelligent systems, Bioinformatics, Drug discovery, Pattern
recognition, Tuberculosis, Protein structure prediction, Gene identification
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_17, © Springer Science+Business Media New York 2015
269
270 Jerry A. Darsey et al.
1.1 Human Brain The average human brain has about 100 billion neurons or nerve
cells, and a slightly greater number of neuroglial cells, which support
and protect neurons. Neurons are considered to be the information
transmission units of the brain, and each neuron is connected to up
to 10,000 other neurons [2]. Neurons are the basic units of the
nervous system, and have a cell body or soma, together with den-
drites (inputs) and axon (outputs), as shown in Fig. 1. Each neuron
is connected to other neurons through a synapse; typically a synapse
is filled with a neurotransmitter chemical and connects the axon of
one neuron and the dendrite of a second neuron. Signals are trans-
mitted from one neuron to another through synaptic space; these
signals can be either excitatory or inhibitory. Neurons are organized
and extensively connected to form a complex network in the human
brain, and act as messengers in sending and receiving impulses (sig-
nals). The complex network makes the human brain intelligent, with
learning, memory, recognition, and prediction capabilities.
Fig. 1 Basic structure of neuron with cell body or Soma, dendrites as input nodes,
and axon as output nodes. Branches from axon are connected to dendrites of
other neuron in complex network of human brain
Artificial Neural Networks Applied to TB 271
1.2 Architecture Although the basic unit in the structure of an ANN is computa-
of Artificial Neural tionally similar to the structure of a neuron in the human brain, it
Networks is much simpler biologically than a neuron in the brain. ANNs
incorporate two fundamental elements of biological neural net-
works; neurons are represented as nodes and synapses are repre-
sented as weighted interconnections [3]. Artificial neurons are
organized into layers: the input layer, hidden layer(s), and an out-
put layer. Neural networks with one hidden layer are sufficient to
make decisions and predictions on simple non-linear approxima-
tions, but networks with two or more hidden layers are used in
more complex applications. A Neural network with one hidden
layer is known as a simple perceptron, and with more than one hid-
den layer is called a multilayer perceptron [4]. As shown in Fig. 2,
the input layer is made up of nodes, which contain an activated
function. This layer communicates with one or more hidden layers,
where the actual processing is done, through weighted connec-
tions. The final hidden layer then links to the output layer, which
generates the network outcome or predictions.
x1, x2, and x3 …. are inputs and w1, w2, and w3 … are weighted
connections expressing the importance of inputs to the outputs
(Fig. 1). The network’s output can in principle be any value, but in
many applications is limited to the binary set {0, 1} This output is
W1
Hidden Layer
Input layer
Output Layer
X1
X2
X3
W3 W2
ì0 if
ï
åw x
j
j j £ threshold
output = í
ï1 if åw x j j > threshold
î j
The working mechanism of ANNs
Artificial neural networks are considered as learning algorithms
with nodes as activation functions and weights as edges or connec-
tions between the nodes. Based on the connection pattern or
architecture of the network, ANNs can be either feed-forward net-
works or recurrent (feedback) networks (Fig. 3).
1.2.2 Feedback Network Feedback networks are dynamic, and the output is dependent on
or Recurrent Network the previous layer.
When a new input pattern is presented, the neuron outputs are
computed. Because of the presence of feedback paths, the inputs to
each neuron are then modified, which leads the network to enter a
new state [6].
Artificial Neural Networks Applied to TB 273
1.3 Learning The learning process in ANNs can be viewed as the process of
Algorithm of Neural updating network architecture and adjusting weighted patterns
Networks to efficiently achieve a specific task. A network learns from avail-
able training patterns, and the performance is improved over
time by iteratively adjusting weight patterns in the network. The
major advantage of neural networks when compared to tradi-
tional expert systems is that they learn from a given set of repre-
sentative examples of training data rather than by following a set
of rules previously defined by human experts, as in traditional
computer systems [6].
The learning process or learning architecture of ANNs can be
designed from known, available information about the training
data set. A learning paradigm can be defined by having a model
from a known environment, and learning rules can be defined by
figuring out the update process for connection weights [6].
Identifying a procedure to adjust weights by use of a learning rule
and following a learning paradigm will define the learning process
or learning algorithm of the network.
The major learning paradigms of ANNs are
1. Supervised.
2. Unsupervised.
3. Hybrid.
In a supervised learning paradigm, a correct answer is provided
for each input pattern and the weights are adjusted based on the
degree to which network output agrees with that answer. In an
unsupervised paradigm, as the answers are not provided, the sys-
tem itself recognizes a correlation and organizes the patterns into
categories accordingly. The hybrid paradigm combines both super-
vised and unsupervised learning methods. In hybrid learning
methods, some weights are provided with correct answers, while
some are auto corrected.
The four basic learning rules for a learning paradigm of
ANNs are
1. Error-correction rules.
2. Boltzmann learning rule.
3. Hebbian rule.
4. Competitive learning rules.
1.3.1 Error- All these learning rules can be applied to supervised, unsupervised,
Correction Rule and hybrid paradigms. In the error-correction rule learning pro-
cess, the error generated in each step is used to adjust the weights
to efficiently achieve a task [6]. In the learning process the actual
output y is different from the desired output d, so an expression
related to the error (d−y) is used to adjust the weights to gradually
reduce the error.
274 Jerry A. Darsey et al.
1.4.1 Hebbian Rule Hebbian rule is based on neurobiological experiments and is the
oldest learning rule [6]. An important property of this rule is that
learning is done locally; that is, change in weight depends only on
the activities of two layers or neurons connected by it. Orientation
selectivity occurs from a Hebbian training network.
3 Methodology
NODE
{
m-1
V1
m-1
V2 OUTj
INPUT VECTOR V •
S F
•
•
m-1
Vn
m
biasj
m-1 NETj
m m m-1 m-1
NETj = S Wij Vi +qi
m = LAYER NUMBER
W = WEIGHT
OUTj =F(NETjm) F = TRANSFER FUNCTION
m 1
F(NETj )= m
1+ exp(-NETj )
Fig. 4 The diagram of the node and transfer function of an artificial neural
network
4 Results
70
60
Selection Factors Identified by Configuration 1
Percentage Correct
50 60
40
40
20
0
30
G3 G5 G10 G13 G14 G15 G42 G47 G50 G51 G54 G57 G58 G72 G80 G81 G82 G83
-20
20 -40
-60
10 -80
0
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 2,200 2,400
Cycle/100
Fig. 5 The neural network prediction of TB cases in the dataset. Inset figure shows selection factors for the
prediction which promoted or inhibited prediction by their relative weight strength
280 Jerry A. Darsey et al.
80
70
60
Percentage Correct
0
0 20 40 60 80 100 120
Cycle/1000
Fig. 6 Prediction of TB-infected and control group (non-TB) patients with ANN. The inset shows the selection
factors and the weight strength of those factors
80
70
60
Percentage Correct
−60
10
0
0 20 40 60 80 100 120
Cycle/1000
Fig. 7 The prediction of chronic and infiltrative TB from the data set. Inset shows factors which promoted or
inhibited prediction of each type of TB
Artificial Neural Networks Applied to TB 281
5 Discussion and Conclusions
References
1. Agatonovic-Kustrin S, Beresford R (2000) 8. Kalate RN, Tambe SS, Kulkarni BD (2003)
Basic concepts of artificial neural network Artificial neural networks for prediction of
(ANN) modeling and its application in phar- mycobacterial promoter sequences. Comput
maceutical research. J Pharm Biomed Anal Biol Chem 27(6):555–564, ISSN 1476-9271
22(5):717–727, ISSN 0731-7085. http:// 9. de Avila S, Silva GJLG (2011) Rules extraction
www.sciencedirect.com/science/article/pii/ from neural networks applied to the prediction
S0731708599002721 and recognition of prokaryotic promoters.
2. MacDonald M (2008) Your brain: the missing Genet Mol Biol 34(2):353–360
manual. Pogue, Sebastopol, CA 10. Zhang F, Kuo MD, Brunkhors A (2006) E. coli
3. Gonzalez AJ, Dankel DD (1993) The engi- promoter prediction using feed-forward neural
neering of knowledge-based systems. Prentice- networks. Proceedings of the 28th IEEE, EMBS
Hall, Englewood Cliffs, NJ. ISBN 0-13- Annual International Conference, New york
334293 City, USA 2025–2027
4. Zahedi Z (1993) Intelligent systems for busi- 11. Holley HHL (1989) Protein secondary struc-
ness: expert systems with neural networks. ture prediction with a neural network. Proc
Wadsworth Inc., Belmont, CA. ISBN 0-534- Natl Acad Sci U S A 86(1):152–156
18888-5 12. Xu Y, Mural RJ, Einstein et al (1996) GRAIL:
5. Nielsen MA (2014) Neural networks and deep a multi-agent neural network system for gene
learning, Chapter 3, Improving the way neural identification. Proc IEEE 81(10):1544–1552
networks learn. Determination Press. http:// 13. Hebsgaard PGK, Tolstrup N, Engelbrecht J
neuralnetworksanddeeplearning.com/chap1. et al (1996) Splice site prediction in Arabidopsis
html thaliana DNA by combining local and global
6. Jain AK, Mao JC, Mohiuddin KM (1996) sequence information. Nucleic Acids Res
Artificial neural networks: a tutorial. Computer 24(17):3439–3452
29(3):31 14. Brunak S, Engelbrecht J, Knudsen (1991)
7. Dybowski R (2000) Neural computation in Prediction of human mRNA donor and accep-
medicine: perspective and prospects. Artificial tor sites from the DNA sequence. J Mol Biol
neural networks in medicine and biology. 220:49–65
Proceedings of the ANNMAB-1 conference, 15. Wu C, Shivakumar S, McLarty J (1995)
Sweden, ISBN 978-1-85233-289-1 Neural networks for full-scale protein
Artificial Neural Networks Applied to TB 283
Abstract
In the classification of breast cancer subtypes using microarray data, hierarchical clustering is commonly
used. Although this form of clustering shows basic cluster patterns, more needs to be done to investigate
the accuracy of clusters as well as to extract meaningful cluster characteristics and their relations to increase
our confidence in their use in a clinical setting. In this study, an in-depth investigation of the efficacy of
three reported gene subsets in distinguishing breast cancer subtypes was performed using four advanced
computational intelligence methods—Self-Organizing Maps (SOM), Emergent Self-Organizing Maps
(ESOM), Fuzzy Clustering by Local Approximation of Memberships (FLAME), and Fuzzy C-means
(FCM)—each differing in the way they view data in terms of distance measures and fuzzy or crisp cluster-
ing. The gene subsets consisted of 71, 93, and 71 genes reported in the literature from three comprehen-
sive experimental studies for distinguishing Luminal (A and B), Basal, Normal breast-like, and HER2
subtypes. Given the costly procedures involved in clinical studies, the proposed 93-gene set can be used for
preliminary classification of breast cancer. Then, as a decision aid, SOM can be used to map the gene sig-
nature of a new patient to locate them with respect to all subtypes to get a comprehensive view of the
classification. These can be followed by a deeper investigation in the light of the observations made in this
study regarding overlapping subtypes. Results from the study could be used as the base for further refining
the gene signatures from later experiments and from new experiments designed to separate overlapping
clusters as well as to maximally separate all clusters.
Key words Breast cancer, Neural networks and fuzzy clustering, DNA microarray, Gene expression
signatures, Subtype classification, Gene clustering
1 Introduction
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_18, © Springer Science+Business Media New York 2015
285
286 Sandhya Samarasinghe and Amphun Chaiboonchoe
Table 1
Microarray studies on prognostic gene-expression profiles in breast cancer (adapted from Weigelt,
Peterse, and van’t Veer [22])
Informative
Microarray type genes Outcomes References
cDNA 496 “intrinsic” Identified 4 subtypes [6]
genes :Luminal-like/ER+
gene, ERBB2+
overexpressed, Basal-
like, Normal breast-like
cDNA 456 “intrinsic” “Luminal A” tumors have [7]
genes a better outcome than
“luminal B” tumors.
Worst outcome is for
“basal-like” and
“ERBB2+” tumors
cDNA 534 intrinsic Tested results from [7] [8]
genes using independent
datasets
Oligonucleotide 70 genes “Identified a Good [12]
“70-gene signature” related to
Mamma low metastasis risk; a
Print” “poor signature”
associated with a high
metastasis risk.
Sensitivity: 91 %,
specificity: 73 %
Oligonucleotide 70 genes “Good signature” versus [36]
“poor signature.”
Sensitivity: 93 %,
specificity: 53 %
Oligonucleotide “metagenes” Accuracy: 90 % [37]
Oligonucleotide 76 genes “Good 76-gene [38]
signature” versus “poor
76 gene-signature.”
Sensitivity: 93 %,
specificity: 48 %
Oligonucleotide 442 genes Expression of “serum- [39, 40]
activated” signature
versus no expression.
Sensitivity: 91 %,
specificity: 29 %
(continued)
288 Sandhya Samarasinghe and Amphun Chaiboonchoe
Table 1
(continued)
Informative
Microarray type genes Outcomes References
The “combined test set” data 306 genes Validated the breast cancer [27]
from Sorlie et al. [7, 8] (cDNA “intrinsic subtypes”
microarrays), van’t Veer et al. [12] by combining existing
(custom Agilent oligo microarrays) and data sets
Sotiriou et al. [28] (cDNA microarrays)
The “combined test set” data 50 genes Incorporated “intrinsic” [42]
from Sorlie et al. [7, 8] (cDNA subtype subtypes into risk model
microarrays), Hu et al. [27] predictor for breast cancer
and Perreard et al. [41] [GEO: “PAM50” prognosis and
GSE2607] (Agilent Human A1, Agilent prediction
Human A2, and custom oligonucleotide)
Oligonucleotide Affymetrix U133a 306 genes Identified relations [43]
microarrays: Wang and colleagues between breast
[GEO:GSE2034], Miller and colleagues subtypes, pathway
[GEO:GSE3494], and Pawitan and deregulation, and drug
associates [GEO:GSE1456] sensitivity.
2 Materials
2.1 Dataset There are three main datasets used in this study:
The first dataset was from Sorlie et al. [7]. Specifically, a
71-gene subset selected from 456 intrinsic genes as shown in
Fig. 1a: Group C—ERBB2+/HER2 (11 genes), Group D—Novel
unknown (12 genes), Group E—Basal-like (19 genes), Group F—
Normal breast-like (14 genes), and Group G—Luminal (15 genes).
These were derived from the raw data from a total of 85 cDNA
microarray experiments on tumor and normal breast tissues (71
ductal, five lobular, two ductal carcinoma in situ, three fibroade-
noma (benign tumor), and four normal breasts).
The dataset was retrieved from http://genome-www.stanford.
edu/breast_cancer/mopo_clinical/ or http://www.ncbi.nlm.nih.
gov/projects/geo/query/acc.cgi?acc=GSE3193.
The second dataset was from Sorlie et al. [8]. We used a
93-gene subset derived from 534 intrinsic genes (see Fig. 1b):
group C (ERBB2+/HER2 (11 genes)), group D (Luminal
Subtype B (28 genes)), group E (Basal-like) (24 genes), group F
(Normal breast-like) (5 genes), and group G (Luminal Subtype A)
(25 genes). These were derived from a total of 122 samples from
Norway/Stanford cohort which included the above mentioned 85
samples from the previous (2001) study used to classify cancer sub-
classes by average linkage hierarchical clustering.
The dataset was retrieved from http://genome-www.stanford.
edu/breast_cancer/robustness/ (also available at http://www.
ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE4335).
In the above studies the raw data (cDNA) were processed and
genes were selected with a criterion of at least fourfold expression
from the median in at least three of the samples tested. Then
differentially expressed genes were used to find the intrinsic genes
by calculating the “within-pair variance” score and the “between-
subject variance.” The ratio between the two scores is called D and
intrinsic genes are those with a small value of D. This resulted in
a b c
290
LRBA
Fig. 1 Hierarchical clustering analysis from three previous studies [7, 8, 26]. The arrows indicated the same subtypes clustered by different gene sets. The red box
represents Basal-like subtype, the blue box represents Luminal A subtype and the pink box represents ERBB2+/HER2 subtype. (Names of the genes in some of
these subtypes can also be found in Table 4)
Breast Cancer Classification 291
456 intrinsic genes from the first dataset and 534 intrinsic genes
from the second dataset. The authors then analyzed these genes by
hierarchical clustering, resulting in five distinct subtypes. Finally, a
small number of novel genes representing each subtype were
identified in these studies, as shown in the highlighted boxes in
Fig. 1a and b for 456 and 534 genes, respectively.
The last dataset was retrieved from Hu et al. [27]. This had a
71-gene subset derived from 306 intrinsic genes (see Fig. 1c):
Group C—Luminal (21 genes), Group D—HER2 (11 genes),
Group E—Interferon-regulated cluster containing STAT1 (12
genes), Group F—Basal-like (14 genes), and Group G—
Proliferation cluster (13 genes). A total of 311 tumors and 4 nor-
mal breast cancer samples was created by combining three existing
datasets of Sorlie et al. [7, 8], van’t Veer et al. [12], and Sotiriou
et al. [28]. The gene subsets containing 71 of the 306 intrinsic
genes are highlighted in boxes in Fig. 1c. The dataset was retrieved
from the UNC Microarray Database (http://www.ncbi.nlm.nih.
gov/projects/geo/query/acc.cgi?acc=GSE1992 or https://
genome.unc.edu/pubsup/breastTumor/).
In our study reported in this chapter, we explored these small
subsets of intrinsic genes separately, specifically, 71 and 93-gene
subsets from Sorlie et al. [7, 8] and 71-gene subset from Hu et al.
[27]. Some additional details of these subsets and the studies are
given in Table 2.
Table 2
Summary of data, intrinsic genes, and identified subtypes in studies from 2000 to 2006 [6–8, 27]
2003 (Sorlie
2000 (Perou et al.) 2001 (Sorlie et al.) et al.) 2006 (Hu et al.)
No. of 42 individuals (36 85 tissue samples 122 individuals 315 samples combined
samples ductal + 2 representing 84 (including 84 from Sorlie et al. [7, 8],
lobular + 1 ductal individuals (71 from 2001) van’t Veer et al. [12]
carcinoma in ductal + 5 lobular + 2 and Sotiriou et al. [28]
situ + 1 ductal carcinoma in
fibroadenoma + 3 situ + 3
normal breast) fibroadenoma + 4
normal breast)
Intrinsic 496 456 534 306
genes
Subtypes 1. Luminal-like / 1. Luminal 1. Luminal A 1. Luminal A
ER+ gene 2. HER2 2. Luminal B 2. Luminal B
2. HER2 3. Basal-like 3. HER2 3. HER2
3. Basal-like 4. Normal breast-like 4. Basal-like 4. Basal-like
4. Normal breast-like 5. A novel unknown 5. Normal 5. Normal breast-like and
cluster breast-like 2 new clusters:
Proliferation,
Interferon-reg.
292 Sandhya Samarasinghe and Amphun Chaiboonchoe
∑(x )
2
Euclidean Distance = d j = x − w j = i − wij (1)
i =1
The objective of learning in SOM networks is to accurately project
the input data onto the map neurons. Over repeated exposure to
inputs during learning, neurons learn to respond to inputs in such
a way that neurons that are closer on the SOM respond to inputs
that are closer in the actual data space. This feature, known as
topology preservation, is achieved by moving not only the winner
neuron but also its neighbor neurons towards the input in order to
minimize the distance between these neurons and the input. In
Breast Cancer Classification 293
x1 x2 xn
this process, the winner moves by a larger amount than its neigh-
bors according to a Neighbor Strength (NS) function.
For each neuron j ∈ Nc, where Nc is the neighborhood sur-
rounding the winner neuron c, learning involves adjusting the new
weight as follows:
w new
j = w old
j + ε (t ) NS ( d ,t ) xi − w old
j
(2)
where wjnew is new weight, wold is the previous weight, ɛ is learning
rate, xi is the ith input vector, and NS is a neighborhood function
defining the strength of weight adjustment of neighbors with respect
to the winner at iteration t and can take the form of a linear or nonlin-
ear function of distance d from the winner. The learning rate ε(t) is
user-defined and can take the form of a linear, e xponential, etc. func-
tion. Starting with a larger value, ε(t) decreases with iterations. NS(d,t)
also can be linear, exponential, Gaussian, etc. and shrinks with itera-
tions so that an initially large neighborhood size gets reduced to just
the winner, or the winner and its immediate neighbors at the end of
training. Training stops when there is no acceptable difference in
weight adjustment in successive iterations. In this study, we used the
Neural Networks Toolbox (Matlab) for training the SOM.
SOM training results in a map where inputs that are closer in
the data space are projected onto neurons that are closer on the
SOM. Therefore, any large gaps between data in the data space are
shown as large distances between corresponding neurons on the
map, thereby revealing on the map potential clusters and cluster
boundaries in the actual data. With an appropriate clustering tool,
such as Ward Clustering, neurons that form clusters on the map
can be grouped to form classes for classification purposes. An
unknown input vector then will belong to the class that it is closest
to, i.e., the class with the shortest distance between its mean vector
and the input vector. This provides a crisp classification for the
input vector—one input vector belongs to one class.
In a trained SOM, each neuron represents the center of gravity
of a number of input vectors. SOM results can be visualized in vari-
ous ways. In one form, a U-matrix (Unified distance matrix) the
average distance between a neuron and its neighbors is shown with
colors on the map. This can show distinct subtypes or gene clusters
294 Sandhya Samarasinghe and Amphun Chaiboonchoe
d ward =
(n r × n s ) xr − xs
2
(3)
(n r + n s )
where r and s symbolize two clusters and nr and ns are the number
of data points in those clusters. ||xr – xs|| gives the magnitude of
the distance between two cluster centers calculated using
Euclidean norm. The Ward distance increases with the increase in
the number of data points. The two clusters with minimum dward
are joined at each step of this process. The ward index gives the
likelihood of each cluster. The separation of clusters depends
upon the value of the index. The higher the index, the more dis-
tinctive are the clusters.
1 d − dt −1 1 ∆dt
Ward Index = × t = × (4)
NC dt −1 − dt − 2 NC ∆dt −1
where ∆dt is the distance between the centers of two clusters to be
merged and ∆dt−1 is the distance between the centers of the clusters
merged previously and prior to that step. NC is the number of
clusters remaining. The method is very effective when the dataset
is not too large. Therefore, it is quite suitable for use on a trained
SOM to cluster map neurons. In this case, it uses the neuron
weights generated by the trained SOM as input vectors.
3.2 Emergent ESOM is an extension of SOM which allows a map to grow in size from
Self-Organizing the initial map. ESOM is arranged in a toroidal grid using a larger num-
Maps (ESOM) ber of neurons than a typical SOM [31]. An ESOM-map can be visual-
ized in three forms: U-Matrix (distance-based), P-Matrix (density-based)
Breast Cancer Classification 295
P (n ) = p (w (n ) , X ) , (5)
where p(x,X) denotes an empirical density estimation at point x
(i.e., w(n) which is the weight vector of neuron n) in the data space
X and ESOM uses Pareto Density Estimation (PDE) [31], an
optimal information kernel density estimation.
Data can be analyzed by using publicly available ESOM soft-
ware (Databionics ESOM Tool) developed by Ultsch and Morchen
[31]. This software is available to download from http://
databionic-esom.sourceforge.net/.
p t +1 ( x ) ≈ ∑ wxy p t ( y )
y ∈KNN ( x )
(7)
296 Sandhya Samarasinghe and Amphun Chaiboonchoe
E ({ p } ) = ∑ p ( x ) − ∑ wxy p ( y ) (8)
∀x y ∈KNN ( x )
Finally, based on fuzzy memberships the clusters can be repre-
sented in two ways: one gene to one cluster or one gene to multi-
ple clusters. FLAME results are given as line plots showing three
patterns: (1) each cluster which includes CSO and the rest, (2) all
outliers, and (3) all CSOs. The most important is the type (1)
which shows expression patterns for each cluster in a separate line
plot. FLAME is integrated with Gene Expression Data Analysis
Studio (GEDAS) (http://sourceforge.net/projects/gedas).
3.4 Fuzzy FCM is the most widely used fuzzy clustering algorithm; it was
C-Means (FCM) developed by Dunn in 1973 [32] and improved by Bezdek in 1981
[33]. It is based on the minimization of the following objective
function:
c c
J = ∑J i = ∑ ∑ u d (x m
ki k − ci ) , 1 ≤ m < ∞ (9)
j =1 i =1 k , x k ∈ci
where ci is the centroid of cluster i, d(xk − ci) is the distance between
ith centroid (ci) and kth input vector xk, c is the total number of clus-
ters, m is a fuzziness parameter which can be any real number greater
than 1, and uki is the degree of membership of xk in cluster i.
FCM is a clustering method that allows data to belong to more
than one cluster with a degree membership uij. The FCM algorithm
starts with a predefined number of clusters where each datum has
an equal degree of belonging to every cluster. Fuzzy partitioning is
carried out through an iterative optimization of the objective func-
tion shown above, with updates of membership uij and cluster cen-
ters cj. The iteration stops when a termination criterion is met. FCM
is integrated with Gene Expression Data Analysis Studio (GEDAS).
Each of the selected methods allows the user to adjust param-
eters. We set parameters for each method as follows: SOM based
on correlation distance and batch learning was conducted on
Matlab (http://www.mathworks.com/) Neural Networks toolbox
followed by an in-house developed algorithm implementing Ward
clustering to define clusters. ESOM based on correlation distance
Breast Cancer Classification 297
4 Results and Discussion
4.1 Gene Clustering In this section, the results from clustering the small subsets of 71
Based on the 71 and 93 genes (extracted from the two intrinsic gene sets of 456
and 93 Genes by SOM, and 534, respectively) using SOM, ESOM, FLAME, and FCM are
ESOM, FCM, presented, highlighting how these two small gene subsets support
and FLAME the identified cancer subtypes by assessing their clustering patterns
when subjected to these clustering algorithms. As highlighted
earlier, the 71-gene subset from the 456 genes represented four
subtypes: HER2 (refers to ERBB2+/HER2), Basal-like, Normal
(i.e., Normal breast-like) and Luminal. The 93-gene subset from
the 534 genes represented five subtypes: HER2, Basal-like,
Normal, Luminal A, and Luminal B.
SOM was developed and the map was clustered using the Ward
method to find optimum clusters. Results of clusters and corre-
sponding Ward index are shown in Fig. 3. Maximum Ward likeli-
hood index for both datasets as evidenced by the largest Ward
likelihood index indicates that the optimum number of clusters in
each case is three. Table 3 shows details of each SOM-based gene
cluster in terms of the number of genes from the small gene subsets
of 71 or 93 represented in each cluster against the number of genes
in the original subtypes.
The fact that both gene sets resulted in three clusters indicates
the strong likelihood that not all five subtypes are distinct. The
members in each SOM cluster for the first dataset (Table 3a) rep-
resent two or more subtypes. Cluster 1 is divided between Luminal
and Normal breast-like (hereafter referred to as either Normal-like
or Normal), which implies that Luminal and Normal-like have
some similarity in gene expression patterns. In contrast, the
93-gene subset extracted from the 546 “intrinsic” genes shows
clear separation of Luminal A into cluster 1 (Table 3b). However,
Table 3b indicates a weak similarity between Basal-like and Normal-
like in its cluster 2 which is predominantly Basal-like. Similarly,
cluster 2 in Table 3a is predominantly Basal (13/19) with some
Normal-like (3/14) cases and one of HER2 (1/11) subtype
298 Sandhya Samarasinghe and Amphun Chaiboonchoe
a b
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 100
Fig. 3 Ward likelihood index for various number of clusters (a) for 71 genes from 456 “intrinsic” gene set (b)
for 93 genes from 546 “intrinsic” gene set ((x-axis—number of clusters and y-axis—Ward likelihood index).
The larger the likelihood index, more likely that number of clusters
Table 3
Proportion of genes belonging to the original subtypes found in each SOM cluster (numerator is the
number of genes belonging to a particular subtype found in an SOM cluster and the denominator is
the total number of genes belonging to that subtype)
45
40
35
30
25
20
15
10
5
0
50
45
40 90
35
30 80
25 70
20 60
50
15 40
10 30
5 20
0 10
0
70
60
50
40
30
20
10
0
50
45
40 90
35 80
30 70
25 60
20 50
15 40
10 30
20
5 10
0 0
Fig. 4 ESOM: P-Matrix (density-based) and 3D view of P-Matrix for the two datasets. White dots represent best
match (closest) vector to each data point. Red indicates higher density and blue indicates smaller density of
input vectors
SMARCE1
KIAA0857, MEN1
Basal-like
HER2
Normal-like
Fig. 5 Mapping of three cancer subtypes (found by original authors from the 93-gene subset) onto ESOM
Table 4
Clustering results from the three methods: SOM, FLAME, and FCM using 71 and 93 genes (number in
brackets refers to number of genes from the original subtypes from Sorlie et al. [7, 8])
Table 4
(continued)
all but Normal-like and Basal-like genes were put into distinct
clusters. Specifically, some (60 %) Normal-like genes were clustered
with Basal-like and the rest of the Normal-like genes were consid-
ered as outliers (see Table 5b). Note though that the Normal-like
subset now has only five genes and this may have affected the results.
SOM (Table 5a) dedicated one cluster (cluster 1) to Luminal A;
and another (cluster 2) to Basal-like and Normal-like genes but put
roughly 25 % of genes in these subtypes in cluster 3 where all
Luminal B and HER2 genes have also been placed. Thus cluster 3
represents four subtypes with the majority (58 %) of the cluster
being genes from Luminal B followed by HER2 (26 %), Basal-like
(14 %) and only 2 % of Normal-like genes. Overall, FLAME shows
more definite support for original subtypes than SOM with Lum A,
Lum B, and HER2 in separate clusters and Basal and Normal-like
sharing one cluster. This shows that the FLAME’s choice of four
clusters as opposed to the original authors’ five subtypes is based on
Table 5
Percentage of genes from each subtype found in each cluster of (a) SOM/Ward, (b) FLAME, and (c)
FCM based on the 93-gene subset
FLAME FCM
Squared Pearson 5
Subtype Pearson 4 clusters Squared Pearson 4 clusters Pearson 5 clusters Pearson 4 clusters clusters
HER2 (11) Cluster 2 (11/11) Cluster 2 (11/11) Cluster 0 (11/11) Cluster 0 (11/11) Cluster 0 (11/11)
Basal-like (24) Cluster 1 (24/27) Cluster 0 (13/46) + Cluster 1 (8/8) + Cluster 2 (23/28) + Cluster 2 (23/27) +
cluster 3 (11/12) cluster 2 (15/27) + cluster 3 (1/26) cluster 4 (1/21)
cluster 3 (1/26)
Normal (5) Cluster 1 (3/27) + Cluster 0 (3/46) + Cluster 2 (5/27) Cluster 2 (5/28) Cluster 2 (23/27) +
outlier (2/2) cluster 3 (1/12) + Cluster 4 (1/21)
outlier (1/1)
Luminal subtype A (28) Cluster 3 (28/28) Cluster 0 (28/46) Cluster 2 (7/27) + Cluster 1 (28/28) Cluster 1 (8/9) +
cluster 3 (1/26) + cluster 4
cluster 4 (20/21) (20/21)
Luminal subtype B (25) Cluster 4 (25/25) Cluster 0(2/46) + cluster 1 Cluster 3 (24/26) + Cluster 3 (25/26) Cluster 3 (25/25)
(23/23) cluster 4 (1/21)
The numbers within brackets are genes from the original author’s (Sorlie et al. [8]) subtype found in each cluster (numerator) and the total number of genes in that cluster
(denominator)
Breast Cancer Classification
305
306 Sandhya Samarasinghe and Amphun Chaiboonchoe
4.2 Classification Another useful investigation that can shed light on the selected gene
of Subtypes: subsets is clustering patients. Therefore, we conducted an SOM and
Clustering Patients FCM cluster analysis of patients based on the 93-gene subset extracted
Using SOM and FCM from the 534 intrinsic genes [8] and the 71 subset of genes extracted
from the 306 intrinsic genes [27] (see Table 2, last column). The lat-
ter set has been proposed from a combined dataset from three sources
including the one that produced the above 534 intrinsic genes. The
results from the previous section show that each clustering method
produces one or more clusters that contain genes from one or more
subtypes. Fuzzy clustering is a soft clustering approach that allows a
measure of the strength of membership of an item in each cluster and
we utilize this aspect of FCM more in this patient clustering.
4.2.1 SOM Clustering Gene expression vectors, each containing expression results for all
of Patients 93 genes (from the 534 intrinsic genes) for each of the 79 patients
were clustered using SOM and results are shown in Fig. 6. The first
map Fig. 6a (left) indicates the number of patients (hits) repre-
sented by each neuron, and Fig. 6a (right) shows the average dis-
tance between neurons where lighter color indicates smaller
distances and darker color indicates larger distances between neu-
rons. Such a distance map allows visualization of potential clusters.
Dark regions indicating larger gaps appear at least in three places
and these separate the samples into at least three groups.
Breast Cancer Classification 307
Fig. 6 (a) Projection of all patients (subtypes) on to the trained SOM based on the 93-gene subset (left ), and
U-matrix (right ). (b) SOM sample hits (patients) for each “intrinsic” subtype according to the 93-gene subset
from the original study (79 patients in total)
closer are organized closer in the map allowing one to see not only
clusters but also biological similarity between different clusters
(subtypes). The SOM panels for the five subtypes show that most
patients of each subtype are organized in the same region; for
example, Normal-like patients are located at the bottom left-hand
corner of the Map. Moreover, Normal-like is located opposite to
LumB; Basal-like is located opposite to Luminal A; and HER2 is
closer to Normal-like and Basal-like in that these three subtypes
together occupy a triangular region in the bottom right part of the
map. LumA and LumB occupy the top left triangle on the map.
This corresponds to classification of breast cancer molecular sub-
types based on the estrogen receptor (ER) status: ER-positive con-
sists of two main subtypes (Luminal A and B) and ER-negative
consists of three subtypes (Her-2, Basal-like, and Normal-like).
Next, we investigated whether the 71-gene subset extracted
from the 306 intrinsic gene set provides better accuracy in classifi-
cation of subtypes; results are shown in Fig. 7. There are 248
patients in this dataset. SOM produced similar results to those with
the previous dataset indicating similar relative positioning of sub-
types in the SOM panels. For example, Normal-like, Basal-like,
and HER2 subtypes occupy the top left region of the map and
LumA and LumB occupy the bottom left region. This reversal of
space occupied by subsets is not an issue as relative position is what
gives a map its meaning. In this case, different intrinsic genes do
not vary the final SOM result.
In an attempt to find final clusters on SOM, Ward clustering of
SOM neurons provided the Ward index vs. number of clusters rela-
tion for the 93 and 71 gene signatures as shown in Fig. 8. The opti-
mum number of clusters for these two signatures is four and two
clusters, respectively. This indicates that the 93 gene signature has
greater discriminating ability than the 71 gene signature that sees
more similarity between the subtypes leading to only two optimum
clusters. A summary of clustering results for the two gene sets are
given in Table 7 which indicates that 93 gene signature distinguished
the 28 LumA patients but put some Lum B patients with them in
the same cluster. It also separated most (90 %) of Normal-like and
Basal-like into separate clusters. It sees all HER2 patients together in
a cluster but it is shared with 60 % of LumB patients and a small
number of Basal-like and Normal-like patients. The 71-gene subset
in contrast cannot distinguish between LumA and B; identifies all 64
Basal-like patients in one cluster but it shares over 50 % of Normal-
like and HER2 patients. The rest of the Normal-like and HER2
patients are clustered with LumA and B. Clearly, the 93 gene signa-
ture shows greater discriminating ability in this case.
4.2.2 FCM Clustering In this section, we cluster the patients based on the 93 and 71 gene
of Patients signatures using FCM in order to ascertain the efficacy of this clus-
tering method on the selected datasets. Results are shown in
Breast Cancer Classification 309
Fig. 7 SOM sample hits (patients) for each “intrinsic” subtype according to the small subset of 71 genes from
the 306 “intrinsic” genes (248 patients in total)
Table 8 where a and b sections are for the 93 gene signature and c
and d parts are for the 71 gene signature. Table 8a, c show original
samples (patient IDs), FCM clusters, and samples in each of these
clusters. In Table 8b, d, FCM clusters are compared with the origi-
nal classification. For the 93-gene set, one FCM cluster has 100 %
members belonging to Normal-like group while other clusters
contain one or more subtypes. Cluster 1 has only LumA patients
310 Sandhya Samarasinghe and Amphun Chaiboonchoe
a b
5 6
4.5
5
4
3.5
4
3
2.5 3
2
2
1.5
1
1
0.5
0 0
0 10 20 30 40 50 60 70 80 0 50 100 150 200 250
Fig. 8 Ward likelihood index for various number of clusters of map neurons (a) for the 93 subset extracted from
the 546 “intrinsic” genes (b) for the 71-gene subset extracted from the 306 “intrinsic” genes (x-axis indicate
the number of clusters and y-axis-Ward likelihood index). The larger the likelihood index, more likely that
number of clusters
Table 7
Members (patients) in each SOM cluster according to subtype based on: (a) 93-gene subset and (b)
71-gene subset
Patient cluster 1 2 3 4
(a) 93-gene subset from the 546 “intrinsic” genes
HER2 (11 patients) 11
Basal-like (19 patients) 17/19 2/19
Normal (10 patients) 9/10 1/10
Luminal A (28 patients) 28/28
Luminal B (11 patients) 4/11 1/11 6/11
(b) 71-gene subset from the 306 “intrinsic” genes
HER2 (29 patients) 15/29 14/29
Basal-like (64 patients) 64/64
Normal (19 patients) 11/19 8/19
Luminal A (90 patients) 90/90
Luminal B (46 patients) 46/46
Table 8
FCM (with 1.2 degree of fuzziness) classification results for two selected datasets: (a) and (b) for 93
genes from the 534 “intrinsic” genes; (c and d) for 71 genes from the 306 “intrinsic” genes
(a) Sample allocation from FCM (based on 93-gene subset from the 534 “intrinsic” genes)
(b) Percentage of patients found in each cluster according to subtype (based on 93 genes from 534
“intrinsic” genes)
Cluster 1 (%) Cluster 2 (%) Cluster 3 (%) Cluster 4 (%) Cluster 5 (%)
LumA (28) 57.15 42.85
LumB (11) 9.09 81.82 9.09
HER2 (11) 100
Basal (19) 5.27 94.73
Normal (10) 100
(c) Sample allocation from FCM (based on 71 genes from 306 “intrinsic” genes)
Table 8
(continued)
(d) Percentage of patients found in each cluster according to subtype (based on 71 genes from 306
“intrinsic” genes)
Fig. 9 SOM sample hits for each FCM cluster according to the subset of 93 genes from 546 intrinsic genes
a b c
Basal HER2 Luminal
2006 2006 2006
10 7 17
2 1 1 1
1 3 3
12 4 18 6 1 7 7 5 19
5 Summary and Conclusions
Acknowledgements
References
1. Carey LA (2010) Through a glass darkly: 3. Marchionni L et al (2008) Systematic review:
advances in understanding breast cancer biology, gene expression profiling assays in early-stage
2000–2010. Clin Breast Cancer 10(3):188–195 breast cancer. Ann Intern Med 148(5):
2. Stadler ZK, Come SE (2009) Review of gene- 358–369
expression profiling and its clinical use in breast 4. Desmedt C, Ruiz-Garcia E, André F (2008)
cancer. Crit Rev Oncol Hematol 69(1):1–11 Gene expression predictors in breast cancer:
316 Sandhya Samarasinghe and Amphun Chaiboonchoe
current status, limitations and perspectives. Eur 19. Weigelt B et al (2012) Challenges translating
J Cancer 44:2714–2720 breast cancer gene signatures into the clinic.
5. Uma RS, Rajkumar T (2007) DNA microarray Nat Rev Clin Oncol 9(1):58–64
and breast cancer – a review. Int J Human 20. Reis-Filho JS, Pusztai L (2011) Gene expres-
Genet 7(1):49–56 sion profiling in breast cancer: classification,
6. Perou C et al (2000) Molecular portraits of prognostication, and prediction. Lancet 378
human breast tumours. Nature 406(6797): (9805):1812–1823
747–752 21. Arpino G et al (2013) Gene expression profil-
7. Sorlie T et al (2001) Gene expression patterns ing in breast cancer: a clinical perspective.
of breast carcinomas distinguish tumor sub- Breast 22(2):109–120
classes with clinical implications. Proc Natl 22. Weigelt B, Peterse JL, van’t Veer LJ (2005)
Acad Sci 98(19):10869–10874 Breast cancer metastasis: markers and models.
8. Sorlie T et al (2003) Repeated observation of Nat Rev Cancer 5(8):591–602
breast tumor subtypes in independent gene 23. Kohonen T (1997) Exploration of very large data-
expression data sets. Proc Natl Acad Sci bases by self-organizing maps. Neural Networks,
100(14):8418–8423 International Conference on 1:PL1–PL6
9. Weigelt B et al (2005) Molecular portraits and 24. Ultsch A, Morchen F (2005) ESOM-Maps:
70-gene prognosis signature are preserved tools for clustering, visualization, and classifica-
throughout the metastatic process of breast tion with Emergent SOM. Technical report 17.
cancer. Cancer Res 65(20):9155–9158 Data Bionics Research Group, University of
10. Henshall S (2005) Review of classification of Marburg, Marburg
human breast cancer using gene expression 25. Fu L, Medico E (2007) FLAME, a novel
profiling as a component of the survival predic- fuzzy clustering method for the analysis of
tor algorithm. Breast Cancer Online 7(12) DNA icroarray data. BMC Bioinformatics
11. Loi S (2008) Review of comparison of gene 8(1):3
expression profiles predicting progression in 26. Dembele D, Kastner P (2003) Fuzzy C-means
breast cancer patients treated with tamoxifen. method for clustering microarray data.
Breast Cancer Online 11(12):null Bioinformatics 19(8):973–980
12. van’t Veer LJ et al (2002) Gene expression pro- 27. Hu Z et al (2006) The molecular portraits of
filing predicts clinical outcome of breast cancer. breast tumors are conserved across microarray
Nature 415(6871):530–536 platforms. BMC Genomics 7(1):96
13. van de Vijver MJ et al (2002) A gene-
28. Sotiriou C et al (2003) Breast cancer classifica-
expression signature as a predictor of survival tion and prognosis based on gene expression
in breast cancer. N Engl J Med 347(25): profiles from a population-based study. Proc
1999–2009 Natl Acad Sci U S A 100(18):10393–10398
14. West M et al (2001) Predicting the clinical sta- 29. Samarasinghe S (2006) Neural networks for
tus of human breast cancer by using gene applied sciences and engineering: from funda-
expression profiles. Proc Natl Acad Sci 98(20): mentals to complex pattern recognition. CRC
11462 Press, Boca Raton, FL
15. Loi S et al (2005) Breast cancer gene expres- 30. Ward JH (1963) Hierarchical grouping to
sion profiling: clinical trial and practice implica- optimize an objective function. J Am Stat
tions. Pharmacogenomics 6(1):49–58 Assoc 58(301):236–244
16. Hess KR et al (2006) Pharmacogenomic pre- 31. Ultsch A, Mörchen F (2005) ESOM-Maps:
dictor of sensitivity to preoperative chemo- tools for clustering, visualization, and classifica-
therapy with paclitaxel and fluorouracil, tion with Emergent SOM. Technical report 46.
doxorubicin, and cyclophosphamide in breast CS Department, University Marburg, Marburg
cancer. J Clin Oncol 24(26):4236–4244 32. Dunn JC (1973) A fuzzy relative of the
17. Naderi A et al (2006) A gene-expression signa- ISODATA process and its use in detecting
ture to predict survival in breast cancer across compact well-separated clusters. Journal of
independent data sets. Oncogene 26(10): Cybernetics 3(3):32–57
1507–1516 33. Bezdek JC (1981) Pattern recognition with
18. Wirapati P et al (2008) Meta-analysis of gene fuzzy objective function algorithms. Kluwer,
expression profiles in breast cancer: toward a Baltimore, MD, p 256
unified understanding of breast cancer subtyp- 34. Dembélé D, Kastner P (2003) Fuzzy C-means
ing and prognosis signatures. Breast Cancer method for clustering microarray data.
Res 10(4):R65 Bioinformatics 19(8):973–980
Breast Cancer Classification 317
35. Ozkan I, Turksen IB (2007) Upper and lower survival. Proc Natl Acad Sci U S A 102(10):
values for the level of fuzziness in FCM. Inform 3738–3743
Sci 177(23):5143–5152 40. Chang HY et al (2004) Gene expression signa-
36. Van De Vijver MJ et al (2002) A gene-expression ture of fibroblast serum response predicts
signature as a predictor of survival in breast human cancer progression: similarities between
cancer. New England Journal of Medicine tumors and wounds. PLoS Biol 2(2):E7
347(25):1999–2009 41. Perreard L et al (2006) Classification and risk
37. Huang E et al (2003) Gene expression predic- stratification of invasive breast carcinomas
tors of breast cancer outcomes. Lancet 361 using a real-time quantitative RT-PCR assay.
(9369):1590–1596 Breast Cancer Res 8(2):R23
38. Wang Y et al (2005) Gene-expression profiles 42. Parker JS et al (2009) Supervised risk predictor
to predict distant metastasis of lymph-node- of breast cancer based on intrinsic subtypes.
negative primary breast cancer. Lancet 365 J Clin Oncol 27(8):1160–1167
(9460):671–679 43. Bild AH et al (2009) An integration of comple-
39. Chang HY et al (2005) Robustness, scalability, mentary strategies for gene-expression analysis
and integration of a wound-response gene to reveal novel therapeutic opportunities for
expression signature in predicting breast cancer breast cancer. Breast Cancer Res 11(4):55
Chapter 19
Abstract
Quantitative Structure–Activity Relationships (QSARs) and Quantitative Structure–Property Relationships
(QSPRs) are mathematical models used to describe and predict a particular activity/property of com-
pounds. On the other hand, the Artificial Neural Network (ANN) is a tool that emulates the human brain
to solve very complex problems. The exponential need for new compounds in the drug industry requires
alternatives for experimental methods to decrease development time and costs. This is where chemical
computational methods have a great relevance, especially QSAR/QSPR-ANN. This chapter shows the
importance of QSAR/QSPR-ANN and provides examples of its use.
Key words Artificial neural networks, Quantitative structure–activity relationship, Quantitative structure–
property relationship, Principal component analysis
1 Introduction
The first Neural Network (NN) model goes back to 1943 when it
was proposed by Warren McCulloch (1898–1969), a neurophysi-
ologist, and Walter Pitts (1923–1969), a mathematician. The first
attempt to use NN in computers took place in 1956 by a group of
IBM researchers and neuroscientists from McGill University in
Canada. Not only neurosciences have contributed to the develop-
ment of Artificial Neural Networks (ANN). For instance, the
network unit known as a Perceptron was proposed in 1957 by
Frank Rosenblatt (1928–1971), a psychologist, and is still in use
due to its simplicity.
In addition, the development of computer science has fol-
lowed the way of the mathematician John von Neumann. In 1947,
he designed a structure based on sequential processing of data and
instructions. This structure rigorously follows a sequence defined
by data stored in memory. Von Neumann’s architecture is based on
the logic of procedures, normally used for partial solutions in some
problems to link the specific problem with the result. Although the
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0_19, © Springer Science+Business Media New York 2015
319
320 Narelle Montañez-Godínez et al.
2 Methodology
2.1 Software In most examples of the use of ANN in QSAR, the methodology
uses quantum level calculations to optimize the molecular geome-
tries data set, using semi-empirical methods or Density Functional
Theory methods to generate descriptors and the subsequent
selection of them to generate a new source of descriptors. The
principal software tools used are Gaussian [7], MOPAC [8], and
Hyperchem [9]. Numerical analysis can be realized with MATLAB
[10] for linear regressions, an ANN for non-(multi)linear regres-
sion and SPSS [11] for multilinear regressions. Furthermore,
Dragon [12] software is used in the calculation of descriptors.
QSAR/QSPR 325
3 QSAR/QSPR
For both financial and social reasons, the drug industry would like
to develop new drugs in much shorter times than is possible using
a purely experimental (lab-based) approach. A large number of
novel compounds are synthesized each year. The number of
molecules that can be obtained by organic synthesis is so high that
the probability of choosing randomly a molecule with the desired
biological activity is practically nil. In this regard, medicinal chem-
istry techniques that allow discovery and optimization of new
templates are used. However, a large fraction of these compounds
are not tested for physicochemical properties or biological activi-
ties, which still remain unknown owing to unavailability of the
compounds or possible risks of toxicity. A method able to predict,
within a realistic error boundary, the biological activities of untested
compounds is necessary to evaluate these molecular features in a
rapid and inexpensive approach.
Among these methods, computer aided design has found its
way to the rational design of new compounds; the most important
practice is known as Quantitative Structure–Activity Relationship
(QSAR). In recent years, numerous quantitative structure–activ-
ity/property relationship QSAR/QSPR models have been intro-
duced for calculating the biological activities from a variety of
numerical descriptors of chemical structures. These relationships
develop correlations between the descriptors of individual com-
pounds and their biological activity/chemical properties. This
approach has proved especially productive when coupled with solid
optimizing tools that help to narrow down the field of potential
molecules to just the most promising candidates, namely, Genetic
Algorithms, Ant Colony Optimization, Support Vector Machines,
and Artificial Neural Networks.
An inevitable difficulty, when dealing with molecular descrip-
tors, is that of collinearity, which frequently exists between inde-
pendent variables. In the worst case, this creates a rigorous problem
in certain types of mathematical treatment such as matrix inversion
[13], though collinearity of molecular descriptors is not normally
so complete that matrix inversion will fail. A better predictive
model can be obtained by orthogonalization of the variables by
means of principal component analysis (PCA) and the subsequent
method is called principal component regression (PCR) [14].
In order to decrease the dimensionality of the independent vari-
able space, a restricted number of principal components (PCs) are
used. For this reason, the choice of significant and informative PCs
is an important task in PCA–based calibration methods [15, 16].
Diverse methods have been proposed to select the important
PCs for calibration purposes. The simplest and most frequent one
is a top–down variable selection where the factors are ranked in the
326 Narelle Montañez-Godínez et al.
Data
Features
Learning
Model Validation
to their binding sites and rate constants, with either structural fea-
tures or with atomic group or molecular properties such as solubil-
ity, lipophilicity, electronic effects, ionization, and stereochemistry.
The structural features were proposed by Free and Wilson [18]
and the use of molecular properties was proposed by Hansch and
Fujita [19], both published independently in 1964. QSAR has
been important in the integration of ADMET (absorption, distri-
bution, metabolism, excretion, and toxicity) profiling and predic-
tion, which is mostly dependent on a type of molecular descriptors
as described by the Lipinski Rule of 5. This procedure is applied to
improve our understanding of structure–activity links, to identify
hazardous compounds at low cost, and to reduce the number of
animals used for experimental toxicity testing. In the field of QSAR
and in the prediction of animal toxicity, various artificial intelli-
gence techniques have been proposed and developed such as ANN
and statistical understanding of knowledge networks [20–22].
There are different tools that have been used to predict the
activity of molecules that can present some biological activity; MLR,
nonlinear regression using ANN and molecular simulations from
molecular recognition (Docking) are also used in QSAR [23].
The calculation of chemical descriptors is an important tool for
QSAR applications, because it makes possible prediction of toxic
properties, harmful characteristics, and pharmacological biological
properties that a group of molecules could present. Many chemical
descriptors can be measured experimentally but this is cost and
time expensive, and therefore development of QSAR tools is useful
to replace or augment measurement.
Most QSAR/QSPR practice uses quantum-mechanical descriptors.
QSAR analysis uses a variety of descriptors including:
● Topological,
● Quantum chemical,
● Galvez topological charge index,
● 2D-autocorrelation,
● Radial Distribution Functions (RDF),
● 3D-MoRSE (Molecule Representation of Structures based
on Electron diffraction),
● Weighted Holistic Invariant Molecular (WHIM),
● GEometry Topology and Atom-Weight AssemblY
(GETAWAY), and so on [17, 24].
3.1 QSAR-ANN In the literature, we can find work that reports applications in
Application which only QSAR has been used, e.g., a QSAR model that describes
the antispasmodic activity of molecules isolated from Mexican
Medicinal Flora and some synthetic based on stilbenoid bioiso-
steres [25]. Equally, there are reports of applications that use only
328 Narelle Montañez-Godínez et al.
a 4
MLR-M86A b ANN-M66B
Training (86 molecules) 4 Training (66 molecules)
Predicted (20 molecules) Predicted (40 molecules)
3 3 NIH11179
NIH11167
NIH11100 NIH11181
NIH11179
2 NIH10945 NIH11188 2 NIH11168
NIH11154 NIH11178
NIH11178
NIH11154
exp
exp
1
log K m
NIH11168 1
log K m
NIH11068 NIH11072
0 0
–1 –1
NIH 11081
–2 –2
–2 –1 0 1 2 3 4 –2 –1 0 1 2 3 4
exp exp
log K m log K m
Fig. 6 Examples of data set calculated by Multiple Linear Regression (solid line), ideal line (dashed line), and
residual line limits (dotted line, cutoff = |1.0|). These were the best ANN calculated results [32]
Currently, QSAR analysis has not reached a level at which the pre-
dictive power is as good as the descriptive power, but when ANN
is included in the models, the predictive power is improved, com-
pared to a model that uses only MLR. When using better comput-
ers and models, predictions will be closer to reality.
Acknowledgement
References
1. de la Fuente Aparicio Ma J, Calonge Cano T model for HEPT derivatives. J Chem Inf
(1999) Aplicaciones de las redes de neuronas Comput Sci 43:1200–1207
en supervisión, diagnosis y control de pro- 6. Andrews R, Diederich J, Tickle AB (1995)
cesos. Universidad Simón Bolívar, Caracas- Survey and critique of techniques for extract-
Venezuela, p 11 ing rules from trained artificial neuronal
2. Brégains JC (2007) Análisis síntesis y control networks. Knowl Base Syst 8:373–389
de diagramas de radiación generados por dis- 7. Frisch MJ, Trucks GW, Schlegel HB et al
tribuciones continuas y discretas utilizando (2009) Gaussian 09. Gaussian, Inc.,
algoritmos estadísticos y redes neuronales arti- Wallingford CT
ficiales aplicaciones. Universidad de Santiago 8. Stewart JJP (2009) MOPAC 2009. Fujitsu
de Compostela, Galicia-Spain, p 104 Limited, Tokyo, Japan
3. Flores López R, Fernández Fernández JM 9. HyperChem(TM) 2003 Professional 7.51,
(2008) Las redes neuronales artificiales Hypercube, Inc., 1115 NW 4th Street,
Fundamentos teóricos y aplicaciones prácticas. Gainesville, Florida 32601, USA
Netbiblo, S.L., Spain, pp 12–14, 17, 21
10. MATLAB Version 7.2.0.232 (R2006a). The
4. Pino Diez R, Gómez Gómez A, de Abajo MathWorks, Inc. (2006) Natick, MA
Martínez N (2001) Introducción a la inteligen-
cia artificial: Sistemas expertos, redes neuronales 11. SPSS Version 13 for Windows (2004)
artificiales y computación evolutiva. Universidad Chicago, IL
de Oviedo, Asturias-Spain, pp 27, 28 12. Todeschini R, Consonni V, Pavan M (2002)
5. Douali L, Villemin D, Cherqaoui D (2003) Dragon Software version 2.1-2002 Pisani 13,
Neural networks: accurate nonlinear QSAR Milano. Dragon Software and references therein
QSAR/QSPR 333
Hugh Cartwright (ed.), Artificial Neural Networks, Methods in Molecular Biology, vol. 1260,
DOI 10.1007/978-1-4939-2239-0, © Springer Science+Business Media New York 2015
335
ARTIFICIAL NEURAL NETWORKS, METHODS IN MOLECULAR BIOLOGY
336 Index
Simulation ...........................90, 91, 93, 94, 99, 125, 207, 217, Training set....................42, 57, 59, 68, 69, 71–83, 95–97, 99,
221, 236, 319, 320, 325 111, 151, 152, 154, 155, 220, 232, 235, 237, 249,
Single layer perceptron (SLP) .................................. 130, 133 251–253, 256, 276–278, 324
6D-QSAR ....................................................................49, 50 Training subset ...................................................................37
SMILES ............................................................. 48, 128, 129 Transcription ....................................................................275
Soft computing .........................................................205–224 Translation........................................................................275
Sorting signal........................................................ 2–8, 12–14 Traversal ........................................................... 230, 231, 242
SPSS .................................................................................322 True negative ...........................................59, 60, 97, 113, 115
STATISTICA ........................................................ 54, 55, 58 True positive ............................................59, 60, 97, 113, 197
Strand ...............................................................................275 Tuberculosis (TB) ................................46, 275, 276, 278–282
Structural factor ..................................................................17 Tunnel-junction ........................................................261–267
Structure activity cliff .......................................................121
Structure activity relationship (SAR) ...............66, 67, 70, 79, U
80, 120, 121, 125, 150, 158 Ultrasound ................................................................ 195, 275
Subtype classification ............................................... 288, 314 U-matrix ........................................................... 293, 294, 307
Support vector machine (SVM) .......................54, 66, 68, 73, Uniprot ..................................................................... 106, 107
77–80, 82, 124, 134–136, 174, 323
Symbolic regression ............................................................37 V
Synergic effect ......................................................................8
Validation ............ 9, 19, 20, 42, 68–70, 72, 75, 78, 79, 84, 98,
Systems......................10, 22, 33, 53, 105, 125, 130, 176, 182,
104, 111–114, 116, 121, 150–152, 155, 157, 158,
205, 207, 219, 228, 229, 242, 261, 262, 272–274,
168, 196, 197, 200–202, 252–254, 314, 327, 328
318, 319, 322
Validation subset...........................................................37, 40
Systems biology ..................................................................33
Vancomycin ........................................................................46
T Variance .............................................113, 184, 202, 290, 293
Vesicle............................................................2–4, 6, 8, 10, 15
TALOS ............................................................ 18, 19, 21, 30 Virtual screening ................................48, 50, 51, 55–61, 150,
Tantalum oxide ......................................... 262, 263, 265, 266 154–160, 275
Taxonomic group ..........................................................34, 35 Viscosity .......................................................................71, 75
Testing set ............................. 95, 97, 111, 112, 116, 150, 151 Vision memory .................................................................231
3D-QSAR .....................................49–52, 122, 150, 154, 156 Visual filter ............................................................... 232, 239
Threshold ....................24, 109, 113, 114, 116, 133, 158, 199, Visual navigation ......................................................227–243
201, 212, 215, 223, 232–234, 247, 248, 255, 256, Voice pattern recognition..................................................206
272, 277, 295, 296, 319 Voiceprint .......................... 207–213, 215, 218–221, 223, 224
Time-delayed inputs.........................................................183
Time series ............................................... 248, 249, 252–257 W
Tolerance .............................................................. 90, 95, 186
Ward clustering ................................................ 293, 296, 306
Top-down method............................................................200
Ward distance ...................................................................293
TOPKAT ...........................................................................83
Water-solubility .............................................. 70, 72–73, 120
Topological descriptors .................... 50, 51, 55, 71, 74, 80, 83
Watson .............................................................................173
Topological indices .............................................................51
Weight .........................80, 107, 116, 120, 129, 133, 134, 139,
TOPS-MODE ..................................................................51
171, 172, 175, 179, 184, 187, 196, 197, 200, 214,
Torsion angle .............................................. 18–21, 26, 28–30
220, 233, 237, 239, 241, 251–253, 273, 274, 276,
Toxicity ....................46, 47, 55, 56, 59, 60, 65–70, 76, 78–84,
279–281, 291–294
120, 121, 123, 323, 325, 326
Weka .............................................55, 58, 124, 129, 130, 145
Trafficking ...............................................2–4, 7, 8, 10, 11, 15
Window.....................22, 29, 51, 52, 137, 138, 141, 166–172,
Training .................................... 9, 37, 57, 68, 92, 95–97, 104,
175, 210, 218, 219
111–113, 133, 149, 167–168, 170, 172, 183, 196,
200–201, 220, 230, 235–239, 249, 252–253, 269, Y
291, 321
Training image ................................................. 230, 231, 233 Yeast ...............................................................................9, 14