Boinformatic Smell

APPLICABILITY DOMAIN AND MACHINE LEARNING
MODEL - A STRUCTURE-FUNCTION RELATIONSHIP OF

FLAVOR AND FRAGRANCE
PRACTICAL WORK REPORT - STRUCTURE-FUNCTION RELATIONSHIP
LARISSA LOUREIRO MARTINS1 AND JÉSSICA DE ASSIS SOFIATTI2

1-2. MSc Management of Flavor and Fragrance industry students at Université Côte d’Azur.
* Mailing address: larissa.loureiro8@icloud.com and jesofiatti@gmail.com
establish the relation between them, like it was

1. INTRODUCTION previously done on PWI.
In parallel, we compare the chemical spaces from the
The Quantitative Structure-Activity Relationship test and learning sets in order to compare their
(QSAR) models consist of mathematical models that similarities. Both fingerprint information obtained are
are used to predict the physicochemical and biological engineered, concatenated and then splitted into
properties of compounds by considering them a individual columns for each Pubchem descriptor using
function of different descriptors found on a disponible the Expand Bit Vector node. Just like on PWI, a PCA
database. This area of science allows scientists to be compute and PCAapply node is added to process the
more assertive when designing drugs, predicting data and turn it into a 3D plot, and finally a 2D/3D
chemical reactions, or even understanding sweetness Scatterplot node is added to map the summarized
prediction¹’². information on a graph where we can visualize the
distribution of the molecules from the 2 different sets
2. METHODS and evaluate its similarity.
B. Machine Learning Model
The proposed methodology was based on the research For part B, the same DB and the Molecular Type nodes
of scientific information about QSAR models, having as used at part A were added. The following step was
as sources the use of database sites - such as Scielo and to add one Molecular Properties node to each one of
Pubmed - and using the workflow system KNIME, the DB configured to generate data about the properties
which is a workflow system for life science research, mentioned on (Image 1). Linear correlation and
and sweetness molecule databases. The objective, thus, Correlation Filter nodes were used for the data obtained
was to gather information in order to evidence the use from the Learning Set to calculate the correlation
of QSAR models for the Flavor field. between the molecular properties obtained and remove
For this project, a database of sweetener compounds is redundant columns (or almost identical information
used in order to build a machine learning model used to appearing in doble), avoiding model bias.
predict sweetness impact. The DB was processed with
the objective of studying its Applicability Domain (A),
which consists in estimating how precise the prediction
about a molecule is while comparing it to the ones on
the DB, and setting up our Machine Learning Model
(B), validating it while comparing it to experimental
data.
A. Applicability Domain
For part A, the sweeteners DB of the learning and test
sets were added to the software and Molecule Type
Cast nodes are used to obtain the data about the
molecule’s fingerprints. In parallel, we use a
Fingerprint Similarity node to compare both data and Image 1. Chemical Properties
C. Normalization and Cross-validation
For improvement of the model, it is necessary to add
the node Normalizer after the Molecular Properties of
Learning set, and link with the Normalizer Apply, also
added, in the Test set. This step normalizes the data
from the physical properties of molecules.
The next step is the X-Partitioner node that is added to
split the Learning set into smaller sets, which turns
them responsible to create a linear regression model
through the Linear Regression Learner and Regression
Predictor with X-Aggregator to form a single model. In
the end, the Linear Correlation calculates the linear
correction of the experimental and logS values for the
Learning set. Graphic 1. Chemical Space (Learning and Test set)
3. RESULTS AND DISCUSSION For the second part, we set up the machine learning
model for sweetness prediction. The Linear correlation
For the evaluation of the Applicability domain, a matrix (Image 2) of the molecular properties below
database of sweet compounds’ database shows that there is a correlation between Element
(data_qsar.xlsx) was used to build a machine-learning count, Bond count, and Molecular weight since the
model, which was split into a learning and test set, Molecular weight and Bond count are linked with the
containing 155 and 153 molecules, respectively. number of atoms of molecules.
Based on the molecules present in the database, a
comparison was made between the learning set and the
test set, where, approximately, 87% are similar with
224 molecules and 19% are dissimilar with 49
molecules. In the diagram below it is possible to see the
results.
Diagram 1. Learning and Test sets
A chemical space (Graphic 1) was built using both sets,

Image 2. Linear correlation matrix
which is a concept in cheminformatics that refers to the
molecule space with all possible relations to a given set
The prediction value of each set was measured
of construction principles and conditions.
according to R² (correlation coefficient linear). It
follows that the closer the value is to one, the better the
prediction. The results are shown in Table 1.
Table 1. Correlation coefficient linear
R² Training R² Test
Baseline Model 0.66 0.58
Baseline with 0.57 0.58
Normalization
Reference Model with 0.60 0.70
Normalization
4. CONCLUSION
According to the results, it can be concluded that the

model is good, but it is a simple model, for better
results, it’s necessary to use a more complex model.
- The evaluation of the applicability domain
indicates that both data sets present a good
overlap and that both chemical spaces are
strongly similar, leading us to conclude that it
is possible to proceed with the next step of the
building and testing of a machine learning
model.
The values of R² that were obtained indicate
good linear correlation, once their values are
closer to 1 than 0, even though it has room to
be improved.
5. REFERENCES
[01] Hanser T, Barber C, Marchaland JF, Werner S.
Applicability domain: towards a more formal definition. SAR
QSAR Environ Res. 2016 Nov;27(11):893-909. doi:
10.1080/1062936X.2016.1250229. Epub 2016 Nov 9. PMID:
27827546.
[02] Chéron JB, Casciuc I, Golebiowski J, Antonczak S,

Fiorucci S. Sweetness prediction of natural compounds. Food
Chem. 2017 Apr 15;221:1421-1425. doi:
10.1016/j.foodchem.2016.10.145. Epub 2016 Nov 3. PMID:
27979110.
CHEMICAL SIMILARITY & CHEMICAL SPACE - A
STRUCTURE-FUNCTION RELATIONSHIP OF FLAVOR AND FRAGRANCE
PRACTICAL WORK REPORT - STRUCTURE-FUNCTION RELATIONSHIP
LARISSA LOUREIRO MARTINS1 AND JÉSSICA DE ASSIS SOFIATTI2
1-2. MSc Management of Flavor and Fragrance industry students at Université Côte d’Azur.
* Mailing address: larissaloumartins@gmail.com and jesofiatti@gmail.com
1. INTRODUCTION about their similarity. Here, the information previously

obtained on the Molecular Properties nodes are
The Quantitative Structure-Activity Relationship
engineered, concatenated, and normalized from 0 to 1.
(QSAR) models consist of mathematical models that
After that, a Principal Component Analysis (PCA) is
are used to predict the physicochemical and biological
made to reduce the data about the variants obtained to a
properties of compounds by considering them a
3D plot while preserving the information it contains;
function of different descriptors found on a disponible
after treating the data with the PCA Compute and PCA
database. This area of science allows scientists to be
apply nodes, we can finally run it on the 2D/3D
more assertive when designing drugs, predicting
Scatterplot node and obtain the summarized
chemical reactions, or even understanding the olfactive
information mapped on a 3D graph and observe the
impact of a certain molecule on our olfactory system¹’².
chemical space similarities between the two groups of
DBs.
2. METHODS
The proposed methodology was based on the research 3. RESULTS AND DISCUSSION
of scientific information about QSAR models, having
For the evaluation of Chemical Similarity, a senses
as sources the use of database sites - such as Scielo and
database (chemsensesDB.xlsx) was used containing
Pubmed - and using the workflow system KNIME,
1760 molecules of fragrance and 806 molecules of
which is a workflow system for life science research,
flavor.
and flavor and fragrance databases. The objective, thus,
Then, these molecules were statistically compared
was to gather information in order to evidence the use
within each group regarding their chemical properties,
of QSAR models for the Flavor and Fragrance field.
such as Bond count, Element count, Hydrogen Bond
For this project, the given chemisenses databases (DB)
Acceptors, Hydrogen Bond Donors, Molecular weight,
of Fragrances and Taste molecules were processed with
XLogP. The results are shown in the tables below.
the objective of studying the chemical similarity (A)
and comparing the chemical spaces (B) between them Table 1. Fragrance Database
both.
A. Chemical Similarity Chemical properties Min. Mean Max
For part A, both databases were added to the software Element count 3 27.1 121
and Molecular Type Cast nodes were used to extract
Bond count 1 11.1 60
the molecule’s fingerprints, generating the necessary
data to proceed with the study. With that, we run the Hydrogen bond Acceptor 0 1.4 19
Molecular Properties node (with variants Element Hydrogen bond Donor 0 0.3 9
count, Hydrogen bond acceptors, Hydrogen bond Molecular weight 30 160.8 846.4
donors, Molecular weight, and XLogP) to understand
XLogP -3.9 2.8 14.6
and compare the molecular properties of both DB, and
finally establish a relation between both fingerprints
(with the Fingerprint Similarity node), working with Table 2. Taste Database
the data in order to obtain molecular similarity where Chemical properties Min. Mean Max
Tanimoto coefficient is higher than 0.85.
Element count 3 41.1 158
B. Chemical Space
In this part, the objective is to compare the chemical Bond count 1 21.5 86
spaces included on both DBs and make conclusions
Hydrogen bond Acceptor 0 4.8 28
Hydrogen bond Donor 0 2.8 17 Through the chemical space constructed, it is possible
to observe that the fragrance molecules are
Molecular weight 27 294.7 1128.5
agglomerated in a certain region, as they need certain
XLogP -5.3 0.9 7.6 chemical properties to be volatile and reach the nasal
cavity, for example, lower molecular weight. In
Observing both tables, it is possible to point out that contrast, Taste molecules are more spread out in space,
taste molecules arrange a wider range of chemical as they do not need certain properties to be felt by the
profiles, from smaller to larger molecules (from 27 to taste system.
1128g/Mol VS 30 to 846g/Mol from fragrances
molecules) and with more hydrogen bond donors and 4. CONCLUSION
acceptors and a range of XlogP that englobes lower
values (variants that give information about the polarity Through a chemical informatic model, it was possible
of the molecules, taste molecules englobe more polar to evaluate the chemical properties and compare them
molecules than fragrances ones). between Fragrance and Taste molecules from the given
Based on the chemical properties identified, a databases. It was observed through the results that:
comparison was made between the groups. where 806
- Taste molecules occupy broader chemical
similar molecules were identified between the groups,
spaces (from smaller to bigger molecules);
and from that number the percentage of similarity and
- Fragrance molecules are more numerous, even
dissimilarity was calculated. As a result, 61% of the
though they are concentrated in chemical
fragrance molecules are dissimilar and 39% are similar,
space, being molecules with smaller molecular
while for the flavor molecules 68% are dissimilar and
weight (in comparison with taste ones) and in
32% are similar. In the diagram below it is possible to
majority apolar;
see the results.
- 61% of the fragrance molecules are
considered dissimilar and 39% are considered
similar according to their chemical properties;
- 68% of the flavor molecules are considered
dissimilar and 32% are considered similar
according to their chemical properties.
5. REFERENCES
[01] Cherkasov A, Muratov EN, Fourches D, Varnek
For the second part, a chemical space was built using A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin
both databases of Fragrance and Taste, which is a YC, Todeschini R, Consonni V, Kuz'min VE, Cramer R,
Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J,
concept in cheminformatics that refers to the property
Richard A, Tropsha A. QSAR modeling: where have you
space with all possible molecules in relation to a given
been? Where are you going to? J Med Chem. 2014 Jun
set of construction principles and conditions.
26;57(12):4977-5010. doi: 10.1021/jm4004285. Epub
2014 Jan 6. PMID: 24351051; PMCID: PMC4074254.
[02] Kunal Roy, Supratik Kar, Rudra Narayan Das,

Chapter 9 - Newer QSAR Techniques, Editor(s): Kunal
Roy, Supratik Kar, Rudra Narayan Das, Understanding
the Basics of QSAR for Applications in Pharmaceutical
Sciences and Risk Assessment, Academic Press, 2015,
Pages 319-356.

Boinformatic Smell

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boinformatic Smell

Uploaded by

Copyright:

Available Formats

APPLICABILITY DOMAIN AND MACHINE LEARNING

MODEL - A STRUCTURE-FUNCTION RELATIONSHIP OF

LARISSA LOUREIRO MARTINS1 AND JÉSSICA DE ASSIS SOFIATTI2

* Mailing address: larissa.loureiro8@icloud.com and jesofiatti@gmail.com

establish the relation between them, like it was

Diagram 1. Learning and Test sets

A chemical space (Graphic 1) was built using both sets,

According to the results, it can be concluded that the

[02] Chéron JB, Casciuc I, Golebiowski J, Antonczak S,

* Mailing address: larissaloumartins@gmail.com and jesofiatti@gmail.com

1. INTRODUCTION about their similarity. Here, the information previously

[02] Kunal Roy, Supratik Kar, Rudra Narayan Das,

You might also like