Professional Documents
Culture Documents
Abstract—FAIR principles and the Open Data initiatives have formats. In fact, even structured data sources may comprise
motivated the publication of large volumes of data. Specifically, a large number of attributes that include notes, comments,
in the biomedical domain, the size of the data has increased or descriptions, which encode valuable knowledge about the
exponentially in the last decade, and with the advances in the entities published by the data source. The integration of the
technologies to collect and generate data, a faster growth rate is
expected for the next years. The available collections of data are knowledge encoded in unstructured data sources or attributes
characterized by the dominant dimensions of big data, i.e., they requires the implementation of natural language techniques
are not only large in volume, but they can be also heterogeneous able to effectively recognize relevant entities and annotate
and present quality issues. These data complexity problems them with terms from controlled vocabularies (e.g., SemRep
impact on the typical tasks of data management, and particularly, [14], MetaMap1 , and Sophia [4] ) or from existing knowledge
in the task of integrating big biomedical data sources. We tackle
the problem of big data integration and present a knowledge- graphs (e.g., DBpedia Spotlight [10]). Moreover, the problem
driven framework able to extract and integrate data collected of data integration has been also extensively treated in the
from structured and unstructured data sources. The proposed literature and several approaches have been proposed to in-
framework resorts to Natural Language Processing techniques tegrate structured data (e.g., KARMA [7], LDIF [1], LIMES
to extract knowledge from unstructured data and short text. [13], MINTE [2], Sieve [11], Silk [6], and RapidMiner LOD
Furthermore, ontologies and controlled vocabularies, e.g., UMLS,
are utilized to annotate the extracted entities and relations with Extension [15]). However, none of these approaches is able to
terms from the ontology or controlled vocabulary. The annotated manage the integration of structured and unstructured data.
data is integrated into a knowledge graph. A unified schema is Our Research Goal: we aim at defining a computational
used to describe the meaning of the integrated data as well as the framework able to exploit knowledge from annotations and
main properties and relations. As proof of concept, we show the
semantic descriptions, and integrate equivalent entities into a
results of applying the proposed framework to integrate clinical
records from lung cancer patients with data extracted from open knowledge graph. Formally, the problem of data integration
data sources like Drugbank and PubMed. The created knowledge can be framed as follows. Given a collection of data sets such
graph enables the discovery of interactions between drugs in the as unstructured text or structured databases, the problem of
treatments prescribed to lung cancer patients. semantic data integration is to identify if two entities in the
collection of datasets match or do not match the same real-
I. I NTRODUCTION
world entity. An entity in a dataset corresponds to real-world
The amount of publicly available data has experience a concept and its properties can be described in structured or
significant growth as the result of initiatives that encourage unstructured way. Once equivalent entities have been matched,
the publication of open data following the FAIR principles different fusion policies need to be performed for merging
[18]. The biomedical domain is an exemplar field where the them into a single entity [2]. Considering the wide nature of
number and size of data sources have faced an exponential entities, the state of the art has focused on methods that reduce
increase in the last decade and is projected to grow very manual work and maximize accuracy and precision.
rapidly in the next decade, reaching more than one Zetta bytes Approach: We propose an approach that relies on both Natural
per year by 2025 [17]. In this era, transforming Big Data into Language Processing techniques for extracting entities from
actionable knowledge demands novel and scalable tools for unstructured text and ontologies to decide if two or more
enabling not only data ingestion and curation, but also for entites match to the same real-world entity. As proof of
efficient large-scale knowledge extraction, management, and concept, our annotation techniques utilized UMLS and DB-
analysis. Extracting knowledge from data requires the integra- pedia as background knowledge. Additionally, the framework
tion of data collected from heterogeneous data sources which resorts to a knowledge graph to identify when two entities are
can make available the data in structured and unstructured connected by relations that indicate that they are equivalent
(e.g., owl:sameAs). We have evaluated our approach on
This work has been supported by the European Union’s Horizon 2020
Research and Innovation Program for the project iASiS with grant agreement
1 https://metamap.nlm.nih.gov/
No 727658.
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
'UXJ,QWHUDFWLRQV
(a) Clinical Records (b) Drug Adverse Events (c) Potential Interactions
Fig. 1: Motivating Example. Heterogeneous sources publishing knowledge about the conditions of a patient and the potential
adverse events. (a) Unstructured clinical records describe a patient medical conditions. (b) Drug interactions and side effects
are the potential adverse events in a treatment. (c) Various biomedical repositories encode knowledge about the potential
interactions between drugs. The integration of all these knowledge is required to determine the effectiveness of a treatment.
tion and present the principal characteristics of our proposed eŠūDžŕĚēijĚ /NJƎŕūƑîƥĿūŠîŠē
/NJƎŕūƑî
/NJƥƑîČƥĿūŠ ƑîDŽĚ
¹ƑîDŽĚƑƙîŕ
knowledge-driven approach. We also outline the most impor-
tant results that we have obtained so far.
The remainder of this poster paper is structured as fol-
.QRZOHGJH*UDSK
lows: Section II motivates the data integration problem over
biomedical data sets. Section III describes our knowledge-
eŠūDžŕĚēijĚ'ĿƙČūDŽĚƑNj
driven framework and summarizes the principal results of
implementing this framework in the iASiS project. Finally, %LJ'DWD6WUXFWXUHGDQG8QVWUXFWXUHG'DWD6RXUFHV
Section IV concludes and give insights for future work.
564
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
$3DWLHQW(QWLW\
3DWLHQW^LG
&RPRUELGLWLHV^&KURQLFREVWUXFWLYH
SXOPRQDU\GLVHDVH0XOWLSOHVFOHURVLV` 6HPDQWLF(QULFKPHQW
GUXJ&RPRUELWLHV^DOSKDSURWHLQDVH
LQKLELWRU)LQJROLPRG`
'LDJQRVWLF^/XQJ$GHQRFDUFLQRPD`
&DQFHU'UXJV^7RFODGHVLQH&LVSODWLQ` 0DSSLQJ5XOHV
6PRNHU^IDOVH`
*HQGHU^0DOH`
)DPLOLDO$QWHGHQFHQWV^6LVWHU`
WXPRU6WDJH^6WDJH,6WDJH,,6WDJH,``
$'UXJ,QWHUDFWLRQ(QWLW\
'UXJ%DQN,G 'UXJ%DQN,G ,QWHUDFWLRQ'HVFULSWLRQ (QWLW\DQG
'% '% &DUERSODWLQPD\LQFUHDVHWKH
3UHGLFDWH/LQNLQJ
LPPXQRVXSSUHVVLYHDFWLYLWLHVRII
)LQJROLPRG
0DSSLQJ5XOHV
Fig. 3: An Example for KG Creation. Entity and predicate linking extracts relevant facts from clinical notes and short text
describing drug interactions. Mapping rules and semantic enrichment create entities for (1) a patient, and (2) drug interactions.
an overview of the proposed knowledge-driven framework. drugs Carboplatin and Fingolimod. The DrugBank identifiers
The framework comprises four components: 1) Knowledge are utilized in the knowledge graph to identify the drugs
extraction; 2) Semantic data integration; 3) Exploration and and their relationship. Moreover, equivalences between terms
traversal; 4) Knowledge discovery. In the first component, facts in different vocabularies are maintained in the knowledge
from unstructured data sources are extracted and represented graph; Figure 4 illustrates an example of the portion of the
in the form of triples, i.e., subject, predicates, and objects [12]. knowledge graph that maintains the equivalences between the
Ontologies and controlled vocabularies are used to guide the identifiers of DrugBank and UMLS for the drugs Carboplatin
extraction process as well as to annotate the extracted facts and Fingolimod (step 3 in the figure). These alignments are
with the terms, e.g., terms from UMLS are utilized to represent utilized during semantic data integration to create the portion
medical concepts and their meaning in a standard way. Once of the knowledge graph that relates the drugs identifies with
relevant entities are extracted and annotated with ontologies, the identifiers from UMLS (step 4 in Figure 4). This rewriting
semantic data integration methods are utilized to decide when allows for linking the drugs and their interactions with the
two entities are equivalent entities based on their annotations drugs that have been prescribed to a patient (step 5 in Figure
and to integrate these entities in the knowledge graph. Figure 4). Thus, the identifiers from UMLS are used not only to
3 depicts an example of the steps followed for knowledge annotate all the entities in the knowledge graph, but also
extraction and knowledge graph creation. First, NLP methods to match equivalent entities and integrate them. Once the
(e.g., Menasalvas et al. [9] and Sakor et al. [16]) able to solve knowledge graph is created, techniques by Sakor et al. [16]
the problem of named entity recognition are used to extract rel- are used to link entities in the knowledge graph to entities
evant entities from unstructured clinical notes and to annotate in other knowledge graphs (e.g., DBpedia [8] and Bio2RDF
the extracted concepts with terms from UMLS. The annotated [5]). This framework has been used in the context of the EU
entities are semantically described using a unified schema and H2020 project iASiS in order to transform biomedical data into
represented as triples by the means of executing a set of actionable knowledge for the support of precision medicine.
mappings rules. These mappings rules are expressed using the The current version of the iASiS knowledge graph has 1,3
RDF Mapping Language (RML) [3] and allow for the creation Billion triples, 46 RDF classes, in average 6.98 relations per
of entities and all their properties in the knowledge graph (e.g., entity, and each class is connected in average to 2.87 classes.
a patient entity in Figure 3). Additionally, short texts included Classes include drugs, publications, lung cancer patients, side
as attributes in structured data sources, e.g., in the description effects, and drug interactions. In addition, each patient is
of the drug interactions in DrugBank, are processed in order associated (in average) with four drugs, 20 side effects, 42
to perform entity and predicate linking. Then, mapping rules scientific publications, and 2 pairs of drug interactions. Thus,
are also executed to represent as triples the extracted entities the clinicians can traverse the knowledge graph and identify
and predicates. Step 2 in Figure 3 illustrates the portion of not only potential drug interactions, but also the scientific
the knowledge graph that represents the interactions between publications that support their diagnostics and treatments.
565
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: An Example for Data Integration. Alignments between drug identifiers in UMLS and DrugBank enable the integration
of drug interactions (3 and 4). A portion of graph representing a patient and the interactions of his prescribed drugs (5).
566
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.