You are on page 1of 4

2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)

Semantic Data Integration Techniques for


Transforming Big Biomedical Data into Actionable
Knowledge
Maria-Esther Vidal Samaneh Jozashoori, Ahmad Sakor
TIB Leibniz Information Centre L3S Institute,
for Science and Technology, Germany Leibniz University of Hannover, Germany
maria.vidal@tib.eu {jozashoori,sakor}@l3s.de

Abstract—FAIR principles and the Open Data initiatives have formats. In fact, even structured data sources may comprise
motivated the publication of large volumes of data. Specifically, a large number of attributes that include notes, comments,
in the biomedical domain, the size of the data has increased or descriptions, which encode valuable knowledge about the
exponentially in the last decade, and with the advances in the entities published by the data source. The integration of the
technologies to collect and generate data, a faster growth rate is
expected for the next years. The available collections of data are knowledge encoded in unstructured data sources or attributes
characterized by the dominant dimensions of big data, i.e., they requires the implementation of natural language techniques
are not only large in volume, but they can be also heterogeneous able to effectively recognize relevant entities and annotate
and present quality issues. These data complexity problems them with terms from controlled vocabularies (e.g., SemRep
impact on the typical tasks of data management, and particularly, [14], MetaMap1 , and Sophia [4] ) or from existing knowledge
in the task of integrating big biomedical data sources. We tackle
the problem of big data integration and present a knowledge- graphs (e.g., DBpedia Spotlight [10]). Moreover, the problem
driven framework able to extract and integrate data collected of data integration has been also extensively treated in the
from structured and unstructured data sources. The proposed literature and several approaches have been proposed to in-
framework resorts to Natural Language Processing techniques tegrate structured data (e.g., KARMA [7], LDIF [1], LIMES
to extract knowledge from unstructured data and short text. [13], MINTE [2], Sieve [11], Silk [6], and RapidMiner LOD
Furthermore, ontologies and controlled vocabularies, e.g., UMLS,
are utilized to annotate the extracted entities and relations with Extension [15]). However, none of these approaches is able to
terms from the ontology or controlled vocabulary. The annotated manage the integration of structured and unstructured data.
data is integrated into a knowledge graph. A unified schema is Our Research Goal: we aim at defining a computational
used to describe the meaning of the integrated data as well as the framework able to exploit knowledge from annotations and
main properties and relations. As proof of concept, we show the
semantic descriptions, and integrate equivalent entities into a
results of applying the proposed framework to integrate clinical
records from lung cancer patients with data extracted from open knowledge graph. Formally, the problem of data integration
data sources like Drugbank and PubMed. The created knowledge can be framed as follows. Given a collection of data sets such
graph enables the discovery of interactions between drugs in the as unstructured text or structured databases, the problem of
treatments prescribed to lung cancer patients. semantic data integration is to identify if two entities in the
collection of datasets match or do not match the same real-
I. I NTRODUCTION
world entity. An entity in a dataset corresponds to real-world
The amount of publicly available data has experience a concept and its properties can be described in structured or
significant growth as the result of initiatives that encourage unstructured way. Once equivalent entities have been matched,
the publication of open data following the FAIR principles different fusion policies need to be performed for merging
[18]. The biomedical domain is an exemplar field where the them into a single entity [2]. Considering the wide nature of
number and size of data sources have faced an exponential entities, the state of the art has focused on methods that reduce
increase in the last decade and is projected to grow very manual work and maximize accuracy and precision.
rapidly in the next decade, reaching more than one Zetta bytes Approach: We propose an approach that relies on both Natural
per year by 2025 [17]. In this era, transforming Big Data into Language Processing techniques for extracting entities from
actionable knowledge demands novel and scalable tools for unstructured text and ontologies to decide if two or more
enabling not only data ingestion and curation, but also for entites match to the same real-world entity. As proof of
efficient large-scale knowledge extraction, management, and concept, our annotation techniques utilized UMLS and DB-
analysis. Extracting knowledge from data requires the integra- pedia as background knowledge. Additionally, the framework
tion of data collected from heterogeneous data sources which resorts to a knowledge graph to identify when two entities are
can make available the data in structured and unstructured connected by relations that indicate that they are equivalent
(e.g., owl:sameAs). We have evaluated our approach on
This work has been supported by the European Union’s Horizon 2020
Research and Innovation Program for the project iASiS with grant agreement
1 https://metamap.nlm.nih.gov/
No 727658.

2372-9198/19/$31.00 ©2019 IEEE 563


DOI 10.1109/CBMS.2019.00116

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
'UXJ,QWHUDFWLRQV

$ PDOH SDWLHQW ZLWK OXQJ DGHQRFDUFLQRPD 'UXJV


VWDJH , +H LV D QRQVPRNHU +H VXIIHUV 6FLHQWLILF
'UXJ,QWHUDFWLRQV 3XEOLFDWLRQV
IURP FKURQLF REVWUXFWLYH SXOPRQDU\ 'UXJ6LGH(IIHFWV
GLVHDVH DQG PXOWLSOH VFOHURVLV +H
$OSKDSURWHLQDVH LQKLELWRU DQG &LVSODWLQPD\LQFUHDVH &LVSODWLQPD\LQWHUDFWZLWK$OSKD
)LQJROLPRG +LV VLVWHU KDG FDQFHU KH  LV 1DXVHD %OHHGLQJ WKHLPPXQRVXSSUHVVLYH SURWHLQDVHLQKLELWRU 3XE0HG,' 
DFWLYLWLHVRI)LQJROLPRG &LVSODWLQPD\LQWHUDFWZLWK7RFODGHVLQH
SUHVFULEHGZLWK7RFODGHVLQHDQG&LVSODWLQ 3XE0HG,' 
+DLU/RVV +HDGDFKH

(a) Clinical Records (b) Drug Adverse Events (c) Potential Interactions

Fig. 1: Motivating Example. Heterogeneous sources publishing knowledge about the conditions of a patient and the potential
adverse events. (a) Unstructured clinical records describe a patient medical conditions. (b) Drug interactions and side effects
are the potential adverse events in a treatment. (c) Various biomedical repositories encode knowledge about the potential
interactions between drugs. The integration of all these knowledge is required to determine the effectiveness of a treatment.

heterogeneous datasets collected by the partners of iASiS


Ǩ
project 2 and have observed that our integration techniques ¬ĚŞîŠƥĿČ
'îƥîTŠƥĚijƑîƥĿūŠ
allow for the discovery of interactions between treatments that
&
could not be observed in case of isolated datasets. LH
V 9R RQW
RJ FD URO
Contributions: In this paper, we summarize the main chal- RO EX OHG
QW OD 
ULH
2
lenges to be achieved to solve the problem of data integra- V

tion and present the principal characteristics of our proposed eŠūDžŕĚēijĚ  /NJƎŕūƑîƥĿūŠîŠē 
/NJƎŕūƑî
/NJƥƑîČƥĿūŠ ƑîDŽĚ
¹ƑîDŽĚƑƙîŕ
knowledge-driven approach. We also outline the most impor-
tant results that we have obtained so far.
The remainder of this poster paper is structured as fol-
.QRZOHGJH*UDSK
lows: Section II motivates the data integration problem over 
biomedical data sets. Section III describes our knowledge-
eŠūDžŕĚēijĚ'ĿƙČūDŽĚƑNj
driven framework and summarizes the principal results of
implementing this framework in the iASiS project. Finally, %LJ'DWD6WUXFWXUHGDQG8QVWUXFWXUHG'DWD6RXUFHV
Section IV concludes and give insights for future work.

II. M OTIVATING E XAMPLE


We motivate the problem of data integration by presenting
a set of heterogeneous biomedical data sources (Figure 1).
The health conditions of a lung patient are described in a Fig. 2: Knowledge-driven Framework. Data is collected
clinical note; this note corresponds to unstructured data that from structured and unstructured data sources. (1) Knowledge
encodes valuable knowledge for understanding the diagnostic, extraction methods transform unstructured data and describe
comorbidities, familial antecedents, and treatments of the the extracted facts using ontologies. (2) Semantic data in-
patient. Albeit informative, the clinical notes are presented tegration rely on annotations and semantic descriptions to
in a form that cannot be directly processed during data integrate the data into a knowledge graph. (3) Federated query
integration. Contrary, Natural Language Processing techniques processing enables the exploration of the knowledge graph. (4)
are required in order to extract relevant entities that describe Knowledge discovery facilitates the uncovering of patterns.
the principal characteristics of a patient– for example, words
highlighted in bold in the text in Figure 1a. Then, as shown
in Figure 1b, adverse events or side effects may occur as a integrate the extracted entities in a way that useful insights
consequence of interactions among of prescribed drugs. How- can be discovered, e.g., drug interactions and side effects.
ever, the information about interactions is usually represented
III. O UR A PPROACH
in unstructured formats, e.g., as short text in DrugBank or
in scientific publications from PubMed (See Figure 1c). The We present a knowledge-driven framework that resorts to a
problem is how to extract all the relevant data from the myriad knowledge graph for describing and integrating data collected
data sources, e.g., clinical notes, Drugbank, and PubMed, and from heterogeneous data sources. A knowledge graph corre-
sponds to a data structure that represents data, knowledge, and
2 http://project-iasis.eu/ actionable insights using a graph data model. Figure 2 presents

564

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.
$3DWLHQW(QWLW\
3DWLHQW^LG
&RPRUELGLWLHV^&KURQLFREVWUXFWLYH

SXOPRQDU\GLVHDVH0XOWLSOHVFOHURVLV` 6HPDQWLF(QULFKPHQW
GUXJ&RPRUELWLHV^DOSKDSURWHLQDVH
LQKLELWRU)LQJROLPRG`
'LDJQRVWLF^/XQJ$GHQRFDUFLQRPD`
&DQFHU'UXJV^7RFODGHVLQH&LVSODWLQ` 0DSSLQJ5XOHV
6PRNHU^IDOVH`
*HQGHU^0DOH`
)DPLOLDO$QWHGHQFHQWV^6LVWHU`
WXPRU6WDJH^6WDJH,6WDJH,,6WDJH,``

 $'UXJ,QWHUDFWLRQ(QWLW\
'UXJ%DQN,G 'UXJ%DQN,G ,QWHUDFWLRQ'HVFULSWLRQ (QWLW\DQG
'% '% &DUERSODWLQPD\LQFUHDVHWKH
3UHGLFDWH/LQNLQJ
LPPXQRVXSSUHVVLYHDFWLYLWLHVRII
)LQJROLPRG

0DSSLQJ5XOHV

Fig. 3: An Example for KG Creation. Entity and predicate linking extracts relevant facts from clinical notes and short text
describing drug interactions. Mapping rules and semantic enrichment create entities for (1) a patient, and (2) drug interactions.

an overview of the proposed knowledge-driven framework. drugs Carboplatin and Fingolimod. The DrugBank identifiers
The framework comprises four components: 1) Knowledge are utilized in the knowledge graph to identify the drugs
extraction; 2) Semantic data integration; 3) Exploration and and their relationship. Moreover, equivalences between terms
traversal; 4) Knowledge discovery. In the first component, facts in different vocabularies are maintained in the knowledge
from unstructured data sources are extracted and represented graph; Figure 4 illustrates an example of the portion of the
in the form of triples, i.e., subject, predicates, and objects [12]. knowledge graph that maintains the equivalences between the
Ontologies and controlled vocabularies are used to guide the identifiers of DrugBank and UMLS for the drugs Carboplatin
extraction process as well as to annotate the extracted facts and Fingolimod (step 3 in the figure). These alignments are
with the terms, e.g., terms from UMLS are utilized to represent utilized during semantic data integration to create the portion
medical concepts and their meaning in a standard way. Once of the knowledge graph that relates the drugs identifies with
relevant entities are extracted and annotated with ontologies, the identifiers from UMLS (step 4 in Figure 4). This rewriting
semantic data integration methods are utilized to decide when allows for linking the drugs and their interactions with the
two entities are equivalent entities based on their annotations drugs that have been prescribed to a patient (step 5 in Figure
and to integrate these entities in the knowledge graph. Figure 4). Thus, the identifiers from UMLS are used not only to
3 depicts an example of the steps followed for knowledge annotate all the entities in the knowledge graph, but also
extraction and knowledge graph creation. First, NLP methods to match equivalent entities and integrate them. Once the
(e.g., Menasalvas et al. [9] and Sakor et al. [16]) able to solve knowledge graph is created, techniques by Sakor et al. [16]
the problem of named entity recognition are used to extract rel- are used to link entities in the knowledge graph to entities
evant entities from unstructured clinical notes and to annotate in other knowledge graphs (e.g., DBpedia [8] and Bio2RDF
the extracted concepts with terms from UMLS. The annotated [5]). This framework has been used in the context of the EU
entities are semantically described using a unified schema and H2020 project iASiS in order to transform biomedical data into
represented as triples by the means of executing a set of actionable knowledge for the support of precision medicine.
mappings rules. These mappings rules are expressed using the The current version of the iASiS knowledge graph has 1,3
RDF Mapping Language (RML) [3] and allow for the creation Billion triples, 46 RDF classes, in average 6.98 relations per
of entities and all their properties in the knowledge graph (e.g., entity, and each class is connected in average to 2.87 classes.
a patient entity in Figure 3). Additionally, short texts included Classes include drugs, publications, lung cancer patients, side
as attributes in structured data sources, e.g., in the description effects, and drug interactions. In addition, each patient is
of the drug interactions in DrugBank, are processed in order associated (in average) with four drugs, 20 side effects, 42
to perform entity and predicate linking. Then, mapping rules scientific publications, and 2 pairs of drug interactions. Thus,
are also executed to represent as triples the extracted entities the clinicians can traverse the knowledge graph and identify
and predicates. Step 2 in Figure 3 illustrates the portion of not only potential drug interactions, but also the scientific
the knowledge graph that represents the interactions between publications that support their diagnostics and treatments.

565

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.


Fig. 4: An Example for Data Integration. Alignments between drug identifiers in UMLS and DrugBank enable the integration
of drug interactions (3 and 4). A portion of graph representing a patient and the interactions of his prescribed drugs (5).

IV. C ONCLUSIONS AND F UTURE W ORK


[8] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N.
We present a knowledge-driven framework able to integrate Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer.
heterogeneous data into a knowledge graph. In the knowledge Dbpedia - A large-scale, multilingual knowledge base extracted from
graph, the integrated data is described in terms of a unified wikipedia. Semantic Web, 6(2):167–195, 2015.
[9] E. Menasalvas, A. R. González, R. Costumero, H. Ambit, and C. Gon-
schema and ontologies. Alignments between terms enable the zalo. Clinical narrative analytics challenges. In Rough Sets - Interna-
matching and integration of equivalent entities. The framework tional Joint Conference, IJCRS 2016, Santiago de Chile, Chile, October
is part of the iASiS platform, and clinicians are starting the 7-11, 2016, Proceedings, pages 23–32, 2016.
[10] P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and C. Bizer. Dbpedia
process of evaluation of outcomes. In the future, we plan to spotlight: shedding light on the web of documents. In Proceedings
develop curation techniques to allow clinicians to evaluate the the 7th International Conference on Semantic Systems, I-SEMANTICS
knowledge represented in the knowledge graph. 2011, Graz, Austria, September 7-9, 2011, pages 1–8, 2011.
[11] P. N. Mendes, H. Mühleisen, and C. Bizer. Sieve: linked data quality
R EFERENCES assessment and fusion. In Proceedings of the 2012 Joint EDBT/ICDT
Workshops, Berlin, Germany, March 30, 2012, pages 116–123, 2012.
[1] S. Andreas, M. Andrea, I. Robert, M. P. N, B. Christian, and B. Christian. [12] R. Navigli. Natural Language Understanding: Instructions for (Present
Ldif-a framework for large-scale linked data integration. In Proceedings and Future) Use. In Proceedings of the Twenty-Seventh International
of the 21st International World Wide Web Conference WWW, Developers Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19,
Track, Lyon, France, April 16-20, 2012. 2018, Stockholm, Sweden., pages 5697–5702, 2018.
[2] D. Collarana, M. Galkin, I. T. Ribón, M. Vidal, C. Lange, and S. Auer. [13] A.-C. N. Ngomo and S. Auer. Limes-a time-efficient approach for large-
MINTE: semantically integrating RDF graphs. In Proceedings of the 7th scale link discovery on the web of data. In IJCAI, pages 2312–2317,
International Conference on Web Intelligence, Mining and Semantics, 2011.
WIMS 2017, Amantea, Italy, June 19-22, 2017, 2017. [14] T. Rindflesch and M. Fiszman. The interaction of domain knowledge
[3] A. Dimou, M. V. Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. V. and linguistic structure in natural language processing: interpreting
de Walle. RML: A generic language for integrated RDF mappings of hypernymic propositions in biomedical text. Journal of Biomedical
heterogeneous data. In Proceedings of the Workshop on Linked Data Informatics, 36(6):462–477, 2003.
on the Web co-located with the 23rd International World Wide Web [15] P. Ristoski, C. Bizer, and H. Paulheim. Mining the web of linked data
Conference (WWW 2014), 2014. with rapidminer. Web Semantics: Science, Services and Agents on the
[4] Z. Q. Divita G, G. AV, D. S, N. J, and S. MH. Sophia: A expedient World Wide Web, 35:142–151, 2015.
umls concept extraction annotator. In AMIA Annu Symp Proc, pages [16] A. Sakor, I. O. Mulang’, K. Singh, S. Shekarpour, M.-E. Vidal,
467–76, 2014. J. Lehmann, and S. Auer. Old is gold: Linguistic driven approach for
[5] M. Dumontier, A. Callahan, J. Cruz-Toledo, P. Ansell, V. Emonet, entity and relation linking of short text. In Proceedings of the NAACL
F. Belleau, and A. Droit. Bio2rdf release 3: A larger, more connected HLT, 2019.
network of linked data for the life sciences. In Proceedings of the [17] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J.
ISWC 2014 Posters & Demonstrations Track a track within the 13th Efron, M. C. S. Ravishankar Iyer, S. Sinha, and G. E. Robinson. Big
International Semantic Web Conference, ISWC 2014, Riva del Garda, data: Astronomical or genomical? Plos One, 13(7), 2015.
Italy, October 21, 2014., pages 401–404, 2014. [18] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax-
[6] R. Isele and C. Bizer. Active learning of expressive linkage rules using ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E.
genetic programming. Journal of Web Semantics, 23:2–15, 2013. Bourne, et al. The fair guiding principles for scientific data management
[7] C. A. Knoblock and P. A. Szekely. Exploiting Semantics for Big Data and stewardship. Scientific data, 3, 2016.
Integration. AI Magazine, 36(1):25–38, 2015.

566

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on November 05,2020 at 20:37:45 UTC from IEEE Xplore. Restrictions apply.

You might also like