Challenges of Big Data Integration in The Life Sciences

Analytical and Bioanalytical Chemistry (2019) 411:6791–8000
https://doi.org/10.1007/s00216-019-02074-9
FEATURE ARTICLE
Challenges of big data integration in the life sciences

Sven Fillinger 1 & Luis de la Garza 1 & Alexander Peltzer 1 & Oliver Kohlbacher 2,3,4,5 & Sven Nahnsen 1
Received: 21 May 2019 / Revised: 8 July 2019 / Accepted: 6 August 2019 / Published online: 28 August 2019
# Springer-Verlag GmbH Germany, part of Springer Nature 2019
Abstract
Big data has been reported to be revolutionizing many areas of life, including science. It summarizes data that is unprecedentedly
large, rapidly generated, heterogeneous, and hard to accurately interpret. This availability has also brought new challenges: How
to properly annotate data to make it searchable? What are the legal and ethical hurdles when sharing data? How to store data
securely, preventing loss and corruption? The life sciences are not the only disciplines that must align themselves with big data
requirements to keep up with the latest developments. The large hadron collider, for instance, generates research data at a pace
beyond any current biomedical research center. There are three recent major coinciding events that explain the emergence of big
data in the context of research: the technological revolution for data generation, the development of tools for data analysis, and a
conceptual change towards open science and data. The true potential of big data lies in pattern discovery in large datasets, as well
as the formulation of new models and hypotheses. Confirmation of the existence of the Higgs boson, for instance, is one of the
most recent triumphs of big data analysis in physics. Digital representations of biological systems have become more compre-
hensive. This, in combination with advances in machine learning, creates exciting new research possibilities. In this paper, we
review the state of big data in bioanalytical research and provide an overview of the guidelines for its proper usage.
Keywords Big data . Bioanalytics . Data integration . Bioinformatics . Scalability
Introduction construction of efficient query interfaces enable entirely new

research opportunities. Correlative analyses combine hetero-
In 2004, Google launched the ambitious goal to digitize as geneous information and can be performed on big datasets
many books as publicly available and legally feasible [1, 2]. that grow rapidly.
The transformation of printed pages to digital images In the life sciences, similar concepts for digitizing biolog-
(digitization) was followed by the identification and indexing ical material, e.g., through sequencing and datafying the
of the text blocks (datafication). The comprehensive annota- information with bioinformatic tools, have emerged over
tion of these information blocks with metadata and the the last two decades, paving the way for big data in life
science research. While big data growth rates have been
Sven Fillinger, Luis de la Garza, and Alexander Peltzer contributed suggested as early as the 1940s [3], the life sciences have
equally to this work.
become important players in the field for about the last
10 years. The widely accepted definition is based on the
* Sven Nahnsen
sven.nahnsen@uni-tuebingen.de
four Vs, requiring data to be very large (volume), but also
heterogeneous (variety), rapidly generated (velocity) and as-
1
Quantitative Biology Center (QBiC), University of Tübingen, Auf sociated with uncertainty (veracity). A part of this veracity
der Morgenstelle 10, 72076 Tübingen, Germany issue in the bioanalytical disciplines arises from the uncer-
2
Center for Bioinformatics, University of Tübingen, tainty that the measured effect is truly an effect and not
72076 Tübingen, Germany simply a measurement error. Thus, it should be noted that
3
Applied Bioinformatics, Department of Computer Science, high-throughput data coming from a single source are not
72076 Tübingen, Germany categorized as big data, regardless of the actual size, nor are
4
Institute for Translational Bioinformatics, University Hospital of data that allow for straightforward and accurate conclusions
Tübingen, 71016 Tübingen, Germany [4]. Similarly, datasets of low complexity and/or low gener-
5
Biomolecular Interactions, Max Planck Institute for Developmental ation rates, but of high volumes are not big data and will
Biology, 72076 Tübingen, Germany simply be referred to as large data.
6792 Fillinger S. et al.
While the most recent research data in the biological and Data analysis
medical sciences show the alignment with big data properties,
related science disciplines have been associated with big data The important developments in scientific disciplines such as
much earlier, such as experimental particle physics. The large bioinformatics and computational biology are akin to the tech-
hadron collider, as installed at the CERN in Geneva, nological development for data generation.
Switzerland, is generating primary research data at a pace The emergence of high-throughput technologies fostered
beyond any current biomedical research center [5]. bioinformatics tool development and the genesis of bioinfor-
Nonetheless, other disciplines such as environmental re- matics workflow solutions. These developments in applied
search [6], medicine (Cancer Genome Atlas Research computer science together with the appreciation of data man-
Network ...), and life sciences (The GTEx Consortium 2015) agement as an important research field continuously provide
are largely benefiting from the model, which was implement- the technical means to share research data beyond the borders
ed by other disciplines for coping with big data. of research groups, institutions, and even countries to fully
Synoptically, the emergence of big data in the biomedical leverage the potential of big data. Querying scientific literature
field can be attributed to three major coinciding events: (1) the databases, it is obvious to see that the number of studies that
technological revolution for data generation, (2) the develop- base on at least one omics type is growing rapidly.
ment of tools for data analysis, and (3) a conceptual change in Furthermore, the number of published so-called multi-
science practice towards open data. omics studies, that is, studies using datasets from more than
one area in life sciences, e.g., proteomics, genomics, and
metabolomics, has increased starkly in recent years, as
Data generation
depicted in Fig. 2. Using a structured, scalable approach, as
workflow engines tend to offer, enables researchers to analyze
The technological innovation is visible in almost all fields
and integrate data from several heterogeneous sources. Both
where bioanalytical data is generated. While genomics and
trends are shown in Fig. 2.
other next-generation sequencing–based technologies have
been leading the field with respect to data amounts and the
speed of their generation, other bioanalytical platforms such as Data sharing
mass spectrometry and imaging are catching up. In recent
years, technological developments in mass spectrometers The emergence of open data, the growing willingness to share
and liquid chromatography have led to a quantum leap in primary datasets, and increasing resources of public data pro-
the quality and quantity of protein and metabolite measure- vide a tremendously important third cornerstone for the entry
ments. This trend can clearly be identified using publicly of big data into the life sciences.
available proteomics data and studies, as visualized in Finally, building on the three innovations, the true potential
Fig. 1. Since 2005, we observe an exponential growth of pub- of big data lies in the discovery of patterns from unprecedent-
licly available proteomics data. edly large datasets and the formulation of new models and
Fig. 1 Development of registered projects and repository size of the PRIDE database from 2005 until October 2018. Blue line: repository size in
terabytes. Cyan line: total number of registered projects (source: https://doi.org/10.5281/zenodo.1464136)
Challenges of big data integration in the life sciences 6793
Fig. 2 Number of published multi-omics studies per year. Based on reported PubMed keywords and publication date (source: https://doi.org/10.5281/
zenodo.1464136)
hypotheses thereof. The growing comprehensiveness of the and methods on the local and the global levels. This system
digital description of biological systems, in combination with can be built in many different flavors that all share the inten-
advances in machine learning and artificial intelligence, opens tion of integrating large volumes of heterogeneous and rapidly
fascinating new possibilities for biomedical research. Here we growing datasets.
review the state of big data in bioanalytical research and iden-
tify the general guidelines for its implementation. Managing bioanalytical data
At the core of the big data ecosystem are methods and systems
Data ecosystems for big data application for the efficient management of these data. Obviously,
attempting to work with big data is a community effort, im-
The ability to integrate data is a core element of big data plying the necessity to make data accessible in both standard-
analyses. Research data requires integration at various levels. ized human- and machine-readable interfaces. In recent years,
Figure 3 illustrates data integration around bioanalytical data scientists, funding agencies, and journals made tremendous
(shown in the blue box), for which a scalable setup requires progress in formalizing these requirements through the FAIR
unambiguous metadata annotation of all entities (studies, or- guidelines [7]. Established to help researchers assess and es-
ganisms, samples) involved in a given experimental design. tablish good data management practices and stewardship, the
Denoted as step (1), the research data needs to be embedded concept of FAIR data is a prerequisite to perform sustainable
into a data management infrastructure providing domain- big data-driven research. FAIR data must be Findable,
specific-rich metadata annotation, persistent storage, and Accessible, Interoperable, and Reusable. The application of
human/machine-accessible query interfaces. Following the the FAIR principles is a key enabler for reproducibility in
annotation, data can be processed by bioinformatics tools science. These applications along with stringent deployment
and pipelines. This processing step (2) is classically known of data standards are thoroughly discussed in [8].
to as bioinformatics processing and involves many, ideally Findability requires the data not only to be registered in a
open, software applications and reference databases. searchable resource, but also to be described with rich and
Harmonized data processing of similar and related data allows standardized metadata. Implementations are diverse, from
integrating datasets from multiple experimental conditions web interfaces to command lines, but they must enable re-
and/or a multitude of omics levels. Figure 3 denotes as its searchers to submit queries and provide upload mechanisms
fourth step the integration beyond the current setting with enforcing a strict requirement for detailed metadata provision-
large-scale global resources (4). Enabling these four steps in ing. Resources must also provide an application programming
an ideally automated setup enables the construction of a big interface (API), enabling software developers to access the
data ecosystem that is scalable to easily integrate new datasets resource data from within their applications and enable, e.g.,
Fig. 3 Bioanalytical data ecosystem for big data applications. annotation (1), as well as lending context (3). The bottom layer depicts
Bioanalytical research aiming at leveraging big data requires several community resources required for data processing (2). Lastly, the top
layers of data integration. The middle level shows integration at the layer illustrates the global integration and access via public data
local (facility, research group) level, including thorough metadata repositories (4) (source: https://doi.org/10.5281/zenodo.1464136)
machine learning or data mining algorithms to operate on the website (https://www.nature.com/sdata/policies/repositories).

data. A very common architectural concept of such an inter- DataCite, a global non-profit organization, provides a registry
face is RESTful APIs (Representational State Transfer, often (re3data) that is connected to a number of subscribed reposi-
used with HTTP as the network protocol). Typically, data is tories and thus serves as a central access point to submit search
then passed around in the common information exchange for- queries. DataCite provides an extensive schema and docu-
mats, such as JSON (JavaScript Object Notation) or XML mentation on how data can be annotated properly [10].
(Extended Markup Language), which are easily machine- Careful annotation of data with metadata is critical for the
and human-readable. A good example of a detailed API de- comparison and interpretation of research data at a global
scription can be found on the ICGC data portal [9] scale.
(International Cancer Genome Consortium; http://docs.icgc. Interoperability in the context of big life science data man-
org/portal/api-endpoints), where cancer-specific controlled agement requires data to be used by non-cooperating sites to
vocabularies can be used to find samples and raw datasets. make use of the data and integrate it with local data for repro-
Accessibility refers to retrievable research data. A best prac- ducibility or further analysis. A common use case is the use of
tice in accessibility is realized through linking metadata de- existing data builds new research hypotheses. Global stan-
scribing datasets found in central data repositories. Using dards to achieve interoperability and provide good practices,
these linked metadata, researchers are provided with related such as the aforementioned DataCite schema definition, are
datasets in order to improve search results. Linked metadata central to produce interoperable data and thereby facilitate the
corresponds to a biology-specific, standardized set of descrip- sharing of data.
tions. Most public repositories provide the technical infra- Data reusability repetitively builds upon rich metadata, if
structure and implementation of access interfaces and proper available. Life science researchers are not always aware of
usage documentation. good practices to attribute data licenses to any data or their
However, the landscape of public data repositories has subsets. Such identification options help to clarify the terms of
grown significantly, rendering the task of data access non- use and to avoid the cost- and time-intensive investment to
trivial. The available data repositories profoundly depend on untangle the legal basis and the application of country-specific
the scientific domain and the type of data. A curated list of copyright laws. The terms of use can be set in most cases by
FAIR data repositories can be found on the Nature journal the copyright holder, which is usually the research institution,
a company, or a single person that created the data. The Digital comparably easy to justify. A mass spectrometer, costing
Curation Centre (DCC) makes suggestions for data license approx. 1 M € with associated equipment, is a common in-
selection (http://www.dcc.ac.uk/resources/how-guides/ vestment in research institutes, yet allocating budget for FAIR
license-research-data). Two very permissive licenses, and data management is frequently not appreciated.
thus ensuring reusability, are the Open Data Commons As a best practice recommendation, generated data should
(https://okfn.org/opendata) and the Creative Commons be stored together with a checksum file that contains the
(https://creativecommons.org) licenses that have become the checksum derived from the data. Most common checksum
de facto standard in publishing life scientific data. algorithms for this purpose are based on cryptographic hash
However, we want to point out that there are substantial chal- functions such as MD5 or SHA1, or CRC32 (polynomial
lenges in ensuring accessibility and reusability of data, division–based). While the abovementioned ones are not con-
especially—but not only—when it concerns human (meta) data. sidered to be inherently secure against intended manipulation,
Access is usually controlled and requires written research interest they are sufficient for detecting random errors introduced in
proposals that also need to be signed by official legal representa- the data by the aforementioned causes. Notably, these check-
tives. Oftentimes institutions do not offer good support and staff sum files should be created as early as possible in a digital
for the transaction of these time-consuming processes. For exam- object lifetime and accompany the data during every data
ple in order to access certain projects hosted by the International transfer step for validation. It is also considered to be best
Cancer Genome Consortium (ICGC), one needs to apply for practice to deliver the data to local or remote data centers, as
(meta) data access over data the Data Access Compliance they usually provide the technical measures for sustainable
Office (DACO) plus additional applications for the different data storage and delivery.
raw data repositories, like PDC or GDC. As these applications While there are solutions available that address these prob-
expire within 24 months, it is necessary to keep track and apply lems, we want to point out that this usually comes with a
for renewal. In addition, most researchers do not have a legal higher price for the hardware and needs domain expertise for
background and therefore are unsure about copyright laws, own- the installation and maintenance of such storage solutions.
ership regulations, and licensing models. In the worst case, data Neglecting the proper provisioning of such a technical setup
and software code is not shared at all due to this uncertainty, and best practice integrity preservation techniques inevitably
which contradicts the FAIR principles. While founding agencies leads to data that could have been altered without notice or even
mostly require mandatory data publishing, best practices for rendered unusable. For example, such alterations can potential-
FAIR data (and software) are not yet fully implemented. ly lead to different numerical measurement values and change
the file content. Obviously, this can have severe consequences
Preservation of data integrity for further scientific evaluation of the data. For the sake of
reproducibility and research data integrity, this should be min-
If data were obtained from several sources, a special emphasis imized by implementing an appropriate big data infrastructure.
should be drawn to the assurance of data correctness. Data
corruption or loss is the most severe damage that can occur Bioinformatics data processing
to (scientific) data. Data loss can be observed after a simple
disk failure, after (un)intended physical damage or accidental With the availability of biological big data, the need to process
deletion. While regular backups to remote systems address considerable amounts of data in computations in a timely
this type of damage quite efficiently, an underestimated dam- m a n n er ha s b e co m e a p r e s s i ng m a t t er [1 3 , 1 4] .
age can occur through data corruption. Data corruptions are Bioinformatic calculations are therefore turning more com-
defined as events when the bit-encoded content changes un- plex and require more computing resources to complete in a
intentionally. The reasons for this can be manifold, such as timely manner. Simple personal computers and monolithic
radiation on the memory (DRAM [11]), background noise in approaches to resolve scientific questions in bioinformatics
the network traffic, or device failure. can no longer keep pace with the growth of available biolog-
Technical solutions that minimize the risk of undetected ical data, the complexity of computations, and the need for
data corruption are widely available. For example, at the hard- faster results. Structuring solutions to scientific questions
ware level, it is considered best practice to employ, e.g., using workflows not only continues to be seen as a good
Hamming-based error-correcting code memory (ECC [12]) practice, but also helps users to obtain reproducible results in
that auto-detects and corrects single- and double-bit errors. a scalable and timely manner.
However, as these are regularly more costly and depend on There are several comprehensive bioinformatics software
more expensive hardware that supports this type of memory, libraries [15–17] designed to be integrated into workflow en-
regular laboratory desktop computer environments rarely em- gines, thus facilitating scientists to design scalable workflows.
ploy such advanced hardware solutions. Paradoxically, in the The primary annotation of experimental raw data with bioin-
research setting, investments for data generation are formatics workflows requires the integration of many
community resources, such as reference genomes and sequencing functional genomic data. This approach aims at
proteomes as well as spectral libraries. providing a central archive for researchers to download and
combine expression data within their own research projects in
order to for example unravel novel transcriptome changes in
Integration beyond institutions different types of tissue.
One of the first projects in the cancer context was the
Institutional research is building considerable amounts of data International Cancer Genome Consortium (ICGC)’s database
in their application domains and on specific research ques- TCGA [9] that stores somatic and germline datasets for more
tions. Studying correlations in heterogeneous datasets, how- than 25,000 patients and aims at providing a central data resource
ever, requires the integration of large volumes of data that for the personalized treatment and research in cancer genomics.
exceed the capacities of single institutions. These needs have Other applications of big data in the life sciences focused
been largely appreciated by the respective communities, and on the generation of large gene expression datasets such as the
the required mindset towards data sharing and open data is GTex [22] that covers more than 50 different sample tissues.
currently being pushed forward by peers, funding agencies, Such an atlas of expression can be used intensively to inves-
and journals. The infrastructure on a global scale is partially tigate reference gene expression in new measures or compare
provided by public repositories that operate as a central data these against putatively disease gene expression.
hub for scientific high-throughput data and allow reusability
beyond the original research questions. Secondly, to make use Bringing computation to data
of these global resources, the well-established process of
copying the relevant data to a local infrastructure is no longer The onset of large-scale research projects in genomics can be seen
applicable, but the tool sets, e.g., bioinformatics workflows, articulately in the size of datasets being created in the last years.
will need to be brought to the data and not vice versa. While the first 1000 Genomes Project (including stage 3 of the
project) [23] now provides access to more than 2500 sequenced
Public repositories whole genome sequencing (WGS) datasets, newer projects such
as the Icelandic whole genome diversity project generated WGS
Over the last decades, multiple data-hosting platforms have samples of 15,220 Icelanders [24]—a 6-fold increase.
been established for a variety of data. Most of these platforms Current projects such as the UK 100,000 Genomes Project
aim at providing a centralized location for specific data types, [25] aim to sequence a total of 100,000 WGS samples and the
e.g., mass spectrometry (MS)–based proteomics/metabolomics European Union initialized a project among 13 states of the
or next-generation sequencing (NGS)–based genomics or tran- European Union to sequence a total of 1,000,000 individuals
scriptomics experiments. [26] in order to enable rare disease research as well as provide
PRIDE [18] (https://www.ebi.ac.uk/pride/archive/) aims at a basis for personalized medicine in healthcare.
providing access to protein and peptide identifications as well Nevertheless, the last years have seen the generation of
as post-translational modifications. All entries into PRIDE are large life science databases that incorporate data from various
curated by a team of experts in the field, ensuring that data sources and are utilized to derive new knowledge using data
entering the repository is of high quality. Standardized datasets mining technologies on large-scale data available.
are furthermore kept in a preprocessed form, enabling direct All these approaches have in common that they intend to
comparisons between different experiments in the repository. provide a solid data basis for research scientists. An obvious
MetaboLights (Haug et al. 2013) is a cross-species data consequence is that they facilitate the means for future treat-
repository for metabolomics raw data and the corresponding ment methods that rely on large-scale databases of
metadata. Similar to PRIDE, it offers the deposition of mass sequence—as well as other omics data to answer specific
spectrometric data in open standard formats which is the de questions that rise especially in the treatment of cancer and
facto standard repository for many journals publishing meta- rare disease. The sheer mass of accumulated data in these
bolomics studies. repositories provides, unfortunately, a challenging environ-
SRA [19] (https://www.ncbi.nlm.nih.gov/sra) and ENA ment for the classic analysis path that researchers were relying
[20] (https://www.ebi.ac.uk/ena) provide means to store on before: Downloading and integrating datasets in the
sequencing data from both Sanger and next-generation se- petabyte (PB) range is oftentimes impractical on local com-
quencing methodologies in a standardized manner. Results puting resources for different reasons, primarily because it
from various publications are archived publicly in these repos- requires a good understanding of these resources and financial
itories, enabling researchers to utilize datasets for comparative possibilities that are not accessible to many researchers.
studies in their own research projects. Additionally, acquiring huge datasets is very time consuming
The gene expression omnibus (GEO) [21] provides public and computationally expensive as the data integrity needs to
access to archived sets of microarray or other next-generation be verified after the download. Last but not least, it produces
high amounts of network traffic which burdens especially Shivom (https://shivom.io/) and Nebula Genomics (https://
smaller institutions with limited bandwidth capabilities. nebula.org/) are initiatives that aim to resolve data sharing and
One example that provides a technical solution is the protection issues in the context of personalized medicine. The
International Cancer Genome Consortium (ICGC) data portal principle behind both concepts is a blockchain-based infra-
(https://dcc.icgc.org/), which provides controlled access to structure that ingests DNA samples from users, e.g., from
human genetic data in the context of cancer research. The personalized genetic testing companies such as 23andme or
sequencing raw data is then stored in different central data AncestryDNA, but also healthcare and sequencing providers.
repositories that are embedded in cloud computing Both raw and processed data belong to the data owner at any
environments such as AWS. The big advantage of this time, but the owner can share access to their data with re-
approach is that once a researcher has granted access to the searchers of their choice, enabled through the blockchain tech-
data repositories, computational workflows can be deployed nology running the service. Users can get tokens to gain ac-
on virtual machines with the data accessible via, i.e., the S3 cess to healthcare products (e.g., more or differently targeted
protocol. No raw data transfer is required (Downs et al. 2014). health reports) in return for providing researchers with more
A different model would be extending cloud server APIs from material. The distributed ledger of blockchain is likely to pro-
offering mere data access (data portals) to performing data oper- vide a solution to data privacy and IP issues, when it comes to
ations. For instance, the International Human Epigenome integrating many datasets in a big data type of experiment.
Consortium allows massive epigenome datasets to be aggregated
and analyzed efficiently on the server-side (Albrecht et al. 2016).
Furthermore, in some research contexts, especially in the
personalized medicine field, stricter data security efforts than Big data quo Vadis?
local institutions can typically provide are required, thus hin-
dering data sharing with researchers outside of data collecting Although bioinformatics as a general research field is existing
institutions that initially have access to, e.g., biopsy material since about 25 years, there have been significant changes in
from patients. Particularly the data sharing policies but also terms of especially data volume that is handled nowadays in
the hindrances that occur when large-scale data repositories bioinformatics workloads.
need to be shared among researchers, therefore, led to several Intriguingly, big data, if comprehensive enough, carries the
developments that aim to bring computing and therefore data potential to challenge the scientific method for new discover-
analysis to the actual data hosts and repositories. ies in science, through studying data correlations at a large
A limitation of the aforementioned cloud approach is that it scale. Theoretically, comprehensive data circumvents model-
requires data from various sources to be aggregated in a secure ing the unknown, but allows direct readout. However, trans-
manner. This is particularly problematic when it comes to lating this ambitious goal to the labs requires restructuring
sensitive patient data. Another possibility is thus federated many of today’s research routines.
machine learning, where algorithms operate in a distributed The prevalent tendency to build and maintain project-
fashion across individual data hubs such as hospitals or re- specific and institutional data silos will prevent fostering the
search institutes and allow for shared machine learning on full potential of big data. Instead, the field’s future is the cre-
the entire dataset. This approach allows for privacy by design ation of centralized (meta) data repositories or “data lakes” of
and is explored by initiatives such as the European structured and unstructured data, as dominated by genomics
FeatureCloud project (http://featurecloud.eu/). and ongoing for other omics technologies. An early success of
Initiatives like the Global Alliance for Genomics in Health the paradigm shift towards sharing gave rise to initiatives such
(GA4GH, www.ga4gh.org) aim at providing generalized as GA4GH. These big data ecosystems finally allow the inte-
policies and interaction interfaces for researchers across the grative analysis for the entire globe that was not possible years
globe. With such interfaces, researchers should be enabled to back. As data protection and ethical concerns limit the possi-
analyze data at various locations, by directly initiating data bilities of local computing facilities to data that has been col-
analysis at these distinct locations and integrating obtained lected at the respective institution, decentralized and standard-
results with their own research data. The biggest change to ized data-sharing measures have to be developed in the next
current big data analysis methods is the context in which such years to make big data in science a true success story.
analysis is performed by, e.g., computing summary metrics Following FAIR principles, big data ecosystems will need to
from cancer patients at a hospital and searching the initiative provide a scalable, metadata-aware framework for future anal-
for patients with similar mutational burden in their tumors yses in various research areas. Only if the research community
respectively. Such information could be incredibly beneficial conceptionally prepares for sharing and integrating data across
for cancer treatment, as the outcome of treatment of other the global, artificial intelligence (AI) methods will be able to
patients with very similar mutational burden would allow for learn from data, fulfill the high expectations, and favorably
a direct assessment of feasibility for certain cancer treatments. contribute to the progress of biomedical research.
Funding information This work was carried out with the support of the 18. Vizcaíno JA, Csordas A, del-Toro N, Dianes JA, Griss J, Lavidas I,
German Research Foundation (DFG) within project INF, SFB/TR 209 et al. 2016 update of the PRIDE database and its related tools.
“Liver Cancer.” Nucleic Acids Res. 2016;44:D447–56.
19. Leinonen R, Sugawara H, Shumway M, International Nucleotide
Sequence Database Collaboration. The sequence read archive.
Compliance with ethical standards Nucleic Acids Res. 2011;39:D19–21.
20. Cochrane G, Alako B, Amid C, Bower L, Cerdeño-Tárraga A,
Conflict of interest The authors declare that they have no conflict of Cleland I, et al. Facing growth in the European Nucleotide
interest. Archive. Nucleic Acids Res. 2013;41:D30–5.
21. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF,
Research involving human participants and/or animals Not applicable. Tomashevsky M, et al. NCBI GEO: archive for functional geno-
mics data sets—update. Nucleic Acids Res. 2013;41:D991–5.
Informed consent Not applicable. 22. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pro-
ject. Nat Genet. 2013;45:580–5.
23. The 1000 Genomes Project Consortium, Auton A, Abecasis GR,
Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti
References A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs
RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES,
1. Mayer-Schönberger V, Cukier K. Big data: a revolution that will Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson
transform how we live, work and think. In: Houghton Mifflin DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs (Principal
Harcourt Publishing Company, vol. 215. New York: Park Avenue Investigator) RA, Boerwinkle E, Doddapaneni H, Han Y, Korchina
South; 2013. p. 10003. V, Kovar C, Lee S, Muzny D, Reid JG, Zhu Y, Wang (Principal
2. NGRAM Viewer. https://books.google.com/ngrams. Accessed Investigator) J, Chang Y, Feng Q, Fang X, Guo X, Jian M, Jiang H,
Oct 2018 Jin X, Lan T, Li G, Li J, Li Y, Liu S, Liu X, Lu Y, Ma X, Tang M,
3. Price MO, Rider F. The scholar and the future of the research library. Wang B, Wang G, Wu H, Wu R, Xu X, Yin Y, Zhang D, Zhang W,
A problem and its solution. Columbia Law Rev. 1944;44:938. Zhao J, Zhao M, Zheng X, Lander (Principal Investigator) ES,
4. Yao Q, Tian Y, Li P-F, Tian L-L, Qian Y-M, Li J-S. Design and Altshuler DM, Gabriel (Co-Chair) SB, Gupta N, Gharani N, Toji
development of a medical big data processing system based on LH, Gerry NP, Resch AM, Flicek (Principal Investigator) P, Barker
Hadoop. J Med Syst. 2015;39:23. J, Clarke L, Gil L, Hunt SE, Kelman G, Kulesha E, Leinonen R,
5. CERN Data Centre passes the 200-petabyte milestone | CERN. McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE,
https://home.cern/about/updates/2017/07/cern-data-centre-passes- Streeter I, Thormann A, Toneva I, Vaughan B, Zheng-Bradley X,
200-petabyte-milestone. Accessed 16 Oct 2018. Bentley (Principal Investigator) DR, Grocock R, Humphray S,
James T, Kingsbury Z, Lehrach (Principal Investigator) H,
6. Savage N. Big data goes green. Nature. 2018;558:S19.
Sudbrak (Project Leader), Ralf, Albrecht MW, Amstislavskiy VS,
7. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton
Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B,
M, Baak A, et al. The FAIR guiding principles for scientific data
Yaspo M-L, Mardis (Co-Principal Investigator) (Co-Chair) ER,
management and stewardship. Sci Data. 2016;3:160018.
Wilson (Co-Principal Investigator) RK, Fulton L, Fulton R,
8. Sansone S-A, McQuilton P, Rocca-Serra P, Gonzalez-Beltran A,
Sherry (Principal Investigator) ST, Ananiev V, Belaia Z,
Izzo M, Lister AL, et al. FAIRsharing as a community approach
Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C,
to standards, repositories and policies. Nat Biotechnol. 2019;37:
Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P,
358–67.
O’Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V,
9. Zhang J, Baran J, Cros A, Guberman JM, Haider S, Hsu J, et al. Shekhtman E, Sirotkin K, Slotta D, Zhang H, McVean (Principal
International Cancer Genome Consortium Data Portal—a one-stop Investigator) GA, Durbin (Principal Investigator) RM,
shop for cancer genomics data. Database. 2011. https://doi.org/10. Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-
1093/database/bar026. Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt
10. DataCite Schema. In: DataCite Schema. https://schema.datacite. (Principal Investigator) JP, Davies CJ, Gollub J, Webster T, Wong
org/meta/kernel-4.1/index.html. Accessed 9 Oct 2018. B, Zhan Y, Auton (Principal Investigator) A, Campbell CL, Kong
11. Schroeder B, Pinheiro E, Weber W-D. DRAM errors in the wild: a Y, Marcketta A, Gibbs (Principal Investigator) RA, Yu (Project
large-scale field study. In: Proceedings of the Eleventh International Leader), Fuli, Antunes L, Bainbridge M, Muzny D, Sabo A,
Joint Conference on Measurement and Modeling of Computer Huang Z, Wang (Principal Investigator) J, Coin LJM, Fang L,
Systems. New York: ACM; 2009. p. 193–204. Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Shao
12. Hamming RW. Error detecting and error correcting codes. Bell Syst H, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Alkan C, Dal E,
Tech J. 1950;29:147–60. Kahveci F, Marth (Principal Investigator) GT, Garrison (Project
13. Savage N. Bioinformatics: big data versus the big C. Nature. Lead), Erik P, Kural D, Lee W-P, Fung Leong W, Stromberg M,
2014;509:S66–7. Ward AN, Wu J, Zhang M, Daly (Principal Investigator) MJ,
14. Dai L, Gao X, Guo Y, Xiao J, Zhang Z. Bioinformatics clouds for DePristo (Project Leader), Mark A, Handsaker (Project Leader),
big data manipulation. Biol Direct. 2012;7:43 discussion 43. Robert E, Altshuler DM, Banks E, Bhatia G, del Angel G,
15. Röst HL, Sachsenberg T, Aiche S, Bielow C, Weisser H, Aicheler F, Gabriel SB, Genovese G, Gupta N, Li H, Kashin S, Lander ES,
et al. OpenMS: a flexible open-source software platform for mass McCarroll SA, Nemesh JC, Poplin RE, Yoon (Principal
spectrometry data analysis. Nat Methods. 2016;13:741–8. Investigator) SC, Lihm J, Makarov V, Clark (Principal
16. Hildebrandt A, Dehof AK, Rurainski A, Bertsch A, Schumann M, Investigator) AG, Gottipati S, Keinan A, Rodriguez-Flores JL,
Toussaint NC, et al. BALL–biochemical algorithms library 1.3. Korbel (Principal Investigator) JO, Rausch (Project Leader),
BMC Bioinformatics. 2010;11:531. Tobias, Fritz MH, Stütz AM, Flicek (Principal Investigator) P,
17. Döring A, Weese D, Rausch T, Reinert K. SeqAn an efficient, Beal K, Clarke L, Datta A, Herrero J, McLaren WM, Ritchie
generic C++ library for sequence analysis. BMC Bioinformatics. GRS, Smith RE, Zerbino D, Zheng-Bradley X, Sabeti (Principal
2008;9:11. Investigator) PC, Shlyakhter I, Schaffner SF, Vitti J, Cooper
(Principal Investigator) DN, Ball EV, Stenson PD, Bentley Antaki D, Bafna V, Michaelson J, Ye K, Devine (Principal
(Principal Investigator) DR, Barnes B, Bauer M, Keira Cheetham Investigator) SE, Gardner (Project Leader), Eugene J, Abecasis
R, Cox A, Eberle M, Humphray S, Kahn S, Murray L, Peden J, (Principal Investigator) GR, Kidd (Principal Investigator) JM,
Shaw R, Kenny (Principal Investigator) EE, Batzer (Principal Mills (Principal Investigator) RE, Dayama G, Emery S, Jun G,
Investigator) MA, Konkel MK, Walker JA, MacArthur (Principal Shi (Principal Investigator) X, Quitadamo A, Lunter (Principal
Investigator) DG, Lek M, Sudbrak (Project Leader), Ralf, Investigator) G, McVean (Principal Investigator) GA, Chen
Amstislavskiy VS, Herwig R, Mardis (Co-Principal Investigator) (Principle Investigator) K, Fan X, Chong Z, Chen T, Witherspoon
ER, Ding L, Koboldt DC, Larson D, Ye K, Gravel S, Swaroop A, D, Xing J, Eichler (Principal Investigator) (Co-Chair) EE, Chaisson
Chew E, Lappalainen (Principal Investigator) T, Erlich (Principal MJ, Hormozdiari F, Huddleston J, Malig M, Nelson BJ, Sudmant
Investigator) Y, Gymrek M, Frederick Willems T, Simpson JT, PH, Parrish NF, Khurana (Principal Investigator) E, Hurles
Shriver (Principal Investigator) MD, Rosenfeld (Principal (Principal Investigator) ME, Blackburne B, Lindsay SJ, Ning Z,
Investigator) JA, Bustamante (Principal Investigator) CD, Walter K, Zhang Y, Gerstein (Principal Investigator) MB, Abyzov
Montgomery (Principal Investigator) SB, De La Vega (Principal A, Chen J, Clarke D, Lam H, Jasmine Mu X, Sisu C, Zhang J,
Investigator) FM, Byrnes JK, Carroll AW, DeGorter MK, Zhang Y, Gibbs (Principal Investigator) (Co-Chair) RA, Yu
Lacroute P, Maples BK, Martin AR, Moreno-Estrada A, (Project Leader), Fuli, Bainbridge M, Challis D, Evani US, Kovar
Shringarpure SS, Zakharia F, Halperin (Principal Investigator) E, C, Lu J, Muzny D, Nagaswamy U, Reid JG, Sabo A, Yu J, Guo X,
Baran Y, Lee (Principal Investigator) C, Cerveira E, Hwang J, Li W, Li Y, Wu R, Marth (Principal Investigator) (Co-Chair) GT,
Malhotra (Co-Project Lead), Ankit, Plewczynski D, Radew K, Garrison EP, Fung Leong W, Ward AN, del Angel G, DePristo MA,
Romanovitch M, Zhang (Co-Project Lead), Chengsheng, Hyland Gabriel SB, Gupta N, Hartl C, Poplin RE, Clark (Principal
FCL, Craig (Principal Investigator) DW, Christoforides A, Homer Investigator) AG, Rodriguez-Flores JL, Flicek (Principal
N, Izatt T, Kurdoglu AA, Sinari SA, Squire K, Sherry (Principal Investigator) P, Clarke L, Smith RE, Zheng-Bradley X,
Investigator) ST, Xiao C, Sebat (Principal Investigator) J, Antaki D, MacArthur (Principal Investigator) DG, Mardis (Principal
Gujral M, Noor A, Ye K, Burchard (Principal Investigator) EG, Investigator) ER, Fulton R, Koboldt DC, Gravel S, Bustamante
Hernandez (Principal Investigator) RD, Gignoux CR, Haussler (Principal Investigator) CD, Craig (Principal Investigator) DW,
(Principal Investigator) D, Katzman SJ, James Kent W, Howie B, Christoforides A, Homer N, Izatt T, Sherry (Principal
Ruiz-Linares (Principal Investigator) A, Dermitzakis (Principal Investigator) ST, Xiao C, Dermitzakis (Principal Investigator) ET,
Investigator) ET, Devine (Principal Investigator) SE, Abecasis Abecasis (Principal Investigator) GR, Min Kang H, McVean
(Principal Investigator) (Co-Chair) GR, Min Kang (Project (Principal Investigator) GA, Gerstein (Principal Investigator) MB,
Leader), Hyun, Kidd (Principal Investigator) JM, Blackwell T, Balasubramanian S, Habegger L, Yu (Principal Investigator) H,
Caron S, Chen W, Emery S, Fritsche L, Fuchsberger C, Jun G, Li Flicek (Principal Investigator) P, Clarke L, Cunningham F,
B, Lyons R, Scheller C, Sidore C, Song S, Sliwerska E, Taliun D, Dunham I, Zerbino D, Zheng-Bradley X, Lage (Principal
Tan A, Welch R, Kate Wing M, Zhan X, Awadalla (Principal Investigator) K, Berg Jespersen J, Horn H, Montgomery
Investigator) P, Hodgkinson A, Li Y, Shi (Principal Investigator) (Principal Investigator) SB, DeGorter MK, Khurana (Principal
X, Quitadamo A, Lunter (Principal Investigator) G, McVean Investigator) E, Tyler-Smith (Principal Investigator) (Co-Chair) C,
(Principal Investigator) (Co-Chair) GA, Marchini (Principal Chen Y, Colonna V, Xue Y, Gerstein (Principal Investigator) (Co-
Investigator) JL, Myers (Principal Investigator) S, Churchhouse Chair) MB, Balasubramanian S, Fu Y, Kim D, Auton (Principal
C, Delaneau O, Gupta-Hinch A, Kretzschmar W, Iqbal Z, Investigator) A, Marcketta A, Desalle R, Narechania A, Wilson
Mathieson I, Menelaou A, Rimmer A, Xifara DK, Oleksyk Sayres MA, Garrison EP, Handsaker RE, Kashin S, McCarroll
(Principal Investigator) TK, Fu (Principal Investigator) Y, Liu X, SA, Rodriguez-Flores JL, Flicek (Principal Investigator) P, Clarke
Xiong M, Jorde (Principal Investigator) L, Witherspoon D, Xing J, L, Zheng-Bradley X, Erlich Y, Gymrek M, Frederick Willems T,
Eichler (Principal Investigator) EE, Browning (Principal Bustamante (Principal Investigator) (Co-Chair) CD, Mendez FL,
Investigator) BL, Browning (Principal Investigator) SR, David Poznik G, Underhill PA, Lee C, Cerveira E, Malhotra A,
Hormozdiari F, Sudmant PH, Khurana (Principal Investigator) E, Romanovitch M, Zhang C, Abecasis (Principal Investigator) GR,
Durbin (Principal Investigator) RM, Hurles (Principal Investigator) Coin (Principal Investigator) L, Shao H, Mittelman D, Tyler-Smith
ME, Tyler-Smith (Principal Investigator) C, Albers CA, Ayub Q, (Principal Investigator) (Co-Chair) C, Ayub Q, Banerjee R, Cerezo
Balasubramaniam S, Chen Y, Colonna V, Danecek P, Jostins L, M, Chen Y, Fitzgerald TW, Louzada S, Massaia A, McCarthy S,
Keane TM, McCarthy S, Walter K, Xue Y, Gerstein (Principal Ritchie GR, Xue Y, Yang F, Gibbs (Principal Investigator) RA,
Investigator) MB, Abyzov A, Balasubramanian S, Chen J, Clarke Kovar C, Kalra D, Hale W, Muzny D, Reid JG, Wang (Principal
D, Fu Y, Harmanci AO, Jin M, Lee D, Liu J, Jasmine Mu X, Zhang Investigator) J, Dan X, Guo X, Li G, Li Y, Ye C, Zheng X, Altshuler
J, Zhang Y, Li Y, Luo R, Zhu H, Alkan C, Dal E, Kahveci F, Marth DM, Flicek (Principal Investigator) (Co-Chair) P, Clarke (Project
(Principal Investigator) GT, Garrison EP, Kural D, Lee W-P, Ward Lead), Laura, Zheng-Bradley X, Bentley (Principal Investigator)
AN, Wu J, Zhang M, McCarroll (Principal Investigator) SA, DR, Cox A, Humphray S, Kahn S, Sudbrak (Project Lead), Ralf,
Handsaker (Project Leader), Robert E, Altshuler DM, Banks E, Albrecht MW, Lienhard M, Larson D, Craig (Principal Investigator)
del Angel G, Genovese G, Hartl C, Li H, Kashin S, Nemesh JC, DW, Izatt T, Kurdoglu AA, Sherry (Principal Investigator) (Co-
Shakir K, Yoon (Principal Investigator) SC, Lihm J, Makarov V, Chair) ST, Xiao C, Haussler (Principal Investigator) D, Abecasis
Degenhardt J, Korbel (Principal Investigator) (Co-Chair) JO, Fritz (Principal Investigator) GR, McVean (Principal Investigator) GA,
MH, Meiers S, Raeder B, Rausch T, Stütz AM, Flicek (Principal Durbin (Principal Investigator) RM, Balasubramaniam S, Keane
Investigator) P, Paolo Casale F, Clarke L, Smith RE, Stegle O, TM, McCarthy S, Stalker J, Chakravarti (Co-Chair) A, Knoppers
Zheng-Bradley X, Bentley (Principal Investigator) DR, Barnes B, (Co-Chair) BM, Abecasis GR, Barnes KC, Beiswanger C,
Keira Cheetham R, Eberle M, Humphray S, Kahn S, Murray L, Burchard EG, Bustamante CD, Cai H, Cao H, Durbin RM, Gerry
Shaw R, Lameijer E-W, Batzer (Principal Investigator) MA, NP, Gharani N, Gibbs RA, Gignoux CR, Gravel S, Henn B, Jones
Konkel MK, Walker JA, Ding (Principal Investigator) L, Hall I, D, Jorde L, Kaye JS, Keinan A, Kent A, Kerasidou A, Li Y, Mathias
Ye K, Lacroute P, Lee (Principal Investigator) (Co-Chair) C, R, McVean GA, Moreno-Estrada A, Ossorio PN, Parker M, Resch
Cerveira E, Malhotra A, Hwang J, Plewczynski D, Radew K, AM, Rotimi CN, Royal, Charmaine D, Sandoval K, Su Y, Sudbrak
Romanovitch M, Zhang C, Craig (Principal Investigator) DW, R, Tian Z, Tishkoff S, Toji LH, Tyler-Smith C, Via M, Wang Y,
Homer N, Church D, Xiao C, Sebat (Principal Investigator) J, Yang H, Yang L, Zhu J, Bodmer W, Bedoya G, Ruiz-Linares A, Cai
Z, Gao Y, Chu J, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Alexander Peltzer is coordinat-

Martinez-Cruzado JC, Oleksyk TK, Barnes KC, Mathias RA, ing the Research & Development
Hennis A, Watson H, McKenzie C, Qadri F, LaRocque R, Sabeti in Data Science team at the
PC, Zhu J, Deng X, Sabeti PC, Asogun D, Folarin O, Happi C, Quantitative Biology Center
Omoniwa O, Stremlau M, Tariyal R, Jallow M, Sisay Joof F, Corrah (QBiC), Tübingen. His main fo-
T, Rockett K, Kwiatkowski D, Kooner J, Tịnh Hiê’n T, Dunstan SJ, cus lies on the integration of effi-
Thuy Hang N, Fonnie R, Garry R, Kanneh L, Moses L, Sabeti PC, cient workflow management and
Schieffelin J, Grant DS, Gallo C, Poletti G, Saleheen D, Rasheed A, execution systems in data science,
Brooks LD, Felsenfeld AL, McEwen JE, Vaydylevich Y, Green with a detailed focus on cloud
ED, Duncanson A, Dunn M, Schloss JA, Wang J, Yang H, Auton computing and the application of
A, Brooks LD, Durbin RM, Garrison EP, Min Kang H, Korbel JO, machine learning.
Marchini JL, McCarthy S, McVean GA, Abecasis GR (2015) A
global reference for human genetic variation. Nature 526:68.
24. Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson
E, et al. Whole genome characterization of sequence diversity of 15,
220 Icelanders. Sci Data. 2017;4:170115.
Oliver Kohlbacher is a Chair for
25. Turnbull C, Scott RH, Thomas E, Jones L, Murugaesu N, Pretty
Applied Bioinformatics at the
FB, et al. The 100 000 Genomes Project: bringing whole genome
University of Tübingen, Director
sequencing to the NHS. BMJ. 2018;361:k1687.
of the Institute for Translational
26. Anonymous (2018) EU countries will cooperate in linking genomic Bioinformatics at University
databases across borders - digital single market - European Hospital Tübingen, and a Fellow
Commission. In: Digital single market - European Commission. at the Max Planck Institute for
https://ec.europa.eu/digital-single-market/en/news/eu-countries- Developmental Biology. The
will-cooperate-linking-genomic-databases-across-borders. lab’s current research focus is on
Accessed 1 Jul 2019. developing methods and tools for
the analysis of biomedical high-
Publisher’s note Springer Nature remains neutral with regard to throughput data and their applica-
jurisdictional claims in published maps and institutional affiliations. tion in translational research.
Sven Fillinger is coordinating the Sven Nahnsen is the Scientific

IT and software development at Director of the Quantitative
the Quantitative Biology Center Biology Center (QBiC),
(QBiC) Tübingen. He is actively Tübingen. His current research in-
working on shaping the software terests range from bioinformatics
and IT infrastructure there to con- method development and data
tribute to efficient data manage- management to data science ap-
ment and analysis platforms. plications in the life and environ-
mental sciences.
Luis de La Garza is currently

finishing his doctoral degree pro-
gram in bioinformatics at the
University of Tübingen.
Throughout the years, he has spe-
cialized in workflows, workflow
engines, and high-performance
and distributed computing.

Challenges of Big Data Integration in The Life Sciences

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Challenges of Big Data Integration in The Life Sciences

Uploaded by

Copyright:

Available Formats

Analytical and Bioanalytical Chemistry (2019) 411:6791–8000

Challenges of big data integration in the life sciences

Keywords Big data . Bioanalytics . Data integration . Bioinformatics . Scalability

Introduction construction of efficient query interfaces enable entirely new

machine learning or data mining algorithms to operate on the website (https://www.nature.com/sdata/policies/repositories).

Z, Gao Y, Chu J, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Alexander Peltzer is coordinat-

Sven Fillinger is coordinating the Sven Nahnsen is the Scientific

Luis de La Garza is currently

You might also like