You are on page 1of 7

Available online at www.sciencedirect.

com

ScienceDirect

Big data analytics for personalized medicine


Davide Cirillo1 and Alfonso Valencia1,2

Big Data are radically changing biomedical research. The Epigenome Consortium (IHEC), and the International
unprecedented advances in automated collection of large- Rare Disease Consortium (IRDiRC), among others
scale molecular and clinical data pose major challenges to data (Table 1).
analysis and interpretation, calling for the development of new
computational approaches. The creation of powerful systems The accessibility to ‘big data’, a term first introduced in
for the effective use of biomedical Big Data in Personalized 1997 in the context of data visualization [4], sets down
Medicine (a.k.a. Precision Medicine) will require significant both an exceptional and ambitious challenge for biomed-
scientific and technical developments, including infrastructure, ical research, with special emphasis on Personalized
engineering, project and financial management. We review Medicine [5]. A typical portrait of biomedical big data
here how the evolution of data-driven methods offers the features heterogeneous, multi-spectral, incomplete, and
possibility to address many of these problems, guiding the imprecise observations [6]. Hence, data-intensive analyt-
formulation of hypotheses on systems functioning and the ics require ad hoc capabilities for complex data represen-
generation of mechanistic models, and facilitating the design of tation and modeling, algorithmic optimization, and
clinical procedures in Personalized Medicine. computational power. In healthcare, for instance, such
systems are defined by five distinctive capabilities: pat-
Addresses terns of care identification, unstructured data analysis,
decision support, prediction, and traceability [7].
1
Barcelona Supercomputing Center (BSC), C/Jordi Girona 29, 08034,
Barcelona, Spain
2
ICREA, Pg. Lluı́s Companys 23, 08010, Barcelona, Spain
Nonetheless, data gathering moves faster than both
Corresponding author: Cirillo, Davide (davide.cirillo@bsc.es) data processing and data analysis, emphasizing the
widening gap between the rapid technological progress
in data acquisition and the comparatively slow func-
Current Opinion in Biotechnology 2019, 58:161–167
tional characterization of biomedical information [8]. In
This review comes from a themed issue on Systems biology
this regard, the integration of molecular information,
Edited by Maria Klapa and Yannis Androulakis such as multi-omics data, and the phenotypic informa-
tion of individual patients from electronic health
records (EHRs), is becoming of critical importance.

https://doi.org/10.1016/j.copbio.2019.03.004 Considering the relevance of biomedical data protec-


0958-1669/ã 0001 Elsevier Ltd. All rights reserved. tion, efficient and secure models for storage, integra-
tion, data-driven discovery, and interpretation must
distinguish big data analytics systems for Personalized
Medicine. In the following, we review the main types
and systems for big data management, analysis, and
interpretation in Personalized Medicine.
Introduction
In the last few decades, we have undergone an unprec- Data types
edented transformation of biomedical research, conduct- Routinely collected patient information is gaining volume
ing to a novel paradigm of data-driven biomedical science and complexity. For instance, neuroimaging is currently
complementing the well-developed hypothesis-driven producing more than 10 petabytes of data every year with
knowledge discovery [1]. Along with high-throughput a staggering ninefold increase in data complexity (i.e. data
genome sequencing, marked by plummeting costs and acquisition modalities) over the last three decades [6]. In
increasing availability, healthcare organizations have parallel, genomic data alone are expected to reach exas-
started to embrace extraction of information from digita- cale dimensions by the next decade, largely exceeding
lized clinical records and imaging data. The outstanding other big data areas such as astronomy [9]. Strikingly,
improvement in automated collection of massive data these data represent a tiny fraction of the over 2.5 quintil-
volumes is exemplified by community movements such lion bytes (2.3 trillion gigabytes) of information generated
as the Global Alliance for Genomics and Health every day [10].
(GA4GH, www.ga4gh.org), research infrastructures like
ELIXIR [2] and Big Data to Knowledge (BD2K) [3], and Among big data types, imaging data can be considered the
international initiatives such as the International Cancer largest in volume as it covers not only gigapixel images,
Genome Consortium (ICGC), the International Human showing tissues and organisms at subcellular resolutions,

www.sciencedirect.com Current Opinion in Biotechnology 2019, 58:161–167


162 Systems biology

Table 1

Large international consortia focusing on Personalized Medicine

Initiative Research focus Link


Human Genome Diversity Project (HGDP) General www.hagsc.org/hgdp/
Global Network of Personal Genome Projects (PGP) General www.personalgenomes.org/
The Encyclopedia of DNA Elements (ENCODE) General www.encodeproject.org/
The NIH Roadmap Epigenomics Mapping Consortium General www.roadmapepigenomics.org/
(Roadmap)
International Human Epigenome Consortium (IHEC) General http://ihec-epigenomes.org/
Cohorts for Heart and Aging Research in Genomic Cardiovascular and www.chargeconsortium.com/
Epidemiology (CHARGE) age-related diseases
International Cancer Genome Consortium (ICGC) Cancer https://icgc.org/
The Cancer Genome Atlas (TCGA) Cancer https://cancergenome.nih.gov/
International Rare Disease Consortium (IRDiRC) Rare diseases www.irdirc.org/

but also metadata and quantitative measurements. In this in new areas such as stream computing, concerned with
view, the development of integrative platforms for scal- processing and analysis of real-time flows of data [17].
able analysis of imaging data coupled with genetic and
functional annotations are of utmost importance [11]. Data management
Distinct areas in which big data are applied include drug
Health information in digital format includes also the and biomarker development, and basic research in cancer,
structured (e.g. ICD codes) and unstructured (e.g. symp- rare diseases, neurodegeneration, diabetes, and cardiovas-
toms descriptions) content of EHRs, initially designed to cular pathologies, among others [18]. The development of
communicate information between clinicians, represent- these areas in the frame of Personalized Medicine is high in
ing a valuable resource for research and model develop- social and governmental agendas, requiring large collabo-
ment [12]. Concurrently, massive parallel quantification rative efforts, collective expertise and distributed manage-
technologies of genomic data, such as whole-genome ment (Figure 1). Many are the examples of how large
sequencing (WGS) and whole-exome sequencing international research platforms are committed to achiev-
(WEX), are playing a key role in accelerating data-driven ing Personalized Medicine solutions, such as personalized
biomedical discovery. Along with genome sequencing, brain models for patient with intractable epilepsy [19]
experimental platforms such as transcriptome sequencing developed within the Human Brain Project (www.
(RNA-seq and ribosome profile), proteome profiling (mass humanbrainproject.eu), a Flagship initiative of the Euro-
spectrometry) and interactome profiling (chromosome con- pean Commission. The model of research-by-consortium,
formation capture, ChIP-seq, hybrid assays), are making initiated by community movements and increasingly
biomedical information accessible faster and cheaper [13]. adopted by government agencies and industry, is at the
basis of all the large-scale biomedical projects, as in case of
Recent advances in genomics, including single-cell the iconic Human Genome Project. For example, the
genome and transcriptome sequencing [9], circulating European Commission has invested more than s2.6 billion
tumor DNA (ctDNA) identification through liquid biopsy in Personalized Medicine research through the FP7 and
[14], and sequencing of bacterial genomes in human sam- Horizon 2020 programs [20], and launched the Interna-
ples (metagenomics), are already having a great impact in tional Personalized Medicine consortium (PerMed), whose
medicine, and are destined to be integrated in standard success stories include BLUEPRINT project for the study
medical practice. In particular, long-read sequencing has of epigenetic mechanisms of hematopoiesis [21]. National
proved successful for microbial composition analysis and de funding, infrastructures, institutional support, public–pri-
novo genome assembly, especially in combination with vate consortia, and donations favor open access, user equity,
short-read sequencing and optical maps [15]. and sustainability, especially in the case of expert-curated
knowledgebases struggling with inappropriate funding
Together with imaging, multi-omics data, and EHRs, models [22]. Sustaining the biomedical big data ecosystem,
patient generated health data (PGHD), from wearable largely funded by short-term grants, demands coordinated
and implantable devices, is becoming an increasingly international efforts and an open discussion on future
relevant big data type in Personalized Medicine [16]. directions and actions [23].
Health and treatment history as well as lifestyle choices,
tracked through mobile apps, guarantee patient engage- Biomedical big data exhibit unique features, such as
ment, which is crucial to improve the quality of healthcare highly distributed acquisition, format heterogeneity,
services [12]. Moreover, real-time sensor devices for and content sensitivity. In this regard, the General Data
biometric measurements are promoting developments Protection Regulation (EU) 2016/679 (GDPR), setting

Current Opinion in Biotechnology 2019, 58:161–167 www.sciencedirect.com


Big data analytics for personalized medicine Cirillo and Valencia 163

Figure 1

CATALONIA SCOTLAND
Undiagnosed Rare Diseases The Scottish Genomes
Catalunya (URDCat) Partnership
Medicina Personalitzada a
Catalunya-Cancer (MedPerCan) ICELAND DENMARK
UNITED KINGDOM
deCODE genetics UK Biobank Genome Denmark
SPAIN Genomics England LUXEMBOURG
Spanish Undiagnosed Rare The National Centre of Excellence
FRANCE FINLAND in Research on Parkinson’s disease
Diseases Program (SpainUDP)
Médecine France FinnGen
NETHERLANDS
génomique 2025
Genome of the ESTONIA
Netherlands (GoNL) Estonian Biobank
CANADA (EGCUT)
Tomorrow Project
CZECH REPUBLIC
National Centre for
Medical Genomics

JAPAN
The Initiative on Rare and
Undiagnosed Diseases (IRUD)
SWITZERLAND
CHINA
Swiss Personalised
Health network (SPHN) China Kadoorie Biobank

UNITED STATES SINGAPORE


ITALY
Precision Medicine Initiative Singapore Genome
SardiNIA
(PMI) Cohort Program Variation Program
SAUDI ARABIA
Saudi Human
Genome Project

MEXICO AND LATIN AMERICA


Slim Initiative in Genomic Medicine
for the Americas (SIGMA)

AFRICA
Human Heredity and AUSTRALIA
Health in Africa (H3Africa) National Centere for
indigenous Genomics (NCIG)

Current Opinion in Biotechnology

Geographic scope of ongoing population-scale sequencing initiatives for Personalized Medicine (adapted from Ref. [45]).

the framework of an ethical and anonymous data proces- intensive data tasks. European initiatives include the
sing, will have a significant impact in the design of big data ecosystem European Open Science Cloud
biomedical research activities. Blockchain-based crypto- (EOSC, https://eosc-hub.eu/), and EuroHPC for exascale
graphic techniques for patient anonymization using smart supercomputers development [28]. These initiatives aim
contract technology [24] represent promising research to provide industry and public authorities with world-
lines that still require substantial investigation. Moreover, class HPC solutions along with premier data storage,
cloud computing is becoming a mainstream way to build management, and transport. In this regard, input/output
and deliver software and storage solutions (IBM Cloud, hardware innovations play a major role in fostering effec-
Google Cloud, Amazon Web Services), by allowing sta- tive big data handling. An interesting example is IBM
keholders to use resources ‘on demand’ to foster repro- POWER9/NVIDIA architecture specifically designed to
ducibility [25], and promoting requirements to limit support artificial intelligence and deep learning [29].
security threats [26]. To be effective, biomedical data
must be secure but also Findable, Accessible, Interoper- Data analysis and interpretation
able, and Reusable (FAIR) [27]. Indeed, maintaining the The key question in biomedical research is how to
continuity of the main data sources, as in the case of the extract knowledge from big data. Although some of the
ELIXIR Data Platform (www.elixir-europe.org/ hardest challenges for computing systems are focused
platforms/data) [2], is crucial to prevent the seclusion on extreme data analytics and data-intensive simula-
of data in silos. tions, such as streaming data analysis [30] and virtual
patient design [31], machine learning on high dimen-
Fundamental in this scenario is high performance com- sional data represents a prevalent concern. Biomedical
puting (HPC). Supercomputers and parallel processing big data entail ensembles of complementary informa-
are essential for addressing complex problems with tion retrieved from heterogeneous sources, which can

www.sciencedirect.com Current Opinion in Biotechnology 2019, 58:161–167


164 Systems biology

be referred to as multi-view data, representing multiple methods such as Sparse Group LASSO (Least Absolute
facets of data instances in different feature spaces. Shrinkage and Selection Operator) is a common
Multi-view data can be investigated thought several approach for supervised multi-view feature selection.
data-driven integrative workflows that generally Along with GLMs, common machine learning models
require inference of associations among distinct enti- for multi-view data are Bayesian models such as the
ties [32]. naı̈ve Bayes classifier, ensemble-learning models such
as random forest, neural networks and more recently
Machine learning methods can be effectively applied to deep learning (Figure 2).
deliver integrative solutions for multi-view data in
order to explain an event or predict an outcome. For Biomedical data pose unique challenges for deep learn-
instance, generalized linear models (GLMs) [33] com- ing [34], a repertoire of highly accurate and flexible
ply with a broad model formulation where the outcome neural network-based machine learning techniques that
is linearly related to factors and covariates through a in the last few years have been successfully applied to
link function that permits the estimation of the model domain-specific applications. The rapid popularity of
parameters typically by maximum likelihood or Bayes- deep learning resides in the convergence of novel
ian techniques. As features of biological data often hardware improvements, easy-to-use software
exhibit some form of structure, such as groups of genes packages, and the availability of large datasets fitting
with similar functions, structural regularization vast parameter spaces.

Figure 2
E

PATIENT
QU
NI
CH
TE
AL
NT

BIOSAMPLE
ME
RI
PE
EX

REGRESSION CLUSTERING
Linear regression k-Means
Logistic regression Expectation-Maximization (EM)

REGULARIZATION INSTANCE-BASED METHODS DECISION TREES BAYESIAN METHODS


LASSO k-Nearest Neighbor (kNN) Decision Stump Naive Bayes
Elastic Net Self-Organizing Map (SOM) Conditional Decision Trees Bayesian Belief Networks (BBN)

ENSEMBLE METHODS NEURAL NETWORKS RULE-BASED METHODS DIMENSIONALITY REDUCTION


AdaBoost Deep Boltzmann Machine (DBM) Apriori algorithm Principal Component Analysis (PCA)
Random Forest Convolution Neural Network (CNN) Eclat algorithm Multidimensional Scaling (MDS)
Current Opinion in Biotechnology

Machine learning algorithms for multi-view data analysis. Biosamples from several experimental techniques (e.g. genomic, proteomic, and
metabolomic data) can be used to identify associations within and between multiple sets of patients, and generate integrative models for patient
stratification.

Current Opinion in Biotechnology 2019, 58:161–167 www.sciencedirect.com


Big data analytics for personalized medicine Cirillo and Valencia 165

Table 2

Notable biomedical applications of Deep learning

Task Data type Method Reference


Transcription factor binding sites prediction PBM, SELEX, ChIP-seq and CLIP-seq CNN [46]
Promoter-enhancer interaction prediction Hi-C and genomic annotations CNN [47]
Metagenomic classification 16S rRNA sequences RNN [48]
Variant calling NGS data CNN [49]
Disease risk Gene variants CNN [50]
De novo drug design Chemical compounds RNN [51]
Hospitalization outcome prediction EHRs RNN [52]
Epileptic seizure prediction EEG RNN [53]
Medical images analysis Skin lesion images CNN [54]

CNN, convolutional neural network; RNN, recursive neural network; PBM, protein-binding microarrays; SELEX, systematic evolution of ligands by
exponential enrichment; ChIP, chromatin immunoprecipitation; CLIP, cross-linking immunoprecipitation; Hi-C, high-resolution chromosome
conformation capture; NGS, next generation sequencing; EEG, electroencephalogram.

Deep learning has largely been applied to biomedical potential to inspire systematic ways to process clinical and
data integration and modeling (Table 2). In particular, it molecular information that spans the four dimensions of
has been effectively employed in the classification of volume, velocity, variety, and veracity, referring to scale,
medical images and videos, often in combination with rate, forms, and content of generated data [7]. At present,
processing of EHRs [35], and included in systems sup- large amount of multi-omics, imaging, medical devices, and
porting physician-computer interactions [36]. EHR data are available from large-scale cohort and popula-
tion studies, revealing subtle differences in human genetics
Massive volumes of aggregated biomedical data often and allowing Personalized Medicine interventions, while
display different levels of granularity, that is, a variety of engaging infrastructural and research management innova-
data dimensionalities, sample sizes, sources and formats tion and sustainability. Challenges in big data analytics are
[37]. In particular, small data sets, low numbers of samples, pointing to the development of effective applications in areas
and class imbalance, represent substantial problems typical where finding connections and insights can be difficult due to
of many areas of biomedicine. A number of techniques, data abundance and the complexity of biological systems.
such as data augmentation (e.g. generation of adversarial Advanced machine learning methods such as deep learning
examples) and transfer learning (e.g. repurposing features and platforms for cognitive computing represent the future
of established models), are being explored to address these toolbox for the data-driven analysis of biomedical big data.
problems [38]. As for the frequent lack of ground-truth or Encouraging progress in these areas will be indispensable for
expert-validated labels for training, novel machine learning future innovation in healthcare and Personalized Medicine.
approaches such as weak supervision [39] are used to
automatically generate data labels to be used for deep
Funding
learning model training, for example annotating a corpus
This work was supported by BBVA Foundation
of radiology reports to train an image classifier.
[“Precision Medicine from Big Data to Cognitive Com-
puting (ref. 76/2016)”])”], and the IBM-BSC Joint
With a special emphasis on human decision making, a
Study Agreement (JSA) on Precision Medicine under
series of new technologies inspired by neuroscience and
the IBM-BSC Deep Learning Center Agreement.
based on various forms of Natural Language Processing
(NLP) and computational linguistics, are referred to as
cognitive computing [40]. Compared to other approaches, Conflict of interest statement
cognitive computing pursuits a dynamic process of obser- Nothing declared.
vation, interpretation, evaluation, and decision. Limita-
tions relate to correct understanding of contextual mean- Acknowledgements
ing and information uncertainty. The most popular
Authors would like to acknowledge Alba Jené and Salvador Capella for their
example of cognitive system is IBM Watson that has support and advice.
been recently applied to several cancer types [41] and
neurological diseases [42]. Several reports inform on
technical details of Watson system [43], along with com-
References and recommended reading
Papers of particular interest, published within the period of review,
ments discussing on its impact in the society [44]. have been highlighted as:

 of special interest
 of outstanding interest
Conclusions
The Big Data paradigm shift is significantly transforming 1. Kitchin R: Big Data, new epistemologies and paradigm shifts.
healthcare and biomedical research [9], having the Big Data Soc 2014, 1 205395171452848.

www.sciencedirect.com Current Opinion in Biotechnology 2019, 58:161–167


166 Systems biology

2. Durinx C, McEntyre J, Appel R, Apweiler R, Barlow M, Blomberg N, 19. Proix T, Bartolomei F, Guye M, Jirsa VK: Individual brain
Cook C, Gasteiger E, Kim J-H, Lopez R et al.: Identifying ELIXIR structure and modelling predict seizure propagation. Brain
core data resources. F1000Res 2016, 5. 2017, 140:641-654.
3. Margolis R, Derr L, Dunn M, Huerta M, Larkin J, Sheehan J, 20. Nimmesgern E, Norstedt I, Draghia-Akli R: Enabling personalized
Guyer M, Green ED: The National Institutes of Health’s Big Data medicine in Europe by the European commission’s funding
to knowledge (BD2K) initiative: capitalizing on biomedical big activities. Pers Med 2017, 14:355-365.
data. J Am Med Inform Assoc 2014, 21:957-958.
21. Fernández JM, de la Torre V, Richardson D, Royo R, Puiggròs M,
4. Cox M, Ellsworth D: Application-controlled demand paging for Moncunill V, Fragkogianni S, Clarke L, BLUEPRINT Consortium,
out-of-core visualization. IEEE Vis 1997:235-244. Flicek P et al.: The BLUEPRINT data analysis portal. Cell Syst
2016, 3:491-495.e5.
5. Rehm HL: Evolving health care through personal genomics.
 Nat Rev Genet 2017, 18:259-267. 22. Gabella D, Durinx C, Appel R: Funding knowledgebases:
This essay illustrates the impact of personalized genomics in everyday towards a sustainable funding model for the UniProt use case.
patient care, from preconception and newborn screening to pediatric and F1000Res 2017, 6 pii: ELIXIR-2051.
adult medicine. Preventive medicine and cost-saving solutions for health
care system are discussed. 23. Bourne PE, Lorsch JR, Green ED: Perspective: sustaining the
big-data ecosystem. Nature 2015, 527:S16-S17.
6. Dinov ID: Volume and value of big healthcare data. J Med Stat
Inform 2016, 4. 24. Kiyomoto S, Rahman MS, Basu A: On blockchain-based
anonymized dataset distribution platform. 2017 IEEE 15th
7. Wang Y, Kung L, Byrd TA: Big data analytics: understanding its International Conference on Software Engineering Research,
 capabilities and potential benefits for healthcare Management and Applications (SERA) 2017:85-92.
organizations. Technol Forecast Soc Change 2018, 126:3-13.
The authors describe the main constituents of big data analytics archi- 25. Langmead B, Nellore A: Cloud computing for genomic data
tectures in healthcare. The main layers of such systems (i.e. data sources,  analysis and collaboration. Nat Rev Genet 2018, 19:208-219.
aggregation, analysis, exploration, and governance) are discussed. Cloud computing is fundamental for large computational resources
Moreover, the authors identify five potential benefits (capabilities) of management. This review examines the advantages of cloud computing
big data analytics by rigorously analyzing 26 big data implementation for genomic research, with special emphasis on data reproducibility and
cases in healthcare. models for distributed collaboration. Aspects concerning security and
costs control are discussed.
8. Berger B, Peng J, Singh M: Computational solutions for omics
26. Tang J, Cui Y, Li Q, Ren K, Liu J, Buyya R: Ensuring security and
data. Nat Rev Genet 2013, 14:333-346.
privacy preservation for cloud data services. ACM Comput
9. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Surv 2016, 49:1-39.
 Iyer R, Schatz MC, Sinha S, Robinson GE: Big Data: astronomical 27. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G,
or genomical? PLoS Biol 2015, 13:e1002195. Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB,
In this landmark article, authors compare genomics with other domains Bourne PE et al.: The FAIR guiding principles for scientific data
generating Big Data, namely astronomy and social media. The growth in management and stewardship. Sci Data 2016, 3:160018.
data acquisition, storage, analysis capabilities, and distribution is pro-
jected in the next decade revealing a rapid exascale growth of genomic 28. Alowayyed S, Groen D, Coveney PV, Hoekstra AG: Multiscale
data. computing in the exascale era. J Comput Sci 2017, 22:15-25.
10. Quintero D, Genovese WM, Kim K, Li MJMJ, Martins F, Nainwal A, 29. Sadasivam SK, Thompo BW, Kalla R, Startke WJ: IBM Power9
Smolej D, Tabinowski M, Tiwary A, Redbooks IBM: IBM Software processor architecture. IEEE Micro 2017, 32:40-51.
Defined Environment. IBM Redbooks; 2015.
30. Sutton J, Mahajan R, Akbilgic O, Kamaleswaran R: PhysOnline:
11. Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, online feature extraction and machine learning pipeline for
Leo S, Antal B, Ferguson RK, Sarkans U et al.: The image data real-time analysis of streaming physiological data. IEEE J
resource: a bioimage data integration and publication Biomed Health Inform 2018, 23:59-65.
platform. Nat Methods 2017, 14:775-781.
31. Chase JG, Preiser J-C, Dickson JL, Pironet A, Chiew YS,
12. Genes N, Violante S, Cetrangol C, Rogers L, Schadt EE, Chan Y-  Pretty CG, Shaw GM, Benyo B, Moeller K, Safaei S et al.: Next-
 FY: From smartphone to EHR: a case report on integrating generation, personalised, model-based critical care medicine:
patient-generated health data. NPJ Digit Med 2018, 1:552. a state-of-the art review of in silico virtual patient models,
This case report explores the clinical implications of patient-generated methods, and cohorts, and how to validation them. Biomed
health data (PGHD) paired with electronic health records (EHRs). In Eng Online 2018, 17:24.
particular, participants’ data are collected from their mobile phones This comprehensive review focuses on computational models of human
and transferred to the EHRs. physiology, from patient-specific models to virtual patients and virtual
cohorts. The development of such models, to be used in silico to test
13. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C: The third medical interventions and treatment protocols, will define the next gen-
revolution in sequencing technology. Trends Genet 2018, eration personalized care and help clinical trial optimization.
34:666-681.
32. Li Y, Wu F-X, Ngom A: A review on machine learning principles
14. Heitzer E, Perakis S, Geigl JB, Speicher MR: The potential of  for multi-view biological data integration. Brief Bioinformatics
liquid biopsies for the early detection of cancer. NPJ Precis 2018, 19:325-340.
Oncol 2017, 1:36. This technical review describes common implementations of machine
learning methods for multi-view data analysis, such as general linear
15. Weissensteiner MH, Pang AWC, Bunikis I, Höijer I, Vinnere- models, Bayesian methods, ensemble learning methods, kernel meth-
Petterson O, Suh A, Wolf JBW: Combination of short-read, long- ods, and network-based methods.
read, and optical mapping assemblies reveals large-scale
tandem repeat arrays with population genetic implications. 33. Agresti A: Categorical Data Analysis. John Wiley & Sons; 2013.
Genome Res 2017, 27:697-708.
34. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 2015,
16. Collins FS, Varmus H: A new initiative on precision medicine. N 521:436-444.
Engl J Med 2015, 372:793-795.
35. Shen D, Wu G, H-I Suk: Deep learning in medical image
17. Ta Van-Dai, Liu Chuan-Ming, Nkabinde GW: Big data stream analysis. Annu Rev Biomed Eng 2017, 19:221-248.
computing in healthcare real-time analytics. 2016 IEEE
International Conference on Cloud Computing and Big Data 36. Topol EJ: High-performance medicine: the convergence of
Analysis (ICCCBDA) 2016:37-42.  human and artificial intelligence. Nat Med 2019, 25:44-56.
This excellent review addresses the pervasive impact of artificial intelli-
18. Lim MD: Consortium sandbox: building and sharing resources. gence on healthcare. This work surveys the major critical points in the
Sci Transl Med 2014, 6:242cm6. evolution of innovative technology for medicine, such as the limits of

Current Opinion in Biotechnology 2019, 58:161–167 www.sciencedirect.com


Big data analytics for personalized medicine Cirillo and Valencia 167

automation of artificial intelligence medical algorithms, and the lack of 45. Dubow T: Population-scale sequencing and the future of
prospective studies in real-world clinical environment. genomic medicine: learning from past and present efforts.
RAND Corp Res Rep 2016. RR-1520-RE. Retrieved from https://
37. Rector A, Rogers J, Bittner T: Granularity, scale and collectivity: www.rand.org.
when size does and does not matter. J Biomed Inform 2006,
39:333-349. 46. Alipanahi B, Delong A, Weirauch MT, Frey BJ: Predicting the
sequence specificities of DNA- and RNA-binding proteins by
38. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, deep learning. Nat Biotechnol 2015, 33:831-838.
 Way GP, Ferrero E, Agapow P-M, Zietz M, Hoffman MM et al.:
Opportunities and obstacles for deep learning in biology and 47. Singh S, Yang Y, Poczos B, Ma J: Predicting enhancer-promoter
medicine. J R Soc Interface 2018, 15. interaction from sequence with deep neural networks. bioRxiv
This significant work represents the most comprehensive and recent 2016:085241. Retrieved from https://www.biorxiv.org.
survey of deep learning applications in biomedical research and health-
care. The massive review covers topics comprising patient stratification, 48. Ditzler G, Polikar R, Rosen G: Multi-layer and recursive neural
modeling of biological processes, disease treatment recommendation, networks for metagenomic classification. IEEE Trans
and many others. Every topic is discussed in detail and accompanied with Nanobioscience 2015, 14:608-616.
an overview of limitations and perspectives. 49. Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A,
Newburger D, Dijamco J, Nguyen N, Afshar PT et al.: Creating a
39. Ratner A, Bach SH, Ehrenberg HR, Fries JA, Wu S, Ré C: Snorkel:
universal SNP and small indel variant caller with deep neural
rapid training data creation with weak supervision. Proc VLDB
networks. bioRxiv 2016:092890. Retrieved from https://www.
Endowment 2017, 11:269-282.
biorxiv.org.
40. Gupta S, Kar AK, Baabdullah A, Al-Khowaiter WAA: Big data with 50. Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK,
cognitive computing: a review for the future. Int J Inf Manage Troyanskaya OG: Deep learning sequence-based ab initio
2018, 42:78-89. prediction of variant effects on expression and disease risk.
Nat Genet 2018, 50:1171-1179.
41. Patel NM, Michelini VV, Snell JM, Balu S, Hoyle AP, Parker JS,
Hayward MC, Eberhard DA, Salazar AH, McNeillie P et al.: 51. Segler MHS, Kogej T, Tyrchan C, Waller MP: Generating focused
Enhancing next-generation sequencing-guided cancer care molecule libraries for drug discovery with recurrent neural
through cognitive computing. Oncologist 2018, 23:179-185. networks. ACS Cent Sci 2018, 4:120-131.
42. Bakkar N, Kovalik T, Lorenzini I, Spangler S, Lacoste A, 52. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ,
Sponaugle K, Ferrante P, Argentinis E, Sattler R, Bowser R: Liu X, Marcus J, Sun M et al.: Scalable and accurate deep
Artificial intelligence in neurodegenerative disease research: learning with electronic health records. NPJ Digit Med 2018,
use of IBM Watson to identify additional RNA-binding proteins 1:1609.
altered in amyotrophic lateral sclerosis. Acta Neuropathol 2018,
135:227-247. 53. Tsiouris KM, Pezoulas VC, Zervakis M, Konitsiotis S,
Koutsouris DD, Fotiadis DI: A long short-term memory deep
43. Chen Y, Elenee Argentinis JD, Weber G: IBM Watson: how learning network for the prediction of epileptic seizures using
cognitive computing can be applied to big data challenges in EEG signals. Comput Biol Med 2018, 99:24-37.
life sciences research. Clin Ther 2016, 38:688-701.
54. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM,
44. Hernandez D, Greenwald T: IBM has a Watson dilemma. Wall Thrun S: Dermatologist-level classification of skin cancer with
Street J 2018. Retrieved from http://online.wsj.com. deep neural networks. Nature 2017, 542:115-118.

www.sciencedirect.com Current Opinion in Biotechnology 2019, 58:161–167

You might also like