Professional Documents
Culture Documents
Extremely powerful computers are needed to help biologists to handle big-data traffic jams.
BY VIVIEN MARX and how the genetic make-up of different can- year, particle-collision events in CERN’s Large
cers influences how cancer patients fare2. The Hadron Collider generate around 15 petabytes
B
iologists are joining the big-data club. European Bioinformatics Institute (EBI) in of data — the equivalent of about 4 million
With the advent of high-throughput Hinxton, UK, part of the European Molecular high-definition feature-length films. But the
genomics, life scientists are starting to Biology Laboratory and one of the world’s larg- EBI and institutes like it face similar data-
grapple with massive data sets, encountering est biology-data repositories, currently stores wrangling challenges to those at CERN, says
challenges with handling, processing and mov- 20 petabytes (1 petabyte is 1015 bytes) of data Ewan Birney, associate director of the EBI. He
ing information that were once the domain of and back-ups about genes, proteins and small and his colleagues now regularly meet with
astronomers and high-energy physicists1. molecules. Genomic data account for 2 peta- organizations such as CERN and the European
With every passing year, they turn more bytes of that, a number that more than doubles Space Agency (ESA) in Paris to swap lessons
often to big data to probe everything from every year3 (see ‘Data explosion’). about data storage, analysis and sharing.
the regulation of genes and the evolution of This data pile is just one-tenth the size of the All labs need to manipulate data to yield
genomes to why coastal algae bloom, what data store at CERN, Europe’s particle-physics research answers. As prices drop for high-
microbes dwell where in human body cavities laboratory near Geneva, Switzerland. Every throughput instruments such as automated
1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 5
© 2013 Macmillan Publishers Limited. All rights reserved
BIG DATA TECHNOLOGY
genome sequencers, small biology labs can Republic, for example, might have an idea
DNANEXUS
become big-data generators. And even labs about how to reprocess cancer data to help the
without such instruments can become big- hunt for cancer drugs. If he or she lacks the
data users by accessing terabytes (1012 bytes) computational equipment to develop it, he or
of data from public repositories at the EBI or she might not even try. But access to a high-
the US National Center for Biotechnology powered cloud allows “ideas to come from any
Information in Bethesda, Maryland. Each place”, says Birney.
day last year, the EBI received about 9 mil- Even at the EBI, many scientists access
lion online requests to query its data, a 60% databases and software tools on the Web
increase over 2011. and through clouds. “People rarely work on
Biology data mining has challenges all of straight hardware anymore,” says Birney. One
its own, says Birney. Biological data are much heavily used resource is the Ensembl Genome
more heterogeneous than those in physics. Browser, run jointly by the EBI and the Well-
They stem from a wide range of experiments come Trust Sanger Institute in Hinxton. Life
that spit out many types of information, such scientists use it to search through, down-
as genetic sequences, interactions of proteins load and analyse genomes from armadillo to
or findings in medical records. The complexity zebrafish. The main Ensembl site is based on
is daunting, says Lawrence Hunter, a compu- hardware in the United Kingdom, but when
tational biologist at the University of Colo- users in the United States and Japan had dif-
rado Denver. “Getting the most from the data Andreas Sundquist says amounts of data are now ficulty accessing the data quickly, the EBI
requires interpreting them in light of all the larger than the tools used to analyse them. resolved the bottleneck by hosting mirror
relevant prior knowledge,” he says. sites at three of the many remote data centres
That means scientists have to store large data efforts also address the infrastructure needs of that are part of Amazon Web Services’ Elastic
sets, and analyse, compare and share them — big-data biology. With the new types of data Compute Cloud (EC2). Amazon’s data centres
not simple tasks. Even a single sequenced traffic jam honking for attention, “we now have are geographically closer to the users than the
human genome is around 140 gigabytes in size. non-trivial engineering problems”, says Birney, EBI base, giving researchers quicker access to
Comparing human genomes takes more than the information they need.
a personal computer and online file-sharing LIFE OF THE DATA-RICH More clouds are coming. Together with
applications such as DropBox. Storing and interpreting big data takes both CERN and ESA, the EBI is building a cloud-
In an ongoing study, Arend Sidow, a com- real and virtual bricks and mortar. On the EBI based infrastructure called Helix Nebula
putational biologist at Stanford University in campus, for example, construction is under — The Science Cloud. Also involved are infor-
California, and his team are looking at specific way to house the technical command centre mation-technology
changes in the genome sequences of tumours of ELIXIR, a project to help scientists across “If I could, I companies such
from people with breast cancer. They wanted Europe safeguard and share their data, and to would routinely as Atos in Bezons,
to compare their data with the thousands of support existing resources such as databases look at all France; CGI in Mon-
other published breast-cancer genomes and and computing facilities in individual coun- sequenced treal, Canada; SixSq
look for similar patterns in the scores of dif- tries. Whereas CERN has one supercollider cancer genomes. in Geneva; and T-Sys-
ferent cancer types. But that is a tall order: producing data in one location, biological With the current tems in Frankfurt,
downloading the data is time-consuming, research generating high volumes of data is infrastructure, Germany.
and researchers must be sure that their com- distributed across many labs — highlighting that’s Cloud computing is
putational infrastructure and software tools the need to share resources. impossible.” particularly attractive
are up to the task. “If I could, I would routinely Much of the construction in big-data biol- in an era of reduced
look at all sequenced cancer genomes,” says ogy is virtual, focused on cloud computing research funding, says Hunter, because cloud
Sidow. “With the current infrastructure, that’s — in which data and software are situated in users do not need to finance or maintain hard-
impossible.” huge, off-site centres that users can access on ware. In addition to academic cloud projects,
In 2009, Sidow co-founded a company demand, so that they do not need to buy their scientists can choose from many commercial
called DNAnexus in Mountain View, Califor- own hardware and maintain it on site. Labs that providers, such as Rackspace, headquartered
nia, to help with large-scale genetic analyses. do have their own hardware can supplement it in San Antonio, Texas, or VMware in Palo
Numerous other commercial and academic with the cloud and use both as needed. They Alto, California, as well as larger companies
can create virtual spaces for data, software and including Amazon, headquartered in Seattle,
results that anyone can access, or they can lock Washington, IBM in Armonk, New York, or
SOURCE: EMBL–EBI
DATA EXPLOSION the spaces up behind a firewall so that only a Microsoft in Redmond, Washington.
The amount of genetic sequencing data stored select group of collaborators can get to them.
at the European Bioinformatics Institute takes
less than a year to double in size.
Working with the CSC — IT Center for Sci- BIG-DATA PARKING
ence in Espoo, Finland, a government-run Clouds are a solution, but they also throw
200
high-performance computing centre, the up fresh challenges. Ironically, their prolif-
160
EBI is developing Embassy Cloud, a cloud- eration can cause a bottleneck if data end
computing component for ELIXIR that offers up parked on several clouds and thus still
Sequencers begin secure data-analysis environments and is cur- need to be moved to be shared. And using
Terabases
120
giving flurries of data
rently in its pilot phase. External organizations clouds means entrusting valuable data to a
80 can, for example, run data-driven experiments distant service provider who may be subject
in the EBI’s computational environment, close to power outages or other disruptions. “I use
40 to the data they need. They can also download cloud services for many things, but always
data to compare with their own. keep a local copy of scientifically important
0
2004 2006 2008 2010 2012 The idea is to broaden access to computing data and software,” says Hunter. Scientists
power, says Birney. A researcher in the Czech experiment with different constellations to
1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 7
© 2013 Macmillan Publishers Limited. All rights reserved
TECHNOLOGY BIG DATA
SOURCE: ASPERA
Most researchers tend to download remote In cloud computing, large data sets are
data to local hardware for analysis. But this processed on remote Internet servers,
St
method is “backward”, says Andreas Sundquist, rather than on researchers’ local computers. or
ag
e
chief technology officer of DNAnexus. “The Co
data are so much larger than the tools, it makes m
pu
no sense to be doing that.” The alternative is to te
r
use the cloud for both data storage and com-
puting. If the data are on a cloud, researchers
can harness both the computing power and the Fi
re rm
wa tfo
tools that they need online, without the need l l d pla
to move data and software (see ‘Head in the C lou
2 5 8 | NAT U R E | VO L 4 9 8 | 1 3 J U N E 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved
BIG DATA TECHNOLOGY
Pennsylvania State University (PSU) in Uni- ent colleagues because of different needs and
versity Park, scientists can download Galaxy’s approaches to the science, she says.
platform of tools to their local hardware, or Increasingly, big genomic data sets are being
use it on the Galaxy cloud. They can then plug used in biotechnology companies, drug firms
in their own data, perform analyses and save and medical centres, which also have specific
the steps in them, or try out workflows set up needs. Robert Mulroy, president of Merrimack
by their colleagues. Pharmaceuticals in Cambridge, Massachu-
Spearheaded by Taylor and Anton setts, says that his teams handle mountains
Nekrutenko, a molecular biologist at PSU, of data that hide drug candidates. “Our view
the Galaxy project draws on a community of is that biology functions through systems
around 100 software developers. One feature dynamics,” he says.
is Tool Shed, a virtual area with more than Merrimack researchers focus on interro-
2,700 software tools that users can upload, gating molecular signalling networks in the
try out and rate. Xu says that he likes the col- healthy body and in tumours, hoping to find
lection and its ratings, because without them, new ways to corner cancer cells. They generate
scientists must always check if a software tool and use large amounts of information from the
actually runs before they can use it. genome and other factors that drive a cell to
become cancerous, says Mulroy. The company Arend Sidow wants to move data mountains
KNOWLEDGE IS POWER stores its data and conducts analysis on its own without feeling pinched by infrastructure.
Galaxy is a good fit for scientists with some computing infrastructure, rather than a cloud,
computing know-how, says Alla Lapidus, a to keep the data private and protected. be used by the next tool in a workflow. And if
computational biologist in the algorithmic Drug developers have been hesitant about software tools are not easily installed, computer
biology lab at St Petersburg Academic Univer- cloud computing. But, says Sundquist, that fear specialists will have to intervene on behalf of
sity of the Russian Academy of Sciences, which is subsiding in some quarters: some companies those biologists without computer skills.
is led by Pavel Pevzner, a computer scientist at that have previously avoided clouds because of Even computationally savvy researchers can
UCSD. But, she says, the platform might not security problems are now exploring them. To get tangled up when wrestling with software
be the best choice for less tech-savvy research- assuage these users’ concerns, Sundquist has and big data. “Many of us are getting so busy
ers. When Lapidus wanted to disseminate the engineered the DNAnexus cloud to be compli- analysing huge data sets that we don’t have
software tools that she developed, she chose ant with US and European regulatory guide- time to do much else,” says Steven Salzberg,
to put them on DNAnexus’s newly launched lines. Its security features include encryption a computational biologist at Johns Hopkins
second-generation commercial cloud-based for biomedical information, and logs to allow University in Baltimore, Maryland. “We have
analysis platform. users to address potential queries from audi- to spend some of our time figuring out ways to
That platform is also designed to cater to tors such as regulatory agencies, all of which is make the analysis faster, rather than just using
non-specialist users, says Sundquist. It is pos- important in drug development. the tools we have.”
sible for a computer scientist to build his or her Yet other big-data pressures come from
own biological data-analysis suite with software CHALLENGES AND OPPORTUNITIES the need to engineer tools for stability and
tools on the Amazon cloud, but DNAnexus Harnessing powerful computers and numer- longevity. Too many software tools crash too
uses its own engineering to help researchers ous tools for data analysis is crucial in drug dis- often. “Everyone in the field runs into similar
without the necessary computer skills to get to covery and other areas of big-data biology. But problems,” says Hunter. In addition, research
the analysis steps. that is only part of the problem. Data and tools teams may not be able to acquire the resources
Catering for non-specialists is important need to be more than close — they must talk to they need, he says, especially in countries
when developing tools, as well as platforms. The one another. Lapidus says that results produced such as the United States, where an academic
Biomedical Information Science and Technol- by one tool are not always in a format that can does not gain as much recognition for soft-
ogy Initiative (BISTI) run by the US National ware engineering as for publishing a paper.
Institutes of Health (NIH) in Bethesda, Mary- With its dedicated focus on data and software
ASPERA
land, supports development of new computa- infrastructure designed to serve scientists, the
tional tools and the maintenance of existing EBI offers an “interesting contrast to the US
ones. “We want a deployable tool,” says Vivien model”, says Hunter.
Bonazzi, programme director in computational US funding agencies are not entirely ignor-
biology and bioinformatics at the National ing software engineering, however. In addi-
Human Genome Research Institute, who is tion to BISTI, the NIH is developing Big Data
involved with BISTI. Scientists who are not to Knowledge (BD2K), an initiative focused on
heavy-duty informatics types need to be able to managing large data sets in biomedicine, with
set up these tools and use them successfully, she elements such as data handling and standards,
says. And it must be possible to scale up tools informatics training and software sharing. And
and update them as data volume grows. as the cloud emerges as a popular place to do
Bonazzi says that although many life research, the agency is also reviewing data-
scientists have significant computational use policies. An approved study usually lays
skills, others do not understand computer out specific data uses, which may not include
lingo enough to know that in the tech world, placing genomic data on a cloud, says Bonazzi.
Python is not a snake and Perl is not a gem When a person consents to have his or her data
(they are programming languages). But even if used in one way, researchers cannot suddenly
biologists can’t develop or adapt the software, change that use, she says. In a big-data age that
says Bonazzi, they have a place in big-data sci- Various data-transfer protocols handle problems uses the cloud in addition to local hardware,
ence. Apart from anything else, they can offer in different ways, says Michelle Munson. new technologies in encryption and secure
1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 9
© 2013 Macmillan Publishers Limited. All rights reserved
TECHNOLOGY BIG DATA
MERRIMACK PHARMACEUTICALS
concerns.
Big data takes large numbers of people. BGI
employs more than 600 engineers and software
developers to manage its information-technol-
ogy infrastructure, handle data and develop
software tools and workflows. Scores of infor-
maticians look for biologically relevant mes-
sages in the data, usually tailored to requests
from researchers and commercial customers,
says Xu. And apart from its stream of research
collaborations, BGI offers a sequencing and
analysis service to customers. Early last year,
the institute expanded its offerings with a
cloud-based genome-analysis platform called
EasyGenomics.
In late 2012, it also bought the faltering US
company Complete Genomics (CG), which
offered human genome sequencing and
analysis for customers in academia or drug
discovery. Although the sale dashed hopes
for earnings among CG’s investors, it doesn’t
seem to have dimmed their view of the pros-
pects for sequencing and analysis services. “It
is now just a matter of time before sequencing
data are used with regularity in clinical prac-
tice,” says one investor, who did not wish to be
identified. But the sale shows how difficult it
can be to transition ideas into a competitive A simplified array of breast-cancer subtypes, produced by researchers at Merrimack Pharmaceuticals,
marketplace, the investor says. who use their own computational infrastructure to hunt for new cancer drugs.
When tackling data mountains, BGI uses
not only its own data-analysis tools, but also data mountains grow, he says. viruses in the genomes of other species, includ-
some developed in the academic community. They are still ironing out some issues with ing humans. Her work generates terabytes of
To ramp up analysis speed and capacity as data Gaea, comparing its performance on the cloud data, which she shares with other researchers.
sets grow, BGI assembled a cloud-based series with its performance on local infrastructure. Given that big-data analysis in biology is
of analysis steps into a workflow called Gaea, Once testing is complete, BGI plans to mount incredibly difficult, Hunter says, open science
which uses the Hadoop open-source software Gaea on a cloud such as Amazon for use by the is becoming increasingly important. As he
framework. Hadoop was written by volunteer wider scientific community. explains, researchers need to make their data
developers from companies and universities, Other groups are also trying to speed up available to the scientific community in a use-
and can be deployed on various types of com- analysis to cater to scientists who want to use ful form, for others to mine. New science can
puting infrastructure. BGI programmers built big data. For example, Bina Technologies in emerge from the analysis of existing data sets:
on this framework to instruct software tools to Redwood City, California, a spin-out from Stan- McClure generates some of her findings from
perform large-scale ford University and the University of Califor- other people’s data. But not everyone recog-
data analysis across “The cultural nia, Berkeley, has developed high-performance nizes that kind of biology as an equal. “The
many computers at baggage of computing components for its genome-analysis cultural baggage of biology that privileges data
the same time. biology that services. Customers can buy the hardware, generation over all other forms of science is
If 50 genomes are privileges data called the Bina Box, with software, or use Bina’s holding us back,” says Hunter.
to be analysed and generation analysis platform on the cloud. A number of McClure’s graduate students
the results com- over all other are microbial ecologists, and she teaches them
pared, hundreds of forms of science FROM VIVO TO SILICO how to rethink their findings in the face of so
computational steps is holding us Data mountains and analysis are altering the many new data. “Before taking my class, none
are involved. The back.” way science progresses, and breeding biologists of these students would have imagined that
steps can run either who get neither their feet nor their hands wet. they could produce new, meaningful knowl-
sequentially or in parallel; with Gaea, they “I am one of a small original group who made edge, and new hypotheses, from existing data,
run in parallel across hundreds of cloud-based the first leap from the wet world to the in silico not their own,” she says. Big data in biology
computers, reducing analysis time rather like world to do biology,” says Marcie McClure, a add to the possibilities for scientists, she says,
many people working on a single large puzzle computational biologist at Montana State Uni- because data sit “under-analysed in databases
at once. The data are on the BGI cloud, as are versity in Bozeman. “I never looked back,” all over the world”. ■
the tools. “If you perform analysis in a non- During her graduate training, McClure ana-
parallel way, you will maybe need two weeks lysed a class of viruses known as retroviruses in Vivien Marx is technology editor at Nature
to fully process those data,” says Xu. Gaea takes fish, doing the work of a “wet-worlder”, as she and Nature Methods.
around 15 hours for the same number of data. calls it. Since then, she and her team have dis-
1. Mattmann, C. Nature 493, 473–475 (2013).
To leverage Hadoop’s muscle, Xu and his covered 11 fish retroviruses without touching
2. Greene, C. S. & Troyanskaya, O. G. PLoS Comput.
team needed to rewrite software tools. But the water in lake or lab, by analysing genomes com- Biol. 8, e1002816 (2012).
investment is worth it because the Hadoop putationally and in ways that others had not. She 3. EMBL–European Bioinformatics Institute EMBL-EBI
framework allows analysis to continue as the has also developed software tools to find such Annual Scientific Report 2012 (EMBL–EBI, 2013).
2 6 0 | NAT U R E | VO L 4 9 8 | 1 3 J U N E 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved