You are on page 1of 5

TECHNOLOGY FEATURE

THE BIG CHALLENGES


OF BIG DATA
As they grapple with increasingly large data sets,
biologists and computer scientists uncork new bottlenecks.
EMBL–EBI

Extremely powerful computers are needed to help biologists to handle big-data traffic jams.

BY VIVIEN MARX and how the genetic make-up of different can- year, particle-collision events in CERN’s Large
cers influences how cancer patients fare2. The Hadron Collider generate around 15 petabytes

B
iologists are joining the big-data club. European Bioinformatics Institute (EBI) in of data — the equivalent of about 4 million
With the advent of high-throughput Hinxton, UK, part of the European Molecular high-definition feature-length films. But the
genomics, life scientists are starting to Biology Laboratory and one of the world’s larg- EBI and institutes like it face similar data-
grapple with massive data sets, encountering est biology-data repositories, currently stores wrangling challenges to those at CERN, says
challenges with handling, processing and mov- 20 petabytes (1 petabyte is 1015 bytes) of data Ewan Birney, associate director of the EBI. He
ing information that were once the domain of and back-ups about genes, proteins and small and his colleagues now regularly meet with
astronomers and high-energy physicists1. molecules. Genomic data account for 2 peta- organizations such as CERN and the European
With every passing year, they turn more bytes of that, a number that more than doubles Space Agency (ESA) in Paris to swap lessons
often to big data to probe everything from every year3 (see ‘Data explosion’). about data storage, analysis and sharing.
the regulation of genes and the evolution of This data pile is just one-tenth the size of the All labs need to manipulate data to yield
genomes to why coastal algae bloom, what data store at CERN, Europe’s particle-physics research answers. As prices drop for high-
microbes dwell where in human body cavities laboratory near Geneva, Switzerland. Every throughput instruments such as automated

1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 5
© 2013 Macmillan Publishers Limited. All rights reserved
BIG DATA TECHNOLOGY

genome sequencers, small biology labs can Republic, for example, might have an idea
DNANEXUS

become big-data generators. And even labs about how to reprocess cancer data to help the
without such instruments can become big- hunt for cancer drugs. If he or she lacks the
data users by accessing terabytes (1012 bytes) computational equipment to develop it, he or
of data from public repositories at the EBI or she might not even try. But access to a high-
the US National Center for Biotechnology powered cloud allows “ideas to come from any
Information in Bethesda, Maryland. Each place”, says Birney.
day last year, the EBI received about 9 mil- Even at the EBI, many scientists access
lion online requests to query its data, a 60% databases and software tools on the Web
increase over 2011. and through clouds. “People rarely work on
Biology data mining has challenges all of straight hardware anymore,” says Birney. One
its own, says Birney. Biological data are much heavily used resource is the Ensembl Genome
more heterogeneous than those in physics. Browser, run jointly by the EBI and the Well-
They stem from a wide range of experiments come Trust Sanger Institute in Hinxton. Life
that spit out many types of information, such scientists use it to search through, down-
as genetic sequences, interactions of proteins load and analyse genomes from armadillo to
or findings in medical records. The complexity zebrafish. The main Ensembl site is based on
is daunting, says Lawrence Hunter, a compu- hardware in the United Kingdom, but when
tational biologist at the University of Colo- users in the United States and Japan had dif-
rado Denver. “Getting the most from the data Andreas Sundquist says amounts of data are now ficulty accessing the data quickly, the EBI
requires interpreting them in light of all the larger than the tools used to analyse them. resolved the bottleneck by hosting mirror
relevant prior knowledge,” he says. sites at three of the many remote data centres
That means scientists have to store large data efforts also address the infrastructure needs of that are part of Amazon Web Services’ Elastic
sets, and analyse, compare and share them — big-data biology. With the new types of data Compute Cloud (EC2). Amazon’s data centres
not simple tasks. Even a single sequenced traffic jam honking for attention, “we now have are geographically closer to the users than the
human genome is around 140 gigabytes in size. non-trivial engineering problems”, says Birney, EBI base, giving researchers quicker access to
Comparing human genomes takes more than the information they need.
a personal computer and online file-sharing LIFE OF THE DATA-RICH More clouds are coming. Together with
applications such as DropBox. Storing and interpreting big data takes both CERN and ESA, the EBI is building a cloud-
In an ongoing study, Arend Sidow, a com- real and virtual bricks and mortar. On the EBI based infrastructure called Helix Nebula
putational biologist at Stanford University in campus, for example, construction is under — The Science Cloud. Also involved are infor-
California, and his team are looking at specific way to house the technical command centre mation-technology
changes in the genome sequences of tumours of ELIXIR, a project to help scientists across “If I could, I companies such
from people with breast cancer. They wanted Europe safeguard and share their data, and to would routinely as Atos in Bezons,
to compare their data with the thousands of support existing resources such as databases look at all France; CGI in Mon-
other published breast-cancer genomes and and computing facilities in individual coun- sequenced treal, Canada; SixSq
look for similar patterns in the scores of dif- tries. Whereas CERN has one super­collider cancer genomes. in Geneva; and T-Sys-
ferent cancer types. But that is a tall order: producing data in one location, biological With the current tems in Frankfurt,
downloading the data is time-consuming, research generating high volumes of data is infrastructure, Germany.
and researchers must be sure that their com- distributed across many labs — highlighting that’s Cloud computing is
putational infrastructure and software tools the need to share resources. impossible.” particularly attractive
are up to the task. “If I could, I would routinely Much of the construction in big-data biol- in an era of reduced
look at all sequenced cancer genomes,” says ogy is virtual, focused on cloud computing research funding, says Hunter, because cloud
Sidow. “With the current infrastructure, that’s — in which data and software are situated in users do not need to finance or maintain hard-
impossible.” huge, off-site centres that users can access on ware. In addition to academic cloud projects,
In 2009, Sidow co-founded a company demand, so that they do not need to buy their scientists can choose from many commercial
called DNAnexus in Mountain View, Califor- own hardware and maintain it on site. Labs that providers, such as Rackspace, headquartered
nia, to help with large-scale genetic analyses. do have their own hardware can supplement it in San Antonio, Texas, or VMware in Palo
Numerous other commercial and academic with the cloud and use both as needed. They Alto, California, as well as larger companies
can create virtual spaces for data, software and including Amazon, headquartered in Seattle,
results that anyone can access, or they can lock Washington, IBM in Armonk, New York, or
SOURCE: EMBL–EBI

DATA EXPLOSION the spaces up behind a firewall so that only a Microsoft in Redmond, Washington.
The amount of genetic sequencing data stored select group of collaborators can get to them.
at the European Bioinformatics Institute takes
less than a year to double in size.
Working with the CSC — IT Center for Sci- BIG-DATA PARKING
ence in Espoo, Finland, a government-run Clouds are a solution, but they also throw
200
high-performance computing centre, the up fresh challenges. Ironically, their prolif-
160
EBI is developing Embassy Cloud, a cloud- eration can cause a bottleneck if data end
computing component for ELIXIR that offers up parked on several clouds and thus still
Sequencers begin secure data-analysis environments and is cur- need to be moved to be shared. And using
Terabases

120
giving flurries of data
rently in its pilot phase. External organizations clouds means entrusting valuable data to a
80 can, for example, run data-driven experiments distant service provider who may be subject
in the EBI’s computational environment, close to power outages or other disruptions. “I use
40 to the data they need. They can also download cloud services for many things, but always
data to compare with their own. keep a local copy of scientifically important
0
2004 2006 2008 2010 2012 The idea is to broaden access to computing data and software,” says Hunter. Scientists
power, says Birney. A researcher in the Czech experiment with different constellations to

1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 7
© 2013 Macmillan Publishers Limited. All rights reserved
TECHNOLOGY BIG DATA

suit their needs and trust levels. HEAD IN THE CLOUDS

SOURCE: ASPERA
Most researchers tend to download remote In cloud computing, large data sets are
data to local hardware for analysis. But this processed on remote Internet servers,
St
method is “backward”, says Andreas Sundquist, rather than on researchers’ local computers. or
ag
e
chief technology officer of DNAnexus. “The Co
data are so much larger than the tools, it makes m
pu
no sense to be doing that.” The alternative is to te
r
use the cloud for both data storage and com-
puting. If the data are on a cloud, researchers
can harness both the computing power and the Fi
re rm
wa tfo
tools that they need online, without the need l l d pla
to move data and software (see ‘Head in the C lou

clouds’). “There’s no reason to move data out-


side the cloud. You can do analysis right there,”
says Sundquist. Everything required is avail-
able “to the clever people with the clever ideas”,
regardless of their local computing resources,
L
says Birney. da arg
ta e
Various academic and commercial ventures fil
es
are engineering ways to bring data and analysis
tools together — and as they build, they have to
address the continued data growth. Xing Xu,
director of cloud computing at BGI (formerly
the Beijing Genomics Institute) in Shenzen, data for film-production studios and the oil periodically, it doesn’t make any economic
China, knows that challenge well. BGI is one and gas industry as well as the life sciences. sense for us to have this infrastructure, espe-
of the largest producers of genomic data in the In an experiment last year, BGI tested a fasp- cially if the user wants that for free,” he says.
world, with 157 genome sequencing instru- enabled data transfer between China and the Data-sharing among many collaborators
ments working around the clock on samples University of California, San Diego (UCSD). also remains a challenge. When BGI uses fasp
from people, plants, animals and microbes. It took 30 seconds to move a 24-gigabyte file. to share data with customers or collaborators,
Each day, it generates 6 terabytes of genomic “That’s really fast,” says Xu. it must have a software licence, which allows
data. Every instrument can decode one human Data transfer with fasp is hundreds of times customers to download or upload the data for
genome per week, an effort that used to take quicker than methods using the normal Inter- free. But customers who want to share data with
months or years and many staff. net protocol, says software engineer Michelle each other using this transfer protocol will need
Munson, chief executive and co-founder of their own software licences. Putting the data on
DATA HIGHWAY Aspera. However, all transfer protocols share the cloud and not moving them would bypass
Once a genome sequencer has cranked out its challenges associated with transferring large, this problem; teams would go to the large data
snippets of genomic information, or ‘reads’, unstructured data sets. sets, rather than the other way around. Xu and
they must be assembled into a continuous The test transfer between BGI and UCSD his team are exploring this approach, alongside
stretch of DNA using computing and software. was encouraging because Internet connec- the use of Globus Online, a free Web-based file-
Xu and his team try to automate as much of tions between China and the United States are transfer service from the Computation Institute
this process as possible to enable scientists to “riddled with challenges” such as variations at the University of Chicago and the Argonne
get to analyses quickly. in signal strength that interrupt data transfer, National Laboratory in Illinois. In April,
Next, either the reads or the analysis, or says Munson. The protocol has to handle such the Computation Institute team launched a
both, have to travel to scientists. Generally, road bumps and ensure speedy transfer, data genome-sequencing-analysis service called
researchers share biological data with their integrity and privacy. Data transfer often slows Globus Genomics on the Amazon cloud.
peers through public repositories, such as the when the passage is Munson says that Aspera has set up a
EBI or ones run by the US National Center “There’s no bumpy, but with fasp pay-as-you-go system on the Amazon cloud
for Biotechnology Information in Bethesda, reason to move it does not. Trans- to address the issue of data-sharing. Later
Mary­land. Given the size of the data, this data outside the fers can fail when a this year, the company will begin selling an
travel often means physically delivering hard cloud. You can file is partially sent; updated version of its software that can be
drives — and risks data getting lost, stolen or do analysis right with ordinary Inter- embedded on the desktop of any kind of com-
damaged. Instead, BGI wants to use either its there.” net connections, puter and will let users browse large data sets
own clouds or others of the customer’s choos- this relaunches the much like a file-sharing application. Files can
ing for electronic delivery. But that presents a entire transfer. By contrast, fasp restarts where be dragged and dropped from one location to
problem, because big-data travel often means the previous transfer stopped. Data that are another, even if those locations are commercial
big traffic jams. already on their way do not get resent, but or academic clouds.
Currently, BGI can transfer about 1 tera- continue on their travels. The cost of producing, acquiring and dis-
byte per day to its customers. “If you transfer Xu says that he liked the experiment with seminating data is decreasing, says James
one genome at a time, it’s OK,” says Xu. “If you fasp, but the software does not solve the data- Taylor, a computational biologist at Emory
sequence 50, it’s not so practical for us to trans- transfer problem. “The main problem is not University in Atlanta, Georgia, who thinks
fer that through the Internet. That takes about technical, it is economical,” he says. BGI that “everyone should have access to the
20 days.” would need to maintain a large Internet con- skills and tools” needed to make sense of all
BGI is exploring a variety of technologies nection bandwidth for data transfer, which the information. Taylor is a co-founder of an
to accelerate electronic data transfer, among would be prohibitively expensive, especially academic platform called Galaxy, which lets
them fasp, software developed by Aspera in given that Xu and his team do not send out big scientists analyse their data and share soft-
Emeryville, California, which helps to deliver data in a continuous flow. “If we only transfer ware tools and workflows for free. Through

2 5 8 | NAT U R E | VO L 4 9 8 | 1 3 J U N E 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved
BIG DATA TECHNOLOGY

Web-based access to computing facilities at valuable feedback to their computationally flu-


STANFORD UNIV./DNANEXUS

Pennsylvania State University (PSU) in Uni- ent colleagues because of different needs and
versity Park, scientists can download Galaxy’s approaches to the science, she says.
platform of tools to their local hardware, or Increasingly, big genomic data sets are being
use it on the Galaxy cloud. They can then plug used in biotechnology companies, drug firms
in their own data, perform analyses and save and medical centres, which also have specific
the steps in them, or try out workflows set up needs. Robert Mulroy, president of Merrimack
by their colleagues. Pharmaceuticals in Cambridge, Massachu-
Spearheaded by Taylor and Anton setts, says that his teams handle mountains
Nekrutenko, a molecular biologist at PSU, of data that hide drug candidates. “Our view
the Galaxy project draws on a community of is that biology functions through systems
around 100 software developers. One feature dynamics,” he says.
is Tool Shed, a virtual area with more than Merrimack researchers focus on interro-
2,700 software tools that users can upload, gating molecular signalling networks in the
try out and rate. Xu says that he likes the col- healthy body and in tumours, hoping to find
lection and its ratings, because without them, new ways to corner cancer cells. They generate
scientists must always check if a software tool and use large amounts of information from the
actually runs before they can use it. genome and other factors that drive a cell to
become cancerous, says Mulroy. The company Arend Sidow wants to move data mountains
KNOWLEDGE IS POWER stores its data and conducts analysis on its own without feeling pinched by infrastructure.
Galaxy is a good fit for scientists with some computing infrastructure, rather than a cloud,
computing know-how, says Alla Lapidus, a to keep the data private and protected. be used by the next tool in a workflow. And if
computational biologist in the algorithmic Drug developers have been hesitant about software tools are not easily installed, computer
biology lab at St Petersburg Academic Univer- cloud computing. But, says Sundquist, that fear specialists will have to intervene on behalf of
sity of the Russian Academy of Sciences, which is subsiding in some quarters: some companies those biologists without computer skills.
is led by Pavel Pevzner, a computer scientist at that have previously avoided clouds because of Even computationally savvy researchers can
UCSD. But, she says, the platform might not security problems are now exploring them. To get tangled up when wrestling with software
be the best choice for less tech-savvy research- assuage these users’ concerns, Sundquist has and big data. “Many of us are getting so busy
ers. When Lapidus wanted to disseminate the engineered the DNAnexus cloud to be compli- analysing huge data sets that we don’t have
software tools that she developed, she chose ant with US and European regulatory guide- time to do much else,” says Steven Salzberg,
to put them on DNAnexus’s newly launched lines. Its security features include encryption a computational biologist at Johns Hopkins
second-generation commercial cloud-based for biomedical information, and logs to allow University in Baltimore, Maryland. “We have
analysis platform. users to address potential queries from audi- to spend some of our time figuring out ways to
That platform is also designed to cater to tors such as regulatory agencies, all of which is make the analysis faster, rather than just using
non-specialist users, says Sundquist. It is pos- important in drug development. the tools we have.”
sible for a computer scientist to build his or her Yet other big-data pressures come from
own biological data-analysis suite with software CHALLENGES AND OPPORTUNITIES the need to engineer tools for stability and
tools on the Amazon cloud, but DNAnexus Harnessing powerful computers and numer- longevity. Too many software tools crash too
uses its own engineering to help researchers ous tools for data analysis is crucial in drug dis- often. “Everyone in the field runs into similar
without the necessary computer skills to get to covery and other areas of big-data biology. But problems,” says Hunter. In addition, research
the analysis steps. that is only part of the problem. Data and tools teams may not be able to acquire the resources
Catering for non-specialists is important need to be more than close — they must talk to they need, he says, especially in countries
when developing tools, as well as platforms. The one another. Lapidus says that results produced such as the United States, where an academic
Biomedical Information Science and Technol- by one tool are not always in a format that can does not gain as much recognition for soft-
ogy Initiative (BISTI) run by the US National ware engineering as for publishing a paper.
Institutes of Health (NIH) in Bethesda, Mary- With its dedicated focus on data and software
ASPERA

land, supports development of new computa- infrastructure designed to serve scientists, the
tional tools and the maintenance of existing EBI offers an “interesting contrast to the US
ones. “We want a deployable tool,” says Vivien model”, says Hunter.
Bonazzi, programme director in computational US funding agencies are not entirely ignor-
biology and bioinformatics at the National ing software engineering, however. In addi-
Human Genome Research Institute, who is tion to BISTI, the NIH is developing Big Data
involved with BISTI. Scientists who are not to Knowledge (BD2K), an initiative focused on
heavy-duty informatics types need to be able to managing large data sets in biomedicine, with
set up these tools and use them successfully, she elements such as data handling and standards,
says. And it must be possible to scale up tools informatics training and software sharing. And
and update them as data volume grows. as the cloud emerges as a popular place to do
Bonazzi says that although many life research, the agency is also reviewing data-
scientists have significant computational use policies. An approved study usually lays
skills, others do not understand computer out specific data uses, which may not include
lingo enough to know that in the tech world, placing genomic data on a cloud, says Bonazzi.
Python is not a snake and Perl is not a gem When a person consents to have his or her data
(they are programming languages). But even if used in one way, researchers cannot suddenly
biologists can’t develop or adapt the software, change that use, she says. In a big-data age that
says Bonazzi, they have a place in big-data sci- Various data-transfer protocols handle problems uses the cloud in addition to local hardware,
ence. Apart from anything else, they can offer in different ways, says Michelle Munson. new technologies in encryption and secure

1 3 J U N E 2 0 1 3 | VO L 4 9 8 | NAT U R E | 2 5 9
© 2013 Macmillan Publishers Limited. All rights reserved
TECHNOLOGY BIG DATA

transmission will need to address such privacy

MERRIMACK PHARMACEUTICALS
concerns.
Big data takes large numbers of people. BGI
employs more than 600 engineers and software
developers to manage its information-technol-
ogy infrastructure, handle data and develop
software tools and workflows. Scores of infor-
maticians look for biologically relevant mes-
sages in the data, usually tailored to requests
from researchers and commercial customers,
says Xu. And apart from its stream of research
collaborations, BGI offers a sequencing and
analysis service to customers. Early last year,
the institute expanded its offerings with a
cloud-based genome-analysis platform called
Easy­Genomics.
In late 2012, it also bought the faltering US
company Complete Genomics (CG), which
offered human genome sequencing and
analysis for customers in academia or drug
discovery. Although the sale dashed hopes
for earnings among CG’s investors, it doesn’t
seem to have dimmed their view of the pros-
pects for sequencing and analysis services. “It
is now just a matter of time before sequencing
data are used with regularity in clinical prac-
tice,” says one investor, who did not wish to be
identified. But the sale shows how difficult it
can be to transition ideas into a competitive A simplified array of breast-cancer subtypes, produced by researchers at Merrimack Pharmaceuticals,
marketplace, the investor says. who use their own computational infrastructure to hunt for new cancer drugs.
When tackling data mountains, BGI uses
not only its own data-analysis tools, but also data mountains grow, he says. viruses in the genomes of other species, includ-
some developed in the academic community. They are still ironing out some issues with ing humans. Her work generates terabytes of
To ramp up analysis speed and capacity as data Gaea, comparing its performance on the cloud data, which she shares with other researchers.
sets grow, BGI assembled a cloud-based series with its performance on local infrastructure. Given that big-data analysis in biology is
of analysis steps into a workflow called Gaea, Once testing is complete, BGI plans to mount incredibly difficult, Hunter says, open science
which uses the Hadoop open-source software Gaea on a cloud such as Amazon for use by the is becoming increasingly important. As he
framework. Hadoop was written by volunteer wider scientific community. explains, researchers need to make their data
developers from companies and universities, Other groups are also trying to speed up available to the scientific community in a use-
and can be deployed on various types of com- analysis to cater to scientists who want to use ful form, for others to mine. New science can
puting infrastructure. BGI programmers built big data. For example, Bina Technologies in emerge from the analysis of existing data sets:
on this framework to instruct software tools to Redwood City, California, a spin-out from Stan- McClure generates some of her findings from
perform large-scale ford University and the University of Califor- other people’s data. But not everyone recog-
data analysis across “The cultural nia, Berkeley, has developed high-performance nizes that kind of biology as an equal. “The
many computers at baggage of computing components for its genome-analysis cultural baggage of biology that privileges data
the same time. biology that services. Customers can buy the hardware, generation over all other forms of science is
If 50 genomes are privileges data called the Bina Box, with software, or use Bina’s holding us back,” says Hunter.
to be analysed and generation analysis platform on the cloud. A number of McClure’s graduate students
the results com- over all other are microbial ecologists, and she teaches them
pared, hundreds of forms of science FROM VIVO TO SILICO how to rethink their findings in the face of so
computational steps is holding us Data mountains and analysis are altering the many new data. “Before taking my class, none
are involved. The back.” way science progresses, and breeding biologists of these students would have imagined that
steps can run either who get neither their feet nor their hands wet. they could produce new, meaningful knowl-
sequentially or in parallel; with Gaea, they “I am one of a small original group who made edge, and new hypotheses, from existing data,
run in parallel across hundreds of cloud-based the first leap from the wet world to the in silico not their own,” she says. Big data in biology
computers, reducing analysis time rather like world to do biology,” says Marcie McClure, a add to the possibilities for scientists, she says,
many people working on a single large puzzle computational biologist at Montana State Uni- because data sit “under-analysed in databases
at once. The data are on the BGI cloud, as are versity in Bozeman. “I never looked back,” all over the world”. ■
the tools. “If you perform analysis in a non- During her graduate training, McClure ana-
parallel way, you will maybe need two weeks lysed a class of viruses known as retroviruses in Vivien Marx is technology editor at Nature
to fully process those data,” says Xu. Gaea takes fish, doing the work of a “wet-worlder”, as she and Nature Methods.
around 15 hours for the same number of data. calls it. Since then, she and her team have dis-
1. Mattmann, C. Nature 493, 473–475 (2013).
To leverage Hadoop’s muscle, Xu and his covered 11 fish retroviruses without touching
2. Greene, C. S. & Troyanskaya, O. G. PLoS Comput.
team needed to rewrite software tools. But the water in lake or lab, by analysing genomes com- Biol. 8, e1002816 (2012).
investment is worth it because the Hadoop putationally and in ways that others had not. She 3. EMBL–European Bioinformatics Institute EMBL-EBI
framework allows analysis to continue as the has also developed software tools to find such Annual Scientific Report 2012 (EMBL–EBI, 2013).

2 6 0 | NAT U R E | VO L 4 9 8 | 1 3 J U N E 2 0 1 3
© 2013 Macmillan Publishers Limited. All rights reserved

You might also like