April 13, 2012

Wrangling Big Data to Fight Pediatric Cancer
Anatol Blass, Ph.D.
High-performance computing and the cloud are enabling vast improvement in scientists’ ability to simulate and analyze data, and
genetic sequencing and research are accessible to more scientists, researchers and medical professionals than ever before.
But a new bottleneck has emerged: we are drowning in data. The trick is how to
efficiently manage the volume and complexity of that data while making it secure yet
accessible to many.
In order to address the big data bottleneck, Dell is building a unique cloud
environment for a pediatric cancer trial in conjunction with the Translational Genomics
Research Institute (TGen) and the Neuroblastoma and Medulloblastoma Translational
Research Consortium (NMTRC).
The collaborative effort is creating a model for how to use HPC and cloud computing
to simplify information access and sharing and bridge the information gaps between
science and medicine. Through the trial, scientists and oncologists are identifying
targeted and personalized treatments for children fighting neuroblastoma.
The cloud will provide the additional computing capacity to support the “real time”
processing of patient tumors and prediction of the best drug therapy for a specific
patient, based on the genetic makeup of that child’s tumor.
This clinical trial involves dozens of scientific and medical partners across the country.
Providing information technology to analyze laboratory results and to support
collaboration across a secure network of clinical sites is crucial to creating a knowledge base that supports clinical decision-making.
Because TGen’s research is so cutting edge, scientists and doctors require flexibility to follow their research as it evolves. The
effort involves studying tumor samples from patients, getting the genomic sequencing data from lab instruments, analyzing that
data, reporting the findings to a tumor board and using the results to make decisions about the best treatment for the patient.
One of the chief challenges was that the newest lab instruments have the capacity to generate raw data at an increasingly faster
rate than ever anticipated by Moore’s Law. The quantity of data being produced from a single instrument is doubling about every 12
months, while at the same time the cost to analyze it is falling by half.
The end result is that the total amount of genomic data being generated is doubling nearly every six months. Moreover, the data
objects produced are complex files with important metadata properties about the samples they came from and the instruments that
produced them. And the files can be extremely large, up to 3TB depending on the instrument. The data associated with a particular
patient currently is about 200TB and growing. Because this is an active area of research, data needs to be kept available to
validate and compare analysis algorithms.
Additionally, for this clinical trial there are 11 participating sites both generating and analyzing data. A hybrid approach was
required to manage the data coming from the instruments and to be able keep large amounts of data accessible to all of the sites to
facilitate collaboration in a secure, cost-effective manner. It was also important to localize data near HPC capacity both in the cloud
and on premise to speed analysis and validation.
The cloud became the medium of exchange for data as well as analysis capabilities, allowing researchers to share their raw
information as well as algorithms for analyzing that data. As a result, TGen and its collaborators can quickly turn data into
knowledge, knowledge into diagnosis, diagnostics into therapies, and therapies into better quality of life for patients.
A colleague of mine coined the phrase “cloud-to-ground” to describe the architecture built to address these issues: an
environment that could manage data and not just archive it. We needed to create a virtual library of data that could be accessed by
researchers and allow data to be checked out and analyzed using HPC capabilities.
We are using Dell’s innovative technology to enable fluid integration between premise-based capabilities (the ground) and virtual
capabilities (the cloud). This provides the framework to move the data fluidly through the research lifecycle, protect it, and make it
available for future use. Data can be ingested at various sites, moved to the cloud and then made available for analysis either in
premise-based HPC environments or any HPC cloud environment.
The unique challenges of personalized medicine require us to address data volume, complexity and locality, as well as
collaboration. By creating integrated hybrid cloud environments, we can harness the power of Big Data and unleash the potential of
personalized medicine.