You are on page 1of 39

Unleash your inner (data) scientist :

The ability and audacity to scale your science with


extensible cyberinfrastructure

Nirav Merchant
The University of Arizona &
iPlant Collaborative
nirav@email.arizona.edu

Topic Coverage
The Big Data and Data Scientist wave
What is cyberinfrastructure (CI)
Delivering pragmatic CI ecosystem
What has the community built with our CI
Lifecycle of research and innovation
Continuing education and learning with CI
Future thoughts and challenges

Science Paradigms
1. Thousand years ago: science was empirical
describing natural phenomena, observations
2. Last few hundred years: theoretical branch
using models, generalizations
3. Last few decades: a computational branch
simulating complex phenomena
4. Today: data exploration (eScience)
unify theory, experiment, and simulation
Based on the transcript of a talk given by the late Jim Gray
to the National Research Council Computer Science and Telecommunication Board in Mountain View, CA,
on 3January 11, 2007

The Fourth Paradigm:


Data-Intensive Scientific Discovery
Increasingly, scientific breakthroughs will be powered by advanced
computing capabilities that help researchers manipulate and explore
massive datasets.
The speed at which any given scientific discipline advances will depend
on how well its researchers collaborate with one another, and with
technologists, in areas of eScience such as databases, workflow
management, visualization, and cloud computing technologies.

http://research.microsoft.com/en-us/collaboration/fourthparadigm/
4

The Discovery Lifecycle

The Fourth Paradigm: Data-Intensive Scientific Discovery


5

Evolution of X-Info
The evolution of X-Info and Comp-X for each discipline X e.g.
(Bio-Informatics , Computational-Biology)
How to codify and represent our knowledge
The Generic Problems:
Data ingest
Managing a petabyte
Common schema
How to organize it
How to reorganize it

How to share it with others


Query and Vis tools
Building and executing models
Integrating data and literature
Documenting experiments
Curation and long-term preservation

The Fourth Paradigm: Data-Intensive Scientific Discovery

Paradigm Shift
Classic paradigm: You produce data, analyze,
interpret (end to end)
Conventional paradigm: Consortium/centers
produce data and you consume it
New Paradigm: Consortium/centers have
produced data and creating cyber
infrastructure to tackle the grand challenge
7

Big Data
Extracting meaningful results from vast amount of data (linked data)
Big data information assets demand cost-effective, innovative
forms of information processing for enhanced insight and decision
making.
Big Data Is only the Beginning of Extreme Information
Management
Big Data Technology, all Is Not New

Attributed to Gartner Consulting

A few word about Big Data and Data Science


The 2014 Gartner Technology Hype-Cycle
http://www.gartner.com/newsroom/id/2819918

Simple Formula for Success

11

The Reality

Excel, R
PERL
Python
ARCGIS
Java Ruby
Fortran C C#
C++ Matlab
etc.
and lots of glue..
12

Amazon
Azure
Rackspace
Campus HPC
XSEDE
Etc.

Simple Formula

http://cloudtweaks.com/2011/05/the-lighter-side-of-the-cloud-data-transfer/

Rise of the data janitors

15

The relevance
Bioinformatics has become too central to biology to
be left to specialist bioinformaticians.
Biologists are all bioinformaticians now
- Lincoln Stein Dec. 2008
http://genomebiology.com/2008/9/12/114

iPlant Collaborative: Vision

Enable life science researchers and educators to


use and extend cyberinfrastructure

www.iPlantCollaborative.org

The iPlant Collaborative


We are a Cyberinfrastructure

Platforms, tools, datasets

Storage and compute

Training and support

From data to discovery

The iPlant Collaborative


And a virtual organization

Developer Expertise
Computational Capacity
Science Domain Expertise
Training
Administrative and Organization

iPlant Collaborative: CI for Scalable Science


Facilitating the 4As of Computational Thinking approaches for
Life Sciences: Abstraction, Automation, Ability and Audacity
Allowing researchers and educators to establish and manage data
driven collaborations: Supporting distributed teams and virtual
organizations (VO) at global scale
Making efficient and coordinated use of CI resources from national,
regional, institutional and commercial providers: NSF XSEDE, iPlant,
campus HPC and high bandwidth connections to commercial cloud
providers
Adopting best practices from science domains where key CI
challenges have been solved: Astronomy, Particle Physics etc.
Community driven, self-provisioning, extensible and open source:
Development and prioritization driven through community
engagement, active engagement with CISE communities

iPlant Collaborative: Platform Philosophy

Strive to provide the CI Lego blocks


Danish 'leg godt' - 'play well
Also translates as 'I put together' in Latin
If desired functionality is not available, the
community can craft their own by using and
extending iPlant CI components (like lego blocks)
Through these extensible and customized
platforms create a ecosystem of interoperable
tools that benefit the broad community (and not
few lab groups)
Provide the tools to allow community to manage
their digital assets (cloud, HPC etc.)
Improve Computational Productivity

Who did we build it for ?

iPlant: Platform for Big Data Collaborations

iPlant Collaborative: Products

Ease of use

Ready to use
Platforms

Extensible
Services

Established CI
Components

Foundational
Capabilities

iPlant: Cohesive Platform for Big Data lifecycle

Researchers like to share !


User Statistics

~27000 user accounts


4900 users with data
2600 users (53% of users with data) made at least 1 share
2100 shares per user
42 million files (58% shared)
59 million (1.1 million/month) shares

Community Data Statistics


5 million files
55 million (1.0 million/month) shares

~1.1PB of User Managed data


Our users consume 5M+ SU annually and more
(we graduate them to compete for their own allocations from XSEDE)

How is it being used ?


User build their own systems (powered by iPlant components) but
managed by them
Consume specific components (a la carte, data store, Atmosphere)
Directly use applications (DE)
Custom design appliances (Atmosphere)
Publish their findings (PNAS, Nature)
Advocate use
Create learning material and courses

iPlant CI: What is the community building ?


Many 1000s omes project
manage their data & analysis
Execute large scale workflows
(25-50TB data , Million+ CPU
hours)
Data
infrastructure
to
coordinate digitization efforts
for multiple sites
Sharing, Visualizing (3D) &
Analyzing
high resolution
microscopy images (40K x
40K) via web browser
Learning material, new course
work,
custom
applications

And it goes way beyond plants and life science

iPlant Collaborative: Training data scientists


Partnership
with
Software
Carpentry and Data Carpentry to
provide best practices necessary
to make efficient use of CI
Allowing individual researchers
and educators to utilize data and
computational infrastructure at
scale (and encounter real
challenges)
Community contributed material
(built on iPlant CI)

Applied Cyberinfrastructure Concepts (ACIC)


Semester long project based learning course: introduces fundamental
concepts, tools and resources for effectively managing common tasks
associated with analyzing large datasets.
Graduate + Undergraduate course working on a REAL research
workflows where scalability is a bottleneck
Provide familiarity with cyberinfrastrucutre (CI) resources available at the
University of Arizona campus, iPlant Collaborative, NSF XSEDE centers,
Cloud (Future Grid and commercial providers such as Amazon).
Learning to apply relevant CI skills (for final project) and developing wiki
based documentation of these best practices.
Learning how to effectively collaborate in interdisciplinary team settings.
Deliver a functional solution to the stakeholder

From research question to reality

Why is it valuable ?
Users are able to over come data and computational bottle necks
Share data of ANY size with ANYONE
Connect data and compute on single platform
Manage their data and computations regardless of scale
Build their own apps and solutions (create their own community
iAnimal, iVirome)
Create custom appliances

iPlant: What worked


All major CI components have seen steady adoption (few
exception)
Think tank to do tank transition was rapid
Evolved to a technology proving ground
Take research products (NSF funded) to production use for our
community
Running infrastructure is not fun, building is. Allowing people to
focus on science (while stream line CI)

iPlant: What worked


Evolution of training (software carpentry)
Sharing/collaboration
Give people exit strategy (options) and they are happy adopt
solution
Provide feedback to CI component creators to improve (usability)
Expectation management: Do not expect the same experience
(cable cord cutting v/s netflix/hulu)

What did not work


Managing distributed teams is harder in VO (load balancing,
enthusiasm etc)
Technology lifecycle is not synchronized across all products
Relying on multiple providers for solution is challenging
(downtimes)
Changing/Evolving needs of community are hard to predict
Growth of users out paces our cloud capabilities (see tweets)

Even the tech geeks notice

Connect with iPlant!


Get a account: http://user.iplantcollaborative.org
Email us: info@iplantcollaborative.org
Questions: http://ask.iplantcollaborative.org
Twitter: @iPlantCollab #iPlant
Facebook: facebook.com/iPlantCollab
LinkedIn: iplant.co/iPlantCollabLinkedIn
Google+: iplant.com/iPlantGooglePlus

Luck favors the brave


Analysis favors the organized