The Science and Fiction of Petascale Analytics

Jacek Becla
Stanford Linear Accelerator Center

SLAC
x x x x

Particle Physics Photon Science Astrophysics Petascale data management
Jacek Becla, SLAC 2

50+ PB images 20+ PB database

2008 MySQL Conference & Expo

Data Explosion
…and processed Enormous amount of digital information is produced

2008 MySQL Conference & Expo

Jacek Becla, SLAC

3

Outline
x x x

Reality Today’s trends Future … of petascale analytics Data-intensive science & industry
Jacek Becla, SLAC 4

x

2008 MySQL Conference & Expo

Data-Intensive Scientific Community
x x x

Multi-decade experiments Large, multi-tier collaborations Distributed, heterogeneous environment Contingency Specialized software
Customizations Recompilation Debuggability

x

Science open source
2008 MySQL Conference & Expo Jacek Becla, SLAC 5

Science & Petabytes
Scientists were always drowning in data
Scientists are drowning in data
-- Jeannette M. Wing, Head Computer & Information Science & Engineering Directorate at NSF, 03/2008

Credit: Kirk Borne, GMU

Early 1990s
Jacek Becla, SLAC 6

2008 MySQL Conference & Expo

Science & Petabytes
High Energy Physics: BaBar x 1999 – 2008 x Few TB/sec
– Small fraction saved
x x x

Billions of collisions 4 PB data set Petabyte database

2008 MySQL Conference & Expo

Jacek Becla, SLAC

7

Science & Petabytes
High Energy Physics: LHC
x

½ PB/sec
– Small fraction saved

x x

Trillions of collisions 15 PB/year
– Starting later this year

2008 MySQL Conference & Expo

Jacek Becla, SLAC

8

Science & Petabytes
NASA: Earth Observing System
x

4 PB in 2005 (images)

2008 MySQL Conference & Expo

Jacek Becla, SLAC

9

Science & Petabytes
Photon Science x Huge lasers
x

Movies of molecules
– Few MB x 120 Hz
Credit: NIF, LLNL

x

Few PB/year

2008 MySQL Conference & Expo

Jacek Becla, SLAC

10

Science & Petabytes
Genomics x Trying to put together database of all known DNA sequences
x

Multi-petabytes

2008 MySQL Conference & Expo

Jacek Becla, SLAC

11

Science & Petabytes
Astronomy x Huge telescopes
x x

Multi-gigapixel cameras Getting ready for…
– Trillions of observations – 50+ PB of images – 20+ PB database

2008 MySQL Conference & Expo

Jacek Becla, SLAC

12

Science & Petabytes
150
LHC LSST

100 PB 50
LHC

0 2000

BaBar

NASA

BaBar

LSST

2005

2010 year

2015

2020

2025

2008 MySQL Conference & Expo

Jacek Becla, SLAC

13

Science, Industry & Petabytes
150

100 PB

?
AT&T Walmart EBay Facebook few others
2005

Google Yahoo! Microsoft

50

0 2000 2010 year 2015 2020 2025

2008 MySQL Conference & Expo

Jacek Becla, SLAC

14

Scientific Analytics Today
x

Complex computations
– 100s of attributes per query

x x x

Iterative, successively more restrictive Curiosity driven questions 3 major query types
– Needle in haystack – Correlations – Time series

2008 MySQL Conference & Expo

Jacek Becla, SLAC

15

Hunt for Higgs Boson
HEP: It’s All About “Events”
x
Event

Complex hierarchical tree-like structures with many relations Events are uncorrelated
 Needle in haystack
 Spatial correlations  Time series within event

Tracker

Calor.

TrackList Track Track Track Track Track

HitList Hit Hit Hit Hit Hit
16

x

Credit: Dirk Düllmann/CERN

2008 MySQL Conference & Expo

Jacek Becla, SLAC

Untangling the Universe
Astronomy: It’s All About “Astronomical Objects”
x x x x
 Needle in haystack  Spatial correlations  Time series

Overlapping Moving Disappearing Highly correlated

2008 MySQL Conference & Expo

Jacek Becla, SLAC

17

Understanding Dynamics of Biological Processes
 Needle in haystack  Correlations  Time series

2008 MySQL Conference & Expo

Jacek Becla, SLAC

18

Future Scientific Analytics
x x x

Seamless integration with raw data Annotation and sharing Ubiquitous scientific data analytics
– Instead of analytics for elite scientists

x

Mobile anytime anywhere
– On open source data

2008 MySQL Conference & Expo

Jacek Becla, SLAC

19

Industry & Analytics
x x x

Most queries

 Needle in haystack  Correlations tool-generated series  Time

Lots of summaries and aggregates Some very complex analytics
– detecting fraudent activities – understanding hacker patterns Industrial analytics – correlating ads with user behaviors

x

are becoming Starting to realize huge increasingly more potential of data/logs complex
Jacek Becla, SLAC

2008 MySQL Conference & Expo

20

Scientific Approach to Petascale Analytics
HEP
x

Others
x x x

x x x x

Relational model insufficient ODBMS didn’t take off Files + metadata in db Custom software Filtering & grouping
– Avoids small-granularity random reads – Organized activity, introduces delay

RDBMS – good match
– but no multi-server setups yet

Bigger systems
– files + metadata in db

Raw data in files
– …or blobs inside database

2008 MySQL Conference & Expo

Jacek Becla, SLAC

21

Industrial Approach to Petascale Analytics
x

Very few use databases for analytics Trend: Map/Reduce paradigm
– M/R, Hadoop, Dryad – Bigtable, HBase, Hypertable – Sawzall, Pig Latin, LINQ

x

2008 MySQL Conference & Expo

Jacek Becla, SLAC

22

Database… Map/Reduce… Files + Database…

Is it really so different?

2008 MySQL Conference & Expo

Jacek Becla, SLAC

23

Maybe Not! You Must…
x x x x x x

Manage lots of hardware Learn to deal with failures Parallelize Optimize Compromise Automate
Jacek Becla, SLAC 24

2008 MySQL Conference & Expo

Manage Lots of Hardware
x x

6 GB / min  100 MB/sec (1 disk) 1 PB / min  150,000 disks

2008 MySQL Conference & Expo

Jacek Becla, SLAC

25

Learn to Deal With Failures
Large number of disks = Large number of nodes = Constant state of failures = Must recover transparently
– don't think RAID or high-end hardware will save you Treat failures as normal state, not exceptions
2008 MySQL Conference & Expo Jacek Becla, SLAC 26

Optimize or Go Bankrupt
x x x x x x x x

How to organize data? What to save? What to re-compute? How to partition? Row or column store? What to index? CPU/disk balance? How much to compress? How to formulate query?
Jacek Becla, SLAC 27

2008 MySQL Conference & Expo

Compromise or Die
x

Performance killers
– Transactions – Foreign keys

2008 MySQL Conference & Expo

Jacek Becla, SLAC

28

Automate
x

1 PB =
– 20 years of movies (HD) – 2,000 years of MP3 (128 kbits/sec)

x

Too much data to browse or comprehend Auto-load balance your data
Jacek Becla, SLAC 29

x

2008 MySQL Conference & Expo

And Your Biggest Problem Is...
power and cooling
– tape is cool – flash disks are coming

2008 MySQL Conference & Expo

Jacek Becla, SLAC

30

Hot or Not
1 Big, monolithic systems 1 Shared all, shared disks 1 Specialized hardware 0 Lightweight, flexible specialized components with open interfaces 0 Commodity hardware 0 Shared nothing
2008 MySQL Conference & Expo Jacek Becla, SLAC 31

Scale or Sophistication?
scale
Map/Reduce
• Costly scalability • Progressively expensive fault tolerance • Inflexible schema

• Overhead too big for small problems • Uses resources inefficiently • Schema inside code

DBMS

Matlab, SAS

sophistication
2008 MySQL Conference & Expo Jacek Becla, SLAC 32

What Is Next?
Map/Reduce

scale

Adding • Schema • SQL (hive) • More indexes

DBMS

• New, more scalable engines • Brand new DBMSes • Planning to scale

Matlab, SAS

sophistication
2008 MySQL Conference & Expo Jacek Becla, SLAC 33

Database Features Needed
Scientific Point of View
x x x x x x x x x x x x x

Scalability up to 100s of petabytes (higher tomorrow) Parallelized single queries on commodity hardware Fault tolerant with intra-query failover Procedural user-defined functions/stored procedures that could be executed in parallel Shared scans Partial results Query pause/restart/abort Pre-execution query cost estimate Resource management system Support for arrays as a first-class column type Support for provenance of data elements Support for uncertainty of data elements Support for spatial and temporal operations

2008 MySQL Conference & Expo

Jacek Becla, SLAC

34

Will They Be There for LSST?
2009 – choosing technology 2010 – 2014: construction 2014 – 2023: production
x

Convincing database camp to build it all for us
– Working with many

x

Testing pure Map/Reduce + Bigtable
– Collaborating with Google

x

Prototyping with custom software plus off-the-shelf RDBMS
– Using MySQL
Jacek Becla, SLAC 35

2008 MySQL Conference & Expo

Summary
x x

Data avalanche Need scalable, sophisticated tools You are facing it too

x

Credit: ncids.org

2008 MySQL Conference & Expo

Jacek Becla, SLAC

36