Ticer Summer School

Thursday 24th August 2006

Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk
TICER Summer School, August 24th 2006 1

Digital Libraries, Grids & E-Science

What is E-Science? What is Grid Computing? Data Grids
– Requirements – Examples – Technologies

• Data Virtualisation • The Open Grid Services Architecture • Challenges
TICER Summer School, August 24th 2006 2

TICER Summer School, August 24th 2006

3

What is e-Science?

• Goal: to enable better research in all disciplines • Method: Develop collaboration supported by advanced distributed computation
– to generate, curate and analyse rich data resources
• From experiments, observations, simulations & publications • Quality management, preservation and reliable evidence

– to develop and explore models and simulations
• Computation and data at all scales • Trustworthy, economic, timely and relevant results

– to enable dynamic distributed collaboration
• Facilitating collaboration with information and resource sharing • Security, trust, reliability, accountability, manageability and agility
TICER Summer School, August 24th 2006 4

prediction

Integrative Biology
Tackling two Grand Challenge research questions: What causes heart disease? How does a cancer form and grow? Together these diseases cause 61% of all UK deaths
Building a powerful, fault-tolerant Grid infrastructure for biomedical science Enabling biomedical researchers to use distributed resources such as high-performance computers, databases and visualisation tools to develop coupled multi-scale models of how these killer diseases develop.

Courtesy of David Gavaghan &  IB Team

6

Biomedical Research Informatics Delivered by Grid Enabled Services
Portal
CFG Virtual Publically Curated Data Ensembl Organisation OMIM Glasgow SWISS-PROT Private Edinburgh MGI
data
HUGO RGD

Private data

Oxford

DATA HUB

Leicester
Private data

Netherlands
Synteny Grid Service
Private data

London
Private data

Private data

http://www.brc.dcs.gla.ac.uk/projects/bridges/

s bla t

+

eDiaMoND: Screening for Breast Cancer

Patients

Letters

Radiology reporting systems

Electronic Patient Records

1 Trust  Many Trusts Collaborative Working Audit capability Epidemiology
Assessment/ Symptomatic
Biopsy Other Modalities MRI PET Ultrasound

Screening

Case Information

2ndary Capture Or FFD

X-Rays and Case Information
eDiaMoND Grid

Symptomatic/Assessment Information Case and Reading Information

Better access to Case information And digital tools

SMF

CAD

3D Images

Training
Digital Reading Case and Reading Information

Supplement Mentoring With access to digital Training cases and sharing Of information across clinics

Manage Training Cases

Perform Training SMF CAD TICER Provided by eDiamond Temporal Comparison Summer School, August 24th 2006

project: Prof. Sir Mike Brady et 8 al.

E-Science Data Resources

• Curated databases
– Public, institutional, group, personal

• • • • • • •

Online journals and preprints Text mining and indexing services Raw storage (disk & tape) Replicated files Persistent archives Registries …
TICER Summer School, August 24th 2006 9

EBank

Slide from Jeremy Frey
TICER Summer School, August 24th 2006 ©

10

Biomedical data – making connections

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

TICER Summer School, August 24th 2006 © Slide provided by Carole Goble: University of Manchester

11

Using Workflows to Link Services

• Describe the steps in a Scripting Language • Steps performed by Workflow Enactment Engine • Many languages in use
– Trade off: familiarity & availability – Trade off: detailed control versus abstraction

• Incrementally develop correct process
– Sharable & Editable – Basis for scientific communication & validation – Valuable IPR asset

• Repetition is now easy
– Parameterised explicitly & implicitly
TICER Summer School, August 24th 2006 12

Workflow Systems
Language Shell scripts Perl Java BPEL Taverna VDT / Pegasus Kepler WF Enact. Shell + OS Perl runtime JVM BPEL Enactment Scufl Chimera & DAGman Kepler Comments Common but not often thought of as WF. Depend on context, e.g. NFS across all sites Popular in bioinformatics. Similar context dependence – distribution has to be coded Popular target because JVM ubiquity – similar dependence – distribution has to be coded OASIS standard for industry – coordinating use of multiple Web Services – low level detail - tools EBI, OMII-UK & MyGrid
http://taverna.sourceforge.net/index.php

High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use BIRN, GEON & SEEK
http://kepler-project.org/
TICER Summer School, August 24th 2006 13

Workflow example
 

  

Taverna in MyGrid http://www.mygrid.org.uk/ “allows the e-Scientist to describe and enact their experimental processes in a structured, repeatable and verifiable way” GUI Workflow language Enactment engine

TICER Summer School, August 24th 2006 ©

14

Notification

Pub/Sub for Laboratory data using a broker and ultimately delivered over GPRS
Comb-e-chem: Jeremy Frey
TICER Summer School, August 24th 2006 © 15

Relevance to Digital Libraries

• Similar concerns
– Data curation & management – Metadata, discovery – Secure access (AAA +) – Provenance & data quality – Local autonomy – Availability, resilience

• Common technology
– Grid as an implementation technology

TICER Summer School, August 24th 2006

16

TICER Summer School, August 24th 2006

17

What is a Grid?
A grid is a system consisting of
− Distributed but connected resources and − Software and/or hardware that provides and manages logically seamless access to those resources to meet desired objectives

Web server Handheld Workstation Database Server Cluster Printer

License

Supercomputer Data Center

R2AD

Source: Hiro Kishimoto GGF17 Keynote May 2006

TICER Summer School, August 24th 2006

18

Virtualizing Resources
Access

Type-specific interfaces

Common Interfaces
Computers Storage Sensors Applications Information

Web services

Resource-specific Interfaces

Resources

Hiro Kishimoto: Keynote GGF17

TICER Summer School, August 24th 2006

19

Ideas and Forms

• Key ideas
– Virtualised resources – Secure access – Local autonomy

• Many forms
– Cycle stealing – Linked supercomputers – Distributed file systems – Federated databases – Commercial data centres – Utility computing
TICER Summer School, August 24th 2006 20

Grid Middleware

Grid middleware services

Job-Submit Service Brokering Service

Registry Service

Advertise

Notify

Virtualized resources

CPU Compute Resource Service

Data Service

Application Service

Printer Service

Hiro Kishimoto: Keynote GGF17

TICER Summer School, August 24th 2006

21

Key Drivers for Grids

• Collaboration
– – – – Expertise is distributed Resources (data, software licences) are location-specific Necessary to achieve critical mass of effort Necessary to raise sufficient resources

• Computational Power
– Rapid growth in number of processors – Powered by Moore’s law + device roadmap – Challenge to transform models to exploit this

• Deluge of Data
– Growth in scale: Number and Size of resources – Growth in complexity – Policy drives greater data availability
TICER Summer School, August 24th 2006 22

Minimum Grid Functionalities

• Supports distributed computation
– Data and computation – Over a variety of
• hardware components (servers, data stores, …) • Software components (services: resource managers, computation and data services)

– With regularity that can be exploited
• By applications • By other middleware & tools • By providers and operations

– It will normally have security mechanisms
• To develop and sustain trust regimes
TICER Summer School, August 24th 2006 23

Grid & Related Paradigms
Distributed Computing
• Loosely coupled • Heterogeneous • Single Administration

Cluster
• Tightly coupled • Homogeneous • Cooperative working

Grid Computing
• • • • Large scale Cross-organizational Geographical distribution Distributed Management

Utility Computing
• Computing “services” • No knowledge of provider • Enabled by grid technology Source: Hiro Kishimoto GGF17 Keynote May 2006

TICER Summer School, August 24th 2006

24

TICER Summer School, August 24th 2006

25

Why use / build Grids?

• Research Arguments
– Enables new ways of working – New distributed & collaborative research – Unprecedented scale and resources

• Economic Arguments
– Reduced system management costs – Shared resources ⇒ better utilisation – Pooled resources ⇒ increased capacity – Load sharing & utility computing – Cheaper disaster recovery
TICER Summer School, August 24th 2006 26

Why use / build Grids?

• Operational Arguments
– Enable autonomous organisations to
• • • • Write complementary software components Set up run & use complementary services Share operational responsibility General & consistent environment for Abstraction, Automation, Optimisation & Tools

• Political & Management Arguments
– Stimulate innovation – Promote intra-organisation collaboration – Promote inter-enterprise collaboration
TICER Summer School, August 24th 2006 27

Grids In Use: E-Science Examples

Data sharing and integration
− Life sciences, sharing standard data-sets, combining collaborative data-sets − Medical informatics, integrating hospital information systems for better care and better science − Sciences, high-energy physics

Simulation-based science and engineering
− Earthquake simulation

Capability computing
− Life sciences, molecular modeling, tomography − Engineering, materials science − Sciences, astronomy, physics

High-throughput, capacity computing for
− Life sciences: BLAST, CHARMM, drug screening − Engineering: aircraft design, materials, biomedical − Sciences: high-energy physics, economic modeling

Source: Hiro Kishimoto GGF17 Keynote May 2006

TICER Summer School, August 24th 2006

28

TICER Summer School, August 24th 2006

29

Database Growth
EMBL DB 111,416,302,701 nucleotides PDB 33,367 Protein structures

Slide provided by Richard Baldock: MRC HGU Edinburgh

Requirements: User’s viewpoint

• Find Data
– Registries & Human communication

• Understand data
– Metadata description, Standard / familiar formats & representations, Standard value systems & ontologies

• Data Access
– – – – Find how to interact with data resource Obtain permission (authority) Make connection Make selection

• Move Data
– In bulk or streamed (in increments)
TICER Summer School, August 24th 2006 31

Requirements: User’s viewpoint 2

• Transform Data
– To format, organisation & representation required for computation or integration

• Combine data
– Standard database operations + operations relevant to the application model

• Present results
– To humans: data movement + transform for viewing – To application code: data movement + transform to the required format – To standard analysis tools, e.g. R – To standard visualisation tools, e.g. Spitfire
TICER Summer School, August 24th 2006 32

Requirements: Owner’s viewpoint

• Create Data
– Automated generation, Accession Policies, Metadata generation – Storage Resources

• Preserve Data
– – – – Archiving Replication Metadata Protection

• Provide Services with available resources
– Definition & implementation: costs & stability – Resources: storage, compute & bandwidth
TICER Summer School, August 24th 2006 33

Requirements: Owner’s viewpoint 2

• Protect Services
– Authentication, Authorisation, Accounting, Audit – Reputation

• Protect data
– Comply with owner requirements – encryption for privacy, …

• Monitor and Control use
– Detect and handle failures, attacks, misbehaving users – Plan for future loads and services

• Establish case for Continuation
– Usage statistics – Discoveries enabled
TICER Summer School, August 24th 2006 34

TICER Summer School, August 24th 2006

35

Large Hadron Collider

• The most powerful

instrument ever built to investigate elementary particle physics
– 10 Petabytes/year of data – 20 million CDs each year!

• Data Challenge:

• Simulation, reconstruction, analysis:
– LHC data handling requires computing power equivalent to ~100,000 of today's fastest PC processors
TICER Summer School, August 24th 2006 36

Composing Observations in Astronomy

No. & sizes of data sets as of mid-2002, grouped by wavelength

• 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects

Data and images courtesy Alex Szalay, John Hopkins

TICER Summer School, August 24th 2006

37

discoveryuse

Global In-flight Engine Diagnostics

100,000 aircraft 0.5 GB/flight
in­flight data

4 flights/day 200 TB/day
global network Significant in eg SITA

Now BROADEN getting Boeing 787 engine contract

airline

ground  station

DS&S Engine Health Center internet, e­mail, pager maintenance centre data centre

Distributed Aircraft Maintenance Environment: Leeds, Oxford, Sheffield &York, Jim Austin

TICER Summer School, August 24th 2006

42

Storage Resource Manager (SRM)

• http://sdm.lbl.gov/srm-wg/ • de facto & written standard in physics, … • Collaborative effort
– CERN, FNAL, JLAB, LBNL and RAL

• Essential bulk file storage
– (pre) allocation of storage
• abstraction over storage systems

– File delivery / registration / access – Data movement interfaces
• E.g. gridFTP

• Rich function set
– Space management, permissions, directory, data transfer & discovery
TICER Summer School, August 24th 2006 43

Storage Resource Broker (SRB)

• http://www.sdsc.edu/srb/index.php/Main_Page • SDSC developed • Widely used
– Archival document storage – Scientific data: bio-sciences, medicine, geo-sciences, …

• Manages
– Storage resource allocation
• abstraction over storage systems

– – – –

File storage Collections of files Metadata describing files, collections, etc. Data transfer services
TICER Summer School, August 24th 2006 44

Condor Data Management

• Stork
– Manages File Transfers – May manage reservations

• Nest
– Manages Data Storage – C.f. GridFTP with reservations
• Over multiple protocols

TICER Summer School, August 24th 2006

45

Globus Tools and Services for Data Management
q

GridFTP
x

A secure, robust, efficient data transfer protocol Web services-based, stores state about transfers Service to access to data resources, particularly relational and XML databases Distributed registry that records locations of data copies Web services-based, combines data replication and registration functionality
TICER Summer School, August 24th 2006 46

q

The Reliable File Transfer Service (RFT)
x

q

The Data Access and Integration Service (OGSA-DAI)
x

q

The Replica Location Service (RLS)
x

q

The Data Replication Service
x

Slides from Ann Chervenak

RLS in Production Use: LIGO

q

Laser Interferometer Gravitational Wave Observatory Currently use RLS servers at 10 sites
x

Contain mappings from 6 million logical files to over 40 million physical replicas

q

Used in customized data management system: the LIGO Lightweight Data Replicator System (LDR)
x

Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation
TICER Summer School, August 24th 2006 47

Slides from Ann Chervenak

RLS in Production Use: ESG
q

q q q

Earth System Grid: Climate modeling data (CCSM, PCM, IPCC) RLS at 4 sites Data management coordinated by ESG portal Datasets stored at NCAR
x x

64.41 TB in 397253 total files 1230 portal users 26.50 TB in 59,300 files 400 registered users Data downloaded: 56.80 TB in 263,800 files Avg. 300GB downloaded/day 200+ research papers being written
TICER Summer School, August 24th 2006 48

q

IPCC Data at LLNL
x x x

x x

Slides from Ann Chervenak

gLite Data Management
Enabling Grids for E-sciencE

• FTS
– File Transfer Service

• LFC
– Logical file catalogue

• Replication Service
– Accessed through LFC

• AMGA
– Metadata services

INFSO-RI-508833

TICER Summer School, August 24th 20062nd EGEE

49

Data Management Services
Enabling Grids for E-sciencE

FiReMan catalog
– – – – – – Resolves logical filenames (LFN) to physical location of files and storage elements Oracle and MySQL versions available Secure services Attribute support Symbolic link support Deployed on the Pre-Production Service and DILIGENT testbed

gLite I/O
– – – – Posix-like access to Grid files Castor, dCache and DPM support Has been used for the BioMedical Demo Deployed on the Pre-Production Service and the DILIGENT testbed

AMGA MetaData Catalog – Used by the LHCb experiment – Has been used for the BioMedical Demo

Medical Data Management
Enabling Grids for E-sciencE

SRM DICOM

Grid Catalogs File Catalog (Fireman) Encryption Keystore (Hydra) Metadata Catalog (AMGA)
Client Library: Lookup file through Metadata (AMGA) Use gLite EDS client: Retrieve file through gLite I/O Retrieve encryption Key from Hydra Decrypt data Serve it up to the application Medical Data Management 3

Medical Imager
Trigger: Retrieve DICOM files from imager. Register file in Fireman gLite EDS client: Generate encryption keys and store them in Hydra Register Metadata in AMGA

MDM Trigger GridFTP gLite I/O

MDM Client Library

Application
Client

INFSO-RI-508833

TICER Summer School, August 24th 20062nd EGEE

50

File Transfer Service
Enabling Grids for E-sciencE

• •

Reliable file transfer Full scalable implementation – Java Web Service front-end, C++ Agents, Oracle or MySQL database support – Support for Channel, Site and VO management – Interfaces for management and statistics monitoring

• • • •

Gsiftp, SRM and SRM-copy support Support for MySQL and Oracle Multi-VO support GridFTP and SRM copy support

INFSO-RI-508833

TICER Summer School, August 24th 20062nd EGEE

51

Commercial Solutions

• Vendors include:
– Avaki – Data Synapse

• Benefits & costs
– Well packaged and documented – Support – Can be expensive
• But look for academic rates

TICER Summer School, August 24th 2006

52

TICER Summer School, August 24th 2006

53

Data Integration Strategies

• Use a Service provided by a Data Owner

• Use a scripted workflow • Use data virtualisation services
– Arrange that multiple data services have common properties – Arrange federations of these – Arrange access presenting the common properties – Expose the important differences – Support integration accommodating those differences
TICER Summer School, August 24th 2006 54

Data Virtualisation Services
• Form a federation
– – – – Set of data resources – incremental addition Registration & description of collected resources Warehouse data or access dynamically to obtain updated data Virtual data warehouses – automating division between collection and dynamic access

• Describe relevant relationships between data sources
– Incremental description + refinement / correction

• Run jobs, queries & workflows against combined set of data resources
– Automated distribution & transformation

• Example systems
– IBM’s Information Integrator – GEON, BIRN & SEEK – OGSA-DAI is an extensible framework for building such systems

TICER Summer School, August 24th 2006

55

Virtualisation variations

• Extent to which homogeneity obtained
– Regular representation choices – e.g. units – Consistent ontologies – Consistent data model – Consistent schema – integrated super-schema – DB operations supported across federation – Ease of adding federation elements – Ease of accommodating change as federation members change their schema and policies – Drill through to primary forms supported
TICER Summer School, August 24th 2006 56

OGSA-DAI

• http://www.ogsadai.org.uk • A framework for data virtualisation • Wide use in e-Science
– BRIDGES, GEON, CaBiG, GeneGrid, MyGrid, BioSimGrid, e-Diamond, IU RGRBench, …

• Collaborative effort
– NeSC, EPCC, IBM, Oracle, Manchester, Newcastle

• Querying of data resources
– Relational databases – XML databases – Structured flat files

• Extensible activity documents
– Customisation for particular applications
TICER Summer School, August 24th 2006 57

TICER Summer School, August 24th 2006

58

The Open Grid Services Architecture

An open, service-oriented architecture (SOA)
− Resources as first-class entities − Dynamic service/resource creation and destruction

• • •

Built on a Web services infrastructure Resource virtualization at the core Build grids from small number of standards-based components
− Replaceable, coarse-grained − e.g. brokers

Customizable
− Support for dynamic, domain-specific content… − …within the same standardized framework

Hiro Kishimoto: Keynote GGF17

TICER Summer School, August 24th 2006

59

OGSA Capabilities
Execution Management
• Job description & submission • Scheduling • Resource provisioning

Data Services
• Common access facilities • Efficient & reliable transport • Replication services

Resource Management
• Discovery • Monitoring • Control

Self-Management
• Self-configuration • Self-optimization • Self-healing

OGSA

Information Services
• Registry • Notification • Logging/auditing

Security
• Cross-organizational users • Trust nobody • Authorized access only

OGSA “profiles” Web services foundation
Hiro Kishimoto: Keynote GGF17

TICER Summer School, August 24th 2006

60

Basic Data Interfaces

Storage Management
− e.g. Storage Resource Management (SRM)

Data Access
− ByteIO − Data Access & Integration (DAI)
• • •

Data Transfer
− Data Movement Interface Specification (DMIS) − Protocols (e.g. GridFTP)

Replica management Metadata catalog Cache management

Hiro Kishimoto: Keynote GGF17

TICER Summer School, August 24th 2006

61

TICER Summer School, August 24th 2006

62

The State of the Art

• Many successful Grid & E-Science projects
– A few examples shown in this talk

• Many Grid systems
– All largely incompatible – Interoperation talks under way

• Standardisation efforts
– Mainly via the Open Grid Forum – A merger of the GGF & EGA

• Significant user investment required
– Few “out of the box” solutions
TICER Summer School, August 24th 2006 63

Technical Challenges

• Issues you can’t avoid
– – – – – – – Lack of Complete Knowledge (LOCK) Latency Heterogeneity Autonomy Unreliability Scalability Change

• A Challenging goal
– balance technical feasibility – against virtual homogeneity, stability and reliability – while remaining affordable, manageable and maintainable
TICER Summer School, August 24th 2006 64

Areas “In Development”

• Data provenance • Quality of Service
– Service Level Agreements

• Resource brokering
– Across all resources

• Workflow scheduling
– Co-sheduling

• Licence management • Software provisioning
– Deployment and update

• Other areas too!
TICER Summer School, August 24th 2006 65

Operational Challenges

• Management of distributed systems
– With local autonomy

• • • • •

Deployment, testing & monitoring User training User support Rollout of upgrades Security
– – – – Distributed identity management Authorisation Revocation Incident response
TICER Summer School, August 24th 2006 66

Grids as a Foundation for Solutions

• The grid per se doesn’t provide
– – – – Supported e-Science methods Supported data & information resources Computations Convenient access

• Grids help providers of these, via
– International & national secure e-Infrastructure – Standards for interoperation – Standard APIs to promote re-use

• But Research Support must be built
– Application developers – Resource providers
TICER Summer School, August 24th 2006 67

Collaboration Challenges

• Defining common goals • Defining common formats
– E.g. schemas for data and metadata

• Defining a common vocabulary
– E.g. for metadata

• Finding common technology
– Standards should help, eventually

• Collecting metadata
– Automate where possible
TICER Summer School, August 24th 2006 68

Social Challenges

• Changing cultures
– Rewarding data & resource sharing – Require publication of data

• Taking the first steps
– If everyone shares, everyone wins – The first people to share must not lose out

• Sustainable funding
– Technology must persist – Data must persist

TICER Summer School, August 24th 2006

69

TICER Summer School, August 24th 2006

70

Summary

• E-Science exploits distributed computing resource to enable new discoveries, new collaborations and new ways of working • Grid is an enabling technology for e-science. • Many successful projects exist • Many challenges remain

TICER Summer School, August 24th 2006

71

UK e-Science
Globus Alliance

e-Science Institute

National Centre for e-Social Science

Digital Curation Centre
CeSC (Cambridge)

Grid Operations Support Centre National Institute for Environmental e-Science
72

Open Middleware Infrastructure Institute
TICER Summer School, August 24th 2006

TICER Summer School, August 24th 2006

73