You are on page 1of 38

ELIXIR-Pilot Project

“BILS-ProteomeXchange integration using


EUDAT resources”

Dr. Juan A. Vizcaíno, EMBL-EBI, juan@ebi.ac.uk


Dr. Fredrik Levander, BILS, fredrik.levander@bils.se

European Life Sciences Infrastructure for Biological Information


www.elixir-europe.org
Main people involved directly in this pilot

• Andy Jenkinson (Systems group)


• Rui Wang (PRIDE)
• Juan A. Vizcaíno (PRIDE)

• Fredrik Levander
• Samuel Lampa
• Janos Nagy
• Mikael Borg

• Jani Heikkinen

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Overview

• Short intro to PRIDE & ProteomeXchange, BILS


and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions


Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
Overview

• Short intro to PRIDE & ProteomeXchange, BILS


and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions


Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
PRIDE (PRoteomics IDEntifications) database

• PRIDE stores mass spectrometry (MS)-


based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information

• Full support for tandem MS approaches

Martens et al., Proteomics, 2005


http://www.ebi.ac.uk/pride Vizcaíno et al., NAR, 2013

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
ProteomeXchange Consortium
•Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.

•Includes PeptideAtlas (ISB, Seattle), PRIDE


(Cambridge, UK) and MassIVE (UCSD, San Diego).

•Tranche and Peptidome initially included but


discontinued.

•Common identifier space (PXD identifiers)

•Two supported data workflows: MS/MS and


SRM.

•Main http://www.proteomexchange.org
objective: Make life easier forVizcaíno et al., Nat Biotechnol, 2014
researchers
Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
ProteomeXchange data workflow: PRIDE
Receiving repositories Peptide
Atlas
PRIDE
(MS/MS data)
Results
MassIVE
(MS/MS data) UniProt/
Raw Data* ProteomeCentral
neXtProt
Metadata / PASSEL
Manuscript (SRM data)

Other DBs

Researcher’s results

Reprocessed results
Journals GPMDB Other DBs
Raw data*

Metadata Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).

2. Result files:

a. Complete submissions: Result files can be converted to


PRIDE XML or the mzIdentML data standard.

b. Partial submissions: For workflows not yet supported by


PRIDE, search engine output files will be stored and
provided in their original form.
Published
3. Metadata: Sufficiently detailed description of sample origin,
Raw
workflow, instrumentation, submitter.
Files
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
Other
files
b. PEAK: Peak list files f.
SP_LIBRARY
c. GEL: Gel images
Juan A. Vizcaíno
d. OTHER: Any other file type
ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
Current PSI Standard File Formats for MS

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
PRIDE Components: Submission Process

mzIdentML

PRIDE XML

PRIDE Inspector PX Submission Tool

PRIDE Converter 2
Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
ProteomeXchange: 1,963 datasets up until 1st April, 2015
Origin: Top Species studied by at least 20
396 USA datasets:
224 Germany Type:
191 United Kingdom 839 Homo sapiens
613 PRIDE complete
106 Netherlands 232 Mus musculus
1177 PRIDE partial
105 China 79 Arabidopsis thaliana
104 France 79 PeptideAtlas/PASSEL complete
94 Switzerland 77 Saccharomyces cerevisiae
69 MassIVE
75 Canada 44 Rattus norvegicus
25 reprocessed
55 Japan 35 Escherichia coli
55 Spain
54 Denmark 21 Bos taurus
52 Sweden 21 Glycine max
50 Belgium
48 Australia
34 Austria Datasets/year: ~ 460 species in total
25 Norway 2012: 102
23 Taiwan 2013: 527
22 India
21 Finland 2014: 963
20 Ireland 2015: 371
20 Italy
16 Brazil
15 Russia
14 Republic of Korea Data volume:
10 Israel Publicly Accessible:
Total: ~102 TB
10 Singapore … 959 datasets, 49% of all
Number of all files: ~250,000
88% PRIDE
PXD000320-324: ~ 5 TB
9% PASSEL
PXD000065: ~ 1.4TB
3% MassIVE

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
BILS – Bioinformatics Infrastructure for Life Sciences
• Distributed national research infrastructure
supported by the Swedish Research Council
• Coordination with other bioinformatics activities

• BILS provides:
• Bioinformatics support (consultancy)
• Bioinformatics infrastructure (data and tools)
Computing and storage is provided in collaboration with SNIC

• Bioinformatics network
• Nodes at each of the 6 large university cities
• Annual workshop
• Training
• Coordination with other bioinformatics activities
• Swedish node in ELIXIR

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Main BILS proteomics support aims
• Data storage:
• Secure
• Long-time
• Metadata
• Automated
• Publishing
• Standardised formats

• Data processing:
• Accessible data processing workflows

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Proteios: Software environment for proteomics
A multi-user platform for analysis and management of proteomics data

web browser access and analysis


of own data only

BILS
Scripts

Public access
to released
raw data Häkkinen et al. (2009) J Proteome Res

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
EUDAT
• EUDAT aims to contribute to building and operating a
Collaborative Data Infrastructure for European science.
• This involves a suite of co-ordinated and interoperable
services for preserving scientific data, and for making
them accessible to researchers.
• EUDAT collaborates with research communities across a
range of disciplines, from social sciences to
environmental science and including molecular biology (as
represented by ELIXIR).
• These communities have diverse structures, cultures and
scales but also share some common requirements
regarding the management of data.
http://www.eudat.eu
Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
EUDAT services

http://www.eudat.eu
Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
B2SAFE

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
EUDAT: B2SAFE AND iRODS
• B2SAFE aims to provide a software ecosystem for
persistently available data, including persistent
identification, abstracted data storage, and reliable
automated replication via auditable rules.

• It is built on top of the iRODS data management software


(http://irods.org) and integrates a PID system such as the
European Persistent Identification Consortium (EPIC -
(http://www.pidconsortium.eu) Handle API).

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Overview

• PRIDE, ProteomeXchange, BILS and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Objective
• To integrate the data repositories for MS proteomics data
run by BILS (Sweden) and ProteomeXchange (via the PRIDE
database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Plans at European level

1.- ELIXIR replication


National proteomics centers Central repository Data storage centers

Result Raw Result Raw Raw


s Data s Data Data

Meta Meta Meta


data data data

2.- EUDAT replication


Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015
Objective
• To integrate the data repositories for MS proteomics data
run by BILS (Sweden) and ProteomeXchange (via the PRIDE
database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.

• This project will also show the potential of collaboration among


research infrastructures and e-infrastructures to better
manage the data deluge. It will help to evaluate the
requirements of such federated systems.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Overview

• Short intro to PRIDE & ProteomeXchange, BILS


and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Timeline

•The pilot started when Jani Heikkinen (EUDAT) installed


B2SAFE at EMBL-EBI (July 2014).

•Data workflow was defined on September/ October 2014.

•Implementation work happened in parallel, with regular weekly


calls from January 2015.

•The pilot is now finishing (May 2015).

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Envisioned data workflow (September/October 2014)

• Default B2SAFE rules ->Trigger replication of data from BILS to EBI


• PIDS assigned per file

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Implementation process (1)
• B2SAFE 3.0.0 (including iRODS 3.3.1) was initially
installed at EMBL-EBI.
• However, BILS had moved already to iRODS v4.
• Incompatibility problems were found.
• It was decided to install iRODS 4.0 at the EBI, to solve
the incompatibility issue.
• At the time iRODS v4 was not officially supported with
iRODS version 4.0.3, so changes were necessary to the
original install procedure to accommodate 4.0.3.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Implementation process (2)
• EBI and BILS obtained Handle prefixes and made them available
within EPIC. The integration with iRODS was successfully tested.
• The next step was to configure B2SAFE and achieve a test
replication of a file from BILS to EBI using the B2SAFE PID
creation and file transfer rules.
• Unexpected delays:
• EBI experienced some network issues that affected
communications between the EBI and BILS iRODS.
• Two successive bugs were discovered. Both centered on the
rule execution engine and prevented B2SAFE from functioning.
• These bugs were solved by EUDAT & iRODs developers.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Implementation process (3)
• With workarounds now in place it was possible to manually
trigger a successful replication of a file from BILS to EBI.
• However it became apparent that the authorisation
mechanism employed by iRODS in a federation would make
the proposed submission workflow difficult to manage in a
production environment.
• This means every BILS researcher able to submit data
must have a user created for them on the EBI server first.
Alternative customised solutions could solve this issue by
decoupling the actions of researchers from the replication
itself. However this would inevitably add complexity.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Implementation process (4)

• At this point (March 2015) the pilot had overrun (it was
expected to last 6 months), with more work required to
integrate the B2SAFE replication process with the PRIDE
submission pipeline.

• It was decided to halt the process and find an alternative


way to achieve the same goals using existing resources.

• A detailed report has been written and has been sent to all the
parties involved.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Implemented alternative solution

• Proteios is able to generate the metadata file needed


for the submission to ProteomeXchange via PRIDE.

• The PX submission tool was extended to support


loading of files not available locally at the moment
of submission (URLs are specified).

• As a proof of concept, dataset PXD002037 was


submitted to PRIDE. Now it is publicly available.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
PX submission tool updated to streamline BILS submissions

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Submitted dataset (now publicly available in PRIDE)

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Dataset tags in PRIDE Archive

- Datasets can be tags with different attributes.


- Functionality available in the submission process.
- Stable URLs can be generated.
http://www.ebi.ac.uk/pride/archive/simpleSearch?q=&projectTagFilters=Bioinformatics%20Infrastructure%20for%20Life
%20Sciences%20(BILS)%20network%20(Sweden)

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Overview

• Short intro to PRIDE & ProteomeXchange, BILS


and EUDAT

• Objectives of the pilot

• Report on the results

• Perspectives for the future and conclusions

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
At present and in the near future…
• EMBL-EBI is involved in the EUDAT 2020 project (PI is
Steven Newhouse).

• EMBL-EBI will then continue to collaborate with EUDAT, for


gaining experience in the use of this software.

• PRIDE will evaluate the situation in the future to decide if


the originally envisioned submission pipeline (based on
B2SAFE and IRODS) is implemented.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Conclusions
• The pilot establishes that the original use case is not the best
application of B2SAFE at the present time. However, the situation will
be kept under review by PRIDE.
• This conclusion is not a reflection on B2SAFE per se, indeed B2SAFE
and iRODS have been found to be very flexible and are likely to be
interesting candidates for other use cases outside of PRIDE elsewhere
in EMBL-EBI or ELIXIR.
• In particular, use cases focused on data management within or
between data centres (i.e. bipartite collaborations) or environments
where mature data submission, curation and archiving solutions do not
already exist.
• In addition, we recommend ELIXIR continues to explore EUDAT
services and their relevance in ELIXIR use cases.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Conclusions: Technical recommendations

• Incorporate a fully-functional RESTful interface for iRODS


into B2SAFE, that can be used by a client to avoid installing
iCommands on the client machine.

• The security model should be adapted to allow anonymous


RW to a specified URL.

• If widespread deployment of EUDAT software is expected,


effort must be committed by EUDAT 2020 to make the
software more easily and quickly deployable by ‘ordinary’
system administrators.

Juan A. Vizcaíno ELIXIR Webinar


juan@ebi.ac.uk 20 May 2015
Acknowledgements
• Henning Hermjakob
• Steven Newhouse

• Rafael Jimenez

• Bengt Persson

• EUDAT management
& developers
Juan A. Vizcaíno ELIXIR Webinar
juan@ebi.ac.uk 20 May 2015

You might also like