You are on page 1of 17

Ambiguous HELM Requirements Specification - Phase 1

Document Details

Project HELM

Status / Version Version 1.0

Document Approval

Criteria Role Name Date

Author Project Manager Claire Bellamy 24th July 2015

Approver Project Lead Sergio Rotstein 24th July 2015

Approver Pistoia Alliance Executive Director John Wise 24th July 2015

Approver Technical lead Tianhong Zhang 24th July 2015

Approver HELM Standard Management Group Lead Matthias Nolte 24th July 2015

Jan Holst-Jensen
Roland Knispel,
HELM Standard Management Group Team Stefan
Approver 24th July 2015
Members Klostermann, Sven
Neumeyer, Yohann
Potier

Version Control Details

Version Status Date Author Description of Change(s)

1.0 Issued 24th July Claire Bellamy, Jan First version released to support RFP.
2015 Holst-Jensen, Roland
Knispel, Stefan
Klostermann, Sven
Neumeyer, Matthias
Nolte, Yohann Potier,
Tianhong Zhang

Disclaimer

If you are reading this document on paper, it is an uncontrolled copy and you may not have the latest
version. If you are updating this document, you may not be using the current version. Please refer to
the electronic copy available within the HELM Google Drive for the authoritative version.
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

Table of Contents Page


1. INTRODUCTION ........................................................................................... 3
1.1 Definitions ...................................................................................................... 3
1.2 Purpose ......................................................................................................... 3
1.3 Background .................................................................................................... 3
1.3.1 HELM Code Current Configuration .............................................................. 4
1.3.2 HELM Code Work in progress ..................................................................... 4
1.3.3 HELM Code Future state............................................................................. 5
1.4 Scope............................................................................................................. 6
1.4.1 Out of scope................................................................................................... 6
1.4.2 Primary Stakeholders ..................................................................................... 7
1.5 References .................................................................................................... 7
1.5.1 Background Resources .................................................................................. 7
1.5.2 Documentation ............................................................................................... 7
2. GENERAL DESCRIPTION............................................................................. 8
2.1 Ambiguity - structure types to be represented ................................................ 8
2.1.1 Component Ambiguity .................................................................................... 8
2.1.2 Composition Ambiguity .................................................................................. 8
2.1.3 Connection Ambiguity .................................................................................... 9
2.2 Use cases ...................................................................................................... 9
2.3 Overview ........................................................................................................ 9
2.3.1 Deliverables ................................................................................................. 10
2.4 User Characteristics ..................................................................................... 10
2.5 Design Constraints ....................................................................................... 10
2.5.1 Ambiguous HELM notation........................................................................... 10
2.5.2 Architectural constraints ............................................................................... 10
2.6 Assumptions and Dependencies .................................................................. 11
3. OPERATIONAL REQUIREMENTS .............................................................. 12
3.1 Functional Requirements ............................................................................. 12
3.1.1 Ambiguity implementation in the HELM Notation Toolkit .............................. 12
3.1.2 Additional HELM Notation Toolkit Functionality ............................................ 12
3.1.3 HELM Notation Toolkit Web-Services .......................................................... 13
3.1.4 Chemical Toolkit Plugin API ......................................................................... 14
3.1.5 Documentation ............................................................................................. 15
3.2 Non-Functional requirements ....................................................................... 16
3.2.1 Environment Requirements .......................................................................... 16
3.2.2 Quality Characteristics ................................................................................. 16
3.2.3 Control Function Requirements .................................................................... 17

Page 2/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

1. INTRODUCTION

1.1 Definitions

Term Description

API Application programming interface


HAbE HELM Antibody Editor developed by Roche
HELM Hierarchical Editing Language for Macromolecules
IS Information system
URS User Requirements Specification

1.2 Purpose
The purpose of this User Requirements Specification (URS), is to outline the requirements of an
extension to the HELM standard and its associated tools.

The first purpose of this extension is to enable the representation of ambiguous macromolecules
meaning macromolecules where not all characteristics of the structures are or can be fully specified.
The intention is to enable HELM to capture the structural information that can be specified and report
what is not known.

The second purpose of this work is to implement a set of APIs (Application Programming Interfaces)
including web-services for the toolkit. These are intended to:

1. Provide a single mechanism via which the toolkit can access the developers chemical engine
of choice.
2. Abstract the toolkit functionality such that in the future the functionality can be accessed by a
thin client by using the web-services.

The intended audience of this document are the HELM project team members and groups interested
in implementing the requirements.

1.3 Background

The Hierarchical Editing Language for Macromolecules (HELM), an emerging notation standard,
enables the representation of a wide range of biomolecules (e.g. proteins, nucleotides, antibody drug
conjugates) through a hierarchical notation that represents complex macromolecules as polymeric
structures with support for unnatural components (e.g. unnatural amino acids) and chemical
modifications. Created by Pfizer scientists, the Pistoia Alliance formalized the HELM notation as an
open standard in early 2013 and publicly released a modified version of previously proprietary
software tools to the Open Source community, which now serve as the reference implementation of
the HELM standard.

Page 3/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

1.3.1 HELM Code Current Configuration

Two major extensions to the original code have been created and published (figure 1);

Exchangeable HELM which enables the user to include the monomer definition with the
HELM string thus creating a format that can be used to exchange information between
organizations.
HAbE (HELM Antibody Editor). Created by Roche HAbE enables the recognition and display
of antibody domains and provides functionality and tools that perform related functions. One
example is the automatic creation of Cys-Cys bonds.

The current HELM suite has two dependencies on commercial tools:

yFiles a graphing package that supports the graphical representation of HELM structures. A
developer licence is required to work with the code. You do not need a licence to distribute
your final system.
MarvinBeans a chemical engine and sketcher that is used to perform calculations, interpret
extended SMILES strings and sketch monomers. A licence is required for each HELM editor
user.

Figure 1: The current HELM suite and its dependencies

1.3.2 HELM Code Work in progress

Prospective HELM adopters have expressed concerns that these dependencies restrict their ability to
use HELM within their organization. To address this in the shorter term, work is being done by third
parties to create new versions of the HELM editor that can access different chemical tools for drawing
and calculation. The Roche HAbE team are also working to include ambiguity in the antibody editor.
Therefore by the time the requirements in this RFP are implemented, there may be minor changes to
the landscape, however the project does not control these timescales and cannot guarantee whether
this work will be complete or not.

Page 4/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

1.3.3 HELM Code Future state


At present HELM is attracting increasing numbers of adopters, many of whom have requirements
dictated by the need to incorporate HELM into their enterprise environments. Two recurring
requirements are:

Greater flexibility in accessing the functionality and in the choice of drawing tool/chemical
engine, so they do not have to purchase a Marvin licence for every HELM editor user.
The ability to represent ambiguity in macromolecules.

The project has considered these requirements and produced a road-map that divides the work into
two tranches.

1.3.3.1 Phase 1 this document


The first tranche will include: the creation of an API to access chemical engines, and Web Services
for the toolkit and ambiguity implemented in the toolkit only. This is the work that is specified in this
document.

Figure 2: The HELM suite roadmap

1.3.3.2 Phase 2 Future Work

In phase 2 the project intends to create a new thin client editor which includes the HAbE functionality.
This editor will provide user access to the representation of ambiguity in HELM structures. The
existing editor and HAbE will be retained on GitHub, but will not be developed further. The new editor
will not be dependent on yFiles.

Page 5/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

Phase 2 does not form part of these requirements and is included for background information only.

1.4 Scope

The work defined in this specification consists of the following changes to the HELM toolkit:

1. To implement an extension to the HELM notation which allows the definition of structurally
ambiguous molecules. The line notation design specified in Ambiguous HELM Line Notation
Design.doc should be used as the definition for this work.
2. To create a collection of Chemical Toolkit APIs that allow abstracted integration of chemical
engines and the HELM toolkit.
3. To create a collection of web-services that allow access to the HELM toolkit functionality.
4. To document the notation and code changes.

1.4.1 Out of scope


Out of scope activities are:

Any changes to the current HELM Editor and HAbE.


Removal of yFiles since this is part of the HELM editor and HAbE.
Annotations which are not structurally relevant. Non-structural annotations should be in JSON
format and enclosed between the 3rd and 4th $, however the implementation of this does not form
part of this work. .
Formulations; since these consist of chemically distinct components and therefore HELM is not a
suitable method to describe them. Formulations are defined as:
o A composition describing all chemical substances participating in forming the
carrier/vehicle/capsule. This can include lipid nanoparticles or lipid-polymer hybrid
nanoparticles. The composition reflects the percentage of each compound as well as
the structures of the compounds. Typically the envelope is carrying a specific load,
e.g. a siRNA or mRNA.
Extension of the functionality beyond the requirements stated in this document.

Page 6/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

1.4.2 Primary Stakeholders

The primary stakeholders are

The HELM notation management group who have the responsibility for managing the HELM
notation.
The HELM open source dictator who approves all code merges to the main trunk.
The wider Pistoia Alliance HELM project team and steering committee.
HELM adopters particularly in-house IS teams responsible for implementing and managing
HELM based systems and vendors supplying HELM compliant systems.
The end users of HELM compliant systems (typically scientists) who use the systems.

1.5 References
1.5.1 Background Resources
[1] Zhang, T., et. al., (2012), HELM: A Hierarchical Notation Language for Complex Biomolecule
Structure Representation, J. Chem. Inf. Model., vol 52,pp 27962806

http://pubs.acs.org/doi/full/10.1021/ci3001925

[2] HELM website

www.OpenHELM.org

[3] HELM resource centre (contains documentation and links to the code)

https://pistoiaalliance.atlassian.net/wiki/display/PUB/HELM+Resources

[4] Github repository for HELM code

https://github.com/PistoiaHELM.

1.5.2 Documentation

[1] HELM notation specification v1.1

https://drive.google.com/file/d/0BybDwk56P1wFZnprdVlDWjI4QzQ/view?usp=sharing

[2] Ambiguous HELM line notation design

Ambiguous HELM Line Notation Design.doc (issued with this RFP)

Page 7/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

2. GENERAL DESCRIPTION
2.1 Ambiguity - structure types to be represented
In order to understand the requirements of ambiguous HELM we need to take into account the
different types of ambiguity that could be present. It is possible that a structure can be fully specified
except for one aspect and there is much useful structural information that can be recorded. It is
equally possible that very little of the structure is specific, but there is still some information that is
worth capturing.

To this end the project team has defined three types of ambiguity:

Component
Composition
Connection

Ambiguities can be at the monomer, simple polymer or complex polymer levels.

The following examples illustrate the different ambiguities. These are chosen to represent each type,
but it is possible that a particular structure will include a combination of these types.

2.1.1 Component Ambiguity


Component ambiguity is where the structure of one or more of the components is not fully specified.

Examples

PEG or bead coupled in a specified and defined manner to a specified and defined position
on a monomer in a simple polymer.

Parts of the variable regions of heavy chain and/or light chain for antibodies where the
monomers in that region are not specified.

Antibodies that have not been fully sequenced.

The glycosylation moiety of proteins (if not fully specified as G0, G1, or G2)

2.1.2 Composition Ambiguity


Composition ambiguity is where it is not possible to fully specify the proportion of the components
relative to each other. The connection point may be specified or not. For example:

mRNA mixtures, where you have an unknown mixture of:

two or more alternative residues at a particular position


two or more alternative sequences in a particular region

A ratio may be known for the alternative combinations.

Example: mRNA synthesized using a mixture of 50% Uridine and 50% Pseudouridine as
the reagent. This will result in each Uridine position having the possibility of being either
Uridine or Pseudouridine.

Page 8/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

Undetermined monomer due to the inability of analytical methods to identify that particular
monomer. In this case only one distinct monomer exists (i.e. it is not a mixture) but the
monomer cant be unequivocally identified. The probability of several different monomers of
choice is given but should not be confused with ratios as detailed above.

Chromatograms from DNA sequencers sometimes fail to identify the nucleotide at a


given position and shows e.g. two significant peaks (instead of one) at that position
e.g. the distinct nucleotide could be A or G but the peak for A is a bit lower than for G.
This is an artifact of the technique and the probability is e.g. 40% for A and 60% for
G.

Sequences derived in different versions from different sources (patents, typos,


unreadable letters - though likely no probability values will be assigned here)

There is an unknown level of methylation, e.g. methylation on Glycine in Fungi.

This applies as well to Met oxidation, Trp oxidation, His oxidation, Lys glycation, Asn
deamidation, Pyr Glu

2.1.3 Connection Ambiguity


Connection ambiguity is where it is not possible to specify where components are connected to each
other. For example:

Unknown attachment points between 2 defined simple polymers, e.g. ADC in a specified ratio
(1:1 ; 1:1.5 ; 1:2 etc.)

Unknown attachment points between 2 defined simple polymers e.g. ADC in a non-specific
ratio of one of the types below:

o no ratio defined
o - ratio given as decimal number (e.g. 1:2.1)
o - ratio given as interval (e.g. 1:2.1---2.3)

2.2 Use cases

There is no user interface component to this work, so no user orientated cases can be created.

Examples of the types of molecule to be represented are given in the general description section.

2.3 Overview
HELM is a notation: The definition and specification of standard is documented in the HELM notation
specification V1.1.

Supporting the notation is the HELM software suite which consists of the following components

Page 9/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

The HELM Toolkit: The HELM Toolkit contains the functionality needed to implement a HELM-
based system, enabling reading, analysis, and manipulation of HELM objects, as well as some
monomer management. It is written in Java and delivered as a .jar file.

The HELM Editor: The HELM Editor is a tool that enables the user to visualize and edit HELM
molecules. It is dependent on the HELM toolkit. The latest version of the HELM editor can be
found on http://pistoiahelm.github.io/. It is written in Java.

HAbE (HELM Antibody Editor): A tool that analyses antibody structures, displays them at a
domain level and allows specific actions to be taken such as connecting free Cys residues.

Currently the software is available as open source code on GitHub and in two compiled forms: Java
web start and an applet. Both are available from https://github.com/PistoiaHELM.

2.3.1 Deliverables
The output of this work shall consist of the following:

1. Code updates to the HELM toolkit to include ambiguity, add the additional functions and create
the HELM notation toolkit API.

2. New web-services for the toolkit.

3. New code to create the chemical toolkit API and two implementations. One to use MarvinBeans
and the other to use a free chemistry engine of the developers choice.

4. Updates to the HELM specification document to include the ambiguity extension definition.

5. Release notes (it is acceptable for these to be generated from the code).

2.4 User Characteristics

As this work package only concerns the HELM toolkit the users do not directly interact with the
functionality. Therefore no users are identified in this section.

2.5 Design Constraints

2.5.1 Ambiguous HELM notation


The representation of ambiguity has been discussed extensively within the HELM standard
management group. A line notation which incorporates these ideas has been designed and is
available in Ambiguous HELM Line Notation Design.doc. This design should be used in any
implementation. Although changes may be made if limitations are found during development, all
changes must be agreed with the HELM Standard Management Group.

2.5.2 Architectural constraints

The following technologies shall be used (as applicable)


Page 10/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

Maven Archetype
Tomcat
Spring 4 (application context, dependency injection)
Spring Data, JPA 2.1/Hibernate 4.3 (persistence)
JAX-RS 2.0/Apache CXF 3.0 (stateless REST services)
Spring Security
TestNG + HSQLDB for in-memory tests
Jenkins + CID custom tooling (CI build/deploy to Tomcat)

The final solution shall contain no dependencies on third party tools that require paid licences.

Any changes to the direct toolkit functionality shall be backwardly compatible.

2.6 Assumptions and Dependencies

It is assumed that respondents will modify the existing HELM code and not start from scratch.

The HELM editor is dependent on the toolkit. This RFP details changes to the toolkit, but it must be
possible to use the new toolkit code with the existing Java editor without further changes to the editor.
The functionality to handle ambiguity will not be accessible from the editor since the UI has no way of
entering it or representing it.

HELM is a live standard and, as such, any changes will affect its current users and should be made
backwardly compatible as far as possible. HELM is an open standard and it is not possible for the
project to be aware of the details of every groups use of HELM, and so there will be unknown
dependencies. All development should aim to minimize disruption to existing users.

Page 11/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

3. OPERATIONAL REQUIREMENTS

3.1 Functional Requirements

Key to priorities

E = Essential. The system must fulfil this requirement or it fails in a fundamental way and
cannot be deployed.
H = High. A requirement of high importance
M = Medium. A requirement of medium importance
L= Low. A requirement of low importance

3.1.1 Ambiguity implementation in the HELM Notation Toolkit

Priority
Req # Requirement
(E,H,M,L)
The notation changes described in Ambiguous HELM Line
Notation Design.doc must be implemented in the HELM
FRS100 E
notation toolkit. This includes all functions required to write and
interpret ambiguity as defined.
The implementation shall be backwardly compatible with the
FRS110 E
current version of the toolkit.
There shall be single mechanism to pass in original HELM and
FRS120 E
ambiguous HELM notation.

3.1.2 Additional HELM Notation Toolkit Functionality

Priority
Req # Requirement
(E,H,M,L)

A new function shall be created in the HELM toolkit that


Generates the image of monomers
Generates the image of the full HELM molecule
FRS400 H

Currently this function is present in the HELM editor, but it


should be moved to the HELM notation toolkit.

Page 12/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

There shall be a new function that performs the following


monomer management actions
Store
Retrieve
Submit
FRS 410 E
Edit
Reconcile monomers (exchangeable HELM)

Note that this service will initially link to the local HELM
monomer XML file, not a centralised store.

3.1.3 HELM Notation Toolkit Web-Services


The following web-services shall be created to expose existing toolkit functionality.

The requirements must be implemented for both the current HELM specification (HELM 1.1) and
ambiguous HELM (HELM2.0) unless otherwise specified.

Priority
Req # Requirement
(E,H,M,L)

There shall be a service that checks the input HELM string for:
Conformance to the specification
FRS300 E
Availability of monomers in the current monomer
database.

There shall be a service that performs the following monomer


management actions
Store
Retrieve
FRS310 E
Submit
Edit
Reconcile monomers (exchangeable HELM)

There shall be a service that converts standard HELM in to


canonical HELM and vice versa.
FRS330 E
HELM v1.1 structures only.

There shall be a service that calculates the following from a


non-ambiguous HELM string:
Molecular weight
FRS340 Molecular formula E
Extinction coefficient

HELM v1.1 structures only.

Page 13/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

There shall be a service that renders an image of the molecule


from the HELM string.
Image generation of the atom/bond representation of
monomers
FRS350 E
Image generation of atom/bond representation of the
HELM molecule

HELM v1.1 structures only.

There shall be a service that converts


HELM string to natural analogue sequence (and vice
versa)
FRS360 E
HELM string to FASTA (and vice versa)

HELM v1.1 structures only.

3.1.4 Chemical Toolkit Plugin API

Priority
Req # Requirement
(E,H,M,L)
There shall be an API through which the HELM notation toolkit
can access the chemical toolkit of the users choice. The
following calls shall be supported:

SMILES validation
SMILES/MolFile import/conversion
FRS200 HELM to chemical structure conversion E
Molecular weight
Molecular formula
Canonicalization of ad-hoc chemical modifications
Specification of chemical attachment points on
monomer structures.
Molecule manipulation (bond breaking and forming)

A new call must be added to the chemical function API that


performs
Image generation of the atom/bond representation of
FRS220 monomers M
Image generation of atom/bond representation of the
HELM molecule

Page 14/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

The API layer must be tested against:


A free tool of your choice (suggestions include RDKit
or CDK)
FRS210 Marvin E

The exact tool and version to be agreed with the project team
prior to starting development work.

3.1.5 Documentation
The following documents shall be created:

Priority
Req # Requirement
(E,H,M,L)

A user guide as Word and PDF documents. The style must


FRS500 H
follow that of the current HELM Editor User Guide.

FRS510 A document specifying the new web-services. H

Page 15/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

3.2 Non-Functional requirements


3.2.1 Environment Requirements
Priority
Req # Requirement
(E,H,M,L)

The following technologies shall be used (as applicable)

Java
Maven Archetype
Tomcat
Spring 4 (application context, dependency injection)
NFR100 Spring Data, JPA 2.1/Hibernate 4.3 (persistence) E
JAX-RS 2.0/Apache CXF 3.0 (stateless REST
services)
Spring Security
TestNG + HSQLDB for in-memory tests
Jenkins + CID custom tooling (CI build/deploy to
Tomcat)

The final code shall be delivered as a pull request which


NFR110 includes the new code as a fork from the HELM master branch
E
in Github.

It is desirable that Github is used as a code repository during


NFR120
development. L

The final solution shall contain no additional dependencies on


NFR130 third party tools that require paid licences for commercial E
usage.

3.2.2 Quality Characteristics


Priority
Req # Requirement
(E,H,M,L)

The proposed architecture must pass a review with the HELM


NFR200
project team before coding starts. E

Code should be written in accordance with normal good


NFR210 practice and developers should write their own internal unit E
tests.

All code must pass regular code reviews with the HELM
project team for good software engineering practice. A
NFR220 minimum of two reviews will be held, the first will be conducted
E
at a time not later than one third of the way through the
planned development period.

Code must be self-documented through appropriate use of


NFR230 comments. Comments must be sufficient to create Java doc
E
reports that are self-explanatory.

Page 16/17
Ambiguous HELM Requirements Specification - Phase 1
Version: 1.0
Type: Specification

The product must be of acceptable quality, as defined by


agreed maximum acceptable number of open defects of each
severity categorisation (Critical, High, Medium, Low) at close of
UAT. The following definitions will be used for these
categories:
Critical prevents any use of the editor or prevents use of
fundamental parts of the toolkit.
NFR 240
High a defect that is frequently encountered by normal use of E
the software and which prevents use of that functionality
Medium - a defect that is encountered by normal use of the
software and which prevents use of an aspect of that
functionality
Low a defect that is rarely encountered, or does not prevent
normal use of the software.

There must be a mechanism by which the project can


NFR250 automatically execute and validate test cases and report E
success/failure.

3.2.3 Control Function Requirements


Priority
Req # Requirement
(E,H,M,L)
All error messages must be easily understandable and consist
of language that a non-IS/IT scientist can understand, and must
also accurately indicate the nature of the problem. For
NFR300 example, low-level system error codes and messages should
not be displayed to a user. H
An error log should be created that enables root cause analysis
to be carried out.

Page 17/17