You are on page 1of 221

A context sensitive model for querying

linked scientific data

Peter Ansell
Bachelor of Science/Bachelor of Business (Biomedical Science and IT)
Avondale College
December 2005

Bachelor of IT (Hons., IIA)


Queensland University of Technology
December 2006

A thesis submitted in partial fulfilment of the requirements for the degree of


Doctor of Philosophy
November, 2011

Principal Supervisor: Professor Paul Roe


Associate Supervisor: A/Prof James Hogan

*
School of Information Technology
Faculty of Science and Technology
Queensland Univesity of Technology
Brisbane, Queensland, AUSTRALIA

c Copyright by Peter Ansell 2011. All Rights Reserved.
The author hereby grants permission to the Queensland University of Technology to
reproduce and redistribute publicly paper and electronic copies of this thesis document in
whole or in part.
Keywords: Semantic web, RDF, Distributed databases, Linked Data

iii
iv
Abstract

This thesis provides a query model suitable for context sensitive access to a wide range
of distributed linked datasets which are available to scientists using the Internet. The
model is designed based on scientific research standards which require scientists to pro-
vide replicable methods in their publications. Although there are query models available
that provide limited replicability, they do not contextualise the process whereby different
scientists select dataset locations based on their trust and physical location. In different
contexts, scientists need to perform different data cleaning actions, independent of the
overall query, and the model was designed to accommodate this function. The query
model was implemented as a prototype web application and its features were verified
through its use as the engine behind a major scientific data access site, Bio2RDF.org.
The prototype showed that it was possible to have context sensitive behaviour for each
of the three mirrors of Bio2RDF.org using a single set of configuration settings. The
prototype provided executable query provenance that could be attached to scientific
publications to fulfil replicability requirements. The model was designed to make it
simple to independently interpret and execute the query provenance documents using
context specific profiles, without modifying the original provenance documents. Exper-
iments using the prototype as the data access tool in workflow management systems
confirmed that the design of the model made it possible to replicate results in different
contexts with minimal additions, and no deletions, to query provenance documents.

v
vi
Contents

Acknowledgements xvii

1 Introduction 1
1.1 Scientific data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Distributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Science example . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Context sensitivity and replication . . . . . . . . . . . . . . . . . 20
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Research artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Related work 25
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Early Internet and World Wide Web . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Linked documents . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Dynamic web services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 SOAP Web Service based workflows . . . . . . . . . . . . . . . . 32
2.4 Scientific data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Semantic web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.2 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Conversion of scientific datasets to RDF . . . . . . . . . . . . . . . . . . 39
2.7 Custom distributed scientific query applications . . . . . . . . . . . . . . 40
2.8 Federated queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Model 47
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Query types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

vii
3.2.1 Query groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Provider groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Endpoint implementation independence . . . . . . . . . . . . . . 60
3.4 Normalisation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 URI compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Contextual namespace identification . . . . . . . . . . . . . . . . 67
3.6 Model configurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.1 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.2 Multi-dataset locations . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 Formal model algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.1 Formal model specification . . . . . . . . . . . . . . . . . . . . . 72

4 Integration with scientific processes 75


4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Annotating data . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Integrating Science and Medicine . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Case study: Isocarboxazid . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Use of model features . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Web Service integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Workflow integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Case Study : Workflow integration . . . . . . . . . . . . . . . . . . . . . 92
4.7.1 Use of model features . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8 Auditing semantic workflows . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.9 Peer review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.10 Data-based publication and analysis . . . . . . . . . . . . . . . . . . . . 102

5 Prototype 107
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Use on the Bio2RDF website . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Configuration schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Query type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4.1 Template variables . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2 Static RDF statements . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 Namespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 Normalisation rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7.1 Integration with other Linked Data . . . . . . . . . . . . . . . . . 124
5.7.2 Normalisation rule testing . . . . . . . . . . . . . . . . . . . . . . 126
5.8 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

viii
6 Discussion 129
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Context sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2.3 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.4 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3.1 Context sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3.2 Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.3 Data trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.4 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.4 Distributed network issues . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.5 Prototype statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.6 Comparison to other systems . . . . . . . . . . . . . . . . . . . . . . . . 162

7 Conclusion 165
7.1 Critical reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.1.1 Model design review . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.1.2 Prototype implementation review . . . . . . . . . . . . . . . . . . 170
7.1.3 Configuration maintenance . . . . . . . . . . . . . . . . . . . . . 171
7.1.4 Trust metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.1.5 Replicable queries . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.1.6 Prominent data quality issues . . . . . . . . . . . . . . . . . . . . 173
7.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3 Future model and prototype extensions . . . . . . . . . . . . . . . . . . . 178
7.4 Scientific publication changes . . . . . . . . . . . . . . . . . . . . . . . . 181

A Glossary 185

B Common Ontologies 187

ix
x
List of Figures

1.1 Example Scientific Method . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Flow of information in living cells . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Integrating scientific knowledge . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Use of information by scientists . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Storing scientific data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Data issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Datasets related to flow of information in living cells . . . . . . . . . . . 10
1.8 Different methods of referencing Uniprot in Entrez Gene dataset . . . . 11
1.9 Different methods of referencing Entrez Gene in Uniprot dataset . . . . 12
1.10 Different methods of referencing Entrez Gene and Uniprot in HGNC dataset 12
1.11 Indirect links between DailyMed, Drugbank, and KEGG . . . . . . . . . 13
1.12 Direct links between DailyMed, Drugbank, and KEGG . . . . . . . . . . 14
1.13 Bio2RDF website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Linked Data Naive Querying . . . . . . . . . . . . . . . . . . . . . . . . 27


2.2 Non Symmetric Linked Data . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Distributed data access versus local data silo . . . . . . . . . . . . . . . 29
2.4 Heterogeneous datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Linked Data access methods . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6 Using Linked Data to retrieve information . . . . . . . . . . . . . . . . . 46

3.1 Comparison of model to federated RDF query models . . . . . . . . . . 49


3.2 Query parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Example: Search for Marplan in Drugbank . . . . . . . . . . . . . . . . . 51
3.4 Search for all references to disease . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Search for references to disease in a particular namespace . . . . . . . . 55
3.6 Search for references to disease using a new query type . . . . . . . . . . 56
3.7 Search for all references in a local curated dataset . . . . . . . . . . . . . 57
3.8 Query groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10 Provider groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.11 Normalisation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.12 Single query template across homogeneous datasets . . . . . . . . . . . . 65

4.1 Medicine related RDF datasets . . . . . . . . . . . . . . . . . . . . . . . 80

xi
4.2 Links between datasets in Isocarboxazid case study . . . . . . . . . . . . 81
4.3 Integration of prototype with Semantic Web Pipes . . . . . . . . . . . . 93
4.4 Semantic tagging process . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Peer review process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Scientific publication cycle . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.1 URL resolution using prototype . . . . . . . . . . . . . . . . . . . . . . . 110


5.2 Simple system configuration in Turtle RDF file format . . . . . . . . . . 111
5.3 Optimising a query based on context . . . . . . . . . . . . . . . . . . . . 113
5.4 Public namespaces and private identifiers . . . . . . . . . . . . . . . . . 114
5.5 Template parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.6 Uses of static RDF statements . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7 Context sensitive use of providers by prototype . . . . . . . . . . . . . . 120
5.8 Namespace overlap between configurations . . . . . . . . . . . . . . . . . 122

6.1 Enforcing syntactic data quality . . . . . . . . . . . . . . . . . . . . . . . 136


6.2 Different paging strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.1 Overall solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

xii
List of Tables

2.1 Bio2RDF dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Normalisation rule stages . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Normalisation rule methods in prototype . . . . . . . . . . . . . . . . . . 123

6.1 Context change scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 132


6.2 Comparison of data quality abilities in different systems . . . . . . . . . 150
6.3 Comparison of data trust capabilities of various systems . . . . . . . . . 153
6.4 Bio2RDF prototype statistics . . . . . . . . . . . . . . . . . . . . . . . . 161

xiii
xiv
Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution.
To the best of my knowledge and belief, the thesis contains no material pre-
viously published or written by another person except where due reference is
made.

Signature:
Peter Ansell

Date:

xv
xvi
Acknowledgements

I dedicate this thesis to my wife, Karina Ansell. She has been very supportive of my
research throughout my candidature. I would also like to thank my supervisors for the
effort they put into guiding me through the process.

xvii
xviii
Chapter 1

Introduction

There are many different ways of sharing information using current communications
technology. A business may write a marketing letter to a customer, or they may email
the information with a link to their website. A grandparent may send a birthday
card to their grandchild, or they may write a note on a social networking website.
A politician may make a speech on television, or they may self-publish a video on a
video-sharing website. A scientist may publish an article in a paper based journal,
while simultaneously publishing it on a website. In each case, the information needs
to be shared with others to have an effect. In some cases, the information needs to
be processed by humans, while in other cases computers must be able to process and
integrate the data. In all cases, the data must be accessible by the interested parties.
In a social network a person may want to plan and share details about an upcoming
event. For example, they may want to plan a menu based on the expected guests at
a dinner party. If the social network made it possible for people to share their meal
preferences with their friends, the host may be able to use this data to determine which
dishes would be acceptable. Although the host could print out a list of meal preferences
from the social networking site and compare them to each of their recipes, it would be
more efficient for the host to automatically derive a list of recipes that match the guests
criteria. For the process to work effectively the preferences on the social networking site
must be matched with the ingredients on the recipe website.
Business and social networks do not require their information to be publicly available
for it to be useful, especially as the privacy of information is paramount in attracting
users. In the dinner party example, the guests do not need to publicly disclose prefer-
ences, and the recipe website does not require information about each guest individually.
In order for the menu planning system to work, it needs the guests to include informa-
tion about which ingredients and recipes they like and dislike. This information is then
accessible to their social networked friends, who can use the recipe site without their
friends having to be members.
Social networks and business-to-business networks rely heavily on data integration,
for example, the foods on the social networking site need to be matched to the meal
ingredients on the recipe site. However, in both cases there are a small number of
types of data to represent. For example, social networks may only need to be able to

1
2 Chapter 1. Introduction

represent people or agents, and networks or groups, while businesses may only need
to exchange information about suppliers, customers and stock. These conditions may
have contributed to the lack of a large set of interlinked data available about business or
social networks, as the low level of complexity makes it possible for humans to examine
and manually link datasets as needed. The dinner party example is not functional at
this time, as there are currently no datasets available with information about recipe
ingredients that have been used by social networks to create meal preference profiles.
However, both social networks and businesses may be encouraged to publish more in-
formation in the future based on schemas such as FOAF [57] and GoodRelations [68];
including the recently published LinkedOpenCommerce site1 .
Networks are also important for scientists as they provide a fast and effective way for
them to publicly provide and exchange all of the relevant data for their publications with
other scientists in order for their data and results to be peer reviewed and replicated [51].
They perform experiments using data from public and private datasets to derive new
results using variations of the scientific method, possibly similar to those described by
Kuhn [81], Crawford and Stucki [43], or Godfrey-Smith [54]. An example of a workflow
describing one of the many scientific methods is shown in Figure 1.1.
In some sciences, such as biology, and chemistry, there are a large number of different
sources of curated data that contain links between datasets, as compared to other
sciences where there may be a limited number of sources of data for scientists to integrate
and crosscheck their experiments against. In cellular biology scientists need to integrate
various types of data, including data about genetic material, genes, translated genes,
proteins, regulatory networks, protein functions, and others, as shown in Figure 1.2.
This data is generally distributed across a number of datasets that focus either on a
particular disease or organism, such as cancer [120] or flies [150], or on a particular type
of information, such as proteins [19].
These biological datasets contain data from many different scientists and it is impor-
tant that it is accurately described and regularly updated. Scientists continually update
these datasets using new data along with links to relevant data items in other datasets.
For example, Figure 1.3 shows the changes necessary to reflect the new discovery that
a gene found in humans is effectively identical to a gene in mice.
There are a wide range of ways that well linked data can be used by scientists,
including the motivation and hypothesis for an experiment, during the design of the
experiment, and in the publication of the results, as shown in Figure 1.4. For example,
the scientist would use the fact that the gene was equivalent in humans and mice as
the motivation, with the hypothesis that the gene may cause cancer in mice. The
characteristics of the cancer, including the fact that it occurs in bones would be used as
part of the experiment design, and the results would contain the implication that the
gene in mice causes cancer. The combined knowledge about the genes, the cancer, and
the resulting implications, would be used by peer reviewers to determine whether the
resulting knowledge was publishable.

1
http://linkedopencommerce.com/
Chapter 1. Introduction 3

Scientist forms a
Scientist reads
testable hypothesis
published material

Scientist designs an
Scientist analyses previous experiment
experiments

Scientist performs
experiment

Scientist analyses data

Scientist makes Scientist forms results


changes as into a publication
required

Scientist submits
publication to a journal

Journal editor sends anonymous copies


of the publication to peers for review

Peers review work with


reference to its validity
and its agreeance with previous
published work

Journal editor takes peer reviews and


decides whether to publish the work

Article is published as part of a journal issue


(Electronic and/or paper)

Citations by peers Critical reviews by peers

Figure 1.1: Example Scientific Method


4 Chapter 1. Introduction

Figure 1.2: Flow of information in living cells


Original figure source: Kenzelmann, Rippe, and Mattick. [2006].
doi:10.1038/msb4100086

As the amount of data grows, and more scientists contribute data about different
types of concepts, it is typical for scientists to store data about concepts such as genes
and diseases in different locations. Scientists are then able to access both datasets
efficiently based on the types of data that they need for their research. The combined
data is discoverable by scientists accessing multiple datasets, using the links contained
in each dataset for correlation, as shown in Figure 1.5. In that example, the data needs
to be integrated for future researchers to study the similar effects of the gene in both
humans and mice. Scientists can use the link specifying that the genes are identical to
imply that the gene may cause cancer in mice, before testing this hypothesis with an
experiment.
Although most scientific publications only demonstrate the final set of steps that
were used to determine the novel result, the process of determining goals and elimi-
nating strategies which are not useful is important. The resulting publications would
contain links to the places where the knowledge could be found. The information in
the publication would make it possible for other scientists to reproduce the experiment,
and provide evidence for the knowledge that the gene causes cancer in both humans
and mice.
Scientific peers may want to reproduce or extend an experiment using their own
resources. These experiments are very similar to the structure of the peer’s experiments,
but the data access and processing resources need to be changed to fit the scientist’s
Chapter 1. Introduction 5

Initial information
Gene ABCC1 Gene GHFC

Found in: Causes: Found in:

Human Bone Cancer Mouse

Integrated information

Identical genes:

Gene ABCC1 Gene GHFC


Identity
used to
Found in: Causes: imply: Found in:

May cause:

Human Bone Cancer Mouse

Figure 1.3: Integrating scientific knowledge

context. For example, they may have personally curated a copy of a dataset, and wish
to substitute it with the publicly available copy that was originally used. This dataset
may not have the same structure, and it may not be accessible using the same method
as the public dataset. In the example, a peer may have a dataset that contains empirical
data related to the expression of a gene in various mice and wish to use this as part of
the experiment. If the curated dataset contains enough information to satisfy the goals
of the experiment, the peer should be able to easily substitute the datasets to replicate
the experiment.

1.1 Scientific data


There are a large number of publicly available scientific datasets that are useful in
various disciplines. In disciplines such as physics, datasets are mostly made up of
direct numeric observations, and there is little relationship between the raw data from
different experiments. In other sciences, particularly those based around biology, most
data describes non-numeric information such as gene sequences or relationships between
proteins in a cell, and there is a clear relationship between the concepts in different
datasets. In the case of biology, there is a clear incentive to integrate different types
of data, compared to physics where it is possible to perform isolated experiments with
and process the raw data without direct correlations to shared concepts such as curated
gene networks.
6 Chapter 1. Introduction

Relationship of published
information to the scientific method

Example methods and sources Scientific method

Paper, HTML, PDF, Databases


Read articles and search databased to
ascertain where research gaps exist Exploration
PubMed, NCBI Gene, NCBI Taxon,
Diseasome

Form a falsifiable hypothesis which


attempts to create new knowledge
in the area Hypothesis

Written, Programming code, Workflow


Use a previously published
experimental method and terminology PLoS One, myExperiment, CPAN, CRAN,
SourceForge Experiment

XML, TSV, SOAP, RDF, Workflow


Use accepted conventions on data
processing to further understand the
data NCBI XML, BioMoby, Bio2RDF, Taverna
Analysis of data

HTML, URL, RDF, URI


Explain results in terms of current
scientific data, with reference to why Conclusions
Written, Ontology, Linked Data
the conclusions may be different

Write up coherent arguments suitable Written


for publication including references and Article writing
citations to appropriate prior articles LaTeX, BibTeX, Word, Endnote

Verify the coherence of the conclusive


arguments and the suitability of the Workflow, Custom system
Peer review
experimental method and analysis

Paper, PDF, RDF, XML, Workflow,

Contribute method and results to the


body of scientific knowledge through PLoS One, myExperiment, CPAN, CRAN, Publication
publication of the peer-reviewed article SourceForge

Figure 1.4: Use of information by scientists


Chapter 1. Introduction 7

Unlinked information
Human Genes Cancers Mouse Genes

Gene ABCC1 Bone Cancer Gene GHFC

Found in: Causes: Found in: Found in:

Human Bone Cancer Bones Mouse

Linked information
Gene links

Human Genes Mouse Genes


Identical genes:

Gene ABCC1 Gene GHFC

Found in: Causes: Found in:


May cause:

Human Bone Cancer Mouse


Implied knowledge

Found in:

Bones

Cancers

Figure 1.5: Storing scientific data

In biology, an experiment relating to the functioning of a cell in one organism may


share a number of conceptual similarities with another experiment examining the ef-
fects of a drug on a different organism. The similarities between the experiments are
identifiable, and are commonly shared between scientists by including links in published
datasets. For example, a current theory about how genetic information influences the
behaviour of cells in living organisms is shown in Figure 1.2, illustrating a small number
of data types that are required for biologists to integrate when processing their data.
In situations where data is heavily tied to particular experiments there may not be a
case for adding links to other datasets. In other cases the experimental results may only
be relevant when they are interpreted in terms of shared concepts and links between
datasets that can be identified separately from the experiment. If the links and concepts
cannot easily be recognised, then the experimental results may not be interpreted fully.
A scientist needs to be able to access data using links and recognise the meaning of the
link.
Given the magnitude and complexity of the scientific data available, which includes
a range of small and large datasets shown in Figure 4.1 and a range of concepts shown
in Figure 1.2, scientists cannot possibly maintain it all in a single location. In practice,
scientists distribute their data across different datasets, in most cases based on the type
of information that the data represents. For example, some datasets contain information
about genes, while others contain information about diseases such as cancer. Scientists
typically use the World Wide Web to access these datasets, although there are other
methods including Grids [85] and custom data access systems, such as LSIDs [41], that
8 Chapter 1. Introduction

can be used to access distributed data.

1.1.1 Distributed data

Science is recognised to be fundamentally changing as a result of the electronic publica-


tion of datasets that supplement paper and electronic journal articles [71]. It is simple
to publish data electronically, as the World Wide Web is essentially egalitarian, as it
has very low barriers to entry. In comparison, academic journals which are based on
peer review and authoritarian principles have higher barriers to entry, although they
should not be designed to distinguish the source of an article based on their social
standing. Although it is simple to publish datasets, it is comparatively hard to get the
data recognised and linked by other datasets.
In practice, there are a limited number of scientists curating these distributed
datasets, as most scientists are users rather than curators or submitters of data. In
addition, the scientists using these distributed datasets may be assisted by research
assistants, so the ideal “scientist” performing experiments or analysing results using
different distributed datasets will likely be a group of collaborating researchers and
assistants.
There are both technical and infrastructure problems that may make it hard for a
new dataset to be linked to from existing scientific datasets. These issues range from
outdated information, to the size of the dataset, and the way the data is made available
for use by scientists.
Data maybe outdated or not linked if it is difficult for the maintainer of a dataset to
obtain or correlate the data from another dataset. If a dataset maintainer is not sure that
a piece of data from another dataset is directly relevant, they are less likely to link to the
data. As distributed data requires that datasets give labels to pieces of information, it
is important that different datasets use compatible methods for identifying each item.
Although most datasets allow other datasets to freely reuse their identifiers without
copyright restrictions, there are some closed, commercial datasets that put restrictions
on reuse of their identifiers. This limits the usefulness of these identifiers, as scientists
cannot openly critique the closed datasets because the act of referring to the closed
dataset identifier in an open dataset could be illegal 2 . By comparison, these datasets
freely link to other datasets, so if they were open, they would be valuable sources of
data for the scientific community.
If there are no accurate, well-linked, sources of data, a scientist may need to manually
curate the available data and host it locally. In some cases scientists may have no issues
trusting a dataset to be updated regularly and contain accurate links to other datasets,
but there may be a contextual reason why they prefer to use data from a different
location, including a third party middleware service and their own computing resources.
The contextual reasons may include a more regularly operational data provider or a
locally available copy of the data. Knowledge of these and other operational constraints
are necessary for other scientists to verify and replicate the research in future.
2
http://lists.w3.org/Archives/Public/public-lod/2011Aug/0117.html
Chapter 1. Introduction 9

1.1.2 Science example

The science of biology offers a large number of public linked datasets, in comparison to
chemistry where there are a large number of datasets, but the majority are privately
maintained and commercially licensed. Some specific issues surrounding data access
in biology are shown in Figure 1.6, along with the general issues affecting scientific
distributed data. Medical patient datasets, for example, may be encoded using an HL7
file format standard 3 . These datasets, however, are generally private, and doctors may
benefit from simple access to both biology and public medicine datasets for internal
integration without widespread publication of the resulting documents. For example,
datasets describing drugs, side effects and diseases are directly relevant, and would be
very useful if they could be linked to and integrated easily [37].

Data

Data is
physically Data can be
Main problem:
distributed labelled
Structured data is
produced and used by Data labels can
Data may be be used by
different entities legally others
protected

Business
Social Networks
Main problem: Main problem:
Trusted interfaces for Privacy of shared
Business-to-Business information Identity of
Existing data
interaction individuals split
labelling standard:
EAN UCC-13 across networks
Hard to legally
Limited number of integrate
operations : Quote,
Purchase, Offer for Sale

Medicine

Main problem: Different standards Large number of Clear links to


Privacy of all for labelling unique patients; not science data are
patient conditions generalisable desirable
information

Science

Large number of data


Main problem: No Datasets contain types: Gene, Protein, Large number of
standard for labelling links to other DNA, Taxonomy, public datasets
and publishing data datasets Chemical elements,
Chemical compounds

Biology

Main problem: Large Genes, proteins, and


A small number of Datasets link and
amount of data regulatory networks are
large datasets; a large publish data in
not easy to generalise
number of small different ways
for all members of a
datasets
species; difficult to
share labels

Figure 1.6: Data issues

3
http://www.hl7.org/implement/standards/index.cfm?ref=nav
10 Chapter 1. Introduction

Figure 1.2 shows the major concepts currently recognised in the field of cellular bi-
ology, along with their relationships. The combined cycle describes how scientists link
information about different parts of genomic cycle, using references to other datasets
containing linked concepts that are thought to be causally relevant. For example, as
shown in Figure 1.7, the Entrez Gene dataset contains information about genomic in-
formation, and this genomic information can be translated to form proteins, which
have relevant information in the Uniprot dataset, among others. The Uniprot dataset
also attempts to compile relevant links to Entrez Gene, among other datasets, to allow
scientists to identify the genes that are thought to be used to create particular proteins.

Entrez
Gene

Reactome

CPath

Uniprot

Drugbank Pfam

Sider

Figure 1.7: Datasets related to flow of information in living cells


Original figure source: Kenzelmann, Rippe, and Mattick. [2006].
doi:10.1038/msb4100086

These datasets are vital for scientists who require information from multiple sections
of the biological cycle to complete their experiments. This thesis examines an example
involving a scientist who is required to determine and accommodate for genetic causes of
side effects from drugs. The necessary published information to complete these types of
experiments are available in linked scientific datasets including DrugBank [144], HGNC
(dataset with single textual symbol to each human gene) [125], NCBI Entrez Gene
(dataset with information about genes) [93], Uniprot (dataset with information about
proteins) [139], and Sider (dataset with side effect information for drugs) [80].
At a conceptual level, this experiment should be relatively easy to perform using
current methods such as web browsing or workflows. However, there are various reasons
why it is difficult for scientists to perform the example experiment and publish the results
including:

• References different in each format and dataset


Chapter 1. Introduction 11

Uniprot reference information available in Entrez Gene

In HTML: mRNA and Protein(s) :


UniProtKB/Swiss-Prot <a href="http://www.uniprot.org/entry/P27338">P27338</a>
HTML source URL : http://www.ncbi.nlm.nih.gov/gene/4129

In ASN.1: type other, heading "UniProtKB", source {}, comment { { type other, source { { src { db "UniProtKB/
Swiss-Prot", tag str "P27338" }, anchor "P27338" } } } }
ASN.1 source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=asn.1&id=4129

In XML: <Dbtag><Dbtag_db>UniProtKB/Swiss-Prot</Dbtag_db><Dbtag_tag><Object-id><Object-
id_str>P27338</Object-id_str></Object-id></Dbtag_tag></Dbtag>
XML source URL : http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=4129

Figure 1.8: Different methods of referencing Uniprot in Entrez Gene dataset

• Many file formats, including Genbank XML, Tab-separated-values and FASTA

• Lack of interlinking : One way link from Sider to Drugbank

• Experimental replication methods vary according to the sources used

• Custom methods are generally hard to replicate in different contexts

There are different methods of referencing between the Entrez Gene dataset and the
Uniprot dataset. There are multiple file formats available for each item in the Entrez
Gene dataset, with each format using a different method to reference the same Uniprot
item as shown in Figure 1.8. In a similar way, there are multiple file formats available
for the equivalent Uniprot protein, however, each format also uses a different method to
reference the same Entrez Gene item, as shown in Figure 1.9 Both of these datasets are
curated and used by virtually every biologist studying the relationship between genes
and proteins. The HGNC gene symbol that is equivalent to these items is represented
using different file formats each of which use a different method of referencing external
data links. In the case of Entrez Gene, HGNC also contains two separate properties,
with very similar semantic meanings, that link to the same targets in Entrez Gene, as
shown in Figure 1.10.
Although some datasets are not linked in either one or both directions, there may
be ways to make the data useful. For instance, the DailyMed website contains links
to DrugBank, but DrugBank does not link back to DailyMed directly. DrugBank does
however link to KEGG [76], which links to DailyMed, as shown in Figure 1.11. There
are links to PubChem [31] from DrugBank and KEGG, however, they are not identical,
as KEGG links to a PubChem substance, while DrugBank links to both PubChem com-
pounds and PubChem substances. In this example, described more fully in Section 4.4,
a scientist or doctor may want to examine the effects of a drug in terms of its side effects
and any genetic causes, however, this information isn’t simple to obtain given the links
and data access methods that are available for distributed linked datasets.
In addition to the directly available data from these sites, there are restructured
and annotated versions available from tertiary data providers including Bio2RDF [24],
Chem2Bio2RDF [39], Neurocommons [117], and the LODD group [121]. These versions
12 Chapter 1. Introduction

Entrez Gene information available in Uniprot

In RDF/XML: <rdfs:seeAlso rdf:resource="http://purl.uniprot.org/geneid/4129" />


RDF/XML source URL : http://www.uniprot.org/uniprot/P27338.rdf

In XML: <dbReference type="GeneID" id="4129" />


XML source URL : http://www.uniprot.org/uniprot/P27338.xml

In HTML: Genome annotation databases :


GeneID: <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&term=4129">4129</a>
HTML source URL : http://www.uniprot.org/uniprot/P27338.html

In Text File: DR GeneID; 4129; -.


Text source URL : http://www.uniprot.org/uniprot/P27338.txt

In GFF : No References available in this format


GFF source URL : http://www.uniprot.org/uniprot/P27338.gff

In FASTA : No References available in this format


FASTA source URL : http://www.uniprot.org/uniprot/P27338.fasta

Figure 1.9: Different methods of referencing Entrez Gene in Uniprot dataset

Uniprot and Entrez Gene reference information available


in HGNC

In HTML:

Entrez Gene ID 4129 <a href="http://view.ncbi.nlm.nih.gov/gene/4129">Gene</a>


UniProt ID (mapped data supplied by UniProt) P27338 <a href="http://www.uniprot.org/uniprot/
P27338">UniProt</a>

HTML source URL : http://www.genenames.org/data/hgnc_data.php?hgnc_id=6834

In Tab Seperated Values Format:

HGNC ID Approved Symbol Approved Name Status Entrez Gene ID Entrez Gene ID (mapped data
supplied by NCBI) UniProt ID (mapped data supplied by UniProt)
6834 MAOB monoamine oxidase B Approved 4129 4129 P27338

Tab Seperated Values Format source URL : http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?


title=HGNC+output
+data&col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_pub_eg_id&col=
md_eg_id&col=md_prot_id&status=Approved&status=Entry
+Withdrawn&status_opt=2&level=pri&=on&where=gd_hgnc_id+%3D+%276834%
27&order_by=gd_hgnc_id&limit=&format=text&submit=submit&.cgifields=&.cgifields=level&.cgifields=
chr&.cgifields=status&.cgifields=hgnc_dbtag

Figure 1.10: Different methods of referencing Entrez Gene and Uniprot in HGNC
dataset
Chapter 1. Introduction 13

Indirect linking
http://www.drugbank.ca/cgi-bin/getCard.cgi?CARD=DB01247

Link back to Drugbank:

Same item http://www.genome.jp/dbget-bin/www_bget?drug:D02580

Link to KEGG Drug database:

http://www.drugbank.ca/drugs/DB01247

Contains link to actual item:

http://www.drugbank.ca/search/search?query=APRD00701

Links to search page for secondary accession number:

http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=6788

Redirects to:

http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?
Redirects to: setid=AC387AA0-3F04-4865-A913-DB6ED6F4FDC5

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=148916

Contains link to:


Contains link to:
http://www.drugbank.ca/cgi-bin/
http://dailymed.nlm.nih.gov/dailymed/lookup.cfm? getCard.cgi?CARD=DB01247.txt
setid=ac387aa0-3f04-4865-a913-db6ed6f4fdc5

Contains link to: Contains link to:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3759

Contains link to:

Contains link to: http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=17396751

Contains link to:

http://sideeffects.embl.de/drugs/3759/ http://www.genome.jp/dbget-bin/www_bget?drug:D02580

Figure 1.11: Indirect links between DailyMed, Drugbank, and KEGG

of the datasets contain more links to other datasets compared to the original data, how-
ever the annotated information may not be as trusted as the original information that
is directly accessible using the publishers website. If scientists rely on extra annotations
provided by these tertiary sources, the experiments may not be simple to replicate using
the original information, making it difficult for peer reviewers to verify the conclusions
unless they use the annotated datasets. Although tertiary sources may not be as useful
as primary sources, they provide a simpler understanding of the links between differ-
ent items, as shown by replicating the example from Figure 1.11 using the annotated
datasets in Figure 1.12.
Although it is simpler to visualise the information using the annotated versions,
there are still difficulties. For example, the DailyMed annotated dataset provided by
the LODD group does not utilise the same identifiers for items as the official version,
although the official version uses two different methods itself. The DailyMed item
identified by both “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” and “6788” in the
official version, is identified as “2892” in the LODD version, and there are no references
to indicate that the LODD published data is equivalent to the DailyMed published data.
A tertiary provider may accidentally refer to two different parts of a dataset as if
14 Chapter 1. Introduction

Direct linking
http://bio2rdf.org/drugbank_drugs:DB01247

http://bio2rdf.org/dr:D02580

http://www4.wiwiss.fu-berlin.de/dailymed/page/drugs/2892

http://bio2rdf.org/pubchem:148916

http://bio2rdf.org/pubchem:3759

http://bio2rdf.org/pubchem:17396751

http://www4.wiwiss.fu-berlin.de/sider/resource/drugs/3759

Figure 1.12: Direct links between DailyMed, Drugbank, and KEGG

they were one set. For example, the Bio2RDF datasets do not currently distinguish
between the PubChem compounds dataset and the PubChem substances dataset. This
error results in the use of identifiers in the examples above such as http://bio2rdf.
org/pubchem:3759 which may refer to either to a compound, http://pubchem.ncbi.
nlm.nih.gov/summary/summary.cgi?cid=3759, or an unrelated substance, http://
pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=3759. In these cases, the two
identifiers would be very difficult to disambiguate unless a statement clearly describes
either a compound or a substance.

1.2 Problems
A number of concepts need to be defined to understand the motivating factors for
this research. These concepts will be used to identify and solve problems that were
highlighted using examples in Section 1.1.2.

Data cleaning : The process of verifying and modifying data so that it is useful
and coherent, including useful references to related data. For example, this may
include manipulating the Entrez and Uniprot datasets so that the links can be
directly resolved to get access to information about the referenced data.

Data quality : A judgment about whether data is useful and coherent. For example,
this may be the opinion of a scientist about whether the datasets were useful,
along with which manipulations would be necessary to improve them.

Trust : A judgment by a scientist that they are comfortable using data from a partic-
ular source in the context of their current experiment. In the example, scientists
Chapter 1. Introduction 15

need to be confident that the sources of data they are using will be both non-trivial
and scientifically acceptable to their peers.

Provenance : The details of the retrieval method and locations of data items that
were used in an experiment, as distinct from the provenance information related
to the original author and curator of the data item [129]. The distributed nature
of scientific data was illustrated by compartmentalising the logical data and links
from Figure 1.3 to match the way it is physically stored, as shown in Figure 1.5.
In terms of this example, the provenance of the asserted relationship between the
mouse gene and the human cancer, would include references to data located in
each of the other datasets.

Context : The factors that are relevant to the way a scientist performs or replicates
an experiment. These factors influence the way different scientists approach the
same problems. For example, a scientist may publish a method using datasets
distributed across many locations; while a peer may reproduce or extend the
document solely using resources at their institution.

These factors are relevant to the way scientists work with data, as shown in Sec-
tion 1.1. Scientists are not currently able to clean and query datasets to fix errors
and produce accurate results using methods that are simple to replicate. The lack of
replicability may result in lower data quality for published datasets due to the lack of
reuse and verification. Although scientists can access and trust parts of a small number
of large public datasets, they are not able to easily trust the large number of small
datasets, or even all of a small selection of large datasets. Trust requires knowledge of
the state of the dataset. In addition, scientists need to understand the usefulness of the
data in reference to other related datasets. It is difficult to understand the usefulness
of an approach while there are barriers to replicating existing research that uses similar
linked datasets.
Some of the problems discussed here are more relevant to the use of workflows, as
compared to manually performing an experiment by hand. For example, it is more
important that data is trusted for large scale workflows, as the intermediate steps are
required to be correct for the final, published, results to be useful. In comparison,
workflows make it easy to generate query provenance, as the precedents for each value
can be explicitly linked through the structure of the workflow. In manual processes, the
query provenance needs to be compiled by the scientist alongside their results. Contex-
tual factors can adversely affect both workflow and manual processes, as workflows can
allow for context variables as workflow inputs, and if they don’t it makes it difficult to
replicate. Similarly, manual processes are designed to be intimately replicated, making
it possible to modify context variables at the desired steps.

1.2.1 Data quality


Data should be accessible to different scientists in a consistent manner so that it can be
easily reused. The usefulness of data can be compromised by errors or inconsistencies
16 Chapter 1. Introduction

at different levels. Errors may occur when the data does not conform to its declared
schema; when dataset links are not consistent or normalised; when properties are not
standardised; or, when the data is factually incorrect. Some of these errors can be
corrected when a query is performed, while others require correction before queries can
be executed.
Data may be compromised by one or more fields that do not match the relevant
schema definition, possibly due to changes in the use of the dataset after its initial
creation [69]. In some of these cases, automated methods may be available to identify
and remove this information, including community accepted vocabularies that define
the expected syntactic and semantic types of linked resources. In other cases such as
relational databases, the structure of the database may need to be modified in order to
solve the issue.
The decentralised nature of the set of linked scientific datasets makes dynamic veri-
fication of multiple datasets hard. Although a syntactic validator may be able to verify
the completeness of a record, a semantic validator may rely on either a total record, or
the total dataset, in order to decide whether the data is consistent. In many cases the
results of a simple query may not return a complete record, as a scientist may only be
interested in a subset of the fields available.
The usefulness of linked data lies in the use of accurate, easily recognisable links
between datasets. Datasets can be published with direct links to other records, or they
may be published using identifiers that can be used to name a record without directly
linking to it. In many cases, the direct links contain an identifier that can be used to
name a record without linking to it, as shown in a variety of examples in Section 1.1.2.
In some cases there are two equivalent identifiers for a record. Figure 1.11 shows two
ways of referencing the relevant DailyMed record using both “AC387AA0-3F04-4865-
A913-DB6ED6F4FDC5” and “6788” as identifiers.
It may be necessary for scientists to recognise identifiers inside links, independent
of the range of links that could be constructed, as the same identifier may be present
in two otherwise different links. If two different identifiers can be used equivalently
for a record, then normalising the identifier to a single form will not adversely affect
the semantic meaning of the record. However, there are cases where the two identifiers
denote two different records with data representing the same thing. In this case, the
two identifiers are synonyms rather than equivalent, meaning that they need to be
distinguished so that scientists can retain the ability to describe both items [112]. This
is particularly important when integrating data from two heterogeneous sources, where
it may be important in future that the identifiers are separate so that their provenance
can be determined.
The issue of variable data quality is more prominent when data is aggregated from
different sources, as incorrect or inconsistent information is either very visible, or causes
large gaps in the results. Both of these outcomes make it difficult or impossible to trust
the results. This is particularly relevant to the increasingly important discipline of
data mining, where large databases are analysed to help scientists form opinions about
Chapter 1. Introduction 17

apparent trends [112]. It is necessary in science predominantly because of the tendency


of scientific datasets to not always remove entries for incorrect items, and to import
references at one point in time, without regularly checking in future to verify that the
references are still semantically valid [69].
The use of multiple linked datasets may highlight existing data quality issues, how-
ever links between distributed datasets may still be useful when the data is factually
incorrect, as the link may be important in determining the underlying issue. In com-
parison to textual search engines which rely on ever more sophisticated algorithms to
guess the meaning of documents, scientists generate purpose built datasets and curate
the links to other datasets to improve the quality of their data.

1.2.2 Data trust

Data trust is a complex concept, with one paper, Gil and Artz [52], identifying 19 factors
that may affect the trust that can be put into a particular set of data. In the context
of this research, data trust is evaluated at the level of datasets and the usefulness of
different queries on each dataset. At a high level, different scientists determine whether
they trust a dataset, and peers use this information to determine whether or not the
dataset is trusted enough to be used as a source for queries. Although this information
is implicitly visible in publications, a computer understandable annotation would allow
scientists to efficiently perform queries across only their trusted datasets. If queries do
not return the expected results, then the dataset that was responsible for the incorrect
data could be explicitly annotated as untrusted for future queries, and past queries
could be replicable regardless of the current situation, but not necessarily trusted.
It is difficult to define a global trust level, due to evidence that knowledge is not
only segmented, but its meaning varies according to context [64]. In business process
modelling, where it is necessary to explicitly represent methods and knowledge related
to the business activities, one paper, Bechky [21], concluded that within a business it
is hard to represent knowledge accurately for a variety of reasons including job speci-
ficity and training differences. Some of these may map to science due to its similarities
with expert knowledge holders and segregated disciplines. Another paper, Rao and
Osei-Bryson [113], focused on analysis of the data quality in internal business knowl-
edge management systems, with the conclusion that there are a number of analytical
dimensions that are necessary to formally trust any knowledge, with some overlap in
the dimensions to the factors identified in Gil and Artz [52].
Trust in a set of data is related to data quality, as any trust metric requires that
datasets contain meaningful statements. If a data trust metric is to extend past overall
datasets to particular data items, there is a need to precisely designate a level of trust to
data items based on general categories of trustworthiness. The continuous evolution of
knowledge and data, including different levels of trust for scientific theories throughout
their lifespan, compared to raw data, make it hard to define the exact meaning of all
data items in a dataset in the context of a relatively static shared meaning for an item.
The lack of exact meaning at this level, due to social and evolutionary issues, means
18 Chapter 1. Introduction

that there is no automated general mechanism for detecting and resolving differences
in opinion about meaning.
The focus of many current distributed scientific data access models is on the use of
scientific concept ontologies as the basis for distributing queries [4, 15, 17, 25, 30, 53, 74,
75, 82, 100, 103, 107, 108, 117, 118, 133, 143]. These systems do not contain methods for
describing the trust that scientists have in the dataset, including different levels of trust
based on the context of each query, as ontology researchers assume that datasets will be
semantically useful enough to be integrated transparently in all relevant experiments.
The datasets that are available to these systems need to remain independent from a
query in order for the system to be able to automatically decide how to distribute the
query based on high level scientific concepts and properties. These systems assume that
both the syntactic and semantic data quality will be high enough to support realistic
and complex queries across multiple linked datasets without having scientists examine
or filter the results at each stage of the query process.
If a dataset is well maintained and highly trusted within a domain it will be more
consistent and accurate, and the way the data is represented should match the common
theories about what meaning the data implies. If the level of trust in a dataset is
not high across a community, the various degrees of trust that cannot be described in
current ontologies, such as uncertainty, will lead to differences in implementation. The
use of different implementations will lead to differences in the trust in the knowledge
that is described using the resulting data. Data trust in science is gained based on a
particular historical context. A scientific theory may be currently accepted to be part
of the Real layer in Critical Realist [96] terms due to a large degree of evidence pointing
to realistic causal mechanisms, but in the future if the context changes the observations
may be interpreted differently. The change in recognition did not mean that the world
actually changed to suit the new theory, however, it does mean that a community of
scientists will gradually accept the new theory and regard knowledge based on the old
theory as possibly suspect.
Strassner et al. [137] attempted to utilise ontologies as the basis for managing the
knowledge about a computer network, and to use this knowledge to automate the process
of network management in different contexts. They found that a lack of sufficiently
diverse datasets to experiment on and the lack of congruence between different ontologies
made it difficult to utilise an ontology based approach as part of a network management
system. In comparison, there are a large number of diverse scientific linked datasets,
and there are issues relating to the use of incongruent ontologies as the datasets have
been modelled by different organisations, each of which have a different view on the
meaning of the data.
As the number of scientific datasets grows, the level of knowledge by particular
scientists about the suitability of any particular set of datasets will likely decrease due
to the lack of knowledge by each scientist about exactly what each database contains.
Not every piece of information is trustworthy, and even trustworthy datasets can contain
inaccuracies including out of date links to other datasets, or errors due to insufficient
Chapter 1. Introduction 19

scientific evidence for some claims [35]. Scientists have to be free to reject sources of
data that they feel are not accurate enough to give value to the investigations in their
current context. However, there is no current method which allows them to do this using
a distributed linked dataset query model. Current theoretical models include many
different factors [52], but no method for scientists to use to integrate their knowledge
of trust with their scientific experiments across linked datasets. Scientists can record
and evaluate the data and process provenance of their workflows, however, this does not
recognise the concept of trust in a dataset provided at a particular location [34, 95, 148].
Although some trust systems can be improved using community based ratings, these
systems are still prone to trust issues based on the nature of the community. Scientists
require the ability to act autonomously to discover new results as necessary without
being forced to confine themselves to the average of the current community opinion
about a dataset or data item. They require a system that provides access to multiple
datasets, without requiring that all scientists utilise it completely for it to be useful,
as this may preventing them from extending it locally as they require. In order to
remain simple to manage, a data access model designed with this in mind cannot hope
to systematically assign trust values to every item in every dataset that will match all
scientist’s opinions. On one extreme it would be globally populated automatically by
an algorithm, while on the other it would be sparsely populated by a range of scientists.
Both of these outcomes do not enable scientists to contextually trust different datasets
more accurately than a manual review of relevant publications.
An autonomous trust system does not require or imply that the data is only mean-
ingful to some scientists. It recognises that the factual accuracy, along with the com-
pleteness of the records contribute to each scientist’s trust in the dataset and its links to
other datasets. This does not imply that private investigations are more valid or useful
than published results or shared community based opinions. It does imply that as part
of the exploratory scientific method, trusted datasets must be developed and refined in
local scenarios before being published. In this way it matches the traditional scientific
publication methodology, where results are locally generated and peer reviewed before
being published. In order to be useful, a linked scientific data access model may be able
to provide for both pre and post published data, and trust based selection of published
data to allow transitions when necessary.
Trust in the content of a document is distinct from trusting that the document was
an unchanged representation of the original document. Typically trust in the computer
sciences, particularly networking and security, is defined as the ability to distinguish
between unchanged and changed representations of information. Although it may be
useful to include this aspect in a trusted scientific data access model, it is not necessary
as scientists decide on their level of trust based on the overall content available from a
particular data provider. In addition a scientist may have different levels of trust for the
content derived from a particular provider depending on their research context, requiring
that, in two different contexts, the same representation be assigned two different levels
of trust.
20 Chapter 1. Introduction

1.2.3 Context sensitivity and replication

Modern scientists need to be able to formulate and execute computer understandable


queries to analyse the data that is available to them within their particular context. In
terms of this research, context is defined as the factors that affect the way the query is
executed, and not necessarily different contextual results. In previous models, such as
SemRef [133], context sensitivity is defined as the process of reinterpreting a complex
query in terms of a number of schema mappings. Scientists may wish to perform a
completely different query in a different context, and the limitation to reinterpreting
a complex structured query makes it impossible to provide alternative, structurally
incompatible queries to match a given scientific question. It is instead necessary for a
context sensitive query system to provide simple mappings between the scientist’s query,
given as a minimal set of parameters rather than a structured query, and any number
of ways that the particular query is implemented on different datasets. Although this
does not provide a semantically rich link between the scientist’s query and the actual
queries that are executed on the data providers, it provides flexibility that is necessary
in a system which is designed to be used in different contexts to generate equivalent
results.

Context sensitivity in this research relates to both the scientist’s requirements and
the locations and representations of data. If scientists have the resources to locally host
all of the datasets, then they are in a position to think about optimising the performance
by using a method such as that proposed by Souza [133]. However, in these cases, the
scientist still has to determine their level of trust based on the intermediate results.
This may not be available if the entire question is answered using joins across datasets
in a single database without a scientist being able to verify the component steps. They
could perform these checks if they use the datasets solely as a data access point, with
higher level filters and joins on the results according to their context.

Although there are systems that may assist them when constructing queries across
different datasets, the scientist needs to verify whether the results of each interdataset
query match their expectations in order to develop trust in their results. The data
quality, along with their prior level of trust in the datasets, and the computer under-
standable meaning attached to the data, are important to whether the scientist accepts
the query and results or chooses to formulate it differently.

Some systems that attempt to gather all possible sources for all queries into a single
location, for example, SRS (Sequence Retrieval System) [46]. The aggregation of similar
datasets into single locations provides for efficient queries, and, assuming the datasets
are not updated by the official providers too often, can be maintainable and trustworthy.
In terms of this research the minimum requirement for scientists to replicate queries is
that the data items must be reliably identified independent of context.
Chapter 1. Introduction 21

1.3 Research questions


A list of research questions were created to highlight some of the current scientific data
access problems described in Section 1.2.

1. What data quality issues are most prevalent for scientists working with multiple
distributed datasets?

2. What data cleaning methods are suitable for scientists who need to produce results
from distributed datasets that are easily replicable by other scientists?

3. What query model features are necessary for scientists to be able to identify
trustworthy datasets and queries?

4. What query provenance documentation is necessary for scientists to perform


queries across distributed datasets in a manner that can be replicated in a different
context with a minimal amount of modification to the queries?

5. Can a distributed dataset query model be implemented, deployed, and used, to


effectively access linked scientific datasets?

1.4 Thesis contributions


This thesis introduces a new context sensitive model for querying distributed linked
scientific datasets that addresses the problems of data cleaning, trust, and provenance as
discussed in Section 1.2. The model allows users to define contextual rules to automate
the data cleaning process. It provides context sensitive profiles which define the trust
that scientists have in datasets, queries, and the associated data cleaning methods. The
model enables scientists to share methodologies through the publication and retrieval
of their queries’ provenance. The thesis describes the implementation of the model in a
prototype web application and validates the model based on examples that were difficult
to resolve without the use of the model and the prototype implementation.
The contributions will be evaluated by comparing the model and implementation
features and methodology against other similar models in terms of their support for
context-sensitive, clean, and documented access to heterogeneous, distributed, linked
datasets. These features are necessary to provide support to current and future scientists
who perform experiments and analyse data based on public datasets.

1.5 Publications
• Ansell. [2011] : Model and prototype for querying multiple linked scientific datasets.
Future Generation Computer Systems. doi:10.1016/j.future.2010.08.016

• Ansell, Hogan, and Roe. [2010]. Customisable query resolution in biology and
medicine. In Proceedings of the Fourth Australasian Workshop on Health Infor-
matics and Knowledge Management (HIKM2010).
22 Chapter 1. Introduction

• Ansell. [2009]. Collaborative development of cross-database Bio2RDF queries.


Presentation at EResearch Australasia 09.

• Ansell. [2008]. Bio2RDF: Providing named entity based search with a common
biological database naming scheme. Presentation at BioSearch08 HCSNet Next-
Generation Search Workshop on Search in Biomedical Information.

• Belleau, Ansell, Nolin, Idehen, and Dumontier. [2008]. Bio2RDF’s SPARQL


Point and Search Service for Life Science Linked Data. Poster at Bio Ontologies
2008 workshop.

Other publications:

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Enhancing BLAST
Comprehension with SilverMap. Presentation at 2009 Microsoft eScience Work-
shop.

• Ansell, Buckingham, Chua, Hogan, Mann, and Roe. [2009]. Finding Friends
Outside the Species: Making Sense of Large Scale BLAST Results with Silvermap.
Presentation at EResearch Australasia 09.

• Rosemann, Recker, Flender, and Ansell. [2006]. Understanding context-awareness


in business process design. Presentation at Australasian Conference on Informa-
tion Systems.

1.6 Research artifacts


The research was made up of a model and a prototype implementation of the model. The
model was designed to overcome the issues that were identified related to data access in
science in Section 1.1. The prototype was then implemented to provide evidence for the
usefulness of the model. The prototype was implemented using Java and Java Servlet
Pages (JSP). It contained approximately 35,000 lines of program code4 . The prototype
included a human-understandable HTML interface, as shown in Figure 1.13, along with
a variety of computer-understandable RDF file formats including RDF/XML, Turtle,
and RDFa. The 0.8.2 version of the prototype was downloaded around 300 times from
Sourceforge 5 .
The prototype was primarily tested on the Bio2RDF website. It provided with a
way to resolve queries easily across their distributed linked datasets, including datasets
using the normalised Bio2RDF resource URIs[24] and datasets using other methods for
identifying data and links to other data. Bio2RDF is a project aimed at providing
browsable, linked, versions of approximately 40 biological and chemical datasets, along
with datasets produced by other primary and tertiary sources where possible. The
datasets produced by Bio2RDF can be colocated in a single database, although datasets
are all publicly hosted in separate query endpoints.
4
http://www.ohloh.net/p/bio2rdf/analyses/latest
5
http://sourceforge.net/projects/bio2rdf/files/bio2rdf-server/bio2rdf-0.8.2/
Chapter 1. Introduction 23

Figure 1.13: Bio2RDF website

The configuration information necessary for the prototype to be used as the engine
for the Bio2RDF website was represented using approximately 27,000 RDF statements.
A large number of datasets were linked back to their original web interfaces and descrip-
tions of the licenses defining the rights that users of particular datasets have. In total
there were 170 namespaces that were configured to provide links back to the original
web interface as part of the Bio2RDF query process, and there were 176 namespaces
that were configured to provide links back to a web page that describes the rights that
users of the dataset have.
The Bio2RDF website was accessed 7,415,896 times during the testing period and the
prototype performed 35,694,219 successful queries and 1,078,354 unsuccessful queries on
the set of distributed datasets to resolve the user queries. The website was accessed
by at least 10,765 unique IP addresses, although two of the three mirrors were located
behind reverse proxy systems that did not provide information about the IP of the user
accessing the website. The three instances of the prototype running on the mirrors were
synchronised using a fourth instance of the software that provided the most up to date
version of the Bio2RDF configuration information to each of the other mirrors at 12
hourly intervals.

1.7 Thesis outline

The thesis starts with an introduction to the problems that form the motivation for this
research in Chapter 1. The related research areas are described in Chapter 2, starting
from a broad historical perspective and going down to related work that attempts to
solve some of the same issues. Chapter 3 contains a description of the model that was
24 Chapter 1. Introduction

created in order to make it simpler for scientists to deal with the issues described in
Chapter 1. Chapter 4 describes how the model and prototype implementation can be
integrated with scientific practices. The prototype that was created as an implementa-
tion of the model is described in Chapter 5. Chapter 6 along with a discussion of issues
that were addressed by the model and prototype along with outstanding issues and a
description of the methodology and philosophy that was used to guide the research at
a high level. A conclusion and a brief description of potential future research directions
are given in Chapter 7.
Chapter 2

Related work

2.1 Overview
For most of scientific history, data and theories have been accumulated in personal
notebooks and letters to fellow scientists. The advent of mechanical publishing in
the Renaissance period allowed scientists to mass produce journals that detailed their
discoveries. This provided a broader base on which to distribute information, making
the discovery cycle shorter. However, data still needed to be processed manually, and
publication restrictions made it hard to publish raw data along with results and theories.
The use of electronic computers to store and process information gave scientists efficient
ways to permanently store their data, allowing for future processing. However, initially
electronic storage costs were prohibitive, and scientists generally did not share raw data
due to these costs. The creation of a global network to electronically link computer
systems provided the impetus for scientists to begin sharing large amounts of raw data,
as the costs were finally reduced to economic levels. This chapter provides a short
summary of a range of data access and query methods that have been proposed and
implemented.
The networked computers, particularly those that make up the Internet, are now
vital to the process of scientific research. The Internet is used to provide access to
information that can retrieve as they require, with over 1000 databases in the biology
discipline [50]. A scientist needs to request information from many different locations
to process it using their available computing resources. There are many ways that sci-
entists could request the information, making it difficult for other scientists to replicate
their research based on their publications. They can publish computer understandable
instructions detailing ways for other scientists to replicate their research, but other
scientists need to be able to execute the instructions locally. This is difficult as the
contextual environment is different between locations, including different operating sys-
tems, database engines, and dataset versions. An initial step towards easy replication
by peers may be to use workflow management systems to integrate the queries and
results into replicable bundles. Applications such as Taverna [105] make it possible to
use workflows to access distributed data sources.
The methods that are most commonly used by scientific workflow management

25
26 Chapter 2. Related work

systems are Web Services, in particular SOAP XML Web Services. Although XML
is a useful data format, it relies on users identifying the context that the data is given
in, and documents from different locations may not be easy to integrate as the XML
model only defines the document structure, and not the meaning of different parts of the
document. Web Services can be useful as data access and processing tools, especially if
scientists are regularly working with program code, so they are useful if the context is
identifiable by the scientist. However, the use of Web Services does not make it simple
to switch between the use of distributed services and local services, as users need to
implement the computer code necessary to match the web service interface to their own
data. The method of data access is not distinctly separated from either the queries or
the data processing steps within workflows, so scientists need to know how to either
reverse engineer the web service, modify a workflow to match their context if the web
service is not available, or access the data locally using another method.
A scientist needs to be able to identify links between different pieces of data to
develop scientifically significant theories across different datasets. The XML data for-
mat, for example, is used as the basis for Web Services and many other data formats.
However, XML and other common scientific data formats do not contain a standard
method for identifying data items or links to other data items in a way that is generi-
cally recognisable. In order to link between data items in different documents a generic
data format, RDF (Resource Description Framework) was developed. RDF is able to
show links between different data items using Uniform Resource Identifiers (URIs) as
identifiers for different data items. RDF statements, including URIs and literals, such
as strings and numbers, provide the structure for links between relevant URIs and prop-
erties of data items, respectively. RDF is used as the basis for the Semantic Web, which
is ideally characterised by the use of computer understandable data by computers to
provide intelligent responses to questions.
The original focus for the Semantic Web community was to define the way the com-
munity agreed meanings for links, commonly known as ontologies or vocabularies, would
be represented. This resulted in the development of OWL (Web Ontology Language),
as a way of defining precisely what meaning could be implied from a statement given in
RDF. Although this was useful, it did not fulfil the goals of the Semantic Web, as it is a
format, rather than a dataset, and the datasets suffered due to both data quality prob-
lems and lack of contextual links between the datasets that were published. The W3C
community recognised this as an issue. Tim Berners-Lee formed a set of guidelines,
known as Linked Data1 , describing a set of best practices using URIs in RDF docu-
ments. Linked Data uses HTTP URIs to access further RDF documents containing
useful information about the link.
According to the best practices given by the Linked Data community, URIs should
be represented using the HTTP protocol, which is widely known and implemented, so
many different people can get access to information about an item using the URI that
was used to identify the item. When a Linked Data URI is resolved, the results should

1
http://www.w3.org/DesignIssues/LinkedData.html
Chapter 2. Related work 27

contain relevant further Linked Data URIs that contain contextually related items, as
shown in Figure 2.1. This provides a data access mechanism that can be used to browse
between many different types of data in a similar way to the HTML document web.
The Linked Data guidelines encourage scientists to publish results that contain useful
contextual links, including links to data items in other scientific disciplines. Although
some scientists wish there to be a single authoritative URI for each scientific data item,
RDF is designed to allow unambiguous links between multiple representations of a data
item using different properties. In addition, Linked Data does not specify that there
needs to be a single URI for each data item. A scientist needs to be able to create
alternative URIs so they can reinterpret data in terms of a different schema or new
evidence, without requiring the relevant scientific community to accept and implement
the change beforehand.

Retrieve Linked Data URI Parse RDF document and


store in local database
If there are new
references, retrieve them

Analyse document for references


to other Linked Data

Perform query on the local


database

Figure 2.1: Linked Data Naive Querying

Despite its general usefulness, plain Linked Data is not an effective method for
systematic querying of distributed datasets, particularly where links are not symmetric,
as shown in Figure 2.2. Even when all Linked Data URIs are symmetric, many URIs
need to be resolved and locally stored before there is enough information to make
complex queries possible. In the traditional document web, textual searches on large
sets of data are made possible using a subset of the documents on each site. Although
this sparse search methodology is very useful for textual documents is not useful for
very specific scientific queries on large, structures, linked, scientific datasets.
The RDF community developed a query language named SPARQL (SPARQL Query
Language for RDF) 2 to query RDF datasets using graph matching techniques. SPARQL
provides a way to perform queries on a remote RDF dataset without first discovering
and resolving all of the Linked Data URIs to RDF documents. However, it is impor-
tant that resolvable Linked Data URIs are present so that the data at each step is
independently accessible outside of the context of a SPARQL query.
SPARQL is useful for constructing complex graph based queries on distributed
datasets, however, it was not designed to provide concurrent query access to multi-
ple distributed datasets. In order for scientists to take advantage of publicly accessible
2
http://www.w3.org/TR/rdf-sparql-query/
28 Chapter 2. Related work

Linked Data URI about the concept X:


<http://firstorganisation.org/resource/concept/X>

No Existing
link link

Another Linked Data URI about the concept X:


<http://otherorganisation.org/concept:X>
Figure 2.2: Non Symmetric Linked Data

scientific datasets, without requiring them to store all of the data locally, queries must
be distributed across different locations. Figure 2.3 shows the concept of data producers
interlinking their datasets with other sources before offering the resulting data on the
network contrasted with the data silo approach where all data is stored locally, and the
maintenance process is performed by a single organisation rather than by each of the
data producers.
Federated SPARQL systems distribute queries across many datasets by splitting
SPARQL queries up according to the way each dataset is configured [40]. An example
of a dataset configuration syntax is VoiD 3 , which includes predicates and types, labels
for overall datasets, and statistics about interlinks between datasets. The component
queries are aggregated by the local SPARQL application before returning the results to
the user. Although there is no current standard defining Federated SPARQL, a future
standard may not be useful for scientists if it does not recognise the need for the dataset
to be filtered or cleaned prior to, or in the act of querying. In addition, the method
and locations used to access distributed datasets are embedded directly into queries
with current Federated SPARQL implementations4 , in a similar way to web services
and workflow management systems.
Federated SPARQL strategies allow scientists to perform queries across distributed
linked datasets, but the cost is that the datasets must all use a single URI to identify
an item in each dataset, or they must all provide equivalency statements to all of the
other URIs for an item. This requirement is not trivial, and a solution has not been
proposed that provides access to heterogeneous datasets in a case where the providers
do not agree on a single URI structure. Proposals for single global URIs for each dataset
have not so far been implemented by multiple organisations, and even when they are
3
http://vocab.deri.ie/void/
4
http://esw.w3.org/SPARQL/Extensions/Federation
Chapter 2. Related work 29

Distributed dataset maintenance

Source 1 Source 2 Source 3

Publisher curates and Publisher curates and Publisher curates and


interlink Source 1 interlink Source 2 interlink Source 3
with others with others with others

Access
relevant data
across network

Data silo maintenance

Source 1 Source 2 Source 3

Copy across Copy across Copy across


network network network

Single organisation
curate and interlink
sources

Store and
access data
locally

Figure 2.3: Distributed data access versus local data silo


30 Chapter 2. Related work

implemented in the future, all datasets would have to follow them for the system to
be effective. Although Linked Data URIs are useful, they provide a disincentive to
single global URIs. If a URI needs to be resolved through a proxy provided by a single
organisation, there are many different social and economic issues to deal with. This
organisation may or may not continue to have funding in the future and they may not
respond promptly to change requests. Each of these issues reduces the usefulness that
the single URI may otherwise provide.
There are non-SPARQL based models and implementations that allow scientists to
make use of multiple datasets as part of their experiments. They are generally based on
the concept of wrapping currently available SQL, SOAP Web Service, and distributed
processing grids to provide for complex cross-dataset queries. The wrappers rely on a
schema mapping between the references in the query and the resources that are available
in different datasets. In some systems distributed queries rely on the ability to map
high level concepts from ontologies to the low level data items that are actually used in
datasets. These types of mappings require the datasets to be factually correct according
to the high level conceptual structure before the query can be completed, as there is
no way to simply ask for data independent of the high level concepts. In these cases it
may be possible to encode the low level data items directly as high level concepts, but
this reduces the effectiveness of the high level concepts in other queries.

2.2 Early Internet and World Wide Web


Although it was originally sponsored by ARPANET, part of the US defence department,
the distributed network of computers that is now known as the Internet, has been used
to transmit scientific information in primitive forms since it was created. Email, the
initial communication medium, was created to enable people in different locations to
share electronic letters, including attached electronic documents. By its nature, email
is generally private, although many public email lists are used to communicate within
communities. Initially, the lack of data storage space made it hard for scientists to
create and curate large datasets, resulting in a bias towards communication of scientific
theories compared to raw data.
The World Wide Web was created by scientists at CERN in Switzerland to share
information using the Internet. It is based on the HTML (HyperText Markup Language)
5 and HTTP (Hypertext Transfer Protocol) 6 specifications. The HTML specification
enables document creators to specify layouts for their documents, and include links to
other documents. The use of HTTP, and in some cases HTTPS and FTP, form the
core transport across which HTML and other documents are served using the Internet.
HTML links, generally HTTP URIs, are created without a specific purpose, other than
to provide a navigational tool.
HTML is regularly used by scientists to browse through data. HTML versions of
scientific datasets display links to scientific data both within the dataset and in other
5
http://www.w3.org/TR/html4/
6
http://tools.ietf.org/html/rfc2616
Chapter 2. Related work 31

datasets using HTTP URLs. These interfaces enable scientists to navigate between
datasets using the links that the data producer determined were relevant.
With recent improvements in computing infrastructure, including very cheap storage
space and large communication bandwidths, scientists are now able to collaborate in
realtime with many different peers using the Internet. The massive scale electronic
sharing of data, which allows this collaboration, has been described as a “4th paradigm”
for science [71]. The first three paradigms are generally recognised as being based on
theory, empirical experimentation, and numerical simulation, respectively. Real-time
collaboration and data sharing enables scientists to process information more efficiently,
arguably resulting in shorter times between hypotheses, experiments, and publications.

2.2.1 Linked documents

Initially the Web was used to publish static documents containing hardcoded links
between documents. To overcome the maintenance effort required to keep these static
documents up to date, systems were created to enable queries on remote datasets,
without having to download the entire dataset. The results of these queries may be
provided in many different formats, but there is no general method for identifying links
between datasets, and in some cases no description of the meaning for links that are
given. In HTML, where links are commonly provided, the URL link function is not
necessarily standardised between datasets, making it potentially hard to match the
reference to references in an HTML page from a different data provider.
Linked documents, most commonly created using HTML, are popular due to the ease
with which they can be produced. However, the semantic content of HTML documents
is limited to structural knowledge about where the link is in the document, and what
the surrounding elements are. Clearly linked documents provide information about the
existence of a relationship between documents, but they do not propose meanings for
the relationship. They can be annotated with meaningful information, with a popular
example being the Dublin Core specifications 7 . Dublin Core can be used to annotate
documents with standard properties such as author and title. Textual search engines,
supplemented with knowledge about the existence of links, make it possible to efficiently
index the entire web, including annotations such as those provided by Dublin Core.

2.3 Dynamic web services


Dynamic processing services were created to avoid having to retrieve entire data files
across the Internet before being able to determine if the data is useful. Initially these
were created using ad hoc scripts that processed users inputs using a local datastore
and returned the results. However, these services required each new service to be un-
derstood by humans and implemented using custom computer programs. Web services
can be created that enable computers to understand the interface specification, and pro-
grammatically access the service, without a human programmer having to understand
7
http://dublincore.org/
32 Chapter 2. Related work

each of the parameters and code the interface into code.


Programmatic web services may be loosely or tightly specified. One example of a
tight specification, where the interactions are specified in a tight contract, is the SOAP
protocol 8 . The SOAP protocol enables computers to directly obtain the interfaces, that
in prior cases needed to be understood by humans. Web services are used by scientists
to avoid issues with finding data in other ways, as the data is able to be manipulated in
program code with the SOAP interface being the specification of what to expect. Web
services may be just used for data retrieval, but they are also commonly used as specific
query interfaces to enable scientists to use remote computing resources to analyse their
data. Although the SOAP specification allows for multiple distributed endpoints to be
used with a single Web Service, in most cases only single data providers are given with
alternative ways to access the single provider. This causes issues with overloading of
particular services and can in some cases stop scientists from processing their data if
the service is unavailable.
The nature of SOAP web services, with a tightly bound XML data contract, enables
them to be reverse engineered and implemented locally. However, this process requires
the re-creation of the code that was used to get the data out of a database or processing
application, something which is not necessarily trivial. The use of Semantic markup to
describe and compose Web Services has been used with limited success, although the
lack of success may be due to the nature of the ontologies rather than a defect in the
model [99].
Although web processing services are useful, the data may be returned in a format
that is not understandable by someone outside of the domain. In the case of SOAP,
users are required to understand XML, or alternatively, write a piece of code to translate
it to their desired format. In the case of other web processing services, the data may
be returned in any of a number of data formats, some of which may be common to
similar processing services. Although the sharing of specific data using these dynamic
web services is useful, scientists currently end up having to spend a large amount of
their time translating between data formats as required by the different services. The
translation between different data formats also makes it hard to recognise references
between data items without first understanding the domain and the use of different
methods by each domain to construct the references.

2.3.1 SOAP Web Service based workflows

Current scientific workflow management systems focus on data access using a combina-
tion of SOAP Web Service calls and locally created code. Although this is a useful way
to access data reliably as it includes fault aware SOAP calls, it creates a reliance on
XML syntax as the basic unit of data interchange. The tightly defined XML specifica-
tion, detailed in the SOAP contract, prevents multiple results being merged into single
result documents easily if the specification did not allow this initially. Although there
is some work being done to enhance the semantic structure of the XML returned by
8
http://www.w3.org/TR/soap/
Chapter 2. Related work 33

web services, it does not contain a clear way to integrate web services which have been
implemented by different organisations. A discovery protocol such as Simple Semantic
Web Architecture and Protocol (SSWAP) provides a way of discovering the semantic
content published in web services, but it does not provide a way for scientists to di-
rectly use the web services discovered by the protocol [102]. The ontology description
of the inputs, operations, and outputs of web services is useful in addition to the WSDL
description which defines the structure of the data inputs and outputs, but it needs to
integrate with the data, and provide a mapping between data structures before it would
be useful as a data access protocol.
Perhaps the biggest results in this area have been attained by the manually coded
Web Service wrappers created by the SADI project [143]. Although it is useful, the
mechanism for identifying which datasets are relevant to each provider in the background
is done on a case by case basis inside of wrappers. The scientists, high level, query cannot
be used in different contexts to integrate data that was not created using identical
identifiers and the same properties. SADI uses a single query to determine which sub-
queries are necessary. If datasets in future follow the exact structure and identifiers are
not changed, than the single query can be adapted or reused to fit future data providers.
SADI model allows for future datasets to be represented using different data structures,
as long as they can be transformed using a manually coded wrapper or using an OWL
based ontology transformation, as the data structure given in the query is fundamental
to the central query distribution algorithm. This restriction ties the scientist closely
to the actual data providers that are used. This makes it difficult to share or extend
queries in different contexts. The parts of the query that need to be modified may not
be easy to identify, given that each query is executed as a combined workflow.
For example, although the datasets are equivalent, SADI would find it difficult to
query across the data providers shown in Figure 2.4, as there is both variation in the
references and variation in the data structures. In the example, some information for the
results needs to come from each dataset, but there is no single way to resolve the query,
as there are different ways of tracing from the drug to the fact that the gene is located
on chromosome X depending on whether the information about its gene sequence is
required. In addition to this, the property names are duplicated between datasets for
different purposes. In the Drug dataset, the property "Target’s gene" contains a two
part reference to a dataset along with the identifier in the dataset, while in the Disease
dataset, the property contains a symbol without a reference to a dataset, and the symbol
is not identical to the identifier in the Drug dataset.

2.4 Scientific data formats


Scientific data that will be used by computer programs, as opposed to the HTML
that is used by humans, is presented in many different formats. These formats are
generally text formats, as binary specifications that are common for some programs
are not transportable and useable in different scenarios, something which is important
to scientists. Although in the past scientists created new file formats for each area,
34 Chapter 2. Related work

Heterogeneous datasets

Get the name, location,


genetic structure, and
taxonomy information for
genes and diseases that are
Query
relevant to the drug Marplan

Drug Dataset Disease Dataset


Name : Marplan Name : Neuritis
Targets disease : Neuritis Targets gene : hgnc, MAOB
Targets gene : ncbi_gene, 4129

Datasets

NCBI Gene Dataset HGNC Gene Dataset


Name : Monoamine oxidase B Name : Monoamine oxidase B
Identifier : 4129 Identifier : 6834
Gene symbol : hgnc, 6834 Gene symbol : MAOB
Taxonomy : 9606 Taxonomy : Human
Genetic sequence : AGGCT... Chromosome : X
Location : Xp11.23

Monoamine oxidase B, X chromosome,


AGGCT..., Human, Neuritis Useful results

Figure 2.4: Heterogeneous datasets


Chapter 2. Related work 35

new scientific document formats are generally standardised on using XML as the basic
format. The common use of XML means that documents are easily parseable, and if
they follow a particular schema, they can be validated easily by programs.
The use of XML along with XML Schema 9 , makes it simple to construct complete
documents and verify that they are syntactically valid. However, the strict nature of
XML schema verification does not allow extension of the document. In addition, XML
does not contain a native method for describing links between documents. Although
HTML contains a native method for linking between documents, it is not designed to
describe the links in terms of recognisable properties.
In addition to XML based formats, scientists share documents using the ASN.1
standard (ISO/IEC 8825-1:2008) 10 and custom, domain specific, data formats based
on computer parseable syntax descriptions encoded using EBNF (ISO/IEC 14977:1996)
11 . These documents may support links to different datasets, but they require knowledge
of the way each link is specified to determine its destination, and the data item that
the link refers to may not be directly resolvable using the link.

2.5 Semantic web


The Semantic Web was designed to give meaning to the masses of data that are repre-
sented using electronic documents. The term was defined using examples which focus
on the ability of computerised agents to automatically reason across multi-discipline
distributed datasets without human intervention to efficiently process large amounts of
information and generate novel complex conclusions to simple questions [27]. The Se-
mantic Web is commonly described using idealistic scenarios where data quality issues
are non-existent, or are assumed not to affect the outcome, data from different locations
is generally trustworthy, and where different data providers all map their data to shared
vocabularies. However, the slow process of developing of the necessary tools has shown
that the achievable operational level of a global Semantic Web may be very limited. For
instance, Lord et al. [91] found that, based on their experience, “[the] inappropriateness
of automated service invocation and composition is likely to be found in other scientific
or highly technical domains.”
A completely automated Semantic Web is still possible, however, it is more likely
that human driven applications can derive novel benefits from the extra structure and
conditions provided by semantically marked up, shared, data. In comparison to agent
based approaches, human data review incorporates curation and query validation pro-
cesses at each step.
Although the Semantic Web (SW) may be a user navigable dataset which includes
references to the current HTML web, the major features are independent of the way the
web is used today. These features include a focus on computer understandability, and
a common context between documents to use as the basis for interpreting any meaning
9
http://www.w3.org/TR/xmlschema-1/
10
http://www.iso.org/iso/catalogue_detail.htm?csnumber=54011
11
http://www.iso.org/iso/catalogue_detail.htm?csnumber=26153
36 Chapter 2. Related work

that may be given.


A scientist needs to unambiguously describe the elements of a query using terminolo-
gies that are compatible with the datasets they have available, including some commonly
used ontologies described in Appendix B. This process requires a way to apply rules to
each of the terms in a query to determine whether they are recognisable and not am-
biguous, and convert them to fit the context of each dataset as necessary. This process
is not trivial, and no general solution has been found, although some solutions focus on
the use of Wikipedia as a central authority to disambiguate terms [18]. Unfortunately,
Wikipedia only contains a subset of terms needed by scientists, as it is defined by its
community as a general knowledge reference, rather than a scientific knowledge refer-
ence. Although new terms could be added at any time to Wikipedia, they can just as
quickly be deleted, meaning that it can’t be permanently used as a reference.
The Semantic Web is currently being developed using the RDF (Resource Descrip-
tion Framework) specification for data interchange 12 . RDF is a graph based data format
that provides a way to contextually link information from different datasets using URI
links and predicates, which are also represented as URIs. These predicate URIs can be
defined in vocabularies. In comparison to domain specific data formats and XML, RDF
is designed to be an extensible model, so information from different locations can be
meaningfully merged by computers.
RDF is based on URIs, of which the HTTP URLs that are commonly used by
the document web, are a subset. The RDF model specifies that the use of a URI in
different statements implies a link between the statements. The URI should contain
enough information to make it easy for someone to get information about the item
if they understand the protocol used by the URI. In the past, scientists have been
hesitant to create HTTP URIs due to perceived technical and social issues surrounding
the HTTP protocol and the DNS system that HTTP is commonly used with [41]. Some
RDF distributions such as the OBO schema recommend instead that scientists represent
links in RDF using other mechanisms such as key-value pairs with namespace prefixes
in one RDF triple and the identifier in another triple. This use of RDF does not enable
computers to automatically merge different RDF statements, as there are no shared
URIs between the statements.

2.5.1 Linked Data

The Semantic Web community has defined guidelines for creating resolvable, shared,
identifiers, in the form of HTTP URIs, under the banner “Linked Data” 13 . The guide-
lines codify a way to gain access to factual data across many different locations, without
relying on locally available databases or custom protocols. Although its specifications
are independent of many of the goals of the Semantic Web, it is designed to produce a
web of contextually linked data, that may then be incrementally improved to represent
a semantically meaningful web of information.
12
http://www.w3.org/TR/rdf-syntax-grammar/
13
http://www.w3.org/DesignIssues/LinkedData.html
Chapter 2. Related work 37

In comparison to linked documents, Linked Data provides the basis for publishing
links between data items that are not solely identifiers for electronic documents. Using
Linked Data, an HTTP link can be resolved to a document that can be processed and
interpreted as factual data by computers without intervention by a human.
The Linked Data guidelines refer to the ability to use Uniform Resource Identifiers
(URIs) to represent both an item and the way to get more information about the
item. Although this in itself is not new, the Linked Data design recommends some
conventions that enable virtually all current computing environments to get access to the
information. These conventions are centred around the use of the Hyper Text Transfer
Protocol (HTTP) and the Domain Name System (DNS), along with the reliance of
these systems on the TCP/IP (Transport Control Protocol/Internet Protocol). The
documents representing Linked Data are transferred in response to HTTP requests as
files encoded in one of the available RDF syntaxes based on HTTP Content Negotiation
(CN) headers.
The RDF specification is ambivalent to the interpretation of data, other than to say
that URIs used in different parts of a document should be merged into a single node in
an abstract graph representing the document. There are a large number of RDF parsers
available, and the RDF specification is not proprietary, making RDF-based Linked Data
accessible to a large number of environments currently using the linked document web.
Typical web documents can also be integrated as Linked Data, with the most common
method being RDFa14 that can be layered onto a traditional HTML document. The
use of RDFa makes it possible to include computer understandable data and visually
marked up information in the same document, using HTTP URIs for semantic links to
other documents.
Scientists can utilise Linked Data to publish their data with appropriately annotated
semantic links to other datasets. Linked Data systems are not centralised, so there is
no single authority to define when datasets can be linked to other datasets. This means
that scientists can use Linked Data to publish their results with reference to either
accepted scientific theories or new theories as necessary, and have the links recognised
and merged automatically by others.

2.5.2 SPARQL

Although Linked Data is useful as an alternative to linked documents, it is not a useful


replacement for dynamic web services. In terms of its use by scientists, Linked Data
focuses on resolving identifiers to complete data records, as opposed to the filtered or
aggregated results of queries. In order to effectively query using Linked Data, scientists
would need to resolve all of the URIs that they knew about and perform SPARQL
queries on the resulting dataset. This is not efficient, as there may be millions of URIs
published as part of each dataset, and there are no guarantees that the documents will
not be updated in future.
Although Linked Data URIs are useful for random data access, they can be used in
14
http://www.w3.org/TR/rdfa-syntax/
38 Chapter 2. Related work

published datasets to enable queries over multiple datasets using SPARQL. SPARQL,
a query language for RDF databases, is a graph based matching language that is based
on the RDF triple. The RDF semantics specify that co-occurrences of a URI can be
represented as a single node on an abstract graph representing a set of RDF triples. In
some ways SPARQL is similar to SQL, which is used to perform queries on relational
databases, although it is not predicated on a set of tables with defined columns to define
valid queries based on. There are many SPARQL endpoints that are publicly available,
including a large number of scientific datasets 15 .

SPARQL queries are constructed based on the structures of the RDF statements in
each dataset. In addition, SPARQL queries can be used to generate results without using
an RDF database. For example, relational databases can be exposed as RDF datasets
using SPARQL queries [29]. SPARQL queries are difficult to transport them between
different data providers unless the RDF statements in each location are identical. They
rely on merging different triples into graphs using either URIs or Blank Nodes, (which
are locally referenceable RDF nodes). It is rare that different datasets use the same
URIs to denote every data record and link.
In practice, different datasets use different URIs due to two main issues. Firstly,
there is a lack of standardisation for URIs, as the original data provider may not define
Linked Data HTTP URIs for their data records. Secondly, even if the original data
provider defines Linked Data HTTP URIs, other entities may customise the data and
wish to produce their own version using different URIs. It is very useful in terms of
SPARQL queries, for all datasets to use the same URIs to identify an item, as it is
difficult to transport SPARQL queries between data providers. However, it is more
useful in terms of simple data access to use locally resolvable Linked Data HTTP URIs.

2.5.3 Logic

RDF and Linked Data are gradually expanding and improving to provide a basis on
which semantically complex queries can be answered using formal logic. The logic
component specifies the way that RDF statements relate to each other and to queries
on the resulting sets of RDF statements. In the context of RDF this requires the use of
computer understandable logic to identify the meaning of URIs and Blank Nodes based
on the statements they appear in. The most popular and mature logic systems are
currently created using Description Logics (DL), with the most popular language being
the family of OWL (Web Ontology Language) variants. OWL-DL, a popular variant,
is based on the theory of Description Logics, where the universe is represented using a
set of facts, which are all assumed to be true for the purposes of the theory.
The theory is by design monotonic, meaning that the addition of new statements will
not change current entailments, where entailments are additional statements that are
added based on the set of logic rules, which may also be statements under consideration.
A non-monotonic theory would make it possible for extra statements to contradict, pos-
sibly override, or remove, current entailments, something that would currently require
15
http://esw.w3.org/SparqlEndpoints
Chapter 2. Related work 39

one or more statements to be discarded before any reasoning was able to be consistently
performed.
Given that scientists are forced to consider multiple theories as possibly valid before
they are consistently proven to be valid using empirical experimentation, they should not
have to apply a single logic to the entire Semantic Web (also termed by Tim Berners-Lee
as the Giant Global Graph (GGG) 16 ). However, there are projects that seek to apply
ontologies to all interlinked scientific datasets to facilitate queries on the GGG. Scientists
require the ability to decide whether statements, or entire datasets are irrelevant to their
query, including the consequences of logic based entailment.

2.6 Conversion of scientific datasets to RDF


RDF versions of scientific datasets, have been created by projects such as Bio2RDF
[24], Neurocommons [117], Flyweb [149], and Linked Open Drug Data (LODD) [121]
to bootstrap the scientific Semantic Web, or at least the scientific Linked Data web.
Where possible, the RDF documents produced by these organisations utilise HTTP
URIs to link to RDF documents from other datasets as they are produced, including
references to other organisations. This is useful, as it matches the basic Linked Data
17 goals which are designed to ensure that data represented in RDF is accessible and
contextually linked to related data.
In some cases, the same dataset is provided by two different organisations using two
or more different URIs. For example, there is information about the NCBI Entrez Gene
dataset in Neurocommons using the URI http://purl.org/commons/record/ncbi_
gene/4129 and there is similar information in Bio2RDF, including the Neurocommons
data, where it is available, using the URI http://bio2rdf.org/geneid:4129. If the
schemas used by these organisations are different, this forces scientists to accept one
representation at the expense of the other when they create SPARQL queries.
The data cleaning process is a vital part of the RDF conversion process. All data
is dirty in some way [70]. The most essential data cleaning process is the identification
of identical resources in different datasets. In non-RDF datasets, identifiers are not
typically URIs, so they require context in order to be unambiguously linked with other
datasets. The lack of context means that syntactically it is not easy to verify whether
two identifiers refer to the same record. The successful conversion of scientific datasets
to RDF, with defined links between the datasets, makes it possible for computers to
automatically merge information. Even if the data is not semantically valid, there may
be unexpected, but insightful results due to the implications of merges and rules that
are applied to the combined data.
RDF conversion may incur a cost in relation to query performance and storage
space, as an RDF version of a traditional relational database is likely to be larger
than a relational database dump format such as SQL. However, the RDF version is
functionally equivalent and can be automatically merged with other RDF datasets.
16
http://dig.csail.mit.edu/breadcrumbs/node/215
17
http://www.w3.org/DesignIssues/LinkedData.html
40 Chapter 2. Related work

Compared to the equivalent relational database, an RDF database is typically larger, as


RDF databases are typically very highly normalised, while relational database schemas
tend to be designed with a tradeoff between normalisation and query efficiency. The
size of the database limits the efficiency of queries, a factor which makes it hard to
practically merge a large number of datasets into single RDF databases. In terms of
scientific datasets the file size expansion, from experience, can be anything from 3 to
10 times, depending on the ratio of plain text to URIs in the datasets.

2.7 Custom distributed scientific query applications


A number of systems have been developed for the single purpose of querying distributed
datasets. Some of these systems are focused on a single domain, such as Distributed
Annotation System (DAS) [110] which provides access to distributed protein and gene
annotations, while others are focused on a mix of domains related to a single topic,
such as the range of datasets related to cancer research that are accessible using the
caGrid infrastructure. Other systems such as OGCS-DAI are built on grid computing
resources, while others such as BioMart are designed to provide simple methods of
mirroring datasets locally, along with the ability for scientists to visually construct
queries by picking resources from different datasets from lists that are provided by the
datasets.
The Distributed Annotation System allows distributed standardised querying on
biological data. However it does not use RDF, as they are focused on a very specific
domain. The biological datasets are represented using discipline specific interfaces and
file formats which can be understood by most bioinformatics software. In comparison
to an RDF based query solution, the current DAS implementation suffers in that it
requires software updates to all of the relevant sites to support any new classes of in-
formation. RDF based systems can be extended without having multiple data models
implemented in their software. RDF based solutions enable users to extend an original
record structure in a completely valid way using their own predicates, enabling cus-
tomised configuration-based additions as well as the up to date annotation data that
DAS is designed to provide.
A variety of different computing and storage grids have been setup in the past to
allow scientists to get access to large scale computing resources without having to have
a super computer available locally. An example a large scientific grid is the caGrid,
developed to enhance the transfer of information between cancer research sites [127]. It
includes an example of a query system that requires a complex ontology to determine
where queries and subqueries go. It is able to utilise SPARQL as a query language, but
it translates queries directly to the an internal query format using ontologies to define
mappings between different services.
It does not allow users to perform these queries outside of the caGrid computing
resources, as it requires the use of its internal query engine to execute queries. Although
software to communicate with caGrid is provided for general use, it is highly specific
and would be difficult to replicate in other contexts. It is able to provide provenance,
Chapter 2. Related work 41

based on the internal queries that are executed as part of the query, but its emphasis on
ontologies as the sole mapping tool means it is unable to be used as a generic model for
querying many datasources, irrespective of their syntactic data quality. The mappings
are defined in a central registry, making it difficult for scientists to define their own
mappings, or integrate their own datasets with different, unpublished, mappings.
OGCS-DAI is a general purpose grid based mapping facility, but it does not have
a single standard data format, so it relies on services providing mappings between all
possible formats in a similar way to general workflow engines [59]. The use of the
resulting documents by scientists to utilise links between datasets, requires mapping
between the identifiers used by datasets, something which is not simple given that the
system uses a single binary unique identifier property that is presumed to be large
enough to be suitable for the distributed system, but is not used in any other models,
making mapping applications rely on other properties to decide which data items in the
system match linked data from other applications.
The OGCS-DAI model allows users to integrate different datasets, but these changes
may not be replicable, as the record of the changes may not be visible in any of the
provenance information that could be produced by the system. It contains some degree
of user context sensitivity, through the use of profiles that decide what level of ontology
support is required by different users, but this context sensitivity does not map to
the query or datasets sections. In particular, users must completely understand the
structure of the underlying datasets to successfully perform a query, both the ontologies,
the way the data is arranged, and its syntactic structure, including whether the data
appears in in lists, or in single items.
The understanding of the generic underlying architecture does not necessarily in-
clude identifying links between datasets, or the concept of data transformations outside
of the metadata that is provided by the service. The metadata is solely provided by
services, although this may be extended through software extensions to support user
defined metadata. The overall focus of the system on a grid infrastructure, as opposed
to an abstract set of linked datasets makes it unsuitable for general use in local contexts
where users may not be able to simulate the grid infrastructure. The model does not
seem to be designed for heterogeneous systems where users want to integrate datasets
from grid sources, local sources, and other non-grid related sources.
A general purpose scientific example of a software that can be used to mirror datasets
locally, and perform queries both locally and on the foreign datasources is BioMart [130].
It can be used to construct queries across different datasets, although there are no global
references internally, so it requires users knowing which fields map to other datasets, or
relying the mapping provided by the dataset author. The datasets are curated by the
authors, who then publish the resource, and register it with the BioMart directory. The
mapping language that is used to map queries between datasets does not provide the
ability for users to customise the mappings that are used by the distributed datasets,
or utilise any dataset that is not available using either the BioMart format, or a generic
SQL database. As the mappings are based on the knowledge that scientists have chosen
42 Chapter 2. Related work

to query datasets from the mart, the context of the links are hard to define outside of
the BioMart software.
The query that is executed by the BioMart software can be used as part of a prove-
nance record, as it contains references to the datasets that were used, but it focuses
solely on the name of the dataset, and there are no references to where to potentially
find the dataset other than to use the internal BioMart conventions for searching for
datasets with that name. The ability of the software to mirror datasets locally provides
some degree of context, although the process relies on the scientists ability to arrange
for the data quality to be verified and corrected before they perform queries.
Some systems aim to merge and clean scientific data for use in local repositories
[4, 70, 74, 141]. Some systems attempt to dynamically merge data to single local
repositories according to users instructions [72]. Some projects have managed to provide
localised strategies using high performance computing to handle the resulting large set
of scientific data [28, 60, 106]. However, it is impractical to expect every dataset to be
copied into a single local database for complex queries by users without access to these
high performance computing facilities. These facilities incur an associated management
overhead that is not practical for many researchers, particularly if the datasets are
regularly updated.
The BIRN project, although able to query across distributed datasets, relies on a
complete ontology mapping to distribute and perform queries [17]. It relies on databases
using the relational model, with a single ontology type field to denote what the type
of the record is. This is not practical for large cross-discipline datasets, as BIRN is
focused on a single domain, biomedicine, where the ontologies cleanly map between
datasets. Arbitrary queries for items, where one does not know what properties will be
available, are hard in a relational model due to the lack of recognition of links between
datasets. In a similar way, [75] and [145] are useful systems, but they lack the ability
to provide users with arbitrary relations, and therefore users must still understand the
way the data is represented to distinguish links from other attributes. These models do
not recognise the difference in syntactic data quality, except for cases where mappings
can be derived using the structure of the record, and there is no allowance provided for
semantic data quality normalisation methods.
The majority of systems that distribute SPARQL queries across a number of end-
points, convert single SPARQL queries into multiple sub-queries, before joining and
filtering the results to match the original query. These systems, broadly known as
Federated SPARQL, generally require that users configure the system with specific
knowledge of properties and types used by each dataset [143]. In many cases they also
assume that the URI for a particular item will be the same in all datasets to enable a
transparent, OWL (Web Ontology Language) inference based, joined results [1, 85, 111],
in a similar way to BIRN, although using the RDF model.
The SRS system [46] provides an integrated set of biological datasets, with a custom
query language and internal addressing scheme. Although the internal identifiers are
unambiguous in the context of the SRS system, they do not have a clear meaning when
Chapter 2. Related work 43

used in other contexts. In comparison to the many formats offered by SRS, and the
native document formats of particular scientific datasets, the use of RDF for both doc-
uments and query results provides a single method, normalised namespace based URIs,
to reference items from any of the involved datasets. SRS provides an internal query
language that makes use of the localised database, giving it performance advantages over
the similar distributed RDF datasets provided by Bio2RDF. The approximate number
of RDF statements that are required to represent each of the largest 14 databases in
the Bio2RDF project are shown in Table 2.1, illustrating the scale of the information
provided currently in distributed RDF datasets.
In comparison to the variety of RDF datasets available, SRS requires that there is
only a single internal identifier for each record, enabling efficient indexing and queries.
Although SRS provide versions of each record in a range commonly used formats, the
central data structures are proprietary and not open and standardised, so it is difficult
to provide interoperability between sites. This reduces the ability of other scientists to
replicate results if they do not have access to an institution with an SRS instance, with
the same datasets.
The OpenFlyData model aims to abstract over multiple RDF databases by using
them as sources for a large local repository, before directing queries at that virtual
dataset. However it does not aim to systematically normalise the information or direct
queries at particular component datasets [150], and it does not contain a method for
scientists to express the amount of trust that they personally have in each dataset.

Database Approximate RDF statements

PDB 14,000,000,000
Genbank 5,000,000,000
Refseq 2,600,000,000
Pubmed 1,000,000,000
Uniprot Uniref 800,000,000
Uniprot Uniparc 710,000,000
Uniprot Uniprot 220,000,000
IProClass 182,000,000
NCBI Entrez Geneid 156,000,000
Kegg Pathway 52,000,000
Biocyc 34,000,000
Gene Ontology (GO) 7,400,000
Chebi 5,000,000
NCBI Homologene 4,500,000

Table 2.1: Bio2RDF dataset sizes

2.8 Federated queries


Many systems attempt to automatically reduce queries to their components, and dis-
tribute partial queries across multiple locations. These systems presume that a number
44 Chapter 2. Related work

of conditions will be satisfied by all of the accessible data providers, including data
quality, semantic integrity, and complete results for queries. In science it is necessary to
rely on datasets in multiple locations instead of locally loading all datasets. This is due
to the size of the datasets, and the continual updating that occurs, including changes
to links between datasets based on changes in knowledge.
In terms of this research into arbitrary multiple scientific dataset querying, the most
common federated query systems are based on SPARQL queries that are split across a
number of different RDF databases. Most systems also allow the translation of SPARQL
queries to SQL queries [29, 40, 45]. Others are focused on a small number of cooperating
organisations such as BIRN [17] and caGRID [120], or they are only focused on a single
topic such as DAS [110]. These systems require estimates of the number of results that
will be returned by any particular endpoint to optimise the way the results are joined.
These specialised systems also require that there is a single authoritative schema for all
users of the system, although there may be mappings between this schema and each
dataset.
Some federated SPARQL systems also require query designers to insert the URL’s
of each of the relevant SPARQL endpoints into their queries by redefining the meaning
of a SPARQL keyword, making complex and non-RDF datasets inaccessible, and intro-
ducing a direct dependency on the endpoint which reduces the ability of scientists to
transport the query according to their context [147]. Federated SPARQL systems that
focus on RDF query performance improvements may not require users to specify which
predicates they are accessing [63].
There are a number of patterns that are implemented by federated query systems,
including Peer-to-Peer (P2P) data distribution, statistics based query distribution, and
the use of search engines to derive knowledge about the location of relevant data.
The first type of federated query relies on an even distribution of facts across the lo-
cations to efficiently process queries in parallel. An even distribution of facts is generally
achieved by registering the peers with a single authority, and distributing statements
from a central location. However, there may be alternatives that self-balance the system
without the use of a single authority. In either case, the locations all need to accept
that they are part of the single system and they must all implement the same model
to effectively distribute the information without knowledge about the nature of the
information. This method is generally implemented in situations where queries need
to be parallelised to be efficient, while the data quality is known, or not important to
the results of the query. This method is not suitable for a large group of autonomous
distributed datasets, as updates to the any of the datasets will require that the virtually
every peer is updated.
The second type of federated query is designed to be used across autonomous, multi-
location datasets. It requires knowledge about the nature of the facts in each dataset,
and uses this knowledge to provide a mapping between the overall query, and the dataset
locations. The majority of federated SPARQL systems are designed around the basic
concept of a high level SPARQL query being mapped to one or more queries, including
Chapter 2. Related work 45

SPARQL, SQL, Web Services, and others. As the system requires detailed statistics
to be efficient, most research focuses on this area, with the data quality assumed to
be very high, and in most cases, the semantic quality to be very good, resulting in
simple schema mappings between different datasets based solely on the information in
the overall query. These methods have difficulties with queries that include unknown
predicates, as this aspect is used to make overall decisions about which locations will
be relevant. The VoiD specification includes a basic description of prefixes that can
be used to identify namespaces, but if the URI structure is complex, the unnormalised
prefix alone is not useful in deciding which locations will be relevant [3, 40]. The
VoiD specification assumed that all references to a dataset will have equivalent URIs,
so any datasets that use their own URIs, perhaps to enable their users to resolve a
slightly different representation of the document locally, will not work with the VoiD
specification.
The third type of federated query is designed to be used against a central knowledge
store. This store contains directions about where facts about particular resources can
be found. The most common type of central knowledge store is in the form of a search
engine. This strategy dramatically reduces the amount of information that is necessary
for a client, while reducing the time required to find resources, assuming that the search
engine has complete coverage of the relevant datasets. In many cases, items can be
directly resolved using their identifier, especially if the authority is recognised, and the
URI can be used to discover queryable sources of information about the item. However,
a search engine may provide more information than a custom mashup that is created and
maintained by users, at the expense of the maintenance and running costs surrounding
a very large database.
In contrast to federated query systems, a pure Linked Data approach is extremely
inefficient in the long term, as it fails to appreciate the scale of the resources that are in
use. It inevitably requires users to have alternative access to large datasets, for example,
a SPARQL endpoint, for any useful, efficient queries. Some systems may provide a mix
of federated query strategies, including a search engine together with a basic knowledge
of which datasets are located in each endpoint. These methods are shown in Figure 2.5.
For example, the Linked Data methods used to resolve information about a given URI
are shown in Figure 2.6, including the possibility of an arbitrary crawl depth based on
the discovered URIs.
These federated systems, along with Linked Data and SPARQL queries are very
brittle. If one of the SPARQL endpoints or Linked Data URIs is unavailable, the
system may break down. Scientific results need to be replicable by peers as easily as
possible. In order to overcome this brittleness an ideal system needs to provide a context
sensitive query model that scientists can personally trust by virtue of their knowledge of
the involved datasets and the way different representations of a dataset are related. An
ideal scientific data access system cannot rely specifically on the use of specific properties
or URIs. By decoupling the scientist’s query from the data locations, an ideal system
is free to negotiate with alternative endpoints that may provide equivalent information
46 Chapter 2. Related work

Common Linked Data Query Methods

HTML RDFa N3 XML

RDF

SPARQL Endpoint
and Graph URI
HTTP URI with
Service Description
known links

Original HTTP URI Search Engine

Linked Data Resolver

SPARQL Query

Figure 2.5: Linked Data access methods

using alternative URIs or properties. It can negotiate based on the semantic meaning
of the query, recognising multiple Linked Data URIs as equivalent and different sets of
properties as equivalent in the current context without requiring the properties or URIs
to always be equivalent.
Common Linked Data Query Methods

N3 RDF/XML

HTTP GET SPARQL Endpoint and


http://ns2.org/ Graph URI
name/id-2 http://ns1mirror.org/sparql
HTML RDFa

HTTP GET Crawl all Service Description with


http://ns1.org/ discovered URIs HTTP URI with URI Prefix
resource/id2 to arbitrary depth known links http://ns1.org/resource/

Original HTTP URI Search Engine : Pre-crawled

Linked Data Resolver

SPARQL Query : DESCRIBE <http://ns1.org/resource/id2>

Figure 2.6: Using Linked Data to retrieve information


Chapter 3

Model

3.1 Overview
Scientists face a number of difficulties when they attempt to access data from multiple
linked scientific datasets, including data cleaning, data trust, and provenance related
issues, as described in Chapter 1. A model was designed to make it easy for scientists
to access data using a set of queries across a set of relevant data providers, with de-
sign features to support data cleaning operations and trusted, replicable queries. The
model addresses the trust, quality, context sensitivity and replication issues discussed
in Section 1.1.
The main design goals for the model are to provide for replicability using a mapping
layer between a scientist’s query and the actual queries that are executed. The mapping
takes on two parallel forms, one of which maps the scientist’s query to a concrete query
that could then be directly executed, and another which map the notion of datasets
to data locations. The scientist’s query can be mapped to different languages and
interfaces as necessary, while the textual strings ensure that there can be more than one
namespace inferred from any query, enabling future scientists to merge different uses of
the model without affecting backwards compatibility.
The mapping layers ensure that no one system or organisation can be a sole point
of failure for replicating a query, as long as data is freely and openly accessible. In
addition, the queries are designed to be replicated based solely on textual provenance
documents, as compared to similar mapping systems that require coded mapping pro-
grams as common components in the mapping workflow. Coded, particularly compiled,
mapping programs would limit the ability to implement the model in future using dif-
ferent technologies.
Scientists can choose to modify the system to suit their current local context, without
affecting the way other scientists perform the same query using public data providers.
This makes it possible for scientists to have contextually applicable data access to
the relevant linked scientific datasets across their research. Scientists do not have to
publicise their lack of trust or their opinion of the data quality in different datasets
to extend the system, as there is no global point of entry for the system. When the
scientist is ready to publish their results, other scientists should be able to examine

47
48 Chapter 3. Model

their queries, and the provenance of their results, as textual documents which should
lower a technological barrier to reuse, compared to implementation specific systems and
global registries.
Datasets may not be trusted if they contain out of date information; they may not
cleanly link to other datasets; or they may cleanly link to other datasets, and be up to
date, but may still not be the best choice for a scientist in their context. The model
is aimed at providing scientists with more direct control over which data providers are
accessed when queries are replicated, as current Linked Data and distributed SPARQL
query models do not fully support this notion, as shown in Figure 2.6. The model
attempts to provide a way for scientists to describe the nature of the distributed linked
dataset queries that are performed in response to a set of their queries, with an emphasis
on collaboration with other scientists and replication by scientists with access to different
facilities containing similar data.
The model enables context sensitive, distributed linked data access by mapping a
single user query to many different locations, denormalising the query based on the
location, and normalising the results to form a single set of homogeneous results. It
maps user queries to query types based on query matching expressions. This mapping
generates a set of parameters for each query type, including a set of namespaces that
are unique to the query type. For each of these parameter sets, a set of providers
which implement the query type are chosen, including inclusion or exclusion based on
the namespaces identified in the query parameters. The query parameters are then
processed using normalisation rules assigned to the provider using two stages for the
query parameters and the query after the query parameters are inserted into the query
template. The resulting query may optionally be compiled and transformed using using
abstract syntax transformation rules.
The query is then executed based on the type of the provider. For example, a
SPARQL query could be submitted using an SQL interface or an HTTP interface, or a
query could be a simple HTTP GET to the endpoint URL. The results are normalised
for each query on each provider in two additional stages, before and after parsing to
RDF triples. The RDF triples from all of the queries on all of the chosen providers are
then aggregated into a single pool of RDF statements with two additional normalisation
stages, before and after serialising the results to an RDF representation.
A comparison of the basic components of the model to the typical federated RDF
based query models access is shown in Figure 3.1. The typical query models, described
in Section 2.8, all fail to see data quality as an inherent issue, preferring to focus
on semantic or RDF statement level normalisation without allowing for any syntactic
normalisation. The query model described here allows for a wide range of normalisation
methods, along with the ability to transparently integrate data from providers which
have not agreed on a particular URI model, as the typical query models all require
that URIs be identical, or that all providers have access to–and have therefore agreed
to–mappings between the different URIs.
In the example from Section 1.1.2, scientists need to access data from a variety of
Chapter 3. Model 49

Figure 3.1: Comparison of model to federated RDF query models


50 Chapter 3. Model

datasets including genomic, drug, and chemical data. However, they are unable to easily
distinguish between trusted data and untrusted data, as some datasets may be useful
for some queries but not others, and the datasets are very large, including numerous
links between datasets. They are unable to systematically alert other scientists to
irregularities in the data quality, as the current data access and publishing methods do
not provide ways for scientists to provide corrections or modifications as part of the
data access model.
The model is designed around the process of scientists asking questions that can be
answered using data from linked datasets. The question contains parameters including
the names of datasets and identifiers for data items that are relevant, along with other
information that defines the nature of the query, as shown conceptually in Figure 3.2.
For example, if the scientist in the example wanted to access information about a Drug
named Marplan in DrugBank, they could access the data using the steps shown in
Figure 3.3. To distinguish between different parts of a dataset, each dataset in the
model is given one or more namespaces based on the way the dataset has assigned
identifiers to data items internally. The parameters, which in the example include the
DrugBank Drug namespace, and the search term, “Marplan”, are matched against the
known types of queries and data providers to determine which locations and services
can provide answers to the question. These parameters are applied to templates for
the query types and providers. The data providers may not all be derived from the
DrugBank dataset, as shown by the inclusion of an equivalent search on the DailyMed
dataset in addition to the DrugBank query.
Data providers are defined in terms of their query interfaces, which types of queries
they are useful for, what namespaces they contain information about, and what data
normalisation rules are necessary. This information is used to match each provider to
a users query, before constructing queries for each provider based on the parameters,
namespaces and data normalisation rules. Scientists can make different providers to
choose relevant data cleaning rules in the context of both queries and providers, an
improvement on previous models that require global or location specific data cleaning
rules.
The model includes profiles that allow scientists to state their contextual preferences
about which providers, query types, and rules, are most useful to them. The selected
providers, query types, and normalisation rules will be selected to answer questions,
even if previous scientists did not use any of the same methods to answer the question
in the past. The profile can explicitly include or exclude any provider, query type, or
normalisation, as well as specifying what behaviour to apply to elements that do not
exactly match the profile. To allow more than one profile to be used consistently, the
profiles are ordered so scientists can override specific parts of any other profiles without
having to recreate or edit the existing profile. Profiles are designed to allow scientists to
customise the datasets and queries that they want to support without having to change
provenance records that were created by previous scientists. This feature is unavailable
in other models that allow scientists to answer questions using multiple distributed
Chapter 3. Model 51

Search for Drug in DrugBank


Data access parameters :
Query type = Search by Namespace
Namespace = Drugbank Drugs
Search term = Marplan

DrugBank DrugBank Data


data source 1 data source 2 Providers

Search Search Actual queries


interface interface

Integrate Resolved data


results

Figure 3.2: Query parameters

Search for Drug in DrugBank


Data access parameters :
Query type = Search by Namespace
Namespace = Drugbank Drugs
Search term = Marplan

DrugBank Dailymed
Data
http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Providers
drugbank/sparql dailymed/sparql

Search for any uses of the word Search for any uses of the word Marplan in Actual queries
Marplan in DrugBank Dailymed that have references to DrugBank

SPARQL Regex search SPARQL Regex search


CONSTRUCT { ?s ?p ?o . } CONSTRUCT { ?drugbankLink
WHERE { ?s ?p ?o . bio2rdf:linkedToFrom ?s . }
FILTER(REGEX(?o, "Marplan"))} WHERE { ?s ?p ?o .
?s owl:sameAs ?drugbankLink .
FILTER(REGEX(?o, "Marplan"))
FILTER(REGEX(STR(?drugbankLink,"http://
www4.wiwiss.fu-berlin.de/drugbank/"))}

Resolved data

Integrate
results

Figure 3.3: Example: Search for Marplan in Drugbank


52 Chapter 3. Model

linked datasets.
Scientists can take advantage of the model to access a large range of linked datasets.
They can then provide other scientists with computer understandable details describing
how they answered questions, making it simple to replicate the answers. This goal,
generally known as process provenance, is supported by the model through the use of
a minimal number of components, with direct links between items, such as the links
between providers and query types. The model can be used to expose the provenance
for any scientist’s query consistently because each of the components are declared using
rules or relationships in the provenance record. This enables other scientists to feed
the provenance record into their implementation of the model to replicate the query
either as it was originally executed, or with substitutions of providers, queries, and/or,
normalisation rules using profiles.
Complete provenance tracking requires a combination of the state of the datasets, the
scientific question that was being answered, any normalisation or data cleaning processes
that were made to the data, and the places that the data was sourced from. The model
is able to produce query provenance details containing the necessary features to replicate
the data access, including substitutions and additions using profiles. The creation of
annotations relating to the state of a dataset is difficult in general, as scientific datasets
do not always contain identifiable revisions, however, if scientists have access to these
annotations they can include them using query types and providers that complement
their queries.
The model requires a single data format to integrate the results of queries from dif-
ferent information providers without ambiguity. This makes it possible to use and trust
multiple data providers without requiring them to have the same or even a compatible
data schema. Other distributed query models require a global schema making it diffi-
cult to substitute data providers. The single data format, without a reliance on a single
schema, enables scientists to easily integrate data from different locations, gradually
modifying data quality rules as necessary, without initially having perfect data quality
and a global schema, as is assumed by the theory proposed in Lambrix and Jakoniene
[83]. A single extensible data format enables scientists to include provenance annota-
tions with data, where most other provenance methods need be separated into different
data files [17, 35, 60, 74].

3.2 Query types


The model directs query parameters to the relevant data providers and normalisation
rules using query types. For example, a scientist needs to initially find all references to a
disease as shown in Figure 3.4, while later only needing to find references from datasets
that they trust as shown in Figure 3.7, or are interested in as shown in Figure 3.5
and Figure 3.6. In each case the scientist needs to describe the way each query would
be structured for each dataset. If there are different methods of data access, different
structures, or different datasets, they need to be encapsulated in separate query types
to determine which providers, namespaces, and normalisation rules are relevant to each
Chapter 3. Model 53

query type.
Each query type represents a query, independent of the actual providers that it may
be applied to, even if it is designed to be specific to a particular provider. However, each
query type is able to recognise the namespaces that it is targeted at without reference
to providers. If a namespace is recognised it may be used to restrict the set of providers
that make the relevant query available in the context of that namespace. This structure
makes it possible to extend or replace queries that other scientists have created for all
namespaces, without changing the syntax. For example, a scientist may want to add
annotations to the data description for a namespace. They do not have to modify any
of the other query types that make it possible to get the base information in order to
add their annotations, and they do not want their annotation query being used as a
datasource for any other namespaces. In the model they simply create a new query
type that recognises the query parameters that are used for the base information query
types, and restrict it to their namespace, as long as the namespace prefix is identifiable
in the parameter set.
When other scientists wish to replicate the query in a different context, e.g., a
different laboratory or using a different database, they can create semantically equivalent
queries to use another data access interface. They can reuse the generic parameters as
part of their new query type minimising the number of external changes that need to
be made for an application to support the new context. The model allows scientists to
easily substitute the query types without having to modify the original query types using
profiles. The profiles dynamically include and exclude query types without changes to
provenance records, other than additions where necessary.
In the example shown in Figure 3.3, there was one set of parameters defined for
the two query types, with the profile defining which of the query types were used in
each situation. It was necessary to create two query types so that the scientist had a
reference to use when defining which strategy they preferred. A use case for this would
be to enable a scientist to transparently substitute their limited query into a public
provenance record which originally used any and all possible datasets as sources for the
query. They would need to do this to be sure that they could validate the results using
datasets that they had personally curated.
In comparison with typical Linked Data and Federated SPARQL query models,
as shown in Figure 3.1, the model introduces a new layer between an overall query
and the data providers. This layer makes it possible for scientists to perform query
dependent normalisation, trust specific endpoints in relation to queries without having
to specify a particular endpoint in their query, and enables scientists to reliably recreate
the query using information in the provenance record. The contextual normalisation and
trust is not possible if the configured information is generically applied to all queries,
as the replicated query would be linked back to datasets and endpoints as the basic
model elements, instead of query types which can contextually define namespaces and
parameter relevance. These namespaces are more abstract than datasets in models such
as VoiD which include references to URI structures and endpoints as part of the core
54 Chapter 3. Model

Search for References to a Disease


Data access parameters :
Query type = Find references
Data item namespace = diseasome_diseases
Data item identifier = 1689

Search for any


references to the Query
disease

DrugBank Dailymed

Sider MediCare Datasets

Integrate
results Resolved data

Figure 3.4: Search for all references to disease


Chapter 3. Model 55

Search for References to a Disease In Namespace


Data access parameters :
Query type = Find references in namespace
Data item namespace = diseasome_diseases
Data item identifier = 1689
Search namespace = sider_drugs

Search for references to the


disease in a certain namespace
Query
SPARQL Reference search
CONSTRUCT { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> . }
WHERE { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> .}

Does not contain namespace sider_drugs Does not contain namespace sider_drugs

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets


drugbank/sparql dailymed/sparql

Does not contain namespace sider_drugs

Sider MediCare

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/
sider/sparql medicare/sparql

Integrate
results Resolved data

Figure 3.5: Search for references to disease in a particular namespace


56 Chapter 3. Model

dataset description.
Search for References to a Disease
Data access parameters :
Query type = Find references (Only Sider Drugs)
Data item namespace = diseasome_diseases
Data item identifier = 1689
Search for any references to
the disease

Query
SPARQL Reference search
CONSTRUCT { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> . }
WHERE { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> .}

Not configured to handle "Find Not configured to handle "Find


references (Only Sider Drugs)" references (Only Sider Drugs)"

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets


drugbank/sparql dailymed/sparql

Not configured to handle "Find


Sider references (Only Sider Drugs)"

http://www4.wiwiss.fu-berlin.de/ MediCare
sider/sparql
http://www4.wiwiss.fu-berlin.de/
medicare/sparql

Integrate
results Resolved data

Figure 3.6: Search for references to disease using a new query type

Each query type defines a relevant set of parameters for each type of data access
interface that may be used on a provider using the query type. If a provider with a new
type of data access interface is included, current query type definitions can be updated
to include a new parameterised template. These changes allow scientists to migrate
between datasets gradually, perhaps while they test the trustworthiness or reliability
of a new dataset or interface, without requiring them to change the way they use the
model to access their data. This change is important as it provides a configuration
driven method of migration, where other methods require scientists to change the way
they access the data interface to match a new dataset.
Query types and data providers can be assigned namespaces. Each of the assigned
namespaces are recognised by the model based on parameters that the scientist uses
in their query. In some cases the assigned namespace does not need to match the
namespace in the query in order for the query type or data provider to be relevant. In
the example shown in Figure 3.4, the scientist indicated that they wanted references
to an item, including references from other datasets. This meant that although it was
important to know the namespace and identifier for the item, any query type or provider
would be relevant.
Similarly, if the scientist wanted to restrict their search for references to a specific
namespace they could add another parameter, and specify that the parameter was a
Chapter 3. Model 57

namespace that the model needed to use to plan which query types and data providers
would be relevant, as shown in Figure 3.5. If they wanted to avoid changing the query
parameters to ensure that the query was compatible with their current processing tools,
they could create a new query type with the same parameter set and only apply it to
datasets containing the namespace as shown in Figure 3.6.
Although it is easier to maintain the applications using the model using a new query
type, the model is easier to maintain if namespaces are given as parameters, as the query
may be reusable in the context of other namespaces. It is important to note that the
actual queries that would be executed on the data providers did not need to change in
the example shown in Figure 3.6, however, the different requirements made it necessary
to use the same query in different query types.

Search for References to a Disease


Data access parameters :
Query type = Find references (Only Sider Drugs)
Data item namespace = diseasome_diseases
Data item identifier = 1689
Search for any references to
the disease

Query
SPARQL Reference search
CONSTRUCT { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> . }
WHERE { ?s ?p <http://
www4.wiwiss.fu-berlin.de/diseasome/
resource/diseases/1689> .}

Restrictred by profile Restrictred by profile

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets


drugbank/sparql dailymed/sparql

Restrictred by profile Included by profile Restrictred by profile

Sider Sider MediCare

http://www4.wiwiss.fu-berlin.de/ http://myuniversity.edu/ http://www4.wiwiss.fu-berlin.de/


sider/sparql sider_curated/sparql medicare/sparql

Integrate
results Resolved data

Figure 3.7: Search for all references in a local curated dataset

3.2.1 Query groups


In some cases there are different ways to make up queries to answer the same question,
depending on the dataset that is used to answer the question. In some circumstances,
only one of these queries needs to be performed for a scientist to get sufficient infor-
mation to answer their query. However, together with normalisation rules, two queries
may be able to generate the same results from two separate providers. In these cases a
scientist may wish to group queries together to make their process more efficient, while
still having redundancy incase a provider is unresponsive. Query types that are grouped
58 Chapter 3. Model

together should have the same parameters, so that scientists do not have to know they
are using different queries, unless they check the provenance.
To reduce the number of provider queries that are unnecessarily performed, the
model includes an optional item that can be used to group query types into formal
groups according to a scientist’s knowledge that the query types are equivalent. The
groups ensure that the same query is not performed on an equivalent provider more
than once because it is known to produce a redundant set of results. The groups also
ensure that the same query may only need to be executed on one of many providers, or
provider groups.
A query group must be treated as a transparent layer that only exists to exhibit
the semantic equivalence of each of the query types it contains. A user can optimise
the behaviour of the model by choosing all or any query types out of each group based
on its the local context. As shown in Figure 3.8, the query groups are not structurally
equivalent, in that they operate on the same queries, but the queries are not equivalent,
as the query parameters need to be translated in four different ways to match the
data provider interfaces. The example namespace, Digital Object Identifier (DOI), is
inherently difficult to query due to the fact that there is no single organisation that
maintains a complete copy of the dataset. There are currently only a small number of
DOI registration agencies, and there is useful information about DOIs available many
different domain specific datasets, so there needs to be a variety of different types of
queries.
Query Groups

REST Interface
SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint
http://bioguid.info/
openurl.php?
http://cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://uniprot.bio2rdf.org/ Data provider
sparql nmrshiftdb/sparql sparql
display=rdf&amp;id=
{namespace}:{identifier}

CONSTRUCT { <http://
CONSTRUCT { <http:// CONSTRUCT { <http://bio2rdf.org/
bio2rdf.org/{namespace}:
bio2rdf.org/{namespace}: {namespace}:{identifier}> ?p ?
No Template {identifier}> ?p ?o }
{identifier}> ?p ?o } o} Templated
WHERE {
<http://bio2rdf.org/
WHERE { WHERE { query
?s <blueobelisk:doi> ?s <dc:identifier> "{namespace}:
{namespace}:{identifier}> ?
"{identifier}" . ?s ?p ?o . } {identifier}" . ?s ?p ?o . }
p ?o }

HTTP GET Bio2RDF URI Blue Obelisk DOI Dublin Core Identifier Query Types

Query Group: Construct document from DOI


Common parameters -
Query Group
Namespace : doi
Identifier:10.*

Figure 3.8: Query groups

3.3 Providers
Scientists gain access to datasets using many different methods, for example, they may
use HTTP web based interfaces, Perl scripts which access SQL databases, or SOAP
Chapter 3. Model 59

XML based web services. The model accommodates different access methods using data
providers that are dataset endpoints which support one or more query types. These
providers may be implemented in different ways, but each provider is linked to query
types, namespaces and normalisation rules. These links give the model information
about how a query is going to be formed, which datasets are relevant, and what data
quality issues exist. Each provider represents a single method of access to either the
whole, or part of, a dataset, and different providers can utilise the same access method
and location for a dataset depending on the scientist’s context.
Each provider is associated with a set of data endpoints that can all be accessed in
the same way. Each of the endpoints must handle all of the query types and namespaces
that are given for the provider. In the example shown in Figure 3.9, there are three
providers representing four endpoints, three actual queries, and two query types. Each
of the endpoints connected to each provider can be used interchangeably, reducing the
query load on redundant endpoints. The two endpoints connected to the Blue Obelisk
DOI query type are not equivalent, as each provider can only have one templated
SPARQL Graph URI, so they must be placed in a provider group before they can be
considered to be equivalent.
Providers

CONSTRUCT { CONSTRUCT { <http://bio2rdf.org/doi:10.*> ?p ?


CONSTRUCT { <http://bio2rdf.org/doi:10.*> ?p ?o } <http://bio2rdf.org/doi:10.*> ?p ?o } o}
WHERE { WHERE { WHERE {
?s <blueobelisk:doi> "10.*" . ?s ?p ?o . GRAPH <urn:graphs/nmrshiftdb> {
GRAPH <http://bio2rdf.org/cpath> {
} ?s <blueobelisk:doi> "10.*" . ?s ?p ?o .
Actual query
<http://bio2rdf.org/doi:10.*> ?p ?o .
}} }}

SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint


SPARQL Endpoint
http://cu.cpath.bio2rdf.org/ http://quebec.cpath.bio2rdf.org/ http://pele.farmbio.uu.se/
http://localhost:8890/sparql Endpoint
sparql sparql nmrshiftdb/sparql
SPARQL Graph URI:
SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: Not
<urn:graphs/nmrshiftdb>
<http://bio2rdf.org/cpath> <http://bio2rdf.org/cpath> applicable

FarmBio
Bio2RDF Mirrored Localhost Provider
Nmrshiftdb
CPath DOI Blue Obelisk DOI
Blue Obelisk DOI
CONSTRUCT { <http://bio2rdf.org/{namespace}: CONSTRUCT { <http://bio2rdf.org/
{identifier}> ?p ?o } {namespace}:{identifier}> ?p ?o }
WHERE { WHERE { Templated
{sparqlGraphUriStart} {sparqlGraphUriStart}
<http://bio2rdf.org/{namespace}:{identifier}> ?p ?o . ?s <blueobelisk:doi> "{identifier}" . ?s ?p ?o . query
{sparqlGraphUriEnd} {sparqlGraphUriEnd}
} }

Bio2RDF URI Blue Obelisk DOI Query Type

Query Group: Construct document from DOI


Common parameters -
Query Group
Namespace : doi
Identifier:10.*

Figure 3.9: Providers

3.3.1 Provider groups

Providers are thin encapsulation layers over endpoints. However, they do not provide
the necessary elements on their own to encapsulate data sources that do not have
identical interfaces. Although a simple solution is to implement two providers, this is
60 Chapter 3. Model

not efficient, as they will both be used in parallel for all of the relevant queries. A more
useful solution, which allows semantically equivalent providers to be grouped, allows the
implementation to know that the providers do not all have to be implemented in the
same way to be equivalent in terms of the RDF triples that they will return in response
to the same user queries.
In the example shown in Figure 3.10, there are two providers that have equivalent
SPARQL Graph URIs, so they do not need to be grouped, as the interface URL can
be duplicated on a single provider. There are two providers that match the same query
and can be used equivalently, but there are different SPARQL Graph URIs, so it is
necessary to use a provider group to make sure that the model can recognise that the
two providers are equivalent.
Provider Groups

SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint


SPARQL Endpoint
http://cu.cpath.bio2rdf.org/ http://quebec.cpath.bio2rdf.org/ http://pele.farmbio.uu.se/
http://localhost:8890/sparql Data provider
sparql sparql nmrshiftdb/sparql
SPARQL Graph URI:
SPARQL Graph URI: SPARQL Graph URI: SPARQL Graph URI: Not
<urn:graphs/nmrshiftdb>
<http://bio2rdf.org/cpath> <http://bio2rdf.org/cpath> applicable

Not Applicable: Blue Obelisk DOI Provider


Providers Match NMRShiftDB Group

CONSTRUCT { <http://bio2rdf.org/ CONSTRUCT { <http://bio2rdf.org/


{namespace}:{identifier}> ?p ? {namespace}:{identifier}> ?p ?o }
o} WHERE { Templated
WHERE { {sparqlGraphUriStart}
<http://bio2rdf.org/{namespace}: ?s <blueobelisk:doi> "{identifier}" . ?s ?p ?o . query
{identifier}> ?p ?o } {sparqlGraphUriEnd}
}

Bio2RDF URI Blue Obelisk DOI Query Type

Query Group: Construct document from DOI


Common parameters -
Query Group
Namespace : doi
Identifier:10.*

Figure 3.10: Provider groups

Provider groups do not functionally change the nature of a query unless users decide
not to use all redundant, equivalent, providers as sources for a query. In this case,
the Provider Groups would encapsulate a set of heterogeneous providers which are
equivalent, and could be considered substitutes for any of the query types on all of the
providers, based on the query groups that the query types are members of. Any of the
providers could then be chosen at random, or using another strategy, to answer a given
query.

3.3.2 Endpoint implementation independence


The system does not require SPARQL endpoint access, as other resolution methods
could be designed and implemented as part of the current provider model, however
the result of each query needs to be in RDF to be included with the results of other
queries. In particular, if RDF wrappers are written for Web Services that currently
Chapter 3. Model 61

form the bulk of the bioinformatics processing services, they can be included as sources
of information for queries that they are relevant to.
In some cases, there is a need to provide simple static mappings between data items
and other known URIs. The number and scale of these mappings may make it more
efficient to dynamically generate the mapping on request. In these cases, the mapping
tool could be located in the model if the information content required for the mapping
is provided in the query, or it can be created with a lightweight widget that takes
the initial data item as input and provides the relevant RDF snippet as output. In
Figure 3.8, there are four different providers. There is one direct HTTP URL that can
be accessed using HTTP GET, while the others must be accessed using HTTP Post.
The information available from each provider may be equivalent, however the endpoints
must all be accessed differently in response to the same query.

3.4 Normalisation rules


Normalisation rules are transformations that allow scientists to define what changes are
necessary to the results of queries from particular providers to form a clean set of results.
Although the ability to assign transformations to the results of queries is not novel, in
the context of this work, it enables scientists to utilise simple alternatives to coding or
arbitrary workflow technology in most cases, although the transformations themselves
may be complex depending on the situation. Normalisation rules allow different scien-
tists to express their opinions about data by changing the way it is represented to fit
their context. The changes may be simply related to data quality, but they may also
be related to trust, as untrusted elements can be removed or the weight of an assertion
can be changed using the rules. The level of trust that scientists put into the results of
queries could be increased by publishing these modifications along with the data, so as
not to portray the information as being directly sourced from the original provider.
As well as allowing scientists to define and share their ideas about what changes
need to be made to data sourced from particular providers, normalisation rules can be
formulated to apply to more than one provider as applicable, making them available
as reusable tools that may apply to many entities. This may reduce the complexity of
dealing with a large number of datasets, if at least some of them are compatible, even
if they do not match the normalised form that this model expects.
The data cleaning process is more difficult for science datasets compared to business
datasets. The most well known data cleaning methods involve business scenarios, where
the conventions for what clean data looks like are well recognised. For example, it is
easy to identify whether a customer record is complete if it contains a valid address and
phone number, and the email address fits a simple pattern. In comparison, in science
it is difficult to identify whether a gene record is complete, as it may be valid, but
not be linked directly to a known gene regulation network, or it may be valid but not
have a protein sequence, as it is a newly discovered type of gene that is not used to
generate proteins. The scientific data quality conditions rely on a meaningful, semantic
understanding of the data, where data is assumed to be clean in business scenarios if it
62 Chapter 3. Model

is syntactically correct. The normalisation rules used with the model need to support
both syntactic and semantic normalisation, within the bounds of the amount of data
that is accessible to the model.
In the example shown in Figure 3.11, there are three different DOI schemas, the
BIBO, Blue Obelisk and Prism vocabularies. The datasets in the example contain
three different types of URIs, Bio2RDF, BioGUID and NMRShiftDB, so there must be
both semantic and syntactic normalisation rules to integrate the results into a single ho-
mogeneous document. In the case of one property, the Dublin Core Identifier, there are
four alternatives that have not been normalised to demonstrate the range of data that
is available. The results of the query may include data from the four endpoints, how-
ever, in the example, there are no results from NMRShiftDB or the Bio2RDF endpoints
for this DOI. The schema that was chosen by the scientist as the most applicable was
the Prism vocabulary, so the normalisation rules mapping the Prism vocabulary to the
BIBO and Blue Obelisk vocabularies are not shown. The URI scheme that was preferred
was the normalised Bio2RDF URI, so the mapping rules are shown for NMRShiftDB
and BioGUID to Bio2RDF. The normalisation rules were only applied to the results
from BioGUID in the example as it was the only provider that returned information,
and it contained redundant information given using two different vocabularies.
Normalisation Rules
Replace Replace
http://bioguid.info/pmid: http://pele.farmbio.uu.se/ URI
with nmrshiftdb/?bibId=
http://bio2rdf.org/pubmed: with Normalisation
http://bio2rdf.org/nmrshiftdb_bib: Rules

Replace
http://purl.org/ontology/bibo/doi Schema
with
http://prismstandard.org/ Normalisation
namespaces/2.0/basic/doi Rules

REST Interface
SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint
http://bioguid.info/
openurl.php?
http://cpath.bio2rdf.org/ http://pele.farmbio.uu.se/ http://uniprot.bio2rdf.org/ Data provider
sparql nmrshiftdb/sparql sparql
display=rdf&amp;id=
{namespace}:{identifier}

CONSTRUCT { <http://bio2rdf.org/ CONSTRUCT { <http://bio2rdf.org/ CONSTRUCT { <http://bio2rdf.org/


{namespace}:{identifier}> ?p ? {namespace}:{identifier}> ?p ? {namespace}:{identifier}> ?p ?
No Template o} o} o} Templated
WHERE { WHERE { WHERE {
<http://bio2rdf.org/{namespace}: ?s <blueobelisk:doi> ?s <dc:identifier> "{namespace}: query
{identifier}> ?p ?o } "{identifier}" . ?s ?p ?o . } {identifier}" . ?s ?p ?o . }

HTTP GET Bio2RDF URI Blue Obelisk DOI Dublin Core Identifier Query Types

Query Group: Construct document from DOI


Common parameters -
Query Group
Namespace : doi
Identifier:10.1080/0968768031000084163

Figure 3.11: Normalisation rules

In this example the model is able to syntactically normalise the data, but without
some further understanding of the way the DOI was meant to be represented and
used, it cannot semantically verify that the data is accurate. Further rules, that are
implemented after the data is imported into the system or after it is merged with other
Chapter 3. Model 63

results, could contain the semantically valid conditions that could be used to remove or
modify the results.
Normalisation rules are used at various stages of the query process, as highlighted in
Table 3.1. Normalisation rules are applied at the level of data providers to provide the
most contextual normalisation with the least disruption to generally used query types.
For each query type on each provider, the normalisation rules are used on each of these
separate streams. All normalisation rules from all streams may be applicable to the
two final stages after all results are merged into the final pool of results, and after these
complete results are output to the desired file format.

Stage Mode
1. Query variables Textual
2. Query template before parsing Textual
3. Query template after parsing In-Memory
4. Results before parsing Textual
5. Results after parsing In-Memory
6. Results after merging In-Memory
7. Results after serialisation Textual

Table 3.1: Normalisation rule stages

Initially, normalisation rules are used for each query type before template variables
are substituted into the relevant template. After the template is ready, normalisation
rules may be applicable to the entire template, including when the template is in textual
form or after the template is parsed by the system to produce an in-memory represen-
tation of the query. This enables scientists to process the variables independently of the
template, before processing the template to determine if it is consistent, if this process
cannot be performed using the variables alone.
The normalisation rules are then used on the textual representation of the results,
if this is available with the relevant communication method. The normalised text is
then parsed into memory using the single data model that the system recognises, and
normalisation rules are applied to the in memory representation of the results before
merging results from different streams. After the results are merged, normalisation
rules are applied to the in-memory representation of the results before serialising the
complete pool of results to an output format. In most cases, this output format will
be the requested format, unless the request was not compatible with the single data
model, in which case, the pool of results must be serialised in a different way based
on the in-memory representation, or a textual transformation on the output would be
used.
The same normalisation method would not be used for each stage. For example,
the query variables are normalised using textual transformations, while the parsed in-
memory results are normalised using algorithms based on the common data model. An
implementation would need to support each of the different methods that were used
with normalisation rules attached to each relevant provider before a query could be
successfully executed. There is no guarantee that leaving out a normalisation rule will
64 Chapter 3. Model

produce a consistent or useful result, unless the scientist has directed in their profiles
that the normalisation rule is not useful to them.

3.4.1 URI compatibility

Although normalisation rules are generic transformations, they are a useful tool when
applied to the case of queries distributed across a set of linked datasets. Although
many datasets contain references to other datasets, the format for references to the
same dataset is not generally consistent across datasets. Even if a resolvable URI is
given as a reference, the resolution point is contained in the reference, along with the
context specific information that defines which item is being linked to. A naive approach
to resolving the URI, would be to use the protocol and endpoint that are given to find
more information about the item. However, if the resolution point is not controlled
by the user, any data quality measures using normalisation rules, or redundancy using
multiple providers, may not be utilised. To solve this issue, normalised URIs can be
created that scientists can easily resolve using resources that they control. If normalised
URIs are given as a basic standard, they can be further utilised to change the references
in documents to match the scientist’s context, making it simple to continue using the
model described here to maintain control over the way the data is represented and
the way references are resolved. This enables scientists to work with data from many
datasets, but choose where to source the information from according to their context.
Although this may tend to encourage the explosion of many different equivalent
URIs, if the URIs are resolvable by other users, the new URIs may still be useful. If
there is a standard format for the URIs, and a mapping between the local URIs and
that set can be published as normalisation rules scientists could recognise the link and
normalise it before resolving it to useful information. This can be performed without
continuously relying on one scientist for infrastructure, as long as a rough network is
established to share information about providers and normalisation rules, along with
any useful query types and namespace information.
A goal of this research is to both make information recognisable and trustworthy,
while simultaneously making it accessible in many different contexts. Along with social
agreements about link and record representations, the use of this model to represent
what data is available, and what the access methods are, makes it possible to decentralise
infrastructure, even if there is still a global resolution point, the use of the resolution
point can be mirrored in other circumstances to minimise interruptions to research.
The model is able to do this without a new file format, access protocol or resolution
interface.
By contrast, the LSID protocol was designed to solve data interoperability and access
issues for life sciences. However, it has not been extensively used. There are likely a
range of reasons, but with respect to this research, it may not have been widely adopted
due to its insistence on a completely new data access protocol, a range of resolution
methods, and the lack of a single data format, although they insisted on the use of RDF
for metadata.
Chapter 3. Model 65

The use of LSIDs would have enabled scientists to recognise the LSID protocol based
on the URI, and access original data format files using one of the supported interfaces.
However it did not include the ability to apply transformations to the information,
as it insisted that data remain permanently static after its publication. In addition,
scientists still needed to be able to process and supply data in different formats, making
it difficult–or impossible–to merge data from different sources.
Federated SPARQL approaches require that the URIs used on one endpoint to
represent a resource must be exactly the same as the URIs used on other endpoints to
represent that resource, as any other URI cannot be presumed to be equivalent without
specific advice. The normalisation rules in this model are not explicitly linked to the
namespace model because they are more relevant to providers than to namespaces, and
the definition of whether the URI is equivalent is dependent on the provider rather than
the namespace definition. An example of this can be seen in Figure 3.12, where different
providers have different names for the same item and the same query template can be
used with all of the providers.

Search for References to a Disease


Data access parameters :
Query type = Find references
Data item namespace = diseasome_diseases
Data item identifier = 1689
Search for any references to
the disease

SPARQL Reference search


CONSTRUCT { ?s ?p <${normalisedURI}> . } Query
WHERE { ?s ?p <${endpointSpecificURI}> .} template

SPARQL Reference search SPARQL Reference search


CONSTRUCT { ?s ?p <http://bio2rdf.org/ CONSTRUCT { ?s ?p <http://bio2rdf.org/
diseasome_diseases:1689> . } diseasome_diseases:1689> . }
WHERE { ?s ?p <http://www4.wiwiss.fu- WHERE { ?s ?p
berlin.de/diseasome/resource/ <urn:diseasome:diseases:1689> .} Normalised
diseases/1689> .} Queries

DrugBank Dailymed

http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ Datasets


drugbank/sparql dailymed/sparql

Disease review
Sider MediCare
http://mycompany.com/
http://www4.wiwiss.fu-berlin.de/ http://www4.wiwiss.fu-berlin.de/ disease_review/sparql
sider/sparql medicare/sparql

Integrate
results Normalised data

Figure 3.12: Single query template across homogeneous datasets

The use of normalisation rules enables the URIs used on each provider to be nor-
malised before being returned to users, so scientists can recognise the URI, and the
software can recognise where to perform further queries that relate to that namespace.
In order not to remove the link to the previous context, normalisation rules can be
used together with a static statement of equivalency back to the alternative URIs that
66 Chapter 3. Model

were used on any of the endpoints, so scientists can have a list of URIs that relate
to particular resources, although they may still focus on the normalised URI to make
further querying with the model easier. A Federated SPARQL system could be de-
signed using the configuration model, without references to query types or parameters,
by recognising what normalisation rules and namespaces are applied to each known
provider, removing the reliance on the preconfigured query types for arbitrary SPARQL
endpoints. Although the namespaces are designed to be recognised in the context of
the query types, the model could be implemented with named standard parameters to
provide efficient access to the places where namespaces are known to be located.
The model allows the definition of rules that remove particular statements from the
results. These rules make it possible to reduce the amount of information as necessary
without changing the interface for the provider.

3.5 Namespaces
The namespaces component is centred around identifying the ways that a dataset is
structured. There are two main considerations for the decision to split a dataset into
namespaces, the way that identifiers are used and the way that properties are used.
In a traditional relational database, primary keys do not have to be given contexts,
because they inherit their nature from the table they are located in. In non-relational
datasets there is still a need to define unique identifiers for items, however as the con-
straints are not given by the structure of the database, the identifiers must be assigned
using some other method. This model does not seek to convert all datasets into a rela-
tional compatible model, where each aspect is normalised and given a unique identifier
set based on its data type, as there may be benefits to sharing an identifier set across
different types of items in the same dataset. These benefits include improved acces-
sibility stemming from a simpler method of searching the dataset without previously
knowing its internal schema. Namespaces can be used to define the scope of an internal
identifier, so scientists can know the context that the identifier was defined in, although
the namespace may not have a clear semantic meaning if data items of different types
are mixed in a single namespace.
The second consideration, relating to properties, stems from the requirement that
the model supports multiple datasets that do not have a single schema for all data
items. Scientists cannot be assured that a given property will exist, although queries
for an item may not cause errors. To specify which properties actually exist, namespaces
may be used to link terminologies, such as ontologies, to datasets, specifically through
particular providers of those datasets. If a provider is an interface for a particular
property, then a namespace can be used to indicate that the provider is an interface for
that property. This is possible even if the namespace is not being used to identify the
data items using the property or the location of the property definition.
In some cases, the set of queries attached to a provider can be used to provide
hints about which properties may be supported by a provider, particularly when the
query type has been annotated as being relevant to a particular property via a known
Chapter 3. Model 67

namespace.
The normalisation of URIs is based on the existence of identifiers that represent
parts of datasets that have unique identifiers available inside of them. These identifiers
need to be easily translated into different forms without ambiguity. As there is not
always a direct link between the normalisation rule component of the model and the
namespaces, any links that are made between normalisation rules that implement the
translation and are informative, rather than normative. Not all normalisation rules
relate to namespaces and identifiers, and not all namespaces require normalisation rules.
If the model required links between normalisation rules and namespaces, the coverage
would not be complete, and the link may not be obvious outside of the context of
providers.
The use of the authority property in the namespace definitions enables scientists to
define the provenance of a namespace along with its use, as the namespace URI is used
as the link by all other parts of the model. This provides scientists with an point of
contact if they have queries about the definition or use of the namespace.

3.5.1 Contextual namespace identification

The model does not require that namespaces be identified globally. A particular iden-
tifier, generally associated with a single namespace, may have different meanings when
applied to different queries, so it may be identified as one namespace in one query, and
another namespace in another query. The contextual nature of namespaces makes them
useful for integrating different datasets into the same system. Although it may be more
applicable to normalisation rules, the ability to decide what the meaning of a namespace
is for a particular query makes it possible to modify the common representation for a
document to match the locally understood representation. This may include avoiding
making large assertions about things that are not necessarily applicable to one dataset
or query, even though they may be applicable to other datasets. The design of the model
allows scientists to include only specific namespaces in both query types and providers,
so scientists can decide that a namespace is only applicable to a set of queries without
requiring reference to providers of information, or the namespace can be deemed only
relevant to a particular provider.
The namespace mechanism is designed with the idea that references from the en-
vironment would be mapped into the model in one or more ways, allowing different
scientists to assign the same external reference, such as a short prefix identifying a
different reference across different contexts, to more than one namespace. This allows
scientists to reuse other model based information without semantic conflicts with their
own models. There are likely to be many situations where this support is necessary once
the model is deployed independently across different areas by different authorities. To
allow computational identification of conflicts, the namespace mechanism must contain
a direct link to the authority that assigned the namespace. This allows scientists to
identify cases where a single authority has assigned the same reference, such as a prefix,
to more than one case, indicating that the authority did not have a clear purpose for
68 Chapter 3. Model

the string. It also allows scientists to trust namespaces based on the authority, although
if there are untrusted sources of data the reference to the authority may not carry the
weight that a set of namespace definitions from a completely trusted authority would.

3.6 Model configurability


The system is designed so that semantically similar queries will be compatible with many
different data site patterns. Public definitions of which queries are semantically similar,
or relevant in a particular case, can be included or excluded, and local, private and
efficient queries can be added or substituted using profiles. The namespace classification
is designed so that it relies on functional keys, although the functional key may look like
a URI. The functional key may match more than one namespace classification, enabling
scientists to take advantage of currently published classifications in some cases, while
concurrently defining their own internally recognised, and maintained, namespaces.

3.6.1 Profiles

The profile component was included to allow varying levels of flexibility with respect
to the inclusion or exclusion of each of the profileable components, i.e., query types,
providers and normalisation rules. Profiles make it simple for scientists to customise the
local configuration by adding their own RDF configuration snippets into the set of RDF
statements that are used to configure the system. The semantics surrounding profiles
provide for input from both scientists and public configuration creators, although local
scientists can always override the recommendations given by configuration creators. The
profiles are also stacked, enabling scientists to define higher or lower level profiles to
extend and filter the set of publicly published configurations. For a query to be executed
on a particular provider, both the query and the provider must be included by one of
the profiles. If this process results in the inclusion of the provider for that query, the
profile list can also be used to determine which of the normalisation rules are used to
transform the input and output.
If a lower level, more relevant profile, explicitly excludes or includes a component,
than the higher level, less relevant profile, will not be processed for that component.
If a profile specifies that it is able to implicitly accept either query types, providers,
or normalisation rules, then the default include or exclude behaviour specified by the
author or the configuration will be used to decide whether the query matches the profile,
and if no profile matches. The default include or exclude behaviour was designed to
allow a configuration editor to suggest to scientists using the configuration that they
are not likely to either find the component relevant, or the component relates to an
endpoint that they are not likely able to access from the public web.
The simplest possible configuration consists of a single query type and a single
provider, as shown in Figure 5.2. The query type needs to be configured with a regular
expression that matches scientist queries. The provider needs to be configured with both
a reference to the query type, and an endpoint URL that can be used to resolve queries
Chapter 3. Model 69

matching the query type. Although the example is trivial, in that the user’s query is
directly passed to another location, it provides an overview of the features that make
up a configuration. One particular feature to be noted is the use of the profile directive
to process profile exclude instructions first, and then include in all other cases. In this
example, there are no profiles defined, resulting in the query type and the provider
being included in the processing of queries that match the definitions.

Although profile decisions need to be binary to ensure replicability, the layering of


different profiles makes it possible to have various levels of trust. A complex layered
set of profiles may be confusing to users, but scientists can preview the included and
excluded items independent of an actual query. This makes it possible to assert levels
of trust before designing experiments.

If the set of configurations that the scientist uses for data access is refreshed period-
ically, possibly from an untrusted source, scientists would need to preview the changes,
especially if any profiles have implicit includes, and a configuration is not completely
trusted. This may be a security issue, but if static configuration sources are used, then
the profiles will have the effect of securing the system as opposed to making it more
vulnerable. This security results from the choice of providers based on their relevancy,
and if needed restrict all queries to just a set of curated providers that will not change
when a public configuration is updated, as may happen at regular intervals.

3.6.2 Multi-dataset locations

In the case where a particular query needs to operate on a provider where more than
one dataset needs to be present, the model allows for the query to specify which are
the namespace parameters, along with a definition of how many of the namespace
parameters need to match against namespaces on any eligible providers. This allows for
the optimisation of queries, without relying on scientists to create new query types for
each case where possible.

This allows scientists to either create local data silos where they store copies of each
of the most used datasets, while still being able to link to distributed SPARQL query
endpoints in some cases. This allows scientists to gain the benefits of the distributed
and aggregated datasets without losing the consistent linking and normalised documents
that this model can provide.

The model allows groupings of both equivalent providers and endpoints, and by
allowing scientists to contribute their information to a shared configuration. The shared
configuration is not limited to a single authority, as the configurations for the model are
designed to be shareable, and independent of the authority that created them originally,
apart from their use of Linked Data URIs as identifiers for items in the configuration
documents.
70 Chapter 3. Model

3.7 Formal model algorithm


The formal definition of how the model is designed to work is given in this section. It
does not contain some features that were included in the prototype, such as choices
based on redundancy, that are provided by provider and query groupings, however,
these can be included by restricting the range of the loops over query groups (qg) and
each of the provider groups (pg). The use of query and provider groups is optional, to
allow for the simplest mechanism of providing and publishing configuration information
using the model. If a query type, or a provider does not appear in any group, they
should be included as long as they pass the tests for inclusion for the scientist query.

1. Let u be the scientist query

2. Let r be the overall set of inputs given in u;

3. Let f be the ordered set of profiles that are enabled;

4. Let π be the pool of facts that will contain the results of the query;

5. Let Q be the set of query types in the global set of query types (GQ) that match
u

6. Let QG be the set of query groups in the global set of query groups (QGS) that
match u

7. Let P be the global set of providers

8. Let P G be the global set of provider groups

9. Let N be the global set of namespaces

10. Let GN R be the global set of normalisation rules to be used to modify queries
and data resulting from queries

11. For each query group qg in QG

12. For each query type q in qg;

(a) Check that q matches the constraints of the set of profiles in f , and if it does
not match then go to the next q
(b) If q matches the conditions in f ;
(c) Let np be the set of namespace parameters identified from r in the context
of q
(d) Let nd be a set of the set of namespace definitions in N which can be mapped
from each element of np
(e) Let nc be the set of namespace conditions for q
(f) If nd does not match q according to nc, then go to the next q
Chapter 3. Model 71

(g) If nd matches q according to nc;

(h) For each provider group pg in P G

(i) For each provider p in pg;

i. Check that p matches the constraints of the set of profiles in f , and if it


does not match then go to the next p
ii. If p matches the conditions in f ;
iii. If p matches q according to nc and nd, including the condition that nd
may not need to match p if this condition is included in nc
iv. Let s be the set of normalisation rules that are attached to p;
v. Substitute inputs from r in any templates in q, using rules that are
defined in p and relevant to query parameter denormalisation stage
vi. Substitute inputs from r in any templates in p, using rules that are
defined in p and relevant to query parameter denormalisation stage
vii. Parse q into an in memory representation mq if necessary for the com-
munication type of provider p
viii. Normalise mq based on any normalisation rules defined in p that are
relevant to the parsed query normalisation stage
ix. Let α be the document that is returned from querying q on p in the
context of r
x. Normalise α according the rules defined for p that are relevant to the
before results parsing stage
xi. Parse the normalised α and then normalise the facts based on the nor-
malisation rules defined in p that are relevant to the parsed results before
pool stage
xii. Then include the normalised results in η
xiii. Let β be the set of non-resolved facts found by parsing the static tem-
plates from q after substituting parameters from the combination of r,
q, p, and nd
xiv. Normalise β according s, and include the results in η
xv. Include η in π

13. Normalise statements from π which match the after pool results stage in the
ordered set of normalisation rules relevant to all providers in P

14. Serialise facts in π

15. Normalise the serialised facts which match the after pool serialisation stage in the
ordered set of normalisation rules relevant to all providers in P

16. Output the result of q as the serialised document


72 Chapter 3. Model

3.7.1 Formal model specification


The model requires that the configuration of each item includes a number of mandatory
features. The list of features that are required for each part are given in this section.
This list of features may be extended by prototypes if they need information that is spe-
cific to the communication methods that they are using. For example, an HTTP based
web application may require additional information for each query type to determine
the way different user queries match based on their use of different HTTP methods such
as HTTP GET and HTTP POST.

Query type

A query type should contain the following features:

• Input query parameters : The information from the user, including namespaces
where required.

• Namespace distribution rules : Ie, which combinations of namespaces are accept-


able, and if it is not clear, how are the namespace identifiers given in the query.

• Whether namespaces are relevant to this query : If namespaces are not relevant,
the namespace distribution parameters are ignored.

• Whether to include endpoints which do not fit the namespace distribution param-
eters but are marked as being general purpose.

• Query templates : The templates for the queries on different endpoint imple-
mentations should be given in the query, with parameters in the template where
available.

• Static templates : Templates that are relevant to the use of the query, but do not
rely on information from the endpoint, other than the provider parameters. These
templates represent facts that can be directly derived from the queries, and will
not be subject to any normalisation rules that are specific to this query type and
the provider it is executed using, with the exception of normalisation rules that
are applied after the facts are included in the output pool.

Provider

A Provider should contain the following features:

• Namespaces : Which namespace combinations are available using this provider.


This may be omitted in some cases if the provider may contain information about
any namespace, but it is hard to pin point which ones it may actually contain at
any point in time.

• Whether this provider is general purpose : This is used to match against any
query types that are applied to this server that claim to be applicable to general
purpose endpoints.
Chapter 3. Model 73

• Query types : Which query types are known to be relevant to this provider. If a
system does not require preconfigured queries to function using the model these
may be omitted. An example of where they may be omitted may be in the use
of the normalisation, namespace and provider portions in a federated SPARQL
implementation.

• Endpoint communication type : The communication framework that is used for


this provider. For example, HTTP GET, HTTP POST, etc.

• Endpoint query protocol : The method that is used to query the endpoint. For
example, SPARQL, SOAP/XML etc.

• Normalisation rules : Which rules are necessary to use with the given queries on
the given namespaces for this endpoint.

Normalisation rule

A normalisation rule should contain at least one of the following features:

• Input rule: used to map normalised information into whatever scheme the relevant
endpoints require.

• Output rule: used to map endpoint specific data back into normalised information.

• Type of rule: For example, Regular expression matching or SPARQL.

• Relevant stages : When the rule is to be applied. For example, before query
creation, after query creation, before results import to normalised data model,
after import to normalised data model, or after serialisation to output data format.
SPARQL rules are only useful in results after they have been imported to the
normalised data model, but regular expressions may be useful for any stage where
the query or results are in textual form, while query optimisation functions are
only relevant after the query has been created, rather than in the process of
replacing templates in the query to form a full query that may be parsed before
being executed depending on the communication method of the relevant data
provider. The normalisation rules, which are applied to the results pool and the
serialised results document, are not relevant to the query type or provider that
they were included with, so variables such as provider location that may be used
in the other earlier stages, will not be available for these rules.

• Order : Indicates where the rule will be applied in a given set of rules within a
stage.

Normalisation rule test

The model has an optional testing component that is used validate the usefulness of
normalisation rules. The expected input and expected output are not defined in a rule,
as they are
74 Chapter 3. Model

• Expected input : The input to use.

• Expected output : The output that is expected.

• Set of rules : The set of rules that must be applied to the input to derive the
expected output.

Namespace

A Namespace should contain the following features:

• The identifier used to designate this namespace : These items can be mapped
to both this and other namespace’s as applicable. These may include traditional
prefixes, but also other items such as predicates and types that are available.

• The expected format for identifiers : A regular expression that can be used to
identify syntactically invalid identifiers for this namespace.

• The preferred URI structure : A template containing the authority, namespace


prefix, identifier, and any other variables or constants necessary to construct a
URI for this namespace.

• The authority that is responsible for assigning the scientist level items to this
namespace : This is in effect a mapping authority, although multiple authorities
can be used concurrently, so there is no central authority.

Profile

A Profile should contain the following features:

• Explicitly included items : A list of items that should be used with this profile,
and the scientist should be told that they will use this if it is the first matching
profile on their sorted profile list.

• Explicitly excluded items : A list of items that should not be used with this profile,
and the scientist should be told they shouldn’t be used if this is the first matching
profile on their sorted profile list.

• Default inclusion and exclusion preference : This is used to assign a default value
to any item that do not specify inclusion or exclusion preferences, as the use of
profiles is not mandatory.

• Implicit inclusion preference : Whether items in each category should be included


if the result of the profile rules indicates the item can be implicitly included.
Chapter 4

Integration with scientific processes

4.1 Overview
Scientists need to be able to concurrently use different linked datasets in their research.
The model described in Chapter 3, enables scientists to query various linked datasets
using a single replicable method. This chapter contains a variety of ways scientists
can integrate the results of these queries with their programs and workflows. The
integration relies on the model to provide data quality, data trust, and provenance
support. Scientists are able to use the model to reuse and modify work produced by
other scientists to fit their current context, even if they need to use different methods
and datasets used by the original scientist.
The model allows scientists to explore linked data and integrate annotations relating
to data they find interesting. In comparison to other techniques, the annotations do
not need to modify the original dataset. The resulting RDF triples can be integrated
into computer operated workflows, as the data resulting from the use of the model
is mandated to be computer understandable and processable using a common query
language regardless of which discipline or location the data originated in.
A case study is given in Section 4.4 describing the ways that the model and pro-
totype, described in Chapter 5, can be used in an exploratory medical and scientific
scenario. The case highlights one way of using the model to publish links to results
along with publications. The use of resolvable URLs as footnotes in Section 4.4 is one
method of integrating computer resolvable data references into publications, although
other methods have also been proposed in the data-based publication literature, includ-
ing embedding references into markup [37, 58] and adding comments into bibliographic
datasets [123].
Scientists need a variety of tools to analyse their data. One method that scientists
use to manage their processing is to integrate data and queries using workflow man-
agement systems. If the workflow is able to process and filter the single data model
that is mandated by the query model, the scientist can avoid or simplify many of the
common data transformation steps that are required to use workflow management sys-
tems to process and channel data between different services. A case study is given in
Section 4.7 using an RDF based workflow to demonstrate the advantages of the model

75
76 Chapter 4. Integration with scientific processes

and prototype compared to current workflow and data access systems.


In many cases it is difficult for a scientist to determine what the sources of data
were in a large research project. Provenance information, including the queries and
locations of data, is designed to be easily accessible using the model, as each of the query
types are directly referenced in the relevant providers; and providers in turn reference
normalisation rules and namespaces by name. These explicit links make it easy to
generate and permanently store a portable provenance description. The provenance
information provided by the model enables scientists to use provenance as part of their
experiments. An example of how this information can be used to give evidence for the
results of a scientific workflow is given in Section 4.7.1.
Scientific results, and the processes that were used to derive the results, need to be
reviewed by peers before publication. These steps require other scientists to attempt
to replicate the results, and challenge any conclusions based on their interpretation
of the assumptions, data, methods, and results that were described by the publishing
scientist. The model is useful for abstracting queries away from high level processing
systems, but the high level processing systems are required to integrate the results of
different queries. Scientific peers need to replicate and challenge the results in their own
contexts as part of the scientific review process. An example of how this can occur is
given in Section 4.7.1.

4.2 Data exploration


Although most scientific publications only demonstrate the final set of steps that were
used to determine the novel result, the process of determining goals and eliminating
unsuccessful strategies is important. Historically, the scientific method required the
scientist to propose a testable hypothesis explaining an observable phenomenon. This
initial hypothesis may have come from a previous direct observation or a previous study.
Ananiadou and McNaught [6] note that, “hypothesis generation relies on background
knowledge and is crucial in scientific discovery.” The source of the initial hypothesis
affects how data is collected and analysed. It may also affect how conclusions based on
the results can be integrated into the public body of scientific knowledge [51]. Variations
on this method in the current scientific environment may involve exploration through
published datasets before settling on a hypothesis.
The model described in this thesis, focusing on replicating queries across linked
scientific datasets, provides a useful ad hoc exploration tool. It is designed to provide
consistent access to information regardless of the data access method. The model focuses
on replication, so exploratory activities can be tracked and replicated. This provides a
novel and powerful tool for scientists who rely on a range of linked scientific datasets
for their work. The method used by the model, is unique, as it provides an abstraction
over the methods for distributing queries using namespace prefixes and query mapping
from scientists queries to actual queries on datasets.
Other RDF based strategies rely on scientists writing complete SPARQL queries
which are resolved dynamically using a workflow approach to provide access to the
Chapter 4. Integration with scientific processes 77

final results. These strategies may be simpler than writing workflows that rely on
multiple different data formats, but they are still a barrier to adoption by the wider
scientific community. They rely on datasets to be normalised, and scientists to be able
to identify a single location for each dataset to include it in their query. This makes it
difficult to replicate queries in different contexts, as peers may need to have access to an
equivalent set of RDF statements and they may need to modify the query to use their
alternative location. By contrast the model provides the necessary provenance details
to enable replication using substitution and exclusion without removing anything from
the provenance details.
An implementation could interactively provide query building features, using both
known and ad hoc datasets as needed. The essential data exploration is related to
the number of links between datasets and how valuable they are. It is useful to have
the ability to continuously explore information, without first requiring an application
to understand a separate interface for each dataset and data provider. Scientists then
need to be able to send future queries to processing services based on the data they
have previously retrieved from the linked datasets. These features are provided by the
normalising of URIs with the prototype, which make it possible to determine future
locations based on the URIs from past results.

4.2.1 Annotating data

Scientists can create annotations that contain direct references to the data along with
new information about the data. This enables them to keep track of interesting items
throughout the exploration process. If these annotations are stored in a suitable place
they can be integrated into the description for each item using a query and provider that
are targeted at the annotation data store. An annotation service was implemented and
the model was used to access the annotations, as shown in Figure 4.4. It demonstrates
that novel data can easily be included with published data for any item based on its
URI. The annotations contained details about the author of the annotation, as the
annotation prototype required scientists to login with their OpenID, and the OpenID
identifying URI was included in the annotation.
Scientists can use tags to keep track of their level of trust in a dataset by annotating
items in the dataset gradually, and then compiling reports about namespaces using
the tag information. The reports can be selective, based on the annotations of trusted
colleagues, and can be setup using new query types and profiles. These tags were
used to annotate data records from the published Gene Ontology [16] and Uniprot
Keywords [114] datasets, as well as an unpublished BioPatML dataset [92]. The tagging
functionality was implemented in a web application that scientists can use it to annotate
any URIs 1 .
The tag information could be directly integrated with the document that they resolve
if they know they want to see annotations together with the original data in their
context. In current workflows, every instance where the dataset was accessed would
1
http://www.mquter.qut.edu.au/RDFTag/NewTerm.aspx
78 Chapter 4. Integration with scientific processes

need to be subdivided into two steps, one for the original data and one for the annotated
data. When used with normalisation rules from the model, annotations could be used to
selectively remove case specific information, although other scientists would not receive
this information until they resolved the annotation to RDF. If useful information was
removed from the original dataset, annotations could be used to include it again.

4.3 Integrating Science and Medicine

The model is designed so that it can be easily customised by users. Extensions can
range from additional sources and queries, to the removal of sources or queries for
efficiency or other contextual reasons. The profiles in the model provide a simple way
for scientists to select which sources of information they want to use without having
to make choices about every published information source. In the context of Health
Informatics, a hospital may want to utilise information from a drug information site,
such as DrugBank and DailyMed, together with their private medical files.
The hospital could map references in their medical files to disease or drug datasets
such as DrugBank or DailyMed and publish the resulting information in a single data
model, or provide a compatible interface to their current system with a single reference
data model. The data from their systems could then be integrated with the DrugBank
information transparently, as the model requires that the data model be able to be
used for transparent merging of data from arbitrary locations. The hospital could then
create a mapping between the terminologies stored in their internal dataset and those
used by DrugBank and related datasets and use those to syntactically normalise their
annotations. These mappings could be made using one of the available SQL to SPARQL
converters such as the Virtuoso RDF Views mechanism [45] or the D2RQ server [29].
To distinguish private records from external public datasets, hospitals would create
a namespace for their internal records, along with providers matching the internal ad-
dresses used for queries about their records. This novel, private information, is able to be
transparently included in the model through the use of private provider configurations
indicating the source of the information.
If the hospital then wanted to map a list of diseases into their files, they could find
a source for disease descriptions, such as Diseasome [55], and either find existing links
through DrugBank [144] and DailyMed 2 , or they could use text mining to discover
common disease names between their records and disease datasets [7]. For scientific
research, resources like Diseasome are linked to bioinformatics datasets such as the
NCBI Entrez Gene [93], PDB [26], PFAM [48], and OMIM [5] datasets. This linkage is
defined explicitly using namespaces and identifiers in those namespaces. This enables
the use of direct resolvable links and enables links to be discovered, for instance there
may be links from patients and clinical trials to genetic factors, implying that patients
may be eligible for particular clinical trials. Patients could be directly linked to genes
using the data model without reference to diseases, and new diseases could be described
2
http://dailymed.nlm.nih.gov/dailymed/about.cfm
Chapter 4. Integration with scientific processes 79

internally without requiring a prior outside publication.


Medical procedures and drugs inevitably have some side effects. There are a list of
potential side effects available in the Sider dataset [80], enabling patients and doctors
to both have access to equal information about the potential side effects of a course
of medication. Side effects that are discovered by the hospital could be recorded using
references to the public Sider record, reducing the possibility that the effect would be
missed in future cases.
The continuous use of a single data model enables scientists to transition between
datasets using links without having to register a new file model for each dataset, al-
though the single data model may have different file formats that are each associated
with input and output from the model. For ease of reading, the URI links, which can
be resolved to retrieve the relevant information used in the following case study, are
footnoted. The links between the datasets were observed by resolving the HTTP URIs
and finding relevant RDF statements inside the document that link to other URIs.
In comparison to other projects that aim to integrate Science and Medicine, such
as BIRN [17], the model used here does not require each participating institution to
publish their datasets using RDF, as long as the data is released using a license that
allows RDF versions of the data to be created by others. This enables others to integrate
datasets that may not be in the same field or may not be currently maintained by an
organisation. The semantic quality of the resulting information may not be as high as
in the specialised neuroscience datasets that BIRN was developed to support, as each
of the organisations involved in BIRN is required to be active and maintained.
The clinical information that was integrated in BIRN consisted of public de-identified
patient data, whereas, this model allows hospitals to retrieve public information and in-
tegrate it with their private information without having to de-identify before being able
to distribute queries across the data. The information about how to federate queries
does not need to be publicly published to be recognised by the model or prototype, al-
though it does need to be published in other systems that rely on centralised authorities
for dataset locations and query federation.
Private data is not ideal, as it hides semantically useful links from being reused.
In some cases, notably hospitals, it is required by privacy laws in some countries that
personal health records be private unless explicitly released.

4.4 Case study: Isocarboxazid


This case study is given in the context of both Science and Medicine. It is founded
around a drug known generically as “Isocarboxazid”. This drug is also known by the
brand name “Marplan”. The aims of this case study are to discover potential relation-
ships between this drug and patients with reference to publications, genes and proteins,
which may possibly affect the course of their treatment. In cases where patients are
known to have adverse reactions or they do not respond positively to treatment, al-
ternatives may be found by examining the usefulness of drugs that are designed for
similar purposes. The doctor and patient could use the exploratory processes that are
80 Chapter 4. Integration with scientific processes

Figure 4.1: Medicine related RDF datasets

explained in the case study to determine whether current treatments are suited to the
patient with respect to their history and genetic makeup. If necessary, the case study
methods could also be used to find alternative treatments.
The case study utilises a range of datasets that are shown in Figure 4.1. These
datasets are sourced from Bio2RDF, LODD, Neurocommons, and DBpedia. The origi-
nal image for the LODD datasets map can be found online 3 .
A portion of the case study, highlighting the links between items in the relevant
datasets, can be seen in Figure 4.2. This case study includes URIs that can be resolved
using the Bio2RDF website, which contains an implementation of the model. For the
purposes of this case study, the relevant entry in DrugBank is known 4 , although it
could also be found using a search 5 . According to this DrugBank record, Isocarboxazid
is “[a]n MAO inhibitor that is effective in the treatment of major depression, dysthymic
disorder, and atypical depression”.
The DrugBank entry for Isocarboxazid contains links to the CAS (Chemical Ab-
stracts Service) registry 6 , which in turn contains links to the KEGG (Kyoto Ency-
clopedia of Genes and Genomes) Drug dataset 7 . The link back to DrugBank from
the KEGG Drug dataset and others in this case could also have been discovered using
only the original DrugBank namespace and identifier 8 . The brand name drug dataset,
Dailymed, also contains a description for Marplan 9 , which is linked from Sider 10 and
3
http://www4.wiwiss.fu-berlin.de/lodd/lodd-datasets_2009-08-06.png
4
http://bio2rdf.org/drugbank_drugs:DB01247
5
http://bio2rdf.org/searchns/drugbank/marplan
6
http://bio2rdf.org/cas:59-63-2
7
http://bio2rdf.org/dr:D02580
8
http://bio2rdf.org/links/drugbank_drugs:DB01247
9
http://bio2rdf.org/dailymed_drugs:2892
10
http://bio2rdf.org/sider_drugs:3759
Chapter 4. Integration with scientific processes 81

MediCare (US)
SideEffects: PubChem
medicare_drugs:1332
Fluvoxamine interaction pubchem:17396751
SideEffects: drugbank_druginteractions:
Neuritis DB00176_DB01247
sider_sideeffects:C0027813
CAS
SideEffects:
cas:59-63-2
sider_drugs:3759
KEGG Drug: Isocarboxazid (INN)
Dailymed: Marplan dr:D02580
dailymed_drugs:2892

Diseaseome: Brunner syndrome


diseasome_diseases:1689
DrugBank: Isocarboxazid / Marplan
drugbank_drugs:DB01247
DrugBank:
Amine oxidase [flavin-containing] B
drugbank_targets:3939
DrugBank:
Amine oxidase [flavin-containing] A
drugbank_targets:3941
HGNC:
monoamine oxidase B (MAOB)
hgnc:6834
HGNC:
monoamine oxidase A (MAOA)
hgnc:6833
PDB:
Human Monoamine Oxidase B in
NCBI Entrez Geneid: complex with Farnesol
monoamine oxidase B (Human) pdb:2BK3
geneid:4129
PFAM:
Amino oxidase
NCBI Entrez Geneid: pfam:PF01593
MAOB (Rat)
geneid:25750
MeSH: DrugBank:
Parkinson's Disease L-amino-acid oxidase
NCBI Entrez Geneid: mesh:D006816 drugbank_targets:3041
MAOB (Mouse)
geneid:109731

Pubmed: DrugBank:
Contains 6795 related articles Flavin-Adenine Dinucleotide
countlinksns/pubmed/mesh:D006816 drugbank_drugs:DB03147

Figure 4.2: Links between datasets in Isocarboxazid case study


82 Chapter 4. Integration with scientific processes

the US MediCare dataset 11 . These alternative URIs could be used to identify more
datasets with information about the drug.
The record for Isocarboxazid in the Sider dataset has a number of typical depres-
sion side-effects to watch for, but it also has a potential link to Neuritis 12 a symptom
which is different to most of the other 39 side effects that are more clearly depression
related. Along with side effects, there are also known drug interactions available using
the DrugBank dataset. An example of these is an indication of a possible adverse re-
action between Isocarboxazid and Fluvoxamine 13 . If Fluvoxamine was already being
given to the patient, other drugs may need to be investigated, as alternatives to prevent
the possibility of a more serious Neuritis side effect. DrugBank contains a simple cate-
gorisation system that might reveal useful alternative Antidepressants 14 , in this case,
such as Nortriptyline 15 .

Dailymed contains a list of typically inactive ingredients in each brand-name drug,


such as Lactose 16 , which may factor into a decision to use one version of a drug
over others. The Drugbank entry for Isocarboxazid also contains links to Diseasome,
for example, Brunner syndrome 17 , which are linked to the OMIM (Online Mendelian
Inheritance in Man) entry for Monoamine oxidase A (MAOA) 18 .

DrugBank also contains a list of biological targets that Isocarboxazid is known to ef-
fect 19 . If Isocarboxazid was not suitable, drugs which also affect this gene; Monoamine
oxidase B (MOAB) 20 21 22 23 , the protein 24 , or the protein family 25 , might also
cause a similar reaction. The negative link (in this case derived using text mining tech-
niques) between the target gene, monoamine oxidase B, 26 , and Huntington’s Disease
27 28 , might influence a doctors decision to give the drug to a patient with a history of
Huntington’s.
The location of the MAOB gene on the X chromosome in Humans might warrant
an investigation into gender related issues related to the original drug, Isocarboxazid.
The homologous MAOB genes in Mice 29 and Rats 30 , are also located on chromosome
X, indicating that they might be useful targets for non-Human trials studying gender
related differences in the effects of the drug.
11
http://bio2rdf.org/medicare_drugs:13323
12
http://bio2rdf.org/sider_sideeffects:C0027813
13
http://bio2rdf.org/drugbank_druginteractions:DB00176_DB01247
14
http://bio2rdf.org/drugbank_drugcategory:antidepressants
15
http://bio2rdf.org/drugbank_drugs:DB00540
16
http://bio2rdf.org/dailymed_ingredient:lactose
17
http://bio2rdf.org/diseasome_diseases:1689
18
http://bio2rdf.org/omim:309850
19
http://bio2rdf.org/drugbank_targets:3939
20
http://bio2rdf.org/symbol:MAOB
21
http://bio2rdf.org/hgnc:6834
22
http://bio2rdf.org/geneid:4129
23
http://bio2rdf.org/mgi:96916
24
http://bio2rdf.org/uniprot:P27338
25
http://bio2rdf.org/pfam:PF01593
26
http://bio2rdf.org/geneid:4129
27
http://bio2rdf.org/mesh:Huntington_Disease
28
http://bio2rdf.org/mesh:D006816
29
http://bio2rdf.org/geneid:109731
30
http://bio2rdf.org/geneid:25750
Chapter 4. Integration with scientific processes 83

The Human gene MAOA , can be found in the Traditional Chinese Medicine (TCM)
dataset [47], 31 , as can MAOB 32 , although there were no direct links from the Entrez
Geneid dataset to the TCM dataset. TCM has a range of herbal remedies listed as being
relevant to the MAOB gene 33 including Psoralea corylifolia 34 . Psoralea corylifolia is
also listed as being relevant to another gene, Superoxide dismutase 1 (SOD1) 35 36 .

SOD1 is known to be related to Amyotrophic Lateral Sclerosis 37 38 , although the


relationship back to the Brunner Syndrome and Isocarboxazid, if any, may only be
exploratory given the range of datasets in between.
LinkedCT is an RDF version of the ClinicalTrials.gov website that was setup to
register basic information about clinical trials [65]. It provides access to clinical infor-
mation, and consequently is a rough guide to the level of testing that various treatments
have had. The drug and disease datasets mentioned above link to individual clinical
interventions in LinkedCT, enabling a path between the drugs, affected genes and trials
relating to the drugs. Although there are no direct links from LinkedCT to Marplan
at the time of publication, a namespace based text search returns a list of potentially
interesting items 39 . An example of a result from this search is a trial 40 conducted by
John S. March, MD, MPH 41 of Duke University School of Medicine and overseen by
the US Government 42 .

The trial references published articles, including one titled “The case for practical
clinical trials in psychiatry” 43 44 . These articles are linked to textual MeSH (Medical
Subject Headings) terms such as “Psychiatry - methods” 45 , indicating an area that
the study may be related to. The trial is linked to specific primary outcomes and the
frequency with which the outcomes were tested, giving information about the scientific
methods in use 46 .

Although LinkedCT is a useful resource, as with any other resource, there are dif-
ficulties with the data being complete and correct. An example of possible issues with
completeness in LinkedCT are recent studies about the use of ClinicalTrials.gov which
indicate that a reasonable percentage of clinical trials either do not publish results,
register with ClinicalTrials.gov, or reference the ClinicalTrials record in publications re-
sulting from the research [94, 116]. These issues may be reduced if people were required
to register all drug trials and reference the entry in any publications.

31
http://bio2rdf.org/linksns/tcm_gene/geneid:4128
32
http://bio2rdf.org/linksns/tcm_gene/geneid:4129
33
http://bio2rdf.org/linksns/tcm_medicine/tcm_gene:MAOB
34
http://bio2rdf.org/tcm_medicine:Psoralea_corylifolia
35
http://bio2rdf.org/tcm_gene:SOD1
36
http://bio2rdf.org/geneid:6647
37
http://bio2rdf.org/mesh:Amyotrophic_Lateral_Sclerosis
38
http://bio2rdf.org/mesh:D000690
39
http://bio2rdf.org/searchns/linkedct_trials/marplan
40
http://bio2rdf.org/linkedct_trials:NCT00395213
41
http://bio2rdf.org/linkedct_overall_official:12333
42
http://bio2rdf.org/linkedct_oversight:2283
43
http://bio2rdf.org/linkedct_reference:22113
44
http://bio2rdf.org/pubmed:15863782
45
http://bio2rdf.org/mesh:D011570Q000379
46
http://bio2rdf.org/linkedct_primary_outcomes:55439
84 Chapter 4. Integration with scientific processes

Doctors and patients do not have to know what the URI for a particular resource
is, as there is a search functionality available. This searching can either be focused
on particular namespaces or it can be performed over the entire known set of RDF
datasets, although the latter will inevitably be slower than a focused search as some
datasets are up to hundreds of gigabytes in size, representing billions of RDF statements.
An example of this may be a search for “MAOB” 47 , which reveals resources that were
not included in this brief case study.

4.4.1 Use of model features


The case study about the drug “Isocarboxazid” in Section 4.4 utilises features from
the model to access data from multiple heterogeneous linked datasets. The case study
uses the query type feature to determine what the nature of each component of the
question means, and maps the question to the relevant query types. The query types
from the Bio2RDF configuration that were utilised in the case study range from basic
resolution of a record from a dataset, to identifying records that were linked to an item
in particular datasets and searches over both the global group of datasets and specific
datasets. These methods make it simple to replicate the case study using the URIs,
although additions of new query types and providers may be necessary to utilise local
or alternate copies of the relevant datasets.

Basic data resolution

The Bio2RDF normalised URI form can be used to get access to any single data
records in Bio2RDF. The pattern is “http://bio2rdf.org/namespace:identifier”,
where “namespace” identifies part of a dataset that contains an item identified by “iden-
tifier”. This query provides access to Bio2RDF as Linked Data. It is simple to im-
plement using SPARQL, with either a CONSTRUCT query, or a DESCRIBE query
being useful. The SPARQL Basic Graph Pattern (BGP) that is used in this query is “
<http://bio2rdf.org/namespace:identifier> ?p ?o . ”, although the URI may be mod-
ified using Normalisation Rules as appropriate to each data provider. The SPARQL
pattern may not be necessary if the query is resolved using a non-SPARQL provider.
An example of a question in the case study that may not use SPARQL is the ini-
tial query. The process of resolving the URI “http://bio2rdf.org/drugbank_drugs:
DB01247”, may require the model to use the Linked Data implementation at the Free
University of Berlin site, provided by the LODD project, or it may use the SPARQL end-
point. In this case, the namespace is “drugbank_drugs”, and the identifier is “DB01247”.
According to the Bio2RDF configuration, the namespace prefix “drugbank_drugs”, is
identified as being part of the namespace “http://bio2rdf.org/ns:drugbank_drugs”
by matching the namespace prefix against the preferred prefix that is part of the
RDF description for the namespace. This namespace is attached to the provider
“http://bio2rdf.org/provider:fuberlindrugbankdrugslinkeddata”, which will at-
tempt to retrieve data by resolving the Linked Data “http://www4.wiwiss.fu-berlin.
47
http://bio2rdf.org/search/MAOB
Chapter 4. Integration with scientific processes 85

de/drugbank/resource/drugs/DB01247” to an RDF document.


The DrugBank provider in Bio2RDF contains links to normalisation rules that are
necessary to convert the URIs in the Free University of Berlin document to their
Bio2RDF equivalents. This enables the referenced URIs to be resolved using the
Bio2RDF system, which contains other redundant methods of resolving the URI besides
the Linked Data method. In this case, the document is known to contain incorrectly
formatted Bio2RDF URIs. This is a data quality issue that can be fixed using the
normalisation rules method, in a similar way to the way the useful, but not Bio2RDF,
Linked Data URIs were changed.
There are some situations where it may be difficult to get a single document to
completely represent a data record. If the only data providers return partial data
records, such as Web Services that map the record to another dataset, then the total data
record may require that many Web Services may be resolved to get a single document.
In some cases, the complete record may have parts of its representation derived from
different datasets, or at least different locations. However in many cases, the record is
available from a single location, so it is simpler to maintain and modify the record. In
both cases, the model allows the data record to be compiled and presented as a single
document to the scientist, including any necessary changes based on their context.

References in other datasets

Scientists are able to take advantage of references that dataset providers have deter-
mined are relevant. They are not easily able to discover new references, as there has
previously been no way of asking for all of the references to a particular item. Al-
though the Linked Data proposals are useful as a minimum standard, they do not allow
scientists to perform these operations directly, as they only focus on direct resolution
of basic data items. Query types were created to handle both global and namespace
targeted searches, extending the usefulness of the Linked Data approach to include link
and search services in addition to simple record resolution. These queries are useful for
scientists who utilise datasets that do not always contain complementary links to other
datasets, and are therefore a way to increase the usability and quality of these datasets.
The omission of a reference may not imply that the data is of a low semantic quality, but
in syntactic terms it is useful to know where an item is referenced in another dataset.
The global reference search in Bio2RDF has the pattern “http://bio2rdf.org/
links/namespace:identifier”, where “namespace” identifies part of a specific dataset
that contains an item identified by “identifier”, and the query will find references to
that item. In the case study, there was a need at one point to find all references to a
drug, without knowing exactly where the references might come from. The Bio2RDF
URI http://bio2rdf.org/links/drugbank_drugs:DB01247 was used to find any items
that contained references to the drug. Although the canonical form for the data item
of interest is http://bio2rdf.org/drugbank_drugs:DB01247, the data normalisation
rules will change this to the URI that is known to be used on each data provider. This
process is not easy to automate, as it requires a mapping step. A general mapping
86 Chapter 4. Integration with scientific processes

that created multiple queries for every known dataset would be very inefficient for any
datasets that had more than one or two known URI forms. In the case of Bio2RDF the
mapping process was manually curated based on examinations of sample data records
to increase the efficiency of queries.
As the global references search requires queries on all endpoints, by design, it is
not efficient in very large sets of data. A more efficient version was designed to take
advantage of any knowledge about a namespace that a reference may be found in. This
form has the pattern “http://bio2rdf.org/linksns/targetnamespace/namespace:
identifier”, where “namespace” identifies part of a specific dataset that contains an
item identified by “identifier”, and the query will find references to that item in data
providers that are known to contain data items in the namespace “targetnamespace”.
This was used in many places in the case study where the scientist knew that there
were links between datasets, and wanted to identify the targets efficiently.
The SPARQL BGP that is used in both cases is likely to be similar to this pattern “ ?s
?p <http://bio2rdf.org/namespace:identifier> . ”, although in some cases the predicate
(second part of the triple) will be necessary, and the object (third part of the triple)
will be “identifier” or something similar.
The model maintainer may be able to specify which data providers are known to
contain some links to a namespace, to assist scientists in identifying linked records
without having to do either a global reference search, or know which namespaces to
use for a targeted search. Using this knowledge, the model can be setup so that during
the data resolution step, there is a targeted search for references in a select number of
namespaces to dynamically include these references in the basic data resolution. In some
cases, such as with widely used taxonomies, it is not efficient to utilise this step, but
in general it is useful and can provide references that scientists may not have expected.
This functionality is vital for namespaces that may not be derived from single datasets,
or the datasets are not available for querying, as the semantic quality of the referenced
data can be increased using a generic set of syntax rules.
An example of this is the “gene” namespace that was referenced in the case study.
Each identifier in the gene namespace is a symbol that is used to identify a gene in a
particular species. Although these are curated by the HGNC (HUGO Gene Nomen-
clature Committee) organisation for humans, the use of these symbols is more widely
spread, including in the gene namespace, where the symbols are denoted as being human
using the NCBI Taxonomy identifier “9606”, in the form http://bio2rdf.org/gene:
9606-symbol, where “symbol” is the name given to a human gene. An example of a
mouse gene symbol is http://bio2rdf.org/gene:10090-ace, which is similar to the
symbols that are available for the human version of the gene, http://bio2rdf.org/
gene:9606-ace and http://bio2rdf.org/symbol:ACE. In cases where the identifiers
are not standardised, or different namespaces have different standards for the same
set of identifiers, there are likely to be data quality issues similar to this case. As
URI paths are case sensitive, http://bio2rdf.org/symbol:ACE is different to http:
//bio2rdf.org/symbol:ace, and datasets using one version will not directly match
Chapter 4. Integration with scientific processes 87

datasets using the other version.


The use of RDF enables the addition of a statement linking to two identifiers. How-
ever, this is not a good solution as it requires scientists to apply reasoning to the data
while assuming that the data is of perfect semantic quality. In the prototype there is
the possibility to apply normalisation rules to template variables such as “identifier” ele-
ments in the symbol namespace that modify queries on particular data providers based
on a previous examination of the data to know which convention a dataset follows.
In some cases, the RDF datasets do not contain URIs for particular referenced
items. Dataset owners do this to avoid any impression that the URIs would have
ongoing support from their organisation. Some organisations have best practices to
publish references to other data items using a tuple formed using one property for a
namespace prefix, such as “namespace” above, and another property for the identifier,
such as “identifier” above. This restricts the usefulness of the resulting data as there
is no simple mechanism for using this tuple to get more information about the item.
Although the lack of URIs is simple to get around in many cases, it makes queries
harder, as the relevant properties need to be targeted along with a textual literal, which
may not be handled as efficiently as URIs.
In some cases a Linked Data RDF provider may argue that the underlying dataset
did not contain URI references to another dataset, so they do not have to provide the
links. One case where this is prominent is in the lack of references out from DBpedia
to Bio2RDF, due to the fact that Bio2RDF isn’t a clear representative of the relevant
data providers, although DBpedia is not a representative of the Wikipedia group in
the first instance. In these cases, the model is useful in determining what DBpedia
items reference a given Bio2RDF namespace and identifier and using these for reference
searches, however, there is a lack of support in the SPARQL query language with
respect to creating URIs inside queries, so the resulting documents still contain the
textual references.
In all of the cases where the dataset does not use a URI to denote a reference to
another item, the query types feature in the model is utilised to make it possible to
use multiple different queries on the dataset, without the scientist having to specify in
their request that they want to use the alternative query types. As an example, these
alternative queries are utilised for the inchi, inchikey 48 , and doi 49 namespaces. These
public namespaces are well defined, so there are not generally syntactic data quality
issues, however they do not have RDF Linked Data compatible URIs, so some datasets
have resorted to using literals, along with a number of different predicates, for references
to these namespaces.
The use of a large number of different data providers for these public namespaces
is difficult from a trust position, as it is not always easy to identify where statements
originate from in RDF, unless one of the Named Graph extensions to RDF is utilised,
such as Trig or Trix. In the case of pure RDF triples, it is not easy to identify after the
fact what trust levels were relevant. In the case of high throughput utilisation of this
48
http://www.iupac.org/inchi/
49
http://www.doi.org/about_the_doi.html
88 Chapter 4. Integration with scientific processes

information, it is not appropriate to rely on a scientist applying their trust policies on


information after queries are performed. The use of profiles to restrict the set of data
providers is the solution provided in the model. The profiles are designed so that they
can be easily distributed to other users, so organisations can use the profiles to define
their trust in different providers. However, profiles are user selected, so any profiles that
appear in query provenance can be excluded during replication simply by not selecting
them.

Text search

RDF URIs are useful, but they are not always easy to find. In particularly, Linked Data
does not specify any mechanism for finding initial URIs given a question. In some cases
this may not be an issue, if identifiers are textual and a URI can easily be created with
knowledge of the identifier. In many cases identifiers are privately defined by datasets
and the identifier has no semantic meaning apart from its use as part of the dataset.
In these cases it is useful to be able to search across different datasets. One solution
to this may be to rely on a search engine, but there are questions about the range of a
search engine, and the ability to restrict it to what is useful and trusted.
The majority of the data providers for text search queries were SPARQL endpoints.
Although the official SPARQL 1.0 implementation only supports one form of text search,
regular expressions, some SPARQL endpoints were known to support additional forms
of full text search that are more efficient, so searches on these endpoints were performed
using a different query type that was based on the same query structure.
In order for scientists to utilise the system without having to know these details,
the two forms of SPARQL search, along with other URL based search providers, were
implemented to match against two standard URI patterns. The first pattern is for a
text search across all datasets that have text search interfaces, http://bio2rdf.org/
search/searchTerm, where “searchTerm” is the text that a scientist wants to search
for. The second pattern is used to target the searches at datasets which contain a
particular namespace, http://bio2rdf.org/searchns/targetNamespace/searchTerm
in this case that is “targetNamespace”. The “targetNamespace” is not guaranteed to be
the only namespace that results are returned from, as the query and the interface may
be generically applicable to any data in a provider. The process could be optimised
further in cases where the interface allowed the query to only target data items that
had URIs which matched the target namespace, but this process may not always be
effective or efficient as the URIs may not have a base structure that is easily or efficiently
identifiable in a SPARQL query.
An experiment to improve the accuracy in some cases was completed, but it was
not efficient, and queries regularly timed out, returning no results. The main efficiency
problem was related to the use of SPARQL 1.0. It is not optimal for searches that rely
on the structure of URIs, as it relies on regular expressions, which are not efficient when
applied to large amounts of text. For the same reason, the targeted reference search
only relies on the namespace being part of the data provider definition, so results can
Chapter 4. Integration with scientific processes 89

be returned from any namespace that was present in the dataset that was queried. In
small scenarios, where the number of triples is quite small 50 , the identifier-independent
part of the URI can be identified as belonging to a particular namespace.
The text search function demonstrates a novel application of the model as an entry
point to Linked Data models. In terms of data quality, it is perhaps less effective
than a single large index, where results can be ranked across a global set of data, as
the prototype implementation of the model only allows ranking inside data providers.
However, in terms of data trust, the sites that were utilised were able to be selected in
a simple manner using profiles, so as to restrict the possibility that the search terms
accidentally matched on an irrelevant topic.
The exact provenance for a search using the model is possible, and can be used in
combination with the results to validate the usefulness of the results to a particular
question. If the results have data quality issues, a scientist can encode rules that can
be used to clean the results to suit their purposes, and view the results of this in
their work, along with which rules were applied in their provenance. If a new text
search method needs to be created to replicate results, it can be included using profiles.
Profiles can also be used to filter out of date query types. For example, when SPARQL
1.1 is standardised, the SPARQL 1.0 query types can be filtered from historical query
provenance and replaced with efficient SPARQL 1.1 queries

4.5 Web Service integration

The query type and provider sections of the model can be transparently extended in
future to allow scientists to patch into web services to use them directly as sources for
queries. A similar ontology based architecture, SADI [143], annotates and wraps up web
services to query them as part of Federated SPARQL queries, however it makes some
assumptions that make it unuseful as a replicable, context-sensitive, linked scientific
data query model.
The SADI method relies on service providers annotating their web services in terms
of universally used predicates and classes. SADI also requires queries to be stated using
known predicates for all parts of the query, with at least one known URI to start the
search from, as the system is designed to replicate a workflow using a SPARQL query.
This prevents queries for known predicates being executed without knowledge of any
URIs, and it also prevents queries that have known URIs without scientists knowing
what predicates exist for the URI. SADI is designed as an alternative to cumbersome
workflow management systems. In this respect it is successful, as the combination of a
starting URI and known predicates are all that is required for traditional web services
to be invoked, as they are specifically designed for operations on known pieces of infor-
mation. SADI is limited in this respect compared to arbitrary Federated SPARQL in-
terpreters that allow any combination of known operations, known information, and/or,
unknown information and operations.
50
1000 triples is small in this context and 5 million+ is large
90 Chapter 4. Integration with scientific processes

In comparison to the separate namespace and normalisation rules model described


here, SADI requires wrapper designers to know beforehand what the structure is for
URIs that all datasets use, with the resulting transformations encoded into the pro-
gram. In comparison, the normalisation rules part of this query model only requires
configuration designers to know how to transform the URIs from a single endpoint to
and from the normalised URI model, and scientists can extend that knowledge using
their own rules. This reduces the amount of maintenance that is required by the system,
as URIs that are passed between services are normalised to a known structure as part of
the output step for prior services before being converted to the URIs that are expected
by the next services.
If a scientist is able to locally host a SADI service registry, then they are able to
trust each of the records. However, the model does not contain a method for trusting
data providers in the context of particular queries. It is important that scientists be
able to trust datasets differently based on their queries, so that all known information
may be retrieved in some contexts, but in other contexts, only trusted datasets may be
queried. Although this may come at the expense of scientists defining new templates
for queries, the model is able to provide a definite level of trust for each query based on
the scientist’s profiles.
The design could be useful as a way to integrate documents derived from the model,
however, the assumption that all queries can be resolved in a Linked Data fashion is
inefficient and there needs to be an allowance for more efficient queries that do not follow
the cell-by-cell model. SADI technically follows a cell-by-cell data retrieval model to
answer queries, however, some queries may be executed intelligently behind the scenes
in Web Services to reduce the overall time, making the system efficient in particular
circumstances, but not based on the system design. The model and prototype can both
we used to perform efficient queries, although resolution of individual Linked Data URIs
is also possible if scientists are not aware of a way to make a query more efficient.

4.6 Workflow integration


Workflow management systems provide a useful way to integrate different parts of a
scientific experiment. Although they may not be used as often as ad-hoc scripts, they
can make it easier to replicate experiments in different contexts, and they make it easier
scientists to understand the way the results are derived. The model and prototype
were not designed to be substitutes for a workflow management system, as they are
designed to abstract out the context specific details such as trust and data quality.
By abstracting out the contextual details, each query can be replicated in future using
the model configuration given by the replicating user, based on the original published
configuration given by the original publisher.
An evaluation of a selection of workflows on the popular workflow sharing website
myExperiment.org revealed that the majority of these scientific workflows have large
components that are concerned with parsing text, creating lists of identifiers, pushing
those lists into processing services and interpreting the results for insertion into other
Chapter 4. Integration with scientific processes 91

processing services. These workflows are unnecessarily complex due to the requirements
in these steps to understand each output format and know how to translate these into
the next input format. Hence they are not as viable for reuse in the general scientific
community as a system that relies on a generic data format. In addition, each of the data
access steps contain hardcoded references to the locations of the data providers. Even
if the data locations are given as workflow parameters, the alternative data providers
would need to subscribe to the same query method. The query format assumptions are
encoded in the preceding steps of the workflow, and the data cleaning assumptions for
the given data provider are encoded in the steps following the data access.
In the model, the query format assumptions and data cleaning assumptions are
linked from the relevant providers. If another provider is available for a query, but the
query format is different, a new query type can be created and linked to a new provider.
This new provider can reuse the data cleaning rules that were linked to from the original
provider, or new rules can be created for this provider. Each of these substitutions is
controlled using profiles, so the entire configuration does not need to be recreated to
replicate the workflow. In addition, if the user only wants to remove data cleaning rules,
providers, or query types, they only need to create new profile and combine it with the
original configuration. All of these changes are independent of the workflow, as long as
the location of the web application is specified as a workflow parameter so that future
users can easily change it.
Scientists can use annotations and provenance in scientific workflows to provide
novel conclusions that can be used to evaluate the workflow in terms of its scientific
meaning and the trust in the data providers. The model provides query provenance, and
scientists can create query types to insert data provenance information into the result
set of a query. These provenance sources give a workflow system a clear understanding
of the origin of the data that it is processing.
In the model, the RDF syntax is used to specify provenance details. This is use-
ful because it enables concepts to be described using arbitrary terminologies, without
requiring a change to the model standard to support different aspects of provenance.
Scientists can also combine provenance and results RDF statements to explain the effect
of workflow parameters on the workflow results.
There are domain specific biological ontologies available that aim to formalise hier-
archies of concepts, however, these are not well integrated with workflow management
systems currently [61, 104, 124, 146]. Researchers have effectively proven that it is prac-
tical to integrate semantic information into electronic laboratory books such as those
by Hughes et al. [73] and Pike and Gahegan [109]. There have also been attempts to
formally describe the experimental process for various disciplines such as geoscience and
the microarray biology community [142]. There have been uses of simple dataset iden-
tifier information to provide enhancements to workflows, with the most common being
PubMed and Gene Ontology overlaps, where PubMed articles are described in terms of
the Gene Ontology [44, 86, 132]. However, these systems do not involve the context of
the scientist in the process, making it hard for scientists to replicate the results if they
92 Chapter 4. Integration with scientific processes

need to extend them or process their provenance.


Scientists can interpret the provenance of their data along with their results by
including trusted, clean, structured data from various locations along with provenance
records. The model and prototype described in this thesis allows for this integration
using query types, and process provenance records in RDF that can be used to configure
the prototype to replicate the results.

4.7 Case Study : Workflow integration


The model links into scientific workflow management systems more easily than current
systems due to its reliance on a single data model, where documents at each step are
parsed using a common toolset. Although many workflow management systems do not
include native support for RDF as a data format, the Semantic Web Pipes program
[87] is designed with RDF as its native format, including SPARQL as a query language
inside the workflow. It was used to demonstrate ways to use the results of queries on
the prototype to trigger other queries based on the criteria that are specified in the
workflow. A future version of the model could expand it into a workflow engine, but
for the prototype it was decided to use an external tool for workflow purposes. The
following case study examines the usefulness of an RDF based workflow management
system to process data using the prototype as the data access system.
The basic input pattern that was used for these Semantic Web Pipes workflows
contained a set of parameters including the base URL for the users instance of the
prototype web application. An implementation of the model was used to resolve a query
to an RDF document, as shown in Figure 4.3. The RDF document was then parsed
into an in-memory RDF model and SPARQL queries were performed to filter the useful
information that required further investigation. In some cases where workflows were
chained together, the results of the SPARQL query were used to form parameters for
further queries on the prototype. The results of these further queries were included in
the results of the workflow, as RDF statements from different locations can be combined
in memory and output to form single documents. In some cases, the initial RDF triples
were used as part of the output statements for the workflow. This makes it possible to
keep track of what information was used to derive the final output, as the RDF model
provides the necessary generic structure to enable different documents to be merged
without any specific rules.
These small workflows were then linked together using the Pipe Call mechanism
in the Semantic Web Pipes program to create larger workflows. This structure makes
it possible to remove internal dependencies in the model on the way each query is
implemented, so data quality and data trust can be changed in the prototype, without
having to change the workflow where possible.
These workflows were used to evaluate the effectiveness of the prototype as a data
access tool specifically for biologists and chemists, as they focused on a biological and
chemical scientific datasets. The workflows were integrated with the aim of verifying
the way that biological networks are annotated and curated using literature sources,
Chapter 4. Integration with scientific processes 93

Figure 4.3: Integration of prototype with Semantic Web Pipes

and further discover new networks that may not have been annotated previously, but
which share common sources of knowledge.
The case study uses the Ecocyc dataset as the basis for the biological network
knowledge about the E. coli bacterium. Ecocyc contains references out to the Pubmed
literature dataset to verify the source of the knowledge. The Pubmed literature dataset
is also used to annotate the NCBI Entrez gene information database. The Entrez
datasets in turn is referenced from a large number of other datasets such as the Uniprot
protein dataset, and the Affymetrix microarray dataset. The microarray dataset can be
used to identify which methods are available for experimentally verifying which of the
proteins from the Ecocyc dataset appear in a biological sample. Along with the links
to Entrez gene, the Affymetrix dataset also includes links to Uniprot. These links are
useful, as they seem from the results of the workflow to be highly specific, although the
actual level of trust could be given by a scientist who could use either this set of links
or the set of links from Entrez gene to further identify the items.. The Ecocyc dataset
is also annotated using the Uniprot dataset, which provides a method of verifying the
consistency of the data that was discovered using the Ecocyc dataset originally. In the
Ecocyc dataset, the Uniprot links are given from the functional elements, such as genes
and polypeptides (i.e., proteins).
The workflow was used with different promoters, and the results were manually
verified to indicate that the results were scientifically relevant. The results generally
contained proteins which were known to be in the same family, as the workflow was
structured to select these relationships in favour of more general relationships. However,
the resulting proteins were known to be experimentally verifiable, and the biological
networks could be considered in terms of these relationships. Although the results of
this workflow may be trivial, they highlight the ease with which these investigations
can be made compared to non-RDF workflows. In traditional workflows, each of the
data items must be manually formatted to match the input for a service. The results
must also be parsed using knowledge of the expected document format, including the
need to parse link identifiers, or have a special service created to perform the dataset
94 Chapter 4. Integration with scientific processes

link analysis and give these identifiers as simple values, and any data normalisation
processes must be created in a way that suits the data format.
In the previous case study the links between data items were used for exploration,
so it was suitable for a scientist to identify relevant links at each step of each case.
In comparison, the use of the workflow in this case enables a scientists to repeat the
experiment easily, with different inputs as necessary. The workflows in this case were
created in a way that allowed scientists to create a workflow for a single purpose, and
then use it in another workflow by linking to it. The relevant information from one step
was processed into a list of parameters for the next step using an embedded SPARQL
engine. The relevant inputs and results were then combined in the output as a single
RDF document.
The model and prototype were used to enable the workflow to avoid issues such
as data quality and remote SPARQL queries. These factors are difficult to replicate
and maintain if they are embedded into the workflow. Each of the workflows include
a parameter that is used as the base URL for queries. This means that users do not
have to utilise a globally accessible implementation of the model if they have one locally
installed. This ensures that scientists can reliably replicate the workflows using their
own context, if they have the datasets locally installed, with local endpoints and the
local implementation used in their workflows as guided by their profile settings.
The workflow was used to verify the consistency of the statements about a partic-
ular promoter, known as “gadXp” 51 , that is thought to be involved in several Ecocyc
regulatory relationships. This promoter is known to be annotated with a reference to a
single Pubmed article, titled “Functional characterization and regulation of gadX, a gene
encoding an AraC/XylS-like transcriptional activator of the Escherichia coli glutamic
acid decarboxylase system”.
This article was used to identify 6 Entrez gene records which were annotated as
being related to the article, using the query form, “/linksns/geneid/pubmed:11976288”
52 . The results of this query contained a set of RDF statements that indicated which
genes were related to the article. These genes were matched against known sets of
probes that could be experimentally verified with microarrays, using a query similar to
the following, “/linksns/affymetrix/geneid:NNNNNN”, where NNNNNN is one of the
genes that were identified as being sourced from common literature.
Each of the 8 microarray probes that were returned by this query were then ex-
amined specifically with reference to any Uniprot annotations they contained, to avoid
transferring the entire documents across the network. This was performed using the
query form “/linkstonamespace/uniprot/affymetrix:XXNNNNNN”, where each of the
affymetrix URIs were microarray probes that matched from the Entrez gene set.
The 11 Uniprot links were then matched back against Ecocyc to determine which
Ecocyc records were relevant. If the datasets were internally consistent, the resulting
8 Ecocyc records, and the Uniprot records they are linked to, should refer to similar
proteins. In this example, the Ecocyc and Uniprot records all referred to proteins from
51
http://bio2rdf.org/ecocyc:PM0-2441
52
http://bio2rdf.org/linksns/geneid/pubmed:11976288
Chapter 4. Integration with scientific processes 95

the “glutamate decarboxylase” family, which gadXp is a member of. Each of the Ecocyc
and Uniprot records were then examined in the context of all of the known datasets to
identify references that could be used by the scientist to verify the relevance of each
record. This step was performed using the query form, “/links/namespace:identifier”,
where namespace was either ecocyc or uniprot based on the set of proteins that were
identified.
There may be cases where the Pubmed article was not specific to the promoter
and contained a large number of irrelevant references to genes in the Entrez dataset.
In these cases the workflow would need to be redesigned to verify the references using
another strategy, such as a strategy using Uniprot references, along with other datasets.
Although the typical usage of the prototype is to resolve queries about single entities,
there is no restriction on queries that combine multiple entities in a single query. This
would reduce the time taken to resolve the query, as a large part of the resolution time for
distributed queries on the Internet relates to the inherent latency in the communication
channels.

4.7.1 Use of model features

The workflow case study uses the basic query operations outlined in Section 4.4.1,
along with others described here, where they are useful for further optimising the way
the scientist constructs and maintains the workflows related to their research.

References to other datasets

The URI model does not require that the leading part of a URI can be used to identify
the namespace, although this assumption is used by other systems such as VoiD. This
assumption is necessary if scientists want a simple method of determining which refer-
ences in a data item are located in particular namespaces. This is an optimisation, but
it is necessary to avoid having to resolve an entire document, and every reference. The
optimisation is useful if the namespace can be identified from the leading part of the
URI, and the URI can be normalised using this leading part of the URI.
The data normalisation rules implemented in the prototype can be used to iden-
tify links to other namespaces if a partial normalised URI, e.g., http://bio2rdf.org/
namespace: can be transformed to the prefix of the unnormalised URI. This is neces-
sary, as the user does not know the identifier for any of the references that they are
searching for.
In cases where a data item contains references that can be identified easily, this
query is useful for optimising the amount of information that needs to be transferred.
This was used in the workflow to optimise the amount of information, as the complete
Affymetrix records were required to determine their relevance to the workflow.
An alternative to this method is to recognise the predicate, and rely on the predicate
to optimise the amount of information that needs to be transferred. This alternative
is also able to be used with services that do not support partial URI queries, but they
96 Chapter 4. Integration with scientific processes

do support queries that can be optimised based on the relationship between the record
and its references.

Human readable label for data

The references in linked datasets are given as plain, anonymous identifiers. This is not
useful for scientists, as they need to have a human readable label to recognise what the
reference referred to. This functionality is necessary to avoid retrieving entire documents
describing each of the references for a document just to display the most informative
interface for a scientist.
Although this query may be optimised to avoid transferring entire records, it is also
possible to filter an entire record using deletion rules. This is particularly relevant if
the linked dataset is only accessible using Linked Data HTTP URI resolution, and the
results need to be restricted to simple labels for the purposes of the query.
In the workflow, this query was used to identify Affymetrix records, as the entire
record was not required for any part of the workflow, and the queries that were per-
formed on the Affymetrix dataset did not include labels. It was useful to include human
readable labels for scientists who wanted to verify the results using human readable la-
bels.
In the Bio2RDF configuration, this query is implemented as “http://bio2rdf.org/
label/namespace:identifier”, where “namespace:identifier” is the identifying portion
from the normalised URI, “http://bio2rdf.org/namespace:identifier”. In the ex-
ample, the label for the Affymetrix data item identified by “http://bio2rdf.org/
affymetrix:gadB_b1493_at” is found by resolving the URL “http://bio2rdf.org/
label/affymetrix:gadB_b1493_at”.

Provenance record integration

Scientists can use the provenance information in their workflow to provide for decisions
based on which datasets and query types were used. There are many different cases
where provenance information is useful and necessary, but in the context of distributed
queries, the provenance record needs to contain at least the locations where the queries
were performed, what the queries in each location were, and which datasets were poten-
tially part of the results based on the queries and the data providers. The provenance
record can then be used to determine the legitimacy of the query, including any trust
and data quality metrics that can be created using annotations on the query types, data
providers, and normalisation rules.
The prototype exposes the provenance record using the prefix “/queryplan/”. For
example, the provenance for “http://bio2rdf.org/geneid:917300” can be found using
the URL “http://bio2rdf.org/queryplan/geneid:917300”. In the workflows in this
case study, these records were used to determine the source of statements, and the
query types that were used, to provide information about the link between Entrez gene
and Uniprot datasets, including the version of each dataset that was used. If there are
multiple versions in use, the information may not be trustworthy, as any deletions or
Chapter 4. Integration with scientific processes 97

fixes that were provided between the versions may not be visible. If the versions do not
match the most current versions the datasets may also not be trustworthy.
The RDF statements in the document were then filtered on the basis of a Dublin
Core version property pointing to the release version of the dataset that was used “http:
//purl.org/dc/terms/version”. If data provenance information is to be included it
needs to be retrieved using a separate query type, as the provenance record is designed
to be only reliant on information that was in the model configuration of the server that
resolved the query to provide replicability down to the level of dataset versions. In the
context of the model, data provenance is informal information about how a particular
fact was derived, rather than how it was accessed. Current provenance models focus
on integrating data provenance chains into the data store, which is possible with the
model as it aims to be independent of datastores to enable replicability in future if
a datastore is not available. Although it is not the focus of this research there are
numerous studies available [34, 35, 95, 119, 149]. For this research, scientists need to
choose which queries they want to record the provenance for, and retrieve, store, and
process those provenance records before processing the information further.

4.7.2 Discussion

The workflow management system was used to process and integrate different queries on
the prototype. The example workflows demonstrated the context-sensitive abilities of
the model to make it simple to personalise the data access operations, including the use
of different data normalisation rules and locations. They also demonstrated the way the
prototype could be used to make components of the overall query more efficient when
the scientist was aware of the nature of the data that was required without requiring
them to be aware of this aspect before starting to design the workflow. The workflow
was used to identify the provenance of some queries, although the necessary size of
the provenance records made it difficult to keep this information throughout the entire
workflow.
It made it simple to change the context by including a reference to the base URL
for the prototype in each workflow. In contrast to other methods such as direct access
to Web Services, the base URL makes it possible for scientists to change the location of
the data, and normalise the data in the model to match the structure of the statements
that are used by the workflow, where scientists would have to change the workflow to
match a new set of Web Services if there were multiple sources of data, and encode each
data normalisation rule into each workflow that required access to the Web Service.
The datasets and query definitions that were used were not identified at every stage,
as the provenance records were found to be very large in some cases where a large part
of the configuration was applicable to each query. The overall configuration file for
the prototype was in the range of 2-3MB when represented using RDF/XML. A global
links query, as was performed in the last step of the case study, returned provenance
records that were about 500KB in RDF/XML. The other provenance queries returned
documents ranging from 12KB for the query plan for examining references to uniprot in
98 Chapter 4. Integration with scientific processes

affymetrix records to 100KB for uniprot and ecocyc record resolutions, as the Uniprot
dataset contains a large number of links to other namespaces. The size of the prove-
nance records for ecocyc records was increased because of a number of automatically
generated data providers that were created to insert links to HTML and other non-RDF
representations of records. The size of the provenance record is proportional to the num-
ber of data providers, rdf normalisation rules, and namespaces, that were relevant to
each query.
The workflow, along with the HTTP URIs that form the identifiers for data in the
workflow, was useful for linking the relevant services. It was useful for abstracting the
details of the queries from the scientific experiment, so that the queries on the datasets
could be changed, or optimised, without the scientist needing to change their workflow.
A scientist could change the data providers, including new normalisation rules to fit
the data quality expected by the workflows. They could redirect the location of the
prototype that they used by changing a single parameter in the workflow, making it
possible to change a public interface while providing the old interface for select purposes.
The scientist could also choose to extend the workflow using the model further by
following the pattern used by the workflow to make up new queries.

4.8 Auditing semantic workflows


Scientists can use the provenance information along with their results to audit the
datasources, queries, and filters that were used to produce their results. The provenance
model needs to be taken into account to validate the design of a workflow in terms of
whether the queries were successful and accurate. Although the provenance model does
not directly provide a proof, it can be used along with review methods to validate the set
of queries, and the datasets that were accessed. Assertions based purely on provenance
models are separated in this research from assertions which investigate the scientific
meaning which provided the motivation for the workflow.
Useful optimisation conclusions based on both the syntactic and semantic levels are
available as part of the provenance model. However, review methods were not designed
as part of the model other than to provide a simple flag to confirm that the configuration
for a query type or data provider had been curated. The provenance model is based on
the RDF configuration model, and as such is designed to allow for extensions, including
new relationships between items in the model while keeping backward compatibility.
Any new relationships that use Linked Data HTTP URIs in RDF would be naively
navigable, making it possible for scientists to use future provenance records in a similar
way to their processing of the current model. The current model provides URI links from
Providers to Query Types, Namespaces, and Normalisation Rules, links from Query
Types directly to Namespaces, and links from Profiles to Providers, Query Types and
Normalisation Rules.
If a scientist did want to make workflow optimisations they could use either of two
different ways based on the number of observed statements about a specific item, or
a common semantic link between items. The first method is more suited to purely
Chapter 4. Integration with scientific processes 99

workflow based provenance which attempts to streamline workflow executions based on


knowledge about failures of specific processing tasks, the length of time taken to execute
a specific workflow processor, or the amount of data which has to be transferred between
different physical sites to accommodate grid or distributed processing. The second
method relies on knowledge about the scientific tasks or data items which are being
utilised by each of the workflow designs and executions which are under consideration
by the auditing agent. The common link may refer to an antecedent item which was
related to the current item by means of a key of some sort as opposed to relating two
items which were both processed by a given processor and were ontologically similar.
Even without reasoning about the meaning of information derived from the semantic
review metadata, any data access methods embedded in the metadata can be utilised
to provide a favourites or tagging system which can quickly identify widely trusted data
providers and query types. This system may be integrated with other tagging systems,
using the model to access multiple data providers that could provide tags for an item.
In a scientific scenario where scientists are performing curation on data, for example,
gene identification in biology, the tagging could be used for both data and process
annotations. An example how this process may be designed can be seen in Figure 4.4.
With respect to the first set, it is possible that specific semantic tags can be ac-
commodated within a provenance ontology, an example of which may be found in the
myGrid provenance ontology53 . The myGrid ontology is not directly suitable, as it as-
sumes that the workflow model that is being represented fits within the SCUFL model,
as shown by their use of DataCollection elements, which form the basis for the distinc-
tion between single data elements and multiple data elements being the output from a
SCUFL workflow. The most notable reason for not using the myGrid ontology, as is, for
extended research into the topic, is that it forms a number of key provenance elements
using strings, where the model is able to give unambiguous identifiers to each item, and
allow these to be referenced using URIs.
In contrast to the idea that only one ontology is acceptable in each domain, multi-
ple ontologies can exist, with both referencing each other where applicable and sensible.
This is particularly relevant where a process provider may have a description of their
server published using an ontology which is compatible with one workflow system, but
many parts of the description can also be represented the same way under another
ontology. If this were the case, and someone published a structure outlining the rela-
tionships between the two ontologies, then a reasoner could present the common pieces
of information for both, although the scientist did not anticipate this. This information
is also useful if the provider of the original server did not publish semantic descriptions
of their process, requiring an external provider to publish the descriptions and keep
them up to date. These external descriptions must then be relied upon to provide ac-
curate information in relation to auditing a given provenance log. The model provides
the basic elements that scientists can use to analyse a provenance log to determine the
range of sources that were used.

53
http://www.mygrid.org.uk/ontology/provenance.owl
100 Chapter 4. Integration with scientific processes

http://mquter.qut.edu.au/user/scientistA
Select an http://mquter.qut.edu.au/data/scientistA/region13445
interesting
region
Scientist
collects
samples

http://mquter.qut.edu.au/data/scientistA/sequence1234

Annotate region
with a tag

Give tag a
meaning

Give region
extended
attributes
http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-1
http://moat.mquter.qut.edu.au/tag/xba2

Create a semantic wiki


page for resource

http://semwiki.mquter.qut.edu.au/xba2

http://mquter.qut.edu.au/search/20070813-2 Give tag extended


semantic attributes

http://dbpedia.org/resource/xba2

http://mquter.qut.edu.au/tags/xba2

http://mquter.qut.edu.au/tags/scientistA/20070813-2

http://mquter.qut.edu.au/workflow/20070813-1

Legend:
Data transformed
by process into

http://mquter.qut.edu.au/provenance/20070813-22

Figure 4.4: Semantic tagging process


Chapter 4. Integration with scientific processes 101

4.9 Peer review


Direct access to scientific data is increasingly being required by peer reviewers to ver-
ify the results given in proposed publications, as the data is not easily replicable for
disciplines where the experiment relies on data from a particular animal or plant, for
example. This peer review process, shown in Figure 4.5, is predicated on the indepen-
dence of the peers and in some cases the authors are unknown, although in many fields
it may not be difficult for peers to identify the authors of a paper based on the style and
references. The transfer of this data to the peers in a completely transparent process
would require the data to be transferred to the publisher and the peers would receive
the information in an anonymised form from the publisher.
The model is useful for this process, as the namespace components of the URIs can
be transparently redirected to the publishers temporary server or to the peers copy of
the data, assuming that the publisher and peers require a detailed review on the as yet
unpublished data. Although it is useful to provide temporary access to unpublished
datasets, the dataset may eventually need to be published along with the publication.
At this point, a final solution must be given to support the relevant queries, with a
minimal number of modifications to the original processing methods.
Peer reviewer reads published material Peer reviewer critiques validity of the
hypothesis

Peer reviewer critiques the design of


the experiment
Peer reviewer analyses previous experiments

If needed the Peer reviewer replicates


the experiment

Peer reviewer verifies the


analysis process

Peer reviewers return their opinions to


the journal editor

Journal editor takes peer reviews and


decides whether to publish the work

Article is published as part of a journal issue


(Electronic and/or paper)

Citations by peers Post-publication reviews


by peers

Figure 4.5: Peer review process

In some disciplines, the dataset may be either too large to individually publish each
item, or the publication may refer to evaluations of public data. In these cases, the
publication may provide a set of HTTP URI links to documents which can be used
102 Chapter 4. Integration with scientific processes

to explore the information. This ability is not new, as DOI’s have been used for this
purpose already, but the ability to link directly into the queries, with the ability of both
peer reviewers and informal reviewers in future to get information about the component
queries that were used, could be a valuable part of the publication chain in disciplines
such as genomics. It would be most valuable if there is a known relationship between
the items in the query results and other data items, as is encouraged in this thesis by
the use of HTTP URIs and RDF documents.
The prototype described in Chapter 5, publishes the details and results of individual
queries using HTTP URIs, for example, http://bio2rdf.org/myexp_workflow:656.
Related HTTP URIs are available for data access provenance records, for example,
http://bio2rdf.org/queryplan/myexp_workflow:656. Higher level provenance de-
tails are handled by other systems, such as a workflow management system, that is
suited to the task. For example, the MyExperiment website provides RDF descriptions
and HTTP URIs for each of the workflows and datasets that are uploaded to their
site, including the original URI for the workflow referred to earlier in the paragraph,
http://www.myexperiment.org/workflows/656.

4.10 Data-based publication and analysis

In order for computers to be utilised as part of the scientific process, data needs to
be in computer readable forms [98]. A major source of communication for scientists is
published articles. As these publications restrict the amount of information that can be
included, scientists have developed text mining algorithms to interpret and extract the
natural language data from published articles [38, 49, 90, 126, 134]. However, recently
a few journals and conferences have started supporting the use of computer readable
publications, which include links to related data, and in some cases include the meaning
of the text and links [128]. The use of semantic markup on text and links to related data
provide opportunities for scientists to automate otherwise manually organised processes.
The model is designed to take advantage of these extensions to provide scientists with
integrated methods of solving their scientific questions using the interactions shown in
Figure 4.6.
Marking up of data, and its use in automating processes, presents many challenges
that were not present in the initial world wide web, where the emphasis was to make
data computer readable and able to be distributed between locations. Making data
readable simply requires that one arrange for a specific format based on a few simple
categories of data. For instance, true and false facts from boolean logic, integers and
real numbers from mathematics theory, and textual forms from natural language, are
sufficient as a basis for data readability. Other forms such as image and spatial data
can be represented using these basic forms, although they will take more computational
power to process and integrate.
The creation of semi-accessible data using plaintext metadata creates a quasi-semantic
form which may be easier for scientists to personally understand, but does not ensure
Chapter 4. Integration with scientific processes 103

External datasets
Scientist

Use publically
accessible datasets

Local datasets
Collect data
experimentally

Collaborate and
review publications

Make conclusions and


Link to dataset write them up for publication
in a publication

Read and integrate


previously published work

Own publications Publish papers


in the area
Citation to previously
published material

Other scientists

Published material

Figure 4.6: Scientific publication cycle


104 Chapter 4. Integration with scientific processes

that a computer will be able to unambiguously utilise the data [2]. This plaintext meta-
data may come in the form of dataset fields or file annotations, however these are difficult
to relate intuitively to the real world properties that motivated the initial hypothesis.
The creation of structures that group and direct the associations between metadata ele-
ments is the final pre-requisite for making clear decisions. Such decisions may be based
on the domain structures that the computer can access, possibly using structures typi-
cally associated with scientific ontologies [14, 20, 22, 32, 77, 79, 84, 101, 104, 135, 146].
Although there may be disagreements about the use of data, semantic markup fixes
the original meaning while allowing the scientist to interpret the data using their own
knowledge of the meaning of the term [97, 109].
Although a computer will then be able to utilise the data more effectively, the
domain ontologies may still not be universally relevant due to a lack of world knowledge
introduced by their creators [122]. This lack is an inherent disability to be overcome
through cross-validation and evolution of the domain ontologies in an iterative manner,
based on previous attempts to use the structures. The ability to gradually evolve,
and normalise the use of domain ontologies is provided through the context sensitive
normalisation rules in the model. The ontologies that are used in different locations
can be normalised where possible to give a simple basis on which to further utilise data,
without having to deal with complex mapping issues in each processing application.
References to data items need to be recognised by scientists regardless of the format
to remove the barriers to querying data based publications. This may not be a simple
exercise for the scientist, as there may be different URI formats that are used in dif-
ferent locations to identify an item. The model and prototype provide simple ways to
represent data references, and resolve references from multiple locations. The prototype
requires the use of the RDF format to integrate data from different locations. RDF does
not have to be presented in specialised RDF files. The RDFa format allows scientists
to markup existing XML-like documents with structured content. This allows a trans-
parent integration with current web documents, which allows for scientists to present
and publish documents, and maintain structured data in the same document, with the
allowance for structured links to other documents using Linked Data principles.
The use of the model with RDFa is implemented in the HTML results pages gen-
erated by the prototype. The documents contain valid RDFa, however, the dynamic
nature of the documents generated by the model makes it hard to take advantage
of some of the advanced syntactic features of RDFa. RDFa is designed to allow for
efficient manual markup of XML documents using compulsory shorthand namespace
declarations, and dynamic generation of an entire automatically generated document
reduces the usability of the information as the namespaces must be artificially generated
to fit with the RDFa design principles.
As part of the publication process, scientists are required to give details about the
methods they used to generate their results. Using the model, they are able to publish
both the process and data details using a combination of workflow file formats, and links
to the relevant provenance records. Using these methods scientists can integrate less
Chapter 4. Integration with scientific processes 105

curated data and workflows, without requiring methods of re-interpreting some pieces
of data using statistical or interpretative processes such as text mining or other forms
of pattern analysis. It may be necessary for scientists to continue to use these methods
in some cases, including past publications, the future use of semantic annotations in
relation to data, processes, and results, on publications will increase the scalability of
the analysis processes.
Semantically marked up publications contain links between textual language and
defined concepts, which were previously not available, and could not be previously used
to validate or mediate the analysis process. These links can easily be generated using
the model as the data access method, and these documents can be simply integrated
with other RDF documents using standard RDF graph merging semantics.
The results may need to be derived through the use of a program that is not simple to
represent in workflow terminology, such as a Java, C# or C++ program. In these cases,
the prototype could still be used to resolve queries on datasets, given the wide range of
tools available to resolve and process HTTP requests and RDF documents. However,
the results may not be as simple to replicate as when the programs are represented in
other formats which can be easily transported between different scientists.
106
Chapter 5

Prototype

5.1 Overview

A prototype web application was created to demonstrate the scientific usefulness of


the data access model described in Chapter 3. The prototype was implemented as
a stateless, dynamic configuration driven, web application that relies on RDF as its
common document model. It was configured to provide access to a large number of
scientific datasets from different disciplines, including biology, medicine and chemistry.
It follows the Linked Data principles in making it possible for scientists to construct
URIs using the HTTP protocol, and make the URIs resolvable to useful information,
including relevant links to other Linked Data URIs that can be resolved using the
prototype. This enables it to be integrated easily with the cases described in Chapter 4.

The prototype makes it possible for scientists to access alternative descriptions of


published dataset records using their own criteria, including adding and deleting state-
ments from the record. HTTP URIs that identify different records in another scheme
can be resolved using the prototype using normalisation rules to convert the URIs to
locally resolvable versions. Scientists can then publish their own HTTP URIs that are
resolved using their instance of the prototype. In terms of provenance, scientists are able
to control the versions of the datasets that are used by the prototype, so they can then
publish URIs that point to their instance and be able to personally ensure that the data
in the resulting records will not materially change without changes to the configuration
of the prototype. For example, a scientist may resolve http://localhost/prototype/
namespace:identifier, and the prototype may be configured to interpret that URL
as being equivalent to http://publicprototype.org/namespace:identifier. This
functionality is necessary for scientists to be able to customise and trust their data,
without having to publish their documents using the official globally authoritative URI
for each record if it would not be resolved using their instance of the prototype.

The prototype was used extensively on the publicly accessible Bio2RDF website,
allowing biologists and chemists to retrieve data from many different linked scientific

107
108 Chapter 5. Prototype

datasets. The prototype software was released as Open Source Software on the Source-
forge site 1 . Scientists can download this software and customise it with their own do-
main names to provide access to their local datasets without reference to the Bio2RDF
datasets.
The prototype accepts and publishes its configuration files in all of the major RDF
formats. This makes it possible for scientists to reuse both the Bio2RDF configuration,
containing biology and chemistry datasets, and any other configuration sources as part
of their implementation so that they can avoid reconfiguring access to these public data
providers. If a scientist has access to alternative data providers or query types compared
to the Bio2RDF configuration, they can use the profile mechanism to selectively ignore
the data providers that they are replacing without having first removed them from the
configuration document.
Scientists are able to share the details of their query with other scientists using
the RDF provenance document for the query. The prototype provides access to the
provenance record for any query using URIs similar to http://localhost/prototype/
queryplan/namespace:identifier. The provenance record contains all of the relevant
configuration information that the prototype relied on to process the query. It enables
other scientists to replicate the results without requiring any further configuration.
This makes it possible to setup an instance of the prototype using provenance records
as configuration sources, possibly without having access to the entire original configu-
ration that was initially used. This ability provides for direct replication of published
experiments wherever possible.
The prototype can be configured to access data using both internal access, and
external access methods depending on the context of the user. Scientists can use this
functionality to provide limited access to their results using the prototype as a proxy.
Peers can replicate the experiment on the same datasets to review the results, without
the scientist having to expose direct access to their organisations entire database. The
profiles, and the profile directives on each query type and provider, are used by peers to
distinguish between the data providers that they can access, as opposed to those that
the scientist used internally.

5.2 Use on the Bio2RDF website

The prototype was used as the engine for the Bio2RDF website, with an example shown
in Figure 1.13. In this role, it was used to access both Bio2RDF and non-Bio2RDF
hosted datasets. Queries on the Bio2RDF website are normalised using the Bio2RDF
URI conventions, with the exception of equivalence references in results that link to
non-Bio2RDF Linked Data URIs.
The prototype configuration, created for the http://bio2rdf.org/ website, con-
tains references to 1603 namespaces, 103 query types, 386 providers and 166 normali-
sation rules. The Bio2RDF configuration provide links out from the RDF documents
1
http://sourceforge.net/projects/bio2rdf/
Chapter 5. Prototype 109

to the HTML Web for 142 of the namespaces. These providers contain URL templates
that are used to create URL’s for the official HTML version for all identifiers inside
of the namespace. The Bio2RDF namespaces can be resolved using query types and
providers using 12921 different permutations. The Bio2RDF configuration can be found
in N3 format at http://config.bio2rdf.org/admin/configuration/n3.
Figure 5.1 shows an example of the steps required for the prototype to retrieve a
list of labels for the Gene Ontology (GO) [62] item with identifier “0000345”, known as
“cytosolic DNA-directed RNA polymerase complex”. It uses the Bio2RDF configuration
and illustrates the combination of a generic query type, along with a query type that is
customised for the GO dataset. The queries are designed so that the generic query can
be useful on any information provider, while the custom GO query will be restricted to
providers that contain GO information because it uses an RDF predicate that only GO
datasets contain. If another dataset was available to retrieve labels for GO terms using
a different query, then a custom query definition could be added in parallel to these two
queries without any parallel side effects. Further examples of how the prototype is used
on the Bio2RDF website can be found in Section 4.4.1 and Section 4.7.1.
The prototype made it possible for the Bio2RDF website to easily include and
extend data from other RDF based scientific data providers. This made it possible
to perform advanced queries on datasets using normalised, namespace-based, URIs to
locate datasets and distribute queries.
The prototype implements some URI patterns that are unique to Bio2RDF, and
would need to be changed by modifying the URL Rewriting configuration file inside of
the prototype. The query is interpreted by the prototype and the actual query passed
to the model excludes the file format, whether the query is requesting a query plan,
and the page offset for the results. As these are prefixed in a known optional order,
scientists can always consistently identify the different instructions to the prototype for
each query, independent of their query types.

5.3 Configuration schema


The configuration documents the prototype were created based on a schema that can
be found at http://purl.org/queryall/. The schema gives the definitions for each
of the schema URIs, along with whether an RDF object or a literal is expected for
each property. A simple configuration that was created using the model can be seen in
Figure 5.2.
In order for the configuration schema to evolve without interrupting past implemen-
tations, the prototype was built to understand different versions and to process them
accordingly. The prototype schema went through 4 revisions as part of this research,
with each implementation able to provide backwards compatible documents based on
schemas in previous versions. In some revisions forwards compatibility with new ver-
sions were affected.
Forwards compatibility is important in order for scientists to replicate queries using
current software where possible. For example, in the first revision, the RDF Type
110 Chapter 5. Prototype

URL to resolve: http://bio2rdf.org/n3/label/go:0000345

Host: http://bio2rdf.org/
Response format: n3/, ie, RDF N3
User query: label/go:0000345

Query type: http://bio2rdf.org/query:labelsearch Query type: http://bio2rdf.org/query:labelsearchforgo


Matches regex: ^label/([\w-]+):(.+) Matches regex: ^label/([\w-]+):(.+)
This query type is useful for all namespaces
This query type is only useful for namespace:
http://bio2rdf.org/ns:go

Provider: http://bio2rdf.org/provider:mirroredobo This query type has a namespace at regular


expression matching group number 1, ie, ([\w-]+),
Matches regex: ^label/([\w-]+):(.+) which in this case is "go"

This provider is not a default provider, and the


The namespace http://bio2rdf.org/ns:go has a
referring query type is namespace specific so a
prefix "go", so this query type is useful for this
namespace check needs to be performed
user query
This provider is able to resolve queries for the
namespace: http://bio2rdf.org/ns:go

The referring query type has a namespace at regular Provider: http://bio2rdf.org/provider:mirroredobo


expression matching group number 1, ie, ([\w-]+),
which in this case is "go" Matches regex: ^label/([\w-]+):(.+)

The namespace http://bio2rdf.org/ns:go has a This provider is not a default provider, and the
prefix "go", so this provider is useful for this user referring query type is namespace specific so a
query namespace check needs to be performed

This provider is able to resolve queries for the


namespace: http://bio2rdf.org/ns:go

One of the provider endpoint URL's is chosen for this The referring query type has a namespace at regular
particular query on this provider: expression matching group number 1, ie, ([\w-]+),
http://cu.go.bio2rdf.org/sparql which in this case is "go"

The namespace http://bio2rdf.org/ns:go has a


This provider is known to be a SPARQL endpoint so the
prefix "go", so this provider is useful for this user
SPARQL template from the query type is copied and
query
replaced with the variables from this user query.

Results of this query:


Results of this query:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ns1: <http://bio2rdf.org/go:0000345> .
@prefix ns1: <http://bio2rdf.org/go:0000345> . @prefix ns2: <http://bio2rdf.org/ns/go#> .
ns1: rdfs:label "cytosolic DNA-directed RNA polymerase ns1: ns2:name "cytosolic DNA-directed RNA polymerase
complex [go:0000345]" . complex" .

A normalisation rule is applied to the results from this


provider, in order to standardise the use of colon's instead
of hashes in the /ns/ ontology terms:
http://bio2rdf.org/rdfrule:ontologyhashtocolon

<http://bio2rdf.org/ns/go#> is changed to
<http://bio2rdf.org/ns/go:> by the normalisation rule.

These results, together with the results of other queries that matched, are
parsed into an in-memory RDF database, and the combined results are
returned to the user in the N3 format requested by the user.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ns1: <http://bio2rdf.org/go:0000345> .
@prefix ns2: <http://bio2rdf.org/ns/go:> .

<http://bio2rdf.org/go:0000345>
rdfs:label "cytosolic DNA-directed RNA polymerase complex [go:0000345]" ;
ns2:name "cytosolic DNA-directed RNA polymerase complex" ;
dc:title "cytosolic DNA-directed RNA polymerase complex" .

Figure 5.1: URL resolution using prototype


Chapter 5. Prototype 111

@prefix query: <http://purl.org/queryall/query:> .


@prefix provider: <http://purl.org/queryall/provider:> .
@prefix profile: <http://purl.org/queryall/profile:> .
@prefix : <http://example.org/> .

:myquery a query:Query ;
query:inputRegex "(.*)" ;
profile:profileIncludeExcludeOrder
profile:excludeThenInclude .

:myprovider a provider:Provider ;
provider:resolutionStrategy provider:proxy ;
provider:resolutionMethod provider:httpgeturl ;
provider:isDefaultSource "true"^^<http://www.w3.org/2001/
XMLSchema#boolean> ;
provider:endpointUrl "http://myhost.org/${input_1}" ;
provider:includedInQuery :myquery ;
profile:profileIncludeExcludeOrder
profile:excludeThenIncludae .

Figure 5.2: Simple system configuration in Turtle RDF file format

definitions for a provider did not include which protocols it supported, as these were
only given in the resolution method. However, the current version requires the RDF
Type definitions to recognise the type before parsing the rest of the configuration,
independent of the resolution method. In this case, a scientist would need to add the
RDF Type to each of the providers before loading the configuration when they wished
to replicate queries using that version. In general, any new features that are added to
the model may affect forward compatibility if the default values for the relevant features
are not accurate for past implementations.
The configuration schema is defined using RDF so that the same query engine can be
used to query the configuration as the one that parses and queries the scientific data. In
addition, the use of RDF enables previous versions to ignore elements that they do not
recognise without affecting the other statements. This enables past implementations to
highlight statements that they did not recognise to flag possible backwards compatibility
issues with new configuration schemas, while enabling them to function using a best
effort approach.
If the configuration is dynamically generated using the prototype, a particular con-
figuration schema revision may be requested and the prototype will make a best effort
to translate the current model into something that will be useful to past versions. For
example, the Bio2RDF mirrors all request their current implemented version from the
configuration server, even if the configuration server supports a new version. However,
the use of RDF makes it possible to include new features in the returned documents
without affecting the old clients who are required to flag and ignore unrecognised prop-
erties in configuration documents.
Backwards and forwards compatibility issues have several implications for scientists.
If a query provenance record, including the relevant definitions from the overall con-
figuration, is published as a file attachment to a publication, it will use the current
configuration version. The query provenance record contains a property for each of the
query bundles that were relevant to the query, indicating which schema version was
used to generate that bundle. Scientists can use this information to accurately identify
112 Chapter 5. Prototype

a complying implementation to execute the query provenance record against when they
wish to replicate it. It is recommended that scientists publish the actual query prove-
nance records, or at minimum the entire configuration file that was relevant to their
research.
If the query provenance record is published using links to a live instance of the
prototype, the links may not be active in the future, so replicability may be hampered.
However, if the links are active, they may refer to either the current version of the
configuration syntax, as the software may have been updated, or they may explicitly
refer to the version that was active when they research was performed. The Bio2RDF
prototype allows fetching of configuration information using past versions, for example,
http://config.bio2rdf.org/admin/configuration/3/rdfxml will attempt to fetch
the current configuration in RDF/XML format using version 3 of the configuration syn-
tax, while http://config.bio2rdf.org/admin/configuration/rdfxml will attempt
to retrieve the configuration using the latest version of the configuration syntax.
If possible, all three options should be simultaneously available to scientists when it
comes time to publish research that uses the prototype, as all three have advantages,
and together, they reduce the number of disadvantages. For example, if they just
publish the file, scientists do not have access to any new features that they may have
implemented, or fixes to normalisation rules that they may have found necessary since
the publication.
If they just publish the version specific configuration URL, they ensure that any
future normalisation rule, provider and query type changes will be available–as long as
the URL is resolvable. The version specific configuration URL should also limit the
possibility for unknown future features to have a semantic impact. However, if it is not
obvious how to change this URL to the current version, or the version that the scientist
has access to, it may not be as useful to them.
If they just publish the version independent URL, then future scientists may not
know how to get access to the past version. Overall, the availability of all three options
conclusively identifies all of the necessary information: the configuration syntax version
that was used, in the RDF statements making up the query provenance record; the
original configuration information; the current configuration information available in
the original configuration version; and the current configuration information available
in the current configuration version.

5.4 Query type


Query types are templated queries that contain the variables which are relevant to the
query. The meaning of the variables are unique to each query type, as different query
types may interpret the same question in different ways. Scientists can take advantage of
this ability to create new compatible query types independent of other query types that
may have previously been defined, and may still be used, in the context of some data
providers. This has a direct effect on the ability of scientists to contextually normalise
information, as the previous query interface may not have differentiated between factors
Chapter 5. Prototype 113

that the scientist believes are relevant to their current context.


An example of a situation where a scientist may want to optimise a current query
to fit their context can be found in Figure 5.3. Depending on the scientist’s trust and
data quality preferences expressed in their profiles, all of the query types and providers
illustrated in Figure 5.3 may be used to resolve their query. This behaviour is reliant on
the design of the pre-configured query type as a place where the namespace, and data
providers can be abstracted from their query. The abstraction makes it possible for the
normalisation rules to then be independent from the way the query is defined, apart
from including markers in the query to determine where endpoint specific behaviour is
required.
In some situations a query may need to be modified to suit different data providers.
This can be achieved without changing the way the query is posed using a new query
type that is relevant to the namespaces in the data provider and the original query
variables.

Typical query for


<http://myscientist.org/concept:A/101>

Widely available query type matches *:*


(namespace:identifier) and identifies a useful endpoint
for the namespace "concept":

Optimised query for


<http://myscientist.org/concept:A/101>

A locally available concurrent query type matches


concept:A/* and performs query on the scientists local
database of concepts known to start with A/

Figure 5.3: Optimising a query based on context

The prototype implements query types as wrappers around templates that are ei-
ther SPARQL queries or RDF/XML documents. These templates are used in con-
junction with providers to include RDF statements in the results set. The wrappers
require information derived from the input parameters. In the case of the prototype,
the input parameters are derived by matching regular expressions to the application
specific path section of the URL that was resolved by a scientist. For example, if the
scientist attempted to resolve http://localhost/prototype/namespace:identifier,
114 Chapter 5. Prototype

where “http://localhost/prototype/” was the path to the prototype application,


then the input parameters would be derived by matching “namespace:identifier”
against the regular expressions for each known query type. If a match occurred, then
the query type would be utilised, and the matching groups from the regular expression
would be used as parameters.
The prototype interprets the parameters as, possibly, a namespace, and either public
or private, according to the configuration of each query type, as shown in Figure 5.4.
Namespace parameters are matched against prefixes in all of the known namespaces to
allow scientists to integrate their datasets with public datasets, without having to previ-
ously agree on a single authoritative URI for the involved namespaces. The distinction
between public and private is used to automatically identify a set of parameters that
could be normalised separately from the public namespace prefixes. A public parame-
ter may be normalised to lower case characters if a template required this for instance,
where the scientist may not want private, internal, parameters to be normalised. This
distinction enables the scientist to control the data quality of the private identifiers
separately from the public namespace prefixes that they have more control over. In
the prototype, matching groups default to being private, and not namespaces. In the
example above, “namespace:identifier” may be matched against a query type using
the regular expression “([\w−]+) : (.+)”. The query type may define matching group
number 1, “([\w−]+)”, to be a public parameter, and for it to be a namespace prefix.
The second matching group, “(.+)”, defaults to being private and not a namespace.

Public and private variables

Bio2RDF DOI
Bio2RDF DOI Namespace Namespace
Has prefix : doi
Has URI : http://bio2rdf.org/ns:doi

BioGUID DOI
Provider
BioGUID DOI resolver to RDF
Handles namespace : http://bio2rdf.org/ns:doi
Provider template URL : http://bioguid.info/openurl.php?display=rdf&id={input_1}:{input_2}
Replaced provider URL : http://bioguid.info/openurl.php?display=rdf&id=doi:10.1038/msb4100086

Bio2RDF URI
Match the Bio2RDF URI pattern with namespace:identifier
Pattern: ([\w-]+):(.+) : Two groups Query type
Only first matching group is declared public, which means the second is private
First matching group is also declared to be a namespace

Resolved URL
User query
http://localhost/doi:10.1038/msb4100086

Figure 5.4: Public namespaces and private identifiers

Scientists can extend the configuration to support new query methods by creating
and using a new URI for the relevant parts of the query type, along with a new resolution
method and resolution strategy on the compatible providers. This makes it possible
Chapter 5. Prototype 115

to add a Web Service for instance, where previously the query was resolved using a
SPARQL endpoint, without formally agreeing on the interface in either case. Although
it hides some of the semantic information about a query, it is important that scientists
do not have to have a community agree about the semantic meaning for a query before
they are able to use it. The use of the query in different scenarios may imply its meaning,
but the data access model and prototype do not require the meaning to be established
to use or share queries.

5.4.1 Template variables

Scientists need to be able to make up generic query templates that include variables
which are filled in when the query executes. This pattern is similar to the way typical
retrieval of resources is done. The prototype however allows the variables to be defined
in different ways for each of a set of query types that may be known to be relevant to the
query. In contrast to typical methods that provide either named parameters, e.g., HTTP
GET parameters and Object Oriented Programming languages, or complex structured
objects, SOAP XML, the prototype allows scientists to define a parameter based on the
path from the URL that the scientist tried to resolve. For example, the scientist may
resolve http://localhost/prototype/concept:Q/212, and the prototype would use
“concept:Q/212” as its input. This input could match in different ways depending on
the query depending on the purpose of the query. The prototype uses regular expression
matching groups to recognise information in queries. These groups are available as
variables to be used in templates in various ways, allowing denormalised queries and
result normalisation to take place.
Template variables, of the general form “${variablename}” can be inserted in tem-
plates, and if they match in the context of the query type and the URL being resolved,
they will be replaced. In the example shown in Figure 5.5, the two query types each
match on the same user query. The matching process creates two input variables in
BioGUID query, and one input variable in the general query. In the general case, the
template variable does not represent a namespace, while in the BioGUID case, the first
input variable represents a public variable, and it is known to be the prefix for a names-
pace. The prototype is then able to use the first input variable to both choose a provider
based on the namespaces that have this prefix, and replace the template variable in the
URL using the input. In the general case, there is no namespace variable, so any default
providers for the query type are chosen, and the template variable is replaced in the
context of these providers.
The matching groups are available as template variables in different ways depending
on the level of normalisation required. If the second matching group in the previous
example, needs to be converted to upper case, the variable “${uppercase_input_2}”
could be included in the template. Variables need to be encoded so that the content
of the variables will never interfere with the document syntax rules. For example, to
properly encode the uppercased, input variable used above, the template variable needs
to be changed to “${urlEncoded_uppercase_input_2}”. This will replace all URI
116 Chapter 5. Prototype

Namespaces as parameters

Default Bio2RDF DOI


Handles all namespaces, to avoid Bio2RDF DOI Namespace
having to know all namespaces Has prefix : doi Namespace
Has URI : http://bio2rdf.org/ns:doi

Bio2RDF mirror BioGUID DOI


Resolves the query using the public Bio2RDF BioGUID DOI resolver to RDF
mirrors Handles namespace : http://bio2rdf.org/ns:doi
Provider
Provider template URL : http://bio2rdf.org/rdfxml/ Provider template URL : http://bioguid.info/
{input_1} openurl.php?display=rdf&id={input_1}:{input_2}
NOTE: {input_1} is different for this provider Replaced provider URL : http://bioguid.info/
compared to BioGUID DOI openurl.php?display=rdf&id=doi:10.1038/
Replaced provider URL : http://bio2rdf.org/rdfxml/ msb4100086
doi:10.1038/msb4100086

HTTP GET Bio2RDF URI


Match the Bio2RDF URI pattern with
Match everything Query type
namespace:identifier
Pattern: (.+) : One group
No namespaces recognised Pattern: ([\w-]+):(.+) : Two groups
Namespace prefix is first group

Resolved URL
http://localhost/doi:10.1038/msb4100086 User query
Patterns match on the path section of the URL, doi:10.1038/msb4100086

Figure 5.5: Template parameters

reserved characters with their “%XX” versions, where “XX” is the reserved character
code according to the relevant RFC 2 . In the prototype, the %XX variables are always
represented in uppercase, so “:” is encoded as “%3A” and not “%3a”.
In some cases data quality depends on the level of standardisation. In the case of
URIs, some applications differ in the way they encode spaces. One set of applications
encode spaces (“ ”) as the plus (“+”) character, and other applications encode it using
percent encoding as “%20”. The prototype allows for both cases, where a URI to be
encoded using the “+” character for spaces by changing “urlEncoded” in template vari-
ables to “plusUrlEncoded”. For example, the example given above could be changed to
“${plusU rlEncoded_uppercase_input_2}”. If the dataset does not consistently use
either method then the prototype may be able to provide limited access to the data by
using two different query types on the same provider to attempt to access data using
both conventions.
For example, this template variable is used on the “dbpedia” namespace to retrieve
information about resources starting with the prefix http://dbpedia.org/resource/.
There are some URIs starting with this prefix which contain special characters in the
name such as http://dbpedia.org/resource/Category:People_associated_with_
Birkbeck%2C_University_of_London, as the percent encoding is required to complete
the name, but it interferes with the colon character after “Category”. It is difficult
to retrieve information from dbpedia, if the item is a category, and the name of the
category contains a reserved character, as each case would require a normalisation
rule to encode the reserved character. In the example above, the prototype will at-
tempt to retrieve information about the non-percent-encoded URI, .../Category:
2
http://tools.ietf.org/html/rfc3986#section-2.1
Chapter 5. Prototype 117

People_associated_with_Birkbeck,_University_of_London.
When the identifier from the dbpedia namespace is fully percent-encoded, the URL
would be .../Category%3APeople_associated_with_Birkbeck%2C_University_of_
London. Both of these URIs will not return information because they do not exactly
match anything in the RDF database that is hosting DBpedia.
The URI RFC 3 includes a condition that the URIs are equivalent if they only
differ by a percent encoded element in a part of the URI where the reserved charac-
ters will not interfere with the syntax, in this case, the path section of the URI. In this
case, it may be satisfactory to make another namespace, for example “dbpediacategory”,
and associate part of the “dbpedia” namespace with that namespace instead to avoid
the issue, for example, the URI would be http://myexample.org/dbpediacategory:
People_associated_with_Birkbeck%2C_University_of_London, which does not con-
tain conflicting percent encoded characters.
URLs in RDF/XML must also have the XML reserved characters encoded, as some
of these characters are not encoded by the URI encoding 4 . This template variable
in the example using RDF/XML would be “${xmlEncoded_urlEncoded_input_2}”.
The order of conversion is right to left, so the input is converted to uppercase before it
is URL encoded, and then the resulting string is XML encoded.
A query type may define intermediate templates which take the input parameters
and use them to create other templates. An example of this in the prototype is the
allowance to define for each query what a normalised URI would look like, given the
parameters. This enables scientists to have another variable to use in queries, where they
do not necessarily want to deal with each input, as that may make the template harder to
port to other query types. The normalised URI in the previous example may be defined
using “${def aultHostAddress}${input_1}${def aultSeparator}${input_2}”. In the
template, the expected public host is given, where it may in some cases be resolvable
to a Linked Data document. Along with the default separator, this template is used to
derive the normalised URI given the abstract details, which may not be directly related
to the URI that was used by the scientist to resolve the query with the prototype. This
allows scientists to use the normalised URI matching the public URI scheme when they
do not need to publish URIs that will resolve using their instance of the prototype.
In many situations scientists may require others to resolve the document using their
instance in order for others to easily replicate the queries, however, both options are
available.
The ability for the prototype to recognise intermediate templates was necessary
so that scientists can reliably normalise complete URIs to the appropriate form that is
required for each endpoint, as the normalisation rules assume that they will be operating
on the complete URI. For example, it is not possible to reliably normalise a URI, for
example http://myorganisation.org/concept:B/222 to its endpoint specific version,
for example, http://worldconcepts.org/concepts.html?scheme=B&id=222 without
including the protocol and host along with the parameters that were interpreted by
3
http://tools.ietf.org/html/rfc3986#section-2.2
4
http://www.w3.org/TR/REC-xml/#sec-predefined-ent
118 Chapter 5. Prototype

the query type. If only the parameters "concept:(A-D)/(\d+)" were included in the
rule, then the normalisation phase may change http://worldconcepts.org/concepts.
html?scheme=B&id=222 to a URL such as http://worldconcepts.org/concepts:B/
222 without changing the host or protocol for the URI.

5.4.2 Static RDF statements

The prototype allows scientists to include pre-defined, static, templates that can be
used to insert information into a document in the context of a data provider and a
users query. Scientists can use this functionality to keep a track of which URIs each
endpoint used, along with their relationship to the URI used by the prototype. This
makes it possible for scientists to describe the context of their document in relation
to other documents, which may be describing equivalent real world things in different
ways. This is in contrast to typical recommendations that specify that there should be a
single URI for each concept, so that others can simply query for descriptions of the item
without knowing what context each alternative description is given in. The RDF model
is designed to provide ways of inter-relating concepts, including different descriptions of
the same item. The prototype allows scientists to setup basic rules about the context
surrounding their use of the URIs, and the RDF model, along with HTTP URIs, are
used to implement this.
Each of the alternative URIs may be a Linked Data URI that can be resolved to a
subset of the data that was found using the normalised URI structure. Even traditional
web URIs, that do not resolve to RDF information, can be useful. For example, a
scientist’s program may require a particular data format, and the traditional web URI
can point to that location, as shown in Figure 5.6.

Normalised URI
http://example.org/namespace:identifier

Has Data Format X URL


Has Dublin Core
Identifier
Scientific data format URL
Dublin Core Identifier http://namespacefoundation.edu/cgi-bin/formatX.cgi?id=identifier

namespace:identifier

Has Alternative URI

Alternative Linked Data URI


Has RDF source
http://othersource.org/record/namespace2/identifier

RDF data source URL (From Provider Endpoint)

http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Figure 5.6: Uses of static RDF statements


Chapter 5. Prototype 119

5.5 Provider
Datasets are available in different locations, and the method of accessing each dataset
may be different. Scientists are able to implement multiple access methods across
multiple locations for a range of query types using providers. This makes it possible for
scientists to change the method and location and query that they use for data access
without having to change their programs, as providers are contextually linked to the
query types, which are linked to the overall query. The prototype makes it possible
to use a range of methods for data access by substituting providers and query types,
enabling scientists to replicate queries in their context without restrictions that may
have been imposed on the original scientist.
Each dataset may need more than one data quality normalisation change in order
for the results to be complete. Each provider can be configured with more than one
normalisation rule, which makes it possible to layer normalisation rules. If scientists
want to change the URIs that are given in a document to republish the results using the
URI of their prototype, they can add a normalisation rule at a higher level than each of
the other normalisation rules. This normalisation rule will change the normalised URI
into one that points to the prototype.
In some cases, the scientist may not know exactly which set of data normalisations
are required for a particular dataset. In these cases the scientist may need to have
multiple slightly different queries performed on the dataset as part of the overall query.
Each provider can be used multiple times to respond to the same overall query, although
it will only be used once in the context of a query type.
The prototype uses HTTP based methods, including HTTP GET, and a HTTP
POST method for SPARQL endpoints. These make it possible to use a large range
of data providers, including all of the Linked Data scientific data providers, and the
SPARQL endpoints that have been made available as alternatives for the Linked Data
interfaces to the scientific data providers. In Figure 5.7, there are two providers, which
are used in different contexts. In one context, myexample.org uses the original data
provider, with the associated normalisation rule, and in the context of a scientist who has
curated and normalised the dataset, the scientist could easily substitute their provider,
and exclude the original data provider using their profile.

5.6 Namespace
The concept of a namespace encompassed both the representation of data and the origin
of the datasets. Although its use in the implementation is limited to simple declarations
that a feature such as a part of a dataset exists on the provider in the context of the
queries it is defined to operate on, it could easily be extended to include statistics. This
may provide for a generalised model where queries are ad hoc, and the parameters are
directly defined by the scientist instead of in the context of each query type.
The use of regular expression matching groups in the prototype to identify the
namespace sections in the scientist query is used to identify a set of namespaces. The
120 Chapter 5. Prototype

Providers in the Prototype

User query

http://myexample.org/namespace:identifier

Resolved by myexample.org Resolved by Scientist A using


using

Common profile Scientist A profile

Resolved using: Resolved primarily using query

Untemplated query SPARQL Describe query

Resolved using provider


Excluded by
Scientist A profile
Resolved using Curated, normalised, local data provider
HTTP POST to :
http://localhost:1234/sparql

Original data provider


HTTP GET from :
http://namespacefoundation.edu/cgi-bin/RDFendpoint.cgi?id=identifier

Requires normalisation

Original data Normalisation Rule

Change http://namespacefoundation.edu/ to http://myexample.org/

Figure 5.7: Context sensitive use of providers by prototype


Chapter 5. Prototype 121

matching groups are prefixes that are mapped to a set of URIs from the configuration
settings for the prototype. The current model and implementation require that scientist
inputs are given as a single string. The use of a single arbitrary string enables semanti-
cally similar queries to choose the parts of the string that are relevant without affecting
other queries. For example, one query type may take the input and send it as part
of an HTTP GET query to an endpoint without investigation into the meaning of the
scientist’s query, while other query types may require knowledge about the meaning of
the query and only partially use the information in a query.
The use of indirect namespace references, using identical prefixes which are attached
to multiple arbitrary URIs, makes it possible to integrate prototype configurations which
would not be possible if the namespace was directly named with a single URI. It is nec-
essary to integrate different sources in a distributed query model, as there is no single
authority which is able to name each of the namespaces authoritatively. A hypothetical
example of where different configurations can be integrated seamlessly using the pro-
totype is shown in Figure 5.8, where it would be difficult to integrate them otherwise
as it would require changes to one of the configurations to make its namespace URIs
match the others. In addition, the ability to define aliases for namespaces makes it
necessary to rely on the prefix instead of the URI, as there is not a single relationship
between a URI and a prefix. The scientist needs to identify providers and query types
in their contextual profiles as the basis of selecting which providers will actually be used
to resolve the query, rather than namespaces, as it is ambiguous from their query which
one should automatically be chosen.

5.7 Normalisation rule


A principle issue in the area of linked scientific datasets relates to the inability of
scientists to query easily across different datasets due to the inability of the relevant
communities to conclude on both the structure of a universal URI scheme, and the
mechanism for how those URIs will be resolved. This model neatly avoids these issues
by reducing URIs to their components, in this case, a namespace prefix, and a private
identifier. The principal information components, which have been used by scientists in
the past to embed non-resolvable references into their documents, are used to create a
series of normalisation steps for each dataset, assuming a starting point that matches
the scientists preferred URI structure. This standard URI is then transformed using
the normalisation rules as necessary given the context of each provider, so that queries
successfully execute with the correct results, before normalising the results so that
scientists can understand the meaning without having to manually translate between
URIs that are not relevant to their scientific questions.
Normalisation rules can be defined and applied to providers to access datasets that
follow different conventions to what a scientist expects. Normalisation rules are initially
applied to query templates using template variables as well as to the query at other
stages.
The prototype implements the models requirement for ordered normalisation rules
122 Chapter 5. Prototype

Resolving namespaces using different configurations

User query

http://myexample.org/doi:identifier

Using Source A Using Source B

Configuration A Configuration B

Contains namespace Resolved primarily using query

Namespace alpha Namespace beta

Contains prefix Contains prefix

Preferred prefix : doi Preferred prefix : doid


Alias : doi

Has Original producer Has Original producer

http://www.doi.org/ http://mydigitalorganisationid.example.org/

Figure 5.8: Namespace overlap between configurations


Chapter 5. Prototype 123

with a staged rule application model. In the model the query and the results go through
different stages where normalisation can occur. For each stage from Table 3.1, the
supported normalisation methods that the prototype implements are listed in Table 5.1.
At each stage, the normalisation rules are sorted based on a global ordering variable
where indexes were sorted based on whether the stage was designed for de-normalisation
or normalisation. The reversal of the order priority in the latter normalisation stages
made it possible to integrate the normalisation and denormalisation rules for a single
purpose into a single rule. This made it easier to manage the configuration information,
as many rules would otherwise have required splitting in two.

Stage Method Order Priority


1. Query variables Regular Expressions Low to High
2. Query template before parsing Regular Expressions Low to High
3. Query template after parsing None implemented Low to High
4. Results before parsing Regular Expressions, XSLT High to Low
5. Results after parsing SPARQL CONSTRUCT High to Low
6. Results after merging SPARQL CONSTRUCT High to Low
7. Results after serialisation Regular Expressions High to Low

Table 5.1: Normalisation rule methods in prototype

In the prototype, URIs that should match what appears in an endpoint take the
form of “${endpointSpecif icU ri}”. This template variable is derived by the application
of a set of Regular Expression normalisation rules to the standard URI template that
is available in the query type, after substituting the relevant variables. The prototype
uses this method as an exhibit of the usefulness of high level normalisation rules to a
set of distributed linked datasets. The use of the endpoint specific URI template, and
its normalised version, “${normalisedStandardU ri}”, enable queries to be created that
are not possible in other models.
The prototype initially supports three different types of rules, Regular expression
string replacements, SPARQL Construct queries for selection or deletion of RDF triples,
and XSLT transformations to transform XML results documents into RDF statements
so they can be integrated with other RDF statements. Other types of rules can be
created by implementing a subclass of the base normalisation rule class and creating
properties to be used in configurations. Each rule needs to be serialisable to a set of
RDF triples to enable replication solely based on the provenance record.
Regular expression rules contain equivalent normalisation and de-normalisation sub-
rules. The de-normalisation part only applies to template variables in queries, where
the normalisation stage applies to entire result sets, although it is not applied to static
RDF insertions as they provide the essential references between normalised and non-
normalised data. The same template variables that are applied to query templates, are
also available in static RDF insertions. This is important, because normalisation is a
syntactic step in most cases, and although it is useful, it is also useful to highlight the
diversity of URIs, where scientists may only have been aware of the normalised version
previously and may not have correlated other URIs with their local version. The static
124 Chapter 5. Prototype

RDF templates that are attached to query types are not normalised along with the
results, although they will be normalised after the results are merged into the overall
pool as there is no way of knowing where individual RDF triples came from at that
point.
For example, the variables used in a SPARQL query may be modified using a regular
expression prior to being substituted into the templates inside of the query, before the
query is then processed and modified using SPARQL Basic Graph Pattern manipulation,
although there is no currently standard rule syntax for this manipulation. Although the
parsed template query normalisation stage is necessary in the model for completeness,
it was not viewed as essential to the prototype in terms of data quality, context, or data
trust. In each case, namespaces or profiles could be used to eliminate query types where
the template would not be useful, rather than using normalisation rules to fix the query
at this stage.
The SPARQL rules operate in one of two modes, either deleting or keeping matches.
The major reason for this restriction is the lack of software support for executing the,
yet to be standardised, SPARQL Update Language (SPARUL) DELETE queries 5 ,
where scientists could have more flexibility in the way deletions and retentions operate.
Each of the different stages of a query are applicable to different types of rules. For
instance, SPARQL transformations are not applicable to most queries, as they require
parsed RDF triples to operate on. In comparison, Regular Expressions are not relevant
to parsed RDF triples, so they are only relevant to queries, the unparsed results of
queries, and the final results document.
The RDF results from a query may be modified as textual documents before being
imported to an internal RDF document, and they may be modified again after all of
the results are pooled in a single RDF pool. In combination with the order attribute
and the SPARQL Insert or Delete modes, normalisation rules can be implemented to
perform any set of data transformations on a scientist’s data.
In the absence of the standardisation of the SPARQL Update Language (SPARUL)
and the lack of implementation of the draft in the OpenRDF Sesame software used by
the prototype 6 , normalisation rules in the prototype are limited to regular expressions
and matching SPARQL CONSTRUCT or RDF triple results. When SPARUL and
other SPARQL 1.1 changes are standardised, including textual manipulation and URI
creation, the prototype could be easily extended to support the use of these features for
arbitrary normalisation.

5.7.1 Integration with other Linked Data

It is a challenge to integrate and use Linked Data from various sources. Apart from
operational issues such as sites not responding or empty results due to silent SPARQL
query failures, there are issues that scientists can consistently handle using the proto-
type, including URI conventions and known mistakes from individual datasets in the
5
http://www.w3.org/TR/2010/WD-sparql11-update-20100126/
6
http://www.openrdf.org/
Chapter 5. Prototype 125

context of particular query types.


The prototype can be configured to use different URI conventions, including those
matching the Bio2RDF normalised URI specification [24], and other proposals such as
the Shared Names proposal, although the Shared Names is not yet finalised.
The prototype implements HTTP URI patterns that are not handled or seen by
the core model implementation. For example, the choice of file format for the results is
made by prefixing the query with “/fileformat/”. In comparison, other systems append
the file format to the URI string, but this prevents their models handling cases where
the actual query ends in the extension, for example, “.rdf”, as there may be no way
to distinguish the suffix from an actual query for the literal string “.rdf”. The model
is designed to provide access to data using an arbitrary range of queries, where the
majority of Linked Data sites solely provide access to single records using a single URI
structure. If all of the identifiers in a dataset are numeric, for example, a real query
would never end in “.rdf”, so there is no confusion about the meaning of the query.
In some cases it may be necessary to integrate normalisation rules created using
different normalised URI schemes. The ordered normalisation rule implementation al-
lows for this easily by creating a new normalisation rule with an order above any of
the orders that the included normalisation rules use. This normalisation rule would
then consistently convert the results of all of the imported normalisation rules, into the
locally accepted normalisation standard.
It would be simpler for single configuration maintainers if every dataset were as-
signed a single set of normalisation rules to correct queries. However, it would prevent
scientists from integrating their provenance and configurations with other scientists
without manually combining the rules that were relevant to each dataset. The proto-
type enables scientists to use other prototypes configuration documents and provenance
documents as configuration sources.
This process is automatic as long as there are no URI collisions where inconsistent
definitions for query types etc., are provided by different sources. However, this should
be a social issue, as scientists should use URIs that are under their control if they expect
their work to be consistent with other users. In cases where scientists wish to modify
definitions for any model items, they should change the subject URI for the triples, and
change any dependent items by changing the reference along with the subject URI for
the dependent item. The original URIs then need to be added to the scientist’s profile
to ignore them and use the new items.
The migration method using new URIs and profiles enables scientists to attach their
own data quality and trust rules to any providers. However, if they only wish to change
data in the context of a particular query type on a particular provider, they would need
to create at least one new provider to include the new normalisation rule. If the query
type was still used, they could not ignore it completely without redefining all of its uses.
In this case, the original provider would need to be ignored using the profile and a copy
made that did not contain the query type.
The model allows different sets of normalisation rules to be used with the same
126 Chapter 5. Prototype

query type on the same namespace on a single dataset. This situation requires two
providers with the same query types and namespaces assigned to them, with only the
normalisation rules differing. The Bio2RDF prototype used this pattern in cases where
datasets contained inconsistent triples and it was not known which normalisation rules
would be necessary for a particular query on a particular namespace beforehand.

5.7.2 Normalisation rule testing

The prototype included tests that specify the example inputs and expected outputs for
normalisation rules. The tests include the requirement that normalisation rules with
both input and output (i.e., de-normalisation and normalisation) components must be
able to transform the results of the input, de-normalisation, rules back into a normalised
version using the output rules. However, if the normalisation rule does not contain an
input rule, or does not contain an output rule, the tests are only performed on the
existing half of the rule.
Rule testing makes it possible for scientists to both validate the way the rule works
on their examples, and demonstrate to other users what the rule is intended to do if that
is not clear from the description and the rule itself. This is important to independently
validate the usefulness of rules to convince other scientists that they are useful for
data cleaning, as there is no single authority to curate rules. Although the model is
completely decentralised by design, to allow for context sensitive changes, authorities
could produce rules and curate them, and scientists could use all or some of the rules
based on their profiles.

5.8 Profiles

In the prototype, configuration information is serialisable to RDF, which can then be


included with configuration information from other sites to form a large pool of possible
datasets and the associated infrastructure. This pool of information can be organised
using profiles so that scientists can exclude items that are not useful to them, and op-
tionally replace the functionality by including an alternative, perhaps a trusted dataset
that has a known provenance and is hosted locally. The prototype was designed as a
layered system so that the prototype can explicitly say which profiles were more relevant
than others.
The configuration files specify a global ordering number for each profile, and higher
numbers take preference over lower numbers, so scientists can generally find a simple
way to override all other profiles with their own. The profiles that will be used in a
particular situation can be specified by the operator of the prototype system without
reference to the order, which is obtained from each of the profiles. The global ordering
makes sense as some profiles will contain instructions to always include anything that
isn’t restricted so far, and these profiles should be consistently visible for other users to
generate the same results when using the profile in their set of profiles.
Chapter 5. Prototype 127

The overall implementation allows scientists to source statements from many differ-
ent locations. Each of these locations should be trusted to safely execute the system.
Untrusted configurations may include queries that are so complex that they timeout,
or that accidentally cause denial of service attacks of data providers. Before queries are
executed, ideally, scientists should review the queries, although they may not be able
to recognise where issues will occur without experimenting with the different ways the
profiles work. An example of a situation where the profiles may be accidentally misused
is where there are multiple mirrors for a website, such as Bio2RDF, and the mirrors
could hypothetically rely on each other for resolution of some queries. This could easily
lead to circular dependencies, which the system could not detect without analysing the
query plan for each query before executing it. The profiles are designed so that these
potential dependencies can be removed by scientists.
The ordered profile system is useful in a range of scenarios. The simplicity makes it
valuable in many situations, as the order of the profiles relies on knowledge about the
desired effects, and the effects of each of the profiles individually. This makes it possible
for scientists who know which order the profiles are given to consistently determine which
query types, providers, and normalisation rules will be used without having to ask other
users for information about how their prototype was setup internally.
Although RDF provides a native facility for ordered lists, the implementation method
makes it impossible to extend a list dynamically, as the last element of the list explicitly
references a null value. It is important that future scientists be able to easily extend a
list without having to modify the document containing the original definition. They can
do this using numerically ordered profiles and normalisation rules. In addition, explicit
ordering provides a degree of control over which processes can be performed in parallel,
as items with the same order should not interfere with each other.
128
Chapter 6

Discussion

6.1 Overview
This chapter presents a discussion of the issues which were relevant to the design of the
data access model described in Chapter 3; the use of the model by scientists as described
in Chapter 4; and the implementation of the prototype web application described in
Chapter 5. The model is designed to provide query access to multiple linked scientific
datasets that are not necessarily cleanly or accurately linked. The model provides
a set of concepts that enable scientists to normalise data quality, enforce data trust
preferences, and provide a simple way of replicating their queries using provenance
information. The prototype is a web application that implements the model to provide
a way for scientists to get simple access to queries across linked datasets. This discussion
includes an analysis of the advantages and disadvantages of the model and prototype
with respect to the current problems that scientists face when querying these types of
datasets.
Simple methods of querying across linked scientific datasets are important for sci-
entists, particularly as the number of published datasets grows due to the use of the
internet to efficiently distribute large amounts of data to different physical locations [67].
The model was based on the degree to which it supports the needs of scientists to access
multiple scientific datasets as part of their research. The model provided useful features
that were not possible in other similar data access models, including abstract mapping
from queries to query types and normalisation rules, using namespaces where neces-
sary. This enables the model to substitute different, semantically equivalent queries,
in different contexts without requiring changes to the scientists high level processing
methods.
They can use the model to transform data based on their knowledge about data
quality issues across the range of datasets they require for their research, without re-
quiring each query to determine which data quality issues are relevant to its operation,
resulting in the ability to perform multiple semantically unique queries on a provider
with each query requiring different data quality modifications. The model allows scien-
tists to trust that the results of their query were derived from a trusted location using
an appropriate query and normalisation standards. Scientists can evaluate the process

129
130 Chapter 6. Discussion

level provenance for their queries, including the exact queries that were performed on
each data provider, and they are able to interpret this information in a simple way due
to the object and reference based design of the model.
The prototype implementation was created to examine the practical benefits of the
model in current scientific research. As part of this case studies were examined, including
a case integrating science and medicine, in Section 4.4. An evaluation of the usefulness
of the prototype together with workflow management systems in Section 4.7. These
case studies, together with experience using and configuring the prototype revealed a
number of advantages and disadvantages relating to the prototype and the initial model
designs.
The prototype enables scientists to easily construct alternative views of scientific
data, and it enables them to publish these views using HTTP URIs that other scien-
tists can use if they have access to the server the prototype is running on. The prototype
was successful in acting as a central point to access numerous datasets, using typical
Linked Data URIs for record access and atypical Linked Data URIs as ways to perform
efficient queries on datasets. The prototype was found to be a useful way of distribut-
ing configuration information, in both bulk form and in a condensed form as part of
provenance records. This information was then able to be customised using a list of
acceptable profiles that were statically configured for each instance of the prototype,
making it possible for users to reliably trust information, as long as they already trusted
the sites they obtained the overall configurations from. The prototype did not attempt
to implement as access control system, so scientists would need to physically restrict
access, or implement a proxy system that provided access control to each instance of
the prototype.
In terms of scientists using the system, it is important that data is able to be re-
liably transferred between locations on demand. This process may be disrupted by
issues such as network instability, query latency, and disruption to the data providers
due to maintenance or equipment failure, while there are inherent limitations includ-
ing bandwidth and download limits that affect both clients and data providers. The
model was designed to provide abstractions to both data providers and query types,
to provide redundancy across both types of queries and data locations. The prototype
was implemented with a customisable algorithm to decide when a data provider had
failed enough times to warrant ignoring it and focusing on other redundant methods of
resolving the query. The redundancy in the model, together with users specifying profile
preferences, made it possible to reduce latency on queries by choosing physically close
data providers and optimising the amount of information that needed to be transferred.

6.2 Model
The model is evaluated based on the degree to which it fits the research questions that
were outlined in Chapter 1, specially context sensitivity, data quality, data trust, and
provenance. The evaluation focuses on the advantages of the model design decisions
in reference to the examples given in Chapter 1, with specific reference to the way the
Chapter 6. Discussion 131

model can be used to replicate queries in different contexts.

6.2.1 Context sensitivity

Scientists work in many different contexts during their careers. These contexts include
different universities and research groups at a high level, and many different data pro-
cessing methods on different datasets. Some of these datasets may be sensitive, and
the experiments may be restricted from public recognition, at least temporarily. Scien-
tist use linked datasets in many different facets of their research, making it necessary
to have a simple method of adjusting the way the data access occurs to match their
current context.
Scientists require that their experiments are able to be replicated in the future, with
the expectation that the results will not be materially different from the original results.
The model includes the ability to easily switch or integrate new data providers or queries,
along with the relevant normalisation rules. This means that even if the original data
provider is unavailable, the overall experiment could still be replicated if the scientist
can find an alternative method of retrieving the results. This is novel, compared to
alternatives such as workflows, because it recognises the difference between the data
location, queries on that data, and any changes that are required. Scientists can use this
flexibility in the model to substitute components individually, according to their context,
without requiring changes to the high level processing methods. The recognition of the
context that data is accessed in, without requiring the use of a particular method, means
that scientists can share their experimental workflows, and other scientists can replicate
the results using their own infrastructure if necessary, with changes to the configuration
of their prototype rather than workflow changes.
Scientific datasets may develop independently, making it difficult to integrate datasets,
even if they are based on the same domain. For example, a scientist may wish to inte-
grate two drug related datasets, DailyMed and DrugBank, but they may be unable to
do so automatically because they do not contain the same data fields, although there
are links between records in the two datasets. If a scientist requires that the data from
another set of results to be represented differently to fit into the way their processes
expect, the model allows the scientist to modify the query provenance without affecting
the way the original set of results was derived. The scientist can use normalisation rules
to avoid having to contact the original data provider to modify their results. They would
do this using a data provider, that accesses the original set of results, and modifies the
data using rules, without performing any further queries. In the DailyMed/DrugBank
example, the scientist would find the common data fields based on their opinion, and
normalise data from one or both datasets using normalisation rules.
There are a number of common scenarios that the model is useful for in terms of
transitioning between different contexts and transitioning between different datasets, as
shown in Table 6.1. The table shows the changes necessary to keep the status quo, so
scientists using the model for data access do not need to modify their processing systems
where the model can be modified instead. In each case, a scientist would have to ensure
132 Chapter 6. Discussion

that their profiles included the new providers, query types and normalisation rules, and
excluded outdated providers, query types and normalisation rules. An important thing
to note with respect to these scenarios is that it may never be necessary to produce a
new query type in reaction to a change in a current dataset, as query types are designed
to be generic, where possible, to keep a stable interface between scientists and a large
number of constantly updated datasets.

Scenario Definite changes Possible changes


Data provider shuts down New provider for re- New query types and
placement normalisation rules
Data structures change New normalisation New query types
rules and add rules
to providers
New scientific theory New normalisation New providers and
rules and add rules query types
to providers
New dataset New providers New normalisation
rules and query
types
Data identifiers change New normalisation New providers
rules

Table 6.1: Context change scenarios

The ability to group query types and providers based on a contextual semantic
understanding of functionality, makes it possible for a scientist to easily find alternatives,
if they find their context makes it difficult to use a particular query type or provider.
For example, a scientist may decide that the Bio2RDF published version of the UniProt
dataset is easier to use than the official UniProt dataset, and they could link together
the query types, providers and normalisation rules relevant to Bio2RDF and publish
them together using a profile. Although this ability is useful, the model does not require
scientists to specify why they didn’t find a particular query type or provider semantically
useful. In each case, the scientist would make a decision about whether they want to use
each item. The range of possible implementations of the model could including a range
of inclusion metrics found in Mylonas et al. [101]. In contrast to Mylonas et al. [101] and
other similar models, this model does not require that scientists and/or data providers
adhere to a single global ontology. It does not even require them to specifically choose
a single ontology for all of their research, as they would need to do in these systems
just to access data from a range of distributed datasets. If different scientists interpret
the same query in different ways, the provenance records will highlight the differences
to other scientists, and other scientists may choose providers based on these records for
some of their own experiments.
The model is designed based on the assumption that the data within a namespace
that can be found using a particular identifier will be consistent over time, although
this may not be the case in many datasets. Although the model makes it possible
Chapter 6. Discussion 133

for scientists to modify and change queries and data providers transparently to over-
come changes, it does not make it simple for scientists to determine which queries and
providers are relevant at the current time, compared to in the past. It is assumed that
scientists will keep track of the provenance of previous experiments if they want to
be guaranteed that the model will use the same sources in resolving their query, and
scientists need to manually verify that the dataset was not modified.
The issue of temporal reasoning, where scientists simultaneously specify that dif-
ferent data providers are relevant at different points in time, could be included in the
profiles system through the use of time constrained profiles, but it would also need to
be included in the data providers if they include multiple versions of items. If datasets
change identifiers for a data record, or deleted data describing a record, and the changed
could not be reverted using normalisation rules, scientists would need to modify their
processes to replicate their past experiments, something which is not ideal, but necessary
in some circumstances.
For example, if a scientist discovered that a gene was not real, or it was a replica
of another gene, a genome dataset may choose to remove all of the information from
the record describing the gene and replace references to the gene with references to the
other gene. This would disrupt any current experiments that relied on the modified
or deleted gene record containing meaningful information, and the scientists workflows
would need to be modified to allow for genes that have been replaced by references to
new genes. Although this was not a real world change, the fact that scientists changed
their opinion is important to the way the system functions. Scientists use labels for
their concepts and a label needs to be used consistently by a scientist in order for their
work to be as successful as possible.
In comparison to other models, this means that scientists are able to have inde-
pendent opinions about the meaning of different pieces of scientific data. This is never
possible if scientists are required to adhere to a single global ontology, as there is no
way to recognise additions or deletions in experiments which are designed to produce
new novel results. There may be benefits in well developed disciplines for scientists to
adopt a single set of properties, even if the objects or classes were still variable. For
example, in the field of medicine, health care providers may adopt a single standard
to ensure that there is no confusion between documents produced by, and distributed
between, different doctors. However, doctors participating in medical trials may need
to forego the benefits of this standardised data interpretation to test new theories using
unconventional properties.

6.2.2 Data quality

Scientists have the need to reformat data to eliminate syntactic data quality mistakes,
such as malformed properties or schemas, before queries across datasets will be consis-
tent. This is a major issue for scientists, as these issues must be fixed before they can
use higher level, semantic data quality procedures to verify that the information content
is consistent. The model is designed to be useful before data quality issues have been
134 Chapter 6. Discussion

resolved, along with being useful when the syntactic and semantic issues are resolved
using permanent dataset changes or normalisation rules.
Many syntactic data quality issues require human interaction to both verify that
there is an issue, and to decide what the easiest way to fix the issue is. Although
this process is best done on a copy of the dataset that the scientist can modify, any
modifications that the scientist can perform as part of a query can be accommodated
using the model with normalisation rules on queries and results. The model makes it
possible for scientists to them publish the rules along with their queries, so that other
scientists can verify the results independently, avoiding the normalisation rules that did
not suit them using profiles as necessary, without having to redesign the query. If a
scientist does need to republish the entire dataset locally, a custom profile can be used
to isolate provider references to the original dataset and replace them with references
to the cleaned dataset.
It is difficult to use automated mechanisms to diagnose semantic issues on distributed
linked datasets, as the process may require information from multiple geographically
different locations, which are not efficient to retrieve if the relevant parts of the datasets
are highly interlinked. The model succeeds in cases where the necessary information can
be retrieved to enable the model to semantically clean data locally, or where queries can
be constructed in ways that allow scientists to clean the data in the query. In general,
the model relies on being able to construct queries that match the semantic meaning
contained in the data provider, with the normalisation occurring on the results. This
is as issue for any distributed query methodology, as it is based around a fundamental
distinction between locally available data that can be processed using efficient methods,
including in memory and on disk, and distributed data that needs to be physically
transported between locations before efficient localised methods can be used to verify
that the data is semantically valid. If scientists require semantically valid trusted data
that is not efficient in a distributed scenario, they can setup a local data provider and
co-locating the datasets to provide semantic data quality using efficient local processes.
In comparison to federated SPARQL systems, the model is able to be used to di-
agnose and fix semantic issues, without requiring data to be provided using SPARQL
in every location. The use of provider specific syntax rules to fix syntactic issues after
a query is executed, provides a more stable basis for semantic reasoning, as providers
are free to change and migrate to new standards while queries that were applicable
to past versions of the dataset may still be useful after syntactic normalisation. The
model can be used to impose current standards on datasets regardless of the decision or
timing from dataset publishers. For example, the query model was used throughout this
project to impose the current Bio2RDF standards on each of the datasets, even though
the Bio2RDF project itself did not always have the resources to physically change the
datasets each time the best practices in the community changed.
The model may be used to easily link together data of differing qualities, potentially
lowering the quality of the more useful data. This factor is inevitable in a model that
allows users to merge information from different sources. Although it may be useful to
Chapter 6. Discussion 135

enforce a single schema on the results, this would only enforce syntactic data quality,
as shown in Figure 6.1. In the example, the syntax normalisation does not result in
more useful data, and may even make it more difficult for scientists to process the data
if they assume that a disease reference will be a link, rather than just a textual label.
The use of a single data format, without relying on a single schema, makes it possible
for scientists to include their private annotations, which may not have been possible in
other scenarios. If the user then republished their annotations along with the original
document, the quality of the combined document may be lower than the original data.
This is a feature of the model, but the resulting issues must be dealt with at a higher
level, through the creation of methods that can be used segregate the data based on its
origin and the desired query, such as RDF quads for instance.
Many systems assume that the data quality issue can be solved in a context-
independent way, relying on a single domain ontology to normalise all documents from
every source to a common representation. However, the assumption that because ontolo-
gies are logically relevant to everything does not imply they are syntactically relevant
to everything. A number of limited scope systems, such as the BIRN network and the
SADI network, have succeeded in integrating data from a range of biological sources, but
they could not easily integrate annotations, or internal scientific studies that violated
one of the core ontological principles of the system. The model is designed to provide
a neutral platform on which scientists can pick and choose their desired sources using
providers, and their desired ontologies, based on the results of normalisation rules.
The cost of this neutral platform is that queries must be specified in terms of the
goals, and not the query syntax, so the system cannot interpret complex queries into sets
of sub queries without users having first defined templates for the sub queries with links
from the appropriate providers to the templates. However, the neutral platform, partic-
ularly including the representation of normalisation rules in textual formats, provides
simple ways for any scientist to choose their personally relevant set of queries, providers
and rules. They can group these sets using profiles that other users can include or omit
the sets based on the identifier that was given to the profile.
Workflow management systems are designed to allow users to filter and clean data at
every stage. The filtering and cleaning stages however are typically directly embedded
in the workflow, making it difficult to distinguish between data processing steps and
cleaning steps. This is important if the cleaning steps are rendered unnecessary in a
future context, but the data processing steps are still necessary to execute the work-
flow. In addition, the locations of the datasets are embedded in workflows, so if users
independently clean and verify a local copy of a dataset, the workflow cannot easily be
modified to both ignore the cleaning steps and to fetch data from the local copy. The
model is designed with this goal in mind, so it natively supports omission and addition
of both cleaning steps–normalisation rules–and dataset locations–providers.
In the workflow integration described in Section 4.6, the prototype was used to pro-
duce workflows that focused on the semantic filtering process. The query provenance for
each query were fetched when necessary by prefixing the query URI with “/queryplan/”.
136 Chapter 6. Discussion

Syntax normalisation example


Data access parameters :
Query type = Find references
Data item namespace = diseasome_diseases
Data item identifier = 1689

Query
Search for any references to
the disease

DrugBank DailyMed

DrugBank drug name : Marplan Dailymed generic drug name : (Isocarboxazid


(Isocarboxazid) Linked to disease : Brunner syndrome Datasets
Disease reference : Diseasome/1689

Label : Marplan (Isocarboxazid)


Label : Isocarboxazid
Syntax
Disease reference : diseasome_diseases:1689
normalised
Disease reference : Brunner syndrome
results

Figure 6.1: Enforcing syntactic data quality


Chapter 6. Discussion 137

6.2.3 Data trust

The model is useful for restricting queries to sources of data on trusted sources. It
allows scientists to restrict their experiments to a set of data providers, along with the
query types and normalisation rules that they find useful. The trust mechanism is novel
in the context of restricting queries on linked datasets based on scientific trust. This
enables scientists to predefine trusted sources, and then perform queries that are known
to be contextually trustworthy, as profiles are limited in their scope.
The trust factors include the ability to distinguish between trusted, and unreviewed,
implicitly useful items, as both being available, depending on the overall level of trust
that the scientist places in the system. It is not practical or useful for a scientist to apply
trust factors to every dataset, however, at the query level, in the context of particular
datasets, a scientist is able to eliminate untrusted items, allow explicitly trusted items,
and make it known to other scientists that other items are as yet unreviewed, but in
their opinion, another scientist could either use or not use them, depending on the
context for the research. The range of undecided datasets may include datasets that
are not linked to the scientist’s discipline expertise, or are not curated to an overall level
that is satisfactory to the scientist, or they could simply include datasets that were not
useful for the experiments that were currently being focused on.
The profiles are useful for temporary exploration, and are necessary if scientists are
to easily use widely applied query types, to particular data providers, without redefining
the query types or providers. This may be common in an open situation where other
scientists may have knowledge of more query types or data providers, that are not
reviewed according to the profile, but can be included or excluded based on the overall
criteria that the other scientist applies. This aspect of data trust is reflected in the
way the provenance is applied in future situations, as scientists need to decide what
preference they are going to apply to the extra data providers and query types, that
they know about, but do not find in a provenance record, as scientific experiments may
be able to be replicated using different query types, for instance, the original scientist
may have used SQL queries as compared to the SPARQL queries that are currently
available, although both queries have the same semantic meaning, and can therefore be
substituted.
If scientists were interested in determining trust in a social context, they could
digitally sign and share their profiles, along with resolvable Linked Data URIs to indicate
where to find the relevant data providers, query types and normalisation rules. The
model and prototype were not evaluated at this level as the model was designed to
make it possible for scientists to define and share their personal trust preferences, rather
than on how to decide at the community level where something was trusted. The trust
algorithm that was used for the prototype could be substituted with a community based
algorithm that used numeric thresholds to represent the complex relationship between
trust and the community preferences. In the trust system that was implemented for
the prototype, the scientist only needed to define whether they wanted to use the item,
rather than whether others should want to use the item. For instance a review site such
138 Chapter 6. Discussion

as that discussed by Heath and Motta [66] would make the model more flexible in social
terms, but it wouldn’t change the fundamental methods used by the model to choose or
ignore data providers, query types and normalisation rules based on the overall context
of the current user.
The model may be extended easily to include a system of numeric thresholds, such
as those used by many distributed trust and review models.The model relies on a verdict
of either yes, maybe, or no, for each item on each profile, and maybe decisions are passed
down through a set of layered profiles, based on the scientist’s trust, until a yes or no
is returned, or all of the profiles do not match. A numeric threshold would require the
scientist to arbitrarily assign an acceptable value to the maybe verdicts for each profile.
The threshold could be defined as an average over a group of scientists, although there
are debates about whether the wisdom of crowds [138] are more useful for trust purposes
than the wisdom of known friends [131]. For example, the rating of known scientists
may be more useful than the rating of a general crowd of scientists. The use of numeric
thresholds would still follow the yes, maybe, and no, system for each profile in order for
the system to allow for a complex definition of context sensitivity that could be directly
replicated by others in the same way as the current model.
The profiles component of the model does not distinguish between model elements
that are either untrusted, those that are excluded because a scientist has a more efficient
way to access the data provider, those where the implementation of the query type is
thought to be incorrect, or those where the normalisation rule is not deemed to be
necessary in a particular context, although it may be trusted in other contexts. The
limited semantic information given by the profile is suitable for the model to consistently
interpret the scientists intentions, but it does not make the exact intentions clear to other
scientists. An implementation of the model may provide different definitions for the
semantic reasons behind a particular inclusion and exclusion, in order for other scientists
to more accurately reason about the information, but the essential steps required to
resolve queries using the model would be unchanged as scientists would need to define
exact criteria in the implementation regarding which parameters specified inclusion and
exclusion, and what to default to if the set of parameters did not match these sets.
In the model, the statements from different data providers are merged into a pool,
before returning the complete set of results to the scientist. It is not possible to distin-
guish between statements from different locations using the current RDF specification
without using reification or RDF Quads. Reification is not practical in any large scale
scenario as it turns single triples into sets of 4 triples as opposed to the RDF quads
implementations that enlarge the size of results by only one extra URI per triple by
adding a graph URI to the current model, and using that URI to denote the location
of the data provide for the statement. There is however limited support for RDF quads
in client applications, as the concept and the available formats (NQuads, TRiG, TRiX)
have not yet been standardised. The prototype supports data available in Quads, with
the graph URIs as the URIs of providers that were used to derive each of the triples in
the graph.
Chapter 6. Discussion 139

6.2.4 Provenance

The model is designed to consistently distribute semantically similar queries across


multiple datasets in a transparent manner, so the initial form, location and quality of
the data does not need to be encoded into data processing scripts and tools. In doing
so, it promotes the reuse and replication of the results in the future by being able to
publish the query as part of a process provenance record that contains information
about the context of the query and what information is needed to replicate the query.
The model can be used to track the provenance of the data, but that would require
extra annotations on providers. It was not feasible given that most datasets in science
only contain simple data provenance annotations such as the dates when records were
last updated, and these dates are not useful for scientists in terms of allowing them to
exactly replicate their experiments.
In scientific workflow models that have previously explored the area of provenance,
the location of the data, and the filters and normalisation rules applied to the data are
not different parts of the model. An advantage of the way the provenance information
is derived from the model is the ability to locate the original data without requiring the
use of data normalisation rules. This enables scientists to work with a data model that
other scientists may be more familiar with in some cases, compared to their preferred
normalised version. This is useful for cases where scientists require the original data
to be unchanged, or the data is accessed in a different way to match their current
collaboration. In the model, data normalisation is not a deciding factor in whether to
use a provider, unlike systems that use single ontologies to distribute data, where the
deciding factor is primarily whether the scientist’s query matches the expected data
normalisation rules.
In a similar way, the normalised data can be useful if it contains references, such as
HTTP URIs, that can be resolved by other scientists to get access to information that
is used as evidence in publications. The normalised information may be cumbersome to
reproduce without the standard model of specifying what queries were used on which
datasources and which changes were made to the information. Scientists can publish
links to the underlying query plan, and peers can access this information and process
it in a similar way to a workflow, but with the benefit of knowing where they can
substitute their own datasources and which normalisation elements were used, in order
to review the work as part of the peer review process.
The design choice of separating the providers of information from queries that sci-
entists may want to pose on arbitrary endpoints improves the way that other scientists
can understand the methodology that was used to derive the answer. This is an im-
portant step in comparison to current workflow tools which rely on scientists to guess
which parts of a workflow are related to data access, and which parts are related to data
cleaning and format transformation. It allows scientists to choose which data providers
are used for queries, and allows them to change the mix of providers without interfering
with the methods used to integrate the information with processing tools.
In terms of provenance, the model allows scientists to completely recreate their
140 Chapter 6. Discussion

queries, assuming that the data is still accessible at some location. It does not specify
the meaning of each query, as this is not relevant to the data access layer, and it does
not specify the provenance of individual data items inside of data providers, as that is
the responsibility of the dataset. In particular, it requires scientists to understand the
way that the query parameters are linked to the overall question before the provenance
can be interpreted, as this requires community negotiation to define how the parameters
match the scientific question.
Other provenance systems focus on understanding the way that the linked datasets
are related and what levels of trust can be placed in each data item, as opposed to
focusing on what is required to reproduce the results of each query [149]. The prove-
nance of the results does depend on this information, but it needs scientists to interpret
the information that is available from the model, including any annotations on data
providers, and any provenance information that is derived as part of the query.
The inclusion of data item provenance in the model was restricted by the requirement
that the model be simple to create and understand. If the model included annotations
about data items in the overall configuration, it would not be maintainable or dis-
tributable. It is the responsibility of the scientist to store the provenance information
describing their queries, using the information that is generated by the model. This is
consistent with the design of the model as a way for scientists to define the way their
queries are distributed across their trusted datasets.
The final complete configuration of 27,487 RDF statements for the Bio2RDF web-
site was serialised in RDF/XML format for a file size of 3,468 kilobytes. It would be
relatively efficient to store provenance records, if redundant triples from the configura-
tion that are duplicated between provenance records are not all stored multiple times.
The provenance record also stays small because each of the elements, such as query
types and providers, are linked using URIs, and it is easy to determine whether the in-
formation about the query type has already been included in the provenance record to
avoid adding it multiple times. In addition, the RDF model does not recognise multiple
copies of a statement as being useful, so it can optimise the number of statements that
are included in the results automatically.
In comparison with the workflow provenance models that attempt to exactly repli-
cate the data processing actions that were included, using the same data access inter-
face in future, the model provides the ability for scientists to merge and change the
data processing actions using substitution based on the user query parameter matching
functions. The parameter matching functions allow for multiple query types to be used
transparently, in comparison to workflows and SPARQL based projects such as SADI
which rely on being able to provide replicability by relying on a single input data struc-
ture producing a single output data structure using a single data location for each part
of the processing cycle. The model makes it possible to integrate heterogeneous services
at the data access level without having to change the processing steps, as is necessary
in current models.
A provenance model for scientific data processing needs to allow scientists in the
Chapter 6. Discussion 141

future to replicate and contrast results using independent data. Although RDF provides
the basis for this replication, current data query models that are based on SPARQL do
not support independent replication without human intervention. The extension of these
data access models to remove direct references to data locations and properties used in
the inputs and results makes it possible for independent replication using configuration
changes. The emphasis of provenance literature on replication in the future denies the
evidence that data providers will always eventually cease to exist or not always be
accessible in the future.
There are tools such as WSsDAT [89] which are designed based on evidence that
communications between data providers are variable in the current time, and may not
last beyond the lifetime of the scientific grant related to the study. Although theo-
retically SPARQL based query models could generate the same results from different
locations, in practice they rely on either hard coding this information into the data
providers service definition (see SPARQL 1.1 Service Definition draft specification1 ),
or hard coding this information into the middleware or clients data processing code as
in SADI. Neither of these solutions are as flexible as a model that relies on separating
the translation and data normalisation steps from the definitions of queries. The model
proposed in this thesis allows scientists to replicate their data processing in future with-
out having to rely on particular data providers or face the daunting task of changing
their processing code to support alternative data providers to support replication using
independent data sources when they could declaratively exchange a part provider for a
current one with a change to the profile in use.

6.3 Prototype

The prototype was implemented as a proof of concept for the model that can be used
to validate the advantages and disadvantages of the model, as described in Section 6.2.
The evaluation of the prototype is based on its usefulness as both a privately deployed
software package, and its deployment as the engine behind the Bio2RDF website. Some
features of the model, such as context sensitivity, and data trust, apply mostly to
the private deployments of the prototype due to the emphasis on scientists forming
opinions about datasets, while the others apply equally to both the public and private
as they apply to the broader scenario of scientists easily getting access to data while
still understanding the provenance of the process involved, and any data quality changes
that were applied in the process.

6.3.1 Context sensitivity

The use of RDF as an internal model makes it possible for scientists to translate results
from any query into other formats as necessary. The output format is able to identify
links explicitly, so any reference fields in other formats can be mapped from the RDF
1
http://www.w3.org/TR/sparql11-service-description/
142 Chapter 6. Discussion

document, and other fields can be mapped as necessary from the structures in the
document.
The prototype provides access to the paging facility which is included in the model
and a subset of the queries, so that scientists can choose to only fetch particular parts
of the results set at each time, although they can get access to all of the information
by resolving the document for each page offset until they recognise that there is no new
information being returned. A scientist can use these functions to direct the model
to return a maximum number of results from each provider. There is no recognition
of relevance across the results returned from different providers, although it could be
given using the deletion rules, to select the best results as necessary. This enables the
prototype to support both gradual and complete resolution of queries, through the use
of a number of data providers to represent the different stages of the query in cases
where the data provider allows paging to occur.
The prototype allows scientists to host the application in arbitrary locations, as
long as scientists put in normalisation rules where necessary to change the structure of
the URIs that are given in the results of each query. In the case of the HTML results
pages, the location is automatically changed to match the way the scientist accessed the
prototype, but this is not the case currently with the RDF representations, where it is
viewed as more important to have standardised URIs in order for the scientist to easily
identify the concept they were querying for without reference to the location of the
prototype. It is particularly relevant that scientists can replicate a public Linked Data
website using the prototype to avoid overburdening the public website with automated
queries that can be performed by local software directly accessing local data.
The prototype allows scientists to extend public resources with any of their own re-
sources, and keep their resources private as necessary. This enables scientists to extend
what are traditionally very static Linked Data sources, without revealing their infor-
mation, or hosting the entire dataset locally. This context sensitivity is an important
feature on its own, but it also makes it possible to solve the data quality and data
trust issues in local situations, and control the data provenance, rather than just obtain
the information about what an independent data provider thinks about a dataset. In
this way it is a generic version of the DAS, where the data must be genome or protein
records, and other annotations are not accessible.
The prototype software allows scientists to individualise the configuration and be-
haviour of the software according to their contextual needs. It includes the ability to
resolve public Linked Data URIs using the prototype, while supporting a simple method
for extending the public representation of the record with RDF information directing
scientists to other queries that are related to the item. However, this behaviour is not
common, and scientists may not fully understand the difference between their local
query and the public version, including how to give a reference to the local version of
the data in their research, especially if the RDF retrieved by the general public using
the published URI never includes their information.
The prototype requires that scientists utilise a local prototype installation to resolve
Chapter 6. Discussion 143

documents according to their trust settings. This either requires scientists to change
the URIs in the results to match the DNS name for their local resolver, or it requires
them to publish the documents using the local prototype URIs. This is not a design
deficiency of the prototype, as it is necessary to fully support local installations of the
prototype software. It is, however, an issue that needs to be resolved using a social
strategy to support the data quality and data trust features, as local results can be
intentionally different to the same query as executed on a public service.

Although it is simpler for scientists to use commonly recognised Linked Data URIs
for a particular item, it requires other scientists to know how to change the URIs in any
of their documents to work with the scientist’s prototype resolver. If they do not know to
change the URIs, then resolving the Linked Data URI will only get them the statements
that were available at the authority. The prototype is designed to allow scientists to
install their own prototype and utilise the configurations from other scientists to derive
their own context sensitive options for resolving the URI, including their own trust
settings to determine which sources of information they trust in the context of each
query.

Some recommendations for identifying data items, and links to items in other
datasets focus on standardising a single URI for each item in a dataset, and utilis-
ing that URI wherever the data item is referred to. A standard URI is useful for
integrating datasets, but it is not useful if there are contextual differences between the
way the item is represented, including novel statements, then it is not appropriate to
use the standard URI, as its trusted global meaning will be limited to statements that
are published by the original authority. In order for the prototype to deliver the unique
set of statements that a scientist believes to be trusted, it is more appropriate that the
URI is changed, although the model and the prototype support both alternatives.

The Linked Data movement does not focus on supporting complex queries using
HTTP URIs, although the principles of Linked Data fit for queries in a similar way to
raw data records. In terms of evaluating the contextual abilities of this prototype, there
are no current recommendations about how to interlink data items with queries that
may be performed using the data item. For example, in the Bio2RDF configuration,
there are many different queries that may be performed depending on the context that
the scientist desires.

It is not practical to enumerate all of the query options in an RDF document that
is resolved using the data item URI, as it would expand the size of the document, and
detract from the information content that is directly relevant to the item. An example
of this is the resolution of the URI “http://bio2rdf.org/geneid:4129”. This URI
can be modified to suit a number of different operations, such as locating the HTML
page that authoritatively defines this item, “http://bio2rdf.org/html/geneid:4129”,
locating the XML document that authoritatively defines this item, “http://bio2rdf.
org/html/geneid:4129”, and searching for references to the item in another namespace,
such as the HGNC namespace, “http://bio2rdf.org/linksns/hgnc/geneid:4129”.
144 Chapter 6. Discussion

If URIs that could be resolved to all of the possible queries for links in other names-
paces were included, there would currently be at least 1622 extra statements correspond-
ing to a link search for each of the namespaces in the Bio2RDF configuration. These
URIs can be usefully created by scientists without impairing basic data item resolution.
If a user requires these query options, extra statements could be included to identify
the meaning of each URI. It is not necessarily evident to a computer that the URI
“http://bio2rdf.org/linksns/hgnc/geneid:4129 will perform a query on the HGNC
namespace. These extra statements would need to link the URI to the namespace, which
in Bio2RDF is identified by the URI “http://bio2rdf.org/ns:hgnc”. The computer
could decide whether to resolve the namespace URI based on the predicates that were
used in the extra RDF statements. It is not practical for Bio2RDF to include each of
the possible query URIs in the document that is resolved using “http://bio2rdf.org/
geneid:4129”, although some are included, such as the location of the HTML page.
In particular, there is a unique “linksns” URIs that can be created dynamically for
each namespace, resulting in a large number of RDF statements that would need to be
created and transported but may never be used.

6.3.2 Data quality

The use of the prototype on the Bio2RDF website provided insights into the data quality
issues that exist across the current linked scientific datasets. The issues have a range of
causes, including a lack of agreement on a standard Linked Data URI to use for each
data item, a range of URI syntax differences in and across datasets and the inability to
retrieve complete results sets depending on the server, which cuts out possibly important
results, and makes it hard to consistently apply semantic normalisation rules to query
results.
There is a large variation in the way references are denoted across the range of
scientific data formats, including differing levels of support for links. These issues are
not necessarily fixed by the use of RDF as a format for the presentation of these datasets,
as the link representation method in the original data format cannot always be fixed
by generic rules. The prototype was useful for fixing a number of the different link
representations, although the lack of support for string manipulation in SPARQL made
it difficult to support complex rules that would enable the prototype to dynamically
create URIs out of properties where the currently available RDF representation did not
specify a URI 2 . The prototype was used to change references and data items that did
not conform to the normalised data structures that were decided on by the Bio2RDF
project. These references included URIs that were created by other organisations, that
could be simply replaced with Bio2RDF URIs to provide scientists with a way to resolve
further information within the Bio2RDF infrastructure. The model also provided the
ability to define query templates, so that the normalised URI would be used in output,
and the endpoint specific reference, whether it be a URI or another form, could be used
in the query. It was successful in integrating the datasets provided by projects varying
2
http://www.w3.org/2009/sparql/wiki/Feature:IriBuiltIn
Chapter 6. Discussion 145

from the Bio2RDF endpoints, to the LODD, FlyWeb, and the datasets provided by the
Pharmaceutical Biosciences group at the Uppsala University in Sweden.
The prototype enables scientists to understand the way the information is repre-
sented in each dataset relevant to their query, as the query can be completely planned
without executing it. This makes it possible for scientists to understand each step of
their overall query, without relying on a single method to incrementally perform a large
query on any of the available datasets, depending on the nature of the query. A major
data quality factor related to the execution of distributed queries relates to the way
results can be obtained from each data provider. In some cases, the entire results set
can be returned without limitation, but many data providers set limits on the number
of results that can be returned from each query. This restriction has a direct impact
on data quality, as the scientist must know what the limits are, and repeatedly per-
form as many incremental queries as necessary to make the query function properly.
The prototype provides a way to incrementally page through results from queries un-
til the scientist is satisfied. Some query types will return generic information that is
not page dependent. Query types can be created to include a parameter indicating
whether paging is relevant. For other query types that are page dependent, scientists
can incrementally get results until the number of results is constant, and less than the
server is internally configured to support. This method is compared to other models in
Figure 6.2, with other systems providing a variety of methods ranging from returning
a fixed sample of the top ranked results to always returning all results.

Paging query results

User query
Information linked to either concept X or
concept Y

Resolved using automatic Return all results


ranked results method

Information linked to both Information about Information about


concept X and concept Y concept X concept X

If there are not enough results:

Return all results


Information about Information about
concept X concept X High bandwidth usage if there are a
large number of results
May still be hidden issues with fixed size
result sets

Return top results


No paging
Bias towards results with both concepts
There may be no scientifically useful
ranking method

Figure 6.2: Different paging strategies

Some public data providers are known to put both incremental restrictions, and
146 Chapter 6. Discussion

overall restrictions on the number of results that can be related to a given query. This
may be necessary, as the data provider may be voluntarily provided for minor use. There
may be overall restrictions with the size of document returned by distributed queries.
If the data provider is an HTTP server, there is the option for the server to return a
status code indicating that the server’s bandwidth is restricted for some reason, and the
document cannot be returned because of this. This restriction is in place for the public
DBpedia provider, http://dbpedia.org, for example, although larger documents may
be available if scientists enable GZIP compression on the results, something that is not
required by the HTTP or Linked Data standards.
Each query needs to be executed in a way that would minimise the chances of
the size restriction being triggered, although it is not possible using current SPARQL
standards to know how large a document may be. This may include constructing
Linked Data URI resolutions as SPARQL queries with limits on the number of results
that would be returned. The prototype avoids these issues by allowing scientists to
pose multiple queries on the data provider, allowing them to iteratively get as many
results as necessary. Scientists may find it necessary to create their own mirror of the
data so that they can execute queries without the limitations that the public provider
imposed. The prototype allows this, as the model is designed to make the data access
methods transparent to the scientist request, making it simple for scientists to override
parts of the prototype configuration with their own context. For example, they could
override the “wikipedia” namespace that DBpedia provides so that queries about it were
redirected to their mirror instead of the public endpoint.
The prototype provides the ability for scientists to customise the quality of the
data resulting from their queries. It is limited by several factors, including the lack
of transformation functions in SPARQL, the lack of RDF and SPARQL to support or
normalise scientific units, and semantic ambiguity about results from different datasets.
The use of RDF as the model that was used to represent the knowledge provides a
number of semantic advantages, as entities from different datasets can be merged, where
other similar models do not allow or encourage universal references to items outside of
a single dataset.
The prototype uses an in memory SPARQL implementation for high level data
transformations. However, the current SPARQL 1.0 recommendation does not include
support for many transformation operations. Notably, there is no way to create a
URI out of one or more RDF Literal strings. This makes it currently hard to use string
identifiers to normalise references in all documents to URIs if the original data producer
did not originally use a URI, as string based regular expressions are not suitable for
the process. This has an impact on the data quality normalisation rules that can be
performed by the prototype, as URIs can only be created by inserting the value into
a URI inside the original query, or the statically inserted RDF. This only works for
queries where the URI was known prior to the query being executed.
In terms of scientific data quality, it is important to know the units that are attached
to particular numeric values. For example, it is important to be able to work with
Chapter 6. Discussion 147

numbers, such as a experimental result of “3.35”, while knowing that the number is
represented in units, such as “mL”. The RDF format has not been extended to include
annotations for these units in a similar way to the native language annotation feature
that is currently included in the RDF specification 3 . This issue forms a large part of the
normalisation process, as scientists expect data normalisation to result in a single unit
being represented for each value. As units are not supported directly by the underlying
data format, the prototype would need to have some other method of recognising the
unit attached to a numeric value to use a rule to convert between units unless scientists
are aware of the possible existence of unsuitable units.
The prototype implements a simple normalisation rule mechanism. It support sim-
ple regular expression transformations, but does not utilise transformations such as
RIF, OWL reasoning. These transformations may be required in some cases, but the
prototype demonstrates the practical value of the model in relation to linked scien-
tific datasets using simple regular expressions for the majority of normalisations, with
the ability to define SPARQL queries to perform more complex normalisation tasks.
Some non-RDF transformation approaches, such as XQuery and XSLT, require a direct
mapping between the dataset and the result formats. In these cases there is no oppor-
tunity to utilise a single intermediate model, such as RDF, that can have statements
unambiguously merged together in an unordered manner.
Many RDF datasets have avoided issues relating to the creation of URIs by exten-
sively using Blank Nodes, which are unaddressable by design. This affects data quality,
as scientists cannot reference, or normalise a reference, to a particular node, and there is
no way, at least in the current RDF 1.0 specification, to normalise Blank Nodes consis-
tently to URIs, by design. The use of Blank Nodes makes it difficult to optimise results,
as an entire RDF graph may need to be transferred between locations so that there are
no elements that cannot be dereferenced in future if blank nodes are used extensively.
For example, an optimised query using URIs as identifiers could result in one triple
out of a thousand triples in a dataset. The same query on a dataset that uses Blank
Nodes may need to transfer all one thousand triples so that the scientist could properly
interpret the triple in terms of other information. If the graph uses URIs, the scientist
can selectively determine which other triples are relevant to the results using the URIs
as independent identifiers. The prototype provides the ability to optimise queries and
return normalised URIs that can be matched against any suitable dataset to determine
where other relevant statements are located.
The use of RDF as the results format may highlight existing semantic data quality
issues. The data quality issues may occur when information is improperly linked to
other information, and the use of computer understandable RDF enables a comput-
erised system to discover that the link is inconsistent with other knowledge. In science,
however, there is a constant evolution of knowledge, so the use of RDF to automatically
discover inconsistent information relies on an up to date encyclopaedia of verified knowl-
edge. The data quality of the encyclopaedia must be assumed to be perfect before it

3
http://www.w3.org/DesignIssues/InterpretationProperties
148 Chapter 6. Discussion

can be used with current, widely implemented, reasoning techniques such as OWL [36],
although there have been explorations into how to make imperfect sources of knowledge
useful for the reasoning process [42]. Data quality issues at the semantic level, such
as these, are highlighted by the use of the model with syntax normalisation rules, as
the sources of data can be integrated with ontologies that represent a high standard of
curated knowledge, although there may not be enough information available to fix the
issue without a higher level processing tool.
There may also be inherent record level issues with data when similar, but distinct,
queries are executed on different datasources, with the multiple results being merged
into the returned document. If there is likely to be ambiguity about the meaning of the
information from different datasources, then ideally the queries shouldn’t be combined,
as scientists would then need to recognise that multiple datasources were being used,
and accommodate for this in the way they process the information, something that the
model attempts to avoid. However, in practice, the different sets of results may both be
valuable, and scientists would then need to notice the different structures used by the
different datasources, or customise their profiles to choose one query over the other, or
allow for the different structures in their processing applications.
The example shown in Figure 5.1 illustrates a data quality issue, as there are multiple
RDF predicates which may be used to describe a label or title for the item, and scientists
would need to recognise this. In the example the datasources used a unique predicate
for name of an item in the Gene Ontology, scientists may also find the distinction useful,
but other scientists may create normalisation rules to convert the data to a standard
predicate, and include those rules in their provenance records to highlight the difference
to other scientists. In comparison to other projects, this allows scientists to specify
their intent, and share this intent with other scientists, without having to embed the
normalisation and access aspects into the other parts of their processing workflows.
In many cases, the domain experts may not be experts in the methods of formally
representing their knowledge in ontologies. Data created by the domain expert may
be consistent, but it may not include many conditions that could be used to verify
its consistency. In order for other scientists to trust the data quality, they can create
new rules and apply them to the information. This would require either low level
programming code in other systems, or a change to the globally recognised ontology
framework, neither of which can be used in provenance records to distinguish between
the original statements, and the changes that were applied by others.
In order for the model to be used to verify the semantic data quality of a piece of data,
the context of the item in conjunction with other items may be required. This process
may require that a large range of data items are included in the output, so that rules
can be processed. In practice, this level of rule based reasoning would require scientists
to download entire datasets locally, and preprocess the data to verify its consistency,
before using it as an alternative to any of the publicly accessible versions of the dataset.
The use of RDF triples makes it difficult to delegate statements to particular data
provenance histories, and therefore the data quality is harder to identify. However, it is
Chapter 6. Discussion 149

possible if the scientist is willing to use different URIs for their annotations compared
to the URI for the relevant data item. Different reasoning tools have different levels
of reliance on the URI that is used for each statement about an item. In OWL for
instance, there is no link between a single URI and a distinct “Thing”, as statements
with different URIs can be merged transparently to form a set of facts about a single
“Thing”. If users do not rely on OWL reasoning, it is possible to distinguish between
statements made by different scientists by examining the URI that is used. This is
principally because RDF allocates a particular URI to the Subject position of an RDF
triple, and users can create queries that rely on the URI being in the Subject position
if they are not using reasoning tools. Reasoning tools such as OWL may discover an
inverse relationship based on the URI of the Predicate in the RDF triple, and interpret
the RDF triple differently, with the item URI being in the Object position in the RDF
triple. This makes it difficult to conclusively use the order of the triple to define which
authority was responsible for the statement.
In some cases, the prototype may need to be configured to perform multiple queries
on a single endpoint to deal with inconsistencies between the structure of the data in
the endpoint and the normalised form. This is inefficient, but it is necessary in some
cases to properly access data where it is not known which form the data will be in for
an endpoint.
The prototype partially supports RDF Quads as there are already datasets, including
Bio2RDF datasets, which utilise RDF Quads to provide provenance information about
records. However, the lack of standardisation for any of the various RDF Quads file
serialisations, and the resulting lack of standardisation between the various toolsets,
made it necessary to restrict the prototype to RDF triples for the current time. Any
moves to require RDF Quads should require a backwards compatible serialisation to
RDF Triples, which is not yet available outside of the very verbose RDF Triple reification
model.
A summary of the data quality support of a range of scientific data access systems
is shown in Table 6.2. It describes a range of ways that different systems use to modify
data, including rule based, workflow and custom code data normalisation. The proto-
type uses rules that are distributed across providers, as shown in Figure 3.1 with the
context sensitive design model. The federated query model in the figure is represen-
tative of the general method used by the other rule based systems, as they rely on a
single, complete decomposition of a complex query to determine which endpoints are
relevant.
The prototype is fundamentally different to other systems in the way it separates
queries from the locations that the queries are to be executed, excluding users from
being able to specify locally useful rules as global rules for the query. When this is
used along with the contextual definition of namespaces it makes the overall query
replicable in a number of locations, where the queries in other systems are localised to
the structure and quality of the data as provided in the currently accessible location.
The configuration uses URI links between query types, providers and normalisation
150 Chapter 6. Discussion

System Advantages Disadvantages


Prototype Rules, syntactic and semantic –
SADI Rule based semantic Custom code for syntactic
BioGUID Clean linked identifiers Custom code for each service
Taverna Custom workflows Hard to reuse and combine
Semantic Web Pipes Custom workflows Not transportable

Table 6.2: Comparison of data quality abilities in different systems

rules that make it possible for data quality rules to be discoverable. Communities of
scientists can take advantage of this behaviour to migrate or replace definitions of rules
in response to changes in either data sources or current standards for data representa-
tion. This benefit is available without having to change references to either the data
normalisation rules in any queries due to the separation of rules from query types, or the
rules in any providers where the new queries do not require new normalisation rules.
This ability does not necessarily affect the provenance of queries, as the provenance
recording system has full access to the exact rules that were used to normalise data
when the query is executed to provide for exact replication on data sources that did not
materially change.

6.3.3 Data trust

The prototype demonstrated the practical considerations that were necessary to support
data trust, including the open source distribution of the program to enable scientists to
create and modify configurations without having to negotiate with the community to
get a central repository modified to suit their research. The configurations that were
relevant to different scientists were published in different web locations, and scientists
were able to configure the prototype to use their locally created configuration. These
configurations, were then used to setup the state of a server, so that the server knew
about all of the providers, query types, and normalisation rules in the provided configu-
rations. When a scientist performed a query on the prototype, the pool of configuration
information was filtered using the profiles that the server was pre-configured to obey.
The pool of configuration information was used to allow three servers in the Bio2RDF
group to each choose the datasets and queries that they wanted to support by assigning
each server a different profile.
In comparison to methods that use endpoint provided metadata to describe service
functionality, the prototype can use on a combination of local files, and publicly acces-
sible web documents for configuration information. The VoiD specification presents a
way of specifying the structure of data available in different SPARQL endpoints, but
it does not include trust information [3]. VoiD documents are produced using RDF,
so it can be stored locally, and interpreted directly, unless it relies on the URL that
it was derived from, to provide context, something that is popular in some RDF pub-
lishing environments. A scientist could trust VoiD information in a similar way to the
configuration information for the prototype if the RDF files were available and verified
Chapter 6. Discussion 151

locally. However, in contrast to VoiD, the prototype configurations can be used to trust
specific queries on specific data providers, as shown by the inclusion of a layer between
providers of data and the query in the context-sensitive model in Figure 3.1.
If a trust mechanism was included explicitly in VoiD it would not include the scien-
tist’s query as part of the context that was used to describe the trust. Trust, in VoiD,
would need to be specified using the structure of the data as the basic reference for
syntactic data trust, although semantic data trust could be recognised using a reference
back to the original data provider. In comparison to the model described here, VoiD
relies on a description of a direct link between datasets and the endpoints that can be
used to access the datasets. In VoiD there is no direct concept of endpoints, so scientists
cannot easily substitute their own endpoints, given a description of the dataset. In the
same way, they cannot also directly state their trust in a particular endpoint, whether it
is a trust based on service availability, data quality, or the other trust factors described
in Gil and Artz [52] and Rao and Osei-Bryson [113].
The mechanism that is used to describe services in the recent SADI model does not
allow transparent reviews of the changes that were made to data items, as many of
the rules related to data normalisation are only visible in code, as the extensive use of
Web Services makes it simpler to encapsulate this layer in auto-generated programming
code, rather than expose it as queries that can be studied by scientists. This means
that a scientist needs to trust a datasource without knowing which method is being
used to resolve their queries, something that is possible, but not as advantageous as the
alternatives where scientists see all of the relevant information and can decide about it
independent of the internal state of a server that is out of their control. In particular,
SADI requires that scientists accept all data sources as semantically equal, and code
is typically linked to a single data location making it necessary to use a single syntax
to describe each datasource in each location. The scientist must trust the results of
the overall SPARQL query that is submitted to the SADI server, as the results are
combined and filtered before the results of the query are returned to scientists. In
particular, any semantic rules that are applied to the data must exactly match or the
data will be discarded or an error returned to the user in lieu of giving them access to
whatever data is available. This makes it particularly difficult for SADI to provide for
optimised queries where there is not enough data in the results of the query to satisfy
the semantic rules for the classes of data that were returned. An entire RDF record
needs to be resolved to reliably determine whether the item described by the record fits
into a particular semantic class.
The provenance information for a SADI query, if available, would require the system
to evaluate the entire query again to accurately indicate which services were used, as
the model behind SADI requires incremental execution of the query to come to a final
solution. This may be similar to the requirements for a query executed using VoiD,
as it contains information about both SPARQL predicates, which may not be specified
in a query, along with information about how to match the starting portion of a URI
with a given dataset. In contrast to this, the prototype is designed to only execute the
152 Chapter 6. Discussion

query in increments, so the information necessary to perform the query in future can
be obtained without performing the query beforehand. This ability is vital if scientists
are to trust the way a particular query will execute before performing the query.
The prototype implements the data trust feature using user specified preferences to
define which profiles are trusted, and where the configuration information is located.
This information is used to provide the context for all queries on the instance of the
prototype. A scientist does not need to decide about this information for every query on
the prototype, as they should have a basic level of trust in the way the query is going to
be performed before executing the query. The scientist is able to make a final decision
about the trust that they are going to put into the results when the query is complete,
and the provenance information is available. However, in the prototype the provenance
information is derived using a different HTTP request. This means that the HTTP
requested query provenance may not exactly match the data providers and/or query
types that were used to derive the results of the previous query due to random choices
of providers and network instability causing providers to be temporarily ignored. These
issues are not present for internal Java based requests where the provenance and the
results can be returned from the same request.
The prototype does not insert information about which provider and query type
combinations that were used to derive particular pieces of data into the results. This
means that scientists do not have a simple way of evaluating which of the sources of
data were responsible for obtaining an incorrect result, if the identifiers for the data
items do not contain this information. This is mostly due to the restrictions placed
on the prototype by the RDF triple model with relation to annotations at the triple
level. The RDF model provides a reification function, but it would expand the size
of the results for a query by at least three hundred percent. Although an RDF quad
format could be used to make it possible to add information about provider and query
type combinations above the level of triples, with some quad syntaxes being relatively
efficient, the RDF quad formats are not widely implemented, and are not standardised,
and changing would require all scientists to understand the RDF quad syntaxes, as
there is no backwards compatibility with the triple models.
The method of configuring the prototype using multiple documents as sources of
configuration information relies on the assumption that URIs for different elements,
such as providers and profiles, being unique to a given document. Although this is
not a realistic assumption to make in a broadly distributed scenario, the configuration
sources were all trusted, and the configuration sources were setup so that within a
particular document it was easy to verify if the URI for an element was accidentally
duplicated. Another method needs to be used to authenticate the document to fully
trust the configuration information given by a web document.
The configuration documents could be authenticated by having scientists manually
check through the contents of the file, and then have them sign the particular file using
a digital signature. This is not viable in dynamic scenarios, as there are legitimate
reasons for making both minor and major changes to the configuration file without
Chapter 6. Discussion 153

the scientist having to lose trust in the configuration information. The major risks are
that an item will be defined in more than one document, causing unintentional changes
to the provider, query type, or other object. A future implementation could allow
users to specify a list of URIs representing model objects that could be trusted from
particular configuration documents. This strategy would ensure that scientists have the
opportunity to verify that their configuration information does not clash with any other
configuration information.

System Advantages Disadvantages


Prototype Configured data trust Multi-datasource results
SADI – Assume all data is factual
BioGUID Simple record translation –
Taverna Data tracing through workflow Typically single locations
Semantic Web Pipes Data tracing for single workflows

Table 6.3: Comparison of data trust capabilities of various systems

6.3.4 Provenance
The prototype can provide a detailed set of configuration information about the methods
that would be used to resolve any query. This provenance record contains the necessary
information for another implementation to completely replicate the query. However,
the provenance information is not necessarily static. The model provides a unique way
of ensuring that scientists can make changes to the provenance record without deleting
the original data providers. It does this by loosely coupling query types with input
parameters, and by linking from providers to the query types they support, so query
types do not have to be updated to reflect the existence of new data providers. The
prototype provides network resiliency across both data providers, so that a single broken
data provider will not result in partial loss of the access to the data, even if different
data providers use different query methods.
The provenance record for each query includes a plan for the query types and
providers that would be used, even if the query was not executed. An implementa-
tion could generate the query plan link and inserted it into the document although
many users may not require the query plan to use the document, so it wasn’t inserted
in the Bio2RDF documents by default. A complete solution would require a special
application that knew how to interpret the URI into its namespace component along
with any identifiers that were present in the item. It would use the namespace to get a
list of query types and the relevant providers for the query types, using the namespace
as the deciding factor. In particular, the provenance for the URI is retrieved using the
implementation by modifying the URI and resolving the modified URI.
The provenance record includes information about the profiles, normalisation rules,
query types, and providers, in the context of the query. It contains a subset of the
overall configuration, which has been processed using the chosen profiles to create the
relevant queries. The prototype relies on template variables from queries, which are
154 Chapter 6. Discussion

replaced with values in the provenance record. These changes mean that the prove-
nance record is customised for the query that it was created for. In cases where the
URI format does not change and the namespace stays the same, the query plan would
contain most of the relevant providers and query types for a similar query. For exam-
ple, both http://bio2rdf.org/queryplan/geneid:4129 and http://bio2rdf.org/
queryplan/geneid:4128 contain the same providers and query types, as there is no
query type that distinguishes between identifiers in the “geneid” namespace.
An alternative to modifying provenance records to derive new possible queries would
be to generate further provenance records using the prototype after it is configured using
the provenance record as a configuration source. The scientist could then trust that the
resulting queries were semantically accurate in terms of the query model.
The prototype did not provide a way to discover what queries would be useful
for a particular normalised URI without resolving the URI. The document that was
resolved for the normalised URI could have static insertions, that were used in the
prototype to indicate which other queries may be applicable. For example, the document
resolved at http://bio2rdf.org/geneid:4129 contains a reference to the URL http:
//bio2rdf.org/asn1/geneid:4129, and the predicate indicates that the URL could be
used to find an ASN.1 formatted document containing the given record.
The provenance implementation in the prototype does not include data provenance.
There are many different ways of denoting data provenance, and they could be used
together with the prototype as long as the sections of the record can be segregated.
This is an issue with RDF, as the different segments of the record cannot easily be
segregated without extending the RDF model from triples to quads, where the extra
context URI for each triple forms a group of statements that can then be given data
provenance. The model emphasises data quality, which includes normalisation, and data
trust, which includes proper annotation of statements with trust estimations. These
conflicting factors make it difficult for both the prototype and the model to support
provenance at a very low granular level. In the prototype, the choice of which strategy
to use is left to the scientist, although the prototype only currently supports RDF
triples.
The strategy taken for the Bio2RDF website is to emphasise data quality over the
location of the datasets and the provenance of individual RDF triples. The low level
data strategy, where the data provenance of each statement is given, makes it possible
to analyse the data in different ways, including the ability to make up integrated prove-
nance and data quality queries. However, a normalised strategy which does not include
all of the data provenance as part of the results of queries allows processing applica-
tions to focus on the scientific data without having to allow for the extra processing
complexity that is required if the RDF model includes the context reference.
The annotation of specific RDF triples is difficult, and some linked datasets require
that scientists extend the original RDF triple model to include a quad model, including
an extra Named Graph to each triple. This is possible using the model, but it is not
encouraged in general as it reduces the number of places that the RDF documents will
Chapter 6. Discussion 155

be recognised, as there is no recognised standard that defines RDF quad documents.


The data item provenance is useful as a justification of results in publications. How-
ever, if the data providers update the versions of the data items without keeping older
data items available, the model is unable to recreate the queries, particularly if the data
items are updated without changing the identifier. In part, this formed the basis for the
LSID project [41], and implementations which used LSID’s such as the myGrid project
[148]. The social contract surrounding LSID’s required that the data resolved for each
identify be unchanged in future to exactly reproduce past queries. LSID’s failed to
gain a large following outside of biological science communities. However, many initial
adopters are instead moving to HTTP URI Linked Data based systems. The use of
more liberally defined HTTP URIs does not mean that the emphasis on keeping data
accessible needs to be lost.
The ability to reproduce large sets of queries exactly on large datasets is a basic
requirement in the internet age. Textual details of a methodology were previously the
only scientific requirement. The model provides the ability to reproduce the same query,
with the publication containing the information required to submit the same queries to
the same datasets. It is flexible enough to work with both constantly updated and static
datasets. The data provider object in a query provenance document may be changed
from the updated endpoint to the location of a static version that was not updated since
the query was executed to reproduce the exact results.
The cost of continuously providing public query access to old copies of datasets may
not be economical based on the benefits of being able to exactly reproduce queries. This
may be leading many data producers to avoid long term solutions such as Handle’s
and DOI’s which rely on the data producer paying for the costs of keeping at least
the metadata about a record available in perpetuity. The model aims to promote a
context sensitive approach which does not rely on an established authority to provide
continuous access to the provenance or metadata about records. It is possible to publish
provenance documents using schemes such as DOI’s if data providers are willing to pay
the costs associated with the continual storage of metadata about items by the global
DOI authority.

6.4 Distributed network issues


The prototype was designed to be resilient to data access issues that could affect the
replicability and context sensitivity of queries resolved using the model. Issues such
as computer networks being unavailable, or sites that were overloaded and failed to
respond to a query would have produced a noticeable effect on the results. The effect
was reduced by monitoring and temporarily avoiding providers that were unresponsive.
This monitoring was made at the level of provider and at a lower level by collating the
DNS names that were responsible for a large number of errors.
The prototype relies on simple random choices across the endpoint URL’s that are
given for each provider to balance the query load across all of the functioning endpoints
if they are not included in query groups or provider groups. In comparison to Federated
156 Chapter 6. Discussion

SPARQL, where each of the services is defined as a single endpoint in the query, query
groups and multiple providers for the same query types allow the system to be load
balanced in a replicable manner. In addition, it was important for the Bio2RDF website
that results were consistent across a large number of queries.
In a large scale multi-location query system, such as the set of scientific databases
described in Table 2.1, a broken or overwhelmed data provider will reduce both the
correctness and speed of queries. The implementation includes the ability to detect
errors based on the endpoint address and avoid contacting them after a certain threshold
is exceeded. If there are backup data providers that can execute the same semantic
query, they can be used to provide redundancy. Although this may have an effect
on the completeness of queries, as long as the backup data providers are used for a
query until a success is found, it was viewed as a suitable tradeoff. It avoids very long
timeouts when sites do not instantly return errors which reduces query responsiveness
for scientists. After a suitable period of time, the endpoints may be unblocked, although
if they fail to respond again they would again be blocked.
The statistics gathering accumulated the total latency of queries made by the pro-
totype on providers. The statistics derived from the Bio2RDF mirrors indicated that
there were 298,807 queries that had low latency errors (i.e., The time for a query to fail
was more than 0 and less than or equal to 300ms (milliseconds)). By comparison there
were 328,857 queries that had a total error latency of more than 300ms. This indicates
that there was a roughly even number of cases where an endpoint had failed completely
and quickly returned an error and where it was working, but busy, and timed out be-
fore being able to complete a request. If an endpoint was blacklisted, it would appear
a maximum number of times per blacklisting period in the statistics, making it difficult
to interpret the statistics. During the research, the blacklisting period varied between
10 minutes and 60 minutes.
The server timeout values across the length of the Bio2RDF monitoring period
ranged from 90 seconds at the beginning to 30 seconds at the end. A small number of
queries, 284, had server timeout values of less than 5 seconds. A server timeout only
occurred if the endpoint stopped responding as it was processing the query, so higher
latencies were possible. In some cases, the query did not timeout after 30 seconds,
presumably because data was still being transferred. The highest total error latency for
a query was 2,121,788 ms for a query on November 10, 2009. It contained two queries
that failed to respond correctly, with the overall query taking 1,201,622 ms to complete.
The highest response time for an overall query was 1,279,359 milliseconds (ms), for
a query on November 4, 2009. It contained 1 failed query, which took 345,502 ms (345
seconds) to fail, while the 8 successful queries that made up part of the query took a
combined total of 8,152,005 ms to complete. This indicated that there were queries
that legitimately took a long time to complete. Many of the queries did complete
relatively quickly considering the network communication that was required. 6,662,019
overall queries completed within 3 seconds of the user requesting the data, and 7,022,205
queries completed within 5 seconds.
Chapter 6. Discussion 157

The Bio2RDF website and datasets were not hosted using high performance servers,
however the total number of unsuccessful queries was still only 3 percent, as shown in
Table 6.4, indicating that the datasets were relatively accessible with the commodity
hardware that was used. Two of the mirrors were limited to single servers with 4 core
processors and between 1 and 4GB of RAM, while the other mirror was spread across
two dual-quad-core machines with 32 GB and 16 GB of RAM respectively. The requests
from users to Bio2RDF were spread equally across all of the servers using DNS round
robin techniques, so there was no specific load balancing to prefer the system with the
higher specifications.
The current implementation lacks the ability to notify scientists in a systematic
manner when particular parts of their query failed, however, in future this information
may also be sent to scientists along with the results. This would enable decisions
about whether to trust the results based on which queries failed to execute. Since
the prototype was designed as a stateless web service, this information is not easily
retrievable outside of the comments included in the results if they used a format that
supports textual comments. The query is not actually executed when the scientist
requests the provenance for a query, so that avenue is not applicable.
Using RDF and SPARQL in the prototype, there is no way to estimate what the
number of results for each query is going to be, as SPARQL 1.0 does not contain a
standard count facility, and it cannot be easily derived from the query without previously
having statistics available. It would also incur a performance hit to require each results
set to be counted, either before or after the actual results are retrieved.
The model and prototype do not include support for specifying the statistics related
to each dataset. In order to use the model as the basis for a federated query application,
where a single query is submitted, planned, filtered, and joined across the data providers,
the model would need to be extended to include descriptions about the statistics relevant
to each namespace. This functionality was not included, as the model was designed to
be relevant to the contextual needs of many different scientists, including those where
scientists do not have the internet resources available to process very large queries, as
is required by similar large grid processing applications.
The prototype is efficient in part due to its assumptions that queries can be executed
in parallel, as the system is stateless, and the results can be normalised and included in
a pool of statements, without having to join or reduce the results from different queries
until after they are resolved and normalised. This design does not favour queries that
require large amounts of information from geographically distant datasets.
The prototype was implemented using Java. The Java security settings do not allow
an unprivileged program to specify which IP is going to be used for a particular HTTP
request, as this requires access to a lower layer of the TCP/IP stack than the high level
HTTP interface allows. The main consequence of this limitation is that the prototype
is not able to understand where failures occur in cases where single DNS names are dis-
tributed across multiple IP addresses. The mirrored Bio2RDF SPARQL endpoint DNS
entries were changed in response to this to create a one-to-one relationship between
158 Chapter 6. Discussion

DNS entries and IPs, which could then be used to reliably detect and avoid errors with-
out affecting the functioning mirrors. For example, there are two IP addresses for the
DNS entry “geneid.bio2rdf.org”, but there is only one IP address for each of the specific
location DNS entries, ie, “cu.geneid.bio2rdf.org” and “quebec.geneid.bio2rdf.org”. This
may be a problem for the use of the prototype with other websites that rely on DNS
level abstractions to avoid scientists having to know how many mirrors are available for
a given endpoint.
Following a policy of having DNS entries mapping to single IP addresses enables an
implementation to understand which endpoints are unresponsive. However, it restricts
the ability of the scientist to choose which endpoints are geographically close. Both
strategies enable the best responses in different conditions. The actual endpoint that
was used can be reported to the scientist in the provenance record, with any alternatives
notes as possible substitutes.

6.5 Prototype statistics

This research was not evaluable based on numeric comparisons, as the research questions
focus on determining the social and technical consequences of different model design
features on queries across distributed linked datasets. However, the use of the prototype
on the Bio2RDF website was monitored and the results are shown here to indicate the
extent to which the model features were relevant to Bio2RDF.
The prototype was configured to match the conventions for the http://bio2rdf.
org/ website. The configuration necessary for the Bio2RDF website contains 1,626
namespaces, 107 query types, 447 providers and 181 normalisation rules. In order to
provide links out from the RDF documents to the rest of the web, 169 of the namespaces
have providers that are able to create URLs for the official HTML versions of the dataset.
The Bio2RDF namespaces can be queried using query types and providers using 13,614
different combinations.
The Bio2RDF website resolved 7,415,896 user queries between June 2009 and May
2010. The prototype was used with three similar profiles on three mirrors of the
Bio2RDF website; two in Canada and one in Australia. The configurations were sourced
from a dedicated instance of the prototype, 4 . Profiles were used to provide access to
each sites internally accessible datasets, while retaining redundant access to datasets
in other locations. The reliability of the website was more important than its response
latency, so redundant access to datasets at other locations increased the likelihood of
getting results for a user’s query.
Analysis of the Bio2RDF use of the prototype generated a number of statistics,
including those shown in Table 6.4. These statistics revealed that the prototype was
widely used including an average over 10 months of 17 user queries per minute. These
queries generated 35,694,219 successful queries on providers, with 1,078,354 unsuccessful
queries (3%). A large number of unsuccessful queries were traced to a small number of
4
http://config.bio2rdf.org/admin/configuration/n3
Chapter 6. Discussion 159

the large Bio2RDF databases that were not handling the query load effectively. There
were periods where the entire dataset hosting facility at individual Bio2RDF locations
was out of use due to hardware failures or maintenance, which caused the other mirrors
to regularly retry these providers during the outage.

The statistics gathering system was not designed to provide complete information
about every combination of query type and provider. In particular, there were at least
10,697,566 queries that were executed on providers that did not require external queries
to complete, as the providers were designated as no-communication. These queries were
not included in the successful or unsuccessful queries. No-communication providers are
fillers for templates to generate RDF statements to add to the resulting documents, so
they did not provide information about the efficiency of the system. The 36,772,573
remote queries were performed using 18,455,634 query types on providers that were
configured to perform SPARQL queries.

Each of the user queries required the prototype to execute an average of 4.96 queries
over an average of 3.73 of the 447 providers. The 447 providers do not represent distinct
endpoints due to the way query types, namespaces, normalisation rules are all contex-
tually related to a provider in the model and not an endpoint. The large number of
unique providers and queries reflects the difficulty in constructing alternative workflows
or processes to contain these queries, along with the normalisation rules that are applied
to each provider.

Each of the 447 providers were able to be removed using profiles, and if necessary
replaced using alternative definitions without needing to change the definitions of any
query types, namespaces, or normalisation rules in the master Bio2RDF configuration.
By contrast, current Federated SPARQL models expose endpoint URLs directly in
queries. This creates a direct link between queries and endpoints, which means that
queries cannot be replicated, based on query provenance, using any other endpoint
without modifying the query provenance.

Federated SPARQL models may use object types to optimise queries. This strat-
egy works for systems where everyone uses the same URI to designate a record, and
all records in a namespace are represented using consistent object type URIs. These
assumptions hold in an environment where every namespace is similar to a typical
normalised relational database table, with the same properties for every record, and
numeric primary keys for each record. However, as the Bio2RDF case demonstrated,
primary keys are not always numeric, so they may require syntactic normalisation. In
addition, there may be multiple object types in a single namespace, even if the object
types are in reality equivalent but represented using different URIs or schemas.

In Bio2RDF, there were a few namespaces that clearly highlighted issues with
relational-database-like assumptions that are used by Federated SPARQL implemen-
tations. For example, the Wikipedia dataset has been republished in RDF form by
the DBpedia group. The record identifiers in DBpedia are derived from the title
of articles on Wikipedia, so references to the equivalent Wikipedia URL should be
160 Chapter 6. Discussion

treated equally to DBpedia. In the Bio2RDF normalised URI scheme, the main names-
pace prefix was “wikipedia”, with an alternate prefix of “dbpedia” as it is commonly
referred to in the Semantic Web. In the prototype, two different URIs, http://
bio2rdf.org/wikipedia:Test and http://bio2rdf.org/dbpedia:Test would resolve
these documents, although the resulting documents both contained http://bio2rdf.
org/wikipedia:Test as the subject of their RDF triples.
In Federated SPARQL, the whole community would need to decide on a single
structure for URIs, or decide on a single object type for each namespace. In Bio2RDF,
different structures were supported by adding a new namespace and applying it and any
corresponding normalisation rules to the differing providers. For example, the prototype
used by Bio2RDF to access and normalise the URIs in the Chembl dataset, available
at the Uppsala University in Sweden 5 . There are a number of namespaces inside this
dataset, and a set of namespaces were created to suit the structure.
The prototype was also experimentally used to create a Linked Data infrastructure
for the original URIs. In order to do this, alternative namespaces needed to be created
for the same dataset, as the namespace given in the original URIs did not contain the
“chembl_” prefix that Bio2RDF used to designate the namespace as being part of the
Chembl dataset.
It would not be possible using Federated SPARQL to create these alternative Linked
Data access points, as Federated SPARQL requires single URIs, so each query could
only contain one of the many published URIs. This would make it impossible for anyone
except for the original publisher to customise the data, and it is necessary to customise
both the source and structure of data to satisfy the data quality, data trust and context
sensitive research questions for this thesis.
The 447 providers were mapped to 1,626 namespaces. The majority of these names-
paces represented ontologies published by OBO, converted to Bio2RDF URIs and lo-
cated in a single Bio2RDF SPARQL endpoint. In the OBO case, it would be very
difficult to separate the different ontologies based on object types, as the OWL vocab-
ulary is universally used to describe the objects in all of the namespaces. The only
alternative method for segregating the dataset into namespaces is to search for a given
prefix in each URI, as each of the ontologies contain different URI prefixes.
In SPARQL 1.0, it is necessary to use regular expressions to query for a particular
prefix in a URI. This is very inefficient and not necessarily simple. To avoid this,
optimised SPARQL queries were made up to use the Virtuoso free text search extension.
In Federated SPARQL, these extensions would need to be located in the query, making
it difficult to replicate sets of these queries on another system, as they would all need to
be modified. However, in Bio2RDF, the equivalent regular expression query was created
and used on non-Virtuoso endpoints, without changing the query that users refer to.
In the case of OBO, these query and namespace features enabled Bio2RDF to create
a small number of providers to represent similar queries across all of the OBO ontologies.
There were at least 10,765 unique users of the Bio2RDF website during the statistics
5
http://rdf.farmbio.uu.se/chembl/sparql
Chapter 6. Discussion 161

gathering period, although two of the mirrors did not recognise distinct users using
their IP addresses as a result of the web applications being hosted in a reverse proxy
configuration.

Statistic Number
Total no. of resolutions 7,415,896
No. of resolutions within 3 seconds of request 6,662,019
No. of resolutions within 5 seconds of request 7,022,205
Average no. of resolutions per minute over 10 months 17.16
Average latency of queries (ms) 1532
Resolutions with low latency errors (1-300 ms in total) 298,807
Resolutions with high latency errors (more than 300 ms in total) 328,857
Total number of queries on external endpoints 36,772,573
No. of successful provider queries by prototype 35,694,219
No. of unsuccessful provider queries by prototype 1,078,354
Average no. of queries for each resolution 4.96
Average no. of endpoints for each resolution 3.73
Number of unique users by IP Address 10,765
Current number of providers 447

Table 6.4: Bio2RDF prototype statistics

The prototype software was released as Open Source Software and had an aggregate
of 2,685 downloads from the SourceForge site 6 . The most recent version of the software
released on SourceForge, 0.8.2, was downloaded around 300 times in its compiled form,
and around 30 times in its source form at the time of writing.
The normalisation rules in the Bio2RDF case provided a way to make the normalised
Bio2RDF URI useful in cases where other data providers used their own URIs. This
is necessary for data providers that independently publish their own Linked Data ver-
sions of datasets using their own DNS authority as the resolving entity for the URI.
For example, the NCBI Entrez Gene data item about “monoamine oxidase B”, with
the identifier “4129” has at least three identifying URIs, including http://purl.org/
commons/record/ncbi_gene/4129, http://bio2rdf.org/geneid:4129, and the NCBI
URI http://www.ncbi.nlm.nih.gov/gene/4129. The normalisation rules made it pos-
sible to retrieve data using any of these URIs and integrate the data into a single doc-
ument that can be interpreted reliably using a single URI as the reference for the item.
In addition, the 150+ providers that were designed to provide links to the traditional
document web did not require normalisation, as the identifiers were able to be placed
into the links.
In the case of Bio2RDF, there was a relatively small number of normalisation rules
compared to the number of providers because there were in practice a limited number
of locations that each dataset was available from, and the semantic structure of the
data did not need to be normalised between providers in most cases. The majority of
rules defined URI transformations so that data from different locations could be directly
integrated based on RDF graph merging rules.
6
http://sourceforge.net/projects/bio2rdf/files/
162 Chapter 6. Discussion

In comparison, the SADI system requires that the data normalisation be written in
code, as opposed to being configured in a set of rules. It must encode each of these URIs
and predicates into the relevant code and decide beforehand what the authoritative URI
will be. This makes it simpler to deploy services to a single repository, but makes it very
difficult for users to change the authoritative URIs and predicates to match their own
unique context. Having said that, the SADI system provides SPARQL based access to
Web Services using its mapping language, so it is useful for integrating RDF methods
with the current pool of scientific Web Services.
SADI has been initially built on a selection of premapped BioMoby services, making
it able to potentially access over 1500 web services when the mapping is complete. In
comparison, the Bio2RDF configuration built for the prototype relies mostly on RDF
database access using direct SPARQL queries, making it possible to avoid performing
multiple queries to get a single record from a single database, as is necessary using
many BioMoby Web Services. The SADI system however is being gradually integrated
with the Bio2RDF databases, although the normalisation rules that make the prototype
successful across the large range of Bio2RDF and non-Bio2RDF RDF datasources will
need to be integrated into the SADI system manually. In addition, SADI, like its SQL
based predecessor described in Lambrix and Jakoniene [83], has not dealt with the issue
of context sensitivity or replicability if the dataset structures materially change, as the
Bio2RDF datasources have done significantly in the last 2 years.

6.6 Comparison to other systems

There are a number of other systems that are designed using a model that is based
a single list of unique characteristics of each datasource, such as the properties and
classes of information available in each location. These systems use this information
to decompose a query into subqueries and join the results based on the original query.
The prototype focuses on the flexible, content-agnostic, RDF model as its basis for
automatically joining all results. Although some other systems are designed to resolve
SPARQL queries, they do not focus on RDF as the sole model, with many utilising the
SPARQL Results Format7 as the standard format for query results in the same way
previous systems use SQL result rows.
The SPARQL Results Format is a useful method of communicating results of com-
plex queries without having to convert the traditional relational row-based results pro-
cessing algorithms into the fully flexible RDF model. However, it does not enable the
system to directly integrate the results from two locations without knowing what each
of the results mean in terms of the overall query. If the query is not a typical query that
is looking for a set of result rows, rather it is looking for a range of information about
items, then the variables from different results will not join easily, as they are defined
in the query as text, and they do not form RDF documents. However, any SPARQL

7
http://www.w3.org/TR/rdf-sparql-XMLres/
Chapter 6. Discussion 163

SELECT query, which would otherwise return results using the SPARQL Results For-
mat, can be rewritten as an equivalent SPARQL CONSTRUCT query which returns
RDF triples.
The prototype can be integrated with other RDF based systems using SPARQL
CONSTRUCT/DESCRIBE queries, URL resolution, or other methods as they are cre-
ated in future. In all of the other frameworks the distribution of queries across different
locations relies on knowledge about the structure of the data that will be returned,
as queries are planned based on the way that the data from different endpoints will
be joined. In the model and prototype, the distribution is based on knowledge about
the high level datasources that are accessible using each endpoint, with the possibility
that all datasources will be available using a single default endpoint without needing to
specify each of the datasources. These methods make it possible to construct complex,
efficient queries, without restricting users to particular types of queries or restricting
the system to cases where it knows the complete details of the data accessible in each
endpoint.
The prototype provides users with a way of integrating any of the RDF datasources
that they have access to, including Web Services wrapped using SADI [143], distributed
RDF databases wrapped using DARQ [111], and plain RDF documents that are ac-
cessible using URL’s. This makes it possible for scientists to develop solutions in any
infrastructure and make it possible for other scientists to directly examine or consume
their results, whether or not the other scientists have a prior understanding of the on-
tologies that they have used to describe their results. Although SADI provides direct
access to RDF datasources using specifically designed code sources, its entire processing
method relies on scientists knowing exactly which predicates were used to describe each
dataset. This makes it necessary for each community to use a single set of predicates
for any of the datasets that they wish to access with their query, or produce code for
the translation rules to make the SADI engine see a single set of predicates.
The prototype provides an environment where multiple sources of data from different
contexts, such as open academic, and closed business models, can be combined, and
recombined in different ways by others, without having to publish limited access datasets
completely in globally accessible locations. In DARQ, for example, the statistics relevant
to each dataset must be accessible by each location to decide whether the query is
possible, and if so whether there is an efficient way to join the results. In the prototype
the relevant pieces of information about each dataset can be published, and if the dataset
is not intended to be directly accessed, the prototype can be used to publish the set of
operations that the public can perform with the dataset. In SADI, it is necessary to use
a single service repository, as the query distribution plan is compiled in a single location
using both the coded programs and the service details, making it difficult to transport
current queries, or integrate service repositories, even if a scientist can establish their
own repository for their own private needs.
The prototype makes it possible to integrate different datasets by encoding the nor-
malisation rules in a transportable configuration file and making sure that namespaces
164 Chapter 6. Discussion

are always recognised using an indirect method, so that there is not a single global defi-
nition of what URI is used to identify each namespace or dataset. In SADI, namespaces
are either recognised using custom code for each URI structure, based on the prefix
definitions in the Life Science Resource Name website 8 , or they are recognised using
the literal prefix string representing the namespace.
In comparison, the prototype makes it possible for scientists to internally redefine
namespaces using the same prefix as an external namespace, while linking it to their own
URI to make it possible and simple to distinguish between their use of the same prefix
and other uses of the prefix. It is not possible to use a prefix to name more than one
namespace if one relies on Inverse Functional Properties to name an RDF object using
a database name prefix and the identifier without using a unique predicate URI, such
as the Gene Ontology recommends in its RDF/XML format guidelines9 . The prototype
encourages Linked Data style, HTTP URI references to resources so that they can be
accessed using the prototypes query methods, using HTTP method if available, or using
the entire HTTP URI as an identifier if another method is used in the future.

8
http://lsrn.org/
9
http://www.geneontology.org/GO.format.rdfxml.shtml
Chapter 7

Conclusion

The research presented in this thesis aims to make it possible for scientists to work
with distributed linked scientific datasets, as described in Section 1.1. Current systems,
described in Chapter 2, that attempt to access these datasets encounter problems in-
cluding data quality, data trust, and provenance which were defined and described in
Section 1.2. These problems, shown in Figure 1.6, are inherent in current distributed
data access systems.
Current systems assume data quality is relatively high; that scientists understand
and trust the datasets that are publicly available in various locations; and that other
scientists will be able to replicate the entire results set using the exact method as
published. In each case, scientists need to have control over the relevant methods
and datasources to integrate, replicate, or extend the results in their particular context.
They must be able to do this with minimal effort. Solutions such as copying all datasets
to their local site before normalising and storing them in a specialised database, or
alternatively relying exclusively on public web services that may not be locally available
or reliable, do not form a good long term solution for access to both local and public
datasets.
The query model described in Chapter 3 enables scientists to integrate semantically
similar queries on multiple datasources into a single, high quality, normalised output for-
mat without relying exclusively on a particular set of data providers. The query model
distinguishes between user queries, templated query types, and the data providers, mak-
ing it possible to add and modify the way existing user queries are resolved by adding
or modifying templated query types or data providers. In addition, users can create
new normalisation rules to modify queries or the results of queries without requiring
any community negotiation, as is necessary with single global ontology systems that
have been created in the past.
The query model was implemented in the prototype web application as described in
Chapter 5. The prototype makes it possible for scientists to expose their queries and
configuration information as HTTP URIs. The use of HTTP URIs enables scientists to
use the information in different processing applications, while sharing the computer un-
derstandable query methods and results with other scientists to enable further scientific
research.

165
166 Chapter 7. Conclusion

The prototype is designed to be configured in a simple manner. The configuration


of their prototype installation can be configured from scratch or the public sources of
configuration information can be included and selectively used based on the users profile.
An example of a public configuration is the Bio2RDF configuration 1 . They can add
and substitute new providers without requiring changes to current providers and queries.
The use of profiles from the query model enables scientists to choose local sources in
preference to non-local sources where they are available. Profiles allow scientists to
ignore datasources, queries, or normalisation rules that are untrusted, without actually
removing them from published configuration files that they import into their prototype.
Namespaces, corresponding to sets of unique identifiers, can be assigned to data
providers and queries to perform efficient queries using the prototype. Namespaces
enable the prototype to distribute queries across providers who provide the same data-
source, without relying on the method that was used to implement either of the providers,
or any declarations by the data producer about which datasets are present in a location.
The RDF based provenance and configuration information that is provided by the
prototype can be processed together with the RDF data when it is necessary to explain
the origin of the data along with the data cleaning and normalisation methods that were
relevant. This enables scientists to integrate the prototype with workflow management
systems, although there is not yet widespread support for RDF processing in workflow
management systems. Chapter 4 described the integration of the model with medicine
and science showed that it can be used for exploratory purposes in addition to the
regular workflow processing tasks.
The model can to be used to access untrusted datasets. This is necessary for ex-
ploration and evaluation purposes. However, it is simultaneously possible to restrict
trusted queries to only specific datasources as deemed necessary by a scientist. This
would reduce the number of datasets that they need to evaluate before trusting the
results of a query.
The model is implemented as a prototype web application that can be used to
perform complete queries on each data provider for a particular query, and normalise the
results. The scientist is then responsible for performing further queries. In comparison
to other systems, the prototype does not attempt to automatically optimise the amount
of information that is being transferred and it does not answer arbitrary SPARQL
queries. It has native support for de-normalising queries to match data in particular
endpoints, and normalising the results, something that other systems do not cater for,
but which is important for consistent context-sensitive integration of distributed linked
scientific datasets.
The use of the prototype in both private and public scenarios, including the public
Bio2RDF.org website2 and private uses of the open source software package3 , provides
evidence for the success of the model in improving the ability for scientists to perform
queries across distributed linked scientific datasets.
1
http://config.bio2rdf.org/admin/configuration/n3
2
http://bio2rdf.org/
3
http://sourceforge.net/projects/bio2rdf/
Chapter 7. Conclusion 167

The scientific case studies described in Chapter 4 and the related systems described
in Chapter 2, show that there are no other models or systems that are designed to enable
scientists to independently access and normalise a wide variety of data from distributed
scientific datasets. Other systems generally fail by assuming constant data quality and
semantic meaning, uniform references, or they require scientists to have all datasets
locally available.
The prototype was shown in Section 4.7, to be simple to integrate with workflow
management systems using HTTP URI resolution as the access method and RDF as
the data model, with no normalisation steps necessary in the workflows. Workflows are
ideally suited to filtering and parallelisation, and they were used in this way to make it
possible to link the results of queries using the prototype with further queries.
The discussion in Chapter 6, shows that the prototype is useful by virtue of the
way it was used by the Bio2RDF project to integrate a large range of heterogeneous
datasets published by different organisations. In addition, the Bio2RDF configuration
was manipulated using profiles to both make the Bio2RDF mirrors work efficiently,
and for private users to add and remove elements of the configuration using private
configurations.
The model and prototype make it possible for scientists to understand and define
the data quality, data trust, and provenance requirements related to their research by
being able to perform queries on datasets, including understanding where datasets are
located and exactly which syntax and semantic conventions they use. In reference to
the issues shown in Figure 1.6, the model and prototype provide practical support as
illustrated in Figure 7.1.
The model and prototype could be improved in the future to include more features
such as linked queries and named parameters that may be beneficial to scientists, as
described in Section 7.3. In particular, the prototype could be extended to support
different types of normalisation, including logic-based rules and query transformations.

7.1 Critical reflection

The model provided a simple way to align queries with datasets, and the prototype
provided verification of the models usefulness through its use as the manager for queries
to the Bio2RDF website. The prototype phase influenced the design of the model, as
the implementation of various features required changes to the model. In addition to
this, the Bio2RDF datasets were simultaneously updated, requiring the configuration
for the prototype to be maintained and updated out of step with the development and
release cycle.

7.1.1 Model design review

The model provides a number of loosely linked elements, centralised around providers.
Although they act as a central point of linkage, providers still contain a many to many
168 Chapter 7. Conclusion

Data

Maintain distributed data in Standardise public


different locations: dataset labels:
Single data model to Data providers and profiles Preferred namespace
integrate structured label
data: Provide knowledge
RDF Data labels
about protected data:
resolve to data:
Static RDF insertions
Linked Data
Medicine

Promote useful Use different trust for Give patients access to


Physically restrict
resolvable labels for different patients to machine understandable
access to patient data: links to science data:
recognise context:
Locally resolvable URIs links to public datasets: Linked Data
Preferred and alternative Profiles
namespace labels

Science

Republish major datasets


Use normalised label and Focus on datasets instead
as an illustration to
publish alternative labels so of data types:
smaller providers :
scientists can integrate data: Namespaces separate
Bio2RDF, LODD,
Normalisation rules from Normalisation rules
Neurocommons

Normalise links to other


datasets: Normalise data access so workflows
Contextual normalisation can be replicated easily:
rules Query types and Modifiable HTTP URIs

Biology

Normalise syntax, and rely on scientists


Offer different data formats
including semantic normalisation as part
as references from a single
of their experiment context:
generic data model:
Regular Expressions and SPARQL patterns
RDF and URL

Assign namespace labels to Standardise on a widely


datasets to reduce confusion available technology for
about the target of a link: data access:
Namespaces HTTP

Figure 7.1: Overall solutions


Chapter 7. Conclusion 169

relationship with endpoints. Each provider may have more than one, redundant, end-
point, while each endpoint may be associated with more than one provider, in the
context of different namespaces and/or query types.
The entry points, query types, provide a many to many relationship between queries,
namespaces and query types. This enables the construction of arbitrary relationships
between queries and namespaces, as namespaces can be recreated using different URIs
to avoid collisions for the same prefix, and normalised using the preferred and alternate
prefix structures. Although the prototype only implemented a single query to namespace
mapping method, using Regular Expressions, new mapping methods can be created to
transform namespaces using intelligent mechanisms, including database lookups and
SPARQL queries to transform the input in arbitrary ways.
In comparison to provider-based normalisation rules, this mechanism is not easy to
share between query types, but in all of the Bio2RDF cases the preferred and alternate
prefix feature of namespace entries was enough to normalise namespaces. In a future re-
vision these rules could be defined separately, or normalisation rules could be integrated
with both query types and providers.
The loosely linked query elements are very useful, particularly in the way query
replication can be performed without having to modify hard links between queries and
data sources. However, it makes it difficult to determine a priori which elements will
be used for a particular type of query without a concrete example, particularly in
a large configuration such as that used for Bio2RDF. For example, the DOI (Digital
Object Identifier) namespace that was largely represented in RDF datasets using textual
references. Since a query can define itself as being relevant to all namespaces, the list
of namespace URIs that are attached to any query type does not immediately identify
the queries that were relevant. This made it difficult to audit the DOI datasets without
performing queries to see which datasets were available.
As namespaces needed to be independent of the required query model, to support
arbitrary queries, a similar mechanism was created for providers, the default provider
setting. In addition, to enable queries to require explicit annotation of providers with a
namespace before sending a query to a provider, query types were able to specify that
they were not compatible with default providers. This design choice was very useful
according to the criteria for this thesis, but in the majority of cases, these features were
not required by Bio2RDF, and made it slightly more difficult to verify the configuration
was acting as desired.
However, in cases where there was a definite query, it was always possible to de-
termine the relevant query types, namespace entries, providers and normalisation rules
without performing any network operations. In comparison, some federated SPARQL
systems such as SADI, cannot determine which datasources will be required for a query,
as they need to continue to plan each query while they are executing it. In addition, it
is impossible, by design, to get all properties of a data record from a SADI SPARQL
query, as by design there are no unknown predicates in a query. This inherent lack of
knowledge about the extent of a single query makes it difficult to replicate using the
170 Chapter 7. Conclusion

resulting query provenance. The query needs to be executed again using the full SADI
service registry to be correct.
In comparison, queries using the model described in this thesis are replicable solely
using the query provenance information, if the content of the datasets do not change.
This makes it practical to publish the query provenance for others to use in similar
ways, even if they do not use the same query, or they want to selectively add or remove
providers to match their context. If the namespace parameters do not change, the query
plan for the first query will be identical for future queries, eliminating the necessity of
a central registry to plan similar queries after the first query.
Although it would have been possible to define parameters at the query group level,
this would have a negative effect on replicability and context sensitivity. If all queries
in a query group are required to accept the same parameters, then general query types,
which could answer a wide range of queries, would not be allowed. This would have
a negative effect on replicability if these general query types are necessary to provide
redundant access to data. The model defines parameters at the level of query type to
provide for any possible mapping between a query and the parameters for that query.
This makes it possible to pass through queries directly to lower levels, or deconstruct
them into parameter sets for direct resolution. If a query type inherited its parameter
mappings from its query group, it may not be possible to implement in these situations
without creating an entirely new query group, which defeats the purpose of query groups
as a semantic grouping of similarly purposed query types.

7.1.2 Prototype implementation review

The prototype was implemented using Java Servlets, along with the URLRewrite library
to support HTTP URI queries in a lightweight manner, and the Sesame RDF library to
support internal RDF transformations. This made it very flexible during development,
as the URL structures were easy to modify using URLRewrite when necessary, and
the Servlets did not need to know about the query method as they received a filtered
path string. In addition, queries to the model are designed to be stateless, reducing the
complexity of the application.
The processing code was implemented using a single Servlet. Each instance of the
servlet used a single Settings class to maintain knowledge about the configuration files,
and a similarly scoped BlacklistController class to maintain knowledge about failed
queries on endpoints, and the frequency of client queries. The settings class contained
maps that made it possible to efficiently lookup references between configuration objects
using their RDF URIs, as objects were not tightly linked in memory using Java object
references.
The servlet was able to process multiple queries concurrently per second. However,
the basic design would be difficult to optimise if there were a range of different nor-
malisation rules in the common queries. For example, the RDF document needs to be
processed as a stream of characters before being imported to an abstract RDF store to
be processed for each query. Then, these RDF triples are integrated with RDF triples
Chapter 7. Conclusion 171

from other query types to be processed as a pool of RDF triples, in order to perform
semantic normalisations that were not possible with the results from each provider.
Then, in order to perform textual normalisation, the pool of RDF triples needed to be
serialised to an RDF string that could have an arbitrary number of normalisations per-
formed on it. All of the string normalisation stages could require the entire string, for
example, an XSLT transformation. This made it difficult to generate a smooth pipeline.
In comparison, the BioMart transformation pipeline assumes that records are in-
dependent of each other, so that it can process them in a continuous stream, using
ASCII control chracters (\t for the end of a field and \n for the end of a record) in
a byte stream to designate the start and end of records. The Bio2RDF case required
that the existence of relational database style records with identical fields in each were
not necessary to process data, with search services able to provide arbitrary results,
for example. It also required that the results could be normalised from each endpoint
individually, to fix errors in particular datasets, and that results could be normalised
across all endpoints for a query, to make it possible to perform semantic queries across
datasets.
The servlet handled documents containing up to 30,000 RDF triples regularly, as
this range was required to process queries on the RDF triples in the configuration,
although most results were limited to 2000-5000 triples. In smaller datasets, records
were very small if they were only sourced from their official dataset, but if they were
highly linked to from other datasets, the results were much larger, resulting in a very
large document. In other cases, records are relatively small only if internal links are not
taken into account. For example, in the Wikipedia dataset, the interarticle links are
so numerous that DBpedia was forced to omit the page links dataset from the default
Linked Data URI, as it was causing lightweight Linked Data clients to crash due to
the large number of triples. The Bio2RDF installation of the prototype was able to
process and deliver the full Wikipedia records, including the page links, although the
dbpedia.org SPARQL endpoint had a high traffic flow so SPARQL queries sometimes
failed to complete.

7.1.3 Configuration maintenance

The configuration defines the types of queries, data providers, normalisation rules, and
namespace entries, that are to be used to resolve queries. It is flexible, and when used
with each users set of profiles, it can be used differently based on the user’s context,
rather than on something that is built into the basic configuration.
In comparison to Federated SPARQL systems, it is not limited by the basic descrip-
tions of the statements and classes in an endpoint. Normalisation rules can be written
to reform the URIs, properties and classes that are the core of Federated SPARQL
configurations. However, this flexibility requires more effort, as the Federated SPARQL
descriptions can be completely generated without human input, and most are statically
generated by producers rather than generated by users.
172 Chapter 7. Conclusion

In order to automatically generate configuration information for the prototype, ad-


ditional information is required, defining the rules and queries that would be used for
each namespace. This registry would then be used together with Federated SPARQL
descriptions to generate a complete configuration which could then be selectively cus-
tomised by scientists using profiles.
The majority of the configuration management tasks involve identifying changes
to datasets and how to either take advantage of the change, or isolate the change if
it will interfere with their current queries. For example, if a new data quality issue
was identified for an endpoint, a new normalisation rule could be added to the data
providers. If the data provider previously was thought to be consistently applicable to
a range of endpoints, the data provider may need to be split into two data providers to
isolate the data quality issue.
The RDF statements that make up the prototype configuration files can be validated
using a schema. The prototype schema is generated dynamically by the prototype
itself. The current version can be resolved from http://purl.org/queryall/, which
in turn resolves the schema using an instance of the prototype. Past versions can be
generated using past prototype versions, or by passing the current prototype the desired
configuration version as part of the schema request.

7.1.4 Trust metrics

Truth metrics are used to give users opinions about the trustworthiness of unknown
quality items. They generally take the form of numeric scales. For example, there may
be a 5 star hotel, or a 8 star movie. The model does not attempt to define trust metrics
inside the model. In centralised registries it is appropriate to compile information from
all users and generate ratings based on the wisdom of the crowds. However, the model
is designed to be extensible based on unique contexts.
If a Web Service is given a trust metric based on its uptime, it may not be relevant
to a user that reimplemented the service exactly in their own location. Similarly, if
a user perceives that a data provider is generating incorrect, or non-standard data,
they may rate the data provider lowly. However, the model provides the opportunity
for others to generate wrappers to improve data quality, using normalisation rules for
transformations. The wrappers are defined in a declarative manner, enabling them to
publish the resulting rules for others to examine and reuse. If others reuse the wrapper,
then effectively they are trusting the dataset. This contradicts the trust opinions given
to the basic data provider interface, even though the data provider is providing the
same information.
Instead, the trust in a data provider was defined based solely on whether the dataset
was used or not. Although this definition provides clues about useful data providers,
it does not reliably identify false negatives. That is, it does not correctly categorise
providers that are not used, but would be trusted if there were no other contextual
factors, such as the performance capacity of the provider or the physical distance to the
provider.
Chapter 7. Conclusion 173

The data trust research question, defined in Section 1.3, relies on scientists identi-
fying the positive aspects of trust. Scientists are able to identify trusted datasets and
queries based on examining published configurations that used the datasets or queries.
They can then form their own opinions to generate further publications, which would
increase the rank for a particular provider in the view of future peers.

7.1.5 Replicable queries


The model provides a framework for scientists to use when they wish to define and
communicate details of their experiments to other scientists. The query provenance
details that combine the replication details, include the data cleaning methods that
were thought to be necessary for the queries to be successful and useful. They also
contain indications about positive trust ratings and the general data quality of each
data provider.
The prototype can take query provenance details and execute the original query in
the current scientist’s context, with selections and deletions defined as extra RDF triples.
That is, future scientists do not have to remove any triples from a query provenance
document to replicate it, as any changes are additions, together with changes to the
profiles used to replicate the queries.
If a new normalisation rule is needed, it can be added to the original provider, by
extending the original provider definition with an extra RDF triple. This extension can
occur in another document if necessary, as the RDF triples are all effectively linked after
they are loaded into an RDF triplestore by the prototype. In addition, non-RDF systems
that may be used by scientists currently can be integrated using normalisation rules to
transform the content to RDF, including XML using XSLT and textual transformations
including formats such as CSV [88], or they can be accessed using queries to conversion
software such as BioMart [130] or D2R [29]. Any textual queries are supported using
the model, including URLs and HTTP POST queries.
If a query does not function properly in the scientists context, they can ignore the
query type using their profile, and create a replacement using a different URI. They can
then link this new query type URI into the original provider.

7.1.6 Prominent data quality issues


The use of the prototype as the engine for the Bio2RDF website revealed a range of
data quality issues, many of which were introduced in Section 1.2. These included data
URIs that did not match between different locations, a range of predicates that were
either not defined in ontologies, or defined to be redundant with other ontologies, and
a lack of consistency in the use of some predicates between locations.
The issue of different URIs for the same data records was identified as a major issue
by the Bio2RDF community. Although it was not an issue for the Bio2RDF datasets, it
affected the ability to source extra links from other providers. The Bio2RDF resolver has
a major requirement to be a single resolver for information about different data records,
while including references to the original URIs after the normalised URI is resolved,
174 Chapter 7. Conclusion

whether they are resolvable or not. This requirement it necessary to include both the
original URI and the normalised URI in some cases, meaning that the URI normalisation
couldn’t be applied in a single step. Some statements needed to be included without
being normalised together with the results from each endpoint. In order to allow for all
possibilities, there were two stages included after these statements were inserted, one
with access to RDF triples, and one with access to the serialised results in whatever
format was requested. For example, http://www4.wiwiss.fu-berlin.de/drugbank/
resource/drugs/DB00001 was equivalent to http://bio2rdf.org/drugbank_drugs:
DB00001, and when the original RDF triples were imported the URI needed to change,
but after the equivalent RDF triples were inserted, the URIs needed not to change.
This was achieved by applying the URI normalisation rules to the “after results import”
stage, before the equivalence triples were inserted and further normalisation was done
in the “after results to pool” and “after results to document” stages.
The Bio2RDF datasets did however have issues with inconsistent URIs over time for
predicates and classes. These issues were solved over a period of time, as best practices
were identified in the larger RDF community. However, the lack of resources for main-
tenance of the RDF datasets meant that the Bio2RDF resolver needed to recognise and
normalise 5 different URI styles while the migrations to older datasets were performed.
For example, an ontology URI was originally written as http://bio2rdf.org/
bio2rdf#property. This was acceptable, but not ideal, as the property would not
be resolved, due to the fragment “#property”, being stripped before HTTP resolution.
In order to isolate these cases from other URI resolutions, the ontology URI standard
was changed to http://bio2rdf.org/ns/bio2rdf#property. The base URL in this
URI was designed to resolve to a full ontology document containing a definition for the
property. However, this was not ideal in cases where the property needed to be re-
solved specifically, so it was changed to http://bio2rdf.org/ns/bio2rdf:property.
This URI was completely resolvable and distinguishable from other URIs, but it was
not consistent with some other ontologies that were defined using the normal Bio2RDF
URI format, http://bio2rdf.org/namespace:identifier. In response to this, a new
namespace was created for each Bio2RDF ontology, for example, http://bio2rdf.org/
bio2rdf_resource:property.
At one stage another alternative was also experimentally used, http://ontology.
bio2rdf.org/bio2rdf:property, however the use of a single domain name for all URIs
was too valuable to ignore. The normalisation rules for each of these rules were, in
practice, applied in series to eventually arrive at the current URI structure, although
they could have all been modified to point to the current URI format.
In some cases, datasets define their own properties that overlap with current prop-
erties. Although this research did not aim to provide a general solution to ontology
equivalence, the normalisation rules, particularly SPARQL rules, can be used to trans-
late between different, but equivalent sets of RDF triples. It is particularly important
for the future of Linked Data if there are standard properties that can be relied on
to be consistent. For example, the RDF Schema label property could be relied on to
Chapter 7. Conclusion 175

contain a human readable label for a URI in RDF, although its meaning was redefined
by OWL2, and can hence no longer be used for non-ontology purposes in cases where
OWL2 may be used.
One term in particular was identified as being inconsistently interpreted, possibly
due to a lack of specification and best practice information. The Dublin Core “identifier”
property is designed to provide a way to uniquely identify an item in cases where the
URI may be different. For example, it may be used to provide the unique identifier
from the dataset that it was originally derived from. In the drugbank case above this
would be “DB00001”. However, it may also be used to represent the namespace prefixed
identifier, for example, “drugbank_drugs:DB00001”. The SADI designers favour another
specification that matches their Federated SPARQL strategy, that of a blank node with
a key value pair identifying the dataset using a URI in one triple and the identifier in
another triple attached to the blank node 4 . Although this functionality may be useful,
it requires a single URI to identify the dataset, which requires everyone to accept a
single URI for every dataset. In the Bio2RDF experience outlined here, that is unlikely
to ever happen.

7.2 Future research

This thesis produced a viable model for scientists to use in cases where other scientists
will need to replicate the research using their own resources. It incorporates the nec-
essary elements for dynamically cleaning data, in a declarative manner so that other
scientists could replicate and extend the research in future. However, it left open re-
search questions related to studying the behaviour of scientists using the model with
respect to trust and data provenance.
The model provides the necessary functions for scientists to individually trust any
or all of the major elements, including query types, providers and normalisation rules.
The trusted elements would then be represented in the query provenance record that a
model implementation could generate for each query. Scientists could substitute their
own implementations for these elements without removing the original implementations
from the original provenance records.
This thesis did not study the reactions of scientists to the complexity surrounding
this functionality, as it enables arbitrary changes to both queries and data, in any
sequence. For example, there were two flags implemented in the prototype to control
the implementation of the profiles feature. The first defined whether an element should
be included if it implicitly matched a profile, and the second defined whether an element
should be included if it was not excluded by any profile and did not implicitly match
any profile. Although these rules are simple to define, different combinations of these
two flags may change the results of a query in range of different ways.
These flags were used to simplify the maintenance of the Bio2RDF system, as they
reduced the complexity of the resulting profiles. The first flag, “include when implicitly
4
https://groups.google.com/group/sadi-discuss/msg/d3e4289082098428
176 Chapter 7. Conclusion

matching”, was enabled and the second flag, “include when no profiles match flag”
was disabled. Future research could study the behaviour of the model using different
combinations of these flags, in combination with the settings on each profile defining
whether to implicitly include any query type or provider that advertised itself as being
suitable for implicit inclusion.
Although they were mostly implemented to simplify maintenance, the flags, along
with other flags inside of named profiles, enable scientists to dynamically switch between
different levels of trust without changing the other configuration information. A very
low level of trust would set both of the overall flags to disabled and include a profile with
all of its flags set to disabled. This level of trust would require that they specifically
include all of the query types, providers and normalisation rules that they wish to use.
In comparison, a very high level of trust may be represented by setting both of the
overall flags set to enabled, without using any profiles.
The query provenance records produced by the prototype included a subset of the
complete configuration that the scientist had enabled at the time they executed the
request. This may reduce the ability of scientists to change the trust levels when repli-
cating queries using the provenance record. Setting broader levels of trust will only ever
include the elements in the record, even if there were other possibilities in the prototype
configuration. In the Bio2RDF example, the entire configuration was located in each
provenance record, the provenance size would increase, depending on which elements
matched a particular query. The size increase could be significant in terms of future
processing of the query using the provenance record and storing and publishing the
provenance record.
Future research could investigate the viability of this model, and the viability of a
large scale combined data and query provenance model with respect to its usefulness
and economic cost. For example, the query provenance, including the entire Bio2RDF
configuration so that future scientists could fully experiment with novel profile and trust
combinations, could be stored in a database. If it were stored using a single graph for
each of the query provenance RDF records, the 7.4 million queries during the 11 month
statistics gathering period, there would be in excess of 222 billion triples in the resulting
database, as the configuration contained approximately 30 thousand triples. There were
additional triples for each provenance record detailing the actual endpoints that were
used and the exact queries that were used for each endpoint, which would increase the
actual number of triples.
For the evaluation of this thesis, a reduced set of this provenance was stored in a
database, with the records only containing URIs pointing to the full query type and
provider definitions that would be present in the full record. Most provenance research
assumes that this cost will be either offset by the usefulness of the provenance in future,
or it will be taken up by a new commercial provenance industry, perhaps modelled on
the current advertising industry. A full examination of this would be required before
recommending any large scale use of provenance information as a method of satisfying
trust and replicability requirements. In comparison, the publication of a single large
Chapter 7. Conclusion 177

configuration, such as the Bio2RDF configuration, along with the methods used for
each of the queries used to produce a given scientific conclusion, would enable flexible
replicability using a range of trust levels, without incurring the permanent storage cost
that is associated with permanent provenance, as 30 thousand triples can be stored and
communicated relatively easily compared to the alternative of 222 billion.
In the Bio2RDF configuration there was a single profile for each of the three mirrors,
along with a single default profile for other users of the prototype. These profiles were
used to exclude items from the configuration based on knowledge about the location
of the provider. Most legitimate changes in the context of the Bio2RDF mirrors were
fixed by changing the master configuration, as they were bug fixes that improved the
results of a query.
If a provider was not available any more, it was set to be excluded by default,
without being removed. This made it possible to keep knowledge of the provider, if,
for example, they reappeared at a different location, without continuing to attempt to
use the provider. If a dataset changed so that a normalisation rule was not required
any more, the normalisation was either changed to be excluded by default, or it was
removed from the provider, depending on whether it may have some use in the future
again. Future queries against the Bio2RDF website would use the new configuration,
however past provenance records would still include their original elements, even if the
query could not be easily replicated due to permanent disappearance of the provider,
or a non-backwards-compatible data change.
Future research could examine the effects of modifications to the Bio2RDF master
configuration, perhaps in conjunction with past research into migration of ontologies
[136]. There are conflicting objectives influencing migration of configurations using the
model. Firstly, if a public provider changes their data, then the URI for the provider
would not be changed, or current profiles would be affected. The normalisation rules
and namespaces applied to the provider would be added or deleted to match the real
changes. However, if the old version of the provider was referenced in past provenance
records, then they could not be included together with the current configuration, or
they could introduce conflicting normalisation rules with unknown combined effects. In
this case it would be advantageous to change the URI of the provider, however, this
would affect profiles that trusted the provider in both cases, as they may need to be
changed to explicitly include the current URI to continue to have the same effect.
The research in this thesis does not examine ways to include and use identity in-
formation in configurations, other than to include it in namespaces to recognise the
authority that defined a particular namespace. Future research could extend this thesis
to examine the benefits of adding in identity information, including individual scientists
and the organisations they work for. One benefit might be to automatically identify the
necessary citations for publications, based on the use of a particular group of datasets.
Another might be to recognise scientific laboratory contributions, including the actual
contributions from individual scientists in a team, which may only currently be identified
as contributions from the authors on a resulting publication.
178 Chapter 7. Conclusion

7.3 Future model and prototype extensions


The model may be extended in future to suit different scenarios, including those where
complex novel queries are processed using information that is known about data providers.
As part of this it may be extended to include references to statistics as a way of opti-
mising large multi-dataset queries. The prototype can be extended to provide different
ways of creating templates and different types of normalisation rules to suit any of the
current normalisation stages.
These changes would require the normalisation process to be intelligently applied
to both the queries and the results of queries, although the way this may work is not
immediately obvious. The independence of namespaces from normalisation rules may be
an issue in automatically generating queries, as currently the prototype only includes an
informative link between normalisation rules and any namespaces that they are related
to. The rationale for this is that the namespaces make it possible to identify many
different datasets using a single identifier, while the normalisation rules independently
make location specific queries functional in some locations without any reference to
other locations. Future research could examine this aspect, along with other semantic
extensions to the model to facilitate semantic reasoning on the RDF triples that make
up the configuration of the prototype.
The scientific community has endeavoured on a number of occasions to create
schemes that require a single URI for all objects, regardless of how that would af-
fect information retrieval using the URIs, as the gains for federated queries are thought
to be more valuable than the losses for context dependent queries and URI resolution.
Although the model encourages a normalised URI scheme for queries to effectively trans-
late queries and join results using the normalised data model across endpoints, it does
not require it, so scientists can arbitrarily create rules and use URIs that manipulate
queries and results outside of the normalised URI scheme.
The model could also be extended to recognise links between namespaces to indi-
cate which namespaces are synonyms or subsets of each other. This could be used by
scientists to identify data providers as being relevant, even if the provider did not in-
dicate that namespace was relevant. This would enable scientists to at least discover
alternative data providers, even if the related queries and normalisation rules may not
be directly applicable as they may be specific to another normalised URI scheme.
The implementation of named parameters would need to ensure that if two services
were implemented differently, that they would still be used in parallel without having
conflicting named parameters. Regular Expression query types are currently limited
to numerically indexed parameters. If other query types they wished to be backwards
or cross compatible with Regular Expression query types they would need to support
similar parameters. For example, they could support the “input_NN” convention used
by Regular Expressions.
Named parameters may restrict the way future scientists replicate queries if they
have semantic connotations. For example, the two simplest named parameters, “names-
pace” and “identifier” from http://example.org/${$namespace$}$:${$identifier$}$,
Chapter 7. Conclusion 179

may conflict if “identifier” actually meant a namespace. For example, to get data about
“namespace”, one may use http://example.org/${$ns$}$:${$namespace$}$ which se-
mantically connotes two namespace, “ns” and “namespace”, resulting in a possible se-
mantic conflict with other query types.
The current model is designed as a way of accessing data for single step queries, as
opposed to multi-step workflows. However, the complexity of queries can be arbitrarily
defined according to the behaviour of the query on a provider with ordered normalisation
rules filtering the results. In future, this may be extended to include linked queries that
directly reference other queries. The model for this behaviour would need to decide
what information would be required for scientists to be able to create new sub-queries
without having to change the main query to fit their context.
The current model assumes that the parameters are not directly related to a partic-
ular query interface, so that they can be interpreted differently based on the context of
the user. This is similar to workflow management systems and programming languages,
which both require particular query interfaces to be in place to enable different contex-
tual implementations. This research aimed to explore a different method that removed
a direct connection between the users query and the way data is structured in physical
data providers. Given that this freedom was based on the goal of context-sensitivity,
the model may not be viable for use with linked queries. It may, however, be reasonable
to modify the model to include explicit interfaces that query types implement, as the
query implementation would be separated from the interface and the data providers. In
effect, the model only requires that the separation occur so that each part can be freely
substituted based on the scientist’s context, without materially changing the results if
the scientist can determine an equivalent way to derive the same results.
The model and prototype can be used to promote the reuse of identifiers and the
standardisation of URIs. Perhaps surprisingly, normalisation rules can be used to in-
tegrate systems that would not otherwise be linked using URIs. This is due to the
fact that at the point that the RDF statements from the provider are combined with
other results, novel unnormalised RDF statements relating the normalised URIs to any
other equivalent strings or URIs can also be inserted into the pool of RDF results. This
provides a solution to the issue of URI sustainability, as in the future, any other nor-
malised URI scheme can provide normalisation rules that map the future scheme back
to current URIs if a particular organisation did not maintain or update their data using
currently recognised URIs. Using this system, scientists can publish their data using
URIs that have not yet been approved by a standards body with the knowledge that
they can migrate in future without reducing the usefulness of their queries.
The prototype should be extended in future to allow static unnormalised RDF state-
ments to derive information from the results of queries. This could be implemented by
adding an optional SPARQL SELECT template to each query type that could be used to
fill template values dynamically based on the actual results returned from the provider
between the parsing and inclusion in the pool normalisation stages. This method would
provide further value to the model as it allows scientists to derive partially normalised
180 Chapter 7. Conclusion

results without relying on variables being present in their query. This would make it
possible to dynamically insert unnormalised links to URIs that are not transformable
using textual methods. For example, this would allow scientists to dynamically derive
the relationship between the two incompatible DailyMed reference schemes shown in
Figure 1.11 (i.e., “AC387AA0-3F04-4865-A913-DB6ED6F4FDC5” is textually incom-
patible with “6788”), even if the rest of the result statements contained normalised
URIs.
The system does not need to rely on Linked Data HTTP URIs and RDF. If there is
another well known resolution method, with a suitable arbitrary data model, developed
in the future the prototype could easily be changed to suit the situation, and the model
may not be affected. The model does not rely on HTTP requests for input, so it does
not need to follow other web conventions regarding the meaning of HTTP requests for
different items. For example, the model could be designed to fit with 303 redirects, but
it would be entirely arbitrary to require this for direct Java queries into the system,
so it does not require it. The metadata which forms the basis for the extended 303
discussion can be transported across a different channel as necessary. It is up to the
scientist using the data to keep the RDF triples relating to the document separate from
the RDF triples relating to the document provenance, if they do not wish to process
them together.
The model allows scientists to change where their data is provided without changing
their URIs, or relying on entire communities accepting the new source as authoritative
for a dataset. For example, some schemes that are not designed to be directly resolvable
to information, such as ISBNs, may be resolvable using various methods now and in
the future without the scientist changing either their workflows, or their preferred URI
structure. This eliminates the major issue that is limiting the success of distributed
RDF based queries without imposing a new order, such as LSIDs.
The prototype focuses on RDF as its authoritative communication syntax because
of its simple extensibility and relatively well supported community. If there was another
syntax that both allowed multiple information sources to be merged to a useful combined
document by machine–without semantic conflict–and enabled universal references, such
as those provided by URIs, it may be interchangeable with RDF as the single format
of choice. The provision for multiple file formats for the RDF syntax may also be
used in future to generate more concise representations of information, as the currently
standardised file formats, RDF/XML and NTriples, are verbose and may prevent the
transfer of large amounts of information as efficiently as a binary RDF format, for
example.
The prototype is designed to be backwards compatible as far as possible with older
style configurations, since the configuration syntax changed as the model and prototype
developed. The configuration file is designed based on the features in the model, so it
makes it possible for other implementations to take the configuration file and process
it using the model theory as the semantic reference. The prototype has the ability to
recognise configuration files designed for other implementations, although it may not be
Chapter 7. Conclusion 181

able to use them if they require functionality that is not implemented or incompatible
with the prototype.
The prototype was used to evaluate the practical methods for distributing trusted
information about which queries are useful on different data providers, including refer-
ences to the rules that scientists used with the data and the degree of trust in a data
provider. However, the profiles mechanism is currently limited to include and exclude
instructions, these do not directly define particular levels of trust, other than through
the use of the profile by scientists in their work. It could be extended in future to
provide a semantically rich reference to the meaning for a profile inclusion or exclusion.
This would enable scientists to have additional trust criteria when deciding whether a
profile is applicable to them.
In this study, a constraint that was used for profile selection by the configuration
maintainers was the physical location of a data provider. This constraint was not
related to the main objectives of data quality, data trust and replicability, although it
was related to the context sensitivity objective. The profiles were not actually annotated
with the ideal physical location for users of the profile. It was deemed not to be vital
to the research goals and there were only three locations in the Bio2RDF case study.
Although the prototype would still need to have a binary decision about whether to
include or exclude an item, the scientist could have different operating criteria for the
prototype based on the distance to the provider. This may be implemented in future for
each instance of the prototype, or it may be implemented using additional parameters to
each query. This may include the IP address of the request being used with geolocation
techniques the best location to redirect the user to, or the public IP address of the server
being used to geolocate the closest data providers based on the geographic locations that
could be attached to each provider.
The model is agnostic to the nature of the normalisation rules, making it simple to
implement rules for more methods than just Regular Expressions and SPARQL CON-
STRUCT queries. In particular, the prototype could be extended in future to support
logic-based rules to either decide whether entire results were valid, or to eliminate in-
consistent statements based on the logic rules. Validation is possible using the model
currently, as the normalisation rule would simply remove all of the invalid results state-
ments from a given provider so that they wouldn’t be included, or they could include
RDF triples in the response indicating a warning for possibly invalid results. However,
it could also be possible to have formal validation rules, without modifying the provider
interface, as the rule interface and its implementation can be extended independently.

7.4 Scientific publication changes


The scientific method, particularly with regard to peer review, has developed to match
its environment, with funding models favouring scientists that continually generate new
results and publish in the most prestigious journals and conferences [43]. The growth of
the Internet as a publishing medium is gradually changing the status quo that academic
publishers have previously sustained. In some academic disciplines, notably maths and
182 Chapter 7. Conclusion

physics, free and open publication databases such as arXiv5 are gradually displacing
commercially published journals as the most common communication channel.
In the case of arXiv, the operational costs are mostly paid for by the institutions
that use the database the most 6 . In the absence of a commercial profit motive, arXiv
publishes articles using an instant publication system that is very low cost. Given
that many of the funding models are provided by governments, ideally, the results
of the research should be publicly available 7 . However, the initial issues with open
access publication, the perceived absence of a genuine business model for open access
publishers, and the absence of an authentic open peer review system 8 , make it difficult
to resolve underlying issues relating to data access and automated replication of analysis
processes.
The model proposed in this thesis encourages scientists to link data, along with
publication of the queries related to a publication. In doing so, it allows scientists to
freely distribute their methodology, both attached to a publication, and on its own,
if they annotate the process description. Even if the underlying data is sensitive and
private, the publication of the methodology in a computer understandable form allows
for limited verification. Ideally datasets should be published along with articles and
computer understandable process models, to allow simple, complete replication. Both
publishers and specialised data repositories such as Dryad [140]9 or DataCite [33]10
could fill this niche.
The current scientific culture makes it difficult to change the system, even though
technologically there is no reason why a low cost electronic publication model could
not be sustainable. Critical peer review systems on the internet tend to be anonymous,
as are many traditional peer review methods. However, on the internet, peer review
systems typically do not include verification of the peer’s identity. In groups of scientists
within a field, this should not be an issue though, and there is technology available to
determine trust levels in an otherwise anonymous internet system, via PGP key signing
as part of a “Web of Trust” [56]. In the case of this thesis, this encourages scientists
to cross-verify publications that they have personally replicated using the published
methodology, although possibly in a different context. However, there is currently
no financial incentive to verify published research, as grants are mostly given for new
research. Lowering the barriers to replication and continuous peer review will reduce
the financial disincentive, although it remains to be seen whether it would be sufficient
to change the current scientific publishing culture.
Current scientific culture is designed to encourage a different perception of career
researchers, compared with professional scientists or engineers. This may be evidenced
in the order that authors are named on publications, even if it was patently obvious

5
http://arxiv.org/
6
http://arxiv.org/help/support/arxiv_busplan_July2011
7
http://www.researchresearch.com/index.php?option=com_news&template=rr_2col&view=
article&articleId=1102230
8
http://www.guardian.co.uk/science/2011/sep/05/publish-perish-peer-review-science
9
http://datadryad.org/
10
http://www.datacite.org
Chapter 7. Conclusion 183

to their peers and readers of the publication that they did not perform the bulk of the
research. It may also be evidenced by the omission of lab assistants from publication
credit completely, in cases where publications are monetarily valuable. In the context
of this research, it may prevent the direct identification of query provenance, as the
raw data contributions in a publication may be attributed to the lead author on the
publication, which would make it difficult to investigate the actual collection methods
if they did not actually collect the data.
Researchers require the publication in order to continue receiving grants, while the
income of professional scientists is linked to an employment contract that does not
include publications as a success criteria. In evolutionary terms, this design favours the
recognition of the career researchers who analyse and integrate the results. It offers
no competitive advantage to those who do not succeed or fail based on the results. In
terms of data provenance, this results in the data being attributed to the scientist, which
may be accurate, considering their role in cleaning the raw data. Data provenance may
be more important than query provenance in terms of verification, however, in terms
of replicability, query provenance may be more useful as it is more clearly useful for
machines. Data provenance is generally information, however, it could be used by
machines to select and verify versions of data items and authors if a viable model and
implementation are created in future.
184
Appendix A

Glossary

BGP A Basic Graph Pattern in SPARQL that defines the way the SPARQL query
maps onto the underlying RDF graph.

Context Any environmental factor that affects the way a scientist wishes to perform
their experiment. These may include location, time, network access to data items,
or the replication of experiments with improved methodologies and a different view
on the data quality of particular datasets.

Data item A set of properties and values reflecting some scientific data that are ac-
cessed as a group using a unique identifier from a linked dataset.

Data provider An element of the model proposed in this thesis to represent the nor-
malisation rules and query types on particular namespaces that are relevant to a
given location.

Link The use of an identifier for another data item in the properties attached to a data
item in a dataset

Linked dataset Any set of data that contains links to other data items using identifiers
from the other datasets.

Namespace An element of the model proposed in this thesis to represent parts of


linked datasets that contain unique identifiers, to make it possible to directly link
to these items using only knowledge of the namespace and the identifier.

Normalisation rule An element of the model proposed in this thesis to represent any
rules that are needed to transform data that is given to endpoints, and returned
from endpoints.

Profiles An element of the model proposed in this thesis to represent the contextual
wishes of a scientist using the system in a given context.

Query type An element of the model proposed in this thesis to represent a query
template and associated information that is used to identify data providers that
may be relevant to the query type.

185
186 Appendix A. Glossary

RDF Resource Description Framework : A generic data model used to represent linked
data including direct links between datasources using URIs. It is used by the pro-
totype to make it possible to directly integrate data from different data providers
into a single set of results for each scientist’s query.

Scientist Any member of a research team who is tasked with querying data and ag-
gregating the results.

SPARQL SPARQL Query Language for RDF : A query language that matches graph
patterns against sets of RDF triples to resolve a query.

URI Uniform Resource Identifier : A string of characters that are used to identify
shared items on the internet. URIs are used as the standard syntax for identifying
unique items in RDF and SPARQL.
Appendix B

Common Ontologies

There are a number of common ontologies that have been used to make RDF documents
on the internet recognisable by different users. RDF documents containing terms from
these ontologies are meaningful to different users through a shared understanding of the
meaning surrounding each term.

RDF Resource Description Framework Syntax : The basic RDF ontology is represented
in RDF, requiring user agents to understand its basic elements without reference
to any other external specification. 1

RDFS RDF Schema : Provides a basic set of terms which are required to describe RDF
documents and to represent restrictions on items inside of RDF documents.2

OWL Web Ontology Language : Enables more complex assertions about data than
are possible in RDF Schema. It is generally assumed to be the common language
for new web ontologies, although some endeavour to represent theirs in RDFS
because of its restricted format which makes reasoning more reliable. The use of
some limited subsets of OWL are guaranteed to provide semantically complete
reasoning.3

DC Dublin Core Elements : The commonly accepted set of ontology terms for describ-
ing metadata about online resources such as documents. It is referred to as the
legacy set of terms with the dcterms namespace being the current recommenda-
tion, although there are many ontologies which still utilise this form, including
the dcterms set which attempts to reference these terms wherever applicable to
provide backward compatibility.4

DCTERMS Dublin Core Terms : Provides a revised set of document metadata el-
ements including the elements in the original DC set, with domain and range
specifications to further define the allowed contextual use of the DC terms. The
original DC ontology was intentionally designed to provide a set of terms which
1
http://www.w3.org/1999/02-22-rdf-syntax-ns
2
http://www.w3.org/2000/01/rdf-schema
3
http://www.w3.org/2002/07/owl
4
http://purl.org/dc/elements/1.1/

187
188 Appendix B. Common Ontologies

did not have side-effects, and could therefore be used more widely without preju-
dice, but as the Semantic Web developed this was found to be unnecessarily vague,
with user agents desiring to validate and integrate ontologies in non-textual forms.
5

FOAF Friend Of A Friend : Provides an ontology which attempt to describe relation-


ships between agents on the internet in social networking circles.6

WOT Web Of Trust : Provides an ontology which formalises a trust relationship based
on distributed PGP keys and signatures. It can be integrated with the FOAF
model by utilising the model of hashed mailbox addresses which is unique to the
distributed FOAF community.7

SKOS Structured Knowledge Organisation System : Aims to be able to represent


hierarchical and non-hierarchical semi-ordered thesauri and similar collections of
literary terms. It is utilised by the dbpedia project to represent the knowledge
which is encoded inside of Wikipedia, and hence has been practically verified to
some extent.8

5
http://purl.org/dc/terms/
6
http://xmlns.com/foaf/0.1/
7
http://xmlns.com/wot/0.1/
8
http://www.w3.org/2004/02/skos/core
Bibliography

[1] Gergely Adamku and Heiner Stuckenschmidt. Implementation and evaluation


of a distributed rdf storage and retrieval system. In Proceedings of the 2005
IEEE/WIC/ACM International Conference on Web Intelligence (WI’05), pages
393–396, Los Alamitos, CA, USA, 2005. IEEE Computer Society. ISBN 0-7695-
2415-X. doi: 10.1109/WI.2005.73.

[2] Hend S. Al-Khalifa and Hugh C. Davis. Measuring the semantic value of folk-
sonomies. In Innovations in Information Technology, 2006, pages 1–5, November
2006. doi: 10.1109/INNOVATIONS.2006.301880.

[3] K Alexander, R Cyganiak, M Hausenblas, and J Zhao. Describing linked datasets.


Proceedings of the Workshop on Linked Data on the Web (LDOW2009) Madrid,
Spain, 2009. URL http://hdl.handle.net/10379/543.

[4] R. Alonso-Calvo, V. Maojo, H. Billhardt, F. Martin-Sanchez, M. García-Remesal,


and D. Pérez-Rey. An agent- and ontology-based system for integrating public
gene, protein, and disease databases. J. of Biomedical Informatics, 40(1):17–29,
2007. ISSN 1532-0464. doi: 10.1016/j.jbi.2006.02.014.

[5] Joanna Amberger, Carol A. Bocchini, Alan F. Scott, and Ada Hamosh. Mckusick’s
online mendelian inheritance in man (omim). Nucleic Acids Research, 37(suppl
1):D793–D796, 2009. doi: 10.1093/nar/gkn665.

[6] Sophia Ananiadou and John McNaught, editors. Text mining for biology and
biomedicine. Artech House, Boston, 2006.

[7] Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine.
Computational Linguistics, 33(1):135–140, 2007. doi: 10.1162/coli.2007.33.1.135.

[8] Peter Ansell. Bio2rdf: Providing named entity based search with a common
biological database naming scheme. In Proceedings of BioSearch08: HCSNet Next-
Generation Search Workshop on Search in Biomedical Information, November
2008.

[9] Peter Ansell. Collaborative development of cross-database bio2rdf queries. In


eResearch Australasia 2009, Novotel Sydney Manly Pacific, November 2009.

189
190 BIBLIOGRAPHY

[10] Peter Ansell. Model and prototype for querying multiple linked scientific datasets.
Future Generation Computer Systems, 27(3):329–333, March 2011. ISSN 0167-
739X. doi: 10.1016/j.future.2010.08.016.

[11] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Hogan, Scott Mann,
and Paul Roe. Finding friends outside the species: Making sense of large scale
blast results with silvermap. In Proceedings of eResearch Australasia 2009, 2009.

[12] Peter Ansell, Lawrence Buckingham, Xin-Yi Chua, James Michael Hogan, Scott
Mann, and Paul Roe. Enhancing blast comprehension with silvermap. In 2009
Microsoft eScience Workshop, October 2009.

[13] Peter Ansell, James Hogan, and Paul Roe. Customisable query resolution in
biology and medicine. In Proceedings of the Fourth Australasian Workshop on
Health Informatics and Knowledge Management - Volume 108, HIKM ’10, pages
69–76, Darlinghurst, Australia, Australia, 2010. Australian Computer Society,
Inc. ISBN 978-1-920682-89-7. URL http://portal.acm.org/citation.cfm?id=
1862303.1862315.

[14] Mikel Egana Aranguren, Sean Bechhofer, Phillip Lord, Ulrike Sattler, and
Robert D. Stevens. Understanding and using the meaning of statements in a
bio-ontology: recasting the gene ontology in owl. BMC Bioinformatics, 8:57,
2007. doi: 10.1186/1471-2105-8-57.

[15] Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, and Craig A. Knoblock. Retrieving
and integrating data from multiple information sources. International Journal of
Intelligent and Cooperative Information Systems, 2:127–158, 1993. URL http:
//www.isi.edu/info-agents/papers/arens93-ijicis.pdf.

[16] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather
Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight,
Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew
Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald,
Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification
of biology. the gene ontology consortium. Nature Genet., 25:25–29, 2000. doi:
10.1038/75556.

[17] Vadim Astakhov, Amarnath Gupta, Simone Santini, and Jeffrey S. Grethe. Data
integration in the biomedical informatics research network (birn). Data Integration
in the Life Sciences, pages 317–320, 2005. doi: 10.1007/11530084_31.

[18] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia:


A nucleus for a web of open data. Lecture Notes in Computer Science, 4825:722,
2007.

[19] Amos Bairoch, Rolf Apweiler, Cathy H. Wu, Winona C. Barker, Brigitte Boeck-
mann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez,
BIBLIOGRAPHY 191

Michele Magrane, Maria J. Martin, Darren A. Natale, Claire O’Donovan, Nicole


Redaschi, and Lai-Su L. Yeh. The universal protein resource (uniprot). Nucleic
Acids Research, 33(suppl_1):D154–159, 2005. doi: 10.1093/nar/gki070.

[20] JB Bard and SY Rhee. Ontologies in biology: design, applications and future
challenges. Nat Rev Genet, 5(3):213–222, 2004. doi: 10.1038/nrg1295.

[21] Beth A. Bechky. Sharing meaning across occupational communities: The trans-
formation of understanding on a production floor. Organization Science, 14(3):
312–330, 2003. ISSN 10477039. URL http://www.jstor.org/stable/4135139.

[22] S. Beco, B. Cantalupo, L. Giammarino, N. Matskanis, and M. Surridge. Owl-


ws: a workflow ontology for dynamic grid service composition. In e-Science and
Grid Computing, 2005. First International Conference on, page 8 pp., 2005. doi:
10.1109/E-SCIENCE.2005.64.

[23] François Belleau, Peter Ansell, Marc-Alexandre Nolin, Kingsley Idehen, and
Michel Dumontier. Bio2rdf’s sparql point and search service for life science linked
datasets. Poster at the 11th Annual Bio-Ontologies Meeting, 2008.

[24] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and
Jean Morissette. Bio2rdf: Towards a mashup to build bioinformatics knowledge
systems. Journal of Biomedical Informatics, 41(5):706–716, 2008. doi: 10.1016/j.
jbi.2008.03.004.

[25] S Bergamaschi, S Castano, and M Vincini. Semantic integration of semistructured


and structured data sources. SIGMOD Record, 28:54–59, 1999. doi: 10.1145/
309844.309897.

[26] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat,
Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank.
Nucleic Acids Research, 28(1):235–242, 2000. doi: 10.1093/nar/28.1.235.

[27] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Sci-
entific American Magazine, May 2001. URL http://www.sciam.com/article.
cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21.

[28] A Birkland and G Yona. Biozon: a system for unification, management and
analysis of heterogeneous biological data. BMC Bioinformatics, 7:70, 2006. doi:
10.1186/1471-2105-7-70.

[29] C. Bizer and A. Seaborne. D2rq-treating non-rdf databases as virtual rdf graphs.
In Proceedings of the 3rd International Semantic Web Conference (ISWC2004),
2004.

[30] U. Bojars, J.G. Breslin, A. Finn, and S. Decker. Using the semantic web for
linking and reusing data across web 2.0 communities. Web Semantics: Science,
Services and Agents on the World Wide Web, 6(1):21–28, February 2008. doi:
10.1016/j.websem.2007.11.010.
192 BIBLIOGRAPHY

[31] Evan E. Bolton, Yanli Wang, Paul A. Thiessen, and Stephen H. Bryant. Chapter
12 pubchem: Integrated platform of small molecules and biological activities. In
Ralph A. Wheeler and David C. Spellmeyer, editors, Annual Reports in Computa-
tional Chemistry, volume 4 of Annual Reports in Computational Chemistry, pages
217–241. Elsevier, 2008. doi: 10.1016/S1574-1400(08)00012-1.

[32] Shawn Bowers and Bertram Ludäscher. An ontology-driven framework for data
transformation in scientific workflows. In Data Integration in the Life Sciences,
pages 1–16. Springer, 2004. doi: 10.1007/b96666.

[33] Jan Brase, Adam Farquhar, Angela Gastl, Herbert Gruttemeier, Maria Heijne,
Alfred Heller, Arlette Piguet, Jeroen Rombouts, Mogens Sandfaer, and Irina Sens.
Approach for a joint global registration agency for research data. Information
Services and Use, 29(1):13–27, 01 2009. doi: 10.3233/ISU-2009-0595.

[34] Peter Buneman, Adriane Chapman, and James Cheney. Provenance management
in curated databases. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD
international conference on Management of data, pages 539–550, New York, NY,
USA, 2006. ACM Press. ISBN 1-59593-434-0. doi: 10.1145/1142473.1142534.

[35] Eithon Cadag, Peter Tarczy-Hornoch, and Peter Myler. On the reachability of
trustworthy information from integrated exploratory biological queries. Data Inte-
gration in the Life Sciences, pages 55–70, 2009. doi: 10.1007/978-3-642-02879-3_
6.

[36] Diego Calvanese, Giuseppe Giacomo, Domenico Lembo, Maurizio Lenzerini, Ric-
cardo Rosati, and Marco Ruzzi. Using owl in data integration. Semantic Web
Information Management, pages 397–424, 2010. doi: 10.1007/978-3-642-04329-1_
17.

[37] Michael N Cantor and Yves A Lussier. Putting data integration into practice:
using biomedical terminologies to add structure to existing data sources. AMIA
Annu Symp Proc, pages 125–129, 2003.

[38] Monica Chagoyen, Pedro Carmona-Saez, Hagit Shatkay, Jose M Carazo, and Al-
berto Pascual-Montano. Discovering semantic features in the literature: a foun-
dation for building functional associations. BMC Bioinformatics, 7:41, 2006. doi:
10.1186/1471-2105-7-41.

[39] B Chen, X Dong, D Jiao, H Wang, Q Zhu, Y Ding, and D Wild. Chem2bio2rdf:
a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics, 11:255, 2010. doi: 10.1186/
1471-2105-11-255.

[40] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux,


Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to semantic web
BIBLIOGRAPHY 193

query federation in the life sciences. BMC Bioinformatics, 10(Suppl 10):S10, 2009.
ISSN 1471-2105. doi: 10.1186/1471-2105-10-S10-S10.

[41] Tim Clark, Sean Martin, and Ted Liefeld. Globally distributed object identifica-
tion for biological knowledgebases. Brief Bioinformatics, 5(1):59–70, 2004. doi:
10.1093/bib/5.1.59.

[42] Paulo C.G. Costa and Kathryn B. Laskey. Pr-owl: A framework for bayesian
ontologies. In Proceedings of the International Conference on Formal Ontology in
Information Systems (FOIS 2006). November 9-11, 2006, Baltimore, MD, USA,
2006. IOS Press. URL http://hdl.handle.net/1920/1734.

[43] Susan Crawford and Loretta Stucki. Peer review and the changing research record.
Journal of the American Society for Information Science, 41(3):223–228, 1990.
ISSN 1097-4571. doi: 10.1002/(SICI)1097-4571(199004)41:3<223::AID-ASI14>
3.0.CO;2-3.

[44] Andreas Doms and Michael Schroeder. GoPubMed: exploring PubMed with the
Gene Ontology. Nucl. Acids Res., 33(Supplement 2):W783–786, 2005. doi: 10.
1093/nar/gki470.

[45] Orri Erling and Ivan Mikhailov. Rdf support in the virtuoso dbms. In Proceedings
of the 1st Conference on Social Semantic Web (CSSW), pages 7–24. Springer,
2007. doi: 10.1007/978-3-642-02184-8_2.

[46] T. Etzold, H. Harris, and S. Beulah. Srs: An integration platform for databanks
and analysis tools in bioinformatics. In Bioinformatics: Managing Scientific Data,
chapter 5, pages 35–74. Elsevier, 2003.

[47] Y Fang, H Huang, H Chen, and H Juan. Tcmgenedit: a database for as-
sociated traditional chinese medicine, gene and disease information using text
mining. BMC Complementary and Alternative Medicine, 8:58, 2008. doi:
10.1186/1472-6882-8-58.

[48] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger,
Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristof-
fer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bate-
man. The pfam protein families databases. Nucleic Acids Research, 38(suppl 1):
D211–D222, 2010. doi: 10.1093/nar/gkp985.

[49] R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, and I. Roberts. Integrating text


mining into distributed bioinformatics workflows: a web services implementation.
In Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE International
Conference on, pages 145–152, 15-18 Sept. 2004. doi: 10.1109/SCC.2004.1358001.

[50] Michael Y. Galperin. The molecular biology database collection: 2008 update.
Nucl. Acids Res., 36(suppl-1):D2–4, 2008. doi: 10.1093/nar/gkm1037.
194 BIBLIOGRAPHY

[51] Robert Gentleman. Reproducible research: A bioinformatics case study. Statis-


tical Applications in Genetics and Molecular Biology, 4(1), 2005. doi: 10.2202/
1544-6115.1034.

[52] Yolanda Gil and Donovan Artz. Towards content trust of web resources. Web
Semantics: Science, Services and Agents on the World Wide Web, 5(4):227–239,
December 2007. doi: 10.1016/j.websem.2007.09.005.

[53] C. A. Goble, R. D. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker,


M. Peim, and A. Brass. Transparent access to multiple bioinformatics information
sources. IBM Syst. J., 40(2):532–551, 2001. ISSN 0018-8670. doi: 10.1147/sj.402.
0532.

[54] Peter Godfrey-Smith. Theory and Reality: An Introduction to the Philosophy of


Science. University Of Chicago Press, 2003.

[55] Kwang-Il Goh, Michael E. Cusick, David Valle, Barton Childs, Marc Vidal, and
Albert-László Barabási. The human disease network. Proceedings of the Na-
tional Academy of Sciences, 104(21):8685–8690, May 2007. doi: 10.1073/pnas.
0701361104.

[56] Jennifer Golbeck. Weaving a web of trust. Science, 321(5896):1640–1641, 2008.


doi: 10.1126/science.1163357.

[57] Mike Graves, Adam Constabaris, and Dan Brickley. Foaf: Connecting people on
the semantic web. Cataloging & Classification Quarterly, 43(3):191–202, 2007.
ISSN 0163-9374. doi: 10.1300/J104v43n03_10.

[58] Tom Gruber. Collective knowledge systems: Where the social web meets the
semantic web. Web Semantics: Science, Services and Agents on the World Wide
Web, 6(1):4–13, February 2008. doi: 10.1016/j.websem.2007.11.011.

[59] Miguel Esteban Gutiérrez, Isao Kojima, Said Mirza Pahlevi, Oscar Corcho, and
Asunción Gómez-Pérez. Accessing rdf(s) data resources in service-based grid in-
frastructures. Concurrency and Computation: Practice and Experience, 21(8):
1029–1051, 2009. doi: 10.1002/cpe.1409.

[60] LM Haas, PM Schwarz, and P Kodali. Discoverylink: a system for integrated


access to life sciences data sources. IBM Syst. J, 40:489–511, 2001. URL http:
//portal.acm.org/citation.cfm?id=1017236.

[61] Carole D. Hafner and Natalya Fridman Noy. Ontological foundations for biology
knowledge models. In Proceedings of the Fourth International Conference on In-
telligent Systems for Molecular Biology, pages 78–87. AAAI Press, 1996. ISBN
1-57735-002-2.

[62] MA Harris, J Clark, A Ireland, J Lomax, M Ashburner, R Foulger, K Eilbeck,


S Lewis, B Marshall, and C Mungall. The gene ontology (go) database and
BIBLIOGRAPHY 195

informatics resource. Nucleic Acids Res, 32(1):D258–261, 2004. doi: 10.1093/


nar/gkh036.

[63] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A federated repository


for querying graph structured data from the webs. Lecture Notes in Computer
Science, 4825:211, 2007. doi: 10.1007/978-3-540-76298-0_16.

[64] Olaf Hartig and Jun Zhao. Using web data provenance for quality assessment.
In In: Proc. of the Workshop on Semantic Web and Provenance Management at
ISWC, 2009.

[65] O Hassanzadeh, A Kementsietsidis, L Lim, RJ Miller, and M Wang. Linkedct:


A linked data space for clinical trials. arXiv:0908.0567v1, August 2009. URL
http://arxiv.org/abs/0908.0567.

[66] Tom Heath and Enrico Motta. Ease of interaction plus ease of integration: Com-
bining web 2.0 and the semantic web in a reviewing site. Web Semantics: Science,
Services and Agents on the World Wide Web, 6(1):76–83, February 2008. doi:
10.1016/j.websem.2007.11.009.

[67] John J. Helly, T. Todd Elvins, Don Sutton, David Martinez, Scott E. Miller,
Steward Pickett, and Aaron M. Ellison. Controlled publication of digital scientific
data. Commun. ACM, 45(5):97–101, 2002. ISSN 0001-0782. doi: 10.1145/506218.
506222.

[68] Martin Hepp. Goodrelations: An ontology for describing products and services of-
fers on the web. In Proceedings of the 16th International Conference on Knowledge
Engineering and Knowledge Management (EKAW2008), September 29 - October
3, 2008, Acitrezza, Italy, volume 5268, pages 332–347. Springer LNCS, 2008. doi:
10.1007/978-3-540-87696-0_29.

[69] Katherine G. Herbert and Jason T.L. Wang. Biological data cleaning: a case
study. International Journal of Information Quality, 1(1):60 – 82, 2007. doi:
10.1504/IJIQ.2007.013376.

[70] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world data is dirty: Data
cleansing and the merge/purge problem. Data Mining and Knowledge Discovery,
2(1):9–37, January 1998. doi: 10.1023/A:1009761603038.

[71] Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm:
Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Wash-
ington, 2009. URL http://research.microsoft.com/en-us/collaboration/
fourthparadigm/.

[72] Shahriyar Hossain and Hasan Jamil. A visual interface for on-the-fly biological
database integration and workflow design using vizbuilder. Data Integration in
the Life Sciences, pages 157–172, 2009. doi: 10.1007/978-3-642-02879-3_13.
196 BIBLIOGRAPHY

[73] Gareth Hughes, Hugo Mills, David De Roure, Jeremy G. Frey, Luc Moreau, M. C
Schraefel, Graham Smith, and Ed Zaluska. The semantic smart laboratory: a
system for supporting the chemical escientist. Org. Biomol. Chem., 2:3284 –
3293, 2004. doi: 10.1039/B410075A.

[74] Zachary Ives. Data integration and exchange for scientific collaboration. Data In-
tegration in the Life Sciences, pages 1–4, 2009. doi: 10.1007/978-3-642-02879-3_
1.

[75] Hai Jin, Aobing Sun, Ran Zheng, Ruhan He, and Qin Zhang. Ontology-
based semantic integration scheme for medical image grid. Cluster Comput-
ing and the Grid, IEEE International Symposium on, 0:127–134, 2007. doi:
10.1109/CCGRID.2007.77.

[76] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hi-
rakawa. Kegg for representation and analysis of molecular networks involving
diseases and drugs. Nucleic Acids Research, 38(suppl 1):D355–D360, 2010. doi:
10.1093/nar/gkp896.

[77] P. D. Karp. An ontology for biological function based on molecular interactions.


Bioinformatics, 16(3):269–285, Mar 2000. doi: 10.1093/bioinformatics/16.3.269.

[78] Marc Kenzelmann, Karsten Rippe, and John S Mattick. Rna: Networks & imag-
ing. Molecular Systems Biology, 2(44), 2006. doi: 10.1038/msb4100086.

[79] Jacob Köhler, Stephan Philippi, and Matthias Lange. Semeda: ontology based
semantic integration of biological databases. Bioinformatics, 19(18):2420–2427,
Dec 2003. doi: 10.1093/bioinformatics/btg340.

[80] Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, and Peer Bork.
A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol, 6,
January 2010. doi: 10.1038/msb.2009.98.

[81] Thomas Samuel Kuhn. The Structure of Scientific Revolutions. University Of


Chicago Press, 3rd edition, 1996.

[82] Camille Laibe and Nicolas Le Novere. Miriam resources: tools to generate and
resolve robust cross-references in systems biology. BMC Systems Biology, 1(1):58,
2007. ISSN 1752-0509. doi: 10.1186/1752-0509-1-58.

[83] P. Lambrix and V. Jakoniene. Towards transparent access to multiple biological


databanks. In Proceedings of the First Asia-Pacific bioinformatics conference
on Bioinformatics 2003-Volume 19, pages 53–60. Australian Computer Society,
Inc. Darlinghurst, Australia, Australia, 2003. URL http://portal.acm.org/
citation.cfm?id=820189.820196.

[84] Ning Lan, Gaetano T Montelione, and Mark Gerstein. Ontologies for proteomics:
towards a systematic definition of structure and function that scales to the genome
BIBLIOGRAPHY 197

level. Current Opinion in Chemical Biology, 7(1):44–54, February 2003. doi:


10.1016/S1367-5931(02)00020-0.

[85] Andreas Langegger, Martin Blochl, and Wolfram Woss. Sharing data on the grid
using ontologies and distributed sparql queries. In 18th International Conference
on Database and Expert Systems Applications, 2007. DEXA ’07, pages 450–454,
2007. doi: 10.1109/DEXA.2007.4312934.

[86] A. Lash, W.-J. Lee, and L. Raschid. A methodology to enhance the semantics of
links between pubmed publications and markers in the human genome. In Fifth
IEEE Symposium on Bioinformatics and Bioengineering (BIBE 2005), pages 185–
192, Minneapolis, Minnesota, USA, October 2005. doi: 10.1109/BIBE.2005.3.

[87] D. Le-Phuoc, A. Polleres, M. Hauswirth, G. Tummarello, and C. Morbidoni.


Rapid prototyping of semantic mash-ups through semantic web pipes. In Pro-
ceedings of the 18th international conference on World wide web, pages 581–590.
ACM New York, NY, USA, 2009. URL http://www2009.org/proceedings/pdf/
p581.pdf.

[88] Timothy Lebo, John S. Erickson, Li Ding, Alvaro Graves, Gregory Todd Williams,
Dominic DiFranzo, Xian Li, James Michaelis, Jin Guang Zheng, Johanna Flores,
Zhenning Shangguan, Deborah L. McGuinness, and Jim Hendler. Producing and
using linked open government data in the twc logd portal (to appear). In David
Wood, editor, Linking Government Data. Springer, New York, NY, 2011.

[89] Peter Li, Yuhui Chen, and Alexander Romanovsky. Measuring the dependability
of web services for use in e-science experiments. In Dave Penkler, Manfred Re-
itenspiess, and Francis Tam, editors, Service Availability, volume 4328 of Lecture
Notes in Computer Science, pages 193–205. Springer Berlin / Heidelberg, 2006.
doi: 10.1007/11955498_14.

[90] Sarah Cohen-Boulakia Frédérique Lisacek. Proteome informatics ii: Bioinfor-


matics for comparative proteomics. Proteomics, 6(20):5445–5466, 2006. doi:
10.1002/pmic.200600275.

[91] Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler,
Duncan Hull, Carole Goble, and Lincoln Stein. Applying semantic web services
to bioinformatics: Experiences gained, lessons learnt. In Sheila A. McIlraith,
Dimitris Plexousakis, and Frank van Harmelen, editors, The Semantic Web –
ISWC 2004, volume 3298 of Lecture Notes in Computer Science, pages 350–364.
Springer Berlin / Heidelberg, 2004. doi: 10.1007/978-3-540-30475-3_25.

[92] Stefan Maetschke, Michael W. Towsey, and James M. Hogan. Biopatml : pattern
sharing for the genomic sciences. In 2008 Microsoft eScience Workshop, 7-9 De-
cember 2008., University Place Conference Center & Hotel, Indianapolis, 2008.
URL http://eprints.qut.edu.au/27327/.
198 BIBLIOGRAPHY

[93] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez gene: gene-centered


information at ncbi. Nucleic Acids Res., 35:D26–D31, 2007. doi: 10.1093/nar/
gkl993.

[94] Sylvain Mathieu, Isabelle Boutron, David Moher, Douglas G. Altman, and
Philippe Ravaud. Comparison of registered and published primary outcomes in
randomized controlled trials. JAMA, 302(9):977–984, 2009. doi: 10.1001/jama.
2009.1242.

[95] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting


causal graphs from an open provenance data models. Concurrency and Compu-
tation: Practice and Experience, 20(5):577–586, 2007. doi: 10.1002/cpe.1236.

[96] J. Mingers. Real-izing information systems: Critical realism as an underpinning


philosophy for information systems. Information and Organization, 14(2):87–103,
April 2004. doi: 10.1016/j.infoandorg.2003.06.001.

[97] Olivo Miotto, Tin Tan, and Vladimir Brusic. Rule-based knowledge aggregation
for large-scale protein sequence analysis of influenza a viruses. BMC Bioinformat-
ics, 9(Suppl 1):S7, 2008. ISSN 1471-2105. doi: 10.1186/1471-2105-9-S1-S7.

[98] Barend Mons. Which gene did you mean? BMC Bioinformatics, 6:142, 2005. doi:
10.1186/1471-2105-6-142.

[99] Michael Mrissa, Chirine Ghedira, Djamal Benslimane, Zakaria Maamar, Florian
Rosenberg, and Schahram Dustdar. A context-based mediation approach to com-
pose semantic web services. ACM Trans. Internet Technol., 8(1):4, 2007. ISSN
1533-5399. doi: 10.1145/1294148.1294152.

[100] C. J. Mungall, D. B. Emmert, and Consortium FlyBase. A chado case study:


an ontology-based modular schema for representing genome-associated biological
information. Bioinformatics, 23:i337–i346, 2007. doi: 10.1093/bioinformatics/
btm189.

[101] P.H.. Mylonas, D. Vallet, P. Castells, M. Fernández, and Y. Avrithis. Per-


sonalized information retrieval based on context and ontological knowledge.
The Knowledge Engineering Review, 23(Special Issue 01):73–100, 2008. doi:
10.1017/S0269888907001282.

[102] Rex Nelson, Shulamit Avraham, Randy Shoemaker, Gregory May, Doreen Ware,
and Damian Gessler. Applications and methods utilizing the simple semantic
web architecture and protocol (sswap) for bioinformatics resource discovery and
disparate data and service integration. BioData Mining, 3(1):3, 2010. ISSN 1756-
0381. doi: 10.1186/1756-0381-3-3.

[103] Natalya F. Noy. Semantic integration: a survey of ontology-based approaches.


SIGMOD Rec., 33(4):65–70, 2004. ISSN 0163-5808. doi: 10.1145/1041410.
1041421.
BIBLIOGRAPHY 199

[104] P. V. Ogren, K. B. Cohen, G. K. Acquaah-Mensah, J. Eberlein, and L. Hunter.


The compositional structure of gene ontology terms. Pac Symp Biocomput, 9:214–
225, 2004. URL http://psb.stanford.edu/psb-online/proceedings/psb04/
ogren.pdf.

[105] Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris,
Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter
Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert D. Stevens, Anil
Wipat, and Chris Wroe. Taverna: lessons in creating a workflow environment for
the life sciences. Concurrency and Computation: Practice and Experience, 18(10):
1067–1100, 2006. ISSN 1532-0626. doi: 10.1002/cpe.993.

[106] A. Paschke. Rule responder hcls escience infrastructure. In Proceedings of the


3rd International Conference on the Pragmatic Web: Innovating the Interactive
Society, pages 59–67. ACM New York, NY, USA, 2008. doi: 10.1145/1479190.
1479199.

[107] C. Pasquier. Biological data integration using semantic web technologies.


Biochimie, 90(4):584–594, 2008. doi: 10.1016/j.biochi.2008.02.007.

[108] Steve Pettifer, David Thorne, Philip McDermott, James Marsh, Alice Villeger,
Douglas Kell, and Teresa Attwood. Visualising biological data: a semantic ap-
proach to tool and database integration. BMC Bioinformatics, 10(Suppl 6):S19,
2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-S6-S19.

[109] William Pike and Mark Gahegan. Beyond ontologies: Toward situated representa-
tions of scientific knowledge. International Journal of Human-Computer Studies,
65(7):674–688, July 2007. doi: 10.1016/j.ijhcs.2007.03.002.

[110] Andreas Prlić, Ewan Birney, Tony Cox, Thomas Down, Rob Finn, Stefan Gräf,
David Jackson, Andreas Kähäri, Eugene Kulesha, Roger Pettett, James Smith,
Jim Stalker, and Tim Hubbard. The distributed annotation system for integration
of biological data. In Data Integration in the Life Sciences, pages 195–203, 2006.
doi: 10.1007/11799511_17.

[111] B. Quilitz and U. Leser. Querying distributed rdf data sources with sparql. Lecture
Notes in Computer Science, 5021:524, 2008. doi: 10.1007/978-3-540-68234-9_39.

[112] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current approaches.
IEEE Data Engineering Bulletin, 23:2000, 2000. URL http://sites.computer.
org/debull/A00DEC-CD.pdf.

[113] Lila Rao and Kweku-Muata Osei-Bryson. Towards defining dimensions of knowl-
edge systems quality. Expert Systems with Applications, 33(2):368–378, 2007.
ISSN 0957-4174. doi: 10.1016/j.eswa.2006.05.003.

[114] N. Redaschi. Uniprot in rdf: Tackling data integration and distributed annotation
with the semantic webs. Nature Precedings, 2009. doi: 10.1038/npre.2009.3193.1.
200 BIBLIOGRAPHY

[115] Michael Rosemann, Jan C. Recker, Christian Flender, and Peter D. Ansell. Un-
derstanding context-awareness in business process design. In 17th Australasian
Conference on Information Systems, Adelaide, Australia, December 2006. URL
http://eprints.qut.edu.au/6160/.

[116] Joseph S. Ross, Gregory K. Mulvey, Elizabeth M. Hines, Steven E. Nissen, and
Harlan M. Krumholz. Trial publication after registration in clinicaltrials.gov: A
cross-sectional analysis. PLoS Med, 6(9):e1000144, 09 2009. doi: 10.1371/journal.
pmed.1000144.

[117] A. Ruttenberg, J.A. Rees, M. Samwald, and M.S. Marshall. Life sciences on the
semantic web: the neurocommons and beyond. Briefings in Bioinformatics, 10
(2):193, 2009. doi: 10.1093/bib/bbp004.

[118] Satya S Sahoo, Olivier Bodenreider, Joni L Rutter, Karen J Skinner, and Amit P
Sheth. An ontology-driven semantic mashup of gene and biological pathway infor-
mation: Application to the domain of nicotine dependence. Journal of Biomedical
Informatics, Feb 2008. doi: 10.1016/j.jbi.2008.02.006.

[119] Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, and Amit Sheth. Provenance
context entity (pace): Scalable provenance tracking for scientific rdf datasets.
In Proceedings of the 22nd International Conference on Scientific and Statistical
Database Management (SSDBM 2010), 2010. URL http://knoesis.wright.
edu/library/download/ProvenanceTracking_PaCE.pdf.

[120] Joel Saltz, Scott Oster, Shannon Hastings, Stephen Langella, Tahsin Kurc,
William Sanchez, Manav Kher, Arumani Manisundaram, Krishnakant Shanbhag,
and Peter Covitz. cagrid: design and implementation of the core architecture of
the cancer biomedical informatics grid. Bioinformatics, 22(15):1910–1916, 2006.
doi: 10.1093/bioinformatics/btl272.

[121] Matthias Samwald, Anja Jentzsch, Christopher Bouton, Claus Kallesoe, Egon
Willighagen, Janos Hajagos, M Marshall, Eric Prud’hommeaux, Oktie Hassan-
zadeh, Elgar Pichler, and Susie Stephens. Linked open drug data for pharmaceu-
tical research and development. Journal of Cheminformatics, 3(1):19, 2011. ISSN
1758-2946. doi: 10.1186/1758-2946-3-19.

[122] Marco Schorlemmer and Yannis Kalfoglou. Institutionalising ontology-based se-


mantic integration. Applied Ontology, 3(3):131–150, 2008. ISSN 1570-5838. URL
http://eprints.ecs.soton.ac.uk/18563/.

[123] Ronald Schroeter, Jane Hunter, and Andrew Newman. The Semantic Web: Re-
search and Applications, volume 4519, chapter Annotating Relationships Between
Multiple Mixed-Media Digital Objects by Extending Annotea, pages 533–548.
Springer Berlin / Heidelberg, 2007. doi: 10.1007/978-3-540-72667-8_38.
BIBLIOGRAPHY 201

[124] S. Schulze-Kremer. Adding semantics to genome databases: towards an on-


tology for molecular biology. Proceedings of the 5th International Conference
on Intelligent Systems for Molecular Biology, 5:272–275, 1997. URL https:
//www.aaai.org/Papers/ISMB/1997/ISMB97-042.pdf.

[125] Ruth L. Seal, Susan M. Gordon, Michael J. Lush, Mathew W. Wright, and El-
speth A. Bruford. genenames.org: the hgnc resources in 2011. Nucleic Acids
Research, 2010. doi: 10.1093/nar/gkq892.

[126] Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the ge-
nomic era: an overview. J Comput Biol, 10(6):821–855, 2003. doi: 10.1089/
106652703322756104.

[127] E. Patrick Shironoshita, Yves R Jean-Mary, Ray M Bradley, and Mansur R


Kabuka. semcdi: A query formulation for semantic data integration in cabig.
J Am Med Inform Assoc, Apr 2008. doi: 10.1197/jamia.M2732.

[128] David Shotton. Semantic publishing: the coming revolution in scientific journal
publishing. Learned Publishing, 22(2):85–94, APRIL 2009. doi: 10.1087/2009202.

[129] Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science.


SIGMOD Record, 34(3):31–36, 2005. doi: 10.1145/1084805.1084812.

[130] D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and


A. Kasprzyk. Biomart – biological queries made easy. BMC genomics, 10(1):22,
2009. doi: 10.1186/1471-2164-10-22.

[131] Miriam Solomon. Groupthink versus the wisdom of crowds: The social episte-
mology of deliberation and dissent. The Southern Journal of Philosophy, 44(S1):
28–42, 2006. doi: 10.1111/j.2041-6962.2006.tb00028.x.

[132] Young C. Song, Edward Kawas, Ben M. Good, Mark D. Wilkinson, and Scott J.
Tebbutt. Databins: a biomoby-based data-mining workflow for biological path-
ways and non-synonymous snps. Bioinformatics, 23(6):780–782, Jan 2007. doi:
10.1093/bioinformatics/btl648.

[133] Damires Souza. Using Semantics to Enhance Query Reformulation in Dy-


namic Distributed Environments. PhD thesis, Universidade Federal de Per-
nambuco, April 2009. URL http://www.bdtd.ufpe.br/tedeSimplificado/tde_
arquivos/22/TDE-2009-07-06T123119Z-6016/Publico/dysf.pdf.

[134] Irena Spasić and Sophia Ananiadou. Using automatically learnt verb selectional
preferences for classification of biomedical terms. J Biomed Inform, 37(6):483–497,
Dec 2004. doi: 10.1016/j.jbi.2004.08.002.

[135] Robert D. Stevens, C Goble, I Horrocks, and S Bechhofer. Building a bioin-


formatics ontology using oil. IEEE Transactions on Information Technology in
Biomedicine, 6:135–141, June 2002. ISSN 1089-7771. doi: 10.1109/TITB.2002.
1006301.
202 BIBLIOGRAPHY

[136] Ljiljana Stojanovic, Alexander Maedche, Boris Motik, and N. Stojanovic. User-
driven ontology evolution management. In Proceedings of the 13th European Con-
ference on Knowledge Engineering and Knowledge Management EKAW, pages 53–
62, Madrid, Spain, 2002. URL http://www.fzi.de/ipe/publikationen.php?
id=804.

[137] John Strassner, Sven van der Meer, Declan O’Sullivan, and Simon Dobson. The
use of context-aware policies and ontologies to facilitate business-aware network
management. Journal of Network and Systems Management, 17(3):255–284,
September 2009. doi: 10.1007/s10922-009-9126-4.

[138] James Surowiecki. The Wisdom of Crowds. Random House, New York, 2004.
ISBN 0-385-72170-6.

[139] The UniProt Consortium. The universal protein resource (uniprot) in 2010. Nu-
cleic Acids Research, 38(suppl 1):D142–D148, 2010. doi: 10.1093/nar/gkp846.

[140] Todd J. Vision. Open data and the social contract of scientific publishing. Bio-
Science, 60(5):330–331, 2010. ISSN 00063568. doi: 10.1525/bio.2010.60.5.2.

[141] Stephan Weise, Christian Colmsee, Eva Grafahrend-Belau, Björn Junker, Chris-
tian Klukas, Matthias Lange, Uwe Scholz, and Falk Schreiber. An integration and
analysis pipeline for systems biology in crop plant metabolism. Data Integration
in the Life Sciences, pages 196–203, 2009. doi: 10.1007/978-3-642-02879-3_16.

[142] Patricia L. Whetzel, Helen Parkinson, Helen C. Causton, Liju Fan, Jennifer
Fostel, Gilberto Fragoso, Laurence Game, Mervi Heiskanen, Norman Morrison,
Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor, Joseph White, and
Jr Stoeckert, Christian J. The MGED Ontology: a resource for semantics-based
description of microarray experiments. Bioinformatics, 22(7):866–873, 2006. doi:
10.1093/bioinformatics/btl005.

[143] Mark D Wilkinson, Benjamin Vandervalk, and Luke McCarthy. Sadi semantic
web services – cause you can’t always get what you want! In Proceedings of IEEE
APSCC Workshop on Semantic Web Services in Practice (SWSIP 2009), pages
13–18, 2009. doi: 10.1109/APSCC.2009.5394148.

[144] David S. Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava,
Dan Tzur, Bijaya Gautam, and Murtaza Hassanali. Drugbank: a knowledgebase
for drugs, drug actions and drug targets. Nucleic Acids Research, 36(suppl 1):
D901–D906, 2008. doi: 10.1093/nar/gkm958.

[145] Limsoon Wong. Technologies for integrating biological data. Brief Bioinform, 3
(4):389–404, 2002. doi: 10.1093/bib/3.4.389.

[146] Alexander C. Yu. Methods in biomedical ontology. Journal of Biomedical Infor-


matics, 39(3):252–266, June 2006. doi: 10.1016/j.jbi.2005.11.006.
BIBLIOGRAPHY 203

[147] J. Zemánek, S. Schenk, and V. Svátek. Optimizing sparql queries over disparate
rdf data sources through distributed semi-joins. In Proceedings of ISWC 2008
Poster and Demo Session, 2008. URL http://ftp.informatik.rwth-aachen.
de/Publications/CEUR-WS/Vol-401/iswc2008pd_submission_69.pdf.

[148] Jun Zhao, Carole Goble, Robert Stevens, and Daniele Turi. Mining taverna’s
semantic web of provenance. Concurrency and Computation: Practice and Expe-
rience, Online Publication, 2007. doi: 10.1002/cpe.1231.

[149] Jun Zhao, Graham Klyne, and David Shotton. Provenance and linked data in bio-
logical data webs. In Linked Open Data Workshop at The 17th International World
Wide Web Conference, 2008. URL http://data.semanticweb.org/workshop/
LDOW/2008/paper/4.

[150] Jun Zhao, Alistair Miles, Graham Klyne, and David Shotton. Openflydata: The
way to go for biological data integration. Data Integration in the Life Sciences,
pages 47–54, 2009. doi: 10.1007/978-3-642-02879-3_5.

You might also like