• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Life sciences onthe Semantic Web: theNeurocommons andbeyond*
Alan Ruttenberg, Jonathan A. Rees, Matthias Samwald and M. Scott Marshall
Submitted:18th September 2008; Received (in revised form):14th January 2009
Abstract
Translational research, the effort to couple the results of basic research to clinical applications, depends on theability to effectively answer questions using information that spans multiple disciplines. The Semantic Web, withits emphasis on combining information using standard representation languages, access to that information viastandard web protocols, and technologies to leverage computation, such as in the form of inference and distribut-able query, offers a social and technological basis for assembling, integrating and making available biomedicalknowledge at Web scale.In this article, we discuss theuse of Semantic Web technology for assembling and queryingbiomedical knowledge from multiple sources and disciplines.We present the Neurocommons prototype knowledgebase, a demonstration intended to show the feasibility and benefits of using these technologies. The prototypeknowledge base can be used to experiment with and assess the scalability of current tools and methods for creatingsuch a resource, and to elicit issues that will need to be addressed in order to expand the scope and use of it. We demonstrate the utility of the knowledge base by reviewing a few example queries that provide answers to precise questions relevant to the understanding of disease. All components of the knowledge base are freelyavailable athttp://neurocommons.org/,enabling readers to reconstruct the knowledge base and experiment with this new technology.
Keywords:
Semantic Web; ontology; data integration; life science; medicine; neuroscience
INTRODUCTION
Understanding complex biological systems is acrucial challenge for modern biomedical scienceand informatics. In order to answer questions thatmight accelerate translational medicine, knowledgefrom different disciplines, research methodologiesand repositories must be collected and integrated.However, the data and knowledge that measure anddescribe biomedical phenomena are scattered acrossnumerous information systems, each with its ownterminologies, identifier schemes, and data formats.One collation counts more than 1000 publiclyaccessible molecular biology databases [1]. There islittle schema or ontology reuse between these.Beyond these lies a bulk of biomedical knowledgepublished in journals, monographs, and textbooks.Making effective computational use of all thisknowledge is an important contemporary challenge.Given this situation, it is difficult for researchers tofind all available information about a subject of interest, and to organize it so that it can be foundand understood. Scientists who would attempt toform a comprehensive view of a biological phenom-enon face tedious and error-prone computing taskssuch as converting data formats and informationschemas, querying different databases and combining
*In memory of our friend and colleague William Bug, Ontological Engineer.Corresponding author. Dr M. Scott Marshall, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands.Tel:
þ
31(0)20 5257522; Fax:
þ
31(0)20 5257490; E-mail: marshall@science.uva.nl
Alan Ruttenberg
is a principal scientist at Science Commons where he is part of the Neurocommons development team. He is acoordinating editor of the OBO Foundry, and co-chair of the W3C OWL Working group.
 JonathanA.Rees
pursues dual interests in computer science and biology. He is an author of the Scheme 48 programming languageand is an appointed member of the W3C Technical Architecture Group.
Matthias Samwald
is a researcher at DERI Galway (Galway, Ireland), the Konrad Lorenz Institute for Evolution and CognitionResearch (Altenberg, Austria) and at the Section on Medical Expert and Knowledge-Based Systems, Medical University of Vienna(Austria). He is a member of the W3C HCLS Interest Group.
M. Scott Marshall
is a researcher at the Informatics Institute of the University of Amsterdam. He is the unit lead of the AdaptiveInformation Disclosure group of the VL-e project and co-chair of the W3C HCLS Interest Group.
BRIEFINGS IN BIOINFORMATICS.
page 1 of 12 doi:10.1093/bib/bbp004
ß
The Author 2009. Published by Oxford University Press.For Permissions, please email: journals.permissions@oxfordjournals.org
 
Briefings in Bioinformatics Advance Access published March 12, 2009
 
the results of these queries, wrestling with a varietyof uncoordinated application interfaces, readingarticles and extracting and integrating relevant factsfrom them. Most of such a scientist’s resources arespent on working through the complexities of information systems instead of understanding thecomplexities of biological reality—the actual goal of biomedical research [2].Instead of ushering in a new era of biomedicalinsight, the growing abundance of data on the webhas intensified the need to develop new approachesto manage and integrate it. If we fail to do so,knowledge will remain fractured—encoded in amyriad of representational dialects—and effectivelyinaccessible to the majority of researchers.As a means to change this situation, we havebecome interested in helping establish a SemanticWeb for science [3, 4]. By our assessment, theSemantic Web adds to existing Web standardsand practices encouraging clearly specified namesfor things, classes, and relationships, organized anddocumented in ontologies, with data expressed usingstandardized well-specified knowledge representa-tion languages. Such a combination could enablecomputationally assisted management of informa-tion, ease the integration of different sources into acoherent system, and make knowledge more widelyand easily accessible. As with the existing synergybetween Internet and intranet, these technologiescontinue to enhance the ability to work withknowledge that spans public and organizationalboundaries, an essential capability in an ecosystemof biomedical research that includes academia,pharmaceutical companies, medical clinics andgovernment agencies.A number of recent Semantic Web standardsprovide a part of the technical basis for such a vision,building on existing Web practices such as theubiquitous use of Uniform Resource Identifiers(URIs) as globally unique names and documentationlocators. The Resource Description Framework(RDF) [5], RDF Schema (RDFS) and the WebOntology Language (OWL) [6] are standardsfor knowledge representation. RDF(S) (We use
RDF(S)
to refer to both RDF and RDF Schema)provides a basic syntax, datatypes and the ability touse classes and instances. OWL goes beyond RDF(S)in offering more expressive ways of specifying classes,relations between classes, properties and relationshipsbetween instances. OWL is expressive enough tostate inconsistent assertions, therefore going beyondRDF(S) and enabling tools that can profitably checkconsistency in the service of improving data quality.The query language SPARQL [7] is a firststandard for posing queries against repositories of knowledge expressed in these languages. Reasonerssuch as Pellet [8] are able to compute implicationsof statements made in OWL, as well as performconsistency checking.The Neurocommons prototype is a knowledgebase built as a first step towards Web scale integrationof scientific knowledge. With it, we are alreadyable to demonstrate how Semantic Web technolo-gies can be applied in biomedical research, foinstance by helping scientists more easily answer questions about background science and connectionsbetween different research disciplines. The prototypeserves as one test bed for exploring the technical,social and legal processes that will be needed toachieve a future in which the results of research areplaced seamlessly into the Web of science. It alsodemonstrates the productive use of existing ontolo-gies and exposes the need for their augmentationand future development. Through our experienceworking with the SenseLab project [9], the OBOFoundry [10], and with members of the W3CSemantic Web for Health Care and Life SciencesInterest Group [11], we can report insights onmethods of collaboration that can work in practice.The prototype is based on the Virtuoso open sourcetriple store (http://virtuoso.openlinksw.com/)as anOWL and RDF repository, and comes with openaccess data. The knowledge base has been releasedwith the express purpose of allowing others toreplicate, experiment with and extend it.We see this prototype as a step towards theSemantic Web for science. Below we present theconstruction of the prototype, review related efforts,assess gaps and propose next steps, and set forwardwhat we see as some challenges for both the shortand long term.
THE NEUROCOMMONSKNOWLEDGE BASETechnicalgoals oftheNeurocommons KB
The Neurocommons prototype explores what futurelife sciences data standards should be like in order topromote integration. We had a number of specificgoals in building the prototype. First, we wanted tobe able to exercise the ability to ask and get precise
page 2 of12
Ruttenberg
et al.
 
answers to questions. Second, we wanted to showthat the emerging Semantic Web technologies couldaccommodate data at a scale appropriate to aknowledge base. There have been a number of biomedical knowledge prototypes that use relativelysmall amounts of information and are thereforeunconvincing. Entrez Gene and PubMed together provide an essential basis for bioinformatics work,so we chose inclusion content from these resourcesas a baseline.We wanted to use modern knowledge represen-tation techniques in order to escape the tendencyfor representation to be too closely tied with storagetechnology, in particular the biases introduced bythe limitations of the relational model (e.g. difficultyin working with hierarchical and nested structures),and in order to work towards representations thatwere not tied to a specific end. If knowledge is tobe shared on a Semantic Web, and be available for new and unanticipated uses (i.e. not the ones for which the data was created), we must attempt torepresent knowledge in such a way that it is clearlyexpressed yet application neutral. In not all casesare we as yet successful. In some cases, the magnitudeof the work made it infeasible. In other cases,the current state of OWL is such that it is insuffi-ciently expressive to handle all such representation.However, after applying the principles of the OBOFoundry [12], we were able to succeed in somedemonstrations of data integration.Finally, culminating a long debate on what mightbe suitable identifiers for entities that are the subjectmatter of biomedicine, we wanted to prototypea mechanism and protocol for minting URIs thatachieved univocity, persistence, manageability, con-formance to Semantic Web protocols. (For further discussion of this point seehttp://neurocommons.org/page/Common_Naming_Project and discus-sions at http://lists.w3.org/Archives/Public/public-semweb-lifesci/.)
Data sources
The scientific focus of the Neurocommons project isto support disease research for neurological diseases.In an attempt to force the design to be general,we strove to provide background knowledge thatwould support our own focus as well as othespecializations and chose a number of sources basedon an assessment of value for query, ease of acquisition, effort required to represent them inOWL and type of data. The knowledge baseincludes basic information about genes taken fromEntrez Gene; the full set of OBO ontologies,including the Gene Ontology [13], the GeneOntology Annotations (GOA) [14] that associategene products with functions, processes and struc-tures; the OWL version of GALEN [15]; links tothe literature in the form of gene to article linksfrom Entrez and GO, the medical subject headingdefinitions and article associations from PubMed,as well as selected information associated with eacharticle. Where we had a choice of species-specificinformation, we include that about human andcommonly studied model organisms: mouse, rat, fly,nematode, dog, cow, yeast, zebrafish, chimpanzee,pig, chicken and frog. Homolog information relatinggenes in these species to each other is taken fromHomologene (http://www.ncbi.nlm.nih.gov/sites/entrez?db
¼
homologene). In order to get someexperience with queries that include reagent infor-mation, we incorporate the Addgene (http://www.addgene.com) plasmid catalog. Of these data sources,the OBO ontologies are provided in OWL, whereasmost of the others needed to be translated.There is a broad range of databases that relateto neuroscience—our selection was primarily limitedby the not-insignificant effort to represent their subject matter. These sources include: Metadataassociated with the Allen Brain Atlas [16] images,NeuronDB, a database of a selection of neuronalproperties from the SenseLab project, the Swanson-1998 rat portion of the Brain Architecture andManagement System (BAMS) [17] database, whichincludes gross neural circuitry as well as somemolecular expression information, and the PDSPKi database [18] of compound affinity to neuronreceptors. Results of an early information extractionpilot (http://sw.neurocommons.org/2007/text-mining.html) run against a portion of neurosciencerelated abstracts are also included.
NAMESANDTHE NAMED
A central tenet of the Web is that entities (knownas ‘resourcesin web parlance) are identified or 
named 
by URIs. When the Web was being developed,the primary entities that were manipulated by Webtools, and therefore needed names, were the Webpages themselves and their contents—images,other attached resources, and other pages thatwere included as links. The URL’s that served asthe names of the Web pages and their contents
Life sciences on the Semantic Web
page 3 of12
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...