You are on page 1of 10

Subj3ct – A Subject Identity Resolution Service

Graham Moore and Khalil Ahmed
Networked Planet Ltd, Innovation House, Mill Street, Oxford, UK {graham.moore,khalil.ahmed}@networkedplanet.com

Abstract. Subj3ct is an online service that provides a resolution capability between identifiers for subjects and the web addresses of encoded statements about a subject. This paper explains the underlying concepts and protocols of the service and discusses different usage scenarios.

1

Introduction

Subj3ct [1] is an online Subject Identity Resolution service. Subject Identifiers [2] are URIs that are used to unambiguously identify some ‘notion’, ‘concept’ or ‘subject’ about which people want to make statements. The service stores identifiers for subjects and the web addresses of representations of statements about a subject. Clients of the service can look up a subject identity record so that they can choose an appropriate identifier to use in an application or to look up where on the web they can find more information about a given subject. All information stored in Subj3ct comes from some source and that source is recorded. Clients can make use of this provenance information to decide what identifiers they want to trust and use. Subj3ct is a registry service for subject identifiers for the Semantic Web. The goal is to provide the functionality required to realize the vision of the Semantic Web or Web 3.0. Using the Subj3ct portal and APIs it is possible to register XML feeds that provide information and identifiers for the concepts that people or applications are interested in and to connect identifiers with other identifiers created by other people for the same concepts. By enabling the sharing and publication of subject identifiers on the web, Subj3ct is a platform that enables communication between applications and human-beings across the Web.

2

What are Subjects?

Subjects are the corner-stone of the Semantic Web or Web 3.0 as it has been called. Quite simply, a Subject is absolutely anything you want to talk about – a person, a place, a piece of music, a feeling . . . anything at all. What makes a subject special is that every subject has its own identifier. Identifiers are just a unique URL (a web address) that is used by Web 3.0 applications when they want to refer to that Subject. Although the basic concept is very simple, using subjects in Web 3.0 successfully faces three significant challenges:
Maicher, L.; Garshol, L. M. (Eds.): Linked Topic Maps. Fifth International Conference on Topic Maps Research and Applications, TMRA 2009 Leipzig, Germany, November 12–13, 2009 Revised Selected Papers. Leipziger Beiträge zur Informatik. ISBN 978-3-941608-06-1

164

G. Moore and K. Ahmed

1. Any number of different identifiers could be created for the same subject. The creators of identifiers need some way to agree that their different identifiers actually identify the same Subject, and users of identifiers need to be able to see all the identifiers that identify a given Subject in one place. Subj3ct allows users to assert that an identifier has another equivalent identifier and uses trust metrics to give an indication of how much those declarations should be trusted. 2. Identifiers by themselves are not useful. Just being able to identify a subject is not quite enough for an application to be useful. To be truly useful, it should be possible to use the identifier to find resources that talk about that subject. Subj3ct allows users to specify the web pages, pictures, RDF files, topic maps or other Internet resources that provide information relevant to a particular subject. 3. Identifiers need to be trusted. Before using an identifier, assuming that two identifiers really do talk about the same Subject or following a suggested resource for an identifier a user needs to have some level of trust in the identifiers being used and the information they are given about those identifiers. Subj3ct provides users with a simple system of trust metrics based on identifying the users who can create identifiers under specific web domains and correlation of statements made by one user with those made by other users.

3

Subj3ct Glossary

The following is a short glossary of some key terms that are central to the Subj3ct service. Feed A Feed is a web resource that provides information about subjects that can be crawled and parsed by the Subj3ct crawler. Currently we support syndication feeds that use the ATOM [3] syntax with some Subj3ct-specific extensions to it and SKOS [4] RDF/XML. As Subj3ct develops we will gradually add support for other types of information resources. Any user can register a feed with Subj3ct feeds that have been suggested anonymously are given the lowest level of trust; feeds that are registered by a user signed in with a Subj3ct account are given higher trust; and feeds registered by a Registered Publisher are given the highest trust level. Registered Publisher A Registered Publisher is an individual or organization that has registered with Subj3ct and claimed control over a particular Namespace. Feeds registered with Subj3ct by a Registered Publisher that are contained under the Namespace that the publisher owns and that provide information about the identifiers that are contained under the same Namespace are accorded the highest level of trust.

Subj3ct – A Subject Identity Resolution Service

165

Equivalent Subject Identifier Any Subject Identifier can have any number of equivalent subject identifiers. Equivalence here simply means that the two identifiers are used to identify the same subject. To prevent accidental or intentional hijacking of identifiers, Subj3ct only records statements of equivalence that come from the same feed that provided the subject identifier. When a feed declares that identifier X is equivalent to identifier Y, the feed that provided identifier Y may reciprocate with a declaration that identifier Y is equivalent to identifier X. Alternatively the owner of identifier Y may not agree that X and Y identify the same concept, in which case they are under no obligation to make the reciprocal statement. The statement that identifier X is equivalent to identifier Y is seen as an outgoing equivalence for identifier X, meaning that X is the subject that they equivalence statement is about. The same statement is seen as an incoming equivalence for identifier Y. When you see a record for subject, the equivalence statements it participates in are divided into incoming and outgoing equivalences and you should note that while the person that registered the identifier has control over the outgoing equivalence statements, they have no control over the incoming statements (this is a bit like a page on the Web - the creator of the page has control over where that page links out to, but has no control over who links back to the page). Subject A subject is any concept that you want to talk about or share between people or applications. Subjects are identified using Subject Identifiers. Subject Identifier A Subject Identifier is a unique URI that is assigned to a particular Subject. A Subject Identifier can have any number of Equivalent Subject Identifiers and any number of Subject Web Addresses associated with it. Subject Web Address A Subject Web Address is a resource on the web that provides information relevant to a particular subject. A Subject Web Address is always provided in connection with a Subject Identifier which identifies the Subject that the resource is related to. A single resource may be relevant to many subjects, but the best resources are those that are related only to a single subject (such as an encyclopedia entry that describes that subject in some detail, or a series of RDF statements about the subject) - however, because a single Subject may have many Subject Identifiers, it is possible that the same Subject Web Address is provided for a number of Subject Identifiers. Any user can provide a Subject Web Address for a Subject Identifier, but a reasonable level of trust is assigned only when the user is a Registered Publisher of either the Subject Identifier or the Subject Web Address.

166

G. Moore and K. Ahmed

Subject Identity Record A subject identity record is the data structure returned by Subj3ct for a given subject identifier. This record can be accessed by people via the portal or by machine using the API. The record contains information about the subject including the identifier, equivalent identifiers and links to web resources that are said to contain more information about the subject. It is intended that these records can be cached maintained by intermediate resolution services similar to Subj3ct.

4

The Subj3ct Feed Processing Service

Subj3ct acts as a registry and clearinghouse service for subjects. It periodically crawls sources of subject information that have been registered either by the publisher of that information or by other interested parties and imports the identifiers used for subjects, any statements of equivalence between subjects and any links provided to related resources for subjects. The main format for publishers to expose subject information is to publish it in an ATOM feed. The following fragment shows a single ATOM entry being used to express the identity of a subject and also linking to equivalent subjects and to online resources about the subject. <entry> <title>Hierarchical Relation Type</title> <id>http://www.techquila.com/psi/hierarchy/index.atom/1</id> <updated>2003-12-04T14:39:10Z</updated> <summary> A type of association type. Associations which are typed by a topic which is an instance of this type represent a parent-child relationship between two or more topics. </summary> <link rel="SubjectIdentifier" href="http://www.techquila.com/psi/ hierarchy/#hierarchical-relation-type" /> <link rel="SubjectRepresentation" href="http://www.techquila.com/tmsinia4.html" /> <link rel="SubjectEquivalence" href="http://www.networkedplanet.com/psi/ hierarchy/hierarchical-relation-type" /> </entry> Subj3ct doesn’t require identifier publishers to use any ATOM XML extensions; instead the semantics of each link are indicated through the ‘rel’ attribute. The allowed values are ‘SubjectIdentifier’, ‘SubjectRepresentation’, and ‘SubjectEquivalence’. A feed can contain many entries and each entry results in information about a subject being stored in the Subj3ct database.

Subj3ct – A Subject Identity Resolution Service

167

This imported information is then given a trust score based on the identity of the person who registered the information source, the URI of the subject information source itself and the URIs used for the identifiers and related resources. This provenance information is a key aspect of Subj3ct as it allows for webs of mutual trust to be asserted between co-operating parties while still exposing identifiers and equivalences in an open world environment. 4.1 Trust Scores and Identifiers in Subj3ct

In Subj3ct, all statements about identifiers are assigned a basic Trust Score. The Trust Score provides a rough measure of how likely it is that the creator of an identifier agrees with any statement being made about the identifier. NOTE: The Trust Score does not tell you if you should trust the creator of an identifier or the individual who creates statements about that identifier, nor does it assert anything about the content that might be found if you follow the URLs contained in those statements. The Trust Score for a statement is calculated based on the URL of the feed that contains the statement, and the URLs contained in the statement itself. The highest score is assigned if all the URLs contained in the statement are contained within the same parent directory as the URL of the feed itself. A lower score is assigned if the URLs contained in the statement are all contained within the same DNS sub-domain as the feed itself. The lowest score is assigned if the URLs contained in the statement are from different domains or sub-domains from the feed. In addition, much higher trust scores are assigned to statements from feeds that are registered through an active Subj3ct user account, and at the other end of the scale, zero trust is assigned to statements from a feed that is suggested anonymously and refer to URLs in a different domain or sub-domain than the feed itself.

5

The Subj3ct API

As well as portal that can be used by humans to search for identifiers Subj3ct also provides an API for accessing information about subjects. The API consists of the following operations: Get Subject Identity Record By Subject Identifier The format for this request is http://api.subj3ct.com/subjects? identifier={identifierUri} &format={format} The request will return a SubjectIdentityRecord for the given subject identifier in a format specified in the optional format parameter.

168

G. Moore and K. Ahmed

Search for Subject Identity Records by Web Resource The format for this request is http://api.subj3ct.com/subjects/webaddresses? uri={representationUri} &skip={skip} &take={take} &format={format} The request will return a search result. Subjects are included in the result if they have a corresponding web resource whose URI matches the URI specified. The search result is returned in format specified in the optional format parameter. Search for Subject Identity Records by Identifier URI The format for this request is http://api.subj3ct.com/subjects/identifiers? uri={uri} &skip={skip} &take={take} &format={format} The request will return a search result. Subjects are included in the result if the start of their subject identifier matches the query term provided. The take and skip parameters allow for accessing paged results. Search for Subject Identity Records by Text Query The format for this request is http://api.subj3ct.com/subjects/search? query={query} &skip={skip} &take={take} &format={format} The request will return a search result. Subjects are included in the result if the name, description or any part of their subject identifier matches the query term provided. The take and skip parameters allow for accessing paged results. The subject identity record has the following structure as defined using XMLSchema: <Subject xmlns="http://subj3ct.com/schema/data" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"> <Identifier>http://psi.ontopedia.net/Axel_Borge</Identifier>

Subj3ct – A Subject Identity Resolution Service

169

<Provenance> http://www.garshol.priv.no/tmphoto/subj3ct.atom </Provenance> <Name>Axel Borge</Name> <RecordLink> http://api.subj3ct.com/subjects? identifier=http%3a%2f%2fpsi.ontopedia.net%2fAxel_Borge </RecordLink> <Trust>25</Trust> <EquivalenceStatements /> <RepresentationStatements> <RepresentationStatement> <RepresentationUri> http://psi.ontopedia.net/Axel_Borge </RepresentationUri> <Provenance> http://psi.ontopedia.net/ontopedia/subject/atom_stream </Provenance> <Trust>20</Trust> </RepresentationStatement> <RepresentationStatement> <RepresentationUri> http://www.garshol.priv.no/tmphoto/person.jsp?id=t83420 </RepresentationUri> <Provenance> http://www.garshol.priv.no/tmphoto/subj3ct.atom </Provenance> <Trust>125</Trust> </RepresentationStatement> </RepresentationStatements> </Subject> The subject identity record (which can also be retrived in JSON and SKOS format) can then be used by client application to mash together semantic data that is referenced by the representation links based on identifiers for similar subjects.

6

Embedding Subjects

One of the important mechanisms for creating a habit for using subject identifiers is to have available very simple tools for content authors and systems developers to make use of. The API is one such tool but another one is the set of ‘embeddable’ mark-up that subj3ct provides for inclusion in things such as blogs posts and other published content. The ability to embed references to subjects as part of content provides authors with a powerful way to classify content using common vocabulary terms.

170

G. Moore and K. Ahmed

The following is an RDFa fragment, using Dublin Core that can be included in a page to indicate that the page is about that subject. <span xmlns:dc=http://purl.org/dc/elements/1.1/ rel="dc:subject" href="http://psi.ontopedia.net/Axel_Borge">Axel Borge</span> The powerful thing about this is that google and other search engines are beginning to recognise and index semantic mark-up. This reinforces the cycle of identifiers, content and usage.

7

Applications for Subjects and Subj3ct

Subjects are the corner-stone of the Semantic Web. They can be used to classify resources; to identify key entities in text; or to discover relevant information quickly. Below are three categories of usage. Resource Classification The identifiers for subjects are much better ways to say what a piece of content is about than simple textual tags. Tags are limited because they are usually language-specific and because it is too easy for the same tag to refer to very different concepts (consider the tag "Paris" for example, is it referring to a place or a person?). By comparison, Subject identifiers are URIs which are inherently language-neutral and always refer to a particular subject regardless of the language used to name or describe that subject. Subject identifiers are also more specific than tags in that they always refer to the subject that they were made for, they cannot be misinterpreted or confused and because subject identifiers are unique URIs, there is no way for one identifier to be confused with another one. Identifiers can be used anywhere where traditional tags or controlled vocabularies are used, including in HTML documents; in tagging applications such as your favourite online photo sharing application; and even in desktop applications such as word processors. Classification has traditionally been a manual activity, requiring a human being to view the content before then picking the terms used to classify that content from some controlled list. More recently classification has started to be done (with varying levels of accuracy) by computers. Subject identifiers can be used to record the results of either manual or automated classification. Entity Identification As well as classification of entire resources (such as the tagging of a document with the key terms that it is ’about’), Subject identifiers can also be used to record the results of identifying the key entities (things, people, places, events) discussed in the content of a document. Online services such as OpenCalais [5]

Subj3ct – A Subject Identity Resolution Service

171

attempt to automatically find these entities in text content and return identifiers for the entities they find. By registering this identifiers with Subj3ct and connecting the identifiers returned by these services to other identifiers created through other means it is possible to not only label key entities within text but also to link to many other relevant resources. Information Discovery Subj3ct provides an online search form and APIs that allows users and applications to find subjects either using an identifier gleaned from somewhere (maybe sent in an email to a user, or attached to a document as classification metadata, or embedded in content as a result of entity identification); or by searching from a resource to find the subjects that claim to be relevant to that resource; or even by searching against the name and description registered for the subject by the creator of the subject identifier. The resources related to an identifier can be HTML pages or other humanreadable content but they can equal be resources that provide information in RDF [6] or Topic Maps syntax [7]. Semantic Web applications can make use of the Subj3ct API to quickly discover if there are any useful data sources in their preferred syntax and use this to enrich the application experience for the end-user.

8

Current Limitations

A lack of Multi-lingual names for subjects is major limitation. At the time of publication this limitation has been addressed and the subj3ct service now allows subject identifier authors the ability to register multiple names for a subject. This feature was not implemented in earlier versions. The ability to remove an identifier is not currently offered. This partly because we think that once an identifier comes into existence it can be in use and it makes no sense to retract it. It’s description and name can be updated A general limitation is that while Subj3ct now contains many identifies and equivalences it still lacks tools for easily using the service in editorial or information finding contexts. This is a focus for future work.

9

Relationship to Other Work

One obvious comparison can be made with the ‘sameAs’ service [8]. This service aggregates identifies for resources, and equivalent ones as defined using owl:sameas [9]. The main differences between sameas.org and Subj3ct are: 1. Subj3ct distinguishes identifiers for subjects and identifiers for the addresses of statements about those subjects. 2. Subj3ct records who registered the identifier, allow clients to filter equivalent identifiers and

172

G. Moore and K. Ahmed

3. Subj3ct puts a trust score onto an identifier based on the domain from which it was registered. 4. Subj3ct works at two levels, identifiers and addresses for resource that contain statements about that resource. In this sense it provides a DNS [10] for the semantic web, whereas sameAs only provides unregulated resource equivalence capabilities.

10

Conclusion

Subj3ct is a key service in providing a resolution mechanism between subject identifiers, equivalent identifiers and web resources that contains further statements about a subject.

References
1. Subj3ct.com, http://www.subj3ct.com 2. Graham Moore, Lars Marius Garshol (eds), Topic Maps Data Model, http://www.isotopicmaps.org/sam/sam-model 3. ATOM Syndication Protocol RFC4287, http://atompub.org/rfc4287.html 4. SKOS (Simple Knowledge Organization System), http://www.w3.org/2004/02/skos 5. OpenCalais, http://www.opencalais.com 6. RDF/XML Syntax Specification, http://www.w3.org/TR/rdf-syntax-grammar 7. XTM Topic Maps Syntax, http://www.isotopicmaps.org/sam/sam-xtm 8. sameAs service, http://sameas.org 9. OWL Web Ontology Language, http://www.w3.org/TR/owl-ref 10. DNS, http://en.wikipedia.org/wiki/Domain_Name_System