You are on page 1of 16

1

Aaron B. Helton

St. Edward’s University

MCIS 6309.01

Scaffolding the Semantic Web

April 12, 2008


2

Abstract

Tim Berners-Lee described a vision of the World Wide Web in which its information

could be understandable to both humans and machines, enabling machines to act on that

intelligence in ways that only humans had been able to do before. In practical terms, this means

being able to answer specific questions informed by disparate data sources, provide trustworthy

and meaningful connections between data, and to enable the continual reuse of data in any way

that can be conceived. While a number of tools and specifications have been created to facilitate

the implementation of the Semantic Web, very few practical implementations have appeared to

date. Further, most of the current use cases listed on the World Wide Web Consortium's (W3C)

site indicate that it is being implemented in concentrated, domain specific ways (Herman, I., &

Stephens, S., 2007). This paper demonstrates a methodology by which one can approach the

creation of new Semantic Web applications for the synthesis and creation of knowledge. It uses

as its case example a small subset of the United States tourism industry.
3

Introduction to the Semantic Web

Before any exploration is made on how best to achieve a Semantic Web implementation,

one must understand what the Semantic Web is and how it can be of use. Tim Berners-Lee

described the Semantic Web as a Web in which computers “become capable of analyzing all the

data on the Web…machines talking to machines.” (“Semantic Web,” 2008) What this means in

practical terms is that the Semantic Web is comprised of data that has been annotated with

categorical and descriptive metadata, which allows it to be used and reused in countless forms.

In theoretical terms, it represents a shift from proprietary incompatible data formats to a unified

and predictable data format (this includes the metadata) with programmatically determinable

characteristics, such as categories, connections to other data, and rules describing how to process

the data. In time this will allow machines to form the network paths between data points such

that one only has to ask a question, and a concise and accurate answer drawing from multiple

relevant data sources is presented; other possibilities include truly intelligent agents, or computer

processes that act on the behalf of humans, and the ability to take vast stores of seemingly

disparate data and recombine them into new information. The net effect of this development will

be greater potential for knowledge creation and discovery and ultimate reusability of data.

With some recent news reporting on the Semantic Web, one might believe the concept to

be fairly new. In fact, Berners-Lee spoke the quote above in 1999, making Semantic Web

almost ten years old. Despite this, only a handful of Semantic Web implementations exist.

Among the young implementations are such sites as GeoNames (http://geonames.org) and Twine

(http://twine.com). Until recently, even search engines, which could benefit the most from

Semantic Web, have steered clear; Yahoo! announced its intention to add Semantic Web

capabilities to its own search engine. (Arrington, M., 2008) In an official blog post regarding
4

this announcement, Yahoo! Search Product Management Director Amit Kumar suggested that

development and adoption of the Semantic Web has lagged because “[w]ithout a killer semantic

web app for consumers, site owners have been reluctant to support standards like RDF, or even

microformats.” (Kumar, A., 2008) Thus the next step toward the realization of the Semantic

Web is the creation of applications that pave the way.

Components of the Semantic Web

XML, XML Schema, and RDF

XML, or Extensible Markup Language, provides the basic syntactical elements to

describe the resources on the Semantic Web, but provides no indication of any meanings

associated with these resources. XML Schema provides structural guidelines for the documents

that describe resources. (Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., &

Cowan, J., 2006)

The specification that confers meaning to resources is the Resource Description

Framework, or RDF. This is a language to describe Semantic Web resources. RDF defines

elemental attributes of resources, or the sets of attributes that belongs to a resource or a class of

resources. It is through RDF that resources also gain some annotation about basic relationships

with other resources. Assertions in RDF are called triples, and they consist of subject-predicate-

object groupings. (Manola, F., & Miller, E., 2004) For instance, consider two resources, a poem

called “The Waste Land” and an author named T.S. Eliot. In an RDF triple, one could make the

assertion that the poem called “The Waste Land” has the author T.S. Eliot. This establishes a

basic relationship between a poem and an author, both resources. Further, each of these

resources has a number of attributes to describe them. The poem resource has a name and any

other information that might be useful in identifying this poem from a group of poems. The
5

author resource also has a name, in this case the name of a person; since person is another

common resource type, additional information could be linked via the author’s person resource.

In this way, RDF helps to build out a basic structure of the Semantic Web.

RDFS and OWL

Extending RDF is the RDF Schema (RDFS) specification. RDFS allows a richer

ontological vocabulary to be constructed, and defines the domain and range of particular

assertions. While RDF is primarily concerned with elemental resources, RDFS is capable of

expressing resource classes as resources in their own right. Classes and subclasses have

members and are members of other classes, and RDFS describes the relationships between these

classes. This establishes hierarchy upon the RDF network.

Web Ontology Language (OWL) has largely superseded RDFS as the language for

authoring ontologies, but its features are largely the same. It makes assertions about class

membership and class relationships via axioms dictating any constraints. (McGuinness, D., &

Van Harmelen, F., 2004) A reasonable assertion based on the poem, person, author example

could say that “All authors are also people.” This simple statement establishes that every

resource that is an author is also a person resource; thus author is a subclass of person (note that

no reverse assertion can be made here; not every person is an author, which is a reasonable

approach in most circumstances). It is through such assertions that the Semantic Web gains its

usefulness, since arbitrary connections between resources can be explored in the pursuit of new

knowledge.

SPARQL

As all of the above specifications can be ultimately reduced to an RDF document once

instantiated, it is natural that there be some way to ask questions of the data represented by the
6

various resources and assertions. SPARQL (which stands for the recursively named SPARQL

Protocol and RDF Query Language) is the query language invented for this purpose. Given an

ontology suited for the particular resources, one could construct a query to answer some part of

an RDF triple. SPARQL ties all of the core specifications together so that some meaningful

information can be put together easily and in easily understood syntax. (Prud'Hommeaux, E., &

Seaborne, A., 2008)

Putting it all together

All of this may seem to be academic so far. Indeed, the dearth of usable Semantic Web

applications seems to reinforce this idea. What is missing is a methodology for turning this loose

on any data set. The challenges in implementing Semantic Web specifications in the first place

mostly involve where to start. For instance, it is well and good to suggest that content creators

go forth and add semantics to their content, but it does nothing for the vast troves of content

already in existence. Nor does it address the very real problem that most publishing software for

the Web (or, for that matter, any other delivery platform) has no semantic capability, meaning

that anyone who even wishes to publish semantically annotated documents to the Web must do

so manually or not at all. Given no incentive on the part of content creators to add semantic

annotation to their content, past, present or future, and no ready mechanism to ensure its

automatic annotation, it is no wonder the Semantic Web has not materialized yet.

These challenges can be overcome. The biggest challenges are how to determine

annotation for existing documents and how to ensure automatic annotation going forward. A

number of projects aimed at these already exist. For conversion of existing content, MIT’s

SIMILE (Semantic Interoperability of Metadata and Information in unLike Environments)

project has a number of programs available. “SIMILE seeks to enhance inter-operability among
7

digital assets, schemata/vocabularies/ontologies, metadata, and services.” (SIMILE:About, 2007)

A related set of projects, called Microformats, also aims to lower the burden of content

annotation by defining a number of RDF-compatible data formats (HTML vCard and iCal

specifications, for instance). Microformats are “a set of simple, open data formats built upon

existing and widely adopted standards.” (About microformats, n.d.) These efforts are predicated

on the idea that certain kinds of content are already annotated in standardized ways, such as

when they conform to some published RFC specification (SMTP, or email, headers are a perfect

example). However, these are still limited to published specifications, and they do not take into

account any heuristic information about non-standard content, such as blocks of free text.

One option for adding semantic information to existing content is to explore text mining

in such a way that previously unspecified details can be programmatically determined. In reality,

this entails a statistical breakdown of a document’s words and how the word distribution and

terminology fit with a pre-defined set of word clusters. The goal is automatic classification.

This approach was documented in 1999 by a group at the University of Wolverhampton, UK,

who proposed to use the Dewey Decimal System (DDC) for automatic classification of Web

documents, an approach that should work in all but the most specialized fields. In their own

words: "DDC is considered appropriate because it is a universal classification scheme covering

all subject areas and geographically global information. It is familiar to anyone accustomed to

using a library and has multilingual scope. The hierarchical nature enables the users of a search

engine to refine their search from rough classifications to increasingly more accurate ones."

(Jenkins)

Such an automated classification system makes possible the gathering of domain specific

information about a given document, and its efficacy in this regard is limited only by the skill
8

with which the underlying ontology has been authored. Further, if such a system can perform

classification against defined domains, it should be capable of performing such classification

against any other arbitrary but well-defined ontology. Lists of domain keywords can help to

establish links between documents that, while not necessarily falling within the same domain,

nevertheless may have some overlap between them. This fusion of knowledge from multiple

domains could fuel knowledge discovery and creation.

At this point, some thoughts about feasibility are in order. Given that few or no existing

publishing systems perform any semantic markup on their documents, the whole exercise may

still appear to be infeasible at worst and costly at best. It is not trivial to retrofit publishing

systems with automatic classifiers, nor is it trivial to have content owners retroactively process

their published documents. If one were creating a semantic search appliance for use inside a

company where the document volume is more manageable, such an undertaking would be

possible. The proposed system, however, while relying on a tightly defined domain, is not

intended for the internal use of any company. It is much more related to one of the large search

engines, Google or Yahoo!, for instance. Therefore it makes the most sense to equip the

automatic classification system with some mechanism for discovering content that it can process

and to which it can add its semantic markup. Google does this by following links within

documents; since this system is not a general purpose search application, such an approach may

not be appropriate. Adding a user-driven social networking approach, however, is.

Extending the Core: Wisdom of the Masses and FOAF

Social networking sites such as Digg (http://digg.com), Facebook (http://facebook.com),

LinkedIn (http://linkedin.com), and a host of others have shown that large groups of people are

often good at making certain kinds of decisions. Digg is probably the most exemplary and most
9

applicable in this case, as it shows how content can be added once and moderated by a large

group of people. It is certainly possible, like any other system, for something like this to be

subverted by special interests, but such subversion appears to be the exception and not the rule.

Thus the semantic search appliance should have some way for people to add information and

provide input on the efficacy of any other information submitted to the site, trusting that once the

normative forces of large crowds has taken hold, such information will be as reliable as possible

(for additional reference, one need only turn to Wikipedia (http://www.wikipedia.org), the

Online Encyclopedia, as a perfect example of how crowdsourcing can work well). Domain

experts will soon emerge, and information that they author or add will eventually be regarded as

trustworthy.

Next, the other popular component of social networks is the actual social aspect. In fact,

the hallmark of social networks is the ability to build networks of people based on shared

interests, shared ability, or any other ad hoc grouping. The general concept is known as Friend

of a Friend, or FOAF. These serve to reinforce the filters created by user submit/user moderate

mechanisms. In a sense, these groupings are ad hoc ontologies of people or other resources that

work to build the so-called Web of Trust. So a semantic search application should also possess

social network capabilities to help flesh out the gossamer strands between resources.

The system, then, looks like this. With XML as its core syntax and SPARQL as its

primary query language, all resources (people, documents, events, places, or any other arbitrary

thing) are either indexed automatically or added intentionally, processed and annotated with

semantic metadata (RDF) ala various defined ontologies (OWL), and their ultimate fate is

determined by user moderation (a simple vote tally is all that should be required). This
1
0
automates the building of ad hoc ontologies based on formal ontologies, shared interests and the

Web of Trust, and exposes domain expertise in the process via FOAF.

Case: A Semantic Search Application for U.S. Tourism

Armed with this methodology, consider something more concrete. Figures for U.S.

Tourism earnings from all sources are not readily available, but the World Tourism

Organization’s 2007 report on tourism estimates U.S. figures for internationally-sourced tourism

at some USD 85.7 billion for 2006. (Tourism Highlights 2007 Edition, 2007) Since this

represents only international receipts, the total figure from all sources is undoubtedly far higher

than this. Thus any system that seeks to improve the ability for people to make connections

between different kinds of tourism data should be economically worthwhile.

What kinds of questions make sense for U.S. tourism? First, and these are probably

obvious, people who are seeking to travel may ask questions about where to go in the first place,

where to stay when they get there, where they can eat, where they can buy supplies or simply

shop, what events are happening while they will be there, and anything else that might fit their

interests better. To date, there is no single application or site capable of pulling together all of

this information and presenting it to someone asking these kinds of questions. The process

involves multiple visits either to a search engine or to the various websites that serve the

respective silos of information, and the tourist may still be left with incomplete information or a

sense that he or she could have gotten a better deal. Even asking a simple question like “What is

the nearest state park to my house?” yields less than satisfactory results. It can be answered, but

in multiple steps and with some guesswork. Add to this question any other attributes to answer,

and the number of steps before an answer is reached begins to increase rapidly. For instance,

one might rephrase the preceding question as “What state park is nearest to my house and allows
1
1
fishing and swimming, and what restaurants are within ten miles of the park?” Needless to say,

answering such a question, while certainly doable via a number of websites, nevertheless

presents a challenge. And yet there is value in being able to do so, especially for restaurant

owners within that radius.

The first step toward being able to answer such questions is to gather the data from its

various repositories. As it happens, GeoNames provides geotagged information that can answer

the question of where a state park is located (although querying to find the nearest one to another

point can only be done visually on the GeoNames site at present), and Google (via its own maps

service) can piece together which restaurants are near the park (again, establishing a specific

radius does not seem to be possible). What is missing from easily searchable data, however, is

any information regarding the services available at a given state park. As an example, the State

of Texas does maintain a list of state parks (TWPD: Find a park, n.d.), and each state park has a

standard listing. What is also included in each listing is a link to a PDF document where the

services information resides. These documents are good candidates for the automatic

classification system to index, as their consistent structure makes it fairly easy to develop an

ontology for the document type. So the proposed system for this case will begin with these kinds

of documents where possible, searching for instances of attributes that can provide semantic

meaning to the contents. Other such documents will be sought out and added to the index, either

by the application administrator or by end users. Since some sensible ontologies have to be

developed for each type, and there are numerous structures to contend with, this presents the

greatest challenge. This is the point at which reliance on end users becomes necessary (but see

below for issues that may arise from this approach).


1
2
After the process of indexing these documents has begun, the user moderation process

can begin. Users from a given region are often aware of attractions and events that people from

outside the region are largely unaware, so it makes sense to let them publish information about

these things. As new users discover such information, they will be allowed to agree or disagree

with the information presented (it could very well be outdated) and publish their own

information with what they think is correct. In this way duplication is not necessarily prohibited,

but the normative forces of the knowledgeable users will act to promote the most accurate or

most useful information. As new information is constantly added, indexed, tagged, moderated,

and consumed, the system will become better at answering certain kinds of questions. Especially

useful to business owners, advertisers and end users is the wealth of ancillary information

returned around even a simple search, including service and retail listings, restaurants, suppliers,

and event listings.

Turning back to the complex question from earlier in the case, the system now has some

capability to provide an answer. The question “What state park is nearest to my house and

allows fishing and swimming, and what restaurants are within ten miles of the park?” can now be

answered with a minimum of information provided by the end user (in this case the user need

only specify, in addition to the search criteria, the location representing his/her house). Further,

because the system has gathered some information that also has data connections nearby, options

the user may not have considered can be presented to the user along with the originally-sought

information.

Such options present advertisers of services and goods that are in any way related to this

sought information the opportunity to market to someone who is more likely to be looking for

something they offer than a casual browser might. As advertisers continually seek ways to
1
3
narrow the focus of their marketing campaigns for cost effectiveness, a system that delivers a

higher percentage of a target demographic should meet wide appeal. Additionally, interesting

analytics can be gleaned from looking at the search and moderation trends, which could inform

future marketing campaigns.

The final consideration for a Semantic Web application is managing user trust. Since the

application in question is intended for public use, the trustworthiness of any user-submitted

information cannot be fully verified. In a perfect scenario, no user would seek to fraudulently

promote any submitted resource, or, if such subversion did occur, the system would be normative

enough to keep such activity in check. In most cases, this should be true. Given a large enough

user base, subversive activity should occur rarely and its effects should be brief in duration.

However, there may be occasions where this is not the case, especially where less well known or

less popular resources are concerned. In the case of the tourism search application, a

concentrated effort to promote certain businesses (the most likely occurrence) could undermine

the perception of trustworthiness that the site has built over time. Thus there should be some

mechanism by which such activity can be reported and dealt with, a sort of “management by

exception” approach. If advertisers are allowed to buy prominence in any listings of related

information, then those advertisements should be clearly labeled as such. In this case, the key to

building user trust is to create the perception that errors will be weeded out and corrected over

time. Whether this means allowing collaborative modification of listing information ala

Wikipedia or setting up a system to deal with iteratively corrected duplicates is left to the

implementer of the system in question. Some combination of these could be used, but under no

circumstances should these mechanisms and the user moderation provide misleading information

to the end user.


1
4
Conclusion

For just one subset of one very large industry, it has been demonstrated that the Semantic

Web holds great potential. From this, it is easy to see how it could be applied to virtually any

domain. It will allow the drawing of knowledge from disparate data sources toward the goal of

machine understandable information, whereupon such machines can take action in ways only

humans have been able to do. With these guidelines, new applications capable of answering

almost any question, discovering new knowledge, and informing knowledge creation can be

developed and implemented.


1
5
References

About microformats. (n.d.). Retrieved Apr. 12, 2008, from http://microformats.org/about/.

Arrington, M. (2008, Mar. 13). Yahoo Embraces The Semantic Web - Expect The Internet To

Organize Itself In A Hurry. Retrieved Apr. 12, 2008, from

http://www.techcrunch.com/2008/03/13/yahoo-embraces-the-semantic-web-expect-the-

web-to-organize-itself-in-a-hurry/.

Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., & Cowan, J. (2006, August

16). Extensible Markup Language (XML) 1.1 (Second Edition). Retrieved Apr. 14, 2008,

from http://www.w3.org/TR/2006/REC-xml11-20060816/.

Herman, I., & Stephens, S. (2007, December 4). Semantic Web Education and Outreach Interest

Group Case Studies and Use Cases. Retrieved Feb. 13, 2008, from

http://www.w3.org/2001/sw/sweo/public/UseCases/.

Jenkins, C., Jackson, M., Burden, P., & Wallis, J. (1999). Automatic RDF Metadata Generation

for Resource Discovery. Computer Networks: The International Journal of Computer

and Telecommunications Networking, 31(11-16), 1305-1320. Retrieved Apr. 12, 2008,

from

http://wotan.liu.edu/docis/lib/goti/rclis/dbl/connet/(1999)31%253A11%252F16%253C1

305%253AARMGFR%253E/people.cs.uct.ac.za%252F~hbrown%252FCAKMS

%252Farticles%252FAutoma.pdf.
1
6

Kumar, A. (2008, Mar. 13). Yahoo! Search Blog: The Yahoo! Search Open Ecosystem. Retrieved

Apr. 12, 2008, from http://www.ysearchblog.com/archives/000527.html.

Manola, F., & Miller, E. (2004, Feb. 10). RDF Primer. Retrieved Apr. 14, 2008, from

http://www.w3.org/TR/rdf-primer/.

Mcguinness, D., & Van Harmelen, F. (2004, Feb. 10). OWL Web Ontology Language Overview

. Retrieved Apr. 14, 2008, from http://www.w3.org/TR/owl-features/.

Prud'Hommeaux, E., & Seaborne, A. (2008, January 15). SPARQL Query Language for RDF.

Retrieved Feb. 13, 2008, from http://www.w3.org/TR/rdf-sparql-query/.

SIMILE:About - SIMILE. (2007, Feb. 27). Retrieved Apr. 12, 2008, from

http://simile.mit.edu/wiki/SIMILE:About.

Semantic Web. (2008, Apr. 11). Retrieved Apr. 14, 2008, from

http://en.wikipedia.org/wiki/Semantic_Web.

Tourism Highlights 2007 Edition. (n.d.). Retrieved Apr. 12, 2008, from

unwto.org/facts/eng/pdf/highlights/highlights_07_eng_hr.pdf.

You might also like