Professional Documents
Culture Documents
Aaron B. Helton
MCIS 6309.01
Abstract
Tim Berners-Lee described a vision of the World Wide Web in which its information
could be understandable to both humans and machines, enabling machines to act on that
intelligence in ways that only humans had been able to do before. In practical terms, this means
being able to answer specific questions informed by disparate data sources, provide trustworthy
and meaningful connections between data, and to enable the continual reuse of data in any way
that can be conceived. While a number of tools and specifications have been created to facilitate
the implementation of the Semantic Web, very few practical implementations have appeared to
date. Further, most of the current use cases listed on the World Wide Web Consortium's (W3C)
site indicate that it is being implemented in concentrated, domain specific ways (Herman, I., &
Stephens, S., 2007). This paper demonstrates a methodology by which one can approach the
creation of new Semantic Web applications for the synthesis and creation of knowledge. It uses
as its case example a small subset of the United States tourism industry.
3
Before any exploration is made on how best to achieve a Semantic Web implementation,
one must understand what the Semantic Web is and how it can be of use. Tim Berners-Lee
described the Semantic Web as a Web in which computers “become capable of analyzing all the
data on the Web…machines talking to machines.” (“Semantic Web,” 2008) What this means in
practical terms is that the Semantic Web is comprised of data that has been annotated with
categorical and descriptive metadata, which allows it to be used and reused in countless forms.
In theoretical terms, it represents a shift from proprietary incompatible data formats to a unified
and predictable data format (this includes the metadata) with programmatically determinable
characteristics, such as categories, connections to other data, and rules describing how to process
the data. In time this will allow machines to form the network paths between data points such
that one only has to ask a question, and a concise and accurate answer drawing from multiple
relevant data sources is presented; other possibilities include truly intelligent agents, or computer
processes that act on the behalf of humans, and the ability to take vast stores of seemingly
disparate data and recombine them into new information. The net effect of this development will
be greater potential for knowledge creation and discovery and ultimate reusability of data.
With some recent news reporting on the Semantic Web, one might believe the concept to
be fairly new. In fact, Berners-Lee spoke the quote above in 1999, making Semantic Web
almost ten years old. Despite this, only a handful of Semantic Web implementations exist.
Among the young implementations are such sites as GeoNames (http://geonames.org) and Twine
(http://twine.com). Until recently, even search engines, which could benefit the most from
Semantic Web, have steered clear; Yahoo! announced its intention to add Semantic Web
capabilities to its own search engine. (Arrington, M., 2008) In an official blog post regarding
4
this announcement, Yahoo! Search Product Management Director Amit Kumar suggested that
development and adoption of the Semantic Web has lagged because “[w]ithout a killer semantic
web app for consumers, site owners have been reluctant to support standards like RDF, or even
microformats.” (Kumar, A., 2008) Thus the next step toward the realization of the Semantic
describe the resources on the Semantic Web, but provides no indication of any meanings
associated with these resources. XML Schema provides structural guidelines for the documents
that describe resources. (Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., &
Framework, or RDF. This is a language to describe Semantic Web resources. RDF defines
elemental attributes of resources, or the sets of attributes that belongs to a resource or a class of
resources. It is through RDF that resources also gain some annotation about basic relationships
with other resources. Assertions in RDF are called triples, and they consist of subject-predicate-
object groupings. (Manola, F., & Miller, E., 2004) For instance, consider two resources, a poem
called “The Waste Land” and an author named T.S. Eliot. In an RDF triple, one could make the
assertion that the poem called “The Waste Land” has the author T.S. Eliot. This establishes a
basic relationship between a poem and an author, both resources. Further, each of these
resources has a number of attributes to describe them. The poem resource has a name and any
other information that might be useful in identifying this poem from a group of poems. The
5
author resource also has a name, in this case the name of a person; since person is another
common resource type, additional information could be linked via the author’s person resource.
In this way, RDF helps to build out a basic structure of the Semantic Web.
Extending RDF is the RDF Schema (RDFS) specification. RDFS allows a richer
ontological vocabulary to be constructed, and defines the domain and range of particular
assertions. While RDF is primarily concerned with elemental resources, RDFS is capable of
expressing resource classes as resources in their own right. Classes and subclasses have
members and are members of other classes, and RDFS describes the relationships between these
Web Ontology Language (OWL) has largely superseded RDFS as the language for
authoring ontologies, but its features are largely the same. It makes assertions about class
membership and class relationships via axioms dictating any constraints. (McGuinness, D., &
Van Harmelen, F., 2004) A reasonable assertion based on the poem, person, author example
could say that “All authors are also people.” This simple statement establishes that every
resource that is an author is also a person resource; thus author is a subclass of person (note that
no reverse assertion can be made here; not every person is an author, which is a reasonable
approach in most circumstances). It is through such assertions that the Semantic Web gains its
usefulness, since arbitrary connections between resources can be explored in the pursuit of new
knowledge.
SPARQL
As all of the above specifications can be ultimately reduced to an RDF document once
instantiated, it is natural that there be some way to ask questions of the data represented by the
6
various resources and assertions. SPARQL (which stands for the recursively named SPARQL
Protocol and RDF Query Language) is the query language invented for this purpose. Given an
ontology suited for the particular resources, one could construct a query to answer some part of
an RDF triple. SPARQL ties all of the core specifications together so that some meaningful
information can be put together easily and in easily understood syntax. (Prud'Hommeaux, E., &
All of this may seem to be academic so far. Indeed, the dearth of usable Semantic Web
applications seems to reinforce this idea. What is missing is a methodology for turning this loose
on any data set. The challenges in implementing Semantic Web specifications in the first place
mostly involve where to start. For instance, it is well and good to suggest that content creators
go forth and add semantics to their content, but it does nothing for the vast troves of content
already in existence. Nor does it address the very real problem that most publishing software for
the Web (or, for that matter, any other delivery platform) has no semantic capability, meaning
that anyone who even wishes to publish semantically annotated documents to the Web must do
so manually or not at all. Given no incentive on the part of content creators to add semantic
annotation to their content, past, present or future, and no ready mechanism to ensure its
automatic annotation, it is no wonder the Semantic Web has not materialized yet.
These challenges can be overcome. The biggest challenges are how to determine
annotation for existing documents and how to ensure automatic annotation going forward. A
number of projects aimed at these already exist. For conversion of existing content, MIT’s
project has a number of programs available. “SIMILE seeks to enhance inter-operability among
7
A related set of projects, called Microformats, also aims to lower the burden of content
annotation by defining a number of RDF-compatible data formats (HTML vCard and iCal
specifications, for instance). Microformats are “a set of simple, open data formats built upon
existing and widely adopted standards.” (About microformats, n.d.) These efforts are predicated
on the idea that certain kinds of content are already annotated in standardized ways, such as
when they conform to some published RFC specification (SMTP, or email, headers are a perfect
example). However, these are still limited to published specifications, and they do not take into
account any heuristic information about non-standard content, such as blocks of free text.
One option for adding semantic information to existing content is to explore text mining
in such a way that previously unspecified details can be programmatically determined. In reality,
this entails a statistical breakdown of a document’s words and how the word distribution and
terminology fit with a pre-defined set of word clusters. The goal is automatic classification.
This approach was documented in 1999 by a group at the University of Wolverhampton, UK,
who proposed to use the Dewey Decimal System (DDC) for automatic classification of Web
documents, an approach that should work in all but the most specialized fields. In their own
all subject areas and geographically global information. It is familiar to anyone accustomed to
using a library and has multilingual scope. The hierarchical nature enables the users of a search
engine to refine their search from rough classifications to increasingly more accurate ones."
(Jenkins)
Such an automated classification system makes possible the gathering of domain specific
information about a given document, and its efficacy in this regard is limited only by the skill
8
with which the underlying ontology has been authored. Further, if such a system can perform
against any other arbitrary but well-defined ontology. Lists of domain keywords can help to
establish links between documents that, while not necessarily falling within the same domain,
nevertheless may have some overlap between them. This fusion of knowledge from multiple
At this point, some thoughts about feasibility are in order. Given that few or no existing
publishing systems perform any semantic markup on their documents, the whole exercise may
still appear to be infeasible at worst and costly at best. It is not trivial to retrofit publishing
systems with automatic classifiers, nor is it trivial to have content owners retroactively process
their published documents. If one were creating a semantic search appliance for use inside a
company where the document volume is more manageable, such an undertaking would be
possible. The proposed system, however, while relying on a tightly defined domain, is not
intended for the internal use of any company. It is much more related to one of the large search
engines, Google or Yahoo!, for instance. Therefore it makes the most sense to equip the
automatic classification system with some mechanism for discovering content that it can process
and to which it can add its semantic markup. Google does this by following links within
documents; since this system is not a general purpose search application, such an approach may
LinkedIn (http://linkedin.com), and a host of others have shown that large groups of people are
often good at making certain kinds of decisions. Digg is probably the most exemplary and most
9
applicable in this case, as it shows how content can be added once and moderated by a large
group of people. It is certainly possible, like any other system, for something like this to be
subverted by special interests, but such subversion appears to be the exception and not the rule.
Thus the semantic search appliance should have some way for people to add information and
provide input on the efficacy of any other information submitted to the site, trusting that once the
normative forces of large crowds has taken hold, such information will be as reliable as possible
(for additional reference, one need only turn to Wikipedia (http://www.wikipedia.org), the
Online Encyclopedia, as a perfect example of how crowdsourcing can work well). Domain
experts will soon emerge, and information that they author or add will eventually be regarded as
trustworthy.
Next, the other popular component of social networks is the actual social aspect. In fact,
the hallmark of social networks is the ability to build networks of people based on shared
interests, shared ability, or any other ad hoc grouping. The general concept is known as Friend
of a Friend, or FOAF. These serve to reinforce the filters created by user submit/user moderate
mechanisms. In a sense, these groupings are ad hoc ontologies of people or other resources that
work to build the so-called Web of Trust. So a semantic search application should also possess
social network capabilities to help flesh out the gossamer strands between resources.
The system, then, looks like this. With XML as its core syntax and SPARQL as its
primary query language, all resources (people, documents, events, places, or any other arbitrary
thing) are either indexed automatically or added intentionally, processed and annotated with
semantic metadata (RDF) ala various defined ontologies (OWL), and their ultimate fate is
determined by user moderation (a simple vote tally is all that should be required). This
1
0
automates the building of ad hoc ontologies based on formal ontologies, shared interests and the
Web of Trust, and exposes domain expertise in the process via FOAF.
Armed with this methodology, consider something more concrete. Figures for U.S.
Tourism earnings from all sources are not readily available, but the World Tourism
Organization’s 2007 report on tourism estimates U.S. figures for internationally-sourced tourism
at some USD 85.7 billion for 2006. (Tourism Highlights 2007 Edition, 2007) Since this
represents only international receipts, the total figure from all sources is undoubtedly far higher
than this. Thus any system that seeks to improve the ability for people to make connections
What kinds of questions make sense for U.S. tourism? First, and these are probably
obvious, people who are seeking to travel may ask questions about where to go in the first place,
where to stay when they get there, where they can eat, where they can buy supplies or simply
shop, what events are happening while they will be there, and anything else that might fit their
interests better. To date, there is no single application or site capable of pulling together all of
this information and presenting it to someone asking these kinds of questions. The process
involves multiple visits either to a search engine or to the various websites that serve the
respective silos of information, and the tourist may still be left with incomplete information or a
sense that he or she could have gotten a better deal. Even asking a simple question like “What is
the nearest state park to my house?” yields less than satisfactory results. It can be answered, but
in multiple steps and with some guesswork. Add to this question any other attributes to answer,
and the number of steps before an answer is reached begins to increase rapidly. For instance,
one might rephrase the preceding question as “What state park is nearest to my house and allows
1
1
fishing and swimming, and what restaurants are within ten miles of the park?” Needless to say,
answering such a question, while certainly doable via a number of websites, nevertheless
presents a challenge. And yet there is value in being able to do so, especially for restaurant
The first step toward being able to answer such questions is to gather the data from its
various repositories. As it happens, GeoNames provides geotagged information that can answer
the question of where a state park is located (although querying to find the nearest one to another
point can only be done visually on the GeoNames site at present), and Google (via its own maps
service) can piece together which restaurants are near the park (again, establishing a specific
radius does not seem to be possible). What is missing from easily searchable data, however, is
any information regarding the services available at a given state park. As an example, the State
of Texas does maintain a list of state parks (TWPD: Find a park, n.d.), and each state park has a
standard listing. What is also included in each listing is a link to a PDF document where the
services information resides. These documents are good candidates for the automatic
classification system to index, as their consistent structure makes it fairly easy to develop an
ontology for the document type. So the proposed system for this case will begin with these kinds
of documents where possible, searching for instances of attributes that can provide semantic
meaning to the contents. Other such documents will be sought out and added to the index, either
by the application administrator or by end users. Since some sensible ontologies have to be
developed for each type, and there are numerous structures to contend with, this presents the
greatest challenge. This is the point at which reliance on end users becomes necessary (but see
can begin. Users from a given region are often aware of attractions and events that people from
outside the region are largely unaware, so it makes sense to let them publish information about
these things. As new users discover such information, they will be allowed to agree or disagree
with the information presented (it could very well be outdated) and publish their own
information with what they think is correct. In this way duplication is not necessarily prohibited,
but the normative forces of the knowledgeable users will act to promote the most accurate or
most useful information. As new information is constantly added, indexed, tagged, moderated,
and consumed, the system will become better at answering certain kinds of questions. Especially
useful to business owners, advertisers and end users is the wealth of ancillary information
returned around even a simple search, including service and retail listings, restaurants, suppliers,
Turning back to the complex question from earlier in the case, the system now has some
capability to provide an answer. The question “What state park is nearest to my house and
allows fishing and swimming, and what restaurants are within ten miles of the park?” can now be
answered with a minimum of information provided by the end user (in this case the user need
only specify, in addition to the search criteria, the location representing his/her house). Further,
because the system has gathered some information that also has data connections nearby, options
the user may not have considered can be presented to the user along with the originally-sought
information.
Such options present advertisers of services and goods that are in any way related to this
sought information the opportunity to market to someone who is more likely to be looking for
something they offer than a casual browser might. As advertisers continually seek ways to
1
3
narrow the focus of their marketing campaigns for cost effectiveness, a system that delivers a
higher percentage of a target demographic should meet wide appeal. Additionally, interesting
analytics can be gleaned from looking at the search and moderation trends, which could inform
The final consideration for a Semantic Web application is managing user trust. Since the
application in question is intended for public use, the trustworthiness of any user-submitted
information cannot be fully verified. In a perfect scenario, no user would seek to fraudulently
promote any submitted resource, or, if such subversion did occur, the system would be normative
enough to keep such activity in check. In most cases, this should be true. Given a large enough
user base, subversive activity should occur rarely and its effects should be brief in duration.
However, there may be occasions where this is not the case, especially where less well known or
less popular resources are concerned. In the case of the tourism search application, a
concentrated effort to promote certain businesses (the most likely occurrence) could undermine
the perception of trustworthiness that the site has built over time. Thus there should be some
mechanism by which such activity can be reported and dealt with, a sort of “management by
exception” approach. If advertisers are allowed to buy prominence in any listings of related
information, then those advertisements should be clearly labeled as such. In this case, the key to
building user trust is to create the perception that errors will be weeded out and corrected over
time. Whether this means allowing collaborative modification of listing information ala
Wikipedia or setting up a system to deal with iteratively corrected duplicates is left to the
implementer of the system in question. Some combination of these could be used, but under no
circumstances should these mechanisms and the user moderation provide misleading information
For just one subset of one very large industry, it has been demonstrated that the Semantic
Web holds great potential. From this, it is easy to see how it could be applied to virtually any
domain. It will allow the drawing of knowledge from disparate data sources toward the goal of
machine understandable information, whereupon such machines can take action in ways only
humans have been able to do. With these guidelines, new applications capable of answering
almost any question, discovering new knowledge, and informing knowledge creation can be
Arrington, M. (2008, Mar. 13). Yahoo Embraces The Semantic Web - Expect The Internet To
http://www.techcrunch.com/2008/03/13/yahoo-embraces-the-semantic-web-expect-the-
web-to-organize-itself-in-a-hurry/.
Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., & Cowan, J. (2006, August
16). Extensible Markup Language (XML) 1.1 (Second Edition). Retrieved Apr. 14, 2008,
from http://www.w3.org/TR/2006/REC-xml11-20060816/.
Herman, I., & Stephens, S. (2007, December 4). Semantic Web Education and Outreach Interest
Group Case Studies and Use Cases. Retrieved Feb. 13, 2008, from
http://www.w3.org/2001/sw/sweo/public/UseCases/.
Jenkins, C., Jackson, M., Burden, P., & Wallis, J. (1999). Automatic RDF Metadata Generation
from
http://wotan.liu.edu/docis/lib/goti/rclis/dbl/connet/(1999)31%253A11%252F16%253C1
305%253AARMGFR%253E/people.cs.uct.ac.za%252F~hbrown%252FCAKMS
%252Farticles%252FAutoma.pdf.
1
6
Kumar, A. (2008, Mar. 13). Yahoo! Search Blog: The Yahoo! Search Open Ecosystem. Retrieved
Manola, F., & Miller, E. (2004, Feb. 10). RDF Primer. Retrieved Apr. 14, 2008, from
http://www.w3.org/TR/rdf-primer/.
Mcguinness, D., & Van Harmelen, F. (2004, Feb. 10). OWL Web Ontology Language Overview
Prud'Hommeaux, E., & Seaborne, A. (2008, January 15). SPARQL Query Language for RDF.
SIMILE:About - SIMILE. (2007, Feb. 27). Retrieved Apr. 12, 2008, from
http://simile.mit.edu/wiki/SIMILE:About.
Semantic Web. (2008, Apr. 11). Retrieved Apr. 14, 2008, from
http://en.wikipedia.org/wiki/Semantic_Web.
Tourism Highlights 2007 Edition. (n.d.). Retrieved Apr. 12, 2008, from
unwto.org/facts/eng/pdf/highlights/highlights_07_eng_hr.pdf.