Exploiting a Web of Semantic Data for Interpreting Tables

Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi
Computer Science and Electrical Engineering University of Maryland, Baltimore County Baltimore MD 21250 {zarsyed1, finin, varish1, joshi}@umbc.edu

ABSTRACT
Much of the world’s knowledge is contained in structured documents like spreadsheets, database relations and tables in documents found on the Web and in print. The information in these tables might be much more valuable if it could be appropriately exported or encoded in RDF, making it easier to share, understand and integrate with other information. This is especially true if it could be linked into the growing linked data cloud. We describe techniques to automatically infer a (partial) semantic model for information in tables using both table headings, if available, and the values stored in table cells and to export the data the table represents as linked data. The techniques have been prototyped for a subset of linked data that covers the core of Wikipedia.

mation retrieval, text mining, and information extraction.

2. BACKGROUND AND MOTIVATION
It is unlikely that the Web of data will be realized by having millions of people manually entering RDF data. We need systems that can automatically generate linked data from existing sources, be they unstructured (e.g., free text), semi-structured (e.g., text embedded in forms or wikis) or structured (e.g., data in spreadsheets and databases). Extracting data and its implicit schema from tables is a problem that is common to several areas of computer science, including databases [26] where it is long been a part of the database integration problem, Web systems [5] where search engines need to know how to index tabular data and the Semantic Web [17] where organizations want to publish their traditional databases as linked RDF data. While a number of pragmatic systems have been built, there are many difficult research issues yet to be solved. Several manual systems exist for specifying a mapping from relational tables and spreadsheets to RDF [4]. In earlier work we developed RDF123 [13] as an application and web service for converting data in simple spreadsheets to an RDF graph. Users control how the spreadsheet's data is converted to RDF by constructing a graphical RDF123 template that specifies how each row in the spreadsheet is converted as well as metadata for the spreadsheet and its RDF translation. The template can map spreadsheet cells to a new RDF node or to a literal value. Labels on the nodes in the map can be used to create blank nodes or labeled nodes, attach a XSD datatype, and invoke simple functions (e.g., string concatenation). The template itself is stored as a valid RDF document encouraging reuse and extensibility. While systems like RDF123 are both practical and useful, they suffer, in general, from several shortcomings. First, they require

Keywords
Semantic Web, linked data, human language technology, information retrieval

1. INTRODUCTION
Twenty-five years ago Doug Lenat started a project to develop Cyc [17] as a broad knowledge base filled with the common sense knowledge and information needed to support a new generation of intelligent systems. The project was visionary and ambitious, requiring broad ontological scope as well as detailed encyclopedic information. While the research and development around Cyc has contributed much to our understanding of building complex, large-scale knowledge bases relatively few applications have been built that exploit it. Not long after the Cyc project began, Tim Berners-Lee proposed a distributed hypertext system based on standard Internet protocols [2]. The Web that resulted fundamentally changed the ways we share information and services, both on the public Internet and within organizations. One success story is Wikipedia, the familiar Web-based, collaborative encyclopedia comprising millions of articles in dozens of languages. The original Web proposal contained the seeds of another effort that is beginning to gain traction: a Semantic Web designed to enable computer programs to share and understand structured information easily as a Web of Data. Resources like Wikipedia and the Semantic Web’s linked data [3] are now being integrated to provide experimental knowledge bases containing both general purpose knowledge as well as a host of specific facts about significant people, places, organizations, events and other entities of interest. The results are finding immediate applications in many areas, including improving inforCopyright is held by the authors. Second Web Science Conference, 2627 April 2010, Raleigh NC, USA.

Figure 1. RDF123 is an application and web service for converting data in simple spreadsheets to an RDF graph.

0 the “read write web” characterized by collaborative content development and sharing using blogs. This is where existing semantic web search engines such as Swoogle can be useful. Aravali. then we can conclude that the “Beetle” is insect beetle but not car beetle. we would avoid an incorrect interpretation. This graph shows the "ontology context" for the word "student". including links to and from other Wikipedia articles. is a major problem. FOAF.. Another challenge confronting the Web is the integration of information from heterogeneous sources and in heterogeneous representations such as structured..org/resource/Baltimore) that are part of or linked to part of the linked data cloud. The trend is now moving forward to the next step and that is to exploit the collaboratively developed content generated as a result of Web 2. They work actually as identifiers for the concepts. e. In earlier work we tackled the second problem. Some connected lines in the gray area are omitted for simplicity. social bookmarking sites and collaborative editing using Wikis etc. These problems are interlinked and a solution to one involves solving the others. Of course. since they index a variety of ontologies based on the terms in them. These basic Wikipedia pages are further augmented by category pages (e. We may also find out from existing drug ontologies or linked open data sources that medicines have doses. if the column titles include “Beetle” and “Habitat”. especially for users unfamiliar with the Semantic Web who only want to convert their data into some RDF vocabulary. we can find out while these are all names of medicines. link graphs and triples using a single interface. and might also be consistent with a column entitled residents in the table. Each oval stands for an ontology and the gray area denotes the ontology context of student ontology.e. they offer no support for producing linked data – i. The traditional web which was a static “read only web” permitted flow of content from the provider to the viewer evolved into Web 2. Cyc. the names of the attributes of the concepts. Column titles have very important information for finding the underlying concepts. Jwalamukhi. Interpreting a table means recognizing the concepts in the table and mapping columns to the properties of the concepts according to the information in the table. Finally. Alleve and Voveron. For example. We can also search for ontologies that might be based on synonyms of the words used in the table [13] by using open sources such as WordNet. can be mapped to concepts and can disambiguate. The Person ontology has a conditional probability of 0. Suppose we see a column with the entries Nilgiri. Linked data sources and Wikipedia can also provide a sanity check for our interpretations.8 accompanying the Student ontology while the other way only has a conditional probability of 0. The column titles (the identifiers). RDF data where objects are referenced not be strings (e. users who specify the mapping must decide what RDF terms to use to describe the instances in the table and the relationships between them. Not only is this fact inferred using linked data and Swoogle [6]. doing all this assumes that we have a machine representation of concepts. Second. The core idea exploited by most approaches is to use references to Wikipedia articles and categories as terms in an ontology. We can also learn concepts from the instance rows. This interpretation would make the latitude/longitude entries correct. and others define properties that this class has. [14] describe a system that suggests terms for relational tables based on the words in their schema column headers. they are specifically a category called Non Steroidal Anti Inflammatory Drugs.a hybrid knowledge base of structured and unstructured information extracted from Wikipedia augmented by RDF data from DBpedia and other Linked Data resources. For example.g..0 applications for richer applications.. taken as a whole.Figure 2.01. Wikipedia pages are rich with other data having semantic impact. and conditions they are prescribed for. redirection . it is unique in integrating knowledge available in different structured and un-structured forms and providing an integrated query interface to applications enabling them to exploit broader types of knowledge resources such as free text. the RDF data should include instances of a person class. For example. If they do not. They can be the names of the concepts. APPROACH We have been exploring the use of Web-derived knowledge bases through the development of Wikitology [9] .. or even knowing what the possibilities are. links to disambiguation pages. and also establish that some column title represents a class. if the table described people. significant manual work to specify the mapping from the table to an appropriate representation in RDF. However. and Karakoram in a table. and another column that has numbers that seem to be latitude and longitude. We can learn what properties a class is associated with from the indexed ontologies and data documents. biological weapons) representing concepts associated and other categories. Third. Advil. but we can also check to see if the latitudes and longitudes provided match those that DBpedia lists. 3.g. We could infer from DBpedia that these are names of mountains. Choosing from among them. http://dbpedia. semi-structured and un-structured representations. This might help us interpret the other columns. “Baltimore”) but by RDF instances (e. see [22] and [25] for examples. The Web is dynamic and constantly evolving. We can learn the co-occurrences of two concepts/classes from RDF data documents.g.g. Han et al. Using Linked Data sources or Wikipedia. etc. the names of the connections between the concepts (attributes and connections can all be viewed as properties of a concept). the reference to the Wikipedia page on weapons of mass destruction can be used to represent the WMD concept and the page on Alan Turing that individual person. suppose a table column contains the title medicine and the entries Motrin. Wikitology is not unique in using Wikipedia to form the backbone of a knowledge base. For instance. recommending a set of terms to use to describe the objects and relationships in the table. relational tables. We could then search for other interpretations and find that these are also the names for dorms at IIT-Delhi. But there are many ontologies that are widely used that contain a class intended to represent people.

Freebase.and out-links.com/2004/06/04/cyc#> dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino. mayor and population could match properties in the same or related ontologies.org/2001/XMLSchema#> . Cross Document Coreference Resolution [10] and Entity Linking [9]. @prefix cyc: <http://www. the KB can help . the strings in column one can be recognized as entity mentions and linked to known entities in the Wikitology knowledge base [9] that are instances of the DBpedia:Place class. relational tables. SPARQL or keyword queries. The process involves and integrates data model extraction.org/ontology/> . types (Freebase entity types. is self evident to humans. consider the population column. the DPpedia background KB offers direct evidence that the integer values are closer to the populations of the cities than the states. a consensus based concept space developed and kept current by a diverse collection of people. and linked data are focused on making knowledge available in structured forms. Several heuristics can be employed. suggest different selections. we enhance the IR index with fields containing instance data taken from other data structures and links to related information available in other data structures such as graphs or triples in an RDF triple store which are related to the Wikipedia concept. a class that includes both cities and states. ontology alignment. An example of an application that further illustrates the potential of this system is extracting linked data from the content found in structured documents such as spreadsheets. 4. Examining the data values.. but does not help in choosing whether it should be associated with the cities (column one) or their states (column two). an information retrieval (IR) index is our basic information substrate.org/resource/> .g. Moreover. The DPpedia ontology associates the mayor property with the domain city and population with populated place. . Finally. @prefix dbpo: <http://dbpedia. Additional analysis can automatically generate a narrower description in OWL-DL. redirects. Consider. Therefore. The data aligns best with the known values for the narrower legal city entity. links. We represent the concepts (articles) as documents in the index using the Lucene IR Library [15]. such as preferring the closest column or the first column but these can. effort and cost. linked concepts. Wikipedia categories. article content..w3. provides additional information that can confirm some possibilities and disambiguate between possibilities for others.g.. first sentence of article. Incorporating these qualities in knowledge bases like Cyc will be very expensive in terms of time. dbp:New_York_City … Figure 3. For example. YAGO.. The likely target ontology schema supports its interpretation as a property from places to integers. HTML tables and structured reports. and interpretation. A knowledge base incorporating information available in different forms can better meet the needs of real world applications than one exposing knowledge in a more restricted way such as through SQL. The intended meaning of the pages. in. high quality of content [16] and availability in multiple languages. Exploiting Wikipedia and related knowledge sources to develop a novel hybrid knowledge base brings advantages inherent to Wikipedia. Moreover. There are many advantages in deriving a knowledge base from Wikipedia such as having broad coverage (over 3M articles). Wikipedia's infoboxes flesh out the nascent ontology with links corresponding to properties and relations. popularity values computed by search engines. such as major cities located in the United States. To integrate it with other forms of knowledge. processing the simple table shown in Figure 3. The data model implicit in a simple table can be inferred by analyzing the column headers and string values and the result mapped into appropriate target ontologies as linked data. The linked data (serialized in Turtle) that might be generated from the table in figure one resolves values to familiar linked data entitles when possible. Using a specialized IR index enables applications to query the knowledge base using both text (e. Our novel hybrid KB can complement these valuable resources by integrating knowledge available in other forms and providing flexible access to knowledge. from DBpedia we can find confirming evidence for many of the individual relationships between these people and places. who can read the text and examine the images on the pages. The column headers strongly suggest the type of information in the columns: city and state might match classes in a target ontology such as DBpedia. Wikitology has been used successfully for a variety of tasks such as Concept Prediction. Efforts like DBpedia. Wikipedia provides a way to allow ordinary people to contribute knowledge as it is familiar and easy to use. This collaborative development process leads to a consensus model that is kept current and up-to-date and is also available in many languages. Most of the knowledge available in Wikipedia is in the form of natural language text. Here.cyc. and history pages indicating when and how often a page has been edited. Approaches for indexing and querying IR indices are more efficient and scalable as compared to triple stores. as an example. as concepts. WIKITOLOGY One of the challenges in integrating information available in different forms is the selection of an appropriate data structure to represent or link data available in different forms and provide a query interface giving an integrated view. entity extraction and linking. Figure 4. as in this case. The articles in Wikipedia represent concepts. words in a document) and structured constraints (e. A named entity recognizer can identify the column two values as person mentions and Wikitology can resolve some of these to individuals in its KB. cyc:partOf dbp:Massachusetts. The different fields present in the Wikitology index include title. which initially are just strings. @prefix xsd: <http://www. rdfs:type = yago:Person). disambiguate between the two population-related properties associated with cities – one for the city proper and another for its broader metropolitan effort. DBpedia and Yago types).@prefix dbp: <http://dbpedia. dbpo:populationTotal "610000"^^xsd:integer . DBpedia infobox properties and values.

Figure 5.14 53. In our approach we consider the linked concepts as candidate fillers for slots related to the primary article/concept. We have compared the predicted slots/properties for different entity classes (such as Politician. The average accuracy for Table 1. For our ±1 accuracy. Wikitology is a hybrid knowledge based storing information in structured and unstructured forms and reasoning over it using a variety of techniques. Earlier work on Entity Linking task [19] defined in the Knowledge Base Population track at the 2009 Text Analysis Conference [20] identified Google page rank as an important feature for linking entities mentioned in text to Wikipedia entities. we discover and exploit “ISA'” relation between fillers (linked concepts) and WordNet nodes to serve as candidate slot labels. We used these to train a classifier using the number of in.420 0.95 0. There are several cases where the filler is subsumed by the slot label for example. 50% and 41% for an unpruned decision tree. the best result was obtained using in-links only (95. We sampled Wikipedia articles and used them to develop a decision tree to predict their page rank based on the umber of in and out-links.46 54. 4. We first identify and rank candidate slots and fillers for an entity.Page Length All – InLinks All .05 0. we have also introduced a field with page rank approximations of Wikipedia articles. Basketball Player etc.78 95.95 49. This resulted in an overall accuracy of about 94% with a distribution shown by the final column in Table 1. We used 10-fold cross-validation to evaluate four common classification algorithms: SVM.75 0. In addition to different fields representing different types of content extracted from Wikipedia and related resources.30 93. which we considered to be sufficient for our application. therefore we have developed an approach for approximating the page rank for Wikipedia articles.46 ± 1 Accuracy 93.69 94. However. Based on this observation. We examined the different feature combination to discover which feature combination was best.) with DBpedia infobox ontology and Freebase resource.and out-links and the article length as classification features and the page rank as the class label.OutLinks In-links Page Length Out-links Accuracy 53.81%) closely followed by in-links only (54.394 0. decision trees with and without pruning and naïve Page rank 0 1 2 3 4 5 6 7 Articles 44 18 100 1129 2124 794 84 3 Accuracy 0. respectively. The accuracy obtained per PageRank class using decision tree (J48) algorithm is given in Table 1’s third column.640 0. For the set of properties found to be common we manually evaluated the subject object pairs of the property by verifying the ground truth from respective Wikipedia articles. We recomputed the accuracy and considered a prediction to be correct if there is a difference of one between the predicted and actual PageRank.28 0. We are currently using the decision tree (J48 Pruned) classification algorithm using in-links only feature set to approximate the page ranks for Wikipedia articles.67 4. rock and soul as fillers for the slot Genre and all three of these are a type of Genre. having a high accuracy for these classes indicates that we can approximate the PageRank for majority of Wikipedia articles with high accuracy. with results shown in Table 2.080 0. SVM and naïve Bayes.69%).03%).91 51.81 54.98 0.1 Page Rank Approximations We selected 4296 random articles from Wikipedia and queried Google for their PageRank resulting in the distribution shown in column two of Table 1.062 0. We discuss our approach below.191 0.91 0.422 0. . the infobox present in the article on “Michael_Jackson'' mentions pop. The combination of in-links and page length gave the best accuracy (54. An analysis of our features revealed that only using the number of in-links provided our best approximation for Google’s PageRank metric for a Wikipedia article.23 95.13 Table 2. it is not possible to query Google for page rank of all Wikipedia articles. Bayes.Features All features All .68 0. Since majority of the Wikipedia articles have PageRank values between three and four.2 Discovering Ontology Elements from Wikipedia page links We have developed an unsupervised and unrestricted approach to discovering an info-box like ontology by exploiting the interarticle links within Wikipedia which represent relations between concepts. The pruned decision tree algorithm gave the highest accuracy of 53% followed by 52%.34 95.03 49. then classify the entities based on ranked slots and finally derive a class hierarchy.000 ± 1 Accuracy 0.20 94.

We used an earlier version of the Wikitology knowledge base in the entity linking task defined in the Knowledge Base Population track of Text Analysis Conference 2009 [20]. The values in the same row are mapped to the “contents” field and the “linkedConcepts” field in Wikitology. redirects. YAGO classes. It is often the case that the table column header represents the type of information in that column such as cities. The task is defined as: given an entity mention string and an article with the entity mention. infobox properties and values as Algorithm 1: WikitologyLinkInstance Description: Different parts of the table are mapped to different components of the Wikitology Index to get Top N matching Wikitology Concepts to the Instance in the table. Location and Organization) associated with the concept. The table header suggests the type of the instance. can be used to interpret the information contained in the tables.0 types: (ColHeader) propertiesValues:(RowData) contents: (Mention)^4. The entity mention represents the entity to link and the article provides additional context about that entity. The instance mention is also mapped to the “redirects” field which contains the Wikipedia redirects to the article or concept. giving a boost of 4. The entity mention and the column header is mapped to the “firstSentence” field. The “title” field contains the Wikipedia article or concept titles. the values in the same row suggest concepts and literals associated with the instance. It is often the case that the first sentence in Wikipedia usually defines the concept and mentions the type of the concept as well.0 to the instance mention itself. 5. We break the task down into the following tasks: • Associating classes or types with each column based on the column header and values and selecting the best one • Linking cell values to entities. While these tasks could be addressed serially.1 Class Prediction In order to link the instances in the table rows to the entities in Wikitology we get the context for the instances in the following way. The values in the rows of the tables may represent related information to the instances in the same row. as well as the values in the rows. A Wikipedia redirect page is a pseudo page with a title that is an alternate name or misspelling for the article (e.the properties was found to be above 80%. An attempt to access a redirect page results in the Wikipedia server returning the canonical page. The labels in the table headers. Our current prototype approaches the problems mostly serially with some interactions. Input: Instance Mention “Mention” Instance Row Data “RowData” Instance Column Header “ColHeader” Number of Top N Entries to return “TopN” Output: Top N Matching Concepts Vector Query = title: (Mention) redirects: (Mention) firstSentence: (Mention) firstSentence: (ColHeader) linkedConcepts: (RowData) linkedConcepts: (Mention)^4. FROM TABLES TO LINKED DATA Producing linked data to represent a table is a complex task that requires developing an overall interpretation of the intended meaning of the table as well as attention to the details of choosing the right URIs to represent both the schema as well as instances. Instead of a document. In the case of tabular data. if present. We process the information in the table using a custom query module for Wikitology that maps the information in different parts of the table to different components of Wikitology index. We describe our approach below. WordNet classes and Freebase classes (i. Linking the table headers as well as the instances in the rows to concepts in a knowledge base can aid in providing more context and links to other related concepts. and identify the key of each resulting sub-table. the values in the columns represent the instances of that type. if appropriate • Associating properties that represent the implied relations between columns and selecting the best ones • Analyzing the table globally decompose or normalize it into one or more sub-tables. link it to the right Wikipedia entity. The “contents” field contains article text including the title. A table may be defined as a two-dimensional presentation of logical relationships between groups of data [24]. whereas.TopN) Return TopNConceptVector .searchQuery(Query. the problems are intertwined. the individual entry in a particular row and a column represents the entity mention. Condoleeza_Rice for Condoleezza_Rice and Mark_Twain for Samuel_Clemons). The table header is mapped to the “types” field.. movies etc. Similar to the entity linking task it is not trivial to link table headers or values to concepts in the KB as the same concept may be expressed in a variety of ways in the table headers as well as data rows and there might not be enough context available in the table. we then use the predicted class labels as additional evidence to link the instances in the column. if appropriate. to Wikitology knowledge base entries to predict the Class for a table column. After linking the instances in the table we can enumerate the relations present between columns and select a particular relation to describe the relationship existing between pairs of columns. the different parts of the table serve as the context to disambiguate the entity or concept. The result is that the Wikitology pages for a term are effectively indexed under these variant titles. The “types” field contains the DBpedia classes. There are different fields available in Wikitology’s specialized IR index that can be exploited for inferring a (partial) semantic model for information available in tabular forms. artists. Person. Our custom query module maps the instance mention to the “title” field. countries.0 contents: (RowData) TopNConceptVector = Wikitology. Our results demonstrate that there are certain types of properties which are evident in the link structure of resources like Wikipedia that can be predicted with high accuracy using little or no linguistic analysis and can be used to enrich the existing infobox ontology as well as the Wikitology knowledge base itself.e.g. whereas. We discuss our approach to first linking the instances in the table 5.

25): Table Topic US States US Presidents Basketball players US cities Rivers in Europe Columns State.. We use Wikitology to get the top ‘N’ possible Wikipedia entities for each string and their types or classes.2 Entity Linking We use the predicted types as additional evidence to link the instances to the right Wikipedia entities. For each column we predicted class labels from four vocabularies – DBpedia Ontology. 5. The default value V[i. Preceded by. From the top ‘N’ Wikipedia entities predicted for each string. Place of Birth City. 1/R (1 for 1st. Baltimore.71 Yago 45. New York. Vice President. College. The “typesRef” field is an un-tokenized field and supports exact match. We create a (sparse) matrix V[i.j] with the value we assign to interpreting string i as being in class j. Yago. as the right match. we had a total of 171 entities. Boston .j] for 0<i< length(C). The relation that appears the maximum number of times between the pairs of entities in the columns can be selected as a representative relation. To associate the best class with a set of strings in a column we do the following: 1. DBpedia class.org/ontology/capital http://dbpedia. The accuracies for correct prediction of class labels for a column from each of the four vocabularies are given in the table. We re-query Wikitology by updating the original query in Algorithm 1.g. Population River. Querying using the typesRef field restricts the results to only those entities whose type exactly matches the input type.e. Length. We also incorporate the normalized page rank of the top ‘N’ Wikipedia entities using a weighted sum (w = 0. 3. . Team. We normalize the weight for each class with the number of strings present in a column. thus linking them as an entity would be inappropriate. Largest City.4 Relation Selection After relation enumeration we have a step for relation selection to select a representative relation existing between a pair of columns in a table.3 Relation Enumeration To describe the relations between columns in a table we take all pairs of entities present in the same row (which have been already linked to Wikipedia) and query DBpedia for the set of relations between each pair. Origin Table 3. The “linkedConcepts” field enlists all the linked concepts in Wikipedia.47 Table 4.. Let ‘C’ be the set of all associated classes for a column (e. We collected a sample data set composed of 5 tables taken from Google Squared [11].45 100 100 71. by adding the predicted types mapped to the “typesRef” field in the index. 6. place.42 WordNet 54. whereas the Yago and WordNet class labels predicted for a column were least accurate.42 Freebase 90.org/property/largestcity In this way we can enumerate the set of relations between a pair of columns by listing the relations between pair of entities in the respective columns for each row.66 90. http://dbpedia. Yago class. We used a small collection of tables each with four or five columns to evaluate our initial system. Basin Countries. Governor President..g. To pick the best class for the column. we find the class j that maximizes the sum of V[i. We repeat the process for four types of classes existing in Wikitology i. 4.90 100 33. We also map the row values to the “propertiesValues” field which contains the DBpedia infobox properties and value pairs for a concept. We use 0. These columns most likely describe a property of some other column in the table. WordNet and Freebase (Freebase type currently includes Person. Succeeded by Name. 1/3 for 3rd and so on).well as the linked categories. person. Results for the class prediction tasks for entities in a given column for the tables in our small collection.. The table topics and columns are given in Table 3. Party. 1/2 for 2nd . populatedPlace …).).54 85. For example the relations between Massachusetts and Boston are: http://dbpedia. The results for class prediction for table columns are given in Table 4. 2. i. Mayor.org/ontology/largestCity http://dbpedia.3 as the threshold. Location and Organization only). We then select the top most predicted entity by Wikitology which matches one of the types predicted. Philadelphia. obtained from Google Squared.e. EVALUATION We have developed an initial prototype system to evaluate our approach.90 100 66..71 100 71. State. Places and 5. Mouth. The Freebase class labels predicted for a column were the most accurate. which were classified into Persons. 6. In the five tables. 5. 1 Score = w × ( ) + (1 − w) × ( PageRank ) R 5. we vote for each class in the set C based on the entity’s rank.org/property/capital Entity Type Places Persons Organization Total Accuracy (%) Cols 11 7 3 21 DBpedia 90. These columns contain integers. Nationality. A class label is considered for selection as a best class for the column only if its normalized weight is more than a threshold weight.33 85. Capital City.org/ontology/PopulatedPlace/largestCity http://dbpedia. Let ‘S’ be the set of ‘k’ strings in a table column (e.j]=0 is used when string i is not mapped to class j at all. WordNet and Freebase. The Wikitology query module returns a vector of top N matching Wikitology entities. We have excluded the columns “Population” from the US cities tables and “Length” from the Rivers in Europe table when we calculated the accuracy. Different approach will be needed to deal with such columns.org/ontology/PopulatedPlace/capital http://dbpedia.

this helped in giving a higher score to “Chris Broad” for the entity mention “C. C. ISBN numbers etc. ACKNOWLEDGMENTS The research described in this paper was supported in part by a Fulbright fellowship.76 % 61. addresses. We have also proposed approaches for describing the relations evident in the tabular data through relation enumeration and relation selection. Initial evaluation shows that the techniques are promising and can be employed for automated extraction of linked data from structured data in the form of tables. We would also like to explore approaches to discovering new concepts and resources as well as relations and properties not already existing in the ontology. Broad” as compared to “Dan Broad”. We are developing a framework into which we can plug column type identifiers to identify common types such as numbers. Many problems remain. We are currently not employing any specialized heuristics for person names. Lehmann.. 7. There are two general categories. The values in the nationality column are mostly adjectival forms of place names such as “Australian”. The “Mayors” column from the US cities table includes the name as first name initial dot last name. S. NSF award IIS-0326460 and the Human Language Technology Center of Excellence. and the values stored in table cells and to export the data the table representation as linked data. However.Entity Type Person Places Organization Total Correct Predictions 59 45 20 131 Number of Entities 65 73 30 171 ticker symbols. states. The techniques have been prototyped for a subset of linked data that covers the core of Wikipedia. semi-structured and structured sources. stock 10. C.g.. Euro. We have currently experimented with a small data set which we were able to verify manually. we are evaluating strategies for handling data that cannot be linked to linked data entities.667 % 76. Our system presently does not handle acronyms well. R. Particle Physics Lab. in our earlier work [9] if a name initial was present in the entity mention we included that in the query by introducing the initial followed by a wild card in the query to also match any entity names that contain the full name rather than the initial. n5. Our system was able to link persons with the highest accuracy (90. T. G. Cyganiak.60 % Table 5. however it has moderate success in linking Places and Organizations. In the future we plan to run our experiments on larger data sets.e. pp. CONCLUSIONS Realizing the Web of Data vision requires making a significant amount of useful data available in RDF and linked into the growing linked data cloud. 6th Int’l Semantic Web Conf. this helped in linking to the right country articles in Wikipedia. [3] Bizer. There might be many concepts available in structured data like tables that are not present in the DBpedia ontology. On further evaluation. The “Nationality” column present in the Basketball Players table gave a low accuracy for entity linking. (2007). “Mayors” from the US Cities table.60 %. The first has to do with interpreting entities and relations in a table that might refer to one time period by using a collection of knowledge describing facts and situations from essentially many time periods. The technique of resolving strings against Wikitology is a productive one that works for many cases. We have described techniques to automatically infer a (partial) semantic model for information in tables using both table headings. In some cases it may be that neither the column header nor its values provide good evidence to predict the intended meaning.. (1989). 8.. we have realized that certain columns in the tables will need further specialized querying for the entities to be linked correctly. Proc. IEEE Intelligent Systems. The second problem arises when the data – both in the table to be interpreted and the reference knowledge bases and sources – are inconsistent due to mistakes. Since it is unreasonable to expect that any system will be completely accurate. The columns that have been labeled incorrectly are: “Nationality” from the Basketball players table. The “States” column from the US cities table includes all the states as acronyms. 87-92. The results for entity linking are given in Table 5. The Emerging Web of Linked Data. In our earlier work with entity linking [9] we used a separate module to replace all adjectival forms of place names with nouns i.64 % 66. Z. DISCUSSION AND FUTURE WORK We have developed an approach for linking data in the table and table headers to concepts in the DBpedia ontology. Achieving this will require developing techniques to automate or mostly automate the extraction of linked data from unstructured. We might also be able to exploit the evidence from multiple rows in the same table. and Ives. “Australian” will get replaced with “Australia”. v24. etc. e. link it to the table key. Accuracy 90. dates. we recognize the need to handle a number of special cases. chemical compounds. unpublished report. stock ticker symbols. DBpedia: A Nucleus for a Web of Open Data. . Kobilarov. REFERENCES [1] Auer. J. 9. Instead of using the present mechanism for querying a better solution would be identifying that the column contains acronyms and then expanding the acronyms to full names before querying using the present mechanism.. Here a possible solution is to create a unique property for the column. Of these we mention two that are shared with other Web information use cases: temporal qualification and handling inconsistencies. [2] Berners-Lee.76 %). Organizations. and represent the cell values a appropriate literals. if available. Results for the entity linking task for the entity mentions found in our sample tables.. Springer. Information management: A Proposal. noise or differing opinions. a gift from Microsoft Research. We can extend the same approach to predict the properties for entities in a table based on other entities present in the same row. however. (2009). We plan on adding additional mechanisms to exploit a collection of acronym and abbreviations for common entities. Our work on predicting ontology elements from Wikipedia page links suggests that we can discover several properties/slots by exploiting the associated links. and “States” from the US Cities table.. Out of a 171 entities we were able to link 131 entities correctly which gives an accuracy of 76. Bizer..

. on Agents and Artificial Intel-ligence. V. [7] Etzioni.. Sachs. T. October 2008. 2003. Mayfield..A Database to RDF Mapping Language. Cambridge University Press. November 2004. Valencia. G. 16th ACM Conf. Finin. Y. v6.. C. 2nd Int. (1992). J. 17th Int. Proc. M. and Peng.T. and Dittrich.. [20] McNamee.. Semantic Web Conf. Reddivari. Mayfield. [15] Hatcher. ISWC/ASWC 2007: 225-238 [18] Lenat. Addison-Wesley. November. Wang. October 2007. A. M. A. on Information and Knowledge Manage-ment. [13] Han. S.. (2010). Using Wikitology for cross-document entity coreference resolution. T. C. D... Gerber. [17] Hu. (2008). WWW (Posters) 2003. Cost.. and Pereira. and Weikum. Sun. J. "Open Information Extraction from the Web".. J. Conf. R.. 51(12): 68-74. Washington DC. J. [25] Wu. H. Z.. and Zhang. Han. S. In Proc. 8th Int... Y. Web tables: Exploring the power of tables on the web. (2008). Conf. P.com/squared [12] Halevy. L. F. Doshi. Halevy. Piatko. [23] Syed. Web Semantics: Science. pp. M. [6] Ding... . and Piatko. Using Wikitology for cross-document entity coreference resolution. C. Finin.. L. 243-252. McNamee. AAAI Press. McNamee. [19] McNamee. B.. P. and Yesha. P. Three Decades of Data Integration . Y. R. Wikipedia as an Ontology for Describing Documents. Z. Proceedings of the 18th IFIP World Computing Congress. C. Communications of the ACM. Y.. Finin. T... pp.. Dredze. Kasneci. of Electronic Publishing'92. C. D.google. AAAI Press. Garera. [22] Suchanek.. (2008).. N. Norvig. IEEE Intelligent Systems. Conf. A. Swoogle: A Search and Metadata Engine for the Semantic Web. T. D. and Syed. (2004). pp. W.. [9] Finin.. L. G. [11] Google Squared – http://www. D. [5] Cafarella. Lauw. Z. T.. Springer. and Weld. Banko. Automatically Refining the Wikipedia Infobox Ontology. ACM Press. A. Proc. C. (1990). Dreyer. O. n3. and Weld. Parr. [24] Vanoirbeek. Finin. Formatting structured tables.. Proceedings of the Seventh International Semantic Web Conference. (2009). AAAI Press. (2009).All Problems Solved?. Building Large Knowledge Bases. K. RDF123: from Spreadsheets to RDF. on the World Wide Web.. Syed. (2009) Finding Semantic Web Ontology Terms from Words. T.. P. 1992. Discovering Simple Mappings Between Relational Database Schemas and Ontologies. J. Z. (2008). Qu. Elsevier. 3–12. In: Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read. and Gospodnetic. and Joshi. 2004. A. Sachs. T. The Unreasonable Effectiveness of Data. 203-217. (2009). Proceedings of Biodiversity Information Standards Annual Conference (TDWG 2007). [8] Finin.. on Information and Knowledge Management (CIKM'04). Mayfield. F. Pan.. [10] Finin. Wang. M. T. [16] Hu. Apr.. Finin. Proc. H. [26] Ziegler. D2R Map .. (2008). pp 538--549. M..S. 2004. C. National Institute of Standards and Technology. F. E. 8-12. P. J. Z.. Overview of the TAC 2009 knowledge base population track.. Thirteenth ACM Conf. O. pp. 635-644... Lucene in action. and Vuong. Springer.. [21] Parr. E. Joshi. Yago: A Large Ontology from Wikipedia and WordNet. D. and Piatko. New York. (2008). Washington DC. P. and Joshi. and Dang. on Web-logs and Social Media. Soderland. Proc. In proc. Yarowsky. AAAI Spring Symposium on Learning by Reading and Learning to Read. Measuring Article Quality in Wikipedia: Models and Evalua- tion. L. 2nd Int. Syed. T. (2009).. P. Proc. pp. Proceedings of the VLDB. (2007).. HLTCOE Approaches to Knowledge Base Population at TAC 2009..[4] Bizer. Reading MA. pp. (2009).. A. D. March/April 2009. and Guha. In Proceedings of the 2009 Text Analysis Conference. Services and Agents on the World Wide Web. [14] Han. A. 2008. RDF123 and Spotter: Tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text. J.. of Text Analysis Conference 2009. (2007). Creating and Exploiting a Web of Semantic Data. Rao. WWW'03. 291{309.. Lim. Sachs. 24(2). ACM.

Sign up to vote on this title
UsefulNot useful