You are on page 1of 66

Natural Language Generatio n for th e Semantic Web:

Uns upervised Template Extraction

Daniel Duma

Semantic Web: Uns upervised Template Extraction Daniel Duma M Sc Speech and Language Processing Philo sophy,

M Sc Speech and Language Processing Philo sophy, Psychology and Language Sciences University of Edinburgh

2012

Abstract

I propose an architecture for a Natural Language Generation system that automatically learns sentence templates, together with statistical document planning, from parallel RDF data and text. To this end, I design, build and test a proof-of-concept system (“LOD-DEF”) trained on un-annotated text from the Simple English Wikipedia and RDF triples from DBpedia, with the communicative goal of generating short descriptions of entities in an RDF ontology. Inspired by previous work, I implement a baseline triple-to-text generation system and I conduct human evaluation of the LOD-DEF system against the baseline and human-generated output. LOD-DEF significantly outperforms the baseline on two of three measures: non-redundancy and structure and coherence.

2 of 66

Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Daniel Duma)

3 of 66

Acknowledgements

I am indebted to the many people who have, directly or indirectly, contributed to this effort.

First, to my parents, Eugenia and Calin Duma, for without them I would not be here to tell this story, and to Decebal Duma, for his financial support and a wealth of stories to entertain friends with.

Second, to everyone who helped in some way to the completion of this thesis, and especially to Austin Leirvik, Ben Dawson, Cristian Kliesch, Dan Maftei and Magda Aniol for supplementing my lack of knowledge with patient explanations and helpful hints.

Third, to my supervisor, Ewan Klein, for being a continuous source of encouragement and for his many helpful pointers along the way.

Finally, I want to thank everyone who has made this year of my life something more than one never-ending night in DSB. And to you, caffeine, for packing three days into one.

4 of 66

Table of Contents

Chapter 1

Introduction and background

8

1.1 Introduction

8

1.2 Overview of this thesis

9

1.3 Semantic Web and Linked Data

1.6.1

Chapter 2

9

1.4 RDF triples: data format for the Semantic Web

11

1.5 DBpedia: the hub of the LOD Cloud

12

1.6 Natural Language Generation

13

Shallow vs. Deep NLG

15

Previous approaches

17

2.1

Hand-coded, rule-based

17

2.1.1

Assessment

17

2.2

Generating directly from RDF

17

2.2.1

Assessment

18

2.3

Unsupervised trainable NLG

18

2.3.1

Assessment

19

2.4

Automatic summarisation

19

Chapter 3 Design

21

3.1 Design overview

3.4

3.4.1

Chapter 4

21

3.2 Goal

23

3.3 Tasks

24

3.3.1 Aligning data and text

24

3.3.2 Extracting templates

24

3.3.3 Dealing with Linked Open Data

25

3.3.4 Modelling different classes

26

3.3.5 Document planning

27

Baseline

27

Coherent text

28

Implementation: Training

30

4.1

Obtaining the data

30

4.1.1 Wikipedia text

30

4.1.2 DBpedia triples

30

4.2

Tokenizing and text normalisation

32

5 of 66

4.3

Aligning: Named Entity Recognition

32

4.3.1 Surface realisation generation

32

4.3.2 Spotting

32

4.4 Class selection

33

4.5 Coreference resolution

34

4.6 Parsing

35

4.7 Syntactic pruning

35

4.8 Store annotations

37

4.9 Post-processing

38

4.9.1

Cluster predicates into pools

38

4.9.2

Purge and filter sentences

39

4.9.3

Compute n-gram probabilities and store model

39

Chapter 5

Implementation: Generation

40

5.1 Retrieve RDF triples

40

5.2 Choose best class for entity

40

5.3 Chart generation

41

5.4 Viterbi generation

41

5.5 Filling the slots

42

Chapter 6 Experiments

44

6.1 Problems with the data

44

6.2 Performance of the system

44

6.2.1 Spotting performance

45

6.2.2 Parser performance

45

6.2.3 Class selection performance

45

6.2.4 Template extraction

46

6.2.5 Examples of errors in output

47

Chapter 7 Evaluation

49

7.1 Approach

49

7.2 Selection of data

49

7.3 Human generation

50

7.4 LOD-DEF generation

51

7.5 Human rating

51

7.6 Results

51

6 of 66

7.7

Discussion

52

Chapter 8

Conclusion and future work

53

8.1 Conclusion

53

8.2 Future work

53

Appendix A: Human generation

55

Appendix B: Human evaluation

57

References

63

7 of 66

Chapter 1

Introduction and background

1.1

Introduction

The next generation of the web is in the making. The amount of information on the Semantic Web is growing fast; this open, structured, explicitly meaningful machine-readable data on the Web is already forming a web of data, a “giant global graph consisting of billions of RDF statements from numerous sources covering all sorts of topics” (Heath & Bizer, 2011).

This information space is however designed to be used by machines rather than humans (Gerber

et al., 2006), and us humans are meant to access it via intelligent user agents, such as information

brokers, search agents and information filters (Decker et al., 2000) which on the whole are expected to take the shape of question-answering systems (Bontcheva & Davis, 2009).

A crucial element of such a question-answering system is then the ability to communicate with

the user using natural language (Bontcheva & Davis, 2009), both understanding user input and generating natural language to relay information to the user.

This is why the role of Natural Language Generation is potentially key in the Semantic Web vision (Galanis & Androutsopoulos, 2007) where, for applications generating textual output, the text presented to users can be generated or updated by NLG systems using data on the web as input. However, Natural Language Generation systems have traditionally relied on hand-built templates and schemas and many expert-hours of work. While this has been a successful approach in several domains (e.g. Androutsopoulos el al., 2001), it is frequently observed that it does not scale well, it is not easy to transfer across domains, and it requires many expert man- hours which makes it expensive and impractical for many applications (Busemann & Horacek, 1998). The scale and decentralised nature of the Semantic Web suggests this is one of these applications.

Recent initiatives by organisations and governments, coupled with efforts in text mining have made large knowledge bases publicly available, such as census results, biomedical databases and more general ones like DBpedia. These now contain information also found in natural language texts, starting with the very ones the information was mined from. I propose here that, given the widespread availability on the web of these parallel text and data resources, and the mature state

of key Natural Language Processing technologies, NLG systems could be automatically trained

from these resources with little or no human intervention. The wealth of research done in trainable NLG and automatic summarisation suggests that it is feasible for these systems to learn how to generate natural language by analysing existing human-authored natural language text in

8 of 66

an unsupervised fashion. This would make them inexpensive to build and deploy, easier to transfer to other domains, and potentially multilingual, which would contribute massively to making the Semantic Web vision a reality.

The aim of this project is to propose an architecture for an NLG system that can automatically learn sentence templates and document planning from parallel data and text, for the communicative goal of describing entities in an RDF ontology. The emphasis of the system is on expressing literal, non-temporal, factual data.

The system is trained from text and data by performing four main actions: given parallel data and text about a number of entities, first it aligns the text and the data by finding literal values in the text. Second, it extracts and modifies sentences that express these values, so as to use them as sentence templates. Third, it collects statistics about the frequency with which a spotted property follows another. Finally, it determines the class of entity that the text and data describe and builds a separate model for that class.

The nature of this project is exploratory rather than exhaustive. To this end, I design, build and evaluate a proof-of-concept system (LOD-DEF) trained on text from the Simple English Wikipedia and data from DBpedia. No part of its architecture is exhaustively optimised, and all the modules in the pipeline can be seen as baselines for their specific function. This system is as such in itself a baseline implementation, its goal being to explore the feasibility of this approach and hopefully help inspire others to improve upon it.

1.2 Overview of this thesis

In this chapter, I present the case for trainable Natural Language Generation for the Semantic Web and provide an overview these two technology areas.

In Chapter 2 I review the recent approaches most relevant to the present one, on which this project is either based, or to which it is theoretically related.

Chapter 3 considers the project from a design standpoint: it formulates the goals of the project and the criteria it must abide by, identifies the problem areas and lays out the design of the solutions. In this chapter I also present the full specification of the non-trainable baseline generation system implemented.

In Chapter 4 the training pipeline is laid out and technically detailed with step-by-step examples. The same is done for the generation pipeline in Chapter 5.

In Chapter 6 a number of experiments are discussed, reporting the performance of the system on different metrics and analysing some interesting findings.

Chapter 7 contains the detailed analysis of the human evaluation of the system. Chapter 8 contains the conclusion and suggests possible directions for future work.

1.3 Semantic Web and Linked Data

Where the terms Semantic Web and Web of Data refer to a vision of the web to come (Berners- Lee et al., 2001), the term Linked Data is more concrete, referring to “a set of best practices for

9 of 66

publishing and interlinking st ructured data on the Web” (Heath & Bizer , 2011). These best practices (the “Linked Data P rinciples”) require the use of a number of web -based open formats for publishing this data, such a s HTTP as a transport layer, URIs as identifi ers for resources and RDF for linking datasets (see section 1.4 for a detailed explanation). Essenti ally, “the basic idea of Linked Data is to apply the general architecture of the World Wide Web t o the task of sharing structured data on global scale .” (Heath & Bizer, 2011) To realise the SW vi sion, inference must be added to this Linked Data.

Linked Data does not need to be open, i.e. accessible to anyone, but increa singly organisations are publishing Linked Open D ata (LOD). The rate of publication of this in formation has been

steadily increasing over the p ast years, forming a big “data cloud”, contai ning, among others,

companies, books, scienti c publications, lm s, music, television

and radio programmes, genes, proteins, drugs and clinical trials, statistical

online communities and review s” (Heath & Bizer, 2011).

“geographic locations, people,

data, census results,

“geographic locations, people, data, census results, Figure 1.1 Linking Open Data cloud diagram, by Richard

Figure 1.1 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. (Cygan iak & Jentzsch, 2011)

billion (10 9 ) RDF triples (statements) in da tasets linked in the

LOD Cloud (Bizer et al., 2011 ) from 295 different datasets. Figure 1.1 repre sents these nodes as bubbles, interconnected by edg es.

It is these edges that are mos t interesting; perhaps the key aspect of this effort is that these published datasets are explicit ly linked together by using common vocabula ries and ontologies. Both the vocabularies and the data can be published by any organisation o r individual, leading to the somewhat famous obse rvation “Anyone can say Anything about A nything” (Klyne & Carroll, 2002). Throughout th is thesis, “Semantic Web” and “Web of Data ” are both taken to

As of late, there were over 31

10 of 66

mean Linked Open Data, thus referring to data published in adherence to the Linked Data Principles 1 .

1.4 RDF triples: data format for the Semantic Web

Resource Description Framework (RDF) is the default and recommended data format for the Web of Data (Heath & Bizer 2011). RDF triples represent the simplest statement that can be made, involving two entities (nodes in a conceptual graph) and a relation between them (an edge). These are often called subject, predicate and object, and a triple must contain all three of them. Another way of reading this information is that the subject has a property (predicate), the value of which is the object. I use both naming conventions throughout this thesis.

An example of a triple would be:

http://dbpedia.org/resource/United_States_of_America

http://dbpedia.org/property/leaderName

http://dbpedia.org/resource/Barack_Obama

Figure 1.1 Example of an RDF triple

where United_States_of_America has a property leaderName, the value of which is Barack_Obama. This could be conceptually read as “the name of the leader of the USA is Barack Obama”. This triple then connects two entities in the graph, United_States_of_America and Barack_Obama, via the edge leaderName.

A central aspect to RDF triples is the fact that subjects and objects must necessarily be URIs

(Uniform Resource Identifier). The concept of a URL (Uniform Resource Locator) is perhaps a familiar one, given how commonplace the use of that web addresses (e.g. http://www.google.com) has become. A URI differs from a URL in that, although it must also

be globally unique, it does not need to be “dereferenceable”, that is, if we point a web browser to

that address, the browser may not be able to load a web page to show. It is recommended good practice that URIs be made dereferenceable (Heath & Bizer 2011), but it is not required.

Objects of triples can either be URIs (e.g. “http://dbpedia.org/resource/Barack_Obama” in the previous example) or literal values, such as character strings (e.g. “Hogwarts”), dates (e.g. “1066- 10-12”^^xsd:date) or numbers in different formats. RDF triples may either have a data type or a language suffix (e.g. @en for English), but not both. A language suffix automatically identifies the value as a string literal.

A number of serializations of RDF exist, for a number of purposes; the one in use throughout

this document is N3/Turtle (Berners-Lee & Connolly, 2011). This serialization is intended to be easier for humans to access directly: among other things, it allows for the shortening of namespaces to make triples easier to read. In this serialisation, we can define a number of prefixes for namespaces:

@prefix : <http://dbpedia.org/resource/> @prefix dbprop: <http://dbpedia.org/property/> @prefix dbont: <http://dbpedia.org/ontology/>

This allows us to write the previous triple as:

dbpedia:United_States_of_America

dbprop:leaderName

dbpedia:Barack_Obama

1 See http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ for an intuitive overview of Linked Open Data.

11 of 66

When accessed, each of the URIs in this triple would be expanded back to the values in Figure 1.1. Finally, Notation 3 permits defining and using a default namespace, identified by a single colon. Henceforth, the default namespace used in examples throughout this document is <http://dbpedia.org/resource/> for ease of reading:

:United_States_of_America dbprop:leaderName :Barack_Obama.

Ontologies can be built on top of RDF using a standard class inheritance mechanism via the predicate rdf:type. RDF implements multiple inheritance, that is, an instance can belong to any number of classes. On top of this basic framework, more complex mechanisms to allow for inheritance and reasoning have been implemented, most importantly for this project RDFS (RDF Schema) and OWL (Web Ontology Language). Different extensions of OWL (OWL Lite, OWL DL, OWL Full) can encode different types of formal logic (Bechhofer et al., 2004), but this

is outside the scope of this project.

A property, being a URI, can have properties itself. For every property, its rdfs:domain property

restricts the classes and instances which can have this property and its rdfs:range specifies what values this property can take.

There are many standard prefixes for namespaces defining vocabularies with widely-used and well-defined semantics. Two examples are foaf (“Friend Of A Friend”) and dc (“Dublin Core”). Very important to the design of the system presented here are the widely used properties foaf:name, the description of which simply stands for “a name for some thing” (Brickley & Miller, 2010), and rdfs:label, which “may be used to provide a human-readable version of a resource's name” (Brickley & Guha, 2004).

Beyond storage, the emphasis of the LOD approach is that this data can be queried in highly complex ways. The default query language for this is SPARQL (Prud'hommeaux & Seaborne, 2008) which is a structured query language based on matching patterns of triples, allowing for highly complex filters using logic, inference and regular expressions. Some examples of these queries, expressed in natural language, could be:

Skyscrapers in China that have more than 50 floors

Albums from the Beach Boys that were released between 1980 and 1990

1.5 DBpedia: the hub of the LOD Cloud

At the centre of the LOD cloud (Figure 1.1) lays the DBpedia. This multi-lingual knowledge base contains knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages (Mendes et al., 2012). Infoboxes are human-authored tables of information, akin to collections of attribute-value pairs, that appear on the side of an article on Wikipedia.

They contain factual information such as dates, population sizes, titles of national anthems, etc.,

in a format that is easy to mine for data. This extracted data is stored as RDF triples, using a

number of standard vocabularies for properties (e.g. foaf, dc).

As described in (Mendes et al., 2012), “the DBpedia Ontology organizes the knowledge on Wikipedia in 320 classes which form a subsumption hierarchy and are described by 1,650 different properties. It features labels and abstracts for 3.64 million things in up to 97 different languages of which 1.83 million are classified in a consistent ontology, including 416,000 persons,

12 of 66

526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations, 183,000 species and 5,400 diseases.”

Due to its status as the unofficial hub of the LOD cloud (being used to interlink many datasets), and its breadth of coverage, the data on the DBpedia seems to be an ideal starting dataset for any approach that aims to generate natural language using data in the LOD Cloud.

1.6 Natural Language Generation

Natural Language Generation is the process of creating text in natural language (e.g. English) from an input of conceptual information. NLG allows for adapting the text to specific communicative goals and to user preferences and for generating text in different natural languages using the same underlying representation. In what is perhaps the required reference of the field, Reiter & Dale (2000), the authors propose and describe a standard architecture for a Natural Language Generation system. While the approach to NLG in the present project is far from this level of sophistication, it is relevant to present an overview of this architecture here to put the task at hand in context.

According to Reiter & Dale (2000), the architecture of a Natural Language Generation system is modularised, formed of a number of distinct, well-defined and easily integrated modules which

perform different functions. A graphical representation of this architecture is provided in Figure

1.2.

It is sometimes observed that NLG consists in making choices (what to say, in what order, with what words, etc.). These choices depend on each other, but can be separated in different levels, forming a pipeline. In this pipeline, domain data in some internal semantic representation is input at one end and natural language text is output at the other.

13 of 66

Figure 1.2 Natural Language Generation Architecture (Based on Reiter and Dale, 2000) The pipeline consists

Figure 1.2 Natural Language Generation Architecture (Based on Reiter and Dale, 2000)

The pipeline consists of three main stages, implemented by as many components:

1. In the document planning stage, the data to be included in the generated text is chosen (content determination) and also, the order in which to present this is chosen (document structuring). These processes produce an intermediate representation of the structure of the document, in this diagram labelled “document plan”, typically a tree structure.

Document planning takes domain data as input together with a communicative goal, that is, the purpose of the text that is to be generated, such as “describing of an entity”, “recommending a restaurant”, or “comparing flights”, depending on the application. The communicative goal typically determines both content determination and document structuring.

Document planning, as well as the other components in this pipeline, can be informed, among others, by: discourse strategies (helping realise the communicative goal), dialogue history (in a dialogue system), constraints upon the output (the resulting text might need to fit in a constrained space, etc.).

Most importantly however, it can be informed by a user model, capturing preferences or specific circumstances that characterise the target audience of the text. Depending on the application, system and communicative goal, this could mean e.g. a preference on sentence length or for the ordering of an argument in the case of a recommendation.

2. In the microplanning stage, the document plan is taken as input to a number of sub- processes, which are to a great extent dependent on each other.

14 of 66

a. Lexicalisation is choosing the content words required in order to express the content selected by the previous module.

b. Aggregation consists in joining together short sentences or chunks of text to create longer sentences. Both coordination and subordination strategies may play a role in this.

c. Referring Expression Generation (REG) deals with how to refer to entities in the discourse. There are multiple ways in which we can refer to the same real-world entity. For example, “Barack Obama” might be referred to as “President Obama”, “Obama”, “the President”, or simply “he”, depending on the communicative context. A distinction is usually made between the first time an entity is mentioned (“initial reference”) and “subsequent reference”. Depending on other factors such as pragmatic considerations, we might want to avoid repetition by using personal pronouns and other referring expressions. These also depend on style considerations of the textual domain. For instance, in newswire it might be preferred to use “President Obama” and “the President” instead of “he”.

3. The surface realisation component takes text specifications as input and outputs surface text. Surface realisation often adopts an “overgenerate and rank” approach, where a number of possible surface realisations are generated and then ranked using a language model (i.e. how likely that sequence of words is) or other scoring functions.

This pipeline allows for much control over the output text, permits a high degree of confidence that the text will be grammatical and semantically accurate by design, and most importantly, allows for adapting and adjusting the output text according to a user model. This has been called the “deep” model for generation, in contrast with “shallow” methods based on templates, as outlined in the following.

1.6.1 Shallow vs. Deep NLG

It has been noted that there is a continuum between shallow and deep NLG methods (Busemann, 2011). Considering “canned text” to be at the shallow end of the scale and “deep” NLG to be at the other, a number of intermediate approaches can be situated in between them, depending on what modules and functionality they implement, as represented in Figure 1.3.

Prefabricated texts “Fill in the slots” With flexible templates With aggregation With sentence planning With document planning

shallow

deep

deep

Figure 1.3 Shallow to deep NLG transition (based on Busemann, 2011)

The approach presented herein stands on the shallow end of this scale, as the generation is not inherently knowledge-based or theoretically motivated (Busemann & Horacek, 1998), but based on sentence templates with “slots” in them. These templates are sequences of text tokens of two

15 of 66

types: static text (words or punctuation) and placeholders for values, linking that slot to the value of a property or variable. A template as we define it here could take the form of:

[name] was born on [dateOfBirth].

In this example, [name] and [dateOfBirth] are the names of properties whose values would be substituted in that sentence in lieu of the properties themselves (e.g. “John Doe was born on 14 October 1066.”).

Templates can deal to a large extent with the issue of lexicalisation, for they already contain many of the words used and as such are a lexical choice, and with that of aggregation, as they can contain complex grammatical structures where only values have to be substituted in.

The approach presented herein also incorporates characteristics of deeper NLG, by performing a kind of document planning as described in 3.3.5.

16 of 66

Chapter 2

Previous approaches

A number of previous approaches to Natural Language Generation for the Semantic Web have

been adopted. Of these, a majority have been concerned with verbalising OWL ontologies (c.f. Stevens et al., 2011, Hewlett & Kalyanpur, 2005, Liang et al., 2012), and the verbalisation of factual data has remained somewhat under-addressed. In the following I situate my work in the context of previous efforts by providing an overview of the most relevant ones.

2.1 Hand-coded, rule-based

In a first category, there have been a number of approaches to NLG for the SW that employed a

deep NLG architecture like the one described in Chapter 2. Perhaps the most interesting of these

to date is the NaturalOWL system (Galanis & Androutsopoulos, 2007), which could potentially

be more easily applied across domains. It builds upon the M-PIRO system (Androutsopoulos el al., 2001), used for multilingual generation of museum exhibits, but adapting it to use OWL ontologies and RDF data. The classes and properties in the ontologies are explicitly annotated with text in multiple languages to carry out the generation, which enables the system to generate multilingual text using RDF data.

2.1.1

Assessment

NaturalOWL is a versatile and powerful system, including a full NLG pipeline adapted from an already-successful system with commercial applications. It can achieve high quality output, and

is

multilingual by design. In essence, this system is a solid implementation for Linked Open Data

of

the NLG Architecture described in section 1.6. As such, we can see in it the same benefits and

shortcomings. Great control over the output comes with a requirement for many expert man- hours and limited transferability between textual domains. Furthermore, the approach requires publishers of Linked Data to provide non-trivial annotations of the ontologies they publish. It

remains to be seen to what extent this is a realistic expectation.

2.2 Generating directly from RDF

A competing approach is generating directly from RDF with few hand-coded rules, particularly

representative of which is the Triple-Text system of Sun & Mellish (2007). The authors note that RDF predicates typically encode rich linguistic information, that is, their URIs are meaningful

chunks of natural language. Sun & Mellish (2007) have exploited this information to automatically generate natural English text from triples without using domain dependent knowledge.

Their approach, the Triple-Text (TT) system, is based on processing the predicate of the triple. Words forming predicates that are meaningful in English are typically concatenated into one string with no spaces or space-equivalent characters (e.g. underscores), but they are also typically

17 of 66

“camel-cased”, that is, using uppercase characters used for capturing these separations (e.g. “hasProperty”, “wasBornIn”). This makes it easy to tokenise the predicate into its building blocks (e.g. “has” + “property”, “was” + “born” + ”in”). This sequence of tokens is then assigned part-of-speech (POS) tags and classified into one of 6 categories, depending on its format (e.g. “has” + [unit]* + [noun]). For each a different rule is applied to build the output sentence (e.g. “has” + det* + “units” + “noun”).

As an example, given the triple “North industryOfArea manufacturing_sector”, the system generates the sentence “The industry of the North is manufacturing sector.”

2.2.1

Assessment

Simple as it is, this approach is interesting in many ways: it is reasonably domain-independent, it is very fast, inexpensive and intuitive to deploy and can provide an immediate lexicalisation of triples to natural language text without the need for domain-dependent knowledge.

However, its shortcomings severely limit its applications. First, generation from single RDF triples is also limited by the fact that the relations encoded in a triple are only between two entities. Human discourse is on average much richer, often including relations that can involve more than two entities, like ditransitive verbs, which require a subject, a direct object and an indirect object (e.g. "John gave the book to Mary"). The authors point this out and suggest that the next step is generating from multiple triples. Second, no mechanism is provided to perform document planning (as described in 1.6) when dealing with a collection of triples that should be lexicalised together. Finally, the output is not always grammatically correct and it cannot be easily adapted to a specific domain, as it does not take into account the ambiguity inherent to natural language (e.g. polysemy) and relies on using the same words found in the RDF predicates for realisation.

The baseline implemented as part of this project (see section 3.4) draws much inspiration from this approach, extending it to use rdfs:label properties for verbalisation and combining it with a baseline Referring Expression Generation algorithm.

2.3 Unsupervised trainable NLG

Perhaps the most relevant previous work on trainable NLG is that of Duboue & McKeown (2003), where they describe a system that learns content determination rules statistically from parallel text and semantic input data. They collect the information in the knowledge base for this application by crawling websites with celebrity fact-sheets and obtained the biography texts for these celebrities by crawling other websites.

They align the data with the text (i.e. the “Matching” stage) using a two-step approach. In the first step they identify the spans of text that are verbatim copies of values in the data. The second step is building statistical language models and using the entropy of these to determine if other spans of text are indicative of other values being expressed. There is an amount of inference and reasoning involved in this approach, such as deciding that “young” is someone between a span of ages. This is specifically applied to short biographies.

18 of 66

2.3.1

Assessment

This work focuses on a limited domain and only on content determination. The output of their system is however still exclusively dependent on hand-written rules and is specifically targeted at the constrained domain of biography generation. However, their approach to automatic learning

of content determination is undoubtedly far superior to what I present in this paper. The

approach taken in this project is equivalent to the baseline of Duboue & McKeown (2003), or the first matching step in their algorithm: only literal values found in the data are matched in the text.

2.4 Automatic summarisation

Very related to the approach presented herein is the wealth of work done in the field of automatic summarisation, which consists in creating a summary of a text by automatically choosing the most relevant information in it and collating it 2 . A subfield of automatic summarisation, frequently called “text-to-text” Natural Language Generation (to differentiate it from full “concept-to-text” NLG) deals with the generation of documents by extracting information from multiple input documents.

This can be seen as very related to this project, insofar as multi-document summarisation also deals with the extraction of sentences and with concatenating them in an organised way to create a new document. However, a main difference stands out: text-to-text NLG only deals with processing documents that are all about the same entity, subject or topic and extracting the most relevant sentences from those documents to create a new document. This stands in contrast with the problem we are tackling here: we want to generate natural language describing an instance in an ontology for which there may be no such text available. We then need to identify sentences about an entity that will be transferable, that is, will be true of other entities of the same type, more particularly, sentences that express values of properties that other entities of that type will have.

Where those sentences are not directly available in the text, we can try to modify them to make them transferable. This is much related to the frequently addressed task of sentence compression

in automatic summarisation, which consists in creating a summary of a single sentence. Often

this is addressed using tree operations, where a sentence is parsed and the parse tree is modified,

with the most frequent operation being the deletion of constituents. Where for summarisation these constituents are removed because they are deemed of lesser importance, in the present approach they are deleted where there is no evidence for them in the data.

A number of previous approaches to this deletion problem exist (e.g. Cohn & Lapata, 2009,

Filippova & Strube, 2008) but here I specifically borrow the term syntactic pruning from the work of Gagnon & Da Sylva (2006). Their approach is to parse the sentences with a dependency parser and apply a number of hand-built pruning rules and filters to simplify and compress those sentences. This approach presented here is similarly rule-based.

2 Methods for summarisation are generally classified into extractive, i.e. extracting sentences from the text based on their salience score and joining them, and abstractive, i.e. producing a new, shorter text (Gagnon & Da Sylva, 2006). Both of these categories are relevant here.

19 of 66

Chapter 3

Design

3.1 Design overview

The present approach is based on two main intuitions. One is that if we can identify sentences in text expressing factual information about an entity that are transferable (i.e. would also be true of another entity of the same class) we can use them as sentence templates for that class.

class) we can use them as sentence templates for that class. Figure 3.1 System overview This

Figure 3.1 System overview

This requires that we first identify literal values in the text that are the values of properties in the data and then select and edit the sentences to make sure they express no information that would not be true of another entity, i.e. that is not a value expressed from the input data. An example of this would be “[foaf:name] was one of the greatest [rdf:type].”, as this template contains a value judgement that is not supported by data.

The second intuition is that we can model the content of an article by collecting statistics about the properties whose values we have spotted in the text.

21 of 66

Figure 3.2 Aligning data and text: spotting property values As an illustration of this intuition,

Figure 3.2 Aligning data and text: spotting property values

As an illustration of this intuition, consider the RDF triples and text shown in Figure 3.2. We can paraphrase the data in RDF as “:Johann_Sebastian_Bach is an entity of type German Composers, his death place is Leipzig, his birth place is Eisenach”, etc.

In this particular example, we can align all the values of all the RDF properties with spans of text (illustrated by arrows in Figure 3.2), and this is here called spotting. Although it is often the case that string literals in the RDF can be matched with identical strings in the text, sometimes a conversion of these values is required. For example, the value “1750-07-28” is matched with the non-identical string “28 July 1750”, as these different formats for dates represent in fact the same value.

Having spotted the values, we can assume we have spotted the properties that generated them, which then allows us to replace each of those values in the text by a symbolic link to the property that generated it and extract the sentence template:

[foaf:name] (b. [dbont:birthPlace], [dbont:birthDate], d. [dbont:deathPlace], [dbont:deathDate]) was a [dbprop:shortDescription].

This template contains no information that is not supported by the data, and is therefore transferable and can be instantiated for any other entity of the same class (in this case, yago:GermanComposers 3 ) for which we have the same properties, e.g.:

Ludwig van Beethoven (b.

Vienna, 26 March 1827) was a German composer.

Bonn,

17

December 1770, d.

This approach can be seen to a certain extent as a conceptual hybrid between shallow NLG systems, where the information to be represented is stored in a symbolic data structure, and text-to-text NLG, where the content determination and document structuring are automatically

3 Throughout this thesis, I refer to classes used by DBpedia with the namespace prefix yago: defined as @prefix yago: <http://dbpedia.org/class/yago/>

22 of 66

learned from text, and the surface realisation is carried out via templates that are also automatically learned from text.

The hypothesis presented here is that this system, trained to learn document planning and sentence templates will perform better in subjective human evaluation than a simple baseline generating directly from English RDF predicates. The system is also ranked against human- generated text for the same data, which is expected to perform better and so the hypothesis is that this is an upper bound.

3.2

Goal

The system must generate descriptions, approximately equivalent to Wikipedia “stubs”, for any entity in an RDF ontology, focussing on factual data. This is therefore the one hard-wired communicative goal of the system.

This system must be inexpensive and fast to deploy, using readily available resources and avoiding any manual annotation of data, while also keeping hand-written rules to a minimum and avoiding domain-specific ones (e.g. specific for biographies, descriptions of cities, etc.).

At the same it should significantly overcome the shortcomings of direct generation from RDF triples by performing content determination and document structuring as described in section 1.6. Given that the rules for this are not present in the data, they must be statistically acquired from text aligned with the data.

Crucially, the system must be able to extract sentence templates from the training text for use in generation. These templates should verbalise the values of properties identified in them in the training text. Most importantly, these sentences should be transferrable between instances of the same class: it should only extract sentences that would hold true for other instances of the same class with different property values.

Also, this system is specifically targeted to use Linked Open Data, which implies that it should exhibit a degree of robustness to inconsistencies and redundancies in the data. Linked Open Data, unlike a relational database, has a very flexible schema, so the system should be able to deal with e.g. a property value being available for one instance of a class but not for another. Likewise, it should not depend on hand-picked lists of predicates, with some exceptions for very widely used ones (e.g. rdfs:label).

As opposed to e.g. Stevens et al (2011), the aim is not to fully process OWL semantics, but the focus here is exclusively on factual, literal, non-temporal data that can be expressed in quantities and string literals.

Finally, it should be easy to adapt to other domains. A specific format of text and a specific schema for the data should not be required, as long as one article of description text per entity is available, together with RDF triples that contain properties aligned with an instance in an ontology. Finally, it should be conceivable to adapt the system to other languages for which the required resources exist (e.g. a trained parser).

23 of 66

3.3

Tasks

3.3.1 Aligning data and text

Automatically aligning the RDF data with the text can be seen as a case of Named Entity Recognition, a well-established task used for such processes as Text Mining and Sentiment Analysis, which consists in identifying and annotating a number of “entities” in a text (Feldman & Sanger 2007). These entities can be names or values, such as quantities (both as digits and as string literals), dates, etc.

General-purpose NER systems are usually based on three different matching techniques:

dictionaries (also known as gazetteers), regular expressions and trained classifiers (Conditional Random Fields, MaxEnt, etc.). In this implementation, the NER task only uses gazetteers, regular expressions and heuristics for normalising and recognising quantities. Despite this being a very simple approach, for this specific project it is adequate as a baseline. Given that we have prior knowledge about what entities we expect to find in the text, the problem is limited to finding them. Cases of ambiguity are much reduced and the problem is limited to recognising values when they appear.

Selecting the RDF properties whose values are to be spotted in the text is dependent on the domain of the text and on the way data is encoded in RDF/OWL. We can think of this as a window over the graph, defined by a number of edges or relations. The ideal distance of edges to consider including is dataset-specific, depending on the design of this dataset, very complex relationships may exist between nodes in the graph. In the present approach, only triples that have as subject the main topic of the article being processed (title entity) are retrieved (i.e. s p o, where s=title entity URI). These triples can be any property and have any value. This was deemed to be sufficient for this dataset, and retrieving longer paths through the graph was found to significantly increase the complexity of the extraction.

3.3.2 Extracting templates

The system must extract sentence templates with “slots” in them as described in 1.6.1. These slots correspond to spotted properties of the class being described. As others before (Sripada et al., 2003), I find Grice’s Cooperative Principle and its maxims (Grice, 1975) capture fundamentally important aspects of an NLG system. Crucial to the approach presented herein is the maxim of quality: “contribute only what you know to be true; do not say that for which you lack evidence”. As per this maxim, the output of the system is desired to be truthful, which means that we should make sure the textual output is supported by evidence in the input data. The sentence templates extracted should be then transferable, that is, they should hold true for any entity of a given class that has the same properties as the entity for which the original sentence was written.

In order to realise this, similarly to (Gagnon & Da Sylva, 2006), the system needs to parse the source sentences to prune them of constituents, but in our case, these constituents are those for which we have no evidence in the data.

Parsing consists in taking a sentence in natural language and determining what its most likely parse tree would be, i.e. how its grammatical constituents are clustered and nested. It is important to note here that parsing is an area of active research and, due to natural language

24 of 66

ambiguity, far from a solved problem, as parsing will be another step in the pipeline that is likely

to introduce a significant source of errors.

3.3.3 Dealing with Linked Open Data

A brief examination of data from the DBpedia brings key issues to the fore. First, properties of

an instance can have multiple values. For example, let us examine the following triples:

:Carl_Maria_von_Weber

:Carl_Maria_von_Weber

dbont:birthPlace :Eutin. dbont:birthPlace :Holstein.

The property dbont:birthPlace here has two values, both of them are correct, and both of them may be necessary, as Eutin is contained in Holstein. It would perhaps have been better to store this information only as the more finely-grained value (:Eutin) and left the task of performing inference to reach the coarser-grained container (:Holstein) to the smart agent. However, given that both these values are in the data, the system must be able to deal with this. Our approach

is to group together spotted entities in text that are the values of the same property and to keep

track of properties spotted as lists in text. When generating, we can use this information to determine if only one surface realisation must be chosen from the options given or if all of them

must be shown as a list.

Second, multiple properties can have the same value. These properties may have very different meanings, as in the following example:

:Carl_Orff

dbont:birthPlace :Munich.

:Carl_Orff

dbont:deathPlace :Munich.

This creates a different problem that has arguably more difficult solutions, as the problem will be disambiguating between surface realisations. A number of approaches could be applied to this, some variant of the EM algorithm for instance.

Third, there are many properties that have the same meaning and yet are often present for the same entity, which makes them completely redundant. These redundant triples are either kept for backwards compatibility or due to of an incomplete alignment of vocabularies when aggregating different sources of data. Consider, for example, these triples:

:Johann_Sebastian_Bach

dbont:birthPlace

:Eisenach.

:Johann_Sebastian_Bach

dprop:birthPlace

:Eisenach.

:Johann_Sebastian_Bach

dont:placeOfBirth

:Eisenach.

:Johann_Sebastian_Bach

dprop:placeOfBirth

:Eisenach.

It is immediately clear in this example that the four predicates shown are actually one and the

same in meaning, and their values confirm this. The equivalence of these predicates is well known and documented (Mendes et al., 2012), but it is not retrievable from the triples themselves. OWL implements a system to link equal URIs, via the owl:sameAs property (Bechhofer et al., 2004),

which is available for some properties but not for others 4 .

The solution to the last two problems is to compute predicate similarity (by essentially counting the times the two predicates have the same value) and grouping significantly co-occurring properties together into predicate “pools”. The similarity of predicates is their similarity

4 While we would expect this to be a solved problem when dealing only with data from a self-contained knowledge base such as DBpedia, it can be seen as a good example of a common problem of LOD. As such, this is an opportunity to rise to the challenge and provide a solution for it.

25 of 66

coefficient, conceptually identical to Dice’s coefficient. This is computed by dividing the number of times these two properties for an entity of class C have the same value over the times they appear for entities of that class. This is only computed for the set of entities of that class for which both properties are defined (i.e. have a value other than a null string).

For example, the four predicates seen above are frequently grouped into a single pool, which takes the name of the most frequent of these predicates (or the first in alphabetical order in case of a tie).

Once this similarity metric has been computed, to discard those sentence templates that contain conflicting predicates in their slots, e.g. if “birthPlace” and “deathPlace” have been spotted with the same value when they belong to different “pools”, that is, over all the text their Dice coefficient is smaller than a constant. Other options were considered, such as a semantic similarity metric between the rdfs:label of the predicates and the context words. Duboue & McKeown (2003) have a different approach to clustering which includes hand-input rules for inference (e.g. people whose age is 1 < age < 25 are labelled “young”). For this baseline implementation however, the simplest option was chosen.

3.3.4 Modelling different classes

The system has a single communicative goal: the description of an instance of a class. However, descriptions of entities belonging to different classes can be seen as belonging to different textual sub-domains. What is relevant in the description of, for instance, a rock band, is unlikely to be relevant (or even apply) to the description of a species of animal. Similarly, the same predicates need not be expressed in the same order for all classes. Finally, sentence templates can also be expected to depend on the class of the entity, both because of the predicates they realise and of the lexical items and structures used in them. To a certain degree, this also applies to classes that belong to a common super-class, e.g. while both being instances of the super-class “Company”, the airline KLM and the software company Oracle should intuitively be treated as instances of different classes.

This means we need to choose, from those available, the class that an entity would be most prototypical of, as defined by Prototype Theory (Rosch, 1973). This class must have the right granularity or level of detail, both for training and generating. It must not be too general that the statements are too generic or irrelevant, or too specific that the extracted templates do not generalise to other entities of the same class.

This task is surprisingly nontrivial, as the standard class inheritance mechanism implemented by RDF (i.e. “entity rdf:type class”) allows for multiple class inheritance, and as an initial exploratory analysis of the data shows, this mechanism is very frequently used and a single entity typically belongs to a number of classes (for example, J.S. Bach belongs to both “ComposersForPipeOrgan” and “PeopleFromEisenach” 5 ). This makes choosing the “right” class for an entity a problem that requires nontrivial inference to solve.

5 It could be argued that this information would be much better encoded using properties, e.g. ”:Johann_Sebastian_Bach :composedForInstrument :PipeOrgan” ”:Johann_Sebastian_Bach :bornIn :Eisenach” This, however, would require extracting (mining) this information from the names of Wikipedia categories.

26 of 66

For the present approach I develop a baseline class selection algorithm, essentially consisting in computing a score for each class based on term frequency scores (i.e. the count of times a word appears in the class names of the entity) and selecting the n-highest. The intuition is that category overlap can help determine which class is more central to the entity, and help choose the class with an adequate granularity 6 . That is, if an entity belongs to a number of classes with “composer” in the name, the degree of confidence should be higher that the entity’s class should be “composer” or a subclass of it. A detailed explanation of this is offered in section 4.4.

3.3.5 Document planning

N-grams are sequences of items in succession, of size n, which capture the probability of a sequence of items to appear (Jurafsky & Martin 2009, pp. 117-124). N-gram models are frequently used to model language, and have also been applied to capturing the likelihood of a sequence of concepts, rather than words, in text. This approach has been previously applied by e.g. Galley et al. (2001), who used it to aid in document structuring for a dialogue system. Their “word-concept n-grams” are in our situation equivalent to RDF predicates.

Duboue and McKeown (2003) also used n-grams to refine their statistical model for content determination. Here I implement their baseline: we aim to learn content determination by collecting unigrams (1-grams) of spotted predicates in the text: if the frequency of a predicate was below a threshold in the articles for a given class of entity, even if an instance of this class has this property in the data, the system should not output it.

According to Reiter and Dale (2000), document structuring carries out more complex selection and ordering of information than just sequencing; it treats the text as a tree structure and clusters related items of information. Given that the sentences templates extracted from the text contain several properties expressed, we can think of them as partial trees, part of the bigger tree required for document structuring, so I expect that extracting these templates and ordering them in the right way will yield good results.

3.4

Baseline

In order to evaluate this approach, a comparable baseline approach is necessary. The baseline generation system I implement here is exclusively based on direct generation from RDF triples. It is loosely based on Sun & Mellish (2007), in that single triples are used to generate single sentences and I use a shallow linguistic analysis of the words in the predicate to determine the structure of these sentences.

However, a number of differences stand out. First, as opposed to Sun & Mellish (2007), this baseline does not directly split out the “camel case” in predicates. RDF predicates, being URIs themselves, have properties like rdfs:label, and these triples are available on DBpedia. Given this, I first attempt to retrieve the rdfs:label of the predicate for use in generation. It is an underlying

6 This does not attempt to provide a definitive solution to this problem, but to solve it to a satisfactory degree for the present application. A more sophisticated approach would perhaps have to account for the fact that the “right” class necessarily depends on the application and the context. For instance, when deciding whether Mauritius an IslandCountry or AfricanCountry, when it is just as prototypical a member of the first and second categories.

27 of 66

assumption of this approach that labels in all the languages we are concerned with (here exclusively English) will be available in the triples. If a label is not available, the system then does back off to splitting the words in the predicate URI.

The first sentence created by the baseline is an expression of the class of the entity, formed by the name of the entity (i.e. its rdfs:label) followed by “is a” and the rdfs:label of the class of the entity, e.g. “Johann Sebastian Bach is a German composer.” This class is chosen using the class selection algorithm detailed in section 4.4.

All subsequent sentences are composed according to the following logic:

If the retrieved or created label starts with an auxiliary verb (i.e. “is” or “has”), the article “a” is inserted after that first word, the first word of the sentence is made to be the personal pronoun nominative (i.e. “he”, “she”, “it”), and the value(s) are appended to the sentence, separated with a colon.

For example, from London_Heathrow_Airport foaf:isPrimaryTopicOf

<http://en.wikipedia.org/wiki/London_Heathrow_Airport>, the resulting

sentence is:

“It is a primary topic of <http://en.wikipedia.org/wiki/London_Heathrow_Airport>”

If

the

label

starts

with

“was”,

the

sentence

is

created

using

the

template

“[personal pronoun] [property label] [values].”

 

Otherwise,

 

the

sentence

is

created

with

“[possessive

pronoun]

[property

label]

[is/are]

[values].”

The predicate text is converted to plural if there are several. For several values, these are presented as a list, e.g. “His names are X, Y and Z ”. It is an implied assumption that predicate labels will be in the singular.

3.4.1

Coherent text

 

As opposed to Sun & Mellish (2007), who only generated single sentences from single triples, we are dealing with a collection of triples encoding information about a single entity, and we wish to present this information as a coherent text, made up of several sentences connected using coherence devices like coreference.

For this, the baseline implements a very simple Referring Expression Generation algorithm, which operates in the following way: The initial reference to the title entity expresses the full name of the entity being described, by retrieving its foaf:name or rdfs:label for the language we are generating in (throughout this paper, always “en” for English). Subsequent references to that entity will use a personal pronoun (e.g. “he”, “she”, “it”). The system specifically retrieves the value for foaf:gender for the title entity whose description is being generated and chooses the right pronoun and possessive based on its value (e.g. “he” and “his” for “Male”, “she” and “her” for “Female”).

28 of 66

This is a simple approach that produces acceptable results in English. It is perhaps relevant to note here that different languages will have different requirements for the treatment of grammatical gender and probably a more complex approach would be necessary. In the baseline implementation of the system, no attempt to order the sentences is made. Also, no attempt is made at document planning, with one exception: simple heuristics and an “ignore list” filter out from the input properties with inappropriate values. Specifically, values containing more than ten words and values which are integers below 31 are ignored, together a number of predicates such as purl:subject. This essentially helps filter out output that would be too verbose and visually strident, with the aim of making the baseline competitive in evaluation.

29 of 66

Chapter 4

Implementation: Training

In this chapter I present an implementation of the training phase according to the design principles outlined in Chapter 3: the LOD-DEF system. The system is trained on a corpus of text documents, each of which is an article from the Simple English Wikipedia, and RDF triples for the same entity, retrieved at runtime from DBpedia Live. The full training pipeline is described in this chapter: what steps are taken to train the NLG model from the text and data. The diagram in Figure 4.1 represents the pipeline for each article, while the post-processing steps are represented in Figure 4.3.

4.1 Obtaining the data

4.1.1 Wikipedia text

The text from the Simple English Wikipedia was downloaded from the Wikipedia dump site 7 . This is one large compressed XML file, containing the latest revision of all the articles on the SEW, together with a small amount of metadata, e.g. author information for the most recent edit. This text is in “wiki markup” format, including in it information like infoboxes, inter- language links, markers such as stub and redirection, etc.

This text then had to be processed to remove all this mark-up unnecessary to the task, and leave only the article text with the links to other articles. This step is done only once, previous to running the training pipeline on any article. Steps 2-7 apply to every processed article.

4.1.2 DBpedia triples

RDF triples from DBpedia are retrieved at runtime for each article via SPARQL queries. The standard SPARQL endpoints for this were the ones provided by DBpedia Live 8 . The endpoints proved to be quite problematic, as they would quite frequently experience high overload which meant slower response times and several times would be offline for several hours. To deal with this and to speed up testing, I implemented an offline triple cache using SQLite3, as an alternative to setting up a special triplestore for this project, which would have been a considerable expenditure of time and resources.

Triples are retrieved using the SPARQL query:

SELECT ?pred, ?obj WHERE { <http://dbpedia.org/resource/%s> ?pred ?obj.}

7 http://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

8 http://live.dbpedia.org/sparql and http://dbpedia-live.openlinksw.com/sparql

30 of 66

This returns all the triples in the DBpedia with the entity marked by %s as subject 9 , using its “wiki link” as search keyword. The triples returned by this query are then first filtered by language, as we are only concerned with triples for English in this case 10 . Triples with literal values in English often do not have a marked language suffix, as English is often considered the “default” language, so we need to include both “en” and “” (blank string) as valid languages. Second, triples using a predicate from an “exclude” list are filtered out. These are predicates that we found to be useless for our purpose and to induce noise, add data overhead and increase training time. Examples of these predicates are "http://www.w3.org/2002/07/owl#sameAs" and "http://purl.org/dc/terms/subject" (the values of which are already available as rdf:type properties connected to YAGO classes).

as rdf:type properties connected to YAGO classes). Figure 4.1 Training pipeline for each article 9 The

Figure 4.1 Training pipeline for each article

9 The special %s marker is interpreted by a formatting function which substitutes it for a parameter string.

10 Using the FILTER keyword in SPARQL queries resulted in much longer response times from the endpoint, together with occasional time-outs, more so the more complex a filter, so I decided to use it sparingly and filter the results client-side for increased robustness and faster execution. This is perhaps less than ideal within the vision of the Semantic Web, but due to the limitations of the available servers, it is often more practical to retrieve more information than needed and filter it client-side rather than relying on filtering by the SPARQL endpoint.

31 of 66

4.2

Tokenizing and text normalisation

The first step in the pipeline is to tokenize the Wikipedia text – separate text into words, punctuation and sentences. Several standard tokenizers were tested for this task, and their results proved quite unsatisfactory 11 . Therefore use a custom built algorithm is used, to take into account the format of Wikipedia mark-up. Tokens are considered to be individual elements of the sentence, and so punctuation is individually tokenized: commas, colons, semicolons, parentheses, brackets, etc., are all considered to be individual tokens. The exception is the apostrophe (’), so clitics like “n’t” and the genitive “’s” will be tokenized as one single word, remaining attached to the root word.

These rules are applied in order to facilitate the next processing step, the spotting of values in text. To further ease this, a number of processing stages normalise values found in the text, mainly using Regular Expression matching and replacement. As an example, if a number is expressed in words (e.g. “fourteen”) it is converted to its equivalent in digits as a string (“14”).

The same date may appear in the Wikipedia in several formats, e.g. “17 Aug 2012”, “17 Aug, 2012”, “17 August 2012”. We could either normalise these first in the text or perform a RegEx matching for each generated date for each processed text. In the interest of processing ease and efficiency, we do it once and normalise all dates found to one standard format: YYYY-MM-DD (year-month-day) using all digits. This is the same format used in xsd:date values (ISO 8601 12 ), which further eases spotting.

4.3 Aligning: Named Entity Recognition

4.3.1 Surface realisation generation

In

a first step, we generate possible surface realisations for the triples retrieved. This is equivalent

to

building a gazetteer list. The way it is done will depend on the type of object of each triple.

If the object is a URI, the rdfs:label for this URI (again identified by %s in the query) is retrieved via a SPARQL query:

SELECT ?label WHERE { <%s> <http://www.w3.org/2000/01/rdf-schema#label> ?label. FILTER (langMatches(lang(?label),"") || langMatches(lang(?label),"en") ) }

If the object is a typed literal, that is, it has an associated data type, the conversion will depend on the data type. Strings (xsd:string) are taken as they are with no modification, where xsd:int, xsd:decimal, xsd:double are converted to integers and xsd:float is converted to float and rounded to two decimals.

4.3.2 Spotting

A second step is clustering together tokens to facilitate a maximum span matching: all surface

realisations generated in the step above are ordered in inverse order of length (highest first). For

11 As an example, the Punkt sentence splitter included with the NLTK Python library would consistently fail to ”

separate sentences like “

12 http://books.xmlschemata.org/relaxng/ch19-77041.html

was born in Bath.Later in life

or “…was born in [Bath].[London] was his first…”.

32 of 66

each surface realisation r, if a series of tokens K matches r, assuming a whitespace character between elements of K, then K are concatenated together into a single token t, with the insertion of one whitespace character between every two tokens in K.

Flexible matching is then done between each token t and each surface realisation r using a regular expression to accept whitespace, hyphens or any other character occurring between two words.

4.4 Class selection

This step could in practice be located anywhere in the pipeline, as it only affects what “class model” will be updated with the learned templates and n-gram counts (as detailed in 5.9). Class models are the data structures holding the extracted sentence templates and stored annotations from the training documents, together with other information after post-processing. Determining what class models the title entity 13 belongs to at this stage makes it more straightforward to save this information with no intervening temporary data structure.

As pointed out earlier, determining the “right” class for an entity is not straightforward. As a working example, consider the rdf:type triples available for Johann_Sebastian_Bach:

:Johann_Sebastian_Bach rdf:type :AnglicanSaints, yago:ComposersForViolin, foaf:Person, yago:ComposersForCello, yago:GermanComposers yago:GermanClassicalOrganists, yago:PeopleCelebratedInTheLutheranLiturgicalCalendar, yago:ComposersForPipeOrgan yago:ComposersForLute, yago:OrganistsAndComposersInTheNorthGermanTradition,

yago:18th-centuryGermanPeople,

yago:PeopleFromSaxe-Eisenach,

yago:BaroquEComposers,

yago:PeopleFromEisenach,

yago:ClassicalComposersOfChurchMusic.

As the example illustrates, entities in DBpedia are aligned with YAGO classes 14 , which are automatically mined from crowd-sourced Wikipedia categories.

We choose the class using the following steps:

1. We retrieve the rdfs:label values for each of the classes. Using a bag-of-words approach, we put all these labels in a single list of words.

2. We add to this vector the words from the first sentence of the Wikipedia article.

3. We remove the stopwords from this list, i.e. prepositions, conjunctions, articles (“for”, “from”, “and”, “a”, “the”, etc.).

4. We compute term frequency (tf) scores for each word in this list, i.e. count how many times they occur in it.

13 The title entity is the entity that is the main topic of the article on the Wikipedia and whose URI is the subject of the triples in DBpedia. 14 YAGO is a freely available knowledge base, derived from Wikipedia, WordNet and GeoNames (Kasneci et al., 2008).

33 of 66

5.

We compute a normalized sum of tf scores for every class label, using the formula:

1

=

= > 1

= 1

5

where w is the class label string, w i is the ith element (word) in the string, tf is the term frequency score and N is the total number of elements in w. Note: tf(w i ) will return 0 when w i is a stopword. M is adjusted here to reflect a dispreference against one-word class names.

6. We order all the classes by their score in inverted order and select the n-highest as the classes the entity belongs to. We train for several models at the same time, given that we cannot be confident the class we chose is the only one that the entity is prototypical of. During the experiments, we set the value of this n to 5.

As an example, training for :Johann_Sebastian_Bach, the n-best list is shown in Table 4.1. For each of these classes, a class model is created (or updated if already existent).

rdf:type

Score

yago:GermanComposers yago:BaroquEComposers 15

6.0

3.3

yago:ComposersForViolin 3.0

yago:ComposersForCello

3.0

yago:ComposersForLute

3.0

Table 4.1 Class scores, n-best list

4.5 Coreference resolution

Coreference resolution can be a complex task and is an active area of research. A number of well- known algorithms exist for this, but given the domain of text we are dealing with, for our purposes a very simple approach to coreference resolution is sufficient,

A very simple heuristic is used for coreference resolution: the first pronoun appearing in the text is assumed to refer to the entity the text is describing (the title entity), and so are all forms of it throughout the text. The title entity, that is, the entity the article is about, its main topic, is henceforth referred to as “$self”. Consider the following example, taking the first two sentences from the article on Johann Sebastian Bach. Coreferent spans of text are in bold face:

Johann Sebastian Bach (b. Eisenach, 21 March 1685; d.

Leipzig,

28

July 1750)

was

a

German composer

and

organist. He

lived in

the

last

part

of

the Baroque

period.”

15 Note that the spelling “BaroquEComposers” is taken verbatim from the data. This just offers a hint of how careful one must be when dealing with automatically-mined data like that of DBpedia and YAGO. We explore this issue in more depth in 6.1.

34 of 66

The first pronoun to appear is the “He” starting the second sentence. From this moment on, “he” will be assumed to be coreferential with the entity :Johann_Sebastian_Bach, and so will be the form “his”. For each coreferent token, a coreference annotation is stored.

4.6 Parsing

For parsing, I employ the Stanford parser with the pre-trained PCFG English model 16 (Klein and Manning, 2003), a widely-used, state-of-the-art, self-contained parser, which also provides a number of pre-trained probabilistic models for other languages. Distributed as a Java Archive (.jar), it is easy to interface or use from the command line or other programming languages 17 . A number of freely licensed and open-sourced parsers were considered (e.g. C&C parser, NLTK Viterbi parser, Berkeley parser) and the final choice was motivated by its robustness, speed, and ease of interfacing.

Of all the sentences containing at least one coreferential token, the ones that also contain at least one spotted property value that is not coreferential with $self are selected as template candidates. For each of these sentences, a specially prepared pre-parse version is created, where for each spotted entity (or each annotation) a placeholder variable is created. This variable takes the name “var_n”, where n is an automatic counter, with values from 1 to N, the number of tokens in the sentence with an annotated spotted entity. So, for instance, given the sentence:

“Carl Maria von Weber (born Eutin, Holstein, baptised 20 November 1786; died 5 June, 1826 in London was one of the most important German composers of the early Romantic period.”

After date normalisation and spotted entity substitution, this sentence becomes:

“Var_1 (born Var_2, Var_3, baptised 1786-11-20; died var_4 in var_5 was one of the most important Var_6 of the early Romantic period.”

The parser assigns a noun (NN) Part-Of-Speech tag by default to unknown words, which all the placeholders are in this case. This is conceptually consistent with the fact that they are spotted entities in the text, and can therefore be nouns. This is done in order to preserve these spotted entities as one unit each. The parser will often nest entities formed by more than one word in the parse tree in ways that complicate the posterior retrieval of those entities and even more so the pruning of the tree.

4.7 Syntactic pruning

While parsing can be helpful for a number of tasks (e.g. it can inform coreference resolution by identifying the subject of the sentence), here it is only deemed necessary in order to carry out syntactic pruning to ensure the templates are transferable.

16 Version 1.6.5, from 30/11/2010. The Python library used was designed for the older API and incompatible with more recent versions. 17 This implementation uses jPype (http://jpype.sourceforge.net/) to interface the Java Virtual Machine from Python.

35 of 66

For this, it is considered here that the following grammatical categories require support in the data: nouns, adjectives, adverbs and numerals. The corresponding tags of these categories returned by the parser are: N* (e.g. NNS – plural), JJ*, RB*, CD*. The asterisk here is meant as a wildcard for zero or more characters: NN* should match both NNS and NN. These tags are the ones used in the Penn Treebank, the annotated corpus the Stanford parser English PCFG model was trained on.

By “require support” it is meant that the words with those corresponding tags must have been aligned to values found in the data, i.e. must have a “spotted” predicate. Note that several grammatical categories do not require support, most relevantly verbs. This is because what verbs do require is objects, and it is these that require support. This concept can be seen as very related to techniques of Relation Extraction (Sarawagi, 2008).

The pruning proceeds in three stages:

Stage 1: Each leaf of the tree (i.e. word in the sentence) whose Part-of-Speech tag matches one of the masks (N*, JJ*, RB*, CD*) and which does not have a “spotted” value in the data is deleted.

Stage 2: Context-Free Grammar rules are inverted. For example, if in a standard CFG a constituent is expanded via the rule NP -> DET + N (a Noun Phrase is formed by a head Noun and a DETerminer), if the head noun of an NP is deleted, then the whole NP must be deleted too, together with all the constituents it may contain. The rules used in stage 2 are:

o

VP requires V leaf

o

NP requires N* leaf (N, NP, NN, NNS, all valid)

o

PP requires P and NP

o

Verb requires object: either VB* or VP containing it must have sister constituent to the right 18

o

WH-phrase (WDT, WP, WP$) requires VP or S leaf

o

Coordinating conjunction (CC) requires sister nodes to its left and right of the same type

The rules in this stage are applied successively and repeatedly to the parse tree until no modification is made.

Stage 3:

o

Resulting parses are first filtered by the number of spotted entities they still contain. If there is not at least one token that refers to $self and one spotted value, we discard the template candidate.

o

Finally, we apply a number of rules to ensure correct punctuation by deleting empty brackets, duplicate commas, etc.

An example of this processing applied to the previous sentence from section 4.6 can be seen in Figure 4.2:

18 This accounts for inconsistencies in the parse trees returned by the Stanford parser.

36 of 66

In stage 1 the following leaves require evidence in the data, and as this is missing, they are deleted: “1786-11-20” (CD), “most” (RBS), “important” (JJ), “early” (JJ), “Romantic” (JJ), “period” (NN).

In stage 2, the NP containing “period”, having lost the only NN that supported it, is deleted. The ADJP “most important”, having no leaves in it, is also deleted.

In post-processing rules, “one of the” is substituted by “a”.

This pruning produces the following pruned template candidate:

[foaf:name] (born [dbont:birthPlace], baptised died [dbont:deathDate] in [dbont:deathPlace]) was a [rdf:type].

This is not a perfectly correct template, given the presence of “baptised” in it with no complement. Although it is extracted and stored, this template is not judged grammatical enough as per the criteria defined in section 6.2.4.

enough as per the criteria defined in section 6.2.4. Figure 4.2 Parse tree, with removed constituents

Figure 4.2 Parse tree, with removed constituents underlined.

4.8 Store annotations

At this stage, a list of annotations from step 3 (NER) is collected and saved for the model as a separate list for each document processed. This is done by iterating through all the tokens in the text, ignoring sentence boundaries and storing only annotations that do not refer to the title entity (e.g. foaf:name, rdfs:label, dbprop:name). This will be used in step 4.9.3 to compute counts and probabilities for predicate n-grams.

An example list of annotations for an article would be:

{foaf:name, rdfs:label, dbprop:name}, {dbont:birthDate}, {dbont:birthPlace}, {dbont:deathDate}, {dbont:deathPlace}, {rdf:type}, {dbont:knownFor}

37 of 66

Figure 4.3 Training: post-processing steps 4.9 Post-processing Post-processing is applied to every class model

Figure 4.3 Training: post-processing steps

4.9

Post-processing

Post-processing is applied to every class model independently, executing the following steps in succession.

4.9.1 Cluster predicates into pools

First the owl:sameAs property is retrieved for every property spotted (i.e. for every entry in the 1-gram list) and it and its object are added to a single “predicate pool”. Second, the lists of annotations for the analysed articles are processed and properties that are seen to have the same value with high frequency are grouped together in pools. This is done by computing a similarity coefficient based on the Dice coefficient formula:

, =2

!"# ,

!$ + !$

This is twice the amount of times they have the same value divided by the number of times they appear individually. This is only computed for the set D of entities of class C for which both p 1 and p 2 are defined (i.e. have a value other than null string). If this coefficient is above a threshold, the predicates are deemed to be equivalent. Experimentally the value of this threshold is set to 0.9. As an example, foaf:name, rdfs:label and dbprop:name are clustered together in a single pool for all classes used in the experiments.

The predicate pool is identified by the most frequent predicate of those in the pool. If they are equally frequent, the first one in the list returned by the sorting function is used. This most frequent predicate is then substituted in all the sentence templates in that model in place of all

38 of 66

the other predicates in the pool. This is mostly for aesthetic reasons and to simplify the generation. It does not affect the behaviour of the system, but makes the visualisation of the extracted templates more intuitive.

4.9.2 Purge and filter sentences

After the predicate pools have been built, each sentence in the model is checked for conflicts between predicates. This is done to account for the fact that predicates with different and possibly opposed meanings can have the same object.

For every sentence template t, for every slot s in t, if s contains more than one predicate after the clustering carried out in the previous step, the predicates are checked for similarity. The Dice coefficient of each pair of predicates is checked (having been computed previously), and if it falls below a threshold (experimentally set to 0.1), the sentence template is discarded.

Ideally, sentences that express the same set properties would be filtered here according to some criteria (e.g. length in tokens, amount of symbol tokens present, an overall character length preference, etc.), and the best or n-best ones would be kept. This is, however, not implemented in the LOD-DEF system.

4.9.3 Compute n-gram probabilities and store model

The n-gram counts collected are adjusted to reflect probabilities using Maximum Likelihood Estimation. A very simple smoothing technique is applied, equivalent to add-α smoothing with a very small α. Trigrams are used throughout this implementation. The model is finally stored in a file, for which Python’s built-in serialisation is used.

Class:

yago:GermanComposers

Templates:

(1) [foaf:name] ([dbont:birthDate] -- [dbont:deathDate]) was a [purl:description].

(2)[foaf:name] (born [dbont:birthDate] [dbont:birthPlace] ; died [dbont:deathDate] [dbont:deathPlace]) was a [rdf:type].

Pools:

(1) {foaf:name, rdfs:label, dbprop:name, dbprop:caption, dbprop:cname} (2) {rdf:type, purl:description, dbprop:shortDescription} (3) {dbprop:dateOfDeath, dbprop:deathDate, dbont:deathDate} (4) {dbprop:dateOfBirth, dbprop:birthDate, dbont: birthDate} (5) {dbont:knownFor}

n-grams

(“”, “”, foaf:name) (“”, foaf:name, dbont:birthDate)

(dbont:birthDate, dbont:deatDate, purl:description) (dbont:deatDate, purl:description, dbont:knownFor)

Figure 4.4 Example trained model for yago:GermanComposers

The resulting output of the training pipeline is not just one class model, but a model collection, as for every entity a number of class models may be created or updated.

39 of 66

Chapter 5

Implementation: Generation

The generation algorithm described here takes as input a collection of trained class models as defined in the previous chapter, and the URI of an entity for which to generate a description article.

5.1 Retrieve RDF triples

Values are retrieved from the SPARQL endpoint for each predicate pool saved in the model. A pool may contain any number of different predicates, but these being considered completely equal, only the values of one of them are retrieved and saved for the whole pool. A query is made for each predicate until values are returned, which are then stored for the whole pool.

5.2 Choose best class for entity

The first step of this procedure is identical to that detailed in section 4.4 with the difference that there is no article text to be considered, so this is not added to the vector of words, therefore step 2 (as defined in 4.4) is omitted. This generates an n-best list (experimentally, n=5) of classes. For each of these classes, if a model is found in the model collection, a score is computed for this model. This score is the amount of pools in the model that would get instantiated through the sentence templates available in the model. Only pools for which there are values in the triples for the entity are considered. For example, from the n-best list for :Johann_Sebastian_Bach from Table 4.1, if two models were available such that:

Model

yago:GermanComposers

yago:ComposersForCello

Number of templates

2

3

Pools for which there are values in the data

7

6

Pools that would be instantiated through templates

5

6

The chosen model would be yago:ComposersForCello, even though it received a lower score in the first step, only because a higher number of values would be (potentially) expressed through templates. This does not take into account the fact that due to the constraint on number of uses of property values, not all these templates might be instantiated.

The motivation behind this choice is that an extracted sentence template is expected to generate higher quality text, so a model instantiating more predicates through extracted templates is preferred. This is especially important for the subjective human valuation conducted as part of this project (see Chapter 7 for details).

40 of 66

5.3

Chart generation

We use chart generation: all sentence templates in the model for which there are enough triples in the data are put on a chart and combinations of them are generated. The following steps are taken:

1. For each template t, where S(t) i is the ith slot in it, we discard sentences for which there is no value for S(t) i in the set of retrieved property values V. Every template t must satisfy that for every slot of S we must find a value in V.

2. For each pool in the model, a simple sentence template is generated in exactly the same way as for the baseline and added to the chart. This is done in order to deal with the situation where pools (spotted properties in the training text) would not be expressed for a lack of a template expressing them.

5.4 Viterbi generation

We now need to select and order sentence templates from the chart to produce a combination. Ideally, we would want to find the combination of sentences that expresses all the values of the pools in the model, yet uses as uses as many extracted templates and as few simple generated ones as possible.

In order to deal with the combinatorial explosion, instead of an “overgenerate and rank” approach, we apply the Viterbi criterion (Jurafsky & Martin, 2009). This means that we compute scores for all the options at every step, select the one with the highest and discard all the others, thus only ever keeping one possible combination. This is not guaranteed to be the optimal solution to the requirements outlined above, but it is a satisfactory trade-off between quality and speed and keeps the algorithm simple and the generation running in polynomial time. The computational complexity of the algorithm presented in Figure 5.1 is O (n 2 log n).

used_ngram_list = null predicate (beginning of document) combination = new list of sentence templates (1) Do while len(candidates) < len(chart):

considered = new list of templates (2) Do for each template in chart:

If template not in combination:

If template does not require more uses of pools than allowed 19 :

Compute n-gram score of template using only the first pool Add to score a tenth of the number of pools used by template 20 Add sentence with score to considered list If no templates were considered, exit loop (1) Take the template with the highest score, add it to combination Increase the counter of times used of each pool used by template Add all pools used to used_ngram_list

Figure 5.1 Pseudocode for the Viterbi generation algorithm

19 One pool does not have to satisfy this constraint; this is the one representing the name of the entity being described, “$self”, clearly identified based on the fact that it contains rdfs:label.

20 This has the effect of, where more than one template would have the same n-gram score, the one using more pools (i.e. the longest one) will be selected.

41 of 66

To illustrate this algorithm, consider we are generating an article for the entity :Woody_Woodpecker using the following example model (Figure 5.2), where two templates were extracted from text. Consider also that in the input data there are only values available for pools (1) to (4), so pool (5) has no value.

Class:

yago:FictionalAnthropomorphicCharacters

Templates:

(1) [foaf:name] is a [rdf:type] created by [dbprop:creator]. (2) [foaf:name] first appeared in [dbprop:first].

Pools:

(1) {foaf:name, rdfs:label, dbprop:name} = “Woody Woodpecker” (2) {rdf:type} = “fictional anthropomorphic characters” (3) {dbprop:creator} = “Walter Lantz”, “Ben Hardaway”, “Alex Lovy” (4) {dbprop:significantother} = “Winnie Woodpecker” (5) {dbprop:first} = *empty*

Figure 5.2 Example model for generation

Having selected the model, the templates available for which the required values are available are put on a chart (Figure 5.3). Here, template (2) requires values from pool (5), for which no values were found in the RDF triples, so it is not added to the chart. Template (1) fulfils all requirements and is added. Next, simple templates are generated for each of the four pools except pool (1) as this one contains rdfs:label.

(1) [$self] is a [rdf:type] created by [dbprop:creator]. (2) [$self] is a [rdf:type]. (3) [$self-posessive] creator is [dbprop:creator]. (4) [$self-posessive] significant other is [dbprop:signoficantother].

Figure 5.3 Chart for generation

In the first iteration, none of the pools have been used, which makes all templates on the chart selectable. Considering that the stored n-gram probabilities have rdf:type as the most likely predicate to follow the null property (beginning of document), and given that only the first property expressed is considered when computing the n-gram score, both (1) and (2) would have the same score. However, given the formula adds to this score a value proportional to the number of properties that would be instantiated by the template, template (1) is chosen and added to the final combination. This has two pools marked as used once: pool (2) and (3). This still leaves one pool that does not refer to $self to be expressed: number (4).

In the second iteration, templates (1), (2) and (3) cannot be selected as candidates to follow in the combination, as they require properties that have already been used once. Only template (4) is available for selection, so independently of its score it will be added next. The final template combination is then (1,4).

5.5 Filling the slots

For every template in the combination created in the step above, we must select values for its slots. For slots that are refer to $self, the title entity, LOD-DEF implements a very simple Referring Expression Generation algorithm, similar to the baseline described in section 3.4.

The initial reference to an entity is its foaf:name or rdfs:label. For classes which have been observed to be referred to using a singular pronoun with grammatical gender (“he” and “she”), as it was done for the baseline, the system specifically retrieves the value for foaf:gender for the

42 of 66

entity whose description is being generated and chooses the right pronoun based on it. This objective would ideally be attained by performing inference on the classes the entity belongs to or by checking with the SPARQL endpoint whether the entity is of class “Person” (using any of the available URIs identifying a person, e.g. foaf:Person). However, due to the experience dealing with remote data as detailed in 6.1, in the implementation we trust the text rather than the data. If no foaf:gender value is available for an entity for which “he” and/or “she” referring pronouns were observed in training, the fallback gender is the most frequent one observed during training.

For all other slots, if only one value is available for the pool, this is rendered depending on its type (i.e. dates get formatted from 1066-10-14 to “14 October 1066”). Finally, a number of regular expressions help keep the output grammatically correct, by adjusting spaces between punctuation tokens, changing the article “a” to “an” before a word starting with a vowel, etc. Continuing the previous example, the resulting output is:

Woody Woodpecker is a fictional anthropomorphic character created by Walter Lantz, Ben Hardaway and Alex Lovy. His significantother is Winnie Woodpecker.

43 of 66

Chapter 6

Experiments

6.1 Problems with the data

Months of testing support the conclusion that a great degree of caution must be exercised when relying on DBpedia data. To begin with, as mentioned before, the schema is rather unreliable. Redundancy is high, as very often several properties with the same meaning are provided (e.g. dbprop:birthPlace, dbprop:placeOfBirth, dbont:birthPlace and dbprop:birthPlace all have the same meaning). These properties are meant to have owl:sameAs links to identify them as equal, yet when these triples exist they always point back to the same URI (e.g. dbprop:dateOfBirth owl:sameAs dbprop:dateOfBirth). As detailed in Chapter 3 and Chapter 4, this is addressed by the LOD-DEF system by learning pools of equivalent predicates.

Further, it is remarkable that the rdf:type properties on DBpedia link to supposed class URIs with incorrect spellings like “SpanishFootballCluBs”, “CarManuFACturers” and “BaroquEComposers”. It remains unclear what the reasons behind these spellings are, but these are clearly errors in the data, as there are no triples in the triplestore with these URIs as subjects. Their correctly-spelled counterparts do have triples, e.g. yago:CarManufacturers

rdfs:label "Car manufacturers"@en.

6.2 Performance of the system

Evaluation of the system’s performance is somewhat problematic due to the cumulative error rate introduced by the amount of interdependent modules in the architecture pipeline. Each stage depends on output from the previous stage, so for instance an error at the spotting stage is sure to impact the extraction of a sentence template.

I manually evaluate here two main aspects of the system: the success of the template extraction process and the class selection algorithm. For this, the training pipeline was run to train a single model collection for the classes in Table 6.1, for a maximum of 30 entities of each class. These classes are the same ones used for human evaluation (see Chapter 7), although the entities the model was trained on need not be the same ones. Other aspects, such as spotting performance, are evaluated through examples and critical discussion.

yago:EnglishComedians

yago:CarManuFACturers

yago:AmericanPopSingers

yago:AfricanCountries

yago:FictionalAnthropomorphicCharacters

yago:SpanishFootballCluBs

yago:ArgentineFootballers

yago:SingingCompetitions

yago:SpeedMetalMusicalGroups

yago:GermanComposers

yago:CapitalsInEurope

dbont:TelevisionShow

Table 6.1 Classes used for testing

44 of 66

6.2.1

Spotting performance

As only a gazeteer is applied in this baseline, the performance of the spotting will exclusively depend on the extent to which the value literals in the RDF triples mirror those found in text. In the case of spelling differences or naming inconsistencies, the spotting will fail.

There are inconsistencies in the data, such as the spelling of names cross-language. For example, for a single entity we find “George Frideric Handel” in the triple values and “George Fredrick Handel (German: Georg Friedrich Händel)” in its associated article text. Note that none of the two spellings found in text can be matched to the one in the triple values.

Another example is the category name “Argentine Fooballers”. This string is seldom spotted in the text, as the surface realisation is “football players”. The two terms are synonymous, and ideally the system should be able to determine that the surface realisation of “footballer” is “football player”. Clearly then this task requires a more sophisticated approach than literal string matching. DBpedia provides a lexicalisation dataset which can be applied to this task and is indeed used by DBpedia Spotlight (Mendes & Jakob, 2011). Another option for a robust NER solution is OpenCalais (Butuc, 2009).

6.2.2 Parser performance

Although the parser introduces a significant error rate to the system due to inconsistencies in nesting constituents, I do not directly evaluate its performance here, lacking a gold standard to compare against for this specific domain. It should however be noted that a different PCFG model could help improve performance. Also, it was noted before that other approaches to sentence compression use dependency parsing. This was tested for the present project but was deemed unsuitable due to the low accuracy of the output from the parsers tested. There exists perhaps a different dependency parser that would be more fitting to the task.

6.2.3 Class selection performance

The class selection algorithm was manually evaluated, by comparing the n-best classes identified by the algorithm with the first line in the full English Wikipedia article for that entity.

The criteria adopted were as follows. Consider two sets, A and B, where A is the set of classes mentioned in the first description sentence in the text, and B is the set of n-best classes chosen by the class selection algorithm. The criteria for establishing matches between these sets are shown in Table 6.2.

Match type

Criterion

No match

No element of A is equivalent to an element in B At least one element in A is equivalent to an element in B A 1 (the first class mentioned in the text) is also in B

Partial match

Correct match

Table 6.2 Match criteria

For this testing, n was set dynamically depending on the number of classes available for an entity. For entities belonging to 9 or more classes, n is set to 5. For entities with fewer than 9 classes, n is set to half the number of classes, rounded up.

To give an example, for the entity :Woody_Woodpecker, the best class as chosen by the algorithm is yago:FictionalAntropomorphicCharacters, whereas the article text begins “Woody Woodpecker

45 of 66

is an animated cartoon character, an anthropomorphic acorn woodpecker”. This is judged to be a correct match, although “animated” and “cartoon” do not appear in the class name.

Similarly, for an entity of class yago:CapitalsInEurope, where no class with “city” or “capital” in the name is available, yago:PopulatedPlace is accepted as equivalent of “city”.

For testing, set C is also defined, where B is the set of n-best classes chosen directly from the triples, and C is the set that was chosen adding the first sentence from the article text to the bag of words.

 

Set B (n-best)

Set C (n-best with text)

Entities tested

95

95

No match

1

1

Partial match

0

0

Correct match

94

94

Table 6.3 Results of class selection evaluation

The results of the evaluation suggest this is a robust algorithm, with almost a 100% correct match rate as defined above. This evaluation is admittedly dependent on subjective interpretation of the meaning and overlapping of class names, so further testing, refining of criteria and ideally evaluation by other humans (like the one in Chapter 7) should be undertaken.

The one entity for which no match was found was :Life_in_Hell. Revealingly, it is said to be of class yago:FictionalAntropomorphicCharacters, yet the Wikipedia states “Life in Hell was a weekly comic strip […] The strip features anthropomorphic rabbits and a gay couple.” While the entity clearly contains fictional characters, it was decided that it should be of class “comic strip” or equivalent.

6.2.4 Template extraction

The templates extracted were manually judged on transferability (Y/N) and on their grammaticality. Grammaticality was judged according to the criteria outlined in Table 6.4.

Scores

Meaning

5

Perfectly grammatical

4

Minor punctuation defects (e.g. stranded commas)

3

Missing determiner or stranded conjunction Lack of verb or no meaning

1-2

Table 6.4 Grammaticality scores and criteria

Intuitively there are also different degrees of non-transferability, but this here was not taken into account. The judging was binary: if a template was not perfectly transferable, it was judged not transferable at all.

46 of 66

Item

total

Processed articles Sentences considered for extraction Discarded during pruning Discarded after pruning (filtered) Extracted templates Non-transferable Transferable templates Average grammaticality 5-stars grammaticality 4-stars grammaticality Final accuracy

268

199

98 (49%)

26 (13%)

74 (37%)

14 (19%)

60 (81%)

4.15

34

9

43/74 (58%)

Table 6.5 Extracted templates statistics

For the purposes of evaluation, here I adopt as the final performance metric (accuracy) of the template extraction process the percentage of final extracted templates that are both transferable and have a grammaticality score of 4 or 5. As Table 6.5 shows, of a total of 74 extracted templates, 60 (81%) are transferable, of which 43 (58% of total extracted templates) have a rating of 4 or 5 on grammaticality.

Note here that of the 14 non-transferable templates reported, 3 were purged in the post- processing stage because of conflicting predicates in their slots. However, this is not taken into account here, as this is an independent step with a different purpose.

The final accuracy metric can clearly be improved on, and one way of doing so is refining the rules for pruning. The development of these rules did not follow a data-driven approach, but was based on first principles of Context Free Grammar and on a summary examination of the data. It only became apparent during evaluation that these rules were not sufficient to ensure the grammaticality and transferability of the extracted templates.

6.2.5 Examples of errors in output

“Casablanca of Morocco is Rabat.” Here the spotting failed mainly due to the low quality of the data. During training, the title entity’s dbprop:largestCity property had the value “capital” as a string literal. This prompted the extraction of the previous template [dbprop:largestCity] of [dbprop:commonName] is [dbont:capital]. This property has no rdfs:range specified in the schema, which means it can take any value. This is unfortunate, as it could be argued that its values should be of type City, and a string literal like “capital” here is of little use and adds noise to the data.

“Her active is 1981.” What this means is that the person who is the title entity has been active since 1981, but the rdfs:label for this does not say so.

“Although Hyundai Motor Company started in Public.” Here, the pruning rules were clearly not enough to make this sentence grammatical. Either the “although” should have been removed or the whole template dropped.

[foaf:surname] is married to [rdf:type] [dbprop:spouse]. This template for EnglishComedians may well happen to be true once instantiated in text, if and only if the spouse of the title entity is of the same type (i.e. EnglishComedians). This

47 of 66

means that this sentence is not transferable and should be identified as such and discarded.

“Mercyful Fate is a Speed metal musical group from, Denmark and Copenhagen.” This sentence is mildly ungrammatical due to the presence of a comma. While “Denmark and Copenhagen” are odd, they are due to the generation algorithm, not to the extraction.

48 of 66

Chapter 7

Evaluation

7.1 Approach

Given the exploratory nature of this project, the evaluation relies on multiple human valuation of the system’s output, evaluated in equal conditions with output from two other systems: the baseline described in section 3.4 and expert human output. I adopt a two-panel (i.e. two separate groups of subjects) approach to compare the three generation systems, very similar to the evaluation undertaken by Sun & Mellish (2007). Humans in Panel A generate descriptions of the same 12 entities and humans in Panel B rate the different outputs of System A (baseline), System B (LOD-DEF) and System C (human generation) across a number of dimensions.

The hypothesis is that LOD-DEF will be rated higher on average in human evaluation than a system generating exclusively from English words in RDF predicates. For comparison with an upper bound, the system is also ranked against human-generated text for the same data. Human-generated text need not always be an upper bound in subjective evaluation, but given the simplicity of the two NLG systems, this is the hypothesis here.

Given the relatedness of the present approach with automatic summarisation, three of the criteria for evaluation used by the Document Understanding Conference 2007 were found to be very appropriate for the task at hand. The texts are rated on grammaticality, non-redundancy, and structure and coherence. No direct evaluation of content determination is carried out: here it is evaluated implicitly through the dimension of “non-redundancy”, given that its main effect in this implementation is filtering out redundant and unnecessary information.

7.2 Selection of data

Classes used for evaluation were not chosen at random. Given that one of the aims was to evaluate the effect of the sentence templates as opposed to the baseline, I purposefully applied a bias towards classes for which a higher number of templates were extracted and more properties were spotted in text, which correlates with classes for which more factual information (strings and quantities) was available on the DBpedia 21 . This aimed to ensure richer output was generated by the LOD-DEF system, to allow for a more meaningful rating from human judges and so that we can evaluate the performance of the system at document structuring.

Within these constraints an attempt was made to select classes as varied as possible. While in the final test set there are four instances of subclasses of “Person”, these are markedly different kinds of person, with several different RDF properties. Also, this is in approximate correlation (30%) to the amount of entities of type “person” available in the consistent DBpedia ontology, approximately 23% (Mendes et al., 2012).

21 This also correlates with subclasses of Person.

49 of 66

Subject

1

2

Entities

Jennifer Jane Saunders (English Comedians) Hyundai Motor Company (Car Manufacturers) Nicole Scherzinger (American Pop Singers)

Fernando Gago (Argentine Footballers) American Idol (Singing Competitions) Mercyful Fate (Speed Metal Musical Groups) William Herschel (German Composers) Belgrade (Capitals in Europe) Winx Club (Television Shows)

and

classes

Morocco (African Countries) Woody Woodpecker (Fictional Characters) Real Zaragoza (Football Clubs)

Table 7.1 Entities chosen for evaluation and subject generating each

For each class, the aim was to select a lesser known instance and so prevent the subjects’ adding extraneous information to the output. For example, for “Fictional Character”, instead of “Mickey Mouse”, “Woody Woodpecker” was chosen, still a widely known character but arguably one that is less heavy with associations.

During the development of the LOD-DEF system, the development set of entities against which I adjusted the several subcomponents was formed mostly of instances of yago:GermanComposers. The template extraction system was adjusted in order to extract more grammatical sentences this set, which was also included in evaluation.

William Herschel is not best known for being a German composer, but rather an astronomer.

Although he was both, this is an instance where the algorithm failed due to the data available. Given that at the time of organising the survey this was not known, the survey reflects it as such. However, an important observation is that he is best known for discovering Uranus the planet. In the context of the article, as it is not specified, it could be assumed that this is a piece

of music.

7.3 Human generation

Panel A is given triples related to the chosen entities and instructions on how to proceed. Panel

A is formed by two native speakers of English, both of them linguistics postgraduate students.

Their task is to write summary descriptions of the entities the data is about by expressing as

much of this data as possible in text.

The triples are grouped by entity they relate to, one entity on each page. The information is printed in a human-friendly format, where the rdfs:label is retrieved for every predicate, followed by the equal sign and a list of n values, which are all the values the property has. If a value is a URI, the rdfs:label for that URI is retrieved and printed instead. Otherwise the value is presented with its literal value. For example, “birth date = 1958-07-06”, “place of birth = Sleaford, Lincolnshire, England”. The full instructions used for this experiment can be found in Appendix

A.

Triples given to Panel A were selected from the same ones identified by the LOD-DEF system as pools. Triples were then curated and filtered by hand to further remove redundancy and to randomize the order in which they are presented. As I have already pointed out, much factual information is encoded in Wikipedia categories, and thus in the names of YAGO classes. For this reason, only one class is included in the triples, which I manually picked as the one intuitively and subjectively considered more representative from the available rdf:type triples.

50 of 66

I avoided giving the subjects examples of what kind of output was expected, thus taking care not to prime them. I did include an example of generating from one triple as a warning to avoid including extraneous information.

7.4 LOD-DEF generation

For the training of the LOD-DEF system a separate model collection was trained for a maximum of 30 entities for of the class manually chosen (see Table 7.1). These entities were taken in the order returned by the SPARQL endpoint, where articles on the Simple English Wikipedia where available for them.

7.5 Human rating

Subjects were asked to complete an online survey. For this survey, the same 12 entities (Table 7.1) were described by the three systems, which produced 36 short texts, rated by 25 subjects. The participants were self-identified as having an upper-intermediate or above level of English. The texts were presented to the subjects in pseudo-random order, to avoid texts about the same entity occurring within a page of each other (four texts were presented on every page). This avoided direct side-by-side comparison. For each of these, each subject was asked to rate the text on a measure of 1 (lowest) to 5 (highest) on the following three criteria, adapted from the DUC 2007 criteria 22 : grammaticality, non-redundancy and structure and coherence. For the full description of these criteria and the instructions given to the subjects, see Appendix B.

It was not disclosed to the subjects until the end of the experiment that humans generated the texts of one of the systems being tested.

7.6 Results

An exploratory analysis of the data collected showed clear differences in means between for the rating of the three systems (Table 7.2). To establish the significance of these differences I conducted a One-Way ANOVA (as opposed to Paired-rank T-tests, to adjust for the comparisons made) for each of the three criteria the texts were rated on. All three ANOVAs were statistically significant: for grammaticality (F(2,72)=119.001, p < 0.001), for non-redundancy (F(2,72)=129.053, p < 0.001) and for structure and coherence (F(2,72)=129.053, p < 0.001). I conducted Tukey’s Post-Hoc test to establish which comparisons were significant for each; Table 7.3, Table 7.4 and Table 7.5 show the differences in mean and the results of the Tukey tests.

For the ratings on structure and coherence, three outliers were found to affect the normality of the distribution of System C. The outliers were removed to and the assumption of normality held (as reported by the Shapiro-Wilk test) and ANOVA and Tukey test were run with and without the outliers. The same difference was found in main effects in both models, therefore I report the main effects with the outliers included.

System

Grammaticality

Non-redundancy

Structure and coherence

A (baseline)

2.29

1.89

1.95

22 http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt

51 of 66

B (LOD-DEF)

2.58

3.03

2.70

C (humans)

4.48

4.66

4.49

 

Table 7.2 Means

 

Baseline vs.LOD-DEF

Grammaticality

Non-redundancy

Structure and coherence

Difference

0.29

1.14

0.75

Significance

p = 0.151 No

p < 0.001 Yes

p < 0.001 Yes

Significant

Table 7.3 Differences and significance

LOD-DEF vs. Humans

Grammaticality

Non-redundancy

Structure and coherence

Difference

1.14

1.63

1.79

Significance

p < 0.001 Yes

p < 0.001 Yes

p < 0.001 Yes

Significant

Table 7.4 Differences and significance

Humans vs. Baseline

Grammaticality

Non-redundancy

Structure and coherence

Difference

2.19

2.77

2.54

Significance

p < 0.001 Yes

p < 0.001 Yes

p < 0.001 Yes

Significant

7.7

Discussion

Table 7.5 Differences and significance

 

As it was expected, expert human generation is an upper bound in this evaluation, being consistently superior to the other two systems tested. LOD-DEF does not improve on the perception of grammaticality of the baseline, but it does significantly outperform the baseline on non-redundancy and structure and coherence.

The difference between the average score of humans and the baseline is at its lowest for grammaticality, as the output of both the baseline and LOD-DEF was judged surprisingly high on grammaticality. LOD-DEF scored very slightly higher (a difference of 0.29 on means) but this is not statistically significant (p = 0.151). The most significant improvement of LOD-DEF over the baseline is on the non-redundancy metric, with a difference of 1.14.

The fact that, in spite of the simple approach taken and the many errors in output (as discussed in the previous chapter), LOD-DEF still significantly outperforms the baseline on both non- redundancy and structure and coherence is very encouraging. These results suggest that automatic training of NLG systems is a promising approach that should be pursued further.

52 of 66

Chapter 8

Conclusion and future work

8.1 Conclusion

This project has focussed on describing, implementing and testing a trainable shallow Natural Language Generation system for factual Linked Open Data based on the extraction of sentence templates and document planning via content n-grams.

The main contributions of this work are:

Describing a full architecture for this system, including both the training and generation stages. To my knowledge this system in its whole represents a new approach to trainable NLG, never before tried in its entirety.

Building a baseline implementation of this architecture, the LOD-DEF system, and both evaluating its performance at template extraction and class selection, and conducting human evaluation of this system against a baseline and human-generated output.

Proving that even an exceedingly simple system as LOD-DEF system is rated significantly higher in human evaluation over the baseline. In essence, this project shows that this approach is a promising one and that it should be pursued further.

As per the criteria outlined in Chapter 3, it is clear that much could be improved. Most importantly, much more work on template extraction needs to be done. With little extra effort, the system could easily improve on its current performance of 58% of extracted templates that are both transferable and grammatical.

This project was met with a measure of success. However, were I to start again now, I would approach it in different ways. First, I would perhaps focus on one of the many problems I have tackled, e.g. the class selection algorithm, and investigate it more thoroughly. Second, while building this whole system from the ground up was a highly instructive experience, I would strive to use or adapt an existing architecture.

This three-month project started life as a PhD research proposal. It is immediately apparent that this is but one twelfth of that initial project, and many interesting lines of research had to be abandoned for lack of time and experience. Sourced both from the original proposal and from the findings of this project, in the next section I offer some directions for future work.

8.2 Future work

First, the implementation described here is but a baseline. As I have already suggested, more robust systems exist for every main module of this application: Named Entity Recognition, parsing, coreference resolution, etc.

Within the same shallow approach, substituting these modules in the pipeline would surely help improve the results, as the analyses of errors in Chapter 6 show. For the NER task, using an

53 of 66

established system like DBpedia Spotlight (Mendes & Jakob, 2011) or OpenCalais (Butuc, 2009) would be a first step, which would also allow integrating inference in the selection of the data to be spotted in text.

Also, the architecture implemented for this project duplicates readily-available general-purpose architectures, of which the General Architecture for Text Engineering (GATE) (Cunningham et al. 1996) is a prototypical example.

But beyond these improvements on the shallow approach, the crucial steps involve moving towards deeper natural language understanding and with it to deeper generation. The most sophisticated approaches to document planning use another level of abstraction from text:

discourse relations. The original aim was to automatically extract these relations, to which there exist a number of approaches (e.g. Soricut & Marcu, 2003).

Whether we think of the rhetorical relations in a text as a tree (as in Rhetorical Structure Theory – Mann & Thompson, 1988) or as a graph (e.g. Segmented Discourse Representation Theory – Asher & Lascarides, 2003), it is clear that the structure and coherence of a text are more than just a succession of properties and their values.

This move must probably be accompanied by an application of techniques of relation extraction, ideally informed by a deeper understanding of the argument structure of predicates in natural language text, that is, what arguments verbs take and their thematic roles. The FrameNet and VerbNet projects, coupled with WordNet are likely to play a role in this (Shi & Mihalcea, 2005).

These steps would allow us to move towards the automatic learning of rules for deeper generation. A number of more general-purpose NLG architectures exist (e.g. NaturalOWL as described before, but also others not specifically targeted to the Semantic Web like OpenCCG – White, 2008).

With better identifications of the relations between the spotted entities in text and an understanding of the rhetorical relations between sentences, we could extract full document planning and aggregation rules, which could be converted for use by one of those systems.

Finally, an interesting problem for Named Entity Recognition is that of vagueness (Klein & Rovatsos, 2011), when dealing, for instance, with large numbers. For example, the population of a country is a figure in the millions, which is often reported in text as “about 30 million people”, but has exact numbers in the data (e.g. 27.543.216).

In brief, the approach presented herein does not but scratch the surface.

54 of 66

Appendix A: Human generation

The following text was given to each of the two subjects. It is followed by the first data for generation as an example.

Description generation

In this experiment you are required to write short descriptions based on some given information.

You will find a block of information at the beginning of each page. This information consists of facts about a real-world entity (a person, a country, etc.) You might have never heard of that person or thing but this does not matter.

Your task is to write a short description of this entity based on the information at the beginning of the page. You are writing for a general audience with no previous knowledge about the entity you are describing (but with knowledge of other entities of the same type, e.g. countries) and you want to get all this information across (think of an article on the English Wikipedia). Use the blank area of each page to write your text. Feel free to copy and paste names and other chunks of text.

Please do not use any other information resources for this (i.e. don't look it up on Google or Wikipedia until you have finished the experiment). It is essential you write the text that seems more natural to you from the given information only.

Please do not include any value judgements (e.g. “the best”, “one of the greatest”, “a very famous”, “the most important”) unless these are present in the information provided.

The information is in random order. You should report it in the order that seems more logical to you in a description.

You can use any format you prefer for dates, numbers and other amounts. You can use any grammatical construction and vocabulary.

It is very important that you include in your text no other information and that you use all the information that you can infer from what is given (that is relevant).

Example:

From this information:

name = John date of death = 1666-02-01

You could write:

1. John died in 1666.

2. John died accompanied by his wife and 3 pigs, in a barge that was pushed blazing into

Dunsapie Loch.

3. John died on 1st Feb 1666.

55 of 66

(1) is bad because it omits date information. (2) is bad because it adds extraneous information. (3) is good.

[new page]

Information:

Category: English comedians active = 1981, spouse = Adrian Edmondson, description = British comedienne, place of birth = Sleaford, Lincolnshire, England, birth name = Jennifer Jane Saunders, spouse = Adrian Edmondson, birth date = 1958-07-06, notable work = Various in French & Saunders, Edina Monsoon in Absolutely Fabulous, Fairy Godmother in Shrek 2, name = Jennifer Saunders,

Text (write text here):

56 of 66

Appendix B: Human evaluation

Hello! This is an evaluation questionnaire that compares output from 3 Natural Language Generation systems, that is, software that takes data and outputs text in English. Two of these I have created myself.

There are 36 very short text snippets in this questionnaire, generated directly from information on the Semantic Web. It should take you about 25 minutes.

All you need to do is rate each text from 1 (very poor) to 5 (very good) on these measures:

Grammaticality Non-redundancy Structure and Coherence.

Don't worry, these are all explained at the top of each page. NOTE: I assume that your level of English is upper-intermediate or above. Let's go!

These are the criteria for rating the texts, please take a moment to read them. They appear again at the beginning of every page.

Grammaticality The text should have no system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read.

Non-redundancy There should be no unnecessary repetition in the text. Unnecessary repetition might take the form of whole sentences that are repeated, or repeated facts, or the repeated use of a noun or noun phrase (e.g., "Bill Clinton") when a pronoun ("he") would suffice.

Structure and Coherence The text should be well-structured and well-organized.

1. Very Poor

2. Poor

3. Barely Acceptable

4. Good

5. Very Good

57 of 66

Full text generated by systems B and C, and example of A

System A (baseline)

1 Jennifer Saunders is an English television actor. Her birth date is 6 July 1958. Her description is British comedienne. Her spouse is Adrian Edmondson. Her genres are Comedy and Parody. Her caption is Saunders in November 2008. Her birth name is Jennifer Jane Saunders. Her wordnet type is synset-actor-noun-1. Her nationality is British people. Her medium is Television, film. Her short description is British comedienne. Her place of birth is Sleaford, Lincolnshire, England. She is a primary topic of : Jennifer Saunders. Her page is Jennifer Saunders. Her notable works are Edina Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2. Her name is Jennifer Saunders. She has a photo collection : Jennifer Saunders. Her label is Jennifer Saunders. Her given name is Jennifer. Her surname is Saunders. Her birth places are Sleaford and Lincolnshire.

4 Morocco is an Arab LeaguE member state. Its cctld is .ma. Its geometry is POINT(-6.85 34.0333). Its area total (km2)s are 446550.0 and 446739.2791875256. Its sovereignty types are Monarchy and Independence. Its demonym is Moroccan. Its time zone is Western European Time. Its lats are 34.0333 and 32.0. Its established events are from France, Alaouite dynasty, Mauretania and from Spain. Its percentage of area water is 250.0. Its leader names are Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its points are 34.03333333333333 -6.85 and 32.0 -6.0. Its gdp ppp is 1.62617E11. Its image maps are 29.0 and Morocco on the globe .svg. Its official languagess are Berber, Arabic and Arabic language. Its government types are Parliamentary system, Constitutional Monarchy and Unitary state. Its leader titles are Prime Minister of Morocco, List of heads of government of Morocco, List of rulers of Morocco and King of Morocco. Its currency is Moroccan dirham. Its conventional long name is Kingdom of Morocco. Its legislature is Parliament of Morocco. Its national anthem is "Cherifian Anthem".

Its longs are -6.85 and -6.0. Its percent waters are 250.0. Its languages type is Native languages. Its time zone dst is Western European Summer Time. It has a photo collection : Morocco. Its page is Morocco. Its north is Mediterranean Sea. Its anthem is Cherifian Anthem. Its homepage is http://www.maroc.ma/PortailInst/An/. Its longname is Kingdom of Morocco. Its established dates are 7 April 1956, 1666, 2 March 1956 and 110. Its founding dates are 7 April 1956 and 2 March 1956. Its capital is Rabat. Its largest city is Casablanca. Its lower house is Assembly of Representatives of Morocco. Its ethnic groups are North African Arabs, Berber people and Berber Jews. Its gdp nominal is 9.9241E10. It is a primary topic of : Morocco. Its gdp ppp per capita is 5052.0. Its longew is

Its

southwest is Atlantic Ocean. Its area footnote is or 710,850 km2. Its population density (/sqkm)s are 71.622 and 71.6. Its official languages are Arabic language and Berber languages. Its latns is N. Its gdp nominal per capita is 3083.0. Its calling codes are Telephone numbers in Morocco and %2B212. Its hdi category is medium. Its northeast is Mediterranean Sea. Its label is Morocco. Its languages are Hassaniya language, Moroccan Arabic, Arabic language and Berber languages. Its titles are Languages, Geographic locale and International membership. Its currency code is MAD. Its national motto is "God, Homeland, King". Its mottoes are "God, Homeland, King", (Berber) and (Arabic). Its northwest is Atlantic Ocean. Its name is Morocco. Its west is Atlantic Ocean. Its upper house is Assembly of Councillors.

W. Its drives on is right. Its common name is Morocco. Its languages is Berber, Moroccan Arabic, Hassaniya

5 Woody Woodpecker is a Fictional anthropomorphic character. His last appearance is The New Woody Woodpecker Show. His portrayers are Kent Rogers, Grace Stafford, Ben Hardaway, Daniel Webb, Mel Blanc, Cherry Davis, Danny Webb and Billy West. His families are Splinter and Knothead and Scrooge Woodpecker. His creators are Walter Lantz, Ben Hardaway and Alex Lovy. His significantother is Winnie Woodpecker. His caption is 1951.0. He is a primary topic of : Woody Woodpecker. His species is Woodpecker. His last is I Know What You Did Last Night. His first is Knock Knock. His gender is Male. He has a photo collection : Woody Woodpecker. His labels are Woody Woodpecker. His page is Woody Woodpecker. His first appearance is Knock Knock (1940 cartoon). His name is Woody Woodpecker. His homepage is www.woodywoodpecker.com.

System B (LOD-DEF)

1 Jennifer Saunders (6 July 1958, Sleaford and Lincolnshire) is a British comedienne. Her spouse is Adrian Edmondson. Her active is 1981. Her place of birth is Sleaford, Lincolnshire, England. Her notable works are Edina Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2. Her nationality is British people.

2 Hyundai Motor Company is a Car manuFACturer. Hyundai Motor Company started on 29 December 1967. Although Hyundai Motor Company started in Public. Its parent company is Hyundai Motor Group. Its founded by is Hyundai Motor Company. Its location country is South Korea. Its subsid is Hyundai Motor India Limited. Its

58 of 66

location cities are Seoul. Its key people is Chung Mong-koo. Its products is Automobiles, commercial vehicles, engines. Its key person is Chung Mong-koo. Its products are Commercial vehicle and Internal combustion engine. Its production is 2943529.

3 Nicole Scherzinger (born 29 June 1978) is an American female singer. Scherzinger worked in Hawaii and Honolulu. Her labels are Polydor Records, Interscope Records and A&M Records. Her associated musical artists are Days of the New, Pussycat Dolls and Eden's Crush. Her titles are "Jai Ho! ", "Poison" and Dancing with the Stars (US) winner. Her alternative names is Kea, Nicole. Her befores are Donny Osmond and Kym Johnson. Her years is Season 10.

4 Morocco (called as Kingdom of Morocco) is an African country. Casablanca of Morocco is Rabat. Morocco's leader names are Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its west is Atlantic Ocean. Its official languagess are Berber, Arabic and Arabic language. Its northeast is Mediterranean Sea. Its demonym is Moroccan. Its founding dates are 7 April 1956 and 2 March 1956. Its established events are from France, Alaouite dynasty, Mauretania and from Spain. It is an African country. Its demonym is Moroccan. Its established dates are 7 April 1956, 1666, 2 March 1956 and 110. Its leader titles are King and Prime Minister. Its largest city is Casablanca.

5 Woody Woodpecker is a Fictional anthropomorphic character created by Walter Lantz, Ben Hardaway and Alex Lovy. His species is Woodpecker. His first is Knock Knock. His last is I Know What You Did Last Night. His first appearance is Knock Knock (1940 cartoon). His significantother is Winnie Woodpecker.

6 Real Zaragoza's clubname is Real Zaragoza. Its nats are Italy, Portugal, Mexico, ESP, Serbia, ITA, Croatia, BRA, Paraguay, Hungary, Argentina and Spain. Its league is La Liga. Its titles are Inter-Cities Fairs Cup, UEFA Cup Winners' Cup and UEFA Cup Winners%27 Cup. Its founded is 1932. Its fullname is Real Zaragoza, S.A.D. It is a Spanish football cluB.

7 Fernando Gago's teams are Real Madrid C.F., Boca Juniors and Argentina national football team. His clubss are Real Madrid C.F. and Boca Juniors. His birth date is 10 April 1986. His playername is Fernando Gago. His fullname is Fernando Rubén Gago. His currentclub is Real Madrid C.F. His dateofbirth is 10 April 1986. He is an Argentina international footballer.

8 American Idol is a Creative Work run by the 19 Entertainment and FremantleMedia.Its presenters are Brian Dunkleman and Ryan Seacrest. Its judges are Randy Jackson, Ryan Seacrest and Mariah Carey.

9 Mercyful Fate is an Speed metal musical group from, Denmark and Copenhagen. Their former band members are Timi Hansen, Snowy Shaw and Michael Denner. Their associated musical artists are Fate (band), Arch Enemy, Force of Evil (band), Spiritual Beggars, Memento Mori (band), King Diamond (band), Brats (band), Black Rose (band) and Metallica. Their band members are Hank Shermann, Mike Wead, King Diamond and Sharlee D'Angelo. Their record labels are Roadrunner Records, Combat Records, Rave On (record label) and Metal Blade Records. Their labels are Roadrunner Records, Combat Records, Metal Blade Records and Rave On %28record label%29. Their years active is 1981.

10 Friedrich Wilhelm Herschel (15 November 1738 in Holy Roman Empire, Hanover and Electorate of Brunswick- Lüneburg – 25 August 1822 in England, Berkshire and Slough) was a German composer. His known fors are Uranus and Infrared.

11 City of Belgrade is the populated place. Its is a part of : Belgrade%23 Municipalities. Its leader names are Party of United Pensioners of Serbia, Dragan Đilas, Democratic Party (Serbia), Milan Krkobabić and Socialist Party of Serbia. Its population demonym is Belgrader. Its official name is Belgrade. It is a populated place. Its native names are Град Београд, Београд and Beograd.

12 Winx Club is a Nickelodeon, Rai Due, 4Kids TV, 4KidsTV and Rai 2 series made by Alfred R. Kahn, Norman J. Grossfeld and Joanna Lee. Its director is Iginio Straffi. Its first aired is 28 January 2004. It is a Creative Work. Its country is Italy.

System C (humans)

1 Jennifer Jane Saunders (Born 06/07/1958) is an English comedienne, originally from Sleaford, Lincolnshire. Jennifer has been active as a comedienne since 1981 and a selection of her most notable roles include Edina Monsoon in Absolutely Fabulous, the Fairy Godmother in Shrek 2, whilst also appearing in French and Saunders. Her spouse is

59 of 66

Adrian Edmondson.

2 Hyundai Motor Company is a South Korean company based in Seoul and is part of the Hyundai Motor Group. The company was founded on the 29th of December, 1967 by Chung Ju-yung. They make various products ranging from automobiles, commerical vehicles and internal combustion engines.

3 Nicole Prescovia Elikolani Valiente (also known as Nicole Scherzinger, Kea, Nicole) was born on the 29th of August 1978, in Honolulu, Hawaii, USA and is a singer from the noughties. She is associated with a variety of different acts including Days of the New, Pussycat Dolls and Eden’s Crush. Record labels that she has been signed to include A&M Records, Polydor Records and Interscope Records.

4 Morocco (or Kingdom of Morocco) is a country that is part of the continent of Africa. The capital city is Rabat and the largest city is Casabalanca. The total area of Morocco is approximately 445739 Km2, with a hdi categorisation of medium. The country has a population density of 186 people per square mile with the official population demonym being Moroccan. It is geographically located with the Mediterranean Sea to the Northeast and the Atlantic Ocean to the Southwest of the country. Arabic and Berber are the officially spoken languages, although Hassaniva is also spoken. The present leaders in the country are King Mohammed VI and the Prime Minister is Abdelilah Benkirane. The modern country was officially established in 1956 on the 7th of April, which marks the independence from the Alouite Dynasty of France. Earlier reports of the country’s establishment relate to 1666, with the event of Mauretania from Spain.

5 Woody Woodpecker is a fictional cartoon woodpecker. Created by Ben Hardaway, Walter Lantz and Alex Lovy, his first appearance was in the 1940 cartoon ‘Knock Knock’. Since then various people have portrayed the character including, Mel Blanc, Billy West, Kent Rogers, Ben Hardaway, Daniel Webb, Grace Stafford, Cherry Davis and Danny Webb. The last appearance which Woody Woodpecker was featured in was ‘I Know What You Did Last Night’. Other related characters include his significant other, Winnie Woodpecker.

6 Real Zaragoza are a Spanish football team, who play in the Spanish league La Liga. Their ground is La Romareda, Aragon in Zaragoza. Founded in 1932, the club have won the Inter-Cities Fairs Cup and the UEFA cup Winners’ Cup. Players for the team come from a variety of different nations including Argentina, Italy, Hungary, Serbia, Croatia, Paraguay, Mexico, Spain, Portugal and Brazil.

7 Fernando Rubén Gago, born on 10 April 1986, is an Argentine footballer. Gago currently plays for the club Real Madrid C.F., as well as for the team Boca Juniors and the Argentine national football team; he has played in four other clubs prior to joining Real Madrid. Gago has thus far scored no goals for his national team.

8 American Idol is a singing competition aired on television by the Fox Broadcasting Company, and produced by FremantleMedia and 19 Entertainment. It was first aired on 11 June 2002. The programme is presented by Brian Dunkleman and Ryan Seacrest, and the panel of judges is composed of Mariah Carey, Randy Jackson, Simon Cowell, Steven Tyler, Ellen DeGeneres, Paula Abdul, Jennifer Lopez and Kara Dio Guardi. Its producers are Shane Drake, Ken Warwick, Bruce Gowers, Nigel Lythgoe, Gregg Gelfand, John Pritchett and Andrew Scheer.

9 Mercyful Fate is a speed metal musical group from Copenhagen, Denmark. It has been associated with the bands Metallica, Arch Enemy, King Diamond, Memento Mori, Brats, Black Rose, Force of Evil, Spiritual Beggars and Fate. The band has been active since 1981. Its current members are King Diamond, Hank Shermann, Sharlee D’Angelo, Mike Wead and Bjarne T. Holm; past members are Snowy Shaw, Michael Denner, Timi Hansen and Kim Ruzz. It has released records on the labels Roadrunner Records, Combat Records, Metal Blade Records and Rave On.

10 William Herschel (born Friedrich Wilhelm Herschel) was a German composer. He was born on 15 November 1738 in Hanover, Electorate of Brunswick-Lüneburg, Holy Roman Empire. Herschel was known for the pieces Uranus and Infrared. He died on 25 August 1822 in Slough, Berkshire, England.

11 Belgrade (officially the City of Belgrade, native name Beograd) is the capital of Serbia. It is a city with an area of 359.96 km2 and forms part of the Belgrade Municipalities. Its City Council is ruled by the Socialist Party of Serbia and the Party of United Pensioners of Serbia; the current Mayor is Milan Krkobabić, and the Deputy Mayor is Dragan Đilas. Belgrade was established prior to 279 BC. The population demonym of Belgrade is Belgrader.

12 Winx Club is an Italian animated television show aired in stereo on the networks 4Kids TV, Rai 2, and Nickelodeon. It is directed by Iginio Straffi and released on 28 January 2004, and has so far run for 104 episodes

60 of 66

over four seasons. It is narrated by Joanna Lee, Alfred R. Kahn and Norman J. Grossfeld.

61 of 66

Order of the articles in the survey:

Page

Text number

1

1

1

3

1

6

1

4

2

7

2

5

2

2

2

8

3

10

3

4

3

9

3

11

4

12

4

2

4

3

4

5

5

10

Generated

system

C

B

B

A

C

A

C

B

A

B

C

C

A

B

A

B

C

by

5

1

A

5

8

C

5

12

B

6

3

C

6

9

B

6

2

A

6

7

B

7

11

A

7

6

C

7

8

A

7

1

B

8

12

C

8

11

B

8

9

A

8

4

C

9

10

B

9

5

C

9

7

A

9

6

A

62 of 66

References

Androutsopoulos, I., Kokkinaki, V., Dimitromanolaki, A., Calder, J., Oberlander, J., Not, E. (2001). Generating Multilingual Personalized Descriptions of Museum Exhibits – The M- PIRO Project. Retrieved from http://arxiv.org/ftp/cs/papers/0110/0110057.pdf

Asher, N. & Lascarides, A. (2003). Logics of Conversation. Studies in Natural Language Processing. Cambridge University Press.

McGuinness, D.L.,

Bechhofer,

Stein, L.A. World Wide Web Consortium (W3C). (2004). OWL

S.,

van

Harmelen,

F.,

Hendler,

J.,

Horrocks,

I.,

Patel-Schneider, P. &

Web Ontology Language Reference. Retrieved from http://www.w3.org/TR/owl-ref/

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American. Retrieved from http://campus.fsu.edu/bbcswebdav/users/bstvilia/lis5916metadata/

readings/scientific-american_0.pdf

Berners-Lee, T. & Connolly, D. (W3C). (2011). Notation3 (N3): A readable RDF syntax. Retrieved from http://www.w3.org/TeamSubmission/n3/

Bizer, C., Jentzsch, A. & Cyganiak, R

(2011). State of the LOD Cloud. Retrieved from

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Bontcheva, K., & Davis, B. (2009). Natural Language Generation from Ontologies. In J. Davies,

M. Grobelnik, & D. Mladenic (Eds.), Semantic Knowledge Management: Integrating

Ontology Management, Knowledge Discovery and Human Language Technology (pp. 113– 127). Springer.

Brickley, D., & Guha, R.V. (W3C). (2004). RDF Vocabulary Description Language 1.0: RDF Schema. Retrieved from http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

Brickley, D. & Miller, L. (2010). FOAF Vocabulary Specification 0.98. Retrieved from

http://xmlns.com/foaf/spec/20100809.html

Busemann, S., & Horacek, H. (1998). A Flexible Shallow Approach to Text Generation, (2945), 10. Computation and Language. Retrieved from http://arxiv.org/abs/cs.CL/9812018

Busemann,

from

http://www.coli.uni-saarland.de/courses/LT1/2011/slides/shallow-nlg-lecture_WS1112.pdf

Butuc, M.G. (2009). Semantically enriching content using OpenCalais. Retrieved from

S.

(2011).

Shallow

Text

Generation.

Retrieved

www.eed.usv.ro/SistemeDistribuite/2009/Butuc1.pdf

Cohn, T., & Lapata, M. (2009). Sentence compression as tree transduction. Journal of Artificial Intelligence , 1–38. Retrieved from http://eprints.pascal-network.org/archive/00005887/

Cunningham, H., Wilks, Y. & Gaizauskas, R.J. (1996). Gate: a general architecture for text engineering. Proceedings of the 16th conference on Computational linguistics-Volume 2,

pp. 1057--1060

63 of 66

Cyganiak, R. & Jentzsch, A. (2011). The Linking Open Data cloud diagram. Retrieved from http://lod-cloud.net/

Decker, S., Van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, M., Melnik, S. (2000). The Semantic Web-on the respective Roles of XML and RDF. Retrieved from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6109&rep=rep1&type=pdf

Duboue, P. A., & Mckeown, K. R. (2003). Statistical acquisition of content selection rules for natural language generation. Proceedings of the 2003 conference on Empirical methods in

processing, pp. 121-128. Retrieved from

natural language

http://dl.acm.org/citation.cfm?id=1119371

Feldman, R. & Sanger, J. (2007). The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Filippova, K., & Strube, M. (2008). Dependency tree based sentence compression. Proceedings of

the Fifth International Natural Language Generation Conference on - INLG

doi:10.3115/1708322.1708329

’08, 25.

Gagnon, M., & Sylva, L. D. (2006).

Text Compression by Syntactic Pruning. Advances in

Artificial Intelligence 312–323. Springer.

Galanis, D., & Androutsopoulos, I. (2007). Generating multilingual descriptions from linguistically annotated OWL ontologies: the NaturalOWL system. Proceedings of the Eleventh European Workshop on Natural Language Generation, 143–146. Retrieved from

http://dl.acm.org/citation.cfm?id=1610188

Galley, M., Fosler-Lussier, E., & Potamianos, A. (2001). Hybrid natural language generation for spoken dialogue systems. In Proceedings of the 7th European Conference on Speech Communication and Technology (Interspeech-Eurospeech). September 3-7, 2001. Aalborg, Denmark.

Grice, H. P. (1975). Logic and Conversation. In C. P. & M. J. (Eds.), Syntax and Semantics, Vol 3: Speech Acts (pp. 43–58). New York, New York, USA: Academic Press.

Heath, T. & Bizer, C., (2011). Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.

Hewlett, D. Kalyanpur, A. Kolovski, V. Halaschek-Wiener, C. (2005). Effective NL paraphrasing of ontologies on the Semantic Web. In Workshop on End-User Semantic Web Interaction, 4th Int. Semantic Web conference, Galway, Ireland. Retrieved from http://www.mindswap.org/papers/nlpowl.pdf

Jurafsky, D. & Martin, J.H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall: New Jersey.

Kasneci, G., Ramanath, M., Suchanek, F., & Weikum, G. (2008). The YAGO-NAGA Approach

from

to

Knowledge

Discovery.

Retrieved

http://dl.acm.org/citation.cfm?id=1519103.1519110

64 of 66

Klein,

D.

& Manning, C

(2003). Accurate Unlexicalized Parsing. Proceedings of the 41st

Meeting of the Association for Computational Linguistics, pp. 423-430.

Klein, E. and Rovatsos, M. (2011). Temporal vagueness, coordination and communication. In Nouwen, R., Schmitz, H.-C., van Rooij, R., and Sauerland, U., editors, Vagueness in Communication, LNCS. Springer.

Klyne, G., & Carroll, J. (W3C). (2002). Resource Description Framework (RDF): Concepts and Abstract Data Model. Retrieved from http://www.w3.org/TR/2002/WD-rdf-concepts-

20020829/

Liang, S.F., Stevens, R., Scott, D. & Rector, A. (2012). OntoVerbal: a Protege plugin for verbalising ontology classes. Proceedings of the Third International Conference on Biomedical Ontology , (ICBO'2012), Graz, Austria.

Mann, W.C. & Thompson, S.A. (1988). Rhetorical structure theory: Toward a functional theory of text organization.

Mendes, P., & Jakob, M. (2011). DBpedia spotlight: shedding light on the web of documents. Proceedings of the 7th , 1–8. Retrieved from http://dl.acm.org/citation.cfm?id=2063519

Mendes, P., Jakob, M., & Bizer, C. (2012). DBpedia: A Multilingual Cross-Domain Knowledge Base. Retrieved from http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/

research/publications/Mendes-Jakob-Bizer-DBpedia-LREC2012.pdf

Prud'hommeaux, E., & Seaborne, A. (W3C). (2008). SPARQL Query Language for RDF, Retrieved from http://www.w3.org/TR/rdf-sparql-query/

Reiter, E., & Dale, R. (2000). Building Natural Language Generation systems. Cambridge University Press.

Rosch, E.H. (1973). Natural categories. Cognitive Psychology 4 (3): 328–50. DOI:10.1016/0010-

0285(73)90017-0.

Sarawagi, S. (2008). Information Extraction, 1(3), 261–377. doi:10.1561/150000000

Shi, L. & Mihalcea, R. (2005). Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing. Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science, DOI: 10.1007/978-3-540-30586-6_9

Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. Proceedings of the 2003 Conference of the North …, (June), 149–156. Retrieved from http://dl.acm.org/citation.cfm?id=1073475

Sripada, S. G., Reiter, E., Hunter, J., & Yu, J. (2003). Generating English summaries of time series data using the Gricean maxims. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03, 187.

doi:10.1145/956755.956774

Stevens, R., Malone, J., Williams, S., Power, R., and Third, A., (2011). Automating generation of

textual

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102894/

class

definitions

from

OWL

to

English.

Retrieved

from

Sun, X. & Mellish, C. (2007). An Experiment on “Free Generation” from Single RDF triples. Retrieved from www.aclweb.org/anthology/W07/W07-2316.pdf

65 of 66

White, M., (2008). OpenCCG Realizer Manual. Documentation of the OpenCCG Realizer. Retrieved from https://svn.kwarc.info/repos/lamapun/lib/LaMaPUn/External/Math- CCG/docs/realizer-manual.pdf

66 of 66