You are on page 1of 118

Evaluation of M-PIRO system text output

Athanasios N. Karasimos

Master of Science Speech and Language Processing Department of Theoretical and Applied Linguistics University of Edinburgh 2003
Copyright Athanasios N. Karasimos 2003, printed by LaserJet HP6C

Declaration
I hereby declare that this MSc dissertation is of my own composition and that it contains no material previous submitted for the award of any other degree. The word reported in this MSc Dissertation has been executed by myself, except where due acknowledgement is made in the text.

( Athanasios N. Karasimos)

Athanasios N. Karasimos

ii

Evaluation of M-PIRO System

Acknowledgements
I desire to express my gratitude to all those people without whom this task would have been much harder and difficult to accomplish. Above all, thank you Amy and Colin, for being so patient, helpful, supportive and co-operative, and for being there every time I need you. You make many things clear and easy with your useful and clever advice. It was said that the beginning and the end of a dissertation is the supervisors. Many thanks to all these people who participated in both experiments. Without their participation the evaluation would be impossible to be done. I appreciate the time you spent for my experiment. And special thanks to my classmate, Behzad whom I own my dissertations topic and I am grateful for all these useful conversations. Many thanks to Ellen Burk for guiding me out of the statistics labyrinth. Finally, at the Greek front, thanks to Aggeliki, Efi, Stavroula and Stathis for their care and support throughout the whole year. Additionally, special thanks to Alexander Melengoglou, who offered his valuable knowledge and comments. I could not be here now without the strong support and love of my family. Thank you Anna and Sotiria for your corrections.

Athanasios N. Karasimos

iii

Evaluation of M-PIRO System

To my mother, my uncle and specially to my sister, Helen, ( ).

Athanasios N. Karasimos

iv

Evaluation of M-PIRO System

Abstract
Half of the problem in Natural Language Generation (NLG) is the evaluation of a NLG system. In the last few decades the research about evaluation has increased and made some serious steps on this direction. This study describes an evaluation of a multilingual personalized information objects system (M-PIRO), which dynamically generates descriptions for exhibits in a virtual museum exhibition. In the evaluation, learning outcomes in-between subjects who read two sets of texts about coins and vessels were compared to those of subjects who read these text sets with different text structure. The aim was to attempt to prove that the text type factors, comparison and aggregation are essential for a better performance. Several types of data were collected by post-session tests of factual recall knowledge and a questionnaire about the evaluated system. Results showed that performance measures did differ between subjects in the two conditions (presence and absence of the text type factors); additionally, the data analysis revealed that text difficulty and the subjects impression of learning were also statistically significant. These issues are all considered in order to determine if the goal of M-PIRO is achieved and to suggest some improvements to it. The study concludes with an outline of further future work.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Contents

Contents
Declaration.......................................................................................................................ii Acknowledgements.........................................................................................................iii Abstract............................................................................................................................v Contents ..........................................................................................................................vi Index of tables, graphs and pictures ..........................................................................viii 1. Introduction .................................................................................................................1 1.1. Natural Language Generation Systems ...............................................................1 1.2. Evaluating Natural Language Generation Systems.............................................2 1.3. Purpose and Outline of the study .........................................................................5 2. The M-PIRO NLG System............................................................................................6 2.1. The ILEX NLG System...........................................................................................6 2.1.1. The ILEX Dynamic Hypertext System ............................................................6 2.1.2. The evaluation of the ILEX System: Dynamic vs. Static version....................7 2.2. The M-PIRO System ...............................................................................................9 2.2.1. The M-PIRO Domain and Generation Architecture ........................................9 2.2.2. The M-PIRO Authoring Tool.........................................................................12 3. Aggregation and Comparison in the M-PIRO ..........................................................14 3.1. Aggregation........................................................................................................14 3.1.1. What is aggregation?....................................................................................14 3.1.2. The implementation of aggregation in the M-PIRO System..........................15 3.2. Comparison ........................................................................................................17 3.2.1. What is comparison? ....................................................................................17 3.2.2. The implementation of comparison in M-PIRO System................................19 4. The Pilot Experiment................................................................................................21 4.1. Introduction........................................................................................................21 4.2. Method ...............................................................................................................23 4.2.1. Designing and choosing the exhibit texts ....................................................23 4.2.2. Subjects ........................................................................................................26 4.2.3. Procedure......................................................................................................26 4.3. Results and Discussion.......................................................................................28

Athanasios N. Karasimos

vi

Evaluation of M-PIRO System

Contents 5. The Main Experiment...............................................................................................32 5.1. Introduction........................................................................................................32 5.2. Method ...............................................................................................................32 5.2.1. Designing and choosing the exhibit texts ....................................................32 5.2.2. Subjects ........................................................................................................35 5.2.3 Procedure.......................................................................................................36 5.3. Results ................................................................................................................37 6. General Discussion ....................................................................................................48 6.1. The results of both experiments .........................................................................48 6.1.1. Interpreting the results..................................................................................48 6.1.2. Ordering effect: a possible flaw in experimental design..............................51 6.2. Suggestions and improvements ..........................................................................53 6.3. Future work........................................................................................................56 6.4. Conclusion .........................................................................................................58 Bibliography ..................................................................................................................64 Appendix I: The M-PIRO generated texts for the Main Experiment.64 Coins Text Sequence [English] with comparison and aggregation..........................64 Coins Text Sequence [English] without comparison and aggregation.....................67 Vessels Text Sequence [English] with comparison and aggregation .......................70 Vessels Text Sequence [English] without comparison and aggregation ..................73 Coins Text Sequence [Greek] with comparison and aggregation ............................77 Coins Text Sequence [Greek] without comparison and aggregation .......................80 Vessels Text Sequence [Greek] with comparison and aggregation..........................83 Vessels Text Sequence [Greek] without comparison and aggregation.....................87 Appendix II: What did you learn from the virtual exhibition? ....................................... 91 The Questions for the Coins Text Sequence [English] .............................................91 The Questions for the Vessels Text Sequence [English] ...........................................93 The Questions for the Coins Text Sequence [Greek] ................................................95 The Questions for the Vessels Text Sequence [Greek]..............................................97 Questionnaire..........................................................................................................100 Appendix III : The Statistical guide ......................................................................................... 101

Athanasios N. Karasimos

vii

Evaluation of M-PIRO System

Index of tables, graphs and pictures

Index of tables, graphs and pictures


Table 2.1. Table 2.2. Table 3.1. Table 3.2. Table 3.3. Table 4.1. Table 4.2. Table 4.3. Graph 4.4. Graph 4.5. Table 4.6. Picture 5.1. Table 5.2. Table 5.3. Graph 5.4. Graph 5.5. Graph 5.6. Graph 5.7. Graph 5.8. Graph 5.9. Graph 5.10. Graph 5.11. Graph 5.12. Graph 5.13. Graph 5.14. Graph 6.1.
The M-PIRO pipeline generation architecture Part of M-PIRO entity hierarchy organization of types and levels The relationships between two representation Short and long conjunctions for similarity and contrast The comparison ordering based on information importance for M-PIRO system Part of the lekythos text generated by M-PIRO with and without comparison and aggregation Some questions from the factual recall text (all the questions are in Appendix II). The correct answers are in bold type The results of the participants in both versions of the pilot experiment The performance of the participants based on the option of comparison and aggregation The performance of the participants based on the group factor The questionnaire results of the pilot experiment A web page from the vessels sequence that contains the Spherical Corinthian Aryballos Two examples of the vessels texts with more complicated comparisons The results of the participants in both languages of the main experiment The score performance per person depending on text type factors (Greek Version) The score performance per person depending on text type factors (English Version) The results per participant depending on the group factor [Greek version] The results per participant depending on the group factor [English version] The performance for all the participants depending on the genre factor [both version] Box plots for performance depending on genre and text type factors [English version] Box plots for performance depending on genre and text type factors [Greek version] The performance of Greek participants depending on text difficulty (coins vs. vessels) The performance of English participants depending on text difficulty (coins vs. vessels) The questionnaire results summary of the English participants for both groups The questionnaire results summary of the Greek participants for both groups Box plots for the performance of the participants depending on the language factor

9 10 17 18 18 23 24 26 27 28 29 31 33 36 37 38 39 39 40 41 41 42 42 43 43 49

Athanasios N. Karasimos

viii

Evaluation of M-PIRO System

Chapter 1 Introduction

Chapter 1

Introduction
1.1. Natural Language Generation Systems
Natural Language Generation (NLG) is the assembly of the text word-by-word using knowledge of morphology, syntax, semantics and text structure (O Donnell et al., 2001). As a branch of computational linguistics, cognitive science and artificial intelligence, NLG is the process of constructing natural language outputs from nonlinguistic inputs (symbolic or numeric inputs), in particular of mapping some underlying representation of information to a meaningful, understandable and specific presentation of that information in spoken and/ or textual linguistic form (in human languages). A complete NLG system has to take many decisions to produce an appropriate output. The goal of NLG can be characterized as the inverse of that of Natural Language Understanding (NLU), whereas in NLU the concern is to map from text output to some underlying representation of the meaning (Mellish & Dale, 1998; Jurafsky & Martin, 2000: 764-794; O Donnell et al., 2001). The generation process in an NLG system typically consists of the five following main stages: Content determination, in which the system decides what information should be included as appropriate in the text, and what should be omitted; this selection depends upon a variety of contextual factors and the particular user to whom it is to be directed. Document structuring, in which it is decided how the text should be organized and structured; this means that (for the information already included) the NLG system has to choose the appropriate structure to convey the information. Lexical selection, in which the system chooses the particular word, word types and phrases that are required to communicate the specified information; it may be also possible to vary the words used for stylistic effect.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 1 Introduction Sentence structure1, which involves the processes of aggregation, in which the system must apportion the selected content into phrase, clause and sentence-sized chunks and often in the interests of fluency place several pieces of information into one sentence, and referring expression generation, in which the system determines the properties of an entity by referring to that entity. And Surface realization, in which the system determines the mapping of the underlying text into a natural text of grammatically correct sentences. The M-PIRO NLG system, which will be discussed in Chapter 2 and evaluated, is a dynamical hypertext2 natural language generation system.

1.2. Evaluating Natural Language Generation Systems


Over the last 15 years, the level of interest and concern expressed by the natural language processing (NLP) researchers with regard to evaluation has increased substantially. In early NLG work, the quality of the output of the system was assessed by the system authors themselves. However, they misestimated the worthiness of the evaluation for the improvement of an NLG system. In contrast, it has nowadays become widely accepted in the language processing community that NLP researchers should appreciate the evaluation of a system and pay more attention to its results, since it plays an essential role in the development of NLG systems, techniques and strategies. Mellish and Dale (1998: 349) claim that NLG is exactly half of the problem of natural language processing work; the other half is the evaluation of a system. According to them, the first serious work in the field of evaluation took place in 1990 at a workshop held on the theme of the Evaluation of Natural Language Generation Systems and the papers of Meteer and McDonald (1991) and Moore (1991). The main problem which they tried to distinguish, is the difference of the evaluation of a system from the evaluation of underlying theories. Dealing with this and some other essential

For some researchers (Reiter & Dale, 2000. Building a Natural Language Generation System; Melengoglou, 2002) lexical selection, aggregation and referring expression generation are part of microplanning. 2 Dynamic hypertext refers to a NLG system which creates its outputs dynamically at run-time, when the user requests them; such a output text is generated on call, not pre-written by a human author.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 1 Introduction problems, the empirical work in this field was noticeably increased based on the above work papers. Evaluation can have many objectives and can consider several different dimensions of an NLG system or theory. Sometimes some evaluation objectives combine aspects of a system and its underlying theory, making the task harder. Mellish and Dale (1998) suggest that the evaluation of NLG techniques must be broken down into three main categories. Evaluating properties of the Theory is the assessment of the characteristics of some theory underlying an NLG system or a part of it; the implementation of this theory, e.g. Rhetorical Structure Theory, helps us to characterize as appropriate or not this theory for the system. Evaluating properties of the System is the assessment of specific characteristics of some NLG system; it might be a comparison of two NLG systems or their algorithms, the performance of a NLG system in two different version of it, or the output of the generator with the characteristics of a corpus of target text. Finally, Applications Potential is the evaluation of the potential utility of an NLG system in some environment, for instance if its use provides a better solution than some other approach (Mellish & Dale, 1998: 353-354). Previous approaches to evaluation of NLG theories are very few. The main problem is that a good NLG system is based on a theory; nevertheless, during its construction several practical problems must be solved and, therefore, the solutions may be unconnected to the underlying theory. There were some evaluations of grammars based on some theories, like Rhetorical Structure Theory (Mann & Thompson, 1988). Robin tested his revision-based theory with a natural corpus for the domain of baseball summaries. As Mellish and Dale (1998: 355) report, this kind of evaluation is dangerous, since most reported work on NLG evaluates its theory indirectly through the systems that implement them. In contrast to the evaluation of NLG theories, the question how good is my NLG system is much easier to answer. There are different kinds of system aspects that can be evaluated. Accuracy evaluation means the assessment of the relationship between input and output, if the generated text conveys the desired meaning to the reader (Jordan et al., 1993). Fluency and intelligibility evaluation concerns the quality of generated text and includes notions such as syntactic correctness, stylistic appropriateness,

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 1 Introduction organization and coherence. Despite the unclearness of these notions measurements, Minnis (1993) made some proposals for evaluation. There are quite a few evaluations in this field, like Bangalore et al. (2000) who evaluated the system FERGUS and found that its output understandability and quality was correlated each other3. Finally, a task evaluation involves observing how well a task is performed by using the NLG system. Usually, task evaluation is used for evaluation of Machine Translation (MT) systems, such as IDAS assessment (Levine & Mellish, 1995); however, it was also used for other kind of NLG systems, such as PEBA-II (Milosavljevic & Oberlander, 1998), ILEX4 (Cox et al., 1999) and AMVF (Carerini and Moore, 2000). Finally, there are some issues and general problems which arise when dealing with evaluating a NLG system. Firstly, the very major problem in evaluating an NLG system is that of assessing its output and the question arises of what the output should be. A fluency and intelligibility evaluation can deal with this issue, but it lacks objective criteria, where the results of a task evaluation indirectly reflect the properties of the system. Secondly, it is needed to evaluate measurable attributes for the performance of the system, and thirdly, these attributes must be compared with something else, otherwise it is hard to say that something is good or bad, easy or difficult, acceptable or not. Additionally, it is really essential to get adequate test data for the evaluation. An evaluation without sufficient data will unsurprisingly fail. Finally, the human subjects may not agree and it would be wise to handle the disagreement among human judges by avoiding taking into account their judgements; therefore the authors should guide5 them to give specific objective and clear opinions and not vague critics and thoughts.

They evaluate to different versions of FERGUS (Flexible Empiricist/ Rationalist Generation Using Syntax) using evaluation metrics (accuracy) which are useful to them as relative quantitative assessments of different models. 4 For more details see section 2.1.2 of the second chapter. 5 However, the subjects should not be guided where the authors desire, for instance the subjects to say what the authors want to hear.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 1 Introduction

1.3. Purpose and Outline of the study


The purpose of this dissertation is to present and describe the evaluation of the MPIRO NLG system and to export useful and helpful conclusions about a further improvement of the system and a future development of an NLG system. The hypotheses of this evaluation are the following: firstly, we expect that a text with comparison and aggregation will help the subjects perform better and learn more. Secondly, the performance will differ depending on the difficulty of a text. Thirdly, the subjects will characterize as more fluent and natural a text with comparison and aggregation than a text without these factors. Finally, they will feel more comfortable and certain that they learn more from a text with the above factors. The remainder of this dissertation is organized as followed: Chapter 2 provides an overview of the ILEX NLG system (2.1.1) and its evaluation (2.1.2). Then it presents the M-PIRO NLG system and describes the domain (2.2.1) and architecture (2.2.2) of this system and its authoring tool (2.2.3). Chapter 3 examines both comparison and aggregation in M-PIRO system. It also provides a literature background and description of the implementation of these factor in the current system. Chapter 4 gives a description of the pilot experiment that preceded the main evaluation experiment. After the presentation of the main purpose (4.1), the method is illustrated (4.2) and the results are presented and analyzed (4.3). Chapter 5 provides the design of the main experiment for the evaluation. As the purpose and the design was largely the same as that of the pilot (5.1-5.2), the essential part which is emphasized, is the analysis of the results (5.3). Chapter 6 closes the dissertation with a general discussion of the results of this study (6.1.1) and of the notice of a ordering effect in the experimental design (6.1.2). Furthermore, there are some suggestions, improvements (6.2) and future work (6.3) about the M-PIRO system.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System

Chapter 2

The M-PIRO NLG System


2.1. The ILEX NLG System
2.1.1. The ILEX Dynamic Hypertext System
ILEX
6

(the Intelligent Labeling Explorer) is a dynamic hypertext generation system

developed at the University of Edinburgh (1997-2000) in collaboration with the National Museum of Scotland. Its task was to generate labels (descriptions) for objects of a virtual museum exhibition in several languages, using a single knowledge database storing information in a language-neutral form. ILEX generated labels which were personalized, whereas they were tailored opportunistically depending on the users interests and the users interaction history with the system. The user model of ILEX provides labels generation for the categories of child, adult and expert. It models the users in terms of their relation to information, such as the interest, the importance and the level of assimilation of the information, and provides values for each predicate type. Since the authors can not predict the exact nature of the user, ILEX allows the users to control directly the displayed generation text for the objects and the freedom to browse the collection of objects in any order; based on the authors pre-assumptions of the values of their relation to information, the system proceeds with the users requests and adapts the structure of its label to the user. Therefore, the ILEXs aim is to produce exhibits descriptions that a real curator might give and the visitors feel like they are in real museum with a guider. The opportunistic text tailoring is achieved in ILEX via the use of referring expressions, comparison expression, nominal anaphora and approaches derived from Rhetorical Structure Theory. Built on Rhetorical Structure Theory7 the aggregation is organized into nucleus-Satellite relations (Like most Arts and Crafts style jewels, it has an elaborate design) and multi-nuclei relations (This jewel is a necklace and was

For an extended discussion of this system, see O Donnell et al. (2001); see also, Milodavljevic and Oberlander (1998), Cox et al. (1999), O Donnell et al. (2000). 7 For more details and an extensive discussion of the theory, see Mann and Thompson (1988).

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System made by a British designer called Edward Spencer). For the comparison expression it uses the users navigation log; it introduces an already known concept (This necklace is also in Arts and Crafts style), make simple comparisons with the previous visited exhibit (For instance the previous item uses oval-shaped stones (in other words it features rounded stones). However this necklace does not feature rounded stones) and steers clear of repeating information which has already been mentioned (Cox et al., 1999; Melengoglou, 2002). Nevertheless, ILEX is not without problems and flaws (Isard et al., 2003). They captured much of information by interviewing a curator and then hand-coding taxonomic information and other assertions. The authors use type text strings literally rather than in terms of knowledge-base objects stored in some language neutral form; therefore it was hard to present information in any other language rather than the original input language. Related to this, there is some problems with linguistic grammar (the English and Spanish works well, but for the other languages the grammars should rebuild or reconstruct). Furthermore, the same level values of adult, expert and child types do not change essentially the text structure. Finally, the system is less modular than desirable.

2.1.2. The evaluation of the ILEX System: Dynamic vs. Static version
The Cox, O Donnell, Oberlander (1999) paper describes an evaluation of ILEX, the intelligent labelling explorer and in the evaluation, learning outcomes in subjects who used the dynamic ILEX system were analysed and contrasted to subjects who used a static version of the system. Their goal was to attempt to isolate learning effects specifically due to dynamic hypertext generation (Cox et al., 1999). In previous work (Levine & Mellish, 1995) the IDAS system that used natural generation techniques in the automatic generation of hypertexts was evaluated, where the users task was to retrieve relevant information to answer specific questions; however, they did not use any comparison group for their assessment and also, built their results and discussion on the users page visits and navigation logs. Since Cox, O Donnell and Oberlanders aim was not to compare their dynamic hypermedia with a traditional media system or to observe aspects of hypermedia such as

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System configurations of links, they used two different version of ILEX: a traditionally configured version with static pages and no user modeling against the original intelligent system with dynamically generated text containing referring expressions and comparisons based on user model (Cox et al., 1999). Both versions contained the same six jewels. Three instruments were devised for use in the evaluation, 1. a recall test of factual knowledge about jewels in the exhibition, 2. a curator task8 and 3. a usability questionnaire. They used twenty subjects allocated to the dynamic ILEX system and ten subjects to the static version of the system. The results were quite interesting. Both groups scored similarly on the two tests in performance terms; however, the log data processing revealed that the dynamic system users made more visits to the case of jewels than static subjects, and made proportionately more navigation-related button clicks than their static ILEX counterpants (Cox et al., 1999: 7). Based on these results they claim that the users did not benefit from the dynamic version properties since they did not score better; additionally, the learning pattern and performance varied and was achieved by different ways depending on the log data. Nevertheless, there were some flaws to this experiment, since there was not used the same number of users for both versions. Moreover, the subjects were not exposed to the same experimental conditions, since they used different versions and, therefore, it was possible to occur essentially a main effect for groups9, which was not mentioned if it existed or not. Furthermore, the required time was too much for only six exhibits and it might potentially have affected the results of the performance, since the time conditions were not real (in a normal case no one would spend ninety minutes for a twelve paragraphs text of six exhibits). They claimed that any learning effect is almost dependent on the navigation route just as Levine & Mellish (1993); I maintain that the learning effects are beyond any log and navigation route, since there are many factors that learning processing depends on (Mellish & Dale, 1998; Jurafsky & Martin, 2000). Other experiments have been carried to test different properties of a NLG system and evaluate the system depending on them. Properties of the underlying theory, properties of the system and the applications potential were evaluated. Nonetheless, the

This task consisted some presentations of unseen jewels in the exhibition and the subjects were asked to examine the exhibit and classify it by answering multiple-choice questions. 9 For more about statistics terminology see Appendix III.

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System lack of a specific evaluation theory and the disagreement of subjective qualities like fluency, readability, good style and appropriateness constitute an essential drawback of the evaluation of an NLG system. Some researchers failed to properly evaluate systems since they used immeasurable phenomena or subjective criteria. According to Mellish & Dales (1998) evaluation theory approach, there are some important issues and problems that must be solved for an evaluation design, which they are going to be discussed in the pilot experiment section.

2.2. The M-PIRO System


The M-PIRO10 NLG system (Multilingual Personalized Information Objects) is a more recent project of the Information Societies Programme of the European Union that also generates descriptions for virtual museum object (exhibits). It is a descendant of the
ILEX

System and has focused on developing language-engineering technology for

personalized information objects, specifically on multilingual information delivery (Isard et al., 2003). It incorporates high-quality speech output, an authoring tool, improved user modeling and a modular core generation domain (Androutsopoulos et al., 2002).

2.2.1. The M-PIRO Domain and Generation Architecture


domain model CONTENT SELECTION information to be conveyed TEXT PLANNING text structure MICRO-PLANNING document specifications SURFACE REALISATION exhibit description
Table 2.1: The M-PIRO pipeline generation architecture

domain database + domain semantics selection of facts to convey to the user

ordering of facts + document structure (RST)

lexicalisation + referring expression generation

text generation

The projects consortium consisted of the University of Edinburgh (Scotland), ITC-irst (Italy), NCSR Demokritos (Greece), the University of Athens (Greece) and the Foundation of the Hellenic World (Greece).

10

Athanasios N. Karasimos

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System The table 2.1 outlines the stages in the M-PIRO NLG architecture (Androutsopoulos et al., 2002) and the process for generating a text.

Domain authoring The knowledge base contains all the necessary information about entities and relationships; entities are abstract and concrete. The major task is the hierarchy of entity types, such as entities and sub-entities, (e.g. exhibit and material: statue and marble) which can contain more levels of an entity, for example the entity statue has complex statue, kouros, imperial portrait, etc. Similarly, the relations between entities are expressed by using fields, since the domain author can define fields for each entity, which are then inherited by all entity types below in the hierarchy (Isard et al., 2003). For example, creation-period applies to statue and to all descendents, such as complex statue, kouros, imperial portrait and it must be filled by on entity of the historicalperiod group (archaic, classical, hellenistic and roman).
Basic Type Entity Type Entity Level Entity

Copy a-location statue

complex-statue kouros portrait exhibit7 imperial portrait exhibit17 coin jewel relief

exihibit

Table 2.2: Part of M-PIRO entity hierarchy organization of types and levels

Microplanning expressions Each field has associated information that specifies how the relationship it represents can be expressed as a sentence. As mentioned in the introduction, microplanning involves lexical selection, aggregation and referring expression generation; the specifications for these are known as microplanning expressions. Either some clause plans are created, in which a verb is selected using a pull-down menu of the verbs or some templates, which are built the expression using strings and references

Athanasios N. Karasimos

10

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System to the two entities whose relationship is expressed by the field. Furthermore, microplanning is to populate from the language-dependent lexicon, which contains entries for nouns and verbs for lexical selection. Articles and prepositions are domain independent and therefore, they are stored as a separate resource.

Three languages M-PIRO can generate text in three languages: English, Greek and Italian. The grammar for English is based mainly on the ILEX grammar; nonetheless, the grammar for Greek was constructed from the beginning having as base the English one, and the Italian grammar was based on the ILEX Spanish one. As already mentioned, the lexicon has a larger independent domain now and a full inflection system especially for Greek and Italian due to their rich morphological systems. Moreover, M-PIRO supports highquality speech output (Festival11 for English and Italian and DEMOSTHeNES12 for Greek)

User Modeling
One major advantage of the system is that the user modelling information is stored separately from the domain and linguistic resources, in a personalization server. Each time the user interacts with the system, he gives his personal details; thus, the system always has access to the users personal profile and information on the history of the users interactions with the collection. Also user types for adults, experts and children were defined by the authors. Each entity type field has values for interest, importance and repetitions for each user type. The repetitions value is to calculate the assimilation score and rate per user (low rate repetition for experts, high for children). The microplanning expressions and the lexicon entries depend on it and change because of the user type. There is a comparison module based on a list of important information and there is an aggregation module that uses techniques such as simple conjunction,

Developed by the University of Edinburgh. For more details see the official web pages of Festival (http://www.cstr.ed.ac.uk/projects/festival/) and M-PIRO (http://www.ltg.ed.ac.uk/mpiro, http:// mpiro.ime.gr). 12 Developed by the University of Athens. For more details see the official web pages of DEMOSTHeNES (http://laertis.di.uoa.gr/speech/synthesis/demosthenes/) and M-PIRO.

11

Athanasios N. Karasimos

11

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System relative clauses, and syntactic embedding to join together single facts; these two factors will to be discussed in more detail later in chapter 3.

Modularity
M-PIROs

system architecture is significantly more modular than that of its

predecessor ILEX, which lacks modularity. In particular the linguistic resources, database, and user-modeling subsystems are now separate from the systems that perform the natural language generation and speech synthesis giving the system a satisfactory level of modularity; of course, it is not possible to move to a new application domain without specifying both what will be talked about and what vocabulary will be used when talking about it.

2.2.2. The M-PIRO Authoring Tool


According to the authors (Androutsopoulos et al., 2002; Spiliotopoulos et al., 2002; Isard et al., 2003) compared with domain authoring, this is the a simpler process of defining specific instances of entities and filling the entities fields with the corresponding information. The authoring tool previews the output from the generation system on the basis of the current state of the database. The authoring tool is tailored to allow the domain experts to manipulate not only the contents of the database, but also its structure and domain dependent linguistic resources that control how the information of the database is rendered in natural language. The difficult part of the authoring is done by an expert, e.g. museum curator, who designs the hierarchy and adds the basic types, entity types, microplans, etc. The easier part of the authoring, which is what I have already referred to, is done by non-experts, who add particular entities. So an expert will add the entity type amphora, but a non-expert can then add lots of particular amphora entities, e.g. exhibit1, exhibit18, and use the microplans which the expert has built to add information about the particular entity. Domain and exhibit authoring can be used together to check information (given by a designer or curator) and create a text. For example, the domain authoring can define a basic type material and exhibit, a relation made-of, a specific material [marble], an entity type statue that is a subtype of exhibit and an entity type imperial portrait that is a

Athanasios N. Karasimos

12

Evaluation of M-PIRO System

Chapter 2 The M-PIRO NLG System subtype of statue [portrait of Octaves Augustus]; the text will be This exhibit is an imperial portrait. It is made of marble. It is designed to be used by people, such as museum curators, who have no experience in language technology [one of the basic rules of usability of Nielsen (, 2000)]. Finally, they can create the types of visitors (adult, expert, child) and attach fields and microplanning expressions to their properties.

Athanasios N. Karasimos

13

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO

Chapter 3

Aggregation and Comparison in the M-PIRO


3.1. Aggregation
3.1.1. What is aggregation13?
In a Natural Language Generation system aggregation is part of the microplanning section, where texts are composed of verb-based, clause-sized propositions. These propositions are likely to contain repetitions and redundancies, which make the text to seem boring, non-fluent and unnatural to human readers (Melengoglou, 2002). The overcoming of this problem is the use of aggregation. The question what is aggregation by searching in the literature had attempted to be answered variedly; many researchers tried to define aggregation. So, aggregation is considered to be the generation of fluent, more readable and less boring text by eliminating redundancy and combining semantically the text components at any level to achieve a more concise and coherent text. This can be a recap of the literature trials for a definition. The effect of aggregation can be seen very clearly in the following example from Reape and Mellish (1999), in which two propositions with obvious common features were combined to produce a single sentence: 1. The car is here 2. The car is blue [1+2]. The blue car is here The goal of aggregation is to produce a text which is concise, coherent, cohesive and fluent; however, the goals that aggregation tries to achieve cover most of the territory of the default communicative goals of NLG systems generally. Linguistics theories separate aggregation into several types, such as conceptual aggregation (the reduction of the number of propositions in the text while increasing the complexity of conceptual roles value), discourse aggregation (any operation that achieves to make a discourse structure better), semantic aggregation (the combination of two semantic entities into

13

For extended discussion, see Reape & Mellish (1999).

Athanasios N. Karasimos

14

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO one or semantic grouping and logical transformations), syntactic aggregation (grouping subjects or predicates the most common form), lexical aggregation (lexicalization or choice of the particular lexemes to realize lexical predicates and structures) and referential aggregation (the referring expressions). The input to aggregation is usually a tree containing the ordering of facts and the dependencies that relate the facts. In this tree aggregation detects shared components among neighboring text-treenodes and combines them with an attempt to remove redundancies and repetitions in the resulting text. The most common type of aggregation is simple conjunction or disjunction which joins facts together by the mean of coordination or contrast, like and, but and or. Another common type is syntactic embedding which subordinates a clause to a proposition surrounded by commas (Alexander, the king of Macedonia, conquered the Persians). Generally, according to Melengoglou et al. (2003) the choice of particular aggregation operations seems to be highly domain-specific.

3.1.2. The implementation of aggregation in the M-PIRO System


Aggregation in the M-PIRO project receives as input a sequence of semantic representations of facts, which are classifying facts or facts presenting attributes. The facts are connected with rhetorical relations, which have to be made explicit to the aggregation model as it is very probable that the intended meaning could be lost in the generated text. On the contrary, aggregation will not ever effect the meaning of a text when the relations are specific. The aggregation module can combine a classifying fact with a fact presenting attribute into a complex sentence and two facts presenting attributes into a compound sentence. Melengoglou (2002) built the M-PIRO aggregation module rules, which were capable of producing a text that is more concise and readable. They were grouped into three major combinations. Aggregating identity-attribute pairs includes type-comma (This exhibit is a drachma, created during the classical period), type-qualifier (This portrait depicts Alexander the Great, a king from Macedonia) and type-semicolon (In the background you can see rows of columns, temples and other buildings; in the foreground there is a ship and a statue of a male form. Aggregating attribute pairs

Athanasios N. Karasimos

15

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO includes simple conjunction (This coin originates from Patras and it is now exhibited in the Numismatic Museum of Athens) and shared subject-predicate (This stamnos was decorated by the painter of Dinos with the red figure technique and is made of clay). Finally, aggregating nucleus-satellite pairs includes syntactic embedding (This is another relief, a tomb stele). There is an hierarchy in the rules, since it is essential that the system must select the proper aggregation rule to apply and reject the others. Therefore, some rules have higher priority as the resulting text is less redundant and more readable and they help to clarify the meaning. According to these priorities, the syntactic embedding is the most important rule in the set, while simple conjunction is the least significant one. Additionally, it is not clearly preferred the type-comma from the shared subjectpredicate. After the choice of the appropriate rule, its parameters must be specified, such as the maximum facts the system should convey to a sentence and the verification of sentence quality. Applying the rules in a sequence, there are two main steps for the aggregation algorithm. The first step is the important user-modelling parameter of the maximum facts per sentence, which determines the number of facts that Exprimo should convey to a particular type of user in each sentence; short sentences may be suitable for small children, but for adults long sentence may be well suited - the use of short sentence becomes irritating and boring to them - (Melengoglou et al., 2003). Moreover the conflict in application of two adjacent aggregation rules must be eliminated; therefore the system should adapt correctly the right aggregation rule to the new linguistic structure and give up the other. This choice is necessary for the sentence quality, since there are two kind of restrictions: user-modelling restrictions and text quality restrictions. For all this restrictions, Melengoglous module had some suggestions and proposals that solved the problems and the conflicts. The following sentences illustrates the effect of different values of the max facts per sentence parameter of four propositions generated by the M-PIRO system. Max facts = 1: This exhibit is a stamnos. It was decorated by the painter of Dinos. It was painted with the red figure technique. It is made of clay

Athanasios N. Karasimos

16

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO Max facts = 2: This exhibit is a stamnos, decorated by the painter of Dinos. It was painted with the red figure technique and is made of clay Max facts = 3: This exhibit is a stamnos, decorated by the painter of Dinos with the red figure technique. It is made of clay Max facts = 4: This exhibit is a stamnos; it was decorated by the painter of Dinos with the red figure technique and is made of clay
Table 3.1.: A set of propositions generated by M-PIRO with different values of the max facts per sentence parameter

3.2. Comparison
3.2.1. What is comparison?
Comparison is like the ancient Roman god, Ianus, which has two faces; comparison is constituted by two related but so different aspects, similarity and contrast. Similarity prototypically signals by connectives like also and too and contrast by connectives like whereas and while. Unfortunately, the literature contains fewer in depth studies of comparison generally. The Rhetorical Structure Theory of Mann and Thompson (1998) included very elementary discussion and relations about comparison applied to a NLG system. The lack of articles and an extended discussion in linguistic theories for comparison made the task of implementation in a NLG system even harder and created a few controversial suggestions and solutions. Comparison is used as a mean of enhancing the experience of the user by both facilitating learning and broadening the users knowledge by introducing them to new and relevant items in the domain. Similarity deals with two propositions which contain some common components. 1. Michael is German and teaches linguistics. 2. Maria is Italian and teaches linguistics. [1+2=>]. Michael is German and teaches linguistics. Maria is Italian; she also teaches linguistics. On the other hand, contrast deals with two propositions which contain contrary components or two different aspects of a relation or entity.

Athanasios N. Karasimos

17

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO 3. Michael is a German. He teaches linguistics 4. Angelina is English. She teaches history [3+4=>]. Michael is a German and teaches linguistics. Angelina is an English; but, she teaches history (or She is also a teacher; however, she teaches history) Comparison are seldom thought of as something in and of themselves rather, they are considered to be a part of a larger explanation; according to Milosavljevic (1999) they are in fact a central part of descriptions. She claims that it is widely accepted that when describing a new concept to a hearer, the hearers understanding of the new concept can be encouraged by drawing analogies with understood concepts or solutions to problems (1999: 28). In addition, Trevskys theory (Keane et al., 2001) suggests that the similarity of the two entities, A and B, to another as being a weighted function of the intersection of the features of A and B less the sum of a weighted function of the distinctive features in each of the entities; in particular, a new entity has been constructed based on the two entities similarity properties. The relationships between two representations may be a. commonalities (a property-pair of the entities matches), b. alignable differences (a property-pair of the entities has different values) and c. nonalignable differences (a value of a property-pair is absent) [see table 3.2].
stamnos drachma Creation period: classical Material: silver -----

Commonalities Creation period : classical Alignable differences Non-alignable differences


Material: clay Painted by : Dinos

Table 3.2.: The relationships between two representation

Discourse analysis and pragmatics have dealt with the problem of comparison but only skin-deeply. It was proposed a group of various conjunctions for similarity, such as both, similarly, another way in which these two. are similar, in the same way, these .. are similar in that and likewise, and for contrast, such as different in many ways, is different, whereas, another difference, but, also differ in, however, while (more details in table 3.3.). Nevertheless, the conditions of the combinations of information, the preference of some conjunctions instead of others, the appropriateness of a conjunction and the change of the text structure have not been examined deeply and

Athanasios N. Karasimos

18

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO seriously, so that the comparison module of the recent NLG systems still remains quite simplistic and sometimes has a few weaknesses because of monotony of the expressions and absence of pluralism. Short conjunctions
Similarly, Likewise, ...the same... ...the same as... ...also... ..., too. both

Long conjunctions
In the same way, X is similar to Y in that (they)... X and Y are similar in that (they)... Like X, Y [verb]... In like manner, One way in which X is similar to Y is (that)... Another way in which X is similar to Y is (that)...

SIMILARITY

Short conjunctions CONTRAST However, In contrast, By contrast, ..., but ..., yet

Long conjunctions On the other hand, even though + [sentence] although + [sentence] whereas + [sentence] unlike + [sentence] while + [sentence] nevertheless,

Table 3.3.: Short and long conjunctions for similarity and contrast

3.2.2. The implementation of comparison in M-PIRO System


An essential aim in generating comparisons was to avoid making individual, fullclause repetition to previously seen exhibits, which tend to be boring, distractive and irritating; this made its educational goal disputable and controversial. Wherefore, the MPIRO

comparison module attempts to overcome this problem by using the class

hierarchy to group previously examined exhibits into broader categories and make either group references or short individual references, when the system knows that the name of the exhibit is sufficient to make a unique reference (Androutsopoulos et al., 2002) [see table 3.3.]. Comparison ordering (importance) sculpted-by potter-is potter-is original location painted-by creation-period original location painting-technique used creation-period made of painting-technique used made-of GREEK

Table 3.4.: The comparison ordering based on information importance for M-PIRO system

Athanasios N. Karasimos

ENGLISH

19

Evaluation of M-PIRO System

Chapter 3 Aggregation and Comparison in M-PIRO The system can make two kinds of comparison: similarity (like the stamnos, this lekythos was created during the classical period) and contrast (Unlike the previous coins which were made of silver, this stater is made of gold). When the system is requested for the next exhibit, it stores the information of the exhibit that can be compared (see the previous table); particularly it locates the target predicates. As a first step to forming a comparison, Exprimo completes the domain class hierarchy tree for the previously examined exhibits that belong to the same exhibit subclass at the current exhibit (we assume that the user visited only exhibits from the subclass vessel). The system includes all the possible potential comparators which were collected by the system for each past exhibit. The next step is to remove firstly subsets (to avoid making full-clause references to previous exhibits) and secondly similar subclasses that are not directly related to the exhibit. Finally, the system removes the weakest relatives by checking any identical comparators that are higher in the hierarchy and distant relatives by checking direct relatives of the exhibit in the current focus. For example, the system checks for the comparator made-of of an archaic stater (silver); the past exhibit had the form classical tetradrachm made-of silver and the post-previous exhibit a drachma made-of silver. The super-class is now coins [made-of silver] and therefore, the system removes the other entities (tetradrachm and drachma) to use the super-class entry (coins) and generates the comparison like the previous coins, this stater is made of silver; however, for instance if both previous exhibits were the same sub-entity such as tetradrachm and both made of silver, it is meaningless to keep the super-class entry. Therefore, the generated comparison would be like the tetradrachms, this stater is made of silver. Finally the system uses for similarity the phrases another and like the (previous) X and for comparison unlike the (previous) X.

Athanasios N. Karasimos

20

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment

Chapter 4

The Pilot Experiment


4.1. Introduction
Ultimately, the pilot experiment intended to examine not only the performance in the texts with-in subjects and between-in subjects, but also what kind of text structure the subjects preferred and how much they thought that they had learnt from a text with or without comparison and aggregation. It was decided that the best way to answer all these questions would be to give the subjects two thematically different text sets produced by M-PIRO system, where the one had enabled the options of comparison and aggregation and the other did not, and subjects were tested in a factual recall test and were asked to decide which one is more natural and more fluent to them. There were, however, a lot of parameters to consider and test before proceeding to the main experiment. Experiments that had been conducted in earlier studies and involved evaluation of natural language generation systems aimed at testing different kinds of system properties such as accuracy (Jordan, Dorr & Benoit 1993), fluency/ intelligibility (Minnis 1998), task (IDAS: Levine & Mellish, 1995; ILEX: Cox et al., 1999) and could therefore provide different reference for the purpose of their evaluation experiment. These experiments and approaches evaluated different aspects and properties of a NLG system. Mellish & Dale (1998:349-350) tried to distinguish between the evaluation of systems and the evaluation of their theories that were underlied, and distinguish both of these from task evaluation; each of these aspects is considered by looking how evaluation has been carried out in the field so far. Although the last few decades (1980s and after) the evaluation has increased substantially, there has not been done many works about evaluation, neither in the linguistics field nor in the natural language generation system experimental design theory. According to Mellish & Dale (1998) and Bangalore et al. (2000) the problem is the confusion that is caused by the inability to distinguishing properly natural language processing and natural language processing; perhaps the most important of the reasons is that from a practical perspective we are faced with a world where there is a great deal of textual material whose value might be

Athanasios N. Karasimos

21

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment leveraged by the successful achievement of partial achievement of the goals of NLU (Mellish & Dale, 1998: 352). The only work about task evaluation was done by Levine & Mellish (1995) for IDAS and Cox et al. (1999) for ILEX system. As it has been already noticed in section 2.1.2, the evaluation of ILEX failed to support the hypothesis that the dynamic hypertext version would improve the performance of the subjects. Therefore, it was very important to treat the issues and problems (Mellish &Dale, 1998) by evaluating M-PIRO carefully and test the solutions before running the main experiment. For the purpose of this study, the two variables that would be tested, were the comparison and aggregation14. The major problem in evaluating an NLG system is that of assessing the output. Since there is not any objective criterion for comparing the appropriateness of the text, it was decided to assess the output with and without the two variables. Thereupon, it was critical and appropriate text output for the experiment and all the subjects to be exposed in the same conditions [unlike the Cox et al.(1999) experiment]. Furthermore, the knowledge that the human subjects would earn from the texts, was measured, since the human subjects are the most valuable resource. Finally, the last problem that had to be solved before the main experiment, was how to handle disagreement among human judges, since humans will not always agree about subjective qualities like good style, coherence and readability (Mellish & Dale, 1999:363). Therefore, it is preferable to avoid consulting explicit human judgements for this reason. After finding answers to all this questions and choosing the best parameters, the only way to test them is with a pilot experiment, which would show how adequate were are the data, texts and design.

Comparison is the main factor which the users history and the interaction with other texts are build on, and aggregation expect for making a text more fluent and natural, improves the interaction within the text since the information are not (bricks and tiles situated without order).

14

Athanasios N. Karasimos

22

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment

4.2. Method
4.2.1. Designing and choosing the exhibit texts
M-PIRO Authoring Tool15 was used to choose the exhibits, to design and run the pilot experiment, as well as to preview the texts which were generated. The authoring tool is a very useful tool not only for introducing new entities, sub-entities, information and exhibits, but also for observing the knowledge database for each exhibit and for choosing which exhibit order would be the optional for the experiment. The core experimental procedure for the experiment was the two text variables that they help us to evaluate the system, aggregation and comparison. The first decision was how these variables were going to be used in the experiment and if they would be tested together or separately. There could be a text with comparison and aggregation, a text with comparison only and no aggregation, a text with aggregation and no comparison and another text without comparison and aggregation. The ILEX evaluation showed that only comparison did not support the hypothesis that users history (compare with what the user had already visited) and failed to help them perform better and learn more than the user whose history had been turned off (Cox et al., 1999). Moreover, the aggregation option combine facts inside an exhibits text and could not help the user by itself to remember more details. Therefore, it was decided to be used a text with comparison and aggregation and another one without comparison and aggregation. Consequently, for the first text it was not chosen the option of Disable the users history, and in the users profile it was selected four facts per sentence (max) for aggregation and for the second text the users history was disabled and it was selected only one fact per sentence. Continuing with the design of the experiment, it was decided to use the user model profile for adults. This users profile has as default values four facts per sentence (two more than the child users profile) and two repetitions for assimilation (one more than the expert users profile). Because of the time limitations of the experiments (both pilot and main) it had to be kept only in one profile and, therefore, the participants were only

15

For more details read the session 2.2.3. The M-PIRO Authoring Tool

Athanasios N. Karasimos

23

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment adults. Moreover, none of the adult subjects should be an expert or had at some point been acquainted with the subject of numismatics16, angiology17 and archeology. The second core part of the experiment was the exhibit texts and the decision if the variables were going to be tested with-in or in-between the subjects. Although for the evaluation of ILEX, some subjects used the dynamic version and the others the static, it was determined to give two text sets each subject, one with comparison and aggregation and the other without them. However, these sets could not have exactly the same exhibits as it would be impossible to evaluate the performance. Thereupon, there were chosen two completely different text sets which contained six exhibits each and all the exhibits belonged to the same main entity. Exhibits were avoided to be imported from different entities in one set like statues, portraits and jewels, since that would make the text much more difficult for the user and M-PIRO could not produce comparison pairs between unrelated at some point exhibits. The first text set had only coins exhibits18 which were a drachma, a classical tetradrachm, a tetradrachm, an archaic stater, an Hellenistic stater and a coin of the reign of Commodus. The second text set had only vessels exhibits which were a Hadra ware hydria, a black kantharos cylix, a classical cylix painted by Eucharides, a lekythos painted by Amasis and a lekythos. Lekythos (with comparison and aggregation)
This exhibit is another lekythos. Like the black kantharos, it was created during the classical period. It dates from between 470 and 460 B.C. It shows an athlete preparing to throw his javelin. This lekythos was painted with the red figure technique. In antiquity, javelin throwing was intimately bound up with the Greek way of life. Before it became a feature of sporting life, the javelin was one of the weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden spear about the height of a tall man. This exhibit is a lekythos. The lekythos was originally from Attica. Like this previous lekythos, it was originally from Attica.

Lekythos (without comparison and aggregation)


This exhibit is a lekythos. It was created during the classical period. It dates from between 475 and 470 B.C. It depicts an athlete preparing to perform a jump. This lekythos was painted with the red figure technique. The origin of the long jump lies in the challenge presented by traversing the cliffs, ravines and rough terrain of the Greek countryside, and, accordingly, by the challenge of waging war on this terrain. It was a complicated sport in which the athletes used special weights, the halteres, to

Numismatics is the science, whose field is the history and study of (ancient) coins and medals. Angiology is the science, whose field is the history and study of ancient vessels. 18 The whole texts for the Coins and the Vessels in both version are in Appendix I (English and Greek texts)
17

16

Athanasios N. Karasimos

24

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment


increase their momentum and the distance of the jump. On this lekythos, the athlete holds the weights in his hands and is about to jump away from the springboard. In order to win, he needs not only great speed and strong legs but also precise coordination between his hands and feet as they make contact with the springboard. This is why the long jump was occasionally accompanied by music, which helped the jumper pace his rhythm. This lekythos was originally from Attica.
Table 4.1.: Part of the lekythos text generated by M-PIRO with and without comparison and aggregation

Moreover, we considered that the exhibits of each text set must belong to different main entity; if both text sets contained, for example, different exhibits but all of them vessels, then it would be a biased problem since the users would not be any more nave when they read the second text set. In addition, it was interesting to see how the performance of the subjects would be in easy and difficult texts, particularly in the easy text set of coins and the difficult text set of vessels. Hence, the half of the participants read the coins text sequence with comparison and aggregation and the vessels text sequence without them and the other participants the coins text sequence without comparison and aggregation and the vessels text sequence with them. Finally, two instruments were devised for use in evaluation. They consisted of a recall test of factual knowledge about coins and vessels in the exhibition and a usability questionnaire. The test was administrated to subjects printed in paper. The factual recall test which was introduced to the subjects with the title What did you learn from the virtual exhibition was a multiple choice test. Almost the half of the questions was about combined and contrasted facts between the exhibits with a variety of difficulty. Four examples (two from both question sets) are show below:
8. During which period were the two cylixes created? (archaic, classical, Hellenistic, roman) 12. Which color is the background of a red-painting technique decorated exhibit? (red, black, white, clay, blue) 3. The tetradrachms are made of . (gold, silver, bronze, marble, there is no info in the texts, different material for each one) 12. Whose picture is on Hellenistic stater? (King Perseus, Alexander the Great, Athena, Apollo)
Table 4.2.: Some questions from the factual recall text (all the questions are in Appendix II). The correct answers are in bold type.

Athanasios N. Karasimos

25

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment

4.2.2. Subjects
The subject that have taken part in the pilot experiment were 8 native speakers of Greek and English aged 23 to 31, four for each language version of M-PIRO. Four of them were male and the other four were female. At the time of the experiment they were all MSc or PhD students at the University of Edinburgh; the Greek subjects have spent 1-5 years studying in Britain, which did not affect their understanding of Greek scientific text. Although all the Greek participants have some elementary courses of Ancient Greek history and Archeology at the Gymnasium and Lyceum classes, they were nave as they did not have any previous experience either with the subject of vessels and coins or with a natural language system text output either. Similarly, the English participants were also nave as they did not have any previous experience with archeological texts about vessels and coins and/ or natural language system text output. Additionally, only one of the English participants had taken a course for Greek Language and therefore, he was familiar with the Greek words. Furthermore, none of the subjects had any history of reading problems (like dyslexia) or understanding problems. Finally, all the participants were nave far as it concerns the goal of the experiment.

4.2.3. Procedure
The experiment took place in the Computer Micro Lab room of the Department of Theoretical and Applied Linguistics of the University of Edinburgh and in my flat. The subjects were usually alone in very quiet conditions so as no one can could disturb while they were reading of the texts. The M-PIRO text output was printed in A4 pages. Before the experiment started the subjects had been given a short introduction about its nature. They were told that they were going to read two different texts, in particular two different text sets of six museum exhibits each of which generated by an NLG system. Moreover, they were informed about what kind of information was in the texts; nevertheless they were not explained anything about the text structure or the text difficulty. Additionally, the were told that they should try to learn and remember the descriptions and references related to the exhibits since they were going to answer a 13multiple choice question set for each set without having any text in front of them. They

Athanasios N. Karasimos

26

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment had fifteen minutes to reach each exhibits text sequence and they could ask anything before start reading. When it was specified that their task would be to answer some questions, all the participants wanted to know what kind of questions were going to respond to and if they had to learn the texts by heart to memorize them. Upon these questions they were given some examples and told that they should read them like any other text or document. When the time run out or they felt that they could answer the questions, they were given more instruction. They were told that they had to choose only one possible answer but it was written in the question that there were two possible right answers. Specially, they were asked not to answer any question which they did not have any clue about or remember anything about, since it was possible to choose the right answer by chance and that would not be good for the statistical analysis. It was pointed out that it was obvious not to remember everything and they should not feel uncomfortable because of their unacknowledged questions. The subjects were encouraged to ask any questions before the beginning of the experiment or make any comments after the end of the experiment. Firstly, they read the first text set and answered the corresponding questions; afterwards, they read the second text set and answered the questions again. I chose randomly which text they would read first; therefore, some participants read first the text set with comparison and aggregation and the others the text set without comparison and aggregation; as well as, some participants read firstly the coins texts and the others the vessels texts. This procedure covered all possible combination of the text order, since it was necessary to examine the ordering effect and its possible flaw in the experimental design19. Testing the order of the two text group in the pilot experiment, it would give some important information for the design of the main experiment. Finally, the participants were asked to fill a questionnaire for both text sets; for this session, they had an opportunity to check again the texts and fill the questions without time pressure. At the end, they were interviewed, where they were informed about the purpose of the experiment and they discussed about their own critical comments and ideas for the experiment.

More details and discussion about this problem in a later session of this dissertation. See at 6.1.2. Ordering effects: a possible flaw in experimental design.

19

Athanasios N. Karasimos

27

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment

4.3. Results and Discussion


The 8 double-session data results, one for each participant, were marked by the author, saved in Word (Microsoft Office 2000) and then exported to Excel and SPSS 11.5 for further analysis, process and creation of tables and graphs.
Coins Group A (Eng) Std. Deviation Group B (Eng) Std. Deviation Group A (Gr) Std. Deviation Group B Std. Deviation 12 2,82 11 1,41 9 2,82 7 2,82 Vessels 11,5 2,12 6 0 11 1,41 5,5 2,12 Difference -1.5 +0,70 +5 +1,41 +2 +1,41 +1,5 +0,70

Table 4.3.: The results of the participants in both versions of the pilot experiment

There were 13 multiple-choice questions and the highest possible score was 15. The results of the experiment hinted not only that the participants scored better in the questions of the text with comparison and aggregation, but also that they preferred the text with these options as more fluent and natural. However, there was an exception among the participants; a participant scored much better in the text without comparison and aggregation and she claimed at the questionnaire that this funny comparison thing between the vessels texts made her tired and she did not like the text as it did not help her remember details cause of the repetitions. Nevertheless, this case was a very rare exception. Furthermore, despite the grouping effect in the Greek participants, it seems that the comparison and aggregation made a greater effect in the vessels texts, since the difference between the groups were 6 and 5,5. This fact supports the hypothesis that comparison and aggregation help the user to learn more and remember better some details, as the users do not have usually many difficulties with easy texts. Furthermore, standard deviation numbers did allow us to make more clear comments, as it was expected that the standard deviation of mean of the texts with comparison and aggregation would be smaller than those of the texts without them; unfortunately, this was not supported by the data of the pilot.

Athanasios N. Karasimos

28

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment

Pilot Experiment Performance

16

15

14

13 12

13

12 10 10 Score 9 9 8 8 7 6 6 5 4 4 6 10

11 10

Text with comparison and aggregation Texts without comparison and aggregation

0 1 2 3 4 Subjects 5 6 7 8

Graph 4.1.: The performance of the participants based on the option of comparison and aggregation (the first four are the Greek participants and the other four the English)
Pilot Experiment Performance

16

14

12

10 Coins with comp.-aggr. Score 8 Vessels without comp.-aggr. Coins without comp.-aggr. Vessels with comp.-aggr. 6

0 1 2 3 4 Subjects 5 6 7 8

Graph 4.2.: The performance of the participants based on the group factor.

Based on the results posted and described above and despite the fact that they were not statistical significance (1, 6 = 2.748, sig., .148) in the text type factors (comparison and aggregation), I decided to work with both text sets per participant in the main experiment. The main effect for groups (p< .05, sig., .031) was the main cause for this statistical insignificance and, therefore, it was expected that it would not appear in the main experiment especially since the performance with-in the subjects supported one of my hypothesis. As it is illustrated in the graphs, there was obviously an effect on participants performance on the questions, depending on the test type factors. Moreover, the participants characterized the coins texts as easy and the vessels texts as

Athanasios N. Karasimos

29

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment difficult/ very difficult. Finally, it was almost always (the one exception) more preferable the text set with comparison and aggregation. The results of the pilot did not allow for many useful assumptions on what to expect in the main experiment, as there did not seem to be consistency among the groups. For instance, when looking at the scores of each group in both versions, it is a big surprise the fact that English participants performed better than the Greeks. Although a comparison between these two versions is unfair because of the subjects different background knowledge and how it interfered in the pilot, the results of the pilot raised some questions which may be answered in the main experiment. Therewithal, the comments of the participants gave helpful guideline for the main experiment. They found the vessels text question too difficult and they suggested importing some dummy answers in the multiple-choice; by the way, they thought that more questions would not be a problem. Furthermore, it was suggested that each exhibits text had to be split up in two or three pages. Additionally, they considered that the time had to be between twenty minutes, because less would be not enough and more it would be getting boring. Finally, the participants who read first the text with the vessels (with or without comparison and aggregation) complained that the difficulty of the text exhausted them since it was tough and all the information almost completely new for them, and therefore, it was harder to read the second text set with the coins; so they preferred to had read the coins text set first. This ordering effect may be a possible flaw for the main experiment and it is going to be mentioned and discussed in details on session 6.1.2. However, it was a major problem which should be solved somehow.
Coins text How interesting have you found the text? How difficult are the questions Did you enjoyed the texts? Which text is more fluent and natural? Interesting easy yes 7 subjects chose the text with comparison and aggregation Vessels text Neutral Difficult Yes 1 subject chose the text without comparison

Table 4.6.: The questionnaire results of the pilot experiment

To sum up, the results of the pilot study made fairly clear that a text with comparison and aggregation are supportive in subjects reading to remember, learn and

Athanasios N. Karasimos

30

Evaluation of M-PIRO System

Chapter 4 The Pilot Experiment perform better than a text without these factors. This outcome was considered to be encouraging for the main experiment. It would be interesting to find out if there would be difference in the performance with-in and between-in subjects and if there is a difference how big it would be depending on the difficulty and stiffness of the text itself. The data we got from the pilot experiment was not enough to make any assumptions; nevertheless, they supported some of our theories/ hypotheses and gave us some ideas for new hypotheses such as if the comparison and aggregation make the text easier, the dependence of the learning feeling on the text type and what forces the participants to choose a text as more fluent and natural.

Athanasios N. Karasimos

31

Evaluation of M-PIRO System

Chapter 5 The Main Experiment

Chapter 5

The Main Experiment


5.1. Introduction
The main experiment set out to evaluate two language versions (English and Greek) of the M-PIRO NLG system by testing the text structure factors of comparison and aggregation; it would support or fail our four hypotheses. As the pilot experiment results revealed, not only the participants not only scored better in the texts with comparison and aggregation, but also showed a bigger score difference in the difficult text sets. What is now anticipated is to observe the difference in the performance depending on the text type and not depending on any other factors such as the group, the genre and the language. This would hopefully show if the participants scores were related to the text factors (comparison and aggregation) and there would be no statistical significance in the group, the genre and the language factors. Moreover, it was anticipated that the participants would characterize as natural and fluent the text with the text structure factors. Since the results of the pilot revealed to bigger differences in difficult texts and, therefore, our hypothesis that the text factors are more useful and helpful to demanding texts could be supported. However, the results of the pilot revealed a statistically significance of the main effect for group, which was not expected to occur in the main experiment. Despite of the limitations of the pilot experiment, the pilot did leave room for speculation on what we expect from the main experiment test. Finally, we hoped that the results would help us to make a useful evaluation of the system which would be a vaulting horse for future improvement of the M-PIRO system.

5.2. Method
5.2.1. Designing and choosing the exhibit texts
The main experiment demanded a slight change of the vessels sequence, since the participants asked for vessels exhibits with more common characteristics. The coins text sequence remained the same as in the pilot experiment. However, the main experiment

Athanasios N. Karasimos

32

Evaluation of M-PIRO System

Chapter 5 The Main Experiment was run in different conditions, since the participants did not read the texts from printed pages, but from on-line web pages. Due to some malfunctions of the on-line Greek version if the M-PIRO system, it was decided all the web pages (English and Greek) would be built manually. For this purpose a simple html editor was used and it was based on the web page skeleton of organization (IME) for M-PIRO presentation. The experiment demanded a slightly more sophisticated changes in this skeleton for the purpose of the experiment (font type, size, text properties). Therefore in the middle of the page was the picture of the exhibit and the corresponding text generated by the Authoring Tool. On the bottom of this text there were links for more information about the same exhibit. These links made the pages more similar to the real M-PIRO pages; in addition, the participants of the pilot complained about the size of the texts and they suggested splitting the texts into two or three pages. Furthermore, there were also links for the previous and next exhibits that the participant would visit (Picture 5.1). Finally, the text properties20 were: for font type Times New Roman, style normal, size 12 and paragraph lead 1,5 (, 2000).

Picture 5.1.: A web page from the vessels sequence that contains the Spherical Corinthian Aryballos

According to (2000), research showed that the users had a 20% lost of the text read in web page comparing with the users who read the text in printed paper; therefore, it was suggested to simulate the printed pages properties to the design of the web pages, since that would reduce the loss of the text information. For extended discussion, see (2000).

20

Athanasios N. Karasimos

33

Evaluation of M-PIRO System

Chapter 5 The Main Experiment

The core experimental procedure of the main experiment was again the two text variables that help us to evaluate the system, aggregation and comparison. As the pilot results showed, the right combinations had been chosen; therefore, the text factors would be present together in the text or absent at the same time. As far as it concerns aggregation was set to four facts per sentence to aggregate. Additionally, the user model type remained the adult type, since subjects were selected no below the age of eighteen or subjects who had or have studied archeology or were experts in the field of numismatics and angiology. For reasons that are already presented and explained in the pilot experiment section (4.2.1), two completely different text sets were chosen which contained six exhibits each and all the exhibits belonged to the same main entity type. Although exhibits were avoided to be using of very different types, the selected vessels sequence contained exhibits with only a few common characteristics; therefore, it was necessary to substitute some exhibits with some others. The first text set had only coins exhibits, which were the same as in the pilot. The second text set had only vessels exhibits which were a kylix, a classical kylix painted by Eucharides, a stamnos painted by Dinos, a spherical Corinthian aryballos, a lekythos painted by Amasis and a lekythos. This vessels text set had more common characteristics than the pilot set; it was a bit easier and contained more complicated comparisons, not only with the previous vessel, but also with all the previous vessels or with a vessel visited two or three exhibits before, as shown in tables 5.2.
Spherical Corinthian aryballos This exhibit is an aryballos. Like the kylixes, this aryballos was created during the archaic period. It dates from the late 7th century B.C. and it belongs to the corinthian type. It is spherical in shape. On the body a wide zone is distinguished with pairs of comastes among supplementary patterns, mainly rosettes (jewels representing roses with open radiate leaves). "Comastes" were the participants in "comous", feasts in honour of Dionysus. Unlike the previous vessels, which were painted with the red figure technique, this aryballos was decorated with the black figure technique. The black figure technique is the opposite of the red figure technique. This aryballos was found in the Temple of Hera, on Delos, an island. Today this aryballos is located in the Archaeological Museum of Delos. Lekythos This exhibit is another lekythos. Like the stamnos21, it was created during the

21

The stamnos exhibits was visited third in users visiting order and this lekythos sixth.

Athanasios N. Karasimos

34

Evaluation of M-PIRO System

Chapter 5 The Main Experiment


classical period. It dates from between 470 and 460 B.C. It shows an athlete preparing to throw his javelin. This lekythos was painted with the red figure technique. In antiquity, javelin throwing was intimately bound up with the Greek way of life. Before it became a feature of sporting life, the javelin was one of the weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden spear about the height of a tall man. This exhibit is a lekythos. Like the previous lekythos, it was originally from Attica. It is now exhibited in the National Archaeological Museum of Athens.
Table 5.2.: Two examples of the vessels texts with more complicated comparisons

Finally, as in the pilot experiment, two instruments were devised for use in evaluation. They consisted of a recall test of factual knowledge about coins and vessels in the exhibition, and a usability questionnaire. The test was administered to subjects printed on paper or as a Word document form. The factual recall test which was introduced to the subjects with the title What did you learn from the virtual exhibition was a multiple choice test. Some questions were changed due to their perplexity, difficulty and failure to be answered by the pilot experiment subjects; additionally, two more questions in each set were added, so it became a 15-multiple choice question set. Almost half of the questions were about combined and contrasted facts between the exhibits with a variety of difficulty. Finally, the questionnaire had seven questions about both texts. Both the questions sets and the questionnaire are in Appendix II.

5.2.2. Subjects
The subjects that took part in the Greek version of the main experiment were twenty four22 Greek native speakers. The Greek subjects aged 20 to 54 were twelve male and twelve female, who have spent time living abroad and practising a foreign language in its natural environment; however, this experience did not affect their ability to understand perfectly a Greek scientific text. They were undergraduate, postgraduate or PhD students or they were working holding an university diploma. Although all the Greek participants had some elementary courses of Ancient Greek History, they were nave since they had no previous experience either with the subject of vessels and coins texts or with the natural language generation text output; however, most of them were

The data of twenty subjects for each version was going to be examined. We ran the experiment for more than twenty as there were some problems such as really low score (performance), misunderstanding of the instructions and tired subjects.

22

Athanasios N. Karasimos

35

Evaluation of M-PIRO System

Chapter 5 The Main Experiment familiar with the vocabulary and they had visited this kind of exhibits in an archeological museum. The subjects that took part in the English version of the main experiment twenty two English native speakers and foreigners who lived many years in England. The English subject aged 23 to 35 were twelve male and ten female. Almost none of the English participants had any elementary course of Ancient Greek History and Archeology; however, some of them had some experience with natural language generation text output and participated in elementary courses of Greek Language. Furthermore, none of the subjects had any history of reading (like dyslexia) or understanding problems. Finally, all the participants were nave as to the goal of the experiment.

5.2.3 Procedure
The main experiment took place in the Computer Micro Lab room of the Department of Theoretical and Applied Linguistics of the University of Edinburgh and some subjects residence since they could come to the Micro lab or there was too much noise for the participants. To this purpose the versions were publicized in my personal website, so that was possible to run the experiment on subjects laptops and desktops. The subjects were usually alone in very quiet conditions so no one could disturb them while they were reading the texts. The procedure that was followed was similar to the one of the pilot experiment. Before the experiment started, the subjects had been given a short introduction about its nature. They were told that they were going to read two different texts, in particular two text sets of six museum exhibits each of which was generated by an NLG system. Additionally, they were informed about what kind of information was in the web pages; nonetheless, they were not explained anything about the text structure and the text difficulty. Furthermore, they were told that they should try to learn and remember the descriptions and references related to the exhibits since they were going to answer a 15multiple choice question set for each set of texts without having any text in front of them. Moreover, they were shown how to navigate in the subtexts and to the other exhibits texts, since it was necessary to understand the navigation options. They had

Athanasios N. Karasimos

36

Evaluation of M-PIRO System

Chapter 5 The Main Experiment twenty to twenty five minutes to reach each exhibits text sequence and they could ask anything before starting reading. Again, as in the pilot experiment, it was made clear that they should not learn the information by heart and memorize it and they were given some example questions. When the time ran out or they felt that they could answer the questions, they were given further instructions. They were told that they had to choose only one possible answer except if it was written in the question that there were two possible answers. Specially, they were asked again not to answer any question which they did not remember anything about, since it was possible to choose the right answer by chance and that would affect the result of the statistical analysis. Again, it was made clear that it was preferable to not answer questions that they did not know. Finally, the subjects were encouraged to ask any questions before the beginning of the experiment or of the second session. First of all, it was chosen randomly in which group they would belong. The group A subjects read the coins texts without comparison and aggregation and the vessels text with these factors; on the other hand, the group B subjects read the coins texts with comparison and aggregation and the vessels text without comparison and aggregation. They first read the coins text set and answered the corresponding questions; afterwards, they read the vessels text set and answered the question again. According to the pilot experiment results, it was decided that the texts would not be given randomly, since the subjects claimed that the difficult text (vessels) made them tired; however this fixed order may be a possible flaw that is going to be discussed later. Finally, the participants were asked to fill a 7-question questionnaire for both text sets; for this session, they had the opportunity to check again the texts and fill the questions without time pressure. At the end, they were interviewed, were informed about the purpose of the experiment, were showed the text differences comparing with the other groups texts and discussed about their own critical comments for the texts.

5.3. Results
The 20 double-session data results, one for each participant, were marked by the author, saved in Word (MS Office 2000) and then exported to SPSS 11.5 for further analysis, process and creation of tables. Depending on the purpose of the analysis of the

Athanasios N. Karasimos

37

Evaluation of M-PIRO System

Chapter 5 The Main Experiment merged data, variables were each time arranged in columns and filtered. The resulting file was then exported to Excel for further processing and creation of graphs.
Comp. aggr. Coins Group A
mean/ mode std.deviation/ skewness min./ max.

No comp. aggr. Coins


11,4 3,238 ( .458) 7 16 9,4 (10) 2,796 (- .782) 4 13

Vessels
13,7 (14,5) 1,948 (- .942) 10 16

Vessels

ENGLISH

Group B

mean/ mode std.deviation/ skewness min./ max.

12,4 (13) 2,792 (- .248) 8 16 12,2 (11) 2,048 ( .055) 9 15 12,9 (12) 2,558 (- .708) 8 16 11,1 (11) 1,728 (- .513) 8 13

Group A

mean/ mode std.deviation/ skewness

GREEK

min./ max.

Group B

mean/ mode std.deviation/ skewness min./ max.

10 (10) 2,160 (-1.24) 5 13

Table 5.3.: The results of the participants in both languages of the main experiment

There were 15-multiple choice questions and the highest possible score was 17. Table 5.3. displays the descriptive statistics scores on the two version of M-PIRO for both groups. The results in the Greek version showed generally a better performance in the texts with comparison and aggregation. This tendency was observed in both groups, even though there was a smaller difference in with-in subjects means of group A, where the text factors were in the vessels texts (Group A: 12,2 11,1 = 1.1; Group B: 12,9 10 = 2.9). Furthermore, it was shown that the factor of text difficulty was dependent on the text type factors, since the between groups differences was greater in the demanding text of vessels (vessels: 12,2 10 = 2.2; coins: 12,9 11,1 = 1.8); besides, the score means for coins texts with or without comparison and aggregation were higher than those for vessels texts. Strangely, the standard deviations for both groups were higher in the texts with comparison and aggregation, which showed an unexpected bigger variety of scores in these texts.

Athanasios N. Karasimos

38

Evaluation of M-PIRO System

Chapter 5 The Main Experiment Similarly to the Greek participants performance, the results for the English version revealed generally a better performance in text with comparison and aggregation. This tendency occurred in both groups and like the Greek results, the difference in with-in subjects means was smaller in those of group A (Group A: 13,7 11,4 = 2.3; Group B: 12,9 10 = 2,9). Additionally, it was revealed that the comparison and aggregation factors definitely affected the performance in the difficult texts where the difference between subjects was almost 4 score marks (vessels: 13,7 10 = 3.7; coins 12,9 11,4 = 1.5); surprisingly, the score means of the vessels text with comparison and aggregation was higher than those of coins text. Finally, unlike the Greek participants, the English subjects showed lower variety (standard deviation) of scores in the texts with comparison and aggregation, as was expected. In summary, one could say that participants in both languages and groups are helped by the two text type factors, as they scored better in texts with these two factors. Clearly, the findings indicate that comparison and aggregation with text difficulty has a positive effect on subject performance. It appears that the essential changes of the text structure (interaction with-in and between texts) and the difficulty of the scientific area texts (numismatics vs. angiology) subjects results in substantial differences in performance in all four texts of the experiment. Nevertheless, the standard deviation numbers suggest that it is likely that the language factor may be responsible for the unexpected difference in Greek version; it appears that the Greek participants do not vary their performance on a text which is not fluent and natural.
Main experiment Performance (factors)

18

16

14

12

Score

10

Text with comp.-aggr. Text without comp.-aggr.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Subjects

Graph 5.4.: The score performance per person depending on text type factors (Greek Version)

Athanasios N. Karasimos

39

Evaluation of M-PIRO System

Chapter 5 The Main Experiment

Main Experiment Results per person [English Version]

18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Subjects

Score

Text with comp.-aggr. Text without comp.-aggr.

Graph 5.5.: The score performance per person depending on text type factors (English Version)

Descriptive statistics supported our first expectations for the main experiment results. For further, deeper and more sufficient data analysis, it was essential to run some statistical tests to verify the importance and significance of the variables that they were used in the experiment. Statistical significance were tested using a ANOVA test (see Appendix III). I ran a two-way repeated measures ANOVA test for the Greek results. The ANOVA test revealed that the text type factor (comparison and aggregation) was statistically significant (1, 18 = 48.322, p <.001, sig., .000) for the with-in subjects contrasts; moreover the group factor was not unsurprisingly in this experiment statistically significant (1, 18 = 0.048, sig., .829). Furthermore, the interaction between the two main effects (text type*group) was statistical significant (1, 18 = 9.785, p> .01, sig., .006), which means that the group members interacted different with two texts. Similarly, the English data showed the same results: the main effect for text type factor was statistically significant (1, 16 = 23.332, p< .005, sig., .000) and the group factor was statistically insignificant (1, 16 = 6.531, sig., .152) for the contrasts between subjects. Finally, the interaction between text type and group was statistical significant (1, 16 = 2.263, p> .055, sig., .032). Therefore, the summary results were statistically significant (1, 38 = 81.231, p< .001, sig., .000) in the text type factor and in the interaction text type*group (1, 38 = 5.870, p< .05, sig., .020), but not statistically significant in the group factor (1, 38 = 1.63, sig., .209). These results reject the null hypothesis and support our H1 hypothesis that the performance of the participants depends on the variables of comparison and aggregation in the text structure. Independent of the group main effect, the participants performed

Athanasios N. Karasimos

40

Evaluation of M-PIRO System

Chapter 5 The Main Experiment equally on both groups and therefore, it seems that the dynamically interactive version of the M-PIRO system does help the participants to learn more and remember better. However, although it appears that the group factor does not affect the performance, it was essential to test possible implications of genre and language factors.
Main Experiment results for groups [Greek version]

18 16 14

12 Coins (no comp.-aggr.) Score 10 8 6 Vessels (comp.-aggr.) Coins (comp.-aggr.) Vessels (no comp.-aggr.)

4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Subjects

Graph 5.6.: The results per participant depending on the group factor [Greek version]
Main Experiment Results for groups [English version]

18

16

14

12 Coins (no comp.-aggr.) Score 10 Vessels (comp.-aggr.) Coins (comp.-aggr.) Vessels (no comp.-aggr.) 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Subjects

Graph 5.7.: The results per participant depending on the group factor [English version]

In the process of the data analysis, it was necessary to secure the theory that the performance was not affected by any main effects of out-textual factors, such as group, genre and language. The results of the Greek participants revealed that neither the genre factor was statistically significant (1, 16 = 1.098, sig., .310) nor the interaction between the main effects of genre and group (1, 16 = 0.607, sig., .447) in the test of between subjects effects; however, despite the fact that the interaction of text type*genre*group was statistical insignificant (1, 16 = 0.650, sig., .432), the bivariable text type*genre had

Athanasios N. Karasimos

41

Evaluation of M-PIRO System

Chapter 5 The Main Experiment a significant effect (1, 16 = 7.959, p< .02, sig., .012) on the performance due to the absence of the genre effect. On the other hand, the English results were slightly different. There was not any statistically significance in the interaction group*genre (1, 16 =3.762, sig., .070), text type*genre (1, 16 = 1.945, sig., .182) and text type*group*genre (1, 16 = 0.823, sig., .378); however, the between-subject genre factor was statistically significant (1, 16 = 4.673, p< .05, sig., .046). Finally, the language factor was statistically insignificant (1, 32 = 0.106, sig., .747), just as all the interaction of language with the other factors. In total, it seems that other factors except text type do not affect the performance of the participants. Also, in which group the participants belong was a random choice. However, the English results suggest that the female participants scored significantly better than the male participants. It is likely that this significance is more a coincidence than a regular main effect; I believe with a larger amount of subjects the main effect of genre would not occur. The fact that the Greek female participants scored better, but not significantly better than the males supports the previous thoughts. Nonetheless, the interactions between more than two main effects can not be trustful, since there are a lot of complicate issues that do not us for concrete assumptions. The main effect of the language factor it can not be briefly explained; although this factor was not significant, there are issues to be raised and, therefore, it will be discussed in the following chapter (6.1.).
Subject performance - Genre factor

16

14

12 Male score (Greek) 10 Score Female score (Greek) Male score (English) 8 Female score (English) 6 Male score (both languages) Female score (both languages) 4

0 compaggr Text type factor nocomagg

Graph 5.8.: The performance for all the participants depending on the genre factor [both version]

Athanasios N. Karasimos

42

Evaluation of M-PIRO System

Chapter 5 The Main Experiment


M-PIRO text with comparison and aggregation
18

M-PIRO text without comparison and aggregation

18 16 14 12 10 8 6 4 2
N= 9 10

16

14

12

10

12

6
N= 9 10

female

male

female

male

Genre

Genre

Graph 5.9.: Box plots for performance depending on genre and text type factors [English version]
M-PIRO text without comparison and aggregation
14
33

M-PIRO text with comparison and aggregation

18

16

33

12

14

10

12

10

35

6
35

6
N= 11 9

4
N= 11 9

female

male

female

male

Genre

Genre

Graph 5.10.: Box plots for performance depending on genre and text type factors [Greek version]

Since the difference in performances for the two groups were only the comparison and aggregation factors (and the genre factor in the English version), it was essential to also test the hypothesis about the difficulty of the text, as the exploration of the descriptive statistics showed a considerable difference in vessels and coins score means. Again a two-way repeated measures ANOVA test was used. The results of the Greek participants showed that text difficulty was very significant (1, 18 = 12.527, p< .005, sig., .002) for their performance; likewise the interaction text type*text difficulty, which was statistically significant (1, 18 = 9.729, p< .01, sig., .006) supported the descriptive statistics for the marginal means. Unsurprisingly, in the English data analysis the text difficulty factor was founded to be significant at the .001 level (1, 18 = 67.5, sig., .000). The reported great difference of the means in vessels text sets performance amongst the participants might also be dependent on the significant interaction of the main effects of text type and text difficulty (1, 18 = 9.138, p< .01, sig., .007).

Athanasios N. Karasimos

43

Evaluation of M-PIRO System

Chapter 5 The Main Experiment There is no longer any place for doubts about the reason for the variance in the performance; the results analysis clearly support the hypothesis, that the difference would be bigger in hard texts. To sum up, the data analysis so far suggest that the text type (comparison and aggregation) and text difficulty factors seem to be the major variables for a text structure. The first factor helps the user to learn more and remember better information from a text and the second makes compulsorily necessary the presence of comparison and aggregation in difficult texts generated by a NLG system. In both language versions the data analysis is to a greater or lesser degree the same and indicates that the main factors act similarly and independently from the language.
18

20 18

16

16
14

14

Vessels texts performance

Coins texts performance

12

12 10 8 6
15

10

6
N= 10 10

4
N= 10 10

Coins (no) & Vessels

Coins (comp. - aggr.

Coins (no) & Vessels

Coins (comp. - aggr.

Group of texts

Group of texts

Graph 5.11.: The performance of Greek participants depending on text difficulty (coins vs. vessels)
18
18 16 14

16

14
12

12

10

M-PIRO coins text

M-PIRO vessels text

8 6 4 2
N= 10 10

10

6
N= 10 10

Coins (no) & Vessels

Coins (comp. - aggr.

Coins (no) & Vessels

Coins (comp. - aggr.

Group of texts

Group of texts

Graph 5.12.: The performance of English participants depending on text difficulty (coins vs. vessels)

The questionnaire helped not only to continue with other aspects of M-PIRO evaluation, but also to support two more hypotheses about the text generation outputs. The questionnaire did not show a huge differences between groups A and B. However, it was revealed that the text with comparison and aggregation was always more interesting than the other text. Moreover, the participants had the feeling that they learnt

Athanasios N. Karasimos

44

Evaluation of M-PIRO System

Chapter 5 The Main Experiment more from the text with these factors. Finally, these texts were characterized as more natural and fluent than the other.
ENGLISH VERSION 1. Did you find the texts interesting? Vessels (comp. aggr.) 2. Are the questions difficult? 3. Have you enjoyed the texts? 4. Which text was more natural/ fluent ? 5. I thought there was too much inconsistency in the exhibits text 6. The information for the exhibits was 7. How much did you learn from the texts? 1. Did you find the texts interesting? 2. Are the questions and the text difficult? Coins (comp. aggr.) 3. Have you enjoyed the texts? 4. Which text was more natural/ fluent ? 5. I thought there was too much inconsistency in the exhibits text 6. The information for the exhibits was 7. How much did you learn from the texts? GROUP B Coins text neutral neutral yes 3/10 agree normal enough neutral neutral neutral 9/10 neutral normal enough Vessels interesting neutral yes 7/10 agree more than normal a lot
almost not interesting

GROUP A

difficult neutral 1/10 neutral more than normal a little

Table 5.13.: The questionnaire results summary of the English participants for both groups

GREEK VERSION 1. Did you find the texts interesting? Vessels (comp. aggr.) 2. Are the questions and the text difficult? 3. Have you enjoyed the texts? 4. Which text was more natural/ fluent ? 5. I thought there was too much inconsistency in the exhibits text 6. The information for the exhibits was 7. How much did you learn from the texts? Coins (comp. aggr.) 1. Did you find the texts interesting? 2. Are the questions and the text difficult? 3. Have you enjoyed the texts? 4. Which text was more natural/ fluent ? 5. I thought there was too much inconsistency in the exhibits text

Coins text neutral neutral neutral 0/10 neutral normal enough interesting neutral Yes 9/10 Agree

Vessels interesting Difficult neutral 10/10 disagree more than normal many neutral very difficult neutral 1/10 neutral

Athanasios N. Karasimos

GROUP B

GROUP A

45

Evaluation of M-PIRO System

Chapter 5 The Main Experiment


more than normal enough

6. The information for the exhibits was 7. How much did you learn from the texts?

normal Much

Table 5.14.: The questionnaire results summary of the Greek participants for both groups

Like the descriptive statistics analysis of the factual recall test, it was essential to run some statistical tests to test my hypotheses. When someone reads a text, it is really important to feel that the text supports the effort to learn from it. Therefore, we ran an ANOVA for the seventh question to test this. The learning feeling was found to be significant at the .001 level (1, 18 = 42.552, sig., .000) for the English participants. Additionally, the interaction of text type*feeling co-supported our hypothesis, since it was very significant (1, 18 = 26.552, p< .001, sig., .000). Similarly, the Greek results did not differ from the English; so both the learning feeling (1, 18 = 48.6, p< .001, sig., .000) and the interaction (1, 18 = 21.423, p< .001, sig., .000) were proved to be statistically significant. Finally, concerning the answers for the question about naturalness and fluency of the texts, they reflect subjects problem to confuse fluency with interest. For the English participants the results showed that the fluency factor was not significant for between subjects effects, although the interaction of fluency* text type was significant at .02 level (1, 18 = 7.195, sig., 0.16). Similarly, the Greek participants results did not reveal any significance for the fluency factor. However, 95% of the Greek participants and 80% of the English participants preferred as more natural and fluent the text with comparison and aggregation. The ANOVA analysis seems inconsistent with our hypothesis that comparison and aggregation will force the readers to characterize the text as fluent and natural. Nonetheless, as it has already been mentioned, the half of the participants seemed confused with these questions and their answers were vague and unclear; the lack of specific criteria that would justify what is fluent and natural and the contrast of two thematically different text can explain this result. To sum up, the results of the main experiment show clearly to the direction that text type factors are important for a NLG system and very effective for the current M-PIRO system. Even though one cannot exclude the possibility of an order confound due to the experimental design, it is likely that difficulty of the texts make an essential difference in the performance scores of the English and Greek participants. The test of between-

Athanasios N. Karasimos

46

Evaluation of M-PIRO System

Chapter 5 The Main Experiment subjects effects showed clearly that there was not any significance on the group, genre and language factors. When the participants were asked about what they had learnt from the text, they appeared to prefer significantly the texts with comparison and aggregation, as they thought that from these texts they learnt more. On the contrary, the confusion and subjective criteria for judging text quality did not help us to infer useful conclusions about the fluency and naturalness of the texts.

Athanasios N. Karasimos

47

Evaluation of M-PIRO System

Chapter 6 General Discussion

Chapter 6

General Discussion
6.1. The results of both experiments
6.1.1. Interpreting the results
The basic aim that this study set out to explore, whether or not comparison and aggregation are helpful and useful for the M-PIRO user to perform better and learn more, appears to have been achieved. Thus, it also seems also that the difficulty of thematically various texts does make a difference in measurable performance of subjects, and in particular interacts perfectly with the two text factors to show how important comparison and aggregation are when the texts are getting harder and harder. In that point the M-PIRO system follows the motto helping the users to help themselves by speaking the users language. Considering that the data are collected separately from both experiments, one could state that the text type factors of comparison and aggregation support almost all the participants of the four groups to perform better in the texts that included them. Whatever the text order was, it is certain that these factors by making the texts look like those generated by real curators crucially support the users in their attempt to learn essential information from a electronic exhibition. Additionally, it is more clear when we consider the various difficulty of the texts, where the difference in users performance was bigger in the vessels text. If we set aside the possibility of an order confound for the moment, the reason why this is/ might be happening is because of the fact that the vessels had more information, fewer common characteristics, a lot of variety and were a rarer topic than coins. However, we should not ignore the ceiling effect23. The multiple-choice question set contained 15 questions and the highest possible score could be 17, since all the questions were equally weighted except for the two double-possible answers. The

The ceiling effect is how high the performance score can be for a subject that someone examines in an experiment. If the distance between the lowest and highest score is short, it may make harder the interpretation of the results.

23

Athanasios N. Karasimos

48

Evaluation of M-PIRO System

Chapter 6 General Discussion ceiling effect is more obvious to the subjects with very good memory or who learnt a lot. For someone who scored 15 or 16 out of 17 in the text without comparison and aggregation, it seems that there is not much space for improvement. This subject made one or two mistakes and it would be extremely rare to perform better in the same text with comparison and aggregation. In these cases, it appears that the users did not really benefit from the presence or absence of the text type factors. Therefore, we are uncertain about the help of these factors if the ceiling score was higher. Although this ceiling effect raises some obstacles for represent the performance of the users more secure, it does boost the differences between-subjects effects; the difference of two or four points in users performance could be ten or twenty percent in a rate of 100. Accordingly, the low score of the ceiling effect makes much more important the variation of the users progress. As mentioned in section 5.3., it was originally believed that the confusion in the question about the texts naturalness and fluency reflected the misunderstanding of the question and the lack of some criteria of what is fluent and natural. The essential absence of any specific criterion for naturalness, fluency, goodness and appropriateness and the vacancy of objective judgements are still considered as the main reasons for this confusion. However, it must not be ignored that comparison and aggregation may not by themselves make the text more natural and fluent. Some participants complained about the use of some connectors and conjunctions or the unusual appearance of some referring expressions. It has to be pointed out here that usually all the Greek participants expressed during the interviews a weird feeling: they thought that the texts that they read were a translation of original (English probably) texts. Although their feeling was unspecific and vague, they based their answer to the question on it. This is a perfect example of involvement of human subjects judgment, which it must be avoided as they are completely subjective opinions. Despite all the above reasons for confusion, when they asked to clear their mind of feelings and interests for the texts, it appears that acting subconsciously they chose the right text. The texts were not only structurally different, but also thematically different; this strengthens more the hypothesis that comparison and aggregation are a serious part of a natural and fluent text. We believe undoubtedly that, for instance, if the users had been

Athanasios N. Karasimos

49

Evaluation of M-PIRO System

Chapter 6 General Discussion asked to judge the same thematically text with and without the text type factors, all of them would have chosen the right text. Still considering the results of both experiments, the language factor was surprisingly statistically insignificant. We expected that the Greek participants would perform better than the English. The results showed almost the opposite. In both experiments, the English participants scored better than the Greeks in the vessels texts, but worse in the coins text. We cannot interpret the results of the two languages together, because there are two very important differences; the Greek subjects had the advantage of language familiarity and a supporting background knowledge. All the names of the exhibits were Greek names which were not translated to English, but phonetically transcribed. Therefore, the Greeks were familiar with the vocabulary and even though some words were unknown to them, they could easily guess what these words might be; on the other hand, the English subjects knew almost none of these words and definitely, they could guess what they might be. For instance, in the question about the feast (comuses), the Greeks could easily associate this word with fun, pleasure and entertainment because it is first conjunctional of the compound [+] (comedy) and afterwards, just connect it with god Dionysus, the god of wine and pleasure. The background knowledge is indissolubly related with language. All the Greek participants had some courses of Ancient Greek History; therefore, they were familiar with daily portrayals of the

ancient Greek society, the twelve ancient Greek Gods, wars, generals, emperors and history of the greatest city-states. For instance, they knew where Alexander the Great is from and when he reigned and the history of drachma coin, which was until 2001 the official currency of Greece. However, the knowledge interplayed with the memory and misguided them. When they were asked whose picture is on the Hellenistic stater found on Macedonia, they ignored the fact that stater was an Athenian coin, most of them answered Alexander the Great, since he was dominated that period particularly in Macedonia; moreover, in the question about the origin of the archaic stater, they answered wrongly from Attica or Athens, because it was an Athenian coin (it originated from Croton, North Italy).

Athanasios N. Karasimos

50

Evaluation of M-PIRO System

Chapter 6 General Discussion In contrast, the English participants did not know any of this information and knowledge. They answered the questions based on the texts since they did not have any background knowledge that could crucially interplay with their memory. For the English subjects it was I remember that or not, but for the Greek subjects it was I know that or not! The language familiarity and the background knowledge had some advantages and drawbacks for both versions; but most important, they changed essentially the conditions for Greek and English participants in that way, that we cannot export safely summary results, conclusions and assumptions for both language versions but only separately.
M-PIRO text with comparison and aggregation
18

M-PIRO text without comparison and aggregation

18 16 14 12 10 8 6
35

16

14

12

10

4 2
N=

17

6
N= 20 20

20

20

english

greek

english

greek

Language version

Language version

Table 6.1.: Box plots for the performance of the participants depending on the language factor

6.1.2. Ordering effect: a possible flaw in experimental design


When running the pilot experiment, I tried to give the texts randomly to the subjects, so some of them read first the coins text sequence and the others who read first the vessels text sequence; additionally, some of the participants read first the text with comparison and aggregation contrary to the others who first read the text without the text types factors. Then it was a trial to apply a randomization of the variables of text type and difficulty into the pilot experiment, so I could observe how this ordering effect may affect the results of the participants. The results of the pilot experiment showed that the order of the text sets did not seriously affect the performance since almost always the participants scored better in the text with the text type factors (see more in section 4.3). Simultaneously, a new problem was raised; the subjects who read first the vessels

Athanasios N. Karasimos

51

Evaluation of M-PIRO System

Chapter 6 General Discussion text set already admitted while doing the experiment that they became tired from the first text set and that this was going to affect their performance for the coins text set. After to the criticisms of the subjects who participated in the pilot, it was decided that since participants tiredness would be a serious flaw of the main experiment, all the participants would be exposed to the same conditions, in particular the text sets would not be given randomly, but first the coins texts and then the vessels texts. However, after this decision an ordering effect was created. Before the experiment, the participants were nave about it and they could not imagine how the texts and questions would be, although there was a short introduction. After the first set of texts and questions, the conditions changed. The subjects read some texts and questions; therefore, they knew what they would expect from the next set of texts and questions. They were not nave any more about the nature of the experiment. Because they thought that the texts would be similar, they started reading the next text set and thinking about the information that might be necessary for the appropriate questions. The advantage of surprise was taken from the second text. Therefore, they might have performed worse if they had read the text first and not second. The possibility of such a mistake introduces a complication in the interpretation of the results. It seems that the subjects should score better in the second text, since they knew what they would find in the texts and the questions. Some critics can easily claim that if the order of the text was random, the results would be different. No one can be sure about this possibility. However, there are two facts that should convince anyone that this hypothesis is wrong. To minimize the effect of the ordering effect, two completely different thematically text sets were chosen and, in this way, the advantage of guessing how the questions would be was eliminated, since the second text set was bigger and harder than the first set and most importantly, unrelated. Furthermore, the results analysis revealed that the participants who scored better in the vessels texts set were affected by the text type factors and not by the fact that they knew what kind of information they would find; none of the participants scored better on the vessels text when it did not included comparison and aggregation in its structure. Therefore, it should be pointed out once more that the possibility of the experimental design introducing a priming effect does not undermine the basic

Athanasios N. Karasimos

52

Evaluation of M-PIRO System

Chapter 6 General Discussion conclusion of this evaluation and study. The results of the pilot and main experiments make it very clear that comparison and aggregation help the users to perform better, to learn and remember more and are major parts of creating a coherent, concise, fluent and natural text. The responses we get from the readers imply that at least almost all of them can perform better in a text with these text type factors and clearly prefer it as more fluent, coherent and natural a text with them.

6.2. Suggestions and improvements


The interviews with the participants after the evaluation task and the personal interaction with the M-PIRO system allow us to make some suggestions. The suggestions are some simple observations and thoughts that may improve the system; these suggestions concern both text type factors. The already built aggregation module for the M-PIRO system is multilingual. However, the Greek language has a completely different morphology and syntax compared to English and it may be wise to parameterize the module in some cases. Therefore, the rules and their hierarchy should be slightly different. Firstly, the commatype aggregation is sometimes misplaced. For instance, , , , the relative clause is separated incorrectly with a comma from the main clause. In Greek language the relative clause that carries a crucial meaning for the specification of the noun phrase which it refers, can not be separated by a comma. There are some temple of Hera in Mediterranean and therefore, it is necessary to specify in which temple the vessel was found. On the other hand, the information that Delos is an island it not essential for understanding where this area is; there is only one Delos. A good solution of this problem is to replace the relative adverb with the relative pronoun , which must always follow a comma. The relatives clauses in Greek language is most common type of aggregation and they are used either the nominal relatives or the adverbial relatives clauses embedded. Furthermore, another common type of aggregation is the past participants placed embedded or at the end of the sentence instead of relative clause; for instance the sentence , can be

Athanasios N. Karasimos

53

Evaluation of M-PIRO System

Chapter 6 General Discussion making the text even more natural and concise. Furthermore, it is needful to use the option of pro-drop24 more in Greek language. This ability will make the text look like more natural and fluent; I can strongly claim that this may eliminate the feeling of the participants that they read a translated text. The non-use of the pro-drop option does not make the texts unnatural and boring. The strong inflectional system of the Greek language should usually eliminate any

misunderstanding. However, the pro-drop option is used widely in natural language and it is definitely more preferable the texts with this option. For instance, , . the underlined noun phrase should be dropped. Simultaneously, in the cases where the subject is necessary we can use the adjective phrase // , , (the specific) without the exhibit noun; the text will be less boring if the phrases that refer to the exhibit are rotated. As it has been already described in section 3.2.2, the comparison module has different importance order amongst English and Greek25. The lack of literature binds the hands of the developers of any NLG system, since they have to build a module without much background or underlying theory. Foremost, it must be done some serious work in literature about similarity and contrast. Based on the comments of the participants, who really liked the comparison option, but characterized it as very simple and monotonous, we suggest some ideas that may be possible to implement. Firstly, we can enhance the comparison vocabulary: as well as like, we can use just as, just like, likewise and similarly with appropriate changes in sentence structure. For the contrast sentences, the word unlike can be replaced with contrary to or split the sentence with semicolon (more about comparison expressions, see session 3.2.1). For instance, the sentence unlike the previous coins, which are made of silver, this stater is made of gold can be the previous coins are made of silver; however, (nevertheless/
24

Pro-drop means that a language has the option to drop the subject, since the morphology of the verb help to avoid any misunderstanding. This language option is used very much in Greek. 25 I think that this comparison importance list has to be the same for both languages, although there is a practical reason to be different. Maybe the information importance has also to be reordered.

Athanasios N. Karasimos

54

Evaluation of M-PIRO System

Chapter 6 General Discussion nonetheless) this stater is made of gold. Furthermore, it is also possible to make some with-in texts comparison, such as between original location and current location. If an exhibit is found in different area from where it is located today, then the Exprimo may introduce a contrast: , (Although it was originally from Attica, today it is located in the MFA). Nonetheless, we can not fill the texts with too much comparisons as O Donnell et al. (2000) and Lisowska (2002) discuss. A text with too many comparisons should be boring and irritating, but most importantly the users would be confused, forget lots of information and the educational goal of the system would fail. Unequivocally, the data analysis revealed that the text difficulty is statistically significant for the performance of the participants; the performance of the subjects showed that the difference was greater when the texts were more difficult. Therefore, we suggest that a value of difficulty for each entity depending on curators judge must be introduced. Exhibits that are part of entities that still exist and are part of our daily life, such as coins, jewels and reliefs should have much easier texts than vessels, statues and armory/ weaponry. So, when Exprimo generates texts for exhibits with high value of difficulty, it will reduce the aggregation value by one, so as the generated text will be less complicate, more simple and it will allow us to use more comparison expression, since a text with high aggregation value cancels some comparison because they cannot be present in that text structure. Similarly to the above suggestion, the extension of the user models should be interesting. New types of models users (O Donnell, 1997) can do the system more flexible and ideal for user-tailored NLG systems. This new types of user model are not exactly a user-model, but a information model. Building a user model, we know that a nave user prefers definitions, clarifications, restatements and an expert user prefers more generalizations and specialized information. Therefore, it would be useful to support the users interest with some other text options; except the standard text, it would be offered the option of a general-specific text, when a user wants to learn more about the entity-group and not specially about this exhibit, a when and where text, when a user wants to learn more about when the exhibits were created, the creation period and

Athanasios N. Karasimos

55

Evaluation of M-PIRO System

Chapter 6 General Discussion the original place where they originated and a how and why text, when a user wants to be informed more about the art, the techniques, the potters, the sculptors and the painters and why it was painted, decorated or sculpted with a specific technique. These texts may contain sometimes the same information and facts; however they would differ crucially in their text structure.

6.3. Future work


Since the results of the main evaluation experiment showed that it becomes strongly evident that since differences in performance are indeed revealed by the majority of the adult readers, we can express some expectation about what the results would be of an evaluation with experts and children participants. As long as this study was able to provide definite answers to the role of comparison and aggregation in the M-PIRO NLG system for adult users, given the cautiousness required in interpreting the results because of a possible ordering effect, the first step to take may be to eliminate this factor. Running a similar experiment where both texts would be equally weighted for their difficulty, the texts would be given to the subjects randomly; however, in this case it would be any interaction between the text type factor and the text difficulty, this evaluations results allow us to expect the same conclusions. Additionally, it would be really interesting to run the same experiment with children participants; of course we will use the same exhibits, but different user model (child). First and foremost, we strongly believe that the text type factors will make a clear difference on the participants performance; particularly, it may be even bigger than the difference of the results of this experiment. On the other hand, an evaluation with expert users will reveal that comparison and aggregation may not make a difference in the performance of the participants, particularly in easy texts. For an expert in this field the presence of these text type factors will be a minor help since their background and supporting knowledge is very strong and a lot of the text information will already be known. Therefore, it will be remarkable if we find any significance in text type factors, if so the system to be flawless and perfect; however, it will be a tremendous surprise as the experts do not actually need any help from the text structure.

Athanasios N. Karasimos

56

Evaluation of M-PIRO System

Chapter 6 General Discussion This leads to another kind of evaluation, which is a fluency/ intelligibility/ accuracy evaluation of the system. This would be a double-natured experiment. We could evaluate the system by giving to the participants a text generated by the M-PIRO system and another written by a curator; afterwards, we will base on their performance to test how fluent, natural and coherent the generated texts are. In addition, the participants would be asked to choose between the two texts or to give a scale value for coherence, appropriateness and fluency. Moreover, it would be extremely useful if the above experiment was run with museum curators, whose opinions are very important and the most appropriate for this kind of evaluation; aside from this, an interview with them will reveal crucial thoughts, essential comments, useful suggestions and definitely more objective ideas for improvements of any future version of the M-PIRO system or for any other NLG systems specialized to virtual museum exhibitions or similar areas. The educational goal of the M-PIRO system is that the user would recognise new exhibits after an interaction with the system; the aim is to support the users with knowledge and information that it would be useful for them. The M-PIRO has the potential to teach the users to look, describe and classify exhibits and perhaps take away useful skills for recognizing and describing novel exhibits inside a museum or a book. Therefore, it is essential to test if this goal becomes true. A museum test would reveal if this statement is true. New exhibits similar to these that the users had already visited, would be presented (not seen in the exhibition) with subjects instructed to examine the photograph and then classify the exhibits in terms of its name, painting technique or creation period. These questions would be multiple choices. This test should be run after a task evaluation test of adults or children. Finally, it is also required to continue this evaluation experiment for the M-PIRO speech output. Because of the time limitation of this dissertation, I decided not to include a speech output in our main evaluation experiment. However, a speech output evaluation is not the mirror image of the already done text output evaluation. If someone uses the speech option (Festival and DEMOSTHeNES), it means that either he will not use the text output for any (personal) reason or he will not have the option of the text, because of eyes problems, reading problems (dyslexia) or he uses a guide device in a museum. Therefore, the loss of the information will be greater than the users of the text output (shorter acoustical memory than optical memory). It is expected than

Athanasios N. Karasimos

57

Evaluation of M-PIRO System

Chapter 6 General Discussion the participants will perform similarly to those on the current experiment, but with lower scores. Despite some problem of naturalness of Festival and DEMOSTHeNES, it would be fascinating to examine how the naturalness and goodness of a Text-tospeech (TTS) output interacts with the naturalness, fluency and coherence of an NLG systems text output.

6.4. Conclusion
The current evaluation proved that the text type factors, comparison and aggregation, helped the users to help themselves since they performed better, learned and remembered more from the texts that they read. Additionally, it seems that these factors conjoin textual and information elements, producing texts that are more natural and concise and very close to a real curators text. Therefore, it is essential for every NLG system that respects the users to include these factors, import them carefully in the text structure, since their aim (a natural language generated text) would be achieved. However, evaluation of any natural language generation aspects is a complex issue. Presumably, the quality of most generation system can only be assessed at a system level in a task setting (rather than by taking quantitative measures or by asking humans for quantity assessments, Bangalore et al.. 2000). Such evaluation are costly (specially for time) and unfortunately, some authors of the NLG systems cannot set evaluation as basis of work in generation, for which evaluation is a short step in research, implementation and development. So, it is obvious that is not yet generally accepted that evaluation is half of the general natural language processing problem. Usually, evaluation is treated disparagingly, is underutilized and bandied about demonstrating a NLG systems power. It is important to build a strong evaluation theory. The first and very essential steps were already done by Mellish and Dale (1999); they tried fairly successfully to unite the past with the present researches. Their study must continue and be strengthened by clarifying the areas of evaluation, introducing specific and clear criteria for meanings that usually confuse the participants. Additionally, it is worth finding a way to guide the participants into objective opinions , clear and explicit comments. At the same time, foremost, the developers of a NLG system have to

Athanasios N. Karasimos

58

Evaluation of M-PIRO System

Chapter 6 General Discussion conceive the value of evaluation and that evaluation is the only and safe path to a systems improvements.

Athanasios N. Karasimos

59

Evaluation of M-PIRO System

Appendix I

Bibliography
., (2000). , , . Androutsopoulos I., Calder J., Callaway C., Clark R., Dimitromanolaki A., Hughson I., Isard A., Matheson C., Melengoglou A., Not E., Oberlander J., Spiliotopoulos D., Varges S., Xydas G., (2002). Generation Components and Documentation for Prototype D4.5, M-PIRO deliverables, www.ltg.ed.ac.uk/mpiro/internal/deliverables Bangalore S., Rambow O., Whittaker S., (2000). Evaluation Metrics for Generation, AT&T Labs Research, In Proceedings of the 1st International Conference on Natural Language Generation (INLG 2000), Mitzpe Ramon, Israel, pp. 1-8. Carerini J., Moore J., (2000). An empirical study of influence of User Tailoring on Educative Argument Effectiveness, First International Conference on Natural

Language Generation, Mitzpe Ramon, Israel: pp. 47-54. Carletta J., Isard A., Isard S., Kawtko J., Doherty-Sneddon G., Anderson A., (). The Reliability of a Dialogue Structure coding scheme, pp. 1-8. Cheng H., (2000). Experimenting with interaction between Aggregation and Text structure, submitted to ANLP'00, pp. 1-6. Cox R., O Donnell M., Oberlander J., (1999). Dynamic vs. Static hypermedia in museum education: an evaluation of ILEX, the intelligent labelling explorer, Proceedings of the Artificial Intelligence in Education conference (AI-ED99), Le Mans, July 1999. Amsterdam: IOS, Press pp. 1-8. Huang X., Fiedler A., (1996). Paraphrasing and Aggregating Argumentative Text using text structure, In Proc. of the 8th INLG, Herstmonceux, pp. 21-30. Isard A., Matheson C., Oberlander J., Androutsopoulos I., (2003). Speaking the users language, Intelligent Systems 1-2/2003, pp. 40-45. Jordan P. W., Dorr B. J., Benoit J. W., (1993). A first-pass approach evaluating machine translation system, Machine Translation 8, Klumer Publishes, Dordrecht, Netherlands, pp. 43-58.

Athanasios N. Karasimos

64

The Evaluation of the M-PIRO system

Appendix I Jurafsky D., Martin H. J., (2000). Speech and Language Processing: An introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall, New Jersey. Keane M.T., Smyth B., OSullivan F., (2001). Dynamic Similarity: A processing perspective on Similarity, In Ramscar M. & U. Hahn (Eds.) Similarity & Categorization, pp. 179-192, Oxford: Oxford University Press. Knott Alistair, (2002). Similarity and contrast relations and inductive rules, Proceedings of the Seventh ESSLLI Student Session, pp. 1-4. Knott A., O Donnell M., Oberlander J., Mellish C., (1997). Defeasible rules in Context Selection and Text structuring, proceedings of the 6th European Workshop on Natural Language Generation March 24 - 26, 1997 Gerhard-Mercator University, Duisburg, Germany, pp. 1-11. Knott A., Oberlander J., O Donnell M., Mellish C., (2000). Beyond elaboration: the interaction of relations and focus in coherent text, T. Sanders, J. Schilperoord and W. Spooren (eds.) Text representation: linguistic and psycholinguistic aspects. Amsterdam: Benjamins, pp. 1-12. Lisowska A., (2002). The Design and implementation of an Architecture for using Comparisons in the M-PIRO domain, MSc Dissertation, University of Edinburgh Mann C., Thompson S., (1988). Rhetorical Structure Theory: a theory of text organization, Text 8, pp. 243-281. Melengoglou A., (2002). Multilingual Aggregation in M-PIRO System, MSc Dissertation, University of Edinburgh, pp. 1-55. Melengoglou A., Matheson C., Oberlander J., (2003 in press). User adapted aggregation in the M-PIRO generation system, pp. 1-8. Mellish C., Dale R., (1998). Evaluation in the context of natural Language generation, Computer Speech and Language (1998) 12, pp. 349-373. Metter M. & McDonald D., (1991). Evaluation for generation, in Natural Language Processing Systems Evaluation Workshop, (J.G. Neal & S. M. Wlater eds.), pp. 127-131, NY Rome Laboratory.

Athanasios N. Karasimos

65

The Evaluation of the M-PIRO system

Appendix I Milosavljevic M., Oberlander J., (1998). Dynamic Hypertext Catalogues: Helping users to help themselves, Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (HT'98), pp. 1-14. Milosavljevic M., (1999). The automatic generation of Comparison in descriptions of entities, PhD thesis. Department of Computing, Macquaries University. Minnis S., (1998). Constructive machine translation evaluation, Machine Translation 8, pp. 67-76. Moore J., (1991). Evaluating Natural Language generation facilities in intelligent systems, in Natural Language Processing Evaluation Workshop, (J.G. Neal & S.M. Wlater eds.), pp. 133-140, NY Rome Laboratory. O Donnell M., (1997). Variable-length On-Line Document Generation, Proceedings of the 6th European Workshop on Natural Language Generation, March 24-26, Genhard-Mercator University, Duiburg, Germany. O Donnell M., Mellish C., Oberlander J., Knott A., (2001). ILEX: an architecture for a dynamic hypertext generation system, Natural Language Engineering 7(#). Pp. 225250, Cambridge University Press. O Donnell M., Knott A., Oberlander J., Mellish C., (2000). Optimizing text quality in generation from relational databases, Proceedings of the 1st International Natural Language Generation Conference, 12-16 June 2000, Mitzpe Ramon, Israel, pp. 1-8. ., (1993). , -, . Reape M., Mellish C., (1999). Just what is aggregation anyway?, 7th European Workshop on Natural Language Generation (EWNLG 99), pp. 1-10. Shaw J., (1995). Conciseness through Aggregation in Text Generation, 33rd ACL (Student Session); 1995, pp. 329-331. Spiliotopoulos D., Androutsopoulos I., Stamatakis K., Dimitromanolaki A., Karkaletsis V., Spyropoulos C., (2002). M-PIRO Authoring Tool [User Manual], National Center of Scientific Research (NCSR) Demokritos, pp. 1-77.

Athanasios N. Karasimos

66

The Evaluation of the M-PIRO system

Appendix I Stenning K., Cox R., Oberlander J., (1995). Contrasting the cognitive effects of graphical and sentential logic teaching: reasoning, representation and individual differences, Language and Cognitive Process 10 (1995) , pp. 333-354. Stone M., Doran C. (1997). Sentence Planning as Description Using Tree adjoining Grammar, ACL-EACL97, Madrid, Spain, pp. 1-8. Weissberg R., Buker S., (1990). Writing up Research, Prentice Hall Regents, New Jersey

Athanasios N. Karasimos

67

The Evaluation of the M-PIRO system

Appendix I

Appendix I
The M-PIRO generated Texts for the Main Experiment
Coins Text Sequence [English] with comparison and aggregation
Drachma This exhibit is a drachma, created during the classical period. It dates from circa the 5th century B.C. It has an image of Athena crowned with a branch of olive, her tree, on its obverse. On the other side there is a picture of her owl. This drachma is made of silver and it was originally from Attica.

This drachma was created during the classical period. The classical period was defined by the rise in the political supremacy of Athens (its 'golden age') and the expansion of the Greek world under the rule of Alexander the Great of Macedonia. The classical period covers the time between 480 and 323 B.C. Today this drachma is located in the Agora Museum.
Classical tetradrachm
This exhibit is a tetradrachm, created during the classical period. It dates from between 440 and 420 B.C. It has an image of Athena crowned with a branch of olive, her tree, on its obverse. On the reverse there is a picture of her owl. Like the drachma, this tetradrachm is made of silver. Currently it is in the Numismatic Museum of Athens. This exhibit is a tetradrachm. Tetradrachms were coins with the value of four drachmas. Like the drachma, this

Athanasios N. Karasimos

68

The Evaluation of the M-PIRO system

Appendix I tetradrachm originates from Attica.

Athanasios N. Karasimos

69

The Evaluation of the M-PIRO system

Appendix I

Tetradrachm
This exhibit is another tetradrachm; it was created during the hellenistic period and it dates from the 2nd century B.C. It has a picture of a shield with a bust in its centre, as was customary on macedonian coins. Like the previous coins, this tetradrachm is made of silver. From 168 BC, after the defeat of Perseus, king of Macedonia, by the Roman Aemilius Paulus in Pydna, Macedonia devolved to Roman control. The Romans divided the region into four smaller administrative districts, the socalled merides (sectors). This coin comes from the `first merida' of Macedonia. This tetradrachm was created during the hellenistic period. The hellenistic period covers the chaotic period from the Death of Alexander the Great and the subsequent dissolution of his empire to the victory of the Romans over the Greeks at the Battle of Actium, comprising a truly cosmopolitan or international range of artistic trends. The hellenistic period covers the time between 323 and 31 B.C. Unlike the previous coins, which originate from Attica, this tetradrachm was originally from Macedonia. It is now exhibited in the Numismatic Museum of Athens.

Archaic stater
This exhibit is a stater, created during the archaic period. It dates from between 530 and 510 B.C. It has a picture of a tripod on each side. A tripod is a vessel with three legs and it was the sacred symbol of the god Apollo. Like the previous coins, this stater is made of silver. It originates from Croton. This stater was created during the archaic period, the archaic period covers the time between 700 and 480 B.C. It marked the beginnings of Greek monumental stone sculpture and other developments in the naturalistic representation of the human figure. This stater originates from Croton. Croton was a Greek colony in Southern Italy. Currently this stater is in the Numismatic Museum of Athens.

Athanasios N. Karasimos

70

The Evaluation of the M-PIRO system

Appendix I

Hellenistic stater
This exhibit is another stater; it was created during the hellenistic period and it dates from between 220 and 189 B.C. It has the head of Athena with a Corinthian helmet on the reverse side, as was common in coins of the Ancient World. On the obverse side is a female figure, the personification of Aetolia, seated on Macedonian and Galatic shields. The scene refers to the fight of the Aetolians against the Macedonians and the Galatians. Unlike the previous coins, which are made of silver, this stater is made of gold. Today it is located in the Numismatic Museum of Athens. This stater was originally from the Aetolian league. The Aetolian league was among the leagues that played an important political role.

Coin of the reign of Commodus


This exhibit is another coin; it was created during the roman period and it dates from between 177 A.D. and 192 B.C. It shows a view of the harbour of Patras from the sea. In the background you can see rows of columns, temples and other buildings;in the foreground there is a ship and a statue of a male form. This coin originates from Patras and it is now exhibited in the Numismatic Museum of Athens. This exhibit is a coin. The drachma originates from Attica. The side of a coin which displays the principal symbol (`heads' in `heads or tails') is known as the obverse. This coin was created during the roman period, the roman period covers the time between 31 B.C. and the 4th century A.D. It represents a period when the Greek world was under the rule of the Romans, for whom and under whose patronage many Greek artists worked.

Athanasios N. Karasimos

71

The Evaluation of the M-PIRO system

Appendix I

Coins Text Sequence [English] without comparison and aggregation Drachma


This exhibit is a drachma. It was created during the classical period. It dates from circa the 5th century B.C. It has an image of Athena crowned with a branch of olive, her tree, on its obverse. On the other side there is a picture of her owl. This drachma is made of silver. It originates from Attica.

This drachma was created during the classical period. The classical period was defined by the rise in the political supremacy of Athens (its 'golden age') and the expansion of the Greek world under the rule of Alexander the Great of Macedonia. The classical period covers the time between 480 and 323 B.C. This drachma was originally from Attica. It is now exhibited in the Agora Museum.

Classical tetradrachm
This exhibit is a tetradrachm. It was created during the classical period. It dates from between 440 and 420 B.C. It has an image of Athena crowned with a branch of olive, her tree, on its obverse. On the reverse there is a pciture of her owl. This tetradrachm is made of silver. Today it is located in the Numismatic Museum of Athens.

This exhibit is a tetradrachm. Tetradrachms were coins with the value of four drachmas. This tetradrachm was created during the classical period. The classical

Athanasios N. Karasimos

72

The Evaluation of the M-PIRO system

Appendix I

period was defined by the rise in the political supremacy of Athens (its 'golden age') and the expansion of the Greek world under the rule of Alexander the Great of Macedonia. The classical period covers the time between 480 and 323 B.C. This tetradrachm originates from Attica.
Tetradrachm

This exhibit is a tetradrachm. It was created during the hellenistic period. It dates from the 2nd century B.C. It depicts a shield with a bust in its centre, as was customary on macedonian coins. This tetradrachm is made of silver. From 168 BC, after the defeat of Perseus, king of Macedonia, by the Roman Aemilius Paulus in Pydna, Macedonia devolved to Roman control. The Romans divided the region into four smaller administrative districts, the so-called merides (sectors). This coin comes from the `first merida' of Macedonia.
This tetradrachm was created during the hellenistic period. The hellenistic period covers the chaotic period from the Death of Alexander the Great and the subsequent dissolution of his empire to the victory of the Romans over the Greeks at the Battle of Actium, comprising a truly cosmopolitan or international range of artistic trends. The hellenistic period covers the time between 323 and 31 B.C. This tetradrachm was originally from Macedonia. It is now exhibited in the Numismatic Museum of Athens.

Archaic stater
This exhibit is a stater. It was created during the archaic period. It dates from between 530 and 510 B.C. It has a picture of a tripod on each side. A tripod is a vessel with three legs and

Athanasios N. Karasimos

73

The Evaluation of the M-PIRO system

Appendix I it was the sacred symbol of the god Apollo. This stater is made of silver. It was originally from Croton. This stater was created during the archaic period. The archaic period covers the time between 700 and 480 B.C. It marked the beginnings of Greek monumental stone sculpture and other developments in the naturalistic representation of the human figure. This stater originates from Croton. Croton was a Greek colony in Southern Italy. Today this stater is located in the Numismatic Museum of Athens.

Athanasios N. Karasimos

74

The Evaluation of the M-PIRO system

Appendix I

Hellenistic stater
This exhibit is a stater. It was created during the hellenistic period. It dates from between 220 and 189 B.C. It has the head of Athena with a Corinthian helmet on the reverse side, as was common in coins of the Ancient World. On the obverse side is a female figure, the personification of Aetolia, seated on Macedonian and Galatic shields. The scene refers to the fight of the Aetolians against the Macedonians and the Galatians. This stater is made of gold. It is now exhibited in the Numismatic Museum of Athens. This stater was originally from the Aetolian league. The Aetolian league was among the leagues that played an important political role.

Coin of the reign of Commodus

This exhibit is a coin. It was created during the roman period. It dates from between 177 A.D. and 192 B.C. It shows a view of the harbour of Patras from the sea. In the background you can see rows of columns, temples and other buildings; in the foreground there is a ship and a statue of a male form. This coin originates from Patras. Currently it is in the Numismatic Museum of Athens.
This exhibit is a coin. The drachma originates from Attica. The side of a coin which displays the principal symbol (`heads' in `heads or tails') is known as the obverse. This coin was created during the roman period. The roman period covers the time between 31 B.C. and the 4th century A.D. It represents a period when the Greek world was under

Athanasios N. Karasimos

75

The Evaluation of the M-PIRO system

Appendix I the rule of the Romans, for whom and under whose patronage many Greek artists worked.

Athanasios N. Karasimos

76

The Evaluation of the M-PIRO system

Appendix I

Vessels Text Sequence [English] with comparison and aggregation Kylix


This exhibit is a kylix, created during the archaic period. It dates from between 510 and 500 B.C. It depicts a discus thrower preparing to throw the discus. This kylix was decorated with the red figure technique. The discus thrower on this kylix is weighing the discus in his hands as he gets ready to throw it. Discus throwing techniques have changed little since ancient times, but the discus itself, had various differences. In antiquity it was initially made of stone, and later of bronze, lead or iron. It weighed between 1.3 and 6.6 kilos and so the athlete required both strength and precision to direct its course.

This exhibit is a kylix. The kylix was painted with the red figure technique. Kylixes are wine cups with a shallow bowl placed on top of a base. They have two horizontal handles. This kylix was painted with the red figure technique and it was originally from Attica. Today it is located in the MFA.
This exhibit is a kylix. The kylix was decorated with the red figure technique. It was created during the archaic period. The archaic period marked the beginnings of Greek monumental stone sculpture and other developments in the naturalistic representation of the human figure. The archaic period covers the time between 700 and 480 B.C. This kylix was decorated with the red figure technique.

Classical kylix
This exhibit is a another kylix, created during the archaic period. It depicts a young man who is sitting down and writing with a stylus (pen).

Athanasios N. Karasimos

77

The Evaluation of the M-PIRO system

Appendix I

This kylix was painted by Eucharides. Like the previous kylix, it was painted with the red figure technique. During the archaic period, the most popular method of writing must have been wooden tablets coated with wax, on which letters were written with the stylus and could easily be rubbed out and rewritten. This exhibit is a kylix. The kylix was decorated with the red figure technique. It dates from between 500 and 480 B.C. Like this previous kylix, it was decorated with the red figure technique. The red figure technique is the opposite of the black figure technique. This kylix is now exhibited in University museum of Pensylvania.
This exhibit is a kylix. The kylix was painted with the red figure technique. Like this previous kylix, it was painted with the red figure technique. In the red figure technique, the background was painted black. The figures, which were predesigned, had the natural color of the clay.

Stamnos
This exhibit is a stamnos. Unlike the previous vessels, which were created during the archaic period, this stamnos was created during the classical period. It shows Dionysus (centre) being garlanded by maenads in a state of ecstasy. One maenad (left) is filling a skyphos with wine, another (right) is playing a drum. This stamnos was decorated by the painter of Dinos with the red figure technique and is made of clay. This stamnos was created during the classical period, the classical period covers the time between 480 and 323 B.C. It was defined by the rise in the political supremacy of Athens

Athanasios N. Karasimos

78

The Evaluation of the M-PIRO system

Appendix I (its 'golden age') and the expansion of the Greek world under the rule of Alexander the Great of Macedonia. This stamnos dates from circa 420 B.C. Like the kylixes, this stamnos was painted with the red figure technique. It originates from Attica. This exhibit is a stamnos. Stamnoses are pots which were mostly used for storing and mixing. They have two small horizontal handles on the side and a round body with a short neck. Like the kylixes, this stamnos was painted with the red figure technique. It was originally from Attica. Currently it is in the Archaeological Museum of Napoli.

Athanasios N. Karasimos

79

The Evaluation of the M-PIRO system

Appendix I

Spherical Corinthian aryballos

This exhibit is an aryballos. Like the kylixes, this aryballos was created during the archaic period. It dates from the late 7th century B.C. and it belongs to the corinthian type. It is spherical in shape. On the body a wide zone is distinguished with pairs of comastes among supplementary patterns, mainly rosettes (jewels representing roses with open radiate leaves). "Comastes" were the participants in "comous", feasts in honour of Dionysus. Unlike the previous vessels, which were painted with the red figure technique, this aryballos was decorated with the black figure technique. The black figure technique is the opposite of the red figure technique. This aryballos was found in the Temple of Hera, on Delos, an island. Today this aryballos is located in the Archaeological Museum of Delos.
This exhibit is an aryballos. Aryballoses were pots used by athletes to hold oil, with which they cleaned themselves after exercising. Each athlete probably had his personal aryballos. They are usually ball shaped with one or two handles. Some have the shape of animals, birds, or human heads. Unlike the previous vessels, which were painted with the red figure technique, this aryballos was painted with the black figure technique.

Lekythos
This exhibit is a lekythos, created during the archaic period. It dates from circa 550 B.C. It has a picture of a wedding scene: two newlyweds on a carriage escorted by relatives and friends. The event of the bride's transport to her new house was very important, because a marriage was considered valid

Athanasios N. Karasimos

80

The Evaluation of the M-PIRO system

Appendix I only after the bride and the groom started living together. This lekythos was decorated by Amasis. Like the aryballos, it was decorated with the black figure technique. This exhibit is a lekythos. Lekythoses are basically oil bottles. They are vessels with a tall shape, usually forming an ellipse; they have a base, a single vertical handle, a narrow neck and a small mouth. This lekythos was painted by Amasis. Amasis is thought to have been both a maker and a painter of pots. This lekythos originates from Attica and currently it is in the Met.

Lekythos
This exhibit is another lekythos. Like the stamnos, it was created during the classical period. It dates from between 470 and 460 B.C. It shows an athlete preparing to throw his javelin. This lekythos was painted with the red figure technique. In antiquity, javelin throwing was intimately bound up with the Greek way of life. Before it became a feature of sporting life, the javelin was one of the weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden spear about the height of a tall man. This exhibit is a lekythos. The lekythos was originally from Attica. Like this previous lekythos, it was originally from Attica. It is now exhibited in the National Archaeological Museum of Athens.

Vessels Text Sequence [English] without comparison and aggregation Kylix


This exhibit is a kylix. It was created during the archaic period. It dates from between 510 and 500 B.C. It depicts a discus thrower preparing to throw the discus. This kylix was decorated with the red figure technique. The discus thrower

Athanasios N. Karasimos

81

The Evaluation of the M-PIRO system

Appendix I on this kylix is weighing the discus in his hands as he gets ready to throw it. Discus throwing techniques have changed little since ancient times, but the discus itself, had various differences. In antiquity it was initially made of stone, and later of bronze, lead or iron. It weighed between 1.3 and 6.6 kilos and so the athlete required both strength and precision to direct its course.

This exhibit is a kylix. Kylixes are wine cups with a shallow bowl placed on top of a base. They have two horizontal handles. This kylix was created during the archaic period. The archaic period marked the beginnings of Greek monumental stone sculpture and other developments in the naturalistic representation of the human figure. The archaic period covers the time between 700 and 480 B.C. This kylix was painted with the red figure technique.
This kylix was decorated with the red figure technique. The red figure technique is the opposite of the black figure technique. In it, the background was painted black. The figures, which were predesigned, had the natural color of the clay. This kylix originates from Attica. Currently it is in the Museum of Fine Arts.

Classical kylix
This exhibit is a kylix. It was created during the archaic period. It dates from between 500 and 480 B.C. It shows a young man who is sitting down and writing with a stylus (pen). This kylix was painted with the red figure technique. During the archaic period, the most poular method of writing must have been wooden tablets coated with wax, on which letters were written with the stylus and could easily be rubbed out and rewritten. This kylix was painted by Eucharides. Today it is located in University museum of Pensylvania.

Athanasios N. Karasimos

82

The Evaluation of the M-PIRO system

Appendix I

Athanasios N. Karasimos

83

The Evaluation of the M-PIRO system

Appendix I

Stamnos
This exhibit is a stamnos. It was created during the classical period. It dates from circa 420 B.C. It has a picture of Dionysus (centre) being garlanded by maenads in a state of ecstasy. One maenad (left) is filling a skyphos with wine, another (right) is playing a drum. This stamnos was decorated with the red figure technique. It is made of clay.

This exhibit is a stamnos. Stamnoses are pots which were mostly used for storing and mixing. They have two small horizontal handles on the side and a round body with a short neck. This stamnos was created during the classical period. The classical period was defined by the rise in the political supremacy of Athens (its 'golden age') and the expansion of the Greek world under the rule of Alexander the Great of Macedonia. The classical period covers the time between 480 and 323 B.C. This stamnos was decorated by the painter of Dinos.
This stamnos was originally from Attica. It is now exhibited in the Archaeological Museum of Napoli.

Spherical Corinthian aryballos


This exhibit is an aryballos. It was created during the archaic period. It belongs to the corinthian type. It is spherical in shape. This aryballos was painted with the black figure technique. On the body a wide zone is distinguished with pairs of comastes among supplementary patterns, mainly rosettes (jewels representing roses with open radiate leaves). "Comastes" were the participants in "comous", feasts in honour of Dionysus.

Athanasios N. Karasimos

84

The Evaluation of the M-PIRO system

Appendix I

This aryballos was decorated with the black figure technique. The black figure technique is the opposite of the red figure technique. This aryballos was found in the Temple of Hera. The Temple of Hera is on Delos. This location is an island.

Lekythos
This exhibit is a lekythos. It was created during the archaic period. It depicts a wedding scene: two newlyweds on a carriage escorted by relatives and friends. The event of the bride's transport to her new house was very important, because a marriage was considered valid only after the bride and the groom started living together. This lekythos was painted by Amasis. It was painted with the black figure technique. It originates from Attica.

This exhibit is a lekythos. Lekythoses are basically oil bottles. They are vessels with a tall shape, usually forming an ellipse; they have a base, a single vertical handle, a narrow neck and a small mouth. This lekythos dates from circa 550 B.C. It was decorated by Amasis. Amasis is thought to have been both a maker and a painter of pots. Currently this lekythos is in the New York Metropolitan Museum of Art.
Lekythos
This exhibit is a lekythos. It was created during the classical period. It dates from between 470 and 460 B.C. It has a picture of an athlete preparing to throw his javelin. This lekythos was painted with the red figure technique. In antiquity, javelin throwing was intimately bound up with the Greek way of life. Before it became a feature of sporting life, the javelin was one of the weapons used by ancient Greeks in

Athanasios N. Karasimos

85

The Evaluation of the M-PIRO system

Appendix I war and hunting. A javelin is a sharp, wooden spear about the height of a tall man.

This exhibit is a lekythos. It was created during the classical period. It dates from between 470 and 460 B.C. It has a picture of an athlete preparing to throw his javelin. This lekythos was painted with the red figure technique. In antiquity, javelin throwing was intimately bound up with the Greek way of life. Before it became a feature of sporting life, the javelin was one of the weapons used by ancient Greeks in war and hunting. A javelin is a sharp, wooden spear about the height of a tall man.
This lekythos was originally from Attica. Today it is located in the National Archaeological Museum of Athens.

Coins Text Sequence [Greek] with comparison and aggregation


, . 5 .. . , . . . (" ") . 480 323 .. .

Athanasios N. Karasimos

86

The Evaluation of the M-PIRO system

Appendix I


, . 440 420 .. . , . . , .

. . .

, . 2 .. , . , . 168 .., , , . , . ` ' . . , ,

Athanasios N. Karasimos

87

The Evaluation of the M-PIRO system

Appendix I . 323 31 .. , , . .


, . 530 510 .. , . , . .
. , . 700 480 .. . .


, .

Athanasios N. Karasimos

88

The Evaluation of the M-PIRO system

Appendix I

220 189 .. , , , , , , . . , , . .
. . .


, . 177 .. 192 .. . , , . .

Athanasios N. Karasimos

89

The Evaluation of the M-PIRO system

Appendix I

, . 31 .. 4 ..

Coins Text Sequence [Greek] without comparison and aggregation


. . 5 .. . , . . . . (" ") . 480 323 .. .


. . 440 420 ..

Athanasios N. Karasimos

90

The Evaluation of the M-PIRO system

Appendix I

. , . . .
. . .

. . 2 .. , . . 168 .., , , . , . ` ' . . , , . 323 31 .. . .

Athanasios N. Karasimos

91

The Evaluation of the M-PIRO system

Appendix I


. . 530 510 .. , . . . . , . 700 480 .. . .


. . 220 189 .. , , , , , , . .

Athanasios N. Karasimos

92

The Evaluation of the M-PIRO system

Appendix I

. .
. , , . 323 31 .. . . .


. . 177 .. 192 .. . ,

, . . .

. , . 31 .. 4 ..

Athanasios N. Karasimos

93

The Evaluation of the M-PIRO system

Appendix I

Vessels Text Sequence [Greek] with comparison and aggregation


, . 510 500 .. . . . . . , . , . , 1,3 6,6 , .

. . , . . , . 700 480 .. .
. , , ,

Athanasios N. Karasimos

94

The Evaluation of the M-PIRO system

Appendix I . .

. , . . , . , , .

500 480 .. .

. , , . . , . . .

Athanasios N. Karasimos

95

The Evaluation of the M-PIRO system

Appendix I

. . . . (" ") . 480 323 .. 420 ..


.


, . 7 .. . , , . , , ( ).

Athanasios N. Karasimos

96

The Evaluation of the M-PIRO system

Appendix I

( ). . . . , . , , . , . .
, , .

, . : . '. , . . . . , , , . 550 .. '. '

Athanasios N. Karasimos

97

The Evaluation of the M-PIRO system

Appendix I . .

. , . 470 460 .. . . , . , . , , .

, . .

Vessels Text Sequence [Greek] without comparison and aggregation


. .

510 500 .. . . . .

Athanasios N. Karasimos

98

The Evaluation of the M-PIRO system

Appendix I . , . , . , 1,3 6,6 , . . . , . . , . 700 480 .. . . , , , . . .

. . 500 480 .. , . . , , .
. .

Athanasios N. Karasimos

99

The Evaluation of the M-PIRO system

Appendix I

. . . , . . . . . . . . . 480 323 .. (" ") . 420 .. . .


. . 7 .. . . , , ( ).

Athanasios N. Karasimos

100

The Evaluation of the M-PIRO system

Appendix I ( ).

. . . , . , , . . . . .
. . .

. . 550 .. : . '. .

. . , , , . '. ' .

Athanasios N. Karasimos

101

The Evaluation of the M-PIRO system

Appendix I

. .

. . 470 460 .. . . , . , . , , . . .

Athanasios N. Karasimos

102

The Evaluation of the M-PIRO system

Appendix II

Appendix II
What did you learn from the virtual exhibition?
The Questions for the Coins Text Sequence [English]
1. What kind of objects are all the exhibits that you have already seen? a. drachmas b. tetradrachms c. staters d. coins e. glaukas 2. Which images are on the drachma exhibit? [choose two answers] a. Jupiter b. Alexander the Great c. An olive tree d. An owl e. A port f. Aemilius Paulus g. Athena h. Apollo i. A temple 3. The tetradrachms are made of a. gold b. silver c. bronze d. marble e. there is no info in the texts f. different material for each one .

4. From which area are the first two exhibits (drachma and classical tetradrachm)? a. Macedonia b. Attica c. Athens d. Croton e. Aetolia 5. Were the tetradrachms created during the same period? a. No, the first one was created classical period and the second was created archaic. period b. Yes, both were created during 6. From which area does the archaic stater originate? a. Attica

Athanasios N. Karasimos

103

The Evaluation of the M-PIRO system

Appendix II b. Macedonia c. Aetolia d. Croton 7. What is the difference between the Hellenistic stater and the previous exhibits? a. creation period b. original location c. material d. images e. museum location 8. Which of the six exhibits is the newest of all? a. The Archaic stater b. The Classical tetradrachm c. The Drachma d. The Commodus coin e. The Classical tetradrachm f. The Hellenistic stater 9. The period marked the beginnings of Greek monumental stone sculpture and other developments in the naturalistic representation of the human figure. a. Classical b. Hellenistic c. Archaic d. Roman 10. Who was defeated by the Roman Aemilius Paulus in Pydna (tetradrachm)? a. Alexander the Great b. Philippus c. Perseus d. Commudus 11. Where are the most of the exhibits now? a. Numismatic Museum of Athens b. Agora Museum c. Archeological Museum of Thessaloniki d. Museum di Napoli 12. Whose picture is on the Hellenistic stater? a. King Perseus b. Alexander the Great c. Athena d. Apollo 13. What is the value of a tetradrachm? a. one drachma b. two drachmas c. three drachmas

Athanasios N. Karasimos

104

The Evaluation of the M-PIRO system

Appendix II d. four drachmas e. five drachmas 14. The Romans divided Macedonia into four a. regions b. merides c. kingdoms d. territories e. trajectories .

15. What is in the foreground of the roman coin of the reign of Commudus? a. a temple and a statue b. a ship and a male statue c. a port and a ship d. a stadium and athletes e. a tripod and some smoke

The Questions for the Vessels Text Sequence [English]


1. What kind of objects are the exhibits? a. Kylixes b. Stamnoses c. Aryballoses d. Lekythoses e. Vessels 2. Which exhibit was painted by Eucharides? a. The Classical kylix b. The Stammos c. The Aryballos d. None of them 3. What are the common characteristics between the two kylixes? [choose 2 answers] a. creation period b. original location c. made of same material d. painting technique e. museum location f. similar pictures 4. The last lekythos was created during the classical period like the a. Other lekythos b. Aryballos c. Classical kylix d. Stamnos 5. Which vessel has a spherical shape? .

Athanasios N. Karasimos

105

The Evaluation of the M-PIRO system

Appendix II a. The Stamnos b. The Aryballos c. The Kylix d. The first lekythos 6. Which exhibits are originated from Attica? a. The two lekythoses b. The two kylixes c. The Aryballos and Stamnos d. The Aryballos and the second lekythos e. The Stamnos and classical kylix f. The Lekythoses and kylixes 7. Like the aryballos, the first lekythos a. was created during the hellinistic period. b. has a spherical shape. c. was decorated with the black figure technique. d. is an oil bottle. e. is located in the Archeological Museum of Athens. 8. Which picture does the stamnos exhibit show? a. A marriage feast b. A man preparing to throw the javelin c. Dionysus surrounded by maenads d. A young man sitting down and writing with a stylus 9. During which period were wooden tables coated with wax the most popular method of writing? a. The Classical period b. The Hellenistic period c. The Roman period d. The Byzantine period e. The Archaic period 10. Which color is the background of an exhibit decorated red-figure technique? a. red b. black c. white d. clay e. blue 11. In honor of which god was the comous feasts? a. Hera b. Athena c. Apollo d. Dionysus 12. What is the difference between the aryballos exhibit and the previous vessels ? . [choose 2 answers]

Athanasios N. Karasimos

106

The Evaluation of the M-PIRO system

Appendix II a. The previous vessels were decorated with the black figure technique. b. The previous vessels were created during the Classical period. c. The previous vessels were created during the Archaic period. d. The previous vessels were decorated with the red figure technique. 13. For which of the exhibits is it unknown the painter? a. The Classical kylix b. The Stamnos c. The Aryballos d. The first lekythos 14. What is the characteristic which the fewest of these exhibits have in common? a. the creation period. b. the painting technique c. original location d. museum location 15. Each athlete had probably his personal a. lekythos b. aryballos c. kylix d. kantharos .

The Questions for the Coins Text Sequence [Greek]


1. ; . . . . . 2. ; [ 2 ] . . . . . . . . . 3. : . . .

Athanasios N. Karasimos

107

The Evaluation of the M-PIRO system

Appendix II . . ( ) . 4. ( ); . . . . . 5. ; . , . . , . 6. ; . . . . . 7. ; . . . . . 8. ; . . . . . . , 9. . . . . . 10. (); .

Athanasios N. Karasimos

108

The Evaluation of the M-PIRO system

Appendix II . . . 11. ; . . . . 12. ; . . . . 13. . . . . .

14. ; . . . . . 15. . . . . .

The Questions for the Vessels Text Sequence [Greek]


1. ; . . . . . 2. ; .

Athanasios N. Karasimos

109

The Evaluation of the M-PIRO system

Appendix II . . . . 3. ; [ 2 ] . . . . . . 4. , . . . . . 5. ; . . . . 6. ; . . . . . . 7. , . . . . . 8. ; . . . . . .

Athanasios N. Karasimos

110

The Evaluation of the M-PIRO system

Appendix II 9. ; . . . . . 10. ; . . . . ( ) . 11. / / ; . . . . . 12. ; . . . v . . v . . 13. ; . . . . 14. ; . . . . 15. ; . 500 480 .. . 323 31 .. . 700 480 .. . 980 700 ..

Athanasios N. Karasimos

111

The Evaluation of the M-PIRO system

Appendix II

Questionnaire
1.Did you find the texts interesting? [1= not interesting, 5= very interesting] Vessels text 1. 2. 3. 4. 5. Coins text 1. 2. 3. 4. 5. 2. Are the questions difficult? [1= very easy, very difficult] Vessels text 1. 2. 3. 4. Coins Text 1. 2. 3. 4. 3. Have you enjoyed the texts? [1=yes, 2=neutral, 3=no] Vessels text 1. 2. Coins text 1. 2.

5. 5.

3. 3.

4. Which text (quality, more natural and fluent) did you like more and why?

5. I thought there was too much inconsistency in the exhibit texts. [1=strongly agree, 5=stongly disagree] Vessels text 1. 2. 3. 4. 5. Coins Text 1. 2. 3. 4. 5. 6. The information for the exhibits was: [1=very few, 5=too much] Vessels text 1. 2. 3. 4. Coins Text 1. 2. 3. 4.

5. 5.

7. How much did you learn from these texts? [1=almost nothing, 5=a great deal] Vessels text 1. 2. 3. 4. 5. Coins Text 1. 2. 3. 4. 5.

Athanasios N. Karasimos

112

The Evaluation of the M-PIRO system

Bibliography

Appendix III
The Statistical guide
When we design an experiment, we have a hypothetical deductive reasoning, which we use to refine a conjecture into a Y/N question; e.g. will people learn more if the texts contain comparison and aggregation. Then, we turn the Y/N question into a pair of competing hypotheses which are mutually exclusive. The Null Hypothesis H0 (nothing happens) is that if the texts contain comparison and aggregation, they may or may not learn more, and the Alternative Hypothesis H1 is if texts contain comparison and aggregation, then they will learn more. Additionally, the independent variable (i.v.) is the potential cause; a variable which the experimenter selects at different levels, e.g. the comparison and aggregation (levels: presence and absence) and text difficulty (levels: easy and difficult) and the dependent variable (d.v.) is the potential effect; the dependent variable is examined for the results, e.g. the performance. The statistical significance is the procedure to check the two statistical hypotheses and to decide which is the true and which is the false. To be statistically significant (*) the p-value must be equal or less than .05 [less than .01 is very significant (**) and less than .001 is very very significant (***)]. The ANOVA is appropriate for experimental designs with more than 2 conditions. The assumptions are that we have a internal/ ratio measurement of dependent variable, normal distribution, homogeneity of variance (test about the appropriateness of the variables for the ANOVA). We have a 2-way ANOVA since there are 1 i.v. Moreover, the ANOVA is related design, since the experiment involves either 1 to 1 matching of cases in different conditions and experience of different levels of a variable (absence and presence of text type factors) by the same case (e.g. vessels text); hence the repeated measures or within subjects designs. In general, if, in a factorial experiment, the mean scores of factors (like text type, comparison and aggregation) are not the same at all levels of any of the factors, those factors are said to have main effects. For instance, since there are marked differences

Athanasios N. Karasimos

113

The Evaluation of the M-PIRO system

Bibliography among the marginal means in table 5.1., it would appear that there may be main effects of both factors. The effect of the text type factor (comparison and aggregation) at one particular level of another factor (text type difficulty) is known as a simple main effect to the first factor at a specified level of the second. The difficulty factor has different simple main effects at different levels of the text type factor. When one factor does not have the same simple main effects at all levels of another, the two factors are said to interact. The analysis of variance of data from a factorial experiment offers tests not only for the presence of main effects of each factor considered separately but also for interaction between the factors. The interaction shows how the correlation between the factors is changed, when this correlation is equally changed or not in the values of each factors.

Athanasios N. Karasimos

114

The Evaluation of the M-PIRO system

You might also like