Professional Documents
Culture Documents
Andre Martins
Unbabel and Instituto de Telecomunicacoes
Portugal
ISBN 978-1-945626-36-4
ii
Introduction
This volume contains the contributions accepted for Software Demonstrations at the EACL 2017
conference which takes place from 4th to 7th of April 2017 in Valencia, Spain.
The Demonstrations program primary aim is to present the software systems and solutions related to all
areas of computational linguistics and natural language processing. This program encourages the early
exhibition of research prototypes, but also includes interesting mature systems.
The number of submissions for software demonstrations exceeded, again, the number of available slots.
This confronted us with the situation where we were forced to introduce a strict selection process.
The selection process was completed following the predefined criteria of relevance for EACL 2017
conference: novelty, overall usability, and applicability to more than one linguistic level and to more
than one computational linguistic field/task. We would like to thank the general chairs and the local
organizers, without whom it would have been impossible to put together such a strong demonstrations
program.
We hope you will find this proceedings useful and that the software systems presented in this proceedings
will inspire new ideas and approaches in your future work.
iii
General Chair (GC):
Mirella Lapata (University of Edinburgh)
Co-Chairs (PCs):
Phil Blunsom (University of Oxford)
Alexander Koller (University of Potsdam)
Publication Chairs:
Maria Liakata
Chris Biemann
Workshop Chairs:
Laura Rimell
Richard Johansson
Tutorials co-chairs:
Alex Klementiev
Lucia Specia
Publicity Chair:
David Weir
v
Table of Contents
A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets
Johann-Mattis List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Multilingual CALL Framework for Automatic Language Exercise Generation from Free Text
Naiara Perez and Montse Cuadros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
Nematus: a Toolkit for Neural Machine Translation
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler,
Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria
Nadejde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Autobank: a semi-automatic annotation tool for developing deep Minimalist Grammar treebanks
John Torr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Marine Variable Linker: Exploring Relations between Changing Variables in Marine Science Literature
Erwin Marsi, Pinar Pinar Øzturk and Murat V. Ardelan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit
Andrey Kutuzov and Elizaveta Kuzmenko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
viii
EACL 2017 Software Demonstrations Program
17.30–19.30 Session 1
A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Et-
ymological Datasets
Johann-Mattis List
ix
Wednesday 5th April 2017 (continued)
17.30–19.30 Session 2
x
Thursday 6th April 2017 (continued)
Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit
Andrey Kutuzov and Elizaveta Kuzmenko
xi
COVER: Covering the Semantically Tractable Questions
Michael Minock
KTH Royal Institute of Technology, Stockholm, Sweden
Umeå University, Umeå Sweden.
minock@kth.se, mjm@cs.umu.se
1
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 1–4
c 2017 Association for Computational Linguistics
Figure 1: Overall processing paths of COVER.
as stop assignments and are assigned to the empty the relation (e.g. ‘New York’ for the relation
element; ‘cities’ is assigned to the relation City City) is within the question.
and is marked as condition type; ‘in’ is assigned
the foreign key attribute state from the rela- 3. All attribute focus markers are on men-
tion City to the relation State and is marked tioned, rooted attributes. An attribute focus
as a condition type; ‘Ohio’ is assigned to the value marker (e.g. ‘what’) must not only explic-
‘Ohio’ of attribute name of the relation State itly match an attribute (e.g. ‘population’), but
and is marked as a condition type. The knowledge such an attribute must also be rooted. An at-
to determine such mappings comes from a very tribute is rooted if the relation of the attribute
simple domain lexicon which assigns phrases to is mentioned.
database elements. (For example ’Ohio’ in the lex- 4. Non-focus attributes satisfy correspon-
icon is assigned to the database element represent- dences. Unless they are the focus, attributes
ing the State.name=’Ohio’). Note matching must pair with a value (e.g. in “cities with
is based on word stems. Thus, for example, ‘cities’ name Springfield”), or, in the case that the
matches ‘city’. attribute is a foreign key (e.g. “cities in the
state...”, the attribute must pair with the rela-
2.2 Phase 2: Determining Valid Mappings
tion or primary value of the foreign key (e.g.
Valid mappings must observe certain constraints. “cities in Ohio”).
For example the question “what is the capital of
New York?” is valid when ‘New York’ maps to the 5. Value elements satisfy correspondences.
state, but is not valid when ‘New York’ maps to the Values are either primary (e.g. “New York”)
city. “What is the population of the Ohio River?” or must be paired with either an attribute (e.g.
has no valid mappings because under the given “... city with the name of New York” ...”), or
database there is no way to fit the population via ellipsis paired with a relation (e.g. “... the
attribute with the River table. Specifically the city New York”).
following are six properties that valid mappings
6. All mentioned elements are connected.
must observe.
The elements assigned by the mapping must
form a connected graph over the underlying
1. There exists a unique focus element. There
database schema.
always exists a unique focus element that is
the attribute or relation upon which the ques- This leads to the core definition of this work:
tion is seeking information. For efficiency
Definition 1 (Semantically Tractable Question)
this is actually enforced in phase 1.
For a given question q, lexicon L, q is semantically
2. All relation focus markers are on men- tractable if there exists at least one valid mapping
tioned relations. If a relation is a focus, then over q.
it must be mentioned. A relation is mentioned
2.3 Phase 3: Generating Logical Formulas
if either it is explicitly named in the ques-
tion (e.g. ‘cities’) or if a primary value1 of Given a set of valid mappings, COVER’s third
phase is to generate one or more MRL expres-
1
Primary values(Li and Jagadish, 2014), stand, or for the sions for each valid mapping. To achieve this
most part stand, for a tuple of a relation. For example ‘Spring-
field’ is a primary value of City even though it is not a key we unfold the connected graph of valid mappings
value. (see property 6 in section 2.2) into meaningful full
2
graphs. This is complicated in the self-joining Case Coverage Avg/Med # in-
terpretations
case where graphs will have multiple vertices for full 780/880 (89%) 7.59/2
a given relation. For example the valid mapping no-equiv reduction 780/880 (89%) 19.65/2
for “what states border states that border New
York?” maps to two database relations: State Table 1: Evaluation over G EO Q UERY
and Borders. But the corresponding unfolded
graph will be over three instantiations of State
and two of Borders. Our algorithm gives a sys- Theorem proving is also used to enforce prag-
tematic and exhaustive process to compute all rea- matic constraints. For example we remove queries
sonable unfolded graphs. And then for each un- that do not bear information, or have redundancies
folded graph we generate the set of possible at- within them that violate Gricean principles. This
tachments of conditions and selections. In this end is principally achieved by determining whether a
this gives a set of query objects which maybe di- set-fetching query necessarily generates a single
rectly expressed in SQL for application over the or no tuple where the answer is already in the ques-
database or to first order logic expressions for sat- tion. For example a query retrieving “the names of
isfiability testing via an automatic theorem prover. states where the name of the state is New York and
the state borders a state” does not bear informa-
3 Auxiliary Processes tion. Finally it should be noted, one can add arbi-
trary domain rules (e.g. states have one and only
Figure 1 shows two auxiliary processes that fur- one capital) to constrain deduction. This would
ther constrain what are valid mappings as well as allow more query equivalencies and cases of prag-
which MRL expressions are returned. matics violations to be recognized.
3
5 Discussion which notions of ’attachment’ perform best.
A final question is, is the semantically tractable
To apply Cover, consider our strong, though class fundamental? COVER has generalized the
achievable assumptions: a) the user’s mental class from the earlier PRECISE work and nothing
model matches the database; b) no word or phrase seems to block its further extension to questions
of a question maps only to incorrect element, type requiring negation, circular self-joins, etc. Still,
pairs; c) all words in the question are syntactically we expended considerable effort trying to extend
attached. Under such assumptions, every question the class to include queries with maximum cardi-
that we answer is semantically tractable, and thus nality conditions (e.g. “What are the states with
a correct interpretation is always included within the most cities?”). This effort foundered on defin-
any non-empty answer. Let us discuss the feasi- ing a decidable semantics. But we also witnessed
bility of these assumptions. many cases where spurious valid mappings were
With respect to assumption a, it is difficult let in while not finding a correct valid mapping
to determine if a user’s mental model matches (‘most’ seems to be a particularly sensitive word).
a database, but, in general, natural language in- Further study is required to determine the natu-
terfaces do better when the schema is based on ral limit of the semantic tractability question class.
a conceptual (e.g. Entity-Relationship) or do- How far can we go? Is there a limit? Our intuition
main model, as is the case for G EO Q UERY. Still says ’yes’. A related question is is there a hier-
COVER is not yet fully generalized for EER-based archy of semantically tractable classes where the
databases (e.g. multi-attribute keys, isa hierar- number of interpretations blows up as we extend
chies, union types, etc.). We shall study more cor- the number of constructs we are able to handle?
pora (e.g. QALD (Walter et al., 2012)) to see what Again, our intuition says ’yes’. COVER will be the
type of conceptual models are required. basis of the future study of these questions.
With respect to assumption b, an over-
expanded lexicon can lead to spurious interpre-
tations, but that is not so serious as long as the References
right interpretation still gets generated. Any num- Ion Androutsopoulos, Graeme D. Ritchie, and Peter
ber of additional noisy entries could be added Thanisch. 1995. Natural language interfaces to
and our strong assumption b) would remain true. databases. NLE, 1(1):29–81.
While we are investigating how to automatically Ann A. Copestake and Karen Sparck Jones. 1990. Nat-
generate ‘adequate’ lexicons (e.g. by adapt- ural language interfaces to databases. Knowledge
ing domain-independent ontologies, expanding Eng. Review, 5(4):225–249.
domain-specific lists, or using techniques from au- Fei Li and H. V. Jagadish. 2014. Constructing an
tomatic paraphrase generation), the question of interactive natural language interface for relational
how the lexicon is acquired and accessed (e.g. databases. PVLDB, 8(1):73–84.
Querying over an API would require probing for Michael Minock. 2017. Technical description of
named entities rather than simple hash look-up, COVER 1.0. CoRR.
etc.) is orthogonal to the contribution of this work. Ana-Maria Popescu, Oren Etzioni, and Henry A.
We make assumption c because we want to Kautz. 2003. Towards a theory of natural language
guarantee that COVER lives up to the promise of interfaces to databases. In IUI-03, January 12-15,
2003, Miami, FL, USA, pages 149 – 157.
always having within its non-empty result sets
the correct interpretation. Still the control points Ana-Maria Popescu, Alex Armanasu, Oren Etzioni,
for syntactic analysis are very clearly laid out in David Ko, and Alexander Yates. 2004. Modern nat-
ural language interfaces to databases. In COLING,
COVER . Given that interfaces should be usable by
pages 141 – 147.
real users, keeping the interpretation set manage-
able while trying to keep the correct one in the Sebastian Walter, Christina Unger, Philipp Cimiano,
and Daniel Bär. 2012. Evaluation of a layered ap-
set is useful, especially if a database is particu- proach to question answering over linked data. In
larly ambiguous. Exploring methods to integrate ISWC, Boston, MA, USA, pages 362–374.
off-the-shelf syntactic parsers to narrow the num-
Guogen Zhang, Wesley W. Chu, Frank Meng, and
ber of interpretations while not excluding correct Gladys Kong. 1999. Query formulation from high-
interpretations will be future work. Specifically level concepts for relational databases. In UIDIS,
we will evaluate which off-the-shelf parsers and pages 64–75.
4
Common Round: Application of Language Technologies
to Large-Scale Web Debates
Hans Uszkoreit, Aleksandra Gabryszak, Leonhard Hennig, Jörg Steffen,
Renlong Ai, Stephan Busemann, Jon Dehdari, Josef van Genabith,
Georg Heigold, Nils Rethmeier, Raphael Rubino, Sven Schmeier,
Philippe Thomas, He Wang, Feiyu Xu
Deutsches Forschungszentrum für Künstliche Intelligenz, Germany
firstname.lastname@dfki.de
5
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 5–8
c 2017 Association for Computational Linguistics
Figure 1: Architecture of Common Round.
6
lengthy and noisy information. bate participants. Additionally, our platform in-
Besides the dialogue structure, assessing post- corporates a web search to enable finding support-
ings is an important aspect in the Common Round ing evidence in external knowledge sources. Table
forum. The assessment reflects the community 2 shows an example of the text analytics results
view of a main debate claim, an argument or evi- and the association with external material.
dence. The degrees of consent a user can express
are based on the common 5-level Likert-type scale Cannabis is a good way to reduce pain
(Likert, 1932). Based on the ratings a first time substance disorder
viewer of a question can get a quick overview of NER and MayTreatDisorder
7
are marked. Our NMT is based on an encoder– References
decoder with attention design (Bahdanau et al., D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural
2014; Johnson et al., 2016), using bidirectional machine translation by jointly learning to align and
LSTM layers for encoding and unidirectional lay- translate. CoRR, abs/1409.0473.
ers for decoding. We employ a character-based F. Bex, J. Lawrence, M. Snaith, and C.Reed. 2013.
approach to better cope with rich morphology and Implementing the argument web. Commun. ACM,
OOVs in Common Round user posts. We train 56:66–73.
on English-German data from WMT-162 and use O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,
transfer learning on 30K crawled noisy in-domain B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn,
segments for Common Round domain-adaptation. V. Logacheva, C. Monz, M. Negri, A. Neveol,
As our NMT system is new, we compare with s-o- M. Neves, M. Popel, M. Post, R. Rubino, C. Scar-
ton, L. Specia, M. Turchi, K. Verspoor, and Ma.
t-a PBMT: while both perform in the top-ranks in Zampieri. 2016. Findings of the 2016 Conference
WMT-16 (Bojar et al., 2016), NMT is better on on Machine Translation. In Proc. of MT.
Common Round data (NMT from 29.1 BLEU on
F. Boltuzic and J. Snajder. 2015. Identifying promi-
WMT-16 data to 22.9 car and 23.8 wind power, nent arguments in online debates using semantic tex-
against 30.0 to 13.6 and 17.7 for PBMT) and bet- tual similarity. In ArgMining@HLT-NAACL.
ter adapts using the supplementary crawled data
W. Drozdzynski, H. Krieger, J. Piskorski, U. Schäfer,
(25.8 car and 28.5 wind against 14.1 and 19.3).3 and F. Xu. 2004. Shallow processing with unifica-
tion and typed feature structures — foundations and
7 Conclusion & Future Work applications. Künstliche Intelligenz, 1:17–23.
We presented Common Round, a new type of C. Egan, A. Siddharthan, and A. Z. Wyner. 2016. Sum-
marising the points made in online political debates.
web-based debate platform supporting large-scale
In ArgMining@ACL.
decision making. The dialogue model of the plat-
form supports users in making clear and substan- M. Johnson, M. Schuster, Q. V. Le, M. Krikun,
Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wat-
tial debate contributions by labeling their posts tenberg, G. Corrado, M. Hughes, and J. Dean.
as pro/con arguments, pro/con evidences, com- 2016. Google’s multilingual neural machine trans-
ments and answers. The user input is automati- lation system: Enabling zero-shot translation. In
cally analyzed and enhanced using language tech- https://arxiv.org/abs/1611.04558.
nology such as argument mining, text analytics, S. Krause, H. Li, H. Uszkoreit, and F. Xu. 2012.
retrieval of external material and machine trans- Large-Scale Learning of Relation-Extraction Rules
lation. The analysis results are fed back to the with Distant Supervision from the Web. In ISWC.
user interface as an additional information layer. Q. V. Le and T. Mikolov. 2014. Distributed represen-
For future work we suggest a more fine-grained tations of sentences and documents. In ICML.
automatic text analysis, for example by detecting R. Likert. 1932. A technique for the measurement of
claims and premises of arguments as well as types attitudes. Archives of Psychology, 22(140):1–55.
and validity of evidences. Moreover we will con-
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009.
duct a thorough evaluation of the system. Distant Supervision for Relation Extraction Without
Labeled Data. In IJCNLP.
Acknowledgments
G. Petasis and V. Karkaletsis. 2016. Identifying ar-
This research was supported by the German Fed- gument components through TextRank. In ArgMin-
eral Ministry of Education and Research (BMBF) ing@ACL.
through the projects ALL SIDES (01IW14002) P. J Rousseeuw. 1987. Silhouettes: a graphical aid to
and BBDC (01IS14013E), by the German Fed- the interpretation and validation of cluster analysis.
eral Ministry of Economics and Energy (BMWi) J. Comput. Appl. Math., 20:53–65.
through the projects SDW (01MD15010A) and S. Salter, T. Douglas, and D. Kember. 2016. Compar-
SD4M (01MD15007B), and by the European ing face-to-face and asynchronous online communi-
Union’s Horizon 2020 research and innovation cation as mechanisms for critical reflective dialogue.
JEAR, 0(0):1–16.
programme through the project QT21 (645452).
S. Schmeier. 2013. Exploratory Search on Mobile De-
2
http://www.statmt.org/wmt16/ vices. DFKI-LT - Dissertation Series, Saarbruecken,
translation-task.html Germany.
3
All EN→DE, similar for reverse.
8
A Web-Based Interactive Tool for Creating, Inspecting, Editing, and
Publishing Etymological Datasets
Johann-Mattis List
Max Planck Institute for the Science of Human History
Kahlaische Straße 10
07745 Jena
list@shh.mpg.de
9
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 9–12
c 2017 Association for Computational Linguistics
PANEL INTERACTION
ED Align- Phono- Activation
T EDICTOR
ments logy
Editing
Cognate th o x t ə r
d o: - t a -
Filtering
ID DOCULECT CONCEPT TRANSCRIPTION ... For direct online usage, the tool can be accessed
1 German Woldemort valdəmar ...
via its project website.
2 English Woldemort wɔldəmɔrt ...
3 Chinese Woldemort fu⁵¹ti⁵¹mɔ³⁵ ...
4 Russian Woldemort vladimir ... 3 Data Editing in the EDICTOR
... ... ... ... ...
10 German Harry haralt ... 3.1 Editing Word List Data
11 English Harry hæri ...
12 Russian Harry gali ... Editing data in the Word List panel of the EDIC-
... ... ... ... ...
TOR is straightforward by inserting values in text-
fields which appear when clicking on a given field
Figure 2: Basic file format in the EDICTOR or when browsing the data using the arrow keys of
the keyboard. Additional keyboard shortcuts allow
for quick browsing. For specific data types, auto-
2.2 User Interface matic operations are available which facilitate the
The EDICTOR is divided into different panels input or test what the user inserts. Transcription,
which allow to edit or analyse the data in diffe- for example supports SAMPA-input. The segmen-
rent ways. The core module is the Word List panel tation of phonetic entries into meaningful sound
which displays the data in its original form and units is also carried out automatically. Sound seg-
can be edited and analysed as one knows it from ments are highlighted with specific background
spreadsheet applications. For more complex tasks colors based on their underlying sound class and
of data editing and analysis, such as cognate as- sounds which are not recognized as valid IPA sym-
signment or phonological analysis, additional pan- bols are highlighted in warning colors (see the il-
els are provided. Specific modes of interaction be- lustration in Figure 3). The users can decide them-
tween the different panels allow for a flexible in- selves in which fields they wish to receive auto-
teraction between different tasks. Using drag-and- matic support, and even Chinese input using an
drop, users can arrange the panels individually or automatic Pīnyīn converter is provided.
hide them completely. Figure 1 illustrates how the Conversion and Segmentation
major panels of the EDICTOR interact with each wOld@mO:Rt
wOld@mO:Rt
other. wɔldəmɔːʁt
ID DOCULECT CONCEPT SEGMENTS
w ɔ l d ə m ɔː ʁ t
22 Chinese Woldemort f u ⁵¹ d i ⁵¹ m ɔ ³⁵
2.3 Technical Aspects 4 English Woldemort wOld@mO:Rt
10
3.2 Cognate Assessment 4 Data Analysis in the EDICTOR
Defining which words in multilingual word lists 4.1 Analysing Phonetic Data
are cognate is still a notoriously difficult task for Errors are inevitable in large datasets, and this
machines (List, 2014). Given that the majority holds also and especially for phonetic transcrip-
of datasets are based on manually edited cognate tions. Many errors, however, can be easily spot-
judgments, it is important to have tools which fa- ted by applying simple sanity checks to the data.
cilitate this task while at the same time control- A straightforward way to check the consistency
ling for typical errors. The EDICTOR offers two of the phonetic transcriptions in a given dataset is
ways to edit cognate information, the first assum- provided in the Phonology panel of the EDICTOR.
ing complete cognacy of the words in their en- Here all sound segments which occur in the seg-
tirety, and the second allowing to assign only spe- mented transcriptions of one language are counted
cific parts of words to the same cognate set. In and automatically compared with an internal set
order to carry out partial cognate assignment, the of IPA segments. Counting the frequency of seg-
data needs to be morphologically segmented in a ments is very helpful to spot simple typing er-
first stage, for example with help of the Morpho- rors, since segments which occur only one time
logy panel of the EDICTOR (see Section 4.2). For in the whole data are very likely to be errors. The
both tasks, simple and intuitive interfaces are of- internal segment inventory adds a structural per-
fered which allow to browse through the data and spective: If segments are found in the internal in-
to assign words to the same cognate set. ventory, additional phonetic information (manner,
place, etc.) is shown, if segments are missing, this
German is highlighted. The results can be viewed in tabu-
English
Russian
lar form and in form of a classical IPA chart.
Chinese
IGNORE 4.2 Analysing Morphological Data
The majority of words in all languages consist
Figure 4: Aligning words in the EDICTOR of more than one morpheme. If historically re-
lated words differ regarding their morpheme struc-
ture, this poses great problems for automatic ap-
3.3 Phonetic Alignment proaches to sequence comparison, since the al-
gorithms usually compare words in their entirety.
Since historical-comparative linguistics is essen- German Großvater ‘grandfather’, for example,
tially based on sequence comparison (List, 2014), is composed of two different morphemes, groß
alignment analyses, in which words are arranged ‘large’ and Vater ‘father’. In order to analyse
in a matrix in such a way that corresponding multi-morphemic words historically, it is impor-
sounds are placed in the same column, are un- tant to carry out a morphological annotation anal-
derlying all cognate sets. Unfortunately they are ysis. In order to ease this task, the Morphology
rarely made explicit in classical etymological dic- panel of the EDICTOR offers a variety of straight-
tionaries. In order to increase explicitness, the forward operations by which morpheme structure
EDICTOR offers an alignment panel. The align- can be annotated and analysed at the same time.
ment panel is essentially realized as a pop-up win- The core idea behind all operations is a search
dow showing the sounds of all sequences which for similar words or morphemes in the same lan-
belong to the same cognate set. Users can edit the guage. These colexifications are then listed and
alignments by moving sound segments with the displayed in form of a bipartite word family net-
mouse. Columns of the alignment which contain work in which words are linked to morphemes, as
unalignable parts (like suffixes or prefixes) can be illustrated in Figure 5. The morphology analysis
explicitly marked as such. In addition to manual in the EDICTOR is no miracle cure for morpheme
alignments, the EDICTOR offers a simple align- detection, and morpheme boundaries need still to
ment algorithm which can be used to pre-analyse be annotated by the user. However, the dynam-
the alignments. Figure 4 shows an example for ically produced word family networks as well as
the alignment of four fictive cognates in the EDIC- the explicit listing of words sharing the same sub-
TOR. sequence of sounds greatly facilitates this task.
11
faːtər «the father»
NON-BIOLOGICAL datasets in a visually appealing format as can
MOTHER be seen from this exemplary URL: http:
ʃtiːf+faːtər «the stepfather» //edictor.digling.org?file=Tujia.
ɡroːs+mʊtər «the grandmother» tsv&publish=true&preview=500.
FATHER
LARGE
This paper presented a web-based tool for creat-
ing, inspecting, editing, and publishing etymologi-
ɡroːs+faːtər «the grandfather» FATHER-OF-SPOUSE cal datasets. Although many aspects of the tool are
still experimental, and many problems still need to
Figure 5: Word family network in the EDICTOR: be solved, I am confident that – even in its current
The morphemes (in red) link the words around form – the tool will be helpful for those working
German Großvater ‘grandfather’ (in blue). with etymological datasets. In the future, I will de-
velop the tool further, both by adding more useful
features and by increasing its consistency.
4.3 Analysing Sound Correspondences
Once cognate sets are identified and aligned, Acknowledgements
searching for regular sound correspondences in the This research was supported by the DFG research
data is a straightforward task. The Correspon- fellowship grant 261553824 Vertical and lateral
dences panel of the EDICTOR allows to analyse aspects of Chinese dialect history. I thank Nathan
sound correspondence patterns across pairs of lan- W. Hill, Guillaume Jacques, and Laurent Sagart
guages. In addition to a simple frequency count, for testing the prototype.
however, conditioning context can be included in
the analysis. Context is modeled as a separate
string that provides abstract context symbols for References
each sound segment of a given word. This means Russell D. Gray and Quentin D. Atkinson. 2003.
essentially that context is handled as an additional Language-tree divergence times support the Ana-
tier of a sequence. This multi-tiered represen- tolian theory of Indo-European origin. Nature,
426(6965):435–439.
tation is very flexible and also allows to model
suprasegmental context, like tone or stress. If Gerhard Jäger. 2015. Support for linguistic
users do not provide their own tiers, the EDIC- macrofamilies from weighted alignment. PNAS,
112(41):1275212757.
TOR employs a default context model which dis-
tinguishes consonants in syllable onsets from con- Johann-Mattis List and Robert Forkel. 2016. LingPy.
sonants in syllable offsets. Max Planck Institute for the Science of Human His-
tory, Jena. URL: http://lingpy.org.
5 Customising the EDICTOR Johann-Mattis List, Simon J. Greenhill, and Russell D.
Gray. 2017. The potential of automatic word
The EDICTOR can be configured in multiple comparison for historical linguistics. PLOS ONE,
ways, be it while editing a dataset or before load- 12(1):1–18, 01.
ing the data. The latter is handled via URL pa- Johann-Mattis List. 2014. Sequence comparison in
rameters passed to the URL that loads the applica- historical linguistics. Düsseldorf University Press,
tion. In order to facilitate the customization proce- Düsseldorf.
dure, a specific panel for customisation allows the Guillaume Segerer and S. Flavier. 2015. RefLex.
users to define their default settings and creates a Laboratoire DDL, Paris and Lyon. URL: http:
URL which users can bookmark to have quick ac- //reflex.cnrs.fr.
cess to their preferred settings. Sergej Anatol’evič Starostin. 2000. STARLING.
The EDICTOR can be loaded in read-only RGGU, Moscow. URL: http://starling.
mode by specifying a “publish” parameter. rinet.ru.
Additionally, server-side files can be directly Supplementary Material
loaded when loading the application. This
Supplements for this paper contain a demo video (https://youtu.be/
makes it very simple and straightforward to IyZuf6SmQM4), the application website (http://edictor.digling.
org), the source (v. 0.1, https://zenodo.org/record/48834), and
use the EDICTOR to publish raw etymological the development version (http://github.com/digling/edictor).
12
WAT-SL: A Customizable Web Annotation Tool for Segment Labeling
Johannes Kiesel and Henning Wachsmuth and Khalid Al-Khatib and Benno Stein
Faculty of Media
Bauhaus-Universität Weimar
99423 Weimar, Germany
<first name>.<last name>@uni-weimar.de
13
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 13–16
c 2017 Association for Computational Linguistics
moving the mouse cursor over a label, the label de-
scription is displayed to prevent a faulty selection.
Once a segment is labeled, its background color
changes, and the button displays an abbreviation
of the respective label. To assist the annotators in
forming a mental model of the annotation inter-
face, the background colors of labeled segments
match the label colors in the menu. All labels are
saved automatically, avoiding any data loss in case
of power outages, connection issues, or similar.
In some cases, texts might be over-segmented,
for example due to an automatic segmentation. If
this is the case, WAT-SL allows annotators to mark
a segment as being continued in the next segment.
The interface will then visually connect these seg-
Figure 1: Screenshot of an exemplary task selec-
ments (cf. the buttons showing “->” in Figure 2).
tion page in WAT-SL, listing three assigned tasks.
Finally, the annotation interface includes a text
box for leaving comments to the project organiz-
In WAT-SL, the annotation process is split into ers. To simplify the formulation of comments,
tasks, usually corresponding to single texts. To- each segment is numbered, with the number being
gether, these tasks form a project. shown when the mouse cursor is moved over it.
2.1 Annotating Segments 2.2 Curating Annotations
The annotation interface is designed with a focus After an annotation process is completed, a cura-
on easy and efficient usage. It can be accessed tion phase usually follows where the annotations
with any modern web browser. In order to start, of different annotators are consolidated into one.
annotators require only a login and a password. The WAT-SL curation interface enables an effi-
When logging in, an annotator sees the task se- cient curation by mimicking the annotation inter-
lection page that lists all assigned tasks, includ- face with three adjustments (Figure 3): First, seg-
ing the annotator’s current progress in terms of ments for which the majority of annotators agreed
the number of labeled segments (Figure 1). Com- on a label are pre-labeled accordingly. Second, the
pleted tasks are marked green, and the web page menu shows for each label how many annotators
automatically scrolls down to the first uncom- chose it. And third, the label description shows
pleted task. This allows annotators to seamless (anonymized) which annotator chose the label, so
interrupt and continue the annotation process. that curators can interpret each label in its context.
After selecting a task to work on, the annotator The curation may be accessed under the same
sees the main annotation interface (Figure 2). The URL as the annotation in order to allow annotators
design of the interface seeks for clarity and self- of some tasks being curators of other tasks.
descriptiveness, following the templates of today’s
most popular framework for responsive web sites, 2.3 Running an Annotation Project
Bootstrap. As a result, we expect that many anno- WAT-SL is a platform-independent and easily de-
tators will feel familiar with the style. ployable standalone Java application, with few
The central panel of the annotation interface configurations stored in a simple “key = value”
shows the text to be labeled and its title. One de- file. Among others, annotators are managed in this
sign objective was to obtain a non-intrusive anno- file by assigning a login, a password, and a set of
tation interface that remains close to just display- tasks to each of them. For each task, the organizer
ing the text in order to maximize readability. As of an annotation project creates a directory (see
shown in Figure 2, we decided to indicate seg- below). WAT-SL uses the directory name as the
ments only by a shaded background and a small task name in all occasions. Once the Java archive
button at the end. To ensure a natural text flow, file we provide is then executed, it reads all config-
line breaks are possible within segments. The but- urations and starts a server. The server is immedi-
ton reveals a menu for selecting the label. When ately ready to accept requests from the annotators.
14
Figure 2: Screenshot of the annotation interface of WAT-SL, capturing the annotation of one news edito-
rial within the argumentative segment labeling project described in Section 3. In particular, the screenshot
illustrates how the annotator selects a label (Testimony) for one segment of the news editorial.
15
3 Case Study: Labeling Argument Units 4 Conclusion and Future Work
WAT-SL was already used successfully in the past, This paper has presented WAT-SL, an open-source
namely, for creating the Webis-Editorials-16 cor- web annotation tool dedicated to segment label-
pus with 300 news editorials split into a total of ing. WAT-SL is designed for easily configuration
35,665 segments, each labeled by three annotators and deployment, while allowing project organiz-
and finally curated by the authors (Al-Khatib et ers to tailor its annotation interface to the project
al., 2016). In particular, each segment was as- needs using standard web technologies. WAT-SL
signed one of eight labels, where the labels in- aims for simplicity by featuring a self-explanatory
clude six types of argument units (e.g., assump- interface that does not distract from the text to be
tion and anecdote), a label for non-argumentative annotated, as well as by always showing the an-
segments, and a label indicating that a unit is con- notation progress at a glance and saving it auto-
tinued in the next segment (see the “->” label in matically. The curation of the annotations can be
Section 2). On average, each editorial contains done using exactly the same interface. In case of
957 tokens in 118 segments. oversegmented texts, annotators can merge seg-
The annotation project of the Webis-Editorials- ments. Moreover, all relevant annotator interac-
16 corpus included one task per editorial. The edi- tions are logged, allowing projects organizers to
torial’s text had been pre-segmented with a heuris- monitor and analyze the annotation process.
tic unit segmentation algorithm before. This algo- Recently, WAT-SL was used successfully on
rithm was tuned towards oversegmenting a text in a corpus with 300 news editorials, where four
case of doubt, i.e., to avoid false negatives (seg- remote annotators labeled 35,000 argumentative
ments that should have been split further) at the segments. In the future, we plan to create more
cost of more false positives (segments that need to dedicated annotation tools in the spirit of WAT-SL
be merged). Note that WAT-SL allows fixing such for other annotation types (e.g., relation identifica-
false positives using “->”. tion). In particular, we plan to improve the anno-
For annotation, we hired four workers from tator management functionalities of WAT-SL and
the crowdsourcing platform upwork.com. Given reuse them for these other tools.
the segmented editorials, each worker iteratively
chose one assigned editorial (see Figure 1), read
References
it completely, and then selected the appropriate la-
bel for each segment in the editorial (see Figure 2). Khalid Al-Khatib, Henning Wachsmuth, Johannes
Kiesel, Matthias Hagen, and Benno Stein. 2016.
This annotation process was repeated for all edito- A News Editorial Corpus for Mining Argumenta-
rials, with some annotators interrupting their work tion Strategies. In Proceedings of the 26th Inter-
on an editorial and returning to it later on. All ed- national Conference on Computational Linguistics,
itorials were labeled by three workers, resulting in pages 3433–3443.
106,995 annotations in total. The average time per Hamish Cunningham. 2002. GATE, a General Ar-
editorial taken by a worker was ~20 minutes. chitecture for Text Engineering. Computers and the
To create the final version of the corpus, we Humanities, 36(2):223–254.
curated each editorial using WAT-SL. In particu- Pontus Stenetorp, Sampo Pyysalo, Goran Topić,
lar, we automatically kept all labels with majority Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-
agreement, and let one expert decide on the oth- jii. 2012. BRAT: A Web-based Tool for NLP-
ers. Also, difficult cases where annotators tend to assisted Text Annotation. In Proceedings of the
Demonstrations at the 13th Conference of the Euro-
disagree were identified in the curation phase. pean Chapter of the Association for Computational
From the perspective of WAT-SL, the annota- Linguistics, pages 102–107.
tion of the Webis-Editorials-16 corpus served as
Seid Muhie Yimam, Chris Biemann, Richard Eckart de
a case study, which provided evidence that the
Castilho, and Iryna Gurevych. 2014. Automatic
tool is easy to learn and master. From the begin- Annotation Suggestions and Custom Annotation
ning, the workers used WAT-SL without notewor- Layers in WebAnno. In Proceedings of 52nd An-
thy problems, as far as we could see from monitor- nual Meeting of the Association for Computational
ing the interaction logs. Also, the comment area Linguistics: System Demonstrations, pages 91–96.
turned out to be useful, i.e., the workers left sev-
eral valuable suggestions and questions there.
16
TextImager as a Generic Interface to R
17
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 17–20
c 2017 Association for Computational Linguistics
flexible architecture with parallel and distributed 3.3 OpenCPU Output Integration
processing capabilities (Hemati et al., 2016). The Visualizing the results of a calculation or sensitiv-
front-end is a web application that makes NLP ity analysis is an important task. That is why we
processes available in a user-friendly way with re- provide interactive visualizations to make the in-
sponsive and interactive visualizations (Hemati et formation available to the user more comprehensi-
al., 2016). TextImager already integrated many ble. This allows the user to interact with the text
third party tools. One of these is R. This section and the output of R, for example, by highlighting
describes the technical integration and utilization any focal sentence in a document by hovering over
of R into TextImager. a word or sentence graph generated by means of R.
3.4 R-packages
3.1 R / OpenCPU
This section gives an overview of R-packages em-
R is a software environment for statistical comput- bedded into the pipeline of TextImager.
ing. It compiles and runs on a variety of UNIX and
Windows platforms. One of our goals is to provide tm The tm-package (Feinerer and Hornik, 2015)
an easy to use interface for R with a focus on NLP. is a text mining R-package containing pre-
To this end, we use OpenCPU to integrate R into process methods for data importing, corpus
TextImager. OpenCPU provides an HTTP API, handling, stopword filtering and more.
which allocates the functionalities of R-packages
(Ooms, 2014). The OpenCPU software can be lda The lda-package (Chang, 2015) provides an
used directly in R; alternatively, it can be installed implementation of Latent Dirichlet Alloca-
on a server. We decided for the latter variant. In tion algorithms. It is used to automatically
addition, we used the opencpu.js JavaScript library identify topics in documents, to get top doc-
which simplifies API use in JavaScript and allows uments for these topics and to predict doc-
for calling R-functions directly from TextImager. ument labels. In addition to the tabular out-
To minimize the communication effort between put we developed an interactive visualization,
client and server and to encapsulate R scripts from which assigns the decisive words to the ap-
TextImager’s functionality, we created a so called propriate topic (see Figure 1). This visualiza-
TextImager-R-package that takes TextImager data, tion makes it easy to differentiate between the
performs all R-based calculations and returns the two topics and see which words classify these
results. This package serves for converting any topics. In our example, we have recognized
TextImager data to meet the input requirements of words such as players, season and game as
any R-package. In this way, we can easily add one topic (sportsman) and party, state and
new packages to TextImager without changing the politically as a different topic (politics). The
HTTP request code. Because some data and mod- parameters of every package can be set on
els have a long build time we used OpenCPU’s runtime. The utilization and combination of
session feature to keep this data on the server and TextImager and R makes it possible, to cal-
access it in future sessions so that we do not have culate topics not only based on wordforms,
to recreate it. This allows the user for quickly ex- but also takes other features into account like
ecuting several methods even in parallel without lemma, pos-tags, morphological features and
recalculating them each time. more.
18
also build an interactive network visualiza-
tion (see figure 3) using sigma.js to make it
interoperable with TextImager.
Figure 1: Words assigned to their detected topics Figure 3: Network graph based on word embed-
dings
and soccer players have been clustered into
their own groups. tidytext The tidytext package (Silge and Robin-
son, 2016) provides functionality to create
datasets following the tidy data principles
(Wickham, 2014). We used it with our tm
based corpus to calculate TF-IDF informa-
tion of documents. In Figure 4 we see the
output tabular with informations like tf, idf
and tf-idf
igraph The igraph-package (Csardi and Nepusz, stringdist The stringdist-package (van der Loo,
2006) provides multiple network analysis 2014) implements distance calculation meth-
tools. We created a document network by ods, like Cosine, Jaccard, OSA and
linking each word with its five most simi- other. We implemented functionality for cal-
lar words based on word embeddings. The culating sentence similarities and provide an
igraph-package also allows to layout the interactive visual representation. Each node
graph and calculate different centrality mea- represents an sentence of the selected docu-
sures. In the end, we can export the network ment and the links between them represent
in many formats (GraphML, GEXF, JSON, the similarity of those sentences. The thicker
etc.) and edit it with graph editors. We the links, the more similar they are. By in-
19
teracting with the visualization, correspond- number of built-in R-packages in TextImager.
ing sections of the document panel are get-
ting highlighted, to see the similar sentences 5 Scope of the Software Demonstration
(see Figure 5). The bidirectional interaction The beta version of TextImager is online.
functionality enables easy comparability. To test the functionalities of R as integrated
into TextImager use the following demo:
http://textimager.hucompute.org/
index.html?viewport=demo&file=
R-Demo.xml.
References
Jonathan Chang, 2015. lda: Collapsed Gibbs Sam-
pling Methods for Topic Models. R package version
1.4.2.
Gabor Csardi and Tamas Nepusz. 2006. The igraph
Figure 5: Depiction of sentence similarities. software package for complex network research. In-
terJournal, Complex Systems:1695.
Maciej Eder, Jan Rybicki, and Mike Kestemont. 2016.
stats We used functions from the R-package stats Stylometry with R: a package for computational text
(R Development Core Team, 2008) to calcu- analysis. R Journal, 8(1):107–121.
late a hierarchical cluster analysis based on
Ingo Feinerer and Kurt Hornik, 2015. tm: Text Mining
the sentence similarities. This allows us to Package. R package version 0.6-2.
cluster similar sentences and visualize them
with an interactive dendrogram. In Figure David Ferrucci and Adam Lally. 2004. UIMA: an
architectural approach to unstructured information
6 we selected on one of these clusters and processing in the corporate research environment.
the document panel immediately adapts and Natural Language Engineering, 10(3-4):327–348.
highlights all the sentences in this cluster.
Fritz Günther, Carolin Dudschig, and Barbara Kaup.
2015. Lsafun: An R package for computations
based on latent semantic analysis. Behavior Re-
search Methods, 47(4):930–944.
Wahed Hemati, Tolga Uslu, and Alexander Mehler.
2016. Textimager: a distributed UIMA-based sys-
tem for NLP. In Proceedings of the COLING 2016
System Demonstrations.
Jeroen Ooms. 2014. The OpenCPU system: Towards a
universal interface for scientific computing through
separation of concerns. arXiv:1406.4806 [stat.CO].
R Development Core Team, 2008. R: A Language and
Environment for Statistical Computing. R Foun-
Figure 6: Similarity-clustered sentences. dation for Statistical Computing, Vienna, Austria.
ISBN 3-900051-07-0.
An interesting side effect of integrating these Julia Silge and David Robinson. 2016. tidytext: Text
mining and analysis using tidy data principles in R.
tools into TextImager’s pipeline is that their out- JOSS, 1(3).
put can be concerted in a way to arrive at higher-
level text annotations and analyses. In this way, we M.P.J. van der Loo. 2014. The stringdist package for
approximate string matching. The R Journal, 6:111–
provide to an integration of two heavily expanding 122.
areas, that is, NLP and statistical modeling.
Hadley Wickham. 2014. Tidy data. Journal of Statis-
4 Future work tical Software, 59(1):1–23.
Fridolin Wild, 2015. lsa: Latent Semantic Analysis. R
In already ongoing work, we focus on big data as
package version 0.73.1.
exemplified by the Wikipedia. We also extend the
20
GraWiTas: a Grammar-based Wikipedia Talk Page Parser
21
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 21–24
c 2017 Association for Computational Linguistics
2 Wikipedia Talk Pages While this structure is easily understood by hu-
mans, parsing talk page discussions automatically
The talk page of a Wikipedia article is the place is far from trivial for a number of reasons. The dis-
where editors can discuss issues with the article cussion is not stored in a structured database for-
content and plan future contributions. There is a mat, and many editors use slight variations of the
talk page for every article on Wikipedia. As any agreed-upon syntax when writing their comments.
other page in Wikipedia, it can be edited by any- For example, there are multiple types of signatures
body by manipulating the underlying text file – that mark the end of a comment. Some editors do
which uses Wiki markup, a dedicated markup syn- not indent correctly or use other formatting com-
tax. When a user visits the talk page, this syntax mands in the Wikipedia arsenal.
file is interpreted by the Wikipedia template en- In addition, some Wikipedia talk pages contain
gine and turned into the displayed HTML. transcluded content. This means that the content
Although Wiki markup includes many different of a different file is pasted into the file at the speci-
formatting (meta-)commands, the commands used fied position. Also, pages that become “too large”
on talk pages are fairly basic. Some guidelines as are archived3 , meaning that the original page is
to how to comment on talk pages are defined in moved to a different name and a new, empty page
the Wikipedia Talk Page Guidelines2 . These rules is created in its place.
specify, for example, that new topics should be
added to the bottom of the page, that a user may 3 GraWiTas
not alter or delete another user’s post, that a user
Our parser – GraWiTas – consists of three com-
should sign a post with their username and a times-
ponents, covering the whole process of retriev-
tamp, and that to indicate to which older post they
ing and parsing Wikipedia talk pages. The first
are referring, users should indent their own post.
two components gather and preprocess the needed
The following snippet gives an example of the data. They differ in the used data source: The
talk page syntax: crawler component extracts the talk page content
== Meaning of second paragraph == of given Wikipedia URLs while the dump compo-
I don’t understand the second paragraph nent processes full Wikipedia XML dumps. The
... [[User:U1|U1]] [[User Talk:U1|
talk]] 07:24, 2 Dec 2016 (UTC) actual parsing is done in the core parser compo-
:I also find it confusing...[[User:U2|U2 nent, where all comments from the Wiki markup
]] 17:03, 3 Dec 2016 (UTC) of a talk page are extracted and exported into one
::Me too... [[User:U3|U3]] [[User Talk:
U3|talk]] 19:58, 3 Dec 2016 (UTC) of several output formats.
:LOL, the unit makes no sense [[User:U4|
U4]] [[User Talk:U1|talk]] 00:27, 6 3.1 Core Parser Component
Dec 2016 (UTC)
Is the reference to source 2 correct? [[ The main logic of the core parser lies in a system
User:U3|U3]] [[User Talk:U3|talk]] of rules that essentially defines what to consider as
11:41, 4 Dec 2016 (UTC) a comment on a talk page. Simply spoken, a com-
ment is a piece of text – possibly with an indenta-
The first line marks the beginning of a new dis-
tion – followed by a signature and some line end-
cussion topic on the page. The following lines
ing. As already discussed, signatures and inden-
contain comments on this topic. All authors
tations are relatively fuzzy concepts, which means
signed their comments with a signature consisting
that the rules need to define a lot of cases and ex-
of some variation of a link to their user profile page
ceptions. There are also some template elements
– in Wiki markup, hyperlinks are wrapped in ’[[’
in the Wikipedia syntax that can change the inter-
and ’]]’ – and a timestamp. User U2 replies to the
pretation of a comment, e.g. outdents4 .
first post and thus indents their text by one tab. In
Grammars are a theoretical concept that
Wiki markup this is done with a colon ’:’. The
matches such a rule system very well. All dif-
third comment, which is a reply to the second one,
ferent cases can be specified in detail and it is
is indented by two colons, leading to more inden-
easy to build more complex rules out of multiple
tation on the page when it is displayed.
3
https://en.wikipedia.org/wiki/Help:
2 Archiving_a_talk_page, as seen on Feb. 14, 2017
https://en.wikipedia.org/wiki/
4
Wikipedia:Talk_page_guidelines, as seen https://en.wikipedia.org/wiki/
on Feb. 14, 2017 Template:Outdent, as seen on Feb. 14, 2017
22
I don’t understand the second paragraph. ID Reply To User Comment Time
U1 (talk) 07:24, 2 Dec 2016 (UTC) 1 - U1 I don’t understand the second paragraph. 07:24, 2 Dec 2016
I also find it confusing... 2 1 U2 I also find it confusing... 17:03, 3 Dec 2016
U2 (talk) 17:03, 3 Dec 2016 (UTC) 3 2 U3 Me too... 19:58, 3 Dec 2016
4 2 U4 LOL, the unit makes no sense. 00:27, 6 Dec 2016
Me too...
5 - U3 Is the reference to source 2 correct? 11:41, 4 Dec 2016
U3 (talk) 19:58, 3 Dec 2016 (UTC)
LOL, the unit makes no sense.
U4 (talk) 00:27, 6 Dec 2016 (UTC)
Is the reference to source 2 correct?
U3 (talk) 11:41, 4 Dec 2016 (UTC)
LOL, the unit makes no sense. U4 LOL, the unit makes no sense.
U3 1
I don’t understand the second paragraph. U1 I don’t understand the second paragraph.
U1 Me too...
Me too...
1
U4 U3 Is the reference to source 2 correct?
Is the reference to source 2 correct?
Figure 1: Schematic representation of the core parser component: A table (top right) is extracted from
a talk page (top left) using a grammar-based rule set. It can then be transformed into various output
formats, e.g. a one-mode (user- / comment-) graph (bottom left and right) or a two-mode graph (bottom
center)
23
in the dump or only some particular ones charac- does not handle transcluded talk page content, nor
terised by a list of article titles. the outdent template. Their parser outputs a CSV
The dump component can be used for large- file with custom fields.
scale studies that look at all talk pages of
Wikipedia at once. Compared to the crawler, ob- 6 Conclusion
taining the Wiki markdown is faster. However, this The possibility of parsing Wikipedia talk pages
comes at the price that users have to download and into networks provides many new avenues for
maintain the large XML dump files. NLP researchers interested in collaboration pat-
terns. A usable parser is a prerequisite for this type
4 Evaluation
of research. Unlike previous implementations, our
We ran a very small evaluation study to assess parser correctly handles many quirks of Wikipedia
the speed and accuracy of our core parser. For talk page syntax, from archived and transcluded
analysing the speed, we generated large artificial talk page contents to outdented comments. It can
talk pages from smaller existing ones and mea- produce output in a number of standardised for-
sured the time our parser took. Using an Intel(R) mats including GML as well as a custom text file
Core(TM) i7 M 620 @ 2.67GHz processor, we format. Finally, it requires very little initial setup.
obtained parsing times under 50ms for file sizes The GraWiTas source code is publicly available7 ,
smaller than 1MB (∼2000 Comments). For files together with a web app demonstrating the core
up to 10MB (∼20000 Comments) it took around parser component.
300ms. This leads to an overall parsing speed
of around 30MB/s for average-sized comments. Acknowledgments
However, real world talk pages very rarely even This work was supported by the Deutsche
pass the 1MB mark. Forschungsgemeinschaft (DFG) under grant No.
To evaluate accuracy, we picked a random ar- GRK 2167, Research Training Group “User-
ticle with a talk page and verified the extracted Centred Social Media”
comments manually. Hereby, an accuracy of 0.97
was achieved. Although such a small evaluation
is of course far from comprehensive, it shows that References
our parser is on the one hand fast enough to parse Oliver Ferschke, Iryna Gurevych, and Yevgen Chebo-
large numbers of talk pages and on the other hand tar. 2012. Behind the article: Recognizing dialog
yields a high enough accuracy for real-world ap- acts in wikipedia talk pages. In Proc. of EACL’12,
pages 777–786, Stroudsburg, PA, USA. ACL.
plications.
Bryan Ford. 2004. Parsing expression grammars: A
5 Related Work recognition-based syntactic foundation. SIGPLAN
Not., 39(1):111–122.
There have been previous attempts at parsing
David Laniado, Riccardo Tasso, Yana Volkovich, and
Wikipedia talk pages. Massa (2011) focused on Andreas Kaltenbrunner. 2011. When the wikipedi-
user talk pages, where the conventions are differ- ans talk: Network and tree structure of wikipedia
ent from article talk pages. Ferschke et al. (2012) discussion pages. In Proc. of ICWSM, Barcelona,
rely on indentation for extracting relationships be- Spain, July 17-21, 2011.
tween comments, but use the revision history to Jun Liu and Sudha Ram. 2011. Who does what: Col-
identify the authors and comment creation dates. laboration patterns in the wikipedia and their impact
However, they do not offer any information on on article quality. ACM Trans. Manag. Inform. Syst.,
2(2).
whether archived, transcluded and outdented con-
tent is handled correctly, and their implementation Paolo Massa. 2011. Social networks of wikipedia. In
has not been made public. Laniado et al. (2011) HT’11, Proc. of ACM Hypertext 2011, Eindhoven,
The Netherlands, June 6-9, 2011, pages 221–230.
infer the structure of discussions from indentation
similar to our approach. They use user signa- Fernanda B. Viegas, Martin Wattenberg, Jesse Kriss,
tures to infer comment metadata (author and date). and Frank Van Ham. 2007. Talk before you type:
Their parser is available on Github6 . However, it Coordination in wikipedia. In Proc. of HICSS’07.
IEEE.
6
https://github.com/sdivad/WikiTalkParser,
7
as seen on Feb. 14, 2017 https://github.com/ace7k3/grawitas
24
TWINE: A real-time system for
TWeet analysis via INformation Extraction
Debora Nozza, Fausto Ristagno, Matteo Palmonari,
Elisabetta Fersini, Pikakshi Manchanda, Enza Messina
University of Milano-Bicocca / Milano
{debora.nozza, palmonari, fersini,
pikakshi.manchanda, messina}@disco.unimib.it
f.ristagno@campus.unimib.it
25
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 25–28
c 2017 Association for Computational Linguistics
Figure 1: TWINE system overview.
26
Figure 3: TWINE Map View snapshot.
can exchange data reliably. The Apache Kafka 2.2 User Interface
platform5 permits us to store and process the data
in a fault-tolerant way and to ignore the latency TWINE provides two different visualisations of
due to the Information Extraction processing. the extracted information: the Map View, which
shows the different geo-tags associated with
Database. All the source and processed data are tweets in addition to the NEEL output, and the List
stored in a NoSQL database. In particular, we View, that better emphasizes the relation between
choose a MongoDB6 database because of its flex- the text and its named entities.
ibility, horizontal scalability and its representation The Map View (Figure 3) provides in the top
format that is particularly suitable for storing Twit- panel a textual search bar where users can insert
ter contents. keywords related to their topic of interest (e.g.
italy, milan, rome, venice). The user can also, from
Frontend host and API web server. The pres- left to right, start and stop the stream fetching pro-
ence of these two server-side modules is motivated cess, clear the current results, change View and ap-
by the need of make the TWINE user-interface in- ply semantic filters related to the geo-localization
dependent on its functionalities. In this way, we and KB resource characteristics, i.e. type and clas-
improve the modularity and flexibility of the en- sification confidence score.
tire system.
Then, in the left-hand panel the user can read
5
https://kafka.apache.org/ the content of each fetched tweet (text, user in-
6
http://www.mongodb.org/ formation and recognized named entities) and
27
directly open it in the Twitter platform. clude more insights such as the user network infor-
The center panel can be further divided into mation, a heatmap visualization and a time control
two sub-panels: the top one shows the information filter.
about the Knowledge Base resources related to the
linked named entities present in the tweets (image,
textual description, type as symbol and the clas- References
sification confidence score), and the bottom one Kalina Bontcheva, Leon Derczynski, Adam Funk,
provides the list of the recognized named entities Mark A Greenwood, Diana Maynard, and Niraj
Aswani. 2013. Twitie: An open-source informa-
for which it does not exist a correspondence in the tion extraction pipeline for microblog text. In Pro-
KB, i.e. NIL entities. ceedings of International Conference on Recent Ad-
These two panels, the one that reports the tweets vances in Natural Language Processing, pages 83–
and the one with the recognized and linked KB re- 90.
sources, are responsive. For example, by clicking Davide Caliano, Elisabetta Fersini, Pikakshi Man-
on the entity Italy in the middle panel, only tweets chanda, Matteo Palmonari, and Enza Messina.
containing the mention of the entity Italy will be 2016. Unimib: Entity linking in tweets using jaro-
shown in the left panel. Respectively, by clicking winkler distance, popularity and coherence. In Pro-
ceedings of the 6th International Workshop on Mak-
on a tweet, the center panel will show only the re- ing Sense of Microposts (# Microposts).
lated entities.
In the right-hand panel, the user can visualize Leon Derczynski, Diana Maynard, Giuseppe Rizzo,
Marieke van Erp, Genevieve Gorrell, Raphaël
the geo-tag extracted from the tweets, (i) the orig- Troncy, Johann Petrak, and Kalina Bontcheva.
inal geo-location where the post is emitted (green 2015. Analysis of named entity recognition and
marker), (ii) the user-defined location for the user linking for tweets. Information Processing & Man-
account’s profile (blue marker) and (iii) the geo- agement, 51(2):32–49.
location of the named entities extracted from the Shamanth Kumar, Geoffrey Barbier, Mohammad Ali
tweets, if the corresponding KB resource has the Abbasi, and Huan Liu. 2011. Tweettracker: An
latitude-longitude coordinates (red marker). analysis tool for humanitarian and disaster relief. In
Proceedings of the 5th International AAAI Confer-
Finally, a text field is present at the top of the ence on Weblogs and Social Media.
first two panels to filter the tweets and KB re-
sources that match specific keywords. Shamanth Kumar, Fred Morstatter, and Huan Liu.
2014. Twitter data analytics. Springer.
The List View is reported in Figure 4. Differ-
ently from the Map View, the focus is on the link Gregor Leban, Blaz Fortuna, Janez Brank, and Marko
between the words, i.e. recognized named enti- Grobelnik. 2014. Event registry: learning about
ties, and the corresponding KB resources. In the world events from news. In Proceedings of the 23rd
International Conference on World Wide Web, pages
reported example, this visualisation is more intu- 107–110.
itive for catching the meaning of Dolomites and
Gnocchi thanks to a direct connection between the Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi,
Chen Lu, Hemant Purohit, Gary Alan Smith, and
named entities and the snippet and the image of
Wenbo Wang. 2014. Twitris: A system for collec-
associated KB resources. tive social intelligence. In Encyclopedia of Social
Network Analysis and Mining, pages 2240–2253.
3 Conclusion Springer.
We introduced TWINE, a system that provides Xiubo Zhang, Stephen Kelly, and Khurshid Ahmad.
2016. The slandail monitor: Real-time processing
an efficient real-time data analytics platform on and visualisation of social media data for emergency
streaming of social media contents. The system is management. In Proceedings of the 11th Interna-
supported by a scalable and modular architecture tional Conference on Availability, Reliability and Se-
and by an intuitive and interactive user interface. curity, pages 786–791.
As future work, we intend to implement a dis-
tributed solution in order to faster and easier man-
age huge quantity of data. Additionally, current
integrated modules will be improved: the NEEL
pipeline will be replaced by a multi-lingual and
more accurate method, the web interface will in-
28
Alto: Rapid Prototyping for Parsing and Translation
29
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 29–32
c 2017 Association for Computational Linguistics
Formalism Reference Algebra(s)
Context-Free Grammars (CFGs) (Hopcroft and Ullman, 1979) String
Hyperedge Replacement Grammars (HRGs) (Chiang et al., 2013) Graph
Tree Substitution Grammars (Sima’an et al., 1994) String / Tree
Tree-Adjoining Grammars (TAGs) (Joshi et al., 1975) TAG string / TAG tree
Synchronous CFGs (Chiang, 2007) String / String
Synchronous HRGs (Chiang et al., 2013) String / Graph
String to Tree Transducer (Galley et al., 2004) String / Tree
Tree to Tree Transducer (Graehl et al., 2008) Tree / Tree
serve as abstract syntactic representations, along a decomposition grammar for w, which describes
the lines of derivation trees in TAG. all terms over the string algebra that evaluate to w;
Next, notice the column “english” in Fig. 2. this step is algebra-specific. From this, we calcu-
This column describes how to interpret the deriva- late a regular tree grammar for all derivation trees
tion tree in an interpretation called “english”. It that the homomorphism maps into such a term,
first specifies a tree homomorphism, which maps and then intersect it with the wRTG. These opera-
derivation trees into terms over some algebra by tions can be phrased in terms of generic operations
applying certain rewrite rules bottom-up. For in- on RTGs; implementing these efficiently is a chal-
stance, the “boy” node in the example derivation lenge which we have tackled in Alto. We can com-
tree is mapped to the term t1 = ∗(the, boy). The pute the best derivation tree from the chart, and
entire subtree then maps to a term of the form map it into an output graph. Similarly, we can also
∗(?1, ∗(wants, ∗(to, ?2))), as specified by the row decode an input graph into an output string.
for “wants2” in the grammar, where ?1 is replaced
by t1 and ?2 is replaced by the analogous term for 3 Algorithms in Alto
“go”. The result is shown at the bottom of the mid-
dle panel in Fig. 3. Finally, this term is evaluated Alto can read IRTG grammars and corpora from
in the underlying algebra; in this case, a simple files, and implements a number of core algorithms,
string algebra, which interprets the symbol * as including: automatic binarization of monolingual
string concatenation. Thus the term in the middle and synchronous grammars (Büchse et al., 2013);
panel evaluates to the string “the boy wants to go”, computation of parse charts for given input ob-
shown in the “value” field. jects; computing the best derivation tree; comput-
The example IRTG also contains a column ing the k-best derivation trees, along the lines of
called “semantics”. This column describes a sec- (Huang and Chiang, 2005); and decoding the best
ond interpretation of the derivation tree, this time derivation tree(s) into output interpretations. Alto
into an algebra of graphs. Because the graph al- supports PCFG-style probability models with both
gebra is more complex than the string algebra, the maximum likelihood and expectation maximiza-
function symbols look more complicated. How- tion estimation. Log-linear probability models are
ever, the general approach is exactly the same as also available, and can be trained with maximum
before: the grammar specifies how to map the likelihood estimation. All of these functions are
derivation tree into a term (bottom of rightmost available through command-line tools, a Java API,
panel in Fig. 3), and then this term is evaluated and a GUI, seen in Fig. 2 and 3.
in the respective algebra (here, the graph shown at We have invested considerable effort into mak-
the top of the rightmost panel). ing these algorithms efficient enough for practical
Thus the example grammar is a synchronous use. In particular, many algorithms for wRTGs in
grammar which describes a relation between Alto are implemented in a lazy fashion, i.e. the
strings and graphs. When we parse an input string rules of the wRTG are only calculated by need;
w, we compute a parse chart that describes all see e.g. (Groschwitz et al., 2015; Groschwitz et
grammatically correct derivation trees that inter- al., 2016). Obviously, Alto cannot be as effi-
pret to this input string. We do this by computing cient for well-established tasks like PCFG parsing
30
Figure 2: An example IRTG with an English and a semantic interpretation (Alto screenshot).
as a parser that was implemented and optimized between Context-Free and Tree-Adjoining Gram-
for this specific grammar formalism. Nonethe- mars in Alto is that CFGs use the simple string al-
less, Alto is fast enough for practical use with gebra outlined in Section 2, whereas for TAG we
treebank-scale gramars, and for less mainstream use a special “TAG string algebra” which defines
grammar formalisms can be faster than special- string wrapping operations (Koller and Kuhlmann,
ized implementations for these formalisms. For 2012). All algorithms mentioned in Section 3 are
instance, Alto is the fastest published parser for generic and do not make any assumptions about
Hyperedge Replacement Grammars (Groschwitz what algebras are being used. As explained above,
et al., 2015). Alto contains multiple algorithms for the only algebra-specific step is to compute de-
computing the intersection and inverse homomor- composition grammars for input objects.
phism of RTGs, and a user can choose the combi- In order to implement a new algebra, a user of
nation that works best for their particular grammar Alto simply derives a class from the abstract base
formalism (Groschwitz et al., 2016). class Algebra, which amounts to specifying the
The most recent version adds further perfor- possible values of the algebra (as a Java class) and
mance improvements through the use of a num- implementing the operations of the algebra as Java
ber of pruning techniques, including coarse-to-fine methods. If Alto is also to parse objects from this
parsing (Charniak et al., 2006). With these, Sec- algebra, the class needs to implement a method for
tion 23 of the WSJ corpus can be parsed in a cou- computing decomposition grammars for the alge-
ple of minutes. bra’s values. Alto comes with a number of alge-
bras built in, including string algebras for Context-
4 Extending Alto
Free and Tree-Adjoining grammars as well as tree,
As explained above, Alto can capture any gram- set, and graph algebras. All of these can be used
mar formalism whose derivation trees can be de- in parsing. By parsing sets in a set-to-string gram-
scribed with a wRTG, by interpreting these into mar for example, Alto can generate referring ex-
different algebras. For instance, the difference pressions (Engonopoulos and Koller, 2014).
31
Finally, Alto has a flexible system of input and hyperedge replacement grammars. In Proceedings
output codecs, which can map grammars and alge- of the 51st Annual Meeting of the Association for
Computational Linguistics.
bra values to string representations and vice versa.
A key use of these codecs is reading grammars D. Chiang. 2007. Hierarchical phrase-based transla-
in native input format and converting them into tion. Computational Linguistics, 33(2):201–228.
IRTGs. Users can provide their own codecs to H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard,
maximize interoperability with existing code. D. Lugiez, S. Tison, M. Tommasi, and C. Löd-
ing. 2007. Tree Automata Techniques and Ap-
5 Conclusion and Future Work plications. published online - http://tata.
gforge.inria.fr/.
Alto is a flexible tool for the rapid implementation N. Engonopoulos and A. Koller. 2014. Generating ef-
of new formalisms. This flexibility is based on a fective referring expressions using charts. In Pro-
division of concerns between the generation and ceedings of the INLG and SIGDIAL 2014.
the interpretation of grammatical derivations. We M. Galley, M. Hopkins, K. Knight, and D. Marcu.
hope that the research community will use Alto on 2004. What’s in a translation rule? In HLT-NAACL
newly developed formalisms and on novel combi- 2004: Main Proceedings.
nations for existing algebras. Further, the toolkit’s J. Graehl, K. Knight, and J. May. 2008. Training tree
focus on balancing generality with efficiency sup- transducers. Computational Linguistics, 34(3):391–
ports research using larger datasets and grammars. 427.
In the future, we will implement algorithms for J. Groschwitz, A. Koller, and C. Teichmann. 2015.
the automatic induction of IRTG grammars from Graph parsing with s-graph grammars. In Proceed-
corpora, e.g. string-to-graph corpora, such as the ings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th Interna-
AMRBank, for semantic parsing (Banarescu et al.,
tional Joint Conference on Natural Language Pro-
2013). This will simplify the prototyping process cessing.
for new formalisms even further, by making large-
J. Groschwitz, A. Koller, and M. Johnson. 2016. Effi-
scale grammars for them available more quickly. cient techniques for parsing with tree automata. In
Furthermore, we will explore ways for incorpo- Proceedings of the 54th Annual Meeting of the As-
rating neural methods into Alto, e.g. in terms of sociation for Computational Linguistics.
supertagging (Lewis et al., 2016). J. E. Hopcroft and J. Ullman. 1979. Introduction
to Automata Theory, Languages, and Computation.
Acknowledgements Addison-Wesley, Reading, Massachusetts.
This work was supported by DFG grant KO L. Huang and D. Chiang. 2005. Better k-best parsing.
2916/2-1. In Proceedings of the Ninth International Workshop
on Parsing Technology.
A. K. Joshi, L. S. Levy, and M. Takahashi. 1975. Tree
References adjunct grammars. Journal of Computer and System
Sciences, 10(1):136–163.
L. Banarescu, C. Bonial, S. Cai, M. Georgescu,
K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, A. Koller and M. Kuhlmann. 2011. A generalized
M. Palmer, and N. Schneider. 2013. Abstract mean- view on parsing and translation. In Proceedings of
ing representation for sembanking. In Proceedings the 12th International Conference on Parsing Tech-
of the 7th Linguistic Annotation Workshop. nologies.
M. Büchse, A. Koller, and H. Vogler. 2013. Generic A. Koller and M. Kuhlmann. 2012. Decompos-
binarization for parsing and translation. In Proceed- ing tag algorithms using simple algebraizations. In
ings of the 51st Annual Meeting of the Association Proceedings of the 11th International Workshop on
for Computational Linguistics. Tree Adjoining Grammars and Related Formalisms
(TAG+11).
E. Charniak, M. Johnson, M. Elsner, J. Austerweil,
D. Ellis, I. Haxton, C. Hill, R. Shrivaths, J. Moore, M. Lewis, K. Lee, and L. Zettlemoyer. 2016. LSTM
M. Pozar, and T. Vu. 2006. Multilevel coarse-to- CCG parsing. In Proceedings of NAACL-HLT.
fine pcfg parsing. In Proceedings of the Human Lan- K. Sima’an, R. Bod, S. Krauwer, and R. Scha. 1994.
guage Technology Conference of the North Ameri- Efficient disambiguation by means of stochastic tree
can Chapter of the ACL. substitution grammars. In Proceedings of Interna-
tional Conference on New Methods in Language
D. Chiang, J. Andreas, D. Bauer, K. M. Hermann, Processing.
B. Jones, and K. Knight. 2013. Parsing graphs with
32
CASSANDRA: A multipurpose configurable voice-enabled
human-computer-interface
33
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 33–36
c 2017 Association for Computational Linguistics
speech data from the user, processing and recog- them here: the part-of-speech tagger is a neural
nition (all using external services), as well as the inspired approach (Boros et al., 2013b) achiev-
(b) presentation of data to the user. The second ing 98.21% accuracy on the ”1984” novel by G.
entity acts as a server and is responsible for re- Orwell using morphosyntactic descriptors (MSDs)
ceiving text-input from the endpoint, identifying a (Erjavec, 2004) in the tag-set; all the lexical pro-
scenario from a limited set, extracting parameters, cessing modules were described in (Boros, 2013)
performing the necessary operations and returning For syllabification we used the onset-nucleus-
information to the endpoint in the form of text, im- coda (ONC) tagging strategy proposed in (Bartlett
ages and sound. et al., 2009) and chunking is performed using a
Because we wanted to keep our system as scal- POS-based grammar described in (Ion, 2007).
able as possible, Automatic Speech Recognition Our TTS implements both unit-selection (Boros
and Text-to-speech synthesis operations are car- et al., 2013a) and statistical parametric speech
ried out on external servers. The endpoint can be synthesis. For Cassandra we chose to use para-
configured to access both Cassandra’s own ASR metric synthesis with our implementation of the
and TTS services (described here) as well as the STRAIGHT filter (Kawahara et al., 1999).
ASR and TTS servers provided by Google Speech Our TTS system primarily supports English,
API. German, French and Romanian and can be ac-
cessed for demonstration purposes from our web
2.1 Cassandra’s TTS synthesis system portal(link will be provided after evaluation). For
other languages Cassandra uses Google TTS,
Text-to-speech synthesis refers to the conversion
which provides statistical parametric speech syn-
of any arbitrary (unrestricted) text into audio sig-
thesis for a large number of languages.
nal. The unrestricted requirement makes this task
difficult and, while state-of-the-art systems pro-
2.2 Cassandra’s speech recognition
duce remarkable results in terms of intelligibil-
ity and naturalness, the recipe for producing syn- Cassandra’s Automatic Speech Recognition
thetic voices which are indistinguishable from nat- (ASR) service was created by our partners,
ural ones has not yet been found. This limita- the Speech and Dialogue Research Laboratory
tion is caused by the fact that natural language un- (http://speed.pub.ro) and it is a scalable and ex-
derstanding still poses serious challenges for ma- tensible on-line Speech-to-Text (S2T) solution. A
chines and the fact that the surface form of the text demo version of the Speech-to-Text (S2T) system
does not provide sufficient cues for the prosodic resides in the cloud or in SpeeDs IT infrastructure
realization of a spontaneous and expressive voice and can be accessed on-line using the Web API
(Taylor, 2009). or a proprietary protocol. Client applications can
TTS synthesis involves two major steps: (a) ex- be developed using any technology that is able to
traction of features from text and (b) conversion of communicate through TCP-IP sockets with the
symbolic representations into actual speech. The server application. The server application can
text processing step (step a) is usually composed communicate with several clients, serving them
of low-level text-processing tasks such as part-of- either simultaneously or sequentially.
speech tagging, lemmatization, chunking, letter- The speech-to-text system can be configured to
to-sound conversion, syllabification etc. The sig- transcribe different types of speech, from a num-
nal processing task (step b) consists of selecting ber of domains and languages. It can be config-
an optimal set of speech parameters (given the fea- ured to instantiate multiple speech recognition en-
tures provided by step a) and generating an acous- gines (S2T Transcribers), each of these engines
tic signal that best fits these parameters. Text pro- being responsible for transcribing speech from a
cessing is performed by our in-house developed specific domain (for example, TV news in Roma-
natural language processing pipeline called Mod- nian, medical-related speech in Romanian, coun-
ular Language Processing for Lightweight Appli- try names in English, etc.). The speech recogni-
cations (MLPLA) (Zafiu et al., 2015). Most of tion engines are based on the open-source CMU
the individual modules have been thoroughly de- Sphinx speech recognition toolkit. The ASR sys-
scribed in our previous work (Boros, 2013; Boros tems with small vocabulary and grammar lan-
and Dumitrescu, 2015), so we only briefly list guage models were evaluated in depth in (Cucu
34
et al., 2015), while the ASR system with large vo- Short Term Memory neural networks and word
cabulary and statistical language model was evalu- embeddings; (b) parameter start/stop markers are
ated in depth in (Cucu et al., 2014). The first ones then added using a deep neural network classifier
have word error rates between 0 and 5%, while trained on a window of 4 words; (c) next, param-
the large vocabulary ASR has a word error rate of eter types are added using a window of six words,
about 16%. in which the parameters are ignored. Theoret-
ically, parameter identification (c) and boundary
2.3 Natural language processing detection (b) could be carried out simultaneously.
However, we found that splitting this task in two
The core of Cassandra is driven by a natural lan-
steps works significantly better, mainly because it
guage processing system that is responsible for re-
mitigates the data sparsity issue. An interesting
ceiving a text and, after analysis and parameter
observation is that using word embeddings in the
extraction, acting based on a predefined scenario.
scenario identification step, allowed the system to
A scenario is the equivalent of a frame in frame-
identify the query ”It’s too hot” as an air condi-
based dialogue systems. At any time, the end-user
tioning activity, though the training data only con-
is able to alter Cassandra’s configuration by creat-
tained examples such as: ”Set AC to $value”, ”Set
ing scenarios or editing existing ones. A scenario
the temperature to $value degrees” etc.
is defined by its name and 3 components:
Because we wanted to keep the system as lan-
Component 1 contains a list of example sen-
guage independent as possible, the only external
tences with parameters preceded by a special char-
resources required for language adaptation is a
acter ’$’ which is used to identify them. (e.g. ”Set
word embeddings file extracted using word2vec
the temperature on $value degrees.”)
(Mikolov et al., 2014). The process is simple and
Component 2 tells the system what to actions given a large enough corpus (e.g. a Wikipedia
to take, depending on the scenario. Actions are dump) one can obtain good embeddings by run-
described in a JSON with the following structure: ning the word2vec tool on the tokenized text.
(a) gadget - a unique identifier telling the system
what module must be used for processing (e.g. a 3 Standard gadgets and usage scenarios
light, an A/C unit, the security system, etc.); (b)
gadget parameters - a free-form JSON structure At submission time we have already implemented
that tells the system what parameters to pass on to 4 demo scenarios:
the gadget (e.g. turn the lights on=1 or off=0, set Scenario 1 is a standard query system, in which
A/C to X degrees, etc.). The values of these pa- the user can ask Cassandra various questions (for
rameters can be either constants, predefined sys- example ”how is the weather”) to which the sys-
tem variables or actual values extracted from the tem will respond using a Knowledge Base (KB).
text. A gadget can return a text response. The KB was built using Wikipedia for Romanian
Component 3 is used for feedback. After pro- and English.
cessing, this JSON is returned to the end-point Scenario 2 is a home automation system that
which initiated the session and it is used to re- allows the user to control home appliances using
lay information back to the user. This is also a a natural voice interface. There are several pre-
JSON with 2 attributes: (a) friendly response - a defined tasks such as multimedia, lighting, cli-
text response that if present will be synthesized mate and security system control. Communica-
and played back to the user; the friendly response tion with these devices is performed through KNX
can include system variables, parameters extracted (EN 50090, ISO/IEC 14543), which is a purpose-
from the text or the response returned by the gad- built standardized network communication proto-
get; (b) launch intent - a structure that informs the col that is simple and scalable.
end-point that it must launch an external Android Scenario 3 is a set of hard-coded short ques-
Intent with a given set of parameters; Cassandra tions/answers that increase Cassandra’s appeal.
comes with a predefined set of Intents for image For example, this Q/A set contains a game called
preview, audio playback or video playback. ”shrink”. It enables the system to perform word
The methodology for scenario identification and associations in a similar manner in which a psy-
parameter extraction is performed in three sequen- chologist would ask a patient to do. This particular
tial steps: (a) the scenario is identified using Long- game proved to be very appealing to the users in
35
our test group primarily because it made the sys- ceedings of Human Language Technologies: North
tem seem more ”life-like”. American Chapter of the Association for Computa-
tional Linguistics, pages 308–316. Association for
Scenario 4 is a business oriented demo. We in-
Computational Linguistics.
tegrated Cassandra with a Document Management Tiberiu Boros and Stefan Daniel Dumitrescu. 2015.
System (DMS) and created an application that Robust deep-learning models for text-to-speech syn-
filters documents based on associated meta-data. thesis support on embedded devices. In Proceedings
The interaction is described using a large JSON in of the 7th International Conference on Management
of computational and collective intElligence in Dig-
which all parameters are optional, missing param- ital EcoSystems, pages 98–102. ACM.
eters being ignored by the system. By doing so
we are able to respond to queries such as: ”Give Tiberiu Boros, Radu Ion, and tefan Daniel Dumitrescu.
2013a. The racai text-to-speech synthesis system.
me all invoices”, ”Give me all invoices issued to
Acme Computers”, ”Give me all invoices issued Tiberiu Boros, Radu Ion, and Dan Tufis. 2013b. Large
to Acme computers that are due in two weeks” etc. tagset labeling using feed forward neural networks.
We find this scenario particularly interesting from case study on romanian language. In ACL (1), pages
692–700.
a commercial point-of-view, especially because it
enables users access to the DMS without being Tiberiu Boros. 2013. A unified lexical processing
forced to master any specific search skills. In a framework based on the margin infused relaxed al-
gorithm. a case study on the romanian language. In
similar way, an application could be built to give
RANLP, pages 91–97.
users access to databases and enable them to con-
struct complex queries and reports with no SQL Horia Cucu, Andi Buzo, Lucian Petrică, Dragoş
knowledge whatsoever: ”Give me a list of cus- Burileanu, and Corneliu Burileanu. 2014. Recent
improvements of the speed romanian lvcsr system.
tomers that bought tablets over the last 6 months In Communications (COMM), 2014 10th Interna-
and group them by age”. tional Conference on, pages 1–4. IEEE.
4 Conclusions and future development Horia Cucu, Andi Buzo, and Corneliu Burileanu.
2015. The speed grammar-based asr system for
Cassandra is an open-source, freely available per- the romanian language. ROMANIAN JOURNAL OF
sonal assistant and, from our knowledge, the only INFORMATION SCIENCE AND TECHNOLOGY,
18(1):33–53.
system that is extensible to several languages, not
only English, using minimal effort. We intend Tomaz Erjavec. 2004. Multext-east version 3: Multi-
to further develop this system and extend the ba- lingual morphosyntactic specifications, lexicons and
sic set of applications to suit most common usage corpora. In LREC.
scenarios, as well as to offer more complex NLP- Radu Ion. 2007. Word sense disambiguation methods
powered business scenarios that integrate with var- applied to english and romanian. PhD thesis. Roma-
ious existing software and hardware implementa- nian Academy, Bucharest.
tions. A necessary next step in the near future is to Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain
create an open source repository that will enable De Cheveigne. 1999. Restructuring speech rep-
the creation of a community of developers around resentations using a pitch-adaptive time–frequency
it. smoothing and an instantaneous-frequency-based f0
extraction: Possible role of a repetitive structure in
Acknowledgements: This work was supported sounds. Speech communication, 27(3):187–207.
by UEFISCDI, under grant PN-II-PT-PCCA-
2013-4-0789, project Assistive Natural-language, Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2014. word2vec.
Voice-controlled System for Intelligent Buildings
(2013-2017). Paul Taylor. 2009. Text-to-speech synthesis. Cam-
bridge university press.
36
An Extensible Framework for Verification of Numerical Claims
37
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 37–40
c 2017 Association for Computational Linguistics
highlight the ease of incorporating new informa-
tion sources to fact check, which may be unavail-
able during training. To validate the extensibility
of the system, we complete an additional evalua-
tion of the system using claims taken from Vla-
chos and Riedel (2015). We make the source code
publicly available to the community.
2 Design Considerations
We developed our fact-checking approach in the
context of the HeroX challenge2 – a competition
organised by the fact checking organization Full-
Fact3 . The types of claims the system presented
can fact check was restricted to those which re-
quire looking up a value in a KB, similar to the one
in Figure 1. To learn a model to perform the KB
look up (essentially a semantic parsing task), we
extend the work of Vlachos and Riedel (2015) who
used distant supervision (Mintz et al., 2009) to
generate training data, obviating the need for man-
ual labeling. In particular, we extend it to handle
simple temporal expressions in order to fact check
time-dependent claims appropriately, i. e. popula-
tion in 2015. While the recently proposed seman-
tic parser of Pasupat and Liang (2015) is also able Figure 2: Relation matching step and filtering
to handle temporal expressions, it makes the as-
sumption that the table against which the claim ing (Section 3.1) and then how the relation match-
needs to be interpreted is known, which is unre- ing module is trained and the features used for this
alistic in the context of fact checking. purpose (Section 3.2).
Furthermore, the system we propose can pre-
dict relations from the KB on which the semantic 3.1 Fact Checking
parser has not been trained, a paradigm referred to
In our implementation, fact checking is a three
as zero-shot learning (Larochelle et al., 2008). We
step process, illustrated in Figure 2. Firstly, we
achieve this by learning a binary classifier that as-
link named entities in the claim to entities in our
sesses how well the claim “matches” each relation
KB and retrieve a set of tuples involving the enti-
in the KB. Finally, another consideration in our
ties found. Secondly, these entries are filtered in a
design is algorithmic accountability (Diakopoulos,
relation matching step. Using the text in the claim
2016) so that the predictions and the decision pro-
and the predicate as features, we classify whether
cess used by the system are interpretable by a hu-
this tuple is relevant. And finally, the values in the
man.
matched triples from the KB are compared to the
3 System Overview value in the statement to deduce the verdict as fol-
lows: if there is at least one value with absolute
Given an unverified statement, the objective of percentage error lower than a threshold defined by
this system is to identify a KB entry to support the user, the claim is labeled true, otherwise false.
or refute the claim. Our KB consists of a set We model the relation matching as a binary
of un-normalised tables that have been translated classification task using logistic regression imple-
into simple Entity-Predicate-Value triples through mented in scikit-learn, predicting whether a
a simple set of rules. In what follows we first de- predicate is a match to the given input claim. The
scribe the fact checking process used during test- aim of this step is to retain tuples for which the
2
http://herox.com/factcheck predicate in the Entity-Predicate-Value tuple can
3
https://fullfact.org be described by the surface forms present in the
38
claim (positive-class) and discard the remainder syntactic features, we consider the words in the
(negative-class). We chose not to model this as a span between the entity and number as well as the
multi-class classification task (one class per pred- dependency path between the entity and the num-
icate) to improve the extensibility of the system. ber. To generalise to unseen relations, we include
A multi-class classifier requires training instances the ability to add custom indicator functions; for
for every class, thus would not be applicable to our demonstration we include a simple boolean
predicates not seen during training. Instead, our feature indicating whether the intersection of the
aim is to predict the compatibility of a predicate words in the predicate name and the sentence is
w. r. t. the input claim. not empty.
For each of the candidate tuples ri ∈ R, a fea-
ture vector is generated using lexical and syntactic 4 Evaluation
features present in the claim and relation: φ(ri , c).
This feature vector is inputted to a logistic re- The system was field tested in the HeroX fact
gression binary classifier and we retain all tuples checking challenge - 40 general-domain claims
where θT φ(ri , c) ≥ 0. chosen by journalists. Given the design of our sys-
tem, we were restricted to claims that can only be
3.2 Training Data Generation answered by a KB look up (11 claims in total), re-
turning the correct truth assessment for 7.5 of them
The training data for relation matching is gener- according to the human judges. The half-correct
ated using distant supervision and the Bing search one was due to not providing a fully correct ex-
engine. We first read the table and apply a set of planation. To fact check a statement, the user only
simple rules to extract subject, predicate, object needs to enter the claim into a text box. The only
tuples, and for each named entity and numeric requirement is to provide the system with an ap-
value we generate a query containing the entity propriate KB. Thus we ensured that the system can
name and the predicate. For example, the entry readily incorporate new tables taken from ency-
(Germany,Population:2015,81413145) clopedic sources such as Wikipedia and the World
is converted to the query “Germany” Population Bank. In our system, this step is achieved by sim-
2015. The queries are then executed on the Bing ply importing a CSV file and running a script to
search engine and the top 50 web-page results are generate the new instances to train the relation
retained. We extract the text from the webpages matching classifier.
using a script built around the BeautifulSoup
Analysis of our entry to this competition
package.4 This text is parsed and annotated with
showed that two errors were caused by incorrect
co-reference chains using the Stanford CoreNLP
initial source data and one partial error caused by
pipeline (Manning et al., 2014).
recalling a correct property but making an incor-
Each sentence containing a mention of an entity rect deduction. Of numerical claims that we did
and a number is used to generate a training ex- not attempt, we observed that many required look-
ample. The examples are labeled as follows. If ing up multiple entries and performing a more
the absolute percentage error between the value complex deduction step which was beyond the
in the KB and the number extracted from text is scope of this project.
below a threshold (an adjustable hyperparameter),
We further validate the system by evaluating
the training instance is marked as a positive in-
the ability of this fact checking system to make
stance. Sentences which contain a number outside
veracity assessments on simple numerical claims
of this threshold are marked as negative instances.
from the data set collected by (Vlachos and Riedel,
We make an exception for numbers tagged as dates
2015). Of the 4,255 claims about numerical prop-
where an exact match is required.
erties about countries and geographical areas in
For each claim, the feature generation function, this data set, our KB contained information to fact
φ(r, c), outputs lexical and syntactic features from check 3,418. The system presented recalled KB
the claim and a set of custom indicator variables. entries for 3,045 claims (89.1%). We observed
Additionally, we include a bias feature for every that the system was consistently unable to fact
unique predicate in the KB. For our lexical and check two properties (undernourishment and re-
4
https://www.crummy.com/software/ newable freshwater per capita). Analysis of these
BeautifulSoup/ failure cases revealed too great a lexical difference
39
between the test claims and the training data our checking from knowledge networks. PloS one,
system generated; the claims in the test cases were 10(6):e0128193.
comparative in nature (e. g. country X has higher Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu.
rate of undernourishment than country Y) whereas 2011. Computational journalism: A call to arms to
the training data generated using the method de- database researchers. In CIDR, volume 2011, pages
148–151.
scribed in Section 3.2 are absolute claims.
A high number of false positive matches were Nicholas Diakopoulos. 2016. Accountability in al-
generated, e. g. for a claim about population, other gorithmic decision making. Communications of the
ACM, 59(2):56–62.
irrelevant properties were also recalled in addition.
For the 3,045 matched claims, 17,770 properties Naeemul Hassan, Bill Adair, James T Hamilton,
were matched from the KB that had a score greater Chengkai Li, Mark Tremayne, Jun Yang, and Cong
Yu. 2015a. The quest to automate fact-checking.
than or equal to the logistic score of the correct world.
property. This means that for every claim, there
Naeemul Hassan, Chengkai Li, and Mark Tremayne.
were, on average, 5.85 incorrect properties also 2015b. Detecting check-worthy factual claims in
extracted from the KB. In our case, this did not presidential debates. In Proceedings of the 24th
yield false positive assignment of truth labels to ACM International on Conference on Information
the claims. This was because the absolute percent- and Knowledge Management, pages 1835–1838.
ACM.
age error between the incorrectly retrieved prop-
erties and the claimed value was outside of the Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio.
threshold we defined; thus these incorrectly re- 2008. Zero-data learning of new tasks. In AAAI,
volume 1, page 3.
trieved properties never resulted in a true verdict,
allowing the correct one to determine the verdict. Michal Lukasik, Kalina Bontcheva, Trevor Cohn,
Arkaitz Zubiaga, Maria Liakata, and Rob Proc-
5 Conclusions and Future Work ter. 2016. Using gaussian processes for rumour
stance classification in social media. arXiv preprint
The core capability of the system demonstration arXiv:1609.01962.
we presented is to fact check natural language Christopher D. Manning, Mihai Surdeanu, John Bauer,
claims against relations stored in a KB. Although Jenny Finkel, Steven J. Bethard, and David Mc-
the range of claims is limited, the system is a field- Closky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Proceedings of ACL
tested prototype and has been evaluated on a pub-
System Demonstrations, pages 55–60.
lished data set (Vlachos and Riedel, 2015) and on
real-world claims presented as part of the HeroX Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
sky. 2009. Distant supervision for relation extrac-
fact checking challenge. In future work, we will tion without labeled data. In Proceedings of ACL,
extend the semantic parsing technique used and pages 1003–1011.
apply our system to more complex claim types.
Panupong Pasupat and Percy Liang. 2015. Compo-
Additionally, further work is required to reduce sitional semantic parsing on semi-structured tables.
the number of candidate relations recalled from arXiv preprint arXiv:1508.00305.
the KB. While this was not an issue in our case, we
Vahed Qazvinian, Emily Rosengren, Dragomir R
believe that ameliorating this issue will enhance Radev, and Qiaozhu Mei. 2011. Rumor has it: Iden-
the ability of the system to assign a correct truth tifying misinformation in microblogs. In Proceed-
label where there exist properties with similar nu- ings of EMNLP, pages 1589–1599.
merical values. Andreas Vlachos and Sebastian Riedel. 2014. Fact
checking: Task definition and dataset construction.
Acknowledgements ACL 2014, page 18.
This research is supported by a Microsoft Azure for Re- Andreas Vlachos and Sebastian Riedel. 2015. Identifi-
search grant. Andreas Vlachos is supported by the EU H2020 cation and verification of simple claims about statis-
SUMMA project (grant agreement number 688139). tical properties. In Proceedings of EMNLP.
Brett Walenz, You Will Wu, Seokhyun Alex Song,
Emre Sonmez, Eric Wu, Kevin Wu, Pankaj K Agar-
References wal, Jun Yang, Naeemul Hassan, Afroza Sultana,
et al. 2014. Finding, monitoring, and checking
Giovanni Luca Ciampaglia, Prashant Shiralkar, Luis M claims computationally based on structured data. In
Rocha, Johan Bollen, Filippo Menczer, and Alessan- Computation+ Journalism Symposium.
dro Flammini. 2015. Computational fact
40
ADoCS: Automatic Designer of Conference Schedules
41
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 41–44
c 2017 Association for Computational Linguistics
The bag-of-words, or vector model representa- ditionally, we find three control bars (Title, Key-
tion derived from the texts, configure a Euclidean words and Abstract) with values between 0 and
space where several distances can be applied in 1 that establish the weights of each of these fac-
order to estimate similarities between elements. tors for the computation of distances between pa-
Nevertheless, in some cases, we need to average pers. By default, the three bars are set to 0.33 to
several criteria and unify them to obtain a single indicate that the three factors will have the same
metric that quantifies dissimilarities, and with this, weight in computing the distances. The values
we can derive a distance matrix. This is the case of the weights are normalised in such a way that
we address here, since for each paper we have they always sum 1. The user can also configure
three different features: title, abstract and key- whether the TF-IDF transformation is applied or
words. In these situations we cannot directly ap- not, as well as the metric that is employed to com-
ply clustering methods based on centroids, such pute the distance between elements. These con-
as K-Means, since there is not a Euclidean space trols are responsive, i.e., when the user modifies
defined for the elements. One way to solve this one of the values, the distance matrix is recom-
type of problems is to apply algorithms that, for puted, and also all the components that depend on
the clustering process, use only the dissimilarity this matrix.
or distance matrix as input. ADoCS works with Once the file is correctly uploaded in the sys-
a new algorithm called CSCLP (Clustering algo- tem, the application enables the function tabs that
rithm with Size Constraints and Linear Program- give access to the functionality of the web system.
ming) (Vallejo, 2016) that only uses as inputs: the There are four application tabs:
size constrains of the sessions and the dissimilar-
ity/distance matrix. • Papers: This tab contains information about
Clustering algorithms obtain better results if a the dataset. We include here the list of pa-
proper selection of initial points method is com- pers. For each paper, we show the number,
puted (Fayyad et al., 1998). For this reason, Title, Keywords and Abstract. In order to im-
the initial points in our clustering algorithm is prove the visualisation, a check box can be
chosen using a popular method: Buckshot algo- employed to show additional information of
rithm (Cutting et al., 1992). In CSCLP, the initial the papers.
points are used as pairwise constraints (as cannot-
link constraints in semi-supervised clustering ter- • Dendrogram: In this part, a dendrogram
minology) for the formation of the clusters, and generated from the distance matrix is shown.
with binary integer linear programming (BILP) The distance between papers is computed
the membership and assignment of the instances to considering the weights selected by the user
the clusters is determined, satisfying the size con- and the methodology detailed in Section 2.
strains of the sessions. In this way, the original
clustering problem with size constraints becomes • MDS: In this tab, a Multidimensional Scaling
an optimisation problem. Details of the algorithm algorithm is employed over the distance ma-
can be found in (Vallejo, 2016). trix to generate a 2D plot about the similarity
of papers. Once the clusters are arranged, the
3 ADoCS Tool membership of the papers to each cluster is
denoted by the colour.
In this section we include a description of the
ADoCS tool. You can find a web version • Wordmap: This application tab includes a
of the tool in the url: https://ceferra. word map representation for showing the
shinyapps.io/ADoCS. most popular terms extracted from the ab-
On the left part of the web interface we find a stracts of the papers in the dataset.
panel where we can upload a csv file containing
information of the papers to be clustered. In the • Schedule: In this part, the user can config-
panel, there are several controls where we can con- ure the number and size of the sessions and
figure some features of this csv file. Concretely, execute the CSCLP algorithm, described in
the separator of fields (comma by default) and how Section 2, to build the groups according the
literals are parsed (single quote by default). Ad- similarity between papers.
42
The typical use of this tool starts by loading a
csv file with the information of the papers. Af-
ter the data is in the system, we can use the first
tab Papers to know the number of papers in the
conference. We can also observe specific informa-
tion of all the papers (title, keywords, abstract or
other information included in the file). The tabs
Dendrogram, MDS and Wordmap can be used for
exploring the similarities between papers and for
knowing the most common terms in the papers of
the conference. Finally, we can execute the semi-
supervised algorithm to configure the sessions in
the Schedule tab. In this tab, we can introduce the
number of the desired sessions as well as the dis-
tribution of the size of the sessions. For instance,
if we have 40 papers and we want 5 sessions of
5 papers, and 5 sessions of 3 papers, we intro-
duce “10” in the Number of Sessions text box and
“5,5,5,5,5,3,3,3,3,3” in Size of Sessions text box.
When the user push button compute, if the sum of
Figure 1: Example of the WordMap of the ab-
the number of papers in the distribution is correct
stracts in the papers of Twenty-Eighth Conference
(i.e. is equal to the total number of papers in the
on Artificial Intelligence (AAAI-2014)
conference), the algorithm of section 2 is executed
in order to find clusters that satisfy the size restric-
tions expressed by the session distribution. As a • proxy: The proxy package (Meyer and
result of the algorithm, a list of the papers with the Buchta, 2016) allows to compute distances
assigned cluster is shown in the tab. Finally, we between instances. The package contains im-
can download the assignments as a csv file if we plementations of popular distances, such as:
push the Save csv button. Euclidean, Manhattan or Minkowski, etc.
43
tool is available for a general use. In this case, Ian Fellows, 2014. wordcloud: Word Clouds. R pack-
we have uploaded the application to the free server age version 2.5.
http://www.shinyapps.io/. Nuwan Ganganath, Chi-Tsun Cheng, and Chi. K
Tse. 2014. Data clustering with cluster size
5 Conclusions and Future Work constraints using a modified k-means algorithm.
Cyber-Enabled Distributed Computing and Knowl-
Arranging papers to create an appropriate con- edge Discovery (CyberC), International Conference
ference schedule with sessions containing papers IEEE, pages 158–161.
with common topics is a tedious task, specially
Valerio Grossi, Anna Monreale, Mirco Nanni, Dino Pe-
when the number of papers is high. Machine
dreschi, and Franco Turini. 2015. Clustering formu-
learning offers techniques that can automatise this lation using constraint optimization. Software Engi-
task with the help of NLP methods for extracting neering and Formal Methods, pages 93–107.
features from the papers. In this context, organis-
Guobiao Hu, Shuigeng Zhou, Jihong Guan, and Xiao-
ing a conference schedule can be seen as a semi- hua Hu. 2008. Towards effective document clus-
supervised clustering. In this paper we have pre- tering: a constrained k-means based approach. In-
sented the ADoCS system, a web application that formation Processing & Management, 44(4):1397–
is able to create a set of clusters according to the 1409.
similarity of the documents analysed. The groups Monica Jha. 2015. Document clustering using
are formed following the size distribution config- k-medoids. International Journal on Advanced
ured by the user. Although initially the application Computer Theory and Engineering (IJACTE),
is focused on grouping conference papers, other 4(1):2319–2526.
related tasks in clustering documents with restric- David Meyer and Christian Buchta, 2016. proxy: Dis-
tions could be addressed thanks to the versatility tance and Similarity Measures. R package version
of the interface (different metrics, TF-IDF trans- 0.4-16.
formation). David Meyer, Kurt Hornik, and Ingo Feinerer. 2008.
As future work, we are interested in develop- Text mining infrastructure in R. Journal of statisti-
ing conceptual clustering methods to extract topics cal software, 25(5):1–54.
from the created clusters. R Core Team, 2015. R: A Language and Environment
for Statistical Computing. R Foundation for Statis-
Acknowledgments tical Computing, Vienna, Austria.
This work has been partially supported by the EU Tadej Škvorc, Nada Lavrac, and Marko Robnik-
(FEDER) and Spanish MINECO grant TIN2015- Šikonja. 2016. Co-Bidding Graphs for Constrained
69175-C4-1-R, LOBASS, by MINECO in Spain Paper Clustering. In 5th Symposium on Languages,
(PCIN-2013-037) and by Generalitat Valenciana Applications and Technologies (SLATE’16), vol-
ume 51, pages 1–13.
PROMETEOII/2015/013.
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
2005. Introduction to data mining. Addison-Wesley
References Longman Publishing Co., Inc., Boston, MA, USA.
Sugato Basu, Mikhail Bilenko, and Raymond J Stefan Theussl and Kurt Hornik, 2016. Rglpk: R/GNU
Mooney. 2004. A probabilistic framework for semi- Linear Programming Kit Interface. R package ver-
supervised clustering. In Proceedings of the tenth sion 0.6-2.
ACM SIGKDD conference, pages 59–68. ACM.
Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie, and Diego F. Vallejo. 2016. Clustering de documentos
Jonathan McPherson, 2016. shiny: Web Application con restricciones de tamaño. Master’s thesis, Higher
Framework for R. R package version 0.14. Technical School of Computer Engineering, Poly-
technic University of Valencia - UPV.
Douglass R. Cutting, David R. Karger, Jan O. Peder-
sen, and John W. Tukey. 1992. Scatter/gather: a Shaohong Zhang, Hau-San Wong, and Dongqing Xie.
cluster-based approach to browsing large document 2014. Semi-supervised clustering with pairwise and
collections. Proceedings of the 15th annual interna- size constraints. International Joint Conference on
tional ACM SIGIR conference, pages 318–392. Neural Networks (IJCNN), IEEE, pages 2450–2457.
Usama Fayyad, Cory Reina, and Paul S. Bradley. 1998. Shunzhi Zhu, Dingding Wang, and Tao. Li. 2010. Data
Initialization of iterative refinement clustering algo- clustering with size constraints. Knowledge-Based
rithms. Proceedings of ACMSIGKDD, pages 194– Systems, 23(8):883–889.
198.
44
A Web Interface for Diachronic Semantic Search in Spanish
45
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 45–48
c 2017 Association for Computational Linguistics
Figure 1: Architecture of Diachronic Explorer
46
(a) (b)
Figure 2: Similar words to cáncer (cancer) using a simple search in 1900 (a) and 2009 (b)
Simple: The user enters a target word and selects dependency-based contexts and a strategy for re-
for a specific period of time (e.g. from 1920 ducing the vector space, which consists in select-
to 1939), and the system returns the 20 most ing the more informative and relevant word con-
similar terms (in average) within the selected texts.
period of type. The user can select a word As far as we know, our system is the first at-
cloud image to visualize the output. Figure 2 tempt to build diachronic distributional models for
shows the similar words to cáncer (‘cancer’) Spanish language. Besides, it uses NoSQL stor-
returned with a simple search in two differ- age technology to scale easily as new data is pro-
ent periods: 1900 (a) and 2009 (b). Notice cessed, and provides an interface enabling useful
that this word is similar to peste and tuber- types of word searching across time. Other simi-
culosis at the begining of the 20th century, lar works for English are reported in different pa-
while nowadays it is more similar to techni- pers (Wijaya and Yeniterzi, 2011; Gulordava and
cal words, such as tumor or carcinoma. Baroni, 2011; Jatowt and Duh, 2014; Kim et al.,
2014; Kulkarni et al., 2015).
Track record: As in the simple search, the user
An interesting direction of research could in-
inserts a word and a time period. However,
volve the use of the system infrastructure for hold-
this new type of search returns the specific
ing other types of language variety, namely di-
similarity scores for each year of the period,
atopic variation. Distributional models can be
instead of the similarity average. In addition,
built from text corpora organized, not only with
the system allows the user to select any word
diachronic information, but also with dialectal fea-
from the set of all words (and not just the top
tures. For instance, we could adapt the system to
20) semantically related to the target one in
search meaning changes across different diatopic
the searched time period.
varieties: Spanish from Argentina, Mexico, Spain,
3 Conclusions Bolivia, and so on. The structure of our system
is generic enough to deal with any type of variety,
In this abstract we described a system, Diachronic not only that derived from the historical axis.
Explorer, that allows linguists to analize meaning
change of Spanish words in the written language Acknowledgments
across time. The system is based on distributional
models obtained from a chronologically structured This work has received financial support from a
language resource, namely Google Books Syn- 2016 BBVA Foundation Grant for Researchers
tactic Ngrams. The models were created using and Cultural Creators, TelePares (MINECO,
47
ref:FFI2014-51978-C2-1-R), Juan de la Cierva Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
formación Grant (Ref: FJCI-2014-22853), the Jon Orwant, Steven Pinker, Martin A. Nowak, and
Erez Lieberman Aiden. 2011. Quantitative Anal-
Consellerı́a de Cultura, Educación e Orde-
ysis of Culture Using Millions of Digitized Books.
nación Universitaria (accreditation 2016-2019, Science, 331(6014):176–182.
ED431G/08) and the European Regional Develop-
ment Fund (ERDF), Derry Tanti Wijaya and Reyyan Yeniterzi. 2011. Un-
derstanding Semantic Change of Words over Cen-
turies. In Proceedings of the 2011 International
Workshop on DETecting and Exploiting Cultural di-
References versiTy on the Social Web (DETECT 2011), pages
Pablo Gamallo and Stefan Bordag. 2011. Is singu- 35–40, Glasgow, Scotland. ACM.
lar value decomposition useful for word simalirity
extraction. Language Resources and Evaluation,
45(2):95–119.
48
Multilingual CALL Framework for Automatic Language Exercise
Generation from Free Text
49
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 49–52
c 2017 Association for Computational Linguistics
2 Related Work depending on the languages’ characteristics and
the richness of the parsing models available. For
There exist countless tools, both web-based and instance, exercises in Spanish can target the sub-
desktop software, that facilitate the creation of junctive conjugation or the definite/indefinite arti-
teaching material for general purposes, some of cles. In English, one can target, for example, past
the best known being Moodle3 , Hot Potatoes4 , and present participle tenses. Once the initial con-
ClassTools.net5 , and ClassMarker6 . However, the figuration has been set, the workflow continues as
focus of such tools is on enabling users to adapt follows:
the pedagogical content they are interested in,
whichever it is, to certain exercise formats (i.e., 3.1 Extraction of candidate items
quizzes, open clozes, crosswords, drag-and-drops, The text given is segmented, tokenized and tagged
and so on). That is, these tools do not offer sup- with IXA pipes (Agerri et al., 2014), using the lat-
port for assessing the contents on their pedagogi- est models provided with the tools. The tokens
cal suitability nor other exercise-dependent tasks, with a PoS tag or morphological feature selected
such as building clues for quizzes or distractors for by the user as pedagogical target are chosen as
multiple-choice exercises. candidate items. In the case of the SS format, item
In the domain of language learning in particular, candidates are sentences containing tokens with
exercise authoring very often implies a lot of word the relevant PoS tags or morphological features.
list curation, searching for texts that contain cer- If the text does not contain any candidate item, the
tain linguistic patterns or expressions, retrieving application alerts the user that it is not possible to
definitions, and similar tasks. The availability of build an exercise with the configuration given.
resources for language teachers that simplify these
processes, such as on-line dictionaries or teaching- 3.2 Final item selection
oriented lexical databases (e.g., English Profile7 ), The system has two ways of getting the final items
depends on the target language. To the best of this from the candidates identified: the user can choose
paper’s authors’ knowledge, there does not exist whether to select them or let the application do it
at the moment an exercise generation and evalu- randomly. In the latter case, the user can set an up-
ation framework specific to learn Basque, Span- per bound to the amount of items generated. The
ish, English, and French, that not only automa- system never yields two contiguous items, since it
tizes formatting the content given by the user for would increase substantially the difficulty in an-
several exercises but also incorporates natural lan- swering them.
guage processing (NLP) techniques to ease the au-
thoring process. Volodina et al. (2014) describe a 3.3 Exercise generation
similar framework named Lärka. Lärka designs Once the final items have been chosen, the actual
multiple-choice exercises in Swedish for linguis- questionnaire must be designed. This depends on
tics or Swedish learners. The questions are based the format chosen by the user and the specific set-
on controlled corpora, that is, the users cannot tings available to that format:
choose the texts they will be working on. Fill-in-the-Gaps (FG). The system substitutes
the items chosen with gaps that have to be filled in
3 Workflow description with the appropriate words. Users can choose to
All the exercise formats mentioned share a com- show, for all the languages, at least three types of
mon building process. Users must choose a lan- clues to help learners do the exercise: the lemmas
guage, provide a text –in that language–, and of the correct words, their definition, or a word
choose a pedagogical target from the options bank of all the correct words (and how many times
given. Pedagogical targets are based on PoS tags they occur). For Spanish and English, the sys-
and morphological features. Different possible tem is also capable of prompting the morphologi-
targets have been implemented for each language, cal features of the words to be guessed, in addition
to the lemma. This feature is interesting to train on
3
https://moodle.org/ singulars and plurals, grammatical gender, verbal
4
https://hotpot.uvic.ca/
5
https://www.classtools.net/
tenses, and so on. Moreover, depending on the
6
https://www.classmarker.com/ pedagogical target chosen, the system automati-
7
http://www.englishprofile.org/ cally disables certain types of clues. It would not
50
make much sense, for instance, to give the lem- 3.4 Evaluation
mas of the correct words if the pedagogical target The user can answer the questionnaire it has de-
were prepositions, given that prepositions cannot signed and get it assessed by the system. For
be lemmatized. The user can also choose to not all the exercise formats, the evaluation consist in
give any help. comparing the correct answers with the input re-
To generate clues based on lemmas and/or mor- ceived from the user. The answer is right only
phologial features, the system turns to the linguis- when it is the same to the correct answer.
tic annotation given by the IXA pipes during can-
didate extraction. The annotation contains all the 4 The demonstrator
information necessary for each token in the text.
The application has a clean interface and is easy
Retrieving definitions requires additional pro-
to follow. An exercise can be designed, answered
cessing, since a word’s definition depends on the
and evaluated visiting less than four pages:
context it appears in. That is, choosing the correct
The home page. In the home page, the user
definition of a word in the text provided by the user
chooses a language –Spanish, Basque, English or
translates to disambiguating the word. The appli-
French–, and an exercise format –FG, MC, or SS.
cation relies on Babelfy (Moro et al., 2014) and
It leads to the exercise configuration page of the
BabelNet (Navigli and Ponzetto, 2010) APIs in or-
exercise chosen.
der to do so. It passes the whole text as the con-
The configuration page. All the configuration
text and asks Babelfy to assign a single sense to
pages share a common structure. A text field oc-
the words chosen as targets of the exercise. Then,
cupies the top of the page. This is where the user
it retrieves from BabelNet the definitions associ-
introduces the text they want to work with. Below
ated to those senses. Babelfy is not always able to
the text field are the configuration options, pre-
assign a sense to a word; when this happens, the
sented as radio-button lists. All the exercise for-
application returns the lemma of the word as its
mats require that users choose a pedagogical tar-
clue.
get. Then come the format-specific settings, the
Multiple Choice (MC). Again, the exercise
only section of the page that varies.
consists of gaps in the text to be filled with the
For an FG exercise, two properties must be set:
correct words. In this case, the learner is given
the type of clue and how the clues must be visu-
a set of words from which to choose an answer.
alized. They can be shown below the gapped text
This set of words contains the right answer and
with a reference to the gap they belong to, or as
some incorrect words called “distractors”. When
description boxes of the gaps (i.e., “tooltips”).
this exercise format is chosen, the system automat-
In a MC exercise, users must choose the amount
ically generates as many distractors as specified by
of distractors they want the system to create.
the user. This is achieved by consulting, for each
For the SS mode, users can set an upper limit to
correct answer, a word embedding model and re-
the sentences that will be selected as candidates.
trieving the most similar words. The models are
Finally, the user can choose whether to let the
word2vec (Mikolov et al., 2013b; Mikolov et al.,
system select the items of the exercise or choose
2013a) trained on Leipzig University’s corpora8
them themselves among the candidates that the
with the library Gensim for Python9 . Thus, dis-
system generates. In the former case, the system
tractors tend to be words that appear often in con-
asks the user how many items it should create at
texts similar to the right answer, but not seman-
most, and takes the user directly to the “Answer
tically or grammatically correct. Distractors are
and evaluate” page. In the latter case, the user is
then transformed to the same case as the correct
taken to the “Choose the items” page.
word and finally shuffled for their visualization.
If the text does not contain any token that meets
Shuffled Sentences (SS). This exercise con-
the pedagogical target, the system asks the user to
sists in ordering a set of words given to formu-
provide a different text or change the configura-
late a grammatical sentence. In this case, the sys-
tion set. That is, the application allows for users
tem substitutes the sentences chosen by gaps and
to know with a single click whether the texts they
shows the sentence shuffled as a lead.
choose are suitable for the pedagogical objective
8
http://corpora.uni-leipzig.de/ they have set, without them having had previously
9
http://radimrehurek.com/gensim/ read the text thoroughly.
51
Choose the items. In this page users choose Another aspect that can be improved is the fact
the items it wants to create among the words that that exercise items are created from unigrams. It
meet the pedagogical target they chose. The in- would be very interesting that the application were
terface shows the whole text with all the available capable of generating multi-word candidates. This
candidates selected. They can be toggled simply would be useful to revise collocations or phrasal
by clicking on them. There are also buttons to se- verbs, for instance. In this same regard, the appli-
lect all and remove all the candidates. Once users cability of the system would increase if it based
are satisfied with their selection, they are led to the item candidate generation not only on PoS tags
“Answer and evaluate” page. or morphological features, but on other criteria as
Answer and evaluate. This is the page where well like the semantics of the text.
the final questionnaire is displayed and can be There is also room for improvement in the con-
filled. The appearance varies a little depending on figurability of the application. Users should be al-
the format chosen. FG exercises consist of the text lowed to control two key features: definition clues
given and gaps where the correct words should be in FG exercises and MC distractors. Currently, the
written. If the tooltip clues option has been en- application imposes the definitions and distractors
abled, clues appear in description boxes when hov- it generates, instead of presenting them as options
ering over the gaps; otherwise, they appear listed for the users to choose.
below the text. Word bank clues appear boxed Finally, we plan to implement more formats that
on to the right of the text. In MC exercises, the add to the available three.
choices are radio-button groups listed to the right
of the text. As for SS exercises, page-wide gaps re- Acknowledgements
place the sentences chosen, and the shuffled words This work has been funded by the Spanish Gov-
are given above the gaps. ernment (project TIN2015-65308-C5-1-R).
At the bottom of this page there is a link to
download the exercise in Moodle CLOZE syntax.
The file that is downloaded can be imported to References
Moodle in order to generate a quiz. Rodrigo Agerri, Josu Bermudez, and German Rigau.
Users can fill in the exercise they designed and 2014. IXA pipeline: Efficient and Ready to Use
submit them to get the percentage of correct an- Multilingual NLP tools. In LREC, volume 2014,
pages 3823–3828.
swers. Furthermore, the answers are colored in
green or red, depending on whether they are cor- Tomas Mikolov, Greg Corrado, Kai Chen, and Jeff
rect or not, respectively. This way learners can try Dean. 2013a. Efficient Estimation of Word Rep-
to answer again. resentations in Vector Space. CoRR, abs/1301.3.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
5 Conclusions rado, and Jeff Dean. 2013b. Distributed Represen-
tations of Words and Phrases and their Composition-
We have described a web-based framework for ality. In NIPS, pages 3111–3119.
language learning exercise generation. It is avail- Andrea Moro, Francesco Cecconi, and Roberto Nav-
able for Spanish, Basque, English, and French. igli. 2014. Multilingual word sense disambiguation
Users can design three types of exercises to train and entity linking for everybody. In Proceedings
diverse skills in these languages. The framework of the 2014 International Conference on Posters &
Demonstrations Track-Volume 1272, pages 25–28.
is highly configurable and lets the user choose
CEUR-WS. org.
whether they want to select the exercise items or
have the system do it. Roberto Navigli and Simone Paolo Ponzetto. 2010.
As future work, the system should be improved BabelNet: Building a very large multilingual seman-
tic network. In ACL, pages 216–225.
in various ways. To begin with, the actual system
delegates to the user the task of making the exer- Elena Volodina, Ildikó Pilán, Lars Borin, and
cises more or less difficult by choosing the items Therese Lindström Tiedemann. 2014. A flexible
language learning platform based on language re-
themselves. The application will be endowed with sources and web services. In LREC, pages 3973–
technology that allows the users to create automat- 3978.
ically exercises which vary in difficulty starting
from the same text.
52
Audience Segmentation in Social Media
53
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 53–56
c 2017 Association for Computational Linguistics
mom; example patterns are my kids or our daugh-
ter. WASM tags both authors in Figure 1 as having
children.
Gender: Possible values are male, female and
unknown. The classification of gender relies on a
similar keyword and pattern matching as described
previously. In addition, it relies on matching the
Figure 1: Example author biographies. author name against a built-in database of 150,000
first names that span a variety of cultures. For am-
biguous first names such as Andrea (which iden-
Our segmentation approach only relies on the tifies females in Germany and males in Italy), we
‘on-topic’ content of an author, not on all of the au- take both the language of the author content and
thor’s posts—for two reasons: Firstly, many social the detected author location into account to pick
media sources such as forums do not allow retriev- the appropriate gender. WASM classifies the first
ing content based on an author name. Secondly, author in Figure 1 as male, the second author as
social media content providers charge per docu- female.
ment. It would not be economical to download
all content for all authors who talk about a cer- 2.3 Author Behavior
tain topic. We found that the combination of the WASM identifies three types of author behavior:
author’s content with the author’s name(s) and bi- Users own a certain product or use a particular
ography significantly enhances both precision and service. They identify themselves through phrases
recall of audience segmentation. such as my new PhoneXY, I’ve got a PhoneXY or
2.1 Author Location I use ServiceXY. Prospective users are interested
in buying a product or service. Example phrases
WASM detects the permanent location of an au- are I’m looking for a PhoneXY or I really need a
thor, i.e., the city or region where the author lives, PhoneXY. Churners have either quit using a prod-
by matching the information found in their biog- uct or service, or have a high risk of leaving. They
raphy against large, curated lists of city, region are identified by phrases such as Bye Bye PhoneXY
and country names. We use population informa- or I sold my PhoneXY.
tion (preferring the location with higher popula- Author behavior is classified by matching key-
tion) to disambiguate city names in the absence of words and patterns in the authors biographies and
any region or country information, e.g., an author content—similarly to the demographic analysis
with the location Stuttgart is assigned to Stuttgart, described above. It allows to understand: What
Germany, not Stuttgart, AR, USA. do actual customers of a certain product or ser-
2.2 Author Demographics vice talk about? What are the key factors why cus-
tomers churn?
WASM identifies three demographic categories: Figure 2 shows an example analysis where the
Marital status: When the author is married, the topics of interest were three retailers: a discounter,
value is true, otherwiseunknown. For the classifi- an organic market and a regular supermarket.2 The
cation we identify keywords (encoded in dictio- top of Figure 2 summarizes what is relevant for
naries) and patterns in the author’s biography and authors that WASM identified as users, i.e., cus-
content. Our dictionaries and patterns match, for tomers of a specific retailer. The bottom part
example, authors describing themselves as mar- shows two social media posts from userss.
ried or husband, or authors that use phrases such
as my spouse loves running or at my wife’s birth- 2.4 Author Interests
day. Thus, WASM tags the authors in Figure 1 as
WASM defines a three-tiered taxonomy of inter-
married.
ests, which is inspired by the IAB Tech Lab Con-
Parental status: When the author has children,
tent Taxonomy (Interactive Advertising Bureau,
the value is true, otherwise unknown. Similarly
2016). The first tier represents eight top-level
to the marital status classification, keywords and
interest categories such as art and entertainment
patterns are matched in the authors’ biographies
and contents. Example keywords are father or 2
The retailers’ names are anonymized.
54
2.5 Account Type Classification
WASM classifies social media accounts into orga-
nizational or personal. This helps customers to
understand who is really driving the social media
conversations around their brands and products: is
it themselves (through official accounts), a set of
news sites, or is it an ‘organic’ conversation with
lots of personal accounts involved?
Organizational accounts include companies,
NGOs, newspapers, universities, and fan sites.
Personal accounts are non-organizational accounts
from individual authors. Account types are distin-
Figure 2: What matters to users of supermarkets. guished by cues from the authors’ names and bi-
ographies. For example, company-indicating ex-
pressions such as Inc or Corp and patterns like
or sports. The second tier comprises about 60 Official Twitter account of <some brand> indi-
more specific categories. These include music cate that the account represents an organization,
and movies under art and entertainment and ball whereas a phrase like Views are my own points to
sports and martial arts under sports. The third an actual person. The biographies in Figure 1 con-
level represents fine-grained interests, e.g., tennis tain many hints that both accounts are personal,
and soccer below ball sports. e.g., agent nouns (director, developer, lover) and
personal and possessive pronouns (I, my, our).
The fine-grained interests on the third level are
identified in author biographies and content with
3 Implementation
the help of dictionaries and patterns. From the bi-
ography of the first author in Figure 1, we infer 3.1 Text Analysis Stack
that he is interested in music and baseball (music
lover and baseball fan). For the second author we The WASM author segmentation components are
infer an interest in machine knitting (I am an avid created by using the Annotation Query Language
machine knitter). Note that a simple occurrence of (AQL; Chiticariu et al., 2010). AQL is an SQL-
a certain interest, e.g. a sport type, is usually not style programming language for extracting struc-
enough: it has to be qualified in the author content tured data from text. Its underlying relational logic
by matching specific patterns. A match in a biog- supports a clear definition of rules, dictionaries,
raphy, however, typically qualifies as a bona fide regular expressions, and text patterns. AQL opti-
interest—excluding, e.g., negation structures. mizes the rule execution to achieve higher analysis
throughput.
Figure 3 visualizes the connections between the The implemented rules harness linguistic pre-
three retailers mentioned in the previous chapter processing steps, which consist of tokeniza-
and coarse-grained interest categories. This re- tion, part-of-speech tagging and chunking (Ab-
veals that authors who write about Organic Mar- ney, 1992; Manning and Schütze, 1999)—the lat-
ket CD are uniquely interested in animals and that ter also expressed in AQL. Rules in AQL sup-
Discounter AB does not attract authors who are in- port the combination of tokens, parts of speech,
terested in sports. These insights allow targeted chunks, string constants, dictionaries, and regular
advertisements or co-selling opportunities. expressions. Here is a simplified pattern for iden-
tifying whether an author has children:
create view Parent as
extract pattern
<FP.match> <A.match>{0,1} /child|sons?|kids?/
from FirstPersonPossessive FP, Adjectives A;
It combines a dictionary of first person posses-
sive pronouns with an optional adjective (identi-
Figure 3: Author interests related to Retailers
fied by its part of speech) and a regular expres-
sion matching child words. The pattern matches
55
phrases such as my son and our lovely kids. thors from all social media sources. WASM clas-
sifies 1,695 authors as users—without any addi-
3.2 System Architecture tional effort by the customer. Compare that with
We wanted to provide users analysis results as the task to run a survey in the retailer’s stores (and
quickly as possible. Hence, we created a data its competition) to get similar insights. We manu-
streaming architecture that retrieves documents ally annotated author behavior for 500 documents
from social media, and analyzes them on the fly, from distinct authors. The evaluated precision is
while the retrieval is still in progress. Moreover, 90.0% at a recall of 58.3%.
we wanted to run all analyses for all users on a The evaluation of account types is based on
single, multi-tenant system to keep operating costs a random sample of 50,124 Tweets, which com-
low. The data stream that our text analysis compo- prises 43,193 distinct authors. 36,682 of the au-
nents see contains social media content for differ- thors provide biographies. We use this as the
ent users. Hence, we built components that can “upper recall bound” for our biography-based ap-
switch between different analysis processing con- proach. Our system assigns an account type for
figurations with minimal overhead. 18,657 authors, which corresponds to a recall of
The text analysis components run as Scala or 50.9%. It classifies 16,981 as personal and 1,676
Java modules within Apache Spark™. We exploit as organizational accounts. We manually anno-
Spark’s Streaming API for our data streaming re- tated 500 of the classified accounts, which results
quirements. The data between the components in a precision of 97.4%.
flows through Apache Kafka™. The combination
of Spark and Kafka allows for processing scale- 5 Conclusion and Future Work
out, and resilience against component failures. This paper presented real-life use cases that re-
The multi-tenant text analysis processing is quire understanding audience segments in social
separated into user-specific and language-specific media. We described how IBM Watson Analytics
analysis steps. We optimized the user-specific for Social Media™ segments social media authors
analysis steps (such as detecting products and according to their location, demographics, behav-
brands a particular user cares about) to have virtu- ior, interests and account types. Our approach
ally no switching overhead. The language-specific shows that the author biography in particular is a
steps (e.g., sentiment detection or author segmen- rich source of segment information. Furthermore,
tation) are invariant across customers. To achieve we show how author segments can be created at
the required processing throughput, we launch one a large scale by combining natural language pro-
language-specific rule set per processing node, and cessing and the latest developments in data ana-
analyze multiple documents in parallel with this lytics platforms. In future work, we plan to ex-
language-specific rule set. tend this approach to additional author segments
The analysis pipeline runs on a cluster of 10 vir- as well as additional languages.
tual machines, each with 16 cores and 32G RAM.
Our customers run 300+ analyses per day, analyz-
ing between 50,000 to 5,000,000 documents each. References
Steven P. Abney, 1992. Parsing By Chunks, pages 257–
4 Evaluation 278. Springer Netherlands, Dordrecht.
Our customers are willing to accept smaller seg- Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao
Li, Sriram Raghavan, Frederick Reiss, and Shivaku-
ment sizes (i.e., lower recall) as long as the preci- mar Vaithyanathan. 2010. SystemT: An Algebraic
sion of the segment identification is high. This de- Approach to Declarative Information Extraction. In
sign goal of our analysis components is reflected Proceedings of ACL, pages 128–137.
in the evaluation results for author behavior and Interactive Advertising Bureau. 2016. IAB Tech
account types on English social media documents Lab Content Taxonomy. Online, accessed: 2016-
that we present here.3 12-23, https://www.iab.com/guidelines/iab-quality-
The retail dataset mentioned in previous chap- assurance-guidelines-qag-taxonomy/.
ters consists of 50,354 documents by 41,209 au- Christopher D. Manning and Hinrich Schütze. 1999.
Foundations of Statistical Natural Language Pro-
3
An evaluation of each audience feature for each sup- cessing. MIT Press, Cambridge, MA, USA.
ported language is beyond the scope of this paper.
56
The arText prototype: An automatic system for writing specialized texts
57
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 57–60
c 2017 Association for Computational Linguistics
Medicine Research article the template for a “claim” to be submitted to the
Review article
Medical history public administration includes the following sec-
Abstract tions:
Bachelor’s thesis - Header
Tourism Informative article
Travel blog post - Addressee
Report - Introductory clause
Rules and regulations
Business plan
- Supporting details
Public Administration Allegation - Request
Cover letter - Closing
Letter of complaint
Claim
A drop-down menu in the left-hand column pro-
Application vides sample texts for each section, including sec-
tion titles, where appropriate. For example, the
Table 1: Specialized fields and textual genres in- “Supporting details” section includes two differ-
cluded in arText. ent contents:
- Grounds for the claim
- Attachments
work) and front end (HTML, CSS, JAVASCRIPT,
When users click on a specific content, arText dis-
with AJAX and JQUERY); Google Analytics is in-
plays a list of sample phrases that can be incor-
tegrated into the site to measure traffic.
porated into the final text. For example, “Attach-
Documents can be exported in four formats:
ments” includes the following phrases:
PDF, TXT, HTML and ARTEXT. Previously
- Attached please find [document name].4
saved documents can be uploaded using the AR-
- The following supporting documents are at-
TEXT format, and the website includes a detailed
tached: [list of documents].
user manual and a contact section for comments,
Users can click on a stock phrase to include it in
questions and suggestions.
the text.
ArText can be accessed at http:
//sistema-artext.com/,3 and has been 2.2 Module 2. Format and Spellchecking
optimized for the Google Chrome browser. To use
arText, click on “Start using arText” and pick one The toolbar at the top of the screen includes an
of the 15 textual genres mentioned above. This open source spellchecker (WebSpellChecker Ltd.)
brings you to the text editor, where you can start and various formatting options, e.g. to change font
writing using the text editor and the three modules or font size; insert bullet points, images, tables and
integrated into arText: Structure, Contents and links; cut, copy and paste; print; and search. Since
Phraseology; Format and Spellchecking; and online storage is not provided, the user’s manual
Lexical and Discourse-based Recommendations. includes instructions for uploading an image to
Google Drive and inserting it into a document pro-
2.1 Module 1. Structure, Contents and duced using arText.
Phraseology
2.3 Module 3. Lexical and Discourse-based
The left-hand column helps users structure and
Recommendations
draft documents. Its interactive template includes
typical sections, contents and phraseology for By clicking on the review button in the right-
each textual genre. This information was extracted hand column, users can see a series of lexical and
from da Cunha and Montané (2016), a corpus- discourse-related recommendations for improving
based analysis following van Dijk (1989)’s textual their texts. These recommendations are derived
approach. Specifically, users can insert: from da Cunha and Montané (2016), which is
- Typical document sections based on Cabré (1999)’s Communicative Theory
- Typical contents found in each section of Terminology and Mann and Thompson (1988)’s
- Phraseology related to each of these contents Rhetorical Structure Theory. The module includes
The text editor displays the sections which typi- a series of algorithms based on linguistic rules and
cally appear in a given textual genre. For example, two NLP tools: the Freeling shallow parser (At-
3 4
A demo is available at https://canal.uned.es/ Users are instructed to fill in the information indicated in
mmobj/index/id/54433 square brackets (e.g. names, dates and numbers).
58
serias et al., 2006) and the DiSeg discourse seg- Text displays a list of discourse markers repeated
menter (da Cunha et al., 2012). in the text. When users click on one of these mark-
This module includes 11 main recommenda- ers, all of its occurrences are highlighted, and a list
tions, all of which are displayed in the right-hand of alternative discourse markers used to express
column, when appropriate. A subset of these rec- the same relationship (e.g. Cause, Restatement,
ommendations is assigned to each textual genre, Contrast and Condition, etc.) is displayed in the
and all recommendations are adapted to the lin- right-hand column. For instance, for the discourse
guistic characteristics of each genre (da Cunha and marker “that is,” used to express Restatement, ar-
Montané, 2016). Recommendations cover the fol- Text suggests the alternatives “in other words,”
lowing 11 topics: “that is to say,” “i.e.” and “to put it another way.”
1. Spelling out acronyms
3 Evaluation
2. Using acronyms systematically
3. Providing definitions Real and ad hoc texts were used to test arText’s al-
4. Using the passive voice gorithms and linguistic rules and improve the sys-
5. Using the 1st person systematically tem. Subsequently, the prototype was launched
6. Using subjectivity indicators and data-driven and user-driven evaluations were
7. Repeating words conducted.
8. Using long sentences The data-driven evaluation was based on a test
9. Segmenting long sentences corpus with 24 texts corresponding to one textual
10. Considering alternative discourse markers genre from each domain; the corpus comprised
11. Varying discourse markers eight medical abstracts, eight tourism-related in-
By clicking on a given recommendation, users can formative articles and eight applications to the
see a more detailed explanation and suggestions. public administration. The linguistic characteris-
In some cases, arText also highlights phrases or tics of these texts were manually annotated, and
content in the text editor. For example, one lexical the manual annotation and arText results were
recommendation, “Spelling out acronyms,” high- compared. Precision and recall were measured for
lights acronyms that are not spelled out when they a series of recommendations; the results are pre-
first appear in the text (i.e. arText would high- sented in Table 2.5
light “COPD” if “chronic obstructive pulmonary Recommendation ID Precision Recall
disease” did not appear next to this acronym the 1 0.76 0.68
first time the term was used). Some recommenda- 2 0.75 0.94
3 x x
tions also actively engage users in the revision pro- 4 1 0.94
cess. For example, the recommendation “Repeat- 5: sing. verbal forms 0.87 0.97
ing words” shows a list of repeated words; when 5: pl. verbal forms 1 0.99
6 - -
users click on a word in the right-hand column,
9 0.74 0.87
all occurrences of this word in the text are high- 10 1 1
lighted. This recommendation is not displayed for
highly specialized textual genres (e.g. research ar- Table 2: Data-driven results.
ticles and abstracts), since lexical variation is usu-
ally avoided in these types of texts (Cabré, 1999). Recommendation 7 did not apply in the medi-
Some recommendations focus on the discourse cal subcorpus; in the tourism and public admin-
level. For example, “Segmenting long sentences” istration subcorpora, 91.67% and 94.70% of de-
highlights one or more long sentences; users can tected words, respectively, were repeated in the
click to see suggestions for splitting them into text. For Recommendation 8, 100% of highlighted
shorter sentences. In this case, arText proposes sentences in the medicine and tourism subcor-
these discourse segments in the right-hand col- pora were long sentences according to the thresh-
umn. The number of words used to determine long olds for abstracts and informative articles; no long
sentences differs for each textual genre, following sentences appeared in the public administration
da Cunha and Montané (2016). 5
In light of the degree of specialization, Recommendation
4 did not apply for any of the textual genres included in the
Another discourse level recommendation refers test corpus. No cases for which Recommendation 6 applies
to “Varying discourse markers.” In this case, ar- were found in the test corpus.
59
subcorpus, so this recommendation could not be smith for the proofreading of the text, and the eval-
tested for this genre. No cases of Recommenda- uation survey respondents and the research groups
tion 11 were found in the medical and administra- ACTUALing and IULATERM for their support.
tion subcorpora; in the tourism corpus, 100% of
detected discourse markers were repeated in the
text, and adequate alternatives were proposed. References
The user-driven evaluation aimed to determine Sandra M. Aluı́sio, Iris Barcelos, Jandir Sampaio, and
how useful arText is. A survey designed using Osvaldo N. Oliveira Jr. 2001. How to learn the
many unwritten “rules of the game” of the academic
Google Forms focused on accessibility, the use- discourse: A hybrid approach based on critiques
fulness of the three modules and general issues. and cases to support scientific writing. Proceedings
Three doctors, three tourism professionals, and 25 of the IEEE International Conference on Advanced
laypersons completed the survey; all laypersons Learning Technologies, 2001.
were between 30-50 years old and had both higher Jordi Atserias, Bernardino Casas, Elisabet Comelles,
education experience and internet skills. In gen- Meritxell González, Lluı́s Padró, and Muntsa Padró.
eral, respondents found arText to be user-friendly 2006. Freeling 1.3. syntactic and semantic services
in an open-source nlp library. Proceedings of the 5th
and useful; 100% of them would recommend the International Conference on Language Resources
system to other people. Respondents found the and Evaluation (LREC 2006), pages 48–55.
section on structure to be the most useful module,
M. Teresa Cabré. 1999. La Terminologı́a. Repre-
while the approach to uploading images was con- sentación y comunicación. Institut for Applied Lin-
sidered the system’s greatest weakness. guistics, Barcelona.
Iria da Cunha and M. Amor Montané. 2016. Un
4 Conclusions and Future Work análisis lingüı́stico de los géneros textuales del
This paper describes a prototype of an automatic ámbito de la administración con repercusión en la
vida de los ciudadanos. Proceedings of the XV Sym-
system to assist users in writing specialized texts. posium of the Ibero-American Terminology Network
The online arText editor helps users draft texts (RITerm 2016), pages 61–62.
for 15 textual genres in three specialized domains:
Iria da Cunha, Eric SanJuan, Juan-Manuel Torres-
medicine, tourism and the public administration. Moreno, Marina Lloberes, and Irene Castellón.
It lays out the structure for each section of the 2012. Diseg 1.0: The first system for spanish dis-
document, suggests appropriate contents and stock course segmentation. Expert Systems with Applica-
phrases for each section, and detects typical lin- tions, 39(2):1671–1678.
guistic errors. This innovative system is the first Iria da Cunha, M. Amor Montané, and Alba Coll. in
tool that considers lexical, textual and discourse press. Detección de géneros textuales que presentan
features for specific textual genres. Moreover, the dificultades de redacción: un estudio en los ámbitos
de la administración, la medicina y el turismo. Pro-
arText project is based on the idea that academic ceedings of the 34th International Conference of the
research can be shared with and used construc- Spanish Association of Applied Linguistics.
tively by the general public.
Jianmin Dai, Roxanne B. Raine, Rod Roscoe, Zhiqiang
In the future, the results of the data-driven eval- Cai, and Danielle S. McNamara. 2011. The
uation will be utilized to improve arText’s algo- writing-pal tutoring system: Development and de-
rithms. A second user-driven evaluation will in- sign. Journal of Engineering and Computer Innova-
clude a broader population (e.g. students). Finally, tions, 2(1):1–11.
arText may be adapted to other textual genres, spe- Tomi Kinnunen, Henri Leisma, Monika Machunik,
cialized domains and languages. Tuomo Kakkonen, and Jean-Luc Lebrun. 2012.
Swan scientific writing assistant a tool for helping
scholars to write reader-friendly manuscripts. Pro-
Acknowledgments ceedings of the 13th Conference of the European
Chapter of the Association for Computational Lin-
This article is part of the “Automatic system
guistics, pages 20–24.
to help in writing specialized texts in domains
relevant to Spanish society” research project, William C. Mann and Sandra A. Thompson. 1988.
Rhetorical structure theory: Toward a functional the-
funded by “2015 BBVA Foundation Grants for
ory of text organization. Text, 8(3):243–288.
Researchers and Cultural Creators”. It was also
supported by a Ramón y Cajal contract (RYC- Teun A. van Dijk. 1989. La ciencia del texto. Paidós,
Barcelona.
2014-16935). We would like to thank Josh Gold-
60
QCRI Live Speech Translation System
Fahim Dalvi, Yifan Zhang, Sameer Khurana, Nadir Durrani, Hassan Sajjad
Ahmed Abdelali, Hamdy Mubarak, Ahmed Ali, Stephan Vogel
Qatar Computing Research Institute
Hamad bin Khalifa University, Doha, Qatar
{faimaduddin, yzhang}@qf.org.qa
Abstract
We present QCRI’s Arabic-to-English
speech translation system. It features
modern web technologies to capture live
audio, and broadcasts Arabic transcrip-
tions and English translations simultane-
ously. Our Kaldi-based ASR system uses
the Time Delay Neural Network architec-
ture, while our Machine Translation (MT)
system uses both phrase-based and neural Figure 1: Speech translation system in action. The
frameworks. Although our neural MT sys- Arabic transcriptions and English translations are
tem is slower than the phrase-based sys- shown in real-time as they are decoded.
tem, it produces significantly better trans-
lations and is memory efficient.1 Our system is also robust to common English
code-switching, frequent acronyms, as well as di-
1 Introduction alectal speech. Both the Arabic transcriptions and
We present our Arabic-to-English SLT system the English translations are presented as results to
consisting of three modules, the Web application, the viewers. The system is built upon modern web
Kaldi-based Speech Recognition culminated with technologies, allowing it to run on any browser
Phrase-based/Neural MT system. It is trained and that has implemented these technologies. Figure
optimized for the translation of live talks and lec- 1 presents a screen shot of the interface.
tures into English. We used a Time Delayed Neu-
2 System Architecture
ral Network (TDNN) for our speech recognition
system, which has a word error rate of 23%. For The QCRI live speech translation system is pri-
our machine translation system, we deployed both marily composed of three fairly independent mod-
the traditional phrase-based Moses and the emerg- ules: the web application, speech recognition, and
ing Neural MT system. The trade-off between machine translation. Figure 2 shows the complete
efficiency and accuracy (BLEU) barred us from work-flow of the system. It mainly involves the
picking only one final system. While the phrase- following steps: 1) Send audio from a broadcast
based system was much faster (translating 24 to- instance to the ASR server; 2) Receive transcrip-
kens/second versus 9.5 tokens/second), it was also tion from the ASR server; 3) Send transcription
roughly 5 BLEU points worse (28.6 versus 33.6) to MT server; 4) Receive translation from the MT
compared to our Neural MT system. We there- server; 5) Sync results with backend system and
fore leave it up to the user to decide whether they 6) Multiple watch instances sync results from the
care more about translation quality or speed. The backend. Steps 1-5 are constantly repeated as new
real-time factor for the entire pipeline is 1.18 using audio is received by the system through the broad-
Phrase-based MT and 1.26 using Neural MT. cast page. Step 6 is also periodically repeated to
1
The demo is available at https://st.qcri.org/ get the latest results on the watch page. Both the
demos/livetranslation. speech recognition and machine translation mod-
61
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 61–64
c 2017 Association for Computational Linguistics
2.2 Speech transcription
We use the Speech-to-text transcription system
that was built as part of QCRI’s submission to
the 2016 Arabic Multi-Dialect Broadcast Media
Recognition (MGB) Challenge. Key features of
the transcription system are given below:
Data: The training data consisted of 1200
hours of transcribed broadcast speech data col-
lected from Aljazeera news channel. In addition
we had 10 hours of development data (Ali et al.,
2016). We used data augmentation techniques
such as Speed and Volume perturbation which in-
creased the size of the training data to three times
the original size (Ko et al., 2015).
Figure 2: Demo system overview Speech Lexicon: We used a Grapheme based
lexicon (Killer and Schultz, 2003) of size 900k.
ules have a standard API that can be used to send The lexicon is constructed using the words that oc-
and receive information. The Web Application cur more than twice in the training transcripts.
connects with the API and runs independently of Speech Features: Features used to train
the system used for transcription and translation. all the acoustic models are 40 dimensional hi-
resolution Mel Frequency Cepstral Coefficients
2.1 Web application (MFCC hires), extracted for each speech frame,
The web application has two major components; concatenated with 100 dimensional i-Vectors per
the frontend and the backend. The frontend is cre- speaker to facilitate speaker adaptation (Saon et
ated using the React Javascript framework to han- al., 2013).
dle the dynamic User Interface (UI) changes such Acoustic Models: We experimented with three
as transcription and translation updates. The back- acoustic models; Time Delayed Neural Networks
end is built using NodeJS and MongoDB to han- (TDNNs) (Peddinti et al., 2015), Long Short-Term
dle sessions, data associated with these sessions Memory Recurrent Neural Networks (LSTM) and
and authentication. The frontend presents the user Bi-directional LSTM (Sak et al., 2014). Perfor-
with three pages; the landing page, the watch page mance of the BLSTM acoustic model in terms of
and the broadcast page. The landing page allows Word Error Rate is better than the TDNN, but
the user to either create a new session or work TDNN has a much better real-time factor while
with an existing one. The watch page regularly decoding. Hence, for the purpose of the speech
syncs with the backend to get the latest partial or translation system, we use the TDNN acoustic
final transcriptions and translations. The broadcast model. The TDNN model consists of 5 hidden lay-
page is meant for the primary speaker. This page ers, each layer containing 1024 hidden units and is
is responsible for recording the audio data and col- trained using Lattice Free Maximum Mutual Infor-
lecting the transcriptions and translations from the mation (LF-MMI) modeling framework in Kaldi
ASR and MT systems respectively. Both partial (Povey et al., 2016). Word Error Rate comparison
transcriptions and translations are also presented of different acoustic models can be seen in Table
to the speaker as they are being decoded. To avoid 1. For further details, see Khurana and Ali (2016).
very frequent and abrupt changes, the rate of up- Language Model: We built a Kneser Ney
date of partial translations was configured based smoothed trigram language model. The vocab size
on a MIN NEW WORDS parameter, which defines is restricted to the 100k most frequent words to
the minimum number of new words required in improve the decoding speed and the real-time fac-
the partial transcription to trigger the translation tor of the system. The choice of using a trigram
service. Both the partial and the final results are model instead of an RNN as in our offline systems
also synced to the backend as they are made avail- was essential in keeping the decoding speed at a
able, so that the viewers on the watch page can reasonable value.
experience the live translation. Decoder Parameters: Beam size for the de-
62
Model %WER Dowmunt et al., 2016) decoder to use our neural
models on the CPU. Because of computation con-
TDNN 23.0
straints, we reduced the beam size from 12 to 5
LSTM 20.9
with a minimal loss of 0.1 BLEU points.
BLSTM 19.3
The primary factors in our final decision were
3-fold; overall quality, translation time and com-
Table 1: Recognition results for the LF-MMI
putational constraints. The translation time has to
trained recognition systems. LM used for decod-
be small for a live translation system. The perfor-
ing is tri-gram. Data augmentation is used before
mance of the four systems on the official IWSLT
training
test-sets is shown in Figure 3.
63
We analyzed the real time performance of the 52nd Annual Meeting of the Association for Compu-
entire pipeline, include the lag induced by trans- tational Linguistics (Volume 1: Long Papers).
lation after the transcription is complete. With Nadir Durrani, Helmut Schmid, and Alexander Fraser.
an average real-time factor of 1.1 for the speech 2011. A joint sequence translation model with in-
recognition, our system keeps up with normal tegrated reordering. In Proceedings of the Associ-
speech without any significant lag. The distribu- ation for Computational Linguistics: Human Lan-
guage Technologies (ACL-HLT’11), Portland, OR,
tion of the real time factor speech recognition and USA.
translation for the in-house test sets is shown in
Figure 4. Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp
Koehn. 2014. Integrating an unsupervised translit-
eration model into statistical machine translation. In
EACL, volume 14, pages 148–153.
Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and
Stephan Vogel. 2016. QCRI machine translation
systems for IWSLT 16. In Proceedings of the 15th
International Workshop on Spoken Language Trans-
lation, IWSLT ’16, Seattle, WA, USA.
Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu
Hoang. 2016. Is neural machine translation ready
for deployment? A case study on 30 translation di-
rections. CoRR, abs/1610.01108.
Sameer Khurana and Ahmed Ali. 2016. QCRI ad-
vanced transcription system (QATS) for the arabic
multi-dialect broadcast media recognition: MGB-2
challenge. In Spoken Language Technology Work-
Figure 4: Real-time analysis for all audio seg- shop (SLT) 2016 IEEE.
ments in our in-house test sets
Mirjam Killer and Tanja Schultz. 2003. Grapheme
based speech recognition. In INTERSPEECH.
3 Conclusion Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
jeev Khudanpur. 2015. Audio augmentation for
This paper presents QCRI live speech translation speech recognition. In INTERSPEECH.
system for real world settings such as lectures
and talks. Currently, the system works very well Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu-
danpur. 2015. A time delay neural network archi-
for Arabic including frequent dialectal words, and tecture for efficient modeling of long temporal con-
also supports code-switching for most common texts. In INTERSPEECH, pages 3214–3218.
acronyms and English words. Our future aim is to
Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-
improve the system in several ways; by having a gah Ghahrmani, Vimal Manohar, Xingyu Na, Yim-
tighter integration between the speech recognition ing Wang, and Sanjeev Khudanpur. 2016. Purely
and translation components, incorporating more sequence-trained neural networks for asr based on
dialectal speech recognition and translation, and lattice-free mmi.
by improving punctuation recovery of the speech Hasim Sak, Andrew W Senior, and Françoise Beau-
recognition system which will help machine trans- fays. 2014. Long short-term memory recurrent
lation to produce better translation quality. neural network architectures for large scale acoustic
modeling. In INTERSPEECH.
George Saon, Hagen Soltau, David Nahamoo, and
References Michael Picheny. 2013. Speaker adaptation of
neural network acoustic models using i-vectors. In
Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui, ASRU, pages 55–59.
Hamdy Mubarak, Steve Renals, and Yifan Zhang.
2016. The mgb-2 challenge: Arabic multi-dialect Rico Sennrich, Barry Haddow, and Alexandra Birch.
broadcast media recognition. In SLT. 2016. Edinburgh neural machine translation sys-
tems for wmt 16. In Proceedings of the First Con-
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas ference on Machine Translation, pages 371–376,
Lamar, Richard Schwartz, and John Makhoul. 2014. Berlin, Germany, August. Association for Compu-
Fast and robust neural network joint models for sta- tational Linguistics.
tistical machine translation. In Proceedings of the
64
Nematus: a Toolkit for Neural Machine Translation
Neural Machine Translation (NMT) (Bahdanau et • We implement a novel conditional GRU with
al., 2015; Sutskever et al., 2014) has recently es- attention.
tablished itself as a new state-of-the art in machine • In the decoder, we use a feedforward hidden
translation. We present Nematus1 , a new toolkit layer with tanh non-linearity rather than a
for Neural Machine Translation. maxout before the softmax layer.
Nematus has its roots in the dl4mt-tutorial.2 We
found the codebase of the tutorial to be compact, • In both encoder and decoder word embed-
simple and easy to extend, while also produc- ding layers, we do not use additional biases.
ing high translation quality. These characteristics
• Compared to Look, Generate, Update de-
make it a good starting point for research in NMT.
coder phases in Bahdanau et al. (2015), we
Nematus has been extended to include new func-
implement Look, Update, Generate which
tionality based on recent research, and has been
drastically simplifies the decoder implemen-
used to build top-performing systems to last year’s
tation (see Table 1).
shared translation tasks at WMT (Sennrich et al.,
2016) and IWSLT (Junczys-Dowmunt and Birch, • Optionally, we perform recurrent Bayesian
2016). dropout (Gal, 2015).
Nematus is implemented in Python, and based
on the Theano framework (Theano Develop- • Instead of a single word embedding at each
ment Team, 2016). It implements an attentional source position, our input representations al-
encoder–decoder architecture similar to Bahdanau lows multiple features (or “factors”) at each
et al. (2015). Our neural network architecture dif- time step, with the final embedding being the
fers in some aspect from theirs, and we will dis- concatenation of the embeddings of each fea-
cuss differences in more detail. We will also de- ture (Sennrich and Haddow, 2016).
scribe additional functionality, aimed to enhance
• We allow tying of embedding matrices (Press
usability and performance, which has been imple-
and Wolf, 2017; Inan et al., 2016).
mented in Nematus.
1
available at https://github.com/rsennrich/nematus We will here describe some differences in more
2
https://github.com/nyu-dl/dl4mt-tutorial detail:
65
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 65–68
c 2017 Association for Computational Linguistics
z0j being the reset and update gate activations. In
Table 1: Decoder phase differences
RNNSearch (Bahdanau et al., 2015) Nematus (DL4MT) this formulation, W0 , U0 , Wr0 , U0r , Wz0 , U0z are
Phase Output - Input Phase Output - Input trained model parameters; σ is the logistic sigmoid
Look cj ← sj−1 , C Look cj ← sj−1 , yj−1 , C
Generate yj ← sj−1 , yj−1 , cj Update sj ← sj−1 , yj−1 , cj
activation function.
Update sj ← sj−1 , yj , cj Generate yj ← sj , yj−1 , cj The attention mechanism, ATT, inputs the en-
tire context set C along with intermediate hidden
state s0j in order to compute the context vector cj
Given a source sequence (x1 , . . . , xTx ) of
as follows:
length Tx and a target sequence (y1 , . . . , yTy ) of
length Ty , let hi be the annotation of the source
Tx
X
symbol at position i, obtained by concatenating
cj =ATT C, s0j = αij hi ,
the forward and backward encoder RNN hidden
→
− ← − i
states, hi = [ h i ; h i ], and sj be the decoder hid-
exp(eij )
den state at position j. αij = PT x ,
k=1 exp(ekj )
decoder initialization Bahdanau et al. (2015)
eij =va| tanh Ua s0j + Wa hi ,
initialize the decoder hidden state s with the last
backward encoder state. where αij is the normalized alignment weight be-
tween source symbol at position i and target sym-
←− bol at position j and va , Ua , Wa are the trained
s0 = tanh Winit h 1
model parameters.
with Winit as trained parameters.3 We use the Finally, the second transition block, GRU2 , gen-
average annotation instead: erates sj , the hidden state of the cGRUatt , by look-
PTx ! ing at intermediate representation s0j and context
h i vector cj with the following formulations:
s0 = tanh Winit i=1
Tx
sj = GRU2 s0j , cj = (1 − zj ) sj + zj s0j ,
conditional GRU with attention Nematus im- sj = tanh Wcj + rj (Us0j ) ,
plements a novel conditional GRU with attention,
rj =σ Wr cj + Ur s0j ,
cGRUatt . A cGRUatt uses its previous hidden state
zj =σ Wz cj + Uz s0j ,
sj−1 , the whole set of source annotations C =
{h1 , . . . , hTx } and the previously decoded symbol similarly, sj being the proposal hidden state,
yj−1 in order to update its hidden state sj , which rj and zj being the reset and update gate
is further used to decode symbol yj at position j, activations with the trained model parameters
W, U, Wr , Ur , Wz , Uz .
sj = cGRUatt (sj−1 , yj−1 , C) Note that the two GRU blocks are not individu-
ally recurrent, recurrence only occurs at the level
Our conditional GRU layer with attention
of the whole cGRU layer. This way of combining
mechanism, cGRUatt , consists of three compo-
RNN blocks is similar to what is referred in the
nents: two GRU state transition blocks and an
literature as deep transition RNNs (Pascanu et al.,
attention mechanism ATT in between. The first
2014; Zilly et al., 2016) as opposed to the more
transition block, GRU1 , combines the previous de-
common stacked RNNs (Schmidhuber, 1992; El
coded symbol yj−1 and previous hidden state sj−1
Hihi and Bengio, 1995; Graves, 2013).
in order to generate an intermediate representation
s0j with the following formulations: deep output Given sj , yj−1 , and cj , the out-
put probability p(yj |sj , yj−1 , cj ) is computed by
s0j = GRU1 (yj−1 , sj−1 ) = (1 − z0j ) s0j + z0j sj−1 ,
a softmax activation, using an intermediate repre-
s0j = tanh W0 E[yj−1 ] + r0j (U0 sj−1 ) ,
sentation tj .
r0j = σ Wr0 E[yj−1 ] + U0r sj−1 ,
66
distribution of the ensemble is the geometric aver-
0
age of the individual models’ probability distribu-
tions. The toolkit includes scripts for beam search
hello 0.946 HI 0.007 Hey 0.006
0.056 4.920 5.107 decoding, parallel corpus scoring and n-best-list
rescoring.
world 0.957 World 0.010 world 0.684
0.100 4.632 5.299 Nematus includes utilities to visualise the atten-
tion weights for a given sentence pair, and to vi-
. 0.030
3.609
! 0.928
0.175
... 0.014
4.384
sualise the beam search graph. An example of the
latter is shown in Figure 1. Our demonstration will
<eos> 0.999 <eos> 0.999 <eos> 0.994 cover how to train a model using the command-
3.609 0.175 4.390
line interface, and showing various functionalities
of Nematus, including decoding and visualisation,
with pre-trained models.4
Figure 1: Search graph visualisation for DE→EN
translation of "Hallo Welt!" with beam size 3.
5 Conclusion
3 Training Algorithms We have presented Nematus, a toolkit for Neural
Machine Translation. We have described imple-
By default, the training objective in Nematus is
mentation differences to the architecture by Bah-
cross-entropy minimization on a parallel training
danau et al. (2015); due to the empirically strong
corpus. Training is performed via stochastic gra-
performance of Nematus, we consider these to be
dient descent, or one of its variants with adaptive
of wider interest.
learning rate (Adadelta (Zeiler, 2012), RmsProp
We hope that researchers will find Nematus an
(Tieleman and Hinton, 2012), Adam (Kingma and
accessible and well documented toolkit to support
Ba, 2014)).
their research. The toolkit is by no means limited
Additionally, Nematus supports minimum risk
to research, and has been used to train MT systems
training (MRT) (Shen et al., 2016) to optimize to-
that are currently in production (WIPO, 2016).
wards an arbitrary, sentence-level loss function.
Various MT metrics are supported as loss function, Nematus is available under a permissive BSD
including smoothed sentence-level B LEU (Chen license.
and Cherry, 2014), METEOR (Denkowski and
Lavie, 2011), BEER (Stanojevic and Sima’an, Acknowledgments
2014), and any interpolation of implemented met- This project has received funding from the Euro-
rics. pean Union’s Horizon 2020 research and innova-
To stabilize training, Nematus supports early tion programme under grant agreements 645452
stopping based on cross entropy, or an arbitrary (QT21), 644333 (TraMOOC), 644402 (HimL) and
loss function defined by the user. 688139 (SUMMA).
4 Usability Features
In addition to the main algorithms to train and References
decode with an NMT model, Nematus includes
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
features aimed towards facilitating experimenta- gio. 2015. Neural Machine Translation by Jointly
tion with the models, and their visualisation. Var- Learning to Align and Translate. In Proceedings of
ious model parameters are configurable via a the International Conference on Learning Represen-
command-line interface, and we provide extensive tations (ICLR).
documentation of options, and sample set-ups for
Boxing Chen and Colin Cherry. 2014. A Systematic
training systems. Comparison of Smoothing Techniques for Sentence-
Nematus provides support for applying single Level BLEU. In Proceedings of the Ninth Workshop
models, as well as using multiple models in an en- on Statistical Machine Translation, pages 362–367,
semble – the latter is possible even if the model Baltimore, Maryland, USA.
architectures differ, as long as the output vocabu- 4
Pre-trained models for 8 translation directions are avail-
lary is the same. At each time step, the probability able at http://statmt.org/rsennrich/wmt16_systems/
67
Michael Denkowski and Alon Lavie. 2011. Me- Proceedings of the 54th Annual Meeting of the As-
teor 1.3: Automatic Metric for Reliable Optimiza- sociation for Computational Linguistics (Volume 1:
tion and Evaluation of Machine Translation Sys- Long Papers), Berlin, Germany.
tems. In Proceedings of the Sixth Workshop on
Statistical Machine Translation, pages 85–91, Ed- Milos Stanojevic and Khalil Sima’an. 2014. BEER:
inburgh, Scotland. BEtter Evaluation as Ranking. In Proceedings of the
Ninth Workshop on Statistical Machine Translation,
Salah El Hihi and Yoshua Bengio. 1995. Hierarchical pages 414–419, Baltimore, Maryland, USA.
Recurrent Neural Networks for Long-Term Depen-
dencies. In Nips, volume 409. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to Sequence Learning with Neural Net-
Yarin Gal. 2015. A Theoretically Grounded Appli- works. In Advances in Neural Information Process-
cation of Dropout in Recurrent Neural Networks. ing Systems 27: Annual Conference on Neural Infor-
ArXiv e-prints. mation Processing Systems 2014, pages 3104–3112,
Montreal, Quebec, Canada.
Alex Graves. 2013. Generating sequences
with recurrent neural networks. arXiv preprint Theano Development Team. 2016. Theano: A Python
arXiv:1308.0850. framework for fast computation of mathematical ex-
pressions. arXiv e-prints, abs/1605.02688.
Hakan Inan, Khashayar Khosravi, and Richard Socher.
2016. Tying Word Vectors and Word Classifiers: A Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture
Loss Framework for Language Modeling. CoRR, 6.5 - rmsprop.
abs/1611.01462.
WIPO. 2016. WIPO Develops Cutting-Edge
Marcin Junczys-Dowmunt and Alexandra Birch. 2016. Translation Tool For Patent Documents, Oct.
The University of Edinburgh’s systems submis- http://www.wipo.int/pressroom/en/
sion to the MT task at IWSLT. In The Interna- articles/2016/article\_0014.html.
tional Workshop on Spoken Language Translation Matthew D Zeiler. 2012. ADADELTA: an
(IWSLT), Seattle, USA. adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Julian Georg Zilly, Rupesh Kumar Srivastava,
arXiv:1412.6980. Jan Koutník, and Jürgen Schmidhuber. 2016.
Recurrent highway networks. arXiv preprint
Razvan Pascanu, Çağlar Gülçehre, Kyunghyun Cho, arXiv:1607.03474.
and Yoshua Bengio. 2014. How to Construct Deep
Recurrent Neural Networks. In International Con-
ference on Learning Representations 2014 (Confer-
ence Track).
Ofir Press and Lior Wolf. 2017. Using the Output Em-
bedding to Improve Language Models. In Proceed-
ings of the 15th Conference of the European Chap-
ter of the Association for Computational Linguistics
(EACL), Valencia, Spain.
68
A tool for extracting sense-disambiguated example sentences
through user feedback
Beto Boullosa† and Richard Eckart de Castilho†
and Alexander Geyken‡ and Lothar Lemnitzer‡
and Iryna Gurevych†
† ‡
Ubiquitous Knowledge Processing Lab Berlin-Brandenburg
Department of Computer Science Academy of Sciences
Technische Universität Darmstadt
http://www.ukp.tu-darmstadt.de http://www.bbaw.de
69
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 69–72
c 2017 Association for Computational Linguistics
tor of good examples based on a hand-written, performed over a certain headword (actually de-
rule-based grammar, determining the quality of fined by its lemma and POS tag). Tasks include
a sentence according to its grammatical structure searching for initial sentences, clustering and clas-
combined with some simpler features as used by sification. Users can work on many jobs in parallel
GDEX. None of those works, however, focused on and isolated, which allows calculating inter-rater
differentiating example sentences according to the agreement on the sense disambiguation task.
word senses or the target words. As for the technologies, the tool was developed
Cook et al. (2013) used Word Sense Induction using Java Wicket and relying on Solr1 , Mallet 2 ,
(WSI) to identify novel senses for words in a dic- DKPro Core (Eckart de Castilho and Gurevych,
tionary. They utilized hierarchical LDA (Latent 2014), Tomcat and MySQL. The next subsections
Dirichlet Allocation) (Teh et al., 2006), a varia- describe the system in more detail.
tion of the original LDA (Blei et al., 2003) topic
model, to identify novel word senses, later com- 3.1 Searching
bining this approach with GDEX, to allow extract- After starting a job, the user goes to the Search
ing good example sentences according to word page to look for sentences containing the desired
senses (Cook et al., 2014). However, they obtained lemma. They are shown in a KWIC (Keyword in
”encouraging rather than conclusive” results, spe- Context) table with their ids. When satisfied with
cially due to limitations of the LDA approach in the results, the user selects the stopwords to use in
linking identified topics with word senses. the next steps and goes to the Clustering phase.
Our work explores and develops a similar ap-
proach of using topic modeling and WSI to cluster 3.2 Clustering
sentences according to the senses of a target word, In the clustering page, the selected sentences are
but we take a step further, using the initial clus- automatically divided into clusters corresponding
ters as seed for a series of interactive classification to topics that ideally relate to the senses of the
steps. The training data for each classification step target word. The user manually configures how
are sentences whose confidence scores calculated many clusters the topic modeling generates and
by the system exceed a threshold, and sentences control the hyper-parameters to fine-tune the pro-
manually labeled by the user. The process leads to cess. We currently use Mallet’s LDA implementa-
a user-driven refinement in the labeling process. tion to topic modeling (McCallum, 2002).
The clustering page lists the selected sentences
3 System description according to the generated clusters. Each cluster is
The computer-assisted, interactive sense disam- shown in its own column, with a word cloud on the
biguation process supported by our system in- top, containing the main words related to it, which
volves: 1) import sentences into a searchable in- helps the user to assess cluster’s quality and mean-
dex; 2) retrieve sentences containing a specific ing. The word sizes correspond to their LDA-
word (lemma); 3) cluster selected sentences, pro- calculated weights. The user can change hyper-
viding a starting point for interactive classifica- parameters and regenerate clusters as often as de-
tion; 4) train a multi-class classifier from the ini- sired, before proceeding to the classification step.
tial clusters - that trains on the most representative 3.3 Classification
sentences for each cluster and is then used to la-
bel the rest of the sentences; a sentence is ”rep- In the classification step, the user interactively re-
resentative” if the confidence score calculated by fines the results, giving feedback to the automatic
the system exceeds a configurable threshold; 5) re- classification in order to improve labeling of the
fine the classifier by interactively correcting sense sentences according to word senses. The initial
assignments for selected sentences. automatic labels correspond to the results of the
The system supports multiple users working in clustering phase. The user starts analyzing each
so called projects, which define, among other sentence and decides if the automatically assigned
things, the list of stopwords available for cluster- sense label is appropriate or if a different label
ing and classification and the location where the needs to be assigned manually. The initial default
sentences should be retrieved from. 1
http://lucene.apache.org/solr/
A project contains jobs, corresponding to tasks 2
http://mallet.cs.umass.edu
70
Figure 1: Classification page
User Assist Accuracy Time (mm:ss) proach for Word Sense Disambiguation, known
Annot. 1 No 0.96 8:05 for its efficiency (Hristea, 2013). We use the Mal-
Annot. 1 Yes 0.95 6:05
let implementation of Naive Bayes, with the sen-
Annot. 2 No 0.90 10:05
Annot. 2 Yes 0.90 7:55 tence tokens as features.
71
a slight loss of quality in the results of the first search Foundation under grant No. EC 503/1-1
annotator when using assistance. This might indi- and GU 798/21-1 (INCEpTION).
cate a negative influence of system suggestions on
the annotator or it could be attributable to a more
difficult random selection of the samples in cer- References
tain cases. These effects call for further investiga- David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
tion. However, using the tool outside the evalua- 2003. Latent dirichlet allocation. Journal of Ma-
chine Learning Research, 3:993–1022, March.
tion setup, we noted subjective speedups from de-
veloping smart strategies to optimize the use of the Paul Cook, Jey Han Lau, Michael Rundell, Diana Mc-
information provided by the machine (e.g. quickly Carthy, and Timothy Baldwin. 2013. A lexico-
annotating sentences containing specific context graphic appraisal of an automatic approach for de-
tecting new word-senses. In Proceedings of eLex
words or sorting by sense and confidence score). 2013, pages 49–65, Tallinn, Estonia.
Thus, we expect the automatic assistance to have
a larger impact in actual use than the present eval- Paul Cook, Michael Rundell, Jey Han Lau, and Tim-
othy Baldwin. 2014. Applying a word-sense in-
uation can show.
duction system to the automatic extraction of di-
verse dictionary examples. In Proceedings of the
5 Conclusions and future work XVI EURALEX International Congress, pages 319–
328, Bolzano, Italy.
We have introduced a novel system for interactive
word sense disambiguation in the context of ex- Jörg Didakowski, Lothar Lemnitzer, and Alexander
ample sentences extraction. Using a combination Geyken. 2012. Automatic example sentence ex-
traction for a contemporary german dictionary. In
of unsupervised and supervised methods, corpus Proceedings of the XV EURALEX International
processing and user feedback, our approach aims Congress, pages 343–349, Oslo, Norway.
at helping lexicographers to properly assess and
Richard Eckart de Castilho and Iryna Gurevych. 2014.
refine a list of pre-analyzed example sentences ac- A broad-coverage collection of portable NLP com-
cording to the senses of a target word, alleviating ponents for building shareable analysis pipelines. In
the burden of doing this manually. Using vari- Proceedings of the Workshop on Open Infrastruc-
ous state-of-the-art techniques, including cluster- tures and Analysis Frameworks for HLT, pages 1–
11, Dublin, Ireland, August.
ing and classification methods for word sense in-
duction, and incorporating user feedback into the Florentina T. Hristea. 2013. The naı̈ve bayes model in
process of shaping word senses and associating the context of word sense disambiguation. In The
sentences to them, the tool can be a valuable ad- Naı̈ve Bayes Model for Unsupervised Word Sense
Disambiguation: Aspects Concerning Feature Se-
dition to a lexicographer’s toolset. As next steps, lection, pages 9–16. Springer, Heidelberg, Germany.
we plan to focus on the extraction of “good” exam-
ples, adding support for ranking sentences in dif- Adam Kilgarriff, Milo Husk, Katy McAdam, Michael
Rundell, and Pavel Rychlý. 2008. Gdex: Automat-
ferent sense clusters according to operational cri-
ically finding good dictionary examples in a corpus.
teria like in Didakowski et al. (2012). We also plan In Proceedings of the XIII EURALEX International
to extend the evaluation and to observe the strate- Congress, pages 425–432, Barcelona, Spain.
gies that users develop, in order to discover if they
Robert Lew. 2015. Dictionaries and their users. In
can inform further improvements to the system. International Handbook of Modern Lexis and Lexi-
cography. Springer, Heidelberg, Germany.
Acknowledgments
Andrew Kachites McCallum. 2002. MAL-
This work has received funding from the Euro- LET: A machine learning for language toolkit.
pean Union’s Horizon 2020 research and innova- http://mallet.cs.umass.edu.
tion programme (H2020-EINFRA-2014-2) under Yee Whye Teh, Michael I Jordan, Matthew J Beal, and
grant agreement No. 654021. It reflects only the David M Blei. 2006. Hierarchical dirichlet pro-
author’s views and the EU is not liable for any cesses. Journal of the American Statistical Associa-
use that may be made of the information con- tion, 101(476):1566–1581.
tained therein. It was further supported by the
German Federal Ministry of Education and Re-
search (BMBF) under the promotional reference
01UG1416B (CEDIFOR) and by the German Re-
72
Lingmotif: Sentiment Analysis for the Digital Humanities
Antonio Moreno-Ortiz
University of Málaga
Spain
amo@uma.es
73
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 73–76
c 2017 Association for Computational Linguistics
many cases. For example, we can classify as neg-
ative the word kill and then include phrases such
as kill time with a neutral valence. When this is
not possible, the options are to include it with the
more statistically probable polarity or simply leave
it out when the chances of getting the item with
one polarity or another are similar.
74
A single input text can be typed or pasted in the titative data tables, which include common text
text box area, or text files can be loaded, depend- metrics and a breakdown of the sentiment analysis
ing on the selected input type. Loading files al- data for the analyzed text. The final results sec-
lows the user to select one of them and analyze it tion shows the input text after processing, where
in single-mode, or select the complete set of files. the identified sentiment items are color-coded to
represent their polarity. This makes it possible to
3.1 Single-document mode know exactly what the analyzer found in the text.
For every text analyzed, either in single or multi-
document mode, Lingmotif produces a number of
metrics for each individual text. The two metrics
that summarize the text’s overall sentiment are the
Text Sentiment Score (TSS) and the Text Senti-
ment Intensity (TSI). Both are displayed by means
of visual, animated (Javascript) gauges at the top
of the results page. The numeric indexes (on a
0-100 scale) are categorized in ranges, from ”ex-
tremely negative” to ”extremely positive”, to make
numeric results more intuitively interpretable by
the user. Both gauges are also color and intensity-
coded in the red (negative) to green (positive)
range (see Figure 2). Figure 4: Sentiment scores gauges
For long texts, Lingmotif will also generate a 3.2 Multi-document analysis
sentiment profile, which is a visual representation
Multiple input texts can be analyzed in one of sev-
of the text’s internal structure and organization in
eral modes (see below). When in multi-document
terms of sentiment expression. This Javascript
mode, Lingmotif will analyze documents one by
graph is interactive: hovering the data points will
one, generating one HTML file for each, although
display the lexical items that make up that partic-
they will not be displayed on the browser, just
ular text segment (see Figure 3).
saved to the output folder. When the analysis is
finished, a single results page will be displayed.
This page is a summary of results, and is different
from the single-document results page: the gauges
for TSS and TSI are now the average for the ana-
lyzed set and the detailed analysis section contains
a quantitative analysis of each of the files in the
set. The first column in this table shows the title
of the document (file name without extension) as a
hyperlink to the HTML file for that particular file.
Available multi-document analysis modes are
Figure 3: Document sentiment profile the following:
The next three sections of the results document • Classifier (default): a stacked bar graph and
are shown in Figure 4 below. First is the quan- data table are offered showing classification
75
results based on their TSS category. The
graph offers a visualization of results (see
Figure 6); both its legend and the graph it-
self are interactive. A table summarizing the
classification results is also offered.
References
A. Aue and M. Gamon. 2005. Customizing sentiment
classifiers to new domains: A case study.
4 Conclusions
Lingmotif goes beyond what SA classifiers have
to offer. It offers automatic identification of
sentiment-laden words and phrases, as well as text
segments. Its many visual representations of the
text’s structure from a sentiment perspective make
76
RAMBLE ON: Tracing Movements of Popular Historical Figures
77
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 77–80
c 2017 Association for Computational Linguistics
FRAME FRAME
Arriving Meet with
Attending Motion
Becoming a member Receiving
Being employed Residence
Bringing Scrutiny
Cause motion Self motion
Colonization Sending
Come together Speak on topic
Conquering State continue
Cotheme Temporary stay
Departing Transfer
Detaining Travel
Education teaching Used up
Figure 1: Information extraction workflow. Fleeing Working on
Inhibit movement
text, a set of destinations together with their coor-
dinates and a date, each representative of the place Table 1: List of frames selected for RAMBLE ON4
where a person moved or lived at the given time-
point. ing the information from the CoreNLP modules in
PIKES with the remaining 29 frames listed in Ta-
Input Data In our approach information ex- ble 1, our application extracts a list of candidate
traction is performed on Wikipedia biographical sentences, containing a date and a movement of
pages. In the first step, these pages are cleaned up the subject together with a destination. These rep-
by removing infoboxes, tables and tags, keeping resent the geographical position of a person at a
only the main body as raw text. certain time.
Pre-processing Raw text is processed using Georeferencing To georeference all the des-
PIKES (Corcoglioniti et al., 2015), a suite of tools tinations mentioned in the candidate sentences
for extracting frame oriented knowledge from En- RAMBLE ON uses Nominatim5 . Due to errors by
glish texts. PIKES integrates Semafor (Das et al., the NERC module (e.g., Artaman League anno-
2014), a system for Semantic Role Labeling based tated as geographical entity), some destinations
on FrameNet (Baker et al., 1998), whose output is can lack coordinates and thus are discarded. More-
used to identify predicates related to movements over, for each biography, the places and dates of
and their arguments because its high-level organi- birth and death of the subject are added as taken
zation in semantic frames is an useful way to gen- from DBpedia.
eralize over predicates. PIKES also includes Stan-
ford CoreNLP (Manning et al., 2014). Its mod- Output Data Details about the movements ex-
ules for Named Entity Recognition and Classifica- tracted from each Wikipedia biography, e.g. date,
tion (NERC), coreference resolution and recogni- coordinates and the original snippet of text, are
tion of time expressions are used to detect for each saved in a JSON file as shown in the example be-
text: (i) mentions related to the person who is the low. This output format accepts additional fields
subject of the biography; (ii) locations and orga- with information about the subject of the biogra-
nizations that can be movement destinations; (iii) phy, e.g. gender, occupation domain, that could be
dates. extracted from other resources.
"name": "Maria_Montessori",
Frame Selection Starting from the frames re- "movements": [{
"date": 19151100,
lated to the Motion frames in FrameNet and a man- "place": "Italy",
ual analysis of a set of biographies, we identified "latitude": 41.87194,
45 candidate frames related to direct (e.g. De- "longitude": 12.5673,
"predicate_id": "t3147",
parting) or indirect (e.g. Residence) movements "predicate": "returned",
of people. After a manual evaluation of these "resource": "FrameNet",
45 frames on a set of biographies annotated with "resource_frame": "Arriving",
"place_frame": "@Goal",
PIKES, we removed 16 of them from the list of "snippet": "Montessori’s father died in
candidate frames because of the high number of November 1915, and she returned to Italy."
false positives. These include for example Es- }]
caping, Getting underway and Touring. Combin- 5
https://nominatim.openstreetmap.org/
78
4 Ramble On Navigator # of # of
DOMAIN
individuals movements
Movements as extracted with the procedure de- Arts 647 (348) 788
Science & Technology 591 (318) 631
scribed in Section 3 are graphically presented on Humanities 502 (276) 709
an interactive map that visualizes trajectories be- Institutions 483 (255) 633
tween places. The interface, called RAMBLE ON Public Figure 69 (30) 55
Sports 59 (30) 54
Navigator, is built using technology based on Business & Law 38 (13) 22
web standards (HTML5, CSS3, Javascript) and Exploration 18 (13) 37
open source libraries for data visualization and ge- TOTAL 2,407 (1,283) 2,929
ographical representation, i.e. d3.js and Leaflet.
Table 2: Domain distribution of individuals in our
Through this interface, see Figure 2, it is possible
use case. In brackets the number of individuals
to filter the movements on the basis of the time
with movements.
span or to search for a specific individual. More-
over, if information about nationality and domain
of occupation is provided in the JSON files, the an actual lack of sentences concerning movements
Navigator allows to further filter the search. Hov- or errors in the automatic processing, e.g., missed
ering the mouse on a trajectory, the snippet of text identification of places or dates. Table 2 shows
from which it was automatically extracted appears the distribution per domain of the individuals with
on the bottom left. Information about all the move- associated movements. Moreover, we corrected
ments related to a place is displayed when hover- the coordinates of places wrongly georeferenced
ing on a spot on the map. The trajectories have (6.7%). The extracted movements are evoked by
an arrow indicating the route destination and are predicates associated to 66 different lemmas (lex-
dashed if the movement described by the snippet ical units in FrameNet). The most frequent lem-
is started before the selected time span. mas (> 100 occurrences) are: return (567), move
The online version of the Navigator shows the (556), visit (253), travel (188), attend (182), go
output of the case study presented in Section 5, (153), live (111), arrive (107).
while the stand-alone application also allows to
upload another set of data. 6 Conclusion and Future Work
We presented an automatic approach for the ex-
5 Case Study
traction and visualisation of motion trajectories,
We relied on the Pantheon dataset (Yu et al., 2016) which is easy to extend to different datasets, and
to identify a list of notable figures to be used in that can provide insights for studies in many fields,
our case study. We chose Pantheon since it pro- e.g., history and sociology.
vides a ready-to-use set of people already classi- In the future, we will mainly focus on improv-
fied into categories based on their domain occupa- ing the system coverage. Currently, missing tra-
tion (e.g., Arts, Sports), birth year, nationality and jectories are mainly due to (i) the presence of pred-
gender. More specifically, we considered 2,407 icates not recognized as lexical units in FrameNet,
individuals from Europe and North America liv- e.g. exile; (ii) the lack of information in the
ing between 1900 and 1955. First we downloaded English Wikipedia biography, and (iii) the pres-
the corresponding Wikipedia pages, as published ence of sentences with complex temporal struc-
in April 2016, collecting a corpus of more than tures, e.g., Cummings returned to Paris in 1921
7,5 million words. Then we used the workflow de- and remained there for two years before return-
scribed in Section 3 and we enriched output data ing to New York. These issues can be dealt with
with the categories taken from Pantheon. We man- by adding missing predicates to FrameNet, ex-
ually refined the output by removing the sentences tend Pikes to other languages and experimenting
wrongly identified as movements (14.02%), for with different systems for temporal information
example those not referring to the subject of the processing (Llorens et al., 2010). We also plan
biography (e.g., When communist North Korea in- to apply the methodology presented in (Aprosio
vaded South Korea in 1950, he sent in U.S. troops). and Tonelli, 2015) to automatically recognize the
The final dataset resulted in 2,929 sentences from Wikipedia text passages dealing with biographical
1,283 biographies, since 1,124 individuals had no information, so to discard sections containing use-
associated movements. This may be due to either less information.
79
Figure 2: Screenshot of Ramble On Navigator.
80
Autobank: a semi-automatic annotation tool for developing deep
Minimalist Grammar treebanks
John Torr
School of Informatics
University of Edinburgh
11 Crichton Street, Edinburgh, UK
john.torr@cantab.net
81
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 81–86
c 2017 Association for Computational Linguistics
structure1 differs considerably from that of the have also been added onto PTB non-terminals4 ,
PTB2 . Furthermore, given the many competing along with the head word and its span. For in-
analyses for any given construction in the Mini- stance, the AGENT subject NP Jack in the sen-
malist literature, no single MG treebank will be tence, Jack helped her, would be annotated with
universally accepted. It is therefore hoped that Au- the tag ARG0{helped<1,2>}. We have also
tobank will stimulate wider interest in broad cov- added the additional NP structure from Vadas and
erage statistical Minimalist parsing by providing Curran (2007), the additional structure for hy-
researchers with a relatively quick and straight- phenated compounds included in the Ontonotes 5
forward way to engineer their own MGs and tree- (Weischedel et al., 2012) version of the PTB, and
banks, or to modify existing ones. the additional structure and role labels for coor-
Autobank works as follows: the PTB first un- dination phrases recently released by Ficler and
dergoes an initial preprocessing phase. Next, the Goldberg (2016). For all structure added, any
researcher builds a seed MG corpus by annotat- function tags are redistributed accordingly5 .
ing lexical items on PTB trees with MG categories Finally, PropBank includes additional
and then selecting from among a set of candi- antecedent-trace co-indexing which we have
date parses which are output using these categories also imported, and some of this implies the need
by MGParse, an Extended Directional Minimal- for additional NP structure beyond what Vadas
ist Grammar (EDMG) parser (see Torr and Sta- and Curran have provided. For instance, in the
bler (2016)). The system then extracts various de- phrase, the unit of New York-based Lowes Corp
pendency mappings between the source and target that *T* makes kent cigarettes, the original anno-
trees. Next, a set of candidate parses is generated tation has the unit and of New York-based Lowes
for the remaining trees in the PTB, and these are Corp as separate sister NP and PP constituents
scored using the dependency mappings extracted (with an SBAR node sister to both), both of
from the seeds. In this way, the source corpus ef- which are co-indexed with the subject trace (*T*)
fectively adopts a disambiguation role in lieu of position in PropBank. In such cases we have
any statistical model. The basic architecture of the added an additional NP node resolving the two
system, which was implemented in Python and its constituents into a single antecedent NP.
Tkinter module, is given in fig 1.
3 The manual annotation phase
2 Preprocessing
Autobank provides a powerful graphical user in-
Autobank includes a module for preprocessing the terface enabling the researcher to construct an
PTB which corrects certain mistakes and adds (ED)MG by relabelling PTB preterminals with
some additional annotations carried out by vari- MG categories and selecting the correct MG parse
ous researchers since the treebank’s initial release. from a set of candidates generated by the parser.
For example, following Hockenmaier and Steed- The main annotation environment is shown in
man (2002), we have corrected cases where verbs fig 2. The PTB tree and its MG candidates are re-
were incorrectly labelled with the past tense tag spectively displayed on the top and bottom of the
VBD instead of the past participle tag VBN. We screen. Between these are a number of buttons al-
also extend the PTB tag set to include person, lowing for easy navigation through the PTB, in-
number and gender3 information on nominals and cluding a regular expression search facility for
pronominals, in order to constrain reflexive bind- locating specific construction types by searching
ing and agreement phenomena in MGbank. both the bracketing and the string. The user can
The semantic role labels of PropBank (Palmer also choose to focus on sentences of a given string
et al., 2005) and Nombank (Meyers et al., 2004) length. On the left, the sentence is displayed
1 from top to bottom, each word with a drop-down
MGs are a mildly context sensitive and computational
interpretation of Chomsky’s (1995) Minimalist Program. menu listing all MG categories so far associated
2
For example, Minimalist trees contain many more null with that word’s PTB preterminal category. Like
heads and traces of phrasal movement, along with shell and
4
Xbar phrase structures, cartographic clausal and nominal Among other things, these crucially distinguish adjuncts
structures, functional heads, head movements, covert move- from arguments, raising/ECM from subject/object control,
ments/Agree operations etc. and promise-type subject control from object control/ECM.
3 5
We used the NLTK name database to derive gender on We use a modified version of Collins’ (1999) head find-
proper nouns. ing rules for this task as well as for the dependency extraction.
82
Automatic
Manual extraction of
Preprocessing Seed source-target
Penn Treebank Annotation
MGbank dependency
Phase
mappings
no
yes Automatic
MGbank Finished? Auto MGbank Annotation
Phase
CCG, EDMG is a strongly lexicalized and deriva- features up the tree via a simple unification mech-
tional formalism, with all subcategorization and anism7 . In lieu of any statistical model, this
linear order information encoded on lexical items. enables the human annotator to tightly constrain
EDMG categories are sequences of features or- the grammar and thus reduce the amount of lo-
dered from left to right which are checked and cal and global ambiguity present during manual
deleted as the derivation proceeds. For example, and automatic annotation. For example, the cat-
the hypothetical category, d= =d v6 , could be used egory: v{+TRANS.x}= +case{+ACC.-NULL}
to represent a transitive verb that first looks to its =d lv{TRANS.x}, could be used for the null
right for an object with d as its first feature (i.e. a causative light verb (so-called little v) that is stan-
DP), before selecting a DP subject on its left and dardly assumed in Minimalism to govern a main
thus yielding a constituent of category v, i.e. a VP. transitive verb. It specifies that its VP complement
MGParse category features can also include must have the property TRANS and that the ob-
subcategorization and agreement properties and
requirements, and allow for percolation of such 7
As in CCG, such unification is limited to atomic property
values rather than the sorts of unbounded feature structures
6
This category (which is actually used for ditransitives by found in Head-Driven Phrase Structure Grammars; unifica-
MGParse) is similar to the CCG category (S\NP)/NP. tion here is therefore not re-entrant.
83
ject whose case feature it will check (in this case parse (to test the performance of the automatic an-
via covert movement) must have the property ACC notator, and speed up annotation), a parser settings
but must not have the property NULL, and that menu, and an option for backing up all data.
following selection of an AGENT DP specifier the The extreme succinctness of the (strongly) lex-
resulting vP will have the property TRANS. Fur- icalized (ED)MG formalism, as discussed in Sta-
thermore, any additional properties carried by the bler (2013), means that seed set creation can be a
VP complement (such as PRES, 3SG etc.) will be relatively rapid process. Like CCGs, MGs have
percolated onto the vP (or rather onto its selectee abstract rule schemas which generalize across cat-
(lv) feature) owing to the x variable. egories, thus dramatically reducing the size of the
Both overt and null (i.e. phonetically silent) grammar. Taking CCGbank’s 1300 categories as
MG categories can be added to the system and an approximate upper bound, a researcher work-
later modified using the menu system at the top of ing five days a week on the annotation process and
the screen. Any time the user attempts to modify a adding 20 MG categories per day to the system
category, the system will first reparse any trees in should have added enough MG categories to parse
the seed set containing that category to ensure that all the sentences of the PTB within 3 months.
the same Xbar tree can still be generated for the
sentence in question following the modification8 . 4 The automatic annotation phase
Once the user has selected an MG category for Once the user has created an initial seed set, they
each word in the sentence, clicking parse causes can select auto generate corpus from the
MGParse to return all possible parses using these corpus menu. This prompts the system to ex-
categories. Parsing without annotating some or all tract a set of dependency mappings and lexical cat-
of the words in the sentence is also possible: MG- egory mappings holding between each seed MG
Parse will simply try all available MG categories tree and its PTB source tree. To achieve this, the
already associated with a word’s PTB pretermi- system traverses the PTB and MG trees and ex-
nal in the seed set. Once returned, the trees can tracts a Collins-style dependency tuple for every
be viewed in several formats, including multiple non-head child of every non-terminal in each tree.
MG derivation tree formats, and Xbar tree and The tuples include the head child and non-head
MG (bare phrase structure) derived tree formats. child categories, the parent category, any relevant
Candidate parses can be viewed side-by-side for function tags, the directionality of the dependency,
comparison, and there is the option to perform a and the head child’s and non-head child’s head
diff on the bracketings, and to eliminate incorrect words and spans. Where there are multiple tuples
candidates from consideration. Once the user has in a tree with the same head and non-head word
identified the correct tree, they can click add to spans, these are grouped together into a chain.
seeds to save it; seeds can be viewed and re- For example, for the sentence the doctor ex-
moved at any point using the native file system. amined Jack, the AGENT subject NP in a PTB-
There will inevitably be occasions when the style tree would yield the following dependency:
parser fails to return any parses. In these cases [VP, examined, <2,3>, NP, doctor, <1,2>, S,
it is useful to build up the derivation step-by-step [ARG0, SUBJ], left]. Many Minimalists assume
to identify the point where it fails, and Autobank that AGENT subjects are base-generated inside the
provides an interface for doing just this (see fig 3). verb phrase (Koopman and Sportiche, 1991) in
Whereas in annotation mode null heads were kept spec-vP, before moving to the surface subject po-
hidden for simplicity, in derivation mode the en- sition in spec-TP. Although in its surface position
tire null lexicon is available, along with the overt the subject is a dependent of the T(ense) head, it
categories the user selected on the main annotation is the lexical verb which is the semantic head of
screen. Other features of the system include a test the extended projection (i.e. of the CP clause con-
sentence mode, a corpus stats display, a facility taining it), and both syntactic and semantic heads
for automatically detecting the best candidate MG are used here when generating the tuples9 . Two
8
tuples are therefore extracted for the dependency
In general, subcategorization properties and require-
9
ments are the only features on an MG category that can be This also allows the system to capture certain systematic
modified without first removing all seed parses containing changes in constituency and recognize, for instance, that tem-
that category as they do not affect the Xbar tree’s geometry poral adjuncts tagged with a TMP label and attaching to VP
(though they can license or prevent its generation). in the PTB should attach to TP in the MG tree.
84
Figure 3: The step-by-step derivation builder.
between the verb and the subject and together they mapping is discovered which has previously been
form a chain representing the latter’s deep and sur- seen in the seeds, the MG tree containing it is
face positions. The system then establishes map- awarded with a point; the candidate with the most
pings from PTB tuples/chains to MG tuples/chains points is added to the Auto MGbank. The abstract
which share the same head and non-head spans mapping above, for instance, ensures that for tran-
(this includes instances where the relation between sitive sentences, trees containing the subject trace
the head and dependent has been reversed). in spec vP are preferred. The user can choose to
Each mapping is stored in three forms of vary- specify the number and maximum string length of
ing degrees of abstraction. In the first, all word- the trees that are automatically generated (together
specific information (i.e. the head and non-head with a timeout value for the parser) and can then
words and their spans) is deleted; in the second inspect the results and transfer any good trees into
the spans and non-head word are deleted, but (a the seed corpus, thereby rapidly expanding it.
lemmatized version of) the head word is retained,
while in the third the spans are deleted but both 5 Conclusion
the (lemmatized) head and non-head words are
retained. The fully reified mapping for our sub- Autobank is a GUI tool currently being used
ject dependency, for instance, would be: [VP, ex- to semi-automatically construct MGbank, a deep
amine, NP, doctor, S, [ARG0, SUBJ], left] —> Minimalist version of the PTB. Minimalism is a
[[T’, examine, DP, doctor, TP, left], [v’, exam- lively theory, however, and in continual flux. Au-
ine, DP, doctor, vP, left]]. Including both abstract tobank was therefore designed with reusability in
and reified mappings allows the system to recog- mind, in the hope that other researchers will use
nise not only general phrase structural correspon- it to create alternative Minimalist treebanks, ei-
dences, but also more idiosyncratic mappings con- ther from scratch or by modifying an existing one,
ditioned by specific lexical items, as in the case of and to stimulate greater interest in computational
idioms and light verb constructions, for instance. Minimalism and statistical MG parsing. The sys-
The system next parses the remaining sen- tem could also potentially be adapted for use with
tences, selecting a set of potential MG categories other source and (lexicalised) target formalisms.
for each word in each sentence using the lexi- sentences. To ameliorate this, once enough seeds have been
cal category mappings10 . Whenever a dependency added, an MG supertagger (see Lewis and Steedman (2014))
will be trained and used to reduce the amount of lexical am-
10
Note that there will be many MG categories for every biguity; to improve things further, multiple sentences will be
PTB category, which will make parsing quite slow for certain processed in parallel during automatic annotation.
85
Acknowledgements The NomBank project: An interim report. In Pro-
ceedings of HLT-EACL Workshop: Frontiers in Cor-
Many thanks to Ed Stabler for providing the ini- pus Annotation.
tial inspiration for this project, and for his subse-
Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005.
quent guidance and encouragement last summer. The proposition bank: An annotated corpus of se-
The work described in this paper was funded by mantic roles. Computational Linguistics, 31(1):71–
Nuance Communications Inc. and the Engineering 106.
and Physical Sciences Research Council. I would
Edward Stabler. 1997. Derivational minimalism. In
also like to thank the reviewers for their comments Christian Retoré, editor, Logical Aspects of Com-
and also my supervisors Mark Steedman and Shay putational Linguistics (LACL’96), volume 1328 of
Cohen for their help and guidance during the de- Lecture Notes in Computer Science, pages 68–95,
velopment of the Autobank system and the writing New York. Springer.
of this paper. Edward Stabler. 2013. Two models of minimalist,
incremental syntactic analysis. Topics in Cognitive
Science, 5:611–633.
References
Mark Steedman. 1996. Surface Structure and Inter-
John Chen, Srinivas Bangalore, and K. Vijay-Shanker. pretation. Linguistic Inquiry Monograph 30. MIT
2006. Automated extraction of tree-adjoining gram- Press, Cambridge, MA.
mars from treebanks. Natural Language Engineer-
ing, 12(3):251–299. John Torr and Edward P. Stabler. 2016. Coordination
in minimalist grammars: Excorporation and across
Noam Chomsky. 1995. The Minimalist Program. MIT the board (head) movement. In Proceedings of the
Press, Cambridge, Massachusetts. Twelfth International Workshop on Tree Adjoining
Grammar and Related Formalisms (TAG+12), pages
Michael Collins. 1999. Head-Driven Statistical Mod- 1–17.
els for Natural Language Parsing. Ph.D. thesis,
University of Pennsylvania. David Vadas and James Curran. 2007. Adding noun
phrase structure to the penn treebank. In Proceed-
Jessica Ficler and Yoav Goldberg. 2016. Coordina- ings of the 45th Annual Meeting of the Association of
tion annotation extension in the penn tree bank. In Computational Linguistics, pages 240–247, Prague,
Proceedings of the 54th Annual Meeting of the As- June. ACL.
sociation for Computational Linguistics, ACL 2016,
Berlin, German, Volume 1: Long Papers, pages Ralph Weischedel, Sameer Pradhan, Lance Ramshaw,
834–842. Jeff Kaufman, Michelle Franchini, Mohammed El-
Bachouti, Nianwen Xue, Martha Palmer, Jena D.
Julia Hockenmaier and Mark Steedman. 2002. Hwang, Claire Bonial, Jinho Choi, Aous Mansouri,
Acquiring compact lexicalized grammars from a Maha Foster, Abdel aati Hawwary, Mitchell Marcus,
cleaner treebank. In Proceedings of the Third In- Ann Taylor, Craig Greenberg, Eduard Hovy, Robert
ternational Conference on Language Resources and Belvin, and Ann Houston. 2012. OntoNotes Re-
Evaluation (LREC), pages 1974–1981. lease 5.0 with OntoNotes DB Tool v0.999 beta.
Tim Hunter and Chris Dyer. 2013. Distributions on
minimalist grammar derivations. In Proceedings of
the 13th Meeting on the Mathematics of Language
(MoL 13), pages 1–11, Sofia, Bulgaria, August. The
Association of Computational Linguistics.
86
Chatbot with a Discourse Structure-Driven Dialogue Management
87
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 87–90
c 2017 Association for Computational Linguistics
Figure 1: Sample dialogue. User questions and
responses are aligned on the left and system re-
sponses - on the right.
2 Personalized and Interactive Domain top to bottom on the search results page. On the
Exploration Scenarios contrary, the chat bot allows a convergence in this
answer navigation session since the answer #n+1
Most chat bots are designed to imitate human in- is suggested based on additional clarification sub-
tellectual activity maintaining a dialogue. The pur- mitted after the answer #n is consulted by the user.
pose of the chat bot with iterative exploration is Chat bot provides topics of answers for a user to
to provide most efficient and effective informa- choose from. These topics give the user a chance
tion access for users. A vast majority of chat to assess how her request was understood on one
bots nowadays, especially based on deep learn- hand and restore the knowledge area associated
ing, try to build a plausible sequence of words with her question on the other hand. In our exam-
(Williams, 2012) to serve as an automated re- ples, topics include balance transfer, using funds
sponse to user query. Such most plausible re- on a checking account, or canceling your credit
sponses, sequences of syntactically and semanti- card. A user is prompted to select a clarification
cally compatible words, are not necessarily most option, drill into either of these options, or decline
informative. Instead, in this demo we focus on a all options and request a new set of topics which
chat bot that helps a user to navigate to the exact, the chat bot can identify.
insightful answer as fast as possible.
For example, if a user is formulated her query 3 Navigation with the Extended DT
Can I use one credit card to pay for another, the
chat bot attempts to recognize a user intent and On the web, information is usually represented in
a background knowledge about this user to estab- web pages and documents, with certain section
lish a proper context. The chat bot leverages an structure. Answering questions, forming topics of
observation learned from the web that an individ- candidate answers and attempting to provide an
ual would usually want to pay with one credit card answer based on user selected topic are the opera-
for another to avoid late payment fee when cash tions which can be represented with the help of a
is unavailable as long as the user does not specify structure that includes the DTs of texts involved.
her other circumstances. When a certain portion of text is suggested to a
To select a suitable answer from a search en- user as an answer, this user might want to drill in
gine, a user first reads snippets one-by-one and something more specific, ascend to a more general
then proceeds to the linked document to consult level of knowledge or make a side move to a topic
in detail. Reading the answer #n does not usu- at the same level. These user intents of navigating
ally help to decide which next answer should be from one portion of text to another can be repre-
consulted, so a user proceeds is to answer #n+1. sented in most cases as coordinate or subordinate
Since answers are sorted by popularity, for a given discourse relations between these portions.
user there is no better way than just proceed from Rhetorical Structure Theory (RST), proposed
88
by (Mann and Thompson, 1988), became popular
as a framework for parsing the discourse relations
and discourse structure of a text, represents texts
by labeled hierarchical structures such as DTs.
A DT expresses the author’s flow of thoughts at
the level of a paragraph or multiple paragraphs.
But DTs become fairly inaccurate when applied
to larger text fragments, or documents. To enable
Figure 3: Forming the Request-Response pair as
a chat bot with a capability to form and navigate
an element of a training set
a DT for a corpus of documents with sections and
hyperlinks, we introduce the notion of ofextended
DT. To avoid inconsistency we would further refer candidates, we select those with higher classifi-
to the original DTs as conventional. To construct cation score. Classification algorithm is based on
conventional discourse trees we used one of the the tree kernel learning on conventional discourse
existing discourse parsers (Joty et al., 2013). trees. Their regular nodes are rhetoric relations,
The intuition behind navigation is illustrated in and terminal nodes are elementary discourse units
Fig.2. Three areas in this chart denote three docu- (phrases, sentence fragments) which are the sub-
ments each containing three section headers and jects of these relations.
section content. For section content, discourse We selected Yahoo! Answer set of question-
trees are built for the text. A navigation starts with answer pairs with broad range of topics. Out of
the route node of a section that matches the user the set of 4.4 million user questions we selected
query most closely. Then the chat bot attempts to 20000 which included more than two sentences.
build a set of possible topics, which are the satel- Answers for most of the questions are fairly de-
lites of the root node of the DT. If the user ac- tailed so no filtering was applied to answers. There
cepts a given topic, the navigation continues along are multiple answers per questions and the best
the chosen edge. Otherwise, when no topic covers one is marked. We consider the pair Question -
the user interest, the chat bot backtracks the DT Best Answer as an element of the positive training
and proceeds to the other section (possibly of an- set and Question - Other Answer as the one of the
other documents) which matched the original user negative training set. To derive the negative train-
query second best. Finally, the chat bot arrives at ing set, we either randomly selected an answer to a
the DT node where the user is satisfied (shown by different but somewhat related question, or formed
hexagon) or gives up and starts a new exploratory a query from the question and obtained an answer
session . Inter-document and inter-section edges from web search results.
for relations such as elaboration play similar role
in knowledge exploration navigation to internal
5 Evaluation
edges of a conventional discourse tree. We compare the efficiency of information access
using the proposed chat bot with a major web
4 Relying on Q/A rhetoric agreement to search engines such as Google, for the queries
select the most suitable answers where both systems have relevant answers. For
a search engines, misses are search results preced-
In conventional search approach, as a baseline, ing the one relevant for a given user. For a chat bot,
Q/A match is measured in terms of keyword statis- misses are answers which causes a user to chose
tics such as TF*IDF. To improve search relevance, other options suggested by the chat bot, or request
this score is augmented by item popularity, item other topics.
location or taxonomy-based score (Galitsky et al., Topics of questions include personal finance.
2013; Galitsky et al., 2014). The feature space Twelve users (authors colleagues) asked the chat
includes Q/A pairs as elements, and a separation bot 15-20 questions each reflecting their financial
hyper-plane splits this feature space. We combine situations, and stopped when they were either sat-
DT(Q) with DT(A) into a single tree with the root isfied with an answer or dissatisfied and gave up.
Q/A pair (Fig. 3). We then classify such pairs The same questions were sent to Google, and eval-
into correct (with high agreement) and incorrect uators had to click on each search results snippet
(with low agreement). Having multiple answer to get the document or a web page and decide on
89
cused on imitation of a human intellectual activity.
The command-line demo1 and source code2 are
available online under Apache License. Source
code is a sub-project of Apache OpenNLP
(https://opennlp.apache.org/). Since
Bing search engine API is actively used for web
mining to obtain candidate answers, a user would
need to use her own Bing key for commercial use
Figure 4: Comparing conventional search engine available at https://datamarket.azure.
with chat bot in terms of a number of iterations com/dataset/bing/search.
90
Marine Variable Linker: Exploring Relations between
Changing Variables in Marine Science Literature
91
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 91–94
c 2017 Association for Computational Linguistics
Figure 1: Example of text annotation according to conceptual model
92
lations between these events, where queries can be
composed in a similar fashion as for events. The
first kind of relation is cooccurs, which means that
two events co-occur in the same sentence. When
two events are frequently found together in a sin-
gle sentence, they tend to be associated in some
way, possibly by correlation. The second kind of
relation is causes, which means that two events in
a sentence are causally related, where one event is
Figure 2: Example of event query composition
the cause and other the effect. Causality must be
explicitly described in the sentence, for example,
3 User interface by words such as causes, therefore, leads to, etc.
Relation search results are presented in two
Although graph search queries can be written by ways. The relation types table contains all pairs of
hand, it takes time, effort and a considerable event types, specifying their relation, event pred-
amount of expertise. In addition, large tables icates, event variables and counts. Figure 3 pro-
are difficult to read and navigate, lacking an easy vides an example for an open-ended search query
way to browse the results, e.g., to look up the where the cause is an event with variable iron (in-
source sentences and articles for extracted events. cluding its specialisations, direction of change un-
Moreover, users need to have a local installation specified), whereas the effect is left open (i.e., can
of all required software and data. The Marine be any event). The corresponding relation graph
Variable Linker (MVL) is intended to solve these is shown in Figure 4. The nodes are event types
problems. Its main function is to enable non- with red triangles for increases, blue triangles for
expert users (marine scientists) to easily search the decreases and green diamonds for changes.
graph database in an interactive way and to present
Clicking on a row in the table or a node in the
search results in a browsable and visual way. It is
graph brings up a corresponding instances table
a web application that runs on any modern plat-
(cf. bottom of Figure 3), showing sentences, years
form with a browser (Linux, Mac OS, Windows,
and citations of articles containing the given rela-
Android, iOS, etc). It is a graphical user interface,
tion. The events are marked in colour: red for in-
which relies on familiar input components such as
creasing, blue for decreasing and green for chang-
buttons and selection lists to compose queries, and
ing. Hovering the mouse over the document icon
uses interactive tables and graphs to present search
will show a citation for the source article, whereas
results. Hyperlinks are used for navigation and to
clicking it will open the article’s web page con-
connect related information, e.g. the webpage of
taining the sentence in a new window.
the source journal article.
A demo of an MVL instance indexing 75,221
Figure 2 shows an example of a search query marine-related abstracts from over 30 journals is
for events consisting of two search rules: (1) the currently freely accessible on the web.2 Source
variable equals iron and (2) the event type is in- code for the text mining system and the graphical
crease. In addition, the search is with specialisa- user interface is freely available.3
tions, which means that it includes variables that
can be generalised to iron, such as particulate iron 4 Discussion and Future work
or iron in clam mactra lilacea. More rules can
be added to narrow down, or widen, event search. Our system still makes many errors. Variables and
Clicking the search button will bring up a new ta- events are sometimes incorrectly extracted (e.g.
ble for matching event types, showing the instance variable more iron in Figure 3 ought to be just
counts, predicates and variables (not shown here). iron), often due to syntactic parsing errors, and
Clicking on any row in this event types table will many are missed altogether (e.g. the decline in
bring up the corresponding event instances table, the particulate organic carbon quotas in the same
which shows all the actual mentions of this event Figure), because events can be expressed in so
in journal articles. Each row in the instance table 2
http://baleen.idi.ntnu.no/demos/
shows a sentence, year of publication and source. megamouth-abs/
Once events are defined, one can search for re- 3
https://github.com/OC-NTNU
93
Figure 3: Example of search results for causal relations types (top) and a selected instance (bottom)
many different and complex ways. Yet we feel interest from marine and climate scientists, rais-
that even with a fair amount of noise, the current ing awareness of the potential of text mining, as
proof-of-concept already offers practical merit in progress will crucially depend on building a com-
battling the literature deluge. We will continue to munity similar to that in biomedical text mining.
work on improving recall and precision, as well
as usability aspects of the interface. One aspect Acknowledgments
under active development is the integration of new Financial aid from the European Commis-
and better algorithms for causal relation extraction sion (OCEAN-CERTAIN, FP7-ENV-2013-6.1-1;
based on machine learning from manually anno- no:603773) is gratefully acknowledged.
tated data. The knowledge graph can be searched
References
efficiently using Cypher queries, which opens up
many other interesting opportunities for knowl- C.D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S.J.
Bethard, and D. McClosky. 2014. The Stanford
edge discovery. We hope our system will attract CoreNLP natural language processing toolkit. In
ACL System Demonstrations, pages 55–60.
E. Marsi and P. Öztürk. 2015. Extraction and general-
isation of variables from scientific publications. In
EMNLP, pages 505–511, Lisbon, Portugal.
E. Marsi and P. Øzturk. 2016. Text mining of related
events from natural science literature. In Workshop
on Semantics, Analytics, Visualisation: Enhancing
Scholarly Data (SAVE-SD), Montreal, Canada.
E. Marsi, P. Öztürk, E. Aamot, G. Sizov, and M.V.
Ardelan. 2014. Towards text mining in climate sci-
ence: Extraction of quantitative variables and their
relations. In Fourth Workshop on Building and Eval-
uating Resources for Health and Biomedical Text
Processing, Reykjavik, Iceland.
D. R. Swanson. 1986. Fish oil, raynaud’s syndrome,
and undiscovered public knowledge. Perspectives in
Figure 4: Example of relation graph biology and medicine, 30(1):7–18.
94
Neoveille, a Web Platform for Neologism Tracking
Emmanuel Cartier
Université Paris 13 Sorbonne Paris Cité - LIPN - RCLN UMR 7030 CNRS
99 boulevard Jean-Baptiste Clément
93430 Villetaneuse, FRANCE
emmanuel.cartier@lipn.univ-paris13.fr
95
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 95–98
c 2017 Association for Computational Linguistics
3.1.3 Neologism Life-cycle(s) and a reference dictionary from which unknown
Neology is one of the aspect of linguistic change, words can be derived. Further filters then apply to
with necrology and stability. One important aspect eliminate spellings errors and Proper Nouns. Sub-
is thus to model the lifecycle of a neologism, from sequent developments all replicate this architec-
its first occurrence to its potential disappearance ture : OBNEO (Cabré and De Yzaguirre, 1995),
or conventionalization. (Traugott and Trousdale, NeoCrawler (Kerremans et al., 2012), Logoscope
2013) have proposed three salient states : innova- (Gérard et al., 2014) and more recently Neoveille
tion, propagation and conventionalisation, linking (Cartier, 2016).
each to several more-or-less obvious properties. Four main difficulties arise from these architec-
With these models in mind, NLP has developed ture : first, EDA can not track semantic neolo-
several algorithms to track and study the lifecycle gisms, as they use existing lexical units to con-
of neologisms. vey innovative meanings; second, the design of
a reference exclusion dictionary is not that ob-
3.2 Computational Models of Neology vious as it requires the existence of a machine-
Computational Linguistics has begun to work on readable dictionary : this entails specific proce-
linguistic change not long ago, mainly because it dures to apply this architecture to less-resourced
needs to have at hand large diachronic electronic languages, and the availability of an up-to-date
corpora. Neology is moreover still considered as machine-readable dictionary for more resourced
an secondary topic, as novel lexical units repre- languages ; third, the EDA architecture is not suf-
sents less than 5 percent of lexical units in corpora, ficient in itself : among unknown forms, most
according to several studies. But linguistic change of them are Proper Nouns, spelling mistakes and
is the complementary aspect of the synchronic other cases derived from corpus boilerplate re-
structure. Every lexical unit is subjected to time, moval : this entails a post-processing phase to de-
form and meaning can change, due to diastratic part cases; Fourth, these systems do not take into
events and situations. The advent of electronic account the sociological and diatopic aspects of
(long and short-term) diachronic corpora, scien- neologism, as they limit their corpora to specific
tific research and advances on word-formation and domains : a ideal system should be able to ex-
machine learning techniques able to manage big tend its monitoring to new corpora and maintain
corpora, have permitted the emergence of neology diastratic meta-datas to characterize novel forms.
tracking systems. Apart from a best knowledge of To the best of our knowledge, Neoveille (Cartier,
language lifecyle(s), these tools would permit to 2016) is the only system implementing this aspect.
update lexicographic resources, computational as- 3.3.2 Semantic Neology Approaches
sets and parsers.
As for semantic neology, three approaches have
From the CL point of view, the main ques- been recently proposed, none of them being ex-
tions are : how can we automatically track neolo- ploited in an operational system. The first one
gisms, categorize them and follow their evolution, stems from the idea that meaning change is linked
from their first appearance to their conventionali- to domain change : every texts and thus the con-
sation or disappearance? At best, can we induce stituent existing lexical units are assigned one or
neology-formation procedures from examples and more topic; if a lexical unit emerges in a new do-
therefore predict potential neologisms? main, a change in meaning should have occurred
3.3 Existing Neology Tracking System (Gérard et al., 2014). The main drawback of this
approach is that it is limited to specific semantic
3.3.1 the Exclusion Dictionary Architecture change (it can not tackle conventional metaphors
(EDA) if appeared in the same domain, nor detect exten-
The main achievement of neology tracking con- sion or restriction of meaning) and mainly limited
sists in system extracting novel forms from mon- to Nouns.
itor corpora, using lexicographic resources as a An other approach is linked to the distributional
reference exclusion dictionary to induce unknown paradigm : ”You shall know a word by the com-
words, what we can call the ”exclusion dictionary pany it keeps”(Firth, 1957). The main idea is to re-
architecture” (EDA). The first system is due to trieve from a large corpora all the collocates or col-
(Renouf, 1993) for English : a monitor corpora lostructions, and classify them according to sev-
96
eral metrics. The main salient resulting context from their first occurrence. Tracking the lifecycle
words represent the ”profile” (Blumenthal, 2009) of neologisms requires to fix criteria to identify the
or ”sketch” (Kilgarriff et al., 2004) of a lexical main phases : emergence, dissemination, conven-
unit for the given synchronic period. The most tionalization (Traugott and Trousdale, 2013). In
elaborated system is surely the Sketch Engine sys- operational systems, the main tool to follow the
tem, which propose for every lexical engine its life of a neologism is the timeline rendering the ab-
”sketch”, i.e. a list, for any user-defined syntac- solute or relative frequency of the lexical unit. In
tic schemas (for example modifiers, nominal sub- Neoveille, these figures are relative to specific di-
ject, object and indirect object for verb) of occur- astratic and diatopic paramaters, visually enabling
rences, sorted by one or several association mea- to distinguish emergence, spread and convention-
sure. This system can be improved in two main alization. These analysis are available for each
ways : first, it does not propose complete syntac- identified neologism by clicking on the stats icon
tic schemas for lexical units like verbs (it is limited (see website, last neologisms menu).
to either a SUBJ-VERB or VERB-OBJ relation,
but does not propose SUBJ-VERB-OBJ relations); 4 Neoveille Tracking System
second, it does not propose a clustering of occur- Architecture
rences, whereas distributional semantics could fill The Neoveille architecture aims at enabling a
the gap and propose distributional classes at any complete synergy between NLP system and expert
place in the schema, instead of flat list of occur- linguists : expert linguists are not able to monitor
rences. the vast amount of textual data whereas automatic
A third approach consists in tracking semantic processes can help tackle this amount: experts can
change by applying the second aspect of the distri- accurately decide if a word is or is not a neolo-
butional hypothesis, that lexical units sharing the gism; our current point of view is that linguists
most of contexts are most likely to be semantically must have the last word on what is and is not a
similar. This assumption has been applied to many neologism, and on the linguistic description; but
computational semantic tasks. Applied to seman- as knowledge and description will grow up with
tic change, if you have at your disposal a bunch time, we will build Supervised Machine Learning
of diachronic corpora, you can build the semantic techniques able to predict potential neologisms.
vectors of any lexical unit corresponding to sev- The Neoveille web architecture has five main
eral periods, and track the changes from one pe- components:
riod to another. First experiments have been pro-
posed by (Hamilton et al., 2016). The main ad- 1. A corpora manager: corpora is the main feed
vantage of this approach resides in the fact that it for NLP systems, and we propose to linguists
proposes for a given word a list of semantically a system enabling to choose their corpora and
similar words, among which synonyms and hyper- to add to them several meta-datas. The cor-
nyms, which permits to clearly explicit the mean- pora, once defined by the user, are retrieved
ing of a word. The main drawback of this ap- on a daily basis, indexed and searched for ne-
proach is to be unable to distinguish meanings for ologisms. Corpora management is available
polysemous units. Another relative drawback re- in the restricted area on the left menu;
lies on the fuzzy notion of similarity, which results 2. An advanced search engine on the corpora;
in semantically too-slightly similar words (anal- : not only corpora can be monitored, but
ogy), or even opposite words (antonymy). But this also the system should propose a search en-
approach is clearly of great help to humanly grasp gine with advanced capabilities : advanced
the meaning of a word. querying, filtering and faceting of results; the
In the Neoveille Project, we are currently devel- Neoveille search engine is available on the
oping a approach combining the Sketch approach restricted area on the left menu; based on
mixed with semantic distributions on the main lex- Apache Solr, it enables to query the corpora
ical unit and its arguments. in a multifactorial manner, with facets and vi-
3.3.3 Tracking the Lifecycle of Neologisms sual filters;
In our view, we postulate that neologisms are new 3. Advanced Data Analytics expliciting the life-
form-meaning pairs (Lexical units) and thus exist cycle of neologisms and their diachronic, di-
97
atopic and diastratic parameters : Neoveille Maria Teresa Cabré and Luis De Yzaguirre. 1995.
provides such a Data Analytics Framework Stratégie pour la détection semi-automatique des
néologismes de presse. TTR : traduction, terminolo-
by combining meta-data to text mining;
gie, rédaction, 8 (2), p. 89-100.
These visual analysis are available for every
neologisms by clicking the stats icon; Emmanuel Cartier. 2016. Néoveille, système de
repérage et de suivi des néologismes en sept langues.
4. A linguistic description component for ne- Neologica, 10, Revue internationale de néologie,
ologisms : this module, whose microstruc- p.101-131.
ture has been setup for several years, could John R. Firth. 1957. Papers in linguistics 1934–1951,
be used for knowledge of neology in a given london, oxford university press, 1957.
language, and could also be used by a super-
Dirk Geeraerts. 2010. Theories of Lexical Semantics.
vised machine learning system, as these fea- Oxford University Press.
tures include a lot of formal properties. This
component is accessible in the restricted area Paul Gevaudan and Peter Koch. 2010. Sémantique
cognitive et changement sémantique. Grandes voies
on the left menu. et chemins de traverse de la sémantique cognitive,
Mémoire de la Société de linguistique de Paris,
5. formal and semantic neologisms tracking XVIII, pp. 103-145.
with state-of-the-art techniques The formal
and semantic neologism components are ac- Christophe Gérard, Ingrid Falk, and Delphine Bern-
hard. 2014. Traitement automatisé de la néologie :
cessible in the restricted area on the left pourquoi et comment intégrer l’analyse thématique?
menu. They work for the seven languages of Actes du 4e Congrès mondial de linguistique
the project. française (CMLF 2014), Berlin, p. 2627-2646.
5 Conclusion and perspectives William L. Hamilton, Jure Leskovec, and Dan Juraf-
sky. 2016. Diachronic Word Embeddings Reveal
This short presentation has evoked the design and Statistical Laws of Semantic Change. ACL 2016.
the overall architecture of a software for linguistic Daphné Kerremans, Susanne Stegmayr, and Hans-Jörg
analysis focusing on linguistic change in a con- Schmid. 2012. The NeoCrawler: identifying and
temporary monitor corpora. It has several interest- retrieving neologisms from the internet and moni-
ing properties : toring on-going change. Kathryn Allan and Justyna
A. Robinson, eds., Current methods in historical se-
mantics, Berlin etc.: de Gruyter Mouton, 59-96.
• real-time tracking, analysis and visualization
of linguistic change; A. Kilgarriff, R. Pavel, Pavel S., and Tugwell D. 2004.
The Sketch Engine. Proceedings of Euralex, pages
• complete synergy between Computational 105–116, Lorient.
Linguistics processing and linguistic experts
Antoinette Renouf. 1993. Sticking to the Text : a cor-
intuitions and knowledge, especially the pos- pus linguist’s view of language. ASLIB Proceedings,
sibility of editing automatic results by experts 45 (5), p. 131-136.
and exploitation of linguistic annotations by
Jean-François Sablayrolles. 2016. Les néologismes.
machine learning processes;
Collection Que sais-je? Presses Universitaires de
France.
• modularity of software : corpora manage-
ment, state-of-the-art search engine including Hans-Jörg Schmid. 2015. The scope of word-
analysis and visualization, neologisms min- formation research. Peter O. Müller, Ingeborg Ohn-
ing, neologism linguistic description, lifecy- heiser, Susan Olsen and Franz Rainer, eds., Word-
Formation. An International Handbook of the Lan-
cle tracking. guages of Europe. Vol. I.
This project, currently focusing on seven lan- Elizabeth Closs Traugott and Graeme Trousdale. 2013.
guages is in the path to extend to other languages. Constructionalization and constructional changes.
References
P. Blumenthal. 2009. Éléments d’une théorie de la
combinatoire des noms. Cahiers de lexicologie 94
(2009-1), 11-29.
98
Building Web-Interfaces for Vector Semantic Models with the WebVectors
Toolkit
99
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 99–103
c 2017 Association for Computational Linguistics
only to supply a trained model or models for one’s
Figure 1: Computing associates for the word ‘lin-
particular language or research goal. The toolkit
guistics’ based on the models trained on English
can be easily adapted for specific needs.
Wikipedia and BNC.
2 Deployment
The toolkit serves as a web interface between dis-
tributional semantic models and users. Under the
hood it uses the following software:
100
Figure 2: Visualizing the positions of several Figure 3: Analogical inference with several mod-
words in the semantic space els
101
rusvectores.org (Kutuzov and Kuzmenko, guage Learning, pages 183–192, Sofia, Bulgaria,
2016). It features 4 Continuous Skipgram and August. Association for Computational Linguistics.
Continuous Bag-of-Words models for Russian Marco Baroni, Georgiana Dinu, and Germán
trained on different corpora: the Russian National Kruszewski. 2014. Don’t count, predict! a
Corpus (RNC)4 , the RNC concatenated with the systematic comparison of context-counting vs.
context-predicting semantic vectors. In Proceedings
Russian Wikipedia dump from November 2016,
of the 52nd Annual Meeting of the Association for
the corpus of 9 million random Russian web pages Computational Linguistics, volume 1.
collected in 2015, and the Russian news corpus
Piotr Bojanowski, Edouard Grave, Armand Joulin,
(spanning time period from September 2013 to
and Tomas Mikolov. 2016. Enriching word vec-
November 2016). The corpora were linguistically tors with subword information. arXiv preprint
pre-processed in the same way, lending the mod- arXiv:1607.04606.
els the ability to better handle rich morphology of
Knut Hofland. 2000. A self-expanding corpus based
Russian. The RusVectores is already being em- on newspapers on the web. In Proceedings of the
ployed in academic studies in computational lin- Second International Conference on Language Re-
guistics and digital humanities (Kutuzov and An- sources and Evaluation (LREC-2000).
dreev, 2015; Kirillov and Krizhanovskij, 2016; B. L. Iomdin, A. A. Lopukhina, K. A. Lopukhin, and
Loukachevitch and Alekseev, 2016) (several other G. V. Nosyrev. 2016. Word sense frequency of
research projects are in progress as of now). similar polysemous words in different languages.
One can use the aforementioned services as live In Computational Linguistics and Intellectual Tech-
nologies. Dialogue, pages 214–225.
demos to evaluate the WebVectors toolkit before
actually employing it in one’s own workflow. A. N. Kirillov and A. A. Krizhanovskij. 2016. Model’
geometricheskoj struktury sinseta [the model of ge-
ometrical structure of a synset]. Trudy Karel’skogo
5 Conclusion nauchnogo centra Rossijskoj akademii nauk, (8).
The main aim of WebVectors is to quickly deploy Andrey Kutuzov and Igor Andreev. 2015. Texts in,
web services processing queries to word embed- meaning out: neural language models in seman-
ding models, independently of the nature of the tic similarity task for Russian. In Computational
Linguistics and Intellectual Technologies. Dialogue,
underlying training corpora. It allows to make Moscow. RGGU.
complex linguistic resources available to wide au-
dience in almost no time. We continue to add new Andrey Kutuzov and Elizaveta Kuzmenko. 2015.
Comparing neural lexical models of a classic na-
features aiming at better understanding of embed- tional corpus and a web corpus: The case for Rus-
ding models, including sentence similarities, text sian. Lecture Notes in Computer Science, 9041:47–
classification and analysis of correlations between 58.
different models for different languages. We also Andrey Kutuzov and Elizaveta Kuzmenko. 2016. We-
plan to add models trained using other algorithms, bvectors: a toolkit for building web interfaces for
like GloVe (Pennington et al., 2014) and fastText vector semantic models (in press).
(Bojanowski et al., 2016). Natalia Loukachevitch and Aleksei Alekseev. 2016.
We believe that the presented open source Gathering information about word similarity from
toolkit and the live demos can popularize distri- neighbor sentences. In International Conference
butional semantics and computational linguistics on Text, Speech, and Dialogue, pages 134–141.
Springer.
among general public. Services based on it can
also promote interest among present and future Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
students and help to make the field more com- Dan Huang, Andrew Y. Ng, and Christopher Potts.
2011. Learning word vectors for sentiment analysis.
pelling and attractive. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies-Volume 1, pages 142–150.
References Association for Computational Linguistics.
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Christopher D. Manning, Mihai Surdeanu, John Bauer,
2013. Polyglot: Distributed word representations Jenny Finkel, Steven J. Bethard, and David Mc-
for multilingual NLP. In Proceedings of the Seven- Closky. 2014. The Stanford CoreNLP natural lan-
teenth Conference on Computational Natural Lan- guage processing toolkit. In Association for Compu-
tational Linguistics (ACL) System Demonstrations,
4
http://www.ruscorpora.ru/en/ pages 55–60.
102
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013a. Exploiting similarities among lan-
guages for machine translation. arXiv preprint
arXiv:1309.4168.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013b. Distributed representa-
tions of words and phrases and their composition-
ality. Advances in Neural Information Processing
Systems 26, pages 3111–3119.
Jeffrey Pennington, Richard Socher, and Christo-
pher D. Manning. 2014. Glove: Global vectors for
word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 1532–
1543.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.
A universal part-of-speech tagset. In Proceedings
of the Eighth International Conference on Language
Resources and Evaluation (LREC-2012). European
Language Resources Association (ELRA).
103
InToEventS: An Interactive Toolkit for Discovering and Building Event
Schemas
Abstract
Event Schema Induction is the task of
learning a representation of events (e.g.,
bombing) and the roles involved in them
(e.g, victim and perpetrator). This paper
presents InToEventS, an interactive tool
for learning these schemas. InToEventS
allows users to explore a corpus and dis-
cover which kind of events are present. We
show how users can create useful event
schemas using two interactive clustering
steps.
1 Introduction
An event schema is a structured representation of Figure 1: Event Schema Definition
an event, it defines a set of atomic predicates or
facts and a set of role slots that correspond to the
typical entities that participate in the event. For ex-
2013; Nguyen et al., 2015). However, they all end
ample, a bombing event schema could consist of
up assuming some form of supervision at docu-
atomic predicates (e.g., detonate, blow up, plant,
ment level, and the task of inducing event schemas
explode, defuse and destroy) and role slots for a
in a scenario where there is no annotated data is
perpetrator (the person who detonates plants or
still an open problem. This is probably because
blows up), instrument (the object that is planted,
without some form of supervision we do not even
detonated or defused) and a target (the object that
have a clear way of evaluating the quality of the
is destroyed or blown up). Event schema induction
induced event schemas.
is the task of inducing event schemas from a tex-
tual corpus. Once the event schemas are defined, In this paper we take a different approach. We
slot filling is the task of extracting the instances argue that there is a need for an interactive event
of the events and their corresponding participants schema induction system. The tool we present en-
from a document. ables users to explore a corpus while discovering
In contrast with information extraction systems and defining the set of event schemas that best de-
that are based on atomic relations, event schemas scribes the domain. We believe that such a tool ad-
allow for a richer representation of the semantics dresses a realistic scenario in which the user does
of a particular domain. But, while there has been not know in advance the event schemas that he is
a significant amount of work in relation discovery, interested in and he needs to explore the corpus to
the task of unsupervised event schema induction better define his information needs.
has received less attention. Some unsupervised The main contribution of our work is to present
approaches have been proposed (Chambers and an interactive event schema induction system that
Jurafsky, 2011; Cheung et al., 2013; Chambers, can be used by non-experts to explore a corpus and
104
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 104–107
c 2017 Association for Computational Linguistics
Figure 2: Event Clustering
easily build event schemas and their correspond- a distance threshold that defines a semantic ball.
ing extractors. For example, the injured predicate is represented
The paper is organized as follows. Section 2 de- by the literal: injure, and the syntactic relation:
scribes our event schema representation. Section 3 object. This tuple is designed to represent the fact
describes the interactive process for event schema that a victim is a person or organization who has
discovery. Section 4 gives a short overview of re- been injured, and whose corresponding word vec-
lated work. Finally section 5 concludes. tor representation is inside a given semantic ball.
105
Figure 3: Slots Clustering
a terrorist planted a bomb, thankfully the police idea: let’s assume for instance a bombing event
defused it before it blew up); and (2) literals with with triggers: {attack, blow up, set off, injure, die,
similar meaning are usually describing the same ...}, we can intuitively describe a victim of a bomb-
atomic predicates (e.g., destroy and blast). ing event as ”Someone who dies, is attacked or in-
We first extract all unique verbs and all nouns jured”, that is: ”PERSON: subject of die, object
noun with a corresponding synset in Wordnet la- of attack, object of injured”.
beled as noun.event or noun.act, these are our Recall that a slot is a set of predicates repre-
predicate literals. Then, for each pair of literals sented with a tuple composed by: a literal predi-
we compute their distance taking into account the cate and a syntactic relation, e.g. kill-subject. Ad-
probability that they will appear nearby in a doc- ditionaly each slot has an entity-type. In a first
ument sampled from the corpus and the distance step, for each predicate in the event trigger set
of the corresponding literals in a word embedding we extract from the corpus all unique tuples of
vector space 1 . the form predicate-syntactic relation-entity type.
Once the distance is computed, we run agglom- The extraction of such tuples uses the universal
erative clustering. In the interactive step we let the dependency representation computed by Stanford
user explore the resulting dendrogram and chose CoreNLP parser and named entity classifier.
a distance threshold that will result in an initial In a second step, we compute a distance be-
partition of the predicate literals into event trig- tween each tuple that is based on the average word
ger sets (see Figure 2). In a second step, the user embeddings of the arguments observed in the cor-
can merge or split the initial clusters. In a third pus for a given tuple. For example, to compute a
step, the user selects and labels the clusters that he vector (die, subject, PERSON) we identify all en-
is interested in, this will become incomplete event tities of type PERSON in the corpus that are sub-
schemas. Finally, using the word embedding rep- ject of the verb die and average their word embed-
resentation the user has the option of expanding dings.
each event trigger set by adding predicate literals Finally, as we did for event induction we run
that are close in the word vector space. agglomerative clustering and offer the user an in-
teractive graphic visualization of the resulting den-
3.2 Second Step: Role Induction
dogram in Figure 3. The user can explore different
In this step we will complete the event schemas clusters settings and store those that represent the
of the previous step with corresponding semantic slots that he is interested in.
roles or slots. This process is based on a simple Once the event schemas have been created, we
1
For experiments we used the pre-trained 300 dimen- can use them to annotate documents. Figure 4
sional GoogleNews model from word2vec. shows an example of an annotated document.
106
Figure 4: Slot Filling
References
Nathanael Chambers and Dan Jurafsky. 2011.
Template-based information extraction without the
templates. In Proceedings of the 49th Annual Meet-
ing of the Association for Computational Linguis-
107
ICE: Idiom and Collocation Extractor for Research and Education
108
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 108–111
c 2017 Association for Computational Linguistics
varied from an F1-score of about 39% to 70%
for the collocation task. A comparison showed
that their combination schemes outperformed ex-
isting techniques, such as MWEToolkit (Ramisch
et al., 2010) and Text-NSP (Banerjee and Peder-
sen, 2003), with the best method achieving an F1-
score of around 40% on the gold-standard dataset,
which is within the range of human performance.
For idiom extraction, ICE uses the semantics-
based methods introduced by (Verma and Vuppu-
luri, 2015). The salient feature of these methods
is that they use Bing search for the definition of
a given phrase and then check the compositional-
ity of the phrase definition against combinations
of the words obtained when a define word query
is issued, where the word belongs to the phrase. Figure 1: Collocation extractor diagram
If there is a difference in the meaning, that phrase
is considered an idiom. In (Verma and Vuppuluri,
2015), authors showed that their method outper-
forms AMALGr (Schneider et al., 2014). Their
best method achieved an F1-score of about 95%
on the VNC tokens dataset.
Thus, ICE includes extraction methods for id-
ioms and collocations that are state-of-the-art.
Other tools exist for collocation extraction, e.g.,
see (Anagnostou and Weir, 2006), in which four
methods including Text-NSP are compared.
109
in the code). These methods are connected se- the n sets. The Or subsystem considers a phrase
quentially. This means that if something is con- as an idiom if at least one word survives one of
sidered as a collocation in one component, it will the subtractions (union of difference sets should
be added to the list of collocations and will not be be non-empty). For the And, at least one word has
given to the next component (yes/no arrows in the to exist that survived every subtraction (intersec-
diagram). Table 1 shows the description of each tion of difference sets should be non-empty)
component in collocation extractor diagram. Performance. ICE outperforms both Text-
The Ngram Extractor receives all sentences and NSP and MWEToolkit. On the gold-standard
generates n-grams ranging from bigrams up to 8- dataset, ICE’s F1-score was 40.40%, MWE-
grams. It uses NLTK sentence and word tokeniz- Toolkit’s F1-score was 18.31%, and Text-NSP had
ers for generating tokens. Then, it combines the 18%. We also compared our idiom extraction
generated tokens together taking care of punctua- with AMALGr method (Schneider et al., 2014) on
tion to generate the n-grams. their dataset and the highest F1-score achieved by
Dictionary Check uses WordNet (Miller, 1995) ICE was 95% compared to 67.42% for AMALGr.
as a dictionary and looks up each n-gram to see if For detailed comparison of ICE’s collocation and
it exists in WordNet or not (a collocation should idiom extraction algorithm with existing tools,
exist in the dictionary). All n-grams that are con- please refer to (Verma et al., 2016) and (Verma
sidered as non-collocations are given to the next and Vuppuluri, 2015).
component as input. Sample Code. Below is the sample code for
The next component is Online Dictionary. It using ICE’s collocation extraction as part of a
searches online dictionaries to see if the n-gram bigger system. For idiom extraction you can use
exists in any of them or not. It uses Bing Search IdiomExtractor class instead of collocationEx-
API 3 to search for n-grams in the web. tractor.
Web Search and Substitution is the next compo-
nent in the pipeline. This method uses Bing Search
>> input=["he and Chazz duel
API to obtain hit counts for a phrase query. Then
with all keys on the line."]
each word in the n-gram will be replaced by 5 ran-
>>from ICE import
dom words (one at the time), and the hit counts
CollocationExtractor
are obtained. At the end, we will have a list of hit
>>extractor =
counts. These values will be used to differentiate
CollocationExtractor.
between collocations and non-collocations.
with_collocation_pipeline(
The last component in the pipeline of colloca-
"T1" , bing_key = "Temp",
tion extraction is Web Search and Independence.
pos_check = False)
The idea of this method is to check whether the
>> print(extractor.
probability of a phrase exceeds the probability that
get_collocations_of_length(
we would expect if the words are independent. It
input, length = 3))
uses hit counts in order to estimate the probabili-
>> ["on the line"]
ties. These probabilities are used to differentiate
between collocations and non-collocations.
Educational Uses. ICE also has a web-based
When running the collocation extraction func-
interface for demonstration and educational pur-
tion, one of the components should be selected out
poses. A user can type in a sentence into an input
of the third and fourth ones.
field and get a list of the idioms or collocations in
The Idiom Extractor diagram is relatively sim-
the sentence. A screen-shot of the web-based in-
pler. Given the input n-gram, it creates n + 1
terface is shown in Figure 3.4
sets. The first contains (stemmed) words in the
meaning of the phrase. The next n sets contain 3 Conclusion
stemmed word in the meaning of each word in the
n-gram. Then it applies the set difference opera- ICE is a tool for extracting idioms and colloca-
tor to n pairs containing the first set and each of tions, but it also has functions for part of speech
3 4
http://datamarket.azure.com/dataset/ The web interface is accessible through https://
bing/search shahryarbaki.ddns.net/collocation/
110
Table 1: Components of Collocation Extraction Subsystem of ICE
Component Description
Ngram Extractor Generates bigram up to 8gram
Dictionary Check Look up ngram in dictionary (Wordnet)
Online Dictionary Look up ngram in online dictionaries (Bing)
Hitcount for phrase query and 5 generated queries
Web Search and Substitution
by randomly changing the words in the ngram
Probability of a phrase exceeds the probability that
Web Search and Independence
we would expect if the words are independent
Figure 3: Screen-shot of the online version of ICE George A. Miller. 1995. Wordnet: a lexical
database for english. Communications of the ACM,
38(11):39–41.
tagging and n-gram extraction. All the compo-
nents of the ICE are connected as a pipeline, Carlos Ramisch, Aline Villavicencio, and Christian
Boitet. 2010. Multiword expressions in the wild?:
hence every part of the system can be changed the mwetoolkit comes in handy. In Proceedings
without affecting the other parts. The tool is of the 23rd International Conference on Computa-
available online at https://github.com/ tional Linguistics: Demonstrations, pages 57–60.
shahryarabaki/ICE as a python package and Association for Computational Linguistics.
also at a website. The software is Licensed under Nathan Schneider, Emily Danchik, Chris Dyer, and
the Apache License, Version 2.0. Noah A. Smith. 2014. Discriminative lexical se-
mantic segmentation with gaps: running the mwe
gamut. Transactions of the Association for Compu-
References tational Linguistics, 2:193–206.
Nikolaos K. Anagnostou and George R. S. Weir. 2006. Violeta Seretan. 2013. A multilingual integrated
Review of software applications for deriving collo- framework for processing lexical collocations. In
cations. In ICT in the Analysis, Teaching and Learn- Computational Linguistics - Applications, pages 87–
ing of Languages, Preprints of the ICTATLL Work- 108. Springer - Netherlands.
shop 2006, pages 91–100.
Rakesh Verma and Vasanthi Vuppuluri. 2015. A new
Satanjeev Banerjee and Ted Pedersen. 2003. The de- approach for idiom identification using meanings
sign, implementation, and use of the ngram statis- and the web. Recent Advances in Natural Language
tics package. In International Conference on Intelli- Processing, pages 681–687.
gent Text Processing and Computational Linguistics,
pages 370–381. Springer. Rakesh Verma, Vasanthi Vuppuluri, An Nguyen, Ar-
jun Mukherjee, Ghita Mammar, Shahryar Baki, and
Araly Barrera and Rakesh Verma. 2012. Combin- Reed Armstrong. 2016. Mining the web for collo-
ing syntax and semantics for automatic extractive cations: Ir models of term associations. In Interna-
single-document summarization. In CICLING, vol- tional Conference on Intelligent Text Processing and
ume LNCS 7182, pages 366–377. Computational Linguistics.
Araly Barrera, Rakesh Verma, and Ryan Vincent.
2011. Semquest: University of houston’s semantics-
based question answering system. In Proceedings
111
Bib2vec: Embedding-based Search System for Bibliographic Information
112
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 112–115
c 2017 Association for Computational Linguistics
liographic information of papers. Each paper con- Paper
sists of several categories. Categories are divided T
into two groups: a textual category Ψ (e.g., titles Element
Non-textual Category 1
113
#Elements Min. set to ask the system to predict the removed au-
Category Type Original Processed Freq.
text textual 59,276 10,994 20 thor considering all the other elements in the pa-
author non-textual 17,260 2,609 5 per. To choose a correct author from all the au-
reference non-textual 10,871 10,871 1 thors can be insanely difficult, so we prepared 10
year non-textual 16 16 1
paper-id non-textual 19,475 19,475 1 selection candidates. In order to evaluate the effec-
tiveness of our model, we compared the accuracy
Table 1: Summary of our data set and model on this data set with that by logistic regression. As
a result, when we use our model, we got 74.3%
and brackets from abstracts and titles. We then (1,486 / 2,000) in accuracy, which was comparable
lowercased the words, and made phrases using the to 74.1% (1,482 / 2,000) by logistic regression.
word2phrase tool2 . Table 2 shows the examples of the search results
We prepared 5 categories: author, paper-id, ref- using our model. The leftmost column shows the
erence, year and text. author consists of the list of authors we input to our model. In the rightmost
authors without distinguishing the order of the au- two columns, we manually picked up words and
thors. paper-id is an unique identifier assigned to authors belonging to a certain topic described in
each paper, and this mimics the paragraph vector Sim et al. (2015) that can be considered to cor-
model (Le and Mikolov, 2014). reference includes respond to the input author. This table shows that
the paper ids of reference papers in this data set. our model can predict relative words or similar au-
Although ids in paper-id and reference are shared, thors favorably well although the words are incon-
we did not assign the same vectors to the ids since sistent with those by the author topic model.
they are different categories. year is the publica- Figure 3 shows the screenshot of our system.
tion year of the paper. text includes words and The lefthand box shows words in the word cloud
phrases in both abstracts and titles, and it belongs related to the query and the righthand box shows
to the textual category Ψ, while each other cate- the close authors. We can input a query by putting
gory is treated as a non-textual category Φi . We it in the textbox or click one of the authors in the
regard elements as unknown elements when they righthand box and select a similarity measure by
appear less than minimum frequencies in Table 1. selecting a radio button.
We split the data set into training and test. We
4.3 Discussion
prepared 17,475 papers for training and the re-
maining 2,000 papers for evaluation. For the test When we train the model, we did not use elements
set, we regarded the elements that do not appear in in category Ψ as context. This reduced the com-
the training set as unknown elements. putational costs, but this might disturbed the accu-
We set the dimension d of vectors to 300 and racy of the embeddings. Furthermore, we used the
show the results with the linear function. averaged vector for the textual category Ψ, so we
do not consider the importance of each word. Our
4.2 Evaluation model might ignore the inter-dependency among
We automatically built multiple choice questions elements since we applied skip-grams. To re-
and evaluate the accuracy of our model. We also solve these problems, we plan to incorporate atten-
compared some results of our model with those of tions (Ling et al., 2015) so that the model can pay
author-topic model. more attentions to certain elements that are impor-
Our method models elements in several cat- tant to predict other elements.
egories and allows us to estimate relationships We also found that some elements have several
among the elements with high flexibility, but this aspects. For example, words related to an author
makes the evaluation complex. Since it is tough spread over several different tasks in NLP. We may
to evaluate all the possible combinations of inputs be able to model this by embedding multiple vec-
and targets, we focused on relationships between tors (Neelakantan et al., 2014).
authors and other categories. We prepared an eval-
uation data set that requires to estimate an author 5 Conclusions
from other elements. We removed an (not un-
This paper proposed a novel embedding method
known) author from each paper in the evaluation
that represents several elements in bibliographic
2
https://github.com/tmikolov/word2vec information with high representation ability and
114
Our Model Author Topic-Model
Input Author Relevant Words Similar Authors Topic Words Topic Authors
Philipp Koehn machine translation Hieu Hoang alignment Chris Dyer
hmeant Alexandra Birch translation Qun Liu
human translators Eva Hasler align Hermann Ney
Ryan McDonald dependency parsing Keith Hall parse Michael Collins
extrinsic Slav Petrov sentense Joakim Nivre
hearing David Talbot parser Jens Nilson
Figure 3: Screen shot of the system with the search results for the query “Ryan McDonald”.
flexibility, and presented a system that can search Quoc V. Le and Tomas Mikolov. 2014. Distributed
for relationships among the elements in the bib- representations of sentences and documents. In
ICML, pages 1188–1196.
liographic information. Experimental results in
Table 2 show that our model can predict relative Wang Ling, Yulia Tsvetkov, Silvio Amir, Ramon Fer-
words or similar authors favorably well. We plan mandez, Chris Dyer, Alan W Black, Isabel Tran-
to extend our model by other modifications such coso, and Chu-Cheng Lin. 2015. Not all contexts
are created equal: Better word representations with
as incorporating attention and embedding multi- variable attention. In EMNLP, pages 1367–1372.
ple vectors to an element. Since this model has
high flexibility and scalability, it can be applied to Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013. Distributed representa-
not only papers but also a variety of bibliographic
tions of words and phrases and their compositional-
information in broad fields. ity. In NIPS, pages 3111–3119.
115
The SUMMA Platform Prototype
116
Proceedings of the EACL 2017 Software Demonstrations, Valencia, Spain, April 3-7 2017, pages 116–119
c 2017 Association for Computational Linguistics
Journalists access the
system through a
Data feed modules web-based GUI.
RethinkDB stores media
content and
meta-information.
RabbitMQ handles
Live stream monitoring communication bet-
ween database and NLP
modules that augment
the monitored data.
Docker provides
containerization of the
various components.
spectrum of news sources, but also allows journalists rylines. It can place geolocations of trending stories
to perform deeper analysis by enhancing their ability on a map. Customised dashboards can be used to fol-
to search through broadcast media across languages in low particular storylines. For the internal monitoring
a way that other monitoring platforms do not support. use case, it will visualize statistics of content that was
reused by other language departments.
1.2 Monitoring Internal News Production
Deutsche Welle is Germany’s international public ser- 2 System Architecture
vice broadcaster. It provides international news and
background information from a German perspective in Figure 1 shows an overview of the SUMMA Platform
30 languages worldwide, 8 of which are used within prototype. The Platform is implemented as an orches-
SUMMA. News production within Deutsche Welle is tra of independent components that run as individual
organized by language and regional departments that containers on the Docker platform. This modular ar-
operate and create content fairly independently. In- chitecture gives the project partners a high level of in-
terdepartmental collaboration and awareness, however, dependence in their development.
is important to ensure a broad, international perspec- The system comprises the following individual pro-
tive. Multilingual internal monitoring of world-wide cessing components.
news production (including underlying background re-
search) helps to increase awareness of the work be- 2.1 Data Feed Modules and Live Streams
tween the different language news rooms, decrease la- These modules each monitor a specific news source
tency in reporting and reduce cost of news production for new content. Once new content is available, it is
within the service by allowing adaptation of existing downloaded and fed into the database via a common
news stories for particular target audiences rather than REST API. Live streams are automatically segmented
creating them from scratch. into logical segments.
1.3 Data Journalism 2.2 Database Back-end
The third use case is data journalism. Measurable data Rethink-DB3 serves as the database back-end. Once
is extracted from the content available in and produced new content is added, Rethink-DB issues processing re-
by the SUMMA platform and graphics are created with quests to the individual NLP processing modules via
such data. The data journalism dashboard will be able RabbitMQ.
to provide, for instance, a graphical overview of trend-
3
ing topics over the past 24 hours or a heatmap of sto- www.rethinkdb.com
117
Storyline Index
Quick access to current storylines.
Storyline Summary
Multi-item/document summary of news
items in the storyline.
The lingua franca within SUMMA is English. Machine 2.7 Deep Semantic Tagging
translation based on neural networks is used to translate The system also has a component that performs seman-
content into English automatically. The back-end MT tic parsing into Abstract Meaning Representations (Ba-
systems are trained with the Nematus Toolkit (Sennrich narescu et al., 2013) with the aim to incorporate them
et al., 2017); translation is performed with AmuNMT into the storyline generation eventually. The parser was
(Junczys-Dowmunt et al., 2016). developed by Damonte et al. (2017). It is an incremen-
tal left-to-right parser that builds an AMR graph struc-
2.5 Entity Tagging and Linking ture using a neural network controller. It also includes
Depending on the source language, Entity Tagging and adaptations to German, Spanish, Italian and Chinese.
Linking is performed either natively, or on the En-
2.8 Knowledge Base Construction
glish translation. Entities are detected with TurboEn-
tityRecognizer, a named entity recognizer within Tur- This component provides a knowledge base of factual
boParser4 (Martins et al., 2009). Then, we link the relations between entities, built with a model based on
detected mentions to the knowledge base with a system Universal Schemas (Riedel et al., 2013), a low-rank
based on our submission to TAC-KBP 2016 (Paikens et matrix factorization approach.The entity relations are
al., 2016). extracted jointly across multiple languages, with enti-
ties pairs as rows and a set of structured relations and
4
https://github.com/andre-martins/ textual patterns as columns. The relations provide in-
TurboParser formation about how various entities present in news
118
documents are connected. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu
Hoang. 2016. Is neural machine translation ready
2.9 Storyline Construction and Summarization for deployment? A case study on 30 translation di-
Storylines are constructed via online clustering, i.e., by rections. CoRR, abs/1610.01108.
assigning storyline identifiers to incoming documents Ondřej Klejch, Ondřej Plátek, Lukáš Žilka, and Filip
in a streaming fashion, following the work in Aggar- Jurčíček. 2015. CloudASR: platform and service.
wal and Yu (2006). The resulting storylines are subse- In Int’l. Conf. on Text, Speech, and Dialogue, pages
quently summarized via an extractive system based on 334–341. Springer.
Almeida and Martins (2013).
André F. T. Martins, Noah A. Smith, and Eric P. Xing.
3 User Interface 2009. Concise integer linear programming formula-
tions for dependency parsing. In ACL, pages 342–
Figure 2 shows the current web-based SUMMA Plat- 350.
form user interface in the storyline view. A storyline is
Peteris Paikens, Guntis Barzdins, Afonso Mendes,
a collection of news items that concerning a particular Daniel Ferreira, Samuel Broscheit, Mariana S. C.
“story” and how it develops over time. Details of the Almeida, Sebastião Miranda, David Nogueira, Pedro
layout are explained in the figure annotations. Balage, and André F. T. Martins. 2016. SUMMA at
TAC knowledge base population task 2016. In TAC,
4 Future Work Gaithersburg, Maryland, USA.
The current version of the Platform is a prototype de- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
signed to demonstrate the orchestration and interaction Burget, Ondr̆ej Glembek, Nagendra Goel, Mirko
of the individual processing components. The look Hannemann, Petr Motlíček, Yanmin Qian, Petr
and feel of the page may change significantly over the Schwarz, Jan Silovský, Georg Stemmer, and Karel
course of the project, in response to the needs and re- Veselý. 2011. The Kaldi speech recognition toolkit.
quirements and the feedback from the use case part- In ASRU.
ners, the BBC and Deutsche Welle. Sebastian Riedel, Limin Yao, Benjamin M. Marlin, and
Andrew McCallum. 2013. Relation extraction with
5 Availability matrix factorization and universal schemas. In HLT-
NAACL, June.
The public release of the SUMMA Platform as open
source software is planned for April 2017. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-
dra Birch, Barry Haddow, Julian Hitschler, Marcin
6 Acknowledgments Junczys-Dowmunt, Samuel Läubli, Antonio Valerio
Miceli Barone, Jozef Mokry, and Maria Nadejde.
This work was conducted within the scope of 2017. Nematus: a toolkit for neural machine trans-
the Research and Innovation Action SUMMA, lation. In EACL Demonstration Session, Valencia,
which has received funding from the European Union’s Spain.
Horizon 2020 research and innovation programme un- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
der grant agreement No 688139. Alexander J. Smola, and Eduard H. Hovy. 2016.
Hierarchical attention networks for document clas-
sification. In NAACL, San Diego, CA, USA.
References
Charu C. Aggarwal and Philip S. Yu. 2006. A frame-
work for clustering massive text and categorical data
streams. In SIAM Int’l. Conf. on Data Mining, pages
479–483. SIAM.
Miguel B. Almeida and Andre F. T. Martins. 2013.
Fast and robust compressive summarization with
dual decomposition and multi-task learning. In ACL,
pages 196–206.
Laura Banarescu, Claire Bonial, Shu Cai, Madalina
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin
Knight, Philipp Koehn, Martha Palmer, and Nathan
Schneider. 2013. Abstract meaning representation
for sembanking. Linguistic Annotation Workshop.
Marco Damonte, Shay B. Cohen, and Giorgio Satta.
2017. An incremental parser for abstract meaning
representation. In EACL.
119
Author Index
121
Miranda, Sebastião, 116 Torr, John, 81
Mitchell, Jeff, 116
Miwa, Makoto, 112 Uslu, Tolga, 17
Mokry, Jozef, 65 Uszkoreit, Hans, 5
Montané, M. Amor, 57
V. Ardelan, Murat, 91
Moreno-Ortiz, Antonio, 73
Vallejo Huanga, Diego Fernando, 41
Moretti, Giovanni, 77
van der Kreeft, Peggy, 116
Mori, Koki, 112
van Genabith, Josef, 5
Morillo Alcívar, Paulina Adriana, 41
Verma, Rakesh, 108
Mubarak, Hamdy, 61
Vlachos, Andreas, 37, 116
Nadejde, Maria, 65 Vogel, Stephan, 61, 116
Narayan, Shashi, 116 vuppuluri, vasanthi, 108
Nguyen, An, 108
Wachsmuth, Henning, 13
Nogueira, David, 116
Wang, He, 5
Nozza, Debora, 25
Wang, Yang, 116
Obamuyide, Abiola, 116 Weber, Susanne, 116
Teichmann, Christoph, 29
Thomas, Philippe, 5
Thorne, James, 37
Tonelli, Sara, 77
Tong, Sibo, 116