You are on page 1of 19

IDENTIFICATION OF DISCOURSE

BOUNDARIES USING ANAPHORICALLY


ANNOTATED TEXT

Muhammad Aatif
Department of Computer Science
University of Peshawar
MS Computer Science (2017-18)
Prof. Dr. Mohammad Abid Khan

Free SlideSalad PowerPoint Template Copyright (C) SlideSalad.com All rights reserved.
Introduction 01

Literature Review 02

The Problem We Solve 03-04

Research Objectives 05

Methodology 06

Phrase Detectives Corpus 2.1.4 07

Results & Discussion 08


Presentation Outline
Summary 09

Publication 10

Free SlideSalad PowerPoint Template Copyright (C) SlideSalad.com All rights reserved.
Introduction
Discourse / Discourse Unit
A piece of text consisting of at lease one sentence where the sentences (if many) are linked to each other.

[ Ali is son of Jamila. He is ten years old. Kashif is his elder brother.] [ Kashif is a student of MS Computer Science. He studies in
University of Peshawar.] (Inconsistent Annotation)

An antecedent An anaphoric device

Identification of Discourse Boundaries is Important


It helps in: Text understanding, simplification, translation, summarization, Paraphrasing, Question-Answering systems etc.

Therefore,
An algorithm for the identification of Discourse Boundaries(DBs) is developed. Needs preprocessed text.

Free SlideSalad PowerPoint Template 1


Copyright (C) SlideSalad.com All rights reserved.
Literature review
literature focused mainly on anaphora resolution, text understanding, text simplification, discourse markers, enhancing
existing models performance, text summarization and comparison of performance in NLP systems among others.

However, Majority are Word-level & Sentence-level which is a poor way of processing natural text because natural text is very
coherent.

The relevant NLP systems need such a unit of processing which is complete semantically & referentially and describe the sub/idea
entirely – the motivation.
Hence, the machine is happy to process the input text unit-by-unit having such a unit of processing.

Free SlideSalad PowerPoint Template 2


Copyright (C) SlideSalad.com All rights reserved.
The Problem We Solve
The Problem We Solve: Is the identification of DBs in English text.
Once it is solved then discourse unit (DU) could be made as a unit of processing in the relevant NLP systems.

Impact: Great impact on many NLP systems such as Text understanding, simplification, translation, summarization, Question-
Answering systems.

---------------- Example of
------------- Text Selected Selected Selected a Question
------------- Database Doc Page(s) DU Answering
---------------- System

Free SlideSalad PowerPoint Template 3


Copyright (C) SlideSalad.com All rights reserved.
The Problem We Solve
Impact: Great impact on many NLP systems such as Text understanding, simplification, translation, summarization, Question-
Answering systems.

Division at Discourse Summarization


Level Level
More abstract
DU1 summaries
---------------- Example of a could be created
------------- DU2 Fetch ideas Text by combining
------------- to Summarization the DUs at
---------------- DU3 summarize System levels
Input text doc
DUn

Therefore, the relevant NLP systems would boost up in terms of accuracy, efficient processing and getting more effective & useful
results when the problem under focus is solved.

Free SlideSalad PowerPoint Template 4


Copyright (C) SlideSalad.com All rights reserved.
Research Objectives

Objective #1
To create an anaphorically annotated corpus of English text.

Objective #2
To design an innovative algorithm based on the knowledge of
the anaphorically annotated text.

Objective #3
To Implement the algorithm for the Identification DBs.

Free SlideSalad PowerPoint Template 5


Copyright (C) SlideSalad.com All rights reserved.
Methodology

Main Goal: Identification of DBs. Used a Corpus-based approach. why? Python 3.6 & Spyder IDE

2 possible Approaches: 1) Standard way in NLP 2) Not used before Performed experiments on 24 docs
1) Use or 2) Don’t use a corpus from the corpus.
Applicable to Dialogue

Free SlideSalad PowerPoint Template 6


Copyright (C) SlideSalad.com All rights reserved.
Phrase Detectives Corpus 2.1.4

Creating of an anaphorically annotated corpus was a difficult stage. Therefore, multiple different options was considered however,
we were lucky to find Phrase Detectives Corpus 2.1.4 (PD2).

PD2 is an anaphorically annotated corpus. 1 st version 2016, 2nd version 2019. Two subset are Silver & Gold.

542 docs, 408153 tokens & 49990 markables (any linguistic expression of interest).

On average, 12.6 annotations per markable. The Gold subset has less errors compared to silver because ...

PD2 age is more than 11 years and still going on…

Free SlideSalad PowerPoint Template 7


Copyright (C) SlideSalad.com All rights reserved.
Results & Discussion

Before Enhancement 1. Missing Annotation After Enhancement


Markables unannotated at all. Hence, the algorithm also is
unaware of there existence completely.
Accuracy Rate Accuracy Rate

2. Incorrect Annotation
Markables are annotated but incorrectly.

88.72% 3. Inconsistent Annotation 97.66%


Annotations which violates the annotation rules of the
corpus.

4. Algorithm Errors
Errors produced by the developed algorithm

Free SlideSalad PowerPoint Template 8


Copyright (C) SlideSalad.com All rights reserved.
Summary

Contributed an algorithm for the Achieved good accuracy rate


identification of DBs using anaphorically The work is unique, useful & has great
annotated text impact to a group of other NLP systems

Input text need to be: Making the system wholly automatic &
Anaphorically annotated & incorporation of the algorithm in other NLP
In XML form systems to check its usefulness

Free SlideSalad PowerPoint Template 9


Copyright (C) SlideSalad.com All rights reserved.
Publication

Identification of Discourse Boundaries Using Anaphorically Annotated Text

Submitted for publication on 26.01.2020 in Journal of Information Communication Technologies & Robotic
Applications (JICTRA). An ‘X’ category journal recognized by HEC.

Critical Review Phase started

Free SlideSalad PowerPoint Template 10


Copyright (C) SlideSalad.com All rights reserved.
References
[1] Z. S. Harris, “Discourse Analysis,” Language, vol. 28, no. 1, pp. 1–30, 1952, doi: 10.2307/409987.

[2] A. R. Tayar, S. R. Tandan, and M. A. Tayal, “A Research on Discourse Access,” IJRTE, vol. 8, no. 2S11, pp. 827–830, Nov. 2019, doi:
10.35940/ijrte.B1135.0982S1119.

[3] R. Ali, M. A. Khan, M. Bilal, and I. Rabbi, “Reciprocal anaphora resolution in Pashto discourse,” in 2008 4th International Conference on Emerging
Technologies, Oct. 2008, pp. 1–5, doi: 10.1109/ICET.2008.4777464.

[4] P. A. Heeman, D. Byron, and J. F. Allen, “Identifying Discourse Markers in Spoken Dialog,” presented at the AAAI 1998 Spring Symposium on Applying
Machine Learning to Discourse Processing, Menlo Park, California, March 1998., pp. 44–51.

[5] K. Tomiyama, F. Nihei, Y. I. Nakano, and Y. Takase, “Identifying Discourse Boundaries in Group Discussions using a Multimodal Embedding Space,” in IUI
Workshops, 2018, vol. 2068.

[6] P. Furkó, “The Boundaries of Discourse Markers – Drawing Lines through Manual and Automatic Annotation,” The Journal of Sapientia Hungarian
University of Transylvania, vol. 10, no. 2, pp. 155–170, Nov. 2018, doi: 10.2478/ausp-2018-0020.

[7] M. Palomar et al., “An Algorithm for Anaphora Resolution in Spanish Texts,” Comput. Linguist., vol. 27, no. 4, pp. 545–567, Dec. 2001.

[8] S. Singh, P. Lakhmani, P. Mathur, and S. Morwal, “Analysis of Anaphora Resolution System for English Language,” IJIT, vol. 3, no. 2, pp. 51–57, Apr. 2014,
doi: 10.5121/ijit.2014.3205.

Free SlideSalad PowerPoint Template 11


Copyright (C) SlideSalad.com All rights reserved.
References

[9] R. Bunescu, “Associative Anaphora Resolution: A Web-Based Approach,” in Proceedings of the 2003 EACL Workshop on The Computational Treatment of
Anaphora, 2003.

[10] R. J. Evans and C. Orasan, “NP Animacy Identification for Anaphora Resolution,” jair, vol. 29, pp. 79–103, Jun. 2007, doi: 10.1613/jair.2179.

[11] P. Lakhmani, S. Singh, and S. Morwal, “Performance Analysis of two Anaphora Resolution System for Hindi Language,” vol. 3, no. 3, pp. 576–580, 2014.

[12] M. A. Khan and F. T. Zuhra, “Role of Corpus in Anaphora Resolution,” presented at the Corpus Linguistics, ICC Birmingham, Jul. 2011.

[13] R. Ali, M. A. Khan, R. Ahmad, and I. Rabbi, “Rule based personal references resolution in pashto discourse for better machine translation,” in 2008 Second
International Conference on Electrical Engineering, Mar. 2008, pp. 1–6, doi: 10.1109/ICEE.2008.4553941.

[14] R. Iida, K. Inui, and Y. Matsumoto, “Anaphora resolution by antecedent identification followed by anaphoricity determination,” ACM Trans. Asian Lang.
Inf. Process., vol. 4, no. 4, pp. 417–434, 2005, doi: 10.1145/1113308.1113312.

[15] M. Sen, N. Shah, and L. Kurup, “An algorithm for resolution of Anaphora in English text,” in 2017 International Conference on Innovations in Information,
Embedded and Communication Systems (ICIIECS), Mar. 2017, pp. 1–5, doi: 10.1109/ICIIECS.2017.8276078.

[16] J. van Kuppevelt, “Discourse structure, topicality and questioning,” JL, vol. 31, no. 1, pp. 109–147, Mar. 1995, doi: 10.1017/S002222670000058X.

Free SlideSalad PowerPoint Template 12


Copyright (C) SlideSalad.com All rights reserved.
References
[17] D. Liu, “Discourse Topic in Anaphora Resolution and Discourse Construction,” in 2013 International Conference on Asian Language Processing, Aug. 2013,
pp. 15–17, doi: 10.1109/IALP.2013.69.

[18] S. Ullah, M. A. Hussain, and K. S. Kwak, “Resolution of Unidentified Words in Machine Translation,” CoRR, Nov. 2009.

[19] T. A. van Dijk, “Principles of Critical Discourse Analysis,” Discourse & Society, vol. 4, no. 2, pp. 249–283, Apr. 1993, doi: 10.1177/0957926593004002006.

[20] M. Jørgensen and L. Phillips, Discourse analysis as theory and method. London ; Thousand Oaks, Calif: Sage Publications, 2002.

[21] M. Patel, A. Chokshi, S. Vyas, and K. Maurya, “Machine Learning Approach for Automatic Text Summarization Using Neural Networks‖,” International
Journal of Advanced Research in Computer and Communication Engineering, vol. 7, no. 1, 2018.

[22] M.-Y. Day and C.-Y. Chen, “Artificial Intelligence for Automatic Text Summarization,” in 2018 IEEE International Conference on Information Reuse and
Integration (IRI), Jul. 2018, pp. 478–484, doi: 10.1109/IRI.2018.00076.

[23] M. A. Khan, Text Based Machine Translation, 1st ed. Department of Computer Science, University of Peshawar, Peshawar, 1995.

[24] T. Wasow, “Anaphoric Relations in English,” Massachusetts Institute of Technology, 1972.

[25] M. Poesio and R. Artstein, “Anaphoric Annotation in the ARRAU Corpus.,” presented at the Proceedings of the International Conference on Language
Resources and Evaluation, LREC 2008, Marrakech, Morocco, Jan. 2008.

Free SlideSalad PowerPoint Template 13


Copyright (C) SlideSalad.com All rights reserved.
References
[26] K. J. Rodriguez, F. Delogu, Y. Versley, E. W. Stemle, and M. Poesio, “Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus.,”
presented at the Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, Valletta, Malta,
2010, vol. LREC’10.

[27] J. Chamberlain, M. Poesio, and U. Kruschwitz, “Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.,” in Proceedings of the Tenth
International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016, pp. 2039–2046.

[28] L. von Ahn, “Games with a purpose,” Computer, vol. 39, no. 6, pp. 92–94, Jun. 2006, doi: 10.1109/MC.2006.196.

[29] M. Poesio, J. Chamberlain, and U. Kruschwitz, “Phrase Detectives,” 2015.

[30] K. M. Seddik and A. Farghaly, “Anaphora Resolution,” in Natural Language Processing of Semitic Languages, I. Zitouni, Ed. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2014, pp. 247–277.

[31] R. Sonbol, G. Rebdawi, and N. Ghneim, “Anaphora Resolution in Business Process Requirement Engineering,” IJECE, vol. 8, no. 3, p. 1766, Jun. 2018, doi:
10.11591/ijece.v8i3.pp1766-1773.

[32] A. Kozlova, A. Svischev, O. Gureenkova, and T. Batura, “A hybrid approach for anaphora resolution in the Russian language,” in 2017 Siberian Symposium
on Data Science and Engineering (SSDSE), Apr. 2017, pp. 36–40, doi: 10.1109/SSDSE.2017.8071960.

[33] Y. Zhu, W. Song, X. Liu, L. Liu, and X. Zhao, “Improving Anaphora Resolution by Animacy Identification,” in 2019 IEEE International Conference on
Artificial Intelligence and Computer Applications (ICAICA), Mar. 2019, pp. 48–51, doi: 10.1109/ICAICA.2019.8873499.

Free SlideSalad PowerPoint Template 14


Copyright (C) SlideSalad.com All rights reserved.
References
[34] S. Lappin and H. J. Leass, “An Algorithm for Pronominal Anaphora Resolution,” Comput. Linguist., vol. 20, no. 4, pp. 535–561, Dec. 1994.

[35] J. T. Dutka, “Anaphoric relations, comprehension and readability,” in Processing of Visible Language, P. A. Kolers, M. E. Wrolstad, and H. Bouma, Eds.
Boston, MA: Springer US, 1980, pp. 537–549.

[36] M. Kameyama, “Recognizing Referential Links: An Information Extraction Perspective,” arXiv:cmp-lg/9707009, Jul. 1997.

[37] R. Ali and M. A. Khan, “Computational Treatment of Zero Anaphora in Pashto Language,” ResearchGate. .

[38] R. Mitkov, “Robust Pronoun Resolution with Limited Knowledge,” in Proceedings of the 36th Annual Meeting of the Association for Computational
Linguistics and 17th International Conference on Computational Linguistics - Volume 2, Stroudsburg, PA, USA, 1998, pp. 869–875, doi: 10.3115/980691.980712.

[39] S. Ullah, M. A. Khan, and K. S. Kwak, “A discourse based approach in text-based machine translation,” in ITC-CSCC :International Technical Conference
on Circuits Systems, Computers and Communications, The Institute of Electronics Engineers of Korea, Jul. 2007, pp. 1128–1129.

[40] M. Poesio, J. Chamberlain, S. Paun, J. Yu, A. Uma, and U. Kruschwitz, “A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric
Interpretation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 1778–1789.

[41] P. Raybaut, “Spyder-Documentation,” Available online at: pythonhosted. org, 2009.

Free SlideSalad PowerPoint Template 15


Copyright (C) SlideSalad.com All rights reserved.
THANK YOU !
IDENTIFICATION OF DISCOURSE
BOUNDARIES USING ANAPHORICALLY
ANNOTATED TEXT
Muhammad Aatif
Department of Computer Science
University of Peshawar
MS Computer Science (2017-18)
Free SlideSalad PowerPoint Template
Prof. Dr. Mohammad Abid Khan Copyright (C) SlideSalad.com All rights reserved.
QUESTIONS

Free SlideSalad PowerPoint Template Copyright (C) SlideSalad.com All rights reserved.

You might also like