You are on page 1of 5

Fundamental Text Analysis Pipeline and Recognizing Textual Entailment

Sunanda Bansal
sunanda.basnal@mail.concordia.ca

Abstract and hypothesis and improving the performance


above a baseline for the task of Recognizing Tex-
In this paper I present the details of my tual Entailment.
work for Project #1 and the experiments Recognizing Textual Entailment is the task to
performed for the Project #2 for the par- determine whether given a text and hypothesis, a
tial fulfillment of Natural Language Anal- human would generally consider the hypothesis an
ysis Course. I developed a fundamental entailment of text or not. Using the Recogniz-
pipeline for textual analysis, on the basis ing Textual Entailment (RTE) Challenge Corpora
of which I analyzed the possible options provided for RTE tasks in NLTK, I try to engi-
to recognize named entities in the Movie neer some features and then compare and analyze
Reviews Corpus provided in the Natural the performance of models and classifiers. I used
Language Toolkit (NLTK). Using the Rec- the the fundamental textual pipeline developed in
ognizing Textual Entailment (RTE) Chal- the initial phases of project #1 to facilitate fea-
lenge Corpora provided for RTE tasks in ture engineering in RTE. I compared the perfor-
NLTK, I trained classifiers, using some ad- mance of Naive Bayes classifier and Support Vec-
ditional features which were based on the tor Machine over a feature space of 12 features, 4
textual pipeline in the first project. I tested of which form the baseline. the feature space is
for all the possible combinations of fea- discussed in details in the Approach sub-section
tures and present here the best perform- of Recognizing Textual Entailment section of this
ing models that received 8.64% increase paper.
of performance on average on the RTE test
sets. 1 Fundamental Textual Pipeline
The fundamental steps to process a text so that it
Introduction
can be analyzed to extract some useful informa-
Natural Language Processing deals with under- tion require that it be broken down into small units,
standing and generation of natural language in while still preserving the sequence and grouping
both formats - text and speech. Today, in 2018, of these units within, as a sentence. Then these
there is large amount of textual data present in units need at least the very basic grammar infor-
digital format which can be great source of ex- mation as to which part of speech role the unit
tracting information. Attempting to understand might be playing in the sentence. Moving after-
text is one of the various major studies in natu- wards to the step of looking at its syntactic struc-
ral language processing. The fundamental steps to ture to identify constituents and dependencies, we
analyzing text in natural language processing are, get a starting point to analyze and find patterns to
generally speaking, tokenization, Part of Speech extract the information we need.
tagging, parsing and Named Entities Recognition. This pipeline was to be tested on the Movie
I analyze the fundamental pipeline, the challenges Reviews Corpus of NLTK during development.
and possible solutions while dealing with datasets Movie Review Corpus is entirely casefolded and
like Movie Corpus Corpus, which is casefolded. each token in this corpus is separated by at least a
The objective is to also enhance the performance space, and sometimes a newline character. These
of each module of this fundamental pipeline. This two features of the corpus proved challenging for
pipeline is further to be used for analyzing the text the task of improvement of pipeline performance.
1.1 Design splitting within nested parenthesis, parenthesis
The fundamental textual pipeline that I created, with multiple lines and sentence terminator sym-
uses the following sequence of NLTK modules - bols ( . ? ! ), dangling parenthesis - where
Sentence Splitting - Using sentence tokenizer, the author started a sentence in a a parenthesis
nltk.tokenize.sent_tokenize, I split but forgot to close it and using parenthesis to
the whole text into sentences. state bullet points like a), 3), c.), (i). I
Tokenization - Using the word tokenizer, created the following fix in the post-processing of
nltk.tokenize.word_tokenize, I split sentence splitting step-
each sentence into tokens. Step 1 - Identifying only bullet points of the
Part of Speech Tagging - nltk.pos_tag pattern a), b) ) and split the sentence right
was used to tag each token of each sentence with before [0-9a-b] \). Now you shouldn’t have
part of speech. I also used Stanford POS tagger to any parenthesis that are mistaken to be dangling.
compare the tagger outputs. Step 2 - Identify Dangling Parenthesis - Count
Parsing - I used Stanford parser for parsing the number of opening parenthesis and closing
each sentence for both constituencies and depen- parenthesis. If the count is not equal then there
dencies. is an extra opening parenthesis. Append the next
Named Entities Recognition - I used a sim- sentence (upto 5 sentences) until you find a dan-
ple grammar and captured the consequent tokens gling closing parenthesis. If a closing parenthesis
tagged proper nouns as Name Entities. not found, then we have a dangling parenthesis. I
split the sentence at this parenthesis.
NE → (<NNP>|<NNPS>)+

1.2 Approach 1.3 Analysis


I focus my work at improving the performance of Many errors were resolved with my fixes and im-
the pipeline at the initial levels since the error in provement on Tagger level and Parser level could
the initial modules propagated all the way to later clearly be observed. The project required that our
modules that gave bizarre results. There was a dire pipeline be tested on Nokia Reviews Dataset pro-
need of processing after sentence splitting since vided a few days before the deadline for analy-
the sentence splitting tool provided by NLTK is sis. This data set was smaller, and was casefolded
very basic. just like movie reviews data set. Hence, similarly,
Periods : There was poor Sentence across the most of the proper nouns can not be recognized as
acronyms which would consider the period in proper nouns and therefore also impacts the recog-
acronyms to be a sign of sentence termination. It nition of named entities.
was incorrect because the sentence does not split This data also presented a new issue where the
that way and hence, it affects what the structure reviews are being enumerate and then separate by
provided by the parser, for that split half of the two newlines, it hould be considered a separate be-
sentence. It is possible that the space between ev- ginning of the sentence. For this data set, may be
ery token, even the letters of acronym and periods we could first divide the data set into paragraphs,
was the cause of sentence splitter tripping up. I separate by newlines, and then split those para-
used the following in the post-processing to sen- graphs into sentences at the sentence terminator
tence splitting step - symbols. This way we should be able split the Re-
Step 1 - Find if a single character [a-z] followed by view 2 part from the beginning sentence of the re-
a period is found at at the end of the sentence. This view. I analysed the above but did not implement
most likely means that the period was not used as the fix since this dataset was just meant to test the
a sentence terminator. performance of pipeline. I observed that -
Step 2 - So, while the pattern continues, concate-
1. The file has header that is taken as a com-
nate the strings and add the final string where the
plete sentence. It does not parse well at all.
pattern has just stopped continuing to the string as
It needs to be addressed by removing the se-
the trailing part of the sentence.
quence of non-informative characters that are
Parenthesis : The corpora exhibits inconsistent
there for the sake of design.
use of parenthesis rampant through the file with
some issues concerning regarding the sentence 2. Data has some gibberish in between, possibly
because of encodings, which does not parse whole name, if there were more clues that asso-
well. ciate them together. I found POS tags to be incon-
sistent in this case, so they could not be used. If
3. i is taken to be a List Item Marker or a plu- the Names were properly capitalized, I may have
ral/common noun, which is wrong, because it been able to capture the latter part of the name
should be tagged Pronoun. with POS tag NNP, which takes capitalization into
account, according to my observation. In support
2 Named Entity Recognition
of my observation, my simple rule with Stanford
The second part of Project #1 required Named En- POS tagger on the test set captures almost all the
tity Recognition to be implemented on three levels named entities, simply because the names are cap-
- after tokenization, after POS Tagging and after italized and tagged as NNP. Whereas, in the case-
parsing. I found most scope of precise work to be folded texts in doesnt often give accurate proper
done at the tokenization level, for my case. Most noun tags.
of my improvement and fixes from first part fo- Titles Followed by Person Names : I found the
cused on tokenization and sentence splitting, so names of people to be frequently preceded with ti-
this module gave me accurate clues at this level tles out of which the three that I found to be fairly
in comparison to other levels. It required that common and easy to sight on a cursory look are
POS tagger performance be improved and simi- - mr., mrs., dr. . I identify named entities
larly Parser output be improved before any useful associated with titles at tokenization level. I cap-
pattern could be identified at those levels. ture one word that follows these titles. With better
I fixed some issues at POS tagging level, which tagger performance we might be able to look at the
improved the performance of parser, but I didnt tags of the words following so that we can capture
delve too much into improvement of tagger results named entities spanning 2-3 words or more. But,
or parser results, given the time limit, for that was my POS tagger didnt give me tags good enough
not the focus of this task, Named Entity Recogni- to make a pattern out of, for the sake of capturing
tion was. more than 1 words following the titles.
Quotes and Parenthesis : In the Movie Re-
2.1 Approach and Analysis
views Dataset, I often found names of movies or
Grammar: I used a simple grammar and cap- people in the parenthesis or quotes. I have ex-
tured the consequent tokens tagged singular/plural tracted the phrases where there are up to 3 words
proper nouns as Name Entities. within the scope of parenthesis and quotes. I
NE → (<NNP>|<NNPS>)+ noticed that the names are often no longer than
3 words. If a sequence in the parenthesis is
Acronyms : I found acronyms to be a short- more than 3 words, it was most likely a com-
ened form of a named entity, so I identified the ment by the user, as per few reviews I superficially
acronyms and tagged them Proper Nouns. This scanned. There is further scope of work in iden-
worked at various levels, part of which was done tifying names that are longer than 3 words but are
for first part of Project #1. Firstly, I fixed the sen- present in a comma separated format within paren-
tence splitting for acronyms, by using the algo- thesis.
rithms already discussed in previous section. For It is difficult to capture the words between sin-
the second part, I combined the acronym as one gle quotes , because they are used as apostrophe
token and tagged it as a proper noun (NNP). Since in the text. Special cases need to be constructed
my grammar simple picks up sequence Proper in order to avoid capturing the sequences between
Nouns, it picks up the acronyms as a named en- apostrophes.
tity.
Though a lot of named entities have been cap- 3 Recognizing Textual Entailment
tured well, many-a-times my regular expression
captures partial named entities. For example, in In Project #2, I developed features for the task of
the cases where the names are abbreviated like recognizing textual entailment in the RTE dataset
E. B. White my regular expression only cap- provided in Natural Language Processing Toolkit.
tures E. B. as a named entity. Which is true, but The goal was to conduct different experiments and
not complete. It would be easier to capture the analyze the results as well as the errors. In order to
analyze and compare the results a baseline model which are present in both text and hypothesis, but
was used. with different Part of Speech tags.
VERBS AND ADJECTIVES: I was interested
3.1 Baseline in the role of verb and adjective in the determi-
The baseline considered was taken from Chapter 6 nation of entailment, on a shallow level, so I de-
of (Bird et al., 2009) veloped 4 features that individually focus on the
Word Overlap - Returns the number of words following aspects of both.
that are common in the text and hypothesis. Similarity : It is not necessary for the hypothe-
Word Hypothesis Extra - Returns the the num- sis to use the same exact verb or adjective as in
ber of words that are present in hypothesis but not text. Many a times, a close synonym suffices.
in the text. Therefore, in order to use this similarity which
Named Entity Overlap - Returns the number many increase the chances of entailment, I use
of Named Entities that are common in text and hy- path_similarity score, of all senses for all
pothesis. Verbs and Adjectives respectively as VERB and
Named Entity Hypothesis Extra - Returns in ADJ from the Wordnet.
the number of Named entities that are present in Lemma Overlap : Though Word Lemma
the Hypothesis but not in the text. Overlap captures lemma overlap for all words, I
specifically wanted to experiment with the useful-
3.2 Approach
ness of verb and adjective lemma overlap in pre-
Overlapping words and Named Entities are good dicting entailments.
features for baseline because the more common In order to develop these features I used used
words do signify to certain extent that the content fundamental textual analysis pipeline and received
of the text and hypothesis is similar. Similarly, the the tagged, constituency parsed and dependency
extra word and named entities in the hypothesis parsed results for each of the text and hypothesis
are a good indication of having more content in pair.
hypothesis which may not be entailed by the text.
Though these features are good for baseline, they 3.3 Evaluation
may not be good enough since they don’t look at I ran experiments for all possible combinations
other important aspects of the text and hypothesis. of the feature space with three classifiers - Naive
Therefore, extending the intuitions from baseline, Bayes Classifier and Support Vector Machine. I
I added the following features: observed that a lot of them had 100% precision
Overlapping lemmas - Many a times, words and a lot had 100% recall for one or the other test
with same lemma appear in the text and hypoth- sets. I removed all such experiments because in
esis. This feature calculates such overlappings, if those cases all the test instances were being indis-
any. criminately predicted to be not an entailment and
Overlapping POS Tuples - With the intuition all to be an entailment respectively. I have perfor-
that word as well as the role of the word in the mance data of 5681 experimental models out of
sentence is important, I drew overlapping count of which I selected the few that outdid others in one
(word,pos_tag) tuples in the text and hypoth- test case or other for analysis and the results can
esis. be seen in Table 1. The models that were selected
Overlapping Dependency Triples - With the on the basis of their RTE 1, RTE 2 and RTE 3 per-
intuition and intention of taking dependencies into formances increased the baseline performance by
account as a form of syntactic analysis, I consid- 6-8%.
ered the overlapping count of dependency triples
in text and hypothesis, provided by Stanford’s de- 3.4 Analysis
pendency parser . I analyzed the results for the rest of
Ambiguous Words - I observed, in the train- the models. I observed using the
ing corpus, that there were instances where text most_informative_feature function
and hypothesis used same word for in completely of NLTK library that NE Hypothesis Extra feature
different senses, which could be somewhat disam- is consistently most informative feature in most
biguated with looking at Part of Speech tags. In of the cases it is a part of the feature set of the
this feature I count the number of words, if any, model. A close second are Adjective Similarity,
Classifier RTE 1 RTE 2 RTE 3 RTE 10 RTE 30 Features
Word Overlap, Word Hypothesis Extra,
Baselines Naive Bayes 0.62 0.61 0.66 0.67 0.5
NE Overlap, NE Hypothesis Extra
Word Overlap, Word Hypothesis Extra,
Best RTE
Naive Bayes 0.63 0.64 0.64 0.83 0.51 NE Hypothesis Extra, Verb Lemma Overlap,
10
Adjective Similarity, Adjective lemma overlap
NE Overlap, Overlapping POS Tags,
Best RTE
Naive Bayes 0.57 0.57 0.57 0.25 0.67 Ambiguous Words, Adjective Similarity,
30
Adjective Lemma Overlap
Word Overlap,Word Hypothesis Extra,
Best RTE 1 SVM 0.68 0.66 0.68 0.71 0.54 Overlapping lemmas, Verb similarity,
Verb Lemma Overlap, Adjective Similarity
Word Overlap, Word Hypothesis Extra,
Overlapping POS Tags, Overlapping Dependency Triples,
Best RTE 2 SVM 0.67 0.67 0.68 0.71 0.55
Overlapping lemmas, Verb similarity,
Verb Lemma Overlap, Adjective Lemma Overlap
Word Overlap, NE Overlap,
NE Hypothesis Extra, Overlapping POS Tags,
Best RTE 3 SVM 0.66 0.66 0.7 0.71 0.54 Overlapping Dependency Triples, Ambiguous Words,
Overlapping lemmas, Verb similarity, Adjective Similarity,
Adjective lemma overlap

Table 1: Best Performing Models

Overlapping POS tags, Verb Similarity and Word


Hypothesis Extra. Overlapping Dependency
Triples didn’t turn out to be as informative as was
expected.

Future Work
There is scope of improvement in the Tagger and
Parser. For NER task, the temporal and numerical
expressions can be identified by using regular ex-
pressions. In RTE task, the parse tree information
to include, or match constituents could be useful.
Using much more data to train on, a neural net-
work based classifier could be trained to predict
entailments. Most of my work in RTE was based
on extending the intuition of overlapping words.
The intuition of words and NEs extra in hypoth-
esis could be extended to expand and experiment
with the feature space.

References
Bird, Steven, Edward Loper and Ewan Klein 2009.
Natural Language Processing with Python, vol-
ume 1. OReilly Media Inc.

You might also like