NANYANG T ECHNOLOGICAL U NIVERSITY

F INAL Y EAR P ROJECT

NLP Search Engine

Author: T HAI B INH D UONG

Supervisor: P ROF. C HAN C HEE K EONG Examiner: P ROF. C HONG YONG K IM

April 23, 2011

Abstract Even though many natural language systems have been developed successfully and commercialized, none of them yet proved to be versatile enough for a wide variation of tasks. One exception probably was IBM’s Watson, which during the course of this project has won against 2 human champions in a Jeopardy contest and showed for the first time that full scale interaction and reasoning in natural language were finally within the reach of modern technology. In this project, user input query which is in natural language form will be analyzed and presented in FOPC (First Order Predicate Calculus) which is suitable for using as input for higher layer tasks such as logic. In the process, referents or implication from the question, answer pair might also be deducted.

Acknowledgment
I owed my deepest gratitude to my supervisor professor Chan Chee Keong, who is very considerate and cheerful at the same time, and who has given me the opportunity to work on this project which has been very enjoyable. I also want to express my gratitude to my counsellor Mr. Frank Boon Pink, Ms. Jasmine, Ms. Joanne Quek, and professor Gwee Bah Hwee for behind me all the time. My big thanks to professor Francis Bond for his teachings that I became interested in natural language processing. And last but not least, my no-word-can-describe-this gratitude and love to my family and my friends Long and John for their supports. Thanks all of you for everything. All the mistakes in this project are my own.

Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction 1.1 1.2 1.3 1.4 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Report Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 3 3 4 4 4 6 6 6 6 7 7 7 7 7 7 8 9 9 9 10 10 11 16

System Design 2.1 2.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Tools and Resources 3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natural Language Processing Toolkit . . . . . . . . . . . . . . . . . . . . . . .

Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.2.3 3.2.4 Wordnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Semcor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Question Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3 3.4 4

Download and Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Text Clean Up 4.1 4.2 4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spell Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 4.3.2 4.3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .2 5. . . .4 4. . . . . . 17 18 19 19 20 21 21 21 22 23 23 25 25 26 26 27 28 30 30 32 32 32 34 34 36 Part of Speech Tagger 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Method . . . . . . . . . . . . . . . . . . Background Theory . . . . . . . . . . . . . . . . .1 6. . . . . . . .3. . . . . . . . . . . . .2 Introduction . . . . . . . . . . . . . . . . . . 6. . . . .2. . .4 5. . . . . . . . . . . . .5 Results . . . . . . . . . . . .3. . . . . . . .4. . . . . . . Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 6. . Operation Explanation . . . . . . . . 6 Meaning Representation 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 5. . . Estimating Lambda Value .2. . . . . . . Meaning Representation . . .3 6. . . . . . . . Presentation . . .4. . .1 6. . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . . Semantic Analysis A Study Case . . . Future Work . . . . . . . . . . Evaluation and Discussion . . . . . . . . . . . . 6. . . .2 Semantics Analysis . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Language Model . . .4 First Order Predicate Calculus . . . . . . . . . . . . . . . . . . . . Formal Logic . . . . . . . . . . . . . . . .6 7 Evaluations . . . . 6.5 5. . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . .1 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 5. . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . 5. . . . . . .1 6. . . . . . . . . . . . . . . . . . .1 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Introduction . .5 5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operation explanation . . . . .4. . . . . . . . . . . . . . . .2 Implication Derivation . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 11 12 13 22 . . . . . . . . . . . POS tagger flow chart . . . . . . . . . . . . Spelling correction module flow chart . . . . . . . . . . . . . . . . . . Stemmer flow chart . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . . . . . . . . . .1 4. .3 5. . . . . . . . . . Spell checker flow chart . . . . . .1 Simplified system DFD . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . .List of Figures 2. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . Weighed transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 5. . . . . . . . . . . . . .2 5. . . . . . . . . . . . . . . . 15 16 23 23 . . . . . . . . . .List of Tables 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . Accuracies .2 Transformation steps and their patterns . . . . . . . . . . . . Accuracies . .

any natural language tasks can be grouped into these below: • Phonetics and Phonology: The study of linguistic sounds. Natural language search engine on the other hand in theory will be able to response to questions from users as opposed to keywords only. • Discourse: The study of linguistic units larger than one single utterance. In fact. meta data and ranking algorithm to return results that are most likely to match the input queries. To process such a huge knowledge base and half a billion queries per day. Bing which is great for lifestyle. This can be very irrelevant as in early search engine back in the 90s. sense disambiguation. Natural language search engine can be broken down to basic natural language tasks that we perform daily: analysis. Google’s power lies in their gigantic knowledge base. 1 . Other interesting search engines that might be more useful than Google when it comes to more specific tasks are GazoPa. Nevertheless. or can be very effective as in Google case. To be fair. each with their distinct features. traditional search engine probably is the most suitable choice by letting the user do the final and the usually the most difficult task: read through the contents and choose whatever suit their needs. TinyEyes. but still it can be tricked to give a higher rank than it really should. and be able to analyze the actual contents of the web page to determine the level of relevant. . the real content of a web page plays less significant role than it should do. this suggests that in Google method. which is also a natural language search engine. Most search engine up to date based on key words. • Morphology: The study of meaningful components of words. Stock photography all of which are similar image search engine. • Syntax: The study of structural relationships between words • Semantics: The study of meaning • Pragmatics: The study of how language can be used to accomplish goal or in different situation. which is also known as search engine optimization (SEO). the world’s most academic search engine. .Chapter 1 Introduction Seach engine is vital for fast and accurate information retrieval. language generation. and Wolfram Alpha.

a computer simulation of a Rogerian psychotherapist [21]. mouse. This time it was IBM’s Watson against 2 champions on Jeopardy quiz show. Weizenbaum proposed ELIZA. just as how we used to learn when we were kids and plain. and can learn. each with their unique discourse. IBM machine one more time defeated human champions in an intellectual contest. It performed poorly considering a closed context and was discarded in the following sequel. it was not until about 80 years ago in 1936. and computational psycholinguistics in psychology. 1. eye and motion tracking. either in a form of lovable and talkative android or an intelligent super computer with its own evil will and desire. from keyboard. the main goal is to derive the implication given a pair of question and answer. to a mechanical marvel robotic lion by Leonardo da Vinci in1515. It’s so clever that Weizenbaum reported some attendants refused to not believe in ELIZA even after being explained about the situation [21]. yet none prove to be versatile enough for a wide variation of tasks. Despite a long history of envisioning. from ancient programmable machines by pegs and ropes.Human attempts to build automation that mimic humanlike behavior dated back some thousands years ago and still going strong. Users around the world for the first time have access to a search engine which can operate using natural language both as input and output. Along the way. In 1966. Many chat bots were developed based on ELIZA. It’s still highly excited to see such a feature though. and what kind of science fiction that is without humanlike machine. to touch screen. In May 2009. speech recognition in electrical engineering.1 Literature Review Research on linguistics has been carried out in many other fields long before the computer science era and are known under different names in different fields: computational linguistics in linguistics. or more generally natural languages. But the holy grail of communication will be what we have been developed through generations and what we are most naturally familiar with: our mother tongue. it sometime produced amazingly human like conversation. it is known as natural language processing. that humankind had the facility to realized thist long standing dream. human also invent more methods to effectively communicate with the systems. when the first freely programmable computer ‘Z1 Computer’ was invented. The linguistic tasks that we human perform almost effortlessly daily turn out to be challenging indeed. and even brain signal. which I believe is how the machine should. Information retrieval and display is throughout and well organized as if have been carefully prepared by a human. Wolfram Alpha was launched. Using almost no information on human though or feeling and only clever decomposition and recomposition rules trigger by system of ranked keywords. As the technology evolves. minor tasks such as spelling and syntax analysis will also be explored. hence their different conversation styles. The drawback is that it’s pretty useless for nonacademic purposes. In 1997 the infamous PC game Fallout was released with a feature that allowed player to interact with in-game characters using natural language input. Even though many systems has been successfully developed and commercialized [7]. In this project. striving and many brilliant minds. which required contestants 2 . In computer science. In early 2011.

or exhibited weird behavior at some points. 1. chapter 2 will cover the overall design of the system. things such as interact freely with an intelligent system in natural language form will start to penetrate and change the way we are using the machines today. result and performance is discussed in chapter 5.4 Report Organization The main body of this report consists of 4 chapters. spelling correction. Guide for installation is also provided.3 Scope Currently. chapter 6 will discuss how meaning can be extracted and represented using syntax analysis information and FOPC scheme. as well as the requirements. Despite the fact that Watson had the entire data of Wikipedia loaded in its RAM. chunking and partial meaning representation from an natural language input.2 Goals and Objectives Be able to analyse a natural language input query. Finally.to figure out which question should have been asked given a statement. 3 . which is typically a question or question and answer pair and return its meaning representation in the format suitable for logic operation. chapter 4 will cover spelling correction. tools and resources. 1. Next. 1. chance are not so far into the future. the program is able to perform word stemming. Additional spelling correction might be performed if required. part-of-speech tagging. specifications. Part of speech (POS) tagging methodology. The first one.

the program should be able to analyse a natural language query. 2. Three modules are: • spelling module: for spelling correction. • sense module: for semantics analysis.1 Requirements At the end of the project. • Part-of-speech (POS) tagging using second hidden Markov model (HMM): Accuracy should approach 90%.Chapter 2 System Design 2.2 Designs At the core of the program are three separate modules and a central database. Some necessary intermediate processes are: • Text cleaning up: Including tokenizing and spelling correction if required.1 4 . A simplified data flow diagram is shown in figure 2. which is typically a question or a pair of question and answer and return its meaning representation in the format suitable for logic operation. • tagging module: for POS tagging. • Meaning deduction and representation using first order predicate calculus (FOPC).

1: Simplified system DFD 5 .Figure 2.

. portability. flexible. it’s unlikely that Python will ever be as fast as C. is an easy-to-use.. . popular. these terms mean (but not limited to) readability. • Graphics: Pixar. 3. Python emphasizes concepts such as quality.Chapter 3 Tools and Resources 3.1 3. Python’s speed of development is just as important as C’s speed of execution. As a general-purpose language. In modern software context. 6 . To be fair.. In short. software and data for Python. Both C and Python have their distinct strengths and roles..2 Natural Language Processing Toolkit Natural Language Processing Toolkit (NLTK) is an open source natural language processing libraries.1 Tools Python Python. and open source programming language designed to optimize development speed. .1. . Yahoo. . . . • And many mores. fast development speed. In this project. productivity.. . • Numerical application: NASA. However. Battlefield 2. Disney. . since Python programs use highly optimized structures and libraries. Python can be used for almost anything computers are capable of. and integration. Zope. NLTK was used mainly for its corpus and probability module. Paint Shop Pro. objectoriented. • Games: Civilization 4. A few organizations currently using Python are: • Web developments: Google. they tend to run near or even quicker than the speed of C language somehow. text processing power and web work suitability. Blender. named after Monty Python a British band of comedians.1. National Weather Service. mature. .

3.0. NLTK is shipped with Wordnet 3.2. 3. and type the commands: >>> import nltk >>> nltk.014. which is useful for text analysis and artificial intelligence. The samples were divided into categories and subdivisions. • Python 2. Some of them are shipped with NLTK. Synsets are interlinked by means of semantic and lexical relations. A full installation will require approximately 800 MB of free disk space.2 Resources Below are collection of corpora used in the process. verbs.312 wordsl of running text of edited English prose printed in the United States during the calendar year 1961. Nouns. A tagged version of the corpus.3 Download and Installation Download size is about 17 MB.7 • PyYAML • NLTK After installing all packages. The result is a meaningful hierarchical network of related words.3 Semcor A subset of Brown corpus in which words are also tagged with their sense along with their part of speech.2 Brown The Brown University Standard Corpus of Present-Day American English (or just Brown corpus) consists of 1. 3.4 Question Corpus Collections of ‘which’ and ‘what’ questions tagged with part of speech and intention of the question.2.2. adjectives and adverbs are grouped into sets of synonyms (synsets).download() 7 . Available at Rada Mihalcea’s page 3. run the Python IDLE (see Getting Started). Available at Rada Mihalcea’s page 3.2. in which every word was tagged with its part of speech is also available in NLTK. each expressing a distinct concept.3.1 Wordnet Wordnet is a lexical database for English language.

6 ->IDLE Check that the Python interpreter is listening by typing the following command at the prompt. ’County’. If you did not install the data to one of the above central locations. (This assumes you downloaded the Brown Corpus): >>> from nltk. ’Fulton’. Click on the File menu and select Change Download Directory.] 3. set this to C:\nltk data. showing the NLTK Downloader. It should print Monty Python! >>> print "Monty Python!" 8 .words()[:50] [’The’.py extension. You can also open up an editor with File ->New Window and type in a program. Save your program to a file with a . ’Grand’. .4 Getting Started The simplest way to run Python is via IDLE. you will need to set the NLTK DATA environment variable to specify the location of the data.. It opens up a window and you can enter commands at the >>> prompt. select the packages or collections you want to download. Next. Right click on My Computer.. then run it using Run ->Run Module. select Properties>Advanced>En Variables>User Variables>New.A new window should open.. In Windows: Start ->All Programs ->Python 2. ’Jury’. For central installation.. the Integrated Development Interface.corpus import brown >>> brown. ’said’. Test that the data has been installed as follows.

the amount of memory saving can be substantial considering a very large and noisy corpus such as html documents. "’".2 Tokenizer Here is the code for the tokenizer: import re TOK=r’(?:\b([\w][\w\-\’]*[\w]))|([ˆ\s\w])’ def tokenizer(sent.1 Introduction Usually the very first step of every text processing tasks. ’at’. A result example for an noisy input is shown below. ’?’] 9 . ’/’. this is not the case if B is False. ’"’. filtered out odd characters and spell checked before being used for further processing. ’?’. "’". "y’re". The expression A and B or C is an equivalent to: if A is True: return B else: return C However. ’gaping’. ’?’.sent)] The function uses a regular expression define in pattern (default value is TOK) to search for words and punctuations in the input string sent. >>> tokenizer("W-H-A-T ’y’re’ looking/gaping at \"huh\" ??? ") [’W-H-A-T’.. ’looking’. A clever cleaning up can benefit the project in many ways. ’"’. user input query will be tokenized (including punctuation). ’huh’. Even if it’s just a simple procedure which filters out non-desired characters. 4.pattern=TOK): return [item[0] and item[0] or item[1] \ for item in re. As in this project context.Chapter 4 Text Clean Up 4.findall(pattern.

• Suffix and prefix: stopt (stopped). brid (bird) • Finger slip: what (qhat). we will only focus on errors that result in nonexistent words. realy (really). As for this project. though there was considerable variation between good spellers and poor spellers. Section 4.5% to 23% for bibliographic database was reported in [2]. • Real word error: hole (hope). [17] found that real-word errors account for about a quarter to a third of all spelling errors. happend (happened). Some methods are: • N-gram technique • Rule based technique • Minimum edit distance technique • Probability technique • Neural nets technique • Acceptance based technique • Expectation based technique • .1 Spell Checker Introduction [17] reports a rate of 25 errors per thousand for handwritten essays by secondary school students. 1 (number one) or l (letter l). there are different methods for detection and correction [14]. 10 . perhaps more if you include word-division errors. On the other hand. no3 (now). • Order of letters in words: gril (girl). Spelling error rate and it’s significance varies depend on the application fields. For the project’s context where user types in the input query. them (then). and they will have significant impact on the output results There are many sources of spelling errors: • Spelling by sound: wuns (ones). lirbary (library) • Confusion of homophones: there/their.3 4.3.2 will explain in detail the method use in this project. sed (said). • Similar shape in optical recognition: rn (m).4.3. • Not hearing the sound: umrella (umbrella). while the rest of this chapter will evaluate and discuss on the performance. a rate of 0. the rate of errors will be high. Depend on what kind of spelling errors. two/too/to...

PorterStemmer() However. • wn dic. Figure 4.1: Spell checker flow chart Stemmer A handy and readily availbale stemmer called Porter Stemmer can be called using the following code: import nltk stemmer = nltk.4. Compiled using Wordnet.txt: Compiled using the English lexicon in NLTK.2 Method A few lexicons were prepared before hand: • dict en. • small dict. • *. This leads to many awkward results as shown below.1 shows a simplified flow chart for the spell checker Figure 4.stem. >>> stemmer. its POS tag and associated definition.txt: The corpus of every lemma.stem(’goes’) ’goe’ >>> stemmer.stem(’grocery’) ’groceri’ 11 .stem(’propose’) ’propos’ >>> stemmer.txt: Compiled using Wordnet and stop word lexicon in NLTK. the stemmer doesn’t validate the return solutions.3.exc: Corpora of morphological exceptions (mice/mouse).

it is the opposite of feature fitc. add. It can also return the definition associating with the part of speech of the token during the process if required. can): length difference.py. Figure 4.2: Stemmer flow chart The stemmer utilises extra information such as exceptions and part of speech tags for validation.py Spelling Correction The underlined approach was to figured out the similarity between 2 tokens. This weights the unique. change. rather than common characters. sums up their lengths then weights.>>> stemmer. a more sophisticated stemmer was developed. • transmuteI(word.can): level of character disorder. 4 basic (distance 1) transform steps are: delete.can): steps required to transform word into can. 12 . hence a better accuracy. The experiment results indicated that feature transmuteI alone was sufficient. The other features can be used for selecting potential candidates to speed up the process.2. can): scans for match strings between word and cand. can): in some sense. and swap (adjacent letters only). Feature fitc was chosen because of its high speed and usually resulted in a neither too broad or too restricted candidate set.stem(’groceries’) ’groceri’ For this reason. • fitm(word. These features are: • fitl(word. • fitc(word. • fitorder(word. Source code can be found in module morphy. It’s flow chart is shown in figure 4. Another version that also return the meaning associated with the word can be found in module wn dict fast. A few features were chosen and combined either by linear or cascade combination.

5)] 0. 1.0). 0. 1. Figure 4.0).0).Detail Operation Explanation Refer to figure 4. The source code can be found in module morphy.wei) in classifiers[:-1])/sig The C(properties) is a vector of features together with their weights for linear interpolation as can be seen in the function C(interpolation). (transmuteI. properties=[ #(fitorder. (fitc.3: Spelling correction module flow chart Let’s explore the module at its very top layer.2 for the stemmer operation. #(fitl. 1.3.can)*wei for (prop. The last element in the feature vector is used for 13 .0).py. 1. #(fitm. #(fitm. A flow chart for the spelling correction is shown in figure 4.wei) in classifiers[:-1]) return 1.0*sum(prop(word.67)] def interpolation(word.classifiers=properties): ’Linear interpolation of various classifiers.0). #(fitc. #(fitl. 1. 1.7)] 0. last element is used for candidate sig = sum(wei for (prop.can.0). #(transmute.

Obviously. Furthermore. their common must approach their lengths for f itc to be significant.can) l=0 for string in common: l+=len(string) return 0. 1]. This can be achieved through a four steps process: • Search for the common string between the two words. The function C(transmuteI) will calculate the shortest distance and will figure out the intermediate transformation steps if required as well. we’ll say the distance between the two is 4. when C(word) or C(can) becomes longer. can. Calculating transmuteI is not very straightforward as in the case of f itc. Sum up their lengths then weight""" (word.lower()) common = common_strings_mk2(word. • Swap two adjacent character. • Delete one character.can) = (word. only C(transmuteI) will be used as feature and C(fitc) as candidate filter.0/len(can)) The code is self explained.0/len(word)+1. These steps are: • Add one character. • Weight the transformation for calculating transmuteI. C(fitc) is calculated by using equation (4.1). Below are the implementation code for C(fitc): def fitc(word. • Change one character into a different one.candidate filtering. can): """ The function scan for match strings btw word and candidate. As stated above. The transformation from C(word) to C(can) can be assumed to go through a series of basic transformation steps. there are infinite number of ways to transform one word to another. • Assign transformation steps. 14 .lower(). but there should be a limited number of shortest paths.5*l*(1. the more similar the words are. The function implies that the bigger the value.1) Note that f itc is in range [0. • Segment and align the two words based on the common string. If 4 steps is required for the transformation. f itc = 1 × length(common) × 2 1 1 + length(word) length(candidate) (4.

.’e’. ’b’. ’cd’. This is a greedy process. The process bases on the fact that the transformation steps will decide how the column vector look like. so these transformations should be weighed less.’l’. assign transformation steps. which means it tries to group as many adjacent characters as possible. weight the transformation for calculating transmuteI..3.2 shows the weighed transformations. search for the common string between the two words.. ’cd’.Firstly.. i. ’c’ ’’ ’c’ ’d’ ’’ ’d’ ’c’ ’d’ ’’ ’’ but not ’c’ ’c’ Delete ’c’ Change ’c’ to ’d’ ’c’ ’’ ’d’ ’d’ Swap ’c’ and ’d’ ’c’ ’’ Add ’c’ Table 4. ’’. ’k’.verbose=False): .’l’. ’’. nearer distance. ’’. can): "Return list of substrings that match.3. [ 2. ’cd’. 12]] Thirdly. ’m’]. for instance ’abc’ as opposed to ’a’. 15 . 6. ’m’] Secondly. >>> common_strings_mk2(’abcdefghklm’. ’b’. 8.’g’. ’’.’21becdfhlkm’) [[’a’. ’’. Table 4.1: Transformation steps and their patterns Finally. Sample code for searching function: def common_strings_mk2(word. in order of appearance from left to right" . Table 4. ’h’.’e’. [’2’. >>> align(’abcdefghklm’.’1’. Some transformation are more likely to happen than others.e. ’f’. ’f’.’21becdfhlkm’) [’b’.can.2 showed the summary of the patterns and their corresponding steps. ’h’. segment and align the two words based on the common string. ’k’. ’m’]. ’h’. The desired output of this process is shown in the example below.’c’. 10.’b’.’c’ or ’ab’. def align(word. ’f’. 4. ’k’. ’’.

o to u.5’.can)) originally was the steps require to tranform word to can #Basic steps are add.Swap Swap Change All steps Between vowels Between (g.can): "Measure distance (˜steps require to tranform word to can)" #tagseq(align(word. (r.0 After weighing the transformation steps. ’1’. 3]. 5]. (h.type. ’Del’. and (h.0.0.2: Weighed transformation 0.5 0. ’l and k’. 7].2) Function (4.2) implies that returned value of the function is equal to 1. and swap (adjacent chars) #Has been modified to weight the steps instead. [1.0.5 0.5 4.0. ’Swap’.. (h. converges to zero when distance approaches infinity. change. ’Change’.h). u to o. ’e’. y to i Not above cases Table 4. [1. delete. ’a to 2’. 0].3.can)])) >>> transmuteI(’abcdefghklm’. (c.can. #For example: ’swap a and e:0.0/(6+sum([val[0] for val in tagseq(word. ’Add’. Some sample codes and outputs of a few keys functions were shown below: def tagseq(word.’21becdfhlkm’) 0.verbose=False): "Figure out the transmute steps given aligned sequences\n\ return value: [[weight. 1]. ’change a to e’: 1. ’g’.index]]" . ’Add’. [1.h). and reduces by one-third when distance is 3. >>> tagseq(’abcdefghklm’. ’e’. ’Del’.r) a to e. 11]] def transmuteI(word.0. (h..’21becdfhlkm’) [[1. [1.k).t).3 Results Below are a few demonstrations for the stemmer. transmuteI is calculated as follow: transmuteI = 6 6 + sum(weighed transf ormations) (4.0.0 only if the distance is zero.s). e to a.involved chars.0.0 return 6.5 1. 16 . [1.p). i to e or y.

’stops’.3. ’nos’. ’rely’. ’that’. ’n’]] Some demonstration for the spelling correction. ’khat’. ’grin’. ’realty’. ’nor’. ’nog’. ’really’. ’chat’] >>> correct(’no3’) [’nox’. ’brio’. ’relay’. ’grail’. ’grill’. ’n’]] >>> morphy(’groceries’) [[’grocery’. ’brim’. ’stoat’] >>> correct(’happend’) [’happen’. ’grim’. ’ghat’. ’hat’. ’bris’. ’stop’. ’bird’. ’append’] >>> correct(’realy’) [’reply’. ’realm’.’. ’v’]] >>> morphy(’propose’) [[’propose’. ’buns’] >>> correct(’sed’) [’sad’] >>> correct(’stopt’) [’stout’. ’nod’.1 >>> correct(’umrella’) [’umbrella’] >>> correct(’libary’) [’library’] >>> correct(’qhat’) [’what’. ’braid’. ’girl’. ’bid’. ’ready’. ’grip’. ’brie’.>>> morphy(’goes’) [[’go’. ’non’. ’grid’. ’now’. ’qat’. ’bride’. ’brit’. ’noe’. ’nov’. ’noc’. ’nob’. ’v’]] >>> morphy(’grocery’) [[’grocery’. ’brig’. ’no. ’mealy’] 4. ’gris’.4 Evaluation The stemmer worked perfectly well for the test samples. ’not’. The error samples were taken in 4. ’aril’] >>> correct(’wuns’) [’uns’. 17 . ’noi’. ’real’.3. ’brad’. ’grid’. ’redly’. ’quat’. ’arid’] >>> correct(’gril’) [’grit’. ’no’] >>> correct(’brid’) [’rid’.

there might be no unique correct answer. However. which was one of Woody Allen’s words for ‘love’ since he thought ‘love’ was too weak of a word. From the above testing experiment. Further experiments by quickly skimming through a dictionary. and ‘happend’. ’curve’] So in conclusion. the spell checked failed in 4 cases: ‘wuns’.uk/headers/0643.ox.3.ac.On the other hand. 18 . the corpus is not readily usable. A corpus reader to make this corpus readily available in NLTK would be very useful. 4. catching one glance at a long word and rapidly typing it into the computer showed that the spell checker rarely failed. these four cases couldn’t be helped since they were errors due to spelling by sound and the spell checker wasn’t designed for this type of errors. One of the few failed cases was ‘lurve’. the spell checker has done its job well. The best solution therefore would be letting the user choose from a list of suggested words. >>> correct(’lurve’) [’lure’. ‘stopt’. However.5 Future Work A spelling corpus is freely available at ota. So we would say a spell checker fails only if it lefts out the intended solutions. ‘sed’.xml The corpus can be used for a statistical spell checker or accuracy test. it’s hard to define a baseline or perform an accuracy test for the spelling checker because due to undetermined nature of spelling errors.

some syntax analysis will be necessary to understand the utterance’s sense correctly. this utterance can either mean that the speaker wants to eat at some nearby location. George no hurt you!”. “I am driving a car”. However. “I want to eat someplace that’s close to NTU. the clause “because she was rich” can modify either the state of being married (the whole utterance) or just the cause (the verb only). we can understand an utterance while paying little to no regard to its syntax. a great chapter in American life came to a close and a greater chapter began. or there is ambiguity. “He didn’t marry her because she was rich. but closer inspection reveals a quite simple structure: a sentence pre modifier ( When Ulysses S. given the usual structure of the verb eat. Let’s consider a few examples: 1.” 2. The same goes for example 3. or some random jungle talk “George good. Grant and Robert E. In fact most of natural language tasks can be viewed ask resolving ambiguity at some points. Considering example 2. Lee met in the parlor of a modest house at Appomattox Court House.” 3. Lee met in the parlor of a modest house at Appomattox Court House. Grant and Robert E. When Ulysses S.” 4. Virginia to work out the terms for the surrender of Lee’s Army of Northern 19 . Virginia to work out the terms for the surrender of Lee’s Army of Northern Virginia. the meaning can be “being out of town before 4PM and arriving in town only then”.Chapter 5 Part of Speech Tagger 5. Example 4 might look complicated at first. For example. at times when the syntax becomes complicated. depending on which part of the sentence that the preposition phrase “until 4PM” modifies.1 Introduction Most of the time. “He won’t be in town until 4PM. Considering example 1. or “arriving in town at some earlier time but not staying as late as 4PM”. or that the person wants to swallow the place. The latter is much less likely to happen in real world.

Our current HMM model will fail in this case unless the term “whole new house” happens to appear more often than the others in usual context. A HMM model has some major advantages compared to a decision tree model. we’ll explore the syntax analysis.3.2 Method The language is assumed to be a second order hidden Markov model (HMM). • It would required an adept knowledge in linguistics to capture all the grammar rules for using in the decision tree.] As those examples above illustrated.e. HMM model also has drawbacks: • True model of the language is not known and can only be approximated. a good syntax analysis will make the task of meaning extraction easier and more precise than intuition alone. • Most words in English are unambiguous. which appeared quite far after the empty location. i. which means that the choice of a word is only depend on its previous two words. Let’s consider an example: “I painted my neighbour’s whole new . • It is impossible to have a databse of every intance of a language. such as face. .Virginia. 20 . considering our context which is part of speech tagging. Some examples are “adjective usually precede noun”. house. or “to must be follow by noun phrase or bare infinitive verb”. A tagger based on HMM model will estimate the probability of various sequences and return the most probable sequence or sequences. pants. Section 5. since a strict definition of a grammar rule is not easy to define. green except for the front door yesterday. that is they only have one part of speech. The following sections will continue to elaborate on the program operation and its performance.” Which word will be likely to fill in the empty space? It should be something that can be painted on. board.) followed by a series of simple sentences ([a great chapter in American life came to a close][ and a greater chapter began. 5. But only “house” are related to “door”. But many of the most common words in English are not. Necessary mathematical calculation is covered in section 5. POS tagging. or cow. the HMM model works much more efficiently since grammar rules restrict which classes can stand next to each other.2 will provide some information about method used. This of course is not true. the HMM model should account for these unseen events (also known as smoothing) • The application filed should be similar to the training data. but not totally false either. . This kind of database is usually simpler to prepare but time consumming. In this chapter. In fact [9]stated that over 40 • A single HMM tagger can be reused for many languages or applications if provided with proper training data which in this case is a set of sentences tagged with each word’s part of speech. However.

C(t1 . P (t3 .e. an prefix tagger. It’s not the same as the multiple of highest probabilities.t) ) C(t1 2 The reason for minus 1 is because we treat the in using trigram as observed event. compare the following values: C(t1 . t2 ) = P (w3 |t3 ) × P (t3 |t1 .t. t3 ). λ3 are weights of the classifiers.3) P (wi |ti )(λ3 × P (ti |ti−2 . C(t1 .2) (5. increase the corresponding lambda by a certain amount.1 Mathematical Background Language Model The tagging problem can be defined for the trigram model as follow: giving sequence of[t1 . 5. t2 . a regular expression tagger and a default tagger will be called in that order to prevent error propagation caused by failing to tag a word.1) (5. Similarly. the returned soultions is the most probable paths. bigram and unigram classifiers were linearly interpolated into a tagger call lihmmtagger.3 5.t)−1 . the highest multiple of probabilities.t. a trigram.4) In which λ1 . find the probability of the sequence being follow by word w3 which has POS t3 i.py.• A database of sufficient size for trainning must be availbe. so the actual data must be minus by 1. In case this tagger fails to tag a words. t2 . The 2 3 chosen amount were: 1. t2 ) The formula can be derived by making the assumption that w3 only depends on t3 . w3 |t1 . ti−1 ) + λ2 × P (ti |ti−1 ) + λ3 × P (ti )) (5. C(t2 . • Unable to validate the returned solutions. i. 5. and 5.t. w3 ) = P (w3 |t3 ) × P (t3 ) Solution can be found using iinear interpolation of 5. and t−1 = t−2 = BOS (Beginning of Sentence).t. C(t3 )−1 C(t1 2 C(t2 N (t)−1 Depend on which is the maximum of them.3.3.2. the bigram and unigram model can be defined: P (t3 .t)−1 . t3 ) × P (t3 |t1 . For this reason we skipped trigrams which have been seen only once. In this project. λ1 + λ2 + λ3 = 1.2 Estimating Lambda Value 3 )−1 2 3 )−1 For each trigram in training data. The source code can be found in module tagger.3: argmax 1≤i≤n (5. t2 ] (POS 1 and 2). 21 . One method for finding this path is known as Viterbi algorithm [10] 5.1.e. Logarithm is used when the numbers get smaller. w3 |t2 ) = P (w3 |t3 ) × P (t3 |t2 ) P (t3 . λ2 . t2 ) = P (w3 |t1 . It should be pointed out that.

which automatically assigns the most common tags which is ‘NN’ (147169 counts in 1071233.4 Operation Explanation The tagger flow chart is shown in figure 5. bigram and unigram tagger. • affixtagger: a tagger bases on prefix and suffix of words. • accuracy: take a tagged corpus as test set and return the percentage of correctly tagged words. • lihmmtagger: a linear interpolation of trigram. • setimport: import previous training information. that is 14%) • retagger: a tagger bases on regular expression. • tagit: tag a word.1: POS tagger flow chart There are 4 tagger classes: • A default tagger. 22 . • tagthem: tag a set of words.1 Figure 5. Each tagger is a separate class and has some common methods: • train: takes a tagged corpus as trainning data and export training information for later use since training might take a long time.5.

• test: An improved accuracy test.56 27.5 showed accuracy test for the linear interpolation HMM tagger.6 5. but this will affect the accuracy as well. In order to improve speed of processing but limiting the affect on accuracy at the same time.2: Accuracies Variance 132 24 Best 78. As for the HMM tagger.38 6. The drawback is it requires much more computational effort. the outcomes might not be very useful for other tasks such as information extraction due to misleading tags.72 67. A search beam (1000 to 2000) must be applied to limit the search space.5 Results Below are some return scores in accuracy test. the scores were close to each other.54 Mean w/o testset segmenting 64. However.04 78. It was surprising to achieve such decent accuracy just by knowing the ending or starting character.15 85.6 Evaluation and Discussion The suffix tagger had slightly better accuracy and deviation than prefix tagger.03 Worst 45. Take a tagged coprus as test set.68 Standard deviation 2.25 67. Even though the mean accuracy for 2 method of testing is approximate to each other. Testset size 120 1010 Mean 63. This is a much lower score than the other HMM tagger. 5. and since the tagger searches for the most probable path. the search space can easily reach several hundred thousands in just 10 or 15 words. Therefore searching for the most probable path is more desirable.45 Table 5.1: Accuracies Table 5. simply going from left to right and choosing the most likely tags yield a decent accuracy and speed. Interpolation of them fell somewhere in between. segmenting the test set gave more insights to the accuracy scores than the latter. which are the most common sentence length in Brown corpus.84 4.14 28. Return average accuracy and the standard deviations. So suffix tagger was chosen. This implied 23 . Tagger Regular expression Suffix tagger Prefix tager HMM tagger (most likely tags only) Mean (percent) 7.42 Table 5. Hence it can be assumed that the tagger will work 67% of the time. There are 170 possible tags. which shouldn’t be the case. Despite the large difference in testing scale.45 58.23 6. complex sentences which consist of more than one clauses will be broken and tagged individually since words across clauses have little syntax relation. divide it into smaller sets and perform accuracy test on each set.

The performance improvements however was unknown. The experiment logs are provided in the database. but can also degrade the performance dramatically. even though the performance score was not as high as expected. 24 . ’‘‘’). ’. (’with’.that the language model might go wrong at some points such as smoothing process or estimating lambda values. ’NN’). Furthermore. ’BEZ’). So in conclusion. On the other hand. segmenting the sentence into smaller clauses proved to have better accuracy. This might due to punctuations often have very high frequency of appearance. Analyzing experiment logs suggested that punctuations can improve. The speed is much slower than comparing to the other HMM tagger. The tagger assigned ‘what’ with a ‘‘‘’ tag instead of ‘WDT’. (’?’. In some occasions during debugging process. (’is’. ’DT’). More specifically. considering the project’s context where the inputs are user search queries. ’NN’). (’jazz’. since punctuations are more likely to be tagged correctly. ("’’". the POS tag ‘‘‘’ has a count of 6160 as opposed to a count of 4834 for ‘WDT’. ’WDT’). let’s consider a sample from the log: (’‘‘’. whether their benefits outgrow their disadvantage is unresolved at the moment. By segmenting the sentences into clauses. (’vow’. the speed was practically double. punctuations will not be a big issue. ’IN’). Even though it is possible to modify the program to carry out experiments for the sake of verifying the above problem. they also have the effect of limiting the error propagation. Therefore should the punctuations be considered during the training or tagging process. "’’"). (’What’. . .’). For instance. | . (’?’. (’this’. especially for lengthy sentences.’)] NoCls: 2 1 1 3 3 6 12 12 12 72 432 432 432 | ‘‘ ‘‘ | ‘‘ WDT | WDT BEZ | BEZ IN | IN DT | DT NN | NN NN | NN ’’ | . ’. the program are good enough for practical usage. when applied to the semantics module (discussed in the next chapter) the returned results were promising.

some background theory on FOPC and formal logic will be cover in 6.3 will explore how these theory can be apply to a specific case.Chapter 6 Meaning Representation 6. their meaning and the kind of real world knowledge that is needed to perform the involved tasks. It makes sense since most of the grammar rules are about how different words can be combined to form a sentence. Semantic Networks and Frames. This class of verbs is known as transitive verb. This implies that a different system other than grammar is necessary for meaning representation. 25 . In this chapter. for example is some verbs must not stand alone on itself. What is needed is a representation that can bridge the gap between linguistic inputs. consider some of everyday language tasks that require some form of semantic processing: • Answering an question. • Following a recipe. The rest of this chapter will then explain how the program achieved the desired result describe in 6.3. • Realizing that you’ve been insulted. a few examples have illustrated that in many cases syntax analysis is useful but not necessary for meaning comprehension. Three notable schemes are First Order Predicate Calculus (FOPC). a fair number of representational schemes have been invented to capture the meaning of the natural language inputs for use in processing systems. Over the years. It is clear that none of morphological or syntactical representation thus far will get us very far on these tasks.2.1 Introduction In the previous chapter. The exception. and in someway represents the sense of the verbs. 6. like “give”. However.

The semantics of FOPC Capturing meaning of a sentence involves identifying the terms and predicates corresponding to various grammar elements of the sentence.2 6. Unambiguous Representations Ambiguities exist in all aspects of all languages. The conclusions might be not explicitly represented in the knowledge base. Some means of determining that certain interpretations are more or less preferable than others is needed. 26 . Some basic building elements of FOPC are: Constants Refer to specific object in the world.6. FOPC contants refer to exactly one object. Inference The system’s ability to derive valid conclusions based on meaning representations of inputs and its knowledge base. Like in programming language. however have more than one contants refer to them. Verifiability The system’s ability to compare an affair described by a representation to affairs modeled in a knowledge base.2. Objects can.1 Background Theory First Order Predicate Calculus Desiderata For Representation Scheme There are basics requirements that a meaning representation must fulfill. to be useful a representation scheme must be expressive enough to cover a wide range of subject matters such as time and tense. Variables Allow the system to match unknown entity to a known object in knowledge base so that the entire propositions is matched. Expressiveness Finally. Canonical Form The phenomenon of distinct inputs that should be assigned the same meaning representations. but are logically derivable from the available propositions.

2. i. constants.2 Formal Logic The word ‘formal’ means by form. • Contradiction: is a simultaneous acceptance and rejection of some remarks. Usage of quantifiers make these two uses possible. and ‘all’ quantifier that refers to all objects in a class. Examples: LocationOf(NTU) Variables Give the system the ability to draw inferences or make assertions about objects without having to refer to any particular ones.Functions Functions in FOPC can be expressed as attributes of objects. and variables and not a formular. Formular An equivalent so sentence in grammar representation. starts by a premise and ends by a conclusions. Contradiction isn’t allowed in formal logic (also known as consistency principle) • Monotonicity principle: A proof cannot be invalidated by adding premises. FOPC functions have the same syntax as a single argument predicate. 27 . they are in fact terms since they actually refer to unique objects. Lambda Notation Enable formal functionality for replacing of a specifics variable by a specifics term. Formulars therefore can be assigned with True or False values depending on whether the information they encoded are in accord with the knowledge base or not. functions. properties or relations between terms. Quantifiers Variable can be used to making statements about either a particular unknown object. • Claims: Are declarative remarks about states of the world. however. rather than by shape or meaning. FOPC formular is a representation of objects. • Proof and disproof: A formal proof is a logical argument which convinces by following formal rules.e. The two basic operator in FOPC are ‘exist’ quantifier that denotes one particular unknown object. Examples: lambda x P(x)(A) --> P(A) 6. since proof obeys rules. or all the objects in some arbitrary classes of objects. not on meaning or facts. Note that the arguments of formulars must be terms. Some necessary terms to read and understand formal logic are: • Argument: Is a line of reasoning.

It floats.So. or both. There are four connectives: – And: If we accept ‘A and B’ then we are forced to accept both A and B simultaneously. therefore. A witch! . she’s made of wood.Q5: What also floats in water? . logically.A1: Burn them! .. therefore B 6. For instance here is an example of ‘If-then’ elimination rule: (If A then B) also A.. only B.3 Semantic Analysis A Study Case Semantic analysis is the process whereby meaning representation of linguistic inputs is created.. The names suggest that the connectives are introduced or eliminated in the final proof.. If she.And.• Formal rules: is an intermediate step in an logical arguments.More witches! . Each connective has two rules associated with them: introduction and elimination.. .A5: A duck! . weighs the same as a duck. .Q2: What do you burn apart from witches? .. why do witches burn? . – If-then: If we accept ‘if A then B’ and A. we are forced to accept B. Remove the supports! A representation scheme based on FOPC and logic proof for the above excerpt will look something like this: lambda_x (witches(x) --> burn(x)) (1) premise lambda_x (burn(x)-->wood(x)) (2) premise 28 .A4: No. Consider the following excerpt from the movie “Monty Python and the Holy Grail” . .Q3: So.We shall use my largest scales. – Not: If we accept ‘not A’ then accept A lead to a contradiction. .. – Or: If we accept ‘A or B’ then we can either accept only A.. Rules show how can larger proof be made out of one or more smaller proofs.. • Connectives: the logic connectives are used to build larger claims out of smaller claim.A3: ’Cause they’re made of wood? ..Q1: Tell me: What do you do with witches? .A witch! .Q4: How do we tell if she’s made of wood? Does wood sink in water? .A2: Wood.

.lambda_x (witches(x)-->wood(x)) (3) implication-introduction (1) (2) lambda_x (wood(x)-->float(x)) (4) premise float(duck) (5) premise lambda_x lambda_y ((weigh(x)=weigh(y)) and float(y) -->float(x)) (6) premise "We shall use my largest scales. Remove the supports!’ weigh(duck)=weigh(girl) (7) given lambda_x ((weigh(x)=weigh(duck)) and float(duck) -->float(x)) (8) lambda-reduction (6) (weigh(girl)=weigh(duck)) and float(duck) -->float(girl) (9) lambda-reduction (8) (weigh(girl)=weigh(duck)) and float(duck) (10) And-introduction (7) (5) float(girl) (11) Implication-elimination (10) (9) wood(girl) (12) Implication-elimination (11) (4) Let’s assume: lamda_x (not_witches(x)-->not_wood(x)) (13) given lamda_x (wood(x)-->witches(x)) (14) backward chaining (13) (3) wood(girl)-->witches(girl) (15) lambda-reduction (14) witches(girl) (16) Implication-elimination (16) (12) "A witch!. this is true since there is one invalid proof at step (12). However. .A witch!" The claim at the end is nonsensical according to our intuition. it is still quite possible for the claim to be true if we rewrite the proof as below: lambda_x (wood(x)-->float(x)) (4) premise 29 . Using logic as shown above..

the implication must satisfy the following two rules: 1. In the above example.1) premise lambda_x (human(x) and float(x) --> wood(x)) (4. Considering an example: Q: Who is Albert? A: He is a genius. while predicate and argument extraction are handled by sentence class.2) backward chaining (4. The left hand side must be meaningful. or the implication will be pretty much useless.3) lambda-reduction human(girl) (4.lambda_x (human(x) and not_wood(x) --> not_float(x)) (4.3) A new premise (4. and they refers to ‘genius’ and ‘Albert’ respectively.1) (4) human(girl) and float(girl) --> wood(girl) (4. preposition phrase or verb phrase can be modeled using the class phrase Some key attributes of the class phrase are: 30 . Amazingly enough.4. Logically. a sound implication would be Albert --> genius or Albert --> he.4 6. the final claim which is supposed to be humorous and nonsensical is true logically. referents and their syntactical roles in the sentences. 2.1 Design Operation explanation Implication Derivation The program derives the implication by figuring out what are the symbols. and the right hand side of the implication must be either the argument or predicate of the answer. The above two rules and referent identification process are embedded in deduct class. ‘who’ and ‘he’ are symbols.4) (4.4) given wood(girl) (12) Implication-elimination (11) (4. which is reasonable is introduced. and implication such as who --> he is not very informative. 6. Following the above two rules. The left hand side of the implication must be either the argument or predicate of the question.2). phrase Class As the name suggested. phrases such as noun phrase.

Discourse of the sentence. It provides convenient access to many components of a sentence such as POS tags. • modifiers .String of POS tags. Currently declarative.Jungle talk version of the original sentence. • wttups . • negate . • SubPre . It is worh noting that the process relies solely on syntactical information provided by the Sentence class and pays no regard to the actual meaning of the tokens. . Some key attributes of the class deduct are: • lhs . Some key attributes of the class Sentence are: • type . yes/no question and some of Wh-questions are supported.Symbols and referents. • jungle . subject. deduct Class To derive implications from a pair of question and answer. • rhs .FOPC representation. • root . Sentence Class This class is used to represent an instance of sentence.beginning and ending indexes of root of the phrase.Sentence type. Currently noun phrase. • wtstr . • refpairs . • type .Tuples of POS tags. hence the name syntax driven semantics analysis.modifiers of the phrase.Lists of subjects and predicates. Each phrase has an associated function to search for that particular phrase in a sentence. predicate.beginning and ending indexes of the phrase. Unimplemented at the moment.Whether the sentence is negative or not • referent .• string .Instance of the sentence on the left hand side (the question). . preposition phrase. • jdiscourse . verb phrase and adjective phrase are supported • span .the actual string of the phrase. • presentation .Symbols and referents pairs 31 . • FOPC . sentence type. including postmodifiers and premodifiers.Type of phrases.List of possible implications.Instance of the sentence on the right hand side (the answer).

FOPC formulas with quantifiers.FOPC formulas without quantifiers. • Lambda .3. • preamble . • presentations . • connective .Formula’s connective. constant. • var .5 6. the class atom can be used to generate a proper representation.2 Design Presentation At the moment.1 Results Semantics Analysis Below are the output implications for sample pairs of questions and answers.List of possible representations.a1). 6.4. if we are able to provide these information. atom Class On top of generating basic atomic representations.5.Formula’s predicate. • amble . or function).Variable symbol. • quantifier . and some statements are modified into questions for compatibility or manually tagged if they were tagged wrongly.’ >>> deduct(q1.6. lambda reduction and combining smaller expressions are also supported. the module will need information such as what should the quantifier be.Trigger lambda reduction. Some of them are from the study case in section 6. More specifically.presentation [’witches/NNS-->burn/VB’] 32 . what should the term be (variable. However.Formula’s quantifier. • term . the representation module is not integrated into the semantics analysis modules discussed above since the extracted information is not detail enough for the module to generate an accurate representation. q1="What do you do with witches?" a1=’Burn them.Formula’s term • predicate .

presentation [’albert/NP-->genius/NN’.presentation [’north/NR of/IN america/NP-->canada/NP’] q10=’who is the most stupid guy on earth’ a10=’Dummies’ 33 .’ >>> deduct(q2.a9). ’water/NN-->floats/VBZ’] q5="What also floats in water?" a5=’A duck’ >>> deduct(q5.a3). it floats.presentation [’burn/VB-->wood/NN’] q3=’Why do witches burn?’ a3="Because they are made of wood" >>> deduct(q3.a2).presentation [’witches/NNS-->made/VBN of/IN wood/NN’] q4=’does/DOZ wood/NN sink/VB in/IN water/NN’ a4=’No.presentation [’wood/NN-->floats/VBZ’.1).presentation [’weight/VB same/AP as/CS a/AT duck/NN-->wood/NN’.’ >>> deduct(q4.a5).a4. ’albert/NP-->he/PPS’] q9=’which country is north of America’ a9=’Canada is north of America’ >>> deduct(q9.a6.q2=’What do you burn apart from witches?’ a2=’Wood.1).a8). ’duck/NN-->made/VBN of/IN wood/NN’] q8=’who is Albert’ a8=’He is a genius’ >>> deduct(q8.presentation [’floats/VBZ in/IN water/NN-->duck/NN’] q6=’why/WRB does/DOZ she/PPS weight/VB the/AT same/AP as/CS a/AT duck/NN’ a6=’Because she is made of wood’ >>> deduct(q6.

presentation [’dog/NN-->bites/VBZ him/PP’.burn) Isa(girl.’girl’]]) >>> t4.’var’) >>> t2.witches) connective Role_of_girl(x_burn.>>> deduct(q10. lambda reduction and combining smaller atoms into a bigger representation.a10).presentation ’lambda_x_burn Isa(x_burn.burn) lambda_x_witches Isa(x_witches.A).x_witches)’ >>> t4=atom(t3.6 Evaluations Further expriments on the deduction program showed that even though the program analyses the sentence in a rather simple minded way.5.presentation ’lambda_x_witches Isa(x_witches.’ >>> deduct(Q.’var’) >>> t1.presentation ’lambda_x_burn Isa(x_burn.None.A). >>> Q=’Why is he kicking that poor dog?’ >>> A=’Because it bites him. and for some a bit more complex sentences.Apresentation ’Isa(x_burn.t1) >>> t3. it worked unexpectedly well for simple sentences. ’dog/NN-->bites/VBZ him/PP’] Q=’Why does someone die?’ A=’Because he is old’ >>> deduct(Q.2 Meaning Representation Below are a few demonstrations including atomic terms. >>> t1=atom(’witches’.burn) connective ’ >>> t3=atom(t2.witches) connective ’ >>> t2=atom(’burn’.presentation [’someone/PN-->old/JJ’] 34 .girl)’ 6.None.witches) connective Role_of_x_witches(x_burn.presentation [’the/AT most/QL stupid/JJ guy/NN on/IN earth/NN-->dummies/NNS’] 6.Lambda=[[t1.

Some current limitations are: • Not all type of questions.Every restaurant has a menu The meaning representation of the sentence might take either one of the below two forms: . As the formulas getting more complex. connectives proved to be an ambiguity issue. • The analysis bases only on syntactical roles and does not consider the actual meaning and the relative locations of the tokens Despite the limitations.exist y Isa(y. Currently only yes no question. a sentence with n quantifiers will have n! representations. As for the representation module. 35 . ‘whom’. x) and Isa(y. it worked as expected for simple formulas. Let’s consider an example: . ‘which’. as well as sentences with clauses and commas are supported.all Restaurant(x) then exist e. Restaurant) then exist e Having(e) and Haver(e. quantity or entity relationships. ‘what’. ‘why’ and simple sentences are available. ‘who’. Menu) and Had(e. the program does work in simple context. • Unable to extract information such as tense. Hence it can either be used as a layer in a multi level process in which each layer solves a particular problem. y) In the worst case. x) and Had(e. which are necessary for FOPC representation scheme. y) . Menu) and all x Isa(x. or be improved to be able to dealt with complex sentences and extract more useful information. y Having(e) and Haver(e.

I managed to retrieve other precious things in return: less arrogant. Though I felt that I could have done much more if I have had received more formal training on linguistics. My deepest gratitude’s owed to my supervisor professor Chan Chee Keong.Chapter 7 Conclusion This has been a very enjoyable project for me. and memorable moments such as the excitement when getting the program running for first time. I have had fun learning Python and study some very interesting aspect of the language that I used to take for granted. realized how magnificent this world and its people are. or had my heart sunk to my feet when hearing about IBM Watson’s triumph. I was contended with the achievement. thank you. I really am glad to have the opportunity to work on this project. 36 . Though I failed to achieve the initial goals.

Frequency and impact of spelling errors in bibliographic data bases.. Spelling correction for the telecommunications network for the deaf. Tnt . Church. [3] Thorsten Brants. Proof and Disproof in Formal Logic An introduction for programmers. [11] Lisa F. 1977. [2] Charles P. [12] Mark D. The viterbi algorithm.broad-coverage statistical parsing. [5] Eric Brill and Robert C. A spelling correction program based on a noisy channel model. 13(1):1 – 12. Church and Lisa F. May 1992. March 1964. Association for Computational Linguistics. DAVID FORNEY. PA. [6] Hinrich Schiitze Christopher D. pages 205–210. Computational Linguistics and Speech Recognition. [4] Eugene Charniak Brian Roark. ACM. Commun. PA. ACM. May 1991. 2000. Information Processing and Management. Commercial applications of natural language processing. Prentice Hall. Damerau. Gale. November 1995. 1990. Stroudsburg. Computation and Neural Systems. Measuring efficiency in high-accuracy. Manning.a statistical part-of-speech tagger. 35:80–90. Martin Daniel Jurafsky. Commercial applications of natural language processing. 1999. Massachusetts Institute of Technology. [7] Kenneth W. Oxford University Press.Bibliography [1] Richard Bornat. USA. pages 286–293. Kernighan. Association for Computational Linguistics. Kenneth W. 38:71–79. An improved error model for noisy channel spelling correction. G. [9] James H. USA. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. An Introduction to Natural Language Processing. Bourne. Rau Kenneth W. 7. Church. 7:171–176. COLING ’90. [13] Karen Kukich. second edition. Commun. 2000. Commun. ACL ’00. A technique for computer detection and correction of spelling errors.Volume 2. [8] Fred J. 37 . Rau. In Proceedings of the 13th conference on Computational linguistics . Stroudsburg. Moore. Foundations of Statistical Natural Language Processing. [10] JR. ACM. and William A. 2005.

23:676–687. ACM. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 38 .. 23:495–505. [20] Kristina Toutanova and Robert C. 9:36–45. [16] Mark D Ryan Michael RA Huth. pages 144–151. ACM. 2010. December 1992. Techniques for automatically correcting words in text. Inf. ACL ’02. January 1966.0 Cookbook. Packt Publishing. Neural probabilistic language models. 2002. [17] Roger Mitton. [15] Charles R. 19:33–38. January 1976. 24:377– 439. [18] Jacob Perkins. Computer programs for detecting and correcting spelling errors.a computer program for the study of natural language communication between man and machine. Surv. Litecky and Gordon B. Stroudsburg. PA. [22] Jean-Sbastien Sencal Frderic Morin Jean-Luc Gauvain Yoshua Bengio.spelling correctors and the misspellings of poor spellers. ACM Comput. USA. 2000. Process. Moore. December 1980. Holger Schwenk. Logic in Computer Science Modelling and reasoning about systems. error-proneness. Python Text Processing with NLTK 2..[14] Karen Kukich. Cambridge University Press. Peterson. Commun. Commun. Association for Computational Linguistics. A study of errors. Manage. September 1987. [21] Joseph Weizenbaum. Commun.. and error diagnosis in cobol. [19] James L. Spelling checkers. Pronunciation modeling for improved spelling correction. ACM. Davis. Eliza&mdash.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.