You are on page 1of 80

Natural Language Processing(CSE4022)

By
N. Ilakiyaselvan
Computational Challenges in Other
Languages
Spelling Correction
Non-word spelling error example
acress
Candidate generation:
• Words with similar spelling
– Small edit distance to error
• Words with similar pronunciation
– Small edit distance of pronunciation to error

41
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings,
where edits are:
– Insertion
– Deletion
– Substitution
– Transposition of two adjacent letters

42
Words within 1 of acress
Error Candid Corre Error Type
ate ct Lette
Correcti Letter r
on
acres actre t - deletion
s ss
acres cress - a insertion
s
acres cares ca ac transpositio
s s n
acres acces c r substitution
s s
43
Candidate generation
• 80% of errors are within edit distance 1
• Almost all errors within edit distance 2

• Also allow insertion of space or hyphen


– thisidea  this idea
– inlaw  in-law

44
Unigram Prior probability
Counts from 404,253,213 words in Corpus of Contemporary English (COCA)

word Frequency of P(word)


word
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .0000318463
45
Issues in spelling
• If very confident in correction
– Autocorrect
• Less confident
– Give the best correction
• Less confident
– Give a correction list
• Unconfident
– Just flag as an error

46
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search,


but there are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval

47
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents


– Assume it is a static collection for the moment

• Goal: Retrieve documents with information


that is relevant to the user’s information need
and helps the user complete a task

48
Sec. 1.1

How good are the retrieved docs?


 Precision : Fraction of retrieved docs that are
relevant to the user’s information need
 Recall : Fraction of relevant docs in collection
that are retrieved

49
Question Answering
One of the oldest NLP tasks (punched card systems in 1961)
Simmons, Klein, McConlogue. 1964. Indexing and
Dependency Logic for Answering English Questions.
American Documentation 15:30, 196-204

50
Apple’s Siri

51
Wolfram Alpha

52
Types of Questions in Modern
Systems
• Factoid questions
– Who wrote “The Universal Declaration of Human
Rights”?
– How many calories are there in two slices of apple
pie?
– What is the average age of the onset of autism?
– Where is Apple Computer based?
• Complex (narrative) questions:
– In children with an acute febrile illness, what is
the efficacy of acetaminophen in reducing
fever?
– What do scholars think about Jefferson’s position
53 on dealing with pirates?
Commercial systems:
mainly factoid questions
Where is the Louvre Museum In Paris, France
located?
What’s the abbreviation for L.P.
limited partnership?
What currency is used in China? The yuan
What kind of nuts are used in almonds
marzipan?
What instrument does Max drums
Roach play?
What is the telephone number 650-723-2300
for Stanford University?
Paradigms for QA
• IR-based approaches
– TREC; IBM Watson; Google
• Knowledge-based and Hybrid approaches
– IBM Watson; Apple Siri;
– Wolfram Alpha;
– True Knowledge Evi

55
IR-based Factoid QA
• QUESTION PROCESSING
– Detect question type, answer type, focus, relations
– Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
– Retrieve ranked documents
– Break into suitable passages and rerank
• ANSWER PROCESSING
– Extract candidate answers
– Rank candidates
• using evidence from the text and external sources
IR-based Factoid QA

Document
DocumentDocument
Document
Document Document
Indexing Answer

Passage
Question Retrieval
Processing Docume
Query Document
Docume
nt
Docume
nt
Docume
nt
Passage Answer
Docume
Formulation Retrieval Relevant
nt
nt Retrieval passages Processing
Question Docs
Answer Type
Detection

57
Knowledge-based approaches (Siri)

• Build a semantic representation of the query


– Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured
data or resources
– Geospatial databases
– Ontologies (Wikipedia infoboxes, dbPedia, WordNet,
Yago)
– Restaurant review sources and reservation services
– Scientific databases
58
Hybrid approaches (IBM Watson)
• Build a shallow semantic representation of the
query
• Generate answer candidates using IR methods
– Augmented with ontologies and semi-structured data
• Score each candidate using richer knowledge
sources
– Geospatial databases
– Temporal reasoning
– Taxonomical classification
59
Question Processing
Things to extract from the question
• Answer Type Detection
– Decide the named entity type (person, place) of the
answer
• Query Formulation
– Choose query keywords for the IR system
• Question Type classification
– Is this a definition question, a math question, a list
question?
• Focus Detection
– Find the question words that are replaced by the answer
• Relation Extraction
– Find relations between entities in the question
60
Answer Type Detection: Named
Entities
• Who founded Virgin Airlines?
– PERSON
• What Canadian city has the largest
population?
– CITY.
Answer Type Taxonomy
Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02

• 6 coarse classes
– ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
– LOCATION: city, country, mountain…
– HUMAN: group, individual, title, description
– ENTITY: animal, body, color, currency…

62
Answer Types

63
More Answer Types

64
Text Normalization
• Every NLP task needs to do text
normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
Text Normalization
How many words?
• I do uh main- mainly business data processing
– Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats!
– Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms
Issues in Tokenization
• Finland’s capital  Finland Finlands Finland’s ?
• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower case ?
• San Francisco  one token or two?
• m.p.h., PhD.  ??
Word Tokenization in Chinese
• Also called Word Segmentation
• Chinese words are composed of characters
– Characters are generally 1 syllable and 1
morpheme.
– Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
– Maximum Matching (also called Greedy)
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that
matches the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2
Max-match segmentation

• Thecatinthehat the cat in the hat


• Thetabledownthere the table down there

theta bled own there


• Doesn’t generally work in English!

• But works astonishingly well in Chinese


– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗
里达
• Modern probabilistic segmentation algorithms
even better
Normalization
• Need to “normalize” terms
– Information Retrieval: indexed text & query terms
must have same form.
• We want to match U.S.A. and USA

• We implicitly define equivalence classes of terms


– e.g., deleting periods in a term
• Alternative: asymmetric expansion:
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient


Case folding
• Applications like IR: reduce all letters to lower case
– Since users tend to use lower case
– Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, MT, Information extraction
– Case is helpful (US versus us is important)
Lemmatization

• Reduce inflections or variant forms to base form


– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be
different color
• Lemmatization: have to find correct dictionary
headword form
• Machine translation
– Spanish quiero (‘I want’), quieres (‘you want’) same
lemma as querer ‘want’
Morphology
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Bits and pieces that adhere to stems
• Often with grammatical functions
Stemming
• Reduce terms to their stems in information retrieval
• Stemming is crude chopping of affixes
– language dependent
– e.g., automate(s), automatic, automation all
reduced to automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer
Step 1a Step 2 (for long stems)
sses  ss caresses  caress ational ate relational relate
ies  i ponies  poni izer ize digitizer  digitize
ss  ss caress  caress ator ate operator  operate
s  ø cats  cat …
Step 1b Step 3 (for longer stems)
(*v*)ing  ø walking  walk al  ø revival  reviv
sing  sing able  ø adjustable  adjust
(*v*)ed  ø plastered  plaster
ate  ø activate  activ
… …
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?

(*v*)ing  ø walking  walk


sing  sing

78
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
(*v*)ing  ø walking  walk
sing  sing

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

1312 King 548 being


548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr

79
Dealing with complex morphology is
sometimes necessary
• Some languages requires complex morpheme
segmentation
– Turkish
– Uygarlastiramadiklarimizdanmissinizcasina
– `(behaving) as if you are among those whom we could not
civilize’
– Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

You might also like