You are on page 1of 209

Chat bots

What are chatbots?

• A chatbot is a conversational agent that


interacts with users using natural
language.

• Started as an attempt to fool humans.


• Numerous applications of chatbots such
as Customer Service, call centers etc
Need for chatbots?
• Widespread use of personal machines
• Better Human Computer Interaction

• “To express their interest, wishes, or queries


directly and naturally, by speaking, typing, and
pointing”.
Need for chatbots?

• You: Hello
• Op: Hi. This is Railway Enquiry
• You: What is the status of train 2803?
• Op: It’s right on time. The train will leave CST at
5:45 pm. Is there anything else I could assist you
with?
• You : No, thank you
• Op: You are welcome. Indian railways wishes a nice
and happy journey.
History

• ELIZA
o Developed in the 1960s
o Looks for pronouns and verbs
o ‘You’ becomes ‘I’ and vice versa
o User: You are a dork.
o ELIZA: What makes you think I am a dork?

• PARRY (1972)
o Simulated a paranoid schizophrenic
History (contd)

• RACTER
o 1984
o Generate English language prose at random
o Generated a book
o The Policeman's Beard Is
Half-Constructed
Excerpt from the book
History (contd.)

• ALICE
o 1994
o Three time loebner prize winner
o Richard Wallace
o http://www.alicebot.org/
Talk Outline
• How do chat bots work?

• Use of AIML for storing language models.

• Applications
o Intelligent Tutoring Systems
o Other applications here

• Conclusion
How Do Chatbots Work
• Don’t have a good understanding of conversation
• Are based on pattern matching
• Chatterbots have a set of input and output rules
• Recognize cue words from user and responds with
a pre-calculated response
• For e.g.:- Human: “I am feeling very worried
today.”
Chatterbot: “Why are you feeling worried lately?”
• Other chatterbots learn through user interactions
ELIZA
• One of the first chatterbots designed in 1966
• Parodies a therapist rephrasing statements of the
user and posing them back as questions
• Works by simple parsing and substitution of key
words into canned phrases
• People get emotionally caught up by ELIZA’s
confident replies forgetting that it’s a machine
Example of Conversation
• Eliza’s conversation with a patient is:-
– Human: Well, my boyfriend made me come here.
ELIZA: Your boyfriend made you come here?
Human: He says I'm depressed much of the time.
ELIZA: I am sorry to hear you are depressed.
Human: It's true. I'm unhappy.
ELIZA: Do you think coming here will help you not to be
unhappy?
• Able to elicit emotional responses from users
though being programmed not to do so
• Demonstrates ELIZA effect
Jabberwacky
• No fixed rules and principles programmed into it
• Learns language and context through human
interaction. Stores all conversations and comments
which are used to find appropriate responses
• Problems faced due to this approach:-
– Continuous changing of subject and conversation
– May respond in a bad-tempered and rude manner
• Was designed to pass the Turing test and is the
winner of the Loeber Prize contest
ALICE Chatbot System
• ALICE(Artificial Linguistic Internet Computer
Entity) is inspired by ELIZA
• Applies heuristic pattern matching rules to input
to converse with user
• ALICE is composed of two parts
– Chatbot engine
– Language Model
• Language models are stored in AIML(Artificial
Intelligence Mark-up Language) files
Structure of AIML
• AIML consists of data objects which are made up of units
called topics and categories
• A topic has a name attribute and categories associated with it
• Categories consist of pattern and template and are the basic
unit of knowledge
• Pattern consists of only words, spaces and wildcard symbols _
and *.
Types of ALICE/AIML Categories
• Atomic categories: do not have wildcard symbols.

• Default categories: have wildcard entries * or _.


Continued
• Recursive categories:
Symbolic Reduction:

Divide and Conquer:


Continued

Synonyms
ALICE Pattern Matching Algorithm

• Normalization is applied for each input, removing all


punctuations, split in two or more sentences and converted to
uppercase.
E.g.: Do you, or will you eat me?.
Converted to: DO YOU OR WILL YOU EAT ME

• AIML interpreter then tries to match word by word the


longest pattern match. We expect this to be the best one.
Algorithm
• Assume the user input starts with word X.
• Root of this tree structure is a folder of the file system that
contains all patterns and templates.
• The pattern matching uses depth first techniques.

The folder has a subfolder stars with _,then, ”_/”,scan through and match
all words suffixed X, if no match then:

Go back to the folder, find another subfolder start with word X, if so then
turn to “X/”,scan for matching the tail of X. Patterns are matched. If no
match then:

Go back to the folder, find a subfolder starting with *,turn to, “*/”, try all
suffixes of input following “X” to see one match. If no match was found,
change directory back to the parent of this folder and put “X” back to the
head of the input.
Dialogue Corpus Training Dataset
Alice tries to mimic the real human conversations. The
training to mimic ‘real’ human dialogues and conversational
rules for the ALICE chatbot is given in the following ways.

• Read the dialogue text from the corpus.


• The dialogue transcript is converted to AIML format.
• The output AIML is used to retrain ALICE.
Other approaches
• First word approach:
The first word of utterance is assumed to be a good clue to an
appropriate response. Try matching just the first word of the
corpus utterance.

• Most significant word approach:


Look for word in the utterance with the highest “information
content”. This is usually the word that has the lowest
frequency in the rest of the corpus.
Intelligent Tutoring Systems
• Intended to replace classroom instruction
– textbook
– practice or “homework helpers”
• Modern ITS stress on practice
• Typically support practice in two ways
– product tutors – evaluate final outcomes
– process tutors – hints and feedbacks
Learner Modelling
• Modelling of the affective state of learner
– student's opinion, self-confidence
• Model to infer learner's knowledge
• Target Motivation
– just like expert human tutors do
– instructions can be adjusted
Open learner Modelling
• Extension of traditional learner modelling
– makes the model visible and interactive part
– displays ITS' internal belief of the learner's knowledge
state
• distinct records of learner's and system's belief
– like an information bar
– learner might challenge system's belief
ITS that use Natural Language
• Improved natural language might close the gap
between human tutor and ITS
• Pedagogical agents or avatars
– uses even non-verbal traits like emotions
– act as peers, co-learners, competitors, helpers
– ask and respond to questions, give hints and explanations,
provide feedbacks, monitor progress
Choice of Chatbots
• Feasibility of integrating natural language with open
learner model requires
– Keeping the user “on topic”
– Database connectivity
– Event driven by database changes
– Web integration
– An appropriate corpus of semantic reasoning knowledge
Chatbots for Entertainment
• Aim has been to mimic human conversation
• ELIZA – to mimic a therapist, idea based on
keyword matching.
• Phrases like “Very interesting, please go on”
• simulate different fictional or real personalities using
different algorithms of pattern matching
• ALICE – built for entertainment purposes
• No information saved or understood.
Chatbots in Foreign Language Learning
• An intelligent Web-Based teaching system for
foreign language learning which consists of:
– natural language mark-up language
– natural language object model in Java
– natural language database
– a communication response mechanism which considers
the discourse context and the personality of the users and
of the system itself.
• Students felt more comfortable and relaxed
• Repeat the same material without being bored
Chatbots in Information Retrieval
• Useful in Education – Language, Mathematics
• FAQchat system - queries from teaching resources
to how to book a room
• FAQchat over Google
– direct answers at times while Google gives links
– number of links returned by the FAQchat is less than
those returned by Google
• Based essentially on keyword matching
Chatbots in IR – Yellow Pages
• The YPA allows users to retrieve information from
British Telecom’s Yellow pages.
• YPA system returns addresses and if no address
found, a conversation is started and the system asks
users more details.
• Dialog Manager, Natural Language front-end, Query
Construction Component, and the Backend database
• YPA answers questions such as “I need a plumber
with an emergency service?”
Chatbots in Other Domains
• Happy Assistant helps access e-commerce sites to
find relevant information about products and
services
• Sanelma (2003) is a fictional person to talk with in a
museum
• Rita (real time Internet technical assistant), an eGain
graphical avatar, is used in the ABN AMRO Bank to
help customer doing some financial tasks such as a
wire money transfer (Voth, 2005).
Conclusion
• Chatbots are effective tools when it comes to
education, IR, e-commerce, etc.
• Downside includes malicious users as in yahoo
messenger.
• The aim of chatbot designers should be: to build
tools that help people, facilitate their work, and their
interaction with computers using natural language;
but not to replace the human role totally, or imitate
human conversation perfectly.
References
• Bayan Abu Shawar and Eric Atwell, 2007 “Chatbots: Are they Really
Useful?” : LDV Forum - GLDV Journal for Computational
Linguistics and Language Technology.
http://www.ldv-forum.org/2007_Heft1/Bayan_Abu-Shawar_and_Eric_At
well.pdf
• Kerly, A., Hall, P., and Bull, S. 2007. Bringing chatbots into education:
Towards natural language negotiation of open learner models.
Know.-Based Syst. 20, 2 (Mar. 2007), 177-185.
• Lane, H.C. (2006). Intelligent Tutoring Systems: Prospects for
Guided Practice and Efficient Learning. Whitepaper for the Army's
Science of Learning Workshop, Hampton, VA. Aug 1-3, 2006.
• http://en.wikipedia.org/wiki/Chatterbot
• ALICE. 2002. A.L.I.C.E AI Foundation, http://www.alicebot.org/
Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to Information Retrieval

Information Retrieval
▪ Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).

▪ These days we frequently think first of web search, but


there are many other cases:
▪ E-mail search
▪ Searching your laptop
▪ Corporate knowledge bases
▪ Legal information retrieval

2
Introduction to Information Retrieval

Unstructured (text) vs. structured


(database) data in the mid-nineties

3
Introduction to Information Retrieval

Unstructured (text) vs. structured


(database) data today

4
Introduction to Information Retrieval Sec. 1.1

Basic assumptions of Information Retrieval


▪ Collection: A set of documents
▪ Assume it is a static collection for the moment

▪ Goal: Retrieve documents with information that is


relevant to the user’s information need and helps
the user complete a task

5
Introduction to Information Retrieval

The classic search model


User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Introduction to Information Retrieval Sec. 1.1

How good are the retrieved docs?


▪ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection that
are retrieved

▪ More precise definitions and measurements to follow later

7
Introduction to Information Retrieval

Introduction to
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term-document incidence matrices
Introduction to Information Retrieval Sec. 1.1

Unstructured data in 1620


▪ Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
▪ One could grep all of Shakespeare’s plays for Brutus
and Caesar, then strip out lines containing Calpurnia?
▪ Why is that not the answer?
▪ Slow (for large corpora)
▪ NOT Calpurnia is non-trivial
▪ Other operations (e.g., find the word Romans near
countrymen) not feasible
▪ Ranked retrieval (best documents to return)
▪ Later lectures
10
Introduction to Information Retrieval Sec. 1.1

Term-document incidence matrices

Brutus AND Caesar BUT NOT 1 if play contains


Calpurnia word, 0 otherwise
Introduction to Information Retrieval Sec. 1.1

Incidence vectors
▪ So we have a 0/1 vector for each term.
▪ To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented) bitwise AND.
▪ 110100 AND
▪ 110111 AND
▪ 101111 =
▪ 100100

12
Introduction to Information Retrieval Sec. 1.1

Answers to query
▪ Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

▪ Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

13
Introduction to Information Retrieval Sec. 1.1

Bigger collections
▪ Consider N = 1 million documents, each with about
1000 words.
▪ Avg 6 bytes/word including spaces/punctuation
▪ 6GB of data in the documents.
▪ Say there are M = 500K distinct terms among these.

14
Introduction to Information Retrieval Sec. 1.1

Can’t build the matrix


▪ 500K x 1M matrix has half-a-trillion 0’s and 1’s.

▪ But it has no more than one billion 1’s. Why?


▪ matrix is extremely sparse.

▪ What’s a better representation?


▪ We only record the 1 positions.

15
Introduction to Information Retrieval

Introduction to
Information Retrieval
Term-document incidence matrices
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to Information Retrieval Sec. 1.2

Inverted index
▪ For each term t, we must store a list of all documents
that contain t.
▪ Identify each doc by a docID, a document serial number
▪ Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar


is added to document 14?
18
Introduction to Information Retrieval Sec. 1.2

Inverted index
▪ We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and best
▪ In memory, can use linked lists or variable length arrays
▪ Some tradeoffs in size/ease of insertion Posting

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

Dictionary Postings
Sorted by docID (more later on why). 19
Introduction to Information Retrieval Sec. 1.2

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed

Tokenizer
Token stream Friends Romans Countrymen

Linguistic modules
Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval Sec. 1.2

Inverted index construction


Documents to Friends, Romans, countrymen.
be indexed

Tokenizer
Token stream Friends Romans Countrymen
More on
these later. Linguistic modules
Modified tokens friend roman countryman

Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Introduction to Information Retrieval

Initial stages of text processing


▪ Tokenization
▪ Cut character sequence into word tokens
▪ Deal with “John’s”, a state-of-the-art solution
▪ Normalization
▪ Map text and query term to same form
▪ You want U.S.A. and USA to match
▪ Stemming
▪ We may wish different forms of a root to match
▪ authorize, authorization
▪ Stop words
▪ We may omit very common words (or not)
▪ the, a, to, of
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Token sequence


▪ Sequence of (Modified token, Document ID) pairs.

Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Introduction to Information Retrieval Sec. 1.2

Indexer steps: Sort


▪ Sort by terms
▪ And then docID

Core indexing step


Introduction to Information Retrieval Sec. 1.2

Indexer steps: Dictionary & Postings


▪ Multiple term
entries in a single
document are
merged.
▪ Split into Dictionary
and Postings
▪ Doc. frequency
information is
added.
Why frequency?
Will discuss
later.
Introduction to Information Retrieval Sec. 1.2

Where do we pay in storage?


Lists of
docIDs

Terms
and
counts IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointer
26
s
Introduction to Information Retrieval

Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying modern IR
Introduction to Information Retrieval

Introduction to
Information Retrieval
Query processing with an inverted index
Introduction to Information Retrieval Sec. 1.3

The index we just built


▪ How do we process a query? Our focus
▪ Later - what kinds of queries can we process?

29
Introduction to Information Retrieval Sec. 1.3

Query processing: AND


▪ Consider processing the query:
Brutus AND Caesar
▪ Locate Brutus in the Dictionary;
▪ Retrieve its postings.
▪ Locate Caesar in the Dictionary;
▪ Retrieve its postings.
▪ “Merge” the two postings (intersect the document sets):

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

30
Introduction to Information Retrieval Sec. 1.3

The merge
▪ Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
31
Introduction to Information Retrieval Sec. 1.3

The merge
▪ Walk through the two postings simultaneously, in
time linear in the total number of postings entries

2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.
32
Introduction to Information Retrieval

Intersecting two postings lists


(a “merge” algorithm)

33
Introduction to Information Retrieval

Introduction to
Information Retrieval
Query processing with an inverted index
Introduction to Information Retrieval

Introduction to
Information Retrieval
Phrase queries and positional indexes
Introduction to Information Retrieval Sec. 2.4

Phrase queries
▪ We want to be able to answer queries such as
“stanford university” – as a phrase
▪ Thus the sentence “I went to university at Stanford”
is not a match.
▪ The concept of phrase queries has proven easily
understood by users; one of the few “advanced search”
ideas that works
▪ Many more queries are implicit phrase queries
▪ For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes


▪ Index every consecutive pair of terms in the text as a
phrase
▪ For example the text “Friends, Romans,
Countrymen” would generate the biwords
▪ friends romans
▪ romans countrymen
▪ Each of these biwords is now a dictionary term
▪ Two-word phrase query-processing is now
immediate.
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries


▪ Longer phrases can be processed by breaking them
down
▪ stanford university palo alto can be broken into the
Boolean query on biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs


matching the above Boolean query do contain the
phrase.

Can have false positives!


Introduction to Information Retrieval Sec. 2.4.1

Extended biwords
▪ Parse the indexed text and perform part-of-speech-tagging
(POST).
▪ Bucket the terms into (say) Nouns (N) and
articles/prepositions (X).
▪ Call any string of terms of the form NX*N an extended
biword.
▪ Each such extended biword is now made a term in the
dictionary.
▪ Example: catcher in the rye
N X X N
▪ Query processing: parse it into N’s and X’s
▪ Segment query into enhanced biwords
▪ Look up in index: catcher rye
Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes


▪ False positives, as noted before
▪ Index blowup due to bigger dictionary
▪ Infeasible for more than biwords, big even for them

▪ Biword indexes are not the standard solution (for all


biwords) but can be part of a compound strategy
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes


▪ In the postings, store, for each term the position(s) in
which tokens of it appear:

<term, number of docs containing term;


doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs
2: 3, 149; 1,2,4,5
could contain “to be
4: 17, 191, 291, 430, 434;
or not to be”?
5: 363, 367, …>

▪ For phrase queries, we use a merge algorithm


recursively at the document level
▪ But we now need to deal with more than just
equality
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query


▪ Extract inverted index entries for each distinct term:
to, be, or, not.
▪ Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
▪ to:
▪ 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
▪ be:
▪ 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
▪ Same general method for proximity searches
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
▪ LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
▪ Again, here, /k means “within k words of”.
▪ Clearly, positional indexes can be used for such
queries; biword indexes cannot.
▪ Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work for
any value of k?
▪ This is a little tricky to do correctly and efficiently
▪ See Figure 2.12 of IIR
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


▪ A positional index expands postings storage
substantially
▪ Even though indices can be compressed
▪ Nevertheless, a positional index is now standardly
used because of the power and usefulness of phrase
and proximity queries … whether used explicitly or
implicitly in a ranking retrieval system.
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


▪ Need an entry for each occurrence, not just once per
document
▪ Index size depends on average document size Why
?
▪ Average web page has <1000 terms
▪ SEC filings, books, even some epic poems … easily 100,000
terms
▪ Consider a term with frequency 0.1%

Document size Postings Positional postings


1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
▪ A positional index is 2–4 as large as a non-positional
index

▪ Positional index size 35–50% of volume of original


text

▪ Caveat: all of this holds for “English-like” languages


Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
▪ These two approaches can be profitably combined
▪ For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
▪ Even more so for phrases like “The Who”
▪ Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
▪ A typical web query mixture was executed in ¼ of the time
of using just a positional index
▪ It required 26% more space than having a positional index
alone
Introduction to Information Retrieval

Introduction to
Information Retrieval
Phrase queries and positional indexes
Word Classes and Part-of-Speech
(POS) Tagging

CS4705
Julia Hirschberg

CS 4705
Garden Path Sentences

• The old dog


…………the footsteps of the young.
• The cotton clothing
…………is made of grows in Mississippi.
• The horse raced past the barn
…………fell.

2
Word Classes

• Words that somehow ‘behave’ alike:


– Appear in similar contexts
– Perform similar functions in sentences
– Undergo similar transformations
• ~9 traditional word classes of parts of speech
– Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction

3
Some Examples

•N noun chair, bandwidth, pacing


•V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly
•P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those

4
Defining POS Tagging

• The process of assigning a part-of-speech or


lexical class marker to each word in a corpus:

WORDS
TAGS
the
koala
put N
the V
keys P
on DET
the
table

5
Applications for POS Tagging
• Speech synthesis pronunciation
– Lead Lead
– INsult inSULT
– OBject obJECT
– OVERflow overFLOW
– DIScount disCOUNT
– CONtent conTENT
• Parsing: e.g. Time flies like an arrow
– Is flies an N or V?
• Word prediction in speech recognition
– Possessive pronouns (my, your, her) are likely to be followed by
nouns
– Personal pronouns (I, you, he) are likely to be followed by verbs
• Machine Translation
6
Closed vs. Open Class Words

• Closed class: relatively fixed set


– Prepositions: of, in, by, …
– Auxiliaries: may, can, will, had, been, …
– Pronouns: I, you, she, mine, his, them, …
– Usually function words (short common words which play a role
in grammar)
• Open class: productive
– English has 4: Nouns, Verbs, Adjectives, Adverbs
– Many languages have all 4, but not all!
– In Lakhota and possibly Chinese, what English treats as
adjectives act more like verbs.

7
Open Class Words
• Nouns
– Proper nouns
• Columbia University, New York City, Arthi
Ramachandran, Metropolitan Transit Center
• English capitalizes these
• Many have abbreviations
– Common nouns
• All the rest
• German capitalizes these.

8
– Count nouns vs. mass nouns
• Count: Have plurals, countable: goat/goats, one goat, two
goats
• Mass: Not countable (fish, salt, communism) (?two fishes)
• Adjectives: identify properties or qualities of
nouns
– Color, size, age, …
– Adjective ordering restrictions in English:
• Old blue book, not Blue old book
– In Korean, adjectives are realized as verbs
• Adverbs: also modify things (verbs, adjectives,
adverbs)
– The very happy man walked home extremely slowly
yesterday.
9
– Directional/locative adverbs (here, home, downhill)
– Degree adverbs (extremely, very, somewhat)
– Manner adverbs (slowly, slinkily, delicately)
– Temporal adverbs (Monday, tomorrow)
• Verbs:
– In English, take morphological affixes (eat/eats/eaten)
– Represent actions (walk, ate), processes (provide, see),
and states (be, seem)
– Many subclasses, e.g.
• eats/V ⇒ eat/VB, eat/VBP, eats/VBZ, ate/VBD,
eaten/VBN, eating/VBG, ...
• Reflect morphological form & syntactic function
How Do We Assign Words to Open or
Closed?
• Nouns denote people, places and things and can
be preceded by articles? But…
My typing is very bad.
*The Mary loves John.
• Verbs are used to refer to actions, processes, states
– But some are closed class and some are open
I will have emailed everyone by noon.
• Adverbs modify actions
– Is Monday a temporal adverbial or a noun?

11
Closed Class Words
• Idiosyncratic
• Closed class words (Prep, Det, Pron, Conj, Aux,
Part, Num) are generally easy to process, since we
can enumerate them….but
– Is it a Particles or a Preposition?
• George eats up his dinner/George eats his dinner up.
• George eats up the street/*George eats the street up.
– Articles come in 2 flavors: definite (the) and indefinite
(a, an)
• What is this in ‘this guy…’?

12
Choosing a POS Tagset

• To do POS tagging, first need to choose a set of


tags
• Could pick very coarse (small) tagsets
– N, V, Adj, Adv.
• More commonly used: Brown Corpus (Francis &
Kucera ‘82), 1M words, 87 tags – more
informative but more difficult to tag
• Most commonly used: Penn Treebank:
hand-annotated corpus of Wall Street Journal, 1M
words, 45-46 subset
– We’ll use for HW1

13
Penn Treebank Tagset

14
Using the Penn Treebank Tags

• The/DT grand/JJ jury/NN commmented/VBD


on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
• Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
• Except the preposition/complementizer “to” is just
marked “TO”
• NB: PRP$ (possessive pronoun) vs. $

15
Tag Ambiguity

• Words often have more than one POS: back


– The back door = JJ
– On my back = NN
– Win the voters back = RB
– Promised to back the bill = VB
• The POS tagging problem is to determine the
POS tag for a particular instance of a word

16
Tagging Whole Sentences with POS is Hard

• Ambiguous POS contexts


– E.g., Time flies like an arrow.
• Possible POS assignments
– Time/[V,N] flies/[V,N] like/[V,Prep] an/Det arrow/N
– Time/N flies/V like/Prep an/Det arrow/N
– Time/V flies/N like/Prep an/Det arrow/N
– Time/N flies/N like/V an/Det arrow/N
– …..

17
How Do We Disambiguate POS?

• Many words have only one POS tag (e.g. is, Mary,
very, smallest)
• Others have a single most likely tag (e.g. a, dog)
• Tags also tend to co-occur regularly with other
tags (e.g. Det, N)
• In addition to conditional probabilities of words
P(w1|wn-1), we can look at POS likelihoods
(P(t1|tn-1)) to disambiguate sentences and to assess
sentence likelihoods

18
Some Ways to do POS Tagging

• Rule-based tagging
– E.g. EnCG ENGTWOL tagger
• Transformation-based tagging
– Learned rules (statistic and linguistic)
– E.g., Brill tagger
• Stochastic, or, Probabilistic tagging
– HMM (Hidden Markov Model) tagging

19
Rule-Based Tagging

• Typically…start with a dictionary of words and


possible tags
• Assign all possible tags to words using the
dictionary
• Write rules by hand to selectively remove tags
• Stop when each word has exactly one (presumably
correct) tag

20
Start with a POS Dictionary

• she: PRP
• promised: VBN,VBD
• to: TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB
• Etc… for the ~100,000 words of English

21
Assign All Possible POS to Each Word

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

22
Apply Rules Eliminating Some POS

E.g., Eliminate VBN if VBD is an option when


VBN|VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

23
Apply Rules Eliminating Some POS

E.g., Eliminate VBN if VBD is an option when


VBN|VBD follows “<start> PRP”
NN
RB
JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

24
EngCG ENGTWOL Tagger

• Richer dictionary includes morphological and


syntactic features (e.g. subcategorization frames)
as well as possible POS
• Uses two-level morphological analysis on input
and returns all possible POS
• Apply negative constraints (> 3744) to rule out
incorrect POS
Sample ENGTWOL Dictionary

26
ENGTWOL Tagging: Stage 1
• First Stage: Run words through FST morphological
analyzer to get POS info from morph
• E.g.: Pavlov had shown that salivation …
Pavlov PAVLOV N NOM SG PROPER
had HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown SHOW PCP2 SVOO SVO SV
that ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG

27
ENGTWOL Tagging: Stage 2
• Second Stage: Apply NEGATIVE constraints
• E.g., Adverbial that rule
– Eliminate all readings of that except the one in It isn’t
that odd.
Given input: that
If
(+1 A/ADV/QUANT) ; if next word is adj/adv/quantifier
(+2 SENT-LIM) ; followed by E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a verb like
consider which allows adjective
complements (e.g. I consider that odd)
Then eliminate non-ADV tags
Else eliminate ADV

28
Transformation-Based (Brill) Tagging

• Combines Rule-based and Stochastic Tagging


– Like rule-based because rules are used to specify tags in
a certain environment
– Like stochastic approach because we use a tagged corpus
to find the best performing rules
• Rules are learned from data
• Input:
– Tagged corpus
– Dictionary (with most frequent tags)

* 29
Transformation-Based Tagging

• Basic Idea: Strip tags from tagged corpus and try to learn them
by rule application
– For untagged, first initialize with most probable tag for each word
– Change tags according to best rewrite rule, e.g. “if word-1 is a
determiner and word is a verb then change the tag to noun”
– Compare to gold standard
– Iterate
• Rules created via rule templates, e.g.of the form if word-1 is an
X and word is a Y then change the tag to Z”
– Find rule that applies correctly to most tags and apply
– Iterate on newly tagged corpus until threshold reached
– Return ordered set of rules
• NB: Rules may make errors that are corrected by later rules

* 30
Sample TBL Rule Application

• Labels every word with its most-likely tag


– E.g. race occurences in the Brown corpus:
• P(NN|race) = .98
• P(VB|race)= .02
• is/VBZ expected/VBN to/TO race/NN tomorrow/NN
• Then TBL applies the following rule
– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB tomorrow/NN
* 31
Learning Rules

• 2 parts to a rule
– Triggering environment
– Rewrite rule
• The range of triggering environments of templates
(from Manning & Schutze 1999:363)
Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3
1 *
2 *
3 *
4 *
5 *
6 *
7 *
8 *
9 *
TBL Tagging Algorithm

• Step 1: Label every word with most likely tag (from


dictionary)
• Step 2: Check every possible transformation & select
one which most improves tag accuracy (cf Gold)
• Step 3: Re-tag corpus applying this rule, and add rule to
end of rule set
• Repeat 2-3 until some stopping criterion is reached, e.g.,
X% correct with respect to training corpus
• RESULT: Ordered set of transformation rules to use on
new data tagged only with most likely POS tags

* 33
TBL Issues

• Problem: Could keep applying (new)


transformations ad infinitum
• Problem: Rules are learned in ordered sequence
• Problem: Rules may interact
• But: Rules are compact and can be inspected by
humans

* 34
Methodology: Error Analysis

• Confusion matrix: VB TO NN
– E.g. which tags did we
most often confuse
with which other tags? VB
– How much of the
overall error does each TO
confusion account for?

NN
More Complex Issues

• Tag indeterminacy: when ‘truth’ isn’t clear


Caribbean cooking, child seat
• Tagging multipart words
wouldn’t --> would/MD n’t/RB
• How to handle unknown words
– Assume all tags equally likely
– Assume same tag distribution as all other singletons in
corpus
– Use morphology, word length,….
Summary

• We can develop statistical methods for identifying


the POS of word sequences which come close to
human performance – high 90s
• But not completely “solved” despite published
statistics
– Especially for spontaneous speech
• Next Class: Read Chapter 6:1-5 on Hidden
Markov Models
Information Extraction
Introduction
• information extraction
• relation extraction
knowledge graphs
Three tasks related to events
• Event extraction
Event coreference
• Temporal expression
Normalized
• Template filling
UMLS relation:
• Infoboxes

• RDF triple

• Freebase

• Is-a (or) Hypernym


Relation Extraction Algorithms
• Five main classes
• Handwritten pattern

• supervised machine learning

• semi-supervised
• Unsupervised
• distant supervision
Using Patterns to Extract Relations
• Hearst patterns

• Consider the following sentence:

• Agar is a substance prepared from a mixture of red algae, such as Gelidium,

for laboratory or industrial use

• lexico-syntactic pattern
Relation Extraction via Supervised Learning
Feature-based supervised relation classifiers.
Example:
• American Airlines, a unit of AMR, immediately matched the move,
spokesman Tim Wagner said
• Syntactic structure
• syntactic path
• Constituent paths between M1 and M2
NP ↑ NP ↑ S ↑ S ↓ NP
• Dependency-tree paths
Airlines ←sub j matched ←comp said →sub j Wagner
Neural supervised relation classifiers
• TACRED relation extraction dataset

• 1-of-N classifier to assign one of the 43 labels

• BERT like RoBERTa

• SPANbert

• hand-labeled data

• semi-supervised and unsupervised approaches


Semi supervised Relation Extraction via
Bootstrapping
• Seed patterns
• Seed paths
• Bootstrapping
• confidence values
• Track two factors:
hits(p): the set of tuples in T that p matches while looking in D
finds(p): The total set of tuples that p finds in D

• noisy-or
Distant Supervision for Relation Extraction
• Distant supervision
Unsupervised Relation Extraction
• open information extraction or Open IE
• For example, the ReVerb system extracts a relation from a sentence s in 4
steps:
• 1. Run a part-of-speech tagger and entity chunker over s
• 2. For each verb in s, find the longest sequence of words w that start with a
verb and satisfy syntactic and lexical constraints, merging adjacent matches.
• 3. For each phrase w, find the nearest noun phrase x to the left which is not a
relative pronoun, wh-word or existential “there”. Find the nearest noun
phrase y to the right.
• 4. Assign confidence c to the relation r = (x,w, y) using a confidence
classifier and return it
Evaluation of Relation Extraction
• Supervised
• Semi-supervised
• Unsupervised
• Tuples
• The estimated precision Pˆ is
Extracting Times
• Temporal Expression Extraction

• Temporal Normalization

• Extracting Events and their Times

• Template Filling

• Machine Learning Approaches to Template Filling


Information Retrieval
Introduction
• Information Retrieval
• ad hoc retrieval
• Query
• Documents
Term weighting and document scoring
Document Scoring
• We score document d by the cosine of its vector d with the query
vector q:

• using the tf-idf values and spelling out the dot product as a sum of
products:
Inverted Index
• It consists of two parts:
• dictionary
• Postings
Evaluation of Information-Retrieval Systems
• Precision and recall are then defined as:

• precision-recall curve
• interpolated precision
• mean average precision
IR-based Factoid Question Answering
• IR-based QA
• retrieve and read
• reading comprehension
Entity Linking
• entity linking
• Wikification

Linking based on Anchor Dictionaries and Web Graph


• Anchor texts
• Mention Detection
• Mention Disambiguation
Knowledge-based Question Answering
• Knowledge-Based QA from RDF triple stores
• QA by Semantic Parsing
Using Language Models to do QA
Classic QA Model
• (1) Question Processing

• (2) Candidate Answer Generation

• (3) Candidate Answer Scoring

• (4) Answer Merging and Confidence Scoring.


Evaluation of Factoid Answers

• Mean Reciprocal Rank MRR


• MRR is defined as
Thank You
RNN
Introduction

• A recurrent neural network (RNN) is any network that contains a


cycle within its network connections
• Elman Networks or simple recurrent networks.
• These networks are useful in their own right and serve as the basis for
more complex approaches like the Long Short-Term Memory (LSTM)
networks
Simple recurrent neural network
Simple recurrent neural network illustrated
as a feed forward network.
Inference in RNNs
• Forward inference
• To compute an output yt for an input xt , we need the activation
value for the hidden layer ht .
• To calculate this, we multiply the input xt with the weight matrix W
• the hidden layer from the previous time step ht−1 with the weight matrix U
• suitable activation function, g
• activation value for the current hidden layer, ht
• ht = g(Uht−1 +Wxt)
• yt = f(V ht)
Training
• Backpropagation Through Time
RNNs as Language Models
• teacher forcing
• Weight Tying
Training RNNs as language models.
Generation with RNN-Based Language
Models
• To begin, sample a word in the output from the softmax distribution
that results from using the beginning of sentence marker, <s>, as the
first input.
• Use the word embedding for that first word as the input to the
network at the next time step, and then sample the next word in the
same fashion.
• Continue generating until the end of sentence marker, </s>, is
sampled or a fixed length limit is reached.
Autoregressive generation
Other Applications of RNNs
Autoregressive generation with an
RNN-based neural language model.
Part-of-speech tagging with sequence
labeling
RNNs for Sequence Classification
Stacked RNNs
Bidirectional RNNs
A bidirectional RNN for sequence
classification
Managing Context in RNNs: LSTM
• it is quite difficult to train RNNs for tasks that require a network to
make use of information distant from the current point of processing.
• To see this, consider the following example in the context of language
modeling
• The flights the airline was cancelling were full.

- vanishing gradients problem


Long Short-Term Memory
Introduction
• Divide the context management problem into two sub-problems.
• removing information no longer needed from the context
• adding information likely to be needed for later decision making.

• The gates in an LSTM share a common design pattern;


• each consists of a feed forward layer
• followed by a sigmoid activation function
• followed by a pointwise multiplication with the layer being gated.
Gates
• forget gate

• add gate

• output gate
Gated Units, Layers and Networks
Thank you

You might also like