Professional Documents
Culture Documents
2018/19—Sem I
• Information Retrieval (IR) provides a list of potentially relevant documents in response
to users query.
User Interface
user need
Text Operations text text
Searching Index
Text
ranked docs retrieved docs
Database
Ranking
• Depending on how index terms are treated, there are three classic IR models: Boolean,
Vector and Probabilistic models.
• Based on set theoretic (set theory and Boolean algebra) concepts.
♦ These term weights are used to compute the degree of similarity each
document stored in the systems and the user query.
♦ Retrieved documents can be sorted in decreasing order to get ranked list of
documents.
• Advantages:
• Captures the IR problem with the assumption that for a given user query there is a set
of documents which contains exactly the relevant documents and no other.
♦ This set of documents is the ideal answer set.
♦ Given the description of this ideal answer set, there would be no problem in
retrieving its documents.
• The querying process can then be thought of as a process of specifying the properties
of an ideal answer set.
♦ Initially guess the properties (we can start by using index terms).
♦ User feed back is then initiated and taken to improve the probability that the
user will find document d with query q.
♦ Measure of document similarity to the query:
P(d relevant-to q)
P(d non-relevant-to q)
• The main advantage is that documents are ranked in decreasing order of their
probability of being relevant.
• The performance of IR systems can be evaluated by using two commonly used metrics:
precision and recall.
• Recall is the fraction of the relevant documents which has been retrieved.
relevant ∩ retrieved
Recall =
relevant
relevant ∩ retrieved
Precision =
retrieved
• NLP is widely used to improve the performance of IR systems.
♦ Stop word removal: with the objective of filtering out words with very low
discrimination values for retrieval purpose.
• Example:
Firm XYZ is a full service advertising agency specializing in direct and
Text: interactive marketing. Located in Bole, Addis Ababa, Firm XYZ is looking for
an Assistant Account Manager to help manage and coordinate interactive
marketing initiatives. Experience in online marketing and/or the advertising
field is a plus. Depending on the experiences of the applicants, the company
pays an attractive salary of Birr 3,000- Birr 5,000 per month.
Extracted Information:
INDUSTRY: Advertising
POSITION: Assistant Account Manager
LOCATION: Bole, Addis Ababa.
COMPANY: Firm XYZ
SALARY: Birr 3,000 - Birr 5,000 per month
• IE is applied to a narrowly restricted domain.
♦ Quantities: three quintals of teff, 3000 Birr, ሶስት ኩንታል ጤፍ, 3ሺ ብር, etc.
♦ etc.
• Relation Detection and Classification involves identification of relations between entities.
• Examples:
Examples:
He was born on October 2, 1938.
He was born in the middle of the Second Italo-Ethiopian War.
He was born two years after the Second Italo-Ethiopian War broke.
The pump circulates the water every 2 hours.
♦ Temporal Normalization
• Example:
• Machine Translation (MT) refers to a translation of texts from one natural language to
another by means of a computerized system.
• The process of direct translation involves morphological analysis, lexical transfer, local
reordering, and morphological generation.
♦ Fast
♦ Simple
♦ Inexpensive
• Cons:
♦ Unreliable
♦ Not powerful
♦ Rule proliferation
NP VP S
N V NP NP VP
Syntactic Syntactic Syntactic
Abebe broke Det N Structure Transfer Structure N N V
♦ Offers the ability to deal with more complex source language phenomena than
the direct approach.
• Cons:
EVENT: breaking
TENSE: past
AGENT: Abebe
PATIENT: window
DEFINITENESS: definite
Interlingua
Analysis Generation
• Interlingual translation is suitable for multilingual machine translation, and its main
drawback is that the definition of an Interlingua is difficult and maybe even impossible
for a wider domain.
Analysis Interlingua Generation
• Parameters of probabilistic models are derived from the analysis of bilingual text
corpora.
Bilingual Corpus of English and Amharic (Example)
Abebe ate besso. Aበበ በሶ በላ።
Abebe bought besso. Aበበ በሶ ገዛ።
Abebe threw the stone. Aበበ ድንጋዩን ወረወረው።
Abebe went to school. Aበበ ወደ ትምህርት ቤት ሄደ።
Kebede bought a car. ከበደ መኪና ገዛ።
Kebede bought the car. ከበደ መኪናውን ገዛው።
Kebede bought the car. ከበደ መኪናዋን ገዛት።
Almaz made tea. Aልማዝ ሻይ Aፈላች።
Almaz made tella. Aልማዝ ጠላ ጠነሰሰች።
Source text
Target text
• Language Model tries to ensure that words come in the right order.
♦ Some notion of grammaticality
• Given an English string e, language model assigns p(e) by formula.
♦ Good English string high p(e); and bad English string low p(e).
♦ Calculated with:
A statistical grammar such as a probabilistic context free grammar; or
An n-gram language model.
• Trigram probabilities
♦ Bayes’ rule:
p(e|f ) = p(f|e) * p(e) / p(f )
Aበበ በሶ በላ ።
Abebe
ate
besso
Target text
e
• A decoder searches for the best sequence of transformations that translates a source
sentence.
♦ Look up all translations of every source word or phrase, using word or phrase
translation table.
♦ Recombine the target language phrases that maximizes the translation model
probability * the language model probability.
♦ This search over all possible combinations can get very large so we need to find
ways of limiting the search space.
• Decoding is, therefore, a searching problem that can be reformulated as a classic
Artificial Intelligence problem, i.e. searching for the shortest path in an implicit graph.
• Pros:
♦ Has a way of dealing with lexical ambiguity
♦ Can deal with idioms that occur in the training data
♦ Can be built for any language pair that has enough training data (language
independent)
♦ No need of language experts (requires minimal human effort)
• Cons:
♦ Does not explicitly deal with syntax
• Economic reasons:
♦ Low cost
♦ Rapid prototyping
• Practical reasons:
♦ Many language pairs don't have NLP resources, but do have parallel corpora
• Quality reasons:
♦ Uses chunks of human translated text as its building blocks
♦ Produces state of the art results (when very large data sets are available)
• Parallel corpus
♦ For example, Negarit Gazette (ነጋሪት ጋዜጣ) is a useful resource for English-to-
Amharic machine translation or vice versa.
• Decoder
♦ For example, Pharaoh (phrase-based decoder that builds phrase tables from
Giza++ word alignments and produces best translation for new input using the
phrase table plus SRILM language model)
• Fundamental idea:
♦ People do not translate by doing deep linguistics analysis of a sentence.
♦ They translate by decomposing sentence into fragments, translating each of
those, and then composing those properly.
• Uses the principle of analogy in translation
• Example:
Given the following translations:
የመጽሃፉ ዋጋ ከ500 ብር በላይ ነው The price of the book is more than 500 Birr
የቤቱ ዋጋ ከ500 ብር በላይ ነው The price of the house is more than 500 Birr
• Locating similar sentences
• Aligning sub-sentential fragments
• Combining multiple fragments of example translations into a single sentence
• Determining when it is appropriate to substitute one fragment for another
• Selecting the best translation out of many candidates
• Pros:
♦ Uses fragments of human translations which can result in higher quality
• Cons:
♦ May have limited coverage depending on the size of the example database, and
flexibility of matching heuristics
• Machine translation systems discussed so far have their own pros and cons.
• Hybrid systems take the synergy effect of rule-based, statistical and example-based
machine translations.
♦ Rules can be post-processed by statistics and/or examples
♦ Statistics guided by rules and/or examples
• Example:
Segment1 Segment2
Translated by rule-based Translated by example-based
• Natural Languages (NLs) are increasingly becoming important interfaces styles in
Human-Computer Interaction (HCI).
• The growing popularity of natural language interfaces is due to the rise of human needs
to interact/communicate with computer systems to:
♦ get answers for real world questions; or
♦ make conversation in a coherent way about various topics.
• Two of the most important applications of NLP that deal with such issues are Question
Answering and Dialogue Systems.
• Question Answering (QA) System:
♦ A system that provides an answer or answer containing text for a given question
formulated using natural language.
• Dialogue System (DS):
♦ A system that converses with human beings in a coherent way.
♦ An extension of QA system, i.e., a two-way QA system.
Question
Answer
Question Question
Answer Answer
Clarification Request
Document
Collection
• The modality of Dialogue Systems can be text-based, spoken-dialogue, graphical user
interface, or multi-modal.
• Text-Based:
♦ The conversation is made by making use of natural language texts.
♦ For example, ELIZA.
• Spoken Dialogue:
♦ The conversation is made by making use of voice.
♦ For example, HMIHY (how may I help you) developed at AT&T for call routing.
• Graphical User Interface:
♦ The conversation is made by making use of images.
♦ For example, Dialogue Boxes in Windows applications.
• Multimodal:
♦ The conversation is made by any combination of the above three modalities.
• Dialogue Systems differ in the degree with which human or computer takes the
initiative.
Question
Answer
Question
Answer
Computer-Initiative Human-Initiative
♦ Computer maintains tight control ♦ Human maintains tight control
♦ Human is highly restricted ♦ Computer is highly restricted
♦ E.g., Dialogue Boxes ♦ E.g., ELIZA
Mixed-Initiative
♦ Human and computer have flexibility to specify constraints
♦ Mainly research prototypes
• Currently, Dialogue Systems are used in specific domains such as:
♦ Customer service: Responding to customers' general questions about products
and services, e.g., answering questions about applying for a bank loan.
♦ Help desk: Responding to internal employee questions, e.g., responding to
human resource questions.
♦ Website navigation: Guiding customers to relevant portions of complex
websites, e.g., helping people determine where information or services reside
on a company's website.
♦ Guided selling: Providing answers and guidance in the sales process,
particularly for complex products being sold to novice customers.
♦ Technical support: Responding to technical problems, such as diagnosing a
problem with a device.
General Architecture of Spoken Dialogue System
I/O
Server Dialogue Knowledge
Manager Base
SHOW:
FLIGHTS:
ORIGIN
CITY: Addis Ababa
DATE: Tuesday
TIME: morning
DESTINATION
CITY: London
DATE:
TIME:
• “Frame and slot semantics” can be generated by a semantic grammar.