You are on page 1of 61

Natural Language Processing (1)

Zhao Hai 赵海
Department of Computer Science and Engineering
Shanghai Jiao Tong University

2010-2011 
zhaohai@cs.sjtu.edu.cn
Outline

 Course Goals
 Course Schedule
 Course Requirements
 Overview

21/7/10 09:26 2
Course Goals

 Introduction to the know-how of NLP, especially NLU, including


research highlights, crucial technologies and application achievements;
 Providing a chance to train students for reading and evaluating new
academic papers from an important international conference in related
areas, such as ACL conference;
 Encouraging students to present and discuss their comments for
the papers.
 Accomplish a practical NLP system through a course project.

21/7/10 09:26 3
Course Schedule (1)
1. Overview (2 lhs = 2 lecture hours)

1.1 Natural Language Understanding (NLU)


1.2 Different Levels of Language Analysis
1.3 Applied Approaches in NLU Systems
1.4 Applications of NLU

21/7/10 09:26 4
Course Schedule (2)
2. Lexicons and Lexical Analysis (11 lhs)

2.1 Lexicon: A Language Resource


2.2 A Lexicon for English Words: WordNet
2.3 Generative Lexicon
2.4 Finite State Models and Morphological Analysis
2.5 Collocations
2.6 Statistical Inference: n-gram Models over Sparse Data

21/7/10 09:26 5
Course Schedule (3)
3. Syntactic Processing (14 lhs)

3.1 Basic English Syntax


3.2 Grammars and Parsing
3.3 Features and Augmented Grammars
3.4 Grammars for Natural Language
3.5 Toward Efficient Parsing
3.6 Ambiguity Resolution: Statistical Methods

21/7/10 09:26 6
Course Schedule (4)
4. Semantic Interpretation (6 lhs)

4.1 Semantics and Logical Form


4.2 Linking Syntax and Semantics
4.3 Ambiguity Resolution
4.4 Other Strategies for Semantic Interpretation

21/7/10 09:26 7
Course Schedule (5)
5. Machine Learning Approaches for
Natural language processing (6 lhs)

5.1 Main machine learning approaches


Maximum entropy
K-nearest neighbor
Support vector machine
Structure learning
5.2 A Case Study: train a Part-of-speech tagger from
labeled corpus
21/7/10 09:26 8
Course Schedule (6)
6. Course Discussion (1 lh)

6.1 Discussion for given Course Content


6.2 How to Prepare for the Paper Reading
6.3 Other Related Issues

21/7/10 09:26 9
Course Schedule (7)
7. Students Workshop (2 lh)

7.1 ACL/EMNLP Paper Reading Groups


7.2 Summary and Comment Writing
7.3 Presentation and Discussion

21/7/10 09:26 10
Course Schedule (8)
Curriculum Schedule

Time: The 3rd and 4th classes, Monday morning,


The 1st and 2nd classes, Wednesday morning,
The 1st-8th week;
Location:

21/7/10 09:26 11
Course Requirements (1)
1. Final Grade

 Attendance and Assignments 30%


 ACL/EMNLP Paper Summary,
Comment and Presentation 30%
 Course project 40%

21/7/10 09:26 12
Course Requirements (2)
2. Texts and References

 James Allen. Natural Language Understanding (The


Second Ver.). The Benjamin / Cummings Publishing
Company, Inc., 1995.
 Christopher D. Manning and Hinrich Schütze.
Foundations of Statistical Natural Language
Processing. The MIT Press. Springer-Verlag, 1999.

21/7/10 09:26 13
Course Requirements (3)

 ACL Anthology
 http://www.aclweb.org/anthology-new/

 Other Related References.

21/7/10 09:26 14
Course Requirements (4)
4. FTP Site and Contact Email

 Old version: ftp://ftp.cs.sjtu.edu.cn/yao-tf/nlu/


 The latest:
http://bcmi.sjtu.edu.cn/~zhaohai/lessons/nlp2011/index.html

zhaohai@cs.sjtu.edu.cn

21/7/10 09:26 15
Overview (1)
Natural Language Understanding (1)
What is Natural Language?

 It means human language. The most common way that people


communicate is by speaking or writing in one of the natural
language such as English, Chinese, German, or French.
 There are two forms of natural language: written and spoken
forms.

21/7/10 09:26 16
Overview (2)
Natural Language Understanding (2)
NLP & NLU (1)

 NLP (Natural Langauge Processing) sums up all methods


covering the pure processing of language by means of
algorithmic, statistic, heuristic etc. means.
 NLU (Natural Langauge Understanding) indicates the real
understanding of a text that is formulated in some natural
languages.
Semantic or pragmatic issue?
21/7/10 09:26 17
Overview (3)
Natural Language Understanding (3)
NLP & NLU (2)

 Information Retrieval (IR)  NLP


 Information Extraction (IE)  NLP with NLU (Shallow
Parsing)
 Summarization  NLP with NLU (Shallow and Deep
Parsing)
 Question Answering (QA)  NLP with NLU (Shallow or
Deep Parsing)
 Machine Translation (MT)  NLP with NLU (Deep Parsing)
 Natural Language Generation (NLG)  NLP
 …

21/7/10 09:26 18
Overview (4)
Natural Language Understanding (4)
Why is NLU a Difficult Task? (1)

 Complexity of the target representation into which the


matching is being done
In fact, the procedure of understanding natural language is to
transform it from one representation into another. Extracting
meaningful information of source representation often requires
the use of additional knowledge.

21/7/10 09:26 19
Overview (5)
Natural Language Understanding (5)
Why is NLU a Difficult Task? (2)

 Type of mapping
There are one-to-one, many-to-one, one-to-many, or many-to-
many mappings. One-to-many mappings require a great deal of
domain knowledge beyond the input to make the correct choice
among target representations.
For example (one-to-many): a) a tall giraffe vs. b) a tall poodle
(a small dog with thick curling hair has proud bearing)
21/7/10 09:26 20
Overview (6)
Natural Language Understanding (6)
Why is NLU a Difficult Task? (3)

 Level of interaction of the components of the source


representation
In many natural language sentences, changing a single word can
alter the interpretation of the entire structure. As the number of
interactions increases, so does the complexity of the mapping.

21/7/10 09:26 21
Overview (7)
Natural Language Understanding (7)
Why is NLU a Difficult Task? (4)

 Modifier attachment problem


The sentence Give me all the employees in a division making
more than $50,000 doesn't make it clear whether the speaker
wants all employees making more than $50,000, or only those in
divisions making more than $50,000.

21/7/10 09:26 22
Overview (8)
Natural Language Understanding (8)
Why is NLU a Difficult Task? (5)

 Quantifier scoping problem


In logic, some words such as “the”, “each”, or “what” that
express “universal” () or “existential” (). They can have
several readings.
 Elliptical utterances
The interpretation of a query may depend on previous queries
and their interpretations. E.g., asking Who is the manager of the
automobile division and then saying, of aircraft?
21/7/10 09:26 23
Overview (9)
Natural Language Understanding (9)
Computational Linguistics (1)

Research in Computational Linguistics, the use of computers in


the study of languages, started soon after computers became
available in the 1940’s. This discipline, along with AI discipline
and so on, promoted the progress of NLU.

21/7/10 09:26 24
Overview (10)
Natural Language Understanding (10)
Computational Linguistics (2)
Computational
Engineering Linguistics
Science Bioscience
Psychology

Cognitive
AI
Computer Science
Science
Philosophy
Linguistics

21/7/10 09:26 25
Overview (11)
Natural Language Understanding (11)
Symbolic Processing

In the procedure of NLU, we mainly use machine to manipulate


different symbols.
e.g. it was readily used on written text to compile:
 word indexes (lists of word occurrences) and
 concordance (indexes including a line of context for each
occurrence).
21/7/10 09:26 26
Overview (12)
Natural Language Understanding (12)
Machine Translation (1)

 In 1949, Warren Weaver proposed that computers might be


useful for “the solution of world-wide translation problems”.
 However, even after more than 50 years of effort, current
systems still produce output of limited quality, which is suitable
for assimilation of foreign-language documents, but not for the
production of publishable material.
21/7/10 09:26 27
Overview (13)
Natural Language Understanding (13)
Machine Translation (2)

 By practice, the researchers have realized that human


language translation is a complex cognitive ability involving
knowledge of different kinds:
 the structure of sentences;
 the meaning of words;
 a model of the listener (user model);
 the rules of conversation (dialogue translation);
 an extensive shared body of general information about the world.
21/7/10 09:26 28
Overview (14)
Natural Language Understanding (14)
Machine Translation (3)
 Some forms of translation for information access is already today available
in the web at no cost. e.g.
http://babelfish.altavista.com/tr

http://translate.google.com/?hl=zh-CN&tab=wT#auto|en|

The increasing demand for these services will give a push to improve their
quality;
 The translation providers will find ways to increase vocabularies and
translation quality semi-automatically from terminological resources,
bilingual corpora and similar sources.

21/7/10 09:26 29
Overview (15)
Natural Language Understanding (15)
Machine Translation (4)

 Clearly, any systematic collection of lexical and


terminological information in the form of domain-specific
ontologies will help to build better MT systems for these
domains.
 Conversely, the construction of ontologies can be facilitated
by automatic alignment of existing translations, as this will
naturally lead to a clustering of the vocabulary along the
relevant semantic distinctions.
21/7/10 09:26 30
Overview (16)
Natural Language Understanding (16)
Investigation Goals

AI researchers in natural language processing expected their


work to lead both to:
 the development of practical, useful language
understanding systems and
 a better understanding of language and the nature of
intelligence.
21/7/10 09:26 31
Overview (17)
Different Levels of Language Analysis (1)
Six Analysis Levels for Written Texts

 Morphological Analysis (Lexical Analysis)


 Syntactic Analysis (Deep & Shallow Parsing)
 Semantic Analysis
 Pragmatic Analysis
 Discourse Analysis (Text Analysis)
 World Knowledge Analysis (is it possible?)
21/7/10 09:26 32
Overview (18)
Different Levels of Language Analysis (2)
Morphological Analysis (1)

 It is the identification of a word-stem from a full word-form (and


sometimes also the identification of the syntactic category of the
stem).
 For example, the word friendly is combined by the noun (stem)
friend and the suffix -ly, which transforms a noun into an adjective.

21/7/10 09:26 33
Overview (19)
Different Levels of Language Analysis (3)
Morphological Analysis (2)

 Most systems that analyze natural language text typically start


by segmenting the text into meaningful tokens.

 In general, this procedure includes tokenization


(segmentation), normalization (stemming), POS (part-of-
speech) tagging, named entity / phrase identification.

21/7/10 09:26 34
Overview (20)
Different Levels of Language Analysis (4)
Syntactic Analysis (1)
 Its goal is to break down given textual units, e.g. sentences,
into smaller constituents, to assign categorical labels to them, and
to identify the grammatical relations that hold between the
various parts.
 In most parsers, the grammar is separated from the processing

components. The grammar consists of a lexicon, and rules that


syntactically and semantically combine words and phrases into
larger phrases and sentences.
21/7/10 09:26 35
Overview (21)
Different Levels of Language Analysis
(5)
Syntactic Analysis (2)

 The output of a shallow parser is less complete than that from


a deep parser, that is, it is not a phrase-structure tree.
 A shallow parser may identify some phrasal constituents, such
as noun phrase, without indicating their internal structure and
their function in the sentence.
 It has the advantages of efficiency and robustness.
21/7/10 09:26 36
Overview (22)
Different Levels of Language Analysis
(6)
Syntactic Analysis (3)

 The challenges will be how to find syntactic parsers that are at


the same time fast, robust, deliver a detailed analysis that is
correct with high probability and that are easily to adapt to
special domains.
 One of the current research emphases is to integrate shallow
syntactic parsers with deeper syntactic approaches.
21/7/10 09:26 37
Overview (23)
Different Levels of Language Analysis
(7)
Semantic Analysis (1)

 The goal of semantic analysis is to assign meanings to


utterances whose meaning is complete, containing word
meaning and combination of word meaning, which is a context-
independent meaning.

21/7/10 09:26 38
Overview (24)
Different Levels of Language Analysis
(8)
Semantic Analysis (2)

 The task of semantic analysis can be divided into several


subtasks, depending on the linguistic level where it takes place.
 The most important subtasks are the semantic tagging of
ambiguous words and phrases, and the resolution of referring
expressions.

21/7/10 09:26 39
Overview (25)
Different Levels of Language Analysis (9)
Pragmatic Analysis

 It depicts the relationships between the symbols of texts (talks)


and the producers / users.
 Note that here those present writers / readers and speakers /
hearers.
 In other words, the context of situation has significant impact
for the interpretation of a discourse.

21/7/10 09:26 40
Overview (26)
Different Levels of Language Analysis (10)
Discourse Analysis

 Extracting the knowledge contained in texts requires more than


the resolution of local semantic ambiguities.
 Discourse analysis needs to consider the global argumentative
structure of texts. In addition, it also analyzes the relationships
between sentences in a text.
 This analysis is especially important for pronoun and temporal
constituents.
21/7/10 09:26 41
Overview (27)
Different Levels of Language Analysis (11)
World Knowledge Analysis

 It analyzes and infers the general world knowledge that each


language users must have, e.g. other user’s beliefs and goals in a
conversation.

21/7/10 09:26 42
Overview (28)
Different Levels of Language Analysis (12)
Examples

Consider each example below as a candidate for the initial


sentence of the book concerning natural language processing:
1. Language is one of the fundamental aspects of human behavior
and is a crucial component of our lives.
2. Green frogs have large noses.
3. Green ideas have large noses.
4. Large have green ideas nose.
21/7/10 09:26 43
Overview (29)
Applied Approaches in NLU Systems (1)
Historical Categories

Borrowed from Winograd (1972), groups NLU approaches


according to how they represent and use knowledge of their
subject matter. On this basis, they can be divided into four
historical categories.

21/7/10 09:26 44
Overview (30)
Applied Approaches in NLU Systems (2)
Historical Categories

 The earliest approach with limited results in specific,


constrained domains (BASEBALL, SAD-SAM, STUDENT
and ELIZA);
 Text-based approach (PROTOSYNTHEX-I and Semantic
Memory);
 Limited logic-based approach (SIR, TLC, DEACON and
CONVERSE);
 Knowledge-based approach (LUNAR, SHRDLU,
MARGIE, SAM and LIFER).
21/7/10 09:26 45
Overview (31)
Applied Approaches in NLU Systems (3)
BASEBALL [Bert Green, 1963]

An information retrieval program with a large database of facts


about all American League games over a given year. It accepted
input questions from the user, limited to one clause with no
logical connectives.

21/7/10 09:26 46
Overview (32)
Applied Approaches in NLU Systems (4)
SAD-SAM [Lindsay, 1963]

 Syntactic Appraiser and Diagrammer -- Semantic Analyzing


Machine. Programmed by Robert Lindsay in 1963 at CMU.
 It uses an basic English vocabulary (1,700 words) and follows
a context-free grammar.
 It parses input from left to right, builds derivation trees, and
passes them to SAM, which extracts the semantically relevant
information to build family trees and find answers to questions.
21/7/10 09:26 47
Overview (33)
Applied Approaches in NLU Systems (5)
ELIZA [Weizenbaum, 1966]

 It was built at MIT in 1966 and was the most famous pattern-
matching natural language system. The system assumes the role
of a Rogerian, or “nondirective”, therapist in its dialog with the
user.
 It operated by matching the left sides of its rules against the
user’s last sentence, and using the appropriate right side to
generate a response. Rules were indexed by keywords so only a
few had to be matched against a particular sentence. Some rules
had no left side, so they could apply anywhere.

21/7/10 09:26 48
Overview (34)
Applied Approaches in NLU Systems (6)
ELIZA: Sample Data
Word Rank Pattern Outputs

alike 10 ?X In what way?


What resemblance do you see?
are 3 ?X are you ?Y Would you prefer it if I weren’t ?Y?
3 ?X are ?Y What if they were not ?Y?
always 5 ?X Can you think of a specific example?
When?
Really, always?
what 2 ?X Why do you ask?
Does that interest you?

21/7/10 09:26 49
Overview (35)
Applied Approaches in NLU Systems (7)
ELIZA: A Dialogue
User: ELIZA:

Men are all alike. In what way?


They’re always bugging us about
something or other. Can you think of a specific example?
Well, my boyfriend made me come here. Your boyfriend made you come here.
He says I’m depressed much of the time. I am sorry to hear you are depressed.
It’s true, I am unhappy. Do you think coming here will help you
not to be unhappy?
I need some help; that much seems certain. What would you mean to you if you got
some help?
Perhaps I could learn to get along with my
mother. Tell me more about your family.
My mother takes care of me. Who else in your family takes care of you?
My father. Your father.
You are like my father in some days. What resemblance do you see?

21/7/10 09:26 50
Overview (36)
Applied Approaches in NLU Systems (8)
SIR [Bertram Raphael, 1968]

Semantic Information Retrieval System, it was a prototype


“understanding” machine, since it could accumulate facts and
then make deductions about them in order to answer questions.

21/7/10 09:26 51
Overview (37)
Applied Approaches in NLU Systems (9)
LUNAR [William Woods, 1973] (1)

 LUNAR answered questions about the rock samples brought


back from the moon using two databases -- the chemical analyzes
and the literature references.
 Specifically, it helped geologists access, compare, and
evaluate chemical analysis data on moon rocks and soil
composition obtained from the Apollo-11 mission.

21/7/10 09:26 52
Overview (38)
Applied Approaches in NLU Systems (10)
LUNAR [William Woods, 1973] (2)

 It operated by translating a question entered in English into an


expression in a formal query language. The translation was done
with an ATN parser coupled with a rule-driven semantic
interpretation procedure.

21/7/10 09:26 53
Overview (39)
Applications of NLU (1)
Text-Based Applications

 Finding appropriate documents on certain topics from a


database of texts;
 Extracting information from messages or articles on certain
topics;
 Translating documents from one language to another;
 Summarizing texts for certain purposes.
21/7/10 09:26 54
Overview (40)
Applications of NLU (2)
Dialogue-Based Applications

 Question-answering systems, where natural language is used


to query a database;
 Automated customer service over the telephone;
 Tutoring systems, where the machine interacts with a student;
 General cooperative problem-solving systems.

21/7/10 09:26 55
Overview (41)
CL Research Topics (1)
Call for Papers from ACL-2006 (1)

 pragmatics, semantics, syntax, grammars and the lexicon; (NLU)


 phonetics, phonology and morphology; (NLU)
 lexical semantics and ontologies; (NLU)
 word segmentation, tagging and chunking; (NLU)
 parsing, generation and summarization; (NLU)
 language modeling, spoken language recognition and understanding;
(NLU)
 linguistic, psychological and mathematical models of language;
 information retrieval, question answering, information extraction and text
mining; (NLU)

21/7/10 09:26 56
Overview (42)
CL Research Topics (2)
Call for Papers from ACL-2006 (2)

 machine learning for natural language; (NLU)


 corpus-based modeling of language, discourse and dialogue; (NLU)
 multi-lingual processing, machine translation and translation aids; (NLU)
 multi-modal and natural language interfaces and dialogue systems;
(NLU)
 applications, tools and resources such as annotated corpora; and (NLU)
 evaluation of systems.

21/7/10 09:26 57
Overview (43)
CL Research Topics (3)
Call for Papers from ACL- 2010 (1)
 Discourse, dialogue, and pragmatics
 Grammar engineering
 Information extraction
 Information retrieval
 Knowledge acquisition
 Large scale language processing
 Language generation
 Language processing in domains such as bioinformatics, legal, medical,
etc.
 Language resources, evaluation methods and metrics, science of annotation
 Lexical/ontological/formal semantics
 Machine translation
 Mathematical linguistics, grammatical formalisms
 Mining from textual and spoken language data

21/7/10 09:26 58
Overview (44)
CL Research Topics (4)
Call for Papers from ACL-2010 (2)
 Multilingual language processing
 Multimodal language processing (including speech, gestures, and other
communication media)
 NLP applications and systems
 NLP on noisy unstructured text, such as emails, blogs, sms
 Phonology/morphology, tagging and chunking, word segmentation
 Psycholinguistics
 Question answering
 Semantic role labeling
 Sentiment analysis and opinion mining
 Spoken language processing
 Statistical and machine learning methods
 Summarization
 Syntax, parsing, grammar induction
 Text mining
 Textual entailment and paraphrasing
 Topic and text classification
 Word sense disambiguation

21/7/10 09:26 59
Overview (45)
CL Research Topics (5)
Accepted Regular Paper Statistics for JSCL-2005

 lexical, syntactical, semantic and discourse analysis, 24


papers, 29.3%;
 resource building and related techniques, 12 papers, 14.6%;
 machine translation techniques, system and evaluation, 8
papers, 9.7%;
 intelligent retrieval, 30 papers, 36.6%;
 others, 8 papers, 9.7%.

21/7/10 09:26 60
Overview (46)
Assignments (1)

1. Read the roadmap of NLP on the following website:


http://elsnet.dfki.de/roadmap.php?version=flash

21/7/10 09:26 61

You might also like