You are on page 1of 58

Unit-1: Introduction

IS 7118: Natural Language Processing (NLP)


1st Year, 2nd Semester, M.Sc(IS)
(Slides are adapted from Text Book by Jurafsky & Martin )
Instructor: Prof. Rama Krishna Rao Bandaru
Syllabus
1) Introduction
2) Regular Expressions
3) Finite-State Methods
4) Morphology
5) N-gram Language models
6) Smoothing and Evaluation
7) Part of Speech Tagging
8) Context Free Grammar
9) Parsing
10) Semantics
( Brief introduction to Python and NLTK toolkit)
IS 7118 NLP Unit-1: Introduction, 2
Prof. R.K.Rao Bandaru
Text Books
1. Speech And Language Processing: An
Introduction to Natural Language Processing,
Computational Linguistics, and Speech
Recognition, By Daniel Jurafsky and James
H. Martin, Prentice-Hall, 2000.

2. Natural Language Programming with Python,


By Edward Loper, Ewan Klein, and Steven
Bird, Stanford, July 2007
IS 7118 NLP Unit-1: Introduction, 3
Prof. R.K.Rao Bandaru
References
3) Foundation of Statistical Natural Language Processing,
By Christopher D. Manning and Hinric Schutze
4) Natural Language Understanding, By Allen James.
5) “A Beginner’s Guide to Python” available at http://
wiki. python.org/moin/ BeginnersGuide
6) Python 3 For Absolute Beginners by Tim Hall and J-P
Stacey, Apress publishing
7) Python Text Processing with NLTK 2.0 Cookbook,
By Jacob Perkins

IS 7118 NLP Unit-1: Introduction, 4


Prof. R.K.Rao Bandaru
Linguistic Committees & Conferences

• Proceedings of major conferences (related to Natural


Language Processing):
– ACL (Association of Computational Linguistics)
– European Chapter of the ACL
– COLING (International Committee of Computational
Linguistics)
– ANLP (Applied Natural Language Processing, by ACL)
– ACL SIGDAT, SIGNLL other SIG (Special Interest Groups)
Workhops, such as WVLC (Workshop on Very Large Corpora)
– EMNLP (Empirical Methods in Natural Language Processing
– DARPA HLT (Defense Advanced Research Project Agency
Human Language Technology Workshops)
IS 7118 NLP Unit-1: Introduction, 5
Prof. R.K.Rao Bandaru
Grading Policy
• Quiz 10%
• Assignment1 10%
• Assignment2 10%
• Paper Review 20%
• Final Exam 50%

✓ ATTENDANCE MANDATOY

IS 7118 NLP Unit-1: Introduction, 6


Prof. R.K.Rao Bandaru
Course Objective
• Objective : The objective of this Natural Language
Processing (aka Computational Linguistics or Human
Language Technology) course is to introduce computer
systems that can interpret, learn and generate natural
languages.
• Goal : The goal of this new field is to get computers to
perform useful tasks involving human language, i.e., ‘to
acquire deep understanding of broad language (-not just
string processing or keyword matching )’
– Note : This course deals only with Text-based language
processing (-not Speech)

IS 7118 NLP Unit-1: Introduction, 7


Prof. R.K.Rao Bandaru
What is Natural Language?
• A language is a set of sentences that may be used as signals to
convey semantic information
• Natural language is a language that is spoken or written by
humans for general purpose communication.
• Natural language processing (NLP) is a field of computer
science and linguistics concerned with the interactions
between computers and human (natural) languages--Wikipedia
• Natural Language Processing is anything that a computer
needs to understand natural language and also generate the
natural language.
• NLP encompasses a broad set of techniques for automated
generation, manipulation and analysis of natural or human
8
languages.. IS 7118 NLP Unit-1: Introduction,
Prof. R.K.Rao Bandaru
Why Should You Care?

Three trends
1. An enormous amount of information is now available
in machine readable form as natural language text
(newspapers, web pages, medical records, financial
filings, etc.)
2. Conversational agents are becoming an important
form of human-computer communication
3. Much of human-human interaction is now mediated
by computers via social media

IS 7118 NLP Unit-1: Introduction, 9


Prof. R.K.Rao Bandaru
Applications
• Let’s take a quick look at few important
application areas
– Text analytics
– Question answering
– Information Extraction
– Machine translation
– Summarization,
– Language Comprehension
– coreference, etc.

IS 7118 NLP Unit-1: Introduction, 10


Prof. R.K.Rao Bandaru
Text Analytics
• Data-mining of weblogs, microblogs, discussion forums,
message boards, user groups, and other forms of user generated
media
– Product marketing information
– Political opinion tracking
– Social network analysis
– Buzz analysis (what’s hot, what topics are people talking
about right now)
• Google processes 20 PB a day (2008)
• WaybackMachine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 80% data is unstructured (IBM, 2010)
• More importantly, heterogeneous
11
IS 7118 NLP Unit-1: Introduction,
Prof. R.K.Rao Bandaru
Information Extraction & Sentiment
Analysis
Attributes:
zoom
affordability
size and weight
flash
ease of use
Size and weight
✓ • nice and compact to carry!
✓ • since the camera is small and light, I won't need to carry
around those heavy, bulky professional cameras either!
✗ • the camera feels flimsy, is plastic and very light in weight you
have to be very delicate in the handling of this camera
IS 7118 NLP Unit-1: Introduction, 12
Prof. R.K.Rao Bandaru
Information Extraction
Subject: curriculum meeting Event: Curriculum mtg
Date: Jan-16-2012
Date: January 15, 2012 Start: 10:00am
End: 11:30am
To: Dan Jurafsky
Where: Gates 159

Hi Dan, we’ve now scheduled the curriculum


meeting.
It will be in Gates 159 tomorrow from 10:00-
11:30. Create new Calendar entry
-Chris
IS 7118 NLP Unit-1: Introduction, 13
Prof. R.K.Rao Bandaru
Information Extraction

IS 7118 NLP Unit-1: Introduction, 14


Prof. R.K.Rao Bandaru
Question Answering
• Traditional information retrieval provides
documents/resources that provide users with
what they need to satisfy their information
needs.
• Question answering on the other hand directly
provides an answer to information needs posed
as questions.

IS 7118 NLP Unit-1: Introduction, 15


Prof. R.K.Rao Bandaru
Watson Jeopardy

IS 7118 NLP Unit-1: Introduction, 16


Prof. R.K.Rao Bandaru
QA/NL Interaction

IS 7118 NLP Unit-1: Introduction, 17


Prof. R.K.Rao Bandaru
Machine Translation

The automatic translation of texts between languages is one of the


oldest non-numerical applications in Computer Science.
In the past 10 years or so, MT has gone from a niche academic
curiosity to a robust commercial industry.

IS 7118 NLP Unit-1: Introduction, 18


Prof. R.K.Rao Bandaru
Machine Translation

• Helping human translators


• Fully automatic

Enter Source Text:

这 不过 是 一 个 时间 的 问题 .

Translation from Stanford’s Phrasal:

This is only a matter of time.

IS 7118 NLP Unit-1: Introduction, 19


Prof. R.K.Rao Bandaru
Machine Translation

IS 7118 NLP Unit-1: Introduction, 20


Prof. R.K.Rao Bandaru
Google Translate: Russian

IS 7118 NLP Unit-1: Introduction, 21


Prof. R.K.Rao Bandaru
English – Russian

IS 7118 NLP Unit-1: Introduction, 22


Prof. R.K.Rao Bandaru
IS 7118 NLP Unit-1: Introduction, 23
Prof. R.K.Rao Bandaru
Summarization

IS 7118 NLP Unit-1: Introduction, 24


Prof. R.K.Rao Bandaru
2013-Summly-> Yahoo!!

IS 7118 NLP Unit-1: Introduction, 25


Prof. R.K.Rao Bandaru
Language Comprehension

IS 7118 NLP Unit-1: Introduction, 26


Prof. R.K.Rao Bandaru
IS 7118 NLP Unit-1: Introduction, 27
Prof. R.K.Rao Bandaru
Language Technology
making good progress

Sentiment analysis still really hard


mostly solved Best roast chicken in San Francisco!
Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?

Let’s go to Agra!

Carter told Mubarak he shouldn’t run again. Paraphrase
Buy V1AGRA … ✗
Word sense disambiguation
XYZ acquired ABC yesterday
(WSD)
ABC has been taken over by XYZ
Part-of-speech (POS) tagging I need new batteries for my mouse.

ADJ ADJ NOUN VERB ADV Summarization


Colorless green ideas sleep furiously. Parsing
The Dow Jones is up
Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER)
Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog
Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) you want a ticket?
Party
You’re invited to our dinner May 27 28
IS 7118 NLP Unit-1: Introduction, party, Friday May 27 at 8:30 add
Prof. R.K.Rao Bandaru
Example:A Spoken Dialogue System

IS 7118 NLP Unit-1: Introduction, 29


Prof. R.K.Rao Bandaru
Linguistic Knowledge
• Phonetics and Phonology
• Morphology
• Syntax Discourse

• Semantics Pragmatics

• Pragmatics
• Discourse Semantics

Syntax

IS 7118 NLP Unit-1: Introduction, 30


Prof. R.K.Rao Bandaru
Phonetics and Phonology
• The study of linguistic sounds and their
relations to words

IS 7118 NLP Unit-1: Introduction, 31


Prof. R.K.Rao Bandaru
Morphology
• The study of internal structures of words and
how they can be modified.
– It Concerns the way words are built up from
smaller meaning bearing units
• Parsing complex words into their components

IS 7118 NLP Unit-1: Introduction, 32


Prof. R.K.Rao Bandaru
Syntax
• The study of structural relationship of words in
a sentence .
– It concerns how words are put together to form
correct sentences and what structural role each
word has

IS 7118 NLP Unit-1: Introduction, 33


Prof. R.K.Rao Bandaru
Semantics
• The study of the meaning of words, and how
these form the meaning of sentences
– It concerns what words mean and how these
meanings combine in sentences to form sentence
meanings
• Realizing lexical relations among words

IS 7118 NLP Unit-1: Introduction, 34


Prof. R.K.Rao Bandaru
Pragmatics
• The study of how language is used to
accomplish goals and the influence of context
on meaning
– It concerns how sentences are used in different
situations and how use affects the interpretation of
the sentence
• Understanding the aspects of a language which
depends on situation and world knowledge

IS 7118 NLP Unit-1: Introduction, 35


Prof. R.K.Rao Bandaru
Discourse
• The study of linguistic units larger than a single
utterance
– It concerns how the immediately preceding
sentences affect the interpretation of the next
sentence

IS 7118 NLP Unit-1: Introduction, 36


Prof. R.K.Rao Bandaru
Why NLP is hard?
• Different ways of Parsing a sentence
• Word category ambiguity
• Word sense ambiguity
• Words can mean more than their sum of parts
• Imparting world knowledge is difficult ("the blue pen ate the
ice-cream")
• Fictitious worlds ("people on mars can fly")
• Defining scope ("people like ice-cream," does this mean all
people like ice cream?)
• Language is changing and evolving
• Complex ways of interaction between the kinds of
knowledge
• Exponential complexity at each point in using the knowledge
IS 7118 NLP Unit-1: Introduction, 37
Prof. R.K.Rao Bandaru
Why NLP is hard?
• Different words/sentences express the same meaning
– The third season of the year
• Fall
• Autumn
– Book delivery time
• When will my book arrive?
• When will I receive my book?
• One word/sentence can have different meanings
– Fall
• The thrid season of the year
• Moving down towards the ground or towards a lower position
– The door is open
• Expressing a fact IS 7118 NLP Unit-1: Introduction, 38
Prof. R.K.Rao Bandaru
• A request to close the door
Ambiguity is pervasive
• Natural language is highly
ambiguous and must be
disambiguated.
– I saw the man on the hill with a
telescope.
– I saw the Grand Canyon flying to LA.
– Time flies like an arrow.
– Horse flies like a sugar cube.
– Time runners like a coach.
– Time cars like a Porsche.

IS 7118 NLP Unit-1:


39 Introduction,
Prof. R.K.Rao Bandaru
Ambiguity is Ubiquitous
• Speech Recognition
– “recognize speech” vs. “wreck a nice beach”
– “youth in Asia” vs. “euthanasia”
• Syntactic Analysis
– “I ate spaghetti with chopsticks” vs. “I ate spaghetti with meatballs.”
• Semantic Analysis
– “The dog is in the pen.” vs. “The ink is in the pen.”
– “I put the plant in the window” vs. “Ford put the plant in Mexico”
• Pragmatic Analysis
– From “The Pink Panther Strikes Again”:
– Clouseau : Does your dog bite?
Hotel Clerk: No.
Clouseau : [bowing down to pet the dog] Nice doggie.
[Dog barks and bites Clouseau in the hand]
Clouseau : I thought you said your dog did not bite!
Hotel Clerk: That is not my dog.
IS 7118 NLP Unit-1:
40 Introduction,
Prof. R.K.Rao Bandaru
Ambiguity is Explosive
• Ambiguities compound to generate enormous
numbers of possible interpretations.
• In English, a sentence ending in n prepositional
phrases has over 2n syntactic interpretations (cf.
Catalan numbers).
– “I saw the man with the telescope”: 2 parses
– “I saw the man on the hill with the telescope.”: 5 parses
– “I saw the man on the hill in Texas with the telescope”:
14 parses
– “I saw the man on the hill in Texas with the telescope at
noon.”: 42 parses
– “I saw the man on the hill in Texas with the telescope at
noon on Monday” 132 parses
IS 7118 NLP Unit-1: Introduction, 41
41
Prof. R.K.Rao Bandaru
Ambiguity –Another Example
• Find at least 5 meanings of this sentence:
– I made her duck
• I cooked waterfowl for her benefit (to eat)
• I cooked waterfowl belonging to her
• I created the (ceramic?) duck she owns
• I caused her to quickly lower her upper body
• I waved my magic wand and turned her into
undifferentiated waterfowl

IS 7118 NLP Unit-1: Introduction, 42


Prof. R.K.Rao Bandaru
Ambiguity is Pervasive
• I caused her to quickly lower her head or body
– Lexical category: “duck” can be a noun or verb
• I cooked waterfowl belonging to her.
– Lexical category: “her” can be a possessive (“of her”)
or dative (“for her”) pronoun
• I made the (ceramic) duck statue she owns
– Lexical Semantics: “make” can mean “create” or
“cook”, and about 100 other things as well

IS 7118 NLP Unit-1: Introduction, 43


Prof. R.K.Rao Bandaru
Ambiguity is Pervasive

• Phonetics!
– I mate or duck
– I’m eight or duck
– Eye maid; her duck
– Aye mate, her duck
– I maid her duck
– I’m aid her duck
– I mate her duck
– I’m ate her duck
– I’m ate or duck
– I mate or duck
IS 7118 NLP Unit-1: Introduction, 44
Prof. R.K.Rao Bandaru
Problem
• Remember our pipeline...

Morphological Syntactic Semantic


Context
Processing Analysis Interpretation

IS 7118 NLP Unit-1: Introduction, 45


Prof. R.K.Rao Bandaru
Really it’s this

Semantic
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Syntactic Interpretation
Semantic
Syntactic Interpretation
Semantic
Analysis
Syntactic Interpretation
Semantic
Analysis
Syntactic Interpretation
Semantic
Morphological Analysis
Syntactic Interpretation
Semantic
Analysis
Syntactic Interpretation
Semantic
Processing Analysis
Syntactic Interpretation
Semantic
Analysis Interpretation
Semantic
Analysis Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Semantic
Interpretation
Interpretation

IS 7118 NLP Unit-1: Introduction, 46


Prof. R.K.Rao Bandaru
Dealing with Ambiguity

• Three possible approaches:


1. Tightly coupled interaction among processing
levels; knowledge from other levels can help
decide among choices at ambiguous levels.
2. Pipeline processing that ignores ambiguity as it
occurs and hopes that other levels can eliminate
incorrect structures.
3. Probabilistic approaches based on making the
most likely choices
1. Or passing along n-best choices

IS 7118 NLP Unit-1: Introduction, 47


Prof. R.K.Rao Bandaru
Models and Algorithms
• By models we mean the formalisms that are
used to capture the various kinds of linguistic
knowledge we need.
• State machines
• Rule-based approaches
• Logical formalisms
• Probabilistic models

IS 7118 NLP Unit-1: Introduction, 48


Prof. R.K.Rao Bandaru
Algorithms
• Algorithms are then used to manipulate the
knowledge representations needed to tackle the
task at hand.
• Many of the algorithms that we’ll study will
turn out to be transducers; algorithms that take
one kind of structure as input and output another.
• Unfortunately, ambiguity makes this process
difficult. This leads us to employ algorithms that
are designed to handle ambiguity of various
kinds
IS 7118 NLP Unit-1: Introduction, 49
Prof. R.K.Rao Bandaru
Paradigms

• In particular..
– State-space search
• To manage the problem of making choices during processing when we
lack the information needed to make the right choice
– Dynamic programming
• To avoid having to redo work during the course of a state-space search
– CKY, Earley, Minimum Edit Distance, Viterbi, Baum-Welch
– Classifiers
• Machine learning based classifiers that are trained to make decisions
based on features extracted from the local context

IS 7118 NLP Unit-1: Introduction, 50


Prof. R.K.Rao Bandaru
Even more uncertainty: ‘Morphing’
in vision
• Morphin

IS 7118 NLP Unit-1: Introduction, 51


Prof. R.K.Rao Bandaru
Why else is natural language
understanding difficult?
non-standard English segmentation issues idioms
Great job @justinbieber! Were
SOO PROUD of what youve dark horse
the New York-New Haven Railroad get cold feet
accomplished! U taught us 2
#neversaynever & you yourself the New York-New Haven Railroad lose face
should never give up either♥ throw in the towel

neologisms world knowledge tricky entity names


unfriend Mary and Sue are sisters. Where is A Bug’s Life playing …
Retweet Let It Be was recorded …
Mary and Sue are mothers.
bromance
… a mutation on the for gene …

But that’s what makesISit7118


fun!
NLP Unit-1: Introduction,
Prof. R.K.Rao Bandaru
52
Making progress on this problem…
• The task is difficult! What tools do we need?
– Knowledge about language
– Knowledge about the world
– A way to combine knowledge sources
• How we generally do this:
– probabilistic models built from language data
• P(“maison” → “house”) high
• P(“L’avocat général” → “the general avocado”) low
– Luckily, rough text features can often do half the
job.
IS 7118 NLP Unit-1: Introduction, 53
Prof. R.K.Rao Bandaru
Bigger Applications
• Intelligent computer systems
• NLU interfaces to databases
• Computer aided instruction
• Information retrieval
• Intelligent Web searching
• Data mining
• Machine translation
• Speech recognition
• Natural language generation
• Question answering
IS 7118 NLP Unit-1: Introduction, 54
Prof. R.K.Rao Bandaru
NLP History – Pre statistics
• (1) Colorless green ideas sleep furiously.
• (2) Furiously sleep ideas green colorless
– It is fair to assume that neither sentence (1) nor (2) (nor indeed any part
of these sentences) had ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these sentences will be ruled
out on identical grounds as equally "remote" from English. Yet (1),
though nonsensical, is grammatical, while (2) is not.”(Chomsky 1957)
• 70s and 80s: more linguistic focus
– Emphasis on deeper models, syntax and semantics
– Toy domains / manually engineered systems
– Weak empirical evaluation

IS 7118 NLP Unit-1: Introduction, 55


Prof. R.K.Rao Bandaru
History: Two Generations of NLP
• Hand-crafted Systems –Knowledge Engineering [1950s–]
– Rules written by hand; adjusted by error analysis
– Require experts who understand both the systems and
domain
– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable (Machine Learning) System


[1985s–]
– The tasks are modeled in a statistical way
– More robust techniques based on rich annotations
– Perform better than rules (Parsing 90% vs. 75% accuracy)

IS 7118 NLP Unit-1: Introduction, 56


Prof. R.K.Rao Bandaru
NLP –Machine learning and
empiricism
• “Whenever I fire a linguist our system performance
improves.”–Jelinek, 1988
• 1990s: Empirical Revolution
– Corpus-based methods produce the first widely used tools
– Deep linguistic analysis often traded for robust
approximations
– Empirical evaluation is essential
• 2000s: Richer linguistic representations used in
statistical approaches, scale to more data!
• 2010s: you decide!
IS 7118 NLP Unit-1: Introduction, 57
Prof. R.K.Rao Bandaru
End of the Unit-1

???

You might also like