You are on page 1of 209

Natural Language Processing

Chapter 2

Regular Expressions, Text Normalization,


Edit Distance

Department of Computer Science and Engineering,


Shahjalal University of Science and Technology, Sylhet
Regular Expression (RE)

A language for specifying text search strings.

Regular Expressions Introduction 2 / 65


Regular Expression (RE)

A language for specifying text search strings.

Formally,
- an algebraic notation for characterizing a set of strings.

Regular Expressions Introduction 2 / 65


Basic RE Patterns: Simplest Kind

The simplest kind of RE is


- a sequence of simple characters.

Regular Expressions Basic Patterns 3 / 65


Basic RE Patterns: Simplest Kind

The simplest kind of RE is


- a sequence of simple characters.

Regular Expressions Basic Patterns 3 / 65


Basic RE Patterns: Simplest Kind

The simplest kind of RE is


- a sequence of simple characters.

We’ll show REs delimited by slashes (/).

Regular Expressions Basic Patterns 3 / 65


Basic RE Patterns: Simplest Kind

The simplest kind of RE is


- a sequence of simple characters.

We’ll show REs delimited by slashes (/).


But slashes (/) are not part of the REs.

Regular Expressions Basic Patterns 3 / 65


Basic RE Patterns: Case Sensitivity

Regular expressions are case sensitive.

Regular Expressions Basic Patterns 4 / 65


Basic RE Patterns: Case Sensitivity

Regular expressions are case sensitive.

Lower case /s/ is distinct from uppercase /S/.

Regular Expressions Basic Patterns 4 / 65


Basic RE Patterns: Case Sensitivity

Regular expressions are case sensitive.

Lower case /s/ is distinct from uppercase /S/.

The pattern /woodchucks/ will not match the string Woodchucks.

Regular Expressions Basic Patterns 4 / 65


Basic RE Patterns: Disjunction

The string of characters inside the square braces [ and ] specifies


- a disjunction of characters to match.

Regular Expressions Basic Patterns 5 / 65


Basic RE Patterns: Disjunction

The string of characters inside the square braces [ and ] specifies


- a disjunction of characters to match.

Regular Expressions Basic Patterns 5 / 65


Basic RE Patterns: Range

Where there is a well-defined sequence associated with a set of


characters,
- the brackets can be used with the dash (-) to specify
any one character in a range.

Regular Expressions Basic Patterns 6 / 65


Basic RE Patterns: Range

Where there is a well-defined sequence associated with a set of


characters,
- the brackets can be used with the dash (-) to specify
any one character in a range.

Regular Expressions Basic Patterns 6 / 65


Basic RE Patterns: Negation

If the caret ^ is the first symbol after the open square brace [,
- the resulting pattern is negated.

Regular Expressions Basic Patterns 7 / 65


Basic RE Patterns: Negation

If the caret ^ is the first symbol after the open square brace [,
- the resulting pattern is negated.
If the caret ^ occurs anywhere else,
- it usually stands for a caret ^.

Regular Expressions Basic Patterns 7 / 65


Basic RE Patterns: Negation

If the caret ^ is the first symbol after the open square brace [,
- the resulting pattern is negated.
If the caret ^ occurs anywhere else,
- it usually stands for a caret ^.

Regular Expressions Basic Patterns 7 / 65


Basic RE Patterns: Optionality

How can we talk about optional elements,


- like an optional s in woodchuck and woodchucks?

Regular Expressions Basic Patterns 8 / 65


Basic RE Patterns: Optionality

Use the question mark /?/,


- which means “zero or one instances of the previous character”.

Regular Expressions Basic Patterns 9 / 65


Basic RE Patterns: Optionality

Use the question mark /?/,


- which means “zero or one instances of the previous character”.

Regular Expressions Basic Patterns 9 / 65


Basic RE Patterns: Problem

Consider the language of certain sheep, which consists of strings that


look like the following:
baa!
baaa!
baaaa!
baaaaa!
...

Regular Expressions Basic Patterns 10 / 65


Basic RE Patterns: Problem

Consider the language of certain sheep, which consists of strings that


look like the following:
baa!
baaa!
baaaa!
baaaaa!
...
How can we represent the language using RE?

Regular Expressions Basic Patterns 10 / 65


Basic RE Patterns: Kleene *

Use the *

Regular Expressions Basic Patterns 11 / 65


Basic RE Patterns: Kleene *

Use the *
Pronounced “cleany star”.

Regular Expressions Basic Patterns 11 / 65


Basic RE Patterns: Kleene *

Use the *
Pronounced “cleany star”.
Means “zero or more occurrences of the immediately previous
character or regular expression”.

Regular Expressions Basic Patterns 11 / 65


Basic RE Patterns: Kleene *

An integer (a string of digits) is thus /[0-9][0-9]*/.

Regular Expressions Basic Patterns 12 / 65


Basic RE Patterns: Kleene *

An integer (a string of digits) is thus /[0-9][0-9]*/.


Why isn’t it just /[0-9]*/?

Regular Expressions Basic Patterns 12 / 65


Basic RE Patterns: Kleene +

Have to write the regular expression for digits twice,


- so there is a shorter way to specify “at least one” of some
character.

Regular Expressions Basic Patterns 13 / 65


Basic RE Patterns: Kleene +

Have to write the regular expression for digits twice,


- so there is a shorter way to specify “at least one” of some
character.
Kleene + means “one or more occurrences of the immediately
preceding character or regular expression”.

Regular Expressions Basic Patterns 13 / 65


Basic RE Patterns: Kleene +

Have to write the regular expression for digits twice,


- so there is a shorter way to specify “at least one” of some
character.
Kleene + means “one or more occurrences of the immediately
preceding character or regular expression”.
An integer (a string of digits) is thus /[0-9]+/.

Regular Expressions Basic Patterns 13 / 65


Basic RE Patterns: Kleene +

Have to write the regular expression for digits twice,


- so there is a shorter way to specify “at least one” of some
character.
Kleene + means “one or more occurrences of the immediately
preceding character or regular expression”.
An integer (a string of digits) is thus /[0-9]+/.
Two ways to specify the sheep language: /baaa*!/ or /baa+!/.

Regular Expressions Basic Patterns 13 / 65


Basic RE Patterns: Wildcard

The period (/./),


- a wildcard expression that matches any single character
(except a carriage return).

Regular Expressions Basic Patterns 14 / 65


Basic RE Patterns: Wildcard

The period (/./),


- a wildcard expression that matches any single character
(except a carriage return).

Regular Expressions Basic Patterns 14 / 65


Basic RE Patterns: Wildcard

The wildcard (.) is often used together with the Kleene *


- to mean “any string of characters”.

Regular Expressions Basic Patterns 15 / 65


Basic RE Patterns: Wildcard

The wildcard (.) is often used together with the Kleene *


- to mean “any string of characters”.
Suppose we want to find any line in which a particular word, for
example, aardvark, appears twice.

Regular Expressions Basic Patterns 15 / 65


Basic RE Patterns: Wildcard

The wildcard (.) is often used together with the Kleene *


- to mean “any string of characters”.
Suppose we want to find any line in which a particular word, for
example, aardvark, appears twice.
We can specify this with the regular expression /aardvark.*aardvark/.

Regular Expressions Basic Patterns 15 / 65


Basic RE Patterns: Anchors

Special characters that anchor regular expressions to particular places


in a string.

Regular Expressions Basic Patterns 16 / 65


Basic RE Patterns: Anchors

Special characters that anchor regular expressions to particular places


in a string.
The most common anchors are
- the caret ^ and the dollar sign $.

Regular Expressions Basic Patterns 16 / 65


Basic RE Patterns: Anchor caret ^

The caret ^
- matches the start of a line.

Regular Expressions Basic Patterns 17 / 65


Basic RE Patterns: Anchor caret ^

The caret ^
- matches the start of a line.
The pattern /^The/ matches
- the word The only at the start of a line.

Regular Expressions Basic Patterns 17 / 65


Basic RE Patterns: The caret ^

The caret ^ has three uses:

Regular Expressions Basic Patterns 18 / 65


Basic RE Patterns: The caret ^

The caret ^ has three uses:


To match the start of a line,

Regular Expressions Basic Patterns 18 / 65


Basic RE Patterns: The caret ^

The caret ^ has three uses:


To match the start of a line,
To indicate a negation inside of square brackets,

Regular Expressions Basic Patterns 18 / 65


Basic RE Patterns: The caret ^

The caret ^ has three uses:


To match the start of a line,
To indicate a negation inside of square brackets,
Just to mean a caret ^.

Regular Expressions Basic Patterns 18 / 65


Basic RE Patterns: Anchor dollar sign $

The dollar sign $ matches


- the end of a line.

Regular Expressions Basic Patterns 19 / 65


Basic RE Patterns: Anchor dollar sign $

The dollar sign $ matches


- the end of a line.
⊔$
- a useful pattern for matching a space at the end of a line.

Regular Expressions Basic Patterns 19 / 65


Basic RE Patterns: Anchor dollar sign $

The dollar sign $ matches


- the end of a line.
⊔$
- a useful pattern for matching a space at the end of a line.
/^The dog\.$/
- matches a line that contains only the phrase The dog.

Regular Expressions Basic Patterns 19 / 65


Basic RE Patterns: Anchor dollar sign $

The dollar sign $ matches


- the end of a line.
⊔$
- a useful pattern for matching a space at the end of a line.
/^The dog\.$/
- matches a line that contains only the phrase The dog.
We have to use the backslash (\) here
- since we want the . to mean “period” and not the wildcard.

Regular Expressions Basic Patterns 19 / 65


Basic RE Patterns: Anchors more

Two other anchors:


- \b matches a word boundary
- \B matches a non-boundary.

Regular Expressions Basic Patterns 20 / 65


Basic RE Patterns: Anchors more

Two other anchors:


- \b matches a word boundary
- \B matches a non-boundary.
/\bthe\b/
- matches the word the but not the word other.

Regular Expressions Basic Patterns 20 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.

Regular Expressions Basic Patterns 21 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.

Regular Expressions Basic Patterns 21 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.
/\b99\b/

Regular Expressions Basic Patterns 21 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.
/\b99\b/
will match the string 99 in There are 99 bottles of beer on the wall
(because 99 follows a space)

Regular Expressions Basic Patterns 21 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.
/\b99\b/
will match the string 99 in There are 99 bottles of beer on the wall
(because 99 follows a space)
but not 99 in There are 299 bottles of beer on the wall (since 99
follows a number).

Regular Expressions Basic Patterns 21 / 65


Basic RE Patterns: “word” definition

Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.
/\b99\b/
will match the string 99 in There are 99 bottles of beer on the wall
(because 99 follows a space)
but not 99 in There are 299 bottles of beer on the wall (since 99
follows a number).
But it will match 99 in $99 (since 99 follows a dollar sign ($), which is
not a digit, underscore, or letter).

Regular Expressions Basic Patterns 21 / 65


Break

Regular Expressions Break 22 / 65


Basic RE Patterns: Disjunction operator

Search for either the string cat or the string dog.

Regular Expressions Disjunction, Grouping, and Precedence 23 / 65


Basic RE Patterns: Disjunction operator

Search for either the string cat or the string dog.


Why can’t we say /[catdog]/?

Regular Expressions Disjunction, Grouping, and Precedence 23 / 65


Basic RE Patterns: Disjunction operator

Search for either the string cat or the string dog.


Why can’t we say /[catdog]/?
Use the disjunction operator, also called the pipe symbol |.

Regular Expressions Disjunction, Grouping, and Precedence 23 / 65


Basic RE Patterns: Disjunction operator

Search for either the string cat or the string dog.


Why can’t we say /[catdog]/?
Use the disjunction operator, also called the pipe symbol |.
The pattern /cat|dog/ matches
- either the string cat or the string dog.

Regular Expressions Disjunction, Grouping, and Precedence 23 / 65


Basic RE Patterns: Precedence

How can I specify both guppy and guppies?

Regular Expressions Disjunction, Grouping, and Precedence 24 / 65


Basic RE Patterns: Precedence

How can I specify both guppy and guppies?


We cannot simply say /guppy|ies/,
- because that would match only the strings guppy and ies.

Regular Expressions Disjunction, Grouping, and Precedence 24 / 65


Basic RE Patterns: Precedence

How can I specify both guppy and guppies?


We cannot simply say /guppy|ies/,
- because that would match only the strings guppy and ies.
Use the parenthesis operators ( and ).

Regular Expressions Disjunction, Grouping, and Precedence 24 / 65


Basic RE Patterns: Precedence

How can I specify both guppy and guppies?


We cannot simply say /guppy|ies/,
- because that would match only the strings guppy and ies.
Use the parenthesis operators ( and ).
The pattern /gupp(y|ies)/ would specify that
- we meant the disjunction only to apply to the suffixes y and ies.

Regular Expressions Disjunction, Grouping, and Precedence 24 / 65


Basic RE Patterns: Operator precedence
hierarchy

Counters have a higher precedence than sequences,


- /the*/ matches theeeee but not thethe.

Regular Expressions Disjunction, Grouping, and Precedence 25 / 65


Basic RE Patterns: Operator precedence
hierarchy

Counters have a higher precedence than sequences,


- /the*/ matches theeeee but not thethe.

Sequences have a higher precedence than disjunction,


- /the|any/ matches the or any but not thany or theny.

Regular Expressions Disjunction, Grouping, and Precedence 25 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)
Solution: /\b[tT]he\b/

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)
Solution: /\b[tT]he\b/
Might want to find the in some context where it might also have
underlines or numbers nearby (the_ or the25)

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)
Solution: /\b[tT]he\b/
Might want to find the in some context where it might also have
underlines or numbers nearby (the_ or the25)
Solution: /[^a-zA-Z][tT]he[^a-zA-Z]/

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)
Solution: /\b[tT]he\b/
Might want to find the in some context where it might also have
underlines or numbers nearby (the_ or the25)
Solution: /[^a-zA-Z][tT]he[^a-zA-Z]/
Won’t find the word the when it begins a line.

Regular Expressions NLP Errors 26 / 65


Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)
Solution: /\b[tT]he\b/
Might want to find the in some context where it might also have
underlines or numbers nearby (the_ or the25)
Solution: /[^a-zA-Z][tT]he[^a-zA-Z]/
Won’t find the word the when it begins a line.
Solution: /(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/

Regular Expressions NLP Errors 26 / 65


NLP Errors

Two kind of Errors:


false positives:
- strings that we incorrectly matched like other or there

Regular Expressions NLP Errors 27 / 65


NLP Errors

Two kind of Errors:


false positives:
- strings that we incorrectly matched like other or there
false negatives:
- strings that we incorrectly missed, like The.

Regular Expressions NLP Errors 27 / 65


NLP Errors: Two efforts

Two kind of Efforts:


Increasing precision (minimizing false positives)

Regular Expressions NLP Errors 28 / 65


NLP Errors: Two efforts

Two kind of Efforts:


Increasing precision (minimizing false positives)
Increasing recall (minimizing false negatives)

Regular Expressions NLP Errors 28 / 65


A More Complex Example

Write a RE for
- any machine with at least 6 GHz and 500 GB of disk space
for less than $1000.

Regular Expressions A More Complex Example 29 / 65


A More Complex Example

Write a RE for
- any machine with at least 6 GHz and 500 GB of disk space
for less than $1000.

HOME WORK: Section 2.1.4

Regular Expressions A More Complex Example 29 / 65


More RE Operators: Aliases

Aliases for common sets of characters.

Regular Expressions More Operators 30 / 65


More RE Operators: Aliases

Aliases for common sets of characters.

Regular Expressions More Operators 30 / 65


More RE Operators: Counters

RE operators for counting.

Regular Expressions More Operators 31 / 65


More RE Operators: Counters

RE operators for counting.

Regular Expressions More Operators 31 / 65


More RE Operators: Backslashed

Some characters that need to be backslashed.

Regular Expressions More Operators 32 / 65


More RE Operators: Backslashed

Some characters that need to be backslashed.

Regular Expressions More Operators 32 / 65


Text Normalization

Text Normalization 33 / 65
Corpora

Corpus: a computer-readable collection of text or speech.

Text Normalization Corpora 34 / 65


Corpora

Corpus: a computer-readable collection of text or speech.


plural: Corpora

Text Normalization Corpora 34 / 65


Corpora: Example

The Brown corpus

Text Normalization Corpora 35 / 65


Corpora: Example

The Brown corpus


A million word collection of English texts

Text Normalization Corpora 35 / 65


Corpora: Example

The Brown corpus


A million word collection of English texts
from different genres (newspaper, fiction, non-fiction, academic, etc.)

Text Normalization Corpora 35 / 65


Corpora: Example

The Brown corpus


A million word collection of English texts
from different genres (newspaper, fiction, non-fiction, academic, etc.)
assembled at Brown University in 1963-64 (Kucera et al., 1967)

Text Normalization Corpora 35 / 65


Corpora: Example

The SUMono corpus

Text Normalization Corpora 36 / 65


Corpora: Example

The SUMono corpus


A 27 million word collection of Bangla texts

Text Normalization Corpora 36 / 65


Corpora: Example

The SUMono corpus


A 27 million word collection of Bangla texts
from different genres (newspaper, fiction, non-fiction, academic, blogs,
etc.)

Text Normalization Corpora 36 / 65


Corpora: Example

The SUMono corpus


A 27 million word collection of Bangla texts
from different genres (newspaper, fiction, non-fiction, academic, blogs,
etc.)
assembled at Shahjalal University of Science and Technology in
2010-14 (Mumin et al., 2014)

Text Normalization Corpora 36 / 65


Words

How many words are in the following Brown sentence?


He stepped out into the hall, was delighted to encounter a water brother.

Text Normalization Words 37 / 65


Words

How many words are in the following Brown sentence?


He stepped out into the hall, was delighted to encounter a water brother.
13 words, if we don’t count punctuation marks as words

Text Normalization Words 37 / 65


Words

How many words are in the following Brown sentence?


He stepped out into the hall, was delighted to encounter a water brother.
13 words, if we don’t count punctuation marks as words
15 words, if we count punctuation

Text Normalization Words 37 / 65


Words

How many words are in the following SUMono sentence?


আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও কারও
মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।

Text Normalization Words 38 / 65


Words

How many words are in the following SUMono sentence?


আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও কারও
মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।
17 words, if we don’t count punctuation marks as words

Text Normalization Words 38 / 65


Words

How many words are in the following SUMono sentence?


আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও কারও
মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।
17 words, if we don’t count punctuation marks as words
20 words, if we count punctuation

Text Normalization Words 38 / 65


Words

Lemma: a set of lexical forms having


- the same stem
- the same major part-of-speech
- and the same word sense.
This stem is called Lemma of this set of lexical forms.

Text Normalization Words 39 / 65


Words

Lemma: a set of lexical forms having


- the same stem
- the same major part-of-speech
- and the same word sense.
This stem is called Lemma of this set of lexical forms.
wordform: the full inflected or derived form of the word.

Text Normalization Words 39 / 65


Words

Lemma: a set of lexical forms having


- the same stem
- the same major part-of-speech
- and the same word sense.
This stem is called Lemma of this set of lexical forms.
wordform: the full inflected or derived form of the word.
For example: cats versus cat?

Text Normalization Words 39 / 65


Words

Lemma: a set of lexical forms having


- the same stem
- the same major part-of-speech
- and the same word sense.
This stem is called Lemma of this set of lexical forms.
wordform: the full inflected or derived form of the word.
For example: cats versus cat?
These two words have the same lemma cat but are different
wordforms.

Text Normalization Words 39 / 65


Words

How many words are there in a corpus?

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens
Types: the number of distinct words in a corpus.

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens
Types: the number of distinct words in a corpus.
Tokens: the total number of running words.

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens
Types: the number of distinct words in a corpus.
Tokens: the total number of running words.
How many word Types and word Tokens are in the following SUMono
sentence?
আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও
কারও মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens
Types: the number of distinct words in a corpus.
Tokens: the total number of running words.
How many word Types and word Tokens are in the following SUMono
sentence?
আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও
কারও মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।
17 Tokens and 16 Types, if we ignore punctuation

Text Normalization Words 40 / 65


Words

How many words are there in a corpus?


Two ways of talking about words: Types and Tokens
Types: the number of distinct words in a corpus.
Tokens: the total number of running words.
How many word Types and word Tokens are in the following SUMono
sentence?
আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও
কারও মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।
17 Tokens and 16 Types, if we ignore punctuation
20 Tokens and 18 Types, if we count punctuation

Text Normalization Words 40 / 65


Text Normalization

Commonly applied text normalization process:


Tokenizing (segmenting) words

Text Normalization Text Normalization 41 / 65


Text Normalization

Commonly applied text normalization process:


Tokenizing (segmenting) words
Normalizing word formats

Text Normalization Text Normalization 41 / 65


Text Normalization

Commonly applied text normalization process:


Tokenizing (segmenting) words
Normalizing word formats
Segmenting sentences

Text Normalization Text Normalization 41 / 65


Tokenization

Tokenization: the task of segmenting running text into tokens.

Text Normalization Tokenization 42 / 65


Normalization

Normalization: the task of putting tokens in a standard format,


choosing a single normal form for tokens with multiple forms.

Text Normalization Tokenization 43 / 65


Normalization

Normalization: the task of putting tokens in a standard format,


choosing a single normal form for tokens with multiple forms.
For example: USA and US

Text Normalization Tokenization 43 / 65


Lemmatization

Lemmatization: the task of determining that several words have the


same root, despite their surface differences.

Text Normalization Lemmatization 44 / 65


Lemmatization

Lemmatization: the task of determining that several words have the


same root, despite their surface differences.
For example: am, are, and is have the shared lemma be

Text Normalization Lemmatization 44 / 65


Lemmatization

Lemmatization: the task of determining that several words have the


same root, despite their surface differences.
For example: am, are, and is have the shared lemma be
The lemmatized form of a sentence like
He is reading detective stories would thus be
He be read detective story.

Text Normalization Lemmatization 44 / 65


Morphological parsing

How is lemmatization done?

Text Normalization Lemmatization 45 / 65


Morphological parsing

How is lemmatization done?


The most sophisticated methods for lemmatization involve
- complete morphological parsing of the word.

Text Normalization Lemmatization 45 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.

Text Normalization Lemmatization 46 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.
Two broad classes of morphemes can be distinguished:

Text Normalization Lemmatization 46 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.
Two broad classes of morphemes can be distinguished:
stems: the central morpheme of the word, supplying the main meaning

Text Normalization Lemmatization 46 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.
Two broad classes of morphemes can be distinguished:
stems: the central morpheme of the word, supplying the main meaning
affixes: adding “additional” meanings of various kinds.

Text Normalization Lemmatization 46 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.
Two broad classes of morphemes can be distinguished:
stems: the central morpheme of the word, supplying the main meaning
affixes: adding “additional” meanings of various kinds.
fox consists of one morpheme: fox

Text Normalization Lemmatization 46 / 65


Morphological parsing

Morphology is the study of the way words are built up


- from smaller meaning-bearing units called morphemes.
Two broad classes of morphemes can be distinguished:
stems: the central morpheme of the word, supplying the main meaning
affixes: adding “additional” meanings of various kinds.
fox consists of one morpheme: fox
foxes consists of two morphemes: fox and -es

Text Normalization Lemmatization 46 / 65


Stemming Vs Lemmatization

Stemming algorithms work


- by cutting off the end or the beginning of the word,
- taking into account a list of common prefixes and suffixes
that can be found in an inflected word.

Text Normalization Lemmatization 47 / 65


Stemming Vs Lemmatization

Stemming algorithms work


- by cutting off the end or the beginning of the word,
- taking into account a list of common prefixes and suffixes
that can be found in an inflected word.
This indiscriminate cutting can be successful in some occasions, but
not always.

Text Normalization Lemmatization 47 / 65


Stemming Vs Lemmatization

Stemming algorithms work


- by cutting off the end or the beginning of the word,
- taking into account a list of common prefixes and suffixes
that can be found in an inflected word.
This indiscriminate cutting can be successful in some occasions, but
not always.

Text Normalization Lemmatization 47 / 65


Stemming Vs Lemmatization

Lemmatization takes into consideration the morphological analysis of


the words.

Text Normalization Lemmatization 48 / 65


Stemming Vs Lemmatization

Lemmatization takes into consideration the morphological analysis of


the words.
To do so, it is necessary to have detailed dictionaries
- which the algorithm can look through to link the form
back to its lemma.

Text Normalization Lemmatization 48 / 65


Stemming Vs Lemmatization

Lemmatization takes into consideration the morphological analysis of


the words.
To do so, it is necessary to have detailed dictionaries
- which the algorithm can look through to link the form
back to its lemma.

Text Normalization Lemmatization 48 / 65


Edit Distance

Edit Distance 49 / 65
Edit Distance

Measuring how similar two strings are.

Edit Distance 50 / 65
Edit Distance

Measuring how similar two strings are.


In spelling correction,
- the user typed some erroneous string—let’s say graffe
- we want to know what the user meant.

Edit Distance 50 / 65
Edit Distance

Measuring how similar two strings are.


In spelling correction,
- the user typed some erroneous string—let’s say graffe
- we want to know what the user meant.
In coreference,
- the task of deciding whether two strings such as the
following refer to the same entity:
Stanford President John Hennessy
Stanford University President John Hennessy

Edit Distance 50 / 65
Edit Distance

Edit distance gives us a way


- to quantify both of these intuitions about string similarity.

Edit Distance 51 / 65
Edit Distance

Edit distance gives us a way


- to quantify both of these intuitions about string similarity.
More formally, the minimum edit distance between two strings is
defined as
- the minimum number of editing operations needed to
transform one string into another.

Edit Distance 51 / 65
Edit Distance

Edit distance gives us a way


- to quantify both of these intuitions about string similarity.
More formally, the minimum edit distance between two strings is
defined as
- the minimum number of editing operations needed to
transform one string into another.
Editing operations like insertion, deletion, substitution.

Edit Distance 51 / 65
Edit Distance

What is the gap between intention and execution?

Edit Distance 52 / 65
Edit Distance

What is the gap between intention and execution?


Visualize the problem as an alignment between the two strings.

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

delete an I

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

delete an I
substitute E for N

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

delete an I
substitute E for N
substitute X for T

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

delete an I
substitute E for N
substitute X for T
insert C

Edit Distance 52 / 65
Edit Distance
What is the gap between intention and execution?
Visualize the problem as an alignment between the two strings.

delete an I
substitute E for N
substitute X for T
insert C
substitute U for N

Edit Distance 52 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.

Edit Distance 53 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)

Edit Distance 53 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)
Each of the three operations has a cost of 1

Edit Distance 53 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)
Each of the three operations has a cost of 1
Assume that the substitution of a letter for itself, for example, T for T,
has zero cost.

Edit Distance 53 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)
Each of the three operations has a cost of 1
Assume that the substitution of a letter for itself, for example, T for T,
has zero cost.
What is the Levenshtein distance between intention and execution?

Edit Distance 53 / 65
Edit Distance

We can also assign a particular cost or weight to each of these


operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)
Each of the three operations has a cost of 1
Assume that the substitution of a letter for itself, for example, T for T,
has zero cost.
What is the Levenshtein distance between intention and execution?
The Levenshtein distance between intention and execution is 5.

Edit Distance 53 / 65
Edit Distance

An alternative version of Levenshtein distance:

Edit Distance 54 / 65
Edit Distance

An alternative version of Levenshtein distance:


each insertion or deletion has a cost of 1 and substitutions are not
allowed.

Edit Distance 54 / 65
Edit Distance

An alternative version of Levenshtein distance:


each insertion or deletion has a cost of 1 and substitutions are not
allowed.
equivalent to allowing substitution, but giving each substitution a cost
of 2 since any substitution can be represented by one insertion and one
deletion

Edit Distance 54 / 65
Edit Distance

An alternative version of Levenshtein distance:


each insertion or deletion has a cost of 1 and substitutions are not
allowed.
equivalent to allowing substitution, but giving each substitution a cost
of 2 since any substitution can be represented by one insertion and one
deletion
Using this version, what is the Levenshtein distance between intention
and execution?

Edit Distance 54 / 65
Edit Distance

An alternative version of Levenshtein distance:


each insertion or deletion has a cost of 1 and substitutions are not
allowed.
equivalent to allowing substitution, but giving each substitution a cost
of 2 since any substitution can be represented by one insertion and one
deletion
Using this version, what is the Levenshtein distance between intention
and execution?
Using this version, the Levenshtein distance between intention and
execution is 8.

Edit Distance 54 / 65
Minimum Edit Distance

How do we find the minimum edit distance?

Edit Distance 55 / 65
Minimum Edit Distance

How do we find the minimum edit distance?


Think of this as a search task
- in which we are searching for the shortest path
from one string to another
- thru a sequence of edits

Edit Distance 55 / 65
Minimum Edit Distance

How do we find the minimum edit distance?


Think of this as a search task
- in which we are searching for the shortest path
from one string to another
- thru a sequence of edits

Edit Distance 55 / 65
Minimum Edit Distance

How do we find the minimum edit distance?


Think of this as a search task
- in which we are searching for the shortest path
from one string to another
- thru a sequence of edits

We can do this by using dynamic programming.

Edit Distance 55 / 65
Dynamic programming

A class of algorithms

Edit Distance 56 / 65
Dynamic programming

A class of algorithms
Apply a table-driven method
- to solve problems by combining solutions to sub-problems.

Edit Distance 56 / 65
Dynamic programming

A class of algorithms
Apply a table-driven method
- to solve problems by combining solutions to sub-problems.

Edit Distance 56 / 65
Dynamic programming

A class of algorithms
Apply a table-driven method
- to solve problems by combining solutions to sub-problems.

Examples: the Viterbi algorithm, the CKY algorithm for parsing

Edit Distance 56 / 65
Minimum Edit Distance Algorithm

Given two strings,


- the source string X of length n
- the target string Y of length m

Edit Distance 57 / 65
Minimum Edit Distance Algorithm

Given two strings,


- the source string X of length n
- the target string Y of length m
We’ll define D[i, j] as the edit distance between X[1..i] and Y[1..j],
- i.e., the first i characters of X and the first j characters of Y.

Edit Distance 57 / 65
Minimum Edit Distance Algorithm

Given two strings,


- the source string X of length n
- the target string Y of length m
We’ll define D[i, j] as the edit distance between X[1..i] and Y[1..j],
- i.e., the first i characters of X and the first j characters of Y.
The edit distance between X and Y is thus D[n, m].

Edit Distance 57 / 65
Minimum Edit Distance Algorithm

Base conditions:
D(i, 0) = del-cost(i)
D(0, j) = ins-cost(j)

Edit Distance 58 / 65
Minimum Edit Distance Algorithm

Base conditions:
D(i, 0) = del-cost(i)
D(0, j) = ins-cost(j)
Recurrence conditions:

Edit Distance 58 / 65
Minimum Edit Distance Algorithm

Base conditions:
D(i, 0) = del-cost(i)
D(0, j) = ins-cost(j)
Recurrence conditions:

Applying alternative version of Levenshtein distance:

Edit Distance 58 / 65
Minimum Edit Distance Algorithm

Edit Distance 59 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum Edit Distance Algorithm

Edit Distance 60 / 65
Minimum cost alignment

Augment the minimum edit distance algorithm


- to store backpointers in each cell.

Edit Distance 61 / 65
Minimum cost alignment

Augment the minimum edit distance algorithm


- to store backpointers in each cell.
Perform a backtrace from the final cell to the initial cell.

Edit Distance 61 / 65
Minimum cost alignment

Augment the minimum edit distance algorithm


- to store backpointers in each cell.
Perform a backtrace from the final cell to the initial cell.
Each complete path between the final cell and the initial cell is a
minimum distance alignment.

Edit Distance 61 / 65
Minimum cost alignment
Augment the minimum edit distance algorithm
- to store backpointers in each cell.
Perform a backtrace from the final cell to the initial cell.
Each complete path between the final cell and the initial cell is a
minimum distance alignment.

Edit Distance 61 / 65
Home Work

Home Work 62 / 65
Home Work

Exercises
2.1 RE

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program
2.4 Minimum edit distance

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program
2.4 Minimum edit distance
2.5 Minimum edit distance

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program
2.4 Minimum edit distance
2.5 Minimum edit distance
2.6 Minimum edit distance

Home Work 63 / 65
Home Work

Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program
2.4 Minimum edit distance
2.5 Minimum edit distance
2.6 Minimum edit distance
2.7 Minimum cost alignment

Home Work 63 / 65
References I

Kucera, H., Kučera, H., & Francis, W. N. (1967). Computational analysis


of present-day american english. Brown university press.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions,
insertions, and reversals. In Soviet physics doklady (Vol. 10, pp.
707–710).
Mumin, M. A. A., Shoeb, A. A. M., Selim, M. R., & Iqbal, M. Z. (2014).
Sumono: A representative modern bengali corpus. SUST Journal of
Sc. and Tech., 21(1), 78–86. Retrieved from journals.sust.edu/
uploads/archive/1a5537abd87524887e5e68b2975082fb.pdf

References 64 / 65
THANK YOU

References Thank You 65 / 65

You might also like