Chapter 2

Natural Language Processing
Chapter 2
Regular Expressions, Text Normalization,

Edit Distance
Department of Computer Science and Engineering,

Shahjalal University of Science and Technology, Sylhet
Regular Expression (RE)
A language for specifying text search strings.
Regular Expressions Introduction 2 / 65

Regular Expression (RE)
A language for specifying text search strings.
Formally,
- an algebraic notation for characterizing a set of strings.
Regular Expressions Introduction 2 / 65

Basic RE Patterns: Simplest Kind
The simplest kind of RE is

- a sequence of simple characters.
Regular Expressions Basic Patterns 3 / 65




We’ll show REs delimited by slashes (/).


We’ll show REs delimited by slashes (/).

But slashes (/) are not part of the REs.

Basic RE Patterns: Case Sensitivity
Regular expressions are case sensitive.

Lower case /s/ is distinct from uppercase /S/.

Lower case /s/ is distinct from uppercase /S/.
The pattern /woodchucks/ will not match the string Woodchucks.

Basic RE Patterns: Disjunction
The string of characters inside the square braces [ and ] specifies

- a disjunction of characters to match.

Basic RE Patterns: Disjunction
The string of characters inside the square braces [ and ] specifies

- a disjunction of characters to match.

Basic RE Patterns: Range
Where there is a well-defined sequence associated with a set of

characters,
- the brackets can be used with the dash (-) to specify
any one character in a range.

Basic RE Patterns: Range
Where there is a well-defined sequence associated with a set of

characters,
- the brackets can be used with the dash (-) to specify
any one character in a range.

Basic RE Patterns: Negation
If the caret ^ is the first symbol after the open square brace [,
- the resulting pattern is negated.

If the caret ^ occurs anywhere else,
- it usually stands for a caret ^.

If the caret ^ occurs anywhere else,
- it usually stands for a caret ^.

Basic RE Patterns: Optionality
How can we talk about optional elements,

- like an optional s in woodchuck and woodchucks?

Use the question mark /?/,

- which means “zero or one instances of the previous character”.

Use the question mark /?/,

- which means “zero or one instances of the previous character”.

Basic RE Patterns: Problem
Consider the language of certain sheep, which consists of strings that

look like the following:
baa!
baaa!
baaaa!
baaaaa!
...

Basic RE Patterns: Problem
Consider the language of certain sheep, which consists of strings that

look like the following:
baa!
baaa!
baaaa!
baaaaa!
...
How can we represent the language using RE?

Basic RE Patterns: Kleene *
Use the *

Use the *
Pronounced “cleany star”.

Use the *
Pronounced “cleany star”.
Means “zero or more occurrences of the immediately previous
character or regular expression”.

An integer (a string of digits) is thus /[0-9][0-9]*/.

An integer (a string of digits) is thus /[0-9][0-9]*/.

Why isn’t it just /[0-9]*/?

Basic RE Patterns: Kleene +
Have to write the regular expression for digits twice,

- so there is a shorter way to specify “at least one” of some
character.


character.
Kleene + means “one or more occurrences of the immediately
preceding character or regular expression”.


character.
An integer (a string of digits) is thus /[0-9]+/.


character.
An integer (a string of digits) is thus /[0-9]+/.
Two ways to specify the sheep language: /baaa*!/ or /baa+!/.

Basic RE Patterns: Wildcard
The period (/./),

- a wildcard expression that matches any single character
(except a carriage return).

The period (/./),

- a wildcard expression that matches any single character
(except a carriage return).

The wildcard (.) is often used together with the Kleene *

- to mean “any string of characters”.


Suppose we want to find any line in which a particular word, for
example, aardvark, appears twice.


Suppose we want to find any line in which a particular word, for
example, aardvark, appears twice.
We can specify this with the regular expression /aardvark.*aardvark/.

Basic RE Patterns: Anchors
Special characters that anchor regular expressions to particular places

in a string.

Basic RE Patterns: Anchors
Special characters that anchor regular expressions to particular places

in a string.
The most common anchors are
- the caret ^ and the dollar sign $.

Basic RE Patterns: Anchor caret ^
The caret ^
- matches the start of a line.

Basic RE Patterns: Anchor caret ^
The caret ^
- matches the start of a line.
The pattern /^The/ matches
- the word The only at the start of a line.

Basic RE Patterns: The caret ^
The caret ^ has three uses:


To match the start of a line,


To indicate a negation inside of square brackets,


To indicate a negation inside of square brackets,
Just to mean a caret ^.

Basic RE Patterns: Anchor dollar sign $
The dollar sign $ matches

- the end of a line.


⊔$
- a useful pattern for matching a space at the end of a line.


⊔$
/^The dog\.$/
- matches a line that contains only the phrase The dog.


⊔$
/^The dog\.$/
- matches a line that contains only the phrase The dog.
We have to use the backslash (\) here
- since we want the . to mean “period” and not the wildcard.

Basic RE Patterns: Anchors more
Two other anchors:

- \b matches a word boundary
- \B matches a non-boundary.

Basic RE Patterns: Anchors more
Two other anchors:

- \b matches a word boundary
- \B matches a non-boundary.
/\bthe\b/
- matches the word the but not the word other.

Basic RE Patterns: “word” definition
Based on the definition of “words” in programming languages.


Technically, a “word” for the purposes of a regular expression is
defined as
- any sequence of digits, underscores, or letters.


defined as
/\b99\b/


defined as
/\b99\b/
will match the string 99 in There are 99 bottles of beer on the wall
(because 99 follows a space)


defined as
/\b99\b/
but not 99 in There are 299 bottles of beer on the wall (since 99
follows a number).


defined as
/\b99\b/
but not 99 in There are 299 bottles of beer on the wall (since 99
follows a number).
But it will match 99 in $99 (since 99 follows a dollar sign ($), which is
not a digit, underscore, or letter).

Break
Regular Expressions Break 22 / 65

Basic RE Patterns: Disjunction operator
Search for either the string cat or the string dog.
Regular Expressions Disjunction, Grouping, and Precedence 23 / 65


Why can’t we say /[catdog]/?


Use the disjunction operator, also called the pipe symbol |.


Use the disjunction operator, also called the pipe symbol |.
The pattern /cat|dog/ matches
- either the string cat or the string dog.

Basic RE Patterns: Precedence
How can I specify both guppy and guppies?


We cannot simply say /guppy|ies/,
- because that would match only the strings guppy and ies.


Use the parenthesis operators ( and ).


Use the parenthesis operators ( and ).
The pattern /gupp(y|ies)/ would specify that
- we meant the disjunction only to apply to the suffixes y and ies.

Basic RE Patterns: Operator precedence
hierarchy
Counters have a higher precedence than sequences,

- /the*/ matches theeeee but not thethe.

Basic RE Patterns: Operator precedence
hierarchy
Counters have a higher precedence than sequences,

- /the*/ matches theeeee but not thethe.
Sequences have a higher precedence than disjunction,

- /the|any/ matches the or any but not thany or theny.

Basic RE Patterns: A Simple Example
Write a RE to find cases of the English article the.
Regular Expressions NLP Errors 26 / 65

Solution: /the/

Solution: /the/
Miss the word when it begins a sentence and hence is capitalized (i.e.,
The)

Solution: /the/
The)
Solution: /[tT]he/

Solution: /the/
The)
Solution: /[tT]he/
Still incorrectly return texts with the embedded in other words (e.g.,
other or theology)

Solution: /the/
The)
Solution: /[tT]he/
other or theology)
Solution: /\b[tT]he\b/

Solution: /the/
The)
Solution: /[tT]he/
other or theology)
Might want to find the in some context where it might also have
underlines or numbers nearby (the_ or the25)

Solution: /the/
The)
Solution: /[tT]he/
other or theology)
Solution: /[â-zA-Z][tT]he[â-zA-Z]/

Solution: /the/
The)
Solution: /[tT]he/
other or theology)
Won’t find the word the when it begins a line.

Solution: /the/
The)
Solution: /[tT]he/
other or theology)
Won’t find the word the when it begins a line.
Solution: /(^|[â-zA-Z])[tT]he([â-zA-Z]|$)/

NLP Errors
Two kind of Errors:

false positives:
- strings that we incorrectly matched like other or there

NLP Errors
Two kind of Errors:

false positives:
- strings that we incorrectly matched like other or there
false negatives:
- strings that we incorrectly missed, like The.

NLP Errors: Two efforts
Two kind of Efforts:

Increasing precision (minimizing false positives)

NLP Errors: Two efforts
Two kind of Efforts:

Increasing precision (minimizing false positives)
Increasing recall (minimizing false negatives)

A More Complex Example
Write a RE for
- any machine with at least 6 GHz and 500 GB of disk space
for less than $1000.
Regular Expressions A More Complex Example 29 / 65

A More Complex Example
Write a RE for
- any machine with at least 6 GHz and 500 GB of disk space
for less than $1000.
HOME WORK: Section 2.1.4
Regular Expressions A More Complex Example 29 / 65

More RE Operators: Aliases
Aliases for common sets of characters.
Regular Expressions More Operators 30 / 65

More RE Operators: Aliases
Aliases for common sets of characters.

More RE Operators: Counters
RE operators for counting.

More RE Operators: Counters
RE operators for counting.

More RE Operators: Backslashed
Some characters that need to be backslashed.

More RE Operators: Backslashed
Some characters that need to be backslashed.

Text Normalization
Text Normalization 33 / 65
Corpora
Corpus: a computer-readable collection of text or speech.
Text Normalization Corpora 34 / 65

Corpora
Corpus: a computer-readable collection of text or speech.

plural: Corpora

Corpora: Example
The Brown corpus

Corpora: Example
The Brown corpus

A million word collection of English texts

Corpora: Example
The Brown corpus

from different genres (newspaper, fiction, non-fiction, academic, etc.)

Corpora: Example
The Brown corpus

from different genres (newspaper, fiction, non-fiction, academic, etc.)
assembled at Brown University in 1963-64 (Kucera et al., 1967)

Corpora: Example
The SUMono corpus

Corpora: Example
The SUMono corpus

A 27 million word collection of Bangla texts

Corpora: Example
The SUMono corpus

from different genres (newspaper, fiction, non-fiction, academic, blogs,
etc.)

Corpora: Example
The SUMono corpus

from different genres (newspaper, fiction, non-fiction, academic, blogs,
etc.)
assembled at Shahjalal University of Science and Technology in
2010-14 (Mumin et al., 2014)

Words
How many words are in the following Brown sentence?

He stepped out into the hall, was delighted to encounter a water brother.
Text Normalization Words 37 / 65

Words

13 words, if we don’t count punctuation marks as words

Words

15 words, if we count punctuation

Words
How many words are in the following SUMono sentence?

আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও কারও
মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।

Words


Words

20 words, if we count punctuation

Words
Lemma: a set of lexical forms having

- the same stem
- the same major part-of-speech
- and the same word sense.
This stem is called Lemma of this set of lexical forms.

Words

- the same stem
wordform: the full inflected or derived form of the word.

Words

- the same stem
For example: cats versus cat?

Words

- the same stem
For example: cats versus cat?
These two words have the same lemma cat but are different
wordforms.

Words
How many words are there in a corpus?

Words

Two ways of talking about words: Types and Tokens

Words

Types: the number of distinct words in a corpus.

Words

Tokens: the total number of running words.

Words

How many word Types and word Tokens are in the following SUMono
sentence?
আর তাই ণীকে ছা ছা ীেদর পাশাপািশ যখন বসলাম, তখন কারও
কারও মেন হ ল, িফের গিছ িব িবদ ালেয়র সই িদন েলায়।

Words

sentence?
17 Tokens and 16 Types, if we ignore punctuation

Words

sentence?
17 Tokens and 16 Types, if we ignore punctuation
20 Tokens and 18 Types, if we count punctuation

Text Normalization
Commonly applied text normalization process:

Tokenizing (segmenting) words
Text Normalization Text Normalization 41 / 65

Text Normalization

Normalizing word formats

Text Normalization

Normalizing word formats
Segmenting sentences

Tokenization
Tokenization: the task of segmenting running text into tokens.
Text Normalization Tokenization 42 / 65

Normalization
Normalization: the task of putting tokens in a standard format,

choosing a single normal form for tokens with multiple forms.

Normalization
Normalization: the task of putting tokens in a standard format,

choosing a single normal form for tokens with multiple forms.
For example: USA and US

Lemmatization
Lemmatization: the task of determining that several words have the

same root, despite their surface differences.
Text Normalization Lemmatization 44 / 65

Lemmatization

For example: am, are, and is have the shared lemma be

Lemmatization

For example: am, are, and is have the shared lemma be
The lemmatized form of a sentence like
He is reading detective stories would thus be
He be read detective story.

Morphological parsing
How is lemmatization done?

How is lemmatization done?

The most sophisticated methods for lemmatization involve
- complete morphological parsing of the word.

Morphology is the study of the way words are built up

- from smaller meaning-bearing units called morphemes.


Two broad classes of morphemes can be distinguished:


stems: the central morpheme of the word, supplying the main meaning


affixes: adding “additional” meanings of various kinds.


fox consists of one morpheme: fox


fox consists of one morpheme: fox
foxes consists of two morphemes: fox and -es

Stemming Vs Lemmatization
Stemming algorithms work

- by cutting off the end or the beginning of the word,
- taking into account a list of common prefixes and suffixes
that can be found in an inflected word.


This indiscriminate cutting can be successful in some occasions, but
not always.


This indiscriminate cutting can be successful in some occasions, but
not always.

Lemmatization takes into consideration the morphological analysis of

the words.


the words.
To do so, it is necessary to have detailed dictionaries
- which the algorithm can look through to link the form
back to its lemma.


the words.
To do so, it is necessary to have detailed dictionaries
- which the algorithm can look through to link the form
back to its lemma.

Edit Distance
Edit Distance 49 / 65
Edit Distance
Measuring how similar two strings are.
Edit Distance

In spelling correction,
- the user typed some erroneous string—let’s say graffe
- we want to know what the user meant.
Edit Distance

In spelling correction,
- the user typed some erroneous string—let’s say graffe
- we want to know what the user meant.
In coreference,
- the task of deciding whether two strings such as the
following refer to the same entity:
Stanford President John Hennessy
Stanford University President John Hennessy
Edit Distance
Edit distance gives us a way

- to quantify both of these intuitions about string similarity.
Edit Distance

More formally, the minimum edit distance between two strings is
defined as
- the minimum number of editing operations needed to
transform one string into another.
Edit Distance

More formally, the minimum edit distance between two strings is
defined as
- the minimum number of editing operations needed to
transform one string into another.
Editing operations like insertion, deletion, substitution.
Edit Distance
What is the gap between intention and execution?
Edit Distance

Visualize the problem as an alignment between the two strings.
Edit Distance
Edit Distance
delete an I
Edit Distance
delete an I
substitute E for N
Edit Distance
delete an I
substitute E for N
substitute X for T
Edit Distance
delete an I
substitute E for N
substitute X for T
insert C
Edit Distance
delete an I
substitute E for N
substitute X for T
insert C
substitute U for N
Edit Distance
We can also assign a particular cost or weight to each of these

operations.
Edit Distance

operations.
The simplest weighting factor: Levenshtein distance (Levenshtein,
1966)
Edit Distance

operations.
1966)
Each of the three operations has a cost of 1
Edit Distance

operations.
1966)
Assume that the substitution of a letter for itself, for example, T for T,
has zero cost.
Edit Distance

operations.
1966)
has zero cost.
What is the Levenshtein distance between intention and execution?
Edit Distance

operations.
1966)
has zero cost.
What is the Levenshtein distance between intention and execution?
The Levenshtein distance between intention and execution is 5.
Edit Distance
An alternative version of Levenshtein distance:
Edit Distance

each insertion or deletion has a cost of 1 and substitutions are not
allowed.
Edit Distance

allowed.
equivalent to allowing substitution, but giving each substitution a cost
of 2 since any substitution can be represented by one insertion and one
deletion
Edit Distance

allowed.
deletion
Using this version, what is the Levenshtein distance between intention
and execution?
Edit Distance

allowed.
deletion
Using this version, what is the Levenshtein distance between intention
and execution?
Using this version, the Levenshtein distance between intention and
execution is 8.
Minimum Edit Distance
How do we find the minimum edit distance?

Think of this as a search task
- in which we are searching for the shortest path
from one string to another
- thru a sequence of edits


We can do this by using dynamic programming.
Dynamic programming
A class of algorithms
Dynamic programming
Apply a table-driven method
- to solve problems by combining solutions to sub-problems.
Dynamic programming
Dynamic programming
Examples: the Viterbi algorithm, the CKY algorithm for parsing
Minimum Edit Distance Algorithm
Given two strings,

- the source string X of length n
- the target string Y of length m
Given two strings,

We’ll define D[i, j] as the edit distance between X[1..i] and Y[1..j],
- i.e., the first i characters of X and the first j characters of Y.
Given two strings,

We’ll define D[i, j] as the edit distance between X[1..i] and Y[1..j],
- i.e., the first i characters of X and the first j characters of Y.
The edit distance between X and Y is thus D[n, m].
Base conditions:
D(i, 0) = del-cost(i)
D(0, j) = ins-cost(j)
Base conditions:
Recurrence conditions:
Base conditions:
Recurrence conditions:
Applying alternative version of Levenshtein distance:
Minimum cost alignment
Augment the minimum edit distance algorithm

- to store backpointers in each cell.

Perform a backtrace from the final cell to the initial cell.

Each complete path between the final cell and the initial cell is a
minimum distance alignment.
Each complete path between the final cell and the initial cell is a
minimum distance alignment.
Home Work
Home Work 62 / 65
Home Work
Exercises
2.1 RE
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
2.3 ELIZA-like program
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
2.4 Minimum edit distance
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
Home Work 63 / 65
Home Work
Exercises
2.1 RE
2.2 RE
2.7 Minimum cost alignment
Home Work 63 / 65
References I
Kucera, H., Kučera, H., & Francis, W. N. (1967). Computational analysis

of present-day american english. Brown university press.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions,
insertions, and reversals. In Soviet physics doklady (Vol. 10, pp.
707–710).
Mumin, M. A. A., Shoeb, A. A. M., Selim, M. R., & Iqbal, M. Z. (2014).
Sumono: A representative modern bengali corpus. SUST Journal of
Sc. and Tech., 21(1), 78–86. Retrieved from journals.sust.edu/
uploads/archive/1a5537abd87524887e5e68b2975082fb.pdf
References 64 / 65
THANK YOU
References Thank You 65 / 65

Chapter 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

Natural Language Processing

Regular Expressions, Text Normalization,

Department of Computer Science and Engineering,

A language for specifying text search strings.

Regular Expressions Introduction 2 / 65

A language for specifying text search strings.

Regular Expressions Introduction 2 / 65

The simplest kind of RE is

Regular Expressions Basic Patterns 3 / 65

The simplest kind of RE is

Regular Expressions Basic Patterns 3 / 65

The simplest kind of RE is

We’ll show REs delimited by slashes (/).

Regular Expressions Basic Patterns 3 / 65

The simplest kind of RE is

We’ll show REs delimited by slashes (/).

Regular Expressions Basic Patterns 3 / 65

Regular expressions are case sensitive.

Regular Expressions Basic Patterns 4 / 65

Regular expressions are case sensitive.

Lower case /s/ is distinct from uppercase /S/.

Regular Expressions Basic Patterns 4 / 65

Regular expressions are case sensitive.

Lower case /s/ is distinct from uppercase /S/.

The pattern /woodchucks/ will not match the string Woodchucks.

Regular Expressions Basic Patterns 4 / 65

The string of characters inside the square braces [ and ] specifies

Regular Expressions Basic Patterns 5 / 65

The string of characters inside the square braces [ and ] specifies

Regular Expressions Basic Patterns 5 / 65

Where there is a well-defined sequence associated with a set of

Regular Expressions Basic Patterns 6 / 65

Where there is a well-defined sequence associated with a set of

Regular Expressions Basic Patterns 6 / 65

Regular Expressions Basic Patterns 7 / 65

Regular Expressions Basic Patterns 7 / 65

Regular Expressions Basic Patterns 7 / 65

How can we talk about optional elements,

Regular Expressions Basic Patterns 8 / 65

Use the question mark /?/,

Regular Expressions Basic Patterns 9 / 65

Use the question mark /?/,

Regular Expressions Basic Patterns 9 / 65

Consider the language of certain sheep, which consists of strings that

Regular Expressions Basic Patterns 10 / 65

Consider the language of certain sheep, which consists of strings that

Regular Expressions Basic Patterns 10 / 65

Regular Expressions Basic Patterns 11 / 65

Regular Expressions Basic Patterns 11 / 65

Regular Expressions Basic Patterns 11 / 65

An integer (a string of digits) is thus /[0-9][0-9]*/.

Regular Expressions Basic Patterns 12 / 65

An integer (a string of digits) is thus /[0-9][0-9]*/.

Regular Expressions Basic Patterns 12 / 65

Have to write the regular expression for digits twice,

Regular Expressions Basic Patterns 13 / 65

Have to write the regular expression for digits twice,

Regular Expressions Basic Patterns 13 / 65

Have to write the regular expression for digits twice,

Regular Expressions Basic Patterns 13 / 65

Have to write the regular expression for digits twice,

Regular Expressions Basic Patterns 13 / 65

The period (/./),