You are on page 1of 17

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing


Class 9
Instructor: Michael Fry
THE PLAN TODAY
• Go over Quiz 2
• Go over Assignment 1
• Construct an n-gram model
• Generate text using the n-gram model
• Most common
• Random
• Probabilistic
• Learn about functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
BUT FIRST

• Next week:
• Monday is a holiday
• Tuesday is an optional workshop
• Meaning I won’t prepare anything, but you can come with questions/projects and we can
work on them together
• Assignment 3 is due
• Wednesday is your final exam
• 8:30-11:30am, Buchanan B209 (like your normal morning class)
• And, that’s a wrap
• I’ll do the marking as soon as feasible
N-GRAM MODEL
• So, how do we build an n-gram model?
• Steps:
• Read through the corpus
• Identify each gram that occurs
• For each gram, memorize the n-1 following grams
• When done, calculate the probability of each n-gram
N-GRAM MODEL
• Here’s a simple example using Green Eggs and Ham by Dr. Seuss

Corpus Grams
Do you like do (3), you (1), like (3), green (2), eggs (2),
Green eggs and ham and (2), ham (2), I (2), not (2), them (1), Sam-
I-am (1)
I do not like them,
Bigrams
Sam-I-am.
I do not like do you (1), you like (1), like green (2), green
Green eggs and eggs (2), eggs and (2), and ham (2), ham I
ham. (1), I do (2), do not (2), not like (2), like them
(1), them Sam-I-am (1), Sam-I-am I (1)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
N-GRAM MODEL
• What’s a good way to track N-grams in Python? We need a word (key) that
connects to the next-n-words (value)
• Dictionary

bigrams = {'do': {'you': 1, 'not': 2}, Bigrams


'you': {'like': 1, 'green': 2, 'them': 1}, do you (1), you like (1), like
'like': {'green': 2, 'them': 1}, green (2), green eggs (2),
'green': {'eggs': 2}, eggs and (2), and ham (2),
'eggs': {'and': 2}, ham I (1), I do (2), do not
'and': {'ham': 2}, (2), not like (2), like them
'ham': {'I': 1}, (1), them Sam-I-am (1),
'I': {'do': 2}, Sam-I-am I (1)
'them': {'Sam-I-am': 1},
'Sam-I-am': {'I': 1}}
N-GRAM MODEL
• Let’s build one
• Choose a dataset from nltk.corpus.gutenberg.fileids()
• Go through the words (not all, say the first thousand) and build a bi-gram model
• To access words, use: nltk.corpus.gutenberg.words(‘austen-emma.txt’)
• You don’t have to use austen-emma.txt, just one of the
• Remember to match the format of the bigram dictionary:
• Each key is a word with a value that is another dictionary
• In the nested dictionary, the key is a word that followed the first key and value is the frequency with
which it followed it
N-GRAM MODEL
• Once you have your bigram dictionary, you can build a sentence using it
• There are lots of ways to implement this, one option:
• Provide a first word and the number of words you want in your sentence
• Access the bigrams from your first word as a key
• Select the next word with the highest occurrence
• Repeat until you reach the number of words in your sentence
N-GRAM MODEL
• Another possible way to generate sentences, use the random module
• import random
• Provide the first word and number of desired words
• Get the possible next words from your bigram list
• Use random.randint(lower_bound, upper_bound) to select a random word
• Repeat with your new word until you reach the desired number of words
N-GRAM MODEL
• A final option is to use probabilities (this is probably the best when thinking about
how the brain might actually work)
• Provide a word and the number of desired words in the sentence
• Access the bigram list
• Find the probability of each following-word option
• Use random to generate a number between 0-1, select where that number is based on
the probabilities of the items in the dictionary
N-GRAM MODEL
• In other words, divide up the space between 0 and 1, and assign each gram to a
range. Pick a random number using random.random() and see which word's range it
falls in.
• You need to know the probabilities of any word, meaning it’s frequency over the sum
of all possibilities

"like" "them" "green"


1/4 1/4 2/4
0.25 0.25 0.5
0.0 0.25 0.5 1.0
THE PLAN TODAY
• Go over Quiz 2
• Go over Assignment 1
• Construct an n-gram model
• Generate text using the n-gram model
• Most common
• Random
• Probabilistic
• Learn about functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
NEW PYTHON BASIC: FUNCTIONS
• We now have three ways to generate a sentence, it would be nice if we could set
these as separate functions to call when we want to generate something
• Functions are blocks of code that are assigned a name, and which can be executed
whenever you want
• They:
• start with the def keyword
• require a colon (like any block)
• require whitespace indentation
NEW PYTHON BASIC: FUNCTIONS
• Here’s a super simple example:
def print_hello():
print('Hello world!')
• Now we can call the function whenever we want by typing print_hello()
print_hello()
print_hello()
print_hello brackets are crucial!
using brackets "calls" the
function (makes it work)

no brackets just "refers" to


the function (nothing will
happen)
NEW PYTHON BASIC: FUNCTIONS
• You can send information into a function. This is called "passing an argument“
• You can return information from a function using the keyword return.
def pluralize(word):
if word == 'mouse’:
return 'mice’
elif word == 'goose’:
return 'geese’
else:
return word+'s’
print(pluralize('table’)
print(pluralize('mouse'))
NEW PYTHON BASIC: FUNCTIONS
• Let’s practice:
• Write a function that returns the length of a list (just like len(), but don’t use len())
• Write a function that takes a path_to_file argument and a split_by argument and returns
the lines of the file

You might also like