You are on page 1of 4

COS4861/102/0/2022

Tutorial Letter 102/0/2022

Natural Language Processing


COS4861

Year module

School of Computing

This tutorial letter contains assignment 01

BAR CODE

university
Define Tomorrow. of south africa
ASSIGNMENT 01
Due Date: 2022 May 18
Total Marks: 80
Unique Assignment Number: 123456789

ONLY FOR YEAR MODULE


Please note that it is your responsibilty to check that your assignment is registered on the as-
signment database. You can do this by visiting myUnisa.

Question 1: 30 Marks

(1.1) Design a Finite State Automaton (FSA) that recognises the finish times for top runners (5)
in a marathon. It should handle all times up to five hours. Times are reported to the
nearest 10th of a second. Make sure that “second”, “minute”, and “hour” have proper
singular or plural endings when appropriate.

(1.2) Write a regular expression for the following. You may use Python, or Perl notation.
(a) The set of all alphabetic strings. (1)
(b) The set of all strings that has two words separated by the word ‘of’. For example, (1)
‘Axis of Evil’.
(c) The set of all strings consisting of digits that contain a repeated two-digit pattern (3)
in positions 4, 10, and 15. Strings can only be between 20 and 25 digits in length.
For example: ‘01234567834901342345’

(1.3) Consider the following Regular Expression (RE):

/^(\+\+\+)(-{2,3})(\:{1})\s[0-9_]+\s\3\2\1$/

(a) Describe the strings matched by the RE. (3)


(b) Give a non-trivial example of a pattern (string of characters) matched by the RE. (2)
For your chosen example write down the contents of registers 1, 2 and 3.

(1.4) This question introduces the concepts of measuring performance of a solution.


Please do it in sequence.
(a) Write your own RE which will match an email address. (3)
(b) Download the text-file called ’training.txt’ from the additional resources section to (3)
test your RE. Test your RE by using a grep utility, or programatically (read the file,
perform the RE match, and print the results). Calculate the accuracy of your acRE
using the following formula: P = Nc . Set N = 10 (there are 10 email addresses in
the file), and set c equal to the number of email addresses your RE found.

2
COS4861/102/0/2022

If your original RE did not match all the emails in the file, you may modify it. Write
down the final RE that you come up with, as well as its accuracy count. Did your
RE catch anything that wasn’t an email address?

(1.5) Now download ’test.txt’ and test your RE against it. Provide the accuracy of your RE. (3)
Set N = 5 redo the calculation from above to show how your RE performed. List the
email addresses that your RE did not catch, and explain why they weren’t caught in
each case.

(1.6) Convert the following Non-discrete Finite State Automaton (NFSA) to a Discrete Finite (6)
State Automaton (DFSA). Refer to the Additional Resources on myUNisa for an ex-
ample of how this could be done. Give an RE that represents the DFSA. (Please note
that you will also be expected to convert an NFSA to a DFSA and write down the RE
that it represents in the examination). Show the steps that you follow and draw the
resulting DFSA.

0
3
1
2
0
0 1

0,1
start 1 4
1
0 0,1
5

Question 2: 5 Marks

Write an FST to implement the Soundex algorithm as described in exercise 3.5 of the textbook.

Question 3: 25 Marks

NOTE: Your lecturer will give you instructions on how to handle the programming parts of your
assignment via the module site. Be sure to follow those instructons carefully.
You will be developing a library of NLP routines and functions. It is thus important to ensure that
you implement the code as functions, or in classes as methods. And the code that shows their
operation should simply make calls to these functions/methods.
Please note that you will be implementing the algorithms yourself, you are not to make use of
existing libraries. Using, or copying from existing libraries such as the NLTK, the Apache NLP, or
any other such project will result in a 0% mark.

(3.1) Computing the minimum edit distance by hand, determine whether connect is closer (5)
to commute or to contact. Show how you get to the edit distances (use an edit distance

3
grid). Use a cost of 1 for insertions and deletions and a cost of 2 for substitutions.
Please note that there is a mistake in the 2009 edition of the textbook. The second
last line of the minimum edit distance algorithm in Fig.3.25 should read:

distance[i − 1, j − 1] + sub cost(sourcej , targeti ).

(3.2) Write a program to implement the minimum edit distance algorithm, and compare (10)
the results to the hand-computed results. Don’t hard-code the algorithm for this
question – allow the user to enter two words, and present them with the edit-
distance. Allow the user to specify the costs associated with substitutions, insertions,
and deletions. Add an option to print out the distance matrix.

(3.3) Implement a simplistic chatbot on any domain using the RE substitution method de- (10)
fined in section 2.1.6. You may not simply implement the ELIZA bot from the textbook.
Use at least five substitutions.

Question 4: 20 Marks

(4.1) Do Exercise 4.2 and 4.3 in your textbook. Your corpora should each have at least 800- (10)
1000 words to make your output meaningful. Remember, your output should indicate
P(ω) for each unigram, and P(ω|ω−1 ) for each bigram.
Remember to add your corpora to your git repo, otherwise we won’t be able to test it.

(4.2) Add La-Placian smoothing, as well as Good-Turing discounting. Print out the Un- (10)
smoothed, La Placian, and Good-Turing discounted values.

©2022, UNISA (v2022.0.0)

You might also like