You are on page 1of 14

MANIPAL ACADEMY OF HIGHER EDUCATION

B.Tech V Semester Sessional Examination September 2023


Natural Language Processing [DSE 3155]
Scheme of Evaluation
Exam Date & Time: 27-Sep-2023 (10:45 AM - 12:45 PM)

Marks: 30 Duration: 120 mins.


TYPE MCQ
Q. No. Question and Solution Marks
Q1 Identify the suitable definition for morphology 0.5

1. The study of the rules governing the sounds that form words

2. The study of the rules governing sentence formation

3. The study of the rules governing word formation

4. The study of the rules governing sounds


Q2 In the Porter-Stemmer algorithm if the condition is “(*d)” 0.5
then which of the following is correct interpretation

1. The stem ends with a vowel followed by a consonant.

2. The stem ends with a double vowel

3. The stem ends with a double consonant

4. The stem ends CVC, where the second C is not W, X, or Y


Q3 The following is a category of Non-Concatenative 0.5
morphology

1. Reduplication

2. Affixation

3. Compounding

4. None of the above


Q4 (a|b)* is equivalent to 0.5

1. b*a*

2. (a*b*)*
3. a*b*

4. None of the above


Q5 /ba{3,4}d\d/ matches 0.5

1. baad

2. baaad6

3. baaadd

4. baaad
Q6 Which sentence best describes inflectional morphology? 0.5

1. Adding a morpheme to produce a new word but the same lexeme

2. Adding a morpheme to produce a new word and different lexeme

3. Adding a morpheme to produce the same word but different


lexeme

4. Adding a morpheme to produce the same sentence but different


lexeme
Q7 Humans usually Write ’M, To State Am. In Which Type of 0.5
Morphology you can categorize this example? (0.5)

1. Plural Noun

2. Cliticization

3. Singular Noun

4. Inflectional
Q8 In finite state transducer, the simple pair of symbols in the 0.5
alphabet Σ is called

1. Default pairs

2. Fragments

3. Finite alphabets

4. Feasible pairs
Q9 Morphotactics is a model for 0.5

1. Spelling modifications that may occur during affixation


2. Explaining the model of morpheme ordering
3. All affixes in the English language
4. Listing stems with basic information
Q10 The number of states in a minimal deterministic finite 0.5
automaton corresponding to the language L= |n>=4| is

1. 4

2. 5

3. 3

4. 6

TYPE: DESCRIPTIVE
Q. Question and Solution Mark
No s
.
Q1 Design Minimum Edit Distance algorithm. Apply the same to compute the Total:
1 minimum edit distance between the words PEACEFUL and PAECFL 4
Mark
Solution: s

Algorithm:

1
Mark

Or-
2
Marks

Computation of the minimum edit distance between the words


PEACEFUL and PAECFL
L 8 7 6 5 6 5 4 5
U 7 6 5 4 5 4 5 4
F 6 5 4 3 4 3 4 5
E 5 4 3 2 3 4 5 6
C 4 3 2 3 2 3 4 5
A 3 2 1 2 3 4 5 6
E 2 1 2 1 2 3 4 5
P 1 0 1 2 3 4 5 6
# 0 1 2 3 4 5 6 7
# P A E C F L U 1
Mark
Tracing: (any one possible case)

L 8 7 6 5 6 5 4 5
U 7 6 5 4 5 4 5 4
F 6 5 4 3 4 3 4 5
E 5 4 3 2 3 4 5 6
C 4 3 2 3 2 3 4 5
A 3 2 1 2 3 4 5 6
E 2 1 2 1 2 3 4 5
P 1 0 1 2 3 4 5 6
# 0 1 2 3 4 5 6 7
# P A E C F L U

Total cost=5
The operations performed are: 3-delete and 2-insert
Q1 State any two differences between DFA and NFA. Also, design the following Total
2 a) Deterministic Finite Automata with transition table for the language Mark
accepting strings ending with s: 4
’01’ over input alphabets ∑ = {0, 1}
b)Non-Deterministic Finite Automata with transition table for accepting
strings such that there are two
0’s separated by a number of positions that is multiple of 4 input alphabets ∑
= {0, 1}.

Solution:
Any two difference between DFA and NFA: (for example)
DFA NFA
In DFA, the next possible state is In NFA, each pair of state and input 0.5
distinctly set symbol can have many possible Mark
next states.
All DFA are NFA. Not all NFA are DFA.
DFA cannot use Empty String NFA can use Empty String
transition. transition.
DFA allows only one move for There can be choice (more than one
single input alphabet. move) for single input alphabet.

a) Deterministic Finite Automata with transition table for the language


accepting strings ending with ’01’ over input alphabets ∑ = {0, 1}
1
Mark

Transition Table:
0.5
Present state Next state on input 0 Next state on input 1 Mark
→q0 q1 q0
q1 q1 q2
*q2 q1 q0
b) Non-Deterministic Finite Automata with transition table for accepting
strings such that there are two 0’s separated by a number of positions that is 1.5
multiple of 4 input alphabets ∑ = {0, 1}. Mark

0.5
Transition Table: mark

Present state Next state on input 0 Next state on input 1


→q0 q0, q1 q0
q1 q2, q5 q2
q2 q3 q3
q3 q4 q4
q4 q1 q1
*q5 q5 q5
Q1 Assemble the various steps involved in the Byte Pair Encoding algorithm for Total:
3 tokenization. 3
Apply the Byte Pair Encoding tokenization to find the tokens generated for testing Mark
text string: s
“ lowest oldest newest”.
Consider training corpus with frequency of various words as follows.
{“old”: 7, “older”: 3, “finest”: 9, “lowest”: 4, “new”: 5, “newer”: 4}.”

Solution:
Steps involved in Byte-pair encoding Algorithm:
Step 1: The BPE token learner begins with a vocabulary that is just the set of all
individual characters.
Step 2 : It then examines the training corpus, chooses the two symbols that are most 0.5
frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary, Marks
and replaces every adjacent ’A’ ’B’ in the corpus with the new ‘AB’
Step 3: It continues to count and merge, creating new longer and longer character
strings, until k merges have been done creating k novel tokens; k is thus a parameter
of the algorithm.
The resulting vocabulary consists of the original set of characters plus k new
symbols.
2
Constructing the vocabulary using training corpus: Marks
Corpus Vacabulary
7 old_ _olderfinstw
3 older_
9 finest_
4 lowest_
5 new_
4 newer_
Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st
3 o l d e r_
9 f i n e st _
4 l o w e st _
5 new_
4 n e w e r_

Corpus Vacabulary
7 o l d_ _olderfinstw
3 o l d e r_ st,st_
9 f i n e st_
4 l o w e st_
5 n e w_
4 n e w e r_

Corpus Vacabulary
7 o l d_ _olderfinst
3 o l d e r_ w st,st_, est_
9 f i n est_
4 l o w est_
5 n e w_
4 n e w e r_

Corpus Vacabulary
7 ol d _ _ o l d e r f i n s t w st,st_,
3 ol d e r_ est_,ol
9 f i n est_
4 l o w est_
5 n e w_
4 n e w e r_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_, est_,ol,old
3 old e r_
9 f i n est_
4 l o w est_
5 n e w_
4 n e w e r_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old e r_ est_,ol,old,ne
9 f i n est_
4 l o w est_
5 ne w_
4 ne w e r_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old e r_ est_,ol,old,ne,new
9 f i n est_
4 l o w est_
5 new_
4 new e r_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old er _ est_,ol,old,ne,new,er
9 f i n est_
4 l o w est_
5 new_
4 new er _

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_
9 f i n est_
4 l o w est_
5 new_
4 new er_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest
9 f i nest_
4 l o w est_
5 new_
4 new er_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,inest

9 f inest_
4 l o w est_
5 new_
4 new er_

Corpus Vacabulary
7 old _ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_
9 finest_
4 l o w est_
5 new_
4 new er_
Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_

9 finest_
4 l o w est_
5 new_
4 new er_

Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_
9 finest_
4 l o w est_
5 new_
4 new er_

Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_,ow
9 finest_
4 l ow est_
5 new_
4 new er_

Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_,ow,low
9 finest_
4 low est_
5 new_
4 new er_

Corpus Vacabulary
7 old_ _ o l d e r f i n s t w st,st_,
3 old er_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_,ow,low,lowest
9 finest_
4 lowest_
5 new_
4 new er_

Corpus Vacabulary
7 _ o l d e r f i n s t w st,st_,
old_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_,ow,low,lowest,newer
3 old
er_
9
finest_
4 0.5
lowest_ Marks
5
new_
4
newer_

Corpu Vacabulary
s
7 _ o l d e r f i n s t w st,st_,
old_ est_,ol,old,ne,new,er,er_,nest_,inest_,finest_,old_,ow,low,lowest,n
3 ewer,older
older_
9
finest
_
4
lowest
_
5
new_
4
newer
_

Testing:

lowest oldest newest

Lowest recognized as lowest


Oldest recognized as old est
newest recognized as new est

Q1 List and demonstrate with examples the use of regular expression operators Total
4 for counting. Mark
s
Solution:

2
Marks

Any example for each:


1) Zero or more occurrence of any alphabet: [A-Za-z]*
1
2) One or more occurrence of any alphabet: [A-Za-z]+ Mark
3) Exactly one or more occurrence of any alphabet: [A-Za-z]?
4) 4 occurrence of any alphabet: [A-Za-z]{4}
5) 4 to 10 occurrence of any alphabet: [A-Za-z]*{4,10}
6) Atleast 4 occurrence of any alphabet: [A-Za-z]*{4,}

Q1 Design a finite state transducer with an E-insertion orthographic rule for Total
5 parsing the plural form of the string “crash” Marks
3
Solution:
Lexical level C R A S H +N +PL

Intermediate C R A S H ^ S #
level

Surface level C R A S H E S

0.5

• Two-level morphology system used for parsing or generating.


• Lexicon transducer – maps between the lexical level (stems and
morphological features) and an intermediate level (represents
concatenation of morphemes)
• Host of transducers – (each represents a single spelling rule) – run
in parallel; maps between intermediate level and surface level
• Putting the spelling rules in parallel or in series (cascading) is a 1.5
design choice.
• The architecture is a two level cascade of transducers
• The output from one transducer acts as an input to another transducer
• The cascade can be run top-down to generate a string, or bottom-
up to parse it

Q1 Analyze the Noisy channel model for spell check using Bayesian inference Total
6 Marks
Solution: 3
1

• In the noisy channel model, it is imagined that the surface form is actually a
“distorted” form of an original word passed through a noisy channel.
• Language is generated and passed through a noisy channel.
• This channel introduces “noise” in the form of substitutions or other
changes to the letters, making it hard to recognize the “true” word
• Goal: To build a model of the channel. Given this model, we then find the
true word by passing every word of the language through the model of the 1
noisy channel and seeing which one comes the closest to the misspelled
word
• The decoder passes each hypothesis through a model of this channel and
picks the word that best matches the surface noisy word.

Using Bayesian Model in Noisy Channel:


• This noisy channel model is a kind of Bayesian inference.
• We see an observation x (a misspelled word) and our job is to find the word
w that generated this misspelled word
• Out of all possible words in the vocabulary V we want to find the word w 1
such that P(w/x) is highest.

Prior probability and Conditional probability can be used to find the most
likely original version given the actually observed signal..

Q1 Describe the Porter-stemmer algorithm. Apply the algorithm to stem the words Total
7 “Characterization” to “Characterize” and “Multidimensional” to “ Marks
Multidimension” 3

Solution:
• Porter stemmer widely used stemming algorithms. Consonants and vowels
play important role in this algorithm
• A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of
length greater than 0 will be denoted by V.
• Any word, or part of a word, therefore has one of the four forms:
• CVCV … C → collection, management
• CVCV … V → conclude, revise 1
• VCVC … C → entertainment, illumination
• VCVC … V → illustrate, abundance
• These may all be represented by the single form

[C]VCVC ... [V]


where the square brackets denote arbitrary presence of their
contents.
• Using ($VCm$) to denote VC repeated m times, this may again be written
as [C]($ VCm $)[V]
• m will be called the measure of any word or word part when represented
in this form.
• The rules for removing a suffix will be given in the form
(condition) S1 -> S2
• This means that if a word ends with the suffix S1, and the stem before S1
satisfies the given condition, S1 is replaced by S2. The condition is usually
given in terms of m,
.

• The stem of the word has m > 1 (since m = 5) and ends with “AL”.
• “AL” is deleted (replaced with null).
• The word will not change the stem further.

• Finally the output will be MULTIDIMENSION .


1

• The stem of the word has m > 0 (since m = 3) and ends with “IZATION”.
• “IZATION” will be replaced with “IZE”. Coz: ((m>0) IZATION -> IZE )
• Then the new stem will be CHARACTERIZE.

Q1 Discuss the spelling error that may occur during typing and in OCR. Justify with
8 examples wherever applicable .

Solution:
Spelling errors that occur while typing are characterized as :
• Substitutions: mistyping the as thw 2
• Insertions : mistyping the as ther
• deletions : mistyping the as th
• transpositions: mistyping the as hte

OCR errors are grouped into five classes:


• Substitutions : mistyping the as thw
• Multisubstitutions : mistyping computer as ‘cemputur’
• space insertions : mistyping ‘number’ as ‘num ber’
• Space deletions: mistyping ‘the number’ as thenumber’
• failures

You might also like