You are on page 1of 22

1

2
3
4
5
6
7
1. Efforts to develop NLP applications for Amharic (and other local languages) have not been much
successful so far. What do you think are the problems?

Ambiguity - A word, term, phrase or sentence could mean several possible things. - Computer languages are
designed to be unambiguous.

Variability - Lots of ways to express the same thing. - Computer languages have variability but the
equivalence of expressions can be automatically detected.

2. Discuss the pros and cons of taking root forms and surface forms of words for the
development of electronic amharic dictionary

3. explain why morphological analyzers and/or synthesizers are more important for the
development of high-level NLP applications for Amharic than English

Morphological analysis helps to find the minimal units of a word which holds linguistic information for further
processing. Morphological analysis plays a critical role in the development of natural language processing
(NLP) applications.

Morphological analysis of highly inflected and morphologically rich languages like Amharic is a difficult task
because of the complexity of the morphology. However, due to the complexity of the inherent characteristics
of the language, it is found to be difficult.

So since morphological analysis to the English language is relatively easy and analyzing morphemes are the
base for the development of most NLP applications, morphological analyzers and/or synthesizers are
more important to Amharic than English

4. compare and contrast the issues that you would consider in the development of amharic and
English grammar checker

5. discuss the use of amharic morphological analyzer in the development of the following NLP
applications

a. amharic spelling checker

In spell checker application each morpheme of the word has to be checked for correctness in regards to the
text context. In this case knowledge of the language morphology is necessarily required. This necessity can
be achieved through morphological analysis.

b. amharic information retrieval

before retrieving information, the text has to be processed to generate semantic vocabulary

c. amharic to english machine translation

8
d. latent semantic analysis

6. consider the following amharic sentence "አዲሱ መጽሃፍ የእኔ አይደለም"

a. write context free grammar that recognizes the sentence

NP- Noun Phrase, VP- Verb Phrase, Adj – Adjective, N, Noun, Pro- Pronoun

S → NP VP

NP → ADJ NP

NP → NAME

VP → PRO V

Adj → አዲሱ

NAME → መጽሃፍ

Pro → የእኔ

V → አይደለም

b. parse the sentence with

1. top down strategy

It start parsing with the symbol S and then searches through different ways to rewrite the
symbols until the input sentence is generated

S => NP VP

=>ADJ NP VP

=> አዲሱ NP VP

=> አዲሱ NAME VP

=> አዲሱ መጽሃፍ PRO V

=> አዲሱ መጽሃፍ የእኔ V

=> አዲሱ መጽሃፍ የእኔ አይደለም

2. bottom up strategy

9
Parsing start with the words in a sentence and uses production rules backward to reduce the
sequence of the symbols until it consists solely of S

አዲሱ መጽሃፍ የእኔ አይደለም

ADJ መጽሃፍ የእኔ አይደለም

ADJ NAME የእኔ አይደለም

ADJ NAME PRO አይደለም

ADJ NAME PRO V

NP NAME PRO V

NP PRO V

NP VP

c. show the semantic representation of the sentence using

1. first order predicate calculus

የእኔ አይደለም (አዲሱ, መጽሃፍ)

2. ለ-calculus, along with the parse tree

ɻx ɻy እኔ አይደለም(y,x) (መጽሃፍ)(አዲሱ) -›ɻyእኔ አይደለም(y, መጽሃፍ) (አዲሱ)

-› እኔ አይደለም(አዲሱ, መጽሃፍ)

explain the use of amharic wordnet in the development of the following amharic nlp applications

a. word sense disambiguation

b. information retrieval

c. english to amharic machine translation

10
d. semantic networks

7. Explain why anaphoric resolution is more complex in amharic than English

8. sketch a two-level morphology (using finite state transducer) that help generate and/or parse
amharic derivational and inflectional morphologies for the following examples

Answer:

In finite state, morphology words are represented as corresponding between the lexical and surface-
level where surface level represents the concatenation of letters that make up the actual spelling
words and lexical level represents the concatenation of morphemes making up word. So based on
the two-level morphology sketch as below

Lexical Level ጠ ማ ማ +Adj € +PLU S6


S0 S1 S2 S3 S4 S5
0 0 0 0
Surface Level 0 ጠ 0 ማ 0 ማ € ዎ ች

Lexical Level ሰ ባ ራ +Adj € +PLU


S0 S1 S2 S3 S4 S5
Surface Level 0 ሰ 0 ባ 0 ራ 0 € 0 ዎ 0 ቹ

11
Evaluation Method:

Assignment – 10%

Final Exam – 50%

Project – 40% - The Dr. will give you one of his 13 topics for a group of two students.

Total Assignments: 2

Assignment I: Morphology Generation

List out as much words as possible that can be generated from the root “¨<eÉ”

Some examples are:

¨cÅ ¨cÅ‹ ¨cÉG< ¨cÉ” ¨cÉI ¨cÉi ¨cÇ‹G<

Hint: Try to follow a pattern; the more the merrier; a classmate has written around 850;

Assignment II: Sketching a finite state automaton

12
NLP Final Exam: Dr. Yaregal

Total Question: 12 Total Points: 100

1. [6] NLP is more difficult than artificial language processing. Explain why.
2. [8] Using NLP, describe how an Amharic spelling checker can be developed.
3. [6] Write (using examples) why NLP is difficult in the Amharic language is more difficult than in the
English language.
4. [10] Sketch a two-state morphology for the following examples: (Create just one diagram)
Root Noun Noun Noun (plural +
definite)

ÓMØ ÑLß ›eÑLß ›eÑLà‡

p`Ø q^ß ›eq^ß ›eq^à‡

U`Ø S^ß ›eS^ß ›eS^à‡

5. [6] Describe the four classes of Chomsky hierarchy of grammars with their equivalent automata.
6. [20] "d K›e‚` SêHñ” cØ…M
a. Create a finite state automaton that recognizes the above Amharic text.
b. ???m
c. i. Create the first order predicate calculus
ii. Write the lambda calculus along with a parse tree
d. Create the Vauquois Traingle, at each step show how "d K›e‚` SêHñ” cØ…M is
translated to its English equivalent “Kassa had given the book to Aster”.
e.
7. [6] Write the components of Information Retrieval that use NLP.
Answer - Information Retrieval (IR) provides a list of potentially relevant documents in response to user’s query.
Information Retrieval is a process which is performed through the following components.

13
Text Database, Database Manger Module, User Interface, Text Operation, Query Operation, Searching, Ranking,
Indexing, Index, User, Text, Index file, Query . . .

8. [8] List (with examples) five NLP applications that can be developed using an Amharic wordnet.
Answer:
1. Machine Translation (MT) - refers to a translation of texts from one natural language to another by
means of a computerized system.
2. Information retrieval and extraction
3. Word Sense Disambiguation
4. Language teaching and translation applications
5. Audio and video retrieval and Translation

9. [8] Write the Noisy Channel Equation for a statistical machine translation and speech recognition and
discuss the analogy between their components.

10. [6] Describe (with pros and cons) the top-down and bottom-up approaches of text summarization
implementations.
Answer: There are two computational approaches to Text Summarization:

♦ Top-Down Text Summarization


♦ Bottom-Up Text Summarization
14
a) Top-down approach is considered as query-driven summarization. It can be implemented using Information
Extraction (IE). User needs only certain types of information. System needs a particular criterion of interest,
used to focus search.

Pros: higher quality and supports abstracting. User mandate to have only small information.

Cons: low speed, and still needs to scale up to robust open-domain summarization. IE works only for very particular
templates

b) Bottom-up approach is considered as text-driven summarization. User needs any information that is
important. System uses strategies (importance metrics) over representation of whole text. Bottom-up
approach can be implemented using Information Retrieval (IR).

Pros: robust, and good for query-oriented summaries.

Cons: lower quality, and inability to manipulate information at abstract levels.

11. [10] Describe how a language model can be used in (N-Grams):


a. Spelling correction
b. Parsing
e. OCR

12. [6] Explain why:


Optical Character Recognition (OCR) is classified into three, Machine Printed OCR, Offline handwritten OCR and
online Handwritten OCR.
a. Machine printed OCR is easier than handwritten OCR
Machine Printed OCR is better than handwritten because in machine reading template matching is effective to
recognize standardized machine printed characters. However, handwriting recognition or general-purpose machine-
printed character recognition is limited since it needs a stored template of all variants. Handwriting systems usually
have difficulties to segment unconstrained text into individual characters.

b. Online handwritten OCR is easier than offline handwritten OCR


Online handwriting OCR recognition uses the directional features generated because of pen-tip movements so that
it makes online handwriting is better than offline handwriting OCR. the text is digitalized using scanner and stored
option may be image, pdf which might ask another processing.

15
BONUS: [5] Write (BRIEFLY) the most important things you have learnt in this course.

I learnt that NLP is the computerized approach to analyzing text that is based on both a set of theories and a set of
technologies. But it is not an easy task to analyze texts because of two constraints, Ambiguity: A word, term, phrase,
or sentence could mean several possible things for example the word “ገና” can means to imply time reference or it can
mean Christian holiday; and variability: Lots of ways to express the same thing for example መምህር and አስተማሪ refers
the same entitity.

I knew the definition of NLP as of interdisciplinary field (Computer Science, Electrical Engineering, Linguistics,
psycholinguistics …) of study dealing with computational techniques (models and algorithms) for analyzing and
representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-
like language processing for a range of tasks or applications. We can think of Natural Language Processing (NLP) as
the sum of Natural Language Generation (NLG) and Natural Language Understanding (NLU).

NLP is implemented in different key areas and works cooperatively with many fields beginning from early times such
as MT (Machine Translation), IR & IE (Information Retrieval and Information Extraction), Deep Learning, (WSD)
Word Sense Disambiguation and so on.

I believe I will do my thesis on this area because it is highly a researchable area especially, we Ethiopians should
represent and use the concepts of NLP better than what existed.

13. [6] NLP is more difficult than artificial language processing. Explain why.
Computers are an order of magnitude faster, in the artificial language well design specification exist.
However, when we come to natural language, the languages have ambiguity and variability.
NLP is hard because of:
• Ambiguity
-A word, term, phrase or sentence could mean several possible things.
-Computer languages are designed to be unambiguous.
• Variability
-Lots of ways to express the same thing.
-Even though computer languages have variability but the equivalence of expressions can be
automatically detected.

14. [8] Using NLP, describe how an Amharic spelling checker can be developed.
15. [6] Write (using examples) why NLP is difficult in the Amharic language than in the English language?
Normally there are three type of morphological structure namely isolated, agglutinative and inflectional.

16
Languages that has isolated morphological structure has a Morphemes that represent the word in a
sentence .in these languages there is a little or no change in the morphology of the word. Very and such
languages do not require extensive study on morphological analysis. Like English. Languages with
agglutinative morphological structure have Morphemes that represent words in a sentence that are
combined from different Morphemes. languages with such type of Morphemes have easily separable
Morphemes.so extensive morphological analyzer is not need to analyze Morphemes. Languages with
inflectional morphological structures, Morphemes are fused together and require complex morphological
analyzer to separate morphemes. Morphemes maybe fused together in several ways such as affixation and
doubling all or part of a word

16. [10] Sketch a two-state morphology for the following examples: (Create just one diagram)
Root Noun Noun Noun (plural + definite)

ÓMØ ÑLß ›eÑLß ›eÑLà‡

p`Ø q^ß ›eq^ß ›eq^à‡

U`Ø S^ß ›eS^ß ›eS^à‡

17. [6] Describe the four classes of Chomsky hierarchy of grammars with their equivalent automata.
Type 0(unrestricted )
Production rule
there is no limitation on the production rule.
- There is only one non terminal symbole at the left hand side.
The production rule is descrivbed as follows
S->SS
S->ABC
AB->BA
BA->AB
AC->CA.
CA->AC
BC-→CB
CB→BC
A→a.
B-→b.
C-→c
S-e.

17
Type 1(context sensitive)
Production rurle
@1B@2→@a1B@a2
There is exist only one non terminal symbole at the lefthand side of called B.
@1B@a2 are a sequence of empty or terminal and non terminal symbols a1 on the left context and a2 on the right context .s →e if it is not
found on the right hand side.
This is used for subject verb agreement like the student comes and the students come.
S → NP VP [S=Sentence, NP= Noun Phrase, VP= Verb Phrase]
NP → Det Nsing [Det= Determiner, Nsing= Noun (singular)]
NP → Det Nplur [Nplur= Noun (plural)]
Nsing VP → Nsing Vsing [Vsing= Verb (singular)]
Nplur VP → Nplur Vplur [Vplur= Verb (plural)]
Det → the
Nsing → student
Nplur → students
Vsing → comes
Vplur → come
Note: Context-Sensitive Languages/Grammars are subsets of Unrestricted Languages/ Grammars.
Type II(context free grammer)
Production rule:
Exactly one non-terminal on left hand side, but anything on the right hand side.
These rules are used to describe grammars of natural languages that are context-free. For
example, past tenses of English are context-free with respect to the subject. Thus, it is
grammatically correct to construct the sentences: the students came and the student came. The
following production rules can be used to represent such context-free grammars.

S → NP VP [S=Sentence, NP= Noun Phrase, VP= Verb Phrase]


NP → Det N [Det= Determiner, N= Noun]
VP→ V [V= Verb]
Det → the
N→ student
N→ students
V→ came.

Context-Free Grammars are important since they are:


• Restricted enough to build efficient parsers
• Powerful enough to describe the syntax of most programming languages
Note: Context-Free Languages/Grammars are subsets of Context-Sensitive Languages/Grammars.
typeIII(regular)

Production rule:
Exactly one non-terminal on left hand side, and one terminal and at most one nonterminal
on right hand side.
Examples:
A → aB Right Regular Grammar
A → Ba Left Regular Grammar
A →a
Note: Regular Languages/Grammars are subsets of Context-Free Languages/Grammars.

18. [20] "d K›e‚` SêHñ” cØ…M


a. Create a finite state automaton that recognizes the above Amharic text.
b. ???
c. i. Create the first order predicate calculus (FOPC)
ii. Write the lambda calculus along with a parse tree
d. Create the Vauquois Triangle, at each step show how "d K›e‚` SêHñ” cØ…Mis translated to its English
equivalent “Kassa had given the book to Aster”.
e.
18
19. [6] Write the components of Information Retrieval that use NLP.
20. [8] List (with examples) five NLP applications that can be developed using an Amharic wordnet.
21. [8] Write the Noisy Channel Equation for a statistical machine translation and speech recognition and
discuss the analogy between their components.
22. [6] Describe (with pros and cons) the top-down and bottom-up approaches of text summarization
implementations.
23. [10] Describe how a language model can be used in (N-Grams):
a. Spelling correction c. Parsing e. OCR
b. ??? d. ???
24. [6] Explain why:
a. Machine printed OCR is easier than handwritten OCR
b. Online handwritten OCR is easier than offline handwritten OCR
BONUS: [5] Write (BRIEFLY) the most important things you have learnt in this course.

19
Exam 2017/18
1. Discuss what are the difficulty of developing Amharic Grammar Checker
Natural Language Understanding (NLU) –
A full NLU system would be able to:
• paraphrase an input text
• translate the text into another language
• answer questions about the contents of the text
• draw inferences from the text.
Importance of NLP

• Natural language is the preferred medium of communication for people.


• Computers can do useful things for us if:
• NLP bridges the communication gap between people and computers.
Key contributors in this field of NLP include:

- Noam Chomsky, in his work on generative grammars


- Claude Shannon, in his work on applied probabilistic models to automata for language.
- John Backus and Peter Naur, in their work on context-free grammars for programming languages.

Semantics
Semantics deals with the meaning of words, phrases and sentences.
Lexical semantics – the meanings of the component words
Compositional semantics – how components combine to form larger meanings
Discourse – Discourse level deals with the properties of the text as a whole that convey meaning by making
connections between component sentences.
Several methods are used in discourse processing, two of the most common being:

20
-Anaphora resolution replacing words such as pronouns, which are semantically vacant, with the appropriate entity
to which they refer; and
-Discourse/text structure recognition– determining the functions of sentences in the text (which adds to the
meaningful representation of the text)

Approaches to NLP
Rule-based Approach
Rule-based systems are based on explicit representation of facts about language
through well-understood knowledge representation schemes and associated
algorithms.
Rule-based systems usually consist of a set of rules, an inference engine, and a
workspace or working memory.
Knowledge is represented as facts or rules in the rule-based approach.
The inference engine repeatedly selects a rule whose condition is satisfied and
executes the rule.
The primary source of evidence in rule-based systems comes from human-developed
rules (e.g. grammatical rules) and lexicons.
Rule-based approaches have been used tasks such as information extraction, text
categorization, ambiguity resolution, and so on.

Statistical Approach
Statistical approaches employ various mathematical techniques and often use large
text corpora to develop approximate generalized models of linguistic phenomena
based on actual examples of these phenomena provided by the text corpora without
adding significant linguistic or world knowledge.
The primary source of evidence in statistical systems comes from observable data (e.g.
large text corpora).
Statistical approaches have typically been used in tasks such as speech recognition,
parsing, part-of-speech

Connectionist Approach
A connectionist model is a network of interconnected simple processing units with
knowledge stored in the weights of the connections between units.
Similar to the statistical approaches, connectionist approaches also develop
generalized models from examples of linguistic phenomena.
What separates connectionism from other statistical methods is that connectionist
models combine statistical learning with various theories of representation.
In addition, in connectionist systems, linguistic models are harder to observe due to the
fact that connectionist architectures are less constrained than statistical ones.
Connectionist approaches have been used in tasks such as word-sense disambiguation,
21
language generation, syntactic parsing, limited domain translation tasks, and so on.

22

You might also like