Professional Documents
Culture Documents
2
3
4
5
6
7
1. Efforts to develop NLP applications for Amharic (and other local languages) have not been much
successful so far. What do you think are the problems?
Ambiguity - A word, term, phrase or sentence could mean several possible things. - Computer languages are
designed to be unambiguous.
Variability - Lots of ways to express the same thing. - Computer languages have variability but the
equivalence of expressions can be automatically detected.
2. Discuss the pros and cons of taking root forms and surface forms of words for the
development of electronic amharic dictionary
3. explain why morphological analyzers and/or synthesizers are more important for the
development of high-level NLP applications for Amharic than English
Morphological analysis helps to find the minimal units of a word which holds linguistic information for further
processing. Morphological analysis plays a critical role in the development of natural language processing
(NLP) applications.
Morphological analysis of highly inflected and morphologically rich languages like Amharic is a difficult task
because of the complexity of the morphology. However, due to the complexity of the inherent characteristics
of the language, it is found to be difficult.
So since morphological analysis to the English language is relatively easy and analyzing morphemes are the
base for the development of most NLP applications, morphological analyzers and/or synthesizers are
more important to Amharic than English
4. compare and contrast the issues that you would consider in the development of amharic and
English grammar checker
5. discuss the use of amharic morphological analyzer in the development of the following NLP
applications
In spell checker application each morpheme of the word has to be checked for correctness in regards to the
text context. In this case knowledge of the language morphology is necessarily required. This necessity can
be achieved through morphological analysis.
before retrieving information, the text has to be processed to generate semantic vocabulary
8
d. latent semantic analysis
NP- Noun Phrase, VP- Verb Phrase, Adj – Adjective, N, Noun, Pro- Pronoun
S → NP VP
NP → ADJ NP
NP → NAME
VP → PRO V
Adj → አዲሱ
NAME → መጽሃፍ
Pro → የእኔ
V → አይደለም
It start parsing with the symbol S and then searches through different ways to rewrite the
symbols until the input sentence is generated
S => NP VP
=>ADJ NP VP
=> አዲሱ NP VP
2. bottom up strategy
9
Parsing start with the words in a sentence and uses production rules backward to reduce the
sequence of the symbols until it consists solely of S
NP NAME PRO V
NP PRO V
NP VP
-› እኔ አይደለም(አዲሱ, መጽሃፍ)
explain the use of amharic wordnet in the development of the following amharic nlp applications
b. information retrieval
10
d. semantic networks
8. sketch a two-level morphology (using finite state transducer) that help generate and/or parse
amharic derivational and inflectional morphologies for the following examples
Answer:
In finite state, morphology words are represented as corresponding between the lexical and surface-
level where surface level represents the concatenation of letters that make up the actual spelling
words and lexical level represents the concatenation of morphemes making up word. So based on
the two-level morphology sketch as below
11
Evaluation Method:
Assignment – 10%
Project – 40% - The Dr. will give you one of his 13 topics for a group of two students.
Total Assignments: 2
List out as much words as possible that can be generated from the root “¨<eÉ”
Hint: Try to follow a pattern; the more the merrier; a classmate has written around 850;
12
NLP Final Exam: Dr. Yaregal
1. [6] NLP is more difficult than artificial language processing. Explain why.
2. [8] Using NLP, describe how an Amharic spelling checker can be developed.
3. [6] Write (using examples) why NLP is difficult in the Amharic language is more difficult than in the
English language.
4. [10] Sketch a two-state morphology for the following examples: (Create just one diagram)
Root Noun Noun Noun (plural +
definite)
5. [6] Describe the four classes of Chomsky hierarchy of grammars with their equivalent automata.
6. [20] "d K›e‚` SêHñ” cØ…M
a. Create a finite state automaton that recognizes the above Amharic text.
b. ???m
c. i. Create the first order predicate calculus
ii. Write the lambda calculus along with a parse tree
d. Create the Vauquois Traingle, at each step show how "d K›e‚` SêHñ” cØ…M is
translated to its English equivalent “Kassa had given the book to Aster”.
e.
7. [6] Write the components of Information Retrieval that use NLP.
Answer - Information Retrieval (IR) provides a list of potentially relevant documents in response to user’s query.
Information Retrieval is a process which is performed through the following components.
13
Text Database, Database Manger Module, User Interface, Text Operation, Query Operation, Searching, Ranking,
Indexing, Index, User, Text, Index file, Query . . .
8. [8] List (with examples) five NLP applications that can be developed using an Amharic wordnet.
Answer:
1. Machine Translation (MT) - refers to a translation of texts from one natural language to another by
means of a computerized system.
2. Information retrieval and extraction
3. Word Sense Disambiguation
4. Language teaching and translation applications
5. Audio and video retrieval and Translation
9. [8] Write the Noisy Channel Equation for a statistical machine translation and speech recognition and
discuss the analogy between their components.
10. [6] Describe (with pros and cons) the top-down and bottom-up approaches of text summarization
implementations.
Answer: There are two computational approaches to Text Summarization:
Pros: higher quality and supports abstracting. User mandate to have only small information.
Cons: low speed, and still needs to scale up to robust open-domain summarization. IE works only for very particular
templates
b) Bottom-up approach is considered as text-driven summarization. User needs any information that is
important. System uses strategies (importance metrics) over representation of whole text. Bottom-up
approach can be implemented using Information Retrieval (IR).
15
BONUS: [5] Write (BRIEFLY) the most important things you have learnt in this course.
I learnt that NLP is the computerized approach to analyzing text that is based on both a set of theories and a set of
technologies. But it is not an easy task to analyze texts because of two constraints, Ambiguity: A word, term, phrase,
or sentence could mean several possible things for example the word “ገና” can means to imply time reference or it can
mean Christian holiday; and variability: Lots of ways to express the same thing for example መምህር and አስተማሪ refers
the same entitity.
I knew the definition of NLP as of interdisciplinary field (Computer Science, Electrical Engineering, Linguistics,
psycholinguistics …) of study dealing with computational techniques (models and algorithms) for analyzing and
representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-
like language processing for a range of tasks or applications. We can think of Natural Language Processing (NLP) as
the sum of Natural Language Generation (NLG) and Natural Language Understanding (NLU).
NLP is implemented in different key areas and works cooperatively with many fields beginning from early times such
as MT (Machine Translation), IR & IE (Information Retrieval and Information Extraction), Deep Learning, (WSD)
Word Sense Disambiguation and so on.
I believe I will do my thesis on this area because it is highly a researchable area especially, we Ethiopians should
represent and use the concepts of NLP better than what existed.
13. [6] NLP is more difficult than artificial language processing. Explain why.
Computers are an order of magnitude faster, in the artificial language well design specification exist.
However, when we come to natural language, the languages have ambiguity and variability.
NLP is hard because of:
• Ambiguity
-A word, term, phrase or sentence could mean several possible things.
-Computer languages are designed to be unambiguous.
• Variability
-Lots of ways to express the same thing.
-Even though computer languages have variability but the equivalence of expressions can be
automatically detected.
14. [8] Using NLP, describe how an Amharic spelling checker can be developed.
15. [6] Write (using examples) why NLP is difficult in the Amharic language than in the English language?
Normally there are three type of morphological structure namely isolated, agglutinative and inflectional.
16
Languages that has isolated morphological structure has a Morphemes that represent the word in a
sentence .in these languages there is a little or no change in the morphology of the word. Very and such
languages do not require extensive study on morphological analysis. Like English. Languages with
agglutinative morphological structure have Morphemes that represent words in a sentence that are
combined from different Morphemes. languages with such type of Morphemes have easily separable
Morphemes.so extensive morphological analyzer is not need to analyze Morphemes. Languages with
inflectional morphological structures, Morphemes are fused together and require complex morphological
analyzer to separate morphemes. Morphemes maybe fused together in several ways such as affixation and
doubling all or part of a word
16. [10] Sketch a two-state morphology for the following examples: (Create just one diagram)
Root Noun Noun Noun (plural + definite)
17. [6] Describe the four classes of Chomsky hierarchy of grammars with their equivalent automata.
Type 0(unrestricted )
Production rule
there is no limitation on the production rule.
- There is only one non terminal symbole at the left hand side.
The production rule is descrivbed as follows
S->SS
S->ABC
AB->BA
BA->AB
AC->CA.
CA->AC
BC-→CB
CB→BC
A→a.
B-→b.
C-→c
S-e.
17
Type 1(context sensitive)
Production rurle
@1B@2→@a1B@a2
There is exist only one non terminal symbole at the lefthand side of called B.
@1B@a2 are a sequence of empty or terminal and non terminal symbols a1 on the left context and a2 on the right context .s →e if it is not
found on the right hand side.
This is used for subject verb agreement like the student comes and the students come.
S → NP VP [S=Sentence, NP= Noun Phrase, VP= Verb Phrase]
NP → Det Nsing [Det= Determiner, Nsing= Noun (singular)]
NP → Det Nplur [Nplur= Noun (plural)]
Nsing VP → Nsing Vsing [Vsing= Verb (singular)]
Nplur VP → Nplur Vplur [Vplur= Verb (plural)]
Det → the
Nsing → student
Nplur → students
Vsing → comes
Vplur → come
Note: Context-Sensitive Languages/Grammars are subsets of Unrestricted Languages/ Grammars.
Type II(context free grammer)
Production rule:
Exactly one non-terminal on left hand side, but anything on the right hand side.
These rules are used to describe grammars of natural languages that are context-free. For
example, past tenses of English are context-free with respect to the subject. Thus, it is
grammatically correct to construct the sentences: the students came and the student came. The
following production rules can be used to represent such context-free grammars.
Production rule:
Exactly one non-terminal on left hand side, and one terminal and at most one nonterminal
on right hand side.
Examples:
A → aB Right Regular Grammar
A → Ba Left Regular Grammar
A →a
Note: Regular Languages/Grammars are subsets of Context-Free Languages/Grammars.
19
Exam 2017/18
1. Discuss what are the difficulty of developing Amharic Grammar Checker
Natural Language Understanding (NLU) –
A full NLU system would be able to:
• paraphrase an input text
• translate the text into another language
• answer questions about the contents of the text
• draw inferences from the text.
Importance of NLP
Semantics
Semantics deals with the meaning of words, phrases and sentences.
Lexical semantics – the meanings of the component words
Compositional semantics – how components combine to form larger meanings
Discourse – Discourse level deals with the properties of the text as a whole that convey meaning by making
connections between component sentences.
Several methods are used in discourse processing, two of the most common being:
20
-Anaphora resolution replacing words such as pronouns, which are semantically vacant, with the appropriate entity
to which they refer; and
-Discourse/text structure recognition– determining the functions of sentences in the text (which adds to the
meaningful representation of the text)
Approaches to NLP
Rule-based Approach
Rule-based systems are based on explicit representation of facts about language
through well-understood knowledge representation schemes and associated
algorithms.
Rule-based systems usually consist of a set of rules, an inference engine, and a
workspace or working memory.
Knowledge is represented as facts or rules in the rule-based approach.
The inference engine repeatedly selects a rule whose condition is satisfied and
executes the rule.
The primary source of evidence in rule-based systems comes from human-developed
rules (e.g. grammatical rules) and lexicons.
Rule-based approaches have been used tasks such as information extraction, text
categorization, ambiguity resolution, and so on.
Statistical Approach
Statistical approaches employ various mathematical techniques and often use large
text corpora to develop approximate generalized models of linguistic phenomena
based on actual examples of these phenomena provided by the text corpora without
adding significant linguistic or world knowledge.
The primary source of evidence in statistical systems comes from observable data (e.g.
large text corpora).
Statistical approaches have typically been used in tasks such as speech recognition,
parsing, part-of-speech
Connectionist Approach
A connectionist model is a network of interconnected simple processing units with
knowledge stored in the weights of the connections between units.
Similar to the statistical approaches, connectionist approaches also develop
generalized models from examples of linguistic phenomena.
What separates connectionism from other statistical methods is that connectionist
models combine statistical learning with various theories of representation.
In addition, in connectionist systems, linguistic models are harder to observe due to the
fact that connectionist architectures are less constrained than statistical ones.
Connectionist approaches have been used in tasks such as word-sense disambiguation,
21
language generation, syntactic parsing, limited domain translation tasks, and so on.
22