You are on page 1of 79

ADMAS UNIVERSITY

SCHOOL OF POSTGRADUATE STUDIES

MSC PROGRAM

AMHARIC NEWS TEXT CLASSIFICATION USING DEEP LEARNING APPROACH

A THESIS SUBMITTED TO IN PARTIAL FULFILMENT OF THE

REQUIREMENTS FOR THE DEGREE OF MSC IN COMPUTER SCIENCE

BY: Solomon Melkamu

Advisor: Micheal Melese (Ph.D)

JULY,2020,
ADDIS ABABA ETHIOPIA

i
Declaration
I declare that the thesis is my original work and has not been presented for a degree in any other
university. This thesis submitted for examination with my approval as university advisor.

Advisor------------------------------

Date --------------------------------

ii
Certificate of Approval
This is to certify that the thesis prepared by Solomon Melkamu entitled Amharic news text
classification by using machine learning approach and submitted to in partial fulfilment of the
requirements for the degree of MSc in computer science complies with the regulation of the university
and meets the accepted standards with respect to originality and quality.

Name of candidate -------------------------------signature ---------------------date --------------

Name of Advisor------------------------------- Signature---------------------Date---------------

External examiner: ____________________Signature: ____________Date: __________

Internal examiner: ____________________Signature: ____________Date: __________.

iii
Acknowledgments
Above all thanks to the Almighty God for his priceless support during the period of my study, without
his will it cannot be true.Next my greatest thanks go to my respectful advisor Dr. Micheal Melese for
his guidance, advice and critical comments to the successful accomplishment of this study.

iv
Table of Contents
Declaration.............................................................................................................................................................ii
Certificate of Approval..........................................................................................................................................iii
Acknowledgments.................................................................................................................................................iv
List of Table...........................................................................................................................................................vi
List of Figures........................................................................................................................................................vii
List of Acronyms..................................................................................................................................................viii
ABSTRACT..............................................................................................................................................................ix
CHAPTER ONE........................................................................................................................................................1
INTRODUCTION......................................................................................................................................................1
1.1. Background.....................................................................................................................................................1
1.2 Statement of the problem............................................................................................................................3
1.3 Objective of the study..................................................................................................................................5
1.3.1 General Objective..................................................................................................................................5
1.3.2 Specific Objective..................................................................................................................................5
1.4. Significance of the study.............................................................................................................................5
1.5 Scope and Limitation of the Study...............................................................................................................6
1.6 Methodology of the Study...........................................................................................................................6
1.6.1 Literature Review..................................................................................................................................6
1.6.2 Data source...........................................................................................................................................7
1.6.3 Tools and Experimentation Procedure..................................................................................................7
1.6.4 News classification Evaluation Techniques............................................................................................8
1.7 Organization of the Thesis........................................................................................................................9
Chapter two.........................................................................................................................................................10
Literature Review.................................................................................................................................................10
Text Classification................................................................................................................................................10
Introduction.........................................................................................................................................................10
2.1 Definition of Automatic Text Classification................................................................................................10
2.2. Approaches to Text Classification..............................................................................................................11
2.2.1. Manual Classification.........................................................................................................................11
2.2.2 Automatic classification......................................................................................................................11
2.3 Basic concepts of text classification...........................................................................................................11
2.1.4. Challenges for Machine Learning.......................................................................................................12
v
2.4. Types of Machine learning........................................................................................................................13
2.4.1 Supervised Learning............................................................................................................................13
2.4.2. Unsupervised Learning.......................................................................................................................14
2.4.3. Semi-supervised Learning..................................................................................................................14
2.4.4. Reinforcement Learning.....................................................................................................................15
2.5 Use of News Automatic Text Classification................................................................................................15
2.6.2. Text Classification Method.................................................................................................................16
2.6.4. Classification Algorithms....................................................................................................................16
2.6.6 Functions.............................................................................................................................................17
2.6.7 Trees....................................................................................................................................................19
2.6.8 Rules....................................................................................................................................................21
2.6.9 Misc.....................................................................................................................................................22
2.6. 10 Ensemble..........................................................................................................................................22
3.6 Evaluation of Text Classifiers......................................................................................................................27
3.6.2 Training versus Test sets.....................................................................................................................27
3.6.3 Performance Measures.......................................................................................................................27
3.7 Review of Related Research Works on Amharic Text Classification...........................................................28
Chapter Three......................................................................................................................................................30
Amharic Language and Its Writing System.......................................................................................................30
Introduction.....................................................................................................................................................30
3.1 Amharic Language......................................................................................................................................30
3.2.1 Amharic Characters.............................................................................................................................31
3.2. Amharic Number System...........................................................................................................................33
3.3. Problems of Amharic Writing System........................................................................................................33
3.3.1 Inconsistency of Compound Words.........................................................................................................34
3.3.3 Inconsistency of Abbreviations............................................................................................................34
3.3.4 Transliterations Problem.....................................................................................................................34
Chapter Four........................................................................................................................................................35
Methodology........................................................................................................................................................35
Introduction.....................................................................................................................................................35
4.1 Architecture of Amharic News Text classification......................................................................................35
4.2 Data Source................................................................................................................................................37
4.3 Data collection...........................................................................................................................................37
vi
4.4 Document Preprocessing...........................................................................................................................37
4.5 Tokenization...............................................................................................................................................37
4.6 Normalization.............................................................................................................................................38
4.7 Words and Numbers Removal Stop...........................................................................................................38
4.8 Data Conversion.........................................................................................................................................40
Chapter Five.........................................................................................................................................................41
Experimentation and Performance Discussion....................................................................................................41
Introduction.....................................................................................................................................................41
5.1The Results of Experimentation and Testing...............................................................................................41
5.1.1 Testing.................................................................................................................................................41
5.1.2. Naïve Bays Test..................................................................................................................................42
5.1.3. J48 algorithm Test............................................................................................................................46
5.1.4. Sequential minimal optimization Classification Test........................................................................49
5.1.5Comparison of classification Algorithms..............................................................................................53
Chapter six...........................................................................................................................................................55
Conclusions and Recommendations....................................................................................................................55
6.1 Conclusions................................................................................................................................................55
6.2 RECOMMENDATIONS...............................................................................................................................57
REFERNCE.............................................................................................................................................................59

vii
List of Table
Table 2.1: Related Research Works on Automatic Amharic Text Classification

Table 3 .1.Amharic Characters

List of Figures
Figure 4.1 Architecture of Automatic Amharic News Text Classification

viii
List of Acronyms
CSV:comma separated values
CNN:Convolutional neural network

DL:Deep Learning

ENA: Ethiopian News Agency

ICT: Information and Communication Technologies

ML: Machine Learning

NLP :Natural Language Processing

RNN:Recurrent neural network

ix
ABSTRACT
Text classification is one of the important methods in natural language processing that classifies huge
amount of text documents into different categories. Currently Amharic news institutions collect and
store large amount of news electronically and they classify the news using manual classification system
subjectively. As there is increase of volume of electronic news information, we wanted to support with
deep learning technique for better find, filter, and manage these news resources.The objective of this
research was to design and develop an Amharic news text classification prototype model using deep
learning approach to categorize Amharic news items into their domains(classes).The research used
the supervised text classification techniques which is applied to build different types of classifier that
are generated from the Amharic text documents A total of .4,432 news items were used to do this
research. The methodology used in this research has many step processes such as document collection,
document preprocessing techniques such as tokenization, normalization, stop word and numbers
removal, stemming and removal of affixes from the words and preprocessing was done using python to
facilitate the experimentation process of our news classification. The news classification and its
analysis was accomplished using jupitor editor on anaconda package. The data, the pre-processed
Amharic news items, is organized in to categories of three classes, six classes and nine classes for the
experimentation purpose. The experimentation was undertaken based on 80% training and 20 % test
data. Three supervise- learning were used namely: feed forward neural network, simple-RNN and
LSTM-RNN were used to categorize the Amharic news items. The result of this research indicated
that such classifiers are applicable to classify Amharic news items automatically .However, the
classifiers work well when the categories contain almost evenly distributed news items. The
accuracy of the classifiers was better when the number of classes is less. The best result was obtained
by the feed forward neural network, simple-RNN and LSTM-RNN classifiers with three classes 79%,
88.75% and 82.5% respectively. This paper indicated that recurrent classifiers, simpleRNN and LSTM,
are more applicable to supervised automatic categorization of Amharic news items.

Keywords: Natural language processing, Deep Learning, Text classification, Recurrent classifier
algorithms, Amharic news,

x
xi
CHAPTER ONE

INTRODUCTION

1.1. Background
Information becomes one of the resources that people have as basic needs. The Internet has been
providing huge collection of information to satisfy the social and commercial services. Even
though internet brings massive information, the information explosion creates Information
overload that confuses us when we choose relevant information from huge corpus. In order to
make contents useful, we need to have a modern system for delivering only what is needed, at
the right time and in the right format [1].Modern Information Technologies and Web based
services are faced with the problem of selecting, filtering and managing growing amounts of
textual information to which access is usually critical.Machine learning is a general inductive
process that automatically builds an automatic text classifier by learning from a set of pre-
classified documents, the characteristics of the categories interest [2].Machine Learning is
considered as a subfield of Artificial Intelligence and it is concerned with the development of
techniques and methods which enable the computer to learn.
Machine learning is one of the modern ways of providing information that requires the existence
of a variety of applications including local language spell-checkers, document classification and
clustering, machine translation [3].This approach is economically and effective to that
achieved by manual classification systems. There are many different machine learning
techniques and algorithms, which have been used for text categorization. For E x a m p l e ,
Support vector machines (SVMs), Decision trees, Decision rules, and neural networks, nearest
neighbor classifiers, regression models and Bayesian learning methods [4].
Bayesian learning is a statistical classifier that provides a probabilistic approach to
inference, based on the assumption that the quantities of interest are controlled by probability
distribution and the optimal decision; and that optimal decision can be made by reasoning
about these probabilities together with observed data.

Deep learning is a subfield of machine learning concerned with algorithms inspired by the
structure and function of the brain called artificial neural networks (ANN [5].This particular kind
of machine learning that achieves great power and flexibility by learning to represent the world
as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and
more abstract representations computed in terms of less abstract ones.

Deep learning in the field of natural language processing has shown better performance by
designing models [6].It requires more data but less linguistic expertise to train and operate
automatically. It has the capability of dealing with complex problems and tasks. Therefore, deep
neural network classifier for news classification is an approach that is justifiable. So, in the
proposed study, we will apply deep learning approach for news text classification for Amharic
language to enhance classification accuracy.

This proposed study, can be applying deep-learning approach by using recurrent neural network
techniques for news text classification for Amharic language to enhance classification accuracy
[7, 8]. Classification is the process of finding a set of models that distinguish data classes for the
purpose of being able to use the model to predict the class of objects whose class label is
unknown. The derived model is based on the analysis of a set of training data, objects whose
class label is known [9].

Text Classification is the process of finding a set of models ( functions) that identify data classes
and it uses the model to predict the class of objects whose class label is unknown[10] .Text is the
main form of communicating knowledge. Technically speaking, text is any string of language,
usually one that is more than one sentence long [ 1 1 ] . Text is composed of symbols from a
finite alphabet. Text has been created everywhere, in many forms (paper and electronic) and
languages. We use the term document to denote a single unit of information, typically text
in a digital form, but it can also include other media. Documents can a complete logical unit,
like a research article, a book or a manual.The derived model is based on pattern relationship
from training data having known classes.Manual classification is based on human judgments.
Organizations and users search for information and save it in categories meaningful to
themselves. News classification has become one of the key methods for organizing and learning
text classifiers from examples [12].

News is the reporting of current information on television, radio, on internet, and in newspapers
and magazines to mass audience [13] .Amharic news text classification is conduct on a large
collection of electronic news documents that it still, management of Amharic news classification
is processed based manually as file based systems [13].Manual classification is difficult in a
situation where time and speed becomes constraint in Amharic news institutions. Therefore,
news classification of the documents becomes important in order to save time and money.
Amharic is served as the working language of the Federal Government of Ethiopia and it is one
of Semitic languages, which is spoken in many parts of Ethiopia as well as in the Globe
[14].Since Amharic has many speakers and huge collection of its documents, it needs many NLP
systems, particularly information retrieval, text classification systems. Supervised algorithms
assume that the category structure or hierarchy of a text database is already known. They
require a training set of labeled documents and return a function that maps documents to the
pre-defined class labels. Knowing the category structure in advance and generation of correctly
labeled training set are very challenging or even impossible in large and dynamic text databases

This study can be design Supervised deep-learning text classification mechanism for news
classification task. In simple terms; it is the development of algorithms which enable the
machine to learn and perform tasks and activities. Over the period of time many techniques and
methodologies were developed for machine learning recurrent neural network tasks.

1.2 Statement of the problem


In news institutions, large amount of news from different available sources such as walta, ENA,
and Fana are processes every day. Governments, Reporters editors and other stakeholders need
an automatic news classification system that is supported by artificial intelligence capability. The
news staffs such as editors and reporters write the content of each news and categorize it by its
content subjectively. Now it has been seen that news are categorized by editors subjectively. It
results searching specific news difficult to find news from its class. Manual classification
becomes very difficult due to its human expertise requirement to categorized documents based
on their relationships patterns [15]. This manual classification is time consuming which is in a
contrary to the fact that speed is a major factor in news institutions.

The electronic Amharic news is stored in unorganized way that it is difficult to retrieve news
from the news corpus. News Searching becomes a challenging task when news is needed by
stakeholders like government, editors and journalist staffs. News searching and new category
allocation is not fast and time consuming. The Problems with the news categories is lack of clear
distinction between some news categories. The classification error is no strict process to validate
the categories given by reporters to the news items they are entering into the system. This
classification error was found in news items that are not difficult to classify. Automatic text
classification provides an important role in a wide variety of more flexible, dynamic information
management tasks as well. For example real-time sorting of files into folder hierarchies, topic
identifications to support topic-specific processing operations; structured search, or finding
documents that match long-term standing interests or more dynamic task-based interests.

The increasing availability of text documents in electronic form increases the importance of
using automatic methods to analyze the content of text documents, because the method using
domain experts to identify new text documents and allocate them to well-defined categories is
time-consuming and expensive, has limits, and does not provide continuous measure of the
degree of confidence with which the allocation was made [16]. As a result, the identification and
classification of text documents based on their contents are becoming imperative [17].

Text classification is the task of deciding to what category a document belongs among a set of
pre-specified classes of documents. Automatic classification schemes can greatly facilitate the
process of categorization. Automatic classification requires preparation of the source data before
the classifiers can use it. Text categorization starts by transforming the documents, which
typically are strings of characters, into a representation suitable for the learning algorithm and
the classification task. The proposed study on Amharic language has a great advantage for a
significant number of the language speakers. If Language applications such as machine
translation, information extraction, question answering, information retrieval, text classification,
and text summarization systems are developed. Hence, document categorization is required for
managing and filtering important information.

This study attempted to design an automatic Amharic news classification using supervised deep-
learning approach. Therefore, the concern of this study was to build a deep learning prototype
that shows how news text classification becomes modern for good organization and fast retrieval
purposes using recurrent neural network (Rnn) technique. In order to solve the above problems
and to reduce the burden of human experts, more efforts have been undertaken by researchers
and developers in the area of automatic Amharic news text classification. It is through research
that one has to determine which techniques and tools are best for automatic Amharic news text
classification. This view was pointed out by [18]-[19] using different machine learning
approaches.

Even though those scholars tried to develop text classification models using different algorithms,
the scored text classification performances, still more improvement is needed to build realistic
NLP system that Amharic Language needs seriously. Previous works on Amharic news
classification was performed using both knowledge-based approach and corpus-based machine
learning mechanisms. They built promising prototypes that can lead to the fabrication of news
classification application.It needs to develop and improve using recurrent neural network (Rnn)
technique to increase classification accuracy and similarity performance by investigating
application of deep learning supervised approach for Amharic news text.

After comparing different methods, long short-term memory (LSTM), feed forward neural
network and simple-RNN methods were selected as deep learning methods to classify Amharic
news text were selected to produce good news classification model. Therefore, the main
objective of this study was to design and develop a deep learning prototype to shows how news
classification becomes modern for good organization and fast retrieval purposes. To this end,
this study attempts to find solutions to the following research questions.

 What is the effect of the number of classifies and the size of documents used on the
performance and efficiency of classifying algorithms?
 By how much accuracy of the deep learning methods classify Amharic news text
classification?
 Which classification algorithm is more suitable for creating classification model that
helps in Amharic news text document?

1.3 Objective of the study


1.3.1 General Objective

The general objective of this research study was to design and develop an Amharic news text
classification prototype model using deep learning approach.

1.3.2 Specific Objective

The specific objectives of this research are:

 To review related literature on the concepts, techniques and tools of text classification
particularly in the area of machine learning.
 To preprocess the structured/labeled news items or documents to make them ready for the
classifying process.
 To build and train models using feed forward neural network, simple-RNN and LSTM
classifiers
 Testing and evaluating the performances of classifying algorithms for Amharic text news
classifying.
 To recommend further research for the upcoming research in the area of automatic
Amharic news text classification.

1.4. Significance of the study


In addition to being an academic exercise to fulfill the requirement for the program, this
research is believed to produce results that can indicate the application of a general Amharic
automatic news items classifier based on supervised machine -learning techniques. The output
of this study can be used as an input to the development of full-fledged automatic news
items classification of Amharic language for ENA. This study contributes and provides for
future research area in Amharic Natural language processing to indicate the application of a
general Amharic news text classifier with different approaches. In addition to this, it can be used
as initiative for further study in the area of supervised machine-learning approach bases Amharic
text classification with a different algorithm including feed forward, simple-RNN and LSTM.
The result of this study can also be used as a guideline for further investigations in the
possibilities of automatic news items classification system development for the Amharic
language using deep learning approach. In addition to this, it c a n b e used as an input to any
automatic Amharic text classification by different Medias for their news management.

1.5 Scope and Limitation of the Study


The scope of this study is limited to investigate the possibility of designing Amharic news text
classification system using deep learning approach. In this study, only the three classifying back
propagation algorithms namely feed forward neural network, simpleRNN and LSTM were used.
An evaluation was also made to compare their performance in the classification of Amharic news
text. For this reason only the nine categories were considered to conduct the experiments.
Regarding the news classifications, Nine Amharic news items categories were considered.

These are Adega(Accidents), Economy, Temhert(education), Politica, maheberawi guday(social


Issues), Sport, Hege(Law and Justice), technology(Science and Technology) and Tena(health).
All of these news categories have been used for the experiment in three groups. The first group
contains three news categories, the second group contains six news categories, and the third
group contains all (nine) news categories. All of these news categories are available in Ethiopia
(ENA, FANA and Walta Information Center) which were considered for this study.Many text
documents such as phrase based, sentence based ,knowledge based and corpus based text
classification are available, we were limited only the corpus-based text classification by
preparing corpus from 4,432 news documents. The study was also limited to classification of
news items from ENA news source.

1.6 Methodology of the Study


The following method was employed to achieve the above stated objectives.
1.6.1 Literature Review

The extensive study on available literature was carried out on classifier algorithm to develop
tools and techniques for Amharic news items. In order to get a good understanding of text
classification, the Amharic language relevant published documents were reviewed: books,
journal articles, previous related research works and electronic publication on the Internet were
reviewed.The review is to understand the concept of text classification, approaches of text
classification, the methods used for preprocessing and classifier construction, concept of
supervised machine-learning approach can be conduct on news text classification in order to
obtain an in depth understanding of the area and to find useful approaches for the Amharic news
articles. Amharic writing system can be also reviewed to understand Amharic Language
characteristics which are helpful for preprocessing Amharic news items.

1.6.2 Data source

To have clear understanding on the current manual classification scheme and the problem area,
we observe and discuss the working of editors and ICT coordinator of ENA and other ENA
employees how they conducted the news classification operation, how news are created and
distributed, and any software that they are using for news management.
We collected the news data from Ethiopian News Agency (ENA) so as to automate the works or
manual classifications, which is tedious. The researcher used the data collected from ENA for
building the news classification model. The format of data sets converted in to text for
preprocessing. The preprocessing had two steps. The first step was removing the irrelevant
characters from the corpus. The second step was performing the preprocessing. Both tasks were
done using python. The nine categories contained 4,432 Amharic news items which were
prepared for further preprocessing. After preprocesses were carried out all the remaining data
have been used for experimentation.

1.6.3 Tools and Experimentation Procedure

For developing Amharic news text classification, a number of tools are used, Amharic news text
preprocessing tasks, such as normalization, tokenization, stemming, stop word and number
removal and weighting words, Python 3.8.4, integrated with NLTK is employed. The reason to
use Python is its convenient nature and powerfulness to work on text processing. Python is used
to process Amharic texts to prepare representative news items with keywords. This tool is
selected because it is robust to handle and process Amharic characters (Fidels) and its familiarity
to the researcher. Therefore, all those text format of Amharic news documents which were
converted from the “html” to “text” file format have been processed using this software.

Many text datasets can have millions of news items representing the document collection in the
dataset but not all words of news documents in the dataset are useful for automatic classification.
Moreover, some attributes may be irrelevant for a given classification task. For example, if the
task is to classify news items to the major news categories, attributes such as the news creation
date are likely to be irrelevant. After preprocessing, the experimental dataset is rearranged to the
format suitable for keras library on top on anaconda navigator platform.

 The preprocessed dataset is rearranged in an attribute weight matrix by:


 ‘News Category’ in the dataset as fields (column) of the relation
 taking each document instances as a separate record or instance (row) of the relation
 using the weight of attributes in a document as the value of the fields for the instance
the document represents.
 considering class labels as nominal attributes in the dataset
 News are stored in their directory. The folder names of documents used as a category
name.

1.6.4 Evaluation Techniques

The developed model can be evaluating using system performance testing by preparing test
cases. It needs evaluation methods to compare various text classifiers. Evaluation of a classifier
can be conducted by measuring its efficiency and its effectiveness. Efficiency is typically
measured by using the elapsed processor time and it refers to the ability of a classifier to run fast.
Efficiency of a classifier can usually be measured on two dimensions: learning efficiency (i.e.,
the time a deep learning algorithm takes to generate a classifier from a set of training examples)
and classification efficiency (i.e., the time the classifier takes to assign appropriate categories to
a new document).Common evaluation method for text classification systems is effectiveness.

This refers to the ability to take the right decisions on the classification of new incoming
documents. There are several commonly used performance measures of effectiveness. However,
there is no agreement on one single measure for use in all applications. Indeed, the type of
measure that is preferable depends on the characteristics of the test data set and on the user‟s
interests. The absence of one optimal measure of effectiveness makes it very difficult to compare
the relative effectiveness of classifiers.

1.7 Contribution
As we wrote a study on deep learning which is currently applied in text processing area
including, text extraction and generation, information retrieval and text classification, our work
can be used as a reference material for further study. It shows how we apply the deep learning
technology not only for Amharic texts, but also it gives insight to other Ethiopian local
languages. The news documents were prepared as corpus which is stored and presented as
archive from us in addition to ENA news archive, therefore any researchers who seek our corpus,
they can access from us without going here and there.

1.8 Thesis Organization

This study was organized into six chapters:-chapter one incorporates introduction, statement
of the problem, objectives of the study, methodology, scope and significance of the study.
Chapter two discuses different text classification approaches, document preprocessing and
representation, overview of the different classification and evaluation techniques. Chapter
three gives highlight about Amharic writing system. Chapter four discusses details of
methodology adopted and chapter five presents the experimental results and findings of the
study. In chapter six summarizes findings of this study conclusion and recommendations.
Chapter two

Literature Review

Text Classification

Introduction
As the expansion of information and technologies continues to increase, such as the Internet
and the web accelerated the growth of document collections from time to time, we need
compatible mechanisms to cope up this radical and huge data traffic. Where there is an
exponential growth in the volume of documents and information has a great value in our day to
day life. However, the emergence of the current digital age has brought both new opportunities
and unprecedented challenges to organizations. One of the major challenges is managing the
volume of information, as a result of the continuous expansion of the Internet, inventions and
advances in information technology .As the collection of documents more increases, the task of
automatic classification of documents became the key technology for organizing large collection
of documents to provide the user with more relevant information. It is believed that grouping
similar documents together into classify will help the users find relevant information quicker,
and will allow them to focus their search in the appropriate direction [20].

2.1 Definition of Automatic Text Classification


The term automatic text classification is sometimes used in the literature to mean different
things. The first concept refers to “the automatic identification of a set of categories and
the grouping of documents under them, a task usually called text classifying [21]”. This
technique simply classifies documents without having a pre-defined categories where the text
to be classified. The classification process is then expected to create the classes or categories
based on the similarity that exist among the documents. Unsupervised learning is learning
where there is no hint at all about the correct outputs. The second view for the definition of
automatic text categorization refers to “the automatic assignment of documents to a
predefined set of categories [21]“. This concept is also referred to as text categorization.
Text categorization is a supervised learning problem [23]. [22] State that supervised learning
is any situation in which both the inputs and outputs of a component can be perceived.
The machine learning approach gives high-accuracy classifiers, and is significantly less
expensive than manual construction because the algorithm automatically constructs the
decision rule itself [24].The Classifiers built by means of supervised- machine learning
techniques nowadays achieve impressive levels of effectiveness, making automatic classification
a qualitatively and not only economically viable alternative to manual classification[21].
The terms Text classification known as text topic spotting, it is the activity of Labeling natural
language texts with thematic categories from a predefined Set [21].Text classification is the task
of assigning predefined categories to free-text documents [25].

2.2. Approaches to Text Classification


2.2.1. Manual Classification

Manual classification requires individuals to assign each document to one or more classes. These
individuals are usually domain experts who are thoroughly versed in the category structure or
taxonomy being used. It is often used in library and technical collections as well as in call centers
and form-processing environments. Manual classification can achieve a high degree of accuracy-
although even domain experts will occasionally disagree on how to categorize document.
However, manual classification is more labor-intensive and therefore most costly than automated
techniques.

2.2.2 Automatic classification


To address the problems of manual classification, automatic text classification is explored as an
alternative approach using different techniques. In contrast to manual classification, automatic
classification offers the advantages of automation, efficiency, and consistency. Automatic
classification employed rule-based or knowledge engineering and machine learning automatic
classification techniques.
2.3 Basic concepts of text classification
The massive growth of the number of electronic documents (e.g. e-books or Web pages) for
general reference or specific search purposes is quite difficult, if not totally impossible, for
library and information professionals to manually categorize and index documents. This is the
problem of information overload [26] .In fact; it is very time-consuming to classify a rich
mixture of electronic documents solely based on the manual methods. To alleviate this problem
some initial ideas for applying automatic document classification methods to categorization of
electronic documents in an experimental setting have been explored [27]-[28].

According to [21], automatic text classification is the task of building soft-ware tools to
automatically assign some labels data to a document based on some selected features of that
document. Until the late 1980s, the most popular automatic text classification method was based
on the knowledge engineering approach, where a set of manually defined rules is applied to
classify documents. However, the main problem of such an approach is the knowledge
acquisition bottle-neck; domain experts must be available and heavily consulted in designing the
classification rules.

In fact, it is very time-consuming to elicit documenti classification knowledge even if domain experts are
abundantly available, which is unlikely in the real world. The machine learning approach to text
categorization, a general inductive process automatically builds a classifier for a category ci by

observing the characteristics of a set of documents di manually classified under ci by a domain

expert; from these characteristics, the inductive process learns the characteristics that a new unseen
document should have in order to be classified under ci. the classification problem is an activity of

supervised learning, since the learning process is “supervised” by the knowledge of the
categories and of the training instances that belong to them.

2.1.4. Challenges for Machine Learning


The unstructured format of natural language text and the diversity of target concepts
associated with the categories, present interesting challenges to the content based
application of machine learning algorithms. The large number of input features, that seem
necessary for the construction of classifiers, overwhelms most text categorization
systems.
For most machine learning algorithms, increasing the number of features means
that they have to use more training examples to obtain the same level of text
categorization performance. This large number of training examples and features may
be computationally intractable for most machine learning algorithms, by requiring
unacceptably large processing time and memory [29]. There are usually many features
that appear in most documents classification task. These words can be considered
irrelevant, in the sense that such features are evenly distributed throughout documents
and, as a result, have no discriminating power. It is important for the efficiency and
effectiveness of the system to select an efficient subset of features, by removing these
irrelevant ones. However, it is a difficult task since a reasonable feature subset size
might be different across the categories and some informative features for a given
category could be distributed across several categories. For example, depending on the
level of concept complexity, some categories require a large number of features to
describe their concepts while others need a relatively small number of features. Also,
informative features in the overlapping categories might be evenly distributed across
such overlapping categories and could be considered as irrelevant ones [30].

2.4. Machine learning text classification


Machine learning can be defined as a way of how machines can generalize from experiences
made by presenting training data or interaction with their surroundings. Manually classifying
documents rules and statistical text categorization uses machine learning methods to learn
automatic classification rules based on human-labeled training documents. Modern classification
approaches now employ Machine Learning techniques. Deep learning is part of Machine
learning which grows and solves many problems which is concerned with the creation of
computer programs that learn from experience (deep learning).
In the case of news text classification, the task to be learnt is an objective function from a set of
functions called the hypothesis space, which maps candidate news documents onto one or more
classes. A set of pre-classified documents provides the experience necessary for a classifier to
learn the objective function. This set is typically marked up by a human, and is the only human
input required to operate a supervised classifier: the learning and subsequent classification can be
done automatically. It is possible to learn from a document set that has not already been
classified (this is unsupervised learning in contrast to supervised learning), but the vast majority
of approaches do not consider this possibility. Classification performance is typically
significantly higher with supervised learning [31].

2.5. Types of Machine learning


Modern classification approaches now employ Machine Learning techniques. Machine learning
is a is concerned with the creation of computer programs that learn from experience .In the case
of text classification, the task to be learnt is an objective function from a set of functions called
the hypothesis space, which maps candidate documents onto one or more categories. A set of
pre-classified documents provides the experience necessary for a classifier to learn the objective
function. This set is typically marked up by a human, and is the only human input required to
operate a classifier: the learning and subsequent classification can be done automatically. It is
possible to learn from a document set that has not already been classified (this is unsupervised
learning in contrast to supervised learning), but the vast majority of approaches do not consider
this possibility. Classification performance is typically significantly higher with supervised
learning [31].

2.4.1 Supervised Learning

Most approaches to automated text classification require a human subject expert to initiate the
learning process by manually classifying or assigning a number of training documents to each
category. This classification system, first analyses the statistical occurrences of each word in the
news documents and then constructs a model or classifier for each category that is used to
classify documents automatically. The system refines its model, in a sense learning the
categories as new documents are processed. In supervised learning, the classification is seen as
supervised learning from training examples. The supervision took place when the data are
labeled with predefined classes by human experts. It is like a “teacher” gives the classes such
that the test data are classified into these classes too. Supervised learning is commonly used in
real-world applications, such as face and speech recognition, products or movie
recommendations, network anomalous detection, and sales forecasting. Supervised learning can
be further classified into two types: Regression and Classification learning [72].

Regression trains on and predicts a continuous-valued response whereas Classification attempts


to find the appropriate class label after analyzing positive/negative opinions. In supervised
learning, learning data comes with description, labels, targets or desired outputs and the objective
is to find a general rule that maps inputs to outputs. This kind of learning data is called labeled
data. The learned rule (model) is then used to label new data with unknown outputs.

2.4.2. Unsupervised Learning

These systems identify a group, or clusters of related documents as well as the relationship
between these documents. Commonly referred to as clustering, this approach eliminates the need
for training sets because it does not require a pre-existing category structure. However, clustering
algorithms are not always good at selecting categories that are intuitive to human users. For this
reason, clustering generally works hand-in-hand with the supervised learning techniques. When
learning data contains only some indications without any description or labels, it is up to the
coder or the algorithm to find the structure of the underlying data, to discover hidden patterns, or
to determine how to describe the data. This kind of learning data is called unlabeled data [66].
Suppose that there are many data points, and want to classify the min to several groups and may
not exactly know what the criteria of classification would be. Therefore, the unsupervised
learning algorithm tries to classify the given dataset into a certain number of groups in an
optimum way. Unsupervised learning algorithms are extremely powerful tools for analyzing data
and identifying patterns and trends.
2.4.3. Semi-supervised Learning

In the previous two types, either there are no labels for all the observation in the dataset or labels
are present for all the observations. Semi-supervised learning falls in between these two. In many
practical situations, the cost to label is quite high, since it requires skilled human experts to do
that. So, in the absence of labels in the majority of the observations but present in few, semi-
supervised algorithms are the best candidates for the model building. These methods exploit the
idea that even though the group memberships of the unlabeled data are unknown, this data carries
important information about the group parameters based on some seed instances labeling by
experts with low human effort[45].

2.4.4. Reinforcement Learning

Method aims at using observations gathered from the interaction with the environment to take
actions that would maximize the reward or minimize the risk. Reinforcement learning algorithm
(called the agent) continuously learns from the environment in an iterative fashion. In the process,
the agent learns from its experiences of the environment until it explores the full range of possible
states. Reinforcement Learning is a type of Machine Learning, and thereby also a branch
of Artificial Intelligence. It allows machines and software agents to automatically determine the
ideal behavior within a specific context, in order to maximize its performance. Simple reward
feedback is required for the agent to learn its behavior; this is known as the reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed decide the best action to
select based on his current state. When this step is repeated, the problem is known as a Markov
Decision Process[66].

Supervised Unsupervised
U
learning learning
Deep
Semi-
learning
Reinforcement
supervised
Learning
Learning

Figure 2.1[the relation between deep learning and machine learning]

2.5 Use of News Automatic Text Classification


Document Distribution: Text classification permits the efficient automatic distribution of
documents via email or fax by eliminating the time consuming, manual process of faxing or
mailing. And this can be achieved by first classifying the documents according to sender and
message type

Text filtering: Text filtering is for example the activity in ordered to classify text document
containing specific keywords or different keywords. Typical cases of filtering procedures are e-
mail filters, newsfeed filters.

Mail routing; Large enterprises are currently automating their document processing by means of
workflow management system, allowing an image of the document to circulate through the
company rather than the original. In particular, they aim for a uniform treatment of incoming
mail, whether it is electronic or in paper form. A bottleneck in this approach is the entering of
documents into the right work flow. This process involves a superficial interpretation of the
contents of the document, which is time consuming and error prone.
News monitoring:
In knowledge-based companies like the stock exchanges, numbers of people are concerned with
the scanning of newspapers and other information sources for items which are concerned with
the national or international economy, or with individual companies on the stock market. The
results are sent to the person who should be informed.

2.6. Text Classification Method

The classification task, learning and classification can be divided into the following two steps:

Preprocessing/Indexing- is the mapping of document contents into a logical view (e.g. a vector
representation of documents) which can be used by a classification algorithm. Text operations
and statistical operations are used to extract important content from a document.

Learning/Classification- based on the logical view of the document learning or classification


takes place. It is important that for classification and learning the same preprocessing/indexing
methods are used.

2.6.1. Classification Algorithms

Supervised algorithms assume that the category structure or hierarchy of a text


database is already known. They require a training set of labeled documents and return a
function that maps documents to the pre-defined class labels [45]. As discussed
previously, knowing the category structure in advance and generation of correctly
labeled training set are very challenging or even impossible in large and dynamic text
databases. A wide range of classification algorithms have been developed through
time with different underlying models and different theories of how a classifier should
be built. These algorithms have different inductive biases that affect their performance on
a data set, and consequently, it is important to find the inductive bias that best fits the
data set. This can be done empirically by applying a set of different machine learning
algorithms and selecting the algorithm that performs the best [32].In this section this study
discuss the most popular supervised algorithms :-
Linear Regression
Linear regression is a standard linear regression algorithm that expresses the numerical
class as a linear combination of the attributes. The coefficients of these attributes are
calculated using the least-square method [32].
RBF Network
RBF Network trains a radial basis function network, which is a type neural network.
The network has three layers: an input layer with a node for each attribute; a hidden layer
where each node has a Gaussian radial basis function as activation function, created
using a clustering method called K Means [36], an output layer containing a node for each
class with sigmoid as activation function.
Support Vector Machines
Support Vector Machines (SVM) is a technique introduced by Vapnik in 1995, which is
based on the Structural Risk Minimization principle. It is designed for solving
two-class pattern recognition problems. The problem is to find the decision surface that
separates the positive and negative training examples of a category with maximum margin.
K-Nearest Neighbors Classification

K-NN (k-nearest neighbor) classification is a popular instance-based learning method[33] (that


has been shown to be a strong performer in the task of text categorization [47]. The kNN
algorithm is one of the great machine learning algorithms for beginners. They make predictions
based on old available data, in order to classify data into categories based on deferent
characteristics. The kNN algorithm is on the supervised machine learning algorithm list, which is
mostly used for classification. It stores available data and uses it to measure similarities in new
cases. The K in kNN is a parameter that denotes the number of nearest neighbors that will be
included in the “majority voting process”. This way, each element’s neighbors “vote” to
determine his class.   The best way to use the kNN algorithm is when you have a small, noise-
free dataset and all data in labeled. The algorithm is not a quick one and doesn’t teach itself to
recognize unclean data and when the dataset is larger, it is not a good idea to use kNN.

Linear Regression
Linear regression is among the most popular machine learning algorithms. It works to establish a
relation between two variables by fitting a linear equation through the observed data. In other
words, this type of algorithms observes various features in order to come to a conclusion. If the
number of variables is bigger than two - the algorithm will be called multiple linear regressions.
Linear regression is also one of the supervised machine learning algorithms that work well in
Python. It is a powerful statistical tool and can be applied for predicting consumer behavior,
estimating forecasts, and evaluating trends.

A company can benefit from conducting linear analysis and forecast the sales for a future period
of time. So, if we have two variables, one of them is explanatory, and the other is the dependent.
The dependent variable represents the value you want to research or make a prediction about.
The explanatory variable is independent. The dependent variable always counts on the
explanatory. The point of the linear machine learning is to see whether there is a significant
relationship between the two variables and if there is, to see exactly what it represents. Linear
regression is considered a simple machine learning algorithm and is therefore popular among
scientists.   Now, there is linear regression, and there is logistic regression. Let’s have a look at
the divergence

Logistic Regression

Logistic regression is one of the basic machine learning algorithms. It is a binomial classifier that
has only 2 states, or 2 values – to which you can assign the meanings of yes and no, true and
false, on and o, or 1 and 0. This kind of algorithm classifies the input data as category or non-
category. The input data is compressed and then analyzed. Unlike linear regression, the logistic
algorithms make predictions by using a nonlinear function. Logistic regression algorithms are
used for classification and not for regression tasks. The “regression” in the name suggests that
the algorithms use a linear model and incorporate it into the future space. Logistic regression is a
supervised machine learning algorithm, which, like the linear regression, works well in Python.
From a mathematical point of view, if the output data of a research is expected to be in terms of
sick/healthy or cancer/no cancer, then a logistic regression is the perfect algorithm to use.
Learning Vector Quantization

The Learning Vector Quantization algorithm, or LVQ, is one of the more advanced machine
learning algorithms. Unlike the kNN, the LVQ algorithm represents an artificial neural network
algorithm. In other words, it aims to recreate the neurology of the human brain. The LVQ
algorithm uses a collection of codebook vectors as a representation. Those are basically lists of
numbers, which have the same input and output qualities as your training data.

Support Vector Machines

The SVMs are one of the most popular machine learning algorithms. The Support Vector
Machines algorithm is suitable for extreme cases of classifications. Meaning – when the decision
boundary of the input data is unclear. The SVM serves as a frontier which best segregates the
input classes. SVMs can be used in multidimensional datasets. The algorithm transforms the
non-linear space into a linear space. In 2 dimensions you can visualize the variables as a line and
thus have an easier time identifying the correlations. In real life, SVMs have already been used in
a variety of fields:

Neural Network
Neural networks analysis is basically a prediction tool modeled on how human brain works. NNs
are trained to recognize certain patterns or behavior when fed with a large data set and then they
can determine predictors of a dependent variable. Thus, neural networks can be defined as a
distributed processor that can create knowledge based on experience and make that knowledge
available for future use. Neural networks are a set of algorithms, modeled loosely after the
human brain, that are designed to recognize patterns. They interpret sensory data through a kind
of machine perception, labeling or clustering raw input. The patterns they recognize are
numerical, contained in vectors, into which all real-world data, are it images, sound, text or time
series, and must be translated.

Neural Networks form the backbone of deep learning.The goal of a neural network is to find an
approximation of an unknown function. Neural networks are made up of numerous
interconnected conceptualized artificial neurons, which pass data between themselves, and which
have associated weights which are tuned based upon the network’s experience. Neurons have
activation thresholds which, if met by a combination of their associated weights and data passed
to them, are fired; combinations of fired neurons result in learning[48]-[49].

2.6.2 Types of neural network


The rules of different neural networks are decided by different concepts and each has its unique
strengths and weaknesses. In the below paragraphs the applications of different types of neural
networks are briefly discussed [37].

Feed-forward Neural Network – Artificial Neuron

Feed forward neural networks consists of neurons that are ordered into layers. The first layer is
called the input layer, the last layer is called the out- put layer, and the layers between are hidden
layers.

Radial Basis Function Neural Network

Radial basis function neural networks are function approximation models that can be trained by
examples to implement a desired input–output mapping [38].

Multilayer Perceptron
Multilayer perceptron neural networks are the most commonly used feed-forward neural
networks due to their fast operation, ease of implementation, and smaller training set. The hidden
layer processes and transmits the input information to the output layer. A MLPNN model with
insufficient or excessive number of neurons in the hidden layer most likely causes the problems
of bad generalization and over-fitting. There is no analytical method for determining the number
of neurons in the hidden layer. Therefore, it is only found by trial and error [40].

Recurrent Neural Network (RNN)


A Recurrent Neural Network is a type of artificial neural network in which the output of a
particular layer is saved and fed back to the input. This helps predict the outcome of the layer.

The first layer is formed in the same way as it is in the feed forward network. That is, with the
product of the sum of the weights and features. However, in subsequent layers, the recurrent
neural network process begins.RNN use cases tend to be connected to language models in which
knowing the next letter in a word or the next word in a sentence is predicated on the data that
comes before it.This Recurrent neural network the input layer will be fed back with the saved
output of the layer next to it and it helps for the prediction of outcome of the layer [37].

The system self-learns itself and makes the correct prediction using the back-propagation if it’s
wrong in prediction. This type of neural networks are mainly used in text to speech conversion.

Figure 2.2 shows the recurrent neural network

Modular Neural Network


In this type of neural network, sub-tasks are performed by the different independently functioned
network [37] .These network computation processes do not interact with each other and achieve
the output by working independently. The neural network will separate large processes into
independent components these make faster process computation. This speed is achieved because
the networks are not connected to one another. Figure 7 shows the modular neural network with
a visual representation.

Convolutional Neural Network

Convolutional neural networks (CNNs), a variant of deep learning, were motivated by


neurobiological research on locally sensitive and orientation-selective nerve cells in the visual
cortex. Convolutional Neural Networks are a special kind of multi-layer neural networks, with
the following characteristic: a CNN is a feed forward network that can extract topological
properties from an image. Like almost every other neural network, it is trained with a version of
the back-propagation algorithm.

CNNs are designed to recognize visual patterns directly from pixel images with little-to-none
preprocessing. They can recognize patterns with extreme variability, such as handwritten text
and natural images. A CNN typically consists of a convolution layers, subsampling layers, and a
fully connected layer. These layers can either be completely interconnected or pooled.Before
passing the result to the next layer, the convolutional layer uses a convolutional operation on the
input. Due to this convolutional operation, the network can be much deeper but with much fewer
parameters.

A CNN typically consists of a convolution layers, subsampling layers, and a fully connected
layer. These layers can either be completely interconnected or pooled.Before passing the result to
the next layer, the convolutional layer uses a convolutional operation on the input. Due to this
convolutional operation, the network can be much deeper but with much fewer parameters.
Recurrent Neural Network: Recurrent Neural Network is a type of artificial neural network in
which the output of a particular layer is saved and fed back to the input.

This helps predict the outcome of the layer. The first layer is formed in the same way as it is in
the feed forward network. That is, with the product of the sum of the weights and features.
However, in subsequent layers, the recurrent neural network process begins RNN use cases tend
to be connected to language models in which knowing the next letter in a word or the next word
in a sentence is predicated on the data that comes before it

2.6.3 Deep Learning


Artificial Intelligence includes the simulation process of human intelligence by machines and
special computer systems. The examples of artificial intelligence include learning, reasoning and
self-correction. Applications of AI include speech recognition, expert systems, and image
recognition and machine vision. Machine learning is the branch of artificial intelligence, which
deals with systems and algorithms that can learn any new data and data patterns.
Fig 2. 3 Deep learning in relative to ML and AI
Deep Learning is an advanced of Machine Learning that uses the concepts of Neural Networks to
solve highly-computational use cases that involve the analysis of multi-dimensional data. It
automates the process of feature extraction, making sure that very minimal human intervention is
needed. Deep learning is a set of techniques inspired by the mechanism of the human brain. The
two primary deep learning, i.e., Convolution Neural Networks (CNN) and Recurrent Neural
Networks (RNN) are used in text classification [50]

This machine learning method needs a lot of training sample instead of traditional machine
learning algorithms, i.e., a minimum of millions of labeled examples. On the opposite hand,
traditional machine learning techniques reach a precise threshold wherever adding more training
sample does not improve their accuracy overall. Deep learning classifiers outperform better
result with more data.

Deep Learning Tools

TENSORFLOW-one of the best frameworks, tensor Flow is used for natural language
processing, text classification and summarization, speech recognition and translation and more. It
is flexible and has a comprehensive list of libraries and tools which lets you build and deploy
ML applications. Tensor Flow finds most of its application in developing solutions using deep
learning with python as there are several hidden layers (depth) in deep learning in comparison to
traditional machine learning networks. Most of the data in the world is unstructured and
unlabeled that makes Deep Learning tensor Flow one of the best libraries to use. A neural
network nodes represent operations while edges stand for multidimensional data arrays (tensors)
flowing between them.

Torch – “Torch is a scientific computing framework with wide support for machine learning
algorithms that puts GPUs first. It is easy to use and efficient, thanks to an easy and fast scripting
language, LuaJIT, and an underlying C/CUDA implementation.”

Theano  “Theano is a Python library that lets you to define, optimize, and evaluate mathematical
expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is
possible to attain speeds rivaling hand-crafted C implementations for problems involving large
amounts of data. It can also surpass C on a CPU by (Central Processing Unit) and GPU
(Graphical Processing Unit). “many orders of magnitude by taking advantage of recent GPUs.

Microsoft Cognitive Toolkit: Most effective for image, speech and text-based data, MCTK
supports both CNN and RNN. For complex layer-type, users can use high-level language, and
the fine granularity of the building blocks ensures smooth functioning.

CAFFE: One of the deep learning tools built for scale, Caffe helps machines to track speed,
modularity and expression. It uses interfaces with C, C++, Python, MATLAB and is especially
relevant for convolution neural networks.

Chainer: A Python-based deep learning framework, Chainer provides automatic differentiation


APIs based on the define-by-run approach (a.k.a. dynamic computational graphs). It can also
build and train neural networks through high-level object-oriented APIs.

Deeplearning4j: Deeplearning4j is a JVM-based, industry-focused, commercially supported,


distributed deep-learning framework. The most significant advantage of using Deeplearning4j is
speed. It can skim through massive volumes of data in very little time.

2.6.3. Benefits of using deep learning


Maximum utilization of unstructured data:-Research from Gartner revealed that a huge
percentage of an organization’s data is unstructured because the majority of it exists in different
types of formats like pictures, texts etc. For the majority of machine learning algorithms, it’s
difficult to analyze unstructured data, which means it’s remaining unutilized and this is exactly
where deep learning becomes useful. This can use different data formats to train deep
learning algorithms and still obtain insights which are relevant to the purpose of the training. For
instance, you can use deep learning algorithms to uncover any existing relations between
industry analysis, social media chatter, and more to predict upcoming stock prices of a given
organization.

Ability to deliver high-quality results: - Humans get hungry or tired and sometimes make
careless mistakes. When it comes to neural networks, this isn’t the case. Once trained properly,
a deep learning model becomes able to perform thousands of routine, repetitive tasks within a
relatively shorter period of time compared to what it would take for a human being. In addition,
the quality of the work never degrades, unless the training data contains raw data which doesn’t
represent the problem you’re trying to solve.

Elimination of unnecessary costs:-Recalls is highly expensive and for some industries, a recall
can cost an organization millions of dollars in direct costs. With the help of deep learning,
subjective defects which are hard to train like minor product labeling errors etc. can be
detected. Deep learning models can also identify defects which would be difficult to detect
otherwise. When consistent images become challenging because of different reasons, deep
learning can account for those variations and learn valuable features to make the inspections
robust.

Elimination of the need for data labeling-Data labeling can be an expensive and time-consuming
job. With a deep learning approach, the need for well-labeled data becomes obsolete as the
algorithms excel at learning without any guideline. Other types of machine learning approaches
aren’t nearly as successful as this type of learning.

3.6 Evaluation of Text Classifiers


As mentioned in previous chapters there are many learning algorithms useful for
text classification purposes. Such algorithms may be more appropriate for certain application
domain type or size of data than the other. In order to select or evaluate the best algorithm
the following lists are important [51]-[52]:Predictive accuracy of classifier: This refers to the
ability of the model/algorithm to correctly predict the class label of new or previously
unseen data. This is the most Common criterion.-Speed of learner and classifier, which refers
to the computation costs involved in building the classifier and classifying a new document
respectively-Robustness-the ability of the model to make correct predictions given noisy data
or data with missing values
Scalability-refers to the ability to construct the model efficiently given large amounts of
data.-Interpretability: the level of understanding and insight that is provided by the model.

3.6.2 Training versus Test sets


The machine learning approach depends on the availability of an initial corpus of
documents preclassified under categories. Therefore, prior to classifier construction the initial
corpus is split in two training set and test set, not necessarily of equal size. Most
researchers use 20%[53]), 30% [54] of data for test set and the remaining for training set
respectively. The training set is inductively built by observing the characteristics of the
documents. IN many research settings, once a classifier has been built it is desirable to evaluate
its effectiveness. A test set is used for testing the effectiveness of the classifier. Each document
from the test set is fed to the classifier, and the classifier decisions are compared with the expert
decisions. The documents in test set cannot participate in any way in the inductive construction
of the classifiers; otherwise, experimental results obtained would likely be unrealistically
good, and the evaluation is considered not scientific [21].

3.6.3 Performance Measures


A measure of classification effectiveness is based on how often the classifier decisions
values match the expert decisions. Classification effectiveness is usually measured in terms
of the classical IR notions of precision and recall, adapted to the case of TC [21] Recall (R) is
the percentage of the documents for a given category that are classified correctly. Precision (P)
is the percentage of the predicted documents for a given category that are classified correctly.
These can be formalized as
R = NCP/NC

and P = NCP/NP respectively, where NC is the number of testing documents for a given
category c; NP is the number of documents that are predicted as category c by the classifier;
and NCP is the number of documents that are classified correctly [55] Classification accuracy
is also the other method of measure of performance represented by c/n where n is the
total number of test instances and c is the number of test instances correctly classified
by the system [21] Accuracy (error rate) is the rate of correct (incorrect) predictions made
by the model over a data set. The Average results of accuracy can be represented in
confusion matrix form. A confusion matrix is a matrix showing the predicted and actual
classifications. A confusion matrix is of size LxL, where L is the number of different label
values (56]

Precision and recall are the evaluation parameters of information retrieval are used in text
classifications. Precision measures the exactness of a classifier. Precision is the ratio of the
number of reviews classified correctly to the total number of reviews in a given category. A high
precision means less false positive, while a lower precision means more false positives. This
model can explain by four evaluation metrics such as accuracy, precision, F1-measure and recall
as shown below

TP
Precision=
TP+ FP

TP
Recall=
TP+ FN

2∗Recall∗Precision
F-measure =
Recall+ Precision

TP+TN
Accuracy=
TP+ TN+ FP+ FN
3.7 Review of Related Research Works on Amharic Text Classification
Many researches have been done on automatic Amharic text classification by used statistical
method for the classification of Amharic news.
According to [57], classification error by experts is the major factor that reduces the
performance of the classifier. This study recommendations gives due attention on the
development of standard Amharic preprocessing tools like spell checker, thesaurus and
stop words so that researchers on Amharic text classification can focus on one aspect at a
time.

According to [58] the classification accuracy using Naïve Bayes and KNN decreases if the
categories in the training data contain fewer documents. This research works added that if the
categories in the training data are not evenly distributed, classification affected negatively.
This indicated that KNN is poor for large dataset and there is a difficulty in determining
the value of K. Amharic preprocessing tools development is one of the issues.
[59] Studied that decision tree and SVM classifiers showed better accuracy for categories
with large number of documents in the training set than fewer documents. This also noted
that SVM and Decision Tree classifiers are good in accuracy at the expense of performance
cost. Therefore, this research works recommends other classifiers with less processing cost and
better accuracy. Therefore, these research works r e c o m m e n d s other classifiers with less
processing cost and better accuracy and also recommended the development of standard
Amharic preprocessing tool that aid text classification task.

Name Categories Method used Accuracy


considered
Zelalem Sintayehu (2001) 3 Cosine Similarity 85.05%
Surafel Teklu (2003) 3 KNN 89.61
Naïve Bayes 95.73
Yohannes Lacework (2007) 15 LMT 79.72%
LibSVM 81.15%
Worku (2009) 9 ANN(Artificial Neural 70.8%
Network)

Zeleke A. (2010) 12 LibVSM(Phrasebased 72.01%


approach

Alemu k.(2010) 8 LibVSM(Hierarchical 80.34%


approach)

Animut B.(12) 10 Naivebays 69.7084

Table 2.1: Related Research Works experiment and result on Automatic Amharic Text
Classification

Summery

In the literature review, the paper discussed automatic text classification, machine learning
approaches including supervised learning, unsupervised learning, semi supervised
learning ,reinforcement learning deep learning and types of neural network. Since this study
employee Deep learning approach for Amharic news Text classification task, the paper review
common and well-known algorithm such as long short-term memory (LSTM), feed forward
neural network and simple-RNN methods were used to classify Amharic news text for produce
good news classification model and so on.Related works with Amharic news text classification
were discussed with their performance accuracy.
Chapter Three

Amharic Language and Its Writing System

Introduction
Document classification is one of the techniques which most of these organizations are using for
classifying the collection of information into classes based on the similarity or ‘likeliness’ of the
documents. Amharic is one of the languages through which information can be produced and
disseminated in Ethiopia. Ethiopian News Agency (ENA) is an organization that text is
classification implemented in Amharic news items. Though, it is manual, it experiences the use
of news classification system to store news into different classes such as ‘economy’, ‘politics’,
‘sport’, etc. and later retrieve it.

3.1 Amharic Language

Amharic is a Semitic language and the national language of Ethiopia (ኢትዮጵያ). The majority of
the 25 million or so speakers of Amharic can be found in Ethiopia, but there are also speakers in
a number of other countries, particularly Eritrea (ኤርትራ), Canada, the USA and Sweden.The
name Amharic (ኣማርኛ - amarəñña) comes from the district of Amhara (አማራ) in nortern Ethiopia,
which is thought to be the historic centre of the language. Amharic is the working language of
the Federal Government of Ethiopia and is spoken and written as a first or second language in
many parts of the country [59]

The Ethiopic script was first displayed on a computer around 1986. At the time the challenge in
the computer representation of the script was developing a software package that can handle
character design, keyboard layout and printer set-up.
The work by ESTC started an enthusiastic rush to develop Ethiopic software by different IT
companies and teams of individuals which led to the problem of lack of standardization. Now a
day there are more than 35 Ethiopic software products available, each with its own character set,
encoding system, typeface names and keyboard layout.Ethiopia is a multi-lingual country where
more than 80 languages are used in day-to-day Communication (60]. Although many languages
are spoken in Ethiopia, according to the Federal Democratic Republic Population census
commission, Amharic is dominant in that it is spoken in the country as a mother tongue language
in more than 21 million people (29.3% of the total population) (Census Summary of Ethiopia,

2007).Amharic is written with a version of the Ge'ez script known as ፊደል (Fidel). There is no
standard way to translate Amharic into the Latin alphabet. Amharic is named after the district of
Amhara, which is thought to be the historic Centre of the language [61].
3.2 Amharic Writing System

Amharic writing system is taken from Gee’z alphabet and it is written using the Gee’z script
horizontally from left-to-right. This writing style is one of the major differences between
Amharic and the other Semitic languages like Arabic and Hebrew[62] .Geez, which remained the
ecclesiastical and literary expression in Ethiopia until the 16th century, gradually gave way to
Amharic that was used both in spoken and writing in the royal courts. It began to be used for
literary purposes at the beginning of the 19th century as the administrative state changed its way
of communication from oral to written one[57]

3.2.1 Amharic Characters

In Amharic language, each characters have seven different forms which is called orders that reflect
the seven vowel sounds (e, u, i, a, e, i, o). The seven orders represent syllable combinations
consisting of a consonant followed by vowel. The first order is the basic form which contains 34 base
characters or basic forms; and the other six orders are non-base characters derived from the base
forms by changing the different vowels. In other words, the six non-base forms or orders show the
different forms of the basic characters (first orders). The 34 basic characters and their respective six
derived orders give a total of (34 * 7) 238 distinct characters (fields). Sample lists of orders for
Amharic.
first second Third fourth Fifth Sixth Seventh

Order Order Order Order Order Order Order

ሀ ሁ ሂ ሃ ሄ ህ ሆ

Ha Hu Hi Ha He H H Ho

ለ ሉ ሊ ላ ሌ ል ሎ

Le Lu Li La Le L L Lo

መ ሙ ሚ ማ ሜ ም ሞ

Ma Mu Mi Ma Me M MO

Table 3 .1.Amharic Characters Example:

 Characteristics of the Amharic Character


Amharic writing system is often called syllabify rather than an alphabet because the seven orders
of Amharic characters indicated above represent syllable combination consisting of consonant
and following vowel. The non-basic forms (vocalization) are derived from the basic forms
(consonants) by attaching small appendages (diacritic marks) to the right, left, top, or bottom in
more or less regular modification. Some are formed by adding strokes, others by adding loops or
other forms of differentiation to each core character. Amharic writing system is difficult and
vulnerable to various problems; it is difficult to automate information retrieval system for
Amharic language. These writing problems have a negative effect on the performance of
different machine learning approaches in text classification. Some of the problems are discussed
in the following sections.
Formation of Compound Nouns

Compound nouns are sometimes written as two separate words [63].For example, ብርድ-ልብስ
which means “blanket” may be written as ብርድ ልብስ or ብርድልብስ and; ክፍለከተማ as ክፍለ-ከተማ
which means “sub city”.This happened to be inconsistent in Amharic texts and should be
considered in automatic classification [57]
Punctuation mark

In Amharic language words are separated by two dots (: ሁለት ነጥብ), however, blank spaces are
generally used. The end of the sentence is marked by a square-formed four dots (። አራት ነጥብ),
and the symbols (፣ነጠላ ሰረዝ) and (፤ድርብ ሰረዝ) represent a comma and semicolon respectively.
Moreover, the language borrows some punctuation marks from foreign languages such as (! “,”
„, /, \, etc.). According to [64] there are about seventeen punctuation marks used in Amharic
language. However, the existing Amharic software does not make use some of them.

Numerals
According to Bender et al., Amharic number characters are derived from Greek letters, and some
were modified to look like Amharic Fidel. Each of the symbols has a horizontal stroke above and
below. Numbering starts from one and has single characters for numbers one to ten, more than
one character for multiples of ten (twenty to ninety), hundred, and thousand. There is no symbol
for zero in the Amharic script. Ethiopic numbers are used mostly in writing dates and page
numbers in text [63]

3.2. Amharic Number System


The Amharic number system consists of twenty single characters which are derived from Greek
letters, and some were modified to look like Amharic ‘Fidel’ [63]. These twenty single characters
represent numbers from one to ten, for multiples of ten, hundred and thousand .The number system
has no representation for zero (0) and it is not suitable for arithmetic computation. Hence the
Amharic number system is used to write dates specially calendar and the Hindu-Arabic numeral
system is used for arithmetic purposes [63].
3.3. Problems of Amharic Writing System
Due to the various problems of the Amharic writing system, it is difficult to automate
information retrieval system for Amharic language. These writing problems also have a negative
effect on the performance of different machine learning approaches in text classification. Some
of the problems are discussed in the following sections.

Redundancy of Some Characters

In Amharic writing system, there is unnecessary redundancy of some characters (fidels) with the
same pronunciation (sound) and meaning. Although these different characters have the same
pronunciation and meaning, they are represented with different symbols. These various forms of
a character have their own meaning in Ge’ez. However, there is no clear rule that shows its
purpose and use in Amharic writing system [63].
These characters (fidels) are ሀ, ሐ, and ኃ, ሰ and ሠ, አ and ዐ, and ጸ and ፀ. Because of these
different forms of a single character, a single word such as “religion” can be written in different
forms as “ሀይማኖት”, “ሃይማኖት”, “ሐይማኖት”, and “ኃይማኖት” although they all have the same
meaning. This result in an increase in the number of words representing a document and an
increase in vector dimension which is a challenge in Amharic document retrieval, document
classification and document clustering.

3.3.1 Inconsistency of Compound Words


In Amharic writing system, two different words can be joined by hyphen, forward slash or space
to form compound words. This shows that, there is inconsistency in writing of compound words
in Amharic writing system. This is because compound words are sometimes written as two
separate words and considered as two independent words. In addition, compound words are also
treated as a single word by fusing the two words or by inserting a hyphen between them or by
using their short form.

For instance a word can be written as “ጽህፈት ቤት”, “ጽህፈት-ቤት” or “ጽ/ቤት” all having the same
meaning.This shows that a single compound word is written in several ways which result in an
increase in the dimension of the vector space. In addition, considering compound words as two
separate words would result in loss of the original meaning of compound words. For example, if
the compound word “ጽህፈት ቤት” is treated as two separate words “ጽህፈት“and “ቤት”. The result
would be two independent words with different meaning and the original meaning of the
compound word has been lost.

3.3.3 Inconsistency of Abbreviations


In Amharic language, using forward slash (“/”) or period (“.”) is common to write some Amharic
words in abbreviation form. For example the short form of the word ዓመተ ምህረት can be written
as ዓ/ም, ዓ.ም or ዓም which result in an inconsistency of abbreviating Amharic words. These
different representations of the same word create high dimensional vector space and it has a
negative effect on the performance of learning algorithms.

3.3.4 Transliterations Problem


In Amharic language there are some words that are taken from foreign languages. In Amharic
writing system, such types of words are written in different ways with spelling variation. For
instance, the word “Computer” can be written as “ኮምፒዩተር” or “ኮምፒውተር”.All the above
mentioned problems of the language have a negative effect in applying machine learning
approaches for document text classification. Hence, to solve such problems and to increase the
performance of learning algorithms in document classification a number of document
preprocessing activities were made before an index term is selected.

Summery

For the sake of Amharic news classification sake, Amharic language was discussed. Amharic
language has its of scripts including Amharic Writing System, Fidel, punctuation marks and
numeral symbols. Amharic language characters have also redundancy nature so there is a need to
harmonized mechanism during preprocessing of news documents
Chapter Four

Methodology

4.1. Introduction
To classify text documents by using supervised machine learning techniques, documents should
first be preprocessed. In the preprocessing step, the documents should be transformed into a
representation suitable for applying the learning algorithms. In this chapter the data source and
data filtering mechanisms carried out will be explained details. The preprocessing algorithms
developed to process Amharic news texts like character (Fidel) normalization, stemming, stop
word and number removal discussed.

4.2. Data collection


The document data set that was used for the experiments were Amharic news text which was
collected from ENA and used by other previous researchers. Even though, classification of news
items is done manually. The total number of categories collected and considered in this study are
nine; with a total 3,213 Amharic news items or documents.

4.3. Data Source


The data source for this research is the electronic Amharic news items collected from the from
FANA and ENA .The news items from April 2006 to October 20013 from ENA and from
decemper 2019 to April 2020 from FANA are used as data source for the experiment of this
study. All Amharic news items in are collected from 2019-2020. However, not all of these news
items are useful for the classification experiment because of errors made during data entry and
manual classification. Therefore, applying data filtering is the right approach in order to take
relevant news documents.

4.4. Research design


Research can be designed as quantitative and qualitative design methods. Quantitative research
design expects numerical finding as a formula or numerical value whereas the qualitative
research design method need to find theoretical concepts that can be measured in terms of quality
instead of quantity.
Since this study searched a model/ a formula that can be measured in terms of numeric value
using accuracy measure, the paper used experimental research design method under quantitative
design methods

4.5. Architecture of Amharic News Text classification


Classification Amharic news items are used as an input to the system. Then preprocessing tasks
like normalization (changing varying Amharic characters with similar sound to one common
form, changing punctuation marks to space), tokenization, stop word and number removal,
stemming, weighting terms and dimension reduction are done.
Amharic news classification using supervised has its own architecture. These are document
collection, document preprocessing and representation, classification and evaluating the results.
Some of the documents are collected from ENA manually and then preprocessing of documents
is held. In the document preprocessing stage transliteration, tokenization, normalization, stop
words and numbers removal and stemming were done.
Once all these document preprocessing and representation activities were done, the datasets
were prepared in an appropriate format and given to the learning algorithms. The learning
algorithms process this dataset and group them into the appropriate classify to its category and
finally the performances of those classification algorithms were evaluated using different
classification evaluation metrics. The details of each phase are discussed in the following
sections.
Training
Training
Training
News Document
News Document
News Document

Assemble
Document

Text

Representation

Prepossessing News
text

Supervised Deep Learning


Algorithms-Recurrent
Learning
neural network

News
Test News Document
Document Classifier/
model

Classified
News Documents
Figure 4. 1: Architecture of Automatic Amharic News Text Classification

4.6. Document Preprocessing


Document preprocessing is important to improve the accuracy, efficiency, and scalability of the
Classification process. In order to get better experiment results, language dependent document
preprocessing should be performed before automatic classification is implemented. Text or
document preprocessing is the step by which the text is made comfortable to the learning
algorithm. The preprocessing includes a removal of non-informative words or characters from
the text. It is the first step in the preparation of documents to present them in a format suitable for
classification. The process of tokenization, normalization and stemming is language-dependent
and in this study the different characteristics or features of the Amharic language were
considered in the development of the algorithms. The document preprocessing task was
implemented using python programming language (Python 3.8.4).

4.6.1. Tokenization
In the process of tokenization, documents are broken into individual tokens. Typically words and
sequence of words are extracted from documents to enhance the performance of learning
algorithms. In the tokenization process irrelevant and noisy features for the classification process
such as punctuation marks and any non-Amharic characters were removed from documents in
the collection using python. This is because these features are not relevant to represent the
content of documents and they have no contribution in discriminating one document or category
from the other [65].

4.6.2. Normalization
In Amharic writing system characters with the same sound have different symbols. These
different symbols must be considered as similar because they do not have effect on meaning. As
a result, in this study, all different symbols of the same sound were converted to one common
form. In order to exploit this equivalence [66] algorithm was used. Thus, for example, if the
character was one of ሐ፣ሓ፣ሃ፣ኀ ኃ or ኸ (all of them with a similar sound, h) then it was converted
to ሀ. By the same token, all orders of ሠ (with the sound s) were changed to their equivalent
respective orders of ሰ, all orders of ዐ (with the sound a) were changed to their equivalent
respective orders of አ, all orders of ፀ (with the sound tse) were changed to their equivalent
respective orders of ጸ.

4.6.3. Words and Numbers Removal Stop


All terms or words in a document are not relevant equally to represent the contents of the
document. Some extremely common words that would appear almost in all documents of the
collection are not relevant to represent the content of documents and discriminate one document
from others. There are some common Amharic words that appear in almost all the documents in
the collection. These words which are encountered very frequently and carry no useful
information about the content and thus the category of documents are called stop words. These
stop words should be excluded from the vocabulary entry, which will lead to a drastic reduction
in the dimensionality of the feature space.

To remove stop words a stop word lists can be used or the stop word can be determined from
their frequency, which is said to be more efficient and language independent [67]. Hence, in this
work, stop word removal was performed using the two approaches; stop word list and term
frequency thresholding. Stop words are language specific and often domain specific. However,
for Amharic language there is no standard stop word list.in this study, two kinds of stop words
lists were prepared by considering the stop word lists used by previous researchers such as [68].

The first stop word list consists of some Amharic words such as “ነበር”, “ነው”, “ሆነ”, etc which
appear almost in all documents and are used to provide structure in the language rather than
content. The second stop word list contains news specific Amharic such as “ዘገበ”, “ገለፁ”, etc.
Such words have no discriminating power among Amharic documents in the collection and
removed from the document collection.

I.News Specific Stop Words


Reporters and journalists most of the time report an incident to the public. As a result, they use
vocabularies peculiar for this purpose. An example of such words, which they use very
frequently, is “notifying” (አስታወቀ፣አመለከተ፣ጠቆመ).Basically, they use this word when reporting
about an official or organizational press release. In fact, these words are pure verbs and are
usually found at the end of a sentence.
II, Common Stop Words
Like other languages, some words in Amharic are used very frequently in the normal usage of
the language. There are common stop words which are used for grammatical purposes like
ነው፣ዎኖም፣ነበር፣እና፣ነገር ግን ፣etc, which are not informative to identify one document from the other.
Because of the unavailability of comprehensive standard stop word lists done by previous
researchers, the researcher of this study is required to develop stop word lists.

4.6.4. Stemming
Stemming is the process of reducing words to their base form, or stem by removing suffixes or
prefixes. For example, the Amharic words የግብርና, ግብርናዎች, የግብርናዎች are all reduced to the stem
or root word ግብርና. Notice that one effect of stemming is to reduce the number of distinct words
in a text corpus and to increase the frequency of occurrence of some individual words. As shown
in the above example, if the terms ግብርና, የግብርና, ግብርናዎች, የግብርናዎች occur in the document with
the frequency of one, the frequency of ግብርና increases from one to four by stemming the other
three terms into their base form.An exceptional list of words on which the affix (suffix and
prefix) removal algorithms cannot be applied was prepared.

This is because removing suffix or prefix from some Amharic words result in loss of the original
meaning or for that matter any meaning. This may mislead the learning or clustering algorithms
to produce a different clustering solution. For example if we remove the suffix ‘ን’ (‘n’) from the
word ‘ኮንስትራክሽን’ (‘construction’), the result is ኮንስትራክሽ (‘konstrakxi’) which has no meaning
in Amharic language. Similarly if we remove the prefix ‘ከ’ (‘ke’) from the word ‘ከተማ’ (‘town’)
we get a combination of characters ‘ተማ’ (tema) which is not in the Amharic lexicon.Removal of
Prefixes and suffixes-Prefixes are characters that are attached at the beginning of the word
depending on the context of a sentence.
Suffixes are also characters attached at the end of the word. Common Amharic prefixes such as
“የ”, “ከ”, “በ”, “ለ” and “እንደ” and common Amharic suffixes such as “ም”, “ን”, ”ዎች” and ”ዊነት”
were removed from the whole document collection except from the words in the exception list .

4.6.5. Data Conversion


We gave the Amharic corpus to python code using utf-8 representation mechanism; our dataset
was processed without converting to another data conversation technique between languages.
The documents in the corpus were saved using plain text.

4.7. Techniques
Supervised learning requires humanly annotated texts. Accordingly the labled each news items
by their category from their directory. Example Sport news were stored in the folder sport. So the
folder names acts as a class label.For our class construction, the paper used recurrent neural
network which is one of the statae of the art deep learning technique. Feed forward neural
networks are densely connected networks and convents but they have no memory of process
inputs independently, without storing any state in between inputs.

Recurrent neural network is an artificial intelligent component that imitates the behaviors of
Biological intelligence that processes information incrementally while maintaining an
internal model of what it’s processing, built from past information and constantly
updated as new information comes in A recurrent neural network (RNN) processes sequences by
iterating through the sequence elements and maintaining a state containing information relative
to what it has seen so far [72].

It is inspired by logic gates of a computer. To control a memory cell we need a number


of gates. One gate is needed to read out the entries from the cell (as opposed to reading any other
cell). We will refer to this as the output gate. A second gate is needed to decide when to read data
into the cell. We refer to this as the input gate. Last, we need a mechanism to reset the contents
of the cell, governed by a forget gate. [74].
Fig 3. 1 SLTM structure [75]
Recurrent neural network is a type of deep learning-oriented algorithm, which follows a
sequential approach. In neural networks, we always assume that each input and output is
independent of all other layers using keras framework on top of tensor flow data structure
[72,75]. Keras is an easy to learn and high-level Python library run on top of Tensor Flow
framework [75].

Consider the following steps to train a recurrent neural network [75]:

Step 1: Input a specific example from dataset.

Step 2: Network will take an example and compute some calculations using randomly initialized
variables.

Step 3: A predicted result, here Amharic news, is then computed.

Step 4: The comparison of actual result generated with the expected value will produce an error.
Step 5: To trace the error, it is propagated through same path where the variables are also
adjusted.

Step 6: The steps from 1 to 5 are repeated until we are confident that the variables declared to get
the output are defined properly.

Step 7: A systematic prediction is made by applying these variables to get new unseen
input (news for test set.

Back propagation is a learning technique to compute partial derivatives which includes the
basic form of composition best suitable for neural networks.LSTM is a common used recurrent
neural network which controls the decision on what inputs should be taken within the specified
neuron. It includes the control on deciding what should be computed and what output should be
generated based loop learning having memory [75].

4.8. Classifier Evaluation measures

The evaluation of document classifier is typically conducted experimentally rather than


analytically. The reason is that, in order to evaluate a system analytically, we would need a
formal specification of the problem that the system is trying to solve, and the central notion of
text is, due to its subjective manner.

The experimental evaluation of a model usually measures its effectiveness i.e. it is ability to take
the right classification decisions. The class assignment of a multi-class classifier can be
evaluated using a confusion matrix. Confusion matrix is a tool for analyzing how well a classifier
recognizes instances of different classes.

These are Recall, Precision and F-score. Recall measure contain information about whether
classification errors are dominated by f-negative. Precision measure contains information about
whether the classification errors are dominated by f-pos. The tradeoff between Recall and
Precision can be controlled by setting classifier parameters. Both measures should typically be
used to describe the overall performance, as neither is particularly informative by itself. The third
measure F-score is an average R and P[72].

We chose accuracy to measure the performance of the news classification model. Accuracy
(error rate) is the rate of correct (incorrect) predictions made by the model over a data set. The
average results of accuracy can also be represented in confusion matrix form.

Accuracy = (t-pos + t-neg) / (t-pos + t-neg + f-pos + f-neg)


Chapter Five

Experimentation and Performance Discussion

Introduction
This chapter discusses the results obtained from the experiment using deep learning approach.
The experiments are performed based on the concepts discussed in the previous chapters. The
experiments were done using three document classifications techniques namely feed forward
neural network, simple recurrent neural network and short long term memory (SLTM) recurrent
neural networks, Back propagation algorithm is the base foundation feed forward and recurrent
neural learning for classification task for our model building experiment and evaluation ,testing
sake. The results obtained from these classifications deep learning ways are discussed in detail in
the next sections.

5.1The Results of Experimentation and Testing


For supervised experiment the researcher used all the selected classifiers were compared on the
same data and a set of categories a total of nine classes and 4,453 documents were used in the
experimentation process. All documents were labeled to its pre-defined classes with the
corresponding provided by ENA. To test the performances of feed forward neural network,
simple recurrent neural network and short long term memory (SLTM) recurrent neural networks,
classification ways at increasing number of classes, the different pre-defined number of classes
and the corresponding pre-classified documents were used to conduct the experiments.
The nine categories were divided into three and the experiments were done on three, six and nine
number of classes using 967, 1065 and 4,453 documents respectively as The first experiment
was done on three classes:-politica, sport and tena that contain relatively equal number of news
items were selected. The second experiment was performed on six classes:-economy, politics,
sport, health, tecnology,and social issue and The third experiment was contain relatively equal
number of news items were selected on nine categories economy, politics, sport,
health,tecnology, social issue, education,law and culture and Turism. .
5.1.1 Testing

A classifier algorithm is important because it allows evaluating how reliably a given classifier
will label data. In this study the performance of selected automatic text classifiers, for the
application of Amharic news items classification is tested. Estimating the accuracy of a
hypothesis is relatively straightforward when data is plentiful. Therefore, to obtain an
unbiased estimate of future accuracy, it is important to test the hypothesis on the test
examples chosen independently of the training examples and the hypothesis. When
evaluating a learned hypothesis estimating the accuracy with which it will classify future
instances is the critical one [69].

To the researcher’s knowledge, there is no standard established text corpus for Amharic
text classification testing. Hence, Amharic news items from ENA were selected by the
researcher for the experiment. After the documents are indexed and modeled training
and testing was performed. In order to examine the applicability of a machine learning
algorithm to Amharic news items feed forward neural network, simple recurrent neural network
and short long term memory (SLTM) recurrent neural networks were compared on the data set
of categories.

5.1.2. Feed forward neural network Test

As discussed in chapter two neural networks is one of the general algorithms of machine
learning. Feed forward neural network is a special type of neural network classifier which is
taken in to deep learning for model building. Feed forward neural network uses simple back
propagation algorithm by taking only the give training inputs. In our context the input was
unclassified news to their corresponding reading domains.
Experiment was conducted on three news classes namely “politica‟, “sport‟ and „tena‟ that
contain relatively equal number of news items were selected; where 967 news items were used.
The second experiment was performed on six classes:-economy, politics, sport, health,
tecnology,and Social issues on top of 1065 dataset. All dataset was prepared from nine class
news corpus with 4,453 news items. The nine categories were economy, politics, sport, health,
tecnology, social issues, education, law and culture and tourism.
For instance, take the class size classification using feed forward neural network in table xxx
below
Train on 100 samples, validate on 798 samples
Epoch 1/10
100/100 [==============================] - 1s 14ms/step - loss: 0.6541 - acc: 0.5600
Epoch 2/10
100/100 [==============================] - 1s 7ms/step - loss: 0.5032 - acc: 0.6600
Epoch 3/10
100/100 [==============================] - 1s 7ms/step - loss: 0.3939 - acc: 0.7300
Epoch 4/10
100/100 [==============================] - 1s 7ms/step - loss: 0.3151 - acc: 0.7200
Epoch 5/10
100/100 [==============================] - 1s 7ms/step - loss: 0.2468 - acc: 0.7400
Epoch 6/10
100/100 [==============================] - 1s 7ms/step - loss: 0.1829 - acc: 0.7400
Epoch 7/10
100/100 [==============================] - 1s 7ms/step - loss: 0.1287 - acc: 0.7400
Epoch 8/10
100/100 [==============================] - 1s 7ms/step - loss: 0.0774 - acc: 0.7700
Epoch 9/10
100/100 [==============================] - 1s 7ms/step - loss: 0.0261 - acc: 0.7800
Epoch 10/10
100/100 [==============================] - 1s 8ms/step - loss: -0.0214 - acc: 0.7900

Table 5: 1 Sample performance result


In the table 5.1 above, the highest performance of the news model was 79% accuracy using feed
forward neural networks. So we can conclude that Feed forward neural network algorithm
classified news items 79%, 37% and 32% accuracies having three, six and nine category sizes
respectively.
The classification accuracy for this test can be shown using table 5.2 below with types of news
class used during the experimentation. Experiment on three, six, and nine classes using Feed
forward neural network (Simple Back propagation algorithm).

Three class(politics ,sport Six Nine


and health) class(economy, class(economy,
politics, sport, politics, sport,
health, health, technology,
technology, and social issues,
social issue) education, law and
culture and tourism)

Feed forward neural 79% 37% 32%


network

Table 5: 2 feed forward neural network score on different category size and type

Interpretation and discussions

From the above table, the highest accuracy was 79% and the lowest was 32 % when the
number of class is three and nine respectively. The sixth class category yields an accuracy of
37%. From this we can conclude that when the number of class increase the accuracy goes in a
reverse way.This study points out that using smaller news domains are better to build Amharic
news models.

5.1.3. Simple recurrent (SimpleNN) neural network Test


Simple-RNN is an implementation of neural network for deep learning based classification.
Unlike Sequential model which is processed using feed forward network? Simple recurrent
neural network does not build sequential model that makes the assumption that the network has
exactly one input and exactly one output, and recurrent neural network is an artificial intelligent
component that imitates the behaviors of Biological intelligence that processes information
incrementally while maintaining an internal model of what it’s processing, built from past
information and constantly updated as new information comes in [72].

A recurrent neural network (RNN) processes sequences by iterating through the sequence
elements and maintaining a state containing information relative to what it has seen so far. It
loops over time steps, and at each timestamp, it considers its current state at t and the input at
t[72] using advanced Back propagation algorithm. Back propagation algorithm through time is
merely an application of back propagation to sequence models with a hidden state.

Fig 5. 1 recurrent neural network

In the table 5.3 belowe, the highest performance of the news model was 88.75 % accuracy using
simple-RNN neural networks. So we can conclude that simple-RNN algorithm classified news
well than feed forwarding neural network. News items were classified 88.75%, 52.5% and 27.5%
correctly to their category based on three, six and nine category sizes respectively. The
performance and the class lists are shown in table 5.3 below.
Three Six Nine
class(politics ,sport and class(economy, class(economy,
health) politics, sport, politics, sport,
health, health, tecnology,
tecnology,and social issues,
social issue) education, law and
culture and
tourism)

SimpleRNN 88.75% 52.5% 27.5%

Table 5: 3 Experiments on three, six, and nine classes using SimpleNN

Interpretation and discussions

Simple-RNN provides good generalization performance that can be applicable for real-world
applications such as MT,IR and IE[]. The performance achieved using simple-Rnn was 88.75%
by using supervised techniques. It performed well with a three class sized datasets. Therefore, it
can be potentially applicable for development of news classification systems for Amharic. Three
category size test, six and nine class tests showed performance diminishment as class size
increase[72].

The hyperplane only depend on a subset of training examples taken.


. The reason behind for attractive performance of Simple-RNN due to small number of category
size (three). From this, we can conclude that when the number of class increase, the accuracy
decreases [72]. This study points out that using smaller news domains are better to build
Amharic news models.

5.1.4. Short Long Term Memory (SLTM) recurrent neural networks Test
Although we used the sequential neural networks, recurrent neural networks are much more
common in practice to solve complex problems. The widely wised-used recurrent networks are
long short term memory (LSTM [74]. LSTM controls the decision on what news inputs entered
and computed and classified news is generated as output.

Three Six class(economy, Nine class(economy,


class(politics ,sport and politics, sport, health, politics, sport, health,
health) technology, and social technology, social
issue) issues, education, law
and culture and tourism)

SLTM 82.5% 38.75 % 32.5%

Table 5: 4 performance result of nine category

Interpretation and discussions

The LSTM is very useful to learn temporal or sequential data. We used the common activation
functions renu for learning hidden layers (dense), the output layer used softmax activation
faction. Performance results as shown in table 5.4 were building using a Keras model to classify
data under 10 epochs. LSTM classifies the news into their category using three, six and nine
categories. LSTM score 82.50%,38.75 % and 32.5% for three, six and nine categories.

Like feed forward and simpleRNN, the number of class increases while the accuracy decreases
in in LSTM [72].
The performance of LSTM becomes 82.50%, which is less than the performance gained by
Simple-RNN(88.75%) using the three category size (politics, sport and health). The reason why
lower performance of LSTM is that its high computational intensive and overfit of our news
easily. If we perform overfit minimization techniques, we would get a better performance [75].
Chapter six

Conclusions and Recommendations

6.1 Conclusions

The rapid growth in Information and Communication Technology has resulted in the creation
of large volume of text in electronic form. As the volume of information continues to
increase, there is growing interest in helping people better find, filter, and manage these
resources. Text classifications- the assignment of natural language documents to one or
more predefined categories based on their content is an important component in many
information organization and management tasks. The automated classification of texts into
pre-specified categories, although dating back to the early 1960s, has witnessed a booming
interest in the last ten years, due to the increased availability of documents in digital form
and the ensuing need to organize them [21].

The objectives of this study has been to prepare processing tools for amharic news text
classifications, and test the applicability of automatic text classifiers for amharic
language text classification activities based on document content. The representation and quality
of the instance data is first and foremost. If there is much irrelevant and redundant
information present or noisy and unreliable data, then knowledge discovery during the
training phase is more difficult. It is well known that data preparation and filtering steps take
considerable amount of processing time in Machine learning feature extraction and
selection, etc. The product of data pre-processing is the final training set [70].To this end,
much attention was given on the pre-processing of the source data by developing language
dependent pre-processing tools, for the amharic news text items. Therefore Amharic news items
are prepared in different database.Only the headline, the slug and the keywords were
considered to build models assuming that they contain features which represent the
document and the processing and learning time of the algorithms can also be reduced.
Therefore, procedures to facilitate these things were prepared. Furthermore the following tools
were developed:-Stop words removal tool from the news items texts, Removal of extraneous
characters, numeric characters, a tool for tokenization, Amharic words identification and
Attempt to remove affixes from Amharic words

These tools are important for the reduction of features and data cleaning for the automatic
text classification process. After making the Amharic news items text comfortable to the
tool used in this study for classification (python) and data transformation to attribute
relation file format was employed before training and classification tasks.from this study on
automatic classification of Amharic news items, it can be concluded that: Proper data pre-
processing techniques which considers the Natural Language Processing increases the
effectiveness of the automatic classification.

The best result (accuracy) obtained from both the simple recurrent neural network and short
long term memory (SLTM) recurrent neural networks classifiers were when the number of
instances is approximately equal in each class and the accuracy is 88.75% and 82.5%
respectively for the category of three classes.
Relatively lower accuracy obtained is for feed forward neural network on category of six
classes 37% and on category of nine classes 32%

Both the simple recurrent neural network and short long term memory (SLTM) recurrent neural
networks classifiers indicates that better accuracy over others classifiers for Amharic news
items. This study concluded that supervised- deep Learning approach can be applied to
Amharic news text classification tasks.

In this study, three classification algorithms including simple recurrent neural network ,short
long term memory (SLTM) recurrent neural networks and feed forward neural network
classifiers algorithm are applied to Amharic news dataset.Based on the experiments done in this
study, the following concluding remarks were made.As the number of classes and documents
increase, the accuracy produced by different classifiers, simple recurrent neural network ,short
long term memory (SLTM) recurrent neural networks and feed forward neural network become
decreased and requires relatively high computational requirements. Moreover, it is learnt that
considering categories with equal number of news items increases the performance of the
classifiers.

Many researches have been conducted in different contexts to devise methods which can enable
to change threats into opportunities for wise use of information to counter act information
overload. Classification is one of the methods that can be employed to organize information for
effective and efficient use. Manual classification is hardly possible with the incredible increase in
the volume of information; as a result, automatic classification deep learning representation is
one selected area of research. Thus, in order to maintain this concept, a different approach from
previous works have been proposed and implemented in this research.

Preprocessing and testing were the main steps for the accomplishment of this study.
Preprocessing and labeling the data is worked out before the datasets are fed into the classifier.
Due to the problems of the Amharic writing system and unavailability of Amharic document
processing tool, the focus of this research was on developing tools which facilitates efficient
automatic classification of Amharic documents.

6.2 Recommendations
The outputs of this research indicate deep learning techniques/methods are applicable for
automatic classification of Amharic news items. The representation and quality of more d a t a
has to be done to ensure automatic processing of Amharic news texts under all
circumstances. The stop-word list used in this research is compiled during the data preparation
and mostly is news specific. The availability of standard stop-word list would definitely
facilitate researches in the area of automatic classification; therefore a standard Amharic stop-
word list should be developed. The availability of standardized text corpus promotes
text classification researches, nevertheless, there is no established text corpus for Amharic
news text classification purpose and checking, hence there is a need for standardized text
corpus.
Recommendations for further research are forwarded to improve the performance of document
classification and to explore all algorithms and applications of supervised document
classification especially for local languages.

The two classifiers used in this research have shown good accuracy. Therefore, there is a need to
look for other classifiers with less processing cost and better accuracy.The availability of
standard stop-word list would possibly facilitate researches in the areas of automatic
classification. Nevertheless, there is no standard stop word list for use in the Amharic language.
Therefore; a standard Amharic stop-word list should be developed.
- In this research, the researcher tried to correct some of the spelling errors manually
which is not exhaustive for the purpose of this research. Spelling errors has effect in attribute
reduction and selection for automatic classification, indicating a need for Amharic language
spell checker. To obtain better performance in Amharic news classification, the suitable
parameters for this model are also explored and it is found that it is helpful to improve the
results.

Amharic is a morphologically rich language; as a result there are many variants of words in
Amharic, a construction of a thesaurus for the language helps to bring the variants together and
standardization. This research considers only the classification of main classes, it does
not consider the classification of sub-classes, and therefore, research in this direction would also
improve the classification quality.

This research considers the single-label classification; it assigns a document to one of the
pre-defined classes. The case where assigning a document to more than one class (multi-
label classification) is an issue that has to be studied.

A number of researches were done on Amharic text document classification. However, as to the
knowledge of the researcher, all the previous studies were conducted using Amharic text news
item only. Future researchers can also explore document classification techniques to various real
world problems such as classification and clustering of research papers and e- mail messages.
Moreover, document classification techniques can also be extended to other local languages if
huge collection of documents is available. Therefore, in future work, these modifications will be
tried in our experiment to see if it will achieve better performance.
REFERNCE

[1] Lars Asker, Atelach Alemu, Bjo¨rn Gamba¨ck, Samuel Eyassu, and Lemma Nigussie (2009).
Classifying Amharic web news, Info Retrieval.

[2] Sebastiani, F. (2005).Determining the semantic orientation of terms through gloss


classification. In Proceedings of the 14th ACM international conference on Information and
knowledge management (pp. 617-624).

[3] Kabita Thaoroijam (2014). A Study on Document Classification using Machine Learning
Techniques, IJCSI International Journal of Computer Science Issues,

[4] Crimmins (2001), Francis. Classification. http://dev.panopficearch.com/classification.html

[5], LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, algorithm

6] Jason B. (2017). Deep Learning for Natural Language Processing

[7 ] Bender, et.al. The Ethiopian Writing System. In Languages in Ethiopia. London:


Oxford University Press, 1976

[8]Witten I. H. & Frank E. (2005). Data mining: Practical Machine Learning Tools and
Techniques (2nd Ed.), Morgan Kaufmann Publishers, San Mateo, CA.
[9] Han and Kamber, 2004. Data Mining Concepts Models Methods & Algorithms, 5.4 (2001):
1-18.

[10] Nidhi and Vishal (2012). Algorithm for Punjabi Text Classification, International Journal of
Computer Applications (0975 8887) Volume 37 No.11, pp. 30-32.

[11] Russell, Stuart, and Peter Norvig. "Intelligent agents." Artificial intelligence: A modern
approach 74 (1995): 46-47.
[12] Ashis Kumar and Rikta Sen (2014). Supervised Learning Methods for Bangla Web
Document Categorization, international Journal of Artificial Intelligence & Applications
(IJAIA),

[13] Mark W. Pope, 2007: Pope, Mark W. "Automatic classification of online news headlines

[14] Bender, et.al. The Ethiopian Writing System. In Languages in Ethiopia. London:
Oxford University Press, 1976.

[15] Shi, Chenye, et al. "Chinese SNS blog classification using semantic similarity." 2013 Fifth
International Conference on Computational Aspects of Social Networks. IEEE, 2013.

[16] Olivier, D.V., 2000, Mining e-mail Authorship, Proceedings of Sixth ACM SIGKDD,
International Conference on Knowledge Discovery and Data Mining,
Boston, USA.

[17] Zhang et al, 2005.Zhang Y, Gong L. and Wang Y., (2005), An improved TF-IDF approach
for text classifications, J Zhejiang Univ SCI 2005,6A(1), pp49-55, China.
[18]Diriba Megersa, (2002), An Automatic Sentence parser for Oromo language, Master thesis at
school of Information studies for Africa, Addis Ababa University, Addis Ababa.

[19]Rafael A. Calvo, Jae-Moon L, and Xiaobo L , (2004), Managing content with automatic
document classification, Journal of digital information, volume 5 issue 2

[20] Hammouda, K. M., & Kamel, M. S. (2002, December). Phrase-based document similarity
based on an index graph model. In 2002 IEEE International Conference on Data Mining, 2002.
Proceedings. (pp. 203-210). IEEE.

[21] Sebastiani, (2002). Machine learning in automated text categorization. ACM computing


surveys (CSUR), 34(1), pp.1-47.

[22] Russell, Stuart, and Peter Norvig. "Intelligent agents." Artificial intelligence: A modern
approach 74 (1995): 46-47.
[23] Rasmussen, (1992). "Ecological interface design: Theoretical foundations." IEEE
Transactions on systems, man, and cybernetics 22.4 (1992).

[24] Nigam, K. P. (2001). Using unlabeled data to improve text classification. Carnegie-Mellon


univ Pittsburgh pa school of computer science.

[25] [Yang Y. et.al., 2008 Yan, L., Zhang, Y., Yang, L. T., & Ning, H. (Eds.). (2008). The
Internet of things: from RFID to the next-generation pervasive networked systems. Crc Press.

[26] (Levy, 1999). (Levy, 1999): Friedman, Marc, Alon Y. Levy, and Todd D. Millstein.
"Navigational plans for data integration." AAAI/IAAI 1999 (1999): 67-73.

[27] Chander, A., Mitchell, J. C., & Shin, I. (2001, June). Mobile code security by Java byte code
instrumentation. In Proceedings DARPA Information Survivability Conference and Exposition II.
DISCEX'01 (Vol. 2, pp. 27-40). IEEE.

[28] (Chung & Noh, 2003). Chung, Y. M., & Noh, Y. H. (2003). Developing a specialized
directory system by automatically classifying Web documents. Journal of Information
Science, 29(2), 117-126.

[29] Park, S. C., Park, M. K., & Kang, M. G. (2003). Super-resolution image reconstruction: a
technical overview. IEEE signal processing magazine, 20(3), 21-36.

[30] Kang, M. G. (2004). Super-resolution image reconstruction: a technical overview. IEEE


signal processing magazine, 20(3), 21-36.\

[31] Robinson H., (2003), Feature Selection and representation in text categorization, Available
at:http://citeseerx.ist.psu.edu/viewdoc/summary

[32] Stig-Erland, and Roland Olsson. "Improving decision tree pruning through automatic
programming." Proceedings of the Norwegian Conference on Informatics (NIK-2007)
(November 2007, Holmenkollen Park Hotel Rica, Oslo). 2007.
[33] Mitchell, 1999). Vincent‐Wayne. "Consumer perceived risk: conceptualization’s and
models." European Journal of marketing (1999).

[34] Wang, 1993). Walberg. "Toward a knowledge base for school learning." Review of
educational research 63.3 (1993): 249-294

[35] Witten, 2005). Ian H., et al. "Kea: Practical automated keyphrase extraction." Design and
Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI global, 2005. 129-152.

[36 ]Martin, 1995); Davis, Martin. Engines of Logic: Mathematicians and the Origin of the
Computer. New York: Norton, 2000.

[37] Prabhu, A., Varma, G., & Namboodiri, A. (2018). Deep expander networks: Efficient deep
networks from graph theory. In Proceedings of the European Conference on Computer Vision
(ECCV) (pp. 20-35).

[38] Karayiannis, N. B., & Randolph-Gips, M. M. (2003). On the construction and training of
reformulated radial basis function neural networks. IEEE Transactions on Neural
Networks, 14(4), 835-846.

[39] Landwehr, 2005) Landwehr, Niels, Mark Hall, and Eibe Frank. "Logistic model
trees." Machine learning 59.1-2 (2005): 161-205.

[40] Orhan, Umut, Mahmut Hekim, and Mahmut Ozer. "EEG signals classification using the K-
means clustering and a multilayer perceptron neural network model." Expert Systems with
Applications 38, no. 10 (2011): 13475-13481.

[41] Breiman, 1984 Breiman, Leo, et al. Classification and regression trees. CRC press, 1984

[42] Svozil, D., Kvasnicka, V., & Pospichal, J. (1997). Introduction to multi-layer feed-forward
neural networks. Chemometrics and intelligent laboratory systems, 39(1), 43-62.
[43] Freund, 1999) "DNA computing based on splicing: The existence of universal
computers." Theory of Computing Systems 32, no. 1, 69-112.

[44] Cohen, 1995 Cohen, Lou. Quality function deployment: how to make QFD work for you.
Prentice Hall, 1995.

[45] Kohavi, 1995 Dougherty, James, Ron Kohavi, and Mehran Sahami. "Supervised and
unsupervised discretization of continuous features." Machine learning proceedings 1995. Morgan
Kaufmann, 1995. 194-202.

[46] Frank, Eibe, and Ian H. Witten. "Generating accurate rule sets without global optimization."
(1998).

[47] Yang, Y., & Liu, X. (1999, August). A re-examination of text categorization methods.
In Proceedings of the 22nd annual international ACM SIGIR conference on Research and
development in information retrieval (pp. 42-49).

[48] Liping, Yang, Liu Guifang, and Zhang Yanni. "The Breed of Strains with Resistance for
Lily." Journal of Northeast Forestry University 31.6 (2003): 33-35.

[49] Caruso, Eugene M., Daniel T. Gilbert, and Timothy D. Wilson. "A wrinkle in time:
Asymmetric valuation of past and future events." Psychological Science 19.8 (2008): 796-801.

[50] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, algorithm

[51] Witten, Ian H., and Eibe Frank. "Data Mining: Practical Machine Learning Tools and
Technologies with Java Implementations." (2000).

[52 Han, Jiawei, Micheline Kamber, and Jian Pei. "Data mining: concepts and
technologies." Data Mining Concepts Models Methods & Algorithms, 5.4 (2001): 1-18.

[53]McCallum, Andrew, and Kamal Nigam. "A comparison of event models for naive bayes text
classification." AAAI-98 workshop on learning for text categorization. Vol. 752. No. 1. 1998.
[54] Koller, Daphne, and Mehran Sahami. Hierarchically classifying documents using very few
words. Stanford InfoLab, 1997.

[55] Salton, Gerard. 1989. Automatic Text Processing: The Transformation, Analysis, and
Retrieval of Information by Computer. Reading: Addison-Wesley Publishing
Company.

[56 ]Kohavi R. (1996). Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid,
in Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining, pp. 202–207.

[57] Zelalem S. (2001). Automatic classifications of Amharic news items. (Master‟s Thesis,
department of information science Addis Ababa university,Ethiopia).

[58] Surafel Teklu,2003. Automatic categorization of Amharic news text: A machine learning
approach. Department of Information Science, Addis Ababa University, Masters Thesis.

[59]Yohannes A. (2007). Automatic Amharic text classification. (Master‟s Thesis, Department


of information science Addis Ababa University, Ethiopia).

[60] Tessema M. (2007),Search Engine for Amharic Web documents.


ISBN: 978-3-639-19632-0. VDM Verlag co. Germany.

[61] Quinlan, 1986 Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986):
81-106.

[62] Wapedia (2009). %_B~. Retrieved on April 24, 2009. Web site:
http://wapedia.mobi/am/.

[63] Bender, et.al. The Ethiopian Writing System. In Languages in Ethiopia. London:
Oxford University Press, 1976.

[64] Beletu R. (1982). A Graphemic Analysis of the Writing System of Amharic. Paper for the
Requirement of the Degree of bachelor of Art in Linguistics. Addis Ababa University.
[65] Arzucan O. (2002), supervised and unsupervised machine learning techniques for text
document

Categorization, Bo?gazi¸ci University.

[66] Lakechew Y. (2011), Unsupervised Amharic news classification. (Master‟s Thesis,


Department of
Information Science, Addis Ababa University, Ethiopia).

[67] Yancong Z. and Hyuk cho., (2001), Classification Algorithms on text document.

Available at http://www.cs.utexas.edu/users/hyukcho/classificationAlgorithm.html,

Visited in December, 2008.

[68] Alemu (2010), Hierarchically classifying Amharic news classification. (Master‟s Thesis,
Department of Information Science, Addis Ababa University, Ethiopia).
[69] MitchellT. M. , (1997), Machine Learning, McGraw Hill.

[70] Kotsiants S, S.B, (2007), Supervised Machine Learning: A review of Classification


Techniques, Informatica, Vol. 31, No. 3, pp 249-268, Slovene Society Informatika.

[71]. Abera Diriba Gemechu(2009),Automatic Classification Of Afaan Oromo News Text:The


Case Of Radio Fana ,Departement Of Information Science, Addis Ababa University.

[72] Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing
text with the natural language toolkit. " O'Reilly Media, Inc.".

[73] Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola ,Dive into Deep Learning
Release 0.14.1, ul 19, 2020
[74]Tutorial, K. (2017). The Ultimate Beginner’s Guide to Deep Learning in Python.
Appendix A
I. Amharic Punctuation Marks (Atelach, 2002)

No. Punctuation mark Symbol Purpose


1 The four dots or double colon :: Mark end of a sentence
2 Colon : Separate words in a sentence: not
common
3 White space Separate words in a sentence:
current practice
4 Question mark ? Placed at the end of questions
5 Exclamation mark ! Used at the end of sentences that
show exclamation
6 Comma < Used like comma
7 Semi-colon = Used like semi-column
8 Three dots … For deliberate omission of words,
phrases, or sentences
9 Quotation marks ‹‹ ›› Used at the beginning and at the
end of quoted word, phrase, etc.
10 Parenthesis () To enclose elaboration
11 Stroke / Separate date, month, etc.
12 Mocking mark > Placed at the end of mocking
sentence

II. Amharic Numbers (Zelalem, 2001)

You might also like