You are on page 1of 102

DSpace Institution

DSpace Repository http://dspace.org


Software Engineering thesis

2023-03-16

ETHIOPIAN ORTHODOX TEWAHIDO


CHURCH AQUAQUAM ZEMA
CLASSIFICATION MODEL USING
DEEP LEARNING

BAYE, TADESSE DAGNEW

http://ir.bdu.edu.et/handle/123456789/15400
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY

BAHIR DAR INSTITUTE OF TECHNOLOGY

SCHOOL OF GRADUATE STUDIES

FACULTY OF COMPUTING

MSC THESIS ON:

ETHIOPIAN ORTHODOX TEWAHIDO CHURCH AQUAQUAM


ZEMA CLASSIFICATION MODEL USING DEEP LEARNING
APPROACH

BY:

BAYE TADESSE DAGNEW

March 16, 2023

Bahir Dar, Ethiopia


ETHIOPIAN ORTHODOX TEWAHIDO CHURCH AQUAQUAM
ZEMA CLASSIFICATION MODEL USING DEEP LEARNING
APPROACH

BAYE TADESSE DAGNEW

A thesis submitted the school of Research and Graduate Studies of Bahir Dar Institute of
Technology, BDU in partial fulfillment of the requirement for the Degree of Master of
Science in Information Technology in Computing Faculty.

ADVISOR NAME: TESFA TEGEGNE (PHD)

March 16, 2023

BAHIR DAR, ETHIOPIA


DECLARATION

I, the undersigned, declare that the thesis comprises my own work. In compliance with
internationally accepted practices, I have acknowledged and refereed all materials used in
this work. I understand that non-adherence to the principles of academic honesty and
integrity, misrepresentation/ fabrication of any idea/data/fact/source will constitute
sufficient ground for disciplinary action by the University and can also evoke penal
action from the sources which have not been properly cited or acknowledged.

Name of the student: Baye Tadesse Signature___


Date of submission: March, /2023
Place: Bahir Dar

This thesis has been submitted for examination with my approval as a university advisor.

Advisor Name: Dr. Tesfa Tegegne


Advisor signature: ______

i
ii
ACKNOWLEDGEMENTS

First of all, praise is due to almighty GOD with His compassion and mercifulness to
allow me finalizing this work. I would also thank the Ever-Virgin, St. Marry, Mother of
our Lord.
I would like to extend my thanks to my thesis advisor Dr, Tesfa Tegegne for guiding me
and also for his help in commenting my papers and forwarding useful suggestions that
helped me shape my work.
I am most grateful to my families, Sisters and Brothers. They have always loved me and
supported my every choice. I am also thankful for the great joys and happiness.

iii
LIST OF ABBREVIATIONS

ANN Artificial Neural Network

BLSTM Bi-Directional Long Short Term Memory

CNN Convolutional Neural Network

CRNN Convolutional Recurrent Neural Network

DIP Digital Image Processing

DSP Digital Signal Processing

EOTC Ethiopian Orthodox Tewahdo Church

FC Fully Connected

FFT Fast Fourier Transform

GLCM Gray Level Co-occurrence Matrix

Hz Hertz

KNN K-Nearest Neural Network

LSTM Long Short Term Memory Network

MFCC Mel Frequency Cepstral Coefficients

MIR Music Information Retrieval

MP3 Music Picture Expert Group Layer -3 Audio (Audio format)

NLP Natural Language Processing

RAM Random Access Memory

ReLU Rectified Linear Unit

iv
RGB Red, Green, and Blue

SAE Stacked Auto-Encoder

STFT Short Term Fourier Transformation

SVM Support Vector Machine

Wav Window Wave (Audio format/ file extension)

ZCR Zero Crossing Rate

v
Table of Contents

DECLARATION ................................................................................................................. i
ACKNOWLEDGEMENTS ............................................................................................... iii
LIST OF ABBREVIATIONS ............................................................................................ iv
LIST OF TABLE ............................................................................................................... ix
LIST OF FIGURE............................................................................................................... x
ABSTRACT......................................................................................................................... xi
CHAPTER ONE: INTRODUCTION ................................................................................. 1
1.1. Background .......................................................................................................... 1
1.2. Statement of Problem ........................................................................................... 3
1.3. Objective of the study .......................................................................................... 4
1.3.1. General Objective ......................................................................................... 4
1.3.2. Specific objectives ........................................................................................ 4
1.4. Methods ................................................................................................................ 5
1.4.1. Dataset Collection ......................................................................................... 5
1.4.2. Research Design............................................................................................ 5
1.4.3. Tools and Programming language ................................................................ 5
1.4.4. Development environment ............................................................................ 7
1.5. Scope of the Study................................................................................................ 7
1.6. Significant of the study ........................................................................................ 8
1.7. Evaluation Technique ........................................................................................... 8
1.8. Organizations of the study.................................................................................... 8
CHAPTER TWO: LITERATURE REVIEW ................................................................... 10
2.1. Introduction ........................................................................................................ 10
2.2. Overview of audio signal ................................................................................... 10
2.3. EOTC Zema bet ................................................................................................. 11
2.3.1. Aquaquam Bet ............................................................................................ 11
2.3.2. Aquaquam zema types ................................................................................ 12
2.4. Music information retrieval (MIR) .................................................................... 13
2.5. Music & audio representations ........................................................................... 13
2.6. Digital signal processing .................................................................................... 14
vi
2.7. Audio signal processing ..................................................................................... 15
2.7.1. Signal Terminologies .................................................................................. 15
2.7.2. Audio signal ................................................................................................ 16
2.8. Audio data acquisition ........................................................................................ 17
2.9. Preprocessing audio............................................................................................ 17
2.9.1. Audio noise reduction ................................................................................. 17
2.9.2. Audio segmentation .................................................................................... 18
2.9.3. Waveform representation ............................................................................ 18
2.10. Digital image processing ................................................................................ 19
2.10.1. Spectrogram ................................................................................................ 19
2.11. Feature extraction ........................................................................................... 20
2.11.1. Deep feature ................................................................................................ 20
2.11.2. Content-Based Features .............................................................................. 21
2.12. Classification .................................................................................................. 24
2.12.1. Deep Learning approach ............................................................................. 24
2.13. Evaluation Technique ..................................................................................... 32
2.14. Overfitting ...................................................................................................... 34
2.15. Related work ................................................................................................... 35
2.16. Conclusion ...................................................................................................... 38
CHAPTER THREE: METHODOLOGY ......................................................................... 39
3.1. Introduction ........................................................................................................ 39
3.2. Model Architecture ............................................................................................ 39
3.2.1. Aquaquam Zema Acquisition ..................................................................... 41
3.2.2. Preprocessing .............................................................................................. 42
3.2.3. Spectrogram Generation ............................................................................. 45
3.2.4. Image Resize ............................................................................................... 48
3.3. Feature Extraction .............................................................................................. 49
3.4. Convolution Neural Network (CNN) As Feature Extraction and classification 50
3.5. Training .............................................................................................................. 51
3.6. Testing Phase...................................................................................................... 54
3.7. Conclusion.......................................................................................................... 55

vii
CHAPTER FOUR: EXPERIMENT, RESULT AND DISCUSSION .............................. 56
4.1. Introduction ........................................................................................................ 56
4.2. Dataset Preparation ............................................................................................ 56
4.3. Experiment and Result ....................................................................................... 57
4.3.1. Experiment based on the length of audio data using end-to –end CNN with
Spectrogram feature ................................................................................................... 57
4.3.2. Experiment on the end to end CNN model with spectrogram and MFCC
feature 69
4.3.3. Experiment on CNN as feature extraction and SVM as classifier .............. 71
4.3.4. Test the given dataset on domain experts ................................................... 73
4.4. Discussion .......................................................................................................... 74
4.5. Summary ............................................................................................................ 75
CHAPTER FIVE: CONCLUSION, CONTRIBUTION AND RECOMMENDATION . 76
5.1. Conclusion.......................................................................................................... 76
5.2. Contribution ....................................................................................................... 78
5.3. Recommendation ................................................................................................ 79
Reference .......................................................................................................................... 80
Appendix ........................................................................................................................... 86

viii
LIST OF TABLE

Table 2. 1: summery of comparisons between activation function ................................... 32


Table 2. 2: summery of related work ............................................................................... 37
Table 3.1: Source of audio dataset and number of audio zema dataset ........................... 41
Table 3. 2: Pseudo code for segmenting audio ................................................................. 45
Table 3. 3: Algorithm for generating Mel-spectrogram image ........................................ 46
Table 3. 4: Image resize algorithm ................................................................................... 49
Table 4.1: Collected audio datasets table styles ............................................................... 57
Table 4.2: Performance evaluation metrics result for 5 second segmentation experiment
........................................................................................................................................... 59
Table 4. 3: Performance result of 10 second segment experiment ................................... 62
Table 4 4: Performance evaluation metrics result for 20 second segment experiment .... 67
Table 4.5: comparisons between different audio dataset segmentation experiment ........ 68
Table 4. 6: comparisons between CNN model with spectrogram and MFCC feature ..... 70
Table 4.7: Performance evaluation result of SVM classifier with CNN training ............. 71
Table 4.8: Comparison between CNN model Softmax classifier and SVM classifier ...... 73
Table 4. 9: Tasting result of domain experts .................................................................... 73

ix
LIST OF FIGURE

Figure 2. 1: Architecture of Artificial Neural Network .................................................... 27


Figure 2. 2:: SVM Class Classification ............................................................................ 28
Figure 2.3: The architecture of convolutional neural network ........................................ 30
Figure 3. 1: the proposed model architecture for aquaquam zema classification ........... 40
Figure 3.2: Shows the steps to reduce noise from the original audio file . ..................... 44
Figure 3. 3: Shows spectrogram images for each type of music genre ............................ 48
Figure 4. 1: The 5 second experiment training model summary in 100 epochs ............... 59
Figure 4. 2: Training accuracy curve of 5 second experiment ......................................... 60
Figure 4.3: Training lose curve of 5 second experiment .................................................. 61
Figure 4. 4: the accuracy graph of 5 second segment experiment model ........................ 61
Figure 4. 5: The 10 second segment experiment training model summery ..................... 62
Figure 4.6: The Training accuracy curve of 10 second experiment model ....................... 63
Figure 4.7: The Training loss curve of 10 second experiment model ............................... 64
Figure 4.8: The accuracy graph of 10 second segment experiment ................................. 65
Figure 4.9: The 20 second segment experiment training model summary in 100 epochs 66
Figure 4.10: The Training accuracy curve of 20 second segment experiment ................. 67
Figure 4.11: The Training loss curve of 20 second segment experiment ......................... 67
Figure 4.12: the accuracy graph of 20 second segment experiment ................................. 68
Figure 4..13: Training summary of CNN model with MFCC feature ............................... 70
Figure 4.14: Training summary of CNN with SVM classifier ........................................... 71
Figure 4 15: The Training accuracy curve of SVM classifier experiment ....................... 72
Figure 4.17: Accuracy graph of SVM classifier ............................................................... 73

x
ABSTRACT

Research has been done to automatically classify music genres based on their sound for
music information retrieval (MIR) systems, music data analysis, and music transcription
purposes. When the amount of music data increases, indexing and retrieving it becomes
more challenging. Previous studies in this area had concentrated on the song
classification, identification, prediction, and distinction of their music genre for modern
music services. Aquaquam zema classification is one category of music information
retrieval.

One of the traditional forms of education in Ethiopian Orthodox Tewahdo Church


(EOTC) is aquaquam zema, in which the priests perform with a measured sound while
dressed secular clothes. It is closely related to music and rhythm because it is a secular
art. The knowledge gap between modern and traditional education on the zema genre is
primarily what pushes us in this approach because the majority of students in this
traditional school do not have a complete understanding of the zema genre. We create a
model to categorize the sound signal of aquaquam zema into their genre to help with this
constraint. Five major zema kinds can be used to categorize aquaquam zema Zimame
(ዝማሜ), Qum (Tsinatsel)(ቁም), Meregd (መረግድ), Tsifat (ጽፋት) and Amelales

(አመላለስ).

We obtained the audio data from the Aquaquam bet and recorded it using smartphones
and the website to get this classification. After data collection, we begin to preprocess
audio, segment audio with predetermined lengths, convert audio to visual
representations, and produce spectrogram images. Then we create a model by extracting
features and classifying them using a deep learning approach. We created a full-featured
CNN model using the Softmax classifier. We achieve 97.5% of training accuracy and
91.76% for test accuracy by using the proposed model.

Keywords: Aquaquam Zema, Deep Learning, spectrogram, Feature Extraction and


Classification

xi
CHAPTER ONE: INTRODUCTION

1.1. Background

Music plays a very important and impacting role in people‟s lives. Music provides insight
for various cultures (Asim & Ahmed, 2017). Therefore, it is essential to identify and
classify the music according to the corresponding genres to fulfill the needs of the people
categorically. The widespread usage of the Internet has brought about significant changes
in the music industry as well as leading to all kinds of change (Shah et al., 2022). Audio
genre classification of music becomes an important task with real world applications such
as classification, prediction, and distinction, and speaker identification for voice
verification. Music genre classification is considered to be one possible way of managing
a large digital music database (Boxler, 2020).

Audio classification is the process of analyzing and identifying any type of audio,
sound, noise, musical notes, or any other similar type of data to classify them
accordingly. The audio data that is available to us can occur in numerous forms, such as
sound from acoustic devices, musical chords from instruments, human speech, or even
naturally occurring sounds like the chirping of birds in the environment. Modern deep
learning techniques allow us to achieve state-of-the-art results for tasks and projects
related to audio signal processing. Genres can be defined as categorical labels created by
humans to identify or characterize the style of music (Jawaherlalnehru et al., 2018). One
way to categorize and organize songs is based on the genre, which is identified by some
characteristics of the music such as rhythmic structure, harmonic content, and
instrumentation (Yaslan & Cataltepe, 2014).

Zema can be defined as a method of tactical shouting or producing a sound that can
produce a pleasing feeling when it is listened to by our senses. It is tactical and has its
own formula to say the song, so depending on this mechanism, it is possible to say every
zema can be sound but the reverse is not true (Tadese, 2018). Zema genre classification
is one specific task of automatic audio genres classification technology which is included
under the discipline of music information retrieval; it enables the machine to recognize
and classify melody.

1
Ethiopia has been endowed with a rich secular, spiritual, and cultural heritage which are
the expressions of our identity. The traditional school of the Ethiopian Orthodox Church
(EOTC) is one of these spiritual and cultural heritages from which the Ethiopian
Orthodox Christian's personality, celebrity, and identity are developed (Tsegaye, 1975).
There are different schools, in which different kinds of educational specializations are
offered, namely, Nibab Bet, Kidasie Bet, Zema Bet, Kine Bet, Aquaquam Bet, and
metshifit Bet (Kenew, n.d.).

Teklie Aquaquam zema is one of the traditional types of education that the Ethiopian
Orthodox Tewahedo Church (EOTC) has under Aquaquam. The priests wear secular
clothes and perform with a measured sound. Because it is a secular art, it has a strong
connection with music and rhythm. Teklie Aquaquam has been with the community for a
long time from the 19th century; the legendary Aleka Gebrehana started it by the name of
his son Aleka Teklie and is still known as Teklie Aquaquam (Addisie, 2019). Priests in
the middle country and northern Ethiopia still use it (Addisie, 2019). The aquaquam was
initially introduced to Gondar by the aleka Gebrehana. He was an accomplished
instructor at EOTC and was born in 1814 in Nabega Giorgis, in the area of Fogera
(kalawi). Aquaquam is utilized for a variety of social activities and in a wide range of
contexts. Such illustrations would be problems to test young minds; speeches with
examples to aid in the work of the judiciary; Myths and tales to foster tradition, tame
character, and teach about nature; Ironic music and songs are used to expose and cultivate
underlying hostility.

Teklie Aquaquam is one of the basic spiritual zema in EOTC, has basically five types of
zema such as Zimame (ዝማሜ), Qum (Tsinatsel)(ቁም), Meregd (መረግድ), Tsifat

(ጽፋት), Amelales (አመላለስ) and sub-categorize are Neaus-meregd (ንዑስ መረግድ),

Abey-Meregd (ዓብይ መረግድ), Woreb (ወረብ) these all have their own zema features,
Aquaquam has their own special chant that contains Mezimur, Abun, Ezil, Wazema,
Esme-Lealem, Zik, Araray, Anigergary, and Selam.

2
1.2. Statement of Problem

For the last few years, The Ethiopian Orthodox Church School provided people with a
unique education that allowed them to develop their knowledge and creativity across a
variety of fields. In this study (Kenew, n.d.), the students in traditional schools are
traveling long distances on foot to learn. During that time, the majority of the flock sent
their male child to attend this spiritual academic institution, but that trend has greatly
decreased and has been less prevalent in recent years. According to (Ibrahim et al., 2017),
speech recognition is important for human computer interaction. as a result to classify the
audio genre we use a deep learning model to fix the above problem. The key factor that
motivated us to perform this study was most of the learners as well as some disciples who
passed with the traditional school are not identified with each zema genre properly
because of no zema notation just, unlike St. yared zema. Technologies play an important
role in a variety of education, music classification, audio segmentation, genre prediction,
and the like.

The second one is the knowledge gap between modern and traditional education on Zema
genres. And most flocks are unable to know this sweet style of Aquaquam due to the
problem of giving the highest priority to modern education, so this technology must be
addressed in their modern education. Know-a-day music genre classification is important
to categorize multiple audio signals and music information retrieval (MIR) (Mhatre,
2020). The other important thing which needs to be conducted in this study was the
flocks may not get suitable conditions to attend this spiritual school due to different
problems, particularly zema teaching needs to exist in the school frequently. Existing
every day in the school may be difficult for those who have different kinds of work
because they give great opportunities to easily learn and understand the types of zema
with their characteristics. Aquaquam lessons take more time than other lessons, so
students don't choose them in terms of time. The paper (Lidy & Schindler, n.d.), Presents
the popular audio genre classification method, the deep neural network is appropriate for
genre issues, and this gives good classification performance. Therefore aquaquam zema
classification using deep learning is a big issue for EOTC students to learn in their
choice.

3
To this end this research attempts to answer the following questions.

 Which segmentation size is appropriate for Aquaquam zema?

 Which feature extraction technique is appropriate for Aquaquam zema?

 To what extent deep learning can classify Aquaquam zema?

1.3. Objective of the study

The objective of this study can be described into two categories: General objective and
specific objective. Both categories are described as follows.

1.3.1. General Objective

The main objective of this study is to develop Ethiopian Orthodox Tewahido Church
aquaquam zema classification model using a deep learning approach

1.3.2. Specific objectives

To achieve the general objective of this study the following specific objectives is
formulated:

 Review literature on music genre classification

 Collect the aquaquam zema genre from the well-known aquaquam bet

 To determine the segmentation size of aquaquam zema

 To determine feature extraction and classifier that is appropriate for aquaquam


zema

 To develop a model that classifies aquaquam zema accordingly

 Evaluate the performance of the model

4
1.4. Methods

To classify aququam zema, we followed an experimental research methodology to


achieve the objectives and address the problems. Experimental research is research
conducted with a scientific approach using two sets of variables. Experimental research is
used to manipulate the input variables and measure the output variables to understand their causal
effects.

1.4.1. Dataset Collection

The data were recorded from EOTC aquaquam zema bet. The first one was recorded
from Bahir Dar Abune Gebremenfes Kidus Aquaquam scholars and the second data was
recorded from Andabet Debre Mihret Kidest Kidanemiheret scholars. In addition, the
data was retrieved from https://www.ethiopianorthodox.org and http://debelo.org.

1.4.2. Research Design

The proposed model contains three components such as preprocessing, feature extraction,
and classification. In the prepossessing stage the noise reduction techniques,
segmentation, and then zema audio signals are converted into spectrogram images. In the
feature extraction stage, the system identifies the features of aquaquam zema by using the
deep feature extraction method and then classified them based on their genre which is
Zimame, Qum, Meregd, Tsifat, and Amelales. To develop a model to classify aquaquam
zema we use a deep learning approach.

1.4.3. Tools and Programming language

To implement our system we use Anaconda because Anaconda is an open-source


distribution of Python and many other programming languages and is used to simplify
package management and deployment in science, machine learning, and deep learning
applications.

(Athulya & Sindhu, 2021) Python programming language are used for displaying,
editing, processing, analyzing, and simulating the proposed model for classifying zema.
Design and implement the audio and image processing tasks. TensorFlow is open-

5
Python is more suitable and powerful to stimulate musical instrument classification
systems and open source to access it. Keras (using TensorFlow as s backend) are used to
source library for developing software to make the machine intelligent based on direct
dataflow graphs. The main aim of Tensor is to do scientific research in machine learning
and neural networks. We use librosa library to transform each audio file into a
spectrogram.

From this implementation some of the anaconda packet and tools are:

1) Graphviz: – It is graph visualization software.


2) Jupyter Notebook 6.4.8:- Jupyter notebooks are great way to run deep-learning
experiments. It allows you to break up a long experiment into smaller pieces that can
be executed independently which makes the development interactive. All the
experiments in this research were run in Jupyter.
3) Keras 2.8.0:- It is a deep learning framework or a library providing high-level
building blocks for developing deep learning models.
4) Librosa 1.2.0: - The components required building music information retrieval
systems are provided by this library for music and audio analysis. To extract features
from the audio, we employ this library.
5) Matplotlib 3.5.1: – It is a python 2D plotting library.
6) NumPy 1.21.5: – It is a package for manipulating multidimensional arrays (tensors).
Each piece of data needed for deep learning must be represented by a tensor of a
particular size, and NumPy was used to store and manage the arrays.
7) OpenCV 4.5.5: – It is a resource library for computer vision issues. We read audio
data from disk using the library, and then break it down into the individual picture
elements that make up the audio.
8) PyDub: – It is a library that has an easy-to-use high-level interface for manipulating
audio data.
9) Python 3.9.12:- Python is the programming language used to implement the models.
The abundance of libraries for data manipulation and frameworks for deep learning
and data processing led us to choose Python.
10) Scikit-learn 1.0.2: It is a machine learning library with various features and tools.

6
11) Seaborn 0.9.0: – It is a matplotlib-based data visualization package. It offers a
sophisticated user interface for creating visually appealing and useful statistical
graphs.

1.4.4. Development environment

In addition to the above software package and library we used the processor Intel(R)
Core(TM) i5-7200U CPU @ 2.50GHz 2.71 GHz, RAM 4GB and window 10 64-bit
operating system.

1.5. Scope of the Study

This study only focuses on the teklie aquaquam classification with the audio file and file
extension mp3 or wave. This audio file comes from the books of aquaquam zemas
scholars, such as Mezimur, Abun, Ezil, Wazema, Eseme-Lealem, Zik, Araray,
Anigergary, and Selam. Aquaquam zema is classified into five major classes Zimamie
(ዝማሜ), Qum (Tsinatsel)(ቁም), Meregd (መረግድ), Tsifat (ጽፋት), Amelales (አመላለስ)
and sub-categorize are Woreb (ወረብ), Neaus-meregd (ንዑስ መረግድ) and Abey-Meregd
(ዓብይ መረግድ). This paper concerns five classes of zema (zemamie, kum, mereged,
tsifat and amelales), each zema has its features including rhythm and sound. The feature
extraction can be done by frequency and time domain audio processing of the audio data.
To clearly distinguish each type of Zema the segmentation can be performed.

This study won‟t consider the zema with instruments, video, and any textual types of
input data in the data set but the description may be used. We won‟t focus on Non-EOTC
zema because some of them are Saint Yared Zema but when the reality is observed it is
not rather categorized under music. This research won‟t concern the remaining aquaquam
class Woreb (ወረብ) and sub-zemas such as Neaus-meregd (ንዑስ መረግድ) and Abey-
Meregd (ዓብይ መረግድ), Because as Melake Mihret Tewubo Ayenew side that these
subcategories differ with the main zema type based on the instrument that priests chant.

7
1.6. Significant of the study

The significance of the study from a scientific perspective becomes an input for
researchers to do additional investigation on aquaquam and St. Yared zemas and for those
who are interested to know aquaquam Zema, it can be used as a guide and also to prevent
the dropping and the confusion of scholars. The aquaquam students easily get the guide
and directions using such an audio file. When the students migrate from one area to
another maybe they cannot get scholars. The study's additional significance is that it
provides a starting point for future researchers to use various algorithms to improve the
classification of other spiritual zema with this method.

From the perspective of the practical main benefit is to provide to enable opportunity for
aquaquam Zema scholars to easily learn and relatively in short periodicals of time to
specialize. The other benefit of this study is it provides a new dataset because we create
for the very first time a dataset for Aquaquam zema, it offers supportive information for
the student who learns in the traditional spiritual school and it gives a motive for modern
education students to learn the traditional school teaching in a parallel way. Finally, this
study also provides information to foreign tourists to have some knowledge about
aquaquam zema types.

1.7. Evaluation Technique

After we had trained the developed classification its performance evaluated using
selected test dataset using accuracy which is the percentage of how the zemas can be
classify correctly with the previous audio zema. Other performance measures are
precision, recall and f-score. It is also evaluated using Confusion matrix which identifies
the performance of the selected signal processing techniques that is a contingency table
or an error matrix, a specific table layout that allows visualization of the performance of
an algorithm.

1.8. Organizations of the study

This section describes the overall explanation of the remaining chapters in a short and
precise way. The subsequent chapters contain the following items. There are five chapters

8
in this thesis. The thesis' broad structure was covered in the first chapter. This thesis'
remaining sections are structured as follows:

Literature Review (Chapter 2), The Aquaquam Zema, audio composition, music
classification systems, audio feature processing, extraction, and feature classification
models in machine learning and deep learning algorithms are covered in detail in the
reviewed literature presented in this chapter. We also talk about digital picture processing
and music information retrieval. And last, discussions of similar works. Methodology
(Chapter 3), This chapter discusses in detail a method that was utilized to support the
thesis, such as model design, discussion of dataset collecting and preparation approaches,
audio signal preprocessing techniques, feature extraction, and classification model.
Experiment Result and Discussion (Chapter 4), this chapter describes the experiment
design and experiment results. Conclusion and Recommendation (Chapter 5), the final
chapter's Conclusions section summarizes the main ideas of our debate and the most
important findings of our research. Limitations and possible solutions for future work are
discussed in the recommendation section.

9
CHAPTER TWO: LITERATURE REVIEW

2.1. Introduction

This section covered the literature review on the idea of an audio signal, in-depth
information about the study, the background of Teklie, Aquaquam zema compositions,
types of zema, representation of recorded audio zema into waveforms and some deep
learning approaches for feature extraction and classification have described. Lastly, the
assessment metrics and related research are briefly mentioned.

2.2. Overview of audio signal

Music is an important part of everyday life. Around the world, it exists in many different
forms and styles. Because musical preferences vary from person to person (Eyob Alemu,
Ephrem Afele Retta, 2019). Music Information Retrieval (MIR) is a highly
interdisciplinary research field and broadly defined as extracting information from music
and its applications. Often music means the audio content, although otherwise its scope
extends to other types of musical information e.g., lyrics, music metadata, or user
listening history. Musical genre is the best way of music classification. Mostly this
classification is done by labeling music manually. This classification is very much
needed to recommend music as people tend to listen to the music of their preferred genre.
A lot of musical applications such as „Spotify‟, „Sound cloud‟ use genre classification to
suggest music to their users.

The music genre classification is becoming an outstanding research area; because of its
contribution for the growth of music information retrieval system and better
manipulations of musical records it makes deep study and better manipulations of
musical records. It provides access for an in-depth study and better explanation of music
contents. It takes a lot of time and effort to manually describe a musical genre to a human
specialist (Frehiwot Terefe, 2019). The efficiency of a set of characteristics depends on
the application. Therefore, the key challenge in designing audio classification models is
the creation of descriptive functionality for a particular application. In reality, audio tells
a lot about the clip's mood, the music part, the noise, the speed or slowness of the pace,

10
and the human brain can also classify only on the basis of audio (Müller & Ellis, 2011).
A music genre can be classified by analyzing a lot of features in the music Therefore,
classifying musical genres automatically is a complicated task. However, there is a
noticeable pattern in the same kind of genre if we analyze the relationship pattern
between them by extracting and analyzing different features.

2.3. EOTC Zema bet

In the traditional church education system, various programs, namely; Nebab-Bet, Zema-
Bet, Qedase-Bet, Aquaquam-Bet, Qene-Bet, and Metshafit-Bet have been offered.
Nebab-Bet (school of reading) deals with the skills of reading, writing, and arithmetic;
Qedasse-Bet (school of liturgy) deals with the kind of prayer during a mass service;
Qene-Bet (school of poetry) deals with highly elaborated, strict, and multiplied form of
„geez‟ poetry, and Zema-Bet (school of hymen or music) deals with the hymn of St.
Yared (the renowned church education composer), his musical notations and the 3R's
(reading, writing, and arithmetic‟s), The program that deals with the chanting of St.Yared
with its typical kinds of dance and musical instruments are known as Aquaquam-Bet
(school of swaying and chanting) (Wreqenh, n.d.). Zema Bets: the study of Ethiopian
Orthodox Church music, divided into many specialized fields that cover music instrument
playing, textual readings of the liturgy, vocal performance, and body movement.

2.3.1. Aquaquam Bet

Aquaquam bet is one of EOTC School to teach scholars in aquaquam. Aquaquam


(religious dance and movements in which drums and sistrums are used) is one of the
basic tenets of Zema schoolings. It is a type of subject where a student is expected to
blend the style of his body movement with what he sings while he plays the church music
instrument during chanting. The music instruments are Ṣänaṣəl (Sistrum), Mäquamia and
Käbäro which are dominantly used in the Ethiopian Orthodox Church according to
(Debebe, 2017). Aquaqaum is a term for Ethiopian Christian liturgical dance and its
instrumental accompaniment, integrated with sistrum, drum and prayer stuff (Debebe,
2017). Aquaquam is a spiritual dance performed in the church system, followed by
chanting accompanied by cymbals and drums. This wisdom of the few has come down

11
from ancient times. Aquaquam, which is performed with the aid of staffs (prayer sticks),
Sistra anel Drums, and other instruments, is different from pure liturgical music. Mahlet
music is an ancient kind of performance for re-enactments that was passed down to the as
a St. Yared heritag (Tsegaye, 1975). Chief Tekle Gebrehana, the chief's son, gave the
name Tekle Aquaquam. It is stated that Chief Gebrehana received this aquaquam from
angels and that he alone taught his son, Chief Teklie Gebrehana, this rare education.
Takale Akuakuam is an improvised version of the earlier Aquaquam style which is given
at Gondar Baeta. It was the famous Alaqa Gabrahana who introduced this new version of
aquaquam (Tsegaye, 1975).

2.3.2. Aquaquam zema types

According to Yenta Zera Biruk Dawit (2021), personal communication with Yenta zera
biruk, Aquaquam scholar at Bahirdar Abune Gebremenfes Kidus Church September 10,
2021, as he supposed that Teklie aquaquam have their chant principle, each zemas are
performed by its step and special wearing style; and Teklie Aquaquam is one of the
fundamental spiritual zemas in EOTC. It consists of the following types of zemas,
including Zimame (ዝማሜ), Qum (Tsinatsel)(ቁም), Meregd (መረግድ), Tsifat
(ጽፋት), Amelales (አመላለስ),Woreb (ወረብ) and sub-categorize are Neaus-meregd (ንዑስ
መረግድ), Abey-Meregd (ዓብይ መረግድ),

Zemamie: The Geez term "zamame" means "to sing" and is derived from that language.
Chief Gebrehana's debut composition is a song. It is performed shortly after the melody
in the play Mahlet, and when the song is all sung together. It is referred to as a pioneer
driving a pioneer being driven. The choir walks forward and backward as usual, swings
the marker to the right, clears the area, and occasionally punches the ground. Melak
Mihret Tewubo Ayenew (2022), personal communication with Melak Mihret Tewubo
Ayenew, Aquaquam and zema scholar at Andabet Kidanemihret Betekrstian May 21,
2022.

Meregd: the priests singing in very slow and broad motion of sound. This song is a
different way to increase the body movement even if it is a drum that has been
established for years. The rhythm is different.

12
Tsifat: singing is faster than merged, It is a rhythm that has a different speed than the
kum and the meregd, and the movement of the body and hands is more visible in this
time than when it is not fast. This type of song is a part where the drum is being beaten
fast

Amelales: Amelales means to walk back from a drum with a loud voice and sing while
walking. They hum the book while walking in a gentle voice. the gentle ones jump up
and sing it in a gentle voice and in a pleasant manner.

2.4. Music information retrieval (MIR)

Music information retrieval is a set of computational processes for extracting descriptive


information from recorded music, useful for many kinds of music classification tasks.
Digital signal processing, music theory, musicology, music psychology, music education,
human computer interaction, and the library sciences are just a few of the areas that MIR
is related to (Tzanetakis & Cook, 2010).

The core objective of several of its sub-disciplines, as well as an essential component of


many MIR system characteristics, is automatic music classification (Ren et al., 2010). In
researches which are related with MIR features can be extracted from records which are
first represented in a digital form this condition makes the audio feature extraction
process to be associated with the concept of DSP.

2.5. Music & audio representations

An audio signal is a signal that contains information in the audible frequency range.
Audio representation refers to the extraction of audio signal properties, or features, that
are representative of the audio signal composition (both in temporal and spectral domain)
and audio signal behavior over time (Choi & Sandler, 2017).

The work (William J. Pielemeier, 1996) audio time-frequency representation is more


advanced to get detail information for audio signal. There are millions of musical data
available in digital databases that have different forms representing music, it can be in the
form of text as we see as lyrics of the song and this type of formats is textual, also it can

13
be poster of new music albums or audio formats that are used in recordings, another form
is music instrument digital interface which is known as MIDI, these are different
representations for music and audio in general which are required to be acquainted to,
especially for studies and researches in music information retrieval.

Sound is a mechanical wave that is an oscillation of pressure transmitted through a solid,


liquid, or gas. The perception of sound in any organism is limited to a certain range of
frequencies (20Hz~20000Hz for humans). The other main form of sound Time-
Frequency representation is a 2 dimensional matrix that represents the frequency contents
of an audio signal over time. By using Short Term Fourier Transformation (STFT) from
audio signal we can generate spectrogram image , spectrogram is mostly used to describe
a TF Representation that does not have any explicit phase representation (Mu et al.,
2021).

2.6. Digital signal processing

Digital signal processing is a very important concept in many audio related activities in
most cases sampling is the first step in digital signal processing; which is the digitization
process in which the continuous analog signal is represented in digital form by means of
measuring the signal level of the analog signal at consistent time pauses (Mckay, 2010).

Signals may be continuous (analog), as they are in the natural world, or digital, as they
are in modern digital equipment like computers. Only digital signals can be stored and
processed by computers. Therefore, before being stored and processed by computers,
image, audio, and video signals must be converted to a digital format (Selam, 2020). It
uses digital processing to perform a wide range of signal processing operations, such as
computers or more advanced digital signal processors. DSP is mostly used in audio signal
arenas, speech synthesis, radar, seismology, audio, sonar, and voice recognition signals
(Birku, 2021).

14
2.7. Audio signal processing

The computational techniques for purposeful acoustic signal or sound manipulation are
the focus of audio signal processing. Making a machine understand the sound of events
by hearing is done through audio processing. People, musical instruments, animals, and
other items are the origin of the sounds. Audio signal processing is important for different
application areas such as music, speech recognition, and environmental sound recognition
for surveillance, information retrieval, and communication (Kasehun, 2021). Like a
human, the machine can hear the sound of audio music, which aids in decision-making
based on the established rules and regulations. In order to comprehend and examine the
sonic characteristics, audio signal processing techniques are required.

2.7.1. Signal Terminologies

Signals are emitted from the source with the form of sound and the sound has its own
components like: loudness, timbre, and pitch. Pitch, or the frequency at which the
waveform repeats itself, is the fundamental frequency of the sound. Loudness is a
measurement of the strength of the sound waves, whereas Timber is a more complex
concept that depends on the harmonic content of the signal.

Signal or wave form is an amount that varies with time or space and that generally
transmits data. The distinctions between analog versus digital and continuous time versus
discrete time are also made when addressing waveform processing problems. These terms
are sometimes used interchangeably; the two sets of terms should be credited with
different definitions. Signals are emitted from the source with the form of sound and the
sound has its own components like pitch, loudness and timber.

Analog signal:- This determines the waveform that is continuous in time and belongs to a
class that takes on a continuous amplitude value spectrum. Analog wave forms or analog
signal s are derived from acoustic sources of data. The signals are represented
mathematically as a function of continuous variables. Analog signals are continuous time
with continuous amplitude (Smith, 1999).

15
Digital signal:- implies that both time and amplitude are quantized. In digital models the
signals are represented as a sequence of numbers which takes only a finite set of values.
These types of signals have continuous time. As we know computers understand any
form of input with numeric value which means in the form of 0 and 1 (Sharma et al.,
2020).

I. Frequency: - it is used to measure the strength or loudness of audio with the


given specified time.
II. Pitch: - it is the frequency in the sound of the fundamental variable, which is
the frequency with which the waveform is repeated.
III. Loudness:- is a sound wave volume measurement
IV. Timber: - It is the color of the music or is quality that makes the listener gives
a judgment of which music or audio being played, determined by the
harmonic content of the signal.
V. Mel spectrogram: - is a spectrogram where the frequencies are converted to
the Mel scale.
VI. Fourier transform: - is a mathematical representation of sound that takes a
time- domain signal as input and decomposes it into frequencies as output. It
is a mathematical function that converts the shape of a signal into the time and
frequency domains of representation.

2.7.2. Audio signal

Audio signal: The audio signal is often called raw audio, compared to other
representations that are transformations based on it. A digital audio signal consists of
audio samples that specify the amplitudes at time-steps. In majority of MIR works,
researchers assume that the music content is given as a digital audio signal. The audio
signal has not been the most popular choice; researchers have preferred 2D
representations such as STFT and spectrograms because learning a network starting from
the audio signal requires even a larger dataset.

16
2.8. Audio data acquisition

All audio classification and detection tasks start with this approach. For the desired
purpose, audio data is retrieved from a stored database or by recording the sound of
things. Using a digital recorder, smartphone, or other compatible technology, audio data
is collected. Different sound recording formats, including MP3, WAV, OGG, AAC, and
AC3, are available. As it is the initial input for this investigation. This is the early stage of
audio signal processing and deals with acquiring the audio files required for research in
various audio file formats and converting the signals into spectrogram images. Except in
situations where processing is not included, it becomes an important part of the study. It
involves taking an audio sound primary source of data and recording using a sound
recorder to capture it accurately in an uncontrolled environment, as well as taking an
audio sound secondary source of data that is recorded audio data. A recording of audio is
typically raw and needs additional processing and analysis to be used for a particular
purpose.

2.9. Preprocessing audio

A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.

To create a reliable and suitable audio signal representation, pre-processing of the input
audio signals is essential. In the actual world, background noise and foreground acoustic
objects are typically present in audio signals captured with a microphone or smartphone.
This audio cannot immediately be utilized as an input for classification using machine
learning. The cause is that duplication in signals must first be eliminated. Noise
reduction, equalization, low-pass filtering, and segmenting the initial audio signal into
smaller audio and silence events to be used in feature extraction are all steps in the
preprocessing process (Babaee et al., 2018).

2.9.1. Audio noise reduction

17
One of the key digital signal processing (DSP) applications is noise reduction. The nature
of recorded audio and music signals also tend to be disrupted usually by noise; these
noises are rather more frequent to occur in recorded audio than in music. A study showed
that there are approximately three silences in a sentence, while in music, silences do
occur usually but the transients (which occur at the start and end of silence) in speech are
more noisy than the transients associated with music (Wolfe, 2002)

2.9.2. Audio segmentation

An important preprocessing step for audio signal processing used in numerous


applications is audio segmentation. There is always audio data available, but it is
frequently in an unstructured format. It is important to classify them into regularized
qualities in order to make it easier to use them. The option to segment an audio stream
from long audio into other audio categories is also helpful (Li & You, 2015). The ability
to divide an audio stream into homogeneous regions is useful in a variety of applications.
As a result, audio segmentation is the process of dividing an uninterrupted audio stream
into acoustically comparable or homogeneous category sections. Finding acoustic
variations in an audio signal is the aim of audio segmentation. For automatic indexing
and information retrieval of all instances of a particular speaker, this segmentation
produces useful information such as division into speaker signals and speaker Identifiers
(Bhandari, 2016). In audio processing, windowing has the purpose of minimizing the
discontinuities in the audio signals at the start and end of each frame (Science,
2021)(J.Nalini & Palanivel, 2013). It is applied to the frame's filtered frequency to shape
it. For applications involving digital signal processing, windowing functions like
Hamming, Blackman, flat top, force, Hanning, exponential, and Kaiser are available.
Depending on the application regions and the signal content of the data, someone else
might use these widow features.

2.9.3. Waveform representation

Waveform is a representation used to display audio or sound waves, it displays the


changes in the amplitude that occurs over time, it is a visualization tool and python
method to show the audio, waveform with lower amplitudes indicates a low-pitched

18
sound or soft, while higher amplitudes indicates louder or higher-pitched sounds. Audio
signals are generated when an object vibrates, human‟s voice is generated when vocal
chords vibrates, piano sound generates when its strings vibrate by the act of hammering
it, the generated sound waves travel through air causing the air molecules to oscillate,
these oscillations cause rapid displacements of air particles in the form of compression
and rarefaction, then these waves when they hit ear drums it causes certain nerves to
vibrate and generate an electrical signal to the brain which is then perceived by humans
as sound, or sound recorder or a microphone receives these waves and translates it into
the intended sound. Audio signals are considered to be non-stationary signals and the
reason because of the frequent changes in the frequency content, in music the pitch
doesn‟t change much compared to that of speech that are continuously changing
(KHASHANA, 2020)

2.10. Digital image processing

2.10.1. Spectrogram

STFT and mel-spectrogram have been the most popular input representations for music
classification (Choi & Sandler, 2017). Although sample-based deep learning methods
have been introduced, 2-dimensional representations would be still useful in the near
future for efficient training. Mel-spectrograms provide an efficient and perceptually
relevant representation compared to STFT (Ullrich, Karen, Jan Schlüter, n.d.)

An additional form of audio representation known as a spectrogram is a two-dimensional


(2D) plot between time and frequency that also has a third dimensions that displays
colors. Sonographs or voicegrams are other names for them in the context of audio. The
amplitude of the frequency at a specific time is represented by each value of the
spectrogram in terms of color intensity. These can be described as a spectrum of
frequencies that changes with time, where: Time is on the X-axis, and it runs from oldest
(left) to newest (youngest). Frequency is plotted on a Y-axis with lowest frequencies at
the bottom and highest frequencies at the top. The values represent the amplitudes of a
specific frequency at a specific moment. Spectrograms have been used in a variety of

19
audio analysis tasks including sound event classification (Lee et al., 2009),speaker
recognition (Dennis et al., 2011)

The spectrogram is one kind of a heat map where the intensity is depicted by the varying
colors and brightness. The lower color on the spectrogram represents lower intensity and
lower amplitudes, and darker colors correspond to progressively stronger (or louder)
amplitudes. A spectrogram can also be seen as a visual representation of the signal
strength, or the “loudness” of a signal over time at various frequencies presented in a
particular waveform (Badshah et al., 2017). When analyzing the spectrum content of
audio, the Fourier transform is a helpful tool. In order to determine the magnitude and
phase of each frequency component, it translates signals from the time domain to the
frequency domain. A more efficient version of the Discrete Fourier Transform (DFT), the
Fast Fourier Transform (FFT), can be used to analyze discrete sequences in real time,
such as audio sample values. FFT, in essence, correlates the frequencies present in the
signal and classifies them in distinct steps. Environmental audio contains both stationary
and brief frequencies, hence the Short-Time Fourier Transform (STFT), which shows
how the spectrum changes over time, is preferred. To generate spectra for each sequential
time segment of the original signal, STFT utilizes a sliding FFT window. A power
spectrogram estimate is created by stacking the squared magnitude of each spectrum. The
end result is a dynamic spectrum with time and frequency plotted on opposite axes. These
spectra's values describe the loudness of a certain frequency at a specific time.

2.11. Feature extraction

Most of the time output of the audio classification is the input of the audio identification.
These processes reduce the searching space and speed up the process and help to retrieve
better results. Audio features can be divided into temporal and spectral features that
capture the temporal and spectral characteristics of an audio signal, respectively.

2.11.1. Deep feature

Deep learning has proven to be a powerful technique for extracting high-level features
from low-level information. Features extracted from the hidden layers of various deep

20
learning models are called deep features. Deep features can be extracted from any deep
learning model like convolutional neural networks (CNNs), deep neural networks
(DNNs), recurrent neural networks (RNNs), Deep stacked auto-encoder (SAE),
unidirectional long short term memory network (LSTM), bi-directional long short term
memory (BLSTM) and other similar models.

Spectrogram is a visual representation of the frequency spectrum in a signal, which varies


with time. A common format is an image that indicates: on the vertical axis the
frequency, on the horizontal axis time and a third dimension with the amplitude of a
particular frequency at a given moment, represented by the intensity of the color (Cruz et
al., 2019). The magnitude or amplitude is represented by the color. A spectrogram's color
is measured in decibels and is either brighter or higher (unit of measure). Technically
speaking, we may transform a waveform into a spectrogram, which is analogous to a
picture. Researchers have discovered that we can successfully use computer vision
techniques to the spectrogram. Therefore, by identifying patterns in the spectrogram, a
deep learning model can extract the dominating audio for each time frame in a waveform.
The STFT represents a signal in the time-frequency domain by computing discrete
Fourier transforms (DFT) over short overlapping windows. After conversion of audio
signal into spectrogram, features can be automatically extracted from spectrogram since
we used convolutional neural networks it is more powerful for feature extraction because
it has layers that can filter each image. Spectrograms have been used in many audio based
tasks including musical genre recognition, text to speech, and emotion recognitions
(Boxler, 2020).

2.11.2. Content-Based Features

Content-based features are extracted from raw audio signals. These features can be
represented using mean, standard deviation, variance, feature histograms, MFCC
aggregation and area moments. This subsection can go through a select few of these
features and briefly explain them. Manually extracted or content-based features can be
split into the time domain and frequency domain (Bahuleyan, 2018).

Zero Crossing Rate (ZCR):

21
One significant temporal domain property is zero crossing rates. This feature, which is
frequently employed in audio classification, is very beneficial but inexpensive. Typically,
we define is as the quantity of zero crossings in the temporal domain that occur inside a
second (Senevirathna & Jayaratne, 2015).

Zero-crossing rate (ZCR), the rate at which a discrete-time signal switches from positive
to negative or another state, is the frequency at which the signal changes its sign. Zero
crossing frequency is a straightforward indicator of a signal's frequency content. It is a
measure of number of times in a given frame that the amplitude of the audio signals
passes through a value of zero as shown in (Raju, N., Arjun, N., Manoj, S., Kabilan, K.,
Shivaprakaash, 2013). In order to use ZCR to distinguish unvoiced sounds from noise
and environment, the waveform can be shifted before computing the ZCR. This is
particular useful if the noise is small. In addition, ZCR has high signal frequency rate and
is much lower for voiced speech as compared to unvoiced speech (Ngo, 2011). ZCR is an
important parameter for voiced/unvoiced classification and for endpoint detection(Raju,
N., Arjun, N., Manoj, S., Kabilan, K., Shivaprakaash, 2013). The disadvantage of ZCR is
it did not take noise into account (Bormane & Dusane, 2013)

∑ * + (1)

Where, s is a signal of length T and the indicator function || {A} is equal to 1 when A is
true and is equal to 0 otherwise.

Mel Frequency Campestral Coefficient (MFCC):

MFCC is used to understand and represent the physiological properties of human


perceptions depending on the frequencies of the signal because it is closely related to the
human auditory system. MFCC is a short time power spectral representation of sound.
MFCC extracts the feature from the cepstrum of the audio signal through converting
input signals from the time domain into the frequency domain. MFCCs are commonly
used in audio recognition and are finding increased use in music information recognition
and genre classification systems. A Mel Frequency Cepstrum (MFC) is a representation
of the short-term power spectrum of a sound whereas the Mel Frequency Cepstral

22
Coefficients (MFCC) is a series of coefficients that make up the MFC. MFCCs have also
been shown to be an effective representation for music [(Beth Logan, 2000).

The discrete cosine transform can be used to translate the mel spectrum coefficients (and
consequently their logarithm) into the time domain (DCT).

( )
∑ ( ) ⌈ ⌉ (2)

where n is the number of MFCC, Ci(n) is the n-th MFCC coefficients of the i-th frame,
S(m) is the logarithmic power spectrum of the audio signal, and M is the number of
triangular filters (Xie et al., 2012)

Spectral Centroid:

This property tells us where the mass center is located in a spectrogram and is obtained
with the weighted average of the frequencies. It helps to predict the”brightness” in a
sound, so it is very useful when measuring the”timbre” in an audio. Therefore, it helps to
distinguish between higher pitch and lower pitch audio signal. Equation 2 calculates
Spectral Centroid.

∑ ( )

(3)
( )

Where X(k) is the amplitude of a bin k in the DFT (Discrete Fourier Transform)
spectrum.

Chroma Features:

This feature is an interesting and powerful representation of audio. We know that any
tone is belongs to one of musical octaves. To classify tone into musical octaves we used
tone height i.e. if tone height is less then it may belongs to class “C” and if it is very high
then it could belongs to class “B”. But there is another measurement of a tone class which
is called “Chroma”. It is a powerful tool for audio analysis that has tones that can be
categorized in a significant way, and one of its properties allows capturing the harmonic
and melodic characteristics

23
2.12. Classification

Audio classification is the classification of specific sounds into several categories, such as
environmental sound classification and speech recognition. The tasks we perform are the
same as dog and cat image classification, spam and ham text classification. The same
applies to phonetic classification. The only difference is the type of data, including images,
text, and certain types of audio files of certain lengths. Speech classification is used in a
wide variety of industries. Audio blocking capabilities, music genre identification, natural
language classification, ambient noise classification, and detection and identification of
different types of noise. Used in chat bots to take chat bot performance to the next level.

There are considerable amount of real world applications for audio classification. For an
example, very helpful to be able to search sound effects automatically from a very large
audio database in films post processing, which contains sounds of explosion, windstorm,
earthquake, animals and so on (Senevirathna & Jayaratne, 2015). Audio classification or
sound classification can be referred to as the process of analyzing audio recordings. This
amazing technique has multiple applications in the field of AI and data science such as
chat bots, automated voice translators, virtual assistants, music genre identification, and
text to speech applications. Audio classifications can be of multiple types and forms such
as acoustic Data Classification or acoustic event detection, Music classification, Natural
Language Classification, and Environmental Sound Classification (Analystics vidhya
access 10/2022 ).

2.12.1. Deep Learning approach

Deep learning is a broad term that refers to machine learning and artificial learning
methods used to intimidate individuals and influence their behavior based on specific
human brain functions. This is a crucial component of data science that directs modeling
using data-driven techniques in the context of prediction and statistics. To develop such
human-like adaptability and learning ability and function accordingly, powerful forces,
commonly called algorithms, must be at work. With the help of deep learning, we can
accomplish the task of music genre classification without the need for handcrafted
features.

24
Several layers of neural networks, which are nothing more than a collection of decision-
making networks that are pre-trained to perform a task, are dynamically built to run deep
learning algorithms throughout them. Each of these is subsequently processed through
basic layered representations before moving on to the following layer. Deep learning use
Supervised learning, it is an approach for creating artificial intelligence, wherein a
computer algorithm is trained on input data that has been labeled for a particular output.
When fed new data that has never been seen previously, the machine learning model is
trained until it can recognize the underlying patterns and connections between the output
labels and the input data. This allows it to produce correct labeling results. Supervised
learning is good at classification and regression problems, such as determining what
category a news article belongs to or predicting the volume of sales for a given future
date. The goal of supervised learning is to interpret the data in light of a particular
circumstance. In neural network methods, the supervised learning process is enhanced by
monitoring the model's outputs in real-time and adjusting the architecture to bring the
system closer to the desired accuracy. The available labeled data and the algorithm are
two factors that affect the level of accuracy that is obtained (Shah et al., 2022).

2.12.1.1. Long Short Term Memory Networks (LSTMs)

Recurrent neural networks (RNNs) with long-term dependent learning and adaptation
capabilities are known as LSTMs. It can remember and recall information from the past
for a longer time, and by default, this is its only behavior. LSTMs are designed to retain
over time and henceforth they are majorly used in time series predictions because they
can restrain memory or previous inputs. This comparison is made due to their chain-like
structure, which consists of four interconnected layers that communicate with one another
in various ways. Along with time series prediction applications, they can be used to build
voice recognizers, advance medicinal research, and create musical loops.

First, LSTM work sequentially; they have a tendency not to recall extraneous information
acquired in a prior stage. They then selectively update a subset of the cell-state values
before generating a subset of the cell-state as output.

25
2.12.1.2. RNN

Recurrent neural networks are a different type of feed-forward network. Here, after a
specific time delay, each neuron in the buried layers receives an input. The recurrent
neural network mostly uses information from earlier iterations. For instance, in order to
predict the next word in any sentence, one must be familiar with the words that came
before. In addition to processing the inputs, it also distributes the length and weights over
time. It does not let the size of the model to increase with the increase in the input size.
However, the only problem with this recurrent neural network is that it has slow
computational speed as well as it does not contemplate any future input for the current
state. It has a problem with reminiscing prior information.

One such class of artificial neural network is the recurrent network, which is primarily
designed to find patterns in data sequences like text, genomes, handwriting, spoken
language, and numerical time‟s series data from sensors, stock markets, and
governmental organizations. RNNs follow the work approach by putting output feeds (t-
1) time if the time is defined as t. Next, the output determined by t is feed at input
time t+1. Similarly, these processes are repeated for all the input consisting of any length.
There's also a fact about RNNs is that they store historical information and there's no
increase in the input size even if the model size is increased. RNNs look something like
this when unfolded.

2.12.1.3. Artificial Neural Network (ANN)

An artificial neural network is one of the most powerful supervised computational


systems. It consists of extremely large numbers of simple processors and
interconnections. ANNs have the properties of high adaptability and high error tolerance
due to efficient and reliable classification performance (Babaee et al., 2018). ANN is a
computing system that, in terms of its structures, modes of operation, and capacity for
learning, mimics the biological human brain system. Numerous neurons are coupled
together in an ANN in order to process activities and provide useful outputs (data-flair,
[Accessed 12/9/2022]). The brain's structure adapts to present problems by learning from
prior experiences. The input, hidden, and output layers are the three layers that make up

26
an ANN. Many nodes or neurons are present in each layer. Each hidden layer or output
layer neuron is connected to the input layer neurons. The three layers' functionality
demonstrates how the neural network learns. By adjusting the weight of connections
during model training, the neural network feeds the input data until it matches the input
data with the intended class. When high-dimensionality noise, non-linearity, and
inaccuracy occur in the data, ANN are applied (Govender et al., 2012).

Artificial neuron network, each neuron receives a number of input signals Xi, through its
connections. A set of real weighted values Wi, are used to describe connection strengths.
The neurons activation level, ∑ , determined by the cumulative strength of its input
signals. A threshold function f computes that neurons final output state (Tsantekidis et al.,
2022)

Figure 2. 1: Architecture of Artificial Neural Network

2.12.1.4. Support Vector Machine (SVM)

SVM is also a widely used approach for audio identification. Actually SVM is widely
used for audio classification instead of identification for completeness we were discuss it
here. It is a statistical learning algorithm for classifiers. SVM is used to solve many
practical problems such as face detections, three-dimensional (3-D) objects recognition
and so on. SVM is a statistic based pattern classification technique introduced by
Vapnik (1995). SVM is based on the concept of structural risk minimization (SRM). A
learning machine‟s risk (R) is bound within the sum of the empirical risk (Remp) and a

27
confidence interval ψ i.e. R≤Remp + ψ (Avendaño-Valencia & Fassois, 2015). SVM‟s
utilize a kernel function to map a nonlinearly separable vector into a higher dimensional
space to make it linearly separable. The idea of decision planes, which establishes the
decision limits, serves as the foundation for the SVM's operating principles. A decision
plane is a diagram that distinguishes between collections of objects with various class
memberships. Data from two classes can always be separated by a hyper plane with the
right nonlinear mapping to a high enough dimension. Single SVM‟s are binary classifiers
that can be extended by integrating several together for solving multiclass data problems
(David and Lerner, 2004).

To find the ideal separating hyper plane between two classes of data, the SVM method is
applied. Assume dataset T has two separable classes and a total of k samples, where these
samples are represented as (x1, y1), (x2, y2),…, (xk, yk). The class label is represented as
y { -1,1} and is the binary value of two classes; x Rn where Rn is an n- dimensional
space.

Figure 2. 2:: SVM Class Classification

2.12.1.5. Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNN) is a variant of Feed-Forward Artificial Neural


Networks that typically operate on data with a known grid-like topology in 1, 2, or 3
dimensions (Narayan & Gardent, 2020).The Convolutional Neural Network (CNN) is a

28
particularly powerful deep learning method that was introduced by (LeCun et al., 1989).
A neural network that automatically pulls helpful features from data points is known as a
convolutional neural network (CNN) (without manual fine-tuning). Similar to a typical
multilayer neural network, a convolutional neural network (CNN) is made up of one or
more convolutional layers followed by one or more fully connected layers. CNN is a feed
forward NN technique used in computer vision and auditing applications to comprehend
and extract picture attributes of objects.

In a variety of application areas, including object detection, audio and video processing,
natural language processing, music instrument recognition, and speech recognition, CNN
produces better results. CNN is computationally more efficient for image processing
through automatically extracting the features of the input data without any human
interference. CNN-based models such as AlexNet, VGG or ResNet for sound
classification (Simonyan & Zisserman, 2015)

VGG16 consists of 13 convolution layers with 3x3 kernels, 5 Max Pooling layers with
pool size 2x2 filters, 2 fully connected layers, and finally a softmax layer.

2.12.1.5.1. CNN architectures

In fully connected layers, each unit (neuron) is connected to all of the units in the
previous layer. On CNN, however, each unit is connected to a small number of units in
the previous layer. In addition, all units are connected to the previous layer in the same
way, with the same weights and structure. CNNs use convolution operation instead of
general matrix multiplication in at least one of their layers. Other neural networks do not
have the same architecture as convolutional neural networks. In other neural networks,
every neuron in the layer before it is fully connected to the layer above it. In
convolutional neural networks, just a portion of the neurons in the next layer are
connected to the neurons in the layer above.

The basic layers of CNN are convolution layers, pool layers, and fully connected layers.

29
Figure 2.3: The architecture of convolutional neural network

The layer that gives the network its name is the convolutional layer. The input image's
features are first extracted by this layer. It carries out a process known as convolution. In
the context of CNN, convolution is a linear operation involving multiplication according
to the element between the input image and the filter. The main process at the level of
Conv is convolution Operation. A filter is a small matrix used to detect patterns in an
input image. Matrix values are started with random numbers using different methods.

The pooling layer simplifies the information obtained from the output tier of the
convolution layer. The pooling layer receives the feature output from the convolution
layer and prepares the condensed feature's output. It is used to gradually reduce the size
of the input representation. Therefore, it reduces the number of parameters required, and
the amount of computation needed and controls the over fittings. The output of the
pooling procedure is invariant to minor translations of the input data, which is helpful for
computational efficiency but also has another virtue that can enhance convolutional
networks resiliency. In other words, even if a learned feature is slightly translated in the
input, the network still pick it up if a convolution layer is followed by a pooling layer.
Pooling is frequently utilized in CNN designs since a feature's precise position is rarely
as crucial as the fact that it exists at all. A common type of pooling is max-pooling,
where the maximum value in a window is taken as the output representing that region. As
in the convolution case, a window of size w x h is swept over the input, producing the
output.

30
CNN most commonly uses ReLU as the activation function (Bahuleyan, 2018). It is
not sufficient to merely stack many linear transformations on top of one another in order
to be able to model complex interactions because this causes the layers to collapse into
one. We need to include nonlinearities to our models because convolution is a linear
operation and the matrix multiplication in a fully connected layer is likewise linear (the
bias terms can also be added by adding a bias dimension of ones to the input). By
applying an activation function to each output of a layer, we avoid collapsing the model
and hence make it much more powerful. the most widely used activation function is the
rectified linear unit (ReLU) (Tekniska Högskola et al., 2018).

It is defend as: f(x) = max(0,x) (4)

Non-linearity activation is useful to improve classification and the learning capabilities of


the network. POOL layers perform non-linear down sampling operations aimed at
reducing the spatial size of the representation while simultaneously decreasing the
number of parameters, the possibility of overfitting, and the computational complexity of
the network. ReLU's have some advantageous properties, such as efficient computation
and efficient gradient propagation.

A fully connected layer is the only layer in the model where every neuron in the previous
layer is connected to every neuron in the next layer. Classify the input image based on the
training data using the features from the previous level's output. Since arrays are just a
series of numbers, we can feed them into a fully connected neural network (all neurons
are connected). They are fully connected or dense, in the sense that no parameter sharing
takes place. Every parameter in the fully connected layer interacts with its own part of the
input, in contrast with convolutional layers, where a parameter in one kernel filter
interacts with many deferent data points in the input. They also usually serve to condense
the output of the network into a final vector of the same size as the number of classes the
model can distinguish between.

It is a common practice to insert a POOL layer between CONV layers. Typical pooling
functions are max and average. FC layers have neurons that are fully connected to all the
activations in the previous layer and are applied after CONV and POOL layers. In the

31
higher layers, multiple FC layers and one CLASS layer perform the final classification. A
widely used activation function in the CLASS layer is SoftMax.

The usual choices of the last layer activation function for various types of tasks are
summarized in the table 2.1.

Table 2. 1: summery of comparisons between activation function

Task activation function


Binary classification Sigmoid
Better than sigmoid for hidden layers. It is different Tanh
than sigmoid because its output lies between -1 and
1.
Complex relationship class Relu

2.13. Evaluation Technique

Metrics are used to validate the implementation of proposed solutions. Various metrics
were used to quantify the performance of the proposed model. These are accuracy,
precision, recall and f1 score. Before we discuss metrics, we need to know some
terminology. True positives (TP) refer to the number of positive data points correctly labeled by
the classifier. A true negative (TN) is a negative data point correctly labeled by the classifier. A
false positive (FP) is a negative data point falsely labeled as positive, and a false negative (FN) is
a positive data point falsely labeled as negative.

To evaluate our models, we chose the following measure of success or evaluation


metrics:

a. Accuracy: The percentage of sound clips from that was correctly categorized for
a certain genre. The formula is as follows:

( )
(4)
( )

b. Precision: can be considered a metric for precision. In other words, it measures


the proportion of test results that are genuinely positive.

32
(5)

c. Recall (sensitivity): is a metric for completeness, or the proportion of test data


that is marked as such.

Recall (6)

d. F1 Score (harmonic mean of precision and recall): is a measurement that


combines precision and recall in to one. By doing that, it gives equal weight to
precision and recall. It is defined as the following:

(7)

e. Confusion Matrix: A useful table n matrix that helps visualize the performance
of a classification machine learning model. There are four different combinations
of predicted and true values as seen. For a given genre label g and a sound-clip s,
TP (true positive) is when s is correctly classified as g; FP (false positive) is
where the model classified s with g but its true genre is different. FN (false
negative) where the model did not classify the sound-clip even though its true
genre label is g. Lastly, TN (true negative) where the model correctly identified
that s does not have genre label g. (Lau, 2020)
f. P/R AUC (area under precision recall curve): Since the accuracy metric alone
cannot adequately assess a classifier, we introduced a substitute metric known as
the area under the precision-recall curve. The trade-off between the TP rate and
the positive predictive value for the forecast is summarized by the precision-recall
curve. Area under the curve summarized the integral or an approximation of the
area under the precision-recall curve.
g. Macro Average: Giving each class the same weight and dividing by the total
number of classes can be calculating the average. The formula is as follows:


(8)

33
h. Weighted Average: y multiplying each class by the weight (number of samples
in each class) and dividing by the total number of samples, the average were
calculated. The formula is as follows:

∑ ( )

(9)

2.14. Overfitting

Overfitting refers to the situation in which a model learns statistical consistency inherent i
in the training set, resulting in independent noise rather than signal learning, and subseqe
nt poor performance on new data sets. In neural networks, over fitting occurs frequently.
A neural network is more likely to over fit if it is more expressive (has more layers and
weights). This is one of the main challenges in machine learning, as an overfitted model
is not generalizable to never-seen-before data. In that sense, a test set plays an essential
role in the proper performance evaluation of machine learning models, as discussed in the
previous section. A tedious check for recognizing overfitting to the training data is to
monitor the loss and accuracy of the training and validation sets. If the model performs
well on the training set compared to the validation set, then the model has likely been
overfitting to the training data.

There have been several approaches proposed to minimize overfitting. The best solution
for reducing overfitting is to obtain large training data. A model trained on a larger
dataset typically generalizes better, however, that is not always possible in the audio data.
The other solutions include regularization with dropout or weight decay, batch
normalization, and data augmentation, as well as reducing architectural complexity.
Dropout is a recently introduced regularization technique where randomly selected
activations are set to 0 during the training so that the model becomes less sensitive to
specific weights in the network (Hinton et al., 2012).

The following are the most common ways to fight overfitting (Miceli et al., 2018)

34
Reducing Network’s size: The number of learnable parameters in the model,
which is based on the number of layers and the number of units in each layer,
is what we refer to as the "network size.
Adding weight regularization: requiring the layer's weights to take only small
values, which results in a more uniform distribution of weight values (weight
regularization). This is accomplished by including in the network's loss
function a cost for using big weights.
Early Stopping: Early stopping is one technique for teaching neural networks
to just pick up on the signal and disregard the noise. The majority of the signal
is located in the overall shape and sometimes the color of an image, whereas a
significant quantity of noise is found in the finer details of an image. Therefore,
the simplest and cheapest regularization method is to stop the network when it
starts getting worse, i.e. early stopping. (If we want to ignore the fine-grained
details and capture only the general information present in the data.)
Adding Dropout: Several of the layer's output features are turned off or
randomly removed during training (set to 0). By randomly training small
network segments at a time, dropout makes a large network behave like a small
one, and small networks do not overfit. This is because small neural networks
do not have much expressive power. These cannot learn the more granular
details (noise) that tend to be the source of overfitting. They have room to
capture only the big, obvious, high-level features.
Data Augmentation: When there is limited training data available for
computer vision, it is a potent method for reducing overfitting.

2.15. Related work

Now a days aquaquam zema becomes popular in EOTC as well as the country, as the
base of modern education and music. Most spiritual activities are done by aquaquam
zema, and popular musicians are originating from this zema even if it doesn‟t teach music
but replace their spiritual task with music. But there is nothing on aquaquam zema genre
classification so most of the literature can be relate to music genre, A few researches have
been done on Sent Yared zema notations, instrumental sent Yared songs identification

35
and classification, and music genre identification and classification (Kasehun, 2021)
(Yeshanew, 2020). A work to develop automatic pentatonic scale identification of
Begena has been done identifying well-known Ethiopian traditional lyre Bagana song
scales, such as Selamta, Wanen, Chernet, and Bati Major (wendimu, 2020). However, he
uses only instrumental zemas as their genre classification. A research work titled EOTC
drum sound classification was proposed, this work focuses on drum sound to predict
mahelet zema type, using GLCM, CNN and Hybrid techniques with SVM algorithm,
however, this work uses instrumental zema data with frequency-based extraction
(Kasehun, 2021). The research (Asim & Ahmed, 2017) works music genres
classification using K-NN and SVM classification with MFCC image of audio signal and
the accuracy rate becomes 64% and 77% respectively, however researchers Focus on
acoustic feature and manual feature for audio signal. (Lau, 2020) in this work cared out
experiment based on deep learning music genre classification music genre classification
using a deep-learning convolutional neural network approach against 5 traditional off-
the-shelf classifiers, Feature selection included both spectrograms and content-based
features, The classifiers were performed on the popular GTZAN dataset and experiments
showed 66% accuracy results for each classifier on test data in similar way.

(Choi & Sandler, n.d.), the gap of this paper is with the MagnaTagATune dataset to
apply a content-based automatic music tagging technique utilizing fully convolutional
neural networks (FCNs), and the results show that 89% of the tags utilized. (Tzanetakis &
Cook, 2010) explores the classification of audio signals into musical genre hierarchies, as
they believe music genres are categories created by people to label pieces of music based
on similar characteristics. They propose three feature sets for representing rhythmic,
timbral texture, and pitch characteristics. The following supervised machine learning
classifiers were adopted for their experiments: k-Nearest Neighbour and Gaussian
Mixture model. They achieve a classification accuracy of 61% on the GTZAN dataset.

The paper (Vishnupriya & Meenakshi, 2018) present new novel approach for
classification audio, he use CNN for both train and classify by using MFCC feature of
audio signal, the result becomes 76% of classification performance. From the following
researches, which were carried out in various fields and disciplines using various

36
algorithms, strategies, and methodology to improve performance and accuracy on their
study in general, the following summary discussed

Table 2. 2: summery of related work

N Authors Title Problem Method Result and


o Name Limitation
1 Nasrullah, Music Artist Artist CRNN for Only use
Zain Classification with classification is classificatio instrumental
Zhao, Yue Convolutional only based on n artistic data type
2019 Recurrent Neural frame level
Networks feature
2 Kasahun Ethiopian Orthodox Previous LPCC  Use sample of
A. 2021 Tewahido Church researchers &MFCC drum sound.
Drum Sounds focuses only features  prediction
Classification western music with GMM accuracy
Using Machine classification, prediction 82.4%
Learning spiritual
instrument music
sound has no
give attention for
a days
3 Muhamma Automatic Music Most of the K-NN and Focus on
d Asim Genres researcher focus SVM for acoustic feature
Ali, Classification using only spectral classificatio and manual
Zain Machine Learning statistical n and feature
Ahmed features timbre MFCC extraction
Siddiqui for western feature for
2017 music each audio
classification
4 Birku L. saint yared kum Yared zema are Spectrogra  Classification
2021 zema classification base for music, m feature of instrumental
using convolutional but has no give with CNN zema as a gap
neural network attention, to as feature  Using kum
classification and extraction zema dataset
reduction of and only
scholars are classificatio  Result 88% of
cause for this n accuracy.
paper

37
5 Alan Kai Classifying Music Other researchers spectrogram  Only uses
Hassen, Genres Using use handcrafted feature and instrumental
H.Janben, Image feature and in ResNet18 GTZAN dataset.
D. Classification small network classificatio  it result 84.7%
Assenmach Neural Networks layer n model
er, Mike P. of accuracy
2018
6 VISHNUP Automatic Music Data distribution Mel  Result of
RIYA S, Genre is main case for spectrum accuracy is
K.MEENA Classification using researchers, and MFCC 76% with 10
KSHI, Convolution Neural previous study for feature class of
2018 Network use only 4 extraction classification.
classes. So no and CNN  Using
music data algorithm instrumental
distribution for music as a
classificatio gap.
n
7 Oramas, S. Multi-label music Multimodal Multi-level  Use
Nieto, O. genre classification classification of classificatio traditional
Barbieri, from audio, text, music, no n SVM and handcrafted
F. and images using previous work on CNN feature,
Serra, deep features combination of approach  Does‟ not
Xavier, multi modal consider
2017 feature to visual sound
multimodal class.

2.16. Conclusion

In this chapter, comprehensive literature has been reviewed on aquaquam zema, music
information retrieval, genre classification, audio representation, and digital image
processing. It explained how audio signals are generally represented utilizing pre-
processing stages as images. The algorithm or approaches used for pattern feature
extraction and classification have been addressed concerning this. Furthermore, the
parameters and components such as deep feature, content-based feature, Spectral
Centroid, LSTM, SVM, ANN, and CNN are defined in detail. Top-down concepts of a
different layer of CNN and its components have been explained. Finally, the different
evaluation metrics and related works are defined.

38
CHAPTER THREE: METHODOLOGY

3.1. Introduction

The rapid growth of digital technology has created many media files such as phone calls,
music files, ambient sounds, audio recordings of meetings, voicemails, and radio and
television broadcasts. Manual indexing is therefore unreliable and very time-consuming,
so an effective automated content analysis and search system is important to deal with
overproduction. Segmentation and classification of audio signals is a very difficult task
due to the presence of non-stationaries and discontinuities in audio signals. Automatic
music classification and annotation are still viewed as challenging tasks due to the
difficulty of extracting and selecting optimal audio characteristics. This chapter
introduces all the tasks associated with the proposed model architecture, describes the
phases that make the process as clear as possible, and describes each aspect of the work.
Researchers seek to reveal everything about sound processing, feature extraction, and
ultimately placement of Zema into target classes.

3.2. Model Architecture

The proposed model architecture has the following stages: audio data acquisition,
preprocessing, segmentation, audio spectrogram generation, feature extraction, and
classification. During the audio data collection phase, we collected audio files covering
all activities from the initial raw data to create the final dataset. The audio data
preprocessing phase normalizes the audio into an understandable format. The audio
segmentation phase divides the audio signal into homogeneous segments. At the
spectrogram stage, we visualized a homogenous segmentation of the given audio
information with an image. The feature extraction phase uses deep feature extraction to
find associations between different objects from the spectrogram image. Finally, for
classification, the researcher used CNN architecture with an input layer, a convolutional
layer, and a fully connected layer, followed by a SoftMax classifier to classify Mel his
spectrograms into appropriate classes. The model constructed by taking the training
dataset on the specified class to train it and the validation dataset utilized to validate the

39
model parameters. It has also used test dataset to evaluate the performance of the trained
model.

The proposed model is shown in Figure 3.1. Details for each phase are provided in the
sections that follow.

Figure 3. 1: the proposed model architecture for aquaquam zema classification

This model was developed to classify Aquaquam zema into five correct classes. Includes
activities ranging from voice input to classification, spectral gating is used in noise
reduction to improve audio signal quality and reduce unwanted signals. Feature
extraction extracts time-frequency features and finally applies a classification grid search
technique to select CNN parameter values. The architecture is designed based on audio
and image generation and classification procedures.

40
3.2.1. Aquaquam Zema Acquisition

Collecting voice data is the primary objective of the study, as it is impossible to conduct a
study without collecting voice data from various sources. Speech data acquisition is the
process used to prepare data for classification models. Techniques to get the audio files
you need from EOTC specialist spiritual schools as well as from spiritual websites. We
acquired audio records for audio zema from sources Bahir Dar Abune Gebremenfes
Kidus church and Andabet Debre Mihret Kidest Kidanemiheret church aquaquam
scholars, and including most popular websites https://www.ethiopianorthodox.org and
http://debelo.org. In order to minimize noise when we record audio, we use smart phone
recorded in the mode of nighttime using step voice record and a location that is not easily
affected by external sounds and electromagnetic waves.

The collected data is manually classified into a proper class by Andabet Debre Mihret
Kidest Kidanemiheret Church and Bahir Dar Abune Gebremenfes Kidus church
Aquaquam scholars. After that all labeled recorded data was checked by Melake Mihret
Tewubo Ayene, Lealem Getahun, Ashenafi Tadesse, and Kelemework from Andabet
Debre Mihret Kidest Kidanemiheret Church Aquaquam scholar and Bahir Dar Abune
Gebremenfes Kidus church Aquaquam scholar. The collected records are tested and
grouped into the five main classes manually by experts in Andabet Debre Mihret Kidest
Kidanemiheret Church at Andabet.

Table 3.1: Source of audio dataset and number of audio zema dataset

No. Source of Record Types of Number of Recorded Data


Record
1 Andabet Debre Mihret Kidest Zimame 280
Kidanemiheret Church Aquaquam 340
Kum
scholar
2 Bahir Dar Abune Gebremenfes Meregd 341
Kidus church Aquaquam scholar Tsifat
3 https://www.ethiopianorthodox.org 310
4 http://debelo.org Amelales 273
Total 4 5 1544

41
Each sample is taken it saved in some of them are WAV file, the other also saved Mp3,
and AMR file formats and at a sampling rate of 22050Hz and 16-bit rate because most of
the previous works used it, to reduce the redundancy of signal researchers use mono
channel (Parida et al., 2022). After that, all these data are properly preprocessed and the
necessary features are extracted. 1544 audio recordings were collect from spiritual school
professionals and websites. The collected dataset is applied in the developed CNN
classifier to fixed-size segments of the audio recording to generate a set of class labels
that characterize the overall signal. This audio sound is about 60 seconds long. The
datasets are classified into training and testing for constructing the zema classification
model. 80% of audio samples from zema type are used for training and the remaining one
is used for testing.

3.2.2. Preprocessing

Audio quality is one of the very fundamental questions that must be answered before
proceeding to other learning or training steps (Wilson, AD and Fazenda, 2013). Data are
often obtained from manually recorded data, which are usually unreliable and in different
formats. When tackling machine learning problems, more than half of the time is spent
maintaining data quality. The initial raw data can contain various issues such as noise,
distortion, and other extraneous signal details that can affect model performance. The
data is also manually downloaded from the first conversion on the free source website
and filtered to ensure all audio files are in the same format. Most of the recorded audio
files are in .wav file format, so this website allows you to download files in Mp3 format
only, and recording data can also be downloaded in his AMR and WAV. The audio data
for this investigation was gathered from different sources and is commonly not
reasonable for direct use in training the model (Mamun et al., 2017).

It also focuses on segmenting the audio file at equal time intervals so that spectrogram
images can be properly and accurately transformed.

A detailed description of the characteristics of the audio used in this study is provided in
the literature review section. This section describes preprocessing activities performed on
audio data. After the audio recording has been assured by an expert in the area, the

42
process of preparing reliable recordings for the next step of genre classification has its
advantages:

3.2.2.1. Noise Removal

It is the process of deleting unnecessary sounds found in the record, like peoples
communication voice, unnecessary environment sound, instrumental sound and
repetitions. For any audio system processing, noise reduction comes first. For this
reason, the noise signals in the zema sounds are decreased before creating the Mel-
spectrogram image. In order to eliminate undesirable signals depending on the frequency
sequence of the sound's components, digital filtering techniques are required (Seed et al.,
2020). We must choose an audio file for this preprocess that is either voiceless (in which
no speech is produced), unvoiced (in which the vocal cords are not vibrating, resulting in
a periodic or random speech waveform), or voiced (in which the vocal chords are tensed
and hence vibrate frequently).

Algorithms for noise reduction aim to change signals to a greater or lesser degree, the
common method for the removal or reduce of noise is optimal linear filtering method,
and some algorithms in this method are Wiener filtering, Kalman filtering and spectral
subtraction technique. Here a filter or transformation is passed through the noise signal
(H.E.V et al., 2007) (Natrajan & Nadu, 2017). For our perspective we applied spectral
subtraction noise reduction technique was applied on each sample audio data.

43
Figure 3.2: Shows the steps to reduce noise from the original audio file (Ibrahim et al.,
2017).

3.2.2.2. Segmentation of Audio

Segmentation literally means dividing a given object into parts (or segments) based on a
defined set of characteristics. That it is required to know the model can classify the zema
signal by learning from the audio features extracted from the segmented audio. Because
minimizing the length of the sample records helps to reduce the time can be taken for
feature extraction (Frehiwot Terefe, 2019). The other advantage of segmentation is
important preprocessing step, especially for audio data analysis. This is because we can
segment a noisy and lengthy audio signal into short homogeneous segments (handy short
sequences of audio) which are used for further processing.

Segmentation can be applied only on supervised mode of discrete audio segments. The
techniques used to segment the given long recorded audio into homogeneous segments
using a thresholding method which means to assign fixed value of time interval based on
the assigned time chunk audio. Because of feature are connected with the segmentation
boundaries the time assigned to make the audio file to be segment 20 seconds used. Here
is some sample pseudo code which shows how the longest audio files are segmented into
several segments.

44
Table 3. 2: Pseudo code for segmenting audio

Input: long size audio data


Output: segmented audio file
Begin: Read the long sized audio data from the folder
Assign the size to be the audio segmented // equal 20 sec
Cut audio equal second
Return the segmented audio
End

3.2.2.3. Waveform Representation

The Wavelet Transform is a transformation that can be used to analyze the spectral and
temporal properties of non-stationary signals like audio file. The waveform's two-
dimensional time and amplitude measurements of sound volume are displayed.
Additionally, it displays the start and end of each sound. It explains the distribution of
sample values across time and is crucial for comparing the loudness of the sounds before
and after the noise is decreased (Tekniska Högskola et al., 2018).

We use librosa library to generate wavelets of each audio. However, trying to imaging
what the audio sounds like from a waveform.

3.2.3. Spectrogram Generation

To implement deep learning convolutional neural networks for the classification audio
signal, Different audio classification techniques exist in now a day; basically the two
main mechanisms are mostly used. This mechanism is to convert the audio data in the
signal and from the signal image the feature can be extracted whereas the other way can
be changing the audio data in the spectrogram image and from the image extracted the
required feature that enables us to classify each zema with their proper class. In this work
we use spectrogram image to represent the collected aquaquam zema audio to perform
train and test our model, first our wave file audio dataset signals are converted from one -
dimension signal to a two-dimensional spectrogram image and save as an image file in
PNG format.

45
A spectrogram is a visual depiction of the spectrum of frequencies of an audio signal as it
varies with time. Hence it includes both time and frequency aspects of the signal. A
spectrogram is a representation of frequency content over time found by taking the
squared magnitude of the short-time Fourier Transform (STFT) of a signal (McFee et al.,
2015). It is obtained by applying the Short-Time Fourier Transform (STFT) on the
signal. In the simplest of terms, the STFT of a signal is calculated by applying the Fast
Fourier Transform (FFT) locally on small time segments of the signal. We use librosa
library to transform each audio file into a spectrogram. The Mel spectrogram is the
portrayal of the sound in the form of time and frequency. It is fragmented into several
points that equally distribute frequencies and times on a scale of Mel frequency. A short-
time Fourier transform is applied to the raw audio signal for every song to create
spectrograms. Once created, the relationship of the Mel frequency scale and its inverse
has been defined as follows:

STFT {x(n)}(m,w) ∑ , - , - (10)

Where x[n] is the input signal, whereas w[n] is the window function (Nasrullah & Zhao,
2019). Spectrograms are 2-dimensional graphical representations of an audio signal. The
x-axis represents time and the y axis represents frequency. Audio signals can be
converted into MEL spectrograms where the y-axis represents Mel frequency bins
instead.

Figure 3.3 illustrates how a spectrogram captures both frequency content and temporal
variation for 20 second audio samples in the zema dataset.

Table 3. 3: Algorithm for generating Mel-spectrogram image

i. Result spectrogram image


ii. Input: Audio signal x and sampling rate (Sr)
iii. Apply Butterworth noise removal
iv. Framing the signal into short frames using hamming window
v. Apply STFT for each frame
vi. Convert frequency scale to Mel scale

46
vii. Take logarithm compression with maximum loudness
viii. Return images

47
Figure 3. 3: Shows spectrogram images for each type of music genre

3.2.4. Image Resize

The generated spectrogram image size is 360 x 360. However, large images not only
occupy more space in the memory but also result in a larger neural network, thus,
increasing both the space and time complexity. The generated images resized into 224
image sizes, which use for input to the proposed model to increase the performance. The
image size 224x224 (height and width) has given high accuracy and convenience to the
model compared with others. Therefore, it selected as an image size for generating a
spectrogram images. If the resolution of the time domain is high, the frequency resolution
decreases. If the frequency resolution is high, the resolution of the time domain is
lowered. That is, the resolution of the spectrogram has a drawback of creating a
conflicting window effect. To overcome this drawback, the resolution of the 2D
spectrogram data was maintained regarding the amount of time through an image
processing, and the data dimensions were reduced to extract the features. A 2D image

48
resizing method was used to reduce the image size of the spectrogram, which is
composed of 2D data. To reduce the image size bi-cubic interpolation was applied, and
the resolution of the spectrogram regarding the amount of time was maintained. Bi-cubic
interpolation is obtained after executing cubic interpolation on the x- and the y-axes. Bi-
cubic interpolation image resampling techniques are selected for image resizing because
it is effective in all applications of image processing.

( ) ∑ ∑ ( )

Table 3. 4: Image resize algorithm

Result resized image


i. Input: A spectrogram image I
ii. Image data = original spectrogram image I
iii. I= im.resize(m, m)
iv. Return I
v. End

3.3. Feature Extraction

CNN and the deep learning algorithms generate different features to classify audio genre,
The features extracted from the audio are Mel spectrograms, Mel frequency spectral
coefficients (MFCC), Spectral Centroid, Spectral Rollof, Zero Crossing rate, etc. all this
features are important to characterize each class. In this study, we extracted Mel
spectrograms from the audio waveforms since it is regarded as detailed and accurate
information for audio signals. Having a modest computational cost demand, being
relatively simple to extract, and providing reliable performance, it is more common to
extract features specifically using CNN. Sub-band energy separation, multiple frequency
band demodulations, and other spectrogram (spectral-temporal) elements are features for
evaluating speaker qualities in variable time-frequency resolutions.

49
Spectrogram features are extracted from the deep neural networks (DNNs). The Mel-
spectrogram or any other relevant audio feature is fed to the DNNs as the input. The
machine learns the features for understanding and classifying the Mel-spectrogram into
appropriate classes. Feature Extraction is a process of computing the compact numerical
representation of a segment of audio signal. The feature extraction is the basis of the
classification because the extracted features are the one that distinguishes the class in
case of this research. Mel-spectrogram is chosen for this model because of the
performance it showed in different NLP and other audio classification problems (Panwar
et al., 2017). Researchers says Mel-spectrogram characteristics are good for sound
recognition, audio classification (Badshah et al., 2017), In addition, the mel-scale was
proven to be similar to the human auditory system (Doob, 1957). Feature Vector
Extraction is done using the librosa package in python. This package specifically used for
the audio analysis.

3.4. Convolution Neural Network (CNN) As Feature Extraction


and classification

CNNs are neural networks designed to process data that is available in the form of
multidimensional arrays (grids), as signals and sequences for one-dimensional arrays,
images and audio spectrograms for two-dimensional arrays or video for three-
dimensional arrays (Lecun et al., 2015). In this study we use CNN as feature extraction
and classification because of as mention in literature review CNN is powerful for audio
audition and image classification. Convolutional neural networks (CNNs) have been
widely used for the task of image classification (Hinton et al., 2012). The 3-channel
(RGB) matrix representation of an image is fed into a CNN which is trained to predict the
image class. In this study, the sound wave can be represented as a spectrogram, which in
turn can be treated as an image (Nanni et al., 2017) (Lidy & Schindler, n.d.). The task of
the CNN is to use the spectrogram to classify the genre class (one of five classes). The
audio signals are transformed into spectrogram images in order to employ CNN for
audio.

50
The inputs for CNN-based feature extraction are spectrogram images of size M x N. The
image travel through the CNN layers input, hidden, and output layer. Consequently, by
performing convolution, activation, and pooling processes to the Mel-spectrogram
picture, the pertinent feature information was produced to categorize the aquaquam zema.

Once the spectrograms are generated, they are sent as input to the classification models.
The classification model classifies the spectrograms into proper genres class zemamie,
tsenatsel, mereged, tsifat and amelales. CNNs are built by repeatedly concatenating five
classes of layers: convolutional (CONV), activation (ACT), and pooling (POOL), which
are followed by a last stage that typically contains fully connected (FC) layers and a
classification (CLASS) layer. The CONV layer performs feature extraction by
convolving input to filters. After each CONV layer, a non-linear ACT layer is applied.
As we have mentioned, VGG is one of the earliest CNN models used for signal
processing. It is well known that the early CNN layers capture the general features of
sounds such as wavelength, amplitude, etc., and later layers capture more specific
features such as the spectrum and the cepstral coefficients of waves. This makes a VGG-
style model suitable for the MIR task. In this paper we use VGG16 based CNN model for
both feature extraction and classification our Aquaquam Zema.

3.5. Training

In this phase, audio data is presented in a visual format and then features of the audio file
are extracted. Use audio files to generate spectrograms, feed this transformed image
directly into a CNN algorithm to learn features, use different layers to filter out
spectrograms, and use predefined classes to Classify. 1210 spectrograms image were
generated from all the audio files in the dataset. Seventy percent of this data was used for
training and the rest was used in the testing phase

Convolution layer:- Convolution layers apply definite number of convolutional filters on


the spectrogram of an audio signal. The output of this layer is called as feature map.
There are different convolution layers in the training phase. The proposed CNN model
was developed by the researcher using five convolutional layers. 224 x 224 x 3 picture is
the input to the first convolution layer. This size is selected randomly and it is within the

51
range that CNN performs best in the literature which is 64 up to 360. Here just neural
networks that use Convolutional layers, also known as conv layers, which are based on
the mathematical operation of convolution.

As we have mentioned in the above, these CNN layers take the input size of 224 size of
width and 224 size of height and the next 3 indicate the filter size then the input images
have input features which is fed for the convolution layers. Even if the size of the
spectrogram image is determined by the algorithm that generates the spectrogram image
from the audio file, most of the researchers take this size for their research. The numbers
of kernels are used to manage the depth of output volumes. In our model, we have used
32, 64, 64, 96 and 128 filters. The number of filters we have applied increased as we
went down to the fully connected layers and the Softmax classifier. We have also used 3
x 3 filter size for layer and to determines the number of pixels skipped (horizontally and
vertically) each time we make convolution operation for this we have used stride size of
two (2, 2) since the size of stride is determine the size of image if the size is two it reduce
the size of image vertically as well as horizontally with half. The filter size 5x5 and 7x7
are experimented, but they give low accuracy compared with 3x3 filter size.

The output of the convolution layer after each convolution operation expressed in the
following formula (Tekniska Högskola et al., 2018).

Thus, if we have n x n image and f x f filter and we convolve with a stride s and padding
p, size of the output is:

Lp = (I + 2P - F)/S + 1 (12)

Where LP represents the output of convolution, I represent the input size, F represents the
kernel size, P is the number of zero paddings and S is the stride.

Pooling layer:- The width and height of the input volumes in the CNN network are
reduced via pooling techniques. Therefore, max-pooling was used to build the CNN
model because it performed better than average and combined pooling operations. The
max pooling reduces the dimension of the data by taking the maximum values from each
block. A pooling size of two (2 x 2) and a stride size of two are used after each

52
consecutive convolution layer. The stride size determines the number of pixels skipping
horizontally and vertically while doing the pooling operation. For audio classification, the
audio images are downsized in order to speed up CNN classification performance(Costa
et al., 2016). Downsizing images reduces the number of neurons in the convolutional
layers as well as the number of trainable parameters of the network. Downsizing is
accomplished by taking only the first pixel of every four pixels in 2 × 2 sub windows of
the image.

Activation layer: - The activation function is the rectified linear units (ReLUs), except
for the neurons of the last layer, which use Softmax, as mentioned above. It is important
that the number of neurons in the last layer equals the number of classes, as we all know;
convolution is a linear operation which is usually not enough to reflect the
representations of features. Thus, we employ Rectified Linear Units (ReLUs) to achieve a
non-linear behavior.

The definition of ReLUs activation function is f(x) = max(0, x). Obviously, ReLUs
brings out sparse feature representations in hidden layers since components below 0 are
cut off. In contrast with sigmoid, ReLUs do not saturate at 1 and the partial derivative of
the activation function is never 0, which can avoid the appearance of vanishing gradient
in some degree. Meanwhile, ReLUs also have more rapid speed of convergence than
traditional sigmoid and tanh activations.

Fully connected layer:- Fully connected layers extract global features from the local
feature maps. Training is performed using back propagation with 100 epochs. Once
trained, the output of the 5th layer is used for feature extraction.

Our convolutional neural network architecture was built using Keras and consists of the
input layer followed by 5 convolutional blocks. Each convolutional block consists of
convolutional layer using 3x3 filters, 2x2 strides and mirrored padding, relu activation
function, max pooling with a 3x3 windows size, 2x2 strides. The convolutional blocks
have filter sizes of (32, 64, 64, 96 and 128) respectively. After the 5 convolutional
blocks, the 2D matrix is then flattened into a 1D array. Lastly, the final layer consists of a
dense fully-connected layer that uses a Softmax activation function to output the

53
probabilities for each of the 5 label classes. Through determining the probability of each
target class out of the total class, the softmax function is used to assign the feature to one
of the five classes. The class with the highest probability is selected as the classified label
for a given input.

3.6. Testing Phase

This phase proceeds in altogether the same way as the training phase. You should
preprocess the images in the same way as for training. Taking other paths can present
inputs to the network that cannot be classified, thus leading to incorrect classification.
Similarly, feature learning is performed similarly to training, using a learning model built
from training. A different input image than the training dataset is used, called the test
dataset.

Use different layers of operations to improve the performance and accuracy of your
model, classify certain data based on their underlying characteristics, and reduce possible
losses in your model, further improving its performance I used the following technique:
Validation of the model is good. These are:

Batch normalization: - It's a method of standardizing the inputs to a layer while training
very deep neural networks for each mini -batch. This has the effect of stabilizing the
learning process and significantly reducing the number of training cycles needed for deep
network training.

Batch size:-Since the databases are so big, the databases are split in batches. The number
of the training examples present in this split is the batch size. This batch represents the
input in a single iteration to the neural network, The forward and backward optimization
of each batch against the labels of the actual prediction.

Epochs:- An epoch is when one time a whole dataset is moved through the neural
network forward and backward. To train the model, the number of epochs should be
greater than one, and as the number of epoch‟s increases, the weight in the network
changes more frequently, and the curve shifts from under fitting to optimal or even
overfitting.

54
Optimizer:-Optimizer is an optimization algorithm that helps us to minimize the loss
function towards changing and adapting the values of the weights and bias of the
network. There are many different types such as Stochastic Gradient Descent, Adam,
Adamax and RMSprop. Adam optimization algorithm is computationally efficient and
consumes little memory than other classes of stochastic gradient descent algorithms.
Most researchers prefer the Adam optimizer.

Loss:- Loss function is the most important unit for estimating error from prediction to
original value. Aim for zero loss in the training phase to ensure a perfect match between
the estimated and expected values. To get them, we need to adjust the neuron weights
with an optimization function until we get better predictions. Testing phase here we
simply apply the training phase of feature extraction and learning, and the previous
method used for his SoftMax classification of the given tested data.

3.7. Conclusion

This deep learning model was developed to classify Aquaquam zema into five main
types. First, the model receives input audio data. This data requires further preprocessing
such as decreased noising, and segmentation to obtain speech segments consistent in time
interval and size. Each segmented audio form is converted to a spectrogram image. The
spectrograms are also sized to have the same width and height. The scaled spectrogram
image is then the input to the convolutional neural network. From the input image, the
first layer of the CNN convolution extracts features, and the second layer, pooling,
performs image dimensionality reduction. Finally, the spectrogram image pixels are
rendered in full vector format. SoftMax with connected layers converted and classified.

55
CHAPTER FOUR: EXPERIMENT, RESULT AND DISCUSSION

4.1. Introduction

This chapter describes data collection, the tools used to develop the prototype,
and system performance. First, we describe the collection, preparation, and preprocessing
of data used as input to the proposed system, and the next section describes the tools used
to develop the model. The last part concerns the performance of the proposed system.

4.2. Dataset Preparation

This section describes data set collection, preparation, and preprocessing. As the
requested data for this study could not be found from published sources or previous
studies, we are preparing our dataset. To the best of our knowledge, there are no previous
studies on Aquaquam zema classification using the audio spectrogram feature. Therefore,
we had no other option but to prepare our own database that represents some of
aquaquam zema to build the pre-trained audio classifier system. For this experiment, a
zema signal is collected from Bahir dar Abune GebreMenfes kidus churches, Andabet
kidanemihret church, and the EOTC website which is used to implement the
classification of aquaquam.

After the collection of the data, we had to manually classify the recorded zema to the
proper class it belongs to. This manual classification was performed and verified by
aquaquam scholars. It consists of five different zema: Zimame, Qum (Tsinatsel), Meregd,
Tsifat (ጽፋት), and Amelales. Each recorded audio must belong to the five zema, with 60-
second snippets each, stored as 22050 Hz, 16-bit mono *.wav audio files. Researchers
select this sample as humans cannot hear signals over 20 KHz, those frequencies need not
be considered. And at a sampling rate as low as 8 KHz, as in telephone facility, different
genres of music can still be differentiated. So we chose a sampling rate that lies between
these two limits. Then next we transform the data collected. This step is related to making
the dataset suitable for the algorithm used and knowledge of the problem domain. Audio
data must be segmented with equal size to have uniform time intervals. And then, the
segmented Audio files are changed into a visual representation form which is a
56
spectrogram *.png format. The data which are fed from the convolutional network is the
spectrogram image in the form of *.png image format. The transferred audio spectrogram
image default size is 360 x 360 undersized images and then resized by using librosa
library to the same size of 224x224. The number of data collected for each class is
shown below.

We used training data from 80% of the total dataset (1544 samples) and testing data from
20% of the total dataset (314 samples) to create our model. Twenty percent of the training
dataset, or 20% of the total, is used as the test dataset and validation. Each sample is
obtained using a 16-bit, 2205Hz sampling rate. The required features are then extracted
after all of these data have been correctly preprocessed.

Table 4.1: Collected audio datasets table styles

No. Class Number of Duration of each Number of


Data(wave) wav spectrogram Image
1 Zimame 280 20 second 280
2 Kum 340 20 second 340
3 Meregd 341 20 second 341
4 Tsifat 310 20 second 310
5 Amelales 273 20 second 273
Total 5 1544 1544

4.3. Experiment and Result

The experiments below were carried out by this publication. The first experiment uses a
CNN model with a spectrogram feature from end to end using fixed length of
segmentation. A CNN model with handcrafted features is used in the second experiment.
The last experiment uses a multi-class SVM classifier with a CNN feature.

4.3.1. Experiment based on the length of audio data using end-to –end CNN with
Spectrogram feature

Deep feature extraction based on CNN was used by the researchers, and softmax was
used for classification. For the model's training and testing, they used 1230 and 314

57
datasets, respectively. To construct the CNN model, the impacts of the number of layers,
pooling processes, and activation functions are examined simultaneously, because there is
no accepted process for choosing parameter values that are accurate.

The architecture created a fundamental sequential proposed model, as seen in Figure 3.1.
A convolutional layer with 32 filters and a kernel size of is the model's initial layer (3, 3).
The input must take RGB image and must be 224 × 224, because the size of image of less
than 224 sizes is result to poor performance. The output layer's spatial dimensions are
subsequently decreased using a stride parameter by the max-pooling layer. The
convolution's step is represented by the stride parameter, which is a 2-tuple of numbers.
The step value is usually left at the default value of (3, 3), constant to all convolutional
layer to ensure that the size of the output volume is small. The pooling layer level is
accompanied by a ReLU activation function.

In order to answer the research question „what segmentation size is appropriate for aquaq
uam zema classification model?‟ we have done the experiment base on the timer of audio
data. In this experiment researcher has done the length of audio data by its second we try
to prepare the dataset in 5, 10 and 20 second of audio. The first experiment in the above s
hows using 5 second audio data segmentation are done, next we are going to do experime
nt for 10 and 20 second of audio segmentation.

4.3.1.1. Experiment on the 5 second audio segmentation dataset with end-to-end CN


N model

The sequential CNN model used in the experiment is demonstrated in Figure 4.5. For the
model's first layer, there are 32 filters. The input shape is described as being composed of
224, 224, and 3 RGB images. The output layer's spatial dimensions are subsequently
reduced using the max-pooling layer and the stride parameter. Figure 4.1 below displays
how the proposed classification model performs with a training accuracy of 96.1% and a
validation accuracy of 89.8% of our data and an overall loss of 0.46.

58
Figure 4. 1: The 5 second experiment training model summary in 100 epochs

The 5 second audio dataset model used in the experiment to evaluate the model achieved
96% training accuracy and 89.8% testing accuracy using the softmax function. The
experiment's overall average loss rate is 1.38. The training accuracy of this experiment is
higher than that of the 10-second audio experiment. However, the test's accuracy is less
than 1%, and the aggregate loss was also more than the experiment's 10 second.

The CNN model performance expressed using precision, recall, and F1- score Table 4.2.
Evaluation metrics for given 5-second audio dataset using CNN model and Softmax-
based Relu activation experiment model

The following evaluation measures demonstrate how well the model performed for each
class and how well it solved the issue. As demonstrated by the precision of amelales,
kum, Meregd, tsifat, and zemame, 0.05%, 0.16%, 0.08%, 0.01%, and 0.20% obtained
from other classes in a false positive manner and the recall of amelales, kum, Meregd,
tsifat, and zemame, 0.15%, 0.05%, 0.07%, 008%, and 0.28% respectively assigned to
other classes in a false negative.

Table 4.2: Performance evaluation metrics result for 5 second segmentation experiment

Class Precision Recall F1-Score Support

59
Amelales 0.95 0.85 0.90 41
Kum 0.84 0.95 0.89 85
Mereged 0.92 0.93 0.93 73
Tsifat 0.99 0.92 0.95 76
Zemame 0.80 0.72 0.76 39

Accuracy 0.90 314


Macro Avg 0.90 0.88 0.88 314
Weighted Avg 0.90 0.90 0.90 314

Test Result: 89.809 Loss: 1.465

The training accuracy was always better than the loss accuracy, as the training loss and ac
curacy curve in the figure below clearly demonstrates. As the number of epochs is increas
ed, the model accuracy increases and the model loss lower. The accuracy and loss of the s
uggested model for training and testing are illustrated diagrammatically below.

Figure 4. 2: Training accuracy curve of 5 second experiment

60
Figure 4.3: Training lose curve of 5 second experiment

Figure 4. 4: the accuracy graph of 5 second segment experiment model

4.3.1.2. Experiment on the 10 second audio segmentation with end-to-end CNN mo


del

This experiment conducted with the batch size of 32 and in 100 epochs.
Another convolution level, grouping, and focus level with improved convolution filter
criteria are used, followed by another convolution level, and focus level. The model was

61
compiled using the Adam optimizer, dropout rate of 0.3 is used and categorical entropy multiclass
cross-entropy loss functions. The summary statistics for the model are shown in Figure 4.5 below.

Figure 4. 5: The 10 second segment experiment training model summery

The performance (accuracy and loss value) of the proposed model is shown in the figure
4.6 and 4.7. It takes almost an hour to train the model in Python Anaconda software. As
clearly shown in table 4.3 below, the proposed classification model achieves a training
accuracy of 97.5% and a validation accuracy of 90.13% of our data with the overall loss
0.89.

The 10 second end to end CNN model performance expressed using precision, recall, and
F1- score

Table 4. 3: Performance result of 10 second segment experiment

Class Precision Recall F1-Score Support

Amelales 0.99 0.89 0.94 41


Kum 0.87 0.85 0.86 85
Mereged 0.90 0.95 0.93 73
Tsifat 0.98 0.86 0.92 76
Zemame 0.74 0.92 0.82 39

Accuracy 0.90 314


Macro Avg 0.90 0.90 0.90 314
Weighted Avg 0.91 0.90 0.90 314

Test Result: 90.204 Loss: 0.892

62
Table 4.3, Evaluation metrics result of proposed model with Softmax The above
evaluation metrics show the performance of the model for each class how much it
correctly satisfied the problem. As precision of amelales, kum , Meregd, tsifat and
zemame shows, 0.01%,0.13%, 0.10%, 0.02% and 0.26% respectively got from other
classes in a false positive way and recall of amelales, kum , Meregd, tsifat and zemame
shows, 0.11%, 0.15%, 0.05%, 0.14% and 0.08% respectively assigned to other classes in
a false negative.

The accuracy and loss of the proposed model with our 10 second dataset is shown
in figure 4.6 and figure 4.7. As the training loss and accuracy curve in the figure below
clearly shows, the training accuracy was greater than the loss accuracy all
along the curve, when the number of epochs is increased, the model accuracy is high,
and the model loss also decreases. Below is a diagrammatic overview of the accuracy and
loss of the suggested model for training and testing.

Figure 4.6: The Training accuracy curve of 10 second experiment model

63
Figure 4.7: The Training loss curve of 10 second experiment model

The training and validation accuracy-loss curves make it abundantly evident that as the
number of epoch‟s rises, the values of loss approach zero and the accuracy likewise
reaches 100%. Before 20 epochs, the training and validation accuracy oscillate between
one another, but from epochs 20 onward, the training accuracy always outperforms the
validation accuracy. Additionally, while testing and validation losses fluctuate up to 20
epochs into the curve, training losses are substantially smaller than validation losses after
this point. At the conclusion of an epoch, training accuracy had increased linearly to
97.5%, however testing accuracy had increased significantly to 90.2%.The training loss
progressively decreased into 0, while loss rate reached at 0.89. The gap between the
training and validation accuracy has much lower throughout the curve. These shows low
overfitting and the model has performed in the best way for drum sound classification.

64
Figure 4.8: The accuracy graph of 10 second segment experiment

4.3.1.3. Experiment on the 20 second audio segmentation with end-to-end CNN mo


del

These experiments conducted with the convolutional layer, pooling layer and fully
connected classification layer with softmax. A convolutional layer with 32 filters and a
kernel size of is the model's initial layer (3, 3). The input must take RGB image and must
be 224 × 224, because the size of image of less than 224 sizes is result to poor
performance. The output layer's spatial dimensions are subsequently decreased using a
stride parameter by the max-pooling layer. The convolution's step is represented by the
stride parameter, which is a 2-tuple of numbers. The step value is usually left at the
default value of (3, 3), constant to all convolutional layer to ensure that the size of the
output volume is small. The pooling layer level is accompanied by a ReLU activation
function.

This experiment also conducted with the batch size of 32 and in 100 epochs. The model
was compiled using the Adam optimizer, dropout rate of 0.3 is used and categorical
entropy multiclass cross-entropy loss functions. This experiment achieved 97.5% training
accuracy, 91.76% testing accuracy using the same Adam optimizer and the experiment's

65
overall average loss rate is 0.827. The training accuracy of this experiment is higher than
that of the 10-second audio experiment. The test's accuracy is greater than previus 10
second segment experiment with 1.4%, and the aggregate loss rate was also less than the
experiment's 5 and 10 second.

The summary statistics for the model are shown in Figure 4.7 below.

Figure 4.9: The 20 second segment experiment training model summary in 100 epochs

The evaluation metrics in table 4.4 shows the performance of the model for each class
how much it correctly satisfied the problem. As precision of amelales, kum, Meregd,
tsifat and zemame shows, 0.19%,0.08%, 0.10%, 0.02% and 0.06% respectively got from
other classes in a false positive way and recall of amelales, kum, Meregd, tsifat and
zemame shows, 0.04%, 0.16%, 0.02%, 0.11% and 0.12% respectively assigned to other
classes in a false negative.

66
Figure 4.10: The Training accuracy curve of 20 second segment experiment

Figure 4.11: The Training loss curve of 20 second segment experiment

Table 4 4: Performance evaluation metrics result for 20 second segment experiment

Class Precision Recall F1-Score Support

67
Amelales 0.81 0.96 0.88 41
Kum 0.91 0.84 0.87 85
Mereged 0.90 0.98 0.94 73
Tsifat 0.98 0.89 0.94 76
Zemame 0.94 0.88 0.91 39

Accuracy 0.92 314


Macro Avg 0.91 0.91 0.91 314
Weighted Avg 0.92 0.92 0.92 314

Test Result: 91.76 loss: 0.827

As the training loss and accuracy curve in the figure below clearly shows, the training a
ccuracy was greater than the loss accuracy all along the curve, when the number of epoch
s is increased, the model accuracy is high, and the model loss also decreases. Below is a d
iagrammatic overview of the accuracy and loss of the suggested model for training and te
sting.

Figure 4.12: the accuracy graph of 20 second segment experiment

Table 4.5: comparisons between different audio dataset segmentation experiment

Model Length of Train Val_accuracy Loss rate Test


audio Accuracy accuracy

68
5 second 96% 89.8% 1.465 89.8%
CNN 10 second 97.5% 90% 0.89 90.2%
20 second 97.7% 90.13% 0.827 91.76 %

As the result shows in table 4.5 the effect of length of audio for classification of
Ethiopian Orthodox Tewahido (EOTC) aquaquam zema classification, this experiment
answers the question of “which segmentation size is appropriate for aquaquam zema
classification model” the experiment was conducted in 5, 10 and 20-second length of
audio segmentation sample with the transformation of spectrogram image with 224 x224
x 3 sizes. The experiments in all sample datasets are built in an end-to-end CNN model
with the Relu activation and classifier of softmax. In 5 second experiment, the result
shows 96% of training accuracy, 89.8% test accuracy, and an overall loss rate is 1.465, in
this experiment, all result shows smaller training and test accuracy value than in 10 and
20-second experiment, and loss rate also more than 10 and 20-second experiment. In 10
second experiment, the result shows 97.5% of training accuracy, 90.0% test accuracy,
and an overall loss rate is 0.89, in this experiment's result shows higher training and test
accuracy value than the 5-second experiment model, and less training and test accuracy
value than the 20-second experiment. The loss rate of 0.89 was also less than 5 second
experiments and greater than 10 second experiments.

Generally 20 second audio segmentation is more appropriate for the classification of


aquaquam zema with spectrogram feature and the end-to-end CNN model for training as
well as classification. In 20 second experiment the result shows 97.7% of training
accuracy, 91.76% test accuracy and overall loss rate is 0.827. In this experiment all result
shows higher training and test accuracy value than 5 and10 second model, and loss rate
0.82 also less than 5 and 10 second experiment. Generally 20 second audio segmentation
is more appropriate for the classification of aquaquam zema with spectrogram feature and
the end-to-end CNN model for training as well as classification.

4.3.2. Experiment on the end to end CNN model with spectrogram and MFCC
feature

69
In order to answer the research question “which feature extraction method is appropriate
to Aquaquam zema classification?” the researchers experimented with MFCC audio
feature extraction with the current training model.

This experiment also conducted with the batch size of 32 and in 100 epochs. The model
was compiled using the Adam optimizer, dropout rate of 0.3 is used and categorical entropy
multiclass cross-entropy loss functions. This experiment achieved 95.6% training accuracy,
83.63% testing accuracy using the dataset size and the experiment's overall average loss
rate is 1.077. This experiment result shows as below.

Figure 4..13: Training summary of CNN model with MFCC feature

Table 4. 6: comparisons between CNN model with spectrogram and MFCC feature

class Precision recall f1-score support


Amelales 1.00 0.74 0.85 41
Kum 0.85 0.92 0.88 85
Mereged 0.96 0.80 0.88 73
Tsifat 0.98 0.83 0.90 76
Zemame 0.46 0.89 0.31 39

Accuracy 0.84 314


Macro Avg 0.85 0.84 0.82 314
Weighted Avg 0.89 0.84 0.85 314
Test Result: 83.673 loss: 1.077
As we had seen the figure 4.10, the MFCC coefficients of the experiment was conducted
80% of train and 20% of test data with epoch 100 and batch size of 32 for the first
convolution layer. The final result of accuracy is 83.67% and the total loss 1.077, this

70
result becomes less than the accuracy and loss rate result of the experiment that was done
in spectrogram feature extraction model. Generally, the experiment that was done by
CNN model with MFCC feature were shows poor performance than CNN model with
spectrogram feature for classification of EOTC aquaquam zema.

4.3.3. Experiment on CNN as feature extraction and SVM as classifier

In this experiment, we use CNN as feature extraction that have 1544 sample spectrogram png
image data out of this 1230 training samples used as input for SVM classifier the remaining
314 samples are used for testing purposes. As shown in figure 4.14 the training accuracy
result is 86.24% and the final test accuracy is 81.21% with the total loss of 1.434

Figure 4.14: Training summary of CNN with SVM classifier

The average accuracy recorded in this experiment was 81.21%. In the following figure
we present the result of metrics, precision, recall, score, macro average, weighted-average
and accuracy of the single iteration.

Table 4.7: Performance evaluation result of SVM classifier with CNN training

class Precision recall f1-score support


amelales 0.92 0.89 0.93 41
kum 0.90 0.94 0.95 85
mereged 0.91 0.93 0.92 73
tsifat 0.96 0.94 0.97 76
zemame 0.78 0.76 0.76 39

accuracy 0.81 314


macro avg 0.79 0.80 0.79 314
weighted avg 0.80 0.80 0.80 314
Test Result: 81.21 loss: 1.434

71
As clearly shown in the training loss and accuracy curve in Figures 4.15 and 4.16, the
training accuracy was higher than the testing accuracy throughout the curve. It shows that
when the number of epochs is increased, the accuracy of the model is high, and the loss
of the model decrease. As we had seen the figure 4.10, the MFCC coefficients of the
experiment was conducted 80% of train and 20% of test data with epoch 100 and batch
size of 32 for the first convolution layer. The final result of accuracy is 83.67% and the
total loss 1.077, this 70

Figure 4 15: The Training accuracy curve of SVM classifier experiment

72
Figure 4.16: Accuracy graph of SVM classifier

Table 4.8: Comparison between CNN model Softmax classifier and SVM classifier

Model Classifier Train Accuracy Val_accuracy Loss rate Test accuracy


Softmax 97.7% 90.13% 0.827 91.76%
CNN
SVM 86.2% 82.52% 1.434 81.21 %

As the result shown in table 4.8 the effect of classifier for classification of Ethiopian
Orthodox Tewahido (EOTC) aquaquam zema classification, this comparison was done
between the CNN model with the softmax classifier and SVM classifier. The above table
shows us the overall accuracy and loss of the training and testing phase of the developed
model relative to the other related classifiers. It has the best accuracy as compared to the
remaining models, especially for the testing phase and also it has less percent of loss rate.
The result shows 86.2% of training accuracy, 81.21% test accuracy, and a loss rate of
1.434 for the SVM classifier and 97.7% of training accuracy, 91.76% of test accuracy,
and a total loss rate of 0.827 for the Softnax classifier. Generally, the end-to-end
classifier is a great performance than the SVM classifier for the classification of aququam
zema model using spectrogram image features.

4.3.4. Test the given dataset on domain experts

To test the experts from Andabet Debre mihret Kidist Kidanemihret church Aquaquam
scholars and Andabet Woreda Mierafe Mariam Kebelie, Kidist Mariam church EOTC
scholars, we prepare a test dataset from a given audio data manually. We used 3 levels of
scholars Kutir, Qenie, and Aquaquam zema, the respondents were asked about the class
of aquaquam zema and identification of such 20-second zema genre. We selected 3
scholars from Kutir School, 3 from Qunie School, and 4 from aquaquam School. We
prepared 154 (10%) test data from 1544 total datasets, the total number of audio from the
five classes of aquaquam zema genre the result shows in table 4.9 below.

Table 4. 9: Tasting result of domain experts

Level of education
Kuter 3 Qenie 3 Aquaquam 4
Class No of Size of scholars scholars scholars

73
data dataset No % No % No %
answer answer answer
Amelales 28 20 sec 0 00 0 00 26 92.85
kum 34 20 sec 0 00 3 8.82 30 88.23
Mereged 34 20 sec 0 00 0 00 26 76.47
Tsifat 31 20 sec 0 00 0 00 24 77.42
Zemamie 27 20 sec 0 00 4 14.81 25 92.6
Total 154 - 0 00 7 4.54 131 85.06
As we can see from the domain expert response, Table 4.9, shows that among
participants at various educational levels, 3 Kuter scholars responded that they were
unable to classify or identify the zema genre 0.0%, 3 Qenie respondents asserted that the
Kum and Zemamie aquaquam zema were present to some extent and the total percentage
of response for a given test data is 4.54%, and 4 scholar respondents from aquaquam
zema, the total number of test data for a given genre were response 85.06% correctly
classified into proper class.
As a result, we contrast this domain expert result with our deep learning end-to-end CNN
model results, finding that the CNN model is more accurate than the domain expert in
classifying 20-second data for the aquaquam zema class.

4.4. Discussion

From the experiment, the researcher performed the proposed aquaquam zema
classification model has been analyzed using Mel-spectrogram image and performance
evaluation metrics results. Changing the segmentation of audio data recorded different
performances of aquaquam zema classification model. The other thing we have noticed
from this experiment is that by removing noise from the audio dataset and resizing the
transformed image. There in the segmentation of audio investigation conducted in 5,10,
and 20-second audio length result shows taste accuracy of 89.8%, 90.2%, and 91.76 %,
and training accuracy of 96%, 97.5%, and 97.72% respectively. The experiments for the
20-second segmentation recorded good classification performance for aquaquam zema
classification compared with 5 and 10-second segmentation techniques. We also
conducted an experiment by changing the features of audio to MFCC, so in this way, the
spectrogram image gives good performance for the classification of aquaquam zema.

74
Table 4.6, Shows 97.7% and 86.2% for training accuracy, and 91.76% and 81.21 % for
test accuracy spectrogram and MFCC feature respectively. The other experiment has also
done spectrogram feature extraction methods with classifier CNN with softmax and CNN
with SVM which are trained tested and validated, by using this classifier in order to
classify the aquaquam zema and to build the classification model. Consequently, different
tested accuracy results have been recorded using other parameters and methods in the
above-conducted experiments. Different testing results have been found for each
scenario, as table 4.8 shows 97.7% and 86.2% for training accuracy, 91.76% and 81.21 %
for test accuracy softmax and SVM respectively. Lastly, we test our study with experts
who are from different levels of education; the result shows 85.06% of test accuracy.
Therefore, a model that developed using deep learning CNN with a softmax classifier
was recorded high accuracy for aquaquam zema classification.

4.5. Summary

The dataset preparation and implementation aspects of the proposed models are lastly co
vered in full in this chapter. The dataset for this study was gathered from Bahir Dar Abun
e Gebre Menfes Kidu Church, Andabet Debre Mihret Kidist Kidanemihret Church schola
rs, and the EOTC website using a smartphone device. After the data was gathered, we per
formed segmentation on each audio sample to create equal audio lengths. We then conver
ted each audio sample to a spectrogram image with a size of 224 x 224 *.png for the stud
y. Then, the experimental evaluation results of the proposed model for automatic classific
ation of the aquaquam zema are analyzed. Finally, three different segmentation technique
s are used in the tests. We also examine the results between in-depth features and handcra
fted texture features and briefly address softmax and multi-class SVM. Evaluation measu
res accuracy, f1-score, recall, precision, macro average, and weighted average are used to
describe the model's performance in classifying aquaquam zema into amelales, kum, mer
egd, tsifat, and zemamie.

75
CHAPTER FIVE: CONCLUSION, CONTRIBUTION AND
RECOMMENDATION

5.1. Conclusion

In this study, we used deep learning to develop a model for classifying Teklie aquaquam
zema in EOTC. For the purposes of music information retrieval, voice recognition, and
sound identification, previous work has been done on the classification of music genres,
audio sounds, and instrumental sounds by capturing features like melody, harmony,
rhythm, and the like. The retrieval of music-related information is related to our research
areas. A group of computational models known as music information retrieval takes
descriptive data from recorded music. From this perspective, Teklie Aquaquam is one of
the core spiritual zemas schools in EOTC. Aquaquam zemas are one of the chants,
melodies, or zemas performed in these schools. Everything is portrayed as a sound or as
music, It consists of the following types of zemas, including Zimame (ዝማሜ), Qum

(Tsinatsel)(ቁም), Meregd (መረግድ), Tsifat (ጽፋት), Amelales (አመላለስ), Woreb (ወረብ)

and sub-categorize are Neaus-meregd (ንዑስ መረግድ), Abey-Meregd (ዓብይ መረግድ).


This study consisting data acquisition, preprocessing, segmentation, feature extraction
and classification.
The data accusation was recorded by using a smartphone from aquaquam scholars and
downloaded from the EOTC website after collecting the dataset we performed manual
indexing or classification for their class by aquaquam zema scholars. The preprocessing
was performed on a dataset the collected data were prepared to 22050 Hz and 16-bit rate
mono with *.wav file format. The noise reduction technique is also performed. And the
segmentation technique is executed to equalize the length of the audio. After splitting the
audio into segments, each segmented piece of audio is converted to spectrogram image
*.png file format. Then, each segmented audio file was scaled to 224 by 224 in order to
extract features.
Due to the fact that audio data cannot be directly recognized by models, feature
extraction from spectrogram images was utilized to transform the audio data into a
comprehensible format. Amelales, Kum, Merged, Tsifat, and Zemame were the
classification classes used in the training of the classification model, which was based on

76
a dataset of 1544 data points. End-to-end CNN deep learning was used to create the
model. Because deep learning is a branch of machine learning in which the data
representation as well as the rules for data processing are automatically learned.
We used the Python Anaconda software to implement the coding portion. Keras and
TensorFlow were used as the backend, and various libraries were loaded. Since audio
signal processing was performed, Librosa was specifically used for audio files. We
conducted an experiment to evaluate the suggested model's effectiveness and respond to
the study question. To answer the research question; of which segmentation size is
suitable for aquaquam zema classification, the 20-second segmentation is an excellent
performance for the categorization of aquaquam zema. We compare 3 tests, 5-second, 10-
second, and 20-second segmentation, and the result indicates 91.76% of test accuracy.
What feature extraction technique is suitable for Aquaquam Zema classification? This is
the other study question, and we conducted two studies employing audio spectrogram
features and MFCC characteristics to address it. Adam was used as an optimizer to
compile the 9,719,429 trainable parameters using the audio spectrogram feature. The
classifier's performance was 91.76%, while the accuracy we achieved with MFCC
features was 83.673% of test accuracy. And when compared to another SVM classifier
experiment, the outcome demonstrates that CNN with softmax performs better than
SVM. For the suggested EOTC aquaquam zema classification model, CNN with softmax
classifier in 20-second segmentation and spectrogram feature is excellent and robust; the
result shows 97.7% of training accuracy, 91.76% of test accuracy, and an overall loss rate
of 0.826.

77
5.2. Contribution

We are developing this model for EOTC aquaquam zema classification, which can
possibly be employed in the scientific world and the knowledge legacy of the researcher
for a specific society. This classification contains Amelales ( አመላለስ), kum(ቁም),

merged (መረግድ), tsifat (ጽፋት) and Zimamie (ዝማሜ). For this work, there have not
been preprocessed and featured data of aquaquam zema, therefore, we prepared the
preprocessed and featured data for aquaquam zema which are songs without any
instrumentals sound. After collecting the data, we used segmentation, noise reduction,
and generation of the audio signal's visual representation. This study's essential
contribution is the application of fixed time interval segmentation on audio data and
extracting spectrogram features using end-to-end CNN and a softmax classifier.
Developing segmentation and feature extraction methods that are appropriate for
aquaquam zema classification is the other contribution of this study. Finally, our research
demonstrated the Ethiopian Orthodox church Aquaquam Zema could be classified using
an end-to-end deep learning algorithm, while compared to manually extracting
characteristics, this model saves time.

78
5.3. Recommendation

The findings of this study are helpful and can be applied to the field of audio
classification for applications involving the retrieval of music information. The spiritual
aquaquam zema classification model is the purpose of this effort. Setting a course for
future work in relation to our work is crucial to improving performance. Therefore,
additional examination of this experiment using various methods is required. The primary
areas listed below are suggested for this study's future research. In this work, increasing
the audio segmentation time interval is not more effective; instead, adjusting the
spectrogram image's size may enhance the classification model's performance and
accuracy. Using both visual representations of audio data and techniques for extracting
acoustic features may maximize the accuracy and performance of classifier models. The
classifier model is greatly impacted by expanding the dataset as well. The classifier
model's capabilities improve as the amount of input data rises. Even when you use other
classifiers, the performance and accuracy may be improved. Applying various
preprocessing approaches to audio data, such as enhancing the audio signal with noise,
employing a range of loudness‟s, time stretching, pitch shifting, and adding textual
information (features) that provide extra information, may produce excellent results.
Aquaquam zema classification may yield positive results even when applying additional
deep learning pre-training SVM classification techniques and other machine-learning
algorithms.

79
Reference

Addisie, Y. (2019). የ ተ ክ ሌ አ ቋ ቋ ም ቃላ ዊ ታሪ ክ እ ና ፋ ይ ዳ ው.

Asim, M., & Ahmed, Z. (2017). Automatic Music Genres Classification using Machine
Learning. International Journal of Advanced Computer Science and Applications,
8(8), 337-344Classification of music genre has been an i.
https://doi.org/10.14569/ijacsa.2017.080844

Athulya, K. M., & Sindhu, S. (2021). Deep learning based music genre classification
using spectrogram. Icicnis.

Avendaño-Valencia, L. D., & Fassois, S. D. (2015). Natural vibration response based


damage detection for an operating wind turbine via Random Coefficient Linear
Parameter Varying AR modelling. Journal of Physics: Conference Series, 628(1),
273–297. https://doi.org/10.1088/1742-6596/628/1/012073

Babaee, E., Anuar, N. B., Wahid, A., Wahab, A., Shamshirband, S., & Chronopoulos, A.
T. (2018). An Overview of Audio Event Detection Methods from Feature Extraction
to Classification An Overview of Audio Event Detection Methods from Feature
Extraction to Classification. Applied Artificial Intelligence, 00(00), 1–54.
https://doi.org/10.1080/08839514.2018.1430469

Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech Emotion
Recognition from Spectrograms with Deep Convolutional Neural Network.

Bahuleyan, H. (2018). Music Genre Classification using Machine Learning Techniques.


April.

BAYE, W. (2020). INSTRUMENTAL SONG OF SAINT YAREDIC CHANT DERIVED


AUTOMATIC PENTATONIC SCALE IDENTIFICATION : BAGANA.

Beth Logan. (2000). lMel Frequency Cepstral Coefficients for Music Modeling.
https://www.ptonline.com/articles/how-to-get-better-mfi-results

Bhandari, G. M. (2016). Different Audio Feature Extraction using Segmentation. 2(09),


1–5.

BIRKU, L. A. (2021). SAINT YARED KUM ZEMA CLASSIFICATION USING


CONVOLUTIONAL NEURAL NETWORK.

Bormane, D. S., & Dusane, M. M. (2013). A Novel Techniques for Classification of


Musical Instruments. Information and Knowledge Management, 1–8, 3(10), 1–9.

Boxler, D. (2020). Machine Learning Techniques Applied to Musical Genre Recognition.

Choi, K., & Sandler, M. (n.d.). Automatic tagging using deep convolutional neural
networks.

80
Choi, K., & Sandler, M. (2017). A Tutorial on Deep Learning for Music Information
Retrieval.

Costa, Y. M. G., Oliveira, L. S., & Jr, C. N. S. (2016). Ac ce p te us ip t. Applied Soft


Computing Journal. https://doi.org/10.1016/j.asoc.2016.12.024

Cruz, D. A., Lopez, S. S., & Camargo, J. E. (2019). Automatic Identification of


Traditional Colombian Music Genres based on Audio Content Analysis and
Machine Learning Techniques.

Debebe, S. (2017). The Teaching-learning Processes in the Ethiopian Orthodox Church


Accreditation Schools of Music (.

Dennis, J., Member, S., Tran, H. D., Li, H., & Member, S. (2011). Spectrogram Image
Feature for Sound Event Classification in Mismatched Conditions. 18(2), 130–133.

Doob, L. W. (1957). An Introduction to the Psychology of Acculturation. Journal of


Social Psychology, 45(2), 143–160. https://doi.org/10.1080/00224545.1957.9714298

Eyob Alemu, Ephrem Afele Retta, E. A. (2019). Pentatonic Scale ( Kiñit )


Characteristics for Ethiopian Music Genre Classification.

Frehiwot Terefe. (2019). Pentatonic Scale ( Kiñit ) Characteristics for Ethiopian Music
Genre Classification.

Govender, P., Pillay, N., & Moorgas, K. E. (2012). ANN’s vs. SVM’s for Image
Classification. 134, 22–24.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R.
(2012). Improving neural networks by preventing co-adaptation of feature detectors.
1–18. http://arxiv.org/abs/1207.0580

Ibrahim, Y. A., Odiketa, J. C., & Ibiyemi, T. S. (2017). F O R H U M A N C O M P U T


E R I N T E R A C T I O N : A N O V E R V I E W. XV.

Jawaherlalnehru, G., Jothilakshmi, S., Nadu, T., & Nadu, T. (2018). Music Genre
Classification using Deep Neural Networks 1. Research Scholar Department of
Computer Science & Engineering, 4(4), 935–940.

Kasehun, A. (2021). ETHIOPIAN ORTHODOX TEWAHIDO CHURCH DRUM


SOUNDS CLASSIFICATION USING MACHINE.

Kenew, S. (n.d.). tradetional eotc.pdf.

KHASHANA, Z. (2020). FEATURE EXTRACTION TECHNIQUES IN MUSIC


INFORMATION RETRIEVAL.

Lau, D. (2020). Music Genre Classification : A Comparative Study Between Deep-

81
Learning And Traditional Machine Learning Approaches. 1433596.

Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., &
Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code
Recognition. Science Signaling, 7(329), 541–551.
https://doi.org/10.1126/scisignal.2005580

Lee, H., Pham, P., & Ng, A. Y. (2009). Unsupervised feature learning for audio
classification using convolutional deep belief networks. 1–9.

Li, F., & You, Y. (2015). An Automatic Segmentation Method of Popular Music Based on
SVM and Self-similarity. 301, 15–25. https://doi.org/10.1007/978-3-319-15554-8

Lidy, T., & Schindler, A. (n.d.). Parallel Convolutional Neural Networks for Music
Genre and Mood Classification. 1–4.

Mamun, A. Al, Kadir, I., Shahariar, A. K. M., Rabby, A., Azmi, A. Al, Frequency, M., &
Coefficient, C. (2017). Bangla Music Genre Classification Using Neural Network.
2014, 397–403.

McFee, B., Raffel, C., Liang, D., Ellis, D., McVicar, M., Battenberg, E., & Nieto, O.
(2015). librosa: Audio and Music Signal Analysis in Python. Proceedings of the
14th Python in Science Conference, Scipy, 18–24. https://doi.org/10.25080/majora-
7b98e3ed-003

Mckay, C. (2010). Automatic Music Classification with jMIR. January, 600.


jmir.sourceforge.net

Mhatre, K. (2020). Music Genre Classification using Neural Networks Page No : 2167.
IX(V), 2167–2172.

Miceli, P. A., Blair, W. D., & Brown, M. M. (2018). Isolating Random and Bias
Covariances in Tracks. In 2018 21st International Conference on Information
Fusion, FUSION 2018. https://doi.org/10.23919/ICIF.2018.8455530

Mu, W., Yin, B., Huang, X., Xu, J., & Du, Z. (2021). Environmental sound classification
using temporal-frequency attention based convolutional neural network. Scientific
Reports, 11(1), 1–14. https://doi.org/10.1038/s41598-021-01045-4

Müller, M., & Ellis, D. P. W. (2011). Signal Processing for Music Analysis. 0(0).

Nanni, L., Lucio, D. R., Brahnam, S., Lucio, D. R., & Brahnam, S. (2017). PT US CR.
https://doi.org/10.1016/j.patrec.2017.01.013

Narayan, S., & Gardent, C. (2020). Deep Learning Approaches to Text Production.

82
Synthesis Lectures on Human Language Technologies, 13(1), 1–199.
https://doi.org/10.2200/S00979ED1V01Y201912HLT044

Nasrullah, Z., & Zhao, Y. (2019). Music Artist Classification with Convolutional
Recurrent Neural Networks. Proceedings of the International Joint Conference on
Neural Networks, 2019-July. https://doi.org/10.1109/IJCNN.2019.8851988

Ngo, K. (2011). Digital signal processing algorithms for noise reduction, dynamic range
compression, and feedback cancellation in hearing aids (Issue July).
http://homes.esat.kuleuven.be/~kngo/phd_KimNgo.pdf%5Cnpapers2://publication/u
uid/F7540D9A-EE74-47EA-B488-408B9B6DA85E

Panwar, S., Das, A., Roopaei, M., & Rad, P. (2017). A deep learning approach for
mapping music genres. 2017 12th System of Systems Engineering Conference, SoSE
2017. https://doi.org/10.1109/SYSOSE.2017.7994970

Parida, K. K., Srivastava, S., & Sharma, G. (2022). Beyond Mono to Binaural:
Generating Binaural Audio from Mono Audio with Depth and Cross Modal
Attention. Proceedings - 2022 IEEE/CVF Winter Conference on Applications of
Computer Vision, WACV 2022, Ild, 2151–2160.
https://doi.org/10.1109/WACV51458.2022.00221

Raju, N., Arjun, N., Manoj, S., Kabilan, K., Shivaprakaash, K. (2013). Obedient Robot
with Tamil Mother Tongue. Journal of Artificial Intelligence 6, 161–167, 4(1), 88–
100.

Ren, J. M., Chen, Z. S., & Jang, J. S. R. (2010). On the use of sequential patterns mining
as temporal features for music genre classification. ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Proceedings, 2294–2297.
https://doi.org/10.1109/ICASSP.2010.5495955

Science, C. (2021). ETHIOPIAN ORTHODOX TEWAHIDO CHURCH DRUM SOUNDS


CLASSIFICATION USING MACHINE.

Seed, A. T., Saber, Z. R., Sana, A. M., & Hameed, M. A. (2020). Eliminating unwanted
signals in sound by using digital signal processing system. Indonesian Journal of
Electrical Engineering and Computer Science, 18(2), 829–834.
https://doi.org/10.11591/ijeecs.v18.i2.pp829-834

Selam, M. (2020). COLLEGE OF NATURAL AND COMPUTATIONAL SCIENCES


AUTOMATIC CLASSIFICATION OF ETHIOPIAN TRADITIONAL MUSIC USING
AUDIO-VISUAL Selam Mulugeta Alemseged A Thesis Submitted to the Department
of Computer Science in Partial Fulfilment of the Degree of Master of Sci. June.

Senevirathna, E. D. N. W., & Jayaratne, L. (2015). Audio Music Monitoring : Analyzing


Current Techniques for Song Recognition and Identification. 4(3), 23–34.
https://doi.org/10.5176/2251-3043

83
Shah, M., Pujara, N., Mangaroliya, K., Gohil, L., Vyas, T., & Degadwala, S. (2022).
Music Genre Classification using Deep Learning. Proceedings - 6th International
Conference on Computing Methodologies and Communication, ICCMC 2022, 974–
978. https://doi.org/10.1109/ICCMC53470.2022.9753953

Sharma, G., Umapathy, K., & Krishnan, S. (2020). Trends in audio signal feature
extraction methods. Applied Acoustics, 158, 107020.
https://doi.org/10.1016/j.apacoust.2019.107020

Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. 3rd International Conference on Learning Representations,
ICLR 2015 - Conference Track Proceedings, 1–14.

Tekniska Högskola, L., Lexfors, L., & Johansson, M. (2018). Audio representation for
environmental sound classification using convolutional neural networks.
http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8964345&fileOI
d=8964346

Tsantekidis, A., Passalis, N., & Tefas, A. (2022). Deep reinforcement learning. In Deep
Learning for Robot Perception and Cognition. https://doi.org/10.1016/B978-0-32-
385787-1.00011-7

Tsegaye, M. (1975). Traditional Education of the Ethiopian Orthodox Church and Its
Potential for Tourism Development (1975-present) ".

Tzanetakis, G., & Cook, P. (2010). Musical genre classification of audio signals using
geometric methods. European Signal Processing Conference, 10(5), 497–501.

Ullrich, Karen, Jan Schlüter, T. G. (n.d.). BOUNDARY DETECTION IN MUSIC


STRUCTURE ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS.

Vishnupriya, S., & Meenakshi, K. (2018). Automatic Music Genre Classification using
Convolution Neural Network. 2018 International Conference on Computer
Communication and Informatics (ICCCI), 1–4.

William J. Pielemeier, M. (1996). Pielemeier TF Analysis Musical Signals.pdf.

Wilson, AD and Fazenda, B. (2013). Perception & evaluation of audio quality in music
production.

Wolfe, J. (2002). Speech and music, acoustics and coding, and what music might be
“for.” Proceedings of the 7th International Conference on Music Perception and
Recognition, 2002, 10–13.

Wreqenh, l/seltanat habt mareyam. (n.d.). zema.pdf.

Xie, C., Cao, X., & He, L. (2012). Algorithm of abnormal audio recognition based on
improved MFCC. Procedia Engineering, 29, 731–737.

84
https://doi.org/10.1016/j.proeng.2012.01.032

Yaslan, Y., & Cataltepe, Z. (2014). Audio Music Genre Classification Using Different
Classifiers and Feature Selection Methods Audio Music Genre Classification Using
Different Classifiers and Feature Selection Methods. January 2006.
https://doi.org/10.1109/ICPR.2006.282

Yeshanew, A. (2020). Developing an Automatic Chant Prediction in Mahelet based on


drums and cymbals sound for Ethiopian Orthodox Tewahdo Church.

85
Appendix

Appendix 1: implementation code for audio segmentation in 20 second

86
Appendix 2: implementation code for transforming spectrogram image

87
Appendix 3: implementation code for CNN training model

88

You might also like