R16 BEST ETHIOPIA Birhanu Belay (Classification Model For Ethiopian Traditional Music Video Using CNN)

DSpace Institution
DSpace Repository http://dspace.org

Information Technology thesis
2023-07-14
Classification Model for Ethiopian

Traditional Music Video Using
Convolutional Neural Network
Birhanu, Belay Moges
http://ir.bdu.edu.et/handle/123456789/15687
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES
COMPUTING FACULTY
Classification Model for Ethiopian Traditional Music Video Using

By
Birhanu Belay Moges
BAHIR DAR, ETHIOPIA

July 14, 2023
September 29, 2023
Classification Model for Ethiopian Traditional Music Video Using
By
Birhanu Belay Moges
A thesis submitted to the school of Research and Graduate Studies ofBahir Dar
Institute of Technology, BDU in partial fulfillment of the requirements for the degree of
Msc. degree in the Information Technology in the computing faculty
Advisor Name:
Seffi Gebeyehu (Asst.prof)
BahirDar, Ethiopia
September 29, 2023

Acknowledgments
My Grateful thanks go to my advisor Seffi Gebeyehu for his valuable comment throughout
preparing this thesis. I would like to thank all participants for their genuine responses, offices,
schools, and friends who supported me in this thesis preparation. Finally, I would like to express
my heartfelt thanks to my all our families for your love, support, and continuous encouragement
I
Table of content
Acknowledgments .................................................................................................... I
Table of content ...................................................................................................... II
Abbreviations .......................................................................................................... V
List of Figures ........................................................................................................ VI
List of Tables ........................................................................................................VII
List of equestions ................................................................................................ VIII
Abstract .....................................................................................................................1
CHAPTER ONE: INTRODUCTION ....................................................................2
1.1. Background .....................................................................................................2
1.1.1 Ethiopian traditional music and music videos .......................................................... 2
1.1.2 Video classification ...................................................................................................... 3
1.2. Statement of the problem...............................................................................3
1.3. The objective of the study ..............................................................................6
1.3.1 General Objective ......................................................................................................... 6
1.3.2 Specific Objectives ........................................................................................................ 6
1.4. Scope of study..................................................................................................7
1.5. Significance of the study ................................................................................7
1.6. Methodology ....................................................................................................7
1.6.1. Literature review ............................................................................................. 8
1.6.2. Data set preparation ........................................................................................ 8
1.6.3. Data preprocessing and Feature Extraction .................................................. 8
1.6.4. Design and Evaluation ..................................................................................... 8
1.7. Organization of the study ..............................................................................9
CHAPTER TWO: LITERATURE REVIEW.....................................................10
2.1Introduction ........................................................................................................ 10
2.2Ethiopian traditional music and video clips...................................................... 10
2.2.1 Music dance of Tigray ......................................................................................... 10
2.2.2 Music dance of Afar .............................................................................................. 11
II
2.2.3 Music dance of Amhara (Gojjam) ....................................................................... 12
2.2.4 Music dance of Amhara (Agew-Awi) .................................................................. 12
2.2.5 Music dance of Oromo/Shoa/ ............................................................................... 13
2.2.6 Music dance of SNNP/guragie/ ............................................................................ 13
2.2.7 Music dance of SNNP/wolayta/ ............................................................................ 14
2.2.8 Music dance of Benshangul Gumuz people ........................................................ 14
2.3 Image/Video Signal ............................................................................................ 15
2.3.1Imageandvideoscenesegmentation ............................................................................. 17
2.3.2 Image and video feature description ......................................................................... 18
2.3.3 Object Recognition in image/video ........................................................................... 19
2.4 Deeplearning ....................................................................................................... 20
2.4.1 Evaluating deep neural network models .................................................................. 23
2.4.2ConvolutionalNeuralNetwork .................................................................................... 25
2.4.3 ActionClassification.................................................................................................... 26
2.5 Related works ..................................................................................................... 27
2.5.1 Video Classification Using Machine Learning ......................................................... 27
2.5.2 Video Classification Using Deep Learning ............................................................... 29
CHAPTER THREE: RESEARCH METHODOLOGY ....................................33
3.1. Introduction.................................................................................................... 33
3.2. Ethiopian Traditional music videos corpus ................................................. 33
3.3. Design Ethiopian Traditional music videos classification ........................... 34
3.4. Preprocessing ................................................................................................. 35
3.4.1. Video segmentation ............................................................................................... 35
3.4.2. Image frame extraction ........................................................................................ 36
3.4.3. Person detection .................................................................................................... 36
3.4.4. Gray scale conversion ........................................................................................... 36
3.5. Video Feature Extraction .............................................................................. 37
3.5.1. HOG/Local feature extraction ............................................................................. 37
3.5.2. Optical flow feature extraction ............................................................................ 37
3.5.3. Deep feature extraction ........................................................................................ 38
3.5.4. Combine /Fusing feature ...................................................................................... 38
3.6. Video classification ........................................................................................ 38
III
CHAPTER FOUR: EXPERIMENT AND IMPLEMENTATION ...................40
4.1 Introduction.................................................................................................... 40
4.2 Dataset ............................................................................................................ 40
4.3 Implementation .............................................................................................. 41
4.3.1 Tools ............................................................................................................................. 41
4.4 Test result ....................................................................................................... 42
4.4.1 Model test with CNN feature ............................................................................... 42
4.4.2 Model test with combined HOG , Optical flow, and CNN features ................ 43
4.5 Discussion ....................................................................................................... 44
CHAPTER FIVE: CONCLUSION AND FUTURE WORK ............................45
5.1 Conclusion ...................................................................................................... 45
5.2 The contribution of the study ........................................................................ 46
5.3 Future work .................................................................................................... 46
Reference.................................................................................................................47
Appendix A: Sample code of CNN end to end classififcation ...........................52
Appendex B: sample code for conbined features of HOG, Optical flow amd
CNN ........................................................................................................................55
Appendix C: Sample sequential frames ...............................................................56
IV
V
Abbreviations
ANN Artificial Neural Network
AI Artificial Intelligence
CNN Convolution Neural Network
DFT Discrete Fourier Transform
DSP Digital Signal Processing
ELU Exponential Linear Unit
FN False Negative
FP False Positive
GPU Graphical Processing Unit
GRU Gated Recurrent Unit
HOF Histogram of Optical Flow
HOG Histogram of Oriented Gradients
KNN K Nearest Neighbor
LBP Local Binary Patterns
LIBSVM Library for Support Vector Machine
LSTM Long Short-Term Memory
MATLAB Matrix Laboratory
NB Naïve Bayes
ReLu Rectified Linear Unit
RGB Red Green Blue
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SIFT Scale Invariant Fourier Transform
STFT Short-Time Fourier Transform
STIP Space-Time Interest Point
SVM Support Vector Machine
TN True Negative
TP True Positive
V
List of Figures
Figure 2.2.1 Music dance of Tigray people (Tigray Traditional Dance Photo, Ethiopia Africa, n.d.) ............... 11
Figure 2.2.2 Music dance of Afar people (Traditional Dances – Afar Culture & Tourism Bureau, n.d.) ....... 12
Figure 2.2.3 Music dance of Gojjam ........................................................................................................................ 12
Figure 2.2.4 Music dance of Agew-Awi people........................................................................................................ 13
Figure 2.2.5 Music dance of Oromo/shoa people .................................................................................................... 13
Figure 2.2.6 Music dance of SNNP/Guragie people ................................................................................................ 14
Figure 2.2.7 Music dance of SNNP/wolayta people................................................................................................. 14
Figure 2.2.8 Music dance of Benshangul-Gumuz people ....................................................................................... 15
Figure 2.9 Convolutional kernel(Trask, 2019) ........................................................................................................ 26
Figure 3.3 Proposed classification model for Ethiopian traditional music video ................................................. 35
VI
List of Tables
Table 1 Summary of related works .......................................................................................................................... 32
Table 2 Dataset of Ethiopian traditional music videos ........................................................................................... 34
Table 3 dataset for Ethiopian music video classification ........................................................................................ 41
VII
List of equestions
Equation 1 Accuracy ................................................................................................................................................. 23
Equation 2 Precision .................................................................................................................................................. 24
Equation 3 Recall ....................................................................................................................................................... 24
Equation 4 F1 Score .................................................................................................................................................. 24
Equation 5 Macro Average .......................................................................................................................................24
Equation 6 Weighted Average .................................................................................................................................. 25
Equation 7 Orientation of gradient .......................................................................................................................... 37
VIII
Abstract
In Ethiopia, there are several nations and ethnic groups which have their own language, culture,
and custom. They play their traditional music and perform their own dance style, while
traditional music is recorded as a video and uploaded on the internet for further use. But
accessing traditional music videos from the internet is tiresome for the user. Therefore, to get
those uploaded music videos easily from the internet it is required to develop a video
classification model. From this perspective, this study aims to develop a classification model for
Ethiopian traditional music videos to overcome the problem of inconveniency.To build this
model we follow experimental research methods,for implementing an algorithm and empirically
evaluating the efficiency and effectiveness and a purposive sampling technique is employed,
because to collect the typical traditional music video that describes all the nations included in
the study. From the selected 8 nations and ethnic groups 80 video clips were collected as a
sample from different online video sharing platforms, and we get 10248 sequences of images for
a dataset.Convolutional Neural Network (CNN),Optical Flow and Histogram Orientation
Gradient (HOG) are used for feature extraction and also Convolutional Neural Network (CNN)
is used as a classifier with python open-source tool.Finally the system has been tested using
sequence of images, which were collected for testing and training purpose. Experimental results
show that, when we use only CNN end to end classifier, the classification accuracy is 83% and
when we use the combination of HOG, Optical flow with CNN features the classifier achieves
good classification accuracy 86%.

Key words: HOG, CNN, Optical Flow, Feature extraction, classifier
1
CHAPTER ONE: INTRODUCTION
1.1. Background
Now a day entertainment industries have a big role in the political, economic, and social affairs
of the countries. Especially in the globalized world, every country has thewillingness to spread
and put their interest in the mind of others using a different form of entertainment. Therefore,
music is one element of the entertainment industries, are listening, and watching music videos to
entertain and enjoy themselves. In this respect, music videos are used as a tool to promote and
create images of their cultures and customs in-depth for the rest of the world over the
Internet(Throsby, 2014).
1.1.1 Ethiopian traditional music and music videos

Our country Ethiopia has several nations and nationalities, which have their language, culture,
and traditional practices to express them. According to their culture, they have their musical
instruments and dancing styles which are performed by individuals or a group(Getachew, 2015).
Traditional music plays a vital role not only in cultural and social activities like marriage,
hunting, farming, gathering, rituals of birth, and puberty but also fosters social cohesiveness. The
folk songs, folk music, dance, and musical instruments provide space for the manifestation of
cultural expressions and ethnic diversity of nations, nationalities, and people of Ethiopia.
The music tradition of Ethiopia is closely associated with folk dance which adds to the artistic
value of the presentation. The music performance involves singing and dancing of a group of
villagers and moving from one village to another promoting sustenance of cultural heritage and
strengthening community cohesiveness. Besides, the use of costumes, mask, musical instruments
in ceremonies and rituals promote an aura of sacredness. The participation of the audience in
music and dance performance and terms of hand-clapping, finger-popping, foot-tapping, and
vocal prompting isthe manifestation of appreciation, approval, and motivation for performing
artists. The cultural dimensions of traditional music and dance among indigenous communities
of Ethiopia are complex and interlinked withthe great tradition of the country (Pati et al., 2015).
In recent years, a large number of our country's music videos are recorded, produced, and
uploaded on the internet through different service providers, so that these music videos are
available online everybody can get access to watch those music videos instantly. However, due
2
to their similarity and large numbers, it is difficult for individual users and music experts to
identify and classify which music belongs to which nations and nationalities across the country,
and they spend much more time getting their interest.
1.1.2 Video classification

Video classification is an application area of computer vision that incorporates text, audio, and
video data.The purpose of video classification is to categorize various videos in different
parameters such as their similarity, or distinct(Ramesh & Mahesh, 2019).For performing the
automatic classification of video, a large number of approaches and techniques have been
attempted. The most common approaches are audio-based, text-based, and video-based
approaches(Ramesh, 2018). The video-based approach is frequently used as compared with
others for video classification because humans perceive more information from visual features
(Darji & Mathpal, 2017). This research work suits a video-based approach. Video classification
will conduct based on the features, to extract these features extraction algorithms are required
that can solve the classification problem appropriately(Rachmadi et al., 2017).However, this
research work is conducted to improve the performance of video classification as compared to
related works by using CNN.CNN is one of the deep learning approaches, which has different
levels of convolutional layers, each level is responsible for feature extraction and CNN is a
successful feature learner and classifier(Albawi et al., 2018; Saif et al., 2018; O’Shea & Nash,
2015).
1.2. Statement of the problem

Ethiopia has more than 80 ethnic groups. These various nations and ethnic groups of Ethiopia
have their distinct music, cultures, and traditions. Ethiopian dances are performed according to
the occasions or situations so we didn’t simply classify them according to their function but
rather according to their uniqueness and individuality. Therefore, there are over 150 unique
dance movements across Ethiopia(Getachew, 2015). Ethiopian traditional music videos are the
composition of different traditional musical instruments and dances. Due to the diversified nature
of Ethiopian traditional music are difficult to identify and distinguish, which music video
belongs to which nation/ethnic group, even for Ethiopian.Especially classified and categorize the
music using itslanguage, lyrics, and melody are difficult. For instance, Gondar, Gojjam, Wollo,
3
and Shoa are located within one specified region, their ethnicity, and languages are the same but
their traditional dancing style and choreography are quite different.
The search mechanism of videos still depends on the content tag(Tefera, 2016). Even though
current Ethiopian traditional music video classifications are done by experts by watching and
listening leading to the problem of being complicated, inefficient, time-consuming, and tiresome.
So to overcome the aforementioned problems video classification model is needed as also
supported in the study of (Bhatt, 2015; Ramesh & Mahesh, 2019; Sun et al., 2017).
(Tefera, 2016)researched the classification of Ethiopian traditional dance usingvisual and audio
features. Inhis research, he considered10 traditional styles of songs namely Eskista (only
theGondar type), Gurage, Asosa, Somali, Oromo, Afar, Hararghe, Tigrinya, Wolaita,
andGambella.He appliedHOG on each frametodetectthe region of interest i.e. danceperformer.
Then he extracted local features relatedto body shape.To achieve that heusedthe LogGabor filter.
Asa feature selection technique, the author hasused Principal ComponentAnalysis,and finally,
the selected features wereused directlyby the classifier. The learning algorithm usedto build the
classifier model wasSVM. Theperformance of the classifierwhenusingonly visual feature was
82.59% and when he combined visual and audio features the classifier scored86.2%. In this
research, the selected featureis focused on the shape andwebelieve if other features such asthe
movementof thedancers were integrated,theperformance of the classifierwould havebeen
increased.Besides, the authorwas very selectivewhen choosingvideo clips thatwere included in
thedataset, such asbackground andwe believe theperformance of the modelwouldbevery low on
real-timedata that includes different types of background.
Thework doneby(Subramanian & Mekonnen, 2016) aimed at classifying ten Ethiopian

traditional dances namely:Afar, Benshangul,Gambella, Eskista,Gurage,Hararghe, Oromo,
Somali, Tigrinya, and Wolayta. They used optical estimation to estimate the motion of
thedancers as away of feature extraction andHOGtodetectthe region of interest. Asan optical
flow estimation method, the authors useHorn-Schunck and Lucas-Kanade optical flow algorithm.
Both works are the same but their difference liesin the learning algorithm used. While the
learning algorithmused in (Subramanian & Mekonnen, 2016) is SVM, the learning algorithm
used in (Subramanian & Mekonnen, 2016) is ANN. Whenusing SVM the maximumaccuracy the
4
authorsgot is99.7 and it is achievedwhen the Lucas-Kanade method isused. The maximum
accuracy they acquired when ANN isused is82.0% and it is again achieved whenthe Lucas-
Kanade method isused.Even though these two works include the motion information in the
feature extraction step, they werevery selective when choosingdata (with no background
movement) making their model lessperformingina real environment.Also, they only train and test
their system with only oneperson asadanceperformer resulting in using a dataset that did not
represent Ethiopian traditional dance which is mostly performed in groups.
The research work conducted by (Mulugeta, 2020) attempts to mergeaudio and visual features to
improve the performance of the classification model, using a pre-trained network called VGG-16
to extract visual information and a custom pre-trained network to extract audio features from the
video. Inthe end, the classifier result showed that the only audio features performance was 85%
and using only visual features performance was 78%. But the researcher recommended that the
combination of audio and video features it could increase the cumulative result of the proposed
classifier by up to 85% (Mulugeta, 2020). Hence this study is focused on classification by
combining traditional music audio and video features.
Still, when we come to the real world to distinguish music videos visual features are more
important because we have discussed the music video clips. Video clip is visual art by nature it
incorporates dances and different movements. But as we mentioned in the above (Mulugeta,
2020), stated that audio features are more powerful than visual features to classify and categorize
Ethiopian traditional music videos. Hence, visual features are not efficient to classify music
video clips. However, previously stated researchers attempt to extract visual features using a
deeplearning pre-trained architecture of VGG16. Video clip incorporate different dance
movements especially in Ethiopian traditional music videos are full of dance movements and
styles, these dance movements are obviously different from the others according to their culture,
and the performer or dancers use their different body parts. One nation may use basically a
movement of a shoulder, waist, hand, neck, or leg either a combination of two or more parts at a
time. Therefore, to classify these different national music video clips we have to extract all visual
features accordingly. However those previously worked research papers used different feature
extraction techniques to draw features from the input, and classifiers based on their advantages
and gets different performance result, still it is not efficient. The first paper is used HOG feature
5
extraction techniques, this technique can be represented and drawn features of images shapeand
orientation from an image by measuring the gradient orientation but it can’t be extract motions,
the second paper is used Optical flow estimation for motion estimation and HOG for feature
extraction. The third paper is used a deep learning pre-trained architecture of VGG16 to extract
visual features. VGG16 is a convolutional network that is used for classification. However CNNs
(Convolutional Neural Networks) are highly effective in analyzing and extracting features from
image and video data, but there is a gap in certain types of features that may not be able to
extract. Such as orientation, spatiotemporal (human motion) features.
This study attempts to answer the following research questions;
 What techniques shall be used to extract features from the video?

 How to enhance the CNN feature extraction techniqueto achieve better classification
accuracy?
 What learning algorithm should be used for the puporse of the classification?
1.3. The objective of the study
1.3.1 General Objective

The general objective of this research is to design a Classification model for Ethiopian traditional
music videos using the Convolution Neural network.
1.3.2 Specific Objectives

The specific objectives of this research are:
 To review previous research works focused on video classification using different techniques.
 To collect and preprocess Ethiopian traditional music videos from different sources
 To Identify techniques or tools for feature extraction from music videos
 To build a classification model using a CNN
 To select an optimal classification algorithm among the known classifiers
 To evaluate the performance of the classification model using the testing dataset,
6
1.4. Scope of study
In this research, data set preparation, preprocessing, feature extraction, and classification
techniques have been explored to design a classification model for Ethiopian traditional music
video using CNN (convolution Neural Network). In video classification, features are drawn
from three modalities; text, audio, and visual features. However, forthis study, we considered
visual features because as compared with others humans perceive more information from visual
features (Darji & Mathpal, 2017).
Due to the availability of Traditional music videos, the study covers only 8 ethnic groups/nations
selected from 6 Ethiopian regions. Theselected regions are Afar, Tigray, Amhara, Oromia,
BenishangulGumuz, and South nations and nationalities. From each region, one/two nations are
selected for data collection and implementation purposes.
1.5. Significance of the study

The outcome of the study has a better achievement for music video clip classification, especially
for studying Ethiopian music and culture. Within the past few years, there has been a tremendous
amount of interest in computer vision and auditions are to understand human behavior from
videos especially;
 This study has a great contribution when there will be an increase in multimedia data
access through the internet, music video clip data indexing,tagging and retrieval becomes
more and more important. Not only in the retrieval but also for content filtering.
 The current study is important in digitizing the cultural heritage of the country. Since
soon everything will be digital so that it will be simple to handle a vast video that will be
available.
 This study is important for video summarization and for dividing music videos for
different applications.
1.6. Methodology
Methodology is a set of systematic approaches and techniques in research that are used as a
roadmap for the research that answersto a question of how the research was conducted. This
study follows an experimental research methodology. Experimental research is the process of
7
implementing an algorithm and empirically evaluating the efficiency and effectiveness of those
algorithms to solve real-world problems (Igwenagu, 2017).
1.6.1. Literature review

Literature review is a summary of previously worked and related research on the topic. In this
research, in order to understand the basic concept of video and image classifications, feature
extraction techniques, procedures, and tools different previously worked research papers, books,
journals, proceedings, etc. are reviewed. After that, for feature extraction and video
classification, appropriate tools and techniques are selected.
1.6.2. Data set preparation

For this study, we collect different music videos from each region that can be representing their
cultures. Therefore, the sampling technique is purposive sampling because we have to collect a
variety of traditional music videos from each region which represents the region and a specific
nation. For this study, the data collected 8 ethnic groups from 6 regions and one/two nation or
ethnic groups from each region respectively. The selected regions are Afar, Tigray, Amhara,
Oromia, BenishangulGumuz, and South nations and nationalities. From each region, one or more
nations are selected. These sample videos are collected from different online video-sharing
platforms YouTube (https://www.youtube.com),Dire Tube(http://www.diretube.com),
SodereTube (http://www.soderetube.com), AddisZefen (http://www.addiszefen.com).
1.6.3. Data preprocessing and Feature Extraction

After the selection of relevant music videos, data preprocessing is conducted to convert raw data
into useful and acceptable forms for a system. HOG, Optical flow, and CNN are used as feature
extraction tools from the given video. Because CNN is capable to extract deep features and HOG
is used to extract shape and orientation features and Optical flow to understand and extract
human motions as well. Key frame detection is employed to detect representative and descriptive
frames according to human action.
1.6.4. Design and Evaluation

For design and implementation purposes, we use the CNN classification model because CNN is
a remarkable, successful, and better classification algorithm (Sharma et al., 2018; Debayle et al.,
2018; Al-Saffar et al., 2017; Karpathy et al., 2014). Python is appropriately employed in the
8
study due to its open-source package, the fastest general-purpose programming language and it
can be run on all platforms (Linner et al., 2021).To evaluate the performance or efficiency and
effectiveness of the model we use Precision, Recall, and Specificity in predicting the outcome of
new observations and test data that have been not used to train the model.
1.7. Organization of the study

The study is structured into five chapters. Chapter One discussed the overview and general
background of the study, statement of the problem, objectives, methodologies, and scope of the
study. Chapter two presents Ethiopian traditional music videos and a review of related works.
Chapter three shows the methods, techniques, algorithms, and architecture to solve the problem.
Chapter Four highlights the experiments and results of the proposed system. And chapter Five
summarizes the conclusion, recommendation, andfuture works.
9
CHAPTER TWO:LITERATURE REVIEW
2.1 Introduction
This chapter gives general information and background about the research topic and
previously worked research. It includes attempts of different researchers to solve the problem,
what techniques, algorithm they use, and the procedures they follow.
In recent years, the internet has been creating good opportunities for music industries to earn
much more money than before. Therefore vast and variety of music videos and movies are
uploaded per day and available on the internet and it’s accessible across the globe through
different agents or providers(Throsby, 2014). But individual users to entertain themselves did
get music videos easily as per their request and preference due to the variety and volume of
music videos.
Currently, across the world searching videos depend on the content tag and text-based user
annotation, but it is inefficient(Tefera, 2016). However, current Ethiopian traditional music
video classifications are done by experts by listening and watching a music video. So that the
study allows classifying a given video automatically based on its visual features.
2.2Ethiopian traditional music and video clips

Ethiopia is a country thathas more than 110 million people and 10 regional states(Ethiopia
Population (2021) - Worldometer). Under these regional states, more than 85 nations/ethnic
groups are there. These nations/ethnic groups in the country have their unique language,
traditional practices, customs, thoughts/beliefs, musical instruments, and dancing styles
which can be performed as an individual or a group on different occasions. According to
analysts, there are over 150 unique dance movements across Ethiopia(Getachew, 2015).
Nowadays, Ethiopian traditional music artists are flourishingand the music video clips are
dramatically increased and have millions of views. For this research work, we have selected 8
traditional music video styles from 6 regions. Based on the availability of music videos. The
selected categories are the music of Tigray, Awi Agew, Gojjam, Shoa Oromo, Gurage,
Wolayta, Benshagul Gumz, andAfar.
2.2.1 Music dance of Tigray

Tigray is one of the regions located in northern Ethiopia, Tigrayan, Irob, andKunama ethnic
10
groups are lived there.TheTigrayans have traditional music dance styles which are
performed in groups and havesmoothcirculardance movementsincorporatedwith
shoulderneck and jumping movements(Traditional Ethiopian Music and Ethiopian Culture –
Tigray, n.d.).Thiscirculardanceiscommonlyknownaskuda.Intheirmusic,itisvery
commontoheartraditionalEthiopianinstrumentssuch asthe masinqo,Kerar, and Kebero.
Figure 2.2.1 Music dance of Tigray people (Tigray Traditional Dance Photo, Ethiopia Africa, n.d.)
2.2.2 Music dance of Afar

Afar is a region which is located in northern Ethiopia. The place is commonly known as the
origin of mankind because ‘Lucy’ and other ancestors werefound in Afar.
Afar people have different kinds of traditional dances for various ceremonies like Weddings,
rituals, and holidays. The dances are performed together as well as in specific age and sex
groups.
The traditional dance of Afar people is characterized by hand and leg movement. During the
dance, the men wear traditional cloth (“Kutta”) and they will tie the knife (“gorade”) and will
up their hands and wave their hands from left to right in the air and at the same time jump up.
Their Dancing style is mainly jumping to the air by pointing their hand in the up
direction(Traditional Dances – Afar Culture & Tourism Bureau, n.d.).A commonly known
dance style is the ‘Lale’ dance style played by both sexeson any occasion.
11
Figure 2.2.2 Music dance of Afar people (Traditional Dances – Afar Culture & Tourism Bureau, n.d.)
2.2.3 Music dance of Amhara (Gojjam)

Amhara People are an ethnic group which is located in the northern and central parts of
Ethiopia. Gojjamisinthenorthwestpartof thecountrylocatedinthe Amhararegion.Liketheother
Amharapeople,theirdanceinvolves themovementof the neck,shoulder,andchest
withseveralvariations ofmovementsandsteps.Theirdanceiswellknownfortheshakingof
theupperbodyorchest and shows warriors' expressions on their face and body(Introducing
Ethiopian Folk Dance_ Mocha Ethiopia Dance Group (English Site), 2002; Tefera,
2016).
Figure 2.2.3 Music dance of Gojjam
2.2.4 Music dance of Amhara (Agew-Awi)

AgewAwiis oneofthe ethnic groupsthat lived in theAmharaRegionofEthiopia. This is located
on the border between Amhara and Benshangul-Gumuz Region.The dance of Agew is full of
interesting movements and most likely thewomendance
withholdinganumbrella,madeofbamboowhilethe mendancewith holdingChira and tirumba.
12
Figure 2.2.4 Music dance of Agew-Awi people
2.2.5 Music dance of Oromo/Shoa/

ShoaOromosareamongthevariantOromoslocatedinthecentralpartofEthiopia.They haveavery
uniquedancemovementthatinvolvesthefastmovementoftheheadby the women. Also,
theyhaveauniquecostume,thewomenwearleather-madewildtwo-
piececostumedecoratedwithshellswhilethemenwearfurskinlion’smaneonthehead andusea
stickfordance.
Figure 2.2.5 Music dance of Oromo/shoa people
2.2.6 Music dance of SNNP/guragie/

Southern Nations, Nationalities, and Peoples (SNNP) are one of the regions located in
southern Ethiopia. SNNP is an extremely ethnically-diverse region of Ethiopia, inhabited by
more than 80 ethnic groups, of which over 45 (or 56%) are indigenous to the region
(Subramanian & Mekonnen, 2016). These ethnic groups are distinguished by different
languages, cultures, and socioeconomic aspects. The dance style of the region is very much
diverse since this region is a diversity of many nations and nationalities. The two most
famous dance styles of the region are the “Gurage” dance which is known as “Goragigna”
13
and “Wolayta” dance which is known as “Wolaytigna”. In this study, both Goragigna and
Wolayitgna dance is investigated.
Gurage peoples are symbolized as hard workers throughout the country, and their dancing
style brings such an image of hard-working people. The dance style is characterized by foot
motifs that go with the whole-body movement. This dance is performed in two ways; the first
part is short and slow motion. The second part of the dance might be performed repeatedly
for three or four times with short intervals depending on the dancer’s interest and
energy(Subramanian & Mekonnen, 2016).
Figure 2.2.6 Music dance of SNNP/Guragie people
2.2.7 Music dance of SNNP/wolayta/

Wolaytaisoneof theethnicgroupsfoundin southernEthiopia. TheWolaytasareknownfor their
rich culture and traditional music dances. Their dance has a distinct movement of the waist.
Their music is playing a prominent role in nationalentertainment in Ethiopia today. The
unique and fast-paced Wolayta tunes have influenced several styles and rhythm as it
continues to shape the identity of Ethiopian musical diversity.
Figure 2.2.7 Music dance of SNNP/wolayta people
2.2.8 Music dance of Benshangul Gumuz people

Benshangul-Gumuz is one of the regions located in the western region of Ethiopia and the
border between Ethiopia and Sudan. ThedanceofBenshangul-Gumuzismainly characterized
14
by theup-and-downmovementofthewholebody withtherhythmofthemusic.Among
severaltraditional dancesavailableintheregion, “Akusha”
isonethemostpredominantdancestyle.Bothmen'sandwomen'sdancestyleisalmostverysimilar
and performed in a group.
Figure 2.2.8 Music dance of Benshangul-Gumuz people
2.3 Image/Video Signal

Images are signals thatdescribe howaparameter varies overthe surface. They have their
information encoded in the spatialdomain, theimage equivalent of the timedomain. In
otherwords, features inan image are representedby edges, not sinusoids. This means that the
spacing and number of pixels aredeterminedby how small of features needto be seen, rather
thanby the formal constraints of the sampling theorem(Dyer & Harms, 1993).
Images are signals with special characteristics. First, they area measure ofaparameter over
space (distance),while most signals area measure ofaparameter over time, such as audio.
Second, they containagreatdeal of information. For example, more than10 megabytes can be
requiredto store one second of televisionvideo. This is more thana thousand timesgreater than
fora similar lengthvoice signal. Third, the final quality is oftena subjective human evaluation,
rather than an objective criterion. These special characteristics have made imageprocessinga
distinct subgroupwithinDSP(Dyer & Harms, 1993)
Video is the dynamic form of static images. Inotherwords, thevideo is composed ofa series of
static images ina certain order, and each image is calleda ‘frame’(Bovik, 2005).Video
referstopictorial (visual) information, including still images and time-varying images.A time-
varyingimage is such that the spatial intensity pattern changes with time.Hence,a time-
15
varying image is a Spatiotemporal intensity pattern, denoted by sc(x1, x2, t), where x1 and x2
are the spatial variables and t is a temporal variable. It is formed by projecting a time-varying
three-dimensional spatial scene into the two-dimensionalimage plane. The temporal
variations in the 3-D scene areusuallydueto the movement of objects in the scene. Thus, time-
varying images reflectaprojection of a 3-D moving object into the 2-D image plane asa
function of time. Digitalvideo correspondsto a spatiotemporally sampled version of this time-
varying image(Bovik, 2005; Mulugeta, 2020).
Animportant feature of digitalimages andvideos is that they are multidimensional signals,

meaning that they are functions of more thana single variable. In the classic study of digital
signalprocessing, the signals areusuallya one-dimensional function of time. Images, however,
area function of two, andperhaps three space dimensions, whereas digitalvideo as a function
includesa third (or fourth) time dimension aswell. The dimension ofa signal is the number of
coordinates that are required to indexa givenpoint in the image(Mulugeta, 2020).
People start the research in imageprocessingby simulating humanvision to see,understand,

and even explain the realworldusing three techniques: image segmentation, image analysis,
and image understanding (Bovik, 2005). While imageprocessing refers to the technique of
performinga series of operations on an imageto achieve somedesired purposes, Image
Analysis mainly aims todetect and measure the object of interest in the digitalimage,to create
specificdescriptions. It isaprocess from an image to values or symbols,which generates some
non-imagedescriptions or representationsby extractinguseful data or information first. Its
content can be divided into several parts, such as feature extraction, objectdescription,
targetdetection, scene matching, and recognition. They describe the characteristics and
features of the target in the image. Similarto image analysis,video analysis is alsoa broad
conceptthat includes the task ofvisual objecttracking, human action analysis, abnormal
behaviordetection, and so on.
Ina real-world environment,due to the changes inillumination, the target movement

complexity, occlusion, color similarity between targets and background, and background
clutter, it is difficulttodesigna robust algorithm for targetdetection and tracking(Mulugeta,
2020).
16
2.3.1Imageandvideoscenesegmentation
Weareonlyinterestedin regionsofinputsthatarecalled targetorforeground,andit
generallycorrespondstocertainspecificoruniqueproperties regions in the image orvideo frame.
On theother side,a region in which we are not interested is called background.
Segmentation means dividing an imageorvideo frame intoseveralspecific and uniqueproperty

partsor subsets according to theprinciple and extracting the target of interest forahigher level
of analysis andunderstanding(Mulugeta, 2020). Therefore, it is the basis of
featureextraction,recognition,and tracking.Thereisneitherageneral methodforimage
segmentation so far, noris an objective criterion for judging whether the segmentation
issuccessful.
Thepurpose of segmentation is to isolatea meaningful entity froman image orvideo sequence.

These targets typically correspondto some specific,unique areas ofan image or video frame.
Theuniqueness here can be thegrayscale value of a pixel, object contour curve, color, texture,
movement information, etc. Such uniqueness can beusedto represent an object as the
characteristics between regions of different objects. The target can correspondtoa single
region or multiple regions. To identify and analyze the targets, it is necessary to isolate and
extract them, such that further identification and understanding can be carried out.
Image segmentation can be built on two basic concepts: similarity and discontinuity. The
similarity of pixels means thatpixelsoftheimageinacertainregionhavesomesimilar
characteristics,suchaspixelgrayscale,or textureformedbythearrangementof pixels.
Thediscontinuity referstothediscontinuity ofsomefeaturesofpixels,suchasthe
mutationofgrayscalevalue,themutationof color, and the mutation of texture structure.Image
segmentation isgenerally achievedby considering the image color,grayscale, edge, texture,
andother spatial information. Currently, image segmentation algorithms can be divided into
two categories: structural segmentation and non-structural segmentation.
Structural segmentation methods are based on the characteristicsofthelocalareaof

imagepixels,includingthresholdsegmentation,region growing,edgedetection,texture
analysis,etc.Thesemethods assure that the features of these areas areknown in advance and are
obtainedduringprocessing.
Non-structural segmentation methodsincludestatisticalpattern recognition,

neuralnetworkmethods,andmethodsusingpriorknowledgeof relationshipsbetween
17
objects.Forexample,thesnakemethodwhich usesan activecontourmodel tosegment objects isa
framework in computervision fordelineating an object outline fromapossibly noisy2D
image.Because there is no temporal information inimage segmentation, it cannot beusedtoget
satisfactory segmentation results onthe video sequence. The efficiency ofthesegmentation
algorithm can beimprovedby considering the time correlation ofvideo frames. Therefore,
video segmentation jointlyuses spatial and temporal informationto achieve this goal.
2.3.2 Image and video feature description

Anotherstepinimage/videosegmentationistodescribethecharacteristicsof thescene withaseries
of symbols or rules and then identify, analyze, and categorize the descriptions.Image features
referto their originalproperties or attributes. It is one of the most basic
attributesusedtodistinguish different images. Some are natural and can beperceivedby
thevision, such as the brightness of the region, edges, textures or color, etc.Some artificial
(man-made) onesneed the be transformed or measured, such as transform
spectrum,histogram, moment, etc(Mulugeta, 2020).
Featureextractionisaprocessofmeasuring theintrinsic,essential,andimportant
featuresorattributesof theobjectandquantizingtheresultordecomposingand symbolizing the
objectto form the feature vectoror symbol string and relational map(Mulugeta, 2020).
IngeneralFeaturedescriptors,refertoaseriesofsymbolsthatareusedfordescribingthe
characteristics ofanimageorvideoobject.Agooddescriptorshouldbeinsensitivetothe
target’sscale,rotation,translation,etc.Featuredescriptorsgenerally fall intotwomain
groups:globalandlocal.
Aglobalfeatureiscalculatedfromallthepixelsof animage,anditdescribes theimage as

awhole.Commonly used features include color, texture, shape features, etc.Compared
toglobal features,localfeaturesdescribethelocal regionsinanimage with
betteruniqueness,invariability,androbustnessandthey havebetterrobustnessto
backgroundclutter,localocclusion,andilluminationchange.Localfeaturesmaybepoints,
edges,orblobsinanimage,whichhavetheadvantageofdescribingthecharacteristics of
pixelsorcolorsinlocalregions.Duetoitsexcellentperformance,localfeatureshave
attractedmoreandmoreresearch attention.In particular,somelocalfeatureswithstrong
robustnesstoillumination and occlusionhavebeen proposedin recentyears,such
asMoraveccornerdetector,Harris cornerdetector, Smallestunitvaluesegmentassimilating
18
nucleuscornerdetector (SUSAN),ScaleInvariantFeatureTransform (SIFT),Difference
ofGaussian(DOG)operator,Histogram OrientationGradient(HOG),
SpeedUpRobustFeatures(SURF),Maximally StableExtremeRegions(MSER),Local
BinaryPattern(LBP)(Mulugeta, 2020).
2.3.3 Object Recognition in image/video

Therecognitionabilityofhumankind more powerful. Evenwithdramaticscale
changes,largedisplacement,andheavyocclusion,peoplecanstillidentifytheobjects.
Incomputervision,imagerecognitionmainly referstothetaskof recognizingobjectsin
animageorvideosequence.We employ somecomputational modelstoextractfeatures fromatwo-
dimensionalimagetoformthedigitaldescriptionandthenestablishaclassifier for
classificationandrecognition.
The classifier can be dividedintothree categories:Generative/ProductiveModel

(includingprobabilitydensitymodel), DiscriminativeModel(decisionboundarylearning
model).Recently,deeplearning-basedmodelshavebeen widely appliedin object
recognitiontasks,which can be viewed as another category.
The generative model is also called the productivemodel,which triestoestimatethejoint

probabilitydistributionof trainingsamples(observations)andtheirlabels.Popular
generativemethodsincludeGaussian Mixture Model(GMM),
NaïveBayesModel(NBM),MixtureofMulti-nominalModel (MMM),MixtureofExpertsSystem
(MES), HiddenMarkovModels(HMM),LatentDirichletAllocation (LDA), Sigmoidal Belief
Networks (SBN), Bayesian Networks (BN), Markov Random Fields (MRF), etc.
The discriminativemodel is also called Conditional Model, or Conditional Probability

Model. The objective of this model is to look for the optimal classification surface between
differentcategories,which reflectsthedifferencebetweenheterogeneousdata. The
commonlyused methodsincludeLinearDiscriminant Analysis, LogisticRegression, Artificial
Neural Networks, Support Vector Machine, Nearest Neighbor, Boosting trees, Conditional
Random Fields, etc(Mulugeta, 2020).
Different from the traditional object recognition pipeline of “feature extractionclassification”,

object recognition can be achieved in an “end-to-end” way through deep learning. Almost all
object recognition tasks have been shined by deep learning nowadays, and the performance is
greatly improved (Mulugeta, 2020).
19
2.4Deeplearning
Deep learning is a subset of machine learning, which is a field dedicated to the study and
development of machines that can learn (sometimes to eventually attain general artificial
intelligence). In industry, deep learning is used to solve practical tasks in a variety of fields
such as computer vision (image), natural language processing (text), and automatic speech
recognition (audio). In short, deep learning is a subset of methods in the machine learning
toolbox, primarily using artificial neural networks, which are a class of algorithms loosely
inspired by the human brain(Trask, 2019). Furthermore, Deep Learning is a mathematical
framework for learning representations from data (Ketkar, 2017).
Deep learning has made enormous success in different areas, such as computer vision, speech
recognition, and natural language processing. Inspired bythis success, many researchers are
moving from using traditional (shallow) machine learning algorithms to using deep learning
to solve problems in music information retrieval and human activity recognition.When
designing a model using these algorithms, the performance of the model heavily depends on
the feature engineering task rather than on the classifier itself (Rafi et al., 2021). However,
the deep neural network automatically learns advanced feature layer by layer automatically
and effectively instead of time-consuming for feature engineering(Song et al., 2018) which
led to the success of many problems in different areas. The idea of training data using the
neural network has been there for a long time but it has become popular due to several
reasons such as the availability of a large amount of data and advancement of hardware
technologies such as Graphical Processing Unit (GPU).
Modern deep learning involves numbers of successive layers of representations and they have
all learned automatically from the inputfor training data. Meanwhile, other approaches to
machine learning tend to focus on learning only one or two layers of representations of the
data. In deep learning, these layered representations arelearned via models called neural
networks. The term neural network is referred to as neuronbiology, but although some of the
central concepts in deep learning were developed in part by drawing inspiration from our
understanding of the brain, deep-learning models are not models of the brain (Ketkar, 2017).
A simple definition of a neural network is one or more weight that is multiplied by input data
to make a prediction(Trask, 2019). The steps involved in deep learning are,learn, compare
and predict.
20
Prediction is what the neural network will tell you, given input data, such as given
temperature, how likely it is going to rain. But the prediction is not always right. Therefore, it
will compare with the truth value and adjust the weight. Comparing gives a measurement of
how many predictionsare missed. A good way to measure error is one of the most important
and complicated subjects of deep learning. One simple way of measuring error is a mean
squared error (MSE). Learning tells each weight how it can change to reduce the error. It is
all about error attribution or the art of figuring out how each weight played its part in creating
error. It is the blame game of deep learning: gradient descent(Trask, 2019).At the end of the
day, learning is about one this: adjusting the weights either up or down so the error is
reduced. Therefore, we can say learning in a neural network is a search problem. We are
searching for the best possible configuration of weights so the network’s error falls to 0 and
predicts perfectly.
The specification of what the layer does to its input data is stored in the layer’s weights. In
technical terms, we would say that the transformation implemented by layers is parameterized
by its weight. In this context, learning means finding a set of values for the weights of all
layers in a network, such that the network will correctly map example inputs to their
associated targets. A deep neural network can contain tens of millions of parameters and we
have to find the correct value for all of them.
To control the output of the neural network, how far the output is from what is expected
needs to be measured. This is the job of the loss function. The loss function takes the
predictions of the network and the true target (what we want the network to output) and
computes adistance score (loss score), capturing how well the network has done on this
specific example. This score is used as a feedback signal to adjust the value of the weights a
little, in a direction that will lower the loss score. This adjustment is the job of the optimizer,
which is implemented by what is called the Backpropagation algorithm.
In the learning process, what the neural network does is search for a correlation between the
input layer and output layer of the training data. If the data do not correlatewith the input and
output we create an intermediate layer (dataset) that does correlate with the output layer. The
goal is to train the network so that even though there is no correlation between the input and
output dataset, the layer (dataset) inserted in the middle (which is called the hidden layer) will
correlate with the output layer(Mulugeta, 2020).
21
If a neural network is trained by adding hidden layers as it is, it would not converge. The
solution is to make the hidden layers sometimes correlate and sometimes not correlate. This is
called conditional correlation or sometimes correlation. This gives it a correlation of its own.
Sometimes correlation is achieved by turning off nodes in the hidden layers based on a
condition, for example, whenever it is negative (Relu or Rectified Linear Unit). This logic is
called nonlinearity. Without the tweak, the neural network is linear(Mulugeta, 2020).
Activationfunctionis afunction appliedtotheneuronsinalayerduringpredictionsuch

asReLu(ReLuisafunctionthathastheeffectofturningallnegativenumbersto0)(Trask, 2019).
Otheractivationfunctions are:
Sigmoid:itconverts theinfinityamountofinputtooutputbetween0and1.It is
usedbothinhiddenandoutputlayers.
Tanh:itisbetterthana sigmoidforhiddenlayers.Itisdifferentthana sigmoid
becauseitsoutputliesbetween-1and1.Thatmeansitcanthrowinsomenegative
correlation.
Softmax:itraiseseachinputvalueexponentiallyandthendividesbythelayer’ssum.
Agradientis thederivativeofthe tensoroperation.Itis thegeneralizationoftheconceptof
derivationstothe functionofmultidimensionalinputs(tofunctionsthattaketensorsas aninput).
Givenadifferentiable function,itistheoreticallypossibletofinditsminimumanalytically: itis
knownthatafunction’sminimumisapointwherethederivativeis0,soallwehave todoisfindall
thepointswherethederivativegoesto0andcheckforwhichof these points thefunctionhas
thelowestvalue.Appliedtoa neuralnetwork, thatmeansfinding analytically thecombination
ofweightvaluesthatyieldsthesmallestpossibleloss
function.Inpractice,aneuralnetworkfunctionconsistsofmanytensoroperationschained
together,eachof whichhasa simpleknownderivative.Chainoffunctionscanbederived
usingthechainrule:f(g(x))= f’(g(x))*g’(x). Applyingthechainruletothecomputation of
gradientvaluesof neuralnetworksgivesrisetoan algorithmcalledBackpropagation.
Backpropagationstartswiththefinallossvalueandworksbackwardfromthetoplayer
tothebottomlayers, applying the chain rule to compute the contribution that each parameter
had in the loss value(Mulugeta, 2020).
There are 3 ways in which we can apply gradient descent to the network(Trask, 2019): Full,
batch, and stochastic gradient descent.Stochastic gradient descent updates
weightsoneexampleatatime(theideaof learning oneexampleata
22
time).Itperformsapredictionandweightupdateforeachtraining exampleseparately.
Ititeratesthrough theentiredatasetmany timesuntilitcanfinda weightconfiguration that works
well for all the training examples.(Full) gradientdescent updates weights one dataset at a
time. Instead of updating the weights once foreach training example, the network calculates
the average weight delta over the entire dataset, changing the weight only each time it
computes a full average.Batch gradient descent updates weights after n examples. Instead of
updating the weight after just one example or after the entire dataset of examples, a batch size
of examples is chosen (typically between 8 and 256), after which the weights are updated. We
can use this method to increase the speed of training and the rate of convergence.
Deep learning offered better performance on many problems. Also, deep learning makes
problem-solving much easier by completely automating what used to be the most crucial step
in a machine-learning workflow: feature engineering. Humans have to go to great lengths to
make the initial input data more amenable to processing by these
methods:theyhadtomanuallyengineergoodlayersofrepresentations fortheirdata. This
iscalledfeatureengineering.Deeplearning on theotherhand,completely automatesthis step.
2.4.1Evaluatingdeepneural network models

Evaluatingadeepneuralnetworkmodelneedsthesplittingoftheavailabledataintothree
sets:training,validation,andtest. Wetrainonthetrainingdataandevaluatethemodelon
thevalidationdata. Oncethemodelis ready,itistestedonefinaltimeonthetestdata.
Someterminologies arenecessarytoknowbeforediscussingtheevaluationmetrics.
True positive(TP)referstothenumberofpositivedatapointsthatwerecorrectlylabeledbythe
classifier.
Truenegative(TN)isthenegativedatapointsthatwerecorrectlylabeledbythe classifier.False-
positive(FP)isthenegativedata pointsthatwereincorrectlylabeledas
positiveandFalsenegative(FN)ispositivedatapointsthatweremislabeledasnegative.
Toevaluateourmodels, wechosethefollowingmeasureofsuccessorevaluationmetrics(Mulugeta,
2020):
a) Accuracy(recognitionrate):Theaccuracyofaclassifieronagiventestsetis the percentageofthe
testsetthatiscorrectlyclassifiedbytheclassifier(Jiawei Han,Micheline Kamber, 2012).
Thatis,
𝑇𝑃 + 𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑃+𝑁
Equation 1 Accuracy
b) Precision:canbethoughtofasameasureofexactness.Inotherwords,itisthe
23
percentage of test data labeled as positive is positive(Jiawei Han,Micheline Kamber, 2012).
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Equation 2 Precision
c) Recall(sensitivity):isameasureofcompleteness, i.e. whatpercentageoftestdata

arelabeledas such(Jiawei Han,Micheline Kamber, 2012).
𝑇𝑃 𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = =
𝑇𝑃 + 𝐹𝑁 𝑃
Equation 3 Recall
d) F1Score(harmonicmeanofprecisionandrecall):isameasurementthatcombines
precisionandrecallintoone.Bydoingthat,itgivesequalweighttoprecisionand
recall (Jiawei Han,Micheline Kamber, 2012). It is defined as the following:
2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹=
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Equation 4 F1 Score
e) ConfusionMatrix:is atableofatleastsizembym.Anentry,CMijinthefirstm
rowsandmcolumnsindicates thenumberoftestrows of classithatwaslabeled bytheclassifier
asclassj. Fortheclassifiertohavegoodaccuracy,ideallymostof
thetupleswouldberepresentedalongthediagonal oftheconfusionmatrix,from
CM1,1toCMm,mwiththerestoftheentrybeingzero.Thatis, ideallyFPandFNare aroundzero(Jiawei
Han,Micheline Kamber, 2012).
f) P/RAUC(areaundert h e precision-recallcurve):Sincet h e measureofaccuracyisnot
enoughtotrulyevaluateaclassifier. Weputanalternativemeasurementcalled areaunderthe
precision-recall curve.The precision-recall curvesummarizesthetrade-off
betweentheTPrateandthepositivepredictivevalue fortheprediction.The areaunder
thecurvesummarizedtheintegral oranapproximationoftheareaunderthe precision-recallcurve.
g) MacroAverage:Thiswillcomputetheaverageby givingeachclassthesame
weightanddividebythetotalnumberofclasses(Jiawei Han,Micheline Kamber, 2012).Itis
calculatedas follows:
∑ 𝑐𝑖
𝑚𝑎𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 =
𝑛
Equation 5 Macro Average
h) WeightedAverage:Thiswillcomputetheaveragebymultiplyingeachclassby
itsweight(samplesineachclass)anddividebya totalnumberofsamples(Jiawei Han,Micheline
Kamber, 2012).Itiscalculated as follows:
24
∑ (𝑐𝑖 𝑥 𝑤𝑖 )
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 =
∑ 𝑤𝑖
Equation 6 Weighted Average
2.4.2ConvolutionalNeuralNetwork
Overfittingisoftencausedbyhavingmoreparameters thannecessarytolearnaspecific
dataset.Whenneuralnetworkshavelots of parametersbutnotmanytrainingexamples,
overfittingisdifficult to avoid(Trask, 2019). Regularization is a means of countering
overfitting. But it is not the only technique to prevent overfitting.
As mentioned before, overfitting is concerned with the ratio between the number of weights
in the model and the number of data points it has to learn those weights. Thus, there is a
better method to counter overfitting. When possible, it is preferable to use something loosely
defined as structure. The structure is selectively choosing to reuse weights for multiple
purposes in a neural network because the same pattern needs to be detected in multiple
places(Trask, 2019).
Normally removing parameters make the model less expressive (less able to learn
patterns),butifwearecleverinwherewereuseweights,themodel canbeequally
expressivebutmorerobusttooverfitting.Perhapssurprisingly,thistechniquealsotends
tomakethemodelsmaller(becausetherearefeweractualparameters tostore).Themost
famousandwidelyused structureinneuralnetworksiscalledconvolution, andwhenused
asalayeritiscalledconvolutionallayer(Trask,
2019).Convolutionalnetworksaresimplyneuralnetworksthatuse convolution in place of general
matrix multiplication in at least one of their layers (Heaton, 2018).
In the convolution layer, lots of very small linear layers are reused in every position, instead
of a single big one. The core idea behind a convolution layer is that instead of having a large,
dense linear layer with a connection from every input to every output, lots of very small
linear layers are used, usually with fewer than 25 inputs and a single output, which are used
in every input position. Every mini-layer is called a convolutional kernel, but it is nothing
more than a baby linear layer with a small number of inputs and a singleoutput.
25
Figure 2.9 Convolutional kernel(Trask, 2019)
As Shown in Figure 9, the single 3x3 convolutional kernel predicts its current location, move
one pixel to the right, then predict again, move another pixel to the right, and so on. Once it
has scanned across the image, it will move down a single pixel and scan back to the
left,repeating until it has predicted every possible position within the image. The result will
be the smaller square of kernel predictions, which are used as input to the nextlayer.
A typical layer of a convolutional network consists of three stages(Heaton, 2018). In the first
stage, the layer performs several convolutions in parallel to produce a set of linear activations.
In the second stage, each linear activation is run through a nonlinear activation function, such
as the rectified linear activation function. In the third stage, a pooling function is used to
modify the output of the layer further. The pooling function replaces the output of the
network at a certain location with a summary statistic of the nearby output. For example, the
max-pooling operation reports the maximum output within a rectangular neighborhood.
The fundamental difference between a densely connected layer and a convolution layer is the
dense layer learn global pattern (patterns involving all pixels) in their input features, whereas
convolution layers learn local patterns (patterns found in the kernel). This key characteristic
gives convolutional networks two interesting properties (Ketkar, 2017):
 The pattern they learn is translation invariant.

 They can learn spatial hierarchies of patterns.
The first convolution layer learned small local patterns such as edges, the second convolution
layer learned larger patterns made of the features of the first layers, and so on.
2.4.3 ActionClassification
Action classification in the video needs to solve two problems. The first one is extracting the
appearance information of the moving human body and the second one is simulating the
26
temporal dynamics of action(Suarez & Naval, 2020). CNN can capture complex spatial
appearance information from an image. However, using CNN for video classification will
cause the loss of temporal context information in the video sequence. Therefore, to model
the temporal context information, the deep learning models have been categorized into
4(Suarez & Naval, 2020). Action recognition based on 3D CNN (Ji et al., 2013), action
recognition based on Spatio-temporal two-stream networks(Simonyan & Zisserman, 2014),
action recognition based on RNN, and action recognition based on skeleton sequence(Li et
al., 2017).
2.5Related works
In this section we are going to consider all previously worked traditional music video
classififcation and as human action recognition (action classification) problems and trends.
2.5.1 Video Classification Using Machine Learning

(Kumar et al., 2017) tried to classify Indian classical dance to extract Indian classical dance
poses from both online and offline videos given a specific dance pose sequence as input.
Among the constraints, the authors have stated are video resolution, frame rate, background
lighting, scene change rate, and blurring. Automatic dance motion extraction is complicated
due to complex poses and actions performed at different speeds in the sink to music or vocal
sound and in the case of Indian classical dance the features representing the dancer should
focus on the entire human body shape. The Indian dance forms are a set of complex body
signatures produced from rotation, bending, and twisting of fingers, hands, and body along
with their motion trajectory and spatial location. As the authors stated there are 8 different
classical Indian dance forms and extracting these complex movements from online videos and
classification requires a complex set of algorithms working in sequence. To solve the
problem, the authors proposed to use silhouette detection and background elimination, human
object extraction, local texture with a shape reference model, and 2D point cloud to represent
the dancer's pose. Five features are calculated that represent the exact shape of the dancer in
the video sequence. For recognition, a multiclass multilabel Adaboost (MCMLA) algorithm
is proposed to classify query dance videos based on the dance dataset. Wavelet reconstructed
local binary patterns were used for feature representation preserving the local shape content
of hands and legs. Two fusion models were proposed for feature fusion. Early fusion at the
segmentation stage with PCA-based Haar wavelet and LBP is used and late fusion using the
Zernike moments, Hu moments, and shape signature is used with Haar and LBP is proposed.
27
The Adaboost can effectively recover the query video frames from the dataset, by shape-
texture observation model defined by discrete wavelet transform (DWT) and local binary
patterns (LBP). The early and late features and classifiers performance tests show that the
proposed late fusion features and multiclass multilabel Adaboost classifier give better
classification accuracy and speed compared to AGM and SVM. The authors conducted a total
of 8 experiments.
The first 4 experiments were for early fusion and the last 4 experiments were for late fusion.
Ofthe 4 experiments, two of them were for online videos, and the rest two were for offline
videos. For the offline videos, the accuracy ranges from 0.74 (early fusion) –0.99 (late
fusion), and for the online videos, the accuracy ranges from 0.66 (early fusion) –0.99 (late
fusion). In this research, the authors used only shape features thatcannot be used directly by
Ethiopian music classifiers. Adding additional information to the classifier such as the
movement of dancers would give more clues when it comes to Ethiopian dance movements.
Also, extracting features manually is not the ideal solution now a day in image processing
problems.
(Tefera, 2016)researched the classification of Ethiopian traditional dance using visual

features. In his research, he considered 10 traditional styles of songs namely Eskista (only the
Gondar type), Gurage, Asosa, Somali, Oromo, Afar, Hararghe, Tigrinya, Wolaita, and
Gambella. He applied HOG on each frame to detect the region of interest i.e. dance
performer. Then he extracted local features related to body shape. To achieve that he used the
Log Gabor filter. As a feature selection technique, the author has used Principal Component
Analysis, and finally, the selected features were used directly by the classifier. The learning
algorithm used to build the classifier model wasSVM. The performance of the classifier when
using SVM was 86.2%. In this research, the selected feature is focused onthe shape and we
believe if other features such as the movement of the dancers were integrated, the
performance of the classifier would have been increased. Also, the author was very selective
when choosing video clips that were included in the dataset, such as background and we
believe the performance of the model would be very low on real-time data that includes
different types of background.
The work done by(Subramanian & Mekonnen, 2016)aims at classifying ten Ethiopian
traditional dances namely: Afar, Benshangul, Gambella, Eskista, Gurage, Hararghe, Oromo,
Somali, Tigrinya, and Wolayta. They used optical estimation to estimate the motion of the
28
dancers as a way of feature extraction and HOG to detect the region of interest. As an optical
flow estimation method, the authors use Horn-Schunck and Lucas-Kanade optical flow
algorithm. Both works are the same but their difference lies in the learning algorithm used.
While the learning algorithm used in(Subramanian & Mekonnen, 2016) is SVM, the learning
algorithm used in(Subramanian & Mekonnen, 2016) is ANN. When using SVM the
maximum accuracy the authors got is 99.7 and it is achieved when the Lucas-Kanade method
is used. The maximum accuracy they acquired when ANN is used is 82.0% and it is again
achieved when the Lucas-Kanade method is used. Even though these two works include the
motion information in the feature extraction step, they were very selective when choosing
data (with no background movement) making their model less performingin a real
environment. Also, they only train and test their system with only one person as a dance
performer resulting in using a dataset that did not represent Ethiopian traditional dance which
is mostly performed in groups.
2.5.2 Video Classification Using Deep Learning

(Ji et al., 2013)develop a 3D CNN model for action recognition. Their model extracts feature
from both the spatial and temporal dimensions by performing 3D convolutions, thereby
capturing the motion information encoded in multiple adjacent frames. They develop a 3D
CNN architecture that generates multiple channels of information from adjacent video frames
and performs convolution and subsampling separately in each channel. The final feature
representation is obtained by combining information from all channels. To boost the
performance, the authors proposed regularizing the 3D CNN model by augmenting the model
with auxiliary outputs computed as high-level motion features. Besides, combining the
outputs with high-level features was proposed. Their method could apply to scenes such as
noise, background interference, and occlusion.The authors evaluated the model on the TREC
Video Retrieval Evaluation (TRECVID 2008) data, which consists of surveillance video data
recorded at London Gatwick Airport and KTH dataset and they get an accuracy of 0.8853
AUC when the false positive rate is 1%. Their experimental results show itseffectiveness in a
real environment but the performance of their Training a model that is designed using 3D
CNN from scratch requires a large amount of video data, memory, and processing power
which are impossible to have at the time this research was conducted. Therefore, designing an
architecture that can process both visual and sequential information more efficiently is
needed.
29
(Simonyan & Zisserman, 2014)aim at extending deep ConvNets to action recognition in
video data. Their architecture was based on two separate recognition streams (spatial and
temporal) that are implemented as ConvNets and are then combined by late fusion. The
spatial stream performs action recognition from still video frames, whereas the temporal
stream is trained to recognize action from motion in the form of dense optical flow. They
stated that decoupling the spatial and temporal networks allowed them to use transfer
learning. The datasets used by the authors were UCF-101 and HMDB-51. They get an
accuracy of 88.0% on the UCF-101 dataset when the fusion is done by SVM and 59.4% on
the HMDB-51 dataset.
(Li et al., 2017) tried to recognize a 3D human action using an end-to-end deep CNN. They
proposed a representation of the dataset using skeleton sequence rather than RGB or depth. A
pre-trained VGG-19 model was used by the authors to perform the learning. They first
performed data pre-processing to normalize and filter the skeleton joints. Secondly, they
transformed a skeleton sequence with an arbitrary number of frames into RGB images.
Finally, they fed the generated images into VGG-19 (Simonyan & Zisserman, 2015) model,
which is a pre-trained model with ImageNet (Russakovsky et al., 2015) to perform fine-
tuning and action recognition. Finally, they performed end-to-end fine-tuning and perform the
learning process. For the hyperparameter set, they selected 40 for the size of mini-batches for
the SGD. In addition, the initial learning rate of 0.001, which was decreased by multiplying it
by 0.1 at every10,000th iteration was set. Moreover, 50 epochs were chosen by the authors to
stop theFine-tuning process. They tested their proposed method on the NTU RGB-D dataset,
which is at that time the largest skeleton-based human action recognition dataset, and
achieved82.1% for cross-view and 75.2% for cross-subject when evaluating. They also
experimented using AlexNet(Krizhevsky et al., 2012) as a pre-trained model but got lower
accuracy (72.96 % and 75.87%). We do not have a skeleton-based dataset that shows the
Ethiopian traditional dance dataset. Also, performing fine-tuning requires access to GPU and
needs a large amount of training time.
(Ijjina & Mohan, 2014) investigated the problem of human action recognition by proposing
2D CNN for human action recognition using action bank features. Their main motivation was
the use of a frequency spectrogram of audio data for speech recognition using CNN. As it is
stated in their paper, human action recognition in videos is a challenging task due to the
existence of time dimension, i.e. the length of a video capturing the same action may be
30
different. Consequently, they proposed an approach that uses action bank features. Action
bank features of a video capture the similarity information of the video against the videos in
an action bank. Thus, videos containing similar actions may contain similar patterns in
action-bank features. The action bank feature is given as an input to the pattern recognition
module for action recognition which is implemented using CNN. The classifier utilizesthis
similarity information to assign an action label to the input video. The use of the action bank
feature to represent video results in a fixed-size representation of videos irrespective of their
length. Therefore, deep neural network architecture can be trained to classify a video in a
single pass without using a temporal window on video frames. The authors consider a CNN
classifier to recognize and classify human action from the local pattern in the action bank
features due to the success of deep CNN architectures for recognizing local visual patterns.
UCF50 dataset is used by the authors and the CNN is trained using batch propagation
algorithm in batchmode. The performance of the CNN classifier during training was
evaluated after every 5 epochs on the test dataset. By considering 5 epochs as an iteration, the
authors train the CNN 200 times (1000 epochs) maintaining the CNN classifier with
minimum error at the end of an iteration. They evaluated the proposed solution using two
methods. The first one is 5-fold cross-validation and the second one is splitting the entire data
into 3 groups to perform 3-fold cross-validation. They achieved an accuracy of 94.02% using
5-fold and 93.76% using the 3-fold. The same as the above research(Simonyan & Zisserman,
2014).
The research work conducted by (Mulugeta, 2020) attempts to mergeaudio and visual
features to improve the performance of the classification model, using a pre-trained network
called VGG-16 to extract visual information and a custom pre-trained network to extract
audio features from the video. In the end, the performance of the classifier when using only
audio features the performance of the classifier was 85% whereas using only visual features
we were able to get an accuracy of 78%. But by adding an audio feature to the video-only
classifier, we were able to increase the accuracy by 7% units making the final performance of
the proposed system 85%(Mulugeta, 2020). However, when we come to the real world to
distinguish music videos visual features are more important because we have discussed the
music video clips. Video clips are visual by nature it incorporates dances and different
movements. The research stated that audio features are more powerful than visual features to
classify and categorize Ethiopian traditional music videos.
31
Table 1 Summary of related works
Authors Title Finding Limitations
(Tefera, 2016) Content- -Use HOGfor feature Frame-level feature

basedclassificationof extraction extraction is better to
Ethiopiannations'music -LIBSVM machine learning represent video. May
videoclips using SVM algorithm for classification use deep learning or
and 82.2 % accuracy CNN forbetter
classification
(Subramanian Content-based -To extract motion features - May use deep

& Mekonnen, classification of optical flow and HOG are learning or CNN for
2016) Ethiopian traditional used. better feature
dance videos using -SVM for classification extraction and
optical flow and classification
histogram of oriented - One-person features
gradient used
(Russo & Jo, Classification of sports -Combined CNN extracted - Transfer learning
2019) videos with a features with only works whenthe
combination of temporal information from initial and target
deep learning models RNN to formulate the problem models are
and transfer learning general quite similar.
model to solve the problem.
-Transfer learning is applied
with VGG 16 model, 94%
accuracy
(Simões et al., Movie genre -Using CNN for -The performance of
2016) classification with classification and feature the classification
CNN extraction, an accuracy of model is varied
73.45% dependingon the
genre
(Mulugeta, Automatic -Combine audio and visual -Not capable to

2020) classification of feature handle direction and
Ethiopian traditional - Use CNN &RNN motion features.
music using audio- - Accuracy 85%
visual
Table 2. 1
All of the above have their own contributions on video classification, but to be more effective
and efficient they need to be extract additional information from an image. Therefore, what
makes this study is different is it uses a combined of three related feature extraction
techniques HOG, Optical flow and CNN to extract information from images.
32
CHAPTER THREE: RESEARCH METHODOLOGY
3.1. Introduction
This chapter discusses detail techniques, tools, and methods that are used to design and model
the Ethiopian traditional music video representation and classification. Also, it explains the
detail of the components of the model and their algorithms.
The process of video classification consists of many steps, such as pre-processing,

segmentation, feature extraction, and classification. In this study, an attempt is made to adopt
video classification techniques for categorizing Ethiopian Nations’ music and dance from a
music video source.
3.2. Ethiopian Traditional music videos corpus

In music video classification using visual features, collecting music video clips and changing
tothe sequence of image frames allows us to represent the video clips. For this, a total of
8nationsand ethnic groups’music videos are selected. Since the research incorporate only
these 8 nations/ethnic groups’ music type and dance movements, due to the shortage of
availability of music videos. As shown in Table 3.1, 80 video clips are collected, from the
popular websites YouTube, Dire tube, SodereTube , AddisZefen with the help of music
experts to identify which video clips are manually labeled, and which music video is belongs
to which nation or ethnic group respectively.Also, music videos are converted into a suitable
format for segmentation. Then the video is segmented and changed intoasequence of image
frames. Different preprocessing techniques like person detection,and size normalization is
applied tothis sequence of image frames.
33
Table 2 Dataset of Ethiopian traditional music videos
S.No Typeof nations’ music video clip Size of Source of videoclip

collected video
1 Tigrignamusicvideo 20 YouTube, DireTube,

Addiszefen, SoderTube
2 Afar musicvideo 20 ”
3 Gojjammusicvideo 20 ”
4 Awimusicvideo 20 ”
5 Shoa oromomusicvideo 20 ”
6 Guragiemusicvideo 20 ”
7 Wolaytamusicvideo 20 ”
8 Benshangul Gumuzmusicvideo 20 ”
160
Total
3.3. Design Ethiopian Traditional music videos classification
Themodel is designed to classify Ethiopian traditional music videos based on visual features.
As discussed in section 1.6.4, we useda convolutional neural network for classification but
due to the availability of data, the dataset is too small. Therefore there would be overfitting.
To overcome the problem of small datasets or overfitting we use transfer learning, which
improves the model with low processing power and memory. The visual features are
extracted using a pre-trained network called VGG-16 prior to the model training and saved on
storage. Another hand the shape, and gradient orientation features are extracted using HOG,
motion/spatio temporal featuresare also extracted using Optical flow and concatenate the
feature vector of three separately extracted features then given to the convolution layer.
34
Preprocessing Frame feature
extraction
Video segmentation
CNN
Combined features
Input video Image frame extraction
HOG
Person detection
Optical
Gray scale conversion flow
Outpu Softmax
Figure 10 Proposed classification model for Ethiopian traditional music video
Following that, the extracted features will be fed to fully connected layers or dense layers.
The last step in the whole process is prediction. Prediction is performed by the last
component which is the prediction module. The concatenated features from the previous
process, and the output of the dense layer will be one of the 8 selected Ethiopian traditional
music (Gojjam, Awi, Afar, Shoa Oromo, Tigray, Guragegna, Wolayta, Benshagul).
An input to the CNN feature extracting module (VGG16) is 100 by 100 pixel RGB image and
the extracted feature output from this module is 3 by 3 by 512 3D tensor. VGG16 is a deep
learning architecture trained on ImageNet dataset. It is composed of 2- dimensional
convolution layer followed by 2D max-pooling layer. This module will extract all the visual
information found in each image and give it to the next layer to be aggregated to form the
feature of the video. The 3 by 3 by 512 tensor features extracted from the first module is
given to the image to video aggregating module. For this classification purpose, we
selected60 frames because we use 3-second video and an average of 20fps. Then it reshapes
the feature and outputs a feature of size 60 by 160 to be given to the next module.
3.4. Preprocessing
This section discusses the activity performed after collecting the music video to convert raw
data into a useful and acceptable form for a system.
3.4.1. Video segmentation

Video segmentation is the process of splitting the video into smaller video clips(Chenliang
Xu, et al, 2012). The reason for video segmentation is to identify and select the most
representative and descriptive video from the video because the full music video doesn’t
35
contain dance, it takes unnecessary time to process and sometimes a single image can
describe the dance of a nation. Frames are a sequence of images defined in terms of time.
Then videos are split based on frame rate. Frame rate is a parameter that is used to specify
the number of frames to be extracted from a video per second. The frame rate varies based
on the video format used(Chenliang Xu, et al, 2012). We segmented the music video clips
with 20 frames per second (fps) and we used VirtualDub free tool by specifying the frame
rate.
3.4.2. Image frame extraction

Imageframeextractionistheprocessofconvertingavideointosequenceofimageframes.Inthis
study, sincefeaturesareextractedfromthe sequenceofimages,imageframeextractionfromthe
segmented video allows extractingthe sequenceof imageframes thatcontains dancers. As
discussedabovethe frame rate for frame extraction isdonewith20fps.
3.4.3. Person detection

This is the most important stepindetecting visualfeaturesof videosequences. Person
detectionallows to identify theregionofinterest and enables to extract features from only the
detected area.The study deployed it since it helps to minimize the
computationalcostneededduringfeatureextractionanditcan
increasetheaccuracyoftherecognitionsystembyonlyidentifyingthemotionarea.Forthis(Dalal et
al., 2010)proposeaverysuccessful edge and gradient-based descriptor called Histogram of
OrientedGradient(HOG)fordetectingandlocating humansinstillimages.The object/person
detector detects peoplein an inputimageusingthe Histogram of Oriented Gradient (HOG)
features anda trained CNN classifier.
3.4.4. Gray scale conversion

A grayscale (gray level) image is simply having a color range of monochromatic shades
from black and white. At this step, the final sequence of frames is converted to grayscale
images. When images RGB color information is converted to black and white colors called
gray images. In fact a ‘gray’ color is one in which the red, green, and blue components all
have equal intensity in RGB space, and so it is only necessary to specify a single intensity
value for each pixel, as opposed to the three intensities needed to specify each pixel in a
full-color image.
The main reason for grayscale conversion is to simplify the algorithm and reduces
computational requirements. Indeed, this study is focused on the dance movement and
styles of nation, so that RGB color information doesn’t have an impact, which is why
36
grayscale conversion is applied. The conversion of a(Mulugeta, 2020)color image to a gray
scale is done using the pythonrgb2gray() function.
3.5. Video Feature Extraction

Feature extraction is a crucial part of the dimensionality reduction process, in which, the raw
data is divided or reduced to more manageable and descriptive representative groups in order
to make the process will be easier. The most important characteristic of feature extraction
helps to get basic features and variables and aggregate them to describe the actual data. In this
studythree feature extraction techniques, HOG/local feature extraction, Optical flow, and
CNN are employed to extract important features from the data set. Because Each feature
extraction techniques captures different patterns in the image, such as texture (HOG), shape
and spatial information (CNN), movement and direction (optical flow). The combination of
these features results in a more robust and discriminative representation of the image that is
able to handle various image variations.Then we combined/merged the features extracted
from both local feature extraction and CNN techniques.
3.5.1. HOG/Local feature extraction

As a feature extraction HOG is a commonly used technique in computer vision and image
processing for object detection. This feature descriptor technique counts the occurrences of
gradient orientation in localized portions of an image. HOG has advantages of edge and
gradient representation of local shape, and degree of invariance. HOG features are computed
using magnitude and orientation. Based on the accumulation of gradient directions over a
pixel of a small spatial region called a cell, and the image is divided into cells of size NxN
pixels. The orientation of all pixels is computed and accumulated in M-bins histogram of
orientation, and then the cell histograms are concatenated to construct the final features
vector. The orientation 𝜃𝑥,of the gradient in each pixel, is computed by the following
equation:(Mulugeta, 2020)
𝐿(𝑥, 𝑦 + 1) − 𝐿(𝑥, 𝑦 − 1)
𝜃 , = tan
𝐿(𝑥 + 1, 𝑦) − 𝐿(𝑥 − 1, 𝑦)
Equation 7 Orientation of gradient
3.5.2. Optical flow feature extraction

Optical flow is the motion of objects between every two consecutive frames of a sequence.
This motion is caused by the movement of the object that is being captured or by the camera
that is taking a picture of it. Consider an object whose intensity is I (x, y, t), which changes
over time to by dx and dy after dt. At this point, the new intensity is I (x+dx, y+dy, t+dt). In
37
video classification optical flow is used to estimate the direction and speed of the moving
objects in a video. The optical flow is estimated using the Farneback method, OpenCV
provides a function cv2.calcOpticalFlowFarneback to look into dense optical flow.
3.5.3. Deep feature extraction

In this thesis, to extract deep featureswe use CNN as a feature extraction method because of
its powerfulness in image processing. Images with size MxN are the inputs in CNN- based
feature extraction. This image passed through many layers the so-called hidden layers and by
applying operations like convolution, and pooling on the image we get useful information for
the recognition stage. We used therectified linear activation function (ReLU) a piecewise
linear because it is powerful and outputs the input directly. We also used maximum pooling
because it reduces the dimension of the feature vector and the noise on the images.
The first stage is providing the input image by specifying its height and width. The height and
width of the image are set to 64x64. Bicubic Interpolation image resampling technique is
selected for image resizing. Because it is effective in all applications of image processing.
This size is selected randomly and it is within the range that CNN performs best in the
literature which is 64 up to 360. The next stage in the network are convolution and pooling
layers. The network contains four convolution and four pooling layers. The kernel size of the
convolution and pooling layer is set to be three and two respectively. Finally, the feature
obtained from the last pooling layer is fed to flatten layer and this layer changes the feature
from multi-dimensional to one- dimensional vector.
3.5.4. Combine /Fusing feature

After extracting features by using a histogram of orientation gradients, optical
flowandconvolutional neural network, we combined both features to generate one
feature.Combining hog, cnn, and optical flow features for image classification is helpful
because the classification can recongnize different features from from the image, like texture,
shape, movement, and direction. Each features looks different things in the image. These
features make the image more distinct.
3.6. Video classification
Classification is the process of predicting the music video based on the feature that gives into
a system to learn. After extracting the features the system learned from the given extracted
features, this process enables the capability of the classification modelto distinguish different
38
nations’ music videos accordingly. However, there are many classification algorithms that
has been used for classification in previously worked research. In a previous study
(Subramanian & Mekonnen, 2016; Tefera, 2016), SVM was used to classify the music
video.In another study, CNN was used by Russo and Jo ( 2019) to classify sport videos, and
(Mulugeta, 2020; Simões et al., 2016)in their study also used CNN for classifying movie
genre and music videos respectively. This paper is used CNN as a classifier because CNN is a
remarkable, successful, and better classification algorithm(Debayle et al., 2018; Sharma et al.,
2018).At the end of the classification model, the splitting of data into training and validation
are performed. In the training phase, the model learn the feature of the training dataset, and in
the validation phase model learns the feature of the testing dataset.
39
CHAPTER FOUR: EXPERIMENT AND IMPLEMENTATION
4.1 Introduction
This chapter discusses the capability of the proposed model according to the research
problem and the research questions are answered. In the next section adetail of the dataset
used in a proposed model. In section 4.3,we describe the implementation and tools used in the
proposed model. 4.4,we present the test result of our model, we test CNN classifier with two
features (CNN end to end and the combined features ofHOG, Optical flow and CNN)
independently. In section 4.5 we compare the performance of CNN feature and combined
feature and also compare our model with the previous worked Ethiopian traditional music
video classification models.
4.2 Dataset
In this research work, we didn’t get readymade music video data. Therefore it requires
tolerance and working with others to collect the music video of each nation and label them
accordingly. Due to the limitation of the availability of music videos we use a total of 8
nation’s music videos. Since the research incorporate only these 8 nations/ethnic groups
music type and dance movements. As shown in Table 4.1, 160 video clips are collected, from
the popular websites Youtube, Dire tube, SodereTube , AddisZefen with the help of music
experts to identify which video clips are manually labeled, and which music video is belongs
to which nation respectively.Also, music videos are converted into a suitable format for
segmentation. Then the video is segmented and changed into a sequence of image frames.
Different preprocessing techniques like person detection, and size normalization is applied to
this sequence of image frames. We use 3 seconds of extracted video from each music video
and extract an average of 20fps, which means that 60 frames were used to represent each
video clips.
40
Table 3 dataset for Ethiopian music video classification
No Type of nations’ music video clip Number of video Number of frames

clips
1. Tigrigna music video 20 1200
2. Afar music video 20 1200
3. Gojjam music video 20 1200
4. Awi music video 20 1200
5. Shoa oromo music video 20 1200
6. Guragie music video 20 1200
7. Wolayta music video 20 1200
8. Benshangul Gumuz music video 20 1200
Total 160 9600
As we mentioned before wedid notusethewholevideo.Instead, weextracted3

secondsofvideoclipfromeachvideothat shows theEthiopiantraditionaldance.
4.3 Implementation
Experiments are done based on the modeldeveloped with Keras (TensorFlow as a backend)
on Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical
Processor(s), and 8 GB of RAM. The model is trained based on different parameters for the
classifier, the data is partitioned into training and testing datasets such that 80 percent of the
data is assigned for training the model and 20 percent of the data is allotted for testing.
4.3.1 Tools
In this research, different tools and techniques are employed in order to achieve research
objectives.
a) Python3:Theprogramminglanguagethatisusedtoimplementthemodels
inpython.Wedecidedtousepythonbecauseof therichness oflibrariesindata
manipulationandframeworks in the deeplearninganddataprocessingarea.
b) Keras 2.3.1:Itis adeeplearningframeworkoralibraryprovidinghigh-level
41
buildingblocksfordevelopingdeeplearningmodels.
c) Scikit-learn 0.22:Itisamachine-learninglibrary withvariousfeaturesand tools.

d) JupyterNotebook:Jupyternotebooksarea greatwaytorundeep-learning
experiments.Itallowsyou tobreakup alongexperimentintosmallerpiecesthat
canbeexecutedindependently whichmakesthedevelopmentinteractive.All the
experimentsin this researchwere runinJupyter.
e) OpenCV: It is a library to solve computer vision problems. We use the library to
read video data from the disk and dismantle it to the image pieces that constitute
the video.
f) Matplotlib 3.1.2: It is a python 2D plotting library.
4.4 Test result

This section is describing the performance evaluation of the proposed model. There are
different evaluationmatrices such as accuracy,precision,recall,F1score,
confusionmatrix,P/RAUC,macroaverage, weightedaverage, cross-entropy,Mattheus
correlation coefficient, and Cohen’s kappa. But in this paper, we used
accuracy,precision,recall,F1score, confusionmatrix,macroaverage,andweightedaverage,
because these matrices show how the proposed system performs the classification of videos
using test data.
In this section, three results showedthe accuracy of our model in the training and validation
phase using CNN classifiers. As we have discussed so far, our features are extracted by using
HOG, CNN, andOptical flow hence we have four features (i.e feature extracted by HOG,
feature extracted by CNN, feature extracted by Optical flow,and feature extracted by CNN,
Optical flow and HOG in combination). Therefore, each feature was tested by the classifier.
The data set is categorized into training and testing data: 80% of our data is for training and
the remaining 20% is for testing set.
4.4.1 Model test with CNN feature

An experiment is conducted using features extracted by CNN that have 241 dimensions and
8097 training samples used as input for CNN classifier and 2151 samples are used for testing
purposes. The average accuracy recorded was 83%.The classification result is shown below.
42
precision recall f1-score support
afar (Class 0) 0.73 0.96 0.83 196

agew (Class 1) 1.00 0.74 0.85 461
benishangul (Class 2) 0.65 1.00 0.79 128
gojjam (class 3) 0.87 0.88 0.87 192
guragie (class 4) 0.47 0.98 0.64 138
oromo (class 5) 0.95 0.70 0.81 460
tigray (class 6) 0.92 0.88 0.90 286
wolayita(class 7) 0.98 0.87 0.92 290
accuracy 0.83 2151

macro avg 0.82 0.88 0.83 2151
weighted avg 0.88 0.83 0.84 2151
Figure4.1 classification report of CNN features
afar (Class 0) [189 0 0 0 7 0 0 0]

agew (Class 1)[ 52 342 6 11 38 7 1 4]
benishangul (Class 2) [ 0 0 128 0 0 0 0 0]
gojjam (class 3) [ 10 0 8 168 1 5 0 0]
guragie (class 4) [ 3 0 0 0 135 0 0 0]
oromo (class 5) [ 0 0 46 8 60 323 22 1]
tigray (class 6) [ 5 0 0 7 15 5 253 1]
wolayita(class 7) [ 0 0 9 0 29 0 0 252]]
Figure4.2 confusion matrix of CNN features

The above confusion matrix result shows that there are perfectly classified and verified
classes. Of the 8 classes, ‘wolayita’ has 92% accuracy,and class ‘guragie, benishangul, and
‘oromo’ has 64%, 79%, and 81% accuracy respectively. Accordingly class of ‘guragie’ is has
least classification result because a sample from its class is miss classified in ‘Agew’ and
‘oromo’classes.
4.4.2 Model test with combinedHOG , Optical flow, and CNNfeatures

Features extracted by HOG and CNN was combined. The size of the feature vector of the
combined feature is 145. The model classified the combined feature using CNN and the test
result is clearly shown in the following classification report and confusion matrix. The
average accuracy from 20 iteration is 86%.
43
precision recall f1-score support
afar (Class 0) 0.86 0.78 0.82 284

agew (Class 1) 1.00 0.85 0.92 401
benishangul (Class 2) 0.66 0.91 0.77 144
gojjam (class 3) 0.94 0.90 0.92 202
guragie (class 4) 0.37 1.00 0.54 106
oromo (class 5) 0.97 0.88 0.92 375
tigray (class 6) 0.99 0.93 0.96 295
wolayita(class 7) 1.00 0.75 0.85 344
accuracy 0.86 2151

macro avg 0.85 0.88 0.84 2151
weighted avg 0.92 0.86 0.87 2151
Figure4.3classification report of combined features

Based on the above experiment result, our model has scored 86% accuracy, the unseen
features are properly extracted and data are correctly classified. The remaining 14% are
incorrectly classified. The result shows that 54% (benishangul) is the minimum and
96%(tigray) is maximum and correctly classified classes.
4.5 Discussion
The expermetal results showed, that the classification accuracy of the CNN model using CNN
features and a combination of CNN and HOG features is 83%and 86% respectively. As the
result showed the features extracted by CNN havea better recognition rate when we compare
them with features extracted by HOG. When theclassifieris trained on the combined feature,
the accuracy achieved is 86.00% which is higher than the result recorded from each feature
extraction technique independently. Because these algorithms have their own advantages.
When we compare perfectly classified and poorly classified results, The dance of ‘guragie’ is
more misclassified, most of its data are classified under ‘afar’ and ‘wolayita’ dance. These
three classes has similar dance movements, mainly performend with the combination of hand
and leg movements.
The performanceof theclassifier ischallengedbythe dancer’sclothing,women’shairstyle,body

size,posture anduse of traditionalobject,since we used powerful algorithms and their
combinations.
As (Alam et al., 2021) research work,a system designed and developed for the COVID-19
identification from X-ray images is effective, with high accuracy and minimum complexity by
44
combining the features extracted by histogram oriented gradient(HOG) features and
convolutional neural network(CNN).
CHAPTER FIVE: CONCLUSION AND FUTURE WORK

5.1 Conclusion
Now-adays, Music video clips areflooded on the internet daily, because it brings happiness,
joy, and wealth for both the providers, musicians, distributors, and even for the audions. Our
country Ethiopia has morethan 80 ethnic groups and each ethnics groups they have their own
belief, language, culture and music style. The difference between these ethnics groups music
is not only language, rhythm rather they have their own traditional dancing movements.
Therefore, to distinguish the music of one ethnic group from the others we can see their
dancing movements and how they perform.
Ethiopian music video clips are increasingly uploaded, shared, and accessible across the glob
over the internet. However, they require some points, like rhythm, text or visual dancing
movements to classify or categorize the music video clips belonging to each ethnic group.
The objective of this study is to build a classification model. For this classification purpose, it
uses the visual feature. For this study, we selected 8 ethnic groups and collected 10 music
videoclips from each ethnic group as a dataset. The video clips are segmented into smaller
video clips containing important part of the dance using a free VirtualDub tool. We change
segmented frames again changes to a sequence of the image at a frame rate of 60 sequences
of images.From these sequence of image frames dance movements and the dancer is detected
using HOG, Optical flow and CNN to extract the features and CNN for classification.
The model has been tested using the images which were captured for testing and training
purposes we use Python programing language. After we extracted the features of the images
the classification model is built using a CNN classifier. Experimental results show that the
developed system by using CNN classiﬁer using segmented images achieves good
classiﬁcation performance with classification accuracy,CNN features and a combination of
HOG and Optical flow features scored 83% and 86% accuracy respectively.
Finally, this work showed that it is possible to classify Ethiopian traditional music using a
combination of different feature extraction techniques and classifiers, to reduce the time taken
for extracting features manually.
45
5.2 The contribution of the study
 The contribution of this work is to show that different feature extraction techniques
can be combined for better classification results. We extracted visual features related
to the appearance of the dancers such as shape, orientation, outfit, and movement by
using both CNN and HOG. This makes the research is different in terms of feature
extraction technique.
 Finally, this work showed that it is possible to classify Ethiopian traditional music
video using a hybrid feature extraction algorithm.
5.3 Future work

We have achieved good result but it can be improved and developed better, so we recommend
that it be worked on the following points so that it can be better and more efficient.
 Using more representational data extraction techniques, specially to handle object

information which is used in music video
 Instead of optical flow using direct 2D/3D tracking techniques, because it can be more
accurate in capturing motions.
 Experiment with different parameters such as learning rate, batch size and number of
size
 Using diverse and wide range of data in terms of light, background, and poses
46
Reference
Al-Saffar, A. A. M., Tao, H., & Talab, M. A. (2017). Review of deep convolution neural
network in image classification. Proceeding - 2017 International Conference on Radar,
Antenna, Microwave, Electronics, and Telecommunications, ICRAMET 2017, 2018-
Janua, 26–31. https://doi.org/10.1109/ICRAMET.2017.8253139
Alam, N. A., Ahsan, M., Based, A., Haider, J., & Kowalski, M. (2021). COVID-19 Detection
from Chest X-ray Images Using Feature Fusion and Deep COVID ‐ 19 Detection from
Chest X ‐ Ray Images Using Feature Fusion and Deep Learning. February.
https://doi.org/10.3390/s21041480
Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2018). Understanding of a convolutional

neural network. Proceedings of 2017 International Conference on Engineering and
Technology, ICET 2017, 2018-Janua(April 2018), 1–6.
https://doi.org/10.1109/ICEngTechnol.2017.8308186
Bhatt, N. (2015). A survey on video classification techniques. Journal of Emerging

Technologies and Innovative Research, 2(3), 607–610.
Bovik, A. (2005). Handbook of image & video procces. In Elsevier Academic Press, (Vol. 1,
Issue 4).
Chenliang Xu, Caiming Xiong, and Jason J. CorsoMitchell, J. C. (2012). Lecture Notes in
Computer Science.
Dalal, N., Triggs, B., Dalal, N., Triggs, B., Gradients, O., International, D., Dalal, N., &
Triggs, B. (2010). Histograms of Oriented Gradients for Human Detection To cite this
version : HAL Id : inria-00548512 Histograms of Oriented Gradients for Human
Detection.
Darji, M. C., & Mathpal, D. (2017). A Review of Video Classification Techniques.

International Research Journal of Engineering and Technology(IRJET), 4(6), 1888–
1891. https://irjet.net/archives/V4/i6/IRJET-V4I6593.pdf
Debayle, J., Hatami, N., & Gavet, Y. (2018). Classification of time-series images using deep
convolutional neural networks. 23. https://doi.org/10.1117/12.2309486
Dyer, S. A., & Harms, B. K. (1993). Digital Signal Processing. In Advances in Computers
(Vol. 37, Issue C). https://doi.org/10.1016/S0065-2458(08)60403-9
47
Ethiopia Population (2021) - Worldometer. (n.d.).
Getachew, F. (2015). Traditional music in Ethiopia.
Heaton, J. (2018). Ian Goodfellow , Yoshua Bengio , and Aaron Courville : Deep learning :
The MIT. Genetic Programming and Evolvable Machines, 19(1), 305–307.
https://doi.org/10.1007/s10710-017-9314-z
Igwenagu, C. (2017). Fundamentals of research methodology and data collection. June, 19–
21.
Ijjina, E. P., & Mohan, C. K. (2014). Human action recognition based on recognition of linear
patterns in action bank features using convolutional neural networks. Proceedings - 2014
13th International Conference on Machine Learning and Applications, ICMLA 2014,
178–182. https://doi.org/10.1109/ICMLA.2014.33
Introducing Ethiopian folk dance_ Mocha Ethiopia Dance Group (English site). (2002).
http://saba.air-nifty.com/mocha_ethiopia_dance_e/introducing-ethiopian-fol.html
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional neural networks for human
action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
Jiawei HanUniversity of Illinois at Urbana–ChampaignMicheline KamberJian PeiSimon

Fraser University. (2012). Data Mining Data mining. In Mining of Massive Datasets
(Vol. 2, Issue January 2013).
https://www.cambridge.org/core/product/identifier/CBO9781139058452A007/type/book
_part
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F. F. (2014). Large-
scale video classification with convolutional neural networks. Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 1725–1732.
https://doi.org/10.1109/CVPR.2014.223
Krizhevsky, B. A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep
Convolutional Neural Networks. Communications of the ACM, 60(6), 84–90.
Kumar, K. V. V., Kishore, P. V. V., & Anil Kumar, D. (2017). Indian Classical Dance
Classification with Adaboost Multiclass Classifier on Multifeature Fusion. Mathematical
Problems in Engineering, 2017. https://doi.org/10.1155/2017/6204742
48
Li, C., Sun, S., Min, X., Lin, W., Nie, B., & Zhang, X. (2017). End-to-end learning of deep
convolutional neural network for 3D human action recognition. 2017 IEEE International
Conference on Multimedia and Expo Workshops, ICMEW 2017, July, 609–612.
https://doi.org/10.1109/ICMEW.2017.8026281
Linner, S., Lischewski, M., & Richerzhagen, M. (2021). Introduction to the Basics. March.
Mulugeta, S. (2020). COLLEGE OF NATURAL AND COMPUTATIONAL SCIENCES

AUTOMATIC CLASSIFICATION OF ETHIOPIAN TRADITIONAL MUSIC USING
AUDIO-VISUAL Selam Mulugeta Alemseged A Thesis Submitted to the Department of
Computer Science in Partial Fulfilment of the Degree of Master of Sci. June.
O’Shea, K., & Nash, R. (2015). An Introduction to Convolutional Neural Networks.

December. http://arxiv.org/abs/1511.08458
Pati, R. N., Yousuf, S. B., & Kiros, A. (2015). Cultural Rights of Traditional Musicians in
Ethiopia: Threats and Challenges of Globalisation of Music Culture. International
Journal of Social Sciences and Management, 2(4), 315–326.
https://doi.org/10.3126/ijssm.v2i4.13620
Rachmadi, R. F., Uchimura, K., & Koutaki, G. (2017). Video classification using compacted
dataset based on selected keyframe. IEEE Region 10 Annual International Conference,
Proceedings/TENCON, 873–878. https://doi.org/10.1109/TENCON.2016.7848130
Rafi, Q. G., Noman, M., Prodhan, S. Z., & Alam, S. (2021). Comparative Analysis of Three
Improved Deep Learning Architectures for Music Genre Classification. April, 1–14.
https://doi.org/10.5815/ijitcs.2021.02.01
Ramesh, M. (2018). Significance of various Video Classification Techniques and Methods : A

Retrospective. 118(8), 523–526.
Ramesh, M., & Mahesh, K. (2019). A Preliminary Investigation on a Novel Approach for
Efficient and Effective Video Classification Model. March, 266–269.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale
Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–
252. https://doi.org/10.1007/s11263-015-0816-y
Russo, M. A., & Jo, K. (2019). Classification of sports videos with combination of deep
49
learning models and transfer learning. 2019 International Conference on Electrical,
Computer and Communication Engineering (ECCE), 1–5.
Saif, S., Tehseen, S., & Kausar, S. (2018). A survey of the techniques for the identification
and classification of human actions from visual data. Sensors (Switzerland), 18(11).
https://doi.org/10.3390/s18113979
Sharma, N., Jain, V., & Mishra, A. (2018). An Analysis of Convolutional Neural Networks
for Image Classification. Procedia Computer Science, 132(Iccids), 377–384.
https://doi.org/10.1016/j.procs.2018.05.198
Simões, G. S., Wehrmann, J., Barros, R. C., & Ruiz, D. D. (2016). Movie genre classification
with Convolutional Neural Networks. Proceedings of the International Joint Conference
on Neural Networks, 2016-Octob(July), 259–266.
https://doi.org/10.1109/IJCNN.2016.7727207
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action
recognition in videos. Advances in Neural Information Processing Systems, 1(January),
568–576.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. 3rd International Conference on Learning Representations, ICLR
2015 - Conference Track Proceedings, 1–14.
Song, G., Wang, Z., Han, F., Ding, S., Song, G., Wang, Z., Han, F., Ding, S., Learning, T.,
Song, G., Wang, Z., Han, F., & Ding, S. (2018). Transfer Learning for Music Genre
Classification To cite this version : HAL Id : hal-01820925. 0–8.
Suarez, J. J. P., & Naval, P. C. (2020). A survey on deep learning techniques for video
anomaly detection. ArXiv, 20–25.
Subramanian, K., & Mekonnen, A. (2016). Content Based Classification of Ethiopian

Traditional Dance Videos Using Optical Flow and Histogram of Oriented Gradient.
International Journal of Innovative in Engineering and Technology, 6(3), 371–380.
Sun, J., Wang, J., & Yeh, T.-C. (2017). Video Understanding: From Video Classification to
Captioning. 1–9. http://cs231n.stanford.edu/reports/2017/pdfs/709.pdf
Tefera, M. G. (2016). Content-Based Classification of Ethiopian Nations Music Video Clip.
50
Throsby, D. (2014). The music industry in the new millennium : Global and local
perspectives THE MUSIC INDUSTRY IN THE NEW MILLENNIUM : Global and Local
Perspectives David Throsby ( Macquarie University , Sydney ). July.
Tigray Traditional Dance photo, Ethiopia Africa. (n.d.).
Traditional Dances – Afar Culture & Tourism Bureau. (n.d.).
Traditional Ethiopian Music and Ethiopian Culture – tigray. (n.d.).
Trask, A. W. (2019). Grokking Deep Learning MEAP V12.
51
Appendix A: Sample code of CNN end to end classififcation
52
53
54
Appendex B: sample code for conbined features of HOG, Optical flow
amd CNN
55
Appendix C: Sample sequential frames
Agew
Gojjam
Tigre
56
57

R16 BEST ETHIOPIA Birhanu Belay (Classification Model For Ethiopian Traditional Music Video Using CNN)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R16 BEST ETHIOPIA Birhanu Belay (Classification Model For Ethiopian Traditional Music Video Using CNN)

Uploaded by

Copyright:

Available Formats

DSpace Institution

DSpace Repository http://dspace.org

Classification Model for Ethiopian

Birhanu, Belay Moges

BAHIR DAR INSTITUTE OF TECHNOLOGY

SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES

Classification Model for Ethiopian Traditional Music Video Using

Convolutional Neural Network

BAHIR DAR, ETHIOPIA

Convolutional Neural Network

Birhanu Belay Moges

Msc. degree in the Information Technology in the computing faculty

Seffi Gebeyehu (Asst.prof)

September 29, 2023

model we follow experimental research methods,for implementing an algorithm and empirically

a dataset.Convolutional Neural Network (CNN),Optical Flow and Histogram Orientation

good classification accuracy 86%.

1.1.1 Ethiopian traditional music and music videos

1.1.2 Video classification

1.2. Statement of the problem

Thework doneby(Subramanian & Mekonnen, 2016) aimed at classifying ten Ethiopian

This study attempts to answer the following research questions;

 What techniques shall be used to extract features from the video?

1.3. The objective of the study

1.3.1 General Objective

1.3.2 Specific Objectives

1.5. Significance of the study

1.6.1. Literature review

1.6.2. Data set preparation

1.6.3. Data preprocessing and Feature Extraction

1.6.4. Design and Evaluation

1.7. Organization of the study

2.2Ethiopian traditional music and video clips

2.2.1 Music dance of Tigray

2.2.2 Music dance of Afar

2.2.3 Music dance of Amhara (Gojjam)

Figure 2.2.3 Music dance of Gojjam

2.2.4 Music dance of Amhara (Agew-Awi)

2.2.5 Music dance of Oromo/Shoa/

Figure 2.2.5 Music dance of Oromo/shoa people

2.2.6 Music dance of SNNP/guragie/

Figure 2.2.6 Music dance of SNNP/Guragie people

2.2.7 Music dance of SNNP/wolayta/

Figure 2.2.7 Music dance of SNNP/wolayta people

2.2.8 Music dance of Benshangul Gumuz people

Figure 2.2.8 Music dance of Benshangul-Gumuz people

2.3 Image/Video Signal

Animportant feature of digitalimages andvideos is that they are multidimensional signals,

People start the research in imageprocessingby simulating humanvision to see,understand,

Ina real-world environment,due to the changes inillumination, the target movement

Segmentation means dividing an imageorvideo frame intoseveralspecific and uniqueproperty

Thepurpose of segmentation is to isolatea meaningful entity froman image orvideo sequence.

Structural segmentation methods are based on the characteristicsofthelocalareaof

Non-structural segmentation methodsincludestatisticalpattern recognition,

2.3.2 Image and video feature description

Aglobalfeatureiscalculatedfromallthepixelsof animage,anditdescribes theimage as

2.3.3 Object Recognition in image/video

The classifier can be dividedintothree categories:Generative/ProductiveModel

The generative model is also called the productivemodel,which triestoestimatethejoint

The discriminativemodel is also called Conditional Model, or Conditional Probability

Different from the traditional object recognition pipeline of “feature extractionclassification”,

Activationfunctionis afunction appliedtotheneuronsinalayerduringpredictionsuch

2.4.1Evaluatingdeepneural network models