Professional Documents
Culture Documents
2023-07-14
http://ir.bdu.edu.et/handle/123456789/15687
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
COMPUTING FACULTY
By
A thesis submitted to the school of Research and Graduate Studies ofBahir Dar
Institute of Technology, BDU in partial fulfillment of the requirements for the degree of
Advisor Name:
BahirDar, Ethiopia
My Grateful thanks go to my advisor Seffi Gebeyehu for his valuable comment throughout
preparing this thesis. I would like to thank all participants for their genuine responses, offices,
schools, and friends who supported me in this thesis preparation. Finally, I would like to express
my heartfelt thanks to my all our families for your love, support, and continuous encouragement
I
Table of content
Acknowledgments .................................................................................................... I
Table of content ...................................................................................................... II
Abbreviations .......................................................................................................... V
List of Figures ........................................................................................................ VI
List of Tables ........................................................................................................VII
List of equestions ................................................................................................ VIII
Abstract .....................................................................................................................1
CHAPTER ONE: INTRODUCTION ....................................................................2
1.1. Background .....................................................................................................2
1.1.1 Ethiopian traditional music and music videos .......................................................... 2
1.1.2 Video classification ...................................................................................................... 3
1.2. Statement of the problem...............................................................................3
1.3. The objective of the study ..............................................................................6
1.3.1 General Objective ......................................................................................................... 6
1.3.2 Specific Objectives ........................................................................................................ 6
1.4. Scope of study..................................................................................................7
1.5. Significance of the study ................................................................................7
1.6. Methodology ....................................................................................................7
1.6.1. Literature review ............................................................................................. 8
1.6.2. Data set preparation ........................................................................................ 8
1.6.3. Data preprocessing and Feature Extraction .................................................. 8
1.6.4. Design and Evaluation ..................................................................................... 8
1.7. Organization of the study ..............................................................................9
CHAPTER TWO: LITERATURE REVIEW.....................................................10
2.1Introduction ........................................................................................................ 10
2.2Ethiopian traditional music and video clips...................................................... 10
2.2.1 Music dance of Tigray ......................................................................................... 10
2.2.2 Music dance of Afar .............................................................................................. 11
II
2.2.3 Music dance of Amhara (Gojjam) ....................................................................... 12
2.2.4 Music dance of Amhara (Agew-Awi) .................................................................. 12
2.2.5 Music dance of Oromo/Shoa/ ............................................................................... 13
2.2.6 Music dance of SNNP/guragie/ ............................................................................ 13
2.2.7 Music dance of SNNP/wolayta/ ............................................................................ 14
2.2.8 Music dance of Benshangul Gumuz people ........................................................ 14
2.3 Image/Video Signal ............................................................................................ 15
2.3.1Imageandvideoscenesegmentation ............................................................................. 17
2.3.2 Image and video feature description ......................................................................... 18
2.3.3 Object Recognition in image/video ........................................................................... 19
2.4 Deeplearning ....................................................................................................... 20
2.4.1 Evaluating deep neural network models .................................................................. 23
2.4.2ConvolutionalNeuralNetwork .................................................................................... 25
2.4.3 ActionClassification.................................................................................................... 26
2.5 Related works ..................................................................................................... 27
2.5.1 Video Classification Using Machine Learning ......................................................... 27
2.5.2 Video Classification Using Deep Learning ............................................................... 29
CHAPTER THREE: RESEARCH METHODOLOGY ....................................33
3.1. Introduction.................................................................................................... 33
3.2. Ethiopian Traditional music videos corpus ................................................. 33
3.3. Design Ethiopian Traditional music videos classification ........................... 34
3.4. Preprocessing ................................................................................................. 35
3.4.1. Video segmentation ............................................................................................... 35
3.4.2. Image frame extraction ........................................................................................ 36
3.4.3. Person detection .................................................................................................... 36
3.4.4. Gray scale conversion ........................................................................................... 36
3.5. Video Feature Extraction .............................................................................. 37
3.5.1. HOG/Local feature extraction ............................................................................. 37
3.5.2. Optical flow feature extraction ............................................................................ 37
3.5.3. Deep feature extraction ........................................................................................ 38
3.5.4. Combine /Fusing feature ...................................................................................... 38
3.6. Video classification ........................................................................................ 38
III
CHAPTER FOUR: EXPERIMENT AND IMPLEMENTATION ...................40
4.1 Introduction.................................................................................................... 40
4.2 Dataset ............................................................................................................ 40
4.3 Implementation .............................................................................................. 41
4.3.1 Tools ............................................................................................................................. 41
4.4 Test result ....................................................................................................... 42
4.4.1 Model test with CNN feature ............................................................................... 42
4.4.2 Model test with combined HOG , Optical flow, and CNN features ................ 43
4.5 Discussion ....................................................................................................... 44
CHAPTER FIVE: CONCLUSION AND FUTURE WORK ............................45
5.1 Conclusion ...................................................................................................... 45
5.2 The contribution of the study ........................................................................ 46
5.3 Future work .................................................................................................... 46
Reference.................................................................................................................47
Appendix A: Sample code of CNN end to end classififcation ...........................52
Appendex B: sample code for conbined features of HOG, Optical flow amd
CNN ........................................................................................................................55
Appendix C: Sample sequential frames ...............................................................56
IV
V
Abbreviations
ANN Artificial Neural Network
AI Artificial Intelligence
CNN Convolution Neural Network
DFT Discrete Fourier Transform
DSP Digital Signal Processing
ELU Exponential Linear Unit
FN False Negative
FP False Positive
GPU Graphical Processing Unit
GRU Gated Recurrent Unit
HOF Histogram of Optical Flow
HOG Histogram of Oriented Gradients
KNN K Nearest Neighbor
LBP Local Binary Patterns
LIBSVM Library for Support Vector Machine
LSTM Long Short-Term Memory
MATLAB Matrix Laboratory
NB Naïve Bayes
ReLu Rectified Linear Unit
RGB Red Green Blue
RNN Recurrent Neural Network
SGD Stochastic Gradient Descent
SIFT Scale Invariant Fourier Transform
STFT Short-Time Fourier Transform
STIP Space-Time Interest Point
SVM Support Vector Machine
TN True Negative
TP True Positive
V
List of Figures
Figure 2.2.1 Music dance of Tigray people (Tigray Traditional Dance Photo, Ethiopia Africa, n.d.) ............... 11
Figure 2.2.2 Music dance of Afar people (Traditional Dances – Afar Culture & Tourism Bureau, n.d.) ....... 12
Figure 2.2.3 Music dance of Gojjam ........................................................................................................................ 12
Figure 2.2.4 Music dance of Agew-Awi people........................................................................................................ 13
Figure 2.2.5 Music dance of Oromo/shoa people .................................................................................................... 13
Figure 2.2.6 Music dance of SNNP/Guragie people ................................................................................................ 14
Figure 2.2.7 Music dance of SNNP/wolayta people................................................................................................. 14
Figure 2.2.8 Music dance of Benshangul-Gumuz people ....................................................................................... 15
Figure 2.9 Convolutional kernel(Trask, 2019) ........................................................................................................ 26
Figure 3.3 Proposed classification model for Ethiopian traditional music video ................................................. 35
VI
List of Tables
Table 1 Summary of related works .......................................................................................................................... 32
Table 2 Dataset of Ethiopian traditional music videos ........................................................................................... 34
Table 3 dataset for Ethiopian music video classification ........................................................................................ 41
VII
List of equestions
Equation 1 Accuracy ................................................................................................................................................. 23
Equation 2 Precision .................................................................................................................................................. 24
Equation 3 Recall ....................................................................................................................................................... 24
Equation 4 F1 Score .................................................................................................................................................. 24
Equation 5 Macro Average .......................................................................................................................................24
Equation 6 Weighted Average .................................................................................................................................. 25
Equation 7 Orientation of gradient .......................................................................................................................... 37
VIII
Abstract
In Ethiopia, there are several nations and ethnic groups which have their own language, culture,
and custom. They play their traditional music and perform their own dance style, while
traditional music is recorded as a video and uploaded on the internet for further use. But
accessing traditional music videos from the internet is tiresome for the user. Therefore, to get
those uploaded music videos easily from the internet it is required to develop a video
classification model. From this perspective, this study aims to develop a classification model for
Ethiopian traditional music videos to overcome the problem of inconveniency.To build this
evaluating the efficiency and effectiveness and a purposive sampling technique is employed,
because to collect the typical traditional music video that describes all the nations included in
the study. From the selected 8 nations and ethnic groups 80 video clips were collected as a
sample from different online video sharing platforms, and we get 10248 sequences of images for
Gradient (HOG) are used for feature extraction and also Convolutional Neural Network (CNN)
is used as a classifier with python open-source tool.Finally the system has been tested using
sequence of images, which were collected for testing and training purpose. Experimental results
show that, when we use only CNN end to end classifier, the classification accuracy is 83% and
when we use the combination of HOG, Optical flow with CNN features the classifier achieves
1
CHAPTER ONE: INTRODUCTION
1.1. Background
Now a day entertainment industries have a big role in the political, economic, and social affairs
of the countries. Especially in the globalized world, every country has thewillingness to spread
and put their interest in the mind of others using a different form of entertainment. Therefore,
music is one element of the entertainment industries, are listening, and watching music videos to
entertain and enjoy themselves. In this respect, music videos are used as a tool to promote and
create images of their cultures and customs in-depth for the rest of the world over the
Internet(Throsby, 2014).
The music tradition of Ethiopia is closely associated with folk dance which adds to the artistic
value of the presentation. The music performance involves singing and dancing of a group of
villagers and moving from one village to another promoting sustenance of cultural heritage and
strengthening community cohesiveness. Besides, the use of costumes, mask, musical instruments
in ceremonies and rituals promote an aura of sacredness. The participation of the audience in
music and dance performance and terms of hand-clapping, finger-popping, foot-tapping, and
vocal prompting isthe manifestation of appreciation, approval, and motivation for performing
artists. The cultural dimensions of traditional music and dance among indigenous communities
of Ethiopia are complex and interlinked withthe great tradition of the country (Pati et al., 2015).
In recent years, a large number of our country's music videos are recorded, produced, and
uploaded on the internet through different service providers, so that these music videos are
available online everybody can get access to watch those music videos instantly. However, due
2
to their similarity and large numbers, it is difficult for individual users and music experts to
identify and classify which music belongs to which nations and nationalities across the country,
and they spend much more time getting their interest.
3
and Shoa are located within one specified region, their ethnicity, and languages are the same but
their traditional dancing style and choreography are quite different.
The search mechanism of videos still depends on the content tag(Tefera, 2016). Even though
current Ethiopian traditional music video classifications are done by experts by watching and
listening leading to the problem of being complicated, inefficient, time-consuming, and tiresome.
So to overcome the aforementioned problems video classification model is needed as also
supported in the study of (Bhatt, 2015; Ramesh & Mahesh, 2019; Sun et al., 2017).
(Tefera, 2016)researched the classification of Ethiopian traditional dance usingvisual and audio
features. Inhis research, he considered10 traditional styles of songs namely Eskista (only
theGondar type), Gurage, Asosa, Somali, Oromo, Afar, Hararghe, Tigrinya, Wolaita,
andGambella.He appliedHOG on each frametodetectthe region of interest i.e. danceperformer.
Then he extracted local features relatedto body shape.To achieve that heusedthe LogGabor filter.
Asa feature selection technique, the author hasused Principal ComponentAnalysis,and finally,
the selected features wereused directlyby the classifier. The learning algorithm usedto build the
classifier model wasSVM. Theperformance of the classifierwhenusingonly visual feature was
82.59% and when he combined visual and audio features the classifier scored86.2%. In this
research, the selected featureis focused on the shape andwebelieve if other features such asthe
movementof thedancers were integrated,theperformance of the classifierwould havebeen
increased.Besides, the authorwas very selectivewhen choosingvideo clips thatwere included in
thedataset, such asbackground andwe believe theperformance of the modelwouldbevery low on
real-timedata that includes different types of background.
4
authorsgot is99.7 and it is achievedwhen the Lucas-Kanade method isused. The maximum
accuracy they acquired when ANN isused is82.0% and it is again achieved whenthe Lucas-
Kanade method isused.Even though these two works include the motion information in the
feature extraction step, they werevery selective when choosingdata (with no background
movement) making their model lessperformingina real environment.Also, they only train and test
their system with only oneperson asadanceperformer resulting in using a dataset that did not
represent Ethiopian traditional dance which is mostly performed in groups.
The research work conducted by (Mulugeta, 2020) attempts to mergeaudio and visual features to
improve the performance of the classification model, using a pre-trained network called VGG-16
to extract visual information and a custom pre-trained network to extract audio features from the
video. Inthe end, the classifier result showed that the only audio features performance was 85%
and using only visual features performance was 78%. But the researcher recommended that the
combination of audio and video features it could increase the cumulative result of the proposed
classifier by up to 85% (Mulugeta, 2020). Hence this study is focused on classification by
combining traditional music audio and video features.
Still, when we come to the real world to distinguish music videos visual features are more
important because we have discussed the music video clips. Video clip is visual art by nature it
incorporates dances and different movements. But as we mentioned in the above (Mulugeta,
2020), stated that audio features are more powerful than visual features to classify and categorize
Ethiopian traditional music videos. Hence, visual features are not efficient to classify music
video clips. However, previously stated researchers attempt to extract visual features using a
deeplearning pre-trained architecture of VGG16. Video clip incorporate different dance
movements especially in Ethiopian traditional music videos are full of dance movements and
styles, these dance movements are obviously different from the others according to their culture,
and the performer or dancers use their different body parts. One nation may use basically a
movement of a shoulder, waist, hand, neck, or leg either a combination of two or more parts at a
time. Therefore, to classify these different national music video clips we have to extract all visual
features accordingly. However those previously worked research papers used different feature
extraction techniques to draw features from the input, and classifiers based on their advantages
and gets different performance result, still it is not efficient. The first paper is used HOG feature
5
extraction techniques, this technique can be represented and drawn features of images shapeand
orientation from an image by measuring the gradient orientation but it can’t be extract motions,
the second paper is used Optical flow estimation for motion estimation and HOG for feature
extraction. The third paper is used a deep learning pre-trained architecture of VGG16 to extract
visual features. VGG16 is a convolutional network that is used for classification. However CNNs
(Convolutional Neural Networks) are highly effective in analyzing and extracting features from
image and video data, but there is a gap in certain types of features that may not be able to
extract. Such as orientation, spatiotemporal (human motion) features.
To review previous research works focused on video classification using different techniques.
To collect and preprocess Ethiopian traditional music videos from different sources
To Identify techniques or tools for feature extraction from music videos
To build a classification model using a CNN
To select an optimal classification algorithm among the known classifiers
To evaluate the performance of the classification model using the testing dataset,
6
1.4. Scope of study
In this research, data set preparation, preprocessing, feature extraction, and classification
techniques have been explored to design a classification model for Ethiopian traditional music
video using CNN (convolution Neural Network). In video classification, features are drawn
from three modalities; text, audio, and visual features. However, forthis study, we considered
visual features because as compared with others humans perceive more information from visual
features (Darji & Mathpal, 2017).
Due to the availability of Traditional music videos, the study covers only 8 ethnic groups/nations
selected from 6 Ethiopian regions. Theselected regions are Afar, Tigray, Amhara, Oromia,
BenishangulGumuz, and South nations and nationalities. From each region, one/two nations are
selected for data collection and implementation purposes.
1.6. Methodology
Methodology is a set of systematic approaches and techniques in research that are used as a
roadmap for the research that answersto a question of how the research was conducted. This
study follows an experimental research methodology. Experimental research is the process of
7
implementing an algorithm and empirically evaluating the efficiency and effectiveness of those
algorithms to solve real-world problems (Igwenagu, 2017).
8
study due to its open-source package, the fastest general-purpose programming language and it
can be run on all platforms (Linner et al., 2021).To evaluate the performance or efficiency and
effectiveness of the model we use Precision, Recall, and Specificity in predicting the outcome of
new observations and test data that have been not used to train the model.
9
CHAPTER TWO:LITERATURE REVIEW
2.1 Introduction
This chapter gives general information and background about the research topic and
previously worked research. It includes attempts of different researchers to solve the problem,
what techniques, algorithm they use, and the procedures they follow.
In recent years, the internet has been creating good opportunities for music industries to earn
much more money than before. Therefore vast and variety of music videos and movies are
uploaded per day and available on the internet and it’s accessible across the globe through
different agents or providers(Throsby, 2014). But individual users to entertain themselves did
get music videos easily as per their request and preference due to the variety and volume of
music videos.
Currently, across the world searching videos depend on the content tag and text-based user
annotation, but it is inefficient(Tefera, 2016). However, current Ethiopian traditional music
video classifications are done by experts by listening and watching a music video. So that the
study allows classifying a given video automatically based on its visual features.
Nowadays, Ethiopian traditional music artists are flourishingand the music video clips are
dramatically increased and have millions of views. For this research work, we have selected 8
traditional music video styles from 6 regions. Based on the availability of music videos. The
selected categories are the music of Tigray, Awi Agew, Gojjam, Shoa Oromo, Gurage,
Wolayta, Benshagul Gumz, andAfar.
10
groups are lived there.TheTigrayans have traditional music dance styles which are
performed in groups and havesmoothcirculardance movementsincorporatedwith
shoulderneck and jumping movements(Traditional Ethiopian Music and Ethiopian Culture –
Tigray, n.d.).Thiscirculardanceiscommonlyknownaskuda.Intheirmusic,itisvery
commontoheartraditionalEthiopianinstrumentssuch asthe masinqo,Kerar, and Kebero.
Figure 2.2.1 Music dance of Tigray people (Tigray Traditional Dance Photo, Ethiopia Africa, n.d.)
Afar people have different kinds of traditional dances for various ceremonies like Weddings,
rituals, and holidays. The dances are performed together as well as in specific age and sex
groups.
The traditional dance of Afar people is characterized by hand and leg movement. During the
dance, the men wear traditional cloth (“Kutta”) and they will tie the knife (“gorade”) and will
up their hands and wave their hands from left to right in the air and at the same time jump up.
Their Dancing style is mainly jumping to the air by pointing their hand in the up
direction(Traditional Dances – Afar Culture & Tourism Bureau, n.d.).A commonly known
dance style is the ‘Lale’ dance style played by both sexeson any occasion.
11
Figure 2.2.2 Music dance of Afar people (Traditional Dances – Afar Culture & Tourism Bureau, n.d.)
12
Figure 2.2.4 Music dance of Agew-Awi people
13
and “Wolayta” dance which is known as “Wolaytigna”. In this study, both Goragigna and
Wolayitgna dance is investigated.
Gurage peoples are symbolized as hard workers throughout the country, and their dancing
style brings such an image of hard-working people. The dance style is characterized by foot
motifs that go with the whole-body movement. This dance is performed in two ways; the first
part is short and slow motion. The second part of the dance might be performed repeatedly
for three or four times with short intervals depending on the dancer’s interest and
energy(Subramanian & Mekonnen, 2016).
14
by theup-and-downmovementofthewholebody withtherhythmofthemusic.Among
severaltraditional dancesavailableintheregion, “Akusha”
isonethemostpredominantdancestyle.Bothmen'sandwomen'sdancestyleisalmostverysimilar
and performed in a group.
Images are signals with special characteristics. First, they area measure ofaparameter over
space (distance),while most signals area measure ofaparameter over time, such as audio.
Second, they containagreatdeal of information. For example, more than10 megabytes can be
requiredto store one second of televisionvideo. This is more thana thousand timesgreater than
fora similar lengthvoice signal. Third, the final quality is oftena subjective human evaluation,
rather than an objective criterion. These special characteristics have made imageprocessinga
distinct subgroupwithinDSP(Dyer & Harms, 1993)
Video is the dynamic form of static images. Inotherwords, thevideo is composed ofa series of
static images ina certain order, and each image is calleda ‘frame’(Bovik, 2005).Video
referstopictorial (visual) information, including still images and time-varying images.A time-
varyingimage is such that the spatial intensity pattern changes with time.Hence,a time-
15
varying image is a Spatiotemporal intensity pattern, denoted by sc(x1, x2, t), where x1 and x2
are the spatial variables and t is a temporal variable. It is formed by projecting a time-varying
three-dimensional spatial scene into the two-dimensionalimage plane. The temporal
variations in the 3-D scene areusuallydueto the movement of objects in the scene. Thus, time-
varying images reflectaprojection of a 3-D moving object into the 2-D image plane asa
function of time. Digitalvideo correspondsto a spatiotemporally sampled version of this time-
varying image(Bovik, 2005; Mulugeta, 2020).
16
2.3.1Imageandvideoscenesegmentation
Weareonlyinterestedin regionsofinputsthatarecalled targetorforeground,andit
generallycorrespondstocertainspecificoruniqueproperties regions in the image orvideo frame.
On theother side,a region in which we are not interested is called background.
Image segmentation can be built on two basic concepts: similarity and discontinuity. The
similarity of pixels means thatpixelsoftheimageinacertainregionhavesomesimilar
characteristics,suchaspixelgrayscale,or textureformedbythearrangementof pixels.
Thediscontinuity referstothediscontinuity ofsomefeaturesofpixels,suchasthe
mutationofgrayscalevalue,themutationof color, and the mutation of texture structure.Image
segmentation isgenerally achievedby considering the image color,grayscale, edge, texture,
andother spatial information. Currently, image segmentation algorithms can be divided into
two categories: structural segmentation and non-structural segmentation.
Featureextractionisaprocessofmeasuring theintrinsic,essential,andimportant
featuresorattributesof theobjectandquantizingtheresultordecomposingand symbolizing the
objectto form the feature vectoror symbol string and relational map(Mulugeta, 2020).
IngeneralFeaturedescriptors,refertoaseriesofsymbolsthatareusedfordescribingthe
characteristics ofanimageorvideoobject.Agooddescriptorshouldbeinsensitivetothe
target’sscale,rotation,translation,etc.Featuredescriptorsgenerally fall intotwomain
groups:globalandlocal.
18
nucleuscornerdetector (SUSAN),ScaleInvariantFeatureTransform (SIFT),Difference
ofGaussian(DOG)operator,Histogram OrientationGradient(HOG),
SpeedUpRobustFeatures(SURF),Maximally StableExtremeRegions(MSER),Local
BinaryPattern(LBP)(Mulugeta, 2020).
Deep learning has made enormous success in different areas, such as computer vision, speech
recognition, and natural language processing. Inspired bythis success, many researchers are
moving from using traditional (shallow) machine learning algorithms to using deep learning
to solve problems in music information retrieval and human activity recognition.When
designing a model using these algorithms, the performance of the model heavily depends on
the feature engineering task rather than on the classifier itself (Rafi et al., 2021). However,
the deep neural network automatically learns advanced feature layer by layer automatically
and effectively instead of time-consuming for feature engineering(Song et al., 2018) which
led to the success of many problems in different areas. The idea of training data using the
neural network has been there for a long time but it has become popular due to several
reasons such as the availability of a large amount of data and advancement of hardware
technologies such as Graphical Processing Unit (GPU).
Modern deep learning involves numbers of successive layers of representations and they have
all learned automatically from the inputfor training data. Meanwhile, other approaches to
machine learning tend to focus on learning only one or two layers of representations of the
data. In deep learning, these layered representations arelearned via models called neural
networks. The term neural network is referred to as neuronbiology, but although some of the
central concepts in deep learning were developed in part by drawing inspiration from our
understanding of the brain, deep-learning models are not models of the brain (Ketkar, 2017).
A simple definition of a neural network is one or more weight that is multiplied by input data
to make a prediction(Trask, 2019). The steps involved in deep learning are,learn, compare
and predict.
20
Prediction is what the neural network will tell you, given input data, such as given
temperature, how likely it is going to rain. But the prediction is not always right. Therefore, it
will compare with the truth value and adjust the weight. Comparing gives a measurement of
how many predictionsare missed. A good way to measure error is one of the most important
and complicated subjects of deep learning. One simple way of measuring error is a mean
squared error (MSE). Learning tells each weight how it can change to reduce the error. It is
all about error attribution or the art of figuring out how each weight played its part in creating
error. It is the blame game of deep learning: gradient descent(Trask, 2019).At the end of the
day, learning is about one this: adjusting the weights either up or down so the error is
reduced. Therefore, we can say learning in a neural network is a search problem. We are
searching for the best possible configuration of weights so the network’s error falls to 0 and
predicts perfectly.
The specification of what the layer does to its input data is stored in the layer’s weights. In
technical terms, we would say that the transformation implemented by layers is parameterized
by its weight. In this context, learning means finding a set of values for the weights of all
layers in a network, such that the network will correctly map example inputs to their
associated targets. A deep neural network can contain tens of millions of parameters and we
have to find the correct value for all of them.
To control the output of the neural network, how far the output is from what is expected
needs to be measured. This is the job of the loss function. The loss function takes the
predictions of the network and the true target (what we want the network to output) and
computes adistance score (loss score), capturing how well the network has done on this
specific example. This score is used as a feedback signal to adjust the value of the weights a
little, in a direction that will lower the loss score. This adjustment is the job of the optimizer,
which is implemented by what is called the Backpropagation algorithm.
In the learning process, what the neural network does is search for a correlation between the
input layer and output layer of the training data. If the data do not correlatewith the input and
output we create an intermediate layer (dataset) that does correlate with the output layer. The
goal is to train the network so that even though there is no correlation between the input and
output dataset, the layer (dataset) inserted in the middle (which is called the hidden layer) will
correlate with the output layer(Mulugeta, 2020).
21
If a neural network is trained by adding hidden layers as it is, it would not converge. The
solution is to make the hidden layers sometimes correlate and sometimes not correlate. This is
called conditional correlation or sometimes correlation. This gives it a correlation of its own.
Sometimes correlation is achieved by turning off nodes in the hidden layers based on a
condition, for example, whenever it is negative (Relu or Rectified Linear Unit). This logic is
called nonlinearity. Without the tweak, the neural network is linear(Mulugeta, 2020).
There are 3 ways in which we can apply gradient descent to the network(Trask, 2019): Full,
batch, and stochastic gradient descent.Stochastic gradient descent updates
weightsoneexampleatatime(theideaof learning oneexampleata
22
time).Itperformsapredictionandweightupdateforeachtraining exampleseparately.
Ititeratesthrough theentiredatasetmany timesuntilitcanfinda weightconfiguration that works
well for all the training examples.(Full) gradientdescent updates weights one dataset at a
time. Instead of updating the weights once foreach training example, the network calculates
the average weight delta over the entire dataset, changing the weight only each time it
computes a full average.Batch gradient descent updates weights after n examples. Instead of
updating the weight after just one example or after the entire dataset of examples, a batch size
of examples is chosen (typically between 8 and 256), after which the weights are updated. We
can use this method to increase the speed of training and the rate of convergence.
Deep learning offered better performance on many problems. Also, deep learning makes
problem-solving much easier by completely automating what used to be the most crucial step
in a machine-learning workflow: feature engineering. Humans have to go to great lengths to
make the initial input data more amenable to processing by these
methods:theyhadtomanuallyengineergoodlayersofrepresentations fortheirdata. This
iscalledfeatureengineering.Deeplearning on theotherhand,completely automatesthis step.
b) Precision:canbethoughtofasameasureofexactness.Inotherwords,itisthe
23
percentage of test data labeled as positive is positive(Jiawei Han,Micheline Kamber, 2012).
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Equation 2 Precision
d) F1Score(harmonicmeanofprecisionandrecall):isameasurementthatcombines
precisionandrecallintoone.Bydoingthat,itgivesequalweighttoprecisionand
recall (Jiawei Han,Micheline Kamber, 2012). It is defined as the following:
2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹=
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Equation 4 F1 Score
e) ConfusionMatrix:is atableofatleastsizembym.Anentry,CMijinthefirstm
rowsandmcolumnsindicates thenumberoftestrows of classithatwaslabeled bytheclassifier
asclassj. Fortheclassifiertohavegoodaccuracy,ideallymostof
thetupleswouldberepresentedalongthediagonal oftheconfusionmatrix,from
CM1,1toCMm,mwiththerestoftheentrybeingzero.Thatis, ideallyFPandFNare aroundzero(Jiawei
Han,Micheline Kamber, 2012).
f) P/RAUC(areaundert h e precision-recallcurve):Sincet h e measureofaccuracyisnot
enoughtotrulyevaluateaclassifier. Weputanalternativemeasurementcalled areaunderthe
precision-recall curve.The precision-recall curvesummarizesthetrade-off
betweentheTPrateandthepositivepredictivevalue fortheprediction.The areaunder
thecurvesummarizedtheintegral oranapproximationoftheareaunderthe precision-recallcurve.
g) MacroAverage:Thiswillcomputetheaverageby givingeachclassthesame
weightanddividebythetotalnumberofclasses(Jiawei Han,Micheline Kamber, 2012).Itis
calculatedas follows:
∑ 𝑐𝑖
𝑚𝑎𝑐𝑟𝑜 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 =
𝑛
Equation 5 Macro Average
h) WeightedAverage:Thiswillcomputetheaveragebymultiplyingeachclassby
itsweight(samplesineachclass)anddividebya totalnumberofsamples(Jiawei Han,Micheline
Kamber, 2012).Itiscalculated as follows:
24
∑ (𝑐𝑖 𝑥 𝑤𝑖 )
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 =
∑ 𝑤𝑖
Equation 6 Weighted Average
2.4.2ConvolutionalNeuralNetwork
Overfittingisoftencausedbyhavingmoreparameters thannecessarytolearnaspecific
dataset.Whenneuralnetworkshavelots of parametersbutnotmanytrainingexamples,
overfittingisdifficult to avoid(Trask, 2019). Regularization is a means of countering
overfitting. But it is not the only technique to prevent overfitting.
As mentioned before, overfitting is concerned with the ratio between the number of weights
in the model and the number of data points it has to learn those weights. Thus, there is a
better method to counter overfitting. When possible, it is preferable to use something loosely
defined as structure. The structure is selectively choosing to reuse weights for multiple
purposes in a neural network because the same pattern needs to be detected in multiple
places(Trask, 2019).
Normally removing parameters make the model less expressive (less able to learn
patterns),butifwearecleverinwherewereuseweights,themodel canbeequally
expressivebutmorerobusttooverfitting.Perhapssurprisingly,thistechniquealsotends
tomakethemodelsmaller(becausetherearefeweractualparameters tostore).Themost
famousandwidelyused structureinneuralnetworksiscalledconvolution, andwhenused
asalayeritiscalledconvolutionallayer(Trask,
2019).Convolutionalnetworksaresimplyneuralnetworksthatuse convolution in place of general
matrix multiplication in at least one of their layers (Heaton, 2018).
In the convolution layer, lots of very small linear layers are reused in every position, instead
of a single big one. The core idea behind a convolution layer is that instead of having a large,
dense linear layer with a connection from every input to every output, lots of very small
linear layers are used, usually with fewer than 25 inputs and a single output, which are used
in every input position. Every mini-layer is called a convolutional kernel, but it is nothing
more than a baby linear layer with a small number of inputs and a singleoutput.
25
Figure 2.9 Convolutional kernel(Trask, 2019)
As Shown in Figure 9, the single 3x3 convolutional kernel predicts its current location, move
one pixel to the right, then predict again, move another pixel to the right, and so on. Once it
has scanned across the image, it will move down a single pixel and scan back to the
left,repeating until it has predicted every possible position within the image. The result will
be the smaller square of kernel predictions, which are used as input to the nextlayer.
A typical layer of a convolutional network consists of three stages(Heaton, 2018). In the first
stage, the layer performs several convolutions in parallel to produce a set of linear activations.
In the second stage, each linear activation is run through a nonlinear activation function, such
as the rectified linear activation function. In the third stage, a pooling function is used to
modify the output of the layer further. The pooling function replaces the output of the
network at a certain location with a summary statistic of the nearby output. For example, the
max-pooling operation reports the maximum output within a rectangular neighborhood.
The fundamental difference between a densely connected layer and a convolution layer is the
dense layer learn global pattern (patterns involving all pixels) in their input features, whereas
convolution layers learn local patterns (patterns found in the kernel). This key characteristic
gives convolutional networks two interesting properties (Ketkar, 2017):
2.4.3 ActionClassification
Action classification in the video needs to solve two problems. The first one is extracting the
appearance information of the moving human body and the second one is simulating the
26
temporal dynamics of action(Suarez & Naval, 2020). CNN can capture complex spatial
appearance information from an image. However, using CNN for video classification will
cause the loss of temporal context information in the video sequence. Therefore, to model
the temporal context information, the deep learning models have been categorized into
4(Suarez & Naval, 2020). Action recognition based on 3D CNN (Ji et al., 2013), action
recognition based on Spatio-temporal two-stream networks(Simonyan & Zisserman, 2014),
action recognition based on RNN, and action recognition based on skeleton sequence(Li et
al., 2017).
2.5Related works
In this section we are going to consider all previously worked traditional music video
classififcation and as human action recognition (action classification) problems and trends.
27
The Adaboost can effectively recover the query video frames from the dataset, by shape-
texture observation model defined by discrete wavelet transform (DWT) and local binary
patterns (LBP). The early and late features and classifiers performance tests show that the
proposed late fusion features and multiclass multilabel Adaboost classifier give better
classification accuracy and speed compared to AGM and SVM. The authors conducted a total
of 8 experiments.
The first 4 experiments were for early fusion and the last 4 experiments were for late fusion.
Ofthe 4 experiments, two of them were for online videos, and the rest two were for offline
videos. For the offline videos, the accuracy ranges from 0.74 (early fusion) –0.99 (late
fusion), and for the online videos, the accuracy ranges from 0.66 (early fusion) –0.99 (late
fusion). In this research, the authors used only shape features thatcannot be used directly by
Ethiopian music classifiers. Adding additional information to the classifier such as the
movement of dancers would give more clues when it comes to Ethiopian dance movements.
Also, extracting features manually is not the ideal solution now a day in image processing
problems.
The work done by(Subramanian & Mekonnen, 2016)aims at classifying ten Ethiopian
traditional dances namely: Afar, Benshangul, Gambella, Eskista, Gurage, Hararghe, Oromo,
Somali, Tigrinya, and Wolayta. They used optical estimation to estimate the motion of the
28
dancers as a way of feature extraction and HOG to detect the region of interest. As an optical
flow estimation method, the authors use Horn-Schunck and Lucas-Kanade optical flow
algorithm. Both works are the same but their difference lies in the learning algorithm used.
While the learning algorithm used in(Subramanian & Mekonnen, 2016) is SVM, the learning
algorithm used in(Subramanian & Mekonnen, 2016) is ANN. When using SVM the
maximum accuracy the authors got is 99.7 and it is achieved when the Lucas-Kanade method
is used. The maximum accuracy they acquired when ANN is used is 82.0% and it is again
achieved when the Lucas-Kanade method is used. Even though these two works include the
motion information in the feature extraction step, they were very selective when choosing
data (with no background movement) making their model less performingin a real
environment. Also, they only train and test their system with only one person as a dance
performer resulting in using a dataset that did not represent Ethiopian traditional dance which
is mostly performed in groups.
29
(Simonyan & Zisserman, 2014)aim at extending deep ConvNets to action recognition in
video data. Their architecture was based on two separate recognition streams (spatial and
temporal) that are implemented as ConvNets and are then combined by late fusion. The
spatial stream performs action recognition from still video frames, whereas the temporal
stream is trained to recognize action from motion in the form of dense optical flow. They
stated that decoupling the spatial and temporal networks allowed them to use transfer
learning. The datasets used by the authors were UCF-101 and HMDB-51. They get an
accuracy of 88.0% on the UCF-101 dataset when the fusion is done by SVM and 59.4% on
the HMDB-51 dataset.
(Li et al., 2017) tried to recognize a 3D human action using an end-to-end deep CNN. They
proposed a representation of the dataset using skeleton sequence rather than RGB or depth. A
pre-trained VGG-19 model was used by the authors to perform the learning. They first
performed data pre-processing to normalize and filter the skeleton joints. Secondly, they
transformed a skeleton sequence with an arbitrary number of frames into RGB images.
Finally, they fed the generated images into VGG-19 (Simonyan & Zisserman, 2015) model,
which is a pre-trained model with ImageNet (Russakovsky et al., 2015) to perform fine-
tuning and action recognition. Finally, they performed end-to-end fine-tuning and perform the
learning process. For the hyperparameter set, they selected 40 for the size of mini-batches for
the SGD. In addition, the initial learning rate of 0.001, which was decreased by multiplying it
by 0.1 at every10,000th iteration was set. Moreover, 50 epochs were chosen by the authors to
stop theFine-tuning process. They tested their proposed method on the NTU RGB-D dataset,
which is at that time the largest skeleton-based human action recognition dataset, and
achieved82.1% for cross-view and 75.2% for cross-subject when evaluating. They also
experimented using AlexNet(Krizhevsky et al., 2012) as a pre-trained model but got lower
accuracy (72.96 % and 75.87%). We do not have a skeleton-based dataset that shows the
Ethiopian traditional dance dataset. Also, performing fine-tuning requires access to GPU and
needs a large amount of training time.
(Ijjina & Mohan, 2014) investigated the problem of human action recognition by proposing
2D CNN for human action recognition using action bank features. Their main motivation was
the use of a frequency spectrogram of audio data for speech recognition using CNN. As it is
stated in their paper, human action recognition in videos is a challenging task due to the
existence of time dimension, i.e. the length of a video capturing the same action may be
30
different. Consequently, they proposed an approach that uses action bank features. Action
bank features of a video capture the similarity information of the video against the videos in
an action bank. Thus, videos containing similar actions may contain similar patterns in
action-bank features. The action bank feature is given as an input to the pattern recognition
module for action recognition which is implemented using CNN. The classifier utilizesthis
similarity information to assign an action label to the input video. The use of the action bank
feature to represent video results in a fixed-size representation of videos irrespective of their
length. Therefore, deep neural network architecture can be trained to classify a video in a
single pass without using a temporal window on video frames. The authors consider a CNN
classifier to recognize and classify human action from the local pattern in the action bank
features due to the success of deep CNN architectures for recognizing local visual patterns.
UCF50 dataset is used by the authors and the CNN is trained using batch propagation
algorithm in batchmode. The performance of the CNN classifier during training was
evaluated after every 5 epochs on the test dataset. By considering 5 epochs as an iteration, the
authors train the CNN 200 times (1000 epochs) maintaining the CNN classifier with
minimum error at the end of an iteration. They evaluated the proposed solution using two
methods. The first one is 5-fold cross-validation and the second one is splitting the entire data
into 3 groups to perform 3-fold cross-validation. They achieved an accuracy of 94.02% using
5-fold and 93.76% using the 3-fold. The same as the above research(Simonyan & Zisserman,
2014).
The research work conducted by (Mulugeta, 2020) attempts to mergeaudio and visual
features to improve the performance of the classification model, using a pre-trained network
called VGG-16 to extract visual information and a custom pre-trained network to extract
audio features from the video. In the end, the performance of the classifier when using only
audio features the performance of the classifier was 85% whereas using only visual features
we were able to get an accuracy of 78%. But by adding an audio feature to the video-only
classifier, we were able to increase the accuracy by 7% units making the final performance of
the proposed system 85%(Mulugeta, 2020). However, when we come to the real world to
distinguish music videos visual features are more important because we have discussed the
music video clips. Video clips are visual by nature it incorporates dances and different
movements. The research stated that audio features are more powerful than visual features to
classify and categorize Ethiopian traditional music videos.
31
Table 1 Summary of related works
(Russo & Jo, Classification of sports -Combined CNN extracted - Transfer learning
2019) videos with a features with only works whenthe
combination of temporal information from initial and target
deep learning models RNN to formulate the problem models are
and transfer learning general quite similar.
model to solve the problem.
-Transfer learning is applied
with VGG 16 model, 94%
accuracy
(Simões et al., Movie genre -Using CNN for -The performance of
2016) classification with classification and feature the classification
CNN extraction, an accuracy of model is varied
73.45% dependingon the
genre
All of the above have their own contributions on video classification, but to be more effective
and efficient they need to be extract additional information from an image. Therefore, what
makes this study is different is it uses a combined of three related feature extraction
techniques HOG, Optical flow and CNN to extract information from images.
32
CHAPTER THREE: RESEARCH METHODOLOGY
3.1. Introduction
This chapter discusses detail techniques, tools, and methods that are used to design and model
the Ethiopian traditional music video representation and classification. Also, it explains the
detail of the components of the model and their algorithms.
33
Table 2 Dataset of Ethiopian traditional music videos
2 Afar musicvideo 20 ”
3 Gojjammusicvideo 20 ”
4 Awimusicvideo 20 ”
5 Shoa oromomusicvideo 20 ”
6 Guragiemusicvideo 20 ”
7 Wolaytamusicvideo 20 ”
8 Benshangul Gumuzmusicvideo 20 ”
160
Total
3.3. Design Ethiopian Traditional music videos classification
Themodel is designed to classify Ethiopian traditional music videos based on visual features.
As discussed in section 1.6.4, we useda convolutional neural network for classification but
due to the availability of data, the dataset is too small. Therefore there would be overfitting.
To overcome the problem of small datasets or overfitting we use transfer learning, which
improves the model with low processing power and memory. The visual features are
extracted using a pre-trained network called VGG-16 prior to the model training and saved on
storage. Another hand the shape, and gradient orientation features are extracted using HOG,
motion/spatio temporal featuresare also extracted using Optical flow and concatenate the
feature vector of three separately extracted features then given to the convolution layer.
34
Preprocessing Frame feature
extraction
Video segmentation
CNN
Combined features
Input video Image frame extraction
HOG
Person detection
Optical
Gray scale conversion flow
Outpu Softmax
Following that, the extracted features will be fed to fully connected layers or dense layers.
The last step in the whole process is prediction. Prediction is performed by the last
component which is the prediction module. The concatenated features from the previous
process, and the output of the dense layer will be one of the 8 selected Ethiopian traditional
music (Gojjam, Awi, Afar, Shoa Oromo, Tigray, Guragegna, Wolayta, Benshagul).
An input to the CNN feature extracting module (VGG16) is 100 by 100 pixel RGB image and
the extracted feature output from this module is 3 by 3 by 512 3D tensor. VGG16 is a deep
learning architecture trained on ImageNet dataset. It is composed of 2- dimensional
convolution layer followed by 2D max-pooling layer. This module will extract all the visual
information found in each image and give it to the next layer to be aggregated to form the
feature of the video. The 3 by 3 by 512 tensor features extracted from the first module is
given to the image to video aggregating module. For this classification purpose, we
selected60 frames because we use 3-second video and an average of 20fps. Then it reshapes
the feature and outputs a feature of size 60 by 160 to be given to the next module.
3.4. Preprocessing
This section discusses the activity performed after collecting the music video to convert raw
data into a useful and acceptable form for a system.
𝐿(𝑥, 𝑦 + 1) − 𝐿(𝑥, 𝑦 − 1)
𝜃 , = tan
𝐿(𝑥 + 1, 𝑦) − 𝐿(𝑥 − 1, 𝑦)
The first stage is providing the input image by specifying its height and width. The height and
width of the image are set to 64x64. Bicubic Interpolation image resampling technique is
selected for image resizing. Because it is effective in all applications of image processing.
This size is selected randomly and it is within the range that CNN performs best in the
literature which is 64 up to 360. The next stage in the network are convolution and pooling
layers. The network contains four convolution and four pooling layers. The kernel size of the
convolution and pooling layer is set to be three and two respectively. Finally, the feature
obtained from the last pooling layer is fed to flatten layer and this layer changes the feature
from multi-dimensional to one- dimensional vector.
Classification is the process of predicting the music video based on the feature that gives into
a system to learn. After extracting the features the system learned from the given extracted
features, this process enables the capability of the classification modelto distinguish different
38
nations’ music videos accordingly. However, there are many classification algorithms that
has been used for classification in previously worked research. In a previous study
(Subramanian & Mekonnen, 2016; Tefera, 2016), SVM was used to classify the music
video.In another study, CNN was used by Russo and Jo ( 2019) to classify sport videos, and
(Mulugeta, 2020; Simões et al., 2016)in their study also used CNN for classifying movie
genre and music videos respectively. This paper is used CNN as a classifier because CNN is a
remarkable, successful, and better classification algorithm(Debayle et al., 2018; Sharma et al.,
2018).At the end of the classification model, the splitting of data into training and validation
are performed. In the training phase, the model learn the feature of the training dataset, and in
the validation phase model learns the feature of the testing dataset.
39
CHAPTER FOUR: EXPERIMENT AND IMPLEMENTATION
4.1 Introduction
This chapter discusses the capability of the proposed model according to the research
problem and the research questions are answered. In the next section adetail of the dataset
used in a proposed model. In section 4.3,we describe the implementation and tools used in the
proposed model. 4.4,we present the test result of our model, we test CNN classifier with two
features (CNN end to end and the combined features ofHOG, Optical flow and CNN)
independently. In section 4.5 we compare the performance of CNN feature and combined
feature and also compare our model with the previous worked Ethiopian traditional music
video classification models.
4.2 Dataset
In this research work, we didn’t get readymade music video data. Therefore it requires
tolerance and working with others to collect the music video of each nation and label them
accordingly. Due to the limitation of the availability of music videos we use a total of 8
nation’s music videos. Since the research incorporate only these 8 nations/ethnic groups
music type and dance movements. As shown in Table 4.1, 160 video clips are collected, from
the popular websites Youtube, Dire tube, SodereTube , AddisZefen with the help of music
experts to identify which video clips are manually labeled, and which music video is belongs
to which nation respectively.Also, music videos are converted into a suitable format for
segmentation. Then the video is segmented and changed into a sequence of image frames.
Different preprocessing techniques like person detection, and size normalization is applied to
this sequence of image frames. We use 3 seconds of extracted video from each music video
and extract an average of 20fps, which means that 60 frames were used to represent each
video clips.
40
Table 3 dataset for Ethiopian music video classification
4.3 Implementation
Experiments are done based on the modeldeveloped with Keras (TensorFlow as a backend)
on Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical
Processor(s), and 8 GB of RAM. The model is trained based on different parameters for the
classifier, the data is partitioned into training and testing datasets such that 80 percent of the
data is assigned for training the model and 20 percent of the data is allotted for testing.
4.3.1 Tools
In this research, different tools and techniques are employed in order to achieve research
objectives.
a) Python3:Theprogramminglanguagethatisusedtoimplementthemodels
inpython.Wedecidedtousepythonbecauseof therichness oflibrariesindata
manipulationandframeworks in the deeplearninganddataprocessingarea.
b) Keras 2.3.1:Itis adeeplearningframeworkoralibraryprovidinghigh-level
41
buildingblocksfordevelopingdeeplearningmodels.
42
precision recall f1-score support
43
precision recall f1-score support
4.5 Discussion
The expermetal results showed, that the classification accuracy of the CNN model using CNN
features and a combination of CNN and HOG features is 83%and 86% respectively. As the
result showed the features extracted by CNN havea better recognition rate when we compare
them with features extracted by HOG. When theclassifieris trained on the combined feature,
the accuracy achieved is 86.00% which is higher than the result recorded from each feature
extraction technique independently. Because these algorithms have their own advantages.
When we compare perfectly classified and poorly classified results, The dance of ‘guragie’ is
more misclassified, most of its data are classified under ‘afar’ and ‘wolayita’ dance. These
three classes has similar dance movements, mainly performend with the combination of hand
and leg movements.
As (Alam et al., 2021) research work,a system designed and developed for the COVID-19
identification from X-ray images is effective, with high accuracy and minimum complexity by
44
combining the features extracted by histogram oriented gradient(HOG) features and
convolutional neural network(CNN).
Ethiopian music video clips are increasingly uploaded, shared, and accessible across the glob
over the internet. However, they require some points, like rhythm, text or visual dancing
movements to classify or categorize the music video clips belonging to each ethnic group.
The objective of this study is to build a classification model. For this classification purpose, it
uses the visual feature. For this study, we selected 8 ethnic groups and collected 10 music
videoclips from each ethnic group as a dataset. The video clips are segmented into smaller
video clips containing important part of the dance using a free VirtualDub tool. We change
segmented frames again changes to a sequence of the image at a frame rate of 60 sequences
of images.From these sequence of image frames dance movements and the dancer is detected
using HOG, Optical flow and CNN to extract the features and CNN for classification.
The model has been tested using the images which were captured for testing and training
purposes we use Python programing language. After we extracted the features of the images
the classification model is built using a CNN classifier. Experimental results show that the
developed system by using CNN classifier using segmented images achieves good
classification performance with classification accuracy,CNN features and a combination of
HOG and Optical flow features scored 83% and 86% accuracy respectively.
Finally, this work showed that it is possible to classify Ethiopian traditional music using a
combination of different feature extraction techniques and classifiers, to reduce the time taken
for extracting features manually.
45
5.2 The contribution of the study
The contribution of this work is to show that different feature extraction techniques
can be combined for better classification results. We extracted visual features related
to the appearance of the dancers such as shape, orientation, outfit, and movement by
using both CNN and HOG. This makes the research is different in terms of feature
extraction technique.
Finally, this work showed that it is possible to classify Ethiopian traditional music
video using a hybrid feature extraction algorithm.
46
Reference
Al-Saffar, A. A. M., Tao, H., & Talab, M. A. (2017). Review of deep convolution neural
network in image classification. Proceeding - 2017 International Conference on Radar,
Antenna, Microwave, Electronics, and Telecommunications, ICRAMET 2017, 2018-
Janua, 26–31. https://doi.org/10.1109/ICRAMET.2017.8253139
Alam, N. A., Ahsan, M., Based, A., Haider, J., & Kowalski, M. (2021). COVID-19 Detection
from Chest X-ray Images Using Feature Fusion and Deep COVID ‐ 19 Detection from
Chest X ‐ Ray Images Using Feature Fusion and Deep Learning. February.
https://doi.org/10.3390/s21041480
Bovik, A. (2005). Handbook of image & video procces. In Elsevier Academic Press, (Vol. 1,
Issue 4).
Chenliang Xu, Caiming Xiong, and Jason J. CorsoMitchell, J. C. (2012). Lecture Notes in
Computer Science.
Dalal, N., Triggs, B., Dalal, N., Triggs, B., Gradients, O., International, D., Dalal, N., &
Triggs, B. (2010). Histograms of Oriented Gradients for Human Detection To cite this
version : HAL Id : inria-00548512 Histograms of Oriented Gradients for Human
Detection.
Debayle, J., Hatami, N., & Gavet, Y. (2018). Classification of time-series images using deep
convolutional neural networks. 23. https://doi.org/10.1117/12.2309486
Dyer, S. A., & Harms, B. K. (1993). Digital Signal Processing. In Advances in Computers
(Vol. 37, Issue C). https://doi.org/10.1016/S0065-2458(08)60403-9
47
Ethiopia Population (2021) - Worldometer. (n.d.).
Heaton, J. (2018). Ian Goodfellow , Yoshua Bengio , and Aaron Courville : Deep learning :
The MIT. Genetic Programming and Evolvable Machines, 19(1), 305–307.
https://doi.org/10.1007/s10710-017-9314-z
Igwenagu, C. (2017). Fundamentals of research methodology and data collection. June, 19–
21.
Ijjina, E. P., & Mohan, C. K. (2014). Human action recognition based on recognition of linear
patterns in action bank features using convolutional neural networks. Proceedings - 2014
13th International Conference on Machine Learning and Applications, ICMLA 2014,
178–182. https://doi.org/10.1109/ICMLA.2014.33
Introducing Ethiopian folk dance_ Mocha Ethiopia Dance Group (English site). (2002).
http://saba.air-nifty.com/mocha_ethiopia_dance_e/introducing-ethiopian-fol.html
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional neural networks for human
action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(1), 221–231. https://doi.org/10.1109/TPAMI.2012.59
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F. F. (2014). Large-
scale video classification with convolutional neural networks. Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 1725–1732.
https://doi.org/10.1109/CVPR.2014.223
Krizhevsky, B. A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep
Convolutional Neural Networks. Communications of the ACM, 60(6), 84–90.
Kumar, K. V. V., Kishore, P. V. V., & Anil Kumar, D. (2017). Indian Classical Dance
Classification with Adaboost Multiclass Classifier on Multifeature Fusion. Mathematical
Problems in Engineering, 2017. https://doi.org/10.1155/2017/6204742
48
Li, C., Sun, S., Min, X., Lin, W., Nie, B., & Zhang, X. (2017). End-to-end learning of deep
convolutional neural network for 3D human action recognition. 2017 IEEE International
Conference on Multimedia and Expo Workshops, ICMEW 2017, July, 609–612.
https://doi.org/10.1109/ICMEW.2017.8026281
Linner, S., Lischewski, M., & Richerzhagen, M. (2021). Introduction to the Basics. March.
Pati, R. N., Yousuf, S. B., & Kiros, A. (2015). Cultural Rights of Traditional Musicians in
Ethiopia: Threats and Challenges of Globalisation of Music Culture. International
Journal of Social Sciences and Management, 2(4), 315–326.
https://doi.org/10.3126/ijssm.v2i4.13620
Rachmadi, R. F., Uchimura, K., & Koutaki, G. (2017). Video classification using compacted
dataset based on selected keyframe. IEEE Region 10 Annual International Conference,
Proceedings/TENCON, 873–878. https://doi.org/10.1109/TENCON.2016.7848130
Rafi, Q. G., Noman, M., Prodhan, S. Z., & Alam, S. (2021). Comparative Analysis of Three
Improved Deep Learning Architectures for Music Genre Classification. April, 1–14.
https://doi.org/10.5815/ijitcs.2021.02.01
Ramesh, M., & Mahesh, K. (2019). A Preliminary Investigation on a Novel Approach for
Efficient and Effective Video Classification Model. March, 266–269.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale
Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–
252. https://doi.org/10.1007/s11263-015-0816-y
Russo, M. A., & Jo, K. (2019). Classification of sports videos with combination of deep
49
learning models and transfer learning. 2019 International Conference on Electrical,
Computer and Communication Engineering (ECCE), 1–5.
Saif, S., Tehseen, S., & Kausar, S. (2018). A survey of the techniques for the identification
and classification of human actions from visual data. Sensors (Switzerland), 18(11).
https://doi.org/10.3390/s18113979
Sharma, N., Jain, V., & Mishra, A. (2018). An Analysis of Convolutional Neural Networks
for Image Classification. Procedia Computer Science, 132(Iccids), 377–384.
https://doi.org/10.1016/j.procs.2018.05.198
Simões, G. S., Wehrmann, J., Barros, R. C., & Ruiz, D. D. (2016). Movie genre classification
with Convolutional Neural Networks. Proceedings of the International Joint Conference
on Neural Networks, 2016-Octob(July), 259–266.
https://doi.org/10.1109/IJCNN.2016.7727207
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action
recognition in videos. Advances in Neural Information Processing Systems, 1(January),
568–576.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. 3rd International Conference on Learning Representations, ICLR
2015 - Conference Track Proceedings, 1–14.
Song, G., Wang, Z., Han, F., Ding, S., Song, G., Wang, Z., Han, F., Ding, S., Learning, T.,
Song, G., Wang, Z., Han, F., & Ding, S. (2018). Transfer Learning for Music Genre
Classification To cite this version : HAL Id : hal-01820925. 0–8.
Suarez, J. J. P., & Naval, P. C. (2020). A survey on deep learning techniques for video
anomaly detection. ArXiv, 20–25.
Sun, J., Wang, J., & Yeh, T.-C. (2017). Video Understanding: From Video Classification to
Captioning. 1–9. http://cs231n.stanford.edu/reports/2017/pdfs/709.pdf
50
Throsby, D. (2014). The music industry in the new millennium : Global and local
perspectives THE MUSIC INDUSTRY IN THE NEW MILLENNIUM : Global and Local
Perspectives David Throsby ( Macquarie University , Sydney ). July.
51
Appendix A: Sample code of CNN end to end classififcation
52
53
54
Appendex B: sample code for conbined features of HOG, Optical flow
amd CNN
55
Appendix C: Sample sequential frames
Agew
Gojjam
Tigre
56
57