Professional Documents
Culture Documents
A Fraud Detection System For Mobile Applications
A Fraud Detection System For Mobile Applications
1
2
3
4
5
A Fraud Detection System for Mobile Applications
6
7
8
9
10
Abstract application with large number of downloads, rating and
11
reviews usually ranked high in the Leader board. There is
12
Since technology is advancing rapidly, the mobile different ways to popularize an application which are
13
app market is also growing very fast and the number of mainly divided into two categories. White hat basis
14
mobile app is increasing day by day. Every app promotion is the legal way to promote an application
15
developer wants their app to be rank as high as possible whereas Black hat basis is the illegal way for promoting
16
in the popularity list. So to maximize their popularity an application. The shady application developers
17
some of the app developers use some unfair and tricky generally use Black hat basis techniques to promote their
18
means. They use “Bot-farms” and “human water applications. They generally use some fraudulent or
armies” to download their apps and provide fraud rating tricky means for boosting up their application in the
19
and reviews. So the challenge is to detect such fraudulent popularity list. This is usually implemented by “Bot-
20
activities. In this paper we are developing such a system farms” or “Human Water Armies” which filled up
21
which can detect such fraudulent behavior by mining the application download, rating and reviews in a very short
22
apps historical records. Here we are mining three types time[2][3]. If we observe carefully we will find that an
23
of evidences on the basis of apps historical records application is not always ranked high in the leader board
24
mainly- Ranking based evidences, Review based but only in some periods called leading event from
25
different leading sessions which means fraud for mobile
26
evidences and Rating based evidences. Then we are
calculating aggregation using these three evidences. applications particularly occur in this session. So
27
detecting frauds in application is nothing but finding
28
fraud in leading sessions. Therefore first we need to find
29
Index Terms – Ranking based evidences, Ratting based
evidences, Review based evidences, Mining Leading the leading session of mobile application then we will
30
evaluate those sessions with applications historical
31
Session, Sentiment analysis.
records. The main objective of our work is to detect fraud
32
behavior for mobile applications by mining the apps
33
historical records. We will check whether there is any
34
Introduction
fraud signature in any leading sessions. We first collects
35
In today’s world technology is advancing
three types of evidences from user feedback namely 1)
36
expeditiously. Mobile device is a part of this technology.
Ranking based evidences, 2) Rating based evidences and
37
The number of mobile user is increasing day by day. 6.2
3) Review based evidences. Since this project mines user
38
percent population of the world has mobile devices. So
feedback so we consider two types of user feedback,
39
the mobile application is also a well known concept. Till
rating based and review based. We generally rate an app
40
now there is over 3.6 million applications in Google play
while downloading it or after seeing its performance so
41
store and 2.2 million applications in Apple app store. The
rating is one of the important evidence to judge the app
42
number of applications developers is also increasing
but as we discussed above that there are some techniques
gradually. So there is huge competition among the
43
with the help of which we can increase the rating [2].
applications developers. As there is massive number of
44
Many people download applications after reading user
applications, for us it is a bit difficult to choose right
45
reviews so shady application developers may inflate their
applications. Every applications developers wants their
46
applications with fake comments. So here we have
applications to be popular so that they can get maximum
47
designed a system that will detect whether any such
number of downloads and thus get maximum revenue
48
activities is done to increase the popularity of the
from it. Applications leader board is such platform from
49
application. We will first determine the active periods.
where we can categorize how much popular an
50
We have designed an algorithm to find the active periods.
application is. These Leader boards are the best way to
51
By mining the applications historical ranking records we
popularize an application. A top most position in the
52
get the leading events and then combining the adjacent
Leader board indicates that the application is popular.
53
leading events we get leading sessions. We then need to
Top ranked applications generally have more numbers of
54
evaluate these leading sessions against three types of
downloads and earns in million dollar. So application
55
evidences. We first divide the statistical and textual
developers have a tendency to investigate different ways
56
reviews. For particular leading session we mapped these
to get higher position in the leader board [3]. An
57
review. For textual reviews we will apply Natural
60
61
62
63
64
65
1
1
2
3
Language Processing (NLP) and get the sentiment of the in the leading sessions. The fraud application developer
4
reviews. Then for each leading session we determine the uses some tricky or unfair means for leading their apps
5
overall sentiment and check whether there is any ranking in the leaderboard. Detection of such apps is
6
anomaly pattern. done by making leading sessions from the leading event
7
which shows the phases of achievements. The rising
8
phase, maintaining phase and recession phase in which
9
Review of Literature
Patil Rohini et al., [1] discussed that almost everyone we find apps ranking behavior from historical ranking
10
records. When those phases are checked over time the
uses mobile phone these days and they use mobile app
11
ranking of the genuine apps constantly maintains over
store. We can get number of applications from these
12
time periods but the fraud apps fluctuates over time.
applications store but there may be some application
13
Therefore we have to characterize some fraud evidences
which may be used for data robbery. So such application
14
from apps historical ranking records. We can consider
should be detected and make identifiable the users. They
15
have proposed a web application that will process the two other types of evidences based on apps historical
16
applications historical records with different techniques reviews and ratting.
17
which will give results in graph forms. From the graphs In ranking based evidences each leading events must
18
then the comparison will be made between the shows a specific ranking pattern from the history of the
19
applications. apps ranking behavior [10].
20
Ranjita.R et al., [13] refer that ranking fraud is the key In ratting based evidences, when app is published, the
21
challenge in the mobile application market. According to user after downloading the app can rate the app. The user
22
them ranking fraud are the fraudulent or vulnerable ratting is one of the most important advertising way of
23
activities which have a purpose of bumping up app in the applications. Apps having higher ratting attract more
24
popularity list. number of users for downloading the apps and it rank
25
Hengshu Zhu et al., [2] give us the idea of mining high on the leaderboard. Therefore ratting is an important
26
active periods namely the leading sessions of the evidence for ranking fraud.
27
applications. They also identified various types of In review based evidences similar to the ratting apps
28
evidences mainly rating based evidences, review based store also allows user to write their feedback as apps
29
evidences and ranking based evidences. Hengshu Zhu reviews. User gives their experiences with the particular
30
and his partner also proposed an aggregation method mobile apps with the help of reviews. So review also
31
based on optimization to integrate all the different types plays an important role in ranking fraud. Finally the
32
of evidences. evidence based aggregation method is use for integrating
33
Shivkumar Swami N et al., [4] discussed that for all the evidences [11].
34
detecting fraud data mining techniques can be used. They
35
discussed different techniques that can use to detect the Proposed Approach
36
anomaly in datasets. They also give a brief description We will first read the datasets and after preprocessing
37
about some of those techniques. it we will separate the statistical reviews and the textual
38
Abhilash TP et al., [3] give us the overall idea of reviews. The statistical reviews will then be mapped in
39
ranking fraud. They first showed us the concept of active sessions. Each session will be checked separately. If it is
40
periods for mobile application. They showed that an found that the sessions are evenly organized then chances
41
application is not always ranked high in the leader board of the reviews being fake is less but if it is abruptly
42
but only in some active periods known as leading organized the chances of the reviews being fake is high.
43
session. They also give the basic idea how we can get For example for session S1 the mean review is excellent
44
active periods by mining applications historical records. but for session S2 the mean review certainly drops it
45
Farther they investigate three types of evidences through means the reviews in session S1 are not genuine and
46
statistical hypothesis test. might be paid reviews. After completing the statistical
47
L. Velmurugan[5] gives the basic idea about the reviews we will consider the textual reviews and apply
48
techniques used in misuse detection and anomaly NLP on these reviews. The NLP process consisting of
49
detection. This paper also gives an overview of mining two parts, Parts of Speech tagging that will find Parts of
50
the leading session. It also gives an algorithm on Speech of each input words and Chunking that will
51
fraudulent ranking behavior detection with Concept remove all the unnecessary Parts of Speech from the
52
Vector Based Review Evidence Analysis (CVBREA). reviews and will gives the action words only. We will
53
process all these action words and will determine the
54
Proposed System overall sentiment. Then we will check all these
55
sentiments session wise to find whether the reviews are
The objective of our work is to find fraudulent ranking
56
fake or genuine. Composite results of both the statistical
behavior for mobile application. Fraud generally happens
57
60
61
62
63
64
65
2
1
2
3
reviews and textual reviews will identify the true nature
4
of the review and will generate the results.
5
Advantages of proposed system- the proposed
6
framework can be extensible and can be continued by
7
considering other evidences for ranking fraud detection.
8
9
10
Proposed System Architecture
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Fig1: Block diagram of System Architecture
30
31
Mining Leading Session
32
As fraud usually fraud usually happens in leading
33
session. So leading sessions are the base for detecting
34
fraud in mobile apps[13][8]. There are mainly two steps
35
for mining leading sessions. First we need to determine
36
leading events from applications historical ranking
37
records. Second we need to merge adjacent leading
38
events for getting a leading session [9]. Observation
39
shows that an application is not always ranked high in the
40
leader board but only in some specific time called leading
41
events. The fraudulent mobile application generally poses
42
different ranking patterns in each leading sessions
43
compare to normal applications. Therefore the problem
44
of identifying fraud behavior in mobile application is to
45
find out vulnerable leading sessions [3].
46
Leading sessions are the active periods for mobile
47
applications. Our first task is to find leading session for
48
mobile applications. We have used an algorithm for
49
mining the leading sessions from applications historical
50
ranking records.
51
The pseudo code of mining leading session for an
52
application a:
53
54
55
56
57
60
61
62
63
64
65
3
1
2
3
Identifying evidences a very important factor for detecting fraud behavior in
4
mobile application. Reviews are usually given in natural
Ranking based evidences:
5
language so these reviews need to be preprocessed with
6
Natural Language Processing (NLP)[6][7].
Leading events are the active periods for mobile
7
Preprocessing of reviews
applications. First we should study the basic significance
8
1. Tokenization: It is the process of breaking a stream of
of leading events to obtain fraud evidences. By analyzing
9
text into words, phrases, symbols or meaningful elements
the Apps’ historical ranking records, we observe that
10
Apps’ ranking behaviors in a leading eventual ways called as tokens.
11
satisfy a specific ranking pattern, which consists of three
12
different ranking phases, namely [9][12] 2. Stop word removal: Stop words are commonly used
13
Rising phase: words such as a, the, and, for, from, is, in etc.
14
15
3. Stemming: For finding the root word stemming is
Maintaining phase:
16
done. In English language Porter stemmer algorithm can
17
Recession phase: be used for removing suffixes to the stem word.
18
Rising phase: When an application ranking increases to a We have to find the overall user reviews then we have
19
top position in the leader board is called rising phase. to mapped these reviews in session and will check its
20
fraud behavior. We have created a module for finding the
21
Maintaining phase: The period for which an application overall sentiment of the reviews.
22
23
remains in its top position is known as maintaining
phase. Block diagram of sentiment analysis module
24
25
26
Recession phase: The phase of declination of the apps
27
rank is called recession phase.
28
29
30
31
32
33
34
Ratting based evidences:
35
36
After downloading an app users generally rate the app.
37
The rating given by the user is one of the most important
38
features for the popularity of the app. The shady
39
application developers usually inflate their apps with fake
40
ratting so that they can get maximum number of
41
downloads. Thus rating based evidences is also important
42
factor that need to be considered. Generally, ratings are
43
between one to five, here we consider a threshold values
44
to classify the rating into two parts. The rating which are
45
less than or equal to three are considered as negative
46
ratings and rating above three are considered as positive Fig2: Block diagram of sentiment analysis module
47
ratings.
48
Algorithm:
Review based evidences: 1. Read all feedback information
49
50
2. Divide the information into sessions
User gives their feedback for a particular application 3. For each session find the feedback obtained, to get the
51
after downloading the application or after experiencing list S1 F1
52
its performance. Many people go through these S2 F2
53
feedbacks before downloading an application. So fraud S3 F3
54
may happen in user reviews also. Shady application …
55
developers may inflate their application with falls
56
…
comments or reviews. So review based evidences is also
57
60
61
62
63
64
65
4
1
2
3
… Then to check how the review changes in each session
4
Sn Fn Where Si is the session, and Fi is the feedback we have use K-means clustering. We made two clusters
5
from that session and calculated the distance between the two clusters.
6
4. Check if the feedbacks have a common trait, if(F1 =
7
F2 and F2 = F3 and .... Fn-1=Fn) Then it means the
8
review is genuine else if there is a abrupt shift in the
9
pattern, then the feedback might be non-genuine For NLP
10
based technique, Fig 17: Centroids of Clusters
11
1. Read all feedback information
12
2. For each feedback, find action words using POS Cluster of all the session
13
Tagging and Chunking process
14
3. Evaluate the sentiment from the feedback and mark the
15
feedback as Good or Bad
16
4. Divide the feedback into sessions
17
5. For each session find the feedback obtained, to get the
18
list S1 F1
19
S2 F2
20
S3 F3
21
.
22
Sn Fn, Where Si is the session, and Fi is the feedback
23
from that session
24
6. Check if the feedbacks have a common trait, if(F1 =
25
F2 and F2 = F3 and .... Fn-1=Fn) Then it means the
26
review is genuine else if there is a abrupt shift in the
27
pattern, then the feedback might be non-genuine.
28
Combined results from both the algorithms to conclude if
29
the given feedback is genuine or not. Fig4: Cluster of sessions
30
In the above clusters we can see that there is only one
31
Results and Disscussions element in second clusters, which is far from the first
32
We have collected historical records of the
cluster that indicates that there is a shift in the user
33
applications. By mining the historical raking records we
reviews. Thus there may be some fraud signatures in the
34
have got the leading session.
user reviews in that session. We can now consider
35
An application with 3 months historical ranking
different applications and made clusters of the sessions
36
records with k=180 are given in the following graph:
then after putting their centroids and distance between
37
the centroid in tabular can compare the results.
38
39
40
Conclusion
41
We have designed a system for detecting fraud
42
behavior in mobile applications. Many shady application
43
developers use some unethical method to increase the
44
popularity of their application. Here we have showed that
45
this type of fraud in mobile applications generally
46
happens in the active period that is leading session of that
47
application. Here we have designed an algorithm for
48
Fig3: Leading sessions mining those leading sessions from applications
49
historical ranking records. The system aims to detect
50
Overall ratings and overall reviews are calculated for frauds based on three types of evidences, such as ranking
51
all the sessions. Then we have lists the overall rating and based evidences, rating based evidences and review
52
overall review sentiment for all the sessions. based evidences. Further, an optimization based
53
aggregation method combines all the three evidences to
detect the fraud. A unique perspective of this approach is
54
that all the evidences can be modeled by statistical
55
hypothesis tests, thus it is easy to be extended with other
56
Fig 16: Overall User Review for all the session
57
60
61
62
63
64
65
5
1
2
3
evidences from domain knowledge to detect ranking [10]. Anuja A. Kadam ,Pushpanjali M. Chouragade, “A
4
fraud. Review Paper on: Malicious Application Detection in
5
Android System”, International Journal of Computer
6
Applications (0975 – 8887) National Conference on
7
References
1. Patil Rohini, Kale Pallavi, Jathade Pournima, Prof. Recent Trends in Computer Science & Engineering
8
(MEDHA 2015).
9
Pamkaj Agarkar, “MobSafe: Forensic Analysis for
10
Android Application and Detection of Fraud apps Using [11]. L. Velmurugan, “Latent Relation Analysis based
11
Cloud Stack and Data Mining”, International Journal of Discovering Fraudulent Ranking Identification on Mobile
12
Advance Research in Computer Engineering and Web Apps”, Indian Journal of Science and Technology,
13
technology (IJARCET) Volume 4, Issue10, Octobor Vol 8(34), DOI: 10.17485/ijst/2015/v8i34/74505,
14
2015. December 2015.
15
2. Hengshu Zhu, Hui Xiong, Senior Member, IEEE,
16
[12]. Ranjitha.R, Mathumita.K, Meena.S and S
Yong Ge, and Enhong Chen, Senior Member, “Discovery Hariharan. “Discovery of Ranking of Fraud for Mobile
17
of Ranking Fraud for Mobile Apps” IEEE Transactions
18
Apps”, International Journal of Innovative Research And
On Knowledge And Data Engineering, Vol. 27, No. 1, Management (IJIREM), ISSN:2350-0557, Vol-3, Issue 3,
19
January 2015.
20
May 2016.
21
3. Abhilash T P, L Dinesha, “Ranking Detection and [13]. S. Karthika and N. Sairam, “A Naiive Bayesian
22
Avoidance Frauds in Mobile Apps Store”, International Classifier for Educational Qualification ”, International
23
Journal of Advance Networking and Application”, ISSN journal of Science and Technology, ISSN (Print): 0974-
24
NO: 0975-0282. 6846, ISSN(Online): 0974-5645, Vol 8(16), July 2015.
25
26
4. Shivakumar Swamy N, Prof. Sanjeev C. Lingareddy,
27
“Fraud Detection Using Data Mining Techniques ”,
28
International Journal of Innovations in Engineering and
29
Technology (IJIET).
30
5. L. Velmurugan, “Latent Relation Analysis based
31
Discovering Fraudulent Ranking Identification on Mobile
32
Web Apps”, Indian Journal of Science and Technology,
33
Vol 8(34), DOI: 10.17485/ijst/2015/v8i34/74505,
34
December 2015.
35
36
6. Shreya Banker and Rupal Patel, “A Brief Review of
37
Sentiment Analysis Methods”, International Journal of
38
Information Sciences and Techniques (IJIST) Vol.6,
39
No.1/2, March 2016.
40
41
[7]. Xuanfan Wu, “Matrics, Techniques and Tools of
42
Anomaly Detection: A Survey”, (Online). Available:
43
https://www.cse.wustl.edu/~jain/cse567-
44
17/ftp/mttad/index.html
45
[8]. Raghuveer Dagade, Prof. Lomesh Ahire, “Review: A
46
Ranking Fraud Detection System For Mobile Apps”,
47
International Journal of Innovative Research in Computer
48
and Communication Engineering, Vol. 3, Issue 11,
49
November 2015.
50
51
[9]. Javvaji Venkataramaiah, Bommavarapu Sushen,
52
Mano. R, Dr. Gladis pushpa Rathi, “An Enhanced
53
Mining Leading Session Algorithm For Fraud App
54
Detection in Mobile Application”, International Journal
55
of Scientific Research in Engineering (IJSRE) Vol. 1 (4),
56
April, 2017.
57
60
61
62
63
64
65
6
ICACCP-2019 1570507786
1
2
3
Long-term Static Music Emotion Recognition: A
4
supervised learning approach to model user emotion
5
6
7
8
profile for Music Recommender Systems
9
10
Rwiddhi Chakraborty* Aniket Dutta*
11
Electronics and Communication Engineering Electronics and Communication Engineering
12
Heritage Institute of Technology Heritage Institute of Technology
13
Kolkata, India Kolkata, India
14
rwiddhi.chakraborty.ece18@heritageit.edu aniket.dutta.ece18@heritageit.edu
15
16
Shubhayu Das* Chandrima Roy
17
Electronics and Communication Engineering Electronics and Communication Engineering
18
Heritage Institute of Technology Heritage Institute of Technology
19
Kolkata, India Kolkata, India
20
21
shubhayu.das.ece18@heritageit.edu chandrima.roy@heritageit.edu
22
*
23
Authors contributed equally to this work
24
25
26
Abstract—In this paper we describe and approach Static Music where different regions of the 2D plot signify different kinds
27
Emotion Recognition (MER) as a Supervised Learning problem. of emotion.
28
Here we propose a paradigm to capture the user’s emotional
tastes, using Arousal and Valence annotations from the user, for
29
various modern applications, primarily in music recommendation
30
systems. The primary aim is to predict the emotional content of
31
a piece of classical music according to the taste of a particular
32
user. We show that our Static MER model gives a satisfactory
33
performance, even with a genre as emotionally complex as
34
Classical Music. Moreover we use the entire duration of the pieces
(unlike other works in this area, which use shortened clips of the
35
pieces) in this problem. We obtained satisfactory results using
36
comparatively smaller feature sets - 68 features for Valence and
37
70 for Arousal. Finally, we propose an architecture for a music
Fig. 1. Russell’s Circumplex Model.
38
recommender system that can integrate this approach effectively.
39
This is the first time in this research space that a two pronged The problem of identifying emotion in music, having mul-
40
approach - smaller feature sets and entire duration of music tiple modern use cases, was taken up by various research
pieces - has been used successfully, with potential for far reaching
41
groups in the past decade, each having different approaches
commercial applications in the present day.
42
Index Terms—Music Emotion Recognition; Music Recommen- to the problem. This led to the MER regression problem
43
dation Systems; Music Information Retrieval; Supervised Learn- being further broken down into ’Static MER’ and ’Dynamic
44
ing; Support Vector Regression; Random Forest Regression; MER’. The former assumes emotion in a piece of music to be
45
Artificial Neural Networks independent of time; or in other words, each piece of music
46
has a definite emotion it aims to express and can be denoted
47
I. I NTRODUCTION by a single point in the Valence-Arousal space. The latter,
48
Dynamic MER assumes that the emotion in music changes
49
The problem of addressing Music Emotion Recognition with time, and hence each piece of music follows a contour
50
(or MER) as a supervised learning problem started about a in the Valence-Arousal space with respect to time. In the past
51
decade ago [1]. Inspired by the pioneering work of Russell decade, majority of researchers interested in Music Emotion
52
[2], MER uses a ’dimensional’ approach to model emotion, Recognition have chosen the Dynamic MER approach, with
53
rather than a ’categorical’ approach. Russell’s dimensional the hope of capturing the variations of emotional expression
54
55
model, which he called the ’circumplex model’, works on the in music. However we suggest here, that Static MER is
56
basis of two metrics, viz. Valence or pleasantness; positive or more relevant and efficient when the use cases rely on being
57
negative affecting states and Arousal or activation; energy and able to differentiate between two pieces of music, based on
60
stimulation level. Figure 1 shows the ’circumplex model’, an individuals emotional response. Under the Static MER
61
62
63
64
65
1
1
hypothesis, this is a single comparison and is particularly in the next subsection. As mentioned in the introduction
2
useful for a use case like ’music recommendation systems’. section, the use case of our work is to predict the valence
3
4
W and arousal values of a new piece of music based on the users
5
In order to suggest music to its users, music streaming taste. Our intention is not to find the absolute emotion of a
6
websites ’Pandora’, use advanced recommendation systems musical piece, in which case the problem of subjectivity of
7
which attempt to build an emotional profile of a user [3]. the annotations must be accounted for. We are not concerned
8
In one of their recent projects ’Spotify’, used ’Valence’ as with removing subjectivity from our ground truth. We want to
9
one of the features in their recommendation algorithm [4]. capture it, since the aim is to predict or replicate the listeners
10
The emotional classification of a piece of music in such response itself (in terms of Valence and Arousal), for pieces
11
applications could be done by in-house music experts, which outside the training set. Hence the entire the entire ground
12
would be subject to the problem of subjectivity. Or this could truth data had been acquired from a from a single person.
13
be done using a machine learning approach to build a listener’s Our data-set being that of classical music pieces, the labeling
14
emotional profile to cater to each listener emotional taste was done by a professional classical musician, teacher and
15
separately. In this work we break down the problem of building conductor, Mr. Anubrata Ghatak [7].
16
a listener profile for such applications, to a simple prediction B. Feature Extraction
17
problem. The task is to show that with a reasonable number of
For our audio feature extraction of, we have used the pyAu-
18
Valence and Arousal annotations of different pieces of music,
dioAnalysis [6] library. Table 1 lists each of the features we
19
a regression algorithm could satisfactorily predict a listener’s
have used. Furthermore each of these features can be classified
20
response to new pieces of music.
21
What is novel about our present approach is that here we
as follows :
• The time-domain features (Features 1 to 3 in
22
attempt to train a regressor to predict the Valence-Arousal
23
values of entire pieces of classical music (about 10-20 minutes Table 1) are directly extracted from the raw sig-
24
long) according to the taste of the user. Moreover the feature nal samples. These features encompass information viz.
25
set we use is considerably smaller compared to that in other Loudness, Noise, Energy, Abruptness, etc.;
26
MER works. This ensures easy and efficient integration with • The frequency-domain features (Features 4 to
27
use cases, that require the prediction of a piece’s Valence and 34 in Table 1, apart from the MFCCs) are
28
Arousal values according to a user’s taste. based on the magnitude of the Discrete Fourier Trans-
29
The rest of this paper is organized as follows. First, in form. The cepstral domain (used by the MFCCs or ’Mel-
30
section two we present the data-set we used for this work and Frequency Cepstral Coefficients’) results are found by
31
how we acquired our ground truth annotations. Subsequently, applying the Inverse DFT on the logarithmic spectrum.
32
here we also discuss the feature extraction paradigm used These features encompass information regarding Timbre
33
to define each piece of music. In section three we discuss Texture, Tonality, Harmony, Multiplicity (or number of
34
the pre-processing and the model training stage. In section pitches heard) etc.;
35
four, we present and analyze our results, and in section five • Features(35 and 36) are tempo related features -
36
our results are summarized and compared with other related these encompass information regarding the tempo and to
37
works. Section six briefly discusses the use and relevance of some extent the overall dominant rhythm.
38
this approach if implemented on a larger scale with respect to Further details about each feature are available on the
39
a music recommendation system. pyAudioAnalysis - feature extraction documentation in [6].
40
41
II. DATASET AND M ETHODOLOGY TABLE I
42
The data used in this work has been created by us and L IST OF F EATURES USED
43
consists of ground truth labels from only one listener, and is Table Feature name
44
1 Zero Crossing Rate
available at Github
45
2 Energy
[https://github.com/ShubhayuDas/StaticMER dataset] , and 3 Entropy of Energy
46
archived in Zenodo [https://doi.org/10.5281/zenodo.1283520]. 4 Spectral Centroid
47
5 Spectral Spread
We created this dataset using two publicly available and
48
6 Spectral Entropy
open source repositories [5] and [6].
49
7 Spectral Flux
8 Spectral Rolloff
50
A. Music Recordings and Ground Truth acquisition 9-21 MFCCs or Mel FrequencyCepstral Coefficients
51
22-33 Chroma Vector
The music data used in this work is the open source
52
34 Chroma Deviation
MusicNet [5] data-set, created by researchers at the University 35 BPM Rate
53
36 BPM Dominance
of Washington. It contains 330 classical music recordings by
54
55
famous composers like Bach, Schubert, Mozart, Beethoven
56
etc. We shall extract relevant musical features in a long term There are two algorithmic stages involved in the long term
57
basis from these recordings to train and test our regression audio feature extraction, for features 1 to 34 (see Table
60
models. The feature extraction process is explained in details 1) :
61
62
63
64
65
2
1
• Short-term feature extraction is carried out first. It splits based on the audio features used : (1)Temporal-Spectral-
2
the input signal into short-term windows (or frames) and Rhythm features Dataset(330X70), (2)Temporal-Spectral Fea-
3
4
computes a number of features for each frame. This tures(330X68), (3)With Fisher-Score Feature Selection on the
5
process leads to a sequence of short-term feature vectors Temporal-Spectral-Rhythm Dataset (330X60).:
6
for the whole signal. We have used a short-term window • (1) Using Temporal-Spectral and Rhythm/Tempo fea-
7
size of 50 ms and step size 25 ms. So, if one short tures : Rhythmic and Tempo features (Features 35
8
term frame starts at 1.00 sec and ends at 1.05 sec, the and 36 in Table 1) was chosen over and above
9
next starts at 1.025 sec and ends at 1.075 sec. Hence Spectral and Time Domain Features(Features 1-34
10
there is a 50 percent overlap. This extracts the set of 34 in Table 1). Hence the data-set taken as input was a
11
(Features 1 to 34 in Table 1) features from 330 X 70 feature matrix.
12
each short term frame of the recording. • (2) Using Spectral-Temporal features : Here only Spec-
13
• Then a Mid-term window and step is specified. For each tral and Time Domain Features (Features 1 to 34
14
segment, after the short-term feature extraction is carried in Table 1) with their mean and standard deviation
15
out, the feature sequence from each mid-term segment is was chosen. Hence the data-set taken as input was a 330
16
used for computing feature statistics (e.g. the mean and X 68 feature matrix.
17
standard deviation of the ZCR). Therefore, each mid-term • (3) With Feature Selection : Here Generalized Fisher
18
segment is represented by mean and standard deviation Score was use to for Feature Selection [10], which se-
19
over each of its short-term feature segments. The mid- lects each feature independently according to their scores
20
term window we used is of 2 seconds with a mid-term under the Fisher criterion. The was feature selection was
21
step of 0.2 seconds (i.e. 90 percent overlap). applied on the 330 X 70 feature vector to select the top
22
• Only after extracting these Short-term features and Mid- most informative features (total 28 features).
23
term features, the averages are taken, in order to produce In this paper we aim to give a Model and Feature-set wise
24
one Long-term feature vector per audio file. Hence in analysis of performance. The subsequent section explains the
25
the features 1-34 mentioned above which follow this results categorized by the choice of the Feature-set.
26
paradigm, each consists of two parameters - one is its
27
mean and the other is its standard deviation over the mid-
TABLE II
28
term features. P ERFORMANCE M EASURE IN TERMS OF M EAN S QUARED E RROR , M EAN
29
We thus obtain a total of 68 features from this process A BSOLUTE E RROR , R 2 S CORE ; FOR VALENCE AND A ROUSAL WITH
30
(34 + 34). Along with the rhythm, we end up with a (330 RESPECT TO DIFFERENT REGRESSION MODELS AND THE
FEATURE - VECTOR USED
31
X 70) feature matrix for the entire set of songs. Different
32
combinations of these were used while training. We elaborate
33
Model- Arousal Valence
on this in the next section. Feature MSE MAE R2 MSE MAE R2
34
SVR1-70 0.1301 0.2953 0.2497 0.1812 0.3723 0.0452
35
SVR2-68 0.1375 0.3075 0.2074 0.1796 0.3735 0.0535
36
III. P RE - PROCESSING AND M ODEL T RAINING SVR3-28 0.1702 0.323 0.0185 0.1898 0.3305 -0.0004
37
We have chosen a Support Vector Regressor (SVR) as RER1-70 0.1368 0.3057 0.2112 0.1715 0.3745 0.0958
RFR2-68 0.134 0.3 0.2275 0.1685 0.3678 0.1121
38
our primary regression model. A Random Forest Regressor RFR3-28 0.1518 0.3134 0.1245 0.1591 0.3617 0.1616
39
(RFR) was chosen to compare the results obtained from SVR. ANN1-70 0.1754 0.3678 -0.0111 0.1999 0.4185 -0.0533
40
A two layer Artificial Neural Network (ANN) with 90 ANN2-68 0.1712 0.3556 0.0128 0.2111 0.4357 -0.1124
ANN3-28 0.1981 0.3864 -0.1423 0.1967 0.4083 -0.0367
41
nodes in each hidden layer was chosen to see if it works
42
with the problem this projects deals with. Hence there are
43
a total of three regression models. The single pre-processing
44
step we performed was to perform max-min normalization IV. R ESULTS AND A NALYSIS
45
[8] to normalize the features. On our primary data-set the The performance of the regression models are evaluated
46
in terms of Mean Squared Error(MSE), Mean Absolute Er-
330 X 70 matrix, we employed 10-fold Cross Validation [9].
47
ror(MAE) and the R2 Score.
The algorithm chooses each fold exactly once, and the 10
48
From Table 2 and Figures 2 and 3 we can draw the
folds are created randomly. The absence of repeating items
49
following inferences:
in each iteration of the algorithm eliminates the possibility
50
of overfitting, i.e. training on the same data set repeatedly to • The Support Vector Regressor tests the best for Arousal
51
obtain inaccurate results. To obtain the best results through prediction with the primary data-set (with 70 features).
52
hyper-parameter tuning, Grid Search [9] was employed while • The Support Vector Regressor also tests the best for
53
conducting Cross Validation. This method systematically de- Valence prediction with the feature selected data-set(with
54
55
fines grid of all possible combination of parameters involved 28 features).
56
and returns the best set of parameters for which the model • The Neural Network architecture consistently performs
57
gives the best outcome. We repeated this methodology for our the worst. One explanation for this would be the com-
60
three models. Moreover Our data was split into 3 catagories paratively smaller data-set we have used, since neural
61
62
63
64
65
3
1
our prediction accuracy. Most works on MER, however use
2
different performance metrics which are not comparable, and
3
4
cater to different problem statements. Moreover, many of them
5
have used different ranges for their outputs (e.g Reference
6
[11], 2013 and many other works use the range of Valence
7
and Arousal from -0.5 to 0.5). The work by [12] achieved
8
an accuracy of 40.6% and 67.4% for Valence and Arousal, in
9
terms of R-squared statistics. However, R-squared says nothing
10
about prediction error. Even with MSE exactly the same, and
11
no change in the coefficients, R-squared can be tailored to be
12
anywhere between 0 and 100% just by changing the range
13
of the independent variable(s). Moreover, R-squared does not
14
measure goodness of fit and can be arbitrarily low when the
15
Fig. 2. Mean Square Error and Mean Absolute Error values for the Arousal
Models on run on the Arousal Test Set.
model is completely correct. Hence the metric we have used
16
is Mean Squared Error, which is more suited for the emotion
17
prediction problem we are trying to address.
18
One work that is closest to ours is by Yang et. al. [13], where
19
they attempted to predict the general emotion in a piece of
20
music. However, their focus was to find the absolute emotion
21
of a piece of music, where the ground truths annotations are
22
assumed to be the general consensus. Using the circumplex
23
model and a regression approach they try to tackle the problem
24
of subjectivity. We, on the other hand embrace it, by trying to
25
predictions with respect to an individuals subjective emotional
26
taste. In their prediction task they achieved a best case per-
27
formance of 0.1798 and 0.1731 for Valence and Arousal, in
28
terms of Mean Absolute Error (MAE). Despite the difference
29
in the MAE values, be believe our prediction performances
30
Fig. 3. Mean Square Error and Mean Absolute Error values for the Valence are similar, given that the recordings we used were of varied
31
Models on the Valence Test Set. length and span about 10-20 minutes long - which is how our
32
work is primarily different from other works. Most works,
33
including to the ones mentioned above, use fixed duration of
34
networks generally perform better on large data-sets. It musical pieces in their data-set. In [14] the pieces are of less
35
may not be true for our specific problem, but is a possible than 1 minute. Moreover, the music data-set used by most
36
explanation for the poor performance. other works, are of a combination of genres of like rock, pop,
37
• Using Fisher scoring to find the top 28 features does etc., which are far less complex than classical music in terms
38
not consistently improve performance for each model of emotional content and variation.
39
type. There may be two follow ups to this - either the The novelty in this work, lies in the fact that our predic-
40
Fisher scoring is not the appropriate measure of feature tions are user specific. We use comparatively smaller feature-
41
information in this case,or the simpler explanation that descriptors as compared to other works - viz. [11] which uses a
42
decreasing features does not in fact reduce redundancy, 128-dimensional feature set, and [12] where a 98-dimensional
43
that our current feature set is sufficient. feature set was used - and yet informative enough to perform
44
comparatively for the prediction task.
45
V. S UMMARY AND D ISCUSSION
46
In this work we mapped a set of recordings’ musical and VI. F UTURE S COPE : M USIC R ECOMMENDATION
47
acoustic features to a particular listener’s emotional response, S YSTEMS
48
in terms of Valence and Arousal. The best case error in terms Having we show that a regressor can satisfactorily map a
49
of Mean Squared Error (MSE) between the actual and the piece of music to the emotional response of listener, given
50
predicted values are 0.1591 for Valence and 0.1301 in case of their honest ground truth annotations; In this section we
51
Arousal. In terms of Mean Absolute Error (MAE) our best case discuss how a such a model could help build a emotion
52
error values are 0.2953 and 0.3617 for Arousal and Valence profile descriptor which could be easily be integrated within
53
respectively. The predictions made by the algorithm reflect the existing Music Recommender Systems - one that encapsulates
54
emotional taste of the listener or one who provides the ground the user’s emotional tastes in music, as well as predicts the
55
truth. region a new piece if music may lie. Ones profile could be
56
57
Even though the definition of our problem is different described as a hash table of Song Id and a 2-D vector of
60
form typical works on Static MER, we wanted to compare Valence nad Arousal annotations from the user (see Figure
61
62
63
64
65
4
1
4). The feature set being what describes the music, a varied [3] Howe (2009). ”Pandoras Music Recommender”, a case study. Available
2
at :
distribution of musical pieces when mapped to the Arousal-
3
https://pdfs.semanticscholar.org/f635/6c70452b3f56dc1ae07b4649a80239
4
Valence space using a regression model as presented in this afb1b6.pdf (online)
5
work, would adequately represent a users musical tastes. The [4] K. Tiffany, (2018). TL;DR: You can now play with Spotifys recommenda-
tion algorithm in your browser; Its fun if you know what valence means!
6
hash table could therefore be integrated with existing music Available at: https://www.theverge.com/tldr/2018/2/5/16974194/spotify-
7
recommendation algorithms, where arousal and valence are a recommendation-algorithm-playlist-hack-nelson (online)
8
part of user ratings. [5] J. Thickstun, Z. Harchaoui, S. M. Kakade, (2017). Learning Features
of Music from Scratch. International Conference on Learning Repre-
9
sentations (ICLR), University of Washington., in press. Retrieved from
10
https://homes.cs.washington.edu/ thickstn/musicnet.html
11
[6] T. Giannakopoulos, (2015). pyAudioAnalysis: An Open-Source Python
Library for Audio Signal Analysis. PloS one. 10, 12, in press. Retrieved
12
from https://github.com/tyiannak/pyAudioAnalysis
13
[7] A. Ghatak, n.d. Classical Musician, Conductor and Teacher at Kolkata
14
Music Academy
[8] MinMaxScaler (n.d.). Retrieved from
15
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Min
16
MaxScaler.html (online)
17
[9] GridSearchCV (n.d.). Retrieved from
http://scikit-learn.org/stable/modules/generated/sklearn.model selection.
18
GridSearchCV.html (online)
19
Fig. 4. An example of the hash tables of tow different users.
[10] Q. Gu, Z. Li, J. Han, (2012). ”Generalized Fisher Score for Feature
20
Selection”. CoRR, abs/1202.3725, in press.
21
Such a model would facilitate suggestions based on emo- [11] F. Weninger, F. Eyben, B. Schuller, (2013). The TUM Approach to
the MediaEval Music Emotion Task Using Generic Affective Audio
22
tion. The similarity of users could be quantified based a Features. In proceedings of MediaEval 2013 Workshop, October 18-19,
23
cosine similarity of the Arousal and Valence ratings of the 2013, Barcelona, Spain, in press.
24
songs heard and rated by both users. Moreover, a personalized [12] R. Panda, B.Rocha, R. Paiva, (2013). ”Dimensional music emotion
recognition: Combining standard and melodic audio features”. In Pro-
25
prediction model for each individual listener as presented here, ceedings of International Symposium on Computer Music Modelling &
26
could directly be used to recommend new pieces of music. One Retrieval, in press.
27
could analyze the common patterns in the Arousal and Valence [13] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, (2007). ”Music Emo-
tion Classification: A Regression Approach”. 2007 IEEE International
28
values of consecutive songs, and make new suggestions previ- Conference on Multimedia and Expo, in press.
29
ously unheard by the listener. Suggestions made with such [14] A. Aljanaki, F. Wiering, & R. , C. Veltkamp, (2015). ”Audio segmenta-
30
an approach, may seem completely unrelated to a random tion based approach for improved emotion recognition”. In proceedings
of the 16th ISMIR Conference, Malaga, Spain, in press.
31
observer, but is likely to be relished by the particular listener.
32
Moreover, intelligent approaches could be used for deciding
33
which set of songs give a unique direction to the users music
34
emotion tastes. One approach would be to cluster the songs in
35
the Arousal-Valence space and randomly suggest songs from
36
the less probable clusters to enforce serendipity. After getting
37
the Valence and Arousal ratings, one could update the hash
38
table if the values of Valence and Arousal provided are within
39
a close threshold of the average of his major cluster (major
40
cluster being the cluster with the highest probability of being
41
assigned).
42
This paradigm of modeling one’s music emotion prefer-
43
ences, opens up quite a few number of options for music
44
recommender systems to include emotion in their analysis. The
45
effectiveness of which would be based on the performance
46
of the emotion prediction model for each listener. We thus
47
appeal to the music emotion and music recommendation
48
system research community, that a benchmark database with a
49
framework for efficient ground truth collection from multiple
50
users, would go a long way in the growth of this unique
51
paradigm of emotion based music recommendation system.
52
53
R EFERENCES
54
[1] Y. H. Yang, Y. C. Lin, Y. F. Su, AND H. H. Chen, (2008). A regression
55
approach to music emotion recognition. IEEE Trans. Audio, Speech
56
Lang. Process. 16, 2, 448457., in press.
57
[2] J. A. Russell, (1980). A circumplex model of affect. Journal of Person-
ality and Social Psychology. 39, 6, 11611178., in press.
60
61
62
63
64
65
5
ICACCP-2019 1570508889
1
1
2
uses two PTZ cameras having different utilities. One is III. SYSTEM ARCHITECTURE
3
named full-shot PTZ camera and the other is movement PTZ The system is a cloud-based web application which is
4
camera. capable of supporting multiple enterprise customers. Remote
5
Real-time person tracking has been implemented before, students can access the system after logging in via biometric
6
but not quite as what we have done; real-time broadcasting of authentication(facial recognition). A PTZ IP camera provides
7
the footage without any delay. The complete package of a continuous video stream of the lecturer and audience to the
8
having the lecture capturing along with audience if necessary, central server via the Kurento media server [18] during a
9
screen sharing, Face Recognition based remote login and lecture session. The server broadcasts this stream live to the
10
attendance marking for online participants, viewing the remote students. The lecturer’s screen is also shared if
11
lecture in real-time with added features such as intelligently required. The attendance of remote students is marked by
12
generating chapters on the video according to the lecture capturing their faces through their webcams and running a
neural network algorithm on the server for identification.
13
slides played alongside with this makes Lecture Capturing
Lecturers are also able to do an offline lecture recording and
14
System a perfect complete package of e-learning. A thorough then upload it to the server where this will be split into
15
research related to e-learning systems has led to the chapters and the audio to text. Figure 1 shows the system with
16
identification of some of the most influential factors used in the main components and their interactions.
17
the field of information systems research. More specifically,
18
characteristics as well as the limitations, weaknesses, and
19
strengths of web-based learning systems. Student variables,
20
such as technical issues and adapting to the new ways are
21
important variables that influence student learning, especially
22
in a collaborative e-learning environment. In particular, this
23
research helps to better understand the characteristics of
24
students and to comprehend what the students expect from
25
the learning management systems. This can help the
26
developers achieve the most effective deployment of such
27
systems and also helps them improve their strategic decision
28
making about technology in the future, they can decide on the
Figure 1: System Architecture
29
best approach that fit their students before implementing any
30
new technology. IV. METHODOLOGY
31
Features of a set of commercially available e-learning
This section describes how the system was designed and
32
platforms were compared with the Lecture Capturing System. implemented explaining the process of each functionality,
33
their flow in the system, and how they interact with each other.
34
Panopto is an easy-to-use video platform for training, The system was implemented using cutting-edge technologies
35
presenting, and communicating that enables users to record such as Nodejs, Python, ReactJS, and MongoDB for storage.
36
videos and rich media presentations and push out to
37
A. Face Recognition based authentication
subscribers in many different formats. This is mainly focused
38
A student or a lecturer can login to the system using the
39
on recording and later on pushing the recorded stream into webcam. Initially, the administrator of the Lecture Capturing
40
the users. BigBlueButton is an open-source web System should register the user by uploading quality images
41
collaboration software utilized by education organizations for and relevant details of the user. After registering a user, the
42
e-learning and training. This enables users to conduct web- server will train the face recognition classifier with the newly
43
conferencing and share documents, audio and video files for uploaded images of a user along with the existing images of
users. Thereafter the user will be authenticated from the face
44
online learning. The software’s “whiteboard” feature allows
recognition process through the webcam only if the
45
presenters to mark valuable topics in the presentation. confidence threshold of the face recognition classifier is
46
greater than 90%. If it is less than 90%, the user will not be
Echo360 combines video management with lecture capture
47
authenticated. In terms of security, session handling will take
48
and active learning to increase student success. Echo360 place after face recognition based login.
49
keeps notes linked to class presentations and videos so that
50
students can jump straight from their own words to those of In terms of the general architecture of the face recognition
51
the instructor and replay the entire learning experience. based login function, two steps are considered:
52
Videos are uploaded and processed in real time so the 1. Detection stage
53
optimized version is available as soon as class is over. LCS system search for the face region (displayed
54
by a rectangle) in the whole video stream
Kaltura offers the broadest set of video management and
55
2. Recognition stage
56
creation tools on the market, tightly integrated with every Contrasting the face image obtained above to the face
57
LMS. From flipped classrooms to live sports broadcasts. image trained in the database, and predicting the user
60
Even though there are many e-learning applications in the registered.
61
market, there isn’t a system which handles all the necessary If the system face recognition is successful, the
62
requirements like LCS has managed to implement. recognition result will be displayed in white text inside a green
63
rectangle on the webcam feed along with the confidence
64
65
2
1
2
percentage. If failed, the system will pop a warning. Face necessary PTZ signals to pan, tilt and zoom accordingly thus
3
Recognition used in the system follows three main steps. ensuring that the lecturer’s actions are always recorded
4
without missing any detail. This video recording will be
1. Prepare training data immediately compressed ‘on-the-fly’ in order to reduce its file
5
OpenCV computer vision library, Python and Numpy are
6
used as dependencies to implement face recognition function size, then streamed live and also saved in the database for
backup purposes and viewing later. Therefore, the students
7
in the system [12]. OpenCV provides two pre-trained and
have the choice of attending the lecture via the live stream or
8
ready to be used face detection classifiers called Haar listening to the lecture later. This is very beneficial to students
9
classifier and LBP classifier [9]. Haar Cascade classifier is since they can also attend lectures without being physically
10
used as the face recognition classifier and Local Binary present (remotely) at the lecture. During the live streaming
11
Patterns (LBP) classifier is used as the face recognition session, the system will decide which video resolution (e.g.
12
classifier to detect and recognize faces in this system. LBP is 480p, 720p, 1080p) to use for the playback at the student’s end
13
a type of visual descriptor used for classification in computer depending on the speed of his/her internet connection.
14
vision [10]. The LBP classifier is used due to its main
advantages such as shorter training time, high accuracy rate in Live streaming is achieved via Kurento which is a
15
difficult lighting conditions which will be useful when WebRTC media server and a set of client APIs. During the
16
detecting faces through the webcam and computationally live streaming session, the lecturer also has the ability to share
17
simple and fast [19]. A formal description of the LBP his/her entire computer screen with the participating students
18
algorithm is given in Figure 2. if required, making certain that not even the most minute detail
19
is not missed. In the case of having low bandwidth to support
20
this feature, the lecturer has the option to disable the IP
21
cameras in order to save bandwidth. This mode of lecturing
22
provides better participation and interaction between the
23
Figure 2: LBP Algorithm Equation lecturers and students. An example of this is if a student wants
24
to ask a question, the control would be given to the particular
25
The training dataset consists of 30 images for each user student by the lecturer and the application would support
26
and each user is assigned a label (e. g. s1, s2) upon registering audio only, video only or both audio-video sources of the
27
to the system. Furthermore, this step will read all the images particular student. But the lecturer has the ability to get back
the control of the audio and video sources of the system when
28
of a person and apply face detection to each one using LBP
required.
29
classifier. Then, add each face to face vectors with the
corresponding person label extracted. Finally, the data
30
C. Lecture Capture and Movement of IP Camera
preparation step will produce following face and label vectors
31
[19]. An IP camera will be used to track the lecturer’s
32
movement and gestures in front of the camera and produce the
33
necessary PTZ signals to pan, tilt and zoom accordingly with
34
the lecturer’s movement.
35
1. Lecture Detection
36
OpenCV Object Detection using Haar feature-based
37
cascade classifiers is an effective object detection. It is a
38
machine learning based approach where a cascade function is
39
Figure 3: Face and label vectors trained from a lot of positive and negative images. It is then
40
used to detect objects in other images. Here we will work with
41
2. Train face recognizer face detection. Initially, the algorithm needs a lot of positive
The face and label vectors returned from the data
42
images and negative images to train the classifier. Then we
preparation step (according to Figure 3
43
need to extract features from it. For this, Haar features shown
‘getImagesAndLabels’ function) will be converted to a
44
in Figure 4 are used. Each feature is a single value obtained by
Numpy array and passed to the OpenCV Haar Cascade subtracting the sum of pixels under the white rectangle from
45
face recognizer for training [19]. The statistical data the sum of pixels under the black rectangle.
46
returned from the face recognizer will be saved in a
47
YAML file.
48
49
3. Prediction
50
Once the user is navigated to the login page, the server
51
automatically detects the face from the Haar classifier,
52
predicts the face by calling the trained OpenCV Haar
53
face recognizer, returns the predicted name of the user
54
associated with the label and live streams the response
55
from the server (recognized face plot, name of the user,
56
confidence threshold) to the login page. Thereafter, the
57
user will get logged in to the system after the user clicks Figure 4: Haar features
60
the face login button.
61
Now, all the possible sizes and locations of each kernel are
62
B. Audio and video conferencing used to calculate lots of features. There are irrelevant
63
An IP camera will be used to track the lecturer’s calculations, consider Figure 5. The top row shows two good
features. The first feature selected seems to focus on the
64
movement and gestures in front of the camera and produce the
65
3
1
2
property that the region of the eyes is often darker than the helpful because most of the times when the lecturer starts to
3
region of the nose and cheeks. The second feature selected plug his laptop to the main projector in a normal classroom all
4
relies on the property that the eyes are darker than the bridge of his desktop content open tabs on web browser, everything
5
of the nose. But the same windows applied to cheeks or any is visible to the students, privacy is a concern and it’s a hassle
6
other place is irrelevant. to switch sharing the screen on and off all the time to the
lecturer. With this Easy Screen Share feature lecturer can
7
stream his webcam footage alongside with the shared screen
8
if required (If in front of the laptop blocking the main camera
9
view). This extension simply initializes socket.io and
10
configures it in a way that single audio/video/screen stream
11
can be shared/relayed over users without any bandwidth/CPU
12
usage issues. This uses RTCMultiConnection is a WebRTC
13
library that is used for WebRTC streaming [22].
14
Figure 5: Features Calculated E. Gesture-Based Camera Control
15
16
For this, we apply each and every feature on all the training When a student who is physically present in the classroom
17
images. For each feature, it finds the best threshold which will has a doubt, the lecturer has to direct the camera towards the
18
classify the faces to positive and negative. We select the audience by performing a gesture at the camera. The lecturer
has to show his hand with all five fingers unfolded. Then the
19
features with the minimum error rate, which means they are
camera will analyze the gesture and recognize it using
20
the features that most accurately classify the face and non- OpenCV and Python technologies and then turn the camera
21
face images. towards the audience so that the remotely logged in users get
22
This lecture tracking script uses OpenCV’s Haar Cascade the picture of what is happening in the classroom. Once the
23
Classifier to do the person detection task. First, it initializes a student has finished asking the question, the lecturer turns the
24
face cascade using the frontal face Haar cascade [19], [20]. camera back to the normal position by either pressing a button
25
Then it starts to detect and track the largest face it can find, if on the interface or performing a gesture at his/her webcam.
26
not tracking face or lost the tracked face again it uses Haar
F. Open Broadcaster Software (OBS) Studio plugin
27
cascade detector to detect face and then correlation tracker to
28
follow it using the dlib library. Both methods require to scan A plugin was implemented for the OBS Studio [23]
29
each the whole frame with a sliding window. The algorithm software which allows a lecturer to do an offline recording of
30
then tries to find the features of a person in each window a lecture and then upload it directly to the server.
31
position. These methods are too expensive to perform in each After recording the lecturer’s desktop screen while s/he
32
frame if we want to run our person tracker on restricted conducts a lecture, the designed plugin would upload this
33
hardware like a budget laptop. For this reason, we combine video to the remote server based on predefined settings at the
34
the person detector with a correlation tracker. The correlation click of a button. These settings can be changed by the lecturer
35
tracker expects a region of interest and starts tracking the to suit their needs (e.g. upload video now or at a later
36
pixels inside that region. In subsequent frames, it tries to find scheduled time). Once the video is uploaded to the remote
37
where the pixels have most likely moved. This is much faster server, the next level of processing should be done at the
38
and more robust than trying to find the person in each and server.
39
every frame again. G. Video Thumbnails/Chapters Creation
40
The lecturer can view the list of videos which have been
41
2. PTZ Camera Movement uploaded to the server. Out of this list, the lecturer can select
42
Open Network Video Interface Forum (ONVIF) is an a video to be converted into a series of thumbnail chapters.
43
open industry standard that provides interoperability among Each frame in the video is analyzed by the PySceneDetect
44
IP security devices such as security cameras, video recorders, algorithm [24] which is implemented using Python. The
45
software, and access control systems [21]. PySceneDetect algorithm makes use of the OpenCV, NumPy,
46
Since we are using ONVIF protocols to move the camera it and FFmpeg libraries for execution. There are two main
47
allows the compatibility from different vendor’s devices so detection methods which PySceneDetect uses:
LCS will have the support by most IP based security devices
48
Threshold-detection - Compares the intensity/brightness
manufacturers giving it an added benefit of not limiting the
49
of the current video frame with a set threshold, and triggers a
system to a specific brand of IP Cameras.
50
scene cut/break when this value crosses the threshold value.
Because of this, we have used this protocol to work with our
51
The threshold value is computed by averaging the Red-Green-
tracking algorithm to move the camera accordingly. The
52
Blue (RGB) values for every pixel in the frame, yielding a
custom functions can move the camera in any direction and
53
single floating-point number representing the average pixel
zoom in and out to focus on the lecturer as required. The
54
value (from 0.0 to 255.0).
detecting and tracking algorithm made will only call to move
55
Content-aware detection - Finds areas where the
function if a significant movement is detected only ensuring
56
it doesn’t break focus for a still movement of the lecturer. difference between two subsequent video frames exceeds the
57
threshold value that is set and then trigger a scene cut. This
60
D. Easy Screen Share allows you to detect cuts between scenes both containing
61
With the ability to share the screen either completed or content, rather than how most traditional scene detection
62
selected custom application window right from the web methods work. With a properly set threshold, this method can
63
browser and start streaming it along the main live stream even detect minor, abrupt changes [25]. This method takes in
64
makes the lectures task at ease and more efficient. This is
65
4
1
2
the threshold and minimum-scene-length in frames (optional) V. RESULTS AND DISCUSSION
3
as input parameters. In terms of face recognition-based login, the LBP
4
It compares the difference in content between adjacent classifier reported an accuracy level of 70.33% for a particular
5
frames against a set threshold/score, which if exceeded, user in terms of face detection by maintaining a shorter
6
triggers a scene cut. It checks for changes in color and training time. In contrast with the Haar classifier which
7
intensity - namely the average HSV color space difference reported 81.05% for a user and took a longer training time,
8
(difference in hue, saturation, and luminance of the frame) – LPB classifier has underperformed Haar classifier when
9
between video frames [26]. If this calculated value is very high detecting faces. Since the lecture capturing system face
10
than the preceding and following values, it means that there recognition-based login should be fast and should maintain a
11
has been a scene change. This process is repeated for the entire higher accuracy level greater than 80%, Haar classifier is the
12
length of the video clip until the entire video clip is analyzed ideal solution which is used currently in the lecture capturing
system. However, the results are derived by allocating 30
13
and all the video chapters are created.
medium quality training images captured from a webcam for
14
Following this process is the real-time speech transcription each user. Therefore, the results reported by the classifiers
15
(audio-to-text conversion) of each video chapter. First, the were not satisfactory and the accuracy can change if each user
16
audio is extracted from the video chapters in mp3 format by is allocated more high-quality images for training.
17
the FFmpeg library. Next is the speech transcription procedure
18
which is achieved via the Watson Speech-To-Text algorithm. Comparison between Haar Classifier and LBP Classifier
19
This service leverages machine intelligence to transcribe the 1. LBP Classifier is faster than Haar classifier.
20
human voice accurately [27]. The service combines 2. Haar classifier uses floats to do all the calculations
21
information about grammar and language structure with while LBP classifier uses integers.
knowledge of the composition of the audio signal. 3. LBP classifier is less accurate than Haar classifier.
22
23
Therefore, the end result would be a set of videos along 4. Haar-like features in the Haar Cascade classifier
24
with their respective audio, presentation slide, and text. These work best for frontal face detection.
5. Haar features are good at detecting edges and lines
25
videos would be stored in the database so that students can
which is effective in face detection.
26
access them any time after the lecture session to further
27
understand and clarify their knowledge. Accuracy rates mentioned in the table below are derived
28
H. Facial Recognition based Attendance Marking using the formula:
29
Using facial recognition, the attendance is marked Accuracy % = 100 - Confidence Index
30
automatically for the students who are present in the lecture The confidence index will return zero if it will be considered
31
room and also the students who are logged in remotely a perfect match in detection or recognition. If not an
32
through the Lecture Capturing System during the live ‘unknown’ label is put on the face.
33
streaming lecture session. The administrator and the lecturer Average LBP classifier Haar classifier
34
are able to view, modify and filter attendance of students. A Accuracy of 60% 90%
35
student is able to view his/her attendance with the aid of the recognition
36
filtering options available. Some noticeable advantages of this Processing time for
37
feature are that it will add an extra layer of security to the encoding and training 1.8min 2.1min
38
system to ensure that only authorized persons to gain access a user with 30 face
39
to the university’s content. A comprehensible advantage of images (in sec)
40
this method of biometric authentication of students can be
noted during the time of an online exam to verify that the
41
42
person on the other end is actually who they claim to be. Also, Testing with different face datasets from 50 – 100 range when
this feature will solve the problem of students marking training images. A laptop with i7, 8th generation, and 8GB
43
attendance for other students.
RAM is used to obtain the below result.
44
45
I. Bandwidth Management Images per user LBP Classifier Haar Classifier
46
The data size which is passed from the client to the node 50 Images 2.5min 4min
47
server and vice versa is reduced using bandwidth optimization 100 Images 3.9min 6.9min
48
techniques such as compression and clustering. The
49
administrator can monitor bandwidth using the bandwidth
50
monitoring dashboard which consists of traffic usage, system Amount of main memory which is used to execute the
51
information, CPU load, alerts to notify exceeded predefined algorithm is defined as memory used and given in MB.
52
threshold settings and attacks and much more that is accessible Algorithm LBP Algorithm Haar-Like
only to the administrator of the system.
53
Algorithm
54
J. Quota Management Memory Used 123MB 290MB
55
The administrator is able to manage the internet quota With regards to the video chapter creation feature, each frame
56
allocation for users from the dashboard. The list of users along in the video is analyzed by an algorithm for changes in color
57
with their usage statistics such as used quota and remaining and intensity - namely the average HSV color space difference
60
quota can be viewed filtered by user type (e.g. lecturer, (difference in hue, saturation, and luminance of the frame). If
61
student), month, and year. This monthly quota can be edited this calculated value is very high than the preceding and
62
by the administrator for a single user (e.g. specific lecturer’s following values, it means that there has been a scene change.
63
id) or all users of a particular user type (e.g. all students). Therefore, the video is split at this time frame. This process is
64
repeated for the entire length of the video clip until the entire
65
5
1
2
video clip is analyzed and all the video chapters are created. [3] 4] P. D. Z. Varcheie, “Online Body Tracking by a PTZ Camera in IP
Surveillance System,” Department of Computer Engineering and
3
This process was run over repeated 100-iteration cycles which Software Engineering, Station Centre-ville, Montr´eal, (Qu´ebec),
4
produced an average accuracy of 95%. Canada, 2009.
5
Previously, many methods have been introduced as e- [4] T. G. Dries Hulens, “Autonomous lecture recording with a PTZ
6
learning platforms. But this research has taken a different path camera,” presented at the Canadian Conference on Computer and
Robot Vision, Belgium, 2014.
7
by replicating a complete classroom-like experience. [5] 5] M. M. M. H. R. Jacko, “Remote control of the PTZ camera system
8
for lecture rooms,” Department of Computers and Informatics, 2015.
Live streaming with recording sessions of a lecture would
9
help all students even if they were present in the lecture itself. [6] 7] B. Wulff, “OpenTrack - Automated Camera Control for Lecture
Recordings,” IEEE International Symposium on Multimedia, 2011.
10
The ability to revise what they have missed if the student [7] C.-F. C. a. P.-C. S. Yong-Quan CHEN, “A Tabletop Lecture
11
attends the lecture late and the ability to go through a previous Recording System,” in International Conference on Consumer
12
lecture before attending the next lecture would result in a huge Electronics-Taiwan, Taiwan, 2015.
13
academic improvement. [8] Y.-T. T. S. C. a. S.-W. C, “Chiung-Yao Fang, ‘Chiung-Yao Fang,
You-Ting Tsai, Shuan Chu, and Sei-Wang Chen,’” Department of
14
Computer Science and Information Engineering, Taiwan, 2015.
15
The main purpose of this system is to offer an effective [9] “The Way Online Video Streaming Works Has Changed.” [Online].
way to help the students access learning materials and
16
Available: https://www.panopto.com/blog/the-way-video-works-
information from anywhere and to quickly recap any forgotten online-has-changed.
17
or absent lectures via the earlier recordings of the sessions.
[10] “Video Analytics & Engagement Dashboard - Panopto Video
18
Platform.” [Online]. Available:
19
A system and a method for an interactive Internet-based video https://www.panopto.com/features/video-cms/video-analytics.
20
conferencing multicast operation which uses a video [11] “Face Detection using OpenCV and Python: A Beginner’s Guide.” .
“Analytics to improve student success - Echo360.” [Online].
21
production studio with a live instructor giving lectures in real- [12]
Available: https://echo360.com/platform/analytics/. [Accessed: 15-
22
time to the participating students. The video conference May-2018].
multicasting permits students to interact with the instructor
23
[13] “Automated Student Attendance Management System Using Face
during the course of the lecture and to later browse the
24
Recognition | Ise A Orobor and Ofualagba Godswill -
recorded session without a hassle. Academia.edu.” [Online]. Available:
25
http://www.academia.edu/37437099/Automated_Student_Attendance
26
VI. CONCLUSION AND FUTURE WORK _Management_System_Using_Face_Recognition. [Accessed: 10-
27
This paper examines an innovative approach that is best Oct-2018].
[14] V. Mankar and S. G Bhele, “A Review Paper on Face Recognition
28
suited to develop a lecture capturing system that provides a
Techniques,” International Journal of Advanced Research in
29
complete classroom experience to remotely logged in Computer Engineering & Technology, vol. 1, pp. 339–346, Oct.
30
students. This system stands unique from other existing 2012.
31
products by being as a comprehensive product that includes [15] F. Ahmad, “Image-based Face Detection and Recognition.” [Online].
Available: https://arxiv.org/ftp/arxiv/papers/1302/1302.6379.pdf.
32
biometric authentication, gesture detection, live streaming of [16] “WebRTC 1.0: Real-time Communication Between Browsers.”
33
lectures, automated attendance marking, offline recording of [Online]. Available: https://www.w3.org/TR/webrtc/. [Accessed: 10-
34
lectures, bandwidth management and desktop screen Oct-2018].
35
capturing all in one. [17] P. Braun, M. Sipos, P. Ekler, and F. Fitzek, “On the Performance
Boost for Peer To Peer WebRTC-based Video Streaming with
36
This research work has been developed mainly for Network Coding,” 2017.
37
addressing the problems in Sri Lankan universities, [18] “What’s Kurento - Kurento.” [Online]. Available:
38
specifically addressing the lack of interactivity between the https://www.kurento.org/whats-kurento. [Accessed: 14-Mar-2018].
[19] “OpenCV library.” [Online]. Available: https://opencv.org.
39
lecturer and the students. Though this research focuses on
[20] “OpenCV: Face Detection using Haar Cascades.” [Online].
40
universities, it has the potential to be used in other fields such Available:
41
as business conferences. In the next stage, the research team https://docs.opencv.org/3.4.2/d7/d8b/tutorial_py_face_detection.html.
42
will be focusing on improving the accuracy of the face [21] “Onvif.” [Online]. Available:
https://www.onvif.org/onvif/ver20/util/operationIndex.html.
43
recognition and gesture detection models by testing other [22] “WebRTC Home | WebRTC.” [Online]. Available:
44
algorithms. Also, the research team will focus on minimizing https://webrtc.org.
45
bandwidth costs by testing out bandwidth optimization [23] “Open Broadcaster Software | Home.” [Online]. Available:
46
techniques. It is hoped that for any person who expects to https://obsproject.com/. [Accessed: 26-Mar-2018].
build a similar system or any other real-time system, results of [24] “Command Reference — PySceneDetect v0.5 documentation.”
47
[Online]. Available: https://pyscenedetect-
this research will be an aid and will provide insight on the
48
performance, accuracy and reliability level that can be manual.readthedocs.io/en/latest/cli/commands.html. [Accessed: 09-
49
expected with the combination of tools, technologies, Oct-2018].
[25] B. Castellano, :movie_camera: A Python/OpenCV-based scene
50
programming approach considered in this paper. detection program, using threshold/content analysis on a given
51
video.: Breakthrough/PySceneDetect. 2018.
52
REFERENCES [26] “Introduction - PySceneDetect.” [Online]. Available:
53
[1] W., “Use of E-Learning,” Universiti Teknologi. Malaysia, Johor, https://pyscenedetect.readthedocs.io/en/latest/.
[27] “Watson Speech to Text,” 28-Nov-2016. [Online]. Available:
54
Malaysia, 2018.
https://www.ibm.com/watson/services/speech-to-text/.
[2] K. Kumar and T. S. Sheng, “Real Time Target Tracking with Pan Tilt
55
Zoom Camera,” presented at the Digital Image Computing, Adelaide,
56
2009.
57
60
61
62
63
64
65