You are on page 1of 17

ICACCP-2019 1570503787

1  
2  
3  
4  
5   A Fraud Detection System for Mobile Applications
6  
7  
8  
9  
10   Abstract application with large number of downloads, rating and
11   reviews usually ranked high in the Leader board. There is
12   Since technology is advancing rapidly, the mobile different ways to popularize an application which are
13   app market is also growing very fast and the number of mainly divided into two categories. White hat basis
14   mobile app is increasing day by day. Every app promotion is the legal way to promote an application
15   developer wants their app to be rank as high as possible whereas Black hat basis is the illegal way for promoting
16   in the popularity list. So to maximize their popularity an application. The shady application developers
17   some of the app developers use some unfair and tricky generally use Black hat basis techniques to promote their
18   means. They use “Bot-farms” and “human water applications. They generally use some fraudulent or
armies” to download their apps and provide fraud rating tricky means for boosting up their application in the
19  
and reviews. So the challenge is to detect such fraudulent popularity list. This is usually implemented by “Bot-
20  
activities. In this paper we are developing such a system farms” or “Human Water Armies” which filled up
21  
which can detect such fraudulent behavior by mining the application download, rating and reviews in a very short
22  
apps historical records. Here we are mining three types time[2][3]. If we observe carefully we will find that an
23  
of evidences on the basis of apps historical records application is not always ranked high in the leader board
24  
mainly- Ranking based evidences, Review based but only in some periods called leading event from
25   different leading sessions which means fraud for mobile
26   evidences and Rating based evidences. Then we are
calculating aggregation using these three evidences. applications particularly occur in this session. So
27   detecting frauds in application is nothing but finding
28   fraud in leading sessions. Therefore first we need to find
29   Index Terms – Ranking based evidences, Ratting based
evidences, Review based evidences, Mining Leading the leading session of mobile application then we will
30   evaluate those sessions with applications historical
31   Session, Sentiment analysis.
records. The main objective of our work is to detect fraud
32   behavior for mobile applications by mining the apps
33   historical records. We will check whether there is any
34   Introduction
fraud signature in any leading sessions. We first collects
35   In today’s world technology is advancing
three types of evidences from user feedback namely 1)
36   expeditiously. Mobile device is a part of this technology.
Ranking based evidences, 2) Rating based evidences and
37   The number of mobile user is increasing day by day. 6.2
3) Review based evidences. Since this project mines user
38   percent population of the world has mobile devices. So
feedback so we consider two types of user feedback,
39   the mobile application is also a well known concept. Till
rating based and review based. We generally rate an app
40   now there is over 3.6 million applications in Google play
while downloading it or after seeing its performance so
41   store and 2.2 million applications in Apple app store. The
rating is one of the important evidence to judge the app
42  
number of applications developers is also increasing
but as we discussed above that there are some techniques
gradually. So there is huge competition among the
43   with the help of which we can increase the rating [2].
applications developers. As there is massive number of
44   Many people download applications after reading user
applications, for us it is a bit difficult to choose right
45   reviews so shady application developers may inflate their
applications. Every applications developers wants their
46   applications with fake comments. So here we have
applications to be popular so that they can get maximum
47   designed a system that will detect whether any such
number of downloads and thus get maximum revenue
48   activities is done to increase the popularity of the
from it. Applications leader board is such platform from
49   application. We will first determine the active periods.
where we can categorize how much popular an
50   We have designed an algorithm to find the active periods.
application is. These Leader boards are the best way to
51   By mining the applications historical ranking records we
popularize an application. A top most position in the
52   get the leading events and then combining the adjacent
Leader board indicates that the application is popular.
53   leading events we get leading sessions. We then need to
Top ranked applications generally have more numbers of
54   evaluate these leading sessions against three types of
downloads and earns in million dollar. So application
55   evidences. We first divide the statistical and textual
developers have a tendency to investigate different ways
56   reviews. For particular leading session we mapped these
to get higher position in the leader board [3]. An
57   review. For textual reviews we will apply Natural
60  
61  
62  
63  
64  
65  
1
1  
2  
3  
Language Processing (NLP) and get the sentiment of the in the leading sessions. The fraud application developer
4   reviews. Then for each leading session we determine the uses some tricky or unfair means for leading their apps
5   overall sentiment and check whether there is any ranking in the leaderboard. Detection of such apps is
6   anomaly pattern. done by making leading sessions from the leading event
7   which shows the phases of achievements. The rising
8   phase, maintaining phase and recession phase in which
9  
Review of Literature
Patil Rohini et al., [1] discussed that almost everyone we find apps ranking behavior from historical ranking
10   records. When those phases are checked over time the
uses mobile phone these days and they use mobile app
11   ranking of the genuine apps constantly maintains over
store. We can get number of applications from these
12   time periods but the fraud apps fluctuates over time.
applications store but there may be some application
13   Therefore we have to characterize some fraud evidences
which may be used for data robbery. So such application
14   from apps historical ranking records. We can consider
should be detected and make identifiable the users. They
15   have proposed a web application that will process the two other types of evidences based on apps historical
16   applications historical records with different techniques reviews and ratting.
17   which will give results in graph forms. From the graphs In ranking based evidences each leading events must
18   then the comparison will be made between the shows a specific ranking pattern from the history of the
19   applications. apps ranking behavior [10].
20   Ranjita.R et al., [13] refer that ranking fraud is the key In ratting based evidences, when app is published, the
21   challenge in the mobile application market. According to user after downloading the app can rate the app. The user
22   them ranking fraud are the fraudulent or vulnerable ratting is one of the most important advertising way of
23   activities which have a purpose of bumping up app in the applications. Apps having higher ratting attract more
24   popularity list. number of users for downloading the apps and it rank
25   Hengshu Zhu et al., [2] give us the idea of mining high on the leaderboard. Therefore ratting is an important
26   active periods namely the leading sessions of the evidence for ranking fraud.
27   applications. They also identified various types of In review based evidences similar to the ratting apps
28   evidences mainly rating based evidences, review based store also allows user to write their feedback as apps
29   evidences and ranking based evidences. Hengshu Zhu reviews. User gives their experiences with the particular
30   and his partner also proposed an aggregation method mobile apps with the help of reviews. So review also
31   based on optimization to integrate all the different types plays an important role in ranking fraud. Finally the
32   of evidences. evidence based aggregation method is use for integrating
33   Shivkumar Swami N et al., [4] discussed that for all the evidences [11].
34   detecting fraud data mining techniques can be used. They
35   discussed different techniques that can use to detect the Proposed Approach
36   anomaly in datasets. They also give a brief description We will first read the datasets and after preprocessing
37   about some of those techniques. it we will separate the statistical reviews and the textual
38   Abhilash TP et al., [3] give us the overall idea of reviews. The statistical reviews will then be mapped in
39   ranking fraud. They first showed us the concept of active sessions. Each session will be checked separately. If it is
40   periods for mobile application. They showed that an found that the sessions are evenly organized then chances
41   application is not always ranked high in the leader board of the reviews being fake is less but if it is abruptly
42   but only in some active periods known as leading organized the chances of the reviews being fake is high.
43   session. They also give the basic idea how we can get For example for session S1 the mean review is excellent
44   active periods by mining applications historical records. but for session S2 the mean review certainly drops it
45   Farther they investigate three types of evidences through means the reviews in session S1 are not genuine and
46   statistical hypothesis test. might be paid reviews. After completing the statistical
47   L. Velmurugan[5] gives the basic idea about the reviews we will consider the textual reviews and apply
48   techniques used in misuse detection and anomaly NLP on these reviews. The NLP process consisting of
49   detection. This paper also gives an overview of mining two parts, Parts of Speech tagging that will find Parts of
50   the leading session. It also gives an algorithm on Speech of each input words and Chunking that will
51   fraudulent ranking behavior detection with Concept remove all the unnecessary Parts of Speech from the
52   Vector Based Review Evidence Analysis (CVBREA). reviews and will gives the action words only. We will
53   process all these action words and will determine the
54   Proposed System overall sentiment. Then we will check all these
55   sentiments session wise to find whether the reviews are
The objective of our work is to find fraudulent ranking
56   fake or genuine. Composite results of both the statistical
behavior for mobile application. Fraud generally happens
57  
60  
61  
62  
63  
64  
65  
2
1  
2  
3  
reviews and textual reviews will identify the true nature
4   of the review and will generate the results.
5   Advantages of proposed system- the proposed
6   framework can be extensible and can be continued by
7   considering other evidences for ranking fraud detection.
8  
9  
10  
Proposed System Architecture
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29   Fig1: Block diagram of System Architecture
30  
31   Mining Leading Session
32   As fraud usually fraud usually happens in leading
33   session. So leading sessions are the base for detecting
34   fraud in mobile apps[13][8]. There are mainly two steps
35   for mining leading sessions. First we need to determine
36   leading events from applications historical ranking
37   records. Second we need to merge adjacent leading
38   events for getting a leading session [9]. Observation
39   shows that an application is not always ranked high in the
40   leader board but only in some specific time called leading
41   events. The fraudulent mobile application generally poses
42   different ranking patterns in each leading sessions
43   compare to normal applications. Therefore the problem
44   of identifying fraud behavior in mobile application is to
45   find out vulnerable leading sessions [3].
46   Leading sessions are the active periods for mobile
47   applications. Our first task is to find leading session for
48   mobile applications. We have used an algorithm for
49   mining the leading sessions from applications historical
50   ranking records.
51   The pseudo code of mining leading session for an
52   application a:
53  
54  
55  
56  
57  
60  
61  
62  
63  
64  
65  
3
1  
2  
3  
Identifying evidences a very important factor for detecting fraud behavior in
4   mobile application. Reviews are usually given in natural
Ranking based evidences:
5   language so these reviews need to be preprocessed with
6   Natural Language Processing (NLP)[6][7].
Leading events are the active periods for mobile
7   Preprocessing of reviews
applications. First we should study the basic significance
8   1. Tokenization: It is the process of breaking a stream of
of leading events to obtain fraud evidences. By analyzing
9   text into words, phrases, symbols or meaningful elements
the Apps’ historical ranking records, we observe that
10   Apps’ ranking behaviors in a leading eventual ways called as tokens.
11   satisfy a specific ranking pattern, which consists of three
12   different ranking phases, namely [9][12] 2. Stop word removal: Stop words are commonly used
13    Rising phase: words such as a, the, and, for, from, is, in etc.
14  
15   3. Stemming: For finding the root word stemming is
 Maintaining phase:
16   done. In English language Porter stemmer algorithm can
17    Recession phase: be used for removing suffixes to the stem word.
18  
Rising phase: When an application ranking increases to a We have to find the overall user reviews then we have
19  
top position in the leader board is called rising phase. to mapped these reviews in session and will check its
20  
fraud behavior. We have created a module for finding the
21  
Maintaining phase: The period for which an application overall sentiment of the reviews.
22  
23   remains in its top position is known as maintaining
phase. Block diagram of sentiment analysis module
24  
25  
26   Recession phase: The phase of declination of the apps
27   rank is called recession phase.
28  
29  
30  
31  
32  
33  
34   Ratting based evidences:
35  
36   After downloading an app users generally rate the app.
37   The rating given by the user is one of the most important
38   features for the popularity of the app. The shady
39   application developers usually inflate their apps with fake
40   ratting so that they can get maximum number of
41   downloads. Thus rating based evidences is also important
42   factor that need to be considered. Generally, ratings are
43   between one to five, here we consider a threshold values
44   to classify the rating into two parts. The rating which are
45   less than or equal to three are considered as negative
46   ratings and rating above three are considered as positive Fig2: Block diagram of sentiment analysis module
47  
ratings.
48   Algorithm:
Review based evidences: 1. Read all feedback information
49  
50   2. Divide the information into sessions
User gives their feedback for a particular application 3. For each session find the feedback obtained, to get the
51  
after downloading the application or after experiencing list S1 F1
52  
its performance. Many people go through these S2 F2
53  
feedbacks before downloading an application. So fraud S3 F3
54  
may happen in user reviews also. Shady application …
55   developers may inflate their application with falls
56   …
comments or reviews. So review based evidences is also
57  
60  
61  
62  
63  
64  
65  
4
1  
2  
3  
… Then to check how the review changes in each session
4   Sn Fn Where Si is the session, and Fi is the feedback we have use K-means clustering. We made two clusters
5   from that session and calculated the distance between the two clusters.
6   4. Check if the feedbacks have a common trait, if(F1 =
7   F2 and F2 = F3 and .... Fn-1=Fn) Then it means the
8   review is genuine else if there is a abrupt shift in the
9   pattern, then the feedback might be non-genuine For NLP
10   based technique, Fig 17: Centroids of Clusters
11   1. Read all feedback information
12   2. For each feedback, find action words using POS Cluster of all the session
13   Tagging and Chunking process
14   3. Evaluate the sentiment from the feedback and mark the
15   feedback as Good or Bad
16   4. Divide the feedback into sessions
17   5. For each session find the feedback obtained, to get the
18   list S1 F1
19   S2 F2
20   S3 F3
21   .
22   Sn Fn, Where Si is the session, and Fi is the feedback
23   from that session
24   6. Check if the feedbacks have a common trait, if(F1 =
25   F2 and F2 = F3 and .... Fn-1=Fn) Then it means the
26   review is genuine else if there is a abrupt shift in the
27   pattern, then the feedback might be non-genuine.
28   Combined results from both the algorithms to conclude if
29   the given feedback is genuine or not. Fig4: Cluster of sessions
30  
In the above clusters we can see that there is only one
31   Results and Disscussions element in second clusters, which is far from the first
32   We have collected historical records of the
cluster that indicates that there is a shift in the user
33   applications. By mining the historical raking records we
reviews. Thus there may be some fraud signatures in the
34   have got the leading session.
user reviews in that session. We can now consider
35   An application with 3 months historical ranking
different applications and made clusters of the sessions
36   records with k=180 are given in the following graph:
then after putting their centroids and distance between
37  
the centroid in tabular can compare the results.
38  
39  
40   Conclusion
41   We have designed a system for detecting fraud
42   behavior in mobile applications. Many shady application
43   developers use some unethical method to increase the
44   popularity of their application. Here we have showed that
45   this type of fraud in mobile applications generally
46   happens in the active period that is leading session of that
47   application. Here we have designed an algorithm for
48   Fig3: Leading sessions mining those leading sessions from applications
49   historical ranking records. The system aims to detect
50   Overall ratings and overall reviews are calculated for frauds based on three types of evidences, such as ranking
51   all the sessions. Then we have lists the overall rating and based evidences, rating based evidences and review
52   overall review sentiment for all the sessions. based evidences. Further, an optimization based
53  
aggregation method combines all the three evidences to
detect the fraud. A unique perspective of this approach is
54  
that all the evidences can be modeled by statistical
55  
hypothesis tests, thus it is easy to be extended with other
56   Fig 16: Overall User Review for all the session
57  
60  
61  
62  
63  
64  
65  
5
1  
2  
3  
evidences from domain knowledge to detect ranking [10]. Anuja A. Kadam ,Pushpanjali M. Chouragade, “A
4   fraud. Review Paper on: Malicious Application Detection in
5   Android System”, International Journal of Computer
6   Applications (0975 – 8887) National Conference on
7   References
1. Patil Rohini, Kale Pallavi, Jathade Pournima, Prof. Recent Trends in Computer Science & Engineering
8   (MEDHA 2015).
9   Pamkaj Agarkar, “MobSafe: Forensic Analysis for
10   Android Application and Detection of Fraud apps Using [11]. L. Velmurugan, “Latent Relation Analysis based
11   Cloud Stack and Data Mining”, International Journal of Discovering Fraudulent Ranking Identification on Mobile
12   Advance Research in Computer Engineering and Web Apps”, Indian Journal of Science and Technology,
13   technology (IJARCET) Volume 4, Issue10, Octobor Vol 8(34), DOI: 10.17485/ijst/2015/v8i34/74505,
14   2015. December 2015.
15   2. Hengshu Zhu, Hui Xiong, Senior Member, IEEE,
16   [12]. Ranjitha.R, Mathumita.K, Meena.S and S
Yong Ge, and Enhong Chen, Senior Member, “Discovery Hariharan. “Discovery of Ranking of Fraud for Mobile
17   of Ranking Fraud for Mobile Apps” IEEE Transactions
18   Apps”, International Journal of Innovative Research And
On Knowledge And Data Engineering, Vol. 27, No. 1, Management (IJIREM), ISSN:2350-0557, Vol-3, Issue 3,
19   January 2015.
20   May 2016.
21   3. Abhilash T P, L Dinesha, “Ranking Detection and [13]. S. Karthika and N. Sairam, “A Naiive Bayesian
22   Avoidance Frauds in Mobile Apps Store”, International Classifier for Educational Qualification ”, International
23   Journal of Advance Networking and Application”, ISSN journal of Science and Technology, ISSN (Print): 0974-
24   NO: 0975-0282. 6846, ISSN(Online): 0974-5645, Vol 8(16), July 2015.
25  
26   4. Shivakumar Swamy N, Prof. Sanjeev C. Lingareddy,
27   “Fraud Detection Using Data Mining Techniques ”,
28   International Journal of Innovations in Engineering and
29   Technology (IJIET).
30   5. L. Velmurugan, “Latent Relation Analysis based
31   Discovering Fraudulent Ranking Identification on Mobile
32   Web Apps”, Indian Journal of Science and Technology,
33   Vol 8(34), DOI: 10.17485/ijst/2015/v8i34/74505,
34   December 2015.
35  
36   6. Shreya Banker and Rupal Patel, “A Brief Review of
37   Sentiment Analysis Methods”, International Journal of
38   Information Sciences and Techniques (IJIST) Vol.6,
39   No.1/2, March 2016.
40  
41   [7]. Xuanfan Wu, “Matrics, Techniques and Tools of
42   Anomaly Detection: A Survey”, (Online). Available:
43   https://www.cse.wustl.edu/~jain/cse567-
44   17/ftp/mttad/index.html
45   [8]. Raghuveer Dagade, Prof. Lomesh Ahire, “Review: A
46   Ranking Fraud Detection System For Mobile Apps”,
47   International Journal of Innovative Research in Computer
48   and Communication Engineering, Vol. 3, Issue 11,
49   November 2015.
50  
51   [9]. Javvaji Venkataramaiah, Bommavarapu Sushen,
52   Mano. R, Dr. Gladis pushpa Rathi, “An Enhanced
53   Mining Leading Session Algorithm For Fraud App
54   Detection in Mobile Application”, International Journal
55   of Scientific Research in Engineering (IJSRE) Vol. 1 (4),
56   April, 2017.
57  
60  
61  
62  
63  
64  
65  
6
ICACCP-2019 1570507786

1  
2  
3   Long-term Static Music Emotion Recognition: A
4  
supervised learning approach to model user emotion
5  
6  
7  
8  
profile for Music Recommender Systems
9  
10   Rwiddhi Chakraborty* Aniket Dutta*
11   Electronics and Communication Engineering Electronics and Communication Engineering
12   Heritage Institute of Technology Heritage Institute of Technology
13   Kolkata, India Kolkata, India
14   rwiddhi.chakraborty.ece18@heritageit.edu aniket.dutta.ece18@heritageit.edu
15  
16   Shubhayu Das* Chandrima Roy
17  
Electronics and Communication Engineering Electronics and Communication Engineering
18  
Heritage Institute of Technology Heritage Institute of Technology
19  
Kolkata, India Kolkata, India
20  
21   shubhayu.das.ece18@heritageit.edu chandrima.roy@heritageit.edu
22  
*
23   Authors contributed equally to this work
24  
25  
26   Abstract—In this paper we describe and approach Static Music where different regions of the 2D plot signify different kinds
27   Emotion Recognition (MER) as a Supervised Learning problem. of emotion.
28   Here we propose a paradigm to capture the user’s emotional
tastes, using Arousal and Valence annotations from the user, for
29  
various modern applications, primarily in music recommendation
30   systems. The primary aim is to predict the emotional content of
31   a piece of classical music according to the taste of a particular
32   user. We show that our Static MER model gives a satisfactory
33   performance, even with a genre as emotionally complex as
34   Classical Music. Moreover we use the entire duration of the pieces
(unlike other works in this area, which use shortened clips of the
35   pieces) in this problem. We obtained satisfactory results using
36   comparatively smaller feature sets - 68 features for Valence and
37   70 for Arousal. Finally, we propose an architecture for a music
Fig. 1. Russell’s Circumplex Model.
38   recommender system that can integrate this approach effectively.
39   This is the first time in this research space that a two pronged The problem of identifying emotion in music, having mul-
40   approach - smaller feature sets and entire duration of music tiple modern use cases, was taken up by various research
pieces - has been used successfully, with potential for far reaching
41   groups in the past decade, each having different approaches
commercial applications in the present day.
42   Index Terms—Music Emotion Recognition; Music Recommen- to the problem. This led to the MER regression problem
43   dation Systems; Music Information Retrieval; Supervised Learn- being further broken down into ’Static MER’ and ’Dynamic
44   ing; Support Vector Regression; Random Forest Regression; MER’. The former assumes emotion in a piece of music to be
45   Artificial Neural Networks independent of time; or in other words, each piece of music
46  
has a definite emotion it aims to express and can be denoted
47  
I. I NTRODUCTION by a single point in the Valence-Arousal space. The latter,
48  
Dynamic MER assumes that the emotion in music changes
49  
The problem of addressing Music Emotion Recognition with time, and hence each piece of music follows a contour
50  
(or MER) as a supervised learning problem started about a in the Valence-Arousal space with respect to time. In the past
51  
decade ago [1]. Inspired by the pioneering work of Russell decade, majority of researchers interested in Music Emotion
52  
[2], MER uses a ’dimensional’ approach to model emotion, Recognition have chosen the Dynamic MER approach, with
53  
rather than a ’categorical’ approach. Russell’s dimensional the hope of capturing the variations of emotional expression
54  
55   model, which he called the ’circumplex model’, works on the in music. However we suggest here, that Static MER is
56   basis of two metrics, viz. Valence or pleasantness; positive or more relevant and efficient when the use cases rely on being
57   negative affecting states and Arousal or activation; energy and able to differentiate between two pieces of music, based on
60   stimulation level. Figure 1 shows the ’circumplex model’, an individuals emotional response. Under the Static MER
61  
62  
63  
64  
65  
1
1  
hypothesis, this is a single comparison and is particularly in the next subsection. As mentioned in the introduction
2  
useful for a use case like ’music recommendation systems’. section, the use case of our work is to predict the valence
3  
4   W and arousal values of a new piece of music based on the users
5   In order to suggest music to its users, music streaming taste. Our intention is not to find the absolute emotion of a
6   websites ’Pandora’, use advanced recommendation systems musical piece, in which case the problem of subjectivity of
7   which attempt to build an emotional profile of a user [3]. the annotations must be accounted for. We are not concerned
8   In one of their recent projects ’Spotify’, used ’Valence’ as with removing subjectivity from our ground truth. We want to
9   one of the features in their recommendation algorithm [4]. capture it, since the aim is to predict or replicate the listeners
10   The emotional classification of a piece of music in such response itself (in terms of Valence and Arousal), for pieces
11   applications could be done by in-house music experts, which outside the training set. Hence the entire the entire ground
12   would be subject to the problem of subjectivity. Or this could truth data had been acquired from a from a single person.
13   be done using a machine learning approach to build a listener’s Our data-set being that of classical music pieces, the labeling
14   emotional profile to cater to each listener emotional taste was done by a professional classical musician, teacher and
15   separately. In this work we break down the problem of building conductor, Mr. Anubrata Ghatak [7].
16   a listener profile for such applications, to a simple prediction B. Feature Extraction
17   problem. The task is to show that with a reasonable number of
For our audio feature extraction of, we have used the pyAu-
18   Valence and Arousal annotations of different pieces of music,
dioAnalysis [6] library. Table 1 lists each of the features we
19   a regression algorithm could satisfactorily predict a listener’s
have used. Furthermore each of these features can be classified
20   response to new pieces of music.
21   What is novel about our present approach is that here we
as follows :
• The time-domain features (Features 1 to 3 in
22   attempt to train a regressor to predict the Valence-Arousal
23   values of entire pieces of classical music (about 10-20 minutes Table 1) are directly extracted from the raw sig-
24   long) according to the taste of the user. Moreover the feature nal samples. These features encompass information viz.
25   set we use is considerably smaller compared to that in other Loudness, Noise, Energy, Abruptness, etc.;
26   MER works. This ensures easy and efficient integration with • The frequency-domain features (Features 4 to

27   use cases, that require the prediction of a piece’s Valence and 34 in Table 1, apart from the MFCCs) are
28   Arousal values according to a user’s taste. based on the magnitude of the Discrete Fourier Trans-
29   The rest of this paper is organized as follows. First, in form. The cepstral domain (used by the MFCCs or ’Mel-
30   section two we present the data-set we used for this work and Frequency Cepstral Coefficients’) results are found by
31   how we acquired our ground truth annotations. Subsequently, applying the Inverse DFT on the logarithmic spectrum.
32   here we also discuss the feature extraction paradigm used These features encompass information regarding Timbre
33   to define each piece of music. In section three we discuss Texture, Tonality, Harmony, Multiplicity (or number of
34   the pre-processing and the model training stage. In section pitches heard) etc.;
35   four, we present and analyze our results, and in section five • Features(35 and 36) are tempo related features -
36   our results are summarized and compared with other related these encompass information regarding the tempo and to
37   works. Section six briefly discusses the use and relevance of some extent the overall dominant rhythm.
38   this approach if implemented on a larger scale with respect to Further details about each feature are available on the
39   a music recommendation system. pyAudioAnalysis - feature extraction documentation in [6].
40  
41   II. DATASET AND M ETHODOLOGY TABLE I
42   The data used in this work has been created by us and L IST OF F EATURES USED
43   consists of ground truth labels from only one listener, and is Table Feature name
44   1 Zero Crossing Rate
available at Github
45   2 Energy
[https://github.com/ShubhayuDas/StaticMER dataset] , and 3 Entropy of Energy
46  
archived in Zenodo [https://doi.org/10.5281/zenodo.1283520]. 4 Spectral Centroid
47   5 Spectral Spread
We created this dataset using two publicly available and
48   6 Spectral Entropy
open source repositories [5] and [6].
49   7 Spectral Flux
8 Spectral Rolloff
50   A. Music Recordings and Ground Truth acquisition 9-21 MFCCs or Mel FrequencyCepstral Coefficients
51   22-33 Chroma Vector
The music data used in this work is the open source
52   34 Chroma Deviation
MusicNet [5] data-set, created by researchers at the University 35 BPM Rate
53   36 BPM Dominance
of Washington. It contains 330 classical music recordings by
54  
55   famous composers like Bach, Schubert, Mozart, Beethoven
56   etc. We shall extract relevant musical features in a long term There are two algorithmic stages involved in the long term
57   basis from these recordings to train and test our regression audio feature extraction, for features 1 to 34 (see Table
60   models. The feature extraction process is explained in details 1) :
61  
62  
63  
64  
65  
2
1  
• Short-term feature extraction is carried out first. It splits based on the audio features used : (1)Temporal-Spectral-
2  
the input signal into short-term windows (or frames) and Rhythm features Dataset(330X70), (2)Temporal-Spectral Fea-
3  
4   computes a number of features for each frame. This tures(330X68), (3)With Fisher-Score Feature Selection on the
5   process leads to a sequence of short-term feature vectors Temporal-Spectral-Rhythm Dataset (330X60).:
6   for the whole signal. We have used a short-term window • (1) Using Temporal-Spectral and Rhythm/Tempo fea-
7   size of 50 ms and step size 25 ms. So, if one short tures : Rhythmic and Tempo features (Features 35
8   term frame starts at 1.00 sec and ends at 1.05 sec, the and 36 in Table 1) was chosen over and above
9   next starts at 1.025 sec and ends at 1.075 sec. Hence Spectral and Time Domain Features(Features 1-34
10   there is a 50 percent overlap. This extracts the set of 34 in Table 1). Hence the data-set taken as input was a
11   (Features 1 to 34 in Table 1) features from 330 X 70 feature matrix.
12   each short term frame of the recording. • (2) Using Spectral-Temporal features : Here only Spec-
13   • Then a Mid-term window and step is specified. For each tral and Time Domain Features (Features 1 to 34
14   segment, after the short-term feature extraction is carried in Table 1) with their mean and standard deviation
15   out, the feature sequence from each mid-term segment is was chosen. Hence the data-set taken as input was a 330
16   used for computing feature statistics (e.g. the mean and X 68 feature matrix.
17   standard deviation of the ZCR). Therefore, each mid-term • (3) With Feature Selection : Here Generalized Fisher
18   segment is represented by mean and standard deviation Score was use to for Feature Selection [10], which se-
19   over each of its short-term feature segments. The mid- lects each feature independently according to their scores
20   term window we used is of 2 seconds with a mid-term under the Fisher criterion. The was feature selection was
21   step of 0.2 seconds (i.e. 90 percent overlap). applied on the 330 X 70 feature vector to select the top
22   • Only after extracting these Short-term features and Mid- most informative features (total 28 features).
23   term features, the averages are taken, in order to produce In this paper we aim to give a Model and Feature-set wise
24   one Long-term feature vector per audio file. Hence in analysis of performance. The subsequent section explains the
25   the features 1-34 mentioned above which follow this results categorized by the choice of the Feature-set.
26   paradigm, each consists of two parameters - one is its
27   mean and the other is its standard deviation over the mid-
TABLE II
28   term features. P ERFORMANCE M EASURE IN TERMS OF M EAN S QUARED E RROR , M EAN
29   We thus obtain a total of 68 features from this process A BSOLUTE E RROR , R 2 S CORE ; FOR VALENCE AND A ROUSAL WITH
30   (34 + 34). Along with the rhythm, we end up with a (330 RESPECT TO DIFFERENT REGRESSION MODELS AND THE
FEATURE - VECTOR USED
31  
X 70) feature matrix for the entire set of songs. Different
32  
combinations of these were used while training. We elaborate
33   Model- Arousal Valence
on this in the next section. Feature MSE MAE R2 MSE MAE R2
34  
SVR1-70 0.1301 0.2953 0.2497 0.1812 0.3723 0.0452
35   SVR2-68 0.1375 0.3075 0.2074 0.1796 0.3735 0.0535
36   III. P RE - PROCESSING AND M ODEL T RAINING SVR3-28 0.1702 0.323 0.0185 0.1898 0.3305 -0.0004
37   We have chosen a Support Vector Regressor (SVR) as RER1-70 0.1368 0.3057 0.2112 0.1715 0.3745 0.0958
RFR2-68 0.134 0.3 0.2275 0.1685 0.3678 0.1121
38   our primary regression model. A Random Forest Regressor RFR3-28 0.1518 0.3134 0.1245 0.1591 0.3617 0.1616
39   (RFR) was chosen to compare the results obtained from SVR. ANN1-70 0.1754 0.3678 -0.0111 0.1999 0.4185 -0.0533
40   A two layer Artificial Neural Network (ANN) with 90 ANN2-68 0.1712 0.3556 0.0128 0.2111 0.4357 -0.1124
ANN3-28 0.1981 0.3864 -0.1423 0.1967 0.4083 -0.0367
41   nodes in each hidden layer was chosen to see if it works
42   with the problem this projects deals with. Hence there are
43   a total of three regression models. The single pre-processing
44   step we performed was to perform max-min normalization IV. R ESULTS AND A NALYSIS
45   [8] to normalize the features. On our primary data-set the The performance of the regression models are evaluated
46   in terms of Mean Squared Error(MSE), Mean Absolute Er-
330 X 70 matrix, we employed 10-fold Cross Validation [9].
47   ror(MAE) and the R2 Score.
The algorithm chooses each fold exactly once, and the 10
48   From Table 2 and Figures 2 and 3 we can draw the
folds are created randomly. The absence of repeating items
49   following inferences:
in each iteration of the algorithm eliminates the possibility
50  
of overfitting, i.e. training on the same data set repeatedly to • The Support Vector Regressor tests the best for Arousal
51  
obtain inaccurate results. To obtain the best results through prediction with the primary data-set (with 70 features).
52  
hyper-parameter tuning, Grid Search [9] was employed while • The Support Vector Regressor also tests the best for
53  
conducting Cross Validation. This method systematically de- Valence prediction with the feature selected data-set(with
54  
55   fines grid of all possible combination of parameters involved 28 features).
56   and returns the best set of parameters for which the model • The Neural Network architecture consistently performs

57   gives the best outcome. We repeated this methodology for our the worst. One explanation for this would be the com-
60   three models. Moreover Our data was split into 3 catagories paratively smaller data-set we have used, since neural
61  
62  
63  
64  
65  
3
1  
our prediction accuracy. Most works on MER, however use
2  
different performance metrics which are not comparable, and
3  
4   cater to different problem statements. Moreover, many of them
5   have used different ranges for their outputs (e.g Reference
6   [11], 2013 and many other works use the range of Valence
7   and Arousal from -0.5 to 0.5). The work by [12] achieved
8   an accuracy of 40.6% and 67.4% for Valence and Arousal, in
9   terms of R-squared statistics. However, R-squared says nothing
10   about prediction error. Even with MSE exactly the same, and
11   no change in the coefficients, R-squared can be tailored to be
12   anywhere between 0 and 100% just by changing the range
13   of the independent variable(s). Moreover, R-squared does not
14   measure goodness of fit and can be arbitrarily low when the
15   Fig. 2. Mean Square Error and Mean Absolute Error values for the Arousal
Models on run on the Arousal Test Set.
model is completely correct. Hence the metric we have used
16   is Mean Squared Error, which is more suited for the emotion
17   prediction problem we are trying to address.
18   One work that is closest to ours is by Yang et. al. [13], where
19   they attempted to predict the general emotion in a piece of
20   music. However, their focus was to find the absolute emotion
21   of a piece of music, where the ground truths annotations are
22   assumed to be the general consensus. Using the circumplex
23   model and a regression approach they try to tackle the problem
24   of subjectivity. We, on the other hand embrace it, by trying to
25   predictions with respect to an individuals subjective emotional
26   taste. In their prediction task they achieved a best case per-
27   formance of 0.1798 and 0.1731 for Valence and Arousal, in
28   terms of Mean Absolute Error (MAE). Despite the difference
29   in the MAE values, be believe our prediction performances
30   Fig. 3. Mean Square Error and Mean Absolute Error values for the Valence are similar, given that the recordings we used were of varied
31   Models on the Valence Test Set. length and span about 10-20 minutes long - which is how our
32   work is primarily different from other works. Most works,
33   including to the ones mentioned above, use fixed duration of
34   networks generally perform better on large data-sets. It musical pieces in their data-set. In [14] the pieces are of less
35   may not be true for our specific problem, but is a possible than 1 minute. Moreover, the music data-set used by most
36   explanation for the poor performance. other works, are of a combination of genres of like rock, pop,
37   • Using Fisher scoring to find the top 28 features does etc., which are far less complex than classical music in terms
38   not consistently improve performance for each model of emotional content and variation.
39   type. There may be two follow ups to this - either the The novelty in this work, lies in the fact that our predic-
40   Fisher scoring is not the appropriate measure of feature tions are user specific. We use comparatively smaller feature-
41   information in this case,or the simpler explanation that descriptors as compared to other works - viz. [11] which uses a
42   decreasing features does not in fact reduce redundancy, 128-dimensional feature set, and [12] where a 98-dimensional
43   that our current feature set is sufficient. feature set was used - and yet informative enough to perform
44  
comparatively for the prediction task.
45   V. S UMMARY AND D ISCUSSION
46   In this work we mapped a set of recordings’ musical and VI. F UTURE S COPE : M USIC R ECOMMENDATION
47   acoustic features to a particular listener’s emotional response, S YSTEMS
48  
in terms of Valence and Arousal. The best case error in terms Having we show that a regressor can satisfactorily map a
49  
of Mean Squared Error (MSE) between the actual and the piece of music to the emotional response of listener, given
50  
predicted values are 0.1591 for Valence and 0.1301 in case of their honest ground truth annotations; In this section we
51  
Arousal. In terms of Mean Absolute Error (MAE) our best case discuss how a such a model could help build a emotion
52  
error values are 0.2953 and 0.3617 for Arousal and Valence profile descriptor which could be easily be integrated within
53  
respectively. The predictions made by the algorithm reflect the existing Music Recommender Systems - one that encapsulates
54  
emotional taste of the listener or one who provides the ground the user’s emotional tastes in music, as well as predicts the
55  
truth. region a new piece if music may lie. Ones profile could be
56  
57   Even though the definition of our problem is different described as a hash table of Song Id and a 2-D vector of
60   form typical works on Static MER, we wanted to compare Valence nad Arousal annotations from the user (see Figure
61  
62  
63  
64  
65  
4
1  
4). The feature set being what describes the music, a varied [3] Howe (2009). ”Pandoras Music Recommender”, a case study. Available
2   at :
distribution of musical pieces when mapped to the Arousal-
3   https://pdfs.semanticscholar.org/f635/6c70452b3f56dc1ae07b4649a80239
4   Valence space using a regression model as presented in this afb1b6.pdf (online)
5   work, would adequately represent a users musical tastes. The [4] K. Tiffany, (2018). TL;DR: You can now play with Spotifys recommenda-
tion algorithm in your browser; Its fun if you know what valence means!
6   hash table could therefore be integrated with existing music Available at: https://www.theverge.com/tldr/2018/2/5/16974194/spotify-
7   recommendation algorithms, where arousal and valence are a recommendation-algorithm-playlist-hack-nelson (online)
8   part of user ratings. [5] J. Thickstun, Z. Harchaoui, S. M. Kakade, (2017). Learning Features
of Music from Scratch. International Conference on Learning Repre-
9   sentations (ICLR), University of Washington., in press. Retrieved from
10   https://homes.cs.washington.edu/ thickstn/musicnet.html
11   [6] T. Giannakopoulos, (2015). pyAudioAnalysis: An Open-Source Python
Library for Audio Signal Analysis. PloS one. 10, 12, in press. Retrieved
12   from https://github.com/tyiannak/pyAudioAnalysis
13   [7] A. Ghatak, n.d. Classical Musician, Conductor and Teacher at Kolkata
14   Music Academy
[8] MinMaxScaler (n.d.). Retrieved from
15   http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Min
16   MaxScaler.html (online)
17   [9] GridSearchCV (n.d.). Retrieved from
http://scikit-learn.org/stable/modules/generated/sklearn.model selection.
18   GridSearchCV.html (online)
19   Fig. 4. An example of the hash tables of tow different users.
[10] Q. Gu, Z. Li, J. Han, (2012). ”Generalized Fisher Score for Feature
20   Selection”. CoRR, abs/1202.3725, in press.
21   Such a model would facilitate suggestions based on emo- [11] F. Weninger, F. Eyben, B. Schuller, (2013). The TUM Approach to
the MediaEval Music Emotion Task Using Generic Affective Audio
22   tion. The similarity of users could be quantified based a Features. In proceedings of MediaEval 2013 Workshop, October 18-19,
23   cosine similarity of the Arousal and Valence ratings of the 2013, Barcelona, Spain, in press.
24   songs heard and rated by both users. Moreover, a personalized [12] R. Panda, B.Rocha, R. Paiva, (2013). ”Dimensional music emotion
recognition: Combining standard and melodic audio features”. In Pro-
25   prediction model for each individual listener as presented here, ceedings of International Symposium on Computer Music Modelling &
26   could directly be used to recommend new pieces of music. One Retrieval, in press.
27   could analyze the common patterns in the Arousal and Valence [13] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, (2007). ”Music Emo-
tion Classification: A Regression Approach”. 2007 IEEE International
28   values of consecutive songs, and make new suggestions previ- Conference on Multimedia and Expo, in press.
29   ously unheard by the listener. Suggestions made with such [14] A. Aljanaki, F. Wiering, & R. , C. Veltkamp, (2015). ”Audio segmenta-
30   an approach, may seem completely unrelated to a random tion based approach for improved emotion recognition”. In proceedings
of the 16th ISMIR Conference, Malaga, Spain, in press.
31   observer, but is likely to be relished by the particular listener.
32   Moreover, intelligent approaches could be used for deciding
33   which set of songs give a unique direction to the users music
34   emotion tastes. One approach would be to cluster the songs in
35   the Arousal-Valence space and randomly suggest songs from
36   the less probable clusters to enforce serendipity. After getting
37   the Valence and Arousal ratings, one could update the hash
38   table if the values of Valence and Arousal provided are within
39   a close threshold of the average of his major cluster (major
40   cluster being the cluster with the highest probability of being
41   assigned).
42   This paradigm of modeling one’s music emotion prefer-
43   ences, opens up quite a few number of options for music
44   recommender systems to include emotion in their analysis. The
45   effectiveness of which would be based on the performance
46   of the emotion prediction model for each listener. We thus
47   appeal to the music emotion and music recommendation
48   system research community, that a benchmark database with a
49   framework for efficient ground truth collection from multiple
50   users, would go a long way in the growth of this unique
51   paradigm of emotion based music recommendation system.
52  
53   R EFERENCES
54  
[1] Y. H. Yang, Y. C. Lin, Y. F. Su, AND H. H. Chen, (2008). A regression
55   approach to music emotion recognition. IEEE Trans. Audio, Speech
56   Lang. Process. 16, 2, 448457., in press.
57   [2] J. A. Russell, (1980). A circumplex model of affect. Journal of Person-
ality and Social Psychology. 39, 6, 11611178., in press.
60  
61  
62  
63  
64  
65  
5
ICACCP-2019 1570508889

1   Cloud-based Lecture Capturing System


2  
3  
4   Abstract—This paper examines an innovative approach to · Capturing the audience and focusing on a guest
5   enhance current e-learning procedures, particularly in speaker in real-time helps the remote logged in students to
6   universities. The “Lecture Capturing System” is a cloud-based experience the live classroom environment.
7   web application which uses enhanced techniques to provide an · Biometric authentication of the remotely logged in
8   interactive e-learning experience to users of the system. It uses a users via facial recognition to act as a secondary layer of
facial recognition-based authentication process to allow remote
9   users to log in to the system. A Pan-Tilt-Zoom (PTZ) IP camera security and to mark the attendance of remote students.
10  captures and tracks the lecturer during the lecture session and
11  this is streamed live to remotely logged-in students. The lecturer LCS stands itself unique from existing systems by being a
12  can also share the computer screen if required. The camera comprehensive product that includes the above-mentioned
13  intelligently identifies specific gestures performed by the features all in one.
14  lecturer to rotate with the aid of gesture analyzing algorithms. II. LITERATURE REVIEW
15  Attendance of remote online students is marked automatically
during a live-streaming lecture by using multiple facial There have been many research efforts done to address
16  
recognition processes executing on the server. Offline recording the needs of smart e-learning systems. Below are some of the
17  of lectures is also supported after which the video is split into a
software functionalities and technologies that have been done
18  series of chapters/thumbnails and the audio is converted to text;
prior to our research. Undertaking a Literature Survey helped
19  each chapter representing a presentation slide and the relevant
us with finding and coming up with the following.
20  text. Bandwidth and quota are managed intelligently to ensure
21  the best possible transmission rate with minimum data Regardless of the enormous growth of e-learning in
22  consumption in order to avoid filling the link to capacity which education and its perceived benefits, the efficiency of such e-
23  would result in network congestion and poor performance of the learning systems will not be fully utilized if the students are
network. This system is revolutionary and is capable of taking
24  e-learning to the next level as it provides a complete classroom not inclined to accept and use the system. “Use of E-
25  experience and much more to the remote users. It also has the Learning”, a research was done to find University students’
26  ability to support multiple enterprise customers. purpose to use e-learning [1]. Discoveries indicate that the
27   content of e-learning and self-efficacy have a positive impact
28   Keywords—PTZ camera control, gesture detection, biometric and substantially associated with perceived usefulness and
29  authentication and attendance, video thumbnails creation, student satisfaction.
30  bandwidth and quota management The tracking of the object and control of the camera is
31   I. INTRODUCTION handled by one computer in real time. The main contribution
32   of the paper [2] is a method for target representation,
E-learning has become one of the newest trends not only
33   localization, and detection, which takes into account both
in the educational sector but also in businesses. As a result,
34   foreground and background properties, and is more
students tend to prefer e-learning than being physically
35  present in a lecture due to various issues such as manually discriminative than the common color histogram based back-
36  taking down notes, inability to instantly understand the projection.
37  content in the lecture, long-distance travel time to get to the
38  lecture, and etc. There are times when a student can miss an Online body tracking by a PTZ camera has been done
39  important lecture due to various reasons, and never be able to before to automatically track a single person and focus on that
40  catch up. person [3], [4]. Online human body tracking method by an IP
41   PTZ camera based on fuzzy-feature scoring was done. At
Key Features of Lecture Capturing System: every frame, candidate targets are detected by extracting
42  
43   · Automatically focusing on the lecturer in real-time. moving targets using optical flow, a sampling, and
44   · appearance. The target is determined among samples using a
Sharing lecturer’s computer screen with the students fuzzy classifier. Results show that the system has a good
45  so that they can see what the lecturer is doing on his/her
target detection precision (> 88%), and the target is almost
46  computer such as coding or annotating a PowerPoint slide.
always localized within 1/4th of the image diagonal from the
47  
image center [3].
48   · Lecturer communicating with students either by
49  voice/video on request of the student so that everyone can see Remote controlling of the PTZ camera system for lecture
the conversation and clear any doubts regarding the question
50   rooms [5]. This consist of a simple and inexpensive software
of the student.
51   solution for remote management of PTZ camera systems.
52   · Generating readable text content after the lecture by This provides the ability for users to remotely control the PTZ
53  intelligently converting the lecturer’s voice into text is a camera system from one place with the simultaneous image
54  helpful feature as it helps students revise the lecture more capturing ability. But this software solution does not support
55  efficiently. real-time tracking of a person, just several predefined presets
56   · Video thumbnails of the chapter make it easier for so that feature can be improved to real-time operation in LCS.
57  students to watch the relevant part they neede without going OpenTrack - Automated Camera Control for Lecture
60  through the whole video. Recordings [6] records lecture sessions automatically without
61   the need for a human camera person. A Tabletop Lecture
· Intelligent bandwidth management to make sure that Recording System [7]. This research presents a lecture
62  we use the least possible data bandwidth to transfer the videos
63  to the students. recording system that employs gestures and digital cameras
64   to facilitate remote distance teaching. Virtual Cameraman [8]
65  

1
1  
2   uses two PTZ cameras having different utilities. One is III. SYSTEM ARCHITECTURE
3   named full-shot PTZ camera and the other is movement PTZ The system is a cloud-based web application which is
4   camera. capable of supporting multiple enterprise customers. Remote
5   Real-time person tracking has been implemented before, students can access the system after logging in via biometric
6   but not quite as what we have done; real-time broadcasting of authentication(facial recognition). A PTZ IP camera provides
7   the footage without any delay. The complete package of a continuous video stream of the lecturer and audience to the
8   having the lecture capturing along with audience if necessary, central server via the Kurento media server [18] during a
9   screen sharing, Face Recognition based remote login and lecture session. The server broadcasts this stream live to the
10  attendance marking for online participants, viewing the remote students. The lecturer’s screen is also shared if
11  lecture in real-time with added features such as intelligently required. The attendance of remote students is marked by
12  generating chapters on the video according to the lecture capturing their faces through their webcams and running a
neural network algorithm on the server for identification.
13  slides played alongside with this makes Lecture Capturing
Lecturers are also able to do an offline lecture recording and
14  System a perfect complete package of e-learning. A thorough then upload it to the server where this will be split into
15  research related to e-learning systems has led to the chapters and the audio to text. Figure 1 shows the system with
16  identification of some of the most influential factors used in the main components and their interactions.
17  the field of information systems research. More specifically,
18  characteristics as well as the limitations, weaknesses, and
19  strengths of web-based learning systems. Student variables,
20  such as technical issues and adapting to the new ways are
21  important variables that influence student learning, especially
22  in a collaborative e-learning environment. In particular, this
23  research helps to better understand the characteristics of
24  students and to comprehend what the students expect from
25  the learning management systems. This can help the
26  developers achieve the most effective deployment of such
27  systems and also helps them improve their strategic decision
28  making about technology in the future, they can decide on the
Figure 1: System Architecture
29  best approach that fit their students before implementing any
30  new technology. IV. METHODOLOGY
31  Features of a set of commercially available e-learning
This section describes how the system was designed and
32  
platforms were compared with the Lecture Capturing System. implemented explaining the process of each functionality,
33  
their flow in the system, and how they interact with each other.
34   Panopto is an easy-to-use video platform for training, The system was implemented using cutting-edge technologies
35  presenting, and communicating that enables users to record such as Nodejs, Python, ReactJS, and MongoDB for storage.
36  videos and rich media presentations and push out to
37   A. Face Recognition based authentication
subscribers in many different formats. This is mainly focused
38   A student or a lecturer can login to the system using the
39  on recording and later on pushing the recorded stream into webcam. Initially, the administrator of the Lecture Capturing
40  the users. BigBlueButton is an open-source web System should register the user by uploading quality images
41  collaboration software utilized by education organizations for and relevant details of the user. After registering a user, the
42  e-learning and training. This enables users to conduct web- server will train the face recognition classifier with the newly
43  conferencing and share documents, audio and video files for uploaded images of a user along with the existing images of
users. Thereafter the user will be authenticated from the face
44  online learning. The software’s “whiteboard” feature allows
recognition process through the webcam only if the
45  
presenters to mark valuable topics in the presentation. confidence threshold of the face recognition classifier is
46   greater than 90%. If it is less than 90%, the user will not be
Echo360 combines video management with lecture capture
47   authenticated. In terms of security, session handling will take
48  and active learning to increase student success. Echo360 place after face recognition based login.
49  keeps notes linked to class presentations and videos so that
50  students can jump straight from their own words to those of In terms of the general architecture of the face recognition
51  the instructor and replay the entire learning experience. based login function, two steps are considered:
52  Videos are uploaded and processed in real time so the 1. Detection stage
53  optimized version is available as soon as class is over. LCS system search for the face region (displayed
54   by a rectangle) in the whole video stream
Kaltura offers the broadest set of video management and
55   2. Recognition stage
56  creation tools on the market, tightly integrated with every Contrasting the face image obtained above to the face
57  LMS. From flipped classrooms to live sports broadcasts. image trained in the database, and predicting the user
60   Even though there are many e-learning applications in the registered.
61  market, there isn’t a system which handles all the necessary If the system face recognition is successful, the
62  requirements like LCS has managed to implement. recognition result will be displayed in white text inside a green
63   rectangle on the webcam feed along with the confidence
64  
65  

2
1  
2   percentage. If failed, the system will pop a warning. Face necessary PTZ signals to pan, tilt and zoom accordingly thus
3   Recognition used in the system follows three main steps. ensuring that the lecturer’s actions are always recorded
4   without missing any detail. This video recording will be
1. Prepare training data immediately compressed ‘on-the-fly’ in order to reduce its file
5   OpenCV computer vision library, Python and Numpy are
6   used as dependencies to implement face recognition function size, then streamed live and also saved in the database for
backup purposes and viewing later. Therefore, the students
7   in the system [12]. ⁠ OpenCV provides two pre-trained and
have the choice of attending the lecture via the live stream or
8   ready to be used face detection classifiers called Haar listening to the lecture later. This is very beneficial to students
9   classifier and LBP classifier [9]. ⁠Haar Cascade classifier is since they can also attend lectures without being physically
10  used as the face recognition classifier and Local Binary present (remotely) at the lecture. During the live streaming
11  Patterns (LBP) classifier is used as the face recognition session, the system will decide which video resolution (e.g.
12  classifier to detect and recognize faces in this system. LBP is 480p, 720p, 1080p) to use for the playback at the student’s end
13  a type of visual descriptor used for classification in computer depending on the speed of his/her internet connection.
14  vision [10].⁠ The LBP classifier is used due to its main
advantages such as shorter training time, high accuracy rate in Live streaming is achieved via Kurento which is a
15  
difficult lighting conditions which will be useful when WebRTC media server and a set of client APIs. During the
16  detecting faces through the webcam and computationally live streaming session, the lecturer also has the ability to share
17  simple and fast [19]. ⁠A formal description of the LBP his/her entire computer screen with the participating students
18  algorithm is given in Figure 2. if required, making certain that not even the most minute detail
19   is not missed. In the case of having low bandwidth to support
20   this feature, the lecturer has the option to disable the IP
21   cameras in order to save bandwidth. This mode of lecturing
22   provides better participation and interaction between the
23   Figure 2: LBP Algorithm Equation lecturers and students. An example of this is if a student wants
24   to ask a question, the control would be given to the particular
25   The training dataset consists of 30 images for each user student by the lecturer and the application would support
26  and each user is assigned a label (e. g. s1, s2) upon registering audio only, video only or both audio-video sources of the
27  to the system. Furthermore, this step will read all the images particular student. But the lecturer has the ability to get back
the control of the audio and video sources of the system when
28  of a person and apply face detection to each one using LBP
required.
29  classifier. Then, add each face to face vectors with the
corresponding person label extracted. Finally, the data
30   C. Lecture Capture and Movement of IP Camera
preparation step will produce following face and label vectors
31  [19]. ⁠ An IP camera will be used to track the lecturer’s
32   movement and gestures in front of the camera and produce the
33   necessary PTZ signals to pan, tilt and zoom accordingly with
34   the lecturer’s movement.
35   1. Lecture Detection
36   OpenCV Object Detection using Haar feature-based
37   cascade classifiers is an effective object detection. It is a
38   machine learning based approach where a cascade function is
39   Figure 3: Face and label vectors trained from a lot of positive and negative images. It is then
40   used to detect objects in other images. Here we will work with
41   2. Train face recognizer face detection. Initially, the algorithm needs a lot of positive
The face and label vectors returned from the data
42   images and negative images to train the classifier. Then we
preparation step (according to Figure 3
43   need to extract features from it. For this, Haar features shown
‘getImagesAndLabels’ function) will be converted to a
44   in Figure 4 are used. Each feature is a single value obtained by
Numpy array and passed to the OpenCV Haar Cascade subtracting the sum of pixels under the white rectangle from
45  
face recognizer for training [19]. ⁠The statistical data the sum of pixels under the black rectangle.
46  
returned from the face recognizer will be saved in a
47  
YAML file.
48  
49   3. Prediction
50   Once the user is navigated to the login page, the server
51   automatically detects the face from the Haar classifier,
52   predicts the face by calling the trained OpenCV Haar
53   face recognizer, returns the predicted name of the user
54   associated with the label and live streams the response
55   from the server (recognized face plot, name of the user,
56   confidence threshold) to the login page. Thereafter, the
57   user will get logged in to the system after the user clicks Figure 4: Haar features
60   the face login button.
61   Now, all the possible sizes and locations of each kernel are
62   B. Audio and video conferencing used to calculate lots of features. There are irrelevant
63   An IP camera will be used to track the lecturer’s calculations, consider Figure 5. The top row shows two good
features. The first feature selected seems to focus on the
64  movement and gestures in front of the camera and produce the
65  

3
1  
2   property that the region of the eyes is often darker than the helpful because most of the times when the lecturer starts to
3   region of the nose and cheeks. The second feature selected plug his laptop to the main projector in a normal classroom all
4   relies on the property that the eyes are darker than the bridge of his desktop content open tabs on web browser, everything
5   of the nose. But the same windows applied to cheeks or any is visible to the students, privacy is a concern and it’s a hassle
6   other place is irrelevant. to switch sharing the screen on and off all the time to the
lecturer. With this Easy Screen Share feature lecturer can
7  
stream his webcam footage alongside with the shared screen
8   if required (If in front of the laptop blocking the main camera
9   view). This extension simply initializes socket.io and
10   configures it in a way that single audio/video/screen stream
11   can be shared/relayed over users without any bandwidth/CPU
12   usage issues. This uses RTCMultiConnection is a WebRTC
13   library that is used for WebRTC streaming [22].
14   Figure 5: Features Calculated E. Gesture-Based Camera Control
15  
16  For this, we apply each and every feature on all the training When a student who is physically present in the classroom
17  images. For each feature, it finds the best threshold which will has a doubt, the lecturer has to direct the camera towards the
18  classify the faces to positive and negative. We select the audience by performing a gesture at the camera. The lecturer
has to show his hand with all five fingers unfolded. Then the
19  features with the minimum error rate, which means they are
camera will analyze the gesture and recognize it using
20  the features that most accurately classify the face and non- OpenCV and Python technologies and then turn the camera
21  face images. towards the audience so that the remotely logged in users get
22  This lecture tracking script uses OpenCV’s Haar Cascade the picture of what is happening in the classroom. Once the
23  Classifier to do the person detection task. First, it initializes a student has finished asking the question, the lecturer turns the
24  face cascade using the frontal face Haar cascade [19], [20]. camera back to the normal position by either pressing a button
25  Then it starts to detect and track the largest face it can find, if on the interface or performing a gesture at his/her webcam.
26  not tracking face or lost the tracked face again it uses Haar
F. Open Broadcaster Software (OBS) Studio plugin
27  cascade detector to detect face and then correlation tracker to
28  follow it using the dlib library. Both methods require to scan A plugin was implemented for the OBS Studio [23]
29  each the whole frame with a sliding window. The algorithm software which allows a lecturer to do an offline recording of
30  then tries to find the features of a person in each window a lecture and then upload it directly to the server.
31  position. These methods are too expensive to perform in each After recording the lecturer’s desktop screen while s/he
32  frame if we want to run our person tracker on restricted conducts a lecture, the designed plugin would upload this
33  hardware like a budget laptop. For this reason, we combine video to the remote server based on predefined settings at the
34  the person detector with a correlation tracker. The correlation click of a button. These settings can be changed by the lecturer
35  tracker expects a region of interest and starts tracking the to suit their needs (e.g. upload video now or at a later
36  pixels inside that region. In subsequent frames, it tries to find scheduled time). Once the video is uploaded to the remote
37  where the pixels have most likely moved. This is much faster server, the next level of processing should be done at the
38  and more robust than trying to find the person in each and server.
39  every frame again. G. Video Thumbnails/Chapters Creation
40   The lecturer can view the list of videos which have been
41   2. PTZ Camera Movement uploaded to the server. Out of this list, the lecturer can select
42   Open Network Video Interface Forum (ONVIF) is an a video to be converted into a series of thumbnail chapters.
43  open industry standard that provides interoperability among Each frame in the video is analyzed by the PySceneDetect
44  IP security devices such as security cameras, video recorders, algorithm [24] which is implemented using Python. The
45  software, and access control systems [21]. PySceneDetect algorithm makes use of the OpenCV, NumPy,
46  Since we are using ONVIF protocols to move the camera it and FFmpeg libraries for execution. There are two main
47  allows the compatibility from different vendor’s devices so detection methods which PySceneDetect uses:
LCS will have the support by most IP based security devices
48   Threshold-detection - Compares the intensity/brightness
manufacturers giving it an added benefit of not limiting the
49   of the current video frame with a set threshold, and triggers a
system to a specific brand of IP Cameras.
50   scene cut/break when this value crosses the threshold value.
Because of this, we have used this protocol to work with our
51   The threshold value is computed by averaging the Red-Green-
tracking algorithm to move the camera accordingly. The
52   Blue (RGB) values for every pixel in the frame, yielding a
custom functions can move the camera in any direction and
53   single floating-point number representing the average pixel
zoom in and out to focus on the lecturer as required. The
54   value (from 0.0 to 255.0).
detecting and tracking algorithm made will only call to move
55   Content-aware detection - Finds areas where the
function if a significant movement is detected only ensuring
56  it doesn’t break focus for a still movement of the lecturer. difference between two subsequent video frames exceeds the
57   threshold value that is set and then trigger a scene cut. This
60   D. Easy Screen Share allows you to detect cuts between scenes both containing
61   With the ability to share the screen either completed or content, rather than how most traditional scene detection
62  selected custom application window right from the web methods work. With a properly set threshold, this method can
63  browser and start streaming it along the main live stream even detect minor, abrupt changes [25]. This method takes in
64  makes the lectures task at ease and more efficient. This is
65  

4
1  
2   the threshold and minimum-scene-length in frames (optional) V. RESULTS AND DISCUSSION
3   as input parameters. In terms of face recognition-based login, the LBP
4   It compares the difference in content between adjacent classifier reported an accuracy level of 70.33% for a particular
5   frames against a set threshold/score, which if exceeded, user in terms of face detection by maintaining a shorter
6   triggers a scene cut. It checks for changes in color and training time. In contrast with the Haar classifier which
7   intensity - namely the average HSV color space difference reported 81.05% for a user and took a longer training time,
8   (difference in hue, saturation, and luminance of the frame) – LPB classifier has underperformed Haar classifier when
9   between video frames [26]. If this calculated value is very high detecting faces. Since the lecture capturing system face
10  than the preceding and following values, it means that there recognition-based login should be fast and should maintain a
11  has been a scene change. This process is repeated for the entire higher accuracy level greater than 80%, Haar classifier is the
12  length of the video clip until the entire video clip is analyzed ideal solution which is used currently in the lecture capturing
system. However, the results are derived by allocating 30
13  and all the video chapters are created.
medium quality training images captured from a webcam for
14   Following this process is the real-time speech transcription each user. Therefore, the results reported by the classifiers
15  (audio-to-text conversion) of each video chapter. First, the were not satisfactory and the accuracy can change if each user
16  audio is extracted from the video chapters in mp3 format by is allocated more high-quality images for training.
17  the FFmpeg library. Next is the speech transcription procedure
18  which is achieved via the Watson Speech-To-Text algorithm. Comparison between Haar Classifier and LBP Classifier
19  This service leverages machine intelligence to transcribe the 1. LBP Classifier is faster than Haar classifier.
20  human voice accurately [27]. The service combines 2. Haar classifier uses floats to do all the calculations
21  information about grammar and language structure with while LBP classifier uses integers.
knowledge of the composition of the audio signal. 3. LBP classifier is less accurate than Haar classifier.
22  
23   Therefore, the end result would be a set of videos along 4. Haar-like features in the Haar Cascade classifier
24  with their respective audio, presentation slide, and text. These work best for frontal face detection.
5. Haar features are good at detecting edges and lines
25  videos would be stored in the database so that students can
which is effective in face detection.
26  access them any time after the lecture session to further
27  understand and clarify their knowledge. Accuracy rates mentioned in the table below are derived
28   H. Facial Recognition based Attendance Marking using the formula:
29   Using facial recognition, the attendance is marked Accuracy % = 100 - Confidence Index
30  automatically for the students who are present in the lecture The confidence index will return zero if it will be considered
31  room and also the students who are logged in remotely a perfect match in detection or recognition. If not an
32  through the Lecture Capturing System during the live ‘unknown’ label is put on the face.
33  streaming lecture session. The administrator and the lecturer Average LBP classifier Haar classifier
34  are able to view, modify and filter attendance of students. A Accuracy of 60% 90%
35  student is able to view his/her attendance with the aid of the recognition
36  filtering options available. Some noticeable advantages of this Processing time for
37  feature are that it will add an extra layer of security to the encoding and training 1.8min 2.1min
38  system to ensure that only authorized persons to gain access a user with 30 face
39  to the university’s content. A comprehensible advantage of images (in sec)
40  this method of biometric authentication of students can be
noted during the time of an online exam to verify that the
41  
42  
person on the other end is actually who they claim to be. Also, Testing with different face datasets from 50 – 100 range when
this feature will solve the problem of students marking training images. A laptop with i7, 8th generation, and 8GB
43  attendance for other students.
RAM is used to obtain the below result.
44  
45   I. Bandwidth Management Images per user LBP Classifier Haar Classifier
46   The data size which is passed from the client to the node 50 Images 2.5min 4min
47  server and vice versa is reduced using bandwidth optimization 100 Images 3.9min 6.9min
48  techniques such as compression and clustering. The
49  administrator can monitor bandwidth using the bandwidth
50  monitoring dashboard which consists of traffic usage, system Amount of main memory which is used to execute the
51  information, CPU load, alerts to notify exceeded predefined algorithm is defined as memory used and given in MB.
52  threshold settings and attacks and much more that is accessible Algorithm LBP Algorithm Haar-Like
only to the administrator of the system.
53   Algorithm
54   J. Quota Management Memory Used 123MB 290MB
55   The administrator is able to manage the internet quota With regards to the video chapter creation feature, each frame
56  allocation for users from the dashboard. The list of users along in the video is analyzed by an algorithm for changes in color
57  with their usage statistics such as used quota and remaining and intensity - namely the average HSV color space difference
60  quota can be viewed filtered by user type (e.g. lecturer, (difference in hue, saturation, and luminance of the frame). If
61  student), month, and year. This monthly quota can be edited this calculated value is very high than the preceding and
62  by the administrator for a single user (e.g. specific lecturer’s following values, it means that there has been a scene change.
63  id) or all users of a particular user type (e.g. all students). Therefore, the video is split at this time frame. This process is
64   repeated for the entire length of the video clip until the entire
65  

5
1  
2   video clip is analyzed and all the video chapters are created. [3] 4] P. D. Z. Varcheie, “Online Body Tracking by a PTZ Camera in IP
Surveillance System,” Department of Computer Engineering and
3   This process was run over repeated 100-iteration cycles which Software Engineering, Station Centre-ville, Montr´eal, (Qu´ebec),
4   produced an average accuracy of 95%. Canada, 2009.
5   Previously, many methods have been introduced as e- [4] T. G. Dries Hulens, “Autonomous lecture recording with a PTZ
6   learning platforms. But this research has taken a different path camera,” presented at the Canadian Conference on Computer and
Robot Vision, Belgium, 2014.
7   by replicating a complete classroom-like experience. [5] 5] M. M. M. H. R. Jacko, “Remote control of the PTZ camera system
8   for lecture rooms,” Department of Computers and Informatics, 2015.
Live streaming with recording sessions of a lecture would
9   help all students even if they were present in the lecture itself. [6] 7] B. Wulff, “OpenTrack - Automated Camera Control for Lecture
Recordings,” IEEE International Symposium on Multimedia, 2011.
10  The ability to revise what they have missed if the student [7] C.-F. C. a. P.-C. S. Yong-Quan CHEN, “A Tabletop Lecture
11  attends the lecture late and the ability to go through a previous Recording System,” in International Conference on Consumer
12  lecture before attending the next lecture would result in a huge Electronics-Taiwan, Taiwan, 2015.
13  academic improvement. [8] Y.-T. T. S. C. a. S.-W. C, “Chiung-Yao Fang, ‘Chiung-Yao Fang,
You-Ting Tsai, Shuan Chu, and Sei-Wang Chen,’” Department of
14   Computer Science and Information Engineering, Taiwan, 2015.
15   The main purpose of this system is to offer an effective [9] “The Way Online Video Streaming Works Has Changed.” [Online].
way to help the students access learning materials and
16   Available: https://www.panopto.com/blog/the-way-video-works-
information from anywhere and to quickly recap any forgotten online-has-changed.
17  or absent lectures via the earlier recordings of the sessions.
[10] “Video Analytics & Engagement Dashboard - Panopto Video
18   Platform.” [Online]. Available:
19  A system and a method for an interactive Internet-based video https://www.panopto.com/features/video-cms/video-analytics.
20  conferencing multicast operation which uses a video [11] “Face Detection using OpenCV and Python: A Beginner’s Guide.” .
“Analytics to improve student success - Echo360.” [Online].
21  production studio with a live instructor giving lectures in real- [12]
Available: https://echo360.com/platform/analytics/. [Accessed: 15-
22  time to the participating students. The video conference May-2018].
multicasting permits students to interact with the instructor
23   [13] “Automated Student Attendance Management System Using Face
during the course of the lecture and to later browse the
24   Recognition | Ise A Orobor and Ofualagba Godswill -
recorded session without a hassle. Academia.edu.” [Online]. Available:
25   http://www.academia.edu/37437099/Automated_Student_Attendance
26   VI. CONCLUSION AND FUTURE WORK _Management_System_Using_Face_Recognition. [Accessed: 10-
27   This paper examines an innovative approach that is best Oct-2018].
[14] V. Mankar and S. G Bhele, “A Review Paper on Face Recognition
28  suited to develop a lecture capturing system that provides a
Techniques,” International Journal of Advanced Research in
29  complete classroom experience to remotely logged in Computer Engineering & Technology, vol. 1, pp. 339–346, Oct.
30  students. This system stands unique from other existing 2012.
31  products by being as a comprehensive product that includes [15] F. Ahmad, “Image-based Face Detection and Recognition.” [Online].
Available: https://arxiv.org/ftp/arxiv/papers/1302/1302.6379.pdf.
32  biometric authentication, gesture detection, live streaming of [16] “WebRTC 1.0: Real-time Communication Between Browsers.”
33  lectures, automated attendance marking, offline recording of [Online]. Available: https://www.w3.org/TR/webrtc/. [Accessed: 10-
34  lectures, bandwidth management and desktop screen Oct-2018].
35  capturing all in one. [17] P. Braun, M. Sipos, P. Ekler, and F. Fitzek, “On the Performance
Boost for Peer To Peer WebRTC-based Video Streaming with
36   This research work has been developed mainly for Network Coding,” 2017.
37  addressing the problems in Sri Lankan universities, [18] “What’s Kurento - Kurento.” [Online]. Available:
38  specifically addressing the lack of interactivity between the https://www.kurento.org/whats-kurento. [Accessed: 14-Mar-2018].
[19] “OpenCV library.” [Online]. Available: https://opencv.org.
39  lecturer and the students. Though this research focuses on
[20] “OpenCV: Face Detection using Haar Cascades.” [Online].
40  universities, it has the potential to be used in other fields such Available:
41  as business conferences. In the next stage, the research team https://docs.opencv.org/3.4.2/d7/d8b/tutorial_py_face_detection.html.
42  will be focusing on improving the accuracy of the face [21] “Onvif.” [Online]. Available:
https://www.onvif.org/onvif/ver20/util/operationIndex.html.
43  recognition and gesture detection models by testing other [22] “WebRTC Home | WebRTC.” [Online]. Available:
44  algorithms. Also, the research team will focus on minimizing https://webrtc.org.
45  bandwidth costs by testing out bandwidth optimization [23] “Open Broadcaster Software | Home.” [Online]. Available:
46  techniques. It is hoped that for any person who expects to https://obsproject.com/. [Accessed: 26-Mar-2018].
build a similar system or any other real-time system, results of [24] “Command Reference — PySceneDetect v0.5 documentation.”
47   [Online]. Available: https://pyscenedetect-
this research will be an aid and will provide insight on the
48  performance, accuracy and reliability level that can be manual.readthedocs.io/en/latest/cli/commands.html. [Accessed: 09-
49  expected with the combination of tools, technologies, Oct-2018].
[25] B. Castellano, :movie_camera: A Python/OpenCV-based scene
50  programming approach considered in this paper. detection program, using threshold/content analysis on a given
51   video.: Breakthrough/PySceneDetect. 2018.
52   REFERENCES [26] “Introduction - PySceneDetect.” [Online]. Available:
53  [1] W., “Use of E-Learning,” Universiti Teknologi. Malaysia, Johor, https://pyscenedetect.readthedocs.io/en/latest/.
[27] “Watson Speech to Text,” 28-Nov-2016. [Online]. Available:
54   Malaysia, 2018.
https://www.ibm.com/watson/services/speech-to-text/.
[2] K. Kumar and T. S. Sheng, “Real Time Target Tracking with Pan Tilt
55   Zoom Camera,” presented at the Digital Image Computing, Adelaide,
56   2009.
57  
60  
61  
62  
63  
64  
65  

You might also like