You are on page 1of 6

Pakistan Journal of Engineering and Technology, PakJET

Multidisciplinary | Peer Reviewed | Open Access


Volume: SI, Number: 01, Pages: 213- 218, Year: 2020

Dual Language Sentiment Analysis Model for YouTube Videos Ranking Based on
Machine Learning Techniques
Samina Yasin1, Kareem Ullah1, Shoaib Nawaz2, Muhammad Rizwan3, and Zahid Aslam2
1
2
University of Agricultural Faisalabad, Pakistan.
3
The Islamia University of Bahawalpur, Pakistan.
Khwaja Freed University of Engineering and Information Technology. Pakistan.
Corresponding author: Samina Yasin (e-mail: samina.bintayyasin@gmail.com).

Abstract-YouTube is the biggest platform for social media video viewing, sharing web site about 500mints video uploaded
on the YouTube in one second. In this huge data when the user search on the YouTube he can’t find desired results and if he
found based on the keyword searched still, he can’t find the qualitative desired video content. Its takes too much time to get
the desired video content. In Asia the two languages are mostly people use for communication 1. English, 2. Roman Urdu
Our proposed works is on the dual language sentiment analysis of the of the YouTube video. Our proposed model helps the
user to rank the video based on the English as well as Roman Urdu. We build a model this will help us to get the review in
English and Roman Urdu this will combine them the score and rank the video based on English and Roman Urdu sentiment
analysis. We deploy the different machine learning algorithms we find the best classifier for dual language sentiment analysis
is logistic regression with count vectorization feature extraction with 87% model accuracy.
Keywords: YouTube, Sentiment analysis, Machine learning, Roman Urdu.

her gives their feedback interim of comments in the


I. INTRODUCTION comments in the simple text these comments are
further divided in the two sections.
About YouTube: YouTube is the most popular due to its
popularity the YouTube having the two billion active a) Parent comment.
users [1-3]. The mostly user age of the YouTube is from b) Child / reply on the parent comment.
18 years to 34 they watched video and sharing the
videos. YouTube is watched in more than 100 countries Parent Comments:
and the more than 80 languages and the 500mints watch Are the main comments which are feedback received by
time video is uploaded in one second. On the daily basis the watcher and after watching the video her put their
one billion hour watched time on the YouTube. Most of feedback? They can also like and dislike the parent
the user are logged in form the mobile devices from the comment too [1-2].
Google Press 70% of the user from watch the YouTube Child Comments:
from mobile devices. YouTube build and deployed the Child comments are the reply of the main or parent
local version for mostly of the countries [4-8]. comment this will make the more discussion about the
Another feature is channel in which people recorded the parent comment for about truth or some people are
video and upload on the YouTube and earn the lot of made more discussion about the reply of the reply
money based on the Google subscriber policy and watch comment and this will build the tree of the comments.
time means google have some policies and SOP which The reply can also like and disliked by the other
are displayed on the YouTube website. The people peoples.
having good number of subscribers they earn the six- c) YouTube Feedback processes:
figure income form the YouTube and its increased by People are belonged with different countries and talking
year by year 40% [9]. with different languages since the YouTube is watched
As we study the YouTube is the product of Google. The in the more than 80 languages countries it means people
Google have many products with different features as with different countries are watching the video contents
like in YouTube the main features are video uploading, and also give the feedback on the videos that means
thumbnail, the after watching the video the user gives people are giving the feed in their own languages on the
their feedback in term of three types. same video or in English or other country people put
1) If the user like the video than they give his their comment in their own language or English. This
feedback in term of likeness and click thumbs-up. means that the one video may and may be received the
2) If the user does not like the video then click the multi-lingual comments and replies in term of feedback
thumbs down. [6]. Since on the YouTube the video content is so
3) If the user needs to talk with video uploader then massive [10-13]. When the people search on the
213
YouTube, they sometime not get the desired results or 3.The actual score of the video can be received by our
desired video which for they are searching they change proposed model.
searching keywords and still can’t find the desired video 4.Model can work only on English and Roman Urdu
until unless they watch the video himself and after individual languages as well as dual languages.
watching the averagely 5 ,6 videos people find the
desired results. f) Significance of proposed study:
Another one processes adopted by the people is before
watching the video is to read comments and based on Obviously, the performance and fast results is the need
the people feedback watcher / searcher will get the of each user. People have no time to spend the hours to
assessment about the quality of the video content. search the batter video the computer science filed
These comments and its sentiment and are helped by persons are also faced the time-wasting issues on the
researchers to rank the video before watching the video YouTube but if we think as a non-computer literate
content. There are other techniques involved for video person than how they find the best results and best
ranking some are novel approaches are also developed. qualitative video. Our proposed model will answer these
In Asia we have faced the feedback mostly in two questions and specially in the Asia it’s so productive
languages one is English and other one is Roman-Urdu and efficient method. The Application of the proposed
[14] the Roman-Urdu is the language with no method is in the business review today people are
grammatically rules and non-standard syntax language intelligent they first check the review of the product
people in Asia there is mostly used language the logic than make purchases most of the mobile’s websites are
and concept of the words are same as Standard Urdu facilitate the user put their review about cell phone that
language but the syntax is non-standard. There is only are so helpful for new buyers.
Roman Urdu user in the Asia is more than 6.06 million
they uses the dual languages for communication in their g) Objectives:
daily life mostly on the social media [15]. 1.Generate the Actual sentiment results of the video.
People used these languages and put their feedback on 2.Helps us to rank the video based on the Qualitative
the YouTube video on the mostly researchers are ranked content.
the video based on the English sentiment analysis [13, 3.Build a model support the Dual language sentiment
15] and ignore the Roman Urdu feedback no doubt the analysis of YouTube video feedback.
video is ranked but the sentiment score is not the actual 4.Reduce the searching time and help the user to find
score because we ignore the Roman Urdu sentiment the best qualitative video efficiently.
score [14]. We proposed model that will help the user to 5.Video will be ranked if having English feedback or
ranked the video with actual feedback for these two Roman-Urdu feedback this will support individually and
languages. combination of both languages.

d) Problem statement: II. REVIEW OF LITERATURE


Video ranking and qualitative analysis is most important
factor in searching on the YouTube website. When Video ranking or video recommendation [13] is a
people searched the video mostly people can’t find the process that will help the user to get and video with
desired video for desired they use the comments section qualitatively good content. YouTube video are still
for finding the assessment about the video qualitative difficult find the qualitative content video there are a lot
content and some time the comments are in 20 , 30 but of techniques sentiment analysis of the comments is also
if the comments are more than 100 and 1000 how we used for ranking and the based on the sentiment score
can assess the quality of the video obviously we need and recommendation is also used in YouTube ranking
machine to find the productive comments this is [11]. When the user search on the YouTube and find the
possible mostly in English comments but not in the dual video content the user does not know the which one is
lingual comments [9]. English having its own Bag of having the qualitative video content it is study the based
words, Senti-Word [8], Glove, Word2vec [10], FastText on the sentiment score values not the sentiment the
, these helps us but for the Roman Urdu and with average sentiment score and percentile score on the
English comments find the ranking of the video it’s our replies of the main comment helps the user to find the
proposed model. labeled video with, Recommended, Not-Recommended,
May be Recommended, Highly Recommended video
e) Contribution [10-13].
1.Our proposed model find the sentiment analysis of the The YouTube section of feedback in which the user
YouTube comments Dual Language Support. gives their feedback interim of comments, the comments
2.This model helps the researched and Youtuber to find also can help if 10, 20 this will easy to read and assess
the best one video based on Dual language feedback. the quality of the video if the comments more than

214
hundreds or thousands than the machine learning will be should also involve that data too and perform research
helpful for us and based on the machine learning we can deployments and find the desired work and results. The
assess the quality content of the video. Most of the Dataset scrapper helps to scrap the datasets.
people analysis the comments and based on their reply’s The steps to get the dataset using the tools and where we
asses the quality of the videos. The multiclassification found the dataset.
of the YouTube video also used for the specific 1.We collected the Roman-Urdu dataset from Kaggle.
classification and this will help to classify the comments 2.We use data scrapping tools to download the
and based on the that model the comments will help the comments of 10 videos with different topics.
user to get the information classified comments and 3.Sentiment Annotation of the Roman-Urdu is already
based on the classified comments the video will be existing on Kaggle.
ranked they use the Instant Scrapper as for data scrapper 4.We use the sentiment package to find the sentiment
and Beautiful- -Soup is the framework for data analysis of the English feedback and use the
scrapping and the data can be saved in the comma LableEncoder of the scikit learn to encode the labels for
separated values. The model used in this proposed dataset for languages and sentiment.
method is support vector machine with 84% model
accuracy. Now days, mostly people used YouTube to B) Methodology:
watch online videos and share their opinion and Here are the steps we follow to perform experiments:
emotions about the any topic of views in comments. In 1. Exploring Data:
[9] investigated that mostly people used to watch online 2. Cleaning Data:
videos either than to read newspaper, books discovered 3. Annotation Processes:
the importance of YouTube videos used the example of 4. Feature Engineering:
annoying orange which is most popular views and 5. Model Building Machine Learning
comments. Authors in [12] investigated the most 6. Model Evaluation
popular vlogger and discussed the three successful 7. Performance Improvement
factors. In [15] discovered the most popular videos on 8. Final improved Model
YouTube they indicated the number and likes dislikes After collection of the dataset we explorer the data set
views of YouTube videos. Sentimental analysis is also and we found a python package which so good for
most important part to identifying the opinion and languages deduction and after language deduction we
feelings and expression of people about any videos. In perform the sentiment analysis. Here is the language
[12-14] also discovered the important of YouTube deduction python package and how to install in python
videos ranking and chose different categories of videos using pip. pip install langdetect
and different growth of YouTube videos.
Here is the package link.
The dual lingual sentiment analysis or multi-lingual
sentiment analysis is also the mostly required approach https://pypi.org/project/langdetect/
in the analysis of the comment and textual data means
two language are used one is English and other one is
Spanish tweets the study proposed a model that can
produced the two languages sentiment analysis [9-10]
with one model they are used the different feature
extraction with word and Lemmas , and Psychometric ,
part of POS tagging , as with Bigrams of words , and
Tripler of words and combined features and active the
results with 66% accuracy with four different feature
extraction [9]. Fig 1: Process model framework.

III. MATERIALS AND METHODS Here is the processes model of our proposed framework
in Fig. 1. The processes model for methodology is here
A) Data Collection: in Fig. 2, this helps us to get a pictorial view of the
We use the Data Scrapper (Instant Scrapper, Beautiful processes.
Soup) they help us to grab the data from the YouTube Data is consisting on the following stats of both
and we will manually annotate the all data for training languages. Here are the values which we are collected
purpose. We use the classes English-positive, English- from the Kaggle are cleaned and then we finalized the
Negative, RU-Positive, RU-Negative. We first clean the dataset is enough for our experiments. Here is the
data than we should annotate the all values. visualization of the dataset. Here is the English and RU.
Another method is Kaggle where is Roman Urdu dataset
is already available and as well as English too, we

215
FIGURE 4: Hisogram of positive and negative
comments

We find here the negative comments having the more


text length than positive comments in each comment
shown in Fig. 4. The summary statistics of the dataset is
shown here.
We find the word cloud for both classes and both
lanagues and find that the negative comments are more
than the positive and the most frquent word used in
dataset for positive class and for negative class. Word
cloud for class Negative-Sentiment shown in Fig. 5.

Fig 2: Methodology of process model.

Here we get to know the how much the length on the


seach comment in both labuages shown in Fig. 3.

Comments

Positive
43% 57% 57%
Negative
FIGURE 5: World Could for Class Positve-Sentiment

FIGURE 3: Comments ratio. These are the both languages World Cloud for both
classes in which we see that the positive and negative
C) Description of the Dataset: words are shown in the worlds cloud.
The dataset is consisting the two languages: D) Data Cleaning:
1. English, After exploring the data, we than perform the cleaning
2. Roman-Urdu (RU) with the balanced classes. processes of the dataset. We perform the different
techniques for cleaning the data. We first remove the all
null or nan values and remove the stop words for
English language and remove the Roman-Urdu (RU)
stop words converting the all values in lowercase and

216
remove the punctuations and remove special characters Algo.Comparesion with Features Performance
and remove spaces and remove numbers and the remove
the unnecessary text. We also remove the duplicate 0.90
values. 0.87 y = -0.0124x + 0.8732
0.86
E) Features Selection: 0.84 R² = 0.4615
0.85 0.83
We have adopted the three features selection techniques, 0.82
Count Vectorization, TFIDF vectorization and 0.80
0.80 0.79
Word2vec Vectorization. We then divide the dataset in
20% for test and 80% for training. We first deploy on
the two Machine learning algorithms with these three 0.75 0.73
features selections methods. The processes for
Deployment of the Machine learning algorithm are list
below. 0.70
1. Logistic Regression with Count Vectorization.
2. Logistic Regression with TFIDF (Unigram, 0.65

Count Vectorization

Count Vectorization
TFIDF Unigram , Bi gram , uni-

TFIDF Unigram , Bi gram , uni-

TFIDF Unigram , Bi gram , uni-

TFIDF Unigram , Bi gram , uni-


Bigram Uni-bi-gram)

Word2Vec

Word2Vec
3. Logistic Regression with Word2Vec.
4. Random Forest with Count Vectorization.
5. Random Forest with TFIDF (Unigram, Bigram

bigram

bigram

bigram

bigram
Uni-bi-gram).
6. Random Forest with Word2vec.
The results for these Machine learning algorithms are in
the table. We first use the feature extraction methods
and processes the deployment for three feature selection
train and test means first we prepare the train and test
with TFIDF and then prepare train and test for count LogR LogR LogR RF RF RF LR RANSAC
vectorization and again we use the word2vec and split
the data into train and test prepare the data for machine FIGURE 6: Algorithm comparison with features.
learning and choice first to deploy the using the for loop
to gives the data for different features based on two
algorithms one is Logistic Regression , Random Forest
and then test the model accuracy. We perform the test
on the linear regression and RANAC regressor using the
TFIDF this gives us the results blow than the Logistic
regression which is 87% with F1 score is 0.87. Other
experiments and results are given and the results are
shown in Fig. 6, with the following variables.

LogR = Logistic Regression


FR = Random Forest
LR = Linear Regressor
RANSAC = Random Sample Consensus

The Logistic regression with count vectorization gives


FIGURE 7: Process model evaluation.
us the 87% accuracy with precision 0.84, recall 0.91, F1
score is 0.87 are the best results for our deployment.We The Model evaluation process is shown in Fig. 7 with
also use the linear regressor and compare with RANAC
the following equations.
regressor these experiments give us the following results
. Estimated coefficients for Linear regressor is 73.43%, ACCURACY = (TP+TN) / (TP + TN + FP + FN)
RANSAC regressor is at 82.19%. PRECISION = TP / (TP + FP)
RECALL = (TP) / (TP + FN)
TRUE POSITIVE RATE = TP / (TP + FN)
FALSE POSITIVE RATE = FP / (FP + TN)

217
CONFUSION PREDICTED PREDICTED arXiv:1511.09142. 2015 Nov 30.
MATRIX POSITIVE NEGATIVE [5] Benkhelifa R, Laallam FZ. Opinion extraction and
classification of real-time youtube cooking recipes
comments. InInternational Conference on Advanced
Actual True Positive False Negative
Machine Learning Technologies and Applications 2018
Positive
Feb 22 (pp. 395-404). Springer, Cham.
[6] Bilal M, Israr H, Shahid M, Khan A. Sentiment
Actual False Positive True Negative classification of Roman-Urdu opinions using Naïve
Negative Bayesian, Decision Tree and KNN classification
techniques. Journal of King Saud University-Computer and
Table I: Comparison of algorithms.
Information Sciences. 2016 Jul 1;28(3):330-44.
[7] Chauhan GS, Meena YK. YouTube Video Ranking by
IV CONCLUSION:
Aspect-Based Sentiment Analysis on User Feedback.
In Our proposed method we get the data from YouTube
InSoft Computing and Signal Processing 2019 (pp. 63-71).
using different scrapper and tools and websites. After
cleaning and annotation and with the help of feature Springer, Singapore.
engineering we prepare data for training model and after [8] Chauhan GS, Meena YK. YouTube Video Ranking by
this we build and model by using the different machine Aspect-Based Sentiment Analysis on User Feedback.
learning algorithms and find the best one comparatively InSoft Computing and Signal Processing 2019 (pp. 63-71).
good for our dataset and move forward that algorithm Springer, Singapore.
for improved accurate results. The proposed model [9] Covington P, Adams J, Sargin E. Deep neural networks for
having application of Dual language sentiment analysis youtube recommendations. InProceedings of the 10th
of the YouTube video content, help us to find the Best ACM conference on recommender systems 2016 Sep 7
Qualitative YouTube video content as given in Tab. I. (pp. 191-198).
Proposed method shows that the Logistic regression [10] Soomro TR, Ghulam SM. Current Status of Urdu on
with count vector is outperform. We also see our three Twitter. Sukkur IBA Journal of Computing and
feature selection methods and compare their Mathematical Sciences. 2019 Sep 5;3(1):28-33.
performances on the three machine learning algorithms. [11] Nawaz S, Rizwan M, Rafiq M. Recommendation Of
More than 6.06 million people use the dual languages on Effectiveness Of Youtube Video Contents By Qualitative
their daily life communication so the method is so Sentiment Analysis Of Its Comments And Replies.
productive and effective and demanding. Our results Pakistan Journal of Science. 2019 Dec 31;71(4):91.
show the method is effective and efficient in processing [12] Nguyen HT, Le Nguyen M. Multilingual opinion mining
of dual language sentiment analysis. on YouTube–A convolutional N-gram BiLSTM word
embedding. Information Processing & Management. 2018
V FUTURE WORK:
May 1;54(3):451-62.
We are using the language deduction python package
[13] Pereira RB, Plastino A, Zadrozny B, Merschmann LH.
we also use this and use the google translator to translate
Categorizing feature selection methods for multi-label
the multi-languages into English and then perform the
classification. Artificial Intelligence Review. 2018 Jan
sentiment analysis to rank the YouTube videos for
qualitative content. 1;49(1):57-78.
[14] Vilares D, Alonso MA, Gómez-Rodríguez C. Supervised
sentiment analysis in multilingual environments.
REFERENCES: Information Processing & Management. 2017 May
[1] Abbas SM. Improved context-aware youtube 1;53(3):595-607.
recommender system with user feedback analysis. Bahria [15] Xia R, Xu F, Zong C, Li Q, Qi Y, Li T. Dual sentiment
University Journal of Information & Communication analysis: Considering two sides of one review. IEEE
Technologies (BUJICT). 2017 Dec 29;10(2). transactions on knowledge and data engineering. 2015 Feb
[2] Alam M, ul Hussain S. Sequence to sequence networks for 26;27(8):2120-33.
Roman-Urdu to Urdu transliteration. In2017 International
Multi-topic Conference (INMIC) 2017 Nov 24 (pp. 1-7).
IEEE.
[3] Alanis I, Rodriguez MA. Sustaining a dual language
immersion program: Features of success. Journal of
Latinos and Education. 2008 Oct 10;7(4):305-19.
[4] Asghar MZ, Ahmad S, Marwat A, Kundi FM. Sentiment
analysis on youtube: A brief survey. arXiv preprint

218

You might also like