Professional Documents
Culture Documents
Becir Isakovic
Dino Keco
Nejdet Dogru
Department of Information Technologies
International Burch University
Abstract—Social media is very important factor in analyzing In the article [3], Tripathy, Agrawal, and Rath explained
modern society as a whole, their values, norms, and behaviors, as how supervised machine learning methods can be used for
being a part of our everyday life. This study is oriented towards sentiment analysis. Authors have used Naive Bayes and
analyzing social media in order to allow users to create their own Support Vector Machine algorithms in order to determine
preferences to follow (analyze) a specific social media source. The sentiment in movie reviews. After comparing the accuracy of
web application has been developed to allow a user to follow these algorithms, they found out that Support Vector Machine
specific Facebook accounts and categorize the Facebook posts on perform better than other used algorithms.
those accounts based on the user defined taxonomies. Results of
this study are various reports generated from the Facebook posts Another interesting approach toward analyzing social
and their statistics that are clustered based on the user defined media was demonstrated by Alp and Oducucu in their paper
taxonomies. The benefit of this project is that any user can track [4]. In this paper, the effort was made in order to extract topical
in real time when people are talking about some topic, and it information from tweets by using hashtags. They tried and
enables anyone to have better insight about society as a whole, compared different methods and best performing was Latent
their values, norms, what they find interesting, and many other Dirichlet Allocation (LDA) that merges several tweets in order
things. This tool is also useful for different companies to track the to try to improve performance. The problem with their
user feedback on social networks for their products. approach is that it is highly computationally intensive and that
it performs poorly on short tweets.
Keywords—social media analysis; facebook; big data; noSQL
database; parallel programming Chung-Hong Lee conducted an interesting study related to
evaluating relatedness of disastrous stories and events in his
I. INTRODUCTION paper [5]. The author used social-media messages in order to
describe real world events through relatedness analysis by
Social media entered every segment of our lives, from mining content of these messages and then compare these
private and personal to business and professional, and findings with results of relatedness analysis on Twitter
companies conduct many of their business activities using their microblogs. This online unsupervised method has been proved
social media pages. The correct analysis of the social media to be quick and accurate for near real-time event identification.
data tends to be very important in order to identify customers’
behaviors and needs for companies, and as well as to Fanpage Karma [6] is a platform that allows businesses to
understand needs of society for governments and social analyze their and competitors accounts across several social
institutions. Analysis of the vast amount of data (big data) media platforms (Facebook, Google+, Pinterest, YouTube,
collected from social networks allows companies to gather Twitter, and Instagram). It gives you a number of reports for
useful insights about their products, services, and customers accounts that you want to analyze and compare their respective
[1]. The analysis of such data can be used in order to measure performances.
client satisfactions, marketing and promotion success, customer
LikeAlyzer [7] give users ability to perform check of any
perception of brand and products, and, moreover, it can
Facebook page. Good think about it is that it doesn’t require
enhance many segments of business from marketing up to
access to Facebook Insights. It has easy to use interface and it
sales. Despite all benefits that come from social media
gives user bunch of different reports about page that they have
analytics, many businesses fail to recognize full potential and
analyzed.
power of the social media. Similarly, the governments fail to
recognize the importance of the social media and effects that it Klear [8] is platform for both, influencer-identification and
can have on the society. analysis. It enables users to search for influencers in different
locations and categories (celebrities, power users...). It allows
Following article [2] underlies the importance of social
you to get top content of some account on several social media
media in everyday life thus the importance of proper analysis
platforms (Facebook, Instagram, and Twitter).
of these ‘communities’. It also proposes several machine
learning algorithms (matrix factorization, neural networks) and Unlike all other platforms for social media analysis, our
techniques (group recommendation by using trust neighbors, a application gives anybody ability to analyze anything (through
tag-based algorithm for the recommendation) that can be used different categories) on Facebook. It gives you insights in what
to address this problem. are values and norms of society and how some aspect of life
(e.g. politics) can influence other aspect (e.g. vulgar speech).
Focus of this study is to demonstrate the benefits of charts that will track how much portal/account followers are
analyzing social network data using big data tools. It introduces speaking about chosen category in defined period. Moreover,
a particular implementation of social media analysis that aims the user will be able to filter events of interest using date range
to provide the ability for anybody to impose their own criteria filter.
and data sources that they want to analyze. In this manner, it
enables every user to see how some particular event in some B. System Architecture
location reflects on people’s behavior and opinion. For purpose The SMA WEB APP [9, 10] is the Java Spring based web
of convenience, we conclude this paper with one particular application used for real-time analyzes of the data on the
case that shows how society reacts to an event in their Facebook social network. The main feature that SMA uses for
environment. analyzing is taxonomy. The taxonomy consists of set of
The remainder of this paper is organized as follows. In different categories and each category has keywords which are
Section 2, nature of data and how it is collected are explained bound to it. For every keyword, there is a list of synonyms that
in addition to details of the system architecture which was represents the same word (Synonyms may not have the same
developed to analyze the collected data. Section 3 demonstrates meaning with keyword. They are used to make categories more
the output of the developed system while Section 4 discusses distinguishable). The taxonomy in the SMA WEB APP is
performance and usefulness of the proposed system. The paper defined per user so each user on the system has its own
is concluded in Section 5 by highlighting the importance of the taxonomy allowing the user to have personalized analysis form
research and future work. himself/herself. Fig. 1 shows the system architecture for SMA
WEB APP.
II. METHODS AND MATERIALS SMA WEB APP is used for interaction with the system for
creating and managing accounts and its settings. As part of the
A. Web Application Functionality settings, the user can configure the sources that it wants to
This information system is a website that presents statistical follow on the Facebook social network. Sources are URLs of
information to end user about user defined categories in social the Facebook pages. The user can also modify taxonomy per its
networks. The system is designed to provide better insight to needs. Moreover, the user has access to various reports for the
how much and when people are speaking about predefined social media data that is processed by Categorization Engine.
topics such as politics, sports, culture. By doing so, valuable Reports like weekday punch card, keyword frequency analysis,
reports can be created by exploiting provided category time based heatmaps, category distribution, category timeline,
information in order to analyze opinions of citizens for posts timeline, individual post analysis and others are available
different purposes. More specifically, this system will retrieve in the SMA WEB APP. For data persistence, Mongo database
data from Facebook site of popular target portals in the region [11] is used in replicated and shared configuration. The SMA
or personal accounts, and then group them into categories. The WEB APP works in the multi-process and the multi-threaded
user will be able to choose which category he/she is interested environment in order to speed up data gathering.
in and he will be served with different kind of graphs and
C. Categorization Engine
Our categorization engine allows every user to impose his
own constraints according to which data will be grouped and
analyzed. Every user can define the number of different
categories. These categories can be some general topics such as Fig. 2 Category creation process
politics, sport, culture or more specific topics such as winter comments. The most frequent words in this dataset are used as
tires, glasses, ventilators. Further, every category can have categories for further analysis.
many keywords. These keywords tend to be more specific
concepts that help categorization engine to make a more Category list for our system is created using these the most
accurate grouping. And finally, every keyword has its frequently used words in order to analyses social media
synonyms. These synonyms do not have to be what we activities in Bosnia and Herzegovina. Using our category list,
perceive to be the synonym in language, rather they are there we have analyzed how frequently these categories are
just to have more accurate categorization. So, once this engine mentioned on those sites as well as which days and what time
starts categorizing posts, we will know in which category of the day.
particular post is, but moreover, we will know which keyword Fig. 3 depicts categories which were determined through
or synonym made that post to enter into that specific category. offline analysis and their presence on Facebook pages of 4
Fig. 2 shows how categories are created. As seen from the portals.
Fig.2, “politics” category was created using two keywords
(Hillary Clinton and Donald Trump) and each keyword has one It can be seen from the chart that social interactions
synonym (Democratic Party and Republican Party, comprise 39% of social activities while surprisingly vulgar
respectively). More keywords and more synonym for each speech comprises 24% of posts and comments on target
keyword can be used while creating categories. Facebook pages. Indeed, these findings are quite interesting
and valuable when it comes to analyzing society as a whole.
III. RESULTS TABLE I shows the most frequent words in collected data.
This section demonstrates the usefulness of the developed Since Facebook pages were in Bosnian language, naturally, the
application by demonstrating two use cases. First use case most frequent words were Bosnian. English translations of
analysis Facebook pages of 3 most popular news portals these words are written in the parenthesis under each word in
(klix.ba, avaz.ba, sportsport.ba, ekskluziva.ba) in order to the TABLE I. TABLE I can give us an idea about what are the
identify the most discussed topics in Bosnia and Herzegovina. topics which are discussed in those Facebook pages we
monitor.
It is assumed that the most discussed topics in these
Facebook pages will give us an idea about what are people’s In Fig. 4 and Fig. 5, a number of posts or comments on
topic of interest. Second use case analyses the people’s reaction social media according to the day of the week and time of the
to the court decision on March 31, 2016 about Vojislav Šešelj day are presented, respectively.
who is famous politician and believed as a war criminal. almost same in each day except the number of posts and
570 MB of data from most popular portal Facebook pages comments on Tuesdays to Fridays is slightly higher than
in Bosnia and Herzegovina (klix.ba, avaz.ba, sportsport.ba, Saturday, Sunday, and Monday. One reason could be that
ekskluziva.ba) was retrieved. This data included 214.037 feeds people might be spending time with their families and friends
from 83.039 different people between 20th of March and 12th during the weekend and focusing on work on Monday so that
of April 2016, where 17.868 were posts and 196.169 were they do not have enough time for following or writing on social
media.
Fig. 3 Presence of categories in social media