You are on page 1of 4

International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

A SURVEY ON BIG DATA ANALYTICS USING


SOCIAL MEDIA DATA

P. Victer Paul, K. Monica, M. Trishanka


Department of Information Technology
Sri Manakula Vinayagar Engineering College, Pondicherry, India
victerpaul@gmail.com, monicaranirajan@gmail.com, trishanka.manoj@gmail.com

Abstract— Analytics is very important in all fields in order to held devices. Here analysis is done on the consumer insights,
make decisions over certain facts. Social media analytics is the social media content and understanding pattern. Real time
process of collecting information from various social media examples of application using Big data are Spotify, Amazon
platforms, websites and blogs. This analytics is done to make Prime etc. Next widely used sector is Healthcare. This sector
effective business conclusions. The usage of social media has has access to huge sets of the data as they handle much
become the latest trend in today’s world. Social data analytics is number of patients and their respective data. So here analytics
not about just collecting likes and comments shared by is being used to link the health related data of the patients in
individuals but it has become the platform for many trademarks order to track them at respective intervals. One such
to bring out promotion. Applications such as marketing, elections
application is Obamacare which has made use of Big data in
widely used social data to make predictive decisions. Some of the
multiple ways. Then comes the education, the staffs and the
approaches followed are forming hypothesis, getting deep into
the data, mapping events etc. This analytics can also be done in
students are now using Big data to learn various analysis tools
applications such as business, Change in amendments, and other technologies related to it. Similarly Manufacturing,
Education, Demonetization etc. The challenges faced are metrics Government, Insurance, Retail, Transportation and Energy
formed by social media should reach the right people, have made great use of Big data.
unstructured data being difficult to process. This paper discusses The term “Characteristics of Big data” reminds about the
about the model, theme, performance evaluation, advantages and 3V’s which is Volume, Variety and Velocity. But now it has
disadvantages under literature survey.
grown up to 6 V’s. The first is volume. Big data in general
Keywords— social media; insights; datasets; analysis ; refers to huge volumes. The data is being generated by
challenges (key words) machines, systems and user interactions which are seen on
social media. This makes the analysis process a little tedious.
Next is the variety, refers to the different types of data that we
I. INTRODUCTION come across. It may be structure, unstructured or semi
Big data is nothing but huge amount of data which cannot structured. The data is usually stored in spread sheets and then
be processed using traditional methods. The data to be moved to databases, but now data comes through various
processed may be structured, unstructured or semi structured. formats such as emails, audios, videos, Gif’s. Next is the
These data come from various sources such as sensor data, velocity. It deals with the rate at which the data is being
transactional data, cameras, microphones and lot more of produced. Data flows become too heavy to be handled. Next is
resources. On these large data sets analytics is performed in veracity, which refers to the abnormal data. Data being
order to obtain undiscovered insights. Based on facts, generated need not always be clean hence we need to work on
everyday petabytes of data are being generated and now it has those data to filter out the necessary ones. Processing must be
come up to zeta data. All social media platforms in which done in order to avoid collecting “Dirty data”. Next is the
billions of users are connected generate huge data which can validity which refers to the correct data. Data must always be
be of any type such as text, images, audio, video, gif, etc. accurate during analysis only then it can help in making the
right choices. The next is the volatility refers to the duration of
Big data is highly used in many applications because each the data being valid. When data becomes irrelevant it must not
and every field now generates tones of data which need to be be used for analysis. Hence these are the characteristics of Big
processed at one point of time. Let us discuss the various data.
applications in which Big data is being used. The foremost is
the Banking in which Securities Exchange Commission is The main issue that the organization face is to collect the
being used to monitor various activities. At present, they are accurate data and to make the best use of it. After collecting
using the network analytics and natural language process as process, the organization must find the right technology to
trend in the markets. Retail traders, big banks and many others work with in order to bring out the best insights. Then comes
are using analytics for making decisions. Moving on to the the access towards the data. Most organizations fail to find the
next application, it is the media. Big data is widely used in right platform to work with the data generated. A survey has
media, since it has now reached the users through many hand mentioned that 55% of the Big data projects are never

978-1-5090-5682-8 /17/$31.00 ©2017 IEEE

1
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
completed, the problem is that the software is not easy to ƒ It is expected by the experts that automation techniques
handle. Not only this, it also requires internal resources to be must be complemented by human understanding and
maintained. There is also a shortage of Data Scientist and knowledge.
Developers who must have Domain knowledge so that they
find the right objectives. The final concern is the Security that B. Big Data Analytics in Healthcare
is the data protection.
Theme: Based on the field of healthcare Big Data
II. LITERATURE SURVEY analytics and its benefits with the methodologies used and its
outcomes and the challenges faced are discussed here
This section deals with the study in Social media analytics and according to Wullianallur Raghupathi et al. [2].
Demonetization.
Model: Big data can avoid inefficiency in the areas of
A. Can Twitter save lives? A Broad scale study on Visual Clinical operation, public health, research and development,
Social Media Analytics for Public Safety pre-adjudication fraud analysis, genomic analytics, evidence
based medicine, patient profile monitoring and device and
Theme: According to [1], Dennis Thom et al. presents that remote profile monitoring. Different sources and data types
nowadays the social media is being used for commercial that are included are machine to machine data, web and social
purpose rather than safety. The paper gives a clear media data, biometric data, big transaction data and human
understanding on the crisis intelligence field study via Twitter generated data.
data during the flood on German 2013. Scatter Blogs systems Performance Evaluation: For analytics to be done these
was introduced to implement other techniques. The second data has to be pooled. Initially the data sources are identified;
phase sketched out a system based on the feedback about data are collected and transformed for analytics. Then the
Scatter Blogs and a comparison is made between both the analytics platform/tool is selected as from Hadoop, IBM
systems. BigInsights, and Cloudera. Next apply various big data
Model: In [1], a new system was introduced known as analytics techniques to data [11-12].
Scatter Blogs. It is a visual social media analytics solution. Advantage:
This was designed for analyst so that they could make
decisions using social media. This is an alternative because ƒ The outcome of the models are validated and presented to
not every software with required version is obtained. the stakeholders for action.
Performance Evaluation: Here visual analytics is being ƒ The big data analytics in healthcare provides a
used with situation assessment as a crucial element. Scatter sophisticated technology to observe insights from medical
Blogs system provided the experts with variety of tools and records and various information storage areas to obtain
techniques to be implemented which are in current research. valuable conclusions.
They made a complete redesign of the system and named it as
Event Digest. The existing techniques were lead line, a topic
Disadvantage:
modeling to show sudden burst segments on visualization. To
find peaks in Twitter a custom designed algorithm called ƒ Although the analytics in health care has to be user
Twitter info is used. LDA topic modeling and Natural friendly and menu driven with active availability of
language processing were used on the visualization process. abundant analytics models, algorithms and methods in a
Thus both field and user study received feedbacks on the
pull down type of menu for large scale maintenance.
newly introduced system called Scatter Blogs [8-10].
ƒ Moreover, the adminstrative problems as of standards and
Advantage: partnerhip have to be considered.
Here visual analytics is being used with situation assessment C. Big Social Data Analytics for Public Health: Facebook
as a crucial element and the positive review are as follows Engagement and Performance
ƒ It is positively commented on the social and system
monitoring usefulness. Theme: In referred paper [3], analysis of Facebook data
ƒ A tool to harvest the public’s opinion and responses. using unsupervised learning indicates recent tendency of user
ƒ Gathering information using this tool was fast and engagement in public health has been increased as observed
by Nadia Straton et al.
tremendous than reporting channels.
Model: Data from 153 public health organization are
gathered from Facebook wall using Social Data Analytics
Disadvantage:
Tool (SODATO). Clustering is used to discover pattern and
ƒ There were difficulties in understanding the current active interesting correlations among large datasets.
filters and modules. Performance Evaluation: From several algorithms, K-
ƒ It is good to have more images and videos rather than Means algorithm is used here for clustering. By comparing the
word based content. clusters we get the clear idea on the popular and less popular
posts. Analysis can also be done based on some category

2
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
types, as to understand the feature of each category IBM Netezza Data Warehouse. The next component IBM
individually K-Means algorithm is suitable. Using this K- Infosphere BigInsights component helps to extract the social
Means clustering and frequency analysis the result shows the individual neighborhood and key location. Next, the telco
posts with low engagement cluster as those posts have very analytics data warehouse where the huge data sets are stored
low user engagement. It is found that form 2013 onwards there so that clustering and statistics are performed. This supports
is increase in share of high and medium clusters. Thus in-database Map Reduce as well. SoLoMo insights include
increase in average span of user engagement gives the post to visualization, social network analysis and various profiling.
acquire great interest as the fame of health care post got Social profiling is based on Key Performance Indicators.
increased in modern era. From this analysis, the photo and link Location Profiling is based on low, middle and high level
type posts gain more attention as neutral and immense features which includes various factors. Mobile profiling
engagement. Thus the chief share of 50% of intermediate and analyzes the call detail record which is usually structured. This
50% of high engagement posts are found during the period of also includes Modeling web pages using Open directory
10-16 and less during evening times. Project (ODP) and URL pattern analyzer under various
categories.
Advantage:
Advantage:
ƒ K-Means algorithm makes the clustering task of
unsupervised machine learning simpler. ƒ This analysis was useful in finding like minded
ƒ Clustering process is easy to configure and provides more communities using points of interest.
options to the users. ƒ This helps to indicate the interaction between subscribers.
ƒ There is limited single point of failure. ƒ Netezza provides native support for Map Reduce.

Disadvantage: Disadvantage:
ƒ However, month, season, day is non decisive features for ƒ IBM Infosphere component does not support social
placing the posts in specific or certain cluster. network analysis.
ƒ Therefore, to get more efficiency in communication for ƒ The data obtained through call detail record is not as
public health the organization must avoid status updates precise as Global Positioning System(GPS)
and think of better visualization and providing posts with
photos and link. It is noticed that the posts that are active E. Impact of Demonetization on Indian Economy
and available attract more user.
ƒ The future work includes the analysis of documented Theme: India has the highest rate of currencies. 87% of the
content of the public health data using supervised currencies are rotating in the form of Rs.500 and Rs.1000.
machine learning methods for the discipline specific These currencies becomes the source of income either by
models as the disease specific models, inner natures and working under organizations or through illegal enterprise.
so on. This paper [5] by Dr. Pratap Singh et al. presents the impacts
after the process of demonetization occurred in India.
D. SoLoMo analytics for Telco Big Data monetization Model: In the paper demonetization is the act of doing
away with the certain currency notes. The process of
restricting the currency notes led to the decision of getting rid
Theme: The reference [4] has observed that the world can
of Corruption. This move may improve the economy system
be accessed using the mobile internet. This has helped widely
being followed in India and to lower the cash being circulated.
in Business. The algorithms and technologies used in this
It also could be a step in making the nation digitalized.
paper are used to find out the insights with the help of social,
location and mobile data of the individual. Performance Evaluation: The paper refers to the data
which is secondary. Secondary data refers to the data which
Model: In the paper referred by H. Cao et al., there are
has been collected from various sources such as survey,
many platforms used each for a different purpose. IBM
census. These data are being collected by various government
Infosphere Big Insights, IBM Infosphere streams, IBM
officials and other departments who take survey for research
Netezza Data Warehouse are the tools used to address the big
as well as analysis purpose. Usually the demonetization
data challenges. This system obtains the telecom data from
process would have been started earlier, only after almost all
various profiling such as social network, location based and
the process has been over, the official date will be announced.
mobile usage profiling.
The process of demonetization is not new to our country. It
Performance Evaluation: The paper clearly explains that has happened before with very high currency notes such as
all the users across the world are socially connected through Rs.10000, but these cash weren’t on high circulation among
various mediums. Useful insights have been explored from the the citizens thus it did not affect the country as now. The
social, location and mobile usage of the users. IBM Infosphere recent denomination that occurred on the restricted currency
streams component is used to stream data in real time. It takes was 87% in circulation among the people. Hence it had
data such as call Detail Record (CDR) which is completely brought adverse effects that people couldn’t not manage.
about the telecom transactions. These data are finally stored in Analysis on demonetization helps other researchers who wish

3
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
to analyze the process and also for government institutions and as a support for all the researchers who wish to analyze the
officials. impacts and effects.
Advantage:
V. CONCLUSION
ƒ This analysis would be useful for the government
This paper presents about the introduction, characteristics,
institutions the next time they plan to perform the same applications and limitations of Big data. Then about the
process. introduction about social data analytics which is the source for
ƒ It highly helped in bringing out the black money. performing analysis. The literature study includes the
proposed work, used tools and techniques in each respective
Disadvantage: paper. It also contains the overall observations with the
futuristic solutions where it has been discussed about the steps
ƒ The recent denomination which occurred has affected the involved in analysis and the applications. The conclusion and
honest normal tax payers who work for daily wages the reference for the above surveyed paper are also included.
ƒ The government did not plan about the new currency that
are to be circulated which led the people standing in REFERENCES
queues before ATM. [1] Dennis Thom, Robert Kruger, Thomas Ertl, “Can Twitter Save Lives? A
Broad-scale Study on Visual Social Media Analytics for Public Safety”,
III. OBSERVATION IEEE Transactions on Visualization and Computer Graphics, 2015.
It has been observed that each paper has used different [2] Wullianallur Raghupathi and Viju Raghupathi, “Big data Analytics in
Healthcare: promise and potential”, Raghupathi and Raghupathi Health
techniques which have their own pros and corns. The Information Science and Systems, 2014.
techniques overall used are Scatter Blogs, SoLoMo, IBM Big [3] Nadiya Straton, Kjeld Hansen, Raghava Rao Mukkamala, Abid Hussain,
Insights each used for a purpose such as analysis, storage, Tor-Morten Grønli Henning Langberg and Ravi Vatrapu, “Big Social
streaming real time data etc. We have decide to take the Data Analytics for Public Health: Facebook Engagement and
positive features of all the algorithms, tools and techniques Performance”, IEEE 18th International Conference on e-Health
used in the above discussed papers. The social Media analytics Networking, Applications and Services (Healthcom), 2016.
is the main resource to perform analysis. It is actually the [4] H. Cao, W.S. Dong, L.S. Liu, C.Y. Ma, W.H. Qian, J.W. Shi, C.H. Tian,
Y. Wang, D. Konopnicki, M. Shmueli-Scheuer, D. Cohen, N. Modani,
process of developing informative tools to collect and H. Lamba, A. Dwivedi, A. A. Nanavati, M. Kumar, “SoLoMo analytics
summarize the data. The usage of social media has been for telco Big Data monetization”, IBM J. RES. & DEV. VOL. 58 NO.
evolving more as the days pass by. It has been used for many 5/6 Paper 9 September/November, 2014.
positive as well as negative usage. It all depends upon the [5] Dr. Partap Singh, Virender Singh, “Impact of Demonetization on Indian
attitude of the users. Facebook, YouTube and Twitter are in Economy”, 3rd International Conference on Recent Innovations in
the first, third and tenth position on account of the usage. In Science, Technology, Management and Environment, 18th December,
2016
general the analysis process comprises of three stages Capture,
Understand and Present. Topic modeling is also one of the [6] Weiguo Fan, Michel D. Gordan, “Unveiling the Power of social Media
Analytics”
important techniques being used in above papers. Hence
[7] Dr. Sunitha V Gangier, Ranganatha B, “Demonetization and its impact
Social media analytics is highly used in the analysis that is on Social Development”, INDIAN JOURNAL OF APPLIED
performed on demonetization. RESEARCH, January 2017
[8] P. Victer Paul, T. Vengattaraman, P. Dhavachelvan, “Improving
IV. FUTURISTIC SOLUTION efficiency of Peer Network Applications by formulating Distributed
Spanning Tree”, Third International Conference on Emerging Trends in
According the literature study that we did on the paper, Engineering & Technology (ICETET-2010),IEEE, India, May 2010. pp.
there were many techniques for analysis that were implied on 813-818.
social data. This was especially to gather insights and make [9] P. Victer Paul, R. Baskaran, P. Dhavachelvan, “A Novel Population
efficient decisions so that it becomes useful for the users who Initialization Technique for Genetic Algorithm”, IEEE International
Conference on Circuit, Power and Computing Technologies (ICCPCT),
wish to do research based on the problem. The first step March 2013, India, pp 1235 - 1238. ISBN: 978-1-4673-4921-5.
process is to collect data from the social networks such as [10] N. Saravanan, R. Baskaran, M. Shanmugam, M.S. SaleemBasha and P.
Twitter, Facebook. The data can be easily retrieved through Victer Paul, "An Effective Model for QoS Assessment in Data Caching
the API (Application Programming Interface) of the respective in MANET Environments", International Journal of Wireless and
platform. Sometimes it is requested to pay in order to retrieve Mobile Computing, Inderscience, Vol.6, No.5, 2013, pp.515-527. ISSN:
the data as they might keep their resources secure. The next 1741-1092.
step processes it to perform analysis on the retrieved data. This [11] P. Victer Paul, A. Ramalingam, R. Baskaran, P. Dhavachelvan,
K.Vivekanandan, R. Subramanian and V.S.K.Venkatachalapathy,
process can also make use of R studio and flume platform to "Performance Analyses on Population Seeding Techniques for Genetic
store the data and perform analysis. The outcome is derived in Algorithms", International Journal of Engineering and Technology
any of the formats such as Graph, line or pie chart. The (IJET), Vol 5, No 3, Jun-Jul 2013, pp. 2993-3000. ISSN: 0975-4024.
analytics process can be used in applications such as [12] P. Victer Paul, N. Saravanan, S.K.V. Jayakumar, P. Dhavachelvan and
predicting elections, sports and demonetization etc. It stands R. Baskaran, “QoS enhancements for global replication management in
peer to peer networks”, Future Generation Computer Systems, Elsevier,
Volume 28, Issue 3, March 2012, Pages 573–582. ISSN: 0167-739X.

You might also like