You are on page 1of 27

BIG DATA AND HADOOP-SENTIMENT ANALYSIS FROM TWITTER

DATA USING FLUME AND HIVE


 
TWITTER

INTRODUCTION
• Twitter.com is a popular micro blogging website.
• Tweets are frequently used to express a tweet’s emotion on a particular subject.
• Twitter enables users to send updates in the form of reviews or messages to a group
of followers.
• Thousands of blogs, millions of tweets and billions of emails are written each day.
Among these social media platforms, twitter is one of them that are gaining high
popularity now a day. It provides fast and efficient way to analyze customers’
perspectives toward the product or service.
• There are firms which poll twitter for analyzing sentiment on a particular topic.
• The challenge is to gather all such relevant data, detect and summarize the overall
on a topic.
• Sentimental analysis is method to be used to mark customer’s perceptions or his/her
reviews towards a product.
• Based on the opinion reflected, a tweet can be classified as positive, negative or
neutral.
“TWEETS”
SENTIMENT ANALYSIS
INTRODUCTION
• “Sentimental Analysis” is also known as “Opinion mining”
• Sentiment analysis is one of the Natural Language Processing fields, dedicated to the
exploration of subjective opinions or feelings collected from various sources about a particular
subject. 
• Sentiment Analysis is a set of tools to identify and extract opinions and use them for the
benefit of the business operation.
• Such algorithms dig deep into the text and find the stuff that points out at the attitude towards
the product in general or its specific element.
• In other words, opinion mining and sentiment analysis mean an opportunity to explore the
mindset of the customers and study the state of the product from the opposite point of view. 
• This makes sentiment analysis great tool for:
• Market research
• Reputation management
• Precision targeting
• Marketing analysis
• Public relations (PR)
• Product reviews
• Net promoter scoring
• Product feedback
• Customer service.
How does Sentiment Analysis work?
• Sentiment analysis is a predominantly classification algorithm aimed at finding an
opinionated point of view and its disposition and highlighting the information of
particular interest in the process.
• What is an "opinion" in sentiment analysis? You all know the general definition of
opinion: “a view or judgment formed about something.
• Well, from the data science standpoint, an opinion is much more than this:  
• On the one hand, it is a subjective assessment of something based on personal
empirical experience. It is partially rooted in objective facts and partly ruled by
emotions.
• On the other hand, an opinion can be interpreted as a sort of dimension in the data
regarding a particular subject. It is a set of signifiers that in combination present a
point of view, i.e., aspect for the particular issue. Think about it as if it was one of
the rings of Saturn.
Sentiment Analysis is applied for the following operations:
• Find and extract the opinionated data (aka sentiment data) on a specific platform
(customer support, reviews, etc.)
• Determine its polarity (positive or negative)
• Define the subject matter (what is being talked about in general and specifically)
• Identify the opinion holder (on its own and in correlation with the existing audience
segments)
• Depending on the purpose, sentiment analysis algorithm can be used at the following
scopes:
• Document-level - for the entire text.
• Sentence-level - obtains the sentiment of a single sentence.
• Sub-sentence level - obtains the sentiment of sub-expressions within a sentence.
• Given its subjective matter, mining an opinion is a tricky affair. Opinions differ. Some
are more valuable than the other. Four subcategories further characterize an opinion:
• Direct opinion is the one that directly states something. For example, “the
responsiveness of the buttons in application X is poor.” Here you have a legit point.
• Comparative Opinion is the one where X is compared with Y based on specific criteria.
For example, “the responsiveness of the button in application X is worse than in
application Y.” In addition to being an insight into your product, it also serves as micro
competitive research.
• Explicit opinion is where everything is clearly defined. For example, “this chair is
rocking.”
• Implicit opinions are implied but not clearly stated. For example, “the app started
lagging in two days.” It is important to note that implicit opinions may also have idioms
and metaphors, which complicates the sentiment analysis process.
Types of Sentiment Analysis
• To understand how to apply sentiment analysis in the context of your business
operation - you need to understand its different types.
• 1st type. Fine-grained Sentiment Analysis involves determining the polarity of the
opinion. It can be a simple binary positive/negative sentiment differentiation. This
type can also go into the more higher specification (for example, very positive,
positive, neutral, negative, very negative), depending on the use case (for example,
as in five-star Amazon reviews).
• 2nd type. Emotion detection is used to identify signs of specific emotional states
presented in the text. Usually, there is a combination of lexicons and machine
learning algorithms that determine what is what and why.
• 3rd type. Aspect-based sentiment analysis goes deeper. Its purpose is to identify
an opinion regarding a specific element of the product. For example, the brightness
of the flashlight in the smartphone. The aspect-based analysis is commonly used in
product analytics to keep an eye on how the product is perceived and what are the
strong and weak points from the customer point of view.
• 4th type. Intent Analysis is all about the action. Its purpose is to determine what
kind of intention is expressed in the message. It is commonly used in customer
support systems to streamline the workflow.
Sentiment Analysis Algorithms
• There are two major Sentiment Analysis methods.
• Rule-based approach
• Rule-based sentiment analysis is based on an algorithm with a clearly defined
description of an opinion to identify. Includes identify subjectivity, polarity, or the
subject of opinion.
• The rule-based approach involves basic Natural Language Processing routine. It
involves the following operations with the text corpus:
• Stemming
• Tokenization
• Part of speech tagging
• Parsing
• Lexicon analysis (depending on the relevant context)
• Here’s how it works:
• There are two lists of words. One of them includes only the positive ones, the other
includes negatives.
• The algorithm goes through the text, finds the words that match the criteria.
• After that, the algorithm calculates which type of words is more prevalent in the text.
If there are more positive words, then the text is deemed to have a positive polarity.
• The rule-based algorithms delivers some sort of results - which lacks flexibility and
precision that would make them truly usable. For instance, the rule-based approach doesn’t
take the context into account.
• However, it can be used for general purposes of determining the tone of the messages,
which may come in handy for customer support.
• These days, rule-based sentiment analysis is commonly used to lay a groundwork for the
subsequent implementation and training of the machine learning solution.
• Automatic Sentiment Analysis
• The automated sentiment analysis is the approach that truly digs into the text and delivers
the insights.
• This type of sentiment analysis uses machine learning to figure out the gist of the message.
• It has high precision and accuracy of the operation and you can process information on
numerous criteria without getting too complicated.
• The automatic approach involves Supervised machine learning classification algorithms.
• Types of sentiment analysis algorithms:
• Linear Regression
• Naive Bayes
• Support Vector Machines
• RNN derivatives LSTM and GRU.
• Sentiment analysis is an incredibly valuable technology for businesses because it allows
getting realistic feedback from your customers in an unbiased (or less biased) way.
Approaches to collect Twitter data using Flume

 
• The various steps that show how to fetch data, process that data and stored in
HDFS and further, how to process this work by using different technique.
Following steps are:
• 1. Create Twitter Application.
• 2. Data fetching using Flume
• 3. Query using HQL.

 Create Twitter Application


• The present study requires twitter information by creating twitter
application for the analyses of data. Following are the various steps that
used to create twitter application:
• First open the connection ofdev.twitter.com/app in Browser and sign in
the twitter account and do some work with twitter Application window
where you will find create, delete, and manage Twitter Apps.
Click on Create New App
• In the next step click on the Create New App button. At that point you will get an application
frame to fill your detail data.
• Now, scroll down and tick the option Yes, I agree and then click create your Twitterapplication.

•At this point the new App will be created. New app is utilized to make Consumer Key,
Access Key, and Access Token Key. This will be used to edit in the Flume.conf file/record.
While getting data from Twitter, these Consumer Key, Access Key, Access Token Key is
used to fetch data which is lively tweeting in the account.
•These keys are used to Access Tokens
tab and it can observe a button with the
name of Create my access token. By
clicking this we can produce the access
token.

Now click on Create my


accesstoken.
Now, we open flume.conf file in the
directory /usr/lib/flume/conf and then
change the following keys in the file.
These keys will be obtained from the
page above.

• Consumer keys (API Key), Consumer


Secret (API Secret), Access tokens are
utilized to arrange the Flume operator.
Access Token, Access Token Secret,
Consumer Key (API Key), Consumer
Secret (API Secret)
Also add the keywords that we want to
extract from twitter. Here, we are
extracting data on Demonetization.
 Fetching the data using Flume:
• After creating an application in the Twitter developer site, we need to fetch the data
from Twitter. For that:
• We will use the consumer key and secret key with the access token and secret values.
• Further, we can fetch data through twitter that is required and it will be in JSON format
and we will put this data in the HDFS in the location where we have saved all the data
that comes from the Twitter.
• The configuration file is used to get the real time data from the Twitter. All the details
or the points of interest needed to filled in the flume-twitter. conf file i.e. configuration
file of Flume.
• Query utilizing HQL
• After setting the above design, the flume is runner, the Twitter data/information will
automatically will saved into HDFS in different directories where we have set the
storage path to save the Twitter data/information that was extracted by using Flume.
• From the collected data we will create a table “mytweets_raw” where the filtered data
will be kept into a formatted structured so that we can clearlyshow that we have
converted the unstructured data into structured data or in organized way.
• After loading the real time data in hive table, more tables are created like dictionary
table which stores the polarity and the word and tweets sentiment table which will
contain all the tweets id and its sentiment. Many such more tables are created and
different operations are done on data.
Here our keyword or the data fetched from twitter

Data in Flume directory


The sentiments of the tweets were calculated by using polarity. Output shows the
sentiment of the tweets for instance positive, negative or neutral in nature. The output
table consists of the tweet id and the sentiment. As every tweet contain its unique id so it
is easy to analyze the sentiment of every tweet.

TWEET ID SENTIMENT
1024177955965161472 NEUTRAL
1024177960763498497 NEUTRAL
1024178084713455616 POSITIVE
1024178114283413506 POSITIVE
1024178196848287745 NEGATIVE
1024179212901654528 NEGATIVE
HIVE:
• Hive and Flume are the tools of Big data Hadoop and they are efficient for extracting
and loading the data.
• Hive is basically used for managing and querying structured data whereas Flume is used
for collecting, aggregating and moving large amount of streaming event data.
• There are different methods real time streaming data by using codes or using Map
reduce etc.
• Using Apache Hive and Apache Flume this work can be done easily and it utilizes less
time too. The operations are performed on the stored data.
Data Pre-processing -Clean-ups / formatting  
• Data Pre-processing: Get the “clean “data and transform it to the format we need.
• Data goes through a series of steps during preprocessing:
• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
• Data Discretization: Involves the reduction of a number of values of a continuous
attribute by dividing the range of attribute intervals.
Data preprocessing is transforming the data
into a basic form that make it easy to work.
Data Normalization:
•The data normalization is used to analyzing
measurement unit that is expressed in the
attributes of measurement unit range and
common scale.
Data Cleaning:
•Data cleaning is an important technique for
data preprocessing tool. It is a process of DM
techniques. It removes the errors data and
reduced unnecessary information of data. The
missing of data are also included in data
cleaning techniques.
Data reduction:
•The data reduction represents the original
data that reduced to obtain a set of techniques
to one way or another way those data needed
to the distinction of data preparation to
approximately suit the input data of DM task
Pre-processing of extracted data:
• After retrieval of tweets Sentiment analysis tool is applied on raw tweets but in most of cases
results to very poor performance. Therefore, preprocessing techniques are necessary for
obtaining better results. We extract tweets i.e. short Messages from twitter which are used as raw
data. This raw data needs to be preprocessed.
So, preprocessing involves following steps which construct in grams:
Filtering: Filtering is nothing but cleaning of raw data. In this step, URL links (E.g.
http://twitter.com), special words in twitter (e.g. “RT” which means ReTweet), user names in
twitter (e.g. @Ron -@ symbol indicating a user name), emoticons are removed.
Tokenization: Tokenization is nothing but Segmentation of sentences. In this step, we will tokenize
or segment text with the help of splitting text by spaces and punctuation marks to form container
of words.
Removal of Stop words: Articles such as “a”, “an”, “the” and other stop words such as “to”, “of”,
“is”, “are”, “this”, “for “removed in this step.
Construction of n -grams: Set of n-grams can make out of consecutive words. Negation words
such as “no”,“not” is attached to a word which follows orPrecedes it. For Instance: “I do not like
remix music” has two bigrams: “I do not”,“do+not like”, “not+like remix music”. So the
accuracy of the classification improves by such procedure, because negation plays an important
role in sentiment analysis. Negation needs to be taken into account, because it is a very common
linguistic construction that affects.
Classification: When performing classification, every tweet is checked for the positive and negative
words from a fixed set of list. Then the average is taken out for both positive and negative,
depending on the higher score the tweet label is saved as positive /negative or if the score is 0 its
neutral.
PRE PROCESSED DATA

EMPRICAL

CLASSIFICATION

POSITIVE NEGATIVE
• NETURAL
FEATURES OF HIVE
Some Hive new features are discussed belo
Framework:Apache Hive is built on top of Hadoop distributed framework system (HDFS)
Large datasets:However, in distributed storage, it helps to query large datasets residing. 
Warehouse:Also, we can say Hive is a distributed data warehouse.
Language:Queries data using a SQL-like language called HiveQL (HQL).
Declarative language:HiveQL is a declarative language like SQL.
Table structure:
Table structure/s is/are similar to tables in a relational database.
Multi-user: multiple users can simultaneously query the data using Hive-QL.
DataAnalysis:however, to perform more detailed data analysis, Hive allows writing custom
Map Reduce framework processes.
ETL support: Also, it is possible to extract/transform/load (ETL) Data easily.
Data Formats: Moreover, Hive offers the structure on a variety of data formats.
Storage:Hive allows access files stored in HDFS. Also, similar others data storage systems such
as Apache Hbase .
 
Format conversion
Moreover, it allows converting the variety of format from to within Hive. Although,
it is very simple and possible.
Hive SerDe interface:
• Before applying logical queries on the data, we need to make sure that the Hive
table can appropriately translate the JSON formatted data by using JSON validator.
• Hive takes that input files which use a delimited row format, yet our fetched data is
in a JSON format, which will not work.
• On the other hand, we can utilize the Hive SerDe interface to determine how to
translate the data. SerDe are the interfaces which tells the Hive that how it should
modify/change the data that Hive can process. So we have added a jar file “hive-
serdes-1.0-SNAPSHOT.jar”into the directory /usr/local/hive/lib. This will be used
by hive shell to extract the clean data from the

Downloaded data into the hive table


• By using the hive jar file and custom serde files we can store unstructured data into hive table name
“mytweets_raw” in the structured format.
• And this is also our input data in which sentiment analysis is done. The set of data were taken from
the social media platform Twitter using Twitter Streaming APIs (application program interface).
• Further it was passed through Apache Flume.
• These fetched datasets (tweets) are stored in HDFS.
• The tweets data in flume directory and it represents the list of twitter data extracted which contains
the keyword as specified in the configuration file.
• We can check the files by downloading them and seeing the tweets relating to the keyword.

• 
• Dictionaries:
• For analyzing the tweets, we have to take polarity into consideration using various types of
dictionaries.
• Lexical Dictionary: It mainly consists of most of the English words which will help us to analyze
the tweets by matching the word in the tweet with the words in the lexical dictionary. It also consists
of idioms, phrases, headwords and multiwords.
• Acronym Dictionary: It is used to expand all the abbreviations and acronyms which will further
generate words which can be analyzed using lexical dictionary.
• Emoticon Dictionary: A tweet containing emoticons can be analyzed by using this dictionary.
Emoticons are basically the textual portrayal of the tweeter's mode which conveys some meaning.
• Stop Words Dictionary: These are the words in the tweet which do not have any polarity and they
need not be analyzed. So they are eliminated and tagged as stop words. We maintain a dictionary
with the list of all stop words for example able, are, both, etc.
Supporting Tools :
Enguity
Social Mention
Streamcrab

Conclusion:
• Sentiment Analysis deals with the perception of the product and understanding of
the market through the lens of sentiment data.
• There are many sources of public and private information out of which you can
harness an insight into the customer’s perception of the product and general market
situation.
• This project not only analyses the sentiments of the user but also computes other
results like the user with maximum friends/followers, top tweets etc. hence hadoop
can also be effectively used to compute such results in order to determine the
current trends with respect to particular topics. This can be very useful in the
marketing sector.

You might also like