You are on page 1of 24

MINI PROJECT REPORT

on
SENTIMENT ANALYSIS OF TWEETS DATA USING DEEP
LEARNING AND BIG DATA APPROACH
SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE
UNIVERSITY OF MUMBAI FOR THE DEGREE OF
Bachelor of Engineering
by
Mr. Javed Khan ARMIET/COMP/16KJ015
Ms. Nuzhat Ansari ARMIET/COMP/16AN031
Mr. Hamza Ansari ARMIET/CS 15AH02
Under the guidance of
PROF. VIVEK PANDEY

ALAMURI RATNAMALA INSTITUTE OF ENGINEERING AND


TECHNOLOGY

Affiliated to
UNIVERSITY OF MUMBAI

Department of Information Technology


Academic Year – 2018-2019

1
CERTIFICATE

This is to certify that the Project-I entitled SENTIMENT ANALYSIS OF


TWEETS DATA USING DEEP LEARNING AND BIG DATA APPROACH
submitted by JAVED KHAN, NUZHAT ANSARI, HAMZA ANSARI bearing PIN
ARMIET/COMP/16KJ015, ARMIET/COMP/16AN031, ARMIET/CS15AH02 on
this First Half - 2019 in partial fulfilment of the requirements for the award of the
Degree of Bachelor of Engineering in Computer of University of Mumbai is a bonafide
work to the best of my/our knowledge and may be placed before the Examination Board
for their consideration.

HEAD OF THE DEPARTMENT PRINCIPAL

GUIDE
(Prof. Vivek Pandey)

Date:

2
ACKNOWLEDGEMENT

I would like to take the opportunity to express our heartfelt gratitude to the people whose
help and co-ordination has made this seminar a success. I thank Prof. Vivek Pandey for knowledge,
guidance and co-operation in the process of making this project.

I owe Seminar success to our guide and convey our thanks to them. We would like to
express our heartfelt to our HOD Prof. Ankit Sanghvi and all the teachers and staff members of
Computer Engineering Department for their full support. We would like to thank our principal for
conductive environment in the institution.

We are also grateful to the library staff of ARMIET for the numerous books, magazines
made available for handy reference and use of internet facility.

Lastly, we are also indebted to all those who have indirectly contributed in making this
Seminar successfully.

3
CONTENTS

SR NO TOPIC NAME PAGE NO

i. MINI PROJECT REPORT 1


ii. CERTIFICATE 2
iii. ACKNOWLEDGEMENT 3
iv. CONTENTS 4
v. ABSTRACT 5
1. INTRODUCTION 6
2. LITERATURE SURVEY 12
3. PROBLEM DEFINATION 15
3.1 EXISTING SYSTEM 16
3.2 SCOPE 16
3.3 PROPOSED SYSTEM 17
3.4 SYSTEM REQUIREMENTS 19
4. METHODOLOGY FOR IMPLEMENTATION 20
5. CONCLUSION 22
6. REFERENCES 24

4
ABSTRACT

Social sites like Twitter helps the millions of people to share their thoughts about a particular thing and
what they feel about them. The tweet is a short and a simple form of expression. Detecting sentiments
in text has a wide range of applications including identifying anxiety or depression of individuals and
measuring well-being or mood of a community. So, in this review paper we focused on Sentiment
Analysis of Twitter data. Sentiments can be expressed in many ways that can be seen such as facial
expression and gestures, speech and by written text. Sentiment Analysis in text documents is
essentially a content – based classification problem involving concepts from the domains of Natural
Language Processing as well as Machine Learning. Using different aspects, the research of Sentiment
Analysis of Twitter Data can be performed. In this paper we can see the different types of Sentiment
Analysis and techniques used to perform the extraction of the data. In this paper, we have taken
comparative study of different approaches and techniques of sentiment analysis having twitter as a
data.

5
1. INTRODUCTION

1. INTRODUCTION

6
1. INTRODUCTION

The social sites such as Twitter, Google+, Instagram, Facebook, and YouTube have gained so much
popularity these days. The area of sentiment analysis falls under computational linguistics and data
mining is known as Opinion Mining. With the use of social sites, analysis techniques have started to
do studies in public data to do sentiment analysis in different areas like politics, sociology, economy,
entertainment and finance.

It mainly aims to detect the public’s mood, behaviour, sentiments, thoughts, and opinion from the texts
provided. Mostly the data available on the social sites are unstructured i.e. almost 80% of data is
unstructured. This unstructured data makes it more difficult to analyses and get a judgement from this
type of data. To make a decision opinion of many people are required. These opinions are required
when decisions have valuable resources. People now get new tools to share their ideas through WWW.
Sentiment Analysis only concentrates on the detection of positive, negative, or neutral i.e. polarity.
Now seeing Twitter is a microblogging site which allows the people to express and share their ideas
which contain a large number of short lengths for marketing, networking. Understanding through an
example, film producers may be eager to know about the opinions of the public about their movies.
Now a day’s gathering opinions and drawing conclusions about the people likes & dislikes have been
the most important perspective. As the internet is growing bigger, its horizons are becoming wider.
Social Media and Microblogging platforms like Facebook, Twitter, Tumblr dominate in spreading
encapsulated news and trending topics across the globe at a rapid pace. A topic becomes trending if
more and more users are contributing their opinion and judgements, thereby making it a valuable
source of online perception. These topics generally intended to spread awareness or to promote public
figures, political campaigns during elections, product endorsements and entertainment like movies,
award shows. Large organizations and firms take advantage of people's feedback to improve their
products and services which further help in enhancing marketing strategies. There is a huge potential
of discovering and analyzing interesting patterns from the infinite social media data for business-
driven applications. Sentiment analysis is the prediction of emotions in a word, sentence or corpus of
documents. It is intended to serve as an application to understand the attitudes, opinions and emotions
expressed within an online mention. The intention is to gain an overview of the wider public opinion
behind certain topics. Precisely, it is a paradigm of categorizing conversations into positive, negative
or neutral labels. Many people use social media sites for networking with other people and to stay up-
to-date with news and current events. These sites (Twitter, Facebook, Instagram, google+) offer a
platform to people to voice their opinions. For example, people quickly post their reviews online as
soon as they watch a movie and then start a series of comments to discuss the acting skills depicted in
the movie. This kind of information forms a basis for people to evaluate, a rate about the performance
of not only any movie but about other products and to know about whether it will be a success or not.
This type of vast information on these sites can be used for marketing and social studies.Therefore,
sentiment analysis has wide applications and include emotion mining, polarity, classification and
influence analysis. Twitter is an online networking site driven by tweets which are 140 characters
limited messages. Thus, the character limit enforces the use of hashtags for text classification.
7
Currently, around 6500 tweets are published per second, which results in approximately 561.6 million
tweets per day. But now the limit has been extended to 280 characters. These streams of tweets are
generally noisy reflecting multi-topic, changing attitudes information in an unfiltered and unstructured
format. the analysis of entire documents is done while at a fine level, the analysis of attributes is done.

However, doing the analysis of tweets expressed in not an easy job. A lot of challenges are involved in
terms of tonality, polarity, lexicon and grammar of the tweets. They tend to be highly unstructured and
non- grammatical. It gets difficult to interpret their meaning. Moreover, extensive usage of slang
words, acronyms and out of vocabulary words are quite common while tweeting online. The
categorization of such words per polarity gets tough for natural processors involved.

1.1 Twitter
Sentiment Analysis is challenging on twitter tweets while performing. Now the field of research,
various techniques have come up with various methods to train the model and then do testing to
check the effectiveness. The aim is to classify the tweets in different sentiment accurately. The
words used are not quite same as the English Dictionary words and it makes our approach outdate
because of the evolutionary use of slangs. Twitter also permits the use of user reference, URLs,
emoticons, and Hash tags. This requires different processing than other words. All above are the
problems faced in the pre-processing section in the system.

1.2 SENTIMENT ANALYSIS ON DATASET


1.2.1 Data collection
Data in the form of raw tweets is retrieved by using the Scala library “Twitter4j” which provides a
package for real time twitter streaming API. The API requires us to register a developer account with
Twitter and fill in parameters such as consumer Key, consumer Secret and Token Secret. This API
allows to get all random tweets or filter data by using keywords. Filters supports to retrieve tweets
which match a specific criterion defined by the developer. We used this to retrieve tweets related to
specific keywords which are taken as input from users. Initially, we set at least set an application name
and mode.

1.2.2 Data Processing


Data processing is the process of splitting the tweets into individual words called tokens. Tokens can
be split using whitespace or punctuation characters. It can be unigram or bigram depending on the
classification model used. The bag-of words model is one of the most extensively used model for
classification. It is based on the fact of assuming text to be classified as a bag or collection of
individual words with no link or interdependence. The simplest way to incorporate this model in our
project is by using unigrams as features. It is just a collection of individual words in the text to be
classified, so, we split each tweet using whitespace. Tweets are normalized by converting it to
lowercase which makes its comparison with an dictionary easier.

8
1.2.3 Data Filtering
A tweet acquired after data processing still has a portion of raw information in it which we may
or may not find useful for our application. Thus, these tweets are further filtered by removing stop
words, numbers and punctuations. Stop words: For example, tweets contain stop words which are
extremely common words like “is”, “am”, “are” and holds no additional information. These words
serve no purpose and this feature is implemented using a list stored in stopfile.dat. We then
compare each word in a tweet with this list and delete the words matching the stop list as Code
snippet for stop words removal Removing non-alphabetical characters: Symbols such as “#@”
and numbers hold no relevance in case of sentiment analysis and are removed using pattern
matching. Regular expressions are used to match alphabetical characters only and rest are ignored.
Code snippet for removing non-alphabets This helps to reduce the clutter from the twitter stream.
Stemming: It is the process of reducing derived words to their roots.

1.2.4 Feature Extraction


This method used in text mining to find the importance of a term to a document in the corpus. The
recommended API is the Data Frame based API. This feature is useful for a case where we need
to find trending topics or to create word clouds. However, this project is more focused towards
finding sentiment in twitter streams so TF-IDF is not implemented.

1.2.5 Classification Algorithm

 Naïve Bayes - In machine learning, naive Bayes classifiers are a family of simple
"probabilistic classifiers" based on applying Bayes' theorem with strong
(naive) independence assumptions between the features. Naive Bayes has been studied
extensively since the 1960s. It was introduced (though not under that name) into the text
retrieval community in the early 1960s, and remains a popular (baseline) method for text
categorization, the problem of judging documents as belonging to one category or the
9
other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the
features. With appropriate pre-processing, it is competitive in this domain with more
advanced methods including support vector machines. It also finds application in
automatic medical diagnosis. Naive Bayes classifiers are highly scalable, requiring a
number of parameters linear in the number of variables (features/predictors) in a learning
problem. Maximum-likelihood training can be done by evaluating a closed-form
expression, which takes linear time, rather than by expensive approximation as used for
many other types of classifiers.
 Maximum Entropy - The principle of maximum entropy states that the probability
distribution which best represents the current state of knowledge is the one with
largest entropy, in the context of precisely stated prior data (such as a proposition that
expresses testable information). Another way of stating this: Take precisely stated prior
data or testable information about a probability distribution function. Consider the set of
all trial probability distributions that would encode the prior data. According to this
principle, the distribution with maximal information entropy is the best choice.
 Support Vector Machine - In machine learning, support-vector machines (SVMs,
also support-vector networks) are supervised learning models with associated
learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier (although
methods such as Platt scaling exist to use SVM in a probabilistic classification setting).
A SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible.
New examples are then mapped into that same space and predicted to belong to a
category based on which side of the gap they fall.
1.2.6 Sentiment Analysis
Sentiment analysis is done by using custom algorithm which finds polarity as below. Finding
polarity for discovering the polarity, we used a simple algorithm of counting positive and negative
words in a tweet. For both, positive and negative words, different lists were made. Next step is to
compare every word in a tweet against both these lists. If the current word matches a word in
positive list, then a score of 1 is incremented and if a negative word is found then it is
decremented. More positive words lead to higher sentiment score. Sentiment Analysis output: The
output contains a list of tweets in real time along with their sentiment score on the left-hand side.
The first tweet has score of -2 which is due to two negative keywords. Next two tweets are
positive as they contain keywords like “good” and “great. Both these words are in the positive
words list. It is to be noted that if a tweet has a score of 0, then it is ignored from final output. The
problem with neutral tweets is that they serve no purpose as they don’t convey any sentiment
towards the product.

10
2. LITERATURE SURVEY

11
2 LITERATURE SURVEY
Twitter is a popular social networking website where users posts and interact with messages known as
“tweets”. This serves as a mean for individuals to express their thoughts or feelings about different
subjects. Various different parties such as consumers and marketers have done sentiment analysis on
such tweets to gather insights into products or to conduct market analysis. with the recent
advancements in machine learning algorithms, the accuracy of our sentiment analysis predictions is
able to improve. In this report, we will attempt to conduct sentiment analysis on “tweets” using
various different machine learning algorithms. We attempt to classify the polarity of the tweet where it
is either positive or negative. If the tweet has both positive and negative elements, the more
dominant sentiment should be picked as the final label.
We use the dataset from Kaggle which was crawled and labelled positive/negative. The data provided
comes with emoticons, usernames and hashtags which are required to be processed and converted into
a standard form. It also needs to extract useful features from the text such unigrams and bigrams which
is a form of representation of the “tweet”. We use various machine learning algorithms to conduct
sentiment analysis using the extracted features.

Singh, Prabhsimran, Ravinder Singh Sawhney, and Karanjeet Singh Kahlon. "Sentiment analysis
of demonetization of 500 & 1000-rupee banknotes by Indian government." ICT Express
(2017).[2] In this paper, we can see that they have discuss and examine about the government
policy of demonetization from the citizen point of view. They have used this point of view to
approach the Sentiment Analysis by using the twitter data set. State wise tweets are collected i.e.
geo-location for the analysis. The Sentiment Analysis used classify the country into categories of
happy, sad, very sad, neutral, and no affect. Tweets collected are based on the keyword and
hashtags like #demonetization.

Gautam, Geetika, and Divakar Yadav. "Sentiment analysis of twitter data using machine learning
approaches and semantic analysis." Contemporary computing (IC3), 2014 seventh international
conference on. IEEE, 2014.[3] In this paper we see the, Sentiment Analysis for customers review
classification. They have used three supervised learning of machine learning – Naive Bayes,
Maximum Entropy and SVM followed by sematic analysis which was used to calculate the
similarity along with all the three learning. They used python and Natural Language Toolkit to
train and classify the methods. The Naive-Byes approach gives a better result than the Maximum
Entropy and SVM.

Fang, Xing, and Justin Zhan. "Sentiment analysis using product. review data." Journal of Big Data
2.1 (2015).[4] In this paper, they have solved the issue of Sentiment Polarity Categorization and
one of the basic problems of Sentiment Analysis. Online product review is useda data. The review
data is collected from Amazon.com. Investigation is achieved for both sentence level and review
level categorization. Naïve Bayesian, Random Forest and SVM are classification techniques used.
Scikit- learn open source software is used for this study. Scikit-Learn is a learning software
package used in python.

12
Amolik, Akshay, et al. "Twitter sentiment analysis of movie reviews using machine learning
techniques." International Journal of Engineering and Technology 7.6 (2016). [5] They have proposed
a better version model of Sentiment Analysis of Twitter data about the reviews of coming movies in
Bollywood and Hollywood. With the help of Naive Bayes and SVM we are able to classify those
tweets accurately. Naive-Bayes is better than SVM in precision but slightly lower accuracy and
recall. The accuracy can be increased by increasing the training data.

13
3. PROBLEM STATEMENT

3.1 EXISTING SYSTEM


3.2 SCOPE
3.3 PROPOSED SYSTEM
3.4 PROPOSED SYSTEM ARCHITECTURE
3.5 SYSTEM REQUIREMENTS
3.5.1 S/W REQUIREMENT
3.5.2 H/W REQUIREMENT

14
3. PROBLEM STATEMENT

3.1 EXISTING SYSTEM :

There are many traditional methods which provide the benefit of having knowledge and update on latest
technology. Some of the methods are listed below:
 Surveys and Questionnaires
 Interviews
 Feedback

3.2 SCOPE:

Sentiment analysis is a uniquely powerful tool for businesses that are looking to measure attitudes,
feelings and emotions regarding their brand. To date, the majority of sentiment analysis projects have
been conducted almost exclusively by companies and brands through the use of social media data,
survey responses and other hubs of user-generated content. By investigating and analyzing customer
sentiments, these brands are able to get an inside look at consumer behaviours and, ultimately, better
serve their audiences with the products, services and experiences they offer.

The future of sentiment analysis is going to continue to dig deeper, far past the surface of the number of
likes, comments and shares, and aim to reach, and truly understand, the significance of social media
interactions and what they tell us about the consumers behind the screens. This forecast also predicts
broader applications for sentiment analysis – brands will continue to leverage this tool, but so will
individuals in the public eye, governments, non-profits, education centres and many other organizations.

15
3.3 PROPOSED SYSTEM:

 Input – Keyword:
Take a subject and then collect data related to that keyword and perform sentiment analysis on
that.
 Retrieval of Tweets:
Tweets can be of different types: Structured, Semi- structured and unstructured type. R or
Python can be used to collect data from Twitter.

 Data Pre- Processing:


It is nothing but filtering of the data by removing the incomplete noisy data.
Below tasks are involved in pre-processing-
 Removal of retweets
 Removing special characters and numbers.
 Stemming
 Tokenization
 Detection of Sentiment:
The main and fundamental task in Sentiment Analysis is classify the polarity of the given tweets.
Polarity identification is done by using different lexicons. The polarity is of three types –
Positive, Negative or Neutral.

16
 Algorithm of Classification:
 Supervised Learning
i. Naïve Bayes
ii. Maximum Entropy
iii. Support Vector Machine

 Unsupervised Learning
i. Lexicon Based Method
ii. Dictionary Based Method
iii. Corpus Based Method

 Output Analysis
After the analysis is done, the result will be in a graphical format.

3.4 PROPOSED SYSTEM ARCHITECTURE:


Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of
the wider public opinion behind certain topics. Social media monitoring tools like Brandwatch
Analytics make that process quicker and easier than ever before, thanks to real-time monitoring
capabilities.

The applications of sentiment analysis are broad and powerful. The ability to extract insights from social
data is a practice that is being widely adopted by organizations across the world.

17
3.5 SYSTEM REQUIREMENTS:

3.5.1 S/W REQUIREMENT

1. Python
2. R

3.5.2 H/W REQUIREMENT

1. Access to high speed network connection (not dial up, i.e., cable, dsl, etc.)
2. Processor: i3 or better processor (i7 8 – Generation processor recommended)
3. Operating System: Windows 7, Windows 10 with all current updates installed
4. Memory: 2+ Gigabytes RAM Memory
5. Hard drive: 512 Gigabyte
6. Sound card and speakers Headset with microphone that plugs into your sound card (not a USB
connection)
7. Monitor with 1024 x 728 pixel resolution or better Software

18
4. METHODLOGY FOR IMPLEMENTATION

4.1 EXTRACTION
4.2 IMPLEMENTATION

19
4.1 EXTRACTION

There are great works and tools focusing on text mining on social networks. The approach to extract
sentiment from tweets is as follows:
1. Start with downloading and caching the sentiment dictionary
2. Download twitter testing data sets, input it in to the program.
3. Clean the tweets by removing the stop words.
4. Tokenize each word in the dataset and feed in to the program.
5. For each word, compare it with positive sentiments and negative sentiments word in the dictionary.
Then increment positive count or negative count.
6. Finally, based on the positive count and negative count, we can get result percentage about sentiment
to decide the polarity.

4.2 IMPLEMENTATION
4.2.1 Implementation
In this paper, we used python to implement sentimental analysis. Some packages have utilized including
tweepy and textblob. We can install the required libraries by following commands:
• pip install tweepy
• pip install textblob
The second step is downloading the dictionary by running the following command:
python -m textblob.download_corpora.
The textblob is a python library for text processing and it uses NLTK for natural language processing.
Corpora is a large and structured set of texts which we need for analyzing tweets.

4.2.2 Connect to Twitter using APIs


To connect to Twitter and query latest tweets, we need to create an account on twitter and define an
application. Users need to go to the apps.twitter.com/app/new and generate the api keys.
Due to the security reasons the api keys are not shown.

4.2.3Sample Results
Following shows the sample output of the program for the ‘fake news’ as a query based on the last 300
tweets from Twitter.
Positive tweets percentage: 16.39 %
Negative tweets percentage: 72.13 %
Neutral tweets percentage: 11.47 %
Positive tweets:
tweet: @Nigel_Farage @PoppyLegion Least we forget: Farage is rich. Brexit makes him richer. He is establishment. He is a
l… https://t.co/FhZSCBVHJs
tweet: @kirk0071 @Scavino45 @WhiteHouse @POTUS @realDonaldTrump Thanks for the good belly laugh this morning.
Your HateTru… https://t.co/AWHXoC84LJ tweet: @rolandsmartin Roland I like you brother but you really need to distant
yourself from Donna Brazile,she's been comp… https://t.co/zqRCsVu98d
Negative tweets:
tweet: RT @Independent: If you saw these tweets, you were targeted by Russian Brexit propaganda
https://t.co/Cc8IvQApbY tweet: Behind Fox News' Baseless Seth Rich Story: The Untold Tale https://t.co/TXcDP1oQ5H
tweet: RT @JackPosobiec: Fake news called the Poland independence day parade a “Nazi march.” Sick
https://t.co/OZA3xUopl1

20
5. CONCLUSION

21
5. CONCLUSION
Due to a large number of real-world applications discovering people’s opinion is important in
better decision making, therefore, there is exciting new research in the field of sentiment analysis.
Recently people have started to express their opinion on the web that increases the need for
analysing opinion online content for the various real-world application. There is a huge scope of
improvement of these existing sentiment analysis model. In this technical paper, we’ve discussed
the importance of social network analysis. We have implemented a python program to implement
sentiment analysis. Support vector machine is learned as best data classification technique it is
nothing different from that technique on other genres in the future these topics can be explored.
Our proposed to classify the tweet as positive, negative, neutral and it is gone through the pre-
processing stage and classified learning.in this POS tagging and features of tweets give the best
result using SVM. there are also several types of algorithm present in machine learning whic h can
be more useful for solving these types of problem. We can see this technique in future to rich
linguistic analysis like topic modelling and sentiment analysis

22
6. REFERENCES

23
6. REFERENCES

[1] Liu, Bing. "Sentiment analysis and opinion mining." Synthesis lectures on human language
technologies 5.1 (2012).

[2] Singh, Prabhsimran, Ravinder Singh Sawhney, and Karanjeet Singh Kahlon. "Sentiment analysis of
demonetization of 500 & 1000-rupee banknotes by Indian government." ICT Express (2017).

[3] Gautam, Geetika, and Divakar Yadav. "Sentiment analysis of twitter data using machine learning
approaches and semantic analysis." Contemporary computing (IC3), 2014 seventh international
conference on. IEEE, 2014.

[4] Fang, Xing, and Justin Zhan. "Sentiment analysis using product review data." Journal of Big Data
2.1 (2015).

[5] Amolik, Akshay, et al. "Twitter sentiment analysis of movie reviews using machine learning
techniques." International Journal of Engineering and Technology 7.6 (2016).

24

You might also like