You are on page 1of 20

MACHINE LEARNING

​ ​Sentiment analysis
Submitted in partial fulfillment of the requirement for the degree MBA-BUSINESS
ANALYTICS

Supervisor
​Dr.​ ​Kusum Lata
​ niversity​ ​School​ ​of​ ​Management​ ​&​ ​Entrepreneurship
U

​ amrit Mehta (2k19/BMBA/11)


NAME:-​ N
RITIKA​ ​(2K19/BMBA/13)
​ACKNOWLEDGEMENT

We both students of MBA Business Analytics of 2nd year in university school of


Management and entrepreneurship, Delhi Technology University will be using this
opportunity to express our gratitude to everyone who supported us throughout the
project.

We are sincerely grateful to Dr. Kusum Lata who guided us for the completion of
this report. We would also like to thank our teacher for providing us with
knowledge about the critical aspect of the topics related to this report helping us
whenever needed.

rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Namrit Mehta
RITIKA
ABSTRACT

Sentiment analysis (also known as visual mines or emotional AI) refers to the use of natural
language processing, textual analysis, computational languages, and biometric to systematically
identify, extract, measure, and study the corresponding regions and details below. Emotional
analysis is widely used in the voice of client material such as reviews and research responses,
online and social media, and healthcare materials for applications ranging from marketing to
customer services to medical treatment.

Analysis of social data from social media can also produce interesting results in detail in the
world of public opinion on almost any product, service or kindness. Social data is one of the
most effective and accurate indicators of public sentiment. The explosion of Web 2.0 has led to
an increase in activity in Podcasting, Blogging, Tagging, RSS Contributions, Social Bookmarks,
and Social Networking. As a result, there was an explosion of interest in the public mine. These
are used for a lot of ideas. Sensory or Vision Analysis Mining is a treatment for the ideas,
feelings and humility of text. In this paper we will be discussing a method that allows for use as
well translation of twitter data to get public comments.
Creating an emotional analysis system is the method to be used moderately to balance customer
perceptions. This paper reports on construction for emotional analysis, extracting and training a
large number of tweets. Results edit customer feedback via tweets for good and bad, namely
represented in a pie chart, in a web map, distributing the structure using php, css and html pages.
INTRODUCTION
Millions of people use social networking sites to express their thoughts, feelings, and concerns
about their daily life. However, people write anything like public works or any comments about
products. Through online communities it provides a platform for consumers to inform and
influence others. In addition, social media provides an opportunity for businesses that provide a
platform for communication with their customers as social media to advertise or speak directly to
customers by communicating with customer feedback on products and services. On the contrary,
consumers are more powerful when it comes to what consumers want to see and how consumers
respond. With this, the successes and failures of the company are shared publicly and keep the
word of mouth. However, social networking can change behavior and consumer decisions, for
example, noting that 87% of internet users are influenced by their purchases and problem
through customer reviews. So that, if an organization is able to quickly come to terms with what
its customers are thinking, it can be very helpful to plan the response in time and come up with a
good strategy to compete with its competitors.

In this project, we use machine learning and natural language processing techniques to
understand the patterns and symbols of tweets and predict the emotions (if any) that prevail.
Specifically, we create a computer model that can distinguish a given tweet as positive, negative
or neutral based on the emotions it expresses. The positive and negative section will contain
polar tweets expressing emotion. However, a neutral section may contain a purposeful or
directed tweet that the user does not show neutrality in or contains any opinion at all. Examples
of each category can be found in Table 1. The decision to use the three classes was made to
address the problem and is in line with ongoing research in the field. The tests performed on our
emotional predictor randomly show that our system is among the best performing programs in
this field. We use our mood predictor, and create an integrated consultation tool to help
businesses interpret and visualize public perceptions about their product and products. This tool
enables the user to not only visualize the distribution of emotions across the database, but also
equips users to perform emotional analysis for the duration, location, and capabilities of the user.
Class Tweet
positive @hon1paris: I <3 1D too! #muchlove
negative The new Transformers suck!! Wasted
my time and money!!!
neutral Well, I guess the govt did what it
could. More needed though!
I plan to wake up early in the morning
#early2bad

Machine Learning Background

Figure: Overview of Supervised Sentiment Classification of Tweets

Before we can understand the research for Twitter's emotional analysis, we need to
explain the general process for dealing with this problem. Supervised Text Segmentation,
a machine-readable method in which a class predictor is taken from data with a training
label, a standard method for the emotional separation of tweets. The whole view of this
method, which is modified by the emotional separation of the tweet, is illustrated.

1. First, a database of labeled tweets is compiled. Each tweet in this set has been marked
as identifying, inappropriate or neutral by personal annotation based on the perception of
expired comments after analyzing the tweet.

After that, the feature finder creates a feature vector for each tweet labeled where the
values ​of this vector should express the feeling. When elemental vectors are extracted
from each tweet in a labeled database, they are included in an algorithm section that
attempts to determine the relationship between each value (called the element) in the
vector and the label-feel concept. The most popular class algorithms used for this work
are SVM (vector support machines), the Naive Bayes method and Maximum Entropy.
Studies have compared several classification algorithms and highlighted the
above-mentioned algorithms as the most effective (Pang et al., 2002). The relationships
of the inputs are taken up by these algorithms and are maintained in the studied model.
When a new model is given to a model, we use the relationships we have learned to
predict emotions.

Problem Statement:

Despite the availability of software to extract personal data about a particular product or
service, organizations and other data workers still face problems with data extraction.

• Sentiment Analysis of Web Based Applications Focus on Single Tweet Only.

With the rapid growth of the World Wide Web, people are using social media platforms
such as Twitter to produce large volumes of commentary in the form of tweets available
for emotional analysis. This translates into a vast amount of information from a human
point of view that makes it difficult to extract sentences, read them, analyze tweet by
tweet, summarize and organize it into a logical format in a timely manner.

• Difficulty of Sentiment Analysis with inappropriate English

Informal language refers to the use of colloquialism and slang in communication, using
the combination of spoken language such as ‘will’ and ‘will’. Not all programs can detect
emotions in the use of informal language and this can hinder the process of analysis and
decision-making.

Thumbnails are an image that symbolizes the appearance of a person’s face, which in the
absence of body language and prosody serves to attract the recipient’s attention to the
idea or anger of the sender’s oral communication, enhances and alters its interpretation.
For example, it shows a positive attitude. Existing programs do not have enough data to
allow them to generate emotion with icons. As people often turn to icons to express
exactly what they can put in words. The inability to analyze this puts the organization at a
loss. Short form is widely used even with short messaging service (SMS). The use of the
short form will be used more often on Twitter to help reduce the characters used. This is
because Twitter has limited its 1 4 0 characters [. For example, ‘Tba’ means declaration

Objective
The purpose of the research is to initiate, to study mood analysis in microblogging
which is for the purpose of analyzing feedback from corporate product customers;
and second, to develop a customer review system in a product that allows the
organization or individual to feel and analyze a large number of tweets into a
useful format.
METHODOLOGY:
We developed a sentiment analysis system using the standard machine learning approach
as explained in the background section.

Twitter
Twitter is a popular real-time service that allows users to share short information known as
tweets limited to 140 characters. Users write tweets to express their opinion on various topics
related to their daily life. Twitter is an ideal platform to bring out the general public's views on
certain issues. A collection of tweets is used as the main corpus of emotional analysis, referring
to the use of the mine of ideas or the processing of natural language.

Twitter, with 500 million users and millions of messages a day, has already become a valuable
asset for organizations to revitalize their reputations and products by releasing and analyzing the
sentiments of public tweets about their products, service market and even competitors.
highlighted that, from social media they have produced ideas for the growing global web, large
volumes of comments in the form of tweets, reviews, blogs or any discussion groups and forums
available for analysis, making the world a faster, more inclusive and easily accessible way for
emotional analysis. .

Microblogging with E-commerce A microblogging platform like Twitter is similar to a standard


blogging platform where one post is short. Twitter has a small number of words designed for
quick information transfer or exchange of ideas. However, small businesses or large
organizations are launching the power of microblogging as an e-commerce marketing tool.
However, microblogging platforms have been developed over a period of a few years to inform
foreign trading websites using an external microblogging platform such as Twitter advertising.

Sharing, collaborative, community-based interventions open up e-commerce, introduce a bright


new space where it can be shown that the microblogging platform has empowered companies to
create a product image, an important marketing channel product, improve product sales, talk to
customer interaction and other relevant business activities. he said, in fact, companies that
produce such products have begun to contaminate microblogs in order to get a general idea of ​the
product. Many times these companies read user feedback and respond to users on microblogs.
Social Media
defines social media as a group of online-based applications that build on the ideas and
technologies of Web2.0 that are allowed to create and exchange user-generated content. In an
Internet World Start interview, pointed out that the trend of Internet users is increasing and
continues to spend a lot of time with social media. and 88 billion minutes in 2011. On the other
hand, businesses that use social networking sites to find and communicate with customers, the
business can be shown to be detrimental to the product being made to communicate with people.
Since social media can easily be sent to the public, it can damage private information to spread in
the social world.

On the contrary, it has been argued that the benefits of participating in social media go beyond
just social sharing to build an organization's reputation and create jobs and income. In addition, it
has been suggested that social media is used for advertising by promotional companies, search
experts, recruitment, public education, commerce and electronics. E-commerce or E-commerce
refers to the purchase and sale of online goods or services that can be via social media, such as
simple Twitter due to its 24-hour availability, easy customer service and global reach.

Among the reasons why a business tends to use more social media to gain an understanding of
consumer behavior trends, market intelligence and provide an opportunity to learn about
customer reviews and ideas. Twitter Comments Analysis Feature can be found in the comments
or tweet to provide useful hints for many different purposes. And, it also meant that emotions can
be divided into two groups, which are bad and good words. Emotional analysis is a natural
language processing method for measuring a expressed opinion or emotion within the selection
of tweets.
Data Processing:

Data processing includes Tokenization which is the process of separating tweets into individual
words called tokens. Tokens can be categorized using white letters or punctuation marks. It can
be unigram or bigram depending on the partition model used. The word-bag model is one of the
most widely used models in classification. It is based on the fact that the text is classified as a
bag or a set of individual words that have no link or dependence. An easy way to incorporate this
model into our project is to use n gram as features. Just a collection of individual words in a file
for
the text will be separated, therefore, we separate each tweet using whitespace. For example, the
tweet "Met met aziz today !!" separated by each white area next.
{Met Aziz !! ”}
The next step in data processing is typical by converting a tweet into smaller letters. Tweets are
typically converted into lower case letters making their comparison with the dictionary easier.
Data sorting:
The tweet received after the data processing is still part of the raw material in it that we may or
may not find useful in our application. Therefore, these tweets continue to be filtered by
removing static words, numbers and punctuation marks. Set words: For example, tweets that
contain stop words are more common words such as “is”, “am”, “are” and have no additional
information. These names are useless and this feature is created using a list stored on steffile.dat.
Then we compare each word in the tweet with this list and delete the words that correspond to
the stop list.
Deleting non-alphabetical characters: Symbols such as "# @" and numbers are not important in
case of emotion analysis and are also removed using pattern matching. Ordinary expressions are
used to match only the letters of the alphabet and pauses are ignored. This helps reduce clutter
from twitter streaming. Determination: It is the process of reducing words based on their roots.
Examples include words such as "fish" with the same roots as "fishing" and "fish". The
stemming library is Stanford NLP which also offers various algorithms such as porter stemming.
In our case, we have never used any basic algorithm due to time constraints.
Feature Release: TF-IDF is an open source format used in quoting texts to determine the value of
a term in a text in a corpus. The recommended API is a Data Frame based API. This feature is
useful in cases where we need to find the best titles or create voice clouds. However, this project
is very focused on getting emotions on twitter streaming so TF-IDF is not done
Sensory Analysis
Emotional analysis is performed using a custom algorithm that finds the magnitude as below.
Finding polarity:
For discovering the polarity, we used a simple algorithm of counting positive and negative words
in a tweet. For both, positive and negative words, different lists were made. Next step is to
compare every word in a tweet against both of these lists. If the current word matches a word in
a positive list, then a score of 1 is incremented and if a negative word is found then it is
decremented. More positive words lead to higher sentiment scores. However, Stanford NLP can
be used to predict accurate sentiment analysis which provides complex algorithms to predict it.

Twitter Sentiment Analysis


Emotions can be found in the comments or in the tweet to provide useful clues for many
different purposes. And, it also meant that emotions can be divided into two groups, which are
bad and good words. Emotional analysis is a natural language processing method for measuring
a expressed opinion or emotion within the selection of tweets.

Emotional analysis refers to the general process of extracting polarity and subjugation from a
semantic concept referring to the power of words and text of polarity or phrases. There are two
main ways to express emotion which are dictionary-based and machine-based learning methods.

1. Lexicon-based Approach
Dictionary-based methods use predefined vocabulary where each word is associated with a
specific feeling. Dictionary methods vary depending on the context in which they were created
and include the calculation of the direction of the document from the semantic direction of the
texts or phrases in the documents. In addition, it also says that the sense of the dictionary is to
find the idea that contains the words in the corpus and predict the view expressed in the text. has
shown dictionary methods with a basic paradigm that are:
i. Customize each tweet, post by removing punctuation marks
ii. Establish total polarity points equal to 0 -> s = 0
iii. Check that the token is in the dictionary, and that if the token is correct, the s will be
positive (+) If the token is not true, it will be negative (-) iv. View full polarity points of post. If
s> threshold, tweet post as positive If s <threshold, tweet post as negative

However, one advantage of a learning-based approach, is that it has the ability to adapt and build
professional models for specific purposes and contexts. Conversely, the availability of labeled
data is therefore a low usage of the new data method that creates labeling data that may cost or
prevent other activities.

2. Machine-learning-based
​ achine learning methods that often rely on supervised surveillance systems where emotion
M
detection is classified as both good and bad binary. This method requires labeled data to train
dividers. This way, it is clear that aspects of the context of a word need to be considered as
negative (e.g. negative) and reinforced (e.g. very good). However, shown the basic paradigm for
creating a feature vector is:

i. Include a discussion section for each tweet post


ii. Collect all the adjectives for all tweets
iii. Create a set of popular words with high N adjectives
iv. Navigate through all tweets in the test set to create the following:
• Number of constructive words
• Number of opposing words
• The presence, absence or frequency of each word
show some examples of reversal of disposal, neglect to simply change the size of the dictionary:
to change the beauty into a bad one. Other examples: She is not afraid but she is not afraid.

In this case, the negligence of the negative or positive value indicates a mixed perception that is
taken better than the transferred value. However, it has been said that the limit of machine-based
approach is better suited to Twitter than lexical-based method.

In addition, it means that machine learning methods can generate a limited number of popular
words that are always given the full value in the name of a word spread on Twitter.

In monitored machine reading, you usually have an X input, which goes into your predictive
function to get Y ^. You can then compare your prediction with the Y value. This gives you the
cost you use to update the parameters θ. The following picture, summarizes the process.
To perform sentiment analysis on a tweet, you first have to represent the text (i.e. "I am happy because I am
learning NLP ") as features, you then train your logistic regression classifier, and then you can use it to classify
the text.

Note that in this case, you either classify 1, for a positive sentiment, or 0, for a negative sentiment.

Vocabulary & Feature Extraction

Given a tweet, or text, you can represent it as a V-shaped vector

V, where V matches your font size. If you had a tweet that said "I'm happy because I'm learning NLP", then
put 1 in the corresponding index of another word in the tweet, and 0 is different. As you can see, as the V
becomes larger, the vector becomes smaller. In addition, we end up with a lot of features and end up training
with θ. This can lead to greater training time, and greater prediction time.
Approach:
The center theme of all the visualizations was decided to be sentiment analysis. Numerous
powerful, effective tools are already published to take advantage of non sentiment related
analysis and visualization. We identified four major areas where targeted visualizations may be
effective for brand managers.
● ​Time:
Analysis of change in sentiment over time was a common theme amongst most tools we studied.
Visualization involving time can enable users to identify sudden change in sentiment trends
which can lead to pinpointing events that may have led to change in trend. By incorporating
interacting to such visualization, the scale of such graph can be adjusted, allowing users to study
both long​term and short​term patterns.
● ​Geographic Location:
Visualization involving maps is also common. Such visuals can help users see the different
sentiment distribution over diverse markets. Like before, adding interaction to such visualization
can enable user to study sentiment changes over a city and also changes over markets in different
continents.
● ​User Influencing Power:
This was a dimension that was not studied in depth before. User influencing power is an
important metric that businesses are concerned with, especially on social media. A negative
sentiment from a highly influencing figure on Twitter can ripple the damage to their followers.
Therefore, understanding sentiment
distribution along this dimension is essential. With interaction features, users of the tool can
change the visualizations to only focus on tweets by users who have higher influencing power.
For this project, user influencing power is assumed to be directly correlated to their number of
followers.
● ​Tweet Platform:
This information is relevant to businesses who have different offerings on different platforms
(i.e. iOS, Android, Web etc.). It can be visualized how sentiments differ based on what platform
was used to post the tweet. One use case for such information can be for mobile app brands that
have different applications on each platform ​ possibly offering different user experiences.
The power of sharing feedback and emotions about a brand through Twitter is in its lightning
propagation speed. A feedback sent by a user can instantly reach the company. To exploit this
power, our objective was to build a real time analysis tool. As tweets arrive at the system, the
visualizations adapt to the newly added information continuously.

Techniques of Sentiment Analysis:

The semantic concepts of organizations drawn from tweets can be used to measure the total
integration of a group of companies with a given contact status. Polarity means the most basic
form, i.e. if the text or sentence is right or wrong. However, emotional analysis has strategies in
providing unity such as:

Indigenous Languages ​Processing (NLP)


NLP strategies are based on machine learning and especially learning statistics using a standard
learning algorithm combined with a large sample, data-based learning rules. Nerve analysis was
treated as a Natural Language Processing described with NLP, at many levels of granularity.
Since being a document-level division function, it has been handled at sentence level and more
recently at sentence level. NLP is a computer science field that involves making computers
accessible to human language and incorporating it as a means of communication with the real
world.

Case-Based Consultation (CBR)


Case-Based Reasoning (CBR) is one of the available ways to initiate emotional analysis. CBR is
known for remembering past problems that have been successfully solved and using similar
solutions to solve current problems that are closely related. has found other benefits of using
CBR that CBR does not require a clear domain model so the request becomes a task of collecting
care history and the CBR system can learn by acquiring new information as cases. This and the
use of data strategies make the maintenance of large columns of information easier.

Artificial Neural Network (ANN)


means that the Artificial Neural Network (ANN) or known as the neural network is a
mathematical method that connects a group of artificial neurons. It will process the data using the
integration method. ANN is used to determine the relationship between input and output or to
find patterns in data.
Support Vector Machine (SVM)
Vector Machine support for detecting tweets. in conjunction with the said SVM is capable of
extracting and analyzing to obtain up to 70% -81.3% accuracy in a test set. collected data from
training from three different Twitter-sensing websites that make extensive use of built-in
emotional dictionaries to name each tweet positive or negative. Using the SVM trained in this
labeled data, they obtained 81.3% accuracy in terms of mood.

Apache Spark: I​ t is an open source computer platform for accessing streaming and transfer data
to a storage system such as HDFS, Database Server. Built on top of Map Reduce and can interact
well with other Apache software. The Apache spark is a memory processing system used to
process big data. Appeared as an advanced version of Hadoop. Although it uses MapReduce
technology, it processes data 100 times faster by separating memory and ten times faster on disk
across different nodes. Its structure is based on Resilient Distributed datasets (RDD) read-only,
data sets are segmented and distributed across different nodes, to ensure tolerance of errors and
downtime features. It overcomes the MapReduce limit where data after reduction is stored on
disk using iterative algorithms that download data from multiple databases in a loop thus using
data style duplication. In this way, the delays involved are minimized thus making it faster. RDD
is actually a pre-processing factor that underpins the application process and then displays the
calculation using the Direct Acyclic Graph (DAG) .The generated DAG serves as a framework
for pattern analysis and analysis and classification of functions. In addition, it has a better edge
over other technologies as it is much easier to use due to the many APIs available. Also, some of
the benefits include high-quality libraries. This built-in feature can provide support for SQL,
machine learning, graph processing and streaming data. It can access data from various storage
sources such as HDFS, CASSANDRA, HBase, S3. Scala: Not only High Level Functional but
also supports the Object Oriented Programming language model. This gives it a Java edge that
requires additional code for the same function compared to Scala. The great success of Scala is
that the Apache Spark is also used in Scala. There are many packages available in the Scala
language of Apache Spark. Therefore, we continue to work in Scala compared to Python or Java.

Idea: It is an Integrated Development Area for creating, implementing and testing code. It is a
closed source but the public software system is provided free of charge.
Provides support for the SBT plugin used to import Apache Spark dependencies and project
building. The Intellij Idea expert system is used along with the SBT plugin which is a
construction tool, another form of maven construction tool.
SBT makes it easy to define dependence and import libraries and dependencies.

Application Programming Interface(API)

The Alchemy API works better than others depending on the quality and size of the extracted
businesses. As time goes on The PythonTwitter Application Programming Interface (API) was
created through collected tweets. Python can automatically calculate the frequency of messages
repeated every 100 seconds, organize the top 200 messages based on the frequency of tweeting
there, and store them in the selected database. Since the Python Twitter API only includes
Twitter messages for the last six days, it collects the data needed to be stored in a separate
database.
Implementation & Result:
CONCLUSION

Twitter is a source of many informal and sound data sets that can be used to find interesting
patterns and styles. Python has shown flexibility in extracting live streams of data and has the
ability to continue storing data collections in HDFS and other common standard stocks. Spark's
processing power enables the project to adapt to multiple locations, thus supporting the
distributed computer. Real-time data analysis enables business organizations to keep track of
their services and also creates opportunities for promotional, marketing and periodic
improvement. Our heartfelt thanks to Dr. Kusum Lata in terms of his response to the whole
project from the initial proposal to the conclusion and the important lessons we learned along the
way including team interaction and the challenges involved in software development efforts.

You might also like