Professional Documents
Culture Documents
A18 CU6051NA A1 CW Coursework 16034872 Anjil Shrestha PDF
A18 CU6051NA A1 CW Coursework 16034872 Anjil Shrestha PDF
I confirm that I understand my coursework needs to be submitted online via Google Classroom under the
relevant module page before the deadline in order for my assignment to be accepted and marked. I am fully
aware that late submissions will be treated as non-submission and a marks of zero will be awarded.
Table of contents
1. Introduction ................................................................................................................................... 1
1.1. AI, ML, NLP & Sentiment Analysis .................................................................................... 1
1.2. Problem Domain ................................................................................................................... 2
2. Background ................................................................................................................................... 4
2.1. Sentiment analysis and its approaches ................................................................................ 4
2.1.1. Approaches .................................................................................................................... 5
2.2. Research works done on Sentiment Analysis ..................................................................... 6
2.3. Current applications of Sentiment analysis ........................................................................ 7
3. Solution .......................................................................................................................................... 8
3.1. Approach to solving the problem ........................................................................................ 8
3.2. Explanation of the AI algorithm .......................................................................................... 9
3.3. Pseudocode........................................................................................................................... 13
3.4. Flowchart ............................................................................................................................. 14
4. Conclusion ................................................................................................................................... 15
4.1. Analysis of the work done .................................................................................................. 15
4.2. Solution addressing the real-world problems ................................................................... 16
4.3. Further work ....................................................................................................................... 17
5. References .................................................................................................................................... 18
Table of figures
Figure 1 Machin learning types (Morgan, 2018)..................................................................................... 1
Figure 2 Relation between AI, NLP, ML and Sentiment Analysis ............................................................ 2
Figure 3 Sentiment Analysis Overview.................................................................................................... 4
Figure 4 Difference approaches on Sentiment Analysis ......................................................................... 5
Figure 5 Bayes Theorem.......................................................................................................................... 9
Figure 6 Flowchart of algorithm............................................................................................................ 14
Table of tables
Table 1 Labeled training data.................................................................................................................. 9
Table 2 Bag of words ............................................................................................................................. 10
Table of Abbreviation
I. AI – Artificial Intelligence
II. ML – Machine Learning
III. NLP - Natural Language Processing
IV. NLTK – Natural language tool kit
V. SVM – Simple Vector Machine
VI. CNN - Convolutional Neural Network
VII. RNN – Recurrent Neural Network
VIII. RNTN – Recursive Neural Tensor Net
CU6051NA Artificial Intelligence
1. Introduction
AI is a broad field of study and is incorporated in variety of technology and machine learning
is one of it. Machine learning is the subfield of Artificial Intelligence that allows software
applications to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the construction of algorithms to receive inputs
and statistical analysis to predict output while new data are available to update outputs
whenever new data is available. (Rouse, 2018) (Reese, 2017)
Primary goal of Machine Learning is to allow the computers learn automatically without
human intervention or without being actually programmed. The process involves searching
data for patterns and adjusting program actions. Under machine learning there are supervised
learning, un-supervised learning and reinforcement learning which every individual has
different process of training data and fitting the model.
Another technology that AI has incorporated is Natural Language Processing (NLP). NLP is
a fundamental element of AI for communicating with an intelligent system using natural
language. Some famous applications of NLP are speech recognition, text translation and
sentiment analysis. Basically, NLP is like building a system that can understand human
language. In order to make machine understand a language, the machine should first learn
how to do it and this is where machine learning is used within the NLP. (BOUKKOURI,
2018) (Expertsystem, 2018)
Sentiment analysis falls under the different applications of NLP and is a process of
determining whether a piece of writing is positive, negative or neutral. Basically, it is a text
classification which aims to estimate sentiment polarity of a body of text based solely on its
content i.e. text can be defined as a value that says whether the exposed opinion is positive
(polarity=1), negative (polarity=0), or neutral. In order to get machine extract sentiments out
of piece of texts the machine needs to be trained using pre-labeled dataset of positive,
negative, neutral content. This means that, techniques of NLP and ML are required for a
system to perform sentiment analysis.
Due to advancement in internet today data is being generated in such a high scale that going
to each pieces of data are humanly impossible. In business data is very useful for findings of
different problems and analyzing those data helps to plan next step for improvising the
business. One of the most important part of a business is taking account on public opinions
and customers feedback on their brands and services. With all those huge volumes of
customer feedbacks it becomes hard to determine whether their services are flourishing or
customers are not liking their services or product. Public opinions on particular product is
what makes that product improve over time and its very challenging to determine whether the
opinions are positive or negative when the opinions are in huge amount. (Stecanella, 2017)
(Gupta, 2017)
Coursera is a huge online learning platform. It provides thousands of courses and has
thousands of viewers or customers. Viewers leave their feedback on their learning
experiences and this feedback is also generated in thousands. Determining whether a
particular feedback is positive, negative or neutral along with thousands of other feedbacks is
humanly impossible. Feedbacks are very important because by the help of its performance of
a particular course can be tracked and helps in further business decisions. Sentiment analysis
can be used to identify and extract subjective information which will help the business to
understand the social sentiment of their courses.
2. Background
Sentiment analysis is not a straight forward procedure, there are many factors that determines
a sentiment of speech or a text. Text information can be categorized into two main types in
general: facts and opinions. Opinions are of two types: direct and comparative. Direct
opinions give an opinion about an entity directly. For example, “This course is helpful”. In
comparative opinions the opinion is expressed by comparing an entity with another example
for example “The teaching method of course A is better than that of course B”. These
collected opinions on fresh hands can be made structured by the help of sentiment analysis
systems. (Stecanella, 2017)
There are various types of sentiment analysis. Some important types are systems that focus
on polarity (positive, negative, neutral) and some systems that detect feelings and emotions
or identify intentions. Polarity of a text is associated with particular feelings like anger,
sadness, or worries (i.e. negative feelings) or happiness, love or enthusiasm (i.e. positive
feelings). Lexicons and machine learning algorithm are used to detect the feelings and
emotions from texts. It gets very tricky when a system is restored to lexicons as the way that
people express their emotions varies a lot and so do the lexical items they use.
2.1.1. Approaches
Currently there are many methods and algorithms introduced that extracts sentiment out of
texts. Computation linguistic is very huge that research and works are still going on to
improve the end result or accuracy that these methods provide. The sentiment analysis
systems are classified as following:
In this approach, set of rules are defined that identifies subjectivity, polarity, or the subject
of an opinion via some kind of scripting language. The variety of inputs that may be used
in this approach are classic NLP techniques like tokenization, part of speech tagging,
stemming, parsing and other resources, such as lexicons. (Stecanella, 2017)
This is the approach that relies on machine learning techniques to learn from data. In this
approach the task is modeled as a classification problem where a classifier is fed with a text
and returns corresponding sentiment e.g. positive, negative or neutral. The classifier is
implemented by first training a model to associate a particular input to the corresponding
output with training samples. The pairs of feature vectors and tags (e.g. positive, negative,
or neutral) are fed into the machine learning algorithm to generate a model. The second step
is the prediction process where the unseen text inputs are transformed into feature vectors
by the feature extractor. The predicted tags are generated when those feature vectors are fed
in the model. Under supervision learning the classification algorithms that are widely used
are Naïve Bayes, Logistic Regression, Support Vector machines and Neural Networks.
(Walaa Medhat, 2014)
It is the approach that combines the best of both rules based an automatic. Combining both
approaches can improve the accuracy and precision of result.
Many research works have been carried out on sentiment analysis. On one research conducted
by Pang and Lee they have described the existing techniques and approaches for an opinion-
oriented information retrieval. Their survey includes the material on summarization of
evaluative text and on broader issues regarding privacy, manipulation, and economic impact
that the development of opinion-oriented information-access services gives rise to. (Bo Pang,
2008)
In another research the authors used web-blogs to construct corpora for sentiment analysis
and use emoticons assigned to blog posts as indicators of users’ mood. In this research SVM
and CRF learners were used to classify sentiments at the sentence level. Additionally, several
strategies were investigated to determine overall sentiment of the document. This research
concluded as the winning strategy is defined by considering the sentiment of the last sentence
of the document as the sentiment at the document level. (Changhua Yang, 2007)
Alec Go and team performed a sentiment search by using Twitter to collect training data.
Various classifiers were used in a corpora constructed by using positive and negative samples
from emoticons. Among the classifiers used Naïve Bayes classifier obtained by best result
with accuracy up to 81% on their test set but this method when used with three classes
(“negative”, “positive” and “neutral”) showed bad performance. (Alec Go, 2009)
In a research done by Alexander Pak and Patrick Paroubek used twitter as a corpus for
Sentiment Analysis and opinion mining. Their research paper focuses on using Twitter for
the task of sentiment analysis. Their paper includes on procedures for automatic collection of
corpus and approaches on performing linguistic analysis of the collected corpus. They have
further built sentiment classifier by using the corpus, that is able to determine polarity
(positive, negative and negative) of a document. (Alexander Pak, 2008)
Sentiment analysis bas become a key tool for making sense of the data where 2.5 quintillion
of data is generated every day. This has helped companies to get key insights and automate
all kind of process and analytics for improving business. Sentiment analysis is being used for
various purposes. In a company where it manufactures different types of products sentiment
analysis has helped track the performance of the product in the market by collecting
sentiments from the customer feedback and reviews.
Sentiment analysis is being used on various aspects. Some common aspects are:
• Brand Monitoring
• Customer Support
• Customer Feedback
• Product Analytics
• Market Research and Analysis
• Workforce Analytics & Voice of the Employee
• Spam filtering
3. Solution
Taking account of above research and explanations it is clear that sentiment analysis can be
used for various aspects like:
• Brand Monitoring
• Customer Support
• Customer Feedback
• Product Analytics, etc.
The ideal solution in achieving above aspects is the use of machine learning technique and
algorithms by incorporating some NLP techniques in data preprocessing. Supervision
learning is the preferred approach to achieve this task of predicting sentiment. Kaggle holds
many datasets for sentiment analysis and for this particular task the labeled dataset on
Coursera’s course reviews is to be used as the training dataset. There are many algorithms
available to fit the model into. Under neural network there are algorithms like RNN, CNN,
RNTN etc. and under non-neural networks-based models there are naive bayes, SVM,
FastText, Deepforest. For the given task Naïve Bayes is the algorithm for predicting the
sentiment. It is considered to be used as the classifier due to following reasons: (Gupta, 2018)
(Shailendra Singh Kathait, 2017)
Naïve Bayes is a probabilistic algorithm that takes advantage of probability theory and
Bayes’ theorem to predict sentiment of a text. In this algorithm the probability of each tag
for a given text is calculated and output is the tag with highest probability. In probability
theory, Bayes rule describes the probability of a feature based on prior knowledge of
conditions that might be related to that feature. (Stecanella, 2017)
P(A|B) – posterior
P(A) – prior
P(B) – evidence
P(B|A) – likelihood
The first step in naïve bayes algorithm is creating a frequency table containing word
frequencies. Every document is treated as a set of the words it contains by ignoring word
order and sentence construction. From the training data the text can be represented by using
the bag of words approach. It is an approach where each word from a sentence is separated
and its repentance in that sentence is counted. For example:
(Helpful, course, and, materials, boring, don’t, waste, time, in, this, useful, content, helped, lot,
thanks)
helpf cour an materi bori don was tim i thi usef conte help lo than a L
ul se d als ng t te e n s ul nt ed t ks a
b
e
l
Helpful 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 +
course
and
materi
als.
Boring. 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -
Don’t 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 -
waste
time in
this.
Useful 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 +
materi
als and
conten
t.
Helped 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 +
a lot.
Thanks
Table 2 Bag of words
P(“ I dont like it” | + ) * P (+) and P(“ I dont like it” | + ) * P (-). Comparison between these
two probabilities can be made to separate either the given review is positive or negative.
As we are using naïve bayes algorithm we assume every word in a sentence is independent
of the other ones so we are no longer looking at entire sentences, but rather at individual
words.
So, for P(“ I dont like it” | + ) * P (+) we write P(+ ) * P( I | + ) * P ( don’t | + ) * P ( like | + )
* P ( it | + ) and for negative P(“ I dont like it” | - ) * P (-) we write P(- ) * P( I | - ) * P ( don’t
| - ) * P ( like | - ) * P ( it | - ).
For positive:
P( + ) = 3/5 = 0.6
P( I | + ) = (0+1)/(10+16)=0.0384
P( don’t | + ) = (0+1)/(10+16)=0.0384
P (like | + ) = (0+1)/(10+16)=0.0384
P (it | +) = (0+1)/(10+16)=0.0384
For negative:
P ( - ) = 2/5 = 0.4
P( I | - ) = (0+1)/(6+16)= 0.0454
As value of y- is greater that y+ the review is classified as negative. This is how bayes theorem
is used in naïve bayes classifier.
To increase the performance of this classifier some advanced NLP techniques are used they
are listed below:
- Removing stopwords
- Tokenization.
- Ignoring case and punctuation
- Strip white space.
- Remove numbers and other characters
Naïve baes classifier can be effectively implemented using python. This algorithm is
implemented using python programming language as it provides many libraries for data pre-
processing, NLP and machine learning. The libraries are listed below:
- Pandas
- NumPy
- Scikit-learn
- NLTK
A predicting model will be built using these python libraries and the end product will be a
web app built using Flask Framework.
3.3. Pseudocode
Read dataset and separate sentiment text and its sentiment label.
x = datafrane.sentimentText
y = sentimentLabel
X_train, X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=1)
Remove stopwords.
Tokenization.
model=naive_bayes.MultinomialNB()
model.fit(X_train,y_train)
my_vectorizer=vectorizer.transform(my_test_data)
model.predict(my_vectorizer
3.4. Flowchart
4. Conclusion
Due to increase of computational power and development on big data the field of AI is
flourishing and has brought revolutionary changes in current technologies and has not yet
reached its furthest extent. In this report short explanation of AI is done highlighting its impact
on other different fields. Making a machine or a software smart can be achieved by the use of
different machine learning approaches. How machine learning techniques makes machine or
a software achieve this is explained in this report. Making a machine understand our natural
language and act accordingly is one of the ultimate goals of AI and different machine learning
algorithms has made this possible to some extent. Explanation of NLP and different
applications of it is described briefly in this report. For a business to succeed, it has to monitor
many aspects including customer review, customer feedback, brand monitoring etc. and in
this report how these can be achieved by the implementation of machine learning algorithms
is highlighted. Sentiment analysis has been introduced in the introduction part of this project
with the analysis on approaches it takes to tackle with different problem domains.
Different approaches can be taken in sentiment analysis and these different approaches are
explained thoroughly in background section of this report. Some research works conducted
in sentiment analysis has been included. The taken procedures and the result of their research
has been highlighted.
From different available machine learning classifiers for text classification, Naïve Bayes
classifier was selected as the classifier for sentiment analysis. The approach on selecting this
classifier has been included in this report. Naïve Bayes classifier uses the Bayes Theorem to
predict the sentiment. How this theorem is used for predicting the sentiment of a text is
explained with each steps of algorithm. An example also has been demonstrated in this report
to address how sentiment of a word can be predicted using Bayes theorem. Pseudocode and
flowchart of the algorithm have been included in the report, which can be used during actual
implementation of the algorithm.
Sentiment analysis bas become a key tool for making sense of the data where 2.5 quintillion
of data is generated every day. This has helped companies to get key insights and automate
all kind of process and analytics for improving business. Sentiment analysis is being used for
various purposes. In a company where it manufactures different types of products sentiment
analysis has helped track the performance of the product in the market by collecting
sentiments from the customer feedback and reviews. (Stecanella, 2017)
Sentiment Analysis has empowered all kinds of market research and competitive analysis,
whether exploring a new market, anticipating future trends, or keeping an edge on the
competition, sentiment analysis has made all the difference. Sentiment analysis makes this
possible by analyzing product review of a brand and compare those with other competitors,
compare sentiment across international markets and so on. (Stecanella, 2017)
Sentiment analysis can be used in monitoring social media. Tweets / Facebook posts can be
analyzed over a period of time to see sentiment of a particular audience. This can be used to
gain deep insight into what’s the current market status of the product. It helps prioritize action
and track trends over time. (Stecanella, 2017)
For any types of service like trolley bus service, free water service etc., the feedbacks and
opinions of the public is crucial. Surveys can be conducted to get the feedbacks and opinions
of the public. Sentiment analysis can be performed in these surveys to identify how well these
services are benefiting the people and understand the changes required for improving the
existing services.
These are only some real-world areas that sentiment analysis can benefit or has been
benefiting. It can be applied to many other aspects of business, from brand monitoring to
product analytics, from customer service to market research. Leading brands are being able
to work faster and with more accuracy by incorporating sentiment analysis into their existing
system and analytics.
This report has only touched the surface of sentiment analysis. For accurately predicting a
sentiment it requires combined usage of both rule-based approaches like lexicons and
automatic approaches i.e. machine learning approach. Naïve baes are a basic model but
performance of this model can be increased by using different data pre-processing techniques,
matching the level of other advanced methods. The techniques like lemmatizing words, N-
grams, TF-IDF, laplace correction, stemming, emoticon, negation, dictionary and so on can
significantly increase the accuracy score. (Ray, 2017) (Giulio Angiani, 2015)
Data visualization is very important because it enables to see analytics that helps grasp
difficult concepts or identify new patterns. Sentiments between products can be compared
using charts like pie, graph line etc. This is very useful for any other companies to track
product performance, identify necessary changes and all kinds of insights. So, sentiment
visualization is another prospect which further increases the efficiency of sentiment analysis.
5. References
Alec Go, R. B. L. H., 2009. Twitter Sentiment Classification using Distant Supervision,
Stanford: s.n.
Alexander Pak, P. P., 2008. Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
In: France: Orsay Cedex, pp. 1321-1326.
Bo Pang, L. L., 2008. Opinion mining and sentiment analysis. 2 ed. s.l.:Foundations and
Trends in Information Retrreva;.
Changhua Yang, K. H.-Y. L. a. H.-H. C., 2007. Emotion classification using web blog
corpora. In: Washington: s.n., pp. 275-278.