JAIST

Natural Language Processing: Document level
Sentiment Analysis of
News Articles by using Lexicon-based Approach
May Sabal Myo

4th year (Knowledge Engineering)
Abstract
• Newspapers and blogs express opinion of news entities such as
people, place and things) while reporting on recent events.
• Now, I proposed a system that assigns scores indicating positive or

negative opinion to each distinct entity.
Introduction
• News can be good or bad, but it is rarely neural.
• The statistical analysis of relatively simple sentiment and provide a
surprisingly meaningful sense of how latest news impacts important
entities.
• In this experiment, we calculate the scores of five popular areas
(economics, entertainment, sports, politics and Tech) by using BBC
news datasets.
• In this research, based on lexicon-based sentiment analysis of news
articles by observing the role of text pre-processing in sentiment
analysis.
Background Theory
• Sentiment analysis is a text analysis method that detects polarity (e.g.
a positive or negative opinion) within text, whether a whole
document, paragraph, sentence, or clause.
• In general, sentiment analysis has been divided mainly in three levels:

1. Document level
2. Sentence level
3. Entity or aspect level
Background Theory (cont.)
• Sentiment analysis can generally be carried out using supervised or
unsupervised approaches.
1. A supervised approach comprises of a set of labeled training data
that is used to build a classification model with the intent of using
this model to classify new data for which labels are not present.
2. Unsupervised or Lexicon-based approaches to sentiment analysis do
not require any training data. In case of a sentence or a document,
the polarities of the individual words that compose the document
collectively convey the sentiment of the sentence or the document.
Proposed System and Methodologies
• Sentiment analysis can be done on document level, sentence level,
word level or phrase level.
• This system explores sentiment analysis on the document level. This

research identifies whether the documents new articles expressed
opinions are positive, negative or neutral.
• The proposed system is based on the Lexicon-based approach.

(cont.)
• This approach can use the following methods:
1. Dictionary-based methods: in these methods lexicon dictionary is
used in order to find out the positive opinion words and negative
opinion words.
2. Corpus-based methods: in these methods large corpus of words is
used and based on syntactic patterns other opinion words can be
found within the context.
(cont.)
• The methodology comprised of 5 steps:
1. Data Collection
This proposed system used BBC new dataset which contains 2225
documents from the website corresponding to stories in five popular
areas, i.e. Business, Entertainment, Politics, Sport, Tech and each
classes carry 300 to 400 article and the file format of these file is .txt.
(cont.)
2. Text Pre-processing
• Preprocessing is a necessary step to clean text (lessen noise of text)
and to reduce inconsistencies from it so that this cleansed data can
more effectively be utilized in text mining or sentiment analysis task.
• The first preprocessing task was tokenizing which breaks a sequence
of sentences into individual components such as words, phrases or
symbols which are termed tokens.
• During tokenization some characters, such as punctuation marks, stop
words, white space, numbers from the data, are discarded.
(cont.)
3. Calculate Polarity of Sentiment of Sentiment words
• Using TF-IDF(Term Frequency-Inverse Document Frequency)
important words or terms in document were identified and assigned a
weightage according to the occurrence of various words in the news
article.
• After identification of important words, WordNet Dictionary (a lexical
database for English language) has been used in this experiment.
(cont.)
4. Calculate Total Sentiment Score
• Find the polarities of each individual words, phrases and sentences
and combining them to predict the polarity of whole document.
• The sentiment score of whole news article has been calculated using
extract sentiment operator, text having a sentiment score of -1 is
considered negative and text having a sentiment score of +1 is
positive.
(cont.)
5. Sentiment Results
• News articles were classified in to positive, negative and neutral
classes by looking at their total sentiment score.
• News articles sentiment was then calculated as the average value of
total word sentiments.
Conclusion
• There are many interesting directions that can be explored. We are
interested in how sentiment can vary by demographic group, news
source or geographic location.
• By expanding our spatial analysis of news entities to sentiment maps,
we can identify geographical regions of favorable or adverse opinions
forgiven entities.
• We are also studying in analyzing the degree to which our sentiment
indices predict future changes in popularity or market behavior.

JAIST

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JAIST

Uploaded by

Copyright:

Available Formats

Natural Language Processing: Document level

May Sabal Myo

• Now, I proposed a system that assigns scores indicating positive or

• In general, sentiment analysis has been divided mainly in three levels:

• This system explores sentiment analysis on the document level. This

• The proposed system is based on the Lexicon-based approach.

You might also like