You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/301403733

Sentiment Analysis on News Articles for Stocks

Conference Paper · September 2014


DOI: 10.1109/AMS.2014.14

CITATIONS READS

8 1,554

4 authors, including:

Rohan Tondulkar Sangeeta Oswal


Indian Institute of Technology Hyderabad Vivekanand Education Society's Institute of Technology
1 PUBLICATION   8 CITATIONS    4 PUBLICATIONS   12 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sangeeta Oswal on 14 June 2017.

The user has requested enhancement of the downloaded file.


Sentiment Analysis on News Articles for Stocks
Vaanchitha Sarah Kazi Rohan Tondulkar Sangeeta Oswal
Kalyanaraman VESIT VESIT Assistant Professor
Student,VESIT sarahkazi1106@gmail.co ron1tondulkar@gmail.co VESIT, Mumbai
vaanchitha@gmail.com m m sangita.oswal@ves.ac.in

Abstract— In recent Years Social Media has become an period and compared it with the change in stock prices for the
important platform for content sharing. In this paper we have same period for a company. No past data has been considered.
used Social media generated content on news articles to see its The difficulty in this approach lies in the fact that most
effect on stock prices. We collected our dataset using Bing API news articles about stocks or the stock market are objective in
which gave us links to news articles about a specific company.
Two different machine learning algorithms were applied to the
nature, hence cannot be directly applied to an existing
dataset and the accuracy of the two was compared. In order to sentiment analysis dictionary. Most sentiment dictionaries
test our results we attached an overall sentiment to each article in available have a value for subjective words used in general
our data set which was compared to the predicted sentiment by language. This would not be useful to us. With a specialized
the algorithm. We also compared the predicted results with the sentiment dictionary we created, we predicted the value of the
actual change in the stock prices on the market. stocks of 11 companies from the Nifty 50 of the National Stock
Exchange (NSE). Using two different algorithms of machine
Keywords: Sentiment, Sentiment Analysis , Stock Market learning, we have compared the results from the two.
The paper has been arranged in the following manner. In
I. INTRODUCTION (HEADING 1) Section 2 we discuss our overall approach to the experiment. It
contains details about how the dataset was collected and
Social Media present abundance of information to the end
created, the data cleansing process, the creation of a specialized
user,Yet most of the information is untapped. One of the
sentiment dictionary for stocks, the creation of feature vectors
popular field is financial sector where abundance of financial
for the algorithm and the implementation of the process.
information is present which can be analyzed and presented to
Section 3 covers in detail the two machine learning algorithms
the investor. Analyzing stock trends has always been an
we have used. Section 4 contains our results, comparisons and
interesting area of research, mainly because no universal rule to
findings which include how accurately each algorithm works
predict change in stock prices has been discovered. More
as a predictor.
research began with most stock exchanges becoming partially
or fully electronic. By introducing the electronic element, stock II. EXPERIMENT
traders and research analysts were able to obtain more accurate
We generated a feature vector using our sentiment
data of the movement of stock prices.
dictionary and training data set. Then machine learning
Stocks can be analyzed sing Fundamental analysis or
algorithms were further used to predict the impact of news
Technical analysis. Fundamental analysts make predictions on
articles on stock market. In this section, we explain the
the basis of the company as it whole i.e. its performance in the
entire procedure that we have followed to analyze the stock
last quarter, earnings ratio etc. Technical analysts make
market with news articles. The following sections will
predictions solely on the basis of the past history of the stock.
describe it in detail:
An investment theory known as the ‘Efficient Market
Hypothesis (EMH) exists. The EMH states that all stocks are A. Dataset
valued after reflecting all relevant information about it i.e. We selected 11 companies registered under India’s National
stock prices in the market change when new information about Stock Exchange (NSE). These news articles comprised our
the stock is brought to light and that the past value of stocks data set. Bing’s API was used to obtain the data set. From the
have no effect on the current value. We use the sentiments of API we got the date, URL and description of articles of each
web user to predict the stock market price. Our work involves of the selected 11 companies. Using the URL, we accessed the
scanning the financial information available on stock from the articles, and compiled our data set into .csv files. For each
social media and extracting the sentiment expressed by company, we considered about 100 articles. Thus, our overall
individual user. dataset had more than 1000 articles.
In this paper we have analyzed on the basis of EMH, the
effect of news articles on stock prices. A novel method of The dataset was divided into two types – Training Dataset
sentiment analysis using feature extracted from the news article and Testing (Experimental) Dataset.
is proposed. We performed analysis on news articles over a
1) Training Dataset: Training Data was that part of the algorithm. The training dataset consisted of total 120 articles,
dataset which helped the machine learn how to behave in any 20 articles each from 6 companies. We manually read through
situation i.e. when given a set of inputs. The feature vector
generated from training dataset was fed to machine learning

TABLE I. SAMPLE TRAINING SET


each article in the training dataset and attached a sentiment 11 companies. This completed the steps Data Pre-processing
to each article. This sentiment was either positive or and Data Cleansing.
negative. We also looked for words which were sentiment
specific in the context of stock news and created a sentiment C. Sentiment Dictionary
dictionary by attaching a positive or negative sentiment to Instead of using available sentiment dictionaries, we
each word. These words formed the columns (features) in created our own sentiment dictionary. This was because we
the feature vector. were unable to find any sentiment dictionary which could had
an accurate sentiment value for words used in news articles
2) Testing Dataset: Testing Dataset was that part of the about stocks. Most pre-existing dictionaries contained words
dataset which was given to the machine learning algorithm which are not used in news articles. Moreover, they did not
to predict the results. The rest of the articles of the first 6 contain an appropriate sentiment value for words which were
companies (the articles that are not used in the training specific to news articles related to stocks. Eg: Bull and bear
dataset) and all the articles from the other 5 companies are important words with positive and negative sentiment
formed the testing dataset. A feature vector of was respective when used in news articles about stocks. But, in
generated for testing dataset. Machine learning algorithms normal language they do not have a strong positive or negative
were applied on this feature vector to predict the sentiment sentiment. We created a sentiment dictionary by manually
(impact) of the articles in testing dataset. reading all the articles in the training dataset. All the words
with sentiments attached to them were classified as positive or
B. Data Cleansing negative words in our sentiment dictionary. The sentiment
The articles obtained from the Bing API contained data that dictionary contains 532 words with 266 positive 266 and
was irrelevant or unnecessary to the stock market. To improve negative words.
the accuracy of the prediction, we pre-processed the data i.e.
we converted raw, unprocessed data to correct, relevant data. D. Feature Vector
Only news articles related to that specific company were kept The positive and negative words of the sentiment dictionary
in the dataset. We also had many articles which were common formed the features of the feature vector. The feature vector
for the stocks of various companies so contained information had 532 features. The feature vector helped the machine
about more than one company. In such cases we made sure to learning algorithm to correctly predict a sentiment of the
include only that part of the article that was relevant to the articles in the training dataset. For every article, each feature
specific company we were analysing. had a value. This value was dependent on the frequency of
After pre-processing, from 100 articles of each company, that word in the article and its sentiment i.e. positive/negative.
our dataset was reduced to about 50-60 articles of each of the For e.g.: A negative word occurring three times, the value
would be -3. Whereas, for a positive word occurring three
times, the value would be +3 times. The feature vector is
demonstrated in Tab. II

TABLE II. FEATURE VECTOR

E. Implementation using sentiment analysis Gradient Descent and Normal Equations. The working of
To the testing data, we applied two Machine Learning these algorithms will be explained in detail in the next section.
Algorithms to predict the rise or fall in the stocks with the help Finally, we compared our results with the actual change in
of the training dataset, the sentiment dictionary and the feature the stock prices in the particular period from which we had
vector. We used two algorithms of Linear Regression namely collected news articles made the data set
Fig.1. Block Diagram

III. MACHINE LEARNING ALGORITHMS


Linear Regression is one of the most widely used methods We aim to minimize cost function J over θ.
for statistical analysis. It is a supervised machine learning min J ( 0 ,1 )
algorithm which is based upon building a linear hypothesis for  0 ,1
a training data set. The hypothesis can be later used for output (4)
prediction. Since every word from our word list is a feature,
Linear Regression with multiple features was used. Linear Gradient Descent Algorithm:
regression with multiple variables is also known as
"multivariate linear regression". The feature vector of training Repeat until convergence {
data set is fed into these machine learners with the features as 
input. The impact is obtained from the output. The hypothesis  j :  j   J ( 0 ,1 )
generated is used to continuously predict the output of articles  j
in experimental data set. } (5)

The notation for equations with any number of input for j=0 to j=n+1
variables is given below: Once again, this is, θj = θj - learning rate (α) times the
partial derivative of J(θ) with respect to θJ(...)
xji =value of feature j in the ith training example θj value is simultaneously updated throughout.
x(i)=the column vector of all the input features of
the ith training example.
m=the number of training examples. B. Linear Regression using Normal Equations
n=|x(i)| (the number of features) The normal equations method is used to find optimum
θ=n+1 dimensional column vector consisting of parameters values of θ vector without iteration.
of the hypothesis.
The parameter vector theta is given by:
The multivariable form of the hypothesis function θ = (XT X)-1 XT y (6)
accommodating these multiple features is as follows: This value of vector θ is used for continuous prediction [2].

hθ (x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxn (1)
IV. RESULTS
Using the definition of matrix multiplication, our Results were calculated on the basis of two comparisons.
multivariable hypothesis function can be concisely represented Comparing the results from Linear Regression using Gradient
as [1]: Descent and Linear Regression using Normal Equations,
found Linear Regression using Gradient Descent to be more
 x0  effective.
x  A. Comparing with expected sentiment found manually
h ( x)   0 1   n  1    T x
 The articles in experimental data set were given a
  positive/negative sentiment after reading each one of them.
 xn  Using linear regression algorithms, their sentiment was
(2) predicted and compared with the actual sentiment. The results
were as follows:
There are two variations in this machine learner:
1) Results with Normal Equations: We achieved the
A. Linear Regression using Gradient Descent following accuracy using Linear Regression with
In this type, the value of a cost function J is minimized and Normal Equations.
the value of parameter vector θ is used for regression. J is just
a function of the parameter vector. (in %) Correct Wrong
Our cost function is: Positive 60.47 39.53
Negative 45.96 54.03
1 m
J(0, 1) = 
2m i 1
(h ( x (i )  y (i ) )) 2 TABLE III. ACCURACY ON COMPARISON OF PREDICTED
SENTIMENT WITH EXPECTED SENTIEMNT USING LINEAR
(3) REGERESSION WITH NORMAL EQUATIO
TABLE IV. ACCURACY ON COMPARISON OF PREDICTED SENTIMENT WITH ACTUAL CHANCE OF STOCK PRICE IN MARKET

2) Results with Gradient Descent: We achieved the 81.81%. Thus, in these results too, Linear Regression using
following accuracy using Linear Regression with Gradient Descent proved to be more accurate.
Gradient Descent.
V. CONCLUSION
(in %) Correct Wrong
Positive 59.53 40.47 We have analyzed the causative relation between news
Negative 59.63 40.37 articles and value of the price of stocks in the market. Our
machine learning model was able to predict the sentiment of
TABLE IV. ACCURACY ON COMPARISON OF PREDICTED an article with accuracy of 53.2% using Normal Equation
SENTIMENT WITH EXPECTED SENTIEMNT USING LINEAR and 59.5% using Gradient Descent when compared to the
REGERESSION WITH GRADIENT DESCENT
result manually predicted by us. On comparison with actual
The results show the percentage of correct predictions for stock prices also, we found that Gradient Decent was more
positive and negative articles. We can clearly see that Normal accurate with an accuracy of 81.82% while Normal
Equations provides with an average accuracy of 53.2% i.e. Equation had an accuracy of only 54.54%. In both cases, we
(60.47+45.96)/2. And Gradient Descent provides with an clearly see Linear Regression using Gradient Descent to be
average accuracy of 59.5% i.e. (59.53+59.62)/2. more efficient.
Thus, Linear Regression with Gradient Descent provided
with better accuracy in these results. VI. FUTURE SCOPE
Our analysis is not perfectly accurate as there are many
facets of the experiment that can be improved. Some of them
B. Comparing with actual change in stock prices in a certain
are as follows:
time period for selected companies
1. Making the sentiment dictionary more
We had taken 11 companies from National Stock Exchange comprehensive so that it can be used in all
(NSE) under the Nifty50 group. The stipulated time period for markets. We also hope to make it efficient enough
each Company was found out based on the first and last news to be able to analyze specific events which don’t
article of that company in the experiment data set. The change occur on a daily basis. E.g. Budget reports, annual
in stock price for every company in this time period was reports etc
found. Out of its articles in experimental data set, if majority 2. Our results comparison takes into consideration
articles had a positive sentiment as prediction, then we the change in stock value over the entire time
predicted an increase in stock price in that period. And if period. Mapping the analysis of the sentiment of
majority articles were predicted having a negative sentiment, the news article to immediate changes in the
then we predicted a drop in stock prices in that period. This market will be more useful
predicted change in stock was compared to the actual change We intend to explore all these areas in our future work.
for each company in that time period. The results achieved are
as shown in Tab. IV.
From the results, we can see that the accuracy with Linear
Regression using Normal Equations is 54.54% while the
accuracy using Linear Regression using Gradient Descent is
ACKNOWLEDGMENT
We would like to thank our mentor and guide, Ms.
Sangeeta Oswal for her constant guidance and support
throughout out project.

REFERENCES
[1] https://share.coursera.org/wiki/index.php/ML:Linear_Regression_
with_Multiple_Variables.
[2] http://www.holehouse.org/mlclass/04_Linear_Regression_with_m
ultiple_variables.html
[3] SOPS: Stock Prediction using Web Sentiment-Vivek Sehgal and
Charles Song
[4] Stock Prediction Using Twitter Sentiment Analysis-Anshul Mittal
Arpit goel
[5] Deriving the Pricing Power of Product Features byMining
Consumer Reviews-Nikolay Archak, Anindya Ghose, Panagiotis
G. Ipeirotis
[6] Giving Content to Investor Sentiment:The Role of Media in the
Stock Market-PAUL C. TETLOCK

View publication stats

You might also like