You are on page 1of 5




Stock Prediction Using Web Sentiments, Financial News and Quotes

Sulana Maria Rebelo and Kavita Asnani
AbstractIn this paper, we present a model that predicts stock market closing value for Dow Jones Industrial Average (DJI) index for a giving trading day. This is done using unstructured data like financial message board messages and news articles. We also use financial stock quotes data. We derive the sentiment for each message from the message board using SentiWordNet and from this we derive the sentiment for every company of DJI for each trading day. News articles are replaced by key phrases using Key Phrase Extraction Algorithm (KEA). The processed message board messages, news articles and stock quotes data will be used to train a Neural Network using Back propagation Algorithm. The trained network will predict closing value for DJI for a particular trading day. Index Terms Back Propagation Algorithm, Dow Jones Industrial Average (DJI), Key Phrase Extraction Algorithm (KEA), Neural Network, SentiWordNet.


ata mining can be used extensively in the financial markets and help in stock-price forecasting. Data mining can help investors discover hidden patterns from the historic data that have probable predictive capability in their investment decisions. The web has rapidly emerged as a great source of financial information ranging from financial news articles to personal opinions. Research has shown that sentiments and stock value are closely related and web sentiments can be used to predict stock behavior [6]. The same is true for financial news. Online forum discussions between investors are not equivalent to market noise, and instead contain financially relevant informational content [9]. Text mining of such financial information can aid stock market predictions. Text mining refers to the process of deriving meaningful information from natural language text. Compared with quotes data, text is unstructured, amorphous, and difficult to deal with algorithmically. There is an important need to extract useful knowledge from vast amounts of textual data. We propose a prediction model that will perform stock closing value prediction for DJI by using quotes data, key phrases in the news articles and sentiments from message boards. For developing this model we have used daily quotes data and financial news articles corresponding to DJI and we also make use of the message board posts for each of the 30 companies of DJI. The data is collected over the period August 2011 to March 2012. Using the message board sentiments, key phrases from news articles and quotes data a Neural Network will be trained using Back Propogation Algorithm. The trained

Neural Network is used to predict the closing value for DJI.

A method is proposed in [3], to predict stock closing value using quotes data and news articles. Key phrases are extracted from the news articles. The relationship between the news articles and the trends on the stock prices are used to train the Artificial Neural Network using the Back propagation Algorithm. In our research, we aim to determine the cumulative effect from the quotes data, key phrases from news articles and the sentiments from message boards on the closing value of DJI. Sentiment Analysis or Opinion mining aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. It has to be determined whether the opinion expressed is positive, negative or neutral. Due to the richness of human language, its large expressiveness and ambiguities the problem of sentiment classification is nontrivial. In [4] a document in a different language than English is first translated into English using standard translation software. Then, the translated document is classified according to its sentiment into one of the classes positive and negative. For sentiment classification, a document is searched for sentiment bearing words like adjectives. By means of SentiWordNet (lexical resources for sentiment analysis in English)[1], scores for positivity and negativity are determined for these words. An interpretation of the scores then leads to the document polarity. [5] proposes measures that determine the semantic orientation of adjectives for three factors of subjective meaning. These three factors of the emotive meaning are the evaluative factor (e.g., goodbad); the potency factor (e.g., strong weak); and the activity factor (e.g., active

Sulana Maria Rebelo, Department of Information Technology (M.E.), Padre Conceicao College of Engineering, Goa University, Verna, India. Kavita Asnani, Department of Information Technology (M.E.), Padre Conceicao College of Engineering, Goa University, Verna, India.



passive). Among these three factors, the evaluative factor has the strongest relative weight. Here we make use of WordNet synonymy-graph. The approach in [6] involves scanning for financial message boards and extracting sentiments expressed by individual authors. Each message is converted to a vector of words and author names. The value of each entry in the vector is then calculated using TFIDF formula. The prediction at time instance i depends upon the values (messages and stock value) at previous time instance. For classifier training weka toolkit is used. This approach calculates a TrustValue which assigns trust to each message based on its author. A classifier is trained which can predict whether the Stock price would go up or down using the features extracted or calculated (including sentiment and TrustValue) over the past one day. An approach used to classifying reviews as recommended or not recommended is given in [7]. The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. The semantic orientation of a phrase is calculated as the pointwise mutual information (PMI) between the given phrase and the word excellent minus the PMI between the given phrase and the word poor. PMI is calculated by issuing queries to a search engine and noting the number of hits (matching documents).

1. Quotes Normalization The quotes data for DJI is downloaded from rices. The quotes data is normalized using Min- Max Normalization. 2. Sentiment Classification For each downloaded message board post, we derive the sentiment the author is trying to convey. The message board post is downloaded from ( 3. Key Phrase Extraction News articles are downloaded from ( For each downloaded new article, we extract the key phrases from that article using Key Phrase Extraction Algorithm (KEA). From the key phrases we derive the global list of key phrases, which is the top 10 most significant key phrases impacting the entire corpus [2], [3]. 4. Prediction Module A Neural Network is trained using Backpropogation algorithm. This Neural Network is used to predict the closing value of DJI.

3.1 Block Diagram

3.2 Sentiment Classification Sentiment Classification aims to automatically predict sentiment polarity (e.g. positive or negative or neutral) of a text such as blog, message board post, review etc. The approach followed in this research for Sentiment Classification is based on SentiWordNet [4]. Once the message board messages are downloaded, we need to derive the sentiment for each message. This means we need to classify each message as expressing positive, negative or neutral sentiment.
Message Board Messages for DJI

Quotes for DJI

Message Board Messages for DJI

News Articles for DJI

Quotes Normalization

Sentiment Classification

Key Phrase Extraction

Part-of-speech Tagging

Prediction Module

Tagged Message Board Messages

Predicted Closing Value

Adjective List

Fig. 1. Block Diagram.

Semantic Orientation Identification SentiWordNet lexicon

Fig. 1 gives the overall design of the system. Quotes data, Message Board messages and news articles for DJI will be downloaded from the Internet for each trading day.

Message Board Messages with Orientation

Fig. 2. Sentiment Classification process.



1. Part-of Speech Tagging: A Part-Of-Speech Tagger is a piece of software that reads text in some language and assigns parts of speech to each word (and other tokens), such as noun, verb, adjective etc. We use the Stanford tagger [8] to tag each message board post. The output of the Stanford tagger is a tagged version of each message. Each token in the message is assigned an appropriate part-of-speech by the tagger. After tagging, each adjective in the message is followed by /JJ. 2. Adjective list: Adjectives convey a high degree of opinion; hence they play an important role in Sentiment Classification. From each tagged message board post we extract all the adjectives. (After tagging, each adjective in the message is followed by /JJ, so in this way we identify the adjectives.) 3. SentiWordNet lexicon: SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. For each adjective we lookup this lexicon and use the corresponding positivity, negativity and objectivity scores and then to derive the sentiment of the adjective. 4. Semantic Orientation Identification: DJI has 30 component companies. For any trading day, if we consider any one company there are many messages. For each of these companies we need to find the semantic orientation for each trading day, by considering all the messages for that company for that trading day. After POS tagging described above, we extract the adjectives from each message and then follow the steps below to derive the sentiment for a particular trading day for a particular company.

ii. Calculate positivity score ( PosScore(Adj) ) and negativity score (NegScore(Adj) ) of the adjective as follows:

PosScore(Adj) =


NegScore(Adj) =


iii. If the PosScore(Adj) = NegScore(Adj), SO(Adj)= neutral else If the PosScore(Adj) > NegScore(Adj) , SO(Adj)= positive else SO(Adj)= negative.

Once we get the semantic orientation of each adjective in a message, we need to find the semantic orientation of the message. For each message (M): PosCnt =number of adjectives having positive semantic orientation. NegCnt =number of adjectives having negative semantic orientation. NeuCnt =number of adjectives having neutral semantic orientation. Semantic Orientation (M) is: neutral if PosCnt = NegCnt neutral if NeuCnt > PosCnt and NeuCnt > NegCnt positive if PosCnt > NegCnt and PosCnt >= NeuCnt negative if NegCnt > PosCnt and NegCnt >=NeuCnt

1. We find the Semantic Orientation of each adjective in a message. 2. Using the results from step 1, we find the Semantic Orientation of each message. 3. Using the results from step 2, we find the sentiment for each trading day for each company. The steps mentioned above are explained below. For each message we find its semantic orientation using the SentiWordNet file. To find the semantic orientation of a message we need to first find the semantic orientation of each adjective in the message. The following approach is followed: For each adjective (Adj) in the message (M) we find its Semantic Orientation (SO) using the following method: i. Lookup the SentiWordNet file and find all records where this adjective appears. Let n be the total number of records found.

Once we get the semantic orientation of each message we find the semantic orientation for each trading day, for each company, by following the approach (which we used to find the semantic orientation of a message). Here first count the number of messages having positive sentiment, negative sentiment and neutral sentiment. Then we follow the rules given above. We then create a sentiment vector giving the sentiment for each company.

3.3 Neural Network Training We use the Feed Forward neural network structure employing the Feed Forward Backpropogation algorithm.
The network configuration details are as follows: 1. Input layer: There is only one input layer. This layer will accept the input data that is fed to the neural network. Each record in the input data correspondes to one trading day. In our confuration the input layer has



43 neurons. Neuron 1 to 3 : Stock quotes data (Open, High, Low) in normalised form. Neuron 4 to 13 : Boolean value i.e presence/absence of the top 10 global key phrases in the key phrases extracted for the news areicles corresponding to a given trading day. Neuron 14 to 43 : gives the sentiment derived from the message board post for each of the 30 companies of DJI corresponding to a given trading day (AA, AXP, BA, BAC, CAT, CSCO, CVX, DD, DIS, GE, HD, HPQ, IBM, INTC, JNJ, JPM, KFT, KO, MCD, MMM, MRK, MSFT, PFE, PG, T, TRV, UTX, VZ, WMT, XOM). A value of 1 signifies positive sentiment, 0 negative sentiment and 0.5 neutral sentiment. 2. Hidden Layer: The neural network may contain one or more hidden layers. In our configuration we have one hidden layer. This layer contaions 45 neurons. 3. Output layer: In this study, the neural netwok will contain only one output neuron which will output the closing index value that has been predicted by the network for a given trading day. 4. Learning rate: A learning rate of 0.09 has been found to be appropriate. 5. Terminating Condition: Training stops when either: The error calculated is below some pre-specified threshold, or a pre-specified number of epochs have expired.
Global List of Key Phrases

For each trading day , the data comprises of the stock quotes, message board messages collected from the time the market opens till it closes and also three news articles are collected over different time slots: when the stock market opens, mid day session, closing of stock market. The data set comprises of quotes data, news articles and message board post collected for 151 days from the period 11-08-2011 to 16-03-2012. Out of these 151 days, 127 days are used for training the neural network and 24 days are used for testing.For the testing days the data set comprises of data collected 2hours after the open of the stock market. Table 1 and Table 2 give the results observed for test data collected over a period of 24 days. TABLE 1 CORRELATION AND ACCURACY Correlation between Actual Close and Predicted Close No. of test days that correctly predicted the trend. 0.80601804



Magnitude of difference (Actual Close Predicted Close) [0-30) [30-60) [60-90) [90-120) [120-150) >150

8 6 7 1 1 1

Quotes Normalization

Sentiment Vector Creation

Key Phrase Vector Creation

Neural Network training using Feed forward Backpropogation Algorithm

Fig. 3. Details of Neural Network training module.

In Table 2 above if (Actual Close Predicted Close) = 20.3 we insert it into [0-30). We can see that the magnitude of the difference (Actual Close Predicted Close) primarily lies in the range [0-30) followed by both [30-60) and [60-90).


In this study we predicted the closing value of Dow Jones Industrial Average Index for a given trading day by considering the impact of historial stock quotes, news articles and message board post. A Neural Network is used for prediction, which has been trained using Backpropogation algorithm.

In this paper, we introduced a novel method to predict stock closing value using, sentiment derived from financial message boards, key phrases extracted from news articles and quotes data. In our experiments, we found that our system predicts the closing value of the index with a considerable accuracy.



[1] [2] SentiWordNet Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin and Craig G. Nevill-Manning. KEA: Practical Automatic Keyphrase Extraction , Proceedings of the fourth ACM conference on Digital Libraries, pp. 254 255. Manisha V. Pinto and Kavita Asnani, Stock Price Prediction Using Quotes and Financial News, International Journal of Soft Computing and Engineering Volume-1 Issue-5, pp. 266 269 Kerstin Denecke. Using SentiWordNet for Multilingual Sentiment Analysis, Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on, pp. 507 512. Jaap Kamps, Maarten Marx, Robert J. Mokken, and Maarten de Rijke, Using WordNet to Measure Semantic Orientation of Adjective, Proceedings of the 4th Intl. Conference on Language Resources and Evaluation LREC'04 vol. IV, pp. 1115-1118 Vivek Sehgal and Charles Song, SOPS: Stock Prediction using Web Sentiment, ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pp. 2126.





[7] Peter D. Turney.Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) Philadelphia July 2002, pp. 417-424 [8] Stanford Tagger [9] Christopher C. Chua , Maria Milosavljevic and James R. Curran. A Sentiment Detection Engine for Internet Stock Message Boards, Proceedings of the Australasian Language Technology Association Workshop 2009. pp. 8993.

Sulana Maria Rebelo received her B.E Degree in Computer Science from Goa University, India in 2006. She is currently pursuing M.E. in Information Technology from Goa University, India. She has 4 years of IT Industry experience with extensive work with Java related technologies. Her research interest includes Data Mining and Information Retrieval. E-mail: Kavita Asnani is Head of Department (Incharge) of Information Technology Department at Padre Conceicao Engineering College, affiliated to Goa University, Goa. She received her Masters degree in Information Technology from Goa University, Goa, India. She has 12 years of teaching experience at College level. She has published many papers in International and National Journals, and also at International and National Conferences. Her area of research includes Data Mining, Information Retrieval and Distributed Systems. E-mail: