Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
36Activity
0 of .
Results for:
No results containing your search query
P. 1
Stock Prediction

Stock Prediction

Ratings:

4.33

(3)
|Views: 2,873|Likes:
Published by greg
Recently, the web has rapidly emerged as a great source
of financial information ranging from news articles to personal
opinions. Data mining and analysis of such financial
information can aid stock market predictions. Traditional
approaches have usually relied on predictions based on past
performance of the stocks. In this paper, we introduce a
novel way to do stock market prediction based on sentiments
of web users. Our method involves scanning for financial
message boards and extracting sentiments expressed by individual
authors. The system then learns the correlation
between the sentiments and the stock values. The learned
model can then be used to make future predictions about
stock values. In our experiments, we show that our method
is able to predict the sentiment with high precision and we
also show that the stock performance and its recent web
sentiments are also closely correlated.
Recently, the web has rapidly emerged as a great source
of financial information ranging from news articles to personal
opinions. Data mining and analysis of such financial
information can aid stock market predictions. Traditional
approaches have usually relied on predictions based on past
performance of the stocks. In this paper, we introduce a
novel way to do stock market prediction based on sentiments
of web users. Our method involves scanning for financial
message boards and extracting sentiments expressed by individual
authors. The system then learns the correlation
between the sentiments and the stock values. The learned
model can then be used to make future predictions about
stock values. In our experiments, we show that our method
is able to predict the sentiment with high precision and we
also show that the stock performance and its recent web
sentiments are also closely correlated.

More info:

Published by: greg on Nov 26, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/09/2012

pdf

text

original

 
SOPS: Stock Prediction using Web Sentiment
Vivek Sehgal and Charles SongDepartment of Computer ScienceUniversity of MarylandCollege Park, Maryland, USA
{
viveks, csfalcon
}
@cs.umd.edu
Abstract
 Recently, the web has rapidly emerged as a great sourceof financial information ranging from news articles to per-sonal opinions. Data mining and analysis of such financialinformation can aid stock market predictions. Traditionalapproacheshaveusuallyreliedonpredictionsbasedonpas performance of the stocks. In this paper, we introduce anovelwaytodostockmarketpredictionbasedonsentimentsof web users. Our method involves scanning for financialmessage boards and extracting sentiments expressed by in-dividual authors. The system then learns the correlationbetween the sentiments and the stock values. The learnemodel can then be used to make future predictions about stock values. In our experiments, we show that our method is able to predict the sentiment with high precision and wealso show that the stock performance and its recent websentiments are also closely correlated.
1. Introduction
As web based technologies continue to be embraced bythe financial sector, abundance of financial information isbecoming available for the investors. One of the popularforms of financial information is the message board; thesewebsites have emerged as a major source for exchangingideas and information. Message boards provide a platformto investors from across the spectrum to come together andshare their opinions on companies and stocks. However,extracting good information from message boards is stilldifficult. The problem is the good information is hiddenwithin vast amount of data and it is nearly impossible foran investor to read all these websites and sort out the infor-mation. Therefore, providing computer software that canextract information can help investors greatly.The data contained in the websites are almost always un-structured. Unstructured data makes an interesting yet chal-lenging machine learning problem. On a message board,each data entry relates to a discussion to some stocks. Thiscan be visualized as a temporal data where the frequency of words and topics is changing with time. The opinions ona stock changes both with time and its performance on thestock exchanges. To put it more formally, there is correla-tion between the sentiment and performance of a stock. Ina study done recently [4], email spam and blogs have beenfound to closely predict stock market behavior. Messageboard data posses a challenge to data mining; opinions onmessage boards can be bullish, bearish, spam, rumor, vit-riolic or simply unrelated to the stock. We found that thenumber of useless messages surpasses useful ones. Fortu-nately for us, the sheer size of total messages allows signifi-cant number of posts of relevant opinions to be analyzed, aslong as we are able to filter out the noise.In this paper, we introduce a novel way to do sentimentprediction using features extracted from the messages. Asmentioned before, many of the sentiments extracted are ir-relevant. To solve this problem, we develop a new measureknown as “TrustValue” which assigns trust to each messagebased on its author. This method rewards those authors whowrite relevant information and whose sentiments closelyfollow stock performance. The sentiments along with their“TrustValue” are then used to learn their relation with thestock behavior. In our experiments, we find our hypothesisthat stock values and sentiments are correlated is true formany message boards.
2. Related Work
In previous work on stock market predictions, [8] ana-lyzed how daily variations in financial message counts ef-fect volumes in stock market. [5] used computer algorithmsto identify news stories that influence markets, and thentraded successfully on this information. Both did real mar-ket stimulations to test the efficacy of their systems.Another approach used in [4], tried to extract sentimentsfor a stock from the messages posted on web. They traineda meta-classifier which can classify messages as “good”,
 
“bad” or “neutral” based on the contents. The system usedboth the naive bag of word models and a primitive languagemodel. They found that time series and cross-sectional ag-gregationofmessagesentimentimprovedthequalityofsen-timent index. [4] reaffirmed the findings of [8] by showingthat market activity is strongly correlated to small investorsentiments. [4] further showed that overnight message post-ing volume predicts changes in next day stock trading vol-ume.There has been a lot of work on sentiment extraction.In [3], a social network is used to study how the sentimentpropagates across like minded blogs. The problem requirednatural language processing of text [6], they concluded thatcomputation can become intractable for a large corpus likeWeb. Our goal is develop an efficient model to predict sen-timent of message boards.
3. Approach
3.1 System Overview
Figure 1 gives an overall outline of our system. The firststep involved data collection. In this step we crawled mes-sage boards and stored the data in a database. The next stepwas extraction of information from the unstructured data.We removed HTML tags and extracted useful features suchas date, rating, message text etc. The information extractedis then used to build sentiment classifiers. By comparing onthe sentiments predicted from the web data and the actualstock value, our system calculated each author’s trust value.This trust value is then applied to filter noise, thus improv-ing our classifiers performance. Finally, our system canmake predictions on stock behavior using all the featuresextracted or calculated.
3.2 Data Collection
We collected over 260,000 messages for 52 popularstocks on
http
:
//finance.yahoo.com
. The stocks werechosen to cover a good spectrum from technology to oil sec-tor. The messages covered over 6 month time period; thislarge amount of data gave us a big time window to analyzethe stock behavior. All the data extracted was stored in arepository for later processing.On this website, the messages are organized by stockssymbol; a message board exists for each stock traded onmajor stock exchange such as NYSE and NASDAQ. Usersmust sign up for an account before they can post messagesand every message posted is associated with the author.This features makes author accountability possible in ourdata processing step. Along with text messages, the authorscan express the sentiments of their posts as “StrongBuy”,“Buy”, “Hold”, “Sell” and “StrongSell”. Aslo, other users
Figure 1. System Overview: The messageboard data is collected and processed. Thenwe predict the sentiments and calculate theTrustValues. These new features are thenused to predict stock behavior.
can rate the messages they read according to their opinionsand views, the rating is scaled out of five stars. And as withany message board, each message has a date, a title andmessage text.We used a scraper program to extract message boardposts for the chosen set of stock symbols. Figure 2 showsa sample message from Yahoo! Finance along with the rel-evant information circled. In this sample message, the au-thor expressed extreme optimism about YAHOO! stock andencouraged others to buy the Yahoo! stocks as its currentvery underpriced with the possibility of a stock split. Ourscraper program would extract the subject, text, stock sym-bol, rating, numberofrating, authorname, dateandauthor’ssentiment.
3.3 Feature Representation
After the relevant information has been extracted. Weconverted each message to a vector of words and authornames. The dates are mapped to an integer values. Thevalue of each entry in the vector is then calculated usingTFIDF formula:
TFIDF 
(
w
) =
T
(
w
)
×
ID
(
w
)
T
(
w
) =
n
(
w
)
w
n
(
w
)
ID
(
w
) =
log
(
|
|{
m
:
w
m
}
)
M is the set of all messages while n(w) is the frequencyof the term w in a message. The TFIDF (Term Frequency
 
Figure 2. We extracted relevant informationfrom the above message such as subject,text, date, author etc.
Inverse Document Frequency) weight is a statistical mea-sure used to evaluate how important a term (i.e., word, fea-ture, etc) is to a message in a corpus. The importance in-creases proportionally to the number of times the term ap-pears in the message but it is offset by the frequency of theterm in the corpus. Thus if the term is common in all themessages of the corpus then it is not a good indicator indifferentiating messages.
3.4 Sentiment Prediction
We assumed that the sentiment of a stock is highly re-sponsive to the performance of the stock and recent newsabout the company. For example, the news about introduc-tion of iPhone by Apple can fuel considerable interest inthe users, it can also affect their sentiments positively. Sim-ilarly, the sentiment can change when there is sharp changein stock performance in the near past. Using the above in-tuition, we modeled the sentiments as conditionally depen-dent upon the messages and stock value over the past oneday (it can be extended to a longer time period in our sys-tem). The sentiment for a message m at time instant i ismodeled as follows:
(
Sentiment
|
θ
) =
(
Sentiment
|
m,
i
,SV 
i
)
This can also be visualized as a Markov process wherethe prediction at time instance i depends upon the valuesat previous time instance. In the above formula,
i
1
and
SV 
i
1
correspond to the set of messages and stock value attime instant i-1 respectively. The parameters for the abovemodel can now be learned using a suitable learning algo-rithm. For our experimentation, we used Naive Bayes, De-cision trees [3] and Bagging [2] to learn the correspondingclassifier. Naive Bayes is a simple model and can easilygeneralize to large data. Unfortunately, it is not able tomodel complex relationships. Decision trees on the otherhand can encode complex relationships but can often leadto over-fitting. Over-fitting can cause the model to performwell for training data but is unable to show similar perfor-mance for actual data.For classifier training, we used a popular toolkit knownas weka [1] which provides all the standard classifierssuch as Naive Bayes, Decision Trees, etc. In our data,a small fractions of messages had sentiments already as-signed to them; their authors expressed the sentiment ex-plicitly. These messages were used as ground-truth whiletraining the sentiment classifier. We trained a classifier foreach stock on a daily bases and each message is classified as“StrongBuy”, “Buy”, “Hold”, “Sell” or “StrongSell”. Thenumber of features used by each classifier was in the rangeof 10,000.
3.5 TrustValue Calculation
We acknowledge that some authors are more knowledge-able than others about the stock market. For example, pro-fessionalfinancialanalystsshouldreceivemoretrust, mean-ing their posts should carry more weight than the posts bythe less knowledgeable authors. However, obtaining an au-thors background is tricky and difficult. Message boardscommonly provide the user profile feature where the au-thors themselves can fill in information about their back-ground. But this feature is often left unused or filled withinaccurate information.Instead of discovering each author’s background, we usean algorithm to calculate an author’s TrustValue base on hisor her historical performance on the message boards. Foreach message, we used the author’s sentiment from the sen-timent prediction step and compare them to the actual stock performance around the time of the post. If the author’spost supported the actual performance, then author’s trustvalue is increased. Not only do we care about the directionin which the stock price went, we also care about the mag-nitude. For example, if the author gave a strong sell sen-timent in the post, but in reality the stock price only dropslightly, then the author should earn less TrustValue. Weused percentage difference in stock price at closing bell asthe normalized measure of stock performance.Our algorithm also takes into account the fact that a sin-gle author cannot be expert on all stocks. It’s commonplacefor even professional financial analysts to keep track of onlya set of stocks. This means an author can only be trusted forthe set stocks he or she knows best. Each author can be as-signed different trust values for different stocks, this featureenables us to paint a clear picture of each author’s abilitieswith our algorithm. The TrustValue is calculated as follows:

Activity (36)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Johnson K. Gao liked this
Johnson K. Gao liked this
Johnson K. Gao liked this
rtlindore liked this
Wouter Rengelink liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->