You are on page 1of 28

Cryptocurrencies and Bit Coin,

Twitter Sentiment Analysis

SSTCSZG628T Dissertation

Submitted by

Sumir Saini
(BITS ID: 2016TC12004)

In fulfilment of the requirement for the degree of

Master in Technology in Software Systems,


From

Birla Institute of Technology and Science Pilani,


Rajasthan, India

Under supervision of

Mr. Rhythm Boruah, Technical Lead, TCS

Mr. Mohit Gupta, Service Delivery Owner, TCS

Page 1 of 28
Page 2 of 28
Page 3 of 28
Contents
Acknowledgement............................................................................................................................. 5
Introduction and Business Problem ................................................................................................... 6
Tools and Libraries Used .................................................................................................................... 7
Twitter Apps ...................................................................................................................................... 9
Development................................................................................................................................... 11
Python Script Name: GatherTweets.py ........................................................................................ 11
Script Name: SQLiteToCSV.py ...................................................................................................... 15
Python Script Name: cleanupsentiments.ipynb ............................................................................ 16
Python Script Name: cryptocurrencies.ipynb ............................................................................... 21
Inference ......................................................................................................................................... 26
Conclusion and Business Recommendations .................................................................................... 27
References ...................................................................................................................................... 28

Page 4 of 28
Acknowledgement
I am using this opportunity to express my sincere thanks to Mr. Rhythm Boruah who guided
me throughout the dissertation. Mr. Rhythm has total 13 years of vast experience in IT
Storage Technologies domain. He guided me with the necessary tools and libraries to work
on the project. It wouldn’t have been possible for me to complete the project without his
help in this area of text sentiment analysis and Twitter applications.

Page 5 of 28
Introduction and Business Problem
Cryptocurrencies and the block chain are the buzzwords amongst the tech savvy youth.
These two could revolutionize the entire society not only by eradicating the middle men
(financial institutions) but also changing the very paradigm of the modern society.
However, there are some apprehension as well as mostly people are little reluctant to adapt
to this new technology. That’s why Businesses needs to identify that trend and leverage it.

Cryptocurrencies follow a similar trend and the early birds will definitely reap the benefits in
the future. It would be judicious to start getting your feet wet ASAP.

The text analytics, sentiment analysis and visualization give a brief insight into the hidden
patters of peoples thought process.

Therefore, in this project, it is tried to capture the sentiments of people regarding these new
technologies. These days, people generally express their emotions mostly on the social
media. One such prominent social media platform is Twitter. We have inbuilt API’s
provided by twitter which enable us to get a copy of such tweets. Based on these tweets
collected we will perform a sentiment analysis and try to understand the dominant
sentiment and what the future of cryptocurrencies looks like.

Page 6 of 28
Tools and Libraries Used
Following tools and libraries are used to perform the project work.

IAE: Enthougth Canopy (Version 2.1.9):

Canopy provides Python 2.7 and 3.5, with easy installation and updates via a
graphical package manager of over 450 pre-built and tested scientific and analytic
Python packages from the Enthought Python Distribution. These include NumPy,
Pandas, SciPy, matplotlib, scikit-learn, and Jupyter / IPython. Canopy also provides
an integrated analysis environment, with editor, IPython console, Data Import Tool,
debugger, and documentation browser.

Python: Version 2.7.13 | Enthought, Inc:

Both, Python version, 2.7 and 3.5, are available with Enthought distribution. Here, in
our project, Python version 2.7 is used.

Visualization Tools: Python (Matplotlib):

matplotlib. Pyplot is a collection of command style functions that make matplotlib


work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates
a figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.

Collection Tools: Python (tweepy, SQLite Database):

Tweepy is an easy-to-use Python library for accessing the Twitter API.


SQLite is a C library that provides a lightweight disk-based database that doesn’t
require a separate server process and allows accessing the database using a
nonstandard variant of the SQL query language. Some applications can use SQLite for
internal data storage. It’s also possible to prototype an application using SQLite and
then port the code to a larger database such as PostgreSQL or Oracle.

Other Python Libraries: dataset, textblob, sys etc.

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple
API for diving into common natural language processing (NLP) tasks such as part-of-
speech tagging, noun phrase extraction, sentiment analysis, classification,
translation, and more.

Page 7 of 28
Twitter Apps:

The Twitter developer portal is a set of self-serve tools that developers can use to
manage their access to the premium APIs, as well as to create and manage their
Twitter apps.

Page 8 of 28
Twitter Apps
Twitter’s developer platform offers many APIs, tools, and resources that enable you to
harness the power of Twitter's open, global and real-time communication network.

One of the main usage of Twitter Apps is to gather the tweets in real-time as Twitter is a
real-time communication network, with very little latency. Many Twitter integrations
depend on the ability to stream Tweets in real-time using HTTP streaming.

This Twitter API allows other applications to get the Twitter using specific Twitter
Authentication and Authorization Management.

I’ve created a Twitter App, called SumirApp, for our project purpose.

Twitter App Name: SumirApp (https://apps.twitter.com/app/15647463/show) and

Website: http://sumirapp.com

Following screenshots show the created Twitter App and its settings.

Page 9 of 28
Next screenshot showing the authentication mechanism through which our application in
Python going to access the Real-time Tweets to process.

Tweepy library uses following Twitter App fields for the authentication purpose: -

1. Consumer Key (API Key):


2. Consumer Secret (API Secret): Keep the "Consumer Secret" a secret. This key should
never be human-readable in your application.
3. Access Token: This access token can be used to make API requests on your own
account's behalf.
4. Access Token Secret: Do not share your access token secret with anyone.

Page 10 of 28
Development
Following scripts are developed to meet our requirement.

Python Script Name: GatherTweets.py

Purpose:
- This script gathers real-time Tweets using tweepy library from Twitter App.
- The script uses following Twitter App parameters for authentication. I’ve hidden the
keys in the documentation for security purpose.
o App Key
o App Secret
o Key
o Secret
- It is using many words like, "cryptocurrency", "cryptocurrencies", "bit coin", "ether",
"lite coin", "cryptos" and "blockchain" for filtering purpose.
- This script is run for 15-20 everyday to gather the tweets.

Script:

# Import settings
import sys
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json

# Hidden the keys for security purpose.


TWITTER_APP_KEY = "C2om7O2ZyAMlrRrJQSIDzzIuv"
TWITTER_APP_SECRET = "*******************************************"
TWITTER_KEY = "1016186897239105536-gXRiMf1t0HcSWPfNRfHkLcQxYMCwI0"
TWITTER_SECRET = "************************************************"

db = dataset.connect('sqlite:///cryptotweets.db')

class StreamListener(tweepy.StreamListener):

def on_status(self, status):


if status.retweeted:
return

description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at

Page 11 of 28
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment

if geo is not None:


geo = json.dumps(geo)

if coords is not None:


coords = json.dumps(coords)

table = db['cryptos']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
except KeyboardInterrupt:
print "Bye"
sys.exit()
print "Writing tweets to file,CTRL+C to terminate the program"

def on_error(self, status_code):


if status_code == 420:
# Returning False in on data disconnects the stream
return False

auth = tweepy.OAuthHandler(TWITTER_APP_KEY,TWITTER_APP_SECRET)
auth.set_access_token(TWITTER_KEY, TWITTER_SECRET)
api = tweepy.API(auth)

stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=["cryptocurrency", "cryptocurrencies", "bit
coin", "ether", "lite coin", "cryptos", "blockchain"])

Output:

Page 12 of 28
After starting the script, it starts pulling the tweets having mentioned filter words. Until we
hit Ctrl+C to terminate the script. Following is the output: -

%run "C:\Users\Administrator\Downloads\Project\Deliverables-
1\Deliverables\Code and Data\GatherTweets.py"
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program

KeyboardInterruptTraceback (most recent call last)


C:\Users\Administrator\Downloads\Project\Deliverables-
1\Deliverables\Code and Data\GatherTweets.py in <module>()
77 stream_listener = StreamListener()
78 stream = tweepy.Stream(auth=api.auth,
listener=stream_listener)
---> 79 stream.filter(track=["cryptocurrency", "cryptocurrencies",
"bit coin", "ether", "lite coin", "cryptos", "blockchain"])

C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in filter(self, follow,
track, async, locations, stall_warnings, languages, encoding,
filter_level)
443 self.session.params = {'delimited': 'length'}
444 self.host = 'stream.twitter.com'
--> 445 self._start(async)
446
447 def sitestream(self, follow, stall_warnings=False,
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _start(self, async)
359 self._thread.start()
360 else:
--> 361 self._run()
362
363 def on_closed(self, resp):
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _run(self)
261 self.snooze_time = self.snooze_time_step
262 self.listener.on_connect()
--> 263 self._read_loop(resp)
264 except (Timeout, ssl.SSLError) as exc:
265 # This is still necessary, as a SSLError can
actually be
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _read_loop(self, resp)
311 length = 0
312 while not resp.raw.closed:
--> 313 line = buf.read_line().strip()
314 if not line:
315 self.listener.keep_alive() # keep-alive
new lines are expected

Page 13 of 28
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in read_line(self, sep)
177 else:
178 start = len(self._buffer)
--> 179 self._buffer +=
self._stream.read(self._chunk_size)
180
181 def _pop(self, length):
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\urllib3\response.pyc in read(self, amt,
decode_content, cache_content)
382 else:
383 cache_content = False
--> 384 data = self._fp.read(amt)
385 if amt != 0 and not data: # Platform-
specific: Buggy versions of Python.
386 # Close the connection when no data is
returned
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\httplib.pyc in read(self, amt)
571
572 if self.chunked:
--> 573 return self._read_chunked(amt)
574
575 if amt is None:
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\httplib.pyc in _read_chunked(self, amt)
613 while True:
614 if chunk_left is None:
--> 615 line = self.fp.readline(_MAXLINE + 1)
616 if len(line) > _MAXLINE:
617 raise LineTooLong("chunk size")
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\socket.pyc in readline(self, size)
478 while True:
479 try:
--> 480 data = self._sock.recv(self._rbufsize)
481 except error, e:
482 if e.args[0] == EINTR:
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\ssl.pyc in recv(self, buflen, flags)
764 "non-zero flags not allowed in calls to
recv() on %s" %
765 self.__class__)
--> 766 return self.read(buflen)
767 else:
768 return self._sock.recv(buflen, flags)
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\ssl.pyc in read(self, len, buffer)
651 v = self._sslobj.read(len, buffer)
652 else:
--> 653 v = self._sslobj.read(len)
654 return v
655 except SSLError as x:
KeyboardInterrupt:

Page 14 of 28
The script is generating a Sqlite database file as shown in following figure.

Script Name: SQLiteToCSV.py

Purpose: This script is converting SqlLite Database file to CSV for better visulization in
excel.

Script:

import dataset
from datafreeze.app import freeze

# Open up a Connection to SQLite DB


db = dataset.connect('sqlite:///cryptotweets.db')
# Gather all the details in the database in the variable result
result = db['cryptos'].all()
# Convert the data into a csv file format
freeze(result, format='csv', filename='cryptotweets.csv')

Output:

Output is saved in csv file.

Following are the contents of the csv file. Following are the colums of the CSV file: -

1. Tweet Time
2. Tweet Text
3. User description
4. Number of followers are user has

Page 15 of 28
5. Location
6. User ID
7. User Name

Python Script Name: cleanupsentiments.ipynb

Purpose: This is the main script of the project which performs following actions: -

1. Takes input from csv file, crptotweets, which was created in previous steps in real-
time from Twitter Data.

Page 16 of 28
2. Performs cleaning up the tweets to work on. Following text cleaning operations are
performed: -
a. Remove stop words.
b. Convert data to lower case.
c. Substitute extra whitespace with single space.
d. Remove multiple punctuation marks.
e. Remove Hyperlinks and https.
f. Remove back slash.
g. Additional cleaning.
3. Perform actual sentiment analysis to generate polarity according to each tweet.
Following polarity values are generated based on the sentiments: -
a. Positive
b. Neutral
c. Negative
4. Generate histogram based on the polarity of tweets.

Script:

import matplotlib.pyplot as plt


import matplotlib.animation as animation
from matplotlib import style
import time

style.use("ggplot")

import pandas as pd
import numpy as np

import re

import matplotlib.pyplot as plt


%matplotlib inline

# Read the CSV file collected


file_location =input("enter the Data Location: ")

file_name = input("enter the file name: ")

twitter_data =
pd.read_csv(file_location+"/"+file_name+".csv",error_bad_lines=False
,warn_bad_lines=False)

twitter_data.head()
# We are now working on latest tweets

Output:

Page 17 of 28
#Step1: Remove stop words and converting data to lowercase

from nltk.corpus import stopwords


stop = set(stopwords.words('english'))
temp = []
for s in twitter_data['text']:
s = str(s)
t1 = [i for i in s.lower().split() if i not in stop]
t2 = " ".join(t1)
temp.append(t2)

# Create new column in the article data to store the lower case
contents.
twitter_data['Tweets'] = temp
twitter_data.head()

place_holder = []
for i in twitter_data['Tweets']:
place_holder.append(i)

place_holder = [re.sub('\S*@\S*\s?','',i) for i in place_holder]#for


removing @
place_holder = [re.sub('\s+',' ',i) for i in place_holder]
# Substitute extra whitespace with single space
Place_holder = [re.sub("[-!,&;#:/\\%''.$@_+]","",i) for i in
place_holder
# Handling Multiple punctuation removal
place_holder = [re.sub("(http+s?:?/*[a-zA-Z0-9]+.com/)",'',i) for i
in place_holder]
# For removing any links

Page 18 of 28
place_holder = [re.sub("(http+s?)",'',i) for i in place_holder]
# For removing https
place_holder = [re.sub("[\\\]",'',i) for i in place_holder]
# For removing back slash
place_holder = [re.sub("(u[0-9]+|\\+)",'',i) for i in place_holder]
# For removing unicode
place_holder[0:5]

twitter_data['Tweets']= place_holder
twitter_data.head()

Output:

from textblob import TextBlob


import re

# Clean tweets
def split_tweet(tweet):
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z
\t])|(\w+:\/\/\S+)", " ",tweet).split())

def quantize_polarity(tweet):
analysis = TextBlob(split_tweet(tweet))
if analysis.sentiment.polarity > 0:
# Polarity tells whether the tweet is positive or negative or
neutral
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1

twitter_data['Sentiment analysis'] = np.array([


quantize_polarity(tweet) for tweet in twitter_data['Tweets'] ])
twitter_data.head()

len(twitter_data)

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

Page 19 of 28
fig.set_size_inches(12, 6, forward=True)

def animate(i):

xar = []
yar = []

x=0
y=0

for l,m in twitter_data.iterrows():


x +=1
if m['Sentiment analysis']>=1:
y +=1
elif m['Sentiment analysis']<0:
y -=1
xar.append(int(x))
yar.append(int(y))
ax1.clear()
ax1.plot(xar,yar)
ax1.set_xlabel('Number of tweets', fontsize=15)
ax1.set_ylabel('Sentiment overall counts' , fontsize=15)
ax1.set_title('Trend of Sentiments', fontsize=15,
fontweight='bold')
ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()

Output:

The sentiment analysis resulted in a linear relation between the no of tweets and overall
sentiments.

tweets_by_semtiments = twitter_data['Sentiment
analysis'].value_counts()

Page 20 of 28
fig, ax = plt.subplots()
fig.set_size_inches(13, 7, forward=True)
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Sentiment_class', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top Sentiments', fontsize=15, fontweight='bold')
tweets_by_semtiments[:].plot(ax=ax, kind='bar', color='blue')

Output:

Python Script Name: cryptocurrencies.ipynb

Purpose: This script does the trend analysis of three prevalent cryptocurrencies (Bitcoin,
Ether and XRP) based on it share value. We’ll, then, compare how share value resembles to
tweets sentiments.

%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

# Import csv having share values of three prevalent


cryptocurrencies.

data =
pd.read_csv('C:\Users\Administrator\Downloads\Project\Deliverables-
1\Deliverables\Code and Data\crypto-markets.csv')

Page 21 of 28
data.head()

Output:

data.describe()

Output:

data.shape

Output:

(785024, 13)

data.dtypes

Output:

Page 22 of 28
## For checking whether the data is null or not.
Data[data.notnull()].shape

Output:

(785024, 13)

data.info
data.columns

Output:

Index([u'slug', u'symbol', u'name', u'date', u'ranknow', u'open',


u'high',
u'low', u'close', u'volume', u'market', u'close_ratio',
u'spread'],
dtype='object')

data['date']= pd.to_datetime(data.date)
## Segregating the top 3 cryptocurrencies
BTC_data = data[data['symbol']=='BTC']
ETH_data = data[data['symbol']=='ETH']
XRP_data = data[data['symbol']=='XRP']
type(BTC_data)
BTC_data['date']= pd.to_datetime(BTC_data.date)
ETH_data['date']= pd.to_datetime(ETH_data.date)
XRP_data['date']= pd.to_datetime(XRP_data.date)

ETH_data.dtypes

Output:

slug object
symbol object
name object
date datetime64[ns]

Page 23 of 28
ranknow int64
open float64
high float64
low float64
close float64
volume float64
market float64
close_ratio float64
spread float64
dtype: object

BTC_data.sort_values('date')
ETH_data.sort_values('date')
XRP_data.sort_values('date')

Output:

# Ploting the three cryptocurrencies time line values


t1 = BTC_data.date
t2 = ETH_data.date
t3 = XRP_data.date
plt.plot(t1,BTC_data.high,'r')
plt.plot(t2,ETH_data.high,'b')
plt.plot(t3,XRP_data.high,'g')
plt.show()

Output:

Page 24 of 28
Page 25 of 28
Inference
From cleanupsentiments.ipynb script, it can be inferred from the polarity histogram that
the there’s a positive response for the cryptocurrencies these days. However, neutral bar in
the histogram is still higher as compared to positive or negative bars. That means people
are still putting their trust in traditional financial institutions, because of several reasons
primary being the security.

And, from cryptocurrencies.ipynb script, cryptocurrencies, especially Bit Coin, are seeing
constant increment in their share value, except a marginal decline over a year, however still
much higher as compared with last 10-15 years.

Page 26 of 28
Conclusion and Business
Recommendations
From above inferences, there seems to be an upward trend among people for
cryptocurrencies and Block Chain. Businesses should start investing on the POC and
infrastructure for Block Chain and cryptocurrencies. This will in turn be fruitful for them in
the future.

The same project could be augmented in future to capture the particulars of people having
good sentiments and can be targeted for customized services. Geo-location data could be
used on countries which show the most positive response towards cryptocurrencies. Also,
the same project could be tweaked to capture which cryptocurrencies are being talked
about the most amongst all others.

Page 27 of 28
References
1. https://www.enthought.com/product/canopy/
2. http://docs.enthought.com/canopy/2.1/index.html
3. https://developer.twitter.com/en/docs/basics/developer-portal/guides/apps.html
4. https://matplotlib.org/tutorials/introductory/pyplot.html
5. https://tweepy.readthedocs.io/en/v3.5.0/
6. https://docs.python.org/2/library/sqlite3.html
7. https://www.learnpython.org/
8. https://www.datacamp.com/
9. The Python Book, Future Publishing Limited,
(https://www.myfavouritemagazines.co.uk/tech-and-gadgets-guides-and-
specials/the-python-book-6th-edition/)
10. R Data Analysis Projects, Gopi Subramanian,
(https://www.packtpub.com/big-data-and-business-intelligence/r-data-analysis-
projects)

Page 28 of 28

You might also like