Professional Documents
Culture Documents
SSTCSZG628T Dissertation
Submitted by
Sumir Saini
(BITS ID: 2016TC12004)
Under supervision of
Page 1 of 28
Page 2 of 28
Page 3 of 28
Contents
Acknowledgement............................................................................................................................. 5
Introduction and Business Problem ................................................................................................... 6
Tools and Libraries Used .................................................................................................................... 7
Twitter Apps ...................................................................................................................................... 9
Development................................................................................................................................... 11
Python Script Name: GatherTweets.py ........................................................................................ 11
Script Name: SQLiteToCSV.py ...................................................................................................... 15
Python Script Name: cleanupsentiments.ipynb ............................................................................ 16
Python Script Name: cryptocurrencies.ipynb ............................................................................... 21
Inference ......................................................................................................................................... 26
Conclusion and Business Recommendations .................................................................................... 27
References ...................................................................................................................................... 28
Page 4 of 28
Acknowledgement
I am using this opportunity to express my sincere thanks to Mr. Rhythm Boruah who guided
me throughout the dissertation. Mr. Rhythm has total 13 years of vast experience in IT
Storage Technologies domain. He guided me with the necessary tools and libraries to work
on the project. It wouldn’t have been possible for me to complete the project without his
help in this area of text sentiment analysis and Twitter applications.
Page 5 of 28
Introduction and Business Problem
Cryptocurrencies and the block chain are the buzzwords amongst the tech savvy youth.
These two could revolutionize the entire society not only by eradicating the middle men
(financial institutions) but also changing the very paradigm of the modern society.
However, there are some apprehension as well as mostly people are little reluctant to adapt
to this new technology. That’s why Businesses needs to identify that trend and leverage it.
Cryptocurrencies follow a similar trend and the early birds will definitely reap the benefits in
the future. It would be judicious to start getting your feet wet ASAP.
The text analytics, sentiment analysis and visualization give a brief insight into the hidden
patters of peoples thought process.
Therefore, in this project, it is tried to capture the sentiments of people regarding these new
technologies. These days, people generally express their emotions mostly on the social
media. One such prominent social media platform is Twitter. We have inbuilt API’s
provided by twitter which enable us to get a copy of such tweets. Based on these tweets
collected we will perform a sentiment analysis and try to understand the dominant
sentiment and what the future of cryptocurrencies looks like.
Page 6 of 28
Tools and Libraries Used
Following tools and libraries are used to perform the project work.
Canopy provides Python 2.7 and 3.5, with easy installation and updates via a
graphical package manager of over 450 pre-built and tested scientific and analytic
Python packages from the Enthought Python Distribution. These include NumPy,
Pandas, SciPy, matplotlib, scikit-learn, and Jupyter / IPython. Canopy also provides
an integrated analysis environment, with editor, IPython console, Data Import Tool,
debugger, and documentation browser.
Both, Python version, 2.7 and 3.5, are available with Enthought distribution. Here, in
our project, Python version 2.7 is used.
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple
API for diving into common natural language processing (NLP) tasks such as part-of-
speech tagging, noun phrase extraction, sentiment analysis, classification,
translation, and more.
Page 7 of 28
Twitter Apps:
The Twitter developer portal is a set of self-serve tools that developers can use to
manage their access to the premium APIs, as well as to create and manage their
Twitter apps.
Page 8 of 28
Twitter Apps
Twitter’s developer platform offers many APIs, tools, and resources that enable you to
harness the power of Twitter's open, global and real-time communication network.
One of the main usage of Twitter Apps is to gather the tweets in real-time as Twitter is a
real-time communication network, with very little latency. Many Twitter integrations
depend on the ability to stream Tweets in real-time using HTTP streaming.
This Twitter API allows other applications to get the Twitter using specific Twitter
Authentication and Authorization Management.
I’ve created a Twitter App, called SumirApp, for our project purpose.
Website: http://sumirapp.com
Following screenshots show the created Twitter App and its settings.
Page 9 of 28
Next screenshot showing the authentication mechanism through which our application in
Python going to access the Real-time Tweets to process.
Tweepy library uses following Twitter App fields for the authentication purpose: -
Page 10 of 28
Development
Following scripts are developed to meet our requirement.
Purpose:
- This script gathers real-time Tweets using tweepy library from Twitter App.
- The script uses following Twitter App parameters for authentication. I’ve hidden the
keys in the documentation for security purpose.
o App Key
o App Secret
o Key
o Secret
- It is using many words like, "cryptocurrency", "cryptocurrencies", "bit coin", "ether",
"lite coin", "cryptos" and "blockchain" for filtering purpose.
- This script is run for 15-20 everyday to gather the tweets.
Script:
# Import settings
import sys
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json
db = dataset.connect('sqlite:///cryptotweets.db')
class StreamListener(tweepy.StreamListener):
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
Page 11 of 28
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment
table = db['cryptos']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
except KeyboardInterrupt:
print "Bye"
sys.exit()
print "Writing tweets to file,CTRL+C to terminate the program"
auth = tweepy.OAuthHandler(TWITTER_APP_KEY,TWITTER_APP_SECRET)
auth.set_access_token(TWITTER_KEY, TWITTER_SECRET)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=["cryptocurrency", "cryptocurrencies", "bit
coin", "ether", "lite coin", "cryptos", "blockchain"])
Output:
Page 12 of 28
After starting the script, it starts pulling the tweets having mentioned filter words. Until we
hit Ctrl+C to terminate the script. Following is the output: -
%run "C:\Users\Administrator\Downloads\Project\Deliverables-
1\Deliverables\Code and Data\GatherTweets.py"
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
Writing tweets to file,CTRL+C to terminate the program
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in filter(self, follow,
track, async, locations, stall_warnings, languages, encoding,
filter_level)
443 self.session.params = {'delimited': 'length'}
444 self.host = 'stream.twitter.com'
--> 445 self._start(async)
446
447 def sitestream(self, follow, stall_warnings=False,
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _start(self, async)
359 self._thread.start()
360 else:
--> 361 self._run()
362
363 def on_closed(self, resp):
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _run(self)
261 self.snooze_time = self.snooze_time_step
262 self.listener.on_connect()
--> 263 self._read_loop(resp)
264 except (Timeout, ssl.SSLError) as exc:
265 # This is still necessary, as a SSLError can
actually be
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in _read_loop(self, resp)
311 length = 0
312 while not resp.raw.closed:
--> 313 line = buf.read_line().strip()
314 if not line:
315 self.listener.keep_alive() # keep-alive
new lines are expected
Page 13 of 28
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\tweepy\streaming.pyc in read_line(self, sep)
177 else:
178 start = len(self._buffer)
--> 179 self._buffer +=
self._stream.read(self._chunk_size)
180
181 def _pop(self, length):
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\site-packages\urllib3\response.pyc in read(self, amt,
decode_content, cache_content)
382 else:
383 cache_content = False
--> 384 data = self._fp.read(amt)
385 if amt != 0 and not data: # Platform-
specific: Buggy versions of Python.
386 # Close the connection when no data is
returned
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\httplib.pyc in read(self, amt)
571
572 if self.chunked:
--> 573 return self._read_chunked(amt)
574
575 if amt is None:
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\httplib.pyc in _read_chunked(self, amt)
613 while True:
614 if chunk_left is None:
--> 615 line = self.fp.readline(_MAXLINE + 1)
616 if len(line) > _MAXLINE:
617 raise LineTooLong("chunk size")
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\socket.pyc in readline(self, size)
478 while True:
479 try:
--> 480 data = self._sock.recv(self._rbufsize)
481 except error, e:
482 if e.args[0] == EINTR:
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\ssl.pyc in recv(self, buflen, flags)
764 "non-zero flags not allowed in calls to
recv() on %s" %
765 self.__class__)
--> 766 return self.read(buflen)
767 else:
768 return self._sock.recv(buflen, flags)
C:\Users\Administrator\AppData\Local\Enthought\Canopy\edm\envs\User\
lib\ssl.pyc in read(self, len, buffer)
651 v = self._sslobj.read(len, buffer)
652 else:
--> 653 v = self._sslobj.read(len)
654 return v
655 except SSLError as x:
KeyboardInterrupt:
Page 14 of 28
The script is generating a Sqlite database file as shown in following figure.
Purpose: This script is converting SqlLite Database file to CSV for better visulization in
excel.
Script:
import dataset
from datafreeze.app import freeze
Output:
Following are the contents of the csv file. Following are the colums of the CSV file: -
1. Tweet Time
2. Tweet Text
3. User description
4. Number of followers are user has
Page 15 of 28
5. Location
6. User ID
7. User Name
Purpose: This is the main script of the project which performs following actions: -
1. Takes input from csv file, crptotweets, which was created in previous steps in real-
time from Twitter Data.
Page 16 of 28
2. Performs cleaning up the tweets to work on. Following text cleaning operations are
performed: -
a. Remove stop words.
b. Convert data to lower case.
c. Substitute extra whitespace with single space.
d. Remove multiple punctuation marks.
e. Remove Hyperlinks and https.
f. Remove back slash.
g. Additional cleaning.
3. Perform actual sentiment analysis to generate polarity according to each tweet.
Following polarity values are generated based on the sentiments: -
a. Positive
b. Neutral
c. Negative
4. Generate histogram based on the polarity of tweets.
Script:
style.use("ggplot")
import pandas as pd
import numpy as np
import re
twitter_data =
pd.read_csv(file_location+"/"+file_name+".csv",error_bad_lines=False
,warn_bad_lines=False)
twitter_data.head()
# We are now working on latest tweets
Output:
Page 17 of 28
#Step1: Remove stop words and converting data to lowercase
# Create new column in the article data to store the lower case
contents.
twitter_data['Tweets'] = temp
twitter_data.head()
place_holder = []
for i in twitter_data['Tweets']:
place_holder.append(i)
Page 18 of 28
place_holder = [re.sub("(http+s?)",'',i) for i in place_holder]
# For removing https
place_holder = [re.sub("[\\\]",'',i) for i in place_holder]
# For removing back slash
place_holder = [re.sub("(u[0-9]+|\\+)",'',i) for i in place_holder]
# For removing unicode
place_holder[0:5]
twitter_data['Tweets']= place_holder
twitter_data.head()
Output:
# Clean tweets
def split_tweet(tweet):
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z
\t])|(\w+:\/\/\S+)", " ",tweet).split())
def quantize_polarity(tweet):
analysis = TextBlob(split_tweet(tweet))
if analysis.sentiment.polarity > 0:
# Polarity tells whether the tweet is positive or negative or
neutral
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1
len(twitter_data)
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
Page 19 of 28
fig.set_size_inches(12, 6, forward=True)
def animate(i):
xar = []
yar = []
x=0
y=0
Output:
The sentiment analysis resulted in a linear relation between the no of tweets and overall
sentiments.
tweets_by_semtiments = twitter_data['Sentiment
analysis'].value_counts()
Page 20 of 28
fig, ax = plt.subplots()
fig.set_size_inches(13, 7, forward=True)
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Sentiment_class', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top Sentiments', fontsize=15, fontweight='bold')
tweets_by_semtiments[:].plot(ax=ax, kind='bar', color='blue')
Output:
Purpose: This script does the trend analysis of three prevalent cryptocurrencies (Bitcoin,
Ether and XRP) based on it share value. We’ll, then, compare how share value resembles to
tweets sentiments.
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
data =
pd.read_csv('C:\Users\Administrator\Downloads\Project\Deliverables-
1\Deliverables\Code and Data\crypto-markets.csv')
Page 21 of 28
data.head()
Output:
data.describe()
Output:
data.shape
Output:
(785024, 13)
data.dtypes
Output:
Page 22 of 28
## For checking whether the data is null or not.
Data[data.notnull()].shape
Output:
(785024, 13)
data.info
data.columns
Output:
data['date']= pd.to_datetime(data.date)
## Segregating the top 3 cryptocurrencies
BTC_data = data[data['symbol']=='BTC']
ETH_data = data[data['symbol']=='ETH']
XRP_data = data[data['symbol']=='XRP']
type(BTC_data)
BTC_data['date']= pd.to_datetime(BTC_data.date)
ETH_data['date']= pd.to_datetime(ETH_data.date)
XRP_data['date']= pd.to_datetime(XRP_data.date)
ETH_data.dtypes
Output:
slug object
symbol object
name object
date datetime64[ns]
Page 23 of 28
ranknow int64
open float64
high float64
low float64
close float64
volume float64
market float64
close_ratio float64
spread float64
dtype: object
BTC_data.sort_values('date')
ETH_data.sort_values('date')
XRP_data.sort_values('date')
Output:
Output:
Page 24 of 28
Page 25 of 28
Inference
From cleanupsentiments.ipynb script, it can be inferred from the polarity histogram that
the there’s a positive response for the cryptocurrencies these days. However, neutral bar in
the histogram is still higher as compared to positive or negative bars. That means people
are still putting their trust in traditional financial institutions, because of several reasons
primary being the security.
And, from cryptocurrencies.ipynb script, cryptocurrencies, especially Bit Coin, are seeing
constant increment in their share value, except a marginal decline over a year, however still
much higher as compared with last 10-15 years.
Page 26 of 28
Conclusion and Business
Recommendations
From above inferences, there seems to be an upward trend among people for
cryptocurrencies and Block Chain. Businesses should start investing on the POC and
infrastructure for Block Chain and cryptocurrencies. This will in turn be fruitful for them in
the future.
The same project could be augmented in future to capture the particulars of people having
good sentiments and can be targeted for customized services. Geo-location data could be
used on countries which show the most positive response towards cryptocurrencies. Also,
the same project could be tweaked to capture which cryptocurrencies are being talked
about the most amongst all others.
Page 27 of 28
References
1. https://www.enthought.com/product/canopy/
2. http://docs.enthought.com/canopy/2.1/index.html
3. https://developer.twitter.com/en/docs/basics/developer-portal/guides/apps.html
4. https://matplotlib.org/tutorials/introductory/pyplot.html
5. https://tweepy.readthedocs.io/en/v3.5.0/
6. https://docs.python.org/2/library/sqlite3.html
7. https://www.learnpython.org/
8. https://www.datacamp.com/
9. The Python Book, Future Publishing Limited,
(https://www.myfavouritemagazines.co.uk/tech-and-gadgets-guides-and-
specials/the-python-book-6th-edition/)
10. R Data Analysis Projects, Gopi Subramanian,
(https://www.packtpub.com/big-data-and-business-intelligence/r-data-analysis-
projects)
Page 28 of 28