Professional Documents
Culture Documents
Sentiment Analysis - Twitter Data
Sentiment Analysis - Twitter Data
Note:
From the machine learning point of view, raw text is useless. Only if we manage to transform
it into meaningful numbers, can we feed it into our machine-learning algorithms such as clustering.
The same is true for more mundane operations on text,
such as similarity measurement
This project can pull data from Tweeter but to do that you need to request for your own API keys
specified below (I removed mine):
my_api_key = "xxxxxxxxx"
my_api_secret = "yyyyyyy"
If you don't have API keys already, you may use "Raw Data" which i pulled from tweeter using:
You can specifiy amount of tweets you want to pull. Here I pulled 100
import numpy as np
import re
import nltk
import matplotlib
%matplotlib inline
sns.set(style="white",color_codes=True)
sns.set(font_scale=1.5)
import tweepy as tw
import warnings
warnings.filterwarnings('ignore')
matplotlib_axes_logger.setLevel('ERROR')
my_api_secret = "xxxxxxxxxxxxxxx"
# authenticate
#tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en",since="2015-09-16").item
tweets = tw.Cursor(api.search_tweets,q=search_query,lang="en").items(50)
tweets_copy = []
tweets_copy.append(tweet)
data= pd.DataFrame()
hashtags = []
try:
hashtags.append(hashtag["text"])
except:
pass
'user_location': tweet.user.location,
'user_description': tweet.user.description,
'user_verified': tweet.user.verified,
'date': tweet.created_at,
'text': text,
'language': tweet.lang,
'favourites-count': tweet.favorite_count,
'author': tweet.user.screen_name,
'retweet-count': tweet.retweet_count,
In [6]:
# Run this cell if you have API key inputed in cell one above
In [5]:
#save raw data extrated in local drive
# if you didn't run first cell, you need to remove this cell.
data.to_csv("Raw Data.csv")
#remove hashtags
# Remove URLS
stop_words = stopwords.words('english')
def clean_text(text):
text = str(text).lower()
return text
In [6]:
wml = WordNetLemmatizer()
tokens = wml.lemmatize(word)
lemma_words.append(tokens)
In [7]:
# Now we have cleaned data for three features: user_description, text, and user_name
pd.DataFrame(data).head()
Out[7]:
user_name ID user_location user_description user_verified
In [8]:
data.to_csv("Clean Data.csv")
data
Out[8]:
user_name ID user_location user_description user_verified
202
0 rohit sharma 1486515814840766464 India manmauji False
01:46:0
🇮🇳🎠🎠⚔⚔
🇮🇳a lone kaffir bio doesnt speak but tweets 202
0 1486513298996482049 In Patriots ♥ False
army®🇮🇳⚔⚔ believe in 🕉 not l... 01:36:0
🎠🎠🇮🇳
202
0 javeed shariff 1486509959281926148 javeed shariff False
01:22:4
For example,- Words like 'love,' 'enjoy,' 'happy,' 'like' all convey a positive sentiment. Also, VADER is
intelligent enough to understand these words' basic context, such as "did not love" as a negative
statement. It also understands the emphasis of capitalization and punctuation, such as "ENJOY."
In [9]:
## Added "Sentiment" column and categorized in positive, negative and neutral
In [10]:
sid = SIA()
In [11]:
# drop sentiments column... not needed
data.drop(columns=['Sentiments'],inplace=True)
data.head()
Out[11]:
user_name ID user_location user_description user_verified
In [12]:
#Number of Words
Out[12]:
user_name ID user_location user_description user_verified
In [13]:
# WordCloud using atual clean data
#plt.axis('off')
#plt.show
Sentimental Analysis
In starting with the analysis we will create the new columns namely Polarity and Subjectivity and
acquire the very values of each comment. Polarity ranges from -1 to 1 and measures how positive or
negative a comment is. It simply means emotions expressed in a sentence. Subjectivity expresses
some personal feelings, views, or beliefs. A subjective sentence may not express any sentiment.
In [14]:
# get subjectivity
def getSubjectivity(txt):
return TextBlob(txt).sentiment.subjectivity
# get polarity
def getPolarity(txt):
return TextBlob(txt).sentiment.polarity
#Columns
data['Subjectivity'] = data['text'].apply(getSubjectivity)
data['Polarity'] = data['text'].apply(getPolarity)
data.head()
Out[14]:
user_name ID user_location user_description user_verified
In [15]:
# function to compute analysis
def getAnalysis(score):
if score < 0 :
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return 'Positive'
data['Analysis'] = data['Polarity'].apply(getAnalysis)
In [16]:
data.head()
Out[16]:
user_name ID user_location user_description user_verified
5 rows × 21 columns
In [17]:
# % Percentages:
pcomments = pcomments['text']
ncomments = ncomments['text']
nucomments = nucomments['text']
Positive: 25.5%
Negative: 11.0%
Nuetral: 63.5%
In [18]:
# the below function will create a word cloud
])
width=2500,
height=2000
).generate(cleaned_word)
plt.figure(1,figsize=(5, 7))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
In [19]:
wordcloud_draw(data.text, 'black')
In [20]:
print("Positive words are", pcomments.count())
wordcloud_draw(pcomments, 'black')
In [21]:
print("Negative words are", ncomments.count())
wordcloud_draw(ncomments)
In [22]:
print("Neutral words are", nucomments.count())
wordcloud_draw(nucomments, 'black')
In [23]:
# Value Count
data['Analysis'].value_counts
plt.title('Sentiment Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
data['Analysis'].value_counts().plot(kind= 'bar')
plt.show()
user_name 0
Out[24]:
ID 0
user_location 0
user_description 0
user_verified 0
date 0
text 0
language 0
favourites-count 0
author 0
retweet-count 0
hashtags 174
source 0
Positive Sentiment 0
Neutral Sentiment 0
Negative Sentiment 0
Number of Words 0
Subjectivity 0
Analysis 0
dtype: int64
In [25]:
data.shape
(200, 21)
Out[25]:
In [26]:
data.dropna(inplace=True)
data.isnull().sum()
user_name 0
Out[26]:
ID 0
user_location 0
user_description 0
user_verified 0
date 0
text 0
language 0
favourites-count 0
author 0
retweet-count 0
hashtags 0
source 0
Positive Sentiment 0
Neutral Sentiment 0
Negative Sentiment 0
Number of Words 0
Subjectivity 0
Polarity 0
Analysis 0
dtype: int64
In [27]:
data.shape
(26, 21)
Out[27]:
In [28]:
data.columns
Out[28]:
'date', 'text', 'language', 'favourites-count', 'author',
dtype='object')
In [29]:
# drop irrelevant data
In [30]:
# check data types and encode object type
data.dtypes
user_location object
Out[30]:
user_description object
user_verified bool
text object
favourites-count int64
retweet-count int64
source object
Subjectivity float64
Polarity float64
Analysis object
dtype: object
In [31]:
enco = LabelEncoder()
data['user_location'] = enco.fit_transform(data['user_location'])
data['user_description'] = enco.fit_transform(data['user_description'])
data['user_verified'] = enco.fit_transform(data['user_verified'])
data['text'] = enco.fit_transform(data['text'])
data['date'] = enco.fit_transform(data['date'])
data['source'] = enco.fit_transform(data['source'])
data['Analysis'] = enco.fit_transform(data['Analysis'])
In [32]:
data.head()
0 3 7 0 25 10 0 0 3 0.0
0 8 3 1 24 8 0 3 3 0.0
0 0 19 0 23 8 0 3 1 0.0
0 2 16 1 22 11 20 3 0 0.0
0 0 2 0 21 4 0 0 2 0.0
In [33]:
X = data.drop(["Analysis"], axis=1)
y= data.Analysis
In [34]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=1)
In [35]:
#Feature Scaling/Standardize (not important step but it boost accuracy)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In [36]:
print (x_train.shape, y_train.shape)
In [37]:
# use another model to confirm the accuracy
In [38]:
# Apply model and check error- eg linear reg
linreg=LinearRegression()
linreg.fit(x_train,y_train)
y_predict = linreg.predict(x_test)
print()
print("mse is:{:.2f}".format(sqrt(mean_squared_error(y_test,y_predict))))
print()
print()
print("Coefficient:",linreg.coef_)
mse is:0.03
In [39]:
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
accuracy = metrics.accuracy_score(y_test,y_pred)
print("Accuracy: {:.2f}%".format(accuracy*100))
Accuracy: 100.00%
In [40]:
from sklearn.metrics import confusion_matrix, accuracy_score
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
"""
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.yticks(tick_marks, classes)
if normalize:
else:
print(cm)
thresh = cm.max() / 2.
horizontalalignment="center",
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [41]:
import itertools
plt.figure(figsize=(7,5))
print()
print("Accuracy: {:.2f}%".format(accuracy*100))
[[1 0]
[0 2]]
Accuracy: 100.00%
In [42]: plt.figure(figsize=(20,7))
<AxesSubplot:>
Out[42]:
In [43]:
print(classification_report(y_test,y_pred))
accuracy 1.00 3