You are on page 1of 12

SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE

A
MINI PROJECT REPORT ON

“Classification Of Tweets Into Positive And Negative Tweets ”

SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE IN THE


FULFILLMENT OF THE REQUIREMENT OF

Data Science And Big Data Analytics Third Year Computer Engineering

Academic Year 2023-24

BY

Name of students Roll No

1. Om Pasalkar 34
2. Abhijit Jagtap 18
3. Viraj Bodke 07
4.Akash Hede 15

Under The Guidance Of

Prof.N.M.Dimble

1
Department of Computer Engineering Navsahyadri Group of Institute’sNaigoan Pune

CERTIFICATE

This is to certify that project entitled,

“Classification Of Tweets Into Positive And Negative Tweets”

Submitted by

Name of students Roll No

1. Om Pasalkar 34
2. Abhijit Jagtap 18
3. Viraj Bodke 07
4.Akash Hede 15

Is a bonafide work carried out by them under the supervision of Prof.N.M.Dimble and is
approved for the partial fulfillment of the requirement of Data Science And Big Data
Analytics course in Third Year Computer Engineering, in the academic year 2023-2024
prescribed by Savitribai Phule Pune University, Pune.

Prof.N.M.Dimble Prof S.N. Gujar


Guide Head of Department ,
Computer Engineering

Place : Pune

2
Abstract
In the era of social media dominance, Twitter stands out as a significant platform for expressing opinions
and sentiments on various topics. The massive volume of tweets generated daily presents an invaluable
source of data for understanding public sentiment towards diverse subjects. This project focuses on
developing a classification system capable of automatically categorizing tweets into positive and negative
sentiments.

The project involves several key steps. First, a diverse dataset of tweets is collected, covering a wide range
of topics and sentiments. Next, preprocessing techniques are applied to clean the text data, removing noise
and irrelevant information. Feature extraction methods are then employed to represent the textual content of
tweets numerically, enabling machine learning algorithms to process and classify them effectively.

Various machine learning models, including logistic regression, support vector machines, and neural
networks, are trained and evaluated for their effectiveness in classifying tweets into positive and negative
categories. Performance metrics such as accuracy, precision, recall, and F1-score are used to assess the
models' performance and identify the most suitable approach.

The developed classification system holds significant practical implications across different domains. It can
be utilized by businesses to monitor customer sentiment towards their products or services, by policymakers
to gauge public opinion on political issues, and by researchers to analyze trends and attitudes in society.
Moreover, the system can aid in market research, brand sentiment analysis, and customer feedback analysis,
providing valuable insights for decision-making and strategic planning.

Overall, this project aims to contribute to the field of sentiment analysis by developing a robust classification
system capable of accurately categorizing tweets into positive and negative sentiments. Through the
integration of natural language processing techniques and machine learning algorithms, the project
endeavors to harness the power of social media data for understanding and interpreting public sentiment in
real-time.

3
Introducation

In today's digital age, social media platforms like Twitter have become indispensable sources
for expressing opinions, sentiments, and emotions. With millions of tweets being posted
every day, there's a vast pool of data that can be analyzed to gain insights into public
sentiment towards various topics, products, events, and more. Understanding the sentiment
behind these tweets can be crucial for businesses, policymakers, and researchers alike.

The aim of this mini project is to develop a classification model that can automatically
categorize tweets into positive and negative sentiments. By leveraging natural language
processing (NLP) techniques and machine learning algorithms, we seek to build a system
that can accurately distinguish between tweets expressing positive emotions, such as
happiness, satisfaction, or approval, and tweets conveying negative sentiments, such as
anger, disappointment, or dissatisfaction.

Objectives
Data Collection: Gather a diverse dataset of tweets covering a range of topics and sentiments.

Preprocessing: Clean and preprocess the collected tweets to remove noise, handle special
characters, tokenize the text, and perform other necessary tasks.

Feature Extraction: Extract relevant features from the preprocessed tweet text, such as word
frequencies, n-grams, or embeddings, to represent each tweet numerically.

Model Development: Train and evaluate various machine learning models, such as logistic
regression, support vector machines, or neural networks, to classify tweets into positive and
negative categories.

Performance Evaluation: Assess the performance of the developed classification models


using metrics such as accuracy, precision, recall, and F1-score.

Deployment: Deploy the trained model as a web application or API that allows users to input
a tweet and receive its sentiment classification in real-time.

Future Enhancements: Explore additional techniques for improving the model's performance,
such as fine-tuning hyperparameters, experimenting with different feature representations, or
leveraging deep learning architectures.
4
Significance
The ability to automatically classify tweets into positive and negative sentiments has several
practical applications:

Brand Sentiment Analysis: Businesses can monitor social media sentiment towards their
products or services, allowing them to quickly identify and address any issues or capitalize
on positive feedback.

Political Opinion Tracking: Researchers and policymakers can analyze public sentiment on
political issues, candidates, or policies, aiding in decision-making and campaign strategies.

Customer Feedback Analysis: Companies can analyze customer feedback on social media
platforms to identify trends, preferences, and areas for improvement in their products or
services.

Market Research: Market analysts can track consumer sentiment towards specific brands,
products, or industry trends, providing valuable insights for investment decisions and market
predictions.

5
4/7/24, 9:27 PM Untitled

In [45]: import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [46]: import pandas as pd


df = pd.read_csv('data_science.csv')

C:\Users\Suraj\AppData\Local\Temp\ipykernel_7592\3628501462.py:2: DtypeWarning:

Columns (9) have mixed types. Specify dtype option on import or set low_memory=Fa
lse.

In [47]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241386 entries, 0 to 241385 Data
columns (total 36 columns):
# Column Non-Null Count Dtype --
- ------ -------------- ----- 0
id 241386 non-null int64
1 conversation_id 241386 non-null int64
2 created_at 241386 non-null object
3 date 241386 non-null object
4 time 241386 non-null object
5 timezone 241386 non-null int64
6 user_id 241386 non-null int64
7 username 241386 non-null object
8 name 241386 non-null object
9 place 354 non-null object
10 tweet 241386 non-null object
11 language 241386 non-null object
12 mentions 241386 non-null object
13 urls 241386 non-null object
14 photos 241386 non-null object
15 replies_count 241386 non-null int64
16 retweets_count 241386 non-null int64
17 likes_count 241386 non-null int64
18 hashtags 241386 non-null object
19 cashtags 241386 non-null object
20 link 241386 non-null object
21 retweet 241386 non-null bool
22 quote_url 10321 non-null object
23 video 241386 non-null int64
24 thumbnail 110338 non-null object
25 near 0 non-null float64
26 geo 0 non-null float64
27 source 0 non-null float64
28 user_rt_id 0 non-null float64
29 user_rt 0 non-null float64
30 retweet_id 0 non-null float64
31 reply_to 241386 non-null object
32 retweet_date 0 non-null float64
33 translate 0 non-null float64
34 trans_src 0 non-null float64 35 trans_dest 0 non-null
float64 dtypes: bool(1), float64(10), int64(8), object(17) memory usage: 64.7+

df['tweet'][10]

MB
In [48]:

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 1/7
4/7/24, 9:27 PM Untitled

Out[48]: 'Trends in #AI for next 5 years, including revenue, applications, and talent (#
INFOGRAPHIC) ——————— #BigData #DataScience #MachineLearning #DeepLearning #Comp
uterVision #NLProc #DataLiteracy #AIStrategy #DigitalTransformation #EdgeAI #Ed
ge #IoT #IIoT #IoTPL #IoTCommunity https://t.co/mn7vFSgyyv'

In [49]: import nltk nltk.download('vader_lexicon')


from nltk.sentiment.vader import SentimentIntensityAnalyzer sid

= SentimentIntensityAnalyzer()

import re import pandas as pd import


nltk nltk.download('words') words =
set(nltk.corpus.words.words())

[nltk_data] Downloading package vader_lexicon to [nltk_data]


C:\Users\Suraj\AppData\Roaming\nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package words to [nltk_data]
C:\Users\Suraj\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!

In [50]: sentence = df['tweet'][0]


sid.polarity_scores(sentence)['compound']

Out[50]: -0.1783

In [51]: def cleaner(tweet):


tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign tweet =
re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove ht tweet = "
".join(tweet.split()) tweet = tweet.replace("#", "").replace("_", " ")
#Remove hashtag sign but k tweet = " ".join(w for w in
nltk.wordpunct_tokenize(tweet) if w.lower() in words or not w.isalpha())
return tweet
df['tweet_clean'] =
df['tweet'].apply(cleaner)

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 2/7
4/7/24, 9:27 PM Untitled

In [52]:m word_dict = {'manipulate':-1,'manipulative':-1,'jamescharlesiscancelled':-1,'ja


' 'pedophile':-1,'pedo':-1,'cancel':-1,'cancelled':-1,'cancel culture
'teamjamescharles':1,'liar':-1}

import nltk nltk.download('vader_lexicon')


from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer() sid.lexicon.update(word_dict)

list1 = [] for i in
df['tweet_clean']:
list1.append((sid.polarity_scores(str(i)))['compound'])

[nltk_data] Downloading package vader_lexicon to [nltk_data]


C:\Users\Suraj\AppData\Roaming\nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!

In [53]: df['sentiment'] = pd.Series(list1)

def sentiment_category(sentiment):
label = ''
if(sentiment>0):
label = 'positive'
elif(sentiment == 0):
label = 'neutral' else:
label = 'negative' return(label) df['sentiment_category']
= df['sentiment'].apply(sentiment_category)

In [54]: df = df[['tweet','date','id','sentiment','sentiment_category']] df.head()

Out[54]: tweet date id sentiment sentiment_category

What can be done? -


2021-
0 Never blindly trust 1406400408545804288 -0.4592 negative
06-20
an ab...

"We need a
paradigm shift from 2021-
1 1406390341176016897 -0.3535 negative
model-centric t... 06-20

Using highresolution
satellite data and 2021-
2 1406386311481774083 0.0000 neutral
compu... 06-20

.@Stephenson_Data
shares four steps 2021-
3 1406383545153638402 0.6249 positive
that will ... 06-20

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 3/7
4/7/24, 9:27 PM Untitled

neg = df[df['sentiment_category']=='negative'] neg


= neg.groupby(['date'],as_index=False).count()

pos = df[df['sentiment_category']=='positive'] pos


= pos.groupby(['date'],as_index=False).count()

pos = pos[['date','id']]
neg = neg[['date','id']]

"Curricula is
inherently brittle in a 2021-
4 1406358632648818689 0.2960 positive
world wh... 06-20

In [57]:

In [60]: import matplotlib.pyplot as plt

fig, ax = plt.subplots()

# Plot positive tweets


for col in pos.columns:
ax.plot(pos['date'], pos['id'], label=col, color='green')

# Plot negative tweets for col in neg.columns:


ax.plot(neg['date'], neg['id'], label=col, color='red')

# Add legend
ax.legend()

# Add labels and title


ax.set_xlabel('Date') ax.set_ylabel('ID')
ax.set_title('Tweets Over Time')

# Show plot
plt.show()

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 4/7
4/7/24, 9:27 PM Untitled

In [62]: pip install wordcloud

Collecting wordcloud
Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl.metadata (3.5 kB)
Requirement already satisfied: numpy>=1.6.1 in c:\users\prajw\appdata\local\progr
ams\python\python311\lib\site-packages (from wordcloud) (1.26.1)
Requirement already satisfied: pillow in c:\users\prajw\appdata\local\programs\py
thon\python311\lib\site-packages (from wordcloud) (10.1.0)
Requirement already satisfied: matplotlib in c:\users\prajw\appdata\local\program
s\python\python311\lib\site-packages (from wordcloud) (3.8.0)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\prajw\appdata\local\p
rograms\python\python311\lib\site-packages (from matplotlib->wordcloud) (1.1.1)
Requirement already satisfied: cycler>=0.10 in c:\users\prajw\appdata\local\progr
ams\python\python311\lib\site-packages (from matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\prajw\appdata\local
\programs\python\python311\lib\site-packages (from matplotlib->wordcloud) (4.43.
1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\prajw\appdata\local
\programs\python\python311\lib\site-packages (from matplotlib->wordcloud) (1.4.5)
Requirement already satisfied: packaging>=20.0 in c:\users\prajw\appdata\local\pr
ograms\python\python311\lib\site-packages (from matplotlib->wordcloud) (23.2)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\prajw\appdata\local\p
rograms\python\python311\lib\site-packages (from matplotlib->wordcloud) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\prajw\appdata\loc
al\programs\python\python311\lib\site-packages (from matplotlib->wordcloud) (2.8.
2)
Requirement already satisfied: six>=1.5 in c:\users\prajw\appdata\local\programs
\python\python311\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordc
loud) (1.16.0)
Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl (300 kB)
0.0/300.2 kB ? eta -: :
- 10.2/300.2 kB ? eta -: :
--- ----------------------------------- 30.7/300.2 kB 435.7 kB/s eta 0:00:01

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 5/7
4/7/24, 9:27 PM Untitled

---------- ---------------------------- 81.9/300.2 kB 573.4 kB/s eta 0:00:01


-------------- ----------------------- 112.6/300.2 kB 595.3 kB/s eta 0:00:01
------------------ ------------------- 143.4/300.2 kB 607.9 kB/s eta 0:00:01
------------------------ ------------- 194.6/300.2 kB 692.9 kB/s eta 0:00:01
-------------------------------- ----- 256.0/300.2 kB 785.2 kB/s eta 0:00:01
-------------------------------------- 300.2/300.2 kB 805.3 kB/s eta 0:00:00
Installing collected packages: wordcloud

import matplotlib.pyplot as plt from wordcloud import WordCloud df2 =


df[(df['date']>='2019-05-11') & (df['date']<='2019-05-14')] positive =
df2[df2['sentiment_category']=='positive'] wordcloud =
WordCloud(max_font_size=50, max_words=500, background_color="white" plt.figure()
plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()

Successfully installed wordcloud-1.9.3 Note: you may need to


restart the kernel to use updated packages. In [63]:

In [64]: print(df[df['sentiment_category']=='positive'])

tweet date \ 3 .@Stephenson_Data shares four steps that will


... 2021-06-20
4 "Curricula is inherently brittle in a world wh... 2021-06-20
6 @LinkLabsInc @IoTchannel Wow! Wonderful!! Cong... 2021-06-20
9 Demystifying #AI with 10 top applications: ht... 2021-06-20
10 Trends in #AI for next 5 years, including reve... 2021-06-20 ...
... ...
241370 Four short links: 15 January 2010 - Best Scien... 2010-01-15
241375 Anti-science disinformers to media: Please ma... 2010-01-13
241377 @Sheril_ I'd love to see some empirical data o... 2010-01-12
241380 Top nations in computer science: http://bit.l... 2010-01-10 241382
RT @filiber: Have a Computer Science backgroun... 2010-01-06

id sentiment sentiment_category 3
1406383545153638402 0.6249 positive
4 1406358632648818689 0.2960 positive

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 6/7
4/7/24, 9:27 PM Untitled

6 1406344023254634499 0.9036 positive


9 1406334476905500679 0.2023 positive
10 1406333930551324673 0.4215 positive ...
... ... ...
241370 7794185676 0.6369 positive
241375 7707597565 0.4215 positive
241377 7671245065 0.6369 positive
241380 7590323198 0.3182 positive
241382 7445162404 0.6767 positive

[113285 rows x 5 columns]

In [65]: print(df[df['sentiment_category']=='negative'])

tweet date \ 0
What can be done? - Never blindly trust an ab... 2021-06-20
1 "We need a paradigm shift from model-centric t... 2021-06-20
5 Many common colour maps distort data through u... 2021-06-20
19 ApolloScape (world’s largest open-source datas... 2021-06-20
36 Disruption defines our world, and the latest h... 2021-06-19 ...
... ...
241355 @DanaKCTV5 We think Phil now studies weather d... 2010-02-02
241366 @GrahamHill And to be really consequent: not o... 2010-01-21
241371 @andrewbarnett you could, note that iphones mo... 2010-01-15
241373 CARPE DIEM BLOG: "Structural Barriers" Discour... 2010-01-14 241384
All in the....data RT @noahWG Dr. Petra provid... 2010-01-05

id sentiment sentiment_category 0
1406400408545804288 -0.4592 negative
1 1406390341176016897 -0.3535 negative
5 1406350577756524555 -0.0772 negative
19 1406332752815869955 -0.4215 negative
36 1406312471531601920 -0.7650 negative ...
... ... ...
241355 8540493580 -0.4019 negative
241366 8020770355 -0.3612 negative
241371 7764817738 -0.5043 negative
241373 7748404739 -0.4215 negative
241384 7376226272 -0.2960 negative

[23782 rows x 5 columns]

In [ ]:

Conclusion:
This mini project aims to develop a robust classification model for categorizing tweets into positive and
negative sentiments, with the potential to offer valuable insights across various domains. By leveraging NLP
techniques and machine learning algorithms, we seek to contribute to the growing field of sentiment analysis
and its applications in real-world scenarios.

localhost:8888/lab/tree/OneDrive/Desktop/Untitled.ipynb 7/7

You might also like