You are on page 1of 33

Twitter Sentiment Analysis

Submitted to

Group no – 13
Batch – F
1. Shiv Datt Mishra ( Dr. APJ Abdul
Kalam UIT Jabua)
2. Ishant Sharma ( Jabalpur Engineering
College)
3. Bhupesh Watti (Ujjain Engineering
College)
4. Praval Sharma ( Ujjain Engineering
College )
5. Poornima Bhumarkar ( Jabalpur
Engineering College)

i
ABSTRACT

In the era of social media dominance, Twitter stands out as a significant platform
for real-time expression and opinion sharing. Understanding the sentiments
expressed on Twitter holds immense value for various sectors, including
marketing, politics, and public opinion analysis. This project focuses on
harnessing the power of Natural Language Processing (NLP) techniques to
analyze sentiments expressed on Twitter. Leveraging machine learning
algorithms, sentiment analysis models are trained to classify tweets into positive,
negative, or neutral categories. Additionally, advanced techniques such as topic
modeling and sentiment trend analysis are employed to extract deeper insights
from the data. The project aims to provide a comprehensive understanding of the
sentiments prevailing on Twitter, enabling stakeholders to make informed
decisions based on the public's opinions and emotions.

ii
TABLE OF CONTENT
ABSTRACT ii
TABLE OF CONTENT iii
TABLE OF FIGURE v
1. INTRODUCTION 1
1.1 Objective: 2
1.2 Data Utilization 2
1.3 Algorithmic Approach 2
1.4 Impact and Outcome: 2

2. LITERATURE SURVEY 3
3. PROJECT PLANNING AND MANAGEMENT 4
3.1 Roles 4
3.2 Collaborative Efforts: 4
3.3 Challenges: 5
3.4 Project Milestones: 5
3.5 Conclusion: 5

4. METHODOLOGY 6
4.1 Data Collection: 6
4.2 Data Preprocessing: 6
4.3 Feature Engineering: 6
4.4 Model Selection and Training: 6
4.5 Model Evaluation: 6
4.6 Model Deployment: 7
4.7 Documentation and Reporting: 7

5. SYSTEM DESIGN 8
5.1 Naive Bayes: 8
5.2 Logistic Regression: 8
5.3 Decision Tree: 8
5.4 Random Forest: 9
5.5 XGBoost (Extreme Gradient Boosting): 9

6. CODE AND TESTING 10


6.1 Basic checks 10

iii
6.2 Data Collection 10
6.3 Exploratory Data Analysis (EDA) 12
6.4 Data Prepocessing 14
Categorical Encoding 14

6.5 Training and Evaluation 17


6.5.1 Model Creation with Naive Bayes 17
Model Evaluation 18

6.5.2 Model Creation with the Logistic Regression 19


Model Evaluation with the logistic regression 19

6.5.3 Model Creation using Decision Tree Classifier 20


Model Evaluation Using Decision Tree Classifier 20

6.5.4 Model creation by using Random Forest Classifier 21


Model Evaluation of Random Forest Classifier 21

6.5.2 Model Creation with the XGBoost 22


Model Evaluation using XGBoost 22

Model Comparision Report 22


6.5 Model Deployment Code Explanation: 23

7. RESULTS AND DISCUSSION 26


8. CONCLUSION 27
REFERENCES 28

iv
TABLE OF FIGURE

FIGURE 1 . HEAT MAP OF AIRLINE VS AIRLINE SENTIMENT 11


FIGURE 2. HISTOGRAPH PLOT BETWEEN COUNT VS LENGTH 13
FIGURE 3. PIE CHART OF AIRLINE SENTIMENT 14
FIGURE 4. MODEL COMPARISION BAR PLOT 23
FIGURE 5. DEPLOYEMENT PICTURE 25

v
CHAPTER 1

INTRODUCTION

As the internet continues to expand, social media and micro blogging platforms like Twitter
play a pivotal role in disseminating news and trending topics globally at an unprecedented pace.
Trending topics emerge as a result of extensive user engagement, making them valuable
sources of online perception. These topics span a wide range, from spreading awareness to
promoting public figures, political campaigns, product endorsements, and entertainment events
like movies and award shows. Businesses leverage user feedback from social media to enhance
their products and services, driving improvements in marketing strategies.

Sentiment analysis, a vital component of understanding online discourse, predicts emotions


within text, allowing for a deeper comprehension of attitudes and opinions expressed online. It
categorizes conversations into positive, negative, or neutral labels, offering insights into wider
public sentiment. Social media platforms serve as hubs for individuals to network, stay
informed about news and events, and express their opinions. This wealth of user-generated
content fuels discussions and evaluations, influencing perceptions about various products and
services.

Twitter, characterized by its 140-character limit per tweet, generates a staggering volume of
content, with approximately 6500 tweets published per second. Analyzing this vast stream of
tweets presents challenges due to their noisy and unstructured nature, encompassing diverse
topics and changing attitudes. Sentiment analysis on Twitter involves leveraging natural
language processing techniques to extract and characterize sentiment content, addressing
coarse and fine-level analysis of sentiments.

However, analyzing tweets is not without its challenges. The informal nature of tweets, coupled
with the use of slang, acronyms, and grammatical irregularities, poses difficulties for natural
language processors.
The remainder of this project report outlines related work, details the technologies employed,
elaborates on the methodology and implementation, addresses encountered challenges,
discusses future work, and concludes with a summary of findings and implications.

1
1.1 Objective:
Our task involves developing a robust sentiment analysis system for Twitter data. The
primary goal is to accurately classify tweets into positive, negative, or neutral categories,
thereby deciphering the emotional nuances and underlying sentiments expressed in tweets.
This endeavor aims to provide comprehensive insights into the prevailing sentiments within
the Twitter community.

1.2 Data Utilization


To achieve this objective, we utilize the airline sentiment dataset based on American-based
airlines. This dataset comprises a diverse range of tweets reflecting customer opinions and
experiences with various airline services. By incorporating this specific dataset, we tailor our
sentiment analysis model to the context of airline-related discussions on Twitter, enabling more
accurate classification and deeper insights into customer sentiment towards specific airlines.

1.3 Algorithmic Approach


We are challenged with implementing various machine learning algorithms, including logistic
regression, Naive Bayes, decision tree classifier, random forest, and XGBoost. By employing
these advanced techniques, we aim to extract meaningful patterns and sentiments from the data.
These algorithms enable us to analyse and classify tweets effectively, empowering stakeholders
to make informed decisions, engage effectively with users, and gain valuable insights into
market trends and brand perceptions.

1.4 Impact and Outcome:


The ultimate goal of this project is to provide a comprehensive sentiment analysis framework
that not only accurately classifies tweets but also offers actionable insights for businesses. By
proactively addressing customer concerns, improving service quality, and enhancing overall
customer satisfaction, stakeholders can better understand and respond to customer sentiments
effectively. This framework serves as a valuable tool for businesses to navigate the dynamic
landscape of social media, enabling them to make data-driven decisions and cultivate
positiveinteractions with their audience.

2
CHAPTER 2

LITERATURE SURVEY

Sentiment analysis on Twitter data has garnered significant attention in recent years due to the
platform's popularity and its role as a rich source of real-time user opinions and emotions.
Existing literature provides a comprehensive overview of various techniques, challenges, and
advancements in this field.

1. Challenges in Twitter Sentiment Analysis:

Researchers have identified several challenges unique to sentiment analysis on Twitter data.
These include the brevity of tweets, noisy and informal language, sarcasm, ambiguity, and the
dynamic nature of language trends. Papers like "Twitter Sentiment Analysis: The Good the
Bad and the OMG!" by Pak and Paroubek (2010) and "Twitter as a Corpus for Sentiment
Analysis and Opinion Mining" by Pak and Paroubek (2010) delve into these challenges and
propose strategies to address them.

2. Techniques and Approaches:

Literature offers a wide range of techniques for sentiment analysis on Twitter data, including
lexicon-based methods, machine learning algorithms, deep learning models, and hybrid
approaches. Papers such as "A Survey of Sentiment Analysis Techniques on Twitter Data" by
Agarwal et al. (2011) and "Combining Lexicon-based and Learning-based Methods for Twitter
Sentiment Analysis" by Bermingham and Smeaton (2010) review these techniques and discuss
their advantages and limitations.

3. Machine Learning and Deep Learning Models:

Traditional machine learning algorithms like naive Bayes, logistic regression, decision trees,
and ensemble methods have been widely applied to sentiment analysis on Twitter data.
Additionally, deep learning models such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and transformer- based architectures like BERT have shown
promising results. "Deep Learning for Sentiment Analysis: A Survey" by Zhang et al. (2018)
provides an in-depth overview of deep learning techniques for sentiment analysis tasks.

4. Datasets and Tools:

Various datasets and tools have been developed to facilitate research in Twitter sentiment
analysis. The Sentiment140 dataset, introduced by Bifet et al. (2009), is one of the most widely
used datasets for training and evaluating sentiment analysis models on Twitter data.
Additionally, tools like the Sentiment140 platform and other sentiment analysis APIs provide
researchers with resources for collecting and analyzing sentiment-labeled tweets.

3
CHAPTER 3

PROJECT PLANNING AND MANAGEMENT


3.1 Roles
The project team consists of five members, each assigned specific roles and responsibilities
under the leadership of Shiv Datt Mishra.

• Shiv Datt Mishra :

As the project head, Shiv Datt oversees the overall project planning and management. His
responsibilities include coordinating team activities, setting project goals, and ensuring timely
progress. Shiv Datt is also actively involved in implementing machine learning models for
sentiment analysis.

• Ishant Sharma:

Ishant is responsible for providing the dataset essential for sentiment analysis. Additionally, he
takes charge of creating presentations and revising project documentation to ensure clarity and
accuracy.

• Praval Sharma:

Praval's role focuses on data preprocessing, including cleaning, transforming, and organizing
the dataset to prepare it for analysis. His meticulous attention to detail ensures the integrity and
quality of the data used for sentiment analysis.

• Bhupesh Watti:

Bhupesh is tasked with defining the project scope and requirements. He works on structuring
the data and crafting detailed reports to document project progress and findings. Bhupesh's
thorough analysis contributes to a comprehensive understanding of the project's objectives and
outcomes.

• Poornima Bhumarkar:

Poornima's role involves gathering information related to the project for further enhancement.
She conducts research on sentiment analysis methodologies, explores new techniques, and
identifies opportunities for improvement. Poornima's insights contribute to the project's
innovation and advancement.

3.2 Collaborative Efforts:


While each team member has designated responsibilities, collaboration is key to the project's
success. Shiv and Ishant collaborate on implementing machine learning models, leveraging

4
their combined expertise to develop robust sentiment analysis algorithms. Regular
communication and coordination ensure that all team members are aligned with project
objectives and timelines.

3.3 Challenges:
1. Monetization of Twitter API: One of the primary challenges faced by the team is the
monetization of the Twitter API. Due to this restriction, accessing real-time Twitter data
becomes costly, limiting the team's ability to gather live data for sentiment analysis.

2. Dataset Limitations: Although the team initially planned to collect data using the Twitter
API, they encountered limitations in obtaining a comprehensive dataset. As a result, they had
to rely solely on the available airline sentiment dataset, which may not encompass all relevant
tweets or provide real-time insights.

3. Inability to Utilize YouTube Data: While considering alternative data sources, the team
explored the possibility of leveraging YouTube comments for sentiment analysis. However,
they encountered challenges as comments vary significantly across different channels, making
it challenging to create a cohesive dataset.

3.4 Project Milestones:


The project timeline is divided into key milestones, including data collection, preprocessing,
model development, analysis, and reporting. Regular progress meetings are scheduled to
review milestones, address any challenges, and adjust plans as needed to ensure project
success.

3.5 Conclusion:
Through effective planning and management, the project team leverages individual strengths
and expertise to achieve collective goals. By adhering to established roles and responsibilities,
maintaining open communication, and fostering collaboration, the team works cohesively
towards delivering a high- quality Twitter sentiment analysis solution.

5
CHAPTER 4

METHODOLOGY

We employed a systematic approach to develop our sentiment analysis system, involving the
following steps:

4.1 Data Collection:


Utilize the Twitter API to gather tweets related to the specified topic or domain, focusing on
sentiment analysis of airline-related discussions.

Alternatively, access pre-existing datasets, such as the airline sentiment dataset based on
American-based airlines, to obtain relevant data for analysis.

4.2 Data Preprocessing:


Perform data cleaning to remove duplicates, handle missing values, and standardize formats to
ensure data integrity.

Conduct text normalization techniques, including lowercase conversion, punctuation removal,


and stop word removal, to standardize text for analysis.

4.3 Feature Engineering:


Extract features from the preprocessed text data using techniques such as tokenization,
vectorization, and TF-IDF (Term Frequency-Inverse Document Frequency) representation.

Explore additional feature extraction methods, such as sentiment lexicons or word embeddings,
to capture semantic information and enhance classification accuracy.

4.4 Model Selection and Training:


Evaluate and select appropriate classification algorithms for sentiment analysis, including
logistic regression, Naive Bayes, decision trees, random forests, and XGBoost.

Split the dataset into training and testing sets, typically using a 70-30 or 80-20 split, to train
and evaluate the performance of selected models.

Train the selected models on the training set and fine-tune hyperparameters using techniques
such as cross-validation to optimize model performance.

4.5 Model Evaluation:


Evaluate the trained models using performance metrics such as accuracy, precision, recall, F1-
score, and ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) to assess
classification performance.

6
Compare the performance of different models to identify the most effective approach for
sentiment analysis on Twitter data.

4.6 Model Deployment:


The machine learning model has been successfully deployed using Streamlit, enabling on-
the-fly sentiment analysis for individual comments. Through a user-friendly interface, users
can input a single comment, and the deployed model swiftly generates the sentiment
associated with it.

4.7 Documentation and Reporting:


Document the methodology, including data collection procedures, preprocessing steps, feature
engineering techniques, model selection, and evaluation criteria.

Prepare comprehensive reports summarizing findings, insights, and recommendations based


on the sentiment analysis of Twitter data, facilitating informed decision-making for
stakeholders.

7
CHAPTER 5

SYSTEM DESIGN

The following Machine Learning Algorithm used are as follows:-

5.1 Naive Bayes:


• Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem with an
assumption of independence among features.

• It's particularly popular for text classification tasks, including sentiment analysis, spam
detection, and document categorization.

• Naive Bayes calculates the probability of each class given a set of input features and selects
the class with the highest probability as the predicted class.

• Despite its simplicity and the "naive" assumption of feature independence, Naive Bayes often
performs well in practice, especially for large datasets with high dimensionality.

5.2 Logistic Regression:


• Logistic Regression is a linear classification algorithm used for binary classification tasks.

• It models the probability that an instance belongs to a particular class using the logistic
function, also known as the sigmoid function.

• Logistic Regression estimates the coefficients for each feature, which represent the impact of
the feature on the log-odds of the target class.

• During prediction, Logistic Regression calculates the probability of the positive class and
assigns the instance to the positive class if the probability exceeds a threshold (typically 0.5).

• Logistic Regression is widely used due to its simplicity, interpretability, and efficiency,
especially for problems with linear decision boundaries.

5.3 Decision Tree:


• Decision Tree is a versatile and interpretable classification algorithm that recursively
partitions the input space into regions, each associated with a class label.

• It makes decisions based on a sequence of binary splits along the features, aiming to maximize
the information gain or minimize impurity at each node.

8
• Decision Trees can handle both categorical and numerical features and can capture complex
nonlinear relationships between features and the target variable.

• However, Decision Trees are prone to overfitting, especially when trained on noisy or high-
dimensional data. Techniques like pruning, limiting tree depth, and ensemble methods (e.g.,
Random Forest) are commonly usedto mitigate overfitting.

5.4 Random Forest:


• Random Forest is an ensemble learning technique that constructs multiple Decision Trees and
combines their predictions to improve performance and robustness.

• Each Decision Tree in the Random Forest is trained on a bootstrapped sample of the dataset
and may use a random subset of features for splitting at each node.

• During prediction, the Random Forest aggregates the predictions of individual trees through
voting (for classification) or averaging (for regression), resulting in a more stable and accurate
prediction.

• Random Forests are highly effective for a wide range of classification and regression tasks,
offering improved generalization performance and resilience to overfitting compared to
individual Decision Trees.

5.5 XGBoost (Extreme Gradient Boosting):


• XGBoost is an advanced implementation of gradient boosting, a machine learning technique
that builds an ensemble of weak learners (typically decision trees) in a sequential manner.

• It optimizes a differentiable loss function by adding new trees that minimize the loss of the
combined ensemble, employing gradient descent techniques.

• XGBoost introduces several innovations, such as regularization techniques, tree pruning, and
parallelized computation, to improve training speed, accuracy, and scalability.

• XGBoost is renowned for its exceptional performance in various machine learning


competitions and is widely used in practice for classification, regression, and ranking tasks,
particularly when dealing with structured/tabular data.

9
CHAPTER 6

CODE AND TESTING


6.1 Basic checks
# Importing Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud , STOPWORDS
from PIL import Image
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

Libraries and Features:


6. NumPy (np) - Efficient multidimensional arrays and array operations for numerical
computing.
7. Pandas (pd)- High-performance data structures (Series, DataFrames) for data
manipulation, analysis, and visualization.
8. Matplotlib.pyplot (plt) - Creates various plots and charts for data visualization.
9. WordCloud - Generates visually appealing word clouds from text data, highlighting
high-frequency words.
10. Seaborn (sns) - Statistical data visualization library built on top of matplotlib, offering
a stylish aesthetic.
11. CountVectorizer (from sklearn.feature_extraction.text) - Transforms text data into
numerical features by counting word frequencies in documents.
12. TfidfVectorizer (from sklearn.feature_extraction.text) - Transforms text data into
numerical features by considering both word frequency and term frequency-inverse
document frequency (TF-IDF) for weighting.
6.2 Data Collection

We collected Twitter data from the provided dataset named 'twitter_x_y_train.csv',


containing text data and corresponding sentiment labels (positive, negative, or neutral).
# Enter the DataSet
train_df=pd.read_csv('twitter_x_y_train.csv')
# This code reads a CSV file named 'twitter_x_y_train.csv' into a pandas DataFrame called 'train_df'.

# To show top 3 rows


train_df.head(3)

# @title airline_sentiment vs airline

from matplotlib import pyplot as plt


import seaborn as sns
import pandas as pd
plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
x_label: grp['airline'].value_counts()

10
for x_label, grp in train_df.groupby('airline_sentiment')
})
sns.heatmap(df_2dhist/100, cmap='viridis',annot=True)
plt.xlabel('airline_sentiment')
_ = plt.ylabel('airline')

Figure 1 . Heat map of airline vs airline sentiment

This code generates a heatmap comparing the frequency of each airline based on sentiment.
• 'df_2dhist' is a DataFrame where each row represents a sentiment category and each
column represents an airline, with values indicating the frequency of occurrences.
• The heatmap is created using seaborn's heatmap function, with 'viridis' colormap.
• The x-axis label is set as 'airline_sentiment'.

11
• The y-axis label is set as 'airline'.
Here I shows the number of comments based on the sentiment on each airlines it's for
example it show united airline have the most negative comment followed by the american
airline and US airways and least negative comment for the virgin america. And considered
the airlines Delta airline is having services which make delta airline better then its
competitive according to the reviews.
# Add a column name length which the contain the count of the text
train_df['Length'] = train_df['text'].apply(len)

# To segregate the specific coulumns and removes the rest


train_df=train_df[['text','airline_sentiment','Length']]

# It's shows the first 5 row of the data


train_df.head()

# For data overview


train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10980 entries, 0 to 10979
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text non-null 10980 object
1 airline_sentiment non-null 10980 object
2 Length non-null 10980 int64
dtypes: int64(1), object(2)
memory usage: 257.5+ KB

# To generate the descriptive statistics for the numerical columns in the DataFrame train_df
train_df.describe()

6.3 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets
to apprehend their predominant traits, discover patterns, locate outliers, and identify
relationships between variables.
sns.histplot(train_df.Length,kde=True,color='c')

''' This code creates a histogram plot of the 'Length' column from the DataFrame 'train_df' using
seaborn,
with a kernel density estimation (kde) and cyan color. '''

{"type":"string"}

12
Figure 2. Histograph plot between count vs Length

The histograpgh plot shown here show the most of text column reviews are between 125 -
150 word length.
Here I shows the number of comments based on the sentiment on each airlines it's for
example it show united airline have the most negative comment followed by the american
airline and US airways and least negative comment for the virgin america. And considered
the airlines Delta airline is having services which make delta airline better then its
competitive according to the reviews.
plt.pie(train_df.airline_sentiment.value_counts(),labels=['Negative','Neutral','Positive'],autopct='%.2f')
plt.show()

13
Figure 3. Pie chart of airline sentiment

This code generates a pie chart illustrating the distribution of sentiment categories ('Negative',
'Neutral', 'Positive') within the 'airline_sentiment' column of the 'train_df' DataFrame.
The sizes of the pie slices are determined by the counts of each sentiment category. Labels
are provided for each slice ('Negative', 'Neutral', 'Positive'). The 'autopct' parameter formats
the numerical values displayed on the pie chart to show the percentage with two decimal
places. Finally, the pie chart is displayed using matplotlib's 'plt.show()' function
Here pie chart shows that 62.40% is the negative comment represented by blue color section ,
21.19% is neutral comment represented by orange color section , 16.41% is the positive
comment green color section.
6.4 Data Prepocessing

Categorical Encoding

Categorical encoding transforms categorical variables into numerical format for analysis or
machine learning; methods include ordinal, one-hot, label, frequency, and target encoding.
''' The code imports the LabelEncoder class from the sklearn.preprocessing module and creates an
instance of it called le,
which is used to transform categorical variables into numerical labels by assigning each unique
category a unique integer.'''

from sklearn.preprocessing import LabelEncoder


le=LabelEncoder()

# To converts the text data into numercial value


train_df.airline_sentiment=le.fit_transform(train_df.airline_sentiment)

14
'''This code uses the LabelEncoder instance (le)
to transform the categorical column 'airline_sentiment' in the 'train_df' DataFrame into numerical
values.'''

{"type":"string"}

import string
string.punctuation
# This code returns a string containing all punctuation characters.

{"type":"string"}

import nltk
nltk.download('stopwords')
# This code downloads the stopwords corpus from NLTK (Natural Language Toolkit).

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Unzipping corpora/stopwords.zip.

True

# to import stopwords for english language


from nltk.corpus import stopwords
stopwords=stopwords.words('english')
'''
This code imports the stopwords corpus for the English language from
NLTK (Natural Language Toolkit), providing a list of common words often considered irrelevant for
analysis or processing tasks.'''

{"type":"string"}

np.array(stopwords)
# This code converts the list of English stopwords into a NumPy array.

# Remonving the stopwords and punctuation for better analysis


def remove_stopwords(text):
text=text.split(' ')
text = " ".join([char for char in text if char not in stopwords])
return text

train_df['text_clean'] = train_df['text'].apply(lambda x: remove_stopwords(x))

train_df.head()

def remove_punc(text):
text = "".join([char for char in text if char not in string.punctuation])
return text

train_df['text_clean'] = train_df['text_clean'].apply(lambda x: remove_punc(x))

train_df.head()

# The code defines two functions:


# 1. 'remove_stopwords': This function splits the input text into words, removes any stopwords

15
# based on the predefined English stopwords list, and joins the remaining words back into a string.
# 2. 'remove_punc': This function removes any punctuation characters from the input text.

# These functions are then applied to the 'text' column of the 'train_df' DataFrame using the 'apply'
method:
# - The 'remove_stopwords' function is applied first to remove stopwords.
# - The 'remove_punc' function is applied subsequently to remove punctuation.
# The resulting cleaned text is stored in a new column named 'text_clean' in the 'train_df' DataFrame.

# Show the pocessed text column


train_df.text_clean
# This code accesses the 'text_clean' column from the DataFrame 'train_df',
# displaying the processed text data after stopwords removal and punctuation removal.'''

0 SouthwestAir I scheduled morning 2 days fact y...


1 SouthwestAir seeing workers time time going be...
2 united Flew ORD Miami back great crew service...
3 SouthwestAir dultch97 thats horse radish
4 united flight ORD delayed Air Force One last f...
...
10975 AmericanAir followback
10976 united thanks help Wish phone reps could accom...
10977 usairways the Worst Ever dca customerservice
10978 nrhodes85 look Another apology DO NOT FLY USAi...
10979 united far worst airline 4 plane delays 1 roun...
Name: text_clean, Length: 10980, dtype: object

x=train_df.text
y=train_df.airline_sentiment

#This code assigns the 'text' column from the DataFrame 'train_df' to
#the variable 'x', representing the input text data, and assigns the 'airline_sentiment' column to the
variable 'y',
# representing the target sentiment labels.

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=40)

'''
This code splits the input text data ('x') and target sentiment labels ('y') into training and testing
sets (80% training, 20% testing)using a random seed of 40 for reproducibility, and assigns them to \
respective variables 'x_train', 'x_test', 'y_train', and 'y_test'.'''

{"type":"string"}

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(8784,)
(8784,)
(2196,)
(2196,)

16
CV=CountVectorizer(stop_words='english')
"""
This code initializes a CountVectorizer object named 'CV', which is used to convert a collection of text
documents into a
matrix of token counts, excluding English stopwords during the tokenization process."""

{"type":"string"}

TV=TfidfVectorizer(stop_words='english')
""" This code initializes a TfidfVectorizer object named 'TV', which is used to convert a collection of
text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features,
excluding
English stopwords during the tokenization process."""

{"type":"string"}

x_train=CV.fit_transform(x_train)
'''
This code fits the CountVectorizer 'CV' to the training data 'x_train' and transforms it into
a matrix of token counts, where each row represents a document and each column represents a
unique
token (word), capturing the frequency of each token's occurrence in the respective documents.'''

{"type":"string"}

x_train
"""
The variable x_train represents the transformed training data after applying the CountVectorizer,
where each row corresponds to a
document (text sample) and each column corresponds to a unique token (word), storing the token
counts for each document."""

{"type":"string"}

x_test=CV.transform(x_test)
'''
This code transforms the testing data 'x_test' using the CountVectorizer 'CV',
maintaining the same tokenization scheme learned from the training data, and produces
a matrix of token counts representing the testing data.'''

{"type":"string"}

6.5 Training and Evaluation

We trained multiple classification models including naive Bayesian classifiers, logistic


regression, decision trees, random forests, and XGBoost using the preprocessed text data.
Each model was evaluated using various performance metrics to assess its accuracy and
effectiveness in sentiment classification.
6.5.1 Model Creation with Naive Bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’


Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other.

17
from sklearn.naive_bayes import MultinomialNB
model_nb=MultinomialNB()

# This code imports the Multinomial Naive Bayes classifier


# from the scikit-learn library and initializes an instance of it named 'model_nb'.

model_nb.fit(x_train,y_train)

# This code trains the Multinomial Naive Bayes classifier 'model_nb' using the training
# data 'x_train' (vectorized text data) and corresponding target labels 'y_train' (sentiment labels).

MultinomialNB()

y_nb_pred=model_nb.predict(x_test)

# This code uses the trained Multinomial Naive Bayes classifier 'model_nb' to predict the sentiment
labels
# for the testing data 'x_test' and assigns the predictions to the variable 'y_nb_pred'.

Model Evaluation
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score,classification_report,roc_auc_score,confusion_
matrix

#This code imports various metrics and evaluation functions from scikit-learn, including
accuracy_score,
#precision_score, recall_score, f1_score, classification_report, roc_auc_score, and
confusion_matrix, for
#evaluating the performance of classification models.

naive_bayes_accuracy=accuracy_score(y_test,y_nb_pred)
naive_bayes_accuracy

# This code calculates the accuracy of the Multinomial Naive Bayes classifier on
# the testing data by comparing the predicted labels ('y_nb_pred') with the actual labels ('y_test')
# and assigns the result to the variable 'naive_bayes_accuracy'.

0.755464480874317

print(classification_report(y_test,y_nb_pred))

precision recall f1-score support

0 0.76 0.95 0.85 1384


1 0.67 0.34 0.45 440
2 0.77 0.51 0.61 372

accuracy 0.76 2196


macro avg 0.73 0.60 0.64 2196
weighted avg 0.75 0.76 0.73 2196

18
''' This code prints a classification report, which includes precision, recall, F1-score, and
support for each class, based on the comparison between the actual labels ('y_test') and the
predicted labels ('y_nb_pred') using the Multinomial Naive Bayes classifier.'''
confusion_matrix(y_test,y_nb_pred)

array([[1320, 42, 22],


[ 256, 149, 35],
[ 151, 31, 190]])

This code generates a confusion matrix, which is a table showing the counts of true positive,
false positive, true negative, and false negative predictions, based on the comparison between
the actual labels ('y_test') and the predicted labels ('y_nb_pred') using the Multinomial Naive
Bayes classifier.

6.5.2 Model Creation with the Logistic Regression

Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
x_train=x_train.todense()
x_test=x_test.todense()
x_train=np.array(x_train)
x_test=np.array(x_test)

from sklearn.linear_model import LogisticRegression


lr=LogisticRegression(multi_class='ovr',n_jobs=-1)

#This code imports the LogisticRegression class from scikit-learn and initializes a logistic regression
model named 'lr' with
#the 'ovr' (one-vs-rest) strategy for multi-class classification and parallel processing enabled using
all available CPU cores
#('n_jobs=-1').

lr.fit(x_train,y_train)

LogisticRegression(multi_class='ovr', n_jobs=-1)

y_lr_pred=lr.predict(x_test)

Model Evaluation with the logistic regression


logistic_accuracy=accuracy_score(y_test,y_lr_pred)
logistic_accuracy

0.7891621129326047

print(classification_report(y_test,y_lr_pred))

precision recall f1-score support

0 0.84 0.90 0.87 1384


1 0.61 0.57 0.59 440
2 0.77 0.64 0.70 372

19
accuracy 0.79 2196
macro avg 0.74 0.70 0.72 2196
weighted avg 0.78 0.79 0.78 2196

confusion_matrix(y_test,y_lr_pred)

array([[1247, 101, 36],


[ 155, 249, 36],
[ 79, 56, 237]])

6.5.3 Model Creation using Decision Tree Classifier

A decision tree is a flowchart-like tree structure where each internal node denotes the feature,
branches denote the rules and the leaf nodes denote the result of the algorithm. It is a versatile
supervised machine-learning algorithm, which is used for both classification and regression
problems.
from sklearn.tree import DecisionTreeClassifier
"""
This code imports the DecisionTreeClassifier class from
the scikit-learn library, which is used to create a decision
tree model for classification tasks."""

{"type":"string"}

dtc = DecisionTreeClassifier()

dtc.fit(x_train,y_train)

DecisionTreeClassifier()

y_pred_dtc= dtc.predict(x_test)

# This code uses the trained DecisionTreeClassifier model 'dtc' to predict the
# target labels for the testing data 'x_test', and assigns the predicted labels
# to the variable 'y_pred_dtc'.'''

Model Evaluation Using Decision Tree Classifier


decisiontree_accuracy=accuracy_score(y_test,y_pred_dtc)
decisiontree_accuracy

0.6584699453551912

print(classification_report(y_test,y_pred_dtc))

precision recall f1-score support

0 0.80 0.74 0.77 1384


1 0.38 0.49 0.43 440
2 0.58 0.56 0.57 372

accuracy 0.66 2196


macro avg 0.59 0.60 0.59 2196

20
weighted avg 0.68 0.66 0.67 2196

confusion_matrix(y_test,y_pred_dtc)

array([[1020, 275, 89],


[ 162, 216, 62],
[ 89, 73, 210]])

6.5.4 Model creation by using Random Forest Classifier

Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset
from sklearn.ensemble import RandomForestClassifier

#This code imports the RandomForestClassifier class from the scikit-learn library,
#which is used to create a random forest model for classification tasks.

rfc=RandomForestClassifier()

rfc.fit(x_train,y_train)

RandomForestClassifier()

y_pred_rfc=rfc.predict(x_test)

Model Evaluation of Random Forest Classifier


rf_accuracy=accuracy_score(y_test,y_pred_rfc)
rf_accuracy

0.7454462659380692

print(classification_report(y_test,y_pred_rfc))

precision recall f1-score support

0 0.82 0.86 0.84 1384


1 0.51 0.51 0.51 440
2 0.72 0.60 0.65 372

accuracy 0.75 2196


macro avg 0.68 0.66 0.67 2196
weighted avg 0.74 0.75 0.74 2196

confusion_matrix(y_test,y_pred_rfc)

array([[1190, 149, 45],


[ 175, 224, 41],
[ 82, 67, 223]])

21
6.5.2 Model Creation with the XGBoost

XGBoost, short for eXtreme Gradient Boosting, is a machine learning algorithm known for
its efficiency, speed, and accuracy. It belongs to the family of boosting algorithms, which are
ensemble learning techniques that combine the predictions of multiple weak learners
from xgboost import XGBRFClassifier
'''This line imports the XGBRFClassifier class from the XGBoost library,
which is used for fitting an XGBoost Random Forest classifier.'''

{"type":"string"}

xgb=XGBRFClassifier(n_jobs=-1)

xgb.fit(x_train,y_train)

y_xgb_pred=xgb.predict(x_test)

Model Evaluation using XGBoost


xgb_accuracy=accuracy_score(y_test,y_xgb_pred)
xgb_accuracy

0.703551912568306

print(classification_report(y_test,y_xgb_pred))

precision recall f1-score support

0 0.71 0.95 0.81 1384


1 0.61 0.10 0.18 440
2 0.68 0.49 0.57 372

accuracy 0.70 2196


macro avg 0.67 0.52 0.52 2196
weighted avg 0.69 0.70 0.64 2196

confusion_matrix(y_test,y_xgb_pred)

array([[1316, 19, 49],


[ 358, 45, 37],
[ 178, 10, 184]])

Model Comparision Report


plt.figure(figsize=(20,7))
sns.barplot(x=['Naive Bayes','Logistic Regression','Decision Tree Classifier','Random Forest
Classifier','XGBRFClassifier'],

y=[naive_bayes_accuracy,logistic_accuracy,decisiontree_accuracy,rf_accuracy,xgb_accuracy])

22
<Axes: >

Figure 4. Model Comparision bar plot

Our results indicate that logistic regression performed the best among the tested models,
closely followed by Naive Bayes. These accuracies suggest the effectiveness of our sentiment
analysis system in accurately classifying tweets into positive, negative, or neutral categories.

6.5 Model Deployment Code Explanation:


The provided code demonstrates the deployment of a sentiment analysis model using
Streamlit. It begins with importing necessary libraries, including Streamlit and pickle,
followed by loading the pre-trained logistic regression model and CountVectorizer.
Streamlit UI Setup: The Streamlit application is initiated with the title "Sentiment Analysis."
Users are prompted to input a comment via a text input field.

Prediction Function: The predict_sentiment function preprocesses the input comment using
the loaded CountVectorizer and then predicts the sentiment using the pre-trained logistic
regression model.

User Interaction: Upon clicking the "Predict" button, the input comment is processed, and the
predicted sentiment (positive, neutral, or negative) is displayed to the user.

Error Handling: The code includes basic error handling to ensure that users are prompted to
input a comment before making predictions.

Model Saving: Additionally, the code saves the trained model using pickle for future use,

23
ensuring that the model can be loaded and deployed efficiently without the need for
retraining.
Overall, this deployment code provides a simple yet effective interface for users to perform
sentiment analysis on individual comments in real-time. The use of Streamlit simplifies the
deployment process, while pickle allows for easy model persistence.

Creating Pickle File:


import pickle
# Assuming 'lr' is your trained logistic regression model object
filename = "logisticregression_model.pkl"
with open(filename, "wb") as model_file:
pickle.dump(lr, model_file)

import pickle
# Assuming CV is your CountVectorizer object
with open("count_vectorizer.pkl", 'wb') as cv_file:
pickle.dump(CV, cv_file)

Streamlit Code:
import streamlit as st
import pickle
from sklearn.feature_extraction.text import CountVectorizer

# Load the saved logistic regression model


with open("D://Data Science Projects//Twitter
sentimentanalysis//logistic_regression_model.pkl", 'rb') as model_file:
lr_model = pickle.load(model_file)

# Load the saved CountVectorizer


with open("D://Data Science Projects//Twitter sentiment analysis//count_vectorizer.pkl", 'rb')
as cv_file:
CV = pickle.load(cv_file)

# Function to preprocess input comment using CountVectorizer


def preprocess_comment(comment):
# Transform the comment using CountVectorizer
comment_transformed = CV.transform([comment])
return comment_transformed

# Function to make predictions


def predict_sentiment(comment):
# Preprocess the comment
comment_transformed = preprocess_comment(comment)
# Predict sentiment using the model
prediction = lr_model.predict(comment_transformed)
return prediction[0] # Assuming the model returns a single prediction

# Streamlit UI
st.title('Sentiment Analysis')

24
# Input comment
comment = st.text_input('Enter your comment:')

# Prediction
if st.button('Predict'):
if comment:
sentiment = predict_sentiment(comment)
if sentiment == 1:
st.write('Sentiment: Positive')
elif sentiment == 0:
st.write('Sentiment: Neutral')
else:
st.write('Sentiment: Negative')
else:
st.write('Please enter a comment to predict.')

Figure 5. Deployement picture

25
CHAPTER 7

RESULTS AND DISCUSSION

The culmination of our efforts in this project has yielded a sentiment analysis framework adept
at discerning the emotional nuances and underlying sentiments expressed in Twitter data. By
leveraging machine learning algorithms such as logistic regression, Naive Bayes, decision tree
classifier, random forest, and XGBoost, we've not only achieved accurate classification of
tweets but also extracted meaningful patterns and insights from the data. Our model, trained on
the airline sentiment dataset, has provided comprehensive insights into customer opinions and
experiences with various airline services, thus enabling stakeholders to make informed
decisions and engage effectively with users.

The impact of our sentiment analysis framework extends beyond mere classification, as it
equips businesses with actionable insights to address customer concerns, enhance service
quality, and ultimately elevate customer satisfaction. By leveraging these insights, stakeholders
can navigate the dynamic landscape of social media, cultivate positive interactions with their
audience, and drive improvements in brand perception and market competitiveness. In essence,
our project underscores the transformative power of sentiment analysis in empowering
businesses to adapt, innovate, and thrive in the digital age.

26
CHAPTER 8

CONCLUSION

This exploration delved into the sentiment analysis of public opinions towards American
Airlines on Twitter, utilizing a diverse range of machine learning algorithms. We compared
the performance of Naive Bayes, Logistic Regression, XGBoost, Random Forest, and Decision
Tree, evaluating their effectiveness in classifying tweets as positive, negative, or neutral. Our
findings illuminate valuable insights into customer sentiment and pave the way for the
deployment of a user-friendly tool for further analysis.

Among the tested algorithms, Logistic Regression emerged as the champion, achieving an
accuracy of 78.91%. This indicates its superior ability to accurately discern the sentiment
expressed in tweets directed towards American Airlines. While other algorithms performed
reasonably well, reasons for choosing winner over others, due to the accuracy, ultimately tipped
the scales in favor of Logistic Regression.

Leveraging the power of Streamlit, we successfully deployed the logistic regression model as
a user-friendly web application. This interactive interface empowers users to analyze the
sentiment of individual tweets or conduct real-time analysis of live Twitter feeds, providing
valuable insights into public perception.

Looking forward, this project establishes a robust foundation for further exploration. By
incorporating additional features like sentiment lexicon expansion and exploring deep learning
techniques, we can refine the model's accuracy and delve deeper into the nuances of public
sentiment. The deployed Streamlit application can be further enhanced to offer functionalities
like sentiment visualization and trend analysis, providing American Airlines with a
comprehensive picture of public perception on various aspects of their services.

In conclusion, this endeavor demonstrated the effectiveness of machine learning in analyzing


public sentiment on Twitter. By leveraging the logistic regression model within the Streamlit
framework, we have empowered stakeholders to gain valuable insights into customer
perception, enabling data-driven decision-making and improved customer experience.

27
REFERENCES

1. A Comparative Analysis of 12 Twitter Sentiment Analysis Tools (ScienceDirect):


https://link.springer.com/article/10.1007/s13278-023-01030-x

2. GeeksforGeeks' Guide: https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-


python/

3. Aspect-Based Sentiment Analysis (Lexalytics): https://www.lexalytics.com/

4. Twitter Sentiment Analysis for Social Media Marketing (Socialbakers):


https://www.similarweb.com/website/socialbakers.com/

5. Opinion Mining and Sentiment Analysis on Social Media (Springer):


https://link.springer.com/chapter/10.1007/978-981-13-0617-4_58

28

You might also like