Professional Documents
Culture Documents
Submitted to
Group no – 13
Batch – F
1. Shiv Datt Mishra ( Dr. APJ Abdul
Kalam UIT Jabua)
2. Ishant Sharma ( Jabalpur Engineering
College)
3. Bhupesh Watti (Ujjain Engineering
College)
4. Praval Sharma ( Ujjain Engineering
College )
5. Poornima Bhumarkar ( Jabalpur
Engineering College)
i
ABSTRACT
In the era of social media dominance, Twitter stands out as a significant platform
for real-time expression and opinion sharing. Understanding the sentiments
expressed on Twitter holds immense value for various sectors, including
marketing, politics, and public opinion analysis. This project focuses on
harnessing the power of Natural Language Processing (NLP) techniques to
analyze sentiments expressed on Twitter. Leveraging machine learning
algorithms, sentiment analysis models are trained to classify tweets into positive,
negative, or neutral categories. Additionally, advanced techniques such as topic
modeling and sentiment trend analysis are employed to extract deeper insights
from the data. The project aims to provide a comprehensive understanding of the
sentiments prevailing on Twitter, enabling stakeholders to make informed
decisions based on the public's opinions and emotions.
ii
TABLE OF CONTENT
ABSTRACT ii
TABLE OF CONTENT iii
TABLE OF FIGURE v
1. INTRODUCTION 1
1.1 Objective: 2
1.2 Data Utilization 2
1.3 Algorithmic Approach 2
1.4 Impact and Outcome: 2
2. LITERATURE SURVEY 3
3. PROJECT PLANNING AND MANAGEMENT 4
3.1 Roles 4
3.2 Collaborative Efforts: 4
3.3 Challenges: 5
3.4 Project Milestones: 5
3.5 Conclusion: 5
4. METHODOLOGY 6
4.1 Data Collection: 6
4.2 Data Preprocessing: 6
4.3 Feature Engineering: 6
4.4 Model Selection and Training: 6
4.5 Model Evaluation: 6
4.6 Model Deployment: 7
4.7 Documentation and Reporting: 7
5. SYSTEM DESIGN 8
5.1 Naive Bayes: 8
5.2 Logistic Regression: 8
5.3 Decision Tree: 8
5.4 Random Forest: 9
5.5 XGBoost (Extreme Gradient Boosting): 9
iii
6.2 Data Collection 10
6.3 Exploratory Data Analysis (EDA) 12
6.4 Data Prepocessing 14
Categorical Encoding 14
iv
TABLE OF FIGURE
v
CHAPTER 1
INTRODUCTION
As the internet continues to expand, social media and micro blogging platforms like Twitter
play a pivotal role in disseminating news and trending topics globally at an unprecedented pace.
Trending topics emerge as a result of extensive user engagement, making them valuable
sources of online perception. These topics span a wide range, from spreading awareness to
promoting public figures, political campaigns, product endorsements, and entertainment events
like movies and award shows. Businesses leverage user feedback from social media to enhance
their products and services, driving improvements in marketing strategies.
Twitter, characterized by its 140-character limit per tweet, generates a staggering volume of
content, with approximately 6500 tweets published per second. Analyzing this vast stream of
tweets presents challenges due to their noisy and unstructured nature, encompassing diverse
topics and changing attitudes. Sentiment analysis on Twitter involves leveraging natural
language processing techniques to extract and characterize sentiment content, addressing
coarse and fine-level analysis of sentiments.
However, analyzing tweets is not without its challenges. The informal nature of tweets, coupled
with the use of slang, acronyms, and grammatical irregularities, poses difficulties for natural
language processors.
The remainder of this project report outlines related work, details the technologies employed,
elaborates on the methodology and implementation, addresses encountered challenges,
discusses future work, and concludes with a summary of findings and implications.
1
1.1 Objective:
Our task involves developing a robust sentiment analysis system for Twitter data. The
primary goal is to accurately classify tweets into positive, negative, or neutral categories,
thereby deciphering the emotional nuances and underlying sentiments expressed in tweets.
This endeavor aims to provide comprehensive insights into the prevailing sentiments within
the Twitter community.
2
CHAPTER 2
LITERATURE SURVEY
Sentiment analysis on Twitter data has garnered significant attention in recent years due to the
platform's popularity and its role as a rich source of real-time user opinions and emotions.
Existing literature provides a comprehensive overview of various techniques, challenges, and
advancements in this field.
Researchers have identified several challenges unique to sentiment analysis on Twitter data.
These include the brevity of tweets, noisy and informal language, sarcasm, ambiguity, and the
dynamic nature of language trends. Papers like "Twitter Sentiment Analysis: The Good the
Bad and the OMG!" by Pak and Paroubek (2010) and "Twitter as a Corpus for Sentiment
Analysis and Opinion Mining" by Pak and Paroubek (2010) delve into these challenges and
propose strategies to address them.
Literature offers a wide range of techniques for sentiment analysis on Twitter data, including
lexicon-based methods, machine learning algorithms, deep learning models, and hybrid
approaches. Papers such as "A Survey of Sentiment Analysis Techniques on Twitter Data" by
Agarwal et al. (2011) and "Combining Lexicon-based and Learning-based Methods for Twitter
Sentiment Analysis" by Bermingham and Smeaton (2010) review these techniques and discuss
their advantages and limitations.
Traditional machine learning algorithms like naive Bayes, logistic regression, decision trees,
and ensemble methods have been widely applied to sentiment analysis on Twitter data.
Additionally, deep learning models such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and transformer- based architectures like BERT have shown
promising results. "Deep Learning for Sentiment Analysis: A Survey" by Zhang et al. (2018)
provides an in-depth overview of deep learning techniques for sentiment analysis tasks.
Various datasets and tools have been developed to facilitate research in Twitter sentiment
analysis. The Sentiment140 dataset, introduced by Bifet et al. (2009), is one of the most widely
used datasets for training and evaluating sentiment analysis models on Twitter data.
Additionally, tools like the Sentiment140 platform and other sentiment analysis APIs provide
researchers with resources for collecting and analyzing sentiment-labeled tweets.
3
CHAPTER 3
As the project head, Shiv Datt oversees the overall project planning and management. His
responsibilities include coordinating team activities, setting project goals, and ensuring timely
progress. Shiv Datt is also actively involved in implementing machine learning models for
sentiment analysis.
• Ishant Sharma:
Ishant is responsible for providing the dataset essential for sentiment analysis. Additionally, he
takes charge of creating presentations and revising project documentation to ensure clarity and
accuracy.
• Praval Sharma:
Praval's role focuses on data preprocessing, including cleaning, transforming, and organizing
the dataset to prepare it for analysis. His meticulous attention to detail ensures the integrity and
quality of the data used for sentiment analysis.
• Bhupesh Watti:
Bhupesh is tasked with defining the project scope and requirements. He works on structuring
the data and crafting detailed reports to document project progress and findings. Bhupesh's
thorough analysis contributes to a comprehensive understanding of the project's objectives and
outcomes.
• Poornima Bhumarkar:
Poornima's role involves gathering information related to the project for further enhancement.
She conducts research on sentiment analysis methodologies, explores new techniques, and
identifies opportunities for improvement. Poornima's insights contribute to the project's
innovation and advancement.
4
their combined expertise to develop robust sentiment analysis algorithms. Regular
communication and coordination ensure that all team members are aligned with project
objectives and timelines.
3.3 Challenges:
1. Monetization of Twitter API: One of the primary challenges faced by the team is the
monetization of the Twitter API. Due to this restriction, accessing real-time Twitter data
becomes costly, limiting the team's ability to gather live data for sentiment analysis.
2. Dataset Limitations: Although the team initially planned to collect data using the Twitter
API, they encountered limitations in obtaining a comprehensive dataset. As a result, they had
to rely solely on the available airline sentiment dataset, which may not encompass all relevant
tweets or provide real-time insights.
3. Inability to Utilize YouTube Data: While considering alternative data sources, the team
explored the possibility of leveraging YouTube comments for sentiment analysis. However,
they encountered challenges as comments vary significantly across different channels, making
it challenging to create a cohesive dataset.
3.5 Conclusion:
Through effective planning and management, the project team leverages individual strengths
and expertise to achieve collective goals. By adhering to established roles and responsibilities,
maintaining open communication, and fostering collaboration, the team works cohesively
towards delivering a high- quality Twitter sentiment analysis solution.
5
CHAPTER 4
METHODOLOGY
We employed a systematic approach to develop our sentiment analysis system, involving the
following steps:
Alternatively, access pre-existing datasets, such as the airline sentiment dataset based on
American-based airlines, to obtain relevant data for analysis.
Explore additional feature extraction methods, such as sentiment lexicons or word embeddings,
to capture semantic information and enhance classification accuracy.
Split the dataset into training and testing sets, typically using a 70-30 or 80-20 split, to train
and evaluate the performance of selected models.
Train the selected models on the training set and fine-tune hyperparameters using techniques
such as cross-validation to optimize model performance.
6
Compare the performance of different models to identify the most effective approach for
sentiment analysis on Twitter data.
7
CHAPTER 5
SYSTEM DESIGN
• It's particularly popular for text classification tasks, including sentiment analysis, spam
detection, and document categorization.
• Naive Bayes calculates the probability of each class given a set of input features and selects
the class with the highest probability as the predicted class.
• Despite its simplicity and the "naive" assumption of feature independence, Naive Bayes often
performs well in practice, especially for large datasets with high dimensionality.
• It models the probability that an instance belongs to a particular class using the logistic
function, also known as the sigmoid function.
• Logistic Regression estimates the coefficients for each feature, which represent the impact of
the feature on the log-odds of the target class.
• During prediction, Logistic Regression calculates the probability of the positive class and
assigns the instance to the positive class if the probability exceeds a threshold (typically 0.5).
• Logistic Regression is widely used due to its simplicity, interpretability, and efficiency,
especially for problems with linear decision boundaries.
• It makes decisions based on a sequence of binary splits along the features, aiming to maximize
the information gain or minimize impurity at each node.
8
• Decision Trees can handle both categorical and numerical features and can capture complex
nonlinear relationships between features and the target variable.
• However, Decision Trees are prone to overfitting, especially when trained on noisy or high-
dimensional data. Techniques like pruning, limiting tree depth, and ensemble methods (e.g.,
Random Forest) are commonly usedto mitigate overfitting.
• Each Decision Tree in the Random Forest is trained on a bootstrapped sample of the dataset
and may use a random subset of features for splitting at each node.
• During prediction, the Random Forest aggregates the predictions of individual trees through
voting (for classification) or averaging (for regression), resulting in a more stable and accurate
prediction.
• Random Forests are highly effective for a wide range of classification and regression tasks,
offering improved generalization performance and resilience to overfitting compared to
individual Decision Trees.
• It optimizes a differentiable loss function by adding new trees that minimize the loss of the
combined ensemble, employing gradient descent techniques.
• XGBoost introduces several innovations, such as regularization techniques, tree pruning, and
parallelized computation, to improve training speed, accuracy, and scalability.
9
CHAPTER 6
10
for x_label, grp in train_df.groupby('airline_sentiment')
})
sns.heatmap(df_2dhist/100, cmap='viridis',annot=True)
plt.xlabel('airline_sentiment')
_ = plt.ylabel('airline')
This code generates a heatmap comparing the frequency of each airline based on sentiment.
• 'df_2dhist' is a DataFrame where each row represents a sentiment category and each
column represents an airline, with values indicating the frequency of occurrences.
• The heatmap is created using seaborn's heatmap function, with 'viridis' colormap.
• The x-axis label is set as 'airline_sentiment'.
11
• The y-axis label is set as 'airline'.
Here I shows the number of comments based on the sentiment on each airlines it's for
example it show united airline have the most negative comment followed by the american
airline and US airways and least negative comment for the virgin america. And considered
the airlines Delta airline is having services which make delta airline better then its
competitive according to the reviews.
# Add a column name length which the contain the count of the text
train_df['Length'] = train_df['text'].apply(len)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10980 entries, 0 to 10979
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text non-null 10980 object
1 airline_sentiment non-null 10980 object
2 Length non-null 10980 int64
dtypes: int64(1), object(2)
memory usage: 257.5+ KB
# To generate the descriptive statistics for the numerical columns in the DataFrame train_df
train_df.describe()
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets
to apprehend their predominant traits, discover patterns, locate outliers, and identify
relationships between variables.
sns.histplot(train_df.Length,kde=True,color='c')
''' This code creates a histogram plot of the 'Length' column from the DataFrame 'train_df' using
seaborn,
with a kernel density estimation (kde) and cyan color. '''
{"type":"string"}
12
Figure 2. Histograph plot between count vs Length
The histograpgh plot shown here show the most of text column reviews are between 125 -
150 word length.
Here I shows the number of comments based on the sentiment on each airlines it's for
example it show united airline have the most negative comment followed by the american
airline and US airways and least negative comment for the virgin america. And considered
the airlines Delta airline is having services which make delta airline better then its
competitive according to the reviews.
plt.pie(train_df.airline_sentiment.value_counts(),labels=['Negative','Neutral','Positive'],autopct='%.2f')
plt.show()
13
Figure 3. Pie chart of airline sentiment
This code generates a pie chart illustrating the distribution of sentiment categories ('Negative',
'Neutral', 'Positive') within the 'airline_sentiment' column of the 'train_df' DataFrame.
The sizes of the pie slices are determined by the counts of each sentiment category. Labels
are provided for each slice ('Negative', 'Neutral', 'Positive'). The 'autopct' parameter formats
the numerical values displayed on the pie chart to show the percentage with two decimal
places. Finally, the pie chart is displayed using matplotlib's 'plt.show()' function
Here pie chart shows that 62.40% is the negative comment represented by blue color section ,
21.19% is neutral comment represented by orange color section , 16.41% is the positive
comment green color section.
6.4 Data Prepocessing
Categorical Encoding
Categorical encoding transforms categorical variables into numerical format for analysis or
machine learning; methods include ordinal, one-hot, label, frequency, and target encoding.
''' The code imports the LabelEncoder class from the sklearn.preprocessing module and creates an
instance of it called le,
which is used to transform categorical variables into numerical labels by assigning each unique
category a unique integer.'''
14
'''This code uses the LabelEncoder instance (le)
to transform the categorical column 'airline_sentiment' in the 'train_df' DataFrame into numerical
values.'''
{"type":"string"}
import string
string.punctuation
# This code returns a string containing all punctuation characters.
{"type":"string"}
import nltk
nltk.download('stopwords')
# This code downloads the stopwords corpus from NLTK (Natural Language Toolkit).
True
{"type":"string"}
np.array(stopwords)
# This code converts the list of English stopwords into a NumPy array.
train_df.head()
def remove_punc(text):
text = "".join([char for char in text if char not in string.punctuation])
return text
train_df.head()
15
# based on the predefined English stopwords list, and joins the remaining words back into a string.
# 2. 'remove_punc': This function removes any punctuation characters from the input text.
# These functions are then applied to the 'text' column of the 'train_df' DataFrame using the 'apply'
method:
# - The 'remove_stopwords' function is applied first to remove stopwords.
# - The 'remove_punc' function is applied subsequently to remove punctuation.
# The resulting cleaned text is stored in a new column named 'text_clean' in the 'train_df' DataFrame.
x=train_df.text
y=train_df.airline_sentiment
#This code assigns the 'text' column from the DataFrame 'train_df' to
#the variable 'x', representing the input text data, and assigns the 'airline_sentiment' column to the
variable 'y',
# representing the target sentiment labels.
'''
This code splits the input text data ('x') and target sentiment labels ('y') into training and testing
sets (80% training, 20% testing)using a random seed of 40 for reproducibility, and assigns them to \
respective variables 'x_train', 'x_test', 'y_train', and 'y_test'.'''
{"type":"string"}
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(8784,)
(8784,)
(2196,)
(2196,)
16
CV=CountVectorizer(stop_words='english')
"""
This code initializes a CountVectorizer object named 'CV', which is used to convert a collection of text
documents into a
matrix of token counts, excluding English stopwords during the tokenization process."""
{"type":"string"}
TV=TfidfVectorizer(stop_words='english')
""" This code initializes a TfidfVectorizer object named 'TV', which is used to convert a collection of
text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features,
excluding
English stopwords during the tokenization process."""
{"type":"string"}
x_train=CV.fit_transform(x_train)
'''
This code fits the CountVectorizer 'CV' to the training data 'x_train' and transforms it into
a matrix of token counts, where each row represents a document and each column represents a
unique
token (word), capturing the frequency of each token's occurrence in the respective documents.'''
{"type":"string"}
x_train
"""
The variable x_train represents the transformed training data after applying the CountVectorizer,
where each row corresponds to a
document (text sample) and each column corresponds to a unique token (word), storing the token
counts for each document."""
{"type":"string"}
x_test=CV.transform(x_test)
'''
This code transforms the testing data 'x_test' using the CountVectorizer 'CV',
maintaining the same tokenization scheme learned from the training data, and produces
a matrix of token counts representing the testing data.'''
{"type":"string"}
17
from sklearn.naive_bayes import MultinomialNB
model_nb=MultinomialNB()
model_nb.fit(x_train,y_train)
# This code trains the Multinomial Naive Bayes classifier 'model_nb' using the training
# data 'x_train' (vectorized text data) and corresponding target labels 'y_train' (sentiment labels).
MultinomialNB()
y_nb_pred=model_nb.predict(x_test)
# This code uses the trained Multinomial Naive Bayes classifier 'model_nb' to predict the sentiment
labels
# for the testing data 'x_test' and assigns the predictions to the variable 'y_nb_pred'.
Model Evaluation
from sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score,classification_report,roc_auc_score,confusion_
matrix
#This code imports various metrics and evaluation functions from scikit-learn, including
accuracy_score,
#precision_score, recall_score, f1_score, classification_report, roc_auc_score, and
confusion_matrix, for
#evaluating the performance of classification models.
naive_bayes_accuracy=accuracy_score(y_test,y_nb_pred)
naive_bayes_accuracy
# This code calculates the accuracy of the Multinomial Naive Bayes classifier on
# the testing data by comparing the predicted labels ('y_nb_pred') with the actual labels ('y_test')
# and assigns the result to the variable 'naive_bayes_accuracy'.
0.755464480874317
print(classification_report(y_test,y_nb_pred))
18
''' This code prints a classification report, which includes precision, recall, F1-score, and
support for each class, based on the comparison between the actual labels ('y_test') and the
predicted labels ('y_nb_pred') using the Multinomial Naive Bayes classifier.'''
confusion_matrix(y_test,y_nb_pred)
This code generates a confusion matrix, which is a table showing the counts of true positive,
false positive, true negative, and false negative predictions, based on the comparison between
the actual labels ('y_test') and the predicted labels ('y_nb_pred') using the Multinomial Naive
Bayes classifier.
Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
x_train=x_train.todense()
x_test=x_test.todense()
x_train=np.array(x_train)
x_test=np.array(x_test)
#This code imports the LogisticRegression class from scikit-learn and initializes a logistic regression
model named 'lr' with
#the 'ovr' (one-vs-rest) strategy for multi-class classification and parallel processing enabled using
all available CPU cores
#('n_jobs=-1').
lr.fit(x_train,y_train)
LogisticRegression(multi_class='ovr', n_jobs=-1)
y_lr_pred=lr.predict(x_test)
0.7891621129326047
print(classification_report(y_test,y_lr_pred))
19
accuracy 0.79 2196
macro avg 0.74 0.70 0.72 2196
weighted avg 0.78 0.79 0.78 2196
confusion_matrix(y_test,y_lr_pred)
A decision tree is a flowchart-like tree structure where each internal node denotes the feature,
branches denote the rules and the leaf nodes denote the result of the algorithm. It is a versatile
supervised machine-learning algorithm, which is used for both classification and regression
problems.
from sklearn.tree import DecisionTreeClassifier
"""
This code imports the DecisionTreeClassifier class from
the scikit-learn library, which is used to create a decision
tree model for classification tasks."""
{"type":"string"}
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
DecisionTreeClassifier()
y_pred_dtc= dtc.predict(x_test)
# This code uses the trained DecisionTreeClassifier model 'dtc' to predict the
# target labels for the testing data 'x_test', and assigns the predicted labels
# to the variable 'y_pred_dtc'.'''
0.6584699453551912
print(classification_report(y_test,y_pred_dtc))
20
weighted avg 0.68 0.66 0.67 2196
confusion_matrix(y_test,y_pred_dtc)
Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset
from sklearn.ensemble import RandomForestClassifier
#This code imports the RandomForestClassifier class from the scikit-learn library,
#which is used to create a random forest model for classification tasks.
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier()
y_pred_rfc=rfc.predict(x_test)
0.7454462659380692
print(classification_report(y_test,y_pred_rfc))
confusion_matrix(y_test,y_pred_rfc)
21
6.5.2 Model Creation with the XGBoost
XGBoost, short for eXtreme Gradient Boosting, is a machine learning algorithm known for
its efficiency, speed, and accuracy. It belongs to the family of boosting algorithms, which are
ensemble learning techniques that combine the predictions of multiple weak learners
from xgboost import XGBRFClassifier
'''This line imports the XGBRFClassifier class from the XGBoost library,
which is used for fitting an XGBoost Random Forest classifier.'''
{"type":"string"}
xgb=XGBRFClassifier(n_jobs=-1)
xgb.fit(x_train,y_train)
y_xgb_pred=xgb.predict(x_test)
0.703551912568306
print(classification_report(y_test,y_xgb_pred))
confusion_matrix(y_test,y_xgb_pred)
y=[naive_bayes_accuracy,logistic_accuracy,decisiontree_accuracy,rf_accuracy,xgb_accuracy])
22
<Axes: >
Our results indicate that logistic regression performed the best among the tested models,
closely followed by Naive Bayes. These accuracies suggest the effectiveness of our sentiment
analysis system in accurately classifying tweets into positive, negative, or neutral categories.
Prediction Function: The predict_sentiment function preprocesses the input comment using
the loaded CountVectorizer and then predicts the sentiment using the pre-trained logistic
regression model.
User Interaction: Upon clicking the "Predict" button, the input comment is processed, and the
predicted sentiment (positive, neutral, or negative) is displayed to the user.
Error Handling: The code includes basic error handling to ensure that users are prompted to
input a comment before making predictions.
Model Saving: Additionally, the code saves the trained model using pickle for future use,
23
ensuring that the model can be loaded and deployed efficiently without the need for
retraining.
Overall, this deployment code provides a simple yet effective interface for users to perform
sentiment analysis on individual comments in real-time. The use of Streamlit simplifies the
deployment process, while pickle allows for easy model persistence.
import pickle
# Assuming CV is your CountVectorizer object
with open("count_vectorizer.pkl", 'wb') as cv_file:
pickle.dump(CV, cv_file)
Streamlit Code:
import streamlit as st
import pickle
from sklearn.feature_extraction.text import CountVectorizer
# Streamlit UI
st.title('Sentiment Analysis')
24
# Input comment
comment = st.text_input('Enter your comment:')
# Prediction
if st.button('Predict'):
if comment:
sentiment = predict_sentiment(comment)
if sentiment == 1:
st.write('Sentiment: Positive')
elif sentiment == 0:
st.write('Sentiment: Neutral')
else:
st.write('Sentiment: Negative')
else:
st.write('Please enter a comment to predict.')
25
CHAPTER 7
The culmination of our efforts in this project has yielded a sentiment analysis framework adept
at discerning the emotional nuances and underlying sentiments expressed in Twitter data. By
leveraging machine learning algorithms such as logistic regression, Naive Bayes, decision tree
classifier, random forest, and XGBoost, we've not only achieved accurate classification of
tweets but also extracted meaningful patterns and insights from the data. Our model, trained on
the airline sentiment dataset, has provided comprehensive insights into customer opinions and
experiences with various airline services, thus enabling stakeholders to make informed
decisions and engage effectively with users.
The impact of our sentiment analysis framework extends beyond mere classification, as it
equips businesses with actionable insights to address customer concerns, enhance service
quality, and ultimately elevate customer satisfaction. By leveraging these insights, stakeholders
can navigate the dynamic landscape of social media, cultivate positive interactions with their
audience, and drive improvements in brand perception and market competitiveness. In essence,
our project underscores the transformative power of sentiment analysis in empowering
businesses to adapt, innovate, and thrive in the digital age.
26
CHAPTER 8
CONCLUSION
This exploration delved into the sentiment analysis of public opinions towards American
Airlines on Twitter, utilizing a diverse range of machine learning algorithms. We compared
the performance of Naive Bayes, Logistic Regression, XGBoost, Random Forest, and Decision
Tree, evaluating their effectiveness in classifying tweets as positive, negative, or neutral. Our
findings illuminate valuable insights into customer sentiment and pave the way for the
deployment of a user-friendly tool for further analysis.
Among the tested algorithms, Logistic Regression emerged as the champion, achieving an
accuracy of 78.91%. This indicates its superior ability to accurately discern the sentiment
expressed in tweets directed towards American Airlines. While other algorithms performed
reasonably well, reasons for choosing winner over others, due to the accuracy, ultimately tipped
the scales in favor of Logistic Regression.
Leveraging the power of Streamlit, we successfully deployed the logistic regression model as
a user-friendly web application. This interactive interface empowers users to analyze the
sentiment of individual tweets or conduct real-time analysis of live Twitter feeds, providing
valuable insights into public perception.
Looking forward, this project establishes a robust foundation for further exploration. By
incorporating additional features like sentiment lexicon expansion and exploring deep learning
techniques, we can refine the model's accuracy and delve deeper into the nuances of public
sentiment. The deployed Streamlit application can be further enhanced to offer functionalities
like sentiment visualization and trend analysis, providing American Airlines with a
comprehensive picture of public perception on various aspects of their services.
27
REFERENCES
28