Professional Documents
Culture Documents
Subject:
Natural Language Processing
Submitted by:
MeharunNisa 2022-MSCS-212
Muhammad Haris 2022-MSCS-214
Saqib Yaseen 2022-MSCS-209
1
Feature engineering for NLP Tasks
3 Feature Engineering 3
3.1 Importance of Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Feature Engineering in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Feature Engineering Techniques for NLP Tasks . . . . . . . . . . . . . . . . 6
3.3.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Word2Vec & GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Movie Reviews Sentiment Analysis Using Bag of Words . . . . . . . . . . . 23
3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.5 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.6 Dataset Features and Target: . . . . . . . . . . . . . . . . . . . . . . 27
3.4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Fake News Detection Using TF-IDF . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Emotion Detection From Text Using WordVec . . . . . . . . . . . . . . . . 35
3.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.3 Evaluating and Outputting Results: . . . . . . . . . . . . . . . . . . 39
3.6.4 Classifier Evaluation: . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1
List of Figures
2
Chapter 3
Feature Engineering
While training any model in machine learning to predict some classes or in more easy words
are to predict some object for example if the model is to identify that the picture is of girl
or of a boy. Then the training results will be more clear and accurate if model knows that
how a girl looks like and how a boy looks like. In short a model should know the features
of girl and boy. These can be all the features or only those which differentiate both of
them. The very first thing comes in mind is the face, that gives significant differences to
differentiate or identify the gender. This is all feature engineering is in simple and easy
words to understand.
The raw data is to be taken from the datasets and then clean and transform them to
required shape to get features.
Transform Train
Machine
Feature
Raw Data Learning
Engineering
Model
There are thousands of examples and a common and relevant example in IT industry
for the readers can be in data analysis. If a shopkeeper hire a data analyst and asks to give
the stats of the yearly involvement of customers in the shop with respect to the gender
and age. So that after looking into the stats the shopkeeper might increase or decrease
the certain type of clothes or items in the shop according to the customer attraction or
requirement.
Now what data analyst will be doing. First of all it requires the dataset, as the camera
footage which is in the shop to monitor the customer movement. After recognizing the
object next step is to identify the gender, it requires certain points to distinguish which
object is female and which is male. Now here feature engineering comes to the picture. It
will provide certain features to use and later same is the case when age analysis will be
done. The results will be either binary or in numbers, which can be used as tabular data
or to plot graphs for more detailed visualizations.
3
Now the definition can be summarized as Feature engineering is the process of ef-
fectively converting raw data into features that are more valuable and suited
for statistical or machine learning studies. This procedure improves the inter-
pretability of data, allowing for better informed decision-making. (? ).
3.2.1 Preprocessing
Preprocessing is the first step in data preparation forexample web legal scrapping where
raw data that is directly scrapped from website or some other medium that passes through
various transformations such as data cleaning like if any puncuation that doesnt add value
in data meaning will be removed, organizing is like grouping similar data with respect to
meaning to optimize the data, and enhancing like formatting to what is required. This is
all to make it suitable for analysis or modeling. This process involves cleaning, organizing,
and enhancing the raw data to address issues such as missing values, outliers, and noise.
In NLP common preprocessing techniques include removing
4
• Punctuation
• Stop words
• Stemming
• Lemmatization
The objective of preprocessing is to ensure data consistency and usability, reducing the im-
pact of irrelevant or problematic information on future analysis or modeling operations. (?
)
• Vectorization
• Bag of Words
• TF-IDF
• N-grams
3.2.3 Classification
The basic goal of supervised machine learning, particularly in classification tasks, is to
assign incoming data to predetermined classes based on its qualities. This approach com-
prises training a model using a dataset in which each instance is already assigned a label.
The model learns to identify patterns and correlations between the input qualities and
their labels. Once successfully trained, it can be used to predict the categories of fresh,
previously unseen data. This methodology is extensively employed in different sectors such
as filtering emails for spam, recognising photos, and diagnosing medical disorders, with the
purpose to automate the sorting of data into designated categories. (? )
5
3.3 Feature Engineering Techniques for NLP Tasks
3.3.1 Bag of Words
In the context of feature engineering within NLP, the Bag of Words (BoW) model trans-
forms text data into a numerical format, specifically into vectors. This method treats a
document as a collection of words without considering their order or grammatical struc-
ture. Essentially, it involves creating a unique index for every distinct word in the text
and then quantifying the document by tallying the occurrences of each word. This results
in a vector where each word is represented by its frequency count in the text, effectively
turning textual information into data that can be analyzed computationally.
Text
Bag of Words
0 1 1 0
In the example above, the data is in tabular form in which count of each word is
displayed for each sentence. This is the basic example but when it comes to paragraphs
and documents there are many punctuations, pronouns, verb forms, adjectives etc. In that
case those will be first removed or turned to their base words.
6
(0,7) 2
(0,3) 1
(0,4) 1
’The fox jumps (0,6) 1
over the lazy (0,5) 1 and are dog fox jumps lazy over the
dog.’ (0,2) 1 0 0 1 1 1 1 1 2
’Dog and fox (1,3) 1 1 1 1 1 0 1 0 0
are lazy’ (1,5) 1
(1,2) 1
(1,0) 1
(1,1) 1
In the example above there are 2 lines of text. The words are being separated and then
the frequency or count of each word is being calculated as shown in the middle table. The
upper rows with 0 in x coordinate are for first sentence and the lower rows in table with 1
in x coordinate are for second line of text. Have a look as there are two ’the’ in first line
the frequency count is 2 and one fox word in first line so the value is 1 and same goes for
the rest of words.
Third table is showing the words in both tables to get more explicit idea in parallel. This
is how bag of words work and calculates the numerical values according to the occurrences.
1
Source: https://miro.medium.com/v2/resize:fit:828/format:webp/1*38q3qy950o1auKtd1i41Yg.
png
2
Source: http://www.prathapkudupublog.com/2019/04/bag-of-words.html
7
Now to enhance this more, few steps of cleaning raw data can also be used in the process
which are as follows
• Punctuation
• Tokenization
• Stop words
• Lemmatization
• Stemming
Tokenization
It is the process of breaking down text into words or other meaningful parts known as
tokens. It assists in generating a bag of individual words from the input text.
Lowercasing
To ensure the similarity of words to be considered once it is good to convert all words to
lowercase.
Removing Punctuation
For more clarity and to simplify the vocabulary it is useful to eliminate punctuation marks
from the text.
Stemming
To avoid variations of similar terms, it’s important to reduce them to their base or root
form (e.g., ”programming” to ”program”).
Lemmatization
To provide more accurate representations, words are reduced to their basic or dictionary
form, known as the lemma, based on their meaning.
8
Tokenization
The, quick, brown, fox, jumps, over, the, lazy, dog, .
Lowercasing
Removing Punctuation
Stemming
Lemmatization
quick, brown, fox, jump, lazi, dog
Output
Example
# BOW Example
from sklearn . feature_extraction . text import CountVectorizer
9
Figure 3.5: Bow example code result
10
3.3.2 TF-IDF
In the context of feature engineering for NLP, TF-IDF, short for term frequency-inverse
document frequency, is a statistical metric used to assess the significance of a word in the
context of a document that is part of a larger corpus.
This metric is calculated by integrating two crucial statistics: the frequency of a term
in a single text and its inverse frequency across a larger set of documents. In essence,
TF-IDF grows according to the number of times a word appears in a document, offset by
the term’s frequency in the corpus. This makes it a useful tool for text analysis tasks,
which are often used in automated systems. In machine learning models for NLP, TF-IDF
serves as an informative feature for recognising and ranking the significance of words within
documents.
Importance
TF-IDF is important in Natural Language Processing (NLP), notably for document re-
trieval and analysis. This strategy increases the importance of terms based on their fre-
quency within a given document. However, it strikes a balance by taking into account the
term’s popularity across a large number of papers. As a result of their ubiquity, regularly
used phrases such as ”this”, ”what”, or ”if” are accorded reduced value. These words
11
typically have less relevance in the context of a given work. In contrast, a phrase such as
”bug” obtains importance in a certain context if it appears frequently in that document
but is uncommon elsewhere. For example, when analysing consumer feedback for software
products, the frequent use of the term ”bug” in comments could be a significant predictor
of talks about software dependability difficulties. This is especially important when cate-
gorising input into topics, as terms like ”bug” are likely to be associated with discussions
regarding product performance or quality.
By incorporating TF-IDF into feature engineering for NLP, we can successfully dis-
tinguish between everyday language and key terms that identify the subject matter of
individual texts, increasing the analytical depth of our NLP applications.
In conjunction with TF, the inverse document frequency (IDF) is crucial. This statistic
measures the rarity or commonality of a phrase throughout the full collection of documents.
The essence of this approach is to invert the proportion of documents containing the phrase,
then apply a logarithmic scale. This technique efficiently magnifies the weight of uncommon
terms, providing a balance against too common terms that may not contribute meaning-
fully to the text’s distinctive context.
The intersection of these two metrics, TF and IDF, yields the TF-IDF score. This score
reflects a term’s relevance within a certain document. Higher TF-IDF scores indicate that
phrases are not only common in a specific document, but also unique over the entire docu-
ment set. This dual emphasis makes TF-IDF a strong tool in feature engineering for NLP.
It provides both a quantitative and qualitative lens to understand and extract meaningful
patterns from textual input.
12
the vectorization approach used might have a substantial impact on the results.
Term Frequency-Inverse Document Frequency, or TF-IDF, is an important factor in this
conversion process. It statistically assesses a word’s relevance in a document, with its value
falling (approaching zero) when the word appears frequently in multiple texts, signalling
lower importance. In contrast, a TF-IDF value close to one indicates the word’s scarcity
across documents, implying more relevance. To calculate a word’s TF-IDF score, multiply
its term frequency by its inverse document frequency, with higher scores suggesting words
that are more relevant to the specific text in question. This numerical representation is
necessary for machine learning models to successfully comprehend and analyse text data.
Information extraction
The most relevant results for your search can be obtained by using TF-IDF, which was
developed for document search. Let’s say someone uses your search engine to find LeBron.
The results will be shown in the most relevant order. In other words, since TF-IDF assigns
a greater score to the phrase ”LeBron,” sports articles that are more pertinent will be
ranked higher.
It is highly probable that all search engines you have ever used have incorporated TF-
IDF scores into their algorithm. Another helpful tool for extracting keywords from text is
Keyword Extraction TF-IDF. How? Since the words with the highest scores are the most
pertinent to the document, they can be regarded as keywords for that document. Quite
simple.
3.3.3 N-Grams
N-grams, which are neighbouring sequences of elements in a text, are continuous sequences
of words, symbols, or tokens. The most significant applications for them are in NLP
(Natural Language Processing) activities involving text data.
13
Figure 3.7: N-grams Work Flow
Importance
N-gram models are widely used in a variety of applications where the modelling inputs are
n-gram distributions, including statistical natural language processing, speech recognition,
machine translation, phonemes and phoneme sequences, predictive text input, and many
others. Another typical application for n-grams is to generate features for supervised
machine learning models like SVMs, MaxEnt models, Naive Bayes, and so on. The basic
idea is to substitute unigrams with tokens from the feature space, such as bigrams (and
trigrams and advanced n-grams).
14
involve n-grams of text.
For example, while building language models in natural language processing, n-grams
are used to create not only unigram models but also bigram and trigram models. Web-
scale n-gram models, developed by tech giants like Google and Microsoft, can be applied to
a range of NLP-related tasks, including word breaking, text summarization, and spelling
correction. (? )
15
three groupings of words that appear in the same sentence. They are formed by shifting
two words ahead one at a time, resulting in the following trigram. Cowards die repeatedly,
frequently, frequently before their deaths, frequently before their deaths, and frequently
before their deaths, whilst the brave never, never taste, never taste of, death, but just once.
4-grams: With this window, we can combine four words in the following ways: Cowards
die a lot, die a lot before, many times before their deaths, the valiant, the valiant never
taste, the heroic never taste of, never taste of death but of death, but once, similarly, we
can pick n bigger than 4 or n smaller than 4 and create 5-grams, etc.
16
Start
A collection of different books
Collect Documents
Break each book into individual words
Tokenize Text
Count how often each word appears in a specific book
Calculate TF
Assess how unique each word is across all books
Calculate IDF
Multiply TF by IDF for each word in each book
Calculate TF-IDF
Identify words with high TF-IDF scores
Select Keywords
Use in Search
End
# N - Grams Example
import nltk
from nltk import ngrams
from nltk . tokenize import word_tokenize
# Generate bigrams
17
bigrams = list ( ngrams ( tokens , 2 ) )
18
3.3.4 Word2Vec & GloVe
Deep Learning has shown to perform incredibly well when used to NLP problems. The
main idea is to feed phrases that are legible by humans into neural networks so that the
models can learn something from them.
Comprehending the pre-processing of textual material for neural neural networks is crucial.
While raw text cannot be entered into a neural network, numbers may. Therefore, we must
translate these terms into a numerical format.
Importance
Word Maps
Word vectors are a far better way to represent words than single hot encoded vectors
(which require a lot of memory to represent text vectors as vocabulary grows). There is no
semantic meaning attached to the index that is given to every word. The vectors for ”dog”
and ”cat” are nearly as close to one another in one hot encoded vector as the vectors for
”dog” and ”computer,” thus the neural network must work extremely hard to comprehend
each word because they are viewed as totally separate entities. The goal of using word
vectors is to address both of these problems.
19
”Word vectors retain the semantic representation of the word and take up far less space
than one hot encoded vector.”
Prior to delving into word vectors, it is crucial to remember that related words tend to
occur together more frequently than dissimilar terms.
Even though two words don’t necessarily have the same meaning when they appear
close to one another, words with comparable meanings are frequently discovered together
when we look at the frequency of these word occurrences.
Word-to-Vec
There are two designs in word2vec: Skip Gramme and CBOW (Continuous Bag of Words).
indicates which words appear near a given word is necessary. To do this, we’ll use a tool
known as a context window.
Recall that ”Deep Learning is very fun and hard.” We must establish what’s referred
to as the window size. In this instance, let’s say 2. We achieve this by going through
every word in the provided data, which is, in this example, only one sentence, and then
taking into account a window of words that surrounds it. Each word will have four words
connected with it because our window size is two. We will take into account the two words
that come before and after the term.
from gensim . models import Word2Vec
from nltk . tokenize import word_tokenize
import nltk
# Example : Get the vector for the word " natural "
vector_natural = model_w2v . wv [ ’ natural ’]
20
print ( " Vector for ’ natural ’: " , vector_natural )
# Example : Get the vector for the word " word "
vector_word = model_w2v . wv [ ’ word ’]
print ( " Vector for ’ word ’: " , vector_word )
21
Figure 3.11: word-to-vec Example Results
22
3.4 Movie Reviews Sentiment Analysis Using Bag of Words
3.4.1 Methodology
This section introduces the sentiment analysis project, focusing on natural language pro-
cessing techniques for binary classification of the reviews to movie into two sentiments as
either positive or negative. Movie reviews sentiment analysis is a project which is based
on natural language processing, where we use NLP techniques to extract useful words of
each review and based on these words we can use binary classification to predict the movie
sentiment if it’s positive or negative.
3.4.2 Code
# # Importing modules # #
import re
import nltk
import joblib
import numpy as np
import pandas as pd
from nltk . corpus import stopwords
from nltk . tokenize import word_tokenize
from nltk . stem import SnowballStemmer
from sklearn . feature_extraction . text import CountVectorizer
from sklearn . model_selection import train_test_split
from sklearn . naive_bayes import GaussianNB , MultinomialNB , BernoulliNB
from sklearn . metrics import accuracy_score
nltk . download ()
# # 1 | Data Preprocessing # #
""" Prepare the dataset before training """
# 1 . 1 Load dataset
dataset = pd . read_csv ( ’ Dataset / IMDB . csv ’)
print ( f " Dataset shape : { dataset . shape } \ n " )
print ( f " Dataset head : \ n { dataset . head () } \ n " )
# 1 . 2 Output counts
print ( f " Dataset output counts :\ n { dataset . sentiment . value_counts () } \ n " )
# # 2 | Data cleaning # #
23
# 2 . 1 Remove HTML tags
def clean ( text ) :
cleaned = re . compile ( r ’ <.*? > ’)
return re . sub ( cleaned , ’ ’ , text )
# 2 . 4 Remove stopwords
def rem_stopwords ( text ) :
stop_words = set ( stopwords . words ( ’ english ’) )
words = word_tokenize ( text )
return [ w for w in words if w not in stop_words ]
# # 3 | Model Creation # #
""" Create model to fit it to the data """
24
# 3 . 1 Creating Bag Of Words ( BOW )
X = np . array ( dataset . iloc [ : ,0 ] . values )
y = np . array ( dataset . sentiment . values )
cv = CountVectorizer ( max_features = 2000 )
X = cv . fit_transform ( dataset . review ) . toarray ()
print ( f " === Bag of words ===\ n " )
print ( f " BOW X shape : { X . shape } " )
print ( f " BOW y shape : { y . shape } \ n " )
# # 4 | Model Evaluation # #
""" Evaluate model performance """
print ( f " Gaussian accuracy = { round ( accuracy_score ( y_test , ypg ) , 2 ) * 100 }
%")
print ( f " Multinomial accuracy = { round ( accuracy_score ( y_test , ypm ) , 2 ) * 100 }
%")
print ( f " Bernoulli accuracy = { round ( accuracy_score ( y_test , ypb ) , 2 ) * 100 }
%")
Importing Modules
Importing necessary libraries and modules for data handling, natural language processing
(NLP), machine learning, and model evaluation. The NLTK package is downloaded.
25
Data Preprocessing
• Load dataset: Reads the movie reviews dataset from a CSV file.
• Output counts: Displays the count of positive and negative sentiments in the dataset.
• Encode output column: Converts sentiment labels (’positive’ and ’negative’) into
binary values (1 for positive, 0 for negative).
Data Cleaning
Series of steps to clean and preprocess text data in reviews:
• Remove HTML tags: Eliminates HTML tags from reviews.
• Remove special characters: Retains alphanumeric characters, replacing others with
spaces.
• Convert to lowercase: Converts all text to lowercase.
• Remove stopwords: Eliminates common English stopwords (e.g., ’the’, ’and’).
• Stemming: Reduces words to their root form.
Model Creation
• Creating Bag Of Words (BOW): Uses CountVectorizer to create a Bag of Words
representation with a limit of 2000 features.
• Train-test split: Divides the dataset into training and testing sets.
• Define and Train models: Defines and trains Gaussian, Multinomial, and Bernoulli
Naive Bayes models on the training data.
• Save models: Saves the trained models for future use.
• Make predictions: Applies trained models to predict sentiments on the test data.
26
3.4.4 Flowchart
Start
Importing Modules
1 — Data Preprocessing
2 — Data Cleaning
3 — Model Creation
4 — Model Evaluation
End
Dataset Head:
27
3.4.7 Results
28
Figure 3.15: Movie Reviews After Removing Special Characters
29
Figure 3.17: After Removing Stop Words
30
3.5 Fake News Detection Using TF-IDF
3.5.1 Methodology
This section introduces the fake news detection using TD-IDF, focusing on natural lan-
guage processing techniques for binary classification. Refering to a practiced model from
kaggle. (? )
3.5.2 Code
# # Importing modules ##
import pandas as pd
from sklearn . model_selection import train_test_split
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . naive_bayes import MultinomialNB
from sklearn . pipeline import make_pipeline
from sklearn . metrics import classification_report , accuracy_score
data . describe ()
data . isnull () . sum ()
data = data . dropna ()
print ( " Original DataFrame shape : " , data . shape )
3
Source: https://www.researchgate.net/figure/Fake-news-detection-model_fig1_358004087
31
print ( " New DataFrame shape : " , data . shape )
data . isnull () . sum ()
# Example usage
print ( predict_fake_news ( " Ukraine is being invaded by russia . " ) )
# Example usage
print ( predict_fake_news ( " India won the fifa worldcup . " ) )
Import Libraries:
• pandas: For data manipulation and analysis.
• train test split: For splitting the dataset into training and testing sets.
• TfidfVectorizer: For converting a collection of raw documents to a matrix of TF-
IDF features.
• MultinomialNB: Naive Bayes classifier for multinomial models.
• make pipeline: For creating a pipeline that applies a series of transformations fol-
lowed by a final estimator.
• classification report, accuracy score: For evaluating the performance of the
classification model.
32
Load the Dataset:
• Reads a CSV file (’WELFake Dataset.csv’) into a pandas DataFrame.
• Checks for missing values and drops rows with missing values.
Split Data:
• Splits the dataset into training and testing sets using the train test split function.
• Prints the classification report and accuracy score to evaluate the model’s perfor-
mance.
33
Load Dataset
Data Exploration
and Cleaning
Split Data
Evaluate Model
Performance
Define Predic-
tion Function
Example Usages
34
Famminaty
Royalty
Youth
Man, Woman, Man 0 0 0
Boy, Girl, Woman 1 0 0 Each word gets
Prince, Princess, Boy 0 1 0 a 1 x 3 vector
Queen, King Girl 1 1 0
Prince 0 1 1
Princess 1 1 1
Queen 1 0 1
King 0 0 1
3.6.2 Code
# # Importing modules ##
iimport pandas as pd
import re
import numpy as np
import imblearn
from gensim . models import Word2Vec
from imblearn . over_sampling import RandomOverSampler
from sklearn . model_selection import train_test_split , GridSearchCV
from sklearn . ensemble import AdaBoostClassifier , R an dom Fo re stC la ss if ier
from sklearn . tree import D ec i s ion Tr ee Cla ss if ier
from sklearn . svm import SVC
from sklearn . naive_bayes import GaussianNB
from sklearn . neighbors import KNeighborsClassifier
from sklearn . metrics import classification_report , accuracy_score , f1_score
import nltk
from nltk . corpus import stopwords
from nltk . stem import WordNetLemmatizer
4
Source: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
35
# Loading stop words and lemmatizer
nltk . download ( ’ stopwords ’)
nltk . download ( ’ wordnet ’)
stop_words = set ( stopwords . words ( ’ english ’) )
lemmatizer = WordNetLemmatizer ()
import opendatasets as od
import pandas as pd
# Loading Data
data = pd . read_csv ( ’/ content / emotion - detection - from - text / tweet_emotions . csv
’)
# dataset info
data . info ()
# Let ’s remove the extra column and clean up the dataset
data . drop ( columns = [ ’ tweet_id ’] , inplace = True )
data . drop_duplicates ( keep = " first " , inplace = True )
data . dropna ( inplace = True )
36
# Preparing data for Word2Vec
w2v_data = [ text . split () for text in data [ ’ cleaned_content ’] ]
# Classifier SVM
svm = SVC ()
svm . fit ( X_train , y_train )
Importing Modules:
The required libraries and modules are imported. These include pandas for data manipula-
tion, regular expression (re) for text cleaning, numpy for numerical operations, imbalanced-
learn (imblearn) for handling imbalanced datasets, gensim for Word2Vec modeling, and
various classifiers and evaluation metrics from scikit-learn.
37
NLP Preprocessing:
NLTK is used for natural language processing tasks. Stop words and a lemmatizer are
loaded from NLTK’s corpus. A custom function clean and preprocess is defined for
cleaning and preprocessing text data, which includes removing user references, URLs, spe-
cial characters, converting to lowercase, and lemmatization.
Data Cleaning:
The unnecessary column ’tweet id’ is dropped, and duplicate and missing values are re-
moved from the dataset.
38
3.6.3 Evaluating and Outputting Results:
The trained SVM classifier is used to predict the labels on the test set. The function
ouput result is defined to print accuracy, F1 score, and a classification report based on
the predicted and true labels of the test set.
39
Bibliography
[1] https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-
techniques-for-machine-learning-2080b0269f10
[2] https://www.analyticsvidhya.com/blog/2022/01/nlp-tutorials-part-ii-feature-
extraction/
[3] https://monkeylearn.com/blog/what-is-tf-idf/
[4] https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
[5] https://www.kaggle.com/code/ritabanmitra/fake-news-classifier/notebook
[7] https://github.com/omaarelsherif/Movie-Reviews-Sentiment-Analysis-Using-
Machine-Learning
[8] https://spotintelligence.com/2023/03/25/nlp-feature-engineering
[9] https://www.kaggle.com/code/longtng/nlp-preprocessing-feature-extraction-
methods-a-z
[10] https://www.kaggle.com/amar09/text-pre-processing-and-feature-extraction
[11] https://www.kaggle.com/ashishpatel26/beginner-to-intermediate-nlp-tutorial
[12] https://www.kaggle.com/ashutosh3060/nlp-basic-feature-creation-and-preprocessing
[13] https://www.kaggle.com/liananapalkova/simply-about-word2vec
40