You are on page 1of 42

Supervisor:

Dr. Usman Ghani

Subject:
Natural Language Processing

Submitted by:
MeharunNisa 2022-MSCS-212
Muhammad Haris 2022-MSCS-214
Saqib Yaseen 2022-MSCS-209

Department of Computer Science


University of Engineering & Technology

1
Feature engineering for NLP Tasks

January 12, 2024


Contents

3 Feature Engineering 3
3.1 Importance of Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Feature Engineering in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Feature Engineering Techniques for NLP Tasks . . . . . . . . . . . . . . . . 6
3.3.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 Word2Vec & GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Movie Reviews Sentiment Analysis Using Bag of Words . . . . . . . . . . . 23
3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.5 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.6 Dataset Features and Target: . . . . . . . . . . . . . . . . . . . . . . 27
3.4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Fake News Detection Using TF-IDF . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Emotion Detection From Text Using WordVec . . . . . . . . . . . . . . . . 35
3.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.3 Evaluating and Outputting Results: . . . . . . . . . . . . . . . . . . 39
3.6.4 Classifier Evaluation: . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1
List of Figures

3.1 Bag of Words Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 6


3.2 Bag of Words detailed example 1 . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Modeling text with machine learning algorithms. 2 . . . . . . . . . . . . . . 7
3.4 Steps to enhance performance of Bow . . . . . . . . . . . . . . . . . . . . . 9
3.5 Bow example code result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 TF-IDF Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 N-grams Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8 TF-IDF Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 N-grams example code results . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 Word2Ves Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.11 word-to-vec Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.12 Movie Review Dataset Head . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.13 Movie Review Dataset Head After Encoding . . . . . . . . . . . . . . . . . . 28
3.14 Movie Reviews After Removing HTML Tags . . . . . . . . . . . . . . . . . . 28
3.15 Movie Reviews After Removing Special Characters . . . . . . . . . . . . . . 29
3.16 Movie Reviews After Converting To Lower Case . . . . . . . . . . . . . . . 29
3.17 After Removing Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.18 After Applying Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.19 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.20 Fake News Detection Model. 3 . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.21 Fake New Detection Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.22 Mapping words to vectors and transformation 4 . . . . . . . . . . . . . . . . 35

2
Chapter 3
Feature Engineering

While training any model in machine learning to predict some classes or in more easy words
are to predict some object for example if the model is to identify that the picture is of girl
or of a boy. Then the training results will be more clear and accurate if model knows that
how a girl looks like and how a boy looks like. In short a model should know the features
of girl and boy. These can be all the features or only those which differentiate both of
them. The very first thing comes in mind is the face, that gives significant differences to
differentiate or identify the gender. This is all feature engineering is in simple and easy
words to understand.
The raw data is to be taken from the datasets and then clean and transform them to
required shape to get features.

Transform Train
Machine
Feature
Raw Data Learning
Engineering
Model
There are thousands of examples and a common and relevant example in IT industry
for the readers can be in data analysis. If a shopkeeper hire a data analyst and asks to give
the stats of the yearly involvement of customers in the shop with respect to the gender
and age. So that after looking into the stats the shopkeeper might increase or decrease
the certain type of clothes or items in the shop according to the customer attraction or
requirement.
Now what data analyst will be doing. First of all it requires the dataset, as the camera
footage which is in the shop to monitor the customer movement. After recognizing the
object next step is to identify the gender, it requires certain points to distinguish which
object is female and which is male. Now here feature engineering comes to the picture. It
will provide certain features to use and later same is the case when age analysis will be
done. The results will be either binary or in numbers, which can be used as tabular data
or to plot graphs for more detailed visualizations.

3
Now the definition can be summarized as Feature engineering is the process of ef-
fectively converting raw data into features that are more valuable and suited
for statistical or machine learning studies. This procedure improves the inter-
pretability of data, allowing for better informed decision-making. (? ).

3.1 Importance of Feature Engineering


In the process of using raw data, data analysts used to face complexity in refining the
data to get the useful one. Sometimes it required alot of manual work that decreased the
efficiency and accuracy as well. With this way of working many in-depth analysis factors
were also being ignored. Feature engineering added more value to the extracted data by
optimizing and refining it. It gives more detailed understanding of whole process.

3.2 Feature Engineering in NLP


Feature engineering allows the extraction of significant or meaningful features or things in
simple words from the text data. Since the traditional machine learning algorithms and
neural networks algorithms often require numerical input. So the methods like
• Bag of Words
• TF-IDF
• N-grams
• Word2Vec & GloVe
convert the given text into numerical representations so that algorithms can use them,
maintaining their semantic relationships and contextual information within the data.
Feature
Preprocessing Classification
Extraction

3.2.1 Preprocessing
Preprocessing is the first step in data preparation forexample web legal scrapping where
raw data that is directly scrapped from website or some other medium that passes through
various transformations such as data cleaning like if any puncuation that doesnt add value
in data meaning will be removed, organizing is like grouping similar data with respect to
meaning to optimize the data, and enhancing like formatting to what is required. This is
all to make it suitable for analysis or modeling. This process involves cleaning, organizing,
and enhancing the raw data to address issues such as missing values, outliers, and noise.
In NLP common preprocessing techniques include removing

4
• Punctuation

• Stop words

• Stemming

• Lemmatization

The objective of preprocessing is to ensure data consistency and usability, reducing the im-
pact of irrelevant or problematic information on future analysis or modeling operations. (?
)

3.2.2 Feature Extraction


Once model is trained, new data is being tested on the trained model. But what if its not
evaluating the new unforseen data as if should be according to what it is being trained
with. Here the problem lies, so after preprocessing feature extraction will be transforming
raw data into a set of relevant and informative features that can be used to represent the
underlying patterns or characteristics of the data so when new data is being introduced
it will e able to catch thos features. Feature extraction involves selecting, transforming,
or creating new features from the original dataset, aiming to capture the most significant
information for a given task. In NLP common feature extraction techniques include

• Vectorization

• Bag of Words

• TF-IDF

• N-grams

3.2.3 Classification
The basic goal of supervised machine learning, particularly in classification tasks, is to
assign incoming data to predetermined classes based on its qualities. This approach com-
prises training a model using a dataset in which each instance is already assigned a label.
The model learns to identify patterns and correlations between the input qualities and
their labels. Once successfully trained, it can be used to predict the categories of fresh,
previously unseen data. This methodology is extensively employed in different sectors such
as filtering emails for spam, recognising photos, and diagnosing medical disorders, with the
purpose to automate the sorting of data into designated categories. (? )

5
3.3 Feature Engineering Techniques for NLP Tasks
3.3.1 Bag of Words
In the context of feature engineering within NLP, the Bag of Words (BoW) model trans-
forms text data into a numerical format, specifically into vectors. This method treats a
document as a collection of words without considering their order or grammatical struc-
ture. Essentially, it involves creating a unique index for every distinct word in the text
and then quantifying the document by tallying the occurrences of each word. This results
in a vector where each word is represented by its frequency count in the text, effectively
turning textual information into data that can be analyzed computationally.

Text
Bag of Words

cat cute dog small


’small dog’
’cute cute dog’ 0 0 1 1
’cute dog’
1 2 0 0

0 1 1 0

Figure 3.1: Bag of Words Representation

In the example above, the data is in tabular form in which count of each word is
displayed for each sentence. This is the basic example but when it comes to paragraphs
and documents there are many punctuations, pronouns, verb forms, adjectives etc. In that
case those will be first removed or turned to their base words.

6
(0,7) 2
(0,3) 1
(0,4) 1
’The fox jumps (0,6) 1
over the lazy (0,5) 1 and are dog fox jumps lazy over the
dog.’ (0,2) 1 0 0 1 1 1 1 1 2
’Dog and fox (1,3) 1 1 1 1 1 0 1 0 0
are lazy’ (1,5) 1
(1,2) 1
(1,0) 1
(1,1) 1

Figure 3.2: Bag of Words detailed example 1

In the example above there are 2 lines of text. The words are being separated and then
the frequency or count of each word is being calculated as shown in the middle table. The
upper rows with 0 in x coordinate are for first sentence and the lower rows in table with 1
in x coordinate are for second line of text. Have a look as there are two ’the’ in first line
the frequency count is 2 and one fox word in first line so the value is 1 and same goes for
the rest of words.
Third table is showing the words in both tables to get more explicit idea in parallel. This
is how bag of words work and calculates the numerical values according to the occurrences.

Figure 3.3: Modeling text with machine learning algorithms. 2

1
Source: https://miro.medium.com/v2/resize:fit:828/format:webp/1*38q3qy950o1auKtd1i41Yg.
png
2
Source: http://www.prathapkudupublog.com/2019/04/bag-of-words.html

7
Now to enhance this more, few steps of cleaning raw data can also be used in the process
which are as follows

• Punctuation

• Tokenization

• Stop words

• Lemmatization

• Stemming

Tokenization
It is the process of breaking down text into words or other meaningful parts known as
tokens. It assists in generating a bag of individual words from the input text.

Lowercasing
To ensure the similarity of words to be considered once it is good to convert all words to
lowercase.

Removing Punctuation
For more clarity and to simplify the vocabulary it is useful to eliminate punctuation marks
from the text.

Removing Stop Words


To focus on relevant words and exclude noise from data, one approach is to eliminate
frequent words that usually do not contain substantial meaning (e.g., ”and,” ”is,” and
”the”).

Stemming
To avoid variations of similar terms, it’s important to reduce them to their base or root
form (e.g., ”programming” to ”program”).

Lemmatization
To provide more accurate representations, words are reduced to their basic or dictionary
form, known as the lemma, based on their meaning.

8
Tokenization
The, quick, brown, fox, jumps, over, the, lazy, dog, .

Lowercasing

the, quick, brown, fox, jumps, over, the, lazy, dog, .

Removing Punctuation

the, quick, brown, fox, jumps, over, the, lazy, dog

Removing Stop Words

quick, brown, fox, jumps, lazy, dog

Stemming

quick, brown, fox, jump, lazy, dog

Lemmatization
quick, brown, fox, jump, lazi, dog

Output

Figure 3.4: Steps to enhance performance of Bow

Example

# BOW Example
from sklearn . feature_extraction . text import CountVectorizer

# Sample text data


corpus = [
" The quick brown fox " ,
" Jumps over the lazy dog " ,
" The dog barks loudly "
]

# Create BoW representation


vectorizer = CountVectorizer ()
X_bow = vectorizer . fit_transform ( corpus )

# Print feature names and BoW matrix


print ( " Feature Names : " , vectorizer . get_ featu re_n ames_ out () )
print ( " BoW Matrix :\ n " , X_bow . toarray () )

9
Figure 3.5: Bow example code result

10
3.3.2 TF-IDF
In the context of feature engineering for NLP, TF-IDF, short for term frequency-inverse
document frequency, is a statistical metric used to assess the significance of a word in the
context of a document that is part of a larger corpus.
This metric is calculated by integrating two crucial statistics: the frequency of a term
in a single text and its inverse frequency across a larger set of documents. In essence,
TF-IDF grows according to the number of times a word appears in a document, offset by
the term’s frequency in the corpus. This makes it a useful tool for text analysis tasks,
which are often used in automated systems. In machine learning models for NLP, TF-IDF
serves as an informative feature for recognising and ranking the significance of words within
documents.

Figure 3.6: TF-IDF Work Flow

Importance
TF-IDF is important in Natural Language Processing (NLP), notably for document re-
trieval and analysis. This strategy increases the importance of terms based on their fre-
quency within a given document. However, it strikes a balance by taking into account the
term’s popularity across a large number of papers. As a result of their ubiquity, regularly
used phrases such as ”this”, ”what”, or ”if” are accorded reduced value. These words

11
typically have less relevance in the context of a given work. In contrast, a phrase such as
”bug” obtains importance in a certain context if it appears frequently in that document
but is uncommon elsewhere. For example, when analysing consumer feedback for software
products, the frequent use of the term ”bug” in comments could be a significant predictor
of talks about software dependability difficulties. This is especially important when cate-
gorising input into topics, as terms like ”bug” are likely to be associated with discussions
regarding product performance or quality.
By incorporating TF-IDF into feature engineering for NLP, we can successfully dis-
tinguish between everyday language and key terms that identify the subject matter of
individual texts, increasing the analytical depth of our NLP applications.

Method for Calculating TF-IDF


When investigating the TF-IDF calculation method, it is critical to consider two fundamen-
tal metrics. Initially, we look at the term frequency (TF), which measures how frequently
a single phrase appears in a given text. This is more than just a count of occurrences;
modifications are performed to match the count to the overall length of the document or
to the most often occurring term inside the text. This change allows for a more nuanced
assessment of the term’s prevalence across the document.

In conjunction with TF, the inverse document frequency (IDF) is crucial. This statistic
measures the rarity or commonality of a phrase throughout the full collection of documents.
The essence of this approach is to invert the proportion of documents containing the phrase,
then apply a logarithmic scale. This technique efficiently magnifies the weight of uncommon
terms, providing a balance against too common terms that may not contribute meaning-
fully to the text’s distinctive context.

The intersection of these two metrics, TF and IDF, yields the TF-IDF score. This score
reflects a term’s relevance within a certain document. Higher TF-IDF scores indicate that
phrases are not only common in a specific document, but also unique over the entire docu-
ment set. This dual emphasis makes TF-IDF a strong tool in feature engineering for NLP.
It provides both a quantitative and qualitative lens to understand and extract meaningful
patterns from textual input.

Why is machine learning using TF-IDF?


Dealing with text-based information in the field of natural language processing (NLP) is a
difficulty for machine learning algorithms, which are essentially built to process numerical
data. Vectorization is a procedure that converts textual material into numerical format.
This transformation is critical during the data analysis phase of machine learning because

12
the vectorization approach used might have a substantial impact on the results.
Term Frequency-Inverse Document Frequency, or TF-IDF, is an important factor in this
conversion process. It statistically assesses a word’s relevance in a document, with its value
falling (approaching zero) when the word appears frequently in multiple texts, signalling
lower importance. In contrast, a TF-IDF value close to one indicates the word’s scarcity
across documents, implying more relevance. To calculate a word’s TF-IDF score, multiply
its term frequency by its inverse document frequency, with higher scores suggesting words
that are more relevant to the specific text in question. This numerical representation is
necessary for machine learning models to successfully comprehend and analyse text data.

Why is this effective?


In other words, a word vector is simply a collection of numbers that correspond to every
possible phrase in the corpus. When a document is vectorized, the text is turned into one
of these vectors, and the numbers in the vectors are used to represent the text’s content.
We can assign a number to each word in a document using TF-IDF, which indicates how
important each word is to the document. The related vectors that result from documents
containing comparable, pertinent terms are what we want in a machine learning system.

The uses of TF-IDF


The process of determining a word’s level of relevance to a document, or TD-IDF, has
several benefits.

Information extraction
The most relevant results for your search can be obtained by using TF-IDF, which was
developed for document search. Let’s say someone uses your search engine to find LeBron.
The results will be shown in the most relevant order. In other words, since TF-IDF assigns
a greater score to the phrase ”LeBron,” sports articles that are more pertinent will be
ranked higher.
It is highly probable that all search engines you have ever used have incorporated TF-
IDF scores into their algorithm. Another helpful tool for extracting keywords from text is
Keyword Extraction TF-IDF. How? Since the words with the highest scores are the most
pertinent to the document, they can be regarded as keywords for that document. Quite
simple.

3.3.3 N-Grams
N-grams, which are neighbouring sequences of elements in a text, are continuous sequences
of words, symbols, or tokens. The most significant applications for them are in NLP
(Natural Language Processing) activities involving text data.

13
Figure 3.7: N-grams Work Flow

Importance
N-gram models are widely used in a variety of applications where the modelling inputs are
n-gram distributions, including statistical natural language processing, speech recognition,
machine translation, phonemes and phoneme sequences, predictive text input, and many
others. Another typical application for n-grams is to generate features for supervised
machine learning models like SVMs, MaxEnt models, Naive Bayes, and so on. The basic
idea is to substitute unigrams with tokens from the feature space, such as bigrams (and
trigrams and advanced n-grams).

What are they?


An N-gram is a contiguous sequence of n items retrieved from a text or audio sample.
According to the programme, the objects might be letters, words, or base pairs. Typically,
a corpus of huge text datasets—a corpus of voice or text—is utilised to collect N-grams.
N-grams can also be regarded of as a group of words that come together within a
given window, determined by essentially moving the window forward by k words (k can
range from 1 to more than 1). Unigrams are single words, bigrams are two, trigrams are
three, 4-grams are four, 5-grams are five, and so on. Uses of N-grams: The n-gram model
is commonly used in NLP, natural language processing, and text mining activities that

14
involve n-grams of text.
For example, while building language models in natural language processing, n-grams
are used to create not only unigram models but also bigram and trigram models. Web-
scale n-gram models, developed by tech giants like Google and Microsoft, can be applied to
a range of NLP-related tasks, including word breaking, text summarization, and spelling
correction. (? )

The significance of word order in a text and NLP


The significance of word order in a text and NLP
The words that are utilised in the text’s natural language are generally not arranged in
a random manner. The terms ”the green apple” and ”apple green the” are interchangeable
in English, but not the other way around. In the text’s flow, there are also a lot of intri-
cate word relationships. One comparatively easy method to capture some of these word
associations is to use n-grams. An impressive amount of information may be gathered just
by observing which words are most frequently found adjacent to one another. We may
create co-occurring words and use them in relationship modelling and other applications
by building the n-grams. For example, the core notion underlying n-grams is that, given
a large enough corpus, we may evaluate every pair, triple, or group of four or more words
that appear next to one another. When we construct these pairs of words, we are more
likely to see ”the green” and ”green apple” several times than we are to see ”apple green”
and ”green the”. Many NLP use cases benefit from understanding and applying this type
of background. We can estimate a speaker’s likelihood of speaking various things to help
pick the output of an automatic speech recognition system.

What Classification Do N-Grams Have?


In the n-gram model in nlp, different n-gram types are appropriate for various applications.
To definitively determine which n-gram type is ideal for text corpus analysis, we must test
various n-gram types on the dataset. Additionally, studies have demonstrated and validated
that trigrams and four grammes are the most effective for spam filtering.

Why is this effective?


Let’s look at the given example sentence. Cowards die many times before they die, whereas
the brave only experience death once and produce the n-grams associated with the text.
Unigrams are simply the special words used in a sentence. Cowards die many times
before their deaths, whilst the brave never face death more than once. Bigrams are just
word pairings that occur in the same sentence. They are formed by advancing one word
forward at a time, resulting in the following bigram. Cowards die many, frequently, before
their deaths, whereas courageous, brave people only experience death once. Trigrams are

15
three groupings of words that appear in the same sentence. They are formed by shifting
two words ahead one at a time, resulting in the following trigram. Cowards die repeatedly,
frequently, frequently before their deaths, frequently before their deaths, and frequently
before their deaths, whilst the brave never, never taste, never taste of, death, but just once.
4-grams: With this window, we can combine four words in the following ways: Cowards
die a lot, die a lot before, many times before their deaths, the valiant, the valiant never
taste, the heroic never taste of, never taste of death but of death, but once, similarly, we
can pick n bigger than 4 or n smaller than 4 and create 5-grams, etc.

16
Start
A collection of different books
Collect Documents
Break each book into individual words
Tokenize Text
Count how often each word appears in a specific book

Calculate TF
Assess how unique each word is across all books

Calculate IDF
Multiply TF by IDF for each word in each book

Calculate TF-IDF
Identify words with high TF-IDF scores

Select Keywords

In a search engine, show documents with relevant keywords

Use in Search

End

Figure 3.8: TF-IDF Work Flow

# N - Grams Example
import nltk
from nltk import ngrams
from nltk . tokenize import word_tokenize

nltk . download ( ’ punkt ’)


# Sample text data
text = " The quick brown fox jumps over the lazy dog "

# Tokenize the text


tokens = word_tokenize ( text )

# Generate bigrams

17
bigrams = list ( ngrams ( tokens , 2 ) )

# Print the bigrams


print ( " Bigrams : " , bigrams )

Figure 3.9: N-grams example code results

18
3.3.4 Word2Vec & GloVe
Deep Learning has shown to perform incredibly well when used to NLP problems. The
main idea is to feed phrases that are legible by humans into neural networks so that the
models can learn something from them.
Comprehending the pre-processing of textual material for neural neural networks is crucial.
While raw text cannot be entered into a neural network, numbers may. Therefore, we must
translate these terms into a numerical format.

Figure 3.10: Word2Ves Work Flow

Importance
Word Maps
Word vectors are a far better way to represent words than single hot encoded vectors
(which require a lot of memory to represent text vectors as vocabulary grows). There is no
semantic meaning attached to the index that is given to every word. The vectors for ”dog”
and ”cat” are nearly as close to one another in one hot encoded vector as the vectors for
”dog” and ”computer,” thus the neural network must work extremely hard to comprehend
each word because they are viewed as totally separate entities. The goal of using word
vectors is to address both of these problems.

19
”Word vectors retain the semantic representation of the word and take up far less space
than one hot encoded vector.”
Prior to delving into word vectors, it is crucial to remember that related words tend to
occur together more frequently than dissimilar terms.
Even though two words don’t necessarily have the same meaning when they appear
close to one another, words with comparable meanings are frequently discovered together
when we look at the frequency of these word occurrences.

Word-to-Vec
There are two designs in word2vec: Skip Gramme and CBOW (Continuous Bag of Words).
indicates which words appear near a given word is necessary. To do this, we’ll use a tool
known as a context window.
Recall that ”Deep Learning is very fun and hard.” We must establish what’s referred
to as the window size. In this instance, let’s say 2. We achieve this by going through
every word in the provided data, which is, in this example, only one sentence, and then
taking into account a window of words that surrounds it. Each word will have four words
connected with it because our window size is two. We will take into account the two words
that come before and after the term.
from gensim . models import Word2Vec
from nltk . tokenize import word_tokenize
import nltk

# Download the punkt tokenizer if not already downloaded


nltk . download ( ’ punkt ’)

# Sample text data


corpus = [
" Word embeddings are powerful tools in natural language processing . " ,
" They help computers understand the relationships between words . " ,
" Word2Vec is a popular algorithm for creating word embeddings . " ,
" It captures semantic relationships and word similarities . " ,
" Machine learning models benefit from using pre - trained word embeddings
.",
" Gensim is a Python library that makes it easy to work with Word2Vec . "
]

# Tokenize the text data


tokenized_text = [ word_tokenize ( sentence . lower () ) for sentence in corpus ]

# Train Word2Vec model


model_w2v = Word2Vec ( sentences = tokenized_text , vector_size = 100 , window =5 ,
min_count =1 , workers = 4 )

# Example : Get the vector for the word " natural "
vector_natural = model_w2v . wv [ ’ natural ’]

20
print ( " Vector for ’ natural ’: " , vector_natural )

# Find similar words to a given word


similar_words = model_w2v . wv . most_similar ( " word " , topn = 3 )
print ( " Similar words to ’ word ’: " , similar_words )
! pip install gensim nltk
from gensim . models import Word2Vec
from nltk . tokenize import word_tokenize
import nltk

# Download the punkt tokenizer if not already downloaded


nltk . download ( ’ punkt ’)

# Sample text data


corpus = [
" Word embeddings are powerful tools in natural language processing . " ,
" They help computers understand the relationships between words . " ,
" Word2Vec is a popular algorithm for creating word embeddings . " ,
" It captures semantic relationships and word similarities . " ,
" Machine learning models benefit from using pre - trained word embeddings
.",
" Gensim is a Python library that makes it easy to work with Word2Vec . "
]

# Tokenize the text data


tokenized_text = [ word_tokenize ( sentence . lower () ) for sentence in corpus ]

# Train Word2Vec model


model_w2v = Word2Vec ( sentences = tokenized_text , vector_size = 100 , window =5 ,
min_count =1 , workers = 4 )

# Example : Get the vector for the word " word "
vector_word = model_w2v . wv [ ’ word ’]
print ( " Vector for ’ word ’: " , vector_word )

# Find similar words to a given word


similar_words = model_w2v . wv . most_similar ( " natural " , topn = 3 )
print ( " Similar words to ’ natural ’: " , similar_words )

21
Figure 3.11: word-to-vec Example Results

22
3.4 Movie Reviews Sentiment Analysis Using Bag of Words
3.4.1 Methodology
This section introduces the sentiment analysis project, focusing on natural language pro-
cessing techniques for binary classification of the reviews to movie into two sentiments as
either positive or negative. Movie reviews sentiment analysis is a project which is based
on natural language processing, where we use NLP techniques to extract useful words of
each review and based on these words we can use binary classification to predict the movie
sentiment if it’s positive or negative.

3.4.2 Code

# # Importing modules # #
import re
import nltk
import joblib
import numpy as np
import pandas as pd
from nltk . corpus import stopwords
from nltk . tokenize import word_tokenize
from nltk . stem import SnowballStemmer
from sklearn . feature_extraction . text import CountVectorizer
from sklearn . model_selection import train_test_split
from sklearn . naive_bayes import GaussianNB , MultinomialNB , BernoulliNB
from sklearn . metrics import accuracy_score

nltk . download ()

# # 1 | Data Preprocessing # #
""" Prepare the dataset before training """

# 1 . 1 Load dataset
dataset = pd . read_csv ( ’ Dataset / IMDB . csv ’)
print ( f " Dataset shape : { dataset . shape } \ n " )
print ( f " Dataset head : \ n { dataset . head () } \ n " )

# 1 . 2 Output counts
print ( f " Dataset output counts :\ n { dataset . sentiment . value_counts () } \ n " )

# 1 . 3 Encode output column into binary


dataset . sentiment . replace ( ’ positive ’ , 1 , inplace = True )
dataset . sentiment . replace ( ’ negative ’ , 0 , inplace = True )
print ( f " Dataset head after encoding :\ n { dataset . head ( 10 ) } \ n " )

# # 2 | Data cleaning # #

23
# 2 . 1 Remove HTML tags
def clean ( text ) :
cleaned = re . compile ( r ’ <.*? > ’)
return re . sub ( cleaned , ’ ’ , text )

dataset . review = dataset . review . apply ( clean )


print ( f " Review sample after removing HTML tags : \ n { dataset . review [ 0 ] } \ n " )

# 2 . 2 Remove special characters


def is_special ( text ) :
rem = ’ ’
for i in text :
if i . isalnum () :
rem = rem + i
else :
rem = rem + ’ ’
return rem

dataset . review = dataset . review . apply ( is_special )


print ( f " Review sample after removing special characters : \ n { dataset . review
[0]}\n")

# 2 . 3 Convert everything to lowercase


def to_lower ( text ) :
return text . lower ()

dataset . review = dataset . review . apply ( to_lower )


print ( f " Review sample after converting everything to lowercase : \ n { dataset
. review [ 0 ] } \ n " )

# 2 . 4 Remove stopwords
def rem_stopwords ( text ) :
stop_words = set ( stopwords . words ( ’ english ’) )
words = word_tokenize ( text )
return [ w for w in words if w not in stop_words ]

dataset . review = dataset . review . apply ( rem_stopwords )


print ( f " Review sample after removing stopwords : \ n { dataset . review [ 0 ] } \ n " )

# 2 . 5 Stem the words


def stem_txt ( text ) :
ss = SnowballStemmer ( ’ english ’)
return " " . join ( [ ss . stem ( w ) for w in text ] )

dataset . review = dataset . review . apply ( stem_txt )


print ( f " Review sample after stemming the words : \ n { dataset . review [ 0 ] } \ n " )

# # 3 | Model Creation # #
""" Create model to fit it to the data """

24
# 3 . 1 Creating Bag Of Words ( BOW )
X = np . array ( dataset . iloc [ : ,0 ] . values )
y = np . array ( dataset . sentiment . values )
cv = CountVectorizer ( max_features = 2000 )
X = cv . fit_transform ( dataset . review ) . toarray ()
print ( f " === Bag of words ===\ n " )
print ( f " BOW X shape : { X . shape } " )
print ( f " BOW y shape : { y . shape } \ n " )

# 3 . 2 Train test split


X_train , X_test , y_train , y_test = train_test_split (X , y , test_size = 0 .2 ,
random_state = 9 )
print ( f " Train shapes : X = { X_train . shape } , y = { y_train . shape } " )
print ( f " Test shapes : X = { X_test . shape } , y = { y_test . shape } \ n " )

# 3 . 3 Defining the models and Training them


gnb , mnb , bnb = GaussianNB () , MultinomialNB ( alpha = 1 .0 , fit_prior = True ) ,
BernoulliNB ( alpha = 1 .0 , fit_prior = True )
gnb . fit ( X_train , y_train )
mnb . fit ( X_train , y_train )
bnb . fit ( X_train , y_train )

# 3 . 4 Save the three models


joblib . dump ( gnb , " Models / MRSA_gnb . pkl " )
joblib . dump ( mnb , " Models / MRSA_mnb . pkl " )
joblib . dump ( bnb , " Models / MRSA_bnb . pkl " )

# 3.5 Make predictions


ypg = gnb . predict ( X_test )
ypm = mnb . predict ( X_test )
ypb = bnb . predict ( X_test )

# # 4 | Model Evaluation # #
""" Evaluate model performance """
print ( f " Gaussian accuracy = { round ( accuracy_score ( y_test , ypg ) , 2 ) * 100 }
%")
print ( f " Multinomial accuracy = { round ( accuracy_score ( y_test , ypm ) , 2 ) * 100 }
%")
print ( f " Bernoulli accuracy = { round ( accuracy_score ( y_test , ypb ) , 2 ) * 100 }
%")

Importing Modules
Importing necessary libraries and modules for data handling, natural language processing
(NLP), machine learning, and model evaluation. The NLTK package is downloaded.

25
Data Preprocessing
• Load dataset: Reads the movie reviews dataset from a CSV file.
• Output counts: Displays the count of positive and negative sentiments in the dataset.
• Encode output column: Converts sentiment labels (’positive’ and ’negative’) into
binary values (1 for positive, 0 for negative).

Data Cleaning
Series of steps to clean and preprocess text data in reviews:
• Remove HTML tags: Eliminates HTML tags from reviews.
• Remove special characters: Retains alphanumeric characters, replacing others with
spaces.
• Convert to lowercase: Converts all text to lowercase.
• Remove stopwords: Eliminates common English stopwords (e.g., ’the’, ’and’).
• Stemming: Reduces words to their root form.

Model Creation
• Creating Bag Of Words (BOW): Uses CountVectorizer to create a Bag of Words
representation with a limit of 2000 features.
• Train-test split: Divides the dataset into training and testing sets.
• Define and Train models: Defines and trains Gaussian, Multinomial, and Bernoulli
Naive Bayes models on the training data.
• Save models: Saves the trained models for future use.
• Make predictions: Applies trained models to predict sentiments on the test data.

3.4.3 Model Evaluation


• Evaluate Gaussian Naive Bayes model: Prints accuracy score for the Gaussian Naive
Bayes model on the test data.
• Evaluate Multinomial Naive Bayes model: Prints accuracy score for the Multinomial
Naive Bayes model on the test data.
• Evaluate Bernoulli Naive Bayes model: Prints accuracy score for the Bernoulli Naive
Bayes model on the test data.

26
3.4.4 Flowchart
Start

Introduction - Movie Reviews Sentiment Analysis

Importing Modules

1 — Data Preprocessing

2 — Data Cleaning

3 — Model Creation

4 — Model Evaluation

End

3.4.5 The Dataset


The Human Activities dataset contains approximately 50,000 records, serving as a sample of
movie reviews. The dataset includes a target column labeled ”sentiment,” which describes
the sentiment of the viewer about the movie as either positive or negative.

3.4.6 Dataset Features and Target:


• [’review’, ’sentiment’]

Dataset Head:

Figure 3.12: Movie Review Dataset Head

27
3.4.7 Results

Figure 3.13: Movie Review Dataset Head After Encoding

Figure 3.14: Movie Reviews After Removing HTML Tags

28
Figure 3.15: Movie Reviews After Removing Special Characters

Figure 3.16: Movie Reviews After Converting To Lower Case

29
Figure 3.17: After Removing Stop Words

Figure 3.18: After Applying Stemming

Figure 3.19: Accuracy

30
3.5 Fake News Detection Using TF-IDF
3.5.1 Methodology
This section introduces the fake news detection using TD-IDF, focusing on natural lan-
guage processing techniques for binary classification. Refering to a practiced model from
kaggle. (? )

Figure 3.20: Fake News Detection Model. 3

3.5.2 Code

# # Importing modules ##
import pandas as pd
from sklearn . model_selection import train_test_split
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . naive_bayes import MultinomialNB
from sklearn . pipeline import make_pipeline
from sklearn . metrics import classification_report , accuracy_score

data = pd . read_csv ( ’/ kaggle / input / fake - news - classification / WELFake_Dataset .


csv ’)
data . head ()

data . describe ()
data . isnull () . sum ()
data = data . dropna ()
print ( " Original DataFrame shape : " , data . shape )
3
Source: https://www.researchgate.net/figure/Fake-news-detection-model_fig1_358004087

31
print ( " New DataFrame shape : " , data . shape )
data . isnull () . sum ()

# Combining title and text


data [ ’ content ’] = data [ ’ title ’] + ’ ’ + data [ ’ text ’]

# Drop rows with missing values


data . dropna ( inplace = True )

X_train , X_test , y_train , y_test = train_test_split ( data [ ’ content ’] , data [ ’


label ’] , test_size = 0 .2 , random_state =
42 )

model = make_pipeline ( TfidfVectorizer () , MultinomialNB () )


model . fit ( X_train , y_train )

predicted = model . predict ( X_test )


print ( clas sific atio n_rep ort ( y_test , predicted ) )
print ( " Accuracy : " , round ( accuracy_score ( y_test , predicted ) * 100 ) , ’% ’)

def p redict_fake_news ( news ) :


prediction = model . predict ( [ news ] )
return ’ Fake ’ if prediction [ 0 ] = = 0 else ’ Real ’
# Example usage
print ( predict_fake_news ( " SATAN 2 : Russia unvelis an image of its terrif . " ) )

# Example usage
print ( predict_fake_news ( " Ukraine is being invaded by russia . " ) )

# Example usage
print ( predict_fake_news ( " India won the fifa worldcup . " ) )

Import Libraries:
• pandas: For data manipulation and analysis.
• train test split: For splitting the dataset into training and testing sets.
• TfidfVectorizer: For converting a collection of raw documents to a matrix of TF-
IDF features.
• MultinomialNB: Naive Bayes classifier for multinomial models.
• make pipeline: For creating a pipeline that applies a series of transformations fol-
lowed by a final estimator.
• classification report, accuracy score: For evaluating the performance of the
classification model.

32
Load the Dataset:
• Reads a CSV file (’WELFake Dataset.csv’) into a pandas DataFrame.

Data Exploration and Cleaning:


• Displays the first few rows of the dataset.

• Describes the dataset.

• Checks for missing values and drops rows with missing values.

Combine Text Columns:


• Combines the ’title’ and ’text’ columns into a new column called ’content’.

Split Data:
• Splits the dataset into training and testing sets using the train test split function.

Build and Train the Model:


• Creates a pipeline that combines the TfidfVectorizer for feature extraction and
MultinomialNB for classification.

• Fits the model using the training data.

Evaluate Model Performance:


• Predicts labels for the test set.

• Prints the classification report and accuracy score to evaluate the model’s perfor-
mance.

33
Load Dataset

Data Exploration
and Cleaning

Combine Text Columns

Split Data

Build and Train


the Model

Evaluate Model
Performance

Define Predic-
tion Function

Example Usages

Figure 3.21: Fake New Detection Work Flow

34
Famminaty

Royalty
Youth
Man, Woman, Man 0 0 0
Boy, Girl, Woman 1 0 0 Each word gets
Prince, Princess, Boy 0 1 0 a 1 x 3 vector
Queen, King Girl 1 1 0
Prince 0 1 1
Princess 1 1 1
Queen 1 0 1
King 0 0 1

Figure 3.22: Mapping words to vectors and transformation 4

3.6 Emotion Detection From Text Using WordVec


3.6.1 Methodology
This section introduces the emotion detection using wordvec2, focusing on natural language
processing techniques for binary classification.

3.6.2 Code

# # Importing modules ##
iimport pandas as pd
import re
import numpy as np
import imblearn
from gensim . models import Word2Vec
from imblearn . over_sampling import RandomOverSampler
from sklearn . model_selection import train_test_split , GridSearchCV
from sklearn . ensemble import AdaBoostClassifier , R an dom Fo re stC la ss if ier
from sklearn . tree import D ec i s ion Tr ee Cla ss if ier
from sklearn . svm import SVC
from sklearn . naive_bayes import GaussianNB
from sklearn . neighbors import KNeighborsClassifier
from sklearn . metrics import classification_report , accuracy_score , f1_score
import nltk
from nltk . corpus import stopwords
from nltk . stem import WordNetLemmatizer

4
Source: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

35
# Loading stop words and lemmatizer
nltk . download ( ’ stopwords ’)
nltk . download ( ’ wordnet ’)
stop_words = set ( stopwords . words ( ’ english ’) )
lemmatizer = WordNetLemmatizer ()

# Loading stop words and lemmatizer


def c l ean_and_preprocess ( text ) :
# Remove user references , URLs , and special characters
text = re . sub ( r ’@ [A - Za - z0 - 9_ ]+ ’ , ’ ’ , text )
text = re . sub ( r ’ http \ S + ’ , ’ ’ , text )
text = re . sub ( r ’ [^ A - Za - z0 - 9 ] ’ , ’ ’ , text )
#

words = text . lower () . split ()


# Lowercase conversion and word splitting
words = [ lemmatizer . lemmatize ( word ) for word in words if word not in
stop_words ]
return ’ ’. join ( words )

! pip install - q opendatasets

import opendatasets as od
import pandas as pd

od . download ( ’ https :// www . kaggle . com / datasets / pashupatigupta / emotion -


detection - from - text ’) # insert ypu
kaggle username and key

# Loading Data
data = pd . read_csv ( ’/ content / emotion - detection - from - text / tweet_emotions . csv
’)

# dataset info
data . info ()
# Let ’s remove the extra column and clean up the dataset
data . drop ( columns = [ ’ tweet_id ’] , inplace = True )
data . drop_duplicates ( keep = " first " , inplace = True )
data . dropna ( inplace = True )

# Applying Basic Cleanup to Tweet Text


data [ ’ cleaned_content ’] = data [ ’ content ’] . apply ( clean_and_preprocess )
data . info ()

# Let ’s display the first values in the processed corpus


data . head ( n = 100 )

36
# Preparing data for Word2Vec
w2v_data = [ text . split () for text in data [ ’ cleaned_content ’] ]

# Training Word2Vec models


w2v_model = Word2Vec ( w2v_data , vector_size = 150 , window = 10 , min_count =2 ,
workers = 4 )

# Function to convert text to vector


def text_to_vector ( text ) :
words = text . split ()
word_vectors = [ w2v_model . wv [ word ] for word in words if word in
w2v_model . wv . key_to_index ]
return np . mean ( word_vectors , axis = 0 ) if word_vectors else np . zeros (
w2v_model . vector_size )

# Convert all texts to vectors


vectorized_texts = np . array ( [ text_to_vector ( text ) for text in data [ ’
cleaned_content ’] ] )

# Initializing and using RandomOverSampler


ros = RandomOverSampler ( random_state = 42 )
X_resampled , y_resampled = ros . fit_resample ( vectorized_texts , data [ ’
sentiment ’] )

# Dividing the data into training and test samples


X_train , X_test , y_train , y_test = train_test_split ( X_resampled ,
y_resampled , test_size = 0 .2 ,
random_state = 42 )

# Results output function


def ouput_result ( y_pred ) :
print ( ’ Accuracy : ’ , accuracy_score ( y_test , y_pred ) )
print ( ’ F1 Score : ’ , f1_score ( y_test , y_pred , average = ’ weighted ’) )
print ( ’ Classification Report :\ n ’ , clas sific atio n_rep ort ( y_test , y_pred ) )

# Classifier SVM
svm = SVC ()
svm . fit ( X_train , y_train )

y_pred = svm . predict ( X_test )


ouput_result ( y_pred )

Importing Modules:
The required libraries and modules are imported. These include pandas for data manipula-
tion, regular expression (re) for text cleaning, numpy for numerical operations, imbalanced-
learn (imblearn) for handling imbalanced datasets, gensim for Word2Vec modeling, and
various classifiers and evaluation metrics from scikit-learn.

37
NLP Preprocessing:
NLTK is used for natural language processing tasks. Stop words and a lemmatizer are
loaded from NLTK’s corpus. A custom function clean and preprocess is defined for
cleaning and preprocessing text data, which includes removing user references, URLs, spe-
cial characters, converting to lowercase, and lemmatization.

Downloading and Loading Dataset:


The code downloads a dataset from Kaggle using the opendatasets library. The dataset
contains tweets labeled with emotions. The dataset is loaded into a Pandas DataFrame
(data).

Data Cleaning:
The unnecessary column ’tweet id’ is dropped, and duplicate and missing values are re-
moved from the dataset.

Applying Basic Cleanup to Tweet Text:


The clean and preprocess function is applied to the ’content’ column to clean and prepro-
cess the tweet text. The cleaned content is stored in a new column called ’cleaned content’.

Word Embedding with Word2Vec:


The data is prepared for Word2Vec by splitting the cleaned content into words. A Word2Vec
model is trained on the preprocessed text data.

Converting Text to Vectors:


A function text to vector is defined to convert each cleaned text into a vector using the
trained Word2Vec model. The text vectors are then stored in the variable vectorized texts.

Handling Imbalanced Data:


The code uses the RandomOverSampler from imbalanced-learn to address class imbalance
by oversampling the minority class.

Splitting Data and Training SVM Classifier:


The oversampled data is split into training and testing sets. An SVM (Support Vector
Machine) classifier is initialized and trained on the training data.

38
3.6.3 Evaluating and Outputting Results:
The trained SVM classifier is used to predict the labels on the test set. The function
ouput result is defined to print accuracy, F1 score, and a classification report based on
the predicted and true labels of the test set.

3.6.4 Classifier Evaluation:


The SVM classifier’s predictions on the test set are evaluated using the ouput result
function, which prints accuracy, F1 score, and a classification report.

39
Bibliography

[1] https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-
techniques-for-machine-learning-2080b0269f10

[2] https://www.analyticsvidhya.com/blog/2022/01/nlp-tutorials-part-ii-feature-
extraction/

[3] https://monkeylearn.com/blog/what-is-tf-idf/

[4] https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

[5] https://www.kaggle.com/code/ritabanmitra/fake-news-classifier/notebook

[6] https://www.researchgate.net/figure/Fake-news-detection-model fig1 358004087

[7] https://github.com/omaarelsherif/Movie-Reviews-Sentiment-Analysis-Using-
Machine-Learning

[8] https://spotintelligence.com/2023/03/25/nlp-feature-engineering

[9] https://www.kaggle.com/code/longtng/nlp-preprocessing-feature-extraction-
methods-a-z

[10] https://www.kaggle.com/amar09/text-pre-processing-and-feature-extraction

[11] https://www.kaggle.com/ashishpatel26/beginner-to-intermediate-nlp-tutorial

[12] https://www.kaggle.com/ashutosh3060/nlp-basic-feature-creation-and-preprocessing

[13] https://www.kaggle.com/liananapalkova/simply-about-word2vec

40

You might also like