You are on page 1of 18

NATURAL LANGUAGE PROCESSING

Python Libraries:
nltk, re,word2vec
List of Experiments
1. Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hashtag from textual data.
2. Perform lemmatization and stemming using python library nltk.
3. Demonstrate object standardization such as replace social media slangs from a text.
4. Perform part of speech tagging on any textual data.
5. Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
6. Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python
7. Demonstrate word embeddings using word2vec.
8. Implement Text classification using naïve bayes classifier and text blob library.
9. Apply support vector machine for text classification.
10. Convert text to vectors (using term frequency) and apply cosine similarity to
provide closenessamong two text.
11. Case study 1: Identify the sentiment of tweets
In this problem, you are provided with tweet data to predict sentiment on
electronicproducts of netizens.
12. Case study 2: Detect hate speech in tweets.
The objective of this task is to detect hate speech in tweets. For the sake of
simplicity, wesay a tweet contains hate speech if it has a racist or sexist
sentiment associated with it.
So, the task is to classify racist or sexist tweets from other tweets.
NATURAL LANGUAGE PROCESSING

What is natural language processing?

Natural language processing (NLP) refers to the branch of computer science—and more
specifically, the branch of artificial intelligence or AI—concerned with giving computers the
ability to understand text and spoken words in much the same way human beings can.

NLP combines computational linguistics—rule-based modeling of human language—with


statistical, machine learning, and deep learning models. Together, these technologies enable
computers to process human language in the form of text or voice data and to ‘understand’ its
full meaning, complete with the speaker or writer’s intent and sentiment.

NLP drives computer programs that translate text from one language to another, respond to
spoken commands, and summarize large volumes of text rapidly—even in real time. There’s a
good chance you’ve interacted with NLP in the form of voice-operated GPS systems, digital
assistants, speech-to-text dictation software, customer service chat bots, and other consumer
conveniences. But NLP also plays a growing role in enterprise solutions that help streamline
business operations, increase employee productivity, and simplify mission-critical business
processes.

Several NLP tasks break down human text and voice data in ways that help the computer
make sense of what it's ingesting. Some of these tasks include the following:

 Speech recognition, also called speech-to-text, is the task of reliably converting voice
data into text data. Speech recognition is required for any application that follows voice
commands or answers spoken questions. What makes speech recognition especially
challenging is the way people talk—quickly, slurring words together, with varying
emphasis and intonation, in different accents, and often using incorrect grammar.
 Part of speech tagging, also called grammatical tagging, is the process of determining
the part of speech of a particular word or piece of text based on its use and context. Part
of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in
‘what make of car do you own?’
 Word sense disambiguation is the selection of the meaning of a word with multiple
meanings through a process of semantic analysis that determine the word that makes the
most sense in the given context. For example, word sense disambiguation helps
distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’
(place).
 Named entity recognition, or NEM, identifies words or phrases as useful entities. NEM
identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
 Co-reference resolution is the task of identifying if and when two words refer to the
same entity. The most common example is determining the person or object to which a
certain pronoun refers (e.g., ‘she’ = ‘Mary’), but it can also involve identifying a
metaphor or an idiom in the text (e.g., an instance in which 'bear' isn't an animal but a
large hairy person).
NATURAL LANGUAGE PROCESSING

 Sentiment analysis attempts to extract subjective qualities—attitudes, emotions,


sarcasm, confusion, suspicion—from text.
 Natural language generation is sometimes described as the opposite of speech
recognition or speech-to-text; it's the task of putting structured information into human
language.

Python and the Natural Language Toolkit (NLTK)

The Python programming language provides a wide range of tools and libraries for attacking
specific NLP tasks. Many of these are found in the Natural Language Toolkit, or NLTK, an open
source collection of libraries, programs, and education resources for building NLP programs.

The NLTK includes libraries for many of the NLP tasks listed above, plus libraries for subtasks,
such as sentence parsing, word segmentation, stemming and lemmatization (methods of
trimming words down to their roots), and tokenization (for breaking phrases, sentences,
paragraphs and passages into tokens that help the computer better understand the text). It also
includes libraries for implementing capabilities such as semantic reasoning, the ability to reach
logical conclusions based on facts extracted from text.

Steps to install NLTK and its data:

Install Pip: run in terminal:


easy_install pip
Install NLTK: run in terminal :
pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:
import nltk nltk.download()

What is Word2Vec?

Word2Vec is a widely used method in natural language processing (NLP) that allows
words to be represented as vectors in a continuous vector space. Word2Vec is an effort to map
words to high-dimensional vectors to capture the semantic relationships between words,
developed by researchers at Google. Words with similar meanings should have similar vector
representations, according to the main principle of Word2Vec.

The basic idea of word embedding is words that occur in similar context tend to be closer
to each other in vector space. For generating word vectors in Python, modules needed
are nltk and gensim. Run these commands in terminal to install nltk and gensim:

pip install nltk


pip install gensim
NATURAL LANGUAGE PROCESSING

 NLTK: For handling human language data, NLTK, or Natural Language Toolkit, is a
potent Python library. It offers user-friendly interfaces to more than 50 lexical resources
and corpora, including WordNet. A collection of text processing libraries for tasks like
categorization, tokenization, stemming, tagging, parsing, and semantic reasoning are also
included with NLTK.
 GENSIM: Gensim is an open-source Python library that uses topic modeling and
document similarity modeling to manage and analyze massive amounts of unstructured
text data. It is especially well-known for applying topic and vector space modeling
algorithms, such as Word2Vec and Latent Dirichlet Allocation (LDA), which are widely
used.
NATURAL LANGUAGE PROCESSING

Experiment 1: Demonstrate Noise Removal for any textual data and remove regular
expression pattern such as hashtag from textual data.
Description: Noise removal is a crucial preprocessing step in natural language processing
(NLP) tasks. It involves cleaning up textual data by removing irrelevant or unwanted
information that does not contribute to the meaning of the text. This can include special
characters, punctuation, numbers, and other non-alphabetic symbols. Regular expressions are
often used to identify and remove specific patterns of noise, such as hashtags (#) in social
media data. Noise removal helps in improving the quality of textual data for further analysis or
processing, such as sentiment analysis, text classification, or language modeling. It simplifies
the text and reduces the dimensionality of the data, making it easier for machine learning
algorithms to extract meaningful insights.

Code:
import re
def remove_noise(text):
# Define the regular expression pattern for hashtags
hashtag_pattern = r'#\w+'

# Remove hashtags using regex substitution


clean_text = re.sub(hashtag_pattern, '', text)
return clean_text

# Example text with hashtags


text_with_noise = "Just finished reading #HarryPotter and it was #amazing! #booklovers"

# Remove noise from the text


clean_text = remove_noise(text_with_noise)

print("Original Text:")
print(text_with_noise)
print("\nText after Noise Removal:")
print(clean_text)

Output:
Original Text:
Just finished reading #HarryPotter and it was #amazing! #booklovers
NATURAL LANGUAGE PROCESSING

Text after Noise Removal:


Just finished reading and it was

Experiment 2: Perform lemmatization and stemming using python library nltk

Description:
Lemmatization and stemming are both techniques used in natural language processing to
reduce words to their base or root form. While stemming involves chopping off prefixes or
suffixes of words to obtain the root, lemmatization involves using vocabulary and morphological
analysis to return the base or dictionary form of a word, which is known as the lemma. The
NLTK library provides a robust toolkit for natural language processing tasks, including
tokenization, lemmatization, stemming, part-of-speech tagging, and more. These techniques are
often used in preprocessing text data before further analysis or modeling.

Code:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download NLTK resources (you only need to do this once)


nltk.download('punkt')
nltk.download('wordnet')

# Initialize lemmatizer and stemmer


lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def lemmatize_text(text):
# Tokenize the text into words
tokens = word_tokenize(text)

# Lemmatize each word in the text


lemmatized_text = [lemmatizer.lemmatize(word) for word in tokens]

# Join the lemmatized words back into a single string


return ' '.join(lemmatized_text)
NATURAL LANGUAGE PROCESSING

def stem_text(text):
# Tokenize the text into words
tokens = word_tokenize(text)

# Stem each word in the text


stemmed_text = [stemmer.stem(word) for word in tokens]

# Join the stemmed words back into a single string


return ' '.join(stemmed_text)

# Example text
text = "nowadays students all are need to learn the programming"

# Lemmatize the text


lemmatized_text = lemmatize_text(text)
print("Lemmatized Text:")
print(lemmatized_text)

# Stem the text


stemmed_text = stem_text(text)
print("\nStemmed Text:")
print(stemmed_text)

Output:
Lemmatized Text:
nowaday student all are need to learn the programming
Stemmed Text:
nowadays students all are need to learn the program
NATURAL LANGUAGE PROCESSING

Experiment 3: Demonstrate object standardization such as replace social media slangs from a
text.

Description:
Standardizing text by replacing social media slang with formal language involves
converting informal or colloquial language commonly used in social media platforms into more
formal language. This process is typically done to improve readability, professionalism, or
clarity in various contexts, such as formal documents, academic writing, or professional
communications.

Code:
# Define a dictionary mapping social media slangs to their formal equivalents
slang_to_formal = {
"lol": "laughing out loud",
"brb": "be right back",
"omg": "oh my god",
"btw": "by the way",
"ttyl": "talk to you later",
# Add more mappings as needed
}

# Function to replace social media slangs with formal language


def standardize_text(text):
words = text.split()
standardized_words = []
for word in words:
standardized_words.append(slang_to_formal.get(word.lower(), word))
return ' '.join(standardized_words)

# Example usage
input_text = "omg, lol, brb! btw, ttyl!"
standardized_text = standardize_text(input_text)
print("Original text:", input_text)
NATURAL LANGUAGE PROCESSING

print("Standardized text:", standardized_text)

Output:
Original text: omg, lol, brb! btw, ttyl!
Standardized text: oh my god, laughing out loud, be right back! by the way, talk to you later!

Experiment 4: Perform part of speech tagging on any textual data

Description:
Part-of-speech (POS) tagging is a process in natural language processing where words in
a text are assigned to their respective parts of speech, such as nouns, verbs, adjectives, adverbs,
etc. This tagging helps in understanding the grammatical structure of sentences and is an
essential step in many NLP tasks, such as text analysis, information extraction, and sentiment
analysis.Python's NLTK library provides easy-to-use tools for POS tagging.
Code:
import nltk
from nltk.tokenize import word_tokenize

# Sample text data


text = "Part-of-speech tagging assigns a grammatical tag to each word in a sentence."

# Tokenize the text into words


words = word_tokenize(text)

# Perform part-of-speech tagging


pos_tags = nltk.pos_tag(words)

# Print the tagged words with their part-of-speech


print("Original Text:", text)
print("Part-of-Speech Tagging:")
for word, pos_tag in pos_tags:
print(f"{word}: {pos_tag}")

Output:
Original Text: I love to eat pizza with extra cheese.
Part-of-Speech Tagging:
I: PRP
love: VBP
to: TO
NATURAL LANGUAGE PROCESSING

eat: VB
pizza: NN
with: IN
extra: JJ
cheese: NN

Explanation:

"I" is tagged as PRP (Personal Pronoun)


"love" is tagged as VBP (Verb, non-3rd person singular present)
"to" is tagged as TO (to as a preposition or infinitive marker)
"eat" is tagged as VB (Verb, base form)
"pizza" is tagged as NN (Noun, singular or mass)
"with" is tagged as IN (Preposition or subordinating conjunction)
"extra" is tagged as JJ (Adjective)
"cheese" is tagged as NN (Noun, singular or mass)
NATURAL LANGUAGE PROCESSING

Experiment 5: Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.

Description:
Latent Dirichlet Allocation (LDA) is a popular technique for topic modeling, which is
used to discover latent topics present in a collection of documents.

Code:
import gensim
from gensim import corpora
from pprint import pprint

# Sample documents
documents = [
"Machine learning is a subset of artificial intelligence.",
"Natural language processing is used in many applications.",
"Deep learning models have achieved state-of-the-art results.",
"Topic modeling is an unsupervised learning technique.",
"Python programming language is commonly used in data science.",
"Text data preprocessing is important for NLP tasks.",
]

# Tokenize each document into words


tokenized_documents = [gensim.utils.simple_preprocess(doc) for doc in documents]

# Create a dictionary mapping words to their integer ids


dictionary = corpora.Dictionary(tokenized_documents)

# Convert tokenized documents into Bag of Words (BoW) format


bow_corpus = [dictionary.doc2bow(doc) for doc in tokenized_documents]

# Train the LDA model


lda_model = gensim.models.LdaModel(bow_corpus, num_topics=3, id2word=dictionary,
passes=10)
NATURAL LANGUAGE PROCESSING

# Print the topics and their top words


print("Topics and their top words:")
print(lda_model.print_topics())

Output:
Topics and their top words:
[(0,
'0.079*"learning" + 0.063*"machine" + 0.046*"intelligence" + 0.046*"artificial" +
0.031*"subset" + 0.031*"natural" + 0.031*"processing" + 0.031*"applications" + 0.031*"used"
+ 0.031*"many"'),
(1,
'0.090*"learning" + 0.070*"deep" + 0.070*"models" + 0.045*"state" + 0.045*"achieved" +
0.045*"results" + 0.045*"art" + 0.045*"of" + 0.045*"have" + 0.045*"modeling"'),
(2,
'0.075*"topic" + 0.061*"unsupervised" + 0.061*"is" + 0.061*"an" + 0.061*"modeling" +
0.061*"learning" + 0.061*"technique" + 0.036*"preprocessing" + 0.036*"important" +
0.036*"data"')]

Each line represents a topic along with its top words and their corresponding probabilities. For
example, the first line shows the top words for topic 0, the second line shows the top words for
topic 1, and so on.
You can interpret these topics as follows:
Topic 0: Machine learning and artificial intelligence
Topic 1: Deep learning models and achievements
Topic 2: Topic modeling and unsupervised learning techniques
NATURAL LANGUAGE PROCESSING

Experiment 6: Demonstrate Term Frequency – Inverse Document Frequency (TF – IDF) using
python

Description: Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used


technique in Natural Language Processing (NLP) and Information Retrieval. It helps quantify the
importance of a term within a document relative to a collection of documents. Here's a Python
program to demonstrate TF-IDF calculation using the scikit-learn library

pip install scikit-learn

Code:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
"TF-IDF stands for Term Frequency-Inverse Document Frequency.",
"It is a technique used in Natural Language Processing and Information Retrieval.",
"TF-IDF quantifies the importance of a term in a document relative to the entire corpus.",
"It helps in identifying the significance of words in a document.",
]

# Create TF-IDF vectorizer


tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer to the documents and transform the documents into TF-IDF vectors
tfidf_vectors = tfidf_vectorizer.fit_transform(documents)

# Get feature names (terms)


feature_names = tfidf_vectorizer.get_feature_names_out()

# Display TF-IDF vectors for each document


NATURAL LANGUAGE PROCESSING

for i, document in enumerate(documents):


print(f"Document {i+1}:")
for j, term in enumerate(feature_names):
print(f"{term}: {tfidf_vectors[i, j]:.4f}")
print("\n")

Output:
Document 1:
and: 0.0000
document: 0.3536
frequency: 0.3536
idf: 0.3536
importance: 0.0000
in: 0.0000
information: 0.0000
is: 0.0000
it: 0.0000
language: 0.0000
natural: 0.0000
of: 0.0000
processing: 0.3536
quantifies: 0.3536
relative: 0.3536
stands: 0.3536
technique: 0.3536
term: 0.3536
tf: 0.3536
the: 0.0000
NATURAL LANGUAGE PROCESSING

Experiment 7: Demonstrate word embeddings using word2vec.

Description:
Word embeddings are numerical representations of words in a continuous vector space. They encode semantic in
component in various NLP tasks, including language modeling, sentiment analysis, machine
translation, and named entity recognition.
In Python, there are several popular methods and libraries for generating word embeddings, with
Word2Vec, GloVe, and FastText being among the most commonly used ones. Here's a brief
overview of word2vec:

Word2Vec: Developed by researchers at Google, Word2Vec is a shallow neural network-based


model that learns word embeddings by predicting the context of words in a large corpus. It
provides dense vector representations of words, where the similarity between words is measured
by the cosine similarity of their vectors.

Code:

from gensim.models import Word2Vec

# Sample corpus
corpus = [
["I", "love", "coding"],
["Machine", "learning", "is", "fun"],
["Python", "is", "a", "popular", "programming", "language"],
["Word", "embeddings", "capture", "semantic", "meanings"] ]

# Train Word2Vec model


model = Word2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4)

# Get word embeddings


word_embeddings = model.wv

# Find most similar words


similar_words = word_embeddings.most_similar("learning", topn=3)

# Print most similar words


print("Most similar words to 'learning':")
NATURAL LANGUAGE PROCESSING

for word, similarity in similar_words:


print(f"{word}: {similarity:.4f}")

# Get word vector for a specific word


word_vector = word_embeddings["coding"]
print("\nWord vector for 'coding':")
print(word_vector)

Output:
Most similar words to 'learning':
language: 0.08412346255779266
fun: 0.0378827781085968
semantic: -0.003678104073524475

Word vector for 'coding':


[ 4.1898028e-03 -4.2786840e-03 -1.0587460e-03 2.6510254e-03
3.9717924e-03 -3.0818630e-03 4.2672198e-03 -3.3203912e-03
-3.4059923e-03 -1.6481152e-03 -4.4648279e-03 -2.8914210e-03
-1.4949015e-03 -1.7316621e-03 3.0592727e-03 -2.1864980e-03
...
1.0573900e-03 -2.6487130e-03 3.4680717e-03 -1.1877659e-03
-3.1844315e-03 4.3596888e-03 -1.0220140e-03 -1.6755638e-03
-4.9010170e-03 -1.0648153e-03 -4.1524348e-03 -4.0424663e-03]
NATURAL LANGUAGE PROCESSING

Experiment 8: Implement Text classification using naïve bayes classifier and text blob
library.

Description:

Text classification is the task of automatically assigning predefined categories or labels to


text documents based on their content. It is a fundamental problem in Natural Language
Processing (NLP) and has numerous applications such as sentiment analysis, spam detection,
topic categorization, and more.

Naïve Bayes classifier is a simple probabilistic classifier based on Bayes' theorem with
the assumption of independence between features. Despite its simplicity, it often performs well
in practice, especially for text classification tasks. It's called "naïve" because it assumes that all
features are independent of each other, which is rarely the case in real-world data, especially in
natural language processing tasks. However, despite this simplification, Naïve Bayes classifiers
can perform surprisingly well in many situations.

TextBlob is a Python library built on top of NLTK (Natural Language Toolkit) and
Pattern libraries, providing an easy-to-use interface for common NLP tasks. It includes
functionalities for text processing, such as tokenization, part-of-speech tagging, noun phrase
extraction, and sentiment analysis. TextBlob also provides a simple API for text classification
using various classifiers, including Naïve Bayes.

Code:

from textblob import TextBlob


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample text data for classification


texts = ["I love this movie", "This movie is great", "I dislike this movie", "This movie is
terrible"]
NATURAL LANGUAGE PROCESSING

# Labels for the text data


labels = ['positive', 'positive', 'negative', 'negative']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Vectorize the text data


vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Train Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train_vectors, y_train)

# Predict labels for test data


y_pred = classifier.predict(X_test_vectors)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report


print("Classification Report:")
print(classification_report(y_test, y_pred))

Output:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
negative 1.00 1.00 1.00 1
positive 1.00 1.00 1.00 1

accuracy 1.00 2
macro avg 1.00 1.00 1.00 2
weighted avg 1.00 1.00 1.00 2

You might also like