You are on page 1of 22

1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

Open in app Sign up Sign in

Search Write

Natural Language Processing (NLP)


Zero to Mastery Part I: Foundations
Unlocking Ten Key Concepts for NLP Proficiency

ChenDataBytes · Follow
10 min read · Dec 28, 2023

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 1/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

Photo by Sven Brandsma on Unsplash

The articles in this series cover the following topics:

Part 1(this article): Presents the fundamental principles of Natural


Language Processing (NLP).

Part 2: Explores the common applications of NLP.

Natural Language Processing (NLP) is a field of study within computer


science and artificial intelligence that focuses on the interaction between
computers and human languages. Its objective is to enable computers to
understand, interpret, and generate human language, thereby facilitating
communication and interaction between humans and machines.

NLP relies on various libraries commonly employed in the field, such as


NLTK and Spacy. NLTK provides a comprehensive set of tools and resources
for NLP tasks, while Spacy offers efficient processing capabilities. However,
it’s worth noting that although Spacy is efficient, it may not be suitable for
certain applications like sentiment analysis, which may require more
specialized libraries or approaches. For sequence modelling in NLP,
TensorFlow’s deep learning framework can be utilized.

Regarding the foundational aspects of NLP, we will delve into ten essential
topics: lemmatization, stemming, part-of-speech tagging, stop words,
pattern matching, sentence segmentation, named entity recognition,
tokenization, word embedding and bag-of-words.

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 2/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

NLP concepts

Linguistic Basics

1. Stemming
Stemming is a linguistic method utilized to obtain the base or root form of
words by removing letters from the word’s end. Its objective is to simplify
words by disregarding tense, pluralization, and other grammatical
variations. The Porter stemming algorithm employs a collection of
predetermined rules and heuristics to eliminate common English suffixes,
thereby converting words into their corresponding stems. SpaCy doesn’t
have a built-in implementation of the Porter Stemmer, so we use nltk for this
exercise.

from nltk.stem.porter import *

p_stemmer = PorterStemmer()
words = ["runner", "running", "ran"]
for word in words:
print(word+' --> '+p_stemmer.stem(word))

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 3/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

2. Lemmatization
In contrast to stemming, lemmatization is a more sophisticated linguistic
process that aims to reduce a word to its base or dictionary form, known as a
lemma. Lemmatization takes into account factors such as part-of-speech
(POS) tags and contextual understanding to ensure accurate and meaningful
transformations.

import spacy
nlp = spacy.load('en_core_web_sm')

doc1 = nlp(u"The dedicated runner, after running for hours, finally ran across t

for token in doc1:


print(token.text, '\t', token.lemma, '\t', token.lemma_)

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 4/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

3. Part of speech
Part of speech refers to the grammatical category of a word in a sentence,
such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or
interjection. Part of speech tagging can be used for various purposes,
including identifying named entities and speech recognition. The

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 5/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

probabilities of part of speech tags occurring near one another can be used
to generate the most reasonable output.

import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u"The dedicated runner, after running for hours, finally ran across th
#print pos for “after"
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))

Text Identification/ Extraction

4. Stop words
Common words such as “a” and “the” occur so frequently in text that they
often do not carry significant meaning compared to nouns, verbs, and
modifiers. These commonly occurring words are referred to as stop words
and can be excluded or filtered out during text processing. Spacy provides a
built-in list of approximately 305 English stop words that can be readily
utilized.

import spacy
nlp = spacy.load('en_core_web_sm')

sentences = "The dedicated runner, after running for hours, finally ran across t

def remove_stopwords(sentence):
sentence = sentence.lower()
words = sentence.split()

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 6/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

sentence = " ".join([w for w in words if not nlp.vocab[w].is_stop])


return sentence

remove_stopwords(sentences)

5. Pattern Match
Pattern matching entails the identification and extraction of linguistic
patterns or structural information from text. This process involves searching
for specific sequences of words, phrases, or syntactic structures that
conform to predefined patterns or rules.

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across th

matcher = PhraseMatcher(nlp.vocab)

phrase_list = ['runner', 'smile']


phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newproduct', None, *phrase_patterns)

matches = matcher(doc)

#Print the matches found in the text.


#Each match is represented as a tuple containing the match ID, start index, and
print(matches)

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 7/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

6. Sentence Segmentation
Sentence segmentation refers to the process of dividing a document or a
textual piece into individual sentences. In natural language processing,
accurately identifying sentence boundaries is crucial for various text
analysis and language understanding tasks.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This is the first sentence. This is another sentence. This is the la
for sent in doc.sents:
print(sent)

7. Named Entity Recognition (NER)


Named Entity Recognition (NER) entails the identification and classification
of named entities present in the given text. Named entities typically
encompass distinct types of words or phrases that represent recognizable
entities, including but not limited to names of individuals, organizations,
locations, dates, numerical expressions, and others.

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 8/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_

Text Representation

8. Tokenization
Tokenization involves breaking down the original text into smaller
components known as tokens. These tokens can be created based on
contiguous sequences of characters or words. The utilization of spacy for
tokenization is demonstrated below.

import spacy
nlp = spacy.load('en_core_web_sm')
mystring = '"The dedicated runner, after running for hours, finally ran across t
doc = nlp(mystring)

for token in doc:


print(token.text, end=' | ')

# Counting Tokens
print("\n Token Counts:",len(doc))

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 9/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

# Counting Vocab Entries


print("\n Vocab Entries: "+str(len(doc.vocab)))

N-gram refers to a consecutive sequence of n items, where an item can be a


character or a word.

from nltk import bigrams


text = """The dedicated runner, after running for hours, finally ran across the
lines = map(str.split, text.split('\n'))
for line in lines:
print("\n".join([" ".join(bi) for bi in bigrams(line)]))

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 10/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

The following example illustrates how to generate tokens, sequences, and


perform padding using the TensorFlow framework. In NLP model training, it
is also common to create input-output pairs, where the input consists of a
sequence of words or characters, and the output is the subsequent word or
character.

from tensorflow.keras.preprocessing.text import Tokenizer


from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define your input texts


sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 11/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

# Initialize the Tokenizer class


tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)
# Get the word index dictionary
word_index = tokenizer.word_index
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result


print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)
# Print the result
print("\nPadded Sequences:")
print(padded)

Subword tokenization is a text tokenization technique that breaks down


words into smaller units, known as subwords or subword units. Unlike
traditional word-based tokenization, where each word is considered a single
token, subword tokenization allows the representation of words as a
sequence of subword units.

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset


imdb_subwords, info_subwords = tfds.load("imdb_reviews/subwords8k", with_info=Tr
train_data, test_data = imdb_subwords['train'], imdb_subwords['test'],

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 12/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

# Get the encoder


tokenizer_subwords = info_subwords.features['text'].encoder

# Define sample sentence


sample_string = 'TensorFlow, from basics to mastery'

# Encode using the subword text encoder


tokenized_string = tokenizer_subwords.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

# Decode and print the results


original_string = tokenizer_subwords.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

9. Word vectors/word embeddings


Basic word representations could be classified into three categories:
integers, one-hot vectors and word embeddings. Word embedding is a
technique that represents words as dense, low-dimensional vectors in a
continuous vector space. The main objective of word embeddings is to
capture the semantic and contextual relationships between words. For
instance, when visualizing word embeddings in 2D, similar words tend to be
located close to each other.

Word Embedding Methods:

Continuous bag-of-words (CBOW): the model learns to predict the center


word given some context words.

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 13/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model


learns to predict the words surrounding a given input word.

word2vec (Google, 2013): overcomes the limitations of BoW and TF-IDF


by preserving contextual information and representing words in a dense
vector space. It does not handle out-of-vocabulary (OOV) words well.

Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the


corpus’s word co-occurrence matrix, similar to the count matrix you’ve
used before.

Deep learning-based contextual embeddings include BERT and GPT.

The code provided demonstrates two ways of adding the embedding layer in
a TensorFlow model. The first method is to use the Embedding layer. The
second method is the use of TensorFlow Hub to build a neural network
model using the Universal Sentence Encoder as a pre-trained embedding
layer. By setting trainable=True, the layer parameters can be fine-tuned
during training.

import tensorflow_hub as hub


import tensorflow as tf

#embedding method 1
model = tf.keras.Sequential([
# Add an Embedding layer with the correct parameters
# input_dim Integer. Size of the vocabulary, i.e. maximum integer index + 1.
# output_dim Integer. Dimension of the dense embedding.
# input_length Length of input sequences, when it is constant.
# 2D tensor with shape: (batch_size, input_length).
# 3D tensor with shape: (batch_size, input_length, output_dim).
tf.keras.layers.Embedding(input_dim=num_words, output_dim=embedding_dim, inp
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(265, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax'),
])

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 14/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

#embedding method 2
model = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=True, dtype=tf.string, input_shape=[]),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])

10. Bag-of-Words/TF-IDF
The bag-of-words approach represents text as an assortment or “bag” of
individual words or tokens, disregarding their specific order or sequence. It
creates a numerical representation of a document or corpus by tallying the
occurrences of each word in the text. However, in the bag-of-words model,
the original word order is discarded, and the focus is solely on the frequency
of word occurrence.

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting


scheme applied to the bag-of-words representation. Its purpose is to assign
weights to words that reflect their importance within a document in the
context of a larger collection of documents, known as a corpus.

TF-IDF takes into account two key factors:

1. Term Frequency (TF): This measures how frequently a term (word)


appears in a document. It assigns a higher weight to words that occur more
frequently within the document. The formula for TF is:

TF(t,d)= Number of times term t appears in document d / Total number of terms


in document d

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 15/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

2. Inverse Document Frequency (IDF): This part measures how unique or


rare a term is across all documents. It assigns a higher weight to words that
appear less frequently across the corpus but provide more unique or
informative content.

IDF(t,D) = log(Total number of documents in the corpus N​/ Number of


documents containing term t+1)

3. TF-IDF Calculation: calculated by multiplying TF and IDF:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

If the TF-IDF value is high, it means the term is both common in the
document and rare across the entire corpus, making it a distinctive feature
of that document.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
"The dedicated runner, after running for hours, finally ran across the finis
"I enjoy running",
"I like to run in the morning"
]

vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(documents)

feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix


for i in range(len(documents)):
print("Document:", i+1)
for j, feature in enumerate(feature_names):
tfidf_value = tfidf_vectors[i, j]
if tfidf_value != 0:

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 16/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

print(feature, ":", tfidf_value)


print()

End note:
In summary, this NLP primer covered ten basic concepts, from tokenization
and word embeddings to part-of-speech tagging and named entity
recognition. This foundational understanding sets the stage for Part II,
where we’ll explore practical NLP applications across diverse domains.

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 17/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

NLP Natural Language Process Nltk Spacy Artificial Intelligence

Written by ChenDataBytes Follow

33 Followers

Data Scientist & Machine Learning Engineer in London.

More from ChenDataBytes

ChenDataBytes ChenDataBytes

Predict Hotel Demands: Leveraging Customer Lifetime Value


Time Series Forecasting… Prediction: Part 1 — Heuristics,…
A Comparison of ARIMA, Phrophet, LSTM & CLV Prediction for All Customers, New and
Pick-up Method Existing

11 min read · Nov 29, 2023 8 min read · Nov 8, 2023

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 18/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

27 28

ChenDataBytes ChenDataBytes

Unlock Price Optimization Customer Lifetime Value


Potential with Python—Modelling… Prediction: Part 2 — Survival Rate
Price Elasticity Formulas and Modelling CLV Prediction for New Customers
Codes for Building Effective Price Strategy

8 min read · Oct 31, 2023 6 min read · Nov 8, 2023

7 3 3

See all from ChenDataBytes

Recommended from Medium

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 19/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

Merve Bayram Durna in DataDrivenInvestor Hang Yu in Towards AI

NLP with ML Using LLMs to Build Explainable


The Complete NLP Guide: Text to Context #4 Recommender Systems
LLMs can go beyond the traditional
recommender systems by generating…

8 min read · 5 days ago 8 min read · 4 days ago

55 50

Lists

AI Regulation Natural Language Processing


6 stories · 278 saves 1097 stories · 562 saves

ChatGPT ChatGPT prompts


23 stories · 393 saves 34 stories · 972 saves

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 20/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

ChenDataBytes Cobus Greyling

Natural Language Processing Large Language Model


(NLP) Zero to Mastery Part II:… Hallucination Mitigation…
Sentiment analysis, topic modeling, text This recently released study is a
classification and text generation comprehensive survey of 32+ mitigation…

9 min read · Dec 28, 2023 4 min read · 6 days ago

336 1

Márton Kardos Enozeren

Unsupervised Text Classification Word2Vec from Scratch with


with Topic Models and Good Old… Python
Use your brain and your data interpretation
skills, and create production-ready pipelines…

9 min read · Aug 3, 2023 13 min read · Jan 7

58 1 7

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 21/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

See more recommendations

https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 22/22

You might also like