Natural Language Processing (NLP) Zero To Mastery Part I - Foundations - by ChenDataBytes - Dec, 2023 - Medium

1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
Open in app Sign up Sign in
Search Write
Natural Language Processing (NLP)

Zero to Mastery Part I: Foundations
Unlocking Ten Key Concepts for NLP Proficiency
ChenDataBytes · Follow
10 min read · Dec 28, 2023
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 1/22
Photo by Sven Brandsma on Unsplash
The articles in this series cover the following topics:
Part 1(this article): Presents the fundamental principles of Natural

Language Processing (NLP).
Part 2: Explores the common applications of NLP.
Natural Language Processing (NLP) is a field of study within computer

science and artificial intelligence that focuses on the interaction between
computers and human languages. Its objective is to enable computers to
understand, interpret, and generate human language, thereby facilitating
communication and interaction between humans and machines.
NLP relies on various libraries commonly employed in the field, such as

NLTK and Spacy. NLTK provides a comprehensive set of tools and resources
for NLP tasks, while Spacy offers efficient processing capabilities. However,
it’s worth noting that although Spacy is efficient, it may not be suitable for
certain applications like sentiment analysis, which may require more
specialized libraries or approaches. For sequence modelling in NLP,
TensorFlow’s deep learning framework can be utilized.
Regarding the foundational aspects of NLP, we will delve into ten essential
topics: lemmatization, stemming, part-of-speech tagging, stop words,
pattern matching, sentence segmentation, named entity recognition,
tokenization, word embedding and bag-of-words.
NLP concepts
Linguistic Basics
1. Stemming
Stemming is a linguistic method utilized to obtain the base or root form of
words by removing letters from the word’s end. Its objective is to simplify
words by disregarding tense, pluralization, and other grammatical
variations. The Porter stemming algorithm employs a collection of
predetermined rules and heuristics to eliminate common English suffixes,
thereby converting words into their corresponding stems. SpaCy doesn’t
have a built-in implementation of the Porter Stemmer, so we use nltk for this
exercise.
from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ["runner", "running", "ran"]
for word in words:
print(word+' --> '+p_stemmer.stem(word))
2. Lemmatization
In contrast to stemming, lemmatization is a more sophisticated linguistic
process that aims to reduce a word to its base or dictionary form, known as a
lemma. Lemmatization takes into account factors such as part-of-speech
(POS) tags and contextual understanding to ensure accurate and meaningful
transformations.
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u"The dedicated runner, after running for hours, finally ran across t
for token in doc1:

print(token.text, '\t', token.lemma, '\t', token.lemma_)
3. Part of speech
Part of speech refers to the grammatical category of a word in a sentence,
such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or
interjection. Part of speech tagging can be used for various purposes,
including identifying named entities and speech recognition. The
probabilities of part of speech tags occurring near one another can be used
to generate the most reasonable output.
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u"The dedicated runner, after running for hours, finally ran across th
#print pos for “after"
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))
Text Identification/ Extraction
4. Stop words
Common words such as “a” and “the” occur so frequently in text that they
often do not carry significant meaning compared to nouns, verbs, and
modifiers. These commonly occurring words are referred to as stop words
and can be excluded or filtered out during text processing. Spacy provides a
built-in list of approximately 305 English stop words that can be readily
utilized.
import spacy
sentences = "The dedicated runner, after running for hours, finally ran across t
def remove_stopwords(sentence):
sentence = sentence.lower()
words = sentence.split()
sentence = " ".join([w for w in words if not nlp.vocab[w].is_stop])

return sentence
remove_stopwords(sentences)
5. Pattern Match
Pattern matching entails the identification and extraction of linguistic
patterns or structural information from text. This process involves searching
for specific sequences of words, phrases, or syntactic structures that
conform to predefined patterns or rules.
import spacy
from spacy.matcher import PhraseMatcher
doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
matcher = PhraseMatcher(nlp.vocab)
phrase_list = ['runner', 'smile']

phrase_patterns = [nlp(text) for text in phrase_list]
matcher.add('newproduct', None, *phrase_patterns)
matches = matcher(doc)
#Print the matches found in the text.

#Each match is represented as a tuple containing the match ID, start index, and
print(matches)
6. Sentence Segmentation
Sentence segmentation refers to the process of dividing a document or a
textual piece into individual sentences. In natural language processing,
accurately identifying sentence boundaries is crucial for various text
analysis and language understanding tasks.
import spacy
doc = nlp(u'This is the first sentence. This is another sentence. This is the la
for sent in doc.sents:
print(sent)
7. Named Entity Recognition (NER)

Named Entity Recognition (NER) entails the identification and classification
of named entities present in the given text. Named entities typically
encompass distinct types of words or phrases that represent recognizable
entities, including but not limited to names of individuals, organizations,
locations, dates, numerical expressions, and others.
import spacy
doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_
Text Representation
8. Tokenization
Tokenization involves breaking down the original text into smaller
components known as tokens. These tokens can be created based on
contiguous sequences of characters or words. The utilization of spacy for
tokenization is demonstrated below.
import spacy
mystring = '"The dedicated runner, after running for hours, finally ran across t
doc = nlp(mystring)
for token in doc:

print(token.text, end=' | ')
# Counting Tokens
print("\n Token Counts:",len(doc))
# Counting Vocab Entries

print("\n Vocab Entries: "+str(len(doc.vocab)))
N-gram refers to a consecutive sequence of n items, where an item can be a

character or a word.
from nltk import bigrams

text = """The dedicated runner, after running for hours, finally ran across the
lines = map(str.split, text.split('\n'))
for line in lines:
print("\n".join([" ".join(bi) for bi in bigrams(line)]))
The following example illustrates how to generate tokens, sequences, and

perform padding using the TensorFlow framework. In NLP model training, it
is also common to create input-output pairs, where the input consists of a
sequence of words or characters, and the output is the subsequent word or
character.
from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences
# Define your input texts

sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
# Initialize the Tokenizer class

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)
# Get the word index dictionary
word_index = tokenizer.word_index
# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)
# Print the result

print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5)
# Print the result
print("\nPadded Sequences:")
print(padded)
Subword tokenization is a text tokenization technique that breaks down

words into smaller units, known as subwords or subword units. Unlike
traditional word-based tokenization, where each word is considered a single
token, subword tokenization allows the representation of words as a
sequence of subword units.
import tensorflow_datasets as tfds
# Download the subword encoded pretokenized dataset

imdb_subwords, info_subwords = tfds.load("imdb_reviews/subwords8k", with_info=Tr
train_data, test_data = imdb_subwords['train'], imdb_subwords['test'],
# Get the encoder

tokenizer_subwords = info_subwords.features['text'].encoder
# Define sample sentence

sample_string = 'TensorFlow, from basics to mastery'
# Encode using the subword text encoder

tokenized_string = tokenizer_subwords.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
# Decode and print the results

original_string = tokenizer_subwords.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
9. Word vectors/word embeddings

Basic word representations could be classified into three categories:
integers, one-hot vectors and word embeddings. Word embedding is a
technique that represents words as dense, low-dimensional vectors in a
continuous vector space. The main objective of word embeddings is to
capture the semantic and contextual relationships between words. For
instance, when visualizing word embeddings in 2D, similar words tend to be
located close to each other.
Word Embedding Methods:
Continuous bag-of-words (CBOW): the model learns to predict the center

word given some context words.
Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model

learns to predict the words surrounding a given input word.
word2vec (Google, 2013): overcomes the limitations of BoW and TF-IDF

by preserving contextual information and representing words in a dense
vector space. It does not handle out-of-vocabulary (OOV) words well.
Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the

corpus’s word co-occurrence matrix, similar to the count matrix you’ve
used before.
Deep learning-based contextual embeddings include BERT and GPT.
The code provided demonstrates two ways of adding the embedding layer in
a TensorFlow model. The first method is to use the Embedding layer. The
second method is the use of TensorFlow Hub to build a neural network
model using the Universal Sentence Encoder as a pre-trained embedding
layer. By setting trainable=True, the layer parameters can be fine-tuned
during training.
import tensorflow_hub as hub

import tensorflow as tf
#embedding method 1
model = tf.keras.Sequential([
# Add an Embedding layer with the correct parameters
# input_dim Integer. Size of the vocabulary, i.e. maximum integer index + 1.
# output_dim Integer. Dimension of the dense embedding.
# input_length Length of input sequences, when it is constant.
# 2D tensor with shape: (batch_size, input_length).
# 3D tensor with shape: (batch_size, input_length, output_dim).
tf.keras.layers.Embedding(input_dim=num_words, output_dim=embedding_dim, inp
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(265, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax'),
])
#embedding method 2
model = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=True, dtype=tf.string, input_shape=[]),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])
10. Bag-of-Words/TF-IDF
The bag-of-words approach represents text as an assortment or “bag” of
individual words or tokens, disregarding their specific order or sequence. It
creates a numerical representation of a document or corpus by tallying the
occurrences of each word in the text. However, in the bag-of-words model,
the original word order is discarded, and the focus is solely on the frequency
of word occurrence.
TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting

scheme applied to the bag-of-words representation. Its purpose is to assign
weights to words that reflect their importance within a document in the
context of a larger collection of documents, known as a corpus.
TF-IDF takes into account two key factors:
1. Term Frequency (TF): This measures how frequently a term (word)

appears in a document. It assigns a higher weight to words that occur more
frequently within the document. The formula for TF is:
TF(t,d)= Number of times term t appears in document d / Total number of terms

in document d
2. Inverse Document Frequency (IDF): This part measures how unique or

rare a term is across all documents. It assigns a higher weight to words that
appear less frequently across the corpus but provide more unique or
informative content.
IDF(t,D) = log(Total number of documents in the corpus N/ Number of

documents containing term t+1)
3. TF-IDF Calculation: calculated by multiplying TF and IDF:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
If the TF-IDF value is high, it means the term is both common in the
document and rare across the entire corpus, making it a distinctive feature
of that document.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"The dedicated runner, after running for hours, finally ran across the finis
"I enjoy running",
"I like to run in the morning"
]
vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
# Print the TF-IDF matrix

for i in range(len(documents)):
print("Document:", i+1)
for j, feature in enumerate(feature_names):
tfidf_value = tfidf_vectors[i, j]
if tfidf_value != 0:
print(feature, ":", tfidf_value)

print()
End note:
In summary, this NLP primer covered ten basic concepts, from tokenization
and word embeddings to part-of-speech tagging and named entity
recognition. This foundational understanding sets the stage for Part II,
where we’ll explore practical NLP applications across diverse domains.
NLP Natural Language Process Nltk Spacy Artificial Intelligence
Written by ChenDataBytes Follow
33 Followers
Data Scientist & Machine Learning Engineer in London.
More from ChenDataBytes
ChenDataBytes ChenDataBytes
Predict Hotel Demands: Leveraging Customer Lifetime Value

Time Series Forecasting… Prediction: Part 1 — Heuristics,…
A Comparison of ARIMA, Phrophet, LSTM & CLV Prediction for All Customers, New and
Pick-up Method Existing
11 min read · Nov 29, 2023 8 min read · Nov 8, 2023
27 28
ChenDataBytes ChenDataBytes
Unlock Price Optimization Customer Lifetime Value

Potential with Python—Modelling… Prediction: Part 2 — Survival Rate
Price Elasticity Formulas and Modelling CLV Prediction for New Customers
Codes for Building Effective Price Strategy
8 min read · Oct 31, 2023 6 min read · Nov 8, 2023
7 3 3
See all from ChenDataBytes
Recommended from Medium
Merve Bayram Durna in DataDrivenInvestor Hang Yu in Towards AI
NLP with ML Using LLMs to Build Explainable

The Complete NLP Guide: Text to Context #4 Recommender Systems
LLMs can go beyond the traditional
recommender systems by generating…
8 min read · 5 days ago 8 min read · 4 days ago
55 50
Lists
AI Regulation Natural Language Processing

6 stories · 278 saves 1097 stories · 562 saves
ChatGPT ChatGPT prompts

23 stories · 393 saves 34 stories · 972 saves
ChenDataBytes Cobus Greyling
Natural Language Processing Large Language Model

(NLP) Zero to Mastery Part II:… Hallucination Mitigation…
Sentiment analysis, topic modeling, text This recently released study is a
classification and text generation comprehensive survey of 32+ mitigation…
9 min read · Dec 28, 2023 4 min read · 6 days ago
336 1
Márton Kardos Enozeren
Unsupervised Text Classification Word2Vec from Scratch with

with Topic Models and Good Old… Python
Use your brain and your data interpretation
skills, and create production-ready pipelines…
9 min read · Aug 3, 2023 13 min read · Jan 7
58 1 7
See more recommendations

Natural Language Processing (NLP) Zero To Mastery Part I - Foundations - by ChenDataBytes - Dec, 2023 - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Processing (NLP) Zero To Mastery Part I - Foundations - by ChenDataBytes - Dec, 2023 - Medium

Uploaded by

Copyright:

Available Formats

1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium

Open in app Sign up Sign in

Natural Language Processing (NLP)

Photo by Sven Brandsma on Unsplash

The articles in this series cover the following topics:

Part 1(this article): Presents the fundamental principles of Natural

Part 2: Explores the common applications of NLP.

Natural Language Processing (NLP) is a field of study within computer

NLP relies on various libraries commonly employed in the field, such as

from nltk.stem.porter import *

for token in doc1:

Text Identification/ Extraction

sentence = " ".join([w for w in words if not nlp.vocab[w].is_stop])

phrase_list = ['runner', 'smile']

#Print the matches found in the text.

7. Named Entity Recognition (NER)

for token in doc:

# Counting Vocab Entries

N-gram refers to a consecutive sequence of n items, where an item can be a

from nltk import bigrams

The following example illustrates how to generate tokens, sequences, and

from tensorflow.keras.preprocessing.text import Tokenizer

# Define your input texts

# Initialize the Tokenizer class

# Print the result

Subword tokenization is a text tokenization technique that breaks down

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset

# Get the encoder

# Define sample sentence

# Encode using the subword text encoder

# Decode and print the results

9. Word vectors/word embeddings

Word Embedding Methods:

Continuous bag-of-words (CBOW): the model learns to predict the center

Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model

word2vec (Google, 2013): overcomes the limitations of BoW and TF-IDF

Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the

Deep learning-based contextual embeddings include BERT and GPT.

import tensorflow_hub as hub

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting

TF-IDF takes into account two key factors:

1. Term Frequency (TF): This measures how frequently a term (word)

TF(t,d)= Number of times term t appears in document d / Total number of terms

2. Inverse Document Frequency (IDF): This part measures how unique or

IDF(t,D) = log(Total number of documents in the corpus N​/ Number of

3. TF-IDF Calculation: calculated by multiplying TF and IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

# Print the TF-IDF matrix

print(feature, ":", tfidf_value)

NLP Natural Language Process Nltk Spacy Artificial Intelligence

Written by ChenDataBytes Follow

Data Scientist & Machine Learning Engineer in London.

More from ChenDataBytes

Predict Hotel Demands: Leveraging Customer Lifetime Value

11 min read · Nov 29, 2023 8 min read · Nov 8, 2023

Unlock Price Optimization Customer Lifetime Value

8 min read · Oct 31, 2023 6 min read · Nov 8, 2023

See all from ChenDataBytes

Recommended from Medium

Merve Bayram Durna in DataDrivenInvestor Hang Yu in Towards AI

NLP with ML Using LLMs to Build Explainable

8 min read · 5 days ago 8 min read · 4 days ago

IDF(t,D) = log(Total number of documents in the corpus N/ Number of